43 min read

In this article by Dejan Sarka, Miloš Radivojević, and William Durkin, the authors of the book, SQL Server 2016 Developer’s Guide explains that before dwelling into the new features in SQL Server 2016, let’s make a quick recapitulation of the SQL Server features for developers available already in the previous versions of SQL Server. Recapitulating the most important features with help you remember what you already have in your development toolbox and also understanding the need and the benefits of the new or improved features in SQL Server 2016.

The recapitulation starts with the mighty T-SQL SELECT statement. Besides the basic clauses, advanced techniques like window functions, common table expressions, and APPLY operator are explained. Then you will pass quickly through creating and altering database objects, including tables and programmable objects, like triggers, views, user-defined functions, and stored procedures. You will also review the data modification language statements. Of course, errors might appear, so you have to know how to handle them. In addition, data integrity rules might require that two or more statements are executed as an atomic, indivisible block. You can achieve this with help of transactions.

The last section of this article deals with parts of SQL Server Database Engine marketed with a common name “Beyond Relational”. This is nothing beyond the Relational Model, the “beyond relational” is really just a marketing term. Nevertheless, you will review the following:

  • How SQL Server supports spatial data
  • How you can enhance the T-SQL language with Common Language Runtime (CLR) elements written is some .NET language like Visual C#
  • How SQL Server supports XML data

The code in this article uses the WideWorldImportersDW demo database. In order to test the code, this database must be present in your SQL Server instance you are using for testing, and you must also have SQL Server Management Studio (SSMS) as the client tool.

This article will cover the following points:

  • Core Transact-SQL SELECT statement elements
  • Advanced SELECT techniques
  • Error handling
  • Using transactions
  • Spatial data
  • XML support in SQL Server

(For more resources related to this topic, see here.)

The Mighty Transact-SQL SELECT

You probably already know that the most important SQL statement is the mighty SELECT statement you use to retrieve data from your databases. Every database developer knows the basic clauses and their usage:

  • SELECT to define the columns returned, or a projection of all table columns
  • FROM to list the tables used in the query and how they are associated, or joined
  • WHERE to filter the data to return only the rows that satisfy the condition in the predicate
  • GROUP BY to define the groups over which the data is aggregated
  • HAVING to filter the data after the grouping with conditions that refer to aggregations
  • ORDER BY to sort the rows returned to the client application

Besides these basic clauses, SELECT offers a variety of advanced possibilities as well. These advanced techniques are unfortunately less exploited by developers, although they are really powerful and efficient. Therefore, I urge you to review them and potentially use them in your applications. The advanced query techniques presented here include:

  • Queries inside queries, or shortly subqueries
  • Window functions
  • TOP and OFFSET…FETCH expressions
  • APPLY operator
  • Common tables expressions, or CTEs

Core Transact-SQL SELECT Statement Elements

Let us start with the most simple concept of SQL which every Tom, Dick, and Harry is aware of! The simplest query to retrieve the data you can write includes the SELECT and the FROM clauses. In the select clause, you can use the star character, literally SELECT *, to denote that you need all columns from a table in the result set. The following code switches to the WideWorldImportersDW database context and selects all data from the Dimension.Customer table.

USE WideWorldImportersDW;
SELECT *
FROM Dimension.Customer;

The code returns 403 rows, all customers with all columns.

Using SELECT * is not recommended in production. Such queries can return an unexpected result when the table structure changes, and is also not suitable for good optimization.

Better than using SELECT * is to explicitly list only the columns you need. This means you are returning only a projection on the table. The following example selects only four columns from the table.

SELECT [Customer Key], [WWI Customer ID],
[Customer], [Buying Group]
FROM Dimension.Customer;

Below is the shortened result, limited to the first three rows only.

Customer Key WWI Customer ID Customer                     Buying Group
------------ --------------- ----------------------------- -------------
0           0               Unknown                       N/A
1           1               Tailspin Toys (Head Office)   Tailspin Toys
2           2               Tailspin Toys (Sylvanite, MT) Tailspin Toys

You can see that the column names in the WideWorldImportersDW database include spaces. Names that include spaces are called delimited identifiers. In order to make SQL Server properly understand them as column names, you must enclose delimited identifiers in square parentheses. However, if you prefer to have names without spaces, or is you use computed expressions in the column list, you can add column aliases. The following query returns completely the same data as the previous one, just with columns renamed by aliases to avoid delimited names.

SELECT [Customer Key] AS CustomerKey,
[WWI Customer ID] AS CustomerId,
[Customer],
[Buying Group] AS BuyingGroup
FROM Dimension.Customer;

You might have noticed in the result set returned from the last two queries that there is also a row in the table for an unknown customer. You can filter this row with the WHERE clause.

SELECT [Customer Key] AS CustomerKey,
[WWI Customer ID] AS CustomerId,
[Customer],
[Buying Group] AS BuyingGroup
FROM Dimension.Customer
WHERE [Customer Key] <> 0;

In a relational database, you typically have data spread in multiple tables. Each table represents a set of entities of the same kind, like customers in the examples you have seen so far. In order to get result sets meaningful for the business your database supports, you most of the time need to retrieve data from multiple tables in the same query. You need to join two or more tables based on some conditions. The most frequent kind of a join is the inner join. Rows returned are those for which the condition in the join predicate for the two tables joined evaluates to true. Note that in a relational database, you have three-valued logic, because there is always a possibility that a piece of data is unknown. You mark the unknown with the NULL keyword. A predicate can thus evaluate to true, false or NULL. For an inner join, the order of the tables involved in the join is not important. In the following example, you can see the Fact.Sale table joined with an inner join to the Dimension.Customer table.

SELECT c.[Customer Key] AS CustomerKey,
c.[WWI Customer ID] AS CustomerId,
c.[Customer],
c.[Buying Group] AS BuyingGroup,
f.Quantity,
f.[Total Excluding Tax] AS Amount,
f.Profit
FROM Fact.Sale AS f
INNER JOIN Dimension.Customer AS c
   ON f.[Customer Key] = c.[Customer Key];

In the query, you can see that table aliases are used. If a column’s name is unique across all tables in the query, then you can use it without table name. If not, you need to use table name in front of the column, to avoid ambiguous column names, in the format table.column. In the previous query, the [Customer Key] column appears in both tables. Therefore, you need to precede this column name with the table name of its origin to avoid ambiguity. You can shorten the two-part column names by using table aliases. You specify table aliases in the FROM clause. Once you specify table aliases, you must always use the aliases; you can’t refer to the original table names in that query anymore. Please note that a column name might be unique in the query at the moment when you write the query. However, later somebody could add a column with the same name in another table involved in the query. If the column name is not preceded by an alias or by the table name, you would get an error when executing the query because of the ambiguous column name. In order to make the code more stable and more readable, you should always use table aliases for each column in the query.

The previous query returns 228,265 rows. It is always recommendable to know at least approximately the number of rows your query should return. This number is the first control of the correctness of the result set, or said differently, whether the query is written logically correct. The query returns the unknown customer and the orders associated for this customer, of more precisely said associated to this placeholder for an unknown customer. Of course, you can use the WHERE clause to filter the rows in a query that joins multiple tables, like you use it for a single table query. The following query filters the unknown customer rows.

SELECT c.[Customer Key] AS CustomerKey,
c.[WWI Customer ID] AS CustomerId,
c.[Customer],
c.[Buying Group] AS BuyingGroup,
f.Quantity,
f.[Total Excluding Tax] AS Amount,
f.Profit
FROM Fact.Sale AS f
INNER JOIN Dimension.Customer AS c
   ON f.[Customer Key] = c.[Customer Key]
WHERE c.[Customer Key] <> 0;

The query returns 143,968 rows. You can see that a lot of sales is associated with the unknown customer.

Of course, the Fact.Sale table cannot be joined to the Dimension.Customer table. The following query joins it to the Dimension.Date table. Again, the join performed is an inner join.

SELECT d.Date, f.[Total Excluding Tax],
f.[Delivery Date Key]
FROM Fact.Sale AS f
INNER JOIN Dimension.Date AS d
   ON f.[Delivery Date Key] = d.Date;

The query returns 227,981 rows. The query that joined the Fact.Sale table to the Dimension.Customer table returned 228,265 rows. It looks like not all Fact.Sale table rows have a known delivery date, not all rows can match the Dimension.Date table rows. You can use an outer join to check this.

With an outer join, you preserve the rows from one or both tables, even if they don’t have a match in the other table. The result set returned includes all of the matched rows like you get from an inner join plus the preserved rows. Within an outer join, the order of the tables involved in the join might be important. If you use LEFT OUTER JOIN, then the rows from the left table are preserved. If you use RIGHT OUTER JOIN, then the rows from the right table are preserved. Of course, in both cases, the order of the tables involved in the join is important. With a FULL OUTER JOIN, you preserve the rows from both tables, and the order of the tables is not important. The following query preserves the rows from the Fact.Sale table, which is on the left side of the join to the Dimension.Date table. In addition, the query sorts the result set by the invoice date descending using the ORDER BY clause.

SELECT d.Date, f.[Total Excluding Tax],
f.[Delivery Date Key], f.[Invoice Date Key]
FROM Fact.Sale AS f
LEFT OUTER JOIN Dimension.Date AS d
   ON f.[Delivery Date Key] = d.Date
ORDER BY f.[Invoice Date Key] DESC;

The query returns 228,265 rows. Here is the partial result of the query.

Date       Total Excluding Tax Delivery Date Key Invoice Date Key
---------- -------------------- ----------------- ----------------
NULL       180.00               NULL             2016-05-31
NULL       120.00               NULL             2016-05-31
NULL       160.00               NULL            2016-05-31
…         …                   …                 …
2016-05-31 2565.00             2016-05-31       2016-05-30
2016-05-31 88.80               2016-05-31       2016-05-30
2016-05-31 50.00               2016-05-31       2016-05-30

For the last invoice date (2016-05-31), the delivery date is NULL. The NULL in the Date column form the Dimension.Date table is there because the data from this table is unknown for the rows with an unknown delivery date in the Fact.Sale table.

Joining more than two tables is not tricky if all of the joins are inner joins. The order of joins is not important. However, you might want to execute an outer join after all of the inner joins. If you don’t control the join order with the outer joins, it might happen that a subsequent inner join filters out the preserved rows if an outer join. You can control the join order with parenthesis. The following query joins the Fact.Sale table with an inner join to the Dimension.Customer, Dimension.City, Dimension.[Stock Item], and Dimension.Employee tables, and with an left outer join to the Dimension.Date table.

SELECT cu.[Customer Key] AS CustomerKey, cu.Customer,
ci.[City Key] AS CityKey, ci.City,
ci.[State Province] AS StateProvince, ci.[Sales Territory] AS SalesTeritory,
d.Date, d.[Calendar Month Label] AS CalendarMonth,
d.[Calendar Year] AS CalendarYear,
s.[Stock Item Key] AS StockItemKey, s.[Stock Item] AS Product, s.Color,
e.[Employee Key] AS EmployeeKey, e.Employee,
f.Quantity, f.[Total Excluding Tax] AS TotalAmount, f.Profit
FROM (Fact.Sale AS f
INNER JOIN Dimension.Customer AS cu
   ON f.[Customer Key] = cu.[Customer Key]
INNER JOIN Dimension.City AS ci
   ON f.[City Key] = ci.[City Key]
INNER JOIN Dimension.[Stock Item] AS s
   ON f.[Stock Item Key] = s.[Stock Item Key]
INNER JOIN Dimension.Employee AS e
   ON f.[Salesperson Key] = e.[Employee Key])
LEFT OUTER JOIN Dimension.Date AS d
   ON f.[Delivery Date Key] = d.Date;

The query returns 228,265 rows. Note that with the usage of the parenthesis the order of joins is defined in the following way:

  • Perform all inner joins, with an arbitrary order among them
  • Execute the left outer join after all of the inner joins

So far, I have tacitly assumed that the Fact.Sale table has 228,265 rows, and that the previous query needed only one outer join of the Fact.Sale table with the Dimension.Date to return all of the rows. It would be good to check this number in advance. You can check the number of rows by aggregating them using the COUNT(*) aggregate function. The following query introduces that function.

SELECT COUNT(*) AS SalesCount
FROM Fact.Sale;

Now you can be sure that the Fact.Sale table has exactly 228,265 rows.

Many times you need to aggregate data in groups. This is the point where the GROUP BY clause becomes handy. The following query aggregates the sales data for each customer.

SELECT c.Customer,
SUM(f.Quantity) AS TotalQuantity,
SUM(f.[Total Excluding Tax]) AS TotalAmount,
COUNT(*) AS InvoiceLinesCount
FROM Fact.Sale AS f
INNER JOIN Dimension.Customer AS c
   ON f.[Customer Key] = c.[Customer Key]
WHERE c.[Customer Key] <> 0
GROUP BY c.Customer;

The query returns 402 rows, one for each known customer. In the SELECT clause, you can have only the columns used for grouping, or aggregated columns. You need to get a scalar, a single aggregated value for each row for each column not included in the GROUP BY list.

Sometimes you need to filter aggregated data. For example, you might need to find only frequent customers, defined as customers with more than 400 rows in the Fact.Sale table. You can filter the result set on the aggregated data by using the HAVING clause, like the following query shows.

SELECT c.Customer,
SUM(f.Quantity) AS TotalQuantity,
SUM(f.[Total Excluding Tax]) AS TotalAmount,
COUNT(*) AS InvoiceLinesCount
FROM Fact.Sale AS f
INNER JOIN Dimension.Customer AS c
   ON f.[Customer Key] = c.[Customer Key]
WHERE c.[Customer Key] <> 0
GROUP BY c.Customer
HAVING COUNT(*) > 400;

The query returns 45 rows for 45 most frequent known customers. Note that you can’t use column aliases from the SELECT clause in any other clause introduced in the previous query. The SELECT clause logically executes after all other clause from the query, and the aliases are not known yet. However, the ORDER BY clause executes after the SELECT clause, and therefore the columns aliases are already known and you can refer to them. The following query shows all of the basic SELECT statement clauses used together to aggregate the sales data over the known customers, filters the data to include the frequent customers only, and sorts the result set descending by the number of rows of each customer in the Fact.Sale table.

SELECT c.Customer,
SUM(f.Quantity) AS TotalQuantity,
SUM(f.[Total Excluding Tax]) AS TotalAmount,
COUNT(*) AS InvoiceLinesCount
FROM Fact.Sale AS f
INNER JOIN Dimension.Customer AS c
   ON f.[Customer Key] = c.[Customer Key]
WHERE c.[Customer Key] <> 0
GROUP BY c.Customer
HAVING COUNT(*) > 400
ORDER BY InvoiceLinesCountDESC;

The query returns 45 rows. Below is the shortened result set.

Customer                             TotalQuantity TotalAmount SalesCount
------------------------------------- ------------- ------------ -----------
Tailspin Toys (Vidrine, LA)           18899         340163.80   455
Tailspin Toys (North Crows Nest, IN) 17684         313999.50   443
Tailspin Toys (Tolna, ND)             16240         294759.10   443

Advanced SELECT Techniques

Aggregating data over the complete input rowset or aggregating in groups produces aggregated rows only – either one row for the whole input rowset or one row per group. Sometimes you need to return aggregates together with the detail data. One way to achieve this is by using subqueries, queries inside queries.

The following query shows an example of using two subqueries in a single query. In the SELECT clause, a subquery that calculates the sum of quantity for each customer. It returns a scalar value. The subquery refers to the customer key from the outer query. The subquery can’t execute without the outer query. This is a correlated subquery. There is another subquery in the FROM clause that calculates overall quantity for all customers. This query returns a table, although it is a table with a single row and single column. This query is a self-contained subquery, independent of the outer query. A subquery in the FROM clause is also called a derived table.

Another type of join is used to add the overall total to each detail row. A cross join is a Cartesian product of two input rowsets—each row from one side is associated with every single row from the other side. No join condition is needed. A cross join can produce an unwanted huge result set. For example, if you cross join just a 1,000 rows from the left side of the join with 1,000 rows from the right side, you get 1,000,000 rows in the output. Therefore, typically you want to avoid a cross join in production. However, in the example in the following query, 143,968 from the left side rows is cross joined to a single row from the subquery, therefore producing 143,968 only. Effectively, this means that the overall total column is added to each detail row.

SELECT c.Customer,
f.Quantity,
(SELECT SUM(f1.Quantity) FROM Fact.Sale AS f1
   WHERE f1.[Customer Key] = c.[Customer Key]) AS TotalCustomerQuantity,
f2.TotalQuantity
FROM (Fact.Sale AS f
INNER JOIN Dimension.Customer AS c
   ON f.[Customer Key] = c.[Customer Key])
CROSS JOIN
   (SELECT SUM(f2.Quantity) FROM Fact.Sale AS f2
            WHERE f2.[Customer Key] <> 0) AS f2(TotalQuantity)
WHERE c.[Customer Key] <> 0
ORDER BY c.Customer, f.Quantity DESC;

Here is an abbreviated output of the query.

Customer                    Quantity   TotalCustomerQuantity TotalQuantity
---------------------------- ----------- --------------------- -------------
Tailspin Toys (Absecon, NJ) 360         12415                 5667611
Tailspin Toys (Absecon, NJ) 324         12415                 5667611
Tailspin Toys (Absecon, NJ) 288         12415                 5667611

In the previous example, the correlated subquery in the SELECT clause has to logically execute once per row of the outer query. The query was partially optimized by moving the self-contained subquery for the overall total in the FROM clause, where logically executes only once. Although SQL Server can many times optimize correlated subqueries and convert them to joins, there exist also a much better and more efficient way to achieve the same result as the previous query returned. You can do this by using the window functions.

The following query is using the window aggregate function SUM to calculate the total over each customer and the overall total. The OVER clause defines the partitions, or the windows of the calculation. The first calculation is partitioned over each customer, meaning that the total quantity per customer is reset to zero for each new customer. The second calculation uses an OVER clause without specifying partitions, thus meaning the calculation is done over all input rowset. This query produces exactly the same result as the previous one/

SELECT c.Customer,
f.Quantity,
SUM(f.Quantity)
   OVER(PARTITION BY c.Customer) AS TotalCustomerQuantity,
SUM(f.Quantity)
   OVER() AS TotalQuantity
FROM Fact.Sale AS f
INNER JOIN Dimension.Customer AS c
   ON f.[Customer Key] = c.[Customer Key]
WHERE c.[Customer Key] <> 0
ORDER BY c.Customer, f.Quantity DESC;

You can use many other functions for window calculations. For example, you can use the ranking functions, like ROW_NUMBER(), to calculate some rank in the window or in the overall rowset. However, rank can be defined only over some order of the calculation. You can specify the order of the calculation in the ORDER BY sub-clause inside the OVER clause. Please note that this ORDER BY clause defines only the logical order of the calculation, and not the order of the rows returned. A stand-alone, outer ORDER BY at the end of the query defines the order of the result.

The following query calculates a sequential number, the row number of each row in the output, for each detail row of the input rowset. The row number is calculated once in partitions for each customer and once ever the whole input rowset. Logical order of calculation is over quantity descending, meaning that row number 1 gets the largest quantity, either the largest for each customer or the largest in the whole input rowset.

SELECT c.Customer,
f.Quantity,
ROW_NUMBER()
   OVER(PARTITION BY c.Customer
       ORDER BY f.Quantity DESC) AS CustomerOrderPosition,
ROW_NUMBER()
   OVER(ORDER BY f.Quantity DESC) AS TotalOrderPosition
FROM Fact.Sale AS f
INNER JOIN Dimension.Customer AS c
   ON f.[Customer Key] = c.[Customer Key]
WHERE c.[Customer Key] <> 0
ORDER BY c.Customer, f.Quantity DESC;

The query produces the following result, abbreviated to couple of rows only again.

Customer                     Quantity   CustomerOrderPosition TotalOrderPosition
----------------------------- ----------- --------------------- --------------------
Tailspin Toys (Absecon, NJ)   360         1                     129
Tailspin Toys (Absecon, NJ)   324         2                     162
Tailspin Toys (Absecon, NJ)   288         3                     374
…                             …           …                     …
Tailspin Toys (Aceitunas, PR) 288         1                     392
Tailspin Toys (Aceitunas, PR) 250         4                     1331
Tailspin Toys (Aceitunas, PR) 250         3                     1315
Tailspin Toys (Aceitunas, PR) 250         2                     1313
Tailspin Toys (Aceitunas, PR) 240         5                     1478

Note the position, or the row number, for the second customer. The order does not look to be completely correct – it is 1, 4, 3, 2, 5, and not 1, 2, 3, 4, 5, like you might expect. This is due to repeating value for the second largest quantity, for the quantity 250. The quantity is not unique, and thus the order is not deterministic. The order of the result is defined over the quantity, and not over the row number. You can’t know in advance which row will get which row number when the order of the calculation is not defined on unique values. Please also note that you might get a different order when you execute the same query on your SQL Server instance.

Window functions are useful for some advanced calculations, like running totals and moving averages as well. However, the calculation of these values can’t be performed over the complete partition. You can additionally frame the calculation to a subset of rows of each partition only.

The following query calculates the running total of the quantity per customer (the column alias Q_RT in the query) ordered by the sale key and framed differently for each row. The frame is defined from the first row in the partition to the current row. Therefore, the running total is calculated over one row for the first row, over two rows for the second row, and so on. Additionally, the query calculates the moving average of the quantity (the column alias Q_MA in the query) for the last three rows.

SELECT c.Customer,
f.[Sale Key] AS SaleKey,
f.Quantity,
SUM(f.Quantity)
   OVER(PARTITION BY c.Customer
       ORDER BY [Sale Key]
               ROWS BETWEEN UNBOUNDED PRECEDING
                 AND CURRENT ROW) AS Q_RT,
AVG(f.Quantity)
   OVER(PARTITION BY c.Customer
       ORDER BY [Sale Key]
               ROWS BETWEEN 2 PRECEDING
                 AND CURRENT ROW) AS Q_MA
FROM Fact.Sale AS f
INNER JOIN Dimension.Customer AS c
   ON f.[Customer Key] = c.[Customer Key]
WHERE c.[Customer Key] <> 0
ORDER BY c.Customer, f.[Sale Key];

The query returns the following (abbreviated) result.

Customer                     SaleKey Quantity   Q_RT       Q_MA
---------------------------- -------- ----------- ----------- -----------
Tailspin Toys (Absecon, NJ) 2869     216         216         216
Tailspin Toys (Absecon, NJ) 2870     2           218         109
Tailspin Toys (Absecon, NJ) 2871     2           220         73

Let’s find the top three orders by quantity for the Tailspin Toys (Aceitunas, PR) customer! You can do this by using the OFFSET…FETCH clause after the ORDER BY clause, like the following query shows.

SELECT c.Customer,
f.[Sale Key] AS SaleKey,
f.Quantity
FROM Fact.Sale AS f
INNER JOIN Dimension.Customer AS c
   ON f.[Customer Key] = c.[Customer Key]
WHERE c.Customer = N'Tailspin Toys (Aceitunas, PR)'
ORDER BY f.Quantity DESC
OFFSET 0 ROWS FETCH NEXT 3 ROWS ONLY;

This is the complete result of the query.

Customer                       SaleKey Quantity
------------------------------ -------- -----------
Tailspin Toys (Aceitunas, PR) 36964   288
Tailspin Toys (Aceitunas, PR) 126253   250
Tailspin Toys (Aceitunas, PR) 79272   250

But wait… Didn’t the second largest quantity, the value 250, repeat three times? Which two rows were selected in the output? Again, because the calculation is done over a non-unique column, the result is somehow nondeterministic. SQL Server offers another possibility, the TOP clause. You can specify TOP n WITH TIES, meaning you can get all of the rows with ties on the last value in the output. However, this way you don’t know the number of the rows in the output in advance. The following query shows this approach.

SELECT TOP 3 WITH TIES
c.Customer,
f.[Sale Key] AS SaleKey,
f.Quantity
FROM Fact.Sale AS f
INNER JOIN Dimension.Customer AS c
   ON f.[Customer Key] = c.[Customer Key]
WHERE c.Customer = N'Tailspin Toys (Aceitunas, PR)'
ORDER BY f.Quantity DESC;

This is the complete result of the previous query – this time it is four rows.

Customer                       SaleKey Quantity
------------------------------ -------- -----------
Tailspin Toys (Aceitunas, PR) 36964   288
Tailspin Toys (Aceitunas, PR) 223106   250
Tailspin Toys (Aceitunas, PR) 126253   250
Tailspin Toys (Aceitunas, PR) 79272   250

The next task is to get the top three orders by quantity for each customer. You need to perform the calculation for each customer. The APPLY Transact-SQL operator comes handy here. You use it in the FROM clause. You apply, or execute, a table expression defined on the right side of the operator once for each row of the input rowset from the left side of the operator. There are two flavors of this operator. The CROSS APPLY version filters out the rows from the left rowset if the tabular expression on the right side does not return any row. The OUTER APPLY version preserves the row from the left side, even is the tabular expression on the right side does not return any row, similarly as the LEFT OUTER JOIN does. Of course, columns for the preserved rows do not have known values from the right-side tabular expression. The following query uses the CROSS APPLY operator to calculate top three orders by quantity for each customer that actually does have some orders.

SELECT c.Customer,
t3.SaleKey, t3.Quantity
FROM Dimension.Customer AS c
CROSS APPLY (SELECT TOP(3)
                 f.[Sale Key] AS SaleKey,
                 f.Quantity
               FROM Fact.Sale AS f
               WHERE f.[Customer Key] = c.[Customer Key]
               ORDER BY f.Quantity DESC) AS t3
WHERE c.[Customer Key] <> 0
ORDER BY c.Customer, t3.Quantity DESC;

Below is the result of this query, shortened to first nine rows.

Customer                           SaleKey Quantity
---------------------------------- -------- -----------
Tailspin Toys (Absecon, NJ)       5620     360
Tailspin Toys (Absecon, NJ)       114397   324
Tailspin Toys (Absecon, NJ)       82868   288
Tailspin Toys (Aceitunas, PR)     36964   288
Tailspin Toys (Aceitunas, PR)     126253   250
Tailspin Toys (Aceitunas, PR)     79272   250
Tailspin Toys (Airport Drive, MO) 43184   250
Tailspin Toys (Airport Drive, MO) 70842   240
Tailspin Toys (Airport Drive, MO) 630     225

For the final task in this section, assume that you need to calculate some statistics over totals of customers’ orders. You need to calculate the average total amount for all customers, the standard deviation of this total amount, and the average count of total count of orders per customer. This means you need to calculate the totals over customers in advance, and then use aggregate functions AVG() and STDEV() on these aggregates. You could do aggregations over customers in advance in a derived table. However, there is another way to achieve this. You can define the derived table in advance, in the WITH clause of the SELECT statement. Such subquery is called a common table expression, or a CTE.

CTEs are more readable than derived tables, and might be also more efficient. You could use the result of the same CTE multiple times in the outer query. If you use derived tables, then you need to define them multiple times if you want to use the multiple times in the outer query. The following query shows the usage of a CTE to calculate the average total amount for all customers, the standard deviation of this total amount, and the average count of total count of orders per customer.

WITH CustomerSalesCTE AS
(
SELECT c.Customer,
SUM(f.[Total Excluding Tax]) AS TotalAmount,
COUNT(*) AS InvoiceLinesCount
FROM Fact.Sale AS f
INNER JOIN Dimension.Customer AS c
   ON f.[Customer Key] = c.[Customer Key]
WHERE c.[Customer Key] <> 0
GROUP BY c.Customer
)
SELECT ROUND(AVG(TotalAmount), 6) AS AvgAmountPerCustomer,
ROUND(STDEV(TotalAmount), 6) AS StDevAmountPerCustomer,
AVG(InvoiceLinesCount) AS AvgCountPerCustomer
FROM CustomerSalesCTE;

It returns the following result.

AvgAmountPerCustomer StDevAmountPerCustomer AvgCountPerCustomer
--------------------- ---------------------- -------------------
270479.217661         38586.082621           358

Transactions and Error Handling

In a real world application, errors always appear. Syntax or even logical errors can be in the code, the database design might be incorrect, there might even be a bug in the database management system you are using. Even is everything works correctly, you might get an error because the users insert wrong data. With Transact-SQL error handling you can catch such user errors and decide what to do upon them. Typically, you want to log the errors, inform the users about the errors, and sometimes even correct them in the error handling code.

Error handling for user errors works on the statement level. If you send SQL Server a batch of two or more statements and the error is in the last statement, the previous statements execute successfully. This might not be what you desire. Many times you need to execute a batch of statements as a unit, and fail all of the statements if one of the statements fails. You can achieve this by using transactions. You will learn in this section about:

  • Error handling
  • Transaction management

Error Handling

You can see there is a need for error handling by producing an error. The following code tries to insert an order and a detail row for this order.

EXEC dbo.InsertSimpleOrder
@OrderId = 6, @OrderDate = '20160706', @Customer = N'CustE';
EXEC dbo.InsertSimpleOrderDetail
@OrderId = 6, @ProductId = 2, @Quantity = 0;

In SQL Server Management Studio, you can see that an error happened. You should get a message that the error 547 occurred, that The INSERT statement conflicted with the CHECK constraint. If you remember, in order details only rows where the value for the quantity is not equal to zero are allowed. The error occurred in the second statement, in the call of the procedure that inserts an order detail. The procedure that inserted an order executed without an error. Therefore, an order with id equal to six must be in the dbo. SimpleOrders table. The following code tries to insert order six again.

EXEC dbo.InsertSimpleOrder
@OrderId = 6, @OrderDate = '20160706', @Customer = N'CustE';

Of course, another error occurred. This time it should be error 2627, a violation of the PRIMARY KEY constraint. The values of the OrderId column must be unique. Let’s check the state of the data after these successful and unsuccessful inserts.

SELECT o.OrderId, o.OrderDate, o.Customer,
od.ProductId, od.Quantity
FROM dbo.SimpleOrderDetails AS od
RIGHT OUTER JOIN dbo.SimpleOrders AS o
   ON od.OrderId = o.OrderId
WHERE o.OrderId > 5
ORDER BY o.OrderId, od.ProductId;

The previous query checks only orders and their associated details where the order id value is greater than five. The query returns the following result set.

OrderId     OrderDate Customer ProductId   Quantity
----------- ---------- -------- ----------- -----------
6           2016-07-06 CustE   NULL       NULL

You can see that only the first insert of the order with the id 6 succeeded. The second insert of an order with the same id and the insert of the detail row for the order six did not succeed.

You start handling errors by enclosing the statements in the batch you are executing in the BEGIN TRY … END TRY block. You can catch the errors in the BEGIN CATCH … END CATCH block. The BEGIN CATCH statement must be immediately after the END TRY statement. The control of the execution is passed from the try part to the catch part immediately after the first error occurs.

In the catch part, you can decide how to handle the errors. If you want to log the data about the error or inform an end user about the details of the error, the following functions might be very handy:

  • ERROR_NUMBER() – this function returns the number of the error.
  • ERROR_SEVERITY() – it returns the severity level. The severity of the error indicates the type of problem encountered. Severity levels 11 to 16 can be corrected by the user.
  • ERROR_STATE() – this function returns the error state number. Error state gives more details about a specific error. You might want to use this number together with the error number to search Microsoft knowledge base for the specific details of the error you encountered.
  • ERROR_PROCEDURE() – it returns the name of the stored procedure or trigger where the error occurred, or NULL if the error did not occur within a stored procedure or trigger.
  • ERROR_LINE() – it returns the line number at which the error occurred. This might be the line number in a routine if the error occurred within a stored procedure or trigger, or the line number in the batch.
  • ERROR_MESSAGE() – this function returns the text of the error message.

The following code uses the try…catch block to handle possible errors in the batch of the statements, and returns the information of the error using the above mentioned functions. Note that the error happens in the first statement of the batch.

BEGIN TRY
EXEC dbo.InsertSimpleOrder
@OrderId = 6, @OrderDate = '20160706', @Customer = N'CustF';
EXEC dbo.InsertSimpleOrderDetail
@OrderId = 6, @ProductId = 2, @Quantity = 5;
END TRY
BEGIN CATCH
SELECT ERROR_NUMBER() AS ErrorNumber,
   ERROR_MESSAGE() AS ErrorMessage,
   ERROR_LINE() as ErrorLine;
END CATCH

There was a violation of the PRIMARY KEY constraint again, because the code tried to insert an order with id six again. The second statement would succeed if you would execute in its own batch, without error handling. However, because of the error handling, the control was passed to the catch block immediately after the error in the first statement, and the second statement never executed. You can check the data with the following query.

SELECT o.OrderId, o.OrderDate, o.Customer,
od.ProductId, od.Quantity
FROM dbo.SimpleOrderDetails AS od
RIGHT OUTER JOIN dbo.SimpleOrders AS o
   ON od.OrderId = o.OrderId
WHERE o.OrderId > 5
ORDER BY o.OrderId, od.ProductId;

The result set should be the same as the results set of the last check of the orders with id greater than five – a single order without details. The following code produces an error in the second statement.

BEGIN TRY
EXEC dbo.InsertSimpleOrder
@OrderId = 7, @OrderDate = '20160706', @Customer = N'CustF';
EXEC dbo.InsertSimpleOrderDetail
@OrderId = 7, @ProductId = 2, @Quantity = 0;
END TRY
BEGIN CATCH
SELECT ERROR_NUMBER() AS ErrorNumber,
   ERROR_MESSAGE() AS ErrorMessage,
   ERROR_LINE() as ErrorLine;
END CATCH

You can see that the insert of the order detail violates the CHECK constraint for the quantity. If you check the data with the same query as last two times again, you would see that there are orders with id six and seven in the data, both without order details.

Using Transactions

Your business logic might request that the insert of the first statement fails when the second statement fails. You might need to repeal the changes of the first statement on the failure of the second statement. You can define that a batch of statements executes as a unit by using transactions. The following code shows how to use transactions. Again, the second statement in the batch in the try block is the one that produces an error.

BEGIN TRY
BEGIN TRANSACTION
EXEC dbo.InsertSimpleOrder
   @OrderId = 8, @OrderDate = '20160706', @Customer = N'CustG';
EXEC dbo.InsertSimpleOrderDetail
   @OrderId = 8, @ProductId = 2, @Quantity = 0;
COMMIT TRANSACTION
END TRY
BEGIN CATCH
SELECT ERROR_NUMBER() AS ErrorNumber,
   ERROR_MESSAGE() AS ErrorMessage,
   ERROR_LINE() as ErrorLine;
IF XACT_STATE() <> 0
   ROLLBACK TRANSACTION;
END CATCH

You can check the data again.

SELECT o.OrderId, o.OrderDate, o.Customer,
od.ProductId, od.Quantity
FROM dbo.SimpleOrderDetails AS od
RIGHT OUTER JOIN dbo.SimpleOrders AS o
   ON od.OrderId = o.OrderId
WHERE o.OrderId > 5
ORDER BY o.OrderId, od.ProductId;

Here is the result of the check:

OrderId     OrderDate Customer ProductId   Quantity
----------- ---------- -------- ----------- -----------
6           2016-07-06 CustE   NULL       NULL
7           2016-07-06 CustF   NULL       NULL

You can see that the order with id 8 does not exist in your data. Because of the insert of the detail row for this order failed, the insert of the order was rolled back as well. Note that in the catch block, the XACT_STATE() function was used to check whether the transaction still exists. If the transaction was rolled back automatically by SQL Server, then the ROLLBACK TRANSACTION would produce a new error.

The following code drops the objects (in correct order, due to object contraints) created for the explanation of the DDL and DML statements, programmatic objects, error handling, and transactions.

DROP FUNCTION dbo.Top2OrderDetails;
DROP VIEW dbo.OrdersWithoutDetails;
DROP PROCEDURE dbo.InsertSimpleOrderDetail;
DROP PROCEDURE dbo.InsertSimpleOrder;
DROP TABLE dbo.SimpleOrderDetails;
DROP TABLE dbo.SimpleOrders;

Beyond Relational

The “beyond relational” is actually only a marketing term. The relational model, used in the relational database management system, is nowhere limited to specific data types, or specific languages only. However, with the term beyond relational, we typically mean specialized and complex data types that might include spatial and temporal data, XML or JSON data, and extending the capabilities of the Transact-SQL language with CLR languages like Visual C#, or statistical languages like R. SQL Server in versions before 2016 already supports some of the features mentioned. Here is a quick review of this support that includes:

  • Spatial data
  • CLR support
  • XML data

Defining Locations and Shapes with Spatial Data

In modern applications, many times you want to show your data on a map, using the physical location. You might also want to show the shape of the objects that your data describes. You can use spatial data for tasks like these. You can represent the objects with points, lines, or polygons. From the simple shapes you can create complex geometrical objects or geographical objects, for example cities and roads. Spatial data appear in many contemporary database. Acquiring spatial data has become quite simple with the Global Positioning System (GPS) and other technologies. In addition, many software packages and database management systems help you working with spatial data. SQL Server supports two spatial data types, both implemented as .NET common language runtime (CLR) data types, from version 2008:

  • The geometry type represents data in a Euclidean (flat) coordinate system.
  • The geography type represents data in a round-earth coordinate system.

We need two different spatial data types because of some important differences between them. These differences include units of measurement and orientation.

In the planar, or flat-earth, system, you define the units of measurements. The length of a distance and the surface of an area are given in the same unit of measurement as you use for the coordinates of your coordinate system. You as the database developer know what the coordinates mean and what the unit of measure is. In geometry, the distance between the points described with the coordinates (1, 3) and (4, 7) is 5 units, regardless of the units used. You, as the database developer who created the database where you are storing this data, know the context. You know what these 5 units mean, is this 5 kilometers, or 5 inches.

When talking about locations on earth, coordinates are given in degrees of latitude and longitude. This is the round-earth, or ellipsoidal system Lengths and areas are usually measured in the metric system, in meters and square meters. However, not everywhere in the world the metric system is used for the spatial data. The spatial reference identifier (SRID) of the geography instance defines the unit of measure. Therefore, whenever measuring some distance or area in the ellipsoidal system, you should always quote also the SRID used, which defines the units.

In the planar system, the ring orientation of a polygon is not an important factor. For example, a polygon described by the points ((0, 0), (10, 0), (0, 5), (0, 0)) is the same as a polygon described by ((0, 0), (5, 0), (0, 10), (0, 0)). You can always rotate the coordinates appropriately to get the same feeling of the orientation. However, in geography, the orientation is needed to completely describe a polygon. Just think of the equator, which divides the earth in the two hemispheres. Is your spatial data describing the northern or southern hemisphere?

The Wide World Importers data warehouse includes the city location in the Dimension.City table. The following query retrieves it for cities in the main part of the USA>

SELECT City,
[Sales Territory] AS SalesTerritory,
Location AS LocationBinary,
Location.ToString() AS LocationLongLat
FROM Dimension.City
WHERE [City Key] <> 0
AND [Sales Territory] NOT IN
     (N'External', N'Far West');

Here is the partial result of the query.

City         SalesTerritory LocationBinary       LocationLongLat              
------------ --------------- -------------------- -------------------------------
Carrollton   Mideast         0xE6100000010C70... POINT (-78.651695 42.1083969)
Carrollton   Southeast       0xE6100000010C88... POINT (-76.5605078 36.9468152)
Carrollton   Great Lakes     0xE6100000010CDB... POINT (-90.4070632 39.3022693)

You can see that the location is actually stored as a binary string. When you use the ToString() method of the location, you get the default string representation of the geographical point, which is the degrees of longitude and latitude.

If SSMS, you send the results of the previous query to a grid, you get in the results pane also an additional representation for the spatial data. Click the Spatial results tab, and you can see the points represented in the longitude – latitude coordinate system, like you can see in the following figure.

Figure 2-1: Spatial results showing customers’ locations

If you executed the query, you might have noticed that the spatial data representation control in SSMS has some limitations. It can show only 5,000 objects. The result displays only first 5,000 locations. Nevertheless, as you can see from the previous figure, this is enough to realize that these points form a contour of the main part of the USA. Therefore, the points represent the customers’ locations for customers from USA.

The following query gives you the details, like location and population, for Denver, Colorado.

SELECT [City Key] AS CityKey, City,
[State Province] AS State,
[Latest Recorded Population] AS Population,
Location.ToString() AS LocationLongLat
FROM Dimension.City
WHERE [City Key] = 114129
AND [Valid To] = '9999-12-31 23:59:59.9999999';

Spatial data types have many useful methods. For example, the STDistance() method returns the shortest line between two geography types. This is a close approximate to the geodesic distance, defined as the shortest route between two points on the Earth’s surface. The following code calculates this distance between Denver, Colorado, and Seattle, Washington.

DECLARE @g AS GEOGRAPHY;
DECLARE @h AS GEOGRAPHY;
DECLARE @unit AS NVARCHAR(50);
SET @g = (SELECT Location FROM Dimension.City
         WHERE [City Key] = 114129);
SET @h = (SELECT Location FROM Dimension.City
         WHERE [City Key] = 108657);
SET @unit = (SELECT unit_of_measure
             FROM sys.spatial_reference_systems
             WHERE spatial_reference_id = @g.STSrid);
SELECT FORMAT(@g.STDistance(@h), 'N', 'en-us') AS Distance,
@unit AS Unit;

The result of the previous batch is below.

Distance     Unit
------------- ------
1,643,936.69 metre

Note that the code uses the sys.spatial_reference_system catalog view to get the unit of measure for the distance of the SRID used to store the geographical instances of data. The unit is meter. You can see that the distance between Denver, Colorado, and Seattle, Washington, is more than 1,600 kilometers.

The following query finds the major cities within a circle of 1,000 km around Denver, Colorado. Major cities are defined as the cities with population larger than 200,000.

DECLARE @g AS GEOGRAPHY;
SET @g = (SELECT Location FROM Dimension.City
         WHERE [City Key] = 114129);
SELECT DISTINCT City,
[State Province] AS State,
FORMAT([Latest Recorded Population], '000,000') AS Population,
FORMAT(@g.STDistance(Location), '000,000.00') AS Distance
FROM Dimension.City
WHERE Location.STIntersects(@g.STBuffer(1000000)) = 1
AND [Latest Recorded Population] > 200000
AND [City Key] <> 114129
AND [Valid To] = '9999-12-31 23:59:59.9999999'
ORDER BY Distance;

Here is the result abbreviated to the twelve closest cities to Denver, Colorado.

City             State       Population Distance  
----------------- ----------- ----------- -----------
Aurora           Colorado   325,078     013,141.64
Colorado Springs Colorado   416,427     101,487.28
Albuquerque       New Mexico 545,852     537,221.38
Wichita          Kansas     382,368     702,553.01
Lincoln           Nebraska   258,379     716,934.90
Lubbock           Texas       229,573     738,625.38
Omaha             Nebraska   408,958     784,842.10
Oklahoma City     Oklahoma   579,999     809,747.65
Tulsa             Oklahoma   391,906     882,203.51
El Paso           Texas       649,121     895,789.96
Kansas City       Missouri   459,787     898,397.45
Scottsdale       Arizona     217,385     926,980.71

There are many more useful methods and properties implemented in the two spatial data types. In addition, you can improve the performance of spatial queries with help of specialized spatial indexes. Please refer to the MSDN article “Spatial Data (SQL Server)” at https://msdn.microsoft.com/en-us/library/bb933790.aspx for more details on the spatial data types, their methods, and spatial indexes.

XML Support in SQL Server

SQL Server in version 2005 also started to feature extended support for XML data inside the database engine, although some basic support was already included in version 2000. The support starts by generating XML data from tabular results. You can use the FOR XML clause of the SELECT statement for this task.

The following query generates an XML document from the regular tabular result set by using the FOR XML clause with AUTO option, to generate element-centric XML instance, with namespace and inline schema included.

SELECT c.[Customer Key] AS CustomerKey,
c.[WWI Customer ID] AS CustomerId,
c.[Customer],
c.[Buying Group] AS BuyingGroup,
f.Quantity,
f.[Total Excluding Tax] AS Amount,
f.Profit
FROM Dimension.Customer AS c
INNER JOIN Fact.Sale AS f
   ON c.[Customer Key] = f.[Customer Key]
WHERE c.[Customer Key] IN (127, 128)
FOR XML AUTO, ELEMENTS,
ROOT('CustomersOrders'),
XMLSCHEMA('CustomersOrdersSchema');
GO

Here is the partial result of this query. First part of the result is the inline schema/

<CustomersOrders>
<xsd:schema targetNamespace="CustomersOrdersSchema" …
   <xsd:import namespace="http://schemas.microsoft.com/sqlserver/2004/sqltypes" …
   <xsd:element name="c">
     <xsd:complexType>
       <xsd:sequence>
         <xsd:element name="CustomerKey" type="sqltypes:int" />
         <xsd:element name="CustomerId" type="sqltypes:int" />
         <xsd:element name="Customer">
           <xsd:simpleType>
             <xsd:restriction base="sqltypes:nvarchar" …
               <xsd:maxLength value="100" />
             </xsd:restriction>
           </xsd:simpleType>
         </xsd:element>
         …
       </xsd:sequence>
     </xsd:complexType>
   </xsd:element>
</xsd:schema>
<c >
   <CustomerKey>127</CustomerKey>
   <CustomerId>127</CustomerId>
   <Customer>Tailspin Toys (Point Roberts, WA)</Customer>
   <BuyingGroup>Tailspin Toys</BuyingGroup>
   <f>
     <Quantity>3</Quantity>
     <Amount>48.00</Amount>
     <Profit>31.50</Profit>
   </f>
   <f>
     <Quantity>9</Quantity>
     <Amount>2160.00</Amount>
     <Profit>1363.50</Profit>
   </f>
</c>
<c >
   <CustomerKey>128</CustomerKey>
   <CustomerId>128</CustomerId>
   <Customer>Tailspin Toys (East Portal, CO)</Customer>
   <BuyingGroup>Tailspin Toys</BuyingGroup>
   <f>
     <Quantity>84</Quantity>
     <Amount>420.00</Amount>
     <Profit>294.00</Profit>
   </f>
</c>
…
</CustomersOrders>

You can also do the opposite process: convert XML to tables. Converting XML to relational tables is known as shredding XML. You can do this by using the nodes() method of the XML data type or with the OPENXML() rowset function.

Inside SQL Server, you can also query the XML data from Transact-SQL to find specific elements, attributes, or XML fragments. XQuery is a standard language for browsing XML instances and returning XML, and is supported inside XML data type methods.

You can store XML instances inside SQL Server database in a column of the XML data type. An XML data type includes five methods that accept XQuery as a parameter. The methods support querying (the query() method), retrieving atomic values (the value() method), existence checks (the exist() method), modifying sections within the XML data (the modify() method) as opposed to overriding the whole thing, and shredding XML data into multiple rows in a result set (the nodes() method).

The following code creates a variable of the XML data type to store an XML instance in it. Then it uses the query() method to return XML fragments from the XML instance. This method accepts XQuery query as a parameter. The XQuery query uses the FLWOR expressions to define and shape the XML returned.

DECLARE @x AS XML;
SET @x = N'
<CustomersOrders>
<Customer custid="1">
   <!-- Comment 111 -->
   <companyname>CustA</companyname>
   <Order orderid="1">
     <orderdate>2016-07-01T00:00:00</orderdate>
   </Order>
   <Order orderid="9">
     <orderdate>2016-07-03T00:00:00</orderdate>
   </Order>
   <Order orderid="12">
     <orderdate>2016-07-12T00:00:00</orderdate>
   </Order>
</Customer>
<Customer custid="2">
   <!-- Comment 222 -->
   <companyname>CustB</companyname>
   <Order orderid="3">
     <orderdate>2016-07-01T00:00:00</orderdate>
   </Order>
   <Order orderid="10">
     <orderdate>2016-07-05T00:00:00</orderdate>
   </Order>
</Customer>
</CustomersOrders>';
SELECT @x.query('for $i in CustomersOrders/Customer/Order
                 let $j := $i/orderdate
                 where $i/@orderid < 10900
                 order by ($j)[1]
                 return
                 <Order-orderid-element>
                 <orderid>{data($i/@orderid)}</orderid>
                 {$j}
                 </Order-orderid-element>')
       AS [Filtered, sorted and reformatted orders with let clause];

Here is the result of the previous query.

<Order-orderid-element>
<orderid>1</orderid>
<orderdate>2016-07-01T00:00:00</orderdate>
</Order-orderid-element>
<Order-orderid-element>
<orderid>3</orderid>
<orderdate>2016-07-01T00:00:00</orderdate>
</Order-orderid-element>
<Order-orderid-element>
<orderid>9</orderid>
<orderdate>2016-07-03T00:00:00</orderdate>
</Order-orderid-element>
<Order-orderid-element>
<orderid>10</orderid>
<orderdate>2016-07-05T00:00:00</orderdate>
</Order-orderid-element>
<Order-orderid-element>
<orderid>12</orderid>
<orderdate>2016-07-12T00:00:00</orderdate>
</Order-orderid-element>

Summary

In this article, you got a review of the SQL Server features for developers that exists already in the previous versions. You can see that this support goes well beyond basic SQL statements, and also beyond pure Transact-SQL.

Resources for Article:


Further resources on this subject:


LEAVE A REPLY

Please enter your comment!
Please enter your name here