Transforming data with the Pivot transform in Data Services
In this article by Iwan Shomnikov, author of the book SAP Data Services 4.x Cookbook, you will learn that the Pivot transform belongs to the Data Integrator group of transform objects in Data Services, which are usually all about the generation or transformation (meaning change in the structure) of data.
Simply inserting the Pivot transform allows you to convert columns into rows. Pivot transformation increases the amount of rows in a dataset as for every column converted into a row, an extra row is created for every key (the non-pivoted column) pair. Converted columns are called pivot columns.
Pivoting rows to columns or columns to rows is quite a common transformation operation in data migration tasks, and traditionally, the simplest way to perform it with the standard SQL language is to use the decode() function inside your SELECT statements. Depending on the complexity of the source and target datasets before and after pivoting, the SELECT statement can be extremely heavy and difficult to understand.
Data Services provide a simple and flexible way of pivoting data inside the Data Services ETL code using the Pivot and Reverse_Pivot dataflow object transforms.
The following steps show how exactly you can create, configure, and use these transforms in Data Services in order to pivot your data.
(For more resources related to this topic, see here.)
We will use the SQL Server database for the source and target objects, which will be used to demonstrate the Pivot transform available in Data Services. The steps in this section describe the preparation of a source table and the data required in it for a demonstration of the Pivot transform in the Data Services development environment:
- Create a new database or import the existing test database, AdventureWorks OLTP, available for download and free test use at https://msftdbprodsamples.codeplex.com/releases/view/55330.
- We will download the database file from the preceding link and deploy it to SQL Server, naming our database AdventureWorks_OLTP.
- Run the following SQL statements against the AdventureWorks_OLTP database to create a source table and populate it with data:
create table Sales.AccountBalance ( [AccountID] integer, [AccountNumber] integer, [Year] integer, [Q1] decimal(10,2), [Q2] decimal(10,2), [Q3] decimal(10,2), [Q4] decimal(10,2)); -- Row 1 insert into Sales.AccountBalance ([AccountID],[AccountNumber],[Year],[Q1],[Q2],[Q3],[Q4]) values (1,100,2015,100.00,150.00,120.00,300.00); -- Row 2 insert into Sales.AccountBalance ([AccountID],[AccountNumber],[Year],[Q1],[Q2],[Q3],[Q4]) values (2,100,2015,50.00,350.00,620.00,180.00); -- Row 3 insert into Sales.AccountBalance ([AccountID],[AccountNumber],[Year],[Q1],[Q2],[Q3],[Q4]) values (3,200,2015,333.33,440.00,12.00,105.50);
So, the source table would look similar to the one in the following figure:
- Create an OLTP datastore in Data Services, referencing the AdventureWorks_OLTP database and AccountBalance import table created in the previous step in it.
- Create the DS_STAGE datastore in Data Services pointing to the same OLTP database. We will use this datastore as a target to stage in our environment, where we insert the resulting pivoted dataset extracted from the OLTP system.
How to do it…
This section describes the ETL development process, which takes place in the Data Services Designer application. We will not create any workflow or script object in our test jobs; we will keep things simple and have only one batch job object with a dataflow object inside it performing the migration and pivoting of data from the ACCOUNTBALANCE source table of our OLTP database. Here are the steps to do this:
- Create a new batch job object and place the new dataflow in it, naming it DF_OLTP_Pivot_STAGE_AccountBalance.
- Open the dataflow in the workspace window to edit it, and place the ACCOUNTBALANCE source table from the OLTP datastore created in the preparation steps.
- Link the source table to the Extract query transform, and propagate all the source columns to the target schema.
- Place the new Pivot transform object in a dataflow and link the Extract query to it. The Pivot transform can be found by navigating to Local Object Library | Transforms | Data Integrator.
- Open the Pivot transform in the workspace to edit it, and configure its parameters according to the following screenshot:
- Close the Pivot transform and link it to another query transform named Prepare_to_Load.
- Propagate all the source columns to the target schema of the Prepare_to_Load transform, and finally link it to the target ACCOUNTBALANCE template table created in the DS_STAGE datastore. Choose the dbo schema when creating the ACCOUNTBALANCE template table object in this datastore.
- Before executing the job, open the Prepare_to_Load query transform in a workspace window, double-click on the PIVOT_SEQ column, and select the Primary key checkbox to specify the additional column as being the primary key column for the migrated dataset.
- Save and run the job, selecting the default execution options.
- Open the dataflow again and import the target table, putting the Delete data from table before loading flag in the target table loading options.
How it works…
Pivot columns are columns whose values are merged in one column after the pivoting operation, thus producing an extra row for every pivoted column.
Non-pivot columns are columns that are not affected by the pivot operation.
As you can see, the pivoting operation denormalizes the dataset, generating more rows. This is why ACCOUNTID does not define the uniqueness of the record anymore, and we have to specify the extra key column, PIVOT_SEQ.
So, you may wonder, why pivot? Why don’t we just use data as it is and perform the required operation on data from the columns Q1-Q4?
The answer, in the given example, is very simple—it is much more difficult to perform an aggregation when the amounts are spread across different columns. Instead of summarizing using a single column with the sum(AMOUNT) function, we would have to write the sum(Q1 + Q2 + Q3 + Q4) expression every time. Quarters are not the worst part yet; imagine a situation where a table has huge amounts of data stored in columns defining month periods and you have to filter data by these time periods.
Of course, the opposite case exists as well; storing data across multiple columns instead of just in one is justified. In this case, if your data structure is not similar to this, you can use the Reverse_Pivot transform, which does exactly the opposite—it converts rows into columns. Look at the following example of a Reverse_Pivot configuration:
Reverse pivoting or transformation of rows into columns leads us to introduce another term—Pivot axis column. This is a column that holds categories defining different columns after a reverse pivot operation. It corresponds to the Header column option in the Pivot transform configuration.
As you noted in this article, the Pivot and Reverse_Pivot transform objects available in Data Services Designer are a simple and easily configurable way to pivot data of any complexity. The GUI of the Designer tool makes maintaining the ETL process developed in Data Services easy and keeps it readable.
If you make any changes to the pivot configuration options, Data Services automatically regenerates the output schema in pivot transforms accordingly.
Resources for Article:
- Sabermetrics with Apache Spark [article]
- Understanding Text Search and Hierarchies in SAP HANA [article]
- Meeting SAP Lumira [article]