SQL Server Integration Services (SSIS)

5 min read

(For more resources related to this topic, see here.)

SSIS as an ETL – extract, transform, and load tool

The primary objective of an ETL tool is to be able to import and export data to and from heterogeneous data sources. This includes the ability to connect to external systems, as well as to transform or clean the data while moving the data between the external systems and the databases. SSIS can be used to import data to and from SQL Server. It can even be used to move data between external non-SQL systems without requiring SQL server to be the source or the destination. For instance, SSIS can be used to move data from an FTP server to a local flat file.

SSIS also provides a workflow engine for automation of the different tasks (for example, data flows, tasks executions, and so on.) that are executed in an ETL job. An SSIS package execution can itself be one step that is part of an SQL Agent job, and SQL Agent can run multiple jobs independent of each other.

An SSIS solution consists of one or more package, each containing a control flow to perform a sequence of tasks. Tasks in a control flow can include calls to web services, FTP operations, file system tasks, automation of command line commands, and others. In particular, a control flow usually includes one or more data flow tasks, which encapsulate an in-memory, buffer-based pipeline of data from a source to a destination, with transformations applied to the data as it flows through the pipeline. An SSIS package has one control flow, and as many data flows as necessary. Data flow execution is dictated by the content of the control flow.

A detailed discussion on SSIS and its components are outside the scope of this article and it assumes that you are familiar with the basic SSIS package development using Business Intelligence Development Studio (SQL Server 2005/2008/2008 R2) or SQL Server Data Tools (SQL Server 2012). If you are a beginner in SSIS, it is highly recommended to read from a bunch of good SSIS books available as a prerequisite. In the rest of this article, we will focus on how to consume Hive data from SSIS using the Hive ODBC driver.

The prerequisites to develop the package shown in this article are SQL Server Data Tools, (which comes as a part of SQL Server 2012 Client Tools and Components) and the 32-bit Hive ODBC Driver installed. You will also need your Hadoop cluster up with Hive running on it.

Developing the package

SQL Server Data Tools (SSDT) is the integrated development environment available from Microsoft to design, deploy, and develop SSIS packages. SSDT is installed when you choose to install SQL Server Client tools and Workstation Components from your SQL Server installation media. SSDT supports creation of Integration Services, Analysis Services, and Reporting Services projects. Here, we will focus on Integration Services project type.

Creating the project

Launch SQL Server Data Tools from SQL Server 2012 Program folders as shown in the following screenshot:
Create a new Project and choose Integration Services Project in the New Project dialog as shown in the following screenshot:
This should create the SSIS project with a blank Package.dtsx inside it visible in the Solution Explorer window of the project as shown in the following screenshot:

Creating the Data Flow

A Data Flow is a SSIS package component, which consists of the sources and destinations that extract and load data, the transformations that modify and extend data, and the paths that link sources, transformations, and destinations. Before you can add a data flow to a package, the package control flow must include a Data Flow task. The Data Flow task is the executable within the SSIS package, which creates, orders, and runs the data flow. A separate instance of the data flow engine is opened for each Data Flow task in a package. To create a Data Flow task, perform the following steps:

Double-click (or drag-and-drop) on a Data Flow Task from the toolbox in the left. This should place a Data Flow Task in the Control Flow canvas of the package as in the following screenshot:
Double-click on the Data Flow Task or click on the Data Flow tab in SSDT to edit the task and design the source and destination components as in the following screenshot:

Creating the source Hive connection

The first thing we need to do is create a connection manager that will connect to our Hive data tables hosted in the Hadoop cluster. We will use an ADO.NET connection, which will use the DSN HadoopOnLinux we created earlier to connect to Hive. To create the connection, perform the following steps:

Right-click on the Connection Managers section in the project and click on New ADO.Net Connection… as shown in the following screenshot:
From the list of providers, navigate to .Net Providers | ODBC Data Provider and click on OK in the Connection Manager window as shown in the following screenshot:
Select the HadoopOnLinux DSN from the Data Sources list. Provide the Hadoop cluster credentials and test connection should succeed as shown in the following screenshot: