Data Analytics is the art of taking data and deriving information from it in order to make informed decisions. A large part of building and validating datasets for the decision making process is data integration—the moving, cleansing, and transformation of data from the source to a target. This article will focus on some of the tools that take Kettle beyond the normal data processing capabilities and integrate processes into analytical tools.
Reading data from a SAS datafile
SAS is one of the leading analytics suites, providing robust commercial tools for decision making in many different fields. Kettle can read files written in SAS’ specialized data format known as sas7bdat using a new (since Version 4.3) input step called SAS Input. While SAS does support other format types (such as CSV and Excel), sas7bdat is a format most similar to other analytics packages’ special formats (such as Weka’s ARFF file format). This recipe will show you how to do it.
Why read a SAS file?
There are two main reasons for wanting to read a SAS file as part of a Kettle process. The first is that a dataset created by a SAS program is already in place, but the output of this process is used elsewhere in other Business Intelligence solutions (for instance, using the output for integration into reports, visualizations, or other analytic tools). The second is when there is already a standard library of business logic and rules built in Kettle that the dataset needs to run through before it can be used.
To be able to use the SAS Input step, a sas7bdat file will be required. The Centers for Disease Control and Prevention have some sample datasets as part of the NHANES Dietary dataset. Their tutorial datasets can be found at their website at http://www.cdc.gov/nchs/tutorials/dietary/downloads/downloads.htm. We will be using the calcmilk.sas7bdat dataset for this recipe.
How to do it…
Perform the following steps to read in the calcmilk.sas7bd dataset:
Open Spoon and create a new transformation.
From the input folder of the Design pallet, bring over a Get File Names step.
Open the Get File Names step. Click on the Browse button and find the calcmilk. sas7bd file downloaded for the recipe and click on OK.
From the input folder of the Design pallet, bring over a SAS Input step. Create a hop from the Get File Names step to the SAS Input step.
Open the SAS Input step. For the Field in the input to use as filename field, select the Filename field from the dropdown.
Click on Get Fields. Select the calcmilk.sas7bd file and click on OK.
To clean the stream up and only have the calcmilk data, add a Select Values step and add a hop between the SAS Input step to the Select Values step. Open the Select Values step and switch to the Remove tab. Select the fields generated from the Get File Names step (filename, short_filename, path, and so on). Click on OK to close the step.
Preview the Select Values step. The data from the SAS Input step should appear in a data grid, as shown in the following screenshot:
If you are using Version 4.4 of Kettle, you will receive a java.lang.NoClassDefFoundError message. There is a work around which can be found on the Pentaho wiki at http://wiki.pentaho.com/display/EAI/SAS+Input.
How it works…
The SAS Input step takes advantage of Kasper Sørensen’s Sassy Reader project (http://sassyreader.eobjects.org). Sassy is a Java library used to read datasets in the sas7bdat format and is derived from the R package created by Matt Shotwell (https://github.com/BioStatMatt/sas7bdat). Before those projects, it was not possible to read the proprietary file format outside of SAS’ own tools.
The SAS Input step requires the processed filenames to be provided from another step (like the Get File Names step). Also of note, while the sas7bdat format only has two format types (strings and numbers), PDI is able to convert fields to any of the built-in formats (dates, integers, and so on).