12 min read

 

Pentaho Data Integration 4 Cookbook

Pentaho Data Integration 4 CookbookOver 70 recipes to solve ETL problems using Pentaho Kettle      

Introduction

Kettle, also known as PDI, is mostly used as a stand-alone application. However, it is not an isolated tool, but part of the Pentaho Business Intelligence Suite. As such, it can also interact with other components of the suite; for example, as the datasource for a report, or as part of a bigger process. This chapter shows you how to run Kettle jobs and transformations in that context.

The article assumes a basic knowledge of the Pentaho BI platform and the tools that made up the Pentaho Suite. If you are not familiar with these tools, it is recommended that you visit the wiki page (wiki.pentaho.com) or the Pentaho BI Suite Community Edition (CE) site: http://community.pentaho.com/.

As another option, you can get the Pentaho Solutions book (Wiley) by Roland Bouman and Jos van Dongen that gives you a good introduction to the whole suite.

A sample transformation

The different recipes in this article show you how to run Kettle transformations and jobs integrated with several components of the Pentaho BI suite. In order to focus on the integration itself rather than on Kettle development, we have created a sample transformation named weather.ktr that will be used through the different recipes.

The transformation receives the name of a city as the first parameter from the command line, for example Madrid, Spain. Then, it consumes a web service to get the current weather conditions and the forecast for the next five days for that city. The transformation has a couple of named parameters:

Integrating Kettle and the Pentaho Suite

The following diagram shows what the transformation looks like:

Integrating Kettle and the Pentaho Suite

It receives the command-line argument and the named parameters, calls the service, and retrieves the information in the desired scales for temperature and wind speed.

You can download the transformation from the book’s site and test it. Do a preview on the next_days, current_conditions, and current_conditions_normalized steps to see what the results look like.

The following is a sample preview of the next_days step:

Integrating Kettle and the Pentaho Suite

The following is a sample preview of the current_conditions step:

Integrating Kettle and the Pentaho Suite

Finally, the following screenshot shows you a sample preview of the current_conditions_normalized step:

Integrating Kettle and the Pentaho Suite

There is also another transformation named weather_np.ktr. This transformation does exactly the same, but it reads the city as a named parameter instead of reading it from the command line. The Getting ready sections of each recipe will tell you which of these transformations will be used.

Avoiding consuming the web service
It may happen that you do not want to consume the web service (for example, for delay reasons), or you cannot do it (for example, if you do not have Internet access). Besides, if you call a free web service like this too often, then your IP might be banned from the service. Don’t worry. Along with the sample transformations on the book’s site, you will find another version of the transformations that instead of using the web service, reads sample fictional data from a file containing the forecast for over 250 cities. The transformations are weather (file version).ktr and weather_np (file version).ktr. Feel free to use these transformations instead. You should not have any trouble as the parameters and the metadata of the data retrieved are exactly the same as in the transformations explained earlier.

If you use transformations that do not call the web service, remember that they rely on the file with the fictional data (weatheroffline.txt). Wherever you copy the transformations, do not forget to copy that file as well.

Creating a Pentaho report with data coming from PDI

The Pentaho Reporting Engine allows designing, creating, and distributing reports in various popular formats (HTML, PDF, and so on) from different kind of sources (JDBC, OLAP, XML, and so on).

There are occasions where you need other kinds of sources such as text files or Excel files, or situations where you must process the information before using it in a report. In those cases, you can use the output of a Kettle transformation as the source of your report. This recipe shows you this capability of the Pentaho Reporting Engine.

For this recipe, you will develop a very simple report: The report will ask for a city and a temperature scale and will report the current conditions in that city. The temperature will be expressed in the selected scale.

Getting ready

A basic understanding of the Pentaho Report Designer tool is required in order to follow this recipe. You should be able to create a report, add parameters, build a simple report, and preview the final result.

Regarding the software, you will need the Pentaho Report Designer. You can download the latest version from the following URL:

http://sourceforge.net/projects/pentaho/files/Report%20Designer/

You will also need the sample transformation weather.ktr.

The sample transformation has a couple of UDJE steps. These steps rely on the Janino library. In order to be able to run the transformation from Report Designer, you will have to copy the janino.jar file from the Kettle libext directory into the Report Designer lib directory.

How to do it…

In the first part of the recipe, you will create the report and define the parameters for the report: the city and the temperature scale.

  1. Launch Pentaho Report Designer and create a new blank report.
  2. Add two mandatory parameters: A parameter named city_param, with Lisbon, Portugal as Default Value and a parameter named scale_param which accepts two possible values: C meaning Celsius or F meaning Fahrenheit.

Now, you will define the data source for the report:

  1. In the Data menu, select Add Data Source and then Pentaho Data Integration.
  2. Click on the Add a new query button. A new query named Query 1 will be added. Give the query a proper name, for example, forecast.
  3. Click on the Browse button. Browse to the sample transformation and select it. The Steps listbox will be populated with the names of the steps in the transformation.
  4. Select the step current_conditions. So far, you have the following:

    Integrating Kettle and the Pentaho Suite

    The specification of the transformation file name with the complete path will work only inside Report Designer. Before publishing the report, you should edit the file name (C:Pentahoreportingweather.ktr in the preceding example) and leave just a path relative to the directory where the report is to be published (for example, reportsweather.ktr).

  5. Click on Preview; you will see an empty resultset. The important thing here is that the headers should be the same as the output fields of the current_conditions step: city, observation_time, weatherDesc, and so on.
  6. Now, close that window and click on Edit Parameters.
  7. You will see two grids: Transformation Parameter and Transformation Arguments. Fill in the grids as shown in the following screenshot. You can type the values or select them from the available drop-down lists:

    Integrating Kettle and the Pentaho Suite

  8. Close the Pentaho Data Integration Data Source window. You should have the following:

    Integrating Kettle and the Pentaho Suite

    The data coming from Kettle is ready to be used in your report.

  9. Build the report layout: Drag and drop some fields into the canvas and arrange them as you please. Provide a title as well. The following screenshot is a sample report you can design:

    Integrating Kettle and the Pentaho Suite

  10. Now, you can do a Print Preview. The sample report above will look like the one shown in the following screenshot:

    Integrating Kettle and the Pentaho Suite

Note that the output of the current_condition step has just one row.

If for data source you choose the next_days or the current_condition_normalized step instead, then the result will have several rows. In that case, you could design a report by columns: one column for each field.

How it works…

Using the output of a Kettle transformation as the data source of a report is very useful because you can take advantage of all the functionality of the PDI tool. For instance, in this case you built a report based on the result of consuming a web service. You could not have done this with Pentaho Report Designer alone.

In order to use the output of your Kettle transformation, you just added a Pentaho Data Integration datasource. You selected the transformation to run and the step that would deliver your data.

In order to be executed, your transformation needs a command-line parameter: the name of the city. The transformation also defines two named parameters: the temperature scale and the wind scale. From the Pentaho Report Designer you provided both—a value for the city and a value for the temperature scale. You did it by filling in the Edit Parameter setting window inside the Pentaho Data Integration Data Source window. Note that you did not supply a value for the SPEED named parameter, but that is not necessary because Kettle uses the default value.

As you can see in the recipe, the data source created by the report engine has the same structure as the data coming from the selected step: the same fields with the same names, same data types, and in the same order.

Once you configured this data source, you were able to design your report as you would have done with any other kind of data source.

Finally, when you are done and want to publish your report on the server, do not forget to fix the path as explained in the recipethe File should be specified with a path relative to the solution folder. For example, suppose that your report will be published in my_solution/reports, and you put the transformation file in my_solution/reports/resources. In that case, for File, you should type resources/ plus the name of the transformation.

There’s more…

Pentaho Reporting is a suite of Java projects built for report generation. The suite is made up of the Pentaho Reporting Engine and a set of tools such as the Report Designer (the tool used in this recipe), Report Design Wizard, and Pentaho’s web-based Ad Hoc Reporting user interface.

In order to be able to run transformations, the Pentaho Reporting software includes the Kettle libraries. To avoid any inconvenience, be sure that the versions of the libraries included are the same or newer than the version of Kettle you are using. For instance, Pentaho Reporting 3.8 includes Kettle 4.1.2 libraries. If you are using a different version of Pentaho Reporting, then you can verify the Kettle version by looking in the lib folder inside the reporting installation folder. You should look for files named kettle-core-<version>.jar, kettle-db-<version>.jar, and kettle-engine-<version>.jar.

Besides, if the transformations you want to use as data sources rely on external libraries, then you have to copy the proper jar files from the Kettle libext directory into the Report Designer lib folder, just as you did with the janino.jar file in the recipe.

For more information about Pentaho Reporting, just visit the following wiki website:

http://wiki.pentaho.com/display/Reporting/Pentaho+Reporting+Community+Documentation

Alternatively, you can get the book Pentaho Reporting 3.5 for Java Developers (Packt Publishing) by Will Gorman.

Configuring the Pentaho BI Server for running PDI jobs and transformations

Configuring the Pentaho BI Server for running PDI jobs and transformations

The Pentaho BI Server is a collection of software components that provide the architecture and infrastructure required to build business intelligence solutions. With the Pentaho BI Server, you are able to run reports, visualize dashboards, schedule tasks, and more. Among these tasks, there is the ability to run Kettle jobs and transformations. This recipe shows you the minor changes you might have to make in order to be able to run Kettle jobs and transformations.

Getting ready

In order to follow this recipe, you will need some experience with the Pentaho BI Server.

For configuring the Pentaho BI server, you obviously need the software. You can download the latest version of the Pentaho BI Server from the following URL:

http://sourceforge.net/projects/pentaho/files/Business%20Intelligence%20Server/

Make sure you download the distribution that matches your platform.

If you intend to run jobs and transformations from a Kettle repository, then make sure you have the name of the repository and proper credentials (user and password).

How to do it…

Carry out the following steps:

  1. If you intend to run a transformation or a job from a file, skip to the How it works section.
  2. Edit the settings.xml file located in the biserver-cepentaho-solutionssystemkettle folder inside the Pentaho BI Server installation folder.
  3. In the repository.type tag, replace the default value files with rdbms. Provide the name of your Kettle repository and the user and password, as shown in the following example:

    <kettle-repository>
    <!– The values within <properties> are passed directly to the
    Kettle Pentaho components. –>
    <!– This is the location of the Kettle repositories.xml file,
    leave empty if the default is used: $HOME/.kettle/repositories.xml
    –>
    <repositories.xml.file></repositories.xml.file>
    <repository.type>rdbms</repository.type>
    <!– The name of the repository to use –>
    <repository.name>pdirepo</repository.name>
    <!– The name of the repository user –>
    <repository.userid>dev</repository.userid>
    <!– The password –>
    <repository.password>1234</repository.password>
    </kettle-repository>

    
    
  4. Start the server. It will be ready to run jobs and transformations from your Kettle repository.

How it works…

If you want to run Kettle transformations and jobs, then the Pentaho BI server already includes the Kettle libraries. The server is ready to run both jobs and transformations from files. If you intend to use a repository, then you have to provide the repository settings. In order to do this, you just have to edit the settings.xml file, as you did in the recipe.

There’s more…

To avoid any inconvenience, be sure that the version of the libraries included are the same or newer than the version of Kettle you are using. For instance, Pentaho BI Server 3.7 includes Kettle 4.1 libraries. If you are using a different version of the server, then you can verify the Kettle version by looking in the following folder:

biserver-cetomcatwebappspentahoWEB-INFlib

This folder is inside the server installation folder. You should look for files named kettlecore-TRUNK-SNAPSHOT .jar, kettle-db-TRUNK-SNAPSHOT.jar, and kettleengine-TRUNK-SNAPSHOT.jar.

Unzip any of them and look for the META-INFMANIFEST.MF file. There, you will find the Kettle version. You will see a line like this: Implementation-Version: 4.1.0.

There is even an easier way: In the Pentaho User Console (PUC), look for the option 2. Get Environment Information inside the Data Integration with Kettle folder of the BI Developer Examples solution; run it and you will get detailed information about the Kettle environment.

For your information, the transformation that is run behind the scenes is GetPDIEnvironment.ktr located in the biservercepentaho-solutionsbi-developersetl folder.

LEAVE A REPLY

Please enter your comment!
Please enter your name here