12 min read

 

Pentaho Data Integration 4 Cookbook

Pentaho Data Integration 4 Cookbook

Over 70 recipes to solve ETL problems using Pentaho Kettle

        Read more about this book      

This article by Adrián Sergio Pulvirenti and María Carina Roldán, authors of Pentaho Data Integration 4 Cookbook, focuses on the different ways for combining, splitting, or manipulating streams or flows of data using Kettle transformations. The main purpose of Kettle transformations is to manipulate data in the form of a dataset; this task is done by the steps of the transformation.

In this article, we will cover:

  • Splitting a stream into two or more streams based on a condition
  • Merging rows from two streams with the same or different structure
  • Comparing two streams and generating differences
  • Generating all possible pairs formed from two datasets

(For more resources on this subject, see here.)

Introduction

The main purpose of Kettle transformations is to manipulate data in the form of a dataset; this task is done by the steps of the transformation.

When a transformation is launched, all its steps are started. During the execution, the steps work simultaneously reading rows from the incoming hops, processing them, and delivering them to the outgoing hops. When there are no more rows left, the execution of the transformation ends.

The dataset that flows from step to step is not more than a set of rows all having the same structure or metadata. This means that all rows have the same number of columns, and the columns in all rows have the same type and name.

Suppose that you have a single stream of data and that you apply the same transformations to all rows, that is, you have all steps connected in a row one after the other. In other words, you have the simplest of the transformations from the point of view of its structure. In this case, you don’t have to worry much about the structure of your data stream, nor the origin or destination of the rows. The interesting part comes when you face other situations, for example:

  • You want a step to start processing rows only after another given step has processed all rows
  • You have more than one stream and you have to combine them into a single stream
  • You have to inject rows in the middle of your stream and those rows don’t have the same structure as the rows in your dataset

With Kettle, you can actually do this, but you have to be careful because it’s easy to end up doing wrong things and getting unexpected results or even worse: undesirable errors.

With regard to the first example, it doesn’t represent a default behavior due to the parallel nature of the transformations as explained earlier. There are two steps however, that might help, which are as follows:

  • Blocking Step: This step blocks processing until all incoming rows have been processed.
  • Block this step until steps finish: This step blocks processing until the selected steps finish.

Both these steps are in the Flow category.

This and the next article on Working with Complex Data Flows focuses on the other two examples and some similar use cases, by explaining the different ways for combining, splitting, or manipulating streams of data.

Splitting a stream into two or more streams based on a condition

In this recipe, you will learn to use the Filter rows step in order to split a single stream into different smaller streams. In the There’s more section, you will also see alternative and more efficient ways for doing the same thing in different scenarios.

Let’s assume that you have a set of outdoor products in a text file, and you want to differentiate the tents from other kind of products, and also create a subclassification of the tents depending on their prices.

Let’s see a sample of this data:

id_product,desc_product,price,category
1,”Swedish Firesteel – Army Model”,19,”kitchen”
2,”Mountain House #10 Can Freeze-Dried Food”,53,”kitchen”
3,”Lodge Logic L9OG3 Pre-Seasoned 10-1/2-Inch Round
Griddle”,14,”kitchen”

Getting ready

To run this recipe, you will need a text file named outdoorProducts.txt with information about outdoor products. The file contains information about the category and price of each product.

How to do it…

Carry out the following steps:

  1. Create a transformation.
  2. Drag into the canvas a Text file input step and fill in the File tab to read the file named outdoorProducts.txt. If you are using the sample text file, type , as the Separator.
  3. Under the Fields tab, use the Get Fields button to populate the grid. Adjust the entries so that the grid looks like the one shown in the following screenshot:

    Pentaho Data Integration 4 tutorial on Data Flows

  4. Now, let’s add the steps to manage the flow of the rows. To do this, drag two Filter rows steps from the Flow category. Also, drag three Dummy steps that will represent the three resulting streams.
  5. Create the hops, as shown in the following screenshot. When you create the hops, make sure that you choose the options according to the image: Result is TRUE for creating a hop with a green icon, and Result is FALSE for creating a hop with a red icon in it.

    Pentaho Data Integration 4 tutorial on Data Flows

  6. Double-click on the first Filter rows step and complete the condition, as shown in the following screenshot:

    Pentaho Data Integration 4 tutorial on Data Flows

  7. Double-click on the second Filter rows step and complete the condition with price < 100.
  8. You have just split the original dataset into three groups. You can verify it by previewing each Dummy step. The first one has products whose category is not tents; the second one, the tents under 100 US$; and the last group, the expensive tents; those whose price is over 100 US$.
  9. The preview of the last Dummy step will show the following:

    Pentaho Data Integration 4 tutorial on Data Flows

How it works…

The main objective in the recipe is to split a dataset with products depending on their category and price. To do this, you used the Filter rows step.

In the Filter rows setting window, you tell Kettle where the data flows to depending on the result of evaluating a condition for each row. In order to do that, you have two list boxes: Send ‘true’ data to step and Send ‘false’ data to step. The destination steps can be set by using the hop properties as you did in the recipe. Alternatively, you can set them in the Filter rows setting dialog by selecting the name of the destination steps from the available drop-down lists.

You also have to enter the condition. The condition has the following different parts:

Pentaho Data Integration 4 tutorial on Data Flows

  • The upper textbox on the left is meant to negate the condition.
  • The left textbox is meant to select the field that will be used for comparison.
  • Then, you have a list of possible comparators to choose from.
  • On the right, you have two textboxes: The upper textbox for comparing against a field and the bottom textbox for comparing against a constant value.

Also, you can include more conditions by clicking on the Add Condition button on the right. If you right-click on a condition, a contextual menu appears to let you delete, edit, or move it.

In the first Filter rows step of the recipe, you typed a simple condition: You compared a field (category) with a fixed value (tents) by using the equal (=) operator. You did this to separate the tents products from the others.

The second filter had the purpose of differentiating the expensive and the cheap tents.

There’s more…

You will find more filter features in the following subsections.

Avoiding the use of Dummy steps

In the recipe, we assumed that you wanted all three groups of products for further processing. Now, suppose that you only want the cheapest tents and you don’t care about the rest. You could use just one Filter rows step with the condition category = tents AND price < 100, and send the ‘false’ data to a Dummy step, as in shown in the following diagram:

Pentaho Data Integration 4 tutorial on Data Flows

The rows that don’t meet the condition will end at the Dummy step. Although this is a very commonly used solution for keeping just the rows that meet the conditions, there is a simpler way to implement it. When you create the hop from the Filter rows toward the next step, you are asked for the kind of hop. If you choose Main output of step, the two options Send ‘true’ data to step and Send ‘false’ data to step will remain empty. This will cause two things:

  1. Only the rows that meet the condition will pass.
  2. The rest will be discarded.

Comparing against the value of a Kettle variable

The recipe above shows you how to configure the condition in the Filter rows step to compare a field against another field or a constant value, but what if you want to compare against the value of a Kettle variable?

Let’s assume, for example, you have a named parameter called categ with kitchen as Default Value. As you might know, named parameters are a particular kind of Kettle variable.

 

You create the named parameters under the Parameter tab from the Settings option of the Edit menu.

To use this variable in a condition, you must add it to your dataset in advance. You do this as follows:

  1. Add a Get Variables step from the Job category. Put it in the stream after the Text file input step and before the Filter Rows step; use it to create a new field named categ of String type with the value ${categ} in the Variable column.
  2. Now, the transformation looks like the one shown in the following screenshot:

    Pentaho Data Integration 4 tutorial on Data Flows

  3. After this, you can set the condition of the first Filter rows step to category = categ, selecting categ from the listbox of fields to the right. This way, you will be filtering the kitchen products.
  4. If you run the transformation and set the parameter to tents, you will obtain similar results to those that were obtained in the main recipe.

Avoiding the use of nested Filter Rows steps

Suppose that you want to compare a single field against a discrete and short list of possible values and do different things for each value in that list. In this case, you can use the Switch / Case step instead of nested Filter rows steps.

Let’s assume that you have to send the rows to different steps depending on the category. The best way to do this is with the Switch / Case step. This way you avoid adding one Filter row step for each category.

In this step, you have to select the field to be used for comparing. You do it in the Field name to switch listbox. In the Case values grid, you set the Value—Target step pairs. The following screenshot shows how to fill in the grid for our particular problem:

Pentaho Data Integration 4 tutorial on Data Flows

The following are some considerations about this step:

  • You can have multiple values directed to the same target step
  • You can leave the value column blank to specify a target step for empty values
  • You have a listbox named Default target step to specify the target step for rows that do not match any of the case values
  • You can only compare with an equal operator
  • If you want to compare against a substring of the field, you could enable the Use string contains option and as Case Value, type the substring you are interested in. For example, if for Case Value, you type tent_ then all categories containing tent_ such as tent_large, tent_small, or best_tents will be redirected to the same target step.

Overcoming the difficulties of complex conditions

There will be situations where the condition is too complex to be expressed in a single Filter rows step. You can nest them and create temporary fields in order to solve the problem, but it would be more efficient if you used the Java Filter or User Defined Java Expression step as explained next.

You can find the Java Filter step in the Flow category. The difference compared to the Filter Rows step is that in this step, you write the condition using a Java expression.

The names of the listboxes—Destination step for matching rows (optional) and Destination step for non-matching rows (optional)—differ from the names in the Filter rows step, but their purpose is the same.

As an example, the following are the conditions you used in the recipe rewritten as Java expressions: category.equals(“tents”) and price < 100. These are extremely simple, but you can write any Java expression as long as it evaluates to a Boolean result.

If you can’t guarantee that the category will not be null, you’d better invert the first expression and put “tents”.equals(category) instead. By doing this, whenever you have to check if a field is equal to a constant, you avoid an unexpected Java error.

Finally, suppose that you have to split the streams simply to set some fields and then join the streams again. For example, assume that you want to change the category as follows:

Pentaho Data Integration 4 tutorial on Data Flows

Doing this with nested Filter rows steps leads to a transformation like the following:

Pentaho Data Integration 4 tutorial on Data Flows

You can do the same thing in a simpler way:

  1. Replace all the steps but the Text file input with a User Defined Java Expression step located in the Scripting category.
  2. In the setting window of this step, add a row in order to replace the value of the category field: As New field and Replace value type category. As Value type select String. As Java expression, type the following:

    (category.equals(“tents”))?(price<100?”cheap_tents”:”expensive_
    tents”):category

    The preceding expression uses the Java ternary operator ?:. If you’re not familiar with the syntax, think of it as shorthand for the if-then-else statement. For example, the inner expression price<100?”cheap_tents”:”expensive_tents” means if (price<100) then return “cheap_tents” else return “expensive_tents”.

  3. Do a preview on this step. You will see something similar to the following:

Pentaho Data Integration 4 tutorial on Data Flows

LEAVE A REPLY

Please enter your comment!
Please enter your name here