Read more about this book |
This article by Adrián Sergio Pulvirenti and María Carina Roldán, authors of Pentaho Data Integration 4 Cookbook, focuses on the different ways for combining, splitting, or manipulating streams or flows of data using Kettle transformations. The main purpose of Kettle transformations is to manipulate data in the form of a dataset; this task is done by the steps of the transformation.
In this article, we will cover:
(For more resources on this subject, see here.)
The main purpose of Kettle transformations is to manipulate data in the form of a dataset; this task is done by the steps of the transformation.
When a transformation is launched, all its steps are started. During the execution, the steps work simultaneously reading rows from the incoming hops, processing them, and delivering them to the outgoing hops. When there are no more rows left, the execution of the transformation ends.
The dataset that flows from step to step is not more than a set of rows all having the same structure or metadata. This means that all rows have the same number of columns, and the columns in all rows have the same type and name.
Suppose that you have a single stream of data and that you apply the same transformations to all rows, that is, you have all steps connected in a row one after the other. In other words, you have the simplest of the transformations from the point of view of its structure. In this case, you don’t have to worry much about the structure of your data stream, nor the origin or destination of the rows. The interesting part comes when you face other situations, for example:
With Kettle, you can actually do this, but you have to be careful because it’s easy to end up doing wrong things and getting unexpected results or even worse: undesirable errors.
With regard to the first example, it doesn’t represent a default behavior due to the parallel nature of the transformations as explained earlier. There are two steps however, that might help, which are as follows:
Both these steps are in the Flow category.
This and the next article on Working with Complex Data Flows focuses on the other two examples and some similar use cases, by explaining the different ways for combining, splitting, or manipulating streams of data.
In this recipe, you will learn to use the Filter rows step in order to split a single stream into different smaller streams. In the There’s more section, you will also see alternative and more efficient ways for doing the same thing in different scenarios.
Let’s assume that you have a set of outdoor products in a text file, and you want to differentiate the tents from other kind of products, and also create a subclassification of the tents depending on their prices.
Let’s see a sample of this data:
id_product,desc_product,price,category
1,”Swedish Firesteel – Army Model”,19,”kitchen”
2,”Mountain House #10 Can Freeze-Dried Food”,53,”kitchen”
3,”Lodge Logic L9OG3 Pre-Seasoned 10-1/2-Inch Round
Griddle”,14,”kitchen”
…
To run this recipe, you will need a text file named outdoorProducts.txt with information about outdoor products. The file contains information about the category and price of each product.
Carry out the following steps:
The main objective in the recipe is to split a dataset with products depending on their category and price. To do this, you used the Filter rows step.
In the Filter rows setting window, you tell Kettle where the data flows to depending on the result of evaluating a condition for each row. In order to do that, you have two list boxes: Send ‘true’ data to step and Send ‘false’ data to step. The destination steps can be set by using the hop properties as you did in the recipe. Alternatively, you can set them in the Filter rows setting dialog by selecting the name of the destination steps from the available drop-down lists.
You also have to enter the condition. The condition has the following different parts:
Also, you can include more conditions by clicking on the Add Condition button on the right. If you right-click on a condition, a contextual menu appears to let you delete, edit, or move it.
In the first Filter rows step of the recipe, you typed a simple condition: You compared a field (category) with a fixed value (tents) by using the equal (=) operator. You did this to separate the tents products from the others.
The second filter had the purpose of differentiating the expensive and the cheap tents.
You will find more filter features in the following subsections.
In the recipe, we assumed that you wanted all three groups of products for further processing. Now, suppose that you only want the cheapest tents and you don’t care about the rest. You could use just one Filter rows step with the condition category = tents AND price < 100, and send the ‘false’ data to a Dummy step, as in shown in the following diagram:
The rows that don’t meet the condition will end at the Dummy step. Although this is a very commonly used solution for keeping just the rows that meet the conditions, there is a simpler way to implement it. When you create the hop from the Filter rows toward the next step, you are asked for the kind of hop. If you choose Main output of step, the two options Send ‘true’ data to step and Send ‘false’ data to step will remain empty. This will cause two things:
The recipe above shows you how to configure the condition in the Filter rows step to compare a field against another field or a constant value, but what if you want to compare against the value of a Kettle variable?
Let’s assume, for example, you have a named parameter called categ with kitchen as Default Value. As you might know, named parameters are a particular kind of Kettle variable.
You create the named parameters under the Parameter tab from the Settings option of the Edit menu.
To use this variable in a condition, you must add it to your dataset in advance. You do this as follows:
Suppose that you want to compare a single field against a discrete and short list of possible values and do different things for each value in that list. In this case, you can use the Switch / Case step instead of nested Filter rows steps.
Let’s assume that you have to send the rows to different steps depending on the category. The best way to do this is with the Switch / Case step. This way you avoid adding one Filter row step for each category.
In this step, you have to select the field to be used for comparing. You do it in the Field name to switch listbox. In the Case values grid, you set the Value—Target step pairs. The following screenshot shows how to fill in the grid for our particular problem:
The following are some considerations about this step:
There will be situations where the condition is too complex to be expressed in a single Filter rows step. You can nest them and create temporary fields in order to solve the problem, but it would be more efficient if you used the Java Filter or User Defined Java Expression step as explained next.
You can find the Java Filter step in the Flow category. The difference compared to the Filter Rows step is that in this step, you write the condition using a Java expression.
The names of the listboxes—Destination step for matching rows (optional) and Destination step for non-matching rows (optional)—differ from the names in the Filter rows step, but their purpose is the same.
As an example, the following are the conditions you used in the recipe rewritten as Java expressions: category.equals(“tents”) and price < 100. These are extremely simple, but you can write any Java expression as long as it evaluates to a Boolean result.
If you can’t guarantee that the category will not be null, you’d better invert the first expression and put “tents”.equals(category) instead. By doing this, whenever you have to check if a field is equal to a constant, you avoid an unexpected Java error.
Finally, suppose that you have to split the streams simply to set some fields and then join the streams again. For example, assume that you want to change the category as follows:
Doing this with nested Filter rows steps leads to a transformation like the following:
You can do the same thing in a simpler way:
(category.equals(“tents”))?(price<100?”cheap_tents”:”expensive_
tents”):category
The preceding expression uses the Java ternary operator ?:. If you’re not familiar with the syntax, think of it as shorthand for the if-then-else statement. For example, the inner expression price<100?”cheap_tents”:”expensive_tents” means if (price<100) then return “cheap_tents” else return “expensive_tents”.
I remember deciding to pursue my first IT certification, the CompTIA A+. I had signed…
Key takeaways The transformer architecture has proved to be revolutionary in outperforming the classical RNN…
Once we learn how to deploy an Ubuntu server, how to manage users, and how…
Key-takeaways: Clean code isn’t just a nice thing to have or a luxury in software projects; it's a necessity. If we…
While developing a web application, or setting dynamic pages and meta tags we need to deal with…
Software architecture is one of the most discussed topics in the software industry today, and…