7 min read

Joining two or more streams based on given conditions

There are occasions where you will need to join two datasets. If you are working with databases, you could use SQL statements to perform this task, but for other kinds of input (XML, text, Excel), you will need another solution.

Kettle provides the Merge Join step to join data coming from any kind of source.

Let’s assume that you are building a house and want to track and manage the costs of building it. Before starting, you prepared an Excel file with the estimated costs for the different parts of your house. Now, you are given a weekly file with the progress and the real costs. So, you want to compare both to see the progress.

Getting ready

To run this recipe, you will need two Excel files, one for the budget and another with the real costs. The budget.xls has the estimated starting date, estimated end date, and cost for the planned tasks. The costs.xls has the real starting date, end date, and cost for tasks that have already started.

You can download the sample files from here.

How to do it…

Carry out the following steps:

  1. Create a new transformation.
  2. Drop two Excel input steps into the canvas.
  3. Use one step for reading the budget information (budget.xls file) and the other for reading the costs information (costs.xls file).
  4. Under the Fields tab of these steps, click on the Get fields from header row… button in order to populate the grid automatically. Apply the format dd/MM/yyyy to the fields of type Date and $0.00 to the fields with costs.
  5. Add a Merge Join step from the Join category, and create a hop from each Excel input step toward this step. The following diagram depicts what you have so far:

    Pentaho Data Integration 4 tutorial on Data Flows

  6. Configure the Merge Join step, as shown in the following screenshot:

    Pentaho Data Integration 4 tutorial on Data Flows

  7. If you do a preview on this step, you will obtain the result of the two Excel files merged. In order to have the columns more organized, add a Select values step from the Transform category. In this new step, select the fields in this order: task, starting date (est.), starting date, end date (est.), end date, cost (est.), cost.
  8. Doing a preview on the last step, you will obtain the merged data with the columns of both Excel files interspersed, as shown in the following screenshot:

Pentaho Data Integration 4 tutorial on Data Flows

How it works…

In the example, you saw how to use the Merge Join step to join data coming from two Excel files. You can use this step to join any other kind of input.

In the Merge Join step, you set the name of the incoming steps, and the fields to use as the keys for joining them. In the recipe, you joined the streams by just a single field: the task field.

The rows are expected to be sorted in an ascending manner on the specified key fields.

There’s more…

In the example, you set the Join Type to LEFT OUTER JOIN. Let’s see explanations of the possible join options:

Pentaho Data Integration 4 tutorial on Data Flows

Interspersing new rows between existent rows

In most Kettle datasets, all rows share a common meaning; they represent the same kind of entity, for example:

  • In a dataset with sold items, each row has data about one item
  • In a dataset with the mean temperature for a range of days in five different regions, each row has the mean temperature for a different day in one of those regions
  • In a dataset with a list of people ordered by age range (0-10, 11-20, 20-40, and so on), each row has data about one person

Sometimes, there is a need of interspersing new rows between your current rows. Taking the previous examples, imagine the following situations:

  • In the sold items dataset, every 10 items, you have to insert a row with the running quantity of items and running sold price from the first line until that line.
  • In the temperature’s dataset, you have to order the data by region and the last row for each region has to have the average temperature for that region.
  • In the people’s dataset, for each age range, you have to insert a header row just before the rows of people in that range.

In general, the rows you need to intersperse can have fixed data, subtotals of the numbers in previous rows, header to the rows coming next, and so on. What they have in common is that they have a different structure or meaning compared to the rows in your dataset.

Interspersing these rows is not a complicated task, but is a tricky one. In this recipe, you will learn how to do it.

Suppose that you have to create a list of products by category. For each category, you have to insert a header row with the category description and the number of products inside that category.

The final result should be as follows:

Pentaho Data Integration 4 tutorial on Data Flows

Getting ready

This recipe uses an outdoor database with the structure shown in Appendix, Data Structures (Download here). As source, you can use a database like this or any other source, for example a text file with the same structure.

How to do it…

Carry out the following steps:

  1. Create a transformation, drag into the canvas a Table Input step, select the connection to the outdoor database, or create it if it doesn’t exist. Then enter the following statement:

    SELECT category
    , desc_product
    FROM products p
    ,categories c
    WHERE p.id_category = c.id_category
    ORDER by category

    
    
  2. Do a preview of this step. You already have the product list!
  3. Now, you have to create and intersperse the header rows. In order to create the headers, do the following: From the Statistics category, add a Group by step and fill in the grids, as shown in the following screenshot:

    Pentaho Data Integration 4 tutorial on Data Flows

  4. From the Scripting category, add a User Defined Java Expression step, and use it to add two fields: The first will be a String named desc_product, with value (“Category: ” + category).toUpperCase(). The second will be an Integer field named order with value 1.
  5. Use a Select values step to reorder the fields as category, desc_product, qty_product, and order. Do a preview on this step; you should see the following result:

    Pentaho Data Integration 4 tutorial on Data Flows

  6. Those are the headers. The next step is mixing all the rows in the proper order. Drag an Add constants step into the canvas and a Sort rows step. Link them to the other steps as shown:

    Pentaho Data Integration 4 tutorial on Data Flows

  7. Use the Add constants to add two Integer fields: qty_prod and order. As Value, leave the first field empty, and type 2 for the second field.
  8. Use the Sort rows step for sorting by category, order, and desc_product.
  9. Select the last step and do a preview. You should see the rows exactly as shown in the introduction.

How it works…

When you have to intersperse rows between existing rows, there are just four main tasks to do, as follows:

  1. Create a secondary stream that will be used for creating new rows. In this case, the rows with the headers of the categories.
  2. In each stream, add a field that will help you intersperse rows in the proper order. In this case, the key field was named order.
  3. Before joining the two streams, add, remove, and reorder the fields in each stream to make sure that the output fields in each stream have the same metadata.
  4. Join the streams and sort by the fields that you consider appropriate, including the field created earlier. In this case, you sorted by category, inside each category by the field named order and finally by the products description.

Note that in this case, you created a single secondary stream. You could create more if needed, for example, if you need a header and footer for each category.

LEAVE A REPLY

Please enter your comment!
Please enter your name here