How to execute jobs in an iterative way with Pentaho Data Integration (PDI)

8 min read

[box type=”note” align=”” class=”” width=””]This is a book excerpt from Learning Pentaho Data Integration 8 CE – Third Edition written by María Carina Roldán. From this book, you will learn to explore, transform, and integrate your data across multiple sources.[/box]

Today, we will learn to configure and use Job executor along with capturing the result filenames.

Using Job executors

The Job Executor is a PDI step that allows you to execute a Job several times simulating a loop. The executor receives a dataset, and then executes the Job once for each row or a set of rows of the incoming dataset. To understand how this works, we will build a very simple example. The Job that we will execute will have two parameters: a folder and a file. It will create the folder, and then it will create an empty file inside the new folder. Both the name of the folder and the name of the file will be taken from the parameters. The main transformation will execute the Job iteratively for a list of folder and file names.

Let’s start by creating the Job:

Create a new Job.
Double-click on the work area to bring up the Job properties window. Use it to
define two named parameters: FOLDER_NAME and FILE_NAME.
Drag a START, a Create a folder, and a Create file entry to the work area and
link them as follows:

4. Double-click the Create a folder entry. As Folder name, type ${FOLDER_NAME}.

5. Double-click the Create file entry. As File name, type ${FOLDER_NAME}/${FILE_NAME}.

6. Save the Job and test it, providing values for the folder and filename. The Job should create a folder with an empty file inside, both with the names that you provide as parameters.

Now create the main Transformation:

Create a Transformation.
Drag a Data Grid step to the work area and define a single field named foldername. The type should be String.
Fill the Data Grid with a list of folders to be created, as shown in the next example:

4. As the name of the file, you can create any name of your choice. As an example, we will create a random name. For this, we use a Generate random value and a UDJE step, and configure them as shown:

5. With the last step selected, run a preview. You should see the full list of folders and filenames, as shown in the next sample image:

6. At the end of the stream, add a Job Executor step. You will find it under the Flow category of steps.

7. Double-click on the Job Executor step.

8. As Job, select the path to the Job created before, for example, ${Internal.Entry.Current.Directory}/create_folder_and_file.kjb

9. Configure the Parameters grid as follows:

10. Close the window and save the transformation.

11. Run the transformation. The Step Metrics in the Execution Results window reflects what happens:

12. Click on the Logging tab. You will see the full log for the Job.

13. Browse your filesystem. You will find all the folders and files just created.

As you see, PDI executes the Job as many times as the number of rows that arrives to the Job Executor step, once for every row. Each time the Job executes, it receives values for the named parameters, and creates the folder and file using these values.

Configuring the executors with advanced settings

Just as it happens with the Transformation Executors that you already know, the Job Executors can also be configured with similar settings. This allows you to customize the behavior and the output of the Job to be executed. Let’s summarize the options.

Getting the results of the execution of the job

The Job Executor doesn’t cause the Transformation to abort if the Job that it runs has errors. To verify this, run the sample transformation again. As the folders already exist, you expect that each individual execution fails. However, the Job Executor ends without error. In order to capture the errors in the execution of the Job, you have to get the execution results. This is how you do it:

Drag a step to the work area where you want to redirect the results. It could be any step. For testing purposes, we will use a Text file output step.
Create a hop from the Job Executor toward this new step. You will be prompted for the kind of hop. Choose the This output will contain the execution results option as shown here:

3. Double-click on the Job Executor and select the Execution results tab. You will see the list of metrics and results available. The Field name column has the names of the fields that will contain these results. If there are results you are not interested in, delete the value in the Field name column. For the results that you want to keep, you can leave the proposed field name or type a different name. The following screenshot shows an example that only generates a field for the log:

4. When you are done, click on OK.

5. With the destination step selected, run a preview. You will see the results that you just defined, as shown in the next example:

6. If you copy any of the lines and paste it into a text editor, you will see the full log
for the execution, as shown in the following example:

2017/10/26 23:45:53 – create_folder_and_file – Starting entry

[Create a folder]

2017/10/26 23:45:53 – create_folder_and_file – Starting entry

[Create file]

2017/10/26 23:45:53 – Create file – File

[c:/pentaho/files/folder1/sample_50n9q8oqsg6ib.tmp] created!

2017/10/26 23:45:53 – create_folder_and_file – Finished job entry

[Create file] (result=[true])

2017/10/26 23:45:53 – create_folder_and_file – Finished job entry

[Create a folder] (result=[true])

Working with groups of data

As you know, jobs don’t work with datasets. Transformations do. However, you can still use the Job Executor to send the rows to the Job. Then, any transformation executed by your Job can get the rows using a Get rows from result step.

By default, the Job Executor executes once for every row in your dataset, but there are several possibilities where you can configure in the Row Grouping tab of the configuration window:

You can send groups of N rows, where N is greater than 1
You can pass a group of rows based on the value in a field
You can send groups of rows based on the time the step collects rows before executing the Job

Using variables and named parameters

If the Job has named parameters—as in the example that we built—you provide values for them in the Parameters tab of the Job Executor step. For each named parameter, you can assign the value of a field or a fixed-static-value. In case you execute the Job for a group of rows instead of a single one, the parameters will take the values from the first row of data sent to the Job.

Capturing the result filenames

At the output of the Job Executor, there is also the possibility to get the result filenames. Let’s modify the Transformation that we created to show an example of this kind of output:

Open the transformation created at the beginning of the section.
Drag a Write to log step to the work area.
Create a hop from the Job Executor toward the Write to log step. When asked for the kind of hop, select the option named This output will contain the result file names after execution. Your transformation will look as follows:

4. Double-click the Job Executor and select the Result files tab.

5. Configure it as shown:

Double-click the Write to log step and, in the Fields grid, add the FileName field.
Close the window and save the transformation.
Run it.
Look at the Logging tab in the Execution Results window. You will see the names of the files in the result filelist, which are the files created in the Job:

…

… – Write to log.0 –

… – Write to log.0 – ————> Linenr 1———————-

——–

… – Write to log.0 – filename =

file:///c:/pentaho/files/folder1/sample_5agh7lj6ncqh7.tmp

… – Write to log.0 –

… – Write to log.0 – ====================

… – Write to log.0 –

… – Write to log.0 – ————> Linenr 2———————-

——–

… – Write to log.0 – filename =

file:///c:/pentaho/files/folder2/sample_6n0rhmrpvj21n.tmp

… – Write to log.0 –

… – Write to log.0 – ====================

… – Write to log.0 –

… – Write to log.0 – ————> Linenr 3———————-

——–

… – Write to log.0 – filename =

file:///c:/pentaho/files/folder3/sample_7ulkja68vf1td.tmp

… – Write to log.0 –

… – Write to log.0 – ====================

…

The example that you just created showed the option with a Job Executor.

We learned how to nest jobs and iterate the execution of jobs. You can know more about executing transformations in an iterative way and launching transformations and jobs from the Command Line from this book Learning Pentaho Data Integration 8 CE – Third Edition.