(For more resources related to this topic, see here.)
Before introducing PDI, let’s talk about Pentaho BI Suite. The Pentaho Business Intelligence Suite is a collection of software applications intended to create and deliver solutions for decision making. The main functional areas covered by the suite are:
All of this functionality can be used standalone but also integrated. In order to run analysis, reports, and so on, integrated as a suite, you have to use the Pentaho BI Platform. The platform has a solution engine, and offers critical services, for example, authentication, scheduling, security, and web services.
This set of software and services form a complete BI Platform, which makes Pentaho Suite the world’s leading open source Business Intelligence Suite.
The Pentaho BI Platform Demo is a pre-configured installation that allows you to explore several capabilities of the Pentaho platform. It includes sample reports, cubes, and dashboards for Steel Wheels. Steel Wheels is a fictional store that sells all kind of scale replicas of vehicles. The following screenshot is a sample dashboard available in the demo:
The Pentaho BI Platform Demo is free and can be downloaded from http://sourceforge.net/projects/pentaho/files/. Under the Business Intelligence Server folder, look for the latest stable version.
You can find out more about Pentaho BI Suite Community Edition at http://community.pentaho.com/projects/bi_platform. There is also an Enterprise Edition of the platform with additional features and support. You can find more on this at www.pentaho.org.
Most of the Pentaho engines, including the engines mentioned earlier, were created as community projects and later adopted by Pentaho. The PDI engine is not an exception—Pentaho Data Integration is the new denomination for the business intelligence tool born as Kettle.
The name Kettle didn’t come from the recursive acronym Kettle Extraction, Transportation, Transformation, and Loading Environment it has now. It came from KDE Extraction, Transportation, Transformation, and Loading Environment, since the tool was planned to be written on top of KDE, a Linux desktop environment, as mentioned in the introduction of the article.
In April 2006, the Kettle project was acquired by the Pentaho Corporation and Matt Casters, the Kettle founder, also joined the Pentaho team as a Data Integration Architect.
When Pentaho announced the acquisition, James Dixon, Chief Technology Officer said:
We reviewed many alternatives for open source data integration, and Kettle clearly had the best architecture, richest functionality, and most mature user interface. The open architecture and superior technology of the Pentaho BI Platform and Kettle allowed us to deliver integration in only a few days, and make that integration available to the community.
By joining forces with Pentaho, Kettle benefited from a huge developer community, as well as from a company that would support the future of the project.
From that moment, the tool has grown with no pause. Every few months a new release is available, bringing to the users improvements in performance, existing functionality, new functionality, ease of use, and great changes in look and feel. The following is a timeline of the major events related to PDI since its acquisition by Pentaho:
Paying attention to its name, Pentaho Data Integration, you could think of PDI as a tool to integrate data.
In fact, PDI not only serves as a data integrator or an ETL tool. PDI is such a powerful tool, that it is common to see it used for these and for many other purposes. Here you have some examples.
The loading of a data warehouse or a datamart involves many steps, and there are many variants depending on business area, or business rules.
But in every case, no exception, the process involves the following steps:
Kettle comes ready to do every stage of this loading process. The following screenshot shows a simple ETL designed with Kettle:
Imagine two similar companies that need to merge their databases in order to have a unified view of the data, or a single company that has to combine information from a main ERP (Enterprise Resource Planning) application and a CRM (Customer Relationship Management) application, though they’re not connected. These are just two of hundreds of examples where data integration is needed. The integration is not just a matter of gathering and mixing data. Some conversions, validation, and transport of data have to be done. Kettle is meant to do all of those tasks.
It’s important and even critical that data be correct and accurate for the efficiency of business, to generate trust conclusions in data mining or statistical studies, to succeed when integrating data. Data cleansing is about ensuring that the data is correct and precise. This can be achieved by verifying if the data meets certain rules, discarding or correcting those which don’t follow the expected pattern, setting default values for missing data, eliminating information that is duplicated, normalizing data to conform minimum and maximum values, and so on. These are tasks that Kettle makes possible thanks to its vast set of transformation and validation capabilities.
Think of a company, any size, which uses a commercial ERP application. One day the owners realize that the licenses are consuming an important share of its budget. So they decide to migrate to an open source ERP. The company will no longer have to pay licenses, but if they want to change, they will have to migrate the information. Obviously, it is not an option to start from scratch, nor type the information by hand. Kettle makes the migration possible thanks to its ability to interact with most kind of sources and destinations such as plain files, commercial and free databases, and spreadsheets, among others.
Data may need to be exported for numerous reasons:
Kettle has the power to take raw data from the source and generate these kind of ad-hoc reports.
The previous examples show typical uses of PDI as a standalone application. However, Kettle may be used embedded as part of a process or a dataflow. Some examples are pre-processing data for an online report, sending mails in a scheduled fashion, generating spreadsheet reports, feeding a dashboard with data coming from web services, and so on.
The use of PDI integrated with other tools is beyond the scope of this article. If you are interested, you can find more information on this subject in the Pentaho Data Integration 4 Cookbook by Packt Publishing at http://www.packtpub.com/pentaho-data-integration-4-cookbook/book.
In order to work with PDI, you need to install the software. It’s a simple task, so let’s do it now.
These are the instructions to install PDI, for whatever operating system you may be using.
The only prerequisite to install the tool is to have JRE 6.0 installed. If you don’t have it, please download it from www.javasoft.com and install it before proceeding. Once you have checked the prerequisite, follow these steps:
c:/util/kettle
or /home/pdi_user/kettle
./home/pdi_user/kettle
as the installation folder, execute: cd /home/pdi_user/kettle chmod +x *.sh
JavaApplicationStub
file. Look for this file; it is located in Data Integration 32-bit.appContentsMacOS
, or Data Integration 64-bit.appContentsMacOS
depending on your system.You have installed the tool in just a few minutes. Now, you have all you need to start working.
Now that you’ve installed PDI, you must be eager to do some stuff with data. That will be possible only inside a graphical environment. PDI has a desktop designer tool named Spoon. Let’s launch Spoon and see what it looks like.
In this section, you are going to launch the PDI graphical designer, and get familiarized with its main features.
Spoon.bat
You can just double-click on the Spoon.bat
icon, or Spoon
if your Windows system doesn’t show extensions for known file types. Alternatively, open a command window—by selecting Run in the Windows start menu, and executing cmd
, and run Spoon.bat
in the terminal.
spoon.sh
spoon.sh
executable, you may type sh spoon.sh
JavaApplicationStub
file, or click on the Data Integration 32-bit.app
, or Data Integration 64-bit.app
iconYou ran for the first time Spoon, the graphical designer of PDI. Then you applied some custom configuration.
In the Option… tab, you chose not to show the repository dialog or the Welcome! window at startup. From the Look & Feel configuration window, you changed the size of the dotted grid that appears in the canvas area while you are working. You also changed the preferred language. These changes were applied as you restarted the tool, not before.
The second time you launched the tool, the repository dialog didn’t show up. When the main window appeared, all of the visible texts were shown in French which was the selected language, and instead of the Welcome! window, there was a blank screen.
You didn’t see the effect of the change in the Grid option. You will see it only after creating or opening a transformation or job, which will occur very soon!
Spoon, the tool you’re exploring in this section, is the PDI’s desktop design tool. With Spoon, you design, preview, and test all your work, that is, Transformations and Jobs. When you see PDI screenshots, what you are really seeing are Spoon screenshots.
In the earlier section, you changed some preferences in the Options window. There are several look and feel characteristics you can modify beyond those you changed. Feel free to experiment with these settings.
Remember to restart Spoon in order to see the changes applied.
In particular, please take note of the following suggestion about the configuration of the preferred language.
If you choose a preferred language other than English, you should select a different language as an alternative. If you do so, every name or description not translated to your preferred language, will be shown in the alternative language.
One of the settings that you changed was the appearance of the Welcome! window at startup. The Welcome! window has many useful links, which are all related with the tool: wiki pages, news, forum access, and more. It’s worth exploring them.
You don’t have to change the settings again to see the Welcome! window. You can open it by navigating to Help | Welcome Screen.
The first time you launched Spoon, you chose not to work with repositories. After that, you configured Spoon to stop asking you for the Repository option. You must be curious about what the repository is and why we decided not to use it. Let’s explain it.
As we said, the results of working with PDI are transformations and jobs. In order to save the transformations and jobs, PDI offers two main methods:
It’s not allowed to mix the two methods in the same project. That is, it makes no sense to mix jobs and transformations in a database repository with jobs and transformations stored in files. Therefore, you must choose the method when you start the tool.
By clicking on Cancel in the repository window, you are implicitly saying that you will work with the files method.
Why did we choose not to work with repositories? Or, in other words, to work with the files method? Mainly for two reasons:
There is a third method called File repository, that is a mix of the two above—it’s a repository of jobs and transformations stored in the filesystem. Between the File repository and the files method, the latest is the most broadly used. Therefore, throughout this article we will use the files method.
Until now, you’ve seen the very basic elements of Spoon. You must be waiting to do some interesting task beyond looking around. It’s time to create your first transformation.
I remember deciding to pursue my first IT certification, the CompTIA A+. I had signed…
Key takeaways The transformer architecture has proved to be revolutionary in outperforming the classical RNN…
Once we learn how to deploy an Ubuntu server, how to manage users, and how…
Key-takeaways: Clean code isn’t just a nice thing to have or a luxury in software projects; it's a necessity. If we…
While developing a web application, or setting dynamic pages and meta tags we need to deal with…
Software architecture is one of the most discussed topics in the software industry today, and…