[box type=”note” align=”” class=”” width=””]This article is a book excerpt from Learning Pentaho Data Integration 8 CE – Third Edition written by María Carina Roldán. In this book you will explore the features and capabilities of Pentaho Data Integration 8 Community Edition.[/box]
In today’s tutorial, we will introduce you to Pentaho Data Integration (PDI) and learn to use it in real world scenario.
Pentaho Data Integration (PDI) is an engine along with a suite of tools responsible for the processes of Extracting, Transforming, and Loading (also known as ETL processes). The Pentaho Business Intelligence Suite is a collection of software applications intended to create and deliver solutions for decision making. The main functional areas covered by the suite are:
All of these tools can be used standalone but also integrated. Pentaho tightly couples data integration with analytics in a modern platform: the PDI and Business Analytics Platform. This solution offers critical services, for example:
This set of software and services forms a complete BI Suite, which makes Pentaho the world’s leading open source BI option on the market.
Note: You can find out more about the platform at https://community.hds.com/community/products-and-solutions/pentaho/. There is also an
Enterprise Edition with additional features and support. You can find more on this at http://www.pentaho.com/.
Most of the Pentaho engines, including the engines mentioned earlier, were created as
community projects and later adopted by Pentaho. The PDI engine is not an exception;
Pentaho Data Integration is the new denomination for the business intelligence tool born as Kettle.
By joining forces with Pentaho, Kettle benefited from a huge developer community, as well as from a company that would support the future of the project.
From that moment, the tool has grown with no pause. Every few months a new release is available, bringing to the user’s improvements in performance and existing functionality, new functionality, and ease of use, along with great changes in look and feel. The following is a timeline of the major events related to PDI since its acquisition by Pentaho:
Paying attention to its name, Pentaho Data Integration, you could think of PDI as a tool to integrate data.
In fact, PDI does not only serve as a data integrator or an ETL tool. PDI is such a powerful tool that it is common to see it being used for these and for many other purposes. Here you have some examples.
The loading of a data warehouse or a data mart involves many steps, and there are many variants depending on business area or business rules.
However, in every case, with no exception, the process involves the following steps:
Kettle comes ready to do every stage of this loading process. The following screenshot shows a simple ETL designed with the tool:
Imagine two similar companies that need to merge their databases in order to have a
unified view of the data, or a single company that has to combine information from a main Enterprise Resource Planning (ERP) application and a Customer Relationship
Management (CRM) application, though they’re not connected. These are just two of
hundreds of examples where data integration is needed. The integration is not just a matter of gathering and mixing data; some conversions, validation, and transfer of data have to be done. PDI is meant to do all these tasks.
Data cleansing is about ensuring that the data is correct and precise. This can be achieved by verifying if the data meets certain rules, discarding or correcting those which don’t follow the expected pattern, setting default values for missing data, eliminating information that is duplicated, normalizing data to conform to minimum and maximum values, and so on. These are tasks that Kettle makes possible, thanks to its vast set of transformation and validation capabilities.
Think of a company, any size, which uses a commercial ERP application. One day the owners realize that the licenses are consuming an important share of its budget. So they decide to migrate to an open source ERP. The company will no longer have to pay licenses, but if they want to change, they will have to migrate the information. Obviously, it is not an option to start from scratch or type the information by hand. Kettle makes the migration possible, thanks to its ability to interact with most kind of sources and destinations, such as plain files, commercial and free databases, and spreadsheets, among others.
Data may need to be exported for numerous reasons:
Kettle has the power to take raw data from the source and generate these kinds of ad hoc reports.
The previous examples show typical uses of PDI as a standalone application. However,
Kettle may be used embedded as part of a process or a data flow. Some examples are
preprocessing data for an online report, sending emails in a scheduled fashion, generating spreadsheet reports, feeding a dashboard with data coming from web services, and so on.
In order to work with PDI, you need to install the software.
Following are the instructions to install the PDI software, irrespective of the operating
system you may be using:
And that’s all. You have installed the tool in just few minutes.
We learnt about installing and using PDI. You can know more about extending PDI functionality and Launching the PDI Graphical Designer from Learning Pentaho Data Integration 8 CE – Third Edition.
At Packt, we are always on the lookout for innovative startups that are not only…
I remember deciding to pursue my first IT certification, the CompTIA A+. I had signed…
Key takeaways The transformer architecture has proved to be revolutionary in outperforming the classical RNN…
Once we learn how to deploy an Ubuntu server, how to manage users, and how…
Key-takeaways: Clean code isn’t just a nice thing to have or a luxury in software projects; it's a necessity. If we…
While developing a web application, or setting dynamic pages and meta tags we need to deal with…