8 min read

[box type=”note” align=”” class=”” width=””]This article is a book excerpt from Learning Pentaho Data Integration 8 CE – Third Edition written by María Carina Roldán.  In this book you will explore the features and capabilities of Pentaho Data Integration 8 Community Edition.[/box]

In today’s tutorial, we will introduce you to Pentaho Data Integration (PDI) and learn to use it in real world scenario.

Pentaho Data Integration (PDI) is an engine along with a suite of tools responsible for the processes of Extracting, Transforming, and Loading (also known as ETL processes). The Pentaho Business Intelligence Suite is a collection of software applications intended to create and deliver solutions for decision making. The main functional areas covered by the suite are:

  • Analysis: The analysis engine serves multidimensional analysis. It’s provided by
    the Mondrian OLAP server.
  • Reporting: The reporting engine allows designing, creating, and distributing
    reports in various known formats (HTML, PDF, and so on), from different kinds
    of sources. In the Enterprise Edition of Pentaho, you can also generate interactive
    Reports.
  • Data mining: Data mining is used for running data through algorithms in order
    to understand the business and do predictive analysis. Data mining is possible
    thanks to Weka project.
  • Dashboards: Dashboards are used to monitor and analyze Key Performance
    Indicators (KPIs). CTools is a set of tools and components created to help the
    user to build custom dashboards on top of Pentaho. There are specific CTools for
    different purposes, including a Community Dashboard Editor (CDE), a very
    powerful charting library (CCC), and a plugin for accessing data with great
    flexibility (CDA), among others. While the Ctools allow to develop advanced and
    custom dashboards, there is a Dashboard Designer, available only in Pentaho
    Enterprise Edition, that allows to build dashboards in an easy way.
  • Data integration: Data integration is used to integrate scattered information from
    different sources (for example, applications, databases, and files) and make the
    integrated information available to the final user. PDI—the tool that we will learn
    to use throughout the book—is the engine that provides this functionality. PDI
    also interacts with the rest of the tools, as, for example, reading OLAP cubes,
    generating Pentaho Reports, and doing data mining with R Executor Script and
    the CPython Script Executor.

All of these tools can be used standalone but also integrated. Pentaho tightly couples data integration with analytics in a modern platform: the PDI and Business Analytics Platform. This solution offers critical services, for example:

  • Authentication and authorization
  • Scheduling
  • Security
  • Web services
  • Scalability and failover

This set of software and services forms a complete BI Suite, which makes Pentaho the world’s leading open source BI option on the market.

Note: You can find out more about the platform at https://community.hds.com/community/products-and-solutions/pentaho/. There is also an
Enterprise Edition with additional features and support. You can find more on this at http://www.pentaho.com/.

Introducing Pentaho Data Integration

Most of the Pentaho engines, including the engines mentioned earlier, were created as
community projects and later adopted by Pentaho. The PDI engine is not an exception;
Pentaho Data Integration is the new denomination for the business intelligence tool born as Kettle.

By joining forces with Pentaho, Kettle benefited from a huge developer community, as well as from a company that would support the future of the project.
From that moment, the tool has grown with no pause. Every few months a new release is available, bringing to the user’s improvements in performance and existing functionality, new functionality, and ease of use, along with great changes in look and feel. The following is a timeline of the major events related to PDI since its acquisition by Pentaho:

  • June 2006: PDI 2.3 was released. Numerous developers had joined the project and
    there were bug fixes provided by people in various regions of the world. The
    version included, among other changes, enhancements for large-scale
    environments and multilingual capabilities.
  • November 2007: PDI 3.0 emerged totally redesigned. Its major library changed to gain massive performance improvements. The look and feel had also changed completely.
  • April 2009: PDI 3.2 was released with a really large amount of changes for a minor version: new functionality, visualization and performance improvements,and a huge amount of bug fixes.
  • June 2010: PDI 4.0 was released, delivering mostly improvements with regard to enterprise features, for example, version control. In the community version, the focus was on several visual improvements.
  • November 2013: PDI 5.0 was released, offering better previewing of data, easier looping, a lot of big data improvements, an improved plugin marketplace, and  hundreds of bug fixes and features enhancements, as in all releases. In its Enterprise version, it offered interesting low-level features, such as step load balancing, Job transactions, and restartability.
  • December 2015: PDI 6.0 was released with new features such as data services, data lineage, bigger support for Big Data, and several changes in the graphical designer for improving the PDI user experience. Some months later, PDI 6.1 was released including metadata injection, a feature that enables the user to modify Transformations at runtime. Metadata injection had been available in earlier versions, but it was in 6.1 that Pentaho started to put in a big effort in implementing this powerful feature.
  • November 2016: PDI 7.0 emerged with many improvements in the enterprise version, including data inspection capabilities, more support for Big Data technologies, and improved repository management. In the community version, the main change was an expanded metadata injection support.
  • November 2017: Pentaho 8.0 is released. The highlights of this latest version are the optimization of processing resources, a better user experience, and the enhancement of the connectivity to streaming data sources—real-time processing.

Using PDI in real-world scenarios

Paying attention to its name, Pentaho Data Integration, you could think of PDI as a tool to integrate data.

In fact, PDI does not only serve as a data integrator or an ETL tool. PDI is such a powerful tool that it is common to see it being used for these and for many other purposes. Here you have some examples.

Loading data warehouses or data marts

The loading of a data warehouse or a data mart involves many steps, and there are many variants depending on business area or business rules.

However, in every case, with no exception, the process involves the following steps:

  1. Extracting information from one or more databases, text files, XML files, and other sources. The extract process may include the task of validating and discarding data that doesn’t match expected patterns or rules.
  2. Transforming the obtained data to meet the business and technical needs required on the target. Transforming includes such tasks such as converting data types, doing some calculations, filtering irrelevant data, and summarizing.
  3. Loading the transformed data into the target database or file store. Depending on the requirements, the loading may overwrite the existing information or may add new information each time it is executed.

Kettle comes ready to do every stage of this loading process. The following screenshot shows a simple ETL designed with the tool:

ETL

Integrating data

Imagine two similar companies that need to merge their databases in order to have a
unified view of the data, or a single company that has to combine information from a main Enterprise Resource Planning (ERP) application and a Customer Relationship
Management (CRM) application, though they’re not connected. These are just two of
hundreds of examples where data integration is needed. The integration is not just a matter of gathering and mixing data; some conversions, validation, and transfer of data have to be done. PDI is meant to do all these tasks.

Data cleansing

Data cleansing is about ensuring that the data is correct and precise. This can be achieved by verifying if the data meets certain rules, discarding or correcting those which don’t follow the expected pattern, setting default values for missing data, eliminating information that is duplicated, normalizing data to conform to minimum and maximum values, and so on. These are tasks that Kettle makes possible, thanks to its vast set of transformation and validation capabilities.

Migrating information

Think of a company, any size, which uses a commercial ERP application. One day the owners realize that the licenses are consuming an important share of its budget. So they decide to migrate to an open source ERP. The company will no longer have to pay licenses, but if they want to change, they will have to migrate the information. Obviously, it is not an option to start from scratch or type the information by hand. Kettle makes the migration possible, thanks to its ability to interact with most kind of sources and destinations, such as plain files, commercial and free databases, and spreadsheets, among others.

Exporting data

Data may need to be exported for numerous reasons:

  • To create detailed business reports
  • To allow communication between different departments within the same company
  • To deliver data from your legacy systems to obey government regulations, and so on

Kettle has the power to take raw data from the source and generate these kinds of ad hoc reports.

Integrating PDI along with other Pentaho tools

The previous examples show typical uses of PDI as a standalone application. However,
Kettle may be used embedded as part of a process or a data flow. Some examples are
preprocessing data for an online report, sending emails in a scheduled fashion, generating spreadsheet reports, feeding a dashboard with data coming from web services, and so on.

Installing PDI

In order to work with PDI, you need to install the software.

Following are the instructions to install the PDI software, irrespective of the operating
system you may be using:

  1. Go to the Download page at http://sourceforge.net/projects/pentaho/files/DataIntegration.
  2. Choose the newest stable release. At this time, it is 8.0, as shown in the following Screenshot:Pentaho Data Integration
  3. Download the available zip file, which will serve you for all platforms.
  4. Unzip the downloaded file in a folder of your choice, as, for example, c:/util/kettle or /home/pdi_user/kettle.

And that’s all. You have installed the tool in just few minutes.

We learnt about installing and using PDI. You can know more about extending PDI functionality and Launching the PDI Graphical Designer from Learning Pentaho Data Integration 8 CE – Third Edition.

Learning Pentaho Data Integration 8 CE

 

 

 

 

LEAVE A REPLY

Please enter your comment!
Please enter your name here