(For more resources related to this topic, see here.)

Big Data analytics – platform requirements

Organizations are striving towards becoming more data driven and leverage data to gain the competitive advantage. It is inevitable that any current business intelligence infrastructure needs to be upgraded to include Big Data technologies and analytics needs to be embedded into every core business process. The following diagram depicts a matrix that connects requirements from low storage/cost to high storage/cost information management systems and analytics applications.

highlights-greenplum-img-0

The following section lists all the capabilities that an integrated platform for Big Data analytics should have:

A data integration platform that can integrate data from any source, of any type, and highly voluminous in nature. This includes efficient data extraction, data cleansing, transformation, and loading capabilities.
A data storage platform that can hold structured, unstructured, and semistructured data with a capability to slice and dice data to any degree, discarding the format. In short, while we store data, we should be able to use the best suited platform for a given data format (for example: structured data to use relational store, semi-structured data to use NoSQL store, and unstructured data to use a file store) and still be able to join data across platforms to run analytics.
Support for running standard analytics functions and standard analytical tools on data that has characteristics described previously.
Modular and elastically scalable hardware that wouldn't force changes to architecture/design with growing needs to handle bigger data and more complex processing requirements.
A centralized management and monitoring system.
Highly available and fault tolerant platform that can repair itself in times of any hardware failure seamlessly.
Support for advanced visualizations to communicate insights in an effective way.
A collaboration platform that can help end users perform the functions of loading, exploring, and visualizing data, and other workflow aspects as an end-to-end process.

Core components

The following figure depicts core software components of Greenplum UAP:

highlights-greenplum-img-1

In this section, we will take a brief look at what each component is and take a deep dive into their functions in the sections to follow.

Greenplum Database

Greenplum Database is a shared nothing, massively parallel processing solution built to support next generation data warehousing and Big Data analytics processing. It stores and analyzes voluminous structured data. It comes in a software-only version that works on commodity servers (this being its unique selling point) and additionally also is available as an appliance (DCA) that can take advantage of large clusters of powerful servers, storage, and switches. GPDB (Greenplum Database) comes with a parallel query optimizer that uses a cost-based algorithm to evaluate and select optimal query plans. Its high-speed interconnection supports continuous pipelining for data processing.

In its new distribution under Pivotal, Greenplum Database is called Pivotal (Greenplum) Database.

Shared nothing, massive parallel processing (MPP) systems, and elastic scalability

Until now, our applications have been benchmarked for certain performance and the core hardware and its architecture determined its readiness for further scalability that came at a cost, be it in terms of changes to the design or hardware augmentation. With growing data volumes, scalability and total cost of ownership is becoming a big challenge and the need for elastic scalability has become prime.

This section compares shared disk, shared memory, and shared nothing data architectures and introduces the concept of massive parallel processing.

Greenplum Database and HD components implement shared nothing data architecture with master/worker paradigm demonstrating massive parallel processing capabilities.

Shared disk data architecture

Have a look at the following figure which gives an idea about shared disk data architecture:

highlights-greenplum-img-2

Shared disk data architecture refers to an architecture where there is a data disk that holds all the data and each node in the cluster accesses this data for processing. Any data operations can be performed by any node at a given point in time and in case two nodes attempt persisting/writing a tuple at the same time, to ensure consistency, a disk-based lock or intended lock communication is passed on thus affecting the performance. Further with increase in the number of nodes, contention at the database level increases. These architectures are write limited as there is a need to handle the locks across the nodes in the cluster. Even in case of the reads, partitioning should be implemented effectively to avoid complete table scans.

Shared memory data architecture

Have a look at the following figure which gives an idea about shared memory data architecture:

highlights-greenplum-img-3

In memory, data grids come under the shared memory data architecture category. In this architecture paradigm, data is held in memory that is accessible to all the nodes within the cluster. The major advantage with this architecture is that there would be no disk I/O involved and data access is very quick. This advantage comes with an additional need for loading and synchronizing data in memory with the underlying data store. The memory layer seen in the following figure can be distributed and local to the compute nodes or can exist as data node.

Shared nothing data architecture

Though an old paradigm, shared nothing data architecture is gaining traction in the context of Big Data. Here the data is distributed across the nodes in the cluster and every processor operates on the data local to itself. The location where data resides is referred to as data node and where the processing logic resides is called compute node. It can happen that both nodes, compute and data, are physically one. These nodes within the cluster are connected using high-speed interconnects.

The following figure depicts two aspects of the architecture, the one on the left represents data and computes decoupled processes and the other to the right represents data and computes processes co-located:

highlights-greenplum-img-4

One of the most important aspects of shared nothing data architecture is the fact that there will not be any contention or locks that would need to be addressed. Data is distributed across the nodes within the cluster using a distribution plan that is defined as a part of the schema definition. Additionally, for higher query efficiency, partitioning can be done at the node level. Any requirement for a distributed lock would bring in complexity and an efficient distribution and partitioning strategy would be a key success factor.

Reads are usually the most efficient relative to shared disk databases. Again, the efficiency is determined by the distribution policy, if a query needs to join data across the nodes in the cluster, users would see a temporary redistribution step that would bring required data elements together into another node before the query result is returned.

Shared nothing data architecture thus supports massive parallel processing capabilities. Some of the features of shared nothing data architecture are as follows:

It can scale extremely well on general purpose systems
It provides automatic parallelization in loading and querying any database
It has optimized I/O and can scan and process nodes in parallel
It supports linear scalability, also referred to as elastic scalability, by adding a new node to the cluster, additional storage, and processing capability, both in terms of load performance and query performance is gained

The Greenplum high-availability architecture

In addition to primary Greenplum system components, we can also optionally deploy redundant components for high availability and avoiding single point of failure.

The following components need to be implemented for data redundancy:

Mirror segment instances: A mirror segment always resides on a different host than its primary segment. Mirroring provides you with a replica of the database contained in a segment. This may be useful in the event of disk/hardware failure. The metadata regarding the replica is stored on the master server in system catalog tables.
Standby master host: For a fully redundant Greenplum Database system, a mirror of the Greenplum master can be deployed. A backup Greenplum master host serves as a warm standby in cases when the primary master host becomes unavailable. The standby master host is synchronized periodically and kept up-to-date using transaction replication log process that runs on the standby master to keep the master host and standby in sync. In the event of master host failure the standby master is activated and constructed using the transaction logs.
Dual interconnect switches: A highly available interconnect can be achieved by deploying redundant network interfaces on all Greenplum hosts and a dual Gigabit Ethernet. The default configuration is to have one network interface per primary segment instance on a segment host (both the interconnects are by default 10Gig in DCA).

External tables

External tables in Greenplum refer to those database tables that help Greenplum Database access data from a source that is outside of the database. We can have different external tables for different formats. Greenplum supports fast, parallel, as well as nonparallel data loading and unloading. The external tables act as an interfacing point to external data source and give an impression of a local data source to the accessing function.

File-based data sources are supported by external tables. The following file formats can be loaded onto external tables:

Regular file-based source (supports Text, CSV, and XML data formats): file:// or gpfdist:// protocol
Web-based file source (supports Text, CSV, OS commands, and scripts): http:// protocol
Hadoop-based file source (supports Text and custom/user-defined formats): gphdfs:// protocol

Following is the syntax for the creation and deletion of readable and writable external tables:

To create a read-only external table:

CREATE EXTERNAL (WEB) TABLE LOCATION (<<file paths>>) | 
EXECUTE '<<query>>' FORMAT '<<Format name for example: 
'TEXT'>>' (DELIMITER, '<<name the delimiter>>');

To create a writable external table:

CREATE WRITABLE EXTERNAL (WEB) TABLE LOCATION (<<file 
paths>>) | EXECUTE '<<query>>' FORMAT '<<Format name for 
example: 'TEXT'>>' (DELIMITER, '<<name the 
delimiter>>');

To drop an external table:
```
DROP EXTERNAL (WEB) TABLE;
```

Following are the examples on using file:// and gphdfs:// protocol:

CREATE EXTERNAL TABLE test_load_file ( id int, name text,
date date, description text )
LOCATION (
'file://filehost:6781/data/folder1/*',
'file://filehost:6781/data/folder2/*'
'file://filehost:6781/data/folder3/*.csv'
)
FORMAT 'CSV' (HEADER);

In the preceding example, data is loaded from three different file server locations; also, as you can see, the wild card notation for each of the locations can be different. Now, in case where the files are located on HDFS, the following notation needs to be used (in the following example, the file is '|' delimited):

CREATE EXTERNAL TABLE test_load_file ( id int, name text,
date date, description text )
LOCATION (
'gphdfs://hdfshost:8081/data/filename.txt'
) FORMAT 'TEXT' (DELIMITER '|');

Summary

In this article, we have learned about Greenplum UAP and also Greenplum Database. This article also gives information about the core components of Greenplum UAP.