Uber came out with an open source data ingestion and dispersal framework for Apache Hadoop, called “Marmaray”, yesterday. Marmaray is a plug-in based framework built and designed on top of the Hadoop ecosystem by the Hadoop Platform team. Marmaray helps connect a collection of systems and services in a cohesive manner to be able to perform certain functions. Let’s have a look at these functions.
- Marmaray is capable of producing quality schematized data via Uber’s schema management library and services.
- It ingests data from multiple data stores into Uber’s Hadoop data lake.
- It can build pipelines using Uber’s internal workflow orchestration service. This allows it to crunch and process the ingested data along with storing and calculating the business metrics based on this data in Hive.
- Marmaray serves the processed results from Hive to an online data store. This allows the internal customers to query the data and get close to instant results.
Other than that, a majority of the fundamental building blocks and abstractions for Marmaray’s design were inspired by Gobblin, a similar project developed at LinkedIn.
There are certain generic components such as DataConverters, WorkUnitCalculator, Metadata Manager, ISourceand ISink in Marmaray that facilitates its overall job flow. Let’s discuss these components.
DataConverters are responsible for producing the error records with every transformation. It is important for all the raw data to conform to a schema before it is ingested into Uber’s Hadoop data lake, this is where DataConverts come into picture. It filters out any data that is malformed, missing required fields, or has other issues.
Uber introduced the concept of WorkUnitCalculator in order to measure the amount of data to process. At advanced levels, WorkUnitCalculator analyzes the type of input source and the previously stored checkpoint. It then calculates the next work unit or batch of work.
The WorkUnitCalculator also considers throttling information when measuring the next batch of data which needs processing.
The Metadata Manager is responsible to cache job level metadata information. The metadata store is capable of storing any relevant metrics which are useful to track, describe, or collect status on jobs. This helps Marmaray to cache job level metadata information.
ISource and ISink
The ISource consists of necessary information from the source data required for the appropriate work units, and ISink comprises all the necessary information on writing to the sink.
Marmaray’s support for any-source to any-sink data pipelines can be applied to a wide range of use cases both in the Hadoop ecosystem and for data migration.
“We hope that Marmaray will serve the data needs of other organizations, and that open source developers will broaden its functionalities,” reads the Uber Blog.
For more information, check out the official Uber Blog.