In this article by Pradeep Pasupuleti and Beulah Salome Purra, authors of the book Data Lake Development with Big Data, we will see the set of formal processes that ensure that data within the enterprise meets the following expectations:
- Acquired from reliable sources and meets pre-defined quality standards.
- Is fit for use for further processing.
- Conforms to well-defined business rules.
- Defined and modified by the right person.
- Follows a well-documented change control process
- Aligned to the organizational strategy.
- Its trustworthiness intact while data flows through various transformation cycles.
The primary motivating factors for organizations to have Data Governance in place is to minimize information risk and maximize the value of information by extracting insights. In the following sections let us explore the components of Data Governance in detail
(For more resources related to this topic, see here.)
Metadata management and lineage tracking
Big Data often relies on extracting value from huge volumes of unstructured data. The first thing we do after this data enters the data lake is classify it and “understand” it by extracting its metadata. Metadata captures vital information about the data as it enters the data lake and indexes this information while it is stored so that users can search Metadata before they access the data and perform any manipulation on it.
Metadata also provides vital information to the users of the Data Lake about the background and significance of the data in the Data Lake and helps in data classification using Taxonomies.
Taxonomies are relationships between data elements and are constructed by tagging data with information on how they are related to each other. Taxonomies are implemented as parent-child hierarchies of relationships that use standard terminology to categorize data.
For instance, using the data contained in the rows and columns of a dataset, Taxonomy can be inferred about the columns’ business definitions, meaning and its relationships with other business entities, and the various business definitions, classifications, and sensitivity tags associated with it. Using these inferences’ operational metadata such as who can access, what can be accessed, and when it can be accessed, can be linked to the taxonomy.
This approach makes it possible to understand and relate vast amounts of data consistently just as similar as a taxonomy-based approach is used to classify billions of life forms into a manageable hierarchical structure of family-genus-species.
Once the data is deconstructed and classified, its other attributes are collected as a part of its metadata catalog. The following lists a general overview of these attributes, which are extracted and stored in a common metadata repository:
- Data identification, data profile, quality information
- Data lineage, versioning, stewardship information
- Entity and attribute information
- Security attributes, data distribution attributes
The following image illustrates one possible view of how metadata is collected as the data passes through the various stages of the Data Lake:
Metadata collection at various stages
The collection of metadata and the derivation of Taxonomies could be implemented in a distributed store to facilitate Data Governance. This enables the Data Lake to exchange metadata with other external systems it integrates with so that the data emanating from these systems could be traced as it enters the Data Lake. This metadata-enabled data access helps in addressing complete Information Lifecycle Management issues right from data ingest to data disposition.
Data security and privacy
Data security is an integral part of the Data Governance process that ensures the data is confidential, has integrity, and is available at all times. Data security protects information from malicious and unauthorized disclosure, use, modification, and destruction.
Data Privacy defines for what specific purposes the data collected about individuals can be used, how long can it be retained and disposed, and it respects the right of the individual to opt-in or opt-out from the data collection. The data concerning individuals should be tagged with privacy metadata so that other processes may take notice. Marking personally identifiable data as sensitive is the first step in enabling the governance process to protect private data.
Data can be secure without being private, but it cannot be private without being secure.
The following figure provides a quick overview about the policies that are enforced by security and privacy:
Security and privacy
Security issues in the Data Lake tiers
Let us now understand the security and privacy concerns as data flows through the three tiers of the Data Lake as discussed in the following subsections.
The Intake Tier
The Intake Tier is the place where we interface the Data Lake with external systems and pull the raw data. One of the stated purposes of the Data Lake is to enable exploratory analytics on raw data, identify its hidden potential, and use raw data before further processing such as joining, refinement, cubing, matching, and so on. Raw data yields valid insights in many use cases such as entity analytics and fraud detection.
The data in this zone poses a high risk due to the following reasons:
- The raw data has not been “understood” from the perspective of usefulness; it is in its pristine raw state devoid of any classification about its eventual usage, authorizations, and privacy
- At the time of ingest, the raw data may contain personally identifiable information, and since it is untouched and no data masking has been performed yet, there is a higher propensity of exposing sensitive data
- Raw data containing sensitive information can be combined or linked with data already existing in the data lake resulting in security breaches
- Decisions and analytics based on bad raw data whose veracity or trust is at question can result in loss of face for the organization
The following are the few ways in which the security and privacy risks associated with raw data stored in the Intake Tier of the Data Lake can be mitigated:
- Perform raw data analysis in a quarantined landing zone where only a small number of authorized users are provided access to all the data
- Perform basic non-intrusive sanity tests to identify if the raw data being accessed is related to obvious business-sensitive information such as payroll, patents, and intellectual capital
- Use basic security controls such as user IDs, strong passwords, access control lists, and so on
- Encrypt file systems and monitor network activity across the intake tier to ensure minimal risk
The Management Tier
It is in the Management Tier of the Data Lake where the raw data is integrated with various existing data; it is profiled and validated by performing automated quality checks, and its integrity is established and eventually all of the raw data is standardized and cleansed into a well-defined structure that is amenable for consumption. It is in this tier that the majority of the metadata is collected at each step in the Management Tier. As the integration steps are being performed, the tracking information along with activity logging and quality monitoring information is stored in metadata that is persisted.
The overall security risk perception of data in this tier is relatively low compared to the Intake Tier, as the data is already parsed and deconstructed to identify the metadata.
The first step to use this metadata-based security approach is to classify or tag data based on different security and protection requirements of the multiple types of data deposited in the Data Lake.
Sensitive data can be classified by setting business-driven priorities and understanding whether the data is to be considered sensitive. If yes, is it based on personally identifiable information or corporate secrets?. Once the sensitiveness of the data is ascertained, business taxonomy is created to determine the relationship of the data to the users and applications that eventually use it, paving way for authorization and authentication controls built using this metadata.
The Data Lake uses the sensitiveness metadata and the derived taxonomy to understand the following:
- Where the data originates from, by continuously monitoring external database access and file shares on the network.
- On which component of the data lake the data exists now and how it relates to the other data in the organization.
- Help safeguard sensitive structured data contained in databases by preventing unauthorized access.
- Help safeguard sensitive unstructured data contained in the form of documents by redacting sensitive information while these unstructured data is being shared.
- Protect access to production and non-production environments such as development, training, testing, quality assurance, and staging environments by masking data that contains confidential information and keeps the functionality of these environments intact.
- Safeguard sensitive production data by encrypting it so that it is scrambled and only the right user with authorization can see it.
- Proactive monitoring of database and file systems that contain sensitive data to recognize any unauthorized data access and alert malicious access attempts, and thus ensure data integrity and compliance.
The Consumption Tier
The Consumption Tier is where the data is accessed and consumed either in raw format from the Raw Zone or in the structured format from the Data Hub. Data is provisioned through this tier for external access for analytics, visualization, or other application access through web services. The data is discovered by the data catalog published in the consumption zone and this actual data access is governed by security controls to limit unwarranted access.
In the Consumption Tier, the risk posed to data security is the least when compared to the other two tiers. As the security processes have already been operationalized, in this tier, we measure the effectiveness of the security processes against defined objectives. The effectiveness can be determined by metrics such as the number of monitored data elements that are sensitive, number of systems that are sensitive, total number of breaches, types of alerts, exceptions in file access patterns, data leakage statistics, and so on.
Information Lifecycle Management
Information Lifecycle Management is a sub-process of Data Governance; it is also known as Data Lifecycle Controls or Data Lifecycle Management.
Information Lifecycle Management is a process of controlling and managing the storage of data in the organization’s infrastructure. The core idea of ILM is that the lifespan of data can be partitioned into multiple separate phases categorized by different patterns of usage and therefore during different phases, the data can be stored differently. ILM also helps to identify the true value of data over its lifetime and classifies it accordingly so that data is stored, migrated, or removed from the organization’s storage infrastructure according to its value.
ILM works under the premise that the data has a finite lifecycle of relevance and the storage infrastructure has a finite capacity to hold data. ILM gives us a structured capability to classify data based upon business relevance by understanding how it evolves or grows over time, comprehends the usage pattern of the data, and eventually how it enables the organization to manage the growth in a systematic way so that the least business valued data is destroyed.
As data arrives into the organization, we can assume with a certain level of certainty that the data related to key business processes such as payroll, transaction processing, customer relationship management, and so on, does have an intrinsic value. Similarly, data in the form of generic email messages, memos, photographs, and so on, are relatively not so business critical.
As new data is acquired into the organization, the older data decreases in relative value; this can be ascertained by performing an analysis of the access pattern of the data. New data is accessed and updated often; as the data becomes old, its frequency of usage diminishes. The frequency of access is one of the methods to find whether the data is relevant to business or not. There are other ways to classify data based on the legal compliance requirements, privacy laws, data availability criteria, and eventual use for analytics, and so on.
The following figure provides a quick overview about the policies that are enforced by ILM:
Information Lifecycle Management
Implementing ILM using Data Lake
Let us now understand how ILM processes are applied as data moves through the three tiers of the Data Lake.
The Intake Tier
The Intake Tier is the place where we store raw data and enable exploratory analytics on raw data. This data stays there until it is used in any way. As the size of the raw data grows, most of the data would not have been touched in any way. This makes it complex to ascertain how useful is the data based on the data access frequency and whether to store the data forever in this tier or to move data in to an archive.
The following are some of the mechanisms in which the ILM processes can be applied on the raw data in the Intake Tier:
- As the Data Lake stores unstructured raw data in file-based formats, data partitioning techniques gives us the flexibility to classify it and govern it better
- In order to save considerable space in the raw zone, we could implement an automated process that can perform shallow compression on unstructured data after a configurable predetermined period of time
- We can implement columnar compression for structured data after a configurable predetermined period of time
The Management Tier
The management tier of the Data Lake integrates raw data and other existing data by standardizing into well-defined structures that is amenable for further processing. In this tier, the metadata is collected at each step of the process along with the tracking information, activity logging, and quality monitoring information.
In order to make real use of the ILM processes to govern the data stored in the management tier, the initial step would be to look at all the types of data that could be potentially kept in the Data Lake and perform a classification based on the following criteria:
- Which data is critical
- Which zone of Data Lake does this data exist in now
- How does this data flow within the Data Lake and how does it flow within the organization
- What changes happen to the data over a period of time; will its value diminish? Is it okay to hold it in the Data Lake?
- The extent of the data protection and data availability needed for this data
- What are the business requirements of this data
- What are the applicable legal policies
The Management Tier implements two types of tiering techniques, storage tiering and compression tiering to store data. Tiering enables partitioning of data based on its lifecycle and class so that least important data does not end up using costly storage; thus, improving the performance of data access and reducing overall costs.
Storage tiering allows data to be moved from one class of storage to another class in order to clear up space costlier storage so that more important data can be stored in it.
Compression tiering allows you to use different types of compression to meet different access patterns of data so that least important data can be compressed more since it is not used too often and it can relieve more storage space.
The following are a few suggested storage and compression tiers:
- The high-performance tier: It is in this tier that all the mission critical data that is frequently accessed and updated is stored. Here the level of compression used is negligible. This tier typically utilizes smaller and faster high performance storage devices such as SSDs. This tier is housed in the Data Lake’s Data Hub Zone.
- The low-cost storage tier: This tier is where the less frequently accessed and relatively less useful data is stored. Here, the level of compression used is more. This tier is composed of larger and slower multi-array storage disks. This tier can be implemented in the Data Hub Zone as a low-cost storage array.
- The online archive tier: This tier is where all the data that is never used is stored. This tier is very large and usually stores the maximum quantity of data. Here, the level of compression used is the most. Typically, data is stored in an indexed compressed format so that it can be retrieved faster when necessary. This tier is located in an external archive database within the Data Lake; it uses ATA hard disks making retrieval easier than storing on tape storage.
- The offline archive tier: This tier is where the data that is never used is stored on tape storage. It is very similar to online archive, except that the tape storage is used instead of hard disks. One major disadvantage of tape storage is it makes it slow to retrieve data. This tier is optional to Data Lake and can be implemented when there is a need to offload the online archive.
If the data in the management tier does not fit in any of the preceding storage and compression tiers, it can be marked for permanent deletion. This aspect of ILM is called defensible disposition. The data is typically deleted in a manner that is defendable in a court so that the entire audit trails of the data from its creation to deletion exists. Defensible disposition of data is many a times mandated by legal and corporate policies, where it explicitly mentions the need to compulsorily store data for a certain time period after which the data can be deleted; this period is termed as Legal Retention Period.
The Consumption Tier
The Consumption Tier is where data is distributed in the raw format from the Raw Zone or in the structured format from the Data Hub. Data is consumed through this tier for external application access for analytics, visualization, or other application access through web services.
In this tier, all the Big Data applications implemented in the Data Lake are already integrated into the business workflow and are already providing business value. This tier implicitly assumes that all the controls are in place and every data flow is regularly monitored from the ILM perspective.
The following are the aspects tracked in the consumption tier as a part of ILM:
- The distribution of Data: As the Consumption Tier deals with the distribution of data from the Data Lake to internal and external data customers, every transaction that distributes data is tracked, logged, and monitored so that they adhere to the ILM policies of the organization.
- The use of Data: Once the data is distributed in the Consumption Tier, it is used by the data customers for generating analytical insights or for other purposes. ILM processes ensure that the right data is used by the right people by enforcing fine-grained access policies and also monitoring the usage patterns of the data.
This article explained in detail Data Governance and the ways to manage data with focus on its availability, usability, integrity, retention, and security. We understood how data governance on the Data Lake is far more efficient.
Resources for Article:
- Leveraging Python in the World of Big Data [article]
- Big Data [article]
- Apache Solr and Big Data – integration with MongoDB [article]