In this article by Joe Sremack, author of the book Big Data Forensics, we will cover the following topics:

An overview of how to identify Big Data forensic evidence
Techniques for previewing Hadoop data

(For more resources related to this topic, see here.)

Hadoop and other Big Data systems pose unique challenges to forensic investigators. Hadoop clusters are distributed systems with voluminous data storage, complex data processing, and data that is split and made redundant at the data block level. Unlike with traditional forensics, performing forensics on Hadoop using the traditional methods is not always feasible. Instead, forensic investigators, experts, and legal professionals—such as attorneys and court officials—need to understand how forensics is performed against these complex systems.

The first step in a forensic investigation is to identify the evidence. In this article, several of the concepts involved in identifying forensic evidence from Hadoop and its applications are covered.

Identifying forensic evidence is a complex process for any type of investigation. It involves surveying a set of possible sources of evidence and determining which sources warrant collection. Data in any organization's systems is rarely well organized or documented. Investigators will need to take a set of investigation requirements and determine which data need to be collected. This requires a few first steps:

Properly reviewing system and data documentation.
Interviewing staff.
Locating backup and non-centralized data repositories.
Previewing data.

The process of identifying Big Data evidence is made difficult by the large volume of data, distributed filesystems, the numerous types of data, and the potential for large-scale redundancy in evidence.

Big Data solutions are also unique in that evidence can reside in different layers. Within Hadoop, evidence can take on multiple forms—such as file stored in the Hadoop Distributed File System (HDFS) to data extracted from application. To properly identify the evidence in Hadoop, multiple layers are examined. While all the data may reside in HDFS, the form may differ in a Hadoop application (for example, HBase), or the data may be more easily extracted to a viable format through HDFS using an application (such as Pig or Sqoop).

Identifying Big Data evidence can also be complicated by redundancies caused by:

Systems that input to or receive output from Big Data systems
Archived systems that may have previously stored the evidence in the Big Data system

A primary goal of identifying evidence is to capture all relevant evidence while minimizing redundant information.

Outsiders looking at a company's data needs may assume that identifying information is as simple as asking several individuals where the data resides. In reality, the process is much more complicated for a number of possible reasons:

The organization may be an adverse party and cannot be trusted to provide reliable information about the data
The organization is large and no single person knows where all data is stored and what the contents of the data are
The organization is divided into business units with no two business units knowing what data the other one stores
Data is stored with a third-party data hosting provider
IT staff may know where data and systems reside, but only the business users know the type of content the data stores

For example, one might assume a pharmaceutical sales company would have an internal system structured with the following attributes:

A division where the data is collected from a sales database
An HR department database containing employee compensation, performance, and retention information
A database of customer demographic information
An accounting department database to assess what costs are associated with each sale

In such a system, that data is then cleanly unified and compelling analyses are created to drive sales. In reality, an investigator will probably find that the Big Data sales system is actually comprised of a larger set of data that originates inside and outside the organization. There may be a collection of spreadsheets on sales employees' desktops and laptops, along with some of the older versions on backup tapes and file server shared folders. There may be a new Salesforce database implemented two years ago that is incomplete and is actually the replacement for a previous database, which was custom-developed and used by 75% of employees. A Hadoop instance running HBase for analysis receives a filtered set of data from social media feeds, the Salesforce database, and sales reports. All of these data sources may be managed by different teams, so identifying how to collect this information requires a series of steps to isolate the relevant information.

The problem for large—or even midsize—companies is much more difficult than our pharmaceutical sales company example. Simply creating a map of every data source and the contents of those systems could require weeks of in-depth interviews with key business owners and staff. Several departments may have their own databases and Big Data solutions that may or may not be housed in a centralized repository. Backups for these systems could be located anywhere. Data retention policies will vary by department—and most likely by system. Data warehouses and other aggregators may contain important information that will not show themselves through normal interviews with staff. These data warehouses and aggregators may have previously generated reports that could serve as valuable reference points for future analysis; however, all data may not be available online, and some data may be inaccessible. In such cases, the company's data will most likely reside in off-site servers maintained by an outsourcing vendor, or worse, in a cloud-based solution.

Big Data evidence can be intertwined with non-Big Data evidence. Email, document files, and other evidence can be extremely valuable for performing an investigation. The process for identifying Big Data evidence is very similar to the process for identifying other evidence, so the identification process described in this book can be carried out in conjunction with identifying other evidence. An important consideration for investigators to keep in mind is whether Big Data evidence should be collected (that is, determining whether it is relevant or if the same evidence can be collected more easily from other non-Big Data systems). Investigators must also consider whether evidence needs to be collected to meet the requirements of an investigation.

The following figure illustrates the process for identifying Big Data evidence:

identifying-big-data-evidence-hadoop-img-0

Initial Steps

The process for identifying evidence is:

Examining requirements
Examining the organization's system architecture
Determining the kinds of data in each system
Assessing which systems to collect

In the book, Big Data Forensics, the topics of examining requirements and examining the organization's system architecture are covered in detail. The purpose of these two steps is to take the requirements of the investigation and match those to known data sources. From this, the investigator can begin to document which data sources should be examined and what types of data may be relevant.

Assessing data viability

Assessing the viability of data serves several purposes. It can:

Allow the investigator to identify which data sources are potentially relevant
Yield information that can corroborate the interview and documentation review information
Highlight data limitations or gaps
Provide the investigator with information to create a better data collection plan

Up until this point in the investigation, the investigator has only gathered information about the data. Previewing and assessing samples of the data gives the investigator the chance to actually see what information is contained in the data and determine which data sources can meet the requirements of the investigation.

Assessing the viability and relevance of data in a Big Data forensic investigation is different from that of a traditional digital forensic investigation. In a traditional digital forensic investigation, the data is typically not previewed out of fear of altering the data or metadata. With Big Data, however, the data can be previewed in some situations where metadata is not relevant or available. This factor opens up the opportunity for a forensic investigator to preview data when identifying which data should be collected.

The main considerations for each source of data include the following:

Data quality
Data completeness
Supporting documentation
Validating the collected data
Previous systems where the data resided
How the data enter and leave the system
The available formats for extraction
How well the data meet the data requirements

There are several methods for previewing data. The first is to review a data extract or the results of a query—or collect sample text files that are stored in Hadoop. This method allows the investigator to determine the types of information available and how the information is represented in the data. In highly complex systems consisting of thousands of data sources, this may not be feasible or requires a significant investment of time and effort.

The second method is to review reports or canned query output that were derived from the data. Some Big Data solutions are designed with reporting applications connected to the Big Data system. These reports are a powerful tool, enabling an investigator to quickly gain an understanding of the contents of the system without requiring much up-front effort to gain access to the systems.

Data retention policies and data purge schedules should be reviewed and considered in this step as well. Given the large volume of data involved, many organizations routinely purge data after a certain period of time.

Data purging can mean the archival of data to near-line or offline storage, or it can mean the destruction of old data without backup. When data is archived, the investigator should also determine whether any of the data in near-line or offline backup media needs to be collected or if the live system data is sufficient. Regardless, the investigator should determine what the next purge cycle is and whether that necessitates an expedited collection to prevent loss of critical information. Additionally, the investigator should determine whether the organization should implement a litigation hold, which halts data purging during the investigation. When data is purged without backup, the investigator must determine:

How the purge affects the investigation
When the data needs to be collected
Whether supplemental data sources must be collected to account for the lost data (for example, reports previously created from the purged data or other systems that created or received the purged data)

Identifying HDFS evidence

HDFS evidence can be identified in a number of ways. In some cases, the investigator does not want to preview the data to retain the integrity of the metadata. In other cases, the investigator is in only interested in collecting a subset of the data. Limiting the data can be necessary when the data volume prohibits a complete collection or forensically imaging the entire cluster is impossible.

The primary methods for identifying HDFS evidence are to:

Generate directory listings from the cluster
Review the total data volume on each of the nodes

Generating directory listings from the cluster is a straightforward process of accessing the cluster from a client and running the Hadoop directory listing command. The cluster is accessed from a client by either directly logging in through a cluster node or by logging in from a remote machine. The command to print a directory listing is as follows:

# hdfs dfs –lsr /

This generates a recursive directory listing of all HDFS files starting from the root directory. This command produces the filenames, directories, permissions, and file sizes of all files. The output can be piped to an output file, which should be saved to an external storage device for review.

Identifying Hive evidence

Hive evidence can be identified through HiveQL commands. The following table lists the commands that can be used to get a full listing of all databases and tables as well as the tables' formats:

Command	Description
SHOW DATABASES;	This lists all available databases
SHOW TABLES;	This lists all tables in current database
USE databaseName;	This makes databaseName the current database
DESCRIBE (FORMATTED\|EXTENDED) table;	This lists the formatting details about the table

Identifying all tables and their formats requires iterating through every database and generating a list of tables and each table's formats. This process can be performed either manually or through an automated HiveQL script file. These commands do not provide information about database and table metadata—such as number of records and last modified date—but they do give a full listing of all available, online Hive data.

HiveQL can also be used to preview the data using subset queries. The following example shows how to identify the top ten rows in a Hive table:

SELECT *
FROM Table1
LIMIT 10

Identifying HBase evidence

HBase evidence is stored in tables, and identifying the names of the tables and the properties of each is important for data collection. HBase stores metadata information in the -ROOT- and .META. tables. These tables can be queried using HBase shell commands to identify the information about all tables in the HBase cluster.

Information about the HBase cluster can be gathered using the status command from the HBase shell:

status

This produces the following output:

2 servers, 0 dead, 1.5000 average load

For additional information about the names and locations of the servers—as well as the total disk sizes for the memstores and HFiles—the status command can be given the detailed parameter.

The list command outputs every HBase table. The one table created in HBase, testTable, is shown via the following command:

list

This produces the following output:

TABLE
testTable
1 row(s) in 0.0370 seconds
=> ["testTable"]

Information about each table can be generated using the describe command:

describe 'testTable'

The following output is generated:

'testTable', {NAME => 'account', DATA_BLOCK_ENCODING => 'NONE',
BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '3',
COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => '2147483647',
KEEP_DELETED_CELLS => 'false', BLOCKSIZE => '65536', IN_MEMORY =>
'false', ENCODE_ON_DISK => 'true', BLOCKCACHE => 'true'}, {NAME =>
'address', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'NONE',
REPLICATION_SCOPE => '0', VERSIONS => '3', COMPRESSION => 'NONE',
MIN_VERSIONS => '0', TTL => '2147483647', KEEP_DELETED_CELLS =>
'false', BLOCKSIZE => '65536', IN_MEMORY => 'false', ENCODE_ON_DISK
=> 'true', BLOCKCACHE => 'true'}
1 row(s) in 0.0300 seconds

The describe command yields several useful pieces of information about each table. Each of the column families are listed, and for each family, the encoding, number of columns (represented as versions), and whether the deleted cells are retained are also listed.

Security information about each table can be gathered using the user_permission command as follows:

user_permission 'testTable'

This command is useful for identifying the users who currently have access to the table. As mentioned before, user accounts are not as meaningful in Hadoop because of the distributed nature of Hadoop configurations, but in some cases, knowing who had access to tables can be tied back to system logs to identify individuals who accessed the system and data.

Summary

Hadoop evidence comes in many forms. The methods for identifying the evidence require the forensic investigator to understand the Hadoop architecture and the options for identifying the evidence within HDFS and Hadoop applications. In Big Data Forensics, these topics are covered in more depth—from the internals of Hadoop to conducting a collection of a distributed server.