Master Final Report Hadoop Spillage (2)

1
Oluwatosin Alabi1, Joe Beckman2, Dheeraj Gurugubelli3
1
Purdue University, oogunwuy@purdue.edu; 2Purdue University, beckmanj@purdue.edu;
3
Purdue University, dj@purdue.edu
Keywords: Data Spillage, Hadoop Clouds, Data Carving, Data Wiping, Digital Forensics
Data spillage in Hadoop clusters has been identified by the National Security Agency as a
security threat because sensitive information that is stored on these clusters and spilled has the
potential to be seen by those without permission to access such data. Military and other
government entities perceive data spillage as a security threat when sensitive information is
introduced onto one or more unauthorized platforms. This project focuses on tracking sensitive
information spilled through the introduction of a sensitive document onto a non-sensitive
Hadoop cluster by a user. The goal of the project is to contain the spilled data more quickly and
aid in the secure removal of that data from the impacted cluster. We seek to establish a
procedure to respond to this type of data spillage event by tracking the spillage through a
forensic process within a Hadoop environment on a cloud infrastructure.
Oluwatosin Alabi: Hadoop cluster configuration and system evaluation
Dheeraj Gurugubelli: Cyber forensics analysis and evaluation
ALL: Experimental design and data analysis
Joe Beckman: Data Management and recommendation of policy related controls
Project Deliverables:
1. Final Project Report: Hadoop: Understanding, Preventing, and Mitigating the Impacts of
Data Spillage from Hadoop Clusters using Information Security Controls. Expected
Delivery Date: 12/8/2014.
2. Project Poster: The project team will generate a poster for presentation of the project at
the annual CERIAS Symposium. Expected Delivery Date: 3/24 - 3/25/2015.
3. Research Conference Presentation/Journal Paper:
3.1.
Network and Distributed System Security Symposium/International
Journal of Security and Its Applications. Expected Delivery Date: 1/1/2015
3.2.
Storage Network Industry Association Data Storage Innovation
Conference. Expected Delivery: April 7-9, 2015, Santa Clara, CA, USA.
Table of Contents
Executive Summary.........................................................................................................................3
1.Introduction...................................................................................................................................4
1.1
Scope.................................................................................................................................4
1.2
Significance.......................................................................................................................4
1.3
Problem Statement............................................................................................................5
1.4
Assumptions......................................................................................................................5
1.5
Limitations........................................................................................................................5
2. Literature Review.........................................................................................................................5
2.1
Hadoop Distributed File System and Data Spillage..........................................................5
2.2
Digital Forensics and Cloud Environments......................................................................7
3 Approach/ Methodology.............................................................................................................10
4.1
Data Management Plan...................................................................................................11
4. Results and Conclusions............................................................................................................11
5. Schedule.....................................................................................................................................12
6.
Budget....................................................................................................................................12
7.
Final Discussion and Future Directions.................................................................................13
8.
Bibliography..........................................................................................................................13
9.
Biographical sketches of the team members..........................................................................14
3
Executive Summary
Data spillage in Hadoop clusters has been identified by the National Security Agency as a
security threat because sensitive information that is stored on these clusters and spilled has the
potential to be seen by those without permission to access such data. Military and other
government entities perceive data spillage as a security threat when sensitive information is
introduced onto one or more unauthorized platforms. As the use of Hadoop clusters to manage
large amounts of data both inside and outside of government grows, the ability to locate and
remove data effectively and efficiently in Hadoop cluster will become increasingly important.
This project focuses on tracking classified information spillage in Hadoop based clusters in order
to contain the situation quicker and aid in data wiping process. The goal of this project is to
establish a procedure to handle data spillage by tracking the spilled data through the Hadoop
Distributed File System (HDFS) and applying digital forensics processes to aid in the removal of
spilled data within an impacted Hadoop cluster on a cloud infrastructure. Our approach to the
problem is novel because we merged the HDFS structure understanding with digital forensic
procedures to identify, acquire, analyze and report the locations of spilled data and any remnants
on virtual disks. Toward that end, we used the procedural framework illustrated below to track
and analyze user-introduced data spillage in the Hadoop-based cloud environment. By following
this procedure, we were able to image each impacted cluster node for analysis, locate all
occurrences of a .pdf file after it was loaded onto the Hadoop Distributed File System (HDFS)
cluster, and and recover the file once it was deleted from the cluster using HDFS commands.
Data Spillage in Hadoop Clouds
Oluwatosin Alabi1, Joe Beckman2, Dheeraj Gurugubelli3
1
Purdue University, oogunwuy@purdue.edu; 2Purdue University, beckmanj@purdue.edu;
3
Purdue University, dj@purdue.edu
Keywords: Data Spillage, Hadoop Clouds, Data Carving, Data Wiping, Digital Forensics
1.
Introduction
Data spillage in Hadoop clusters has been identified by the National Security Agency as a
security threat because sensitive information that is stored on these clusters and spilled has the
potential to be seen by those without permission to access such data. Military and other
government entities perceive data spillage as a security threat when sensitive information is
introduced onto one or more unauthorized platforms. This project focuses on tracking sensitive
information in the Hadoop Distributed File System (HDFS) as a result of the introduction of a
sensitive document onto a non-sensitive Hadoop cluster by a user. The goal of the project is to
find all occurrences of the spilled data on the Hadoop cluster, which is defined in this project as a
user-introduced sensitive file and image the impacted cluster nodes for forensic analysis, in order
to aid in removing that sensitive data from the impacted Hadoop cluster. We seek to establish a
procedure that can be used to respond to this type of data spillage event by tracking the spilled
information using digital forensics processes within a Hadoop environment on a cloud
infrastructure.
1.1
Scope
The scope of this project addresses the locating of all instances of deleted sensitive data
on an impacted, virtualized Hadoop cluster and preservation of that information on the impacted
node images for forensic analysis under the following technical constraints:
1.2
•
Hadoop Cluster: To fit the study within the required time line, the number of nodes and
size of each node had to be carefully chosen. This study used an 8 nodes cluster with a
storage space of 80 GB on each node.
•
Virtualization: The Hadoop cluster used in this experiment was created using VMWare
ESXI software. Therefore, all nodes in the cluster exist not as physical servers, rather as
virtual machines.
•
Storage Disk Type: This study is valid only when the storage units used are Hard Disk
Drive’s (HDD).
Significance
Data spillage is a critical threat to the confidentiality of sensitive data. As use of
increasingly large data sets grows, use of Hadoop clusters to handle these data sets is also
growing, and with it, the potential for data spillage events. In this context, the ability to
completely remove sensitive data from a Hadoop cluster is critical, especially in areas that
impact national security. Without a process to effect the complete removal of sensitive data from
5
Hadoop clusters, user-induced data spillage events in Hadoop clusters could impact the privacy
of billions of people worldwide, and in classified United States government settings, the national
security of the United States.
1.3
Problem Statement
Data spillage in Hadoop clusters has been identified by the National Security Agency as a
security threat because sensitive information spilled onto unauthorized Hadoop clusters has the
potential to be seen by those without permission to access such data, which has the potential to
negatively impact the national security of the United States.
1.4
Assumptions
This study assumes that the following are true.
•
The drives used for storage in the cluster are Hard Disk Drives.
•
Access Data's FTK imager version 2.6, Forensics Toolkit version 5 would correctly
represent the file structure of a forensics image file and PDF files used in this study.
•
The file load on to HDFS is successful and that the location of the spilled data on data
nodes can be known from the metadata in the logs of the name node.
1.5
Limitations
The following are limitations of this study.
•
Time seriously constrained this research. Forensic procedures such as image acquisition
and analysis consumed, on average, 13-18 hours time was spent on each of the nodes
during this process.
•
The computing infrastructure available for digital forensics processing is limited, a higher
transfer and processing capability could have significantly reduced the amount of time
spent processing the impacted cluster nodes.
•
This research addressed a subset of the configurable options available when deploying a
virtual machine-based Hadoop cluster. We believe that these configuration options had
little or no impact on the results of the study, but re-running the experiment with different
configuration options was outside of the scope of this study.
•
This research used VMware as a virtual machine platform. Other products may produce
different results.
•
The Vmware ESXi platform was used to acquire the .vmdk images from the data store.
Changes in the versions of that software could produce different results.
2.
2.1
Literature Review
Hadoop Distributed File System and Data Spillage
Big data sets are classified by the following attributes (Ramanathan et al., 2013): high
volume (data size), variety (multiple sources and data types), and velocity (rate in which new
information is added into the data set) and, value (the utility and quality of the data set). A
growing number of organizations are implementing data repositories using private cloud
infrastructures to facilitate shared access to their data repositories, referred to as a data lake
(EMC White Paper, 2014), and high performance computing resources. The National Institute of
Standards and Technology (NIST) has defined cloud computing as “a model for enabling
convenient, on demand network access to a shared pool of configurable resources (e.g. networks,
servers, storage, application, and services) that can be rapidly provisioned and released with
minimal management effort or service provided interaction” (Mell & Grance, 2011). Due to the
trend toward the use of cloud computing infrastructures for data storage and processing,
companies must address new and different security risks than those associated with traditional
data storage and processing systems. One particular security risk is the loss of over control
sensitive and protected data within an organization’s information technology infrastructure,
which characterizes data leakage.
Data leakage is defined as the accidental or intentional distribution of classified or private
information to an unauthorized entity (Anjali, Geetanjali, Shivlila, R. Shetkar, & B., 2013).
Data spillage, a specific type of data leakage, occurs when classified or sensitive information is
moved onto an unauthorized or undesignated compute node or memory media (e.g. disk). The
ability to control the privacy of sensitive information, then, is a critical component of protecting
national infrastructure and security. Currently, research related to understanding and determining
incident response techniques for dealing with data spillage within Hadooop’s distributed files
system, HDFS, is limited.
Hadoop is the open source implementation of the GoogleTM MapReduce parallel
computing program framework. There are two major components of a Hadoop system: the HDFS
file system for data storage, and parallel computing data processing framework. The HDFS file
system is among a number of distributed file system such as PVFS, Lustre, and Google File
System (GFS). Unlike PVFS and Lustre. RAID is not used as part of the data protection
mechanism . Instead, HDFS replicates data over multiple nodes, called DataNodes, to ensure
reliability (DeRoos, 2014). The HDFS file system architecture is designed after the Unix file
system which stores files as blocks. Each block stored in a DataNode can be composed of data of
size 64MB or 128MB as defined by system administrator. Each group of blocks consists of
metadata descriptions that are stored by the NameNode. The NameNode manages the storage of
file locations and monitors the availability of DataNodes in the system, as described in Figure 1
below. Although there are number of system level configurations that system administrators can
implement to help secure Hadoop systems, they do not eliminate data spillage incidents related to
user error (The Apache Software Foundation, 2014). This study will focus specifically on a case
in which a user loads a confidential or sensitive file onto a Hadoop cluster that is not authorized
to store or process that classified or sensitive data.
7
Figure 1. Hadoop Distributed File system
2.2
Digital Forensics and Cloud Environments
Lu and Lin (2010) researched techniques for providing secure data provenance in cloud
computing environments. They proposed a scheme characterized by a) providing the information
confidentiality on sensitive documents stored in cloud b) anonymous authentication on user
access, and c) provenance tracking on disputed documents. Lu and Lin (2010) were pioneers in
proposing a feasible security schema to ensure confidentiality of sensitive data stored on cloud
environments. Research has been conducted to identify technical issues in digital forensics
investigations performed on cloud-based computing platforms. The authors of this research
argue that, due to the decentralized nature of clouds and of data processing in the cloud,
traditional digital investigative approaches of evidence collection and recovery are not practical
in cloud environments (Birk, D., & Wegener, C., 2011). We propose to extend existing research
by applying cloud-based digital forensics frameworks for use in incident management within
Hadoop clusters. Our research addresses this process in the context of Hadoop distributed file
system (HDFS), where published research is currently limited.
The theoretical framework used in the analysis portion of our investigation is the cloud
forensics framework used and published by Martini and Choo (2014). These authors validated
the cloud forensics framework outlined by McKemmish (1999) and the National Institute of
Standards and Technology (NIST) for conducting digital forensics investigations (Kent,
Chevalier, Grance, & Dang, 2006; Martini & Choo, 2014). The framework, depicted in Figure 2,
describes a four stage iterative process that includes identification, collection, analysis and
reporting of digital artifacts within HDFS data storage system.
Figure 2. Digital investigation process overview used to guide this study.
2.3 Data Carving
When a file is deleted in most file systems including HDFS, only the reference to that
data, called a pointer, is deleted but the data itself remains. In the FAT file system, for example,
when a file is deleted the file’s directory entry is simply changed to reflect that the space that the
data occupies is then unallocated. The first character of a file name is switched with a marker.
The actual file data is still left unchanged. The file data remnants are still present with an
exception of overwriting with data (Carrier, B. & Spafford, E., 2003). Similarly in HDFS, when a
file is deleted, only the pointer to the file on the name node is deleted. The data remnants still
remain unchanged on the data nodes until that data is overwritten.
According to the Digital Forensic Research Workshop, “Data carving is the process of
extracting a collection of data from a larger data set. Data carving techniques frequently occur
during a digital investigation when the unallocated file system space is analyzed to extract files.
The files are 'carved' from the unallocated space using file type-specific header and footer values.
File system structures are not used during the process,” (Garfinkle, 2007). Simply stated, file
carving is the process of extracting the remnants of data from a greater storage space. Data
carving techniques are an important part of digital investigations. Digital forensics examiners
commonly look for data remnants in unallocated file system space. Beek, C. (2011) wrote a
white paper explaining data carving concepts in which he referred to several data carving tools.
In his paper, Beek also explained the difference between data carving and data recovery.
According to Beek, Data recovery is the carving of data based on the file system structure, which
would not be useful on a system format. Further, Beek states that the file system used to retrieve
data is not important to the data retrieval process. In the case examined by our research, in order
to identify the nodes that need to be carved we are dependent on the HDFS.
Simson and Garfinkel (2007) proposed a file carving taxonomy which includes the
following suggested types of file carving.
• Block-Based Carving
• Statistical Carving
9
•
•
•
•
•
•
•
•
Header/Footer Carving
Header/Maximum (file) size Carving
Header/Embedded Length Carving
File structure based Carving
Semantic Carving
Carving with Validation
Fragment Recovery Carving
Repackaging Carving
Every file stored on disk has a file type and each file type is associated with a header and
footer values. For example, a pdf file starts with “%PDF” and ends with “%EOF” and can be
discovered using a string search in the disk space with their HEX values header and footer
signatures. There are a number of techniques to “carve” data remnants of deleted files from a
disk and search string method is one of them (Povar, D., & Bhadran, V. K., 2011). The BoyerMoore searching algorithm, described in R. S. Boyer and J. S. Moore's 1977 paper “A Fast String
Searching Algorithm” is one of the best known ways to perform sub string search in given search
space.
The header and footer signature for some common file types are shown in Figure 3
below:
Extension
Header (Hex)
Footer (Hex)
DOC
D0 CF 11 E0 A1 B1 1A E1
57 6F 72 64 2E 44 6F 63 75 6D 65 6E 74 2E
XLS
D0 CF 11 E0 A1 B1 1A E1
FE FF FF FF 00 00 00 00 00 00 00 00 57 00 6F 00
72 00 6B 00 62 00 6F 00 6F 00 6B 00
PPT
D0 CF 11 E0 A1 B1 1A E1
50 00 6F 00 77 00 65 00 72 00 50 00 6F 00 69 00 6E
00 74 00 20 00 44 00 6F 00 63 00 75 00 6D 00 65 00
6E 00 74
ZIP
50 4B 03 04 14
50 4B 05 06 00
JPG
FF D8 FF E0 00 10 4A 46 49 46 00 01 01
D9 (“Better To Use File size Check”)
GIF
47 49 46 38 39 61 4E 01 53 00 C4
21 00 00 3B 00
PDF
25 50 44 46 2D 31 2E
25 25 45 4F 46
Figure 3: Headers and Footers for Common File Types
Povar and Bhadran (2011) proposed an algorithm to extract or data carve PDF files. The
algorithm contains 6 steps as quoted below.
Step 1. Look for the header signature (%PDF)
Step 2. Check for the version number [file offset 6-8]
Step 3. If version no. > 1.1 go to Step4, else go to Step6
Step 4. Search for the string “Linearized” in first few bytes of the file
Step 5. If it finds the string in Step 4, then length of the file is preceded by a “/L ” character
sequence. Carved file size = embedded length;// 479579, else go to step 6.
Step 6. Use search algorithms to find footer signature (%%EOF). Searching will be
continued until the carved file size<=User specified file size.
2.4
Contributions of Literature
Frameworks and processes exist in literature that support digital forensics processes in
virtual and cloud environments. These artifacts are not, however, specific to Hadoop
environments. To guide our efforts to locate deleted data in HDFS, image the impacted disks for
digital forensic analysis, and support the secure removal of remnants of sensitive data within the
cluster, we will extend the cloud forensics framework from NIST's Mell and Grance (2011) and
McKemmish (1999), pictured in Figure 4 below, to support digital forensics operations in HDFS.
Figure 4: Cloud Cyberforensics framework adapted for HDFS
3.
Approach/ Methodology
Following the scoping of the project and preparation of the Hadoop cluster environment,
the basic four stages of digital forensics analysis are outlined below with some of the practical
considerations related to conducting forensic investigation in a cloud computing environment,
such as HDFS.
1. Evidence Source Identification and Preservation: The focus of this stage is to identify
potential and preserving potential sources of evidence. In the case of HDFS, the name
node and data nodes are consider as points to considering when identifying where
evidence may be located. The name node can be used to locate where and how a file is
distributed within the distributed file system’s data nodes. We must also consider other
components related to the removal of evidence or system administration such as the trash
folder and the virtual machine’s system acquired snapshots taken to help capture the state
and data stored within a virtual machine at specified points in time.
11
2. Collection and Recovery: This stage is aimed at the collection of the data/information
identified and preserved during stage one. In a virtual environment the access to the
physical media is limited and we instead collect an image of the disk contents that were
provisioned to the virtural machine (VM) stored as a Virtural Machine Disk (VMDK) file
format. These VMDK files are retrieved from the VM’s local disk accessed through the
vSphere client’s Datastore. The VMDK file must be converted to a Macintosh
DISKDOUBLE (dd) file format as part of the imaging processes steps needed for stage
three.
3. Examination and Analysis: This stage is aimed at examining the forensic data and
analyzing that data, using a scientifically sound process, to help gather facts related to the
incident under investigation. In our investigation we divide this into two steps:
a. Data remanence: capturing information/data stored in the VMDK file and inspect
for relevant factual evidence related to the incident under investigation. Here we
use the forensics tools from Guidance Software, EnCase, and from AccessData
Forensic Toolkit (FTK) to help process and analyze the, vmdk files retrieved from
the HDFS cluster nodes. The aim here is data recovery of any trace evidence that
may relate to the data spillage incident and the removal sensitive information from
the data nodes.
b. Data Recovery: Here the efforts are made to examine the image files using
EnCase, FTK and Stellar forensic investigation tool to recover that deleted from
the VM managed by the filesystem. Both EnCase and FTK have capabilities to
allow raw disk images to be loaded and processed as dd file types. Stellar requires
that images be mounted in Windows as a logical drive using a third party tool
(Buchanan-Wollaston, Storer, & Glisson, 2013).
4. Reporting and Presentation: This relates to the legal presentation of the collected
evidence and investigation. In the case of this project we are concern with the reporting of
evidence and the practical steps that were implemented to capture/recover evidence using
a scientifically-sound process that is repeatable. This will be documented and presented
to the project Technical Directors and other stakeholders.
4.1
Data Management Plan
Managing data throughout the research life cycle is essential: in order for this experiment
to be replicated for further testing, as guidance for various federal agencies, and for re-testing as
new versions of Hadoop are they are released. The ability to publish and re-create this work
involves the thorough documentation of processes and configuration settings. Initial data
management will document the initial hardware and software specifications of the hardware and
software used, the configuration processes used in configuring both the Hadoop cluster and the
physical and virtual nodes, and a detailed description of the tagged data and the process of
loading it into the cluster. Any software or data used will be backed up to PURR when such
backups are technically feasible and will not violate applicable laws. Results of forensic analysis
will also be thoroughly documented, and uploaded to PURR.
PURR will serve the data management requirements of this project well because it
provides data storage and tools for uploading of project data, communication of that data, and
other services such as: data security, fidelity, backup, and mirroring. Purdue Libraries also offers
at no cost to project researchers consulting services in order to facilitate selection and uploading
of data, inclusive of the generating application and necessary metadata that will ensure proper
long-term data stewardship. Documentation of results and processes will also be uploaded to the
team's shared space on Google Docs as a backup to PURR. The contact information of the
project's appointed data manager will also be uploaded to PURR to facilitate long term access to
project data.
4.
Results and Conclusions
Using the framework and digital forensics processes detailed above, we successfully
located and recovered from the impacted DataNodes the deleted .pdf file that we had designated
as sensitive. Four tests were conducted on the four images acquired during the acquisition
process. For the purpose of testing a Known Evidence File (KEF) was introduced on to the
Hadoop Distributed Files System. The KEF used is a PDF document.
Test 1
Metadata Search from Image 1: NameNode.vmdk (Before deletion of the KEF)
The aim of this test was to find metadata from the NameNode image before deletion of
the KEF. The expected outcome was to be able to find the metadata indicating the nodes to which
the KEF has been replicated. FTK Imager was able to mount the directory structure of the VM
image added as evidence. The event logging file, hdfs-audit.log was located. The replication of
the KEF was logged. The logs indicated the source path of the KEF, the IP address of the name
node, the disk ID’s of the DataNodes to which the data was replicated. Thus, the DataNodes that
need to be imaged were identified.
Figure 5: A snapshot of the hdfs-audit.log
13
Test 2
Data Carving from Image 2: DataNode.vmdk (Before deletion of the KEF replica)
The aim of this test was to find the KEF in the file structure of the DataNode and identify
the path and the physical location at which it has been copied to. The expected outcome was to
be able to locate the complete KEF. FTK Imager was able to mount the directory structure of the
VM image added as evidence. The KEF i.e. The Probability and Statistics book in .PDF format
was located.
Figure 6: Path to the KEF pdf.
Figure 7: The physical location of the KEF pdf
Figure 8: The KEF in Natural View in FTK Imager
15
The KEF file is located in the dfs directory within the dn folder in the 35 th subdirectory. Thus, it’s
confirmed that the file has been replicated to the datanodes indicated in the metadata on the
NameNode. The physical location of the KEF is known and could be used in data wiping
process.
Test 3
Log Search from Image3: NameNode.vmdk (After deletion of the KEF)
The aim of this test was to find logs from the NameNode image after deletion of the KEF.
The expected outcome was to be able to find logs indicating the deletion of the KEF. FTK
Imager was able to mount the directory structure of the VM image added as evidence. The event
logging file, hdfs-audit.log was located. The deletion of the KEF was logged. The logs indicated
that the KEF has been deleted and moved to .\trash. Thus, the KEF was successfully deleted and
replicas were deleted on the DataNodes.
Figure 9: The hdfs-audit.log confirming the delete of the KEF
Test 4
Data Carving from Image 4: DataNode.vmdk (After deletion of the KEF replica)
The aim of this test was to carve the KEF from the DataNode image after deletion of the
KEF. The expected outcome was to be able to find and carve the KEF. The Forensic Toolkit 5.3
was able to mount the VM image and process the image added as evidence file. Though there are
many carving techniques available as mentioned in the literature review, the Header/Footer
carving method was used for the purpose of this test. On searching the node image with 25 50 44
46 (the hex header for PDF), a lot of pdf files were found and finally lead to the discovery of the
KEF pdf. The KEF was found at the exact same path found from test 2 which is a deviation
from the expected location. When a file is deleted, the pointer to the file is deleted and the space
is labeled as unallocated space. But the KEF was contradictorily found at the same path location.
There is proof from test 3 indicating that the files has been deleted, which implies its been
deleted from the file structure. But from the results it is found in the dfs directory. Thus, the KEF
pdf was carved.
Figure 10: KEF found in the Header/Footer Search
17
Figure 11: Using the Save Selection option the hex was exported
Figure 12: The file found is confirmed to be KEF
5.
Schedule
The chart above is the proposed time line for this project, and represents completed items
with green cells. Current progress, represented by the yellow cell, is behind original projections.
The project team has faced significant and unanticipated challenges during the analysis of the
cluster nodes, including reloading data onto the cluster, and re-analyzing the new data.
6.
6.1
Budget
Proposed Budget
Team Hours 3 members, 12 weeks, $1,984/member: $5,592.00
Fringe Benefits 3 members, $750/member: $2,250.00
Purdue Indirect 1, $15,677.86 * .54: $8,466.00
Conference Travel 3 members, $2,000.00/member: $6,000.00
TOTAL: $22,308.00
6.2 Actual Expenditures
Team Hours 3 members, 14 weeks $2,315/member: $6,944.00
Fringe Benefits 3 members, $750/member: $2,250.00
Purdue Indirect 1, $15,677.86 * .54: $8,466.00
USB 1TB External Hard Disk Drive: $70.00
Conference Travel 3 members, $2,000.00/member: $6,000.00
TOTAL: $23,730.00
19
6.3 Discussion of Discrepancies
Actual expenditures exceeded the proposed budget due to the greater number of hours
spent working on the project by team members, and the purchase of an external hard disk drive
used to store node images. Team members worked more hours on the project than initially
budgeted due to the challenges of locating the text file on the data node images. These
challenges forced the loading onto the Hadoop cluster of a second file, which was in .pdf format.
Following this load, the cluster nodes were re-imaged and re-examined. In order to store several
80GB node images for examination and data management, the team required a 1TB external hard
drive in order to speed the examination process, and because the PURR system listed in the
initial proposal as the storage medium for these images had insufficient storage for archiving the
images.
7.
Final Discussion and Future Directions
This study illuminated in greater depth several important issues involved in locating
sensitive data in Hadoop clusters. Many of these issues were not related directly to Hadoop
itself, but to the environment in which Hadoop is run. Because the study was limited in physical
resources, the team had to navigate issues with the virtual environment, as well as those related
to Hadoop itself. Though it was unexpected to the team, the changes required to the analysis of
the data nodes based on the virtualization of the environment appear to add challenge to the
analysis process. As a result, several areas of future work related to this study exist. It is
important to also understand how the process of locating removed data and preserving that data
for forensic examination on a physical Hadoop cluster differs from the same process working on
a virtual cluster. Further, different virtualization techniques may also require changes to the
process. Beyond finding and preserving the evidence of deleted sensitive information in the
Hadoop environment, an organization that makes use of these techniques is also likely interested
in the ability to remove all traces of the sensitive data to a specification that is more robust than
simply issuing commands to Hadoop to remove the data. In these cases, the procedures
discussed in this study would need to be augmented to include the destruction of remaining
traces of data using DOJ or other data removal standards. Finally, because data is replicated
across several nodes within the Hadoop cluster based on Hadoop configuration settings, a data
removal process that involved removing impacted nodes from the cluster for data removal could
be extremely costly in performance and time. Future work should also include studies around
the ability of the organization to remove data within the Hadoop cluster to a desired specification
without removing nodes from their duties in the cluster. The full answer to the questions posed
by this problem will involve not only finding and preserving data from Hadoop clusters, but the
live removal of remaining traces of data to a specification selected by the organization, and the
automation of that process.
8.
Bibliography
Anjali, N. B., Geetanjali, P. R., Shivlila, P., R. Shetkar, S., & B., K. (2013). Data leakage
detection. International Journal of Computer and Mobile Computing, 2(May), 283–288.
Apache Software Foundation. Apache Hadoop 2.6.0 - Cluster Setup. 2014. Retrieved October
12, 2014 from: http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoopcommon/ClusterSetup.html
Beek, C. (2011). Introduction to file carving. White paper. McAfee.
Birk, D., & Wegener, C. (2011, May). Technical issues of forensic investigations in cloud
computing environments. In Systematic Approaches to Digital Forensic Engineering
(SADFE), 2011 IEEE Sixth International Workshop on (pp. 1-10). IEEE.
Buchanan-Wollaston, J., Storer, T., & Glisson, W. (2013). Comparison of the Data Recovery
Function of Forensic Tools. In Springer (Ed.), Advances in Digital Forensics IX (IX., pp.
331–347). Springer Berlin Heidelberg.
Carrier, B., & Spafford, E. H. (2003). Getting physical with the digital investigation process.
International Journal of digital evidence, 2(2), 1-20.
DeRoos, D. (2014). Hadoop for Dummies. Wiley & Sons. New York. pp. 10-14.
EMC White Paper. (2014). Security and compliance for scale-out hadoop data lakes.
Garfinkel, S. L. (2007). Carving contiguous and fragmented files with fast object
validation. digital investigation, 4, 2-12.
Kent, A. K., Chevalier, S., Grance, T., & Dang, H. (2006). Guide to Integrating Forensic
Techniques into Incident Response. Gaithersburg, MD: NIST Special Publication 800-8.
Retrieved from http://cybersd.com/sec2/800-86Summary.pdf
Lu, R., Lin, X., Liang, X., & Shen, X. S. (2010, April). Secure provenance: the essential of bread
and butter of data forensics in cloud computing. InProceedings of the 5th ACM Symposium
on Information, Computer and Communications Security (pp. 282-292). ACM.
Martini, B., & Choo, K.-K. R. (2014). Distributed filesystem forensics: XtreemFS as a case
study. Digital Investigation, 1–19. doi:10.1016/j.diin.2014.08.002
McKemmish, R. (1999). What is forensic computing? (pp. 1–6). Retrieved from
http://aic.gov.au/documents/9/C/A/%7B9CA41AE8-EADB-4BBF-989464E0DF87BDF7%7Dti118.pdf
Mell, P., & Grance, T. (2009). The NIST definition of cloud computing. National Institute of
Standards and Technology, 53(6), 50
Povar, D., & Bhadran, V. K. (2011). Forensic data carving. In Digital Forensics and Cyber
Crime (pp. 137-148). Springer Berlin Heidelberg.
Ramanathan, A., Pullum, L., Steed, C. A., Quinn, S. S., Chennubhotla, C. S., & Parker, T. (2013).
Integrating Heterogeneous Healthcare Datasets and Visual Analytics for Disease Biosurveillance and Dynamics.
9.
Biographical sketches of the team members
Oluwatosin Alabi is a doctoral research assistant for Drs. Dark and Springer. Her research
background is in system modeling and analysis that allows her to approach this project from a
statistical-based modeling approach. Over the summer, she worked on developing a data motion
power consumption model for multicore systems for the Department of Energy (DOE) . In
addition, she is working towards growing expertise and experience in big data analytics under
her adviser Dr. Springer, the head of the Discovery Advancements Through Analytics (D.A.T.A.)
Lab.
21
Joe Beckman is a Ph.D. student specializing in secure data structures for national health data
infrastructure. His most recent experience includes security and privacy analysis of laws,
policies, and information technologies for the Office of the National Coordinator for Health
Information Technology within the Department of Health and Human Services.
Dheeraj Gurugubelli is currently pursuing his second masters at Purdue University in Computer
Science and Information Technology. Dheeraj is a qualified professional with diverse interests
and experiences. Prior to his graduate studies at Warwick University, Dheeraj spent four years
studying computer science engineering at MVGR College of Engineering. Continuing his
education, he graduated with a master in Cyber-security and management degree from University
of Warwick, UK. While at Warwick, Dheeraj spent his days building a prototype tool for HP
cloud compliance automation tool semantic analysis phase at Hewlett Packard Cloud and
Security labs. Further, Dheeraj worked at Purdue University as a Research Scholar researching in
the domain of cyber security and digital forensics. He is an active member of IEEE.