What Is Hadoop? Managing Big Data in the Enterprise Introduction Data volumes are growing much faster than compute power. This growth demands new strategies for processing and analyzing information. According to IDC1, the amount digital information produced in 2011 will be ten times that produced in 2006: 1,800 exabytes. The majority of this data will be “unstructured” – complex data poorly‐suited to management by structured storage systems like relational databases. Unstructured data comes from many sources and takes many forms – web logs, text files, sensor readings, user‐generated content like product reviews or text messages, audio, video and still imagery and more. Large volumes of complex data can hide important insights. Are there buying patterns in point‐of‐sale data that can forecast demand for products at particular stores? Do RFID tag reads show anomalies in the movement of goods during distribution? Do user logs from a web site, or calling records in a mobile network, contain information about relationships among individual customers? Can a collection of nucleotide sequences be assembled into a single gene? Companies that can extract facts like these from the huge volume of data can better control processes and costs, can better predict demand and can build better products. Dealing with big data requires two things: • • Inexpensive, reliable storage; and New tools for analyzing unstructured and structured data. Apache Hadoop is a powerful open source software platform that addresses both of these problems. Hadoop is an Apache Software Foundation project. Cloudera offers commercial support and services to Hadoop users. 1 “An Updated Forecast of Worldwide Information Growth Through 2011,” IDC, March 2008. What is Hadoop? Big Data in the Enterprise 1 Reliable Storage: HDFS Major Internet properties like Google, Amazon, Facebook and Yahoo! have pioneered the use of networks of inexpensive computers for largescale data storage and processing. HDFS uses these techniques to store enterprise data. Hadoop includes a fault‐tolerant storage system called the Hadoop Distributed File System, or HDFS. HDFS is able to store huge amounts of information, scale up incrementally and survive the failure of significant parts of the storage infrastructure without losing data. Hadoop creates clusters of machines and coordinates work among them. Clusters can be built with inexpensive computers. If one fails, Hadoop continues to operate the cluster without losing data or interrupting work, by shifting work to the remaining machines in the cluster. HDFS manages storage on the cluster by breaking incoming files into pieces, called “blocks,” and storing each of the blocks redundantly across the pool of servers. In the common case, HDFS stores three complete copies of each file by copying each piece to three different servers: 1 2 1 4 2 5 5 1 2 3 HDFS 3 4 4 5 2 1 3 3 4 5 Figure 1: HDFS distributes file blocks among servers HDFS has several useful features. In the very simple example shown, any two servers can fail, and the entire file will still be available. HDFS notices when a block or a node is lost, and creates a new copy of missing data from the replicas it What is Hadoop? Big Data in the Enterprise 2 manages. Because the cluster stores several copies of every block, more clients can read them at the same time without creating bottlenecks. Other faulttolerant storage systems are often more expensive than HDFS. Of course there are many other redundancy techniques, including the various strategies employed by RAID machines. HDFS offers two key advantages over RAID: It requires no special hardware, since it can be built from commodity servers, and can survive more kinds of failure – a disk, a node on the network or a network interface. The one obvious objection to HDFS – its consumption of three times the necessary storage space for the files it manages – is not so serious, given the plummeting cost of storage. In addition, HDFS offers some real advantages for data processing, as the next section will show. Hadoop for Big Data Analysis Many popular tools for enterprise data management – relational database systems, for example – are designed to make simple queries run quickly. They use techniques like indexing to examine just a small portion of all the available data in order to answer a question. Hadoop is designed for largescale analyses that need to examine all the data in a repository. Hadoop is a different sort of tool. Hadoop is aimed at problems that require examination of all the available data. For example, text analysis and image processing generally require that every single record be read, and often interpreted in the context of similar records. Hadoop uses a technique called MapReduce to carry out this exhaustive analysis quickly. In the previous section, we saw that HDFS distributes blocks from a single file among a large number of servers for reliability. Hadoop takes advantage of this data distribution by pushing the work involved in an analysis out to many different servers. Each of the servers runs the analysis on its own block from the file. Results are collated and digested into a single result after each piece has been analyzed. What is Hadoop? Big Data in the Enterprise 3 Hadoop takes advantage of HDFS’ data distribution strategy to push work out to many nodes in a cluster. This allows analyses to run in parallel and eliminates the bottlenecks imposed by monolithic storage systems. 2 1 4 2 5 5 1 3 4 2 1 3 3 4 5 Figure 2: Hadoop pushes work out to the data Running the analysis on the nodes that actually store the data delivers much much better performance than reading data over the network from a single centralized server. Hadoop monitors jobs during execution, and will restart work lost due to node failure if necessary. In fact, if a particular node is running very slowly, Hadoop will restart its work on another server with a copy of the data. Summary Hadoop’s MapReduce and HDFS use simple, robust techniques on inexpensive computer systems to deliver very high data availability and to analyze enormous amounts of information quickly. Hadoop offers enterprises a powerful new tool for managing big data. For more information, please contact Cloudera at: info@cloudera.com +1‐650‐362‐0488 http://www.cloudera.com/ What is Hadoop? Big Data in the Enterprise 4