“Cloudera Hadoop” โดย คุณกิตติรักษ์ ม่วงมิ่งสุข กรรมการผู้จัดการบริษัทคลัสเตอร์ คิท (Cluster Kit) และ นายกสมาคมศึกษาและพัฒนาโอเพ่นซอร์ส สัมมนา Big Data & Analytics โดย ดาต้า คิวบ์ (facebook.com/datacube.th) Cloudera Hadoop กตตรกษ ม วงมงสข Kittirak Moungmingsuk kittirak@clusterkit.co.th Arp 4, 2015 Data Cube Seminar @ KU HOME รรจกกนกอน กตตรกษ ม วงมงสข ชชอเล น กก ปจจบนททหนททหลยอย งในกจกรเล$ก ๆ ชชอ “คลสเตอรคท” และไดรบมอบหมยจกคนหลยคนใหเป-นนยกสมาคมศกษา และพฒนาโอเพ นซอรส หรชอ OSEDA วฒกรศ0กษ นกธรรมช2นตรท สทนกเรทยนจงหวดอบลรชธนท วดป3 วเวก(ธรรมชน) ชอบเล นอนเทอรเน$ต ท องเททยว และททกจกรรมต ง ๆ 2 Cluster Kit: Achievement ThaiGrid (Tera Cluster) 800 Cores, Linux Cluster 133 Cores, Win Cluster Sila Cluster @Ramkhamhaeng U. 286 Cores BIOTEC (Eclipse Cluster) 704 Cores Virgin Radio Thailand 7 nodes, Web Cluster Geo-Informatics and Space Technology Development Agency (GISTDA) 10 nodes, Web Cluster HAII (HAII Cluster I, II) 480 Cores 3 Top500.org (update Nov 2014) 4 Top500 Architecture Share (June 2014) 5 Top500 OS Share 6 Why Big Data? 7 Source: https://practicalanalytics.;les.wordpress.com/2012/10/newstyleo;t.jpg 8 Source: http://smartdatacollective.com/yellow;n/75616/why-big-data-and-business-intelligence-one-direction 9 10 Facebook Usage Statistics (June 2014) 829 million daily active users 654 million mobile daily active users 1.32 billion monthly active users 1.07 billion mobile monthly active users Approximately 81.7% of our daily active users are outside the US and Canada Source: http://newsroom.fb.com/company-info/ 11 Google Usage Statistic Data from http://expandedramblings.com/index.php/by-the-num bers-a-gigantic-list-of-google-stats-and-facts/#. VDavqq2mhNA Amount of monthly Google searches 11.944 billion (3/20/14) Number of monthly unique visitor 187 million (3/25/14) 12 จจนวนเครรอ งเซรฟเวอร Google > ลานเครรอง Facebook 180,900 Servers https://www.facebook.com/ArcadianLearning s/posts/549836811713533 13 Low Cost High Performance 14 http://www.opencompute.org/ 15 Software Linux Python C++, Java, Javascript, Go, Sawzal (a custom logging language) Hadoop Linux PHP, C++, Java, Python, and Ruby. Apache Web Server MySQL Hadoop Memcached, Flashcache HipHop to transform PHP source code into C++ and gain performance bene;ts. 16 What is Hadoop? HDFS MapReduce How to build Hadoop cluster How to execute MapReduce Hive SQL 17 Hadoop – How was it Born? To Process Huge Volume of data, as the amount of generated data continued to rapidly increase. (Big Data). Also the Web generated more and more information, which was becoming quite challenging to index the content. 18 What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. Image Source: http://blogs.ejb.cc/archives/4290/hadoop-technical-manuals-athe-hadoop-ecosystem/tumblr_lbbwggcer71qappj8 19 HDFS Architecture Source: http://hadoop.apache.org/docs/r1.0.4/hdfs_design.html 20 Data Replication Source: http://hadoop.apache.org/docs/r1.0.4/hdfs_design.html 21 Hadoop - Basic Architecture Source: http://www.mplsvpn.info/2012/11/hadoop-architecture-types-of-hadoop.html 22 Hadoop - Basic Architecture (contd.) Source: http://www.mplsvpn.info/2012/11/hadoop-architecture-types-of-hadoop.html 23 MapReduce MapReduce is a programming model for processing large data sets, and the name of an implementation of the model by Google. – Wikipedia map: (K1, V1) -> list(K2, V2) Reduce: (K2, list(V2)) -> list(K3, V3) 24 MapReduce Output in a list of (Key, Value) Image source: http://www.rabidgremlin.com/data20/#(3)/ Output in a list of (Key, List of Values) 25 WordCount - MapReduce Map function Reduce function 26 WordCount by Pig Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs https://pig.apache.org/ A=load 'input /*'; B=foreach Agenerateflatten(TOKENIZE((chararray)$0))asword; C=group Bbyword; D=foreach CgenerateCOUNT(B),group; store Dinto'output/wordcount-pig'; Source: http://salsahpc.indiana.edu/ScienceCloud/pig_word_count_tutorial.htm 27 Hive Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. hive>show tables; hive>create table country (country_idint,c ountrystr ing) rowfor matdelimitedfieldsterminatedby','s toredast extfile; hive>desc country; hive>load data l ocalinpat h'/t mp/co untry.csv'intota blecountr y; hive>select count(country_id)fromcountrywhere countrylike 'T%'; 28 Apache™ Mahout is a library of scalable machinelearning algorithms, implemented on top of Apache Hadoop® and using the MapReduce paradigm. Mahout supports four main data science use cases: Collaborative ;ltering Clustering Classi;cation Frequent itemset mining 29 List of algorithms (for distributed mode) Distributed Item-based Collaborative Filtering Canopy Clustering Dirichlet Process Clustering Hierarchical Clustering Collaborative Filtering Using a Parallel Matrix Factorization Latent Dirichlet Allocation Bayesian Fuzzy K-Means K-Means Clustering Mean Shift Clustering Minhash Clustering Spectral Clustering Random Forests Parallel FP Growth Algorithm Source: http://hortonworks.com/hadoop/mahout/ 30 Source: http://imgbuddy.com/hadoop-ecosystem-components.asp 31 HADOOP 1.0 vs 2.0 Source: http://hortonworks.com/blog/apache-hadoop-2-is-ga/ 32 Hadoop 2.0 : YARN (Yet Another Resource Negotiator) Source: http://hortonworks.com/get-started/yarn/ 33 Cloudera Hadoop (CDH) CDH is Cloudera's open source software distribution http://www.cloudera.com/ Source: http://www.cloudera.com/content/cloudera/en/products-and-services/cdh.html 34 Cloudera GUI (Hue) 35 Another Hadoop Platform Hartonworks http://hortonworks.com/ MapR https://www.mapr.com/ 36 References https://www.facebook.com/Engineering https://www.facebook.com/data https://www.facebook.com/publication http://research.google.com http://googleblog.blogspot.com 37 References Stratapps, “An Introduction to Hadoop”, http://stratapps.net/intro-hadoop.php edureka!, “Introduction to Hadoop 2.0 and advantages of Hadoop 2.0 over 1.0”, http://www.edureka.co/blog/introduction-to-hadoop -2-0-and-advantages-of-hadoop-2-0/ , May 2014, 38 โครงกรคอมพวเตอรมรอสองเพรอ นองในชนบท 39 The End. 