大规模数据处理/云计算 Lecture 6 – Inverted Index 彭波北京大学信息科学技术学院 7/17/2014 http://net.pku.edu.cn/~course/cs402/ Jimmy Lin University of Maryland SEWMGroup This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details Word Cooccurrence Tasks • Do word co-occurrence analysis on ShakeSpeare Collection and AP Collection, which is under the directory of /public/Shakespeare and /public/AP of our sewm cluster (or your own virtual cluster). You will get one line of text data as input to process in map function by default.(80 points) • Try to optimize your program, and find the fastest one. Write your approaches and evaluation in your report.(20 points) • Analysis the result data matrix and find something interesting. (10 points bonus) • Write a report to describe approach to each task, the problem you met etc. Discussions in Course Group 怎样用Web UI查看日志？ • 访问http://changping11:50030/ Shuffle and Sort intermediate files (on disk) Mapper merged spills (on disk) Combiner? Reducer circular buffer (in memory) Combiner spills (on disk) other reducers other mappers 5 Job History Disappeared? • Job history stored on the local filesystem of the jobtracker. (hadoop.job.history.location) – The jobtracker’s history files are kept for 30 days before being deleted by the system. • A second copy is also stored for the user in the _logs/history subdirectory of the job’s output directory. (hadoop.job.history.user.location) – never delete by system • View through web ui or command line – $ hadoop job -history output-dir Job Retirement Policy • Once a job is complete it is kept in memory (up to mapred.jobtracker.completeuserjobs.maxi mum) • overall retirement policy of completed jobs – Key: mapred.jobtracker.retirejob.interval – Default: 24 * 60 * 60 * 1000 (1 day) Task Logs • There are some controls for managing the retention and size of task logs. • By default, logs are deleted after a minimum of 24 hours (set this using the mapred.userlog.retain.hours property). • mapred.userlog.limit.kb, 0 by default, meaning there is no cap 无法打开log中调试中stdout信息！ • 对，这里需要解析changping11changpin60， • 把下面解析都加到hosts里去 • --------------• 222.29.134.11 changping11 • 222.29.134.12 changping12 • 222.29.134.14 changping14 • 222.29.134.15 changping15 • 222.29.134.17 changping17 • 222.29.134.18 changping18 • 222.29.134.20 changping20 检讨书…… • 想了一下怎么改“map tasks”，翻api查到了这个： • setMinInputSplitSize(Job, long) • setMaxInputSplitSize(Job, long) • 默认的“Counters”是“44”，于是我就(手贱地)在“word count”加了如下两句： • FileInputFormat.setMinInputSplitSize(job, 10); • FileInputFormat.setMaxInputSplitSize(job, 10); • 于是就出现了图中悲剧的一幕…… Overload，Why? • 1. 过载，why? – slot 配置，原来 map , 12, reduce 6, 有点过分，只有12 core。 • 2. 有人问调度策略是什么？FairScheduler, 效果好像一般，怎么有人的job task在狂跑，有人在一直等，why? – 查说，fair scheduler有一个multiasssign 机制，一次把 tasktracker能跑的task全部分配，是不是这样就导致不公平呢？修改这个配置，改为一次分配2个task. • 3. 看见有人调节priority， smart，但是unfair. – 修改设置， final 为 NORMAL • 补充一点，机群job调度会保证一定的fair，如果job运行慢，要先查job的那些counters,特别是输入输出数据大小是否异常，别把程序问题忽略了对于作业2中的AP数据集，如何减少其 mapper数量（默认为1050个）？ • 个人感觉开这么多Mapper是对Mapper生命的极大浪费~~ • FileInputFormat.setMinInputSplitSize(job, 100*1024*1024); • try CombineFileInputFormat • ...... MapWritable的问题 • 为什么MapWritable在reduce输出后，得到的输出结果是类似这样的： org.apache.hadoop.io.MapWritable@12f1 eff? • 难道他实现序列化，没有重写tostring吗？ • 如果是这样的话，map的输出文件也应该是这种地址形式，那reduce输入的时候又是如何反序列化的？ • 难道map输出是没有调用toString，而 reduce输出是调用的是toString？关于setSortComparatorClass和 setGroupingComparatorClass的区别 • WordPair第一关键字相同的对发至同一reducer, 两个方法似乎都能完成上述任务，请问它们有什么区别？ • You can use org.apache.hadoop.mapreduce.Job's setPartitionerClass method to use a custom partitioner with a map reduce job. • Sort Comparator Class: Used to control how the keys are sorted before the reduce step. • Grouping Comparator Class: Used to control which keys are a single call to reduce. 遇到错误：Java heap space • 估计你用的方法占用太多内存了，现在默认只给每个job 200MB的空间的 (mapred.child.java.opts)。你可以考虑修改一下算法。 • 把mapred.child.java.opts改大一点呗~ • 编写良好的pairs方法占的内存应该不会太多……stripes据同学说会溢出，这个可以采用重写mapper的run方法，每满XX字节输出一下。 co-occurrence的词对要求无序吗？ • 共现是在同一行里出现就可以吗？不一定要相邻？ • A cell mij contains the number of times word wi co-occurs with word wj within a speci c context|a natural unit such as a sentence, paragraph, or a document, or a certain window of m words (where m is an application-dependent parameter)...need not be symmetric. • 词语“共现”这一概念本身就是很模糊的……因此伪代码上只给出了neighbors函数，其具体实现是与应用有关的——打个比方，如果我们的目标是考察“AP新闻稿中与不同地名共现的词汇有什么特点”，那么显然所关心的词对是无序的。另一方面，假如我们期望做一个含有上下文分析功能的语音识别系统，那么作参考的共现词汇对就需要是无序的。 • 顺便说一句，Assignment #2的要求是按默认设置以一行为Mapper的输入，这实际上是不大好的——例如AP新闻稿中有很多句尾长词是用连字符分在两行写的，按上述方法处理得话就会被硬生生拆成两个词。更合理的方法是以一个自然句为Mapper的输入，但这对预处理的要求很大——LZ可以试试用另一个MapReduce应用做到这一点。 More... • 基于Hadoop的新开源项目Apache Spark™ • 图挖掘算法 Writing Technical Report Technical Writing • Written communication, in fact, is an integral part of engineering tasks. • A technical report must inform readers of the reasons, means, results, and conclusions of the subject matter being reported. • The mechanics and format of writing a report may vary but the content is always similar. Abstract • An abstract of a technical report briefly summarizes the report. • It should describe motivations, methods, results, and conclusions. • Be concise in the abstract. Table of Contents • Table of Contents is the list of what is in the report. • Major sections of the report must be listed with page numbers. List of Figures and Tables Acknowledgements • The author(s) must acknowledge every person or agency involved in funding, guiding, advising, and working on the project that are not part of the authoring team. • Failure to acknowledge someone contributing to the project is a serious breach of etiquette and may be construed as plagiarism, a very serious offense. Introduction • Quickly explain the importance of the experiment being reported. • This section is where the necessary concepts that were applied in order to obtain the results are explained. Experimental Details • Details of the experiments or research conducted. • The description must contain enough details to enable someone else to duplicate the experiment. • Engineering and scientific experiment must be repeatable and verifyable. Results and Discussions • Report only the final results. • Raw data and intermediate results that are not central to the topic of the report can be placed in the Appendix if needed. • Most substantial part of the report. Conclusions and Recommendations • Think of the conclusion as a short restatement of important points being presented in the report. • recommendations as to the utilities of those conclusions • Mention restrictions or limits pertaining to the use of the results. • Suggest what the next step in the study should be. References • Any idea, formula, etc., not originating from the author must be cited. • Failure to reference prior works may be interpreted as claiming those works to be your own. • Any work, formulae, or discussion that is a common knowledge in the field does not need to be referenced. Appendices • In is imperative that the way you determine the result from the raw data be made clear. • Others should be able to duplicate the experiment according the instruction provided in the “Experimental Details” section and reduce the data according to the “Sample Calculations” in the Appendix to obtain results similar to what is reported. Suggestions • A formal report is written in third person. • All tables and figures must include captions. • Data presented as a graph are plotted without lines connecting the data points. Cluster size: 38 cores Data Source: Associated Press Worldstream (APW) of the English Gigaword Corpus (v3), which contains 2.27 million documents (1.8 GB compressed, 5.7 GB uncompressed) 34 Inverted Index 1. An Inverted Index: (80 points) • Given an text collection, inverted indexer uses Hadoop to produce an index of all the words in the corpus. • As we know, an inverted index consists of two logical parts: one is Dictionary, the other is PostingList File. • Design your inverted index data storage structure and implement the index builder program. Inverted Index: Positional Information Doc 1 Doc 2 one fish, two fish Doc 3 red fish, blue fish Doc 4 cat in the hat green eggs and ham tf 1 blue 2 3 1 cat 1 egg fish 4 1 2 2 df 1 blue 1 2 1 [3] 1 cat 1 3 1 [1] 1 egg 1 4 1 [2] 2 fish 2 1 2 [2,4] green 1 1 green 1 4 1 [1] ham 1 1 ham 1 4 1 [3] 1 hat 1 3 1 [2] 1 one 1 1 1 [1] 1 red 1 2 1 [1] 1 two 1 1 1 [3] hat one 1 1 red two 1 1 2 2 [2,4] 37 LineIndexer • Given an input text, the LineIndexer output index for each word in it. • The index of a word is just a list of all the locations where the word appears. • No seprate Dictionary file exists. Line Indexer Mapper • output <"word", "filename@offset"> pairs • locations of individual words as byte offsets within the file, not line numbers • get the current filename, refer to Class FileSplit. Line Indexer Reducer • simply to concat all the values together to make a single large string, using "^" to separate the values • "shakespeare.txt@38624^shakespeare.txt @12046^shakespeare.txt@34739^..." 2. Add KWIC support in index. (10 points) • Add a text excerpt of each line where the word appears into the index. • For example, running the LineIndexer on the complete works of Shakespeare yields the following entry for the word cipher. • • • • • 38624 To cipher what is writ in learned books, 12046 To cipher me how fondly I did dote; 34739 Mine were the very cipher of a function, 16844 MOTH To prove you a cipher. 66001 ORLANDO Which I take to be either a fool or a cipher. 3. Add compression for index. (10 points bonus) • Try and evaluate d-gap compression techniques for numbers. Postings Encoding Conceptually: fish 1 2 9 1 21 3 34 1 35 2 80 3 … 45 3 … In Practice: • Don’t encode docnos, encode gaps (or d-gaps) • But it’s not obvious that this save space… fish 1 2 8 1 12 3 13 1 1 2 43 Side Data Distribution • Using the Job Configuration – set arbitrary key-value pairs in the job configuration – no more than a few kilobytes of data • Using HDFS – read data from hdfs in setup() • Using Distributed Cache – copying files and archives to the task nodes in time for the tasks to use them when they run – hadoop jar xxxx.jar -files xxxx.dat 4. Split the index(10 points bonus) • Split the index according to document range other than index term range. Term vs. Document Partitioning D T1 T2 D … Term Partitioning T3 T Document Partitioning … D1 D2 T D3 46 References • Hadoop: The Definitive Guide • "Hwo to Write a Technical report", http://www.mech.utah.edu/~rusmeeha/refe rences/Writing.pdf • "Engineering Technical Reports", http://writing.colostate.edu/guides/guide.cf m?guideid=88 Q&A