1 - 北京大学网络与信息系统研究所

大规模数据处理/云计算
Lecture 6 – Inverted Index
彭波
北京大学信息科学技术学院
7/17/2014
http://net.pku.edu.cn/~course/cs402/
Jimmy Lin
University of Maryland
SEWMGroup
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States
See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details
Word Cooccurrence Tasks
• Do word co-occurrence analysis on
ShakeSpeare Collection and AP Collection,
which is under the directory of
/public/Shakespeare and /public/AP of our sewm
cluster (or your own virtual cluster). You will get
one line of text data as input to process in map
function by default.(80 points)
• Try to optimize your program, and find the
fastest one. Write your approaches and
evaluation in your report.(20 points)
• Analysis the result data matrix and find
something interesting. (10 points bonus)
• Write a report to describe approach to each task,
the problem you met etc.
Discussions in Course Group
怎样用Web UI查看日志?
• 访问http://changping11:50030/
Shuffle and Sort
intermediate files
(on disk)
Mapper
merged spills
(on disk)
Combiner?
Reducer
circular buffer
(in memory)
Combiner
spills (on disk)
other reducers
other mappers
5
Job History Disappeared?
• Job history stored on the local filesystem of the
jobtracker. (hadoop.job.history.location)
– The jobtracker’s history files are kept for 30 days
before being deleted by the system.
• A second copy is also stored for the user in the
_logs/history subdirectory of the job’s output
directory. (hadoop.job.history.user.location)
– never delete by system
• View through web ui or command line
– $ hadoop job -history output-dir
Job Retirement Policy
• Once a job is complete it is kept in
memory (up to
mapred.jobtracker.completeuserjobs.maxi
mum)
• overall retirement policy of completed jobs
– Key: mapred.jobtracker.retirejob.interval
– Default: 24 * 60 * 60 * 1000 (1 day)
Task Logs
• There are some controls for managing the
retention and size of task logs.
• By default, logs are deleted after a
minimum of 24 hours (set this using the
mapred.userlog.retain.hours property).
• mapred.userlog.limit.kb, 0 by default,
meaning there is no cap
无法打开log中调试中stdout信息!
• 对,这里需要解析changping11changpin60,
• 把下面解析都加到hosts里去
• --------------• 222.29.134.11 changping11
• 222.29.134.12 changping12
• 222.29.134.14 changping14
• 222.29.134.15 changping15
• 222.29.134.17 changping17
• 222.29.134.18 changping18
• 222.29.134.20 changping20
检讨书……
• 想了一下怎么改“map tasks”,翻api查到了这个:
• setMinInputSplitSize(Job, long)
• setMaxInputSplitSize(Job, long)
• 默认的“Counters”是“44”,于是我就(手贱地)在“word count”加了
如下两句:
• FileInputFormat.setMinInputSplitSize(job, 10);
• FileInputFormat.setMaxInputSplitSize(job, 10);
• 于是就出现了图中悲剧的一幕……
Overload,Why?
• 1. 过载,why?
– slot 配置,原来 map , 12, reduce 6, 有点过分,只有12 core。
• 2. 有人问调度策略是什么?FairScheduler, 效果好像一般,
怎么有人的job task在狂跑,有人在一直等,why?
– 查说,fair scheduler有一个multiasssign 机制,一次把
tasktracker能跑的task全部分配,是不是这样就导致不公平呢?
修改这个配置,改为一次分配2个task.
• 3. 看见有人调节priority, smart,但是unfair.
– 修改设置, final 为 NORMAL
• 补充一点,机群job调度会保证一定的fair,如果job运行慢,
要先查job的那些counters,特别是输入输出数据大小是否
异常,别把程序问题忽略了
对于作业2中的AP数据集,如何减少其
mapper数量(默认为1050个)?
• 个人感觉开这么多Mapper是对Mapper生命
的极大浪费~~
• FileInputFormat.setMinInputSplitSize(job,
100*1024*1024);
• try CombineFileInputFormat
• ......
MapWritable的问题
• 为什么MapWritable在reduce输出后,得到
的输出结果是类似这样的:
org.apache.hadoop.io.MapWritable@12f1
eff?
• 难道他实现序列化,没有重写tostring吗?
• 如果是这样的话,map的输出文件也应该是
这种地址形式,那reduce输入的时候又是
如何反序列化的?
• 难道map输出是没有调用toString,而
reduce输出是调用的是toString?
关于setSortComparatorClass和
setGroupingComparatorClass的区别
• WordPair第一关键字相同的对发至同一reducer,
两个方法似乎都能完成上述任务,请问它们有什
么区别?
• You can use
org.apache.hadoop.mapreduce.Job's
setPartitionerClass method to use a custom
partitioner with a map reduce job.
• Sort Comparator Class: Used to control how the
keys are sorted before the reduce step.
• Grouping Comparator Class: Used to control
which keys are a single call to reduce.
遇到错误:Java heap space
• 估计你用的方法占用太多内存了,现在默
认只给每个job 200MB的空间的
(mapred.child.java.opts)。你可以考虑修改
一下算法。
• 把mapred.child.java.opts改大一点呗~
• 编写良好的pairs方法占的内存应该不会太
多……stripes据同学说会溢出,这个可以采
用重写mapper的run方法,每满XX字节输
出一下。
co-occurrence的词对要求无序吗?
• 共现是在同一行里出现就可以吗?不一定
要相邻?
• A cell mij contains the number of times
word wi co-occurs with word wj within a
speci
c context|a natural unit such as a sentence,
paragraph, or a document, or a certain
window of m words (where m is an
application-dependent parameter)...need
not be symmetric.
• 词语“共现”这一概念本身就是很模糊的……因此伪代码
上只给出了neighbors函数,其具体实现是与应用有关
的——打个比方,如果我们的目标是考察“AP新闻稿中
与不同地名共现的词汇有什么特点”,那么显然所关心的
词对是无序的。另一方面,假如我们期望做一个含有上下
文分析功能的语音识别系统,那么作参考的共现词汇对就
需要是无序的。
• 顺便说一句,Assignment #2的要求是按默认设置以一行
为Mapper的输入,这实际上是不大好的——例如AP新闻
稿中有很多句尾长词是用连字符分在两行写的,按上述方
法处理得话就会被硬生生拆成两个词。更合理的方法是以
一个自然句为Mapper的输入,但这对预处理的要求很
大——LZ可以试试用另一个MapReduce应用做到这一点。
More...
• 基于Hadoop的新开源项目Apache Spark™
• 图挖掘算法
Writing Technical Report
Technical Writing
• Written communication, in fact, is an integral part
of engineering tasks.
• A technical report must inform readers of the
reasons, means, results, and conclusions of the
subject matter being reported.
• The mechanics and format of writing a report
may vary but the content is always similar.
Abstract
• An abstract of a technical report briefly
summarizes the report.
• It should describe motivations, methods,
results, and conclusions.
• Be concise in the abstract.
Table of Contents
• Table of Contents is the list of what is in
the report.
• Major sections of the report must be listed
with page numbers.
List of Figures and Tables
Acknowledgements
• The author(s) must acknowledge every
person or agency involved in funding,
guiding, advising, and working on the
project that are not part of the authoring
team.
• Failure to acknowledge someone
contributing to the project is a serious
breach of etiquette and may be construed
as plagiarism, a very serious offense.
Introduction
• Quickly explain the importance of the
experiment being reported.
• This section is where the necessary
concepts that were applied in order to
obtain the results are explained.
Experimental Details
• Details of the experiments or research
conducted.
• The description must contain enough
details to enable someone else to
duplicate the experiment.
• Engineering and scientific experiment
must be repeatable and verifyable.
Results and Discussions
• Report only the final results.
• Raw data and intermediate results that are
not central to the topic of the report can be
placed in the Appendix if needed.
• Most substantial part of the report.
Conclusions and Recommendations
• Think of the conclusion as a short
restatement of important points being
presented in the report.
• recommendations as to the utilities of
those conclusions
• Mention restrictions or limits pertaining to
the use of the results.
• Suggest what the next step in the study
should be.
References
• Any idea, formula, etc., not originating
from the author must be cited.
• Failure to reference prior works may be
interpreted as claiming those works to be
your own.
• Any work, formulae, or discussion that is a
common knowledge in the field does not
need to be referenced.
Appendices
• In is imperative that the way you
determine the result from the raw
data be made clear.
• Others should be able to duplicate the
experiment according the instruction
provided in the “Experimental Details”
section and reduce the data
according to the “Sample Calculations”
in the Appendix to obtain results
similar to what is reported.
Suggestions
• A formal report is written in third person.
• All tables and figures must include
captions.
• Data presented as a graph are plotted
without lines connecting the data points.
Cluster size: 38 cores
Data Source: Associated Press Worldstream (APW) of the English Gigaword Corpus (v3),
which contains 2.27 million documents (1.8 GB compressed, 5.7 GB uncompressed)
34
Inverted Index
1. An Inverted Index: (80 points)
• Given an text collection, inverted indexer uses
Hadoop to produce an index of all the words in
the corpus.
• As we know, an inverted index consists of two
logical parts: one is Dictionary, the other is
PostingList File.
• Design your inverted index data storage
structure and implement the index builder
program.
Inverted Index: Positional Information
Doc 1
Doc 2
one fish, two fish
Doc 3
red fish, blue fish
Doc 4
cat in the hat
green eggs and ham
tf
1
blue
2
3
1
cat
1
egg
fish
4
1
2
2
df
1
blue
1
2
1
[3]
1
cat
1
3
1
[1]
1
egg
1
4
1
[2]
2
fish
2
1
2
[2,4]
green
1
1
green
1
4
1
[1]
ham
1
1
ham
1
4
1
[3]
1
hat
1
3
1
[2]
1
one
1
1
1
[1]
1
red
1
2
1
[1]
1
two
1
1
1
[3]
hat
one
1
1
red
two
1
1
2 2
[2,4]
37
LineIndexer
• Given an input text, the LineIndexer output
index for each word in it.
• The index of a word is just a list of all the
locations where the word appears.
• No seprate Dictionary file exists.
Line Indexer Mapper
• output <"word", "filename@offset"> pairs
• locations of individual words as byte
offsets within the file, not line numbers
• get the current filename, refer to Class
FileSplit.
Line Indexer Reducer
• simply to concat all the values together to
make a single large string, using "^" to
separate the values
• "shakespeare.txt@38624^shakespeare.txt
@12046^shakespeare.txt@34739^..."
2. Add KWIC support in index. (10 points)
• Add a text excerpt of each line where the word appears
into the index.
• For example, running the LineIndexer on the complete
works of Shakespeare yields the following entry for the
word cipher.
•
•
•
•
•
38624 To cipher what is writ in learned books,
12046 To cipher me how fondly I did dote;
34739 Mine were the very cipher of a function,
16844 MOTH To prove you a cipher.
66001 ORLANDO Which I take to be either a fool or a
cipher.
3. Add compression for index. (10 points
bonus)
• Try and evaluate d-gap compression
techniques for numbers.
Postings Encoding
Conceptually:
fish
1
2
9
1
21
3
34
1
35 2
80 3
…
45 3
…
In Practice:
• Don’t encode docnos, encode gaps (or d-gaps)
• But it’s not obvious that this save space…
fish
1
2
8
1
12
3
13
1
1
2
43
Side Data Distribution
• Using the Job Configuration
– set arbitrary key-value pairs in the job
configuration
– no more than a few kilobytes of data
• Using HDFS
– read data from hdfs in setup()
• Using Distributed Cache
– copying files and archives to the task nodes in
time for the tasks to use them when they run
– hadoop jar xxxx.jar -files xxxx.dat
4. Split the index(10 points bonus)
• Split the index according to document
range other than index term range.
Term vs. Document Partitioning
D
T1
T2
D
…
Term
Partitioning
T3
T
Document
Partitioning
…
D1
D2
T
D3
46
References
• Hadoop: The Definitive Guide
• "Hwo to Write a Technical report",
http://www.mech.utah.edu/~rusmeeha/refe
rences/Writing.pdf
• "Engineering Technical Reports",
http://writing.colostate.edu/guides/guide.cf
m?guideid=88
Q&A