Homework 3: Map-Reduce, Frequent Itemsets, LSH, Streams

Virginia Tech.
Computer Science
CS 5614 – (Big) Data Management Systems
Fall 2014, Prakash
Homework 3: Map-Reduce, Frequent Itemsets, LSH, Streams
(due October 29 , 2014, 2:30pm, in class—hard-copy please)
th
Reminders:
a.
b.
c.
Out of 100 points. Contains 4 pages.
Rough time-estimates: 7-10 hours.
Please type your answers. Illegible handwriting may get no points, at the discretion of the grader.
Only drawings may be hand-drawn, as long as they are neat and legible.
d. There could be more than one correct answer. We shall accept them all.
e. Whenever you are making an assumption, please state it clearly.
f. Each HW has to be done individually, without taking any help from non-class resources (e.g.
websites etc).
Q1. Map-Reduce [45 points]
In this question, we will use Map Reduce to figure out the number of 2-grams in a large
text corpus given the all the distinct 4-grams from the text corpus. The idea is to
convince you that using Hadoop on AWS has now really become a low-enough
cost/effort proposition (compared to setting up your own cluster). You can use one of
Java/Python/Ruby to implement this question. You are free to use Hadoop Streaming
as well if you want.
Familiarize yourself with AWS (Amazon Web Services). Read the set-up guidelines
posted on the website to set up your AWS account and redeem your free credit ($100)--do this early!
Link: http://people.cs.vt.edu/~badityap/classes/cs5614-Fall14/homeworks/hw3/AWS-setup.pdf
The pricing for various services provided by AWS can be found at
http://aws.amazon.com/pricing/. The services we would be primarily using for this
assignment are the Amazon S3 storage, the Amazon Elastic Cloud Computing (EC2)
virtual servers in the cloud and the Amazon Elastic MapReduce (EMR) managed
Hadoop framework. Play around with AWS and try to create MapReduce job flows (not
required, or graded) or try the sample job flows on AWS.
The questions in this assignment will ideally use up only a very small fraction of your
$100 credit. AWS allows you to use up to 20 instances total (that means 1 master
instance and up to 19 core instances) without filling out a “limit request form”. For this
assignment, you should not exceed this quota of 20 instances. You can learn about
these instance types by going through the extensive AWS documentations. Of course,
after you are done with the HW, feel free to use your remaining credits for any other
fun computations/applications you may have in mind! These credits are applicable
more generally for AWS as a whole, not just MapReduce.
1 We will use data from the Google Books n-gram viewer corpus. N-grams are fixed size
tuples of items. In this case the items are words extracted from the Google Books corpus.
The n specifies the number of elements in the tuple, so a 5-gram contains five words.
This data set is freely available on Amazon S3 in a Hadoop friendly file format and is
licensed under a Creative Commons Attribution 3.0 Unported License. The original
dataset is available from http://books.google.com/ngrams/.
The subset we will be using for this assignment is the 4-gram English 1M dataset in the
following S3 bucket (directory) and is freely accessible to all:
s3://datasets.elasticmapreduce/ngrams/books/20090715/eng-1M/4gram/data
IMP: Note that the dataset is stored in a SequenceFile format (which is not plain text)--so make sure you look-up on how to read/parse this file in Hadoop (feel free to search
online to figure this out). Some helpful links are given below:
https://aws.amazon.com/datasets/8172056142375670
http://stackoverflow.com/questions/18882197/processing-lzo-sequence-files-with-mrjob
Refer to our setup guidelines to see how to set this data as input to your MapReduce job
(Section 6 in the guidelines). We have provided a screenshot to configure the EMR
cluster, which demonstrates how to access input data from some given bucket (here our
bucket is the one given above).
Q1.1. (10 points) The dataset contains all the 4-grams in the Eng-1M dataset. What is
the total number of distinct 4-grams in the dataset? Write a simple MapReduce
job to compute this.
Q1.2. (5 points) Plot the frequency distribution for the occurrence counts of the 4grams i.e. a plot where the x-axis is the occurrence count (say k), and y-axis is
the number of 4-grams which occur k times. Just paste the figure as the answer.
Hint: It will be easiest if you write a simple MR job to pull out just the
occurrence information from the dataset, and then compute the distribution
locally on your machine.
Q1.3. (25 points) Write a MapReduce job to compute the total number of distinct 2grams using the same dataset.
Q1.4. (5 points) Write down the total number of the 2-grams you get (you may need
another separate MR job for this---no need to show us the code for this---just
write down the number).
Code Deliverables: For Q1.1: Give the mapper and reducer files (in addition to the
number). For Q1.3: Give the mapper and reducer for computing the 2-grams from the
2 dataset. Zip all of these as YOUR-LASTNAME.zip and send it to Vanessa (email:
vcedeno@vt.edu) with the subject ‘HW3-Code-Q1’. Also copy-paste these in your hard
copy.
Q2. Finding Similar Items [30 points]
Q2.1. (15 points) In class, we saw how to construct signature matrices using random
permutations. Unfortunately, permuting rows of a large matrix is prohibitive.
Section 3.3.5 of your textbook (in Chapter 3) gives a fairly simple method to
‘simulate’ this randomness using different hash functions. Please read through
that section before attempting this question.
Now consider matrix below.
a. (9 points) Compute the minhash signature for each column using the
method given in Sec 3.3.5 of your textbook, if we use the following three
hash functions: (A) h1(x) = (2x + 1) mod 6; (B) h2(x) = (3x + 2) mod 6; and (C)
h3(x) = (5x + 2) mod 6. So you will finally get a 3x4 matrix as the signature
matrix. Just show the initially computed hash function values and the
final signature matrix.
b. (6 points) How close are the estimated Jaccard similarities for the six pairs
of columns to the true Jaccard similarities (i.e. give the ratio of the
estimated/true for each of the pairs)?
Q2.2. (15 points) Recall that in LSH, given b and r, the probability that a pair of
documents having Jaccard similarity s will be a candidate pair is given by the
function 𝑓 𝑠 = 1 − 1 − 𝑠 ! ! .
a. (6 points) Show that 𝑠 ∗ =
! !/!
!
is a good approximation to the value of s
when the slope of f(s) is the maximum. Hint: Feel free to use Mathematica if
you want---will save you some time ☺
b. (7 points) Recall that the threshold is the value of s at which the probability
of becoming a candidate pair is ½. Given b=32, r=8, plot a graph of f(s) vs s
(of course s should vary from 0 to 1) and numerically estimate the value of
s when f(s) is ½ (call this value s1). Show the plot here, demonstrating your
computation.
c. (2 points) How does s1 compare to s* (for the same values of b=32 and r=8)?
3 Q3. Stream Mining [20 points]
Q3.1. (10 points) Bloom Filters: Suppose we have n bits of memory available, and our
set S has m members. Instead of using k hash functions, we could divide the n
bits into k arrays, and hash once to each array. As a function of n, m, and k,
what is the probability of a false positive? How does it compare with using k
hash functions into a single array?
Q3.2. (5 points) AMS algorithm: Compute the surprise number (second moment) for
the stream 3, 1, 4, 1, 3, 4, 2, 1, 2. What is the third moment of this stream?
Q3.3. (5 points) DGIM algorithm: Suppose the window is as shown below. (Most
recent bit is on the right)
Estimate the number of 1’s in the last k positions, for k = (a) 5 (b) 15. In each case,
how far off the correct value is your estimate?
Q4. Frequent Itemsets [5 points]
Let there be I items in a market-basket data set of B baskets. Suppose that every basket
contains exactly K items. As a function of I, B, and K, how much space does the
triangular-matrix method take to store the counts of all pairs of items assuming four
bytes per array element?
4