Perspective in Informatics 3
Reading Assignments
 J. Leskovec, A. Rajarman, J.D. Ullman, Mining of Massive Datasets, 2014
 M. Charikar, Similarity estimation techniques from rounding algorithms, STOC 2002
 N. Alon, Y. Matias, M. Szegedy, The space complexity of approximating the frequency moments, STOC 1996.
 G. Cormode, Sketch techniques for approximate query processing, Foundations and Trends in Databases, 2011
Assignment #1  2014/11/03
Question 1 [30 marks]

3.5.1: on the space of nonnegative integers, which of the following functions are distance measures? If so, prove it; if not, prove that it fails to satisfy one or more of the axioms.
a) max(x, y) = the larger of x and y. b) diff(x, y) = x − y (the absolute magnitude of the difference between x and y). c) sum(x, y) = x + y.

3.7.2: Let us compute sketches using the following four “random” vectors:
V1= [+1,+1,+1,1] V2=[+1,+1,1,+1] V3=[+1,1,+1,+1] V4=[1,+1,+1,+1] Compute the sketches of the following vectors.
a)[2,3,4,5] b)[2,3,4,5] c)[2,3,4,5]
For each pair, what is the estimated angle between them, according to the sketches? What are the true angles?

3.7.3: suppose we form sketches by using all sixteen of the vectors of length 4, whose components are each +1 or 1. Compute the sketches of the three vectors in Exercise 3.7.2.
Question 2 [10 marks]
 3.7.4(A): Suppose we form sketches using the four vectors from Exercise 3.7.2. What are the constrains on a, b, c, and d that will cause the sketch of the vector [a, b, c, d] to be [+1,+1,+1,+1]? (write your constrains in as simple form as possible)
Question 3 [10 marks]

a) Consider a universe U with n elements, and let R and S be subsets of U both of size m, chosen uniformly at random. What is the expected value of the Jaccard similarity of R and S?

b) How does your answer to part (a) change if R and S must include a certain element (say z) of U?

c) How does your answer to part (a) change if R and S must be disjoint?
question sheet  answer sheet  marked answers
Assignment #2  2014/11/25
Question 1 [15 Marks]

4.2.1 Suppose we have a stream of tuples with the schema:
Grade (University, courseID, studentID, grade)
Assume universities are unique, but a courseID is unique only within a university (i.e., different universities may have different courses with the same ID, e.g., “CS101”) and likewise, studentID’s are unique only within a university (different university may assign the same ID to different students). Suppose we want to answer certain queries approximately from a 1/20th sample of the data. For each of queries below, indicate how you would construct the sample. That is, tell what the key attributes should be.
a) For each university, estimate the average number of students in a course.
b) Estimate the fraction of students who have a GPA of 3.5 or more.
c) Estimate the fraction of courses where at least half the students got “A.”

4.3.1 For the situation of our running example (8 billion bits, 1 billion members of the set S),calculate the falsepositive rate if we use three hash functions? What if we use four hash functions?
Question 2 [15 Marks]

4.4.1 Suppose our stream consists of the integers 3, 1, 4, 1, 5, 9, 2, 6, 5. Our hash functions will all be of the form h(x) = ax + b mode 32 for some a and b. You should treat the result as a 5bit binary integer. Determine the tail length for each stream element and the resulting estimate of the number of distinct elements.
a) h(x) = 2x + 1 mod 32 b) h(x) = 3x + 7 mod 32 c) h(x) = 4x mod 32

4.4.2 Do you see any problems with the choice of hash functions in Exercise 4.4.1? What advice could you give someone who was going to use as hash function of the form h(x) = ax + b mod 2k?
Question 3 [20 Marks]

4.5.1 Compute the surprise number (second moment) for the stream 3, 1, 4, 1, 3, 4, 2, 1, 2. What is the third moment of this stream?

4.5.2 If a stream has n elements, of which m are distinct, what are the minimum and maximum possible surprise number, as a function of m and n?

4.5.3 Suppose we are given the stream of Exercise 4.5.1, to which we apply the AlonMatiasSzegedy Algorithm to estimate the surprise number. For each possible value of i, if X_i_ is a variable starting position _i_, what is the value of X_i.value_?
Question 4 [10 Marks] Consider the count sketch (also referred to as FastAGMS) discussed in class; you have seen how the sketch is used to estimate the selfjoin size of a stream. Now suppose you are given two streams s1 and s2 and are asked to estimate the join size of s1 and s2. How can you use the countsketch for this estimation? Explain how and why the estimation would work.
question sheet  answer sheet  marked answers
Lecture Notes
Part I:
 Lecture 1: Introduction to Data Analysis [PDF]
 Lecture 2+3: Similarity (1) [PDF]
 Lecture 4+5: Similarity (2) [PDF]
 Lecture 6+7: Social Network Analysis [PDF]
 Lecture 8: Mining Data Streams [PDF]
Part II:
 Part II Introduction [PDF]
 Lecture 9: Language Technology Overview [PDF]
 Lecture 10: Language Resources as Our Culture[PDF]
 Lecture 11: Globalization or Localization [PDF]
 Lecture 12: Big Data and Applications [PDF]
 Lecture 13: Analysis of language big data of the East Japan Earthquake[PDF]
 Lecture 14: Policy and Initiatives [PDF]
 Lecture 15: Future Direction [PDF]
 Final Report: Future Direction for Language Technology  A Case Study of Watson System [PDF]