sheet
uco
points
Consider the following collection of four documents d,-:
• d\: BREAKTHROUGH DRUG FOR HIV
• dl". NEW HIV DRUG
• dy. NEW APPROACH FOR TREATMENT OF HIV
• d±: NEW HOPES FOR HIV PATIENTS
Produce a list of (term, document ID) tuples [1 point], sort this list in lexicographical order [1 point], and use the sorted list to construct an inverted index [1 point]. Write down each step. Describe how you would produce this index using the MapReduce distributed framework [2 points].
( kreikthr'ujh ,1), ( drug x    , ( \ot, 1) , ( H\v, l) ,
(«1^**^,3), ( br^Alctliw^hcl), (Mrn^l), (<Wj,2)( (for, l"),
(f*r,s'), (f^l), C mv^), (HiV,*), f HIV,4),
(Vi^«j, M), (new, Z), (*c*/, 3), (wtw, 4),     , 3) | (pAf.^s, h\
for -> 1 -> 3 -f H
\)oyef -i 4
new —? z-» 3 -* 4 3
E«cio y*vvseif */*>nlA process
dociA/vnewt  IP)   list.
Write your solution only on this side of the sheet!
sheet
E
uco
points
Explain the following aspects of the K-means flat clustering algorithm [2 points]:
1. What do we need to know about our dataset before using the algorithm?
2. What is the input and the output of the algorithm?
3. What are the two steps that take place in every epoch?
4. How do we decide in which epoch to stop the algorithm?
I: uncUssi-|ieA pints  fth<A K %etM .   0: K  «»'*f* Carter*) «f J*Ms Rea^i^mn^ , rec^jKtiwj centra*.
£e.htr»t^s converged.
Given the points O, and the seeds □, run the K-means algorithm for three epochs. Draw the state of the algorithm at the beginning and after every epoch; no computation should be necessary. What is the output of the algorithm? [2 points]
1.
3.
H
	■—c	)—		\					
				J *					
			c		J				
	J		i	\					
	A	k		V	sn				
	n				r		□		
								y	
									
A	/								
									
		)—	rv	r-					
				c					
		\	X		y				
									
	A	\ k	\				—.	\	
	( c	j	J		A		3 ,		
	L)	A					J v.		
\		/							
	y								
									
									
				b					
				N					
									
									
									
	' /					H	N	\ 1	
	/						9		
	i,								
	J f								
									
r r
Perform a hierarchical clustering of the above dataset into three classes using the single-link hierarchical agglomerative clustering algorithm, and draw the resulting dendrogram. [1 point] Is the output the same as the output of the K-means flat hierarchical clustering algorithm above? [1 point]
Ves, it to.
Write your solution only on this side of the sheet!
You maintain a text retrieval system. Let E\ denote the complete set of documents in the index of your system and let E2 denote the complete set of documents in the index of a competing system. Suppose the indices of both systems are independent uniform random samples without replacement from the World Wide Web N. The size of £1 is |£i | = 110 trillion (110 • 1012) documents. You take a uniform random subsample of documents without replacement from E\ and you submit each document to the competing system. This gives you an estimate x — 0.2 of the conditional probability P(d 6 £2 I d G E\),d G N. You repeat the same procedure with £2, obtaining an estimate y — 0.4 of the conditional probability P(d G E\ \ d G £2),^ G N. Assume the estimates x,y are the true probabilities. What is the size IE2I of the competing system's index? [3 points]
The grey parrot, native to equatorial Africa, is categorized as an endangered species by the International Union for Conservation of Nature (IUCN). Suppose you take a uniform random sample M without replacement of size \M\ = 8 000 from the grey parrot population N and mark the sampled animals. After returning the marked animals back into the population, you take a second independent uniform random sample T without replacement of the same size |T| = 8 000 from the population. The number of marked animals R = M fl T in the second sample is |R| = 10. What is the most likely size \ N\ of the grey parrot population? [2 points]
Write your solution only on this side of the sheet!
sheet  c j
H
,1 rr    ,n  r,    T-|  r, ,n
r.     ,1   r,    ,n   r. ,n
points
Given a directed graph G that represents three Web pages V(G) = {a,b,c}, and the links E(G) = {(fr,fl), (c,a), (c,b), (b,c)} between these three pages, draw G [1 point] and produce the adjacency matrix (also known as the link matrix) A [1 point], and the Markov transition matrix P [2 points], iscribe the intuition behind the PageRank algorithm [1 point]. Compute the PageRank c a, b, and c using a single iteration of the PageRank algorithm [2 points].
■H [A   ^UIlJVJ 1X1 LUH 11 llll/V IIWK        y   11      1  ^/Wlllljy HI l*-l 1» 11- 1V1HI /VL/l/   WlZ/l^lllU/l   ] I Hill t^V  X     ^ A. J^VJJJ. ILO J .
Describe the intuition behind the PageRank algorithm [1 point]. Compute the PageRank of the pages a, b, and c using a single iteration of the PageRank algorithm [2 points].
Describe what we mean, when we call a page a hub, or an authority [1 point]. Compute the /zufr, and authority scores of the pages a, b, and c [2 points].
? -
" a 0 0
1    0 1
1 1 0
" 1	1	1		"ili	113	1)3 1
	0	1	0	Hi	1lz	ill
_ 1	1	0 .	i	Ml	112	Ml <
The ?ageiu*vk *Wj*rM«v\ c^Mp^es the fr#UWiii^j i Vi^^h^i^l
flviAtUt'»tvj   s)5   a.   \K/et  |>^e   tlaa.-t VHahij  V>(aV>5   ij'oint to.
T   |"0 M]f0 r & 0
A-A -   1  0 i  •  & 0 1  =   0 2. .1  1   0JL01 0^
.t . . r 0 1 i "j r & * 01 r
a,
0
1
2
1 1 a
110 18 1
Write your solution only on this side of the sheet!
Compute an unbiased estimate of a text retrieval system's precision, recall, and the Fi measure on the ^first fiv^>esults [2 points], and the precision at 40% recall [2 points] given the following lists of results for queries q\, and c\2, where R is a relevant result, and N is a non-relevant result:
• Results for q\ : RNNRRNR (10 relevant results for c\\ exist in the collection.)
• Results for qi : NRNRRRRN (5 relevant results for c\i exist in the collection.)
Ihe {tot five r<rjc<lfc$   for  ^enj  ^ : RMNKR
1
z
10 1<]i      3I5-+3110        9/10      27 r
The r<5Hlts »UV\   «(0% re^ll  {or <|Herij  ^ : RNNfcRNR,
Write your solution only on this side of the sheet!