sheet
udo
points
Describe the SoundEx phonetic retrieval algorithm [2 points]. Give an example of two similarly-sounding words with the same SoundEx code [1 point]. Explain the weak point of SoundEx [1 point], and give an example of two similarly-sounding words with different SoundEx codes [1 point].
cImUM™*^ i/o&ttj timtLir tk^tisk Sfifcy W^vU*.
^WOrX A.kA ^Ki/L Utb- $&3flf'
You may also write on the other side of the sheet, but it won't be scanned.
uio l j l !j l*—(, l—I l I J—'j L—X points
Consider the following XML document:
iu>
-
-
Lobster Thermidor With a fried egg on top, and spam 49.9
Draw the XML document as a graph [1 point], and count all structural terms in the document [2 points]. Compute the structural similarity between the structural term item/title#"That's not got much spam in it", and the queries //menu//price#30.9, and //title#"Lobster Thermidor" [2 points].
f
*Um lhou.$ "fWs ..." hehiA/^e^/hofe#/V^4,s ..."
You may also write on the other side of the sheet, but it won't be scanned.
sheet
□ . 4 □ G1E 9
points
Explain the following aspects of the if-means flat clustering algorithm [2 points]:
1. What do we need to know about our dataset before using the algorithm?
2. What is the input and the output of the algorithm?
3. What are the two steps that take place in every epoch?
4. How do we decide in which epoch to stop the algorithm?
4. h«eK to ^KOtf/ fix fanLir tU$S€$ |<( imli'nt Ihliifc
Given the points O, and the seeds □, run the if-means algorithm for three epochs. Draw the state of the algorithm at the beginning and after every epoch; no computation should be necessary. What is the output of the algorithm? [2 points]
)-r— l \
1
J
K t
s / ( \ n
\J /
J J
■f—c )—
J r \ \
\- J \ i
-
A —■ \ A N
] A J \
n / \ J
/
t j
Perform a hierarchical clustering of the above dataset into three classes using the single-link hierarchical agglomerative clustering algorithm, and draw the resulting dendrogram. [1 point] Is the output the same as the output of the if-means flat hierarchical clustering algorithm above? [1 point]
, P o
7<25( H '5.
You may also write on the other side of the sheet, but it won't be scanned.
sheet l- -j l I uio u -j L L J-4. L—i c I J—u l—I points
o
C ( s
State the Z/p/'s /aw [1 point] and the Heaps' law [1 point] in its complete variant [1 point]. Let C denote the term-document matrix of a document collection that contains M unique terms. According to the Heaps' law, what is the size of C in relation to M? [3 points]
zr^a e*w*. t^* a/z , ^ u *u ututf*
Ke^s1 M=kT vxAeire M is i/oe*Ut
( T ?s (Lo fe_04?on 8%« ?i 4o£ei/iS( ah/
You may also write on the other side of the sheet, but it won't be scanned.
Consider the following collection of four documents di.
• d\\ home sales rise in july
• di'. July new home sales rise
• d%: new home sales top forecasts
• d±: increase in home sales in July
Produce a list of (term, document ID) tuples [1 point], sort this list in lexicographical order [1 points], and use the sorted list to construct an inverted index [3 points]. Write down each step. Describe how you would produce this index using the Map Reduce distributed framework [3 points].
O<7(0, , (Met, 0,
('",/•), C'V,*1), ('^««, t,)i0«'^14)
hit -o 4
£«vcZ. t^Ou/^ jjuoc^u
You may also write on the other side of the sheet, but it won't be scanned.