sheet uco points Consider the following collection of four documents d,-: • d\: BREAKTHROUGH DRUG FOR HIV • dl". NEW HIV DRUG • dy. NEW APPROACH FOR TREATMENT OF HIV • d±: NEW HOPES FOR HIV PATIENTS Produce a list of (term, document ID) tuples [1 point], sort this list in lexicographical order [1 point], and use the sorted list to construct an inverted index [1 point]. Write down each step. Describe how you would produce this index using the MapReduce distributed framework [2 points]. ( kreikthr'ujh ,1), ( drug x , ( \ot, 1) , ( H\v, l) , («1^**^,3), ( br^Alctliw^hcl), (Mrn^l), ( 1 -> 3 -f H \)oyef -i 4 new —? z-» 3 -* 4 3 E«cio y*vvseif */*>nlA process dociA/vnewt IP) list. Write your solution only on this side of the sheet! sheet E uco points Explain the following aspects of the K-means flat clustering algorithm [2 points]: 1. What do we need to know about our dataset before using the algorithm? 2. What is the input and the output of the algorithm? 3. What are the two steps that take place in every epoch? 4. How do we decide in which epoch to stop the algorithm? I: uncUssi-|ieA pints fth^e tlaa.-t VHahij V>(aV>5 ij'oint to. T |"0 M]f0 r & 0 A-A - 1 0 i • & 0 1 = 0 2. .1 1 0JL01 0^ .t . . r 0 1 i "j r & * 01 r a, 0 1 2 1 1 a 110 18 1 Write your solution only on this side of the sheet! Compute an unbiased estimate of a text retrieval system's precision, recall, and the Fi measure on the ^first fiv^>esults [2 points], and the precision at 40% recall [2 points] given the following lists of results for queries q\, and c\2, where R is a relevant result, and N is a non-relevant result: • Results for q\ : RNNRRNR (10 relevant results for c\\ exist in the collection.) • Results for qi : NRNRRRRN (5 relevant results for c\i exist in the collection.) Ihe {tot five r