Seminar 12 Exercise 12/1 Pick a topic of your interest and describe it by 5-10 words. Open Sketch Engine https://ske.fi.muni.cz/. Go to WebBootCaT. Create a corpus using the description words as seed. Wait until data are downloaded. Search the word corpus for collocations. Solutions can vary. Definition 1 (Index Relations) Suppose that we could pick a random page from the index of 𝐸1 and test whether it is in 𝐸2’s index and symmetrically, test whether a random page from 𝐸2 is in 𝐸1. These experiments give us fractions π‘₯ and 𝑦 such that our estimate is that a fraction π‘₯ of the pages in 𝐸1 are in 𝐸2, while a fraction 𝑦 of the pages in 𝐸2 are in 𝐸1. Then, letting |𝐸𝑖| denote the size of the index of search engine 𝐸𝑖, we have π‘₯|𝐸1| β‰ˆ 𝑦|𝐸2|, from which we have the form we will use |𝐸1| |𝐸2| β‰ˆ 𝑦 π‘₯ . Exercise 12/2 Two web search engines 𝐴 and 𝐡 each generate a large number of pages uniformly at random from their indexes. 30% of 𝐴’s pages are present in 𝐡’s index, while 50% of 𝐡’s pages are present in 𝐴’s index. What is the number of pages in 𝐴’s index relative to 𝐡’s? Substituting to the Definition 1 we get the fractions βˆ™ 3 10 of 𝐴 is in 𝐡 βˆ™ 5 10 of 𝐡 is in 𝐴 and we get the equation 0.3|𝐴| β‰ˆ 0.5|𝐡| |𝐴| |𝐡| β‰ˆ 0.5 0.3 |𝐴| |𝐡| β‰ˆ 5 3 Definition 2 (Path Similarity) Similarity between a query XPath 𝑐 π‘ž and a document path 𝑐 𝑑 is calculated as 𝐢𝑅(𝑐 π‘ž, 𝑐 𝑑) = {οΈƒ 1+|𝑐 π‘ž| 1+|𝑐 𝑑| if 𝑐 π‘ž can be expanded to 𝑐 𝑑 by adding nodes to the path 0 otherwise 1 Definition 3 (Structural Term) Structural term is defined as an XML-context/term pair denoted by of existing path to a value and the value itself, where the value itself is also a node in the XML document. For example, an XML document containing only a root element with test. test contains two structural terms and . Exercise 12/3 Consider the following the XML document: Computer Science Jennifer Widom Programming Methodology Introduction to the engineering of computer applications emphasizing modern software engineering principles. Jerry R. Cain Eric Roberts Mehran Sahami Programming Abstractions Abstraction and its relation to programming. Eric Roberts 2 Jerry R. Cain CS106A 1. Write the following expressions: a) Return all titles (including both departments and courses). b) Return all course titles that contain the word programming. c) Return the surnames of all instructors teaching at least one course that contains the word software in its description. d) Return the surnames of all professors teaching at least one course that contains the word software in its description. 2. Calculate the similarity between the queries and the corresponding document paths. a) //Instructors//Last_Name#Cain b) //Course/Instructors/Lecturer/Last_Name#Cain 1. a) //Title b) //Course//Title[contains(current(),’programming’)] c) //Course[contains(Description,’software’)]/Instructors//Last_Name d) //Course[contains(Description,’software’)]/Instructors/Professor/Last_Name 2. By Definition 2, a query 𝑐 π‘ž corresponds to document 𝑐 𝑑 if and only if it can be expanded. Original query for a) can be expanded to Course_Catalog/Department/Course/Instructors/Lecturer/Last_Name. Since 𝑐 π‘ž is expandable to 𝑐 𝑑, use the equation from the definition. Substituting for the query length 𝑐 π‘ž = 2 and the document length 𝑐 𝑑 = 6 to the formula we get 𝐢𝑅(𝑐 π‘ž, 𝑐 𝑑) = 1 + |𝑐 π‘ž| 1 + |𝑐 𝑑| = 1 + 2 1 + 6 = 3 7 . For b) we only change the query length 𝑐 π‘ž = 4 and obtain 𝐢𝑅(𝑐 π‘ž, 𝑐 𝑑) = 1 + |𝑐 π‘ž| 1 + |𝑐 𝑑| = 1 + 4 1 + 6 = 5 7 . 3 Exercise 12/4 Count how many structural terms are present in the XML tree: Programming Abstractions Abstraction and its relation to programming Eric Roberts To save space, mark each element with its first letter only (D stands for Description, P for Professor, . . . ). By Definition 3, we count all combinations and write them into the table: C T Programming Abstractions C I P F Eric C I P L Roberts T Programming Abstractions I P F Eric I P L Roberts Programming Abstractions P F Eric P L Roberts C D Abstraction . . . F Eric L Roberts D Abstraction . . . Eric Roberts Abstraction . . . There are 16 structural terms in total. 4