Grouping Words PA154 Jazykové modelovaní (11.1) Pavel Rychlý pary@fi.muni.cz May 11, 2021 PA154 Jazykové modelování (11.1) Grouping Words Linguistic Objects in this Course ■ Trees (with strings at the nodes) ► Syntax, semantics ► Algorithms: ■ Sequences (of strings) ► n-grams, tag sequences ► morpheme sequences, phoneme sequences ► Algorithms: Finite-state, best-paths, forward-backward ■ "Atoms"(unanalyzed strings) ► Words, morphemes ► Represent by contexts - other words they occur with ► Algorithms: PA154 Jazykové modelování (11.1) Grouping Words 2/27 A Concordance for "party" - from www.webcorp.org.uk WebCogfc Concordance the web in real-time. WordIist Tool User Guide WebCorp LSE Publications Feedback WebCorp Live lets you access the Web as a corpus - a large collection of texts from which examples of real language use can be extracted. More... Search: party Case Insensitive: * Span: 50 characters T Search API: Google Language: Not specified 0 Advanced Options Resetovat Search PA154 Jazykové modelování (11.1) Grouping Words 3/27 A Concordance for "party" - from www.webcorp.org.uk ■ thing. She was talking at a party thrown at Daphne's restaurant in ■ have turned it into the hot dinner-party topic. The comedy is the ■ selection for the World Cup party, which will be announced on May 1 ■ in the 1983 general election for a party which, when it could not bear to ■ to attack the Scottish National Party, who look set to seize Perth and ■ that had been passed to a second party who made a financial decision ■ the by-pass there will be a street party. "Then," he says, "we are going ■ number-crunchers within the Labour party, there now seems little doubt ■ political tradition and the same party. They are both relatively Anglophilic ■ he told Tony Blair's modernised party they must not retreat into "warm ■ "Oh no, I'm just here for the party,"they said. "I think it's terrible ■ A future obliges each party to the contract to fulfil it by ■ be signed by or on behalf of each party to the contract." Mr David N PA154 Jazykové modelování (11.1) Grouping Words 4/27 What Good are Word Senses? ■ thing. She was talking at a party thrown at Daphne's restaurant in ■ have turned it into the hot dinner-party topic. The comedy is the ■ selection for the World Cup party, which will be announced on May 1 ■ in the 1983 general election for a r which, when it could not bear to ■ to attack the Scottish National y, who look set to seize Perth and ■ that had been passed to a second ' who made a financial decision ■ the by-pass there will be a street party. "Then," he says, "we are going ■ number-crunchers within the Labour | y, there now seems little doubt ■ political tradition and the same | y. They are both relatively Anglophilic ■ he told Tony Blair's modernised ' they must not retreat into "warm ■ "Oh no, I'm just here for the party,"they said. "I think it's terrible ■ A future obliges each ' to the contract to fulfil it by ■ be signed by or on behalf of each * to the contract." Mr David N PA154 Jazykové modelování (11.1) Grouping Words 5/27 What Good are Word Senses? ■ thing. She was talking at a party thrown at Daphne's restaurant in ■ have turned it into the hot dinner-party topic. The comedy is the ■ selection for the World Cup party, which will be announced on May 1 ■ the by-pass there will be a street party. "Then," he says, "we are going ■ "Oh no, I'm just here for the party,"they said. "I think it's terrible ■ in the 1983 general election for a r which, when it could not bear to ■ to attack the Scottish National y, who look set to seize Perth and ■ number-crunchers within the Labour | y, there now seems little doubt ■ political tradition and the same | y. They are both relatively Anglophilic ■ he told Tony Blair's modernised ' they must not retreat into "warm ■ that had been passed to a second ' who made a financial decision ■ A future obliges each ' to the contract to fulfil it by ■ be signed by or on behalf of each * to the contract." Mr David N PA154 Jazykové modelování (11.1) Grouping Words 6/27 What Good are Word Senses? John threw a "rain forest" party last December. His living room was full of plants and his box was playing Brazilian music ... PA154 Jazykové modelování (11.1) Grouping Words 7/27 What Good are Word Senses? ■ Replace word w with sense s Q ts w into senses: distinguishes this token of w from tokens with sense t Q roups w with other words: groups this token of w with tokens of x that also have sense s PA154 Jazykové modelování (11.1) Grouping Words 8/27 What Good are Word Senses? ■ number-crunchers within the Labour party, there now seems little doubt ■ political tradition and the same party. They are both relatively Anglophilic ■ he told Tony Blair's modernised party they must not retreat into "warm ■ thing. She was talking at a party thrown at Daphne's restaurant in ■ have turned it into the hot dinner-party topic. The comedy is the ■ selection for the World Cup Party, which will be announced on May 1 ■ the by-pass there will be a street party. "Then," he says, "we are going ■ "Oh no, I'm just here for the party,"they said. "I think it's terrible ■ an appearance at the annual awards bash , but feels in no fit state to ■ -known families at a fundraising bash on Thursday night for Learning ■ Who was paying for the bash? The only clue was the name Asprey, ■ Mail, always hosted the annual bash for the Scottish Labour front- ■ popular. Their method is to bash sense into criminals with a short, ■ just cut off people's heads and bash their brains out over the floor, PA154 Jazykové modelování (11.1) Grouping Words 9/27 What Good are Word Senses? ■ number-crunchers within the Labour | y, there now seems little doubt ■ political tradition and the same | y. They are both relatively Anglophilic ■ he told Tony Blair's modernised ' they must not retreat into "warm ■ thing. She was talking at a party thrown at Daphne's restaurant in ■ have turned it into the hot dinner-party topic. The comedy is the ■ selection for the World Cup party, which will be announced on May 1 ■ the by-pass there will be a street party. "Then," he says, "we are going ■ "Oh no", I'm just here for the party,"they said. "I think it's terrible ■ an appearance at the annual awards bash, but feels in no fit state to ■ -known families at a fundraising bash on Thursday night for Learning ■ Who was paying for the bash? The only clue was the name Asprey, ■ Mail, always hosted the annual bash for the Scottish Labour front- popular. Their method is to I i sense into criminals with a short, just cut off people's heads and i their brains out over the floor, PR Grouping Words PA154 Jazykové modelování (11.1) 10/27 What Good are Word Senses? ■ Semantics / Text understanding ► Axioms about TRANSFER apply to (some tokens of) throw ► Axioms about BUILDING apply to (some tokens of) bank ■ Machine translation ■ Info retrieval / Question answering / Text categ. ► Query or pattern might not match document exactly ■ Backoff for just about anything ► what word comes next? (speech recognition, language ID,. ..) ► trigrams are sparse but tri-meanings might not be ► bilexical PCFGs: ► p(S[devour] —>► NP[lion] VP[devour] | S[devour]) ► approximate by p(S[EAT] -> NP[lion] VP [EAT] | S[EAT]) ■ Speaker's real intention is senses; words are a noisy channel PA154 Jazykové modelování (11.1) Grouping Words 11/27 Cues to Word Sense Adjacent words (or their senses) Grammatically related words (subject, object, . ..) Other nearby words Topic of document Sense of other tokens of the word in the same document PA154 Jazykové modelování (11.1) Grouping Words 12/27 Word Classes by Tagging Every tag is a kind of class Tagger assigns a class to each word token bigram ) rju model L Start ■Tl replatľ- PN Verb Det ill Bill directed a Noun I cortege Prep 1 of PA154 Jazykové modelování (11.1) Grouping Words Word Classes by Tagging ■ Every tag is a kind of class ■ Tagger assigns a class to each word token ► Simultaneously groups and splits words ► "party" gets split into N and V senses ► "bash" gets split into N and V senses ► {party/N, bash/N} vs. {party/V, bash/V} ► What good are these groupings? PA154 Jazykové modelování (11.1) Grouping Words Learning Word Classes ■ Every tag is a kind of class ■ Tagger assigns a class to each word token ► {party/N, bash/N} vs. {party/V, bash/V} ► What good are these groupings? ► Good for predicting next word or its class! ■ Role of forward-backward algorithm? ► It adjusts classes etc. in order to predict sequence of words better (with lower perplexity) PA154 Jazykové modelování (11.1) Grouping Words 15/27 Words and Vectors ■ Represent each word type w (party) by a point in k-dimensional space ► e.g., k is size of vocabulary ► the 17th coordinate of w represents strength of w's association with vocabulary word 17 PA154 Jazykové modelování (11.1) Grouping Words 16/27 Word aardvark abacus abandoned abbot abduct above zygote zymurgy Count 0 0 3 1 0 7 1 0 too high too low From corpus: Jim Jeffords abandoned the Republican party. There were lots of abbots and nuns dancing at that party. The party above the art gallery was, above all, a laboratory for synthesizing zygotes and beer. Grouping Words ■ Represent each word type w (party) by a point in k-dimensional space ► e.g., k is size of vocabulary ► the 17th coordinate of w represents strength of w's association with vocabulary word 17. ■ How might you measure this? ► how often words appear next to each other ► how often words appear near each other ► how often words are syntactically linked ► should correct for commonness of word (e.g., "above") PA154 Jazykové modelování (11.1) Grouping Words 18/27 ■ Represent each word type w (party) by a point in k-dimensional space ► e.g., k is size of vocabulary ► the 17th coordinate of w represents strength of w's association with vocabulary word 17. ■ Plot all word types in k-dimensional space ■ Look for clusters of close-together types PA154 Jazykové modelování (11.1) Grouping Words 19/27 rning Classes by Clustering Plot all word types in k-dimensional space Look for clusters of close-together types Plot in k dimensions (k=3) ___ • • • PA154 Jazykové modelování (11.1) Grouping Words Bottom-Up Clustering Start with one cluster per point Repeatedly merge 2 closest clusters ► Single-link: dist(A,B) = min dist(a,b) for a E A, b E B ► Complete-link: dist(A,B) = max dist(a,b) for a E A, b E B PA154 Jazykové modelování (11.1) Grouping Words 21/27 Single-Link each word type is a single-point cluster Again, merge closest pair of clusters: ► Single-link: clusters are close if any of their points are dist(A.B) = min dist(a.b) for a e A, b e B PA154 Jazykové modelování (11.1) Grouping Words 22/27 Single-Link Fast, but tend to get long, stringy, meandering clusters PA154 Jazykové modelování (11.1) Grouping Words Complete-Link ■ Again, merge closest pair of clusters: ► Complete-link: clusters are close only if all of their points are dist(A.B) = max dist(a.b) for a e A, b e B PA154 Jazykové modelování (11.1) Grouping Words 24/27 Complete-Link ■ Slow to find closest pair - need quadratically many distances Grouping Words Summary ■ Start with one cluster per point ■ Repeatedly merge 2 closest clusters ► Single-link: dist(A,B) = min dist(a,b) for a G A, b G B ► Complete-link: dist(A,B) = max dist(a,b) for a G /4, b G B ► too slow to update cluster distances after each merge; but alternatives! ► Average-link: dist(A,B) = mean dist(a,b) for a G /4, b G B ► Centroid-link: dist(A,B) = dist(mean(A),mean(B)) ■ Stop when clusters are "big enough" ► e.g., provide adequate support for backoff (on a development corpus) ■ Some flexibility in defining dist(a,b) ► Might not be Euclidean distance; e.g., use vector angle PA154 Jazykové modelování (11.1) Grouping Words 26/27 EM Clustering (for k clusters) ■ EM algorithm ► Viterbi version - called "k-means clustering" ► Full EM version - called "Gaussian mixtures" ■ Expectation step: Use current parameters (and observations) to reconstruct hidden structure ■ Maximization step: Use that hidden structure (and observations) to reestimate parameters ■ Parameters: k points representing cluster centers ■ Hidden structure: for each data point (word type), which center generated it? PA154 Jazykové modelování (11.1) Grouping Words 27/27