MUNI FI Grouping Words PA154 Language Modeling (7.1) Pavel Rychly March 30,2023 Source: Natural Language Processing (600.465) Jason Eisner, Johns Hopkins Univ. Pavel Rychlý • Grouping Words • March 30, 2023 1/27 Linguistic Objects in this Course ■ Trees (with strings at the nodes) ■ Syntax, semantics ■ Algorithms: Generation, parsing, inside-outside, build semantics ■ Sequences (of strings) ■ n-grams, tag sequences ■ morpheme sequences, phoneme sequences ■ Algorithms: Finite-state, best-paths, forward-backward ■ "Atoms" (unanalyzed strings) ■ Words, morphemes ■ Represent by contexts - other words they occur with ■ Algorithms: Grouping similar words, splitting words into senses Pavel Rychlý • Grouping Words • March 30, 2023 2/27 A ConCOrdanCG for "party" - from Concordance the web in real-time. Sea WordIist Tool User Guide WebCorp LSE Publications Feedback WebCorp Live lets you access the Web as a corpus - a large collection of texts from which examples of real language use can be extracted. More- Search: party Case Insensitive: Span: 50 characters T Search API: Google Language: Not specified Advanced Options Resetovat Search Pavel Rychlý • Grouping Words • March 30, 2023 3/ A ConCOrdanCG for "party" - from ■ thing. She was talking at a party thrown at Daphne's restaurant in ■ have turned it into the hot dinner-party topic. The comedy is the ■ selection for the World Cup party, which will be announced on May 1 ■ in the 1983 general election for a party which, when it could not bear to ■ to attack the Scottish National Party, who look set to seize Perth and ■ that had been passed to a second party who made a financial decision ■ the by-pass there will be a street party. John threw a "rain forest" party Last December. His Living room was fuLL of pLants and his box was playing Brazilian music ... Pavel Rychlý • Grouping Words • March 30, 2023 7/27 What Good are Word Senses? ■ Replace word w with sense s 1. Splits w into senses: distinguishes this token of w from tokens with sense t 2. Groups w with other words: groups this token of w with tokens of x that also have sense s Pavel Rychlý • Grouping Words • March 30, 2023 8/27 What Good are Word Senses? ■ number-crunchers within the Labour party, there now seems Little doubt ■ political tradition and the same party. They are both relatively Anglophilic ■ he told Tony Blair's modernised party they must not retreat into "warm ■ thing. She was talking at a party thrown at Daphne's restaurant in ■ have turned it into the hot dinner-party topic. The comedy is the ■ selection for the World Cup Party, which will be announced on May 1 ■ the by-pass there will be a street party. "Then," he says, "we are going ■ "Oh no, I'm just here for the party," they said. "I think it's terrible ■ an appearance at the annual awards bash , but feels in no fit state to ■ -known families at a fundraising bash on Thursday night for Learning ■ Who was paying for the bash? The only clue was the name Asprey, ■ Mail, always hosted the annual bash for the Scottish Labour front- ■ popular. Their method is to bash sense into criminals with a short, ■ just cut off people's heads and bash their brains out over the floor, Pavel Rychly • Grouping Words • March 30, 2023 9/27 What Good are Word Senses? ■ number-crunchers within the Labour party, there now seems Little doubt ■ political tradition and the same party. They are both relatively Anglophilic ■ he told Tony Blair's modernised party they must not retreat into "warm ■ thing. She was talking at a party thrown at Daphne's restaurant in ■ have turned it into the hot dinner-party topic. Their method is to bash sense into criminals with a short, What Good are Word Senses? ■ Semantics / Text understanding ■ Axioms about TRANSFER apply to (some tokens of) throw ■ Axioms about BUILDING apply to (some tokens of) bank ■ Machine translation ■ Info retrieval / Question answering / Text categ. ■ Query or pattern might not match document exactly ■ Backoff for just about anything ■ what word comes next? (speech recognition, language ID,...) ■ trigrams are sparse but tri-meanings might not be ■ bilexical PCFGs: ■ p(S[devour] NP[Uon] VP[devour] | S[devour]) ■ approximate by p(S[EAT] -> NP[lion] VP[EAT] | S[EAT]) ■ Speaker's real intention is senses; words are a noisy channel Pavel Rychlý • Grouping Words • March 30, 2023 11/27 Cues to Word Sense ■ Adjacent words (or their senses) ■ Grammatically related words (subject, object,...) ■ Other nearby words ■ Topic of document ■ Sense of other tokens of the word in the same document Pavel Rychlý • Grouping Words • March 30, 2023 12/27 Word Classes by Tagging Every tag is a kind of class Tagger assigns a class to each word token ■ i OA Start unkjram PN Verb Det Bill directed a Noun I cortege Prep Noun Prep 4 Jc "i of autos throw ght i Pavel Rychlý • Grouping Words • March 30, 2023 13/27 Word Classes by Tagging ■ Every tag is a kind of class ■ Tagger assigns a class to each word token ■ Simultaneously groups and splits words ■ "party" gets split into N and V senses ■ "bash" gets split into N and V senses ■ {party/N, bash/N} vs. {party/V, bash/V} ■ What good are these groupings? Pavel Rychlý • Grouping Words • March 30, 2023 14/27 Learning Word Classes ■ Every tag is a kind of class ■ Tagger assigns a class to each word token ■ {party/N, bash/N} vs. {party/V, bash/V} ■ What good are these groupings? ■ Good for predicting next word or its class! ■ Role of forward-backward algorithm? ■ It adjusts classes etc. in order to predict sequence of words better (with lower perplexity) Pavel Rychlý • Grouping Words • March 30, 2023 15/27 Words and Vectors ■ Represent each word type w (party) by a point in k-dimensionaL space ■ e.g., k is size of vocabulary ■ the 17th coordinate of w represents strength of w's association with vocabulary word 17 Pavel Rychlý • Grouping Words • March 30, 2023 16/27 Word aardvark abacus abandoned abbot abduct above zygote zymurgy Count 0 0 3 1 0 7 1 0 too high too Low From corpus: Jim Jeffords abandoned the Republican party. There were Lots of abbots and nuns dancing at that party. The party above the art gaLLery was, above aLL, a Laboratory for synthesizing zygotes and beer. Pavel Rychlý • Grouping Words • March 30, 2023 ■ Represent each word type w (party) by a point in k-dimensionaL space ■ e.g., k is size of vocabulary ■ the 17th coordinate of w represents strength of w's association with vocabulary word 17. ■ How might you measure this? ■ how often words appear next to each other ■ how often words appear near each other ■ how often words are syntactically linked ■ should correct for commonness of word (e.g., "above") Pavel Rychlý • Grouping Words • March 30, 2023 18/27 ■ Represent each word type w (party) by a point in k-dimensionaL space ■ e.g., k is size of vocabulary ■ the 17th coordinate of w represents strength of w's association with vocabulary word 17. ■ Plot all word types in k-dimensionaL space ■ Look for clusters of close-together types Pavel Rychlý • Grouping Words • March 30, 2023 19/27 Learning Classes by Clustering ■ Plot all word types in k-dimensionaL space ■ Look for clusters of dose-together types Plot in k dimensions (k=3) _ • • • Pavel Rychlý • Grouping Words • March 30, 2023 Bottom-Up Clustering ■ Start with one cluster per point ■ Repeatedly merge 2 closest clusters ■ Single-Link: dist(A,B) = min dist(a,b) for a eA,b e B ■ Complete-Link: dist(A,B) = max dist(a,b) for a eA,b e B Pavel Rychlý • Grouping Words • March 30, 2023 Single-Link ■ Again, merge closest pair of clusters: ■ Single-Link: clusters are close if any of their points are dist(A,B) = min dist(a,b) for a eA,beB Pavel Rychlý • Grouping Words • March 30, 2023 22/27 Single-Link ■ Fast, but tend to get Long, stringy, meandering clusters Pavel Rychlý • Grouping Words • March 30, 2023 Complete-Link ■ Again, merge closest pair of clusters: ■ Complete-Link: clusters are close only if all of their points are dist(A,B) = max dist(a,b) for a