NLP: The Main Issues WhyisNLP difficult? - many "words", many "phenomena" --> many "rules" * OED: 400k words; Finnish lexicon (of forms): ~2 .107 * s enten ces, clan s es, phr as es, con s tituents \ coor dination, negation, imperatives/questions, inflections, parts of speech pronunciation, topic/focus, and much more! - irregularity (exceptions, exceptions to the exceptions, ... * potato -> potato] es| (tomato, hero,...); photo -> photo [s^pnd even: both mango -> in an go | s |or -> mango |es Adjective / Noun order: new book, electrical engineering, general regulations, flower garden, garden flower,but Governor General 9/11/2000 JHU CS 000.405/Intro to NLP/Jan Haji c Difficulties in NLP (cont.) - ambiguity * books: NOUN or VERB? - you need many books vs. she books her flights online * No left turn weekdays 4-6 pm / except transit vehicles (Charles Street at Cold Spring) - when may transit vehicles turn: Always? Never? * Thank you for not smoking\ drinking, eating or playing radios without earphones. (MTA bus) - Thank you for not eating without earphones?? - or even: Thank you forJj^dr inking without earphones!? * My neighbor's hat was taken by wind. He tried to catch it. - ...catch the wind or ...catch the hat ? 9/11/2000 JHU CS 600.46:5/1 ntro to NLP/Jan Haji c 10 (Categorical) Rules or Statistics? ■ Preferences: - clear cases: context clues: she books --> books is a verb - rule: if an ambiguous word (verb/non verb) is preceded by a matching personal pronoun -> word is a verb - less clear cases: pronoun reference - she/he/it refers to the most recent noun or pronoun (?) (but maybe we can specify exceptions) - selectional: - catching hat » catching wind (but why not?) - semantic: - never thank for drinking in a bus! (but what about the earphones?) 9/11/2000 JHU CS 600.46:5/Intro to NLP/Jan Haji c 11 Solutions • Don't guess if you know: • morphology (inflections) • lexicons (lists of words) • unambiguous names • perhaps some (really) fixed phrases • syntacticrules? • Use statistics Chased on real-world data) for preferences (only?)) - \ • No doubt: but this is the big question! 9/11/2000 JHU CS 600.46:5/1 ntro to NLP/Jan Haji c 12 Statistical NLP • Imagine: - Each sentence W = { w1? w2? wn } gets a probability P(W|X) in a context X (think of it in the intuitive sense for now) - For every possible context X? sort all the imaginable sentences W according to P(W|X): - Ideal situation: ,- best sentence (most probable m context X) NB: same for 9/11/2000 JHU CS 600.46:5/Intro to NLP/Jan Haji c 13 Real World Situation • Unable to specify set of grammatical sentences today using fixed "categorical" rules (maybe never, cf. arguments in MS) • Use statistical "model" based on REAL WORLD DATA and care about the best sentence only (disregarding the "grammatically" issue) best sentence best 9/11/2000 JHU CS 600.46:5/Intro to NLP/Jan Haji c