Sentence Level Text Analysis Vojtěch Kovář Natural Language Processing Centre Faculty of Informatics, Masaryk University Botanická 68a, 602 00 Brno xkovar3@fi.muni.cz Workshop of the Natural Language Processing Centre 28 May 2013 Vojtěch Kovář NLP Centre FI MU Brno Sentence Level Text Analysis Simon spoke about sex with Britney Spears Vojtěch Kovář NLP Centre FI MU Brno Sentence Level Text Analysis kdo/co katastr nemovitostí přísudek Zkolaboval kde v parcích kdo/co lidé přísudek musejí přespávat Zkolaboval katastr nemovitostí , lidé musejí přespávatv parcích Zkolaboval katastr nemovitostí lidé musejí přespávatv parcích zdroj: www.infobaden.cz Vojtěch Kovář NLP Centre FI MU Brno Sentence Level Text Analysis Sentence level (syntactic) analysis Natural language syntax describes relationships among words Automatic syntactic analysis revealing inter-word relationships on various levels detection of noun (prepositional, verb, ...) phrases, clauses finding relationships (dependencies) among the units | Simon | spoke | about sex | with Britney Spears | | Simon | spoke | about sex with Britney Spears | Vojtěch Kovář NLP Centre FI MU Brno Sentence Level Text Analysis Syntactic trees Simon spoke about sex with Britney Spears Vojtěch Kovář NLP Centre FI MU Brno Sentence Level Text Analysis Syntactic trees Simon spoke about sex with Britney Spears Vojtěch Kovář NLP Centre FI MU Brno Sentence Level Text Analysis Syntactic trees Simonsubject spoke aboutpp sexprep-object withpp Britneyprep-object Spearsattr Vojtěch Kovář NLP Centre FI MU Brno Sentence Level Text Analysis Syntactic trees Simonsubject spoke aboutpp sexprep-object withpp Britneyprep-object Spearsattr Vojtěch Kovář NLP Centre FI MU Brno Sentence Level Text Analysis Why are we doing this? Syntactic units are carriers of meaning “in the city” meaning of “in”, “the” is unclear, complicated meaning of “in the city” is simply where Words are sometimes not enough red brick house vs. brick house red vs. red house brick Honey, give me love vs. Love, give me honey Starting point for intelligent natural language applications extraction of facts & question answering logical analysis punctuation detection & grammar checking natural text generation authorship detection machine translation Vojtěch Kovář NLP Centre FI MU Brno Sentence Level Text Analysis Example: Extraction of facts kdo/co katastr nemovitostí přísudek Zkolaboval kde v parcích kdo/co lidé přísudek musejí přespávat Zkolaboval katastr nemovitostí , lidé musejí přespávatv parcích Zkolaboval katastr nemovitostí lidé musejí přespávatv parcích zdroj: www.infobaden.cz text syntactic analysis clauses, phrases phrase classification facts Vojtěch Kovář NLP Centre FI MU Brno Sentence Level Text Analysis Example: Logical analysis Žádný mobilní agent není statický . λw1λt2[Not,[Truew1t2,λw3λt4(∃i5)([statickýw3t4 i5] [[mobilní,agent]w3t4,i5])]]...π text syntactic analysis trees tree conversion formulae Žádný mobilní agent není statický ∧ ¬∃x(mobilni(x) ∧ agent(x) ∧ staticky(x)) Vojtěch Kovář NLP Centre FI MU Brno Sentence Level Text Analysis Example: Grammar checking Let’s eat grandma! syntactic analysis detection of non-probable constructions → grandma is not a usual object of eating → correction suggestion Let’s eat, grandma! life saved :) Similarly with other grammar phenomena “This is worth try” → “This is worth trying” Vojtěch Kovář NLP Centre FI MU Brno Sentence Level Text Analysis How to analyze natural language syntax? Prerequisites word level analysis (part of speech, gender, number) named entity recognition lexical semantic information (e.g. “pregnant” goes with women only) Named entity recognition determine that e.g. “prof. Václav Šplíchal” is a person can be viewed as a sub-task of syntactic analysis Vojtěch Kovář NLP Centre FI MU Brno Sentence Level Text Analysis How to analyze natural language syntax? Statistical methods people annotate corpus statistic methods learn rules from the corpus universal across languages (to some extent) annotation is expensive hard to customize for different applications data are usually not big enough Rule-based methods specialists develop a set of rules (“grammar”) not universal, depends on specialists grammar can become uneasy to maintain easy to customize for different applications Hybrids Vojtěch Kovář NLP Centre FI MU Brno Sentence Level Text Analysis Syntactic analysers in the NLP Centre Synt C++, fast (0.07 s/sentence) based on a large meta-grammar SET Python, slower but easily adaptable based on a set of patterns Both rule-based backbone with statistical extensions grammars for Czech, English and Slovak accuracy 85 – 90 % on journal texts Word Sketches very fast shallow syntax for large corpora 31 languages See you in demo :)Vojtěch Kovář NLP Centre FI MU Brno Sentence Level Text Analysis