Text Summarization 22.11.2016 Matej Gallo What? An automatic summary is a text generated by a software, that is coherent and contains a significant amount of relevant information from the source text. Its compression rate τ is less than a third of the length of the original document. 22.11.2016 Matej Gallo • Produced from one or more documents • Preserve important information • Short Why? “too much information kills information” • Professional summarizers • Expensive • Lacks expertise • Reduce reading time • Easier selection of documents • Improves effectiveness of indexing • Less biased • Personalized summaries for QA systems 22.11.2016 Matej Gallo Summary Categorization • Extractive • Abstractive • Single-document • Multi-document • Indicative • Informative • Headline summarization • Ultra-summarization • Keyword summarization • Generic • Query-focused • Update 22.11.2016 Matej Gallo Summary Categorization • Monolingual • Multi-lingual • Cross-lingual • News • Specialized • Literary • Encyclopedic… • Author • Expert • Professional • Multimedia 22.11.2016 Matej Gallo Abstractive Summarization • Understands the text, generate summary (NLG) • Abstract • Very difficult • Compression • Fusion • Information Extraction 22.11.2016 Matej Gallo Extractive Summarization • Selects sentences from source document • Extract • Cohesion • Coherence • Unresolved co-references • Discourse relations 22.11.2016 Matej Gallo Extractive Summarization • Intermediate representation • Scoring sentences • Selecting summary 22.11.2016 Matej Gallo Intermediate Representation • Topic representation • VSM, lexical chains, LSA, Bayesian topic models • Indicator representation • sentence length, sentence location, proper nouns, numerical data… • Graph representation • directed forward (backward), undirected 22.11.2016 Matej Gallo Scoring Methods • Topic representation • ability of a sentence to express topic • Indicator representation • machine learning • Graph representation • stochastic methods Examples [http://www.sciencedirect.com/science/article/pii/S0957417413002601] 22.11.2016 Matej Gallo Selecting a summary • Length constraint • best n approach • Maximal marginal relevance • Global selection • Maximize importance, maximize coherence, minimize redundancy 22.11.2016 Matej Gallo Evaluation • Manual • Semi-automatic • ROUGE-n • Automatic • ROUGE-n • Lexical level • Abbreviations (BEwT-E, PYRAMID) 22.11.2016 Matej Gallo ROUGE − n = σn−grams ∈ Sumcan ∩ Sumref σn−grams ∈ Sumref Frequent Patterns • Single-document • Monolingual • Graph representation • Dynamic graph – mimicking reading • DGRMiner 22.11.2016 Matej Gallo