{"nbformat":4,"nbformat_minor":0,"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.6.1"},"colab":{"provenance":[]}},"cells":[{"cell_type":"markdown","metadata":{"id":"A8Cj_X1VnmC0"},"source":["# PB016: Artificial Intelligence I, labs 13 - Natural language processing\n","\n","This week's topic is natural language processing (NLP). We'll focus namely on:\n","1. __Text acquisition and pre-processing__\n","2. __Tokenization, tagging and stemming (dummy pipeline)__\n","3. __The NLTK NLP library, shallow syntactic analysis (smarter pipeline)__\n","4. __Sentiment analysis__"]},{"cell_type":"markdown","metadata":{"id":"MjLOiihWnmC1"},"source":["---\n","\n","## 1. Text acquisition and preprocessing\n","\n","__Basic facts__\n","- Usually a necessary step before applying the natural language processing methods themselves.\n","- It consists mainly of preparing texts for machine processing (e.g., removal of OCR noise, markup, splitting in segments, normalisation, etc.).\n","\n","__Examples of typical tasks__\n","- Conversion and cleaning of text - obtaining material (e.g., by downloading), encoding transformations, conversion from the original format to plain text, possible removal of noise in the form of OCR errors, formatting and other marks, annotation passages, etc.\n","- Removal of \"irrelevant\" words - filtering of [stop-list](https://en.wikipedia.org/wiki/Stop_word) expressions that are a valid part of the language, but introduce noise in the context of a given NLP task (e.g., articles, prepositions and even certain verbs and nouns in English are noise for most tasks based on the \"bag of words\" approach, where only statistical parameters of the text matter regardless of its explicit syntactic structure)."]},{"cell_type":"markdown","metadata":{"id":"WMpnWpX6SV5J"},"source":["### Downloading a sample text\n","\n","![orwell](https://www.fi.muni.cz/~novacek/courses/pb016/labs/img/1984first.jpg)"]},{"cell_type":"code","metadata":{"id":"xkjyNb3C0Xhr"},"source":["# import library for opening URLs, etc.\n","import urllib.request\n","\n","# open a link to sample text\n","\n","sample_text_link = \"http://gutenberg.net.au/ebooks01/0100021.txt\"\n","f = urllib.request.urlopen(sample_text_link)\n","\n","# decoding the contents of the link (just convert the binary string to text - \n","# it's already in a relatively clean plain text format)\n","\n","sample_text = f.read().decode(\"utf-8\")\n","\n","# print the beginning and ending of the text\n","\n","beginning = sample_text[:4207]\n","ending = sample_text[-6431:]\n","\n","print('***** The beginning of the \"raw version\" of the 1984 novel *****\\n')\n","print(beginning)\n","\n","print('\\n***** The end of the \"raw version\" of the 1984 novel *****\\n')\n","print(ending)"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"NbpYgDZsXTMX"},"source":["### Cleaning the sample text\n","- Removal of meta-data about the publication, appendices behind the story, other adjustments aimed at obtaining the text itself without structural annotations (i.e., annotation of parts, chapters, etc.).\n","- Notes on the solution:\n","  - The procedure is often very arbitrary, depending on the source text and what we want to do with it.\n","  - A good start is to look at parts of the text, e.g., with `print(sample_text[:K])` and `print(sample_text[-K:])`, where `K` is a reasonably small number of characters (see above).\n","  - From what we see, we decide what to delete, replace, etc.\n","  - Substitutions using [regular expressions](https://en.wikipedia.org/wiki/Regular_expression) are often useful for text cleanup (without the use of specialized NLP libraries, but also with them). For details, see [re](https://docs.python.org/3/library/re.html) module in the standard Python library, specifically the `re.sub()` function."]},{"cell_type":"code","metadata":{"id":"_JumLLSjX5U1"},"source":["# cutting the metadata in the beginning\n","\n","cleaner_text = sample_text.split('PART ONE')[1]\n","\n","# cutting the appendix after the main story\n","\n","cleaner_text = cleaner_text.split('APPENDIX')[0]\n","\n","# deleting the '\\r' characters\n","\n","cleaner_text = cleaner_text.replace('\\r','')\n","\n","# removing structural annotations using the RE module\n","\n","import re\n","\n","cleaner_text = re.sub('PART [A-Z]+','',cleaner_text)\n","cleaner_text = re.sub('Chapter [0-9]+','',cleaner_text)\n","cleaner_text = re.sub('THE END','',cleaner_text)\n","\n","# cutting the whitespace around the text\n","\n","cleaner_text = cleaner_text.strip()\n","\n","# printing the beginning and ending\n","\n","cleaner_beginning = cleaner_text[:3010]\n","cleaner_ending = cleaner_text[-412:]\n","\n","print('***** Cleaner beginning of the 1984 novel *****\\n')\n","print(cleaner_beginning)\n","\n","print('\\n***** Cleaner ending of the 1984 novel *****\\n')\n","print(cleaner_ending)"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"Wuyr49GV3nId"},"source":["---\n","## 2. Tokenization, tagging and stemming (dummy pipeline)\n","\n","__Basic facts - tokenization__\n","- Creating lists of individual paragraphs, sentences and words from the sample text.\n","\n","__Basic facts - tagging__\n","- Assignment of grammatical or other categories to individual parts of the text.\n","- One of the most common methods of tagging text is to assign corresponding part of speech tags to individual words (POS - part of speech tagging).\n","- However, more complex units can also be tagged, e.g., noun and verb phrases, modifier phrases, etc., other, more abstract parts of syntactic trees, semantic tags, etc.\n","\n","__Basic facts - stemming (and lemmatisation)__\n","- The process of reducing inflected (or sometimes derived) words to their word stem, base or root form, or some other sort of a canonical form.\n"," - Stemming is reduction of the word to a form that is most common among all its morphological variants.\n"," - Lemmatization is another (though often very related) form of normalisation by grouping together the inflected forms of a word so they can be analysed as a single canonical item that still bears the meaning of the original word - a lemma, or also a dictionary form.\n","- An example in English: `wait, waits, waiting, ...` $\\rightarrow$ `wait` (stem), `wait` (lemma).\n","- An example in Czech: `čekat, čeká, čekající, ...` $\\rightarrow$ `ček` (stem), `čekat` (lemma)."]},{"cell_type":"markdown","metadata":{"id":"efgjiY6Ls7tZ"},"source":["### Naive tokenization of the sample text\n","- Notes on the solution:\n","  - For paragraphs, one can take advantage of the fact that they are separated by double new lines. However, it might also be good to deal with the individual lines in the paragraphs themselves so that they become a uniform, unwrapped text.\n","  - For splitting the text into sentences and words it is possible to use the punctuation marks or spaces, respecively, and the `split()` function (either from the standard Python library or from the `re` module)."]},{"cell_type":"code","metadata":{"id":"Mj_jb6his7t6"},"source":["# function to create a paragraph list from the input text\n","def paragraph_tokenizer(text):\n","  return [x.replace('\\n', ' ') for x in text.split('\\n\\n')]\n","\n","# function to create a sentence list from the input text\n","def sentence_tokenizer(text):\n","  return re.split('[\\.?!]', text)\n","\n","# function to create a word list from the input text\n","def word_tokenizer(text):\n","  return [x.strip('\\.,;!') for x in text.split()]\n","\n","# creating a list of paragraphs of the novel in 1984 and listing the first 4\n","paragraphs = paragraph_tokenizer(cleaner_text)\n","print('The first 4 paragraphs of 1984:\\n', paragraphs[:4])\n","# creating a list of sentences of the novel in 1984 and listing the first 2\n","sentences = []\n","for paragraph in paragraphs:\n","  sentences += sentence_tokenizer(paragraph)\n","print('\\nThe first 2 sentences of 1984:\\n', sentences[:2])\n","# creating a list of words of the novel in 1984 and listing the first 8\n","words = []\n","for sentence in sentences:\n","  words += word_tokenizer(sentence)\n","print('\\nThe first 8 words of 1984:\\n', words[:8])"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"FPrDqOGTzPlP"},"source":["### Naive tagging and stemming of the sample text\n","- Tagging the first sentence of the 1984 novel with symbols of the relevant parts of speech (POS).\n","- Conseuqent stemming depending on the specific POS tags.\n","- Notes on the solution:\n","  - One can simply assign tags to individual words according to a hard-coded look-up dictionary (which is not very smart or scalable in practice, but it's OK for the purposes of this exercise).\n","  - One can assume that the tags for each relevant word type are as follows: `DT` for articles (\"a\", \"an\", \"the\"),` NN` for nouns or numerals, `VB` for verbs, `PR` for pronouns, `JJ` for adjectives,` IN` for prepositions, `CC` for conjunctions, and` ?? `for unknown words.\n","  - Stemming simply strips (some of) the word suffixes based on their POS tag then."]},{"cell_type":"code","metadata":{"id":"n7PcdjCw0frC"},"source":["# naive POS tagger\n","\n","def dummy_pos_tagger(words):\n","  # list of tagged words\n","  tagged_words = []\n","  # look-up dictionary with tags assigned to the words of interest\n","  tag_dict = {\n","    'it' : 'PR', \n","    'was' : 'VB', \n","    'a' : 'DT', \n","    'bright' : 'JJ', \n","    'cold' : 'JJ', \n","    'day' : 'NN', \n","    'in' : 'IN', \n","    'april' : 'NN', \n","    'and' : 'CC', \n","    'the' : 'DT', \n","    'clocks' : 'NN', \n","    'were' : 'VB',\n","    'striking' : 'VB', \n","    'thirteen' : 'NN'\n","  }\n","  # the tagging itself\n","  for word in words:\n","    tag = '??' # tag for unknown words\n","    if word.lower() in tag_dict:\n","      tag = tag_dict[word.lower()]\n","    tagged_words.append((word,tag))\n","  return tagged_words\n","\n","# naive stemmer\n","\n","def dummy_stemmer(tagged_words):\n","  stemmed_words = []\n","  for word, tag in tagged_words:\n","    stemmed = word\n","    if tag == 'NN':\n","      stemmed = word.rstrip('s')\n","    elif tag == 'VB':\n","      if word.endswith('ed'):\n","        stemmed = word.rstrip('ed')\n","      if word.endswith('ing'):\n","        stemmed = word.rstrip('ing')\n","    stemmed_words.append(stemmed)\n","  return stemmed_words\n","\n","# printing the words of the first sentence\n","words = word_tokenizer(sentences[0])\n","print('RAW WORDS   :\\n'+'\\n'.join(words))\n","tagged_words = dummy_pos_tagger(words)\n","# printing the tagged words of the first sentence\n","print('\\nTAGGED WORDS:\\n'+'\\n'.join([str(x) for x in tagged_words]))\n","stemmed_words = dummy_stemmer(tagged_words)\n","# printing the stemmed words of the first sentence\n","print('\\nSTEMMED WORDS:\\n'+'\\n'.join([str(x) for x in stemmed_words]))"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"K3nUx95eVjuD"},"source":["---\n","\n","## 3. The NLTK NLP library, shallow syntactic analysis (smarter pipeline)\n","\n","__Basic facts__\n","- There are a number of mature tools for doing NLP.\n","- One of the widely used libraries in Python is the Natural Language Toolkit - [NLTK](https://www.nltk.org/).\n","- NLTK contains easy-to-use implementations of a number of state of the art techniques, algorithms, corpora, etc.\n"," - It can thus solve all the above tasks (tokenization, tagging, stemming), among other things, which you will experiment with yourselves.\n","- Once these tasks are solved, one can focus on one of the \"holy grails\" of NLP - the syntactic analysis (also, parsing) of sentences in natural language.\n","  - Essential for determining grammatical correctness of sentences and also a prerequisite to various methods of determining meaning of sentences in as broad range of contexts as possible (semantic and discourse analysis - arguably the so far rather unattainable ultimate goals of NLP).\n","  - Parsing can be relatively [shallow](https://en.wikipedia.org/wiki/Shallow_parsing), or more complex, such as [dependency](https://en.wikipedia.org/wiki/Dependency_grammar) and/or [probabilistic](https://en.wikipedia.org/wiki/Statistical_parsing) (for more details, see the lecture and/or specialized NLP courses).\n","  - However, in these labs we will only deal with the shallow one."]},{"cell_type":"markdown","metadata":{"id":"8RL0A3bXy8Co"},"source":["### __Exercise 3.1: Tokenization, tagging and stemming using NLTK__\n","- Implement a simple NLP pipeline solving the tasks presented in the previous sections using the [NLTK](https://www.nltk.org/) library (which will lead to a much more systematic and robust solution).\n","- More specifically, check the NLTK docs and other relevant online materials to find out how to do the following:\n"," - Tokenize the cleaned text of the 1984 novel to sentences.\n"," - Tokenize the first sentence of the novel to words.\n"," - POS-tag the first sentence of the novel.\n"," - Stem the words in the first sentence of the novel."]},{"cell_type":"code","metadata":{"id":"XW6xI6to4_gI"},"source":["# importing NLTK\n","import nltk\n","\n","# TODO - YOUR OWN SOLUTION GOES HERE"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"LOlRu54okuMS"},"source":["### __Exercise 3.2: Shallow parsing using NLTK__\n","- Use the list of tagged words from the previous example to create a shallow syntactic tree for the first sentence of the 1984 novel.\n","- In this tree, deal primarily with simple noun and verb phrases, but you may also deal with other, more complex syntactic structures if you will.\n","- Notes on the solution:\n"," - NLTK can do most of the work for you (search for the corresponding docs or StackOverflow questions).\n"," - For example, the class `RegexpParser` can come handy (as included in the pre-filled code cell below). It only needs a grammar, which defines individual syntactic groups (noun phrases, verb phrases, etc.) based on the sequence of POS tags of the respective words contained in them.\n"," - A grammar is simply a string defining rules for regular expression analysis (see pre-defined code below).\n"," - E.g., a sequence of two or more adjectives would match the following rule: `JJ2: {<JJ><JJ>+}`.\n"," - You can assume that noun phrases typically consist of nouns (`<NN.*>`) and their modification by adjectives (`<JJ>`) with the possible use of articles (`<DT>`), while verb phrases consist of some unbroken sequence of verb forms (`<VB.*>`)."]},{"cell_type":"code","metadata":{"id":"k2LaEg3Co89p"},"source":["# TODO - complete the grammar for shallow parsing\n","chunk_grammar = \"\"\"\n","  NP: {TODO}\n","  VP: {TODO}\n","\"\"\"\n","\n","# the actual syntactic analysis\n","chunk_parser = nltk.RegexpParser(chunk_grammar)\n","chunked = chunk_parser.parse(tagged_words)\n","\n","# printing the syntactic tree\n","print('A shallow parse tree of the first sentence of 1984:')\n","print(chunked)"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"BGdY7oEfWQaf"},"source":["---\n","\n","## 4. Sentiment analysis\n","\n","__Basic facts__\n","- Analysis of emotional polarity of texts, or possibly multimodal data (reviews, news, political debates, social networks, etc.).\n","- A technique with a number of practical applications (product development, marketing, market analysis and forecasting, public opinion mining, disease outbreak detection and management, crime detection, etc.).\n","\n","__Your task__\n","- Split into groups (min 2, max 4 people).\n","- Check the NLTK documentation, StackOverflow, etc., to find out which NLTK module(s) can be used for sentiment analysis and how they work. Feel free to use other sentiment analysis tools, though, should you find anything else that feels more appropriate and/or convenient.\n","- Choose the approach that works best for you and try to answer the following questions:\n"," - What is the overall tone of the 1984 novel? Is it a glorious utopia, or rather a gloomy dystopia?\n"," - What is the most cheerful sentence (and optionally also a paragraph) of the novel?\n"," - What is the most desperate sentence (and optionally also a paragraph) of the novel?\n","- Take your time, play around (no need to have a complete solution - what's important is to search for possible approaches, play and learn). Feel free to ask for help anytime.\n","- Discuss your analysis of the 1984 novel, and your general observations with the lab tutor (and, optionally, the rest of the class)! The collaborating members of the group with the best relative results and/or an interesting/elegant/efficient/unusual solution can earn bonus points.\n","\n","__Optional task__\n","- If you solve the primary task quickly and get bored (which can happen if you come across an easy-to-use off-the-shelf tool), you can think about solutions to the following problems:\n"," - Many off-the-shelf techniques for sentiment analysis work quite well for individual words, but for more complex phrases, sentences, and larger text units in general, they may not be that great (unless you use, for instance, a truly sophisticated, fine-tuned model based on a state-of-the-art language model).\n"," - The main causes of these problems include, for example, non-trivially negated expressions reversing the sentiment of individual words, irony, metaphors or quotations, and in general the non-trivial nature of semantic and even discourse analysis, which are prerequisites for truly accurate sentiment analysis.\n"," - What clever tricks (or even standard NLP techniques) could contribute to at least a partial solution to these problems?"]},{"cell_type":"code","metadata":{"id":"zJCfQf-fp3dv"},"source":["# TODO - analysis of the sentiment of the novel 1984 with the help of NLTK\n","#        (totally up to you)"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["---\n","\n","#### _Final note_ - the materials used in this notebook are adapted from works licensed by the original authors as follows:\n","- Picture of the first edition of the novel 1984:\n","  - Retrieved from Wikipedia [here](https://en.wikipedia.org/wiki/File:1984first.jpg)\n","  - Author: [Brown University Library](http://library.brown.edu/search/c?SEARCH=PR6029.R8+N49+1949b)\n","  - License: none, or rather [Public Domain](https://en.wikipedia.org/wiki/public_domain) in the US, but probably still copyrighted in the country of origin (UK)\n","- The novel itself:\n"," - Retrieved from the Australian [Project Gutenberg](https://www.gutenberg.org/) site [here](http://gutenberg.net.au/ebooks01/0100021.txt)\n"," - Author: [George Orwell](https://en.wikipedia.org/wiki/George_Orwell)\n"," - License: [Public Domain](https://en.wikipedia.org/wiki/public_domain) in Australia and possibly in other jurisdictions, but in general, the copyright is held by the George Orwell estate and the text should be treated accordingly in terms of its public or any other use"],"metadata":{"id":"tOcmk4AaRaN8"}}]}