{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Text Analysis\n", "The aim of the notebook is to begin with quantitative analysis of text data. We select a Czech text, split it into tokens, perform frequency analysis, and observe the nature of the data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Install necessary packages\n", "In this notebook, we use NLTK (Natural Language ToolKit) for tokenization of input text, and Pandas, a package for easy handling of tabular data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# do not run in G13, all packages are already installed\n", "!pip3 install --user nltk\n", "!pip3 install --user pandas\n", "!pip3 install --user matplotlib\n", "!pip3 install --user numpy" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import nltk\n", "nltk.download('punkt')\n", "from nltk.tokenize import word_tokenize\n", "from collections import Counter\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Get the data\n", "Here, you have to probably change the filename." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "text = None\n", "with open('../01-DH/maj.txt') as f: # modify the path if needed\n", " text = f.read()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tokens = Counter()\n", "for token in word_tokenize(text):\n", " if token:\n", " tokens[token] += 1\n", "tokens" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create DataFrame\n", "Pandas DataFrame is a data object, easy to handle. Let's experiment with it." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = pd.DataFrame.from_dict({\"token\": [k for k,v in dict(tokens).items()], \"freq\": [v for k,v in dict(tokens).items()]})\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### DataFrame Info\n", "**TASK 1**: How many different tokens are in the text? This number is the *vocabulary size*." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.info()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.sort_values(by='token', ascending=True).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**TASK 2**: How many *hapax legomena* do we have in the data?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Pandas Series\n", "Pandas Series is a slice of DataFrame. Usually, a Series is a result of slicing a DataFrame using a condition.\n", "Let's see a singe row, a single column, and a single cell.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.loc[0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df['freq']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df['token'][0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Tokens with a certain frequency" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.loc[df.freq==10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Processing of the Text\n", "So far, we only performed **tokenization** in order to observe single words. Tokenization is quite simple for languages that use spaces (all except CJK=Chinese, Japanese, Korean). However, there are decisions to be made and some of them are language dependent:\n", " - \"can't\" -> \"can\", \"not\" or \"can\", \"'\", \"t\"\n", " - \"won't\" -> \"will\", \"not\" or \"won\", \"'\", \"t\"\n", " - \"cannot\" -> \"can\", \"not\" or \"cannot\"\n", " - \"přišels\" -> \"přišel\", \"jsi\" or \"přišels\"\n", " - \"P. D. Jamesová\" -> \"P.\", \"D.\", \"Jamesová\" or \"P\", \".\", \"D\", \".\", \"Jamesová\"\n", " - \"16/10/2019\" -> \"16\", \"/\", \"10\", \"/\", \"2019\" or \"16/10/2019\" or \"16/\", \"10/\", \"2019\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tagging\n", "Apparently, we could make further analysis if we have more information, for example about particular part-of-speech (POS) there are in the text. Note that the tagging task (assigning one POS for each word) is language dependent and sometimes very difficult, e.g.:\n", "- \"hope\" - verb or noun\n", "- \"loving\" - noun, adjective, verb\n", "- \"stát\" - verb or noun\n", "- \"svíčková\" - noun or adjective" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Use of remote services\n", "POS-tagging is a common NLP task provided by many services. To annotate your own text, either you have to upload it somewhere and download the result, or you can let computer programs to do the stuff via Application Programming Interfaces (APIs). The task of an API is similar to that of a waiter.\n", "\n", "\n", "\n", "Analogically, we let out computer program to send a request \"I need this tokenized text to be POS-tagged\" and let it to present the result.\n", "\n", "As an example API, we will use the Language Services at NLPC FI MUNI: https://nlp.fi.muni.cz/languageservices/. We will use the python library `requests`. For notation of the requests and responses between computer programs we use `JSON`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip3 install --user requests" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import requests\n", "import json" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = {\"call\": \"tagger\", \n", " \"lang\": \"cs\",\n", " \"output\": \"json\",\n", " \"text\": text.replace(';', ',')\n", " }\n", "uri = \"https://nlp.fi.muni.cz/languageservices/service.py\"\n", "r = requests.post(uri, params=data)\n", "r" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = r.json()\n", "data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tokens = [token for token in data['vertical'] if len(token)==3]\n", "df2 = pd.DataFrame.from_dict({\"word\": [word for word, lemma, tag in tokens], \n", " \"lemma\": [lemma for word, lemma, tag in tokens], \n", " \"tag\": [tag for word, lemma, tag in tokens]\n", " })\n", "df2" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pos = [tag[0:2] for tag in df2[\"tag\"]]\n", "df2[\"pos\"] = pos\n", "df2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### List numerals appearing in text" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df2[df2[\"pos\"]==\"k4\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**TASK3**: List prepositions and store it in the variable `prep`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Count prepositions frequencies\n", "If you stored prepositions in the `prep`, you can see the frequencies of prepositions in the text." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x = prep.groupby(by=\"lemma\").count()['word']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Count on POS frequencies" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df2.groupby(by=\"pos\").count()['word']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Data Visualization\n", "Play with data visualization. Display frequencies of some aspects of the data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ax = df.sort_values(by='freq', ascending=False).plot(kind='bar')\n", "ax.get_xaxis().set_visible(False)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.8" } }, "nbformat": 4, "nbformat_minor": 2 }