# Text Analysis
The aim of the notebook is to begin with quantitative analysis of text data. We select a Czech text, split it into tokens, perform frequency analysis, and observe the nature of the data.

## Install necessary packages
In this notebook, we use NLTK (Natural Language ToolKit) for tokenization of input text, and Pandas, a package for easy handling of tabular data.

In [None]:
# do not run in G13, all packages are already installed
!pip3 install --user nltk
!pip3 install --user pandas
!pip3 install --user matplotlib
!pip3 install --user numpy

In [None]:
import pandas as pd
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from collections import Counter
import numpy as np

## Get the data
Here, you have to probably change the filename.

In [None]:
text = None
with open('../01-DH/maj.txt') as f: # modify the path if needed
 text = f.read()

In [None]:
tokens = Counter()
for token in word_tokenize(text):
 if token:
 tokens[token] += 1
tokens

## Create DataFrame
Pandas DataFrame is a data object, easy to handle. Let's experiment with it.

In [None]:
df = pd.DataFrame.from_dict({"token": [k for k,v in dict(tokens).items()], "freq": [v for k,v in dict(tokens).items()]})
df.head()

### DataFrame Info
**TASK 1**: How many different tokens are in the text? This number is the *vocabulary size*.

In [None]:
df.info()

In [None]:
df.sort_values(by='token', ascending=True).head()

**TASK 2**: How many *hapax legomena* do we have in the data?

### Pandas Series
Pandas Series is a slice of DataFrame. Usually, a Series is a result of slicing a DataFrame using a condition.
Let's see a singe row, a single column, and a single cell.



In [None]:
df.loc[0]

In [None]:
df['freq']

In [None]:
df['token'][0]

#### Tokens with a certain frequency

In [None]:
df.loc[df.freq==10]

### Processing of the Text
So far, we only performed **tokenization** in order to observe single words. Tokenization is quite simple for languages that use spaces (all except CJK=Chinese, Japanese, Korean). However, there are decisions to be made and some of them are language dependent:
 - "can't" -> "can", "not" or "can", "'", "t"
 - "won't" -> "will", "not" or "won", "'", "t"
 - "cannot" -> "can", "not" or "cannot"
 - "přišels" -> "přišel", "jsi" or "přišels"
 - "P. D. Jamesová" -> "P.", "D.", "Jamesová" or "P", ".", "D", ".", "Jamesová"
 - "16/10/2019" -> "16", "/", "10", "/", "2019" or "16/10/2019" or "16/", "10/", "2019"

### Tagging
Apparently, we could make further analysis if we have more information, for example about particular part-of-speech (POS) there are in the text. Note that the tagging task (assigning one POS for each word) is language dependent and sometimes very difficult, e.g.:
- "hope" - verb or noun
- "loving" - noun, adjective, verb
- "stát" - verb or noun
- "svíčková" - noun or adjective

### Use of remote services
POS-tagging is a common NLP task provided by many services. To annotate your own text, either you have to upload it somewhere and download the result, or you can let computer programs to do the stuff via Application Programming Interfaces (APIs). The task of an API is similar to that of a waiter.



Analogically, we let out computer program to send a request "I need this tokenized text to be POS-tagged" and let it to present the result.

As an example API, we will use the Language Services at NLPC FI MUNI: https://nlp.fi.muni.cz/languageservices/. We will use the python library `requests`. For notation of the requests and responses between computer programs we use `JSON`.

In [None]:
!pip3 install --user requests

In [None]:
import requests
import json

In [None]:
data = {"call": "tagger", 
 "lang": "cs",
 "output": "json",
 "text": text.replace(';', ',')
 }
uri = "https://nlp.fi.muni.cz/languageservices/service.py"
r = requests.post(uri, params=data)
r

In [None]:
data = r.json()
data

In [None]:
tokens = [token for token in data['vertical'] if len(token)==3]
df2 = pd.DataFrame.from_dict({"word": [word for word, lemma, tag in tokens], 
 "lemma": [lemma for word, lemma, tag in tokens], 
 "tag": [tag for word, lemma, tag in tokens]
 })
df2

In [None]:
pos = [tag[0:2] for tag in df2["tag"]]
df2["pos"] = pos
df2

### List numerals appearing in text

In [None]:
df2[df2["pos"]=="k4"]

**TASK3**: List prepositions and store it in the variable `prep`.

### Count prepositions frequencies
If you stored prepositions in the `prep`, you can see the frequencies of prepositions in the text.

In [None]:
x = prep.groupby(by="lemma").count()['word']

### Count on POS frequencies

In [None]:
df2.groupby(by="pos").count()['word']

### Data Visualization
Play with data visualization. Display frequencies of some aspects of the data.

In [None]:
ax = df.sort_values(by='freq', ascending=False).plot(kind='bar')
ax.get_xaxis().set_visible(False)