# PA164 - Lab 3: Playing with document clustering (classical)

__Outline:__
1. New preprocessing pipeline - diggin into Shakespeare
2. Creating vector representations of Shakespeare's works
3. Clustering the works

---

## 1. New preprocessing pipeline - digging into Shakespeare

<img src="https://www.fi.muni.cz/~novacek/courses/pa164/labs/img/Shakespeare.jpg" alt="will" width="400px" title="Shakespeare's portrait, retrieved from Wikipedia. Author: John Taylor. License: Public Domain"/>

### Downloading and cleaning the Shakespeare's works

In [2]:
import urllib.request # import library for opening URLs, etc.

# open a link to sample text

sample_text_link = "https://www.gutenberg.org/files/100/100-0.txt"
f = urllib.request.urlopen(sample_text_link)

# decoding the content of the link (just convert the binary string to text - 
# it is already in a relatively clean plain text format)

sample_text = f.read().decode("utf-8")

# cutting the metadata in the beginning

cleaner_text = sample_text.split('      Contents')[1]

# cutting the appendix after the main story

cleaner_text = cleaner_text.split('*** END OF THE PROJECT GUTENBERG EBOOK THE COMPLETE WORKS OF WILLIAM SHAKESPEARE ***')[0]

# deleting the '\r' characters

cleaner_text = cleaner_text.replace('\r','')

### Getting the separate texts of Shakespeare's works

In [None]:
# getting the list of titles of Shakespeare's work from the table of contents

# to split at the TOC from the bottom
splitter_bot = """THE SONNETS

                    1"""

# to split at the TOC from the top
splitter_top = """VENUS AND ADONIS






"""

# list of titles from the TOC
titles = [x.strip() for x in cleaner_text.split(splitter_bot)[0].split('\n\n')\
          if len(x.strip())]

# the rest of the text after TOC
body = cleaner_text.split(splitter_top)[-1]

# printing out the list of works

print(len(titles), "Shakespeare's works:", titles)

# populating a mapping from works' titles to their texts - the KEY VARIABLE!

works = {}

for i in range(len(titles)):
  # base text - from the current title till the end of the all-in-one file
  text_down = titles[i] + '\n\n' + body.split(titles[i])[-1].strip()
  if i == len(titles) - 1: # the last text in the all-in-one file
    works[titles[i]] = text_down
  else:                    # other texts, enclosed between consecutive titles
    works[titles[i]] = text_down.split(titles[i+1])[0]

# printing out opening and ending samples of three selected works

print('*********** SONNETS opening sample:')
print(works['THE SONNETS'][:1000])
print('\n\n*********** SONNETS ending sample:')
print(works['THE SONNETS'][-1000:])
print('\n--------------------------------------------\n')
print('*********** AS YOU LIKE IT opening sample:')
print(works['AS YOU LIKE IT'][:1000])
print('\n\n*********** AS YOU LIKE IT ending sample:')
print(works['AS YOU LIKE IT'][-1000:])
print('\n--------------------------------------------\n')
print('*********** VENUS AND ADONIS opening sample:')
print(works['VENUS AND ADONIS'][:1000])
print('\n\n*********** VENUS AND ADONIS ending sample:')
print(works['VENUS AND ADONIS'][-1000:])
print('\n--------------------------------------------\n')

---

## 2. Creating vector representations of Shakespeare's works
- Create a vector space model of Shakespeare's works using [scikit-learn](https://scikit-learn.org/)
 - Apply a stop-list to filter too common and/or noisy words (you can use either a generic, for instance via [NLTK](https://pythonspot.com/nltk-stop-words/), or one that is specific to the stats of the corpus in question)
 - Use TF-IDF normalisation with uni- and bi-gram tokens, similarly to the way we did it with the 1984 paragraphs
- Apply LSA to get a dense representation of the model (play with a few alternatives of top-k latent factors in the result)

---

## 3. Clustering the works
- Use the dense vector space representation of the Shakespeare's works to cluster them using [scikit-learn](https://scikit-learn.org/)
 - A useful guide on what method(s) might be good to apply is [here](https://scikit-learn.org/stable/modules/clustering.html)
 - [K-means](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans) is probably a good run-of-the-mill starting point
 - However, try to come up with another more appropriate and/or wilder method (think about what kind of algorithm selection criteria can be derived from the specifics of this "Shakespear" use case in terms of data points, expected numbers of clusters, their relative sizes, the geometry of the space, etc.)
- Finally, have a look at the clusters you found with the different methods, pretend you are a literary scholar and see whether your discoveries are consistent with the standard lore on the master playwright
