{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "LAB-03_document-clustering-classical.ipynb",
      "provenance": [],
      "collapsed_sections": []
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Z5anb3UHd6pv"
      },
      "source": [
        "# PA164 - Lab 3: Playing with document clustering (classical)\n",
        "\n",
        "__Outline:__\n",
        "1. New preprocessing pipeline - diggin into Shakespeare\n",
        "2. Creating vector representations of Shakespeare's works\n",
        "3. Clustering the works"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "MjLOiihWnmC1"
      },
      "source": [
        "---\n",
        "\n",
        "## 1. New preprocessing pipeline - digging into Shakespeare\n",
        "\n",
        "<img src=\"https://www.fi.muni.cz/~novacek/courses/pa164/labs/img/Shakespeare.jpg\" alt=\"will\" width=\"400px\" title=\"Shakespeare's portrait, retrieved from Wikipedia. Author: John Taylor. License: Public Domain\"/>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "WMpnWpX6SV5J"
      },
      "source": [
        "### Downloading and cleaning the Shakespeare's works"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "xkjyNb3C0Xhr"
      },
      "source": [
        "import urllib.request # import library for opening URLs, etc.\n",
        "\n",
        "# open a link to sample text\n",
        "\n",
        "sample_text_link = \"https://www.gutenberg.org/files/100/100-0.txt\"\n",
        "f = urllib.request.urlopen(sample_text_link)\n",
        "\n",
        "# decoding the content of the link (just convert the binary string to text - \n",
        "# it is already in a relatively clean plain text format)\n",
        "\n",
        "sample_text = f.read().decode(\"utf-8\")\n",
        "\n",
        "# cutting the metadata in the beginning\n",
        "\n",
        "cleaner_text = sample_text.split('      Contents')[1]\n",
        "\n",
        "# cutting the appendix after the main story\n",
        "\n",
        "cleaner_text = cleaner_text.split('*** END OF THE PROJECT GUTENBERG EBOOK THE COMPLETE WORKS OF WILLIAM SHAKESPEARE ***')[0]\n",
        "\n",
        "# deleting the '\\r' characters\n",
        "\n",
        "cleaner_text = cleaner_text.replace('\\r','')"
      ],
      "execution_count": 2,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "LFFrvDzP_2DF"
      },
      "source": [
        "### Getting the separate texts of Shakespeare's works"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "PpBW0Oo-iSxS"
      },
      "source": [
        "# getting the list of titles of Shakespeare's work from the table of contents\n",
        "\n",
        "# to split at the TOC from the bottom\n",
        "splitter_bot = \"\"\"THE SONNETS\n",
        "\n",
        "                    1\"\"\"\n",
        "\n",
        "# to split at the TOC from the top\n",
        "splitter_top = \"\"\"VENUS AND ADONIS\n",
        "\n",
        "\n",
        "\n",
        "\n",
        "\n",
        "\n",
        "\"\"\"\n",
        "\n",
        "# list of titles from the TOC\n",
        "titles = [x.strip() for x in cleaner_text.split(splitter_bot)[0].split('\\n\\n')\\\n",
        "          if len(x.strip())]\n",
        "\n",
        "# the rest of the text after TOC\n",
        "body = cleaner_text.split(splitter_top)[-1]\n",
        "\n",
        "# printing out the list of works\n",
        "\n",
        "print(len(titles), \"Shakespeare's works:\", titles)\n",
        "\n",
        "# populating a mapping from works' titles to their texts - the KEY VARIABLE!\n",
        "\n",
        "works = {}\n",
        "\n",
        "for i in range(len(titles)):\n",
        "  # base text - from the current title till the end of the all-in-one file\n",
        "  text_down = titles[i] + '\\n\\n' + body.split(titles[i])[-1].strip()\n",
        "  if i == len(titles) - 1: # the last text in the all-in-one file\n",
        "    works[titles[i]] = text_down\n",
        "  else:                    # other texts, enclosed between consecutive titles\n",
        "    works[titles[i]] = text_down.split(titles[i+1])[0]\n",
        "\n",
        "# printing out opening and ending samples of three selected works\n",
        "\n",
        "print('*********** SONNETS opening sample:')\n",
        "print(works['THE SONNETS'][:1000])\n",
        "print('\\n\\n*********** SONNETS ending sample:')\n",
        "print(works['THE SONNETS'][-1000:])\n",
        "print('\\n--------------------------------------------\\n')\n",
        "print('*********** AS YOU LIKE IT opening sample:')\n",
        "print(works['AS YOU LIKE IT'][:1000])\n",
        "print('\\n\\n*********** AS YOU LIKE IT ending sample:')\n",
        "print(works['AS YOU LIKE IT'][-1000:])\n",
        "print('\\n--------------------------------------------\\n')\n",
        "print('*********** VENUS AND ADONIS opening sample:')\n",
        "print(works['VENUS AND ADONIS'][:1000])\n",
        "print('\\n\\n*********** VENUS AND ADONIS ending sample:')\n",
        "print(works['VENUS AND ADONIS'][-1000:])\n",
        "print('\\n--------------------------------------------\\n')"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Af31cJWSNL2L"
      },
      "source": [
        "---\n",
        "\n",
        "## 2. Creating vector representations of Shakespeare's works\n",
        "- Create a vector space model of Shakespeare's works using [scikit-learn](https://scikit-learn.org/)\n",
        " - Apply a stop-list to filter too common and/or noisy words (you can use either a generic, for instance via [NLTK](https://pythonspot.com/nltk-stop-words/), or one that is specific to the stats of the corpus in question)\n",
        " - Use TF-IDF normalisation with uni- and bi-gram tokens, similarly to the way we did it with the 1984 paragraphs\n",
        "- Apply LSA to get a dense representation of the model (play with a few alternatives of top-k latent factors in the result)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "-hT4qLXu3CPj"
      },
      "source": [
        "---\n",
        "\n",
        "## 3. Clustering the works\n",
        "- Use the dense vector space representation of the Shakespeare's works to cluster them using [scikit-learn](https://scikit-learn.org/)\n",
        " - A useful guide on what method(s) might be good to apply is [here](https://scikit-learn.org/stable/modules/clustering.html)\n",
        " - [K-means](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans) is probably a good run-of-the-mill starting point\n",
        " - However, try to come up with another more appropriate and/or wilder method (think about what kind of algorithm selection criteria can be derived from the specifics of this \"Shakespear\" use case in terms of data points, expected numbers of clusters, their relative sizes, the geometry of the space, etc.)\n",
        "- Finally, have a look at the clusters you found with the different methods, pretend you are a literary scholar and see whether your discoveries are consistent with the standard lore on the master playwright\n"
      ]
    }
  ]
}