Similarity Search with Gensim

Gensim – Topic modelling for humans

@inproceedings{rehurek_lrec,
      title = {{Software Framework for Topic Modelling with Large Corpora}},
      author = {Radim {\v R}eh{\r u}{\v r}ek and Petr Sojka},
      booktitle = {{Proceedings of the LREC 2010 Workshop on New
           Challenges for NLP Frameworks}},
      pages = {45--50},
      year = 2010,
      month = May,
      day = 22,
      publisher = {ELRA},
      address = {Valletta, Malta},
      note={\url{http://is.muni.cz/publication/884893/en}},
      language={English}
}

Prerequisites

In an IS folder named similarity-search-with-gensim, you will find the following files. Please download all these files before you continue:

  1. The file named gensim.ipynb contains a Jupyter Notebook project file, which we are going to use to interface with the Gensim library.
  2. The file named wiki-tabbed.tsv contains a uniform random sample of 250 English Wikipedia articles in a tab-separated-value (TSV) format. You could produce such a sample yourself by downloading the latest English Wikipedia dump and using the gensim.corpora.wikicorpus module.
  3. The file named dwarf-rabbit.html contains the article on the Dwarf rabbit from the English Wikipedia.

We will also require a working installation of Python. The sirene workstations in the room B311 come with Python and the Anaconda package manager pre-installed, so we will only need to install Gensim and Jupyter Notebook as follows:

  1. Press the start button, search for the “Anaconda Prompt”, and execute it. A window with a command line should open.
  2. Create a new Anaconda environment named pv211 and install the packages gensim, jupyter in this environment by executing the following command:

    conda create --name pv211 python=2 gensim jupyter
    • To use lemmatization, we will also require the pattern package. This package is only available outside the main repositories of Anaconda and it is only available for Python 2, which is the main reason we use Python 2 instead of Python 3 in this tutorial. Install the pattern package by executing the following command:

      conda install --name pv211 --channel asmeurer pattern
  3. Activate the pv211 Anaconda environment and change the working directory to the directory where you have downloaded the above files by executing the following commands:

    activate pv211
    pushd \\ad.fi.muni.cz\DFS\home\%USERNAME%\_profile\Downloads
  4. Open the Jupyter Notebook project file named gensim.ipynb in a Web browser by executing the following command. A Jupyter Notebook allows you to create documents combining Markdown markup with code in Python and other programming languages. Output produced by executing the Python code is recorded in the document. The final document can be exported to a variety of formats including HTML, and PDF.

    jupyter notebook gensim.ipynb
  5. After you have finished the assignments, shut down the Jupyter Notebook by pressing Ctrl+C in the Anaconda Prompt window, and remove the pv211 Anaconda environment by executing the following commands. Skip all administrator prompts by pressing “No”.

    deactivate
    conda env remove --name pv211

To perform the above procedure on a Linux computer with the pip package manager instead of Anaconda, you would use the following commands:

$ mkdir /tmp/pv211
$ virtualenv -p `which python2` /tmp/pv211
$ source /tmp/pv211/bin/activate
$ pip install --upgrade pip
$ pip install gensim pattern jupyter
$ jupyter notebook /path/to/gensim.ipynb
$ # ... work on assignments
$ deactivate
$ rm -r /tmp/pv211

Assignments

  1. Read the Jupyter Notebook project and follow the instructions. Try to understand the code.

  2. Make a copy of the Jupyter Notebook project file by selection “File”, and “Make a Copy” from the Jupyter Notebook horizontal menu. In this new Jupyter Notebook project, use the Latent Dirichlet Allocation (LDA) instead of the Latent Semantic Analysis (LSA) to compute a low-rank approximation of the term-document matrix.

  1. Add code that will use the Web page in the file named dwarf-rabbit.html as a query to find similar articles. Extract the article text using XML parsers such as xml.etree.ElementTree, or LXML; for simplicity, consider extracting only text content of <p> elements that represent paragraphs. To tokenize the extracted text, draw inspiration from the code in the Jupyter notebook project file.