Similarity Search with Gensim

Gensim – Topic modelling for humans

A free open-source natural language processing (NLP) Python library.
Implements a wide range of NLP algorithms and models.
Its use is shown by example in a series of Web tutorials.
If you use the library in your research, academic tradition requires you to cite the following article (the reference is encoded in BibTeX):

@inproceedings{rehurek_lrec,
      title = {{Software Framework for Topic Modelling with Large Corpora}},
      author = {Radim {\v R}eh{\r u}{\v r}ek and Petr Sojka},
      booktitle = {{Proceedings of the LREC 2010 Workshop on New
           Challenges for NLP Frameworks}},
      pages = {45--50},
      year = 2010,
      month = May,
      day = 22,
      publisher = {ELRA},
      address = {Valletta, Malta},
      note={\url{http://is.muni.cz/publication/884893/en}},
      language={English}
}

Prerequisites

In an IS folder named similarity-search-with-gensim, you will find the following files. Please download all these files before you continue:

The file named gensim.ipynb contains a Jupyter Notebook project file, which we are going to use to interface with the Gensim library.
The file named wiki-tabbed.tsv contains a uniform random sample of 250 English Wikipedia articles in a tab-separated-value (TSV) format. You could produce such a sample yourself by downloading the latest English Wikipedia dump and using the gensim.corpora.wikicorpus module.
The file named dwarf-rabbit.html contains the article on the Dwarf rabbit from the English Wikipedia.

We will also require a working installation of Python. The nymfe workstations in the room A219 come with Python and the Pip package manager pre-installed, so we will only need to install Gensim and Jupyter Notebook as follows:

Shot the applications by pressing the button in the lower left corner of the desktop, search for the “Terminal”, and execute it. A window with a command line should open.

Create a new virtual environment named pv211 and install the packages gensim, jupyter in this environment by executing the following commands:

module add python3
mkdir pv211/
virtualenv --python="$(which python3)" pv211/
source pv211/bin/activate
pip install --upgrade pip
pip install gensim jupyter

Open the Jupyter Notebook project file named gensim.ipynb in a Web browser by executing the following command. A Jupyter Notebook allows you to create documents combining Markdown markup with code in Python and other programming languages. Output produced by executing the Python code is recorded in the document. The final document can be exported to a variety of formats including HTML, and PDF.
```
jupyter notebook gensim.ipynb
```
After you have finished the assignments, shut down the Jupyter Notebook by pressing Ctrl+C in the Anaconda Prompt window, and remove the pv211 virtual environment by executing the following commands:
```
deactivate
rm -r pv211
```

Assignments

Read the Jupyter Notebook project and follow the instructions. Try to understand the code.
Make a copy of the Jupyter Notebook project file by selection “File”, and “Make a Copy” from the Jupyter Notebook horizontal menu. In this new Jupyter Notebook project, use the Latent Dirichlet Allocation (LDA) instead of the Latent Semantic Analysis (LSA) to compute a low-rank approximation of the term-document matrix.

Add code that will use the Web page in the file named dwarf-rabbit.html as a query to find similar articles. Extract the article text using XML parsers such as xml.etree.ElementTree, or LXML; for simplicity, consider extracting only text content of <p> elements that represent paragraphs. To tokenize the extracted text, draw inspiration from the code in the Jupyter notebook project file.