@inproceedings{rehurek_lrec,
title = {{Software Framework for Topic Modelling with Large Corpora}},
author = {Radim {\v R}eh{\r u}{\v r}ek and Petr Sojka},
booktitle = {{Proceedings of the LREC 2010 Workshop on New
Challenges for NLP Frameworks}},
pages = {45--50},
year = 2010,
month = May,
day = 22,
publisher = {ELRA},
address = {Valletta, Malta},
note={\url{http://is.muni.cz/publication/884893/en}},
language={English}
}
In an IS folder named similarity-search-with-gensim, you will find the following files. Please download all these files before you continue:
We will also require a working installation of Python. The sirene workstations in the room B311 come with Python and the Anaconda package manager pre-installed, so we will only need to install Gensim and Jupyter Notebook as follows:
Create a new Anaconda environment named pv211
and install the packages gensim
, jupyter
in this environment by executing the following command:
conda create --name pv211 python=2 gensim jupyter
To use lemmatization, we will also require the pattern
package. This package is only available outside the main repositories of Anaconda and it is only available for Python 2, which is the main reason we use Python 2 instead of Python 3 in this tutorial. Install the pattern
package by executing the following command:
conda install --name pv211 --channel asmeurer pattern
Activate the pv211
Anaconda environment and change the working directory to the directory where you have downloaded the above files by executing the following commands:
activate pv211
pushd \\ad.fi.muni.cz\DFS\home\%USERNAME%\_profile\Downloads
Open the Jupyter Notebook project file named gensim.ipynb in a Web browser by executing the following command. A Jupyter Notebook allows you to create documents combining Markdown markup with code in Python and other programming languages. Output produced by executing the Python code is recorded in the document. The final document can be exported to a variety of formats including HTML, and PDF.
jupyter notebook gensim.ipynb
After you have finished the assignments, shut down the Jupyter Notebook by pressing Ctrl+C in the Anaconda Prompt window, and remove the pv211
Anaconda environment by executing the following commands. Skip all administrator prompts by pressing “No”.
deactivate
conda env remove --name pv211
To perform the above procedure on a Linux computer with the pip package manager instead of Anaconda, you would use the following commands:
$ mkdir /tmp/pv211
$ virtualenv -p `which python2` /tmp/pv211
$ source /tmp/pv211/bin/activate
$ pip install --upgrade pip
$ pip install gensim pattern jupyter
$ jupyter notebook /path/to/gensim.ipynb
$ # ... work on assignments
$ deactivate
$ rm -r /tmp/pv211
Read the Jupyter Notebook project and follow the instructions. Try to understand the code.
Make a copy of the Jupyter Notebook project file by selection “File”, and “Make a Copy” from the Jupyter Notebook horizontal menu. In this new Jupyter Notebook project, use the Latent Dirichlet Allocation (LDA) instead of the Latent Semantic Analysis (LSA) to compute a low-rank approximation of the term-document matrix.
<p>
elements that represent paragraphs. To tokenize the extracted text, draw inspiration from the code in the Jupyter notebook project file.