@inproceedings{rehurek_lrec,
title = {{Software Framework for Topic Modelling with Large Corpora}},
author = {Radim {\v R}eh{\r u}{\v r}ek and Petr Sojka},
booktitle = {{Proceedings of the LREC 2010 Workshop on New
Challenges for NLP Frameworks}},
pages = {45--50},
year = 2010,
month = May,
day = 22,
publisher = {ELRA},
address = {Valletta, Malta},
note={\url{http://is.muni.cz/publication/884893/en}},
language={English}
}
In an IS folder named similarity-search-with-gensim, you will find the following files. Please download all these files before you continue:
We will also require a working installation of Python. The nymfe workstations in the room A219 come with Python and the Pip package manager pre-installed, so we will only need to install Gensim and Jupyter Notebook as follows:
Create a new virtual environment named pv211
and install the packages gensim
, jupyter
in this environment by executing the following commands:
module add python3
mkdir pv211/
virtualenv --python="$(which python3)" pv211/
source pv211/bin/activate
pip install --upgrade pip
pip install gensim jupyter
Open the Jupyter Notebook project file named gensim.ipynb in a Web browser by executing the following command. A Jupyter Notebook allows you to create documents combining Markdown markup with code in Python and other programming languages. Output produced by executing the Python code is recorded in the document. The final document can be exported to a variety of formats including HTML, and PDF.
jupyter notebook gensim.ipynb
After you have finished the assignments, shut down the Jupyter Notebook by pressing Ctrl+C in the Anaconda Prompt window, and remove the pv211
virtual environment by executing the following commands:
deactivate
rm -r pv211
Read the Jupyter Notebook project and follow the instructions. Try to understand the code.
Make a copy of the Jupyter Notebook project file by selection “File”, and “Make a Copy” from the Jupyter Notebook horizontal menu. In this new Jupyter Notebook project, use the Latent Dirichlet Allocation (LDA) instead of the Latent Semantic Analysis (LSA) to compute a low-rank approximation of the term-document matrix.
<p>
elements that represent paragraphs. To tokenize the extracted text, draw inspiration from the code in the Jupyter notebook project file.