{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# TF-IDF pro výpočet klíčových slov\n", "Tento notebook ukazuje jak TF-IDF (https://cs.wikipedia.org/wiki/Tf-idf) počítá klíčová slova. Je to technika vhodná pro všechny jazyky, které umíme tokenizovat (rozdělit na slova), je o něco vhodnější pro jazyky s méně bohatou flexí. Pro jazyky s bohatou flexí se dá počítat TF-IDF na lemmatech, samotná lemmatizace ale může být problém a může vnášet do výpočtu chyby.\n", "\n", "Využijeme balíčky Scikit Learn (https://scikit-learn.org/stable/) pro strojové učení a pandas (https://pandas.pydata.org/) pro datovou analytiku." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: sklearn in /home/popelucha/.local/lib/python3.5/site-packages (0.0)\n", "Requirement already satisfied: scikit-learn in /home/popelucha/.local/lib/python3.5/site-packages (from sklearn) (0.20.3)\n", "Requirement already satisfied: numpy>=1.8.2 in /usr/local/lib/python3.5/dist-packages (from scikit-learn->sklearn) (1.16.2)\n", "Requirement already satisfied: scipy>=0.13.3 in /home/popelucha/.local/lib/python3.5/site-packages (from scikit-learn->sklearn) (1.2.1)\n", "\u001b[33mYou are using pip version 19.0.3, however version 19.1 is available.\n", "You should consider upgrading via the 'pip install --upgrade pip' command.\u001b[0m\n", "Requirement already satisfied: pandas in /home/popelucha/.local/lib/python3.5/site-packages (0.24.2)\n", "Requirement already satisfied: python-dateutil>=2.5.0 in /usr/local/lib/python3.5/dist-packages (from pandas) (2.8.0)\n", "Requirement already satisfied: numpy>=1.12.0 in /usr/local/lib/python3.5/dist-packages (from pandas) (1.16.2)\n", "Requirement already satisfied: pytz>=2011k in /home/popelucha/.local/lib/python3.5/site-packages (from pandas) (2019.1)\n", "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.5/dist-packages (from python-dateutil>=2.5.0->pandas) (1.12.0)\n", "\u001b[33mYou are using pip version 19.0.3, however version 19.1 is available.\n", "You should consider upgrading via the 'pip install --upgrade pip' command.\u001b[0m\n" ] } ], "source": [ "!pip install sklearn --user\n", "!pip install pandas --user" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Jako datovou sadu jsme zvolili Ztracený ráj od Johna Miltona (https://cs.wikipedia.org/wiki/Ztracen%C3%BD_r%C3%A1j ). \n", "\n", "V první ukázce porovnáme čtyři knihy z hlediska klíčových slov.\n", "\n", "## TF-IDF na málo datech" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from sklearn.feature_extraction.text import TfidfVectorizer\n", "import pandas as pd\n", "import os\n", "\n", "characters = ['Adam', 'Eve', 'God', 'Satan']\n", "#for root, subdirectories, characters in os.walk('sp/'):\n", "# print(list(characters))\n", "files = ['sp/' + character for character in characters]\n", "contents = [open(file, encoding='utf-8', errors='ignore').read() \n", " for file in files]\n", "\n", "vectorizer = TfidfVectorizer(sublinear_tf=True)\n", "tfidf_matrix = vectorizer.fit_transform(contents)\n", "feature_names = vectorizer.get_feature_names()\n", "dense = tfidf_matrix.todense()\n", "denselist = dense.tolist()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "TfidfVectorizer vytvořil vektory pro každou knihu, je vidět, že pokud slovo v knize není, má hodnotu 0, jinak má hodnotu vypočítaného TF-IDF." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | abandon | \n", "abhor | \n", "abide | \n", "abject | \n", "abjure | \n", "able | \n", "abode | \n", "abolish | \n", "abominable | \n", "abortive | \n", "... | \n", "yoke | \n", "yon | \n", "yonder | \n", "you | \n", "younger | \n", "your | \n", "yours | \n", "youth | \n", "zodiac | \n", "zone | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Adam | \n", "0.000000 | \n", "0.012393 | \n", "0.000000 | \n", "0.000000 | \n", "0.015719 | \n", "0.016988 | \n", "0.000000 | \n", "0.015719 | \n", "0.000000 | \n", "0.012393 | \n", "... | \n", "0.012393 | \n", "0.020983 | \n", "0.019575 | \n", "0.00000 | \n", "0.015719 | \n", "0.008203 | \n", "0.000000 | \n", "0.015719 | \n", "0.015719 | \n", "0.015719 | \n", "
Eve | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "... | \n", "0.000000 | \n", "0.000000 | \n", "0.014044 | \n", "0.00000 | \n", "0.000000 | \n", "0.014044 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "
God | \n", "0.000000 | \n", "0.000000 | \n", "0.021124 | \n", "0.000000 | \n", "0.000000 | \n", "0.017101 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "... | \n", "0.000000 | \n", "0.000000 | \n", "0.013982 | \n", "0.00000 | \n", "0.000000 | \n", "0.023673 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "
Satan | \n", "0.018467 | \n", "0.014560 | \n", "0.014560 | \n", "0.031268 | \n", "0.000000 | \n", "0.011787 | \n", "0.018467 | \n", "0.000000 | \n", "0.018467 | \n", "0.014560 | \n", "... | \n", "0.024652 | \n", "0.014560 | \n", "0.009637 | \n", "0.06099 | \n", "0.000000 | \n", "0.042088 | \n", "0.038756 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "
4 rows × 4091 columns
\n", "