SUMMARY OF PRODUCT REVIEWS

Are you annoyed by having to read through hundreds of reviews when you want to buy something
specific? Don’t worry, a solution will be available soon. Currently, I’m working on this in my
bachelor thesis. I’m a perfectionist, and when something isn’t perfect, I’m extremely annoyed. The
same thing arises when purchasing a product that should meet my requirements. Sure, you can just
read the product specifications, but the cons are always missing – that's why I often rely on
reviews, where you can find everything – pros and cons.

WEB SCRAPING

If the e-shop doesn’t provide you with a dataset, you need to scrape the web. What does it mean?
The definition of web scraping is collecting data from a website using a program. It’s important to
be familiar with HTML programming language because you could get lost while browsing the code.

Another thing to consider is creating a program to accomplish this task. I used a Python package
Beautiful Soup, which is also the most common library for scraping information from web pages. This
library is an HTML or XML parser. It breaks down text into recognized strings of characters for
further analysis, allowing you to get rid of HTML tags and other non-relevant information. Then you
can save the necessary data into a CSV file as a table.

PREPROCESSING

The structure of data may become non-systematic after opening the CSV file, but another library
called Pandas can solve this issue. It provides functions for analysing, cleaning, exploring, and
manipulating data.

🐼 Fun fact: The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis".
However, I always imagine the panda bear from China.

Once you reached the good structure of the text, you can proceed with further data preprocessing.
It often happens that some data contain errors or are even missing, and you must remove them, as
well as punctuation in the text. Computers can then easily identify patterns and relationships
between words.

The next step involves converting the text to lowercase and choosing stop words to eliminate –
words that don’t have any semantic meaning, such as “was”, “she”, “for” … These are commonly used
words in a language that don’t carry useful information. There are many lists available online,
especially for the English language; however, for Czech it’s more challenging.

LANGUAGE MODELLING

When the dataset is cleaned, we use a language model, which is a type of machine learning model
trained to predict a probability distribution over words. To achieve satisfactory results, the
dataset should be extensive. I chose the fastText library created by Facebook's AI Research (FAIR)
lab. It’s also considered a language model for learning word embeddings and text classification.

FastText is efficient for morphologically complex languages. It learns to understand words by
breaking them down into smaller parts called n-grams. From this, it creates word embeddings to
determine semantic similarity in words as vectors. The closer vectors are, the more similarity they
have. Therefore, it also handles spelling errors better than other language model, such as
Word2Vec, which use word embeddings as entire words, not subwords.

In conclusion, fastText is much more complex than we have discussed here, but for the purposes of
summarising my bachelor thesis, I believe it’s sufficient. 😊