PROJECT ASSIGNMENTS

General info

All projects to be implemented as a single Python notebook
The .ipynb files are to be uploaded to the corresponding vault in IS
- Make sure the notebooks run as is in Google Colaboratory!
- If there are any supplementary data used by your notebook, upload them to some permanent and accessible online location and load them from there in the code (similarly to handling external data in this notebook)
Present your projects in one dedicated session (the exact date is to be specified, but it will be some time during the exam term, after June 15th)

Task 3A

Use Biopython/Entrez and PubMed to create a corpus of titles and abstracts of biomedical articles on "Pauling" and "vitamin C"
Extract biomedical entities from the corpus using SciSpaCy
Create a frequency dictionary from the list of extracted entities (where keys are the entities and values are the number of their occurrences in the corpus)
Try to answer the following questions:
- What are the 10 most and the 10 least common entities in the corpus?
- What does it say about the curious ability of people (not only certain morally sub-optimal Czech politicians, but also double-Nobel laureates) to insist on their often silly convictions despite all evidence?
In addition to using SciSpaCy for constructing the frequency dictionary, try also the regexp-based shallow parsing module of NLTK
Hint on a basic way to extract entities with NLTK:
- Use the default tokenizer and POS tagger from NLTK (nltk.sent_tokenize() and nltk.pos_tag()) to preprocess the text (i.e. split it into sentences and annotate them with part-of-speech tags); make sure unknown words are tagged as nouns by default
- Define a simple noun phrase regular expression pattern following the chunk rule for noun phrases defined in the HOWTO (you may experiment with several regexp patterns, though, to increase the coverage of biomedical entities)
- Parse the text in the corpus with the chunk parser using the defined pattern
- Extract the resulting noun phrases as entities
Note:
- You may also experiment with the default pre-trained method nltk.chunk.ne_chunk() for named-entity chunking, but the results may not be best suited to this task (due to them being tailored to types of entities occurring in general texts)
- Alternatively, you might train your own POS tagger on an annotated biomedical corpus if you feel really experimental
Compare the results obtained with both methods

Task 3B

Register on Kaggle
Download the data from the heart failure challenge
Try and beat the accuracy results of the current submissions to the Predict heart failure task, with a bespoke model of your choice developed using scikit-learn
Note:
- Feel free to get inspired by the plentiful insights into the data and the problem itself presented in many of the current submissions to the task

Task 3C

Download the protein-protein interaction data from the Yeast PPI Kaggle project
Compute low-rank embeddings of each protein from the yeast PPI network using the principles of graph representation learning
- You can either use the node2vec tool...
- ... or experiment with one of the many methods reviewed in this recent survey
Experiment with at least 3 clustering methods from scikit-learn to obtain clusters of the yeast proteins based on their embeddings
- At least one of the methods should produce a hierarchical clustering structure, i.e., a dendrogram
Analyse the obtained clusters and check whether their contents and structure correspond to any meaningful biological intuitions
Questions you are expected to try and answer are for instance:
- Do proteins in the same (or nearby) clusters share a predominant biological function?
- Do they share cellular location?
- Are they parts of the same pathway?
- ...
Hint:
- Use UniProt to get information on the particular proteins via searching for their IDs
- For analysing the clusters, this and this blog post may come really handy
- You can also try to import the cluster data into Cytoscape if that would feel more natural to you in order to analyse the results in a visual way