Exercises
December 3, 2009
1 Introductory remarks
General comments We have attempted to compile a set of exercises and
questions that represent a selection of typical steps you will perform, and problems
you might encounter, when analyzing biological data. The data you will
work with is, whenever possible, the same you will start the exercises with. It
is thus as close to the reality as possible. However, in a few cases the data do
not have all the features we would like them to have. We might then have to
switch to a different data set, to allow for the planned analyses.
Web pages and data bases In the course of the exercise, you will need to
go to several web-sites and databases to collect data for your analyses. Unfortunately
we lack the time to mention every single of these web sites in detail. If
you are not already familiar with these pages, please take some time and have a
look what kind of information they provide. However, before you get completely
lost, please ask!
Functions and programs Similar to the web pages and databases, a number
of unix-functions, such as less, chmod, sed, grep, head, tail, tr, sort, comm, uniq
might come in handy for data manipulation and quick data analysis. Again,
we will not be able to give a thorough introduction into every single one of
these functions. It is up to you what way you choose to complete the exercises.
However, make sure that your approach is scalable to larger amounts of data
than the one we will work with. If you are not sure how to complete a certain
task, or if you are interested in a different way of doing things, please ask. Please
keep in mind that now and then it's worthwhile to read and think for an hour
finding out how a program can do something for you in less than a second, even
1
though you would only need five minutes to do it manually. Again, we can only
suggest how to accomplish certain tasks. If you find other ways to be more
efficient, go for it.
Documentation and solutions One part of the exercise is to document the
individual steps of your analyses. Please enter the answers to the individual
questions you'll find below at the appropriate place in your documentation.
Furthermore, please add a remark to the individual questions, whether you find
them
* trivial
* appropriate
* complex
* too difficult
We will collect your documentation at the end of the course(!), so please make
sure that it is structured, complete and that you either do have it in electronic
format or keep a hard copy for your own record.
2 Sequence retrieval and ortholog prediction
The first set of exercises and questions is concerned with putting together a data
set to address the question of the phylogenetic position of our taxon of interest.
2.1 Collecting a dataset
1. ! Download the file ENCCU prot.fa.gz from the following URL:
http://www.cibiv.at/~ingo/Brno/data
and unzip it.
Note, the sequences are in fasta-format. In this format, each sequence has
to be preceded by a sequence header starting with a >. Any information
can be stored in this sequence header. The actual sequence starts in the
next line following the header. Both, protein and DNA sequences can be
stored in fasta format. Individual sequences, multiple unaligned sequences
and alignments can be stored in fasta format.
2
2. How many sequences does this file contain?
3. ! Extract the sequences with the sequence identifiers Q8SSC4 and Q8SQP5
from the file ENCCU prot.fa and address the following questions:
(a) From what species are these sequences most likely derived?
(b) What are the likely functions of these proteins?
(c) Compare the evalues and the scores between the best BLAST hit and
the next best Hit. Which one is informative in ranking the hits?
Please use the BLAST tool at
http://blast.ncbi.nlm.nih.gov/Blast.cgi
for this purpose.
4. ! Find the complete protein sets for the following taxa:
* Saccharomyces cerevisiae (SACCE)
* Homo sapiens (HOMSA)
* Arabidopsis thaliana (ARATH)
Note: Use the diArk-web pages at
http://www.diark.org
as a start for your search.
(a) How many choices for each taxon does diARK give you?
(b) ! Download the protein sequences from an appropriate data repository
in fasta format. If you have the choice, choose ENSEMBL
BioMart and use protein id and gene id as sequence identifier. Filter
for genes with a protein id assigned and a gene status known
(c) ! Rename the files to adhere to the following standard: SPECIES prot.fa.
(Please use for SPECIES the short versions given above, e.g. SACCE).
(d) How many protein sequences are in each data set. Compare the
values to the numbers of protein coding genes present in each of the
genomes and explain your observation.
3
(e) ! Process the ARATH, HOMSA, and SACCE protein sets using the
shell script unique.sh and nentferner.pl that you can download from
http://www.cibiv.at/~ingo/Brno/scripts/
Note, unique.sh will call the script nentferner.pl and expects it to be
in your executable path.
(f) Describe the individual steps and functions in the script unique.sh.
Note, nentferner.pl just removes newlines in a fasta file. Issue nentferner.pl
-h for more info.
2.2 Orthology prediction
In this part of the exercises you will perform ortholog predictions using the
sequences you have retrieved in the previous steps. The aim is to compile
groups of evolutionarily related sequences that can be, in the next step,
used to infer the phylogenetic relationships of the species under study.
5. ! Use the two sequences Q8SSC4 and Q8SQP and perform a local reciprocal
Blast search between
(a) ENCCU and HOMSA
(b) ENCCU and ARATH
(c) ENCCU and SACCE
Check whether you can identify orthologous sequences to the two query
proteins in the three other species. If so, combine the orthologs for each
protein in a file and name this file Q8SSC4 ortho.fa and Q8SQP5 ortho.fa,
respectively.
To complete this task, you need to use the local blast version blastall
installed on the system. Perform the follwowing steps:
(a) Reformat the fasta files for the four proteomes separately into blast
databases. Use the program formatdb for this purpose.
(b) Put the blast databases into a directory blast db
4
(c) Use blastall -p blastp to search for hits in SACCE with the first query
Q8SSC4. Issue the command blastall to obtain all the options for this
program.
(d) Repeat the reciprocal blast search for ARATH and HOMSA.
(e) Repeat the entire procedure for the second protein Q8SQP5.
(f) Explain what you would need to do to confirm orthology for all proteins
in Q8SSC4 ortho.fa.
6. ! Start an InParanoid ortholog search between all ENCCU proteins and
all proteins from one(!) of the other three species. Talk to your neighbor
and make sure that within the course all 3 searches are performed. Once
the InParanoid searches are completed make sure that you obtain the
results from the two searches you have not obtained.
7. ! Go to the OMA project at
http://omabrowser.org
and identify the OMA ortholog groups that correspond to the two genes
you are analyzing. Use the protein search function in the OMA browser.
8. ! Download the two ortholog groups in fasta format and complement the
files Q8SSC4 ortho.fa and Q8SQP5 ortho.fa with the taxa listed in the
file taxon list.txt you can download from
http://www.cibiv.at/~ingo/Brno/taxon_list.txt
9. Use mafft to align the sequences in the two files. Use the most sensitive
alignment settings --maxiterate 1000 --localpair.
10. Use clustalw -convert to convert the fasta format from the mafft output
into phylip format. Issue the command clustalw -help for more info. What
does clustalw usually do? What are the problems you may encounter with
the sequence ids?
5
2.3 Maximum Parsimony tree reconstruction
Use the web page at
http://mobyle.pasteur.fr/
for computing the maximum parsimony trees. Perform 100 Bootstrap replicates
to assess how good the data supports the tree. Please note, the 100 bootstrap
replicates can take a while. As a suggestion, first compute the MP tree that
you have something to work on and then afterwards compute the 100 boostrap
maybe overnight.
1. ! Compute the maximum parsimony tree for the dataset Q8SSC4 ortho.phy
2. ! Compute the maximum parsimony tree for the dataset Q8SQP5 ortho.phy
3. Visualize the two trees with figtree, compare and discuss the differences.
2.4 Maximum likelihood tree reconstruction
Use raxmlHPC and the model PROTGAMMAIWAGF for tree reconstruction.
Alternatively, you can also use the online tools at
http://mobyle.pasteur.fr/
for this purpose. If you choose the online tools then use the options
* substitution model: WAG
* estimated proportion of invariable sites -v
Please note, performing 100 bootstrap replicates can take a while. As a suggestion,
first compute the ML tree that you have something to work on and then
afterwards compute the 100 boostrap maybe overnight.
1. !Compute the maximum likelihood tree for the dataset Q8SSC4 ortho.phy
2. !Compute the maximum likelihood tree for the dataset Q8SQP5 ortho.phy
3. Visualize the two trees with figtree, compare and discuss the differences.
4. Discuss the results from the MP and the ML analysis.
2.5 Distance tree reconstruction
This part is optional! Use the web tools at
http://mobyle.pasteur.fr/
6
for this purpose. Use the programs protdist to construct a distance matrix and
bionj for tree reconstruction.
1. Compute the distance tree for the dataset Q8SSC4 ortho.phy
2. Compute the distance tree for the dataset Q8SQP5 ortho.phy
3. Visualize the two trees with figtree, compare and discuss the differences.
4. Compare to the results from the MP and the ML analysis.
7