Exercises
November 30, 2007
1 Introductory remarks
General comments We have attempted to compile a set of exercises and
questions that represent a selection of typical steps you will perform, and problems
you might encounter, when doing phylogeny reconstruction analyses. The
data you will work with is, whenever possible, 'real biological data' rather than
simulated sequences, and thus as close to phylogeny reconstruction reality as
possible. However, in a few cases the data does not have all the features we
would like them to have. Rather than switching to a different data set then,
we will ask you to modify original sequences or alignments in a way that the
problem we would like to point out becomes obvious. Please make sure that
whenever you modify a file you save it under a different file name, and that you
keep track of what version you are using.
Web pages and data bases In the course of the exercise, you will need to
go to several web-sites and databases to collect data for your analyses. Unfortunately
we lack the time to mention every single of these web sites in detail. If
you are not already familiar with these pages, please take some time and have a
look what kind of information they provide. However, before you get completely
lost, please ask!
Functions and programs Similar to the web pages and databases, a number
of unix-functions, such as less, chmod, sed, grep, head, tail, tr might be handy
to use for data manipulation and quick data analysis. Again, we will not be
able to give a thorough introduction into every single one of these functions.
Rather, we mention them in appropriate places and leave it to you to read the
man-pages and find out how to use them in case you are not already familiar
1
with them. Please keep in mind that sometimes iťs worthwhile to read and
think for an hour finding out how a program can do something for you in less
than a second, even though you would only need five minutes to do it manually.
Again, we can only suggest how to accomplish certain tasks. If you find other
ways to be more efficient, go for it.
Documentation and solutions One part of the exercise is the documentation
of the individual steps you have done during your analyses. Please enter the
answers to the individual questions you'll find below at the appropriate place
in your documentation. Furthermore, please add a remark to the individual
questions, whether you find them
* trivial
* appropriate
* complex
* too difficult
We will collect your documentation at the end of the course and will use it for
our own evaluation of the lecture. So please make sure that you either do have
it in electronic format or keep a hard copy for your own record.
2 Sequence retrieval and sequence alignment
The first set of exercises and questions is concerned with putting together an
initial data set for phylogeny reconstruction
2.1 Collecting a dataset
1. ! Download the sequences gorgo-genomic.fa and ponpy-genomic.fa from
the following location:
http://www.cibiv.at/~ingo/Brno/exercises
Note, the sequences are in fasta-format. In this format, each sequence has
to be preceded by a sequence header starting with a >. Any information
can be stored in this sequence header. The actual sequence starts in the
next line following the header. Both, protein and DNA sequences can be
2
stored in fasta format. Individual sequences, multiple unaligned sequences
and alignments can be stored in fasta format.
(a) What type of biological sequence is it?
(b) How long are the sequences?
(c) How many undetermined positions (N) do they each contain?
2. ! Try to change the format of gorgo-genomic.fa into:
(a) Phylip
(b) Nexus
Please use clustalw for this purpose. Try clustalw -h for online help.
Focus on the options for DATA, VERBS, PARAMETERS. Describe
the result of your attempt and why you get this result.
3. Perform a BLAST search with gorgo-genomic.la at
http://www.ncbi.nlm.nih.gov/blast/Blast.cgi
Run two data base searches, one BlastN search and one BlastX search to
get more information about your sequence. Choose in both cases 'nr' as
the search database.
(a) What is the difference between BlastN and BlastX searches?
(b) Why is the maximal hit length in BlastX substantially shorter than
in the BlastN search?
(c) What can you say about the information content of gorgo-genomic.fa
(d) What would be your first guess about the species from which gorgogenomic.fa
was determined from?
4. ! Determine the corresponding regions to gorgo-genomic.fa from the following
species:
* Homo sapiens (homsa)
* Pan troglodytes (pantr)
* Macaca mulatta (macmu)
3
To do so, please go the Human Genome Browser Website at:
http://www.genome.ucsc.edu
and choose Blat from the menu list. Perform the Blat-search on the three
genomes and:
(a) download the corresponding dna-sequences. Name them homsa-genomic.fa,
pantr-genomic.fa and macmu-genomic.fa, respectively.
(b) Change the sequence header to contain only the short name of the
species.
(c) Put the sequences of all 5 taxa into a single multi-fasta file. Name
this file datasetl-genomic.fa
5. ! Align the sequences in datasetl-genomic.fa with clustalw.
(a) Make yourself familiar with the alignment you have just generated.
(b) Create a copy of the file datasetl-genomic.fa and name it datasetl-
genomic-trunc.fa.
(c) Modify the un-aligned file datasetl-genomic-trunc.fa in the following
way: in the human sequence (homsa) delete the subsequence starting
with AGGAGAA and ending with ATCAGCTT. This shortens the
sequence by 300 bp from 837 bp to 537 bp.
(d) Align the sequences in datasetl-genomic-trunc.fa with clustalw. Compare
the alignment result with the alignment of the unmodified sequences.
Does the alignment reflect your expectation? What effect
to you observe?
(e) Store the alignments for datasetl-genomic.fa and the clustal-alignment
for datasetl-genomic-trunc.fa for later analyses in all three file formats:
fasta, nexus, phylip.
6. ! Gather a second data set that covers now a broader phylogenetic range.
Start with the human gene identified by ENSG00000142687.
(a) What does the gene code for?
4
(b) Get the corresponding proteins and transcrips (orthologs) for the
following species:
* Homo sapiens, Human, HOMSA
* Pan troglodytes, Chimp, PANTR
* Macaca mulatto,, Rhesus, MACMU
* Mus musculus, Mouse, MUSMU
* Rattus norvegicus, Rat, RATNO
* Canís familiaris, Dog, CANFA
* Takifugu rubripes, Pufferfish, FUGRU
* Ciona intestinalis, Sea squirt, CIOIN
(c) Put the protein sequences into a multi-fasta file called datasetlprotein.fa,
and the transcript sequences into a multi-fasta file called
dataset 1-transcript.fa.
(d) Align both files with clustalw and create nexus and phylip versions
of the alignments.
(e) Why do we need to use orthologous sequences for species tree recon-
struction?
To complete this exercise, please go to the the following web site to obtain
the sequences:
http://inparanoid.sbc.su.se
Use Search by Sequence IDs and the Gene id provided above. Note,
in some format, such as nexus or phylip, the length of the sequence name
is limited to 10 characters and will be truncated if they are longer. It is,
thus, helpful to restrict sequence names to 10 characters right from the
beginning.
3 Phylogeny reconstruction using Maximum Parsimony
(MP) and Sequence distance
By now, you have gathered three sets of orthologous sequences, that can be used
for tree reconstruction. Furthermore, you should have one modified alignment
5
that will serve as an example of how artefacts introduced by errors in the analysis
can influence your conclusions. If you haven't had sufficient time to complete
all of the previous exercises, you can download the missing datasets from
http://www.cibiv.at/~ingo/Brno/exercises/datasets/
However, please make sure to complete the exercises whenever you have spare
time. In the following, you will use various simple methods to reconstruct the
phylogeny of the taxa under study.
3.1 The criteria of Maximum Parsimony
Analysis by eye
1. ! Open the alignment file datasetl-genomic.aln in text editor. Identify
and count the number of parsimony informative sites in the alignment.
2. ! What tree does this alignment support. Write the tree
(a) in Newick-format
(b) as a graph
3. Check the results of the previous exercise by trying to open the tree file
in newick-format with a tree viewer software.
4. How many unrooted tree topologies exist for 5 taxa?
5. ! Assign the parsimony informative sites to the individual splits in the
tree. Do all parsimony informative sites argue for the same tree? If not:
(a) Which site disagrees?
(b) What split does it support?
(c) What could be reasons for the observation of such incompatible
splits?
6. Given the tree, what species would you assign the two unknown sequences
to? Compare your assignment to the conclusion you have drawn from the
Blast-search.
6
Analysis with software tools: T h e program paup Since paup is a commercial
software, we need to run the analyses on the embnet-cluster in Vienna.
Please log in via ssh into the emb-net server. To do so, type
ssh <your-account> ůembl.hec.univie.ac.at
Enter your password. Next, transfer the four alignments to your home directory
using scp. paup requires the alignment files to be in nexus format! To do so,
issue the following command in a second shell:
scp <yourfilename> <your-account>@embl.bcc.univie.ac.at:
Please check next, if the file transfer was successful. For obtaining a quick-start
introduction of paup please visit the following link:
http://www.eibiv.at/~ingo/Brno/paup_quick-start.pdf
1. ! Open the file datasetl-genomic.nxs in paup. To do so, simply type paup
datasetl-genomic.nxs
2. ! Type help to get an overview of the options paup gives you.
3. ! What information does the command cstatus give you? Compare this
information to the results you have obtained in your analysis of the alignment
by eye.
4. ! Set the information criterion to parsimony. Compute the maximum
parsimony tree using the command alltrees. What kind of search do you
perform? How many trees are evaluated?
5. ! View the result of your tree search by issuing the command showtrees.
Compare the result to your manually reconstructed tree. Do they agree?
6. ! Compute the maximum parsimony tree using the command hsearch.
Make sure to understand the difference to the command searchall?
7. Set homsa (Homo sapiens) as the outgroup, root the tree and display
it. Next, unroot the tree, change the outgroup to m a c m u (Macaca mulatto)
and re-root the tree. Hint: use the commands outgroup, roottrees,
deroottrees.
8. Save the tree for later inspection.
7
3.2 The criterion of distance
1. ! Load the alignment datasetl-genomic-trunc.nxs into paup. Use the
command execute. Remember, that these are the same sequences as in
the alignment datasetl-genomic.nxs, with the exception that you have removed
300 bp from whithin the human sequence. Compute the maximum
parsimony tree and compare it to the tree from the previous exercise.
What are the differences.
2. ! Compute the neighbor joining tree for the alignment. To do so, use the
command nj
3. ! Compute the optimal tree according to the distance criterion. Set the
criterion to distance, and perform both exhaustive and heuristic search.
Compare the distance tree to the MP-tree. Explain your observation.
4. Close paup, activate the t-coffee environment (simply type activate TCOFFEE
and perform a t.coffee alignment of datasetl-genomic-trunc.fa.
To do so, simply type t-coffee <filename>. Reformat the output to nexus
and repeat the tree reconstruction using MP and Distance methods. Compare
the results to the ones obtained with the clustalw-alignment. Interpret
your results.
3.3 Robustness of the tree reconstruction inferred from
resampling methods
1. ! Load the dataset datasetl-transcript.nxs into paup. Set the criterion to
parsimony. Perform bootstrap and jackknife analysis with default settings
on the tree (make sure that you understand what these two methods do).
What can you say about the support of the individual splits in the tree?
2. Switch to distance analysis and repeat the analysis. Compare the results
to the MP-analysis. Why do these two criteria differ in the robustness of
the reconstructed tree?
3. Repeat the bootstrap and jackknife analysis with the alignment dataset 1protein.nxs.
Why does MP perform better than with the DNA sequence
data?
8
4. ! Download the sequences drome-protein.fa, aedae.fa and anoga.fa from
h t t p : / / w w w . c i b i v . a t / ~ i n g o / B r n o / e x e r c i s e s / d a t a s e t l / p r o t e i n /
Add these sequences to the sequences in datasetl-protein.fa, perform a
clustalw-alignment, convert to nexus format and perform Distance-tree
reconstruction.
5. ! Do the three sequences form a monophyletic clade?
6. All sequences that you have analyzed so far, with the exception of Ciona
intestinalis, are vertebrates. By looking at the tree, can you decide whether
the three new sequences are from vertebrate species or not? Furthermore,
by looking at the tree, can you decide whether the three sequences
come from different individuals of a single species, or from three different
species? Explain.
7. Visualize the tree using the command describetrees. Note, you need to
specify what tree should be described. Try the command describetrees ?.
8. What is the branch length of the Takifugu rubripes branch?
4 Maximum Likelihood with Phylip
All of the following exercises will be using the package phylip. Phylip is not a
single program, but rather a group of programs that all solve smaller problems.
For example the program dnaml finds the ML tree from DNA sequences while
proml uses protein sequences. You can find a rather large introduction at:
h t t p : / / w w w . c i b i v . a t / ~ g r e g / P h y l i p / p h y l i p 2 . p d f .
The Web page is:
h t t p : / / e v o l u t i o n . g e n e t i c s . W a s h i n g t o n . e d u / p h y l i p . h t m l
Phylip, like many other programs, uses slightly different file formats. You can
use clustalw to convert alignments from one format to another. We will need
to convert clustalw alignment files to phylip files. This can be done as follows:
clustalw -convert -output=phylip - i n f i l e = i n . a l n - o u t f i l e = o u t . p h y
Phylip also tends to be very sensitive with file formats. The real problem is
that when a file format is wrong it will not tell you, but the program will simply
crash. So if something is not working it is very likely the file format.
9
Phylip writes its output to the same file names, regardless of what the program
does. So read the warnings and remember to move/copy output files before
running other phylip programs.
4.1 Short Phylip command List
Command description
dnaml Finds a ML tree from DNA sequences
dnamlk same as dnaml only with a molecular clock
proml Finds a ML tree from Amino Acid sequences
seqboot generates bootstrap replicates
treedist Compares trees
retree allows tree manipulation i.e rerooting
consense combines a set of trees into a consense tree
4.2 Small trees
1. Using the alignment file datasetl-genomic.aln from the previous exercise.
Convert the file to phylip format, then use dnaml to find a ML tree. Do
this with global rearrangements on and off as well as with both the rough
method on and off.
2. There is a option to Randomise input order of sequences. Why do you
think this is provided. Does it change anything? Don't forget that dnaml
will overwrite old output files. Also in order to compare trees you can use
the treedist program. However these are small enough to just look.
3. Compare the resulting tree with the Maximum Parsimony tree and comment
on differences. Would you expect there to be large differences?
4. Find the ML tree with the truncated sequence data (datasetl-genomictrunc.aln)
and compare to original tree. Do the options change any de-
tails?
4.3 Larger Trees, and Bootstraps
When using rate variation models, use a Coefficient of variation of substitution
rate among sites of 1. When you are asked about HMM categories, use 4. This
is just the number of categories in other programs.
10
1. Using the datasetl-transcript.aln dataset, find the ML tree with dna se-
quences.
2. Again use global rearrangements and randomise the input sequence order.
Does this change the tree? II so, why?
3. Use the gamma model for rate variation among sites. What is different
about the tree? Consider branch lengths when comparing.
4. Does the result change much when the transition transversion ratio is
changed? Explain.
5. Use the program seqboot to generate 100 re-sampled dataset replicates.
6. Find the ML trees for all 100 datasets by using the "Analyze multiple data
sets?" option in dnaml. Do not use rate variation across sights.
7. Find the consense tree of the 100 bootstrap samples with consense. Use
both the MRe and the strict option if appropriate.
8. Repeat the bootstrap analysis while including rate variation. Which gives
better bootstrap support? Why do you think this is the case?
9. Carry out a jackknife again by using seqboot to generate 100 replicates.
Compare the support with a normal bootstrap.
10. Now use the Amino Acid sequence data from datasetl-protein-ext.aln.
Using global and non-rough options find the ML tree proml. Compare to
the DNA based tree.
11. Estimate the tree with Rate variation. Compare to previous results.
12. Carry out a normal bootstrap and a jackknife on the protein sequences in
the same way as before with the protein sequences. How does the support
compare to the DNA bootstrap and jackknife results.
4.4 Extra Exercises
1. Estimate ML trees from both datasetl-transcript.aln and datasetl-proteinext.aln
using a different ML program. IPQNNI and phyml and Tree Puzzle
are all examples. Phyml has a web site http://atgc.lirmm.fr/phyml/
11
and Tree Puzzle is installed locally. Compare to previous results and explain
why there are difference if any.
2. Carry out bootstraps with the program that you chose in the previous
question. How do the results compare to the previous results and also
how do the DNA to protein tree compare?
12