PHYLOGENY RECONSTRUCTION: THE
BASICS
Ciona s.Homo
Anopheles g.
C. elegans
C. briggsae
Apis m.
Drosophila m.
Drosophila p.
Danio r.Takifugu r.Tetraodon n.
Gallus g.
Xenopus t.
Monodelphis d. Mus m. Rattus n.
Canis f.Bos t.Maccaca m.Pan t.
Saccharomyces s.
Ciona s.Homo
Anopheles g.
C. elegans
C. briggsae
Apis m.
Drosophila m.
Drosophila p.
Danio r.Takifugu r.Tetraodon n.
Gallus g.
Xenopus t.
Monodelphis d. Mus m. Rattus n.
Canis f.Bos t.Maccaca m.Pan t.
Saccharomyces s.
A Simple Concept of Speciation
The Coalescent Model
 Wright Fisher population model
with constant population size.
 Each generation chooses its
parent at random.
 Pairs of lineages are traced back
to a coalescent event.
 Kingman (1982) developed a
continuous model that allows to
estimate times between the
coalescent events.
 The coalescent rate for any pair
of genetic lineages is proportional to
1/Ne in generations or to 1/ in
substitutions.
The evolution of biological sequences
The evolution of biological sequences
Biological sequences can change by substitutions
...deletions and insertions
At the leafs of the tree we find the contemporary
sequences
a
a
Divergence and Coalescent
What we have...
What we want...
?
Public Data Sources
The Problem: Finding the homologous positions
Seq1: - A C G A
Seq2: T A C G T
Seq3: - A T - T
Seq4: - A T G T
A C G A T A C G T A T T A T G T
Seq1
Seq2
Seq4
Seq3
The objective function
An mathematical function able to measure the biological
quality of an alignment...
The objective function
Related questions:
What should a biologically correct alignment look like?
To what extent can we define and formalize its properties?
An mathematical function able to measure the biological
quality of an alignment...
The objective function
Mathematical
Optimal Alignment
Biologically
Optimal Alignment
minimize
Related questions:
What should a biologically correct alignment look like?
To what extent can we define and formalize its properties?
An mathematical function able to measure the biological
quality of an alignment...
The objective function
A mathematical function meant to measure the
biological quality of an alignment...
!
"(#) = S(ai
i=1
n
$ ,bi )
!
(): the score of the pairwise alignment 
n : length of 
ai : letter of sequence A at position i in 
bi : letter of sequence B at position i in 
The objective function
A mathematical function meant to measure the
biological quality of an alignment...
!
Objective: find  that maximizes ()!
!
"(#) = S(ai
i=1
n
$ ,bi )
The scoring function S, an example
then we look for that alignment, that gives us the highest score
by summing up the column scores S(ai,bj) for all columns of the
alignment.
Given two sequences A ={a1,a2,....,an} and B={b1,b2,....,bm}
and a scoring function S such that
!
S(ai,bj ) =
+5, if ai = bj
"2, if ai # bj
"6, for introduction of a gap
$
%
&
'
&
The scoring function S, an example
For example:
T G C T C G T A
T - - T C A T A
+5 -6 -6 +5 +5 +5 +5 = 11
then we look for that alignment, that gives us the highest score
by summing up the column scores S(ai,bj) for all columns of the
alignment.!
S(ai,bj ) =
+5, if ai = bj
"2, if ai # bj
"6, for introduction of a gap
$
%
&
'
&
Given two sequences A ={a1,a2,....,an} and B={b1,b2,....,bm}
and a scoring function S such that
continue.....
T G C T C G T A
T - - T C A T A
+5 -6 -6 +5 +5 +5 +5 = 11
T G C T C G T A
T - T - C A T A
+5 -6 -2 -6 +5 +5 +5 = 4
A1:
A2:
etc...
Why not just scoring all alignments?
* There are far too many
number of possible pairwise alignments:
for two sequences of length N=300 there are 10179
possibilities !
2n
n
"
#
$
%
&
'
Why not just scoring all alignments?
* There are far too many
number of possible pairwise alignments:
for two sequences of length N=300 there are 10179
possibilities !
2n
n
"
#
$
%
&
'
Hence, we need a smart way to cut the computation short, like the
dynamic programming approach for pairwise alignments by
Needleman and Wunsch (1970).
Re-use of previous results
T G C T C G T A
T - - T C A T A
+5 -6 -6 +5 +5 +5 +5 = 11
T G C T C G T A
T - T - C A T A
+5 -6 -2 -6 +5 +5 +5 = 4
A1:
A2:
etc...
Dynamic Programming
A dynamic programming approach usually includes:
 A mathematical description of the (biological) quality of an
solution, i.e. an recursive objective function
 The computation of all intermediate values needed to obtain the
globally optimal solution, thereby avoiding double-computations
 The reconstruction of the globally optimal solution from the
values obtained in the previous step (backtracking)
The Needleman-Wunsch pair-wise alignment
"(i, j) = max
"(i #1, j #1) + S(ai,bj )
"(i, j #1) + S(gap,bj )
"(i #1, j) + S(ai,gap)
$
%
&
'
&
S(ai,bj ) =
+5, if ai = bj
"2, if ai # bj
"6, for introduction of a gap
$
%
&
'
&
Scoring function Objective function
The Needleman-Wunsch algorithm
(i,j-1)
(i-1,j)(i-1,j-1)
!
"(i, j) = max
"(i #1, j #1) + S(ai,bj )
"(i, j #1) + S(gap,bj )
"(i #1, j) + S(ai,gap)
$
%
&
'
&
 (i,j) is the
optimal alignment
score up to and
including ai and bj
S(ai,bj ) =
+5, if ai = bj
"2, if ai # bj
"6, for introduction of a gap
$
%
&
'
&
Needleman-Wunsch algorithm: Initialization
S(ai,bj ) =
+5, if ai = bj
"2, if ai # bj
"6, for introduction of a gap
$
%
&
'
&
The Needleman-Wunsch algorithm: Recursion
S(ai,bj ) =
+5, if ai = bj
"2, if ai # bj
"6, for introduction of a gap
$
%
&
'
&
The Needleman-Wunsch algorithm: Recursion
S(ai,bj ) =
+5, if ai = bj
"2, if ai # bj
"6, for introduction of a gap
$
%
&
'
&
The Needleman-Wunsch algorithm: Recursion
S(ai,bj ) =
+5, if ai = bj
"2, if ai # bj
"6, for introduction of a gap
$
%
&
'
&
Needleman-Wunsch algorithm: Backtrack
*
*
Needleman-Wunsch algorithm: Backtrack
A*
A*
Needleman-Wunsch algorithm: Backtrack
TA*
TA*
Needleman-Wunsch algorithm: Backtrack
Alignment Score: 11*TGCTCGTA*
*T--TCATA*
Smith-Waterman pairwise local alignment
"(i, j) = max
"(i #1, j #1) + S(ai,bj )
"(i, j #1) + S(gap)
"(i #1, j) + S(gap)
0
$
%
&
&
'
&
&
S(ai,bj ) =
+5, if ai = bj
"2, if ai # bj
"6, for introduction of a gap
$
%
&
'
&
Smith-Waterman pairwise local alignment
*TCGTA*
*TCATA*
Alignment Score: 18
Affine Gap costs
!
g(l) = go + l" ge
!
"(i, j) = max
"(i #1, j #1) + S(ai,bj )
maxk= 0
i#1
("(k, j) + g(i # k)), gap in B
maxk= 0
j#1
("(i,k) + g( j # k)), gap in A
$
%
&
'
&
!
"(i, j) = max
"(i #1, j #1) + S(ai,bj )
"(i, j #1) + S(gap,bj )
"(i #1, j) + S(ai,gap)
$
%
&
'
&
Alternative Scoring Functions
Blosum62:
PAM250:
Many others...
Exact vs. Heuristic searches
Both, Needleman-Wunsch and Smith-Waterman alignment methods
are exact methods since they guarantee a globally optimal solution
for the optimization problem!
Drawback: Computational expensive, i.e. O(nm) in time and
memory
Exact vs. Heuristic searches
Solutions:
 omit regions from the grid, that cannot contribute to the optimal
alignment (reduction of the search space, by remaining exact)
Exact vs. Heuristic searches
Solutions:
 use of heuristics (more rigorous reduction of the search space,
sacrificing the guaranteed optimal solution for search speed)
Hashing
* Lookup method for finding an alignment
Pos: 1 2 3 4 5 6 7 8 9 10 11
Seq 1: k c s p t a . . . . .
Seq 2: . . . . . a c s p r k
-10-r
066a
--5t
-594p
-583s
-572c
10111k
OffsetPos in Seq 2Pos in Seq 1Amino acid
Hashing
* Lookup method for finding an alignment
Pos: 1 2 3 4 5 6 7 8 9 10 11
Seq 1: k c s p t a . . . . .
Seq 2: . . . . . a c s p r k
-10-r
066a
--5t
-594p
-583s
-572c
10111k
OffsetPos in Seq 2Pos in Seq 1Amino acid
Seq 1: k c s p t a
Seq 2: a c s p r k
Resulting alignment:
What we are really looking for:
How to construct Multiple Sequence Alignments?
Optimal Solution:
Extend Needleman-Wunsch or Smith-Waterman to multiple sequences
How to construct Multiple Sequence Alignments?
Optimal Solution:
Extend Needleman-Wunsch or Smith-Waterman to multiple sequences
How to construct Multiple Sequence Alignments?
Optimal Solution:
Extend Needleman-Wunsch or Smith-Waterman to multiple sequences
But O(nm) in time and memory:
Computationally not feasible... 4 sequences of length 1000 -> 1TB RAM
A new objective function: Sum of Pairs
Seq1: AGA--CTA
Seq2: G-A--CTT
Seq3: AGAAACTT
A new objective function: Sum of Pairs
Seq1: AGA--CTA
Seq2: G-A--CTT
Seq3: AGAAACTT
Seq1: AGA--CTA
Seq2: G-A--CTT
Seq1: AGA--CTA
Seq3: AGAAACTT
Seq2: G-A--CTT
Seq3: AGAAACTT
!
S(ai,bj ) =
+5, if ai = bj
"2, if ai # bj
"6, for introduction of a gap
$
%
&
'
&
Seq1: AGA--CTA
Seq2: G-A--CTT
Seq1: AGA--CTA
Seq3: AGAAACTT
Seq2: G-A--CTT
Seq3: AGAAACTT
Score: +5 Score: +11 Score: 0
A new objective function: Sum of Pairs
Seq1: AGA--CTA
Seq2: G-A--CTT
Seq3: AGAAACTT
Seq1: AGA--CTA
Seq2: G-A--CTT
Seq1: AGA--CTA
Seq3: AGAAACTT
Seq2: G-A--CTT
Seq3: AGAAACTT
!
S(ai,bj ) =
+5, if ai = bj
"2, if ai # bj
"6, for introduction of a gap
$
%
&
'
&
Seq1: AGA--CTA
Seq2: G-A--CTT
Seq1: AGA--CTA
Seq3: AGAAACTT
Seq2: G-A--CTT
Seq3: AGAAACTT
Score: +5 Score: +11 Score: 0
SUM OF PAIRS SCORE: 16
Progressive Alignment Strategies (ClustalW)
 The sequences are added stepwise. Thus, never more than
two sequences (or multiple sequence alignments) are
simultaneously aligned
 Sequences or MSAs are aligned using Dynamic
Programming
Progressive Alignment Strategies (ClustalW)
Scoring for the alignment of two alignments
respective weights of the sequences x and yx, y
score for aligning position i in sequence x from alignment a to
position j in sequence y from alignment b
number of sequences in alignments a and b, respectivelyn,m
score for aligning column i from alignment (or sequence) a to
column j from alignment or sequence b
(ai,bj):
!
S(ax
i
,by
j
)
!
"(ai
,b j
) =
1
n + m
S(ax
i
,by
j
)
y=1
m
#
x=1
n
# $%x $%y
Scoring for the alignment of two alignments
With sequence weights:
Score = (S(t,v)*15
+ S(t,i)*16
+ S(l,v)*25
+ S(l,i)*26
+ S(k,v)*35
+ S(k,i)*36
+ S(k,v)*45
+ S(k,i)*46)/8
1 peeksavtal
2 geekaavlal
3 padktnvkaa
4 aadktnvkaa
4 egewglvlhv
5 aaektkirsa
!
"(ai
,b j
) =
1
n + m
S(ax
i
,by
j
)
y=1
m
#
x=1
n
# $%x $%y
Features of ClustalW
 progressive strategy
 Distance based generation of a guide tree (approximative or exact)
 tree-guided (NJ) alignment
 change of the scoring matrix as the alignment proceeds
(adaptation to increasing divergence of the sequences
 dynamic variation of gap penalties in position- and residue-specific
manner
* gap opening penalties are locally reduced in stretches of 5 or
more hydrophilic residues (indicative of loop or random coil
regions).
* gap penalties are locally increased within eight residues of
existing gaps.
 sequence weighting
(Known) Problem of ClustalW: Local Optima
a.k.a: Once a gap always a gap
Iterative Alignment Strategy
set of
sequences
initial
(suboptimal)
alignment
refined
alignment
optimized
alignment
refinement
checkno!
yes!
convergence?
END
Stochastic Iterative Alignment
set of
sequences
initial
(suboptimal)
alignment
modified
alignment
optimized
alignment
random
modification
assess
score
no!
yes!
convergence?
END
acceptance
function
accept
reject/revert alignment
Stochastic Iterative Alignment (SAGA)
set of
sequences
initial
(suboptimal)
alignment(s)
modified
alignment
optimized
alignment
random
modification
assess
score
no!
yes!
convergence?
END
acceptance
function
accept
reject/revert alignment
Genetic Algorithm:
 Alignments evolve by `mutation' and
crossing over
 alignments score determines fitness
 over the generations, alignments
survive and reproduce or die
Non-Stochastic Iterative Alignment
Point: The initial alignment is modified by splitting it into two
groups and re-aligning them with dynamic programming.
Example: Prrp, both, alignment (inner loop) and tree/weight (outer loop) are optimized.
Consistency based algorithm
Point: The optimal MSA is defined as the one that agrees the
most with all optimal pair-wise alignments
Features:
 does not depend on a specific substitution rate
 can apply any method capable to align two sequences
 position dependent, i.e. the score associated with the alignment
of two residues depends on their position within the sequence
rather that their individual nature
 rationale: given a set of independent observations, the
constellation most often observed is often closer to the truth
Consistency based Objective Function For alignEment Evaluation (COFFEE)
The Principle of T-Coffee
Position specific substitution matrix
The score of each pair of residues
depends on the compatibility of
this pair with the rest of the library
A comparison
Reconstructing
Trees from
Sequences
Reconstructing the tree of life
Notations
How many possible trees are there?
How many possible trees are there?
!
b(n) =
(2n " 5)!
2n"3
(n " 3)!
!
b(10) = 2027025
!
b(55) = 2.9 "1084
!
b(100) =1.7 "10182
Finding the root of the tree
Finding the root of the tree
Outgroup-routing Midpoint-routing
Tree Formats
Three typical representations of the same tree
(((((CIOIN:0.4222,HOMSA:0.2777)97:0.0575,(ACRMI:0.2611,HYD
MA:0.3700)100:0.0745)100:0.0764,DROME:0.4200)100:0.1034,
CAEEL:0.6027):0.5804,SACCE:0.5832);
#NEXUS
begin taxa;
dimensions ntax=7;
taxlabels
DROME
CIOIN
HYDMA
SACCE
CAEEL
ACRMI
HOMSA
;
end;
begin trees;
tree [&r] tree_1 = (((((CIOIN:0.4222,HOMSA:0.2777)
[&label=97]:0.0575,
(ACRMI:0.2611,HYDMA:0.37)[&label=100]:0.0745)
[&label=100]:0.0764,
DROME:0.42)[&label=100]:0.1034,CAEEL:0.6027):0.5804,
SACCE:0.5832);
end;
begin figtree;
set appearance.backgroundColour=#-1;
end figtree;
Complex trees
Tree display is an unsolved problem
Some more notations
Monophyletic group
(clade, sistergroup)
Chimpanzee Human Gorilla
Paraphyletic group
(e.g. reptiles)
Birds Crocodiles Lizards Turtles
Storks Birds of Prey Old world
vultures
New world
vultures
Polyphyletic group
Character based phylogeny reconstruction:
 A character has to be expressed in at
least two states in the taxa under study.
Taxa are grouped on the basis of shared
character states.
 An evolutionary derived character (state)
is called an Apomorphy
 Aut-Apomorphy: an evolutionary derived
character (state) present only in a single
taxon
 Syn-Apomorphy: an evolutionary derived
character (state) shared by a group of
taxa.
 Plesiomorphy: an ancestral character
(state) shared by a group of extant taxa.
 Homoplasy: A derived character (state)
that is shared for reasons other than
common decent.
The Parsimony Principle
A rule in science and philosophy stating that
entities should not be multiplied needlessly.
This rule is interpreted to mean that the
simplest of two or more competing theories is
preferable and that an explanation for
unknown phenomena should first be
attempted in terms of what is already known.
Also called law of parsimony. (Ockham`s
razor, ca 1285-1350)
The Parsimony Principle
A,B A,B,C A,C
A,B
A,B,C
+C
-B
A,B A,B,C A,C
A,C
+B,-C
+B
The Parsimony Principle
A,B A,B,C A,C
A,B
A,B,C
+C
-B
A,B A,B,C A,C
A,C
+B,-C
+B
How to infer a tree from the data
The Criterion of Maximum Parsimony
The Criterion of Maximum Parsimony
Find the tree  that minimizes the following expression:
where diff measures the distance between two characters
j is an alignment specific weight factor
L alignment length
B number of branches in the tree
k` and k`` are the two nodes connected by branch k
The Criterion of Distance (Hamming Distance)
caactgattattcacseq 4
tagccctttgaacgcseq 3
tagccctttaaatgcseq 2
tcattgtccattcgaseq 1
09108Seq 4
90211Seq 3
102011Seq 2
811110Seq 1
Seq 4Seq 3Seq 2Seq 1 !
Distance-based tree reconstruction
Find branch lengths L(b) such that the sum of the branch
lengths connecting any two leaves gets close to the measured
distances between all pairs of leaves. That is
Dmeasured (A,B)= L(1)+L(2)+L(3)+L(4)
A BD C
L(1)
L(2)
L(3)
L(4)
Clustering Methods
UPGMA = Unweighted Pair Group Methods using
Arithmetic means.
The ultrametric condition implies the molecular
clock
Clustering methods work well, if sequences evolve
according to a molecular clock
or equivalently: if the ultrametric inequality is holds:
d(A,B) " max d(A,C),d(B,C){ }
for each triple (A,B,C)
Theorem: Four-Point-Condition
A distance matrix (di,j)i,j=1....n is representable as a tree, if and only if
!
d(u,v)+ d(x,z) " max d(u,x)+ d(v,z),d(u,z)+ d(v,x){ }
for all { }nzxvu ,...,2,1,,, !
Getting rid of the molecular clock requirement
Theorem: Four-Point-Condition
A distance matrix (di,j)i,j=1....n is representable as a tree, if and only if
!
d(u,v)+ d(x,z) " max d(u,x)+ d(v,z),d(u,z)+ d(v,x){ }
for all { }nzxvu ,...,2,1,,, !
2.5
3.5
2
2.5
8.5
The Neighbour Joining Algorithm
1. begin with star tree:
2. compute for each pair (1,2) the net-divergence
3. take the pair (A,B) that minimizes Eq. (1)
(1)
The Neighbour Joining Algorithm
4. cluster (A,B) and define an interior node W
5. compute branch lengths for the external edges:
L(A,W ) =
1
2
D(A,B) +
1
m " 2
D(A,k) " D(B,k)
k=1
m
#
$
%
&
'
(
)
L(B,W ) =
D(A,B)
2
" L(A,W )
The Neighbour Joining Algorithm
6. compute distance W to the remaining m-2 leaves:
!
D(W ,k) =
1
2
D(A,k) + D(B,k) " D(A,B)( )
7. continue with step 1 with the reduced set of leaves
The Neighbour Joining Algorithm
Least-squares Methods
Find a tree  that minimizes
!
S(") = #(i,k) $ D(i,k)( )
i,k
%
2
where (i,k) is the length of the unique path
connecting leaves i and k in the tree.
Distance Correction
A G C C G C G C A G
T = 0 A G C C A T G C A G
T = 1
T = 2
T = 3
T = 4
T = 5
T = 6
T = 7
T = 8
T = 9
T =10
A G C C A C G C A G
A G C C C C G C A G
A G C C C G G C A G
A G C A C G G C A G
A G G A C G G C A G
A G G A C T G C A G
A G G A C T G C A A
A G A A C T G C A A
A G A A C C G C A A
substitutions
0
2
4
6
8
10
0 2 4 6 8 10
true substitutions
observerd differences
Substitutions
Jukes Cantor Correction Forumla
!
obs(d) =
3
4
"
3
4
Exp["4d /3]
obs(d) can be estimated from the number of observed
different pairs of positions n1 between two aligned sequences
of length l.
Solving
!
n1
l
=
3
4
"
3
4
Exp["4d /3]
leads to Jukes Cantor correction:
d = "
3
4
Log 1"
4
3
n1
l
#
$%
&
'(
Distance Correction
substitutions
true substitutions
observerd differences
Substitutions
The Problem: Different alignments, different trees
Seq1: - N Y L S
Seq2: N K Y L S
Seq3: - N F - S
Seq4: - N F L S
N Y L S N K Y L S N F S N F L S
Seq1
Seq2
Seq4
Seq3
Seq1: N - Y L S
Seq2: N K Y L S
Seq3: N - F - S
Seq4: N - F L S
Seq1
Seq4
Seq2
Seq3
The Problem: Different alignments, different trees
Seq1: - N Y L S
Seq2: N K Y L S
Seq3: - N F - S
Seq4: - N F L S
N Y L S N K Y L S N F S N F L S
Seq1
Seq2
Seq4
Seq3
The alignment strategy may have more impact
on the reconstructed tree than does the type
of tree building method.
Morrison and Ellis (1997) Mol. Biol. Evol.
14:428-441
Focussing on stable parts of the alignment
Gblocks (Castresana (2000) Mol. Biol. Evol. 17:540-552
Objective:
Define a set of conserved blocks from an alignment to
be used in phylogeny reconstuction
Approach:
1) Classification of Columns
 non-conserved : <n/2 + 1 identical residues, or a gap
 conserved : >n/2 + 1 and < 85% identical residues
 highly conserved :>85% identical residues
2) discard contiguous stretches of non-conserved positions (default l = 8)
3) from remaining blocks: remove flanking positions until blocks begin and
end with highly conserved positions, i.e. selected blocks are anchored by
positions that can be aligned with high confidence
4) discard blocks with l < 15
5) remove all positions with gaps together with adjacent positions until a
conserved position is reached
6) discard blocks with l < 10
Note: all given values are the program defaults as given in the original publication
Focussing on stable parts of the alignment
Hidden paralogy mimics orthology
ma :0)
A typical variant: Weighted Sum of Pairs
Seq1: AGA--CTA
Seq2: AGA--CTA
Seq3: G-A--CTT
Seq4: AGAAACTT
Seq1: AGA--CTA
Seq3: G-A--CTT
Seq1: AGA--CTA
Seq4: AGAAACTT
Seq3: G-A--CTT
Seq4: AGAAACTT
SUM OF PAIRS SCORE: 62
Seq1: AGA--CTA
Seq2: AGA--CTA
Seq2: AGA--CTA
Seq3: G-A--CTT
Seq2: AGA--CTA
Seq4: AGAAACTT
Score: 2*(+5) Score: 2*(+11) Score: 0Score: +30
A typical variant: Weighted Sum of Pairs
Seq1: AGA--CTA
Seq2: AGA--CTA
Seq3: G-A--CTT
Seq4: AGAAACTT
Seq1: AGA--CTA
Seq3: G-A--CTT
Seq1: AGA--CTA
Seq4: AGAAACTT
Seq3: G-A--CTT
Seq4: AGAAACTT
SUM OF PAIRS SCORE: 62
Seq1: AGA--CTA
Seq2: AGA--CTA
Seq2: AGA--CTA
Seq3: G-A--CTT
Seq2: AGA--CTA
Seq4: AGAAACTT
Score: 2*(+5) Score: 2*(+11) Score: 0Score: +30
Seq1/Seq2
Seq4
Seq30.49
0.2
0.29
Weighting of sequences: one variant
Seq1/Seq2
Seq4
Seq30.49
0.2
0.29
Seq1: AGACTA
Seq2: AGACTA
Seq3: GACTT
Seq4: AGAAACTT
Dataset:
-
3
3
-4
-2
-1
421
Compute
Pairwise Distance Matrix
Reconstruct
Seq1: (0.29/2+0.2/3)=0.21
Seq2: (0.29/2+0.2/3)=0.21
Seq3: 0.49
Seq4: (0.29+0.2/3)=0.36
Apply weights
Seq1: 0.43
Seq2: 0.43
Seq3: 1
Seq4: 0.73
Normalize
A typical variant: Weighted Sum of Pairs
!
"wsop (#) = $i
i< j
% $ j S(#i,# j )
Seq1: AGA--CTA
Seq2: AGA--CTA
Seq3: G-A--CTT
Seq4: AGAAACTT
Seq1: AGA--CTA
Seq3: G-A--CTT
Seq1: AGA--CTA
Seq4: AGAAACTT
Seq3: G-A--CTT
Seq4: AGAAACTT
SUM OF PAIRS SCORE: 16.7
Seq1: AGA--CTA
Seq2: AGA--CTA
Seq2: AGA--CTA
Seq3: G-A--CTT
Seq2: AGA--CTA
Seq4: AGAAACTT
Score: (0.43x5)2 Score: (0.43x0.73x11)2 Score: 0Score: 0.432x30
Orthologous Sequences, Please!!
Arguments for orthology assumption:
 a sequence tree that is congruent to the
species tree
 conservation of genomic position
 sequence similarity (typically, reciprocal
best blast hit)
 similarity of function
Orthologous Sequences, Please!!
Arguments for orthology assumption:
 a sequence tree that is congruent to the
species tree
 conservation of genomic position
 sequence similarity (typically, reciprocal
best blast hit)
 similarity of function