1
Profile Hidden Markov
Models
2
Methods for Characterizing a
Protein Family
 Objective: Given a number of related
sequences, encapsulate what they have in
common in such a way that we can recognize
other members of the family.
 Some standard methods for characterization:
 Multiple Alignments
 Regular Expressions
 Consensus Sequences
 Hidden Markov Models
3
A Characterization Example
How could we characterize this
(hypothetical) family of
nucleotide sequences?
 Keep the Multiple Alignment
 Try a regular expression
[AT] [CG] [AC] [ACTG]* A [TG] [GC]
 But what about?
 T G C T - - A G G vrs
 A C A C - - A T C
 Try a consensus sequence:
A C A - - - A T C
 Depends on distance measure
Example borrowedfrom Salzberg,1998
4
HMMs to the rescue!
Transition probabilitiesEmission Probabilities
5
Insert (Loop) States
6
Scoring our simple HMM
 #1 - “T G C T - - A G G” vrs: #2 - “A C A C - - A T C”
 Regular Expression ([AT] [CG] [AC] [ACTG]* A [TG] [GC]):
 #1 = Member #2: Member
 HMM:
 #1 = Score of 0.0023% #2 Score of 4.7% (Probability)
 #1 = Score of -0.97 #2 Score of 6.7 (Log odds)
7
Standard Profile HMM
Architecture
 Three types of states:
 Match
 Insert
 Delete
 One delete and one
match per position in
model
 One insert per
transition in model
 Start and end “dummy”
states
Example borrowed from Cline, 1999
8
Match States
Example borrowed from Cline, 1999
9
Insert States
Example borrowed from Cline, 1999
10
Delete States
Example borrowed from Cline, 1999
11
Aligning and Training HMMs
 Training from a Multiple Alignment
 Aligning a sequence to a model
 Can be used to create an alignment
 Can be used to score a sequence
 Can be used to interpret a sequence
 Training from unaligned sequences
12
Training from an existing
alignment
 This process what we’ve been seeing up to this
point.
 Start with a predetermined number of states in your HMM.
 For each position in the model, assign a column in the
multiple alignment that is relatively conserved.
 Emission probabilities are set according to amino acid
counts in columns.
 Transition probabilities are set according to how many
sequences make use of a given delete or insert state.
13
Remember the simple example
 Chose six positions in model.
 Highlighted area was selected to be modeled by an
insert due to variability.
 Can also do neat tricks for picking length of model,
such as model pruning.
14
Aligning sequences to a model
 Now that we have a profile model, let’s use it!
 Try every possible path through the model
that would produce the target sequence
 Keep the best one and its probability.
 Viterbi alg. has been around for a while
 Dynamic Programming based method
 Time: O(N*M) Space: O(N*M)
 (Assuming a constant # of transitions per state)
 N = Length of sequence, M = # of states in HMM
15
So… what do we do with an
alignment to a model?
 Align a bunch of sequences to the model, and get a
new multiple alignment.
 Align a single sequence to the model and get a
numerical score stating how well it fits the model
 “Find me all sequences in the database that match this
family profile X with a log odds score of at least Y”
 Align a single sequence to the model, and get a
description of its columns
 “Columns 124 and 125 map to insert states of family Y, I
wonder what that means?”
16
Training from unaligned
sequences
 One method:
 Start with a model whose length matches the
average length of the sequences and with random
emission and transition probabilities.
 Align all the sequences to the model.
 Use the alignment to alter the emission and
transition probabilities
 Repeat. Continue until the model stops changing
 By-product: It produced a multiple alignment
17
Training from unaligned
continued
 Advantages:
 You take full advantage of the expressiveness of
your HMM.
 You might not have a multiple alignment on hand.
 Disadvantages:
 HMM training methods are local optimizers, you
may not get the best alignment or the best model
unless you’re very careful.
 Can be alleviated by starting from a logical model
instead of a random one.
18
How do we build a model
using only one sequence?
19
Profile HMM Effectiveness
Overview
 Advantages:
 Very expressive profiling method
 Transparent method: You can view and interpret
the model produced
 Very effective at detecting remote homologs
 Disadvantages:
 Slow – full search on a database of 400,000
sequences can take 15 hours (not HMMER 3)
 Have to avoid over-fitting and locally optimal
models
20
pHMMS tools
 Tools
 SAM
 HMMER
 GUI
 HMMVE
 UGENE (plugin)
 Database
 Pfam