1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
Explanation Gn.Ex : gene number, exon number (for reference) Type : Init = Initial
exon (ATG to 5' splice site) Intr = Internal exon (3' splice site to 5' splice site) Term
= Terminal exon (3' splice site to stop codon) Sngl = Single-exon gene (ATG to
stop) Prom = Promoter (TATA box / initation site) PlyA = poly-A signal (consensus:
AATAAA) S : DNA strand (+ = input strand; - = opposite strand) Begin : beginning
of exon or signal (numbered on input strand) End : end point of exon or signal
(numbered on input strand) Len : length of exon or signal (bp) Fr : reading frame
(a forward strand codon ending at x has frame x mod 3) Ph : net phase of exon
(exon length modulo 3) – the position of the intron towards the ORF of the exon
(0, 1 or 2) I/Ac : initiation signal or 3' splice site score (tenth bit units) Do/T : 5'
splice site or termination signal score (tenth bit units) CodRg : coding region
score (tenth bit units) P : probability of exon (sum over all parses containing
exon) Tscr : exon score (depends on length, I/Ac, Do/T and CodRg scores)
Comments The SCORE of a predicted feature (e.g., exon or splice site) is a logodds
measure of the quality of the feature based on local sequence properties.
For example, a predicted 5' splice si te with score > 100 is strong; 50-100 is
moderate; 0-50 is weak; and below 0 is poor (more than likely not a real donor
site). The PROBABILITY of a predicted exon is the estimated probability under
GENSCAN's model of genomic sequence structure that the exon is correct. This
probability depends in general on global as well as local sequence properties,
e.g., it depends on how well the exon fits with neighboring exons. It has been
shown that predicted exons with higher probabilities are more likely to be correct
34
than those with lower probabilities.
What are the suboptimal exons?
Under the probabilistic model of gene structural and compositional properties
used by GENSCAN, each possible "parse" (gene structure description) which is
compatible with the sequence is assigned a probability. The default output of the
program is simply the "optimal" (highest probability) parse of the sequence. The
exons in this optimal parse are referred to as "optimal exons" and the translation
products of the corresponding "optimal genes" are printed as GENSCAN predicted
peptides. (All the data in our J Mol Biol paper and on the other GENSCAN web
pages refer exclusively to the optimal parse/optimal exons.) Of course, the
optimal parse does not always correspond to the actual (biological) parse of the
sequence, that is, the actual set of exons/genes present. In addition, there may be
more than one parse which can be considered "correct", for example, in the case
of a gene which is alternatively transcribed, translated or spliced. For both of
these reasons, it may be of interest to consider "suboptimal" ("near-optimal")
exons as well, i.e. exons which have reasonably high probability but are not
present in the optimal parse. Specifically, for every potential exon E in the
sequence, the probability P(E) is defined as the sum of the probabilities under the
model of all possible "parses" (gene structures) which contain the exact exon E in
the correct reading frame. (This quantity is calculated as described on the
GENSCAN exon probability page.) Given a probability cutoff C, suboptimal exons
are those potential exons with P(E) > C which are not present in the optimal parse.
Suboptimal exons have a variety of potential uses. First, suboptimal exons
sometimes correspond to real exons which were missed for whatever reason by
the optimal parse of the sequence. Second, regions of a prediction which contain
multiple overlapping and/or incompatible optimal and suboptimal exons may in
some cases indicate alternatively spliced regions of a gene (Burge & Karlin, in
preparation). The probability cutoff C used to determine which potential exons
qualify as suboptimal exons can be set to any of a range of values between 0.01
and 1.00. The default value on the web page is 1.00, meaning that no suboptimal
exons are printed. For most applications, a cutoff value of about 0.10 is
recommended. Setting the value much lower than 0.10 will often lead to an
explosion in the number of suboptimal exons, most of which will probably not be
useful. On the other hand, if the value is set much higher than 0.10, then
potentially interesting suboptimal exons may be missed.
34
Explanation Gn.Ex : gene number, exon number (for reference) Type : Init = Initial
exon (ATG to 5' splice site) Intr = Internal exon (3' splice site to 5' splice site) Term
= Terminal exon (3' splice site to stop codon) Sngl = Single-exon gene (ATG to
stop) Prom = Promoter (TATA box / initation site) PlyA = poly-A signal (consensus:
AATAAA) S : DNA strand (+ = input strand; - = opposite strand) Begin : beginning
of exon or signal (numbered on input strand) End : end point of exon or signal
(numbered on input strand) Len : length of exon or signal (bp) Fr : reading frame
(a forward strand codon ending at x has frame x mod 3) Ph : net phase of exon
(exon length modulo 3) – the position of the intron towards the ORF of the exon
(0, 1 or 2) I/Ac : initiation signal or 3' splice site score (tenth bit units) Do/T : 5'
splice site or termination signal score (tenth bit units) CodRg : coding region
score (tenth bit units) P : probability of exon (sum over all parses containing
exon) Tscr : exon score (depends on length, I/Ac, Do/T and CodRg scores)
Comments The SCORE of a predicted feature (e.g., exon or splice site) is a logodds
measure of the quality of the feature based on local sequence properties.
For example, a predicted 5' splice si te with score > 100 is strong; 50-100 is
moderate; 0-50 is weak; and below 0 is poor (more than likely not a real donor
site). The PROBABILITY of a predicted exon is the estimated probability under
GENSCAN's model of genomic sequence structure that the exon is correct. This
probability depends in general on global as well as local sequence properties,
e.g., it depends on how well the exon fits with neighboring exons. It has been
shown that predicted exons with higher probabilities are more likely to be correct
35
than those with lower probabilities.
What are the suboptimal exons?
Under the probabilistic model of gene structural and compositional properties
used by GENSCAN, each possible "parse" (gene structure description) which is
compatible with the sequence is assigned a probability. The default output of the
program is simply the "optimal" (highest probability) parse of the sequence. The
exons in this optimal parse are referred to as "optimal exons" and the translation
products of the corresponding "optimal genes" are printed as GENSCAN predicted
peptides. (All the data in our J Mol Biol paper and on the other GENSCAN web
pages refer exclusively to the optimal parse/optimal exons.) Of course, the
optimal parse does not always correspond to the actual (biological) parse of the
sequence, that is, the actual set of exons/genes present. In addition, there may be
more than one parse which can be considered "correct", for example, in the case
of a gene which is alternatively transcribed, translated or spliced. For both of
these reasons, it may be of interest to consider "suboptimal" ("near-optimal")
exons as well, i.e. exons which have reasonably high probability but are not
present in the optimal parse. Specifically, for every potential exon E in the
sequence, the probability P(E) is defined as the sum of the probabilities under the
model of all possible "parses" (gene structures) which contain the exact exon E in
the correct reading frame. (This quantity is calculated as described on the
GENSCAN exon probability page.) Given a probability cutoff C, suboptimal exons
are those potential exons with P(E) > C which are not present in the optimal parse.
Suboptimal exons have a variety of potential uses. First, suboptimal exons
sometimes correspond to real exons which were missed for whatever reason by
the optimal parse of the sequence. Second, regions of a prediction which contain
multiple overlapping and/or incompatible optimal and suboptimal exons may in
some cases indicate alternatively spliced regions of a gene (Burge & Karlin, in
preparation). The probability cutoff C used to determine which potential exons
qualify as suboptimal exons can be set to any of a range of values between 0.01
and 1.00. The default value on the web page is 1.00, meaning that no suboptimal
exons are printed. For most applications, a cutoff value of about 0.10 is
recommended. Setting the value much lower than 0.10 will often lead to an
explosion in the number of suboptimal exons, most of which will probably not be
useful. On the other hand, if the value is set much higher than 0.10, then
potentially interesting suboptimal exons may be missed.
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69