CodeML: Behind Selectome
[bottom]
Theoretical principles:
CodeML is a program from the package
PAML.
It estimates various parameters (Ts/Tv, dN/dS, branch length) on the codon (nucleotide)
alignment, according to the phylogenetic tree. The parameter of interest to us is
ω = dN/dS,
which indicates selective pressure.
Selectome uses the
branch-site model, which estimates different dN/dS values among branches and among sites.
A branch of interest is selected and called "
foreground branch". All other branches in the tree
are the "
background" branches. The background branches share the same distribution of ω = dN/dS value among
sites, whereas different values can apply to the foreground branch.
Two models are computed: a null model (H0), in which the foreground branch may have different proportions of
sites under neutral selection than the background (i.e. relaxed purifying selection), and an alternative model (H1),
in which the foreground branch may have a proportion of sites under positive selection.
As the alternative model is the general case, it is easier to present it first.
Alternative model (H1): three classes of sites on the foreground branch
ω
0: dN/dS < 1
ω
1: dN/dS = 1
ω
2: dN/dS ≥ 1
K0 : Proportion of sites that are under purifying selection (ω
0 < 1) on both foreground and background branches.
K1 : Proportion of sites that are under neutral evolution (ω
1 = 1) on both foreground and background branches.
K2a: Proportion of sites that are under
positive selection (ω2 ≥ 1) on the foreground branch and under
purifying selection (ω
0 < 1) on background branches.
K2b: Proportion of sites that are under
positive selection (ω2 ≥ 1) on the foreground branch and under
neutral evolution (ω
1 = 1) on background branches.
For each category, we get the proportion of sites and the associated dN/dS values.
Null model (H0) (ω2 fixed to 1):
K0 : Sites that are under purifying selection (ω
0 < 1) on both foreground and background branches.
K1 : Sites that are under neutral evolution (ω
1 = 1) on both foreground and background branches.
K2a: Sites that are under
neutral evolution (ω2 = 1) on the foreground branch and under
purifying selection (ω
0 < 1) on background branches.
K2b: Sites that are under
neutral evolution (ω2 = 1) on the foreground branch and under
neutral evolution (ω
1 = 1) on background branches.
For each model, we get the log likelihood value (lnL
1 for the alternative and lnL
0 for the null models),
from which we compute the Likelihood Ratio Test (LRT).
The 2×(lnL
1-lnL
0) follows a χ² curve with degree of freedom of 1, so we can get a
p-value for this LRT.
Files Preparation:
We need four files to run CodeML:
1) The multiple nucleotide (CDS) alignment, in PHYLIP format:
ENSGT00390000005194.Euteleostomi.001.phy
2) The phylogenetic tree with the branch of interest specified by "#1":
ENSGT00390000005194.Euteleostomi.001.034.nwk
3) A command file where all parameters to run CodeML under the alternative model are specified:
ENSGT00390000005194.Euteleostomi.001.034.H1.ctl
4) A command file where all parameters to run CodeML under the null model are specified:
ENSGT00390000005194.Euteleostomi.001.034.H0.ctl
Alignment file:
In
Phylip format. CodeML will strictly remove any position that contains an undefined "N" nucleotide.
Tree file:
The tree file is in classical
NEWICK forms. We added a "
#1" on the estimated foreground branch.
((((((((a001:0.080601,a002:0.087761):0.098029,a003:0.064073):0.128789,a004:0.151837):0.000005,
a005:0.110686):0.121064,((a006:0.057627,a007:0.105676):0.062503,(((((((((a008:0.067174,a009:0.048846):0.074180,
a010:0.025564):0.034071,a011:0.022204):0.054541,a012:0.022975):0.028882,a013:0.061145):0.022390,a014:0.073039):0.067948,
a015:0.061092):0.029120,(((a016:0.068823,a017:0.038036):0.058143,a018:0.049684):0.026616,(a019:0.050394,
a020:0.123071):0.084815):0.077086):0.018279,(((((a021:0.075107,a022:0.084427):0.048606,a023:0.100325):0.060762,
((a024:0.073681,a025:0.088203):0.074288,a026:0.097719):0.081522):0.060130,(a027:0.024757,a028:0.053627):0.049310):0.074308,
a029:0.036030):0.023405):0.059574):0.090379):0.071366,a030:0.204436):0.037369,a031:0.118374):0.035100,((((a032:0.068424,
a033:0.026283):0.027761,(a034:0.054111,a035:0.056049):0.090568):0.101341,a036:0.031963)#1:0.075484,a037:0.149411):0.055173):0.0;
Run command file (alternative model H1):
We estimate the Ts/Tv ratio (fix_kappa = 0) and the dN/dS (fix_omega = 0).
The branch-site model is specified by setting the
model parameter to 2 (different dN/dS for branches) and the
NSsites
values to 2 (which allows 3 categories for sites: purifying, neutral and positive selection).
- We keep sites with ambiguity data (cleandata = 1). CodeML treats them as missing data.
- We adjust genetic code if required. E.g. mammalian mitochondrial genes will use icode = 1.
- In order to improve CodeML speed and convergence, we run an M0 model before the branch-site model.
This allows us to use an initial/starting-point value for the kappa parameter, as well as already estimated branch lengths for the tree fix_blength = 1.
seqfile = ENSGT00390000005194.Euteleostomi.001.phy * sequence data file name
treefile = ENSGT00390000005194.Euteleostomi.001.034.nwk * tree structure file name
outfile = ENSGT00390000005194.Euteleostomi.001.034.H1.mlc * main result file name
noisy = 0 * 0,1,2,3,9: how much rubbish on the screen
verbose = 1 * 1: detailed output, 0: concise output
runmode = 0 * 0: user tree; 1: semi-automatic; 2: automatic
* 3: StepwiseAddition; (4,5):PerturbationNNI; -2: pairwise
seqtype = 1 * 1:codons; 2:AAs; 3:codons-->AAs
CodonFreq = 2 * 0:1/61 each, 1:F1X4, 2:F3X4, 3:codon table
ndata = 1 * specifies the number of separate data sets in the file
clock = 0 * 0: no clock, unrooted tree, 1: clock, rooted tree
aaDist = 0 * 0:equal, +:geometric; -:linear, 1-6:G1974,Miyata,c,p,v,a
* 7:AAClasses
model = 2 * models for codons:
* 0:one, 1:b, 2:2 or more dN/dS ratios for branches
* models for AAs or codon-translated AAs:
* 0:poisson, 1:proportional, 2:Empirical, 3:Empirical+F
* 6:FromCodon, 8:REVaa_0, 9:REVaa(nr=189)
NSsites = 2 * 0:one w; 1:neutral; 2:positive selection; 3:discrete; 4:freqs;
* 5:gamma; 6:2gamma; 7:beta; 8:beta&w; 9:betaγ
* 10:beta&gamma+1; 11:beta&normal>1; 12:0&2normal>1;
* 13:3normal>0
icode = 0 * 0:universal code; 1:mammalian mt; 2-11:see below
Mgene = 0 * 0:rates, 1:separate;
fix_kappa = 0 * 1: kappa fixed, 0: kappa to be estimated
kappa = 2.05154 * initial or fixed kappa
fix_omega = 0 * 1: omega or omega_1 fixed, 0: estimate
omega = 1
getSE = 0 * 0: don't want them, 1: want S.E.s of estimates
RateAncestor = 0 * (0,1,2): rates (alpha>0) or ancestral states (1 or 2)
Small_Diff = .5e-6 * small value used in the difference approximation of derivatives
cleandata = 0 * remove sites with ambiguity data (1:yes, 0:no)?
fix_blength = 1 * 0: ignore, -1: random, 1: initial, 2: fixed
method = 0 * 0: simultaneous; 1: one branch at a time
Run command file (null model H0):
The command file for the null model is the same as for the alternative model, except for two parameters (in red):
1) The name of the outfile is different.
2) The dN/dS ratio is fixed to 1 (fix_omega = 1).
seqfile = ENSGT00390000005194.Euteleostomi.001.phy * sequence data file name
treefile = ENSGT00390000005194.Euteleostomi.001.034.nwk * tree structure file name
outfile = ENSGT00390000005194.Euteleostomi.001.034.H0.mlc * main result file name
noisy = 9 * 0,1,2,3,9: how much rubbish on the screen
verbose = 1 * 1: detailed output, 0: concise output
runmode = 0 * 0: user tree; 1: semi-automatic; 2: automatic
* 3: StepwiseAddition; (4,5):PerturbationNNI; -2: pairwise
seqtype = 1 * 1:codons; 2:AAs; 3:codons-->AAs
CodonFreq = 2 * 0:1/61 each, 1:F1X4, 2:F3X4, 3:codon table
ndata = 1 * specifies the number of separate data sets in the file
clock = 0 * 0: no clock, unrooted tree, 1: clock, rooted tree
aaDist = 0 * 0:equal, +:geometric; -:linear, 1-6:G1974,Miyata,c,p,v,a
* 7:AAClasses
model = 2 * models for codons:
* 0:one, 1:b, 2:2 or more dN/dS ratios for branches
* models for AAs or codon-translated AAs:
* 0:poisson, 1:proportional, 2:Empirical, 3:Empirical+F
* 6:FromCodon, 8:REVaa_0, 9:REVaa(nr=189)
NSsites = 2 * 0:one w; 1:neutral; 2:positive selection; 3:discrete; 4:freqs;
* 5:gamma; 6:2gamma; 7:beta; 8:beta&w; 9:betaγ
* 10:beta&gamma+1; 11:beta&normal>1; 12:0&2normal>1;
* 13:3normal>0
icode = 0 * 0:universal code; 1:mammalian mt; 2-11:see below
Mgene = 0 * 0:rates, 1:separate;
fix_kappa = 0 * 1: kappa fixed, 0: kappa to be estimated
kappa = 2.05154 * initial or fixed kappa
fix_omega = 1 * 1: omega or omega_1 fixed, 0: estimate
omega = 1
getSE = 0 * 0: don't want them, 1: want S.E.s of estimates
RateAncestor = 0 * (0,1,2): rates (alpha>0) or ancestral states (1 or 2)
Small_Diff = .5e-6 * small value used in the difference approximation of derivatives
cleandata = 0 * remove sites with ambiguity data (1:yes, 0:no)?
fix_blength = 1 * 0: ignore, -1: random, 1: initial, 2: fixed
method = 0 * 0: simultaneous; 1: one branch at a time
Launch CodeML:
In Unix (Linux, MacOSX), this will look like:
codeml ./ENSGT00390000005194.Euteleostomi.001.034.H1.ctl
codeml ./ENSGT00390000005194.Euteleostomi.001.034.H0.ctl
Analyze results:
1) Assign significance of the detection of positive selection on the selected branch:
Two outfiles are produced:
ENSGT00390000005194.Euteleostomi.001.034.H1.mlc (alternative model) and
ENSGT00390000005194.Euteleostomi.001.034.H0.mlc (null model).
In the
ENSGT00390000005194.Euteleostomi.001.034.H0.mlc, we retrieve
only the likelihood lnL
0 value in the following line:
lnL(ntime: 72 np: 76): -23384.936701 +0.000000
In the
ENSGT00390000005194.Euteleostomi.001.034.H1.mlc, we retrieve again the
likelihood value lnL
1 in the following line:
lnL(ntime: 72 np: 77): -23372.806179 +0.000000
We can construct the LRT:
2×(lnL1 - lnL0) = 2×(-23372.806179 - (-23384.936701)) = 24.261044
The degree of freedom is 1 (np
1 - np
0 = 77-76).
p-value =
0.00000084123 (under χ²) => significant.
(In SELECTOME, we add another test to control the False Discovery Rate (FDR): the
q-value. But it is relevant only for multiple analysis.)
2) If significant, retrieve sites under positive selection detected by Bayes Empirical Bayes (BEB):
In the
ENSGT00390000005194.Euteleostomi.001.034.H1.mlc, we can retrieve sites under positive selection using the BEB method:
Bayes Empirical Bayes (BEB) analysis (Yang, Wong & Nielsen 2005. Mol. Biol. Evol. 22:1107-1118)
Positive sites for foreground lineages Prob(w>1):
257 A 0.947
293 T 0.620
467 C 0.998**
616 D 0.991**
617 R 0.989*
647 G 0.693
676 P 0.971*
688 S 0.741
738 E 0.532
866 C 0.997**
903 N 0.948
992 I 0.570
Amino acids refer to the first sequence in the alignment.
Position 467 has high probability (99.8%) to be under positive selection. Position 688 has a lower probability (between 50% & 95%) to be under positive selection.
Only positions above 50% are reported.
References:
[top]