Selectome © 2008/2018 |

- CodeML: Behind Selectome
- Filtering steps
- Software releases

[bottom]

Theoretical principles:

CodeML is a program from the package PAML.

It estimates various parameters (Ts/Tv, dN/dS, branch length) on the codon (nucleotide) alignment, according to the phylogenetic tree. The parameter of interest to us is

Selectome uses the

A branch of interest is selected and called "

Two models are computed: a null model (H0), in which the foreground branch may have different proportions of sites under neutral selection than the background (i.e. relaxed purifying selection), and an alternative model (H1), in which the foreground branch may have a proportion of sites under positive selection.

As the alternative model is the general case, it is easier to present it first.

ω

ω

ω

K0 : Proportion of sites that are under purifying selection (ω

K1 : Proportion of sites that are under neutral evolution (ω

K2a: Proportion of sites that are under

K2b: Proportion of sites that are under

For each category, we get the proportion of sites and the associated dN/dS values.

K0 : Sites that are under purifying selection (ω

K1 : Sites that are under neutral evolution (ω

K2a: Sites that are under

K2b: Sites that are under

For each model, we get the log likelihood value (lnL

The 2×(lnL

Files Preparation:

We need four files to run CodeML:

1) The multiple nucleotide (CDS) alignment, in PHYLIP format: ENSGT00390000005194.Euteleostomi.001.phy

2) The phylogenetic tree with the branch of interest specified by "#1": ENSGT00390000005194.Euteleostomi.001.034.nwk

3) A command file where all parameters to run CodeML under the alternative model are specified: ENSGT00390000005194.Euteleostomi.001.034.H1.ctl

4) A command file where all parameters to run CodeML under the null model are specified: ENSGT00390000005194.Euteleostomi.001.034.H0.ctl

Alignment file:

In Phylip format. CodeML will strictly remove any position that contains an undefined "N" nucleotide.

Tree file:

The tree file is in classical NEWICK forms. We added a "

((((((((a001:0.080601,a002:0.087761):0.098029,a003:0.064073):0.128789,a004:0.151837):0.000005, a005:0.110686):0.121064,((a006:0.057627,a007:0.105676):0.062503,(((((((((a008:0.067174,a009:0.048846):0.074180, a010:0.025564):0.034071,a011:0.022204):0.054541,a012:0.022975):0.028882,a013:0.061145):0.022390,a014:0.073039):0.067948, a015:0.061092):0.029120,(((a016:0.068823,a017:0.038036):0.058143,a018:0.049684):0.026616,(a019:0.050394, a020:0.123071):0.084815):0.077086):0.018279,(((((a021:0.075107,a022:0.084427):0.048606,a023:0.100325):0.060762, ((a024:0.073681,a025:0.088203):0.074288,a026:0.097719):0.081522):0.060130,(a027:0.024757,a028:0.053627):0.049310):0.074308, a029:0.036030):0.023405):0.059574):0.090379):0.071366,a030:0.204436):0.037369,a031:0.118374):0.035100,((((a032:0.068424, a033:0.026283):0.027761,(a034:0.054111,a035:0.056049):0.090568):0.101341,a036:0.031963)#1:0.075484,a037:0.149411):0.055173):0.0;

Run command file (alternative model H1):

We estimate the Ts/Tv ratio (fix_kappa = 0) and the dN/dS (fix_omega = 0).

The branch-site model is specified by setting the model parameter to 2 (different dN/dS for branches) and the NSsites values to 2 (which allows 3 categories for sites: purifying, neutral and positive selection).

- We keep sites with ambiguity data (
*cleandata = 1*). CodeML treats them as missing data. - We adjust genetic code if required. E.g. mammalian mitochondrial genes will use
*icode = 1*. - In order to improve CodeML speed and convergence, we run an M0 model before the branch-site model.

This allows us to use an initial/starting-point value for the kappa parameter, as well as already estimated branch lengths for the tree*fix_blength = 1*.

seqfile = ENSGT00390000005194.Euteleostomi.001.phy * sequence data file name treefile = ENSGT00390000005194.Euteleostomi.001.034.nwk * tree structure file name outfile = ENSGT00390000005194.Euteleostomi.001.034.H1.mlc * main result file name noisy = 0 * 0,1,2,3,9: how much rubbish on the screen verbose = 1 * 1: detailed output, 0: concise output runmode = 0 * 0: user tree; 1: semi-automatic; 2: automatic * 3: StepwiseAddition; (4,5):PerturbationNNI; -2: pairwise seqtype = 1 * 1:codons; 2:AAs; 3:codons-->AAs CodonFreq = 2 * 0:1/61 each, 1:F1X4, 2:F3X4, 3:codon table ndata = 1 * specifies the number of separate data sets in the file clock = 0 * 0: no clock, unrooted tree, 1: clock, rooted tree aaDist = 0 * 0:equal, +:geometric; -:linear, 1-6:G1974,Miyata,c,p,v,a * 7:AAClasses model = 2 * models for codons: * 0:one, 1:b,2:2 or more dN/dS ratios for branches* models for AAs or codon-translated AAs: * 0:poisson, 1:proportional, 2:Empirical, 3:Empirical+F * 6:FromCodon, 8:REVaa_0, 9:REVaa(nr=189) NSsites = 2 * 0:one w; 1:neutral;2:positive selection;3:discrete; 4:freqs; * 5:gamma; 6:2gamma; 7:beta; 8:beta&w; 9:betaγ * 10:beta&gamma+1; 11:beta&normal>1; 12:0&2normal>1; * 13:3normal>0 icode = 0 * 0:universal code; 1:mammalian mt; 2-11:see below Mgene = 0 * 0:rates, 1:separate; fix_kappa = 0 * 1: kappa fixed, 0: kappa to be estimated kappa = 2.05154 * initial or fixed kappa fix_omega = 0 * 1: omega or omega_1 fixed, 0: estimate omega = 1 getSE = 0 * 0: don't want them, 1: want S.E.s of estimates RateAncestor = 0 * (0,1,2): rates (alpha>0) or ancestral states (1 or 2) Small_Diff = .5e-6 * small value used in the difference approximation of derivatives cleandata = 0 * remove sites with ambiguity data (1:yes, 0:no)? fix_blength = 1 * 0: ignore, -1: random, 1: initial, 2: fixed method = 0 * 0: simultaneous; 1: one branch at a time

Run command file (null model H0):

The command file for the null model is the same as for the alternative model, except for two parameters (in red):

1) The name of the outfile is different.

2) The dN/dS ratio is fixed to 1 (fix_omega = 1).

seqfile = ENSGT00390000005194.Euteleostomi.001.phy * sequence data file name treefile = ENSGT00390000005194.Euteleostomi.001.034.nwk * tree structure file name outfile = ENSGT00390000005194.Euteleostomi.001.034.H0.mlc * main result file name noisy = 9 * 0,1,2,3,9: how much rubbish on the screen verbose = 1 * 1: detailed output, 0: concise output runmode = 0 * 0: user tree; 1: semi-automatic; 2: automatic * 3: StepwiseAddition; (4,5):PerturbationNNI; -2: pairwise seqtype = 1 * 1:codons; 2:AAs; 3:codons-->AAs CodonFreq = 2 * 0:1/61 each, 1:F1X4, 2:F3X4, 3:codon table ndata = 1 * specifies the number of separate data sets in the file clock = 0 * 0: no clock, unrooted tree, 1: clock, rooted tree aaDist = 0 * 0:equal, +:geometric; -:linear, 1-6:G1974,Miyata,c,p,v,a * 7:AAClasses model = 2 * models for codons: * 0:one, 1:b, 2:2 or more dN/dS ratios for branches * models for AAs or codon-translated AAs: * 0:poisson, 1:proportional, 2:Empirical, 3:Empirical+F * 6:FromCodon, 8:REVaa_0, 9:REVaa(nr=189) NSsites = 2 * 0:one w; 1:neutral; 2:positive selection; 3:discrete; 4:freqs; * 5:gamma; 6:2gamma; 7:beta; 8:beta&w; 9:betaγ * 10:beta&gamma+1; 11:beta&normal>1; 12:0&2normal>1; * 13:3normal>0 icode = 0 * 0:universal code; 1:mammalian mt; 2-11:see below Mgene = 0 * 0:rates, 1:separate; fix_kappa = 0 * 1: kappa fixed, 0: kappa to be estimated kappa = 2.05154 * initial or fixed kappa fix_omega = 1 * 1: omega or omega_1 fixed, 0: estimate omega = 1 getSE = 0 * 0: don't want them, 1: want S.E.s of estimates RateAncestor = 0 * (0,1,2): rates (alpha>0) or ancestral states (1 or 2) Small_Diff = .5e-6 * small value used in the difference approximation of derivatives cleandata = 0 * remove sites with ambiguity data (1:yes, 0:no)? fix_blength = 1 * 0: ignore, -1: random, 1: initial, 2: fixed method = 0 * 0: simultaneous; 1: one branch at a time

Launch CodeML:

In Unix (Linux, MacOSX), this will look like:

codeml ./ENSGT00390000005194.Euteleostomi.001.034.H1.ctl codeml ./ENSGT00390000005194.Euteleostomi.001.034.H0.ctl

Analyze results:

Two outfiles are produced:

ENSGT00390000005194.Euteleostomi.001.034.H1.mlc (alternative model) and ENSGT00390000005194.Euteleostomi.001.034.H0.mlc (null model).

In the ENSGT00390000005194.Euteleostomi.001.034.H0.mlc, we retrieve only the likelihood lnL

lnL(ntime: 72 np: 76): -23384.936701 +0.000000

In the ENSGT00390000005194.Euteleostomi.001.034.H1.mlc, we retrieve again the likelihood value lnL

lnL(ntime: 72 np: 77): -23372.806179 +0.000000

We can construct the LRT:

The degree of freedom is 1 (np

p-value =

(In SELECTOME, we add another test to control the False Discovery Rate (FDR): the

In the ENSGT00390000005194.Euteleostomi.001.034.H1.mlc, we can retrieve sites under positive selection using the BEB method:

Bayes Empirical Bayes (BEB) analysis (Yang, Wong & Nielsen 2005. Mol. Biol. Evol. 22:1107-1118) Positive sites for foreground lineages Prob(w>1): 257 A 0.947 293 T 0.620 467 C 0.998** 616 D 0.991** 617 R 0.989* 647 G 0.693 676 P 0.971* 688 S 0.741 738 E 0.532 866 C 0.997** 903 N 0.948 992 I 0.570Amino acids refer to the first sequence in the alignment.

Position 467 has high probability (99.8%) to be under positive selection. Position 688 has a lower probability (between 50% & 95%) to be under positive selection.

Only positions above 50% are reported.

References:

[top]