ProPhylER Dataflow and Analysis Pipeline III

Release

ProPhylER 1.0 is live now.

News

January 5 2010
The ProPhylER paper is now published in Genome Research

March 12 2010
Searching by name is now supported on the search page

March 12 2010
Searching with hg 18 coordinates for evaluating coding SNPs is now supported

Contacts

prophyler [at] prophyler.org
arend [at] stanford.edu

Resource Links

Ensembl
Uniprot
PDB
WuBlast
Probcons
Semphy
Jmol
Java

Other Links

Sidow Lab
Stanford Pathology Dept
Stanford Genetics Dept
Stanford School of Medicine

Funded by

NIH/NHGRI

2. From Clusters of Close Homologs to Alignments and Trees

Programs
Data
Curator on GUI
Automated Script
Results for User
External Databases
Generic Data
The first phase of Step 2 is critical for the rest of the pipeline: the generation of reliable alignments on which all downstream analyses depend.

Clusters of orthologs contain sequences that should not be part of an alignment on which predictive analyses are based. Usually, such poison sequences are incorrect gene predictions or partial cDNA sequence translations. Another class of poison sequences are highly diverged orthologs that have undergone so much point and indel evolution that their alignment to the other sequences is not of the quality required for this resource; for such orthologs the assumption that structure and function has been stable since the last common ancestor with the other cluster members becomes questionable.

Combine builds alignments from all sequences in the cluster. All cluster alignments were inspected by a curator and poison sequences removed. Very large clusters were fed into the Protein Alignment Editor, PAE.

PAE allows the hierarchical grouping of obvious subsets of sequences into subalignments, so called Master Sets and Subsets. A Master Set contains all non-poison sequences of a cluster; Subsets can be any number of hierarchically grouped subalignments that contain a smaller number of sequences. Subsets are useful because some proteins have enough sequences in parts of the overall tree that they can be analyzed separately, yet their connection to the rest of the cluster is maintained by having higher-level Subets and the Master Set. One example is the breaking out of p53 as a vetebrate-specific Subset from the Master Set that contains p53, p63, and p73 as well as invertebrate orthologs.

The second phase of Step 2 builds trees, which are necessary (along with the alignments) to for the predictive analyses to follow in Step 3.

TreeBuilder calls Semphy to build a maximum likelihood tree for each Master Set and Subset alignment. If the tree requires manual curation, the curator will use TreeClimber (TC), which allows rerooting of the tree and a number of other operations. TC then compares the species tree and the gene tree and automatically annotates all internal nodes as either being an ancestral species or a gene duplication. This results in tree-based, and therefore objective, determination of orthology and paralogy. The resulting annotated trees are informative in their own right, and, like the alignments, are made available via the ProPhylER interface. The next step generates the predictive data that informs the protein sequence.

Go to Detailed Dataflow 3: From Alignments and Trees to Predictive analyses
Go back to Dataflow 1: From Individual Protein Sequences to Clusters of Close Homologs
Go back to Dataflow overview


Home | Overview | Stats | Search | Help | Documentation | People | Site Map

Last updated 8/20/08