ProPhylER Dataflow and Analysis Pipeline I

Overview

ProPhylER's data were generated by a dataflow that consists of three main parts:

  1. Generation of clusters of orthologs and closely related paralogs, curated
  2. Construction of alignments and trees, curated
  3. Computation of rates, constraints, and predictions, automated

Cluster generation is based on Smith-Waterman alignments of pairs of sequences and is tuned to identify likely orthologs and close paralogs from whole-genome proteomes, to the exclusion of distant paralogs. Two manual curation steps ensured that clusters contain sufficient sequences to be informative, but that divergent paralogs would not be in the same cluster. After genome protein clusters are built, they are augmented by UniProt sequences, which contribute a substantial additional amount of information because of their annotations and because many UniProt sequences come from species whose genomes have not been sequenced. Another manual curation step at this point ensured integrity and specificity of the clusters.

From this point on, we require robust alignments of orthologs and close paralogs that

This was enforced by manual curation of the alignments in which sequences were removed that violated ProPhylER's high standards for alignment specificity.

For each alignment, trees are then built automatically. After treebuilding, curator input may also be required. ProPhylER's post-treebuilding algorithms are then applied to

A number of large clusters have undergone additional curation in which subclusters were generated that can be viewed in the same ProPhylER session, allowing comparisons between different phylogenetic scopes or paralogs.

Detailed Dataflow Descriptions

1. From Individual Protein Sequences to Clusters of Closely Related Homologs

2. From Clusters of Close Homologs to Alignments and Trees

3. From Alignments and Trees to Profiles, Constrained Regions, and Mutation Impact Scores


Home | Overview | Stats | Search | Help | Documentation | People | Site Map

Last updated 8/25/08

Release

ProPhylER 1.0 is live now.

News

January 5 2010
The ProPhylER paper is now published in Genome Research

March 12 2010
Searching by name is now supported on the search page

March 12 2010
Searching with hg 18 coordinates for evaluating coding SNPs is now supported

Contacts

prophyler [at] prophyler.org
arend [at] stanford.edu

Resource Links

Ensembl
Uniprot
PDB
WuBlast
Probcons
Semphy
Jmol
Java

Other Links

Sidow Lab
Stanford Pathology Dept
Stanford Genetics Dept
Stanford School of Medicine

Funded by

NIH/NHGRI