ProPhylER Dataflow and Analysis Pipeline II
Release
ProPhylER 1.0 is live now.
News
January 5 2010
The ProPhylER paper is now published in Genome Research
March 12 2010
Searching by name is now supported on the search page
March 12 2010
Searching with hg 18 coordinates for evaluating coding SNPs is now supported
Contacts
prophyler [at] prophyler.org
arend [at] stanford.edu
Resource Links
Ensembl
Uniprot
PDB
WuBlast
Probcons
Semphy
Jmol
Java
Other Links
Sidow Lab
Stanford Pathology Dept
Stanford Genetics Dept
Stanford School of Medicine
Funded by





1. From Individual Protein Sequences to
Clusters of Closely Related Homologs
The strategy to use genome proteins (as opposed to proteins from genbank or UniProt) for initializing the clusters simplifies the incorporation of a phylogenetic criterion during cluster building: instead of having to deal with the entire NCBI taxonomy we only have to concern ourselves with the handful of species whose genomes have been sequenced and which have a high quality set of gene predictions. At the time of this writing, the set of species includes human, mouse, chick, tetraodon, Ciona intestinalis, Drosophila melanogaster, C. elegans, and some fungi.
At regular intervals, Downloader gets the protein sequences of fully sequenced genomes from Ensembl, SGD, and other genome databases. BookKeeper logs transactions and checks for new sequences.
Using WuBlast, MasterBlaster performs all pairwise alignments. The postsw option is used to give Smith-Waterman alignments for all pairs of sequences that have a match by Blast. This improves the scoring, which is important for accurate clustering.
On the basis of the all-by-all blast results, loose clusters are first built by a simple SingleLinkage algorithm. Then the Normalized MinCut algorithm is applied, in conjunction with a phylogenetic criterion, to sever heterogeneous clusters into smaller subclusters that only contain close homologs.
Figure: The Cluster concept in ProPhylER. Nodes are genes, edges are the Smith-Waterman scores of the alignments. A single-linkage cluster is broken in two by the MinCut, which severs the cluster where the overall strength of connections is weakest. This process is iterated until the likelihood that orthologs are being cut off from one another becomes high. Cutting that was too aggressive is reversed by a curator using the ClusterMuster tool.
The results from the MinCut were proofread by a curator using the ClusterMuster interface. A cluster may not have passed muster because it was cut off from a sister cluster when it should not have been. ClusterMuster allows the curator to correct this by merging clusters back together.
When clusters have passed muster, the ClusterAugmenter blasts all of Uniprot's eukaryotic sequences against the genome protein sequences, and the Uniprot sequences are assigned to their most closely related clusters.
After the assignment of UniProt sequences to clusters by ClusterAugmenter, we actually went through another round of curation (not shown) to check the integrity of clusters because UniProt sequences provide additional useful information. Thus, many more clusters were merged and many clusters were split on the basis of preliminary multiple sequence alignments, with the goal of generating optimal clusters.
Complete clusters contain orthologs and closely related paralogs. Most orthologs will have been assigned, as we use reasonably sensitive parameters. However, this does not mean that all of the homologs in a cluster will be useful in a multiple alignment. In Step 2, sequences that poison the alignment will be culled away to give rise to clean alignments of close homologs.
Go to Detailed Dataflow 2: From Clusters to Alignments and Trees
Go back to Dataflow overview

Last updated 8/25/08