Purpose
ProPhylER provides predictive evolutionary analyses of eukaryotic proteins.
Evolutionary constraint is a powerful but underutilized type of data that facilitates analyses of protein structure and function. ProPhylER quantifies evolutionary constraint to annotate functionally or structurally important regions in proteins and to predict the impact of coding polymorphisms. ProPhylER puts comprehensive constraint data on tens of thousands of eukaryotic proteins, represented by hundreds of thousands of individual sequences, at the fingertips of the researcher.
Philosophy
ProPhylER emphasizes stringent clustering of closely related sequences and high alignment quality.
ProPhylER's predictive analyses use state-of-the art statistical methodology to give you the best results. Before predictions can be generated, the initial sequence data are processed through a dataflow involving blasting, clustering, aligning, and treebuilding. The predictions are only as good as the data produced by that process.
In the course of constructing the ProPhylER data flow and conducting the analyses we learned that this could only be achieved by expert curation. The final grouping of eukaryotic sequences into clusters of closely related sequences, and the final alignments of the sequences in each cluster, are the result of several cycles of automation and intensive manual work by curators.
ProPhylER's focus on high-quality multiple alignments of closely related sequences requires extreme stringency in eliminating poorly predicted proteins. That is how ProPhylER ensures that the majority of variation in the alignments is due to actual evolutionary variation and not due to errors in gene prediction or poor initial clustering of sequences.
Features
The ProPhylER Interface displays:
The Crystal Painter displays:
ProPhylER's data presentation is intended to be maximally convenient for the user:
Comprehensiveness
ProPhylER's predictive analyses require a minimum of evolutionary diversity.
If your favorite protein is represented in the public protein sequence databases, and if it is present in a sufficiently diverse set of species, ProPhylER will probably have an analysis for it. A minimum of three closely related sequences and a certain minimum amount of evolutionary change is necessary for representation in ProPhylER.
If your protein is species-specific, or if its homology to other sequences is confined to structural domains, ProPhylER may not have an analysis for it. This is because ProPhylER enriches its data sets for orthologs and closely related paralogs.
The database stats pages summarize ProPhylER's sequence representation. Best-represented are animals. Fungal sequences are present if they have close homologs in fission yeast or budding yeast, or in animals. Plant and protist sequences tend to be underrepresented. As the volume of sequence data increases, more proteins will have analyses in ProPhylER.
Underlying Sequence Data
ProPhylER's protein sequence data come primarily from Ensembl and Uniprot.
The process that generates ProPhylER clusters and alignments begins with grouping predicted protein sequences from Ensembl into initial clusters. (While there are a lot of sequenced genomes with comprehensive gene predictions, we focused only on those genomes that have a lot of independent cDNA data to support gene predictions.) The clusters were then broken up to give smaller clusters that are enriched for orthologs. These smaller clusters are augmented with UniProt sequences, and again examined for integrity. Depending on their structure they may be further broken up or merged with closely related clusters.
ProPhylER emphasizes specificity over inclusiveness.
Preliminary alignments are then built from the sequences in each cluster, and examined for quality. Sequences that cause a lot of gaps in the preliminary alignments, such as poor predictions from genomes, translations from investigator-submitted partial sequences, or even highly diverged orthologs, are thrown out to prevent the poisoning of alignments. As a consequence, some sequences you would expect to be present in ProPhylER's alignments may not be included. Then, for each cluster and for all its subgroups, trees are built and analyses of constraint are performed.
To learn more, read the Documentation pages which contain a wealth of information on the background and mechanics of ProPhylER's analyses.
Notes
ProPhylER contains complex data that are displayed with interactive Java interfaces. Read the Help pages to avoid frustration.
There is currently no way to download of all of ProPhylER's data at once. You can only download the data for a single cluster via the Interface. A future version of ProPhylER will allow bulk downloads.
Remember that ProPhylER leverages variation in closely related homologs under the assumption that binding and substrate specificities have not changed during the evolution of the aligned sequences. If you are interested in more distant comparisons, for example between structurally similar domains of functionally distinct proteins, then you may want to use Pfam or other resources instead.
ProPhylER is new. If you find bugs, omissions, or oddities, contact us.
Last updated 10/07/08
Release
ProPhylER 1.0beta is live now.
News
ProPhylER 1.0 is a beta release with all functionality implemented except browsing by name.
Contacts
prophyler [at] prophyler.org
arend [at] stanford.edu
Resource Links
Ensembl
Uniprot
PDB
WuBlast
Probcons
Semphy
Jmol
Java
Other Links
Sidow Lab
Stanford Pathology Dept
Stanford Genetics Dept
Stanford School of Medicine
Funded by