Structurally similar proteins need not share significant sequence identity. The early observation of structurally and functionally similar proteins (such as hemoglobin and myoglobin) led to the natural separation of these structures into discrete sets, or folds[1, 2]. However, as more structures were determined and more folds were discovered it became clear that not all members of a fold were necessarily linked by a common function . Also, the determination of structures with conserved structural cores surrounded by variable structural regions complicated the classification of new structures into existing folds. This begs the question: to what degree are structural variations tolerated between a domain a potential cousin before they no longer belong to the same fold. Different weighting of the factors determining structural similarity led to domain dictionaries with different fold classifications. We can minimize the effect of these idosyncrasies by deriving a consensus from publicly available domain dictionaries. We have previously demonstrated the application of this method to SCOP, CATH and the Dali Domain Dictionary to generate a consensus domain dictionary[4-6].
SCOP and CATH, gold standards among hierarchal domain dictionaries, have been the subject of detailed comparison. In general, both weigh potential functional and evolutionary relationships between fold members with different strengths at different levels of the hierarchy. In their early formulations, both domain dictionaries represented different design methodologies. Whereas SCOP was hand curated by experts, CATH was maintained by a combination of automated processes and expert curation [5, 6]. However, SCOP has assumed more automated pre-classification of new structures in response to the increasing rate of structure determination, minimizing this methodological distinction . There also exist non-hierarchal methods of classifying domains into folds. The Dali Domain Dictionary is one such method, and relies on clustering a set of all vs. all similarity scores generated between domains using the Dali structural similarity method. Since their inception, each domain dictionary has had to face a single overwhelming problem: the rate of structure determination has quickly outpaced the ability of any process to categorize new structures. Different responses were formulated. Both SCOP and CATH no longer guarantee that any release will necessarily cover structures in the PDB up to the release date and instead focus on categorizing putative novel topologies.
The consensus domain dictionary (CDD) is the backbone of our Dynameomics mass molecular dynamics initiative . It is the basis for our selection of a topologically diverse sample of targets. Therefore, it is imperative that the CDD be kept up-to-date, so that we can identify novel topologies as they are classified and observe potential splits within and merges between our metafolds as classifications shift. Since we use the contents of the CDD as potential targets for simulation of the folding pathway, it is important that we identify domains that appear to be autonomous folding units. There exist a broad category of domains that cannot be understood as folding units, but instead as artifacts of multi-domain or complex structures. The details in the generation of this CDD have been published elsewhere .