Generation
2009 Consensus Set
Consensus domains were identified between pairs of domain dictionaries (SCOP/CATH, SCOP/DALI, DALI/CATH). The agreement between two domain dictionaries can be measured as the fraction of consensus domains from the total domains originating from shared structures. We reduce the effect of differing release dates on the measure by considering only shared structures. CATH and DALI have the highest agreement, with 96% of DALI domains and 90% of DALI domains included in the CATH/DALI consensus domain set. SCOP and CATH had the next highest agreement, with 79% of SCOP domains and 82% of CATH domains being included in the SCOP/CATH consensus domain set. Finally, SCOP and DALI had the lowest agreement, with 65% of SCOP domains and 61% of DALI domains included in the SCOP/DALI consensus set. A consensus domain need not exist solely between a single pair of domain dictionaries. Where a consensus domain was identified between multiple pairs of domain dictionaries, it was collapsed into a single domain in the consensus domain dictionary (CDD).
Thus, four classes of consensus domains were created: SCOP/CATH, SCOP/DALI, DALI/CATH, and SCOP/CATH/DALI. The 2009 CDD was composed of 80,062 domains that originated from 27,140 PDB structures. The domains of the CDD were distributed among the aforementioned classes as follows: 50.8% SCOP/CATH/DALI, 29.6% SCOP/CATH, 10% CATH/DALI, and 9.4% DALI/CATH. Accordingly, the SCOP/CATH pairwise match identified over 80% of the domains in the CDD, whereas other pairwise matches accounted for the remaining 20%. Summary statistics for the domains in our input domain dictionaries are presented in the table below. To generate the fold list, the CDD first must be filtered by sequence identity.
We independently applied the CATH ‘SOLID’ identifiers and SCOP ‘ASTRAL95’ non-redundant structure measures to the CDD to generate the non-redundant CDD (nrCDD). The nrCDD was composed of 13,345 domains remaining from the CDD. Each domain from the input domain dictionaries has a fold identifier. For SCOP and CATH, this fold identifier is derived from the level in each respective hierarchy at which we choose to cluster (Fold for SCOP, Topology for CATH). Each domain in the CDD has a composite fold identifier derived from its fold classification in the input domain dictionaries. Where multiple composite fold identifiers share 2 or more input fold identifiers, they are clustered together into a metafold. The domains possessing these composite fold identifiers are then associated with their respective metafolds. The domains in the CDD clustered into 1695 metafolds. Taken together, these metafolds incorporate 4217 unique consensus fold identifiers derived from 971 unique SCOP folds, 923 unique CATH topologies, and 2362 Dali folds.
2003 |
Chains |
Domains |
Folds |
Domains/Chains |
SCOP |
27,308 |
35,095 |
783 |
1.29 |
CATH |
25,622 |
36,480 |
1,453 |
1.42 |
Dali |
21,493 |
35,492 |
1,088 |
1.65 |
2009 |
Chains |
Domains |
Folds |
Domains/Chains |
SCOP |
74,608 |
96,973 |
1280 |
1.29 |
CATH |
74,240 |
108,691 |
1,110 |
1.46 |
Dali |
52,740 |
73,609 |
2,783 |
1.39 |
Table 1: Summary Statistics for Domains in Input Domain Dictionaries
Figure 1: 2003 Consensus Domain Dictionary Distribution of Domains in Input Dictionaries
Figure 2: 2009 Consensus Domain Dictionary Distribution of Domains in Input Dictionaries