BLOCKS Database Version 14.3, April 2007
Copyright 2007 by Fred Hutchinson Cancer Research Center
1100 Fairview AV N, A1-162, Seattle, WA 98109
Version 14.3 of the BLOCKS Database consists of 29,068 blocks representing 5900 groups documented in InterPro 14.0 keyed to SWISS-PROT 51.3 and TrEMBL 34.3 obtained from the InterPro server.
NOTICE: This is the final release of the Blocks Database. The first release of the Blocks Database was made in 1990. Our current funding expires in July 2007 and we have decided not to pursue additional funding as our lab has moved on to other projects and there are now many excellent resources for protein family identification. We recommend starting with the InterPro server which has links to other protein family collections.
The BLOCKS Database is based on InterPro entries with sequences from SWISS_PROT and TrEMBL and with cross-references to PROSITE and/or PRINTS and/or SMART, and/or PFAM and/or ProDom entries.
The BLOCKS Database was constructed by the PROTOMAT system (S Henikoff & JG Henikoff, "Automated assembly of protein blocks for database searching", NAR (1991) 19:6565-6572) using the MOTIF algorithm (HO Smith, et al, "Finding sequence motifs in groups of functionally related proteins", PNAS (1990) 87:826-830) as implemented in Block Maker.
To avoid using possible false positive sequences added to the InterPro entries automatically (without human oversight), BLOCKS were made for each InterPro entry using just the sequences in SWISS-PROT, and then TrEMBL sequences were added if they fit the resulting BLOCKS model.
Version 14.3 is an incremental update. Additional sequences were added to 5057 entries in Blocks 14.2, 539 entries were dropped and 290 entries were added. Many redundant entries have been removed from this version of the Blocks Database because the Blocks model of multiple conserved regions in a group of related proteins often conflicts with the InterPro model of individual conserved motifs. For example, InterPro includes the entries IPR012856 "Peptidase S12, aminopeptidase DmpB, region B" and IPR012857 "Peptidase S12, aminopeptidase DmpB, region C". Since these two entries include the same protein sequences, the same blocks will be made for them so only one is now included in the Blocks Database.
InterPro 14.0 consisted of 13,828 entries. The 5900 entries of these
represented in BLOCKS 14.3 were selected as follows:
13828 -3723 entries with no PROSITE, PRINTS, SMART, PFAM or ProDom component (1) -3162 entries with fewer than 3 SWISS-PROT sequences eligible for PROTOMAT (2) -293 entries participating in InterPro parent/child relationships (3) -36 entries with too many sequences to process with PROTOMAT -12 entries for which PROTOMAT failed to find blocks -105 entries for which final blocks were obviously useless (5) -597 redundant entries 5900 1345 blocks entries taken from PRINTS (4) 4555 blocks entries made by PROTOMAT NOTES: (1) InterPro now contains entries from several other sources. However, these five sources tend to define a protein family in terms most amenable to the BLOCKS model which is short, highly conserved regions. In particular, PROTOMAT will generally produce unsatisfactory results for groups comprised of a few, long, globally alignable sequences. (2) PROTOMAT requires at least 3 sequences to make blocks. To be more confident that the sequences used are actually members of the InterPro protein family, we used only sequences from SWISS-PROT. Then, to reduce redundancy, we used only the longest SWISS-PROT sequence among those with the same gene name (characters before the "_" in the SWISS-PROT ID) and similar organism name (first three characters following the "_"). For example, if an InterPro group included SWISS-PROT sequences named AANT_HDVAM|P25989 LENGTH=214 AANT_HDVD3|P29996 LENGTH=195 AANT_HDVWO|P29997 LENGTH=205 only AANT_HDVAM would be used by PROTOMAT. (3) Several InterPro entries are arranged into parent/child hierarchies where all the sequences in a child entry are included in the parent entry. Since PROTOMAT will tend to find the same blocks for the parent and children, each major branch of a hierarchy is represented by only one BLOCKS entry. (4) Because the PRINTS model is the same as the BLOCKS model and PRINTS is a curated collection of alignments, the PRINTS blocks were used directly for InterPro entries with only a PRINTS component as long as the PRINTS blocks had at least three sequences from any source. Then additional sequences were added from TrEMBL if they fit the PRINTS model. (5) These entries tend to be sites (e.g. IPR000886, IPB001216), repeats (e.g. IPR000479, IPR001473) and viral proteins (e.g. IPR000208, IPR000752).
Please note: The PROSITE pattern is not used in any way to make the BLOCKS Database and BLOCKS made from an InterPro PROSITE group may or may not contain the PROSITE pattern. Similarly, the SMART and PFAM multiple alignments are not used in any way to make the BLOCKS Database and BLOCKS made from an InterPro PROSITE, SMART or PFAM group may or may not overlap with the multiple alignments in those databases.
Page last modified