InterPro-一个关于蛋白家族、区域和功能位点知识的集成文献资源

来源：互联网发布：黑魂3防火女化妆数据编辑：程序博客网时间：2024/04/29 18:25

InterPro – An integrated documentation resource for protein families, domains and functional sites

R.Apweiler (1), T.K.Attwood (4), A.Bairoch (2), A.Bateman (5), E.Birney (5), P.Bucher (3), J-J.Codani (8), F.Corpet (6), M.D.R.Croning (1,4), R.Durbin (5), T.Etzold (9), W.Fleischmann (1), J.Gouzy (6), H.Hermjakob (1), I.Jonassen (7), D.Kahn (6), A.Kanapin (1), R.Schneider (9), F.Servant (6), E.Zdobnov (1)

1 EMBL Outstation – European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK
2 Swiss Institute for Bioinformatics, Geneva, Switzerland.
3 Swiss Institute for Experimental Cancer Research, Lausanne, Switzerland.
4 School of Biological Sciences, The University of Manchester, Manchester, UK.
5 The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK.
6 CNRS/INRA, Toulouse, France.
7 University of Bergen, Norway.
8 INRIA, 78153 Le-Chesnay Cedex, France.
9 LION bioscience AG

Abstract

InterPro is a new integrated documentation resource for protein families, domains and functional sites, developed as a means of rationalising the complementary efforts of the PROSITE, PRINTS, Pfam and ProDom database projects. Merged annotations from PRINTS, PROSITE and Pfam form the InterPro core. Each combined InterPro entry includes functional descriptions and literature references, and links are made back to the relevant parent database(s), allowing users to see at a glance whether a particular family or domain has associated patterns, profiles, fingerprints, etc.. Merged and individual entries (i.e., those that have no counterpart in the companion resources) are assigned unique accession numbers. The first release of InterPro contains around 2,400 entries, representing families, domains, repeats and sites of post-translational modification (PTMs) encoded by 4,300 regular expressions, profiles, fingerprints and Hidden Markov Models (HMMs). Each InterPro entry lists all the matches against SWISS-PROT and TrEMBL (more than 370,000 hits in total). The database is accessible for text-based searches at http://www.ebi.ac.uk/ interpro/.

Introduction

Pattern databases have become vital tools for identifying distant relationships in novel sequences and hence for inferring protein function. During the last decade, several pattern-recognition methods have evolved to address different sequence analysis problems, resulting in rather different and, for the most part, independent databases. To perform a comprehensive analysis, a user therefore has to know several important things. For example, what are the resources and where can they be found? What is the difference between them in terms of diagnostic performance and family coverage? What do the different search outputs mean? Is it sufficient to use just one of the databases, and if so, which one? Or, given the seeming complexity, won’t PSI-BLAST (1) do just as well?

Currently, the most commonly-used pattern databases include: PROSITE, home of regular expressions and profiles (2); Pfam, keeper of hidden Markov models (HMMs) (3); PRINTS, provider of fingerprints (groups of aligned, un-weighted motifs) (4); and Blocks, the source of aligned, weighted motifs or blocks (5). Diagnostically, these resources have different areas of optimum application owing to the different strengths and weaknesses of their underlying analysis methods. For example, regular expressions are likely to be unreliable in the identification of members of highly divergent super-families (where HMMs and profiles excel); fingerprints perform relatively poorly in the diagnosis of very short motifs (where regular expressions do well); and profiles and HMMs are less likely to give specific sub-family diagnoses (where fingerprints excel).

In terms of family coverage, the pattern databases are similar in size but differ in content – each contains between 1,000-1,500 entries, spanning a range of globular and membrane proteins, modules and mosaics, repeats, and so on. While all of the resources share a common interest in protein sequence classification, some focus on divergent domains (e.g., Pfam), some focus on functional sites (e.g., PROSITE), and others focus on families, specialising in hierarchical definitions from super-family down to sub-family levels in order to pin-point specific functions (e.g., PRINTS).

A number of sequence cluster databases are also commonly used in sequence analysis, for example to facilitate domain identification (e.g., ProDom (6)). Unlike pattern databases, the clustered resources are derived automatically from sequence databases, using different clustering algorithms. This allows them to be relatively comprehensive, because they do not depend on manual crafting and validation of family discriminators; but the biological relevance of clusters can be unclear, as no annotation is provided.

Given these complexities, analysis strategies should endeavour to combine a range of databases, as none alone is sufficient. In concert, however, they can complement routine sequence database searches by providing more specific diagnoses than are possible with tools such as PSI-BLAST. PSI-BLAST highlights generic similarities by gathering sequences into families using an iterative profiling technique. However, there are problems with this approach. For example, if a multi-domain protein is matched, it may not be clear whether the region matched is the functional part of the protein, and hence whether functional annotations can be reliably transferred to the query; similarly, if a large super-family has been matched, it may be difficult to make the correct family or sub-family diagnosis.

In the task of sequence characterisation, we need more reliable, concerted methods for identifying protein family traits and for inheriting functional annotation. This is especially important given our dependence on automatic methods for assigning functions to the raw sequence data issuing from genome projects. But rationalising this process by creating a single coherent resource for diagnosis and documentation of protein families is difficult, given entirely different database formats, different search tools and different search outputs. Nevertheless, in an attempt to address some of these issues, we have developed InterPro. This new resource provides an integrated view of a number of commonly used pattern databases, and currently offers a Web interface for text-based searches.

Source database and methods

The first release of InterPro was built from Pfam 4.1 (1,488 domains), PRINTS 23.1 (1,159 fingerprints) and PROSITE 15 (1,034 families).

Flat-files submitted by each of the groups were systematically merged and dismantled. Where relevant, family annotations were amalgamated, and all method-specific annotation separated out. This process was complicated by the relationships that can exist, both between entries in the same database, and between entries in different databases. Different types of parent-child relationship were evident, leading us to recognise ‘sub-types’ and ‘sub-strings’. A sub-string means that a motif or motifs are contained within a region of sequence encoded by a wider pattern (e.g., a PROSITE pattern is typically contained within a PRINTS fingerprint; or a fingerprint might be contained within a Pfam domain). A sub-type means that one or more motifs are specific for a sub-set of sequences captured by another more general pattern (e.g., a super-family fingerprint may contain several family- and sub-family-specific fingerprints; or a generic Pfam domain may include several family fingerprints).

Having classified the parent-child relationships of overlapping PROSITE, PRINTS and Pfam entries, all recognisably distinct entities were assigned unique accession numbers (which take the form IPR00000). In doing this, we adopted the general principle that parents and children with sub-string relationships usually have the same IPR numbers, while sub-type parent-child relationships warrant their own IPRs.

Database Format and Content

To facilitate in-house maintenance, InterPro is managed within a relational database system. For users, however, the core InterPro entries are released in a single ASCII flatfile, written in XML. The overall data flow, from individual data provider, through the DBMS, out to the flatfile and on to the user, is fairly complex – a flavour of this complexity is given in Figure 1.

Figure 1. Representation of the InterPro data flow scheme, illustrating the route from the source database providers via the RDBMS to the end user.

Release 1.0 contains nearly 2,300 entries, representing families, domains, repeats and PTMs encoded by 4,300 different regular expressions, profiles, fingerprints and HMMs. Overall, InterPro lists more than 370,000 matches in SWISS-PROT and TrEMBL (7), accounting for around two thirds of all sequences. A complete content list is available from the Web site.

Database Access and Distribution

InterPro is accessible for interactive use via the EBI Web server. The interface allows text-based searches using SRS (8) and output interpretation is facilitated by means of graphics. Thus, for each sequence, the domain and/or motif organisation can be seen at a glance. The flatfile distribution may be retrieved from the EBI anonymous-ftp server (ftp://ftp.ebi.ac.uk/ pub/databases/interpro).

Future Directions

While the first InterPro release was created from PRINTS, PROSITE and Pfam, ProDom will shortly also be included. Various factors rendered a step-wise approach to the development of InterPro desirable. First, the scale of the task of amalgamating just the first three databases was immense. The rational merging of apparently equivalent database entries that in fact simultaneously define a specific family, domains within that family, or even repeats within those domains, presented an enormous challenge. Thus, the immediate goal for InterPro was to limit the problem only to databases that offered annotation. A second important consideration was that while Pfam, PRINTS and PROSITE are true pattern databases, ProDom is based solely on automatic clustering of sequences by similarity (i.e., discriminators are not derived). Resulting clusters need not have precise biological correlations and family designations have changed between database versions. It was therefore necessary that ProDom should adopt stable accession numbers before its entries could be meaningfully considered for inclusion in InterPro.

Once the founder members of InterPro have been assimilated into the resource, other pattern databases will be included (e.g., Blocks (5) and SMART (9)). Ultimately, we hope to include many other family databases to give a more comprehensive view of the resources available.

Applications

A primary application of InterPro’s family, domain and functional site definitions will be in the functional classification of newly determined sequences that lack biochemical characterisation. Thus InterPro will be used to enhance the automated annotation of TrEMBL (10). This should be more efficient and reliable than using each of the pattern databases separately, because InterPro will provide internal consistency checks and deeper coverage.

Conclusion

InterPro is an international initiative that was conceived in an attempt to streamline the efforts of the pattern database providers. The project aims to reduce duplication of effort in the labour-intensive, rate-limiting process of annotation, and will facilitate communication between the disparate resources. By uniting these databases, we capitalise on their individual strengths, producing a single entity that is far greater than the sum of its parts. As it evolves, InterPro will streamline the analysis of newly determined sequences for the individual user, and will make a significant contribution in the demanding task of automatic annotation of predicted proteins from genome sequencing projects.

Acknowledgements

The InterPro project is supported by grant number BIO4-CT98-0052 of the European Commission. TKA is a Royal Society University Research Fellow.

References

1. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller W. and. Lipman, D.J (1997) Nucleic Acids Res., 25, 3389-3402.

2. Hofmann, K., Bucher, P., Falquet, L. and Bairoch, A. (1999). Nucleic Acids Res., 27, 215-219.

3. Bateman, A., Birney, E., Durbin, R., Eddy, S.R., Finn, R.D. and Sonnhammer, E.L.L. (1999) Nucleic Acids Res., 27, 260-262..

4. Attwood, T.K., Flower, D.R., Lewis, A.P., Mabey, J.E., Morgan, S.R., Scordis, P., Selley, J.N. and Wright, W. (1999) Nucleic Acids Res., 27, 220-225.

5. Henikoff, S., Henikoff, J.G. and Pietrokovski, S. (1999) Bioinformatics, 15, 471-479.

6. Corpet, F. Gouzy, J., and Kahn, D. (1999) Nucleic Acids Res., 27, 263-267.

7. Bairoch, A. and Apweiler, R. (1999) Nucleic Acids Res., 27, 49-54.

8. Etzold, T, Ulyanov, A. and Argos, P. (1996) Methods Enzymol., 266, 114-128.

9. Schultz, J., Milpetz, F., Bork, P. and Ponting, C.P. (1998) Proc.Natl.Acad.Sci.USA, 95, 5857-5864.

10. Fleischmann, W., Möller, S., Gateau, A. and Apweiler R. (1999) Bioinformatics, 15, 228-233.