sylvain.poux@isb-sib.ch Head of curation department Swiss-Prot group SIB Swiss Institute of Bioinformatics Beijing, April 2015 The human proteome in UniProtKB 2 The human proteome in UniProtKB Different users for different needs • Some users need detailed knowledge on a protein in free text • Some need information in machine readable format • Others need the complete set of proteins for an organism • Or all variations for the human proteome • Etc. 3 The human proteome in UniProtKB UniProtKB UniProtKB Protein knowledgebase UniProtKB/Swiss-Prot Expert curation UniProtKB/TrEMBL Automatic annotation 4 The human proteome in UniProtKB How to access the human proteome? www.uniprot.org 5 The human proteome in UniProtKB How to access the human proteome? www.uniprot.org 6 The human proteome in UniProtKB How to access the human proteome? www.uniprot.org 7 The human proteome in UniProtKB The human proteome in Swiss-Prot • 20,198 entries are present in UniProtKB/Swiss-Prot covering all known human protein-coding genes • All entries have been reviewed by our curators • What do we mean by expert curation? 8 The human proteome in UniProtKB Expert curation in Swiss-Prot 9 The human proteome in UniProtKB Expert curation in Swiss-Prot 10 The human proteome in UniProtKB Expert curation in Swiss-Prot 11 The human proteome in UniProtKB Expert curation in Swiss-Prot 12 The human proteome in UniProtKB Expert curation in Swiss-Prot Sequences are curated in collaboration with HAVANA, Ensembl, RefSeq and the Consensus CoDing sequence projects (CCDS) 13 The human proteome in UniProtKB Status of the complete proteome • 20,198 entries covering all protein-coding genes as well as 21,841 manually reviewed alternative products • More than 95% of CCDS sequences are present in UniProtKB/Swiss-Prot • 48,313 additional predicted alternative products are available in the TrEMBL section of UniProtKB and can be retrieved as part of the complete human proteome 14 The human proteome in UniProtKB The human proteome in Swiss-Prot • The proteome is regularly revisited as new knowledge become available • Newly identified proteins are added to Swiss-Prot • We delete entries when evidences show that a protein does not exist 15 The human proteome in UniProtKB Human variations 16 The human proteome in UniProtKB Human variations • 1 nucleotide out of 1000 varies in 2 randomly selected genomes. • 3.3 millions of single natural polymorphisms (SNPs) between 2 individuals • 10 millions of SNPs in the human population 17 The human proteome in UniProtKB Human variations in Swiss-Prot Characterized variants found in literature are curated in UniProtKB/SwissProt entries 70,780 genetic variants are manually annotated in UniProtKB/Swiss-Prot of which • 26,000 (40%) are associated with a genetic disease • 8,500 (12%) are associated with functional characterization data All Swiss-Prot variants are listed in the humsavar.txt table (http://www.uniprot.org/docs/humsavar) 18 The human proteome in UniProtKB Human variations in Swiss-Prot 19 The human proteome in UniProtKB Controlled vocabulary for variation data • Functional characterization of variants are being standardized using a combination of controlled vocabulary 20 The human proteome in UniProtKB Controlled vocabulary for variation data • 20% of functional annotations have been structured using controlled vocabulary in a test-phase producing 2,800 standardized annotation • Remaining free-text annotations will be mapped and made available 21 The human proteome in UniProtKB Human variations in UniProt • What about other variants? • Variants that have been automatically mapped to UniProtKB sequences can be found in the homo_sapiens_variation.txt.gz file on the ftp site (ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowl edgebase/variants/) 22 The human proteome in UniProtKB Proteomics in UniProtKB • Lot of data produced using different experimental techniques and identification methods • Interpretation of data differ between groups 23 The human proteome in UniProtKB Proteomics in UniProtKB Challenges for integrating high throughput proteomics data in UniProtKB • Methods and technologies evolve • Should limit the number of false positives • Established our own pipelines for annotation of proteomics data 24 The human proteome in UniProtKB Proteomics in UniProtKB • An expert-driven pipeline for UniProtKB • Evaluation and selection of publications by curators • Published peptides are filtered and mapped to latest UniProtKB sequences • An automatic pipeline for UniProtKB/TrEMBL • Peptides from a selected set of public mass spectrometry repositories are filtered and mapped to latest UniProtKB sequences 25 The human proteome in UniProtKB Human proteomics in UniProtKB ● Source: Human phosphoproteome Identification by MS: • 55’061 peptides • 22’446 phosphosites • 6’526 phosphoproteins ● Processing by our pipeline Apply of stringent filtering rules: • Peptide: Mascot >=40 and PEP<=0.01 • PTM: Ascore >=19 ● Filtered data • 26’497 certified peptides • 8’ 537 certified phosphosites ● Integration into UniProt entries • 4’132 Swiss-Prot and 6’594 TrEMBL 26 The human proteome in UniProtKB Human proteomics in UniProtKB 48 high throughput proteomics papers were evaluated • 31 were incorporated in UniProtKB • 17 rejected because of low confidence in PTM localisation 84,520 peptides passed the filtering, covering 8,605 Swiss-Prot entries and 18,327 TrEMBL entries Generating more than 22,000 PTMs 27 The human proteome in UniProtKB On the importance of curation Shigeo Fukuda 28 The human proteome in UniProtKB Thank you for your attention and thanks to the UniProt teams at: • SIB • EBI • PIR Funding agencies: • • • National Institutes of Health (NIH) (NHGRI and NIGMS) Swiss Federal Government through the State Secretariat for Education, Research and Innovation (SERI) The European Commission 29
© Copyright 2024