Human proteomics in UniProtKB

sylvain.poux@isb-sib.ch
Head of curation department
Swiss-Prot group
SIB Swiss Institute of Bioinformatics
Beijing, April 2015
The human proteome in UniProtKB
2
The human proteome in UniProtKB
Different users for different needs
• Some users need detailed knowledge on a
protein in free text
• Some need information in machine readable
format
• Others need the complete set of proteins for
an organism
• Or all variations for the human proteome
• Etc.
3
The human proteome in UniProtKB
UniProtKB
UniProtKB
Protein knowledgebase
UniProtKB/Swiss-Prot
Expert curation
UniProtKB/TrEMBL
Automatic annotation
4
The human proteome in UniProtKB
How to access the human proteome?
www.uniprot.org
5
The human proteome in UniProtKB
How to access the human proteome?
www.uniprot.org
6
The human proteome in UniProtKB
How to access the human proteome?
www.uniprot.org
7
The human proteome in UniProtKB
The human proteome in Swiss-Prot
• 20,198 entries are present in UniProtKB/Swiss-Prot covering all
known human protein-coding genes
• All entries have been reviewed by our curators
• What do we mean by expert curation?
8
The human proteome in UniProtKB
Expert curation in Swiss-Prot
9
The human proteome in UniProtKB
Expert curation in Swiss-Prot
10
The human proteome in UniProtKB
Expert curation in Swiss-Prot
11
The human proteome in UniProtKB
Expert curation in Swiss-Prot
12
The human proteome in UniProtKB
Expert curation in Swiss-Prot
Sequences are curated in collaboration with HAVANA, Ensembl, RefSeq
and the Consensus CoDing sequence projects (CCDS)
13
The human proteome in UniProtKB
Status of the complete proteome
• 20,198 entries covering all protein-coding genes as well as
21,841 manually reviewed alternative products
• More than 95% of CCDS sequences are present in
UniProtKB/Swiss-Prot
• 48,313 additional predicted alternative products are available in
the TrEMBL section of UniProtKB and can be retrieved as part of
the complete human proteome
14
The human proteome in UniProtKB
The human proteome in Swiss-Prot
• The proteome is regularly revisited as new knowledge
become available
• Newly identified proteins are added to Swiss-Prot
• We delete entries when evidences show that a protein does
not exist
15
The human proteome in UniProtKB
Human variations
16
The human proteome in UniProtKB
Human variations
• 1 nucleotide out of 1000 varies in 2 randomly selected
genomes.
• 3.3 millions of single natural polymorphisms (SNPs)
between 2 individuals
• 10 millions of SNPs in the human population
17
The human proteome in UniProtKB
Human variations in Swiss-Prot
Characterized variants found in literature are curated in UniProtKB/SwissProt entries
70,780 genetic variants are manually annotated in UniProtKB/Swiss-Prot
of which
• 26,000 (40%) are associated with a genetic disease
• 8,500 (12%) are associated with functional characterization data
All Swiss-Prot variants are listed in the humsavar.txt table
(http://www.uniprot.org/docs/humsavar)
18
The human proteome in UniProtKB
Human variations in Swiss-Prot
19
The human proteome in UniProtKB
Controlled vocabulary for variation data
• Functional characterization of variants are being standardized
using a combination of controlled vocabulary
20
The human proteome in UniProtKB
Controlled vocabulary for variation data
• 20% of functional annotations have been structured using
controlled vocabulary in a test-phase producing 2,800
standardized annotation
• Remaining free-text annotations will be mapped and made
available
21
The human proteome in UniProtKB
Human variations in UniProt
• What about other variants?
• Variants that have been automatically mapped to UniProtKB
sequences can be found in the homo_sapiens_variation.txt.gz
file on the ftp site
(ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowl
edgebase/variants/)
22
The human proteome in UniProtKB
Proteomics in UniProtKB
• Lot of data produced using different experimental techniques and
identification methods
• Interpretation of data differ between groups
23
The human proteome in UniProtKB
Proteomics in UniProtKB
Challenges for integrating high throughput proteomics data in
UniProtKB
• Methods and technologies evolve
• Should limit the number of false positives
• Established our own pipelines for annotation of proteomics data
24
The human proteome in UniProtKB
Proteomics in UniProtKB
• An expert-driven pipeline for UniProtKB
• Evaluation and selection of publications by curators
• Published peptides are filtered and mapped to latest UniProtKB
sequences
• An automatic pipeline for UniProtKB/TrEMBL
• Peptides from a selected set of public mass spectrometry
repositories are filtered and mapped to latest UniProtKB
sequences
25
The human proteome in UniProtKB
Human proteomics in UniProtKB
● Source: Human phosphoproteome
Identification by MS:
• 55’061 peptides
• 22’446 phosphosites
• 6’526 phosphoproteins
● Processing by our pipeline
Apply of stringent filtering rules:
• Peptide: Mascot >=40 and PEP<=0.01
• PTM: Ascore >=19
● Filtered data
• 26’497 certified peptides
• 8’ 537 certified phosphosites
● Integration into UniProt entries
• 4’132 Swiss-Prot and 6’594 TrEMBL
26
The human proteome in UniProtKB
Human proteomics in UniProtKB
48 high throughput proteomics papers were evaluated
• 31 were incorporated in UniProtKB
• 17 rejected because of low confidence in PTM
localisation
84,520 peptides passed the filtering, covering 8,605
Swiss-Prot entries and 18,327 TrEMBL entries
Generating more than 22,000 PTMs
27
The human proteome in UniProtKB
On the importance of curation
Shigeo Fukuda
28
The human proteome in UniProtKB
Thank you for your attention and thanks to
the UniProt teams at:
•
SIB
•
EBI
•
PIR
Funding agencies:
•
•
•
National Institutes of Health (NIH) (NHGRI and NIGMS)
Swiss Federal Government through the State Secretariat for Education,
Research and Innovation (SERI)
The European Commission
29