Detangling transcriptional complexity in GENCODE using cutting

Detangling transcriptional complexity in GENCODE
using cutting-edge transcriptomics and proteomics
Mark Thomas
Wellcome Trust Sanger Institute
Biocuration 2015, Beijing
Overview
• Introduction to manual annotation tools used for
GENCODE pipelines
• Improving the comprehensive gene set and
trying to catalogue function
• Validation of the lncRNA gene sets with long
read technology
• Impact of proteomics on functional annotation
gencodegenes.org
Current GENCODE releases
Human 22
Mouse M4
GENCODE pipeline
Specialised Prediction
pipelines
Automated
annotation
(Ensembl)
(Congo, Pseudopipe,
Retrofinder, Appris, etc.)
QC
Tracking System
(Annotrack)
Manual
annotation
(HAVANA)
Merged
dataset
Experimental
Verification
(RT-PCR, RACE)
Assign validation levels
Highlight conflicts
Track solutions
Annotate new regions
Update annotation with QC
gencodegenes.org
Manual Annotation Tools
BLAST
Gene predictions
RepeatMasker
CpG prediction
Pfam
RefSeq
Ensembl
Merged manual
and automated
annotation
DAS=Distributed Annotation system
Annotation of THUMPD3 locus
THUMPD3
Ab initio
Transcript evidence
HAVANA
Annotation
THUMPD3-AS
(lncRNA)
RNA_seq
models
Conservation
Protein evidence
RNA_seq
models
Vega browser for manual annotation
Gene annotation updates in Vega
Current Gene Annotation
Updated Gene Annotation
HAVANA update tracks in VEGA
HAVANA Updates
Current Release
GENCODE in Ensembl
Transcript Flags
•
TSL – transcript support
Levels
•
GENCODE Basic
Basic vs Comprehensive gene sets
Comprehensive
Basic
GENCODE Summary
• GENCODE provides manual evidence based gene
annotation, combined with automated approaches
• The gene set reveals increasing transcriptional
complexity that presents challenges for functional
studies
• To facilitate functional analysis, a ‘Basic’ gene set is
provided that includes primary transcripts only
Transcriptional Complexity, Functionality,
and the ‘non-productive’ transcriptome
Transcriptome Complexity
Coding
Productive
‘Functional’
Non-Coding
Non-Productive
Non-’Functional’
Non-functional transcripts
mRNA
Functional protein
Exon skip
Alternative protein isoform
or spliceosome failure?
26,542 retained introns in GENCODE
… non-functional transcripts (?)
CDS exon
Non-coding / UTR
This ‘biological noise’ is common
Is ‘biological noise’ important?
Splicing signals are frequently weak
Productive
*
Non-productive
e.g. weak splice donor
ON
OFF
‘Stand By’
Transcription diverted from a
productive to non-productive
state for 86 genes
Wong et al. 2013
Functional transcription for HNRNPDL
ATG
CDS exon
‘start not found’
NMD exon
Transcript evidence may not contain TSS
ESTs
cDNAs
Most 5’ point of evidence presumed to be TSS of the gene
TSS complexity for HNRNPDL
Exon 1 [zoom]
CAGE tags
uTSS outside annotation
dTSS within CDS; 10x higher
GM12878 raw CAGE (RIKEN; whole cell)
RAMPAGE
MCF7 raw CAGE (RIKEN; whole cell)
Functional CDS complexity in HNRNPDL
Stronger dTSS
CAGE
HUMAN
Conservation of annotated ATG/CDS
Human
Chimp
Rhesus
Mouse
Dog
Opossum
MEVPPRLSHVPPP...
MEVPPRLSHVPPP...
MEVPPRLSHVPPP...
MEVPPRLSHVPPP...
MEVPPRLSHVPPP...
MEVPPRLSQVPPP...
Ribosome profiling
supports ATG.
MOUSE
CAGE
Stronger dTSS conserved
Updated annotation of HNRNPDL
reverse strand
dbSNP
rs149817562 / COSM106459 - linked to melanoma
Non-coding genome
• The non-coding genome is rapidly expanding and will
soon be greater than the protein coding genome
• Translation of coding genes helps define length and
functionality
• Without a translated sequence, determining the length
and functionality of long non-coding loci is challenging
• The functionality of many well-defined lncRNAs remains
unknown, and may not be transcript-associated.
Annotation of lncRNA to understand functionality
lncRNA are shorter than coding transcripts
Are these full-length?
We have targeted 400 lncRNAs using
5’ and 3’ RACE with 454 sequencing
274 loci were extended, with1,669
novel transcripts identified
Extension of a novel lincRNA
CpG
Island
454 reads
RACE
target
OTTHUMG00000066036 has been extended to a
CpG island using RACE and 454 sequencing
Improving lncRNA annotation
• Non-coding transcripts poorly represented by existing
transcript libraries (eg. ESTs, cDNAs)
• Shorter reads from RNAseq provide support for splice
junctions, but not exon structure
• Longer reads (eg. 454 or PacBIO) are required to
support full-length transcripts
Airn – regulatory lncRNA
Igf2r protein coding
MOUSE
Airn lncRNA
•
•
•
Paternal expression of Airn silences Igf2r
Degrading lncRNA has no phenotype effects (Latos et al. 2012)
Repression occurs via Airn transcription across Igf2r promoter
Airn transcripts can be considered non-functional, but their
transcription drives a functional process
Non-productive transcriptome
• Studies have traditionally focused on the productive
transcriptome
• Non-productive transcription may help us understand
gene function and regulation
• This may be a common model…
... and should therefore be included in annotation projects
Checking if lncRNAs are coding
Pandey - Human Proteome
•
•
•
•
•
30 Tissues
85 Experiments
25 Million HCD Spectra
Tryptic Peptides
Adult and Fetal Samples
Kuster – Human Body Map
•
•
•
•
35 Tissues
47 Experiments
14 Million Spectra
Multiple Enzymes
Cutler – Human Tissue
• 9 Tissues
• 12 Experiments
• 13 Million CID Spectra
GENCODE
Release 20
James.wright@sanger.ac.uk
Coding or non-coding?
PIGBOS1 changed from lncRNA to protein coding gene
RAB27A
CAGE tag
54aa peptide
PSM
PIGBOS1
Human
ATGTTTAGGAGA
Cow
ATGTTTAGGAAG
Mouse
ATGCTCGGGAGA
Rhesus
ATGTTTAGGAGA
Opossum
ATGTTTGGGAGA
Conservation of initiating ATG
across multiple species
(HGNC and RefSeq still have it as lncRNA)
GAS5
conservation
snoRNAs
MILKLKQMGISLRKKMEINLKLKQ
25aa Peptide - PSM shown in green
Summary
• GENCODE aims to annotate all evidence based gene
features for the human and mouse genomes
• New technologies are changing the way we interpret
transcriptional evidence
• This increasing transcriptional complexity presents challenges
for functional analysis
• A Basic GENCODE gene set may simplify analysis, but could
obscure regulatory mechanisms
• Distinguishing productive transcripts from non-productive
transcripts is influenced by our understanding of function
Acknowledgements
HAVANA
Annosoft
GENCODE Consortium
Jen Harrow
Ed Griffiths
James Gilbert
Michael Gray
Steve Miller
Gemma Barson
Tim Hubbard
Guigo lab
Julien Legarde
Barbara Uszczynski
Kellis Lab
Irwin Jungreis
Tress lab
Reymond Lab
Anne-Maude Ferreira
Gerstein Lab
Cristina Sisu
Baikang Pei
Suganthi Bala
Fabio Navara
UCSC
Mark Diekhans
Benedict Paten
Rachel Harte
If Barnes
Andrew Berry
Alex Bignell
Claire Davidson
Gloria Despacio-Reyes
Sarah Donaldson
Adam Frankish
Matt Hardy
Mike Kay
Jane Loveland
Deepa Manthravadi
Jonathan Mudge
Gaurab Mukherjee
Charles Steward
Marie-Marthe Suner
Mark Thomas
Jo Howes (RA)
Annotrack/gencode
Jose Manuel Gonzalez
Electra Tapanari
Ensembl
Paul Flicek
Bronwen Aken
Stephen Trevanion
Proteomics
Jyoti Choudhay
James Wright
454 extension compared against
other methods
454 extension
Iyer dataset
Original lncRNA
gencode7
Clark_capture array
454 data more extensions
RP11-415J8.3-004
RP11-415J8.5-001
Transcript details in Vega
Novel RACE isoforms in targeted loci