Detangling transcriptional complexity in GENCODE using cutting-edge transcriptomics and proteomics Mark Thomas Wellcome Trust Sanger Institute Biocuration 2015, Beijing Overview • Introduction to manual annotation tools used for GENCODE pipelines • Improving the comprehensive gene set and trying to catalogue function • Validation of the lncRNA gene sets with long read technology • Impact of proteomics on functional annotation gencodegenes.org Current GENCODE releases Human 22 Mouse M4 GENCODE pipeline Specialised Prediction pipelines Automated annotation (Ensembl) (Congo, Pseudopipe, Retrofinder, Appris, etc.) QC Tracking System (Annotrack) Manual annotation (HAVANA) Merged dataset Experimental Verification (RT-PCR, RACE) Assign validation levels Highlight conflicts Track solutions Annotate new regions Update annotation with QC gencodegenes.org Manual Annotation Tools BLAST Gene predictions RepeatMasker CpG prediction Pfam RefSeq Ensembl Merged manual and automated annotation DAS=Distributed Annotation system Annotation of THUMPD3 locus THUMPD3 Ab initio Transcript evidence HAVANA Annotation THUMPD3-AS (lncRNA) RNA_seq models Conservation Protein evidence RNA_seq models Vega browser for manual annotation Gene annotation updates in Vega Current Gene Annotation Updated Gene Annotation HAVANA update tracks in VEGA HAVANA Updates Current Release GENCODE in Ensembl Transcript Flags • TSL – transcript support Levels • GENCODE Basic Basic vs Comprehensive gene sets Comprehensive Basic GENCODE Summary • GENCODE provides manual evidence based gene annotation, combined with automated approaches • The gene set reveals increasing transcriptional complexity that presents challenges for functional studies • To facilitate functional analysis, a ‘Basic’ gene set is provided that includes primary transcripts only Transcriptional Complexity, Functionality, and the ‘non-productive’ transcriptome Transcriptome Complexity Coding Productive ‘Functional’ Non-Coding Non-Productive Non-’Functional’ Non-functional transcripts mRNA Functional protein Exon skip Alternative protein isoform or spliceosome failure? 26,542 retained introns in GENCODE … non-functional transcripts (?) CDS exon Non-coding / UTR This ‘biological noise’ is common Is ‘biological noise’ important? Splicing signals are frequently weak Productive * Non-productive e.g. weak splice donor ON OFF ‘Stand By’ Transcription diverted from a productive to non-productive state for 86 genes Wong et al. 2013 Functional transcription for HNRNPDL ATG CDS exon ‘start not found’ NMD exon Transcript evidence may not contain TSS ESTs cDNAs Most 5’ point of evidence presumed to be TSS of the gene TSS complexity for HNRNPDL Exon 1 [zoom] CAGE tags uTSS outside annotation dTSS within CDS; 10x higher GM12878 raw CAGE (RIKEN; whole cell) RAMPAGE MCF7 raw CAGE (RIKEN; whole cell) Functional CDS complexity in HNRNPDL Stronger dTSS CAGE HUMAN Conservation of annotated ATG/CDS Human Chimp Rhesus Mouse Dog Opossum MEVPPRLSHVPPP... MEVPPRLSHVPPP... MEVPPRLSHVPPP... MEVPPRLSHVPPP... MEVPPRLSHVPPP... MEVPPRLSQVPPP... Ribosome profiling supports ATG. MOUSE CAGE Stronger dTSS conserved Updated annotation of HNRNPDL reverse strand dbSNP rs149817562 / COSM106459 - linked to melanoma Non-coding genome • The non-coding genome is rapidly expanding and will soon be greater than the protein coding genome • Translation of coding genes helps define length and functionality • Without a translated sequence, determining the length and functionality of long non-coding loci is challenging • The functionality of many well-defined lncRNAs remains unknown, and may not be transcript-associated. Annotation of lncRNA to understand functionality lncRNA are shorter than coding transcripts Are these full-length? We have targeted 400 lncRNAs using 5’ and 3’ RACE with 454 sequencing 274 loci were extended, with1,669 novel transcripts identified Extension of a novel lincRNA CpG Island 454 reads RACE target OTTHUMG00000066036 has been extended to a CpG island using RACE and 454 sequencing Improving lncRNA annotation • Non-coding transcripts poorly represented by existing transcript libraries (eg. ESTs, cDNAs) • Shorter reads from RNAseq provide support for splice junctions, but not exon structure • Longer reads (eg. 454 or PacBIO) are required to support full-length transcripts Airn – regulatory lncRNA Igf2r protein coding MOUSE Airn lncRNA • • • Paternal expression of Airn silences Igf2r Degrading lncRNA has no phenotype effects (Latos et al. 2012) Repression occurs via Airn transcription across Igf2r promoter Airn transcripts can be considered non-functional, but their transcription drives a functional process Non-productive transcriptome • Studies have traditionally focused on the productive transcriptome • Non-productive transcription may help us understand gene function and regulation • This may be a common model… ... and should therefore be included in annotation projects Checking if lncRNAs are coding Pandey - Human Proteome • • • • • 30 Tissues 85 Experiments 25 Million HCD Spectra Tryptic Peptides Adult and Fetal Samples Kuster – Human Body Map • • • • 35 Tissues 47 Experiments 14 Million Spectra Multiple Enzymes Cutler – Human Tissue • 9 Tissues • 12 Experiments • 13 Million CID Spectra GENCODE Release 20 James.wright@sanger.ac.uk Coding or non-coding? PIGBOS1 changed from lncRNA to protein coding gene RAB27A CAGE tag 54aa peptide PSM PIGBOS1 Human ATGTTTAGGAGA Cow ATGTTTAGGAAG Mouse ATGCTCGGGAGA Rhesus ATGTTTAGGAGA Opossum ATGTTTGGGAGA Conservation of initiating ATG across multiple species (HGNC and RefSeq still have it as lncRNA) GAS5 conservation snoRNAs MILKLKQMGISLRKKMEINLKLKQ 25aa Peptide - PSM shown in green Summary • GENCODE aims to annotate all evidence based gene features for the human and mouse genomes • New technologies are changing the way we interpret transcriptional evidence • This increasing transcriptional complexity presents challenges for functional analysis • A Basic GENCODE gene set may simplify analysis, but could obscure regulatory mechanisms • Distinguishing productive transcripts from non-productive transcripts is influenced by our understanding of function Acknowledgements HAVANA Annosoft GENCODE Consortium Jen Harrow Ed Griffiths James Gilbert Michael Gray Steve Miller Gemma Barson Tim Hubbard Guigo lab Julien Legarde Barbara Uszczynski Kellis Lab Irwin Jungreis Tress lab Reymond Lab Anne-Maude Ferreira Gerstein Lab Cristina Sisu Baikang Pei Suganthi Bala Fabio Navara UCSC Mark Diekhans Benedict Paten Rachel Harte If Barnes Andrew Berry Alex Bignell Claire Davidson Gloria Despacio-Reyes Sarah Donaldson Adam Frankish Matt Hardy Mike Kay Jane Loveland Deepa Manthravadi Jonathan Mudge Gaurab Mukherjee Charles Steward Marie-Marthe Suner Mark Thomas Jo Howes (RA) Annotrack/gencode Jose Manuel Gonzalez Electra Tapanari Ensembl Paul Flicek Bronwen Aken Stephen Trevanion Proteomics Jyoti Choudhay James Wright 454 extension compared against other methods 454 extension Iyer dataset Original lncRNA gencode7 Clark_capture array 454 data more extensions RP11-415J8.3-004 RP11-415J8.5-001 Transcript details in Vega Novel RACE isoforms in targeted loci
© Copyright 2024