EFI-Genome Neighborhood Tool: a web tool for large-scale analysis of genome context Katie Whalen Enzyme Function Initiative (EFI) 2015 ASBMB Annual Meeting March 30th, 2015 What is a Genome Neighborhood Network? High sequence homology Enzyme function Low/Med. Sequence homology + Genome Context Enzyme function What is a Genome Neighborhood Network? Genes << Operon << Regulon gene products forming a biological pathway R A B C Genome neighborhood information facilitates enzyme function discovery via contextual evidence What is a Genome Neighborhood Network? The GNN organizes genome neighborhood information for thousands of query genes in a high throughput and rapid fashion. The resulting network allows a user to quickly identify the protein families that are encoded by the genes within close proximity to the SSN dataset. GNN Generation SSN Cluster Inventory • SSN network file parsing • Singletons excluded • Clusters assigned number and unique color Neighbor Annota8on Gathering • European Nucleotide Archive (ENA) is queried with each SSN sequence • Protein-encoding genes are compared to Pfam • Additional annotation information is gathered Network Genera8on • Network xgmml file written • Nodes = Query sequences and neighbor sequences • Edge = Drawn between genome proximal sequences The entire process is fast and computationally inexpensive EFI-GNT: Genome neighborhood networks (GNNs) Query families GNNs: SSN query families Query families GNNs: bacterial proteins in gene clusters Query families Genome neighbors GNNs: collect neighbors (±10) Query families Genome neighbors GNNs: cluster neighbors into Pfam families Query families Genome neighbors Pfam family neighbors GNNs: deduce function Query families Genome neighbors shared context same pathway same function Pfam family neighbors unique context unique pathway unique function Tools: EFI-EST and EFI-GNT http://enzymefunction.org/ EFI-GNT: Tutorial EFI-GNT (EFI-Genome Neighborhood Tool) Input SSN Input SSN: from EFI-EST Output SSN: families (queries) with unique colors Output GNN: neighborhood Pfams colored by query Shared context Unique context Output GNN: neighborhood Pfam clusters Network Visualization GNN files best viewed in Cytoscape 3.2 Best layout: Prefuse Force Directed www.cytoscape.org Network Manipulation Generally, the full ±10 neighbor GNN presents an overwhelming amount of information. Network Manipulation Full SSN Single SSN Cluster Full GNN Cluster-‐Specific GNN Analyze the neighboring Pfam families specific to this SSN-cluster, 93. Instruc8ons: hBp://enzymefunc8on.org/content/cytoscape-‐3-‐and-‐gnns GNN Format The GNN visually organizes genome neighborhood information into multiple hub-and-spoke clusters. Hub Nodes = a neighboring Pfam family Node Attributes: • Num_neighbors = the number of neighbor sequences belonging to this Pfam family • Num_queries = the number of sequences from SSN that retrieved this Pfam family • pfam = Pfam number, e.g., PF13365 • Pfam description = a short description of the family, e.g., Trypsin-like peptidase domain • Name = Short name for Pfam family, used as label Hub Nodes = a neighboring Pfam family Neighbor_Accessions = list of all Pfam members found in genome context of SSN, with the following additional information: • EC number • PDB code • PDB-hit • Swiss-Prot status (reviewed/ unreviewed) PDB-Hit PDB-hit - a sequence shares significant (Evalue < 10-30) homology with a protein with an X-ray crystal structure in RCSB Protein DataBase. PDB 284k “PDB code:E-value” BLASTp Related structure à homology model for docking UniProt 88M PDB-‐Hit Database 22M For users that are new to homology modeling, see resources by Sali lab at the University of California at San Francisco. hBp://salilab.org/our_resources.html Spoke Nodes = Represent SSN Clusters The Node Attributes: • Cluster Number = # assigned to SSNcluster (e.g. 93) • SSN Cluster Size = the total # of sequences in the SSN-cluster (e.g. 83 sequences) • Num_ratio = the % co-occurrence as a ratio • ClusterFraction = the % cooccurrence as a decimal • Distance = a list of distance between query and neighbor. Spoke Node Size % co-occurrence = # neighbors retrieved / SSN cluster size * 100 Web tool default = 20% SPASM occurs in 180% of Cluster 93 genome contexts 83 sequences returned 153 SPASM neighbors DegT Aminotransferase occurs in 47% of Cluster 93 genome contexts 83 sequences returns 39 DegT neighbors Spoke Node Shape If a single sequence in the node possesses: • EC number = triangle • PDB code = square • PDB-hit (aligns to sequence in PDB with E-value < 10-30) = square • Both PDB(-hit) & EC = diamond • None of the above = circle www.pfam.xfam.org Pfam and the GNN Pfam Name Co-‐occurrence NAD-‐dep. Epimerase 120% 4Fe-‐4S Cluster Domain 150% Glycosyl Transferase 1 75% Methyltransferase 100% dTDP-‐Dehydrorhamnose Epimerase 66% Radical SAM 220% DegT Aminotransferase 47% Glycosyl Transferase 2 116% Nucleo8dyl Transferase 53% SPASM 180% www.pfam.xfam.org Neighborhood Size EFI-GNT default neighborhood size = ± 10 genes* Users may lower this to ± 3 - 9 genes* -‐3 -‐2 -‐1 query Zheng et al. 2002, Genome Research 12, 1221 1 2 3 4 5 * Genes = ENA entries GNN Signal-to-Noise: Added Noise The utility of the GNN is limited primarily by its signal-to-noise Signal = proximal and functionally related genes Noise = proximal and irrelevant genes Source of Noise Remedy Distant genes Decrease neighborhood size Uncommonly co-occurring genes Increase co-occurrence threshold SSN over-fractionation Return SSN to less stringent e-value GNN Signal-to-Noise: Lost Signal Why did my query sequence return less than 20 neighbors? • • • • • Query sequence does not match to the ENA sub-databases (eukaryotic) Not protein-encoding RNA Query sequence is located near the beginning or end of the ENA file The neighbor entry does not have an associated EMBL accession number The neighbor entry has not been incorporated into a current Pfam family. -‐3 -‐2 -‐1 query X1 2 X3 4 5 EFI-GNT Web tool www.enzymefuncIon.org EFI-GNT Input www.efi.igb.illinois.edu/efi-gnt 1. Upload xgmml network, full or rep-node 2. Pick neighborhood size: 3-10 +/- genes 3. Enter co-occurrence lower limit (1-100%) 4. Enter E-mail address 5. Hit “GO” Upload Status Bar EFI-GNT Processing Do Not Close Browser Window! EFI-GNT Output A download link will be sent to the E-mail address provided. Data stored on server for 7 days. EFI-GNT Output The EFI-‐GNT output is a pair of .xgmml files: • Colored SSN • Genome neighborhood network (GNN) • Tabular output Tutorial Pages Tutorial pages containing content similar to this presentation Tutorials specific to Cytoscape use: hNp://enzymefuncIon.org/content/cytoscape-‐3-‐and-‐gnns Feel free to explore the sequence identity space and genome context of a protein family using EFI-EST and EFI-GNT. Please see posters by Katie Whalen (#2642) and Dr. Brian San Francisco (#8436) for further examples of EFI-EST & EFI-GNT use. Feel free to contact us with questions/comments/suggestions. efi@enzymefunction.org Acknowledgements GNN Development Ka8e Whalen (UIUC) Daniel Davidson (UIUC) Jason Bouvier (UIUC) Suwen Zhao (UCSF) Alan Barber (Pythoscape, UCSF) Website Build David Slater (UIUC) Gabe Horton (UIUC) Principal Inves8gators John Gerlt (UIUC) MaBhew Jacobson (UCSF) INDEX SLIDES Network Visualization NOTE – in Cytoscape the automatic rendering and coloring of the colorized SSN is size dependent. Cytoscape settings include a “Threshold View” that needs to be adjusted in the following manner in order to automatically view your colored SSN: • In any version 3.X, go to Edit -> Preferences -> Properties • With “cytoscape 3” selected in the pull-down menu at the top, scroll to the bottom of the Property list and select “viewThreshold” • Click “Modify” and insert 5 zeros to the end of the displayed number • Click “OK” Restart Cytoscape (this should only need to be done once per version of Cytoscape installed on your machine) Example: proline racemase superfamily < 10-120 > 60% ID Zhao et al. 2014 eLife: hBp://dx.doi.org/10.7554/eLife.03275 GNN: pathway “parts” DAO ALDH DHDPS LDH/MDH OCD Zhao et al. 2014 eLife: hBp://dx.doi.org/10.7554/eLife.03275 From GNN: complete pathways DAO DHDPS ALDH OCD LDH/MDH Zhao et al. 2014 eLife: hBp://dx.doi.org/10.7554/eLife.03275 Spoke Nodes Spoke node size is dependent on the % co-occurrence of that Pfam in the neighborhood of that SSN cluster. % co-occurrence = # neighbors retrieved / SSN cluster size * 100 % Co-‐Occurrence IndicaIve SituaIon < 100% The neighbor gene is not well-‐conserved and poten8ally unimportant to the physiological pathway of the query gene. < 100% This par8cular SSN-‐cluster is not isofunc8onal, containing mul8ple neighborhood contexts. ≈ 100% The neighbor gene is a well-‐conserved member of the genome neighborhood. > 100% Two or more instances of neighbors from this par8cular Pfam family exist in the genome neighborhood. EFI-GNT Output NOTE – depending on your browser, the files may download with an additional file extension, such as: .xgmml.txt or .xgmml.xml You must delete the .txt or .xml extension in order to open these files in Cytoscape! Cytoscape opens .xgmml
© Copyright 2024