EFI-Genome Neighborhood Tool: a web tool for large

EFI-Genome Neighborhood Tool: a web
tool for large-scale analysis of genome
context
Katie Whalen
Enzyme Function Initiative (EFI)
2015 ASBMB Annual Meeting
March 30th, 2015
What is a Genome Neighborhood Network?
High sequence homology
Enzyme function
Low/Med. Sequence homology + Genome Context
Enzyme function
What is a Genome Neighborhood Network?
Genes << Operon << Regulon
gene products forming a biological pathway
R A B C Genome neighborhood information facilitates enzyme function
discovery via contextual evidence
What is a Genome Neighborhood Network?
The GNN organizes genome
neighborhood information for
thousands of query genes in a high
throughput and rapid fashion.
The resulting network allows a user to
quickly identify the protein families that
are encoded by the genes within close
proximity to the SSN dataset.
GNN Generation
SSN Cluster Inventory •  SSN network file
parsing
•  Singletons excluded
•  Clusters assigned
number and unique
color
Neighbor Annota8on Gathering •  European Nucleotide
Archive (ENA) is
queried with each
SSN sequence
•  Protein-encoding
genes are compared
to Pfam
•  Additional annotation
information is
gathered
Network Genera8on •  Network xgmml file
written
•  Nodes = Query
sequences and
neighbor sequences
•  Edge = Drawn
between genome
proximal sequences
The entire process is fast and computationally inexpensive
EFI-GNT: Genome neighborhood networks (GNNs)
Query families
GNNs: SSN query families
Query families
GNNs: bacterial proteins in gene clusters
Query families
Genome neighbors
GNNs: collect neighbors (±10)
Query families
Genome neighbors
GNNs: cluster neighbors into Pfam families
Query families
Genome neighbors
Pfam family neighbors
GNNs: deduce function
Query families
Genome neighbors
shared context
same pathway
same function
Pfam family neighbors
unique context
unique pathway
unique function
Tools: EFI-EST and EFI-GNT
http://enzymefunction.org/
EFI-GNT: Tutorial
EFI-GNT (EFI-Genome Neighborhood Tool)
Input
SSN
Input SSN: from EFI-EST
Output SSN: families (queries) with unique colors
Output GNN: neighborhood Pfams colored by query
Shared context
Unique context
Output GNN: neighborhood Pfam clusters
Network Visualization
GNN files best viewed in Cytoscape 3.2
Best layout: Prefuse Force Directed
www.cytoscape.org Network Manipulation
Generally, the full ±10 neighbor GNN
presents an overwhelming amount of
information.
Network Manipulation
Full SSN Single SSN Cluster Full GNN Cluster-­‐Specific GNN Analyze the neighboring Pfam families specific to this SSN-cluster, 93.
Instruc8ons: hBp://enzymefunc8on.org/content/cytoscape-­‐3-­‐and-­‐gnns GNN Format
The GNN visually organizes genome neighborhood
information into multiple hub-and-spoke clusters.
Hub Nodes = a neighboring Pfam family
Node Attributes:
•  Num_neighbors = the number of neighbor sequences
belonging to this Pfam family
•  Num_queries = the number of sequences from SSN that
retrieved this Pfam family
•  pfam = Pfam number, e.g., PF13365
•  Pfam description = a short description of the family, e.g.,
Trypsin-like peptidase domain
•  Name = Short name for Pfam family, used as label
Hub Nodes = a neighboring Pfam family
Neighbor_Accessions = list of all Pfam
members found in genome context of
SSN, with the following additional
information:
•  EC number
•  PDB code
•  PDB-hit
•  Swiss-Prot status (reviewed/
unreviewed)
PDB-Hit
PDB-hit - a sequence shares significant (Evalue < 10-30) homology with a protein with
an X-ray crystal structure in RCSB Protein
DataBase.
PDB 284k “PDB code:E-value”
BLASTp
Related structure à homology model for
docking
UniProt 88M PDB-­‐Hit Database 22M For users that are new to homology
modeling, see resources by Sali lab at the
University of California at San Francisco.
hBp://salilab.org/our_resources.html Spoke Nodes = Represent SSN Clusters
The Node Attributes:
•  Cluster Number = # assigned to SSNcluster (e.g. 93)
•  SSN Cluster Size = the total # of
sequences in the SSN-cluster (e.g.
83 sequences)
•  Num_ratio = the % co-occurrence as
a ratio
•  ClusterFraction = the % cooccurrence as a decimal
•  Distance = a list of distance between
query and neighbor.
Spoke Node Size
% co-occurrence = # neighbors retrieved / SSN cluster size * 100
Web tool default = 20%
SPASM occurs in 180% of Cluster 93 genome contexts 83 sequences returned 153 SPASM neighbors DegT Aminotransferase occurs in 47% of Cluster 93 genome contexts 83 sequences returns 39 DegT neighbors Spoke Node Shape
If a single sequence in the node
possesses:
•  EC number = triangle
•  PDB code = square
•  PDB-hit (aligns to sequence in PDB
with E-value < 10-30) = square
•  Both PDB(-hit) & EC = diamond
•  None of the above = circle
www.pfam.xfam.org Pfam and the GNN
Pfam Name Co-­‐occurrence NAD-­‐dep. Epimerase 120% 4Fe-­‐4S Cluster Domain 150% Glycosyl Transferase 1 75% Methyltransferase 100% dTDP-­‐Dehydrorhamnose Epimerase 66% Radical SAM 220% DegT Aminotransferase 47% Glycosyl Transferase 2 116% Nucleo8dyl Transferase 53% SPASM 180% www.pfam.xfam.org Neighborhood Size
EFI-GNT default neighborhood size = ± 10 genes*
Users may lower this to ± 3 - 9 genes*
-­‐3 -­‐2 -­‐1 query Zheng et al. 2002, Genome Research 12, 1221 1 2 3
4 5
* Genes = ENA entries
GNN Signal-to-Noise: Added Noise
The utility of the GNN is limited primarily by its signal-to-noise
Signal = proximal and functionally related genes
Noise = proximal and irrelevant genes
Source of Noise
Remedy
Distant genes
Decrease neighborhood size
Uncommonly co-occurring
genes
Increase co-occurrence threshold
SSN over-fractionation
Return SSN to less stringent e-value
GNN Signal-to-Noise: Lost Signal
Why did my query sequence return less than 20 neighbors?
• 
• 
• 
• 
• 
Query sequence does not match to the ENA sub-databases (eukaryotic)
Not protein-encoding RNA
Query sequence is located near the beginning or end of the ENA file
The neighbor entry does not have an associated EMBL accession number
The neighbor entry has not been incorporated into a current Pfam family.
-­‐3 -­‐2 -­‐1 query X1 2 X3
4 5
EFI-GNT Web tool
www.enzymefuncIon.org EFI-GNT Input
www.efi.igb.illinois.edu/efi-gnt
1. Upload xgmml network,
full or rep-node
2. Pick neighborhood size:
3-10 +/- genes
3. Enter co-occurrence lower
limit (1-100%)
4. Enter E-mail address
5. Hit “GO”
Upload Status Bar
EFI-GNT Processing
Do Not Close Browser Window!
EFI-GNT Output
A download link will be sent to the E-mail address provided.
Data stored on server for 7 days.
EFI-GNT Output
The EFI-­‐GNT output is a pair of .xgmml files: •  Colored SSN •  Genome neighborhood network (GNN) •  Tabular output Tutorial Pages
Tutorial
pages
containing
content
similar to this
presentation
Tutorials specific to Cytoscape use: hNp://enzymefuncIon.org/content/cytoscape-­‐3-­‐and-­‐gnns Feel free to explore the sequence identity space and genome context of
a protein family using EFI-EST and EFI-GNT.
Please see posters by Katie Whalen (#2642) and Dr. Brian San
Francisco (#8436) for further examples of EFI-EST & EFI-GNT use.
Feel free to contact us with questions/comments/suggestions.
efi@enzymefunction.org
Acknowledgements
GNN Development Ka8e Whalen (UIUC) Daniel Davidson (UIUC) Jason Bouvier (UIUC) Suwen Zhao (UCSF) Alan Barber (Pythoscape, UCSF) Website Build David Slater (UIUC) Gabe Horton (UIUC) Principal Inves8gators John Gerlt (UIUC) MaBhew Jacobson (UCSF) INDEX SLIDES Network Visualization
NOTE – in Cytoscape the automatic rendering and coloring of
the colorized SSN is size dependent. Cytoscape settings
include a “Threshold View” that needs to be adjusted in the
following manner in order to automatically view your colored
SSN:
•  In any version 3.X, go to Edit -> Preferences -> Properties
•  With “cytoscape 3” selected in the pull-down menu at the
top, scroll to the bottom of the Property list and select
“viewThreshold”
•  Click “Modify” and insert 5 zeros to the end of the displayed
number
•  Click “OK”
Restart Cytoscape (this should only need to be done once per
version of Cytoscape installed on your machine)
Example: proline racemase superfamily
< 10-120
> 60% ID
Zhao et al. 2014 eLife: hBp://dx.doi.org/10.7554/eLife.03275 GNN: pathway “parts”
DAO ALDH DHDPS LDH/MDH OCD Zhao et al. 2014 eLife: hBp://dx.doi.org/10.7554/eLife.03275 From GNN: complete pathways
DAO DHDPS ALDH OCD LDH/MDH Zhao et al. 2014 eLife: hBp://dx.doi.org/10.7554/eLife.03275 Spoke Nodes
Spoke node size is dependent on the % co-occurrence of that Pfam in
the neighborhood of that SSN cluster.
% co-occurrence = # neighbors retrieved / SSN cluster size * 100
% Co-­‐Occurrence IndicaIve SituaIon < 100% The neighbor gene is not well-­‐conserved and poten8ally unimportant to the physiological pathway of the query gene. < 100% This par8cular SSN-­‐cluster is not isofunc8onal, containing mul8ple neighborhood contexts. ≈ 100% The neighbor gene is a well-­‐conserved member of the genome neighborhood.
> 100% Two or more instances of neighbors from this par8cular Pfam family exist in the genome neighborhood. EFI-GNT Output
NOTE – depending on your browser, the files may download with an
additional file extension, such as: .xgmml.txt or .xgmml.xml
You must delete the .txt or .xml extension in order to open these
files in Cytoscape!
Cytoscape opens .xgmml