HEALTH-F4-2007-200754 www.gen2phen.org D3.5 High-Level Domain Model Version 2, with Sample/Phenotype Focus WP3 – Standard data models and terminologies V5.0 Final draft Lead beneficiary: EMBL Date: 10/08/2009 Nature: Report Dissemination level: PU (Public) © Copyright 2009 GEN2PHEN Consortium D3.5 High-Level Domain Model Version 2, with Sample/Phenotype Focus HEALTH-200754 WP3 – Standard data models and terminologies Authors: Tomasz Adamusiak, Juha Muilu, Morris Swertz, Helen Parkinson Security: PU Version: v5.0 Final Draft 2/11 TABLE OF CONTENTS DOCUMENT INFORMATION .................................................................................................. 3 DOCUMENT HISTORY ............................................................................................................. 3 1. INTRODUCTION................................................................................................................. 4 2. DESCRIPTION OF WORK ................................................................................................ 4 3. EXISTING MODEL EVALUATION................................................................................. 4 3.1. 3.2. 3.3. 4. GENOMEUTWIN ............................................................................................................... 5 PAGE-OM ...................................................................................................................... 5 XGAP.............................................................................................................................. 5 GEN2PHEN PHENOTYPE MODEL................................................................................. 6 4.1. 4.2. PHENOTYPE MODEL CLASS DESCRIPTIONS ...................................................................... 6 OBJECT INSTANCE ............................................................................................................ 7 5. PHENOTYPE MODEL IMPLEMENTATION AND TESTING.................................... 9 6. FUTURE PLANS ................................................................................................................ 10 6.1. 6.2. 7. A HIGH-LEVEL DOMAIN MODEL VERSION 3 (D3.6) ...................................................... 10 DERIVATION AND SPECIFICATION OF EXCHANGE FORMAT (D3.7)................................. 10 ABBREVIATIONS ............................................................................................................. 11 REFERENCES............................................................................................................................ 11 © Copyright 2009 GEN2PHEN Consortium D3.5 High-Level Domain Model Version 2, with Sample/Phenotype Focus WP3 – Standard data models and terminologies Authors: Tomasz Adamusiak, Juha Muilu, Morris Swertz, Helen Parkinson HEALTH-200754 Security: PU Version: v5.0 3/11 Final Draft Document Information Grant Agreement HEALTH-F4-2007-200754 Number GEN2PHEN Acronym Full title Genotype-To-Phenotype Databases: A Holistic Solution Project URL http://www.gen2phen.org EU Project officer Frederick Marcus (Frederick.Marcus@ec.europa.eu) Deliverable Number D3.5 Title High-Level Domain Model Sample/Phenotype Focus Work package Number 3 Title WP3 – Standard data models and terminologies Delivery date Contractual June 2009 Actual 2, with August 2009 final ; Version 5.0 Status Version Nature Report ; Prototype Other Dissemination Level Public ; Confidential Authors (Partner) Tomasz Adamusiak (EMBL), Juha Muilu (UH.FGC), Morris Swertz (EMBL), Helen Parkinson (EMBL) Responsible Author Helen Parkinson Email parkinson@ebi.ac.uk Partner EMBL-EBI Phone +44 (0)1223 494 672 Document History Name Date Version Tomasz Adamusiak Helen Parkinson Tomasz Adamusiak Helen Parkinson Review 16/6/2009 7/7/2009 12/7/2009 14/7/2009 10/8/2009 1 2 3 4 5 © Copyright 2009 GEN2PHEN Consortium Description D3.5 High-Level Domain Model Version 2, with Sample/Phenotype Focus HEALTH-200754 WP3 – Standard data models and terminologies Authors: Tomasz Adamusiak, Juha Muilu, Morris Swertz, Helen Parkinson Security: PU Version: v5.0 Final Draft 4/11 1. INTRODUCTION Work package 3 ‘Standard data models and terminologies’ provides domain standards to develop GEN2PHEN specific architecture, facilitate data exchange and integrate data across existing and emerging resources. This work package is focused on providing standards to act as the foundation for much of the database development activities of other work packages. The work package objectives include the rapid development of a standard data model(s) capable of representing the minimum agreed content standard (as determined by WP2) and a derived data exchange format. Data models developed in coordination with WP3 will have several uses in GEN2PHEN: data from pre-existing databases will be mapped to generate data in a derived data exchange format, thus offering a flexible solution for integrating and exchanging existing and new data. In this respect, data model development is a necessary prerequisite, initially separated from implementation details. 2. DESCRIPTION OF WORK The focus of the GEN2PHEN High-Level Domain Model Version 2, with Sample/Phenotype Focus development process is: • To evaluate relevant public phenotype models • To develop a core GEN2PHEN phenotype model • To support primary GEN2PHEN use cases, especially in LSDB and HTP domains The two GEN2PHEN modelling workshops: Hinxton (April 9-11, 2008) and Helsinki (January 19-22, 2009) laid the groundwork for specific sub domain development. Subsequent work was continued during the first GEN2PHEN Phenotype Workshop (Geneva, May 7-8, 2009), hosted by SIB). Use cases were gathered and models were developed and minimum content standards to be used in exchanging data between partners were discussed in the context of specific phenotype extensions. See Appendix 1 for detailed workshop proceedings. External invited participants from the epidemiology, medical genetics, ontology development and model organism communities provided expertise and use cases beyond those of Consortium Partners. 3. Existing model evaluation Several public data models 1 currently exist in the Phenotype space and those closely aligned to GEN2PHEN were evaluated for relevance, domain coverage compared to existing resources, ease of use and complexity during the First Phenotype Workshop. 1 Some of the data models have been documented at www.schemalet.org, which is an experimental wiki site for documenting use case specific data models. © Copyright 2009 GEN2PHEN Consortium D3.5 High-Level Domain Model Version 2, with Sample/Phenotype Focus HEALTH-200754 WP3 – Standard data models and terminologies Authors: Tomasz Adamusiak, Juha Muilu, Morris Swertz, Helen Parkinson Security: PU Version: v5.0 Final Draft 5/11 3.1. GenomEUtwin A 5th Framework Programme aimed at unifying studies of European volunteer twins to identify genes underlying common diseases. The GenomEUtwin object model has already been tested on large population cohorts by UH.FGC. See paragraph 4.1 in Appendix 1 for a diagram and more details of the model. 3.2. PAGE-OM A complete OMG standard reference model that represents genotype data at summary and at the level of the individual. It also represents LSDB type data, phenotype, and supports some legacy technology use cases. PAGE-OM is very detailed and is useful as a reference model; meaning that GEN2PHEN specific models can be aligned to it and it can be used as a meta-mapping model for mapping external data representations. It is however, rather complex and one aim of WP3 modelling activities is to develop ‘modules’ whereby domain specific models can be developed, used alone, implemented and made interoperable. See paragraph 4.2 in Appendix 1 for a diagram and more details of the model. 3.3. XGAP The XGAP model (http://www.xgap.org). XGAP addresses the challenges of system-wide genetics experiments in data management, querying and integration via a simple tabular text file format to exchange data between collaborators, a customizable data infrastructure to store, query and integrate data, as well as providing a foundation for the analysis tools. See paragraph 4.3 in Appendix 1 for a diagram and more details of the model. © Copyright 2009 GEN2PHEN Consortium D3.5 High-Level Domain Model Version 2, with Sample/Phenotype Focus HEALTH-200754 WP3 – Standard data models and terminologies Authors: Tomasz Adamusiak, Juha Muilu, Morris Swertz, Helen Parkinson Security: PU Version: v5.0 Final Draft 6/11 4. GEN2PHEN Phenotype Model Figure 1. GEN2PHEN Phenotype Model A GEN2PHEN phenotype model was developed during the Phenotype Workshop in Geneva based on Partners’ input and invited domain experts’ opinions. It was later iterated through a series of face to face meetings and teleconferences among Partners. Figure 1 presents the l.0 version of the model, constructed in Enterprise Architect. It is also available from the schemalet.org website as well as in Enterprise Architect and XML formats from the GEN2PHEN SVN: (https://svn.gene.le.ac.uk/gen2phen/trunk/object_models/) 4.1. Phenotype Model class descriptions • Individual – Individual. Subject of a study. • Inferred_value – Inferred conclusion, derived from zero or many Observed_value instances. • Observable_feature – A measurable feature of an Individual, e.g. blood pressure. • Observation_target – Super class of all observation targets like Individual or Panel. © Copyright 2009 GEN2PHEN Consortium D3.5 High-Level Domain Model Version 2, with Sample/Phenotype Focus HEALTH-200754 WP3 – Standard data models and terminologies Authors: Tomasz Adamusiak, Juha Muilu, Morris Swertz, Helen Parkinson Security: PU Version: v5.0 Final Draft 7/11 • Ontology_term – Term defined in a specific namespace (ontology source). All names and terms should be defined using ontology terms whenever possible. • Observed_value – Specific value measured in an experiment, e.g. 120 (systolic BP, mmHg). • Panel – Collection of Individual instances. • Protocol – Describes how measurement is to be performed, or a specific Standard Operating Procedure. • Protocol_application – Describes how Protocol was instantiated a particular case, how the measurement was done, e.g. on 16/6/2009 by Tomasz Adamusiak. • Variable_definition – Extends the Observable_feature class to enable precise definition of the feature in used applications (for example has unit). Mappings to PaGE-OM and XGAP are available on the schemalet wiki at: http://www.schemalet.org/mediawiki/index.php/COMMON:Phenotype 4.2. Object instance Figure 2. GEN2PHEN Phenotype Model object instance An example instance of the model is shown in Figure 2. A blood pressure measuring protocol was applied to observation target Juha on 25/5/2009. Two values were measured at 10am: 150 and 90, which were systolic and diastolic blood pressure in mmHg respectively. © Copyright 2009 GEN2PHEN Consortium D3.5 High-Level Domain Model Version 2, with Sample/Phenotype Focus HEALTH-200754 WP3 – Standard data models and terminologies Authors: Tomasz Adamusiak, Juha Muilu, Morris Swertz, Helen Parkinson Security: PU Version: v5.0 Final Draft 8/11 Figure 3. Inferred value example The instance depicted in Figure 3 extends the previous one to show how a previously measured blood pressure can be used to infer disease status. A separate inference protocol was applied on 31/5/2009, and a high blood pressure was observed at 2pm. © Copyright 2009 GEN2PHEN Consortium D3.5 High-Level Domain Model Version 2, with Sample/Phenotype Focus HEALTH-200754 5. WP3 – Standard data models and terminologies Authors: Tomasz Adamusiak, Juha Muilu, Morris Swertz, Helen Parkinson Phenotype Model implementation and testing Figure 4. GEN2PHEN Phenotype Model implementation in Molgenis notation © Copyright 2009 GEN2PHEN Consortium Security: PU Version: v5.0 Final Draft 9/11 D3.5 High-Level Domain Model Version 2, with Sample/Phenotype Focus HEALTH-200754 WP3 – Standard data models and terminologies Authors: Tomasz Adamusiak, Juha Muilu, Morris Swertz, Helen Parkinson Security: PU Version: v5.0 Final Draft 10/11 In order to test and develop the Gen2Phen phenotype model we have collaborated with the developers of MOLGENIS [1, 2]. MOLGENIS is an open source software platform to efficiently design, implement, and autogenerate database, APIs, and web applications from object models. Its power is in the use of models and generators so the best solutions are easily reused between applications. MOLGENIS in one simple step generates a database (mySQL or postgreSQL), a web-based GUI, programmatic interfaces including Java API, SOAP web services usable in tools like Taverna (http://taverna.sourceforge.net) and by statistical scripts written in the R language (http://www.r-project.org), as well as a full documentation of the object model. Several Java plug-in mechanisms are also available to customize the generated software. By developing smaller models and ensuring interoperability using MOLGENIS some or all of the models can be consumed by various partners, the majority of whom have use cases which encompass only some of the models. MOLGENIS has been successfully used within the GEN2PHEN Consortium by: 1. MAGE-TAB OM: http://magetab-om.sourceforge.net 2. LSDB object model developed in the course of the Second Modelling Workshop: http://magetab-om.sourceforge.net/lsdb/1.0/object_model.html 3. An example LSDB - Findis, the Finnish National Mutation Database (NMDB): http://www.schemalet.org/mediawiki/index.php/FINDIS:Database Figure 4 depicts GEN2PHEN Phenotype Model as implemented on the MOLGENIS platform. Full documentation is available in Appendix 2 and a working implementation of the model, comprising a back end database, GUI, etc. is available from: http://wwwdev.ebi.ac.uk/microarray-srv/pheno/ 6. FUTURE PLANS 6.1. A High-Level Domain Model Version 3 (D3.6) This will be an improved and tested set of standard UML data models for all required domains, ready to be implemented by all Partners. Feedback from Partners will be then used to provide the ultimate design underpinnings for all GEN2PHEN databases in Iterative Specialized Domain Modelling Complete (D3.9). These sub-domain models including GEN2PHEN Phenotype Model will all be extensively tested and a reference implementation will be provided on the MOLGENIS platform. 6.2. Derivation and Specification of Exchange Format (D3.7) The priorities for data formats in GEN2PHEN are the data exchange between locus specific databases and central repositories and HTP data. The modelling work to date has separated these © Copyright 2009 GEN2PHEN Consortium D3.5 High-Level Domain Model Version 2, with Sample/Phenotype Focus HEALTH-200754 WP3 – Standard data models and terminologies Authors: Tomasz Adamusiak, Juha Muilu, Morris Swertz, Helen Parkinson Security: PU Version: v5.0 Final Draft 11/11 domains to support immediate needs for data exchange. The models developed will eventually support the phenotype extension reported here as well. Validation of LSBD data model commenced in 2009 by working with the existing LSDBs inside and outside the GEN2PHEN consortium, most of who have existing data formats. Those formats will support the data content of the GEN2PHEN Phenotype Model. Validation of the MAGE-TAB OM is underway and progress is promising. We envisage that the phenotypic descriptors, e.g. membership of a cohort through a shared phenotype, or trait will require an extension of MAGE-TAB, and the requirement to provide details of markers in context of HTP data will also require an extension. 7. Abbreviations HGVS LSDB XGAP PaGE-OM Human Genome Variation Society Locus Specific Database Xtensible Genotype And Phenotype data platform Phenotype and Genotype Experiment object model REFERENCES 1. 2. 3. Swertz, M.A., et al., Molecular Genetics Information System (MOLGENIS): alternatives in developing local experimental genomics databases. Bioinformatics, 2004. 20(13): p. 2075-83. Swertz, M.A. and R.C. Jansen, Beyond standardization: dynamic software infrastructures for systems biology. Nat Rev Genet, 2007. 8(3): p. 235-43. Wildeman, M., et al., Improving sequence variant descriptions in mutation databases and literature using the Mutalyzer sequence variation nomenclature checker. Hum Mutat, 2008. 29(1): p. 6-13. © Copyright 2009 GEN2PHEN Consortium
© Copyright 2025