HEALTH-F4-2007-200754 www.gen2phen.org D3.5 High-Level Domain Model Version 2, with Sample/Phenotype Focus WP3 – Standard data models and terminologies V5.0 Final Lead beneficiary: EMBL Date: 10/08/2009 Nature: Report Dissemination level: PU (Public) © Copyright 2009 GEN2PHEN Consortium D3.5 High-Level Domain Model Version 2, with Sample/Phenotype Focus HEALTH-200754 WP3 – Standard data models and terminologies Authors: Tomasz Adamusiak, Juha Muilu, Morris Swertz, Helen Parkinson Security: PU Version: v5.0 Final 2/12 TABLE OF CONTENTS DOCUMENT INFORMATION .................................................................................................. 3 DOCUMENT HISTORY ............................................................................................................. 3 1. INTRODUCTION................................................................................................................. 4 2. DESCRIPTION OF WORK ................................................................................................ 5 3. EXISTING MODEL EVALUATION................................................................................. 5 3.1. 3.2. 3.3. 4. GENOMEUTWIN ............................................................................................................... 6 PAGE-OM ...................................................................................................................... 6 XGAP.............................................................................................................................. 6 GEN2PHEN PHENOTYPE MODEL................................................................................. 7 4.1. 4.2. PHENOTYPE MODEL CLASS DESCRIPTIONS ...................................................................... 7 OBJECT INSTANCE ............................................................................................................ 8 5. PHENOTYPE MODEL IMPLEMENTATION AND TESTING.................................. 10 6. FUTURE PLANS ................................................................................................................ 11 6.1. 6.2. 7. A HIGH-LEVEL DOMAIN MODEL VERSION 3 (D3.6) ...................................................... 11 DERIVATION AND SPECIFICATION OF EXCHANGE FORMAT (D3.7)................................. 11 ABBREVIATIONS ............................................................................................................. 12 REFERENCES............................................................................................................................ 12 APPENDIX I - Report on the First GEN2PHEN Phenotype Workshop APPENDIX II - GEN2PHEN Phenotype Model Reference Implementation © Copyright 2009 GEN2PHEN Consortium D3.5 High-Level Domain Model Version 2, with Sample/Phenotype Focus WP3 – Standard data models and terminologies Authors: Tomasz Adamusiak, Juha Muilu, Morris Swertz, Helen Parkinson HEALTH-200754 Security: PU Version: v5.0 3/12 Final Document Information Grant Agreement HEALTH-F4-2007-200754 Number GEN2PHEN Acronym Full title Genotype-To-Phenotype Databases: A Holistic Solution Project URL http://www.gen2phen.org EU Project officer Frederick Marcus (Frederick.Marcus@ec.europa.eu) Deliverable Number D3.5 Title High-Level Domain Model Sample/Phenotype Focus Work package Number 3 Title WP3 – Standard data models and terminologies Delivery date Contractual June 2009 Actual 2, with August 2009 final ; Version 5.0 Status Version Nature Report ; Prototype Other Dissemination Level Public ; Confidential Authors (Partner) Tomasz Adamusiak (EMBL), Juha Muilu (UH.FGC), Morris Swertz (EMBL), Helen Parkinson (EMBL) Responsible Author Helen Parkinson Email parkinson@ebi.ac.uk Partner EMBL-EBI Phone +44 (0)1223 494 672 Document History Name Date Version Description Tomasz Adamusiak Helen Parkinson Tomasz Adamusiak Helen Parkinson Helen Parkinson 16/6/2009 7/7/2009 12/7/2009 14/7/2009 10/8/2009 1 2 3 4 5 First Draft Created Internal Review Corrections Comments Review © Copyright 2009 GEN2PHEN Consortium D3.5 High-Level Domain Model Version 2, with Sample/Phenotype Focus HEALTH-200754 WP3 – Standard data models and terminologies Authors: Tomasz Adamusiak, Juha Muilu, Morris Swertz, Helen Parkinson Security: PU Version: v5.0 Final 4/12 Definitions Partners of the GEN2PHEN Consortium are referred to herein according to the following codes: ULEIC – University of Leicester (UK) – Coordinator EMBL – European Molecular Biology Laboratory (Germany) – Beneficiary FIMIM – Fundació IMIM (Spain) – Beneficiary LUMC – Leiden University Medical Center (Netherlands) – Beneficiary INSERM – Institut National de la Santé et de la Recherche Médicale (France) – Beneficiary KI – Karolinska Institutet (Sweden) – Beneficiary FORTH – Foundation for Research and Tecnology Hellas (Greece) – Beneficiary CEA – Comissariat à l’Energie Atomique (France) – Beneficiary EMC – Erasmus Universitair Medisch Centrum Rotterdam (Netherlands) – Beneficiary UH.FGC – Helsingin Yliopisto (Finland) – Beneficiary UAVR – Universidade de Aveiro (Portugal) – Beneficiary UWC – University of the Western Cape (South Africa) – Beneficiary CSIR – Council of Scientific and Industrial Research (India) – Beneficiary SIB – Swiss Institute of Bioinformatics (Switzerland) – Beneficiary UNIMAN – The University of Manchester (UK) – Beneficiary BIOBASE – BioBase GmbH. (Germany) – Beneficiary deCODE – Islensk Erfoagreining EH (Iceland) – Beneficiary PHENO – Phenosystems S.A. (Belgium) – Beneficiary BCP – Biocomputing Platforms Ltd. Oy (Finland) – Beneficiary Grant Agreement: The agreement signed between the beneficiaries and the European Commission for the undertaking of the GEN2PHEN project (HEALTH-200754). Project: The sum of all activities carried out in the framework of the Grant Agreement by the Consortium. Work plan: Schedule of tasks, deliverables, efforts, dates and responsibilities corresponding to the work to be carried out for the GEN2PHEN project, as specified in Annex I to the Grant Agreement. Consortium: The GEN2PHEN Consortium, conformed by the above-mentioned legal entities. Consortium agreement: agreement concluded amongst GEN2PHEN participants for the implementation of the Grant Agreement. Such an agreement shall not affect the parties’ obligations to the Community and/or to one another arising from the Grant Agreement. © Copyright 2009 GEN2PHEN Consortium D3.5 High-Level Domain Model Version 2, with Sample/Phenotype Focus HEALTH-200754 WP3 – Standard data models and terminologies Authors: Tomasz Adamusiak, Juha Muilu, Morris Swertz, Helen Parkinson Security: PU Version: v5.0 Final 5/12 1. INTRODUCTION Work package 3 ‘Standard data models and terminologies’ provides domain standards to develop GEN2PHEN specific architecture, facilitate data exchange and integrate data across existing and emerging resources. This work package is focused on providing standards to act as the foundation for much of the database development activities of other work packages. The work package objectives include the rapid development of a standard data model(s) capable of representing the minimum agreed content standard (as determined by WP2) and a derived data exchange format. Data models developed in coordination with WP3 will have several uses in GEN2PHEN: data from pre-existing databases will be mapped to generate data in a derived data exchange format, thus offering a flexible solution for integrating and exchanging existing and new data. In this respect, data model development is a necessary prerequisite, initially separated from implementation details. 2. DESCRIPTION OF WORK The focus of the GEN2PHEN High-Level Domain Model Version 2, with Sample/Phenotype Focus development process is: • To evaluate relevant public phenotype models • To develop a core GEN2PHEN phenotype model • To support primary GEN2PHEN use cases, especially in LSDB and HTP domains The two GEN2PHEN modelling workshops: Hinxton (April 9-11, 2008) and Helsinki (January 19-22, 2009) laid the groundwork for specific sub domain development. Subsequent work was continued during the first GEN2PHEN Phenotype Workshop (Geneva, May 7-8, 2009), hosted by SIB). Use cases were gathered and models were developed and minimum content standards to be used in exchanging data between partners were discussed in the context of specific phenotype extensions. See Appendix 1 for detailed workshop proceedings. External invited participants from the epidemiology, medical genetics, ontology development and model organism communities provided expertise and use cases beyond those of Consortium Partners. 3. Existing model evaluation Several public data models 1 currently exist in the Phenotype space and those closely aligned to GEN2PHEN were evaluated for relevance, domain coverage compared to existing resources, ease of use and complexity during the First Phenotype Workshop. 1 Some of the data models have been documented at www.schemalet.org, which is an experimental wiki site for documenting use case specific data models. © Copyright 2009 GEN2PHEN Consortium D3.5 High-Level Domain Model Version 2, with Sample/Phenotype Focus HEALTH-200754 WP3 – Standard data models and terminologies Authors: Tomasz Adamusiak, Juha Muilu, Morris Swertz, Helen Parkinson Security: PU Version: v5.0 Final 6/12 3.1. GenomEUtwin A 5th Framework Programme aimed at unifying studies of European volunteer twins to identify genes underlying common diseases. The GenomEUtwin object model has already been tested on large population cohorts by UH.FGC. See paragraph 4.1 in Appendix 1 for a diagram and more details of the model. 3.2. PAGE-OM A complete OMG standard reference model that represents genotype data at summary and at the level of the individual. It also represents LSDB type data, phenotype, and supports some legacy technology use cases. PAGE-OM is very detailed and is useful as a reference model; meaning that GEN2PHEN specific models can be aligned to it and it can be used as a meta-mapping model for mapping external data representations. It is however, rather complex and one aim of WP3 modelling activities is to develop ‘modules’ whereby domain specific models can be developed, used alone, implemented and made interoperable. See paragraph 4.2 in Appendix 1 for a diagram and more details of the model. 3.3. XGAP The XGAP model (http://www.xgap.org). XGAP addresses the challenges of system-wide genetics experiments in data management, querying and integration via a simple tabular text file format to exchange data between collaborators, a customizable data infrastructure to store, query and integrate data, as well as providing a foundation for the analysis tools. See paragraph 4.3 in Appendix 1 for a diagram and more details of the model. © Copyright 2009 GEN2PHEN Consortium D3.5 High-Level Domain Model Version 2, with Sample/Phenotype Focus HEALTH-200754 WP3 – Standard data models and terminologies Authors: Tomasz Adamusiak, Juha Muilu, Morris Swertz, Helen Parkinson Security: PU Version: v5.0 Final 7/12 4. GEN2PHEN Phenotype Model Figure 1. GEN2PHEN Phenotype Model A GEN2PHEN phenotype model was developed during the Phenotype Workshop in Geneva based on Partners’ input and invited domain experts’ opinions. It was later iterated through a series of face to face meetings and teleconferences among Partners. Figure 1 presents the l.0 version of the model, constructed in Enterprise Architect. It is also available from the schemalet.org website as well as in Enterprise Architect and XML formats from the GEN2PHEN SVN: (https://svn.gene.le.ac.uk/gen2phen/trunk/object_models/) 4.1. Phenotype Model class descriptions • Individual – Individual. Subject of a study. • Inferred_value – Inferred conclusion, derived from zero or many Observed_value instances. • Observable_feature – A measurable feature of an Individual, e.g. blood pressure. • Observation_target – Super class of all observation targets like Individual or Panel. © Copyright 2009 GEN2PHEN Consortium D3.5 High-Level Domain Model Version 2, with Sample/Phenotype Focus HEALTH-200754 WP3 – Standard data models and terminologies Authors: Tomasz Adamusiak, Juha Muilu, Morris Swertz, Helen Parkinson Security: PU Version: v5.0 Final 8/12 • Ontology_term – Term defined in a specific namespace (ontology source). All names and terms should be defined using ontology terms whenever possible. • Observed_value – Specific value measured in an experiment, e.g. 120 (systolic BP, mmHg). • Panel – Collection of Individual instances. • Protocol – Describes how measurement is to be performed, or a specific Standard Operating Procedure. • Protocol_application – Describes how Protocol was instantiated a particular case, how the measurement was done, e.g. on 16/6/2009 by Tomasz Adamusiak. • Variable_definition – Extends the Observable_feature class to enable precise definition of the feature in used applications (for example has unit). Mappings to PaGE-OM and XGAP are available on the schemalet wiki at: http://www.schemalet.org/mediawiki/index.php/COMMON:Phenotype 4.2. Object instance Figure 2. GEN2PHEN Phenotype Model object instance An example instance of the model is shown in Figure 2. A blood pressure measuring protocol was applied to observation target Juha on 25/5/2009. Two values were measured at 10am: 150 and 90, which were systolic and diastolic blood pressure in mmHg respectively. © Copyright 2009 GEN2PHEN Consortium D3.5 High-Level Domain Model Version 2, with Sample/Phenotype Focus HEALTH-200754 WP3 – Standard data models and terminologies Authors: Tomasz Adamusiak, Juha Muilu, Morris Swertz, Helen Parkinson Security: PU Version: v5.0 Final 9/12 Figure 3. Inferred value example The instance depicted in Figure 3 extends the previous one to show how a previously measured blood pressure can be used to infer disease status. A separate inference protocol was applied on 31/5/2009, and a high blood pressure was observed at 2pm. © Copyright 2009 GEN2PHEN Consortium D3.5 High-Level Domain Model Version 2, with Sample/Phenotype Focus HEALTH-200754 5. WP3 – Standard data models and terminologies Authors: Tomasz Adamusiak, Juha Muilu, Morris Swertz, Helen Parkinson Phenotype Model implementation and testing Figure 4. GEN2PHEN Phenotype Model implementation in Molgenis notation © Copyright 2009 GEN2PHEN Consortium Security: PU Version: v5.0 Final 10/12 D3.5 High-Level Domain Model Version 2, with Sample/Phenotype Focus HEALTH-200754 WP3 – Standard data models and terminologies Authors: Tomasz Adamusiak, Juha Muilu, Morris Swertz, Helen Parkinson Security: PU Version: v5.0 Final 11/12 In order to test and develop the Gen2Phen phenotype model we have collaborated with the developers of MOLGENIS [1, 2]. MOLGENIS is an open source software platform to efficiently design, implement, and autogenerate database, APIs, and web applications from object models. Its power is in the use of models and generators so the best solutions are easily reused between applications. MOLGENIS in one simple step generates a database (mySQL or postgreSQL), a web-based GUI, programmatic interfaces including Java API, SOAP web services usable in tools like Taverna (http://taverna.sourceforge.net) and by statistical scripts written in the R language (http://www.r-project.org), as well as a full documentation of the object model. Several Java plug-in mechanisms are also available to customize the generated software. By developing smaller models and ensuring interoperability using MOLGENIS some or all of the models can be consumed by various partners, the majority of whom have use cases which encompass only some of the models. MOLGENIS has been successfully used within the GEN2PHEN Consortium by: 1. MAGE-TAB OM: http://magetab-om.sourceforge.net 2. LSDB object model developed in the course of the Second Modelling Workshop: http://magetab-om.sourceforge.net/lsdb/1.0/object_model.html 3. An example LSDB - Findis, the Finnish National Mutation Database (NMDB): http://www.schemalet.org/mediawiki/index.php/FINDIS:Database Figure 4 depicts GEN2PHEN Phenotype Model as implemented on the MOLGENIS platform. Full documentation is available in Appendix 2 and a working implementation of the model, comprising a back end database, GUI, etc. is available from: http://wwwdev.ebi.ac.uk/microarray-srv/pheno/ 6. FUTURE PLANS 6.1. A High-Level Domain Model Version 3 (D3.6) This will be an improved and tested set of standard UML data models for all required domains, ready to be implemented by all Partners. Feedback from Partners will be then used to provide the ultimate design underpinnings for all GEN2PHEN databases in Iterative Specialized Domain Modelling Complete (D3.9). These sub-domain models including GEN2PHEN Phenotype Model will all be extensively tested and a reference implementation will be provided on the MOLGENIS platform. 6.2. Derivation and Specification of Exchange Format (D3.7) The priorities for data formats in GEN2PHEN are the data exchange between locus specific databases and central repositories and HTP data. The modelling work to date has separated these © Copyright 2009 GEN2PHEN Consortium D3.5 High-Level Domain Model Version 2, with Sample/Phenotype Focus HEALTH-200754 WP3 – Standard data models and terminologies Authors: Tomasz Adamusiak, Juha Muilu, Morris Swertz, Helen Parkinson Security: PU Version: v5.0 Final 12/12 domains to support immediate needs for data exchange. The models developed will eventually support the phenotype extension reported here as well. Validation of LSBD data model commenced in 2009 by working with the existing LSDBs inside and outside the GEN2PHEN consortium, most of who have existing data formats. Those formats will support the data content of the GEN2PHEN Phenotype Model. Validation of the MAGE-TAB OM is underway and progress is promising. We envisage that the phenotypic descriptors, e.g. membership of a cohort through a shared phenotype, or trait will require an extension of MAGE-TAB, and the requirement to provide details of markers in context of HTP data will also require an extension. 7. Abbreviations HGVS LSDB XGAP PaGE-OM Human Genome Variation Society Locus Specific Database Xtensible Genotype And Phenotype data platform Phenotype and Genotype Experiment object model REFERENCES 1. 2. 3. Swertz, M.A., et al., Molecular Genetics Information System (MOLGENIS): alternatives in developing local experimental genomics databases. Bioinformatics, 2004. 20(13): p. 2075-83. Swertz, M.A. and R.C. Jansen, Beyond standardization: dynamic software infrastructures for systems biology. Nat Rev Genet, 2007. 8(3): p. 235-43. Wildeman, M., et al., Improving sequence variant descriptions in mutation databases and literature using the Mutalyzer sequence variation nomenclature checker. Hum Mutat, 2008. 29(1): p. 6-13. © Copyright 2009 GEN2PHEN Consortium D3.5 High-Level Domain Model Version 2, with Sample/Phenotype Focus. Appendix 1 - Report on the First GEN2PHEN Phenotype Workshop HEALTH-200754 WP3 – Standard data models and terminologies Authors: Tomasz Adamusiak, Juha Muilu, Morris Swertz, Gudmundur Thorisson, Helen Parkinson Security: PU Version: v1.0 1/11 Appendix 1 Report on the First GEN2PHEN Phenotype Workshop Host Venue Dates Swiss Institute of Bioinformatics (SIB) Centre Medicale Universitaire (CMU) 1 Rue Michel-Servet CH1211 Geneva 7-8 May 2009 1. Overview The First GEN2PHEN Phenotype Workshop (Geneva 7-8 May 2009) was hosted by SIB as a follow up the Second Modelling Workshop hosted by UH.FGC (Helsinki 19-22.1.2009). See http://askja.gene.le.ac.uk/drupal5/Modelling_Workshop_2_Report for details on the previous workshop. Use cases and models evaluated previously, served as a basis in developing minimum content standards for exchanging phenotypic information among partners as well as for building and evaluating preliminary phenotype model in partial fulfilment of WP3 deliverables D3.5. Use cases identified in the Genotype to Phenotype domain in a previous deliverable D3.1 were subsequently refined by contact with the wider community and used to drive the development of a domain independent phenotype model. Various pre-existing domain models exist and the workshop began the process of evaluating these for GEN2PHEN needs. This report describes the workshop content. D3.5 High-Level Domain Model Version 2, with Sample/Phenotype Focus. Appendix 1 - Report on the First GEN2PHEN Phenotype Workshop HEALTH-200754 WP3 – Standard data models and terminologies Authors: Tomasz Adamusiak, Juha Muilu, Morris Swertz, Gudmundur Thorisson, Helen Parkinson 2. Participants Consortium members Name Andrew Devereau Mike Cornell Veronique Humbertclaude Christophe Beroud Anna Pigeon David Atlan Gudmundur Thorisson Sergio Matos Anne-Lise Veuthey Lydie Bougueleret Annais Mottaz Lina Yip Juha Muilu Helen Parkinson James Malone Tomasz Adamusiak Organisation UNIMAN UNIMAN INSERM INSERM INSERM PHENO ULEIC UAVR SIB SIB SIB SIB UH.FGC EMBL EMBL EMBL Security: PU Version: v1.0 2/11 D3.5 High-Level Domain Model Version 2, with Sample/Phenotype Focus. Appendix 1 - Report on the First GEN2PHEN Phenotype Workshop HEALTH-200754 WP3 – Standard data models and terminologies Authors: Tomasz Adamusiak, Juha Muilu, Morris Swertz, Gudmundur Thorisson, Helen Parkinson Security: PU Version: v1.0 3/11 Invited domain experts Domain experts represented among others the following consortia: CASIMIR (www.casimir.org.uk), ENGAGE (www.euengage.org), GenomEUTwin (www.genomeutwin.org) BBMRI (www.bbmri.eu) and P3G (www.p3g.org). Name Alan Rector Peter Robinson John Hancock Paul Burton Isabel Fortier Morris Swertz Mauno Vihinen Maria Krestyaninowa Mike Gostev IIlkka Lappalainen Sraboni Ghost Abriel Hugues Organisation UNIMAN Charite Universitaetsmedizin MRC ULEIC ENEP University Medical Center Groningen EMBL EMBL EMBL EMBL Genionics Universitaet Bern 3. Agenda and slides Agenda and speakers' slides are available from http://askja.gene.le.ac.uk/drupal5/content/firstphenotype-workshop-agenda D3.5 High-Level Domain Model Version 2, with Sample/Phenotype Focus. Appendix 1 - Report on the First GEN2PHEN Phenotype Workshop HEALTH-200754 WP3 – Standard data models and terminologies Authors: Tomasz Adamusiak, Juha Muilu, Morris Swertz, Gudmundur Thorisson, Helen Parkinson Security: PU Version: v1.0 4/11 4. Models evaluated 4.1. TWIN:Phenotype Observation is phenotypic observation done by a specific method, which is documented under an observation framework. Classification is inferred or classified conclusion of measurement(s) (here blood pressure). Ontology is the name space (E.g. EUTwin) used for vocabulary (i.e. high blood pressure, low blood pressure) and Classification method provides information on classification specification. Time_accuracy is needed because it is not always possible to know the time exactly (e.g. in some cases exact time cannot be given and date and month must be coded using agreed convention). More information on the model available on the Schemalet website http://www.schemalet.org/mediawiki/index.php/TWIN:Phenotype D3.5 High-Level Domain Model Version 2, with Sample/Phenotype Focus. Appendix 1 - Report on the First GEN2PHEN Phenotype Workshop HEALTH-200754 WP3 – Standard data models and terminologies Authors: Tomasz Adamusiak, Juha Muilu, Morris Swertz, Gudmundur Thorisson, Helen Parkinson Security: PU Version: v1.0 5/11 4.2. PAGEOM:Phenotype Observable features (nose size) can be measured using different observation methods (e.g. ruler) leading to single or multiple observed values (nose size) over observation target(s) (individual). Features can be categorised under different feature categories (e.g. clinical test, heart function, etc.) More information on the model available on the Schemalet website http://www.schemalet.org/mediawiki/index.php/PAGEOM:Phenotype D3.5 High-Level Domain Model Version 2, with Sample/Phenotype Focus. Appendix 1 - Report on the First GEN2PHEN Phenotype Workshop HEALTH-200754 WP3 – Standard data models and terminologies Authors: Tomasz Adamusiak, Juha Muilu, Morris Swertz, Gudmundur Thorisson, Helen Parkinson Security: PU Version: v1.0 6/11 4.3. XGAP:Trait XGAP-OM is the conceptual model behind the XGAP platform. It can be used to consistently model a wide variety of organisms, experimental designs, and biomolecular profiling technologies: • • • • Describe core experimental data using only four core data types Trait, Subject, Data and DataElement. Add experimental design annotations using core FuGE data types Investigation, Protocols and ProtocolApplications, OntologyTerms, etc. Consistently annotate Traits and Subjects using standardized extensions of Trait (e.g. Probe, Marker) and Subject (e.g. Individual, Strain). Consistently extend XGAP for new types of annotations by adding more types of Strain and Subject (e.g. add 'MassPeak' as a new Trait to annotate 'retentiontime' and 'mz') More information on the model available from http://www.xgap.org/objectmodel.html D3.5 High-Level Domain Model Version 2, with Sample/Phenotype Focus. Appendix 1 - Report on the First GEN2PHEN Phenotype Workshop HEALTH-200754 WP3 – Standard data models and terminologies Authors: Tomasz Adamusiak, Juha Muilu, Morris Swertz, Gudmundur Thorisson, Helen Parkinson Security: PU Version: v1.0 7/11 5. 5. Models developed 5.1. COMMON:Phenotype Note: the attributes were not added during the workshop and the model will be amended with them after a cooperative iteration effort. • • • • • • • Individual - Individual. Subject of study Inferred_value - Inferred conclusion, derived from zero or many observed values. Observable_feature - Something we can measure in relation to individual. For example blood pressure. Observation_target - Super class of all observation targets like Individual or Panel. Observed value - Measured value. Panel - Collection of individuals. Protocol - Description how measurement is planned to be done. D3.5 High-Level Domain Model Version 2, with Sample/Phenotype Focus. Appendix 1 - Report on the First GEN2PHEN Phenotype Workshop HEALTH-200754 • • WP3 – Standard data models and terminologies Authors: Tomasz Adamusiak, Juha Muilu, Morris Swertz, Gudmundur Thorisson, Helen Parkinson Security: PU Version: v1.0 8/11 Protocol_Application - Description of how an actual measurement was done (optional different from protocol). Ontology_term - Term defined in specific name space (ontology source). All names and terms will be defined using ontology terms. More information on the model available on the Schemalet website http://www.schemalet.org/mediawiki/index.php/COMMON:Phenotype The model is also available for download in following formats: • Enterprise Architect http://bio-models.svn.sourceforge.net/viewvc/bio- • models/trunk/object_models/enterprise_architect/phenotype.eap?view=log XML http://bio-models.svn.sourceforge.net/viewvc/biomodels/trunk/object_models/enterprise_architect/phenotype.xml?view=log D3.5 High-Level Domain Model Version 2, with Sample/Phenotype Focus. Appendix 1 - Report on the First GEN2PHEN Phenotype Workshop HEALTH-200754 WP3 – Standard data models and terminologies Authors: Tomasz Adamusiak, Juha Muilu, Morris Swertz, Gudmundur Thorisson, Helen Parkinson Security: PU Version: v1.0 9/11 5.2. MOLGENIS:Pheno implementation This is a preliminary evaluation of the model, which will be further developed among Partners. More detailed documentation is available from http://bio-models.svn.sourceforge.net/ viewvc/ bio-models/ molgenis4phenotype/ WebContent/doc/objectmodel.html D3.5 High-Level Domain Model Version 2, with Sample/Phenotype Focus. Appendix 1 - Report on the First GEN2PHEN Phenotype Workshop HEALTH-200754 WP3 – Standard data models and terminologies Authors: Tomasz Adamusiak, Juha Muilu, Morris Swertz, Gudmundur Thorisson, Helen Parkinson Security: PU Version: v1.0 10/11 6. Minimal information on phenotype It was agreed that reporting of the phenotypes is inconsistent. For example only some of the observation targets are annotated with ultrasound of the liver was significant in one of the subjects, but no information is given for other observation targets. Thus it is unclear whether they have also been tested. There are also a number of ethical ramifications which will be followed up in the Ethics Session during the upcoming Fourth GEN2PHEN General Assembly Meeting. It was also suggested that minimal information should be content specific, e.g. obligatory smoking status in reporting of hypertension. It was agreed that published phenotypic information should at least contain the following information about observation targets: • • • • Age Gender Age of onset Ontology (controlled vocabulary) term for signs and symptoms Optional information would include: • Therapy information (ontology coverage is coming up short in this domain) D3.5 High-Level Domain Model Version 2, with Sample/Phenotype Focus. Appendix 1 - Report on the First GEN2PHEN Phenotype Workshop HEALTH-200754 WP3 – Standard data models and terminologies Authors: Tomasz Adamusiak, Juha Muilu, Morris Swertz, Gudmundur Thorisson, Helen Parkinson Security: PU Version: v1.0 11/11 7. Pathogenicity Agreeing on the meaning of pathogenicity was a challenging task, as different communities use it in a slightly different way. It was proposed to distinguish between pathogenicity modifiers (positive/negative) and factors directly pathogenic. Pathogenicity could be variant causing disease or risk, but in a medical setting it is rather mutation causing a disease. Definition for diagnostic labs would also have to be different. A definition stating that pathogenicity leads to disease was found too broad, and the final version defined pathogenicity as an ability to cause disease. Issues raised during the discussion • • • Laboratory testing aims to link the existence of a variant to the occurrence of a disease (bias in over-reporting of pathogenicity). It is not recorded often enough, as it is hugely important and extremely useful. How to record values? It was proposed to use a continuous scale (e.g. p-values) to represent pathogenicity values. It was agreed that from a practical point of view it is more feasible to deal with four levels. But this should also be extended to record values: non known and unclassified. • • Pathogenicity values should be backed up by an evidence reference, e.g. journal paper. In some cases a context is required, e.g. it is pathogenic only in association with... Appendix 2. GEN2PHEN Phenotype Model reference implementation HEALTH-200754 WP3 – Standard data models and terminologies Security: PU Authors: Morris Swertz Version: 1 1/14 Appendix 2 GEN2PHEN Phenotype Model Reference Implementation The GEN2PHEN Phenotype Model is a minimal data model to represent a data set of phenotypic observations resulting from one or more investigations. The objective is to harmonize the exchange of phenotype descriptions between various repositories and to host phenotype information ranging annotations in locus specific databases to rich clinical reports from cohort studies. The initial version of this model was compiled at the GEN2PHEN phenotype workshop (Geneva, 8th-9th May 2009), building on previous modeling efforts from the XGAP, PaGE, FuGE, LOVD, and MAGE-TAB projects. Where appropriate mapping to these models is provided. This document was created by: Morris Swertz, Juha Muilu, Gudmundur Thorisson, Tomasz Adamusiak, Isabel Fortier, Paul Burton, John Hancock, Illke Lappalainen, Anthony Brookes, other members of the GEN2PHEN collaboration and Helen Parkinson. This work is sponsored by EU-GEN2PHEN, EU-CASIMIR, P3G, NWO-Rubicon, NBIC BioAssist/Biobanking. Changelog/decisions 11-06-2009 (following G2P AM4): 1. Added self-reference on Protocol to create aggregated protocols Use case: a study is a set of Questionnaires, each questionaire being a protocol 2. Added VariableDefinition as subclass of Observable feature and moved attribute 'unit' from ObservedValue to ValueDefinition. VariableDefinition can refer to one (?) ObservableFeature concept. Use case: a questionaire (protocol) is defined to measure 'length' in cm; 'length' is the observable feature, 'length in cm' the VariableDefinition. Motivation: if unit was defined on ObservedValue than one cannot define the unit for a protocol. If unit was defined in two places (protocol and value level) then they can conflict with each other. 3. Added timestamp to both the protocolApplication and ObservedValue Use case: blood pressure was measured at five ten minute intervals at 8:00, 8:10, 8:20. The motivation herefor is that protocols often include repeated measurements. A positive example is the use case of blood pressure time series. A negative example is 'blood pressure standing' and 'blood pressure lying down' which are different observableFeatures. 4. Adapted the description of protocolapplication to say it is an 'instance' of the protocol usage. 5. Did not change observableFeature.name into observableFeature.description, this is not advisable as it is inconsistent. 6. Did not replace subclass InferredValue with a directional self reference on ObservedValue for clarity. Appendix 2. GEN2PHEN Phenotype Model reference implementation HEALTH-200754 WP3 – Standard data models and terminologies Security: PU Authors: Morris Swertz Version: 1 2/14 Changelog/decisions 12-06-2009 (following meeting Juha Muilu, Morris Swertz, Tomasz Adamusiak): 1. Protocol.name is not unique within an investigation as it can be reused in multiple studies, a relationship is definable via ProtocolApplication. 2. ObservationTargets are not unique to one investigation as they can be observed in multiple studies, a relationship definable via the ObservedValue. 3. SelfRecursion on ObservedValue for multivalue and derived value was dropped for simplicity reasons. Until shown otherwise multivalue features can be grouped by protocol. 4. ObservedValue name is not made unique within investigation as it defies its purpose to integrate between studies. 5. There is no explicit relationship between ObservedValue.value and Code.term; such constraint checking is outside the scope of this model. 6. Added a 'value' to ParameterValue which was missing. 7. Changed that Code doesn't extend the OntologyTerm class but instead refers to an instance. 8. InferredValue seems not normalized in the sense that one has to repeat ObservationTarget which is implied via the ObservedValues it refers to. However, this is not changed because it can be that an inference is provided without providing the ObservedValues or that a Panel level inference is derived from a set of individual level Observedvalues. Table of contents pheno.system package: pheno.observation package: pheno.target package: pheno.variable package: pheno.protocol package: Identifiable Investigation Individual VariableDefinition Protocol Nameable ObservableFeature Panel CodeList ProtocolApplication OntologySource ObservedValue Code ProtocolParameter OntologyTerm InferredValue ObservationTarget ParameterValue Appendix 2. GEN2PHEN Phenotype Model reference implementation HEALTH-200754 WP3 – Standard data models and terminologies Security: PU Authors: Morris Swertz Version: 1 3/14 1. pheno.system package This packages describe basic classes that are used as building blocks for the pheno.core model. 1.1. Identifiable (interface) (For implementation purposes) The Identifiable interface provides its sub-classes with a unique numeric identifier within the scope of one database. This class maps to FuGE::Identifiable (together with Nameable interface) Attributes: id: int (required) Automatically generated id-field 1.2. Nameable (interface) (For modeling purposes) The Nameable interface provides its sub-classes a meaningful name that need not be unique. This class maps to FuGE::Identifiable (together with Identifiable interface) Attributes: name: string (required) A human-readable and potentially ambiguous common identifier Appendix 2. GEN2PHEN Phenotype Model reference implementation HEALTH-200754 WP3 – Standard data models and terminologies Security: PU Authors: Morris Swertz Version: 1 4/14 1.3. OntologySource implements Identifiable, Nameable The OntologySource class defines a reference to a an existing ontology or controlled vocabulary from which well-defined and stable (ontology) terms can be obtained. For instance: MO, GO, EFO, UMLS, etc. Use of existing ontologies/vocabularies is recommended to harmonize phenotypic descriptions. This class maps to FuGE::OntologySource, MAGETAB::TermSourceREF. Attributes: ontologyURI: hyperlink (required) A URI that references the location of the ontology. 1.4. OntologyTerm implements Identifiable The OntologyTerm class defines references to a single entry from an ontology or a controlled vocabulary. Other classes can reference to this OntologyTerm to harmonize naming of concepts. Each term should have a local, unique label. Good practice is to label it 'sourceid:term', e.g. 'MO:cell' If no suitable ontology term exists one can define new terms locally in which case there is no formal accession for the term. In those cases the local name should be repeated in both term and termAccession. Maps to FuGE::OntologyIndividual; in MAGE-TAB there is no separate entity to model terms. Attributes: term: string (required) The ontology term itself, also known as the 'local name' in some ontologies. termLabel: string (required) The label that is used to refer to this term inside this data set. For instance 'MO:cell' termAccession: string (optional) The accession number assigned to the ontology term in the source ontology. If empty it is assumed to be a locally defined term. Associations: termSource: OntologySource (0..1) The source ontology or controlled vocabulary list that ontology terms have been obtained from. Appendix 2. GEN2PHEN Phenotype Model reference implementation HEALTH-200754 WP3 – Standard data models and terminologies Security: PU Authors: Morris Swertz Version: 1 5/14 2. pheno.observation package This package describes the minimal model for phenotypes. 2.1. Investigation implements Identifiable, Nameable The Investigation class defines self-contained units of study, each having a unique name and a group of actions (protocol applications) and/or results (in ObservedValues). For instance: Framingham study. Maps to XGAP/FuGE Investigation and MAGE-TAB experiment. Discussion: should we adopt MAGE-TAB::IDF type of minimal information about an investigation? 2.2. ObservableFeature implements Identifiable, Nameable The ObservableFeature class defines anything that can be observed (there may be many alternative protocols to measure them). For instance: systolic blood pressure, Diastolic blood pressure, Treatment for hypertension. These names are unique within a data set. Preferably each Appendix 2. GEN2PHEN Phenotype Model reference implementation HEALTH-200754 WP3 – Standard data models and terminologies Security: PU Authors: Morris Swertz Version: 1 6/14 ObservableFeature should be named according to a well-defined ontology. This class maps to XGAP Trait, FuGE DimensionElement and PaGE ObservableFeature. Multi-value features can be grouped by protocol. For instance: blood pressure consists of observations for features systolic and diastolic blood pressure. Associations: ontologyReference: OntologyTerm (0..1) Reference to the formal ontology definition for this feature 2.3. ObservedValue implements Identifiable The ObservableValue class defines the actual observation. For instance: 160 mmHg, 90mmHg, "no treatment". This class has no FuGE equivalent because in FuGE the data protocolapplication association is reversed, i.e. the ProtocolApplication has input/output Data (which could be ObservedValues). Maps to XGAP DataElement that uses the FuGE approach, so oberved values are grouped into 'Data'; Maps to PaGE observed value. Attributes: time: datetime (required) time when the protocol was applied. value: string (required) The value observed Associations: investigation: Investigation (1..1) Reference to the Investigation this observedValue belongs to. observationTarget: ObservationTarget (1..1) Reference to the subject that has been observed observableFeature: ObservableFeature (1..1) Reference to the feature that was observed protocolApplication: ProtocolApplication (0..1) Reference to the protocol application that produced this observation Appendix 2. GEN2PHEN Phenotype Model reference implementation HEALTH-200754 WP3 – Standard data models and terminologies Security: PU Authors: Morris Swertz Version: 1 7/14 2.4. InferredValue extends ObservedValue The InferredValue class defines ObservedValues that are inferred as result of human or computational post-processing of previous ObservedValues. The protocol used for this inference can be defined via the protocolApplication association that is inherited from ObservedValue. For instance: hypertensive = yes when mean arterial pressure = 135 AND no hypertension affecting medicine is taken. This class has no direct mapping to other models: XGAP would use input/ouput Data; PaGE would use a self reference on ObservedValue Implementation discussion: how to make the derivedFrom relationship understandeable in UI. Would need a multicolumn lookup including target, feature, value, and unit. Now one just gets a value. Associations: derivedFrom: ObservedValue (1..n) References to one or more observed values that were used to infer this observation 2.5. ObservationTarget implements Identifiable, Nameable An ObservationTarget class defines the subjects of observation. For instance: individual 1 from study x. This class maps to XGAP subject and maps to Page Abstract_Observation_Target. The name of observationTargets is unique within its Investigation. Appendix 2. GEN2PHEN Phenotype Model reference implementation HEALTH-200754 WP3 – Standard data models and terminologies Security: PU Authors: Morris Swertz Version: 1 8/14 3. pheno.target package 3.1. Individual extends ObservationTarget The Individuals class defines human cases that are used as observation target. This class maps to XGAP and PaGE individual. Discussion: what minimal properties should be hard-coded? E.g. sex is assumed to be an observablefeature while in PAGE/XGAP it as a direct property of individual. Attributes: sex: enum (required) Associations: species: OntologyTerm (1..1) mother: Individual (0..1) Refers to the mother of the individual. father: Individual (0..1) Refers to the father of the individual. Appendix 2. GEN2PHEN Phenotype Model reference implementation HEALTH-200754 WP3 – Standard data models and terminologies Security: PU Authors: Morris Swertz Version: 1 9/14 3.2. Panel extends ObservationTarget The Panel class defines groups of individuals that can act as a single ObservationTarget. Thus a whole group can have ObservedValues such as 'middle aged man' or 'recombinant mouse inbred Line dba x b6'. This class maps to XGAP/PaGE panel classes. Associations: individuals: Individual (1..n) The list of individuals in this panel 4. pheno.variable package The variable package provides classes to define variables as used within a protocol/questionaire. Variables are specific types of observable features in that they have a unit attached Appendix 2. GEN2PHEN Phenotype Model reference implementation HEALTH-200754 WP3 – Standard data models and terminologies Security: PU Authors: Morris Swertz Version: 1 10/14 4.1. VariableDefinition extends ObservableFeature The VariableDefinition class extends the ObservableFeature class to enable precise definition of the unit of ObservableFeature. Associations: unit: OntologyTerm (1..1) Reference to the well-defined measurement unit used to observe this features (if feature is that concrete). E.g. mmHg codeList: CodeList (0..1) 4.2. CodeList implements Identifiable, Nameable The CodeList class names lists of discrete values that are available as options for a particular VariableDefintion. 4.3. Code implements Identifiable The Code class names the code values for a particular codelist. It extends from ontologyTerm adding the option to define pretty labels. For instance 'f=female', 'm=male' Attributes: value: string (required) The value that represents the code in the data label: string (required) The pretty label that represents the human understandeable meaning of the code. For instance the label on a CRF. Associations: codeList: CodeList (1..1) The code-list this code is defined to be part of ontologyTerm: OntologyTerm (0..1) Appendix 2. GEN2PHEN Phenotype Model reference implementation HEALTH-200754 WP3 – Standard data models and terminologies Security: PU Authors: Morris Swertz Version: 1 11/14 5. pheno.protocol package The protocol package provides classes to describe protocols that are planned, or have been used, for observation. This can include questionnaires, wet-lab protocols and dry-lab protocols. Very similar to FuGE/XGAP and MAGE-TAB 5.1. Protocol implements Identifiable, Nameable The Protocol class defines parameterizable descriptions of methods; each protocol has a unique name within a dataset. Each ProtocolApplication can define the ObservableFeatures it can observe as well as the optional Parameters. For instance: SOP for blood pressure measurement used by UK biobank. This class maps to FuGE/XGAP/MageTab Protocol, but in contrast to FuGE it is not required to extend protocol before use. Note that the FuGE's mechanism of Appendix 2. GEN2PHEN Phenotype Model reference implementation HEALTH-200754 WP3 – Standard data models and terminologies Security: PU Authors: Morris Swertz Version: 1 12/14 parameters (for protocol) and parametervalues (for application) is not shown. Has no equivalent in PaGE. Associations: observableFeatures: ObservableFeature (0..n) The features that can be observed using this protocol. protocolComponents: Protocol (0..n) The set of protocols that together to make up this protocol. For instance: a set of questionnaires. 5.2. ProtocolApplication implements Identifiable, Nameable A ProtocolApplication class defines the actual action of observation by instantiating a protocol and optional ParameterValues. For example: the action of blood pressure measurement on 1000 individuals, using a particular protocol, resulting in 1000 associated observed values. This class maps to FuGE/XGAP ProtocolApplication, but in FuGE ProtocolApplications can take Material or Data (or both) as input and produce Material or Data (or both) as output. Similar to PaGE.ObservationMethod Attributes: time: datetime (required) time when the protocol was applied. Associations: protocol: Protocol (1..1) Reference to the protocol that is being used. investigation: Investigation (1..1) Reference to the Investigation this protocolapplication belongs to. 5.3. ProtocolParameter implements Identifiable, Nameable ProtocolParameter represents a variable of a Protocol that is instantiated as a Parameter Value (see ParameterValue). For instance 'growth temperature' in a protocol where yeast are grown at permissive and non permissive temperatures. It implements Unit to define the parameter type and Appendix 2. GEN2PHEN Phenotype Model reference implementation HEALTH-200754 WP3 – Standard data models and terminologies Security: PU Authors: Morris Swertz Version: 1 allowed values. ProtocolParameter maps to FuGE::Parameter Associations: protocol: Protocol (0..1) 5.4. ParameterValue implements Identifiable A ParameterValue is instantiated when a ProtocolApplication applies a Protocol with Parameters. ParameterValue implements Measurement to provide values and Units for ParameterValues. The FuGE equivalent to ParameterValue is FuGE::ParameterValue Attributes: value: string (required) The chosen value of the parameter within this protocol application Associations: protocolApplication: ProtocolApplication (1..1) Reference to the protocol application for which this parameter value was chosen for protocolParameter: ProtocolParameter (1..1) Reference to the protocol parameter that is being bound by this value 13/14 Appendix 2. GEN2PHEN Phenotype Model reference implementation HEALTH-200754 WP3 – Standard data models and terminologies Security: PU Authors: Morris Swertz Version: 1 6. Supplementary figure: complete data model 14/14
© Copyright 2025