ONTOLOGY-BASED WEB INFORMATICS SYSTEM By WENYANG HU A THESIS PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF ENGINEERING UNIVERSITY OF FLORIDA 2002 Copyright 2002 by Wenyang Hu ACKNOWLEDGMENTS I express my sincere gratitude to my advisor, Prof. Limin Fu, for giving me the opportunity to work on this challenging topic and for providing continuous guidance during my thesis writing. I am thankful to Prof. Joachim Hammer and Prof. Jonathan Liu for agreeing to be on my supervisory committee. I would like to take this opportunity to thank my parents, my husband and my son, for their continued and encouraging support throughout my period of study and especially in this endeavor. iii TABLE OF CONTENTS page ACKNOWLEDGMENTS ................................................................................................. iii LIST OF FIGURES ........................................................................................................... vi ABSTRACT...................................................................................................................... vii CHAPTER 1 INTRODUCTION ............................................................................................................1 1.1 Background and Motivation ..................................................................................... 1 1.2 Organization of This Thesis...................................................................................... 3 2 WHY DEVELOP AN ONTOLOGY? ..............................................................................4 2.1 Ontology Role in the Information Retrieval ............................................................. 4 2.1.1 Basic Concepts of an Ontology....................................................................... 4 2.1.2 Refined Definition and Categories of Ontologies........................................... 5 2.1.2.1 Redefinition of Ontology ......................................................................5 2.1.2.2 Different kinds of ontologies .................................................................6 2.2 Current Ontology Applications................................................................................. 7 2.2.1 Reusability ...................................................................................................... 7 2.2.2 Search.............................................................................................................. 8 2.2.3 Specification and Knowledge Acquisition...................................................... 9 2.2.4 Reliability and Maintenance ........................................................................... 9 2.3 Building Ontologies – Ontology Editors, Languages, and Platforms..................... 10 2.3.1 Ontobroker .................................................................................................... 10 2.3.2 DAML+OIL .................................................................................................. 11 2.3.3 SHOE ............................................................................................................ 12 2.3.4 OntoEdit ........................................................................................................ 12 2.3.5 Protégé-2000 ................................................................................................. 12 3 BUILD MEDICAL SUBJECT HEADING ONTOLOGY WITH PROTEGE-2000 .....14 3.1 Overview of Protégé-2000...................................................................................... 14 3.2 Protégé-2000 Ontology Model ............................................................................... 15 3.2.1 Creating and Editing Classes ........................................................................ 16 3.2.2 Creating and Editing Instances ..................................................................... 17 3.2.3 Storage Models and Persistence.................................................................... 18 iv 3.3 Background of Medical Subject Heading ............................................................... 19 3.4 Motivation of Creating an Ontology in Medical Subject Heading Domain ........... 21 3.5 From MeSH Thesaurus to MeSH Ontology ........................................................... 22 3.5.1 Ontology and Structure-based Search........................................................... 23 3.5.2 Building MeSH Ontology ............................................................................. 25 3.5.3 Ontology Data Import and Export ................................................................ 29 Importing existing data of MeSH thesaurus to ontology .................................29 Exporting MeSH ontology to an XML document ...........................................29 4 ONTOLOGY, XML AND XQUERY ............................................................................33 4.1 Extensible Markup Language XML ....................................................................... 33 4.2 Ontologies as Conceptual Models for Generate XML Documents. ....................... 34 4.2.1 XML Itself Is Not Enough ............................................................................ 34 4.2.2 Add Ontology as Conceptual Model............................................................. 35 4.3 XQuery.................................................................................................................... 36 4.3.1 XML Query Language XQuery .................................................................... 36 4.3.2 XQuery Implementation Quip....................................................................... 38 5 IMPLEMENTATION OF AN ONTOLOGY-BASED WEB APPLICATION SYSTEM40 5.1 Building Web Informatics System.......................................................................... 40 5.1.1 Ontology-based Web Application................................................................. 40 5.1.2 Using JSP ...................................................................................................... 41 5.1.3 System Architecture ...................................................................................... 43 5.2 Query the Ontology................................................................................................. 45 5.2.1 Query from Direct Typing ............................................................................ 45 5.2.2 Upload Existing Local Query Files............................................................... 46 5.2.3 Query Ontology or Build Up Ontology Objects with the Ontology Wizard 47 5.2.4 Choose the Query File and Download the Result ......................................... 48 6 CONCLUSIONS AND FUTURE WORK .....................................................................50 LIST OF REFERENCES...................................................................................................52 BIOGRAPHICAL SKETCH .............................................................................................55 v LIST OF FIGURES page Figure 3-1 Protégé architecture .........................................................................................................15 3-2 Protégé-2000: Class definition in 'MeSH ontology.........................................................16 3-3 Acquire instances in Protégé-2000..................................................................................17 3-5 Medical subject heading hierarchy..................................................................................21 3-6 MeSH thesaurus browser.................................................................................................24 3-7 MeSH ontology class hierarchy.......................................................................................26 3-8 Detailed structures and relationships of DescriptorRecord, Concept and Terms in MeSH ontology .........................................................................................................27 3-9 Ontology exported to an XML document .......................................................................32 5-1 Web informatics system architecture ..............................................................................42 5-2 Interface of server program and Quip execution .............................................................44 5-2--continued Interface of server program and Quip execution ..........................................45 5-3 Onine links point to ontology structures .........................................................................46 5-4 Query the ontology by typing or uploading local query files..........................................47 5-5 Query the ontology under the help of ontology query wizard.........................................48 vi Abstract of Thesis Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Master of Engineering ONTOLOGY-BASED WEB INFORMATICS SYSTEM By Wenyang Hu May 2002. Chair: Prof. Limin Fu Department: Computer and Information Science and Engineering Department In the context of knowledge, the term ontology means a specification of a conceptualization. On the application side, the main emphasis is given to the use of ontologies for knowledge sharing and reuse, electronic commerce, and enterprise integration. Ontologies also make it possible to add more semantics to web pages to make information extraction precise and efficient. In this project, I applied a popular ontology editor, Protégé-2000, to build an ontology of the field Medical Subject Heading (MeSH), using concepts, relationships, and data from that field to design ontology classes, slots, facets, proper constraints, and objects. The ontology was later exported to an XML document. A web informatics system was implemented based on that ontology. A carefully designed interface makes it possible for the web application system to invoke an XQuery engine to obtain query output to the end user. The ontology query wizard guides users with ontology information to form legal and proper queries on the ontology data or construct particular ontology objects for users. Information retrieval on the more vii structured web systems, as with the one constructed in this project, could provide answers to sophisticated knowledge-based queries. viii CHAPTER 1 INTRODUCTION 1.1 Background and Motivation Imagine that you want to buy the book “The Little Prince” by Antoine de SaintExupery online. Searching existing web indices in your favorite search engines yields thousands of pages. Only several lead to places where you can actually purchase the book, and others lead to a variety of fan sites and baby product sites. This scenario is common to many people on the World Wide Web. A major problem with this kind of search (called keyword-based search) on the web today is that data available on the web have little semantic organization beyond a simple structural arrangement of text, keywords, titles, or abstracts. As the web expands exponentially in size, the lack of organization makes it difficult to retrieve useful information out of the web. Researchers are trying to find efficient ways to let the web answer queries such as: Find all web pages where X isA book, Y isA person, Title(X)=”The Little Prince’, Name(Y)= “Antoine de Saint-Exupery” Published-by(X, Y) An ordinary HTML page is not appropriate for such queries if no semantics has been added. Previous information retrieval approaches include keyword-based search and field-based search, both of which have disadvantages. A keyword-based search suffers because it associates the semantic meaning of web pages with actual lexical or syntactic content. Tens of hundreds of irrelevant resulting pages are thus unavoidable. The field- 1 2 based approach describes an item not by a set of keywords, but by a set of attribute-value pairs. But usually for a specific application domain, this type of search is supported only by some specially designed browsers. Except from simple linkage, none of these approaches allows for inferences about relationships between web pages. Sophisticated queries are therefore clearly out of reach. The solution for solving these problems is to add semantics to HTML pages, but terms and definitions often differ between groups. Sometimes different groups use identical terms with different meanings. So there is a need to share the meaning of terms in a given domain. Achieving a shared understanding is accomplished by agreeing on an appropriate way to conceptualize the domain, and then to make it explicit in a language. The result is an ontology, which can be widely applied to a variety of contexts for various purposes. By providing a shared and common understanding of a domain, ontologies can be communicated across people and application systems for facilitating knowledge sharing and exchange, and they build the conceptual backbone of the Semantic Web. Only in the few years, interest has increased significantly in researches and applications of ontology. The Medical Subject Heading (MeSH) is a thesaurus that has been used and updated for the past 40 years. It has been a success for indexing and searching journal articles in medical subject databases, books and so forth. However, MeSH still has some limitations as being a thesaurus, and it could be updated and extended to achieve better performance. I chose this thesaurus as a starting point to implement an ontology. The ontology of MeSH belongs to the same domain, imports all useful structures and data from the MeSH thesaurus and adds deeper semantics and ontology validation constraints. 3 Through a web informatics system based on MeSH ontology, users are finally given more flexibility to form sophisticated queries and obtain satisfactory results. 1.2 Organization of This Thesis Chapter 1 provides introduction and overview. Chapter 2 discusses some concepts related to ontology and its applications– such as definitions, current ontology application areas, and why people want to develop domain ontologies. The first part of my implementation is to use Protégé-2000 as a tool to create the ontology. In Chapter 3, I introduce key features of Protégé-2000, provide background information about the MeSH thesaurus, and state how and why I use Protégé-2000 to develop a medical subject heading ontology. The MeSH ontology was created and exported to an XML document in my project implementations. In chapter 4 I will present some of the basic ideas about XML and its query language XQuery, give the relationship between ontology and XML, and tell why the XML format was chosen to save the ontology. Details of the second part of my implementation, that is, how I designed a web informatics system based on MeSH ontology, are introduced in Chapter 5. In Chapter 6, I will mention some of the flaws still existing in this system and possible future improvements. CHAPTER 2 WHY DEVELOP AN ONTOLOGY? 2.1 Ontology Role in the Information Retrieval 2.1.1 Basic Concepts of an Ontology An ontology is a shared and common understanding of some domain that can be communicated across users and computers. It can be defined as “a formal, explicit specification of a shared conceptualization” (Gruber, 1993). Conceptualization refers to an abstract model of some phenomena in the world. Explicit means that the type of concepts and relationships between them are explicitly defined. Shared reflects the fact that an ontology captures consensual knowledge, which is accepted by a group of people. Formal refers to the fact that an ontology should be machine readable and accessible. Typically an ontology is constructed in an collaborative effort of domain experts, endusers, and IT specialists. Research on ontology is becoming increasingly widespread in the computer science community. The term ontology is actually borrowed from philosophy. Though this term has been rather confined to the philosophical area in the past, it is now gaining a specific role in many diverse fields. In the research field of AI, an ontology refers to an engineering artifact. It is constituted by a specific vocabulary used to describe a certain reality, plus a set of assumptions regarding the intended meaning of the vocabulary words. This set of assumptions usually has the form of a first-order logical theory in which vocabulary words appear as unary (concepts) or binary predicate names (relationships). 4 5 In the context of knowledge sharing, ontology means a specification of a conceptualization. That is, an ontology is a description of the concepts and relationships that can exist for an agent or a community of agents. In the simplest case, an ontology describes a hierarchy of concepts related by subsumption relationships. In more sophisticated cases, suitable axioms are added in order to express other relationships between concepts and to constrain their intended interpretations. 2.1.2 Refined Definition and Categories of Ontologies 2.1.2.1 Redefinition of ontology The role of an ontology can be considered as a set of logical axioms designed to account for the intended meaning of a vocabulary. Given a language L with an ontological commitment K, an ontology for L is a set of axioms designed in a way such that the set of its models approximates as best as possible the set of intended models of L, according to K. The following definition of ontology refines Gruber’s definition by making clear the difference between an ontology and a conceptualization: An ontology is a logical theory accounting for the intended meaning of a formal vocabulary, i.e., its ontological commitment to a particular conceptualization of the world. The intended models of a logical language using such a vocabulary are constrained by its ontological commitment. An ontology indirectly reflects this commitment (and the underlying conceptualization) by approximating these intended models. (Guarino, “Formal Ontology and Information System”, page 5, 1988) The ontology is language-dependent while a conceptualization is languageindependent. It is essential to separate the concepts of ontology with conceptualization when addressing the issues related to ontology sharing, fusion, and translation. 6 2.1.2.2 Different kinds of ontologies Ontologies can be classified according to their accuracy in characterizing the conceptualization to which they commit. An ontology can get closer to a conceptualization in several possible ways, such as by developing a richer axiomatization and adopting a richer domain and/or a richer set of relevant conceptual relations. Those get closer to a conceptualization are called “fine-grained” ontologies, compared to “coarse” ontologies. A tradeoff exists between a coarse and a fine-grained ontology committing to the same conceptualization. Fine-grained ontology may be used to establish a consensus about sharing that vocabulary because it gets closer to specifying the intended meaning of a vocabulary. But it may be hard to develop and to reason on due to the number of axioms and the expressiveness of the language adopted. Building such detailed ontologies is usually for the purpose of being accessed from time to time. A coarse ontology, on the other hand, may consist of a minimal set of axioms written in a minimal expressive language. It is intended to be shared among users who already agreed on the underlying conceptualization and to support only a limited set of specific services, for example, to support core system’s functionalities. According to their level of generality, ontologies can also be categorized by top-level ontologies, domain and task ontologies, and application ontologies. Toplevel ontologies describe very general concepts, independent of a particular problem or domain. Domain ontologies describe the vocabulary related to a generic domain (such as medical subject heading domain on which we focus). Task ontologies describe a generic task or activity, such as diagnosing, advertising and so forth) Domain and task ontologies 7 inherit and specialize the terms introduced in the top-level ontology. Application ontologies describe concepts depending on both a particular domain and task. These concepts often correspond to roles played by domain entities while performing a certain task. It seems therefore quite obvious and reasonable to have unified top-level ontologies for large communities of users (Guarino, 1998). The ontology can be regarded as a particular knowledge base, describing facts assumed to be always true by a group of users in a certain domain. And by authority of the agreed-upon meaning of the vocabulary used. It contains state-independent information while the “core” knowledge base, on the other hand, contains statedependent information. So, ontology is a “kind of” knowledge base, but these two concepts are different. 2.2 Current Ontology Applications The research and application communities in which ontology has been useful include software developers, standard organizations as well as database communities. They all need to overcome interoperability difficulties brought by disparate vocabularies, representations, and various tools in their prosperous context. Fundamentally, ontologies are used to improve communication between humans or computers. The current applications of ontology can be grouped into the following areas: 2.2.1 Reusability Many researchers have been designing ontologies for the purpose of enabling knowledge sharing and reuse. The ontology is the basis for a formal representation of the important concepts, processes, and their interrelationships in the domain of interest. This 8 formal representation may be a reusable and shared component in a system, it may also be translated between different modeling systems and used as an interchange format. The scenario can be, for example, an author creates an ontology, which different application developers agree to use. Each pair of translators, for a given application, in effect, defines an application interface that can be used to read/write data from/to the ontology. Ecoyc (Karp et al.,1996) is a commercial product, for example, that uses a shared ontology to make possible access to various heterogeneous databases in the field of molecular biology. In Stanford Medical Informatics (SMI) Protégé 2000, various data formats (XML, Ontolingua, RDF) can be used to import data to the ontology. Ontology can also be exported to different data formats (RDF, XML, OKBC, relational DBMS) so as to let various communities share the ontology data in their own application formats. 2.2.2 Search In recent years, there have been numerous papers and reports announcing attempts and some successes at applying ontologies, especially in the area of search and information retrieval. An ontology can be used as a metadata serving as an index into a repository of systematically ordered relevant concepts in a given domain. A consensual ontology can assist knowledge workers in identifying concepts in which they are interested by providing various users with a clean common vocabulary and clearly defined relationships. The motivation is to improve precision (make sophisticated queries possible), as well as reduce the overall amount of time spent on searching. Supporting technologies in this application area include ontology browsers, search engines, automatic tagging tools, automatic classification of documents, metadata languages, such as XML and so forth. One variation is to assist in query formulation. 9 Ontology can drive the user interface for creating and refining queries, which is the case in my project. I will show in my project how a sophisticated query becomes possible based on an ontology information. Yahoo is another example of large web ontology taxonomies categorizing web sites to facilitate search efficiency. 2.2.3 Specification and Knowledge Acquisition The ontology can assist the process of identifying requirements and defining a specification for an IT system. The basic idea of this scenario is to let an ontology model the application domain, and let it provide a vocabulary for specifying the requirements for one or multiple target applications. When building knowledge-based systems, using an existing ontology as the starting point and basis for guiding knowledge acquisition may also increase the speed and reliability. A typical example for the scenario is Protégé-2000, which used to help automate the process of knowledge acquisition and software development. Generating knowledge acquisition tools from an ontology automatically, Protégé-2000 ensures that the acquisition user interface connects tightly to an ontology. This scenario facilitates the entering of knowledge and also makes certain the validating of ontology can be carried out at most appropriate time. 2.2.4 Reliability and Maintenance Using ontology in system development or as part of the end application can make maintenance easier in a various ways. First of all, a formal representation characteristic of ontology makes software consistency check possible. Software check is automatic and therefore more reliable, and thus makes the system more reliable. Secondly, building 10 software using explicit ontology data helps to improve the documentation, which reduces the cost of maintenance. 2.3 Building Ontologies – Ontology Editors, Languages, and Platforms Ontologies have become common on the World Wide Web. Realizing the importance of ontology in various application fields, a number of languages for defining ontologies on the web, such as RDF(S) and DAML+OIL, were developed. Also being developed are those ontology editors or platforms helping to create ontologies in the most reliable and efficient ways. In this section, I briefly introduce some of the popular tools. 2.3.1 Ontobroker Ontobroker processes information sources and content descriptions in the HTML, XML, and RDF format and provides information retrieval, query answering, and maintenance support. Use of ontologies to explicitly describe background knowledge is central in Ontobroker. A broker architecture is provided in Ontobroker with four elements: a query interface, an info agent, an inference engine, and a database manager. It is an integrated system for collecting knowledge from the web using annotations, creating and query ontologies, and deriving additional implicit factual knowledge automatically. • The query engine receives queries and answers them by checking the content of the databases that were filled by the info and inference agents. • The info agent is responsible for collecting factual knowledge from the web using various styles of meta annotations and direct annotations. In this part, annotation can be done manually using an RDF format or using a small extension of HTML called HTMLA to integrate semantic annotations in HTML documents. • The inference engine uses facts and ontologies to derive additional factual knowledge that is only provided implicitly. It frees knowledge providers from the burden of specifying each fact explicitly. 11 • The database manager is the backbone of the entire system. It receives facts from the info agent, exchanges facts as input and output with the inference agent, and provides facts to the query engine. A representation language, which is based on Frame logic (Kifer et al., 1995), is used to formulate an ontology in Ontobroker. Basically the language provides classes, attributes with domain and range definitions, is-a hierarchies with a set inclusion of subclasses and multiple attribute inheritance. It also provides logical axioms that can be used to further characterize relationships between elements of an ontology and its instances. 2.3.2 DAML+OIL In order to support the use of ontologies, a number of representational formats have been proposed including the Resource Description Framework (RDF)schema, the Ontology Interchange Language (OIL) and the Darpa Agent Markup Language (DAML). DAML+OIL, the language now being proposed as a W3C standard for ontological and metadata representation, is formed by bringing the last two languages together. DAML+OIL is written in RDF, which in turn, is written in XML, using XML namespaces and URIs. Yet it is a language for expressing far more sophisticated classifications and properties of resources than RDFS. It draws heavily on the original OIL specification, but has some key differences. OIL is a proposal for a web-based representation and inference layer for ontologies, including Frame-based representations, Description logics and Web-based languages. It is compatible with RDF, and presents a layered approach to a standard ontology language. Each additional layer adds some functionalities and complexities to the previous layer. One example of the difference of DAML+OIL with OIL is OIL has explicit "OIL" instances, DAML+OIL relies on RDF 12 for instances. Emphasis of March 2001 edition of DAML+OIL is put on the work done to support W3C XML schema. 2.3.3 SHOE Compared to DAML+OIL, SHOE is much simpler and less expressive. The SHOE is a SGML/XML HTML-based knowledge representation language which can be regarded as a superset of HTML, as it adds the tags necessary to embed arbitrary semantic data into web pages. The general steps to add semantics to a web page using SHOE are as follows: 1) First, define an ontology describing valid classification of web objects and plus valid relationships between web objects. This ontology may also borrow from other ontologies. 2) Annotate HTML pages to describe themselves, other pages, or subsections of themselves, with attributes as described in one or more ontolgies. 2.3.4 OntoEdit OntoEdit is a development environment for design, adaptation, and import of knowledge models for application systems, using GUI to represent views on concepts, concepts hierarchy, relations, and axioms. It is a tool that enables inspecting, browsing, codifying, and modifying ontologies, and it supports in this way an ontology maintenance task. Modeling ontologies using OntoEdit means modeling as much independence as possible of a concrete representation language. The conceptual model of an ontology is internally stored using a powerful ontology model, which can be mapped onto different, concrete representation languages. 2.3.5 Protégé-2000 Protégé-2000 is similar to OntoEdit but is more flexible. It is a platform which allows the user to construct a domain ontology, customize knowledge-acquisition forms, and enter domain knowledge. Its flexibility comes from the many plug-ins and widgets 13 available for its GUI. Each plug-in can either extend the system capability or the system functionality. Extended with graphical widgets, Protégé-2000 can have tables, diagrams and animation components to access other knowledge-based systems embedded applications; Protégé-2000 can also be regarded as a library that other applications can use to access and display knowledge bases. We will go into detail about Protégé-2000 in Chapter 3. There are still a lot of other tools, for example, OntoLingua, OntoSeek, OntoWeb and so forth. There is no one correct way to model a domain; there are always viable alternatives. The best solution almost always depends on the application that one have in mind and the extensions that one anticipates. Among several viable alternatives, we will need to determine which one would work better for the projected task, be more intuitive, more extensible, and more maintainable. . CHAPTER 3 BUILD MEDICAL SUBJECT HEADING ONTOLOGY WITH PROTEGE-2000 The Knowledge Modeling Group (KMG) at Stanford University has developed a variety of knowledge-modeling tools as part of the Protégé project for the past 15 years. The current version of the ontology edit tool Protégé is an extensible, open-source application that is now available as free software under the open-source Mozilla Public License and compatible with a wide range of knowledge representation languages. Some basic features of Protégé were introduced in Chapter 2. More detailed technique backgrounds of Protégé, especially how and why I use Protégé-2000 to build an ontology in the Medical Subject Heading domain, will be covered in this chapter. 3.1 Overview of Protégé-2000 Protégé-2000 is a tool that allows the user to 1. Construct a domain ontology by defining classes and class hierarchy, slots and slot-value restrictions, relationships between classes, and properties of these relationships. 2. Customize knowledge-acquisition forms by generating a default form for acquiring instances, based on the types of the slots that the user specified. 3. Enter domain knowledge. You can use the instances tab in Protégé, which is a knowledge-acquisition tool to acquire instances of the classes defined in the ontology. 14 15 As it is shown in Figure3-1, Protégé-2000 system architecture includes three parts-Core protégé Framework, widgets and plug-ins that extend system functionalities and ontology storage models. Strorage Strorage Storage Model Model Model Core Protege Framework Widget Widget Widget Plug-in Plug-in Plug-in Widgets know how to display certain value types while Plug-ins extend system functionality. Storage model is to save the ontology to more persistent storage. Currently three models available in ProtegeRDF schema, JDBC relational database, and standard text format Core protege framework is responsible for maintaining the in-memory ontology using API. It is also responsible for managing Protege name spaces. Figure 3-1 Protégé architecture 3.2 Protégé-2000 Ontology Model Protégé-2000 is a frame-based system. The main elements of the Protégé ontology model are frames representing • Classes. correspond to concepts in the domain • Instances. of classes • Slots. properties of classes and instances • Facets. properties of slots Classes are organized into a “subclass-of” hierarchy with multiple inheritances. Every instance of a class A is also an instance of any of the super-classes of A. Classes themselves can be instances of other classes. Slots are first-class objects in Protégé-2000. Slots are attached to classes and instances either as template slots or as own slots. 16 Template slots describe the properties of instances of that class. Value-type restrictions can be defined for template slots. Template slots for a class become own slots when instances of that class are created. 3.2.1 Creating and Editing Classes Protégé-2000 simplifies the task of developing an appropriate class hierarchy for a given application. The users can easily create or browse class hierarchy and bind slots to classes in the Protégé-2000 ontology editor. Figure 3-2 Protégé-2000: Class definition in 'MeSH ontology In the above figure, the left-hand pane visualizes the class hierarchy and the righthand pane summarizes the slots that are attached to the highlighted class. Each slot has cardinality (single or multiple) and value type defining the types of values. Additional restrictions on the values can be specified using facets, according to the type of values defined. 17 3.2.2 Creating and Editing Instances The Instances tab in Protégé-2000 provides the interface for creating instances of classes. Protégé-2000 makes a distinction between classes and instances. Classes correspond to definitions of concepts (just like schemas in a database) and instances correspond to specific examples of a concept (just like tuples in a database). In addition, slots are a third type of modeling abstraction. They are first-class objects that correspond to attributes of either a class or an instance. A forms interface is used in Protégé-2000 for acquiring the slot values for instances. Protégé-2000 automatically generates the layout and content of the instance forms based on the values and cardinalities of slots for the class. The user can then customize the forms using the Form tab (Fig 3-3). Figure 3-3 Acquire instances in Protégé-2000 The complete editing cycle is therefore as follows: define a concept, lay out the associated form, and use the form to acquire instances. 18 3.2.3 Storage Models and Persistence The core framework of Protégé-2000, as shown in Figure 3-1, interacts with the savings of ontology via a published (and formally defined) API. In this way the widgets and user interface are decoupled from the actual persistent storage and mechanism, thereby enabling Protégé-2000 to save a given ontology to a wide variety of formats. For example, an RDF storage layer can import RDF files to Protégé, and Protégé ontology can be stored in RDF. The RDF storage layer also performs the necessary interpretation and translation. Actually, if counting those possible target formats by utilizing some of the plug-ins in Protégé, the system can export ontology and content knowledge to target formats OKBC, XML, RDF, Ontolingua, JDBC database and so forth. For the special case of exporting ontology to an XML file (we shall use that in our project), the transfer would include the following rules as shown in Figure 3-4 • Un-referenced instances become top-level elements (cyclic references are handled) • Classes and slots become tag names • Objects that are references more than once are shared and reused with id or idref. In the first part of my implementation, I created an ontology of the Medical Subject Heading using Protégé-2000. I will introduce some of the related background information in that field in the next section. 19 Figure 3-4 Exporting ontology to XML 3.3 Background of Medical Subject Heading The MeSH thesaurus has been produced by the National Library of Medicine (NLM) since 1960. Thesauri, also known as classification structures, controlled vocabularies, and ordering systems includes carefully constructed sets of terms and relationships among the terms. The relationships are usually represented as “broader than” “narrower-than”, and “related” links. The MeSH thesaurus is NLM’s controlled vocabulary for subject indexing and searching of journal articles in MEDLINE, books, journal titles, and non-print materials in NLM’s catalog. Translated into many different languages, MeSH is widely used in indexing and cataloging by libraries and other institutions around the world. Forty years of heavy use have led to a significant expansion in the MeSH content and to considerable 20 evolution in its structure. It is one of the most highly sophisticated thesauri in existence today. The selection and assignment of the thesaurus terms are crucial to an information retrieval system, MeSH is quite successful at this issue. MeSH applications 1. It is a vital component of NLM’s computer-based information retrieval system. 2. The MeSH thesaurus is used by NLM for indexing articles from 4,300 of the world’s leading biomedical journals for the MEDLINE database and for other NLM-produced databases which include cataloging of the books, documents, and audiovisuals acquired by the library. Each bibliographic reference is associated with a set of MeSH terms to describe the content of the item. A retrieval query can then be formed using MeSH terms to find items on a desired topic. 3. MeSH is the source of the headings used as index terms in NLM’s Index Medicus and is fundamental to the organization of this monthly guide to articles from more than 3,400 international journals. [MeSH fact sheet] An example of a partial MeSH hierarchy is represented in Figure3-5: 1. Anatomy [A] o Body Regions [A01] + o Musculoskeletal System [A02] + o Digestive System [A03] + o • Biliary Tract [A03.159] + • Esophagus [A03.365] + • Gastrointestinal System [A03.492] + • Liver [A03.620] + • Pancreas [A03.734] + Respiratory System [A04] + 21 o Urogenital System [A05] + o Animal Structures [A13] + o Stomatognathic System [A14] + 2. Organisms [B] 3. Diseases [C] 4. Chemicals and Drugs [D] 5. Analytical, Diagnostic and Therapeutic Techniques and Equipment [E] 6. Psychiatry and Psychology [F] 7. Biological Sciences [G] 8. Physical Sciences [H] Figure 3-5 Medical subject heading hierarchy. 3.4 Motivation of Creating an Ontology in Medical Subject Heading Domain From the introduction in chapter 1, we can see that ontologies are used as a solution to sophisticated queries and other issues related to the Semantic Web in many application domains, mainly due to their abilities of explicitly specifying the semantics and relations and expressing them in a computer understandable language. Conventional knowledge organization tools such as the MeSH thesaurus, resemble the concept of ontology in a way that they define concepts and relationships in a systematic manner (MeSH also has hierarchical, associative, and equivalence relationships), but they are less expressive than ontologies when it comes to machine language. The major differences between the two models are in the value ontology added through deeper semantics in describing objects, both conceptually and relationally. The MeSH thesaurus appears like a gathering of terms and knowledges which are widely used in medical subject indexing, cataloging, and querying. This successful application area would be a good place to implement ontology techniques and make 22 domain knowledge more efficiently shared and reused by people or related software agents, and it will lead to more powerful queries. Huge amounts of useful data coming from the thesaurus can be reused through some efficient ways. Ontologies can also help users in runtime to build their own knowledge base or ontology objects in the specific field even though the user may lack familiarity of some special vocabularies used in the thesaurus. 3.5 From MeSH Thesaurus to MeSH Ontology The ontology includes machine-interpretable definitions of basic concepts in that domain and relations among them. Recall why would someone want to develop an ontology? Some of the reasons include • To share common understanding of the structure of information among people or software agents • To enable reuse of domain knowledge • To make domain assumptions explicit • To analyze domain knowledge The MeSH thesaurus is great in many aspects, but if we want to create a knowledge-rich description of objects, such as required by a Semantic Web, thesaurus turned out to provide only part of the knowledge needed. The goal of the Semantic Web initiative is to annotate large amounts of information resources with knowledge-rich metadata. Such annotations would achieve much better performance based on a rich metadata structure in connection with an ontology. In the ontology construction process, additional knowledge was added to the basic hierarchical structure of the concepts derived from the thesaurus. 23 3.5.1 Ontology and Structure-based Search We are familiar with keyword-based search without any closed vocabulary (those we used in Yahoo and Google). The suffering of huge irrelevant query results are not so surprising to most of us. Usually in order to circumvent the problems of ambiguity in keyword searching, search descriptions should be limited to a fixed set of predefined structures and a closed vocabulary. Thus we come to another solution, namely, fieldbased approach, which describes or retrieves an item not by a set of keywords, but by a set of attribute-value pairs. Typically, a metadata system is predefined and describes the elements (fields), giving some indication what values can be assigned to a particular field. Many of the field-based initiatives recommend the use of closed vocabularies but do not associate particular parts of a thesaurus with a field. As a consequence the only support that a human indexer has is the thesaurus browser. The MeSh browser for examples, presents the thesaurus to users, the users are then restricted to use this specific browser to obtain all information they need. No other way has been offered to users to create a subthesaurus for their own limited purposes, and there has not been any flexibility in choosing which fields the user wants to use for searching besides those fixed ones in the browser. Figure 3-6 shows the MeSH browser screen. 24 Figure 3-6 MeSH thesaurus browser Compared with a flat-structure of attribute-value pairs essentially used in a fieldbased search, the “structure-based” approach allows a more complex description involving relations, introducing a large degree of complexity in the indexing process. Considering the fact that relational descriptions can vary widely between different objects, we need to find a way to solve the problem of complexity of the indexing and annotating process. One of the possible solutions is to use contextual information to constrain the relations and terms presented to the indexer. Suppose the structured descriptions are created by a human annotator using specialized tools. How can a human be supported during the annotating process? From previous discussions, we know a related ontology will be the best answer at this situation. 25 3.5.2 Building MeSH Ontology MeSH thesaurus is useful in filed- and structure-based approaches. We use it as a basis for building our ontology. Transforming this thesaurus to an ontology makes it possible for us to augment it with more semantic information. A number of concepts with additional slots and fillers can be added to the previous thesaurus structure. Another step is to add information about the relationships between possible values of fields and nodes in the ontology. In Protégé-2000, each slot has related facets that specify the constraints applied on that slot – Such as “Does it has multiple values?” “Is it required to be single?” We can also find in Protégé-2000 a Protégé Axiom Language (Pal) constraint plug-in. It could be utilized to validate the whole ontology during ontology establishment by some restrictions and relation constraint checks applied on the ontology data. The mapping from the MeSH thesaurus structure to the Medical Subject Heading ontology would be in such a way that previous databases were built in ontology as different classes, attributes in specific databases became slots, and constraints became facet restrictions plus an ontology validation mechanism-“Pal” constraints. Of course, the enormous data in MeSH also helped me a lot when building the ontology. 1. The full MeSH “part/whole” hierarchy was converted into ontology as a hierarchy of structure, where each concept has a class definition corresponding with the main term in MeSH. There were some considerations of MeSH thesaurus developers when they were defining MeSH “class/subclass” hierarchy. I followed their definitions here to let articles in MeSH be indexed with the most specific headings available. Hierarchical information of class/subclass instances can be traced by different levels of MeSH tree numbers. 26 Both of the hierarchical relationships in MeSH are represented at the level of the descriptor instead of at the level of the concept. The general hierarchical structure of MeSH ontology is presented in the following figures. DescriptorRecord Concept DescriptorRef Term ConceptRef EntryCombination DescriptorRec (Pharmacological Action) DescriptorRec Figure 3-7 MeSH ontology class hierarchy QualifierRef DesQua Combination QualifierRef 27 SLOTS CLASS Annotation DescriptorClass DescriptorRef DescriptorRecord DateCreated EntryCombinationList DescriptorReference PublicMeshNote ConceptRef QualifierReference OnlineNote ScopeNote MeSHOntology SeeRelatedList Concept TermList PreviousIndexingList Is-instanceOf QualifierReference DateEstablished RelatedRegistry NumberList CASN1NAme TreeNumberList pharmacological ActionList HistoryNote PreferredConcept ConceptReference ConsiderAlso Term Is-instanceOf ConceptList AllowableQualifierList Figure 3-8 Detailed structures and relationships of DescriptorRecord, Concept and Terms in MeSH ontology . 2. The next step is to augment a member of concepts with additional slots and filters. For example, “DescriptorRecord” is a major class in MeSH ontology which records most of the medical heading information. I added a slot named “DescriptorRef” which is defined to be an instance of DescriptorReference class. The DescriptorReference class, in turn, includes some unique information about the DescriptorRecord, i.e. DescriptorUI (unique id) and DescritorName. The DescriptorRecord class also includes multiple instances of QualifierRef, which has related Qualifiers information, that is, QualifierUI (unique id), 28 QuialifierName, and Abbreviation of that Qualifier. That makes the relationships between DescriptorRecord and DescriptorRef; DescriptorRecord and QualifierRef explicit (see Figure3-8). 3. The third step was to add data validation mechanism. Validation of the ontology data guarantees the data accurately reflect the real world as long as the constraints are well defined. The first level of data validation is provided by the notion of slot value-types. When a slot is attached to a class, it can be given a value-type (which could be one of: Any, Boolean, Class, Instance, String, Symbol). Slots also have an associated cardinality, either single or multiple. Protégé facets let the users define all these first level restrictions in all slots. It supports about a dozen facets that get exposed in the user interface on slot forms. It is not always possible for the user to have consistent ontology data while editing, therefore the user should decide when to check the constraints. “Pal” is Protégé Axiom Language that helps to enforce the semantic properties of ontology data encoded in Protégé, it uses the Knowledge Interchange Format (KIF) connectives and the KIF syntax. A sample Pal axiom would be like: (defrange ?X1:FRAME DescriptorReference) (defrange ?X2:FRAME DescriptorReference) (forall ?X1 (not (exists ?X2 (= (DescriptorUI ?X1) (DescriptorUI ?X2))))) This constraints means there are no DescriptorRecords which have identical ids. 29 3.5.3 Ontology Data Import and Export Importing existing data of MeSH thesaurus to ontology There are more than 19,000 main headings in the MeSH thesaurus. In addition to these headings, there are 103,500 headings called Supplementary Concept Records within a separate chemical thesaurus. The NLM updates thesaurus data at the beginning of each month. So it will greatly benefit the development of our system if we could get the existing data from MeSH databases. The task of import data from the MeSH thesaurus into MeSH ontology is actually certain kind of metadata transformation that will transform one model (MeSH thesaurus) into a rendering format (Protégé data input format). The release of the MeSH thesaurus in various formats made this process easier and somehow reduced the developing time spent on this issue. Many interoperability tools can be used to carry out this task, for example XSLT. Based on the MeSH thesaurus structure and also on what I learned from Protégé developers about its data import requirements, I used XQuery to develop a program (Xquery will be discussed in Chapter 4)which made the existing data of MeSH thesaurus reusable by MeSH ontology. The program read the data from the thesaurus according to its structure and matched those data with the format of the MeSH ontology, based on classes, slots and facets definitions. The thesaurus data file was then imported successfully to the ontology under validation of facets and Pal restrictions. Exporting MeSH ontology to an XML document The ontology file was exported to an XML document after finishing the structure construction and data importation. The reason why I use the XML document instead of a traditional database, to save the ontology is based on the following considerations: 30 • Relational databases are particularly good for storing highly structured information, and not particularly good at managing semi-structured data, as the one we meet here- not all data content in Medical Subject Headings are highly structured. The instances of the same slot vary in length, degree of completeness, and cardinality. The document-oriented XML usually has a varying structure to allow for the flexibility inherent in prose. • The most common use for XML is as a means of integration or data interchange between enterprise application inside and outside the firewall. We will talk about more features of XML in Chapter 4. Data in the XML format can be easily reused/shared by other users or agents in the same domain of interest. • In second part of my project, I implemented a web informatics system. XML is clearly targeted at the web. The many ways and supporting tools to connect XML documents with web applications will certainly ease the development of my later implementation. This is also a widely debated area actually. We need to separate fully structured data with semi-structured data first before we can accept a common conclusion. Semistructured data is data that have some structure, but are not rigidly structured. An example of semi-structured data is a health record. For example, one patient might have a list of vaccinations, another might have height and weight, the other one might have a list of operations he/she has undergone. (Medical Subject Headings can be regarded as semistructured data.) Semi-structured data is difficult to store in a relational database because it means you either have many different tables (which means many joins and slow retrieval time) or a single table with many null columns (as is the case in the MeSH thesaurus). Semi-structured data are very easy to be stored as XML and are a good fit for a native XML database. Although the details vary depending on the individual RDBMS, generally speaking, relational DBMSs do not handle semi-structured data very well. On the other hand, handling semi-structured data is one of the main virtues of the (data-centric) XML data model. In the real world, one always tries one's best to establish standards, but 31 inevitably requirements and needs change over time – For example, new users enter the arena, the software is expanded to take on a wider scope of jobs, and business practices change slightly over time. It is therefore difficult to keep everything inside totally tight and expect schemas to remain the same forever. XML provides a framework in which formats can evolve carefully for backward compatibility. If someone adds a new child element somewhere, my existing XQuery (or XPath expressions) continue on working, as long as the addition has not disrupted the semantics too harshly. In Chapter 4, I will discuss the relationship of ontology and XML techniques and XQuery engine to query XML documents. In later description of my implementations, we can see that building such an ontology makes it possible that the ontology be used and shared by more agents in diverse environments, and it makes queries much more flexible and efficient. Moreover, ontology can help users create their own ontology objects in XML (later it could be imported to knowledge application systems). The format of ontology in XML will partially look as in figure 3-9. 32 <Project> - <DescriptorRecord> - <DescriptorRef p_attr="t"> - <DescriptorReference> <DescriptorUI>D000029</DescriptorUI> <DescriptorName>Abortion, Legal</DescriptorName> </DescriptorReference> </DescriptorRef> <DateCreated>01/01/1999</DateCreated> <DateEstablished>01/01/1964</DateEstablished> - <AllowableQualifierList p_attr="t"> - <QualifierReference> <QualifierUI>Q000009</QualifierUI> <QualifierName>adverse effects</QualifierName> <Abbreviation>AE</Abbreviation> </QualifierReference> … </AllowableQualifierList> <HistoryNote>64</HistoryNote> <PublicMeSHNote>64</PublicMeSHNote> <TreeNumberList>E04.520.050.055</TreeNumberList> - <ConceptList p_attr="t"> - <Concept PreferredConceptYN="Y"> - <ConceptRef p_attr="t"> - <ConceptReference> <ConceptUI>M0000047</ConceptUI> <ConceptName>Abortion, Legal</ConceptName> <ConceptUMLSUI>C0000812</ConceptUMLSUI> </ConceptReference> </ConceptRef> <ScopeNote>Termination of pregnancy under conditions allowed under local laws. (POPLINE Thesaurus, 1991)</ScopeNote> <PharmacologicalActionList p_attr="t" /> - <TermList p_attr="t"> - <Term ConceptPreferredTermYN="Y"> <String>Abortion, Legal</String> <TermUI>T000087</TermUI> </Term> - <Term> <String>Abortions, Legal</String> <TermUI>T000087</TermUI> </Term> - </TermList> </Concept> </ConceptList> </DescriptorRecord> </Project> Figure 3-9 Ontology exported to an XML document CHAPTER 4 ONTOLOGY, XML AND XQUERY 4.1 Extensible Markup Language XML Everyone agrees that XML represents a significant step forward both for electronic commerce and a number of other types of Internet application. As it is gradually getting to be known by everybody who is interested in its features, I will briefly mention some of the concepts that are related to my work. Extensible Markup Language, namely, XML, is an extremely flexible method for creating a consistent way of sharing information over the Internet, intranets, or anywhere else. XML is a simplified subset of the SGML(Standard Generalized Markup Language). The use of XML for "tagging" data based on content allows for a more focused and powerful way to search data. Because XML enables documents to use semantic markup that identifies data elements according to what they are, rather than how they should appear, many diverse applications also can make use of the information in XML documents. Free to use whatever appropriate tags to add semantic information, XML is designed to describe document types for all thinkable domains and purposes, for example multi-media presentation, HTML-pages of arbitrary contents(XHTML), business transactions. Via standardized interfaces, such as SAX and DOM, explicitly structured textual XML documents can easily be accessed by application programs. Suppose web developers, database developers, document managers, desktop publishers, programmers, scientists, or other professionals all get involved in a certain project. XML in this case 33 34 can provide a simple format that is flexible enough to accommodate such diverse needs. Simplicity, Extensibility, Interoperability, Openness are the most significant features offered in XML. Thus XML could play an important role as a basic technology in the context of knowledge management and dissemination and also when it comes to managing large scale web sites. XML supports corporate design, style sheets (XSL), automatic generation of customized views to documents, consistency between documents, superior linking facilities (XLINK, XPOINTER) and so forth. All these features are based on an individually definable tag set that is tailored to the application needs. Compared with the case in HTML where tags only have pure layout purposes of presentation, the tags in XML have semantic purposes so that they can be exploited for several tasks, such as those mentioned above or as metadata that support intelligent information retrieval. 4.2 Ontologies as Conceptual Models for Generate XML Documents. 4.2.1 XML Itself Is Not Enough In spite of these positive features of XML, it must be understood that XML is solely a description language to specify the structure of documents and thus their syntactic dimension. It is a widely accepted foundation layer upon which to build, but it is not a cure-all for system interoperability. The document structure can represent some semantic properties, but it is only understood by special purpose applications if there is no way to deploy those properties outside. It permits us to use tags but gives us no guidance as to which words are appropriate and commonly acceptable (no close vocabularies being agreed upon by a group of users in the domain of interest). So two questions arise -How should XML be extended to support the representation of business information? How can the numerous heterogeneous systems be unified to enable the low- 35 friction marketplace of the future? These two questions lead directly to the use of ontologies. 4.2.2 Add Ontology as Conceptual Model Ontologies establish a joint terminology between members of a community of interest. If we add true semantics to XML documents by relating the document structure to an ontology, we would be able to represent facts that are compatible with the designed domain model, that is, an ontology. This could be done by mapping ontology concepts and attributes to XML elements via the definition of a Data type definition(DTD). XML documents can thus be authored to represent facts that are compatible with the designed domain model, an ontology. An ontology is a “formal specification of a conceptualization” [Gruber 1993] and thus provides a basis for semantics-based processing of XML documents. At XML level, we have a sequential order of context with element nesting. Only by reaching the level of ontology can we speak about concepts (classes) and semantic relationships (class hierarchy, attribute restrictions, and ontology validation constraints) and that should be regarded as an appropriate level for structuring the contents of documents. Of course, concepts and relationships have to be expressed and stored in linear form in documents, but this is pure representation, i.e. DTDs, and the document structures are not enough to give XML a sound semantics. Using ontology as the primary source for structuring a set of XML documents of certain domain of interest makes ontology act as a kind of mediator between the information seeker and those XML documents. The ontology unifies the different syntaxes and structures of these documents. These documents can then be accessed in a more semantic way (For example use conceptual terms for 36 retrieving facts). In my approach, I took MeSH ontology (represented in classes, slots, facets, logical constraints, and not in XML) and generate from it an XML file which can be used later for query or other purposes. (In real application cases, a set of XML files may be available, and we need to generate DTD from the existing ontology to make the XML files compatible with the ontology). 4.3 XQuery 4.3.1 XML Query Language XQuery XML is an extremely versatile markup language which is capable of labeling the information content of diverse data sources (structured/semi-structured documents, relational databases )and so forth. A query language that uses the structure of XML intelligently should be able to express queries across all these kinds of data, whether they are physically stored in XML or viewed as XML via middleware. Most existing proposals for XML query languages are robust only for specific kinds of data. The XQuery, on the other hand, is designed to be broadly applicable across all types of XML data sources. The XQuery is the W3C's query language for XML. It is derived from Quilt, an earlier XML query language, which, in turn, borrowed features from several other languages, including Xpath1.0, XQL, XML-QL, SQL and OQL. XQuery is designed to meet the requirements identified by the W3C XML Query Working Group XML Query 1.0 Requirements and the use cases in XML Query Use Cases. It is designed to be a small, easily implementable language in which queries are concise and easily understood. It is also flexible enough to query a broad spectrum of XML information sources, including both databases and documents. Also, XQuery is designed to meet the requirement of a human-readable query system. 37 At its simplest, an XQuery expression could be: Document (“bookPublished.xml”)//book This is standard Xpath and is a complete, valid self-contained XQuery query. It means to return a list of all book elements existing in the document “bookPublished.xml”. If we assume for the moment that we are getting back a serialized XML, the above queries pull <book> elements out of the documents and return some content that look like this: <book year="1994"> <title>TCP/IP Illustrated</title> <author><last>Stevens</last><first>W.</first></author> ... <book year="1992"> <title>Advanced Programming in the Unix environment</title> ... <book year="2000"> <title>Data on the Web</title> ... FLWR (pronounced flower) expression makes the queries in Xquery much more interesting and efficient. It is an acronym that stands for four of the possible XQuery subexpressions which are FOR, LET, WHERE, and RETURN. The productions in the XQuery grammar that formally define a FLWR expression are as follows: FlwrExpr ::= (ForClause | letClause)+ whereClause? returnClause ForClause ::= 'FOR' Variable 'IN Expr (',' Variable IN Expr)* LetClause ::= 'LET' Variable ':=' Expr (',' Variable := Expr)* WhereClause ::= 'WHERE' Expr ReturnClause ::= 'RETURN' Expr Following these definitions makes the FLWR production very malleable, highly recursive, and capable of generating a large number of possible query instances, 38 including just about any combination of FOR, LET, WHERE, and RETURN statements imaginable. Debates still exist on whether XQuery overlaps too much with XSLT. Both XQuery and XSLT will use XPath 2.0, and the two Working Groups are working closely together on this. So the two languages will share a great deal. The main differences between the two languages were differences of culture and perspective, but also that XQuery was more ambitious than XSLT and would require more complex optimizations. At a high level, while XSLT is a transformation data language with an interchange centric model, XQuery is a query language with a storage centric model. XSLT uses XPath as a "sublanguage" located in attributes of XML syntax, while XQuery is constructed as a superset of XPath. XQuery is being oriented toward large sets of XML data, therefore there would be a better acceptance for commercial XQuery Three areas to justify creating XQuery as a language separate from XSLT are 1) ease of use; 2) optimizability; and 3) strong data typing. [W3C Working Draft 2001] 4.3.2 XQuery Implementation Quip The Software AG's QuiP is a prototype of XQuery for Windows 32 bit platforms, the W3C XML Query Language. QuiP is designed to make it easy to learn and use the language. The following points explain QuiP in a concise way: • Graphical User Interface for writing queries and viewing results • Online help includes syntax diagrams for XQuery • Examples include 76 queries and 51 XML files • Syntax conforms to the 07 June 2001 Working Draft of XQuery 39 • Most of the XQuery language has been implemented • Queries may be made for XML files or XML stored in a Tamino database [onathan Robie, Software AG, Nov. 2001] I used the Quip Engine in my web project mainly because the following four reasons: • Easiness. With the easiness of Quip plus the help of the MeSH ontology query wizard, the users will be able to write their own queries in Quip without any difficulty; • It is the most up-to-date implementation of XQuery, according to W3C XQuery requirements draft and use cases; • The software is OpenSource and can be got freely online. • Careful analysis on the Quip engine implementation made it possible for me to write interface to embed Quip query engine in my server program (implemented in Java Servlet-Tomcat). CHAPTER 5 IMPLEMENTATION OF AN ONTOLOGY-BASED WEB APPLICATION SYSTEM 5.1 Building Web Informatics System 5.1.1 Ontology-based Web Application Sharing common understanding of the structure of information among users or software agents in the domain of interest is one of the common goals in developing Medical Subject Heading ontology. Also, as a common place in ontology applications, this ontology is akin to defining a set of data and their structures for other programs to use. Not only can the users build up their ontology objects in XML format, which can later be imported to other knowledge bases, under the help of the ontology information built before, ontology data (instances) itself could also build up a web site and offer its users with semantic queries on its prosperous content. According to the implementation and techniques introduced in previous chapters, we know 1) the MeSH thesaurus is a good starting point to build an ontology in the same domain and 2) Protégé- 2000 can be used as an efficient tool to construct the ontology and acquire ontology instances through its user API. It is also possible to apply “Pal” validation constraints to ontology data. In this chapter, I would like to introduce the implementation of a web informatics system, based on the existing MeSH ontology. This web informatics system will give the user flexibility to query the ontology instances and the option to construct ontology objects in various ways. The first scenario is as follows: the ontology itself remains on the server side and user queries could be typed in and sent to the server by “Http” protocol, as in normal web 40 41 application systems. The server accepts the query, searches the ontology by invoking the XQuery engine in the server side, and sends back the query results, as long as the query is a valid one. The second scenario is as follows: The users want to query his own documents instead of the default ontology. In this case, they can first upload to the server their own query files (with file extension .xquery) and document files(with file extension .xml), then the server will complete the queries on those files for clients just like a web agent. These two scenarios require that the users know a good deal about the ontology data model. As they are the persons who write the queries, the users are required to understand XQuery grammar as well. Sometimes the users are not so familiar with the structures and relationships defined in the ontology model, or they may not be so interested in learning XQuery grammar. The third scenario in this case may seem to be the most appropriate one: An online ontology query wizard will help the clients to form proper queries with adequate ontology knowledge and query format, clients will then get through all the difficulties in making valid queries on that specific ontology and will be able to query the ontology according to their own purposes. 5.1.2 Using JSP The increasing sophisticated web applications definitely need to present dynamic information. First-generation solutions of this kind of application included CGI, which is a mechanism for running external programs through the web server. The problem with CGI is scalability. A new process is created for every request. Second-generation solutions included web server programming platforms, plugins and APIs for their servers. Therefore their solutions were specific to their server products. For example, ASP worked only on Microsoft IIS or a personal web server. 42 Using ASP means lost of freedom of selecting a favorite web server and operation systems. JSP pages seem to be the third generation solution that can be combined easily with some second-generation solutions, creating dynamic web content, and making it easier and faster to build web-based applications. These web-based applications work with a variety of other technologies: server, browsers, or other development tools. I use JSP in server programming, which includes a security check in the server side, file upload and download, query request, query results response, ontology query wizard program and so forth. type in XQuery Ontology Query Wizard local XQuery files Generate <Project> { FOR $var in document("MeSHOntology")//DescriptorReference WHERE $var/DescriptorUI<"D000019' RETURN <Descriptor> {$var/DecriptorName} {$var//Annotation} {$var//ScopeNote} <//Descriptor> </Project> local xml files Uploading Client Server default document MeSHOntology.xml Quip (XQuery Engine) user created xml files(sub ontologies) <Project> <Descriptor> <DescriptorName>Calcimycin </DescriptorName> <ScopeNote> An ionophorous, polyether antibiotic from Streptomyces chartreusensis. </ScopeNote> </Descriptor> <Descriptor> ... <Project> Figure 5-1 Web informatics system architecture Download? Y 43 5.1.3 System Architecture The overall system architecture is presented in Figure 5-1. Queries are formed by clients through one of three ways, according to the scenarios I already introduced. The server accepts the query requests from the client side and invokes Quip (XQuery engine) to query the ontology data (or the document data the user created or uploaded earlier). Then responses of query results get back to the clients if the queries are valid ones. Interface to invoke XQuery in server program As I introduced in Chapter 4, Quip is a successful implementation of the XQuery standard. It is produced by SoftwareAG. Another famous product of that company is Tamino, the world’s leading native XML database. (I might consider importing MeSH ontology into the Tamino database in the future). Quip is implemented in Java, which gives the possibility of embedding Quip queries in my server JSP program. The class java.lang.runtime features a static method called getRuntime(),which retrieves the current Java Runtime Environment and which is the only way to obtain a reference to the Runtime object. With that reference, I can run external programs (e,g, Quip) by invoking the Runtime class's exec() method, but I need to pay special attention to the redirections of input/output of that runtime execution. In a JSP server program, both the standard input/output streams and HttpInputStream/ HttpOutputStream are in effect handling input/output in the client/server architecture. I could not simply invoke exec(“java quip…”) method to input a query file to Quip and capture the output from Quip execution. The final solution was to create a special interface class file to smooth the execution in the server program of the Quip query. I finally made it successful. The partial content of this interface file is as follows: 44 class StreamGobbler extends Thread { InputStream is; String type; OutputStream os; StreamGobbler(InputStream is, String type) { this(is, type, null); } StreamGobbler(InputStream is, String type, OutputStream redirect) { this.is = is; this.type = type; this.os = redirect; } public void run() { try { PrintWriter pw = null; if (os != null) pw = new PrintWriter(os); InputStreamReader isr = new InputStreamReader(is); BufferedReader br = new BufferedReader(isr); String line=null; while ( (line = br.readLine()) != null) { if (pw != null) pw.println(line); System.out.println(type + ">" + line); } if (pw != null) pw.flush(); } catch (IOException ioe) { ioe.printStackTrace(); } } } … try{ FileOutputStream fos = new FileOutputStream(xml_result); Runtime rt = Runtime.getRuntime(); String inqf=null; if (cfile!=null) inqf=_filename+cfile ; else inqf=_filename+pfilenm; Process proc = rt.exec("java -classpath quip.jar;crimson.jar;jaxp.jar com.softwareag.xtools.quip.Main --quipcmd quip.exe -f "+inqf+" -o "+xml_result); Figure5-2 Interface of server program and Quip execution 45 StreamGobbler errorGobbler = new // any error message? StreamGobbler(proc.getErrorStream(), "ERROR"); // any output? StreamGobbler outputGobbler = new StreamGobbler(proc.getInputStream(), "OUTPUT", fos); errorGobbler.start(); //kick them off outputGobbler.start(); // any error??? int exitVal = proc.waitFor(); System.out.println("ExitValue: " + exitVal); fos.flush(); fos.close(); } catch (Throwable t) { t.printStackTrace (); } Figure5-2--continued Interface of server program and Quip execution 5.2 Query the Ontology Considering all possibilities of queries on the ontology from the client side, the web application was implemented in such a flexible way that clients may choose the most convenient way to query the ontology. Professionals from this research domain may create their queries ahead of time in XQuery format and may save them locally. For those who are not familiar with the ontology structures, or those do not know how to form a legal query according to Quip, a powerful wizard gives structure information of the ontology and will guide the users through the query processes. 5.2.1 Query from Direct Typing After being checked with the username and password, users can type queries from the client side. As in a normal web application, the JSP server programs display a dynamic web page at the client side, including useful collections of information and descriptions of the ontology. Some input forms need to be filled by clients. Clients can choose to type in their query content directly in a scrolling text area. Upon submitting the 46 forms, the server JSP programs handle all the contents transferred and invoke the Quip query, informing the clients with their query outputs after finishing (see Figure 5-4). 5.2.2 Upload Existing Local Query Files Clients can also choose to upload their existing query files to server programs in case it is not appropriate for online inputting, for example, when the query is quite long. Both of this method and the above one require the users to at least understand the structures of the ontology in order to write queries or construct ontology objects correctly. They should also know how to write legal queries based on XQuery grammar. Useful information appears on the web pages describing the ontology structures to the user. (see Figure 5-3, 5-4) Figure5-3 Onine links point to ontology structures 47 Figure 5-4 Query the ontology by typing or uploading local query files 5.2.3 Query Ontology or Build Up Ontology Objects with the Ontology Wizard We need to consider the situation when the users are not familiar with the ontology structures, or they may not know how to construct legal queries of XQuery. I implemented a MeSH ontology query wizard to give online hints and proper restrictions for querying the ontology. The following steps are designed in the wizard not only to let the user form a proper query according to his initial purpose but also to facilitate constructing well-structured ontology objects in the same domain (see Figure 5-5). • The ontology hierarchy is extracted from the ontology and saved in the ontology configure file. • One menu of the wizard page will prompt the users with available query files on the sever for a particular user. Every user is able to query the default ontology file. Users who uploaded special query files in the XML format from local computers can also make queries on them. The same rule applies for ontology objects generated automatically in earlier queries. • The other menu itemized by ontology structures is constructed from the ontology configure file. 48 • A JSP program does as much as it can to form legal queries automatically, based on the ontology and some information given by the user. The users need to choose the query target file from one menu; select one initial point within the ontology structure to start queries from another menu; type in the condition constraints they want to put on the queries if any; and if they want to construct a new ontology object, they need to give the tag name to put in future ontology objects and specify which fields of data in the initial ontology should appear in the new objects. Then they submit their queries. • The JSP program displays the complete query format to users to help them get familiar with it although users do not have to remember everything. Figure 5-5 Query the ontology under the help of ontology query wizard 5.2.4 Choose the Query File and Download the Result Not only the default ontology file MeSHOntology.xml can be queried, the local xml files (or those ontology object files saved in the XML document made in previous queries) can also be made as the target query files. In doing this, the web application first uploads the local XML files to the server and saves it under that user’s folder. The user can then query on his own XML files in the same way he queries MeSH ontology. When the query results come back, local users can choose to download the results to their own 49 computers. That also brings the possibility of diverse formats local files be queried or accessed in a community of people who commits to the same ontology CHAPTER 6 CONCLUSIONS AND FUTURE WORK Ontologies are usually built by the cooperation of computer professionals and domain experts. Lack of medical subject knowledge made the ontology I built not so ideally defined and constructed, maybe the domain knowledge was not completely included in the ontology. In the future, if medical subject professionals get evolved in system improvement, better performance can be achieved. This web application system is actually not only restricted to one specific domain, Medical Subject Heading. All the features could be updated with ease to let users from various domains apply this technique on different ontologies and query in diverse client/sever systems. The most obvious use of ontology is in connection with a database component. An ontology can be compared with the schema component of a database. Ontology can play an important role in the requirement analysis and conceptual modeling phase. The resulting conceptual model can be represented as an ontology that can be processed by computer. And from there it can be mapped to concrete target platforms, including databases so as to facilitate the system with various database advantages. This system is now saving the ontology into an XML document. In some sense it lacks many of the benefits found in real databases, such as efficient storage, indices, security, transaction and data integrity, multi-user access, triggers, and so on. Considering the irregular format of MeSH data, importing the ontology into a native XML DB (e.g. Tamino DA), is more reasonable than to a traditional database to get the most out of a 50 51 database system, as well as an ontology. I might consider this issue in future improvement. LIST OF REFERENCES Abason, J. M., Gomez, M. “MELISA. An Ontology-based Agent for Information Retrieval in Medicine.” Viewed: Jan. 2002. http://www.ics.forth.gr/proj/isst/SemWeb/proceedings/session3-1/paper.pdf Bourret, R. “XML and Databases.” February 2002. http://www.rpbourret.com/xml/XMLAndDatabases.htm Clark, J. “XSL Transformations (XSLT) Specification 1.0.” W3C Working Draft, April 21, 1999. http://www.w3.org/TR/1999/WD-xslt-19990421.html Decker, S., Harmelen, F. V., Broekstra, J. “The Semantic Web- on the Respective Roles of XML and RDF.” http://www.ontoknowledge.org/oil/downl/IEEE00.pdf Deutsch, A., Fernandez, M., Florescu, D., Levy, A., Suciu, D. “A Query Language for XML.”1998. http://citeseer.nj.nec.com/correct/138418 Extensible Markup Language (XML) Viewed: Jan. 2002. http://www.w3.org/XML/ Fensel, D., Angele, J.,Decker, S. “On2broker: Semantic-Based Access to Information Sources at the WWW.” Proceedings of the World Conference on the WWW and Internet (WebNet 99), Honolulu, Hawai, 1999. ftp://ftp.aifb.uni-karlsruhe.de/pub/mike/dfe/paper/webnet.pdf Gruber, T.R. “A Translation Approach to Portable Ontologies.” Knowledge Acquisition, 5(2):199-220, 1993a. http://ksl-web.stanford.edu/KSL_Abstracts/KSL-92-71.html Gruber, T. R. “Toward Principles for the Design of Ontologies Used for Knowledge Sharing.” Presented at the Padua Workshop on Formal Ontology, March 1993b. http://ksl-web.stanford.edu/KSL_Abstracts/KSL-93-04.html Guarino, N. “Formal Ontology in Information Systems.” Proceedings of FOIS’98, Trento, Italy, 6-8 June 1998. Amsterdam, IOS Press, pp.3-15 52 53 Heflin, J. “Towards the Semantic Web: Knowledge Representation in a Dynamic, Distributed Environment.” Viewed: Feb. 2002. http://www.cs.umd.edu/projects/plus/SHOE/pubs/heflin-thesis-orig.pdf Huffman, S. B., Baudin, C. “Toward Structured Retrieval in Semi-structured Information Spaces.” In Proceedings of the 15th International Joint Conf. on Artificial Intelligence (IJCAI-97), 1997 Hunter, J. “MetaNet -- A Metadata Term Thesaurus to Enable Semantic Interoperability Between Metadata Domains.” Journal of Digital information, 1(8), 2001. http://jodi.ecs.soton.ac.uk/Articles/v01/i08/Hunter/ Jasper, R., Uschold, M. “A Framework for Understanding and Classifying Ontology Applications” in Proceedings of the IJCAI-99 Ontology Workshop, Stockholm, Sweden,1999. Katz, H. “A Look at the W3C’s Proposed Standard for an XML Query Language.” June, 2001. http://www-106.ibm.com/developerworks/xml/library/x-xquery.html Lawrence, S., Giles, C.L. “Context and Page Analysis for Improved Web Search.” IEEE Internet Computing, 2(4), 38-46, 1998. http://www.neci.nec.com/~lawrence/papers/search-ic98/ Mahalingam, K., Huhns, M.N. “An Ontology Tool for Query Formulation in an Agent-Based Context.” proceedings of the 2nd IFCIS International Conference on Cooperative Information Systems (CoopIS '97), 1997. Mahmoud, Q. H. “Web Application Development with JSP and XML.” June, 2001. http://developer.java.sun.com/developer/technicalArticles/xml/WebAppDev/ Nelson, S.J., Johnston, W.D., Humphreys, B.L. “Relationships in Medical Subject Headings (MeSH).” National Library of Medicine, Viewed: Jan. 2002. http://www.nlm.nih.gov/mesh/meshrels.html Noy, N.F., McGuinness, D.L. ``Ontology Development 101: A Guide to Creating Your First Ontology.'' Stanford Knowledge Systems Laboratory Technical Report KSL-01-05 and Stanford Medical Informatics Technical Report SMI-2001-0880, March 2001. http://ksl.stanford.edu/people/dlm/papers/ontology-tutorial-noy-mcguinnessabstract.html 54 Qin, J., Paling, S. “Converting a Controlled Vocabulary into an Ontology: The Case of GEM.” Information Research, 6(2), 2001. http://InformationR.net/ir/6-2/paper94.html Savage, A. “Changes in MeSH Data Structure.” NLM Tech Bull. 2000 Mar-Apr (313):e2. Schmidt, A., Kersten, M., Windhouwer, M., Waas, F. “Efficient Relational Storage and Retrieval of XML Documents.” http://www.cwi.nl/themes/ins1/publications/docs/ScKeWiWa:WEBDB:00.pdf Soergel, Dagobert “Functions of a Thesaurus / Classification /Ontological Knowledge Base.” October 1997. http://www.clis.umd.edu/faculty/soergel/soergelfctclass.pdf United States National Library of Medicine, “Medical Subject Heading (MeSH) , Fact Sheet. Viewed: Jan. 2002. http://www.nlm.nih.gov/mesh Volz, R. “OntoServer – Infrastructure for the Semantic Web. ” University of Karlsruhe, Germany, 2001. http://www.aifb.uni-karlsruhe.de/WBS Wielinga, B.J., Schreiber, A.Th., Wielemaker, J., Sandberg, J.A.C. “From Thesaurus to Ontology.” 2001. http://www.swi.psy.uva.nl/usr/Schreiber/papers/Wielinga01a.pdf W3C Working Draft “XQuery 1.0: An XML Query Language.” 20 December 2001. http://www.w3.org/TR/xquery/ W3C Working Draft “XML Query Use Cases. ” 20 December 2001 http://www.w3.org/TR/xmlquery-use-cases www-rdf-logic “Annotated DAML+OIL (March 2001) Ontology Markup.” Viewed: March 2002. http://www.daml.org/2001/03/daml+oil-walkthru.html BIOGRAPHICAL SKETCH Wenyang Hu was born in HeFei, Anhui Province in China. She received a Bachelor of Science degree in computer engineering from Zhejiang University, Hangzhou, China, in July 1988. Then she entered the graduate program at Zhejiang University and obtained a Master of Engineering degree in computer engineering in 1991. After graduation, she became a faculty member at Anhui University, teaching courses in computer science. At the same time she took part in many co-op projects. The courses she taught included Programming Languages C/C++, DBMS Introduction, Data Structure, and Operating System Principles. The projects in which she participated included the “DIV"(Dial in voice), and the “Multi-Model Interactive Platform.” She enrolled at the University of Florida in August 2000 in the Department of Computer and Information Science and Engineering. She worked as a research assistant with Prof. Limin Fu,and later worked as a teaching assistant for Prof. Beverly A. Sanders. Her future research interests include ontology-based web applications and web server programming. 55
© Copyright 2025