RQUERY: Rewriting Text Queries to Alleviate the Vocabulary Mismatch Problem on RDF Knowledge Bases Saeedeh Shekarpour1 and Sören Auer1,2 1 EIS Research Group, University of Bonn 2 Fraunhofer-Institute IAIS Abstract. For non-expert users, a textual query is the most popular and simple means for communicating with a retrieval or question answering system. However, there is a risk of receiving queries which do not match with the background knowledge. Although query expansion is a solution in traditional information retrieval, it is not the appropriate choice for question answering systems. In this paper, we propose a new method for automatic rewriting input queries leveraging background knowledge. We employ Hidden Markov Models to determine the most suitable derived words from linguistic resources. We introduce the concept of triple-based co-occurrence for recognizing co-occurred words in RDF data. This model was bootstrapped with three different statical distributions. The training as well as the test datasets were extracted from the QALD benchmark. Our experimental study demonstrates the high accuracy of the approach. 1 Introduction While the amount of information being published on the Web of Data is dramatically high; yet, retrieving information is an issue due to several known challenges. A key challenge is the lack of accurate knowledge of the used vocabularies, which even expert users frequently use incorrectly. Thus, communication with search systems particularly based on simple interfaces (i.e., textual queries) requires automatic ways for tackling the vocabulary mismatch challenge. This challenge is even more important for schema-aware search systems, such as question answering systems; especially, since there precise interpretation of the input query as well as accurate spotting of answer is more demanding. Different origins can be considered as the cause of the vocabulary mismatch problem, the main ones being: – inflectional form: is variation of a word for different grammatical categories such as tense, aspect, person, number, etc. For example, the keyword ‘actress’ may need to be altered to ‘actor’ or ‘companies’ to ‘company’. Stemming and lemmatization aim at reducing inflectional forms by converting words to a base form. – lexical form: relates words based on lexical categories. For example, the keyword ‘wife’ can be altered to ‘spouse’ or ‘altitude’ to ‘elevation’, because they hold the same meaning. Synonyms, hyponym and hypernym are examples of lexical relations. – abbreviation form: is a shortened form of a word or phrase. For example, ‘UN’ is the abbreviation of ‘United Nation’. A common way to deal with abbreviations is using a dictionary. Various approaches have been proposed to address vocabulary mismatch problem. The most important ones being: – Using a controlled vocabulary maintained by human editors. This approach is common for relatively small or restricted domains. – Automatically deriving a thesaurus. For instance, word co-occurrence statistics over a text corpus is an automatic way to induce a thesaurus. – Interactive query expansion, which provides a list of recommendations for the end user. The recommendations can come from sources such as query logs or thesaurus. – Automatic query expansion which automatically (without any user intervention) adds derived words to the input query in order to increase recall in retrieval systems. – Query rewriting based on query log mining which leverages the manual query rewriting. This approach requires comprehensive query logs, thus being particularly appropriate for web search engines. RDF data is structured data which can be viewed as a directed, labeled graph Gi = (Vi , Ei ) where Vi is a set of nodes comprising all entities and literal property values, and Ei is a set of directed edges, i.e. the set of all properties. In this paper, we exploit the topology as well as the semantics of RDF data for query rewriting. We propose RQUERY3 a method for automatic query rewriting which exploits the internal structure of RDF data. We employ a Hidden Markov Model (HMM) to obtain optimal tuples of derived words. We test different bootstrapping methods applying well-known distributions as well as an algorithm based on Hyperlink-Induced Topic Search (HITS). Our main contributions are as follows: – We define the concept of triple-based co-occurrence of words in RDF knowledge bases. – We extend the Hidden Markov Model for producing and ranking query rewrites. – We extend a benchmark extracted from QALD benchmark for the query rewriting task. – We assess and analyse the effectiveness of the proposed approach in two directions: (1) how effective is the approach for addressing the vocabulary mismatch problem (by taking queries into account having the vocabulary mismatch problem) (2) how effective is the approach for avoiding noise (by taking queries into account which do not have vocabulary mismatch problem). The paper is structured as follows: In the following section, we present an overview of RQUERY in more detail along with some of the notations and concepts used in this work. Section 3 presents the proposed approach for query rewriting. Our evaluation results are presented in Section 4; then, related work is reviewed in Section 5. Finally, we conclude and give pointers to future work. 3 Demo is at: http://rquery.sina.eis.iai.uni-bonn.de/ External Resources WordNet Segment generation Segment expansion RDF Knowledge Base Derived word validation HMM model construct Viterbi algorithm run Ranked list of rewritten queries Input textual query RQUERY Fig. 1: Overview of RQUERY. 2 RQUERY Overview RQUERY obtains a textual query Q as input; as output it provides a ranked list of query rewrites each of which can be considered as an alternative of the original input. Figure 1 illustrates the high level overview of RQUERY, which comprises six main modules. Each module runs sequentially after the previous module and consumes the previous output. Furthermore, RQUERY relies on external sources, i.e., a linguistic thesaurus (e.g. WordNet [5]) and an RDF knowledge base along with its associated ontology, (e.g. DBpedia [9]). In the following, we describe the main objective of each module. Segment Generation An input textual query Q is initially preprocessed (i.e. applying tokenization and stop word removal). Then, the remaining keywords are considered as an n-tuple of keywords q = (k1 , k2 , ..., kn ). Each subset of this tuple is called a segment. Definition 1 (Segment). For a given n-tuple of keywords q = (k1 , k2 , ..., kn ), the segment s(i,j) is the sequence of keywords from start position i to end position j, i.e., s(i,j) = (ki , ki+1 , ..., kj ). We generate all possible segments which can be derived from the given n-tuple of keywords q = (k1 , k2 , ..., kn ). Since the number of keywords is low (in most queries less than six4 ), generating this set is not computationally expensive. This set is represented as S = {s(i,j) |1 ≤ i ≤ j ≤ n}. Segment Expansion This module expands segments derived from the previous module using WordNet as linguistic thesaurus. The linguistic features of WordNet which are employed in RQUERY are: – Synonyms: words having the same or a very similar meaning to input word. 4 http://www.keyworddiscovery.com/keyword-stats.html?date=2012-08-01 – Hypernyms: words representing a generalization of the input word. An observation from our previous research [14] revealed that basically hyponym5 relationship leads to deriving a large number of expansion terms whereas their contribution to the vocabulary mismatch task is low. Thus, to prevent negative influence on efficiency we do not take hyponym relationship into account. We employ both the original segment s and its lemma s0 for expansion6 . In other words, in case that the original segment s differs from its lemma s0 , we once derive words for the original segment and then we derive words for its lemma. The expansion set is formally defined as follows. Definition 2 (Segment Expansion Set). For a given segment s, we define the associated expansion set denoted by ESs as the union of all the following words: 1. Original words: the given segment s extracted from the n-tuple of keywords q. 2. Lemmatized words: words determined as lemma of the given segment s. 3. Linguistic words: words derived via applying linguistic features over the given segment s as well as its lemma s0 . Derived Word Validation For every given segment s ∈ S we construct its associated expansion set ESs . Then we form S the set of all derived words as union of all available expansion sets W = {w|w ∈ ∀s∈S ESs }. Subsequently, we validate each word w ∈ W against the existing vocabulary in the underlying RDF knowledge base (KB). More precisely, we check the occurrence of each word w ∈ KB by sub-string matching in the literal position of all available triples in the RDF knowledge base. Then, simply if no occurrence is observed, we exclude that word from our derived word set. After the validation phase, the remaining words in the derived word set are exploited for the following operations. Hidden Markov Model Construct We aim at distinguishing a group of derived words which respects the intention of the input query Q. In other words, this grouped words convey the same meaning but it contains words which can be different from the input query words. If these grouped words do not suffer from the vocabulary mismatch problem, when substituted for the input query, the vocabulary mismatch problem is resolved. The substitution of the input query is called query rewrite which is defined as follows: Definition 3 (Query Rewrite). For a given n-tuple query q = (k1 , k2 , ..., kn ), a query 0 ) where each ki0 either matches rewrite is an m-tuple of keywords qr = (k10 , k20 , ..., km a keyword kj in q or was linguistically derived from either a keyword kx or a segment s(x,y) of the input query. We address the problem of finding the appropriate query rewriting by employing a Hidden Markov Model (HMM). In section 3, we elaborate on the entire aspects of constructing the model as well as bootstrapping the parameters. Here, we succinctly present an overview. We construct a hidden Markov model in three steps: 1. The state space is 5 6 Words representing a specialization of the input word. For brevity, we skip indexes of segment symbols which represent start and end positions. job band leader occupation Start director profession music director line conductor business vacation profession bandleader Observation 1 Observation 2 Fig. 2: The constructed state space for the query "profession of bandleader". populated. 2. Transitions between states are established. 3. Parameters are bootstrapped. Assume the input query is “profession of bandleader” and according to the benchmark, it should be rewritten as “occupation of bandleader”. The state space is populated with all 1ß validated words derived from this query. Then, all the transitions between states are recognised and established. Figure 2 illustrates this model. Each eclipse represents a state containing a derived word. The dashed arrows originating from the states and pointing to the keywords determine the emitted keyword of each state. Transitions between states are represented via black arrows. Arrows originating from the start point indicate states from which the first input keyword is observable. Viterbi Algorithm Run The Viterbi algorithm or Viterbi path [18] is a dynamic programming approach for finding the optimal path through a HMM for a given input query. It discovers the most likely states that the sequence of input keywords is observable through. 3 Rewriting Queries using HMM In this section, we describe how we use a Hidden Markov Model (HMM) for rewriting the input query. First, we introduce the notation of HMM parameters, constructing the state space, transition between states and then we detail how we bootstrap the parameters of our HMM. Formally, a Hidden Markov Model (HMM) is a quintuple λ = (X, Y, A, B, π) where: – X is a finite set of states. In our case, X equals to the set of the validated derived words W . – Y denotes the set of observations. Herein, Y equals to the set of all segments Seg derived from the input n-tuple of keywords q. – A : X × X → [0, 1] is the transition matrix. Each entry aij is the transition probability P r(Sj |Si ) from state i to state j. – B : X × Y → [0, 1] represents the emission matrix. Each entry bih = P r(h|Si ) is the probability of emitting the symbol h from state i. – π : X → [0, 1] denotes the initial probability of states. 3.1 State Space A-priori, the state space is populated with as many states as the total number of words exist in literal positions of triples available in the underlying RDF knowledge base. With this regard, the number of states is thus potentially large. To reduce the number of states, we exclude irrelevant states based on the following observations: A relevant state is a state for which its associated word is (1) equal to a segment of the input query, (2) equal to a lemma of a segment of the input query, (3) linguistically derived from a segment of the input query. Thus, we limit the state space X to the set of validated derived words W . From each state, the observable keyword (i.e., emitted strings) is the segment s of which the associated word w of a states is derived. For instance, the word job is derived from the segment profession, so the keyword profession is emitted from the state associated with the word job. 3.2 Transitions between States We define transitions between states based on the concept of co-occurrence of words. We adopt the concept of co-occurrence of words from the traditional information retrieval context to the context of RDF knowledge bases. Triple-based co-occurrence means co-occurrence of words in literals found in the resource descriptions of two resources of a given triple: 1. Two words w1 and w2 co-occur in literal values of the property rdfs:label of resources placed in the – subject as well as predicate of a given triple (subject-predicate co-occurrence). – subject as well as object of a given triple (subject-object co-occurrence). – predicate as well as object of a given triple (predicate-object co-occurrence). 2. Two words w1 and w2 co-occur in the literal of a given triple as well as with the property rdfs:label of the resource placed in the – subject of that triple (subject-literal co-occurrence). – predicate of that triple (predicate-literal co-occurrence). Figure 3 illustrates the five graph patterns which we employed to check co-occurrence of two words w1 and w2. In the following, we present the formal definition along with a sample of SPARQL query recognizing co-occurrence in the subject-predicate position. Definition 4 (Triple-based Co-occurrence). In a given triple t = (s, p, o), two words w1 and w2 are co-occurring, if they appear in the labels (rdfs:label) of at least two resources (i.e., (s, p), (s, o) or (o, p)). The following SPARQL query checks the co-occurrence of w1 and w2 in the subject-predicate position. ASK WHERE { { ?s ?p ?o . ?s rdfs:label ?sl . ?p rdfs:label ?pl . FILTER(regex(?sl,"word1") AND regex(?pl,"word2")) } UNION { ?s ?p ?o . ?s rdfs:label ?sl . ?p rdfs:label ?pl . FILTER(regex(?sl,"word2") AND regex(?pl,"word1")) } FILTER(langMatches(lang(?sl),"en") AND langMatches(lang(?pl),"en")) } w1 w2 w1 w2 w1 w1 w2 w1 l l l l l l l l s p o s p o (a) o subject-predicate. s (b) p subject-object. (c) p w2 subject-literal. s (d) predicate-object. s (e) p w2 predicate-literal. Fig. 3: The graph patterns employed for recognising co-occurrence of the two given words w1 and w2. Please note that the letters s, p, o, c, l, a respectively stand for subject, predicate, object, class, rdfs:label, rdf:class. 3.3 Bootstrapping Parameters Commonly, supervised learning is employed for estimating the Hidden Markov Model parameters. An important consideration here is that we encounter a dynamic modelling meaning state space as well as issued observation (i.e., sequence of input keywords) vary query by query. Thus, learning probability values should be generic and not querydependent because learning model probabilities for each individual query is not feasible. Instead, we rely on bootstrapping, a technique used to estimate an unknown probability distribution function. We apply three distributions (i.e., normal, uniform and zipfian) to find out the most appropriate distribution. For bootstrapping the model parameters A and π, we take into account co-occurence between words as well as frequency of words in the RDF knowledge base. Word frequency is generally defined as the number of times a word appears in the knowledge base. Herein, we adopt an implicit word frequency which considers the type of the resources that the given word appears in the literal position of the rdfs:label property. In our previous research [13], we observed that generally connectivity degree of resources with type class and property is higher than instance resources. In DBpedia, for example, classes have an average connectivity degree of 14,022, while properties have in average 1,243 and instances 37. We assign an static word frequency denoted by wf based on word appearance position. In other words, if a given word w appears in the label of a class, it obtains higher word frequency value. Equation 1 specifies the word frequency values according to their appearance position. With respect to our underlying knowledge base (i.e., DBpedia), we assign logarithm of the average of connectivity degree (as mentioned above) for each α; for instance, α2 approximates log(1243) ∼ 3. α1 w ∈ ClassLabels wf = α2 w ∈ P ropertyLabels (1) α3 w ∈ InstanceLiterals We transform the parameters i.e., word co-occurrence as well as word frequency to hub and authority values computed by HITS algorithm. Hyperlink-Induced Topic Search (HITS) is a link analysis algorithm that was developed originally for ranking Web pages [8]. It assigns a hub and an authority value to each Web page. The hub value estimates the value of links to other pages and the authority value estimates the value of the content on a page. Hub and authority values are mutually interdependent and computed in a series of iterations. In each iteration the authority value is updated to the sum of the hub scores of each referring page; and the hub value is updated to the sum of the authority scores of each referring page. After each iteration, hub and authority values are normalized. This normalization process causes these values to converge eventually. Since RDF data is a graph of linked entities, we employ a weighted version of the HITS algorithm in order to implicitly take co-occurrence as well as frequency of words into account. The weight wfi is the frequency of the associated word of the state Si . Authority and hub values are computed as follows: X auth(Sj ) = wfi ∗ hub(Si ) Si hub(Sj ) = X wfi ∗ auth(Si ) Si Transition Probability. The transition probability of Pstate Sj following state Si is denoted as aij = P r(Sj |Si ). Note that the condition P r(Sj |Si ) = 1 holds. The Si transition probability from the state Si to the state Sj is computed by: aij = P r(Sj |Si ) = auth(Sj ) P × hub(Si ) auth(Sk ) ∀aik >0 Here, the probabilities from state Si to the neighbouring states are uniformly distributed based on the authority values. Consequently, states with higher authority values are more probable to be met. Initial Probability. The initial probability π(Si ) is the probability that the model assignsP to the initial state Si at the beginning. The initial probabilities fulfill the condition π(Si ) = 1. We denote states for which the first keyword is observable by ∀Si InitialStates. The initial states are defined as follows: π(Si ) = hub(Si ) P hub(Sj ) ∀Sj ∈InitialStates In fact, π(Si ) of an initial state is uniformly distributed on hub values. Emission Probability. The probability of emitting a given segment seg from the state Si depends on the linguistic relation between the state associated word wSi and the given segment seg. This probability values θ if the state associated word wSi equals to either seg or its lemma seg 0 and it values η if the state associated word wSi is linguistically driven from seg. This assignment is denoted below. 0 θ wSi = seg ∨ wSi = seg bik = P r(seg|Si ) = η wSi ∈ ESseg − {seg, seg 0 } (2) 0 wSi ∈ / ESseg Intuitively, θ should be larger than η. A statistical analysis with our query corpus confirms this assumption. Accordingly, around 82% of the words do not have a vocabulary mismatch problem. Hence, taking either the original words or lemmatised words into account suffices to a large extent. Only around 12% of the words have a vocabulary mismatch problem. However, we can not solely rely on this statistics. We consider the difference γ between θ and η and perform a parameter sensitivity evaluation on γ. 3.4 Viterbi Algorithm The Viterbi algorithm is a dynamic programming approach for finding the optimal path through a HMM for a given input. It discovers the most likely sequence of hidden states which through that the input query is observable. The most likely path has the maximum joint emission and transition probability of the involved states. Each subpath of this path also has the maximum probability for the corresponding sub-sequence of input keywords. The common version of this algorithm tracks only the most likely path. We extended that in a way that it detects and lists all possible paths which the given input query is observed through them. 3.5 Ranking Mechanisms A hidden Markov model is prone to detect paths with the same probability values. In other words, two paths might tie for a place in the corresponding ranking. We adopt two ranking mechanisms (i.e. dense ranking and modified competition ranking)7 . Although both of them assign the same ranking number to the items which obtained equal probability values, the latter one leaves out a gap in the ranking numbers. They are defined as follows: 7 http://www.merriam-webster.com/dictionary/ranking 1. Dense ranking assigns the same rank number to the paths with equal probability values, it does not leave any gap either after or before items with the same values. In fact, the next item(s) are assigned the following ranking number. Dense ranking is denoted by R. For instance, assume the output is four paths labeled by A, B, C, D. The highest probability value belongs to A, then after that B and C gained the equal probability values and apparently D acquired the lowest probability value. Dense ranking places A as the 1th; both B and C are placed at the 2th, and D is placed at the 3th. 2. Modified competition ranking firstly brings the gap before the items with equal probability values. Similarly, this gap covers a range being one less than the number of items with equal probability values. Modified competition ranking is denoted by R0 . With respect to the assumed example above, applying modified competition ranking results in placing A as the 1th, then one gap is left out; in sequel both B and C are placed as the 3th, and ultimately D is placed at the 4th. Example 1. Let us continue with the given query “profession of bandleader”. The columns of Table 1 respectively show a subset of (1) the segments (2) the states of state space which are associated to derived words (3) semantic relation between the derived words and the segments. After running theViterbi algorithm, the generated top-6 outputs are as follows: R 1 2 3 4 4 5 R’ 1 2 3 5 5 6 Probability 0.0327 0.0138 0.0036 0.00327 0.00327 0.00138 Segment Query Rewrite profession bandleader. profession director. profession conductor. profession music director. occupation bandleader. occupation director. State profession professing profession bussiness job occupation director bandleader music director director profession bandleader profession bandleader Relation Type original keyword synonym hypernym hypernym hypernym original keyword hypernym hypernym original keyword Table 1: A subset of segments, state space, deriving relation type for the given query “profession of bandleader”. 4 Evaluation Expansion and rewriting methods are endangered to yield large number of irrelevant words which negatively influence runtime as well as accuracy. For instance, in [14] we showed that for short queries, the number of derived words is significantly high. Thus, the goal of our evaluation is investigating positive as well as negative impacts by raising the following two questions: (1) How effective is the approach for addressing the vocabulary mismatch problem employing queries having vocabulary mismatch problem? (2) How effective is the approach for avoiding noise employing queries which do not have vocabulary mismatch problem? We employ Mean Reciprocal Rank M RR as our metric for measuring accuracy8 of the outputs. This metric takes the order of results into account. The mean reciprocal rank is defined as average of the reciprocal ranks of P|Q| 1 1 outputs for a set of queries and calculated as: |Q| i=1 ranki . Mean reciprocal rank is computed on the dense ranking as well as standard competition ranking; the former one is denoted by M RR and the latter one by M RR0 . QALD series benchmarks9 are the only available benchmarks tailored for question answering on Linked Data. In these benchmarks, every textual query is associated with an individual SPARQL query. Generally, both textual and SPARQL Queries ranges from different complexities; thus, transformation of textual queries to SPARQL queries causes various challenges. Vocabulary mismatch problem is one of the important challenges. In [14] we extracted a dataset from QALD series benchmarks. Herein, we reuse and extend that benchmark. This dataset contains queries which half of them have vocabulary mismatch problem and half do not. Our experimental study is divided into two parts. First, we perform an evaluation of the bootstrapping parameters of the Hidden Markov Model. We 10 initial queries from the benchmark for this purpose. Second, we evaluate the overall accuracy using the remaining queries. 4.1 Evaluation of Bootstrapping We perform an experimental study to discover the optimum setting for emission, initial and transition probabilities. These parameters are interoperated and equivalently influence the output. Thus, we run a multi-dimensional experiment. The main dimensions as well as goals of this experiment are as follows: – To discover a suitable distribution function for transition and initial probabilities. With this respect, we apply and compare uniform distribution versus normal distribution. – To find out optimum value for γ as the difference between θ and η in the emission probability. We ran a parameter sensitivity evaluation over γ ranging from 0 to 0.9. – To measure the effectiveness of HITS algorithm. With this respect, we ran the distribution functions separately with two input random variables X and Y being defined 8 9 Since all our computations are carried out on the fly; thus, in this work, runtime is not the subject of our evaluation. http://greententacle.techfak.uni-bielefeld.de/~cunger/qald/ index.php?x=task1&q=3n for n = 1, 2, 3, 4, 5. ID Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Query profession bandleader Barak Obama wife Is Natalie Portman an actress? Lawrence of Arabia conflict children of Benjamin Franklin movies with Tom Cruise husband of Amanda Palmer wife altitude of Everest companies in California writer of Wikipedia soccer clubs in Spain employees of Google successor of John F. Kennedy nicknames of San Francisco Statue of Liberty built capital of Canada companies in Munich governor of Texas official languages of the Philippines Who founded Intel? Mismatch word profession wife actress conflict children movie husband altitude company writer - Match word occupation spouse actor battle child film spouse elevation organisation author - Table 2: Queries of the training dataset. respectively as the word frequency and sum of the hub and the authorithy of the sink state10 . Table 2 and Figure 5 show the queries of the training dataset. The half of the queries have vocabulary mismatch problem and the other half do not. Figure 4 represents M RR and M RR0 achieved from bootstrapping experiment over different settings. The substantial findings of this experiment can be summarised as follows: – Uniform distribution significantly outperforms normal distribution in the majority of settings. – Increasing the value of γ constantly result in the more precise ranking. The optimum value is γ = 0.9. This experimental finding confirms our previous observation as around 80% of keywords of queries do not have vocabulary mismatch problem. – While employing the input variable Y has more substantial positive impact for queries which do not have vocabulary mismatch problem, its effect is slightly less than the input variable X for queries having vocabulary mismatch problem. On average, the input variable Y has higher impact. 4.2 Evaluation In this part, we show the result of the experiment taking the best learnt setting into account over the test queries. Table 3 represents M RR, M RR0 for the queries of test dataset. In four cases Q1, Q2, Q3 and Q4; RQUERY failed. The failure reason is simply because the match word can not be derived from lexical as well as inflectional relations. For instance, the keyword ‘extinct’ should be matched to ‘conservation status’ which with the current relations, it is not possible. For the rest of queries, no failure was observed. 10 Note that for a given edge the source state is the one from which the edge originates and the sink state is the one where the edge ends. %Uniform%Distribu7on%on%X% Mean%Reciprocal%Rank% Mean%Reciprocal%Rank% %Uniform%Distribu7on%on%Y% 1" 0.8" 0.6" 0.4" 0.2" 0" 0" 0.1" 0.2" 0.3" 0.4" 0.5" 0.6" 0.7" 0.8" 1" 0.8" 0.6" 0.4" 0.2" 0" 0.9" 0" 0.1" 0.2" 0.3" 0.4" Gamma% MRR:Q11Q10" MRR:Q101Q20" MRR':Q11Q10" MRR':Q101Q20" MRR:Q11Q10" 0.8" 0.6" 0.4" 0.2" 0" 0.1" 0.2" 0.3" 0.4" 0.5" 0.6" MRR:Q101Q20" MRR:Q101Q20" 0.7" 0.8" 0.9" MRR':Q11Q10" MRR':Q11Q10" MRR':Q101Q20" 0.7" 0.8" 0.9" 1" 0.8" 0.6" 0.4" 0.2" 0" 0" 0.1" 0.2" 0.3" 0.4" Gamma% MRR:Q11Q10" 0.6" %Normal%Distribu7on%on%X% Mean%Reciprocal%Rank% Mean%Reciprocal%Rank% %Normal%Distribu7on%on%Y% 1" 0" 0.5" Gamma% 0.5" 0.6" 0.7" 0.8" 0.9" Gamma% MRR':Q101Q20" MRR:Q11Q10" MRR:Q101Q20" MRR':Q11Q10" MRR':Q101Q20" Fig. 4: Monitoring M RR and M RR0 for various settings on the training dataset. Normal"Distribu9on"on"Y" MRR':"Q10DQ20" Uniform"Distribu9on"on"Y" MRR':"Q1DQ10" Normal"Distribu9on"on"X" MRR:"Q10DQ20" Uniform"Distribu9on"on"X" MRR:"Q1DQ10" 0" 0.1" 0.2" 0.3" 0.4" 0.5" 0.6" 0.7" 0.8" 0.9" 1" Fig. 5: M RR and M RR0 of uniform and normal distributions for γ = 0.9. 5 Related Work Automatic Query Expansion (aqe) has been focus of researchers for a long time. It aims at improving retrieval efficiency. Here we divide available works into two parts. The first part discusses a load of approaches employed for query expansion on Web of Documents. The second part presents state of art of query expansion on Semantic Web. Query Expansion on Web of Documents Various approaches differ in the choice of data sources and features. Data source is a collections of words or documents. The choice of the data source influences the size of the vocabulary (expansion terms) as well as the available features. Furthermore, a data source is often restricted to a certain domain and thus constitutes a certain context for the search terms. Common choices for data sources are text corpora, WordNet synsets, hyponyms and hypernyms, anchor ID Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Query actors casted in film animals that are extinct Who owns Aldi? city inhabitants television shows were created by Walt Disney cars produced in germany launch pads operated by NASA official languages of Countries Who developed the video game World of Warcraft? Greek goddesses dwell Abraham Lincoln’s death place area code of Berlin Is proinsulin a protein? owner of universal studios currency of the Czech Republic spoken language in Estonia Tim Burton ’s films breeds of the German Shepherd dog ingredients of carrot cake largest city in Canada Mismatch word cast extinct owns inhabitants created car operated countries developed dwell - Match word starring conservation status key person population creator automobile operator country developer abide - RR 1 0.11 1 1 1 0.125 1 1 1 1 1 0.33 1 0.5 1 1 RR0 1 0.11 1 1 1 0.125 1 1 1 1 1 0.33 1 0.5 1 1 Table 3: M RR and M RR0 on queries of the test dataset. texts, search engine query logs or top ranked documents. A popular way of choosing data source is Pseudo Relevance Feedback (PRF) based query expansion. In this method, top-n retrieved documents are assumed to be relevant and employed as data source. [10] as one of the early works expands queries based on word similarity using cosine coefficient. [16, 17] give the theoretical basis for detecting similarity based on co-occurence. Feature selection [11] consists of two parts: (1) feature weighting assigns a scoring to each feature and (2) the feature selection criterion determines which of the features to retain based on the weights. Some common feature selection methods are mutual information, information gain, divergence from randomness and relevance models. A framework for feature weighting and selection methods is presented in [11]. The authors compare different feature ranking schemes and show that SVM achieves the highest F1 -score of the examined methods. [4] presents a comprehensive survey of aqe in information retrieval and detail a large amount of candidate feature extraction methods. Query Expansion on Semantic Web Since Semantic Web is publishing structured data, expansion methods can be enhanced by taking the structure of data into account. A very important choice is defining and employing new features. In [14] we presented semantic features which can perform as well as linguistic features. An approach similar to that is [2], relying on supervised learning and uses only semantic expansion features instead of a combination of both semantic and linguistic ones. There is however an approach for mining equivalent relations from Linked Data, that relies on three measures of equivalency: triple overlap, subject agreement and cardinality ratio. [20] However, while aqe is prevalent in traditional search engines, but the existing semantic search engines either do not address the vocabulary mismatch problem or employ it in a naive way. SemSeK [1], SINA [15], QAKiS [3] are semantic search engines which still do not apply query expansion. Alexandria [19] uses Freebase to include synonyms and different surface forms. MHE11 combines query expansion and entity recognition by using textual references to a concept and extracting Wikipedia anchor texts of links. This approach takes advantage of a large amount of hand-made mappings that emerge as a byproduct. However, this approach is only viable for WikipediaDBpedia or other text corpora whose links are mapped to resources. Eager [7] expands a set of resources with resources of the same type using DBpedia and Wikipedia categories (instead of linguistic and semantic features in our case). Eager extracts implicit category and synonym information from abstracts and redirect information, determines additional categories from the DBpedia category hierarchy and then extracts additional resources which have the same categories in common. PowerAqua [12] is an ontologybased system that answers natural language queries and uses WordNet synonyms and hypernyms as well as resources related with the owl:sameAs property. 6 Conclusion and Future work In this paper, we presented a method for automatic query rewriting. The proposed approach benefits structure of data for recognising the best query rewrites. It employs a Hidden Markov Model which transitions between states defined based on the concept of triple-based co-occurrence. An experimental study was performed on a training dataset in order to detect the optimum setting for the model parameters. While the result of the evaluation shows the feasibility as well as high accuracy, but still we need to take into account more semantic relations for tackling vocabulary mismatch problem. Furthermore, since all computations are carried out on the fly, in the future we are going to construct a semantic indexing on co-occurrence of words in order to speed up retrieval requests. Another extension is about developing our benchmark by including more number of queries as well as datasets. References 1. N. Aggarwal and P. Buitelaar. A system description of natural language query over dbpedia. In C. Unger, P. Cimiano, V. Lopez, E. Motta, P. Buitelaar, and R. Cyganiak, editors, Proceedings of Interacting with Linked Data (ILD 2012), workshop co-located with the 9th Extended Semantic Web Conference, May 28, 2012, Heraklion, Greece, pages 97–100, 2012. 2. I. Augenstein, A. L. Gentile, B. Norton, Z. Zhang, and F. Ciravegna. Mapping keywords to linked data resources for automatic query expansion. In The Semantic Web: Semantics and Big Data, 10th International Conference, ESWC 2013, Montpellier, France, May 26-30, 2013. Proceedings, Lecture Notes in Computer Science. Springer, 2013. 3. E. Cabrio, A. P. Aprosio, J. Cojan, B. Magnini, F. Gandon, and A. Lavelli. Qakis @ qald2. In C. Unger, P. Cimiano, V. Lopez, E. Motta, P. Buitelaar, and R. Cyganiak, editors, Proceedings of Interacting with Linked Data (ILD 2012), workshop co-located with the 9th Extended Semantic Web Conference, May 28, 2012, Heraklion, Greece, pages 88–96, 2012. 4. C. Carpineto and G. Romano. A survey of automatic query expansion in information retrieval. ACM Comput. Surv., 44(1):1:1–1:50, jan 2012. 5. C. Fellbaum, editor. WordNet: an electronic lexical database. MIT Press, 1998. 11 http://ups.savba.sk/~marek 6. J. E. L. Gayo, D. Kontokostas, and S. Auer. Multilingual linked data patterns. Semantic Web Journal, 2013. 7. O. Gunes, C. Schallhart, T. Furche, J. Lehmann, and A.-C. N. Ngomo. Eager: extending automatically gazetteers for entity recognition. In Proceedings of the 3rd Workshop on the People’s Web Meets NLP: Collaboratively Constructed Semantic Resources and their Applications to NLP, 2012. 8. J. M. Kleinberg. Authoritative sources in a hyperlinked environment. J. ACM, 46(5), 1999. 9. J. Lehmann, C. Bizer, G. Kobilarov, S. Auer, C. Becker, R. Cyganiak, and S. Hellmann. DBpedia - a crystallization point for the web of data. Journal of Web Semantics, 7(3):154– 165, 2009. 10. M. E. Lesk. Word-word associations in document retrieval systems. American Documentation, 20(1):27–38, 1969. 11. S. Li, R. Xia, C. Zong, and C.-R. Huang. A framework of feature selection methods for text categorization. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2, ACL ’09, pages 692–700, Stroudsburg, PA, USA, 2009. Association for Computational Linguistics. 12. V. Lopez, M. Fernández, E. Motta, and N. Stieler. Poweraqua: supporting users in querying and exploring the semantic web content. In Semantik Web Journal, page 17. IOS Press, March 2011. 13. S. Shekarpour, S. Auer, A.-C. Ngonga Ngomo, D. Gerber, S. Hellmann, and C. Stadler. Keyword-driven sparql query generation leveraging background knowledge. In International Conference on Web Intelligence, 2011. 14. S. Shekarpour, K. Höffner, J. Lehmann, and S. Auer. Keyword query expansion on linked data using linguistic and semantic features. In 7th IEEE International Conference on Semantic Computing, September 16-18, 2013, Irvine, California, USA, 2013. 15. S. Shekarpour, A.-C. N. Ngomo, and S. Auer. Question answering on interlinked data. In D. Schwabe, V. A. F. Almeida, H. Glaser, R. A. Baeza-Yates, and S. B. Moon, editors, WWW, pages 1145–1156. International World Wide Web Conferences Steering Committee / ACM, 2013. 16. C. J. van Rijsbergen. Information Retrieval. Buttersworth, London, 1989. 17. C. J. van Rijsbergen. The geometry of information retrieval. Cambridge University Press, 2004. 18. A. J. Viterbi. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory, IT-13(2), 1967. 19. M. Wendt, M. Gerlach, and H. Duwiger. Linguistic modeling of linked open data for question answering. In C. Unger, P. Cimiano, V. Lopez, E. Motta, P. Buitelaar, and R. Cyganiak, editors, Proceedings of Interacting with Linked Data (ILD 2012), workshop co-located with the 9th Extended Semantic Web Conference, May 28, 2012, Heraklion, Greece, pages 75–87, 2012. 20. ziqi zhang, A. L. Gentile, I. Augenstein, E. Blomqvist, and F. Ciravegna. Mining equivalent relations from linked data. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL), Sofia, Bulgaria, august 2013.
© Copyright 2025