Automatically Finding Answers to "Why" and "How to"

Automatically Finding Answers to "Why" and "How to"
Questions for Arabic Language
Ziad Salem, Jawad Sadek, Fairouz Chakkour, and Nadia Haskkour*
Aleppo University, Electrecal and Electroneic Engineering Faculty, Computer Engineering
Department, *Faculty of Arts and Humanities, Aleppo, Syria
dr_ziad@hotmail.co.uk, jawad10a@yahoo.com, feirouzch@yahoo.fr
Abstract. This paper addresses the task of extracting answers to why and how
to-questions from Arabic texts which has not been addressed yet for Arabic
language in the field of question answering systems (QA). The system developed here uses one of the leading theories in computational linguistics called
Rhetorical Structure Theory (RST) and based on cue phrases to both determine
the elementary units and the set of rhetorical relations that is relevant to the targeted questions. Our experiment has been conducted on Arabic raw texts (automatically annotated) taken from Arabic websites and has gave a good result
comparing with a one already done before to why-questions answering for English language.
Keywords: Rhetorical Structure Theory, Natural Language Processing, Question
Answering for Arabic, why and how to questions, Discourse analysis.
1 Introduction
Day by day the amount of information available on the internet is growing, and it
becomes more and more difficult to find answers on the WWW using standard search
engines, as consequence question answering systems (QA) will become increasingly
important. The main aim of QA systems is to provide the user with a flexible access to
information allowing him for writing a question in natural language and presenting a
short answer rather than a list of possibly relevant documents which contain the answer.
Arabic is the sixth most widely spoken language in the world [1], yet there are relatively few studies to improve Arabic information search and retrieval compared to
other languages and this is true for the QA task. However few researches built QA
systems oriented to the Arabic language. The systems were focused on factoid questions like who, what, where and when questions [2][3] in which named entity recognition can make a substantial contribution to identifying potential answers in a source
document, but none of those system addressed why and how to-questions which different techniques are needed.
In the current paper the research aims at developing a system for answering why
and how to-questions for Arabic language including a proper evaluation method as
first attempt to address this type of questions. The system uses RST that has been
applied in a large number of computational Linguistics applications.
R. Setchi et al. (Eds.): KES 2010, Part IV, LNAI 6279, pp. 586–593, 2010.
© Springer-Verlag Berlin Heidelberg 2010
Automatically Finding Answers to "Why" and "How to" Questions
587
2 Rhetorical Structure Theory
Rhetorical structure theory was developed at USC (University of Southern California)
by William Mann and Sandra Thompson. The aim was finding a theory of discourse
structure or function that provides enough detail to guide a computer program in generating texts. Based on their observation of edited text from a wide variety of
sources, Mann & Thompson have several assumptions about how written text functions, and how it involves words, phrases, grammatical structure summarizing as
following [4]:
● Organization: Texts consists of functionality significant parts.
● Unity and coherence: There must be sense of unity to which every par contributes.
● Hierarchy: Elementary parts of a text are composed into larger parts, which in turn
are composed of yet larger parts up to the scale of the text as whole.
● Relation Composition: Relations hold between parts of a text. In which every part
of a text has a role, a function to play, with respect to other parts in the text. A small
finite set of highly recurrent relations holding between pairs of parts of text is used
to link parts together to form larger parts. All rhetorical relations that can possibly
occur in a text can be categorized into a finite set of relation types.
● Asymmetry of Relations: RST establishes two different types of units. Nuclei are
the most important parts of a text, whereas satellites contribute to the nuclei and are
secondary. The most common type of text structuring relation is an asymmetric
class, called nucleus-satellite relations, the nucleus is considered to be the basic information, and more essential to the writer’s purpose than the satellite. The satellite
contains additional information about the nucleus. And it is often incomprehensible
without the nucleus, whereas a text where the satellites have been deleted can
be understood to a certain extent. Table 1 illustrates some of the relations identified by Mann and Thompson.
Table 1. Presents some of the relations used in RST
Relation name
Background
Elaboration
Antithesis
Enablement
Evaluation
Nucleus
Satellite
Text whose understanding is text for facilitating
being facilitated
understanding
basic information
Additional information
ideas favored by the author
ideas disfavored by
the author
An action
information intended
to aid the reader in
performing an action
A situation
an evaluative comment about the situation
Years of text analysis using RST have shown that RST is useful to capture the underlying structure of texts. Furthermore, RST has proven to be adequate in computational
implementations, in the automatic analysis of texts and in the generation of coherent
text [5].
588
Z. Salem et al.
3 Using Rhetorical Relations for Question Answering
Some types of rhetorical relations that might be relevant to why and how to- questions can help finding answers for those questions. Let us consider the two following
examples, taken from Arabic websites, which clarify the method used to extract
answers:
3.1 Example 1
ΪϳΰΗ Γέ΍ήΣ ΔΟέΩ ΪϨϋ ϩΩ΍Ϊϋ· ϢΗ ϱάϟ΍ ΩϮγϻ΍ ϱΎθϟ΍ ϥ· ϝΎϜϳΪϴϣ ζΘϴϳήΑ ΔϔϴΤλ ϲϓ Εήθϧ Δγ΍έΩ ΖϟΎϗ]
ϦϴΑ ϱήϤϟ΍ ϥΎσήδΑ ΔΑΎλϹ΍ ωΎϔΗέ΍ ήδϔϳ ϚϟΫ ϥ·ϭ] ˺[ϥΎσήδϟΎΑ ΔΑΎλϹ΍ήτΧ Ϧϣ Ϊϳΰϳ ΔϳϮΌϣ ΔΟέΩ ̀˹ Ϧϋ
˻
[.ΔϴΑήϏ ήϴϐϟ΍ ΏϮόθϟ΍ ξόΑ
[The research published in the British Medical Journal found that black tea made at
temperature greater than 70 co, can raise the risk of cancer,] 1[and that may be the
cause of high rates of esophageal cancer among non western people.] 2
In this example, unit1 gives information about the cause of the problem presented
in unit2, so we can say that an interpretation relation holds between the two units as
illustrates in Fig.1.
2-1
2
‫وإن ذﻟﻚ ﻳﻔﺴﺮ ارﺗﻔﺎع اﻹﺻﺎﺑﺔ‬
‫ﺑﺴﺮﻃﺎن اﻟﻤﺮي ﺑﻴﻦ ﺑﻌﺾ‬
.‫اﻟﺸﻌﻮب اﻟﻐﻴﺮ ﻏﺮﺑﻴﺔ‬
1
‫ﻗﺎﻟﺖ دراﺳﺔ ﻧﺸﺮت ﻓﻲ‬
‫ﺻﺤﻴﻔﺔ ﺑﺮﻳﺘﺶ ﻣﻴﺪﻳﻜﺎل‬
‫إن اﻟﺸﺎي اﻷﺳﻮد اﻟﺬي ﺗﻢ‬
‫إﻋﺪادﻩ ﻋﻨﺪ درﺟﺔ ﺣﺮارة‬
‫ درﺟﺔ ﻳﺰﻳﺪ ﻣﻦ‬70 ‫ﺗﺰﻳﺪ ﻋﻦ‬
،‫ﻣﻦ ﺧﻄﺮ اﻹﺻﺎﺑﺔ ﺑﺎﻟﺴﺮﻃﺎن‬
Fig. 1. The schema of the Arabic text in the example 1
Now in case of the following question:
{‫} ﻟﻤﺎذا ﺗﻌﺪ اﻹﺻﺎﺑﺔ ﺑﺴﺮﻃﺎن اﻟﻤﺮي ﻣﺮﺗﻔﻌﺔ ﺑﻴﻦ اﻟﺸﻌﻮب اﻟﻐﻴﺮ ﻏﺮﺑﻴﺔ ؟‬
{Why does esophageal cancer has high rates among non western people?}
We notice that the question corresponds to the unit2, so the other part of relation
will be the answer for the question which is the unit1.
Automatically Finding Answers to "Why" and "How to" Questions
589
3.2 Example 2
‫ ﻳﻜﻮن ﻓﻴﻪ آﻞ ﺷﻬﻴﻖ و آﻞ زﻓﻴﺮ‬،‫ ]ﻣﻦ ﺧﻼل ﺗﻨﻔﺲ ﻣﺘﻌﺎدل‬1[‫]ﻳﻤﻜﻨﻚ أن ﺗﺼﻞ إﻟﻰ ﺣﺎﻟﺔ ﻣﻦ اﻻ ﺳﺘﺮﺧﺎء اﻟﻌﻤﻴﻖ‬
2
[.......... ‫ أﻏﻤﺾ ﻋﻴﻨﻴﻚ واﺳﺘﻨﺸﻖ وأﻧﺖ‬،‫ﻣﺘﺴﺎوﻳﻴﻦ ﻓﻲ اﻟﻄﻮل وﻳﺴﺎوي آﻞ ﻣﻨﻬﻤﺎ اﻵﺧﺮ ﻓﻲ اﻟﻄﻮل‬
[You can reach a state of deep relaxation]1[through equal breathing where each
inhalation and exhalation are long and of equal length. Close your eyes and inhale
while…...] 2
Also in this example, we notice that unit1 explains the notion mentioned in unit2, so
we can say that an explanation relation holds between the two units as illustrates in
Fig.2.
Given the following question:
{‫} آﻴﻒ ﻳﻤﻜﻦ اﻟﻮﺻﻮ ل إﻟﻰ اﻻﺳﺘﺮﺧﺎء اﻟﻌﻤﻴﻖ ؟‬
{How to reach a stage of deep relaxation?}
The question corresponds to the unit1, so we can consider the other part of relation
as the answer for the question.
2-1
2
‫ ﻳﻜﻮن ﻓﻴﻪ‬،‫ﻣﻦ ﺧﻼل ﺗﻨﻔﺲ ﻣﺘﻌﺎدل‬
‫آﻞ ﺷﻬﻴﻖ وآﻞ زﻓﻴﺮ ﻃﻮﻳﻠﻴﻴﻦ وﻳﺴﺎوي‬
‫ أﻏﻤﺾ‬.‫آﻞ ﻣﻨﻬﻤﺎ اﻵﺧﺮ ﻓﻰ اﻟﻄﻮل‬
.. ‫ واﺳﺘﻨﺸﻖ واﻧﺖ‬،‫ﻋﻴﻨﻴﻚ‬
1
‫ﻳﻤﻜﻨﻚ أن ﺗﺼﻞ إﻟﻰ ﺣﺎﻟﺔ‬
‫ﻣﻦ اﻻﺳﺘﺮﺧﺎء اﻟﻌﻤﻴﻖ‬
Fig. 2. The schema of the Arabic text in the example 2
We did Arabic text analysis in order to extract a set of rhetorical relations that can
lead to answer why and how to questions. Identified by Al-sanie [6], eleven rhetorical
relations have applied in an Arabic text summarization system. We choose four rhetorical relations from his work (Interpretation–Base–Result–Antithesis) and added
other four relations (Causal–Evidence–Explanation–Purpose) to get the set of relations and its corresponding types of answer as shown in Table2.
In order to automatically derive the text structure, it first needs to determine the elementary units of a text and then find the rhetorical relations that hold between these
units. Marcue [7] relied on cue phrase to perform the previous two steps as a sufficiently
accurate indicator of the boundaries between elementary textual units and of the rhetorical relations that hold between them. We will use the same method in the present work.
Cue phrases are words and phrases that used by writer as cohesive ties between adjacent
clauses and sentences and they are crucial to the reader for understanding of the text.
590
Z. Salem et al.
Analyzing an Arabic corpus and studying the way the Arabic writer used to convey
his thought to the reader [8][9] we generated a set of cue phrases that signaled each
relation shown in Table2. For example the relation Explanation can be hypothesized
on the basis of the occurrence of the cue phrases (......،"‫"ﺑﻮاﺳﻄﺔ‬، "‫"ﻋﻦ ﻃﺮﻳﻖ‬،"‫)"ﻣﻦ ﺧﻼل‬.
Also (...."‫ "وﻗﺎل‬، "‫ "أآﺪ‬،"‫ )"وأﺷﺎر‬can signals an Evidence relation.
Table 2. Presents a set of the Arabic rhetorical relations used to answer why and how to Arabic
questions
Question type
Why - how to
Why
Why
Why
Why - how to
Why - how to
how to
Why
‫ﻧﻮع اﻟﺴﺆال‬
‫ﻟﻤﺎذا – آﻴﻒ‬
‫ﻟﻤﺎذا‬
‫ﻟﻤﺎذا‬
‫ﻟﻤﺎذا‬
‫ﻟﻤﺎذا – آﻴﻒ‬
‫ﻟﻤﺎذا – آﻴﻒ‬
‫آﻴﻒ‬
‫ﻟﻤﺎذا‬
English equivalence
Interpretation
Causal
Result
Base
Antithesis
Purpose
Explanation
Evidence
‫اﺳﻢ اﻟﻌﻼﻗﺔ‬
‫ﺗﻔﺴﻴﺮ‬
‫ﺳﺒﺒﻴﺔ‬
‫ﻧﺘﻴﺠﺔ‬
‫ﻗﺎﻋﺪة‬
‫اﺳﺘﺪراك‬
‫ﻏﺎﻳﺔ‬
‫ﺷﺮح‬
‫اﺛﺒﺎت‬
4 Textual Units and Question Processing
Before starting the answer retrieval task we need to process and tokenize both the
question and the text in which the answer may be found, this subsumes performing
the following steps:
● Normalization: certain combinations of characters can be written in different ways
in the Arabic language. For instance, glyphs that combining HAMZA or MADDA
with ALEF (‫ ﺁ‬، ‫إ‬، ‫ )أ‬are sometimes written as a plain ALEF (‫)ا‬, also the letter TAA
MARBOTH (‫ )ة‬is sometimes changed to HAA (‫ )ﻩ‬at the end of a word, and this will
result in difficult to recognize some Arabic words, So we have to normalize all orthographic variations.
● Stemming: Arabic, as all Semitic languages, is a highly inflected language and has
a very complex morphology; a given headword can be found in huge number of different forms. This abundance of forms results in greater likelihood of mismatch
between the form of word in a question and the forms found in text relevant to the
question. Thus stemming is a basic step in this context, and many are the research
studies which attempt to develop Arabic stemmers. In our system we used Larkey's
light- stemmer [10] in case the word's category is noun, or Khoja's root- base stemmer
[11] in case of verbs which will be more efficient as proposed by Al- shammari [12].
● Stop words removal: due to the absence of standardized list of Arabic stop words,
we dropped 300 high-frequent common words, based on Arabic literature and excluding the cue phrase list, that gives no benefits to the matching results and may save
space and speed searching.
Automatically Finding Answers to "Why" and "How to" Questions
591
We compute the similarity between the question and the textual units by applying
Vector Space Model and rank the textual units in descending order according to the
similarity values using the formula shown below:
Sim (Q ,Ui ) = Cosine
Ui
=
∑
∑
,
,
,
∑
(1)
,
Where WQ,j , Wi,j are the weights of the jth keyword of the question Q and textual
unit Ui respectively.
The algorithm presented in Fig .3 takes as input a sequence of textual units belonging to a text and a question related to the text, and then returns a set of ranked answers.
Input : A question Q ,
A sequence U[n] of textual units and a list RR of
relations that hold among the units in U.
Output: A set A of candidate answers.
1. A := null;
2. Identify the type of Q;
3. Identify a set of relations rr in RR corresponding to
the Q type;
4. Match Q against the textual units U[n];
5. For each match Ui
6.
if ( Ui have a relation rri of one of the types in rr)
7.
sp := related span of rri;
8.
A := A ‫ ׫‬sp ;
9.
else
10.
Discard the current Ui;
11.
end if
12. end for
13. Rank the answers;
Fig. 3. Algorithm that select answers for a given question
5 Experiments and Results
We implemented our system using java programming language. For the purpose of
measuring the performance of our system we used the same experiment conducted by
S.verberne [13]. We selected a number of texts of 150-350 words each. The texts
were extracted from Arabic news websites. Then we distribute those texts to 15
people from different discipline and we asked them to read some of the texts and to
formulate why and how to-questions for the answers could be found in the text, the
subjects were also asked to formulate answers to each of their questions. This resulted
in a set of 98 why and how to-questions and answers pair.
We run our system on the 98 questions we collected, and then compared the answers found by the system to the user-formulated answers; if the answer found
matches the answer formulated by subject then we judged the answer found as correct. The system found the correct answer for 54 questions and this is 55% of all questions. Result is given in Table 3.
592
Z. Salem et al.
In the system created by S.vrberne, they collected a set of 336 why-question and
answer pairs, connected to seven manually annotated English texts from the RST
Treebank of 350-550 words each. When they evaluated the system, they obtained a
recall of 53.3%.
Comparing our result with the one obtained by S.verberne (Table 4); it can be seen
that they selected longer texts than we did. But on the other hand we dealt with raw
text (the structure has automatically driven) whereas they dealt with manually annotated data. Additionally, they reported that the performance would decline if they use
automatically created annotation [13]. As consequence, using the rhetorical relations
proposed in this research for answering why and how to Arabic questions showing
promising results.
Table 3. Shows the outcome of the system
Questions handled
Correctly answered
Wrongly answered
# questions
98
54
44
% of all questions
100
55.1
44.9
Table 4. Presents a comparison between the two question answering systems
Questions #
Words #
Structure derivation
Source
Recall
Arabic QA
98
150-350
Automatically
Arabic Websites
55%
English QA
336
350-550
manually
RST Treebank
53.3%
6 Conclusion and Future Work
In this paper we presented the first study for automatically finding answers to why
and how to-questions for Arabic language based on Rhetorical Structure Theory. We
performed a manual analysis on a set of Arabic texts to select a number of relation
types that is relevant for those kinds of questions; we also selected some of cue phrases to signal the extracted relations. Additionally we carried out an evaluation of the
system and compared it with the Suzan study. The result showed promising future in
the direction of dealing with longer texts than those handled in this study.
References
1. The Bridge Language Report (2007), http://www.bridgelanguagecenter.com
2. Benajiba, Y., Rosso, P., Lyhyaoui, A.: Implementation of the Arabic QA Question Answering System’s Computers. In: ICTC (2007)
3. Hammou, B., Abu-Salem, H., Lytinen, S., Evens, M.: QARAB: A question answering
system to support the Arabic language. In: Workshop on Computational Approaches to
Semitic Languages, ACL (2002)
Automatically Finding Answers to "Why" and "How to" Questions
593
4. Mann, W., Matthiessen, C., Thompson, S.: Rhetorical Structure Theory and Text Analysis.
In: A Frame Work for the Analysis of Texts, pp. 79–195 (1992)
5. Mann, W., Taboada, M.: Rhetorical Structure Theory: Looking back and moving ahead.
SAGE. Discourse Studies. 8, 423–459 (2006)
6. Al-Sanie, W., Touir, A., Mathkour, H.: Towards a Rhetorical Parsing of Arabic Text. In:
International Conference on Computational Intelligence for Modeling, Web Technologies
and Internet Commerce (CIMCA-IAWTIC 2005) (2005)
7. Daniel, M.: The Theory and Practice of Discourse Parsing and Summarization. The MIT
Press, London (2000)
8. Jattal, M.: Nezam al-Jumlah, pp. 127–140. Aleppo University (1979)
9. Haskour, N.: Al-Sababieh fe Tarkeb al-Jumlah Al-Arabih. Aleppo University (1990)
10. Larkey, L.S., Ballesteros, L., Connell, M.E.: Improving stemming for Arabic information
retrieval: Light stemming and co-occurrence analysis. In: 25th SIGIR International Conference Research and Development in Information Retrieval, pp. 275–282. Tampere, Finland
(2002)
11. Khoja, S., Garside, R.: Stemming Arabic Text. Computing Department. Lancaster University, Lancaster (1999)
12. Al-Shammari, E.: Towards an Error Free Stemming. In: LADIS European Conference on
Data Mining (ECDM 2008). The Netherland, Amsterdam (2008)
13. Suzan, V., Lou, B., Nelleke, O.: Discourse-based answering of why-questions. Treatment
Automatic Des Languages, Special Issue on Computational Approaches to Discourse and
Document Processing 47(2), 21–41 (2007)