Download Report

INFORMATION RETRIEVAL
Yu Hong and Heng Ji
jih@rpi.edu
October 15, 2014
Outline
• Introduction
• IR Approaches and Ranking
• Query Construction
• Document Indexing
• IR Evaluation
• Web Search
• INDRI
Information
Basic Function of Information
• Information = transmission of thought
Thoughts
Thoughts
Telepathy?
Words
Words
Writing
Sounds
Sounds
Speech
Encoding
Decoding
Information Theory
• Better called “communication theory”
• Developed by Claude Shannon in 1940’s
• Concerned with the transmission of electrical signals over wires
• How do we send information quickly and reliably?
• Underlies modern electronic communication:
• Voice and data traffic…
• Over copper, fiber optic, wireless, etc.
• Famous result: Channel Capacity Theorem
• Formal measure of information in terms of entropy
• Information = “reduction in surprise”
The Noisy Channel Model
• Information Transmission = producing the same message
at the destination as that was sent at the source
• The message must be encoded for transmission across a medium
(called channel)
• But the channel is noisy and can distort the message
Source
message
Destination
Transmitter
channel
noise
Receiver
message
A Synthesis
• Information retrieval as communication over time and
space, across a noisy channel
Source
message
Destination
Transmitter
channel
Receiver
message
noise
Sender
message
Recipient
Encoding
storage
indexing/writing
Decoding
message
acquisition/reading
noise
What is Information Retrieval?
• Most people equate IR with web-search
• highly visible, commercially successful endeavors
• leverage 3+ decades of academic research
• IR: finding any kind of relevant information
• web-pages, news events, answers, images, …
• “relevance” is a key notion
What is Information Retrieval (IR)?
• Most people equate IR with web-search
• highly visible, commercially successful endeavors
• leverage 3+ decades of academic research
• IR: finding any kind of relevant information
• web-pages, news events, answers, images, …
• “relevance” is a key notion
Interesting Examples
• Google image search
http://images.google.com/
• Google video search
http://video.google.com/
• People Search
• http://www.intelius.com
• Social Network Search
• http://arnetminer.org/
Interesting Examples
• Google image search
http://images.google.com/
• Google video search
http://video.google.com/
• People Search
• http://www.intelius.com
• Social Network Search
• http://arnetminer.org/
Sender
IR System
message
Recipient
Encoding
storage
indexing/writing
Decoding
acquisition/reading
noise
Document
corpus
Query
String
IR
System
Ranked
Documents
1. Doc1
2. Doc2
3. Doc3
.
.
message
The IR Black Box
Query
Results
Documents
Inside The IR Black Box
Query
Documents
Representation
Function
Representation
Function
Query Representation
Document Representation
Comparison
Function
Index
Results
Building the IR Black Box
• Fetching model
• Comparison model
• Representation Model
• Indexing Model
Building the IR Black Box
• Fetching models
• Crawling model
• Gentle Crawling model
• Comparison models
• Boolean model
• Vector space model
• Probabilistic models
• Language models
• PageRank
• Representation Models
• How do we capture the meaning of documents?
• Is meaning just the sum of all terms?
• Indexing Models
• How do we actually store all those words?
• How do we access indexed terms quickly?
Outline
• Introduction
• IR Approaches and Ranking
• Query Construction
• Document Indexing
• IR Evaluation
• Web Search
• INDRI
Fetching model: Crawling
Documents
Search Engines
Web pages
Crawling
Fetching
Function
World Wide Web
Query
Documents
Representation
Function
Representation
Function
Query Representation
Document Representation
Comparison
Function
Index
Results
Fetching model: Crawling
• Q1: How many web pages should we fetch?
• As many as we can.
More web pages
=
Richer knowledge
=
Intelligent Search engine
Document
corpus
Query
String
IR
System
Ranked
Documents
1. Doc1
2. Doc2
3. Doc3
.
.
Fetching model: Crawling
• Q1: How many web pages should we fetch?
• As many as we can.
• Fetching model is enriching the knowledge in the brain of the
search engine
Fetching
Function
I know
everything now,
hahahahaha!
IR
System
Fetching model: Crawling
• Q2: How to fetch the web pages?
• First, we should know the basic network structure of the web
• Basic Structure: Nodes and Links (hyperlinks)
World Wide Web
Basic Structure
Fetching model: Crawling
• Q2: How to fetch the web pages?
• Crawling program (Crawler) visit each node in the web through
hyperlink.
Basic Network Structure
IR
System
Fetching model: Crawling
• Q2: How to fetch the web pages?
• Q2-1: what are the known nodes?
• It means that the crawler know the addresses of nodes
• The nodes are web pages
• So the addresses are the URLs (URL: Uniform Resource Locater)
• Such as: www.yahoo.com, www.sohu.com, www.sina.com, etc.
• Q2-2: what are the unknown nodes?
• It means that the crawler don’t know the addresses of nodes
• The seed nodes are the known ones
• Before dispatching the crawler, a search engine will introduce some
addresses of the web pages to the crawler. The web pages are the
earliest known nodes (so called seeds)
Fetching model: Crawling
• Q2: How to fetch the web pages?
• Q2-3: How can the crawler find the unknown nodes?
Unknown
Nod.
Known
I can do this.
Believe me.
Nod.
Nod.
Doc.
Unknown
Nod. Unknown
Nod. Unknown
Nod. Unknown
Fetching model: Crawling
• Q2: How to fetch the web pages?
• Q2-3: How can the crawler find the unknown nodes?
Unknown
Nod.
I can do this.
Believe me.
Nod.
Nod.
Doc.
Unknown
Nod. Unknown
Nod. Unknown
Nod. Unknown
Fetching model: Crawling
• Q2: How to fetch the web pages?
• Q2-3: How can the crawler find the unknown nodes?
Unknown
Nod.
I can do this.
Believe me.
Nod.
Nod.
Doc.
Unknown
Nod. Unknown
Nod. Unknown
Nod. Unknown
Fetching model: Crawling
• Q2: How to fetch the
web pages?
PARSER
• Q2-3: How can the crawler find the unknown nodes?
Unknown
Known
Nod.
Known
Good news
for me.
Nod.
Nod.
Doc.
Unknown
Known
Nod. Unknown
Known
Nod. Unknown
Known
Nod. Unknown
Known
Fetching model: Crawling
• Q2: How to fetch the web pages?
• Q2-3: How can the crawler find the unknown nodes?
• If you introduce a web page to the crawler (let it known the web
address), the crawler will use a parser of source code to mine lots of
new web pages. Of cause, the crawler have known their addresses.
• But if you don’t tell the crawler anything, it will be on strike because it
can do nothing.
• That is the reason why we need the seed nodes (seed web pages) to
awaken the crawler.
Give me
some seeds.
Fetching model: Crawling
I need some
equipment.
• Q2: How to fetch the web pages?
• To traverse the whole network of the web, the crawler need some
auxiliary equipment.
• A register of FIFO (First in, First out) data structure, such as QUEUE.
• An Access Control Program (ACP)
• Source Code Parser (SCP)
• Seed nodes
crawler
FIFO Register
ACP
SCP
Fetching model: Crawling
I am working
now.
• Q2: How to fetch the web pages?
• Robotic crawling procedure (Only five steps)
• Initialization: push seed nodes (known web pages) into the empty queue
• Step 1: Take out a node from the queue (FIFO) and visit it (ACP)
• Step 2: Steal necessary information from the source code of the node (SCP)
• Step 3: Send the stolen text information (title, text body, keywords and
Language) back to search engine for storage (ACP)
• Step 4: Push the newly found nodes into the queue
• Step 5: Execute Step 1-5 iteratively
Fetching model: Crawling
• Q2: How to fetch the web pages?
• Trough the steps, the number of the known nodes continuously grows
• The underlying reason why the crawler can travers the whole web
I control this.
Seed
Seed
Seed
New
Node
Slot
New
Node
Slot
New
Node
Slot
New
Node
Slot
New
Node
Slot
New
Node
Slot
New
Node
Slot
New
Node
Slot
New
Node
Slot
New
Node
Slot
New
Node
Slot
New
Node
Slot
New
Node
Slot
New
Node
Slot
New
Node
Slot
New
Node
Slot
New
Node
Slot
• Crawler stops working until the register is empty
• Although the register is empty, the information of all nodes in the web has
been stolen and stored in the server of the search engine.
Fetching model: Crawling
• Problems
• 1) Actually, the crawler can not traverse the whole web.
• Such as encountering the infinite loop when falling into a partial closed-
circle network (snare) in the web
Node
Node
Node
No.
Node
Node
Node
Node
Node
Fetching model: Crawling
• Problems
• 2) Crude Crawling.
• A portal web site causes a series of homologous nodes in the
register. Abided by the FIFO rule, the iterative crawling of the
nodes will continuously visit the mutual server of the nodes. It is
crude crawling.
A class of homologous
web pages linking to a
portal sit
https:// www.yahoo.com
https://screen.yahoo.com/live/
https://games.yahoo.com/
https://mobile.yahoo.com/
https://groups.yahoo.com/neo
https://answers.yahoo.com/
http://finance.yahoo.com/
https://weather.yahoo.com/
https://autos.yahoo.com/
https://shopping.yahoo.com/
https://www.yahoo.com/health
https://www.yahoo.com/food
https://www.yahoo.com/style
Node
Node
Node
Node
Node
Slot
Node
Node
Node
Slot
Node
Node
Node
Slot
Node
Slot
Node
Node
Node
Slot
Node
Slot
Node
Slot
Node
Slot
Node
Slot
Node
Slot
Node
Slot
Node
Slot
Node
Slot
Slot
Slot
Network of Web
Fetching model: Crawling
• Homework
• 1) How to overcome the infinite loop cased by the partial closed-circle
network in the web?
• 2) Please find a way to crawl the web like a gentlemen (not crude).
• Please select one of the problems as the topic of your homework. A
short paper is necessary. No more than 500 words in the paper. But
please include at least your idea and a methodology. The methodology
can be described with natural languages, flow diagram, or algorithm.
• Send it to me. Email: tianxianer@gmail.com
• Thanks.
Building the IR Black Box
• Fetching models
• Crawling model
• Gentle Crawling model
• Comparison models
• Boolean model
• Vector space model
• Probabilistic models
• Language models
• PageRank
• Representation Models
• How do we capture the meaning of documents?
• Is meaning just the sum of all terms?
• Indexing Models
• How do we actually store all those words?
• How do we access indexed terms quickly?
Query
Documents
Representation
Function
Representation
Function
Query Representation
Document Representation
Comparison
Function
Index
Results
Query
Documents
Representation
Function
Representation
Function
Query Representation
Document Representation
Ignore Now
Comparison
Function
Results
Index
A heuristic formula for IR (Boolean model)
• Rank docs by similarity to the query
• suppose the query is “spiderman film”
• Relevance= # query words in the doc
• favors documents with both “spiderman” and “film”
• mathematically:
sim ( D, Q)  1qD
qQ
• Logical variations (set-based)
∏ O ( q, D )
• Boolean AND (require all words):
AND( D, Q ) =
• Boolean OR (any of the words):
OR ( D, Q ) = 1 -
q
∏ O ( q, D )
q
Term Frequency (TF)
• Observation:
• key words tend to be repeated in a document
• Modify our similarity measure:
• give more weight if word
occurs multiple times
• Problem:
sim ( D, Q )   tf D ( q )
qQ
• biased towards long documents
• spurious occurrences
• normalize by length:
tf D (q )
sim ( D, Q)  
qQ | D |
Inverse Document Frequency (IDF)
• Observation:
• rare words carry more meaning: cryogenic, apollo
• frequent words are linguistic glue: of, the, said, went
• Modify our similarity measure:
• give more weight to rare words
… but don’t be too aggressive (why?)
 |C | 
tf D (q)

sim ( D, Q)  
 log 
qQ | D |
 df (q) 
• |C| … total number of documents
• df(q) … total number of documents that contain q
TF normalization
• Observation:
• D1={cryogenic,labs}, D2 ={cryogenic,cryogenic}
• which document is more relevant?
• which one is ranked higher? (df(labs) > df(cryogenic))
• Correction:
• first occurrence more important than a repeat (why?)
• “squash” the linearity of TF:
tf ( q )
tf ( q )  K
1
2
3
tf
State-of-the-art Formula
Repetitions of query
words  good
Common words
less important
 |C | 
tf D (q)

sim ( D, Q)  
 log 
qQ tf D ( q )  K | D |
 df (q) 
More query
words  good
Penalize very
long documents
Strengths and Weaknesses
• Strengths
• Precise, if you know the right strategies
• Precise, if you have an idea of what you’re looking for
• Implementations are fast and efficient
• Weaknesses
• Users must learn Boolean logic
• Boolean logic insufficient to capture the richness of language
• No control over size of result set: either too many hits or none
• When do you stop reading? All documents in the result set are
considered “equally good”
• What about partial matches? Documents that “don’t quite match”
the query may be useful also
Vector-space approach to
IR
cat
•cat cat
•cat cat cat
•cat pig
•pig cat
θ
pig
•cat cat pig dog dog
dog
Assumption: Documents that are “close together” in
vector space “talk about” the same things
Therefore, retrieve documents based on how close the
document is to the query (i.e., similarity ~ “closeness”)
Some formulas for Similarity
Dot product
Cosine
Sim ( D, Q)   (ai * bi )
 (a * b )
i
Sim ( D, Q) 
Dice
D
i
i
 ai *  bi
2
i
Sim ( D, Q) 
t1
Q
2
i
t2
2 (ai * bi )
i
 ai   bi
2
i
2
i
 (a * b )
Sim ( D, Q) 
 a   b   (a * b )
i
i
i
Jaccard
2
2
i
i
i
i
i
i
i
An Example
• A document space is defined by three terms:
• hardware, software, users
• the vocabulary
• A set of documents are defined as:
• A1=(1, 0, 0), A2=(0, 1, 0),
A3=(0, 0, 1)
• A4=(1, 1, 0), A5=(1, 0, 1),
A6=(0, 1, 1)
• A7=(1, 1, 1) A8=(1, 0, 1).
A9=(0, 1, 1)
• If the Query is “hardware and software”
• what documents should be retrieved?
An Example (cont.)
• In Boolean query matching:
• document A4, A7 will be retrieved (“AND”)
• retrieved: A1, A2, A4, A5, A6, A7, A8, A9 (“OR”)
• In similarity matching (cosine):
• q=(1, 1, 0)
• S(q, A1)=0.71,
S(q, A2)=0.71, S(q, A3)=0
• S(q, A4)=1,
S(q, A5)=0.5, S(q, A6)=0.5
• S(q, A7)=0.82,
S(q, A8)=0.5, S(q, A9)=0.5
• Document retrieved set (with ranking)=
• {A4, A7, A1, A2, A5, A6, A8, A9}
Probabilistic model
• Given D, estimate P(R|D) and P(NR|D)
• P(R|D)=P(D|R)*P(R)/P(D)
 P(D|R)
D = {t1=x1, t2=x2, …}
• P( D | R) 
 P(t
i
(P(D), P(R) constant)
1 present
xi  
 0 absent
 xi | R )
( ti  xi )D
  P (ti  1 | R ) xi P(ti  0 | R) (1 xi )   pi i (1  pi ) (1 xi )
x
ti
ti
P( D | NR)   P(ti  1 | NR) xi P (ti  0 | NR) (1 xi )   qi i (1  qi ) (1 xi )
x
ti
ti
Prob. model (cont’d)
For document ranking
(1 xi )
i
p
(
1

p
)
 i
i
x
P( D | R)
Odd ( D)  log
 log
P( D | NR)
ti
(1 xi )
i
q
(
1

q
)
 i
i
x
ti
pi (1  qi )
1  pi
  xi log
  log
qi (1  pi ) ti
1  qi
ti
pi (1  qi )
  xi log
qi (1  pi )
ti
Prob. model (cont’d)
• How to estimate pi and qi?
• A set of N relevant and
irrelevant samples:
ri
pi 
Ri
ni  ri
qi 
N  Ri
ri
Rel. doc.
with ti
ni-ri
ni
Irrel.doc. Doc.
with ti
with ti
Ri-ri
N-Ri–n+ri N-ni
Rel. doc. Irrel.doc. Doc.
without ti without ti without ti
Ri
Rel. doc
N-Ri
N
Irrel.doc. Samples
Prob. model (cont’d)
pi (1  qi )
Odd ( D)   xi log
qi (1  pi )
ti
ri ( N  Ri  ni  ri )
  xi
( Ri  ri )( ni  ri )
ti
• Smoothing (Robertson-Sparck-Jones formula)
Odd ( D)   xi
ti
(ri  0.5)( N  Ri  ni  ri  0.5)
  wi
( Ri  ri  0.5)( ni  ri  0.5)
ti D
• When no sample is available:
pi=0.5,
qi=(ni+0.5)/(N+0.5)ni/N
• May be implemented as VSM
An Appraisal of Probabilistic Models
Among the oldest formal models in IR
Maron & Kuhns, 1960: Since an IR system cannot predict
with certainty which document is relevant, we should deal
with probabilities
Assumptions for getting reasonable approximations of the needed
probabilities:
Boolean representation of documents/queries/relevance
Term independence
Out-of-query terms do not affect retrieval
Document relevance values are independent
An Appraisal of Probabilistic Models
The difference between ‘vector space’ and ‘probabilistic’ IR is not that great:
In either case you build an information retrieval scheme in
the exact same way.
Difference: for probabilistic IR, at the end, you score
queries not by cosine similarity and tf-idf in a vector space,
but by a slightly different formula motivated by probability
theory
Language-modeling Approach
• query is a random sample from a “perfect” document
• words are “sampled” independently of each other
• rank documents by the probability of generating query
D
P(
query
)=P( ) P( )P( ) P( )
= 4/9 * 2/9 * 4/9 * 3/9
Naive Bayes and LM generative models

We want to classify document d.
We want to classify a query q.
 Classes: geographical regions like China, UK, Kenya.
Each document in the collection is a different class.

Assume that d was generated by the generative model.
Assume that q was generated by a generative model.

Key question: Which of the classes is most likely to have generated the document?
Which document (=class) is most likely to have generated the query q?
 Or: for which class do we have the most evidence? For which
document (as the source of the query) do we have the most
evidence?
57
Using language models (LMs) for IR
❶
LM = language model
❷
We view the document as a generative model that generates the query.
❸
What we need to do:
❹
Define the precise generative model we want to use
❺
Estimate parameters (different parameters for each document’s model)
❻
Smooth to avoid zeros
❼
Apply to query and find document most likely to have generated the query
❽
Present most likely document(s) to user
❾
Note that x – y is pretty much what we did in Naive Bayes.
What is a language model?
We can view a finite state automaton as a deterministic language
model.
I wish I wish I wish I wish . . . Cannot generate: “wish I wish”
or “I wish I”. Our basic model: each document was generated by a different
automaton like this except that these automata are probabilistic.
59
A probabilistic language model
This is a one-state probabilistic finite-state automaton – a unigram language
model – and the state emission distribution for its one state q1. STOP is not a word,
but a special symbol indicating that the automaton stops. frog said that toad likes
frog STOP
P(string) = 0.01 · 0.03 · 0.04 · 0.01 · 0.02 · 0.01 · 0.02
= 0.0000000000048
60
A different language model for each document
frog said that toad likes frog STOP P(string|Md1 ) = 0.01 · 0.03 · 0.04 · 0.01 · 0.02 · 0.01 · 0.02
= 0.0000000000048 = 4.8 · 10-12
P(string|Md2 ) = 0.01 · 0.03 · 0.05 · 0.02 · 0.02 · 0.01 · 0.02 = 0.0000000000120 = 12 · 10-12
P(string|Md1 ) < P(string|Md2 )
Thus, document d2 is “more relevant” to the string “frog said that toad likes frog STOP” than
d1 is.
61
Using language models in IR

Each document is treated as (the basis for) a language model.

Given a query q

Rank documents based on P(d|q)

P(q) is the same for all documents, so ignore

P(d) is the prior – often treated as the same for all d
 But we can give a prior to “high-quality” documents, e.g., those
with high PageRank.

P(q|d) is the probability of q given d.

So to rank documents according to relevance to q, ranking according to P(q|d) and
P(d|q) is equivalent.
62
Where we are
 In the LM approach to IR, we attempt to model the query
generation process.
 Then we rank documents by the probability that a query
would be observed as a random sample from the
respective document model.
 That is, we rank according to P(q|d).
 Next: how do we compute P(q|d)?
63
How to compute P(q|d)
 We will make the same conditional independence
assumption as for Naive Bayes.
(|q|: length ofr q; tk : the token occurring at position k in q)
 This is equivalent to:
 tft,q: term frequency (# occurrences) of t in q
 Multinomial model (omitting constant factor)
64
Parameter estimation
 Missing piece: Where do the parameters P(t|Md). come from?
 Start with maximum likelihood estimates (as we did for Naive
Bayes)





(|d|: length of d; tft,d : # occurrences of t in d)
As in Naive Bayes, we have a problem with zeros.
A single t with P(t|Md) = 0 will make
zero.
We would give a single term “veto power”.
For example, for query [Michael Jackson top hits] a document
about “top songs” (but not using the word “hits”) would have
P(t|Md) = 0. – That’s bad.
65
We need to smooth the estimates to avoid zeros.
Smoothing
 Key intuition: A nonoccurring term is possible (even though
it didn’t occur), . . .
 . . . but no more likely than would be expected by chance
in the collection.
 Notation: Mc: the collection model; cft: the number of
occurrences of t in the collection;
: the total
number of tokens in the collection.
 We will use
to “smooth” P(t|d) away from zero.
66
Mixture model
 P(t|d) = λP(t|Md) + (1 - λ)P(t|Mc)
 Mixes the probability from the document with the general
collection frequency of the word.
 High value of λ: “conjunctive-like” search – tends to
retrieve documents containing all query words.
 Low value of λ: more disjunctive, suitable for long queries
 Correctly setting λ is very important for good performance.
67
Mixture model: Summary
 What we model: The user has a document in mind and
generates the query from this document.
 The equation represents the probability that the document
that the user had in mind was in fact this one.
68
Example
 Collection: d1 and d2
 d1 : Jackson was one of the most talented entertainers of all
time
 d2: Michael Jackson anointed himself King of Pop
 Query q: Michael Jackson
 Use mixture model with λ = 1/2
 P(q|d1) = [(0/11 + 1/18)/2] · [(1/11 + 2/18)/2] ≈ 0.003
 P(q|d2) = [(1/7 + 1/18)/2] · [(1/7 + 2/18)/2] ≈ 0.013
 Ranking: d2 > d1
69
Exercise: Compute ranking






Collection: d1 and d2
d1 : Xerox reports a profit but revenue is down
d2: Lucene narrows quarter loss but decreases further
Query q: revenue down
Use mixture model with λ = 1/2
P(q|d1) = [(1/8 + 2/16)/2] · [(1/8 + 1/16)/2] = 1/8 · 3/32 =
3/256
 P(q|d2) = [(1/8 + 2/16)/2] · [(0/8 + 1/16)/2] = 1/8 · 1/32 =
1/256
 Ranking: d2 > d1
70
LMs vs. vector space model (1)
 LMs have some things in common with vector space
models.
 Term frequency is directed in the model.

But it is not scaled in LMs.
 Probabilities are inherently “length-normalized”.

Cosine normalization does something similar for vector space.
 Mixing document and collection frequencies has an effect
similar to idf.

Terms rare in the general collection, but common in some documents will
have a greater influence on the ranking.
71
LMs vs. vector space model (2)
 LMs vs. vector space model: commonalities

Term frequency is directly in the model.

Probabilities are inherently “length-normalized”.

Mixing document and collection frequencies has an effect similar to idf.
 LMs vs. vector space model: differences

LMs: based on probability theory

Vector space: based on similarity, a geometric/ linear algebra notion

Collection frequency vs. document frequency

Details of term frequency, length normalization etc.
72
Language models for IR: Assumptions
 Simplifying assumption: Queries and documents are objects of
same type. Not true!

There are other LMs for IR that do not make this assumption.

The vector space model makes the same assumption.
 Simplifying assumption: Terms are conditionally independent.

Again, vector space model (and Naive Bayes) makes the same assumption.
 Cleaner statement of assumptions than vector space
 Thus, better theoretical foundation than vector space

… but “pure” LMs perform much worse than “tuned” LMs.
73
Relevance Using Hyperlinks
• Number of documents relevant to a query can be
enormous if only term frequencies are taken into
account
• Using term frequencies makes “spamming” easy
• E.g., a travel agency can add many occurrences of the words “travel”
to its page to make its rank very high
• Most of the time people are looking for pages from
popular sites
• Idea: use popularity of Web site (e.g., how many people
visit it) to rank site pages that match given keywords
• Problem: hard to find actual popularity of site
• Solution: next slide
Relevance Using Hyperlinks (Cont.)
• Solution: use number of hyperlinks to a site as a measure
of the popularity or prestige of the site
• Count only one hyperlink from each site (why? - see previous slide)
• Popularity measure is for site, not for individual page
• But, most hyperlinks are to root of site
• Also, concept of “site” difficult to define since a URL prefix like cs.yale.edu
contains many unrelated pages of varying popularity
• Refinements
• When computing prestige based on links to a site, give more weight
to links from sites that themselves have higher prestige
• Definition is circular
• Set up and solve system of simultaneous linear equations
• Above idea is basis of the Google PageRank ranking mechanism
PageRank in Google
PageRank in Google (Cont’)
I1
I2
A
B
PR( I i )
PR( A)  (1  d )  d 
i C(Ii )
• Assign a numeric value to each page
• The more a page is referred to by important pages, the more this page is
important
• d: damping factor (0.85)
• Many other criteria: e.g. proximity of query words
• “…information retrieval …” better than “… information … retrieval …”
Relevance Using Hyperlinks (Cont.)
• Connections to social networking theories that ranked
prestige of people
• E.g., the president of the U.S.A has a high prestige since many
people know him
• Someone known by multiple prestigious people has high prestige
• Hub and authority based ranking
• A hub is a page that stores links to many pages (on a topic)
• An authority is a page that contains actual information on a topic
• Each page gets a hub prestige based on prestige of authorities
that it points to
• Each page gets an authority prestige based on prestige of hubs
that point to it
• Again, prestige definitions are cyclic, and can be got by
solving linear equations
• Use authority prestige when ranking answers to a query
HITS: Hubs and authorities
79
HITS update rules




A: link matrix

h: vector of hub scores

a: vector of authority scores
HITS algorithm:



Compute h = Aa

Compute a =ATh

Iterate until convergence

Output (i) list of hubs ranked according to hub score and (ii) list of authorities
ranked according to authority score

80
Outline
• Introduction
• IR Approaches and Ranking
• Query Construction
• Document Indexing
• IR Evaluation
• Web Search
• INDRI
Keyword Search
• Simplest notion of relevance is that the query string
appears verbatim in the document.
• Slightly less strict notion is that the words in the query
appear frequently in the document, in any order (bag of
words).
82
Problems with Keywords
• May not retrieve relevant documents that include
synonymous terms.
• “restaurant” vs. “café”
• “PRC” vs. “China”
• May retrieve irrelevant documents that include
ambiguous terms.
• “bat” (baseball vs. mammal)
• “Apple” (company vs. fruit)
• “bit” (unit of data vs. act of eating)
83
Query Expansion
• http://www.lemurproject.org/lemur/IndriQueryLanguage.php
• Most errors caused by vocabulary mismatch
• query: “cars”, document: “automobiles”
• solution: automatically add highly-related words
• Thesaurus / WordNet lookup:
• add semantically-related words (synonyms)
• cannot take context into account:
• “rail car” vs. “race car” vs. “car and cdr”
• Statistical Expansion:
• add statistically-related words (co-occurrence)
• very successful
Indri Query Examples
• <parameters><query>#combine( #weight( 0.063356 #1(explosion)
0.187417 #1(blast) 0.411817 #1(wounded) 0.101370 #1(injured)
0.161191 #1(death) 0.074849 #1(deaths)) #weight( 0.311760
#1(Davao Cityinternational airport) 0.311760 #1(Tuesday) 0.103044
#1(DAVAO) 0.195505 #1(Philippines) 0.019817 #1(DXDC)
0.058113 #1(Davao Medical Center)))</query></parameters>
Synonyms and Homonyms
• Synonyms
• E.g., document: “motorcycle repair”, query: “motorcycle
maintenance”
• Need to realize that “maintenance” and “repair” are synonyms
• System can extend query as “motorcycle and (repair or
maintenance)”
• Homonyms
• E.g., “object” has different meanings as noun/verb
• Can disambiguate meanings (to some extent) from the context
• Extending queries automatically using synonyms can be
problematic
• Need to understand intended meaning in order to infer synonyms
• Or verify synonyms with user
• Synonyms may have other meanings as well
Concept-Based Querying
• Approach
• For each word, determine the concept it represents from context
• Use one or more ontologies:
• Hierarchical structure showing relationship between concepts
• E.g., the ISA relationship that we saw in the E-R model
• This approach can be used to standardize terminology in
a specific field
• Ontologies can link multiple languages
• Foundation of the Semantic Web (not covered here)
Outline
• Introduction
• IR Approaches and Ranking
• Query Construction
• Document Indexing
• IR Evaluation
• Web Search
• INDRI
Indexing of Documents
• An inverted index maps each keyword Ki to a set of
documents Si that contain the keyword
• Documents identified by identifiers
• Inverted index may record
• Keyword locations within document to allow proximity based
ranking
• Counts of number of occurrences of keyword to compute TF
• and operation: Finds documents that contain all of K1, K2, ...,
Kn.
• Intersection S1 S2 .....  Sn
• or operation: documents that contain at least one of K1, K2,
…, Kn
• union, S1 S2 .....  Sn,.
• Each Si is kept sorted to allow efficient intersection/union by
merging
• “not” can also be efficiently implemented by merging of sorted lists
Indexing of Documents
• Goal = Find the important meanings and create an internal
representation
• Factors to consider:
• Accuracy to represent meanings (semantics)
• Exhaustiveness (cover all the contents)
• Facility for computer to manipulate
• What is the best representation of contents?
• Char. string (char trigrams): not precise enough
• Word: good coverage, not precise
• Phrase: poor coverage, more precise
• Concept: poor coverage, precise
Coverage
(Recall)
String
Word
Phrase
Concept
Accuracy
(Precision)
Indexer steps
• Sequence of (Modified token, Document ID)
pairs.
Doc 1
I did enact Julius
Caesar I was killed
i' the Capitol;
Brutus killed me.
Doc 2
So let it be with
Caesar. The noble
Brutus hath told you
Caesar was ambitious
Term
I
did
enact
julius
caesar
I
was
killed
i'
the
capitol
brutus
killed
me
so
let
it
be
with
caesar
the
noble
brutus
hath
told
you
Doc #
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
caesar
2
was
ambitious
2
2
• Multiple term entries in a
single document are merged.
• Frequency information is
added.
Term
Doc #
ambitious
2
be
2
brutus
1
brutus
2
capitol
1
caesar
1
caesar
2
caesar
2
did
1
enact
1
hath
1
I
1
I
1
i'
1
it
2
julius
1
killed
1
killed
1
let
2
me
1
noble
2
so
2
the
1
the
2
told
2
you
2
was
1
was
2
with
2
Term
Doc #
ambitious
be
brutus
brutus
capitol
caesar
caesar
did
enact
hath
I
i'
it
julius
killed
let
me
noble
so
the
the
told
you
was
was
with
2
2
1
2
1
1
2
1
1
2
1
1
2
1
1
2
1
2
2
1
2
2
2
1
2
2
Term freq
1
1
1
1
1
1
2
1
1
1
2
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
An example
Stopwords / Stoplist
• function words do not bear useful information for IR
of, in, about, with, I, although, …
• Stoplist: contain stopwords, not to be used as index
• Prepositions
• Articles
• Pronouns
• Some adverbs and adjectives
• Some frequent words (e.g. document)
• The removal of stopwords usually improves IR effectiveness
• A few “standard” stoplists are commonly used.
Stemming
• Reason:
• Different word forms may bear similar meaning (e.g. search,
searching): create a “standard” representation for them
• Stemming:
• Removing some endings of word
computer
compute
computes
computing
computed
computation
comput
Lemmatization
• transform to standard form according to syntactic category.
E.g. verb + ing  verb
noun + s  noun
• Need POS tagging
• More accurate than stemming, but needs more resources
• crucial to choose stemming/lemmatization rules
noise v.s. recognition rate
• compromise between precision and recall
light/no stemming
-recall +precision
severe stemming
+recall -precision
Simple conjunctive query (two terms)
Consider the query: BRUTUS AND
CALPURNIA
To find all matching documents using inverted index:
❶
Locate BRUTUS in the dictionary
❷
Retrieve its postings list from the postings file
❸
Locate CALPURNIA in the dictionary
❹
Retrieve its postings list from the postings file
❺
Intersect the two postings lists
❻
Return intersection to user
97
Intersecting two posting lists
This is linear in the length of the postings lists.
Note: This only works if postings lists are sorted.
98
Does Google use the Boolean model?
On Google, the default interpretation of a query [w1 w2 . . .wn] is w1 AND w2
AND . . .AND wn
Cases where you get hits that do not contain one of the wi :
anchor text
page contains variant of wi (morphology, spelling correction,
synonym)
long queries (n large)
boolean expression generates very few hits
Simple Boolean vs. Ranking of result set
Simple Boolean retrieval returns matching documents in no
particular order.
Google (and most well designed Boolean engines) rank the
result set – they rank good hits (according to some estimator
of relevance) higher than bad hits.
99
Outline
• Introduction
• IR Approaches and Ranking
• Query Construction
• Document Indexing
• IR Evaluation
• Web Search
• INDRI
IR Evaluation
• Efficiency: time, space
• Effectiveness:
• How is a system capable of retrieving relevant
documents?
• Is a system better than another one?
• Metrics often used (together):
• Precision = retrieved relevant docs / retrieved docs
• Recall = retrieved relevant docs / relevant docs
relevant
retrieved
retrieved relevant
IR Evaluation (Cont’)
• Information-retrieval systems save space by using
index structures that support only approximate
retrieval. May result in:
• false negative (false drop) - some relevant documents may
not be retrieved.
• false positive - some irrelevant documents may be
retrieved.
• For many applications a good index should not permit any
false drops, but may permit a few false positives.
• Relevant performance metrics:
• precision - what percentage of the retrieved documents are
relevant to the query.
• recall - what percentage of the documents relevant to the
query were retrieved.
IR Evaluation (Cont’)
• Recall vs. precision tradeoff:
• Can increase recall by retrieving many documents (down to a low
level of relevance ranking), but many irrelevant documents would
be fetched, reducing precision
• Measures of retrieval effectiveness:
• Recall as a function of number of documents fetched, or
• Precision as a function of recall
• Equivalently, as a function of number of documents fetched
• E.g., “precision of 75% at recall of 50%, and 60% at a recall
of 75%”
• Problem: which documents are actually relevant, and
which are not
General form of precision/recall
Precision
1.0
Recall
1.0
-Precision change w.r.t. Recall (not a fixed point)
-Systems cannot compare at one Precision/Recall point
-Average precision (on 11 points of recall: 0.0, 0.1, …, 1.0)
An illustration of P/R calculation
List
Doc1
Doc2
Doc3
Doc4
Doc5
…
Rel?
Y
Precision
1.0 -
* (0.2, 1.0)
0.8 -
* (0.6, 0.75)
* (0.4, 0.67)
Y
Y
Assume: 5 relevant docs.
0.6 -
* (0.6, 0.6)
* (0.2, 0.5)
0.4 0.2 0.0
|
0.2
|
0.4
|
0.6
|
0.8
|
1.0
Recall
MAP (Mean Average Precision)
MAP 
1
1
j
 
n Qi | Ri | D j Ri rij
• rij = rank of the j-th relevant document for Qi
• |Ri| = #rel. doc. for Qi
• n = # test queries
• E.g. Rank:
1
5
10
4
8
1st rel. doc.
2nd rel. doc.
3rd rel. doc.
1 1 1 2 3
1 1 2
MAP  [ (   )  (  )]
2 3 1 5 10 2 4 8
Some other measures
• Noise = retrieved irrelevant docs / retrieved docs
• Silence = non-retrieved relevant docs / relevant docs
• Noise = 1 – Precision; Silence = 1 – Recall
• Fallout = retrieved irrel. docs / irrel. docs
• Single value measures:
• F-measure = 2 P * R / (P + R)
• Average precision = average at 11 points of recall
• Precision at n document (often used for Web IR)
• Expected search length (no. irrelevant documents to read before
obtaining n relevant doc.)
Interactive system’s evaluation
• Definition:
Evaluation = the process of systematically collecting data
that informs us about what it is like for a particular user or
group of users to use a product/system for a particular
task in
a certain type of environment.
Problems
• Attitudes:
• Designers assume that if they and their colleagues can use the
system and find it attractive, others will too
• Features vs. usability or security
• Executives want the product on the market yesterday
• Problems “can” be addressed in versions 1.x
• Consumers accept low levels of usability
• “I’m so silly”
Two main types of evaluation
• Formative evaluation is done at different stages of
development to check that the product meets users’ needs.
• Part of the user-centered design approach
• Supports design decisions at various stages
• May test parts of the system or alternative designs
• Summative evaluation assesses the quality of a finished
product.
• May test the usability or the output quality
• May compare competing systems
What to evaluate
Iterative design & evaluation is a continuous process that
examines:
• Early ideas for conceptual model
• Early prototypes of the new system
• Later, more complete prototypes
Designers need to check that they understand users’
requirements and that the design assumptions hold.
Four evaluation paradigms
• ‘quick and dirty’
• usability testing
• field studies
• predictive evaluation
Quick and dirty
• ‘quick & dirty’ evaluation describes the common
practice in which designers informally get feedback from
users or consultants to confirm that their ideas are in-line
with users’ needs and are liked.
• Quick & dirty evaluations are done any time.
• The emphasis is on fast input to the design process rather
than carefully documented findings.
Usability testing
• Usability testing involves recording typical users’
performance on typical tasks in controlled settings. Field
observations may also be used.
• As the users perform these tasks they are watched &
recorded on video & their key presses are logged.
• This data is used to calculate performance times, identify
errors & help explain why the users did what they did.
• User satisfaction questionnaires & interviews are used to
elicit users’ opinions.
Usability testing
• It is very time consuming to conduct and analyze
• Explain the system, do some training
• Explain the task, do a mock task
• Questionnaires before and after the test & after each task
• Pilot test is usually needed
• Insufficient number of subjects for ‘proper’ statistical
analysis
• In laboratory conditions, subjects do not behave exactly
like in a normal environment
Field studies
• Field studies are done in natural settings
• The aim is to understand what users do naturally and how
technology impacts them.
• In product design field studies can be used to:
- identify opportunities for new technology
- determine design requirements
- decide how best to introduce new technology
- evaluate technology in use
Predictive evaluation
• Experts apply their knowledge of typical users, often
guided by heuristics, to predict usability problems.
• Another approach involves theoretically based models.
• A key feature of predictive evaluation is that users need
not be present
• Relatively quick & inexpensive
The TREC experiments
• Once per year
• A set of documents and queries are distributed
to the participants (the standard answers are
unknown) (April)
• Participants work (very hard) to construct, finetune their systems, and submit the answers
(1000/query) at the deadline (July)
• NIST people manually evaluate the answers and
provide correct answers (and classification of IR
systems) (July – August)
• TREC conference (November)
TREC evaluation methodology
• Known document collection (>100K) and query set (50)
• Submission of 1000 documents for each query by each
•
•
•
•
participant
Merge 100 first documents of each participant -> global
pool
Human relevance judgment of the global pool
The other documents are assumed to be irrelevant
Evaluation of each system (with 1000 answers)
• Partial relevance judgments
• But stable for system ranking
Tracks (tasks)
• Ad Hoc track: given document collection, different
•
•
•
•
•
•
•
•
topics
Routing (filtering): stable interests (user profile),
incoming document flow
CLIR: Ad Hoc, but with queries in a different language
Web: a large set of Web pages
Question-Answering: When did Nixon visit China?
Interactive: put users into action with system
Spoken document retrieval
Image and video retrieval
Information tracking: new topic / follow up
CLEF and NTCIR
• CLEF = Cross-Language Experimental Forum
• for European languages
• organized by Europeans
• Each per year (March – Oct.)
• NTCIR:
• Organized by NII (Japan)
• For Asian languages
• cycle of 1.5 year
Impact of TREC
• Provide large collections for further experiments
• Compare different systems/techniques on realistic
data
• Develop new methodology for system evaluation
• Similar experiments are organized in other areas
(NLP, Machine translation, Summarization, …)
Outline
• Introduction
• IR Approaches and Ranking
• Query Construction
• Document Indexing
• IR Evaluation
• Web Search
• INDRI
IR on the Web
• No stable document collection (spider, crawler)
• Invalid document, duplication, etc.
• Huge number of documents (partial collection)
• Multimedia documents
• Great variation of document quality
• Multilingual problem
•…
Web Search
• Application of IR to HTML documents on the World
Wide Web.
• Differences:
• Must assemble document corpus by spidering the web.
• Can exploit the structural layout information in HTML (XML).
• Documents change uncontrollably.
• Can exploit the link structure of the web.
125
Web Search System
Web
Spider
Document
corpus
Query
String
IR
System
1. Page1
2. Page2
3. Page3
.
.
Ranked
Documents
126
Challenges
• Scale, distribution of documents
• Controversy over the unit of indexing
• What is a document ? (hypertext)
• What does the use expect to be retrieved ?
• High heterogeneity
• Document structure, size, quality, level of abstraction / specialization
• User search or domain expertise, expectations
• Retrieval strategies
• What do people want ?
• Evaluation
Web documents / data
• No traditional collection
• Huge
• Time and space to crawl index
• IRSs cannot store copies of documents
• Dynamic, volatile, anarchic, un-controlled
• Homogeneous sub-collections
• Structure
• In documents (un-/semi-/fully-structured)
• Between docs: network of inter-connected nodes
• Hyper-links - conceptual vs. physical documents
Web documents / data
• Mark-up
• HTML – look & feel
• XML – structure, semantics
• Dublin Core Metadata
• Can webpage authors be trusted to correctly mark-up / index their
pages ?
• Multi-lingual documents
• Multi-media
Theoretical models for
indexing / searching
• Content-based weighting
• As in traditional IRS, but trying to incorporate
• hyperlinks
• the dynamic nature of the Web (page validity, page caching)
• Link-based weighting
• Quality of webpages
• Hubs & authorities
• Bookmarked pages
• Iterative estimation of quality
Architecture
• Centralized
• Main server contains the index, built by an indexer, searched by a
query engine
• Advantage: control, easy update
• Disadvantage: system requirements (memory, disk, safety/recovery)
• Distributed
• Brokers & gatherers
• Advantage: flexibility, load balancing, redundancy
• Disadvantage: software complexity, update
User variability
• Power and flexibility for expert users vs. intuitiveness and
ease of use for novice users
• Multi-modal user interface
• Distinguish between experts and beginners, offer distinct interfaces
(functionality)
• Advantage: can make assumptions on users
• Disadvantage: habit formation, cognitive shift
• Uni-modal interface
• Make essential functionality obvious
• Make advanced functionality accessible
Search strategies
• Web directories
• Query-based searching
• Link-based browsing (provided by the browser, not the IRS)
• “More like this”
• Known site (bookmarking)
• A combination of the above
Support for Relevance Feedback
• RF can improve search effectiveness … but is rarely used
• Voluntary vs. forced feedback
• At document vs. word level
• “Magic” vs. control
Some techniques to improve IR
effectiveness
• Interaction with user (relevance feedback)
- Keywords only cover part of the contents
- User can help by indicating relevant/irrelevant
document
• The use of relevance feedback
• To improve query expression:
Qnew = *Qold + *Rel_d - *Nrel_d
where Rel_d = centroid of relevant documents
NRel_d = centroid of non-relevant documents
Modified relevance feedback
• Users usually do not cooperate (e.g. AltaVista in early
years)
• Pseudo-relevance feedback (Blind RF)
• Using the top-ranked documents as if they are relevant:
• Select m terms from n top-ranked documents
• One can usually obtain about 10% improvement
Term clustering
• Based on `similarity’ between terms
• Collocation in documents, paragraphs, sentences
• Based on document clustering
• Terms specific for bottom-level document clusters are assumed to
represent a topic
• Use
• Thesauri
• Query expansion
User modelling
• Build a model / profile of the user by recording
• the `context’
• topics of interest
• preferences
based on interpreting (his/her actions):
• Implicit or explicit relevance feedback
• Recommendations from `peers’
• Customization of the environment
Personalised systems
• Information filtering
• Ex: in a TV guide only show programs of interest
• Use user model to disambiguate queries
• Query expansion
• Update the model continuously
• Customize the functionality and the look-and-feel of the
system
• Ex: skins; remember the levels of the user interface
Autonomous agents
• Purpose: find relevant information on behalf of the user
• Input: the user profile
• Output: pull vs. push
• Positive aspects:
• Can work in the background, implicitly
• Can update the master with new, relevant info
• Negative aspects: control
• Integration with collaborative systems
Outline
• Introduction
• IR Approaches and Ranking
• Query Construction
• Document Indexing
• IR Evaluation
• Web Search
• INDRI
Document Representation
<html>
<head>
<title>Department Descriptions</title>
</head>
<body>
The following list describes …
<h1>Agriculture</h1> …
<h1>Chemistry</h1> …
<h1>Computer Science</h1> …
<h1>Electrical Engineering</h1> …
…
<h1>Zoology</h1>
</body>
</html>
<title>
context
<title>department
descriptions</title>
<title>
extents
<body>
context
<body>the following
list describes …
<h1>agriculture</h1>
… </body>
<body>
extents
<h1>
context
<h1>agriculture</h1>
<h1>chemistry</h1>
…
<h1>zoology</h1>
<h1>
extents
.
.
.
1. department
descriptions
1. the following
list describes
<h1>agriculture
</h1> …
1. agriculture
2. chemistry
…
36. zoology
Model
• Based on original inference network retrieval
framework [Turtle and Croft ’91]
• Casts retrieval as inference in simple graphical
model
• Extensions made to original model
• Incorporation of probabilities based on language
modeling rather than tf.idf
• Multiple language models allowed in the network (one
per indexed context)
Model
Model hyperparameters (observed)
Document node (observed)
α,βh1
α,βtitle
Context language models
θtitle
r1
α,βbody
D
…
θbody
rN
Representation nodes
(terms, phrases, etc…)
r1
…
q1
Information need node
(belief node)
θh1
rN
r1
…
rN
q2
I
Belief nodes
(#combine, #not, #max)
Model
α,βbody
D
α,βh1
α,βtitle
θtitle
r1
…
θbody
rN
r1
…
q1
θh1
rN
r1
q2
I
…
rN
P( r | θ )
• Probability of observing a term, phrase, or “concept”
given a context language model
• ri nodes are binary
• Assume r ~ Bernoulli( θ )
• “Model B” – [Metzler, Lavrenko, Croft ’04]
• Nearly any model may be used here
• tf.idf-based estimates (INQUERY)
• Mixture models
Model
α,βbody
D
α,βh1
α,βtitle
θtitle
r1
…
θbody
rN
r1
…
q1
θh1
rN
r1
q2
I
…
rN
P( θ | α, β, D )
• Prior over context language model determined by α, β
• Assume P( θ | α, β ) ~ Beta( α, β )
• Bernoulli’s conjugate prior
• αw = μP( w | C ) + 1
• βw = μP( ¬ w | C ) + 1
• μ is a free parameter
P(ri |  ,  , D)   P(ri |  ) P( |  ,  , D) 

tf w, D  P( w | C )
| D | 
Model
α,βbody
D
α,βh1
α,βtitle
θtitle
r1
…
θbody
rN
r1
…
q1
θh1
rN
r1
q2
I
…
rN
P( q | r ) and P( I | r )
• Belief nodes are created dynamically based
on query
• Belief node CPTs are derived from standard
link matrices
• Combine evidence from parents in various ways
• Allows fast inference by making marginalization
computationally tractable
• Information need node is simply a belief node
that combines all network evidence into a
single value
• Documents are ranked according to:
P( I | α, β, D)
Example: #AND
P(Q=true|a,b)
A
B
0
false false
0
0
1
false true
true false
true true
A
B
Q
P#and (Q  true)   P(Q  true | A  a, B  b) P( A  a) P( B  b)
a ,b
 P(t | f , f )(1  p A )(1  p B )  P(t | f , t )(1  p A ) pB  P(t | t , f ) p A (1  pB )  P(t | t , t ) p A pB
 0(1  p A )(1  p B )  0(1  p A ) pB  0 p A (1  pB )  1 p A pB
 p A pB
Query Language
• Extension of INQUERY query language
• Structured query language
• Term weighting
• Ordered / unordered windows
• Synonyms
• Additional features
• Language modeling motivated constructs
• Added flexibility to deal with fields via contexts
• Generalization of passage retrieval (extent retrieval)
• Robust query language that handles many current
language modeling tasks
Terms
Type
Example
Matches
Stemmed term
dog
All occurrences of dog (and
its stems)
Surface term
“dogs”
Exact occurrences of dogs
(without stemming)
Term group (synonym group) <”dogs” canine>
All occurrences of dogs
(without stemming) or canine
(and its stems)
Extent match
Any occurrence of an extent
of type person
#any:person
Date / Numeric Fields
Example
Example
Matches
#less
#less(URLDEPTH 3)
Any URLDEPTH numeric field
extent with value less than 3
#greater
#greater(READINGLEVEL 3)
Any READINGINGLEVEL
numeric field extent with value
greater than 3
#between
#between(SENTIMENT 0 2)
Any SENTIMENT numeric field
extent with value between 0
and 2
#equals
#equals(VERSION 5)
Any VERSION numeric field
extent with value equal to 5
#date:before
#date:before(1 Jan 1900)
Any DATE field before 1900
#date:after
#date:after(June 1 2004)
Any DATE field after June 1,
2004
#date:between #date:between(1 Jun 2000 1
Sep 2001)
Any DATE field in summer
2000.
Proximity
Type
Example
Matches
#odN(e1 … em) or
#N(e1 … em)
#od5(saddam hussein) or
#5(saddam hussein)
All occurrences of saddam and
hussein appearing ordered within
5 words of each other
#uwN(e1 … em)
#uw5(information retrieval)
All occurrences of information and
retrieval that appear in any order
within a window of 5 words
#uw(e1 … em)
#uw(john kerry)
All occurrences of john and kerry
that appear in any order within any
sized window
#phrase(e1 … em) #phrase(#1(willy wonka)
#uw3(chocolate factory))
System dependent implementation
(defaults to #odm)
Context Restriction
Example
Matches
yahoo.title
All occurrences of yahoo appearing in the title
context
yahoo.title,paragraph
All occurrences of yahoo appearing in both a title
and paragraph contexts (may not be possible)
<yahoo.title yahoo.paragraph>
All occurrences of yahoo appearing in either a title
context or a paragraph context
#5(apple ipod).title
All matching windows contained within a title context
Context Evaluation
Example
Evaluated
google.(title)
The term google evaluated using the title context as the
document
google.(title, paragraph)
The term google evaluated using the concatenation of the
title and paragraph contexts as the document
google.figure(paragraph) The term google restricted to figure tags within the
paragraph context.
Belief Operators
INQUERY
#sum / #and
#wsum*
#or
#not
#max
INDRI
#combine
#weight
#or
#not
#max
* #wsum is still available in INDRI, but should be used with discretion
Extent / Passage Retrieval
Example
Evaluated
#combine[section](dog canine)
Evaluates #combine(dog canine) for
each extent associated with the section
context
#combine[title, section](dog canine)
Same as previous, except is evaluated for
each extent associated with either the title
context or the section context
#combine[passage100:50](white house) Evaluates #combine(dog canine) 100
word passages, treating every 50 words
as the beginning of a new passage
#sum(#sum[section](dog))
Returns a single score that is the #sum of
the scores returned from #sum(dog)
evaluated for each section extent
#max(#sum[section](dog))
Same as previous, except returns the
maximum score
Extent Retrieval Example
<document>
<section><head>Introduction</head>
Statistical language modeling allows formal
methods to be applied to information retrieval.
...
</section>
<section><head>Multinomial Model</head>
Here we provide a quick review of multinomial
language models.
...
</section>
<section><head>Multiple-Bernoulli Model</head>
We now examine two formal methods for
statistically modeling documents and queries
based on the multiple-Bernoulli distribution.
...
</section>
…
</document>
Query:
#combine[section]( dirichlet smoothing )
0.15
1. Treat each section
extent as a
“document”
0.50
2. Score each
“document”
according to
#combine( … )
0.05
SCORE
0.50
0.35
0.15
…
3. Return a ranked
list of extents.
DOCID
IR-352
IR-352
IR-352
…
BEGIN
51
405
0
…
END
205
548
50
…
Other Operators
Type
Example
Description
Filter require
#filreq(
#less(READINGLEVEL 10)
ben franklin)
)
Requires that documents have
a reading level less than 10.
Documents then ranked by
query ben franklin
Filter reject
#filrej(
#greater(URLDEPTH 1)
microsoft)
)
Rejects (does not score)
documents with a URL depth
greater than 1. Documents then
ranked by query microsoft
Prior
#prior( DATE )
Applies the document prior
specified for the DATE field
System Overview
• Indexing
• Inverted lists for terms and fields
• Repository consists of inverted lists, parsed documents, and
document vectors
• Query processing
• Local or distributed
• Computing local / global statistics
• Features