Digital information discovery systems for universities

Digital information discovery systems
for universities
Prepared to support
the presentation of an invited lecture at the
International Conference on Digital Libraries =
ICDL,
in Delhi, India,
27-29 November 2013
by Paul.Nieuwenhuysen@vub.ac.be
2B114, Vrije Universiteit Brussel, B-1050 Brussel,
Belgium
1
2
Text published in Proceedings ICDL2013.
These slides should be available from the WWW site
http://www.vub.ac.be/BIBLIO/nieuwenhuysen/presentations/
(note: BIBLIO and not biblio)
contents
= summary
= structure
= overview
of this
presentation
• Introduction
• Research problem
• Findings
1. Federated search
2. Merging
information
3. Commercial
discovery systems
versus
Google Scholar
4. Empirical
case studies
• Concluding remarks
3
4
INTRODUCTION:
Information discovery & access
1. Information
discovery
2. Information
delivery / access
Information
discovery
system
Information
delivery / access
system
5
INTRODUCTION:
Information discovery & access
1. Information
discovery
2. Information
delivery / access
Information
discovery
system
Information
delivery / access
system
6
INTRODUCTION:
Information discovery is important
• The quantity of
information in
digital form is
growing fast.
• Information sources are scattered.
• Even metadata are scattered
 NO single simple way to discover
suitable sources
in a dynamic digital network
environment.
7
INTRODUCTION:
Information discovery process
Primary, scholarly information sources
8
INTRODUCTION:
Information discovery process
Numerous indexing and abstracting services
Primary, scholarly information sources
9
INTRODUCTION:
Scattering of sources
• Integration / aggregation
is still far from perfect.
L
10
INTRODUCTION:
Scattering of sources  difficulties
Using several information retrieval systems costs time:
»to learn about the contents and purposes of each database,
»to choose one or several suitable databases,
»to learn about the various user interfaces and efficient
ways to query each chosen database, which is confusing,
»to formulate a suitable query adapted to the target
database,
»to inspect the results…
L
11
INTRODUCTION:
Scattering of sources  difficulties
Using several information retrieval systems costs time:
»to repeat all the actions above for each further selected
database,
»to merge and deduplicate the interesting results
& to save these in some way,
(which is hindered by variations among the visible output
formats on computer display and by variations among the
field structures of records form various databases)
L
12
INTRODUCTION:
Scattering of sources  difficulties
Besides this user’s
viewpoint,
the viewpoint of
librarians is that they
spend a considerable
part of their budget on
databases,
while these may not be
well appreciated and
exploited effectively by
their clients.
L
13
PROBLEM STATEMENT
1. Which methods are available these days to
make it easier and more efficient to find
relevant information sources?
2. Which are the pros and contras of these
methods?
14
PROBLEM STATEMENT
15
PROBLEM STATEMENT
In particular we focus on methods suitable for exploratory
discovery of scholarly information sources in an academic
environment
by users who do not have much experience with selecting
and searching specific databases,
such as undergraduate students
& others without expertise in the domain where their new
information need occurs.
16
FINDINGS
17
Scattering of sources  difficulties
 solutions ?!
Solutions?!
• Federated searching
• Merging of databases
18
• Federated search
19
Federated searching
through scattered databases
User
User
Federated search system
Search engine
Database
Search engine
Database
Search engine
Database
20
Federated searching
through scattered databases
User
User
Federated search system
Search engine
Database
Search engine
Database
Search engine
Database
21
Federated searching:
terminology / vocabulary / synonyms
federated searching
=
meta-searching = metasearching
=
cross-database searching
=
multi-database searching
=
multi-threaded searching
=
one-stop searching
=
poly-searching = polysearching
=
broadcast searching
=
searching through a portal / gateway
22
Information discovery process
with federated search
Some federated search system
Numerous indexing and abstracting databases
Primary, scholarly information sources
23
Information discovery process
with federated search
Some federated search system
Numerous indexing and abstracting databases
Primary, scholarly information sources
24
• Merging of
information
into 1 database
25
Merging information
into a searchable database
User
User
Search engine
Database
or web site
or…
Aggregated database
Database
or web site
or…
Database
or web site
or…
D
or
26
Example
Merging: applications:
union catalogues of libraries
27
Information discovery process
with an integrating system
based on merging into 1 database
Some integrating
information discovery system
Numerous
indexing and abstracting databases
Primary, scholarly information sources
28
• Comparison of
merging databases
with federated
searching
29
Comparison of methods for
information retrieval
• The more general evolution of information and
communication technology has partially determined the
evolution of the information retrieval systems discussed
here:
»Federated searching has been pushed forward since the
Internet made implementations possible with acceptable
speed.
»Merging information sources has more recently seen more
implementations due to increasing capacities of computers
and hard disks at decreasing prices.
30
Comparison of methods
for information retrieval
Merging
databases
Federated
searching
31
Comparison of methods
for information retrieval
Size
of the
coverage
Merging
databases
Federated
searching
-+
+-
32
Comparison of methods
for information retrieval
Size
Independent
of the
of Internet /
coverage WWW
Merging
databases
Federated
searching
-+
+-
+-
33
Comparison of methods
for information retrieval
Size
Independent Up-to-date
of the
of Internet / information
coverage WWW
Merging
databases
Federated
searching
-+
+-
+-
-+
+
34
Comparison of methods
for information retrieval
Size
Independent Up-to-date Pre-search analysis
of the
of Internet / information of all data
coverage WWW
(for better
relevance ranking,
to eliminate
duplicates,
to merge related
database records into
1 record, etc…)
Merging
databases
Federated
searching
-+
+-
+-
-+
+
+
-
35
Comparison of methods
for information retrieval
Size
Independent Up-to-date Pre-search analysis
of the
of Internet / information of all data
coverage WWW
(for better
relevance ranking,
to eliminate
duplicates,
Speed
of retrieval
and
display
to merge related
database records into
1 record, etc…)
Merging
databases
Federated
searching
-+
+-
+-
-+
+
+
-
+-+
36
Comparison of methods for
information retrieval: conclusions
• A single, simple, standard method = approach = solution
does NOT (yet) exist.
• Two basic methods are common.
• They have their own
»advantages
and
»disadvantages.
37
• Commercially
available
information
discovery
systems / services
38
Federated search versus merging
in digital libraries
• In digital library searches:
»Up to date information is not crucial in most cases,
so that federated search is not required.
»The method of a priori merging sources can perform better
than federated search.
• Therefore a few big players in the library information
industry have built services based on this method,
even though considerable investments are needed in
terms computer systems, manpower, internet connectivity
etc.
39
Commercial
information discovery services
• Several companies offer discovery services that are based
mainly on collocating existing bibliographic databases
into bigger merged databases
to obtain a fast and panoramic discovery system
that is hosted somewhere on the WWW = ‘in the cloud’.
Such a discovery system can include > 1 BILLION items!
40
Commercial information
discovery services as OPAC
• The contents of the catalog of the library that implements
such a system can also be imported in the database of the
system.
J
41
Commercial
information discovery services
• Terms used for such systems are
»Information discovery systems
»Resource discovery systems
»Web-scale discovery systems
• Their strength in usability and searching makes them also
usable as “next generation library catalogs”.
42
Commercial
information discovery services
A few producers and systems / services:
»EBSCO Publishing offers EBSCO Discovery Service = EDS
»Ex Libris offers Primo
»Innovative Interfaces
»OCLC
»(ex-Serials Solutions) Proquest offers Summon
43
Information discovery process with an
“information discovery system”
based on merging into 1 database
Some integrating
information discovery system
Numerous
indexing and abstracting databases
Primary, scholarly information sources
44
Information discovery process with an
“information discovery system”
Some integrating
information discovery system
Numerous
indexing and abstracting databases
Primary, scholarly information sources
45
The online catalog:
evolution
More
J
COVERAGE
Less
L
Less
FUNCTIONALITY
More
46
The online catalog:
evolution
More
J
COVERAGE
Less
L
Less
FUNCTIONALITY
More
47
Commercial
information discovery services
Comments by librarians range
from enthusiastic
to skeptical.
JL
48
Commercial
information discovery services
Decentralized:
Centralized / integrated:
Native database 1
Content of database 1
Native database 2
Content of database 2
Native database 3
Content of database 3
Native database 4
Content of database 4
Etc…
Etc…
OPAC
OPAC
49
Commercial
information discovery services
Decentralized:
Centralized / integrated:
Content centered
User centered
Coupled
content
&
user interface
Decoupled
content

user interface
Many different user
interfaces
One uniform user interface
50
Commercial
information discovery services
User(s)
OPAC
Catalog
database
51
Commercial
information discovery services
User(s)
Classical,
OPAC
integrated,
library
management
system
Catalog
database
Lending
management
1 or
several
union
catalogs
52
Commercial
information discovery services
User(s)
Classical,
OPAC
integrated,
library
management
system
Catalog
database
Lending
management
1 or
several
union
catalogs
1 or
several
federated
search
systems
53
Commercial
information discovery services
User(s)
Classical,
OPAC
integrated,
library
management
system
Catalog
database
Lending
management
1 or
several
union
catalogs
1 or
several
federated
search
systems
Information
discovery
system
54
Information discovery services:
limitations / drawbacks
• These discovery systems offers a huge amount of
metadata,
but they can and do NOT cover
»ALL information published
»ALL information available directly in full text,
»ALL publications that have been licensed by the local
library for a fee
L
55
• Commercially
available
information
discovery systems
compared with
free
discovery services
56
Commercial discovery services
versus free discovery services
• Besides the various discovery systems mentioned above,
which can be implemented by a digital library service,
many great discovery systems have become available
relatively recently, which offer
»a high coverage,
»a user friendly interface
»all this free of charge.
&
57
Commercial discovery services
versus free discovery services
• The availability of more high-quality free discovery
services leads to
»the declining value of subscription-based abstracting and
indexing services
»doubts among librarians about the cost-effectiveness of the
commercial information discovery services
58
Commercial discovery services
versus free discovery services
• Example:
Of course the popular general WWW search systems,
lead by Google since a few years.
59
Commercial discovery services
versus free discovery services
• Example:
More specialized but similar systems devoted to scholarly
information, such as Google Scholar.
This is a relatively ‘new kid on the block’.
The system provides good coverage and is increasingly
used by students and researchers as a discovery system.
‘It appears that Google Scholar has supplanted the
traditional library bibliographic database as a means of
subject searching for journal full-text.’
60
Commercial discovery services
versus free discovery services
• Producers and vendors of commercially available
discovery systems talk and write about their competition
as if this consists of the few other similar commercially
available products.
• This is misleading.
• Who is a really significant competitor?
61
Google Scholar:
screenshot
62
Google Scholar
coverage and quality
Google Scholar is steadily improving in coverage and
quality.
Chen, Xiaotian
Google Scholar's Dramatic Coverage Improvement Five Years after Debut.
Serials Review
2010
Volume 36, No. 4, pp. 221 - 226
63
Google Scholar
coverage and quality
“Google Scholar’s coverage is also comprehensive”
Harzing, A.W. (2013)
A preliminary test of Google Scholar as a source for citation data: A longitudinal study of Nobel Prize winners.
Scientometrics, vol. 93, no. 3, pp. 1057-1075.
64
Google Scholar
coverage and quality
“Our data suggest that Google Scholar coverage is now
increasing at a stable rate”
Harzing, A.W. (2013)
A LONGITUDINAL STUDY OF GOOGLE SCHOLAR COVERAGE BETWEEN 2012 AND 2013
http://www.harzing.com/download/gs_coverage.pdf
65
Information discovery services
versus Google Scholar
• It is hard for companies in the information industry
to compete with the leading big company Google
that produces Google Scholar
&
that offers this free of charge on the public internet.
66
Information discovery services
versus Google Scholar
Some federated search system
Some integrating
information discovery system
Numerous
indexing and abstracting databases
OR ?
Google Scholar
search
&
discovery
system/service
Primary, scholarly information sources
67
Information discovery services
versus Google Scholar
Coverage supports
exploratory search
Google Scholar
+
Commercially available discovery systems
+
68
Information discovery services
versus Google Scholar
Search results offer the user a link to local library
holdings and access rights
(if the local library integrates its knowledge base with the
discovery system in a link generator).
If the desired document is not directly available,
then the user can directly request the document from the
local library document delivery service
(if the local library integrates this service with the discovery
system in a link generator).
Google Scholar
+
Commercially
available discovery
systems
+
69
Information discovery services
versus Google Scholar
Can be used /
implemented by a
library free of charge
Coverage includes not only classical
publications,
but also other files on the WWW,
such as web pages and presentation files
Google
Scholar
+
+
Commercially
available
discovery
systems
-
-
a few 1000 $ per
implementation in/by a
library
(besides costs of access
to databases)
70
Information discovery services
versus Google Scholar
Provides links from a bibliographic description
NOT only to the publication on the site of the publisher,
(which is perhaps NOT accessible)
but also to open access copies
on the website of the author at a university
Google Scholar
+
Commercially available
discovery systems
-
71
Information discovery services
versus Google Scholar
Ranking of results exploits
citations received by the
retrieved document 
more influential
documents rank higher
Each document is accompanied by
the number of citations received
from other documents
& by links to those citing
documents
Google Scholar
+
+
Commercially
available
discovery
systems
-
-
72
Information discovery services
versus Google Scholar
Offers search for documents
on the WWW,
with a similar user interface
Google Scholar
+
Google
( = classic WWW search)
Commercially available
discovery systems
-
73
Information discovery services
versus Google Scholar
Offers search for images
on the WWW,
with a similar user interface
Google Scholar
+
images.google
Commercially available
discovery systems
-
74
Information discovery services
versus Google Scholar
If the service / system is chosen by a library,
then branding by the library is possible.
(But this is probably more important
for libraries as organizations
than for the users they are serving, who do not care about
who provides a good service.)
Google Scholar
-
Commercially
available discovery
systems
+
75
Information discovery services
versus Google Scholar
The local library can export local catalogue / holdings,
for import in the database of the discovery system,
mainly to add bibliographic descriptions of unique, local
items that are not yet included from other sources
Google Scholar
-
Commercially
available discovery
systems
+
76
Overlaps
of bibliographical databases
Commercial information discovery service
Catalogue
of the library
Google Scholar
discovery service
77
Information discovery services versus
Google Scholar: limiting to the library
• An additional aspect in the comparison:
Can the system / service limit searches to documents
available in the library?
• Commercially available information discovery systems
are built to do this,
while Google Scholar does NOT work in this way.
• This aspect is exploited by producers of discovery systems
to convince librarians that their system is superior
& that Google Scholar should not even be considered as
an alternative.
78
• Commercially
available
information
discovery systems
versus
Google Scholar
in case studies
79
Information discovery services versus
Google Scholar: case studies
• Empirical case studies have been carried out
»to assess the validity and reality of the general comparison
»to compare the precision in the results of searches
80
Information discovery services versus
Google Scholar: case studies
• The precision of the first 10 results is determined by
several factors such as
1. coverage,
2. enrichment of metadata,
3. inclusion or not, of full-text of the document in indexing,
4. taking into account or not, of the citations and links
received from other document files,
5. indexing algorithms,
6. relevance ranking algorithms, etc…
81
Information discovery services versus
Google Scholar: test method
• Identical searches have been carried out in different
information discovery systems.
82
Information discovery services versus
Google Scholar: test method
• As examples of information discovery services,
we used the implementation of Summon
»at Chalmers University in Sweden &
»at ULB in Belgium.
• These services can be used by anyone free of charge from
anywhere
(at least the discovery / search phase;
access / delivery of full document in the case of licensed
materials is restricted.)
83
Information discovery services versus
Google Scholar: test method
• The search options are set
»to include material that is NOT directly made accessible by
the library, to expand coverage
(This is NOT a default setting.)
»to rank results according to relevance as estimated by the
system
84
Information discovery services versus
Google Scholar: test method
• The topics for searches are well-known by the user who
performs the tests.
• The queries are simple, using only 1 or 2 words and no
operators, to simulate queries of non-expert users for
whom these systems have been developed in the first
place.
• The relevance of results and links to further information
have been evaluated.
85
Information discovery services versus
Google Scholar: test case 1
• A test of finding information on a particular, concrete
subject / topic:
the wooden pillars / poles / posts of the meeting house for
communal decision making,
with a low ceiling,
which is present in each village of the Dogon people in
Mali, West Africa; these are often decorated with a
protective spirit in the form of a female (or male) figure;
such a house is named toguna or togu na (with a space).
• Search query to start with is: toguna
86
Information discovery services versus
Google Scholar: test case 1, results IDS
• The Summon information discovery services gave NO
relevant results in the top 10 results
in 2 tests that were performed with an interval of a few
weeks. L
87
Information discovery services versus
Google Scholar: test case 1, results IDS
• Using as search term togu na:
»At Chalmers University again NO relevant result, even
though a book has been published with this tittle;
restricting results to books gives again NO relevant result. L
»At ULB result 1 refers to the book published with this title
and to the copy available in the ULB library.
• So when the book is not in the library collection, the user
does not discover it, even though this is the most important
publication on the subject. L
88
Information discovery services versus
Google Scholar: test case 1, results GS1
• Google Scholar, in a first test, extended the search
automatically to include togu na with a space. J
• Results 1, 2, 4 point immediately to the important
publication
= the printed book with title Togu na published in 1977.
J
• These results also link to other documents that include a
citation to this book ! J
• Each result offers also links to related documents and
some of these are also relevant. J
89
Information discovery services versus
Google Scholar: test case 1, results GS2
• In a test that was performed a few weeks later,
Google Scholar does NOT extend the search automatically
to include togu na with a space;
the first 10 results are NOT relevant.
• When the query is changed to togu na (with the space),
then results 2, 3, 4 point immediately to the important
publication = the printed book with title Togu na.
90
Information discovery services versus
Google Scholar: test case 1, results GS2
• These results also link to other documents that include a
citation to this book ! J
• Each result offers also links to related documents and
some of these are also relevant. J
91
Information discovery services versus
Google Scholar: test case 1, results GS2
• Using classical Google web search with both queries does
not directly reveal the book.
• Using Google Books with the term toguna , the system
does NOT reveal the book,
but when the query togu na is used, then result 1 gives the
book. J
92
Information discovery services versus
Google Scholar: test case 1, results GS2
• A few book titles that have been published were searched
in Google Scholar;
they were NOT found, in July 2013.
All these findings indicate that Google Scholar has
changed search functionality
as well as inclusion of scholarly books.
93
Information discovery services versus
Google Scholar: test case 2
1. A test of finding more information about the only
journal article that has been published in a in a wellknown scholarly journal,
by two specific authors known by the user.
2. Search query is: mettrop nieuwenhuysen
94
Information discovery services versus
Google Scholar: test case 2, results IDS
The Summon implementations give directly the
bibliographical description of the article,
as result 1, as expected. J
95
Information discovery services versus
Google Scholar: test case 2, results GS
1. Google Scholar gives also directly the bibliographical
description of the journal article, as result 1, as expected !
J
2. Furthermore, the service informs us directly that this
article has received 84 citations, as found by Google J
& it provides links to those citing documents ! J
3. Results 2, 3, 5, 6, 8 give descriptions of presentations at
conference and preliminary publications that are related
to the main published article and that may also be
relevant ! J
96
Information discovery services versus
Google Scholar: test case 3
1. A test of finding information about the famous type of
mask created by the Songye people in DRC, Africa,
which is named kifwebe.
2. Search query is: kifwebe
97
Information discovery services versus
Google Scholar: test case 3, results IDS
• The Summon implementations give a link to maximum 5
relevant document description in the 10 first results; most
results link to an image and not to a document. L
• When the search is refined by choosing
Limit to articles from scholarly publications,
then about 5 of the first 10 results are relevant.
98
Information discovery services versus
Google Scholar: test case 3, results GS
• Google Scholar always gives mainly links to scholarly
sources by default
& here the first 10 results are all relevant, most of them
written by authors respected in the field of study;
citations to printed books are included. J
• Furthermore, as always, the service gives us the number of
citations received by each document, as found by Google J
& it provides links to those citing documents ! J
99
Information discovery services versus
Google Scholar: results
The outcome of these case studies can be formulated briefly
and roughly by the scores:
Information
discovery
service
0-3
100
Information discovery services versus
Google Scholar: discussion
• Evaluating and comparing the various systems / services
is hindered
»by lack of information about their coverage,
their indexing methods, and their search & ranking
algorithms
»by the changes over time in their functions and
performance
101
CONCLUDING
REMARKS
102
CONCLUDING REMARKS
• Increasing the chance that users discover the most
suitable relevant information is an important task of
libraries and information services.
• More methods / techniques / services / systems are
available than ever before to assist users with information
discovery.
• The systems are evolving fast.
• Implementing a commercial information discovery
system brings an additional COST to an information
service.
• The aim of this work is to help librarians to make
decisions on the way forward in information discovery.
103
Questions are welcome