Examining the GloWbe corpus for lexicographic evidence in SgE

Examining the GloWbe
corpus for lexicographic
evidence in SgE, MyE and
Hk English
Vincent B Y Ooi
vinceooi@nus.edu.sg
National University of Singapore
Outline
p  A.
Using the Internet as a lexicographic
resource
p  B. Summary and application of Davies and
Fuchs’ (2015) findings to Singapore,
Malaysia and Hong Kong
p  C. Findings beyond Davies and Fuchs Singapore, Malaysian and Hong Kong
English
p  D. Evaluation of the GloWbe corpus for
lexicographic evidence
A. Using the Internet as a
lexicographic resource
p  (Fuertes-Olivera
2012) “I am using the
concept of corpus in a lexicographical way,
i.e., a lexicographical corpus is any
collection of texts where lexicographers
can find inspiration for completing the
dictionary structures they need when
making a real dictionary… I will focus on
ways of exploiting and exploring the
Internet as a lexicographical corpus, i.e.,
the virtual space in which lexicographers
can easily access data they might need.”
A. Sinclair (2004) On the Web…
p  “The
World Wide Web is not a corpus,
because its dimensions are unknown and
constantly changing, and because it has
not been designed from a linguistic
perspective. At present it is quite
mysterious, because the search engines,
through which the retrieval programs
operate, are all different, none of them
are comprehensive, and it is not at all
clear what population is being sampled. “
A. Sinclair, on curating the Web
p  “It
is important to know precisely what is
actually copied or downloaded from a web
page. This is not always obvious, and
quite often it is not at all the document
that is required…The cheerful anarchy of
the Web thus places a burden of care on a
user, and slows down the process of
corpus building. The organisation and
discipline has to be put in by the corpus
builder.”
B. English World-Wide 36:1 (Feb 2015)
“The Internet as a lexicographical resource” can
be reified, especially for those interested in
varieties of English, in the 1.9 billion-word
Global Web-based English Corpus (GloWbe)
p  Davies and Fuchs (2015) – February issue of
English World-Wide
p  Responses to D&F by:
i)  Christian Mair –Nigerian English etc.
ii)  Joybrato Mukherjee, for South Asian English
iii)  Pam Peters, for Australian English etc.
iv)  Gerald Nelson, for the ICE corpus
p 
B. Davies and Fuchs on the ICE corpus
§ 
§ 
Ice corpus
Ø  1m words each (600,000 S; 400,000 W)
Ø  14 varieties of English (all of which GloWbe also
covers) – a total of merely 12.2million words
However, ICE is limited in size
Ø  Enough data for high frequency syntactic
constructions only
Ø  Not so useful for lexical variation which needs more
data examples (for lexicographic evidence)
B. Motivation for GloWbe
§ 
Need for a larger corpus to study World Englishes
§ 
The GloWbe corpus
Ø  1.9b words
Ø  20 different countries (6 inner circle, 14 outer circle)
Ø  Notice that Expanding Circle countries (e.g. Japan,
China, Korea) are excluded
Ø  Strength: Compare the frequency of a word,
phrase or grammatical construction across these
different varieties of English à mapping different
varieties of English
B. D&F on data collection
§ 
Genre balance
Ø  Between formal & informal language
(like ICE corpora)
Ø  ~60% from informal blogs, ~40% from other formal
genres & text types
§ 
Accuracy in identifying dialect
Ø  Google “Advanced Search”, limited search by the
region (ß-we’ll revisit this later)
B. Size by country (VO – note the
uneven sizes)
B. D&F – Lexical variation freak out
B. D&F – Concordance for freak out
B. D&F – Lexical variation fortnight
p  More
British English than U.S. English
B. D&F – Lexical variation banjaxed
p  Irish
English (‘ruined’, ‘screwed up’)
B. D&F – Lexical variation eve teas
p  “Public
sexual harassment”
B. D&F – Lexical variation handphone
p  Mobile
/ cell phone
B. D&F lexical variation: equipments
B. D&F lexical variation: equipments
B. D&F phraseology [keep in] view
B. D&F phraseology [discuss] about
B. D& F (be) different to
B. D& F (be) different than
B. D&F – had + {gotten/got}
B. D&F – the quotative “like”
construction (May Wong on HkE)
B. Singular/plural agreement: Each
of them is/are (“innovative” plural)
B. The “way” construction (not
typically HkE – May Wong)
C. Singapore: killer litter
C. Dictionary entry for killer litter (this is not
Singlish)
p  killer
litter /…./ noun (uncount;
Singapore and Malaysian English)
p  Killer litter is something heavy, eg a
television, that is disposed of by being
thrown from the higher storeys of a
building, putting passers-by below at risk
of injury: The throwing of killer litter is
irresponsible and highly dangerous.
C. GloWbe: killer litter
C. GloWbe concordance:killer litter
C. Google Advanced search
C. Google adv search: killer litter (Au)
C. GloWbe: lepak (Malaysian and
Singapore English; not in HkE)
C. Concordance of lepak (MyE; SgE)
C. Google adv search: lepak (MyE)
C. Oxforddictionaries.com: lepak
C. GloWbe: shroff
C. Concordance for shroff
C. shroff in oxforddictionaries.com
C. Dictionary defn for shroff/shroffing
C. Measuring diglossia – kiasu most
prototypical “Singlish” item
Measuring diglossia – kiasu for Sri
Lanka?!
(TCEED2 Appendix entry for
kiasu)
p  kiasu
/ … /: adjective
p  (of a person) afraid to lose out.
kiasu in Oxforddictionaries.com
D. Evaluating GloWbe for
lexicographic evidence
§ 
§ 
What does GloWbe represent? “Whatever is found on
the web…[so] it may include very little from certain
genres, such as students’ academic writing, fiction and
business letters.” (D&F responding to Nelson)
“Blogs are not the same as spontaneous spoken
conversation” (D&F) This may pose an issue for
capturing informal/colloquial Malaysian, Singapore and
Hong Kong English. [“Singlish”, for instance, is
inherently spoken in nature]. But, still, GloWbe is
remarkable in capturing quite a number of the
sociocultural features characteristic of the informal
varieties, e.g. kiasu is the most prototypical Singlish item.
D. Evaluating GloWbe for
lexicographic evidence
p  Mair
asks whether blogs constitute a
recognizable genre in the first place.
p  While this is true, the 60% proportion of blogs
may mean that everyday topics and everyday
values are represented – in the personal blog
(but D&F haven’t disclosed the proportion of
different types of blogs, e.g. travel blog, etc)
There’s also the question of ‘blog death’ – so it
would be good to know how the sampling is
done.
D. Evaluating GloWbe for
lexicographic evidence
p  Gerry
Nelson and J Mukherjee suggest
that some writers from a particular
country domain may not actually be from
the country in question. D&F say that they
provide the original URLs for each of the
1.8 million pages. Users may want to
examine the original pages in doubtful
cases.
D. Evaluating GloWbe for
lexicographic evidence
p  In
conclusion, GloWbe is useful as a
welcome and additional “toolbox” for
researchers of world Englishes. It should
be triangulated with the ICE corpus and
other sources of data available.
p  In conclusion, GloWbe still allows us to
confirm many of our intuitions and
provisional findings on varieties of English.
For Stephanie Horch (Mair’s student),
“GloWbe is the best source of data: free,
fast, vast.”
D. Evaluating GloWbe for
lexicographic evidence
§ 
Disadvantages
Ø  No actual spoken material
Ø  Particular website is from a particular country, but did
not check for speaker
§ 
Davies & Fuchs encourage us to use the various corpora
available in a combinational & complementary way
(my emphasis!)
References
p 
p 
p 
p 
p 
p 
p 
Bolton K. 2003. Chinese Englishes: A Sociolinguistic History.
Cambridge: Cambridge University Press.
Davies M, and R Fuchs. 2015. Expanding horizons in the study of
World Englishes with the 1.9 billion word Global Web-based
English corpus (GloWbe), In English World-Wide 36:1, pp1-29.
Fuertes-Olivera, P. 2012. Lexicography and the Internet as a
(Re-)source. In Lexicographica 28:1.
Kilgarriff, A and G Grefenstette. 2003. Web as corpus. URL:
http://www.kilgarriff.co.uk/Publications/2003-KilgGrefenstetteWACIntro.pdf
Sinclair, J. 2004a. Corpus and text – basic principles. In
Developing Linguistic Corpora: A Guide to Good Practice. URL:
http://www.ahds.ac.uk/creating/guides/linguistic-corpora/
chapter1.htm
Sinclair, J. 2004b. Appendix – how to build a corpus.
http://www.ahds.ac.uk/creating/guides/linguistic-corpora/
appendix.htm
Thank You for your kind
attention!