WrELFA corpus progress report: 500k words

This little fellow wants Dionysus' grapes. From the Capitoline Museums in Rome.© Nina Valtavirta

This little fellow wants Dionysus’ grapes. From the Capitoline Museums in Rome.
© Nina Valtavirta

There’s growing interest in English as a lingua franca (ELF) research on description of written ELF. Up to now, ELF data has almost exclusively been drawn from spoken interaction, which is where a lingua franca gets used in the first place. But the use of English as a second/foreign language extends into the written mode as well, and this may also be directed to an international audience. In globalised networks such as academia, examples of English used as a written lingua franca aren’t hard to find. Like other high-stakes domains of ELF, an academic career involves producing English texts that are used to evaluate the author’s professional competence.

Alongside the growth in ELF research has been growing awareness of a power imbalance in academic publishing – journals concentrated in the US & UK typically place a perfect imitation of “native-like” English as a basic criteria for being published. This goes beyond just “correct grammar” and extends into idiomatic usage, phraseological choices, and rhetorical style. So while there’s no dispute that non-native users of English as a lingua franca far outnumber the native English speakers of the world, academic journals tip the balance of power in favor of English native speakers. In short, “good English” is equated with “native-like English”.

This is a question of interest to descriptive ELF research. How does “good English” written by educated professionals who speak a first language other than English differ from the mythologised “native-like English”? This question and the issues surrounding it are persuasively developed by David Owen (2011) in an article on academic publishing and language revision. In his work doing language revision in a Spanish university, he observes that papers rejected on linguistic grounds are often “formally impeccable”, and he presents a series of extracts to illustrate this “correct” vs. “native-like” distinction. In the end, he calls for descriptive ELF research that could clarify this timely question. What does good written ELF look like?

WrELFA: a corpus of written ELF in academia

Late in the same year as Owen’s article was published, Anna Mauranen tasked me with starting compilation of the Corpus of Written English as a Lingua Franca in Academic Settings (WrELFA corpus), which she had been talking about for some time. The million-word ELFA corpus of spoken academic ELF interaction was completed in 2008, and a written companion was a natural development. I’ve been working on this project ever since, and with help from research assistant Jani Ahtiainen, this summer we reached 500,000 words of processed WrELFA text. At this halfway mark to our million-word goal, I thought I’d give an update on our progress.

We first targeted two text types – public online research blogs and preliminary examiners’ statements for newly submitted PhD theses. Both provide insight into expert ELF practice in professional academic contexts. In the first place, they don’t pass any linguistic gatekeepers or language experts between the author and the intended audience. Secondly, the texts are written by active scientists ranging from PhD students (in the case of research blogs) to senior professors (in the case of preliminary examiners’ statements and also blogs). We’re thus in the domain of successful academics using ELF as their means of professional self-representation.

So what makes a text in English ELF? We’ve applied two criteria for selecting authors to be sampled in WrELFA. First, they should have a first language other than English. In addition, and for the sake of clearly defined criteria, we exclude all authors who are professionally based in an “Inner Circle” English-speaking country (e.g. US, UK, Canada). It takes two to have a lingua franca, so what about the audience? Blog comments reveal a diverse international audience, with English native speakers also in the ELF mix. The examiners’ statements are written by and for ELF users, specifically for PhD students (from Finland and abroad) and the faculty administration (of a Finnish university). As such, these are seen as naturally occurring, written academic ELF.

Figure 1. Distribution of academic domains in the WrELFA blog subcorpus (Sept. 2013).

Figure 1. Distribution of academic domains in the WrELFA blog subcorpus (Sept. 2013).

Academic research blogs

For finding academic ELF bloggers, we used researchblogging.org (RB) as a starting place. RB is an aggregator for bloggers who discuss published research in their posts. These listings of research blogs from various fields (with an emphasis toward natural science) led us to a sample of bloggers who fit the above first-language criteria. When sampling blog posts, a preference was made for discussions of published research, but any field-related post was considered, including conference reports and reflections on professional life. Longer posts were also preferred for sampling, as well as those that included some discussion with readers in the comments section.

Using January 2011 as an arbitrary starting point for sampling, 316 posts from 40 different blogs were processed (just under eight posts per blog). This amounts to just under 305,000 words of text, for an average of 7600 words per blog. In terms of academic domains, natural sciences dominate with 46% of words in this blog subcorpus, followed by medicine (28%) and technology (9% – see all domains in Figure 1 above). Among identifiable authors, 17 first languages are represented (see first-language distributions in Figure 2), along with 40,000 words of non-blogger comments from a mostly anonymous global audience.

Figure 2. Distribution of authors' first langauges in the WrELFA blog subcorpus (Sept. 2013).

Figure 2. Distribution of authors’ first langauges in the WrELFA blog subcorpus (Sept. 2013).

This interactive dimension of blogging was of special interest to us, and the proportion of text derived from comments was relatively low at 20% of words. To supplement this interactive material, I compiled a series of additional discussions from the most active blog in the WrELFA sample – The Reference Frame by string theory physicist Luboš Motl, whose blog currently reports 250,000 visitors from around the world in the past three months alone. A site of lively discussion and argument, these 26 additional discussions from physics-related blog posts in Jan.-Feb. 2011 added 67,000 words of interactive text.

With just over 370,000 words in total, this research blog subcorpus is considered finished. Having added the discussions from The Reference Frame, the proportion of words from blog posts stands at 65%, with the share of comments rising to 35% of words. All blog comments have also been extracted to a stand-alone subcorpus for targeted research on online ELF interaction, like I presented at the ELF6 conference earlier this month (and I’ll someday blog about it too).

Preliminary examiners’ statements

There are interesting similarities between research blogs and preliminary examiners’ statements for a PhD defense. They both tend to be evaluative in summarising and discussing the relative merits of others’ research. In addition, they’re often highly technical and field-specific, displaying conventions of academic writing while mixing in more personal evaluations. The examiners’ statements in particular are a fascinating source of expert ELF. Written by senior, successful academics, I can personally attest that these texts are “good English”. They may not, however, be native-like. Interestingly, typos and misspellings are not uncommon, underscoring the authenticity of the texts.

Figure 3. Distribution of authors' first languages in WrELFA examiners' statements (Sept. 2013).

Figure 3. Distribution of authors’ first languages in WrELFA examiners’ statements (Sept. 2013).

This component of WrELFA is a work in progress. We’ve processed the examiners’ statements from 2011-12 in three faculties of the University of Helsinki – humanities (Humanistinen tiedekunta), medicine (Lääketieteellinen tiedekunta), and mathematics & natural science (Matemaattis-luonnontieteellinen tiedekunta). Authors are contacted for permission to use the text and for self-reporting of first language status. Unlike the blogs, which are publicly available and cited to an academic standard, the examiners’ statements are fully anonymised so that no one can be identified from the texts themselves. Another six years of data from three more faculties are waiting to be processed, so all the figures that follow are about to change!

At the time of writing, 130 examiners’ statements have been processed for a total of 127,551 words of text, an average of just under 1000 words per statement. Among the three represented faculties, 40% of words come from math & science, with 38% from humanities and 22% from medicine. The humanists wrote the longest texts, with 28 statements of 1731 words on average. The 31 statements of medical reviewers were 911 words on average, with math & science having the largest number of statements (71) with the smallest average word count of 716 words. As for language backgrounds, authors reported 26 different first languages, with native speakers of German, Finnish, and Swedish together making up 54% of words (see Figure 3 above).

Figure 4. Distribution of the 29 known first languages in WrELFA, with the 10 most-represented first languages shown. 16% of words in the interim corpus are from unidentified blog commenters (Sept. 2013).

Figure 4. Distribution of the 29 known first languages in WrELFA, with the 10 most-represented first languages shown. 16% of words in the interim corpus are from unidentified blog commenters (Sept. 2013).

You can see from Figure 4 at right the first language distributions for the half-million-word interim corpus as a whole. Among the 29 first languages in WrELFA so far, the 10 most well-represented languages make up two-thirds of the word count. Authors with an additional 19 first language backgrounds make up 17% of words, with the remaining 16% from an anonymous mix of blog commenters. In all, it’s a broadly distributed mix, albeit with a bias toward Indo-European languages, but without any especially dominant first language group.

More to come!

This gives just a surface overview of the WrELFA database in progress. I hope in future posts to also discuss the benefits of compiling a corpus in XML format and post-processing texts for different research purposes. In addition, this year we’re starting work on the third component of WrELFA, a collection of scientific articles that have not yet undergone professional language revision. This database will specifically serve the research questions raised by Owen (2011) and others as outlined at the start of this post.

In the meantime, I’ve begun reporting findings from the early WrELFA data, including a just-published paper in the Journal of English as a Lingua Franca on variations in organising chunks in ELFA and WrELFA data. Future updates on the corpus will be posted here and to the WrELFA project website: www.helsinki.fi/elfa/wrelfa.


ResearchBlogging.orgOwen, David (2011) “Native here and to the manner born”: Academic publishing and ‘proper’ English. English Text Construction, 4 (2), 279-302. DOI: 10.1075/etc.4.2.06owe


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: