WrELFA 2015: the written corpus of academic ELF


During the past two years that I’ve kept up this blog, I’ve been working on compilation of the first corpus of written ELF (English as a lingua franca) for Anna Mauranen’s ELFA project. I started loitering around her group shortly after the ELFA corpus of spoken academic ELF was completed, and a written corpus was already being discussed. A couple years later, with the proper mix of time, money, and research assistants, we launched the WrELFA corpus project – the Written Corpus of English as a Lingua Franca in Academic Settings. And now we can announce that the WrELFA corpus compilation is complete.

I’ve been blogging about this work-in-progress over the past couple years, so I don’t need to repeat it all here. There are three text types included in WrELFA, each of which invites its own investigation. These three components are:

  1. Academic research blogs – this subcorpus is drawn from 40 different blogs maintained by second-language users of English and totals 372,000 words (see this and this post).
  2. PhD examiner reports – 330 evaluations by senior academics with 33 different first languages (402,000 words). I’ve discussed this data in depth in this post.
  3. SciELF corpus – a collaborative, stand-alone corpus of 150 unedited research papers by academics from 10 first-language backgrounds. Partners from 12 universities contributed texts to the 759,000 total words (see this post).

Taken together, these three components total just over 1.5 million words of text with a rough binary division between the natural sciences (55% of words) and disciplines in social sciences and humanities (45% of words). For more detailed information on the make-up of the corpus, see the ELFA project homepage, where I’ve recently done a major update of the WrELFA corpus pages, with documentation of the corpus components, compilation principles, and authors’ L1 distributions.

Dialogue & directness: evidence from ngrams

At first glance, these three text types would appear to be quite unrelated, with different purposes and audience. But in terms of linguistic features, there are interesting overlaps between the corpus components. One of the main things that unites the texts is the act of evaluation. Research articles evaluate previous research along with new findings, research blogs engage in evaluation of published research, and the PhD examiners evaluate the readiness of a thesis for public defence. Thus, academic evaluation broadly unites the texts in purpose and form.

The major difference between the texts, however, is the way in which this evaluation is communicated – how is a dialogue constructed with the reader? There are different audiences for each text type, and different degrees of directness assumed with the reader. In research blogs, and especially in blog discussions, this relationship is direct indeed, and I needn’t explain it further here. The conventionalised indirectness of scientific articles lies on the opposite end of the spectrum, with the PhD examiners falling somewhere in between. These reports are directly addressed to university administrators, the candidate is often addressed as well, and examiners frequently use the first-person pronoun I in their assessments.

To get a better sense of these points of convergence, I used AntConc to retrieve all 3-grams and 4-grams occurring at least 10 times in each of the three subcorpora. I then selected the 30 most frequent ngrams from each set for side-by-side comparison across the text types. This way, I could get a picture of the degree of overlap between the most frequent ngrams in the corpus. Looking first at the 3-grams, there are six that appear in each of the top-30 lists, and five of them are the most frequent in WrELFA as a whole:

rank count 3-gram
1 695 as well as
2 593 in order to
3 557 one of the
4 490 in terms of
5 453 part of the
14 300 there is a

This list should look familiar to anyone with an interest in corpus-based studies of academic English. Likewise, among the top 30 3-grams, there are 11 which appear in more than one of the subcorpus lists (ranks and counts drawn from WrELFA as a whole):

rank count 3-gram
6 408 the use of
8 365 the fact that
11 317 due to the
12 314 the number of
13 303 on the other
18 269 the role of
20 244 some of the
22 243 the other hand
25 226 a number of
31 219 the end of
35 211 this is a

Here too, we can see the overlapping 3-grams are squarely within the expectations of academic writing. Turning to the 4-grams, the subcorpus lists start to diverge. The only top-30 4-grams appearing in all three subcorpus lists are on the other hand and at the same time. The other top-30 4-grams that appear in more than one subcorpus list are also conventional academic phrases, such as in the case/field/context of, the fact that the, and on the basis of. But the differences noted above in dialogicity and directness also become evident in the 4-grams.

There is only one top-30 4-gram that appears on the lists for both the research blogs and PhD examiner reports, but not in the top-30 4-grams of SciELF research papers – I would like to (see a sample of concordance lines below). This direct self-reference is found 56 times in the PhD examiner reports (14 times per 100,000 words), 38 times in research blogs (10 times/100k words), but only 19 times in SciELF (2.5 times/100k words). It is the only 4-gram in the entire SciELF list that includes this first-person pronoun. The dialogic quality of the blogs, however, is evidenced by the 4-grams on the blog list that do not appear on the other subcorpus lists: as far as I, I am not sure, I don’t think that, I think it is, and I think that the.

A sample of concordance lines for "I would like to" from the research blogs (files starting with "B") and PhD examiners (files starting with "UDEF") subcorpora.

A sample of concordance lines for “I would like to” from the research blogs (files starting with “B”) and PhD examiners (files starting with “UDEF”) subcorpora.

In sum, the three WrELFA text types show broad similarities as conventional academic texts, but with interesting differences in their degrees of dialogicity and the ways in which the author-audience relationship is constructed.

Our data is your data, just not yet

Everyone is an advocate for open data, especially those who never collect it themselves. Data is cheap, right? Actually, compiling data is ridiculously expensive. Sure, you’ve heard about some 14-trillion-word corpus of God-knows-what scraped from the internet, but a specialised, well-formed, computer-processable, properly character encoded, reasonably balanced, proofread, regularised, and anonymised corpus in which hundreds of author permissions are procured is another matter altogether. So all you lovers of open data with your hands out can take a deep breath and wait. For about a year.

When I started working in the ELFA project, I didn’t understand why Prof. Mauranen preferred to keep the ELFA corpus in-house before releasing it to the broader research community. Now that I’ve converted that one-million-word corpus to XML for public distribution and managed the compilation of another 1.5 million words of text, I understand her thinking perfectly. We did the hard work, and we spent a lot of money doing it. Our international partners who contributed research articles and author permissions for the SciELF corpus also did a lot of work, and they’ve already been given an advance copy of the SciELF corpus files.

Bottom line: the WrELFA corpus will be publicly available about the same time the first studies based on it will be going to publication. If any of you jealous nerds have a problem with that, we can fight it out in the comments. 😉

Here are the top-30 ngram lists in an Excel spreadsheet if you’re still curious.


WrELFA 2015. The Corpus of Written English as a Lingua Franca in Academic Settings. Director: Anna Mauranen. Compilation manager: Ray Carey. http://www.helsinki.fi/elfa/wrelfa.html.


4 thoughts on “WrELFA 2015: the written corpus of academic ELF

  1. eflnotes says:

    congrats and a year is not long at all 🙂

  2. I like this policy “Our data is your data, just not yet.” I’ll also share the learner corpora I built publicly after my own studies get published. Maybe a Creative Commons license would be useful.

    • Ray Carey says:

      I think this might be the best justification for sharing data — to support the findings we report in our research. I think it’s great if others have an interest in analysing the same data with new research questions, but even if they don’t, I want my research to be able to withstand scrutiny, and open data makes that possible.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: