WrELFA 2015: the written corpus of academic ELF

words

During the past two years that I’ve kept up this blog, I’ve been working on compilation of the first corpus of written ELF (English as a lingua franca) for Anna Mauranen’s ELFA project. I started loitering around her group shortly after the ELFA corpus of spoken academic ELF was completed, and a written corpus was already being discussed. A couple years later, with the proper mix of time, money, and research assistants, we launched the WrELFA corpus project – the Written Corpus of English as a Lingua Franca in Academic Settings. And now we can announce that the WrELFA corpus compilation is complete.

I’ve been blogging about this work-in-progress over the past couple years, so I don’t need to repeat it all here. There are three text types included in WrELFA, each of which invites its own investigation. These three components are:

Academic research blogs – this subcorpus is drawn from 40 different blogs maintained by second-language users of English and totals 372,000 words (see this and this post).
PhD examiner reports – 330 evaluations by senior academics with 33 different first languages (402,000 words). I’ve discussed this data in depth in this post.
SciELF corpus – a collaborative, stand-alone corpus of 150 unedited research papers by academics from 10 first-language backgrounds. Partners from 12 universities contributed texts to the 759,000 total words (see this post).

Taken together, these three components total just over 1.5 million words of text with a rough binary division between the natural sciences (55% of words) and disciplines in social sciences and humanities (45% of words). For more detailed information on the make-up of the corpus, see the ELFA project homepage, where I’ve recently done a major update of the WrELFA corpus pages, with documentation of the corpus components, compilation principles, and authors’ L1 distributions.

Dialogue & directness: evidence from ngrams

At first glance, these three text types would appear to be quite unrelated, with different purposes and audience. But in terms of linguistic features, there are interesting overlaps between the corpus components. One of the main things that unites the texts is the act of evaluation. Research articles evaluate previous research along with new findings, research blogs engage in evaluation of published research, and the PhD examiners evaluate the readiness of a thesis for public defence. Thus, academic evaluation broadly unites the texts in purpose and form.

The major difference between the texts, however, is the way in which this evaluation is communicated – how is a dialogue constructed with the reader? There are different audiences for each text type, and different degrees of directness assumed with the reader. In research blogs, and especially in blog discussions, this relationship is direct indeed, and I needn’t explain it further here. The conventionalised indirectness of scientific articles lies on the opposite end of the spectrum, with the PhD examiners falling somewhere in between. These reports are directly addressed to university administrators, the candidate is often addressed as well, and examiners frequently use the first-person pronoun I in their assessments.

To get a better sense of these points of convergence, I used AntConc to retrieve all 3-grams and 4-grams occurring at least 10 times in each of the three subcorpora. I then selected the 30 most frequent ngrams from each set for side-by-side comparison across the text types. This way, I could get a picture of the degree of overlap between the most frequent ngrams in the corpus. Looking first at the 3-grams, there are six that appear in each of the top-30 lists, and five of them are the most frequent in WrELFA as a whole:

rank	count	3-gram
1	695	as well as
2	593	in order to
3	557	one of the
4	490	in terms of
5	453	part of the
14	300	there is a

This list should look familiar to anyone with an interest in corpus-based studies of academic English. Likewise, among the top 30 3-grams, there are 11 which appear in more than one of the subcorpus lists (ranks and counts drawn from WrELFA as a whole):

rank	count	3-gram
6	408	the use of
8	365	the fact that
11	317	due to the
12	314	the number of
13	303	on the other
18	269	the role of
20	244	some of the
22	243	the other hand
25	226	a number of
31	219	the end of
35	211	this is a

Here too, we can see the overlapping 3-grams are squarely within the expectations of academic writing. Turning to the 4-grams, the subcorpus lists start to diverge. The only top-30 4-grams appearing in all three subcorpus lists are on the other hand and at the same time. The other top-30 4-grams that appear in more than one subcorpus list are also conventional academic phrases, such as in the case/field/context of, the fact that the, and on the basis of. But the differences noted above in dialogicity and directness also become evident in the 4-grams.

There is only one top-30 4-gram that appears on the lists for both the research blogs and PhD examiner reports, but not in the top-30 4-grams of SciELF research papers – I would like to (see a sample of concordance lines below). This direct self-reference is found 56 times in the PhD examiner reports (14 times per 100,000 words), 38 times in research blogs (10 times/100k words), but only 19 times in SciELF (2.5 times/100k words). It is the only 4-gram in the entire SciELF list that includes this first-person pronoun. The dialogic quality of the blogs, however, is evidenced by the 4-grams on the blog list that do not appear on the other subcorpus lists: as far as I, I am not sure, I don’t think that, I think it is, and I think that the.

A sample of concordance lines for “I would like to” from the research blogs (files starting with “B”) and PhD examiners (files starting with “UDEF”) subcorpora.

In sum, the three WrELFA text types show broad similarities as conventional academic texts, but with interesting differences in their degrees of dialogicity and the ways in which the author-audience relationship is constructed.

Our data is your data, just not yet

Everyone is an advocate for open data, especially those who never collect it themselves. Data is cheap, right? Actually, compiling data is ridiculously expensive. Sure, you’ve heard about some 14-trillion-word corpus of God-knows-what scraped from the internet, but a specialised, well-formed, computer-processable, properly character encoded, reasonably balanced, proofread, regularised, and anonymised corpus in which hundreds of author permissions are procured is another matter altogether. So all you lovers of open data with your hands out can take a deep breath and wait. For about a year.

When I started working in the ELFA project, I didn’t understand why Prof. Mauranen preferred to keep the ELFA corpus in-house before releasing it to the broader research community. Now that I’ve converted that one-million-word corpus to XML for public distribution and managed the compilation of another 1.5 million words of text, I understand her thinking perfectly. We did the hard work, and we spent a lot of money doing it. Our international partners who contributed research articles and author permissions for the SciELF corpus also did a lot of work, and they’ve already been given an advance copy of the SciELF corpus files.

Bottom line: the WrELFA corpus will be publicly available about the same time the first studies based on it will be going to publication. If any of you jealous nerds have a problem with that, we can fight it out in the comments. 😉

Here are the top-30 ngram lists in an Excel spreadsheet if you’re still curious.

References

WrELFA 2015. The Corpus of Written English as a Lingua Franca in Academic Settings. Director: Anna Mauranen. Compilation manager: Ray Carey. http://www.helsinki.fi/elfa/wrelfa.html.

4 thoughts on “WrELFA 2015: the written corpus of academic ELF”

eflnotes says:

April 20, 2015 at 5:39 pm

congrats and a year is not long at all 🙂

- Ray Carey says:
  
  April 20, 2015 at 5:49 pm
  
  Thanks Mura!
  
Ho Simon Wang says:

April 23, 2015 at 2:41 pm

I like this policy “Our data is your data, just not yet.” I’ll also share the learner corpora I built publicly after my own studies get published. Maybe a Creative Commons license would be useful.

- Ray Carey says:
  
  April 24, 2015 at 2:15 pm
  
  I think this might be the best justification for sharing data — to support the findings we report in our research. I think it’s great if others have an interest in analysing the same data with new research questions, but even if they don’t, I want my research to be able to withstand scrutiny, and open data makes that possible.

ELFA project

a research blog from the University of Helsinki's ELFA project (English as a Lingua Franca in Academic Settings)