During the past two years that I’ve kept up this blog, I’ve been working on compilation of the first corpus of written ELF (English as a lingua franca) for Anna Mauranen’s ELFA project. I started loitering around her group shortly after the ELFA corpus of spoken academic ELF was completed, and a written corpus was already being discussed. A couple years later, with the proper mix of time, money, and research assistants, we launched the WrELFA corpus project – the Written Corpus of English as a Lingua Franca in Academic Settings. And now we can announce that the WrELFA corpus compilation is complete.
I’ve been blogging about this work-in-progress over the past couple years, so I don’t need to repeat it all here. There are three text types included in WrELFA, each of which invites its own investigation. These three components are:
- Academic research blogs – this subcorpus is drawn from 40 different blogs maintained by second-language users of English and totals 372,000 words (see this and this post).
- PhD examiner reports – 330 evaluations by senior academics with 33 different first languages (402,000 words). I’ve discussed this data in depth in this post.
- SciELF corpus – a collaborative, stand-alone corpus of 150 unedited research papers by academics from 10 first-language backgrounds. Partners from 12 universities contributed texts to the 759,000 total words (see this post).
Taken together, these three components total just over 1.5 million words of text with a rough binary division between the natural sciences (55% of words) and disciplines in social sciences and humanities (45% of words). For more detailed information on the make-up of the corpus, see the ELFA project homepage, where I’ve recently done a major update of the WrELFA corpus pages, with documentation of the corpus components, compilation principles, and authors’ L1 distributions. Keep reading…