During the past two years that I’ve kept up this blog, I’ve been working on compilation of the first corpus of written ELF (English as a lingua franca) for Anna Mauranen’s ELFA project. I started loitering around her group shortly after the ELFA corpus of spoken academic ELF was completed, and a written corpus was already being discussed. A couple years later, with the proper mix of time, money, and research assistants, we launched the WrELFA corpus project – the Written Corpus of English as a Lingua Franca in Academic Settings. And now we can announce that the WrELFA corpus compilation is complete.
I’ve been blogging about this work-in-progress over the past couple years, so I don’t need to repeat it all here. There are three text types included in WrELFA, each of which invites its own investigation. These three components are:
- Academic research blogs – this subcorpus is drawn from 40 different blogs maintained by second-language users of English and totals 372,000 words (see this and this post).
- PhD examiner reports – 330 evaluations by senior academics with 33 different first languages (402,000 words). I’ve discussed this data in depth in this post.
- SciELF corpus – a collaborative, stand-alone corpus of 150 unedited research papers by academics from 10 first-language backgrounds. Partners from 12 universities contributed texts to the 759,000 total words (see this post).
Taken together, these three components total just over 1.5 million words of text with a rough binary division between the natural sciences (55% of words) and disciplines in social sciences and humanities (45% of words). For more detailed information on the make-up of the corpus, see the ELFA project homepage, where I’ve recently done a major update of the WrELFA corpus pages, with documentation of the corpus components, compilation principles, and authors’ L1 distributions.
Dialogue & directness: evidence from ngrams
At first glance, these three text types would appear to be quite unrelated, with different purposes and audience. But in terms of linguistic features, there are interesting overlaps between the corpus components. One of the main things that unites the texts is the act of evaluation. Research articles evaluate previous research along with new findings, research blogs engage in evaluation of published research, and the PhD examiners evaluate the readiness of a thesis for public defence. Thus, academic evaluation broadly unites the texts in purpose and form.
The major difference between the texts, however, is the way in which this evaluation is communicated – how is a dialogue constructed with the reader? There are different audiences for each text type, and different degrees of directness assumed with the reader. In research blogs, and especially in blog discussions, this relationship is direct indeed, and I needn’t explain it further here. The conventionalised indirectness of scientific articles lies on the opposite end of the spectrum, with the PhD examiners falling somewhere in between. These reports are directly addressed to university administrators, the candidate is often addressed as well, and examiners frequently use the first-person pronoun I in their assessments.
To get a better sense of these points of convergence, I used AntConc to retrieve all 3-grams and 4-grams occurring at least 10 times in each of the three subcorpora. I then selected the 30 most frequent ngrams from each set for side-by-side comparison across the text types. This way, I could get a picture of the degree of overlap between the most frequent ngrams in the corpus. Looking first at the 3-grams, there are six that appear in each of the top-30 lists, and five of them are the most frequent in WrELFA as a whole:
|1||695||as well as|
|2||593||in order to|
|3||557||one of the|
|4||490||in terms of|
|5||453||part of the|
|14||300||there is a|
This list should look familiar to anyone with an interest in corpus-based studies of academic English. Likewise, among the top 30 3-grams, there are 11 which appear in more than one of the subcorpus lists (ranks and counts drawn from WrELFA as a whole):
|6||408||the use of|
|8||365||the fact that|
|11||317||due to the|
|12||314||the number of|
|13||303||on the other|
|18||269||the role of|
|20||244||some of the|
|22||243||the other hand|
|25||226||a number of|
|31||219||the end of|
|35||211||this is a|
Here too, we can see the overlapping 3-grams are squarely within the expectations of academic writing. Turning to the 4-grams, the subcorpus lists start to diverge. The only top-30 4-grams appearing in all three subcorpus lists are on the other hand and at the same time. The other top-30 4-grams that appear in more than one subcorpus list are also conventional academic phrases, such as in the case/field/context of, the fact that the, and on the basis of. But the differences noted above in dialogicity and directness also become evident in the 4-grams.
There is only one top-30 4-gram that appears on the lists for both the research blogs and PhD examiner reports, but not in the top-30 4-grams of SciELF research papers – I would like to (see a sample of concordance lines below). This direct self-reference is found 56 times in the PhD examiner reports (14 times per 100,000 words), 38 times in research blogs (10 times/100k words), but only 19 times in SciELF (2.5 times/100k words). It is the only 4-gram in the entire SciELF list that includes this first-person pronoun. The dialogic quality of the blogs, however, is evidenced by the 4-grams on the blog list that do not appear on the other subcorpus lists: as far as I, I am not sure, I don’t think that, I think it is, and I think that the.
In sum, the three WrELFA text types show broad similarities as conventional academic texts, but with interesting differences in their degrees of dialogicity and the ways in which the author-audience relationship is constructed.
Our data is your data, just not yet
Everyone is an advocate for open data, especially those who never collect it themselves. Data is cheap, right? Actually, compiling data is ridiculously expensive. Sure, you’ve heard about some 14-trillion-word corpus of God-knows-what scraped from the internet, but a specialised, well-formed, computer-processable, properly character encoded, reasonably balanced, proofread, regularised, and anonymised corpus in which hundreds of author permissions are procured is another matter altogether. So all you lovers of open data with your hands out can take a deep breath and wait. For about a year.
When I started working in the ELFA project, I didn’t understand why Prof. Mauranen preferred to keep the ELFA corpus in-house before releasing it to the broader research community. Now that I’ve converted that one-million-word corpus to XML for public distribution and managed the compilation of another 1.5 million words of text, I understand her thinking perfectly. We did the hard work, and we spent a lot of money doing it. Our international partners who contributed research articles and author permissions for the SciELF corpus also did a lot of work, and they’ve already been given an advance copy of the SciELF corpus files.
Bottom line: the WrELFA corpus will be publicly available about the same time the first studies based on it will be going to publication. If any of you jealous nerds have a problem with that, we can fight it out in the comments. 😉
Here are the top-30 ngram lists in an Excel spreadsheet if you’re still curious.
WrELFA 2015. The Corpus of Written English as a Lingua Franca in Academic Settings. Director: Anna Mauranen. Compilation manager: Ray Carey. http://www.helsinki.fi/elfa/wrelfa.html.