Category Archives: Ray Carey

WrELFA 2015: the written corpus of academic ELF


During the past two years that I’ve kept up this blog, I’ve been working on compilation of the first corpus of written ELF (English as a lingua franca) for Anna Mauranen’s ELFA project. I started loitering around her group shortly after the ELFA corpus of spoken academic ELF was completed, and a written corpus was already being discussed. A couple years later, with the proper mix of time, money, and research assistants, we launched the WrELFA corpus project – the Written Corpus of English as a Lingua Franca in Academic Settings. And now we can announce that the WrELFA corpus compilation is complete.

I’ve been blogging about this work-in-progress over the past couple years, so I don’t need to repeat it all here. There are three text types included in WrELFA, each of which invites its own investigation. These three components are:

  1. Academic research blogs – this subcorpus is drawn from 40 different blogs maintained by second-language users of English and totals 372,000 words (see this and this post).
  2. PhD examiner reports – 330 evaluations by senior academics with 33 different first languages (402,000 words). I’ve discussed this data in depth in this post.
  3. SciELF corpus – a collaborative, stand-alone corpus of 150 unedited research papers by academics from 10 first-language backgrounds. Partners from 12 universities contributed texts to the 759,000 total words (see this post).

Taken together, these three components total just over 1.5 million words of text with a rough binary division between the natural sciences (55% of words) and disciplines in social sciences and humanities (45% of words). For more detailed information on the make-up of the corpus, see the ELFA project homepage, where I’ve recently done a major update of the WrELFA corpus pages, with documentation of the corpus components, compilation principles, and authors’ L1 distributions. Keep reading…


Adventures in correcting the (semi-)scientific record

Photo shared by Michiel1972 via Wikimedia Commons

Photo shared by Michiel1972 via Wikimedia Commons

One of the blogs I follow is Retraction Watch, which documents the world of quality control in scientific research – pre-publication peer review (and its abuses); post-publication peer review in fora such as research blogs; retractions and corrections by journals; and plagiarism and fraud. The large majority of cases they report on are drawn from the “hard sciences”. From time to time, a case pops up from the humanities as well, and it’s not outrageous to ask – who cares anyway? Well, I do.

I’m one of those humanistic researchers who likes to imagine that I do something resembling science. One of the most frustrating things about humanistic research that can’t stand up to scrutiny is the feeling that it doesn’t matter, anyway; nobody cares about this stuff but us. Does that make me some starry-eyed idealist? No, I just don’t like sloppy work. And when I see it, it makes me look bad too, a humanistic guilt by association. Several of the posts on this blog can be seen as post-publication peer review, and during the past year I had my own experience with attempting to correct the (semi-)scientific record.

Last year I read an article by Prof. Hilary Nesi in the Journal of English for Academic Purposes (JEAP) entitled Laughter in university lectures. It contained an obvious error in the word count of the Corpus of British Academic Spoken English (BASE), which resulted in erroneous claims about the frequency of laughter in this linguistic database. The natural response, again, might be who cares?. Several people should care, because the author, two peer reviewers, and the journal editors apparently didn’t look very carefully at the figures reported in two of the tables in the paper. I decided to start with the author. Keep reading…

Tagged ,

On the other side: variations in organising chunks in ELF

Variations in organising chunks aren't that common, but they do tend to stand out.Source: Livio Bourbon via The Telegraph

Variations in organising chunks aren’t that common, but they do tend to stand out.
Source: Livio Bourbon via The Telegraph

When working with ELF data – English used as a lingua franca between second/foreign-language speakers – one of the things that stands out are slight variations in conventional chunks of language. A formulaic chunk like as a matter of fact might be realised as as the matter of fact, or you could hear now that you mention it spoken as now that you say it. There’s no sense in calling them errors, since the variants won’t cause miscommunication, they resemble their conventional counterparts in both function and form, and the less-preferred variant is likely found elsewhere. It’s just not the English native-speaker preference.

These variations are interesting linguistically and they tend to stand out impressionistically for researchers, but I’ve wondered how often these variations actually occur in ELF – both in frequency and also in their distribution relative to conventional forms. It’s not an easy question to answer. Many of these formulaic chunks of language occur infrequently, so finding a couple variants doesn’t really tell you much. The example above of now that you say it occurs twice in the million-word ELFA corpus, with just one instance of the conventional form. Alternatively, as the matter of fact is found in ELFA 21 times compared to just eight occurrences of the expected chunk, but only two speakers account for those 21 instances.

We can see from these examples that a formulaic chunk that rarely shows up won’t reveal much about how often variation occurs among ELF users, across speech events, in different times and places. To find out more, I wanted to start with the highest frequency chunks I could find. These are described by Linear Unit Grammar as organising chunks, the recurring and relatively fixed chunks we use to structure our speech and writing, like on the other hand. Using the corpus freeware AntConc, I looked at the most frequent 3-, 4- and 5-word clusters (aka n-grams) in the ELFA corpus of spoken academic ELF. Keep reading…

Tagged , , ,

In praise of Finnish innovation: a tribute to Ossi Ihalainen

Prof. Ossi Ihalainen (1941-1993), the man who makes my research possibleSource: VARIENG website

Prof. Ossi Ihalainen (1941-1993), the man who makes my research possible.
Source: VARIENG website

When my misguided and unconventional life crash-landed in the University of Helsinki’s English department, I had no idea I was in the midst of a world-class center of linguistic research. It took less than a year to figure it out, and before I had finished my bachelor’s degree, I was working as a research assistant. Now as a PhD student in the same department, I understand very well that I’m part of a proud tradition of innovators in English linguistics.

Earlier this year, I was awarded a three-year grant from the Finnish Cultural Foundation (Suomen Kulttuurirahasto) to pursue my research in Linear Unit Grammar (LUG) and English as a lingua franca (ELF). On August 1, the foundation dispensed the first installment of my grant from the Ossi Ihalainen trust, and I want to take a moment to honor the man who supports my work, some 20 years after his passing. Oddly enough, we have a few things in common.

Ihalainen (right) making field recordings of interviews with speakers of the Somerset dialect.Source: VARIENG website

Ihalainen (right) making field recordings of interviews with speakers of the Somerset dialect.
Source: VARIENG website

Prof. Ossi Ihalainen of the University of Helsinki’s English department was a pioneer in the use of computers to research linguistic corpora. This was during the 1980s, when mainframe computers were needed to perform searches and the programming language of the day was still FORTRAN. He was also an expert in British dialectology, having conducted extensive fieldwork in the 1970s on the dialects of the Somerset region of southwest England. His crowning achievement was a hundred-page chapter on English dialects for the Cambridge History of the English Language.

Prof. Ihalainen died before his time at the age of 51. He fought leukemia for over a year before his death in 1993, and, in an act of academic heroism, he labored through his illness to complete the abovementioned chapter on English dialects. At the time of his death, his will established a trust to support linguistic research in Finland. And still in 2013, that’s exactly what he’s doing, and I do indeed thank him for that. Keep reading…

Fluent chunks 2: How to label your chunks

Photo by Alan Chia via Wikimedia Commons

Photo by Alan Chia via Wikimedia Commons

Most people recognise that we don’t speak in “sentences”. Still, speech is analysed and described using the concepts of sentence grammars, even when these writing-based systems must be bent and stretched, or vice versa – isn’t it cheating to “clean up” naturally occurring speech so it fits into a sentence grammar?

In a previous post I introduced Linear Unit Grammar, or LUG, a chunk-based approach to analysing spoken and written text. In that post I introduced the linear, word-by-word process of chunking up a string of transcribed speech by placing intuitively directed chunk boundaries. The discussion focused on this short extract from an academic conference in the ELFA corpus. When asked about her experience with students in Brazil, the speaker responded:

er i c- i i so i i went to portugal er i live in portugal er for 13 years so i er my experience with brazilian students is is a long way @@ okay a long time ago (note: @@ = laughter and er is like uh in the US style)

How do you divide this into a well-formed constituency tree? The short answer is you don’t, and neither do speakers in actual interaction. LUG analysis attempts to mirror the real-time, linear processing of language as multi-word chunks, regardless of “grammaticality”. Keep reading…

Fluent chunks: an intro to Linear Unit Grammar

The spring equinox has arrived, the sun is shining, and the place is still frozen solid.© Nina Valtavirta

The spring equinox has passed, the sun is shining, and the place is still frozen solid.
© Nina Valtavirta

The question of how to evaluate English proficiency in lingua franca settings such as English-medium university programs has interested me for a while. One of the criticisms heard against ELF research is that it promotes an “anything goes” attitude toward English. But clearly anything does not go – at least not in high-stakes, professional contexts like academia. Yet, it doesn’t make a huge amount of sense to bring in the British Council to evaluate the non-British English used by non-British instructors to teach non-British students outside Britain. The need for contextually appropriate teaching and testing is one of the main motivations for ELF research.

It was my turn to talk at the ELF seminar this month, which was held on 14.3. I introduced my PhD project, which officially started this January and unofficially began over a year ago. I’m researching fluency in spoken academic ELF, but with a data-driven approach; instead of evaluating ELF users by an idealised “native-speaker model”, I’m starting by describing the features of fluency and dysfluency in a corpus of naturally occurring academic ELF. These texts are transcriptions of university level talk from pedagogical settings (lectures, seminar discussions from various fields) and professional events (conference presentations and discussions).

The idea is to first describe what is ordinary ELF in academic settings – what are the recurring patterns and routines of fluent interaction, and what are the (dys)fluency features which differentiate individual ELF users? This is the big question, but the even bigger problem is how to systematically identify and describe these patterns.

Keep reading…