Tag Archives: BASE corpus

Adventures in correcting the (semi-)scientific record

Photo shared by Michiel1972 via Wikimedia Commons

Photo shared by Michiel1972 via Wikimedia Commons

One of the blogs I follow is Retraction Watch, which documents the world of quality control in scientific research – pre-publication peer review (and its abuses); post-publication peer review in fora such as research blogs; retractions and corrections by journals; and plagiarism and fraud. The large majority of cases they report on are drawn from the “hard sciences”. From time to time, a case pops up from the humanities as well, and it’s not outrageous to ask – who cares anyway? Well, I do.

I’m one of those humanistic researchers who likes to imagine that I do something resembling science. One of the most frustrating things about humanistic research that can’t stand up to scrutiny is the feeling that it doesn’t matter, anyway; nobody cares about this stuff but us. Does that make me some starry-eyed idealist? No, I just don’t like sloppy work. And when I see it, it makes me look bad too, a humanistic guilt by association. Several of the posts on this blog can be seen as post-publication peer review, and during the past year I had my own experience with attempting to correct the (semi-)scientific record.

Last year I read an article by Prof. Hilary Nesi in the Journal of English for Academic Purposes (JEAP) entitled Laughter in university lectures. It contained an obvious error in the word count of the Corpus of British Academic Spoken English (BASE), which resulted in erroneous claims about the frequency of laughter in this linguistic database. The natural response, again, might be who cares?. Several people should care, because the author, two peer reviewers, and the journal editors apparently didn’t look very carefully at the figures reported in two of the tables in the paper. I decided to start with the author. Keep reading…

Advertisements
Tagged ,

Laughter in academic talk: Brits, Yanks & ELF compared

Click to jump to the original article (behind paywall): Nesi, Hilary (2012) Laughter in university lectures. Journal of English for Academic Purposes, 11(2). 79-89.

Click to jump to the original article (behind paywall): Nesi, Hilary (2012) Laughter in university lectures. Journal of English for Academic Purposes, 11(2). 79-89.

Update 30.12.2013: this updated post reflects improvements to the Python scripts used to generate the token counts. Links to the improved scripts are available in the footnotes. Minor changes to the token counts and frequencies have been made in the tables and text, but the main content of the post remains unchanged.

When I was earlier blogging on the frequencies of laughter in academic ELF (English as a lingua franca), I came across an article by Prof. Hilary Nesi, a compiler of the BASE corpus – the Corpus of British Academic Spoken English. She provides a qualitative analysis of the types and functions of laughter episodes in lectures from the BASE corpus and she concludes with the uncontroversial advice that British lecturers might want to adjust their use of humor when lecturing for an international audience.

I’ve waited until now to blog on Nesi’s article, since it contains obvious statistical errors that I wanted to research further. When I say obvious, I mean obvious – she cites the word count of the BASE lecture subcorpus as 2,646,920 words, when the official count of the entire corpus is only 1,644,942 words (cited in the same article). Nesi uses this oddly inflated word count to compute the standardised frequencies of laughter in lectures, which are therefore artificially low. Being naturally curious, I emailed Prof. Nesi in April to ask if she could clarify the situation, and naturally I received no reply.

To be fair, everyone makes mistakes and the quantitative findings don’t really affect her qualitative analysis. But this was published in a major peer-reviewed journal, the Journal of English for Academic Purposes. When a statistical error this basic can get past a senior researcher, two peer reviewers, and an editorial staff, it gives this junior researcher a fairly discouraging picture of academic rigor in the humanities. I might just be the first person on earth to look carefully at Nesi’s tables.

When in doubt, do it yourself

The thing that makes corpus research almost seem like real science is reproducibility – like with real experimental results, another researcher can take a linguistic corpus and try to reproduce a study’s findings. So, I downloaded the BASE corpus in XML format and set out to reproduce Nesi’s figures. She also uses the XML version of BASE, but only to search for laughter tags using the WordSmith Tools application. My first theory was that she had generated a word count for the lectures without excluding the XML markup, but even this approach didn’t reach her inflated word count.

Keep reading…

Tagged , , ,