Update 30.12.2013: this updated post reflects improvements to the Python scripts used to generate the token counts. Links to the improved scripts are available in the footnotes. Minor changes to the token counts and frequencies have been made in the tables and text, but the main content of the post remains unchanged.
When I was earlier blogging on the frequencies of laughter in academic ELF (English as a lingua franca), I came across an article by Prof. Hilary Nesi, a compiler of the BASE corpus – the Corpus of British Academic Spoken English. She provides a qualitative analysis of the types and functions of laughter episodes in lectures from the BASE corpus and she concludes with the uncontroversial advice that British lecturers might want to adjust their use of humor when lecturing for an international audience.
I’ve waited until now to blog on Nesi’s article, since it contains obvious statistical errors that I wanted to research further. When I say obvious, I mean obvious – she cites the word count of the BASE lecture subcorpus as 2,646,920 words, when the official count of the entire corpus is only 1,644,942 words (cited in the same article). Nesi uses this oddly inflated word count to compute the standardised frequencies of laughter in lectures, which are therefore artificially low. Being naturally curious, I emailed Prof. Nesi in April to ask if she could clarify the situation, and naturally I received no reply.
To be fair, everyone makes mistakes and the quantitative findings don’t really affect her qualitative analysis. But this was published in a major peer-reviewed journal, the Journal of English for Academic Purposes. When a statistical error this basic can get past a senior researcher, two peer reviewers, and an editorial staff, it gives this junior researcher a fairly discouraging picture of academic rigor in the humanities. I might just be the first person on earth to look carefully at Nesi’s tables.
When in doubt, do it yourself
The thing that makes corpus research almost seem like real science is reproducibility – like with real experimental results, another researcher can take a linguistic corpus and try to reproduce a study’s findings. So, I downloaded the BASE corpus in XML format and set out to reproduce Nesi’s figures. She also uses the XML version of BASE, but only to search for laughter tags using the WordSmith Tools application. My first theory was that she had generated a word count for the lectures without excluding the XML markup, but even this approach didn’t reach her inflated word count.
The value of XML is in being well-formed, structured data that is computer readable. It allows you to query a corpus via a programming language, counting and calculating whatever you wish in a single pass through the corpus. I wrote such a program in Python1 that reproduces Nesi’s tables 2 & 4 (2012: 83-84) – the results are found in Table 1 at right. It shows the raw counts for tokens2 and laughter in the BASE corpus, subdivided by event type (lecture or seminar) and by the four academic domains in the corpus. Finally, the standardised frequencies are shown as laughter tags per 10,000 tokens, with the standardised frequencies from Nesi’s article shown on the far right for comparison.
Nesi’s and my counts for laughter tags very nearly match3, and we have similar token counts and standardised frequencies for the seminar section of BASE. The problem is with Nesi’s figures for the lecture subcorpus, as her standardised frequencies are all about half of what I found. This leads me to believe that she somehow counted the tokens in the lecture subcorpus twice. Whatever the case, it appears that laughter is about twice as frequent in BASE lectures than is reported in Nesi’s article.
Laughter in lectures: Brits, Yanks, and ELF
Nesi also compares her findings to those of Lee (2006) from North American data, the Michigan Corpus of Academic Spoken English, or MICASE corpus. Nesi claims that laughter occurs much more frequently on average in the BASE lectures (a laugh each 1.4 minutes; Nesi 2012: 82) compared to lectures in MICASE (a laugh each 5 minutes; Lee 2006: 54), and she explains this as genre differences between the British and American speech events. But the difference in standardised frequencies (laughs per 10,000 words) is not great at all – 15 laughs per 10,000 tokens in MICASE lectures (per Lee 2006), and 17 laughs per 10,000 tokens in BASE lectures, using my corrected figures (see Table 2 above). How to explain this inconsistency?
Since Nesi is using a per minute regularisation, I consulted the BASE holdings documentation, a spreadsheet which totals the timings of all BASE speech events. I separated the events by lecture/seminar to reproduce Nesi’s findings. As it turns out, the 2000 laughter events in the 152 hours and 46 minutes of BASE lectures comes to an average of a laugh each 4.6 minutes – about the same as Lee’s MICASE finding of a laugh each 5 minutes. So where did Nesi get this odd figure of a laugh each 1.4 minutes? Well, if you take the 43 hours and 23 minutes of BASE seminars, divide it by the 2000 laughs in BASE lectures, you get a laugh per 1.3 minutes, very close to Nesi’s finding. It appears to be little more than another case of sloppy calculations.
We can thus dismiss Nesi’s speculations on why laughter frequencies in BASE and MICASE lectures are so different – they’re not. But I want to take up a claim she makes about laughter in lectures outside ENL (English as a native language) settings:
Laughter episodes in lectures may also be less common in countries where English is used as a medium of university instruction but is spoken as a second or foreign language, because lecturers who deliver the information content of their lectures in English tend to revert to their first language for other social and class management purposes.
(Nesi 2012: 87)
Nesi seems to be unaware that much English-medium instruction (EMI) around the world takes place with English as a lingua franca (ELF), precisely because students and teachers don’t share a first language. Such is the case with the events in the ELFA corpus, recorded in EMI university settings in Finland. With speakers of 50 first languages other than Finnish in the corpus, switching to Finnish for “social and class management” was not an option. What about laughter in these ELF lectures?
It turns out Nesi is right – Table 2 above shows the figures for laughter in lectures from the BASE, MICASE and ELFA corpora. With the caveat that the ELFA lecture subcorpus is quite a bit smaller than the others, there is a statistically significant fewer number of laughs in the ELF lectures.4 Likewise, Nesi calculated the number of times the audience laughs in BASE lectures, and my findings from ELFA were also statistically significant – ELF lecture audience members laugh only 4 times per 10,000 tokens (see Table 3). Those poor ELF users must be in a grim situation, right?
Laughter in ELFA: the rest of the story
An interesting feature of the ELFA corpus is that the lecture files are divided into two parts – monologues and discussions. The figures in Tables 2 & 3 are taken from only the monologic files in order to match the lecture type described by Nesi. When the ELFA lecture monologues are compared to the lecture discussions, the difference is massive: there are 32 laughs per 10,000 words in the lecture discussions, compared to the 10 laughs per 10,000 words in the monologues (see Table 4). What’s even more striking is the fact that among the 12 lecture discussion files in ELFA, 10 are continuations of monologues also included in the corpus.
What these findings suggest is that the real genre difference concerns monologic lectures in ELF: at least in Finnish university settings, there is significantly less laughter in monologic lectures than in corresponding ENL events. But when the monologue ends and the discussion begins, the frequency of laughter more than triples. Relatively infrequent laughter may be a feature of ELF monologues, but this doesn’t mean that ELF users are unable to carry out their “social and class management” in English or that they’re unable to experience humor in a foreign language. In fact, the ELF seminar data reveals the rest of the story.
Nesi includes the laughter data from BASE seminars and notes the much higher frequency of laughter in the MICASE seminars, again invoking the likelihood of genre difference. But her description of the BASE seminars matches exactly the seminars in MICASE and ELFA: “smaller classes where the students are expected to do most of the talking, in discussion or through oral presentations” (Nesi 2012: 83). So, I compared the frequencies of laughter from Lee’s MICASE study (2006) and my findings from BASE and ELFA (see Table 5). The difference is dramatic indeed – the frequency of laughter in ELFA seminars is 55 laughs per 10,000 tokens compared to the rather pitiful 4 laughs per 10,000 words in BASE seminars.
As I’ve said before, a qualitative analysis is needed of all these laughter events to understand what’s really going on in these interactions. But it’s not possible to dismiss these ELFA findings as mere “overuse” or claim that it’s just the nervous laughter of struggling non-native speakers. As you see in Table 5, the MICASE seminars generated a similarly high frequency of laughter. When it comes to laughter in interactive seminars held in English in the U.S., Britain, and Finland, the Brits are clearly the oddballs. We might even turn the tables on the native-speaker bias and conclude that these poor British English speakers need some help.
Finally, the situation in ELF becomes clearest when we again separate the ELFA seminar files, which are divided by student presentations and their ensuing discussions. The figures in Table 5 take all these files together to match the makeup of the other corpora. However, when we separate ELFA seminar presentations from discussions, the preference for laughter in interactive discussions is indisputable with a whopping 64 laughs per 10,000 words (see Table 6). Moreover, we see that laughter occurs in seminar presentations 18 times per 10,000 words – almost identical to the frequency in BASE and MICASE lectures (Table 2).
This leads me to two general conclusions. First, academic status likely plays a role in distribution of laughter in ELF. In the ELFA seminar presentations, where speakers and audience share a similar academic status, laughter occurs at a similar rate as the ENL lectures. The frequency of laughter drops in the ELF lectures, with the accompanying inequality between lecturer and students. Second, ELF users are clearly engaged in good-natured, mutually supportive interaction in these lecture and seminar discussions. The British seminars with little laughter appear to be odd indeed. Nesi’s suggestion that British lecturers should modify their use of humor in lectures outside the UK is sound. But it looks like British students are the ones who most need to adapt to academic interaction in a lingua franca setting.
1 My analytical programs are written in Python 2.6.6 with the lxml library. In the interests of openness and reproducibility, you can view/download my Python code for the BASE corpus from this link. Likewise, my findings reported here from the ELFA corpus are derived from a modified version of this code, which you can also access from this link. Both corpora are freely available from the links under “Primary Sources” below.
2 I use “tokens” instead of “words”, since a word is surprisingly hard to define. Different corpus software counts words in different ways, e.g. “don’t” in WordSmith is a single word, but treated as two words (do + n’t) in AntConc. For this study, I’ve taken a minimalist approach. My Python programs tokenise the XML files by 1) getting rid of the header, 2) stripping all XML markup from the text, and 3) tokenising what’s left at spaces. Thus, my token counts are slightly less than the official BASE corpus word counts.
3 I’ve tried to account for the differences in Nesi’s and my laughter counts. Our counts for the BASE lecture subcorpus are one off, and I can’t account for that. My count for BASE seminars is five higher than Nesi’s, but that’s because one BASE file (sssem006.xml) is not well-formed XML due to several laughter elements being mis-tagged (among other problems). I corrected these by hand so my Python programs could parse the XML. This is also an example of why analysing XML in concordance software is more prone to error than within a programming environment; Nesi would have no way of knowing from WordSmith that a file is marked up in error.
4 The statistical tests performed here use Paul Rayson’s online log-likelihood calculator. The results for comparing all laughter in the BASE/ELFA lectures (Table 2) were G2=37.42 (p<0.0001), and the results for comparing audience laughter only (Table 3) were G2=25.13, also p<0.0001.
ELFA (2008) The Corpus of English as a Lingua Franca in Academic Settings. Director: Anna Mauranen. http://www.helsinki.fi/elfa/elfacorpus (27.7.2013).
And here’s the “mandatory” citation demanded by the BASE website: “The transcriptions used in this study come from the British Academic Spoken English (BASE) corpus project. The corpus was developed at the Universities of Warwick and Reading under the directorship of Hilary Nesi and Paul Thompson. Corpus development was assisted by funding from BALEAP, EURALEX, the British Academy and the Arts and Humanities Research Council.” They don’t give a year of corpus completion. Here’s the website: http://www2.warwick.ac.uk/fac/soc/al/research/collect/base/ (27.7.2013).
Lee, David (2006) Humor in academic spoken discourse. NUCB Journal of Language, Culture and Communication, 8 (1), 49-68. Available online: http://ci.nii.ac.jp/naid/110004860604/en (27.7.2013).
Nesi, Hilary (2012) Laughter in university lectures. Journal of English for Academic Purposes, 11 (2), 79-89. DOI: 10.1016/j.jeap.2011.12.003.