ELFA project

In defense of good data: the question of third-person singular –s

Advertisements

There’s a special place in heaven called “Midsummer in Finland”. This is a recent sunset viewed at the Saimaa in eastern Finland.
© Nina Valtavirta

In the early days of ELF research, it was sometimes claimed that English used as a lingua franca (ELF) between its second language speakers might be a separate and unique variety of English. No one seems to want to defend this claim any longer, and more emphasis is placed on the inherent complexity and fluidity of these lingua franca encounters. Yet, despite this distance from explicit claims of variety status, there is still the tendency for ELF researchers to treat ELF as a bounded object.

This is the argument developed by Janus Mortensen in the latest issue of the Journal of English as a Lingua Franca. He discusses a tendency in ELF research to treat ELF as a language system alongside English as a native language (ENL), in effect reifying ELF or treating it as a bounded object. As a result of this reification, ELF is “turned into a bounded object that can be delimited and characterized in terms of specific properties”, including properties of a formal linguistic nature (Mortensen 2013: 30).

One such linguistic property that Mortensen discusses is the marking of 3rd-person singular verbs in present simple tense: she studies in the university. This so-called 3rd-person singular –s morpheme is an anomaly of the English verb system (I study, you study, we study, and they study, but she studies), and some varieties of English regularise this feature: she study in the university. This “dropping” of the 3rd-person –s (also referred to as 3rd-person zero) has been proposed as a prominent feature of ELF talk since the early 2000s, and it is precisely this notion of a broadly claimed “ELF variant” that Mortensen objects to.

“Emerging as the default option”

As recently as 2012, ELF researchers Alessia Cogo and Martin Dewey have made the claim that “at least in certain types of ELF settings, 3rd person zero appears to be emerging as the default option in informal naturally occurring communications” (Cogo & Dewey 2012: 49). This claim is based on the findings of Dewey’s PhD dissertation (2007a), much of which is also discussed in Cogo & Dewey (2006). In brief, Dewey transcribed eight hours of ELF interaction to compile a 60,000-word corpus. He identified all instances of 3rd-person present singular verbs, and noted whether they appeared with or without the –s morpheme.

Table 1. The findings from Dewey’s research on 3rd-person singular zero.
Source: Cogo & Dewey 2006: 77.

His results? Of the 211 main verbs (excluding auxiliary verbs) that function as a 3rd-person singular verb, 103 (48%) were in the –s form, while 108 (52%) were found with zero marking (Cogo & Dewey 2006: 77). This reported finding is the basis for the claims that “the s and zero are competing variants in ELF communication” and “It also appears from the data that the 3rd person zero is the variant that is winning this competition” (Cogo & Dewey 2006: 76–77). More troublingly, these percentages are repeated elsewhere (e.g. Jenkins et al 2011: 290) without mentioning the small size of Dewey’s corpus or what, in fact, is in it.

All Dewey will tell us about his 60,000 words is that they come from 38 transcribed communicative events “ranging from informal, entirely unplanned conversations to semi-formal seminar presentations, but with a heavy bias towards naturally occurring non-instructional interactions” (Cogo & Dewey 2012: 40). This is a phrase that gets repeated in his other publications,1 but it’s not enough information to evaluate the contents of the corpus. So, I emailed Dewey in October 2012 to request a copy of his PhD dissertation for more information on his corpus. He never replied.

Dewey’s PhD (2007a) is publicly available if you have access to the single copy sitting in Maughan Library at King’s College London. For the rest of us, we have to wonder what’s in this unpublished work cited almost 20 times according to Google Scholar, including in eight publications (co)authored by Dewey, Cogo, and/or Jenkins. More importantly, we have to wonder exactly where and how this data was collected. I would encourage Dewey to make his PhD publicly available with the detail it affords on his corpus compilation.2

Big conclusions from a small database

Leaving the question of data quality aside, Mortensen argues the following about these claims concerning 3rd-person zero marking:

There are cases of ELF data where the zero variant is very frequent and may perhaps be said to be the unmarked choice from an emic perspective (as in the case of Cogo and Dewey’s data), but this does not allow us to generalize and posit the zero variant as the unmarked choice in all ELF interactions.

(Mortensen 2013: 32, emphasis in original)

He further points out that other studies employing similarly small sets of data found quite different results, and nothing close to Dewey’s high rates of occurrence. What these 3rd-person –s studies seem to demonstrate is that small sets of data are unrepresentative – it’s dangerous to make sweeping claims of “emerging trends” based on such limited linguistic data. Fortunately, these questions can now be answered with a meaningful dataset.

Table 2. Comparison of the contents of Dewey’s ELF corpus and the VOICE corpus.

Earlier this year, the members of the VOICE project (Vienna-Oxford International Corpus of English) at the University of Vienna released an updated, part-of-speech tagged (POS-tagged) version of their one-million-word corpus of transcribed ELF interactions. Since automatic POS-taggers are unreliable on spoken text, they performed the laborious task of checking all these tags and correcting them by hand, ensuring accurate labels.

Moreover, since ELF research seeks to capture the variation inherent to ELF, a double POS-tag was employed – one code for the word’s function, and another for the word’s form. In this way, recurring features such as 3rd-person zero are easily identified. In an utterance like she study in the university, the verb study receives a form tag for a generic verb (V), but also a function tag for 3rd-person present singular (VVZ). So, all you have to do is find all the verbs that function as 3rd-person present singular, but don’t have the 3rd-person singular form.

The danger of unrepresentative data

The POS-tagged VOICE corpus can be downloaded in XML format, which makes it possible to let the computer find these cases and to test Dewey’s claims on a systematically compiled corpus. I wrote a program in Python that checks each of the 1,142,982 word elements (<w>) in the XML corpus and examines the double POS-tag, counting the number of verbs that function as 3rd-person singular, but bear a different form.

Rather than follow Dewey’s methodology of manually excluding all auxiliary verbs, I took advantage of the POS-tagging conventions to limit the dataset. There are three tags for 3rd-person singular verbs in the VOICE corpus – VBZ for the verb be (she is a student), VHZ for the verb have (she has a class on Thursday), and VVZ for all other verbs, like the she studies in the university example. By excluding these forms of be and have, I exclude two high-frequency verbs that are often auxiliaries.

Figure 1. The data generated by my Python script on the 3rd-person singular POS tags in the VOICE POS XML corpus.

Similar to Dewey’s claims about auxiliaries, these high-frequency verbs overwhelmingly occur in conventional form in VOICE. Of the 1461 occurrences of VHZ (she has), only 26 (1.8%) have an unexpected form. Of the 15,632 instances of VBZ (she is / she’s), a mere 19 (0.1%) have a non-matching form. After removing these 17,093 observations from consideration, we’re left with 5335 verbs functioning as 3rd-person singular (tagged VVZ).

Of these remaining 5335 verbs functioning as 3rd-person singular, only 310 have a non-matching form tag, almost all of these being the famous 3rd-person zero. Unfortunately for Dewey’s claims of 3rd-person zero “winning this competition”, these 310 cases amount to 5.8% of verbs functioning as 3rd-person singular – a very long way from Dewey’s finding of 52%. If we add the previously excluded 3rd-person singular cases of be and have, the percentage of “innovative” verb forms drops to less than 2%.

Letting good data speak for itself

These findings from the VOICE corpus render Dewey’s claims about 3rd-person zero – including his speculations about speakers’ motivations for employing this “default option” – highly dubious. Although 3rd-person zero marking is indeed present in ELF interaction as an unproblematic feature, the evidence from VOICE gives no credence to researchers’ intuition that 3rd-person zero is especially prominent in ELF. In some speech events (such as in Dewey’s data) it may be more prominent, but there is simply not evidence for a global, ELF-specific trend.

This is connected to Mortensen’s criticisms of ELF being reified, or treated as a language system of its own. He argues instead for conceptualising ELF as English used as a lingua franca, which may vary considerably between ELF speech events. It is recognisably English, but not such a homogenous entity that it can be called an English, in the sense of a bounded object. This shift of focus to English used as a lingua franca puts greater focus on the context and purpose of use, which can influence the form of ELF more than the fact that it is being deployed by second-language users.

Finally, in the interest of open science, you can inspect my programming code and reproduce my findings from the VOICE corpus. The POS-tagged XML can be downloaded here, and my Python code which analyses the 3rd-person POS tags can be downloaded from this link.3 The code outputs the findings as plain text, which you can also download here (Figure 1 is based on this). This output file also includes all the linguistic types of the 3rd-person zero forms discussed above (n=310), together with their number of occurrences.

See also: In search of wild diversity: a closer look at 3rd-person zero marking in ELF

Notes

1 See e.g. Cogo & Dewey 2006: 63 and Dewey 2007b: 351n for similar claims.

2 In the meantime, if you’re one of those lucky few who have a copy of the dissertation, have a look at Appendix B (Dewey 2007a: 309–317) with a calculator and decide for yourself if his 60,000 words indeed represent “a heavy bias towards naturally occurring non-instructional interactions”.

3 On a technical note, the code is written in Python 2.6.6 with the lxml library (v. 3.0.1), and to make it work you have to first delete all the namespace declarations in the VOICE XML. The Python file is designed to be saved and run in IDLE from the parent directory of the “VOICEPOSXML2.0” folder of the download.

If you’re not familiar with programming but still curious, the code can also be viewed in a plain text editor such as Notepad.

References

Cogo, A., & Dewey, M. (2006) Efficiency in ELF Communication: From Pragmatic Motives to Lexico-grammatical Innovation. Nordic Journal of English Studies, 5 (2), 59-93. Available online: http://ub016045.ub.gu.se/ojs/index.php/njes/article/view/65.

Cogo, A. & Dewey, M. (2012) Analysing English as a Lingua Franca: A Corpus-driven Investigation. London: Continuum.

Dewey, M. (2007a) English as a Lingua Franca: an empirical study of innovation in lexis and grammar. Unpublished PhD thesis, King’s College London.

Dewey, M. (2007b) English as a lingua franca and globalization: an interconnected perspective. International Journal of Applied Linguistics, 17 (3), 332-354. DOI: 10.1111/j.1473-4192.2007.00177.x

Jenkins, J., Cogo, A., & Dewey, M. (2011) Review of developments in research into English as a lingua franca. Language Teaching, 44 (3), 281-315. DOI: 10.1017/S0261444811000115

Mortensen, J. (2013) Notes on English used as a lingua franca as an object of study. Journal of English as a Lingua Franca, 2 (1), 25-46. DOI: 10.1515/jelf-2013-0002

VOICE (2013) The Vienna-Oxford International Corpus of English (version POS XML 2.0). Director: Barbara Seidlhofer. Available online: http://www.univie.ac.at/voice/page/download_voice_xml

Advertisements