In defense of good data: the question of third-person singular –s

There's a special place in heaven called "Midsummer in Finland". This is a recent sunset viewed at the Saimaa in eastern Finland.© Nina Valtavirta

There’s a special place in heaven called “Midsummer in Finland”. This is a recent sunset viewed at the Saimaa in eastern Finland.
© Nina Valtavirta

In the early days of ELF research, it was sometimes claimed that English used as a lingua franca (ELF) between its second language speakers might be a separate and unique variety of English. No one seems to want to defend this claim any longer, and more emphasis is placed on the inherent complexity and fluidity of these lingua franca encounters. Yet, despite this distance from explicit claims of variety status, there is still the tendency for ELF researchers to treat ELF as a bounded object.

This is the argument developed by Janus Mortensen in the latest issue of the Journal of English as a Lingua Franca. He discusses a tendency in ELF research to treat ELF as a language system alongside English as a native language (ENL), in effect reifying ELF or treating it as a bounded object. As a result of this reification, ELF is “turned into a bounded object that can be delimited and characterized in terms of specific properties”, including properties of a formal linguistic nature (Mortensen 2013: 30).

One such linguistic property that Mortensen discusses is the marking of 3rd-person singular verbs in present simple tense: she studies in the university. This so-called 3rd-person singular –s morpheme is an anomaly of the English verb system (I study, you study, we study, and they study, but she studies), and some varieties of English regularise this feature: she study in the university. This “dropping” of the 3rd-person –s (also referred to as 3rd-person zero) has been proposed as a prominent feature of ELF talk since the early 2000s, and it is precisely this notion of a broadly claimed “ELF variant” that Mortensen objects to.

“Emerging as the default option”

As recently as 2012, ELF researchers Alessia Cogo and Martin Dewey have made the claim that “at least in certain types of ELF settings, 3rd person zero appears to be emerging as the default option in informal naturally occurring communications” (Cogo & Dewey 2012: 49). This claim is based on the findings of Dewey’s PhD dissertation (2007a), much of which is also discussed in Cogo & Dewey (2006). In brief, Dewey transcribed eight hours of ELF interaction to compile a 60,000-word corpus. He identified all instances of 3rd-person present singular verbs, and noted whether they appeared with or without the –s morpheme.

Table 1. The findings from Dewey's research on 3rd-person singular zero.Source: Cogo & Dewey 2006: 77.

Table 1. The findings from Dewey’s research on 3rd-person singular zero.
Source: Cogo & Dewey 2006: 77.

His results? Of the 211 main verbs (excluding auxiliary verbs) that function as a 3rd-person singular verb, 103 (48%) were in the –s form, while 108 (52%) were found with zero marking (Cogo & Dewey 2006: 77). This reported finding is the basis for the claims that “the s and zero are competing variants in ELF communication” and “It also appears from the data that the 3rd person zero is the variant that is winning this competition” (Cogo & Dewey 2006: 76–77). More troublingly, these percentages are repeated elsewhere (e.g. Jenkins et al 2011: 290) without mentioning the small size of Dewey’s corpus or what, in fact, is in it.

All Dewey will tell us about his 60,000 words is that they come from 38 transcribed communicative events “ranging from informal, entirely unplanned conversations to semi-formal seminar presentations, but with a heavy bias towards naturally occurring non-instructional interactions” (Cogo & Dewey 2012: 40). This is a phrase that gets repeated in his other publications,1 but it’s not enough information to evaluate the contents of the corpus. So, I emailed Dewey in October 2012 to request a copy of his PhD dissertation for more information on his corpus. He never replied.

Dewey’s PhD (2007a) is publicly available if you have access to the single copy sitting in Maughan Library at King’s College London. For the rest of us, we have to wonder what’s in this unpublished work cited almost 20 times according to Google Scholar, including in eight publications (co)authored by Dewey, Cogo, and/or Jenkins. More importantly, we have to wonder exactly where and how this data was collected. I would encourage Dewey to make his PhD publicly available with the detail it affords on his corpus compilation.2

Big conclusions from a small database

Leaving the question of data quality aside, Mortensen argues the following about these claims concerning 3rd-person zero marking:

There are cases of ELF data where the zero variant is very frequent and may perhaps be said to be the unmarked choice from an emic perspective (as in the case of Cogo and Dewey’s data), but this does not allow us to generalize and posit the zero variant as the unmarked choice in all ELF interactions.

(Mortensen 2013: 32, emphasis in original)

He further points out that other studies employing similarly small sets of data found quite different results, and nothing close to Dewey’s high rates of occurrence. What these 3rd-person –s studies seem to demonstrate is that small sets of data are unrepresentative – it’s dangerous to make sweeping claims of “emerging trends” based on such limited linguistic data. Fortunately, these questions can now be answered with a meaningful dataset.

Table 2. Comparison of the contents of Dewey's ELF corpus and the VOICE corpus.

Table 2. Comparison of the contents of Dewey’s ELF corpus and the VOICE corpus.

Earlier this year, the members of the VOICE project (Vienna-Oxford International Corpus of English) at the University of Vienna released an updated, part-of-speech tagged (POS-tagged) version of their one-million-word corpus of transcribed ELF interactions. Since automatic POS-taggers are unreliable on spoken text, they performed the laborious task of checking all these tags and correcting them by hand, ensuring accurate labels.

Moreover, since ELF research seeks to capture the variation inherent to ELF, a double POS-tag was employed – one code for the word’s function, and another for the word’s form. In this way, recurring features such as 3rd-person zero are easily identified. In an utterance like she study in the university, the verb study receives a form tag for a generic verb (V), but also a function tag for 3rd-person present singular (VVZ). So, all you have to do is find all the verbs that function as 3rd-person present singular, but don’t have the 3rd-person singular form.

The danger of unrepresentative data

The POS-tagged VOICE corpus can be downloaded in XML format, which makes it possible to let the computer find these cases and to test Dewey’s claims on a systematically compiled corpus. I wrote a program in Python that checks each of the 1,142,982 word elements (<w>) in the XML corpus and examines the double POS-tag, counting the number of verbs that function as 3rd-person singular, but bear a different form.

Rather than follow Dewey’s methodology of manually excluding all auxiliary verbs, I took advantage of the POS-tagging conventions to limit the dataset. There are three tags for 3rd-person singular verbs in the VOICE corpus – VBZ for the verb be (she is a student), VHZ for the verb have (she has a class on Thursday), and VVZ for all other verbs, like the she studies in the university example. By excluding these forms of be and have, I exclude two high-frequency verbs that are often auxiliaries.

Figure 1. The data generated by my Python script on the 3rd-person singular POS tags in the VOICE POS XML corpus.

Figure 1. The data generated by my Python script on the 3rd-person singular POS tags in the VOICE POS XML corpus.

Similar to Dewey’s claims about auxiliaries, these high-frequency verbs overwhelmingly occur in conventional form in VOICE. Of the 1461 occurrences of VHZ (she has), only 26 (1.8%) have an unexpected form. Of the 15,632 instances of VBZ (she is / she’s), a mere 19 (0.1%) have a non-matching form. After removing these 17,093 observations from consideration, we’re left with 5335 verbs functioning as 3rd-person singular (tagged VVZ).

Of these remaining 5335 verbs functioning as 3rd-person singular, only 310 have a non-matching form tag, almost all of these being the famous 3rd-person zero. Unfortunately for Dewey’s claims of 3rd-person zero “winning this competition”, these 310 cases amount to 5.8% of verbs functioning as 3rd-person singular – a very long way from Dewey’s finding of 52%. If we add the previously excluded 3rd-person singular cases of be and have, the percentage of “innovative” verb forms drops to less than 2%.

Letting good data speak for itself

These findings from the VOICE corpus render Dewey’s claims about 3rd-person zero – including his speculations about speakers’ motivations for employing this “default option” – highly dubious. Although 3rd-person zero marking is indeed present in ELF interaction as an unproblematic feature, the evidence from VOICE gives no credence to researchers’ intuition that 3rd-person zero is especially prominent in ELF. In some speech events (such as in Dewey’s data) it may be more prominent, but there is simply not evidence for a global, ELF-specific trend.

This is connected to Mortensen’s criticisms of ELF being reified, or treated as a language system of its own. He argues instead for conceptualising ELF as English used as a lingua franca, which may vary considerably between ELF speech events. It is recognisably English, but not such a homogenous entity that it can be called an English, in the sense of a bounded object. This shift of focus to English used as a lingua franca puts greater focus on the context and purpose of use, which can influence the form of ELF more than the fact that it is being deployed by second-language users.

Finally, in the interest of open science, you can inspect my programming code and reproduce my findings from the VOICE corpus. The POS-tagged XML can be downloaded here, and my Python code which analyses the 3rd-person POS tags can be downloaded from this link.3 The code outputs the findings as plain text, which you can also download here (Figure 1 is based on this). This output file also includes all the linguistic types of the 3rd-person zero forms discussed above (n=310), together with their number of occurrences.

See also: In search of wild diversity: a closer look at 3rd-person zero marking in ELF

Notes

1 See e.g. Cogo & Dewey 2006: 63 and Dewey 2007b: 351n for similar claims.

2 In the meantime, if you’re one of those lucky few who have a copy of the dissertation, have a look at Appendix B (Dewey 2007a: 309–317) with a calculator and decide for yourself if his 60,000 words indeed represent “a heavy bias towards naturally occurring non-instructional interactions”.

3 On a technical note, the code is written in Python 2.6.6 with the lxml library (v. 3.0.1), and to make it work you have to first delete all the namespace declarations in the VOICE XML. The Python file is designed to be saved and run in IDLE from the parent directory of the “VOICEPOSXML2.0” folder of the download.

If you’re not familiar with programming but still curious, the code can also be viewed in a plain text editor such as Notepad.

References

ResearchBlogging.org

Cogo, A., & Dewey, M. (2006) Efficiency in ELF Communication: From Pragmatic Motives to Lexico-grammatical Innovation. Nordic Journal of English Studies, 5 (2), 59-93. Available online: http://ub016045.ub.gu.se/ojs/index.php/njes/article/view/65.

Cogo, A. & Dewey, M. (2012) Analysing English as a Lingua Franca: A Corpus-driven Investigation. London: Continuum.

Dewey, M. (2007a) English as a Lingua Franca: an empirical study of innovation in lexis and grammar. Unpublished PhD thesis, King’s College London.

Dewey, M. (2007b) English as a lingua franca and globalization: an interconnected perspective. International Journal of Applied Linguistics, 17 (3), 332-354. DOI: 10.1111/j.1473-4192.2007.00177.x

Jenkins, J., Cogo, A., & Dewey, M. (2011) Review of developments in research into English as a lingua franca. Language Teaching, 44 (3), 281-315. DOI: 10.1017/S0261444811000115

Mortensen, J. (2013) Notes on English used as a lingua franca as an object of study. Journal of English as a Lingua Franca, 2 (1), 25-46. DOI: 10.1515/jelf-2013-0002

VOICE (2013) The Vienna-Oxford International Corpus of English (version POS XML 2.0). Director: Barbara Seidlhofer. Available online: http://www.univie.ac.at/voice/page/download_voice_xml

Advertisements
Tagged

4 thoughts on “In defense of good data: the question of third-person singular –s

  1. eflnotes says:

    hi

    very interesting analysis

    i was woondering whether the differences in the way Cogo and Deway annnotated their corpus contributes to their finding, by contrast the VOICE corpus was purposely coded from an ELF point of view? that is if the VOICE corpus was coded like an L1 corpus would there be a greater percentage of 3rd person zero?

    ta
    mura

    • Ray Carey says:

      Thanks mura! It’s not clear what annotation was used in Dewey’s work, but I don’t think that’s the issue here. The unique thing about VOICE is not the ELF point of view, but their inclusion of both a form and function POS tag. Typically POS-tagged corpora don’t distinguish between these, but that’s because of POS-tagging’s basis in sentence grammars and written text, not because they’re L1 corpora.

      The difference in the VOICE/Dewey data shouldn’t be in the approach to annotation. The form/function analysis is usually straightforward, and both databases have been analyzed using this same form/function distinction, regardless of how they coded it. The difference is likely just the small size of Dewey’s sample.

      An interesting question that could also be answered is the frequency of 3rd-person zero in individual VOICE speech events. Are there individual events in the VOICE corpus with rates of occurrence that are similar to Dewey’s data? It’s very possible, as it only takes a few speakers who exhibit this feature regularly to skew the findings of a single event. That’s why we need large corpora to draw reasonable conclusions.

      I could adjust the existing programming code to also find these figures, and I probably will. But these posts with original findings take a lot of time, especially when debugging code and verifying counts are involved! Always more curiosity than time.

      – Ray

  2. Barry Kavanagh says:

    Hi Ray.

    It is interesting that you began this entry mentioning that ELF is not considered to be a unique variety of English and ‘No one seems to want to defend this claim…’. So far so clear.

    Yet it seems to me that in some of the texts describing ELF corpora, ELF is being treated as a variety of English. The VOICE corpus of ELF announces, ‘It is the ultimate aim of the VOICE project to open the way for a large-scale and in-depth linguistic description’ of ELF (VOICE, What is Voice? page http://www.univie.ac.at/voice/page/what_is_voice), while the ELFA corpus of ELF ‘offers a contribution towards an empirical basis for understanding this variety of English… ELF research seeks to establish its characteristic features which deviate from Standard English, and look for possible “core” features of ELF’ (Mauranen, description of the project http://www.helsinki.fi/englanti/elfa/project.html).

    Barry

    • Ray Carey says:

      Hi Barry, thanks for your comment! You’ve pointed out the type of language that Mortensen also criticises in his article. I’ve only looked at one aspect of his argument in this post, but part of his paper looks specifically at the VOICE website definition of ELF that describes it as “an additionally acquired language system”. I just noticed, however, that this language has since been removed from the VOICE FAQ, where Mortensen retrieved it. I think Mortensen’s article has been successful in prompting me as well to be more precise in exactly what it is I’m talking about. And that’s a good thing.

      Regarding Mauranen’s text on the ELFA website: I recently noticed the same passages you’ve pointed out, and I wondered if I ought to ask her if she might want to make revisions to the text. There are a couple reasons for this. First, the text is fairly old. I updated it a year ago only to add the mention of the WrELFA corpus, but I think it’s been on the site for around five years (i.e. at least since I started hanging around). Second, the language you’ve pointed out doesn’t seem to reflect her current thinking. The following passage is from Mauranen 2012 (Exploring ELF: Academic English shaped by non-native speakers), where she offers a definition of ELF:

      ELF speakers must be identified with respect to a definition of a lingua franca. This book adopts a general definition compatible with the traditional concept of a lingua franca as a vehicular language used by speakers who do not share a first language. This is a situation-based definition, and speakers may take on the identity of a lingua franca speaker in response to situational demands, irrespectively of whether in other situations they approach the same language as a target language to be learned. (Mauranen 2012: 8)

      So you can see that this situational definition and the website text are quite different. Moreover, I have the pdf proofs of the 2012 book, and an automated search shows no place in the text where ELF is described in terms of “variety” or “core features” or “characteristic features”. The fact is, all the earliest ELF researchers have adjusted their conceptualisations of ELF over the years, as good scholars tend to do. Your observation is well-founded, and a good motivation to finally bring the ELFA website text back to Anna’s attention.

      Regarding the quote from “What is VOICE”, I should point out the rest of the quote: “large-scale and in-depth linguistic description of this most common contemporary use of English”. So the description is of English in use, in this same situational sense. I believe this in within the bounds of Mortensen’s criticism. We are most certainly doing linguistic description of English – the ‘E’ is not in doubt. The issue is more with how to conceptualize the ‘LF’, either treating ELF as a bounded object like another second-language variety (e.g. Singaporean English) or treating ELF as a situationally specific form of English in use.

      Good observations, and thanks for your interest!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: