ELFA project

Needles in a haystack: questioning the “fluidity” of ELF


As I’ve earlier argued on this blog, sometimes the claims of “fluidity”, “diversity”, and “innovation” found in English as a lingua franca (ELF) research are overstated. It’s so diverse that even ordinary diversity won’t do – it’s “super-diversity” now. It could very well be ultra-mega-diversity-squared, but the question of the prominence of these presumably innovative features is a quantitative one. More specifically, it’s a question of how frequently any variant forms might occur in naturally occurring ELF interaction, relative to the conventional forms. One of my shameless nerd hobbies is writing little Python programs to query corpora, and several of these mini-studies have appeared on this blog. I especially enjoy working with the VOICE corpus, which is great because 1) it contains a million words of unelicited ELF interaction; 2) it’s ready for processing as well-formed XML; and 3) it has been meticulously part-of-speech (POS) tagged for both the form and function of each word in the corpus.

The value of this double form-function tag is that it reveals every token in the corpus where a word like fluently, which is formally recognisable as an adverb, functions in a different way, like as an adjective: i think you are very fluently in english. This example of fluently from VOICE has a form tag of RB (adverb), but a function tag of JJ (adjective) to reflect that fluently seems to be serving in an adjectival function. This kind of form-function variation in ELF is presumably prominent enough that it necessitates this double tagging to adequately describe the fluidity. The VOICE team was kind enough to carry out this formidable task involving manual inspection of all million words. Now that this resource is in place (and freely available), the instances of these form-function mismatches can be easily found, counted, and viewed in context.

I’ve wondered for some time how often these variant form-function tokens occur overall, in relation to their conventional forms. My interest was renewed by the recent paper by VOICE project researcher Ruth Osimk-Teasdale in the Journal of English as a Lingua Franca. One of the main workers on the VOICE POS-tagging project, she investigates word class shifts in VOICE. She narrows her data to double form-function tags that reflect a shift of category across word classes (like from adverb to adjective). These inter-categorical word class shifts therefore exclude variations within a word class, like singular nouns which are treated as plural. She focuses on items like fluently above, where word class conversion occurs without any change to the form of the word itself.

Assigning these form-function tags – and the analysis of them – are directly linked to the fluidity of ELF: “… the fluidity of language becomes especially apparent in the dynamics of online ELF interactions” (Osimk-Teasdale 2014: 114), which are also described as “variable and in flux” and containing “unconventional language use” (ibid. 117). Moreover, the examples of this fluidity are said to be numerous. On the VOICE corpus website, we’re told that the data “often had to be dealt with in novel and unprecedented ways because of its spoken and variable nature”. Osimk-Teasdale reports that during transcription of VOICE

… numerous cases were encountered in which ELF users exploit language by transcending conventionally established boundaries of word class categories in codified English by change of word class but without any change of morphological form (2014: 111).

These transcendent word class shifts thus motivated the double POS-tagging project, which would be better suited to describe “the fluid and process-like nature of the use of word class categories observed in VOICE data” (ibid. 114). But just how “fluid” is it? This is a quantitative question, and now we can get a more precise number than “numerous”. How prominent is this fluidity in relation to conventional, “codified” forms?

A drop in the bucket

Osimk-Teasdale (2014: 120-21; henceforth Ruth, since I’ve talked with her before at the ELF5 conference and we’re equally junior in the academic Game of Thrones) reports a total of 152 non-matching form-function tag combinations in VOICE, after excluding some non-interesting groups like discourse markers and gerunds (-ing words functioning as nouns). Out of these 152 tag combinations, 32 combinations reflect the inter-categorical shifts (across word classes) that are the target of the study. Altogether, these inter-categorical shifts amount to 370 tokens in the corpus, and Ruth’s paper considers the nine most frequent shifts. Here are examples of the top five groups:

type of shift raw count example from VOICE (filename)
Adjective→Adverb 99 you get the money-back guarantee _0 and this is particular a motivation to use this service (PBmtg300)
Adverb→Adjective 52 she asked the question if there’s a really need for community (EDwsd499)
Adjective→Noun 41 there is a big difficult of there is bit er big el- a big difficulty (EDint330)
Noun→Adjective 35 they don’t follow _1 very in a fanatic way _1 the the tradition yeah (LEcon8)
Verb→Noun 32 so they were keeping our suggest to the very very end (PBmtg269)

Ruth points out that nearly all of these instances of word class shifts in VOICE are one-off events, meaning they aren’t found elsewhere in the corpus. At the same time, almost all of these conversion types are also found in first- and second-language varieties of English, so they’re not unprecedented observations, either. However, the second-most frequent type of shift – adverb to adjective – is only attested in the VOICE data. Ruth argues that this can be explained by the added prominence bestowed by an added morpheme. But going through concordance lines of these word class shifts (get the full data below), one finds that some tags appear dubious, and many reflect the futility of attempting to POS-tag every word of spoken interaction. Other examples, such as win-win, eastbound and westbound discussed under adjective to noun shifts, appear decidedly non-transcendent to me. I would use these items in the same sense myself without thinking twice, and I’m one of the least transcendent people I know.

Assuming that these 370 instances of inter-categorical shifts are equally interesting, we still have the question of prominence – how often do these word class shifts occur, relatively speaking? Is this a rushing torrent of linguistic fluidity, or just a drop in the bucket? To get the bigger picture, I expanded my own search to all tokens in the VOICE corpus where the function tag does not match the form tag. What do we see when we look at the distribution of all non-matching form-function tags throughout all one million words of spoken text?

Needles in a haystack

In order to get a sense of the relative prominence of these non-matching form-function tags, it was necessary to likewise count the “conventional” items in which the form and function coincide. But in order to get meaningful results, it was first necessary to exclude a number of items. Since every token in VOICE is tagged, many items such as discourse features (e.g. breathing, laughter, filled and unfilled pauses) and grammatical items (e.g. infinitive to, conjunctions, modal verbs) don’t really have any possibility for form-function variation, so these tags all match. I excluded these, along with the same set of items excluded by Ruth (form tags PVC, VVG, VBG, UNK; also all discourse markers). This narrowed my form-function comparison to only the potentially interesting sources of variation: those tokens with a form tag for adjectives, adverbs, cardinal numbers, nouns, personal & possessive pronouns, and verbs.

Altogether, this dataset consists of 556,296 words in the corpus. By comparing the form and function tags, I was able to count both the matching and non-matching tags for each set of word classes, and here are the results:

Only 1904 tokens with non-matching tags were found, or 0.3% of these 556,296 observations. I didn’t expect these figures to be so low, as it frankly calls into question the whole enterprise of double POS-tagging a million-word corpus for the sake of finding 1904 instances of possibly interesting variation (judge for yourself). None of the individual word class categories rises above 0.6% non-matching tags, with the exception of V, the “generic verb tag” in VOICE (n=776). With 81% non-matching function tags, the form tag V appears to be a repository for several groups of word class shifts, the largest being the famous 3rd-person present singular zero (tagged V/VVZ, n=256). I’ve already blogged elsewhere about these findings from VOICE, which do not support the sometimes outlandish claims that dropping the 3rd-person singular -s morpheme is a preferred feature in ELF. And yet I can see from the figures above that, in spite of its relatively infrequent appearance in VOICE, this 3rd-person zero marking must be one of the better-represented variants coming out of this form-function data.

Check out Ruth’s study by clicking on the image
Osimk-Teasdale, R. (2014) “I just wanted to give a partly answer”: capturing and exploring word class variation in ELF data. Journal of English as a Lingua Franca, 3(1), 109–143.

I want to be clear that I am a friend of VOICE; I believe that I must appreciate the corpus more than anyone else in ELF-land. A meticulously compiled corpus of naturally occurring (non-elicited) ELF interaction is a wonderful thing, the double POS-tagging was a massive effort, and I’ve spent many tens of hours exploring it (with more to come). The fact that the “fluidity” between form and function does not seem to match the exuberant descriptions of ELF researchers does not make the investigation less meaningful or fun. Nor is this a criticism of Ruth in particular; her study contains the typical language of the field (try Google Scholar with a search string of “english as a lingua franca” fluid and see what you find). The point I hope to make with these findings is that the interesting stuff in ELF interaction is not at word level, but at the level of multi-word chunks and the syntagmatic associations between words, rather than the words themselves.

Just as with speakers of first- and second-language varieties of English, there’s a flexibility and generosity of interpretation that enables ELF users to manage, in real time, the inevitably fuzzy edges that come with creating an utterance before we know how it’s going to end. It might not always meet the expectations of the listener, but the linear sequence of speech moves fast, and there’s not much time to dwell on these relatively insignificant – and infrequent – mismatches between form and function. Listeners don’t likely invest much attention in reconciling any (potential) mismatches with a single word’s form and function, as attention moves quickly to the next chunk of information. I think it’s fair to say that we ELF researchers are way more interested in these fuzzy edges of talk than the language users themselves. In this respect, much ELF research – and to be fair, a great deal of linguistic research based on transcriptions of speech – makes a mountain out of a molehill, when it’s really just needles in a haystack.

Take my data please

If you’ve read this far, you might like to see the data yourself. While writing the program to generate these findings, I spent a few hours making a concordancer for the VOICE POS XML corpus. The online search interface for the POS-tagged VOICE gives an inexplicably meager four words of context (two left, two right) which is oftentimes unuseful. For this study, I output all 1904 examples of non-matching form-function tags to spreadsheet formats (Excel and .csv), with each non-matching token in its own column, up to 10 tokens of context on the left and right, and additional columns for the form tag, function tag, and filename. By using the “Filters” feature of Excel, for example, it’s easy to find and sort through whatever tag combinations you’re interested in. If you need a more detailed description of what this data includes and excludes, please send me an email. If you need info on the VOICE tagging and transcription conventions, see the VOICE POS-tagging manual.

Here’s the data in two spreadsheet formats:


Update 28.8.14: Below are the files used to generate the figures above. The ‘VOICE_tag_list_pruned.xml’ file is parsed from within the Python program (v. 2.6.6, see comments file). The two Python programs are indentical, but the file with ‘COMMENTS’ label contains lots of added commentary on what is going on where. It assumes basic knowledge of what Python is and stuff like iterating with ‘for’ loops, if/else statements, slices. I’m no expert, so suggestions for improvement are welcome.

Right-click the links to save the files. All three files are just text files, so if you left-click on one it will appear in your browser as what it is, plain text. The program only runs from within the Python environment, so it won’t run or otherwise harm your computer if you want to click on one and look at the code. If you’re thinking this is all a devious ploy to hijack your computer, take heart – I’m not that clever. 🙂



Osimk-Teasdale, R. (2014) “I just wanted to give a partly answer”: capturing and exploring word class variation in ELF data. Journal of English as a Lingua Franca, 3 (1), 109-143. DOI: 10.1515/jelf-2014-0005.