Needles in a haystack: questioning the “fluidity” of ELF

As I’ve earlier argued on this blog, sometimes the claims of “fluidity”, “diversity”, and “innovation” found in English as a lingua franca (ELF) research are overstated. It’s so diverse that even ordinary diversity won’t do – it’s “super-diversity” now. It could very well be ultra-mega-diversity-squared, but the question of the prominence of these presumably innovative features is a quantitative one. More specifically, it’s a question of how frequently any variant forms might occur in naturally occurring ELF interaction, relative to the conventional forms. One of my shameless nerd hobbies is writing little Python programs to query corpora, and several of these mini-studies have appeared on this blog. I especially enjoy working with the VOICE corpus, which is great because 1) it contains a million words of unelicited ELF interaction; 2) it’s ready for processing as well-formed XML; and 3) it has been meticulously part-of-speech (POS) tagged for both the form and function of each word in the corpus.

The value of this double form-function tag is that it reveals every token in the corpus where a word like fluently, which is formally recognisable as an adverb, functions in a different way, like as an adjective: i think you are very fluently in english. This example of fluently from VOICE has a form tag of RB (adverb), but a function tag of JJ (adjective) to reflect that fluently seems to be serving in an adjectival function. This kind of form-function variation in ELF is presumably prominent enough that it necessitates this double tagging to adequately describe the fluidity. The VOICE team was kind enough to carry out this formidable task involving manual inspection of all million words. Now that this resource is in place (and freely available), the instances of these form-function mismatches can be easily found, counted, and viewed in context.

I’ve wondered for some time how often these variant form-function tokens occur overall, in relation to their conventional forms. My interest was renewed by the recent paper by VOICE project researcher Ruth Osimk-Teasdale in the Journal of English as a Lingua Franca. One of the main workers on the VOICE POS-tagging project, she investigates word class shifts in VOICE. She narrows her data to double form-function tags that reflect a shift of category across word classes (like from adverb to adjective). These inter-categorical word class shifts therefore exclude variations within a word class, like singular nouns which are treated as plural. She focuses on items like fluently above, where word class conversion occurs without any change to the form of the word itself.

Assigning these form-function tags – and the analysis of them – are directly linked to the fluidity of ELF: “… the fluidity of language becomes especially apparent in the dynamics of online ELF interactions” (Osimk-Teasdale 2014: 114), which are also described as “variable and in flux” and containing “unconventional language use” (ibid. 117). Moreover, the examples of this fluidity are said to be numerous. On the VOICE corpus website, we’re told that the data “often had to be dealt with in novel and unprecedented ways because of its spoken and variable nature”. Osimk-Teasdale reports that during transcription of VOICE

… numerous cases were encountered in which ELF users exploit language by transcending conventionally established boundaries of word class categories in codified English by change of word class but without any change of morphological form (2014: 111).

These transcendent word class shifts thus motivated the double POS-tagging project, which would be better suited to describe “the fluid and process-like nature of the use of word class categories observed in VOICE data” (ibid. 114). But just how “fluid” is it? This is a quantitative question, and now we can get a more precise number than “numerous”. How prominent is this fluidity in relation to conventional, “codified” forms?

A drop in the bucket

Osimk-Teasdale (2014: 120-21; henceforth Ruth, since I’ve talked with her before at the ELF5 conference and we’re equally junior in the academic Game of Thrones) reports a total of 152 non-matching form-function tag combinations in VOICE, after excluding some non-interesting groups like discourse markers and gerunds (-ing words functioning as nouns). Out of these 152 tag combinations, 32 combinations reflect the inter-categorical shifts (across word classes) that are the target of the study. Altogether, these inter-categorical shifts amount to 370 tokens in the corpus, and Ruth’s paper considers the nine most frequent shifts. Here are examples of the top five groups:

type of shift raw count example from VOICE (filename)
Adjective→Adverb 99 you get the money-back guarantee _0 and this is particular a motivation to use this service (PBmtg300)
Adverb→Adjective 52 she asked the question if there’s a really need for community (EDwsd499)
Adjective→Noun 41 there is a big difficult of there is bit er big el- a big difficulty (EDint330)
Noun→Adjective 35 they don’t follow _1 very in a fanatic way _1 the the tradition yeah (LEcon8)
Verb→Noun 32 so they were keeping our suggest to the very very end (PBmtg269)

Ruth points out that nearly all of these instances of word class shifts in VOICE are one-off events, meaning they aren’t found elsewhere in the corpus. At the same time, almost all of these conversion types are also found in first- and second-language varieties of English, so they’re not unprecedented observations, either. However, the second-most frequent type of shift – adverb to adjective – is only attested in the VOICE data. Ruth argues that this can be explained by the added prominence bestowed by an added morpheme. But going through concordance lines of these word class shifts (get the full data below), one finds that some tags appear dubious, and many reflect the futility of attempting to POS-tag every word of spoken interaction. Other examples, such as win-win, eastbound and westbound discussed under adjective to noun shifts, appear decidedly non-transcendent to me. I would use these items in the same sense myself without thinking twice, and I’m one of the least transcendent people I know.

Assuming that these 370 instances of inter-categorical shifts are equally interesting, we still have the question of prominence – how often do these word class shifts occur, relatively speaking? Is this a rushing torrent of linguistic fluidity, or just a drop in the bucket? To get the bigger picture, I expanded my own search to all tokens in the VOICE corpus where the function tag does not match the form tag. What do we see when we look at the distribution of all non-matching form-function tags throughout all one million words of spoken text?

Needles in a haystack

In order to get a sense of the relative prominence of these non-matching form-function tags, it was necessary to likewise count the “conventional” items in which the form and function coincide. But in order to get meaningful results, it was first necessary to exclude a number of items. Since every token in VOICE is tagged, many items such as discourse features (e.g. breathing, laughter, filled and unfilled pauses) and grammatical items (e.g. infinitive to, conjunctions, modal verbs) don’t really have any possibility for form-function variation, so these tags all match. I excluded these, along with the same set of items excluded by Ruth (form tags PVC, VVG, VBG, UNK; also all discourse markers). This narrowed my form-function comparison to only the potentially interesting sources of variation: those tokens with a form tag for adjectives, adverbs, cardinal numbers, nouns, personal & possessive pronouns, and verbs.

Altogether, this dataset consists of 556,296 words in the corpus. By comparing the form and function tags, I was able to count both the matching and non-matching tags for each set of word classes, and here are the results:

VOICE_form_function_figures

Only 1904 tokens with non-matching tags were found, or 0.3% of these 556,296 observations. I didn’t expect these figures to be so low, as it frankly calls into question the whole enterprise of double POS-tagging a million-word corpus for the sake of finding 1904 instances of possibly interesting variation (judge for yourself). None of the individual word class categories rises above 0.6% non-matching tags, with the exception of V, the “generic verb tag” in VOICE (n=776). With 81% non-matching function tags, the form tag V appears to be a repository for several groups of word class shifts, the largest being the famous 3rd-person present singular zero (tagged V/VVZ, n=256). I’ve already blogged elsewhere about these findings from VOICE, which do not support the sometimes outlandish claims that dropping the 3rd-person singular -s morpheme is a preferred feature in ELF. And yet I can see from the figures above that, in spite of its relatively infrequent appearance in VOICE, this 3rd-person zero marking must be one of the better-represented variants coming out of this form-function data.

Check out Ruth's study by clicking on the imageOsimk-Teasdale, R. (2014) “I just wanted to give a partly answer”: capturing and exploring word class variation in ELF data. Journal of English as a Lingua Franca, 3(1), 109–143.

Check out Ruth’s study by clicking on the image
Osimk-Teasdale, R. (2014) “I just wanted to give a partly answer”: capturing and exploring word class variation in ELF data. Journal of English as a Lingua Franca, 3(1), 109–143.

I want to be clear that I am a friend of VOICE; I believe that I must appreciate the corpus more than anyone else in ELF-land. A meticulously compiled corpus of naturally occurring (non-elicited) ELF interaction is a wonderful thing, the double POS-tagging was a massive effort, and I’ve spent many tens of hours exploring it (with more to come). The fact that the “fluidity” between form and function does not seem to match the exuberant descriptions of ELF researchers does not make the investigation less meaningful or fun. Nor is this a criticism of Ruth in particular; her study contains the typical language of the field (try Google Scholar with a search string of “english as a lingua franca” fluid and see what you find). The point I hope to make with these findings is that the interesting stuff in ELF interaction is not at word level, but at the level of multi-word chunks and the syntagmatic associations between words, rather than the words themselves.

Just as with speakers of first- and second-language varieties of English, there’s a flexibility and generosity of interpretation that enables ELF users to manage, in real time, the inevitably fuzzy edges that come with creating an utterance before we know how it’s going to end. It might not always meet the expectations of the listener, but the linear sequence of speech moves fast, and there’s not much time to dwell on these relatively insignificant – and infrequent – mismatches between form and function. Listeners don’t likely invest much attention in reconciling any (potential) mismatches with a single word’s form and function, as attention moves quickly to the next chunk of information. I think it’s fair to say that we ELF researchers are way more interested in these fuzzy edges of talk than the language users themselves. In this respect, much ELF research – and to be fair, a great deal of linguistic research based on transcriptions of speech – makes a mountain out of a molehill, when it’s really just needles in a haystack.

Take my data please

If you’ve read this far, you might like to see the data yourself. While writing the program to generate these findings, I spent a few hours making a concordancer for the VOICE POS XML corpus. The online search interface for the POS-tagged VOICE gives an inexplicably meager four words of context (two left, two right) which is oftentimes unuseful. For this study, I output all 1904 examples of non-matching form-function tags to spreadsheet formats (Excel and .csv), with each non-matching token in its own column, up to 10 tokens of context on the left and right, and additional columns for the form tag, function tag, and filename. By using the “Filters” feature of Excel, for example, it’s easy to find and sort through whatever tag combinations you’re interested in. If you need a more detailed description of what this data includes and excludes, please send me an email. If you need info on the VOICE tagging and transcription conventions, see the VOICE POS-tagging manual.

Here’s the data in two spreadsheet formats:

 

Update 28.8.14: Below are the files used to generate the figures above. The ‘VOICE_tag_list_pruned.xml’ file is parsed from within the Python program (v. 2.6.6, see comments file). The two Python programs are indentical, but the file with ‘COMMENTS’ label contains lots of added commentary on what is going on where. It assumes basic knowledge of what Python is and stuff like iterating with ‘for’ loops, if/else statements, slices. I’m no expert, so suggestions for improvement are welcome.

Right-click the links to save the files. All three files are just text files, so if you left-click on one it will appear in your browser as what it is, plain text. The program only runs from within the Python environment, so it won’t run or otherwise harm your computer if you want to click on one and look at the code. If you’re thinking this is all a devious ploy to hijack your computer, take heart – I’m not that clever. 🙂

 

References

ResearchBlogging.org

Osimk-Teasdale, R. (2014) “I just wanted to give a partly answer”: capturing and exploring word class variation in ELF data. Journal of English as a Lingua Franca, 3 (1), 109-143. DOI: 10.1515/jelf-2014-0005.

Advertisements
Tagged

12 thoughts on “Needles in a haystack: questioning the “fluidity” of ELF

  1. eflnotes says:

    great stuff Ray thanks!

    anychance of the script you used to extract the data? i am still trying to muddle my way through xml, xpath etc.

    ta
    mura

    • Ray Carey says:

      I’m happy to share the script, but it might be helpful if I put some more comments in the code of what’s going on where. It’s a pretty straightforward count ’em up and print ’em out operation, and it doesn’t rely on Xpath. The features of the etree Python module are sufficient here. There must be more efficient or elegant ways of doing the same thing, but as far as I can see it does what I expect.

      I’ll post the code to the blog in a day or two, and it could be a good example if anyone else is thinking of going down this dark and perilous road. I don’t mind sharing the concordancer code too, but it was a trickier thing and I might want to work on that some more.

      Thanks for the interest!
      Ray

    • Ray Carey says:

      Ok Mura, the code with comments is up now. I wrote a lot of stuff there (maybe too much), but it goes step-by-step through everything. Feel free to send me an email or post here if you want to talk about some aspect of the code.

      Best,
      Ray

      • eflnotes says:

        this is great Ray thanks a lot, i think i can say that what a lot of people who follow your blog really like is the data you make available for those so interested.

        by the way here is something of my paltry knowledge so far of xml , i extracted gender information from BNC spoken, though there seems to be some mismatches with say BNC64, not yet figured out why!?

        http://bncgender.englishup.me

        ta
        mura

  2. geoffjordan says:

    Hi Ray,

    I haven’t seen the VOICE corpus before; shame on me! (BTW The link to the VOICE corpus in para. 1 of this post doesn’t work.) Tried out the corpus online and couldn’t get a KWIC display. Anyway, although the corpus is still a bit small (don’t you think?), it’s certainly great to be able to search it for form words.

    Best,

    Geoff

    • Ray Carey says:

      Hi Geoff, thanks a lot for pointing out the bad link, it’s fixed!

      Well, there’s small and there’s small. Specialised corpora like the ones I work with tend to be 1-2 million words, like the spoken academic English trio of ELFA, MICASE, and BASE. I think that if you’re going to be doing qualitative work as well, especially with interaction, you’ve got to be pretty familiar with the data, and then a million words is quite a lot. I converted the million-word ELFA corpus from plain text to XML, which involved manual checking of overlapping speech synchronisation, and 130 hours of talk sure seemed like a lot of words to me.

      As for VOICE’s online interface, I don’t think you can really consider it a corpus tool, but more of an example generator. It doesn’t provide KWIC lines, except for the POS-tagged search, which provides the four words of context already mentioned. All my work in VOICE has been with their POS-tagged XML version, through querying via Python and now generating my own concordance lines.

      In my view, the biggest problem with VOICE is that the plain text version is not suitable for analysis in a stand-alone concordancer. Every utterance in each file starts with the file ID, utterance number, and speaker ID, which aren’t relevant to word lists, n-grams etc., and they’re not in so you can’t exclude them. One of these years I’m going to write an XSL transformation to turn VOICE XML into the ELFA corpus plain text format, which is concordancer-friendly.

      Glad to learn about your site as well!

      Ray

  3. […] As I've earlier argued on this blog, sometimes the claims of "fluidity", "diversity", and "innovation" found in English as a lingua franca (ELF) research are overstated. It's so diverse that even o…  […]

  4. Ruth says:

    I would like to respond to the blog entry based on my recent article. The analysis carried out is interesting; however, it also seems to be based on a couple of misconceptions of the article it discusses. In particular, I would like to point out two things:

    The first refers to the following calculation of Ray Carey:
    „Only 1904 tokens with non-matching tags were found, or 0.3% of these 556,296 observations. I didn’t expect these figures to be so low, as it frankly calls into question the whole enterprise of double POS-tagging a million-word corpus for the sake of finding 1904 instances of possibly interesting variation”

    This is based on a misconception of the goal of the dual tagging procedure of VOICE. In contrary to how it is presented in the blog, the dual tagging procedure was not carried out in order to find “non-matching” forms and functions, but, quite the opposite: The tagging of VOICE, which presented a highly challenging task of capturing the way English is used by ELF speakers, attempted to display this use within the POS-tagging of VOICE – which we found was well represented by the dual tagging procedure, amongst many other measures. A second goal of the tagging of VOICE in general was be open to where challenges to the categorisation might occur, as this might have pointed us to interesting aspects worth investigating in the data. This was also, amongst other measures, met by the dual tagging procedure. However, the exploration of word class shifts, discussed in my article, is just one of many things which can be done with the dual tagging procedure. As to the numbers given, i.e. “1904 tokens with non-matching tags”, this is also misleading. In total, in VOICE I counted over 17°000 tokens which were tagged with differing form-function tags. These, of course, serve to signal a number of aspects in the tagging of VOICE, and the investigation of word class shifts is of course only one area these can be useful for.

    A second misconception I would like to point out refers to the following part of the blog:
    “Assigning these form-function tags – and the analysis of them – are directly linked to the fluidity of ELF: “… the fluidity of language becomes especially apparent in the dynamics of online ELF interactions” (Osimk-Teasdale 2014: 114), which are also described as “variable and in flux” and containing “unconventional language use” (ibid. 117). Moreover, the examples of this fluidity are said to be numerous. On the VOICE corpus website, we’re told that the data “often had to be dealt with in novel and unprecedented ways because of its spoken and variable nature”.

    The first quote here is directly linked to conversion and multifunctionality, which represent other concepts used to describe word class change for stable L1 and L2 speech communities, which ELF is not. The form-function relationships constitute ONE example of unconventional language use – that of non-codified relationships between paradigmatic forms and syntagmatic functions. The full quote reads: “This is because the very nature of ELF data means that one cannot rely on conventionalised form–function correspondences for the description of ELF usage, which has been shown to be essentially variable and in flux, and not necessarily bound to codified L1 use.” As the quote suggests, this refers not only to word class variation in ELF data but also to previous research on ELF. Numerous papers have come to the conclusion that ELF is used creatively in that it is used efficiently while not conforming to codified language use on various linguistic levels, e.g. forms and (communicative) functions, e.g. Pitzl (2012), Pitzl (2011), Pitzl, Breiteneder & Klimpfinger (2008), Dorn (2011), Hülmbauer (2009), Hülmbauer (2013), Holzschuh (2013), Seidlhofer & Widdowson (2009) or phonetic and phonology, e.g. Jenkins (2000), Osimk (2010).
    The quote on the VOICE website also clearly does not refer to only these form-function relationships but to the POS-tagging of VOICE in general, where many aspects were encountered which showed that the spoken, interactive ELF data in VOICE is difficult to categorise, and points to variable language use. Such issues, in addition to the transcendent use of word class categories were: ambiguities, the flexible use of the –ing category, non-codified use of collective nouns, pronunciation variation and coinages, code-mixing, or multiword items. All of these, therefore, would have to be taken into account when inquiring about the variability of spoken ELF in VOICE. (Note, by the way, that whenever I refer to ELF data I cannot make universal claims about ELF in general, but only talk about a specific set of data, namely that in VOICE.)

    It is difficult for me to imagine how one could quantify the degrees of ‘fluidity’ of a particular language use. However, if one was to carry out a quantitative analysis, I do not think that the question of how ‘in flux’ and ‘fluid’ ELF is, can be answered in terms of comparing its number to all other tokens in the VOICE corpus. Otherwise other corpus research would only be interesting if it investigated high-frequency phenomena, and much other corpus research would be deemed insignificant merely because the interesting phenomena it investigates appear in rather small numbers (though, needless to say, frequency analysis is highly useful a number of phenomena, such as grammaticalisation). And, it would, in fact, be surprising if the speakers in VOICE would transcend word class categories and exploit the potential of the English language in every second word. Presumably their intelligibility would be greatly diminished and hence would not serve the goal of the respective interaction. Rather, I would argue with Widdowson (2012: 14) that corpora can give you forms (and their frequencies) “but not why [the speakers] produced them and to what pragmatic ends”. In other words, frequencies alone cannot provide the relevance of a phenomenon. This is subject to analysis and independent of we are investigating a high frequency phenomenon or something which occurs less often. In any case, the phenomenon of word class shifts exists in the ELF data of VOICE and it is – linked to other papers which investigate other aspects of ELF – a further piece of a puzzle which points to the fact that ELF users do not always conform to codified norms, but that they exploit the language (hence rendering it ‘fluid’ and ‘in flux’) whenever the NEED ARISES.
    But in any case, if one should want to inquire about how ‘variable’ VOICE data is with regard to word class shifts, other corpora need to be looked at and the percentage of the word class shifts compared there. I would suspect that the number of word class boundaries transcended, the number of –ing forms used flexibly, the number of words coined etc. is greater – in terms of percentages – in VOICE than in any corpus of a more stable speech community. However, such an investigation will bound to be a tedious and enormously time-consuming enterprise as I know of only a couple of corpora which have even attempted to display such variation (cf. Denison 2011, Denison 2013). This is because they can only be detected manually (cf. my paper page 118). It is therefore not surprising that such an investigation has not been carried out, though it would be interesting what the result may be.

    References

    Denison, David. 2011. “Grammatical mark-up: Some more demarcation disputes”. New Methods in Historical Corpora. Manchester, 2011. https://www.escholar.manchester.ac.uk/item/?pid=uk-ac-man-scw:172518 (9 April 2013).
    Denison, David. 2013. “Grammatical mark-up: Some more demarkation disputes”. In Durrell, Paul; Scheible, Martin; Whitt, Silke; Bennett, Richard J. (eds.). New methods in historical corpora. Tübingen: Narr Verlag, 17-35.
    Dorn, Nora. 2011. Exploring –ing: The progressive in English as a lingua franca. Saarbrücken: VDM-Verlag Müller.
    Holzschuh, Elisabeth. 2013. Discourse marker use in ELF conversations: The case of “you know”. MA Thesis, University of Vienna.
    Hülmbauer, Cornelia. 2009. “‘We don’t take the right way. We just take the way that we think you will understand’ – The Shifting Relationship between Correctness and Effectiveness in ELF”. In Mauranen, Anna; Ranta, Elina (eds.). English as a Lingua Franca: Studies and findings. Newcastle upon Tyne: Cambridge Scholars Publishing, 323-347.
    Hülmbauer, Cornelia. 2013. “From within and without: the virtual and the plurilingual in ELF”. Journal of English as a Lingua Franca 2(1), 47-73.
    Jenkins, Jennifer. 2000. The phonology of English as an International Language: new models, new norms, new goals. Oxford: Oxford University Press.
    Osimk, Ruth. 2010. “Testing the intelligibility of ELF sounds”. Speak Out! 42, 14-18.
    Pitzl, Marie-Luise. 2011. Creativity in English as a lingua franca: idiom and metaphor. PhD Thesis, University of Vienna.
    Pitzl, Marie-Luise. 2012. “Creativity meets convention: idiom variation and remetaphorization in ELF”. Journal of English as a Lingua Franca 1(1).
    Pitzl, Marie-Luise; Breiteneder, Angelika; Klimpfinger, Theresa. 2008. “A world of words: processes of lexical innovation in VOICE”. Views 17(2), 21-46.
    Seidlhofer, Barbara; Widdowson, Henry. 2009. “Conformity and creativity in ELF and learner English”. In Albl-Mikasa, Michaela; Braun, Sabine; Kalina, Sylvia (eds.). Dimensionen der Zweitsprachenforschung. Dimensions of Second Language Research. (Festschrift for Kurt Kohn). Tübingen: Narr Verlag, 93-107.
    Widdowson, Henry G. 2012. “ELF and the inconvenience of established concepts”. Journal of English as a Lingua Franca 1(1), 5-26.

    • Ray Carey says:

      Hi Ruth, and thanks for this detailed and thorough response! You’ve represented your position well, and the dialogue is most welcome. Let me respond to the misconceptions you’ve mentioned.

      Starting with the second misconception, you’ve pointed out a few quotes I took out of their immediate context when illustrating the language of creativity and fluidity. There wasn’t anything devious intended, and I don’t think it’s a stretch to make these associations. If you feel I’ve misrepresented the original texts, I apologise for that, and I’m glad you’ve taken the opportunity to clear it up. I don’t think I’ve claimed anything contrary to your main point: the double POS-tagging was one of several responses to the unique challenges posed by spoken ELF data. I’m not convinced, however, that these challenges are especially unique to ELF, but would be encountered in any corpus of naturally occurring spoken English.

      This ties in with the other misconception involving my figures on non-matching form/function tags. You point out that they are misleading because it looks like only 1904 tokens received these tags, when in fact over 17,000 tokens have a different tag for form and function. You’re right that the numbers I’ve provided don’t give the full picture, and I’m happy to clarify them further (after triple-checking my work, of course).

      I find 17,020 non-matching form/function tags in VOICE. More than half of these are connected to the discourse markers “i mean”, “you know” and “i see”, which make up 8822 of these tokens (tagged “i/you” #PPfDM “mean/know/see” #VVPfDM). I excluded these along with the other discourse markers, since they don’t seem particularly innovative or ELF-specific. The other big group of excluded items were progressive verbs functioning as nouns (tags #VVGfNN, #VVGfNNS), which appear to be mostly gerunds and a pretty ordinary feature of English (n=3243).

      The other excluded items, as mentioned in the blog, were non-matching tags with the following form tags: UNK (unknown, n=18), PVC (“pronunciation variation and coinages”, n=2157, more on this below) and VBG (n=10, almost all of them the “being” in “human being”), along with 9 more miscellaneous items with form tags DT, FI, and IN. Lastly, there’s the question of ambiguous tags, where a token is enclosed in two or more elements, each with a different POS-tag. My counts incorporate the “first choice”, i.e. considers the first option but ignores the nested alternatives, which number 857.

      So, starting with the 17,020 instances of non-matching form/function tags and subtracting all the omitted items above, the remaining 1904 tokens are the ones I’ve reported as being the “potentially interesting” variants that might capture some of the creativity and fluidity of ELF, including the interesting examples from your paper. I don’t question their interest — I just wanted a sense of how often I should expect to find them in a representative database. I still think this is a legitimate question, but only the first and most obvious; the additional research questions you’ve outlined are certainly worth pursuing.

      I should mention the PVC-tagged items (pronunciation variation and coinages), which are a big group of excluded items. These are presumably interesting sources of creativity, and all of these 2195 tokens have been excluded here. That’s because I’ve also looked into these items and they are not always interesting, either. Many of them are examples of word formation processes (I’m familiar with Pitzl et al 2008) and a creativity analysis makes sense, e.g. “administrate”, “approvement”, “criterias”, “domestical”. However, a great many of these presumed coinages are only interesting for not being found in your reference dictionary – “multilingualism” (n=31), “intercultural” (n=28), “vice-rector(s)” (n=30), mostly unremarkable words with “non-” prefix (n=117), and lots and lots of field-specific terms that can’t be called coinages. So I excluded these, as the PVC tag itself is not a direct indicator of creativity or something ELF specific.

      I hope this more complete information will show that I wasn’t deliberately trying to be misleading about the figures. Beyond this, I think your comments are thoughtful and deserve consideration. Thanks again for your contribution!

      Best,
      Ray

Leave a Reply to Ruth Cancel reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: