Fluent chunks: an intro to Linear Unit Grammar

The spring equinox has arrived, the sun is shining, and the place is still frozen solid.© Nina Valtavirta

The spring equinox has passed, the sun is shining, and the place is still frozen solid.
© Nina Valtavirta

The question of how to evaluate English proficiency in lingua franca settings such as English-medium university programs has interested me for a while. One of the criticisms heard against ELF research is that it promotes an “anything goes” attitude toward English. But clearly anything does not go – at least not in high-stakes, professional contexts like academia. Yet, it doesn’t make a huge amount of sense to bring in the British Council to evaluate the non-British English used by non-British instructors to teach non-British students outside Britain. The need for contextually appropriate teaching and testing is one of the main motivations for ELF research.

It was my turn to talk at the ELF seminar this month, which was held on 14.3. I introduced my PhD project, which officially started this January and unofficially began over a year ago. I’m researching fluency in spoken academic ELF, but with a data-driven approach; instead of evaluating ELF users by an idealised “native-speaker model”, I’m starting by describing the features of fluency and dysfluency in a corpus of naturally occurring academic ELF. These texts are transcriptions of university level talk from pedagogical settings (lectures, seminar discussions from various fields) and professional events (conference presentations and discussions).

The idea is to first describe what is ordinary ELF in academic settings – what are the recurring patterns and routines of fluent interaction, and what are the (dys)fluency features which differentiate individual ELF users? This is the big question, but the even bigger problem is how to systematically identify and describe these patterns.

A LUG-nut is born

I had already been working and studying in Anna Mauranen’s SELF project for some time before I read her 2009 article on chunking in ELF. Her discussion on how ELF users produce approximate forms of conventional organising chunks like in my view and from my point of view was based on Linear Unit Grammar, or LUG, which she developed with John Sinclair, the godfather of corpus linguistics. It got me curious and I read their 2006 book, Linear Unit Grammar: Integrating speech & writing. And that was how I lost my soul to LUG. (and naturally, when I found a WordPress template entitled “Chunk”, I couldn’t resist)

The joy of discovery – a LUG-nut is born.© Nina Valtavirta

The joy of discovery – a LUG-nut is born.
© Nina Valtavirta

The thing I kept thinking while reading the book was how could it be possible that no one had thought of this and written it down before 2006. LUG is striking for being intuitively satisfying in its simple approach to analysing speech and text as a linear stream of short chunks of words, the same moment-to-moment experience of language we have when listening or speaking in real time. In addition, since it was developed on naturally occurring speech – including ELFA corpus data – the LUG analytical framework is robust enough to handle the ordinary dysfluencies that are so hard to squeeze into a hierarchical sentence grammar.

Naturally, most people have no idea what any of this means, because they’re well-adjusted members of society. That’s why it’s important to start with a sample of speech and show how LUG analysis works. (fair warning: 1300 words of chunking analysis follows. Only the truly LUG-curious and maladjusted need continue)

Get your LUG on

First, consider this short transcription of speech from the ELFA corpus and recorded at an academic conference in Finland. A teacher from Brazil has been asked about her experience teaching Brazilian students:


This is how we might best render speech in writing, as a steady stream of sound. But already we can move through many parts of this transcript and find the boundaries of words. We naturally chunk these letters/sounds into larger units, and the transcription looks like this:

er i c- i i so i i went to portugal er i live in portugal er for 13 years so i er my experience with brazilian students is is a long way @@ okay a long time ago (note: @@ = laughter and er is like uh in the US style)

This is what LUG analysis starts with – an unanalysed stream of talk that we take, like the listeners at the conference, one word at a time into our immediate awareness. The first step is to place boundaries between the words, which separate them into larger chunks. This chunking process is treated as intuitive and pre-theoretical; no attempt is made to define a chunk. With a few guiding principles, the LUG model proposes that speakers of English (and any other natural language) will chunk up a text in pretty much the same way.

Taking the first few words of our transcript as a starting point (er i c- i i so i i went to portugal er) we encounter an ordinary aspect of speaking – a short run of so-called dysfluencies that occur as we formulate our thoughts. The filled pause er can be treated as a chunk in its own right, insofar as it is a sound intended to hold the floor while the next chunk is being formulated. It’s then followed by a “false start” i c-, which can be seen as a partially formulated chunk (maybe i could or i can). So, these can be the first chunk boundaries:

er | i c- | i i so i i went to portugal er

Moving along to the next words, another common dysfluency arises – a one-word repetition. Since these are essentially larger chunks of language that are started and re-started, a chunk boundary is always placed between these evident re-starts. Encountering so, we come to something like a small bridging chunk in its own right, a minimal signal of a transition taking place. LUG also parts with conventional grammars by treating all connecting words, including and and but, as independent, organising chunks. So, so will get chunk boundaries as well. Then another re-start, and our transcript looks like this:

er | i c- | i | i | so | i | i went to portugal er

In the next bit we come to a “fluent run”, in which a single unit of meaning is uttered without the earlier hesitations and re-starts. When performing the word-by-word linear analysis of these more meaning-oriented chunks, the focus is on prospection and completion. With even a couple words, like i went, a number of likely ways forward can be prospected, or anticipated by looking ahead. Listeners are oriented to these possible points of completion, and i went to portugal is an interpretably complete unit – nothing more is required to complete the forward prospection. We can therefore place a chunk boundary here, before the filled pause er.

Now let’s take a bit more of the text:

i went to portugal | er | i live in portugal er for 13 years so

As with the preceding chunk, a short, fluent utterance comprises a single unit – i live in portugal. Then, with chunk boundaries around the filled pause, there’s a short addition to this complete chunk, for 13 years, which adds information and creates a new point of completion. Then another short organising chunk, and we’re making good progress. Let’s see what we’ve chunked so far:

er | i c- | i | i | so | i | i went to portugal | er | i live in portugal | er | for 13 years | so

If you've made it this far, you earned a look at one of my early chunking experiments. It was confirmed that the linear modeling of chunks is preferable to vertical hierarchies.© Ray's mom

If you’ve made it this far, you earned a look at one of my early chunking experiments. It was confirmed that the linear modeling of chunks is preferable to vertical hierarchies.
© Ray’s mom

The organising so acts as a bridge between these message-oriented chunks and the ensuing point of the speaker’s observations: i er my experience with brazilian students is is a long way @@ okay a long time ago. First, a couple short dysfluencies during formulation – a fragment i and filled pause. These take chunk boundaries, and then a longer stretch of fluent speech.

Here’s a good example of evaluating prospection. To me, the first two words of my experience sets up the expectation of some completing element to follow. As soon as with is uttered, the speaker signals that additional information about my experience is coming. So, I’m inclined to place a chunk boundary before and after with brazilian students to indicate that it adds information to the prospective chunk already started, but without providing any resolution. Then is gets repeated, and a chunk boundary separates this re-start, like this:

i | er | my experience | with brazilian students | is | is a long way @@ okay a long time ago

This boundary between experience with is debatable, but systematising these types of judgments is an important part of compiling a LUG corpus, which I’ll discuss in future posts when I outline the principles behind a LUG-chunking algorithm I’m working on.

Now we’re getting close. I place the next boundary after is a long way, which completes the prospection already started with my experience. The @@ symbols signify laughter, which can treated as an interactive chunk in its own right, followed by a similarly interactive unit okay. What’s left is a long time ago, which, as a concluding chunk, appears to rephrase the preceding chunk is a long way. So, it’s another supplementary chunk that adds information about what has come just before.

Ordinary dysfluency

Good news and bad news – the good news is that we’re done chunking our transcription. The bad news is that we still need to analyse the chunks and categorise them based on their linear function. I’ll continue with that in a follow-up post. But for now, this is what our chunked text looks like:

er | i c- | i | i | so | i | i went to portugal | er | i live in portugal | er | for 13 years | so | i | er | my experience | with brazilian students | is | is a long way | @@ | okay | a long time ago

Even before we start to analyse this further, it’s clear that this stretch of talk is utterly unremarkable. These are what we might call “ordinary dysfluencies”. A speaker who has been asked a question takes a few moments to formulate a response, holding the floor with some fragmentary starts and re-starts. Then, with shorter dysfluencies punctuating the on-line planning of following chunks, a bit of uncertainty about a formulaic expression (a long time ago) is resolved at the end.

This study begins with the assumption that “dysfluency”, that scourge of standardised testing, is a profoundly ordinary feature of speaking and hardly noticed in content-oriented discussions such as those in academia. Dysfluencies are normal features of real-time utterance planning that everyone does, regardless of what language we’re speaking. So the question is not as simple as “how many dysfluencies are too many”? Instead, we should ask “where, when, and how do they occur, and what’s going on in their immediate surroundings”?

To answer this, we have to go back to our chunked transcription and see how the chunks are working in the linear development of meaning. And that will take another post.

Note: the continuation post can be found here: Fluent chunks 2: How to label your chunks



Mauranen, A. (2009). Chunking in ELF: Expressions for managing interaction. Intercultural Pragmatics, 6 (2). DOI: 10.1515/IPRG.2009.012.

Mauranen, A. (2012). Linear Unit Grammar. The Encyclopedia of Applied Linguistics. DOI: 10.1002/9781405198431.wbeal0707.

Sinclair, J. McH. & Mauranen, A. (2006). Linear Unit Grammar: Integrating speech and writing. DOI: ISBN 978 90 272 2299 2.


One thought on “Fluent chunks: an intro to Linear Unit Grammar

  1. […] a previous post I introduced Linear Unit Grammar, or LUG, a chunk-based approach to analysing spoken and written text. In that post I introduced the linear, […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: