Evolving English: Globalisation and AI

You probably shouldn’t dismiss content as AI-written just because it uses certain words

A professor in a department store, talking to a young sales rep.

AI-generated image | DALL-E

The American linguist William Labov spent a tiny piece of the 1960s asking New Yorkers where things were. He didn’t really care; he just wanted to hear them say words with the possibility of rhotic pronunciation. That’s when the “r” in a word like “butter” is pronounced, something familiar to most American accents today.

In the 1960s, however, New York experienced a linguistic shift. Where non-rhoticity — meaning many “r” sounds are silent — had once been a mark of higher social status, rhoticity emerged as the city’s new sign of class.

Labov demonstrated this by visiting three department stores and asking the staff questions to elicit the answer “fourth floor”. Sure enough, the staff at upmarket Saks pronounced the “r” sounds more often than the staff at middle-income Macy’s, and the staff at budget-friendly S. Klein barely pronounced them at all.

Languages change over time and distance, particularly languages like English, which are famously unregulated. There is no equivalent of the Académie Française to tell English speakers what is right or wrong. Instead, we all muddle through, globally and across social strata, making our language very dynamic with no correct version.

I wonder what Labov makes of the recent pop-culture idea that you can tell something is AI-written by its overuse of a few particular words?

That idea began with the observation that certain words have become suspiciously abundant since the major AI models came to market. And in some cases, such as Jeremy Nguyen’s look at the more than ten-fold increase of the word “delve” on PubMed, that observation is statistically valid.

It is also an observation backed up by sound reasoning. AI, or at least what we’re currently calling AI, isn’t fundamentally different from what we used to call machine learning — i.e. it’s an algorithm that goes through a training process. For the current generation of AI models, that training includes humans telling it how well it’s performing so that it can learn to improve. And humans are expensive, and require rest and pastoral support and suitable working conditions.

For a combination of reasons (and we don’t know what they are) major AI vendors have hired many of their human trainers in lower income countries. Africa, where certain words — like “delve” — are much more frequently used, has been a notable hiring hub, and it isn’t a huge leap to hypothesise that AI’s lexicon has been shaped by those Africans who played such an important role in its training.

There is a worrying aspect to that hypothesis — if you dismiss writing as AI-generated because you think it overuses a word like “delve” (other suggested AI telltales include “tapestry”, “resonate” and “embrace”), then you might be dismissing the work of one of the 200–300 million English speakers in Africa.

And if you’re reading this from the core Anglosphere and think the estimate of 200–300 million English-speaking Africans seems high, then I have news — somewhere between two-thirds and three-quarters of proficient English speakers are in countries other than Britain, the US, Canada, Australia, or New Zealand.

This had begged the question even before the recent AI advances — who’s English will come to dominate our rapidly globalising world? Who’s idioms, colloquialisms, pronunciations and spellings will permeate best as our communications become more interconnected and developing parts of the world become more digital, more able to spread their forms of English through cultural output?

Many might assume that America, as the largest English-speaking country, will have the greatest influence; others might think that Standard British English will maintain a prestige factor. I’ve always thought that the sheer weight of non-native English speakers would shape the language into a kind of global amalgam.

I had not anticipated that a major AI advance would accelerate that process. We will all soon be mass consumers of AI-generated content, and not only, nor even mostly, due to nefarious humans pretending AI content is their own. Most of it will be far more honest.

Many of us, for instance, have become accustomed to asking an AI for information that we would have previously searched the web for. And now that search engines are rolling out AI-generated answers, that way of obtaining information will quickly normalise through society.

A more nuanced example is that we’re so used to automatic spelling and grammar correction that these tools’ inevitable transition to newer AI versions will barely be noticed. Such tools offer advice on grammar and word choices, and that advice will be based on AI’s idea of good English, which currently appears to include a heavy dose of African word preferences. Tomorrow, it could be moulded by the increasing mass of English web content from highly-populated South Asia.

Which is an entirely normal thing from a linguistic perspective. If English (and every language) evolves for humans then it ought to evolve for AI too, and there is no reason that English’s near-term evolution for AI should conform to any one English-speaking group’s liking.

The question then is this — will those of us in geopolitically dominant English-speaking countries accept that AI doesn’t, and won’t, focus its language preferences on us?

The recent flurry around vocabulary-based AI detection suggests that we probably won’t. That’s a pity.