Gary Parkinson Media

View Original

Will technology end the English language's global domination?

Language is a complicated thing often used for simple ends: to communicate, to warn, to woo, to trade, to learn. Its development can demonstrate history in miniature, yet writ large upon global events.

Take the phrase "lingua franca," for instance. Meaning any language used between those who do not share a common tongue, it could be a pidgin or creole-like corruption or conjunction of two languages, developed between parties with a mutual need for understanding – in commerce or communication. 

Such was the case with the original lingua franca – a pidgin, or combination, language which was used in the middle half of the last millennium by traders and diplomats around the eastern Mediterranean Sea. Based on a simplified Italian but borrowing words from Portuguese, Spanish, French, Greek, Turkish and Arabic, this first lingua franca was named – perhaps ironically, perhaps fittingly – after the Franks: Germanic peoples whose name was used, by those from the Byzantine Empire, to represent all Western Europeans. 

Then again, a lingua franca can be a language imposed from above, either voluntarily when a certain culture dominates by pre-eminence at a formative period – for example, the standard usage of Italian in opera and musical notation, or French in ballet – or more forcefully as part of a military campaign of subjugation and control. 

This superimposition explains the eminence of English as one of the globe's most widely used languages. The British empire took its usage around the world, and even in the post–imperial phase many former colonies have continued to use it as the primary tongue, often as a unifying common ground above a multiplicity of indigenous languages. And as that empire waned, another Anglophone culture grew into global pre-eminence.

 

Whatever way you look at it, English is the dominant language in the world... the default in practically all domains of global communication

- Gaston Dorren, linguistics author and polyglot

"Whatever way you look at it, English is the dominant language in the world," Gaston Dorren tells CGTN Europe – and he should know, as a linguistics author who speaks six languages, can read another nine and has written books in several of them. 

"That's because of the very successful colonial past, whatever your take on that is," he says. "But it's also because of very dominant American culture, through Hollywood, it's because of American military power, economic power, World War II, and so on."

As communication and transport has shrunk the globe, English has solidified as the world's lingua franca. As Dorren has put it, "In recent decades, when American economic, cultural, political and military predominance coincided with globalization, English became the default language in practically all domains of global communication, from cinema and pop music to science and civil aviation."

 

The world's top 20 languages, by number of speakers

  1. English 1.5bn

  2. Mandarin 1.3bn

  3. Spanish 575m

  4. Hindi-Urdu 550m

  5. Arabic 375m

  6. Bengali 275m

  7. Portuguese 275m

  8. Russian 275m

  9. Malay 275m

  10. French 250m

  11. German 200m

  12. Swahili 135m

  13. Japanese 130m

  14. Punjabi 125m

  15. Persian 110m

  16. Javanese 95m

  17. Turkish 90m

  18. Tamil 90m

  19. Korean 85m

  20. Vietnamese 85m

    Source: Babel, Gaston Dorren

     

Your computer speaks English

He might have added computer technology. While older sciences tend to take their vocabulary from what Europeans regard as the "classic" languages of Latin and Greek, computing – having come of age in what many call "the American century" – is largely based upon English… and therefore, so is the architecture of the internet. 

In 2019, Gretchen McCulloch, the linguist and author of the New York Times-bestselling Because Internet: Understanding the New Rules of Language, found that "software programs and social media platforms are now often available in some 30 to 100 languages." That doesn't sound too bad until you realize that there are something like 6,000 or 7,000 languages spoken and signed on the planet.

Furthermore, McCulloch found a deeper problem, one that Karl Marx might call the means of production: "What about the tools that make us creators, not just consumers, of computational tools? I've found four programming languages that are widely available in multilingual versions. Not 400. Four (4)."

While computers themselves are language-agnostic, the platforms they run upon are not, and that raises a barrier to enormous swathes of the world. As McCulloch noted, "Even huge languages that have extensive literary traditions and are used as regional trade languages, like Mandarin, Spanish, Hindi, and Arabic, still aren't widespread as languages of code."

But with computer pioneers largely working in Britain and then the United States, English became computing's lingua franca almost by default. While that may be limiting to non-English speakers who wish to pursue careers in coding, it can have effects upon a much wider population: in essence, anyone who wants to use a computer or the internet has had to wait for a 'customized' version which suits their own language.

This hasn't always been as quick as you might expect. Early computers were limited by ASCII (American Standard Code for Information Interchange), an English-based character-encoding scheme introduced in 1963 and limited to 128 options. 

"English is the internet's mother tongue," Dorren says. "Computers come from Britain, then modern computers come from the United States and the internet, while I think it was developed in Europe, certainly boomed out of Silicon Valley. I try to imagine what would have happened if the internet had started in Germany or France – I'm sure we would have seen a different trajectory."

 

A very brief history of English-based communication technology 

Such anglocentricity is nothing new in language-dispersal technology. ASCII was developed from telegraph code, which itself was castigated for its lack of letters with accents and other diacritics – glyphs which augment the letters of the Latin alphabet, such as the cedilla (ç), tilde (ñ) or umlaut (ü). These are almost vanishingly rare in English compared to other languages using the Latin alphabet, let alone other sets of letters or characters.

Similarly, the QWERTY typewriter layout was created in 1873 by American newspaper editor Christopher Latham Sholes – although he omitted the numbers 0 and 1, figuring capital O and I could do those jobs instead. More pertinently for most of the planet, the design didn't particularly cater for accents, diacritics and other character sets.  

Those who employed different alphabets had to either invent their own hardware (such as new keyboard layouts) or character sets (such as Friedrich Clemens Gerke's umlaut-friendly revision of the American Samuel Morse's self-titled but diacritic-free code), or simply acquiesce to Anglicized alternatives. 

Even as the world has globalized, this is not a situation that has improved over time. While non-English inventors were welcome to build their own typewriters, and Gerke's wider code soon superseded Morse's, ASCII remained the standard for the world wide web until 2008 – which is why domain names before the last decade or so were restricted to English characters. 

 

This is the first time that one language is the lingua franca throughout the world. There are many fields where we can only dream of this

- Gaston Dorren

"There is a difference between the status of English today and of all other lingua francas that went before: This is the first time that one language is the lingua franca throughout the world," says Dorren. And in some respects, he thinks that's not a bad thing – or certainly not as disruptive as the idea of any other language becoming so dominant instead.

"There are many fields where we can only dream of this. Just think of sockets: you travel abroad and you need all sorts of adapters to plug in your computer. If we had one type of socket and then somebody said 'No, you know what, I'm dominant now, I want to change that,' we wouldn't adapt, we wouldn't do that. And the same, I think, with language."

 

Alphabets and other inputs

Language can go a long way with the right help. The Roman empire spread the Latin alphabet over much of Europe; the Latin language itself went on to form the basis of Italian, French, Spanish, Portuguese and Romanian, with a further influence on Dutch, Norwegian, Danish, Swedish and particularly English. 

In turn, these more modern European languages – and the alphabet they share – were spread around much of the globe in the second half of the last millennium, to the point where the Latin alphabet is now used by around 70 percent of the world's population.  

Cultural expansion still happens now through loanwords, which arrive in one language from another without translation (words or phrases which are translated are called calques), thus representing a little linguistic outpost in a foreign tongue. 

They aren't always welcome; the Academie Française has fought long and hard against the creeping Anglicization of the French language – if not always successfully. The dominance of U.S.-led culture – in the entertainment channels of music, films and television, but also in an increasingly heavily-branded commercial world – has led to continuing "cultural creep."

Less than a century ago, Turkish (currently spoken by 90 million) pivoted from a Persian/Arabic script to a bespoke version of the Latin alphabet while changing many of the words in the vocabulary. This unparalleled act of lexical revolution simultaneously re-expressed Turkey's proud heritage – many old loanwords were dropped, either for regional dialect synonyms or simply for neologisms based on Turkic roots – while refocusing the country's cultural gaze from the east to the west. 

Even so, around 30 percent of the world does not use the Latin alphabet. Gaston Dorren's book Babel examines each of the world's 20 most popular languages, precisely half of which use non-Latin scripts: Korean (spoken by 85 million), Tamil (90 million), Persian (110 million), Punjabi (125 million), Japanese  (130 million), Russian (275 million), Bengali (275 million), Arabic (375 million), Hindi/Urdu (550 million) and Mandarin Chinese (1.3 billion).

For these huge swathes of the planet, the internet's default Latin ABC is irrelevant. One increasingly popular byproduct of this divide is pinyin – the 'romanized' system for representing Chinese characters. Although originally developed by Chinese linguists back in the 1950s, pinyin (literally "spell sound") has been boosted by recent technological advances which prioritize the Latin alphabet.

But isn't this another example of cultural creep, of Anglophonic domination by stealth? Not for Dorren, who doubts the world's ability to learn Chinese but welcomes any bridge between speakers of different languages.

 

I could imagine online information in Chinese automatically transcribed into pinyin – like Serbian websites automatically transliterate between Latin and Cyrillic

- Gaston Dorren

"Pinyin is most definitely a good thing for many reasons," he says. "If Mandarin becomes more internationally successful than I frankly expect it to be, pinyin would be an excellent interface. I could imagine online information in Chinese to be automatically transcribed into pinyin – much like Serbian websites automatically transliterate between Latin and Cyrillic, which they do impeccably.

"That would be possible with pinyin because the system works in such a way that it wouldn't be very complicated. A computer has no problem memorizing 6,000 or 10,000 or even the infamous 50,000 characters [of Mandarin Chinese]. A computer can do that: it's just us poor, wet humans who don't manage it."

However, we poor, wet humans can boss our hand-held computers around using another emerging technology which is totally keyboard-free: voice input. Studies suggest around half of all searches are now voice-input, boosted by the popularity of voice assistants like Echo, Alexa and Siri. 

Using a tech cocktail of machine learning and natural language processing (NLP), such technologies automatically ascertain the spoken language – or languages, effortlessly switching from one to another as is so frequently necessary considering that only 40 percent of the world is monolingual. 

 

Domains and languages

In the meantime, the architecture of the internet is belatedly opening up to non-Latin characters. "People who use different alphabets – the Cyrillic alphabet, Korean or Mandarin, Chinese characters, Indian scripts – today, they can all use those on their computers and on their smartphones and on their tablets," says Dorren. 

"I interviewed a guy from Google whose team had worked very hard producing dozens of Indian scripts to make tablets and computers and smartphones accessible to people and to stop them from having to write in English or perhaps Hindi. The monopoly has certainly been broken, and your and my impression that it's all Latin online may be a bit of a parochial view because that's where we go."

 Indeed, India – a huge market with a perhaps bewildering array of languages numbering in the thousands – has been at the forefront of ending the baked-in bias towards the Latin alphabet in general and the English language in particular. 

In November 2019, Twitter India's MD Manish Maheshwari proudly announced that his fast-growing platform was now less than half English, after adding a preferred-language option to help algorithms surface user-specific content. 

"We felt solving for language in India is going to be very important," said Maheshwari. "Already, non-English tweets are 50 percent of overall tweets. We have witnessed this in the last six to eight months because of the changes we have made to the product." Twitter India also worked with 70 media partners to ingest local content and start the conversations.

 

In 1998 about 85% of online content was English, in 2018 it was about 20%. It's expected to bottom out at around 10%

- Vasco Pedro, CEO of translation platform Unbabel

Meanwhile, since 2010 the Internationalized Domain Names (IDN) project has released the internet from its ASCII shackles and enabled 152 top-level domains (the bit after the dot, like .com or .net) including 75 in Chinese, Japanese or Korean scripts plus 33 in Arabic scripts. There are now more than nine million registered IDNs, which is around one in 40 of the global total – and rising fast.

A study by the Council of European National Top-Level Domain Registries and the Oxford Information Labs suggested that such tailored domains "boost the presence of local languages online and show lower levels of English language than is found in the domain name sector worldwide." 

That boost may save those of the world's 6,000 or so languages which could be threatened with extinction by the dominance of English. In 2003 Unesco adopted a recommendation to promote multilingualism online, yet according to a 2018 study by campaign group Whose Knowledge?, the internet only features around 7 percent of the world's tongues.

Even Facebook only supports just over 100 languages. That makes it the most multilingual online social media platform, but it wants to go further to maintain the conversation through the language barriers. In October 2020 it announced an AI capable of translating between any pair of 100 languages without first translating to English, as many existing systems do – one more step away from the anglocentricity of internet architecture.

 

Translation and polarization

Translation is what Vasco Pedro knows best. Portuguese by birth, he's the CEO of Unbabel, a translation platform headquartered in Lisbon and San Francisco which combines neural machine translation with human editing, chiefly for customer service departments. He also happens to have a PhD in Language Technologies.

"It's hard to pinpoint exactly, but roughly speaking, in 1998 about 85 percent of all online content was English and in 2018 it was about 20 percent, and it's expected to bottom out at around 10 percent," he tells CGTN. "By and large, we tend to create content in our own language and nowadays through social networks. And so if you put together the amount of content that people are creating every day in social networks, it just starts to overwhelm everything else." 

Pedro acknowledges the pre-eminence of English in modern communication networks – "For organic reasons and historical reasons, it created the scaffolding where everything runs" – but, like Dorren, has witnessed an increase in non-English tools, both user-facing and behind the scenes, which will continue to level the playing field. 

"Until about three years ago, I thought that [English dominance] would be kind of a permanent feature," he recalls, but a meeting with a major Chinese retailer changed his mind. "I was in China having a meeting, and someone said 'We've realized that one of the difficulties our developers face is that because all of the infrastructure for coding is in English, they necessarily need to understand enough English to be able to work in the systems. But we're now starting to develop Chinesecentric tools to produce Chinesecentric software.' 

 

If you can consume everything you need within your language, you're less incentivized to go outside of your language to search for the knowledge that you need

- Vasco Pedro

"The implications of something like that are incredible, because once you have the same kind of stacks and architectures built from the ground up in Chinese, the reverse would be true, where anyone that would want to use those would have to learn Chinese to be able to interact with it. My current perception is that in certain areas like vision and actually a number of other AI areas, there's significant evidence that China is ahead in terms of output of research, in terms of practical applications, in terms of dissemination."

But every advance has its drawback, and while the increasing availability of more languages online is inarguably a good thing, Pedro is wary of the potential resultant segregation and lack of cross-pollination.  

"It's creating this microcosmos where you're exposed primarily to your language everywhere you go. For someone like me, I tend to exist in multiple different languages, but globalization also is driving similar experiences everywhere you go: the variety of experiences diminishes, which means that it's less different to go to a different place. 

"It's tied in to polarization of news and polarization of opinion: If you can consume all the news you can consume within a particular point of view, then you're not likely to want to go outside of that point of view. If you can consume everything you need within your language, then you're less incentivized to go outside of your language to search for the knowledge that you need."

 

Dialects and idioms

Humans have spent millennia attempting to overcome language barriers, but only to a certain degree. As polyglots, both Pedro and Dorren acknowledge that sometimes we use language as a protective cloak.

"There's an inherent need for humans to differentiate themselves through language," says Pedro. "Language is one of the main indicators of tribe. You look like me, so we're similar. But if you talk like me, then we can connect easily. And if you speak my language, that's already some trust. If you speak my language in the way that I do, with my accent, immediately, there's certain cultural references that I assume. 

"We see that over and over with teenagers, for example, they tend to reinvent the language as a way to differentiate themselves from the previous generation. There's something inherent in human beings, they tend to create language and use language as a way to also create differentiation between groups." 

"Idioms, figurative language, cultural references: those elements of our daily language will be a really hard nut to crack for machines," says Dorren. "They're a hard nut to crack for humans – I've been hearing English for 40 years, and I regularly come across cultural references that I'm just not getting because I'm not British. Potentially a very advanced computer would be better – but the machine is not always on the lookout for humor, whereas many humans are."

Even among speakers of ostensibly the same language, these minor differences are growing rather than diminishing. "People have been speaking English, in the modern form, let's say 300 years, 500 years. There's no convergence towards one English at all," says Pedro. "The Scottish are not going to stop speaking the way they do, or Irish or northern England or Americans. If you just take British and American English, I'm not seeing it coming closer."

 

[The move to machine translation] is always going more slowly than the optimists expect – like autonomous vehicles or nuclear fusion

- Gaston Dorren

That predominantly oral language use is one reason why Dorren differentiates between the translation potential "between written, which is literally a translation, and for oral, which I would prefer to call it interpretation. For written language, I'm frequently amazed at the quality – and then again, I am amazed by the gaffes that these softwares produce."

One example cited on Dorren's website didn't even need to be translated out of the English to be badly mangled. Reading a book, he noted a quote from the Scottish philosopher-economist Adam Smith mentioning "the beggar, who suns himself by the side of the motorway" – somewhat unlikely, given Smith was writing in the 18th century. 

Dorren's research showed that the original quote ended in "highway" – but the British English publisher had altered it while reversioning the original American English book in which it appeared. It's not known whether the fault was human or computer, but Dorren thinks the distance between machine and human translation is disappearing – gradually, but inevitably. 

"It's always going more slowly than the optimists expect, much like autonomous vehicles, for instance, or nuclear fusion," he says. "When I see the progress over the past 10 years, I kind of expect it to happen like an asymptote – is that the English word? One of these curves that creep towards a vertical line... it will get closer and closer until the machine is indistinguishable from a human being."

Even if a translation machine could ever pass this version of the Turing Test, Dorren warns the results might not be immediate, because of the different structures of languages. 

"The software would have to wait until the end of a sentence in order to translate it," he warns. "Many of the verbs that come at the end in the German sentence come at the beginning in an English sentence; if you were a machine translator, you would have to wait until the end of the German sentence to even start translating the English one."

 

Toward a real-life Babel fish

In 1978, author Douglas Adams' sci-fi comedy The Hitchhiker's Guide to the Galaxy imagined the existence of a Babel fish, which absorbed and produced soundwaves of precisely the right sort for it to act as an instantaneous translator of any language in the universe. 

Adams, an early adopter of tech including the fledgling Apple Macintosh computers, once said "Technology is a word that describes something that doesn't work yet." But since then, technological advances have made the possibility of real-time translation feel tantalisingly close – if not, presumably, fish-based.

"In terms of having an implant in the brain that starts augmenting your capabilities through non-biological means, I think we're very close," says Pedro. "Millions of people in the world have a cochlear implant and it enables them to hear in a way they couldn't before. We're not far from a time where someone with the cochlear implant will hear way better than someone without it, where they'll be able to hear different frequencies, and that will be an advantage. 

"Having a Babel fish in the sense that you can say something and it captures what you're saying in writing, I think we're not that far off because it's a matter of noise. Speech recognition is simpler – there's some sounds and there's an actual thing you said and that's it."

 

Maybe I won't need to learn a language any more, but I will want to because it's a sign of knowledge

- Vasco Pedro

But we might want to be careful what we wish for – and not just because Adams warned that the Babel fish, "by effectively removing all barriers to communication between different races and cultures, caused more and bloodier wars than anything else in the history of creation." 

More realistically, if instant (and reliable) translation becomes possible, surely fewer people will bother learning a second language? Pedro acknowledges the possibility, but doubts we will suddenly become a world of monoglots. 

"Medium term, for a long time I think you will learn the language of your parents, your birth, your where you're exposed," he says, "and unless that is English or Chinese becomes dominant, you will probably learn one more language in school because it's a good tool to have: it's good from a brain development perspective to learn other languages."

For Dorren, it depends on your needs. "Many people will give up on foreign language learning because the return on effort will be much lower," he says. "If you run a car rental on some Greek island now, you need to learn English to do your job. If you can do that with machine interpretation, you're probably no longer as motivated to do that. I mean, you're not doing it to read Shakespeare, you're doing it to rent cars, right?"

Pedro takes the same point in the opposite direction. "Maybe I won't need to learn a language any more, but I will want to because it's a sign of knowledge," he insists. "Nobody needs to learn how to play a violin, there's no need to learn how to ride a horse any more, but lots of people do because it's fun. Any phone program can beat the humans' best chess champion, and yet there's never been more people learning how to play chess than there is now."


Originally published by CGTN Europe, 26 Dec 2020