A Sign Corpus For All

On the 13th of November UCL’s Deafness and Cognition Language Research Centre (DCAL) celebrated its 10th anniversary. In ten years DCAL has had a profound effect on a number of areas, from Clinical Psychology to Education. One of the most exciting projects from a linguist’s perspective is probably their British Sign Language (BSL) Corpus Project. Before 2008 there was no large accessible collection of BSL signing. DCAL decided to address this gap and set out to collect signing data from Deaf participants from different areas of the UK. Ultimately signing data was collected from 249 Deaf people in 8 cities (London, Bristol, Birmingham Manchester, Newcastle, Glasgow, Cardiff and Belfast). Within these signers there were also different genders, ages, ethnic groups and occupations represented. Participants were interviewed, held conversations with other signers and were asked to provide their preferred sign for 102 different concepts (e.g. ‘America’ or ‘dog’). This gave DCAL a wealth of signing data unlike anything ever collected on BSL before.


Screen Shot 2015-11-23 at 11.07.04

So, why is this important?

  1. The project makes this data accessible to the general public. This means that signers, learners of BSL and linguists (including you!) can all look at videos of signing for any purpose.
  2. The corpus acts as a BSL time capsule. DCAL has shown that language change is happening very quickly in BSL and by having the BSL Corpus it is possible to keep a record of what BSL looks like now.
  3. Linguists can study the corpus to get a better understanding of the structure of BSL. This, in turn, influences the teaching of BSL and the training of interpreters.
  4. The corpus records the regional variety of BSL as well as the differences across age groups and genders. This is of particular interest to sociologists and sociolinguists. How would we have known before the corpus that there were at least 17 variations of the the sign PURPLE?
  5. Other countries have been spurred on to create sign language corpora and this may allow future comparison between different sign languages.
  6. In the future, DCAL will make the corpus completely searchable like the corpora of written or spoken language. Once it is machine readable it will be open to further research by computational linguists and may be more easily compared with corpora from spoken languages.
  7. The BSL Corpus Project has been used to produce a free online dictionary of BSL based on the signs provided by corpus participants. This is an invaluable tool for learners of BSL and contains over 2,500 signs from the different regions of the UK.

If you are interested in finding out more about the BSL Corpus, visit their website. You can also hear Dr Adam Schembri talking about the project 5 years ago on UCL’s Mini-lecture series here.

Twitter dialectology

Traditionally, we’ve found out about variation in how people speak—whether that be variation between people in different places, of different classes, genders, or whatever—by doing surveys. Dialectologists have travelled around the country interviewing a few people in each town to record how each would say a set of words. Sociolinguists have interviewed wide ranges of people from different educational and social backgrounds and looked for differences in how they speak. These sorts of methods have been very successful—but they’re also very costly. Sending out researchers to do dialectological surveys is an expensive business: many researchers are needed to carry out the long process of getting to know local people and finding some who are willing to be interviewed in every locality and all those researchers have to be paid for their time and travel. The reality is, there just hasn’t been the funding in humanities and social science research to do this sort of work on a large scale for some years and so much of our data is rather out of date.

But in the era of the internet and ‘Big Data’ there’s a new way of finding out about language variation: using social media. And so a new generation of research into language variation using language data from social media is just starting to appear.

Using social media data for research is a very different proposition to traditional survey data. Obviously, it’s mostly written rather than spoken data, which immediately puts some limits on the sorts of things it can tell us. More problematically, you can rarely find out as much information about each person in your study as in a traditional survey, and even what information you can find out is unreliable. As an interviewer in person the researcher can ask for more information when needed: ‘You say you’re from York—were you born and brought up there, or did you move around as a child? Were your parents also from York?’ But dealing with online data, the vast majority of the time what you see is all you get. You know what the user chose to write in the ‘Hometown’ box but not necessarily what they meant by it. You know where their phone was when they tweeted—but you don’t know if that’s the place that they live and were brought up, or indeed whether those are the same places.

Nevertheless, there is one big advantage to this sort of data: there’s lots of it. And a big enough quantity of data can often make up for low quality data, if we’re asking the right questions. Because of the uncertainties about who’s really behind the keyboard, we can rarely use social media to make definitive statements about how much a given group of people speaks or writes in a certain way (that would be statements like ‘people under 25 from London use the word order “give it me” 50% of the time and “give me it” 50% of the time’)—but we can make comparative statements (like ‘people from London use the word order “give it me” twice as often as people from Lincolnshire’).

To exemplify what sort of work is being done with social media at the moment, I’ll take you through a couple of interesting recent papers (links to both are found at the bottom of the post). Gonçalves & Sánchez (2014) gathered around 50,000,000 tweets written in Spanish and associated with a GPS location over two years. They then tracked lexical variation—variation in the words people choose to use to describe a given concept—to see if they could find differences in people’s language use associated with different places. The map below is reproduced from their paper, showing the different words used for ‘car’. As you can see, five distinct areas emerge: people in North America and northern South America largely use ‘carro’; people in Central America and in Spain usually use ‘coche’; and people in the southern half of South America generally use ‘auto’.


goncalves and sanchez cars

They then took results like this for many words and used machine learning algorithms (specifically K-means clustering) to investigate whether there were identifiable groups of dialects. The result was very surprising. Instead of showing big, regional dialects associated with contiguous areas on the map, the algorithm identified just two dialects: one associated with the big urban areas and one with everywhere else. Gonçalves & Sánchez write: “Superdialect α is utilized by speakers in main American and Spanish cities and corresponds to an international variety with a strongly urban component while superdialect β is comprised mostly of rural areas and small towns” (6). They see this as evidence for the homogenising effect of globalisation on language.

Eisenstein et al. (2014) focused not on the static facts of whole dialects but on fast-paced processes of change associated with new words entering the language. They collected a corpus of 107,000,000 tweets in English from 2009-2012 and looked only at words whose frequencies changed significantly over time. Below is an example, reproduced from their paper. It shows the expansion of the term ion (short for ‘I don’t’ as in ‘ion even care’) over a 150 week period.

eisenstein et al ion

One interesting finding which is immediately clear from such figures is that even for these sorts of words which are fundamentally written and exist (basically) only online, geography is relevant. On the face of it, we might expect words on the internet to spread randomly across space, as most of what is posted is publicly visible regardless of where you are. But the reality is that words basically spread through social networks, and these exist in real space, even if we’re watching them in action online.

Eisenstein et al. go on to examine the most common routes of linguistic diffusion, mapping the paths most often taken by new words between the cities, and then investigate what factors favour such linguistic pathways. They found that racial demographics were crucially important: linguistic differences were more likely to be transmitted between cities with similar proportions of African American citizens and Hispanic citizens. Small geographic distance and similar proportion residents of urbanised areas and median income also facilitated linguistic influence. Population also had an effect: larger settlements were more likely to exert influence than be subject to it.

These two studies are just a small intimation of the potential for linguistic research with social media, but hopefully you can start to see what an exciting area this promises to be!

Eisenstein, Jacob, Brendan O’Connor, Noah A. Smith & Eric P. Xing. 2014. Diffusion of Lexical Change in Social Media. PLoS ONE 9(11). e113114. doi:doi:10.1371/journal.pone.
Gonçalves, Bruno & David Sánchez. 2014. Crowdsourcing Dialect Characterization through Twitter. PLoS ONE 9(11). e112074. doi:10.1371/journal.pone.0112074.




As a linguist, I often fail to match my non-linguist friends in how-cool-is-my-degree anecdotes: Japanese word order alternations just don’t have the same shock effect as a budding doctor declaring how formaldehyde makes them hungry during dissections, and Middle English sound changes don’t make you quite as hip as the classicist divulging in ancient Bacchanalia. But a few weeks back, I enjoyed a rare moment of subject coolness when I declared (to the hip classicist, as it happens) that – brace for impact – Finnish only has one word, hän, for both he and she. Nor is it the only one of its kind.

Classicists just wanna have fun. Peter Paul Rubens: Bacchanalia.

Classicists just wanna have fun. Peter Paul Rubens: Bacchanalia.

For a moment, I felt that the ensuing silence, followed by somewhat excessive OMG-nowaying (at which point I was seriously considering offering the poor classicist a paper bag to prevent hyperventilation) was perhaps veering into the overreacting side of things. However, a gender-neutral third-person pronoun has glimmered as the Holy Grail of linguistic equality in the minds of generations of activists and regularly crops up in newspaper headlines – not an insignificant subject to get excited about. English, for instance, has seen suggestions ranging from hu and peh to xe, jee and many more, as alternatives for unifying he and she.

Why bother about such minuscule issues? Haven’t English-speakers been quite content with the distinction since, well, English began? Part of the answer is stylistic: everyone knows the awkwardness of conscientiously typing he/she whenever reference is not specified, turning the prose of budding Shakespeares into a satire of political correctness, or the cumbersome singular they shot down by prescriptivists. The other part, as any self-respecting feminist in the footsteps of Beauvoir will point out, is that language is a tool of, mostly patriarchal, power – more often than not, masculine terms carry the connotations of the standard and the positive. The list is continued by the question of how to refer to those who do not identify themselves in the gender binary, and cases where the use of he has turned androgynous entities into gendered beings (never imagined the Christian God as a bearded fellow on the rim of a cloud?).

It may all seem like a never-ending debate conducted from ivory towers but in 2012, Sweden saw gender-egalitarian fuel thrown into its pronominal flames as the first children’s book with hen, the proposed gender-neutral equivalent for feminine hon and masculine han, was published. In Kivi och Monsterhund, the protagonist Kivi has no specified gender and it is left to the reader to choose how they see, uhm, them. However, in the wonderland of equality where even the main airport features unisex toilets and where hen has finally achieved dictionary-status this year, the opposition has raised a surprisingly animated and even imaginative counter-attack, not least because of underlying seeds of misunderstanding: it is only the extremists who want to see the gender distinction fully erased from pronouns whereas the majority regard hen as an opportunity to avoid the awkwardness of referring problems. Sadly enough, it is the extremist view that has gained the most eager attention.

Hon? Han? Whatever. ClipArtBest.

Some say hen confuses language use, and the major newspaper Dagens Nyheter banned its use for the same reason; others claim it confuses not only language but also their children’s gender identity. Those inclined to think in terms of conspiracy theories see it all as one big feminist plot to erase gender not only from language but humankind (NOT mankind, please) as well. And as hen has a meaning quite different in English, oh the irony of gender-neutral hens!

However, according to the author, Jesper Lundqvist, Kivi the gender-neutral protagonist is less of a feminist tour de force and more of a creative possibility. No proverbial governments have been taken over: children still assign gender, usually their own, when reciting the tale. And while Lundqvist admits to using increasing numbers of hen in casual speech, those still fearing the loss of their gender may rest assured for Kivi does have gender-specific parents – the good old mother and father.

Yet all is not well in allegedly gender-neutral language paradises, either, where, perhaps surprisingly, the opposite trend is emerging. Look east of Sweden, and Finnish language enthusiasts strike back. Finnish may well have only gender-neutral pronouns, hän for people and se for things as well as people, but there is a yearning, however faint and underground, for something gender-specific. The original translation of Joyce’s Ulysses, for instance, was criticized for using the single pronoun in situations where the original English she/he distinction clarifies reference, while the most recent version has been furnished with the additional, made-up hen for feminine reference – likewise the object of unhappy feelings aplenty.

What about greater gender equality propelled by hän? Would the introduction of a gender-specific pronoun not undermine the goals of gender-neutral language enthusiasts? In language, certainly no stylistic issues arise with hän when the referent could be of any, or no, gender, and in society, women were given the right to vote second in the world (which is slightly shadowed by the fact that gender-specific English-speaking New Zealand got there first). However, all these enthusisasts’ arguments reduce to nothing but wishful thinking in face of studies showing that people uniformly manifest predominantly male associations with hän – only girls under school age will see it as female. So much for gender-neutral socialization through a single pronoun.

Gender-neutral pronouns - not so much about votes for women. Image credit: Sony Pictures Classics.

Gender-neutral pronouns – not so much about votes for women. Sony Pictures Classics.

Risking a blow to my subject coolness, I had to inform the classicist of a further anticlimax: what seems to be constantly forgotten in these debates, is that hen and its peer neologisms, whatever their social impact, are very unlikely to gain ground other than as stylistic alternatives for the already existing pronouns. Studies into the history of languages show repeatedly that the basic ingredients of language are the most resistant to change and as such cannot be changed by willpower. Just think about it: can you imagine yourself saying thon or ze instead of he and she without the ridicule carried by so many expressions created to be politically correct?

Deep breaths. English pronouns aren’t going gender-neutral any time soon.


Macavity wasn’t there


I was asked to write this post at fairly short notice, so as I was walking home earlier I was thinking vaguely about what I might include, not having prepared anything in advance. I was also listening to music piped from my phone via headphones, which happened to include the following lyric:

He’s outwardly respectable (I know he cheats at cards)
And his footprints are not found in any files of Scotland Yard’s
And when the larder’s looted or the jewel case is rifled
Or when the milk is missing or another peke’s been stifled
Or the greenhouse glass is broken and the trellis past repair
There’s the wonder of the thing: Macavity’s not there!

These words are from the song Macavity from Cats, Andrew Lloyd-Webber’s musical adaptation of T.S. Eliot’s volume of poetry Old Possum’s Book of Practical Cats. What particularly caught my attention, however, was the pronunciation of the word jewel (in the third line of the above extract).

Historically (and to this day for some speakers), English accents distinguished a sound linguists write /ʤ/ from a sequence of sounds represented with /dj/. /ʤ/ is the ordinary English j sound, found in jokejapejollity etc. – sort of a followed by the consonant in the middle of vision or at the end of beige. /dj/ is a sequence of plus the sound found at the start of year or yes.

Now, once upon a time it was usual to pronounce words like duel or duke with a /dj/ at the start: “dyuel”, “dyuke”. (This was simply a combination of plus the “long u” sound, pronounced yu, which we find at the start of words like universe and university. Nowadays it’s especially associated with the British “Received Pronunciation” accent, or RP.) But a lot of speakers, particularly in the US, now just pronounce these same words with an ordinary and no y: “dooel”, “dook”. And lots of other people pronounce them with /ʤ/: so duel is pronounced the same as jewel and duke is pronounced the same as juke.

What has this got to do with the song? Well, in the recording on my phone, the word jewel is pronounced /dj/. This is interesting because historically jewel never had the pronunciation /dj/, only /ʤ/ (as evidenced by the in the spelling). But duel, as noted above, did used to be pronounced standardly with /dj/. So, in RP and accents like it, duel and jewel were pronounced differently.

This, then, looks like an instance of hypercorrection: a “correcting” of something according to a rule that doesn’t actually apply in this instance, thus actually resulting in an “incorrect” form. T.S. Eliot, though born in America, lived most of his life in England and many of the poems in Old Possum’s Book of Practical Cats are explicitly set in the UK: this, then, is also the setting for Cats. Further, the poems are very much early twentieth century in character, when traditional RP was much more widely spoken than today.

(This is, incidentally, important in another of the poems/songs, Skimbleshanks, which includes the line They’d be off at last for the northern part of the northern hemisphere. In my opinion, last and part are supposed to form a sort of half-rhyme, both with a long “ah” sound (found in last in RP but not many other dialects). In at least one US recording, however, last is pronounced with a short “a”, whereas part nevertheless has a long “ah” on account of the following r – and thus the similarity between the two words is lost.)

Anyway, back to Macavity. Whoever was singing this particular part about the rifled jewel box is clearly aware that duel, in RP, was pronounced with a /dj/. In their ordinary everyday speech, at a guess, they probably however pronounce it with a /ʤ/ (the same as jewel). They may then have a rule they use when putting on RP for the purposes of performance: “my ordinary /ʤ/ is pronounced /dj/ in RP”. Now, this rule correctly applies in duel. The singer in question, however, seems to have overapplied their rule to jewel and come out with a pronunciation that is strictly correct only for duel.

Is there a moral to the story? Really, this is just something I thought was interesting and worth remarking on. But if you want a lesson: sometimes, in trying to replace a “wrong” form with a “right” one, you may in fact end up doing the opposite, and getting it “wrong” where if you hadn’t done anything you’d have got it right.

Lunchtime Linguistics Puzzle

You may remember a few weeks ago, Jessica wrote about the UK Linguistics Olympiad (see here). This week, we are challenging you to solve one of the Linguistics Olympiad puzzles. To find more of their puzzles, visit the International Linguistics Olympiad webpage.

Here is the puzzle (credit to Patrick Littell who originally created this puzzle for the North American Computational Linguistics Olympiad 2008):

Aymara is a South American language spoken by more then 2 million people in the area around Lake Titicaca, which, at 12,507 feet above sea level, is the highest navigable lake in the world. Among the speakers of Aymara are the Uros, a fishing people who live on artificial islands, woven from reeds, that float on the surface of Lake Titicaca.

1. Below, seven fishermen describe their catch. Who caught what?
Screen Shot 2015-10-26 at 13.17.55
1) “Mä hach’a challwawa challwataxa.”
2) “Kimsa hach’a challwawa challwataxa.”
3) “Mä challwa mä hach’a challwampiwa challwataxa.”
4) “Mä hach’a challwa kimsa challwallampiwa challwataxa.”
5) “Paya challwallawa challwataxa.”
6) “Mä challwalla paya challwampiwa challwataxa.”
7) “Kimsa challwa paya challwallampiwa challwataxa.”

Also, watch out! One of the fishermen is lying.

2. Your daily catch is pictured below. Describe it in Aymara, and don’t lie!
Screen Shot 2015-10-26 at 13.18.12
Note: ä is a long a; ll is pronounced as ly; x as the ch in Scottish loch. Some vowels transcribed here are deleted in actual speech.


A not uninteresting sighting

Living in a pictoresque and historic city has many perks. One of them is being frequented by film crews, who like Cambridge as a period backdrop – with a bit of blacking out the double-yellows and covering over the odd signpost, you can easily get the 1880s, the 1930s or the 1960s, as takes your fancy. This month’s cinematic excitement in town has been the filming of ITV’s Grantchester, which is set in the nearby eponymous village. Of course, the local free paper could not resist a wittily-entitled piece. And it caught my eye for another, linguistic, reason.

In the article, a (real) local vicar talks about the drama’s Rev Sidney, and comments: “… although they don’t show them in full, on the show they sometimes round it off with a bit of a sermon. They are certainly well-written, and he is not unprofound.”
Cambridge News & Crier, 15/10/15 p.7
Now, perhaps you think there’s nothing particularly remarkable about that. But what I found interesting was the last phrase, containing that double negation ‘not unprofound’. Why say that, rather than simply ‘profound’?

I think there are two steps to understanding the speaker’s choice here: a semantic one and a pragmatic one. First up, the semantics of negation. At least in English, adding un- to the front of an adjective (like ‘profound’) can have two effects. Have a look at these examples:

wise – unwise worthy – unworthy
happy – unhappy impoverished – unimpoverished
friendly – unfriendly aware-unaware
fair – unfair reliable – unreliable
kind – unkind blemished – unblemished

Now, individuals’ intuitions about this do vary, but generally, the column on the left contains pairs which are contraries: the negated adjective, with un-, means the extreme opposite of the positive. Formally, a pair of adjectives are contraries, if they can be simultaneously false, but not simultaneously true. For example, you can say ‘Bob isn’t happy, but he’s not unhappy either’, but you can’t say ‘Bob is happy, and he’s unhappy, too’. To put it another way, there is a middle ground between contraries, where you’re neither friendly nor unfriendly, neither kind nor unkind, and so on.

Unfriendly Friendly
————————— ……………………… ————————————

Pairs of adjectives in the right column, on the other hand, are contradictories: the negated adjective, with un-, is just the opposite of the positive, the absence of the positive. These pairs cannot be true at the same time or false at the same time. You can’t say ‘Bob is neither aware nor unaware of the situation’, and nor can you say ‘Bob is aware and unaware’. In other words, there in no middle between these terms: you have to be one thing or another.

Unaware Aware
——————————— —————————————

Now these distinctions were observed right back by Aristotle, and they seem to have to do with the nature of the positive adjective – whether it’s something that can have degrees, whether it’s gradable. You can be more or less friendly, but you can’t be more or less aware. However, the important thing for us is that there is this distinction between two types of negative adjectives.

So which column does ‘profound-unprofound’ go in? I have to say, my intuition is not clear. And the OED say that it means ‘not profound; shallow, superficial’, which suggests that it could go in either. ‘Not profound’ is the contradictory, whereas ‘shallow’ is the contrary. How can we tell what our vicar meant here?

The question then is, what happens when you stick another negative, ‘not’, on the front: ‘not unhappy’ vs ‘not unware’? Well, with the contraries, like ‘not unhappy’, you get a meaning that could be ‘happy’ or ‘neither happy or sad’: it encompasses the extreme opposite and the common ground. And because we don’t have a single word with that meaning in English, we can see a good motivation of using a phrase like ‘not unhappy’. But with contradictories, ‘not unaware’ logically means ‘aware’. So why on earth would we say it?

Well, this is where the pragmatics comes in. If the speaker has gone to extra lengths to say ‘not unaware’, when they could have saved themselves a couple of syllables with the simple ‘aware’, this must be because they want to communicate something extra. Something that using ‘aware’ on its own would not convey. Take these examples:
What’s the service like at your local garage?
It’s not unreliable.
+> It’s not as reliable as ‘reliable’ would suggest.

I was not unaware of the situation.
+> I was acutely aware of the situation.

The exact inference depends on the context, but it seems you might get the kind of inferences indicated by +>. Either there’s a tempering of the meaning of the positive term, or there’s an intensification of it.

So back to our newspaper clipping, ‘he is not unprofound’. If ‘unprofound’ for the local rev means ‘shallow’, then he is simply saying that it is not the case that his TV counterpart’s sermons are shallow. But if his mental lexicon has ‘not profound’ for ‘unprofound’, then choosing the phrase ‘not unprofound’ is implicating something more. And in the context, that the sermons “are certainly well-written”, we can infer that he means that they really are profound.

But to find out whether you agree, you’ll have to wait til next year, when the next series of Grantchester is aired.

Further reading:
Levinson, S. C. (2000). Presumptive meanings: The theory of generalized conversational implicature. Cambridge, MA: MIT Press.
Laurence, H. (1989). A natural history of negation. IL: University of Chicago Press, Chicago.

“Distant relatives” and “close neighbours” in second language acquisition

“Relatives” and “neighbours” are two key concepts deeply rooted in the mind of Chinese people, at least for people at my age or a bit older (I’m in my mid-20s, if you are curious). I still vaguely remember the days before we finally moved to the flat purchased by my parents: we lived in the residential area built by my father’s academy, and most of the neighbours were from the same research department. Almost everyone knew everyone else, and in case of emergency, our neighbours were always willing to give a hand. Even in the primary school, we learnt some proverbs regarding the relationship of neighbourhood – “Distant relatives are not as good as close neighbours”, which means “neighbours next door may be more friendly and helpful than those relatives who live faraway”.

While I don’t want to waste more precious bytes on the importance of “neighbourhood” in Chinese culture, I did find that “neighbours” and “relatives” are rather important in the field of linguistics, or, to be more precise, second language acquisition. I should admit that this is not my own creation; almost forty years ago, Eric Kellerman from Radboud University Nijmegen and his colleagues suggested that the distance perceived by the learners between two languages (the native language of the learner, which is abbreviated as NL, and the target language that the learner is learning, which is TL) might influence the learners’ strategies in second language acquisition.

For those learners who acquire (or learn) a second language through formal education, when they start learning a second language, they have already acquired their native language and are able to use it grammatically. Some features of the grammar and vocabulary of native language might have some influence on the learners’ production and judgement of target language; for instance, some learners may translate the proverbs in their mother tongue to their second language in a word-to-word fashion, which leads to some weird and twisted expressions. Such a process is called “transfer” by Kellerman, while it is also known as “native language interference” or “cross-linguistic influence”. However, Kellerman’s research and analysis show that the learners are not that dumb; they do not transfer every feature of their native language to the target language, but judge the individual situation with some constraints. One of those criteria is the “distance” that I have mentioned in the last paragraph.

We are surrounded by a number of theories and myths about languages once we have the concept of “language”. From our parents, teachers or some prominent “folk linguists”, we have received some views of “relative languages” and “neighbour languages”. I guess British kids may hear something about the close relation between English and German, or sometimes between English and French, and then comes the conclusion – “it is easier to learn German and French, whilst Chinese and Japanese are more difficult”. Similar situation happens in China as well, because I have repeatedly heard about the “advantages of language learning” that European children have over Chinese students, such as “if your native language is English, you already know half French and half German and even some Latin”. Under the influence of these sayings and ideas, a learner will gradually construct a “language map” in her mind, on which different languages are listed with certain distance from her native language. The concept of such “distance” is called “psychotypology” by Kellerman, because it is psychologically constructed and sometimes only a little bit similar to the real linguistic typology map.

So how can psychotypology affect the strategies of second language acquisition? The mechanism is, actually, not that difficult – just as with other learning processes, if you believe that two things are similar, you always have the urge to adopt the methods with which you dealt with A to solve some problems regarding B. Kellerman proposes one possible influence of psychotypology. If one perceives that the “distance” between her native language A and her target language B are relatively close, which indicates that she believes that the two languages share a number of similar features, then she could transfer some features and use of A in the production of B. Ideally, the closer the psychological distance between A and B, more features could be transferred. If a learner believes that the distance between A and B is too far, she will give up any transfer from her native language to the target language, because she presumes that there is few similarity between the two languages. In a study with Jordens (Jordens and Kellerman 1981), Kellerman presents a comparison between Dutch students learning German and English: students are more willing to transfer the Dutch phrases and idioms to German than to English, because they believe that Dutch is closer to German but farther from English, although all of them are Germanic languages from the Indo-European language family.

We now see that psychotypology reflects the similarity between two languages perceived by the learner, and it could affect the transfer strategy from the NL to the TL. For linguists and people with a fair knowledge of linguistic typology, it is rather natural to suggest that languages from the same language family are perceived closer and more similar by the learners. Interestingly – and somehow unfortunately – the world does not follow our imagination. When it comes to languages, people still have the opinion that “distant relatives are not as good as close neighbours”.

This summer I went back to Beijing for my pilot study, in which I included a small survey of psychotypology between Chinese and eleven languages that are well-known to Chinese students. The eleven languages are as follows: Japanese, Tibetan, Korean, English, Arabic, French, German, Vietnamese, Mongolian, Spanish and Thai. Among the eleven languages, some are the “close neighbours” of Chinese, such as Japanese, Korean and Vietnamese, all of which made use of Chinese characters in their writing systems historically or currently. Tibetan is the real relative from the Sino-Tibetan family but less known to the public, and the rest are more like “strangers” that we meet on the street every day. My major target was to measure the psychological distance between Chinese and English, so I asked my participants to number those languages from 1 (the closest) to 11 (the farthest) according to the distances from Chinese in their mind, and drop in some words to justify their answer if they like.

After the first stage of tidying up the data, I found that the close neighbours beat the real relative nicely this time. My participants, including senior high school students, undergraduates and graduates, generally believe that Japanese is closest to Chinese (average 1.6 in high school group, 1.87 in university group, both ranking first), while the real relative Tibetan only ranks fourth in the high school group (average 4.67) and fifth in the university group (average 4.73). All the “close neighbours”, including Japanese, Korean and Vietnamese, are among the language closest to Chinese, although all of them are typologically distant from Chinese: Japanese is from Japonic family, Korean is believed a language isolate, and Vietnamese is an Austroasiatic language. When asked about the reasons for the ranking, participants provided a relatively unified answer, such as “Japanese uses kana and kanji, which are borrowed from Chinese”, and “they are geographically close so Japanese/Korean/Vietnamese received a lot of influences from ancient Chinese historically”. In the university group, only one participant put Tibetan at the first position, with a short comment “they (Tibetan and Chinese) are both from Sino-Tibetan family”. What a relief.

The fight between “distant relatives” and “close neighbours” is not the purpose of my pilot study, but it has captured my attention and leads to some interesting questions (that I am not going to answer either in my research or in this blog post): Will the process of second language acquisition gradually change one’s perception of language distance? If so, what factors will be crucial? I would like to be the example this time; right after acquiring the sentence structure of Japanese, I realised that Japanese is totally different from Chinese, and now for me the psychological distance between Japanese and Chinese is even greater than that between English and Chinese (my justification: Japanese is an SOV language with case markers, whilst both English and Chinese are SVO languages without case markers). After all, we only make use of part of the knowledge of a foreign language to decide whether it is close to our native language, and when we have little understanding of the language itself, we may rely on the geographical distance or historical factors, and ignore the typological connection between two languages. “Close neighbours” are sometimes more attractive than those “distant relatives” that we seldom hear about – but will such a belief be sustained, if we go deeper in the area of second language acquisition?


For more about psychotypology, please see:

Jordens, Peter, and Eric Kellerman. “Investigations into the ‘transfer strategy’ in second language learning.” Actes du 5e Congres de IAILA. 1981.

Kellerman, Eric. “Now you see it, now you don’t.” Language transfer in language learning 54.12 (1983): 112-134.

p.s. I was so shocked when I found that Dr Kellerman is now a photographer.

Humble beginnings: the innovation of language changes

It is a universal truth—although not one universally acknowledged—that languages always change. Speech and writing are never entirely stable: new forms are always expanding in popularity and old ones dying out, the meanings of words and expressions are always shifting or being replaced by new ones, and particular pronunciations are always rising and falling in popularity. One area of research—indeed, the main area of research for many historical linguists—is the process by which these changes start.

In historical linguistics we use the term ‘innovation’ to describe the invention of a new form by a speaker. It’s obvious that every change that has even taken place (from the loss of the ‘wh’ sound /ʍ/ in English English to the replacement of the Middle English word bene by Middle French praere to mean ‘prayer’) as well as every change that is still ongoing (from ‘th-fronting’, the process by which the ‘th’ sound /θ/ is replaced by an ‘f’ sound /f/, to the use of ‘like’ to mark quoted speech) must once have been innovations. There must have been a first time and a first speaker for each of these changes: a first speaker to pronounce ‘wh’ /ʍ/ the same as ‘w’ /w/, a first speaker to use the then-French word praere instead of bene when speaking English, a first speaker to say ‘f’ /f/ for ‘th’ /θ/ and a first speaker to use ‘like’ to mark a quotation. We might ask—who were these speakers that have had such a great effect on the language? And why did they produce these innovations?

Innovation is actually happening all the time. All speakers use language creatively, frequently coining new words and using them just once in a particular context (‘nonce formations’), borrowing words from other languages to try to get across specific meanings (‘nonce borrowings’), producing slight differences in pronunciation or using words and phrases in subtly different ways than usual. This is both a blessing and a curse for the study of innovation. On the one hand, it means that it’s relatively easy to get data to study, either by trawling recordings of spoken language data or trying to prompt people to produce innovations in lab settings. However, the vast majority of these innovations never spread beyond the first speaker who uses them, and it’s vanishingly unlikely that the moment of innovation of any particular change which goes on to spread more widely will happen to be recorded. As a result, it’s practically impossible to study the process of innovation for successful changes.

Nevertheless, we can get some ideas about how innovations happen by studying unsuccessful innovations caught in corpora or in the lab, and by theorising about the changes which we observe spreading successfully.

Let’s start with the easiest of our examples. It’s pretty clear that borrowings (like the replacement of Middle English bene by praere) start life with bilingual speakers. In many communities of bilingual speakers, ‘code-switching’ (the use of multiple languages in the same utterance) is common. This gives one possible way in which a Middle French word could first have been used in an otherwise Middle English utterance. Alternatively, bilingual speakers could have chosen to start using the Middle French word rather than the Middle English one because it had subtly different connotations which they wanted to make use of. Or it might be that the word was particularly associated with a ‘domain’ (context) in which Middle French rather than Middle English was the normal language to use—this does appear to be true in this case, for the domain of religion. Once praere had starting turning up in the Middle English of bilingual speakers by one or any of these routes, monolingual Middle English speakers could simply hear it as a new word and start using it themselves. An interesting property of borrowing as a type of innovation is that clearly there’s no reason to assume that just one speaker did it first: as there were lots of bilingual speakers, many of them might have borrowed the form independently, making it all the more likely to spread to monolingual Middle English speakers.

A different possible source of innovations is speech errors. All speakers occasionally produce speech errors in all areas of language, and when a change looks like a form which could have been produced by a speech error, we must consider the possibility that that it started life as just such an error. In the case of the loss of the ‘wh’ sound, it was replaced by a sound which was very similar—the only difference between /ʍ/ and /w/ is that the former is voiceless and the latter voiced. This sound change might have started life by /ʍ/ being accidentally voiced, perhaps in a context when it was surrounded by other voiced sounds. Like the case of borrowing, there’s clearly no reason to assume that innovations of this type happened just once to one speaker. Instead, they might have started life in multiple places and times and with multiple speakers independently.

The above case was a ‘production error’—an error made by a speaker producing language. Other innovations might have started life as ‘perception errors’—errors made by listeners. The case of th-fronting might have started life as perception error. The ‘th’ /θ/ and ‘f’ /f/ sounds are not especially close in terms of how they are pronounced, and so it is relatively unlikely that one would be produced in error for the other. However, perceptually they are very close, so it is quite possible that a listener might have misheard a speaker and thought they were producing [f] for /θ/. Once this had happened, that listener might have gone on to actually produce [f]s for /θ/s, thinking that they were imitating the speaker they’d heard before.

Many researchers take a particular interest in innovations produced during the native acquisition of language by children. It’s clear that production and especially perception errors like those described above might be more likely to occur in child speech, and so many innovations of these types may have started during language acquisition. One special type of innovation can only happen during child language acquisition: ‘reanalysis’. Reanalysis takes place when the grammar of some piece of language is ambiguous and the learner comes to a different conclusion about what the grammar actually is than the speakers from whom they’re learning. Reanalysis has been subject to a huge amount of study in historical linguistics and is a tricky topic for which there isn’t really room here. If you’re interested, though, there’s lots of material online from which you can learn more.

All in all, we know of lots of different ways in which language changes can start. Nevertheless, determining conclusively what was the mechanism behind any specific innovation remains a difficult subject. The question of why certain innovations spread where the vast majority do not is an even thornier one, and perhaps a topic for a future blogpost…

Young olympians in town

uklo_logoSurely the giddiest daydream of every budding linguist is to have been born early enough to have had a crack at deciphering the Egyptian hieroglyphs at a time when their secrets still remained concealed – and then of course, off the back of this rush of excitement, to be recruited into the war-time effort at Bletchley Park, working furiously to glean information from undeciphered code. Well, although anachronistic time-travel sadly remains impossible, if you were born in the UK in 1994 or afterwards, under the auspices of the UK Linguistics Olympiad, you can do the next best thing.

The UK Linguistics Olympiad is now entering its eighth year, setting secondary and sixth form pupils the challenge of cracking codes, scripts and other language-related conundrums within the space of a tense two and a half hours every February. The highest scorers in the advanced paper take part in a second round held over a weekend at a UK university Linguistics department. The top four to eight participants form crack teams to compete in the International Linguistics Olympiad every July – Mysore, India in 2016. In the short amount of time that the UK has taken part in the International Linguistics Olympiad, the national effort has grown to take in a haul, in 2015, of two golds, one silver, a bronze, three honorable mentions and the top spot in the team competition.

Although participation at the round 1 advanced level currently stands at a healthy 1300 pupils, the main aim for the UK Linguistics Olympiad as a still very young endeavour is to grow uptake, particularly amongst schools in the state sector, where the largest number of potential competitors is to be found. To this end, the first training course for high-scoring linguistics olympiad participants from state-maintained schools was held in Cambridge in conjunction with Corpus Christi college from 1st to 4th September 2015.

The training course welcomed twenty participants to sessions exploring languages across the world (led by Elspeth Wilson), examining strategies for tackling language puzzles (led by Neil Sheldon), reconstructing olympiad puzzles (led by Jessica Brown), simulating the decipherment of alien and pictorial languages in real time (led by Paul Meara), and exploring puzzles in the field of Linguistics as an academic discipline (led by Billy Clark). Participants also engaged with cutting-edge research in Linguistics through Jim Baker, Rowena Bermingham, Jamie Douglas and Joe Perry, and were introduced to the history of the UK competition by the UK Linguistics Olympiad chair, Dick Hudson.

Cracking olympiad puzzles is no mean feat – if you fancy your chances, take a look at Patrick Littell’s 2012 problem on ´Phags-pa, rated the toughest in the UK Linguistics Olympiad archives, which requires deciphering a script encoding the Băijiāxìng, essentially, a Song Dynasty Who’s who – and the enthusiasm and tenacity with which the participants tackled the puzzles was extremely impressive.


The end of the course with smiles all around!


To find out more about the UK Linguistics Olympiad, visit www.uklo.org

Get involved as an academic by joining the 2016 marking panel: http://www.uklo.org/for-universities

As a teacher, you can enter pupils from your school here: http://www.uklo.org/registration-expressions-of-interest

Or to see what you’re made of as a one-time budding linguist: http://www.uklo.org/example-questions/past-tests

How much (of one’s own) language is needed for a sense of identity?

On the occasion of the Rugby World Cup, which will be held in England starting on September 18th this year, I noticed the very different approaches to the decision on and the performance of the national anthems of the participating countries. Some countries, bi- or multilingual in nature, are of particular interest. This further led to the question of, going by this measure, how much of a language of one’s own is needed to create an identity –and even if there is such a state as too many languages or in other terms, too much fragmentation.
While the 20 countries participating at the Rugby World Cup constitute a bit of a random sample, many of them do share some common features, mainly a former colonial background and English as the or one of the languages, which makes an interesting point of comparison and brings with it a certain innate tension regarding social and linguistic history and current situations.
While participant countries like Argentina, Uruguay or Italy are largely monolingual, beyond a certain dialect continuum, many of the countries involved show a complex system of sociolinguistic distributions, mainly based on colonialism from England and the British Empire. France is an exception in that it has its own strong language policies in place based on the notion of La Grande Nation and “one nation, one language”, virtually ignoring Breton, Catalan and Basque, and Occitan and other languages / dialects closer to French – although the European Charter for Regional or Minority Languages (of which France, tellingly, is not a signatory!) attempts to address such issues.
Wales, Scotland and Ireland were all at some stage annexed by (basically) England to form the Great Britain and the United Kingdom respectively, though Ireland has meanwhile regained its independence. The effects of English as a mighty competitor is nevertheless clear to see in all three cases. In Scotland, Gaelic, is just gaining some more recognition as a topic on the national agenda where before it was largely the language of the more or less isolated population on the Northwestern periphery. Interestingly, as recent research shows, (new) urban learners don’t share that identity of the “Gael”, while they still see Gaelic as an important element of their identity, partly in areas where it was hardly ever even spoken. It is no wonder due to the small number of speakers and the lack of long-term political support that Gaelic is not in any way represented in the Scottish national anthem Flower of Scotland, beyond a short reference to Scotland’s “hills and glens”, glen being a Gaelic borrowing into Scots and Scottish English. On the topic of Scots, it is in a way equally cursed and blessed by its linguistic proximity to English, being at the end of a North to South dialect continuum (strikingly similarly to Gaelic with Irish!) with some additional elements partly based on contact with other languages like Gaelic, Dutch and Norse languages. Scots, in a boiled down version, contributes to the English of Scotland without in its pure form having a strong standing with the younger generation. The use of “wee” in the anthem, more symbolic than natural in a sense, is a good example of this.
Ireland have chosen to have a combined anthem for the Republic and Northern Ireland in the Rugby, where both of the countries are represented by the same team. Meanwhile, the Republic uses the Irish language for its anthem (Amhrán na bhFiann) – translated from an English original (A Soldiers’ Song). The fact that the only time I remember hearing it rendered louder than the (actually musically much-maligned) English-language combined Rugby anthem (Ireland’s Call) was when England played Rugby – or any sport – at Croke Park for the first time in 2007. Croke Park was the scene of the killing of 14 Irish civilians by the Black and Tans in a stadium at a Gaelic Football game on the original Bloody Sunday on November 21, 1920. On the occasion, the English (British) God Save the Queen was well-respected and then the Irish anthem was sung vigorously and reverberatingly, while on other occasions the lack of true knowledge of the language in Ireland shows – also represented by the fact that not a single version as recorded by a native or native-like speaker appears to exist on youtube or on the manifold CDs in Dublin tourist shops.
Wales is palpably more connected to its anthem Hen Wlad Fy Nhadau, which was originally written in Welsh by a father and son combination of poet and composer Evan and James James in 1856. About 20% of the Welsh population of 3Mio inhabitants still speak the language and many of them natively or native-like. Even non-speakers often know enough Welsh to sing the anthem, for example. This shows in the stadium where the rousing renditions of the Welsh anthem are a marvel to visiting teams and once described by Rugby’s greatest commentator, the late Bill McLaren, to be worth 5 points of a lead for Wales. This is one aspect that indicates that the language is more central to Welsh identity than for example Irish or Gaelic to their respective countries.
In Australia, despite language revival efforts now increasingly supported by universities, the Aboriginal languages do not enter the public perception much, with the anthem also being solely English.
In contrast, New Zealand not only does a lot for the Maori language and not only is the first verse of the anthem (God Defend New Zealand) in Maori, but the national flag is also currently being redesigned to represent the entire population better than the current one. In addition to that, arguably the most iconic cultural element in any sport, the All Black Haka, often imitated and much respected is based on traditional Maori war / welcome dances. And, unsurprisingly, going beyond the Maori language, which is a rare example of relatively successful language revival policies based on a strong identity, the Maori culture and sociological status has been on the rise in an open-minded and inclusive society.


Another interesting fact is that Fiji, though it would have a language of its own in the form of Fijian and though immigration from the British Empire was minimal, retained English as an official language after independence in 1970. Hence, and unlike island cousins Tonga and Samoa, it uses English in its anthem, too.
The biggest counter example, however, is South Africa: With the anthem consisting of two parts, namely the former anthems ‘Nkosi Sikelel’ iAfrika’ (God Bless Africa) of the Black Liberation movement and ‘Stem van Suid-Afrika’ (Call of South Africa), the former official anthem of the apartheid regime and representing the 5 biggest of South Africa’s 11 official languages: the African languages Xhosa (first two lines), Zulu (next two lines), Sesotho (second stanza) and the two languages of the European invaders, English and Dutch-based Afrikaans. The anthem issue was a huge point of contention in the 1990s and only resolved by Nelson Mandela himself. He saw the huge relevance of the choice of anthem for reconciliation due to the high emotional value of the anthem and its language to the formerly ruling population. The hybrid version of the song was a diplomatic and straightforward solution, representing both groups as one, also linguistically, in the new ‘rainbow nation’.

Mandela HF

However, in practice, the performance of the anthem once again represents the reality of the situation rather well: While the African languages are (by now) often sung by large parts of the audience, finally, there is still a clear rise in volume and passion from the predominantly white crowds when the Afrikaans part starts. This is often interpreted as a throw-back to apartheid practices and hence again a point of debate, despite the best intentions of Mandela in creating the hybrid version.
What does it tell us? Is there any correlation between anthems and sociolinguistic status of a language at all? Tentatively, I would argue there clearly is, as some of the examples show. However, it cannot be taken as paramount concerning its relevance to the role or status of a (minority) language.