Report on the 51st International Conference on Sino-Tibetan Languages and Linguistics

by Xiaolan Cao (University of Melbourne)

With the generous bursary from the Philological Society, I was able to present my research paper at the 51st International Conference on Sino-Tibetan Languages and Linguistics held at Kyoto University, Japan, 25–28 September 2018 .

During the conference, three posters and seventy-two papers of the most recent research on Sino-Tibetan languages and linguistics were presented, including various topics in the fields of phonetics, phonology, morphology, syntax, and diachrony. Professor Sun Jackson and Professor James Matissof gave the plenary talks. All the papers presented at the conference are freely available here.

On the second day of the conference, I presented my paper on the phonology of Southern Pinghua and phonological dialectal variances. In this paper, I first present the phonology of Southern Pinghua based on the Wucun dialect. I organized this section of my paper by the order of consonants, vowels, tones, and syllable structure. After going through the phonology and phonological features of the Wucun dialect, I presented my study on the phonological variances between 32 Southern Pinghua dialects. Based on variance analysis, I concluded that Southern Pinghua dialects are relatively diverse, which partly explains the low degree of mutual intelligibility between those dialects. Thus, it is neither prudent nor rigorous to use one dialect to represent the whole Southern Pinghua group without thorough comparative studies investigating dialectal variants.

After my presentation, I received valuable feedback on my paper and connected with researchers who share research interest in Sinitic languages. With all the feedback I received, I am currenlty preparing a journal paper based on my presentation with additions on the diachrony of Southern Pinghua phonology, which I hope to submit to the Transaction of the Philological Society.

Finally, I would like to take this opportunity to thank the Philological Society again for the generous bursary. Without this support, I would not have been able to make my trip to the conference to share my research findings and exchange ideas with researchers from all over the world on Sino-Tibetan languages and linguistics.

Creoles in Costa Rica – the 22nd Biennial SCL Conference

by Marina Merryweather (Queen Mary University of London)

With the generous funding of the Philological Society’s Martin Burr fund, I was able to attend the 22nd Biennial Conference of the Society for Caribbean Linguistics, hosted jointly with the Society for Pidgin and Creole Linguistics.

The Society for Caribbean Linguistics, as the name suggests, focuses on the many languages studied around the Caribbean, be it the post-colonial languages of the different islands and coastal regions, or the indigenous languages still prevalent in the parts of Latin America that border the sea. Meanwhile, the Society for Pidgin and Creole Linguistics generally covers all areas of study related to pidgins and creoles, regardless of where they are found. There is, of course, a lot of overlap between the aims of the two societies, given the prevalence of creole languages all around the Caribbean basin; nonetheless, both societies have their own areas that they cover, and there is a lot that differs between them. This was evident in the talks on offer at the conference.

costa rica 3The theme of this conference was “Connecting the Caribbean: Languages, Borders and Identities”. This led to a lot of focus in the talks on concepts such as language policy, language endangerment, and minority languages more generally. For example, there were panels devoted to the discussion of Limonese creole, an English-lexifier creole spoken on the Caribbean coast of Costa Rica, which is closely related to Jamaican creole – and was, indeed, imported by Jamaican migrant workers in the early 20th century. The panels included the launch of a book containing a standardised alphabet for the language, as well as a round-table with local speakers on their attitudes towards it, and how they intended to preserve it as they became more assimilated into the predominantly Hispanophone Costa Rican society. There were also a considerable number of talks on indigenous languages in Central America, and especially Costa Rica. This included a plenary from Prof. Juan Diego Quesada (National University of Costa Rica and University of Costa Rica) on the typology of indigenous languages, and the way that certain typological features could be attributed to three general groups in the northern, central, and southern regions of Central America.

In addition to the panels on Limon creole mentioned above, there were some other creole and creole-adjacent languages that caught my interest at the conference. There was, for example, an entire panel devoted to languages in St Lucia, and particularly the development of a vernacular English, as opposed to both the standard English and the French creole that they speak on the island. Debate is still taking place as to whether this English is a creole, or whether something else is happening, such as relexification of the French creole, or the development of a mixed language. There was also a plenary chaired by Joyce Pereira on the huge advances in Papiamento language policy in Aruba; in a short space of time, it went from being a language entirely shunned by the Dutch government, to the centre of a pilot scheme to become the main language of education in the country.

costa rica 1The research that I was presenting at the conference was also on a creole language, specifically on the variety of Antillean French Creole spoken in Martinique. Literature on this creole, particularly written in English, is limited, but what little has been written suggests that a process known as decreolisation—whereby creole languages evolve to more closely resemble their lexifier languages—is currently underway (Lefebvre, 1974; Bernabé & Confiant, 2002; Bernabé, 2006). My MA thesis investigated whether this was still a process that was taking place. Basing my research on a study done by Vaillant (2009), I conducted a form of matched-guise test to see if certain grammatical features were considered acceptable by participants. These features were not usually considered obligatory in Martinican Creole, but are in French; if the sample sentences without these French features were considered incorrect, this would point to the French standard becoming the norm. These features included the use of a relative pronoun as well as the use of a reflexive construction with verbs typically considered reflexive in French. The results of this study, however, pointed neither one way nor another; one of the features was considered grammatical and the other was not. This pointed to a number of different possibilities, from an argument against decreolisation, to a different theory being needed to explain the changes at hand—such as the concept of interlecte (Prudent 1981)—to there being structural effects at play determining which positions the morphosyntactic features are needed in. The study was small and took place in a very restricted setting, but by bringing it to the conference, I hoped to develop a paper which could inspire future creolists to look further into the language.

costa rica 2

This was just a small flavour of some of the enormous variety of talks I listened to over the five days of the conference, which was the first I have ever attended. Between the people, the cultural events, and the luscious surroundings of Costa Rica, I had an immensely enjoyable week, and learnt a lot. I am grateful to the Philological Society for giving me the opportunity to attend, and hope that I will be able to also make their next conference in Trinidad in 2020!

The latest from Austronesian historical linguistics

by Laura Arnold (University of Edinburgh)

LogoiCAL14The 14th International Conference on Austronesian Linguistics was held on 17–20 July 2018, at the campus of the Université d’Antananarivo in the capital of Madagascar, the westernmost outpost of the Austronesian world. With four keynote speakers and 176 participants, the conference brought together Austronesian researchers from all over the world to share their latest research on this huge and diverse language family. The four days of talks were followed by an excursion to the UNESCO world heritage site of the Royal Hill of Ambohimanga, situated on a soaring hill above stunning landscapes and rice paddies, 24 km to the northeast of the city. Photographs of the conference by David Gil can be found here.

As ever, there were many talks that dealt with historical, comparative, and philological issues in Austronesian linguistics. The question of the the origin and movement of the pre-Austronesians and the subsequent expansion of Austronesian languages throughout insular Southeast Asia was the subject of lively debate throughout the conference. In his keynote speech, Waruno Mahdi—a proponent of the proto-Austric hypothesis, which links Austro-Tai to Austroasiatic—used genetic, archaeological, and linguistic data to argue that speakers of proto-Austronesian comprised two distinct population groups. One was a subtropical group (the ‘Deutero-Malays’), descended from the rice-cultivating Austro-Tai group; and the other was an equatorial group (the ‘Proto-Malays’), who migrated from the south towards the Proto-Austronesian homeland of Taiwan when the Sunda shelf was flooded, around 7000–4000 years BP. Laurent Sagart, on the other hand, who proposes that Austronesian is a sister of Sino-Tibetan, later argued that the pre-Austronesians originated from the Yellow Valley in north China, approximately 9000–7500 years BP. This conclusion is based on agricultural archaeological evidence regarding the spread of millet domestication; the spread of the ritual ablation of upper lateral incisors; and mtDNA and Y chromosome data showing a link between Sino-Tibetan- and Austronesian-speaking populations. Regarding the dispersal of the Malayo-Polynesians, Marian Klamer emphasised that the traditional farming dispersal model of Austronesian expansion throughout Island Southeast Asia is too simplistic, and cannot account for the linguistic and archaeological diversity found throughout the area – especially for the so-called Western Malayo-Polynesian and Central Malayo-Polynesian languages, which comprise over 600 languages across the majority of Island Southeast Asia. She reminded us that the Malayo-Polynesian expansion most likely did not occur in one fell swoop across the archipelago, but that there may have been hundreds or possibly thousands of migrations across the area; and that we need detailed, bottom-up micro-comparisons in order to work out the history of the linguistic dispersal of Malayo-Polynesian languages. This sentiment in particular appeared to strike a chord with the conference participants, and was something I heard echoed many times over coffee, lunch, and the cocktail party that closed the conference.


Credit: David Gil

Another topic of interest was the linguistic inferences that can be made about the history of Malagasy from 17th-century sources. One of the keynote speakers, Narivelo Rajaonarimanana, outlined his work on the Sorabe manuscripts and texts held in the National Library of Paris, which he has been transcribing and translating. He discussed the use of Qur’anic verses in these manuscripts in healing prayers, and as talismans for protection. He also sketched out some aspects of the grammar of the volañ’onjatsy dialect, spoken by a group living around the Matataña River, which is represented in these texts. Earlier in the conference, Alexander Adelaar (see also here) presented several speculations regarding the phonology of early Malagasy, using evidence from 17th-century Sorabe texts, and a 1603 textbook and wordlist of Malagasy compiled by Frederik de Houtman. First, he concluded that *y and *w in proto-Southeast Barito (the Bornean ancestor of Malagasy) were still vocoids at the point when Sorabe, a derivative of the Arabic script, was first adapted to transcribe Malagasy. Second, he established that the contraction of like vowels in originally disyllabic roots (e.g. *fu(h)u ‘heart’ > fu, *raa ‘blood’ > ra) had not yet taken place. Third, he discussed problems with the traditional identification of the Sorabe texts with the Taimoro dialect, providing linguistic evidence to show that the oldest Sorabe texts have features in common with the Tanosy dialect. Sorabe was originally practiced in a wider area, and its identification with the Taimoro dialect and region alone is too narrow and only reflects the current state of affairs. Finally, the orthography used in the wordlist, as well as comparison with cognate forms in other languages, suggests Malagasy still had a palatal nasal ñ at the time Houtman was compiling this wordlist.

My travel to the conference was funded in part by the Philological Society. In my presentation, I looked at a split in the tone system of a dialect of Ambel, a South Halmahera-West New Guinea language spoken in West Papua, Indonesia. This split was conditioned by vowel height, such that toneless syllables with non-high vowel nuclei *e, *a, and *o developed High tone, whereas toneless syllables with high vowel nuclei *i or *u remained toneless. There are two interesting points about this split. First, tone splits conditioned by vowel quality are very rare. Second, in all other cases of tone splits or tonogenesis conditioned by vowel quality that have been described in the literature so far, high vowels are associated with High tone. The conditioning of High tone by non-high vowels, as we find in Ambel, has not previously been attested. I went on to present a possible phonetic motivation for the split. This motivation makes reference to the complementary phenomena of intrinsic F0 and intrinsic pitch. All things being equal, higher vowels (e.g. /i/, /u/) are generally produced with a higher F0 than lower vowels (e.g. /a/). However, intrinsic pitch compensates for this, in that, when the F0 is identical, hearers perceive lower vowels as being higher in pitch than higher vowels. One important exception to intrinsic F0 is at the lower end of a speaker’s pitch range (e.g. in a tonal language, Low-toned syllables), where differences in F0 are reduced or completely neutralised. Toneless vowels in Ambel are realised with low pitch. I therefore suggested that, when proto-Ambel first developed tone, and toneless syllables came to be realised with low pitch, the intrinsic F0 of these toneless vowels was neutralised; however, the intrinsic pitch that formerly compensated for intrinsic F0 differences was maintained. This meant that speakers of Ambel came to perceive the toneless non-high vowels (*e, *a, and *o) as higher in pitch than the toneless high vowels (*i and *u). Eventually, this perceptual difference resulted in the merger of toneless syllables with non-high vowels with other High-toned syllables. Slides from this presentation can be found here.

Other talks that may be of interest to members and followers of the Philological Society are as follows (in order of presentation):

  • Owen Edwards explored the possible phonetic quality of proto-Austronesian *j. Three pieces of evidence lead him to the conclusion that the best reconstruction may be the affricate *dz. First, *dz is preserved as /dz/ in three primary branches of Austronesian, including Malayo-Polynesian; second, most of the reflexes in the present-day languages can be accounted for by making reference to natural and well-attested sound changes; and third, reconstructing *dz leads to a balanced and typologically-expected phonological inventory in proto-Austronesian.
  • Francesca Moro presented empirical data demonstrating that the morphological simplification of Alorese that has occurred since the most recent common ancestor with Lamaholot can be explained by the large number of L2 speakers of the language, which has historically been used as the lingua franca of the area.
  • Albert Davletshin looked at the diachrony of case marking in Nukeria, a Polynesian outlier – specifically, an agentive marker a, which is preposed only to singular personal and demonstrative pronouns, the question word ai ‘who?’, and personal names. He showed that the development and distribution of a can be explained by an interaction between semantics and phonology. On the semantic level, he discussed the phenomenon of differential agent marking, found elsewhere in Polynesian languages, in which highly-individuated NPs (such as pronouns, personal names, and definite NPs) are marked, whereas lower-individuated NPs are not. The distribution of a can be further explained by making reference to a phonological constraint in Nukeria which prevents the bimoraic singular pronouns and any bimoraic personal names from being realised without additional marking.
  • In a paper by Ritsuko Kikusawa(see also here), John Lowry, Paul Geraghty, Apolonia Tamata, Fumita Sano, Susuma Okamoto, and Hirofumi Teramura, results from a pilot project in Fiji combining linguistic and GIS data were discussed. In this project, the data are used to map different ‘communalects’, depending on how similar forms for a particular meaning are to Standard Fijian. This methodology can also be used to calculate the similarity of forms to a reconstructed ancestor form, and has the potential to be used in testing hypotheses with regards to historical population movements, for example where the ports of entry for a particular island may have been.
  • A paper by Juliette Huber and Antoinette Schapper looked at Austronesian borrowings into the non-Austronesian Eastern Timor languages. On the basis of sound changes in both the Austronesian and non-Austronesian languages, several layers of borrowing can be identified, indicating a complex and long-term history of contact. In addition, Austronesian borrowings from unidentified sources in the Eastern Timor languages suggests that there has been contact with a now-extinct Austronesian substrate in East Timor; and shared vocabulary throughout the languages of the area points to contact between the proto-languages of the Austronesian and non-Austronesian languages spoken today, although the source of these words is difficult to determine.
  • Kirsten Culhane and Owen Edwards presented data from the Meto dialect cluster, in which there are very diverse patterns of intervocalic consonant insertions. A diachronic perspective is necessary to understand this diversity – most of the consonants used in insertion can be easily explained by making reference to well-attested sound changes in each of the dialects. However, a structural analysis is insufficient to account for the synchronic state. Instead, a social perspective which makes reference to the distinct identity of each of the dialect communities is necessary to explain the observed differences.
  • Corinna Handschuh provided an overview of common and proprial articles in Austronesian. Various languages throughout the family have a system in which different articles are used to mark common and proper nouns: most notably in Oceanic, but also elsewhere, such as in Tagalog. The distinction has also been reconstructed to proto-Austronesian. This system is highly unusual, in that it has not so far been attested in any other language family. She thus focussed on the stability of such a typologically unusual system over such a great time depth, flagging up the similarities with nominal classification systems such as gender, which are typically stable over time.
  • Emily Gasser discussed a ‘crazy rule’ of /β/, /r/, and /k/ mutation, which is attested in the majority of the languages of the South Halmahera-West New Guinea (SHWNG) subbranch. While her presentation focussed on the synchrony of this mutation, in the question and answer session she proposed that it may be helpful in the subgrouping of SHWNG – specifically, that the mutation provides evidence for grouping the SHWNG languages spoken around Cenderawasih Bay into a single primary branch.
  • Tobias Weber discussed the typological profile of the languages of Sumatra and the Barrier Islands, investigating mostly structures mentioned in the WALS. He assumed that certain features of these languages—the larger-than-average vowel inventories, the denasalisation of consonants in Enggano and Mentawai, numeral classifiers, and clausal head-marking (indexing of arguments on the predicate)—may be explained by influence from a now-extinct pre-Austronesian substrate.
  • Peter Slomanson looked at the development of negation in the contact languages Sri Lankan Malay and Sri Lankan Portuguese. He showed that these two varieties are in some ways structurally closer to each other than they are to their co-territorial model languages, Tamil and Sinhala, yet the contact languages still differ from each other in their respective negation systems. The parallels that there are, for example in the ordering of functional markers, suggest that contact between what would become Sri Lankan Malay and Sri Lankan Portuguese may have begun in Java, before continuing in Sri Lanka.
  • Penelope Howe presented preliminary data from matched guise tests, showing that an emergent lexical tone contrast in the Central dialects of Malagasy additionally indexes social meaning. Her results suggest that the use of tone in these dialects is associated with more positive attributes (e.g. friendliness, honesty). However, when tone is absent, the speakers of these dialects are associated with more negative attributes (e.g. reticence, indifference).

For further information about any of these presentations, readers are encouraged to contact the relevant author(s).

Semantically driven grammaticalisation: the systematic pathways of Estonian polar question particles

by Mari Aigro (University of Tartu)

Seeing grammaticalisation as being analogically driven takes the explanatory power, which is frequently assigned to syntactic position, and assigns it to the semantic analogy between the source and the target. This case study focuses on the semantic cohesion patterns in the pathways of contemporary as well as historical Estonian polar question particles (PQPs). It will show that not only is the semantic component of function words much more relevant to grammaticalisation than is commonly thought, but also that the grammaticalisation network surrounding a functional category can in fact be semantically so uniform that one can devise a model based on a semantic map and assign it a certain degree of explanatory power regarding why certain markers become PQPs and others are much less likely to do so.

While the most frequently mentioned PQP sources are negation and disjunction markers (Heine & Kuteva 2002), a comprehensive literature review reveals altogether six source categories. In addition to disjunction and negation markers, this list also includes clause conjunction markers, embedded PQPs, conditional markers and pronominal interrogatives (König & Siemund 2007, Nordström 2010, Metslang et al. 2017). These sources appear to form a systematic set – all of the above could be classified as markers of polarity or truth values (see Payne 1985 for coordinators, Nordström 2010 for conditionals). To investigate, whether or not this principle would hold for additional data and other newly discovered source categories, an in-depth corpus study was carried out on Estonian, a language especially rich in both neutral and biased PQPs.

Nearly 2400 polar questions using the particle strategy (inversion and zero-marking strategies are used alongside) were manually encoded in the Corpus of Old Written Estonian (17th–19th century) and the Corpus of Standard Estonian (20th century). I found six different PQPs—four biased and two neutral—used between the 17th and 21st centuries. Three of them—kas, või, ega—are still in use in Standard Modern Estonian. The source of kas is either a clause conjunction (“also”) or an embedded PQP; või most likely originates from a disjunction (“or”); and ega from a clause rejection marker (“nor”). The three historical polar question markers are eks, eps and jo/ju; while the first two originate from negation, the source of the latter is an affirmative focus marker. Only three have given rise to new functional structures: eks became an affirmative polar tag question marker; kas gave rise to the disjunction marker “either”; and jo/ju, after its brief time as a PQP, became a marker of evidentiality when occurring sentence-initially (retaining the older focus reading in other positions).

Hence, the new source categories introduced by the corpus study were polarity-sensitive focus markers (for ju) and rejection markers (for ega), both of which confirm the hypothesis that polar question particles originate from non-interrogative markers, which already involve the semantic component of negation, affirmation or neutral (open) polarity. Table 1 depicts the pathways of Estonian PQPs on a semantic map, which links the two dimensions of polarity – interrogation and bias.

Table 1: Semantic map of Estonian PQPs

Markers in the neutral category are especially relevant. They leave the truth value unknown, assigning open polarity even without interrogation, and due to this share a close link with PQPs. PQPs are more frequently homophonous with disjunction markers than other particles and both of the non-biased Estonian PQPs, kas and või, originate from the neutral category. Additionally, all functional markers originating from PQPs belong in this category. However, although the fact that the map accommodates all known sources of PQPs implies causality, it can only constitute a probabilistic rather than a deterministic model.


The moment of truth: Testing the Matrix Language Frame model in English–Vietnamese bilingual speech

by Li Nguyen (University of Cambridge)

Over the last few decades, there has been burgeoning interest in the study of code-switching in the research of bilingualism. Despite various definitions of what the phenomenon might entail, it is generally agreed in the literature that code-switching broadly refers to bilinguals’ ability to effortlessly alternate between two different languages in their daily speech (Bullock and Toribio 2008:1). This ability enables speakers’ behaviour of language mixing, which, as researchers have come to realise, is far from random but rather governed by specific structural constraints (Poplack 1980; Bullock & Toribio 2009). The nature of such constraints has inspired the search for a ‘universal pattern’, resulting in new investigations involving a number of language pairs, such as English–Spanish (Poplack 1980; Travis & Torres Cacoullos 2013; Aaron 2015), English–Welsh (Stammers & Deuchar 2012), Ukrainian–English (Budzhak-Jones & Poplack 1997), Igbo–English (Eze 1997), or Acadian French–English (Turpin 1998).

One of the most influential theoretical accounts in code-switching literature is Myers-Scotton (2002)‘s Matrix Language Frame model (MLF), which assumes an asymmetrical relationship between the two languages in bilingual discourse. As the MLF goes, ‘speakers and hearers generally agree on which language the mixed sentence is “coming from”’ (Joshi 1985:190–191), and it is this language that constitutes the ‘matrix language’ (ML) of the conversation. In a code-switched clause, the MLF predicts that the ML (i) supplies closed-class system morphemes such as finite verbs or function words, and (ii) determines word order. Although the need and the practicality of identifying a ML in some language pairs are debatable (Sankoff & Poplack 1981; Clyne 1987), the asymmetrical relationship between two languages involved is borne out in many existing datasets. Most often, the asymmetry is more obvious in pairs that are structurally different, with existing evidence heavily involving an Indo-European language and an Asian or African language (see Chan 2009:184 for an exhaustive list). The question is then: does the MLF actually generate accurate predictions in spontaneous speech?

In this project, I am testing the applicability of the MLF in English–Vietnamese code-switching data. This pair provides an interesting testing platform, since they share a similar surface word order (SVO) despite other typological differences. In other words, at a clausal level, the word-order morpheme principle is not applicable to determining the Matrix Language. The focus of the study thus lies on the so-called ‘conflict sites’, points at which the word order of the participating languages differs. These conflicts involve the sequence head-modifier within NPs and Possessive Phrases. Specifically, modifier and possessors precede head nouns in English, but follow head nouns in Vietnamese. When bilingual speakers are presented with such a conflict, MLF predicts that the matrix language (i.e. language of the finite verbs or function words) should determine the word order. Furthermore, as an isolating language, Vietnamese has virtually no overt morphology. This adds an extra layer to the complexity of determining the Matrix Language at the clausal level, which is traditionally is assigned by the language of the finite verb, thereby testing the MLF predictions when these two languages come into contact.

Thanks to fieldwork funding support from the Philological Society, I was able to carry out my fieldwork in Canberra, Australia, where I had existing connections with the Vietnamese bilingual community. Data collection took place between June and September 2017. My principle in building the corpus was drawn from Labov’s emphasis on the vernacular, where ‘minimum attention is paid to speech’ (Labov 1984:29).  This approach was chosen because the vernacular reflects the most natural, systematic form of the language acquired by the speaker ‘before any subsequent efforts at (hyper-) correction or style shifting are made’ (Poplack 1993:252). Recruited speakers were thus free to choose their own interlocutors, in an environment that they were most comfortable with. They were asked to self-record a conversation on their personal mobile phone device, of a minimum of 30 minutes. After the recording was returned, speakers were asked to fill in a questionnaire to obtain information on extra-linguistic variables. The questionnaire consists of 18 questions, available both in English and Vietnamese.

The data collection process was successfully completed, resulting in a corpus of 10 hours of spontaneous speech. Results from this research should offer concrete, empirical evidence for or against the applicability of the MLF in language contact situations in which the participating languages are typologically disparate. If found non-applicable, it is hoped that the patterns found will form the foundation of a new theoretical framework accounting for the data in question. Methodologically, the study demonstrates a systematic approach to determining the ML, especially in problematic situations where the overarching word order of the participating languages converge, and one of the languages lacks overt morphology. When made publicly available, the data will also constitute the first digitalised English–Vietnamese bilingual corpus, providing a valuable resource for future research on this language pair in particular, and in bilingualism research as a whole.


Trilingual families in bilingual capital cities

by Kaisa Pankakoski (Cardiff University)

Open borders, superdiversity and globalisation have enabled the formation of a large amount of families where children are potentially multilingual and may have more than one native language. The parents of multilingual children have different strategies, methods and principles in place to promote intergenerational language transmission or passing a non-native language to their offspring.

What principles and other factors influence bringing up a trilingual child? How do the potentially multilingual children feel about their complex language repertoires? Is there a link between a certain method and the children’s attitudes towards their languages?

CardiffandHelsinkiIn my thesis I investigate trilingual families; the factors influencing language transmission; and the perspectives of the multilingual children in my two home cities: Helsinki and Cardiff. The reason why these two capital cities are compared is that they have very different approaches to bilingual education and heritage language promotion while having several similarities from a visible minority language population to substantial support from the governments for the minority languages. The two countries are also officially bilingual, which offers a different foundation for trilingual language transmission than for instance monolingual countries.

Previous research
There are various aspects influencing the transmission of minority languages in the home. These consist of linguistic environment factors such as families’ language strategies and methods of transmission; sociocultural factors including parental and societal attitudes, the roles of the languages or parental and societal support; and finally familial factors that may involve siblings, extended family and possible family mobility.

The most recent research strand of multilingualism, Family Language Policy (FLP), looks at the importance of parental strategies which are fluid and may change over time. Much like any multilingualism research most of FLP and language transmission research is based on bilingual context rather than multilingual context.

Previous work has not looked at trilingual children’s perceptions or the link between perceptions and language strategies. Furthermore, most multilingualism studies fall into the category of linguistics and language acquisition rather than sociolinguistics. There is no transmission research in contexts with a community majority and minority language.

Funding from PhilSoc to carry out fieldwork in Helsinki
IMG_9916From April 2017 until August 2017 I was based in Finland at the University of Helsinki, Department of Modern Languages. This enabled me to interview seven multilingual case study families living in the Helsinki Metropolitan Area. The families were settled in the country and each had at least one trilingual primary school aged child speaking two official languages of the country (Swedish and Finnish) and one or more additional language(s).

The methodological approach draws from qualitative, mixed-methods approach to data collection and analysis. First the parents filled in an online questionnaire to clarify the family’s language pattern. Then semi-structured interviews and observations within the family homes explored issues that affect language acquisition within families. Both parents and children aged five to twelve were interviewed.

I spent three to six hours with each family in their homes. The data collected includes fourteen filled in questionnaires, fifteen hours of audio recorded interviews, seven hours of recorded audio and/or video observation as well as photographs and notes of each family participating in the research.

IMG_1084This winter possible extended family members will be sent an online questionnaire which will hopefully reveal their perspectives. After completing the fieldwork in Helsinki I will carry out the interviews and observations in Cardiff.

More information about the research
There is a news item on the Cardiff University website as well as a Welsh-language BBC article about my research and fieldwork in Helsinki. For more information about my research questions and methods, see my Cardiff University page.

Read more
Natural Language Processing meets social media corpora

by Yin Yin Lu (University of Oxford)

From 17-19 May I attended the CLARIN workshop on the ‘Creation and Use of Social Media Resources’ in Kaunas, Lithuania. The thirty participants represented a broad range of backgrounds: computer science, corpus linguistics, political science, sociology, communication and media studies, sociolinguistics, psychology, and journalism. Our goal was to share best practises in the large-scale collection and analysis of social media data, particularly from a natural language processing (NLP) perspective.

As Michael Beißwenger noted during the first workshop session, there is a ‘social media gap’ in the corpus linguistics landscape. This is because social media corpora are the “naughty stepchild” of text and speech corpora. Traditional natural language processing tools (for, e.g., news articles, political documents, speeches, essays, books) are not always appropriate for social media texts, given the unique communicative characteristics of such texts. Part-of-speech tagging, tokenisation, dependency parsing, sentiment analysis, irony detection, and topic modelling are notoriously difficult. In addition, the personal nature of much social media creates legal and ethical challenges for the data mining and dissemination of social media corpora: Twitter, for example, forbids researchers from publishing collections of tweets; only their IDs can be shared.

I made invaluable connections with researchers at the intersection of NLP and social media data – and Twitter data in particular, which is the area of my own research. Dirk Hovy, an associate professor at the University of Copenhagen, spoke broadly about the challenges of NLP: engineers assume that all language is identically and independently distributed. This is clearly not true, as language is driven by demographic differences. How can we add extra-linguistic information to NLP models? His proposed solution is word embedding: transforming words into vectors, trained on large amounts of data from different demographic groups. These vectors should capture the linguistic peculiarities of the groups.

A variant of word embedding is document embedding – and tweets can be treated as documents. Thus, it should be possible to transform tweets into vectors to capture the demographic-driven linguistic differences that they contain. I will be applying this approach to my own corpus of 12 million tweets related to the EU referendum.

Andrea Cimino, a postdoc from the Italian NLP Lab, spoke about his work on adapting existing NLP tools—which are trained on traditional text—for social media text. The NLP Lab has developed the best POS tagger for social media based upon deep neural networks (long short-term memory), which are able to capture long relationships between words in a sentence. The tagger has achieved 93.2% accuracy, and is currently only valid on Italian texts. Similar taggers can be developed for English texts, given the appropriate training data.

Rebekah Tromble, an assistant professor at Leiden University, presented on the limitations and biases of data collected from Twitter’s Application Programming Interface (API). There are two public APIs that can be used: the historic Search API and the real-time Streaming API. Up to 18,000 tweets can be harvested from the former over the last seven to ten-day period, whichever limit is reached first. The Streaming API allows for up to 1% of all tweets to be collected in real time; as there are 500 million tweets a day, this is approximately 5 million tweets a day.

Continue reading “Natural Language Processing meets social media corpora”