A spoken corpus of Cameroon Pidgin English: Compilation, applications and next steps

by Melanie Green (Sussex) & Gabriel Ozón (Sheffield)

Cameroon Pidgin English (CPE) is an expanded pidgin/creole spoken in some form by an estimated 50% of Cameroon’s 22,000,000 population (Simons & Fennig 2017). CPE is spoken primarily in the Anglophone west regions, but also in urban centres throughout Cameroon. As a predominantly spoken language, CPE has no standardised orthography, but enjoys a vigorous oral tradition, not least through its presence in the broadcast media. The language has stigmatised status in the face of French and English, prestige languages of Cameroon, where it also co-exists with an estimated 280 indigenous languages (Simons & Fennig 2017).

We describe the spoken corpus of CPE, a British Academy/Leverhulme-funded pilot study (Green et al. 2016, Ozón et al. 2017). The corpus consists of 30 hours of recordings made in five locations, resulting in a total of 240,000 words (80 texts of 15 minutes/3,000 words). Proportions of text types are guided by the International Corpus of English project (Nelson 1996), and the texts contain mark-up and part-of-speech-tagging. The corpus files, which are freely available from the Oxford Text Archive, include sound files (*.mp3 and *.wav), raw and annotated text files, participant metadata, a field manual, a tagging manual and a spelling list.

We then briefly describe some case studies of linguistic phenomena that the pilot corpus allows us to investigate, focusing on grammatical and lexical phenomena, as well as codeswitching, demonstrating that while a small corpus provides a robust test-bed for the investigation of grammatical phenomena, a larger dataset is required for the full investigation of lexical and sociolinguistic phenomena. Finally, we outline our plans for a 1-million-word corpus, a project for which a funding application is in preparation.

This paper was read at the Philological Society meeting at SOAS, University of London, on Friday, 18 January 2019, 4.15pm. A video recording of the presentation can be found below; the slides are available here.

References
Green, Melanie, Miriam Ayafor and Gabriel Ozón. 2016. A spoken corpus of Cameroon Pidgin English: pilot study. British Academy/Leverhulme funded digital database (ref. SG140663).

Nelson, Gerald. 1996. The design of the corpus. In Sidney Greenbaum (ed.). Comparing English worldwide. The International Corpus of English. Oxford: Clarendon Press, 27–35.

Ozón, Gabriel, Miriam Ayafor, Melanie Green and Sarah Fitzgerald. 2017. A spoken corpus of Cameroon Pidgin English. World Englishes 36: 427–447.

Simons, Gary F. and Charles D. Fennig (eds.). 2018. Ethnologue: Languages of the World, Twenty-first edition. Dallas, Texas: SIL International.

In Memoriam Matti Rissanen

by Sylvia Adamson (University of Sheffield)

It is with great sadness that the Society has received news of the death of Matti Rissanen, Professor Emeritus of English Philology at the University of Helsinki, at the age of 80 on 24 January 2018.

varieng_matti_rissanen

A long-time member and supporter of the Philological Society, Matti Rissanen was a pioneer in English historical corpus linguistics, and the director of the project that produced the Helsinki Corpus of English Texts, which covers a thousand years of the history of English and has been used widely since its publication in 1991.

Matti Rissanen was one of the rare scholars to command the history of the English language from its early stages to the present, beginning with his PhD thesis (1967) on the Old English numeral ONE. His wide range of publications includes a number of original articles and several co-edited volumes of corpus-based research, such as Early English in the Computer Age (1993), English in Transition and Grammaticalization at Work (1997), as well as the much cited chapter on ‘Early Modern English syntax’ in The Cambridge History of the English Language (vol. 3, 1999). Also taking an active interest in early American English, he was one of the international team that re-edited the Records of the Salem Witch-Hunt (2009).

His retirement in 2001 did not mark an end to his research activities. His philological expertise made an important contribution to the publication project that resulted in a new Finnish translation of all Shakespeare’s works. One of his long-lasting research interests was the history of English connectives, on which he was working to the very last days of his life.

Active in numerous professional organizations, Matti Rissanen served as president of the Societas Linguistica Europaea and chaired the Board of the International Computer Archive of Modern and Medieval English (ICAME). He was the founder and first director of the Research Unit for Variation, Contacts and Change in English (VARIENG), an Academy of Finland Centre of Excellence from 2000 to 2011. He was also a driving force in the foundation of the Finnish Institute in London and the Language Centre of the University of Helsinki. In recognition of his achievements Matti Rissanen received many awards, including an honorary doctorate of the University of Uppsala, Sweden, and being elected to the Finnish Academy of Science and Letters. He was an Honorary member of the Modern Language Society, the International Society of Anglo-Saxonists, and the Japan Association for English Corpus Studies.

On the personal level, Matti was supervisor to several generations of undergraduate and doctoral students in Helsinki, while providing unfailing encouragement and support to many more students and colleagues both in Finland and abroad. He will be greatly missed by his wide circle of friends.

Anyone who would like to share their memories and recollections of him is invited to do so by adding them as comments (in English or Finnish) to this VARIENG blog post.

This notice has been adapted, with permission, from the notice posted by Matti’s colleagues in Helsinki.

The moment of truth: Testing the Matrix Language Frame model in English–Vietnamese bilingual speech

by Li Nguyen (University of Cambridge)

Over the last few decades, there has been burgeoning interest in the study of code-switching in the research of bilingualism. Despite various definitions of what the phenomenon might entail, it is generally agreed in the literature that code-switching broadly refers to bilinguals’ ability to effortlessly alternate between two different languages in their daily speech (Bullock and Toribio 2008:1). This ability enables speakers’ behaviour of language mixing, which, as researchers have come to realise, is far from random but rather governed by specific structural constraints (Poplack 1980; Bullock & Toribio 2009). The nature of such constraints has inspired the search for a ‘universal pattern’, resulting in new investigations involving a number of language pairs, such as English–Spanish (Poplack 1980; Travis & Torres Cacoullos 2013; Aaron 2015), English–Welsh (Stammers & Deuchar 2012), Ukrainian–English (Budzhak-Jones & Poplack 1997), Igbo–English (Eze 1997), or Acadian French–English (Turpin 1998).

One of the most influential theoretical accounts in code-switching literature is Myers-Scotton (2002)‘s Matrix Language Frame model (MLF), which assumes an asymmetrical relationship between the two languages in bilingual discourse. As the MLF goes, ‘speakers and hearers generally agree on which language the mixed sentence is “coming from”’ (Joshi 1985:190–191), and it is this language that constitutes the ‘matrix language’ (ML) of the conversation. In a code-switched clause, the MLF predicts that the ML (i) supplies closed-class system morphemes such as finite verbs or function words, and (ii) determines word order. Although the need and the practicality of identifying a ML in some language pairs are debatable (Sankoff & Poplack 1981; Clyne 1987), the asymmetrical relationship between two languages involved is borne out in many existing datasets. Most often, the asymmetry is more obvious in pairs that are structurally different, with existing evidence heavily involving an Indo-European language and an Asian or African language (see Chan 2009:184 for an exhaustive list). The question is then: does the MLF actually generate accurate predictions in spontaneous speech?

In this project, I am testing the applicability of the MLF in English–Vietnamese code-switching data. This pair provides an interesting testing platform, since they share a similar surface word order (SVO) despite other typological differences. In other words, at a clausal level, the word-order morpheme principle is not applicable to determining the Matrix Language. The focus of the study thus lies on the so-called ‘conflict sites’, points at which the word order of the participating languages differs. These conflicts involve the sequence head-modifier within NPs and Possessive Phrases. Specifically, modifier and possessors precede head nouns in English, but follow head nouns in Vietnamese. When bilingual speakers are presented with such a conflict, MLF predicts that the matrix language (i.e. language of the finite verbs or function words) should determine the word order. Furthermore, as an isolating language, Vietnamese has virtually no overt morphology. This adds an extra layer to the complexity of determining the Matrix Language at the clausal level, which is traditionally is assigned by the language of the finite verb, thereby testing the MLF predictions when these two languages come into contact.

Thanks to fieldwork funding support from the Philological Society, I was able to carry out my fieldwork in Canberra, Australia, where I had existing connections with the Vietnamese bilingual community. Data collection took place between June and September 2017. My principle in building the corpus was drawn from Labov’s emphasis on the vernacular, where ‘minimum attention is paid to speech’ (Labov 1984:29). This approach was chosen because the vernacular reflects the most natural, systematic form of the language acquired by the speaker ‘before any subsequent efforts at (hyper-) correction or style shifting are made’ (Poplack 1993:252). Recruited speakers were thus free to choose their own interlocutors, in an environment that they were most comfortable with. They were asked to self-record a conversation on their personal mobile phone device, of a minimum of 30 minutes. After the recording was returned, speakers were asked to fill in a questionnaire to obtain information on extra-linguistic variables. The questionnaire consists of 18 questions, available both in English and Vietnamese.

The data collection process was successfully completed, resulting in a corpus of 10 hours of spontaneous speech. Results from this research should offer concrete, empirical evidence for or against the applicability of the MLF in language contact situations in which the participating languages are typologically disparate. If found non-applicable, it is hoped that the patterns found will form the foundation of a new theoretical framework accounting for the data in question. Methodologically, the study demonstrates a systematic approach to determining the ML, especially in problematic situations where the overarching word order of the participating languages converge, and one of the languages lacks overt morphology. When made publicly available, the data will also constitute the first digitalised English–Vietnamese bilingual corpus, providing a valuable resource for future research on this language pair in particular, and in bilingualism research as a whole.

References:

Aaron, J. E. (2015). Lone English-origin nouns in Spanish: The precedence of community norms. International Journal of Bilingualism 19(4), 429–480.

Budzhak-Jones, S. & Poplack, S. (1997). Two generations, two strategies: the fate of bare English-origin nouns in Ukrainian. Journal of Sociolinguistics 1(2), 225-258.

Bullock, B. & Toribio, J. (2008). Cambridge Handbook of Linguistic Code-switching. Cambridge: Cambridge University Press.

Chan, B. (2009). Code-switching between typologically distinct languages. In B. Bullock & A. Toribio (eds.), The Cambridge Handbook of Linguistic Code-switching. Cambridge: Cambridge University Press, 182-198.

Clyne, M. (1987). Constraints on code-switching: How universal are they? Linguistics 25, 739–76.

Eze, E. (1997). Aspects of language contact: A varionatist perspective on codeswitching and borrowing in Igbo-English bilingual discourse. PhD dissertation. Ottawa: University of Ottawa.

Joshi, K. (1985). Processing of sentences with intrasentential code switching. In D. R. Dowty, L. Karttunen and A. Zwicky (eds.) Natural language parsing. Cambridge: Cambridge University Press, 190–205.

Labov, W. (1984). Field methods of the project on linguistic change and variation. In J. Baugh & J. Sherzer (eds.), Language in use: Readings in sociolinguistics. Englewood Cliffs, NJ: Prentice Hall, 28–53.

Myers-Scotton, C. (2002). Contact Linguistics: Bilingual Encounters and Grammatical Outcomes. Oxford: Oxford University Press.

Poplack, S. (1980). Sometimes I’ll start a sentence in Spanish y termino en español: Toward a typology of codeswitching. Linguistics 18(7–8), 581–618.

Poplack, S. (1993). Variation theory and language contact. In D. Preston (ed.), American dialect research: An anthology celebrating the 100th anniversary of the American Dialect Society. Amsterdam: Benjamins, 251–268.

Sankoff, D. & Poplack, S. (1981). A formal grammar for code-switching. Papers in Linguistics 14(1), 3-46.

Stammers J., & Deuchar M. (2012). Testing the nonce borrowing hypothesis: Counter-evidence from English-origin verbs in Welsh. Bilingualism: Language and Cognition 15(3), 630–664.

Travis, C., & Torres Cacoullos, R. (2013). Making voices count: Corpus compilation in bilingual communities. Australian Journal of Linguistics 33(2), 170-194.

Turpin, D. (1998). ‘Le francais, c’est le last frontier’: The status of English-origin nouns in Acadian French. International Journal of Bilingualism 2(2), 221–233.

The Faces of PhilSoc: Melanie Green

melanie_green

Name: Melanie Green

Position: Reader in Linguistics and English Language

Institution: University of Sussex

Role in PhilSoc: Council Member

About You

How did you become a linguist – was there a decisive event, or was it a gradual development?

Somewhere between doing my A-levels (in English, French and Latin) and applying for university, when I found the SOAS prospectus in the school cupboard. At that point I realised that studying language didn’t have to mean studying literature, and I applied to study Hausa at SOAS. In my final year, I took a course that focused on the linguistic description of Hausa (taught by Professor Philip Jaggar), and it was this course that led me upstairs to the Linguistics Department, where I then took my MA and PhD.

What was the topic of your doctoral thesis? Do you still believe in your conclusions?

My doctoral thesis was on focus and copular constructions in Hausa, and offered a minimalist analysis. I still believe in the descriptive conclusions, which relate to the grammaticalisation of non-verbal copula into focus marker, but I’m less convinced these days by formal theory. I still enjoy teaching it though, because I think it makes students think carefully (and critically) about formal similarities and differences between languages.

On what project / topic are you currently working?

Together with Gabriel Ozon at Sheffield and Miriam Ayafor at Yaounde I, I’ve just completed a BA/Leverhulme funded project to build a pilot spoken corpus of Cameroon Pidgin English. Based on this corpus, Miriam and I co-authored a descriptive grammar of the variety, which is in press.

What directions in the future do you see your research taking?

In my dreams, typologically-framed language documentation. In reality, probably more corpus linguistics, since this seems to be what attracts funding at the moment.

How did you get involved with the Philological Society?

The PhilSoc published my first book, Focus in Hausa.

‘Personal’ Questions

Do you have a favourite language – and if so, why?

No.

Minimalism or LFG?

Minimalism.

Teaching or Research?

Both.

Do you have a linguistic pet peeve?

No.

Looking to the Future

Is there something that you would like to change in academia / HE?

I would like there to be more funding for language documentation. Languages are dying faster than we can describe them.

(How) Do you manage to have a reasonable work-life balance?

I do, but that only became possible in mid-career. I achieve it with careful planning, so when I’m off work, I’m really off work.

What is your prime tip for younger colleagues?

Start publishing as early as possible.

Natural Language Processing meets social media corpora

by Yin Yin Lu (University of Oxford)

From 17-19 May I attended the CLARIN workshop on the ‘Creation and Use of Social Media Resources’ in Kaunas, Lithuania. The thirty participants represented a broad range of backgrounds: computer science, corpus linguistics, political science, sociology, communication and media studies, sociolinguistics, psychology, and journalism. Our goal was to share best practises in the large-scale collection and analysis of social media data, particularly from a natural language processing (NLP) perspective.

As Michael Beißwenger noted during the first workshop session, there is a ‘social media gap’ in the corpus linguistics landscape. This is because social media corpora are the “naughty stepchild” of text and speech corpora. Traditional natural language processing tools (for, e.g., news articles, political documents, speeches, essays, books) are not always appropriate for social media texts, given the unique communicative characteristics of such texts. Part-of-speech tagging, tokenisation, dependency parsing, sentiment analysis, irony detection, and topic modelling are notoriously difficult. In addition, the personal nature of much social media creates legal and ethical challenges for the data mining and dissemination of social media corpora: Twitter, for example, forbids researchers from publishing collections of tweets; only their IDs can be shared.

I made invaluable connections with researchers at the intersection of NLP and social media data – and Twitter data in particular, which is the area of my own research. Dirk Hovy, an associate professor at the University of Copenhagen, spoke broadly about the challenges of NLP: engineers assume that all language is identically and independently distributed. This is clearly not true, as language is driven by demographic differences. How can we add extra-linguistic information to NLP models? His proposed solution is word embedding: transforming words into vectors, trained on large amounts of data from different demographic groups. These vectors should capture the linguistic peculiarities of the groups.

A variant of word embedding is document embedding – and tweets can be treated as documents. Thus, it should be possible to transform tweets into vectors to capture the demographic-driven linguistic differences that they contain. I will be applying this approach to my own corpus of 12 million tweets related to the EU referendum.

Andrea Cimino, a postdoc from the Italian NLP Lab, spoke about his work on adapting existing NLP tools—which are trained on traditional text—for social media text. The NLP Lab has developed the best POS tagger for social media based upon deep neural networks (long short-term memory), which are able to capture long relationships between words in a sentence. The tagger has achieved 93.2% accuracy, and is currently only valid on Italian texts. Similar taggers can be developed for English texts, given the appropriate training data.

Rebekah Tromble, an assistant professor at Leiden University, presented on the limitations and biases of data collected from Twitter’s Application Programming Interface (API). There are two public APIs that can be used: the historic Search API and the real-time Streaming API. Up to 18,000 tweets can be harvested from the former over the last seven to ten-day period, whichever limit is reached first. The Streaming API allows for up to 1% of all tweets to be collected in real time; as there are 500 million tweets a day, this is approximately 5 million tweets a day.

Continue reading “Natural Language Processing meets social media corpora” →

Big and small data in ancient languages

by Nicholas Zair (University of Cambridge)

Back in November I gave a talk at the Society’s round table on ‘Sources of evidence for linguistic analysis’ on ‘Big and small data in ancient languages’. Here I’m going to focus on one of the case studies I considered under the heading of ‘small data’, which is based on an article that I and Katherine McDonald and I have written (more details below) about a particular document from ancient Italy known as the Tabula Bantina.

tabula_bantina

It comes from Bantia, modern day Banzi in Basilicata and is written in Oscan, a language which was spoken in Southern Italy in the second half of the first millennium BC, including in Pompeii prior to a switch to speaking Latin towards the end of that period. Since Oscan did not survive as a spoken language, we know it almost entirely from inscriptions written on non-perishable materials such as stone, metal and clay. There aren’t very many of these inscriptions: perhaps a few hundred, depending on definitions (for instance, do you include control marks consisting of a single letter?). We are lucky that Oscan is an Indo-European language, and, along with a number of other languages from ancient Italy, quite closely related to Latin, so we can make good headway with it. Nonetheless, our knowledge of Oscan and its speakers is fairly limited: it is certainly a language that comes under the heading of ‘small data’.

iron_age_italy

One of the ways scholars have addressed the problem of so-called corpus languages like Oscan, and even better-attested but still limited ones like Latin has been to combine as many relevant sources of information, from ancient historians to the insights of modern sociolinguistic theory as a way of squeezing as much information from what we have – and trying to fill in the blanks where information is lacking. This has been a huge success, but this approach can also be dangerous, especially when it comes to studying language death. Given that we know a language will die out in the end, it is very tempting to see every piece of evidence as a staging post in the process, and try to fit it into our narrative of language death. Often this provides very plausible histories, but we must remember that, while in hindsight history can look teleological, things are rarely so clear at the time.

The Tabula Bantina is a bronze tablet with a Latin law on one side and an Oscan law on the other side. It is generally agreed that the Latin text was written before the Oscan one, but the Oscan is not a translation of the Latin: the writer of the Oscan text simply used the conveniently blank side of the tablet to write the new material on. The striking things about the Oscan text are that it is written in the Latin alphabet, and there are lots of mistakes. It also strongly resembles Latin legal language. The date of this side is probably between about 100-90 BC, just before Rome’s ‘allies’, which is to say conquered peoples and cities in Italy, rose up against it in a rebellion generally known as the Social War. Continue reading “Big and small data in ancient languages” →

‘Counting’: quality and quantity in literary language and tools for investigating it

by Jonathan Hope (Strathclyde University, Glasgow)

The transcription of a substantial proportion of Early Modern English books by the Text Creation Partnership has placed more than 60,000 digital texts in the hands of literary and linguistic researchers. Linguists are in many cases used to dealing with large electronic corpora, but for literary scholars this is a new experience. Used to arguing from the quality, rather than quantity of evidence, literary scholars have a new set of norms and procedures to learn, and are faced with the exciting, or perhaps depressing, prospect that their object of study has changed.

In this talk I’ll look at some specific case studies that illustrate the potential, and the problems, of quantity-based studies – and will highlight key areas where literary scholars need to reassess their expectations of ‘evidence’, and the texts we use. A possible alternative title might be ‘Learning to live with error: gappy texts and crappy metadata’.

A screencast of the talk can be found below.

This paper was read at the Philological Society meeting in Oxford, Wolfson College, on Saturday, 11 March, 4.15pm.

The Philological Society Blog

News from the Society and its members

Tag: corpus linguistics