Big and small data in ancient languages

by Nicholas Zair (University of Cambridge)

Back in November I gave a talk at the Society’s round table on ‘Sources of evidence for linguistic analysis’ on ‘Big and small data in ancient languages’. Here I’m going to focus on one of the case studies I considered under the heading of ‘small data’, which is based on an article that I and Katherine McDonald and I have written (more details below) about a particular document from ancient Italy known as the Tabula Bantina.


It comes from Bantia, modern day Banzi in Basilicata and is written in Oscan, a language which was spoken in Southern Italy in the second half of the first millennium BC, including in Pompeii prior to a switch to speaking Latin towards the end of that period. Since Oscan did not survive as a spoken language, we know it almost entirely from inscriptions written on non-perishable materials such as stone, metal and clay. There aren’t very many of these inscriptions: perhaps a few hundred, depending on definitions (for instance, do you include control marks consisting of a single letter?). We are lucky that Oscan is an Indo-European language, and, along with a number of other languages from ancient Italy, quite closely related to Latin, so we can make good headway with it. Nonetheless, our knowledge of Oscan and its speakers is fairly limited: it is certainly a language that comes under the heading of ‘small data’.



One of the ways scholars have addressed the problem of so-called corpus languages like Oscan, and even better-attested but still limited ones like Latin has been to combine as many relevant sources of information, from ancient historians to the insights of modern sociolinguistic theory as a way of squeezing as much information from what we have – and trying to fill in the blanks where information is lacking. This has been a huge success, but this approach can also be dangerous, especially when it comes to studying language death. Given that we know a language will die out in the end, it is very tempting to see every piece of evidence as a staging post in the process, and try to fit it into our narrative of language death. Often this provides very plausible histories, but we must remember that, while in hindsight history can look teleological, things are rarely so clear at the time.

The Tabula Bantina is a bronze tablet with a Latin law on one side and an Oscan law on the other side. It is generally agreed that the Latin text was written before the Oscan one, but the Oscan is not a translation of the Latin: the writer of the Oscan text simply used the conveniently blank side of the tablet to write the new material on. The striking things about the Oscan text are that it is written in the Latin alphabet, and there are lots of mistakes. It also strongly resembles Latin legal language. The date of this side is probably between about 100-90 BC, just before Rome’s ‘allies’, which is to say conquered peoples and cities in Italy, rose up against it in a rebellion generally known as the Social War. Continue reading “Big and small data in ancient languages”

‘Counting’: quality and quantity in literary language and tools for investigating it

by Jonathan Hope (Strathclyde University, Glasgow)

The transcription of a substantial proportion of Early Modern English books by the Text Creation Partnership has placed more than 60,000 digital texts in the hands of literary and linguistic researchers. Linguists are in many cases used to dealing with large electronic corpora, but for literary scholars this is a new experience. Used to arguing from the quality, rather than quantity of evidence, literary scholars have a new set of norms and procedures to learn, and are faced with the exciting, or perhaps depressing, prospect that their object of study has changed.

 In this talk I’ll look at some specific case studies that illustrate the potential, and the problems, of quantity-based studies – and will highlight key areas where literary scholars need to reassess their expectations of ‘evidence’, and the texts we use. A possible alternative title might be ‘Learning to live with error: gappy texts and crappy metadata’.

A screencast of the talk can be found below.

This paper was read at the Philological Society meeting in Oxford, Wolfson College, on Saturday, 11 March, 4.15pm.

Old Norwegian vowel harmony and the value of quantitative data for descriptive linguistics

by Tam Blaxter (University of Cambridge)

Quantitative methods in historical linguistics are most often used to answer ‘variationist’ questions. We assume that we know what the possible forms of a language were, but ask questions about their distribution: when was one form replaced by another? Who used which forms? Were some more common in particular linguistic contexts, genres or text types? For this reason, quantitative methods might seem unappealing to historical linguists primarily interested in describing a historical variety—its grammar and lexicon—or describing etymologies. From time to time, however, quantitative data can throw a light on these more basic descriptive questions.

An excerpt from the Old Norwegian Homily Book

Old Norwegian, unlike its better-studied West Nordic sister Old Icelandic, exhibited height harmony of unstressed non-low vowels. Readers familiar with Old Icelandic texts will expect to see three distinct vowels in unstressed syllables: /a i u/ written <a i u>. In Old Norwegian texts we find an additional two graphemes, <e o>, in complementary distribution with <i u>. These vowels agree with the vowel of the stressed syllable for height: <i u> appear in unstressed syllables whenever the stressed syllable was high and <e o> whenever it was non-high. There are two exceptions to this rule: when the syllable contained the vowel normalised ǫ, which was the u-umlaut product of *a, we find unstressed syllables with <u> and either <e> or <i>, and when the stressed syllable contained the i-umlaut product of *a (usually normalised e but sometimes written ę to distinguish it from /e/ < Proto-Germanic *e), we find unstressed syllables with <i> and either <u> or <o>.

In theory, then, we could use the vowel harmony to distinguish between the stressed phonemes /e/ and /ę/ which were not (consistently) distinguished in the orthography: the former should have harmony vowels <e o> while the latter should have <i o/u>. However, Old Norwegian vowel harmony is a slippery creature. Few texts exhibit it totally consistently, making it difficult to sort out what is orthographic and what phonological variation. If we take a qualitative approach in which we read individual texts and describe their orthographies, we can’t confidently interpret deviations from vowel harmony as meaningful. If, on the other hand, we take a quantitative approach which includes data from many different texts, interesting patterns may become clear. Continue reading “Old Norwegian vowel harmony and the value of quantitative data for descriptive linguistics”

Sources of evidence for linguistic analysis

Round table discussion with Aaron Ecay (Unversity of York), Seth Mehl (University of Sheffield), Nick Zair (Univeristy of Cambridge), chaired by Cécile De Cat (University of Leeds)

Is linguistics an empirical science? How reliable are the data on which linguistic analyses and theories are based? These questions are not new, but in light of the disturbing findings of the Reproducibility Project in psychological sciences, the need to revisit them has become more pressing.  This round table discussion will start with presentations from three postdoctoral researchers, who will discuss the question of data collection and analysis and the interpretation of linguistic evidence.


This panel will be held on 11 November 2016 at 4.15pm in the Great Woodhouse Room, University House, University of Leeds, LS2 9JS.

For more information about the individual panelists’ presentations, see their abstracts below. The presentations have been live-tweeted under the hashtag , and George Walkden has kindly provided a storified version of the tweets. Continue reading “Sources of evidence for linguistic analysis”