Sources of evidence for linguistic analysis

Round table discussion with Aaron Ecay (Unversity of York), Seth Mehl (University of Sheffield), Nick Zair (Univeristy of Cambridge), chaired by Cécile De Cat (University of Leeds)

Is linguistics an empirical science? How reliable are the data on which linguistic analyses and theories are based? These questions are not new, but in light of the disturbing findings of the Reproducibility Project in psychological sciences, the need to revisit them has become more pressing.  This round table discussion will start with presentations from three postdoctoral researchers, who will discuss the question of data collection and analysis and the interpretation of linguistic evidence.


This panel will be held on 11 November 2016 at 4.15pm in the Great Woodhouse Room, University House, University of Leeds, LS2 9JS.

For more information about the individual panelists’ presentations, see their abstracts below. The presentations have been live-tweeted under the hashtag .

Big and small data in ancient languages

by Nicholas Zair (University of Cambridge)

Ancient linguists often have to deal with ‘bad’ data; in practice, this often means ‘small’ data, and a particularly fruitful approach has been to view the data through the lens of sociolinguistic theory, especially with regard to multilingualism, language as a marker of identity, and language death. However, there are dangers in using theory to ‘fill in the gaps’ in our data; for example, we might wonder how relevant modern cases of language death are to ancient linguistic situations, and to what extent disparate pieces of evidence can reasonably be made to fit into a narrative (pre)defined by ‘what we expect’. On the other hand, in recent years there has been a great increase in digital resources for ancient languages. These often allow much faster collection and analysis of large amounts of data, but can also pose challenges – not least the danger of making ‘bad’ data worse. I will discuss issues surrounding sources for and use of ancient linguistic data, providing case studies from ancient Italy.

Corpus semantics: From texts to data to meaning

by Seth Mehl (University of Sheffield)

Corpus semantics applies quantitative data science techniques to questions of meaning in language. In order to be successful, such research must account for the nature of corpus data and the nature of semantic meaning, and connect the two. In this talk, I explore the nature of linguistic data, the processes for collecting and structuring it, and the possible relationships between corpus data, semantic meaning, and quantitative calculations. Can computers count words to find meaning? In addressing this question, I present examples from my own research on the Linguistic DNA project, which employs computational methods and close reading to model semantic and conceptual change across tens of thousands of texts, and over a billion words, of Early Modern English.

Bridge to nowhere? – Progress and problems in relating syntactic variation and change to syntactic theory

by Aaron Ecay (University of York)

Modern formal syntactic theories rely on a crucial methodological assumption: that speakers’ mental representations can be accessed and interrogated by way of acceptability judgments of test sentences created by the investigator.  When studying extinct language varieties, however, native speakers are not available to provide judgments.  And when variable phenomena are studied judgments are no help, since speakers (by and large) accept all variants equally.  Corpora, and the quantitative data they provide, are one methodology for studying syntactic variation and change.  But how can the data yielded by corpora connect with theoretical analyses?

In order to bridge the gap between the theoretical and quantitative-empirical domains, linguists have developed linking hypotheses.  In this talk, I will review several of these, such as:

  • The constant rate hypothesis (Kroch 1989)
  • The exponential model of morphophonological rules (Guy 1991)
  • The variational model of syntactic acquisition (Yang 2000)

After reviewing the models and how they serve as effective linking hypotheses, I’ll go on to consider the work that has followed from them. These models, I will argue, have all encountered challenges arising from the increase in available data and quantitative sophistication brought about in recent decades by the computer revolution.

What, then, will happen next at the interface between syntactic theory and quantitative data?  Several developments are on the horizon. Firstly, traditional syntactic acceptability judgments have taken a recent quantitative turn (see e.g. Sprouse et al. 2013).  Secondly, newly available sources of data, larger by orders of magnitude than what was previously available, have in the past several years been brought to bear on questions of both historical and contemporary variation, uncovering finer variation than was previously assumed to exist (see e.g. Grieve 2012).  Finally, new quantitative linking hypotheses are being developed to augment those listed above (see e.g. Kauhanen 2016).  These point to a future where the gap between syntactic theory and syntactic variation is narrower, and will be bridged by a common understanding of the processes that produce and regulate variation.

Grieve, J.  (2012) “A statistical analysis of regional variation in adverb position in a corpus of written Standard American English.” Corpus Linguistics and Linguistic Theory 8, pp. 39-72.
Guy, G. (1991) “An exponential model of morphological constraints.” Language Variation and Change 3, pp. 1-22.
Kauhanen, H. (2016) “Neutral change.”  Journal of Linguistics. (Accepted to appear; available online).
Kroch, A. (1989) “Reflexes of grammar in patterns of language change.” Language Variation and Chance 1, pp. 199-244.
Sprouse, J., Schütze, C., Almeida, D.  (2013)  “A comparison of informal and formal acceptability judgments using a random sample from Linguistic Inquiry 2001–2010.”  Lingua 134, 219-248.
Yang, C. (2001) “Internal and external forces in language change.” Language Variation and Change 12, 231-250.

3 thoughts on “Sources of evidence for linguistic analysis

  1. “Is linguistics an empirical science?” requires bold criticisms for it is not a straight forward yes-no answers. Linguistics to be an empirical science must free itself from advocating such kind of ascientific “reproducibility” theory. “Reproducibility” as a new version of traditional “representationism” or “correspondence” theory does continue to popularize the view that the mind has “ideas” which symbolize “outside things”. Proposing a polarity of two distinct levels in this sense is straight forwardly guilty of conveying to us an unobservable “matter” that which would make possible a relation between the two. We must do away with this kind of “psychological fallacy”, the inclination of the [psychologist] to believe that his knowledge of the agent’s operations is possessed by the agent himself, as had pointed out by William James, that our knowledge is observational and propositional. That is, any issue raised must be discussed in its own terms and not in terms of the psychologist’s inclinations.
    I am currently rewriting the Tongan history since its first recordings and annotations by some members of the London Missionary Society who were stationed in Tonga during the 19th century. Based entirely on my critiques of the missionaries’ recordings of local stories and myths and of other Tongan historians’ dissertations I am of course directing the method of rewriting to coincide with the logical relationship as between the ‘subject’ and ‘predicate’, connected by a ‘copula’, of a given proposition. Dualism as central feature in social scientific practice is, in my work, collapsed as just one complex state of interconnected affairs…


    1. On behalf of Aaron Ecay:
      This comment highlights a tension between theories of mental representations and theories of (non-mental) phenomena.
      The debate is an old one and wide-ranging; one of its most celebrated ramifications on modern formal linguistics is the exchange between Chomsky and Skinner in the late 1960s. Since this time, most linguists have accepted that mental representations are necessary for an adequate theory of language (though they differ, often sharply, on the form and content that should be attributed to those representations).
      Reopening such a fundamental question is unlikely to be a productive way of examining past or current trends in linguistic analysis. In other disciplines the fundamental question is answered differently; the commenter may be interested, for example, in “bottom-up” approaches to cognitive science through neural imaging. Questions have been raised, however, about whether this approach succeeds even on its own terms: see and the references therein for an overview.
      The questions that I will be addressing in my contribution to the panel are different, namely how does evidence from a variety of sources shape linguistic argumentation (that is, argumentation in the field of linguistics)? How do new sources of evidence support, contradict, or otherwise interact with those which have longer traditions of use? What use are linguists making of the variety of evidence at their disposal, and will (or should) practice evolve in the future?
      I believe that my fellow panelists will also be speaking in this vein, though as I write this I have not yet had the benefit of hearing their talks.


  2. In a short reply to “On behalf of Aaron Ecay”, it doesn’t matter how old is the debate, still, the question at stake has not yet resolved. Chomsky only changed the debate hence his theory of transformative-generative grammar and, somehow, rationalistically asserted the aloofness of linguistic, in other words, Fregean “sense” as opposed to “reference”. An adequate theory of language is not necessarily reducible to mental representation. Language is rather a social phenomenon. It is our way of dealing with whatever situation, that when we talk about situation X, for example, we, at the same time, describe it, in no representative manner, as Y. Further, what is in question here is not the use of words, for, while the use of a word may be described as arbitrary, we are not using it as a word unless we refer by means of it to a particular sort of thing. It implies that we are directly acquainted with that sort of thing or with things as of that sort. Further, we have to be acquainted with the word as a noise of a certain sort; and the “reference” of this to the other sort of thing is a further situation with which we become acquainted…
    Quite frankly, I would not be interested to read about the “bottom-up approaches to cognitive science” as suggested. What I said has nothing to do with or has implied any such approach.


Do you have a comment?

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s