Natural Language Processing meets social media corpora

by Yin Yin Lu (University of Oxford)

From 17-19 May I attended the CLARIN workshop on the ‘Creation and Use of Social Media Resources’ in Kaunas, Lithuania. The thirty participants represented a broad range of backgrounds: computer science, corpus linguistics, political science, sociology, communication and media studies, sociolinguistics, psychology, and journalism. Our goal was to share best practises in the large-scale collection and analysis of social media data, particularly from a natural language processing (NLP) perspective.

As Michael Beißwenger noted during the first workshop session, there is a ‘social media gap’ in the corpus linguistics landscape. This is because social media corpora are the “naughty stepchild” of text and speech corpora. Traditional natural language processing tools (for, e.g., news articles, political documents, speeches, essays, books) are not always appropriate for social media texts, given the unique communicative characteristics of such texts. Part-of-speech tagging, tokenisation, dependency parsing, sentiment analysis, irony detection, and topic modelling are notoriously difficult. In addition, the personal nature of much social media creates legal and ethical challenges for the data mining and dissemination of social media corpora: Twitter, for example, forbids researchers from publishing collections of tweets; only their IDs can be shared.

I made invaluable connections with researchers at the intersection of NLP and social media data – and Twitter data in particular, which is the area of my own research. Dirk Hovy, an associate professor at the University of Copenhagen, spoke broadly about the challenges of NLP: engineers assume that all language is identically and independently distributed. This is clearly not true, as language is driven by demographic differences. How can we add extra-linguistic information to NLP models? His proposed solution is word embedding: transforming words into vectors, trained on large amounts of data from different demographic groups. These vectors should capture the linguistic peculiarities of the groups.

A variant of word embedding is document embedding – and tweets can be treated as documents. Thus, it should be possible to transform tweets into vectors to capture the demographic-driven linguistic differences that they contain. I will be applying this approach to my own corpus of 12 million tweets related to the EU referendum.

Andrea Cimino, a postdoc from the Italian NLP Lab, spoke about his work on adapting existing NLP tools—which are trained on traditional text—for social media text. The NLP Lab has developed the best POS tagger for social media based upon deep neural networks (long short-term memory), which are able to capture long relationships between words in a sentence. The tagger has achieved 93.2% accuracy, and is currently only valid on Italian texts. Similar taggers can be developed for English texts, given the appropriate training data.

Rebekah Tromble, an assistant professor at Leiden University, presented on the limitations and biases of data collected from Twitter’s Application Programming Interface (API). There are two public APIs that can be used: the historic Search API and the real-time Streaming API. Up to 18,000 tweets can be harvested from the former over the last seven to ten-day period, whichever limit is reached first. The Streaming API allows for up to 1% of all tweets to be collected in real time; as there are 500 million tweets a day, this is approximately 5 million tweets a day.

Continue reading “Natural Language Processing meets social media corpora”

Membership survey 2016 

by Richard K. Ashdowne (University of Oxford; Honorary Membership Secretary, PhilSoc)

In spring 2016 the Council of the Society ran an online survey to find out members’ views on matters to do with the Society’s current activities, and in particular its programme of meetings.

More than 200 members completed the survey, from a wide range of the Society’s very diverse membership, including new and student associate members and those who have been members of the society for many decades.

The chief results of the survey were that more than half of the respondents typically do not attend any meetings of the Society each year, while less than 10% of respondents said they typically manage to attend three or more meetings. Over a quarter of those who completed the survey said they had never attended a meeting of the Society.

The most frequently given reasons for being unable to attend meetings were the difficulty and/or cost of travel to meetings and the pressure of other work or family commitments. A number of other reasons were given by smaller numbers of respondents.

The Society very much understands that the investment of time and money for a member to attend a meeting in person is often considerable. For this reason we have now encouraged speakers to provide a brief abstract that will enable members to make a more informed decision about attending.

With a view to making its meetings more accessible to UK members living outside the southeast of England the Society is continuing to arrange at least one of its regular meetings each year outside of this area. Recent events of this kind have included the events in Newcastle and Leeds in 2016. The Society – via the Secretary – is keen to hear from members who would be willing to host such events in the future.

The survey asked whether respondents had viewed the videos of some of the Society’s joint events with the British Academy and whether members would watch recordings of other meetings in addition to or instead of attending. Since this possibility was generally welcomed by those who responded, the Society has now begun to experiment with making video recordings of some of its regular meetings and making these available via YouTube. It is hoped that members who are unable to attend meetings in person may find these of interest. We would be interested in any feedback on these videos in comments on this post.

Council keeps the arrangements for meetings under regular review and so we’d also be interested in any comments in general on the Society’s events via the comments on this post.

Fieldwork on West Polesian

by Kristian Roncero (University of Surrey)

West Polesian belongs to the Eastern Slavonic subgroup and is spoken in the Polish region of Podlasie, the south-western half of the Brest region in Belarus, and the Volynsk region in Ukraine. West Polesian has hardly been studied separately, yet it differs considerably from the national standard  (or literary) languages where it is spoken. One of the main reasons is its isolation. Older stages of the Common Eastern Slavonic language and culture have been preserved thanks to the fact that Polesians live in a marshy area which can be difficult to access as it is frequently flooded. In Žydča (see map), some speakers  remember the times when they were kids and a helicopter would bring bread to the village as the ‘road’ was flooded (before they drained some roads in the 80’s-90’s).

pastedImage
Map of the studied villages in the region of Brest (Belarus)

There is very little work on West Polesian grammar, which is why I decided that I needed to get it from first hand witnesses. Continue reading “Fieldwork on West Polesian”

Exaptation: acquiring the unacquirable

by Benjamin Lowell Sluckin (Humboldt University of Berlin, formerly University of Cambridge)

I was fortunate enough to receive a PhilSoc Masters Bursary in 2015/16, which has been of greater value to me than the £4000 awarded. It enabled me to study for an MPhil in Theoretical and Applied Linguistics at my institution of choice, the University of Cambridge. I’m happy to say it was worth it!  So before I get down to writing about my experiences of postgraduate study and research, I want to thank PhilSoc for their generosity and for seeing value in that hopeful letter of application penned in early Spring 2015.

First I’ll say a bit about my general experience and then I’ll get down to the linguistic meat. Cambridge is a weird and wonderful place. It is like stepping into a time machine and stepping out in 1870 where everyone has a MacBook. It is a bubble, as everyone says; the real world seems distant and at times one can feel claustrophobic. However, the bubble is good for doing research. It is quiet, there are talks almost every day and there was always the possibility of valuable academic discussion with my peers and seniors in the department, from whom I learnt a great deal.  Like any University, but perhaps especially, there is also the constant opportunity to have your assumptions about everything and anything challenged by those who know better, or at least pretend to do so. The Masters Bursary allowed me not only to learn some serious linguistics, but also to acquire the ability to power a very unstable boat with a very long stick. All in all, I learnt a great deal. I can now say with some confidence that I understand enough syntax to understand what people are disagreeing about most of the time, but not to always understand why they insist on disagreeing.

In my bursary application I said I wanted to specialise in diachronic morphosyntax in Germanic and I specifically “promised” to look at exaptive changes in language (my thanks to George Walkden whose support and lectures got me thinking about these things). In short, Lass (1990, 1997) said that when form-to-function mappings are eroded in language, we can be left with functionless linguistic “junk” which can then be co-opted for an unrelated function. The canonical example from Lass (1990) is the recycling of afrikaans gender marking from Dutch syntactic agreement marking for gender and definiteness (1a,b) to conditioning by the morphological character of the adjective itself (1c,d): simple vs complex.   I found Lass’ ideas interesting and I knew that David Willis in Cambridge had been working on this topic, so I was keen to get in on the action (for lack of a better term). Once arrived, he was always ready to challenge my ideas and encourage me to refine my arguments.

(1) Examples
a. Dutch common/neuter definite & common indefinite

de gevaarlijk-e muis/paard
the dangerous-e mouse.com/horse.neut

b. Dutch neuter, indefinite

een gevaarlijk-∅ paard
a dangerous-∅ horse.neut
(adapted from ex.23, Norde & Trousdale 2016:187)

c. Afrikaans simple adjective

die groot groep
the large-∅ group
([Lubbe & Plessis 2014:28] cf. Sluckin 2016:6)

d. Afrikaans complex adjective

die belangrik-e rol
the important-e role
([Lubbe & Plessis 2014:21] cf. Sluckin 2016:6)

Scholars have argued about exaptation for 25 years; so I will admit now that I approach this problem from a minimalist perspective. That means: I focus on Child Language Acquisition as the primary locus of morphosyntactic change, I reject junk, i.e. functionless material as impossible (like many but not all), and crucially my work assumes that the syntactic architecture is based on a hierarchical generation of formal features and projecting heads, and so on and so on….

This type of change is especially interesting because, in my mind, it shows the incredible capacity of the child acquiring language to regularise seemingly incoherent data. Research into exaptive reanalyses can tell us something about how humans can make good data from bad data.

So what is bad data? Well “junk” doesn’t work if we assume that every utterance is somehow a representation of linguistic units stored in the lexicon – or whatever we call it. Sadly,  I don’t have the space elaborate on all past approaches (see Vincent 1995; Willis 2010, 2016; Lass 1997, and Van de Velde & Norde 2016 for a review), but my hypothesis can be summed up as follows: breakdown in language can, over time, render structures increasingly difficult to acquire; this can reach a point where the target structure—dare I say parameter—is no longer acquirable from the input. The child is faced with the choice of losing the structure or finding any other possible analysis. What’s the difference between this and any other reanalysis, I hear you ask. Well, one standard view is that reanalysis works on the basis of ambiguity between possible analyses; so if there are two or more possible analyses, the child is more likely to choose the simpler one (2a). If the more economical analysis were not found, the original would still be available from the input. I argue that for exaptation what we instead find is that the original analysis is removed completely for the acquirer (2b). Therefore, any new analysis does not rely on ambiguity between the target and other analyses, as the target just doesn’t factor for the child making sense of the input.

I have tried to test this for syntax alone, whereas past work focused more on morphosyntax. The questions I am trying to answer is: how pervasive is exaptive reanalysis and what strategies do children use to find analyses when they can’t draw on strategies of economy. To these ends, I am looking for explanations orthogonal to Universal Grammar. My MPhil thesis research on the collapse of V2 and its reanalysis as Locative Inversion in Early Modern English involving the actuation of locative formal features, e.g. out of the woods came the bear, seems to suggest that phonologically silent syntactic heads might be especially vulnerable to this kind of change, as their acquisition is purely dictated by overt syntax (3a,b: trees for those who like them – click on the “Read more” button). Metaphorically speaking, we knew Pluto was there before we could see it because we could see things orbiting it. Syntax works similarly, the only difference is that if we change an orbit we change the planet, or rather syntactic head, too.  I am pursuing these ideas with larger case studies as part of my PhD project at the Humboldt University in Berlin, where I am now part of Artemis Alexiadou’s  research group.  I am also trying to see how grammar competition, language contact and exaptive reanalysis might go hand in hand in certain situations.

Continue reading “Exaptation: acquiring the unacquirable”