Natural Language Processing meets social media corpora

by Yin Yin Lu (University of Oxford)

From 17-19 May I attended the CLARIN workshop on the ‘Creation and Use of Social Media Resources’ in Kaunas, Lithuania. The thirty participants represented a broad range of backgrounds: computer science, corpus linguistics, political science, sociology, communication and media studies, sociolinguistics, psychology, and journalism. Our goal was to share best practises in the large-scale collection and analysis of social media data, particularly from a natural language processing (NLP) perspective.

As Michael Beißwenger noted during the first workshop session, there is a ‘social media gap’ in the corpus linguistics landscape. This is because social media corpora are the “naughty stepchild” of text and speech corpora. Traditional natural language processing tools (for, e.g., news articles, political documents, speeches, essays, books) are not always appropriate for social media texts, given the unique communicative characteristics of such texts. Part-of-speech tagging, tokenisation, dependency parsing, sentiment analysis, irony detection, and topic modelling are notoriously difficult. In addition, the personal nature of much social media creates legal and ethical challenges for the data mining and dissemination of social media corpora: Twitter, for example, forbids researchers from publishing collections of tweets; only their IDs can be shared.

I made invaluable connections with researchers at the intersection of NLP and social media data – and Twitter data in particular, which is the area of my own research. Dirk Hovy, an associate professor at the University of Copenhagen, spoke broadly about the challenges of NLP: engineers assume that all language is identically and independently distributed. This is clearly not true, as language is driven by demographic differences. How can we add extra-linguistic information to NLP models? His proposed solution is word embedding: transforming words into vectors, trained on large amounts of data from different demographic groups. These vectors should capture the linguistic peculiarities of the groups.

A variant of word embedding is document embedding – and tweets can be treated as documents. Thus, it should be possible to transform tweets into vectors to capture the demographic-driven linguistic differences that they contain. I will be applying this approach to my own corpus of 12 million tweets related to the EU referendum.

Andrea Cimino, a postdoc from the Italian NLP Lab, spoke about his work on adapting existing NLP tools—which are trained on traditional text—for social media text. The NLP Lab has developed the best POS tagger for social media based upon deep neural networks (long short-term memory), which are able to capture long relationships between words in a sentence. The tagger has achieved 93.2% accuracy, and is currently only valid on Italian texts. Similar taggers can be developed for English texts, given the appropriate training data.

Rebekah Tromble, an assistant professor at Leiden University, presented on the limitations and biases of data collected from Twitter’s Application Programming Interface (API). There are two public APIs that can be used: the historic Search API and the real-time Streaming API. Up to 18,000 tweets can be harvested from the former over the last seven to ten-day period, whichever limit is reached first. The Streaming API allows for up to 1% of all tweets to be collected in real time; as there are 500 million tweets a day, this is approximately 5 million tweets a day.

Through the statistical analysis of samples, Tromble demonstrated that bias is likely to be introduced by both APIs—the samples are not random. Moreover, there are systematic differences between the factors that influence which tweets are collected from the Search and Streaming APIs. Retweets are less likely to appear in Search API results, but are even less likely to appear in the Streaming API. Tweets with more mentions are less likely to appear and tweets with more hashtags are more likely to appear in both. In the Streaming API, tweets with multimedia (images, videos, animated GIFs) are prioritised – but not in the Search API.

All of these results seem to suggest that content matters more than user characteristic variables (number of tweets, followers, or friends) for Twitter APIs: original tweets that are rich in media and connected with broader discourses are emphasized, regardless of how popular, prolific, or engaged their authors are. The fact that public APIs are biased toward especially ‘rich’ content has clear consequences for interpretation. As I am examining both content and user variables in the context of Brexit tweets, Tromble’s findings are highly relevant for my research.

There were two practical sessions with Twitter data – the first was led by Nikola Ljubešic from the Jožef Stefan Institute, on harvesting, processing, and visualising geo-encoded tweets using the TweetCat tool. We learned how to collect tweets published in low-frequency languages using seed words, and how to filter tweets in these languages from geo-encoded data. We visualised these tweets on maps.

The second workshop was a demonstration of an open source toolkit for tweet analysis based on GATE, software developed for text processing. Diana Maynard, the lead computational linguist on the GATE research team, presented how the software can collect, analyse, index, query, and visualise tweets. Analysis tools include semantic annotation, topic detection, sentiment analysis, and user classification. GATE has been applied to tweets relating to Brexit, Trump, and Earth Hour. In fact, a ‘Brexit Analyser’ pipeline has been developed and is available on the cloud. It would be interesting to compare the results of my analysis with the results of this pipeline.

On the final day of the workshop I delivered a presentation on my own research on resonance and rhetoric in the EU referendum. What makes a political message work on social media – how can it be expressed and delivered most effectively? The feedback was very useful, and has provided me with different NLP approaches to explore, as well as provoked me to reflect upon the role of gatekeepers in virality. Message success might not only be due to content and user features, but also to the structure of the user’s network.

I left the workshop feeling very satisfied with all of the connections I had made; I do not think I could have met these researchers in any other context. There will probably be a follow-up workshop a year from now – by then, NLP tools will be more advanced, but social media APIs will have evolved as well. At the end of the day, analysing social media content is a cat and mouse game.

Do you have a comment?

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s