Datathon on Texts Annotation: How Programmers and Humanities Specialists Analyze Texts
This past weekend, ITMO University held the first ever datathon on texts annotation. Over 80 participants from different fields, from programming to humanities, competed to solve the tasks on the linguistic annotation of historical and culturological sources. The datathon came hand in hand with an extensive educational program presented by educators of the seminar on natural language processing organized by ITMO together with Huawei. In their presentations, they focused on topical tasks of the applied artificial intelligence and modern trends in the field of natural language processing. Here’s more about the datathon.
Format of the datathon
The datathon was a two-day competition which saw participants unite in teams to develop the best solution for a task based on the analysis of humanities data. It was organized by ITMO University’s Digital Humanities Lab (DHLab) and the machine learning center of the international laboratory “Computer Technologies”, whose specialists conduct research on machine learning, evolutionary computation, and discrete optimization. DHLab is both ITMO University’s international scientific laboratory on digital humanities research and the first large-scale project on the popularization of digital humanities, a field of science at the intersection of IT and liberal arts, in St. Petersburg (you can read about the project in this ITMO.NEWS article).
The company Huawei took on the role of the main partner of ITMO University texts annotation datathon. Starting from December 2018, Huawei and the university’s Computer Technologies laboratory have been hosting regular seminars on natural language processing. Among other partners of the event was the State Historical Museum of Religion, which houses religious artifacts from the ancient times to the present day.
According to Antonina Puchkovskaya, head of ITMO University’s DHLab, what made the datathon special was that it brought together not only programmers but also specialists from such humanities fields as linguistics, culturology, history, fine arts and others.
“The main goal of the datathon was to solve the problem that we encounter as part of our large interdisciplinary project on the creation of an interactive map of St. Petersburg. In this project, we analyze different humanities sources ranging from memoirs to periodicals and others. We had to come up with a way to annotate all this data, so we decided to table that problem at the datathon. As we work with original sources in the Russian language, we invited not only programmers but also humanities specialists: the task required an in-depth understanding of the texts’ structure,” she notes. “Datathons are now a usual practice in many countries all over the world, but in Russia, the format is yet to become widespread. That’s why apart from our main task, we aim to promote events such as this, among the humanities community in particular.”
The participants of the datathon, as well as others with an interest in the analysis of humanities data and working in the field of digital humanities, had the opportunity to talk with experts in the field and learn about the new ITMO University Master’s program “Data, Culture and Visualization”, which is already accepting applications for its first intake. The program is aimed at training specialists capable of developing algorithms and software for intellectual and linguistic data analysis, and also of applying information technologies in humanities fields with the help of intellectual data analysis, machine learning and computer linguistics tools.
The first day of the datathon included lectures by ITMO and Huawei specialists, who talked about research objectives within the field of applied artificial intelligence and modern trends in natural language processing, and shared about cases from their experience of working in the field.
For one, Huawei Technologies specialists Denis Teslenko and Vlad Tretyak provided insights into the recognition of named entities; Evgenia Bogacheva, a staff member at the machine learning center of ITMO’s Computer Technologies laboratory, gave a presentation on coreference resolution, while Daria Rodionova, also from Huawei Technologies, used her own projects to talk about sentiment analysis and how to teach a machine to differentiate between good and bad in events and people.
“Also known as tonal analysis, sentiment analysis is one of the most oft-encountered tasks of natural language processing. It is known, for example, that upon hearing some news, humans always try to give it an emotional connotation: whether it’s good, bad or neutral for them personally. So if we as researchers gather a bunch of these connotations as ascribed to an object or phenomenon, we can successfully use these to create different ratings. Movie ratings have the same underlying principle, as do the ones for goods and services. The main difficulty here, however, is ensuring maximum objectivity, as any evaluation is of course rather subjective,” shared the expert.
Why texts need to be annotated and how it could be of use in culturological data analysis
All in all, over 80 young professionals took part in the datathon. On the first day, the participants split into interdisciplinary teams, each team including at least one programmer and humanities specialist. Having done that, they set to work, solving a task on the linguistic annotation of historical and culturological sources needed for creating a corpus of texts and enabling the following training of a neural network.
At the teams’ disposal was a large body of culturally significant data such as diaries of music composers and famous personalities on the theater scene of the 18-19th centuries St. Petersburg, as well as texts on rock music history in St. Petersburg and similar sources on jazz and theater. All of these materials need to be processed and analyzed as part of the project on the creation of an interactive map of St. Petersburg, currently conducted at ITMO University’s Digital Humanities Lab.
The datathon participants’ task was to use this data to create a linguistic annotation (or, in other words, to find links between words in the texts) that would allow for a classification of notions that refer to the same entity. The links in question could be different, with one example being finding the names that allude to one real entity. For one, while Pyotr Tchaikovsky may be referred to as “Pyotr Tchaikovsky”, “Pyotr” or “a great Russian composer”, all of these mentions still pertain to one person. Another example is zero anaphora. This linguistic term defines the situation where although there is a clearly understood entity, the sentence lacks a word or phrase that would refer to it, but the missing reference could be restored when analyzed contextually.
“Consider the sentence “I bought a car and a bike, the first broke and the second I can’t ride and got rusty”. Here, the word “first” refers to the car and “second” refers to the bike, but the reference of the phrase “got rusty” is not so unequivocal. One part of the task for the datathon participants was to tackle this ambiguity and ensure that the references trace to the right entity,” says Andrey Filchenkov, head of the machine learning center of ITMO University’s international laboratory “Computer Technologies”. “Thus, the work that was carried out as part of the datathon also contributed to the creation of a coreference corpus of the Russian language. Coreference resolution is a classic task that is central to natural language processing. Usually, when it comes to working with the Russian language, we encounter the problem of the lack of resources, as it’s not as replete with data as is English. When we finish this corpus, we plan to make it public to help our colleagues in the field in developing new models and conducting a comprehensive analysis.”
Evgenia Bogacheva adds that the participants of the datathon not only helped increase the corpus but also advanced the development of the first Russian language corpus of zero anaphoras. While these tasks have long been present in the English language research, in Russian, where the omittance of a subject is a very common structural fixture, the specialists have a lot of work to do and problems to solve before the zero anaphora corpus is finally created.
Each text was annotated by two participants, which, say the organizers, made for a significant increase in the quality of the finished annotation. If and when a collision of opinions happened, the team members had the opportunity to appeal to the judges, all expert professionals in the field of linguistics. To help them in their task, the participants used a software toolset specially created by the datathon organizers.
It was the GUM team that emerged as the winner of the two-day datathon, receiving 15,000 rubles in prize money. The teams who claimed the second and third places were awarded free tickets to the State Historical Museum of Religion and prizes from ITMO University. Apart from that, all participants received memorable gifts from the organizers and bonus points when applying to the new international Master’s program “Data, Culture & Visualization”.