A path to extract word meanings from texts and resolve lexical polysemy

January 12, 2018
99 Views
Annually in Russia the largest conference on computer linguistics "Dialogue" is held, at which experts discuss methods of computer analysis of the Russian language, assess the level of computer linguistics and determine the directions of its development. Every year, within the framework of Dialogue, competitions are organized for automatic processing of the Russian language – Dialogue Evaluation. In this post we will talk about how the Dialogue Evaluation competition is organized, and in more detail about how one of its components, RUSSE, is going and what is waiting for its participants this year. Go.

 image

Dialogue Evaluation – competition for assessing the quality of methods of text analysis in Russian

In total for the last seven years, 13 Dialogue Evaluation competitions took place on a variety of subjects: from the competition of morphological analyzers to the campaign to assess the quality of machine translation systems. All of them are similar to the international competition of the SemEval text analysis systems, but with a focus on the features of word processing in Russian – rich morphology or a freer word order than in English. The tasks for Dialogue Evaluation on the structure are similar to the tasks for Kaggle in the field of data analysis, SemEval – in the field of computer linguistics, TREC – in the field of information retrieval and ILSVRC – in the field of pattern recognition.

The participants of the competition are given a task that must be solved within the agreed timeframe (usually within a few weeks). At SemEval and Dialogue Evaluation competitions are held in two stages. Initially, the participants receive a description of the task and a training sample that can be used to develop methods for solving the problem and assessing the quality of the methods obtained. For example, in the track of 2015 there were pairs of semantically close words that participants could use to develop models of vector representations of words. Organizers negotiate a list of external resources that can or can not be used. At the second stage, the participants receive a test sample. They must apply to it the models developed in the first stage. Unlike the training sample, the test sample does not contain any markup. The marking of the test sample at this stage is available only to the organizers, which guarantees the fairness of the competition. As a result, participants send their decisions on the test sample to the organizers, who evaluate the results and publish the rank of the participants.

As a rule, after the competition is over, the test samples are placed in public access. They can be used in further research. If the competition is held in the framework of a scientific conference, participants can publish reports on participation in the conference proceedings.

RUSSE – contest on the evaluation of methods of computational lexical semantics for the Russian language

RUSSE (Russian Semantic Evaluation) – a series of activities to systematically evaluate the methods of computational lexical semantics of the Russian language. The first RUSSE competition was held in 2015 during the "Dialogue" conference and was devoted to a comparison of methods, how to determine the semantic similarity of words (semantic similarity). To assess the quality of distributive models of semantics, data sets in Russian were created for the first time, similar to widely used data-bases in English, such as WordSim353. More than ten teams evaluated the quality of such models of vector representations of words for the Russian language, like word2vec and GloVe.

The second RUSSE competition will take place this year. It will be focused on evaluating word sense embeddings and other models to extract values ​​and resolve word sense induction & disambiguation.

RUSSE 2018: extracting word meanings from texts and allowing lexical polysemy

Many words of the language have several meanings. However, simple models of vector word representations, such as word2vec, do not take this into account and mix different word meanings in one vector. This problem is designed to solve the problem of extracting word meanings from texts and automatically detecting ambiguous words in the body of texts. Within the framework of the SemEval competitions, methods were studied to automatically extract word meanings and resolve lexical polysemy for Western European languages ​​- English, French and German. At the same time, systematic evaluation of such methods for the Slavic languages ​​was not carried out. The competition in 2018 will draw the attention of researchers to the problem of the automatic resolution of lexical polysemy and will identify effective approaches to solving this problem using the example of the Russian language.

One of the main difficulties in the processing of Russian and other Slavic languages ​​is the lack or limited availability of high-quality lexical resources, such as WordNet for English. We believe that the results of RUSSE will be useful for the automatic processing of not only the Slavic languages, but also other languages ​​with limited lexical and semantic resources.

Problem description

Participants RUSSE 2018 proposed to solve the problem of clustering short texts. In particular, during the testing phase, participants receive a set of ambiguous words, for example, the word "lock", and a set of text fragments (contexts) in which ambiguous words are mentioned. For example, "the castle of Vladimir Monomakh in the lyubche" or "the movement of the bolt key in the castle." Participants must cluster the resulting contexts so that each cluster corresponds to a particular word value. The number of values ​​and, accordingly, the number of clusters is unknown beforehand. In this example, you need to group the contexts into two clusters, corresponding to the two values ​​of the word "lock": "device that prevents access to anywhere" and "construction."

For the competition, the organizers prepared three sets of data from different sources. For each such set, you need to populate the "predicted value identifier" column and load the answer file using the CodaLab platform. CodaLab gives the participant the opportunity to immediately see their results calculated on the part of the test data set.

The task set in this competition is close to the task formulated for the English language at the SemEval-2007 and SemEval-2010 competitions. Note that in tasks of this type, participants are not provided with a reference list of the meanings of the word – so-called. inventory values ​​. Therefore, for marking contexts, a participant can use arbitrary identifiers, for example, lock # 1 or a lock (device).

The main stages of the competition

  • November 1, 2017. – publication of training set of data
  • December 15, 2017 – release of the test set of data
  • January 15, 2018 – completion of acceptance of results for evaluation
  • February 1, 2018 – announcement of the results of the competition

From December 15 to January 15, participants can upload the results of the solution to the CodaLab platform. In total, it is proposed to mark out three sets of data, designed as separate tasks on CodaLab:

Data sets

The competition offers three sets of data for constructing models based on different cases and inventories of the meanings of words, described in the table below:

 image

The first set of data (wiki-wiki) uses as the meaning of the words the division proposed in Wikipedia; contexts are taken from Wikipedia articles. The bts-rnc dataset uses the Large Russian Explanatory Dictionary, edited by SA Kuznetsov (BTS) as an inventory of the meanings of the words; contexts are taken from the National Corps of the Russian Language (NCRC). Finally, active-dict uses the meanings of words from the Active Dictionary of the Russian Language, edited by Yu. D. Apresyan; contexts are also taken from the Active Dictionary of the Russian language – these are examples and illustrations from dictionary articles.

Six sets of data (three basic and three additional) are used for training and developing systems using different inventories of word meanings. All training data sets are listed in the table:

 image

Test datasets (wiki-wiki, bts-rnc, active-dict) are arranged in the same way as training . But they do not have a field with the value of the target word – the 'predicted value identifier'. This field is completed by the participants. Their answers will be compared with the reference ones. Participating systems are compared based on a quality measure. The training and test data sets are labeled according to the same principle, but the ambiguous words in them differ.

Quality measure

As in other similar competitions, for example, SemEval-2010, in our competition the quality of the system is assessed by comparing its responses to the gold standard. The gold standard is a set of sentences in which people determine the values ​​of target ambiguous words. The target word in each sentence is manually assigned this or that identifier from the given inventory of the meanings of the words. After the system of each participant indicates the identifier of the predicted value of the target word in each sentence of the test sample, we compare the grouping of sentences by the word values ​​from the member system to the gold standard. For comparison, we will use the adjusted Random coefficient. Such a comparison can be considered a comparison of two clusterings.

Competition tracks

The competition includes two tracks:

  • In a track without the use of linguistic resources (knowledge-free track), participants must cluster contexts according to different values ​​and assign to each value an identifier using only the body of texts.
  • In a knowledge-rich track, participants can use any additional resources, such as dictionaries, to identify the meanings of target words.

Our approach to the evaluation of systems assumes that practically any model of lexical polysemy resolution can participate in the competition: both machine-based approaches without a teacher (distributed vector representations, graph methods), and lexical approaches such as WordNet

Reference systems

To make the task more understandable, we publish several ready-made solutions with which we can compare our results. For a track without the use of linguistic resources (knowledge-free), we recommend paying attention to the systems of permitting lexical polysemy without a teacher, for example, AdaGram. For a track with the use of linguistic resources (knowledge-rich), we recommend using vector representations of the meanings of words constructed using existing lexical-semantic resources, for example, PyTes and RuWordNet. This can be done using methods such as AutoExtend.

Recommendations to participants

Training datasets have already been published. For a starting point you can take our models published in the repository on GitHub (there is a detailed manual), and refine and improve them. Please follow the instructions, but do not be afraid to ask questions – the organizers will be happy to help!

Discussion and publication of the results

Participants are invited to write an article about their system and submit it to the
international conference on computer linguistics "Dialogue 2018". The proceedings of this conference are indexed by Scopus. Within the special section of this conference, the results of the competition will be discussed.

Organizers

Contacts

The organizers of the competition will be happy to answer your questions in a group on Google and on Facebook. For more information, see the competition page.

Sponsors and partners

Leave a Comment

Your email address will not be published.