MAEL - MultimediA Entity Linking.Axis : DataSense
Coordinators : Hervé Le Borgne
Candidate : Omar Adlali
Institutions : CEA list, LIMSI
Administrator laboratory : CEA list
Reference working group : GT D2K & GT TAL&SEM
Engagements : June 2018 - may 2019
The standardization of named entities present in queries is known for its positive impact on information retrieval processes. The disambiguation of named entities, also calledentity linking, consists in automatically linking references to entities identified in a text and entities present in a knowledge base thus leading to an unambiguous standardization of said entities.
Such a task is sometimes generalized to a more complex system aiming at globally disambiguating all the concepts of a text in relation to a given knowledge base, whether they are named entities or nominal expressions (e.g. Wikify or Babelfy). An entity disambiguation system usually has three main components (Ji et al., 2014). First, the request text is analyzed to identify "mentions of entities" likely to be disambiguated with respect to the reference knowledge base. Then, for each entity mention, the system produces several "candidate entities" from the database. Finally, it selects the best entity among the candidates. One of the main difficulties in this context is being able to manage the very large number of entities generally present in the database that give rise to a large number of ambiguities.
To date, the disambiguation of entities named as such concerns exclusively textual data, comparing the problems of automatic language processing and knowledge representation. MAEL aims to use visual information to help disambiguation whenever it can be useful. This is of course the case when the document analysed is multimodal by nature, such as a text accompanied by illustrative photos, or the subtitles (or audio transcription) of a video. The visual recognition of a person, a film, a place or an organization via its logo will then greatly facilitate disambiguation. More subtly, some concepts are more easily represented visually than textually, especially in terms of colour.
The main objective of the post-doc is to determine how visual information can be beneficial to a system of disambiguation of named entities.
The implementation of a multimedia entity linking system requires the exploration of several questions. First, it will be a question of determining which type of entity can benefit from taking into account the visual dimension on the one hand, and the methods available to extract this information on the other hand. Examples are face recognition for people or logo recognition for organizations. The specificity of the approaches generally allows a significant gain in recognition performance. Nevertheless, such heterogeneity of approaches poses several problems. First, the system can quickly become oversized in complexity if it has to use a battery of visual sensors for each disambiguation. A pre-selection of the visual tools to be implemented according to the type of entity incorporates a possible identification error of the said type and therefore cannot correct it. Moreover, in the case of multimodality, for example, it cannot be excluded that text and image are not strictly coherent. For example, an article about an advertisement about an organization can be illustrated by a portrait of one of its executives or a spokesperson, one of its flagship products, or the buildings of its head office.
Thus, one of the main challenges is to establish a homogeneous representation of visual information despite the possible heterogeneity of the methods used to extract it. Moreover, this visual representation must also be easily comparable to the representations derived from the text. The creation of a space common to all these representations appears to be a promising avenue, although possibly difficult to implement. The post-doc co-supervisors have already proposed such spaces in the context of research and cross-modal classification, for text and image modalities. In addition, they also implemented an entity linking textual system with several million entities in the knowledge base for the CEA and explored joint learning of distributed representations of words and entities in the same space that allows a robust model to be established for the comparison between the local context of the entity mention and the candidate entities for the LIMSI.
On a large scale, building a multimedia knowledge base is also a challenge, both in terms of data collection and annotation. In the field of vision, this issue has been actively addressed since the re-emergence of deep learning in 2012. The post-doc co-supervisors have had contributions in this field, concerning on the one hand the use of weakly annotated data to learn from convolutional networks and on the other hand the improvement of visual representation in a learning transfer context by having an almost nil annotation cost.
Scientific production :