Extending the YAGO Knowledge Base
Axis : DataSense
Subject : Extending the YAGO knowledge base
Director : Fabian SUCHANEK
Institution : Télécom ParisTech, Max Planck Institute for Informatics in Germany
Administrator laboratory : LTCI
PhD Student : Thomas REBELE
Beginning : march 15, 2015
Defense : july 19, 2018
Scientific production :
In the frame of the PhD thesis of Thomas Rebele : Extending the YAGO knowledge base and Katerina Tzompanaki's postdocship:
- Thomas Rebele, Thomas Pellissier Tanon, Fabian M. Suchanek: “Bash Datalog: Answering Datalog Queries with Unix Shell Commands”, International Semantic Web Conference (ISWC), 2018
- Thomas Rebele, Katerina Tzompanaki, Fabian M. Suchanek: “Adding Missing Words to Regular Expressions”, Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), 2018
- Thomas Rebele, Arash Nekoei, Fabian M. Suchanek: “Using YAGO for the Humanities”, Workshop on Humanities in the Semantic Web (WHISE), 2017
- Thomas Rebele, Katerina Tzompanaki, Fabian M. Suchanek: “Visualizing the addition of missing words to regular expressions”, International Semantic Web Conference (ISWC) demo track, 2017
- Thomas Rebele, Fabian M. Suchanek, Johannes Hoffart, Joanna Asia Biega, Erdal Kuzey, Gerhard Weikum: “YAGO: a multilingual knowledge base from Wikipedia, Wordnet, and Geonames”, International Semantic Web Conference (ISWC) resource paper track, 2016
- Hiep Le, Thomas Rebele, Fabian M. Suchanek: “Open Digital Forms”, Theory and Practice of Digital Libraries (TPDL/ECDL) demo track, 2016
PhD thesis manuscript: https://www.thomasrebele.org/publications/2018_phd_thesis.pdf
A knowledge base is a set of facts about the world. YAGO was one of the first large-scale knowledge bases that were constructed automatically. This thesis focuses on extending the YAGO knowledge base along two axes : extraction and preprocessing.
The first main contribution of this thesis is improving the number of facts about people. The thesis describes algorithms and heuristics for extracting more facts about birth and death date, about gender, and about the place of residence. The thesis also shows how to use these data for studies in Digital Humanities.
The second main contribution are two algorithms for repairing a regular expression automatically so that it matches a given set of words. Experiments on various datasets show the effectiveness and generality of these algorithms. Both algorithms improve the recall of the initial regular expression while achieving a similar or better precision.
The last contribution is a system for translating database queries into Bash scripts. This approach allows preprocessing large tabular datasets and knowledge bases by executing Datalog and SPARQL queries, without installing any software beyond a Unix-like operating system. Experiments show that the performance of our system is comparable with state-of-the-art systems.