Menu

Minoan ER

Minoan ER is an Entity Resolution (ER) framework, built by researchers in Crete (the land of the ancient Minoan civilization). Entity resolution aims to identify descriptions that refer to the same entity within or across knowledge bases.

What created the need for the Minoan ER framework?

Over the past decade, numerous knowledge bases (KBs) have been built to power large-scale knowledge sharing, but also an entity-centric Web search, mixing both structured data and text querying. These KBs offer comprehensive, machine-readable descriptions of a large variety of real-world entities (e.g., persons, places, products, events) published on the Web as Linked Data. Traditionally, KBs are manually crafted by a dedicated team of knowledge engineers, such as the pioneering projects Wordnet and Cyc. Today, more and more KBs are built from existing Web content using information extraction tools. Such an automated approach offers an unprecedented opportunity to scale-up KBs construction and leverage existing knowledge published in HTML documents.

Although they may be derived from the same data source (e.g., a Wikipedia entry), KBs (e.g., DBpedia, Freebase) may provide multiple, non-identical descriptions of the same real-world entities. This is mainly due to the different information extraction tools and curation policies employed by KBs, resulting to complementary and sometimes conflicting entity descriptions. Entity resolution (ER) aims to identify descriptions that refer to the same real-world entity appearing either within or across KBs. ER is essential in order to improve interlinking in the Web of data, even by third-parties and thus improve the level of analytics and sense-making of the Web data content.

Isn't that an old problem?

ER is a problem that has been studied extensivey in data warehouses, however, the new ER challenges stem from the openness of the Web of data in describing entities by an unbounded number of KBs, the semantic and structural diversity of the descriptions provided across domains even for the same real-world entities, as well as the autonomy of KBs in terms of adopted processes for creating and curating entity descriptions. In particular:

The scale, diversity and graph structuring of entity descriptions in the Web of data challenge the way two descriptions can be effectively compared in order to efficiently decide whether they are referring to the same real-world entity. The core task of the ER problem is to decide whether two descriptions match using an adequate similarity function. For specific domains and relatively small number of KBs, such similarity functions can be easily defined eventually using experts' knowledge. In cross-domain and large-scale ER, even deciding which is the most appropriate piece of descriptions for performing comparisons is an open research issue. For example, do we need to care only for the values of the descriptions, or should we consider any graph structuring of descriptions? What is a reasonable trade-off for assessing similarity between the content-based and structure-based similarity of two descriptions? Moving one step forward, how does schematic information, in terms of employed attribute names and types, affect the degree of similarity of two descriptions?

What can the Minoan ER framework actually do?

We focus on designing and developing the Minoan ER framework that is capable of resolving entity descriptions appearing in the Web of data. We use blocking as a preprocessing step for ER to reduce the number of required comparisons. Specifically, blocking places similar entity descriptions into blocks, leaving to the entity resolution algorithm the comparisons only between descriptions within the same block. Its goal is to place as many matching descriptions as possible in common blocks, i.e., identify many matches, and only miss as few matches as possible. To place two descriptions into the same block, different criteria are employed that mostly reflect the content similarity of the descriptions. For further reducing the number of comparisons to be performed by ER, blocking can be accompanied by block post-processing steps. Such steps make sense to be used, when blocking results in missing only few matches, and the whole process is faster than exhaustively performing the comparisons between all descriptions. Clearly, it is not straightforward to attain the best trade-off between pruning many comparisons, while retaining the comparisons between matches, since it is not easy to select, or even construct, the appropriate similarity function to use.

To minimize the number of missed matches, an iterative entity resolution process can exploit in a pay-as-you-go fashion any intermediate results of blocking and matching, progressively discovering new candidate description pairs for resolution. Such an iterative process considers similarity evidence provided by entity descriptions placed into the same block or being structurally related in the original entity graph. This way, an iterative approach is more suitable for coping with the varying data quality (e.g., incompleteness) and loose structuring (e.g., diverse entity graphs) of entity descriptions in the Web of data.

Minoan ER framework provides some publicly available RDF datasets (entity collections) from the LOD cloud, both central and peripheral, each with a ground truth of known matches. Highly similar descriptions, met in central LOD collections, feature many common tokens in the values of common attributes, while somehow similar descriptions, met in peripheral collections, have significantly fewer common tokens in attributes that are not necessarily semantically related. Hence, the former can be compared only on their content (i.e., values), while the latter require contextual information, e.g., the similarity of neighborhood descriptions, linked with different types of relationships. Moreover, we offer the source code of our MapReduce algorithms, that we have used in our publications.



We would like to acknowledge our many collaborators who have influenced our thoughts and our understanding of this research area over the years, and the following projects for their support in our research efforts: EU FP7-ICT-2011-9 DIACHRON (Managing the Evolution and Preservation of the Data Web), EU FP7-PEOPLE- 2013-IRSES SemData (Semantic Data Management), EU FP7-ICT-318552 IdeaGarden (An Interactive Learning Environment Fostering Creativity), and LoDGoV (Generate, Manage, Preserve, Share, and Protect Resources in the Web of Data) of the Research Programme ARISTEIA (EXCELLENCE), GSRT, Ministry of Education, Greece, and the European Regional Development Fund. Finally, we would like to thank the ~okeanos GRNET cloud service.

News

October 2016: Our paper "Benchmarking Blocking Algorithms for Web Entities" was accepted at the IEEE Transactions on Big Data journal.

March 2016: Our paper "Minoan ER: Progressive Entity Resolution in the Web of Data" was presented at EDBT 2016 @ Bordeux, France (March 15-18, 2016).

September 2015: Our papers "Big Data Entity Resolution: From Highly to Somehow Similar Entity Descriptions in the Web" and "Parallel Meta-blocking: Realizing Scalable Entity Resolution over Large, Heterogeneous Data" were accepted at IEEE Big Data Conference @ Santa Clara, CA (Oct 29- Nov 1, 2015).

August 2015: Our book "Entity Resolution in the Web of Data", got published by Morgan&Claypool. You can find it here.

July 2015: Our poster "WebER: Resolving Entities in the Web", was accepted at the European Data Forum @ Luxembourg (Nov 16-17, 2015).