Minoan ER

Minoan ER is an Entity Resolution (ER) framework, built by researchers in Crete (the land of the ancient Minoan civilization). Entity resolution aims to identify descriptions that refer to the same entity within or across knowledge bases.

What created the need for the Minoan ER framework?

Over the past decade, numerous knowledge bases (KBs) have been built to power large-scale knowledge sharing, but also an entity-centric Web search, mixing both structured data and text querying. These KBs offer comprehensive, machine-readable descriptions of a large variety of real-world entities (e.g., persons, places, products, events) published on the Web as Linked Data. Traditionally, KBs are manually crafted by a dedicated team of knowledge engineers, such as the pioneering projects Wordnet and Cyc. Today, more and more KBs are built from existing Web content using information extraction tools. Such an automated approach offers an unprecedented opportunity to scale-up KBs construction and leverage existing knowledge published in HTML documents.

Although they may be derived from the same data source (e.g., a Wikipedia entry), KBs (e.g., DBpedia, Freebase) may provide multiple, non-identical descriptions of the same real-world entities. This is mainly due to the different information extraction tools and curation policies employed by KBs, resulting to complementary and sometimes conflicting entity descriptions. Entity resolution (ER) aims to identify descriptions that refer to the same real-world entity appearing either within or across KBs. ER is essential in order to improve interlinking in the Web of data, even by third-parties and thus improve the level of analytics and sense-making of the Web data content.

Isn't that an old problem?

ER is a problem that has been studied extensivey in data warehouses, however, the new ER challenges stem from the openness of the Web of data in describing entities by an unbounded number of KBs, the semantic and structural diversity of the descriptions provided across domains even for the same real-world entities, as well as the autonomy of KBs in terms of adopted processes for creating and curating entity descriptions. In particular:

The number of KBs (aka RDF datasets) in the Linking Open Data (LOD) cloud has roughly tripled between 2011 and 2014 (from 295 to 1014), while KBs interlinking dropped by 30%. The main reason is that with more KBs available, it becomes more difficult for data publishers to identify relations between the data they publish and the data already published. Thus, the majority of KBs are sparsely linked, while their popularity in links is heavily skewed. Sparsely interlinked KBs appear in the periphery of the LOD cloud (e.g., Open Food Facts, Bio2RDF), while heavily interlinked ones lie at the center (e.g., DBpedia, GeoNames). Encyclopaedic KBs, such as DBpedia, or widely used georeferencing KBs, such as GeoNames, are interlinked with the largest number of KBs.
The descriptions contained in these KBs present a high degree of semantic and structural diversity, even for the same entity types. Despite the Linked Data principles, multiple names (e.g., URIs) can be used to refer to the same real-world entity. The majority (58.24%) of the 649 vocabularies currently used by KBs are proprietary, i.e., they are used by only one KB, while diverse sets of properties are commonly used to describe the entities both in terms of types and number of occurrences even in the same KB. Only YAGO contains 350K different types of entities, while Google's Knowledge Graph contains 35K properties, used to describe 600M entities.

The scale, diversity and graph structuring of entity descriptions in the Web of data challenge the way two descriptions can be effectively compared in order to efficiently decide whether they are referring to the same real-world entity. The core task of the ER problem is to decide whether two descriptions match using an adequate similarity function. For specific domains and relatively small number of KBs, such similarity functions can be easily defined eventually using experts' knowledge. In cross-domain and large-scale ER, even deciding which is the most appropriate piece of descriptions for performing comparisons is an open research issue. For example, do we need to care only for the values of the descriptions, or should we consider any graph structuring of descriptions? What is a reasonable trade-off for assessing similarity between the content-based and structure-based similarity of two descriptions? Moving one step forward, how does schematic information, in terms of employed attribute names and types, affect the degree of similarity of two descriptions?

What can the Minoan ER framework actually do?

We focus on designing and developing the Minoan ER framework that is capable of resolving entity descriptions appearing in the Web of data. We use blocking as a preprocessing step for ER to reduce the number of required comparisons. Specifically, blocking places similar entity descriptions into blocks, leaving to the entity resolution algorithm the comparisons only between descriptions within the same block. Its goal is to place as many matching descriptions as possible in common blocks, i.e., identify many matches, and only miss as few matches as possible. To place two descriptions into the same block, different criteria are employed that mostly reflect the content similarity of the descriptions. For further reducing the number of comparisons to be performed by ER, blocking can be accompanied by block post-processing steps. Such steps make sense to be used, when blocking results in missing only few matches, and the whole process is faster than exhaustively performing the comparisons between all descriptions. Clearly, it is not straightforward to attain the best trade-off between pruning many comparisons, while retaining the comparisons between matches, since it is not easy to select, or even construct, the appropriate similarity function to use.

To minimize the number of missed matches, an iterative entity resolution process can exploit in a pay-as-you-go fashion any intermediate results of blocking and matching, progressively discovering new candidate description pairs for resolution. Such an iterative process considers similarity evidence provided by entity descriptions placed into the same block or being structurally related in the original entity graph. This way, an iterative approach is more suitable for coping with the varying data quality (e.g., incompleteness) and loose structuring (e.g., diverse entity graphs) of entity descriptions in the Web of data.

Minoan ER framework provides some publicly available RDF datasets (entity collections) from the LOD cloud, both central and peripheral, each with a ground truth of known matches. Highly similar descriptions, met in central LOD collections, feature many common tokens in the values of common attributes, while somehow similar descriptions, met in peripheral collections, have significantly fewer common tokens in attributes that are not necessarily semantically related. Hence, the former can be compared only on their content (i.e., values), while the latter require contextual information, e.g., the similarity of neighborhood descriptions, linked with different types of relationships. Moreover, we offer the source code of our MapReduce algorithms, that we have used in our publications.

We would like to acknowledge our many collaborators who have influenced our thoughts and our understanding of this research area over the years, and the following projects for their support in our research efforts: EU FP7-ICT-2011-9 DIACHRON (Managing the Evolution and Preservation of the Data Web), EU FP7-PEOPLE- 2013-IRSES SemData (Semantic Data Management), EU FP7-ICT-318552 IdeaGarden (An Interactive Learning Environment Fostering Creativity), and LoDGoV (Generate, Manage, Preserve, Share, and Protect Resources in the Web of Data) of the Research Programme ARISTEIA (EXCELLENCE), GSRT, Ministry of Education, Greece, and the European Regional Development Fund. Finally, we would like to thank the ~okeanos GRNET cloud service.

Menu

Minoan ER

What created the need for the Minoan ER framework?

Isn't that an old problem?

What can the Minoan ER framework actually do?

News