There is a vast and rapidly increasing quantity of government, scientific, corporate, and crowd-sourced data published nowadays on the Web. The emerging Web of Data realizes the vision of a data-driven decision-making and is expected to play a catalyst role in the way structured information is exploited across-domains in a large scale. It offers a great potential for building innovative data products and services that create new value from already collected data by fostering active citizenship (awareness regarding greenhouse gas emissions, food supply-chains, etc.), urban sustainability (smart energy consumption, mobility, etc.), but also world-wide research according to the "fourth paradigm of science".
In particular, according to the Linked Data paradigm, real world entities are described on the Web by interlinked data rather than documents. Data publishers are thus encouraged to describe entities using W3C standards (RDF, etc.) by naming them with HTTP URIs, so that people can access their descriptions on the Web, as well as by including links to other URIs, so that people can discover more entities (persons, places, artefacts, etc.). Due to the open and decentralized nature of the Web, real world entities are usually described in multiple datasets using different URIs in a partial, overlapping and sometimes evolving way. Recognizing descriptions of the same real-world entity across (and sometimes within) data sources emerges as a central problem in order to increase dataset linking but also to search the Web of data for entities and their relations (e.g., persons, things, events, places, etc.).
Data describing entities could be made available in the Web under different formats (e.g., tabular, tree or graph) of varying structuredness. Entity resolution techniques proposed for relational data (merging customer databases, library catalogues, etc.) are not suited for the Web of data due to the high heterogeneity (even across domains) and non-regularity in data structuring. As a matter of fact an entity described in knowledge bases, such as Yago or Freebase, is declared to be instance of several semantic types (i.e. classes) while its description may employ properties from different vocabularies (i.e., structural type). As a result the ontologies used by linked data represent in their vast majority simple vocabularies (of classes or properties) rather than schemas imposing extensive structural and semantic data constraints.
Thus, matching loosely structured entities is one of the major challenges of entity resolution that arise in the context of the Web of data compared to named entity matching for unstructured data and attribute value matching in relational databases. The fact that Web data heavily rely on heterogeneous vocabularies calls for cross-domain entity resolution techniques. For example, only Yago features 10M entities in 350K classes and 120M facts (RDF triples) for 100 relations, while Freebase features 25M entities described by 2000 topics and 100M facts using 4000 properties. Finally, entity resolution in the scale of the Web of data (more than 60B of triples) calls for efficient parallel techniques. Our goal is to highlight these challenges, to present recent techniques addressing them and identify potential gaps for future research.
The solution space of entity resolution
Entity resolution has been studied in a variety of contexts, using several approaches. In this tutorial, we introduce three dimensions for the solution space of entity resolution, namely type of method input, objective of method and type of method. Concerning the type of input data, we consider the cases of tabular, tree and graph data. We study entity resolution approaches with respect to their particular objectives, i.e. effectiveness, efficiency and scalability, and their methods, by distinguishing them between blocking, iterative and learning ones.
Type of method input: Different types of input data impose different solutions for the problem of entity resolution. In particular, due to the high structuredness of tabular data, for computing similarities between entity descriptions of this type, it suffices to compare the values of their common attributes. For the case of tree data, e.g. XML, where a hierarchical structure exists, i.e. ancestor and descendant relations, the similarity of values, used to compare two entity descriptions, is affected by the similarity of their ancestors and descendants. Data can be also graph structured, as in RDF, featuring cycles and non-unique root elements. Hence, the problem of computing similarities becomes harder.
Objective of method: We discern entity resolution methods with respect to their main objectives between those that aim at effectiveness, efficiency, and scalability. Algorithms focusing on effectiveness, try to find as many true matches and as few false matches as possible, while algorithms focusing on efficiency aim at resolving the given entity descriptions as fast as possible, usually by reducing the huge number of redundant comparisons. Scalable entity resolution methods are those that can cope with Big Data, typically, by parallelizing the task of entity resolution and distributing it to multiple computational resources, such as Map/Reduce nodes.
Type of method: Methods that perform entity resolution are categorized as blocking, iterative, or learning. Blocking-based approaches group together, in the same block, entity descriptions that are close to each other, denoting possible matches. To accomplish that, they typically rely on blocking keys, i.e. criteria based on which the descriptions are placed into blocks. Iterative algorithms are based on the idea that identifying some matching entity descriptions can lead to new matches, e.g. by using the transitive closure of the matched pairs, or by using the already merged descriptions. Learning-based approaches typically use an initial set of training data, annotated as matches and non-matches, and then classify the entity descriptions in these two categories, using statistical inference.
The type of input data determines the complexity of the similarity measure used for comparing entity descriptions. High effectiveness is achieved either by increasing the number of comparisons per description in a single iteration, or by increasing the number of iterations, each exploiting the matches already found. Both iterative and learning methods target high effectiveness.
Methods focusing on efficiency, aim at reducing the number of comparisons, while preserving a certain level of effectiveness. Blocking approaches typically target high efficiency, by comparing only the contents of each block, instead of the whole input set of descriptions. To do this, they exploit knowledge regarding the semantics of the input data. Regardless of effectiveness and efficiency, scalable methods are capable of exploiting multiple machines to deal with large scale data.