CS561 Web Data Management

Projects

Spring 2013
Professor: Vassilis Christophides
Teaching Assistant: Michalis Chortis
E-mails: {christop, mhortis}@ics.forth.gr

Course Hours: Tuesday 3-5PM and Wednesday 11-1PM
Room: H.204
Office Hours: After the lectures or by appointment
Course Credits: 4

[Home] [Lectures] [Instructional Material] [Software and Tools] [Programming Assignments] [Projects] [Grades]

Project Description - Project Papers - Project Assignments - Schedule of Presentations

Project Description

This is a one or two person comprehensive survey project in which you perform an in-depth analysis of research literature in the area covered by the course. The key to the success of this project is your creativity and dedication. Specifically, you need to do the following:

Determine a research topic for your project according to the list of papers in the project papers section of course's web page.
- Each group of students should choose different papers and every student should choose at least 2 papers.
Read a sufficient number of papers (usually more than the papers you are going to present) in order to perform an in-depth analysis of the research described in the papers. Specifically, you need to focus on the following:
- Definition of criteria on which published results can be classified. These may include the types of problems to be solved, the types of methods used, the types of frameworks/architectures, etc.
- Comprehensive and systematic description of the technical aspect of the papers. This includes the concept, ideas, algorithms, methodologies, experimental results etc.
- Put a particular emphasis on your personal critique to the paper's material.
Give a 30 to 45-minute presentation in class, whose organization should be also followed in your report. [Presentation Guidelines]
Write a technical report including all items mentioned above. [Guidelines for the presentation of written work]

Requirements

A survey must analyze a good number (minimum 2 per student) of papers related to the selected topic. The survey report and the presentation will be evaluated on both its breadth (i.e., how complete the coverage is) and its depth (i.e., how much insight it brings out). For the grading of the presentation there are many aspects that would be taken into consideration. In details, as far as the understanding of the paper (12%) the grading will be as follows:

8% technique/approach,
2% background and
2% shortcomings/open problems

About delivering your talk (8%) the grading will be as follows:

3% slides,
3% speech and
2% session of questions and problems.

The report must have the following sections: Abstract (up to 250 words), Introduction, the main technical sections, Conclusion / Contributions (according to the related work) and Bibliographic References. Basically, you should follow the structure of research papers such as those you have read.

Both report and presentation should address several or possibly all of the following issues:

What is the most important point of each one of the papers?
Why the work is notable or novel or neither?
Why the problems tackled by the paper are or are not important?
Why is the proposed solution potentially useful or not useful?
Are the assumptions clearly specified and are they reasonable and practically valid?
Point out additional contexts in which the same idea or technology could be applied; relate the work to another paper that you find during your literature search.
How the proposed ideas are evaluated and how thorough is that evaluation?
Identify a list of possible future research tasks to make the proposed work even better, develop a different solution strategy, or to drop some of the given assumptions, and so on.

The length of the paper should be somewhere between 15 and 30 pages.

What to hand in

The electronic version of your presentation and printed handouts (before the presentation in the classroom).
The electronic version and a hard copy of your report.

Hint

It is suggested that when you study the papers, you would make a list of the points that you find particularly confusing, ambiguous, interesting, controversial, etc., and try to formulate your own comments, possible answers, and examples to address those points. These points and related materials can be a part of your report. In general, you may be asked to address those points in class during your presentation. Thus your critiques and other relevant information should be in your mind when you arrive in class.

Advice on research and writing

Project Assignments

Student ID	Student Name	Assigned Papers
784	Ozokan Doruk	Heuristics-Based Query Processing for Large RDF Graphs Using Cloud Computing Data Intensive Query Processing for Large RDF Graphs Using Cloud Computing Tools
783	Jamous Hassan	From SPARQL to MapReduce: The Journey Using a Nested TripleGroup Algebra Efficient Processing of RDF Graph Pattern Matching on MapReduce Platforms
634	Lantzaki Christina	On Blank Nodes Efficient Query Answering against Dynamic RDF Databases
778	Efthymiou Vassilis	A Blocking Framework for Entity Resolution in Highly Heterogeneous Information Spaces To Compare or Not to Compare: Making Entity Resolution more Efficient
777	Alogdiannaki Eleni	RDF3X: a RISCstyle Engine for RDF Scalable Join Processing on Very Large RDF Graphs
770	Choudhury Vineet	Scalable SPARQL Querying of Large RDF Graphs Towards Effective Partition Management for Large Graphs
776	Manjing Tham	Apples and Oranges: A Comparison of RDF Benchmarks and Real RDF Datasets An Empirical Study of Real-World SPARQL Queries
775	Theivapulendra Enotharani	Static Analysis and Optimization of Semantic Web Queries Efficient Distributed Query Processing for Autonomous RDF Databases Distributed SPARQL querying with Avalanche
771	Dixit Prabhakar Madhukar	Efficient Execution of Top-K SPARQL Queries Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-Shaped (RDF) Data
779	Nikolov Nikolay	Rewriting Queries on SPARQL Views SPARQL-RW: Transparent Query Access over Mapped RDF Data Sources
781	Sher Imran Falak	On Directly Mapping Relational Databases to RDF and OWL R2RML: RDB to RDF Mapping Language A Direct Mapping of Relational Data to RDF
773	Xia Siliang	gStore: Answering SPARQL Queries via Subgraph Matching Storing and Indexing Massive RDF Data Sets
793	Saniat Mahmudur Rahman	FedBench: A Benchmark Suite for Federated Semantic Data Query Processing Benchmarking Federated SPARQL Query Engines: Are Existing Testbeds Enough?
772	Berkley Roger Alekos	Effective Page Refresh Policies For Web Crawlers Swoogle: A Search and Metadata Engine for the Semantic Web
703	Seliniotaki Aleka	CLARO: Modeling and Processing Uncertain Data Streams PODS: A New Model and Processing Algorithms for Uncertain Data Streams

Schedule of Presentations

Presentation Date	Time Slot	Student Name	ID	Papers	Presentation files
Tuesday 14/05	15:00-16:00	Nikolov Nikolay	779	Rewriting Queries on SPARQL Views SPARQL-RW: Transparent Query Access over Mapped RDF Data Sources	PDF
	16:00-17:00	Xia Siliang	773	gStore: Answering SPARQL Queries via Subgraph Matching Storing and Indexing Massive RDF Data Sets	PDF
	17:00-18:00	Jamous Hassan	783	From SPARQL to MapReduce: The Journey Using a Nested TripleGroup Algebra Efficient Processing of RDF Graph Pattern Matching on MapReduce Platforms	PDF
Wednesday 15/05	11:00-12:00	Theivapulendra Enotharani	775	Static Analysis and Optimization of Semantic Web Queries Efficient Distributed Query Processing for Autonomous RDF Databases Distributed SPARQL querying with Avalanche	PDF 1 PDF 2
	12:00-13:00	Saniat Mahmudur Rahman	793	FedBench: A Benchmark Suite for Federated Semantic Data Query Processing Benchmarking Federated SPARQL Query Engines: Are Existing Testbeds Enough?	PDF
	13:00-14:00	Dixit Prabhakar Madhukar	771	Efficient Execution of Top-K SPARQL Queries Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-Shaped (RDF) Data	PDF
	14:00-15:00	Ozokan Doruk	784	Heuristics-Based Query Processing for Large RDF Graphs Using Cloud Computing Data Intensive Query Processing for Large RDF Graphs Using Cloud Computing Tools	PDF
Tuesday 21/05	15:00-16:00	Sher Imran Falak	781	On Directly Mapping Relational Databases to RDF and OWL R2RML: RDB to RDF Mapping Language A Direct Mapping of Relational Data to RDF	PDF
	16:00-17:00	Choudhury Vineet	770	Scalable SPARQL Querying of Large RDF Graphs Towards Effective Partition Management for Large Graphs	PPTX
	17:00-18:00	Alogdiannaki Eleni	777	RDF3X: a RISCstyle Engine for RDF Scalable Join Processing on Very Large RDF Graphs	PDF
Wednesday 22/05	11:00-12:00	Manjing Tham	776	Apples and Oranges: A Comparison of RDF Benchmarks and Real RDF Datasets An Empirical Study of Real-World SPARQL Queries	PDF
	12:00-13:00	Berkley Roger Alekos	772	Effective Page Refresh Policies For Web Crawlers Swoogle: A Search and Metadata Engine for the Semantic Web	PDF
	13:00-14:00	Seliniotaki Aleka	703	CLARO: Modeling and Processing Uncertain Data Streams PODS: A New Model and Processing Algorithms for Uncertain Data Streams	PDF
	14:00-15:00	Lantzaki Christina	634	On Blank Nodes Efficient Query Answering against Dynamic RDF Databases	PDF

Project Papers

[Data Integration in the Web of Data] [Web Data Storage and Access] [Benchmarking]

Data Integration in the Web of Data:

Crawling

Sindice.com: Weaving the Open Linked Data
Giovanni Tummarello, Renaud Delbru, Eyal Oren, ISWC/ASWC 2007
LDSpider - An open-source crawling framework for the Web of Linked Data
Robert Isele, Jurgen Umbrich, Christian Bizer, Andreas Harth, ISWC Posters&Demos 2010
Semantic Navigation on the Web of Data- Speci?cation of Routes, Web Fragments and Actions
Valeria Fionda, Claudio Gutierrez, Giuseppe Pirro, WWW 2012
MultiCrawler: A Pipelined Architecture for Crawling and Indexing Semantic Web Data
Andreas Harth, Jurgen Umbrich, Stefan Decker, ISWC 2006
Searching Semantic Web Objects Based on Class Hierarchies
Gong Cheng, Weiyi Ge, Honghan Wu, Yuzhong Qu, LDOW 2008
Swoogle: A Search and Metadata Engine for the Semantic Web
Li Ding, Timothy W. Finin, Anupam Joshi, Rong Pan, R. Scott Cost, Yun Peng, Pavan Reddivari, Vishal Doshi, Joel Sachs, CIKM 2004
Effective Page Refresh Policies For Web Crawlers
Junghoo Cho, Hector Garcia-Molina, ACM Trans. Database Syst. 2003

SPARQL Federation and Query Mediation

Efficient Distributed Query Processing for Autonomous RDF Databases
Fabian Prasser, Alfons Kemper, Klaus A. Kuhn, EDBT 2012
Linked Data Query Processing Strategies
Gunter Ladwig, Thanh Tran, ISWC 2010
Pay-as-you-go Data Integration for Linked Data: opportunities, challenges and architectures
Norman W. Paton, Klitos Christodoulou, Alvaro A. A. Fernandes, Bijan Parsia, Cornelia Hedeler, SWIM 2012
Rewriting Queries on SPARQL Views
Wangchao Le, Songyun Duan, Anastasios Kementsietsidis, Feifei Li, Min Wang, WWW 2011
SPARQL-RW: Transparent Query Access over Mapped RDF Data Sources
Konstantinos Makris, Nikos Bikakis, Nektarios Gioldasis, Stavros Christodoulakis, EDBT 2012
Efﬁcient Query Answering against Dynamic RDF Databases
Francois Goasdoue, Ioana Manolescu, Alexandra Roatis, EDBT 2013

Data Mapping and Entity Resolution

A Blocking Framework for Entity Resolution in Highly Heterogeneous Information Spaces
George Papadakis, Ekaterini Ioannou, Themis Palpanas, Claudia Niederee, Wolfgang Nejdl, IEEE TKDE 2012
To Compare or Not to Compare- Making Entity Resolution more Efﬁcient
George Papadakis, Ekaterini Ioannou, Claudia Niederee, Themis Palpanas, Wolfgang Nejdl, SWIM 2011

Data Quality and Provenance

Algebraic Structures for Capturing the Provenance of SPARQL Queries
Floris Geerts, Grigoris Karvounarakis, Vassilis Christophides and Irini Fundulaki, EDBT 2013

RDF Data Exchange

On Directly Mapping Relational Databases to RDF and OWL
Juan Sequeda, Marcelo Arenas, Daniel P. Miranker, WWW 2012

Web Data Storage and Access

SPARQL Engines and Optimization

Characteristic Sets: Accurate Cardinality Estimation for RDF Queries with Multiple Joins
Thomas Neumann, Guido Moerkotte, ICDE 2011
Database foundations for scalable RDF processing
Katja Hose, Ralf Schenkel, Martin Theobald, Gerhard Weikum, Reasoning Web 2011
Heuristics-based Query Optimisation for SPARQL
Petros Tsialiamanis, Lefteris Sidirourgos, Irini Fundulaki, Vassilis Christophides, Peter A. Boncz, EDBT 2012
RDF3X: a RISCstyle Engine for RDF
Thomas Neumann, Gerhard Weikum, PVLDB 2008
Scalable Join Processing on Very Large RDF Graphs
Thomas Neumann, Gerhard Weikum, SIGMOD Conference 2009
Storing and Indexing Massive RDF Data Sets
Yongming Luo, Francois Picalausa, George H.L. Fletcher, Jan Hidders, Stijn Vansummeren, Semantic Search over the Web 2012
gStore: Answering SPARQL Queries via Subgraph Matching
Lei Zou, Jinghui Mo, Lei Chen 0002, M. Tamer Ozsu, Dongyan Zhao, PVLDB 2011
Static Analysis and Optimization of Semantic Web Queries
Andres Letelier, Jorge Perez 0001, Reinhard Pichler, Sebastian Skritek, PODS 2012

Continuous Query Processing

An Execution Environment for C-SPARQL Queries
Davide Francesco Barbieri, Daniele Braga, Stefano Ceri, Michael Grossniklaus, EDBT 2010
Linked Stream Data Processing
Danh Le Phuoc, Josiane Xavier Parreira, Manfred Hauswirth, Reasoning Web 2012
Querying RDF Streams with C-SPARQL
Davide Francesco Barbieri, Daniele Braga, Stefano Ceri, Emanuele Della Valle, Michael Grossniklaus, SIGMOD Record 2010

Multi-Query Processing

Scalable Multi-Query Optimization for SPARQL
Wangchao Le, Anastasios Kementsietsidis, Songyun Duan, Feifei Li, ICDE 2012

Large Scale Data Management

From SPARQL to MapReduce: The Journey Using a Nested TripleGroup Algebra
HyeongSik Kim, Padmashree Ravindra, Kemafor Anyanwu, PVLDB 2011
Heuristics-Based Query Processing for Large RDF Graphs Using Cloud Computing
Mohammad Farhan Husain, James P. McGlothlin, Mohammad M. Masud, Latifur R. Khan, Bhavani M. Thuraisingham, IEEE Transactions on Knowledge and Data Engineering 2011
Large-scale Linked Data Processing - Cloud Computing to the Rescue?
Michael Hausenblas, Robert Grossman, Andreas Harth, Philippe Cudre-Mauroux, CLOSER 2012
RDF Data Management in the Amazon Cloud
Francesca Bugiotti, Francois Goasdoue, Zoi Kaoudi, Ioana Manolescu, EDBT/ICDT Workshops 2012
Scalable SPARQL Querying of Large RDF Graphs
Jiewen Huang, Daniel J. Abadi, Kun Ren, PVLDB 2011
CumulusRDF: Linked Data Management on Nested Key-Value Stores
Gunter Ladwig, Andreas Harth, SSWS 2011
Data Intensive Query Processing for Large RDF Graphs Using Cloud Computing Tools
Mohammad Farhan Husain, Latifur Khan, Murat Kantarcioglu, Bhavani M. Thuraisingham, IEEE CLOUD 2010
Efficient Processing of RDF Graph Pattern Matching on MapReduce Platforms
Padmashree Ravindra, Seokyong Hong, HyeongSik Kim, Kemafor Anyanwu, DataCloud-SC 2011
PigSPARQL: Mapping SPARQL to Pig Latin
Alexander Schatzle, Martin Przyjaciel-Zablocki, Georg Lausen, SWIM 2011
Rya: A Scalable RDF Triple Store for the Clouds
Roshan Punnoose, Adina Crainiceanu, David Rapp, Cloud-I 2012
RDFPath: Path Query Processing on Large RDF Graphs with MapReduce
Martin Przyjaciel-Zablocki, Alexander Schatzle, Thomas Hornung, Georg Lausen, ESWC Workshops 2011
Towards Effective Partition Management for Large Graphs
Shengqi Yang, Xifeng Yan, Bo Zong, Arijit Khan, SIGMOD Conference 2012

Keyword-based and Top-K Querying

Efficient Execution of Top-K SPARQL Queries
Sara Magliacane, Alessandro Bozzon, Emanuele Della Valle, ISWC 2012
Querying Heterogeneous Datasets on the Linked Data Web: Challenges, Approaches, and Trends
Andre Freitas, Edward Curry, Joao Gabriel Oliveira, Sean O'Riain, IEEE Internet Computing 2012
Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-Shaped (RDF) Data
Thanh Tran, Haofen Wang, Sebastian Rudolph, Philipp Cimiano, ICDE 2009
Natural Language Questions for the Web of Data
Mohamed Yahya, Klaus Berberich, Shady Elbassuoni, Maya Ramanath, Volker Tresp, Gerhard Weikum, EMNLP-CoNLL 2012

Benchmarking

SP^2Bench: A SPARQL Performance Benchmark
Michael Schmidt, Thomas Hornung, Georg Lausen, Christoph Pinkel, ICDE 2009
An Evaluation of Approaches to Federated Query Processing over Linked Data
Peter Haase, Tobias Matha, Michael Ziller, I-SEMANTICS 2010
Apples and Oranges: A Comparison of RDF Benchmarks and Real RDF Datasets
Songyun Duan, Anastasios Kementsietsidis, Kavitha Srinivas, Octavian Udrea, SIGMOD Conference 2011
Benchmarking Federated SPARQL Query Engines: Are Existing Testbeds Enough?
Gabriela Montoya, Maria-Esther Vidal, Oscar Corcho, Edna Ruckhaus, Carlos Buil Aranda, ISWC 2012
D1.2 Benchmarking RDF Storage Engines
Ying Zhang, Pham Minh Duc, Fabian Groffen, Erietta Liarou, Peter Boncz, Martin Kersten, Jean Paul Calbimonte, Oscar Corcho, 2012
Column-Store Support for RDF Data Management: Not All Swans Are White
Lefteris Sidirourgos, Romulo Goncalves, Martin L. Kersten, Niels Nes, Stefan Manegold, PVLDB 2008
MonetDB Release with Optimized Graph Path Processing
Collaborative Project, 2012
On Generating Benchmark Data for Entity Matching
Ekaterini Ioannou, Nataliya Rassadko, Yannis Velegrakis, Journal on Data Semantics 2012
An Empirical Study of Real-World SPARQL Queries
Mario Arias, Javier D. Fernandez, Miguel A. Martinez-Prieto, Pablo de la Fuente, CoRR 2011
FedBench: A Benchmark Suite for Federated Semantic Data Query Processing
Michael Schmidt, Olaf Gorlitz, Peter Haase, Gunter Ladwig, Andreas Schwarte, Thanh Tran, ISWC 2011