CS-562 Advanced Topics in Databases

Big Data Processing & Analytics

Spring 2025
Instructors: Haridimos Kondylakis
Teaching Assistant: Sophia Sideri
E-mails: {kondylak, sophisid}@csd.uoc.gr
Mailing List: hy562-list@csd.uoc.gr

Course Hours: Mon 13.00-15.00, Wed 13.00-15.00
Room: H.208
Weekly Lab: Fri 12.00-14.00, Room H.208
Office Hours: Available upon request

[Description] [Lectures]

Announcements

There will be no lecture on 29/9, Professor will be abroad.

Assignemnt 1 is uploaded in Elearn Link Deadline: Friday 17th Oct 2025. If using Scala, you can use metals for project management Download starter code here.

Big Data requires the storage, organization, and processing of data at a scale and efficiency -typically of heterogeneous nature and in streaming flow- that go well beyond the capabilities of conventional information technologies. Such requirements have been first introduced for processing the web, and they are today a common place in many industries. In this respect many traditional assumptions break, new query and programming interfaces are required (Map/Reduce), and new computing models will emerge (Cloud Computing). This course aims to introduce parallel/distributed data processing using the MapReduce (M/R) paradigm and provide insights for developing applications on top of the Hadoop platform.

Big data raises also new challenges in data mining. Given the scale and speed of data that needs to be processed as well the variety of parameters to be taken into account, state of the art machine learning algorithms working offline and expecting homogeneous and clean data are also challenged. There is on ongoing effort to design Big Data Mining algorithms accommodating a parallel/distributed or even a streaming evaluation. Of course such kind of incremental, partial evaluation impacts the quality of obtained statistical models and thus algorithms compromise between quality of the learning and computation time. The course will adopt an algorithmic viewpoint: data mining is about applying algorithms to data, rather than using data to “train” a machine-learning engine of some sort.

The course will consist of lectures based both on textbook material (freely-available for download on the Web) and scientific papers. It will also include programming assignments that will provide students with hands-on experience on building data-intensive applications using existing Big Data tools and platforms. The intended audience of this course is MSc and PhD students but also practitioners who plan to design or develop state-of-the-art algorithms available today for Big Data analysis.

Learning Objectives

Understand different models of computation:

MapReduce
Streams and online algorithms

Mine different types of data:

Data is high dimensional
Data is infinite/never-ending

Use different mathematical ‘tools’:

Hashing (LSH, Bloom filters)
Dynamic programming (frequent itemsets)

Solve real-world problems:

Duplicate document detection
Market Basket Analysis

Course Material

Jure Leskovec, Anand Rajaraman, Jeff Ullman. “Mining of Massive Datasets” Cambridge University Press, 2020

Free download

Sandy Ryza, Uri Laserson, Sean Owen, Josh Wills. “Advanced Analytics With Spark: Patterns for Learning from Data at Scale“ O'Reilly Media 2017

Spark Tutorial, University of Maryland

Anand Rajaraman, Jeffrey David Ullman, Jure Leskovec Εξόρυξη από Μεγάλα Σύνολα Δεδομένων - 3η Έκδοση 2020

[Κωδικός Βιβλίου στον Εύδοξο: 94700707]

Articles from newspapers

Course Organization

2-3 Programming Excerices (30%)

MapReduce & Spark

1 Research Presentation (20%)

Topic to be announced
Useful links: How to read an academic paper: Part 1, Part 2, Part 3

Final Project (50%)

Tentative Schedule

Week 1 (22/09-26/09): Course Overview

Course Overview [pdf]

22/09/2025: Course Overview

24/09/2025: Scalable Data Analytics

Week 2 (29/09-03/10): Scalable Data Analytics using Spark

Introduction to Scalable Data Analytics using Apache Spark

29/09/2025: Scalable Data Analytics (lecture)

01/10/2025: Scalable Data Analytics (lecture)

Lab 1 (03/10): MapReduce Programming Fri 13.00-15.00 - online [pdf]

Assignment 1 Elearn Spark Intro + Embeddings + Clustering
If using Scala, you can use metals for project management Download starter code here.

Week 3 (06/10-10/10): Finding Similar Items , Massive Data Processing(Assign. 1)

Finding Similar Sets

06/10/2025: Finding Similar Items

08/10/2025: Massive Data Processing

Lab 2 (10/10): Programming in Spark Fri 13.00-15.00 [pdf]

10/10/2025: Lab 2 — Programming in Spark

Week 4 (13/10-17/10):Extracting Association Rules

Assignment 1 Due

Relational Data Processing

13/10/2025: Extracting Association Rules

15/10/2025: Extracting Association Rules

17/10/2025: Lab 3 — Intro to Data Frames and Spark SQL — Assignment 1 Due

Lab 3 (18/10): Intro to Data Frames and Spark SQL [pdf]

Week 5 (20/10-24/10): Streaming Analytics (Assign. 2)

20/10/2025: Streaming Analytics — Assignment 2 Announcement

22/10/2025: Streaming Analytics

Assignment 2 on Elearn Assignment 2

Week 6 (27/10-31/10): Schema Discovery

27/10/2025: Schema Discovery

29/10/2025: Schema Discovery

Lab 4: Spark Streaming [pdf]

Week 7 (03/11-07/11): Semantic Summaries

03/11/2025: Semantic Summaries

05/11/2025: Semantic Summaries

Lab 5: Schema Discovery + Hands-on [pdf] - Assignment 2 Due

Week 8 (10/11-14/11): Property Graphs

10/11/2025: Property Graphs

12/11/2025: Property Graphs

Week 9 (17/11-21/11): Data Ethics

17/11/2025: Data Ethics

19/11/2025: Data Ethics

Week 10 (24/11-28/11): Big Data Processing using LLMs/RAGs

24/11/2025: Big Data Processing using LLMs/RAGs

Week 11 (02/12-05/12):

01/12/2025: Property Graphs Partitioning + Hands-on (Elisjana)

Student Paper Presentations : 03/12

Week 12 (09/12-12/12): Data Management in the Quantum Era

09/12/2025: Quantum Data Management + Hands-on (Limnaios)

Student Paper Presentations : 11/12

Week 13 (15/12-19/12): -