Course Hours: Mon 13.00-15.00, Wed 13.00-15.00
Room: H.208
Weekly Lab: Fri 12.00-14.00, Room H.208
Office Hours: Available upon request
Big Data requires the storage, organization, and processing of data at a scale and efficiency -typically of heterogeneous nature and in streaming flow- that go well beyond the capabilities of conventional information technologies. Such requirements have been first introduced for processing the web, and they are today a common place in many industries. In this respect many traditional assumptions break, new query and programming interfaces are required (Map/Reduce), and new computing models will emerge (Cloud Computing). This course aims to introduce parallel/distributed data processing using the MapReduce (M/R) paradigm and provide insights for developing applications on top of the Hadoop platform.
Big data raises also new challenges in data mining. Given the scale and speed of data that needs to be processed as well the variety of parameters to be taken into account, state of the art machine learning algorithms working offline and expecting homogeneous and clean data are also challenged. There is on ongoing effort to design Big Data Mining algorithms accommodating a parallel/distributed or even a streaming evaluation. Of course such kind of incremental, partial evaluation impacts the quality of obtained statistical models and thus algorithms compromise between quality of the learning and computation time. The course will adopt an algorithmic viewpoint: data mining is about applying algorithms to data, rather than using data to “train” a machine-learning engine of some sort.
The course will consist of lectures based both on textbook material (freely-available for download on the Web) and scientific papers. It will also include programming assignments that will provide students with hands-on experience on building data-intensive applications using existing Big Data tools and platforms. The intended audience of this course is MSc and PhD students but also practitioners who plan to design or develop state-of-the-art algorithms available today for Big Data analysis.
Course Overview [pdf]
22/09/2025: Course Overview
24/09/2025: Scalable Data Analytics
Introduction to Scalable Data Analytics using Apache Spark
29/09/2025: Scalable Data Analytics (lecture)
01/10/2025: Scalable Data Analytics (lecture)
Lab 1 (03/10): MapReduce Programming Fri 13.00-15.00 - online [pdf]
Assignment 1
Elearn Spark Intro + Embeddings + Clustering
If using Scala, you can use metals for project
management Download starter code here.
Finding Similar Sets
06/10/2025: Finding Similar Items
08/10/2025: Massive Data Processing
Lab 2 (10/10): Programming in Spark Fri 13.00-15.00 [pdf]
10/10/2025: Lab 2 — Programming in Spark
Assignment 1 Due
Relational Data Processing
13/10/2025: Extracting Association Rules
15/10/2025: Extracting Association Rules
17/10/2025: Lab 3 — Intro to Data Frames and Spark SQL — Assignment 1 Due
20/10/2025: Streaming Analytics — Assignment 2 Announcement
22/10/2025: Streaming Analytics
Assignment 2 TBA
27/10/2025: Schema Discovery
29/10/2025: Schema Discovery
Lab 5: Spark Streaming
03/11/2025: Semantic Summaries
05/11/2025: Semantic Summaries
Lab 6: Schema Discovery + Hands-on - Assignment 2 Due
10/11/2025: Property Graphs
12/11/2025: Property Graphs
17/11/2025: Data Ethics
19/11/2025: Data Ethics
24/11/2025: Big Data Processing using LLMs/RAGs
01/12/2025: Property Graphs Partitioning + Hands-on (Elisjana)
Student Paper Presentations : 03/12
09/12/2025: Quantum Data Management + Hands-on (Limnaios)
Student Paper Presentations : 11/12