MPCS 53112 Advanced Data Analytics (Autumn 2020)

Section 1
Instructor(s) Chaudhary, Amitabh (amitabh)
Location Online Only
Meeting Times Wednesday 5:20pm - 8:20pm
Fulfills Elective Specialization - Data Analytics (DA-2)


*This course will be conducted remotely and will be online only for Autumn 2020*

In this course we study the algorithms and the associated distributed computing systems used in analyzing massive datasets, or big data, and in large-scale machine learning.

We focus on two fundamental ideas for scaling analysis to large datasets: (i) distributed computing, and (ii) randomization.  In the former, we study how to design, implement, and evaluate data analysis algorithms for the distributed computing platforms MapReduce/Hadoop and Spark.  In the latter, we explore techniques such as locality sensitive hashing, Bloom filters, and data stream mining.  These fundamental ideas are applied to applications such as finding similar items, market-basket analysis, clustering, and building recommendation systems---all on massive datasets.  They are the foundation of modern data analysis in companies such as Google, Facebook, and Netflix.

A major component of the course is a quarter long project in which students build a prototype system for solving a real-world data analysis problem.  Examples of past student projects include analyzing sentiment in tweets and restaurant reviews, determining risk of crime when using public transportation, diagnosing eye disease from retina images, predicting a film's critical success based on the script, and analyzing NBA data for the "hot hand phenomenon." 

Topics (tentative list)

  • MapReduce framework
  • Designing and analyzing MapReduce algorithms
  • Spark framework
  • Spark machine learning library (MLib)
  • Locality sensitive hashing for finding similar items
  • Data stream mining
  • Large-scale market basket analysis, Association rules
  • Large-scale clustering
  • Recommendation systems
  • Other advanced data analysis/machine learning topics based on student interest


  • Weekly Readings, Programming and Theory Assignments, Class Participation: 30%
  • Three Quizzes: 35%
  • Project: 35%
Primary Textbook

Course Prerequisites

MPCS 50101 Math for Computer Science
MPCS 55001 Algorithms
MPCS 51042 Python Programming (or Programming core requirement with prior knowledge of Python)
MPCS 53110 Foundations of Computational Data Analysis
MPCS 53111 Machine Learning

In all the above courses a grade of B+ or above is required.

Other Prerequisites

The course requires mathematical, algorithmic, and programming maturity. Students are expected to know the following:

Programming in Python: use of lists, dictionaries, conditionals, classes, and reading from and writing to files.

Data structures: such as trees, graphs, and hash tables.

Basic multivariate calculus: including differentiation, integration, and finding maxima and minima.

Basic linear algebra: vectors, matrices, and matrix multiplication.

Further, students should be prepared to learning new libraries, languages (e.g., Scala), and programming paradigms.

Overlapping Classes

This class is scheduled at a time that conflicts with these other classes:

  • MPCS 51230-1 -- User Interface and User Experience Design
  • MPCS 52011-1 -- Introduction to Computer Systems
  • MPCS 56420-1 -- Bioinformatics for Computer Scientists
  • MPCS 50103-2 -- Mathematics for Computer Science: Discrete Mathematics