MPCS 53112 Advanced Data Analytics (Autumn 2022)

Section 1
Instructor(s) Chaudhary, Amitabh (amitabh)
Location RY 277
Meeting Times Wednesday 5:30pm - 8:30pm
Fulfills Elective Specialization - Data Analytics (DA-2)


In this course we study the algorithms and the associated distributed computing systems used in analyzing massive datasets, or big data, and in large-scale machine learning. We also cover the foundations of reinforcement learning.

We focus on two fundamental ideas for scaling analysis to large datasets: (i) distributed computing, and (ii) randomization.  In the former, we study how to design, implement, and evaluate data analysis algorithms for the distributed computing platforms MapReduce/Hadoop and Spark.  In the latter, we explore techniques such as locality sensitive hashing, Bloom filters, and data stream mining.  They are the foundation of modern data analysis in companies such as Google, Facebook, and Netflix.

Reinforcement learning refers to the situation in which you want to model your environment, but you don’t have a data set for training. Instead you learn by interacting with your environment. We’ll learn algorithms that, e.g., teach themselves how to play chess by simply playing the game (against another copy of themselves) millions of times! They have applications in autonomous systems, robotics, operations research, responsive website design, stock trading, etc.

A major component of the course is a quarter long project in which students build a prototype system for solving a real-world data analysis problem.  Examples of past student projects include movie recommendation systems, text analysis to predict stock movements, a reinforcement learning system for stock trading, diagnosing eye disease from retina images, adding components to Spark’s machine learning library, building a system to play the game pong using reinforcement learning, and a deep learning system for lip reading.

Topics (tentative list)

  • MapReduce framework
  • Designing and analyzing MapReduce algorithms
  • Spark framework
  • Spark machine learning library (MLib)
  • Locality sensitive hashing for finding similar items
  • Data stream mining
  • Finite Markov decision processes
  • Reinforcement learning algorithms: Sarsa, Q-learning.
  • Recommendation systems
  • Other advanced data analysis/machine learning topics based on student interest


  • Weekly Readings, Programming and Theory Assignments, Class Participation: 30%
  • Three Quizzes: 35%
  • Project: 35%

Primary Textbooks

Course Prerequisites

MPCS 50103 Math for Computer Science
MPCS 55001 Algorithms
MPCS 51042 Python Programming (or Programming core requirement with prior knowledge of Python)
MPCS 53110 Foundations of Computational Data Analysis
MPCS 53111 Machine Learning

In all the above courses a grade of B+ or above is required.

Other Prerequisites

The course requires mathematical, algorithmic, and programming maturity. Students are expected to know the following:

Programming in Python: use of lists, dictionaries, conditionals, classes, and reading from and writing to files.

Data structures: such as trees, graphs, and hash tables.

Basic multivariate calculus: including differentiation, integration, and finding maxima and minima.

Basic linear algebra: vectors, matrices, and matrix multiplication.

Further, students should be prepared to learning new libraries, languages (e.g., Scala), and programming paradigms.

This course requires competency in Unix and Linux. Please plan to attend the MPCS Unix Bootcamp ( and/or review the UChicago CS Student Resource Guide here:

Overlapping Classes

This class is scheduled at a time that conflicts with these other classes:

  • MPCS 53020-1 -- Foundations of Database Systems
  • MPCS 56511-1 -- Introduction to Computer Security
  • MPCS 55001-2 -- Algorithms
  • MPCS 52553-1 -- Web Development

Eligible Programs

Masters Program in Computer Science Bx/MS in Computer Science (Option 2: Professionally-oriented - CS Majors) Bx/MS in Computer Science (Option 3: Profesionally-oriented - Non-CS Majors)