MPCS 53112 Advanced Data Analytics (Autumn 2022)

Section 1
Instructor(s)
Location None
Meeting Times
Fulfills Elective Specialization - Data Analytics (DA-2)

Syllabus

*Please note: This is the syllabus from the 2021/22 academic year and subject to change.*

In this course we study the algorithms and the associated distributed computing systems used in analyzing massive datasets, or big data, and in large-scale machine learning. We also cover the foundations of reinforcement learning.

We focus on two fundamental ideas for scaling analysis to large datasets: (i) distributed computing, and (ii) randomization.  In the former, we study how to design, implement, and evaluate data analysis algorithms for the distributed computing platforms MapReduce/Hadoop and Spark.  In the latter, we explore techniques such as locality sensitive hashing, Bloom filters, and data stream mining.  They are the foundation of modern data analysis in companies such as Google, Facebook, and Netflix.

Reinforcement learning refers to the situation in which you want to model your environment, but you don’t have a data set for training. Instead you learn by interacting with your environment. We’ll learn algorithms that, e.g., teach themselves how to play chess by simply playing the game (against another copy of themselves) millions of times! They have applications in autonomous systems, robotics, operations research, responsive website design, stock trading, etc.

A major component of the course is a quarter long project in which students build a prototype system for solving a real-world data analysis problem.  Examples of past student projects include movie recommendation systems, text analysis to predict stock movements, a reinforcement learning system for stock trading, diagnosing eye disease from retina images, adding components to Spark’s machine learning library, building a system to play the game pong using reinforcement learning, and a deep learning system for lip reading.

Topics (tentative list)

  • MapReduce framework
  • Designing and analyzing MapReduce algorithms
  • Spark framework
  • Spark machine learning library (MLib)
  • Locality sensitive hashing for finding similar items
  • Data stream mining
  • Finite Markov decision processes
  • Reinforcement learning algorithms: Sarsa, Q-learning.
  • Recommendation systems
  • Other advanced data analysis/machine learning topics based on student interest

Evaluation

  • Weekly Readings, Programming and Theory Assignments, Class Participation: 30%
  • Three Quizzes: 35%
  • Project: 35%

Primary Textbook

Course Prerequisites

MPCS 50103 Math for Computer Science
MPCS 55001 Algorithms
MPCS 51042 Python Programming (or Programming core requirement with prior knowledge of Python)
MPCS 53110 Foundations of Computational Data Analysis
MPCS 53111 Machine Learning

In all the above courses a grade of B+ or above is required.

Other Prerequisites

The course requires mathematical, algorithmic, and programming maturity. Students are expected to know the following:

Programming in Python: use of lists, dictionaries, conditionals, classes, and reading from and writing to files.

Data structures: such as trees, graphs, and hash tables.

Basic multivariate calculus: including differentiation, integration, and finding maxima and minima.

Basic linear algebra: vectors, matrices, and matrix multiplication.

Further, students should be prepared to learning new libraries, languages (e.g., Scala), and programming paradigms.

This course requires competency in Unix and Linux. Please plan to attend the MPCS Unix Bootcamp (https://masters.cs.uchicago.edu/page/mpcs-unix-bootcamp) or take the online MPCS Unix Bootcamp Course on Canvas.

Overlapping Classes

This class is scheduled at a time that does not conflict with any other classes this quarter.

Eligible Programs

Masters Program in Computer Science Bx/MS in Computer Science (Option 2: Professionally-oriented - CS Majors) Bx/MS in Computer Science (Option 3: Profesionally-oriented - Non-CS Majors)