Section | 1 |
---|---|
Instructor(s) | Chaudhary, Amitabh (amitabh) |
Location | RY 277 |
Meeting Times | Wednesday 5:30pm - 8:30pm |
Fulfills | Elective Specialization - Data Analytics (DA-2) |
In this course we study the algorithms and the associated distributed computing systems used in analyzing massive datasets, or big data, and in large-scale machine learning. We also cover the foundations of reinforcement learning.
We focus on two fundamental ideas for scaling analysis to large datasets: (i) distributed computing, and (ii) randomization. In the former, we study how to design, implement, and evaluate data analysis algorithms for the distributed computing platforms MapReduce/Hadoop and Spark. In the latter, we explore techniques such as locality sensitive hashing, Bloom filters, and data stream mining. They are the foundation of modern data analysis in companies such as Google, Facebook, and Netflix.
Reinforcement learning refers to the situation in which you want to model your environment, but you don’t have a data set for training. Instead you learn by interacting with your environment. We’ll learn algorithms that, e.g., teach themselves how to play chess by simply playing the game (against another copy of themselves) millions of times! They have applications in autonomous systems, robotics, operations research, responsive website design, stock trading, etc.
A major component of the course is a quarter long project in which students build a prototype system for solving a real-world data analysis problem. Examples of past student projects include movie recommendation systems, text analysis to predict stock movements, a reinforcement learning system for stock trading, diagnosing eye disease from retina images, adding components to Spark’s machine learning library, building a system to play the game pong using reinforcement learning, and a deep learning system for lip reading.
Topics (tentative list)
Evaluation
Primary Textbooks
MPCS 50103 Math for Computer Science
MPCS 55001 Algorithms
MPCS 51042 Python Programming (or Programming core requirement with prior knowledge of Python)
MPCS 53110 Foundations of Computational Data Analysis
MPCS 53111 Machine Learning
In all the above courses a grade of B+ or above is required.
The course requires mathematical, algorithmic, and programming maturity. Students are expected to know the following:
Programming in Python: use of lists, dictionaries, conditionals, classes, and reading from and writing to files.
Data structures: such as trees, graphs, and hash tables.
Basic multivariate calculus: including differentiation, integration, and finding maxima and minima.
Basic linear algebra: vectors, matrices, and matrix multiplication.
Further, students should be prepared to learning new libraries, languages (e.g., Scala), and programming paradigms.
This course requires competency in Unix and Linux. Please plan to attend the MPCS Unix Bootcamp (https://masters.cs.uchicago.edu/page/mpcs-unix-bootcamp) and/or review the UChicago CS Student Resource Guide here: https://uchicago-cs.github.io/student-resource-guide/.
This class is scheduled at a time that conflicts with these other classes: