In this course we study the algorithms and the associated distributed computing systems used in analyzing massive datasets, or big data, and in large-scale machine learning.
We focus on two fundamental ideas for scaling analysis to large datasets: (i) distributed computing, and (ii) randomization. In the former, we study how to design, implement, and evaluate data analysis algorithms for the distributed computing platforms MapReduce/Hadoop and Spark. In the latter, we explore techniques such as locality sensitive hashing, Bloom filters, and data stream mining. These fundamental ideas are applied to applications such as finding similar items, market-basket analysis, clustering, and building recommendation systems---all on massive datasets. They are the foundation of modern data analysis in companies such as Google, Facebook, and Netflix.
A major component of the course is a quarter long project in which students build a prototype system for solving a real-world data analysis problem. Examples of past student projects include analyzing sentiment in tweets and restaurant reviews, determining risk of crime when using public transportation, diagnosing eye disease from retina images, predicting a film's critical success based on the script, and analyzing NBA data for the "hot hand phenomenon."
Topics (tentative list)
- MapReduce framework
- Designing and analyzing MapReduce algorithms
- Spark framework
- Spark machine learning library (MLib)
- Locality sensitive hashing for finding similar items
- Data stream mining
- Large-scale market basket analysis, Association rules
- Large-scale clustering
- Recommendation systems
- Other advanced data analysis/machine learning topics based on student interest
- Weekly Readings, Programming and Theory Assignments, Class Participation: 30%
- Three Quizzes: 35%
- Project: 35%
The course requires mathematical, algorithmic, and programming maturity. Students are expected to know the following:
Programming in Python: use of lists, dictionaries, conditionals, classes, and reading from and writing to files.
Data structures: such as trees, graphs, and hash tables.
Basic multivariate calculus: including differentiation, integration, and finding maxima and minima.
Basic linear algebra: vectors, matrices, and matrix multiplication.
Further, students should be prepared to learning new libraries, languages (e.g., Scala), and programming paradigms.