In this course we study the algorithms and the associated distributed computing systems used in analyzing massive datasets, or big data, and in large-scale machine learning.
We focus on two fundamental ideas for scaling analysis to large datasets: (i) distributed computing, and (ii) randomization. In the former, we study how to design, implement, and evaluate data analysis algorithms for the distributed computing platforms MapReduce/Hadoop and Spark. In the latter, we explore techniques such as locality sensitive hashing, Bloom filters, and data stream mining. These fundamental ideas are applied to applications such as finding similar items, market-basket analysis, clustering, and building recommendation systems---all on massive datasets. They are the foundation of modern data analysis in companies such as Google, Facebook, and Netflix.
A major component of the course is a quarter long project in which students build a prototype system for solving a real-world data analysis problem. Examples of past student projects include analyzing sentiment in tweets and restaurant reviews, determining risk of crime when using public transportation, diagnosing eye disease from retina images, predicting a film's critical success based on the script, and analyzing NBA data for the "hot hand phenomenon."
Topics (tentative list)
- MapReduce framework
- Designing and analyzing MapReduce algorithms
- Spark framework
- Spark machine learning library (MLib)
- Locality sensitive hashing for finding similar items
- Data stream mining
- Large-scale market basket analysis, Association rules
- Large-scale clustering
- Recommendation systems
- Other advanced data analysis/machine learning topics based on student interest
- Weekly Readings, Programming and Theory Assignments, Class Participation: 30%
- Three Quizzes: 35%
- Project: 35%
MPCS 50101 Math for Computer Science
MPCS 55001 Algorithms
MPCS 51042 Python Programming (or Programming core requirement with prior knowledge of Python)
MPCS 53110 Foundations of Computational Data Analysis
MPCS 53111 Machine Learning
In all the above courses a grade of B+ or above is required. Please contact the instructor if you have, instead, equivalent courses or experience, or meet most but not all of the requirements.
The course requires mathematical, algorithmic, and programming maturity. Students are expected to know the following:
Programming in Python: use of lists, dictionaries, conditionals, classes, and reading from and writing to files.
Data structures: such as trees, graphs, and hash tables.
Basic multivariate calculus: including differentiation, integration, and finding maxima and minima.
Basic linear algebra: vectors, matrices, and matrix multiplication.
Further, students should be prepared to learning new libraries, languages (e.g., Scala), and programming paradigms.