MPCS 53112 Advanced Data Analytics (Autumn 2016)

Section 1
Instructor(s) Chaudhary, Amitabh (amitabh)
Location Ryerson 276
Meeting Times Tuesday 5:30pm - 8:30pm
Fulfills Elective Specialization - Data Analytics (DA-2)

Syllabus

In  this course we study the algorithms and the associated distributed computing systems used in analyzing massive datasets, or big data, and in large-scale machine learning.

We focus on two fundamental ideas for scaling analysis to large datasets: (i) distributed computing, and (ii) randomization.  In the former, we study how to design, implement, and evaluate data analysis algorithms for the distributed computing platforms MapReduce/Hadoop and Spark.  In the latter, we explore techniques such as locality sensitive hashing, Bloom filters, and data stream mining.  These fundamental ideas are applied to applications such as finding similar items, market-basket analysis, clustering, and building recommendation systems---all on massive datasets.  They are the foundation of modern data analysis in companies such as Google, Facebook, and Netflix.

A major component of the course is a quarter long project in which students build a prototype system for solving a real-world data analysis problem.  Examples of past student projects include analyzing sentiment in tweets and restaurant reviews, determining risk of crime when using public transportation, diagnosing eye disease from retina images, predicting a film's critical success based on the script, and analyzing NBA data for the "hot hand phenomenon." 

Topics (tentative list)

  • MapReduce framework
  • Designing and analyzing MapReduce algorithms
  • Spark framework
  • Spark machine learning library (MLib)
  • Locality sensitive hashing for finding similar items
  • Data stream mining
  • Large-scale market basket analysis, Association rules
  • Large-scale clustering
  • Recommendation systems

Evaluation

  • Weekly Readings, Programming and Theory Assignments, Class Participation: 30%
  • Three Quizzes: 35%
  • Project: 35%
Primary Textbook

Course Prerequisites

MPCS 50101 Math for Computer Science
MPCS 55001 Algorithms
MPCS Programming core requirement
MPCS 53110 Foundations of Computational Data Analysis
MPCS 53111 Machine Learning

In all the above courses a grade of B+ or above is required. Please contact the instructor if you have, instead, equivalent courses or experience, or meet most but not all of the requirements.

Other Prerequisites

The course requires mathematical, algorithmic, and programming maturity. Students are expected to know the following:

Programming in Python: use of lists, dictionaries, conditionals, classes, and reading from and writing to files.

Data structures: such as trees, graphs, and hash tables.

Basic multivariate calculus: including differentiation, integration, and finding maxima and minima.

Basic linear algebra: vectors, matrices, and matrix multiplication.

Further, students should be prepared to learning new libraries, languages (e.g., Scala), and programming paradigms.

Overlapping Classes

This class is scheduled at a time that conflicts with these other classes:

  • MPCS 55001-1 -- Algorithms
  • MPCS 53001-1 -- Databases