*Please note: This is the syllabus from the 2021/22 academic year and subject to change.*
The purpose of this class is to learn how to build applications at scale, by providing you with the techniques and tools capable of providing subsecond response times to millions of users interacting with petabytes of data.
In this course, we will cover both the theory and practice of building Big Data application. We will not only learn how to use technologies such as HDFS, MapReduce, Spark, Kafka, Hive, Thrift, HBase, Zookeeper, columnar stores, etc., but also understand why Big Data applications employ such a diverse array of technologies and where each one of them fits.
We will demonstrate the practice of Big Data application architecture by implementing a running Big Data web application for exploring the relationship between weather and flight performance utilizing all of the weather and flight delay information in the United States over the last decade to explore the relationship between weather and flight perforrmance.
To develop a sound understanding of the theory of Big Data, we will learn about important formulations of Big Data application architectures, such as Nathan Marz' lambda architecture, proper use of normalized and denormalized data stores within large-scale web applications, application of the CAP theorem, etc. We will also continuously keep in mind important additional topics that invariably arise in real world applications of Big Data, such as budgeting, compliance, etc..
Students are required to bring a laptop to class every week.
- Overview of Big Data
- Lambda architecture
- Data model/storage
- Batch layer
- Serving layer
- Speed layer
- Technologies including Hadoop/Spark/Hive/HBase and other NoSQL databases/Thrift/Zookeeper etc.
- Scraping and cleaning data
There will be weekly homework assignments on particular topics. At the end of the course, each student will do a Big Data web application on a topic of their choice that interests them. Past projects have included analyzing Divvy bike rental trends, looking at the effect of weather on Chicago crime data, protein folding, data mining wikipedia, and more. In addition to being cool, discovering that one knows everything they need to develop a complete Big Data web application is a great experience.
Relationship to other MPCS Big Data courses
This course focuses on the topic of architecting large scale Big Data applications. However, it only lightly touches on other Big Data-relevant topics like managing infrastructure in the public cloud or Big Data machine learning algorithms (we do discuss how to adapt traditional analytics queries to Big Data environments). While the course is complete on its own and will leave you in a position where you are comfortable building enterprise-grade Big Data web applications, regardless of what other courses you take, it also complements other Big Data courses. One useful way to view it would be that if you develop a powerful new Big Data analytic using machine learning techniques from and ML course, this course will teach you how to architect and implement a Big Data web application that can be used by millions of users on petabytes of data that leverages the new analytic, which can then be deployed in the public cloud using PaaS and IaaS techniques taught in a Cloud Computing class.