*Please note that the following course information is from a previous offering the course and subject to change.*
Parallel programming is ubiquitous in both the largest compute clusters and the smallest, low-power embedded devices. Though this has been the status quo for many years, achieving optimal parallel performance can still be a challenging, multi-disciplinary effort.
In this course, we will focus on compute-intensive (rather than data-intensive) parallel programming, representative of numerical applications. Computer architecture and systems will be a pervasive theme, and we will discuss how parallel APIs map to the underlying hardware.
We will implement and optimize C/C++ applications on large-scale, multicore CPU and GPU compute clusters. We learn widely-used parallel programming APIs (OpenMP, CUDA, and MPI) and use them to solve problems in linear algebra, Monte Carlo simulations, discretized partial differential equations, and machine learning.
The majority of coding assignments can be completed in either C or C++. Certain applications will require coding portions in pure C; however, in these cases, we will cover the requisite information for those with previous exposure to only C++. Previous or concurrent courses in systems and architecture can be helpful, but no prerequisite knowledge of systems/architectures is assumed.
- Overview of CPU and GPU Architectures
- Instruction sets
- Functional units
- Memory hierarchies
- Performance Metrics
- Latency and bandwidth
- Roofline modeling
- Single-core optimization
- Compiler-assisted vectorization (data-level parallelism)
- Design patterns for cache-based optimization
- Multi-threaded CPU programming
- Worksharing, synchronization, and atomic operations
- Memory access patterns, including non-uniform memory access
- The OpenMP API
- GPU programming
- Thread-mapping for optimal vectorization and memory access
- Task-scheduling for latency reduction
- The CUDA and OpenMP offload APIs
- Distributed parallelism
- Synchronous and asynchronous communication patterns
- Data decomposition
- Hybrid models for distributed multi-threaded and GPU programming
- The MPI API
Throughout the course, will draw on examples from linear algebra, Monte Carlo simulations, discretized partial differential equations, and machine learning.
The graded coursework will consist of 6 out-of-class, individually-completed coding projects. Most will be one week, but the final assignments will be larger two-week projects.
There will also be brief conceptual quizzes, which will be discussed in class and counted as a completion grade.
We will draw on material from the following texts. None are required, but they can be helpful resources throughout your career.