MPCS 53113 Natural Language Processing (Summer 2016)

Section 1
Instructor(s) Chaudhary, Amitabh (amitabh)
Location Ryerson 276
Meeting Times Wednesday 5:30pm - 8:30pm
Fulfills Elective Specialization - Data Analytics (DA-2)


Can we predict how people will vote based on their twitter conversations? Can we identify pairs of researchers who will benefit from collaborating with each other based on their published articles? In this course we will study techniques for automatically detecting patterns and learning hidden structures in text data.  Such techniques are of tremendous value due to the explosion in the amount of available text data, and their potential benefit to social sciences and businesses.

We will learn the fundamental steps in natural language processing, such as syntactic parsing or understanding the structure of a sentence, and semantic analysis or understanding the meaning of a sentence from the meanings of the words in it.

We will see that the primary challenge is that natural languages are ambiguous.  For instance, the sentence I made her duck can be interpreted in five different ways.  So we will focus on probabilistic and machine learning mechanisms that learn ambiguity resolution by training on large amounts of text corpora.  These include sequence models such as Markov models, hidden Markov models, and conditional random fields.  They also include classification and clustering techniques, such as logistic regression, naive Bayes, support vector machines, Gaussian mixture models, and EM clustering.

All through the course we will both implement algorithms in Python and use Python based libraries such as the Natural Language Toolkit (NLTK) for processing real-world data.

A tentative list of topics follows.

  • Text processing applications, ambiguity in natural languages.
  • The Natural Language Toolkit
  • N-gram language models: parameter estimation and smoothing, model evaluation.
  • Part-of-speech tagging and sequence labeling: hidden Markov models, maximum entropy models, conditional random fields.
  • Text classification: logistic regression, naive Bayes, EM clustering.
  • Syntactic parsing and grammar formalisms: parsing context-free grammars, statistical parsing.
  • Semantic analysis: lexical semantics, word-sense disambiguation.
  • Information Retrieval: TF.IDF, PageRank.
  • Information Extraction: named entity recognition, relation extraction.
  • Question answering and summarization systems.
Coursework and Evaluation
  • Assignments:  These will be weekly, for the first half of the course, and will consist of theoretical and programming questions to help students develop a deeper understanding of the material.  They will be worth  25% of the grade.
  • Quizzes: There will be three in-class quizzes that will test the fundamental concepts. They will count toward 40% of the grade.
  • Course project: Students will work on a project of their choice, individually or in teams of two. This will be worth 30% of the grade.
  • Readings: Students will read assigned material to prepare for each class and answer review questions.  These along with class participation will count for 5% of the grade.
Readings will be assigned from primarily Speech and Language Processing by D. Jurafsky and J.H. Martin.  Material for other readings will be available on Chalk electronic reserves.

Course Prerequisites

MPCS 50103 Math for Computer Science
MPCS Programming core requirement
MPCS 53110 Foundations of Computational Data Analysis
MPCS 53111 Machine Learning

Equivalent courses or experience will be accepted with instructor permission.

Other Prerequisites

Overlapping Classes

This class is scheduled at a time that does not conflict with any other classes this quarter.