Can we predict how people will vote based on their twitter conversations? Can we identify pairs of researchers who will benefit from collaborating with each other based on their published articles? In this course we will study techniques for automatically detecting patterns and learning hidden structures in text data. Such techniques are of tremendous value due to the explosion in the amount of available text data, and their potential benefit to social sciences and businesses.
We will learn the fundamental steps in natural language processing, such as syntactic parsing or understanding the structure of a sentence, and semantic analysis or understanding the meaning of a sentence from the meanings of the words in it. These will help us build sophisticated models for text classification, such as for detecting sentiment or identifying fake news.
We will see that the a primary challenge is that natural languages are ambiguous. For instance, the sentence I made her duck can be interpreted in five different ways! So our models are probabilistic, and we resolve the ambiguity by training on large amounts of text corpora.
We will study a variety of models in the context of text processing including Markov and hidden Markov models, naive Bayes, logistic regression, and neural networks.
All through the course we will use Python and libraries such as the Natural Language Toolkit (NLTK) for processing real-world data.
A tentative list of topics follows.
Coursework and Evaluation
- Text processing applications, ambiguity in natural languages.
- The Natural Language Toolkit
- N-gram language models
- Information Retrieval
- Text Classification, Naive Bayes
- Logistic Regression
- Neural Networks
- Part-of-Speech Tagging
- Syntactic and Statistical Parsing
- Semantic Analysis
- Information Extraction
TextbookReadings will be assigned from primarily Speech and Language Processing by D. Jurafsky and J.H. Martin. Material for other readings will be available on Canvas electronic reserves.
- Assignments: These will be weekly, for the first half of the course, and will consist of theoretical and programming questions to help students develop a deeper understanding of the material. They will be worth 25% of the grade.
- Quizzes: There will be three in-class quizzes that will test the fundamental concepts. They will count toward 40% of the grade.
- Course project: Students will work on a project of their choice, individually or in teams of two. This will be worth 30% of the grade.
- Readings: Students will read assigned material to prepare for each class and answer review questions. These along with class participation will count for 5% of the grade.