Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Project 2: Reproducibility in Natural Langauge Processing

The assignment content is located in proj02-nlp.ipynb.

Learning Objectives:

Deliverables: For this assignment, you will have a single GitHub repository for your group. Your repository should contain the following:

There are two core components to this homework:

  1. Working with text data building from simple to complex methods (Parts 1-4).

  2. Building a reproducible and structured repository (Part 5).

We have provided you with detailed questions with hints and will cover similar material in the lab.

The total grade for this homework is divided between:

Parts 1-4 will each be graded from their respective notebooks. Part 5 will be graded by assessing whether all deliverables mentioned above are present, as well as your team’s use of git (e.g. effective commit messages, use of .gitignore for handling extraneous/unnecessary files, repo organization, etc.)

When running your analyses and making plots, it may be useful to save intermediate steps, such as vectorized data files and then read them in order to make just the plots.

Note: The parts successively build on each other, so you will need to strategize how to divde and conquer in your groups. For example, some topic models from part 3 require the vectorization used in part 2.

Acknowledgment: This homework assignment has been revamped from the previous Spring 2017 assignment created by Prof Fernando Perez and Eli Ben-Michael, adapted from Prof Deb Nolan previous version of this project. The data from the project is from a Kaggle dataset here: https://www.kaggle.com/datasets/nicholasheyerdahl/state-of-the-union-address-texts-1790-2024?resource=download