Project 2: Reproducibility in Natural Langauge Processing

Statistics 159/259, Fall 2025
Due 11/21/2025, 11:59PM PT
Prof. F. Pérez, GSI J. Butler, and GSI S. Andrade, Department of Statistics, UC Berkeley.
Your score for this assignment is out of 70 points (with a potential of 7 points extra credit). This project as a whole accounts for 20% of the final course grade.
Assignment type: group project assignment

The assignment content is located in proj02-nlp.ipynb.

Learning Objectives:

Implement Modern NLP Pipelines:
- Load and Prepare Data: Ingest and clean a real-world text dataset using pandas.
- Process Text Efficiently: Implement a text pre-processing pipeline using spaCy, a production-grade NLP library.
- Understand Core Concepts: Understand the diffence between a token, a lemma, a stop word, and punctuation, and understand why lemmatization was preferred for text analysis prior to LLMs.
Conduct Core Text Analysis:
- Extract Linguistic Features: Use spaCy to efficiently extract lemmas, parts-of-speech, and named entities from raw text.
- Perform Frequency Analysis: Use spaCy outputs to compare the most frequent words in different documents.
- Analyze Text Over Time: Compare the language of modern presidents to historical ones, drawing initial conclusions about how political communication has evolved.
Build and Compare Topic Models:
- Vectorize Text: Understand and apply the TF-IDF (Term Frequency-Inverse Document Frequency) vectorization method to prepare text for machine learning.
- Implement Traditional Topic Modeling: Build, train, and interpret a Latent Dirichlet Allocation (LDA) topic model using gensim to find topics based on word co-occurrence.
- Implement Modern Topic Modeling: Use BERTopic to leverage transformer embeddings (BERT), clustering, and semantic similarity to discover conceptually coherent topics.
- Critically Evaluate Models: Compare the topics generated by LDA and BERTopic, analyzing the trade-offs and advantages of each approach (i.e., “bag-of-words” vs. “semantic similarity”).
Reproducibility and Collaboration:
- Work on more open-ended questions that involve self learning, utilization of external resources, and exploration
- Collaboration on Complex Projects: Gain experience coordinating on teams to complete tasks with dependencies on each other
- Reproducibility: Gain experience describing your work in accessibile formats and making it reproducible.

Deliverables: For this assignment, you will have a single GitHub repository for your group. Your repository should contain the following:

One notebook for each of Parts 1, 2, and 3 in the assignment notebook proj02-nlp.ipynb (and Part 4 if you wish to do it) that includes code to create the plots and simulations the question asks for, along with any written responses, commentary, discussion, and documentation where applicable. Please name each notebook using the convention nlp-PXX.ipynb, with XX corresponding to the number of the part. Please remember to use markdown headings for each section/subsection so the entire notebook document is readable. All figures should be both rendered in the notebook and saved in a separate folder called outputs. Please also make sure to structure your notebooks as if you were conducting this as a clean and nicely presented data analysis report. Do not include our prompts/problem statements in the final report notebooks.
Complete the contribution statement in contribution_statement.md, briefly and qualitatively detailing each group member’s contributions to the assignment.
An ai_documentation.txt file where your group will put any prompts and output from AI companions.
A README describing the aims of the project, with a Binder link to the repo so your notebooks can be run.
A MyST Site for your project, deployed to GitHub Pages. Each notebook should have its own tab on the website.

There are two core components to this homework:

Working with text data building from simple to complex methods (Parts 1-4).
Building a reproducible and structured repository (Part 5).

We have provided you with detailed questions with hints and will cover similar material in the lab.

The total grade for this homework is divided between:

[15 Points] Part 1: Data loading and Initial Exploration.
[20 Points] Part 2: Simple Text Processing - Tokenization, Lemmatization, Word Frequency, Vectorization.
[20 Points] Part 3: Advanced Text Processing - LDA and BERTopic Topic Modeling
[Extra Credit: 7 Points] Part 4: Choose your own adventure!
[15 Points] Part 5: Project Structure

Parts 1-4 will each be graded from their respective notebooks. Part 5 will be graded by assessing whether all deliverables mentioned above are present, as well as your team’s use of git (e.g. effective commit messages, use of .gitignore for handling extraneous/unnecessary files, repo organization, etc.)

When running your analyses and making plots, it may be useful to save intermediate steps, such as vectorized data files and then read them in order to make just the plots.

Note: The parts successively build on each other, so you will need to strategize how to divde and conquer in your groups. For example, some topic models from part 3 require the vectorization used in part 2.

Acknowledgment: This homework assignment has been revamped from the previous Spring 2017 assignment created by Prof Fernando Perez and Eli Ben-Michael, adapted from Prof Deb Nolan previous version of this project. The data from the project is from a Kaggle dataset here: https://www.kaggle.com/datasets/nicholasheyerdahl/state-of-the-union-address-texts-1790-2024?resource=download