N-gram Language Models and Neural Dependency Parsing

Overview

Two assignments from the NLP course at Vrije Universiteit Amsterdam (M.Sc. AI). The pair is intentional: the first one builds the classical statistical foundation; the second sits on top of it with a neural model. Worked with Julien Testu and Christopher Lam.

Part 1: N-gram Language Modeling

Built unigram, bigram, and trigram language models on the Brown corpus, with a full pipeline from corpus statistics to text generation.

Corpus analysis. Token / type counts, frequency distributions, and Zipf's-law verification across genres (news, adventure, romance, etc.) to confirm the corpus behaves as expected before modelling.
Probability estimation. Maximum-likelihood probabilities at each n-gram order, then Laplace smoothing and interpolation smoothing layered on top to handle zero-probability events on unseen n-grams.
Evaluation. Perplexity computed against held-out text. Lower perplexity means the model assigns more probability mass to the actual continuation, which means it's a better predictor.
Generation. Sampled text from each model order to compare. Trigram outputs are noticeably more coherent than bigram, which is more coherent than unigram, exactly what the perplexity numbers predict.
Bonus task. Explored efficiency improvements for the lookup tables at large vocabulary sizes.

Part 2: Neural Dependency Parsing

Implemented a feed-forward transition-based dependency parser in PyTorch using the shift-reduce paradigm.

Transition system. Three actions: SHIFT (move word from buffer to stack), LEFT-ARC (create dependency from top of stack to second), RIGHT-ARC (create dependency from second to top).
Architecture. Embedding layer for words, POS tags, and dependency labels into dense vectors. Two hidden layers with ReLU activation. Dropout regularisation during training only, disabled at evaluation so predictions are deterministic.
Training. Adam optimiser on annotated dependency treebanks. Adam combines momentum (smooth, low-variance updates) with adaptive per-parameter learning rates (larger updates where gradients are smaller), which is the standard pick for problems like this.

Why both halves are in one project

Statistical LMs and neural parsers feel like different worlds, but the assignment was structured to make the bridge explicit: the n-gram half teaches you to think in joint and conditional probabilities, and the parser half teaches you to learn those distributions instead of estimating them with smoothing. Most of the gain in modern NLP comes from that switch: same probabilistic framing, different way of estimating the parameters.

Stack

Python, PyTorch, NLTK, NumPy, Matplotlib.