NLP: Language Modeling and Dependency Parsing
N-gram language models with smoothing and text generation, plus a neural transition-based dependency parser built in PyTorch.
May 1, 2024
NLPPyTorchLanguage ModelingDeep Learning
Overview
Two group assignments from the NLP course at Vrije Universiteit Amsterdam (M.Sc. AI, 2024), covering both classical and neural approaches to core NLP tasks. Worked with Julien Testu and Christopher Lam.
Part 1: N-gram Language Modeling
Built and evaluated statistical language models on the Brown corpus:
- Corpus analysis: Token/type statistics, word frequency distributions, Zipf's law verification across genres (news, adventure, etc.)
- N-gram models: Unigram, bigram, and trigram language models with maximum likelihood probability estimation
- Smoothing: Implemented Laplace and interpolation smoothing techniques to handle unseen n-grams and zero-probability events
- Evaluation: Perplexity computation to compare model quality — lower perplexity indicates better prediction of held-out text
- Text generation: Sampling from trained models to generate coherent text, comparing quality across n-gram orders
- Optimization: Bonus task exploring computational efficiency improvements for large vocabulary sizes
Part 2: Neural Dependency Parsing
Implemented a feed-forward neural network for transition-based dependency parsing in PyTorch. The parser learns to predict syntactic structure of sentences using the shift-reduce paradigm:
- Architecture: Embedding layer mapping words/POS tags/dependency labels to dense vectors, two hidden layers with ReLU activation, dropout regularization to prevent overfitting
- Transition system: Three actions — SHIFT (move word from buffer to stack), LEFT-ARC (create dependency from top of stack to second), RIGHT-ARC (create dependency from second to top)
- Training: Supervised learning on annotated dependency treebanks with the Adam optimizer, which combines momentum (low-variance updates) with adaptive learning rates (larger updates for parameters with smaller gradients)
- Dropout: Applied during training only — randomly deactivating neurons prevents overfitting; disabled during evaluation to ensure deterministic predictions
Technologies
Python, PyTorch, NLTK, NumPy, Matplotlib, Conda