Reece Shuttleworth

pic w pigeons

My Table of Contents

pic w suit on


Hello, I'm Reece! I want to create safe AI that aims to solve our most important and pressing problems. Feel free to contact me at

Currently, I am an MEng student at MIT, working to understand how and why transformers and other deep neural nets work in hopes to make better models. I am primarily doing this work as a part of Jacob Andreas's lab. I also did my undergrad at MIT in 2.5 years, studying computer science and cognitive science.

This past summer, I interned in the Bay Area at Numenta, and worked on parameter efficient fine-tuning methods for sparse transformer models in order to reduce the hardware requirements associated with fine-tuning. Previously, I worked in CSAIL in the Learning and Intelligent Systems group, where I worked on solving planning problems with AI. Separately, I have also done work using Large Language Models to solve math and science problems.

I also run for MIT.

Projects & Publications


Check back later for updates 😁


Characterizing Sparsity in Transformers

[paper] [code]

Theoretical and biological reasons suggest that sparsity may be important for deep neural network to perform well. Here, we examine the attention blocks of large transformer models to identify sparse features in their weights and/or activations. One interesting finding is that the weight matricies used in attention have very low stable rank, especially the matrix product \(W_qW_k^T\).

Bias in BERT models

[paper] [code] [WandB runs]

Historical training data used for training machine learning models can be biased, and machine learning models are susceptible to inheriting this bias. In this work, we use the masked token prediction capabilities of BERT models to show that they contain gender and racial bias. We create a dataset and use a novel loss function in order to reduce bias via finetuning. Preliminary analysis shows that this finetuning is successful at reducing bias, but needs to be examined further.

Gabor filter-constrained CNNs

[paper] [code]

Humans have more general and robust object recognition capabilities in comparison to machines. In this work, we investigated whether constraining convolutional neural networks to be more human-like, via the use of Gabor filters, improves their performance and their robustness against adversarial attacks.

MIT Pokerbots


I competed in the 2023 MIT Pokerbots competition and placed in the top 10%, resulting in a cash prize. The variant played in this competition was River of Blood Hold'em.

PyTorch, but in NumPy


Implemented basic PyTorch functionality from scratch using only NumPy arrays. Neural networks converge and perform well on non-trivial problems.


PDDL Planning with Pretrained Large Language Models

[paper] [code]

Published in the NeurIPS FMDM Workshop.

Paper Abstract: We study few-shot prompting of pretrained large language models (LLMs) towards solving PDDL planning problems. We are interested in two questions: (1) To what extent can LLMs solve PDDL planning problems on their own? (2) How and to what extent can LLMs be used to guide AI planners? Recent work by Valmeekam et al. (2022) presents negative evidence for (1) in the classic blocks world domain. We confirm this finding, but expand the inquiry to 18 domains and find more mixed results with a few clear successes. For (2), we propose a simple mechanism for using good-but-imperfect LLM outputs to aid a heuristic-search planner. We also find that the LLM performance is due not only to syntactic pattern matching, but also to its commonsense understanding of English terms that appear in the PDDL.

Using Large Language Models to Solve College Math

[paper] [code]

Published in PNAS.

Paper Abstract: We demonstrate that a neural network pretrained on text and fine-tuned on code solves mathematics course problems, explains solutions, and generates questions at a human level. We automatically synthesize programs using few-shot learning and OpenAI's Codex transformer and execute them to solve course problems at 81% automatic accuracy. We curate a dataset of questions from Massachusetts Institute of Technology (MIT)'s largest mathematics courses (Single Variable and Multivariable Calculus, Differential Equations, Introduction to Probability and Statistics, Linear Algebra, and Mathematics for Computer Science) and Columbia University's Computational Linear Algebra. We solve questions from a MATH dataset (on Prealgebra, Algebra, Counting and Probability, Intermediate Algebra, Number Theory, and Precalculus), the latest benchmark of advanced mathematics problems designed to assess mathematical reasoning. We randomly sample questions and generate solutions with multiple modalities, including numbers, equations, and plots. The latest GPT-3 language model pretrained on text automatically solves only 18.8% of these university questions using zero-shot learning and 30.8% using few-shot learning and the most recent chain of thought prompting. In contrast, program synthesis with few-shot learning using Codex fine-tuned on code generates programs that automatically solve 81% of these questions. Our approach improves the previous state-of-the-art automatic solution accuracy on the benchmark topics from 8.8 to 81.1%. We perform a survey to evaluate the quality and difficulty of generated questions. This work automatically solves university-level mathematics course questions at a human level and explains and generates university-level mathematics course questions at scale, a milestone for higher education.


Under Construction.