Currently under review.
Published in the NeurIPS FMDM Workshop.
Published in PNAS.
We extend on recent work(Scaling Laws for Precision) showing that quantized models degrade in performance more significantly past a certain amount of training steps. We confirm their findings on larger models (OLMo-1B & 7B) and on downstream tasks. Further, we find that there are no abnormally large or growing activation statistics across training steps, suggesting that large activations are not the cause of this degradation. Lastly, we trace the quantized error for each weight matrix of the model and find that across training steps, the attention projection (\(W_{QKV}\)) weights consistently has growing quantization error, while other modules have roughly similar quantization error across steps. This suggests that Quantization-induced degradation may be due to growing quantization error in the queries, keys, and values across training steps.
Studied mixed precision and activation aware quantization in order to extend lossless quantization of weights and activations past 8-bit. Our investigation finds limitations extending quantization past 6 bits, suggesting that achieving lossless 4-bit quanitzation, which can be leveraged for practical benefits, will be challenging to achieve.
Studied inference optimizations, like KV-Caching and Grouped Query Attention, in the attention module of transformers, including in their impact on inference speed and energy usage.
Theoretical and biological reasons suggest that sparsity may be important for deep neural network to perform well. Here, we examine the attention blocks of large transformer models to identify sparse features in their weights and/or activations. One interesting finding is that the weight matricies used in attention have very low stable rank, especially the matrix product \(W_qW_k^T\).
Historical training data used for training machine learning models can be biased, and machine learning models are susceptible to inheriting this bias. In this work, we use the masked token prediction capabilities of BERT models to show that they contain gender and racial bias. We create a dataset and use a novel loss function in order to reduce bias via finetuning. Preliminary analysis shows that this finetuning is successful at reducing bias, but needs to be examined further.
Humans have more general and robust object recognition capabilities in comparison to machines. In this work, we investigated whether constraining convolutional neural networks to be more human-like, via the use of Gabor filters, improves their performance and their robustness against adversarial attacks.
I competed in the 2023 MIT Pokerbots competition and placed in the top 10%, resulting in a cash prize. The variant played in this competition was River of Blood Hold'em.
Implemented basic PyTorch functionality from scratch using only NumPy arrays. Neural networks converge and perform well on non-trivial problems.
[Multiplication Execution Video] [Multiplication Program]
Created functioning 8-bit computer by wiring logic chips on breadboards by hand. Created micro instructions directly in binary and implemented hardware primitives built out of micro instructions. Used 16 bytes of memory to write multiplication program out of hardware primitives.