Projects - AI Alignment

Sparse AutoEncoders

SAE Feature Image

Sparse autoencoders (SAEs) are an unsupervised technique used to decompose a model’s activations into interpretable features vectors. I recently published a set of Sparse Autoencoders for the Residual Stream of GPT2 Small here. You can browse these features on Neuronpedia.

My research into Sparse Autoencoders is currently being supervised by Neel Nanda at theMATS Program.

Decision Transformer Interpretability

avatar

In this project I apply the mathematical framework for transformer circuits to Decision Transformers, a reinforcement learning method designed to produced AI which can simulate players of an arbitrary quality. This project helped me gain a deeper understanding of many mechanistic interpretability techniques, many of the nuances of studying circuits and looking for goal representations inside neural networks.

You can find an initial write up here. The main github repo is for the project is here. I published an update with some findings here which I then applied to understanding Spelling in GPT-J here.

ARENA

ARENA (Alignment Research ENgineering Accelerator) was 9 week research engineering accelerator I participated in, during which we completed a series of increasingly sophisticated projects, culminating with my capstone on Decision Transformer Interpretability. These projects included:

Projects - Computational Biology

I studied computational biology at university and have worked on a number of projects in this area. These include:

pi_poster