About Me
I am a Member of Technical Staff at Anthropic, an AI safety and research company.
I spend most of my time building evaluations of large language models with the eventual aim of understanding whether they could pose catastrophic risks.
I also build software tools to make it easier to do research on large language models.
These days I am most interested in large language models, the science of deep learning, mechanistic interpretability, and the early field of "alignment science", which asks: How do we know what an AI system is trying to do?
Prior to joining Anthropic, I was a graduate student in computer science at MIT, where I was privileged to be advised by Piotr Indyk.
My main project at MIT was Riffle, a project applying ideas from database design to app development, in collobration with Geoffrey Litt,
Johannes Schickling, and Daniel Jackson. I continue to work on the Riffle project in an adivsory role.
From 2017 to 2020, I was a software engineer at Apple, where I built database systems for iCloud, including the FoundationDB Record Layer.
I graduated from
Caltech with a degree in
computer science in 2016.
At Caltech, I was
heavily
involved
in
student
government.
My (outdated) CV is available.
Publications
-
Tom Henighan,
Shan Carter,
Tristan Hume,
Nelson Elhage,
Robert Lasenby,
Stanislav Fort,
Nicholas Schiefer, and
Christopher Olah
“Superposition, Memorization, and Double Descent”,
Transformer Circuits Thread,
2023.
Abstract
We have little mechanistic understanding of how deep learning
models overfit to their training data, despite it being a
central problem. Here we extend our previous work on toy models
to shed light on how models generalize beyond their training
data.
-
Yuntao Bai,
Saurav Kadavath,
Sandipan Kundu,
Amanda Askell,
Jackson Kernion,
Andy Jones,
Anna Chen,
Anna Goldie,
Azalia Mirhoseini,
Cameron McKinnon,
Carol Chen,
Catherine Olsson,
Christopher Olah,
Danny Hernandez,
Dawn Drain,
Deep Ganguli,
Dustin Li,
Eli Tran-Johnson,
Ethan Perez,
Jamie Kerr,
Jared Mueller,
Jeffrey Ladish,
Joshua Landau,
Kamal Ndousse,
Kamilė Lukošiūtė
Liane Lovitt,
Michael Sellitto,
Nelson Elhage,
Nicholas Schiefer,
Noemí Mercado,
Nova DasSarma,
Robert Lasenby,
Robin Larson,
Sam Ringer,
Scott Johnston,
Shauna Kravec,
Sheer El-Showk,
Stanislav Fort,
Tamera Lanham,
Timothy Telleen-Lawton,
Tom Conerly,
Tom Henighan,
Tristan Hume,
Samuel R. Bowman,
Zac Hatfield-Dodds,
Ben Mann,
Dario Amodei,
Nicholas Joseph,
Sam McCandlish,
Tom Brown, and
Jared Kaplan
“Constitutional AI: Harmlessness from AI Feedback”,
arXiv preprint,
2022.
Abstract
As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised responses. In the RL phase, we sample from the finetuned model, use a model to evaluate which of the two samples is better, and then train a preference model from this dataset of AI preferences. We then train with RL using the preference model as the reward signal, i.e. we use 'RL from AI Feedback' (RLAIF). As a result we are able to train a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them. Both the SL and RL methods can leverage chain-of-thought style reasoning to improve the human-judged performance and transparency of AI decision making. These methods make it possible to control AI behavior more precisely and with far fewer human labels.
-
Abstract
In some neural networks, individual neurons correspond to natural "features" in the input. Such monosemantic neurons are of great help in interpretability studies, as they can be cleanly understood. In this work we report preliminary attempts to engineer monosemanticity in toy models. We find that models can be made more monosemantic without increasing the loss by just changing which local minimum the training process finds. More monosemantic loss minima have moderate negative biases, and we are able to use this fact to engineer highly monosemantic models. We are able to mechanistically interpret these models, including the residual polysemantic neurons, and uncover a simple yet surprising algorithm. Finally, we find that providing models with more neurons per layer makes the models more monosemantic, albeit at increased computational cost. These findings point to a number of new questions and avenues for engineering monosemanticity, which we intend to study these in future work.
-
Abstract
Recent work shows that the expressive power of Graph Neural Networks (GNNs) in distinguishing non-isomorphic graphs is exactly the same as that of the Weisfeiler-Lehman (WL) graph test. In particular, they show that the WL test can be simulated by GNNs. However, those simulations involve neural networks for the 'combine' function of size polynomial or even exponential in the number of graph nodes , as well as feature vectors of length linear in . We present an improved simulation of the WL test on GNNs with \emph{exponentially} lower complexity. In particular, the neural network implementing the combine function in each node has only a polylogarithmic number of parameters in , and the feature vectors exchanged by the nodes of GNN consists of only bits. We also give logarithmic lower bounds for the feature vector length and the size of the neural networks, showing the (near)-optimality of our construction.
-
Samuel R. Bowman,
Jeeyoon Hyun,
Ethan Perez,
Edwin Chen,
Craig Pettit,
Scott Heiner,
Kamilė Lukošiūtė
Amanda Askell,
Andy Jones,
Anna Chen,
Anna Goldie,
Azalia Mirhoseini,
Cameron McKinnon,
Christopher Olah,
Daniela Amodei,
Dario Amodei,
Dawn Drain,
Dustin Li,
Eli Tran-Johnson,
Jackson Kernion,
Jamie Kerr,
Jared Mueller,
Jeffrey Ladish,
Joshua Landau,
Kamal Ndousse,
Liane Lovitt,
Nelson Elhage,
Nicholas Schiefer,
Nicholas Joseph,
Noemí Mercado,
Nova DasSarma,
Robin Larson,
Sam McCandlish,
Sandipan Kundu,
Scott Johnston,
Shauna Kravec,
Sheer El-Showk,
Stanislav Fort,
Timothy Telleen-Lawton,
Tom Brown,
Tom Henighan,
Tristan Hume,
Yuntao Bai,
Zac Hatfield-Dodds,
Ben Mann, and
Jared Kaplan
“Measuring Progress on Scalable Oversight for Large Language Models”,
arXiv preprint,
2022.
Abstract
Developing safe and useful general-purpose AI systems will require us to make progress on scalable oversight: the problem of supervising systems that potentially outperform us on most skills relevant to the task at hand. Empirical work on this problem is not straightforward, since we do not yet have systems that broadly exceed our abilities. This paper discusses one of the major ways we think about this problem, with a focus on ways it can be studied empirically. We first present an experimental design centered on tasks for which human specialists succeed but unaided humans and current general AI systems fail. We then present a proof-of-concept experiment meant to demonstrate a key feature of this experimental design and show its viability with two question-answering tasks: MMLU and time-limited QuALITY. On these tasks, we find that human participants who interact with an unreliable large-language-model dialog assistant through chat -- a trivial baseline strategy for scalable oversight -- substantially outperform both the model alone and their own unaided performance. These results are an encouraging sign that scalable oversight will be tractable to study with present models and bolster recent findings that large language models can productively assist humans with difficult tasks.
-
Nelson Elhage,
Tristan Hume,
Catherine Olsson,
Nicholas Schiefer,
Tom Henighan,
Shauna Kravec,
Zac Hatfield-Dodds,
Robert Lasenby,
Dawn Drain,
Carol Chen,
Roger Grosse,
Sam McCandlish,
Jared Kaplan,
Dario Amodei,
Martin Wattenberg, and
Christopher Olah
“Toy Models of Superposition”,
Transformer Circuits Thread,
2022.
Abstract
Neural networks often pack many unrelated concepts into a single neuron - a puzzling phenomenon known as 'polysemanticity' which makes interpretability much more challenging. This paper provides a toy model where polysemanticity can be fully understood, arising as a result of models storing additional sparse features in "superposition." We demonstrate the existence of a phase change, a surprising connection to the geometry of uniform polytopes, and evidence of a link to adversarial examples. We also discuss potential implications for mechanistic interpretability.
-
Liane Lovitt,
Jackson Kernion,
Amanda Askell,
Yuntao Bai,
Saurav Kadavath,
Ben Mann,
Ethan Perez,
Nicholas Schiefer,
Kamal Ndousse,
Andy Jones,
Sam Bowman,
Anna Chen,
Tom Conerly,
Nova DasSarma,
Dawn Drain,
Nelson Elhage,
Sheer El-Showk,
Stanislav Fort,
Zac Hatfield-Dodds,
Tom Henighan,
Tristan Hume,
Josh Jacobson,
Scott Johnston,
Shauna Kravec,
Catherine Olsson,
Sam Ringer,
Eli Tran-Johnson,
Dario Amodei,
Tom Brown,
Nicholas Joseph,
Sam McCandlish,
Chris Olah,
Jack Clark, and
Jared Kaplan
“Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned”,
arXiv preprint,
2022.
Abstract
We describe our early efforts to red team language models in order to simultaneously discover, measure, and attempt to reduce their potentially harmful outputs. We make three main contributions. First, we investigate scaling behaviors for red teaming across 3 model sizes (2.7B, 13B, and 52B parameters) and 4 model types: a plain language model (LM); an LM prompted to be helpful, honest, and harmless; an LM with rejection sampling; and a model trained to be helpful and harmless using reinforcement learning from human feedback (RLHF). We find that the RLHF models are increasingly difficult to red team as they scale, and we find a flat trend with scale for the other model types. Second, we release our dataset of 38,961 red team attacks for others to analyze and learn from. We provide our own analysis of the data and find a variety of harmful outputs, which range from offensive language to more subtly harmful non-violent unethical outputs. Third, we exhaustively describe our instructions, processes, statistical methodologies, and uncertainty about red teaming. We hope that this transparency accelerates our ability to work together as a community in order to develop shared norms, practices, and technical standards for how to red team language models.
-
Saurav Kadavath,
Tom Conerly,
Amanda Askell,
Tom Henighan,
Dawn Drain,
Ethan Perez,
Nicholas Schiefer,
Zac Hatfield-Dodds,
Nova DasSarma,
Eli Tran-Johnson,
Scott Johnston,
Sheer El-Showk,
Andy Jones,
Nelson Elhage,
Tristan Hume,
Anna Chen,
Yuntao Bai,
Sam Bowman,
Stanislav Fort,
Deep Ganguli,
Danny Hernandez,
Josh Jacobson,
Jackson Kernion,
Shauna Kravec,
Liane Lovitt,
Kamal Ndousse,
Catherine Olsson,
Sam Ringer,
Dario Amodei,
Tom Brown,
Jack Clark,
Nicholas Joseph,
Ben Mann,
Sam McCandlish,
Chris Olah, and
Jared Kaplan
“Language Models (Mostly) Know What They Know”,
arXiv preprint,
2022.
Abstract
We study whether language models can evaluate the validity of their own claims and predict which questions they will be able to answer correctly. We first show that larger models are well-calibrated on diverse multiple choice and true/false questions when they are provided in the right format. Thus we can approach self-evaluation on open-ended sampling tasks by asking models to first propose answers, and then to evaluate the probability "P(True)" that their answers are correct. We find encouraging performance, calibration, and scaling for P(True) on a diverse array of tasks. Performance at self-evaluation further improves when we allow models to consider many of their own samples before predicting the validity of one specific possibility. Next, we investigate whether models can be trained to predict "P(IK)", the probability that "I know" the answer to a question, without reference to any particular proposed answer. Models perform well at predicting P(IK) and partially generalize across tasks, though they struggle with calibration of P(IK) on new tasks. The predicted P(IK) probabilities also increase appropriately in the presence of relevant source materials in the context, and to the presence of hints towards the solution of mathematical word problems. We hope these observations lay the groundwork for training more honest models, and for investigating how honesty generalizes to cases where models are trained on objectives other than the imitation of human writing.
-
Nicholas Schiefer,
Geoffrey Litt, and
Daniel Jackson,
“Merge what you can, fork what you can't: managing data integrity in local-first software”,
9th Workshop on Principles and Practice of Consistency for Distributed Data (PaPoC'22),
2022.
Abstract
In a local-first architecture that prioritizes availability in the presence of network partitions, there is a tension between two goals: merging concurrent changes without user intervention and maintaining data integrity constraints. We propose a synchronization model called forking histories which satisfies both goals in an unconventional way. In the case of conflicting writes, the model exposes multiple event histories that users can see and edit rather than converging to a single state. This allows integrity constraints to be maintained within each history while giving users flexibility in deciding when to manually reconcile conflicts. We describe a class of applications for which these integrity constraints are particularly important and propose a design for a system that implements this model.
-
Christos Chrysafis,
Ben Collins,
Scott Dugas,
Jay Dunkelberger,
Moussa Ehsan,
Scott Gray,
Alec Grieser,
Ori Herrnstadt,
Kfir Lev-Ari,
Tao Lin,
Mike McMahon,
Nicholas Schiefer, and
Alexander Shraer,
“FoundationDB Record Layer: A Multi-Tenant Structured Datastore”,
2019 International Conference on Management of Data, Industry Track (SIGMOD 2019),
2019.
Abstract
The FoundationDB Record Layer is an open source library that provides a record-oriented data store with semantics similar to a relational database implemented on top of FoundationDB, an ordered, transactional key-value store. The Record Layer provides a lightweight, highly extensible way to store structured data. It offers schema management and a rich set of query and indexing facilities, some of which are not usually found in traditional relational databases, such as nested record types, indexes on commit versions, and indexes that span multiple record types. The Record Layer is stateless and built for massive multi-tenancy, encapsulating and isolating all of a tenant's state, including indexes, into a separate logical database. We demonstrate how the Record Layer is used by CloudKit, Apple's cloud backend service, to provide powerful abstractions to applications serving hundreds of millions of users. CloudKit uses the Record Layer to host billions of independent databases, many with a common schema. Features provided by the Record Layer enable CloudKit to provide richer APIs and stronger semantics with reduced maintenance overhead and improved scalability.
-
Peter Ahrens,
Helen Xu, and
Nicholas Schiefer,
“A Fill Estimation Algorithm for Sparse Matrices and Tensors in Blocked Formats”,
2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS 2018),
2018.
Abstract
Many sparse matrices and tensors from a variety of applications, such as finite element methods and computational chemistry, have a natural aligned rectangular nonzero block structure. Researchers have designed high-performance blocked sparse operations which can take advantage of this sparsity structure to reduce the complexity of storing the locations of nonzeros. The performance of a blocked sparse operation depends on how well the block size reflects the structure of nonzeros in the tensor. Sparse tensor structure is generally unknown until runtime, so block size selection must be efficient. The fill is a quantity which, for some block size, relates the number of nonzero blocks to the number of nonzeros. Many performance models use the fill to help choose a block size. However, the fill is expensive to compute exactly. We present a sampling-based algorithm called Phil to estimate the fill of sparse matrices and tensors in any format. We provide theoretical guarantees for sparse matrices and tensors, and experimental results for matrices. The existing state-of-the-art fill estimation algorithm, which we will call OSKI, runs in time linear in the number of elements in the tensor. The number of samples Phil needs to compute a fill estimate is unrelated to the number of nonzeros and depends only on the order (number of dimensions) of the tensor, desired accuracy of the estimate, desired probability of achieving this accuracy, and number of considered block sizes. We compare Phil and OSKI on a suite of 42 matrices. On most inputs, Phil estimates the fill at least 2 times faster and often more than 20 times faster than OSKI. Phil consistently produced accurate estimates; in all cases that we tested Phil was faster and/or more accurate than OSKI. Finally, we find that Phil and OSKI produce comparable speedups in multicore blocked sparse matrix-vector multiplication (SpMV) when the block size was chosen using fill estimates in a model due to Vuduc et al.
-
Nicholas Schiefer and
Erik Winfree, “Time Complexity of Computation and Construction in the Chemical Reaction Network-Controlled Tile Assembly Model”,
22nd International Conference on DNA Computing and Molecular Programming
(DNA22),
2016.
Abstract
In isolation, chemical reaction networks and tile-based self-assembly are well-studied models of chemical computation. Previously, we introduced the chemical reaction network-controlled tile assembly model (CRN-TAM), in which a stochastic chemical reaction network can act as a non-local control and signalling system for tile-based assembly, and showed that the CRN-TAM can perform several tasks related to the simulation of Turing machines and construction of algorithmic shapes with lower space or program complexity than in either of its parent models. Here, we introduce a kinetic variant of the CRN-TAM and investigate the time complexity of computation and construction. We analyze the time complexity of decision problems in the CRN-TAM, and show that decidable languages can be decided as efficiently by CRN-TAM programs as by Turing machines. We also give a lower bound for the space-time complexity of CRN-TAM computation that rules out efficient parallel stack machines. We provide efficient parallel implementations of non-deterministic computations, showing among other things that CRN-TAM programs can decide languages in NTIME ∩ coNTIME(f(n)) in O(f(n) + n + log c) time with (1 - exp (-c)) probability, using volume exponential in n. Lastly, we provide basic mechanisms for parallel computations that share information and illustrate the limits of parallel computation in the CRN-TAM.
-
Nicholas Schiefer and
Erik Winfree, “Universal Computation and Optimal Construction in the Chemical Reaction Network-Controlled Tile Assembly Model”,
21st International Conference on DNA Computing and Molecular Programming
(DNA21),
2015.
Abstract
Tile-based self-assembly and chemical reaction networks provide two well-studied models of scalable DNA-based computation. Although tile self-assembly provides a powerful framework for describing Turing-universal self-assembling systems, assembly logic in tile self-assembly is localized, so that only the nearby environment can affect the process of self-assembly. We introduce a new model of tile-based self-assembly in which a well-mixed chemical reaction network interacts with self-assembling tiles to exert non-local control on the self-assembly process. Through simulation of multi-stack machines, we demonstrate that this new model is efficiently Turing-universal, even when restricted to unbounded space in only one spatial dimension. Using a natural notion of program complexity, we also show that this new model can produce many complex shapes with programs of lower complexity. Most notably, we show that arbitrary connected shapes can be produced by a program with complexity bounded by the Kolmogorov complexity of the shape, without the large scale factor that is required for the analogous result in the abstract tile assembly model. These results suggest that controlled self-assembly provides additional algorithmic power over tile-only self-assembly, and that non-local control enhances our ability to perform computation and algorithmically self-assemble structures from small input programs.
Contact
- Email: