Vladimir Mikulik

research

∙ 07/28/2023

The Hydra Effect: Emergent Self-repair in Language Model Computations

We investigate the internal structure of language model computations usi...

0 Thomas McGrath, et al. ∙

research

∙ 07/18/2023

Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla

Circuit analysis is a promising technique for understanding the internal...

0 Tom Lieberum, et al. ∙

research

∙ 01/12/2023

Tracr: Compiled Transformers as a Laboratory for Interpretability

Interpretability research aims to build tools for understanding machine ...

0 David Lindner, et al. ∙

research

∙ 03/21/2022

Teaching language models to support answers with verified quotes

Recent large language models often answer factual questions correctly. B...

0 Jacob Menick, et al. ∙

research

∙ 03/26/2021

Alignment of Language Agents

For artificial intelligence to be beneficial to humans the behaviour of ...

0 Zachary Kenton, et al. ∙

research

∙ 03/05/2021

Causal Analysis of Agent Behavior for AI Safety

As machine learning systems become more powerful they also become increa...

26 Grégoire Delétang, et al. ∙

research

∙ 10/23/2020

Algorithms for Causal Reasoning in Probability Trees

Probability trees are one of the simplest models of causal generative pr...

2 Tim Genewein, et al. ∙

research

∙ 10/21/2020

Meta-trained agents implement Bayes-optimal agents

Memory-based meta-learning is a powerful technique to build agents that ...

8 Vladimir Mikulik, et al. ∙

research

∙ 09/25/2019

Neural networks are a priori biased towards Boolean functions with low entropy

Understanding the inductive bias of neural networks is critical to expla...

0 Chris Mingard, et al. ∙

research

∙ 06/05/2019

Risks from Learned Optimization in Advanced Machine Learning Systems

We analyze the type of learned optimization that occurs when a learned m...

0 Evan Hubinger, et al. ∙

Vladimir Mikulik

Featured Co-authors

Sign in with Google

Consider DeepAI Pro