Research

Current Research

A non exhaustive list of directions and questions I am currently working on:

  • Interpretability
    • Science of Activation Engineering
      • What is the geometry of activations in abstract concept spaces (for example, alignment)?
      • How do we unlock better steering methods for LLMs?
      • Can we measure goals of AI systems using interp methods?
      • How does the geometry of personas evolve during EM finetuning?
  • Reinforcement Learning
    • Can implicit decision-aware auditing prevent goal misgeneralization?

Publications