Research

Current Research

A non exhaustive list of directions and questions I am currently working on:

  • Interpretability
    • Science of Activation Engineering
      • What is the geometry of activations in abstract concept spaces (for example, alignment)?
      • What concepts are steerable, and what concepts are not?
      • What scaling laws govern concept linearity?
      • How does the geometry of personas evolve during EM finetuning?
    • Interpretability for Evaluation
      • Can interpretability methods outperform LLM-as-a-judge systems for reward-hacking behavior, especially when the judge itself can potentially be hacked?
  • Evaluations
    • How should we evaluate agentic situational awareness?
    • How can we run efficient agentic evals using fluid benchmarking?
  • Reinforcement Learning
    • Can implicit decision-aware auditing prevent goal misgeneralization?
  • Philosophy and Meta-Research
    • What are the ethics of personalized digital twins?
    • There is substantial noise in academic research. Why does this happen, and what can we do to make it better?

Publications