Research
Current Research
A non exhaustive list of directions and questions I am currently working on:
-
Interpretability
-
Science of Activation Engineering
- What is the geometry of activations in abstract concept spaces (for example, alignment)?
- How do we unlock better steering methods for LLMs?
- Can we measure goals of AI systems using interp methods?
- How does the geometry of personas evolve during EM finetuning?
-
Science of Activation Engineering
-
Reinforcement Learning
- Can implicit decision-aware auditing prevent goal misgeneralization?
Publications
-
Intrinsic Guardrails: How Semantic Geometry of Personality Interacts with Emergent Mis-alignment in LLMs
Aneja, K., Mittal, M., Goel, A., Kumaraguru, P., Bonagiri, V.K.
AI4GOOD, ICML 2026 -
Efficient Safety Benchmarking via Item Response Theory
Spagliardi, F.*, Silva, M.*, Datta, A.*, Zhou, A., Bonagiri, V.K., Cruz, D.
AI4GOOD, ICML 2026 -
Check Yourself Before You Wreck Yourself: Selectively Quitting Improves LLM Agent Safety
Vamshi Krishna Bonagiri, Ponnurangam Kumaraguru, Khanh Nguyen, Benjamin Plaut
Reliable ML and Regulatable ML workshops, NeurIPS 2025 -
If Pigs Could Fly... Can LLMs Logically Reason Through Counterfactuals?
Ishwar B Balappanawar*, Vamshi Krishna Bonagiri*, Anish R Joishy*, Manas Gaur, Krishnaprasad Thirunarayan, Ponnurangam Kumaraguru
arXiv, arXiv 2025 (Under Review) -
SaGE: Evaluating Moral Consistency in Large Language Models
Vamshi Krishna Bonagiri, Sreeram Vennam, Priyanshul Govil, Ponnurangam Kumaraguru, Manas Gaur
International Conference on Computational Linguistics, COLING 2024 -
Dark Side of the Tune: Investigating the maladaptive outcomes of excessive music consumption in the age of unlimited music access
Vamshi Krishna Bonagiri, Vinoo Alluri
18th International Conference on Music Perception and Cognition, ICMPC 2025 -
Measuring Moral Inconsistencies in Large Language Models
Vamshi Krishna Bonagiri, Sreeram Vennam, Manas Gaur, Ponnurangam Kumaraguru
The Sixth BlackboxNLP Workshop, EMNLP 2024 -
From Human Judgements to Predictive Models: Unravelling Acceptability in Code-Mixed Sentences
Prashant Kodali, Anmol Goel, Likhith Asapu, Vamshi Krishna Bonagiri, Anirudh Govil, Monojit Choudhury, Manish Shrivastava, Ponnurangam Kumaraguru
ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP 2025) -
Towards Effective Paraphrasing for Information Disguise
Anmol Agarwal, Shrey Gupta, Vamshi Krishna Bonagiri, Manas Gaur, Joseph Reagle, Ponnurangam Kumaraguru
European Conference on Information Retrieval, ECIR 2023 -
Are Deepfakes Concerning? Analyzing Conversations of Deepfakes on Reddit and Exploring Societal Implications
Dilrukshi Gamage, Piyush Ghasiya, Vamshi Krishna Bonagiri, Mark E Whiting, Kazutoshi Sasahara
CHI Conference on Human Factors in Computing Systems, CHI 2022 -
Cobias: Contextual Reliability in Bias Assessment
Priyanshul Govil, Hemang Jain, Vamshi Krishna Bonagiri, Aman Chadha, Sanorita Dey, Ponnurangam Kumaraguru, Manas Gaur
Web Science Conference, WebSci 2025 -
Representation Learning for Identifying Depression Causes in Social Media
Priyanshul Govil, Vamshi Krishna Bonagiri, Mayank Gaur, Ponnurangam Kumaraguru
Third ACM SIGKDD Workshop on Knowledge-infused Learning (KiL 2023)