publications
publications by categories in reversed chronological order. generated by jekyll-scholar.
2024
- Under ReviewSecret Collusion Among Generative AI AgentsSumeet Ramesh Motwani, Mikhail Baranchuk, Martin Strohmeier, Vijay Bolina, Philip H. S. Torr, Lewis Hammond, and Christian Schroeder Witt2024
Recent capability increases in large language models (LLMs) open up applications in which groups of communicating generative AI agents solve joint tasks. This poses privacy and security challenges concerning the unauthorised sharing of information, or other unwanted forms of agent coordination. Modern steganographic techniques could render such dynamics hard to detect. In this paper, we comprehensively formalise the problem of secret collusion in systems of generative AI agents by drawing on relevant concepts from both the AI and security literature. We study incentives for the use of steganography, and propose a variety of mitigation measures. Our investigations result in a model evaluation framework that systematically tests capabilities required for various forms of secret collusion. We provide extensive empirical results across a range of contemporary LLMs. While the steganographic capabilities of current models remain limited, GPT-4 displays a capability jump suggesting the need for continuous monitoring of steganographic frontier model capabilities. We conclude by laying out a comprehensive research program to mitigate future risks of collusion between generative AI models.
2023
- ICLR 2024STARC: A General Framework For Quantifying Differences Between Reward FunctionsJ. Skalse, L. Farnik, S. R. Motwani, E. Jenner, A. Gleave, and A. AbateThe Twelfth International Conference on Learning Representations, Sep 2023
In order to solve a task using reinforcement learning, it is necessary to first formalise the goal of that task as a reward function. However, for many real-world tasks, it is very difficult to manually specify a reward function that never incentivises undesirable behaviour. As a result, it is increasingly popular to use reward learning algorithms, which attempt to learn a reward function from data. However, the theoretical foundations of reward learning are not yet well-developed. In particular, it is typically not known when a given reward learning algorithm with high probability will learn a reward function that is safe to optimise. This means that reward learning algorithms generally must be evaluated empirically, which is expensive, and that their failure modes are difficult to anticipate in advance. One of the roadblocks to deriving better theoretical guarantees is the lack of good methods for quantifying the difference between reward functions. In this paper we provide a solution to this problem, in the form of a class of pseudometrics on the space of all reward functions that we call STARC (STAndardised Reward Comparison) metrics. We show that STARC metrics induce both an upper and a lower bound on worst-case regret, which implies that our metrics are tight, and that any metric with the same properties must be bilipschitz equivalent to ours. Moreover, we also identify a number of issues with reward metrics proposed by earlier works. Finally, we evaluate our metrics empirically, to demonstrate their practical efficacy. STARC metrics can be used to make both theoretical and empirical analysis of reward learning algorithms both easier and more principled.
- NeurIPS MASECA Perfect Collusion Benchmark: How can AI agents be prevented from colluding with information-theoretic undetectability?S. R. Motwani, M. Baranchuk, L. Hammond, and C. S. WittIn Multi-Agent Security Workshop, NeurIPS’23, Oct 2023
Secret collusion among advanced AI agents is widely considered a significant risk to AI safety. In this paper, we investigate whether LLM agents can learn to collude undetectably through hiding secret messages in their overt communications. To this end, we implement a variant of Simmon’s prisoner problem using LLM agents and turn it into a stegosystem by leveraging recent advances in perfectly secure steganography. We suggest that our resulting benchmark environment can be used to investigate how easily LLM agents can learn to use perfectly secure steganography tools, and how secret collusion between agents can be countered pre-emptively through paraphrasing attacks on communication channels. Our work yields unprecedented empirical insight into the question of whether advanced AI agents may be able to collude unnoticed.