Things are uncertain: we’re on a tightrope, and the actions we take determine the future
Most work should focus on this, because in the other scenarios we can’t do anything
Key risks
Dangerous capabilities, e.g. manipulation, cyberoffensive skills, weapon development
Currently limited by (but likely to change in future):
models mostly just in memoryless chat form, rather than taking actions
models trained on human-level data, so mainly capping out at human level
Misalignment: a tendency to use capabilities in the way users don’t want
Raises a concern that AI will misuse its dangerous capabilities
Deceptive alignment: hard to detect and avoid
How make things go well
Prevent
Evaluations
To discourage people deploying things with dangerous capabilities or that are misaligned
Improve understanding of AI risk, so people understand risks of certain development approaches
E.g. so people at top AI companies know ‘if we train our model like this, it’s very likely to be unsafe’
Identify approaches to making safe AI systems
Mitigate
Defenses against dangerous capabilities
Monitoring, to shut down AI systems at early signs of misbehaviour
Shutdown methods, to reliably halt AI systems
[added by me] Preventing misuse, probably down to access controls and cybersecurity
Coordinating the above
Mandated best practices / standards / regulations
Emergency orders
Non-proliferation agreements. Maybe with compute governance
International pressure / agreements, e.g. encouraging other states to enforce best practices
Thinking about risk reduction
Risk curve
Want to reduce area under the curve
Scale down vertically: mitigate risks
Scale down horizontally: get to ‘safe state’ sooner
Things don’t need to be perfect to make things better, e.g. voluntary commitments, medium-reliable evals
Broadly useful things
Shortening the time to implementing protections
Buying time before risk period
Lowering competitive pressure during risk period
Institutions being able to make wise decisions
[notes]
When I read this I did want to set up a fake misalignment scenario, where an LLM plays in an RL environment running a company, and it can get reward for being deceptive. I would be interested in measuring how much it goes to deception, and what empirical interventions we can apply to make it avoid this (e.g. is good enough prompting before sending it into the environment sufficient?).
Also this + online learning in a way that doesn’t reinforce deception.
This talk also makes me buy shard theory a bit more