Catastrophic risks from unsafe AI (particularly from 23:35 onwards)

Scenarios we might be in
- Things are fine: we’ll be fine whatever happens
- Things are doomed: nothing we can do
- Things are uncertain: we’re on a tightrope, and the actions we take determine the future
  - Most work should focus on this, because in the other scenarios we can’t do anything
Key risks
- Dangerous capabilities, e.g. manipulation, cyberoffensive skills, weapon development
  - Currently limited by (but likely to change in future):
    - models mostly just in memoryless chat form, rather than taking actions
    - models trained on human-level data, so mainly capping out at human level
- Misalignment: a tendency to use capabilities in the way users don’t want
  - Raises a concern that AI will misuse its dangerous capabilities
  - Deceptive alignment: hard to detect and avoid
How make things go well
- Prevent
  - Evaluations
    - To discourage people deploying things with dangerous capabilities or that are misaligned
  - Improve understanding of AI risk, so people understand risks of certain development approaches
    - E.g. so people at top AI companies know ‘if we train our model like this, it’s very likely to be unsafe’
  - Identify approaches to making safe AI systems
- Mitigate
  - Defenses against dangerous capabilities
  - Monitoring, to shut down AI systems at early signs of misbehaviour
  - Shutdown methods, to reliably halt AI systems
  - [added by me] Preventing misuse, probably down to access controls and cybersecurity
- Coordinating the above
  - Mandated best practices / standards / regulations
  - Emergency orders
  - Non-proliferation agreements. Maybe with compute governance
  - International pressure / agreements, e.g. encouraging other states to enforce best practices
Thinking about risk reduction
- Risk curve
- Want to reduce area under the curve
  - Scale down vertically: mitigate risks
  - Scale down horizontally: get to ‘safe state’ sooner
- Things don’t need to be perfect to make things better, e.g. voluntary commitments, medium-reliable evals
- Broadly useful things
  - Shortening the time to implementing protections
  - Buying time before risk period
  - Lowering competitive pressure during risk period
  - Institutions being able to make wise decisions
[notes]
- When I read this I did want to set up a fake misalignment scenario, where an LLM plays in an RL environment running a company, and it can get reward for being deceptive. I would be interested in measuring how much it goes to deception, and what empirical interventions we can apply to make it avoid this (e.g. is good enough prompting before sending it into the environment sufficient?).
  - Also this + online learning in a way that doesn’t reinforce deception.
- This talk also makes me buy shard theory a bit more