Criteria for an AI safety plan

An AI safety plan is a description of proposed actions people could take in the world to make AI go well. This is NOT supposed to be a political agenda, and we hope it to avoid turning into one. Although it may be inevitable we have to take some political actions. This is kind of described in section 1 of https://www.bridgespan.org/getmedia/16a72306-0675-4abd-9439-fbf6c0373e9b/strong-field-framework.pdf [TODO: we haven’t described this super clearly - need to expand on this for clarity].

Some SAFE properties for this plan (the mnemonic is not too forced, I promise):

Sufficient: if all the actions are carried out, we would consider the world in a good state.
- NB: just solving the alignment problem, or tackling near-term risks, is not sufficient.
Action-orientated: the plan is a set of actions that explain how people can contribute (and thus how our course could prepare people to contribute), and does not just describe a list of events that happen.
Feasible: we think it’s reasonably plausible the plan can be executed. This generally excludes plans that require actors to take significant actions against their own interests.
Encompassing: the plan covers all actions needed, not just some for a short time frame or one jurisdiction.

Why do we need criteria?

To help clarify what it is we’re looking for
To make sure what we produce will be sufficient to guide the field

Earlier notes

Criteria

Must have
- Gives us confidence that if all the actions are carried out, things would be good
- Seems feasible. Generally this excludes plans that require:
  - actor(s) to take significant actions against their own interests
  - all governments to magically agree on something very complex, particularly where there are benefits for defectors and enforcement is hard
- Explains what actions people can take to contribute to the plan (e.g. is a set of assignable/assigned actions, not just list of events to happen)
- Are comprehensive, e.g. cover all the actions we need to be taken, not just those in one jurisdiction
Should have
- Detailed enough for people to evaluate whether they are contributing to the plan
- Well-communicated / clear
- We can track progress on the plan
  - Ideally from early on, so we can adjust it if we realise the plan won’t work. E.g. having falsifiable hypotheses
- Robustness to some failure
Could have
- Beyond just making sure we don’t die, the plan explains how we can maximize the benefits of AI
- Multiple viable paths to success

Notes on Good Strategy Bad Strategy:

Also see Notes on Good Strategy/Bad Strategy
Problem diagnosis, guiding policy, coherent actions
Leverage, proximate objectives, chain-link systems, comparative advantage