I was accepted into the summer cohort of AI safety fundamentals by BlueDot Impact, a UK based organization offering courses related to AI safety. The curriculum is freely available for anyone at anytime but it has an application process for participating in a facilitated cohort, which is also offered without any cost.
The program is well designed collecting a number of resources that have been influential in the field of AI safety.
So far we have completed four sessions.
AI and the years ahead
This had articles for people to understand the AI systems, from the architecture and principles to the general overview of development of AI and computing trends.
What is AI alignment?
The basic idea of AI alignment. I was a bit unsatisfied by some of the resources here. I felt there could have been better resources to introduce this. AI alignment has various definitions with the basic idea being
AI systems being align with human values
This has been introduced in many ways:
AI being safe for humanity
AI doing what its creator intends to do
In 1960, AI pioneer Norbert Wiener described the AI alignment problem as follows:
If we use, to achieve our purposes, a mechanical agency with whose operation we cannot interfere effectively… we had better be quite sure that the purpose put into the machine is the purpose which we really desire
This is a fairly difficult problem to even formulate.
A system... will often set... unconstrained variables to extreme values; if one of those unconstrained variables is actually something we care about, the solution found may be highly undesirable. This is essentially the old story of the genie in the lamp, or the sorcerer's apprentice, or King Midas: you get exactly what you ask for, not what you want."
Specifying what you want is an extremely difficult task and it would most often differ from what you really want.
💡
I believe that we should differentiate different tasks performed by AI and scrutinize what is happening there.
Examples:
Image classification
LLM agent : An agent trained on using a fundamental fine-tuned LLM that is helpful and harmless.
Question: Can an agent using a foundational LLM trained on wide variety of internet data be truly harmless?
Conjecture: Any such Agent can be jailbroken.
Experiment: Using an AI agent to jailbreak another agent.
Question: How can it be made safer?
Experiment: Trained to not reveal information under various persuasive
Question: Can you get hold of sensitive information embedded in the training dataset based.
💡
I believe that LLM can be best used by training agents that make decisions on the internet search, tools use, verifying things on certain principles, relying only on the citable resources.
Question: What kinds of behaviour can be enforced on an AI?
Reinforcement learning from human (or AI) feedback