AI: August, 2024 - Shivam Arora

August 25th

AI triad report

cset.georgetown.edu

https://cset.georgetown.edu/wp-content/uploads/CSET-AI-Triad-Report.pdf

Well explained concepts for AI policymakers.

Algorithm, data, and compute. The triad that goes behind AI and a trichotomy that is helpful in thinking different aspect of policy

August 24th

Long await between updates.

Week 8 of the AI alignment program

Project on understanding LLMs

Transformers learn shortcuts to Automata

They are training the network on inputs and outputs of Automata
Seems like they don’t claim that Automata is what the transformer is learning (need to confirm)
They show that a parallel theoretical solution exists for those Automata that they think transformer might be learning.
The transformer is forced to learn recursive solution which is shown to be more robust
Again, I am not sure if the first version that the transformer learns is actually a solution or simply memorization.

August 23rd

Theoretical linguistics capabilities of LLMs

Recursion in LLMs

August 11th

Technical AI governance

August 5th

A Mathematical Framework for Transformer Circuits

Transformer language models are an emerging technology that is gaining increasingly broad real-world use, for example in systems like GPT-3 , LaMDA , Codex , Meena , Gopher , and similar models. However, as these models scale, their open-endedness and high capacity creates an increasing scope for unexpected and sometimes harmful behaviors. Even years after a large model is trained, both creators and users routinely discover model capabilities – including problematic behaviors – they were previously unaware of.

https://transformer-circuits.pub/2021/framework/index.html

An intuition for the transformer model from the perspective of the residual stream

Eg. : GPT type model, Decoder only transformers

For simplification, they avoid the MLPs, because they want to be able to view the model as a linear sum, by avoiding the activation functions that causes non-linearity.

💡

I am curious how could these toy model would be: one that it is already a small model, and on top of it we are not considering the activation function.

Question: What loss-function is used for the next word predictor?

Answer: Cross-entropy loss for the output vector of size , calculating the classification probability among all possible words in the vocabulary.

0-layer transformer

August 4th

Bigrams. skip-trigrams
Can be accessed directly from weights
Two-layer transformers implement similar algorithms in a very different way in something called induction-head.

A Mathematical Framework for Transformer Circuits

https://transformer-circuits.pub/2021/framework/index.html

August 3rd

Phenomenon of attention head learning patterns after the phase change.
Does not happen for one-layer transformer.

Catherine Olsson - Induction Heads

BlackboxNLP 2022 keynote by Catherine Olsson Blackbox NLP #NLProc #blackbox #nlp #xai #emnlp2022 #emnlp

https://www.youtube.com/watch?v=yMGG2OENyu0

Induction heads

August 2nd

Go through the readings of this week on Robustness for adversial attacks.

First paper on attacks on LLMs to produce harmful content. ‣

Manual Jailbreaking requires human ingenuity and is time intensive.
Previous attempts on generating harmful response using gradient on Loss function depended upon a particular model and a particular response.
Here they use the technique on making the model start the answer with, Yes sure, here is …
Here … is the harmful prompt. In that case the model will generate the harmful content with high likelihood. This is used through prompting jailbreaking as well but chances are lower.
The model is used to find an optimal suffix to any prompt that would start the answer as above.

Cohort call for AI Safety program

Discussion on the AI control, Unlearning etc.

Chris Olah’s talk

August 1st

Interpretable AI: There are various techniques that have been classically studied to understand the internals of the neural networks.

Read the section on

10.2 Pixel Attribution (Saliency Maps) | Interpretable Machine Learning

Machine learning algorithms usually operate as black boxes and it is unclear how they derived a certain decision. This book is a guide for practitioners to make machine learning decisions interpretable.

https://christophm.github.io/interpretable-ml-book/pixel-attribution.html

They identify the pixels in the input space which corresponds to the higher activation mainly via two different kind of techniques:

Occlusion- or perturbation-based: Methods like SHAP and LIME manipulate parts of the image to generate explanations (model-agnostic). Gradient-based: Many methods compute the gradient of the prediction (or classification score) with respect to the input features. The gradient-based methods (of which there are many) mostly differ in how the gradient is computed

Visualization in the CovNets

Lecture 12 | Visualizing and Understanding

In Lecture 12 we discuss methods for visualizing and understanding the internal mechanisms of convolutional networks. We also discuss the use of convolutional networks for generating new images, including DeepDream and artistic style transfer. Keywords: Visualization, t-SNE, saliency maps, class visualizations, fooling images, feature inversion, DeepDream, style transfer Slides: http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture12.pdf -------------------------------------------------------------------------------------- Convolutional Neural Networks for Visual Recognition Instructors: Fei-Fei Li: http://vision.stanford.edu/feifeili/ Justin Johnson: http://cs.stanford.edu/people/jcjohns/ Serena Yeung: http://ai.stanford.edu/~syyeung/ Computer Vision has become ubiquitous in our society, with applications in search, image understanding, apps, mapping, medicine, drones, and self-driving cars. Core to many of these applications are visual recognition tasks such as image classification, localization and detection. Recent developments in neural network (aka “deep learning”) approaches have greatly advanced the performance of these state-of-the-art visual recognition systems. This lecture collection is a deep dive into details of the deep learning architectures with a focus on learning end-to-end models for these tasks, particularly image classification. From this lecture collection, students will learn to implement, train and debug their own neural networks and gain a detailed understanding of cutting-edge research in computer vision. Website: http://cs231n.stanford.edu/ For additional learning opportunities please visit: http://online.stanford.edu/

https://www.youtube.com/watch?v=6wcs6szJWMY

Visualize the filters in the first layer.

Shows the shapes looked into for the template matching.

Nearest neighbour in the last hidden layer.

The semantically similar photos are near to each other even if they might differ to each other in pixel position

Dimensionality reduction for the last layer for clustering and then observing the corresponding inputs.

It clusters them semantically as they are all classified as the same label in the classification.

t-SNE visualization of CNN codes

https://cs.stanford.edu/people/karpathy/cnnembed/

Clustered Image

AI: August 2024