August 24th

  • Long await between updates.
  • Week 8 of the AI alignment program
  • Project on understanding LLMs
  • Transformers learn shortcuts to Automata
    • They are training the network on inputs and outputs of Automata
    • Seems like they don’t claim that Automata is what the transformer is learning (need to confirm)
    • They show that a parallel theoretical solution exists for those Automata that they think transformer might be learning.
    • The transformer is forced to learn recursive solution which is shown to be more robust
    • Again, I am not sure if the first version that the transformer learns is actually a solution or simply memorization.

    August 23rd

    • Theoretical linguistics capabilities of LLMs
    • Recursion in LLMs

    August 11th

    • Technical AI governance

    August 5th

      • An intuition for the transformer model from the perspective of the residual stream
      • Eg. : GPT type model, Decoder only transformers
      • For simplification, they avoid the MLPs, because they want to be able to view the model as a linear sum, by avoiding the activation functions that causes non-linearity.
        I am curious how could these toy model would be: one that it is already a small model, and on top of it we are not considering the activation function.
        Question: What loss-function is used for the next word predictor?
        Answer: Cross-entropy loss for the output vector of size , calculating the classification probability among all possible words in the vocabulary.
        • 0-layer transformer

        August 2nd

        • Go through the readings of this week on Robustness for adversial attacks.
        • First paper on attacks on LLMs to produce harmful content.
          • Manual Jailbreaking requires human ingenuity and is time intensive.
          • Previous attempts on generating harmful response using gradient on Loss function depended upon a particular model and a particular response.
          • Here they use the technique on making the model start the answer with, Yes sure, here is …
          • Here … is the harmful prompt. In that case the model will generate the harmful content with high likelihood. This is used through prompting jailbreaking as well but chances are lower.
          • The model is used to find an optimal suffix to any prompt that would start the answer as above.
        • Cohort call for AI Safety program
          • Discussion on the AI control, Unlearning etc.
        • Chris Olah’s talk

        August 1st

        • Interpretable AI: There are various techniques that have been classically studied to understand the internals of the neural networks.
        • Read the section on
        • They identify the pixels in the input space which corresponds to the higher activation mainly via two different kind of techniques:
          • Occlusion- or perturbation-based: Methods like SHAP and LIME manipulate parts of the image to generate explanations (model-agnostic). Gradient-based: Many methods compute the gradient of the prediction (or classification score) with respect to the input features. The gradient-based methods (of which there are many) mostly differ in how the gradient is computed

        Visualization in the CovNets

        Lecture 12 | Visualizing and Understanding
        Lecture 12 | Visualizing and Understanding
        • Visualize the filters in the first layer.
          • Shows the shapes looked into for the template matching.
        • Nearest neighbour in the last hidden layer.
          • The semantically similar photos are near to each other even if they might differ to each other in pixel position
        notion image