05001000150020002500300035004000450000.10.20.30.40.50.60.70.8
GD-equivalent transformerLinear transformer (1 layer)In-context points: 100261014182226303438424650545862667074788286909498A transformer layer converges with a GD stepTraining stepsIn-context test loss