0
500
1000
1500
2000
2500
3000
3500
4000
4500
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
GD-equivalent transformer
Linear transformer (1 layer)
In-context points: 100
2
6
10
14
18
22
26
30
34
38
42
46
50
54
58
62
66
70
74
78
82
86
90
94
98
A transformer layer converges with a GD step
Training steps
In-context test loss
plotly-logomark