Path: blob/master/notebooks/book1/15/kernel_regression_attention.ipynb
1192 views
Nadaraya-Watson kernel regression in 1d using attention
We show how to interpret kernel regression as an attention mechanism. Based on sec 10.2 of http://d2l.ai/chapter_attention-mechanisms/nadaraya-waston.html
Data
Constant baseline
As a baseline, we use the empirical mean of y.
Kernel regression
We can visualize the kernel matrix to see which inputs are used to predict each output.
Implementation using learned attention
As an illustration of how to learn attention kernels, we make the bandwidth parameter adjustable, so we can optimize it by backprop.
The implementation uses batch matrix multiplication (torch.bmm). This is defined as follows. Suppose the first batch contains n matrix Xi of size a x b, and the second batch contains n matrix Yi of size b x c. Then the output will have size (n, a, c).
To apply attention to kernel regression, we make a batch of size , where is the number of training points. In batch , the query is the 'th training point we are truying to predict, the keys are all the other inputs and the values are all the other outpouts .
Train using SGD.
Results of training
Not suprisignly, fitting the hyper-parameter 'w' (the bandwidth of the kernel) results in overfitting, as we show below. However, for parametric attention, this is less likely to occur.