Recurrent Parameterless Attention is a Consensus Algorithm

Ethan Smith
May 24
2 min read

In another post, I wrote about parameterless (boneless) attention as a means of mixing information across datapoints weighted by their correlation.

Doing this recurrently until convergence yields what I'd like to call a consensus algorithm.

A consensus algorithm smells a lot like unsupervised clustering algorithms in that they identify groupings of points, but with the added component that the values of points themselves are modified at each iteration to move closer to their neighbors until all points reach a "consensus" or an agreed-upon value they all take on.

To make this more concrete, we can say more on how it differs from something like K-means clustering.

K-Means

Initialize centroids in the same space as the data
Move these centroids around in order to have minimal distance from the data points in their cluster
Reassign data points to centroids based on how close they are to each centroid.

https://www.youtube.com/watch?v=_aWzGGNrcic

Consensus algorithm

Data points themselves serve as the centroids
- No predefined number of clusters: potential to have as many clusters as there are data points. Hyperparameters used to define the update step may influence how many points we converge upon
Instead of updating the initialized centroids at each step, we update the data points themselves to take a step in a direction that considers all other data points, with the directions corresponding to their closest neighbors having the most importance

Parameterless attention is one possible consensus algorithm. I originally discovered this use case when trying to design a color clustering loss that could be differentiable for backpropagation and benefit from GPU acceleration. To note, despite it being differentiable, it did not work very well as a loss.

Assuming you know how attention works, and specifically the QKNorm variant, it works as follows:

For each data point:
- Calculate vector of cosine similarities with all other points
- Apply temperature scaling, a hyperparameter that decides how strongly to prioritize nearer points over farther away points
- Apply softmax over this vector, giving us the mixing coefficients or weights.
- Take a weighted sum of all the data points scaled by the mixing coefficients
- Perform the update: a linear interpolation between the existing data points and the output of the attention operation we just did.

This should probably ring a bell for attention methods used in graph neural networks.

It looks like this

We can use it to perform segmentation of an image in color space

Or, by initializing random noise centroids instead, and performing something closer to "Perceiver Attention", we can perform something closer to a "soft" K-means algorithm.

We can also perform consensus in spatial domains.

I initially tried using the same hyperparameters as before but found that the points did not move at all. With the scale as high as it was, it's possible that nearly all the weight was put on self and immediate neighbors.

scale=100.0

step_size=0.85

at smaller scales, we just collapse to the mean, which is probably pretty reasonable behavior for single gaussian, which really is just a single cluster of points.

scale=1.0

step_size=0.85