graph attention model

[[concept]]
Graph Attention (GAT)

(x)i=σ(jN(i)αij(x1)jH)

Where the αij are the learned graph attention coefficients, computed as

αij=softmaxN(i)(eij),eij=ρ(a(xiH||xjH))jN(i)ρ(a(xiH||xjH))

Here, we are learning the graph weight matrix E via learning H. We can think of this as calculating the "relative similarity" between transformed features of i and j vs the similarity of all neighbors of i.

  • (||) is the row-wise concatenation operation.
    • so aR1×2d
    • and each xiHRd
  • ρ is a pointwise nonlinearity, typically leaky ReLU

The learnable parameters for this model are a,H,H for 1L.

Notes

  • This architecture is local because it is still respecting the graph sparsity pattern, and learning the "similarities" of the transformed features at each layer of the network.
  • This architecture does not depend on the size of the graph, only the dimension of the features
  • It can still be expensive to compute however, if the graph is dense. We need to compute |E| coefficients αij, which can be up to n2 in complete graphs.

This additional flexibility increases the capacity of the architecture (less likely to underfit to training data), and this has ben observed empirically. But this comes at the cost of a more expensive forward pass.

This architecture is available and implemented in PyTorch Geometric

Note

We can think of the GAT layer (x)i=σ(jN(i)αij(x1)jH) as a convolutional layer where the GSO S, is learned

  • however, this architecture is not convolutional since S is changing at each step.
  • But it can be useful to see how we can write/think about it as something that looks convolutional (if you squint)

Mentions

File
2025-02-19 graphs lecture 9
2025-04-16 lecture 21
2025-02-25 equivariant lecture 4