[[lecture-data]]2025-02-17
Quick note from last lecture on sparse matrix representations
coordinate representation and compressed sparse row representation - better explanation of CSR:
Compressed Sparse Row (CSR) representation can become a set of 3 tensors/"lists". We get this representation by realizing that we can encode the row index in the COO representation a bit more efficiently.
Instead of writing out the row index for each element, we can collect the column indices for each row, put each of these collections together, and then have a
pointertell us where to start reading.If has non-zero entries, then
- the
columntensor contains entries. Each entry contains the column index for one of the non-zero elements, sorted by ascending row index.roworpointertensor contains entries
- the first entries contain the starting index for where the row’s elements start in the
columntensor- the last entry is
valuetensor contains entries and contains the non-zero elements sorted by row then column index.If you’ve taken a numerical LA class, then familiar with seeing how the total # operations is
Summary
- information theoretic threshold
- community detection Today
- See Lecture 8.pdf
- Popular GNN architectures that can be expressed in convolutional form
- MPNN
- GCN
- ChebNet
- graphSAGE
- GATS
1. Graph Signals and Graph Signal Processing
Recall the definition of a graph neural network:
GNN
A “full-fledged” GNN layer is a multi-dimensional generalization of the multi-layer graph perceptron given by x_{\ell} = \sigma(U_{\ell}) = \sigma\left( \sum_{k=0}^{K-1} S^k x_{\ell-1} H_{\ell, k} \right)$$ H_{\ell,k} \in \mathbb{R}^{d_{\ell-1} \times d_{\ell}}1 \leq \ell \leq L$, where
- is a pointwise nonlinearity
- and the multiplication is a convolutional filter bank
Here, is still called an -layer embedding.
This is a convolutional GNN because it is based on graph convolution.
- GNNs have great properties which help explain their practical success This form is general enough to describe a majority of the most popular architectures in use today
Message Passing Neural Network (MPNN)
Message Passing Neural Network (MPNN)
This type of network was initially introduced by Gilmer et al. In a message passing neural network, each layer consists of 2 operations: the message and the update.
The message is a function that processes
- the signal at node
- the message /signal at each of the neighbors of
- the edge weights
The update is a function on the signal of the graph from the previous layer to the next, taking into account the message and the current value at each node: (see message passing neural network)
Idea
Consider an MPNN. As long as is linear, the “message” can be expressed as a graph convolution.
Example
M_{\ell}[(x_{\ell})_{i}, (x_{\ell})_{j}, A_{ij}] &= \alpha \cdot(x_{\ell})_{i} + \beta \cdot A_{ij}(x_{\ell})_{j}\\ \implies m_{\ell} &= \alpha x_{\ell} + \beta Ax_{\ell} \end{aligned}$$which is a graph convolution with
As long as is the composition of a pointwise nonlinearity with a linear function , then can be expressed as a GNN layer
Example
\implies x_{\ell+1} &= \sigma([\alpha’+\beta’\alpha]x_{\ell} + \beta’\beta A x_{\ell}) \end{aligned}$$
Which is a graph convolution with
Question
From the perspective of learning the weights ( vs ), what is different?
Graph Convolutional Networks
Introduced by Kipf and Willing, 2017, a graph convolutional network layer is given by Note that are row vectors. We can think of each layer as a “degree normalized aggregation”
Idea
We can write a graph convolutional network layer as a graph convolution. This is easy to see by simply writing Which is exactly the form of a graph convolution with
And is the binary adjacency matrix (given as
A[A>0]in Python)
Note
Advantages of GCNs:
- One local diffusion step per layer
- Simple and interpretable
- Neighborhood size normalization prevents vanishing or exploding gradient problem - only one diffusion step
Disadvantages
- no edge weights
- only supports the adjacency
- No self loops unless they are present in (embedding at layer not informed by the embedding at layer ). Even if self-loops are in the graph, there is no ability to weight and differently.
ChebNet
The idea behind ChebNets, which were introduced by Defferrard et al. in 2017, in to create a “fast” implementation of graph convolution in the spectral domain. Each layer of a ChebNet is given by Where to ensure and the are the Chebyshev Polynomials.
We use the Chebyshev polynomials instead of Taylor polynomials to approximate an analytic function because
- By the chebyshev equioscillation theorem , they provide a better approximation with fewer coefficients
- chebyshev polynomials are orthogonal, meaning we can find an orthonormal basis for free. There is no need to implement the filters in the spectral domain.
(see ChebNet)
Learning in the spectral domain provides an “inductive bias” for learning localized spectral filters
In practice, we need to use an analytic function surrogate for this type of filter (solid red line).
Problem
Recall that we can represent any analytic function with convolutional graph filters, but translating back to the spectral domain is expensive because we need to compute the graph fourier transform and also inverse graph fourier transform .
This costs 2 additional matrix-vector multiplications, plus the diagonalization of
In the typical spectral representation of a convolutional graph filter, we have
Where is the laplacian. Since the spectral graph filter operates on a signal pointwise, this tells us that each of the weights are the same.
Recall that we can represent any analytic function with convolutional graph filters. Thus, the expressivity of filters is quite strong. However, translating back to the spectral domain is expensive because we need to compute the graph fourier transform and also inverse graph fourier transform . This costs 2 additional matrix-vector multiplications, plus the diagonalization of . Is there a way to keep the expressivity of my graph convolution, but without having to compute as many expensive operations?
Chebyshev Polynomial
The Chebyshev Polynomials can be calculated using the recurrence relation: \begin{aligned} T_{0}(\lambda) &=1 \\ T_{1}(\lambda) &= \lambda\\ &\vdots \\ T_{k+1}(\lambda) &= 2\lambda T_{k}(\lambda) - T_{k-1}(\lambda) \end{aligned}$$ And each T_{n}T_{n} = \cos(n\theta)$
In conventional signal processing, filters based on these polynomials have better cutoff behavior in the spectral domain (compared to general polynomial filters)
- ie, they need fewer coefficients than the Taylor approximation for the same quality
For the formal theorem statement/proof: see chebyshev equioscillation theorem
Using requires more coefficients than ie for the same approximation quality.
Intuition why this is better than the Taylor polynomials:
- Chebyshev Polynomial expansion minimizes norm over the approximation interval (), while Taylor series is a local approximation around some
Additionally,
Theorem
Let
The Chebyshev Polynomials are orthogonal with respect to the inner product .
Proof
\int_{-1}^1 T_{n}(x)T_{m}(x) \frac{1}{\sqrt{ 1-x^2 }} \, dx &\implies x=\cos \theta,\;\;\; \frac{dx}{d\theta}=\sin \theta\\ &= \int_{\pi}^2\pi T_{n}(\cos\theta) T_{m}(\cos\theta) \frac{d\theta \sin\theta}{\sqrt{ 1-\cos^2\theta }} \\ &= \int_{\pi}^2\pi \cos(n\theta) \cos(m\theta) d\theta \;\; \text{ since } T_{n} = \cos(n\theta) \\ &= \int_{\pi}^2\pi \frac{e^{i n\theta} + e^{- i n \theta}}{2} + \frac{e^{i m\theta} + e^{- i m \theta}}{2} \, d\theta \\ &= \frac{1}{4} \int_{\pi}^{2\pi} e^{i n\theta} + e^{- i n \theta} + e^{i m\theta} + e^{- i m \theta}\, d\theta \\ &= \begin{cases} \frac{1}{4} \int_{\pi}^{2\pi} 4 d\theta = \theta |_{\pi}^{2\pi} = \pi \;\;\;\text{ if } n=m=0\\ \frac{1}{4} \int_{\pi}^{2\pi} 2 d\theta + \frac{1}{4}\int_{\pi}^{2\pi}2 \cos((n+m)\theta) d\theta = \frac{\pi}{2} + 0 = \frac{\pi}{2}\;\;\; \text{ if } n=m\neq 0 \\ \frac{1}{4} \int_{\pi}^{2\pi}2\cos((n+m)\theta) + 2\cos((n-m)\theta)d\theta = 0\;\; \; \text{ if } n \neq m \end{cases} \end{aligned}$$
Note
Advantages
- Fast and cheap - easy to calculate polynomials and they are orthogonal for free
- localized spectral filters
Disadvantages
- less stable than Taylor approximations
- perturbing the adjacency matrix can cause large fluctuations in the chebyshev polynomial approximation, whereas the taylor approximations will remain basically the same
- is restricted to
- difficult to interpret in the node domain
Graph SAGE
Introduced by Hamilton-Ying-Leskovee in 2017, each Graph SAGE layer implements the following 2 operations:
Aggregate
Concatenate
The standard operation is an average over . Letting , we get
(U_{\ell})_{i} &= \frac{1}{\lvert N(i) \rvert} \sum_{j \in N(i)} (x_{\ell-1})_{j} \\ \implies U_{\ell} &= SX_{\ell-1},\,\,\,\,\,\,\,\, S=D^{-1/2}AD^{-1/2}\\ x_{\ell} = \sigma([x_{\ell-1} \;\; U_{\ell}] H) &= \sigma(x_{\ell-1}H_{0} + SX_{\ell-1}H_{1}) \end{aligned}$$ where $A$ is the binary [[Concept Wiki/adjacency matrix]], which is equivalent to a [[Concept Wiki/graph convolutional network\|GCN]] with nodewise normalization $\frac{(x_{\ell})_{i}}{\lvert \lvert (x_{\ell})_{i} \rvert \rvert_{2}}$. This normalization helps in some cases (empirically). (see [[Concept Wiki/graph SAGE]])
notes
- SAGE implementations allow for a variety of functions, including and LSTMs (which look at as a sequence, losing permutation equivariance). In these cases, SAGE is no longer a graph convolution.
- There are both advantages and disadvantages of staying a GCN and using other functions.
- the authors of SAGE popularized the representation of GNNs (with ) meaning
is the “aggregate” and is the “update”