[[lecture-data]]

2025-02-17

Quick note from last lecture on sparse matrix representations

coordinate representation and compressed sparse row representation - better explanation of CSR:

Compressed Sparse Row (CSR) representation $A \in R^{n \times m}$ can become a set of 3 tensors/"lists". We get this representation by realizing that we can encode the row index in the COO representation a bit more efficiently.

Instead of writing out the row index for each element, we can collect the column indices for each row, put each of these collections together, and then have a pointer tell us where to start reading.

If $A$ has $z$ non-zero entries, then

the column tensor contains $z$ entries. Each entry contains the column index for one of the non-zero elements, sorted by ascending row index.

row or pointer tensor contains $n + 1$ entries

the first $n$ entries contain the starting index for where the row’s elements start in the column tensor

the last entry is $z$

value tensor contains $z$ entries and contains the non-zero elements sorted by row then column index.

If you’ve taken a numerical LA class, then familiar with seeing how the total # operations is

$total operations = \sum_{i = 1}^{n} ∣ N_{i} ∣ = \sum_{i = 1}^{n} d_{i} = 1^{T} A 1 = ∣ E ∣$

Summary

Last Time

information theoretic threshold

community detection Today

See Lecture 8.pdf

Popular GNN architectures that can be expressed in convolutional form

MPNN

GCN

ChebNet

graphSAGE

GATS

1. Graph Signals and Graph Signal Processing

Recall the definition of a graph neural network:

GNN

A “full-fledged” GNN layer is a multi-dimensional generalization of the multi-layer graph perceptron given by $x_{\ell} = \sigma(U_{\ell}) = \sigma\left( \sum_{k=0}^{K-1} S^k x_{\ell-1} H_{\ell, k} \right)$$$ H_{\ell,k} \in \mathbb{R}^{d_{\ell-1} \times d_{\ell}} $f or$ 1 \leq \ell \leq L$, where

$X_{0} = X \in R^{n \times d_{0}}$

$Y = X_{L} \in R^{n \times d_{ℓ}}$

$σ : R \to R$ is a pointwise nonlinearity

and the multiplication is a convolutional filter bank

Here, $X_{ℓ} \in R^{n \times d_{ℓ}}$ is still called an $ℓ$ -layer embedding.

This is a convolutional GNN because it is based on graph convolution.

GNNs have great properties which help explain their practical success This form is general enough to describe a majority of the most popular architectures in use today

Message Passing Neural Network (MPNN)

Message Passing Neural Network (MPNN)

This type of network was initially introduced by Gilmer et al. In a message passing neural network, each layer consists of 2 operations: the message and the update.

The message is a function that processes

the signal at node $i$

the message /signal at each of the neighbors of $i$

the edge weights $m_{ℓ} = M_{ℓ} (x), (m_{ℓ})_{i} = \sum_{j \in N (i)} M_{ℓ} ((x_{ℓ})_{i}, (x_{ℓ})_{j}, A_{ij})$

The update is a function on the signal of the graph from the previous layer to the next, taking into account the message and the current value at each node: $(x_{ℓ})_{i} = U_{ℓ} ((x_{ℓ})_{i}, (m_{ℓ})_{i})$ (see message passing neural network)

Idea

Consider an MPNN. As long as $M_{ℓ}$ is linear, the “message” can be expressed as a graph convolution.

Example

$M_{\ell}[(x_{\ell})_{i}, (x_{\ell})_{j}, A_{ij}] &= \alpha \cdot(x_{\ell})_{i} + \beta \cdot A_{ij}(x_{\ell})_{j}\\ \implies m_{\ell} &= \alpha x_{\ell} + \beta Ax_{\ell} \end{aligned}$$$

which is a graph convolution with $K = 2, S = A, h_{0} = α, h_{1} = β$

As long as $U_{ℓ}$ is the composition of a pointwise nonlinearity $σ$ with a linear function $M_{ℓ}$ , then $x_{ℓ}$ can be expressed as a GNN layer

Example

\implies x_{\ell+1} &= \sigma([\alpha’+\beta’\alpha]x_{\ell} + \beta’\beta A x_{\ell}) \end{aligned}$$

Which is a graph convolution with $K = 2, S = A, h_{0} = α^{'} + β^{'} α, h_{1} = β^{'} β$

(see MPNNs can be expressed as graph convolutions)

Question

From the perspective of learning the weights ( $h_{0}, h_{1}$ vs $α, α^{'}, β, β^{'}$ ), what is different?

Graph Convolutional Networks

Introduced by Kipf and Willing, 2017, a graph convolutional network layer is given by $(x_{ℓ})_{i} = σ [\sum_{j \in N (i)} \frac{( x _{ℓ - 1} ) _{j}}{∣ N ( i )∣} H_{ℓ}]$ Note that $H_{ℓ}$ are row vectors. We can think of each layer as a “degree normalized aggregation”

(see graph convolutional network)

Idea

We can write a graph convolutional network layer as a graph convolution. This is easy to see by simply writing $x_{ℓ} = σ (S x_{ℓ - 1} H_{ℓ})$ Which is exactly the form of a graph convolution with

$K = 2$

$S = D_{binary}^{- 1/2} A_{binary} D_{binary}^{- 1/2}$

$H_{ℓ, 0} = 0, H_{ℓ, 1} = H_{ℓ}$

And $A_{binary}$ is the binary adjacency matrix (given as A[A>0] in Python)

(see GCN layers can be written as graph convolutions)

Note

Advantages of GCNs:

One local diffusion step per layer

Simple and interpretable

Neighborhood size normalization prevents vanishing or exploding gradient problem - only one diffusion step

Disadvantages

no edge weights

only supports the adjacency $S = A$

No self loops unless they are present in $A$ (embedding at layer $ℓ$ not informed by the embedding at layer $ℓ - 1$ ). Even if self-loops are in the graph, there is no ability to weight $(x_{ℓ})_{i}$ and $(x_{ℓ - 1})_{j}, j \in N (i)$ differently.

ChebNet

The idea behind ChebNets, which were introduced by Defferrard et al. in 2017, in to create a “fast” implementation of graph convolution in the spectral domain. Each layer of a ChebNet is given by $x_{ℓ} = σ (\sum_{k = 0}^{K - 1} T_{k} (\overset{ˉ}{L}) x_{ℓ - 1} H_{ℓ, k})$ Where $\overset{ˉ}{L} = \frac{2 L}{λ _{m a x}} - I$ to ensure $λ \in (- 1, 1) \forall λ$ and the $T_{k}$ are the Chebyshev Polynomials.

We use the Chebyshev polynomials instead of Taylor polynomials to approximate an analytic function because

By the chebyshev equioscillation theorem , they provide a better approximation with fewer coefficients

chebyshev polynomials are orthogonal, meaning we can find an orthonormal basis for free. There is no need to implement the filters in the spectral domain.

(see ChebNet)

Learning in the spectral domain provides an “inductive bias” for learning localized spectral filters In practice, we need to use an analytic function surrogate for this type of filter (solid red line).

Problem

Recall that we can represent any analytic function with convolutional graph filters, but translating back to the spectral domain is expensive because we need to compute the graph fourier transform $GFT (x) = \overset{x}{^} = V^{T} x$ and also inverse graph fourier transform $h (λ) \overset{x}{^}, y = Vh (λ) \overset{x}{^}$ .

This costs 2 additional matrix-vector multiplications, plus the diagonalization of $L$

In the typical spectral representation of a convolutional graph filter, we have

$h (λ) = \sum_{k = 0}^{K - 1} h_{k} λ^{k} (L)$ Where $S = L$ is the laplacian. Since the spectral graph filter operates on a signal pointwise, this tells us that each of the weights $h_{k}$ are the same.

Recall that we can represent any analytic function with convolutional graph filters. Thus, the expressivity of filters is quite strong. However, translating back to the spectral domain is expensive because we need to compute the graph fourier transform $GFT (x) = \overset{x}{^} = V^{T} x$ and also inverse graph fourier transform $h (λ) \overset{x}{^}, y = Vh (λ) \overset{x}{^}$ . This costs 2 additional matrix-vector multiplications, plus the diagonalization of $L$ . Is there a way to keep the expressivity of my graph convolution, but without having to compute as many expensive operations?

Chebyshev Polynomial

The Chebyshev Polynomials can be calculated using the recurrence relation: $\begin{aligned} T_{0}(\lambda) &=1 \\ T_{1}(\lambda) &= \lambda\\ &\vdots \\ T_{k+1}(\lambda) &= 2\lambda T_{k}(\lambda) - T_{k-1}(\lambda) \end{aligned}$$ And each$ T_{n} $ha s t h e p ro p er t y t ha t$ T_{n} = \cos(n\theta)$

In conventional signal processing, filters based on these polynomials have better cutoff behavior in the spectral domain (compared to general polynomial filters)

ie, they need fewer coefficients than the Taylor approximation for the same quality

For the formal theorem statement/proof: see chebyshev equioscillation theorem

Using $h (λ) = \sum_{k = 0}^{K - 1} h_{k} λ^{k}$ requires more coefficients than $h (λ) = \sum_{k = 0}^{K^{'} - 1} h_{k} T_{k} (λ)$ ie $K^{'} < K$ for the same approximation quality.

Intuition why this is better than the Taylor polynomials:

Chebyshev Polynomial expansion minimizes $L_{\infty}$ norm over the approximation interval ( $∣∣ x ∣ ∣_{\infty} = max_{i} (x_{i})$ ), while Taylor series is a local approximation around some $\tilde{λ}$

Additionally,

Theorem

Let $⟨ f, g ⟩ = \int_{- 1}^{1} f (x) g (x) \frac{1}{1 - x ^{2}} d x$

The Chebyshev Polynomials $T_{0}, T_{1}, \dots$ are orthogonal with respect to the inner product $⟨ \cdot, \cdot ⟩$ .

(see chebyshev polynomials are orthogonal)

Proof

$\int_{-1}^1 T_{n}(x)T_{m}(x) \frac{1}{\sqrt{ 1-x^2 }} \, dx &\implies x=\cos \theta,\;\;\; \frac{dx}{d\theta}=\sin \theta\\ &= \int_{\pi}^2\pi T_{n}(\cos\theta) T_{m}(\cos\theta) \frac{d\theta \sin\theta}{\sqrt{ 1-\cos^2\theta }} \\ &= \int_{\pi}^2\pi \cos(n\theta) \cos(m\theta) d\theta \;\; \text{ since } T_{n} = \cos(n\theta) \\ &= \int_{\pi}^2\pi \frac{e^{i n\theta} + e^{- i n \theta}}{2} + \frac{e^{i m\theta} + e^{- i m \theta}}{2} \, d\theta \\ &= \frac{1}{4} \int_{\pi}^{2\pi} e^{i n\theta} + e^{- i n \theta} + e^{i m\theta} + e^{- i m \theta}\, d\theta \\ &= \begin{cases} \frac{1}{4} \int_{\pi}^{2\pi} 4 d\theta = \theta |_{\pi}^{2\pi} = \pi \;\;\;\text{ if } n=m=0\\ \frac{1}{4} \int_{\pi}^{2\pi} 2 d\theta + \frac{1}{4}\int_{\pi}^{2\pi}2 \cos((n+m)\theta) d\theta = \frac{\pi}{2} + 0 = \frac{\pi}{2}\;\;\; \text{ if } n=m\neq 0 \\ \frac{1}{4} \int_{\pi}^{2\pi}2\cos((n+m)\theta) + 2\cos((n-m)\theta)d\theta = 0\;\; \; \text{ if } n \neq m \end{cases} \end{aligned}$$$

Note

Advantages

Fast and cheap - easy to calculate polynomials and they are orthogonal for free

localized spectral filters

Disadvantages

less stable than Taylor approximations

perturbing the adjacency matrix can cause large fluctuations in the chebyshev polynomial approximation, whereas the taylor approximations will remain basically the same

$S$ is restricted to $L$

difficult to interpret in the node domain

Graph SAGE

Introduced by Hamilton-Ying-Leskovee in 2017, each Graph SAGE layer implements the following 2 operations:

Aggregate $(U_{ℓ})_{i} = AGGREGATE_{ℓ} ({(x_{ℓ})_{j}, j \in N (i)})$

Concatenate $(x_{ℓ})_{i} = σ (CONCAT ((x_{ℓ - 1})_{i}, (U_{ℓ})_{i}))$

The standard $AGGREGATE$ operation is an average over $N (i)$ . Letting $H = [H_{0} H_{1}]^{T}$ , we get
$(U_{\ell})_{i} &= \frac{1}{\lvert N(i) \rvert} \sum_{j \in N(i)} (x_{\ell-1})_{j} \\ \implies U_{\ell} &= SX_{\ell-1},\,\,\,\,\,\,\,\, S=D^{-1/2}AD^{-1/2}\\ x_{\ell} = \sigma([x_{\ell-1} \;\; U_{\ell}] H) &= \sigma(x_{\ell-1}H_{0} + SX_{\ell-1}H_{1}) \end{aligned}$$ where $A$ is the binary [[Concept Wiki/adjacency matrix]], which is equivalent to a [[Concept Wiki/graph convolutional network\|GCN]] with nodewise normalization $\frac{(x_{\ell})_{i}}{\lvert \lvert (x_{\ell})_{i} \rvert \rvert_{2}}$. This normalization helps in some cases (empirically). (see [[Concept Wiki/graph SAGE]])$

notes

SAGE implementations allow for a variety of $AGGREGATE$ functions, including $max$ and LSTMs (which look at $N (i)$ as a sequence, losing permutation equivariance). In these cases, SAGE is no longer a graph convolution.

There are both advantages and disadvantages of staying a GCN and using other functions.

the authors of SAGE popularized the $AGGREGATE = UPDATE$ representation of GNNs (with $K = 2$ ) meaning

$x_{ℓ} = σ (S X_{ℓ - 1} H_{ℓ})$ $S x_{ℓ}$ is the “aggregate” and $σ (\dots H_{ℓ})$ is the “update”

mnzk digital garden

Explorer

2025-02-17 graphs lecture 8

1. Graph Signals and Graph Signal Processing

Message Passing Neural Network (MPNN)

Graph View

Table of Contents

Backlinks