[[lecture-data]]2025-02-12
Summary
1. Graph Signals and Graph Signal Processing
information theoretic threshold
Recall from last time that in the balanced sbm with , if , there is a region around where the detection of communities is impossible from an information theory perspective using the signal to noise ratio.
Example
In our example with , with , then $SNR = \frac{(p-q)^2}{2(p+q)}$$
- numerator is the signal: degree discrepancy in nonequal subgraphs
- denominator is noise: the average degree for all nodes in the graph
Theorem
Suppose we have a stochastic block model problem where the adjacency matrix , and .
If the signal to noise ratio , then almost exact recovery is impossible.
Even with infinite time and resources, there is no algorithm that can recover the true communities with just the probabilities of adjacency. This gives us the information theoretic threshold
Proof
see proofs in Massoulié (2014) and Mossel (2014). Aside: interesting to see since there are different proof methods from different domains. See also Abbe (survey papers).
Example: Sparse Graphs
and as (the average degree vanishes).
Then $\begin{aligned}SNR &= \frac{\left( \frac{a}{n} - \frac{b}{n} \right)^2}{2\frac{(a+b)}{n}} \ &= \frac{1}{2n} \frac{(a-b)^2}{(a+b)} \ SNR_{p,q}< \frac{1}{n} &\iff \frac{1}{2n} \frac{(a-b)^2}{a+b} < \frac{1}{n} \ \implies \frac{(a-b)^2}{2(a+b)} &< 1 \end{aligned}$$ ie, in sparse graphs, it is not difficult to identify the information theoretic threshhold (we can easily calculate the signal to noise ratio)
Takeaway
Community detection is difficult when the graph is sparse. Traditional methods may fail even before we hit this threshold!
contextual stochastic block model (C-SBM)
contextual stochastic block model (C-SBM)
The (binary) contextual stochastic block model (C-SBM) is like an SBM, but includes node features that are drawn from a Gaussian distribution. Here,
p & \dots & p & q & \dots & q \\ & & \vdots & \vdots & & \vdots \\ p & \dots & p & q & \dots & q \\ \hline q & \dots & q & p & \dots & p \\ \vdots & & \vdots & \vdots & & \vdots \\ q & \dots & q & p & \dots & p\end{array}\right]$$ - $y \in \{ -1,1 \}^n$ (or $\{ 0,1 \}$) and $B \in \mathbb{R}^{2 \times 2}$ And [[Concept Wiki/graph signals\|node features]] $X$ are drawn - $x_{i} \sim \sqrt{ \frac{\mu}{n} } y_{i} u + \frac{z_{i}}{\sqrt{ d }}$, $u \sim \text{Normal}\left( 0, \frac{I_{d}}{d} \right)$, $z_{i} \sim \text{Normal}(0, I_{d})$ So $$X_{i}|Y_{i}, u \sim \text{Normal}\left( \pm \sqrt{ \frac{\mu}{n} }u, \frac{I_{d}}{d} \right)$$ (see [[Concept Wiki/contextual stochastic block model]])
- The adjacency matrix is given as where
Feature-aware spectral embeddings
Recall that is spectral clustering, which is fully unsupervised. For spectral embedding, we make use of some of the known community assignments.
feature-aware spectral embeddings
Feature-aware spectral embeddings incorporate the availability of node features into our predictions (for example, in a C-SBM) when using spectral embedding.
Let be a graph with diagonalizable adjacency matrix and node features . Suppose we have communities that we want to assign to the nodes.
- Diagonalize as
- Pick the top eigenvectors to create
- Define where are the top eigenvectors of .
Before, we had >[!equation] Spectral Embedding Problem
$\min_{f} \sum_{i \in T} \mathbb{1}(f(A){i} = y{i}), ;;; f \in{ f(A) = \sigma(V_{c}W), W \in \mathbb{R}^{C\times C} }$$
Now, our hypothesis class is instead:
Feature-Aware Spectral Embedding Hypothesis Class
Note
In the presence of node features, the information theoretic threshold for community detection becomes (with )
Takeaway as long as the means of the communities are sufficiently separated (high and/or high ).
the community detection is possible in an information theoretic sense when
Computational thresholds for spectral clustering (non-rigorous)
maybe there exists an algorithm that can find these communities, but can we do so in polynomial time? Can we do even better?
Example
spectral redemption - Krzkala et al 2013. Computational threshold using adjacency matrix
Takeaway
Even by doing spectral embedding or feature-aware spectral embeddings, since we have to fix (which is often unknown) and (the number of eigenvectors we use) our spectral algorithms might fail.
In practice, spectral algorithms fail above the information theoretic threshold (sometimes, significantly above that threshold!)
Question
Can GNNs help?
Lecture 3. We saw that the expressivity of graph convolutions can find regression solutions in many cases
Let’s go back to
Recall the conditions for finding a convolutional graph filter:
Theorem
Let be a shift operator with eigenvalues , and a graph signal. Let be the GFT.
Suppose
Then there exist coefficients such that , and is a graph convolution
In the example above, having access to eigenvectors might have helped. But not only do we need to fix when unknown, we also have to pick a large enough so that we don’t miss informative eigenvalues “lost in the bulk” like in the paper.
This motivates using GNNs for semi-supervised community detection on graphs
Takeaway
We can use GNNs to solve the semi-supervised community detection node-level task on graphs.
Let be a graph with diagonalizable adjacency matrix and node features . Suppose we have communities that we want to assign to the nodes.
As long as satisfies the conditions for finding a convolutional graph filter, namely that
- and
- for all
then there exists a graph convolution (or a 1-layer linear GNN) that approximates . The optimization problem then is the same as before:
Community Detection Optimization Problem where is a surrogate of the 0-1 loss, each is the one-hot community vector of node , and is some parametric function of and
And we can choose our hypothesis class so that , a GNN with learnable parameters (this is the form of the function we defined in lecture 5)
(see we can use GNNs to solve feature-aware semi-supervised learning problems)
Housekeeping
- Homework due 2 weeks from today (first week of)
A note on sparse matrix-vector multiplications
,
- the usual
matmuloperation in PyTorch requires to compute for
Example
Illustration to see the operations required
Typically the graph shift operator (ie, adjacency matrix, graph laplacian , etc) are sparse matrices, ie
- in this case, nonzero entries can achieve complexity. But, to achieve it you need to use the sparse tensor representation of
How sparse matmuls work
Suppose we have a sparse adjacency matrix
12 & 0 & 26 & 0 \\ 0 & 0 & 19 & 14 \\ 26 & 19 & 0 & 0 \\ 0 & 14 & 0 & 7 \end{bmatrix}$$ There are two main options to store this more efficiently. > [!def] Coordinate (COO)rdinate Representation > > $A$ can become a set of $\lvert E \rvert$ tuples of the form $(\text{row index}, \text{column index}, \text{value})$. So this matrix becomes the ordered set of these tuples, sorted by row and then column index. > > This can be better from a memory perspective, but can still be quite large if we have a few nodes with high degree. A nice solution to this is to use [[Concept Wiki/compressed sparse row representation#^definition\|compressed sparse row representation]] instead. > (see [[Concept Wiki/coordinate representation]]) > [!example] > > $$A = \begin{bmatrix} 12 & 0 & 26 & 0 \\ 0 & 0 & 19 & 14 \\ 26 & 19 & 0 & 0 \\ 0 & 14 & 0 & 7 \end{bmatrix}$$ > can become > $(0,0,12), (0, 2, 26), (1,2,19), (1,3,14), (2,0, 26), (2,1,19), (3,1,14), (3,3,7)$ > Another method to compress a sparse matrix's information is through compressed sparse row (CSR) representation - this is most popular. > [!def] Compressed Sparse Row (CSR) representation > > Suppose $A \in \mathbb{R}^{n \times m}$ is a sparse matrix. We can represent it as a set of 3 tensors/"lists" called **compressed sparse row** representation (CSR). We get this representation by realizing that we can encode the row index in the COO representation a bit more efficiently. > > Instead of writing out the row index for each element, we can collect the column indices for each row, put each of these collections together, and then have a `pointer` tell us where to start reading. > > If $A$ has $z$ non-zero entries, then > - the `column` tensor contains $z$ entries. Each entry contains the column index for one of the non-zero elements, sorted by ascending row index. > - `row` or `pointer` tensor contains $n+1$ entries > - the first $n$ entries contain the starting index for where the row's elements start in the `column` tensor > - the last entry is $z$ > - `value` tensor contains $z$ entries and contains the non-zero elements sorted by row then column index. > > (see [[Concept Wiki/compressed sparse row representation]]) > [!example]+ > > $$A = \begin{bmatrix} > 12 & 0 & 26 & 0 \\ > 0 & 0 & 19 & 14 \\ > 26 & 19 & 0 & 0 \\ > 0 & 14 & 0 & 7 > \end{bmatrix}$$ > > `Pointer = [0 2 4 6]` > $$\begin{array}{c|cccc} \text{index} & 0 & 1 & 2 & 3 \\ \hline \text{read start} &0 & 2 & 4 & 6 \end{array}$$ > `Column = [0 2 2 3 0 1 1 3]` > $$\begin{array}{c|cc|cc|cc|cc} \text{index} & 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 \\ \hline \text{col indx} &0 & 2 & 2 & 3 & 0 & 1 & 1 & 3 \end{array}$$ > `Value = [12 16 19 14 26 19 14 7]` > $$\begin{array}{c|cc|cc|cc|cc} \text{index} & 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 \\ \hline \text{value} & 12 & 26 & 19 & 14 & 26 & 19 & 14 & 7 \end{array}$$ > Matrix multiplication can then become much more efficient computationally. Using sparse matrix multiplication we get $$\text{total operations} = \sum_{i=1}^n \lvert N_{i} \rvert = \sum_{i=1}^nd_{i} = \mathbb{1}^TA\mathbb{1} = \lvert E \rvert $$ in `PyTorch`, to transform to a sparse tensor: - `S.to_sparse()` for COO - `S.to_sparse_csr()` for CSR matrix-vector multiplication is as before: `S @ x` or `torch.mv(S,x)` for both representations and this will give a faster multiplication.
spectral redemption - Krzkala et al 2013. Computational threshold using adjacency matrix
Illustration to see the