2025-02-10 graphs lecture 6

Data

subject:: Data Science Methods for Large Scale Graphs

2025-02-10

1. Graph Signals and Graph Signal Processing

Community graph example

Summary

Today

community detection (HW 1 - wednesday)
recap SBM (last time)
spectral clustering algorithm
spectral embeddings
contextual SBM

stochastic block model

an $n$ -node stochastic block model (SBM) graph with $C$ communities is given by

P = Y B Y^{T}, A \sim P

where

$Y \in {0, 1}^{n \times c}$ is the community assignment matrix (if $y_{i c} = 1$ then node $i$ belongs to community $c$ )
$B$ is the matrix of intra- and inter-community probabilities: $B_{c_{1}, c_{2}}$ is the edge probability for a node pair $(i, j)$ such that $Y_{i c_{1}} = 1, Y_{j c_{2}} = 1$
$A$ is the adjacency matrix

Note

While $B$ can take any value (as long as it is symmetric), often we have $$B = \begin{cases}p \text{ if }c_{1}=c_{2} \ q \text{ otherwise}\end{cases}$$
with $q = 1 - p$

(see stochastic block model)

Balanced stochastic block model

We say an SBM is balanced when all communities have the same size ( $\frac{n}{c}$ ); it is unbalanced otherwise.
(see balanced stochastic block model)

Example

Consider the 2-community balanced SBM. For an $n$ -node graph, suppose the nodes are labeled such that the first $\frac{n}{2}$ are in community 1 and the remaining nodes are in community 2. Then

y \in R^{n \times 2} = [\begin{matrix} 1 & 0 \\ 1 & 0 \\ 0 & 1 \\ ⋮ & ⋮ \\ 0 & 1 \end{matrix}], P \in R^{(n / 2 + n / 2) \times (n / 2 + n / 2)} = [\begin{array}{cccccc} p & \dots & p & q & \dots & q \\ ⋮ & ⋮ & ⋮ \\ p & \dots & p & q & \dots & q \\ q & \dots & q & p & \dots & p \\ ⋮ & ⋮ & ⋮ & ⋮ \\ q & \dots & q & p & \dots & p \end{array}]

Recall that $A \sim P$ , ie $A_{i j} = {\begin{cases} 1 with prob. P_{i j} \\ 0 with prob 1 - P_{i j} \end{cases}$

Thus, $E (A) = P$ . Let's compute the eigenvectors and eigenvalues of $P = E (A)$ .

First, we observe that this matrix is rank 2 and will thus have 2 eigenvectors

It is easy to see that $v_{1} = \frac{1}{\sqrt{n}} 1$ and $P v_{1} = (p + q) v_{1}$ . Thus $λ_{1} = p + q$

Further, we note that $P = \frac{p + q}{2} {1 \cdot 1}^{T} + \frac{p - q}{2} m \cdot m^{T}$ , where $m = [\begin{matrix} 1 \\ ⋮ \\ 1 \\ - 1 \\ ⋮ \\ - 1 \end{matrix}]$

Thus $v_{2} = \frac{1}{\sqrt{n}} m$ with eigenvalue $λ_{2} = p - q$

Note

This means that we can use the vector associated with the second largest eigenvalue (in absolute value) of $E (A) = P$ to reveal the community assignments of the nodes.

This is the intuition behind spectral clustering. For an arbitrary graph with adjacency matrix $A$ , we assume that $A = A_{S B M} + ϵ$ , where $ϵ$ is random noise with $E (ϵ) = 0$ .

Thus, we can use our sample mean adjacency matrix to estimate the community assignments of our nodes!

Spectral Clustering

Tip

What happens when $C > 2$ (more general case than example above)?

The community assignment is a linear combination of the top $C$ eigenvectors (in absolute value)

Suppose we are given $A$ and $C$ . How do we estimate $y$ ?

spectral clustering algorithm

Diagonalize $A = V Λ V^{T}$
Order the eigenvectors by decreasing eigenvalue magnitude
This yields $$V_{c} = \begin{bmatrix}v_{1} & v_{2} & \dots & v_{c}\end{bmatrix}, v_{i} \in \mathbb{R}^n = \begin{bmatrix}u_{1} \
u_{2} \
\vdots \
u_{n}
\end{bmatrix}, u_{j} \in \mathbb{R}^c$$
we can see the rows $u_{j}$ as embeddings of the notes in $R^{c}$ (community space)
yay we can cluster ( $k$ -means or whatever you want. can also use like gaussian mixture models)

(see spectral clustering)

This is unsupervised learning - it only assumes knowledge of $A$ . When we know some subset of notes $T \subset V$ , these can be used to make better predictions and turns the task into a semi-supervised one. This is what we do more often in practice.

Explicitly, we can convert the problem to:

min_{f} \sum_{i \in T} 1 (f (A)_{i} = y_{i})

where $f$ is some parametric function, $y_{i}$ is the one-hot community vector, and $y_{i c} = 1$ if $i$ is in community $c$ . In practice, the indicator/ $0 - 1$ loss is substituted for cross-entropy loss

Spectral Embedding

Suppose we are given a diagonalizable adjacency matrix $A$ for a graph with nodes partitioned into $C$ classes. Suppose we know the assignments for some of the nodes $T \subset Vertices$ . Let $V_{c}$ be the top $C$ eigenvectors of $A$ .

Finding the spectral embedding amounts to solving the problem$$\min_{f} \sum_{i \in T} \mathbb{1}(f(A){i} = y)$$ where $f$ comes from our hypothesis class ${f : f (A) = σ (V_{c} W), W \in R^{C \times C}}$ and $σ$ is a predetermined pointwise nonlinearity.

Notes

This can be thought of as an FC-NN on the $C$ -dimensional embeddings $u_{1}, \dots, u_{n}$ of the nodes of the graph into feature space.

in practice, a surrogate is usually used for the 0-1 loss
to find the optimal $f$ ie the optimal $W$ , we solve using gradient descent methods

(see spectral embedding)

Spectral clustering is fully unsupervised, spectral embeddings incorporates some prior knowledge making this semi-supervised

information theoretic threshold

Question

What happens when $p \approx q$ in the balanced SBM with $c = 2$ ?

Answer

It becomes impossible to cluster! When $p = q$ , we have an erdos-renyi graph, in which the edge prob. is constant for all nodes. There are no communities to distinguish.

But even when $p \neq q$ , there is a region around $p = q$ where detection of communities is impossible in an information theory sense

We want to recover as close as possible the exact partition of nodes into communities. Without perfect information, this amounts to finding almost exact recovery:

Almost exact recovery

In a stochastic block model problem, almost exact recovery of the assignments of nodes to communities is achieved when

P {\frac{1}{n} 1 ((f (A)_{i} = y_{i})) = 1 - o (1)} = 1 - o (1)

ie $P (lim_{n \to \infty} \frac{1}{n} 1 [f (A)_{i} = y_{i}] = 1) = 1$ , ie the function converges almost surely to the true labels.

(see almost exact recovery)

How do we know if almost exact recovery is possible? From an information theoretic point of view, we can determine this using the signal to noise ratio.

signal-to-noise ration (snr)

The signal to noise ratio is the ratio of the power of signal (the thing we want to measure) to the power of the extraneous noise. This is given as:

S N R = \frac{P_{signal}}{P_{noise}}

Where $P_{_}$ is the average power, or second moment. When the signal and noise are Random Variables, then we have

S N R = \frac{E [X_{signal}^{2}]}{E [X_{noise}^{2}]}

(see signal to noise ratio)

Example

In our example with $c = 2$ , with $B = {\begin{cases} p if c_{1} = c_{2} \\ q otherwise \end{cases}$ , then

S N R = \frac{(p - q)^{2}}{2 (p + q)}

numerator is the signal: degree discrepancy in nonequal subgraphs
denominator is noise: the average degree for all nodes in the graph
^f75880

Theorem

Suppose we have a stochastic block model problem where the adjacency matrix $A \sim P, P = Y B Y^{T}$ , and $B = {\begin{cases} p if c_{1} = c_{2} \\ q otherwise \end{cases}$ .

If the signal to noise ratio $S N R \frac{< 1}{n}$ , then almost exact recovery is impossible. Even with infinite time and resources, there is no algorithm that can recover the true communities with just $A$ .

(see almost exact recovery is impossible when the signal to noise ratio is less than the threshold)

Proof

see proofs in Massoulié (2014) and Mossel (2014). Aside: interesting to see since there are different proof methods from different domains. See also Abbe (survey papers).

Example: Sparse Graphs

$p = \frac{a}{n}, q = \frac{b}{n}$ and $\frac{| E |}{n^{2}} \to 0$ as $n \to \infty$ (the average degree vanishes).
Then

\begin{aligned} S N R & = \frac{{(\frac{a}{n} - \frac{b}{n})}^{2}}{2 \frac{(a + b)}{n}} \\ = \frac{1}{2 n} \frac{(a - b)^{2}}{(a + b)} \\ S N R_{p, q} < \frac{1}{n} & ⟺ \frac{1}{2 n} \frac{(a - b)^{2}}{a + b} < \frac{1}{n} \\ ⟹ \frac{(a - b)^{2}}{2 (a + b)} & < 1 \end{aligned}

ie, in sparse graphs, it is not difficult to identify the information theoretic threshhold (we can easily calculate the signal to noise ratio)

Next time, we will look at another SBM with more information (in the form of node features), then we can get better thresholds

Note

From 2025-02-05 graphs lecture 5: there was a "typo" in the distributions for our SBM example.