\mathbb{P}[\left\lvert \,\lvert \lvert x \rvert \rvert^2 -d \, \right\rvert \geq t] &= \mathbb{P}\left[ \left\lvert \sum_{i=1}^d x_{i}^2 - d \right\rvert \geq t \right] \\
&\leq 2 \begin{cases}
\exp\left( -\frac{t^2}{8d} \right)&\text{if } t \leq d \\
\exp\left( -\frac{t}{8} \right) & \text{if }t \geq d
\end{cases} \\
&= 2\exp\left( -\frac{1}{8} \min \left\{ \frac{t^2}{d}, t \right\}\right)
\end{align}$$
Important
This tells us that, with high probability, we have
\lvert \lvert x \rvert \rvert ^2 &= d + {\cal O}(\sqrt{ d }) \\
\implies \lvert \lvert x \rvert \rvert &= \sqrt{ d + {\cal O}(\sqrt{ d }) } \\
&= \sqrt{ d } \cdot \sqrt{ 1+ {\cal O}\left( \frac{1}{\sqrt{ d }} \right)} \\
&= \sqrt{ d }\left( 1+{\cal O}\left( \frac{1}{\sqrt{ d }} \right) \right) \\
&= \sqrt{ d } + {\cal O}(1)
\end{align}$$
ie, the standard [[Concept Wiki/gaussian random vector]] falls close to the spherical shell of width ${\cal O}(1)$ around the sphere of radius $\sqrt{ d }$.
We deal with only one side of the distribution; the desired inequality is achieved by simply multiplying by 2 to account for the other tail.
Note that E[xi2]=1, so define si=xi2−1. Then
\mathbb{P}\left[ \sum_{i=1}^d x_{i}^2 - d \geq t\right] &= \mathbb{P}\left[ \sum_{i-1}^ds_{i} \geq t \right] \\
&= \mathbb{P}\left[ \exp\left( \lambda \sum_{i=1}^d s_{i} \right) \geq \exp(\lambda t) \right] \\
(*)& \leq \frac{\mathbb{E}\left[ \exp\left( \lambda \sum s_{i} \right) \right]}{\exp(\lambda t)} \\
(**)& = \frac{\mathbb{E}[\exp(\lambda s_{1})]^d}{\exp(\lambda t)} \\
&= \exp(-\lambda t + d \log(\mathbb{E}[\exp(\lambda s_{1})]))
\end{align}$$
Where
- $(*)$ is from [[Concept Wiki/Markov's Inequality]]
- $(* *)$ is because the $x_{i}$ are independent
Now, define
$$\psi(\lambda) := \log(\mathbb{E}[\exp(\lambda s_{1})]) = \log(\mathbb{E}[\exp(\lambda(x_{1}^2 - 1))])$$
As the *cumulant generating function* ([wikipedia](https://en.wikipedia.org/wiki/Cumulant)) of $x_{1}^2-1$ (which has expectation 0). Note that $\mathbb{E}[\exp(\lambda x_{1}^2)]$ is finite if and only if $\lambda < \frac{1}{2}$ and in this case
$$\begin{align}
\mathbb{E}[\exp(\lambda x_{1}^2)] &= \frac{1}{\sqrt{ 1-2\lambda }} \\
\implies \psi(\lambda) &= -\lambda + \frac{1}{2}\log\left( \frac{1}{1-2\lambda} \right) \\
({\dagger})&= \lambda^2+{\cal O}(\lambda)
\end{align}$$
Where we get $({\dagger})$ via Taylor expansion. This yields the bound $\psi(\lambda) \leq 2\lambda^2$ for $\lambda \leq \frac{1}{4}$
>[!exercise]- Exercise (tedious) - Verify the bound
>- Can plot $\psi(\lambda)$ and the bound $2\lambda^2$ (see notes page 6; figure 1.1)
>- Verify via algebra that $\psi(\lambda) \leq 2\lambda^2$ for all $\lambda \leq \frac{1}{4}$
Assuming $\lambda \leq \frac{1}{4}$, we have
$$\begin{align}
\mathbb{P}\left[ \sum_{i=1}^d x_{i}^2 - d \geq t \right] &\leq \exp(-\lambda t + d \log(\mathbb{E}[\exp(\lambda s_{1})])) \\
&=\exp(-\lambda t + d\psi(\lambda)) \\
&\leq \exp(-\lambda t + 2d\lambda^2)
\end{align}$$
Now, $2d\lambda^2 -\lambda t$ is a convex quadratic function of $\lambda$, and is thus minimized at $\lambda^* = \frac{t}{4d}$. We want to set $\lambda=\lambda^*$, but we must ensure that $\lambda \leq \frac{1}{4}$ to maintain our desired bound. Thus, we have 2 cases:
- If $t \leq d$, then $\lambda^* \leq \frac{1}{4}$ and we can safely set $\lambda = \lambda^*$.
$$\mathbb{P}\left[ \sum_{i=1}^d x_{i}^2 - d \geq t \right] \leq \exp(-\lambda t + 2d\lambda^2)=\exp\left( -\frac{t^2}{4d} \right)$$
- If $t \geq d$, then $\frac{t}{4d} \not\leq \frac{1}{4}$, so we instead set $\lambda=\frac{1}{4}$.
$$\mathbb{P}\left[ \sum_{i=1}^d x_{i}^2 - d \geq t \right] \leq \exp(-\lambda t + 2d\lambda^2)=\exp\left( -\frac{t}{4}+\frac{d}{8} \right) \leq \exp\left( -\frac{t}{8} \right)$$
Which is the desired result
$$\tag*{$\blacksquare$}$$
Remarks
NOTE
The important part of the proof is finding some ψ(λ)=O(λ2) for a bounded λ. This means that the random variance is subexponential (a weaker version of being subgaussian).
This result is characteristic of sums of d iid subexponential random variables. Sums of this type have “Gaussian tails” scaled by exp(−dct2) up to a cutoff of t∼d, after which they have “exponential tails” scaled by exp(−ct)
Bernstein’s inequality is general tool for expressing this type of behavior (see notes pg. 7 for more + reference)