[[lecture-data]]Info
let current = dv.current(); let previous = current.previous_lecture; // get notes with this as the previous let pages = dv.pages("") .where(p => p.previous_lecture) .where(p => String(p.previous_lecture) .includes(dv.current().file.link)); let output = `:LiArrowBigLeftDash: ${previous || "No Previous Lecture"} | `; if (pages.length > 0) { output += pages .map(p => p.file.link).join(", "); } else { output += 'No Next Lecture' } output += ' :LiArrowBigRightDash:' dv.el('div', output)
Class Notes
Summary
- Non-commutative Khntchine inequality
- Matrix Bernstein
- Matrix Chernoff
8. Matrix Concentrations
Recall from last time, we had symmetric as a Gaussian series (centered). Call And let
Theorem
Lemma
Proof of Theorem
\mathbb{E}_{}\left[\lvert \lvert X \rvert \rvert \right] &\leq \mathbb{E}_{}\left[(\mathrm{Tr}(X^{2k}))^{1/2k}\right] \\ &\leq d \cdot \lvert \lvert v \rvert \rvert ^k \\ &\overset{\text{Jensen}}\leq (\mathbb{E}_{}\left[\mathrm{Tr}(X^{2k})\right] ^{2k}) \\ &\overset{\text{lemma}}\leq \sqrt{ 2k } \mathrm{Tr}(V^k)^{1/2k} \\ &\leq \underbrace{ \sqrt{ 2k } \cdot d^{1/2k} }_{ k \asymp \log d } \cdot \underbrace{ \lvert \lvert V \rvert \rvert ^{1/2} }_{ \sigma(X) } \end{align}$$ Proof of Lemma $$\begin{align} T=\mathbb{E}_{}\left[\mathrm{Tr}(X^{2k})\right] &= \mathbb{E}_{g_{i}}\left[\mathrm{Tr}\left( \sum_{i} g_{i} A_{i} \right)^{2k}\right] \\ &= \sum_{i} \mathbb{E}_{g_{i}}\left[\mathrm{Tr}\left( A_{i}\left( \sum_{j}g_{j}A_{j} \right)^{2k-1} \right)\right] \\ &\overset{\text{integration by parts}}= \sum_{i} \mathbb{E}_{}\left[\sum_{\ell=0}^{2k-2}\mathrm{Tr}\left( A_{i}\left( \underbrace{ \sum_{j}g_{j}A_{j} }_{ X } \right)^{\ell}A_{i}\left( \underbrace{ \sum_{j}g_{j}A_{j} }_{ X } \right)^{2k-2-\ell} \right)\right] \\ \text{prop} \implies T &\leq \sum_{i} \sum_{\ell=0}^{2k-2} \mathbb{E}_{}\left[\mathrm{Tr}(A_{i}^2 X^{2k-2})\right] \\ &=(2k-1) \mathbb{E}_{}\left[\mathrm{Tr}(VX^{2k-2})\right] \\ &= (2k-1) \mathbb{E}_{}\left[\langle V, X^{2k-2} \rangle \right] \end{align}$$ Prop Suppose $A,B \in \mathbb{R}^{d\times d}$ are [[hermitian|symmetric]]. $$0 \leq s \leq 2t \implies \mathrm{Tr}(AB^sAB^{2t-s})\leq \mathrm{Tr}(A^2B^{2t})$$ with max/equality occurring when $A$ and $B$ commute Proof $$\begin{align} B &= \sum_{i}^d \lambda_{i} v_{i}v_{i}^{\intercal} \\ \implies \mathrm{Tr}(AB^sAB^{2t-s})&= \sum_{i,j} \lambda_{i}^s \lambda_{j}^{2t-s} \mathrm{Tr}\left( Av_{i}v_{i}^{\intercal} Av_{j}v_{j}^{\intercal}\right) \\ &\leq \sum_{i,j} \lvert \lambda_{i} \rvert ^s \lvert \lambda_{j} \rvert ^{2t-s} \left( v_{i}^{\intercal}Av_{j} \right)^2 \\ &= \sum_{i,j} \left( v_{i}^{\intercal}Av_{j} \right)^2 \lvert \lambda_{j} \rvert ^{2t} \left( \frac{\lvert \lambda_{i} \rvert }{\lvert \lambda_{j} \rvert } \right)^s \\ \text{this is convex function of } &s \in [0, 2t] \text{ with max at } s=2t \\ &\leq \sum_{i,j} \lambda_{j}^{2t}\left( v_{i}^{\intercal}Av_{j} \right)^2 \\ &= \mathrm{Tr}(A^2 B^{2t}) \end{align}$$ $$\tag*{$\blacksquare$}$$ --- Prop Von Neumann inequality $$\langle A,B \rangle \leq \langle \lambda(A), \lambda(B) \rangle = \sum_{i}\lambda_{i}(A) \lambda_{i}(B)$$ Proof Note $\lvert \lvert A \rvert \rvert_{F}^2 = \lvert \lvert \lambda(A) \rvert \rvert_{2}^2$ . Thus $$\begin{align} \lvert \lvert \lambda(A) - \lambda(B) \rvert \rvert _{2}^2 & \overset{(*)}\leq \lvert \lvert A-B \rvert \rvert _{F}^2 \\ \implies \cancel{ \lvert \lvert A \rvert \rvert _{F}^2K } + \cancel{ \lvert \lvert B \rvert \rvert _{F}^2 } - 2\langle \lambda(A), \lambda(B) \rangle &\leq \cancel{ \lvert \lvert A \rvert \rvert _{F}^2 } + \cancel{ \lvert \lvert B \rvert \rvert _{F}^2 } - 2 \langle A,B \rangle \\ \implies \langle A,B \rangle \leq \langle \lambda(A), \lambda(B) \rangle \end{align}$$ $$\tag*{$\blacksquare$}$$ --- Lemma $$\mathbb{E}_{}\left[\mathrm{Tr}(X^{2k})\right] \leq (2k-1)^k \mathrm{Tr}(V_{k})$$ ie $T=\sum_{i}\lambda_{i}(X)^{2k} = \lvert \lvert \lambda(X) \rvert \rvert_{2k}^{2k}$ $$\begin{align} T &\overset{\text{V.N.}}\leq (2k-1)\mathbb{E}_{}\left[\langle \lambda(V), \lambda(\underbrace{ X^{2k-2} }_{ (\lambda_{1}^{2k-2},\dots,\lambda_{d}^{2k-2}) }) \rangle \right] \\ &\overset{\text{Holder}}\leq (2k-1)\underbrace{ \mathbb{E}_{}\left[\lvert \lvert \lambda(X^{2k-2}) \rvert \rvert _{\frac{k}{k-1}}\right] }_{\left( \sum_{i}\lambda_{i}(X)^{{2k-1}\cdot\frac{k-1}{k}} \right)^{k-1/k} = (\mathrm{Tr}(X^{2k}))^{k-1/k} } \underbrace{ \lvert \lvert \lambda(V) \rvert \rvert _{k} }_{ \left( \sum_{i}\lambda_{i}(V^k) \right)^{1/k} =(\mathrm{Tr}(V^k))^{1/k}} \\ \implies T &\leq (2k-1)\cdot\mathrm{Tr}(V^k) \cdot\mathbb{E}_{}\left[\mathrm{Tr}(X^{2k})^{k-1/k}\right] \\ &\leq (2k-1)^k (\mathrm{Tr}(V^k))^{1/k} T^{k-1/k} \\ \implies T &\leq (2k-1)^k \mathrm{Tr}(V^k) \end{align}$$ $$\tag*{$\blacksquare$}$$ --- ### Trick 1 Gaussian $\xrightarrow{} \pm 1$ (Rademacher variables) Lemma $$\mathbb{E}_{z_{i} \sim \text{Uniform}\{ \pm1 \}}\left[\left\lvert \left\lvert \sum_{i} z_{i} A_{i} \right\rvert \right\rvert \right] \leq \sqrt{ \frac{\pi}{2} } \mathbb{E}_{g_{i} \sim {\cal N}(0, 1)}\left[ \left\lvert \left\lvert \sum_{i}g_{i}A_{i} \right\rvert \right\rvert \right] $$ Corollary ($\pm 1$ NCK) $$\mathbb{E}_{}\left[\left\lvert \left\lvert \underbrace{ \sum_{i} g_{i}A_{i} }_{ X } \right\rvert \right\rvert \right] \lesssim \sqrt{ \log d }\,\sigma(X) $$ Proof of Lemma $$\begin{align} \text{Law}(g_{i}) &= \text{Law}(z_{i}\cdot\lvert g_{i} \rvert ) \\ \mathbb{E}_{}\left[\lvert g_{i} \rvert \right] &= \sqrt{ \frac{2}{\pi} } \\ \implies \mathbb{E}_{g_{i}}\left[\left\lvert \left\lvert \sum_{i} g_{i}A_{i} \right\rvert \right\rvert \right] &= \mathbb{E}_{z_{i}, g_{i}}\left[\left\lvert \left\lvert \sum_{i}z_{i} \lvert g_{i} \rvert A_{i} \right\rvert \right\rvert \right] \\ &\overset{\text{Jensen}} \geq \mathbb{E}_{z_{i}}\left[\left\lvert \left\lvert \mathbb{E}_{g_{i}}\left[\sum_{i} \lvert g_{i} \rvert z_{i} A_{i}\right] \right\rvert \right\rvert \right] \\ &= \sqrt{ \frac{2}{\pi} } \mathbb{E}_{z_{i}}\left[\left\lvert \left\lvert \sum_{i}z_{i}A_{i} \right\rvert \right\rvert \right] \end{align}$$ ### Trick 2: Symmetrization Lemma Suppose I have $H_{i} \in \mathbb{R}^{d\times d}_{\text{sym}}$, arbitrary random matrices. Then $$\mathbb{E}_{H_{i}}\left[\left\lvert \left\lvert \sum_{i}(H_{i} - \mathbb{E}_{}\left[H_{i}\right] ) \right\rvert \right\rvert \right] \leq 2 \mathbb{E}_{z_{i} \sim \text{Unif}\{ \pm_{1} \}, H_{i}}\left[\sum z_{i} H_{i}\right] $$ Corollary $$\mathbb{E}_{}\left[\left\lvert \left\lvert \sum_{i}(H_{i} - \mathbb{E}_{}\left[H_{i}\right] ) \right\rvert \right\rvert \right] \lesssim \sqrt{ \log d }\cdot \underbrace{ \mathbb{E}_{H_{i}}\left[\left\lvert \left\lvert \sum_{i}H_{i}^2 \right\rvert \right\rvert ^{1/2}\right] }_{ (*) }$$ Note that $(*)$ is still hard to compute. Proof Use the above Lemma and the $\pm 1$ NCK conditioned on fixed H. Proof of Lemma Let $H_{i}'$ be independent copies of the $H_{i}$. $$\begin{align} \mathbb{E}_{}\left[\left\lvert \left\lvert \sum_{i} H_{i} (H_{i} - \underbrace{ \mathbb{E}_{}\left[H_{i}\right] }_{ =\mathbb{E}_{}\left[H_{i}'\right] } ) \right\rvert \right\rvert \right] &= \mathbb{E}_{H_{i}}\left[\left\lvert \left\lvert \mathbb{E}_{H_{i}'}\left[\sum_{i}(H_{i}-H_{i}')\right] \right\rvert \right\rvert \right] \\ &\overset{\text{Jensen}}\leq \mathbb{E}_{H_{i},H_{i}'}\left[\left\lvert \left\lvert \sum_{i} (H_{i}-H_{i}') \right\rvert \right\rvert \right] \\ (^{\dagger})&= \mathbb{E}_{z_{i,}H_{i},H_{i}'}\left[\left\lvert \left\lvert \sum_{i} z_{i}(H_{i}-H_{i}') \right\rvert \right\rvert \right] \\ &\leq \mathbb{E}_{}\left[\left\lvert \left\lvert \sum_{i} z_{i} H_{i} \right\rvert \right\rvert \right] + \mathbb{E}_{}\left[\left\lvert \left\lvert \sum_{i} z_{i}H_{i}' \right\rvert \right\rvert \right] -2\mathbb{E}_{}\left[\left\lvert \left\lvert \sum_{i}z_{i} H_{i} \right\rvert \right\rvert \right] \end{align} $$ Where $(^{\dagger})$ is because the distribution of $H_{i}-H_{i}'$ is symmetric since they are iid copies > [!references] References for this type of thing / trick > > Probability in Banach Spaces (book) > - Older prob books that don't use measure theory as a basis > - eg. Feller > Theorem Matrix Chernoff Bound For absolute constants $c_{1},c_{2} >0$, any $\varepsilon >0$, and $H_{i} \in \mathbb{R}^{d\times d}_{\text{sym}}$ independent with $H_{i} \succeq 0$ [[almost everywhere|almost surely]], then $$\begin{align} \mathbb{E}_{}\left[\left\lvert \left\lvert \sum_{i}H_{i} \right\rvert \right\rvert \right] &\leq \left[ \underbrace{ \left\lvert \left\lvert \sum_{i} \mathbb{E}_{}\left[H_{i}\right] \right\rvert \right\rvert ^{1/2} }_{ \text{"easy"} } + c_{1} \sqrt{ \log d }\cdot \left(\mathbb{E}_{}\left[\max_{i} \lvert \lvert H_{i} \rvert \rvert \right]\right) \right]^{1/2} \\ &\overset{(*)}\leq (1+\varepsilon) \left\lvert \left\lvert \sum\mathbb{E}_{}\left[H_{i}\right] \right\rvert \right\rvert + \left( 1+\frac{1}{\varepsilon} \right)c_{2}\log d \cdot\underbrace{ \mathbb{E}_{}\left[\max_{i}\lvert \lvert H_{i} \rvert \rvert \right] }_{ K } \end{align}$$ Where we get $(*)$ via a weighted [[arithmetic-geometric mean inquality]] $(x+y)^2 = x^2 + y^2 + 2 \sqrt{ \varepsilon x }\left( \frac{1}{\sqrt{ \varepsilon }}y \right)$ Often, 1. $\lvert \lvert H_{i} \rvert \rvert \leq K$ almost surely 2. the second term is $\ll$ than the first for $\varepsilon = {\cal O}(1)$ Together, $$\begin{align} \implies \mathbb{E}_{}\left[\left\lvert \left\lvert \sum_{i} H_{i} \right\rvert \right\rvert \right] &\leq (1 + {\cal O}(1))\cdot \left\lvert \left\lvert \sum \mathbb{E}_{}\left[H_{i}\right] \right\rvert \right\rvert \\ &\lesssim \left\lvert \left\lvert \sum\mathbb{E}_{}\left[H_{i}\right] \right\rvert \right\rvert + K \log d \end{align} $$ >[!proof] > Let $E:= \mathbb{E}_{}\left[\left\lvert \left\lvert \sum _{i }H_{i} \right\rvert \right\rvert\right]$ Then note that $$\begin{align} E & \leq \left\lvert \left\lvert \underbrace{ \sum_{i} \mathbb{E}_{}\left[H_{i}\right] }_{ A } \right\rvert \right\rvert + \mathbb{E}_{}\left[\left\lvert \left\lvert \underbrace{ \sum_{i}(H_{i}- \mathbb{E}_{}\left[H_{i}\right] ) }_{ B } \right\rvert \right\rvert \right] \\ (*)\implies B &\leq C \sqrt{ \log d }\cdot \mathbb{E}_{}\left[\left\lvert \left\lvert \sum_{i}\underbrace{ H_{i}^2 }_{ H_{i}^{1/2}H_{i}H^{1/2} \preceq (\max\lvert \lvert H_{i} \rvert \rvert ) H_{i} } \right\rvert \right\rvert ^{1/2}\right] \\ \\ &\leq C \sqrt{ \log d }\cdot \mathbb{E}_{}\left[(\max \lvert \lvert H_{i} \rvert \rvert) \right]^{1/2} \cdot \left\lvert \left\lvert \sum_{i} H_{i} \right\rvert \right\rvert^{1/2} \\ (**) &\leq C \sqrt{ \log d } \cdot \underbrace{ \left(\ \mathbb{E}_{}\left[\max_{i}\lvert \lvert H_{i} \rvert \rvert \right] \right)^{1/2} }_{ M }\left( \underbrace{ \mathbb{E}_{}\left[\sum H_{i}\right] }_{ E }\right)^{1/2} \\ \implies E &\leq A + CM\sqrt{ E } \\ \implies (\sqrt{ E })^2 - CM(\sqrt{ E }) - A &\leq 0 \\ \implies \sqrt{ E } &\leq \arg_{x}\max \{ x^2 - CMx -A=0 \} \\ &= \frac{1}{2} (CM + \sqrt{ C^2 M^2 + 4A }) \\ \sqrt{ x+y } \leq \sqrt{ x } + \sqrt{ y } \implies &\leq \frac{1}{2}(CM + CM + 2\sqrt{ A }) \\ &= CM + \sqrt{ A } \\ \implies E &\leq (\sqrt{ A } + CM)^2 \end{align}$$ $$\tag*{$\blacksquare$}$$ Where $(*)$ is by And $(**)$ is by [[cauchy-schwarz theorem|Cauchy-Schwarz]] ```datacorejsx const { dateTime } = await cJS() return function View() { const file = dc.useCurrentFile(); return <p class="dv-modified">Created {dateTime.getCreated(file)} ֍ Last Modified {dateTime.getLastMod(file)}</p> } ```