class: center, middle, inverse, title-slide # 06 Duality ### STAT 535A ### Daniel J. McDonald ### Last modified - 2021-01-27 --- ## Last time $$ \newcommand{\R}{\mathbb{R}} \newcommand{\argmin}[1]{\underset{#1}{\textrm{argmin}}} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\indicator}{1} \renewcommand{\bar}{\overline} \renewcommand{\hat}{\widehat} \newcommand{\tr}[1]{\mbox{tr}(#1)} \newcommand{\Set}[1]{\left\{#1\right\}} \newcommand{\dom}[1]{\textrm{dom}\left(#1\right)} \newcommand{\st}{\;\;\textrm{ s.t. }\;\;} $$ $$ \min_x f(x) = g(x) + h(x) $$ * `\(g(x)\)` is convex and differentiable * `\(h(x)\)` is convex Define the "proximal mapping/operator" of `\(h\)`: $$ \textrm{prox}_{h,\gamma}(y) = \arg\min_z \frac{1}{2\gamma} \norm{ z - y }_2^2 + h(z) $$ Prox GD: 1. `\(x^{g} = x^{(k)} - \gamma \nabla g(x^{(k)})\)` 2. `\(x^{(k+1)} = \textrm{prox}_{h,\gamma}(x^g)\)` SGD: * Approximate an average of gradients with an average of a sample of gradients --- ## Today 1. Lower bounding the Primal problem 2. The dual problem 3. Conjugate functions 4. Connecting conjugates to subgradients and proximal operators --- ## Duality - Suppose we want to _Lower bound_ our convex program - Find `\(B\leq \min_x f(x),\quad x\in C\)`. __Example:__ `$$\begin{aligned} \min_{x,y} &\quad x + y\\ \textrm{s.t.}&\quad x+y \geq 2\\ &\quad x,y\geq 0 \end{aligned}$$` - What is `\(B\)`? --- ## Harder __Example:__ `$$\begin{aligned} \min_{x,y} &\quad x + 3y\\ \textrm{s.t.}&\quad x+y \geq 2\\ &\quad x,y\geq 0 \end{aligned}$$` - What is `\(B\)`? -- **Solution:** $$ `\begin{aligned} \min_{x,y} &\quad (x + y)+ 2y\\ \textrm{s.t.}&\quad x+y \geq 2\\ &\quad x,y\geq 0 \end{aligned}` $$ --- ## Harderer __Example:__ $$ `\begin{aligned} \min_{x,y} &\quad px + qy\\ \textrm{s.t.}&\quad x+y \geq 2\\ &\quad x,y\geq 0 \end{aligned}` $$ - What is `\(B\)`? -- **Solution:** $$ `\begin{aligned} \min_{x,y} &\quad px + qy\\ \textrm{s.t.}&\quad ax+ay \geq 2a\\ &\quad bx, cy\geq 0\\ &\quad a,b,c \geq 0 \end{aligned}` $$ - Adding implies $$ (a+b)x+(a+c)y\geq 2a$$ - Set `\(p=(a+b)\)` and `\(q=(a+c)\)` we get that `\(B=2a\)` --- ## Better - We can make this lower bound bigger by maximizing: $$ `\begin{aligned} \max_{a,b,c} &\quad 2a\\ \textrm{s.t.}&\quad a+b = p\\ &\quad a+c=q\\ &\quad a,b,c \geq 0 \end{aligned}` $$ - This is the __Dual__ of the __Primal__ LP $$ `\begin{aligned} \min_{x,y} &\quad px + qy\\ \textrm{s.t.}&\quad x+y \geq 2\\ &\quad x,y\geq 0 \end{aligned}` $$ - Note that the number of Dual variables (3) is the number of Primal constraints --- ## What is this for? 1. Primal/Dual have complementary numbers of variables -- corresponds to number of primal constraints 2. Dual depends on the complementary (dual) norm 3. Dual has the same smoothness 4. Dual can shift linear transformations between terms 5. Dual will tell us additional properties of the solutions --- ## General LP $$ `\begin{aligned} &\quad \textrm{(P)}\\ \min_x &\quad c^\top x\\ \textrm{s.t} &\quad Ax=b\\ &\quad Gx\leq h \end{aligned}` \quad\quad `\begin{aligned} &\quad \textrm{(D)}\\ \max_{u,v} &\quad -b^\top u - h^\top v \\ \textrm{s.t} &\quad -A^\top u -G^\top v=c\\ &\quad v \geq 0 \end{aligned}` $$ __Exercise__ --- ## Alternate derivation $$ `\begin{aligned} \min_x &\quad c^\top x\\ \textrm{s.t} &\quad Ax=b\\ &\quad Gx\leq h \end{aligned}` $$ - Suppose that some `\(x\)` is feasible. - Then, for that `\(x\)`, $$ c^\top x \geq c^\top x + u^\top (Ax-b) + v^\top (Gx-h) =: L(x,u,v). $$ as long as `\(v\geq 0\)` and `\(u\)` is anything. - We call `\(L(x,u,v)\)` the __Lagrangian__. Now, suppose `\(C\)` is the feasible set, and `\(f^*\)` is the optimum $$ f^* \geq \min_{x\in C} L(x,u,v) \geq \min_x L(x,u,v) =: g(u,v) $$ - We call `\(g(u,v)\)` the __Lagrange Dual Function__ - Maximizing `\(g(u,v)\)` is the Dual problem. --- ## Weak duality Consider the generic (primal) convex program $$ `\begin{aligned} \min_x &\quad f(x)\\ \textrm{s.t} &\quad Ax=b\\ &\quad h_i(x)\leq 0 \end{aligned}` $$ - For feasible `\(x\)`, `\(v\geq 0\)` $$ f(x) \geq f(x) + u^\top (Ax-b) + v^\top h(x) \geq \min_x L(x,u,v) = g(u,v). $$ - Therefore, `$$f^* \geq \max_{\forall u, v\geq 0} g(u,v) = g^*.$$` - This is __weak duality__. - Note that the Dual Program is always convex even if **P** is not (pointwise max of linear functions) --- ## Strong duality `$$f^* = g^*$$` - Sufficient conditions for strong duality: __Slater's conditions__ - If __P__ is convex and there exists `\(x\)` such that for all `\(i\)`, `\(h_i(x) < 0\)` (strictly feasible), then we have strong duality. (Extension: strict inequalities for `\(i\)` when `\(h_i\)` not affine.) - Sufficient conditions for strong duality of an LP: strong duality if either __P__ or __D__ is feasible. (Dual of __D__ = __P__) --- ## Example Dual for Support Vector Machine $$ `\begin{aligned} &\quad \textrm{(P)}\\ \min_{\xi,\beta,\beta_0} &\quad \frac{1}{2}\norm{\beta}_2^2 +C\indicator^\top\xi\\ \textrm{s.t} &\quad \xi_i \geq 0\\ &\quad y_i(x_i^\top \beta+\beta_0)\geq 1-\xi_i \end{aligned}` \quad\quad\quad `\begin{aligned} &\quad \textrm{(D)}\\ \max_{w} &\quad -\frac{1}{2}w^\top\tilde{X}^\top\tilde{X}w +\indicator^\top w\\ \textrm{s.t} &\quad 0\leq w \leq C\indicator\\ &\quad w^\top y = 0 \end{aligned}` $$ where `\(\tilde{X} = \textrm{diag}(y)X\)`. - `\(w=0\)` is Dual feasible. (Don't need strict, because affine) - We have strong duality by Slater's conditions. --- ## The Conjugate Function Given `\(f: \mathbb{R}^n \rightarrow \mathbb{R}\)`, define `\(f^*: \mathbb{R}^n \rightarrow \mathbb{R}\)` `$$f^*(y) = \max\limits_x y^T x - f(x)$$` **Note:** The conjugate function is always convex. "Largest gap between the line `\(y^T x\)` and `\(f(x)\)`". If `\(f\)` is differentiable, then this is called "Legendre transform" of `\(f\)`. --- ## Properties - Fenchel's inequality `\(\forall x, y\)` `$$f(x) + f^*(y) \ge x^\top y$$` - `\(\Longrightarrow f^{**} \le f\)` - if `\(f\)` is closed, convex, then `\(f^{**} = f\)` - if `\(f\)` is closed, convex, then `\(\forall x, y\)` `$$x \in \partial f^*(y) \Longleftrightarrow y \in \partial f(x)\Longleftrightarrow f(x) + f^*(y) = x^T y$$` - If `\(f(u,v) = f_1(u) + f_2(v)\)`, then `\(f^*(w,z) = f^*_1(w) + f^*_2(z)\)` - If `\(f(x) = a g(x) + b\)`, then `\(f^*(y) = a g^*\left(\frac{y}{a}\right) - b\)` - If `\(A\in \mathbb{R}^{n\times n}\)`, non-singular, `\(f(x) = g(Ax+b)\)`, then `\(f^*(y) = g^*(A^{-\top} y) - b^T A^{-\top} y\)` --- ## Examples * `\(f(x) = \frac{1}{2} x^\top Q x\)` with `\(Q \succ 0\)`. We want `\(\max\limits_x y^\top x - \frac{1}{2} x^\top Q x\)` Take partial derivative and set to 0, `$$y - Qx = 0 \Longrightarrow Q^{-1} y = x^* \Longrightarrow f^*(y) = \frac{1}{2} y^\top Q^{-1} y$$` * If `\(f(x) = I_C(x) = \begin{cases}0 &x\in C,\\\infty &\textrm{else}\end{cases}\)` The dual is the __support function__ `\(f^*(y) = \max\limits_{x\in C} y^\top x\)` --- ## Conjugates for norms * __General norms__ Given a norm `\(||\cdot||\)`, by the definition `\(\norm{y}_* = \max_x y^\top x - \norm{x} = \max_{\norm{x} < 1} y^\top x\)`. * The __dual norm__ (not the conjugate) of an `\(L_p\)` norm `\(\norm{\cdot}_p\)` is `\(\norm{\cdot}_q\)` with `\(q^{-1} + p^{-1} = 1\)` (Hölder conjugates) * The conjugate for a norm `\(f(x) = \norm{x}\)` is `\(f^*(y) = \begin{cases} 0 & \norm{y}_* \leq 1\\ \infty & \textrm{else}\end{cases}\)` * The conjugate for a squared norms `\(f(x) = \frac{1}{2} \| x \| ^2\)` is `$$f^*(y) = \frac{1}{2} \| y \|_*^2$$` --- ## Conjugates can help us find the subgradient I want the subgradient of `\(f(x) = \norm{x}_1\)` at `\(x=0\)` Fenchel's inequality `\(\Longrightarrow f(x) + f^*(z) \geq z^\top x\)` Subgradient definition: a vector `\(g\)` is a subgradient of `\(f\)` at `\(x\)` if `\(f(y) \geq f(x) + g^\top (y-x)\)` for all `\(y\)` But `\(g \in \partial f(x) \Longrightarrow g^\top x = f(x) + f^*(g)\)`. Plug in `\(f\)` and `\(x=0\)`: `\(\Longrightarrow f^*(g) = 0\)` for all `\(g\)` in `\(\partial \norm{0}_1\)` The dual norm of `\(\norm{\cdot}_1\)` is `\(\norm{\cdot}_\infty\)`, so `\(f^*(g) = \begin{cases} 0 & \max_i{|g_i|} < 1\\ \infty & \textrm{else}\end{cases}\)` Therefore `\(\partial \norm{0}_1\)` is any vector `\(g\)` with `\(\max_i |g_i| < 1\)` --- ## Conjugates can help us find the dual problem `\(\min_x f(x) + h(x) \Longleftrightarrow \min_{x,z} f(x) + h(z)\)` subject to `\(z=x\)`. Apply definition (Lagrangian) to find the dual function: `$$g(u) = \min_{x,z} f(x) + h(z) + u^\top(x-z) = -f^*(u) -h^*(-u)$$` So the dual problem is `$$\max_u -f^*(u) -h^*(-u)$$` Primal: `\(\min_x f(x) + \norm{x}\)` Dual: `\(\max_u -f^*(u) \quad\mbox{subject to}\quad \norm{u}_* \leq 1\)` -- **Moreau decomposition:** For any `\(x \in \mathcal{X}\)`, `$$x = \textrm{prox}_{f\gamma} + \textrm{prox}_{f^*\gamma} (x)$$` --- ## Duality and the group lasso 0. Fix a group 1. Let `\(f(x) = \norm{x}_2\)`. We want to find the proximal operator of this. 2. Argue `\(f^*(x) = I_{\mathcal{B} }(x)\)` where `\(\mathcal{B} = \{u : \norm{u}_2 \leq 1 \}\)`. 3. Find the proximal operator for `\(f^*\)`. 4. Make some adjustments for `\(\lambda\)` using properties above. 5. Use Moreau Decomposition. 6. Done.