TU Delft & University of Pisa
June 23, 2025
Consider a parametrized convex optimization problem \[ \begin{aligned} & \min_x \quad && f(x, \theta) \\ & \text{s.t.} && g(x, \theta) \leq 0 \\ &&& h(x, \theta) = 0 \end{aligned} \]
Where \(f\) and \(g\) are convex functions, and \(h\) is an affine function.
Karush-Kuhn-Tucker (KKT) conditions generalize the method of Lagrange multipliers to constrained optimization problems. They provide necessary conditions for a solution to be optimal.
The KKT conditions are:
\[ \begin{aligned} g(x^*) &\leq 0, \qquad &\text{(Primal feasibility)} \\ h(x^*) &= 0, &\text{(Primal feasibility)} \\ \lambda^* &\geq 0, &\text{(Dual feasibility)} \\ \lambda^* \circ g(x^*) &= 0, &\text{(Complementary slackness)} \\ \nabla f(x^*) + (\lambda^*)^T \nabla g(x^*) + (\nu^*)^T \nabla h(x^*) &= 0 &\text{(Stationarity)} \end{aligned} \]
Where \(\lambda^*\) and \(\nu^*\) are the Lagrange multipliers for the inequality and equality constraints, respectively. The \(\circ\) operator denotes the element-wise product (Hadamard product).
The gradient of the optimal solution with respect to the parameters \(\theta\) can be computed with
In general, the optimal solution is not available in closed form. However, it is possible to derive the gradient of the optimal solution with respect to the parameters \(\theta\) using the KKT conditions.
At the solution, we can treat the solver as simply finding the root of the nonlinear equation set
\[ G(x^*, \lambda^*, \nu^*) = \begin{bmatrix} \nabla f(x^*) + (\lambda^*)^T \nabla g(x^*) + (\nu^*)^T \nabla h(x^*) \\ \lambda \circ g(x^*) \\ h(x^*) \\ \end{bmatrix} = 0 \]
Using the shorthand \((x^*, \lambda^*, \nu^*)(\theta)\) as the primal-dual optimal solution as a function of \(\theta\), we have
\[ \begin{split} & \partial_{\theta} G(x^\star(\theta), \lambda^\star(\theta), \nu^\star(\theta), \theta) = 0 \\ \Longrightarrow \;\; & \partial_{x, \lambda, \nu} G(x^\star, \lambda^\star, \nu^\star, \theta) \partial_\theta (x^\star, \lambda^\star, \nu^\star)(\theta) + \partial_{\theta} G(x^\star, \lambda^\star, \nu^\star, \theta) = 0 \\ \Longrightarrow \;\; & \partial_\theta (x^\star, \lambda^\star, \nu^\star)(\theta) = -\left (\partial_{x, \lambda, \nu} G(x^\star, \lambda^\star, \nu^\star, \theta) \right)^{-1} \partial_{\theta} G(x^\star, \lambda^\star, \nu^\star, \theta). \end{split} \]
From this, we can differentiate through the solution
\[ \partial_{x, \lambda, \nu} G(x^*, \lambda^*, \nu^*, \theta) = \begin{bmatrix} \nabla^2_x f + (\lambda^*)^T \nabla^2 _x g & \partial_x g & \partial_x h \\ \partial_x g^T \operatorname{diag}(\lambda^*) & \operatorname{diag}(g) & 0 \\ \partial_x h^T & 0 & 0 \\ \end{bmatrix} \]
The forward pass computes the optimal solution, the backward pass computes the gradient of the optimal solution with respect to the parameters \(\theta\).
From BPQP