Differentiable Optimization

Davide De Benedittis

TU Delft & University of Pisa

June 23, 2025

Problem Description 📝

Convex Optimization Problem

Consider a parametrized convex optimization problem \[ \begin{aligned} & \min_x \quad && f(x, \theta) \\ & \text{s.t.} && g(x, \theta) \leq 0 \\ &&& h(x, \theta) = 0 \end{aligned} \]

Where \(f\) and \(g\) are convex functions, and \(h\) is an affine function.

KKT Conditions

Karush-Kuhn-Tucker (KKT) conditions generalize the method of Lagrange multipliers to constrained optimization problems. They provide necessary conditions for a solution to be optimal.

The KKT conditions are:

\[ \begin{aligned} g(x^*) &\leq 0, \qquad &\text{(Primal feasibility)} \\ h(x^*) &= 0, &\text{(Primal feasibility)} \\ \lambda^* &\geq 0, &\text{(Dual feasibility)} \\ \lambda^* \circ g(x^*) &= 0, &\text{(Complementary slackness)} \\ \nabla f(x^*) + (\lambda^*)^T \nabla g(x^*) + (\nu^*)^T \nabla h(x^*) &= 0 &\text{(Stationarity)} \end{aligned} \]

Where \(\lambda^*\) and \(\nu^*\) are the Lagrange multipliers for the inequality and equality constraints, respectively. The \(\circ\) operator denotes the element-wise product (Hadamard product).

Differentiable Optimization (1/2)

The gradient of the optimal solution with respect to the parameters \(\theta\) can be computed with

numerical methods (😢)
explicit methods – write the computational graph that calculates the solution and perform autodiff (😢)
implicit methods – use KKT conditions and implicit function theorem (😄)

In general, the optimal solution is not available in closed form. However, it is possible to derive the gradient of the optimal solution with respect to the parameters \(\theta\) using the KKT conditions.

Differentiable Optimization (2/2)

At the solution, we can treat the solver as simply finding the root of the nonlinear equation set

\[ G(x^*, \lambda^*, \nu^*) = \begin{bmatrix} \nabla f(x^*) + (\lambda^*)^T \nabla g(x^*) + (\nu^*)^T \nabla h(x^*) \\ \lambda \circ g(x^*) \\ h(x^*) \\ \end{bmatrix} = 0 \]

Using the shorthand \((x^*, \lambda^*, \nu^*)(\theta)\) as the primal-dual optimal solution as a function of \(\theta\), we have

\[ \begin{split} & \partial_{\theta} G(x^\star(\theta), \lambda^\star(\theta), \nu^\star(\theta), \theta) = 0 \\ \Longrightarrow \;\; & \partial_{x, \lambda, \nu} G(x^\star, \lambda^\star, \nu^\star, \theta) \partial_\theta (x^\star, \lambda^\star, \nu^\star)(\theta) + \partial_{\theta} G(x^\star, \lambda^\star, \nu^\star, \theta) = 0 \\ \Longrightarrow \;\; & \partial_\theta (x^\star, \lambda^\star, \nu^\star)(\theta) = -\left (\partial_{x, \lambda, \nu} G(x^\star, \lambda^\star, \nu^\star, \theta) \right)^{-1} \partial_{\theta} G(x^\star, \lambda^\star, \nu^\star, \theta). \end{split} \]

From this, we can differentiate through the solution

\[ \partial_{x, \lambda, \nu} G(x^*, \lambda^*, \nu^*, \theta) = \begin{bmatrix} \nabla^2_x f + (\lambda^*)^T \nabla^2 _x g & \partial_x g & \partial_x h \\ \partial_x g^T \operatorname{diag}(\lambda^*) & \operatorname{diag}(g) & 0 \\ \partial_x h^T & 0 & 0 \\ \end{bmatrix} \]

Optimization as a Layer 🧅

The forward pass computes the optimal solution, the backward pass computes the gradient of the optimal solution with respect to the parameters \(\theta\).

From BPQP

Applications 🚀

Applications

Learn hard constraints
Modeling predictions
Game theory (e.g., Nash equilibrium)
Meta learning

Applications

Learn hard constraints
Modeling predictions
Game theory (e.g., Nash equilibrium)
Meta learning
RL and control
- Optimization layers in RL
- MPC tuning
- Multi-agent control
- Safety
- High dimensional control

References 📚

https://implicit-layers-tutorial.org/differentiable_optimization/ – by Zico Kolter
https://bamos.github.io/presentations/ – by Brandon Amos
Lots and lots of papers
Early works by Akshay Agrawal & Brandon Amos

Bedankt voor uw aandacht (◕‿◕)