Differentiable Optimization

Davide De Benedittis

TU Delft & University of Pisa

June 23, 2025

Problem Description 📝

Convex Optimization Problem

Consider a parametrized convex optimization problem \[ \begin{aligned} & \min_x \quad && f(x, \theta) \\ & \text{s.t.} && g(x, \theta) \leq 0 \\ &&& h(x, \theta) = 0 \end{aligned} \]

Where \(f\) and \(g\) are convex functions, and \(h\) is an affine function.

KKT Conditions

Karush-Kuhn-Tucker (KKT) conditions generalize the method of Lagrange multipliers to constrained optimization problems. They provide necessary conditions for a solution to be optimal.

The KKT conditions are:

\[ \begin{aligned} g(x^*) &\leq 0, \qquad &\text{(Primal feasibility)} \\ h(x^*) &= 0, &\text{(Primal feasibility)} \\ \lambda^* &\geq 0, &\text{(Dual feasibility)} \\ \lambda^* \circ g(x^*) &= 0, &\text{(Complementary slackness)} \\ \nabla f(x^*) + (\lambda^*)^T \nabla g(x^*) + (\nu^*)^T \nabla h(x^*) &= 0 &\text{(Stationarity)} \end{aligned} \]

Where \(\lambda^*\) and \(\nu^*\) are the Lagrange multipliers for the inequality and equality constraints, respectively. The \(\circ\) operator denotes the element-wise product (Hadamard product).

Differentiable Optimization (1/2)

The gradient of the optimal solution with respect to the parameters \(\theta\) can be computed with

  • numerical methods (😢)
  • explicit methods – write the computational graph that calculates the solution and perform autodiff (😢)
  • implicit methods – use KKT conditions and implicit function theorem (😄)

In general, the optimal solution is not available in closed form. However, it is possible to derive the gradient of the optimal solution with respect to the parameters \(\theta\) using the KKT conditions.

Differentiable Optimization (2/2)

At the solution, we can treat the solver as simply finding the root of the nonlinear equation set

\[ G(x^*, \lambda^*, \nu^*) = \begin{bmatrix} \nabla f(x^*) + (\lambda^*)^T \nabla g(x^*) + (\nu^*)^T \nabla h(x^*) \\ \lambda \circ g(x^*) \\ h(x^*) \\ \end{bmatrix} = 0 \]

Using the shorthand \((x^*, \lambda^*, \nu^*)(\theta)\) as the primal-dual optimal solution as a function of \(\theta\), we have

\[ \begin{split} & \partial_{\theta} G(x^\star(\theta), \lambda^\star(\theta), \nu^\star(\theta), \theta) = 0 \\ \Longrightarrow \;\; & \partial_{x, \lambda, \nu} G(x^\star, \lambda^\star, \nu^\star, \theta) \partial_\theta (x^\star, \lambda^\star, \nu^\star)(\theta) + \partial_{\theta} G(x^\star, \lambda^\star, \nu^\star, \theta) = 0 \\ \Longrightarrow \;\; & \partial_\theta (x^\star, \lambda^\star, \nu^\star)(\theta) = -\left (\partial_{x, \lambda, \nu} G(x^\star, \lambda^\star, \nu^\star, \theta) \right)^{-1} \partial_{\theta} G(x^\star, \lambda^\star, \nu^\star, \theta). \end{split} \]

From this, we can differentiate through the solution

\[ \partial_{x, \lambda, \nu} G(x^*, \lambda^*, \nu^*, \theta) = \begin{bmatrix} \nabla^2_x f + (\lambda^*)^T \nabla^2 _x g & \partial_x g & \partial_x h \\ \partial_x g^T \operatorname{diag}(\lambda^*) & \operatorname{diag}(g) & 0 \\ \partial_x h^T & 0 & 0 \\ \end{bmatrix} \]

Optimization as a Layer 🧅

The forward pass computes the optimal solution, the backward pass computes the gradient of the optimal solution with respect to the parameters \(\theta\).

From BPQP

Applications 🚀

Applications

  • Learn hard constraints
  • Modeling predictions
  • Game theory (e.g., Nash equilibrium)
  • Meta learning

Applications

  • Learn hard constraints
  • Modeling predictions
  • Game theory (e.g., Nash equilibrium)
  • Meta learning
  • RL and control
    • Optimization layers in RL
    • MPC tuning
    • Multi-agent control
    • Safety
    • High dimensional control
From Actor Critic MPC

References 📚

Bedankt voor uw aandacht (◕‿◕)