Exam 4 Study Guide/Practice Aid

Content

The exam covers content from gradient descent through differentiation matrices. Roughly in terms of subsections, that is

  • Optimization
    • gradient descent and steepest descent
    • Newton's method ($n$D)
    • quasi-Newton methods (BFGS)
    • nonlinear least squares (Gauss-Newton, Levenberg-Marquardt)
    • constrained optimization, KKT conditions
  • Interpolation
    • setup, Vandermonde matrices
    • Lagrange, Newton, and orthogonal polynomial bases
    • Chebyshev polynomials and nodes
    • error estimates
    • piecewise interpolation and splines
  • Quadrature
    • interpolatory quadrature, Newton-Cotes
    • accuracy and stability
    • Gaussian quadrature
    • composite quadrature
  • Numerical differentiation
    • accuracy
    • differentiation matrices

Some questions to ask yourself

Gradient Descent / Steepest Descent

  • What is the steepest descent direction for a function $f$? $$ \boldsymbol{s}_k = -\nabla f(\boldsymbol{x}_k) $$
  • What is the role of the line search? What method can be used?
  • Write out one full step of gradient descent.
  • What is the convergence rate of steepest descent? (Linear convergence.)
  • For the quadratic model $f(\boldsymbol{x}) = \frac{1}{2}\boldsymbol{x}^T A \boldsymbol{x} + \boldsymbol{c}^T \boldsymbol{x}$ with $A$ SPD, what is the convergence factor? $$ \|\boldsymbol{e}_{k+1}\|_A = \frac{\kappa(A)-1}{\kappa(A)+1}\|\boldsymbol{e}_k\|_A $$
  • Why does ill-conditioning slow down gradient descent?
  • What are the (weak) Wolfe conditions, and why are they used instead of an exact line search?
    • Sufficient decrease: $\phi(\alpha) \le \phi(0) + c_1\alpha\phi'(0)$
    • Curvature condition: $\phi'(\alpha) \ge c_2\phi'(0)$

Heavy Ball and Conjugate Gradient

  • What is the heavy ball (momentum) method? $$ \boldsymbol{x}_{k+1} = \boldsymbol{x}_k - \alpha_k\nabla f(\boldsymbol{x}_k) + \beta_k(\boldsymbol{x}_k - \boldsymbol{x}_{k-1}) $$
  • For the quadratic model with optimal $\alpha,\beta$, what convergence factor does heavy ball achieve? $$ \|\boldsymbol{e}_{k+1}\|_A = \frac{\sqrt{\kappa(A)}-1}{\sqrt{\kappa(A)}+1}\|\boldsymbol{e}_k\|_A $$
  • What is stochastic gradient descent (SGD)? When is it used?
  • What is the conjugate gradient method for optimization? How does it relate to Lanczos iteration and the Krylov space?

Newton's Method ($n$D) for Optimization

  • Derive Newton's method for optimization in $n$ dimensions from the quadratic Taylor approximation: $$ H_f(\boldsymbol{x}_k)\boldsymbol{s}_k = -\nabla f(\boldsymbol{x}_k), \qquad \boldsymbol{x}_{k+1} = \boldsymbol{x}_k + \boldsymbol{s}_k $$
  • What are the drawbacks? (Local convergence; requires second derivatives; fails when $H_f$ is nearly indefinite.)
  • What is the convergence rate of Newton's method for optimization? (Quadratic, near a strict local minimum.)
  • What is the Nelder-Mead method?

Quasi-Newton Methods (BFGS)

  • What is the secant condition for a Hessian approximation $B_{k+1}$? $$ B_{k+1}\boldsymbol{s}_k = \boldsymbol{y}_k,\quad \boldsymbol{s}_k = \boldsymbol{x}_{k+1}-\boldsymbol{x}_k,\quad \boldsymbol{y}_k = \nabla f(\boldsymbol{x}_{k+1})-\nabla f(\boldsymbol{x}_k) $$
  • What is the BFGS update formula?
  • Why does BFGS retain positive definiteness of $B_{k+1}$ (given the Wolfe curvature condition)?
  • What is the convergence behavior of BFGS in practice?

Nonlinear Least Squares

  • What is the nonlinear least-squares problem? $$ \min_{\boldsymbol{x}} \varphi(\boldsymbol{x}) = \tfrac{1}{2}\|\boldsymbol{r}(\boldsymbol{x})\|_2^2 $$
  • Derive the gradient of $\varphi$: $\nabla\varphi = J_{\boldsymbol{r}}^T\boldsymbol{r}$.
  • What is the Gauss-Newton method? How does it approximate the full Hessian? $$ J^T J\,\boldsymbol{s}_k = -J^T\boldsymbol{r}(\boldsymbol{x}_k) $$ Why is this a least-squares problem?
  • When does Gauss-Newton converge well, and when does it struggle? (Large residual at the solution $\Rightarrow$ poor convergence.)
  • What is Levenberg-Marquardt? $$ (J^T J + \mu_k I)\boldsymbol{s}_k = -J^T\boldsymbol{r}(\boldsymbol{x}_k) $$ What does $\mu_k\to 0$ and $\mu_k\to\infty$ correspond to?
  • How does the trust-region criterion $\rho_k$ guide the choice of $\mu_k$ in Levenberg-Marquardt?

Constrained Optimization and KKT Conditions

  • State the general constrained optimization problem (equality constraints $\boldsymbol{g}=\boldsymbol{0}$, inequality constraints $\boldsymbol{h}\ge\boldsymbol{0}$).
  • What is the Lagrangian? $$ \mathcal{L}(\boldsymbol{x},\boldsymbol{\lambda}_g,\boldsymbol{\lambda}_h) = f(\boldsymbol{x}) - \boldsymbol{\lambda}_g^T\boldsymbol{g}(\boldsymbol{x}) - \boldsymbol{\lambda}_h^T\boldsymbol{h}(\boldsymbol{x}) $$
  • State the KKT necessary conditions:
    • $\nabla_{\boldsymbol{x}}\mathcal{L} = \boldsymbol{0}$
    • $\boldsymbol{g}(\boldsymbol{x}^*) = \boldsymbol{0}$
    • $\boldsymbol{h}(\boldsymbol{x}^*) \ge \boldsymbol{0}$
    • $\boldsymbol{\lambda}_h^* \ge \boldsymbol{0}$
    • Complementarity: $\boldsymbol{h}(\boldsymbol{x}^*)\cdot\boldsymbol{\lambda}_h^* = 0$
  • What does it mean for a constraint to be active at $\boldsymbol{x}^*$?
  • Why does the sign of the multiplier matter for inequality constraints?
  • What is the constraint qualification, and what can go wrong when it fails? (e.g., two active constraints with parallel gradients $\Rightarrow$ multipliers not unique or KKT fails to be necessary.)
  • What is Farkas' Lemma and how does it underpin KKT?
  • What computational approach solves the KKT equations? (Apply Newton's method to the a subset of the KKT conditions.)

Interpolation: Setup and Vandermonde Matrices

  • What is the interpolation problem? Given nodes $(x_i)$ and values $(y_i)$, find $f$ with $f(x_i)=y_i$.
  • How is the problem made unique? (Pick a basis $\{\varphi_j\}$, set $N_\text{func}=N$, solve the Vandermonde system $V\boldsymbol{\alpha}=\boldsymbol{y}$.)
  • What is the Vandermonde matrix $V_{ij} = \varphi_j(x_i)$?
  • What happens to $V$ when a node is repeated? (Two rows become identical $\Rightarrow$ $V$ is singular.)
  • What is the Lebesgue constant $\Lambda$, and what does it measure?
  • How does $\Lambda$ connect interpolation to best-approximation? $\|f - Pf\|_\infty \le (1+\Lambda)\|f-q\|_\infty$ for any polynomial $q$.

Interpolation Bases

  • What are the Lagrange basis polynomials $\varphi_j$? (They make $V=I$; $\varphi_j(x_i)=\delta_{ij}$.)
  • Write down the Lagrange interpolant for nodes $(x_i)$ and values $(y_i)$.
  • What is the Newton (triangular) basis? How are its coefficients found? ($O(n^2)$ via forward substitution.)
  • Why are monomial bases on equispaced nodes problematic? (Near-linear dependence $\Rightarrow$ ill-conditioned $V$.)
  • What are orthogonal polynomials? How are Legendre polynomials constructed? (Gram-Schmidt on monomials under $\langle f,g\rangle=\int_{-1}^1 fg\,dx$.)
  • State the three-term recurrence for Legendre polynomials.
  • What are Chebyshev polynomials? Give three equivalent definitions.
  • What are the Chebyshev nodes (extrema and roots of $T_k$)? Why are edge-clustered nodes beneficial?
  • Why is Chebyshev interpolation preferred over equispaced polynomial interpolation?

Interpolation Error

  • State the interpolation error formula: $$ f(x) - p_{n-1}(x) = \frac{f^{(n)}(\xi)}{n!}\prod_{i=1}^n(x-x_i) $$
  • What is the simplified error bound when $h=x_n-x_1$? $$ \max_x|f(x)-p_{n-1}(x)| \le C M h^n, \quad M=\max|f^{(n)}| $$
  • How do Chebyshev nodes minimize $\max_x|\prod(x-x_i)|$?

Piecewise Interpolation and Splines

  • What is piecewise linear interpolation? What is its error order? ($O(h^2)$)
  • What is a natural cubic spline? How many unknowns and equations are there?
  • What conditions are imposed at interior nodes for a cubic spline? (Continuity of $f$, $f'$, $f''$.)
  • What are the "extra" boundary conditions for a natural spline? ($f''=0$ at the endpoints.)
  • What are B-splines?

Quadrature: Interpolatory/Newton-Cotes

  • Derive interpolatory quadrature from the Lagrange interpolant: $$ \int_a^b f\,dx \approx \sum_i \omega_i f(x_i), \qquad \omega_i = \int_a^b \ell_i(x)\,dx $$
  • What is the method of undetermined coefficients for finding quadrature weights?
  • To what polynomial degree are the midpoint, trapezoidal, and Simpson's rules exact? (Midpoint: 1; Trapezoidal: 1; Simpson's: 3.)
  • Why do rules with an odd number of nodes gain an extra degree of exactness? (Cancellation of odd-order error terms by symmetry.)
  • What are the drawbacks of Newton-Cotes quadrature at high order? (Possible large negative weights $\Rightarrow$ instability; Runge-like divergence.)

Quadrature: Accuracy and Stability

  • Derive the accuracy bound for interpolatory quadrature: using the interpolation error, $$ \left|\int_a^b f\,dx - \sum_i\omega_i f(x_i)\right| \le C h^{n+1}\|f^{(n)}\|_\infty $$ (quadrature gains one order over interpolation).
  • Why do quadratures with large negative weights have poor stability?
  • What is Gaussian quadrature? How many points does it need to integrate degree-$(2n-1)$ polynomials exactly? ($n$ points.)
  • How are Gaussian quadrature nodes related to orthogonal polynomials?

Composite Quadrature

  • What is composite quadrature? Why is it preferred over a single high-order rule?
  • Derive the error bound for composite quadrature using $m$ panels of length $h$: $$ \left|\int_a^b f - \text{composite}\right| \le C\|f^{(n)}\|_\infty h^n(b-a) $$ (composite loses one order vs. non-composite).
  • What is adaptivity? Why is it useful?

Numerical Differentiation and Finite Differences

  • Why is numerical differentiation harder than integration? (Ill-conditioned; amplifies noise; subject to cancellation.)
  • What is the accuracy order of numerical differentiation with $n$ interpolation points? ($O(h^{n-1})$ — one order lower than interpolation.)

Differentiation Matrices

  • How is numerical differentiation cast as a matrix-vector operation? $$ D = V' V^{-1}, \qquad f'(\boldsymbol{x}) \approx D f(\boldsymbol{x}) $$ where $V'_{ij} = \varphi_j'(x_i)$.
  • Does $D$ depend on the function being differentiated? (No. Depends on the nodes and the interpolation basis.)
  • What is the nullspace of $D$? (Constant vectors; rows of $D$ always sum to zero.)
  • How do you compute second derivatives with $D$? ($D^2$.)
  • How does $D$ change under a shift of the nodes? (It does not: $D_{\boldsymbol{x}+c} = D_{\boldsymbol{x}}$.)
  • How does $D$ change under a scaling of the nodes? ($D_{\alpha\boldsymbol{x}} = D_{\boldsymbol{x}}/\alpha$.)
  • How do rows of $D$ relate to finite difference formulas? (Row $i$ gives the FD weights for $f'(x_i)$.)
  • How is a single small $D$ (e.g. $3\times 3$) applied to a large equispaced grid? (First row for left boundary, interior row for interior points, last row for right boundary.)