Exam 4 Study Guide/Practice Aid

Content

The exam covers content from gradient descent through differentiation matrices. Roughly in terms of subsections, that is

Optimization
- gradient descent and steepest descent
- Newton's method ($n$D)
- quasi-Newton methods (BFGS)
- nonlinear least squares (Gauss-Newton, Levenberg-Marquardt)
- constrained optimization, KKT conditions
Interpolation
- setup, Vandermonde matrices
- Lagrange, Newton, and orthogonal polynomial bases
- Chebyshev polynomials and nodes
- error estimates
- piecewise interpolation and splines
Quadrature
- interpolatory quadrature, Newton-Cotes
- accuracy and stability
- Gaussian quadrature
- composite quadrature
Numerical differentiation
- accuracy
- differentiation matrices

Some questions to ask yourself

Gradient Descent / Steepest Descent

What is the steepest descent direction for a function $f$? $$ \boldsymbol{s}_k = -\nabla f(\boldsymbol{x}_k) $$
What is the role of the line search? What method can be used?
Write out one full step of gradient descent.
What is the convergence rate of steepest descent? (Linear convergence.)
For the quadratic model $f(\boldsymbol{x}) = \frac{1}{2}\boldsymbol{x}^T A \boldsymbol{x} + \boldsymbol{c}^T \boldsymbol{x}$ with $A$ SPD, what is the convergence factor? $$ \|\boldsymbol{e}_{k+1}\|_A = \frac{\kappa(A)-1}{\kappa(A)+1}\|\boldsymbol{e}_k\|_A $$
Why does ill-conditioning slow down gradient descent?
What are the (weak) Wolfe conditions, and why are they used instead of an exact line search?
- Sufficient decrease: $\phi(\alpha) \le \phi(0) + c_1\alpha\phi'(0)$
- Curvature condition: $\phi'(\alpha) \ge c_2\phi'(0)$

Heavy Ball and Conjugate Gradient

What is the heavy ball (momentum) method? $$ \boldsymbol{x}_{k+1} = \boldsymbol{x}_k - \alpha_k\nabla f(\boldsymbol{x}_k) + \beta_k(\boldsymbol{x}_k - \boldsymbol{x}_{k-1}) $$
For the quadratic model with optimal $\alpha,\beta$, what convergence factor does heavy ball achieve? $$ \|\boldsymbol{e}_{k+1}\|_A = \frac{\sqrt{\kappa(A)}-1}{\sqrt{\kappa(A)}+1}\|\boldsymbol{e}_k\|_A $$
What is stochastic gradient descent (SGD)? When is it used?
What is the conjugate gradient method for optimization? How does it relate to Lanczos iteration and the Krylov space?

Newton's Method ($n$D) for Optimization

Derive Newton's method for optimization in $n$ dimensions from the quadratic Taylor approximation: $$ H_f(\boldsymbol{x}_k)\boldsymbol{s}_k = -\nabla f(\boldsymbol{x}_k), \qquad \boldsymbol{x}_{k+1} = \boldsymbol{x}_k + \boldsymbol{s}_k $$
What are the drawbacks? (Local convergence; requires second derivatives; fails when $H_f$ is nearly indefinite.)
What is the convergence rate of Newton's method for optimization? (Quadratic, near a strict local minimum.)
What is the Nelder-Mead method?

Quasi-Newton Methods (BFGS)

What is the secant condition for a Hessian approximation $B_{k+1}$? $$ B_{k+1}\boldsymbol{s}_k = \boldsymbol{y}_k,\quad \boldsymbol{s}_k = \boldsymbol{x}_{k+1}-\boldsymbol{x}_k,\quad \boldsymbol{y}_k = \nabla f(\boldsymbol{x}_{k+1})-\nabla f(\boldsymbol{x}_k) $$
What is the BFGS update formula?
Why does BFGS retain positive definiteness of $B_{k+1}$ (given the Wolfe curvature condition)?
What is the convergence behavior of BFGS in practice?

Nonlinear Least Squares

What is the nonlinear least-squares problem? $$ \min_{\boldsymbol{x}} \varphi(\boldsymbol{x}) = \tfrac{1}{2}\|\boldsymbol{r}(\boldsymbol{x})\|_2^2 $$
Derive the gradient of $\varphi$: $\nabla\varphi = J_{\boldsymbol{r}}^T\boldsymbol{r}$.
What is the Gauss-Newton method? How does it approximate the full Hessian? $$ J^T J\,\boldsymbol{s}_k = -J^T\boldsymbol{r}(\boldsymbol{x}_k) $$ Why is this a least-squares problem?
When does Gauss-Newton converge well, and when does it struggle? (Large residual at the solution $\Rightarrow$ poor convergence.)
What is Levenberg-Marquardt? $$ (J^T J + \mu_k I)\boldsymbol{s}_k = -J^T\boldsymbol{r}(\boldsymbol{x}_k) $$ What does $\mu_k\to 0$ and $\mu_k\to\infty$ correspond to?
How does the trust-region criterion $\rho_k$ guide the choice of $\mu_k$ in Levenberg-Marquardt?

Constrained Optimization and KKT Conditions

State the general constrained optimization problem (equality constraints $\boldsymbol{g}=\boldsymbol{0}$, inequality constraints $\boldsymbol{h}\ge\boldsymbol{0}$).
What is the Lagrangian? $$ \mathcal{L}(\boldsymbol{x},\boldsymbol{\lambda}_g,\boldsymbol{\lambda}_h) = f(\boldsymbol{x}) - \boldsymbol{\lambda}_g^T\boldsymbol{g}(\boldsymbol{x}) - \boldsymbol{\lambda}_h^T\boldsymbol{h}(\boldsymbol{x}) $$
State the KKT necessary conditions:
- $\nabla_{\boldsymbol{x}}\mathcal{L} = \boldsymbol{0}$
- $\boldsymbol{g}(\boldsymbol{x}^*) = \boldsymbol{0}$
- $\boldsymbol{h}(\boldsymbol{x}^*) \ge \boldsymbol{0}$
- $\boldsymbol{\lambda}_h^* \ge \boldsymbol{0}$
- Complementarity: $\boldsymbol{h}(\boldsymbol{x}^*)\cdot\boldsymbol{\lambda}_h^* = 0$
What does it mean for a constraint to be active at $\boldsymbol{x}^*$?
Why does the sign of the multiplier matter for inequality constraints?
What is the constraint qualification, and what can go wrong when it fails? (e.g., two active constraints with parallel gradients $\Rightarrow$ multipliers not unique or KKT fails to be necessary.)
What is Farkas' Lemma and how does it underpin KKT?
What computational approach solves the KKT equations? (Apply Newton's method to the a subset of the KKT conditions.)

Interpolation: Setup and Vandermonde Matrices

What is the interpolation problem? Given nodes $(x_i)$ and values $(y_i)$, find $f$ with $f(x_i)=y_i$.
How is the problem made unique? (Pick a basis $\{\varphi_j\}$, set $N_\text{func}=N$, solve the Vandermonde system $V\boldsymbol{\alpha}=\boldsymbol{y}$.)
What is the Vandermonde matrix $V_{ij} = \varphi_j(x_i)$?
What happens to $V$ when a node is repeated? (Two rows become identical $\Rightarrow$ $V$ is singular.)
What is the Lebesgue constant $\Lambda$, and what does it measure?
How does $\Lambda$ connect interpolation to best-approximation? $\|f - Pf\|_\infty \le (1+\Lambda)\|f-q\|_\infty$ for any polynomial $q$.

Interpolation Bases

What are the Lagrange basis polynomials $\varphi_j$? (They make $V=I$; $\varphi_j(x_i)=\delta_{ij}$.)
Write down the Lagrange interpolant for nodes $(x_i)$ and values $(y_i)$.
What is the Newton (triangular) basis? How are its coefficients found? ($O(n^2)$ via forward substitution.)
Why are monomial bases on equispaced nodes problematic? (Near-linear dependence $\Rightarrow$ ill-conditioned $V$.)
What are orthogonal polynomials? How are Legendre polynomials constructed? (Gram-Schmidt on monomials under $\langle f,g\rangle=\int_{-1}^1 fg\,dx$.)
State the three-term recurrence for Legendre polynomials.
What are Chebyshev polynomials? Give three equivalent definitions.
What are the Chebyshev nodes (extrema and roots of $T_k$)? Why are edge-clustered nodes beneficial?
Why is Chebyshev interpolation preferred over equispaced polynomial interpolation?

Interpolation Error

State the interpolation error formula: $$ f(x) - p_{n-1}(x) = \frac{f^{(n)}(\xi)}{n!}\prod_{i=1}^n(x-x_i) $$
What is the simplified error bound when $h=x_n-x_1$? $$ \max_x|f(x)-p_{n-1}(x)| \le C M h^n, \quad M=\max|f^{(n)}| $$
How do Chebyshev nodes minimize $\max_x|\prod(x-x_i)|$?

Piecewise Interpolation and Splines

What is piecewise linear interpolation? What is its error order? ($O(h^2)$)
What is a natural cubic spline? How many unknowns and equations are there?
What conditions are imposed at interior nodes for a cubic spline? (Continuity of $f$, $f'$, $f''$.)
What are the "extra" boundary conditions for a natural spline? ($f''=0$ at the endpoints.)
What are B-splines?

Quadrature: Interpolatory/Newton-Cotes

Derive interpolatory quadrature from the Lagrange interpolant: $$ \int_a^b f\,dx \approx \sum_i \omega_i f(x_i), \qquad \omega_i = \int_a^b \ell_i(x)\,dx $$
What is the method of undetermined coefficients for finding quadrature weights?
To what polynomial degree are the midpoint, trapezoidal, and Simpson's rules exact? (Midpoint: 1; Trapezoidal: 1; Simpson's: 3.)
Why do rules with an odd number of nodes gain an extra degree of exactness? (Cancellation of odd-order error terms by symmetry.)
What are the drawbacks of Newton-Cotes quadrature at high order? (Possible large negative weights $\Rightarrow$ instability; Runge-like divergence.)

Quadrature: Accuracy and Stability

Derive the accuracy bound for interpolatory quadrature: using the interpolation error, $$ \left|\int_a^b f\,dx - \sum_i\omega_i f(x_i)\right| \le C h^{n+1}\|f^{(n)}\|_\infty $$ (quadrature gains one order over interpolation).
Why do quadratures with large negative weights have poor stability?
What is Gaussian quadrature? How many points does it need to integrate degree-$(2n-1)$ polynomials exactly? ($n$ points.)
How are Gaussian quadrature nodes related to orthogonal polynomials?

Composite Quadrature

What is composite quadrature? Why is it preferred over a single high-order rule?
Derive the error bound for composite quadrature using $m$ panels of length $h$: $$ \left|\int_a^b f - \text{composite}\right| \le C\|f^{(n)}\|_\infty h^n(b-a) $$ (composite loses one order vs. non-composite).
What is adaptivity? Why is it useful?

Numerical Differentiation and Finite Differences

Why is numerical differentiation harder than integration? (Ill-conditioned; amplifies noise; subject to cancellation.)
What is the accuracy order of numerical differentiation with $n$ interpolation points? ($O(h^{n-1})$ — one order lower than interpolation.)

Differentiation Matrices

How is numerical differentiation cast as a matrix-vector operation? $$ D = V' V^{-1}, \qquad f'(\boldsymbol{x}) \approx D f(\boldsymbol{x}) $$ where $V'_{ij} = \varphi_j'(x_i)$.
Does $D$ depend on the function being differentiated? (No. Depends on the nodes and the interpolation basis.)
What is the nullspace of $D$? (Constant vectors; rows of $D$ always sum to zero.)
How do you compute second derivatives with $D$? ($D^2$.)
How does $D$ change under a shift of the nodes? (It does not: $D_{\boldsymbol{x}+c} = D_{\boldsymbol{x}}$.)
How does $D$ change under a scaling of the nodes? ($D_{\alpha\boldsymbol{x}} = D_{\boldsymbol{x}}/\alpha$.)
How do rows of $D$ relate to finite difference formulas? (Row $i$ gives the FD weights for $f'(x_i)$.)
How is a single small $D$ (e.g. $3\times 3$) applied to a large equispaced grid? (First row for left boundary, interior row for interior points, last row for right boundary.)