Jekyll2022-09-14T15:23:30+00:00https://thenumbat.github.io/feed.xmlMax SlaterComputer Graphics, Programming, and MathDifferentiable Programming from Scratch2022-07-31T00:00:00+00:002022-07-31T00:00:00+00:00https://thenumbat.github.io/Autodiff<p><a href="https://en.wikipedia.org/wiki/Differentiable_programming">Differentiable programming</a> has been a hot research topic over the past few years, and not only due to the popularity of machine learning libraries like TensorFlow, PyTorch, and JAX. Many fields apart from machine learning are also finding differentiable programming to be a useful tool for solving many kinds of optimization problems. In computer graphics, differentiable <a href="https://www.youtube.com/watch?v=Tou8or1ed6E">rendering</a>, differentiable <a href="https://physicsbaseddeeplearning.org/diffphys.html">physics</a>, and <a href="https://nvlabs.github.io/instant-ngp/">neural representations</a> are all poised to be important tools going forward.</p>
<h2 id="table-of-contents">Table of Contents</h2>
<ul>
<li><a href="#prerequisites">Prerequisites</a>
<ul>
<li><a href="#differentiation">Differentiation</a></li>
<li><a href="#optimization">Optimization</a></li>
</ul>
</li>
<li><a href="#differentiating-code">Differentiating Code</a>
<ul>
<li><a href="#numerical">Numerical</a></li>
<li><a href="#symbolic">Symbolic</a></li>
</ul>
</li>
<li><a href="#automatic-differentiation">Automatic Differentiation</a>
<ul>
<li><a href="#forward-mode">Forward Mode</a></li>
<li><a href="#backward-mode">Backward Mode</a></li>
</ul>
</li>
<li><a href="#de-blurring-an-image">De-blurring an Image</a></li>
<li><a href="#further-reading">Further Reading</a></li>
</ul>
<h1 id="prerequisites">Prerequisites</h1>
<h2 id="differentiation">Differentiation</h2>
<p>It all starts with the definition you learn in calculus class:</p>
\[f^{\prime}(x) = \lim_{h\rightarrow 0} \frac{f(x + h) - f(x)}{h}\]
<p>In other words, the derivative computes how much \(f(x)\) changes when \(x\) is perturbed by an infinitesimal amount. If \(f\) is a one-dimensional function from \(\mathbb{R} \mapsto \mathbb{R}\), the derivative \(f^{\prime}(x)\) returns the slope of the graph of \(f\) at \(x\).</p>
<div class="center" style="width: 80%;"><img class="diagram2d center" src="/assets/autodiff/slope.svg" /></div>
<p>However, there’s another perspective that provides better intuition in higher dimensions. If we think of \(f\) as a map from points in its domain to points in its range, we can think of \(f^{\prime}(x)\) as a map from <em>vectors</em> based at \(x\) to <em>vectors</em> based at \(f(x)\).</p>
<h3 id="one-to-one">One-to-One</h3>
<p>In 1D, this distinction is a bit subtle, as a 1D “vector” is just a single number. Still, evaluating \(f^{\prime}(x)\) shows us how a vector placed at \(x\) is scaled when transformed by \(f\). That’s just the slope of \(f\) at \(x\).</p>
<div class="center" style="width: 100%;"><img class="diagram2d_double center" src="/assets/autodiff/vec1d.svg" /></div>
<h3 id="many-to-one">Many-to-One</h3>
<p>If we consider a function \(g(x,y) : \mathbb{R}^2 \mapsto \mathbb{R}\) (two inputs, one output), this perspective will become clearer. We can differentiate \(g\) with respect to any particular input, known as a partial derivative:</p>
\[g_x(x,y) = \lim_{h\rightarrow0} \frac{g(x+h,y) - g(x,y)}{h}\]
<p>The function \(g_x\) produces the change in \(g\) given a change in \(x\). If we combine it with the partial derivative for \(y\), we get the derivative, or <em>gradient</em>, of \(g\):</p>
\[\nabla g(x,y) = \begin{bmatrix}g_x(x,y)\\g_y(x,y)\end{bmatrix}\]
<p>That is, \(\nabla g(x,y)\) tells us how \(g\) changes if we change either \(x\) or \(y\). If we take the dot product of \(\nabla g(x,y)\) with a vector of differences \(\Delta x,\Delta y\), we’ll get their combined effect on \(g\):</p>
\[\nabla g(x,y) \cdot \begin{bmatrix}\Delta x\\\Delta y\end{bmatrix} = \Delta xg_x(x,y) + \Delta yg_y(x,y)\]
<p>It’s tempting to think of the gradient as just another vector. However, it’s often useful to think of the gradient as a <em>higher-order function</em>: \(\nabla g\) is a function that, when evaluated at \(x,y\), gives us <em>another function</em> that transforms <em>vectors</em> based at \(x,y\) to <em>vectors</em> based at \(g(x,y)\).</p>
<p>It just so happens that the function returned by our gradient is linear, so it can be represented as a dot product with the gradient vector.</p>
<!-- $$ \nabla g(x,y) (\Delta x, \Delta y) = \begin{bmatrix}g_x(x,y)\\g_y(x,y)\end{bmatrix}\cdot\begin{bmatrix}\Delta x\\\Delta y\end{bmatrix} $$ -->
<div class="center" style="width: 80%;"><img class="diagram2d_double center" src="/assets/autodiff/vec2d.svg" /></div>
<p>The gradient is typically explained as the “direction of steepest ascent.” Why is that? When we evaluate the gradient at a point \(x,y\) and a vector \(\Delta x, \Delta y\), the result is a change in the output of \(g\). If we <em>maximize</em> the change in \(g\) with respect to the input vector, we’ll get the direction that makes the output increase the fastest.</p>
<p>Since the gradient function is just a dot product with \(\left[g_x(x,y), g_y(x,y)\right]\), the direction \(\left[\Delta x, \Delta y\right]\) that maximizes the result is easy to find: it’s parallel to \(\left[g_x(x,y), g_y(x,y)\right]\). That means the gradient vector is, in fact, the direction of steepest ascent.</p>
<div class="center" style="width: 100%;"><img class="diagram2d center" src="/assets/autodiff/grad.svg" /></div>
<!-- (Aside: the relationship between maximizing the gradient and the direction of steepest ascent holds in arbitrary dimensions/spaces, but in curved spaces, it's not as simple as pulling out the vector.) -->
<h4 id="the-directional-derivative">The Directional Derivative</h4>
<p>Another important term is the <em>directional derivative</em>, which computes the derivative of a function along an arbitrary direction. It is a generalization of the partial derivative, which evaluates the directional derivative along a coordinate axis.</p>
\[D_{\mathbf{v}}f(x)\]
<p>Above, we discovered that our “gradient function” could be expressed as a dot product with the gradient vector. That was, actually, the directional derivative:</p>
\[D_{\mathbf{v}}f(x) = \nabla f(x) \cdot \mathbf{v}\]
<p>Which again illustrates why the gradient vector is the direction of steepest ascent: it is the \(\mathbf{v}\) that maximizes the directional derivative. Note that in curved spaces, the “steepest ascent” definition of the gradient still holds, but the directional derivative becomes more complicated than a dot product.</p>
<h3 id="many-to-many">Many-to-Many</h3>
<p>For completeness, let’s examine how this perspective extends to vector-valued functions of multiple variables. Consider \(h(x,y) : \mathbb{R}^2 \mapsto \mathbb{R}^2\) (two inputs, two outputs).</p>
\[h(x,y) = \begin{bmatrix}h_1(x,y)\\h_2(x,y)\end{bmatrix}\]
<p>We can take the gradient of each part of \(h\):</p>
\[\begin{align*}
\nabla h_1(x,y) &= \begin{bmatrix}h_{1_x}(x,y)\\ h_{1_y}(x,y)\end{bmatrix} \\
\nabla h_2(x,y) &= \begin{bmatrix}h_{2_x}(x,y)\\ h_{2_y}(x,y)\end{bmatrix}
\end{align*}\]
<p>What object would represent \(\nabla h\)? We can build a matrix from the gradients of each component, called the <em>Jacobian</em>:</p>
\[\nabla h(x,y) = \begin{bmatrix}h_{1_x}(x,y) & h_{1_y}(x,y)\\ h_{2_x}(x,y) & h_{2_y}(x,y)\end{bmatrix}\]
<p>Above, we argued that the gradient (when evaluated at \(x,y\)) gives us a map from input <em>vectors</em> to output <em>vectors</em>. That remains the case here: the Jacobian is a 2x2 matrix, so it transforms 2D \(\Delta x,\Delta y\) vectors into 2D \(\Delta h_1, \Delta h_2\) vectors.</p>
<div class="center" style="width: 85%;"><img class="diagram2d_double center" src="/assets/autodiff/vec2d2d.svg" /></div>
<p>Adding more dimensions starts to make our functions hard to visualize, but we can always rely on the fact that the derivative tells us how input vectors (i.e. changes in the input) get mapped to output vectors (i.e. changes in the output).</p>
<h3 id="the-chain-rule">The Chain Rule</h3>
<p>The last aspect of differentiation we’ll need to understand is how to differentiate function composition.</p>
\[h(x) = g(f(x)) \implies h^\prime(x) = g^\prime(f(x))\cdot f^\prime(x)\]
<p>We could prove this fact with a bit of real analysis, but the relationship is again easier to understand by thinking of the derivative as higher order function. In this perspective, the chain rule itself is just a function composition.</p>
<p>For example, let’s assume \(h(\mathbf{x}) : \mathbb{R}^2 \mapsto \mathbb{R}\) is composed of \(f(\mathbf{x}) : \mathbb{R}^2 \mapsto \mathbb{R}^2\) and \(g(\mathbf{x}) : \mathbb{R}^2 \mapsto \mathbb{R}\).</p>
<p>In order to translate a \(\Delta \mathbf{x}\) vector to a \(\Delta h\) vector, we can first use \(f^\prime\) to map \(\Delta \mathbf{x}\) to \(\Delta \mathbf{f}\), based at \(\mathbf{x}\). Then we can use \(g^\prime\) to map \(\Delta \mathbf{f}\) to \(\Delta g\), based at \(f(x)\).</p>
<div class="center" style="width: 100%;"><img class="center" style="width: 100%;" src="/assets/autodiff/vec2d2d1d.svg" /></div>
<p>Because our derivatives/gradients/Jacobians are linear functions, we’ve been representing them as scalars/vectors/matrices, respectively. That means we can easily compose them with the typical linear algebraic multiplication rules. Writing out the above example symbolically:</p>
\[\begin{align*} \nabla h(\mathbf{x}) &= \nabla g(f(\mathbf{x}))\cdot \nabla f(\mathbf{x}) \\
&= \begin{bmatrix}g_{1_{x_1}}(f(\mathbf{x})) & g_{1_{x_2}}(f(\mathbf{x}))\\ g_{2_{x_1}}(f(\mathbf{x})) & g_{2_{x_2}}(f(\mathbf{x}))\end{bmatrix} \cdot \begin{bmatrix}f_{x_1}(\mathbf{x})\\ f_{x_2}(\mathbf{x})\end{bmatrix} \\
&= \begin{bmatrix}g_{1_{x_1}}(f(\mathbf{x}))f_{x_1}(\mathbf{x}) + g_{1_{x_2}}(f(\mathbf{x}))f_{x_2}(\mathbf{x})\\ g_{2_{x_1}}(f(\mathbf{x}))f_{x_1}(\mathbf{x}) + g_{2_{x_2}}(f(\mathbf{x}))f_{x_2}(\mathbf{x})\end{bmatrix}
\end{align*}\]
<p>The result is a 2D vector representing a gradient that transforms 2D vectors to 1D vectors. The composed function \(h\) had two inputs and one output, so that’s correct. We can also notice that each term corresponds to the chain rule applied to a different computational path from a component of \(\mathbf{x}\) to \(h\).</p>
<h2 id="optimization">Optimization</h2>
<p>We will focus on the application of differentiation to optimization via gradient descent, which is often used in machine learning and computer graphics. An optimization problem always involves computing the following expression:</p>
\[\underset{\mathbf{x}}{\arg\!\min} f(\mathbf{x})\]
<p>Which simply means “find the \(\mathbf{x}\) that results in the smallest possible value of \(f\).” The function \(f\), typically scalar-valued, is traditionally called an “energy,” or in machine learning, a “loss function.” Extra constraints are often enforced to limit the valid options for \(\mathbf{x}\), but we will disregard constrained optimization for now.</p>
<div class="center" style="width: 80%;"><img class="diagram2d center" src="/assets/autodiff/argmin.svg" /></div>
<p>One way to solve an optimization problem is to iteratively follow the gradient of \(f\) “downhill.” This algorithm is known as gradient descent:</p>
<ul>
<li>Pick an initial guess \(\mathbf{\bar{x}}\).</li>
<li>Repeat:
<ul>
<li>Compute the gradient \(\nabla f(\mathbf{\bar{x}})\).</li>
<li>Step along the gradient: \(\mathbf{\bar{x}} \leftarrow \mathbf{\bar{x}} - \tau\nabla f(\mathbf{\bar{x}})\).</li>
</ul>
</li>
<li>while \(\|\nabla f(\mathbf{\bar{x}})\| > \epsilon\).</li>
</ul>
<p>Given some starting point \(\mathbf{x}\), computing \(\nabla f(\mathbf{x})\) will give us the direction from \(\mathbf{x}\) that would increase \(f\) the fastest. Hence, if we move our point \(\mathbf{x}\) a small distance \(\tau\) along the negated gradient, we will decrease the value of \(f\). The number \(\tau\) is known as the step size (or in ML, the learning rate). By iterating this process, we will eventually find an \(\mathbf{x}\) such that \(\nabla f(\mathbf{x}) \simeq 0\), which is hopefully the minimizer.</p>
<div class="center" style="width: 80%;"><img class="diagram2d center" src="/assets/autodiff/descent.svg" /></div>
<p>This description of gradient descent makes optimization sound easy, but in reality there is a lot that can go wrong. When gradient descent terminates, the result is only required to be a critical point of \(f\), i.e. somewhere \(f\) becomes flat. That means we could wind up at a maximum (unlikely), a saddle point (possible), or a <em>local</em> minimum (likely). At a local minimum, moving \(\mathbf{x}\) in any direction would increase the value of \(f\), but \(\mathbf{x}\) it is <strong>not</strong> necessarily the minimum value \(f\) can take on globally.</p>
<table class="mathtable" style="width: 100%;"><tr>
<td style="width: 29%;"><img class="center" src="/assets/autodiff/maximum.svg" /></td>
<td style="width: 32%; padding-left: 5px;"><img class="center" src="/assets/autodiff/saddle.svg" /></td>
<td style="width: 39%;"><img class="center" src="/assets/autodiff/localmin.svg" /></td>
</tr><td style="text-align: center;">Maximum</td><td style="text-align: center;">Saddle Point</td><td style="text-align: center;">Local Minimum</td></table>
<p>Gradient descent can also diverge (i.e. never terminate) if \(\tau\) is too large. Because the gradient is only a linear approximation of \(f\), if we step too far along it, we might skip over changes in \(f\)’s behavior—or even end up <em>increasing</em> both \(f\) and \(\nabla f\). On the other hand, the smaller we make \(\tau\), the longer our algorithm takes to converge. Note that we’re assuming \(f\) has a lower bound and achieves it at a finite \(\mathbf{x}\) in the first place.</p>
<table class="mathtable" style="width: 100%;"><tr>
<td style="width: 50%;"><img class="diagram2d center" src="/assets/autodiff/diverge.svg" /></td>
<td style="width: 50%;"><img class="diagram2d center" src="/assets/autodiff/converge_slow.svg" /></td>
</tr><td style="text-align: center;">Divergence</td><td style="text-align: center;">Slow Convergence</td></table>
<p>The algorithm presented here is the most basic form of gradient descent: much research has been dedicated to devising loss functions and descent algorithms that have higher likelihoods of converging to reasonable results. The practice of adding constraints, loss function terms, and update rules is known as <em>regularization</em>. In fact, optimization is a whole field in of itself: if you’d like to learn more, there’s a vast amount of literature to refer to, especially within machine learning. <a href="https://distill.pub/2017/momentum/">This interactive article</a> explaining <em>momentum</em> is a great example.</p>
<h1 id="differentiating-code">Differentiating Code</h1>
<p>Now that we understand differentiation, let’s move on to programming. So far, we’ve only considered mathematical functions, but we can easily translate our perspective to programs. For simplicity, we’ll only consider <em>pure</em> functions, i.e. functions whose output depends solely on its parameters (no state).</p>
<p>If your program implements a relatively simple mathematical expression, it’s not too difficult to manually write another function that evaluates its derivative. However, what if your program is a deep neural network, or a physics simulation? It’s not feasible to differentiate something like that by hand, so we must turn to algorithms for automatic differentiation.</p>
<p>There are several techniques for differentiating programs. We will first look at numeric and symbolic differentiation, both of which have been in use as long as computers have existed. However, these approaches are distinct from the algorithm we now know as <em>autodiff</em>, which we will discuss later.</p>
<h2 id="numerical">Numerical</h2>
<p>Numerical differentiation is the most straightforward technique: it simply approximates the definition of the derivative.</p>
\[f^\prime(x) = \lim_{h\rightarrow 0} \frac{f(x+h)-f(x)}{h} \simeq \frac{f(x+0.001)-f(x)}{0.001}\]
<p>By choosing a small \(h\), all we have to do is evaluate \(f\) at \(x\) and \(x+h\). This technique is also known as differentiation via <em>finite differences</em>.</p>
<p>Implementing numeric differentiation as a higher order function is quite easy. It doesn’t even require modifying the function to differentiate:</p>
<pre id="numerical-example" class="editor">
function numerical_diff(f, h) {
return function (x) {
return (f(x + h) - f(x)) / h;
}
}
let df = numerical_diff(f, 0.001);
</pre>
<p>You can edit the following Javascript example, where \(f\) is drawn in blue and numerical_diff(\(f\), 0.001) is drawn in purple. Note that using control flow is not a problem:</p>
<table class="codetable"><tr>
<td style="width: 50%; padding-right: 10px;"><div id="numerical-editor" class="editor"></div>
<button id="numerical-button" type="button" style="margin-top: 10px;">Differentiate</button></td>
<td style="width: 50%;"><canvas id="numerical-canvas" class="diagram_canvas"></canvas></td>
</tr></table>
<p>Unfortunately, finite differences have a big problem: they only compute the derivative of \(f\) in one direction. If our input is very high dimensional, computing the full gradient of \(f\) becomes computationally infeasible, as we would have to evaluate \(f\) for each dimension separately.</p>
<p>That said, if you only need to compute one directional derivative of \(f\), the full gradient is overkill: instead, compute a finite difference between \(f(\mathbf{x})\) and \(f(\mathbf{x} + \Delta\mathbf{x})\), where \(\Delta\mathbf{x}\) is a small step in your direction of interest.</p>
<p>Finally, always remember that numerical differentiation is only an approximation: we aren’t computing the actual limit as \(h \rightarrow 0\). While finite differences are quite easy to implement and can be very useful for validating other results, the technique should usually be superseded by another approach.</p>
<h2 id="symbolic">Symbolic</h2>
<p>Symbolic differentiation involves transforming a representation of \(f\) into a representation of \(f^\prime\). Unlike numerical differentiation, this requires specifying \(f\) in a domain-specific language where each syntactic construct has a known differentiation rule.</p>
<p>However, that limitation isn’t so bad—we can create a compiler that differentiates expressions in our symbolic language for us. This is the technique used in computer algebra packages like Mathematica.</p>
<p>For example, we could create a simple language of polynomials that is symbolically differentiable using the following set of recursive rules:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>d(n) -> 0
d(x) -> 1
d(Add(a, b)) -> Add(d(a), d(b))
d(Times(a, b)) -> Add(Times(d(a), b), Times(a, d(b)))
</code></pre></div></div>
<table class="codetable" style="margin-top: 15px; margin-bottom: 15px;">
<tr>
<td style="width: 45%; padding-right: 10px;">
<div id="symbolic-editor" class="editor"></div>
<button id="symbolic-button" type="button">Differentiate</button>
</td>
<td style="width: 55%;">
<pre id="symbolic-output" class="editor"></pre>
<canvas id="symbolic-canvas" class="diagram_canvas" style="margin-top: 10px;"></canvas>
</td>
</tr>
</table>
<p>If we want our differentiable language to support more operations, we can simply add more differentiation rules. For example, to support trig functions:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>d(sin a) -> Times(d(a), cos a)
d(cos a) -> Times(d(a), Times(-1, sin a))
</code></pre></div></div>
<p>Unfortunately, there’s a catch: the size of \(f^\prime\)’s representation can become very large. Let’s write another recursive relationship that counts the number of terms in an expression:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Terms(n) -> 1
Terms(x) -> 1
Terms(Add(a, b)) -> Terms(a) + Terms(b) + 1
Terms(Times(a, b)) -> Terms(a) + Terms(b) + 1
</code></pre></div></div>
<p>And then prove that <code class="language-plaintext highlighter-rouge">Terms(a) <= Terms(d(a))</code>, i.e. differentiating an expression cannot decrease the number of terms:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Base Cases:
Terms(d(n)) -> 1 | Definition
Terms(n) -> 1 | Definition
=> Terms(n) <= Terms(d(n))
Terms(d(x)) -> 1 | Definition
Terms(x) -> 1 | Definition
=> Terms(x) <= Terms(d(x))
Inductive Case for Add:
Terms(Add(a, b)) -> Terms(a) + Terms(b) + 1 | Definition
Terms(d(Add(a, b))) -> Terms(d(a)) + Terms(d(b)) + 1 | Definition
Terms(a) <= Terms(d(a)) | Hypothesis
Terms(b) <= Terms(d(b)) | Hypothesis
=> Terms(Add(a, b)) <= Terms(d(Add(a, b)))
Inductive Case for Times:
Terms(Times(a, b)) -> Terms(a) + Terms(b) + 1 | Definition
Terms(d(Times(a, b))) -> Terms(a) + Terms(b) + 3 +
Terms(d(a)) + Terms(d(b)) | Definition
=> Terms(Times(a, b)) <= Terms(d(Times(a, b)))
</code></pre></div></div>
<p>This result might be acceptable if the size of <code class="language-plaintext highlighter-rouge">df</code> was linear in the size of <code class="language-plaintext highlighter-rouge">f</code>, but that’s not the case. Whenever we differentiate a <code class="language-plaintext highlighter-rouge">Times</code> expression, the number of terms in the result will at least <em>double</em>. That means the size of <code class="language-plaintext highlighter-rouge">df</code> grows <em>exponentially</em> with the number of <code class="language-plaintext highlighter-rouge">Times</code> we compose. You can demonstrate this phenomenon by nesting multiple <code class="language-plaintext highlighter-rouge">Times</code> in the Javascript example.</p>
<p>Hence, symbolic differentiation is usually infeasible at the scales we’re interested in. However, if it works for your use case, it can be quite useful.</p>
<h1 id="automatic-differentiation">Automatic Differentiation</h1>
<p>We’re finally ready to discuss the automatic differentiation algorithm actually used in modern differentiable programming: autodiff! There are two flavors of autodiff, each named for the direction in which it computes derivatives.</p>
<h2 id="forward-mode">Forward Mode</h2>
<p>Forward mode autodiff improves on our two older techniques by computing exact derivatives without building a potentially exponentially-large representation of \(f^\prime\). It is based on the mathematical definition of <em>dual numbers</em>.</p>
<h3 id="dual-numbers">Dual Numbers</h3>
<p>Dual numbers are a bit like complex numbers: they’re defined by <a href="https://en.wikipedia.org/wiki/Field_extension">adjoining</a> a new quantity \(\epsilon\) to the reals. But unlike complex numbers where \(i^2 = -1\), dual numbers use \(\epsilon^2 = 0\).</p>
<p>In particular, we can use the \(\epsilon\) part of a dual number to represent the derivative of the scalar part. If we replace each variable \(x\) with \(x + x^\prime\epsilon\), we will find that dual arithmetic naturally expresses how derivatives combine:</p>
<p>Addition:</p>
\[(x + x^\prime\epsilon) + (y + y^\prime\epsilon) = (x + y) + (x^\prime + y^\prime)\epsilon\]
<p>Multiplication:</p>
\[\begin{align*} (x + x^\prime\epsilon) * (y + y^\prime\epsilon) &= xy + xy^\prime\epsilon + x^\prime y\epsilon + x^\prime y^\prime\epsilon^2 \\
&= xy + (x^\prime y + xy^\prime)\epsilon \end{align*}\]
<p>Division:</p>
\[\begin{align*} \frac{x + x^\prime\epsilon}{y + y^\prime\epsilon} &= \frac{\frac{x}{y}+\frac{x^\prime}{y}\epsilon}{1+\frac{y^\prime}{y}\epsilon} \\
&= \left(\frac{x}{y}+\frac{x^\prime}{y}\epsilon\right)\left(1-\frac{y^\prime}{y}\epsilon\right) \\
&= \frac{x}{y} + \frac{x^\prime y - xy^\prime}{y^2}\epsilon
\end{align*}\]
<p>The chain rule also works: \(f(x + x^\prime\epsilon) = f(x) + f'(x)x^\prime\epsilon\) for any smooth function \(f\). To prove this fact, let us first show that the property holds for positive integer exponentiation.</p>
<p>Base case:
\((x+x^\prime\epsilon)^1 = x^1 + 1x^0x^\prime\epsilon\)</p>
<p>Hypothesis:
\((x+x^\prime\epsilon)^n = x^n + nx^{n-1}x^\prime\epsilon\)</p>
<p>Induct:</p>
\[\begin{align*}
(x+x^\prime\epsilon)^{n+1} &= (x^n + nx^{n-1}x^\prime\epsilon)(x+x^\prime\epsilon) \tag{Hypothesis}\\
&= x^{n+1} + x^nx^\prime\epsilon + nx^nx^\prime\epsilon + nx^{n-1}x^{\prime^2}\epsilon^2\\
&= x^{n+1} + (n+1)x^nx^\prime\epsilon
\end{align*}\]
<p>We can use this result to prove the same property for any smooth function \(f\). Examining the Taylor expansion of \(f\) at zero (also known as its <em>Maclaurin series</em>):</p>
\[f(x) = \sum_{n=0}^\infty \frac{f^{(n)}(0)x^n}{n!} = f(0) + f^\prime(0)x + \frac{f^{\prime\prime}(0)x^2}{2!} + \frac{f^{\prime\prime\prime}(0)x^3}{3!} + \dots\]
<p>By plugging in our dual number…</p>
\[\begin{align*} f(x+x^\prime\epsilon) &= f(0) + f^\prime(0)(x+x^\prime\epsilon) + \frac{f^{\prime\prime}(0)(x+x^\prime\epsilon)^2}{2!} + \frac{f^{\prime\prime\prime}(0)(x+x^\prime\epsilon)^3}{3!} + \dots\\
&= f(0) + f^\prime(0)(x+x^\prime\epsilon) + \frac{f^{\prime\prime}(0)(x^2+2xx^\prime\epsilon)}{2!} + \frac{f^{\prime\prime\prime}(0)(x^3+3x^2x^\prime\epsilon)}{3!} + \dots \\
&= f(0) + f^\prime(0)x + \frac{f^{\prime\prime}(0)x^2}{2!} + \frac{f^{\prime\prime\prime}(0)x^3}{3!} + \dots \\
&\phantom{= }+ \left(f^\prime(0) + f^{\prime\prime}(0)x + \frac{f^{\prime\prime\prime}(0)x^2}{2!} + \dots \right)x^\prime\epsilon \\
&= f(x) + f^\prime(x)x^\prime\epsilon
\end{align*}\]
<p>…we prove the result! In the last step, we recover the Maclaurin series for both \(f(x)\) and \(f^\prime(x)\).</p>
<h3 id="implementation">Implementation</h3>
<p>Implementing forward-mode autodiff in code can be very straightforward: we just have to replace our <code class="language-plaintext highlighter-rouge">Float</code> type with a <code class="language-plaintext highlighter-rouge">DiffFloat</code> that keeps track of both our value and its dual coefficient. If we then implement the relevant math operations for <code class="language-plaintext highlighter-rouge">DiffFloat</code>, all we have to do is run the program!</p>
<p>Unfortunately, JavaScript does not support operator overloading, so we’ll define a <code class="language-plaintext highlighter-rouge">DiffFloat</code> to be a two-element array and use functions to implement some basic arithmetic operations:</p>
<pre id="forward-lib" class="editor">
function Const(n) {
return [n, 0];
}
function Add(x, y) {
return [x[0] + y[0], x[1] + y[1]];
}
function Times(x, y) {
return [x[0] * y[0], x[1] * y[0] + x[0] * y[1]];
}
</pre>
<p>If we implement our function \(f\) in terms of these primitives, evaluating \(f([x,1])\) will return \([f(x),f^\prime(x)]\)!</p>
<p>This property extends naturally to higher-dimensional functions, too. If \(f\) has multiple outputs, their derivatives pop out in the same way. If \(f\) has inputs other than \(x\), assigning them constants means the result will be the partial derivative \(f_x\).</p>
<table class="codetable"><tr>
<td style="width: 50%; padding-right: 10px;"><div id="forward-editor" class="editor"></div>
<button id="forward-button" type="button" style="margin-top: 10px;">Differentiate</button></td>
<td style="width: 50%;"><canvas id="forward-canvas" class="diagram_canvas"></canvas></td>
</tr></table>
<h3 id="limitations">Limitations</h3>
<p>While forward-mode autodiff does compute exact derivatives, it suffers from the same fundamental problem as finite differences: each invocation of \(f\) can only compute the directional derivative of \(f\) for a single direction.</p>
<p>It’s useful to think of a forward mode derivative as computing one <em>column</em> of the gradient matrix. Hence, if \(f\) has few inputs but many outputs, forward mode can still be quite efficient at recovering the full gradient:</p>
\[\begin{align*}
\nabla f &= \begin{bmatrix} f_{1_x}(x,y) & f_{1_y}(x,y) \\ \vdots & \vdots \\ f_{n_x}(x,y) & f_{n_y}(x,y) \end{bmatrix} \\
\hphantom{\nabla f}&\hphantom{ ==|}\begin{array}{} \underbrace{\hphantom{f_{n_x}(x,y)}}_{\text{Pass 1}} & \underbrace{\hphantom{f_{n_y}(x,y)}}_{\text{Pass 2}} \end{array}
\end{align*}\]
<p>Unfortunately, optimization problems in machine learning and graphics often have the opposite structure: \(f\) has a huge number of inputs (e.g. the coefficients of a 3D scene or neural network) and a single output. That is, \(\nabla f\) has many columns and few rows.</p>
<h2 id="backward-mode">Backward Mode</h2>
<p>As you might have guessed, backward mode autodiff provides a way to compute a <em>row</em> of the gradient using a single invocation of \(f\). For optimizing many-to-one functions, this is exactly what we want: the full gradient in one pass.</p>
<p>In this section, we will use Leibniz’s notation for derivatives, which is:</p>
\[f^\prime(x) = \frac{\partial f}{\partial x}\]
<p>Leibniz’s notation makes it easier to write down the derivative of an arbitrary variable with respect an arbitrary input. Derivatives also obtain nice algebraic properties, if you squint a bit:</p>
\[g(f(x))^\prime = \frac{\partial g}{\partial f}\cdot\frac{\partial f}{\partial x} = \frac{\partial g}{\partial x}\]
<h3 id="backpropagation">Backpropagation</h3>
<p>Similarly to how forward-mode autodiff propagated derivatives from inputs to outputs, backward-mode propagates derivatives from outputs to inputs.</p>
<p>That sounds easy enough, but the code only runs in one direction. How would we know what the gradient of our input should be before evaluating the rest of the function? We don’t—when evaluating \(f\), we use each operation to build a <em>computational graph</em> that represents \(f\). That is, when \(f\) tells us to perform an operation, we create a new node noting what the operation is and connect it to the nodes representing its inputs. In this way, a pure function can be nicely represented as a directed acyclic graph, or DAG.</p>
<p>For example, the function \(f(x,y) = x^2 + xy\) may be represented with the following graph:</p>
<p><img class="diagram2d_wider_on_mobile center" src="/assets/autodiff/g1.svg" /></p>
<p>When evaluating \(f\) at a particular input, we write down the intermediate values computed by each node. This step is known as the <em>forward pass</em>, and computes <em>primal</em> values.</p>
<p><img class="diagram2d_wider_on_mobile center" src="/assets/autodiff/g2.svg" /></p>
<p>Then, we begin the <em>backward pass</em>, where we compute <em>dual</em> values, or derivatives. Our ultimate goal is to compute \(\frac{\partial f}{\partial x}\) and \(\frac{\partial f}{\partial y}\). At first, we only know the derivative of \(f\) with respect to final plus—they’re the same value, so \(\frac{\partial f}{\partial +} = 1\).</p>
<p><img class="diagram2d_wider_on_mobile center" src="/assets/autodiff/g3.svg" /></p>
<p>We can see that the output was computed by adding together two incoming values. Increasing either input to the sum would increase the output by an equal amount, so derivatives propagated through this node should be unaffected. That is, if \(+\) is the output and \(+_1,+_2\) are the inputs, \(\frac{\partial +}{\partial +_1} = \frac{\partial +}{\partial +_2} = 1\).</p>
<p>Now we can use the chain rule to combine our derivatives, getting closer to the desired result: \(\frac{\partial f}{\partial +_1} = \frac{\partial f}{\partial +}\cdot\frac{\partial +}{\partial +_1} = 1\), \(\frac{\partial f}{\partial +_2} = \frac{\partial f}{\partial +}\cdot\frac{\partial +}{\partial +_2} = 1\).</p>
<p><img class="diagram2d_wider_on_mobile center" src="/assets/autodiff/g4.svg" /></p>
<p>When we evaluate a node, we know the derivative of \(f\) with respect to its output. That means we can propagate the derivative back along the node’s incoming edges, modifying it based on the node’s operation. As long as we evaluate all outputs of a node before the node itself, we only have to check each node once. To assure proper ordering, we may traverse the graph in <a href="https://en.wikipedia.org/wiki/Topological_sorting"><em>reverse topological order</em></a>.</p>
<p>Once we get to a multiplication node, there’s slightly more to do: the derivative now depends on the <em>primal</em> input values. That is, if \(f(x,y) = xy\), \(\frac{\partial f}{\partial x} = y\) and \(\frac{\partial f}{\partial y} = x\).</p>
<p>By applying the chain rule, we get \(\frac{\partial f}{\partial *_1} = 1\cdot*_2\) and \(\frac{\partial f}{\partial *_2} = 1\cdot*_1\) for both multiplication nodes.</p>
<p><img class="diagram2d_wider_on_mobile center" src="/assets/autodiff/g5.svg" /></p>
<p>Applying the chain rule one last time, we get \(\frac{\partial f}{\partial y} = 2\). But \(x\) has multiple incoming derivatives—how do we combine them? Each incoming edge represents a different way \(x\) affects \(f\), so \(x\)’s total contribution is simply their sum. That means \(\frac{\partial f}{\partial x} = 7\). Let’s check our result:</p>
\[\begin{align*}
f_x(x,y) &= 2x + y &&\implies& f_x(2,3) &= 7 \\
f_y(x,y) &= x &&\implies& f_y(2,3) &= 2
\end{align*}\]
<p>You’ve probably noticed that traversing the graph built up a derivative term for each path from an input to the output. That’s exactly the behavior that arose when we manually computed the gradient using the chain rule!</p>
<p>Backpropagation is essentially the chain rule upgraded with dynamic programming. Traversing the graph in reverse topological order means we only have to evaluate each vertex once—and re-use its derivative everywhere else it shows up. Despite having to express \(f\) as a computational graph and traverse both forward and backward, the whole algorithm has the same time complexity as \(f\) itself. Space complexity, however, is a separate issue.</p>
<h3 id="implementation-1">Implementation</h3>
<p>We can implement backward mode autodiff using a similar approach as forward mode. Instead of making every operation use dual numbers, we can make each step add a node to our computational graph.</p>
<pre id="backward-lib" class="editor">
function Const(n) {
return {op: 'const', in: [n], out: undefined, grad: 0};
}
function Add(x, y) {
return {op: 'add', in: [x, y], out: undefined, grad: 0};
}
function Times(x, y) {
return {op: 'times', in: [x, y], out: undefined, grad: 0};
}
</pre>
<p>Note that JavaScript will automatically store references within the <code class="language-plaintext highlighter-rouge">in</code> arrays, hence build a DAG instead of a tree. If we implement our function \(f\) in terms of these primitives, we can evaluate it on an input node to automatically build the graph.</p>
<pre id="backward-lib-2" class="editor">
let in_node = {op: 'const', in: [/* TBD */], out: undefined, grad: 0};
let out_node = f(in_node);
</pre>
<p>The forward pass performs a post-order traversal of the graph, translating inputs to outputs for each node. Remember we’re operating on a DAG: we must check whether a node is already resolved, lest we recompute values that could be reused.</p>
<pre id="backward-lib-3" class="editor">
function forward(node) {
if (node.out !== undefined) return;
if (node.op === 'const') {
node.out = node.in[0];
} else if (node.op === 'add') {
forward(node.in[0]);
forward(node.in[1]);
node.out = node.in[0].out + node.in[1].out;
} else if (node.op === 'times') {
forward(node.in[0]);
forward(node.in[1]);
node.out = node.in[0].out * node.in[1].out;
}
}
</pre>
<!-- (You might notice that given a value for $$x$$, the `out` values could have been computed while running $$f$$. That is true, but this purely graph-based approach will help illustrate a following section.) -->
<p>The backward pass is conceptually similar, but a naive pre-order traversal would end up tracing out every path in the DAG—every gradient has to be pushed back to the roots. Instead, we’ll first compute a reverse topological ordering of the nodes. This ordering guarantees that when we reach a node, everything “downstream” of it has already been resolved—we’ll never have to return.</p>
<pre id="backward-lib-4" class="editor">
function backward(out_node) {
const order = topological_sort(out_node).reverse();
for (const node of order) {
if (node.op === 'add') {
node.in[0].grad += node.grad;
node.in[1].grad += node.grad;
} else if (node.op === 'times') {
node.in[0].grad += node.in[1].out * node.grad;
node.in[1].grad += node.in[0].out * node.grad;
}
}
}
</pre>
<p>Finally, we can put our functions together to compute \(f(x)\) and \(f'(x)\):</p>
<pre id="backward-lib-5" class="editor">
function evaluate(x, in_node, out_node) {
in_node.in = [x];
forward(out_node);
out_node.grad = 1;
backward(out_node);
return [out_node.out, in_node.grad];
}
</pre>
<p>Just remember to clear all the <code class="language-plaintext highlighter-rouge">out</code> and <code class="language-plaintext highlighter-rouge">grad</code> fields before evaluating again! Lastly, the working implementation:</p>
<table class="codetable"><tr>
<td style="width: 50%; padding-right: 10px;"><div id="backward-editor" class="editor"></div>
<button id="backward-button" type="button" style="margin-top: 10px;">Differentiate</button></td>
<td style="width: 50%;"><canvas id="backward-canvas" class="diagram_canvas"></canvas></td>
</tr></table>
<h3 id="limitations-1">Limitations</h3>
<p>If \(f\) is a function of multiple variables, we can simply read the gradients from the corresponding input nodes. That means we’ve computed a whole row of \(\nabla f\). Of course, if the gradient has many rows and few columns, forward mode would have been more efficient.</p>
\[\begin{align*}
\nabla f &= \begin{bmatrix} \vphantom{\Big|} f_{0_a}(a,\dots,n) & \dots & f_{0_n}(a,\dots,n) \\ \vphantom{\Big|} f_{1_a}(a,\dots,n) & \dots & f_{1_n}(a,\dots,n) \end{bmatrix} \begin{matrix} \left.\vphantom{\Big| f_{0_a}(a,\dots,n)}\right\} \text{Pass 1} \\ \left.\vphantom{\Big| f_{0_a}(a,\dots,n)}\right\} \text{Pass 2} \end{matrix}
\end{align*}\]
<p>Unfortunately, backwards mode comes with another catch: we had to store the intermediate result of every single computation inside \(f\)! If we’re passing around substantial chunks of data, say, weight matrices for a neural network, storing the intermediate results can require an unacceptable amount of memory and memory bandwidth. If \(f\) contains loops, it’s especially bad—because every value is immutable, naive loops will create long chains of intermediate values. For this reason, real-world frameworks tend to encapsulate loops in monolithic parallel operations that have analytic derivatives.</p>
<p>Many engineering hours have gone into reducing space requirements. One problem-agnostic approach is called checkpointing: we can choose not to store intermediate results at some nodes, rather re-computing them on the fly during the backward pass. Checkpointing gives us a natural space-time tradeoff: by strategically choosing which nodes store intermediate results (e.g. ones with expensive operations), we can reduce memory usage without dramatically increasing runtime.</p>
<p>Even with checkpointing, training the largest neural networks requires far more fast storage than is available to a single computer. By partitioning our computational graph between multiple systems, each one only needs to store values for its local nodes. Unfortunately, this implies edges connecting nodes assigned to different processors must send their values across a network, which is expensive. Hence, communication costs may be minimized by finding min-cost graph cuts.</p>
<div class="center" style="width: 100%;"><img class="diagram2d center" style="width: 575px; margin-bottom: 20px;" src="/assets/autodiff/partition2.svg" /></div>
<h3 id="graphs-and-higher-order-autodiff">Graphs and Higher-Order Autodiff</h3>
<p>Earlier, we could have computed primal values while evaluating \(f\) itself. Frameworks like PyTorch and TensorFlow take this approach—evaluating \(f\) both builds the graph (also known as the ‘tape’) and evaluates the forward-pass results. The user may call <code class="language-plaintext highlighter-rouge">backward</code> at any point, propagating gradients to all inputs that contributed to the result.</p>
<p>However, the forward-backward approach can limit the system’s potential performance. The forward pass is relatively easy to optimize via parallelizing, vectorizing, and distributing graph traversal. The backward pass, on the other hand, is harder to parallelize, as it requires a topological traversal and coordinated gradient accumulation. Furthermore, the backward pass lacks some mathematical power. While computing a specific derivative is easy, we don’t get back a general <em>representation</em> of \(\nabla f\). If we wanted the gradient of the gradient (the Hessian), we’re back to relying on numerical differentiation.</p>
<p>Thinking about the gradient as a higher-order function reveals a potentially better approach. If we can represent \(f\) as a computational graph, there’s no reason we can’t also represent \(\nabla f\) <em>in the same way</em>. In fact, we can simply add nodes to the graph of \(f\) that compute derivatives with respect to each input. Because the graph already computes primal values, each node in \(f\) only requires us to add a constant number of nodes in the graph of \(\nabla f\). That means the result is only a constant factor larger than the input—evaluating it requires exactly the same computations as the forward-backward algorithm.</p>
<p>For example, given the graph of \(f(x) = x^2 + x\), we can produce the following:</p>
<div class="center" style="width: 100%;"><img class="diagram2d center" style="width: 500px;" src="/assets/autodiff/graph2.svg" /></div>
<p>Defining differentiation as a function <em>on computational graphs</em> unifies the forward and backward passes: we get a single graph that computes both \(f\) and \(\nabla f\). That means we can work with higher order derivatives by applying the transformation again! Even better, distributed training is easier as we no longer have to worry about synchronizing gradient updates across multiple systems. JAX implements this approach, enabling its seamless gradient, JIT compilation, and vectorization transforms. PyTorch also supports higher-order differentiation via including backward operations in the computational graph, and <a href="https://github.com/pytorch/functorch">functorch</a> provides a JAX-like API.</p>
<h1 id="de-blurring-an-image">De-blurring an Image</h1>
<p>Let’s use our fledgling differentiable programming framework to solve a real optimization problem: de-blurring an image. We’ll assume our observed image was computed using a simple box filter, i.e., each blurred pixel is the average of the surrounding 3x3 ground-truth pixels. Of course, a blur loses information, so we won’t be able to reconstruct the exact input—but we can get pretty close!</p>
\[\text{Blur}(\text{Image})_{xy} = \frac{1}{9} \sum_{i=-1}^1 \sum_{j=-1}^1 \text{Image}_{(x+i)(y+j)}\]
<table class="imagetable"><tr style="width: 100%;">
<td style="display: inline-block;"><canvas id="image-orig"></canvas><p>Ground Truth Image</p></td>
<td style="display: inline-block;"><canvas id="image-blur"></canvas><p>Observed Image</p></td></tr>
</table>
<p>We’ll need to add one more operation to our framework: division. The operation and forward pass are much the same as addition and multiplication, but the backward pass must compute \(\frac{\partial f}{\partial x}\) and \(\frac{\partial f}{\partial y}\) for \(f = \frac{x}{y}\).</p>
<pre id="image-div" class="editor">
function Divide(x, y) {
return {op: 'divide', in: [x, y], out: undefined, grad: 0};
}
// Forward...
if(node.op === 'divide') {
forward(node.in[0]);
forward(node.in[1]);
node.out = node.in[0].out / node.in[1].out;
}
// Backward...
if(node.op === 'divide') {
n.in[0].grad += n.grad / node.in[1].out;
n.in[1].grad += (-n.grad * node.in[0].out / (node.in[1].out * node.in[1].out));
}
</pre>
<p>Before we start programming, we need to express our task as an optimization problem. That entails minimizing a loss function that measures how far away we are from our goal.</p>
<p>Let’s start by guessing an arbitrary image—for example, a solid grey block. We can then compare the result of blurring our guess with the observed image. The farther our blurred result is from the observation, the larger the loss should be. For simplicity, we will define our loss as the total squared difference between each corresponding pixel.</p>
\[\text{Loss}(\text{Blur}(\text{Guess}), \text{Observed}) = \sum_{x=0}^W\sum_{y=0}^H (\text{Blur}(\text{Guess})_{xy} - \text{Observed}_{xy})^2\]
<p>Using differentiable programming, we can compute \(\frac{\partial \text{Loss}}{\partial \text{Guess}}\), i.e. how changes in our proposed image change the resulting loss. That means we can apply gradient descent to the guess, guiding it towards a state that minimizes the loss function. Hopefully, if our blurred guess matches the observed image, our guess will match the ground truth image.</p>
<p>Let’s implement our loss function in differentiable code. First, create the guess image by initializing the differentiable parameters to solid grey. Each pixel has three components: red, green, and blue.</p>
<pre id="image-impl-0" class="editor">
let guess_image = new Array(W*H*3);
for (let i = 0; i < W * H * 3; i++) {
guess_image[i] = Const(127);
}
</pre>
<p>Second, apply the blur using differentiable operations.</p>
<pre id="image-impl-1" class="editor">
let blurred_guess_image = new Array(W*H*3);
for (let x = 0; x < W; x++) {
for (let y = 0; y < H; y++) {
let [r,g,b] = [Const(0), Const(0), Const(0)];
// Accumulate pixels for averaging
for (let i = -1; i < 1; i++) {
for (let j = -1; j < 1; j++) {
// Convert 2D pixel coordinate to 1D row-major array index
const xi = clamp(x + i, 0, W - 1);
const yj = clamp(y + j, 0, H - 1);
const idx = (yj * W + xi) * 3;
r = Add(r, guess_image[idx + 0]);
g = Add(g, guess_image[idx + 1]);
b = Add(b, guess_image[idx + 2]);
}
}
// Set result to average
const idx = (y * W + x) * 3;
blurred_guess_image[idx + 0] = Divide(r, Const(9));
blurred_guess_image[idx + 1] = Divide(g, Const(9));
blurred_guess_image[idx + 2] = Divide(b, Const(9));
}
}
</pre>
<p>Finally, compute the loss using differentiable operations.</p>
<pre id="image-impl-2" class="editor">
let loss = Const(0);
for (let x = 0; x < W; x++) {
for (let y = 0; y < H; y++) {
const idx = (y * W + x) * 3;
let dr = Add(blurred_guess_image[idx + 0], Const(-observed_image[idx + 0]));
let dg = Add(blurred_guess_image[idx + 1], Const(-observed_image[idx + 1]));
let db = Add(blurred_guess_image[idx + 2], Const(-observed_image[idx + 2]));
loss = Add(loss, Times(dr, dr));
loss = Add(loss, Times(dg, dg));
loss = Add(loss, Times(db, db));
}
}
</pre>
<p>Calling <code class="language-plaintext highlighter-rouge">forward(loss)</code> performs the whole computation, storing results in each node’s <code class="language-plaintext highlighter-rouge">out</code> field. Calling <code class="language-plaintext highlighter-rouge">backward(loss)</code> computes the derivative of loss at every node, storing results in each node’s <code class="language-plaintext highlighter-rouge">grad</code> field.</p>
<p>Let’s write a simple optimization routine that performs gradient descent on the guess image.</p>
<pre id="image-impl-3" class="editor">
function gradient_descent_step(step_size) {
// Clear output values and gradients
reset(loss);
// Forward pass
forward(loss);
// Backward pass
loss.grad = 1;
backward(loss);
// Move parameters along gradient
for (let i = 0; i < W * H * 3; i++) {
let p = guess_image[i];
p.in[0] -= step_size * p.grad;
}
}
</pre>
<p>We’d also like to compute <code class="language-plaintext highlighter-rouge">error</code>, the squared distance between our guess image and the ground truth. We can’t use error to inform our algorithm—we’re not supposed to know what the ground truth was—but we can use it to measure how well we are reconstructing the image. For the current iteration, we visualize the guess image, the guess image after blurring, and the gradient of loss with respect to each pixel.</p>
<table class="imagetable">
<tr>
<td><canvas id="image-cur"></canvas><p id="image-error"></p></td>
<td><canvas id="image-cur-blur"></canvas><p style="width: 105%; margin-left: -2.5%;" id="image-loss"></p></td>
<td><canvas id="image-grad"></canvas><p><strong>Gradient Image</strong></p></td>
</tr>
</table>
<table class="imagetable" style="margin-top: -5px; margin-bottom: -15px;">
<tr>
<td style="width: 16.5%;"></td>
<td><div><p style="float:left;"><button id="image-reset">Reset</button> <button id="image-step">Step</button></p>
<input type="range" min="0" max="100" value="25" class="slider" id="image-step-size" style="width: 45%; margin-left: 16px; margin-top: 7px; float: left;" /></div></td>
</tr>
</table>
<p>The slider adjusts the step size. After running several steps, you’ll notice that even though loss goes to zero, error does not: the loss function does not provide enough information to exactly reconstruct the ground truth. We can also see that optimization behavior depends on the step size—small steps require many iterations to converge, and large steps may overshoot the target, oscillating between too-dark and too-bright images.</p>
<h1 id="further-reading">Further Reading</h1>
<p>If you’d like to learn more about differentiable programming in ML and graphics, check out the following resources:</p>
<ul>
<li><a href="https://jax.readthedocs.io/en/latest/notebooks/quickstart.html">JAX</a>, <a href="https://pytorch.org/tutorials/beginner/basics/intro.html">PyTorch</a>, <a href="https://www.tensorflow.org/tutorials/quickstart/beginner">TensorFlow</a>, <a href="https://github.com/geohot/tinygrad">tinygrad</a>, <a href="https://github.com/patr-schm/TinyAD">TinyAD</a></li>
<li><a href="https://physicsbaseddeeplearning.org/intro.html">Physics-Based Deep Learning</a>, <a href="https://www.youtube.com/watch?v=Tou8or1ed6E">Differentiable Rendering Intro</a>, <a href="https://rgl.epfl.ch/publications">Differentiable Rendering Papers</a>, <a href="https://neuralfields.cs.brown.edu/">Neural Fields in Visual Computing</a></li>
<li><a href="https://dlsyscourse.org/">CMU Deep Learning Systems</a></li>
</ul>
<link rel="stylesheet" type="text/css" href="/assets/js/prism.css" />
<script src="/assets/js/prism.js"></script>
<script type="module" src="/assets/autodiff/autodiff.js"></script>Differentiable programming has been a hot research topic over the past few years, and not only due to the popularity of machine learning libraries like TensorFlow, PyTorch, and JAX. Many fields apart from machine learning are also finding differentiable programming to be a useful tool for solving many kinds of optimization problems. In computer graphics, differentiable rendering, differentiable physics, and neural representations are all poised to be important tools going forward. Table of Contents Prerequisites Differentiation Optimization Differentiating Code Numerical Symbolic Automatic Differentiation Forward Mode Backward Mode De-blurring an Image Further Reading Prerequisites Differentiation It all starts with the definition you learn in calculus class: \[f^{\prime}(x) = \lim_{h\rightarrow 0} \frac{f(x + h) - f(x)}{h}\] In other words, the derivative computes how much \(f(x)\) changes when \(x\) is perturbed by an infinitesimal amount. If \(f\) is a one-dimensional function from \(\mathbb{R} \mapsto \mathbb{R}\), the derivative \(f^{\prime}(x)\) returns the slope of the graph of \(f\) at \(x\). However, there’s another perspective that provides better intuition in higher dimensions. If we think of \(f\) as a map from points in its domain to points in its range, we can think of \(f^{\prime}(x)\) as a map from vectors based at \(x\) to vectors based at \(f(x)\). One-to-One In 1D, this distinction is a bit subtle, as a 1D “vector” is just a single number. Still, evaluating \(f^{\prime}(x)\) shows us how a vector placed at \(x\) is scaled when transformed by \(f\). That’s just the slope of \(f\) at \(x\). Many-to-One If we consider a function \(g(x,y) : \mathbb{R}^2 \mapsto \mathbb{R}\) (two inputs, one output), this perspective will become clearer. We can differentiate \(g\) with respect to any particular input, known as a partial derivative: \[g_x(x,y) = \lim_{h\rightarrow0} \frac{g(x+h,y) - g(x,y)}{h}\] The function \(g_x\) produces the change in \(g\) given a change in \(x\). If we combine it with the partial derivative for \(y\), we get the derivative, or gradient, of \(g\): \[\nabla g(x,y) = \begin{bmatrix}g_x(x,y)\\g_y(x,y)\end{bmatrix}\] That is, \(\nabla g(x,y)\) tells us how \(g\) changes if we change either \(x\) or \(y\). If we take the dot product of \(\nabla g(x,y)\) with a vector of differences \(\Delta x,\Delta y\), we’ll get their combined effect on \(g\): \[\nabla g(x,y) \cdot \begin{bmatrix}\Delta x\\\Delta y\end{bmatrix} = \Delta xg_x(x,y) + \Delta yg_y(x,y)\] It’s tempting to think of the gradient as just another vector. However, it’s often useful to think of the gradient as a higher-order function: \(\nabla g\) is a function that, when evaluated at \(x,y\), gives us another function that transforms vectors based at \(x,y\) to vectors based at \(g(x,y)\). It just so happens that the function returned by our gradient is linear, so it can be represented as a dot product with the gradient vector. The gradient is typically explained as the “direction of steepest ascent.” Why is that? When we evaluate the gradient at a point \(x,y\) and a vector \(\Delta x, \Delta y\), the result is a change in the output of \(g\). If we maximize the change in \(g\) with respect to the input vector, we’ll get the direction that makes the output increase the fastest. Since the gradient function is just a dot product with \(\left[g_x(x,y), g_y(x,y)\right]\), the direction \(\left[\Delta x, \Delta y\right]\) that maximizes the result is easy to find: it’s parallel to \(\left[g_x(x,y), g_y(x,y)\right]\). That means the gradient vector is, in fact, the direction of steepest ascent. The Directional Derivative Another important term is the directional derivative, which computes the derivative of a function along an arbitrary direction. It is a generalization of the partial derivative, which evaluates the directional derivative along a coordinate axis. \[D_{\mathbf{v}}f(x)\] Above, we discovered that our “gradient function” could be expressed as a dot product with the gradient vector. That was, actually, the directional derivative: \[D_{\mathbf{v}}f(x) = \nabla f(x) \cdot \mathbf{v}\] Which again illustrates why the gradient vector is the direction of steepest ascent: it is the \(\mathbf{v}\) that maximizes the directional derivative. Note that in curved spaces, the “steepest ascent” definition of the gradient still holds, but the directional derivative becomes more complicated than a dot product. Many-to-Many For completeness, let’s examine how this perspective extends to vector-valued functions of multiple variables. Consider \(h(x,y) : \mathbb{R}^2 \mapsto \mathbb{R}^2\) (two inputs, two outputs). \[h(x,y) = \begin{bmatrix}h_1(x,y)\\h_2(x,y)\end{bmatrix}\] We can take the gradient of each part of \(h\): \[\begin{align*} \nabla h_1(x,y) &= \begin{bmatrix}h_{1_x}(x,y)\\ h_{1_y}(x,y)\end{bmatrix} \\ \nabla h_2(x,y) &= \begin{bmatrix}h_{2_x}(x,y)\\ h_{2_y}(x,y)\end{bmatrix} \end{align*}\] What object would represent \(\nabla h\)? We can build a matrix from the gradients of each component, called the Jacobian: \[\nabla h(x,y) = \begin{bmatrix}h_{1_x}(x,y) & h_{1_y}(x,y)\\ h_{2_x}(x,y) & h_{2_y}(x,y)\end{bmatrix}\] Above, we argued that the gradient (when evaluated at \(x,y\)) gives us a map from input vectors to output vectors. That remains the case here: the Jacobian is a 2x2 matrix, so it transforms 2D \(\Delta x,\Delta y\) vectors into 2D \(\Delta h_1, \Delta h_2\) vectors. Adding more dimensions starts to make our functions hard to visualize, but we can always rely on the fact that the derivative tells us how input vectors (i.e. changes in the input) get mapped to output vectors (i.e. changes in the output). The Chain Rule The last aspect of differentiation we’ll need to understand is how to differentiate function composition. \[h(x) = g(f(x)) \implies h^\prime(x) = g^\prime(f(x))\cdot f^\prime(x)\] We could prove this fact with a bit of real analysis, but the relationship is again easier to understand by thinking of the derivative as higher order function. In this perspective, the chain rule itself is just a function composition. For example, let’s assume \(h(\mathbf{x}) : \mathbb{R}^2 \mapsto \mathbb{R}\) is composed of \(f(\mathbf{x}) : \mathbb{R}^2 \mapsto \mathbb{R}^2\) and \(g(\mathbf{x}) : \mathbb{R}^2 \mapsto \mathbb{R}\). In order to translate a \(\Delta \mathbf{x}\) vector to a \(\Delta h\) vector, we can first use \(f^\prime\) to map \(\Delta \mathbf{x}\) to \(\Delta \mathbf{f}\), based at \(\mathbf{x}\). Then we can use \(g^\prime\) to map \(\Delta \mathbf{f}\) to \(\Delta g\), based at \(f(x)\). Because our derivatives/gradients/Jacobians are linear functions, we’ve been representing them as scalars/vectors/matrices, respectively. That means we can easily compose them with the typical linear algebraic multiplication rules. Writing out the above example symbolically: \[\begin{align*} \nabla h(\mathbf{x}) &= \nabla g(f(\mathbf{x}))\cdot \nabla f(\mathbf{x}) \\ &= \begin{bmatrix}g_{1_{x_1}}(f(\mathbf{x})) & g_{1_{x_2}}(f(\mathbf{x}))\\ g_{2_{x_1}}(f(\mathbf{x})) & g_{2_{x_2}}(f(\mathbf{x}))\end{bmatrix} \cdot \begin{bmatrix}f_{x_1}(\mathbf{x})\\ f_{x_2}(\mathbf{x})\end{bmatrix} \\ &= \begin{bmatrix}g_{1_{x_1}}(f(\mathbf{x}))f_{x_1}(\mathbf{x}) + g_{1_{x_2}}(f(\mathbf{x}))f_{x_2}(\mathbf{x})\\ g_{2_{x_1}}(f(\mathbf{x}))f_{x_1}(\mathbf{x}) + g_{2_{x_2}}(f(\mathbf{x}))f_{x_2}(\mathbf{x})\end{bmatrix} \end{align*}\] The result is a 2D vector representing a gradient that transforms 2D vectors to 1D vectors. The composed function \(h\) had two inputs and one output, so that’s correct. We can also notice that each term corresponds to the chain rule applied to a different computational path from a component of \(\mathbf{x}\) to \(h\). Optimization We will focus on the application of differentiation to optimization via gradient descent, which is often used in machine learning and computer graphics. An optimization problem always involves computing the following expression: \[\underset{\mathbf{x}}{\arg\!\min} f(\mathbf{x})\] Which simply means “find the \(\mathbf{x}\) that results in the smallest possible value of \(f\).” The function \(f\), typically scalar-valued, is traditionally called an “energy,” or in machine learning, a “loss function.” Extra constraints are often enforced to limit the valid options for \(\mathbf{x}\), but we will disregard constrained optimization for now. One way to solve an optimization problem is to iteratively follow the gradient of \(f\) “downhill.” This algorithm is known as gradient descent: Pick an initial guess \(\mathbf{\bar{x}}\). Repeat: Compute the gradient \(\nabla f(\mathbf{\bar{x}})\). Step along the gradient: \(\mathbf{\bar{x}} \leftarrow \mathbf{\bar{x}} - \tau\nabla f(\mathbf{\bar{x}})\). while \(\|\nabla f(\mathbf{\bar{x}})\| > \epsilon\). Given some starting point \(\mathbf{x}\), computing \(\nabla f(\mathbf{x})\) will give us the direction from \(\mathbf{x}\) that would increase \(f\) the fastest. Hence, if we move our point \(\mathbf{x}\) a small distance \(\tau\) along the negated gradient, we will decrease the value of \(f\). The number \(\tau\) is known as the step size (or in ML, the learning rate). By iterating this process, we will eventually find an \(\mathbf{x}\) such that \(\nabla f(\mathbf{x}) \simeq 0\), which is hopefully the minimizer. This description of gradient descent makes optimization sound easy, but in reality there is a lot that can go wrong. When gradient descent terminates, the result is only required to be a critical point of \(f\), i.e. somewhere \(f\) becomes flat. That means we could wind up at a maximum (unlikely), a saddle point (possible), or a local minimum (likely). At a local minimum, moving \(\mathbf{x}\) in any direction would increase the value of \(f\), but \(\mathbf{x}\) it is not necessarily the minimum value \(f\) can take on globally. MaximumSaddle PointLocal Minimum Gradient descent can also diverge (i.e. never terminate) if \(\tau\) is too large. Because the gradient is only a linear approximation of \(f\), if we step too far along it, we might skip over changes in \(f\)’s behavior—or even end up increasing both \(f\) and \(\nabla f\). On the other hand, the smaller we make \(\tau\), the longer our algorithm takes to converge. Note that we’re assuming \(f\) has a lower bound and achieves it at a finite \(\mathbf{x}\) in the first place. DivergenceSlow Convergence The algorithm presented here is the most basic form of gradient descent: much research has been dedicated to devising loss functions and descent algorithms that have higher likelihoods of converging to reasonable results. The practice of adding constraints, loss function terms, and update rules is known as regularization. In fact, optimization is a whole field in of itself: if you’d like to learn more, there’s a vast amount of literature to refer to, especially within machine learning. This interactive article explaining momentum is a great example. Differentiating Code Now that we understand differentiation, let’s move on to programming. So far, we’ve only considered mathematical functions, but we can easily translate our perspective to programs. For simplicity, we’ll only consider pure functions, i.e. functions whose output depends solely on its parameters (no state). If your program implements a relatively simple mathematical expression, it’s not too difficult to manually write another function that evaluates its derivative. However, what if your program is a deep neural network, or a physics simulation? It’s not feasible to differentiate something like that by hand, so we must turn to algorithms for automatic differentiation. There are several techniques for differentiating programs. We will first look at numeric and symbolic differentiation, both of which have been in use as long as computers have existed. However, these approaches are distinct from the algorithm we now know as autodiff, which we will discuss later. Numerical Numerical differentiation is the most straightforward technique: it simply approximates the definition of the derivative. \[f^\prime(x) = \lim_{h\rightarrow 0} \frac{f(x+h)-f(x)}{h} \simeq \frac{f(x+0.001)-f(x)}{0.001}\] By choosing a small \(h\), all we have to do is evaluate \(f\) at \(x\) and \(x+h\). This technique is also known as differentiation via finite differences. Implementing numeric differentiation as a higher order function is quite easy. It doesn’t even require modifying the function to differentiate: function numerical_diff(f, h) { return function (x) { return (f(x + h) - f(x)) / h; } } let df = numerical_diff(f, 0.001); You can edit the following Javascript example, where \(f\) is drawn in blue and numerical_diff(\(f\), 0.001) is drawn in purple. Note that using control flow is not a problem: Differentiate Unfortunately, finite differences have a big problem: they only compute the derivative of \(f\) in one direction. If our input is very high dimensional, computing the full gradient of \(f\) becomes computationally infeasible, as we would have to evaluate \(f\) for each dimension separately. That said, if you only need to compute one directional derivative of \(f\), the full gradient is overkill: instead, compute a finite difference between \(f(\mathbf{x})\) and \(f(\mathbf{x} + \Delta\mathbf{x})\), where \(\Delta\mathbf{x}\) is a small step in your direction of interest. Finally, always remember that numerical differentiation is only an approximation: we aren’t computing the actual limit as \(h \rightarrow 0\). While finite differences are quite easy to implement and can be very useful for validating other results, the technique should usually be superseded by another approach. Symbolic Symbolic differentiation involves transforming a representation of \(f\) into a representation of \(f^\prime\). Unlike numerical differentiation, this requires specifying \(f\) in a domain-specific language where each syntactic construct has a known differentiation rule. However, that limitation isn’t so bad—we can create a compiler that differentiates expressions in our symbolic language for us. This is the technique used in computer algebra packages like Mathematica. For example, we could create a simple language of polynomials that is symbolically differentiable using the following set of recursive rules: d(n) -> 0 d(x) -> 1 d(Add(a, b)) -> Add(d(a), d(b)) d(Times(a, b)) -> Add(Times(d(a), b), Times(a, d(b))) Differentiate If we want our differentiable language to support more operations, we can simply add more differentiation rules. For example, to support trig functions: d(sin a) -> Times(d(a), cos a) d(cos a) -> Times(d(a), Times(-1, sin a)) Unfortunately, there’s a catch: the size of \(f^\prime\)’s representation can become very large. Let’s write another recursive relationship that counts the number of terms in an expression: Terms(n) -> 1 Terms(x) -> 1 Terms(Add(a, b)) -> Terms(a) + Terms(b) + 1 Terms(Times(a, b)) -> Terms(a) + Terms(b) + 1 And then prove that Terms(a) <= Terms(d(a)), i.e. differentiating an expression cannot decrease the number of terms: Base Cases: Terms(d(n)) -> 1 | Definition Terms(n) -> 1 | Definition => Terms(n) <= Terms(d(n)) Terms(d(x)) -> 1 | Definition Terms(x) -> 1 | Definition => Terms(x) <= Terms(d(x)) Inductive Case for Add: Terms(Add(a, b)) -> Terms(a) + Terms(b) + 1 | Definition Terms(d(Add(a, b))) -> Terms(d(a)) + Terms(d(b)) + 1 | Definition Terms(a) <= Terms(d(a)) | Hypothesis Terms(b) <= Terms(d(b)) | Hypothesis => Terms(Add(a, b)) <= Terms(d(Add(a, b))) Inductive Case for Times: Terms(Times(a, b)) -> Terms(a) + Terms(b) + 1 | Definition Terms(d(Times(a, b))) -> Terms(a) + Terms(b) + 3 + Terms(d(a)) + Terms(d(b)) | Definition => Terms(Times(a, b)) <= Terms(d(Times(a, b))) This result might be acceptable if the size of df was linear in the size of f, but that’s not the case. Whenever we differentiate a Times expression, the number of terms in the result will at least double. That means the size of df grows exponentially with the number of Times we compose. You can demonstrate this phenomenon by nesting multiple Times in the Javascript example. Hence, symbolic differentiation is usually infeasible at the scales we’re interested in. However, if it works for your use case, it can be quite useful. Automatic Differentiation We’re finally ready to discuss the automatic differentiation algorithm actually used in modern differentiable programming: autodiff! There are two flavors of autodiff, each named for the direction in which it computes derivatives. Forward Mode Forward mode autodiff improves on our two older techniques by computing exact derivatives without building a potentially exponentially-large representation of \(f^\prime\). It is based on the mathematical definition of dual numbers. Dual Numbers Dual numbers are a bit like complex numbers: they’re defined by adjoining a new quantity \(\epsilon\) to the reals. But unlike complex numbers where \(i^2 = -1\), dual numbers use \(\epsilon^2 = 0\). In particular, we can use the \(\epsilon\) part of a dual number to represent the derivative of the scalar part. If we replace each variable \(x\) with \(x + x^\prime\epsilon\), we will find that dual arithmetic naturally expresses how derivatives combine: Addition: \[(x + x^\prime\epsilon) + (y + y^\prime\epsilon) = (x + y) + (x^\prime + y^\prime)\epsilon\] Multiplication: \[\begin{align*} (x + x^\prime\epsilon) * (y + y^\prime\epsilon) &= xy + xy^\prime\epsilon + x^\prime y\epsilon + x^\prime y^\prime\epsilon^2 \\ &= xy + (x^\prime y + xy^\prime)\epsilon \end{align*}\] Division: \[\begin{align*} \frac{x + x^\prime\epsilon}{y + y^\prime\epsilon} &= \frac{\frac{x}{y}+\frac{x^\prime}{y}\epsilon}{1+\frac{y^\prime}{y}\epsilon} \\ &= \left(\frac{x}{y}+\frac{x^\prime}{y}\epsilon\right)\left(1-\frac{y^\prime}{y}\epsilon\right) \\ &= \frac{x}{y} + \frac{x^\prime y - xy^\prime}{y^2}\epsilon \end{align*}\] The chain rule also works: \(f(x + x^\prime\epsilon) = f(x) + f'(x)x^\prime\epsilon\) for any smooth function \(f\). To prove this fact, let us first show that the property holds for positive integer exponentiation. Base case: \((x+x^\prime\epsilon)^1 = x^1 + 1x^0x^\prime\epsilon\) Hypothesis: \((x+x^\prime\epsilon)^n = x^n + nx^{n-1}x^\prime\epsilon\) Induct: \[\begin{align*} (x+x^\prime\epsilon)^{n+1} &= (x^n + nx^{n-1}x^\prime\epsilon)(x+x^\prime\epsilon) \tag{Hypothesis}\\ &= x^{n+1} + x^nx^\prime\epsilon + nx^nx^\prime\epsilon + nx^{n-1}x^{\prime^2}\epsilon^2\\ &= x^{n+1} + (n+1)x^nx^\prime\epsilon \end{align*}\] We can use this result to prove the same property for any smooth function \(f\). Examining the Taylor expansion of \(f\) at zero (also known as its Maclaurin series): \[f(x) = \sum_{n=0}^\infty \frac{f^{(n)}(0)x^n}{n!} = f(0) + f^\prime(0)x + \frac{f^{\prime\prime}(0)x^2}{2!} + \frac{f^{\prime\prime\prime}(0)x^3}{3!} + \dots\] By plugging in our dual number… \[\begin{align*} f(x+x^\prime\epsilon) &= f(0) + f^\prime(0)(x+x^\prime\epsilon) + \frac{f^{\prime\prime}(0)(x+x^\prime\epsilon)^2}{2!} + \frac{f^{\prime\prime\prime}(0)(x+x^\prime\epsilon)^3}{3!} + \dots\\ &= f(0) + f^\prime(0)(x+x^\prime\epsilon) + \frac{f^{\prime\prime}(0)(x^2+2xx^\prime\epsilon)}{2!} + \frac{f^{\prime\prime\prime}(0)(x^3+3x^2x^\prime\epsilon)}{3!} + \dots \\ &= f(0) + f^\prime(0)x + \frac{f^{\prime\prime}(0)x^2}{2!} + \frac{f^{\prime\prime\prime}(0)x^3}{3!} + \dots \\ &\phantom{= }+ \left(f^\prime(0) + f^{\prime\prime}(0)x + \frac{f^{\prime\prime\prime}(0)x^2}{2!} + \dots \right)x^\prime\epsilon \\ &= f(x) + f^\prime(x)x^\prime\epsilon \end{align*}\] …we prove the result! In the last step, we recover the Maclaurin series for both \(f(x)\) and \(f^\prime(x)\). Implementation Implementing forward-mode autodiff in code can be very straightforward: we just have to replace our Float type with a DiffFloat that keeps track of both our value and its dual coefficient. If we then implement the relevant math operations for DiffFloat, all we have to do is run the program! Unfortunately, JavaScript does not support operator overloading, so we’ll define a DiffFloat to be a two-element array and use functions to implement some basic arithmetic operations: function Const(n) { return [n, 0]; } function Add(x, y) { return [x[0] + y[0], x[1] + y[1]]; } function Times(x, y) { return [x[0] * y[0], x[1] * y[0] + x[0] * y[1]]; } If we implement our function \(f\) in terms of these primitives, evaluating \(f([x,1])\) will return \([f(x),f^\prime(x)]\)! This property extends naturally to higher-dimensional functions, too. If \(f\) has multiple outputs, their derivatives pop out in the same way. If \(f\) has inputs other than \(x\), assigning them constants means the result will be the partial derivative \(f_x\). Differentiate Limitations While forward-mode autodiff does compute exact derivatives, it suffers from the same fundamental problem as finite differences: each invocation of \(f\) can only compute the directional derivative of \(f\) for a single direction. It’s useful to think of a forward mode derivative as computing one column of the gradient matrix. Hence, if \(f\) has few inputs but many outputs, forward mode can still be quite efficient at recovering the full gradient: \[\begin{align*} \nabla f &= \begin{bmatrix} f_{1_x}(x,y) & f_{1_y}(x,y) \\ \vdots & \vdots \\ f_{n_x}(x,y) & f_{n_y}(x,y) \end{bmatrix} \\ \hphantom{\nabla f}&\hphantom{ ==|}\begin{array}{} \underbrace{\hphantom{f_{n_x}(x,y)}}_{\text{Pass 1}} & \underbrace{\hphantom{f_{n_y}(x,y)}}_{\text{Pass 2}} \end{array} \end{align*}\] Unfortunately, optimization problems in machine learning and graphics often have the opposite structure: \(f\) has a huge number of inputs (e.g. the coefficients of a 3D scene or neural network) and a single output. That is, \(\nabla f\) has many columns and few rows. Backward Mode As you might have guessed, backward mode autodiff provides a way to compute a row of the gradient using a single invocation of \(f\). For optimizing many-to-one functions, this is exactly what we want: the full gradient in one pass. In this section, we will use Leibniz’s notation for derivatives, which is: \[f^\prime(x) = \frac{\partial f}{\partial x}\] Leibniz’s notation makes it easier to write down the derivative of an arbitrary variable with respect an arbitrary input. Derivatives also obtain nice algebraic properties, if you squint a bit: \[g(f(x))^\prime = \frac{\partial g}{\partial f}\cdot\frac{\partial f}{\partial x} = \frac{\partial g}{\partial x}\] Backpropagation Similarly to how forward-mode autodiff propagated derivatives from inputs to outputs, backward-mode propagates derivatives from outputs to inputs. That sounds easy enough, but the code only runs in one direction. How would we know what the gradient of our input should be before evaluating the rest of the function? We don’t—when evaluating \(f\), we use each operation to build a computational graph that represents \(f\). That is, when \(f\) tells us to perform an operation, we create a new node noting what the operation is and connect it to the nodes representing its inputs. In this way, a pure function can be nicely represented as a directed acyclic graph, or DAG. For example, the function \(f(x,y) = x^2 + xy\) may be represented with the following graph: When evaluating \(f\) at a particular input, we write down the intermediate values computed by each node. This step is known as the forward pass, and computes primal values. Then, we begin the backward pass, where we compute dual values, or derivatives. Our ultimate goal is to compute \(\frac{\partial f}{\partial x}\) and \(\frac{\partial f}{\partial y}\). At first, we only know the derivative of \(f\) with respect to final plus—they’re the same value, so \(\frac{\partial f}{\partial +} = 1\). We can see that the output was computed by adding together two incoming values. Increasing either input to the sum would increase the output by an equal amount, so derivatives propagated through this node should be unaffected. That is, if \(+\) is the output and \(+_1,+_2\) are the inputs, \(\frac{\partial +}{\partial +_1} = \frac{\partial +}{\partial +_2} = 1\). Now we can use the chain rule to combine our derivatives, getting closer to the desired result: \(\frac{\partial f}{\partial +_1} = \frac{\partial f}{\partial +}\cdot\frac{\partial +}{\partial +_1} = 1\), \(\frac{\partial f}{\partial +_2} = \frac{\partial f}{\partial +}\cdot\frac{\partial +}{\partial +_2} = 1\). When we evaluate a node, we know the derivative of \(f\) with respect to its output. That means we can propagate the derivative back along the node’s incoming edges, modifying it based on the node’s operation. As long as we evaluate all outputs of a node before the node itself, we only have to check each node once. To assure proper ordering, we may traverse the graph in reverse topological order. Once we get to a multiplication node, there’s slightly more to do: the derivative now depends on the primal input values. That is, if \(f(x,y) = xy\), \(\frac{\partial f}{\partial x} = y\) and \(\frac{\partial f}{\partial y} = x\). By applying the chain rule, we get \(\frac{\partial f}{\partial *_1} = 1\cdot*_2\) and \(\frac{\partial f}{\partial *_2} = 1\cdot*_1\) for both multiplication nodes. Applying the chain rule one last time, we get \(\frac{\partial f}{\partial y} = 2\). But \(x\) has multiple incoming derivatives—how do we combine them? Each incoming edge represents a different way \(x\) affects \(f\), so \(x\)’s total contribution is simply their sum. That means \(\frac{\partial f}{\partial x} = 7\). Let’s check our result: \[\begin{align*} f_x(x,y) &= 2x + y &&\implies& f_x(2,3) &= 7 \\ f_y(x,y) &= x &&\implies& f_y(2,3) &= 2 \end{align*}\] You’ve probably noticed that traversing the graph built up a derivative term for each path from an input to the output. That’s exactly the behavior that arose when we manually computed the gradient using the chain rule! Backpropagation is essentially the chain rule upgraded with dynamic programming. Traversing the graph in reverse topological order means we only have to evaluate each vertex once—and re-use its derivative everywhere else it shows up. Despite having to express \(f\) as a computational graph and traverse both forward and backward, the whole algorithm has the same time complexity as \(f\) itself. Space complexity, however, is a separate issue. Implementation We can implement backward mode autodiff using a similar approach as forward mode. Instead of making every operation use dual numbers, we can make each step add a node to our computational graph. function Const(n) { return {op: 'const', in: [n], out: undefined, grad: 0}; } function Add(x, y) { return {op: 'add', in: [x, y], out: undefined, grad: 0}; } function Times(x, y) { return {op: 'times', in: [x, y], out: undefined, grad: 0}; } Note that JavaScript will automatically store references within the in arrays, hence build a DAG instead of a tree. If we implement our function \(f\) in terms of these primitives, we can evaluate it on an input node to automatically build the graph. let in_node = {op: 'const', in: [/* TBD */], out: undefined, grad: 0}; let out_node = f(in_node); The forward pass performs a post-order traversal of the graph, translating inputs to outputs for each node. Remember we’re operating on a DAG: we must check whether a node is already resolved, lest we recompute values that could be reused. function forward(node) { if (node.out !== undefined) return; if (node.op === 'const') { node.out = node.in[0]; } else if (node.op === 'add') { forward(node.in[0]); forward(node.in[1]); node.out = node.in[0].out + node.in[1].out; } else if (node.op === 'times') { forward(node.in[0]); forward(node.in[1]); node.out = node.in[0].out * node.in[1].out; } } The backward pass is conceptually similar, but a naive pre-order traversal would end up tracing out every path in the DAG—every gradient has to be pushed back to the roots. Instead, we’ll first compute a reverse topological ordering of the nodes. This ordering guarantees that when we reach a node, everything “downstream” of it has already been resolved—we’ll never have to return. function backward(out_node) { const order = topological_sort(out_node).reverse(); for (const node of order) { if (node.op === 'add') { node.in[0].grad += node.grad; node.in[1].grad += node.grad; } else if (node.op === 'times') { node.in[0].grad += node.in[1].out * node.grad; node.in[1].grad += node.in[0].out * node.grad; } } } Finally, we can put our functions together to compute \(f(x)\) and \(f'(x)\): function evaluate(x, in_node, out_node) { in_node.in = [x]; forward(out_node); out_node.grad = 1; backward(out_node); return [out_node.out, in_node.grad]; } Just remember to clear all the out and grad fields before evaluating again! Lastly, the working implementation: Differentiate Limitations If \(f\) is a function of multiple variables, we can simply read the gradients from the corresponding input nodes. That means we’ve computed a whole row of \(\nabla f\). Of course, if the gradient has many rows and few columns, forward mode would have been more efficient. \[\begin{align*} \nabla f &= \begin{bmatrix} \vphantom{\Big|} f_{0_a}(a,\dots,n) & \dots & f_{0_n}(a,\dots,n) \\ \vphantom{\Big|} f_{1_a}(a,\dots,n) & \dots & f_{1_n}(a,\dots,n) \end{bmatrix} \begin{matrix} \left.\vphantom{\Big| f_{0_a}(a,\dots,n)}\right\} \text{Pass 1} \\ \left.\vphantom{\Big| f_{0_a}(a,\dots,n)}\right\} \text{Pass 2} \end{matrix} \end{align*}\] Unfortunately, backwards mode comes with another catch: we had to store the intermediate result of every single computation inside \(f\)! If we’re passing around substantial chunks of data, say, weight matrices for a neural network, storing the intermediate results can require an unacceptable amount of memory and memory bandwidth. If \(f\) contains loops, it’s especially bad—because every value is immutable, naive loops will create long chains of intermediate values. For this reason, real-world frameworks tend to encapsulate loops in monolithic parallel operations that have analytic derivatives. Many engineering hours have gone into reducing space requirements. One problem-agnostic approach is called checkpointing: we can choose not to store intermediate results at some nodes, rather re-computing them on the fly during the backward pass. Checkpointing gives us a natural space-time tradeoff: by strategically choosing which nodes store intermediate results (e.g. ones with expensive operations), we can reduce memory usage without dramatically increasing runtime. Even with checkpointing, training the largest neural networks requires far more fast storage than is available to a single computer. By partitioning our computational graph between multiple systems, each one only needs to store values for its local nodes. Unfortunately, this implies edges connecting nodes assigned to different processors must send their values across a network, which is expensive. Hence, communication costs may be minimized by finding min-cost graph cuts. Graphs and Higher-Order Autodiff Earlier, we could have computed primal values while evaluating \(f\) itself. Frameworks like PyTorch and TensorFlow take this approach—evaluating \(f\) both builds the graph (also known as the ‘tape’) and evaluates the forward-pass results. The user may call backward at any point, propagating gradients to all inputs that contributed to the result. However, the forward-backward approach can limit the system’s potential performance. The forward pass is relatively easy to optimize via parallelizing, vectorizing, and distributing graph traversal. The backward pass, on the other hand, is harder to parallelize, as it requires a topological traversal and coordinated gradient accumulation. Furthermore, the backward pass lacks some mathematical power. While computing a specific derivative is easy, we don’t get back a general representation of \(\nabla f\). If we wanted the gradient of the gradient (the Hessian), we’re back to relying on numerical differentiation. Thinking about the gradient as a higher-order function reveals a potentially better approach. If we can represent \(f\) as a computational graph, there’s no reason we can’t also represent \(\nabla f\) in the same way. In fact, we can simply add nodes to the graph of \(f\) that compute derivatives with respect to each input. Because the graph already computes primal values, each node in \(f\) only requires us to add a constant number of nodes in the graph of \(\nabla f\). That means the result is only a constant factor larger than the input—evaluating it requires exactly the same computations as the forward-backward algorithm. For example, given the graph of \(f(x) = x^2 + x\), we can produce the following: Defining differentiation as a function on computational graphs unifies the forward and backward passes: we get a single graph that computes both \(f\) and \(\nabla f\). That means we can work with higher order derivatives by applying the transformation again! Even better, distributed training is easier as we no longer have to worry about synchronizing gradient updates across multiple systems. JAX implements this approach, enabling its seamless gradient, JIT compilation, and vectorization transforms. PyTorch also supports higher-order differentiation via including backward operations in the computational graph, and functorch provides a JAX-like API. De-blurring an Image Let’s use our fledgling differentiable programming framework to solve a real optimization problem: de-blurring an image. We’ll assume our observed image was computed using a simple box filter, i.e., each blurred pixel is the average of the surrounding 3x3 ground-truth pixels. Of course, a blur loses information, so we won’t be able to reconstruct the exact input—but we can get pretty close! \[\text{Blur}(\text{Image})_{xy} = \frac{1}{9} \sum_{i=-1}^1 \sum_{j=-1}^1 \text{Image}_{(x+i)(y+j)}\] Ground Truth Image Observed Image We’ll need to add one more operation to our framework: division. The operation and forward pass are much the same as addition and multiplication, but the backward pass must compute \(\frac{\partial f}{\partial x}\) and \(\frac{\partial f}{\partial y}\) for \(f = \frac{x}{y}\). function Divide(x, y) { return {op: 'divide', in: [x, y], out: undefined, grad: 0}; } // Forward... if(node.op === 'divide') { forward(node.in[0]); forward(node.in[1]); node.out = node.in[0].out / node.in[1].out; } // Backward... if(node.op === 'divide') { n.in[0].grad += n.grad / node.in[1].out; n.in[1].grad += (-n.grad * node.in[0].out / (node.in[1].out * node.in[1].out)); } Before we start programming, we need to express our task as an optimization problem. That entails minimizing a loss function that measures how far away we are from our goal. Let’s start by guessing an arbitrary image—for example, a solid grey block. We can then compare the result of blurring our guess with the observed image. The farther our blurred result is from the observation, the larger the loss should be. For simplicity, we will define our loss as the total squared difference between each corresponding pixel. \[\text{Loss}(\text{Blur}(\text{Guess}), \text{Observed}) = \sum_{x=0}^W\sum_{y=0}^H (\text{Blur}(\text{Guess})_{xy} - \text{Observed}_{xy})^2\] Using differentiable programming, we can compute \(\frac{\partial \text{Loss}}{\partial \text{Guess}}\), i.e. how changes in our proposed image change the resulting loss. That means we can apply gradient descent to the guess, guiding it towards a state that minimizes the loss function. Hopefully, if our blurred guess matches the observed image, our guess will match the ground truth image. Let’s implement our loss function in differentiable code. First, create the guess image by initializing the differentiable parameters to solid grey. Each pixel has three components: red, green, and blue. let guess_image = new Array(W*H*3); for (let i = 0; i < W * H * 3; i++) { guess_image[i] = Const(127); } Second, apply the blur using differentiable operations. let blurred_guess_image = new Array(W*H*3); for (let x = 0; x < W; x++) { for (let y = 0; y < H; y++) { let [r,g,b] = [Const(0), Const(0), Const(0)]; // Accumulate pixels for averaging for (let i = -1; i < 1; i++) { for (let j = -1; j < 1; j++) { // Convert 2D pixel coordinate to 1D row-major array index const xi = clamp(x + i, 0, W - 1); const yj = clamp(y + j, 0, H - 1); const idx = (yj * W + xi) * 3; r = Add(r, guess_image[idx + 0]); g = Add(g, guess_image[idx + 1]); b = Add(b, guess_image[idx + 2]); } } // Set result to average const idx = (y * W + x) * 3; blurred_guess_image[idx + 0] = Divide(r, Const(9)); blurred_guess_image[idx + 1] = Divide(g, Const(9)); blurred_guess_image[idx + 2] = Divide(b, Const(9)); } } Finally, compute the loss using differentiable operations. let loss = Const(0); for (let x = 0; x < W; x++) { for (let y = 0; y < H; y++) { const idx = (y * W + x) * 3; let dr = Add(blurred_guess_image[idx + 0], Const(-observed_image[idx + 0])); let dg = Add(blurred_guess_image[idx + 1], Const(-observed_image[idx + 1])); let db = Add(blurred_guess_image[idx + 2], Const(-observed_image[idx + 2])); loss = Add(loss, Times(dr, dr)); loss = Add(loss, Times(dg, dg)); loss = Add(loss, Times(db, db)); } } Calling forward(loss) performs the whole computation, storing results in each node’s out field. Calling backward(loss) computes the derivative of loss at every node, storing results in each node’s grad field. Let’s write a simple optimization routine that performs gradient descent on the guess image. function gradient_descent_step(step_size) { // Clear output values and gradients reset(loss); // Forward pass forward(loss); // Backward pass loss.grad = 1; backward(loss); // Move parameters along gradient for (let i = 0; i < W * H * 3; i++) { let p = guess_image[i]; p.in[0] -= step_size * p.grad; } } We’d also like to compute error, the squared distance between our guess image and the ground truth. We can’t use error to inform our algorithm—we’re not supposed to know what the ground truth was—but we can use it to measure how well we are reconstructing the image. For the current iteration, we visualize the guess image, the guess image after blurring, and the gradient of loss with respect to each pixel. Gradient Image Reset Step The slider adjusts the step size. After running several steps, you’ll notice that even though loss goes to zero, error does not: the loss function does not provide enough information to exactly reconstruct the ground truth. We can also see that optimization behavior depends on the step size—small steps require many iterations to converge, and large steps may overshoot the target, oscillating between too-dark and too-bright images. Further Reading If you’d like to learn more about differentiable programming in ML and graphics, check out the following resources: JAX, PyTorch, TensorFlow, tinygrad, TinyAD Physics-Based Deep Learning, Differentiable Rendering Intro, Differentiable Rendering Papers, Neural Fields in Visual Computing CMU Deep Learning SystemsLake Tahoe2022-06-10T00:00:00+00:002022-06-10T00:00:00+00:00https://thenumbat.github.io/Lake%20Tahoe<p>Incline Village, NV, 2022</p>
<p>My cat Ruby, 2022</p>
<div style="text-align: center;"><img src="/assets/tahoe/0.webp" /></div>
<div style="text-align: center;"><img src="/assets/tahoe/1.webp" /></div>
<div style="text-align: center;"><img src="/assets/tahoe/2.webp" /></div>
<div style="text-align: center;"><img src="/assets/tahoe/3.webp" /></div>
<div style="text-align: center;"><img src="/assets/tahoe/4.webp" /></div>
<div style="text-align: center;"><img src="/assets/tahoe/5.webp" /></div>
<div style="text-align: center;"><img src="/assets/tahoe/6.webp" /></div>
<div style="text-align: center;"><img src="/assets/tahoe/7.webp" /></div>
<div style="text-align: center;"><img src="/assets/tahoe/8.webp" /></div>
<div style="text-align: center;"><img src="/assets/tahoe/9.webp" /></div>
<div style="text-align: center;"><img src="/assets/tahoe/10.webp" /></div>
<p></p>Incline Village, NV, 2022 My cat Ruby, 2022Exponentially Better Rotations2022-04-15T00:00:00+00:002022-04-15T00:00:00+00:00https://thenumbat.github.io/Exponential-Rotations<p><em>Based on <a href="http://15462.courses.cs.cmu.edu/">CMU 15-462</a> course materials by <a href="https://www.cs.cmu.edu/~kmcrane/">Keenan Crane</a>.</em></p>
<p>If you’ve done any 3D programming, you’ve likely encountered the zoo of techniques and representations used when working with 3D rotations. Some of them are better than others, depending on the situation.</p>
<h2 id="table-of-contents">Table of Contents</h2>
<ul>
<li><a href="#representations">Representations</a>
<ul>
<li><a href="#rotation-matrices">Rotation Matrices</a></li>
<li><a href="#euler-angles">Euler Angles</a></li>
<li><a href="#quaternions">Quaternions</a></li>
<li><a href="#axisangle-rotations">Axis/Angle</a></li>
</ul>
</li>
<li><a href="#the-exponential-and-logarithmic-maps">The Exponential and Logarithmic Maps</a>
<ul>
<li><a href="#axisangle-in-2d">Axis/Angle in 2D</a></li>
<li><a href="#axisangle-in-3d">Axis/Angle in 3D</a></li>
<li><a href="#averaging-rotations">Averaging Rotations</a></li>
<li><a href="#quaternions-again">Quaternions (Again)</a></li>
</ul>
</li>
<li><a href="#further-reading">Further Reading</a></li>
</ul>
<h1 id="representations">Representations</h1>
<h2 id="rotation-matrices">Rotation Matrices</h2>
<p>Linear-algebra-wise, the most straightforward representation is an orthonormal 3x3 matrix (with positive determinant). The three columns of a rotation matrix specify where the x, y, and z axes end up after the rotation.</p>
<p>Rotation matrices are particularly useful for transforming points: just multiply! Even better, rotation matrices can be composed with any other linear transformations via matrix multiplication. That’s why we use rotation matrices when actually drawing things on screen: only one matrix multiplication is required to transform a point from world-space to the screen. However, rotation matrices are not so useful for actually working with <em>rotations</em>: because they don’t form a vector space, adding together two rotation matrices will not give you a rotation matrix back. For example, animating an object by linearly interpolating between two rotation matrices adds scaling:</p>
<div class="center" style="width: 100%;" id="matrix_interp_container">
<canvas id="matrix_interp" class="diagram3d"></canvas>
</div>
<table class="slidertable" cellspacing="0" cellpadding="0"><tr>
<td><input type="range" min="0" max="100" value="0" class="slider" id="matrix_interp_t" /></td>
<td><button id="matrix_interp_rand" type="button">Randomize</button></td>
</tr></table>
<table class="mathtable" cellspacing="0" cellpadding="0"><tr>
<td>$$ R_0 = \begin{bmatrix}1&0&0\\0&1&0\\0&0&1\end{bmatrix} $$</td>
<td id="matrix_interp_mj">$$ R(0) = \begin{bmatrix}1.00&0.00&0.00\\0.00&1.00&0.00\\0.00&0.00&1.00\end{bmatrix} $$</td>
<td id="matrix_interp_m1">$$ R_1 = \begin{bmatrix}-1&0&0\\0&1&0\\0&0&-1\end{bmatrix} $$</td>
</tr></table>
<h2 id="euler-angles">Euler Angles</h2>
<p>Another common representation is Euler angles, which specify three separate rotations about the x, y, and z axes (also known as pitch, yaw, and roll). The order in which the three component rotations are applied is an arbitrary convention—here we’ll apply x, then y, then z.</p>
<canvas id="euler" class="diagram3d"></canvas>
<table class="slidertable" cellspacing="0" cellpadding="0">
<tr><td>$$\theta_x$$</td><td><input type="range" min="0" max="100" value="0" class="slider" id="euler_x" /></td></tr>
<tr><td>$$\theta_y$$</td><td><input type="range" min="0" max="100" value="0" class="slider" id="euler_y" /></td><td><button id="euler_lock" type="button">Lock</button></td></tr>
<tr><td>$$\theta_z$$</td><td><input type="range" min="0" max="100" value="0" class="slider" id="euler_z" /></td></tr>
</table>
<p>Euler angles are generally well-understood and often used for authoring rotations. However, using them for anything else comes with some significant pitfalls. While it’s possible to manually create splines that nicely interpolate Euler angles, straightforward interpolation often produces undesirable results.</p>
<p>Euler angles suffer from <em>gimbal lock</em> when one component causes the other two axes of rotation to become parallel. Such configurations are called <em>singularities</em>. At a singularity, changing either of two ‘locked’ angles will cause the same output rotation. You can demonstrate this phenomenon above by pressing the ‘lock’ button and adjusting the x/z rotations (a quarter rotation about y aligns the z axis with the x axis).</p>
<p>Singularities break interpolation: if the desired path reaches a singularity, it gains a degree of freedom with which to represent its current position. Picking an arbitrary representation to continue with causes discontinuities in the interpolated output: even within an axis, interpolation won’t produce a constant angular velocity. That can be a problem if, for example, you’re using the output to drive a robot. Furthermore, since each component angle is cyclic, linear interpolation won’t always choose the shortest path between rotations.</p>
<canvas id="euler_interp" class="diagram3d"></canvas>
<table class="slidertable" cellspacing="0" cellpadding="0"><tr>
<td><input type="range" min="0" max="100" value="0" class="slider" id="euler_interp_t" /></td>
<td><button id="euler_interp_rand" type="button">Randomize</button></td>
</tr></table>
<table class="mathtable" cellspacing="0" cellpadding="0"><tr>
<td>$$ \mathbf{\theta}_0 = \begin{bmatrix}0\\0\\0\end{bmatrix} $$</td>
<td id="euler_interp_mj">$$ \mathbf{\theta}(0) = \begin{bmatrix}0.00\\0.00\\0.00\end{bmatrix} $$</td>
<td id="euler_interp_m1">$$ \mathbf{\theta}_1 = \begin{bmatrix}-3.14\\0.00\\-3.14\end{bmatrix} $$</td>
</tr></table>
<p>Thankfully, interpolation is smooth if the path doesn’t go through a singularity, so these limitations <em>can</em> be worked around, especially if you don’t need to represent ‘straight up’ and ‘straight down.’</p>
<h2 id="quaternions">Quaternions</h2>
<p>At this point, you might be expecting yet another article on quaternions—don’t worry, we’re not going to delve into hyper-complex numbers today. It suffices to say that unit quaternions are the standard tool for composing and interpolating rotations, since <a href="https://en.wikipedia.org/wiki/Slerp">spherical linear interpolation</a> (slerp) chooses a constant-velocity shortest path between any two quaternions. However, unit quaternions also don’t form a vector space, are unintuitive to author, and can be computationally costly to interpolate<a href="http://number-none.com/product/Hacking%20Quaternions/">*</a>. Further, there’s no intuitive notion of scalar multiplication, nor averaging.</p>
<p>But, they’re still fascinating! If you’d like to understand quaternions more deeply (or, perhaps, learn what they are in the first place), read <a href="https://eater.net/quaternions">this</a>.</p>
<canvas id="quat_interp" class="diagram3d"></canvas>
<table class="slidertable" cellspacing="0" cellpadding="0"><tr>
<td><input type="range" min="0" max="100" value="0" class="slider" id="quat_interp_t" /></td>
<td><button id="quat_interp_rand" type="button">Randomize</button></td>
</tr></table>
<table class="mathtable" cellspacing="0" cellpadding="0"><tr>
<td>$$ Q_0 = \begin{bmatrix}1\\0\\0\\0\end{bmatrix}\ \ $$</td>
<td id="quat_interp_mj">$$ Q(0) = \begin{bmatrix}1.00\\0.00\\0.00\\0.00\end{bmatrix}\ \ $$</td>
<td id="quat_interp_m1">$$ Q_1 = \begin{bmatrix}0\\0\\1\\0\end{bmatrix} $$</td>
</tr></table>
<p>Note that since quaternions double-cover the space of rotations, sometimes \(Q(1)\) will go to \(-Q_1\).</p>
<h2 id="axisangle-rotations">Axis/Angle Rotations</h2>
<p>An axis/angle rotation is a 3D vector of real numbers. Its direction specifies the axis of rotation, and its magnitude specifies the angle to rotate about that axis. For convenience, we’ll write axis/angle rotations as \(\theta\mathbf{u}\), where \(\mathbf{u}\) is a unit-length vector and \(\theta\) is the rotation angle.</p>
<canvas id="axis_angle" class="diagram3d" style="padding-bottom: 10px;"></canvas>
<table class="slidertable" cellspacing="0" cellpadding="0">
<tr><td>$$\mathbf{u}_x$$</td><td><input type="range" min="0" max="100" value="60" class="slider" id="axis_angle_x" /></td></tr>
<tr><td>$$\mathbf{u}_y$$</td><td><input type="range" min="0" max="100" value="75" class="slider" id="axis_angle_y" /></td></tr>
<tr><td>$$\mathbf{u}_z$$</td><td><input type="range" min="0" max="100" value="50" class="slider" id="axis_angle_z" /></td></tr>
<tr><td>$$\theta$$</td><td><input type="range" min="0" max="100" value="80" class="slider" id="axis_angle_t" /></td></tr>
</table>
<p>Since axis/angle rotations are simply 3D vectors, they form a vector space: we can add, scale, and interpolate them to our heart’s content. Linearly interpolating between any two axis/angle rotations is smooth and imparts constant angular velocity. However, note that linearly interpolating between axis-angle rotations does not necessarily choose the shortest path: it depends on which axis/angle you use to specify the target rotation.</p>
<canvas id="axis_angle_interp" class="diagram3d"></canvas>
<table class="slidertable" cellspacing="0" cellpadding="0"><tr>
<td><input type="range" min="0" max="100" value="0" class="slider" id="axis_angle_interp_t" /></td>
<td><button id="axis_angle_interp_rand" type="button">Randomize</button></td>
</tr></table>
<table class="mathtable" cellspacing="0" cellpadding="0"><tr>
<td>$$ \theta_0\mathbf{u}_0 = \begin{bmatrix}0\\0\\0\end{bmatrix}\ \ $$</td>
<td id="axis_angle_interp_mj">$$ \theta\mathbf{u}(0) = \begin{bmatrix}0.00\\0.00\\0.00\end{bmatrix}\ \ $$</td>
<td id="axis_angle_interp_m1">$$ \theta_1\mathbf{u}_1 = \begin{bmatrix}0\\3.14\\0\end{bmatrix} $$</td>
</tr></table>
<p>Like quaternions, axis/angle vectors double-cover the space of rotations: sometimes \(\theta\mathbf{u}(1)\) will go to \((2\pi - \theta_1)(-\mathbf{u}_1)\).</p>
<h1 id="the-exponential-and-logarithmic-maps">The Exponential and Logarithmic Maps</h1>
<p>Ideally, we could freely convert rotations between these diverse representations based on our use case. We will always want to get a rotation matrix out at the end, so we’ll consider matrices the ‘canonical’ form. Enter the <strong>exponential map</strong>: a function that takes a different kind of rotation object and gives us back an equivalent rotation matrix. The corresponding <strong>logarithmic map</strong> takes a rotation matrix and gives us back a rotation object. How these maps relate to the scalar <code class="language-plaintext highlighter-rouge">exp</code> and <code class="language-plaintext highlighter-rouge">log</code> functions will hopefully become clear later on.</p>
<div class="center" style="width: 90%;"><img style="width: 100%; max-width: 600px;" class="center" src="/assets/exp-rotations/map.png" /></div>
<p>Below, we’ll define an <code class="language-plaintext highlighter-rouge">exp</code> and <code class="language-plaintext highlighter-rouge">log</code> map translating between rotation matrices and axis/angle vectors. But first, to build up intuition, let us consider how rotations work in two dimensions.</p>
<h2 id="axisangle-in-2d">Axis/Angle in 2D</h2>
<p>In 2D, there’s only one axis to rotate around: the one pointing out of the plane. Hence, our ‘axis/angle’ rotations can be represented by just \(\theta\).</p>
<p>Given a 2D point \(\mathbf{p}\), how can we rotate \(\mathbf{p}\) by \(\theta\)? One way to visualize the transformation is by forming a coordinate frame in which the output is easy to describe. Consider \(\mathbf{p}\) and its quarter (\(90^\circ\)) rotation \(\mathcal{J}\mathbf{p}\):</p>
<div class="center" style="width: 90%;"><img class="diagram2d center" src="/assets/exp-rotations/2d_rot.svg" /></div>
<p>Using a bit of trigonometry, we can describe the rotated \(\mathbf{p}_\theta\) in two components:</p>
\[\begin{align*} \mathbf{p}_\theta &= \mathbf{p}\cos\theta + \mathcal{J}\mathbf{p}\sin\theta \\
&= (\cos(\theta)\mathcal{I} + \sin(\theta)\mathcal{J})\mathbf{p} \end{align*}\]
<p>But, what actually are \(\mathcal{I}\) and \(\mathcal{J}\)? The former should take a 2D vector and return it unchanged: it’s the 2x2 identity matrix. The latter should be similar, but swap and negate the two components:</p>
\[\begin{align*} \mathcal{I} &= \begin{bmatrix}1&0\\0&1\end{bmatrix} \\ \mathcal{J} &= \begin{bmatrix}0&-1\\1&0\end{bmatrix} \end{align*}\]
<p>Just to make sure we got \(\mathcal{J}\) right, let’s check what happens if we apply it twice (via \(\mathcal{J}^2\)):</p>
\[\mathcal{J}^2 = \begin{bmatrix}0&-1\\1&0\end{bmatrix}\begin{bmatrix}0&-1\\1&0\end{bmatrix} = \begin{bmatrix}-1&0\\0&-1\end{bmatrix}\]
<p>We got \(\mathcal{J}^2 = -\mathcal{I}\), which is a 180-degree rotation. So, \(\mathcal{J}\) indeed represents 90-degree rotation.</p>
<p>Now, what does our transform look like?</p>
\[\begin{align*}
\mathbf{p}_\theta &= (\cos(\theta)\mathcal{I} + \sin(\theta)\mathcal{J})\mathbf{p} \\
&= \left(\cos(\theta)\begin{bmatrix}1&0\\0&1\end{bmatrix} + \sin(\theta)\begin{bmatrix}0&-1\\1&0\end{bmatrix}\right)\mathbf{p} \\
&= \begin{bmatrix}\cos\theta&-\sin\theta\\\sin\theta&\cos\theta\end{bmatrix}\mathbf{p}
\end{align*}\]
<p>That’s the standard 2D rotation matrix. What a coincidence!</p>
<h3 id="complex-rotations">Complex Rotations</h3>
<p>If you’re familiar with complex numbers, you might notice that our first transform formula feels eerily similar to Euler’s formula, \(e^{ix} = \cos x + i\sin x\):</p>
\[\begin{align*}
\mathbf{p}_\theta &= (\cos(\theta)\mathcal{I} + \sin(\theta)\mathcal{J})\mathbf{p} \\
e^{i\theta}p &= (\cos(\theta) + i\sin(\theta))p
\end{align*}\]
<div class="center" style="width: 60%;"><img class="diagram2d center" src="/assets/exp-rotations/expi.svg" /></div>
<p>Where \(i\) and \(\mathcal{J}\) both play the role of a quarter turn. We can see that in complex arithmetic, multiplying by \(i\) in fact has that effect:</p>
\[\mathcal{J}\mathbf{p} = \begin{bmatrix}0&-1\\1&0\end{bmatrix}\begin{bmatrix}a\\b\end{bmatrix} = \begin{bmatrix}-b\\a\end{bmatrix}\]
\[ip = i(a + bi) = ai + bi^2 = -b + ai\]
<p>So, there must be some connection to the exponential function here.</p>
<h3 id="the-2d-exponential-map">The 2D Exponential Map</h3>
<p>Recall the definition of the exponential function (or equivalently, its Taylor series):</p>
\[e^x = \sum_{k=0}^\infty \frac{x^k}{k!} = 1 + x + \frac{x^2}{2!} + \frac{x^3}{3!} + \dots\]
<p>Using Euler’s formula, \(e^{i\theta}\) gave us a complex number representing a 2D rotation by \(\theta\). Can we do the same with \(e^{\theta\mathcal{J}}\)? If we plug a matrix into the above definition, the arithmetic still works out: 2x2 matrices certainly support multiplication, addition, and scaling. (More on matrix exponentiation <a href="https://www.youtube.com/watch?v=O85OWBJ2ayo">here</a>.)</p>
<p>Let \(\theta\mathcal{J} = A\) and plug it in:</p>
\[e^A = \sum_{k=0}^\infty \frac{A^k}{k!} = \mathcal{I} + A + \frac{A^2}{2!} + \frac{A^3}{3!} + \dots\]
<p>Let’s pull out the first four terms to inspect further:</p>
\[\begin{align*}
e^A &= \mathcal{I} + A + \frac{1}{2!}A^2 + \frac{1}{3!}A^3 \\
&= \mathcal{I} + A\left(\mathcal{I} + \frac{1}{2}A\left(\mathcal{I} + \frac{1}{3}A\right)\right) \\
&= \mathcal{I} + A\left(\mathcal{I} + \frac{1}{2}A\begin{bmatrix}1&\frac{-\theta}{3}\\\frac{\theta}{3}&1\end{bmatrix}\right) \\
&= \mathcal{I} + A\begin{bmatrix}1-\frac{\theta^2}{6}&\frac{-\theta}{2}\\\frac{\theta}{2}&1-\frac{\theta^2}{6}\end{bmatrix} \\
&= \begin{bmatrix}1-\frac{\theta^2}{2}&-\theta+\frac{\theta^3}{6}\\\theta-\frac{\theta^3}{6}&1-\frac{\theta^2}{2}\end{bmatrix}
\end{align*}\]
<p>These entries look familiar. Recall the Taylor series that describe the functions \(\sin\) and \(\cos\):</p>
\[\begin{align*} \sin x &= x - \frac{x^3}{3!} + \frac{x^5}{5!} - \dots \\
\cos x &= 1 - \frac{x^2}{2!} + \frac{x^4}{4!} - \dots \end{align*}\]
<p>If we write out all the terms of \(e^A\), we’ll recover the expansions of \(\sin\theta\) and \(\cos\theta\)! Therefore:</p>
\[e^A = e^{\theta\mathcal{J}} = \begin{bmatrix}\cos\theta&-\sin\theta\\\sin\theta&\cos\theta\end{bmatrix}\]
<p>We’ve determined that the exponential function \(e^{\theta\mathcal{J}}\) converts our angle \(\theta\) into a corresponding 2D rotation matrix. In fact, we’ve proved a version of Euler’s formula with 2x2 matrices instead of complex numbers:</p>
\[\begin{align*} && e^{\theta\mathcal{J}} &= (\cos(\theta)\mathcal{I} + \sin(\theta)\mathcal{J})\\
&\implies& \mathbf{p}_\theta &= e^{\theta\mathcal{J}}\mathbf{p}
\end{align*}\]
<h3 id="the-2d-logarithmic-map">The 2D Logarithmic Map</h3>
<p>The logarithmic map should naturally be the inverse of the exponential:</p>
\[R = \exp(\theta\mathcal{J}) \implies \log(R) = \theta\mathcal{J}\]
<p>So, given \(R\), how can we recover \(\theta\mathcal{J}\)?</p>
\[\begin{align*}
&& R &= \begin{bmatrix}\cos\theta&-\sin\theta\\\sin\theta&\cos\theta\end{bmatrix}\\
&\implies& \theta &= \text{atan2}(R_{21},R_{11})\\
&\implies& \log(R) &= \begin{bmatrix}0&-\theta\\\theta&0\end{bmatrix}
\end{align*}\]
<p>Note that in general, our exponential map is not injective. Clearly, \(\exp(\theta\mathcal{J}) = \exp((\theta + 2\pi)\mathcal{J})\), since adding an extra full turn will always give us back the same rotation matrix. Therefore, our logarithmic map can’t be surjective—we’ll define it as returning the <em>smallest</em> angle \(\theta\mathcal{J}\) corresponding to the given rotation matrix. Using \(\text{atan2}\) implements this definition.</p>
<h3 id="interpolation">Interpolation</h3>
<p>Consider two 2D rotation angles \(\theta_0\) and \(\theta_1\). The most obvious way to interpolate between these two rotations is to interpolate the angles and create the corresponding rotation matrix. This scheme is essentially a 2D version of axis-angle interpolation.</p>
\[\begin{align*} \theta(t) &= (1-t)\theta_0 + t\theta_1\\
R_\theta(t) &= \begin{bmatrix}\cos(\theta(t))&-\sin(\theta(t))\\\sin(\theta(t))&\cos(\theta(t))\end{bmatrix}
\end{align*}\]
<p>However, if \(\theta_0\) and \(\theta_1\) are separated by more than \(\pi\), this expression will take the long way around: it’s not aware that angles are cyclic.</p>
<canvas id="2d_angle_interp" class="diagram3d"></canvas>
<table class="slidertable" cellspacing="0" cellpadding="0"><tr>
<td><input type="range" min="0" max="100" value="0" class="slider" id="2d_angle_interp_t" /></td>
<td><button id="2d_angle_interp_rand" type="button">Randomize</button></td>
</tr></table>
<table class="mathtable" cellspacing="0" cellpadding="0"><tr>
<td>$$ \theta_0 = 0\ \ $$</td>
<td id="2d_angle_interp_mj">$$ \theta(0) = 0.00\ \ $$</td>
<td id="2d_angle_interp_m1">$$ \theta_1 = 4.71 $$</td>
</tr></table>
<p>Instead, let’s devise an interpolation scheme based on our <code class="language-plaintext highlighter-rouge">exp</code>/<code class="language-plaintext highlighter-rouge">log</code> maps. Since we know the two rotation matrices \(R_0\), \(R_1\), we can express the rotation that takes us directly from the initial pose to the final pose: \(R_1R_0^{-1}\), i.e. first undo \(R_0\), then apply \(R_1\).</p>
<p>Using our logarithmic map, we can obtain the smallest angle that rotates from \(R_0\) to \(R_1\): \(\log(R_1R_0^{-1})\). Since \(\log\) gives us an axis-angle rotation, we can simply scale the result by \(t\) to perform interpolation. After scaling, we can use our exponential map to get back a rotation matrix. This matrix represents a rotation \(t\) of the way from \(R_0\) to \(R_1\).</p>
<p>Hence, our final parametric rotation matrix is \(R(t) = \exp(t\log(R_1R_0^{-1})))R_0\).</p>
\[\begin{align*} R(0) &= \exp(0)R_0 = R_0\\
R(1) &= \exp(\log(R_1R_0^{-1}))R_0 = R_1R_0^{-1}R_0 = R_1 \end{align*}\]
<canvas id="2d_exp_interp" class="diagram3d"></canvas>
<table class="slidertable" cellspacing="0" cellpadding="0"><tr>
<td><input type="range" min="0" max="100" value="0" class="slider" id="2d_exp_interp_t" /></td>
<td><button id="2d_exp_interp_rand" type="button">Randomize</button></td>
</tr></table>
<table class="mathtable" cellspacing="0" cellpadding="0"><tr>
<td>$$ R_0 = \begin{bmatrix}1&0\\0&1\end{bmatrix} $$</td>
<td id="2d_exp_interp_mj">$$ R(0) = \begin{bmatrix}1.00&0.00\\0.00&1.00\end{bmatrix} $$</td>
<td id="2d_exp_interp_m1">$$ R_1 = \begin{bmatrix}0&-1\\1&0\end{bmatrix} $$</td>
</tr></table>
<p>Using <code class="language-plaintext highlighter-rouge">exp</code>/<code class="language-plaintext highlighter-rouge">log</code> for interpolation might seem like overkill for 2D—we could instead just check how far apart the angles are. But below, we’ll see how this interpolation scheme generalizes—without modification—to 3D, and in fact any number of dimensions.</p>
<h2 id="axisangle-in-3d">Axis/Angle in 3D</h2>
<p>We’re finally ready to derive an exponential and logarithmic map for 3D rotations. In 2D, our map arose from exponentiating \(\theta\mathcal{J}\), i.e. \(\theta\) times a matrix representing a counter-clockwise quarter turn about the axis of rotation. We will be able to do the same in 3D—but what transformation encodes a quarter turn about a 3D unit vector \(\mathbf{u}\)?</p>
<p>The cross product \(\mathbf{u}\times\mathbf{p}\) is typically defined as a vector normal to the plane containing both \(\mathbf{u}\) and \(\mathbf{p}\). However, we could also interpret \(\mathbf{u}\times\mathbf{p}\) as a quarter turn of the <em>projection</em> of \(\mathbf{p}\) into the plane with normal \(\mathbf{u}\), which we will call \(\mathbf{p}_\perp\):</p>
<div class="center" style="width: 85%;"><img class="diagram2d center" src="/assets/exp-rotations/crossu.svg" /></div>
<p>So, if we can compute the quarter rotation of \(\mathbf{p}_\perp\), it should be simple to recover the quarter rotation of \(\mathbf{p}\). Of course, \(\mathbf{p}=\mathbf{p}_\perp+\mathbf{p}_\parallel\), so we’ll just have to add back the parallel part \(\mathbf{p}_\parallel\). This is correct because a rotation about \(\mathbf{u}\) preserves \(\mathbf{p}_\parallel\):</p>
<div class="center" style="width: 85%;"><img class="diagram2d center" src="/assets/exp-rotations/crossu2.svg" /></div>
<p>However, “\(\mathbf{u} \times\)” is not a mathematical object we can work with. Instead, we can devise a matrix \(\mathbf{\hat{u}}\) that when multiplied with a a vector \(\mathbf{p}\), outputs the same result as \(\mathbf{u} \times \mathbf{p}\):</p>
\[\begin{align*}
\mathbf{u} \times \mathbf{p} &= \begin{bmatrix} u_yp_z - u_zp_y \\ u_zp_x - u_xp_z \\ u_xp_y - u_yp_x \end{bmatrix} \\
&= \begin{bmatrix}0&-u_z&u_y\\u_z&0&-u_x\\-u_y&u_x&0\end{bmatrix}\begin{bmatrix}p_x\\p_y\\p_z\end{bmatrix} \\
&= \mathbf{\hat{u}}\mathbf{p}
\end{align*}\]
<p>We can see that \(\mathbf{\hat{u}}^T = -\mathbf{\hat{u}}\), so \(\mathbf{\hat{u}}\) is a <em>skew-symmetric</em> matrix. (i.e. it has zeros along the diagonal, and the two halves are equal but negated.) Note that in the 2D case, our quarter turn \(\mathcal{J}\) was also skew-symmetric, and sneakily represented the 2D cross product! We must be on the right track.</p>
<p>The reason we want to use axis/angle rotations in the first place is because they form a vector space. So, let’s make sure our translation to skew-symmetric matrices maintains that property. Given two skew-symmetric matrices \(A_1\) and \(A_2\):</p>
\[(A_1 + A_2)^T = A_1^T + A_2^T = -A_1 - A_2 = -(A_1 + A_2)\]
<p>Their sum is also a skew-symmetric matrix. Similarly:</p>
\[(cA)^T = c(A^T) = -cA_1\]
<p>Scalar multiplication also maintains skew-symmetry. The other vector space properties follow from the usual definition of matrix addition.</p>
<p>Finally, note that \(\mathbf{u} \times (\mathbf{u} \times (\mathbf{u} \times \mathbf{p})) = -\mathbf{u} \times \mathbf{p}\). Taking the cross product three times would rotate \(\mathbf{p}_\perp\) three-quarter turns about \(\mathbf{u}\), which is equivalent to a single negative-quarter turn. More generally, \(\mathbf{\hat{u}}^{k+2} = -\mathbf{\hat{u}}^k\) for any \(k>0\). We could prove this by writing out all the terms, but the geometric argument is easier:</p>
<div class="center" style="width: 100%; padding-bottom: 20px;"><img class="diagram2d center" src="/assets/exp-rotations/crossu3.svg" /></div>
<h3 id="the-3d-exponential-map">The 3D Exponential Map</h3>
<p>Given an axis/angle rotation \(\theta\mathbf{u}\), we can make \(\theta\mathbf{\hat{u}}\) using the above construction. What happens when we exponentiate it? Using the identity \(\mathbf{\hat{u}}^{k+2} = -\mathbf{\hat{u}}^k\):</p>
\[\begin{align*}
e^{\theta\mathbf{\hat{u}}} &= \mathcal{I} + \theta\mathbf{\hat{u}} + \frac{1}{2!}\theta^2\mathbf{\hat{u}}^2 + \frac{1}{3!}\theta^3\mathbf{\hat{u}}^3 + \frac{1}{4!}\theta^4\mathbf{\hat{u}}^4 + \frac{1}{5!}\theta^5\mathbf{\hat{u}}^5 + \dots \\
&= \mathcal{I} + \theta\mathbf{\hat{u}} + \frac{1}{2!}\theta^2\mathbf{\hat{u}}^2 - \frac{1}{3!}\theta^3\mathbf{\hat{u}} - \frac{1}{4!}\theta^4\mathbf{\hat{u}}^2 + \frac{1}{5!}\theta^5\mathbf{\hat{u}} + \dots \\
&= \mathcal{I} + \left(\theta - \frac{1}{3!}\theta^2 + \frac{1}{5!}\theta^5 - \dots\right)\mathbf{\hat{u}} + \left(1 - \left(1 - \frac{1}{2!}\theta^2 + \frac{1}{4!}\theta^4 - \dots\right)\right)\mathbf{\hat{u}}^2 \\
&= \mathcal{I} + \sin(\theta)\mathbf{\hat{u}} + (1-\cos(\theta))\mathbf{\hat{u}}^2
\end{align*}\]
<p>In the last step, we again recover the Taylor expansions of \(\sin\theta\) and \(\cos\theta\). Our final expression is known as <em>Rodrigues’ formula</em>.</p>
<p>This formula is already reminiscent of the 2D case: the latter two terms are building up a 2D rotation in the plane defined by \(\mathbf{u}\). To sanity check our 3D result, let’s compute our transform for \(\theta = 0\):</p>
\[e^{0\mathbf{\hat{u}}}\mathbf{p} = (\mathcal{I} + 0\mathbf{\hat{u}} + (1-1)\mathbf{\hat{u}}^2)\mathbf{p} = \mathbf{p}\]
<p>Rotating by \(\theta = 0\) preserves \(\mathbf{p}\), so the formula works. Then compute for \(\theta = \frac{\pi}{2}\):</p>
\[\begin{align*} e^{\frac{\pi}{2}\mathbf{\hat{u}}}\mathbf{p} &= (\mathcal{I} + 1\mathbf{\hat{u}} + (1-0)\mathbf{\hat{u}}^2)\mathbf{p} \\ &= \mathbf{p} + \mathbf{\hat{u}}\mathbf{p} + \mathbf{\hat{u}}^2\mathbf{p} \\
&= \mathbf{p} + \mathbf{u}\times\mathbf{p} + \mathbf{u}\times(\mathbf{u}\times\mathbf{p})\\
&= (\mathbf{p}_\perp + \mathbf{p}_\parallel) + \mathbf{u}\times\mathbf{p} - \mathbf{p}_\perp\\
&= \mathbf{u}\times\mathbf{p} + \mathbf{p}_\parallel
\end{align*}\]
<p>Above, we already concluded \(\mathbf{u}\times\mathbf{p} + \mathbf{p}_\parallel\) is a quarter rotation. So, our formula is also correct at \(\theta = \frac{\pi}{2}\). Then compute for \(\theta = \pi\):</p>
\[\begin{align*} e^{\pi\mathbf{\hat{u}}}\mathbf{p} &= (\mathcal{I} + 0\mathbf{\hat{u}} + (1-(-1))\mathbf{\hat{u}}^2)\mathbf{p} \\
&= \mathbf{p} + 2\mathbf{\hat{u}}^2\mathbf{p} \\
&= (\mathbf{p}_\perp + \mathbf{p}_\parallel) - 2\mathbf{p}_\perp \\
&= -\mathbf{p}_\perp + \mathbf{p}_\parallel
\end{align*}\]
<div class="center" style="width: 100%;"><img class="diagram2d center" src="/assets/exp-rotations/crossu4.svg" /></div>
<p>We end up with \(-\mathbf{p}_\perp + \mathbf{p}_\parallel\), which is a half rotation. Hence \(\theta = \pi\) is also correct.</p>
<p>So far, our formula checks out. Just to be sure, let’s prove that our 3D result is a rotation matrix, i.e. it’s orthonormal and has positive determinant. A matrix is orthonormal if \(A^TA = \mathcal{I}\), so again using \(\mathbf{\hat{u}}^{k+2} = -\mathbf{\hat{u}}^k\):</p>
\[\begin{align*}
&\left(\mathcal{I} + \sin(\theta)\mathbf{\hat{u}} + (1-\cos(\theta))\mathbf{\hat{u}}^2\right)^T\left(\mathcal{I} + \sin(\theta)\mathbf{\hat{u}} + (1-\cos(\theta))\mathbf{\hat{u}}^2\right) \\
=& \left(\mathcal{I}^T + \sin(\theta)\mathbf{\hat{u}}^T + (1-\cos(\theta))\left(\mathbf{\hat{u}}^T\right)^2\right)\left(\mathcal{I} + \sin(\theta)\mathbf{\hat{u}} + (1-\cos(\theta))\mathbf{\hat{u}}^2\right) \\
=&\ (\mathcal{I} - \sin(\theta)\mathbf{\hat{u}} + (1-\cos(\theta))\mathbf{\hat{u}}^2)\left(\mathcal{I} + \sin(\theta)\mathbf{\hat{u}} + (1-\cos(\theta))\mathbf{\hat{u}}^2\right) \\
=&\ \mathcal{I} + \sin(\theta)\mathbf{\hat{u}} + (1-\cos(\theta))\mathbf{\hat{u}}^2 - \sin(\theta)\mathbf{\hat{u}} - \sin^2(\theta)\mathbf{\hat{u}}^2 - \sin(\theta)(1-\cos(\theta))\mathbf{\hat{u}}^3 \\&+ (1-\cos(\theta))\mathbf{\hat{u}}^2 + \sin(\theta)(1-\cos(\theta))\mathbf{\hat{u}}^3 + (1-\cos(\theta))\mathbf{\hat{u}}^4 \\
=&\ \mathcal{I} + \sin(\theta)\mathbf{\hat{u}} + (1-\cos(\theta))\mathbf{\hat{u}}^2 - \sin(\theta)\mathbf{\hat{u}} - \sin^2(\theta)\mathbf{\hat{u}} + \sin(\theta)(1-\cos(\theta))\mathbf{\hat{u}} \\&+ (1-\cos(\theta))\mathbf{\hat{u}}^2 - \sin(\theta)(1-\cos(\theta))\mathbf{\hat{u}} - (1-\cos(\theta))^2\mathbf{\hat{u}}^2\\
=&\ \mathcal{I} - \sin^2(\theta)\mathbf{\hat{u}}^2 + (1-\cos(\theta))^2\mathbf{\hat{u}}^2\\
=&\ \mathcal{I} + (-\sin^2(\theta)+1-\cos^2(\theta))\mathbf{\hat{u}}^2\\
=&\ \mathcal{I} + (1-(\sin^2(\theta)+\cos^2(\theta)))\mathbf{\hat{u}}^2\\
=&\ \mathcal{I}
\end{align*}\]
<p>Therefore, \(e^{\theta\mathbf{\hat{u}}}\) is orthonormal. We could show its determinant is positive (and therefore \(1\)) by writing out all the terms, but it suffices to argue that:</p>
<ul>
<li>Clearly, \(\begin{vmatrix}\exp(0\mathbf{\hat{u}})\end{vmatrix} = \begin{vmatrix}\mathcal{I}\end{vmatrix} = 1\)</li>
<li>There is no \(\theta\), \(\mathbf{\hat{u}}\) such that \(\begin{vmatrix}\exp(\theta\mathbf{\hat{u}})\end{vmatrix} = 0\), since \(\mathbf{\hat{u}}\) and \(\mathbf{\hat{u}}^2\) can never cancel out \(\mathcal{I}\).</li>
<li>\(\exp\) is continuous with respect to \(\theta\) and \(\mathbf{\hat{u}}\)</li>
</ul>
<p>Therefore, \(\begin{vmatrix}\exp(0\mathbf{\hat{u}})\end{vmatrix}\) can never become negative. That means \(\exp(\theta\mathbf{\hat{u}})\) is a 3D rotation matrix!</p>
<h3 id="the-3d-logarithmic-map">The 3D Logarithmic Map</h3>
<p>Similarly to the 2D case, the 3D exponential map is not injective, so the 3D logarithmic map will not be surjective. Instead, we will again define it to return the <em>smallest magnitude</em> axis-angle rotation corresponding to the given matrix. Our exponential map gave us:</p>
\[R = \exp(\theta\mathbf{\hat{u}}) = \mathcal{I} + \sin(\theta)\mathbf{\hat{u}} + (1-\cos(\theta))\mathbf{\hat{u}}^2\]
<p>We can take the <a href="https://en.wikipedia.org/wiki/Trace_(linear_algebra)">trace</a> (sum along the diagonal) of both sides:</p>
\[\operatorname{tr}(R) = \operatorname{tr}(\mathcal{I}) + \sin(\theta)\operatorname{tr}(\mathbf{\hat{u}}) + (1-\cos(\theta))\operatorname{tr}(\mathbf{\hat{u}}^2)\]
<p>Clearly \(\operatorname{tr}(\mathcal{I}) = 3\), and since \(\mathbf{\hat{u}}\) is skew-symmetric, its diagonal sum is zero. That just leaves \(\mathbf{\hat{u}}^2\):</p>
\[\mathbf{\hat{u}}^2 = \begin{bmatrix}-u_y^2-u_z^2&u_xu_y&u_xu_z\\u_xu_y&-u_x^2-u_z^2&u_yu_z\\u_xu_z&u_yu_z&-u_x^2-u_y^2\end{bmatrix}\]
<p>We can see \(\operatorname{tr}(\mathbf{\hat{u}}^2) = -2u_x^2-2u_y^2-2u_z^2 = -2\|\mathbf{u}\|^2 = -2\). (We originally defined \(\mathbf{u}\) as a unit vector.) Our final trace becomes:</p>
\[\begin{align*} && \operatorname{tr}(R) &= 3 + 0\sin(\theta) - 2(1-\cos(\theta)) \\ &&&= 1 + 2\cos\theta \\
&\implies& \theta &= \arccos\left(\frac{\operatorname{tr}(R)-1}{2}\right) \end{align*}\]
<p>That’s half of our logarithmic map. To recover \(\mathbf{\hat{u}}\), we can antisymmetrize \(R\). Recall \(\mathbf{\hat{u}}^T = -\mathbf{\hat{u}}\), and that \(\mathbf{\hat{u}}^2\) is symmetric (above).</p>
\[\begin{align*}
&& R - R^T &= \mathcal{I} - \mathcal{I}^T + \sin(\theta)(\mathbf{\hat{u}}-\mathbf{\hat{u}}^T) + (1-\cos(\theta))(\mathbf{\hat{u}}^2-(\mathbf{\hat{u}}^2)^T) \\
&&&= \sin(\theta)(\mathbf{\hat{u}}+\mathbf{\hat{u}}) + (1-\cos(\theta))(\mathbf{\hat{u}}^2-\mathbf{\hat{u}}^2) \\
&&&= 2\sin(\theta)\mathbf{\hat{u}} \\
&\implies& \mathbf{\hat{u}} &= \frac{1}{2\sin\theta}(R-R^T) \\
\end{align*}\]
<p>Finally, to get \(\mathbf{u}\), we can pull out the entries of \(\mathbf{\hat{u}}\), which we just derived:</p>
\[\mathbf{u} = \frac{1}{2\sin\theta}\begin{bmatrix}R_{32}-R_{23}\\ R_{13}-R_{31}\\R_{21}-R_{12}\end{bmatrix}\]
<p>We now have our full 3D logarithmic map!</p>
<h3 id="interpolation-1">Interpolation</h3>
<p>Now equipped with our 3D <code class="language-plaintext highlighter-rouge">exp</code> and <code class="language-plaintext highlighter-rouge">log</code> maps, we can use them for interpolation. The exact same formula as the 2D case still applies:</p>
\[R(t) = \exp(t\log(R_1R_0^{-1})))R_0\]
<canvas id="3d_exp_interp" class="diagram3d"></canvas>
<table class="slidertable" cellspacing="0" cellpadding="0"><tr>
<td><input type="range" min="0" max="100" value="0" class="slider" id="3d_exp_interp_t" /></td>
<td><button id="3d_exp_interp_rand" type="button">Randomize</button></td>
</tr></table>
<table class="mathtable" cellspacing="0" cellpadding="0"><tr>
<td>$$ R_0 = \begin{bmatrix}1&0&0\\0&1&0\\0&0&1\end{bmatrix} $$</td>
<td id="3d_exp_interp_mj">$$ R(0) = \begin{bmatrix}1.00&0.00&0.00\\0.00&1.00&0.00\\0.00&0.00&1.00\end{bmatrix} $$</td>
<td id="3d_exp_interp_m1">$$ R_1 = \begin{bmatrix}-0.42&-0.59&-0.69\\0.51&-0.79&0.36\\-0.75&-0.20&0.63\end{bmatrix} $$</td>
</tr></table>
<p>Our interpolation scheme produces all the nice properties of axis/angle rotations—<strong>and</strong> chooses the shortest path every time. This wouldn’t look so smooth with Euler angles!</p>
<h2 id="averaging-rotations">Averaging Rotations</h2>
<p>However, we would have gotten an equally good interpolation scheme by just using quaternions instead of messing about with all this matrix math. Let’s consider something interesting we can <em>only</em> easily do with axis/angle rotations: averaging a set of rotation matrices.</p>
<p>The most straightforward method is to convert each matrix into an axis/angle rotation, average the resulting vectors, and convert back. That is certainly a valid strategy, but the resulting behavior won’t be very intuitive:</p>
<canvas id="rot_avg" class="diagram3d"></canvas>
<p><button class="center" id="rot_avg_rand" type="button">Randomize</button></p>
<p>In particular, summing axis-angle vectors can result in “catastrophic cancellation.” An extreme example is averaging \(\begin{bmatrix}\pi&0&0\end{bmatrix}\) and \(\begin{bmatrix}-\pi&0&0\end{bmatrix}\), resulting in zero—which is clearly not representative of the two equivalent rotations.</p>
<p>To find an alternative, let’s first consider a slightly unconventional way of averaging points in the plane. The average of a set of points is the point that minimizes total squared distance to all others. Hence, there’s an optimization-based algorithm for finding it. Given \(x_0, \dots, x_n\), we can iterate the following procedure:</p>
<ul>
<li>Pick an initial guess \(\bar{x} \in \mathbb{R}^2\) (can be one of the points).</li>
<li>Repeat:
<ul>
<li>For each point, get its translation from the guess: \(\mathbf{u}_i \leftarrow x_i - \bar{x}\)</li>
<li>Average the vectors: \(\mathbf{u} \leftarrow \frac{1}{n} \sum_{i=1}^n \mathbf{u}_i\)</li>
<li>Step toward the average direction: \(\bar{x} \leftarrow \bar{x} + \tau\mathbf{u}\)</li>
</ul>
</li>
<li>while \(\|\mathbf{u}\| > \epsilon\).</li>
</ul>
<div class="center" style="width: 80%;"><img class="diagram2d center" src="/assets/exp-rotations/pointavg.svg" /></div>
<p>As we run this procedure, \(\bar{x}\) will converge to the average point. Of course, we could have just averaged the points directly, but we’ll be able to translate this idea to the rotational case rather nicely.</p>
<p>Our logarithmic map lets us convert rotation matrices to axis axis/angle rotations, which are themselves just 3D points. So, what if we use the point averaging algorithm on rotations \(R_0, \dots, R_n\)?</p>
<ul>
<li>Pick an initial guess \(\bar{R} \in \mathbb{R}^{3\times3}\) (can be \(\mathcal{I}\)).</li>
<li>Repeat:
<ul>
<li>For each matrix, get its axis/angle rotation from the guess: \(\mathbf{u}_i \leftarrow \log(R_i\bar{R}^{-1})\)</li>
<li>Average the vectors: \(\mathbf{u} \leftarrow \frac{1}{n} \sum_{i=1}^n \mathbf{u}_i\)</li>
<li>Step toward the average rotation: \(\bar{R} \leftarrow \exp(\tau\mathbf{u})\bar{R}\)</li>
</ul>
</li>
<li>while \(\|\mathbf{u}\| > \epsilon\).</li>
</ul>
<canvas id="karcher" class="diagram3d"></canvas>
<table class="slidertable" cellspacing="0" cellpadding="0"><tr>
<td><button id="karcher_step" type="button" class="l">Step</button></td>
<td><button id="karcher_rand" type="button" class="r">Randomize</button></td>
</tr></table>
<p>The result of this algorithm is formally known as the <em>Karcher mean</em>. Just like how the average point minimizes total squared distance from all other points, the Karcher mean is a rotation that minimizes squared <em>angular</em> distance from all other rotations. Therefore, it won’t be subject to catastrophic cancellation—we’ll always end up with a non-zero in-between rotation.</p>
<p>Try comparing the two averaging algorithms—randomizing will keep them in sync. While the results are often similar, the Karcher mean exhibits more consistent behavior.</p>
<h2 id="quaternions-again">Quaternions (Again)</h2>
<p><em>Warning: section assumes knowledge of quaternions</em></p>
<p>Okay, I couldn’t resist talking about quaternions at least a little bit, given how closely they’re related to axis/angle rotations. Just like how complex exponentiation turned out to be equivalent to (skew-symmetric) 2D matrix exponentiation, quaternion exponentiation is equivalent to (skew-symmetric) 3D matrix exponentiation.</p>
<p>In 2D, an axis/angle rotation was simply \(\theta\). We created a pure-imaginary complex number \(i\theta\) and exponentiated it:</p>
\[e^{i\theta} = \cos\theta + i\sin\theta\]
<p>We got back a complex number that when multiplied with a point, rotates it by \(\theta\). It’s always the case that \(\|\cos\theta + i\sin\theta\| = 1\), so 2D rotations can be represented as unit-norm complex numbers.</p>
<p>In 3D, an axis/angle rotation is a vector \(\mathbf{u}\) such that \(\|\mathbf{u}\| = \theta\). What happens if we create a pure-imaginary quaternion \(\mathbf{q} = u_x\mathbf{i} + u_y\mathbf{j} + u_z\mathbf{k}\) and exponentiate it, too?</p>
<p>To make evaluating \(e^\mathbf{q}\) easier, first derive the following using the quaternion <a href="https://en.wikipedia.org/wiki/Quaternion#Multiplication_of_basis_elements">multiplication rules</a>:</p>
\[\begin{align*}
\mathbf{q}^2 &= (u_x\mathbf{i} + u_y\mathbf{j} + u_z\mathbf{k})(u_x\mathbf{i} + u_y\mathbf{j} + u_z\mathbf{k}) \\
&= u_x^2\mathbf{i}^2 + u_xu_y\mathbf{i}\mathbf{j} + u_xu_z\mathbf{i}\mathbf{k} + u_yu_x\mathbf{j}\mathbf{i} + u_y^2\mathbf{j}^2 + u_yu_z\mathbf{j}\mathbf{k} + u_zu_x\mathbf{k}\mathbf{i} + u_zu_y\mathbf{k}\mathbf{j} + u_z^2\mathbf{k}^2 \\
&= -u_x^2 + u_xu_y\mathbf{k} - u_xu_z\mathbf{j} - u_yu_x\mathbf{k} - u_y^2 + u_yu_z\mathbf{i} + u_zu_x\mathbf{j} - u_zu_y\mathbf{i} - u_z^2 \\
&= -u_x^2-u_y^2-u_z^2\\
&= -\|\mathbf{q}\|^2\\
&= -\theta^2
\end{align*}\]
<p>Which is highly reminiscent of the skew-symmetric matrix identity used above. Therefore…</p>
\[\begin{align*}
e^\mathbf{q} &= 1 + \mathbf{q} + \frac{\mathbf{q}^2}{2!} + \frac{\mathbf{q}^3}{3!} + \frac{\mathbf{q}^4}{4!} + \frac{\mathbf{q}^5}{5!} + \dots \\
&= 1 + \frac{\theta\mathbf{q}}{\theta} - \frac{\theta^2}{2!} - \frac{\theta^3\mathbf{q}}{3!\theta} + \frac{\theta^4}{4!} + \frac{\theta^5\mathbf{q}}{5!\theta} \dots \\
&= \left(1 - \frac{\theta^2}{2!} + \frac{\theta^4}{4!} - \dots\right) + \frac{\mathbf{q}}{\theta}\left(\theta - \frac{\theta^3}{3!} + \frac{\theta^5}{5!} - \dots\right) \\
&= \cos\theta + \frac{\mathbf{q}}{\theta}\sin\theta \\
&\approx \cos\theta + \frac{\mathbf{u}}{\|\mathbf{u}\|}\sin\theta
\end{align*}\]
<p>Our result looks almost exactly like the 2D case, just with three imaginary axes instead of one. In 2D, our axis/angle rotation became a unit-norm complex number. In 3D, it became a unit-norm quaternion. Now we can use this quaternion to rotate 3D points! Pretty cool, right?</p>
<p>One advantage of using quaternions is how easy the exponential map is to compute—if you don’t need a rotation matrix, it’s a good option. The quaternion logarithmic map is similarly simple:</p>
\[\theta = \arccos(\Re(\mathbf{q})), \mathbf{u} = \frac{1}{\sin\theta}\Im(\mathbf{q})\]
<p>Finally, note that the way to rotate a point \(\mathbf{p}\) by a quaternion \(\mathbf{q}\) is by evaluating the conjugation \(\mathbf{q}\mathbf{p}\mathbf{q}^{-1}\), where \(\mathbf{p} = p_x\mathbf{i} + p_y\mathbf{j} + p_z\mathbf{k}\) is another pure-imaginary quaternion representing our point. The conjugation technically rotates the point by \(2\theta\) about \(\mathbf{u}\), but that’s easily accounted for by making \(\|\mathbf{u}\| = \frac{\theta}{2}\) in the beginning.</p>
<h1 id="further-reading">Further Reading</h1>
<p>Made it this far? Well, there’s even more to learn about rotations.</p>
<p>Learn about quaternions <a href="https://eater.net/quaternions">here</a>, and why geometric algebra is more intuitive <a href="https://marctenbosch.com/quaternions/">here</a>.</p>
<p>Beyond understanding the four representations covered here (plus geometric algebra), it can be enlightening to learn about the algebraic structure underlying all 3D rotations: the group \(SO(3)\). I found <a href="https://www.youtube.com/watch?v=ACZC_XEyg9U">this video</a> to be a great resource: it explains \(SO(3)\) both intuitively and visually, demonstrating how it relates it to the group \(SU(2)\) as well as why quaternions and axis/angle rotations double-cover 3D rotation matrices.</p>
<p>The <a href="https://en.wikipedia.org/wiki/3D_rotation_group">wikipedia page on SO(3)</a> is also informative, though <em>very</em> math heavy. It touches on connections with axis/angle rotations, topology, \(SU(2)\), quaternions, and Lie algebra. It turns out the vector space of skew-symmetric matrices we derived above makes up \(\mathfrak{so}(3)\), the Lie algebra that corresponds to \(SO(3)\)—but I don’t know what that entails.</p>
<script type="module" src="/assets/exp-rotations/rotations.js"></script>Based on CMU 15-462 course materials by Keenan Crane. If you’ve done any 3D programming, you’ve likely encountered the zoo of techniques and representations used when working with 3D rotations. Some of them are better than others, depending on the situation. Table of Contents Representations Rotation Matrices Euler Angles Quaternions Axis/Angle The Exponential and Logarithmic Maps Axis/Angle in 2D Axis/Angle in 3D Averaging Rotations Quaternions (Again) Further Reading Representations Rotation Matrices Linear-algebra-wise, the most straightforward representation is an orthonormal 3x3 matrix (with positive determinant). The three columns of a rotation matrix specify where the x, y, and z axes end up after the rotation. Rotation matrices are particularly useful for transforming points: just multiply! Even better, rotation matrices can be composed with any other linear transformations via matrix multiplication. That’s why we use rotation matrices when actually drawing things on screen: only one matrix multiplication is required to transform a point from world-space to the screen. However, rotation matrices are not so useful for actually working with rotations: because they don’t form a vector space, adding together two rotation matrices will not give you a rotation matrix back. For example, animating an object by linearly interpolating between two rotation matrices adds scaling: Randomize $$ R_0 = \begin{bmatrix}1&0&0\\0&1&0\\0&0&1\end{bmatrix} $$ $$ R(0) = \begin{bmatrix}1.00&0.00&0.00\\0.00&1.00&0.00\\0.00&0.00&1.00\end{bmatrix} $$ $$ R_1 = \begin{bmatrix}-1&0&0\\0&1&0\\0&0&-1\end{bmatrix} $$ Euler Angles Another common representation is Euler angles, which specify three separate rotations about the x, y, and z axes (also known as pitch, yaw, and roll). The order in which the three component rotations are applied is an arbitrary convention—here we’ll apply x, then y, then z. $$\theta_x$$ $$\theta_y$$Lock $$\theta_z$$ Euler angles are generally well-understood and often used for authoring rotations. However, using them for anything else comes with some significant pitfalls. While it’s possible to manually create splines that nicely interpolate Euler angles, straightforward interpolation often produces undesirable results. Euler angles suffer from gimbal lock when one component causes the other two axes of rotation to become parallel. Such configurations are called singularities. At a singularity, changing either of two ‘locked’ angles will cause the same output rotation. You can demonstrate this phenomenon above by pressing the ‘lock’ button and adjusting the x/z rotations (a quarter rotation about y aligns the z axis with the x axis). Singularities break interpolation: if the desired path reaches a singularity, it gains a degree of freedom with which to represent its current position. Picking an arbitrary representation to continue with causes discontinuities in the interpolated output: even within an axis, interpolation won’t produce a constant angular velocity. That can be a problem if, for example, you’re using the output to drive a robot. Furthermore, since each component angle is cyclic, linear interpolation won’t always choose the shortest path between rotations. Randomize $$ \mathbf{\theta}_0 = \begin{bmatrix}0\\0\\0\end{bmatrix} $$ $$ \mathbf{\theta}(0) = \begin{bmatrix}0.00\\0.00\\0.00\end{bmatrix} $$ $$ \mathbf{\theta}_1 = \begin{bmatrix}-3.14\\0.00\\-3.14\end{bmatrix} $$ Thankfully, interpolation is smooth if the path doesn’t go through a singularity, so these limitations can be worked around, especially if you don’t need to represent ‘straight up’ and ‘straight down.’ Quaternions At this point, you might be expecting yet another article on quaternions—don’t worry, we’re not going to delve into hyper-complex numbers today. It suffices to say that unit quaternions are the standard tool for composing and interpolating rotations, since spherical linear interpolation (slerp) chooses a constant-velocity shortest path between any two quaternions. However, unit quaternions also don’t form a vector space, are unintuitive to author, and can be computationally costly to interpolate*. Further, there’s no intuitive notion of scalar multiplication, nor averaging. But, they’re still fascinating! If you’d like to understand quaternions more deeply (or, perhaps, learn what they are in the first place), read this. Randomize $$ Q_0 = \begin{bmatrix}1\\0\\0\\0\end{bmatrix}\ \ $$ $$ Q(0) = \begin{bmatrix}1.00\\0.00\\0.00\\0.00\end{bmatrix}\ \ $$ $$ Q_1 = \begin{bmatrix}0\\0\\1\\0\end{bmatrix} $$ Note that since quaternions double-cover the space of rotations, sometimes \(Q(1)\) will go to \(-Q_1\). Axis/Angle Rotations An axis/angle rotation is a 3D vector of real numbers. Its direction specifies the axis of rotation, and its magnitude specifies the angle to rotate about that axis. For convenience, we’ll write axis/angle rotations as \(\theta\mathbf{u}\), where \(\mathbf{u}\) is a unit-length vector and \(\theta\) is the rotation angle. $$\mathbf{u}_x$$ $$\mathbf{u}_y$$ $$\mathbf{u}_z$$ $$\theta$$ Since axis/angle rotations are simply 3D vectors, they form a vector space: we can add, scale, and interpolate them to our heart’s content. Linearly interpolating between any two axis/angle rotations is smooth and imparts constant angular velocity. However, note that linearly interpolating between axis-angle rotations does not necessarily choose the shortest path: it depends on which axis/angle you use to specify the target rotation. Randomize $$ \theta_0\mathbf{u}_0 = \begin{bmatrix}0\\0\\0\end{bmatrix}\ \ $$ $$ \theta\mathbf{u}(0) = \begin{bmatrix}0.00\\0.00\\0.00\end{bmatrix}\ \ $$ $$ \theta_1\mathbf{u}_1 = \begin{bmatrix}0\\3.14\\0\end{bmatrix} $$ Like quaternions, axis/angle vectors double-cover the space of rotations: sometimes \(\theta\mathbf{u}(1)\) will go to \((2\pi - \theta_1)(-\mathbf{u}_1)\). The Exponential and Logarithmic Maps Ideally, we could freely convert rotations between these diverse representations based on our use case. We will always want to get a rotation matrix out at the end, so we’ll consider matrices the ‘canonical’ form. Enter the exponential map: a function that takes a different kind of rotation object and gives us back an equivalent rotation matrix. The corresponding logarithmic map takes a rotation matrix and gives us back a rotation object. How these maps relate to the scalar exp and log functions will hopefully become clear later on. Below, we’ll define an exp and log map translating between rotation matrices and axis/angle vectors. But first, to build up intuition, let us consider how rotations work in two dimensions. Axis/Angle in 2D In 2D, there’s only one axis to rotate around: the one pointing out of the plane. Hence, our ‘axis/angle’ rotations can be represented by just \(\theta\). Given a 2D point \(\mathbf{p}\), how can we rotate \(\mathbf{p}\) by \(\theta\)? One way to visualize the transformation is by forming a coordinate frame in which the output is easy to describe. Consider \(\mathbf{p}\) and its quarter (\(90^\circ\)) rotation \(\mathcal{J}\mathbf{p}\): Using a bit of trigonometry, we can describe the rotated \(\mathbf{p}_\theta\) in two components: \[\begin{align*} \mathbf{p}_\theta &= \mathbf{p}\cos\theta + \mathcal{J}\mathbf{p}\sin\theta \\ &= (\cos(\theta)\mathcal{I} + \sin(\theta)\mathcal{J})\mathbf{p} \end{align*}\] But, what actually are \(\mathcal{I}\) and \(\mathcal{J}\)? The former should take a 2D vector and return it unchanged: it’s the 2x2 identity matrix. The latter should be similar, but swap and negate the two components: \[\begin{align*} \mathcal{I} &= \begin{bmatrix}1&0\\0&1\end{bmatrix} \\ \mathcal{J} &= \begin{bmatrix}0&-1\\1&0\end{bmatrix} \end{align*}\] Just to make sure we got \(\mathcal{J}\) right, let’s check what happens if we apply it twice (via \(\mathcal{J}^2\)): \[\mathcal{J}^2 = \begin{bmatrix}0&-1\\1&0\end{bmatrix}\begin{bmatrix}0&-1\\1&0\end{bmatrix} = \begin{bmatrix}-1&0\\0&-1\end{bmatrix}\] We got \(\mathcal{J}^2 = -\mathcal{I}\), which is a 180-degree rotation. So, \(\mathcal{J}\) indeed represents 90-degree rotation. Now, what does our transform look like? \[\begin{align*} \mathbf{p}_\theta &= (\cos(\theta)\mathcal{I} + \sin(\theta)\mathcal{J})\mathbf{p} \\ &= \left(\cos(\theta)\begin{bmatrix}1&0\\0&1\end{bmatrix} + \sin(\theta)\begin{bmatrix}0&-1\\1&0\end{bmatrix}\right)\mathbf{p} \\ &= \begin{bmatrix}\cos\theta&-\sin\theta\\\sin\theta&\cos\theta\end{bmatrix}\mathbf{p} \end{align*}\] That’s the standard 2D rotation matrix. What a coincidence! Complex Rotations If you’re familiar with complex numbers, you might notice that our first transform formula feels eerily similar to Euler’s formula, \(e^{ix} = \cos x + i\sin x\): \[\begin{align*} \mathbf{p}_\theta &= (\cos(\theta)\mathcal{I} + \sin(\theta)\mathcal{J})\mathbf{p} \\ e^{i\theta}p &= (\cos(\theta) + i\sin(\theta))p \end{align*}\] Where \(i\) and \(\mathcal{J}\) both play the role of a quarter turn. We can see that in complex arithmetic, multiplying by \(i\) in fact has that effect: \[\mathcal{J}\mathbf{p} = \begin{bmatrix}0&-1\\1&0\end{bmatrix}\begin{bmatrix}a\\b\end{bmatrix} = \begin{bmatrix}-b\\a\end{bmatrix}\] \[ip = i(a + bi) = ai + bi^2 = -b + ai\] So, there must be some connection to the exponential function here. The 2D Exponential Map Recall the definition of the exponential function (or equivalently, its Taylor series): \[e^x = \sum_{k=0}^\infty \frac{x^k}{k!} = 1 + x + \frac{x^2}{2!} + \frac{x^3}{3!} + \dots\] Using Euler’s formula, \(e^{i\theta}\) gave us a complex number representing a 2D rotation by \(\theta\). Can we do the same with \(e^{\theta\mathcal{J}}\)? If we plug a matrix into the above definition, the arithmetic still works out: 2x2 matrices certainly support multiplication, addition, and scaling. (More on matrix exponentiation here.) Let \(\theta\mathcal{J} = A\) and plug it in: \[e^A = \sum_{k=0}^\infty \frac{A^k}{k!} = \mathcal{I} + A + \frac{A^2}{2!} + \frac{A^3}{3!} + \dots\] Let’s pull out the first four terms to inspect further: \[\begin{align*} e^A &= \mathcal{I} + A + \frac{1}{2!}A^2 + \frac{1}{3!}A^3 \\ &= \mathcal{I} + A\left(\mathcal{I} + \frac{1}{2}A\left(\mathcal{I} + \frac{1}{3}A\right)\right) \\ &= \mathcal{I} + A\left(\mathcal{I} + \frac{1}{2}A\begin{bmatrix}1&\frac{-\theta}{3}\\\frac{\theta}{3}&1\end{bmatrix}\right) \\ &= \mathcal{I} + A\begin{bmatrix}1-\frac{\theta^2}{6}&\frac{-\theta}{2}\\\frac{\theta}{2}&1-\frac{\theta^2}{6}\end{bmatrix} \\ &= \begin{bmatrix}1-\frac{\theta^2}{2}&-\theta+\frac{\theta^3}{6}\\\theta-\frac{\theta^3}{6}&1-\frac{\theta^2}{2}\end{bmatrix} \end{align*}\] These entries look familiar. Recall the Taylor series that describe the functions \(\sin\) and \(\cos\): \[\begin{align*} \sin x &= x - \frac{x^3}{3!} + \frac{x^5}{5!} - \dots \\ \cos x &= 1 - \frac{x^2}{2!} + \frac{x^4}{4!} - \dots \end{align*}\] If we write out all the terms of \(e^A\), we’ll recover the expansions of \(\sin\theta\) and \(\cos\theta\)! Therefore: \[e^A = e^{\theta\mathcal{J}} = \begin{bmatrix}\cos\theta&-\sin\theta\\\sin\theta&\cos\theta\end{bmatrix}\] We’ve determined that the exponential function \(e^{\theta\mathcal{J}}\) converts our angle \(\theta\) into a corresponding 2D rotation matrix. In fact, we’ve proved a version of Euler’s formula with 2x2 matrices instead of complex numbers: \[\begin{align*} && e^{\theta\mathcal{J}} &= (\cos(\theta)\mathcal{I} + \sin(\theta)\mathcal{J})\\ &\implies& \mathbf{p}_\theta &= e^{\theta\mathcal{J}}\mathbf{p} \end{align*}\] The 2D Logarithmic Map The logarithmic map should naturally be the inverse of the exponential: \[R = \exp(\theta\mathcal{J}) \implies \log(R) = \theta\mathcal{J}\] So, given \(R\), how can we recover \(\theta\mathcal{J}\)? \[\begin{align*} && R &= \begin{bmatrix}\cos\theta&-\sin\theta\\\sin\theta&\cos\theta\end{bmatrix}\\ &\implies& \theta &= \text{atan2}(R_{21},R_{11})\\ &\implies& \log(R) &= \begin{bmatrix}0&-\theta\\\theta&0\end{bmatrix} \end{align*}\] Note that in general, our exponential map is not injective. Clearly, \(\exp(\theta\mathcal{J}) = \exp((\theta + 2\pi)\mathcal{J})\), since adding an extra full turn will always give us back the same rotation matrix. Therefore, our logarithmic map can’t be surjective—we’ll define it as returning the smallest angle \(\theta\mathcal{J}\) corresponding to the given rotation matrix. Using \(\text{atan2}\) implements this definition. Interpolation Consider two 2D rotation angles \(\theta_0\) and \(\theta_1\). The most obvious way to interpolate between these two rotations is to interpolate the angles and create the corresponding rotation matrix. This scheme is essentially a 2D version of axis-angle interpolation. \[\begin{align*} \theta(t) &= (1-t)\theta_0 + t\theta_1\\ R_\theta(t) &= \begin{bmatrix}\cos(\theta(t))&-\sin(\theta(t))\\\sin(\theta(t))&\cos(\theta(t))\end{bmatrix} \end{align*}\] However, if \(\theta_0\) and \(\theta_1\) are separated by more than \(\pi\), this expression will take the long way around: it’s not aware that angles are cyclic. Randomize $$ \theta_0 = 0\ \ $$ $$ \theta(0) = 0.00\ \ $$ $$ \theta_1 = 4.71 $$ Instead, let’s devise an interpolation scheme based on our exp/log maps. Since we know the two rotation matrices \(R_0\), \(R_1\), we can express the rotation that takes us directly from the initial pose to the final pose: \(R_1R_0^{-1}\), i.e. first undo \(R_0\), then apply \(R_1\). Using our logarithmic map, we can obtain the smallest angle that rotates from \(R_0\) to \(R_1\): \(\log(R_1R_0^{-1})\). Since \(\log\) gives us an axis-angle rotation, we can simply scale the result by \(t\) to perform interpolation. After scaling, we can use our exponential map to get back a rotation matrix. This matrix represents a rotation \(t\) of the way from \(R_0\) to \(R_1\). Hence, our final parametric rotation matrix is \(R(t) = \exp(t\log(R_1R_0^{-1})))R_0\). \[\begin{align*} R(0) &= \exp(0)R_0 = R_0\\ R(1) &= \exp(\log(R_1R_0^{-1}))R_0 = R_1R_0^{-1}R_0 = R_1 \end{align*}\] Randomize $$ R_0 = \begin{bmatrix}1&0\\0&1\end{bmatrix} $$ $$ R(0) = \begin{bmatrix}1.00&0.00\\0.00&1.00\end{bmatrix} $$ $$ R_1 = \begin{bmatrix}0&-1\\1&0\end{bmatrix} $$ Using exp/log for interpolation might seem like overkill for 2D—we could instead just check how far apart the angles are. But below, we’ll see how this interpolation scheme generalizes—without modification—to 3D, and in fact any number of dimensions. Axis/Angle in 3D We’re finally ready to derive an exponential and logarithmic map for 3D rotations. In 2D, our map arose from exponentiating \(\theta\mathcal{J}\), i.e. \(\theta\) times a matrix representing a counter-clockwise quarter turn about the axis of rotation. We will be able to do the same in 3D—but what transformation encodes a quarter turn about a 3D unit vector \(\mathbf{u}\)? The cross product \(\mathbf{u}\times\mathbf{p}\) is typically defined as a vector normal to the plane containing both \(\mathbf{u}\) and \(\mathbf{p}\). However, we could also interpret \(\mathbf{u}\times\mathbf{p}\) as a quarter turn of the projection of \(\mathbf{p}\) into the plane with normal \(\mathbf{u}\), which we will call \(\mathbf{p}_\perp\): So, if we can compute the quarter rotation of \(\mathbf{p}_\perp\), it should be simple to recover the quarter rotation of \(\mathbf{p}\). Of course, \(\mathbf{p}=\mathbf{p}_\perp+\mathbf{p}_\parallel\), so we’ll just have to add back the parallel part \(\mathbf{p}_\parallel\). This is correct because a rotation about \(\mathbf{u}\) preserves \(\mathbf{p}_\parallel\): However, “\(\mathbf{u} \times\)” is not a mathematical object we can work with. Instead, we can devise a matrix \(\mathbf{\hat{u}}\) that when multiplied with a a vector \(\mathbf{p}\), outputs the same result as \(\mathbf{u} \times \mathbf{p}\): \[\begin{align*} \mathbf{u} \times \mathbf{p} &= \begin{bmatrix} u_yp_z - u_zp_y \\ u_zp_x - u_xp_z \\ u_xp_y - u_yp_x \end{bmatrix} \\ &= \begin{bmatrix}0&-u_z&u_y\\u_z&0&-u_x\\-u_y&u_x&0\end{bmatrix}\begin{bmatrix}p_x\\p_y\\p_z\end{bmatrix} \\ &= \mathbf{\hat{u}}\mathbf{p} \end{align*}\] We can see that \(\mathbf{\hat{u}}^T = -\mathbf{\hat{u}}\), so \(\mathbf{\hat{u}}\) is a skew-symmetric matrix. (i.e. it has zeros along the diagonal, and the two halves are equal but negated.) Note that in the 2D case, our quarter turn \(\mathcal{J}\) was also skew-symmetric, and sneakily represented the 2D cross product! We must be on the right track. The reason we want to use axis/angle rotations in the first place is because they form a vector space. So, let’s make sure our translation to skew-symmetric matrices maintains that property. Given two skew-symmetric matrices \(A_1\) and \(A_2\): \[(A_1 + A_2)^T = A_1^T + A_2^T = -A_1 - A_2 = -(A_1 + A_2)\] Their sum is also a skew-symmetric matrix. Similarly: \[(cA)^T = c(A^T) = -cA_1\] Scalar multiplication also maintains skew-symmetry. The other vector space properties follow from the usual definition of matrix addition. Finally, note that \(\mathbf{u} \times (\mathbf{u} \times (\mathbf{u} \times \mathbf{p})) = -\mathbf{u} \times \mathbf{p}\). Taking the cross product three times would rotate \(\mathbf{p}_\perp\) three-quarter turns about \(\mathbf{u}\), which is equivalent to a single negative-quarter turn. More generally, \(\mathbf{\hat{u}}^{k+2} = -\mathbf{\hat{u}}^k\) for any \(k>0\). We could prove this by writing out all the terms, but the geometric argument is easier: The 3D Exponential Map Given an axis/angle rotation \(\theta\mathbf{u}\), we can make \(\theta\mathbf{\hat{u}}\) using the above construction. What happens when we exponentiate it? Using the identity \(\mathbf{\hat{u}}^{k+2} = -\mathbf{\hat{u}}^k\): \[\begin{align*} e^{\theta\mathbf{\hat{u}}} &= \mathcal{I} + \theta\mathbf{\hat{u}} + \frac{1}{2!}\theta^2\mathbf{\hat{u}}^2 + \frac{1}{3!}\theta^3\mathbf{\hat{u}}^3 + \frac{1}{4!}\theta^4\mathbf{\hat{u}}^4 + \frac{1}{5!}\theta^5\mathbf{\hat{u}}^5 + \dots \\ &= \mathcal{I} + \theta\mathbf{\hat{u}} + \frac{1}{2!}\theta^2\mathbf{\hat{u}}^2 - \frac{1}{3!}\theta^3\mathbf{\hat{u}} - \frac{1}{4!}\theta^4\mathbf{\hat{u}}^2 + \frac{1}{5!}\theta^5\mathbf{\hat{u}} + \dots \\ &= \mathcal{I} + \left(\theta - \frac{1}{3!}\theta^2 + \frac{1}{5!}\theta^5 - \dots\right)\mathbf{\hat{u}} + \left(1 - \left(1 - \frac{1}{2!}\theta^2 + \frac{1}{4!}\theta^4 - \dots\right)\right)\mathbf{\hat{u}}^2 \\ &= \mathcal{I} + \sin(\theta)\mathbf{\hat{u}} + (1-\cos(\theta))\mathbf{\hat{u}}^2 \end{align*}\] In the last step, we again recover the Taylor expansions of \(\sin\theta\) and \(\cos\theta\). Our final expression is known as Rodrigues’ formula. This formula is already reminiscent of the 2D case: the latter two terms are building up a 2D rotation in the plane defined by \(\mathbf{u}\). To sanity check our 3D result, let’s compute our transform for \(\theta = 0\): \[e^{0\mathbf{\hat{u}}}\mathbf{p} = (\mathcal{I} + 0\mathbf{\hat{u}} + (1-1)\mathbf{\hat{u}}^2)\mathbf{p} = \mathbf{p}\] Rotating by \(\theta = 0\) preserves \(\mathbf{p}\), so the formula works. Then compute for \(\theta = \frac{\pi}{2}\): \[\begin{align*} e^{\frac{\pi}{2}\mathbf{\hat{u}}}\mathbf{p} &= (\mathcal{I} + 1\mathbf{\hat{u}} + (1-0)\mathbf{\hat{u}}^2)\mathbf{p} \\ &= \mathbf{p} + \mathbf{\hat{u}}\mathbf{p} + \mathbf{\hat{u}}^2\mathbf{p} \\ &= \mathbf{p} + \mathbf{u}\times\mathbf{p} + \mathbf{u}\times(\mathbf{u}\times\mathbf{p})\\ &= (\mathbf{p}_\perp + \mathbf{p}_\parallel) + \mathbf{u}\times\mathbf{p} - \mathbf{p}_\perp\\ &= \mathbf{u}\times\mathbf{p} + \mathbf{p}_\parallel \end{align*}\] Above, we already concluded \(\mathbf{u}\times\mathbf{p} + \mathbf{p}_\parallel\) is a quarter rotation. So, our formula is also correct at \(\theta = \frac{\pi}{2}\). Then compute for \(\theta = \pi\): \[\begin{align*} e^{\pi\mathbf{\hat{u}}}\mathbf{p} &= (\mathcal{I} + 0\mathbf{\hat{u}} + (1-(-1))\mathbf{\hat{u}}^2)\mathbf{p} \\ &= \mathbf{p} + 2\mathbf{\hat{u}}^2\mathbf{p} \\ &= (\mathbf{p}_\perp + \mathbf{p}_\parallel) - 2\mathbf{p}_\perp \\ &= -\mathbf{p}_\perp + \mathbf{p}_\parallel \end{align*}\] We end up with \(-\mathbf{p}_\perp + \mathbf{p}_\parallel\), which is a half rotation. Hence \(\theta = \pi\) is also correct. So far, our formula checks out. Just to be sure, let’s prove that our 3D result is a rotation matrix, i.e. it’s orthonormal and has positive determinant. A matrix is orthonormal if \(A^TA = \mathcal{I}\), so again using \(\mathbf{\hat{u}}^{k+2} = -\mathbf{\hat{u}}^k\): \[\begin{align*} &\left(\mathcal{I} + \sin(\theta)\mathbf{\hat{u}} + (1-\cos(\theta))\mathbf{\hat{u}}^2\right)^T\left(\mathcal{I} + \sin(\theta)\mathbf{\hat{u}} + (1-\cos(\theta))\mathbf{\hat{u}}^2\right) \\ =& \left(\mathcal{I}^T + \sin(\theta)\mathbf{\hat{u}}^T + (1-\cos(\theta))\left(\mathbf{\hat{u}}^T\right)^2\right)\left(\mathcal{I} + \sin(\theta)\mathbf{\hat{u}} + (1-\cos(\theta))\mathbf{\hat{u}}^2\right) \\ =&\ (\mathcal{I} - \sin(\theta)\mathbf{\hat{u}} + (1-\cos(\theta))\mathbf{\hat{u}}^2)\left(\mathcal{I} + \sin(\theta)\mathbf{\hat{u}} + (1-\cos(\theta))\mathbf{\hat{u}}^2\right) \\ =&\ \mathcal{I} + \sin(\theta)\mathbf{\hat{u}} + (1-\cos(\theta))\mathbf{\hat{u}}^2 - \sin(\theta)\mathbf{\hat{u}} - \sin^2(\theta)\mathbf{\hat{u}}^2 - \sin(\theta)(1-\cos(\theta))\mathbf{\hat{u}}^3 \\&+ (1-\cos(\theta))\mathbf{\hat{u}}^2 + \sin(\theta)(1-\cos(\theta))\mathbf{\hat{u}}^3 + (1-\cos(\theta))\mathbf{\hat{u}}^4 \\ =&\ \mathcal{I} + \sin(\theta)\mathbf{\hat{u}} + (1-\cos(\theta))\mathbf{\hat{u}}^2 - \sin(\theta)\mathbf{\hat{u}} - \sin^2(\theta)\mathbf{\hat{u}} + \sin(\theta)(1-\cos(\theta))\mathbf{\hat{u}} \\&+ (1-\cos(\theta))\mathbf{\hat{u}}^2 - \sin(\theta)(1-\cos(\theta))\mathbf{\hat{u}} - (1-\cos(\theta))^2\mathbf{\hat{u}}^2\\ =&\ \mathcal{I} - \sin^2(\theta)\mathbf{\hat{u}}^2 + (1-\cos(\theta))^2\mathbf{\hat{u}}^2\\ =&\ \mathcal{I} + (-\sin^2(\theta)+1-\cos^2(\theta))\mathbf{\hat{u}}^2\\ =&\ \mathcal{I} + (1-(\sin^2(\theta)+\cos^2(\theta)))\mathbf{\hat{u}}^2\\ =&\ \mathcal{I} \end{align*}\] Therefore, \(e^{\theta\mathbf{\hat{u}}}\) is orthonormal. We could show its determinant is positive (and therefore \(1\)) by writing out all the terms, but it suffices to argue that: Clearly, \(\begin{vmatrix}\exp(0\mathbf{\hat{u}})\end{vmatrix} = \begin{vmatrix}\mathcal{I}\end{vmatrix} = 1\) There is no \(\theta\), \(\mathbf{\hat{u}}\) such that \(\begin{vmatrix}\exp(\theta\mathbf{\hat{u}})\end{vmatrix} = 0\), since \(\mathbf{\hat{u}}\) and \(\mathbf{\hat{u}}^2\) can never cancel out \(\mathcal{I}\). \(\exp\) is continuous with respect to \(\theta\) and \(\mathbf{\hat{u}}\) Therefore, \(\begin{vmatrix}\exp(0\mathbf{\hat{u}})\end{vmatrix}\) can never become negative. That means \(\exp(\theta\mathbf{\hat{u}})\) is a 3D rotation matrix! The 3D Logarithmic Map Similarly to the 2D case, the 3D exponential map is not injective, so the 3D logarithmic map will not be surjective. Instead, we will again define it to return the smallest magnitude axis-angle rotation corresponding to the given matrix. Our exponential map gave us: \[R = \exp(\theta\mathbf{\hat{u}}) = \mathcal{I} + \sin(\theta)\mathbf{\hat{u}} + (1-\cos(\theta))\mathbf{\hat{u}}^2\] We can take the trace (sum along the diagonal) of both sides: \[\operatorname{tr}(R) = \operatorname{tr}(\mathcal{I}) + \sin(\theta)\operatorname{tr}(\mathbf{\hat{u}}) + (1-\cos(\theta))\operatorname{tr}(\mathbf{\hat{u}}^2)\] Clearly \(\operatorname{tr}(\mathcal{I}) = 3\), and since \(\mathbf{\hat{u}}\) is skew-symmetric, its diagonal sum is zero. That just leaves \(\mathbf{\hat{u}}^2\): \[\mathbf{\hat{u}}^2 = \begin{bmatrix}-u_y^2-u_z^2&u_xu_y&u_xu_z\\u_xu_y&-u_x^2-u_z^2&u_yu_z\\u_xu_z&u_yu_z&-u_x^2-u_y^2\end{bmatrix}\] We can see \(\operatorname{tr}(\mathbf{\hat{u}}^2) = -2u_x^2-2u_y^2-2u_z^2 = -2\|\mathbf{u}\|^2 = -2\). (We originally defined \(\mathbf{u}\) as a unit vector.) Our final trace becomes: \[\begin{align*} && \operatorname{tr}(R) &= 3 + 0\sin(\theta) - 2(1-\cos(\theta)) \\ &&&= 1 + 2\cos\theta \\ &\implies& \theta &= \arccos\left(\frac{\operatorname{tr}(R)-1}{2}\right) \end{align*}\] That’s half of our logarithmic map. To recover \(\mathbf{\hat{u}}\), we can antisymmetrize \(R\). Recall \(\mathbf{\hat{u}}^T = -\mathbf{\hat{u}}\), and that \(\mathbf{\hat{u}}^2\) is symmetric (above). \[\begin{align*} && R - R^T &= \mathcal{I} - \mathcal{I}^T + \sin(\theta)(\mathbf{\hat{u}}-\mathbf{\hat{u}}^T) + (1-\cos(\theta))(\mathbf{\hat{u}}^2-(\mathbf{\hat{u}}^2)^T) \\ &&&= \sin(\theta)(\mathbf{\hat{u}}+\mathbf{\hat{u}}) + (1-\cos(\theta))(\mathbf{\hat{u}}^2-\mathbf{\hat{u}}^2) \\ &&&= 2\sin(\theta)\mathbf{\hat{u}} \\ &\implies& \mathbf{\hat{u}} &= \frac{1}{2\sin\theta}(R-R^T) \\ \end{align*}\] Finally, to get \(\mathbf{u}\), we can pull out the entries of \(\mathbf{\hat{u}}\), which we just derived: \[\mathbf{u} = \frac{1}{2\sin\theta}\begin{bmatrix}R_{32}-R_{23}\\ R_{13}-R_{31}\\R_{21}-R_{12}\end{bmatrix}\] We now have our full 3D logarithmic map! Interpolation Now equipped with our 3D exp and log maps, we can use them for interpolation. The exact same formula as the 2D case still applies: \[R(t) = \exp(t\log(R_1R_0^{-1})))R_0\] Randomize $$ R_0 = \begin{bmatrix}1&0&0\\0&1&0\\0&0&1\end{bmatrix} $$ $$ R(0) = \begin{bmatrix}1.00&0.00&0.00\\0.00&1.00&0.00\\0.00&0.00&1.00\end{bmatrix} $$ $$ R_1 = \begin{bmatrix}-0.42&-0.59&-0.69\\0.51&-0.79&0.36\\-0.75&-0.20&0.63\end{bmatrix} $$ Our interpolation scheme produces all the nice properties of axis/angle rotations—and chooses the shortest path every time. This wouldn’t look so smooth with Euler angles! Averaging Rotations However, we would have gotten an equally good interpolation scheme by just using quaternions instead of messing about with all this matrix math. Let’s consider something interesting we can only easily do with axis/angle rotations: averaging a set of rotation matrices. The most straightforward method is to convert each matrix into an axis/angle rotation, average the resulting vectors, and convert back. That is certainly a valid strategy, but the resulting behavior won’t be very intuitive: Randomize In particular, summing axis-angle vectors can result in “catastrophic cancellation.” An extreme example is averaging \(\begin{bmatrix}\pi&0&0\end{bmatrix}\) and \(\begin{bmatrix}-\pi&0&0\end{bmatrix}\), resulting in zero—which is clearly not representative of the two equivalent rotations. To find an alternative, let’s first consider a slightly unconventional way of averaging points in the plane. The average of a set of points is the point that minimizes total squared distance to all others. Hence, there’s an optimization-based algorithm for finding it. Given \(x_0, \dots, x_n\), we can iterate the following procedure: Pick an initial guess \(\bar{x} \in \mathbb{R}^2\) (can be one of the points). Repeat: For each point, get its translation from the guess: \(\mathbf{u}_i \leftarrow x_i - \bar{x}\) Average the vectors: \(\mathbf{u} \leftarrow \frac{1}{n} \sum_{i=1}^n \mathbf{u}_i\) Step toward the average direction: \(\bar{x} \leftarrow \bar{x} + \tau\mathbf{u}\) while \(\|\mathbf{u}\| > \epsilon\). As we run this procedure, \(\bar{x}\) will converge to the average point. Of course, we could have just averaged the points directly, but we’ll be able to translate this idea to the rotational case rather nicely. Our logarithmic map lets us convert rotation matrices to axis axis/angle rotations, which are themselves just 3D points. So, what if we use the point averaging algorithm on rotations \(R_0, \dots, R_n\)? Pick an initial guess \(\bar{R} \in \mathbb{R}^{3\times3}\) (can be \(\mathcal{I}\)). Repeat: For each matrix, get its axis/angle rotation from the guess: \(\mathbf{u}_i \leftarrow \log(R_i\bar{R}^{-1})\) Average the vectors: \(\mathbf{u} \leftarrow \frac{1}{n} \sum_{i=1}^n \mathbf{u}_i\) Step toward the average rotation: \(\bar{R} \leftarrow \exp(\tau\mathbf{u})\bar{R}\) while \(\|\mathbf{u}\| > \epsilon\). Step Randomize The result of this algorithm is formally known as the Karcher mean. Just like how the average point minimizes total squared distance from all other points, the Karcher mean is a rotation that minimizes squared angular distance from all other rotations. Therefore, it won’t be subject to catastrophic cancellation—we’ll always end up with a non-zero in-between rotation. Try comparing the two averaging algorithms—randomizing will keep them in sync. While the results are often similar, the Karcher mean exhibits more consistent behavior. Quaternions (Again) Warning: section assumes knowledge of quaternions Okay, I couldn’t resist talking about quaternions at least a little bit, given how closely they’re related to axis/angle rotations. Just like how complex exponentiation turned out to be equivalent to (skew-symmetric) 2D matrix exponentiation, quaternion exponentiation is equivalent to (skew-symmetric) 3D matrix exponentiation. In 2D, an axis/angle rotation was simply \(\theta\). We created a pure-imaginary complex number \(i\theta\) and exponentiated it: \[e^{i\theta} = \cos\theta + i\sin\theta\] We got back a complex number that when multiplied with a point, rotates it by \(\theta\). It’s always the case that \(\|\cos\theta + i\sin\theta\| = 1\), so 2D rotations can be represented as unit-norm complex numbers. In 3D, an axis/angle rotation is a vector \(\mathbf{u}\) such that \(\|\mathbf{u}\| = \theta\). What happens if we create a pure-imaginary quaternion \(\mathbf{q} = u_x\mathbf{i} + u_y\mathbf{j} + u_z\mathbf{k}\) and exponentiate it, too? To make evaluating \(e^\mathbf{q}\) easier, first derive the following using the quaternion multiplication rules: \[\begin{align*} \mathbf{q}^2 &= (u_x\mathbf{i} + u_y\mathbf{j} + u_z\mathbf{k})(u_x\mathbf{i} + u_y\mathbf{j} + u_z\mathbf{k}) \\ &= u_x^2\mathbf{i}^2 + u_xu_y\mathbf{i}\mathbf{j} + u_xu_z\mathbf{i}\mathbf{k} + u_yu_x\mathbf{j}\mathbf{i} + u_y^2\mathbf{j}^2 + u_yu_z\mathbf{j}\mathbf{k} + u_zu_x\mathbf{k}\mathbf{i} + u_zu_y\mathbf{k}\mathbf{j} + u_z^2\mathbf{k}^2 \\ &= -u_x^2 + u_xu_y\mathbf{k} - u_xu_z\mathbf{j} - u_yu_x\mathbf{k} - u_y^2 + u_yu_z\mathbf{i} + u_zu_x\mathbf{j} - u_zu_y\mathbf{i} - u_z^2 \\ &= -u_x^2-u_y^2-u_z^2\\ &= -\|\mathbf{q}\|^2\\ &= -\theta^2 \end{align*}\] Which is highly reminiscent of the skew-symmetric matrix identity used above. Therefore… \[\begin{align*} e^\mathbf{q} &= 1 + \mathbf{q} + \frac{\mathbf{q}^2}{2!} + \frac{\mathbf{q}^3}{3!} + \frac{\mathbf{q}^4}{4!} + \frac{\mathbf{q}^5}{5!} + \dots \\ &= 1 + \frac{\theta\mathbf{q}}{\theta} - \frac{\theta^2}{2!} - \frac{\theta^3\mathbf{q}}{3!\theta} + \frac{\theta^4}{4!} + \frac{\theta^5\mathbf{q}}{5!\theta} \dots \\ &= \left(1 - \frac{\theta^2}{2!} + \frac{\theta^4}{4!} - \dots\right) + \frac{\mathbf{q}}{\theta}\left(\theta - \frac{\theta^3}{3!} + \frac{\theta^5}{5!} - \dots\right) \\ &= \cos\theta + \frac{\mathbf{q}}{\theta}\sin\theta \\ &\approx \cos\theta + \frac{\mathbf{u}}{\|\mathbf{u}\|}\sin\theta \end{align*}\] Our result looks almost exactly like the 2D case, just with three imaginary axes instead of one. In 2D, our axis/angle rotation became a unit-norm complex number. In 3D, it became a unit-norm quaternion. Now we can use this quaternion to rotate 3D points! Pretty cool, right? One advantage of using quaternions is how easy the exponential map is to compute—if you don’t need a rotation matrix, it’s a good option. The quaternion logarithmic map is similarly simple: \[\theta = \arccos(\Re(\mathbf{q})), \mathbf{u} = \frac{1}{\sin\theta}\Im(\mathbf{q})\] Finally, note that the way to rotate a point \(\mathbf{p}\) by a quaternion \(\mathbf{q}\) is by evaluating the conjugation \(\mathbf{q}\mathbf{p}\mathbf{q}^{-1}\), where \(\mathbf{p} = p_x\mathbf{i} + p_y\mathbf{j} + p_z\mathbf{k}\) is another pure-imaginary quaternion representing our point. The conjugation technically rotates the point by \(2\theta\) about \(\mathbf{u}\), but that’s easily accounted for by making \(\|\mathbf{u}\| = \frac{\theta}{2}\) in the beginning. Further Reading Made it this far? Well, there’s even more to learn about rotations. Learn about quaternions here, and why geometric algebra is more intuitive here. Beyond understanding the four representations covered here (plus geometric algebra), it can be enlightening to learn about the algebraic structure underlying all 3D rotations: the group \(SO(3)\). I found this video to be a great resource: it explains \(SO(3)\) both intuitively and visually, demonstrating how it relates it to the group \(SU(2)\) as well as why quaternions and axis/angle rotations double-cover 3D rotation matrices. The wikipedia page on SO(3) is also informative, though very math heavy. It touches on connections with axis/angle rotations, topology, \(SU(2)\), quaternions, and Lie algebra. It turns out the vector space of skew-symmetric matrices we derived above makes up \(\mathfrak{so}(3)\), the Lie algebra that corresponds to \(SO(3)\)—but I don’t know what that entails.Graphics Blogroll2022-04-04T00:00:00+00:002022-04-04T00:00:00+00:00https://thenumbat.github.io/Graphics-Blogroll<p>In no particular order, my list of cool computer graphics (or adjacent) blogs:</p>
<ul>
<li><a href="https://www.redblobgames.com/">Amit Patel</a>
<ul>
<li>Interactive algorithm visualizations</li>
<li>Highlight: <a href="https://www.redblobgames.com/pathfinding/a-star/introduction.html">Introduction to A*</a></li>
</ul>
</li>
<li><a href="https://iquilezles.org/">Inigo Quilez</a>
<ul>
<li>3D SDF rendering and <a href="https://www.shadertoy.com/">shadertoy</a></li>
<li>Highlight: <a href="https://iquilezles.org/www/articles/distfunctions/distfunctions.htm">3D SDF Index</a></li>
</ul>
</li>
<li><a href="https://fgiesen.wordpress.com/">Fabian Giesen</a>
<ul>
<li>Graphics programming, compression, comp arch, optimization</li>
<li>Highlight: <a href="https://fgiesen.wordpress.com/2011/07/09/a-trip-through-the-graphics-pipeline-2011-index/">A trip through the Graphics Pipeline 2011</a></li>
</ul>
</li>
<li><a href="https://bartwronski.com/">Bart Wronski</a>
<ul>
<li>Image/signal processing, real-time rendering, linear algebra</li>
<li>Highlight: <a href="https://bartwronski.com/2020/12/27/why-are-video-games-graphics-still-a-challenge-productionizing-rendering-algorithms/">Why are video game graphics still a challenge?</a></li>
</ul>
</li>
<li><a href="https://ncase.me/">Nicky Case</a>
<ul>
<li>‘Explorable’ real-world models, mental health</li>
<li>Highlight: <a href="https://ncase.me/polygons/">Parable of the Polygons</a></li>
</ul>
</li>
<li><a href="https://ciechanow.ski/archives/">Bartosz Ciechanowski</a>
<ul>
<li>Highly detailed interactive explainers</li>
<li>Highlight: <a href="https://ciechanow.ski/cameras-and-lenses/">Cameras and Lenses</a></li>
</ul>
</li>
<li><a href="https://www.alanzucconi.com/">Alan Zucconi</a>
<ul>
<li>Misc. game dev</li>
<li>Highlight: <a href="http://www.alanzucconi.com/2017/04/17/procedural-animations/">Introduction to Procedural Animations</a></li>
</ul>
</li>
<li><a href="https://marctenbosch.com/">Marc Ten Bosch</a>
<ul>
<li>Higher-dimensional physics, math</li>
<li>Highlight: <a href="https://marctenbosch.com/quaternions/">Let’s remove Quaternions from every 3D Engine</a></li>
</ul>
</li>
<li><a href="https://blog.demofox.org/">Alan Wolfe</a>
<ul>
<li>Image/signal processing, real-time graphics, math</li>
<li>Highlight: <a href="https://blog.demofox.org/2016/02/29/fast-voronoi-diagrams-and-distance-dield-textures-on-the-gpu-with-the-jump-flooding-algorithm/">Fast Voronoi Diagrams and Distance Fields with Jump Flooding</a></li>
</ul>
</li>
<li><a href="https://fabiensanglard.net/">Fabien Sanglard</a>
<ul>
<li>Retro game dev, graphics programming, game engines</li>
<li>Highlight: <a href="https://fabiensanglard.net/postcard_pathtracer/index.html">Deciphering the Postcard Sized Pathtracer</a></li>
</ul>
</li>
<li><a href="https://therealmjp.github.io/posts/">Matt Pettineo</a>
<ul>
<li>Graphics programming, real-time rendering</li>
<li>Highlight: <a href="https://therealmjp.github.io/posts/attack-of-the-depth-buffer/">Attack of the Depth Buffer</a></li>
</ul>
</li>
<li><a href="https://aras-p.info/">Aras Pranckevičius</a>
<ul>
<li>Misc. graphics, optimization, compilers</li>
<li>Highlight: <a href="https://aras-p.info/blog/2018/12/28/Modern-C-Lamentations/">“Modern” C++ Lamentations</a></li>
</ul>
</li>
<li><a href="https://bgolus.medium.com/">Ben Golus</a>
<ul>
<li>Graphics programming, shaders</li>
<li>Highlight: <a href="https://bgolus.medium.com/the-quest-for-very-wide-outlines-ba82ed442cd9">The Quest for Very Wide Outlines</a></li>
</ul>
</li>
<li><a href="https://interplayoflight.wordpress.com/">Kostas Anagnostou</a>
<ul>
<li>Graphics programming, rendering</li>
<li>Highlight: <a href="https://interplayoflight.wordpress.com/2022/03/26/raytraced-global-illumination-denoising/">Raytraced Global Illumination Denoising</a></li>
</ul>
</li>
<li><a href="http://www.adriancourreges.com/blog/">Adrian Courrèges</a>
<ul>
<li>Real-time rendering</li>
<li>Highlight: <a href="http://www.adriancourreges.com/blog/2016/09/09/doom-2016-graphics-study/">DOOM 2016 Graphics Study</a></li>
</ul>
</li>
<li><a href="https://raphlinus.github.io/">Raph Levien</a>
<ul>
<li>2D rendering, GPU programming</li>
<li>Highlight: <a href="https://raphlinus.github.io/ui/graphics/gpu/2021/10/22/swapchain-frame-pacing.html">Swapchains and Frame Pacing</a></li>
</ul>
</li>
<li><a href="https://pharr.org/matt/blog/">Matt Pharr</a>
<ul>
<li>Rendering, optimization</li>
<li>Highlight: <a href="http://pharr.org/matt/blog/2018/07/16/moana-island-pbrt-all.html">Rendering Moana with pbrt</a></li>
</ul>
</li>
<li><a href="https://wickedengine.net/category/devblog/">János Turánszki</a>
<ul>
<li>Real-time rendering, game engines</li>
<li>Highlight: <a href="https://wickedengine.net/2017/08/30/voxel-based-global-illumination/">Voxel-based Global Illumination</a></li>
</ul>
</li>
<li><a href="https://zeux.io/">Arseny Kapoulkine</a>
<ul>
<li>Real-time rendering, optimization</li>
<li>Highlight: <a href="https://zeux.io/2020/02/27/writing-an-efficient-vulkan-renderer/">Writing an efficient Vulkan renderer</a></li>
</ul>
</li>
<li><a href="http://theorangeduck.com/page/all">Daniel Holden</a>
<ul>
<li>Animation, geometry, game dev</li>
<li>Highlight: <a href="http://theorangeduck.com/page/spring-roll-call">Game Developer’s Spring-Roll-Call</a></li>
</ul>
</li>
<li><a href="https://blog.tuxedolabs.com/">Dennis Gustafsson</a>
<ul>
<li>Real-time rendering, game dev, GPU programming</li>
<li>Highlight: <a href="https://blog.tuxedolabs.com/2018/05/04/bokeh-depth-of-field-in-single-pass.html">Bokeh depth of field in a single pass</a></li>
</ul>
</li>
</ul>
<h3 id="possibly-dead">Possibly Dead</h3>
<p>These blogs are very infrequently updated or haven’t been updated in years.</p>
<ul>
<li>Jonathan Blow: <a href="http://number-none.com/blow/blog/">Braid</a>/<a href="http://the-witness.net/news/">The Witness</a>
<ul>
<li>Braid/The Witness development posts</li>
<li>Highlight: <a href="http://the-witness.net/news/">A Shader Trick</a></li>
</ul>
</li>
<li><a href="https://nickmcd.me/">Nicholas McDonald</a>
<ul>
<li>Procedural generation, math</li>
<li>Highlight: <a href="https://nickmcd.me/2019/10/30/markov-chains-for-procedural-buildings/">Markov Chains for Procedural Buildings</a></li>
</ul>
</li>
<li><a href="https://ijdykeman.github.io/">Isaac Dykeman</a>
<ul>
<li>Procedural generation, real-time graphics</li>
<li>Highlight: <a href="https://ijdykeman.github.io/ml/2017/10/12/wang-tile-procedural-generation.html">Procedural Worlds from Simple Tiles</a></li>
</ul>
</li>
<li><a href="https://realtimecollisiondetection.net/blog/">Christer Ericson</a>
<ul>
<li>Graphics programming, real-time physics, game dev</li>
<li>Highlight: <a href="http://realtimecollisiondetection.net/blog/?p=86">Order your graphics draw calls around!</a></li>
</ul>
</li>
<li><a href="http://phildec.users.sourceforge.net/index.php">Philippe Decaudin</a>
<ul>
<li>Real-time graphics</li>
<li>Highlight: <a href="http://phildec.users.sourceforge.net/Research/VolumetricBillboards.php">Volumetric Billboards</a></li>
</ul>
</li>
<li><a href="http://www.humus.name/index.php?page=News">Emil Persson</a>
<ul>
<li>Graphics programming, game dev</li>
<li>Highlight: <a href="http://www.humus.name/index.php?page=News&ID=383">Rules of Optimization</a></li>
</ul>
</li>
<li><a href="https://leifnode.com/">Leif Erkenbrach</a>
<ul>
<li>Graphics programming</li>
<li>Highlight: <a href="https://leifnode.com/2015/05/voxel-cone-traced-global-illumination/">Voxel Cone Traced Global Illumination</a></li>
</ul>
</li>
<li><a href="https://blog.mecheye.net/">Clean Rinse</a>
<ul>
<li>Graphics programming</li>
<li>Highlight: <a href="https://blog.mecheye.net/2019/05/why-is-2d-graphics-is-harder-than-3d-graphics/">Why are 2D vector graphics so much harder than 3D?</a></li>
</ul>
</li>
<li><a href="http://graphicrants.blogspot.com/">Brian Karis</a>
<ul>
<li>Graphics programming, math</li>
<li>Highlight: <a href="http://graphicrants.blogspot.com/2013/08/specular-brdf-reference.html">Specular BRDF Reference</a></li>
</ul>
</li>
<li><a href="http://diaryofagraphicsprogrammer.blogspot.com/">Wolfgang Engel</a>
<ul>
<li>Graphics programming</li>
<li>Highlight: <a href="http://diaryofagraphicsprogrammer.blogspot.com/2018/03/triangle-visibility-buffer.html">Triangle Visibility Buffer</a></li>
</ul>
</li>
<li><a href="http://www.aortiz.me/">Ángel Ortiz</a>
<ul>
<li>Graphics programming</li>
<li>Highlight: <a href="http://www.aortiz.me/2018/12/21/CG.html">A Primer On Efficient Rendering Algorithms & Clustered Shading</a></li>
</ul>
</li>
<li><a href="https://setosa.io/ev/">Setosa</a>
<ul>
<li>Interactive explainers</li>
<li>Highlight: <a href="https://setosa.io/ev/eigenvectors-and-eigenvalues/">Eigenvectors and Eigenvalues</a></li>
</ul>
</li>
<li><a href="https://kosmonautblog.wordpress.com/">Kosmonaut Games</a>
<ul>
<li>Graphics programming, real-time rendering</li>
<li>Highlight: <a href="https://kosmonautblog.wordpress.com/2017/05/01/signed-distance-field-rendering-journey-pt-1/">Signed Distance Field Rendering Journey</a></li>
</ul>
</li>
<li><a href="http://jamie-wong.com/">Jamie Wong</a>
<ul>
<li>Simulation, geometry, software dev</li>
<li>Highlight: <a href="http://jamie-wong.com/2016/07/15/ray-marching-signed-distance-functions/">Ray Marching and Signed Distance Fields</a></li>
</ul>
</li>
<li><a href="https://shaderbits.com/blog">Shader Bits</a>
<ul>
<li>Shaders, unreal engine</li>
<li>Highlight: <a href="http://shaderbits.com/blog/creating-volumetric-ray-marcher">Creating a Volumetric Ray Marcher</a></li>
</ul>
</li>
<li><a href="http://procworld.blogspot.com/">Miguel Cepero</a>
<ul>
<li>Procedural generation, voxels</li>
<li>Highlight: <a href="http://procworld.blogspot.com/2010/11/from-voxels-to-polygons.html">From Voxels to Polygons</a></li>
</ul>
</li>
<li><a href="https://0fps.net/">0 FPS</a>
<ul>
<li>Geometry, procedural generation, voxels</li>
<li>Highlight: <a href="https://0fps.net/2012/06/30/meshing-in-a-minecraft-game/">Meshing in a Minecraft Game</a></li>
</ul>
</li>
</ul>In no particular order, my list of cool computer graphics (or adjacent) blogs: Amit Patel Interactive algorithm visualizations Highlight: Introduction to A* Inigo Quilez 3D SDF rendering and shadertoy Highlight: 3D SDF Index Fabian Giesen Graphics programming, compression, comp arch, optimization Highlight: A trip through the Graphics Pipeline 2011 Bart Wronski Image/signal processing, real-time rendering, linear algebra Highlight: Why are video game graphics still a challenge? Nicky Case ‘Explorable’ real-world models, mental health Highlight: Parable of the Polygons Bartosz Ciechanowski Highly detailed interactive explainers Highlight: Cameras and Lenses Alan Zucconi Misc. game dev Highlight: Introduction to Procedural Animations Marc Ten Bosch Higher-dimensional physics, math Highlight: Let’s remove Quaternions from every 3D Engine Alan Wolfe Image/signal processing, real-time graphics, math Highlight: Fast Voronoi Diagrams and Distance Fields with Jump Flooding Fabien Sanglard Retro game dev, graphics programming, game engines Highlight: Deciphering the Postcard Sized Pathtracer Matt Pettineo Graphics programming, real-time rendering Highlight: Attack of the Depth Buffer Aras Pranckevičius Misc. graphics, optimization, compilers Highlight: “Modern” C++ Lamentations Ben Golus Graphics programming, shaders Highlight: The Quest for Very Wide Outlines Kostas Anagnostou Graphics programming, rendering Highlight: Raytraced Global Illumination Denoising Adrian Courrèges Real-time rendering Highlight: DOOM 2016 Graphics Study Raph Levien 2D rendering, GPU programming Highlight: Swapchains and Frame Pacing Matt Pharr Rendering, optimization Highlight: Rendering Moana with pbrt János Turánszki Real-time rendering, game engines Highlight: Voxel-based Global Illumination Arseny Kapoulkine Real-time rendering, optimization Highlight: Writing an efficient Vulkan renderer Daniel Holden Animation, geometry, game dev Highlight: Game Developer’s Spring-Roll-Call Dennis Gustafsson Real-time rendering, game dev, GPU programming Highlight: Bokeh depth of field in a single pass Possibly Dead These blogs are very infrequently updated or haven’t been updated in years. Jonathan Blow: Braid/The Witness Braid/The Witness development posts Highlight: A Shader Trick Nicholas McDonald Procedural generation, math Highlight: Markov Chains for Procedural Buildings Isaac Dykeman Procedural generation, real-time graphics Highlight: Procedural Worlds from Simple Tiles Christer Ericson Graphics programming, real-time physics, game dev Highlight: Order your graphics draw calls around! Philippe Decaudin Real-time graphics Highlight: Volumetric Billboards Emil Persson Graphics programming, game dev Highlight: Rules of Optimization Leif Erkenbrach Graphics programming Highlight: Voxel Cone Traced Global Illumination Clean Rinse Graphics programming Highlight: Why are 2D vector graphics so much harder than 3D? Brian Karis Graphics programming, math Highlight: Specular BRDF Reference Wolfgang Engel Graphics programming Highlight: Triangle Visibility Buffer Ángel Ortiz Graphics programming Highlight: A Primer On Efficient Rendering Algorithms & Clustered Shading Setosa Interactive explainers Highlight: Eigenvectors and Eigenvalues Kosmonaut Games Graphics programming, real-time rendering Highlight: Signed Distance Field Rendering Journey Jamie Wong Simulation, geometry, software dev Highlight: Ray Marching and Signed Distance Fields Shader Bits Shaders, unreal engine Highlight: Creating a Volumetric Ray Marcher Miguel Cepero Procedural generation, voxels Highlight: From Voxels to Polygons 0 FPS Geometry, procedural generation, voxels Highlight: Meshing in a Minecraft GameLas Vegas2022-04-03T00:00:00+00:002022-04-03T00:00:00+00:00https://thenumbat.github.io/Las%20Vegas<p>Las Vegas, NV, 2022</p>
<p>Hualapai Nation, 2022</p>
<div style="text-align: center;"><img src="/assets/vegas/0.webp" /></div>
<div style="text-align: center;"><img src="/assets/vegas/1.webp" /></div>
<div style="text-align: center;"><img src="/assets/vegas/2.webp" /></div>
<div style="text-align: center;"><img src="/assets/vegas/3.webp" /></div>
<div style="text-align: center;"><img src="/assets/vegas/4.webp" /></div>
<div style="text-align: center;"><img src="/assets/vegas/5.webp" /></div>
<div style="text-align: center;"><img src="/assets/vegas/6.webp" /></div>
<div style="text-align: center;"><img src="/assets/vegas/7.webp" /></div>
<div style="text-align: center;"><img src="/assets/vegas/8.webp" /></div>
<div style="text-align: center;"><img src="/assets/vegas/9.webp" /></div>
<div style="text-align: center;"><img src="/assets/vegas/10.webp" /></div>
<div style="text-align: center;"><img src="/assets/vegas/11.webp" /></div>
<div style="text-align: center;"><img src="/assets/vegas/12.webp" /></div>
<div style="text-align: center;"><img src="/assets/vegas/13.webp" /></div>
<div style="text-align: center;"><img src="/assets/vegas/14.webp" /></div>
<div style="text-align: center;"><img src="/assets/vegas/15.webp" /></div>
<div style="text-align: center;"><img src="/assets/vegas/16.webp" /></div>
<p></p>Las Vegas, NV, 2022 Hualapai Nation, 2022Hamming Hats2022-03-29T00:00:00+00:002022-03-29T00:00:00+00:00https://thenumbat.github.io/Hamming-Hats<p><em>I first encountered the <a href="https://www.nytimes.com/2001/04/10/science/why-mathematicians-now-care-about-their-hat-color.html">Hat Problem</a> in <a href="http://www.cs.cmu.edu/~15251/">CMU 15-251</a></em>.</p>
<h2 id="the-game">The Game</h2>
<p>Consider a 31-person game in which each player is assigned a red or blue hat, determined by an independent coin flip. The players can see everyone else’s hat color, but <strong>not</strong> their own.</p>
<p>Once the game begins, no communication is allowed between the players. After looking at the other hats, all players must <em>simultaneously</em> either announce the color of their hat or abstain from guessing. If at least one player correctly announces their hat color, and no players incorrectly announce theirs, the group wins.</p>
<p>What strategy can the group use to obtain a high probability of winning?</p>
<h2 id="a-simple-strategy">A Simple Strategy</h2>
<p>The group could choose one person to always announce red and have all other players always abstain. This strategy wins 50% of the time.
However, it clearly doesn’t make use of all the available information, so we can do better.</p>
<p>In fact, we can do <em>much</em> better by making use of an error-correcting code. But what do error-correcting codes have to do with the hat game?</p>
<p>Let’s start by describing the game state as a 31-bit binary number, where bit \(i\) specifies the color of player \(i\)’s hat. As a player, we can see everyone else’s hat, so we know 30 bits of this number. Since the group can agree on a consistent ordering of players beforehand, we also know the index of our unknown bit. If we could somehow use all the other bits to pin down ours, we could correctly announce our hat color and win the game. That sounds like a job for error correction.</p>
<h2 id="hamming-codes">Hamming Codes</h2>
<p>If you’re new to Hamming codes, first watch 3Blue1Brown’s <a href="https://www.youtube.com/watch?v=X8jsijhllIA">excellent video on the topic</a>.</p>
<p>Here, we’ll make use of a matrix-based definition of Hamming codes, which will allow us to use tools from linear algebra. We will be working with vectors over the field \(\mathbb{F}_2\), that is, the integers modulo 2. If you’ve just watched the video, you might notice that summing bits modulo two is the same as computing their parity.</p>
\[\begin{array} {|r|r|}\hline x & y & x+y & x*y \\ \hline 0 & 0 & 0 & 0 \\ \hline 0 & 1 & 1 & 0 \\ \hline 1 & 0 & 1 & 0 \\ \hline 1 & 1 & 0 & 1 \\ \hline \end{array}\]
<p>To define a Hamming code, first choose a positive integer \(r\) and let \(n = 2^r - 1\). Our code will use \(n\) bit words to encode \(n-r\) bits of data, providing the ability to correct arbitrary 1-bit errors.</p>
<p>The set of Hamming codewords is defined as:</p>
\[C = \{ c \in \{0,1\}^n\ \big|\ H_{r,n}c = 0 \}\]
<p>Where \(H_{r,n}\) is a \(r \times n\) matrix whose \(i\)th column is the \(r\)-bit binary representation of \(i\).</p>
<p>For example, if \(r = 3, n = 7\), then</p>
\[H_{3,7} = \begin{bmatrix}0&0&0&1&1&1&1\\0&1&1&0&0&1&1\\1&0&1&0&1&0&1\end{bmatrix}\]
<p>We call \(H\) ‘‘error-correcting’’ because given a non-codeword \(z\) (i.e. \(Hz \neq 0\)), \(Hz\) will be the binary representation of the position of the ‘‘error’’ in \(z\). That is, if we flip bit \(Hz\) in \(z\), the result will be a codeword. For example,</p>
\[H_{3,7}z = \begin{bmatrix}0&0&0&1&1&1&1\\0&1&1&0&0&1&1\\1&0&1&0&1&0&1\end{bmatrix} \begin{bmatrix}1&0&1&0&1&1&1\end{bmatrix}^T = \begin{bmatrix}1&1&0\end{bmatrix}\]
<p>If we flip the sixth (\(110\)) bit in \(z\), we in fact get a codeword (\(1010101\)). This suggests that every non-codeword can be transmuted into a unique codeword by flipping exactly one bit. Let’s prove that.</p>
<p>By construction, any two columns in \(H\) differ in at least one position. That means every pair of columns is linearly independent—i.e. you can’t add two columns together and get the zero vector. Then, consider that for any non-zero codeword \(c\), \(Hc\) is a linear combination of the columns of \(H\). Since \(c\) is a codeword, we know \(Hc = 0\), so \(c\) must add together <em>at least three</em> columns from \(H\). Therefore, \(c\) must have at least three bits set.</p>
<p>Now consider any two codewords \(c_1 \neq c_2\):</p>
\[\begin{align*}
&& Hc_2 &= 0\\
&\implies& H(c_1 + (c_2 - c_1)) &= 0\\
&\implies& Hc_1 + H(c_2 - c_1) &= 0\\
&\implies& H(c_2 - c_1) &= 0 \tag{$Hc_1 = 0$}\\
\end{align*}\]
<p>That is, the difference between any two codewords is itself a codeword—and must have at least three bits set. Therefore, all codewords differ from all others in at least three positions.</p>
<p>Now, given an \(n\)-bit codeword \(c\), we can construct \(n\) other words that differ in exactly <em>one</em> bit. Because transmuting \(c\) into a different codeword would require changing at least <em>three</em> bits, we know that none of our new words can be one bit different from any <em>other</em> codeword than \(c\).</p>
<p>Finally, let’s count the number of codewords. Since by construction, the columns of \(H\) include each \(r\)-bit basis vector, \(H\) has rank \(r\) and nullity \(n-r\). Since we defined \(C\) as the null space of \(H\), there are \(2^{n-r}\) codewords.</p>
<p>For each codeword, we can generate \(n\) unique non-codewords, which accounts for…</p>
<div style="font-size: 20px;">
$$ \begin{align*}
\underbrace{2^{n-r}}_{\text{codewords}} + \underbrace{n \times 2^{n-r}}_{\text{non-codewords}} &= 2^{n-r} + (2^r-1)(2^{n-r})\\
&= 2^{n-r}2^r\\
&= 2^n \text{ words}
\end{align*}$$
</div>
<p>…all of the words! Therefore, every word is either a codeword or differs from a unique codeword in exactly one bit.</p>
<h2 id="a-better-strategy">A Better Strategy</h2>
<p>We’re now ready to devise a better strategy for the hat game. Beforehand, the group of \(n\) players can communicate the appropriate Hamming matrix \(H_{r,n}\) and set a player order.</p>
<p>Once hats are assigned, each player constructs both the game state \(s_b\) assuming that they’re wearing blue and the state \(s_r\) assuming they’re wearing red.</p>
<ul>
<li>If \(H_{r,n}s_b \neq 0\) and \(Hs_r = 0\), they announce blue.</li>
<li>If \(H_{r,n}s_b = 0\) and \(Hs_r \neq 0\), they announce red.</li>
<li>If \(H_{r,n}s_b \neq 0\) and \(Hs_r \neq 0\), they abstain.</li>
</ul>
<p>Note that it’s impossible for both results to be be zero.</p>
<p>The Hamming strategy wins the game whenever the initial distribution of hats is <em>not</em> a codeword. By the above proof, we know that when the state is not a codeword, it differs from exactly one codeword in exactly one bit. This means that there is a single player for whom switching their hat color would make the state a codeword: that player will observe \(Hs = 0\) when assuming the incorrect choice, and \(Hs \neq 0\) when assuming the correct choice. All other players will observe \(Hs \neq 0\) for both assumptions because toggling their bit would not make the state a codeword. Therefore, one player will announce their correct color, and the group wins.</p>
<p>For example, in a 7-player hat game with initial distribution \(\begin{bmatrix}1&0&1&0&1&1&1\end{bmatrix}\):</p>
<ul>
<li>Player 1 finds:</li>
</ul>
\[\begin{align*}H_{3,7}\begin{bmatrix}0_{?}&0&1&0&1&1&1\end{bmatrix}^T &= \begin{bmatrix}1&1&1\end{bmatrix}\\
H_{3,7}\begin{bmatrix}1_{?}&0&1&0&1&1&1\end{bmatrix}^T &= \begin{bmatrix}1&1&0\end{bmatrix}\end{align*}\]
<ul>
<li>Players 2-5 and 7 also find \(H_{3,7}s_b \neq 0\) and \(H_{3,7}s_r \neq 0\).</li>
<li>Player 6 finds:</li>
</ul>
\[\begin{align*}H_{3,7}\begin{bmatrix}1&0&1&0&1&0_{?}&1\end{bmatrix}^T &= \begin{bmatrix}0&0&0\end{bmatrix}\\
H_{3,7}\begin{bmatrix}1&0&1&0&1&1_{?}&1\end{bmatrix}^T &= \begin{bmatrix}1&1&0\end{bmatrix} \tag{that's 6!}\end{align*}\]
<p>The lone player 6 announces color 1, and the group wins.</p>
<p>On the other hand, the group will lose the game when the initial distribution <em>is</em> a codeword. In this case, all players will determine that one hat color—the <em>correct</em> one—results in a codeword, and the other does not. This is because when the true state is a codeword, changing any one bit will result in a non-codeword. In this case, all players will announce the wrong color, and the group loses.</p>
<p>For example, again in the 7-player game, but this time with initial distribution \(\begin{bmatrix}1&0&1&0&1&0&1\end{bmatrix}\):</p>
<ul>
<li>Player 1 sees:</li>
</ul>
\[\begin{align*}H_{3,7}\begin{bmatrix}0_{?}&0&1&0&1&0&1\end{bmatrix}^T &= \begin{bmatrix}0&0&1\end{bmatrix} \tag{that's 1!}\\
H_{3,7}\begin{bmatrix}1_{?}&0&1&0&1&0&1\end{bmatrix}^T &= \begin{bmatrix}0&0&0\end{bmatrix}\end{align*}\]
<ul>
<li>Player 2 sees:</li>
</ul>
\[\begin{align*}H_{3,7}\begin{bmatrix}1&0_{?}&1&0&1&0&1\end{bmatrix}^T &= \begin{bmatrix}0&0&0\end{bmatrix}\\
H_{3,7}\begin{bmatrix}1&1_{?}&1&0&1&0&1\end{bmatrix}^T &= \begin{bmatrix}0&1&0\end{bmatrix}\tag{that's 2!}\end{align*}\]
<ul>
<li>All other players find the equivalent.</li>
</ul>
<p>Hence, all players announce the incorrect color.</p>
<h2 id="optimality">Optimality</h2>
<p>Our Hamming strategy wins whenever the true arrangement of hats does not represent a codeword. Since there are \(n\times 2^{n-r}\) non-codewords out of \(2^n\) total words:</p>
\[\mathbb{P}(\text{win}) = \frac{n \times 2^{n-r}}{2^n} = \frac{n}{2^r} = \frac{n}{n+1} \implies \frac{31}{32}\]
<p>The 31-person group has an approximately 97% chance to win the game, given a uniformly random initial distribution. But, is 97% necessarily the best they can do?</p>
<p>Let’s bound the maximum number of games that <em>any possible</em> strategy could win. Given an arbitrary strategy, each individual player will guess correctly and incorrectly in an equal number of game states. This is because each player’s decision making process must be independent of the color of their <em>own</em> hat: given the same information about the 30 other hats, the player <em>must</em> either abstain or guess the same color in both the state where they’re wearing red <em>and</em> where they’re wearing blue. Hence, for the individual to be right in one state, they must be wrong in the other.</p>
<p>However, this fact doesn’t mean our overall strategy has to lose as much as it wins: if we can use up more incorrect guesses in losing rounds than correct guesses in winning rounds, our overall win-rate will increase. Optimally, we would have exactly one player guess correctly in every winning round, and have all players guess incorrectly in every losing round. This implies that for every \(n\) winning rounds, there must be at least one losing round.</p>
<p>Hence, if the strategy wins \(W\) rounds and loses \(L\) rounds out of \(2^n\), we know \(L \ge \frac{W}{n}\). Because \(W + L = 2^n\), we have:</p>
\[\begin{align*} 2^n - W &\ge \frac{W}{n} \\
n2^n &\ge W(n + 1)\\
\frac{n}{n+1} &\ge \frac{W}{2^n}\end{align*}\]
<p>Which means the maximum win probability is in fact \(\frac{n}{n+1}\), which our Hamming strategy achieves. It is optimal!</p>
<p>Unfortunately, we’ve only defined our Hamming strategy for games with \(n = 2^r - 1\) players. It may be the case that when the number of players is not one less than a power of two, the optimal strategy is simply to ignore some players and run the largest Hamming strategy available—but this has never been proven!</p>I first encountered the Hat Problem in CMU 15-251. The Game Consider a 31-person game in which each player is assigned a red or blue hat, determined by an independent coin flip. The players can see everyone else’s hat color, but not their own. Once the game begins, no communication is allowed between the players. After looking at the other hats, all players must simultaneously either announce the color of their hat or abstain from guessing. If at least one player correctly announces their hat color, and no players incorrectly announce theirs, the group wins. What strategy can the group use to obtain a high probability of winning? A Simple Strategy The group could choose one person to always announce red and have all other players always abstain. This strategy wins 50% of the time. However, it clearly doesn’t make use of all the available information, so we can do better. In fact, we can do much better by making use of an error-correcting code. But what do error-correcting codes have to do with the hat game? Let’s start by describing the game state as a 31-bit binary number, where bit \(i\) specifies the color of player \(i\)’s hat. As a player, we can see everyone else’s hat, so we know 30 bits of this number. Since the group can agree on a consistent ordering of players beforehand, we also know the index of our unknown bit. If we could somehow use all the other bits to pin down ours, we could correctly announce our hat color and win the game. That sounds like a job for error correction. Hamming Codes If you’re new to Hamming codes, first watch 3Blue1Brown’s excellent video on the topic. Here, we’ll make use of a matrix-based definition of Hamming codes, which will allow us to use tools from linear algebra. We will be working with vectors over the field \(\mathbb{F}_2\), that is, the integers modulo 2. If you’ve just watched the video, you might notice that summing bits modulo two is the same as computing their parity. \[\begin{array} {|r|r|}\hline x & y & x+y & x*y \\ \hline 0 & 0 & 0 & 0 \\ \hline 0 & 1 & 1 & 0 \\ \hline 1 & 0 & 1 & 0 \\ \hline 1 & 1 & 0 & 1 \\ \hline \end{array}\] To define a Hamming code, first choose a positive integer \(r\) and let \(n = 2^r - 1\). Our code will use \(n\) bit words to encode \(n-r\) bits of data, providing the ability to correct arbitrary 1-bit errors. The set of Hamming codewords is defined as: \[C = \{ c \in \{0,1\}^n\ \big|\ H_{r,n}c = 0 \}\] Where \(H_{r,n}\) is a \(r \times n\) matrix whose \(i\)th column is the \(r\)-bit binary representation of \(i\). For example, if \(r = 3, n = 7\), then \[H_{3,7} = \begin{bmatrix}0&0&0&1&1&1&1\\0&1&1&0&0&1&1\\1&0&1&0&1&0&1\end{bmatrix}\] We call \(H\) ‘‘error-correcting’’ because given a non-codeword \(z\) (i.e. \(Hz \neq 0\)), \(Hz\) will be the binary representation of the position of the ‘‘error’’ in \(z\). That is, if we flip bit \(Hz\) in \(z\), the result will be a codeword. For example, \[H_{3,7}z = \begin{bmatrix}0&0&0&1&1&1&1\\0&1&1&0&0&1&1\\1&0&1&0&1&0&1\end{bmatrix} \begin{bmatrix}1&0&1&0&1&1&1\end{bmatrix}^T = \begin{bmatrix}1&1&0\end{bmatrix}\] If we flip the sixth (\(110\)) bit in \(z\), we in fact get a codeword (\(1010101\)). This suggests that every non-codeword can be transmuted into a unique codeword by flipping exactly one bit. Let’s prove that. By construction, any two columns in \(H\) differ in at least one position. That means every pair of columns is linearly independent—i.e. you can’t add two columns together and get the zero vector. Then, consider that for any non-zero codeword \(c\), \(Hc\) is a linear combination of the columns of \(H\). Since \(c\) is a codeword, we know \(Hc = 0\), so \(c\) must add together at least three columns from \(H\). Therefore, \(c\) must have at least three bits set. Now consider any two codewords \(c_1 \neq c_2\): \[\begin{align*} && Hc_2 &= 0\\ &\implies& H(c_1 + (c_2 - c_1)) &= 0\\ &\implies& Hc_1 + H(c_2 - c_1) &= 0\\ &\implies& H(c_2 - c_1) &= 0 \tag{$Hc_1 = 0$}\\ \end{align*}\] That is, the difference between any two codewords is itself a codeword—and must have at least three bits set. Therefore, all codewords differ from all others in at least three positions. Now, given an \(n\)-bit codeword \(c\), we can construct \(n\) other words that differ in exactly one bit. Because transmuting \(c\) into a different codeword would require changing at least three bits, we know that none of our new words can be one bit different from any other codeword than \(c\). Finally, let’s count the number of codewords. Since by construction, the columns of \(H\) include each \(r\)-bit basis vector, \(H\) has rank \(r\) and nullity \(n-r\). Since we defined \(C\) as the null space of \(H\), there are \(2^{n-r}\) codewords. For each codeword, we can generate \(n\) unique non-codewords, which accounts for… $$ \begin{align*} \underbrace{2^{n-r}}_{\text{codewords}} + \underbrace{n \times 2^{n-r}}_{\text{non-codewords}} &= 2^{n-r} + (2^r-1)(2^{n-r})\\ &= 2^{n-r}2^r\\ &= 2^n \text{ words} \end{align*}$$ …all of the words! Therefore, every word is either a codeword or differs from a unique codeword in exactly one bit. A Better Strategy We’re now ready to devise a better strategy for the hat game. Beforehand, the group of \(n\) players can communicate the appropriate Hamming matrix \(H_{r,n}\) and set a player order. Once hats are assigned, each player constructs both the game state \(s_b\) assuming that they’re wearing blue and the state \(s_r\) assuming they’re wearing red. If \(H_{r,n}s_b \neq 0\) and \(Hs_r = 0\), they announce blue. If \(H_{r,n}s_b = 0\) and \(Hs_r \neq 0\), they announce red. If \(H_{r,n}s_b \neq 0\) and \(Hs_r \neq 0\), they abstain. Note that it’s impossible for both results to be be zero. The Hamming strategy wins the game whenever the initial distribution of hats is not a codeword. By the above proof, we know that when the state is not a codeword, it differs from exactly one codeword in exactly one bit. This means that there is a single player for whom switching their hat color would make the state a codeword: that player will observe \(Hs = 0\) when assuming the incorrect choice, and \(Hs \neq 0\) when assuming the correct choice. All other players will observe \(Hs \neq 0\) for both assumptions because toggling their bit would not make the state a codeword. Therefore, one player will announce their correct color, and the group wins. For example, in a 7-player hat game with initial distribution \(\begin{bmatrix}1&0&1&0&1&1&1\end{bmatrix}\): Player 1 finds: \[\begin{align*}H_{3,7}\begin{bmatrix}0_{?}&0&1&0&1&1&1\end{bmatrix}^T &= \begin{bmatrix}1&1&1\end{bmatrix}\\ H_{3,7}\begin{bmatrix}1_{?}&0&1&0&1&1&1\end{bmatrix}^T &= \begin{bmatrix}1&1&0\end{bmatrix}\end{align*}\] Players 2-5 and 7 also find \(H_{3,7}s_b \neq 0\) and \(H_{3,7}s_r \neq 0\). Player 6 finds: \[\begin{align*}H_{3,7}\begin{bmatrix}1&0&1&0&1&0_{?}&1\end{bmatrix}^T &= \begin{bmatrix}0&0&0\end{bmatrix}\\ H_{3,7}\begin{bmatrix}1&0&1&0&1&1_{?}&1\end{bmatrix}^T &= \begin{bmatrix}1&1&0\end{bmatrix} \tag{that's 6!}\end{align*}\] The lone player 6 announces color 1, and the group wins. On the other hand, the group will lose the game when the initial distribution is a codeword. In this case, all players will determine that one hat color—the correct one—results in a codeword, and the other does not. This is because when the true state is a codeword, changing any one bit will result in a non-codeword. In this case, all players will announce the wrong color, and the group loses. For example, again in the 7-player game, but this time with initial distribution \(\begin{bmatrix}1&0&1&0&1&0&1\end{bmatrix}\): Player 1 sees: \[\begin{align*}H_{3,7}\begin{bmatrix}0_{?}&0&1&0&1&0&1\end{bmatrix}^T &= \begin{bmatrix}0&0&1\end{bmatrix} \tag{that's 1!}\\ H_{3,7}\begin{bmatrix}1_{?}&0&1&0&1&0&1\end{bmatrix}^T &= \begin{bmatrix}0&0&0\end{bmatrix}\end{align*}\] Player 2 sees: \[\begin{align*}H_{3,7}\begin{bmatrix}1&0_{?}&1&0&1&0&1\end{bmatrix}^T &= \begin{bmatrix}0&0&0\end{bmatrix}\\ H_{3,7}\begin{bmatrix}1&1_{?}&1&0&1&0&1\end{bmatrix}^T &= \begin{bmatrix}0&1&0\end{bmatrix}\tag{that's 2!}\end{align*}\] All other players find the equivalent. Hence, all players announce the incorrect color. Optimality Our Hamming strategy wins whenever the true arrangement of hats does not represent a codeword. Since there are \(n\times 2^{n-r}\) non-codewords out of \(2^n\) total words: \[\mathbb{P}(\text{win}) = \frac{n \times 2^{n-r}}{2^n} = \frac{n}{2^r} = \frac{n}{n+1} \implies \frac{31}{32}\] The 31-person group has an approximately 97% chance to win the game, given a uniformly random initial distribution. But, is 97% necessarily the best they can do? Let’s bound the maximum number of games that any possible strategy could win. Given an arbitrary strategy, each individual player will guess correctly and incorrectly in an equal number of game states. This is because each player’s decision making process must be independent of the color of their own hat: given the same information about the 30 other hats, the player must either abstain or guess the same color in both the state where they’re wearing red and where they’re wearing blue. Hence, for the individual to be right in one state, they must be wrong in the other. However, this fact doesn’t mean our overall strategy has to lose as much as it wins: if we can use up more incorrect guesses in losing rounds than correct guesses in winning rounds, our overall win-rate will increase. Optimally, we would have exactly one player guess correctly in every winning round, and have all players guess incorrectly in every losing round. This implies that for every \(n\) winning rounds, there must be at least one losing round. Hence, if the strategy wins \(W\) rounds and loses \(L\) rounds out of \(2^n\), we know \(L \ge \frac{W}{n}\). Because \(W + L = 2^n\), we have: \[\begin{align*} 2^n - W &\ge \frac{W}{n} \\ n2^n &\ge W(n + 1)\\ \frac{n}{n+1} &\ge \frac{W}{2^n}\end{align*}\] Which means the maximum win probability is in fact \(\frac{n}{n+1}\), which our Hamming strategy achieves. It is optimal! Unfortunately, we’ve only defined our Hamming strategy for games with \(n = 2^r - 1\) players. It may be the case that when the number of players is not one less than a power of two, the optimal strategy is simply to ignore some players and run the largest Hamming strategy available—but this has never been proven!Washington2022-03-16T00:00:00+00:002022-03-16T00:00:00+00:00https://thenumbat.github.io/Washington<p>Washington, D.C., 2022</p>
<div style="text-align: center;"><img src="/assets/washington/0.webp" /></div>
<div style="text-align: center;"><img src="/assets/washington/1.webp" /></div>
<div style="text-align: center;"><img src="/assets/washington/2.webp" /></div>
<div style="text-align: center;"><img src="/assets/washington/3.webp" /></div>
<div style="text-align: center;"><img src="/assets/washington/4.webp" /></div>
<div style="text-align: center;"><img src="/assets/washington/5.webp" /></div>
<div style="text-align: center;"><img src="/assets/washington/6.webp" /></div>
<div style="text-align: center;"><img src="/assets/washington/7.webp" /></div>
<div style="text-align: center;"><img src="/assets/washington/8.webp" /></div>
<div style="text-align: center;"><img src="/assets/washington/9.webp" /></div>
<div style="text-align: center;"><img src="/assets/washington/10.webp" /></div>
<div style="text-align: center;"><img src="/assets/washington/11.webp" /></div>
<div style="text-align: center;"><img src="/assets/washington/12.webp" /></div>
<div style="text-align: center;"><img src="/assets/washington/13.webp" /></div>
<div style="text-align: center;"><img src="/assets/washington/14.webp" /></div>
<div style="text-align: center;"><img src="/assets/washington/15.webp" /></div>
<div style="text-align: center;"><img src="/assets/washington/16.webp" /></div>
<p></p>Washington, D.C., 2022Kauai2022-01-22T00:00:00+00:002022-01-22T00:00:00+00:00https://thenumbat.github.io/Kauai<p>Kauai, HI, 2022</p>
<div style="text-align: center;"><img src="/assets/kauai/0.webp" /></div>
<div style="text-align: center;"><img src="/assets/kauai/1.webp" /></div>
<div style="text-align: center;"><img src="/assets/kauai/2.webp" /></div>
<div style="text-align: center;"><img src="/assets/kauai/3.webp" /></div>
<div style="text-align: center;"><img src="/assets/kauai/4.webp" /></div>
<div style="text-align: center;"><img src="/assets/kauai/5.webp" /></div>
<div style="text-align: center;"><img src="/assets/kauai/6.webp" /></div>
<div style="text-align: center;"><img src="/assets/kauai/7.webp" /></div>
<div style="text-align: center;"><img src="/assets/kauai/8.webp" /></div>
<div style="text-align: center;"><img src="/assets/kauai/9.webp" /></div>
<div style="text-align: center;"><img src="/assets/kauai/10.webp" /></div>
<div style="text-align: center;"><img src="/assets/kauai/11.webp" /></div>
<div style="text-align: center;"><img src="/assets/kauai/12.webp" /></div>
<div style="text-align: center;"><img src="/assets/kauai/13.webp" /></div>
<div style="text-align: center;"><img src="/assets/kauai/14.webp" /></div>
<div style="text-align: center;"><img src="/assets/kauai/15.webp" /></div>
<div style="text-align: center;"><img src="/assets/kauai/16.webp" /></div>
<p></p>Kauai, HI, 2022A Compiler Bug2021-12-26T00:00:00+00:002021-12-26T00:00:00+00:00https://thenumbat.github.io/Compiler-Bug<p>While I was working on <a href="/projects/dawn">Dawn</a>, I ran into a curious bug in the Visual Studio 2019 C++ compiler. I reported it to the bug tracker, where it was confirmed to be an interference analysis issue. It was eventually fixed nearly a year later in 2020. Today, let’s investigate what the issue really was.</p>
<p>The problem arose upon implementing a Perlin noise type for procedural solid texturing. My type just so happened to include a 4-kilobyte array of pre-initialized random data, and this precise size caused writes to the data to interfere with a preceding struct member. The resulting bug ended up bricking the output of my path tracer—but only with optimizations enabled!</p>
<p>Assuming there was some bug in my code, I was able to narrow down the problem to an unintended write just proceeding the random data array. However, I could not figure what exactly was causing the write. These kind of bugs (particularly those that only show up with optimizations enabled) are typically symptoms of undefined behavior, but there was no undefined behavior here. Hence, I started stripping out pieces of my code until I could minimally reproduce the issue. Eventually, I checked whether my code worked as expected in GCC and Clang, and it did: the issue was a MSVC compiler bug that resulted in broken code-gen.</p>
<p>I was able to capture the issue in the following example:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include <cstdint>
#include <cstdio>
</span>
<span class="k">struct</span> <span class="nc">data</span> <span class="p">{</span>
<span class="kt">uint8_t</span> <span class="n">_data</span><span class="p">[</span><span class="mi">4095</span><span class="p">]</span> <span class="o">=</span> <span class="p">{};</span>
<span class="p">};</span>
<span class="k">struct</span> <span class="nc">container</span> <span class="p">{</span>
<span class="kt">uint8_t</span> <span class="n">type</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
<span class="n">data</span> <span class="n">n</span><span class="p">;</span>
<span class="k">static</span> <span class="n">container</span> <span class="n">make</span><span class="p">()</span> <span class="p">{</span>
<span class="n">container</span> <span class="n">ret</span><span class="p">;</span>
<span class="n">printf</span><span class="p">(</span><span class="s">"Before: %d</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="p">(</span><span class="kt">int</span><span class="p">)</span><span class="n">ret</span><span class="p">.</span><span class="n">type</span><span class="p">);</span>
<span class="n">ret</span><span class="p">.</span><span class="n">n</span> <span class="o">=</span> <span class="n">data</span><span class="p">{};</span>
<span class="n">printf</span><span class="p">(</span><span class="s">"After: %d</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="p">(</span><span class="kt">int</span><span class="p">)</span><span class="n">ret</span><span class="p">.</span><span class="n">type</span><span class="p">);</span>
<span class="k">return</span> <span class="n">ret</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">container</span><span class="p">()</span> <span class="p">{}</span>
<span class="n">container</span><span class="p">(</span><span class="k">const</span> <span class="n">container</span><span class="o">&</span> <span class="n">o</span><span class="p">)</span> <span class="p">{}</span>
<span class="p">};</span>
<span class="kt">void</span> <span class="nf">func</span><span class="p">(</span><span class="n">container</span> <span class="n">c</span><span class="p">)</span> <span class="p">{}</span>
<span class="kt">void</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
<span class="n">func</span><span class="p">(</span><span class="n">container</span><span class="o">::</span><span class="n">make</span><span class="p">());</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Which resulted in the following output when compiled via <code class="language-plaintext highlighter-rouge">cl bug.cpp -O2</code>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Before: 1
After: 0
</code></pre></div></div>
<p>My example seemed weirdly specific. Each of the following changes would correct the output:</p>
<ul>
<li>Changing <code class="language-plaintext highlighter-rouge">_data[4095]</code> to <code class="language-plaintext highlighter-rouge">_data[4094]</code> or smaller</li>
<li>Removing the container copy constructor</li>
<li>Removing the call to <code class="language-plaintext highlighter-rouge">func()</code></li>
<li>Adding an unrelated call to <code class="language-plaintext highlighter-rouge">container::make()</code> before <code class="language-plaintext highlighter-rouge">func()</code></li>
</ul>
<p>With the help of <a href="https://godbolt.org/z/5a9v99988">compiler explorer</a>, we can figure out what the issue was, as well as what changed after the fix. Bisecting the supported compiler versions shows that the assembly output has been quite stable over time, only changing with the jump from version 19.24 to 19.25.</p>
<p>Examining the output (x86, <code class="language-plaintext highlighter-rouge">-O2</code>), we will find that the only relevant code is in <code class="language-plaintext highlighter-rouge">main</code>—all function calls have all been inlined. Comparing between the two compiler versions reveals near-identical assembly output: the only difference is that the correct version allocates extra stack space, preventing two temporaries from overlapping.</p>
<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="nf">$T1</span> <span class="err">=</span> <span class="mi">32</span>
<span class="err">--</span> <span class="nf">$T2</span> <span class="err">=</span> <span class="mi">32</span>
<span class="err">++</span> <span class="nf">$T2</span> <span class="err">=</span> <span class="mi">4128</span>
<span class="err">--</span> <span class="nf">$T3</span> <span class="err">=</span> <span class="mi">4128</span>
<span class="err">++</span> <span class="nf">$T3</span> <span class="err">=</span> <span class="mi">8224</span>
<span class="nl">main:</span>
<span class="c1">; Save registers/allocate stack space</span>
<span class="nf">mov</span> <span class="p">[</span><span class="nb">rsp</span><span class="o">+</span><span class="mi">8</span><span class="p">],</span> <span class="nb">rsi</span>
<span class="nf">push</span> <span class="nb">rdi</span>
<span class="err">--</span> <span class="nf">mov</span> <span class="nb">eax</span><span class="p">,</span> <span class="mi">8224</span>
<span class="err">++</span> <span class="nf">mov</span> <span class="nb">eax</span><span class="p">,</span> <span class="mi">12320</span>
<span class="nf">call</span> <span class="nv">__chkstk</span>
<span class="nf">sub</span> <span class="nb">rsp</span><span class="p">,</span> <span class="nb">rax</span>
<span class="c1">; Copy $T3 to $T2</span>
<span class="nf">lea</span> <span class="nb">rdi</span><span class="p">,</span> <span class="kc">$</span><span class="nv">T2</span><span class="p">[</span><span class="nb">rsp</span><span class="p">]</span>
<span class="nf">mov</span> <span class="nb">ecx</span><span class="p">,</span> <span class="mi">4096</span>
<span class="nf">lea</span> <span class="nb">rsi</span><span class="p">,</span> <span class="kc">$</span><span class="nv">T3</span><span class="p">[</span><span class="nb">rsp</span><span class="p">]</span>
<span class="nf">xor</span> <span class="nb">edx</span><span class="p">,</span> <span class="nb">edx</span>
<span class="nf">rep</span> <span class="nv">movsb</span>
<span class="c1">; Create container in $T2 with type = 1</span>
<span class="nf">lea</span> <span class="nb">rsi</span><span class="p">,</span> <span class="kc">$</span><span class="nv">T2</span><span class="p">[</span><span class="nb">rsp</span><span class="p">]</span>
<span class="nf">mov</span> <span class="nb">r8d</span><span class="p">,</span> <span class="mi">4095</span>
<span class="nf">lea</span> <span class="nb">rcx</span><span class="p">,</span> <span class="p">[</span><span class="nb">rsi</span><span class="o">+</span><span class="mi">1</span><span class="p">]</span>
<span class="nf">mov</span> <span class="p">[</span><span class="nb">rsi</span><span class="p">],</span> <span class="mi">1</span>
<span class="nf">call</span> <span class="nv">memset</span>
<span class="c1">; Print $T2->type</span>
<span class="nf">movzx</span> <span class="nb">edx</span><span class="p">,</span> <span class="p">[</span><span class="nb">rsi</span><span class="p">]</span>
<span class="nf">lea</span> <span class="nb">rcx</span><span class="p">,</span> <span class="nv">OFFSET</span> <span class="nv">FLAT</span><span class="p">:</span><span class="err">`</span><span class="nv">string</span><span class="err">'</span>
<span class="nf">call</span> <span class="nv">printf</span>
<span class="c1">; Create temporary data{} in $T1</span>
<span class="nf">xor</span> <span class="nb">edx</span><span class="p">,</span> <span class="nb">edx</span>
<span class="nf">lea</span> <span class="nb">rcx</span><span class="p">,</span> <span class="kc">$</span><span class="nv">T1</span><span class="p">[</span><span class="nb">rsp</span><span class="p">]</span>
<span class="nf">mov</span> <span class="nb">r8d</span><span class="p">,</span> <span class="mi">4095</span>
<span class="nf">call</span> <span class="nv">memset</span>
<span class="c1">; Copy temporary data{} from $T1 to $T2->n</span>
<span class="nf">lea</span> <span class="nb">rcx</span><span class="p">,</span> <span class="p">[</span><span class="nb">rsi</span><span class="o">+</span><span class="mi">1</span><span class="p">]</span>
<span class="nf">mov</span> <span class="nb">r8d</span><span class="p">,</span> <span class="mi">4095</span>
<span class="nf">lea</span> <span class="nb">rdx</span><span class="p">,</span> <span class="kc">$</span><span class="nv">T1</span><span class="p">[</span><span class="nb">rsp</span><span class="p">]</span>
<span class="nf">call</span> <span class="nv">memcpy</span>
<span class="c1">; Print $T2->type</span>
<span class="nf">movzx</span> <span class="nb">edx</span><span class="p">,</span> <span class="p">[</span><span class="nb">rsi</span><span class="p">]</span>
<span class="nf">lea</span> <span class="nb">rcx</span><span class="p">,</span> <span class="nv">OFFSET</span> <span class="nv">FLAT</span><span class="p">:</span><span class="err">`</span><span class="nv">string</span><span class="err">'</span>
<span class="nf">call</span> <span class="nv">printf</span>
<span class="c1">; Deallocate stack space/load registers</span>
<span class="err">--</span> <span class="nf">mov</span> <span class="nb">rsi</span><span class="p">,</span> <span class="p">[</span><span class="nb">rsp</span><span class="o">+</span><span class="mi">8240</span><span class="p">]</span>
<span class="err">++</span> <span class="nf">mov</span> <span class="nb">rsi</span><span class="p">,</span> <span class="p">[</span><span class="nb">rsp</span><span class="o">+</span><span class="mi">12336</span><span class="p">]</span>
<span class="nf">xor</span> <span class="nb">eax</span><span class="p">,</span> <span class="nb">eax</span>
<span class="err">--</span> <span class="nf">add</span> <span class="nb">rsp</span><span class="p">,</span> <span class="mi">8224</span>
<span class="err">++</span> <span class="nf">add</span> <span class="nb">rsp</span><span class="p">,</span> <span class="mi">12320</span>
<span class="nf">pop</span> <span class="nb">rdi</span>
<span class="nf">ret</span> <span class="mi">0</span>
</code></pre></div></div>
<p>So, what exactly are these temporaries, and why was the overlap a problem? We can deduce what each temporary means:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">$T1</code>: the anonymous <code class="language-plaintext highlighter-rouge">data{}</code> created in <code class="language-plaintext highlighter-rouge">container::make</code></li>
<li><code class="language-plaintext highlighter-rouge">$T2</code>: <code class="language-plaintext highlighter-rouge">ret</code>, created in <code class="language-plaintext highlighter-rouge">container::make</code> and passed to <code class="language-plaintext highlighter-rouge">func</code>. The same memory is used in both contexts due to <a href="https://en.cppreference.com/w/cpp/language/copy_elision">return value optimization and copy elision</a>—the copy constructor is never called.</li>
<li><code class="language-plaintext highlighter-rouge">$T3</code>: seemingly useless, uninitialized storage that is copied to <code class="language-plaintext highlighter-rouge">$T2</code> before it is overwritten. Was this left over from before copy elision but not omitted?</li>
</ul>
<p>Referring back to the assembly listing, we can see that the broken version assigns <code class="language-plaintext highlighter-rouge">$T1</code> and <code class="language-plaintext highlighter-rouge">$T2</code> to the same stack location!</p>
<table>
<thead>
<tr>
<th>Temporary</th>
<th>Type</th>
<th>Location</th>
</tr>
</thead>
<tbody>
<tr>
<td>$T1</td>
<td>data</td>
<td>32-4127</td>
</tr>
<tr>
<td>$T2</td>
<td>container</td>
<td>32-4128</td>
</tr>
<tr>
<td>$T3</td>
<td>container</td>
<td>4128-8224</td>
</tr>
</tbody>
</table>
<p>Hence, the code overwrites <code class="language-plaintext highlighter-rouge">$T2->type</code> when initializing <code class="language-plaintext highlighter-rouge">data{}</code> to zero. But that’s not all: <code class="language-plaintext highlighter-rouge">memcpy</code> is used to copy from <code class="language-plaintext highlighter-rouge">$T1</code> to <code class="language-plaintext highlighter-rouge">$T2->n</code>, which is undefined behavior because the source and destination ranges overlap.</p>
<p>The corrected version allocates separate space for the two temporaries, and everything works as expected.</p>
<table>
<thead>
<tr>
<th>Temporary</th>
<th>Type</th>
<th>Location</th>
</tr>
</thead>
<tbody>
<tr>
<td>$T1</td>
<td>data</td>
<td>32-4127</td>
</tr>
<tr>
<td>$T2</td>
<td>container</td>
<td>4128-8224</td>
</tr>
<tr>
<td>$T3</td>
<td>container</td>
<td>8224-12320</td>
</tr>
</tbody>
</table>
<p>Unfortunately, explaining <em>what</em> happened gives us little insight into <em>why</em>. Somehow, my code structure gave the compiler the false impression that <code class="language-plaintext highlighter-rouge">ret</code> and <code class="language-plaintext highlighter-rouge">data{}</code> were not alive at the same time, despite one being copied to the other. Given the context, I would guess the combination of inlining and copy elision caused confusion over stack ownership—but we might never know for sure.</p>While I was working on Dawn, I ran into a curious bug in the Visual Studio 2019 C++ compiler. I reported it to the bug tracker, where it was confirmed to be an interference analysis issue. It was eventually fixed nearly a year later in 2020. Today, let’s investigate what the issue really was. The problem arose upon implementing a Perlin noise type for procedural solid texturing. My type just so happened to include a 4-kilobyte array of pre-initialized random data, and this precise size caused writes to the data to interfere with a preceding struct member. The resulting bug ended up bricking the output of my path tracer—but only with optimizations enabled! Assuming there was some bug in my code, I was able to narrow down the problem to an unintended write just proceeding the random data array. However, I could not figure what exactly was causing the write. These kind of bugs (particularly those that only show up with optimizations enabled) are typically symptoms of undefined behavior, but there was no undefined behavior here. Hence, I started stripping out pieces of my code until I could minimally reproduce the issue. Eventually, I checked whether my code worked as expected in GCC and Clang, and it did: the issue was a MSVC compiler bug that resulted in broken code-gen. I was able to capture the issue in the following example: #include <cstdint> #include <cstdio> struct data { uint8_t _data[4095] = {}; }; struct container { uint8_t type = 1; data n; static container make() { container ret; printf("Before: %d\n", (int)ret.type); ret.n = data{}; printf("After: %d\n", (int)ret.type); return ret; } container() {} container(const container& o) {} }; void func(container c) {} void main() { func(container::make()); } Which resulted in the following output when compiled via cl bug.cpp -O2: Before: 1 After: 0 My example seemed weirdly specific. Each of the following changes would correct the output: Changing _data[4095] to _data[4094] or smaller Removing the container copy constructor Removing the call to func() Adding an unrelated call to container::make() before func() With the help of compiler explorer, we can figure out what the issue was, as well as what changed after the fix. Bisecting the supported compiler versions shows that the assembly output has been quite stable over time, only changing with the jump from version 19.24 to 19.25. Examining the output (x86, -O2), we will find that the only relevant code is in main—all function calls have all been inlined. Comparing between the two compiler versions reveals near-identical assembly output: the only difference is that the correct version allocates extra stack space, preventing two temporaries from overlapping. $T1 = 32 -- $T2 = 32 ++ $T2 = 4128 -- $T3 = 4128 ++ $T3 = 8224 main: ; Save registers/allocate stack space mov [rsp+8], rsi push rdi -- mov eax, 8224 ++ mov eax, 12320 call __chkstk sub rsp, rax ; Copy $T3 to $T2 lea rdi, $T2[rsp] mov ecx, 4096 lea rsi, $T3[rsp] xor edx, edx rep movsb ; Create container in $T2 with type = 1 lea rsi, $T2[rsp] mov r8d, 4095 lea rcx, [rsi+1] mov [rsi], 1 call memset ; Print $T2->type movzx edx, [rsi] lea rcx, OFFSET FLAT:`string' call printf ; Create temporary data{} in $T1 xor edx, edx lea rcx, $T1[rsp] mov r8d, 4095 call memset ; Copy temporary data{} from $T1 to $T2->n lea rcx, [rsi+1] mov r8d, 4095 lea rdx, $T1[rsp] call memcpy ; Print $T2->type movzx edx, [rsi] lea rcx, OFFSET FLAT:`string' call printf ; Deallocate stack space/load registers -- mov rsi, [rsp+8240] ++ mov rsi, [rsp+12336] xor eax, eax -- add rsp, 8224 ++ add rsp, 12320 pop rdi ret 0 So, what exactly are these temporaries, and why was the overlap a problem? We can deduce what each temporary means: $T1: the anonymous data{} created in container::make $T2: ret, created in container::make and passed to func. The same memory is used in both contexts due to return value optimization and copy elision—the copy constructor is never called. $T3: seemingly useless, uninitialized storage that is copied to $T2 before it is overwritten. Was this left over from before copy elision but not omitted? Referring back to the assembly listing, we can see that the broken version assigns $T1 and $T2 to the same stack location! Temporary Type Location $T1 data 32-4127 $T2 container 32-4128 $T3 container 4128-8224 Hence, the code overwrites $T2->type when initializing data{} to zero. But that’s not all: memcpy is used to copy from $T1 to $T2->n, which is undefined behavior because the source and destination ranges overlap. The corrected version allocates separate space for the two temporaries, and everything works as expected. Temporary Type Location $T1 data 32-4127 $T2 container 4128-8224 $T3 container 8224-12320 Unfortunately, explaining what happened gives us little insight into why. Somehow, my code structure gave the compiler the false impression that ret and data{} were not alive at the same time, despite one being copied to the other. Given the context, I would guess the combination of inlining and copy elision caused confusion over stack ownership—but we might never know for sure.Exile2021-06-26T00:00:00+00:002021-06-26T00:00:00+00:00https://thenumbat.github.io/projects/Exile<p>A <a href="https://handmade.network/manifesto">handmade</a>-style voxel game engine. Currently working on a re-write.</p>
<!--end_excerpt-->
<hr />
<p>I worked on the first version of Exile from 2017-2019, which can be found on <a href="https://github.com/TheNumbat/exile">GitHub</a>. As a learning experience, I implemented everything without using any libraries. It included the following features:</p>
<ul>
<li>Custom C-like standard library</li>
<li>Type reflection and generic serialization through metaprogramming</li>
<li>OpenGL 4.5 deferred renderer with an optimized voxel mesh pipeline, post-processing, dynamic lighting, and MSAA</li>
<li>Asynchronous voxel world generation and meshing</li>
<li>Platform abstraction layer wrapping Win32 or SDL2 and supporting hot code reloading</li>
<li>CPU and memory usage profiling system</li>
<li>Asset pipeline</li>
</ul>
<p>I also wrote several <a href="/">blog posts</a> about the various systems.</p>
<p>As of 2022, I’m working on a complete re-write of the project. So far, it includes a new modern C++ standard library and a Vulkan renderer using hardware accelerated ray tracing. Look out for fresh blog posts about these systems!</p>
<p><img src="/assets/projects/exile2.png" alt="exile2" /></p>A handmade-style voxel game engine. Currently working on a re-write.