Jekyll2024-01-17T15:30:04+00:00https://thenumb.at/feed.xmlMax SlaterComputer Graphics, Programming, and MathPalm Springs2024-01-07T00:00:00+00:002024-01-07T00:00:00+00:00https://thenumb.at/Palm-Springs<p>Palm Springs, CA, 2023</p>
<div style="text-align: center;"><img src="/assets/palm-springs/00.webp" /></div>
<div style="text-align: center;"><img src="/assets/palm-springs/01.webp" /></div>
<div style="text-align: center;"><img src="/assets/palm-springs/02.webp" /></div>
<div style="text-align: center;"><img src="/assets/palm-springs/03.webp" /></div>
<div style="text-align: center;"><img src="/assets/palm-springs/04.webp" /></div>
<div style="text-align: center;"><img src="/assets/palm-springs/05.webp" /></div>
<div style="text-align: center;"><img src="/assets/palm-springs/06.webp" /></div>
<div style="text-align: center;"><img src="/assets/palm-springs/07.webp" /></div>
<div style="text-align: center;"><img src="/assets/palm-springs/08.webp" /></div>
<div style="text-align: center;"><img src="/assets/palm-springs/09.webp" /></div>
<div style="text-align: center;"><img src="/assets/palm-springs/10.webp" /></div>
<div style="text-align: center;"><img src="/assets/palm-springs/11.webp" /></div>
<div style="text-align: center;"><img src="/assets/palm-springs/12.webp" /></div>
<div style="text-align: center;"><img src="/assets/palm-springs/13.webp" /></div>
<p></p>Palm Springs, CA, 2023Oxidizing C++2024-01-06T00:00:00+00:002024-01-06T00:00:00+00:00https://thenumb.at/rpp<p>I recently rewrote the C++20 infrastructure I use for personal projects, so I decided to snapshot its design in the style of <a href="https://nullprogram.com/blog/2023/10/08/">another recent article</a>.
The library (<em>rpp</em>) is primarily inspired by Rust, and may be found on <a href="https://github.com/TheNumbat/rpp">GitHub</a>.</p>
<p>I use <em>rpp</em> as an STL replacement, and I’m currently developing a <a href="https://twitter.com/The_Numbat/status/1731093556859699353">real-time path tracer</a> with it.
<em>rpp</em> attempts to provide the following features, roughly in priority order:</p>
<ul>
<li>
<p>Fast Compilation</p>
<p>Including <em>rpp</em> should not increase compilation time by more than 250ms.
It must not pull in any STL or system headers—they’re what’s really slow, not templates.</p>
</li>
<li>
<p>Debugging</p>
<p>User code should be easy to debug.
That requires good debug-build performance, defensive assertions, and prohibiting exceptions/RTTI.
Concepts should be used to produce straightforward compilation errors.</p>
</li>
<li>
<p>Performance</p>
<p>User code should have control over allocations, parallelism, and data layout.
This is enabled by a region-based allocator system, a coroutine scheduler, and better default data structure implementations.
Additionally, tracing-based profiling and allocation tracking are tightly integrated.</p>
</li>
<li>
<p>Explicitness</p>
<p>All non-trivial operations should be visible in the source code.
That means prohibiting implicit conversions, copies, and allocations, as well as using types with clear ownership semantics.</p>
</li>
<li>
<p>Metaprogramming</p>
<p>It should be easy to perform compile-time computations and reflect on types.
This involves pervasive use of concepts and constexpr, as well as defining a format for reflection metadata.</p>
</li>
</ul>
<p>These goals lead to a few categories of interesting features, which we’ll discuss below.</p>
<ul>
<li><a href="#compile-times">Compile Times</a></li>
<li><a href="#data-structures-and-reflection">Data Structures and Reflection</a></li>
<li><a href="#allocators-regions-and-tracing">Allocators, Regions, and Tracing</a></li>
<li><a href="#coroutines">Coroutines</a></li>
</ul>
<p><em>Standard disclaimers: no, it wouldn’t make sense to enforce this style on existing projects/teams, no, the library is not “production-ready,” and yes, C++ is Bad™ so I considered <a href="#appendix-why-not-language-x">switching to language X</a>.</em></p>
<h1 id="compile-times">Compile Times</h1>
<p><em>Note: this is all pre-modules, so may change soon. <a href="https://gitlab.kitware.com/cmake/cmake/-/issues/18355">CMake</a> recently released support!</em></p>
<p>C++ compilers are slow, but building your codebase doesn’t have to be.
Let’s investigate how MSVC spends its time compiling <a href="https://github.com/CMU-Graphics/Scotty3D">another project</a> of mine, which is written in a mainstream C++17 style.
Including dependencies, it’s about 100k lines of code and takes 20 seconds to compile with optimizations enabled.<sup id="fnref:benchmark" role="doc-noteref"><a href="#fn:benchmark" class="footnote" rel="footnote">1</a></sup></p>
<p><img class="center box_shadow" style="margin-top:20px;margin-bottom:20px" src="/assets/rpp/std_compile.png" /></p>
<p>Processing this list of files took up the vast majority of compilation time, but only a handful are actually used in the project—the rest were pulled in to support the STL.</p>
<p>We’ll also find several compilation units that look like this:</p>
<p><img class="center box_shadow" style="margin-top:20px;margin-bottom:20px" src="/assets/rpp/chrono.png" /></p>
<p>Here, including <em>only</em> <code class="language-plaintext highlighter-rouge">std::chrono</code> took 75% of the entire invocation.</p>
<p>One would hope that incremental compilation is still fast, but the slowest single compilation unit spent <em>ten seconds</em> in the compiler frontend.
Much of that time was spent solving constraints, creating instances, and generating specializations for STL data structures.</p>
<p><img class="center box_shadow" style="margin-top:20px;margin-bottom:20px" src="/assets/rpp/tail.png" /></p>
<p>The STL’s long compile times can be mitigated by using precompiled headers (or modules…in C++23), but the core issue is simply that the STL is huge.</p>
<p>So, how fast can we go by avoiding it?
Here’s my rendering project, which is also roughly 100k lines of code, including <em>rpp</em> and other dependencies.
(<em>Click to play.</em>)</p>
<video class="box_shadow center" style="margin-top:20px;margin-bottom:20px" src="/assets/rpp/diopter_build.mp4" preload="" muted="" width="450px" onclick="this.play()"></video>
<p>The optimized build completes in under 3 seconds.
There’s no trick—I’ve just minimized the amount of code included into each compilation unit.
Let’s take a look at where MSVC spends its time:</p>
<p><img class="center box_shadow" style="margin-top:20px;margin-bottom:20px" src="/assets/rpp/diopter_include.png" /></p>
<p>Most of these files come from non-STL dependencies and are actually required by the project.
Several arise from <em>rpp</em>, but the worst offender (vmath) only accumulated 1s of work.</p>
<p>Incremental compilation is also much improved: the slowest compilation unit only spent 1.4s in the frontend.
Within each unit, we can see that including <code class="language-plaintext highlighter-rouge">base.h</code> (the bulk of <em>rpp</em>) took under 200ms on average.</p>
<p><img class="center box_shadow" style="margin-top:20px;margin-bottom:20px" src="/assets/rpp/base.png" /></p>
<p>Of course, this is not a proper benchmark: I’m comparing two entirely separate codebases.
Project structure has a large impact on compile times, as does usage code, so your milage may vary.
Nevertheless, after working on both projects, I found the decreased iteration time alone to be worth ditching the STL.</p>
<h2 id="how">How?</h2>
<p>You might wonder whether <em>rpp</em> only compiles quickly because it omits most STL features.
This is obviously true to some extent: it’s only 10k lines of code, only supports Windows/Linux/x64, and isn’t concerned with backwards compatibility.</p>
<p>However, <em>rpp</em> replicates all of the STL features I actually use.
Implementing the core functionality takes relatively little code, and given some care, we can keep compile times low.
For example, we can still use <em>libc</em> and the Windows/Linux APIs, so long as we relegate their usage to particular compilation unit.<sup id="fnref:lto" role="doc-noteref"><a href="#fn:lto" class="footnote" rel="footnote">2</a></sup>
User code only sees the bare-minimum symbols and types—<code class="language-plaintext highlighter-rouge">windows.h</code> need not apply.</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">OS_Thread</span> <span class="nf">sys_start</span><span class="p">(</span><span class="n">OS_Thread_Ret</span> <span class="p">(</span><span class="o">*</span><span class="n">f</span><span class="p">)(</span><span class="kt">void</span><span class="o">*</span><span class="p">),</span> <span class="kt">void</span><span class="o">*</span> <span class="n">data</span><span class="p">);</span>
<span class="kt">void</span> <span class="nf">sys_join</span><span class="p">(</span><span class="n">OS_Thread</span> <span class="kr">thread</span><span class="p">);</span>
</code></pre></div></div>
<p>Unfortunately, the same cannot be said for the STL proper, as (most) generic code must be visible to all compilation units.
Hence, <em>rpp</em> re-implements many STL data structures and algorithms.
They’re not that hard to compile quickly—STL authors have just never prioritized it.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>array.h box.h files.h format.h function.h hash.h heap.h log.h map.h
math.h opt.h pair.h thread.h queue.h rc.h reflect.h rng.h stack.h
string.h tuple.h variant.h vec.h
</code></pre></div></div>
<p>These two approaches let us entirely eliminate standard headers.
Due to built-in language features, we can’t quite get rid of <code class="language-plaintext highlighter-rouge">std::initializer_list</code> and <code class="language-plaintext highlighter-rouge">std::coroutine_handle</code>, but the necessary definitions can be manually extracted into a header that does not pull in anything else.</p>
<h1 id="data-structures-and-reflection">Data Structures and Reflection</h1>
<p>Let’s discuss <em>rpp</em>’s primary data structures.
We will write <code class="language-plaintext highlighter-rouge">T</code> and <code class="language-plaintext highlighter-rouge">A</code> to denote generic type parameters and allocators, respectively.
The <a href="#allocators-regions-and-tracing">following section</a> will discuss allocators in more detail.</p>
<h3 id="pointers"><em>Pointers</em></h3>
<p>There are five types of pointers, each with different ownership semantics.
Except for raw pointers, all must point to a valid object or be <code class="language-plaintext highlighter-rouge">null</code>.</p>
<ul>
<li>
<p><code class="language-plaintext highlighter-rouge">T*</code> is rarely used.
The referenced memory is never owned; there are no other guarantees.</p>
</li>
<li>
<p><code class="language-plaintext highlighter-rouge">Ref<T></code> is nearly the same as a raw pointer but removes pointer arithmetic and adds <code class="language-plaintext highlighter-rouge">null</code> checks.</p>
</li>
<li>
<p><code class="language-plaintext highlighter-rouge">Box<T, A></code> is a uniquely owning pointer, similar to <code class="language-plaintext highlighter-rouge">std::unique_ptr</code>.
It may be moved, but not copied, and frees its memory when destroyed.</p>
</li>
<li>
<p><code class="language-plaintext highlighter-rouge">Rc<T, A></code> is a <em>non-atomically</em> reference-counted pointer.
The reference count is stored contiguously with the pointed-to object.
It may be moved or copied and frees its memory when the last reference is destroyed.</p>
</li>
<li>
<p><code class="language-plaintext highlighter-rouge">Arc<T, A></code> is an <em>atomically</em> reference-counted <code class="language-plaintext highlighter-rouge">Rc</code>, similar to <code class="language-plaintext highlighter-rouge">std::shared_ptr</code>.</p>
</li>
</ul>
<h3 id="strings"><em>Strings</em></h3>
<p><code class="language-plaintext highlighter-rouge">String<A></code> contains a pointer, a length, and a capacity.
Strings are not null-terminated and may contain null bytes.
There is no “small-string” optimization, so creating a non-empty string always allocates.<sup id="fnref:smallstring" role="doc-noteref"><a href="#fn:smallstring" class="footnote" rel="footnote">3</a></sup></p>
<p><code class="language-plaintext highlighter-rouge">String_View</code> provides a non-owning immutable view of a string, similar to <code class="language-plaintext highlighter-rouge">std::string_view</code>.
Most non-destructive string operations are implemented on <code class="language-plaintext highlighter-rouge">String_View</code>.
String literals may be converted to <code class="language-plaintext highlighter-rouge">String_View</code>s using the <code class="language-plaintext highlighter-rouge">_v</code> literal suffix.</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">String_View</span> <span class="n">message</span> <span class="o">=</span> <span class="s">"foo"</span><span class="n">_v</span><span class="p">;</span>
<span class="n">String</span><span class="o"><></span> <span class="n">copy</span> <span class="o">=</span> <span class="n">message</span><span class="p">.</span><span class="n">string</span><span class="o"><></span><span class="p">();</span>
<span class="k">if</span><span class="p">(</span><span class="n">copy</span><span class="p">.</span><span class="n">view</span><span class="p">()</span> <span class="o">==</span> <span class="s">"bar"</span><span class="n">_v</span><span class="p">)</span> <span class="c1">// ...</span>
</code></pre></div></div>
<h3 id="arrays"><em>Arrays</em></h3>
<p><code class="language-plaintext highlighter-rouge">Array<T, N></code> is a fixed-size array of length <code class="language-plaintext highlighter-rouge">N</code>.
It’s essentially an <em>rpp</em>-compatible <code class="language-plaintext highlighter-rouge">std::array</code>.</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Array</span><span class="o"><</span><span class="n">i32</span><span class="p">,</span> <span class="mi">3</span><span class="o">></span> <span class="n">int_array</span><span class="p">{</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">};</span>
</code></pre></div></div>
<p><code class="language-plaintext highlighter-rouge">Vec<T, A></code> is a dynamically-sized array, similar to <code class="language-plaintext highlighter-rouge">std::vector</code>.
There is no “small-vector” optimization, so creating a non-empty vector always allocates.<sup id="fnref:smallvector" role="doc-noteref"><a href="#fn:smallvector" class="footnote" rel="footnote">4</a></sup></p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Vec</span><span class="o"><</span><span class="n">i32</span><span class="o">></span> <span class="n">int_vec</span><span class="p">{</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">};</span>
<span class="n">int_vec</span><span class="p">.</span><span class="n">push</span><span class="p">(</span><span class="mi">3</span><span class="p">);</span>
<span class="n">int_vec</span><span class="p">.</span><span class="n">pop</span><span class="p">();</span>
</code></pre></div></div>
<p><code class="language-plaintext highlighter-rouge">Slice<T></code> is a non-owning immutable view of a contiguous array, similar to <code class="language-plaintext highlighter-rouge">std::span</code>.
A <code class="language-plaintext highlighter-rouge">Slice</code> may be created from an <code class="language-plaintext highlighter-rouge">Array</code>, a <code class="language-plaintext highlighter-rouge">Vec</code>, a <code class="language-plaintext highlighter-rouge">std::initializer_list</code>, or a raw pointer & length.</p>
<p>Non-destructive functions on arrays take <code class="language-plaintext highlighter-rouge">Slice</code> parameters, which abstract over the underlying storage.
This erases (e.g.) the size of the <code class="language-plaintext highlighter-rouge">Array</code> or the allocator of the <code class="language-plaintext highlighter-rouge">Vec</code>, reducing template bloat.</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">print_slice</span><span class="p">(</span><span class="n">Slice</span><span class="o"><</span><span class="n">i32</span><span class="o">></span> <span class="n">slice</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span><span class="p">(</span><span class="n">i32</span> <span class="n">i</span> <span class="o">:</span> <span class="n">slice</span><span class="p">)</span> <span class="n">info</span><span class="p">(</span><span class="s">"%"</span><span class="p">,</span> <span class="n">i</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<p><code class="language-plaintext highlighter-rouge">Stack<T, A></code>, <code class="language-plaintext highlighter-rouge">Queue<T, A></code>, and <code class="language-plaintext highlighter-rouge">Heap<T, A></code> are array-based containers with stack, queue, and heap semantics, respectively.
<code class="language-plaintext highlighter-rouge">Queue</code> is implemented using a growable circular buffer, and <code class="language-plaintext highlighter-rouge">Heap</code> is a flat binary min-heap with respect to the less-than operator.</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Stack</span><span class="o"><</span><span class="n">i32</span><span class="o">></span> <span class="n">int_stack</span><span class="p">{</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">};</span>
<span class="n">int_stack</span><span class="p">.</span><span class="n">top</span><span class="p">();</span> <span class="c1">// 3</span>
<span class="n">Queue</span><span class="o"><</span><span class="n">i32</span><span class="o">></span> <span class="n">int_queue</span><span class="p">{</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">};</span>
<span class="n">int_queue</span><span class="p">.</span><span class="n">front</span><span class="p">();</span> <span class="c1">// 1</span>
<span class="n">Heap</span><span class="o"><</span><span class="n">i32</span><span class="o">></span> <span class="n">int_heap</span><span class="p">{</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">};</span>
<span class="n">int_heap</span><span class="p">.</span><span class="n">top</span><span class="p">();</span> <span class="c1">// 1</span>
</code></pre></div></div>
<h3 id="hash-map"><em>Hash Map</em></h3>
<p><code class="language-plaintext highlighter-rouge">Map<K, V, A></code> is an open-addressed hash table with Robin Hood linear probing.
For further discussion of why this is a good choice, see <a href="/Hashtables"><em>Optimizing Open Addressing</em></a>.</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Map</span><span class="o"><</span><span class="n">String_View</span><span class="p">,</span> <span class="n">i32</span><span class="o">></span> <span class="n">int_map</span><span class="p">{</span><span class="n">Pair</span><span class="p">{</span><span class="s">"foo"</span><span class="n">_v</span><span class="p">,</span> <span class="mi">0</span><span class="p">},</span> <span class="n">Pair</span><span class="p">{</span><span class="s">"bar"</span><span class="n">_v</span><span class="p">,</span> <span class="mi">1</span><span class="p">}};</span>
<span class="n">int_map</span><span class="p">.</span><span class="n">insert</span><span class="p">(</span><span class="s">"baz"</span><span class="n">_v</span><span class="p">,</span> <span class="mi">2</span><span class="p">);</span>
<span class="n">int_map</span><span class="p">.</span><span class="n">erase</span><span class="p">(</span><span class="s">"bar"</span><span class="n">_v</span><span class="p">);</span>
<span class="k">for</span><span class="p">(</span><span class="k">auto</span><span class="o">&</span> <span class="p">[</span><span class="n">key</span><span class="p">,</span> <span class="n">value</span><span class="p">]</span> <span class="o">:</span> <span class="n">int_map</span><span class="p">)</span> <span class="n">info</span><span class="p">(</span><span class="s">"%: %"</span><span class="p">,</span> <span class="n">key</span><span class="p">,</span> <span class="n">value</span><span class="p">);</span>
</code></pre></div></div>
<p>The implementation is very similar to <em>Optimizing Open Addressing</em>, except that the table also stores key hashes.
This improves performance when hashing keys is expensive—particularly for strings.<sup id="fnref:hash" role="doc-noteref"><a href="#fn:hash" class="footnote" rel="footnote">5</a></sup></p>
<h3 id="optional"><em>Optional</em></h3>
<p><code class="language-plaintext highlighter-rouge">Opt<T></code> is either empty or contains value of type <code class="language-plaintext highlighter-rouge">T</code>.
It is not a pointer; it contains a boolean flag and storage for a <code class="language-plaintext highlighter-rouge">T</code>.
Optionals are used for error handling, as <em>rpp</em> does not use exceptions.</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Opt</span><span class="o"><</span><span class="n">i32</span><span class="o">></span> <span class="n">int_opt</span><span class="p">{</span><span class="mi">1</span><span class="p">};</span>
<span class="k">if</span><span class="p">(</span><span class="n">int_opt</span><span class="p">)</span> <span class="n">info</span><span class="p">(</span><span class="s">"%"</span><span class="p">,</span> <span class="o">*</span><span class="n">int_opt</span><span class="p">);</span>
</code></pre></div></div>
<p>There is currently no <code class="language-plaintext highlighter-rouge">Result<T, E></code> type, but I plan to add one.</p>
<h3 id="tuples"><em>Tuples</em></h3>
<p><code class="language-plaintext highlighter-rouge">Pair<A, B></code> contains two values of types <code class="language-plaintext highlighter-rouge">A</code> and <code class="language-plaintext highlighter-rouge">B</code>.
It supports structured bindings and is used in the implementation of <code class="language-plaintext highlighter-rouge">Map</code>.</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Pair</span><span class="o"><</span><span class="n">i32</span><span class="p">,</span> <span class="n">f32</span><span class="o">></span> <span class="n">pair</span><span class="p">{</span><span class="mi">1</span><span class="p">,</span> <span class="mf">2.0</span><span class="n">f</span><span class="p">};</span>
<span class="n">i32</span> <span class="n">one</span> <span class="o">=</span> <span class="n">pair</span><span class="p">.</span><span class="n">first</span><span class="p">;</span>
<span class="k">auto</span> <span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="n">f</span><span class="p">]</span> <span class="o">=</span> <span class="n">pair</span><span class="p">;</span>
</code></pre></div></div>
<p><code class="language-plaintext highlighter-rouge">Tuple<A, B, C, ...></code> is a heterogeneous product type, which may be accessed via (static) indexing or structured binding.</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Tuple</span><span class="o"><</span><span class="n">i32</span><span class="p">,</span> <span class="n">f32</span><span class="p">,</span> <span class="n">String_View</span><span class="o">></span> <span class="n">product</span><span class="p">{</span><span class="mi">1</span><span class="p">,</span> <span class="mf">2.0</span><span class="n">f</span><span class="p">,</span> <span class="s">"foo"</span><span class="n">_v</span><span class="p">};</span>
<span class="n">i32</span> <span class="n">one</span> <span class="o">=</span> <span class="n">product</span><span class="p">.</span><span class="n">get</span><span class="o"><</span><span class="mi">0</span><span class="o">></span><span class="p">();</span>
<span class="k">auto</span> <span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="n">f</span><span class="p">,</span> <span class="n">s</span><span class="p">]</span> <span class="o">=</span> <span class="n">product</span><span class="p">;</span>
</code></pre></div></div>
<p>An N-element tuple is represented by a value and an N-1-element tuple.
At runtime, this layout is equivalent to a single struct with N fields.</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="p">{</span> <span class="n">A</span> <span class="n">first</span><span class="p">;</span> <span class="k">struct</span> <span class="p">{</span> <span class="n">B</span> <span class="n">first</span><span class="p">;</span> <span class="k">struct</span> <span class="p">{</span> <span class="n">C</span> <span class="n">first</span><span class="p">;</span> <span class="k">struct</span> <span class="p">{}}}}</span>
</code></pre></div></div>
<p>The linear structure makes compile-time tuple operations occur in linear time—but I haven’t found this to be an issue in practice.
Structuring tuples as binary trees would accelerate compilation, but the code would be more complex.</p>
<h3 id="variant"><em>Variant</em></h3>
<p><code class="language-plaintext highlighter-rouge">Variant<A, B, C, ...></code> is a heterogeneous sum type.
It is superficially similar to <code class="language-plaintext highlighter-rouge">std::variant</code>, but has several important differences:</p>
<ul>
<li>It is never default constructable, so there is no <code class="language-plaintext highlighter-rouge">std::monostate</code>.</li>
<li>It cannot be accessed by index or by alternative, so there is no <code class="language-plaintext highlighter-rouge">std::bad_variant_access</code>.</li>
<li><em>rpp</em> does not use exceptions, so there is no <code class="language-plaintext highlighter-rouge">valueless_by_exception</code>.</li>
<li>The alternatives do not have to be destructible.</li>
<li>The alternatives must be distinct.</li>
</ul>
<p>Variants are accessed via (scuffed) pattern matching: the <code class="language-plaintext highlighter-rouge">match</code> method invokes a callable object overloaded for each alternative.
The <code class="language-plaintext highlighter-rouge">Overload</code> helper lets the user write a list of (potentially polymorphic) lambdas.</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Variant</span><span class="o"><</span><span class="n">i32</span><span class="p">,</span> <span class="n">f32</span><span class="p">,</span> <span class="n">String_View</span><span class="o">></span> <span class="n">sum</span><span class="p">{</span><span class="mi">1</span><span class="p">};</span>
<span class="n">sum</span> <span class="o">=</span> <span class="mf">2.0</span><span class="n">f</span><span class="p">;</span>
<span class="n">sum</span><span class="p">.</span><span class="n">match</span><span class="p">(</span><span class="n">Overload</span><span class="p">{</span>
<span class="p">[](</span><span class="n">i32</span> <span class="n">i</span><span class="p">)</span> <span class="p">{</span> <span class="n">info</span><span class="p">(</span><span class="s">"Int %"</span><span class="p">,</span> <span class="n">i</span><span class="p">);</span> <span class="p">},</span>
<span class="p">[](</span><span class="n">f32</span> <span class="n">f</span><span class="p">)</span> <span class="p">{</span> <span class="n">info</span><span class="p">(</span><span class="s">"Float %"</span><span class="p">,</span> <span class="n">f</span><span class="p">);</span> <span class="p">},</span>
<span class="p">[](</span><span class="k">auto</span><span class="p">)</span> <span class="p">{</span> <span class="n">info</span><span class="p">(</span><span class="s">"Other"</span><span class="p">);</span> <span class="p">},</span>
<span class="p">});</span>
</code></pre></div></div>
<p>Requiring distinct types means a variant can’t represent multiple alternatives of the same type.
We can work around this restriction using the <code class="language-plaintext highlighter-rouge">Named</code> helper, but the syntax is clunky.</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Variant</span><span class="o"><</span><span class="n">Named</span><span class="o"><</span><span class="s">"X"</span><span class="p">,</span> <span class="n">i32</span><span class="o">></span><span class="p">,</span> <span class="n">Named</span><span class="o"><</span><span class="s">"Y"</span><span class="p">,</span> <span class="n">i32</span><span class="o">>></span> <span class="n">sum</span><span class="p">{</span><span class="n">Named</span><span class="o"><</span><span class="s">"X"</span><span class="p">,</span> <span class="n">i32</span><span class="o">></span><span class="p">{</span><span class="mi">1</span><span class="p">}};</span>
<span class="n">sum</span><span class="p">.</span><span class="n">match</span><span class="p">(</span><span class="n">Overload</span><span class="p">{</span>
<span class="p">[](</span><span class="n">Named</span><span class="o"><</span><span class="s">"X"</span><span class="p">,</span> <span class="n">i32</span><span class="o">></span> <span class="n">x</span><span class="p">)</span> <span class="p">{</span> <span class="n">info</span><span class="p">(</span><span class="s">"%"</span><span class="p">,</span> <span class="n">x</span><span class="p">.</span><span class="n">value</span><span class="p">);</span> <span class="p">},</span>
<span class="p">[](</span><span class="n">Named</span><span class="o"><</span><span class="s">"Y"</span><span class="p">,</span> <span class="n">i32</span><span class="o">></span> <span class="n">y</span><span class="p">)</span> <span class="p">{</span> <span class="n">info</span><span class="p">(</span><span class="s">"%"</span><span class="p">,</span> <span class="n">y</span><span class="p">.</span><span class="n">value</span><span class="p">);</span> <span class="p">},</span>
<span class="p">});</span>
</code></pre></div></div>
<p>Like optionals, variants are not pointers: they contain an index and sufficient storage for the largest alternative.
However, variants also need a way to map the runtime index to a compile-time type.
Given at most eight alternatives, it simply checks the index with a chain of <code class="language-plaintext highlighter-rouge">if</code> statements.
Otherwise, it indexes a static table of function pointers, which is generated at compile time.
This introduces an indirect call, but is still relatively fast.</p>
<h3 id="function"><em>Function</em></h3>
<p><code class="language-plaintext highlighter-rouge">Function<R(A, B, C...)></code> is a type-erased closure.
It must be provided parameters of types <code class="language-plaintext highlighter-rouge">A</code>, <code class="language-plaintext highlighter-rouge">B</code>, <code class="language-plaintext highlighter-rouge">C</code>, etc., and returns a value of type <code class="language-plaintext highlighter-rouge">R</code>.
<code class="language-plaintext highlighter-rouge">Function</code> is similar to <code class="language-plaintext highlighter-rouge">std::function</code>, but has a few important differences:</p>
<ul>
<li>It is not default constructable, so there is no <code class="language-plaintext highlighter-rouge">std::bad_function_call</code>.</li>
<li>It is not copyable (like <code class="language-plaintext highlighter-rouge">std::move_only_function</code>).</li>
<li>The “small function” optimization is required: constructing a function never allocates.</li>
</ul>
<p>Functions can be constructed from any callable object with the proper type—in practice, usually a lambda.</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Function</span><span class="o"><</span><span class="kt">void</span><span class="p">(</span><span class="n">i32</span><span class="p">)</span><span class="o">></span> <span class="n">print_int</span> <span class="o">=</span> <span class="p">[](</span><span class="n">i32</span> <span class="n">i</span><span class="p">)</span> <span class="p">{</span> <span class="n">info</span><span class="p">(</span><span class="s">"%"</span><span class="p">,</span> <span class="n">i</span><span class="p">);</span> <span class="p">};</span>
</code></pre></div></div>
<p>By default, a <code class="language-plaintext highlighter-rouge">Function</code> only has 4 words of storage, so there’s also <code class="language-plaintext highlighter-rouge">FunctionN<N, R(A, B, C...)></code>, which has <code class="language-plaintext highlighter-rouge">N</code> words of storage.
In-place storage implies that all functions in a homogeneous collection must have a worst-case size.
If this becomes a problem, you should typically defunctionalize or otherwise restructure your code.</p>
<p>At runtime, functions contain an explicit vtable pointer and storage for the callable object.
The vtable is a static array of function pointers generated at compile time.
Calling, moving, or destroying the function requires an indirect call—where possible, it’s better to template usage code on the underlying function type.</p>
<h2 id="explicitness">Explicitness</h2>
<p>In <em>rpp</em>, all non-trivial operations should be visible in the source code.
This is not generally the case in mainstream C++: most types can be implicitly copied, single-argument constructors introduce implicit conversions, and many operations conditionally allocate memory.</p>
<h3 id="copies"><em>Copies</em></h3>
<p>Unlike the STL, <em>rpp</em> data structures cannot be implicitly copied.
To duplicate (most) non-trivial types, you must explicitly call their <code class="language-plaintext highlighter-rouge">clone</code> method.</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Vec</span><span class="o"><</span><span class="n">i32</span><span class="o">></span> <span class="n">foo</span><span class="p">{</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">};</span>
<span class="n">Vec</span><span class="o"><</span><span class="n">i32</span><span class="o">></span> <span class="n">bar</span> <span class="o">=</span> <span class="n">foo</span><span class="p">.</span><span class="n">clone</span><span class="p">();</span>
<span class="n">Vec</span><span class="o"><</span><span class="n">i32</span><span class="o">></span> <span class="n">baz</span> <span class="o">=</span> <span class="n">move</span><span class="p">(</span><span class="n">foo</span><span class="p">);</span>
</code></pre></div></div>
<p>The <code class="language-plaintext highlighter-rouge">Clone</code> concept expresses this capability, allowing generic code to clone supported types.
Additionally, concepts are used to pick the most efficient implementation of <code class="language-plaintext highlighter-rouge">clone</code>.
For example, “trivially copyable” types can be copied with <code class="language-plaintext highlighter-rouge">memcpy</code>, as seen in <code class="language-plaintext highlighter-rouge">Array</code>:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Array</span> <span class="nf">clone</span><span class="p">()</span> <span class="k">const</span>
<span class="n">requires</span><span class="p">(</span><span class="n">Clone</span><span class="o"><</span><span class="n">T</span><span class="o">></span> <span class="o">||</span> <span class="n">Copy_Constructable</span><span class="o"><</span><span class="n">T</span><span class="o">></span><span class="p">)</span> <span class="p">{</span>
<span class="n">Array</span> <span class="n">result</span><span class="p">;</span>
<span class="k">if</span> <span class="k">constexpr</span><span class="p">(</span><span class="n">Trivially_Copyable</span><span class="o"><</span><span class="n">T</span><span class="o">></span><span class="p">)</span> <span class="p">{</span>
<span class="n">Libc</span><span class="o">::</span><span class="n">memcpy</span><span class="p">(</span><span class="n">result</span><span class="p">.</span><span class="n">data_</span><span class="p">,</span> <span class="n">data_</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">T</span><span class="p">)</span> <span class="o">*</span> <span class="n">N</span><span class="p">);</span>
<span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="k">constexpr</span><span class="p">(</span><span class="n">Clone</span><span class="o"><</span><span class="n">T</span><span class="o">></span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span><span class="p">(</span><span class="n">u64</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">N</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="n">result</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">data_</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">clone</span><span class="p">();</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="k">for</span><span class="p">(</span><span class="n">u64</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">N</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="n">result</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">T</span><span class="p">{</span><span class="n">data_</span><span class="p">[</span><span class="n">i</span><span class="p">]};</span>
<span class="p">}</span>
<span class="k">return</span> <span class="n">result</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<h3 id="constructors"><em>Constructors</em></h3>
<p>Roughly all constructors are marked as <code class="language-plaintext highlighter-rouge">explicit</code>, so do not introduce implicit conversions.
I think this is the correct tradeoff, but it has a downside: type inference for template parameters is quite limited.
That means we often have to annotate otherwise-obvious types.</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Vec</span><span class="o"><</span><span class="n">i32</span><span class="o">></span> <span class="n">make_vec</span><span class="p">(</span><span class="n">i32</span> <span class="n">i</span><span class="p">)</span> <span class="p">{</span>
<span class="k">return</span> <span class="n">Vec</span><span class="o"><</span><span class="n">i32</span><span class="o">></span><span class="p">{</span><span class="n">i</span><span class="p">};</span> <span class="c1">// Have to write the "<i32>"</span>
<span class="p">}</span>
</code></pre></div></div>
<h2 id="reflection">Reflection</h2>
<p>Compared to previous versions, C++20 has much better native support for compile-time computation.
Hence, <em>rpp</em> does not provide many metaprogramming tools beyond using <code class="language-plaintext highlighter-rouge">constexpr</code>, <code class="language-plaintext highlighter-rouge">consteval</code>, etc. where possible.
However, it does include a reflection system, which is used to implement a generic <code class="language-plaintext highlighter-rouge">printf</code>.</p>
<p>To make a type reflectable, the user creates a specialization of the struct <code class="language-plaintext highlighter-rouge">Reflect::Refl</code>.
This may be done manually, or with a helper macro. For example:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="nc">Data</span> <span class="p">{</span>
<span class="n">Vec</span><span class="o"><</span><span class="n">i32</span><span class="o">></span> <span class="n">ints</span><span class="p">;</span>
<span class="n">Vec</span><span class="o"><</span><span class="n">f32</span><span class="o">></span> <span class="n">floats</span><span class="p">;</span>
<span class="p">};</span>
<span class="n">RPP_RECORD</span><span class="p">(</span><span class="n">Data</span><span class="p">,</span> <span class="n">RPP_FIELD</span><span class="p">(</span><span class="n">ints</span><span class="p">),</span> <span class="n">RPP_FIELD</span><span class="p">(</span><span class="n">floats</span><span class="p">));</span>
</code></pre></div></div>
<p>By default, <em>rpp</em> provides specializations for all built-in types and <em>rpp</em> data structures.
Eventually, we’ll get proper compiler support for reflection, but it’s still years away from being widely available.<sup id="fnref:metaprogram" role="doc-noteref"><a href="#fn:metaprogram" class="footnote" rel="footnote">6</a></sup></p>
<p>The generated reflection data enables operating on types at compile time.
For example, we can iterate the fields of an arbitrary record:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="nc">Print_Field</span> <span class="p">{</span>
<span class="k">template</span><span class="o"><</span><span class="n">Reflectable</span> <span class="n">T</span><span class="p">></span>
<span class="kt">void</span> <span class="n">apply</span><span class="p">(</span><span class="k">const</span> <span class="n">Literal</span><span class="o">&</span> <span class="n">field_name</span><span class="p">,</span> <span class="k">const</span> <span class="n">T</span><span class="o">&</span> <span class="n">field_value</span><span class="p">)</span> <span class="p">{</span>
<span class="n">info</span><span class="p">(</span><span class="s">"%: %"</span><span class="p">,</span> <span class="n">field_name</span><span class="p">,</span> <span class="n">field_value</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">};</span>
<span class="n">Data</span> <span class="n">data</span><span class="p">;</span>
<span class="n">iterate_record</span><span class="p">(</span><span class="n">Print_Field</span><span class="p">{},</span> <span class="n">data</span><span class="p">);</span>
</code></pre></div></div>
<p>The reflection API includes various other low-level utilities, but it’s largely up to the user to build higher-level abstractions.
One example is included: a generic printing system, which is used in the logging macros we’ve seen so far.</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Data</span> <span class="n">data</span><span class="p">{</span><span class="n">Vec</span><span class="o"><</span><span class="n">i32</span><span class="o">></span><span class="p">{</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">},</span> <span class="n">Vec</span><span class="o"><</span><span class="n">f32</span><span class="o">></span><span class="p">{</span><span class="mf">1.0</span><span class="n">f</span><span class="p">,</span> <span class="mf">2.0</span><span class="n">f</span><span class="p">,</span> <span class="mf">3.0</span><span class="n">f</span><span class="p">}};</span>
<span class="n">info</span><span class="p">(</span><span class="s">"%"</span><span class="p">,</span> <span class="n">data</span><span class="p">);</span>
</code></pre></div></div>
<p>By default, formatting a record will print each field.
Formatting behavior can be customized by specializing <code class="language-plaintext highlighter-rouge">Format::Write</code> / <code class="language-plaintext highlighter-rouge">Format::Measure</code>, which are provided for all built-in data structures.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Data{ints : Vec[1, 2, 3], floats : Vec[1.000000, 2.000000, 3.000000]}
</code></pre></div></div>
<p>The implementation is surprisingly concise: it’s essentially an <code class="language-plaintext highlighter-rouge">if constexpr</code> branch for each primitive type, plus iterating records/enums.
It’s also easy to extend: for my renderer, I wrote a similar system that generates <a href="https://github.com/ocornut/imgui">Dear ImGui</a> layouts for arbitrary types.</p>
<p><img class="center box_shadow" style="margin-top:20px;margin-bottom:20px" src="/assets/rpp/imgui.png" /></p>
<h1 id="allocators-regions-and-tracing">Allocators, Regions, and Tracing</h1>
<p>In <em>rpp</em>, all data structures are statically associated with an <em>allocator</em>.</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Vec</span><span class="o"><</span><span class="n">u64</span><span class="p">,</span> <span class="n">Mdefault</span><span class="o">></span> <span class="n">vector</span><span class="p">;</span>
</code></pre></div></div>
<p>This approach appears similar to the STL, which also parameterizes data structures over allocators, but is fundamentally different.
An <em>rpp</em> allocator is purely a type, not a value, so is fully determined at compile time.
That means the <em>specific</em> allocator used by a data structure is part of its type.</p>
<p>By default, all allocations come from <code class="language-plaintext highlighter-rouge">Mdefault</code>, which is a thin wrapper around <code class="language-plaintext highlighter-rouge">malloc</code>/<code class="language-plaintext highlighter-rouge">free</code> that records allocation events.
It starts to get interesting once we allow the user to define new allocators:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">using</span> <span class="n">Custom</span> <span class="o">=</span> <span class="n">Mallocator</span><span class="o"><</span><span class="s">"Custom"</span><span class="o">></span><span class="p">;</span>
<span class="n">Vec</span><span class="o"><</span><span class="n">u64</span><span class="p">,</span> <span class="n">Custom</span><span class="o">></span> <span class="n">vector</span><span class="p">;</span>
</code></pre></div></div>
<p>Here, <code class="language-plaintext highlighter-rouge">Custom</code> still just wraps <code class="language-plaintext highlighter-rouge">malloc</code>/<code class="language-plaintext highlighter-rouge">free</code>, but its allocation events are tracked under a different name.
Now, if we print allocation statistics:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Allocation stats for [Custom]:
Allocs: 3
Frees: 3
High water: 8192
Alloc size: 24576
Free size: 24576
</code></pre></div></div>
<p>Printing this data at program exit makes it easy to find memory leaks and provides a high-level overview of where your memory is going.</p>
<h2 id="regions">Regions</h2>
<p>Tracking heap allocations is useful, but it doesn’t obviously justify the added complexity.
More interestingly, <em>rpp</em> data structures also support stack allocation.
The <em>rpp</em> stack allocator is a global, thread-local, chunked buffer that is used to implement <em>regions</em>.
For further discussion of this design, see <a href="https://blog.janestreet.com/oxidizing-ocaml-locality/">Oxidizing OCaml</a>.</p>
<p>Passing the <code class="language-plaintext highlighter-rouge">Mregion</code> allocator to a data structure causes it to allocate from the current region.
Allocating region memory is almost free, as it simply bumps the stack pointer.
Freeing memory is a no-op.</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Vec</span><span class="o"><</span><span class="n">u64</span><span class="p">,</span> <span class="n">Mregion</span><span class="o"><</span><span class="n">R</span><span class="o">>></span> <span class="n">local_vector</span><span class="p">;</span>
</code></pre></div></div>
<p>Regions are denoted by the <code class="language-plaintext highlighter-rouge">Region</code> macro.
Each region lasts until the end of the following scope, at which point all associated allocations are freed.</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Region</span><span class="p">(</span><span class="n">R</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// R starts here</span>
<span class="c1">// Allocate a Vec in R</span>
<span class="n">Vec</span><span class="o"><</span><span class="n">u64</span><span class="p">,</span> <span class="n">Mregion</span><span class="o"><</span><span class="n">R</span><span class="o">>></span> <span class="n">local_vector</span><span class="p">;</span>
<span class="n">local_vector</span><span class="p">.</span><span class="n">push</span><span class="p">(</span><span class="mi">1</span><span class="p">);</span>
<span class="n">local_vector</span><span class="p">.</span><span class="n">push</span><span class="p">(</span><span class="mi">2</span><span class="p">);</span>
<span class="p">}</span> <span class="c1">// R ends here</span>
</code></pre></div></div>
<p>Unfortunately, regions introduce a new class of memory bugs.
Region-local values must not escape their scope, and a region must not attempt to allocate memory associated with a parent.
In OCaml or Rust, these properties can be checked at compile time, but not so in C++.</p>
<p>Instead, <em>rpp</em> uses a runtime check to ensure region safety.
When allocating or freeing memory, the allocator checks that the current region has the same <em>brand</em> as the region associated with the allocation.
Brands are named by the user and passed as the template parameter of <code class="language-plaintext highlighter-rouge">Mregion</code>.</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Region</span><span class="p">(</span><span class="n">R0</span><span class="p">)</span> <span class="p">{</span>
<span class="n">Vec</span><span class="o"><</span><span class="n">u64</span><span class="p">,</span> <span class="n">Mregion</span><span class="o"><</span><span class="n">R0</span><span class="o">>></span> <span class="n">local_vector</span><span class="p">;</span>
<span class="n">Region</span><span class="p">(</span><span class="n">R1</span><span class="p">)</span> <span class="p">{</span>
<span class="n">local_vector</span><span class="p">.</span><span class="n">push</span><span class="p">(</span><span class="mi">1</span><span class="p">);</span> <span class="c1">// Region brand mismatch!</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Brands are compile-time constants derived from the region’s source location, so do not introduce any runtime overhead beyond the check itself.</p>
<p>Since C++17, the STL also supports allocator polymorphism—that is, <em>runtime</em> polymorphism.
For example, <code class="language-plaintext highlighter-rouge">std::pmr::monotonic_buffer_resource</code> can be used to implement regions, but it incurs an indirect call for every allocation.
Conversely, <em>rpp</em>’s regions are fully determined at compile time.
This design also sidesteps the problem of nested allocators, which is a major pain point in the STL.</p>
<h2 id="pools">Pools</h2>
<p>Most allocations are either very long-lived, so go on the heap, or very short-lived, so go on the stack.
However, it’s occasionally useful to have a middle ground, so <em>rpp</em> also provides <em>pool allocators</em>.</p>
<p>A pool is a simple free-list containing fixed-sized blocks of memory.
When the pool is not empty, allocations are essentially free, as they simply pop a buffer off the list.
Freeing memory is also trivial, as it pushes the buffer back onto the list.</p>
<p>Passing <code class="language-plaintext highlighter-rouge">Mpool</code> to a data structure causes it to allocate from the properly sized pool.
Since pools are fixed-size, <code class="language-plaintext highlighter-rouge">Mpool</code> can only be used for “scalar” allocations, i.e. <code class="language-plaintext highlighter-rouge">Box</code>, <code class="language-plaintext highlighter-rouge">Rc</code>, and <code class="language-plaintext highlighter-rouge">Arc</code>.</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Box</span><span class="o"><</span><span class="n">u64</span><span class="p">,</span> <span class="n">Mpool</span><span class="o">></span> <span class="n">pooled_int</span><span class="p">{</span><span class="mi">1</span><span class="p">};</span>
</code></pre></div></div>
<p>Each size of pool gets its own statistics:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Allocation stats for [Pool<8>]:
Allocs: 2
Frees: 2
High water: 16
Alloc size: 16
Free size: 16
</code></pre></div></div>
<p><code class="language-plaintext highlighter-rouge">Mpool</code> is similar to <code class="language-plaintext highlighter-rouge">std::pmr::synchronized_pool_resource</code>, but is much simpler, as it only supports one block size and does not itself coalesce allocations into chunks.
Currently, each pool is globally shared using a mutex.
This is not ideal, but may be improved by adding thread-local pools and a global fallback.</p>
<h2 id="polymorphism">Polymorphism</h2>
<p>Parameterizing data structures over allocators provides a lot of flexibility—but it makes all of their operations polymorphic, too.
Regions exacerbate the issue: for example, a function returning a locally-allocated <code class="language-plaintext highlighter-rouge">Vec</code> must be polymorphic over all regions.</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">template</span><span class="o"><</span><span class="n">Region</span> <span class="n">R</span><span class="p">></span>
<span class="n">Vec</span><span class="o"><</span><span class="n">i32</span><span class="p">,</span> <span class="n">Mregion</span><span class="o"><</span><span class="n">R</span><span class="o">>></span> <span class="n">make_local_vec</span><span class="p">()</span> <span class="p">{</span>
<span class="k">return</span> <span class="n">Vec</span><span class="o"><</span><span class="n">i32</span><span class="p">,</span> <span class="n">Mregion</span><span class="o"><</span><span class="n">R</span><span class="o">>></span><span class="p">{</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">};</span>
<span class="p">}</span>
</code></pre></div></div>
<p>The same friction is present in Rust, where references give rise to lifetime polymorphism.
(In fact, brands are a restricted form of lifetimes.)
The downside of this approach is code bloat.
Due to link-time deduplication, I haven’t run into any issues with binary size, but it could become a problem in larger projects.</p>
<h2 id="tracing">Tracing</h2>
<p>In addition to tracking allocator statistics, <em>rpp</em> records a list of allocation events for each frame.<sup id="fnref:frames" role="doc-noteref"><a href="#fn:frames" class="footnote" rel="footnote">7</a></sup>
This data makes it easy to target an average of zero heap allocations per frame—other than growing the stack allocator.</p>
<p>Furthermore, <em>rpp</em> includes a basic tracing profiler.
To record how long a block of code takes to execute, simply wrap it in <code class="language-plaintext highlighter-rouge">Trace</code>:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Trace</span><span class="p">(</span><span class="s">"Trace Me"</span><span class="p">)</span> <span class="p">{</span>
<span class="c1">// ...</span>
<span class="p">}</span>
</code></pre></div></div>
<p>The profiler will compute how many times the block was entered, how much time was spent in the block, and how much time was spent in the block’s children.
The resulting timing tree can be traversed by user code, but <em>rpp</em> does not otherwise provide a UI.</p>
<p>In my renderer, I’ve used <em>rpp</em>’s profile data to graph frame times (based on <a href="https://github.com/Raikiri/LegitProfiler">LegitProfiler</a>):</p>
<p><img class="center box_shadow" style="margin-top:20px;margin-bottom:20px" src="/assets/rpp/profiler.png" /></p>
<p>Finally, note that tracing is disabled in release builds.</p>
<h1 id="coroutines">Coroutines</h1>
<p>Coroutines were a <a href="https://probablydance.com/2021/10/31/c-coroutines-do-not-spark-joy/">somewhat</a> <a href="https://www.jeremyong.com/c++/2023/12/24/cpp20-gamedev-naughty-nice/">controversial</a> addition to C++20, but I’ve found them to be a good fit for <em>rpp</em>.
Creating an <em>rpp</em> task involves returning an <code class="language-plaintext highlighter-rouge">Async::Task<T></code> from a coroutine:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Async</span><span class="o">::</span><span class="n">Task</span><span class="o"><</span><span class="n">i32</span><span class="o">></span> <span class="n">task1</span><span class="p">()</span> <span class="p">{</span>
<span class="n">co_return</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>
<p>To execute a task on another thread, simply create a thread pool and await <code class="language-plaintext highlighter-rouge">pool.suspend()</code>.
When a coroutine awaits another task, it will be suspended until the completion of the awaited task.</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Async</span><span class="o">::</span><span class="n">Task</span><span class="o"><</span><span class="n">i32</span><span class="o">></span> <span class="n">task2</span><span class="p">(</span><span class="n">Async</span><span class="o">::</span><span class="n">Pool</span><span class="o"><>&</span> <span class="n">pool</span><span class="p">)</span> <span class="p">{</span>
<span class="n">co_await</span> <span class="n">pool</span><span class="p">.</span><span class="n">suspend</span><span class="p">();</span>
<span class="n">co_return</span> <span class="n">co_await</span> <span class="n">task1</span><span class="p">();</span>
<span class="p">};</span>
</code></pre></div></div>
<p>It’s possible to write your entire program in terms of coroutines, but it’s often more convenient to maintain a main thread.
Therefore, tasks also support synchronous waits:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Async</span><span class="o">::</span><span class="n">Pool</span><span class="o"><></span> <span class="n">pool</span><span class="p">;</span>
<span class="k">auto</span> <span class="n">task</span> <span class="o">=</span> <span class="n">task2</span><span class="p">(</span><span class="n">pool</span><span class="p">);</span>
<span class="n">info</span><span class="p">(</span><span class="s">"%"</span><span class="p">,</span> <span class="n">task</span><span class="p">.</span><span class="n">block</span><span class="p">());</span>
</code></pre></div></div>
<p>Tasks may be awaited at most once, cannot be copied, and cannot be cancelled after being scheduled.<sup id="fnref:once" role="doc-noteref"><a href="#fn:once" class="footnote" rel="footnote">8</a></sup>
These limitations allow the implementation to be quite efficient:</p>
<ul>
<li>Tasks are just pointers to coroutine frames.</li>
<li>Promises add minimal additional state: a status word, a flag for synchronous waits, and the result value.</li>
<li>Executing a coroutine only allocates a single object (the frame).</li>
</ul>
<p>Notably, managing the promise lifetime doesn’t require reference counting.
Instead, <em>rpp</em> uses a state machine based on <a href="https://devblogs.microsoft.com/oldnewthing/20210416-00/?p=105115">Raymond Chen’s</a> design.</p>
<table class="center2" style="margin-top:25px;margin-bottom:20px">
<tr><th>Task is...</th><th>Old Promise State</th><th>New Promise State</th><th>Action</th></tr>
<tr><td>Created</td><td></td><td>Start</td><td></td></tr>
<tr><td>Awaited</td><td>Done</td><td>Done</td><td>Resume Awaiter</td></tr>
<tr><td>Awaited</td><td>Start</td><td>Awaiter Address</td><td>Suspend Awaiter</td></tr>
<tr><td>Completed</td><td>Start</td><td>Done</td><td></td></tr>
<tr><td>Completed</td><td>Abandoned</td><td></td><td>Destroy</td></tr>
<tr><td>Completed</td><td>Awaiter Address</td><td>Done</td><td>Resume Awaiter</td></tr>
<tr><td>Dropped</td><td>Start</td><td>Abandoned</td><td></td></tr>
<tr><td>Dropped</td><td>Done</td><td></td><td>Destroy</td></tr>
</table>
<p>Since the awaiter’s address can’t be zero (Start) one (Done) or two (Abandoned), we can encode all states in a single word and perform each transition with a single atomic operation.
Additionally, we can use <a href="https://lewissbaker.github.io/2020/05/11/understanding_symmetric_transfer">symmetric transfer</a> to resume tasks without requeuing them.</p>
<p>Currently, the scheduler does not support work stealing, prioritization, or CPU affinity, so blocking on another task can deadlock.
I plan to add support for these features in the future.</p>
<h2 id="asynchronous-io">Asynchronous I/O</h2>
<p>As seen in many other languages, coroutines are also useful for asynchronous I/O.
Therefore, the <em>rpp</em> scheduler supports awaiting platform-specific events.
Its design is roughly based on <a href="https://github.com/jeremyong/coop">coop</a>.</p>
<p>Only basic support for async file I/O and timers is included, but creating custom I/O methods is easy:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Task</span><span class="o"><</span><span class="n">IO</span><span class="o">></span> <span class="n">async_io</span><span class="p">(</span><span class="n">Pool</span><span class="o"><>&</span> <span class="n">pool</span><span class="p">)</span> <span class="p">{</span>
<span class="n">HANDLE</span> <span class="n">handle</span> <span class="o">=</span> <span class="cm">/* Start an IO operation */</span><span class="p">;</span>
<span class="n">co_await</span> <span class="n">pool</span><span class="p">.</span><span class="n">event</span><span class="p">(</span><span class="n">Event</span><span class="o">::</span><span class="n">of_sys</span><span class="p">(</span><span class="n">handle</span><span class="p">));</span>
<span class="n">co_return</span> <span class="cm">/* The result */</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>In my renderer, this system enables awaitable GPU tasks based on <code class="language-plaintext highlighter-rouge">VK_KHR_external_fence</code>s.</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">co_await</span> <span class="n">vk</span><span class="p">.</span><span class="n">async</span><span class="p">(</span><span class="n">pool</span><span class="p">,</span> <span class="p">[](</span><span class="n">Vk</span><span class="o">::</span><span class="n">Commands</span><span class="o">&</span> <span class="n">commands</span><span class="p">)</span> <span class="p">{</span>
<span class="c1">// Add GPU work to the command buffer</span>
<span class="p">});</span>
</code></pre></div></div>
<h2 id="compiler-bugs">Compiler Bugs</h2>
<p>Despite compilers having supported coroutines for several years, I still ran into multiple compiler bugs while implementing <em>rpp</em>.
This suggests that coroutines are still not widely used, or I imagine the bugs would have been fixed by now.</p>
<p>In particular, MSVC 19.37 exhibits a <a href="https://developercommunity.visualstudio.com/t/destroy-coroutine-from-final_suspend-r/10096047">bug</a> that makes symmetric transfer unsafe, so it’s currently possible for the <em>rpp</em> scheduler to stack overflow on Windows.
As of Jan 2024, a fix is pending release by Microsoft.
I ran into a similar bug in <code class="language-plaintext highlighter-rouge">clang</code>, but didn’t track it down precisely because updating to <code class="language-plaintext highlighter-rouge">clang-17</code> fixed the issue.</p>
<h1 id="appendix-why-not-language-x">Appendix: why not language X?</h1>
<p><strong><a href="https://www.rust-lang.org/"><em>Rust</em></a></strong></p>
<p>This may seem like a lot of work that would be better spent transitioning to Rust.
Much of <em>rpp</em> is directly inspired by Rust—even the name—but I haven’t been convinced to switch just yet.
The kind of projects I work on (i.e. graphics & games) tend to be both borrow-checker-unfriendly and do not primarily suffer from memory or concurrency issues.</p>
<p>That’s not to say these kind of bugs never come up, but I find debuggers, sanitizers, and profiling sufficient to fix issues and prevent regressions.
I would certainly <em>prefer</em> these bugs to be caught at compile time, but I haven’t found the language overhead of Rust to be obviously worth it—the painful bugs come from GPU code anyway!
Of course, it might be a different story if I wrote security-critical programs.</p>
<p>What has really held me back is simply that the Rust compiler is slow—but now that the <a href="https://blog.rust-lang.org/2023/11/09/parallel-rustc.html">parallel frontend</a> has landed, I may reconsider.
I would also concede that Rust is the better language to start learning in 2024.</p>
<p><strong><a href="https://www.youtube.com/playlist?list=PLmV5I2fxaiCKfxMBrNsU1kgKJXD3PkyxO"><em>Jai</em></a></strong></p>
<p>Jai prioritizes many of the same goals as <em>rpp</em>.
I sure would like to use it, but it’s been nine years and I still don’t have a copy of the compiler.</p>
<p><strong><em><a href="https://gist.github.com/bkaradzic/2e39896bc7d8c34e042b">“Orthodox C++”</a></em></strong></p>
<p>In the past, I’ve written a lot of C-style C++, including the first version of <em>rpp</em> (unnamed at the time).
In retrospect, I don’t think this style is a great choice, unless you’re <em>very</em> disciplined about not using C++ features.
Introducing even a small amount of C++isms leads to a lot of poor interactions with C-style code.
Overall, I’d say it’s better to return to plain C.</p>
<p>That said, <em>rpp</em> is obviously influenced by this style: it doesn’t use exceptions, RTTI, multiple/virtual inheritance, or the STL.
However, it does make heavy use of templates.</p>
<p><strong><em>C</em></strong></p>
<p>I’ve also written a fair amount of C, and I’d say it’s still a good choice for OS-adjacent work.
However, I rarely need to think about topics like binary size, inline assembly, or ABI compatibility, so I’d rather have (e.g.) real language support for generics.
Plus, at least where available, Rust is often a better choice.</p>
<p><strong><a href="https://dlang.org/"><em>D</em></a></strong></p>
<p>I haven’t written much D, but from what I’ve seen, it’s a pretty sensible redesign of C++.
Many of the features that interact poorly in C++ work well in D.
However, there’s also the garbage collector—it can be disabled, but you lose out on a lot of library features.
Overall, I think D suffers from the problem of being “only” moderately better than C++, which isn’t enough to justify rebuilding the ecosystem.</p>
<p><strong><a href="https://odin-lang.org/"><em>Odin</em></a></strong></p>
<p>I tried Odin in 2017, but it wasn’t sufficiently stable at the time.
It appears to be in a good state today—and certainly improves on C for application programming—but I think it suffers from the same problem that D does relative to C++.</p>
<p><strong><a href="https://ocaml.org/"><em>OCaml</em></a></strong></p>
<p>I feel obligated to mention OCaml, since I now use it professionally, even for systems programming tasks.
Modes, unboxed types, and domains are starting to make OCaml very viable for systems-with-incremental-concurrent-generational-GC projects—but it’s not yet a great fit for games/graphics programming.</p>
<p>I would particularly like to see an ML-family language without a GC (maybe along the lines of <a href="https://koka-lang.github.io/koka/doc/index.html">Koka</a>), but not so imperative/object oriented as Rust.</p>
<p><strong><em>Zig</em>, <em>Nim</em>, <em>Go</em>, <em>C#</em>, <em>F#</em>, <em>Haskell</em>, <em>Beef</em>, <em>Vale</em>, <em>Java</em>, <em>JavaScript</em>, <em>Python</em>, <em>Crystal</em>, <em>Lisp</em></strong></p>
<p>I’ve learned about each these languages to some small extent, but none seem sufficiently compelling for my use cases.
(Except maybe Zig, which scares me.)</p>
<h1 id="footnotes">Footnotes</h1>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:benchmark" role="doc-endnote">
<p>The 100k LOC do not include the STL itself. Benchmarked on an AMD 5950X @ 4.4GHz, MSVC 19.38.33129, Windows 11 10.0.22631. <a href="#fnref:benchmark" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:lto" role="doc-endnote">
<p>Unfortunately, this makes the implementations unavailable for inlining. However, I’m prioritizing compile times, and link-time optimization can make up for it. <a href="#fnref:lto" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:smallstring" role="doc-endnote">
<p>Adding the small-string optimization only makes sense if it happens to accelerate an existing codebase.
It causes strings’ allocation behavior to depend on their length, resulting in less predictable performance.
Further, if your program creates a lot of small, heap-allocated <code class="language-plaintext highlighter-rouge">String<></code>s, you should ideally refactor it to use <code class="language-plaintext highlighter-rouge">String_View</code>, the stack allocator, or a custom interned string representation. <a href="#fnref:smallstring" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:smallvector" role="doc-endnote">
<p>(For the same reasons as strings.) <a href="#fnref:smallvector" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:hash" role="doc-endnote">
<p>It also stores hashes alongside integer keys, which doesn’t help. I’ll fix this eventually. <a href="#fnref:hash" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:metaprogram" role="doc-endnote">
<p>Of course, this data could be generated by a separate meta-program or a compiler plugin, but I haven’t found the additional complexity to be worth it. <a href="#fnref:metaprogram" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:frames" role="doc-endnote">
<p>Assuming your program is long-running and has a natural frame delineation. Otherwise, the entire execution is traced under one frame. <a href="#fnref:frames" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:once" role="doc-endnote">
<p>Because tasks may only be awaited once, they do not support <code class="language-plaintext highlighter-rouge">co_yield</code>. <a href="#fnref:once" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>I recently rewrote the C++20 infrastructure I use for personal projects, so I decided to snapshot its design in the style of another recent article. The library (rpp) is primarily inspired by Rust, and may be found on GitHub. I use rpp as an STL replacement, and I’m currently developing a real-time path tracer with it. rpp attempts to provide the following features, roughly in priority order: Fast Compilation Including rpp should not increase compilation time by more than 250ms. It must not pull in any STL or system headers—they’re what’s really slow, not templates. Debugging User code should be easy to debug. That requires good debug-build performance, defensive assertions, and prohibiting exceptions/RTTI. Concepts should be used to produce straightforward compilation errors. Performance User code should have control over allocations, parallelism, and data layout. This is enabled by a region-based allocator system, a coroutine scheduler, and better default data structure implementations. Additionally, tracing-based profiling and allocation tracking are tightly integrated. Explicitness All non-trivial operations should be visible in the source code. That means prohibiting implicit conversions, copies, and allocations, as well as using types with clear ownership semantics. Metaprogramming It should be easy to perform compile-time computations and reflect on types. This involves pervasive use of concepts and constexpr, as well as defining a format for reflection metadata. These goals lead to a few categories of interesting features, which we’ll discuss below. Compile Times Data Structures and Reflection Allocators, Regions, and Tracing Coroutines Standard disclaimers: no, it wouldn’t make sense to enforce this style on existing projects/teams, no, the library is not “production-ready,” and yes, C++ is Bad™ so I considered switching to language X. Compile Times Note: this is all pre-modules, so may change soon. CMake recently released support! C++ compilers are slow, but building your codebase doesn’t have to be. Let’s investigate how MSVC spends its time compiling another project of mine, which is written in a mainstream C++17 style. Including dependencies, it’s about 100k lines of code and takes 20 seconds to compile with optimizations enabled.1 Processing this list of files took up the vast majority of compilation time, but only a handful are actually used in the project—the rest were pulled in to support the STL. We’ll also find several compilation units that look like this: Here, including only std::chrono took 75% of the entire invocation. One would hope that incremental compilation is still fast, but the slowest single compilation unit spent ten seconds in the compiler frontend. Much of that time was spent solving constraints, creating instances, and generating specializations for STL data structures. The STL’s long compile times can be mitigated by using precompiled headers (or modules…in C++23), but the core issue is simply that the STL is huge. So, how fast can we go by avoiding it? Here’s my rendering project, which is also roughly 100k lines of code, including rpp and other dependencies. (Click to play.) The optimized build completes in under 3 seconds. There’s no trick—I’ve just minimized the amount of code included into each compilation unit. Let’s take a look at where MSVC spends its time: Most of these files come from non-STL dependencies and are actually required by the project. Several arise from rpp, but the worst offender (vmath) only accumulated 1s of work. Incremental compilation is also much improved: the slowest compilation unit only spent 1.4s in the frontend. Within each unit, we can see that including base.h (the bulk of rpp) took under 200ms on average. Of course, this is not a proper benchmark: I’m comparing two entirely separate codebases. Project structure has a large impact on compile times, as does usage code, so your milage may vary. Nevertheless, after working on both projects, I found the decreased iteration time alone to be worth ditching the STL. How? You might wonder whether rpp only compiles quickly because it omits most STL features. This is obviously true to some extent: it’s only 10k lines of code, only supports Windows/Linux/x64, and isn’t concerned with backwards compatibility. However, rpp replicates all of the STL features I actually use. Implementing the core functionality takes relatively little code, and given some care, we can keep compile times low. For example, we can still use libc and the Windows/Linux APIs, so long as we relegate their usage to particular compilation unit.2 User code only sees the bare-minimum symbols and types—windows.h need not apply. OS_Thread sys_start(OS_Thread_Ret (*f)(void*), void* data); void sys_join(OS_Thread thread); Unfortunately, the same cannot be said for the STL proper, as (most) generic code must be visible to all compilation units. Hence, rpp re-implements many STL data structures and algorithms. They’re not that hard to compile quickly—STL authors have just never prioritized it. array.h box.h files.h format.h function.h hash.h heap.h log.h map.h math.h opt.h pair.h thread.h queue.h rc.h reflect.h rng.h stack.h string.h tuple.h variant.h vec.h These two approaches let us entirely eliminate standard headers. Due to built-in language features, we can’t quite get rid of std::initializer_list and std::coroutine_handle, but the necessary definitions can be manually extracted into a header that does not pull in anything else. Data Structures and Reflection Let’s discuss rpp’s primary data structures. We will write T and A to denote generic type parameters and allocators, respectively. The following section will discuss allocators in more detail. Pointers There are five types of pointers, each with different ownership semantics. Except for raw pointers, all must point to a valid object or be null. T* is rarely used. The referenced memory is never owned; there are no other guarantees. Ref<T> is nearly the same as a raw pointer but removes pointer arithmetic and adds null checks. Box<T, A> is a uniquely owning pointer, similar to std::unique_ptr. It may be moved, but not copied, and frees its memory when destroyed. Rc<T, A> is a non-atomically reference-counted pointer. The reference count is stored contiguously with the pointed-to object. It may be moved or copied and frees its memory when the last reference is destroyed. Arc<T, A> is an atomically reference-counted Rc, similar to std::shared_ptr. Strings String<A> contains a pointer, a length, and a capacity. Strings are not null-terminated and may contain null bytes. There is no “small-string” optimization, so creating a non-empty string always allocates.3 String_View provides a non-owning immutable view of a string, similar to std::string_view. Most non-destructive string operations are implemented on String_View. String literals may be converted to String_Views using the _v literal suffix. String_View message = "foo"_v; String<> copy = message.string<>(); if(copy.view() == "bar"_v) // ... Arrays Array<T, N> is a fixed-size array of length N. It’s essentially an rpp-compatible std::array. Array<i32, 3> int_array{0, 1, 2}; Vec<T, A> is a dynamically-sized array, similar to std::vector. There is no “small-vector” optimization, so creating a non-empty vector always allocates.4 Vec<i32> int_vec{0, 1, 2}; int_vec.push(3); int_vec.pop(); Slice<T> is a non-owning immutable view of a contiguous array, similar to std::span. A Slice may be created from an Array, a Vec, a std::initializer_list, or a raw pointer & length. Non-destructive functions on arrays take Slice parameters, which abstract over the underlying storage. This erases (e.g.) the size of the Array or the allocator of the Vec, reducing template bloat. void print_slice(Slice<i32> slice) { for(i32 i : slice) info("%", i); } Stack<T, A>, Queue<T, A>, and Heap<T, A> are array-based containers with stack, queue, and heap semantics, respectively. Queue is implemented using a growable circular buffer, and Heap is a flat binary min-heap with respect to the less-than operator. Stack<i32> int_stack{1, 2, 3}; int_stack.top(); // 3 Queue<i32> int_queue{1, 2, 3}; int_queue.front(); // 1 Heap<i32> int_heap{1, 2, 3}; int_heap.top(); // 1 Hash Map Map<K, V, A> is an open-addressed hash table with Robin Hood linear probing. For further discussion of why this is a good choice, see Optimizing Open Addressing. Map<String_View, i32> int_map{Pair{"foo"_v, 0}, Pair{"bar"_v, 1}}; int_map.insert("baz"_v, 2); int_map.erase("bar"_v); for(auto& [key, value] : int_map) info("%: %", key, value); The implementation is very similar to Optimizing Open Addressing, except that the table also stores key hashes. This improves performance when hashing keys is expensive—particularly for strings.5 Optional Opt<T> is either empty or contains value of type T. It is not a pointer; it contains a boolean flag and storage for a T. Optionals are used for error handling, as rpp does not use exceptions. Opt<i32> int_opt{1}; if(int_opt) info("%", *int_opt); There is currently no Result<T, E> type, but I plan to add one. Tuples Pair<A, B> contains two values of types A and B. It supports structured bindings and is used in the implementation of Map. Pair<i32, f32> pair{1, 2.0f}; i32 one = pair.first; auto [i, f] = pair; Tuple<A, B, C, ...> is a heterogeneous product type, which may be accessed via (static) indexing or structured binding. Tuple<i32, f32, String_View> product{1, 2.0f, "foo"_v}; i32 one = product.get<0>(); auto [i, f, s] = product; An N-element tuple is represented by a value and an N-1-element tuple. At runtime, this layout is equivalent to a single struct with N fields. struct { A first; struct { B first; struct { C first; struct {}}}} The linear structure makes compile-time tuple operations occur in linear time—but I haven’t found this to be an issue in practice. Structuring tuples as binary trees would accelerate compilation, but the code would be more complex. Variant Variant<A, B, C, ...> is a heterogeneous sum type. It is superficially similar to std::variant, but has several important differences: It is never default constructable, so there is no std::monostate. It cannot be accessed by index or by alternative, so there is no std::bad_variant_access. rpp does not use exceptions, so there is no valueless_by_exception. The alternatives do not have to be destructible. The alternatives must be distinct. Variants are accessed via (scuffed) pattern matching: the match method invokes a callable object overloaded for each alternative. The Overload helper lets the user write a list of (potentially polymorphic) lambdas. Variant<i32, f32, String_View> sum{1}; sum = 2.0f; sum.match(Overload{ [](i32 i) { info("Int %", i); }, [](f32 f) { info("Float %", f); }, [](auto) { info("Other"); }, }); Requiring distinct types means a variant can’t represent multiple alternatives of the same type. We can work around this restriction using the Named helper, but the syntax is clunky. Variant<Named<"X", i32>, Named<"Y", i32>> sum{Named<"X", i32>{1}}; sum.match(Overload{ [](Named<"X", i32> x) { info("%", x.value); }, [](Named<"Y", i32> y) { info("%", y.value); }, }); Like optionals, variants are not pointers: they contain an index and sufficient storage for the largest alternative. However, variants also need a way to map the runtime index to a compile-time type. Given at most eight alternatives, it simply checks the index with a chain of if statements. Otherwise, it indexes a static table of function pointers, which is generated at compile time. This introduces an indirect call, but is still relatively fast. Function Function<R(A, B, C...)> is a type-erased closure. It must be provided parameters of types A, B, C, etc., and returns a value of type R. Function is similar to std::function, but has a few important differences: It is not default constructable, so there is no std::bad_function_call. It is not copyable (like std::move_only_function). The “small function” optimization is required: constructing a function never allocates. Functions can be constructed from any callable object with the proper type—in practice, usually a lambda. Function<void(i32)> print_int = [](i32 i) { info("%", i); }; By default, a Function only has 4 words of storage, so there’s also FunctionN<N, R(A, B, C...)>, which has N words of storage. In-place storage implies that all functions in a homogeneous collection must have a worst-case size. If this becomes a problem, you should typically defunctionalize or otherwise restructure your code. At runtime, functions contain an explicit vtable pointer and storage for the callable object. The vtable is a static array of function pointers generated at compile time. Calling, moving, or destroying the function requires an indirect call—where possible, it’s better to template usage code on the underlying function type. Explicitness In rpp, all non-trivial operations should be visible in the source code. This is not generally the case in mainstream C++: most types can be implicitly copied, single-argument constructors introduce implicit conversions, and many operations conditionally allocate memory. Copies Unlike the STL, rpp data structures cannot be implicitly copied. To duplicate (most) non-trivial types, you must explicitly call their clone method. Vec<i32> foo{0, 1, 2}; Vec<i32> bar = foo.clone(); Vec<i32> baz = move(foo); The Clone concept expresses this capability, allowing generic code to clone supported types. Additionally, concepts are used to pick the most efficient implementation of clone. For example, “trivially copyable” types can be copied with memcpy, as seen in Array: Array clone() const requires(Clone<T> || Copy_Constructable<T>) { Array result; if constexpr(Trivially_Copyable<T>) { Libc::memcpy(result.data_, data_, sizeof(T) * N); } else if constexpr(Clone<T>) { for(u64 i = 0; i < N; i++) result[i] = data_[i].clone(); } else { for(u64 i = 0; i < N; i++) result[i] = T{data_[i]}; } return result; } Constructors Roughly all constructors are marked as explicit, so do not introduce implicit conversions. I think this is the correct tradeoff, but it has a downside: type inference for template parameters is quite limited. That means we often have to annotate otherwise-obvious types. Vec<i32> make_vec(i32 i) { return Vec<i32>{i}; // Have to write the "<i32>" } Reflection Compared to previous versions, C++20 has much better native support for compile-time computation. Hence, rpp does not provide many metaprogramming tools beyond using constexpr, consteval, etc. where possible. However, it does include a reflection system, which is used to implement a generic printf. To make a type reflectable, the user creates a specialization of the struct Reflect::Refl. This may be done manually, or with a helper macro. For example: struct Data { Vec<i32> ints; Vec<f32> floats; }; RPP_RECORD(Data, RPP_FIELD(ints), RPP_FIELD(floats)); By default, rpp provides specializations for all built-in types and rpp data structures. Eventually, we’ll get proper compiler support for reflection, but it’s still years away from being widely available.6 The generated reflection data enables operating on types at compile time. For example, we can iterate the fields of an arbitrary record: struct Print_Field { template<Reflectable T> void apply(const Literal& field_name, const T& field_value) { info("%: %", field_name, field_value); } }; Data data; iterate_record(Print_Field{}, data); The reflection API includes various other low-level utilities, but it’s largely up to the user to build higher-level abstractions. One example is included: a generic printing system, which is used in the logging macros we’ve seen so far. Data data{Vec<i32>{1, 2, 3}, Vec<f32>{1.0f, 2.0f, 3.0f}}; info("%", data); By default, formatting a record will print each field. Formatting behavior can be customized by specializing Format::Write / Format::Measure, which are provided for all built-in data structures. Data{ints : Vec[1, 2, 3], floats : Vec[1.000000, 2.000000, 3.000000]} The implementation is surprisingly concise: it’s essentially an if constexpr branch for each primitive type, plus iterating records/enums. It’s also easy to extend: for my renderer, I wrote a similar system that generates Dear ImGui layouts for arbitrary types. Allocators, Regions, and Tracing In rpp, all data structures are statically associated with an allocator. Vec<u64, Mdefault> vector; This approach appears similar to the STL, which also parameterizes data structures over allocators, but is fundamentally different. An rpp allocator is purely a type, not a value, so is fully determined at compile time. That means the specific allocator used by a data structure is part of its type. By default, all allocations come from Mdefault, which is a thin wrapper around malloc/free that records allocation events. It starts to get interesting once we allow the user to define new allocators: using Custom = Mallocator<"Custom">; Vec<u64, Custom> vector; Here, Custom still just wraps malloc/free, but its allocation events are tracked under a different name. Now, if we print allocation statistics: Allocation stats for [Custom]: Allocs: 3 Frees: 3 High water: 8192 Alloc size: 24576 Free size: 24576 Printing this data at program exit makes it easy to find memory leaks and provides a high-level overview of where your memory is going. Regions Tracking heap allocations is useful, but it doesn’t obviously justify the added complexity. More interestingly, rpp data structures also support stack allocation. The rpp stack allocator is a global, thread-local, chunked buffer that is used to implement regions. For further discussion of this design, see Oxidizing OCaml. Passing the Mregion allocator to a data structure causes it to allocate from the current region. Allocating region memory is almost free, as it simply bumps the stack pointer. Freeing memory is a no-op. Vec<u64, Mregion<R>> local_vector; Regions are denoted by the Region macro. Each region lasts until the end of the following scope, at which point all associated allocations are freed. Region(R) { // R starts here // Allocate a Vec in R Vec<u64, Mregion<R>> local_vector; local_vector.push(1); local_vector.push(2); } // R ends here Unfortunately, regions introduce a new class of memory bugs. Region-local values must not escape their scope, and a region must not attempt to allocate memory associated with a parent. In OCaml or Rust, these properties can be checked at compile time, but not so in C++. Instead, rpp uses a runtime check to ensure region safety. When allocating or freeing memory, the allocator checks that the current region has the same brand as the region associated with the allocation. Brands are named by the user and passed as the template parameter of Mregion. Region(R0) { Vec<u64, Mregion<R0>> local_vector; Region(R1) { local_vector.push(1); // Region brand mismatch! } } Brands are compile-time constants derived from the region’s source location, so do not introduce any runtime overhead beyond the check itself. Since C++17, the STL also supports allocator polymorphism—that is, runtime polymorphism. For example, std::pmr::monotonic_buffer_resource can be used to implement regions, but it incurs an indirect call for every allocation. Conversely, rpp’s regions are fully determined at compile time. This design also sidesteps the problem of nested allocators, which is a major pain point in the STL. Pools Most allocations are either very long-lived, so go on the heap, or very short-lived, so go on the stack. However, it’s occasionally useful to have a middle ground, so rpp also provides pool allocators. A pool is a simple free-list containing fixed-sized blocks of memory. When the pool is not empty, allocations are essentially free, as they simply pop a buffer off the list. Freeing memory is also trivial, as it pushes the buffer back onto the list. Passing Mpool to a data structure causes it to allocate from the properly sized pool. Since pools are fixed-size, Mpool can only be used for “scalar” allocations, i.e. Box, Rc, and Arc. Box<u64, Mpool> pooled_int{1}; Each size of pool gets its own statistics: Allocation stats for [Pool<8>]: Allocs: 2 Frees: 2 High water: 16 Alloc size: 16 Free size: 16 Mpool is similar to std::pmr::synchronized_pool_resource, but is much simpler, as it only supports one block size and does not itself coalesce allocations into chunks. Currently, each pool is globally shared using a mutex. This is not ideal, but may be improved by adding thread-local pools and a global fallback. Polymorphism Parameterizing data structures over allocators provides a lot of flexibility—but it makes all of their operations polymorphic, too. Regions exacerbate the issue: for example, a function returning a locally-allocated Vec must be polymorphic over all regions. template<Region R> Vec<i32, Mregion<R>> make_local_vec() { return Vec<i32, Mregion<R>>{1, 2, 3}; } The same friction is present in Rust, where references give rise to lifetime polymorphism. (In fact, brands are a restricted form of lifetimes.) The downside of this approach is code bloat. Due to link-time deduplication, I haven’t run into any issues with binary size, but it could become a problem in larger projects. Tracing In addition to tracking allocator statistics, rpp records a list of allocation events for each frame.7 This data makes it easy to target an average of zero heap allocations per frame—other than growing the stack allocator. Furthermore, rpp includes a basic tracing profiler. To record how long a block of code takes to execute, simply wrap it in Trace: Trace("Trace Me") { // ... } The profiler will compute how many times the block was entered, how much time was spent in the block, and how much time was spent in the block’s children. The resulting timing tree can be traversed by user code, but rpp does not otherwise provide a UI. In my renderer, I’ve used rpp’s profile data to graph frame times (based on LegitProfiler): Finally, note that tracing is disabled in release builds. Coroutines Coroutines were a somewhat controversial addition to C++20, but I’ve found them to be a good fit for rpp. Creating an rpp task involves returning an Async::Task<T> from a coroutine: Async::Task<i32> task1() { co_return 1; }; To execute a task on another thread, simply create a thread pool and await pool.suspend(). When a coroutine awaits another task, it will be suspended until the completion of the awaited task. Async::Task<i32> task2(Async::Pool<>& pool) { co_await pool.suspend(); co_return co_await task1(); }; It’s possible to write your entire program in terms of coroutines, but it’s often more convenient to maintain a main thread. Therefore, tasks also support synchronous waits: Async::Pool<> pool; auto task = task2(pool); info("%", task.block()); Tasks may be awaited at most once, cannot be copied, and cannot be cancelled after being scheduled.8 These limitations allow the implementation to be quite efficient: Tasks are just pointers to coroutine frames. Promises add minimal additional state: a status word, a flag for synchronous waits, and the result value. Executing a coroutine only allocates a single object (the frame). Notably, managing the promise lifetime doesn’t require reference counting. Instead, rpp uses a state machine based on Raymond Chen’s design. Task is...Old Promise StateNew Promise StateAction CreatedStart AwaitedDoneDoneResume Awaiter AwaitedStartAwaiter AddressSuspend Awaiter CompletedStartDone CompletedAbandonedDestroy CompletedAwaiter AddressDoneResume Awaiter DroppedStartAbandoned DroppedDoneDestroy Since the awaiter’s address can’t be zero (Start) one (Done) or two (Abandoned), we can encode all states in a single word and perform each transition with a single atomic operation. Additionally, we can use symmetric transfer to resume tasks without requeuing them. Currently, the scheduler does not support work stealing, prioritization, or CPU affinity, so blocking on another task can deadlock. I plan to add support for these features in the future. Asynchronous I/O As seen in many other languages, coroutines are also useful for asynchronous I/O. Therefore, the rpp scheduler supports awaiting platform-specific events. Its design is roughly based on coop. Only basic support for async file I/O and timers is included, but creating custom I/O methods is easy: Task<IO> async_io(Pool<>& pool) { HANDLE handle = /* Start an IO operation */; co_await pool.event(Event::of_sys(handle)); co_return /* The result */; } In my renderer, this system enables awaitable GPU tasks based on VK_KHR_external_fences. co_await vk.async(pool, [](Vk::Commands& commands) { // Add GPU work to the command buffer }); Compiler Bugs Despite compilers having supported coroutines for several years, I still ran into multiple compiler bugs while implementing rpp. This suggests that coroutines are still not widely used, or I imagine the bugs would have been fixed by now. In particular, MSVC 19.37 exhibits a bug that makes symmetric transfer unsafe, so it’s currently possible for the rpp scheduler to stack overflow on Windows. As of Jan 2024, a fix is pending release by Microsoft. I ran into a similar bug in clang, but didn’t track it down precisely because updating to clang-17 fixed the issue. Appendix: why not language X? Rust This may seem like a lot of work that would be better spent transitioning to Rust. Much of rpp is directly inspired by Rust—even the name—but I haven’t been convinced to switch just yet. The kind of projects I work on (i.e. graphics & games) tend to be both borrow-checker-unfriendly and do not primarily suffer from memory or concurrency issues. That’s not to say these kind of bugs never come up, but I find debuggers, sanitizers, and profiling sufficient to fix issues and prevent regressions. I would certainly prefer these bugs to be caught at compile time, but I haven’t found the language overhead of Rust to be obviously worth it—the painful bugs come from GPU code anyway! Of course, it might be a different story if I wrote security-critical programs. What has really held me back is simply that the Rust compiler is slow—but now that the parallel frontend has landed, I may reconsider. I would also concede that Rust is the better language to start learning in 2024. Jai Jai prioritizes many of the same goals as rpp. I sure would like to use it, but it’s been nine years and I still don’t have a copy of the compiler. “Orthodox C++” In the past, I’ve written a lot of C-style C++, including the first version of rpp (unnamed at the time). In retrospect, I don’t think this style is a great choice, unless you’re very disciplined about not using C++ features. Introducing even a small amount of C++isms leads to a lot of poor interactions with C-style code. Overall, I’d say it’s better to return to plain C. That said, rpp is obviously influenced by this style: it doesn’t use exceptions, RTTI, multiple/virtual inheritance, or the STL. However, it does make heavy use of templates. C I’ve also written a fair amount of C, and I’d say it’s still a good choice for OS-adjacent work. However, I rarely need to think about topics like binary size, inline assembly, or ABI compatibility, so I’d rather have (e.g.) real language support for generics. Plus, at least where available, Rust is often a better choice. D I haven’t written much D, but from what I’ve seen, it’s a pretty sensible redesign of C++. Many of the features that interact poorly in C++ work well in D. However, there’s also the garbage collector—it can be disabled, but you lose out on a lot of library features. Overall, I think D suffers from the problem of being “only” moderately better than C++, which isn’t enough to justify rebuilding the ecosystem. Odin I tried Odin in 2017, but it wasn’t sufficiently stable at the time. It appears to be in a good state today—and certainly improves on C for application programming—but I think it suffers from the same problem that D does relative to C++. OCaml I feel obligated to mention OCaml, since I now use it professionally, even for systems programming tasks. Modes, unboxed types, and domains are starting to make OCaml very viable for systems-with-incremental-concurrent-generational-GC projects—but it’s not yet a great fit for games/graphics programming. I would particularly like to see an ML-family language without a GC (maybe along the lines of Koka), but not so imperative/object oriented as Rust. Zig, Nim, Go, C#, F#, Haskell, Beef, Vale, Java, JavaScript, Python, Crystal, Lisp I’ve learned about each these languages to some small extent, but none seem sufficiently compelling for my use cases. (Except maybe Zig, which scares me.) Footnotes The 100k LOC do not include the STL itself. Benchmarked on an AMD 5950X @ 4.4GHz, MSVC 19.38.33129, Windows 11 10.0.22631. ↩ Unfortunately, this makes the implementations unavailable for inlining. However, I’m prioritizing compile times, and link-time optimization can make up for it. ↩ Adding the small-string optimization only makes sense if it happens to accelerate an existing codebase. It causes strings’ allocation behavior to depend on their length, resulting in less predictable performance. Further, if your program creates a lot of small, heap-allocated String<>s, you should ideally refactor it to use String_View, the stack allocator, or a custom interned string representation. ↩ (For the same reasons as strings.) ↩ It also stores hashes alongside integer keys, which doesn’t help. I’ll fix this eventually. ↩ Of course, this data could be generated by a separate meta-program or a compiler plugin, but I haven’t found the additional complexity to be worth it. ↩ Assuming your program is long-running and has a natural frame delineation. Otherwise, the entire execution is traced under one frame. ↩ Because tasks may only be awaited once, they do not support co_yield. ↩14 Years of (Graphics) Programming2023-11-16T00:00:00+00:002023-11-16T00:00:00+00:00https://thenumb.at/14-Years<p>Many people have chronicled their programming journey in the 80s, 90s, and 00s—here’s one for the 2010s.</p>
<h2 id="2010">2010</h2>
<div>
<p><img style="margin-left:20px;margin-bottom:15px;float:right" src="/assets/14years/ti83.jpg" width="125" /></p>
<div style="">
<p>I first encountered programming sometime around 2010, when I was 9 years old.</p>
<p>I can’t recall precisely how or why I got started, but I wrote my first programs in TI-BASIC, meticulously entering each symbol into my TI-83 calculator.
They computed simple algebraic expressions that I used to do math problems.</p>
<p>I was fascinated by video games, so after learning how to poll for input, I created <em>Catch the θ</em>.
The player would move an ‘I’ character up and down to catch falling ‘θ’ characters.
Each new ‘θ’ would fall faster than the last, until it became impossible to catch them all.
I soon discovered you could <a href="https://www.youtube.com/watch?v=W_mZ7smIz3U">transfer programs between TI calculators</a>, resulting in a new source of distraction for our math class.</p>
<p>The game looked like this:</p>
</div>
</div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>LIVES: 3 SCORE: 3
I
θ
</code></pre></div></div>
<p>Around the same time, I got my first personal computer: a big Gateway laptop with a first-gen Intel i3 and 4GB DDR3.
Previously, I had only used my parents’ old iMac G3, on which I learned to type, used the AppleWorks programs, and played Mac OS 9’s limited selection of games.<sup id="fnref:games" role="doc-noteref"><a href="#fn:games" class="footnote" rel="footnote">1</a></sup></p>
<p><img class="center box_shadow" style="margin-top:5px;margin-bottom:10px" src="/assets/14years/imac.jpg" width="400" /></p>
<h2 id="2011">2011</h2>
<p>My new laptop, on the other hand, ran the recently released Windows 7.
Windows opened up a whole new world of software I could download from the internet.</p>
<p><img class="center" style="margin-top:10px;margin-bottom:20px" src="/assets/14years/justbasic.jpg" /></p>
<p>I somehow ended up with an IDE called <a href="https://www.justbasic.com/">Just BASIC</a>, in which I wrote a few more utilities and text-based games.
I only remember one: a “HiLo” game where the player would find a number via binary search.<sup id="fnref:guessanumber" role="doc-noteref"><a href="#fn:guessanumber" class="footnote" rel="footnote">2</a></sup></p>
<p>Eventually, I wanted to make a game with graphics, so I downloaded <a href="https://www.yoyogames.com/gamemaker">GameMaker 8</a>.
After following the tutorials, I used the visual scripting language to make a couple games, but never really finished anything.
I did, however, write an essay about game development, in which I stated that 3D graphics was “very hard” for amateurs.</p>
<p><img class="center box_shadow" style="margin-top:10px;margin-bottom:20px" src="/assets/14years/gamemaker.jpg" width="450" /></p>
<p>Up until this point, I had been more interested in making physical systems than programming—I spent a lot of time on LEGOs, domino runs, and Rube Goldberg machines.
Programming something of comparable complexity on my own was still out of reach.</p>
<h2 id="2012">2012</h2>
<p>My interests started to converge when I discovered Minecraft upon its release in late 2011.
It was one of the first popular building games, and quickly became a mainstay for me.
By version 1.2, I was exploring mods that added technology & automation and hosting servers for friends.<sup id="fnref:mods" role="doc-noteref"><a href="#fn:mods" class="footnote" rel="footnote">3</a></sup></p>
<p><img class="center box_shadow" style="margin-top:15px;margin-bottom:20px" src="/assets/14years/ic2.jpg" width="450" /></p>
<p>I soon got a desktop computer: a Gateway mid-tower with an i5-2320 and 8GB DDR3.
This encouraged me to continue programming, but I didn’t learn Java—the internet claimed Minecraft’s mediocre performance was due to its choice of language.</p>
<p>Instead, I embarked on learning the <em>lingua franca</em> of game development: C++.
I started with <a href="https://www.bloodshed.net/">Dev-C++ 4.9.9</a>.</p>
<p><img class="center box_shadow" style="margin-top:10px;margin-bottom:20px" src="/assets/14years/devcpp.png" width="450" /></p>
<p>I can’t remember much from this period, but I believe I first learned C-with-classes style C++ from an online course.
Presumably, I then wrote some slightly bigger programs—but I have no record of them, so I must not have gotten very far.</p>
<p>At some point, I got stuck trying to get <a href="https://liballeg.org/">Allegro</a> to link properly.
Some things never change.</p>
<h2 id="2013">2013</h2>
<p>This is the first year from which any code is preserved.
In school, I took a robotics class, where we used <a href="https://www.robotc.net/">RobotC</a> to control LEGO Mindstorms robots.
The programs were simple and messy, including line following and maze solving (if you can call it that).</p>
<p><img class="center" style="margin-top:5px;margin-bottom:20px" src="/assets/14years/mindstorms.jpg" width="300" /></p>
<p>Outside of school, I acquired an Arduino Uno and did some small projects in C.
Judging from files I still have lying around, I made a binary clock, as well as <em>Catch the LED</em>, a game where the player would try to press a button just as a sequence of lights passed a central LED.</p>
<p><img class="center" style="margin-top:-30px;margin-bottom:-35px" src="/assets/14years/arduino.png" width="300" /></p>
<p>I also upgraded the desktop with a GTX 650—to run minecraft shaders, obviously.</p>
<h2 id="2014">2014</h2>
<p>I started making more progress with C++, eventually learning C++11 features and switching to Visual Studio.
Thanks to the <a href="https://lazyfoo.net/tutorials/SDL/index.php">Lazy Foo</a> tutorials, I was able to successfully link <a href="https://www.libsdl.org/">SDL</a> and write some graphical programs.</p>
<p>For reasons unknown, I then decided it was time to learn Java.
After going through another online course, I used Eclipse to write some more advanced programs, including a matrix calculator/linear solver, a Lorentz transform calculator, and a Pong clone.</p>
<p><img class="center box_shadow" style="margin-top:0px;margin-bottom:20px" src="/assets/14years/matrix.png" width="300" /></p>
<p>At this point, I started spending a lot of time reading articles and watching YouTube videos on technical topics.</p>
<p>I discovered cryptocurrencies and set up my GPU to mine Dogecoin (what else?) for a few weeks.
Too bad I lost the wallet—it’d be worth a whole $200 today.</p>
<div>
<p><img class="center" style="float:right;padding-left:10px;margin-left:15px;margin-bottom:15px" src="/assets/14years/hmhlogo.png" width="140" /></p>
<div>
<p>In late 2014, I happened upon one of Jonathan Blow’s streams discussing <a href="https://www.youtube.com/watch?v=TH9VCN6UkyQ&list=PLmV5I2fxaiCKfxMBrNsU1kgKJXD3PkyxO&index=1">Jai</a>, a programming language for games.
I recall being particularly impressed by the compile-time code execution demo, which seemed so much better than inscrutable C++ templates.
More importantly at the time, he mentioned <a href="https://handmadehero.org/">Handmade Hero</a>, Casey Muratori’s new series on creating a game from scratch.</p>
</div>
</div>
<p>It was just what I wanted to learn—the details of how complex programs actually worked.</p>
<h2 id="2015">2015</h2>
<p>I followed Handmade Hero for a few months, learning a lot about low-level programming and operating systems.<sup id="fnref:hmh" role="doc-noteref"><a href="#fn:hmh" class="footnote" rel="footnote">4</a></sup>
I attempted to follow along with the code, but didn’t have sufficient free time to keep it up indefinitely.
Still, the series had a lasting impact on my approach to programming.
I learned how to understand programs at multiple levels of abstraction, as well as the practice of <a href="https://caseymuratori.com/blog_0015">semantic compression</a>.</p>
<p><img class="center box_shadow" style="margin-top:0px;margin-bottom:20px" src="/assets/14years/truly_x86.png" width="450" /></p>
<p>That summer, I took my first formal Computer Science course at the local university.
It was an introduction to programming in C++, which I found quite boring.</p>
<p>I built my first computer, scouring the internet for deals and putting together an i7-5820k, GTX 970, 16GB DDR4 build for ~$1000.<sup id="fnref:desktop2" role="doc-noteref"><a href="#fn:desktop2" class="footnote" rel="footnote">5</a></sup>
I repurposed the old hardware as a media / game server running <a href="https://unraid.net/">Unraid</a>.</p>
<p><img class="center box_shadow" style="margin-top:5px;margin-bottom:20px" src="/assets/14years/pc1.jpg" width="400" /></p>
<p>In the fall, I took a second CS course, but it wasn’t much better.
More importantly, I started teaching programming.
I ran a weekly elective class, teaching yet another introduction to programming in C++.
The first semester covered the basics: control flow, functions, IO, memory management, classes, and sorting.
For sample projects, I wrote a square root calculator, maze escape game, and BS poker card game.</p>
<p><img class="center box_shadow" style="margin-top:20px;margin-bottom:30px" src="/assets/14years/elective.png" width="700" /></p>
<p>Outside of school, I started using <a href="https://www.sublimetext.com/">Sublime Text</a> in conjunction with the Visual Studio debugger.
I learned how to use Git and pushed my first project to GitHub.
It was my most complex program yet: a text-based RPG engine that would interpret a file containing a tree of locations/prompts.
The code was rather disorganized and probably leaked memory, but it successfully performed recursive-descent parsing and tree traversal.</p>
<p><img class="center box_shadow" style="margin-top:5px;margin-bottom:20px" src="/assets/14years/rpg2.png" width="500" /></p>
<p>At some point, I learned a bit of Python, but didn’t stick with it.</p>
<h2 id="2016">2016</h2>
<p>Inspired by Handmade Hero, I started working on my own 2D game engine.
It wasn’t totally from scratch—I used SDL2 and the C++ STL.
I would work on it for the rest of the year.</p>
<p>The “engine” part mostly consisted of wrappers around SDL2 and associated libraries, but also included a sprite renderer, a chunked world system, hot reloading, text rendering, profiling features, and (buggy) collisions.</p>
<p><img class="center box_shadow" style="margin-top:15px;margin-bottom:20px" src="/assets/14years/2dgame.png" width="600" /></p>
<p>I taught a second semester of programming, this time covering more advanced C++ topics like operator overloading, basic data structures, templates, and OOP.
I started converting the notes I had written into a <a href="https://thenumb.at/cpp-course/">website</a>, which gets a fair amount of traffic to this day.
I wrote a “Simple Drawing Library” and used it to demonstrate DFS/BFS.</p>
<p><img class="center box_shadow" style="margin-top:5px;margin-bottom:20px" src="/assets/14years/sdl.png" width="300" /></p>
<p>In the fall, I took Data Structures, which was more interesting than the introductory courses.
I taught a third semester, this time focusing on SDL2 and game programming patterns.
I wrote my own condensed version of the Lazy Foo tutorials I had followed in years prior.</p>
<p>I got tired of the 2D engine and embarked on learning 3D rendering.
I started learning OpenGL 3.3 from <a href="https://learnopengl.com/">learnopengl.com</a> and found it quite fun.
At the time, I was taking calculus III, so I started working on a 3D graphing program.
I also discovered <a href="https://github.com/ocornut/imgui">Dear ImGui</a>, which I’ve used ever since.</p>
<div style="text-align: center;margin-top:15px">
<p><img class="center box_shadow" style="margin-top:5px;margin-bottom:10px" src="/assets/14years/3dgraph.png" width="500" />
<strong><em>
<a href="https://github.com/TheNumbat/3D-Grapher">[GitHub]</a>
</em></strong></p>
</div>
<p>I would occasionally add features for ~1.5 years, and the grapher would became my first program that anyone else actually used.</p>
<h2 id="2017">2017</h2>
<p>In spring, I took a “Math for Computer Science” (i.e. logic and proofs) class, which was new and interesting.
I played <a href="https://store.steampowered.com/app/210970/The_Witness/">The Witness</a>, which doesn’t directly relate to programming, but greatly influenced how I think about game design.</p>
<p><img class="center box_shadow" style="margin-top:5px;margin-bottom:20px" src="/assets/14years/witnessme.jpg" width="500" /></p>
<p>I taught a fourth semester, this time covering data structures and C++11 topics.
I stopped updating the website, but created lessons on stacks, queues, lists, heaps, balanced trees, hash maps, graphs, threading, smart pointers, move semantics, exceptions, and basic functional programming.</p>
<p>In the summer, I began a larger project: a Minecraft-inspired 3D voxel engine called <em>Exile</em>.
This time, it was from scratch: I wrote my own data structures, Win32 and Linux platform code, ImGui, and OpenGL 4 renderer.</p>
<p>The project started in the <a href="https://github.com/odin-lang/Odin">Odin</a> programming language, but I quickly switched back to C++ since the Odin compiler wasn’t stable at the time.
I also tried streaming my work on Twitch, but never consistently.
Finally, I started tracking my ever-growing repository of bookmarks on <a href="https://github.com/TheNumbat/Lists">GitHub</a>.</p>
<div style="text-align: center;margin-top:20px">
<p><img class="center box_shadow" style="margin-top:5px;margin-bottom:10px" src="/assets/14years/cavegame.png" width="550" />
<strong><em>
<a href="https://github.com/TheNumbat/exile">[GitHub]</a>
</em></strong></p>
</div>
<p>I would work on <em>Exile</em> sporadically for the next several years, eventually rewriting it multiple times and evolving it into the pure-rendering project I’m working on today.</p>
<p>I did a month-long internship with a local mobile gaming startup, where I wrote some promotional Facebook games in Javascript.
I don’t think they amounted to anything, but I got paid (a bit) for programming.</p>
<p>In the fall, I started supervising a few group projects instead of teaching new content.
I took Computer Graphics, where (yet again) I already knew a good deal of the content—but I enjoyed it anyway.
I made a solar system simulator, a physics-based pinball game, and a chunk-based voxel renderer that I later used in <em>Exile</em>.</p>
<div class="center" style="align-items: center;justify-content: center;display:flex;margin-top:-5px">
<p><img class="box_shadow" style="flex:50%" src="/assets/14years/solarsystem.png" width="400" />
<img class="box_shadow" style="flex:50%" src="/assets/14years/pinball.png" width="400" />
<img class="center box_shadow" style="flex:50%" src="/assets/14years/notminecraft.png" width="400" /></p>
</div>
<h2 id="2018">2018</h2>
<p>During spring/summer, I spent a lot of time working on <em>Exile</em>.</p>
<p>I learned more advanced rendering techniques, including deferred lighting, ambient occlusion, and environment maps.
I designed a memory efficient rendering pipeline for voxel data, and wrote a libclang-based metaprogram that generated reflection data.</p>
<div style="text-align: center;">
<p><img class="center box_shadow" style="margin-top:5px;margin-bottom:10px" src="/assets/14years/exile.png" width="750" />
<strong><em>
<a href="https://github.com/TheNumbat/exile">[GitHub]</a>
</em></strong></p>
</div>
<p>I wrote my first three blog posts on <em>Exile</em>’s <a href="/Hot-Reloading-in-Exile">hot reloading</a>, <a href="/Reflection-in-Exile">reflection</a>, and <a href="/Voxel-Meshing-in-Exile">voxel meshing</a> systems.</p>
<p>I then left home to attend university.
I started taking many technical courses, but I’ll only be mentioning those with significant programming projects.
It started with another round of discrete math and data structures, which were a big step up in difficulty.</p>
<p><img class="center box_shadow" style="margin-top:5px;margin-bottom:20px" src="/assets/14years/dietcoke.png" width="250" /></p>
<p>My friend and I participated in a hackathon, where we made a visualization app for high-dimensional machine learning data.</p>
<div style="text-align: center;">
<p><img class="center box_shadow" style="margin-top:5px;margin-bottom:10px" src="/assets/14years/vizml.png" width="450" />
<strong><em>
<a href="https://github.com/TheNumbat/viz.ml">[GitHub]</a>
</em></strong></p>
</div>
<p>I also started using VSCode, learned \(\LaTeX\), and basically never handwrote anything again.</p>
<h2 id="2019">2019</h2>
<p>In spring, I took a functional programming class, which I found pretty elegant and intuitive.
The class was taught in SML, exposing me to a language with a coherent design.
I continued using C++, but started applying more functional programming techniques.</p>
<p><img class="center" src="/assets/14years/smlnj.jpeg" width="125" /></p>
<p>In summer, I interned at NVIDIA, working on some automated OpenGL benchmarking tools.
It turned out there wasn’t much work to do, so I spent much of my time working on a ray tracer based
on <a href="https://raytracing.github.io/"><em>Ray Tracing in One Weekend</em></a>, as well as <em>Exile</em>.
I found and reported a miscompilation bug in the Visual Studio compiler.</p>
<div style="text-align: center;margin-top:20px">
<p><img class="center box_shadow" style="margin-top:5px;margin-bottom:10px" src="/assets/14years/dawn.png" width="500" />
<strong><em>
<a href="https://github.com/TheNumbat/Dawn">[GitHub]</a>
</em></strong></p>
</div>
<p>In fall, I took several more classes.</p>
<ul>
<li>Computer systems, involving a malloc implementation, shell, and web proxy in C.</li>
<li>Parallel algorithms, involving various span-efficient programs in SML.</li>
<li>Computer graphics (again), involving a rasterizer, mesh editor, path tracer, and animation system in C++.</li>
</ul>
<p><img class="center box_shadow" style="margin-top:25px;margin-bottom:30px" src="/assets/14years/old_scotty3d.png" width="450" /></p>
<p>I particularly enjoyed graphics, but was frustrated with the course’s out-of-date OpenGL 2 codebase, <em>Scotty3D</em>.
Therefore, I applied to become a teaching assistant for the course, and began work on a new version of <em>Scotty3D</em> in modern C++17 and OpenGL 4.</p>
<div style="text-align: center;margin-bottom:-15px;margin-top:15px">
<p><img class="center box_shadow" style="margin-top:5px;margin-bottom:10px" src="/assets/14years/s4d.png" width="500" />
<strong><em>
<a href="https://github.com/CMU-Graphics/Scotty3D">[GitHub]</a>
</em></strong></p>
</div>
<h2 id="2020">2020</h2>
<p>In spring, I took a course on operating systems, where (along with a partner) I wrote a bare-metal Sokoban game, a user-space threading library, and an x86 kernel, all from scratch in C.
Our kernel ran on real hardware, supporting virtual memory, preemptive multitasking, and concurrent syscalls.
The course taught me a lot about concurrency and system design, though our final product was certainly <em>not</em> entirely concurrency-safe.<sup id="fnref:concurrency" role="doc-noteref"><a href="#fn:concurrency" class="footnote" rel="footnote">6</a></sup></p>
<p><img class="center box_shadow" style="margin-top:15px;margin-bottom:20px" src="/assets/14years/os.png" width="500" /></p>
<p>In the summer, I interned at Apple, working on more GPU benchmarking tools.
Unfortunately, this was the summer of COVID, so my project was derailed and I again didn’t end up with much work to do.
Hence, I spent a lot of time on <em>Scotty3D</em>.
By the end of the summer, it was ready to support three of the four major assignments.</p>
<div style="text-align: center;">
<div class="center" style="align-items: center;justify-content: center;display:flex;margin-bottom:-15px;margin-top:-15px">
<p><img class="box_shadow" style="flex:50%" src="/assets/14years/meshedit.png" width="393" />
<img class="box_shadow" style="flex:50%" src="/assets/14years/render.png" width="400" />
<img class="center box_shadow" style="flex:50%" src="/assets/14years/animate.png" width="475" /></p>
</div>
<p><strong><em>
<a href="https://github.com/CMU-Graphics/Scotty3D">[GitHub]</a>
</em></strong></p>
</div>
<p>That fall, we released the new version—I had to fix many bugs, but overall it was a big hit.
Students were able to produce much cooler results than previous years.</p>
<p>I also took a compilers course, where (along with a partner) I wrote a compiler for the C0 language.
We implemented the compiler in Rust, which I would describe as a redesign of C++ without (most of) the insanity.<sup id="fnref:rust" role="doc-noteref"><a href="#fn:rust" class="footnote" rel="footnote">7</a></sup>
Much of the work was in the compiler backend, which supported SSA, SCCP, register allocation, x86 code generation, and fork/join parallelism.</p>
<p><img style="margin-top:-10px" class="center" src="/assets/14years/crab.png" width="175" /></p>
<p>Finally, I started writing a new version of <em>Exile</em> from scratch, this time using Vulkan and modern C++17.
Keeping with the theme, I again started by writing my own standard library.
To learn Vulkan, I went through the <a href="https://vulkan-tutorial.com/">Vulkan Tutorial</a>, which was useful, but I later realized it contained some bad advice.</p>
<h2 id="2021">2021</h2>
<p>I built a new computer: a compact ITX system with a Ryzen 5950X, RTX 3090, and 32GB DDR4.
The GPU shortage was in full swing, so I borrowed the GPU from one of the graphics PhD students—ostensibly for research purposes—and later replaced it with an RTX 3080ti.</p>
<p><img class="center box_shadow" style="margin-bottom:22px;margin-top:5px" src="/assets/14years/pc2.png" width="450" /></p>
<p>In spring, I took several more classes:</p>
<ul>
<li>Parallel architecture & programming, involving ILP, SIMD, CUDA, OpenMP, and MPI.</li>
<li>A research assistantship, where I worked on high-performance closest point queries.</li>
<li>Physics-based rendering, where I implemented a real-time path tracer using the 3090’s RTX hardware.</li>
</ul>
<div style="text-align: center;">
<p><img style="margin-top:15px;margin-bottom:10px" class="center box_shadow" src="/assets/14years/gpurt.png" width="500" />
<strong><em>
<a href="https://github.com/TheNumbat/GPU-RT">[GitHub]</a>
</em></strong></p>
</div>
<ul>
<li>Technical animation, where I added various simulation features to <em>Scotty3D</em>.</li>
</ul>
<div style="text-align: center;margin-top:20px;margin-bottom:15px">
<p><img class="box_shadow" style="flex:50%" src="/assets/14years/cloth.png" width="400" />
<img class="box_shadow" style="flex:50%" src="/assets/14years/fluid.png" width="390" /></p>
</div>
<p>In the summer, I interned at Jane Street, where I learned OCaml and gained more practical experience with functional programming.
I worked on an autocompletion engine for S-expressions, as well as APIs for risk management.</p>
<p><img class="center" style="margin-top:-5px;margin-bottom:20px" src="/assets/14years/ohcamel.jpg" width="300" /></p>
<p>In the fall, I took a programming languages course (also in OCaml), where I learned more about type systems, static analysis, and theorem provers.
I also took “deep learning systems,” where I implemented my own version of PyTorch in Python and CUDA.
I hadn’t done any formal machine learning courses, so I got to learn a lot about both deep learning and its practical implementation.</p>
<p><img class="center" style="margin-top:-10px;margin-bottom:5px" src="/assets/14years/conv.png" width="300" /></p>
<p>I did another research assistantship, where I worked on sampling strategies for differentiable rendering in <a href="https://www.mitsuba-renderer.org/">Mitsuba</a>.
I didn’t really have enough time for this project, so it didn’t amount to much.</p>
<p><img class="center" style="margin-top:0px;margin-bottom:20px" src="/assets/14years/tridir.png" width="450" /></p>
<p>Up until now, I had been a TA for the graphics course.
This semester, I switched to game programming, which was pretty fun.</p>
<p>Finally, I graduated.
With my newfound free time, I decided to restart my blog.
I wrote a couple short posts on <a href="/Compiler-Bug">that compiler bug from 2019</a> and <a href="/Hamming-Hats">hamming codes</a>.</p>
<h2 id="2022">2022</h2>
<p>In the spring, I felt pretty burnt out after seven semesters and three summers of work, so didn’t do much of anything (other than finally getting to Grandmaster in Overwatch).</p>
<p>I eventually wrote a longer, interactive post about <a href="/Exponential-Rotations">3D rotations and the exponential map</a>.
The response greatly exceeded my expectations—it was my first post to get a lot of traffic, mostly from Hacker News and Twitter.</p>
<p><img class="center" style="margin-top:0px;margin-bottom:5px" src="/assets/14years/rotations.png" width="450" /></p>
<p>In the summer, I did another brief stint as a research assistant working on additional updates to <em>Scotty3D</em>.
We wrote a new instanced scene graph, a test suite for the assignments, a custom file format, and integrated the remaining major assignment into the codebase.</p>
<p>I wrote another interactive post on <a href="/Autodiff">differentiable programming</a>, which received an honorable mention in the <a href="https://www.youtube.com/watch?v=cDofhN-RJqg">Summer of Math Exposition</a>.</p>
<p><img class="center" style="margin-top:0px;margin-bottom:5px" src="/assets/14years/autodiff.png" width="550" /></p>
<p>I did some more work on <em>Exile</em>, rewriting the Vulkan abstractions and using RTX hardware to dynamically render huge amounts of cubes at high frame rates.</p>
<p><img class="center box_shadow" style="margin-top:0px;margin-bottom:25px" src="/assets/14years/cubes.png" width="550" /></p>
<p>I then became a full time software engineer on Jane Street’s compilers team.
I began working on profiling and optimization tools, as well as learning more OCaml and functional programming theory.</p>
<p>I wrote another post on <a href="/Neural-Graphics">neural graphics primitives</a>, the first to cover a research topic.</p>
<p><img class="center" style="margin-top:-5px;margin-bottom:5px" src="/assets/14years/ingp.png" width="500" /></p>
<p>In the winter, I didn’t do much programming, but briefly worked on a puzzle game concept with a friend.</p>
<p><img class="center box_shadow" style="margin-top:20px;margin-bottom:5px" src="/assets/14years/blocks.png" width="400" /></p>
<h2 id="2023">2023</h2>
<p>In spring, I wrote posts on <a href="/Spherical-Integration">spherical integration</a> and <a href="/Hashtables">open addressed hash tables</a>, which also became relatively popular.</p>
<p><img class="center" style="margin-top:20px;margin-bottom:15px" src="/assets/14years/hashtables.png" width="750" /></p>
<p>For work, I wrote a series of articles covering the addition of <em>modes</em> to the OCaml type system.
We’re slowly making OCaml more akin to Rust, so I titled the series <a href="https://blog.janestreet.com/oxidizing-ocaml-locality/">Oxidizing OCaml</a>.</p>
<p><img class="center" style="margin-top:-5px;margin-bottom:35px" src="/assets/14years/capsule.svg" width="450" /></p>
<p>In the summer, I started contributing primarily to the OCaml compiler itself.
I implemented support for SIMD types & operations, as well as other additions to the x86 backend.</p>
<p>I wrote my longest post yet on <a href="/Functions-are-Vectors">functional analysis</a>, which received another honorable mention in the <a href="https://www.youtube.com/watch?v=6a1fLEToyvU&list=PLnQX-jgAF5pQS2GUFCsatSyZkSH7e8UM8">Summer of Math Exposition</a>.</p>
<p><img class="center" style="margin-top:-5px;margin-bottom:5px" src="/assets/14years/vectors.png" width="450" /></p>
<p><em>Functions are Vectors</em> took so long (likely 100+ hours) that I lost motivation to write anything else until the post you’re reading right now.</p>
<p>Instead, I did some more programming: I ripped out the core of <em>Exile</em> and started a new project focused on real-time path tracing.
I did another pass on the Vulkan abstraction layer, which I think is finally in a good state.
It covers all the fancy features you’re supposed to use in Vulkan, like multiple frames in flight, multithreaded command buffer recording, async compute and transfer queues, dynamic rendering, and custom allocators.</p>
<p><img class="center box_shadow" style="margin-top:25px;margin-bottom:25px" src="/assets/14years/dragon.png" width="450" /></p>
<p>For reasons of masochism, I then decided to rewrite my C++ standard library *again*, this time in C++20.
I think it’s also finally in a pretty good state, so I put the current version on <a href="https://github.com/TheNumbat/rpp">GitHub</a>.</p>
<p><img class="center box_shadow" style="margin-top:25px;margin-bottom:25px" src="/assets/14years/rpp.JPG" width="450" /></p>
<p>And that brings us to the present.
I’m working on a post covering the design principles behind the library, as well as how I got it to compile in 100ms per invocation.</p>
<h1 id="then-vs-now">Then vs. Now</h1>
<p>I’m part of the first generation to grow up with the Web, but the last to remember it before the rise of smartphones.
In some respects, this meant the late 2000s was the easiest time in history to get started in computing.</p>
<p>You may have noticed this post doesn’t reference any textbooks, nor any personal mentors (though it does include a degree).
That’s because I was able to learn primarily from the internet, which wasn’t really feasible for previous generations.
However, the leap from using computers to modifying them was still small: doing anything interesting required delving into desktop computing, which lead to programming.</p>
<p>Now, users can get by without learning to type, let alone program—so most never do.
On the other hand, online resources are more accessible than ever.
The pandemic heralded a big increase in the amount of university-level content online, and I expect that trend to continue.
Going beyond YouTube and its ilk, students of the 2020s will have access to LLMs, which can (arguably) already function as search engine and tutor.</p>
<p>We’ll see how it turns out.</p>
<h1 id="footnotes">Footnotes</h1>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:games" role="doc-endnote">
<p>See <a href="https://store.steampowered.com/app/1368820/RollerCoaster_Tycoon_3_Complete_Edition/">Roller Coaster Tycoon 3</a>, <a href="https://en.wikipedia.org/wiki/Marble_Blast_Gold">Marble Blast Gold</a>, and <a href="https://en.wikipedia.org/wiki/Enigmo">Enigmo</a>. <a href="#fnref:games" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:guessanumber" role="doc-endnote">
<p>Did I watch <a href="https://www.youtube.com/watch?v=QS4weDRoOKQ">this video</a>? I have no idea. <a href="#fnref:guessanumber" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:mods" role="doc-endnote">
<p>See IndustrialCraft, BuildCraft, RedPower, ComputerCraft, and Railcraft. Remember Hamachi? <a href="#fnref:mods" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:hmh" role="doc-endnote">
<p>Don’t start here if you want to finish a game. It was great for learning systems programming, though. <a href="#fnref:hmh" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:desktop2" role="doc-endnote">
<p>The human eye can’t see more than 3.5GB anyway. <a href="#fnref:desktop2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:concurrency" role="doc-endnote">
<p>Arguably, (unstructured) concurrency is the only really difficult problem in programming. Help from the compiler is sorely needed. <a href="#fnref:concurrency" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:rust" role="doc-endnote">
<p>I think Rust is a good language, but I’ve since avoided it because I value short compile times and don’t work on anything security-critical. I also do borrow-checker-unfriendly things (e.g. using Vulkan) that are annoying in Rust. However, I may return to it now that the <a href="https://blog.rust-lang.org/2023/11/09/parallel-rustc.html">parallel frontend</a> has landed. <a href="#fnref:rust" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Many people have chronicled their programming journey in the 80s, 90s, and 00s—here’s one for the 2010s. 2010 I first encountered programming sometime around 2010, when I was 9 years old. I can’t recall precisely how or why I got started, but I wrote my first programs in TI-BASIC, meticulously entering each symbol into my TI-83 calculator. They computed simple algebraic expressions that I used to do math problems. I was fascinated by video games, so after learning how to poll for input, I created Catch the θ. The player would move an ‘I’ character up and down to catch falling ‘θ’ characters. Each new ‘θ’ would fall faster than the last, until it became impossible to catch them all. I soon discovered you could transfer programs between TI calculators, resulting in a new source of distraction for our math class. The game looked like this: LIVES: 3 SCORE: 3 I θ Around the same time, I got my first personal computer: a big Gateway laptop with a first-gen Intel i3 and 4GB DDR3. Previously, I had only used my parents’ old iMac G3, on which I learned to type, used the AppleWorks programs, and played Mac OS 9’s limited selection of games.1 2011 My new laptop, on the other hand, ran the recently released Windows 7. Windows opened up a whole new world of software I could download from the internet. I somehow ended up with an IDE called Just BASIC, in which I wrote a few more utilities and text-based games. I only remember one: a “HiLo” game where the player would find a number via binary search.2 Eventually, I wanted to make a game with graphics, so I downloaded GameMaker 8. After following the tutorials, I used the visual scripting language to make a couple games, but never really finished anything. I did, however, write an essay about game development, in which I stated that 3D graphics was “very hard” for amateurs. Up until this point, I had been more interested in making physical systems than programming—I spent a lot of time on LEGOs, domino runs, and Rube Goldberg machines. Programming something of comparable complexity on my own was still out of reach. 2012 My interests started to converge when I discovered Minecraft upon its release in late 2011. It was one of the first popular building games, and quickly became a mainstay for me. By version 1.2, I was exploring mods that added technology & automation and hosting servers for friends.3 I soon got a desktop computer: a Gateway mid-tower with an i5-2320 and 8GB DDR3. This encouraged me to continue programming, but I didn’t learn Java—the internet claimed Minecraft’s mediocre performance was due to its choice of language. Instead, I embarked on learning the lingua franca of game development: C++. I started with Dev-C++ 4.9.9. I can’t remember much from this period, but I believe I first learned C-with-classes style C++ from an online course. Presumably, I then wrote some slightly bigger programs—but I have no record of them, so I must not have gotten very far. At some point, I got stuck trying to get Allegro to link properly. Some things never change. 2013 This is the first year from which any code is preserved. In school, I took a robotics class, where we used RobotC to control LEGO Mindstorms robots. The programs were simple and messy, including line following and maze solving (if you can call it that). Outside of school, I acquired an Arduino Uno and did some small projects in C. Judging from files I still have lying around, I made a binary clock, as well as Catch the LED, a game where the player would try to press a button just as a sequence of lights passed a central LED. I also upgraded the desktop with a GTX 650—to run minecraft shaders, obviously. 2014 I started making more progress with C++, eventually learning C++11 features and switching to Visual Studio. Thanks to the Lazy Foo tutorials, I was able to successfully link SDL and write some graphical programs. For reasons unknown, I then decided it was time to learn Java. After going through another online course, I used Eclipse to write some more advanced programs, including a matrix calculator/linear solver, a Lorentz transform calculator, and a Pong clone. At this point, I started spending a lot of time reading articles and watching YouTube videos on technical topics. I discovered cryptocurrencies and set up my GPU to mine Dogecoin (what else?) for a few weeks. Too bad I lost the wallet—it’d be worth a whole $200 today. In late 2014, I happened upon one of Jonathan Blow’s streams discussing Jai, a programming language for games. I recall being particularly impressed by the compile-time code execution demo, which seemed so much better than inscrutable C++ templates. More importantly at the time, he mentioned Handmade Hero, Casey Muratori’s new series on creating a game from scratch. It was just what I wanted to learn—the details of how complex programs actually worked. 2015 I followed Handmade Hero for a few months, learning a lot about low-level programming and operating systems.4 I attempted to follow along with the code, but didn’t have sufficient free time to keep it up indefinitely. Still, the series had a lasting impact on my approach to programming. I learned how to understand programs at multiple levels of abstraction, as well as the practice of semantic compression. That summer, I took my first formal Computer Science course at the local university. It was an introduction to programming in C++, which I found quite boring. I built my first computer, scouring the internet for deals and putting together an i7-5820k, GTX 970, 16GB DDR4 build for ~$1000.5 I repurposed the old hardware as a media / game server running Unraid. In the fall, I took a second CS course, but it wasn’t much better. More importantly, I started teaching programming. I ran a weekly elective class, teaching yet another introduction to programming in C++. The first semester covered the basics: control flow, functions, IO, memory management, classes, and sorting. For sample projects, I wrote a square root calculator, maze escape game, and BS poker card game. Outside of school, I started using Sublime Text in conjunction with the Visual Studio debugger. I learned how to use Git and pushed my first project to GitHub. It was my most complex program yet: a text-based RPG engine that would interpret a file containing a tree of locations/prompts. The code was rather disorganized and probably leaked memory, but it successfully performed recursive-descent parsing and tree traversal. At some point, I learned a bit of Python, but didn’t stick with it. 2016 Inspired by Handmade Hero, I started working on my own 2D game engine. It wasn’t totally from scratch—I used SDL2 and the C++ STL. I would work on it for the rest of the year. The “engine” part mostly consisted of wrappers around SDL2 and associated libraries, but also included a sprite renderer, a chunked world system, hot reloading, text rendering, profiling features, and (buggy) collisions. I taught a second semester of programming, this time covering more advanced C++ topics like operator overloading, basic data structures, templates, and OOP. I started converting the notes I had written into a website, which gets a fair amount of traffic to this day. I wrote a “Simple Drawing Library” and used it to demonstrate DFS/BFS. In the fall, I took Data Structures, which was more interesting than the introductory courses. I taught a third semester, this time focusing on SDL2 and game programming patterns. I wrote my own condensed version of the Lazy Foo tutorials I had followed in years prior. I got tired of the 2D engine and embarked on learning 3D rendering. I started learning OpenGL 3.3 from learnopengl.com and found it quite fun. At the time, I was taking calculus III, so I started working on a 3D graphing program. I also discovered Dear ImGui, which I’ve used ever since. [GitHub] I would occasionally add features for ~1.5 years, and the grapher would became my first program that anyone else actually used. 2017 In spring, I took a “Math for Computer Science” (i.e. logic and proofs) class, which was new and interesting. I played The Witness, which doesn’t directly relate to programming, but greatly influenced how I think about game design. I taught a fourth semester, this time covering data structures and C++11 topics. I stopped updating the website, but created lessons on stacks, queues, lists, heaps, balanced trees, hash maps, graphs, threading, smart pointers, move semantics, exceptions, and basic functional programming. In the summer, I began a larger project: a Minecraft-inspired 3D voxel engine called Exile. This time, it was from scratch: I wrote my own data structures, Win32 and Linux platform code, ImGui, and OpenGL 4 renderer. The project started in the Odin programming language, but I quickly switched back to C++ since the Odin compiler wasn’t stable at the time. I also tried streaming my work on Twitch, but never consistently. Finally, I started tracking my ever-growing repository of bookmarks on GitHub. [GitHub] I would work on Exile sporadically for the next several years, eventually rewriting it multiple times and evolving it into the pure-rendering project I’m working on today. I did a month-long internship with a local mobile gaming startup, where I wrote some promotional Facebook games in Javascript. I don’t think they amounted to anything, but I got paid (a bit) for programming. In the fall, I started supervising a few group projects instead of teaching new content. I took Computer Graphics, where (yet again) I already knew a good deal of the content—but I enjoyed it anyway. I made a solar system simulator, a physics-based pinball game, and a chunk-based voxel renderer that I later used in Exile. 2018 During spring/summer, I spent a lot of time working on Exile. I learned more advanced rendering techniques, including deferred lighting, ambient occlusion, and environment maps. I designed a memory efficient rendering pipeline for voxel data, and wrote a libclang-based metaprogram that generated reflection data. [GitHub] I wrote my first three blog posts on Exile’s hot reloading, reflection, and voxel meshing systems. I then left home to attend university. I started taking many technical courses, but I’ll only be mentioning those with significant programming projects. It started with another round of discrete math and data structures, which were a big step up in difficulty. My friend and I participated in a hackathon, where we made a visualization app for high-dimensional machine learning data. [GitHub] I also started using VSCode, learned \(\LaTeX\), and basically never handwrote anything again. 2019 In spring, I took a functional programming class, which I found pretty elegant and intuitive. The class was taught in SML, exposing me to a language with a coherent design. I continued using C++, but started applying more functional programming techniques. In summer, I interned at NVIDIA, working on some automated OpenGL benchmarking tools. It turned out there wasn’t much work to do, so I spent much of my time working on a ray tracer based on Ray Tracing in One Weekend, as well as Exile. I found and reported a miscompilation bug in the Visual Studio compiler. [GitHub] In fall, I took several more classes. Computer systems, involving a malloc implementation, shell, and web proxy in C. Parallel algorithms, involving various span-efficient programs in SML. Computer graphics (again), involving a rasterizer, mesh editor, path tracer, and animation system in C++. I particularly enjoyed graphics, but was frustrated with the course’s out-of-date OpenGL 2 codebase, Scotty3D. Therefore, I applied to become a teaching assistant for the course, and began work on a new version of Scotty3D in modern C++17 and OpenGL 4. [GitHub] 2020 In spring, I took a course on operating systems, where (along with a partner) I wrote a bare-metal Sokoban game, a user-space threading library, and an x86 kernel, all from scratch in C. Our kernel ran on real hardware, supporting virtual memory, preemptive multitasking, and concurrent syscalls. The course taught me a lot about concurrency and system design, though our final product was certainly not entirely concurrency-safe.6 In the summer, I interned at Apple, working on more GPU benchmarking tools. Unfortunately, this was the summer of COVID, so my project was derailed and I again didn’t end up with much work to do. Hence, I spent a lot of time on Scotty3D. By the end of the summer, it was ready to support three of the four major assignments. [GitHub] That fall, we released the new version—I had to fix many bugs, but overall it was a big hit. Students were able to produce much cooler results than previous years. I also took a compilers course, where (along with a partner) I wrote a compiler for the C0 language. We implemented the compiler in Rust, which I would describe as a redesign of C++ without (most of) the insanity.7 Much of the work was in the compiler backend, which supported SSA, SCCP, register allocation, x86 code generation, and fork/join parallelism. Finally, I started writing a new version of Exile from scratch, this time using Vulkan and modern C++17. Keeping with the theme, I again started by writing my own standard library. To learn Vulkan, I went through the Vulkan Tutorial, which was useful, but I later realized it contained some bad advice. 2021 I built a new computer: a compact ITX system with a Ryzen 5950X, RTX 3090, and 32GB DDR4. The GPU shortage was in full swing, so I borrowed the GPU from one of the graphics PhD students—ostensibly for research purposes—and later replaced it with an RTX 3080ti. In spring, I took several more classes: Parallel architecture & programming, involving ILP, SIMD, CUDA, OpenMP, and MPI. A research assistantship, where I worked on high-performance closest point queries. Physics-based rendering, where I implemented a real-time path tracer using the 3090’s RTX hardware. [GitHub] Technical animation, where I added various simulation features to Scotty3D. In the summer, I interned at Jane Street, where I learned OCaml and gained more practical experience with functional programming. I worked on an autocompletion engine for S-expressions, as well as APIs for risk management. In the fall, I took a programming languages course (also in OCaml), where I learned more about type systems, static analysis, and theorem provers. I also took “deep learning systems,” where I implemented my own version of PyTorch in Python and CUDA. I hadn’t done any formal machine learning courses, so I got to learn a lot about both deep learning and its practical implementation. I did another research assistantship, where I worked on sampling strategies for differentiable rendering in Mitsuba. I didn’t really have enough time for this project, so it didn’t amount to much. Up until now, I had been a TA for the graphics course. This semester, I switched to game programming, which was pretty fun. Finally, I graduated. With my newfound free time, I decided to restart my blog. I wrote a couple short posts on that compiler bug from 2019 and hamming codes. 2022 In the spring, I felt pretty burnt out after seven semesters and three summers of work, so didn’t do much of anything (other than finally getting to Grandmaster in Overwatch). I eventually wrote a longer, interactive post about 3D rotations and the exponential map. The response greatly exceeded my expectations—it was my first post to get a lot of traffic, mostly from Hacker News and Twitter. In the summer, I did another brief stint as a research assistant working on additional updates to Scotty3D. We wrote a new instanced scene graph, a test suite for the assignments, a custom file format, and integrated the remaining major assignment into the codebase. I wrote another interactive post on differentiable programming, which received an honorable mention in the Summer of Math Exposition. I did some more work on Exile, rewriting the Vulkan abstractions and using RTX hardware to dynamically render huge amounts of cubes at high frame rates. I then became a full time software engineer on Jane Street’s compilers team. I began working on profiling and optimization tools, as well as learning more OCaml and functional programming theory. I wrote another post on neural graphics primitives, the first to cover a research topic. In the winter, I didn’t do much programming, but briefly worked on a puzzle game concept with a friend. 2023 In spring, I wrote posts on spherical integration and open addressed hash tables, which also became relatively popular. For work, I wrote a series of articles covering the addition of modes to the OCaml type system. We’re slowly making OCaml more akin to Rust, so I titled the series Oxidizing OCaml. In the summer, I started contributing primarily to the OCaml compiler itself. I implemented support for SIMD types & operations, as well as other additions to the x86 backend. I wrote my longest post yet on functional analysis, which received another honorable mention in the Summer of Math Exposition. Functions are Vectors took so long (likely 100+ hours) that I lost motivation to write anything else until the post you’re reading right now. Instead, I did some more programming: I ripped out the core of Exile and started a new project focused on real-time path tracing. I did another pass on the Vulkan abstraction layer, which I think is finally in a good state. It covers all the fancy features you’re supposed to use in Vulkan, like multiple frames in flight, multithreaded command buffer recording, async compute and transfer queues, dynamic rendering, and custom allocators. For reasons of masochism, I then decided to rewrite my C++ standard library *again*, this time in C++20. I think it’s also finally in a pretty good state, so I put the current version on GitHub. And that brings us to the present. I’m working on a post covering the design principles behind the library, as well as how I got it to compile in 100ms per invocation. Then vs. Now I’m part of the first generation to grow up with the Web, but the last to remember it before the rise of smartphones. In some respects, this meant the late 2000s was the easiest time in history to get started in computing. You may have noticed this post doesn’t reference any textbooks, nor any personal mentors (though it does include a degree). That’s because I was able to learn primarily from the internet, which wasn’t really feasible for previous generations. However, the leap from using computers to modifying them was still small: doing anything interesting required delving into desktop computing, which lead to programming. Now, users can get by without learning to type, let alone program—so most never do. On the other hand, online resources are more accessible than ever. The pandemic heralded a big increase in the amount of university-level content online, and I expect that trend to continue. Going beyond YouTube and its ilk, students of the 2020s will have access to LLMs, which can (arguably) already function as search engine and tutor. We’ll see how it turns out. Footnotes See Roller Coaster Tycoon 3, Marble Blast Gold, and Enigmo. ↩ Did I watch this video? I have no idea. ↩ See IndustrialCraft, BuildCraft, RedPower, ComputerCraft, and Railcraft. Remember Hamachi? ↩ Don’t start here if you want to finish a game. It was great for learning systems programming, though. ↩ The human eye can’t see more than 3.5GB anyway. ↩ Arguably, (unstructured) concurrency is the only really difficult problem in programming. Help from the compiler is sorely needed. ↩ I think Rust is a good language, but I’ve since avoided it because I value short compile times and don’t work on anything security-critical. I also do borrow-checker-unfriendly things (e.g. using Vulkan) that are annoying in Rust. However, I may return to it now that the parallel frontend has landed. ↩NYC: Botanical Garden2023-08-15T00:00:00+00:002023-08-15T00:00:00+00:00https://thenumb.at/Botanical-Garden<p>New York Botanical Garden, New York, NY, 2023</p>
<div style="text-align: center;"><img src="/assets/botanical-garden/00.webp" /></div>
<div style="text-align: center;"><img src="/assets/botanical-garden/01.webp" /></div>
<div style="text-align: center;"><img src="/assets/botanical-garden/02.webp" /></div>
<div style="text-align: center;"><img src="/assets/botanical-garden/03.webp" /></div>
<div style="text-align: center;"><img src="/assets/botanical-garden/04.webp" /></div>
<div style="text-align: center;"><img src="/assets/botanical-garden/05.webp" /></div>
<div style="text-align: center;"><img src="/assets/botanical-garden/06.webp" /></div>
<div style="text-align: center;"><img src="/assets/botanical-garden/07.webp" /></div>
<div style="text-align: center;"><img src="/assets/botanical-garden/08.webp" /></div>
<div style="text-align: center;"><img src="/assets/botanical-garden/09.webp" /></div>
<div style="text-align: center;"><img src="/assets/botanical-garden/10.webp" /></div>
<div style="text-align: center;"><img src="/assets/botanical-garden/11.webp" /></div>
<div style="text-align: center;"><img src="/assets/botanical-garden/12.webp" /></div>
<div style="text-align: center;"><img src="/assets/botanical-garden/13.webp" /></div>
<div style="text-align: center;"><img src="/assets/botanical-garden/14.webp" /></div>
<p></p>New York Botanical Garden, New York, NY, 2023Functions are Vectors2023-07-29T00:00:00+00:002023-07-29T00:00:00+00:00https://thenumb.at/Functions-are-Vectors<p>Conceptualizing functions as infinite-dimensional vectors lets us apply the tools of linear algebra to a vast landscape of new problems, from image and geometry processing to curve fitting, light transport, and machine learning.</p>
<p><em>Prerequisites: <a href="https://www.youtube.com/playlist?list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab">introductory linear algebra</a>, <a href="https://www.youtube.com/playlist?list=PLZHQObOWTQDMsr9K-rj53DwVRMYO3t5Yr">introductory calculus</a>, <a href="https://www.youtube.com/playlist?list=PLZHQObOWTQDNPOjrT6KVlfJuKtYTftqH6">introductory differential equations</a>.</em></p>
<p><em>This article received an honorable mention in 3Blue1Brown’s <a href="https://www.youtube.com/watch?v=6a1fLEToyvU">Summer of Math Exposition 3</a>!</em></p>
<ul>
<li><a href="#functions-as-vectors">Functions as Vectors</a>
<ul>
<li><a href="#vector-spaces">Vector Spaces</a></li>
<li><a href="#linear-operators">Linear Operators</a></li>
<li><a href="#diagonalization">Diagonalization</a></li>
<li><a href="#inner-product-spaces">Inner Product Spaces</a></li>
<li><a href="#the-spectral-theorem">The Spectral Theorem</a></li>
</ul>
</li>
<li><a href="#applications">Applications</a>
<ul>
<li><a href="#fourier-series">Fourier Series</a></li>
<li><a href="#image-compression">Image Compression</a></li>
<li><a href="#geometry-processing">Geometry Processing</a></li>
<li><a href="#further-reading">Further Reading</a></li>
</ul>
</li>
</ul>
<hr style="margin-top: 25px; margin-bottom: 20px" />
<h1 id="functions-as-vectors">Functions as Vectors</h1>
<p>Vectors are often first introduced as lists of real numbers—i.e. the familiar notation we use for points, directions, and more.</p>
<div class="flextable">
<div class="item">
$$ \mathbf{v} = \begin{bmatrix}x\\y\\z\end{bmatrix} $$
</div>
<div class="item">
<img class="diagram2d" style="min-width: 200px" src="/assets/functions-are-vectors/vector.svg" />
</div>
</div>
<p>You may recall that this representation is only one example of an abstract <em>vector space</em>.
There are many other types of vectors, such as lists of <a href="https://en.wikipedia.org/wiki/Complex_number">complex numbers</a>, <a href="https://en.wikipedia.org/wiki/Cycle_space">graph cycles</a>, and even <a href="https://pdodds.w3.uvm.edu/files/papers/others/1980/ward1980a.pdf">magic squares</a>.</p>
<p>However, all of these vector spaces have one thing in common: a <em>finite</em> number of dimensions.
That is, each kind of vector can be represented as a collection of \(N\) numbers, though the definition of “number” varies.</p>
<p>If any \(N\)-dimensional vector is essentially a length-\(N\) list, we could also consider a vector to be a <em>mapping</em> from an index to a value.</p>
\[\begin{align*}
\mathbf{v}_1 &= x\\
\mathbf{v}_2 &= y\\
\mathbf{v}_3 &= z
\end{align*}\ \iff\ \mathbf{v} = \begin{bmatrix}x \\ y \\ z\end{bmatrix}\]
<p>What does this perspective hint at as we increase the number of dimensions?</p>
<svg id="fvec" class="center" width="500" height="350"></svg>
<div class="flextable">
<div class="item" style="margin-right: 15px">
<div style="display: inline-flex"><p>Dimensions</p>
<input type="range" min="5" max="60" value="5" class="slider item" id="fvec_slider" style="min-width: 150px" />
</div></div>
<div class="item" id="fvec_vec" style="margin-left: 15px"></div>
</div>
<p>In higher dimensions, vectors start to look more like functions!</p>
<h2 id="countably-infinite-indices">Countably Infinite Indices</h2>
<p>Of course, a finite-length vector only specifies a value at a limited number of indices.
Could we instead define a vector that contains infinitely many values?</p>
<p>Writing down a vector representing a function on the natural numbers (\(\mathbb{N}\))—or any other <a href="https://en.wikipedia.org/wiki/Countable_set"><em>countably infinite</em></a> domain—is straightforward: just extend the list indefinitely.</p>
<div class="flextable">
<div class="item">
$$ \begin{align*}\mathbf{v}_1 &= 1\\\mathbf{v}_2 &= 2\\ &\vdots \\ \mathbf{v}_i &= i\end{align*}\ \iff\ \mathbf{v} = \begin{bmatrix}1 \\ 2 \\ 3 \\ \vdots \end{bmatrix} $$
</div>
<div class="item">
<img style="min-width: 150px; margin-top: 10px" src="/assets/functions-are-vectors/identity.svg" />
</div>
</div>
<p>This vector could represent the function \(f(x) = x\), where \(x \in \mathbb{N}\).<sup id="fnref:polynomials" role="doc-noteref"><a href="#fn:polynomials" class="footnote" rel="footnote">1</a></sup></p>
<h2 id="uncountably-infinite-indices">Uncountably Infinite Indices</h2>
<p>Many interesting functions are defined on the real numbers (\(\mathbb{R}\)), so may not be representable as a countably infinite vector.
Therefore, we will have to make a larger conceptual leap: not only will our set of indices be infinite, it will be <a href="https://en.wikipedia.org/wiki/Uncountable_set"><em>uncountably infinite</em></a>.</p>
<p>That means we can’t write down vectors as lists at all—it is <a href="https://en.wikipedia.org/wiki/Cantor%27s_diagonal_argument">impossible</a> to assign an integer index to each element of an uncountable set.
So, how can we write down a vector mapping a <em>real</em> index to a certain value?</p>
<p>Now, a vector really is just an arbitrary function:</p>
<div class="flextable">
<div class="item">
$$ \mathbf{v}_{x} = x^2\ \iff\ \mathbf{v} = \begin{bmatrix} x \mapsto x^2 \end{bmatrix} $$
</div>
<div class="item">
<img class="diagram2d center" style="min-width: 150px; margin-top: 10px; margin-bottom: 10px" src="/assets/functions-are-vectors/quadratic.svg" />
</div>
</div>
<p>Precisely defining how and why we can represent functions as infinite-dimensional vectors is the purview of <em>functional analysis</em>.
In this post, we won’t attempt to prove our results in infinite dimensions: we will focus on building intuition via analogies to finite-dimensional linear algebra.</p>
<hr />
<h1 id="vector-spaces">Vector Spaces</h1>
<div class="review">
<p><em>Review: <a href="https://www.youtube.com/watch?v=TgKwz5Ikpc8">Abstract vector spaces | Chapter 16, Essence of linear algebra</a>.</em></p>
</div>
<p>Formally, a <em>vector space</em> is defined by choosing a set of vectors \(\mathcal{V}\), a scalar <a href="https://en.wikipedia.org/wiki/Field_(mathematics)">field</a> \(\mathbb{F}\), and a zero vector \(\mathbf{0}\).
The field \(\mathbb{F}\) is often the real numbers (\(\mathbb{R}\)), complex numbers (\(\mathbb{C}\)), or a finite field such as the integers modulo a prime (\(\mathbb{Z}_p\)).</p>
<p>Additionally, we must specify how to add two vectors and how to multiply a vector by a scalar.</p>
\[\begin{align*}
(+)\ &:\ \mathcal{V}\times\mathcal{V}\mapsto\mathcal{V}\\
(\cdot)\ &:\ \mathbb{F}\times\mathcal{V} \mapsto \mathcal{V}
\end{align*}\]
<p>To describe a vector space, our definitions must entail several <em>vector space axioms</em>.</p>
<h2 id="a-functional-vector-space">A Functional Vector Space</h2>
<p>In the following sections, we’ll work with the vector space of real functions.
To avoid ambiguity, square brackets are used to denote function application.</p>
<ul>
<li>The scalar field \(\mathbb{F}\) is the real numbers \(\mathbb{R}\).</li>
<li>The set of vectors \(\mathcal{V}\) contains functions from \(\mathbb{R}\) to \(\mathbb{R}\).<sup id="fnref:rtor" role="doc-noteref"><a href="#fn:rtor" class="footnote" rel="footnote">2</a></sup></li>
<li>\(\mathbf{0}\) is the zero function, i.e. \(\mathbf{0}[x] = 0\).</li>
</ul>
<p>Adding functions corresponds to applying the functions separately and summing the results.</p>
<div class="flextable">
<div class="item">
$$ (f + g)[x] = f[x] + g[x] $$
</div>
<div class="item">
<img class="diagram2d" style="min-width: 300px; margin-top: 25px" src="/assets/functions-are-vectors/fadd.svg" />
</div>
</div>
<p>This definition generalizes the typical element-wise addition rule—it’s like adding the two values at each index.</p>
\[f+g = \begin{bmatrix}f_1 + g_1 \\ f_2 + g_2 \\ \vdots \end{bmatrix}\]
<p>Multiplying a function by a scalar corresponds to applying the function and scaling the result.</p>
<div class="flextable">
<div class="item">
$$ (\alpha f)[x] = \alpha f[x] $$
</div>
<div class="item">
<img class="diagram2d" style="min-width: 300px; margin-top: 25px" src="/assets/functions-are-vectors/fscale.svg" />
</div>
</div>
<p>This rule similarly generalizes element-wise multiplication—it’s like scaling the value at each index.</p>
\[\alpha f = \begin{bmatrix}\alpha f_1 \\ \alpha f_2 \\ \vdots \end{bmatrix}\]
<h2 id="proofs">Proofs</h2>
<p>Given these definitions, we can now prove all necessary vector space axioms.
We will illustrate the analog of each property in \(\mathbb{R}^2\), the familiar vector space of two-dimensional arrows.</p>
<div class="dropdown">
<div class="dropdown-header">
<strong>Vector Addition is Commutative</strong>
</div>
<div class="dropdown-content">
<div class="dropdown-content-pad">
For all vectors <span class="fix-katex">$$\mathbf{u}, \mathbf{v} \in \mathcal{V}$$</span>:
<div class="flextable">
<div class="item">
$$\mathbf{u} + \mathbf{v} = \mathbf{v} + \mathbf{u}$$
</div>
<div class="item">
<img class="diagram2d" style="min-width: 250px" src="/assets/functions-are-vectors/commut.svg" />
</div>
</div>
Since real addition is commutative, this property follows directly from our definition of vector addition:
$$\begin{align*}
(f + g)[x] &= f[x] + g[x]\\
&= g[x] + f[x]\\
&= (g + f)[x]
\end{align*}$$
</div>
</div>
</div>
<div class="dropdown">
<div class="dropdown-header">
<strong>Vector Addition is Associative</strong>
</div>
<div class="dropdown-content">
<div class="dropdown-content-pad">
For all vectors <span class="fix-katex">$$\mathbf{u}, \mathbf{v}, \mathbf{w} \in \mathcal{V}$$</span>:
<div class="flextable" style="magin-top: 5px; margin-bottom: 10px">
<div class="item">
$$(\mathbf{u} + \mathbf{v}) + \mathbf{w} = \mathbf{u} + (\mathbf{v} + \mathbf{w})$$
</div>
<div class="item">
<img class="diagram2d" style="min-width: 250px" src="/assets/functions-are-vectors/assoc.svg" />
</div>
</div>
This property also follows from our definition of vector addition:
$$\begin{align*}
((f + g) + h)[x] &= (f + g)[x] + h[x]\\
&= f[x] + g[x] + h[x]\\
&= f[x] + (g[x] + h[x])\\
&= f[x] + (g + h)[x]\\
&= (f + (g + h))[x]
\end{align*}$$
</div>
</div>
</div>
<div class="dropdown">
<div class="dropdown-header">
<strong><span class="fix-katex">$$\mathbf{0}$$</span> is an Additive Identity</strong>
</div>
<div class="dropdown-content">
<div class="dropdown-content-pad">
For all vectors <span class="fix-katex">$$\mathbf{u} \in \mathcal{V}$$</span>:
<div class="flextable">
<div class="item">
$$\mathbf{0} + \mathbf{u} = \mathbf{u} $$
</div>
<div class="item">
<img class="diagram2d" style="min-width: 225px" src="/assets/functions-are-vectors/zidentity.svg" />
</div>
</div>
This one is easy:
$$\begin{align*}
(\mathbf{0} + f)[x] &= \mathbf{0}[x] + f[x]\\
&= 0 + f[x]\\
&= f[x]
\end{align*}$$
</div>
</div>
</div>
<div class="dropdown">
<div class="dropdown-header">
<strong>Additive Inverses Exist</strong>
</div>
<div class="dropdown-content">
<div class="dropdown-content-pad">
For all vectors <span class="fix-katex">$$\mathbf{u} \in \mathcal{V}$$</span>, there exists a vector <span class="fix-katex">$$-\mathbf{u} \in \mathcal{V}$$</span> such that:
<div class="flextable" style="margin-top: 20px; margin-bottom: 10px">
<div class="item">
$$\mathbf{u} + (-\mathbf{u}) = \mathbf{0}$$
</div>
<div class="item">
<img class="diagram2d" style="min-width: 225px" src="/assets/functions-are-vectors/inverse.svg" />
</div>
</div>
Negation is defined as applying <span class="fix-katex">$$f$$</span> and negating the result: <span class="fix-katex">$$(-f)[x] = -f[x]$$</span>.
Clearly, <span class="fix-katex">$$-f$$</span> is also in <span class="fix-katex">$$\mathcal{V}$$</span>.
$$\begin{align*}
(f + (-f))[x] &= f[x] + (-f)[x]\\
&= f[x] - f[x]\\
&= 0\\
&= \mathbf{0}[x]
\end{align*}$$
</div>
</div>
</div>
<div class="dropdown">
<div class="dropdown-header">
<strong><span class="fix-katex">$$1$$</span> is a Multiplicative Identity</strong>
</div>
<div class="dropdown-content">
<div class="dropdown-content-pad">
For all vectors <span class="fix-katex">$$\mathbf{u} \in \mathcal{V}$$</span>:
<div class="flextable" style="margin-bottom: 10px">
<div class="item">
$$1\mathbf{u} = \mathbf{u}$$
</div>
<div class="item">
<img class="diagram2d" style="min-width: 225px" src="/assets/functions-are-vectors/midentity.svg" />
</div>
</div>
Note that <span class="fix-katex">$$1$$</span> is specified by the choice of <span class="fix-katex">$$\mathbb{F}$$</span>.
In our case, it is simply the real number <span class="fix-katex">$$1$$</span>.
$$\begin{align*}
(1 f)[x] &= 1 f[x]\\
&= f[x]
\end{align*}$$
</div>
</div>
</div>
<div class="dropdown">
<div class="dropdown-header">
<strong>Scalar Multiplication is Associative</strong>
</div>
<div class="dropdown-content">
<div class="dropdown-content-pad">
For all vectors <span class="fix-katex">$$\mathbf{u} \in \mathcal{V}$$</span> and scalars <span class="fix-katex">$$\alpha, \beta \in \mathbb{F}$$</span>:
<div class="flextable" style="margin-top: 10px; margin-bottom: 10px">
<div class="item">
$$(\alpha \beta)\mathbf{u} = \alpha(\beta\mathbf{u})$$
</div>
<div class="item">
<img class="diagram2d" style="min-width: 325px" src="/assets/functions-are-vectors/mult.svg" />
</div>
</div>
This property follows from our definition of scalar multiplication:
$$\begin{align*}
((\alpha\beta) f)[x] &= (\alpha\beta)f[x]\\
&= \alpha(\beta f[x])\\
&= \alpha(\beta f)[x]
\end{align*}$$
</div>
</div>
</div>
<div class="dropdown">
<div class="dropdown-header">
<strong>Scalar Multiplication Distributes Over Vector Addition</strong>
</div>
<div class="dropdown-content">
<div class="dropdown-content-pad">
For all vectors <span class="fix-katex">$$\mathbf{u}, \mathbf{v} \in \mathcal{V}$$</span> and scalars <span class="fix-katex">$$\alpha \in \mathbb{F}$$</span>:
<div class="flextable" style="margin-bottom: 10px">
<div class="item">
$$\alpha(\mathbf{u} + \mathbf{v}) = \alpha\mathbf{u} + \alpha\mathbf{v}$$
</div>
<div class="item">
<img class="diagram2d" style="min-width: 400px" src="/assets/functions-are-vectors/distrib1.svg" />
</div>
</div>
Again using our definitions of vector addition and scalar multiplication:
$$\begin{align*}
(\alpha (f + g))[x] &= \alpha(f + g)[x]\\
&= \alpha(f[x] + g[x])\\
&= \alpha f[x] + \alpha g[x]\\
&= (\alpha f)[x] + (\alpha g)[x]\\
&= (\alpha f + \alpha g)[x]
\end{align*}$$
</div>
</div>
</div>
<div class="dropdown">
<div class="dropdown-header">
<strong>Scalar Multiplication Distributes Over Scalar Addition</strong>
</div>
<div class="dropdown-content">
<div class="dropdown-content-pad">
For all vectors <span class="fix-katex">$$\mathbf{u} \in \mathcal{V}$$</span> and scalars <span class="fix-katex">$$\alpha, \beta \in \mathbb{F}$$</span>:
<div class="flextable" style=" margin-bottom: 10px">
<div class="item">
$$(\alpha + \beta)\mathbf{u} = \alpha\mathbf{u} + \beta\mathbf{u}$$
</div>
<div class="item">
<img class="diagram2d" style="min-width: 250px" src="/assets/functions-are-vectors/sdistrib.svg" />
</div>
</div>
Again using our definitions of vector addition and scalar multiplication:
$$\begin{align*}
((\alpha + \beta)f)[x] &= (\alpha + \beta)f[x]\\
&= \alpha f[x] + \beta f[x] \\
&= (\alpha f)[x] + (\beta f)[x]
\end{align*}$$
</div>
</div>
</div>
<p>Therefore, we’ve built a vector space of functions!<sup id="fnref:otherspaces" role="doc-noteref"><a href="#fn:otherspaces" class="footnote" rel="footnote">3</a></sup>
It may not be immediately obvious why this result is useful, but bear with us through a few more definitions—we will spend the rest of this post exploring powerful techniques arising from this perspective.</p>
<h2 id="a-standard-basis-for-functions">A Standard Basis for Functions</h2>
<div class="review">
<p><em>Review: <a href="https://www.youtube.com/watch?v=k7RM-ot2NWY">Linear combinations, span, and basis vectors | Chapter 2, Essence of linear algebra</a>.</em></p>
</div>
<p>Unless specified otherwise, vectors are written down with respect to the <em>standard basis</em>.
In \(\mathbb{R}^2\), the standard basis consists of the two coordinate axes.</p>
<div class="flextable">
<div class="item">
$$ \mathbf{e}_1 = \begin{bmatrix}1 \\ 0\end{bmatrix},\,\, \mathbf{e}_2 = \begin{bmatrix}0 \\ 1\end{bmatrix} $$
</div>
<div class="item">
<img class="diagram2d" style="min-width: 175px; margin-top: -30px; margin-bottom: -20px" src="/assets/functions-are-vectors/r2std.svg" />
</div>
</div>
<p>Hence, vector notation is shorthand for a <em>linear combination</em> of the standard basis vectors.</p>
<div class="flextable">
<div class="item">
$$ \mathbf{u} = \begin{bmatrix}\alpha \\ \beta\end{bmatrix} = \alpha\mathbf{e}_1 + \beta\mathbf{e}_2 $$
</div>
<div class="item">
<img class="diagram2d" style="min-width: 175px; margin-top: 15px; margin-bottom: -25px" src="/assets/functions-are-vectors/r2stdv.svg" />
</div>
</div>
<p>Above, we represented functions as vectors by assuming each dimension of an infinite-length vector contains the function’s result for that index.
This construction points to a natural generalization of the standard basis.</p>
<p>Just like the coordinate axes, each <em>standard basis function</em> contains a \(1\) at one index and \(0\) everywhere else.
More precisely, for every \(\alpha \in \mathbb{R}\):</p>
<div style="margin-top: -15px">
\[\mathbf{e}_\alpha[x] = \begin{cases} 1 & \text{if } x = \alpha \\ 0 & \text{otherwise} \end{cases}\]
</div>
<p>Ideally, we could express an arbitrary function \(f\) as a linear combination of these basis functions.
However, there are uncountably many of them—and we can’t simply write down a sum over the reals.
Still, considering their linear combination is illustrative:</p>
\[\begin{align*} f[x] &= f[\alpha]\mathbf{e}_\alpha[x] \\ &= f[1]\mathbf{e}_1[x] + f[2]\mathbf{e}_2[x] + f[\pi]\mathbf{e}_\pi[x] + \dots \end{align*}\]
<p>If we evaluate this “sum” at \(x\), we’ll find that all terms are zero—except \(\mathbf{e}_x\), making the result \(f[x]\).</p>
<hr />
<h1 id="linear-operators">Linear Operators</h1>
<div class="review">
<p><em>Review: <a href="https://www.youtube.com/watch?v=P2LTAUO1TdA">Change of basis | Chapter 13, Essence of linear algebra</a>.</em></p>
</div>
<p>Now that we can manipulate functions as vectors, let’s start transferring the tools of linear algebra to the functional perspective.</p>
<p>One ubiquitous operation on finite-dimensional vectors is transforming them with matrices.
A matrix \(\mathbf{A}\) encodes a <em>linear transformation</em>, meaning multiplication preserves linear combinations.</p>
\[\mathbf{A}(\alpha \mathbf{x} + \beta \mathbf{y}) = \alpha \mathbf{A}\mathbf{x} + \beta \mathbf{A}\mathbf{y}\]
<p>Multiplying a vector by a matrix can be intuitively interpreted as defining a new set of coordinate axes from the matrix’s column vectors.
The result is a linear combination of the columns:</p>
<div class="flextable">
<div class="item">
<div class="hide_narrow">
\[\mathbf{Ax} = \begin{bmatrix} \vert & \vert & \vert \\ \mathbf{u} & \mathbf{v} & \mathbf{w} \\ \vert & \vert & \vert \end{bmatrix} \begin{bmatrix}x_1 \\ x_2 \\ x_3\end{bmatrix} = x_1\mathbf{u} + x_2\mathbf{v} + x_3\mathbf{w}\]
</div>
<div class="hide_wide">
\[\begin{align*}
\mathbf{Ax} &= \begin{bmatrix} \vert & \vert & \vert \\ \mathbf{u} & \mathbf{v} & \mathbf{w} \\ \vert & \vert & \vert \end{bmatrix} \begin{bmatrix}x_1 \\ x_2 \\ x_3\end{bmatrix} \\ &= x_1\mathbf{u} + x_2\mathbf{v} + x_3\mathbf{w}
\end{align*}\]
</div>
</div>
<div class="item">
<img class="diagram2d_wide" src="/assets/functions-are-vectors/transform2.svg" />
</div>
</div>
<p>When all vectors can be expressed as a linear combination of \(\mathbf{u}\), \(\mathbf{v}\), and \(\mathbf{w}\), the columns form a basis for the underlying vector space.
Here, the matrix \(\mathbf{A}\) transforms a vector from the \(\mathbf{uvw}\) basis into the standard basis.</p>
<p>Since functions are vectors, we could imagine transforming a function by a matrix.
Such a matrix would be infinite-dimensional, so we will instead call it a <em>linear operator</em> and denote it with \(\mathcal{L}\).</p>
<div class="hide_narrow">
\[\mathcal{L}f = \begin{bmatrix} \vert & \vert & \vert & \\ \mathbf{f} & \mathbf{g} & \mathbf{h} & \cdots \\ \vert & \vert & \vert & \end{bmatrix} \begin{bmatrix}f_1\\ f_2 \\ f_3\\ \vdots\end{bmatrix} = f_1\mathbf{f} + f_2\mathbf{g} + f_3\mathbf{h} + \cdots\]
</div>
<div class="hide_wide">
\[\begin{align*}
\mathcal{L}f &= \begin{bmatrix} \vert & \vert & \vert & \\ \mathbf{f} & \mathbf{g} & \mathbf{h} & \cdots \\ \vert & \vert & \vert & \end{bmatrix} \begin{bmatrix}f_1\\ f_2\\ f_3 \\ \vdots\end{bmatrix} \\ &= f_1\mathbf{f} + f_2\mathbf{g} + f_3\mathbf{h} + \cdots
\end{align*}\]
</div>
<p>This visualization isn’t very accurate—we’re dealing with <em>uncountably</em> infinite-dimensional vectors, so we can’t actually write out an operator in matrix form.
Nonetheless, the structure is suggestive: each “column” of the operator describes a new basis function for our functional vector space.
Just like we saw with finite-dimensional vectors, \(\mathcal{L}\) represents a change of basis.</p>
<h2 id="differentiation">Differentiation</h2>
<div class="review">
<p><em>Review: <a href="https://www.youtube.com/watch?v=S0_qX4VJhMQ">Derivative formulas through geometry | Chapter 3, Essence of calculus</a>.</em></p>
</div>
<p>So, what’s an example of a linear operator on functions?
You might recall that <em>differentiation</em> is linear:</p>
\[\frac{\partial}{\partial x} \left(\alpha f[x] + \beta g[x]\right) = \alpha\frac{\partial f}{\partial x} + \beta\frac{\partial g}{\partial x}\]
<p>It’s hard to visualize differentiation on general functions, but it’s feasible for the <a href="https://en.wikipedia.org/wiki/Linear_subspace">subspace</a> of <em>polynomials</em>, \(\mathcal{P}\).
Let’s take a slight detour to examine this smaller space of functions.</p>
\[\mathcal{P} = \{ p[x] = a + bx + cx^2 + dx^3 + \cdots \}\]
<p>We typically write down polynomials as a sequence of powers, i.e. \(1, x, x^2, x^3\), etc.
All polynomials are linear combinations of the functions \(\mathbf{e}_i[x] = x^i\), so they constitute a countably infinite basis for \(\mathcal{P}\).<sup id="fnref:polynatural" role="doc-noteref"><a href="#fn:polynatural" class="footnote" rel="footnote">4</a></sup></p>
<p>This basis provides a convenient vector notation:</p>
<div class="hide_narrow">
\[\begin{align*} p[x] &= a + bx + cx^2 + dx^3 + \cdots \\ &= a\mathbf{e}_0 + b\mathbf{e}_1 + c \mathbf{e}_2 + d\mathbf{e}_3 + \dots \end{align*}\ \iff\ \mathbf{p} = \begin{bmatrix}a\\ b\\ c\\ d\\ \vdots\end{bmatrix}\]
</div>
<div class="hide_wide">
\[\begin{align*} p[x] &= a + bx + cx^2 + dx^3 + \cdots \\ &= a\mathbf{e}_0 + b\mathbf{e}_1 + c \mathbf{e}_2 + d\mathbf{e}_3 + \dots \\& \iff\ \mathbf{p} = \begin{bmatrix}a\\ b\\ c\\ d\\ \vdots\end{bmatrix} \end{align*}\]
</div>
<p>Since differentiation is linear, we’re able to apply the rule \(\frac{\partial}{\partial x} x^n = nx^{n-1}\) to each term.</p>
<div class="hide_narrow">
\[\begin{align*}\frac{\partial}{\partial x}p[x] &= \vphantom{\Bigg\vert}a\frac{\partial}{\partial x}1 + b\frac{\partial}{\partial x}x + c\frac{\partial}{\partial x}x^2 + d\frac{\partial}{\partial x}x^3 + \dots \\ &= b + 2cx + 3dx^2 + \cdots\\ &= b\mathbf{e}_0 + 2c\mathbf{e}_1 + 3d\mathbf{e}_2 + \dots\end{align*} \ \iff\ \frac{\partial}{\partial x}\mathbf{p} = \begin{bmatrix}b\\ 2c\\ 3d\\ \vdots\end{bmatrix}\]
</div>
<div class="hide_wide">
\[\begin{align*}\frac{\partial}{\partial x}p[x] &= \vphantom{\Bigg\vert}a\frac{\partial}{\partial x}1 + b\frac{\partial}{\partial x}x + c\frac{\partial}{\partial x}x^2\, +\\ & \phantom{=} d\frac{\partial}{\partial x}x^3 + \dots \\ &= b + 2cx + 3dx^2 + \cdots\\ &= b\mathbf{e}_0 + 2c\mathbf{e}_1 + 3d\mathbf{e}_2 + \dots \\ &\iff\ \frac{\partial}{\partial x}\mathbf{p} = \begin{bmatrix}b\\ 2c\\ 3d\\ \vdots\end{bmatrix}\end{align*}\]
</div>
<p>We’ve performed a linear transformation on the coefficients, so we can represent differentiation as a matrix!</p>
\[\frac{\partial}{\partial x}\mathbf{p} = \begin{bmatrix}0 & 1 & 0 & 0 & \cdots\\ 0 & 0 & 2 & 0 & \cdots\\ 0 & 0 & 0 & 3 & \cdots\\ \vdots & \vdots & \vdots & \vdots & \ddots \end{bmatrix}\begin{bmatrix}a\\ b\\ c\\ d\\ \vdots\end{bmatrix} = \begin{bmatrix}b\\ 2c\\ 3d\\ \vdots\end{bmatrix}\]
<p>Each column of the differentiation operator is itself a polynomial, so this matrix represents a change of basis.</p>
\[\frac{\partial}{\partial x} = \begin{bmatrix} \vert & \vert & \vert & \vert & \vert & \\ 0 & 1 & 2x & 3x^2 & 4x^3 & \cdots \\ \vert & \vert & \vert & \vert & \vert & \end{bmatrix}\]
<p>As we can see, the differentiation operator simply maps each basis function to its derivative.</p>
<p>This result also applies to the larger space of <em>analytic</em> real functions, which includes polynomials, exponential functions, trigonometric functions, logarithms, and other familiar names.
By definition, an analytic function can be expressed as a Taylor series about \(0\):</p>
\[f[x] = \sum_{n=0}^\infty \frac{f^{(n)}[0]}{n!}x^n = \sum_{n=0}^\infty \alpha_n x^n\]
<p>Which is a linear combination of our polynomial basis functions.
That means a Taylor expansion is essentially a change of basis into the sequence of powers, where our differentiation operator is quite simple.<sup id="fnref:polybasis" role="doc-noteref"><a href="#fn:polybasis" class="footnote" rel="footnote">5</a></sup></p>
<hr />
<h1 id="diagonalization">Diagonalization</h1>
<div class="review">
<p><em>Review: <a href="https://www.youtube.com/watch?v=PFDu9oVAE-g">Eigenvectors and eigenvalues | Chapter 14, Essence of linear algebra</a>.</em></p>
</div>
<p>Matrix decompositions are arguably the crowning achievement of linear algebra.
To get started, let’s review what diagonalization means for a \(3\times3\) real matrix \(\mathbf{A}\).</p>
<h2 id="eigenvectors">Eigenvectors</h2>
<p>A vector \(\mathbf{u}\) is an <em>eigenvector</em> of the matrix \(\mathbf{A}\) when the following condition holds:</p>
<div class="flextable">
<div class="item">
$$ \mathbf{Au} = \lambda \mathbf{u} $$
</div>
<div class="item">
<img class="diagram2d" style="min-width: 200px" src="/assets/functions-are-vectors/eigenvector.svg" />
</div>
</div>
<p>The <em>eigenvalue</em> \(\lambda\) may be computed by solving the <em>characteristic polynomial</em> of \(\mathbf{A}\).
Eigenvalues may be real or complex.</p>
<p>The matrix \(\mathbf{A}\) is <em>diagonalizable</em> when it admits three linearly independent eigenvectors, each with a corresponding <strong>real</strong> eigenvalue.
This set of eigenvectors constitutes an <em>eigenbasis</em> for the underlying vector space, indicating that we can express any vector \(\mathbf{x}\) via their linear combination.</p>
<div class="flextable">
<div class="item">
$$ \mathbf{x} = \alpha\mathbf{u}_1 + \beta\mathbf{u}_2 + \gamma\mathbf{u}_3 $$
</div>
<div class="item">
<img class="diagram2d" style="margin-left: 10px; min-width: 400px" src="/assets/functions-are-vectors/eigenbasis.svg" />
</div>
</div>
<p>To multiply \(\mathbf{x}\) by \(\mathbf{A}\), we just have to scale each component by its corresponding eigenvalue.</p>
<div class="flextable">
<div class="item" style="margin-right: -40px; margin-left: 15px">
$$ \begin{align*} \mathbf{Ax} &= \alpha\mathbf{A}\mathbf{u}_1 + \beta\mathbf{A}\mathbf{u}_2 + \gamma\mathbf{A}\mathbf{u}_3 \\
&= \alpha\lambda_1\mathbf{u}_1 + \beta\lambda_2\mathbf{u}_2 + \gamma\lambda_3\mathbf{u}_3 \end{align*} $$
</div>
<div class="item">
<img class="diagram2d wider_on_mobile" style="min-width: 525px" src="/assets/functions-are-vectors/eigenscale.svg" />
</div>
</div>
<p>Finally, re-combining the eigenvectors expresses the result in the standard basis.</p>
<p><img class="diagram2d center" style="width: 400px" src="/assets/functions-are-vectors/eigenbasis2.svg" /></p>
<p>Intuitively, we’ve shown that multiplying by \(\mathbf{A}\) is equivalent to a change of basis, a scaling, and a change back.
That means we can write \(\mathbf{A}\) as the product of an invertible matrix \(\mathbf{U}\) and a diagonal matrix \(\mathbf{\Lambda}\).</p>
<div class="hide_narrow">
\[\begin{align*} \mathbf{A} &= \mathbf{U\Lambda U^{-1}} \\
&= \begin{bmatrix}\vert & \vert & \vert \\ \mathbf{u}_1 & \mathbf{u}_2 & \mathbf{u}_3 \\ \vert & \vert & \vert \end{bmatrix}
\begin{bmatrix}\lambda_1 & 0 & 0 \\ 0 & \lambda_2 & 0 \\ 0 & 0 & \lambda_3 \end{bmatrix}
\begin{bmatrix}\vert & \vert & \vert \\ \mathbf{u}_1 & \mathbf{u}_2 & \mathbf{u}_3 \\ \vert & \vert & \vert \end{bmatrix}^{-1}
\end{align*}\]
</div>
<div class="hide_wide">
\[\begin{align*} \mathbf{A} &= \mathbf{U\Lambda U^{-1}} \\
&= \begin{bmatrix}\vert & \vert & \vert \\ \mathbf{u}_1 & \mathbf{u}_2 & \mathbf{u}_3 \\ \vert & \vert & \vert \end{bmatrix}
\\ & \phantom{=} \begin{bmatrix}\lambda_1 & 0 & 0 \\ 0 & \lambda_2 & 0 \\ 0 & 0 & \lambda_3 \end{bmatrix}
\\ & \phantom{=} \begin{bmatrix}\vert & \vert & \vert \\ \mathbf{u}_1 & \mathbf{u}_2 & \mathbf{u}_3 \\ \vert & \vert & \vert \end{bmatrix}^{-1}
\end{align*}\]
</div>
<p>Note that \(\mathbf{U}\) is invertible because its columns (the eigenvectors) form a basis for \(\mathbb{R}^3\).
When multiplying by \(\mathbf{x}\), \(\mathbf{U}^{-1}\) converts \(\mathbf{x}\) to the eigenbasis, \(\mathbf{\Lambda}\) scales by the corresponding eigenvalues, and \(\mathbf{U}\) takes us back to the standard basis.</p>
<p>In the presence of complex eigenvalues, \(\mathbf{A}\) may still be diagonalizable if we allow \(\mathbf{U}\) and \(\mathbf{\Lambda}\) to include complex entires.
In this case, the decomposition as a whole still maps real vectors to real vectors, but the intermediate values become complex.</p>
<h2 id="eigenfunctions">Eigenfunctions</h2>
<div class="review">
<p><em>Review: <a href="https://www.youtube.com/watch?v=m2MIpDrF7Es">What’s so special about Euler’s number e? | Chapter 5, Essence of calculus</a>.</em></p>
</div>
<p>So, what does diagonalization mean in a vector space of functions?
Given a linear operator \(\mathcal{L}\), you might imagine a corresponding definition for <em>eigenfunctions</em>:</p>
\[\mathcal{L}f = \psi f\]
<p>The scalar \(\psi\) is again known as an eigenvalue.
Since \(\mathcal{L}\) is infinite-dimensional, it doesn’t have a characteristic polynomial—there’s not a straightforward method for computing \(\psi\).</p>
<p>Nevertheless, let’s attempt to diagonalize differentiation on analytic functions.
The first step is to find the eigenfunctions.
Start by applying the above condition to our differentiation operator in the power basis:</p>
\[\begin{align*}
&& \frac{\partial}{\partial x}\mathbf{p} = \psi \mathbf{p} \vphantom{\Big|}& \\
&\iff& \begin{bmatrix}0 & 1 & 0 & 0 & \cdots\\ 0 & 0 & 2 & 0 & \cdots\\ 0 & 0 & 0 & 3 & \cdots\\ \vdots & \vdots & \vdots & \vdots & \ddots \end{bmatrix}\begin{bmatrix}p_0\\ p_1\\ p_2\\ p_3\\ \vdots\end{bmatrix}
&= \begin{bmatrix}\psi p_0\\ \psi p_1 \\ \psi p_2 \\ \psi p_3 \\ \vdots \end{bmatrix} \\
&\iff& \begin{cases} p_1 &= \psi p_0 \\ p_2 &= \frac{\psi}{2} p_1 \\ p_3 &= \frac{\psi}{3} p_2 \\ &\dots \end{cases} &
\end{align*}\]
<p>This system of equations implies that all coefficients are determined solely by our choice of constants \(p_0\) and \(\psi\).
We can explicitly write down their relationship as \(p_i = \frac{\psi^i}{i!}p_0\).</p>
<p>Now, let’s see what this class of polynomials actually looks like.</p>
<div class="hide_narrow">
\[p[x] = p_0 + p_0\psi x + p_0\frac{\psi^2}{2}x^2 + p_0\frac{\psi^3}{6}x^3 + p_0\frac{\psi^4}{24}x^4 + \dots\]
</div>
<div class="hide_wide">
\[\begin{align*}
p[x] &= p_0 + p_0\psi x + p_0\frac{\psi^2}{2}x^2\, +\\ &\phantom{=} p_0\frac{\psi^3}{6}x^3 + p_0\frac{\psi^4}{24}x^4 + \dots
\end{align*}\]
</div>
<p>Differentiation shows that this function is, in fact, an eigenfunction for the eigenvalue \(\psi\).</p>
<div class="hide_narrow">
\[\begin{align*} \frac{\partial}{\partial x} p[x] &= 0 + p_0\psi + p_0 \psi^2 x + p_0\frac{\psi^3}{2}x^2 + p_0\frac{\psi^4}{6}x^3 + \dots \\
&= \psi p[x] \end{align*}\]
</div>
<div class="hide_wide">
\[\begin{align*}
\frac{\partial}{\partial x} p[x] &= 0 + p_0\psi + p_0 \psi^2 x\, +\\ &\phantom{=} p_0\frac{\psi^3}{2}x^2 + p_0\frac{\psi^4}{6}x^3 + \dots \\
&= \psi p[x]
\end{align*}\]
</div>
<p>With a bit of algebraic manipulation, the definition of \(e^{x}\) pops out:</p>
<div class="hide_narrow">
\[\begin{align*} p[x] &= p_0 + p_0\psi x + p_0\frac{\psi^2}{2}x^2 + p_0\frac{\psi^3}{6}x^3 + p_0\frac{\psi^4}{24}x^4 + \dots \\
&= p_0\left((\psi x) + \frac{1}{2!}(\psi x)^2 + \frac{1}{3!}(\psi x)^3 + \frac{1}{4!}(\psi x)^4 + \dots\right) \\
&= p_0 e^{\psi x} \end{align*}\]
</div>
<div class="hide_wide">
\[\begin{align*}
p[x] &= p_0 + p_0\psi x + p_0\frac{\psi^2}{2}x^2\, +\\ &\phantom{=} p_0\frac{\psi^3}{6}x^3 + p_0\frac{\psi^4}{24}x^4 + \dots \\
&= p_0\Big((\psi x) + \frac{1}{2!}(\psi x)^2\, +\\ &\phantom{=p_0\Big((} \frac{1}{3!}(\psi x)^3 + \frac{1}{4!}(\psi x)^4 + \dots\Big) \\
&= p_0 e^{\psi x}
\end{align*}\]
</div>
<p>Therefore, functions of the form \(p_0e^{\psi x}\) are eigenfunctions for the eigenvalue \(\psi\), including when \(\psi=0\).</p>
<h2 id="diagonalizing-differentiation">Diagonalizing Differentiation</h2>
<p>We’ve found the eigenfunctions of the derivative operator, but can we diagonalize it?
Ideally, we would express differentiation as the combination of an invertible operator \(\mathcal{L}\) and a diagonal operator \(\mathcal{D}\).</p>
<div class="hide_narrow">
\[\begin{align*} \frac{\partial}{\partial x} &= \mathcal{L} \mathcal{D} \mathcal{L}^{-1} \\
&=
\begin{bmatrix} \vert & \vert & & \\ \alpha e^{\psi_1 x} & \beta e^{\psi_2 x} & \dots \\ \vert & \vert & \end{bmatrix}
\begin{bmatrix} \psi_1 & 0 & \dots \\ 0 & \psi_2 & \dots \\ \vdots & \vdots & \ddots \end{bmatrix}
{\color{red} \begin{bmatrix} \vert & \vert & & \\ \alpha e^{\psi_1 x} & \beta e^{\psi_2 x} & \dots \\ \vert & \vert & \end{bmatrix}^{-1} }
\end{align*}\]
</div>
<div class="hide_wide">
\[\begin{align*} \frac{\partial}{\partial x} &= \mathcal{L} \mathcal{D} \mathcal{L}^{-1} \\
&=
\begin{bmatrix} \vert & \vert & & \\ \alpha e^{\psi_1 x} & \beta e^{\psi_2 x} & \dots \\ \vert & \vert & \end{bmatrix}
\\ & \phantom{=} \begin{bmatrix} \psi_1 & 0 & \dots \\ 0 & \psi_2 & \dots \\ \vdots & \vdots & \ddots \end{bmatrix}
\\ & \phantom{=} {\color{red} \begin{bmatrix} \vert & \vert & & \\ \alpha e^{\psi_1 x} & \beta e^{\psi_2 x} & \dots \\ \vert & \vert & \end{bmatrix}^{-1} }
\end{align*}\]
</div>
<p>Diagonalization is only possible when our eigenfunctions form a basis.
This would be true if all analytic functions are expressible as a linear combination of exponentials.
However…</p>
<div class="dropdown" style="margin-top: 15px; margin-bottom: 15px">
<div class="dropdown-header">
<strong>Counterexample: <span class="fix-katex">$$f[x] = x$$</span></strong>
</div>
<div class="dropdown-content">
<div class="dropdown-content-pad" style="margin-top: -15px; margin-bottom: -5px">
<p>First assume that \(f[x] = x\) can be represented as a linear combination of exponentials.
Since analytic functions have countably infinite dimensionality, we should only need a countably infinite sum:</p>
\[f[x] = x = \sum_{n=0}^\infty \alpha_n e^{\psi_n x}\]
<p>Differentiating both sides:</p>
\[\begin{align*} f^{\prime}[x] &= 1 = \sum_{n=0}^\infty \psi_n\alpha_n e^{\psi_n x} \\
f^{\prime\prime}[x] &= 0 = \sum_{n=0}^\infty \psi_n^2\alpha_n e^{\psi_n x} \end{align*}\]
<p>Since \(e^{\psi_n x}\) and \(e^{\psi_m x}\) are linearly independent when \(n\neq m\), the final equation implies that all \(\alpha = 0\), except possibly the \(\alpha_\xi\) corresponding to \(\psi_\xi = 0\).
Therefore:</p>
\[\begin{align*}
1 &= \sum_{n=0}^\infty \psi_n\alpha_n e^{\psi_n x}\\
&= \psi_\xi \alpha_\xi + \sum_{n\neq \xi} 0\psi_n e^{\psi_n x} \\
&= 0
\end{align*}\]
<p>That’s a contradiction—the linear combination representing \(f[x] = x\) does not exist.</p>
<p>A similar argument shows that we can’t represent any non-constant function whose \(n\)th derivative is zero, nor periodic functions like sine and cosine.</p>
</div>
</div>
</div>
<p>Real exponentials don’t constitute a basis, so we cannot construct an invertible \(\mathcal{L}\).</p>
<h3 id="the-laplace-transform">The Laplace Transform</h3>
<p>We previously mentioned that more matrices can be diagonalized if we allow the decomposition to contain complex numbers.
Analogously, more linear operators are diagonalizable in the larger vector space of functions from \(\mathbb{R}\) to \(\mathbb{C}\).</p>
<p>Differentiation works the same way in this space; we’ll still find that its eigenfunctions are exponential.</p>
\[\frac{\partial}{\partial x} e^{(a+bi)x} = (a+bi)e^{(a+bi)x}\]
<p>However, the new eigenfunctions have complex eigenvalues, so we still can’t diagonalize.
We’ll need to consider the still larger space of functions from \(\mathbb{C}\) to \(\mathbb{C}\).</p>
\[\frac{\partial}{\partial x} : (\mathbb{C}\mapsto\mathbb{C}) \mapsto (\mathbb{C}\mapsto\mathbb{C})\]
<p>In this space, differentiation can be diagonalized via the <a href="https://en.wikipedia.org/wiki/Laplace_transform">Laplace transform</a>.
Although useful for solving differential equations, the Laplace transform is non-trivial to invert, so we won’t discuss it further.
In the following sections, we’ll delve into an operator that can be easily diagonalized in \(\mathbb{R}\mapsto\mathbb{C}\): the Laplacian.</p>
<hr />
<h1 id="inner-product-spaces">Inner Product Spaces</h1>
<div class="review">
<p><em>Review: <a href="https://www.youtube.com/watch?v=LyGKycYT2v0">Dot products and duality | Chapter 9, Essence of linear algebra</a>.</em></p>
</div>
<p>Before we get to the spectral theorem, we’ll need to understand one more topic: inner products.
You’re likely already familiar with one example of an inner product—the Euclidean dot product.</p>
\[\begin{bmatrix}x\\ y\\ z\end{bmatrix} \cdot \begin{bmatrix}a\\ b\\ c\end{bmatrix} = ax + by + cz\]
<p>An inner product describes how to measure a vector along another vector.
For example, \(\mathbf{u}\cdot\mathbf{v}\) is proportional to the length of the <em>projection</em> of \(\mathbf{u}\) onto \(\mathbf{v}\).</p>
<div class="flextable" style="margin-top: -30px; margin-bottom: -20px">
<div class="item">
$$ \mathbf{u} \cdot \mathbf{v} =\|\mathbf{u}\|\|\mathbf{v}\|\cos[\theta] $$
</div>
<div class="item">
<img class="diagram2d" style="min-width: 300px" src="/assets/functions-are-vectors/dotproduct.svg" />
</div>
</div>
<p>With a bit of trigonometry, we can show that the dot product is equivalent to multiplying the vectors’ lengths with the cosine of their angle.
This relationship suggests that the product of a vector with itself produces the square of its length.</p>
\[\begin{align*} \mathbf{u}\cdot\mathbf{u} &= \|\mathbf{u}\|\|\mathbf{u}\|\cos[0] \\
&= \|\mathbf{u}\|^2
\end{align*}\]
<p>Similarly, when two vectors form a right angle (are <em>orthogonal</em>), their dot product is zero.</p>
<div class="flextable" style="margin-top: -10px; margin-bottom: -10px">
<div class="item">
$$ \begin{align*} \mathbf{u} \cdot \mathbf{v} &= \|\mathbf{u}\|\|\mathbf{v}\|\cos[90^\circ] \\ &= 0 \end{align*} $$
</div>
<div class="item">
<img class="diagram2d" style="min-width: 175px" src="/assets/functions-are-vectors/orthogonal.svg" />
</div>
</div>
<p>Of course, the Euclidean dot product is only one example of an inner product.
In more general spaces, the inner product is denoted using angle brackets, such as \(\langle \mathbf{u}, \mathbf{v} \rangle\).</p>
<ul>
<li>The length (also known as the <em>norm</em>) of a vector is defined as \(\|\mathbf{u}\| = \sqrt{\langle \mathbf{u}, \mathbf{u} \rangle}\).</li>
<li>Two vectors are orthogonal if their inner product is zero: \(\ \mathbf{u} \perp \mathbf{v}\ \iff\ \langle \mathbf{u}, \mathbf{v} \rangle = 0\).</li>
</ul>
<p>A vector space augmented with an inner product is known as an inner product space.</p>
<h2 id="a-functional-inner-product">A Functional Inner Product</h2>
<p>We can’t directly apply the Euclidean dot product to our space of real functions, but its \(N\)-dimensional generalization is suggestive.</p>
\[\begin{align*} \mathbf{u} \cdot \mathbf{v} &= u_1v_1 + u_2v_2 + \dots + u_Nv_N \\ &= \sum_{i=1}^N u_iv_i \end{align*}\]
<p>Given countable indices, we simply match up the values, multiply them, and add the results.
When indices are uncountable, we can convert the discrete sum to its continuous analog: an integral!</p>
\[\langle f, g \rangle = \int_a^b f[x]g[x] \, dx\]
<p>When \(f\) and \(g\) are similar, multiplying them produces a larger function; when they’re different, they cancel out.
Integration measures their product over some domain to produce a scalar result.</p>
<div class="flextable">
<div class="item">
<img class="diagram2d" style="min-width: 300px; margin-top: 25px" src="/assets/functions-are-vectors/innerprod_large.svg" />
</div>
<div class="item">
<img class="diagram2d" style="min-width: 300px; margin-top: 25px" src="/assets/functions-are-vectors/innerprod_small.svg" />
</div>
</div>
<p>Of course, not all functions can be integrated.
Our inner product space will only contain functions that are <a href="https://en.wikipedia.org/wiki/Square-integrable_function">square integrable</a> over the domain \([a, b]\), which may be \([-\infty, \infty]\).
Luckily, the important properties of our inner product do not depend on the choice of integration domain.</p>
<h2 id="proofs-1">Proofs</h2>
<p>Below, we’ll briefly cover functions from \(\mathbb{R}\) to \(\mathbb{C}\).
In this space, our intuitive notion of similarity still applies, but we’ll use a slightly more general inner product:</p>
\[\langle f,g \rangle = \int_a^b f[x]\overline{g[x]}\, dx\]
<p>Where \(\overline{x}\) denotes <a href="https://en.wikipedia.org/wiki/Complex_conjugate">conjugation</a>, i.e. \(\overline{a + bi} = a - bi\).</p>
<p>Like other vector space operations, an inner product must satisfy several axioms:</p>
<div class="dropdown">
<div class="dropdown-header">
<strong>Conjugate Symmetry</strong>
</div>
<div class="dropdown-content">
<div class="dropdown-content-pad" style="padding-bottom: 10px">
For all vectors <span class="fix-katex">$$\mathbf{u}, \mathbf{v} \in \mathcal{V}$$</span>:
$$\langle \mathbf{u}, \mathbf{v} \rangle = \overline{\langle \mathbf{v}, \mathbf{u} \rangle}$$
Conjugation may be taken outside the integral, making this one easy:
$$\begin{align*} \langle f, g \rangle &= \int_a^b f[x]\overline{g[x]} \, dx \\
&= \int_a^b \overline{g[x]\overline{f[x]}} \, dx \\
&= \overline{\int_a^b g[x]\overline{f[x]} \, dx} \\
&= \overline{\langle g, f \rangle}
\end{align*}$$
Note that we require <em>conjugate</em> symmetry because it implies <span class="fix-katex">$$\langle\mathbf{u}, \mathbf{u}\rangle = \overline{\langle\mathbf{u}, \mathbf{u}\rangle}$$</span>, i.e. the inner product of a vector with itself is real.
</div>
</div>
</div>
<div class="dropdown">
<div class="dropdown-header">
<strong>Linearity in the First Argument</strong>
</div>
<div class="dropdown-content">
<div class="dropdown-content-pad" style="padding-bottom: 10px">
For all vectors <span class="fix-katex">$$\mathbf{u}, \mathbf{v}, \mathbf{w} \in \mathcal{V}$$</span> and scalars <span class="fix-katex">$$\alpha, \beta \in \mathbb{F}$$</span>:
$$\langle \alpha \mathbf{u} + \beta \mathbf{v}, \mathbf{w} \rangle = \alpha\langle \mathbf{u}, \mathbf{w} \rangle + \beta\langle \mathbf{v}, \mathbf{w} \rangle $$
The proof follows from linearity of integration, as well as our vector space axioms:
<div class="hide_narrow">
\[\begin{align*} \langle \alpha f + \beta g, h \rangle &= \int_a^b (\alpha f + \beta g)[x]\overline{h[x]} \, dx \\
&= \int_a^b (\alpha f[x] + \beta g[x])\overline{h[x]} \, dx \\
&= \int_a^b \alpha f[x]\overline{h[x]} + \beta g[x]\overline{h[x]} \, dx \\
&= \alpha\int_a^b f[x]\overline{h[x]}\, dx + \beta\int_a^b g[x]\overline{h[x]} \, dx \\
&= \alpha\langle f, h \rangle + \beta\langle g, h \rangle
\end{align*}\]
</div>
<div class="hide_wide">
\[\begin{align*} &\langle \alpha f + \beta g, h \rangle\\ &= \int_a^b (\alpha f + \beta g)[x]\overline{h[x]} \, dx \\
&= \int_a^b (\alpha f[x] + \beta g[x])\overline{h[x]} \, dx \\
&= \int_a^b \alpha f[x]\overline{h[x]} + \beta g[x]\overline{h[x]} \, dx \\
&= \alpha\int_a^b f[x]\overline{h[x]}\, dx\, +\\&\hphantom{==} \beta\int_a^b g[x]\overline{h[x]} \, dx \\
&= \alpha\langle f, h \rangle + \beta\langle g, h \rangle
\end{align*}\]
</div>
Given conjugate symmetry, an inner product is also <a href="https://en.wikipedia.org/wiki/Antilinear_map">antilinear</a> in the second argument.
</div>
</div>
</div>
<div class="dropdown">
<div class="dropdown-header">
<strong>Positive-Definiteness</strong>
</div>
<div class="dropdown-content">
<div class="dropdown-content-pad" style="padding-bottom: 10px">
For all <span class="fix-katex">$$\mathbf{u} \in \mathcal{V}$$</span>:
$$ \begin{cases} \langle \mathbf{u}, \mathbf{u} \rangle = 0 & \text{if } \mathbf{u} = \mathbf{0} \\ \langle \mathbf{u}, \mathbf{u} \rangle > 0 & \text{otherwise} \end{cases} $$
By conjugate symmetry, we know <span class="fix-katex">$$\langle f, f \rangle$$</span> is real, so we can compare it with zero.
<br /><br />
However, rigorously proving this result requires measure-theoretic concepts beyond the scope of this post.
In brief, we redefine <span class="fix-katex">$$\mathbf{0}$$</span> not as specifically <span class="fix-katex">$$\mathbf{0}[x] = 0$$</span>, but as an equivalence class of functions that are zero "almost everywhere."
If <span class="fix-katex">$$f$$</span> is zero almost everywhere, it is only non-zero on a set of measure zero, and therefore integrates to zero.
<br /><br />
After partitioning our set of functions into equivalence classes, all non-zero functions square-integrate to a positive value. This implies that every function has a real norm, <span class="fix-katex">$$\sqrt{\langle f, f \rangle}$$</span>.
</div>
</div>
</div>
<p>Along with the definition \(\|f\| = \sqrt{\langle f, f \rangle}\), these properties entail a variety of important results, including the <a href="https://en.wikipedia.org/wiki/Cauchy%E2%80%93Schwarz_inequality">Cauchy–Schwarz</a> and <a href="https://en.wikipedia.org/wiki/Triangle_inequality">triangle</a> inequalities.</p>
<hr />
<h1 id="the-spectral-theorem">The Spectral Theorem</h1>
<p>Diagonalization is already a powerful technique, but we’re building up to an even more important result regarding <em>orthonormal</em> eigenbases.
In an inner product space, an orthonormal basis must satisfy two conditions: each vector is unit length, and all vectors are mutually orthogonal.</p>
<div class="flextable" style="margin-top: -20px; margin-bottom: -10px">
<div class="item">
$$ \begin{cases} \langle\mathbf{u}_i,\mathbf{u}_i\rangle = 1 & \forall i \\ \langle \mathbf{u}_i, \mathbf{u}_j \rangle = 0 & \forall i \neq j \end{cases} $$
</div>
<div class="item">
<img class="diagram2d" style="min-width: 200px" src="/assets/functions-are-vectors/orthonormal.svg" />
</div>
</div>
<p>A matrix consisting of orthonormal columns is known as an <em>orthogonal</em> matrix.
Orthogonal matrices represent <em>rotations</em> of the standard basis.</p>
<p>In an inner product space, matrix-vector multiplication computes the inner product of the vector with each <em>row</em> of the matrix.
Something interesting happens when we multiply an orthogonal matrix \(\mathbf{U}\) by its transpose:</p>
<div style="margin-top: 30px; margin-bottom: -10px">
\[\begin{align*}
\mathbf{U}^T\mathbf{U} &=
\begin{bmatrix}\text{---} & \mathbf{u}_1 & \text{---} \\ \text{---} & \mathbf{u}_2 & \text{---} \\ \text{---} & \mathbf{u}_3 & \text{---} \end{bmatrix}
\begin{bmatrix}\vert & \vert & \vert \\ \mathbf{u}_1 & \mathbf{u}_2 & \mathbf{u}_3 \\ \vert & \vert & \vert \end{bmatrix} \\
&= \begin{bmatrix} \langle \mathbf{u}_1, \mathbf{u}_1 \rangle & \langle \mathbf{u}_1, \mathbf{u}_2 \rangle & \langle \mathbf{u}_1, \mathbf{u}_3 \rangle \\
\langle \mathbf{u}_2, \mathbf{u}_1 \rangle & \langle \mathbf{u}_2, \mathbf{u}_2 \rangle & \langle \mathbf{u}_2, \mathbf{u}_3 \rangle \\
\langle \mathbf{u}_3, \mathbf{u}_1 \rangle & \langle \mathbf{u}_3, \mathbf{u}_2 \rangle & \langle \mathbf{u}_3, \mathbf{u}_3 \rangle \end{bmatrix} \\
&= \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix} \\
&= \mathcal{I}
\end{align*}\]
</div>
<p>Since \(\mathbf{U}^T\mathbf{U} = \mathcal{I}\) (and \(\mathbf{U}\mathbf{U}^T = \mathcal{I}\)), we’ve found that the transpose of \(\mathbf{U}\) is equal to its inverse.</p>
<p>When diagonalizing \(\mathbf{A}\), we used \(\mathbf{U}\) to transform vectors from our <em>eigenbasis</em> to the standard basis.
Conversely, its inverse transformed vectors from the <em>standard basis</em> to our eigenbasis.
If \(\mathbf{U}\) happens to be orthogonal, transforming a vector \(\mathbf{x}\) into the eigenbasis is equivalent to <em>projecting</em> \(\mathbf{x}\) onto each eigenvector.</p>
<div class="flextable" style="margin-top: -10px; margin-bottom: -10px">
<div class="item">
$$ \mathbf{U}^{-1}\mathbf{x} = \mathbf{U}^T\mathbf{x} = \begin{bmatrix}\langle \mathbf{u}_1, \mathbf{x} \rangle \\ \langle \mathbf{u}_2, \mathbf{x} \rangle \\ \langle \mathbf{u}_3, \mathbf{x} \rangle \end{bmatrix} $$
</div>
<div class="item">
<img class="diagram2d" style="min-width: 250px" src="/assets/functions-are-vectors/orthoproj.svg" />
</div>
</div>
<p>Additionally, the diagonalization of \(\mathbf{A}\) becomes quite simple:</p>
<div class="hide_narrow">
\[\begin{align*} \mathbf{A} &= \mathbf{U\Lambda U^T} \\
&= \begin{bmatrix}\vert & \vert & \vert \\ \mathbf{u}_1 & \mathbf{u}_2 & \mathbf{u}_3 \\ \vert & \vert & \vert \end{bmatrix}
\begin{bmatrix}\lambda_1 & 0 & 0 \\ 0 & \lambda_2 & 0 \\ 0 & 0 & \lambda_3 \end{bmatrix}
\begin{bmatrix}\text{---} & \mathbf{u}_1 & \text{---} \\ \text{---} & \mathbf{u}_2 & \text{---} \\ \text{---} & \mathbf{u}_3 & \text{---} \end{bmatrix} \end{align*}\]
</div>
<div class="hide_wide">
\[\begin{align*} \mathbf{A} &= \mathbf{U\Lambda U^T} \\
&= \begin{bmatrix}\vert & \vert & \vert \\ \mathbf{u}_1 & \mathbf{u}_2 & \mathbf{u}_3 \\ \vert & \vert & \vert \end{bmatrix}
\\ &\phantom{=} \begin{bmatrix}\lambda_1 & 0 & 0 \\ 0 & \lambda_2 & 0 \\ 0 & 0 & \lambda_3 \end{bmatrix}
\\ &\phantom{=} \begin{bmatrix}\text{---} & \mathbf{u}_1 & \text{---} \\ \text{---} & \mathbf{u}_2 & \text{---} \\ \text{---} & \mathbf{u}_3 & \text{---} \end{bmatrix} \end{align*}\]
</div>
<p>Given an orthogonal diagonalization of \(\mathbf{A}\), we can deduce that \(\mathbf{A}\) must be <em>symmetric</em>, i.e. \(\mathbf{A} = \mathbf{A}^T\).</p>
\[\begin{align*} \mathbf{A}^T &= (\mathbf{U\Lambda U}^T)^T \\
&= {\mathbf{U}^T}^T \mathbf{\Lambda }^T \mathbf{U}^T \\
&= \mathbf{U\Lambda U}^T \\
&= \mathbf{A}
\end{align*}\]
<p>The <a href="https://en.wikipedia.org/wiki/Spectral_theorem">spectral theorem</a> states that the converse is also true: \(\mathbf{A}\) is symmetric if and only if it admits an orthonormal eigenbasis with real eigenvalues.
Proving this result is <a href="https://mast.queensu.ca/~br66/419/spectraltheoremproof.pdf">somewhat involved</a> in finite dimensions and <a href="https://www.diva-portal.org/smash/get/diva2:1598080/FULLTEXT01.pdf">very involved</a> in infinite dimensions, so we won’t reproduce the proofs here.</p>
<h2 id="self-adjoint-operators">Self-Adjoint Operators</h2>
<p>We can generalize the spectral theorem to our space of functions, where it states that a <em>self-adjoint</em> operator admits an orthonormal eigenbasis with real eigenvalues.<sup id="fnref:symmetric" role="doc-noteref"><a href="#fn:symmetric" class="footnote" rel="footnote">6</a></sup></p>
<p>Denoted as \(\mathbf{A}^{\hspace{-0.1em}\star\hspace{0.1em}}\), the <em>adjoint</em> of an operator \(\mathbf{A}\) is defined by the following relationship.</p>
\[\langle \mathbf{Ax}, \mathbf{y} \rangle = \langle \mathbf{x}, \mathbf{A}^{\hspace{-0.1em}\star\hspace{0.1em}}\mathbf{y} \rangle\]
<p>When \(\mathbf{A} = \mathbf{A}^\star\), we say that \(\mathbf{A}\) is <em>self-adjoint</em>.</p>
<p>The adjoint can be thought of as a generalized transpose—but it’s not obvious what that means in infinite dimensions.
We will simply use our functional inner product to determine whether an operator is self-adjoint.</p>
<h2 id="the-laplace-operator">The Laplace Operator</h2>
<div class="review">
<p><em>Review: <a href="https://www.youtube.com/watch?v=ToIXSwZ1pJU">Solving the heat equation | DE3</a>.</em></p>
</div>
<p>Earlier, we weren’t able to diagonalize (real) differentiation, so it must not be self-adjoint.
Therefore, we will explore another fundamental operator, the <a href="https://en.wikipedia.org/wiki/Laplace_operator">Laplacian</a>.</p>
<p>There are <a href="https://www.youtube.com/watch?v=oEq9ROl9Umk">many equivalent definitions of the Laplacian</a>, but in our space of one-dimensional functions, it’s just the second derivative. We will hence restrict our domain to twice-differentiable functions.</p>
\[\Delta f = \frac{\partial^2 f}{\partial x^2}\]
<p>We may compute \(\Delta^\star\) using two integrations by parts:</p>
<div class="hide_narrow">
\[\begin{align*}
\left\langle \Delta f[x], g[x] \right\rangle &= \int_a^b f^{\prime\prime}[x] g[x]\, dx \\
&= f^\prime[x]g[x]\Big|_a^b - \int_a^b f^{\prime}[x] g^{\prime}[x]\, dx \\
&= (f^\prime[x]g[x] - f[x]g^{\prime}[x])\Big|_a^b + \int_a^b f[x] g^{\prime\prime}[x]\, dx \\
&= \left\langle f[x], \Delta g[x] \right\rangle
\end{align*}\]
</div>
<div class="hide_wide">
\[\begin{align*}
\left\langle \Delta f[x], g[x] \right\rangle &= \int_a^b f^{\prime\prime}[x] g[x]\, dx \\
&= f^\prime[x]g[x]\Big|_a^b - \int_a^b f^{\prime}[x] g^{\prime}[x]\, dx \\
&= (f^\prime[x]g[x] - f[x]g^{\prime}[x])\Big|_a^b \\&\hphantom{=}+ \int_a^b f[x] g^{\prime\prime}[x]\, dx \\
&= \left\langle f[x], \Delta g[x] \right\rangle
\end{align*}\]
</div>
<p>In the final step, we assume that \((f^\prime[x]g[x] - f[x]g^{\prime}[x])\big|_a^b = 0\), which is not true in general.
To make our conclusion valid, we will constrain our domain to only include functions satisfying this boundary condition.
Specifically, we will only consider <em>periodic</em> functions with period \(b-a\).
These functions have the same value and derivative at \(a\) and \(b\), so the additional term vanishes.</p>
<p>For simplicity, we will also assume our domain to be \([0,1]\).
For example:</p>
<p><img class="diagram2d_wider_on_mobile center" style="margin-top: 25px" src="/assets/functions-are-vectors/periodic.png" /></p>
<p>Therefore, the Laplacian is self-adjoint…almost.
Technically, we’ve shown that the Laplacian is symmetric, not that \(\Delta = \Delta^\star\).
This is a subtle point, and it’s possible to prove self-adjointness, so we will omit this detail.</p>
<h2 id="laplacian-eigenfunctions">Laplacian Eigenfunctions</h2>
<div class="review">
<p><em>Review: <a href="https://www.youtube.com/watch?v=v0YEaeIClKY">e^(iπ) in 3.14 minutes, using dynamics | DE5</a>.</em></p>
</div>
<p>Applying the spectral theorem tells us that the Laplacian admits an orthonormal eigenbasis.
Let’s find it.<sup id="fnref:thisspace" role="doc-noteref"><a href="#fn:thisspace" class="footnote" rel="footnote">7</a></sup></p>
<p>Since the Laplacian is simply the second derivative, real exponentials would still be eigenfunctions—but they’re not periodic, so we’ll have to exclude them.</p>
\[\color{red} \Delta e^{\psi x} = \psi^2 e^{\psi x}\]
<p>Luckily, a new class of periodic eigenfunctions appears:</p>
\[\begin{align*} \Delta \sin[\psi x] &= -\psi^2 \sin[\psi x] \\
\Delta \cos[\psi x] &= -\psi^2 \cos[\psi x] \end{align*}\]
<p>If we allow our diagonalization to introduce complex numbers, we can also consider functions from \(\mathbb{R}\) to \(\mathbb{C}\) .
Here, <em>purely complex</em> exponentials are eigenfunctions with <em>real</em> eigenvalues.</p>
<div class="hide_narrow">
\[\begin{align*} \Delta e^{\psi i x} &= (\psi i)^2e^{\psi i x} \\
&= -\psi^2 e^{\psi i x} \\
&= -\psi^2 (\cos[\psi x] + i\sin[\psi x]) \tag{Euler's formula} \end{align*}\]
</div>
<div class="hide_wide">
\[\begin{align*} \Delta e^{\psi i x} &= (\psi i)^2e^{\psi i x} \\
&= -\psi^2 e^{\psi i x} \\
&= -\psi^2 (\cos[\psi x] + i\sin[\psi x]) \\& \tag{Euler's formula} \end{align*}\]
</div>
<p>Using Euler’s formula, we can see that these two perspectives are equivalent: they both introduce \(\sin\) and \(\cos\) as eigenfunctions.
Either path can lead to our final result, but we’ll stick with the more compact complex case.</p>
<p>We also need to constrain the set of eigenfunctions to be periodic on \([0,1]\).
As suggested above, we can pick out the eigenvalues that are an integer multiple of \(2\pi\).</p>
\[e^{2\pi \xi i x} = \cos[2\pi \xi x] + i\sin[2\pi \xi x]\]
<p>Our set of eigenfunctions is therefore \(e^{2\pi \xi i x}\) for all integers \(\xi\).</p>
<h2 id="diagonalizing-the-laplacian">Diagonalizing the Laplacian</h2>
<p>Now that we’ve found suitable eigenfunctions, we can construct an orthonormal basis.</p>
<p>Our collection of eigenfunctions is linearly independent, as each one corresponds to a distinct eigenvalue.
Next, we can check for orthogonality and unit magnitude:</p>
<div style="margin-top: 20px; magin-bottom: 40px">
<div class="dropdown">
<div class="dropdown-header">
<strong>Mutual Orthogonality</strong>
</div>
<div class="dropdown-content">
<div class="dropdown-content-pad" style="margin-top: -10px">
<p>Compute the inner product of \(e^{2\pi \xi_1 i x}\) and \(e^{2\pi \xi_2 i x}\) for \(\xi_1 \neq \xi_2\):</p>
<div class="hide_narrow">
\[\begin{align*} \langle e^{2\pi\xi_1 i x}, e^{2\pi\xi_2 i x} \rangle &= \int_0^1 e^{2\pi\xi_1 i x} \overline{e^{2\pi\xi_2 i x}}\, dx \\
&= \int_0^1 (\cos[2\pi\xi_1 x] + i\sin[2\pi\xi_1 x])(\cos[2\pi\xi_2 x] - i\sin[2\pi\xi_2 x])\, dx \\
&= \int_0^1 \cos[2\pi\xi_1 x]\cos[2\pi\xi_2 x] - i\cos[2\pi\xi_1 x]\sin[2\pi\xi_2 x] +\\ &\phantom{= \int_0^1} i\sin[2\pi\xi_1 x]\cos[2\pi\xi_2 x] + \sin[2\pi\xi_1 x]\sin[2\pi\xi_2 x] \, dx \\
&= \int_0^1 \cos[2\pi(\xi_1-\xi_2)x] + i\sin[2\pi(\xi_1-\xi_2)x]\, dx \\
&= \frac{1}{2\pi(\xi_1-\xi_2)}\left(\sin[2\pi(\xi_1-\xi_2] x)\Big|_0^1 - i\cos[2\pi(\xi_1-\xi_2) x]\Big|_0^1\right) \\
&= 0
\end{align*}\]
</div>
<div class="hide_wide">
\[\begin{align*} &\langle e^{2\pi\xi_1 i x}, e^{2\pi\xi_2 i x} \rangle \\ &= \int_0^1 e^{2\pi\xi_1 i x} \overline{e^{2\pi\xi_2 i x}}\, dx \\
&= \int_0^1 (\cos[2\pi\xi_1 x] + i\sin[2\pi\xi_1 x])\cdot\\ &\hphantom{==\int_0^1}(\cos[2\pi\xi_2 x] - i\sin[2\pi\xi_2 x])\, dx \\
&= \int_0^1 \cos[2\pi\xi_1 x]\cos[2\pi\xi_2 x]\, -\\ &\hphantom{==\int_0^1} i\cos[2\pi\xi_1 x]\sin[2\pi\xi_2 x] +\\ &\hphantom{== \int_0^1} i\sin[2\pi\xi_1 x]\cos[2\pi\xi_2 x]\, +\\ &\hphantom{==\int_0^1} \sin[2\pi\xi_1 x]\sin[2\pi\xi_2 x] \, dx \\
&= \int_0^1 \cos[2\pi(\xi_1-\xi_2)x]\, +\\ &\hphantom{==\int_0^1} i\sin[2\pi(\xi_1-\xi_2)x]\, dx \\
&= \frac{1}{2\pi(\xi_1-\xi_2)}\Big(\sin[2\pi(\xi_1-\xi_2] x)\Big|_0^1\, -\\ &\hphantom{=== \frac{1}{2\pi(\xi_1-\xi_2)}} i\cos[2\pi(\xi_1-\xi_2) x]\Big|_0^1\Big) \\
&= 0
\end{align*}\]
</div>
<p>Note that the final step is valid because \(\xi_1-\xi_2\) is a non-zero integer.</p>
<p>This result also applies to any domain \([a,b]\), given functions periodic on \([a,b]\).</p>
<p>It’s possible to further generalize to \([-\infty,\infty]\), but doing so requires a <a href="https://math.stackexchange.com/questions/394237/understanding-weighted-inner-product-and-weighted-norms">weighted</a> inner product.</p>
</div>
</div>
</div>
<div class="dropdown">
<div class="dropdown-header">
<strong>Unit Magnitude</strong>
</div>
<div class="dropdown-content">
<div class="dropdown-content-pad" style="margin-top: -10px">
<p>It’s easy to show that all candidate functions have norm one:</p>
\[\begin{align*} \langle e^{2\pi\xi i x}, e^{2\pi\xi i x} \rangle &= \int_0^1 e^{2\pi\xi i x}\overline{e^{2\pi\xi i x}} \\
&= \int_0^1 e^{2\pi\xi i x}e^{-2\pi\xi i x}\, dx \\&= \int_0^1 1\, dx\\ &= 1 \end{align*}\]
<p>With the addition of a constant factor \(\frac{1}{b-a}\), this result generalizes to any \([a,b]\).</p>
<p>It’s possible to further generalize to \([-\infty,\infty]\), but doing so requires a <a href="https://math.stackexchange.com/questions/394237/understanding-weighted-inner-product-and-weighted-norms">weighted</a> inner product.</p>
</div>
</div>
</div>
</div>
<p>The final step is to show that all functions in our domain can be represented by a linear combination of eigenfunctions.
To do so, we will find an invertible operator \(\mathcal{L}\) representing the proper change of basis.</p>
<p>Critically, since our eigenbasis is <em>orthonormal</em>, we can intuitively consider the inverse of \(\mathcal{L}\) to be its transpose.</p>
<div class="hide_narrow">
\[\mathcal{I} = \mathcal{L}\mathcal{L}^{-1} = \mathcal{L}\mathcal{L}^{T} = \begin{bmatrix} \vert & \vert & & \\ e^{2\pi\xi_1 i x} & e^{2\pi\xi_2 i x} & \dots \\ \vert & \vert & \end{bmatrix}\begin{bmatrix} \text{---} & e^{2\pi\xi_1 i x} & \text{---} \\ \text{---} & e^{2\pi\xi_2 i x} & \text{---} \\ & \vdots & \end{bmatrix}\]
</div>
<div class="hide_wide">
\[\begin{align*} \mathcal{I} &= \mathcal{L}\mathcal{L}^{-1} \\&= \mathcal{L}\mathcal{L}^{T} \\&= \begin{bmatrix} \vert & \vert & & \\ e^{2\pi\xi_1 i x} & e^{2\pi\xi_2 i x} & \dots \\ \vert & \vert & \end{bmatrix}\\ & \phantom{=} \begin{bmatrix} \text{---} & e^{2\pi\xi_1 i x} & \text{---} \\ \text{---} & e^{2\pi\xi_2 i x} & \text{---} \\ & \vdots & \end{bmatrix} \end{align*}\]
</div>
<p>This visualization suggests that \(\mathcal{L}^Tf\) computes the inner product of \(f\) with each eigenvector.</p>
\[\mathcal{L}^Tf = \begin{bmatrix}\langle f, e^{2\pi\xi_1 i x} \rangle \\ \langle f, e^{2\pi\xi_2 i x} \rangle \\ \vdots \end{bmatrix}\]
<p>Which is highly reminiscent of the finite-dimensional case, where we projected onto each eigenvector of an orthogonal eigenbasis.</p>
<p>This insight allows us to write down the product \(\mathcal{L}^Tf\) as an integer function \(\hat{f}[\xi]\).
Note that the complex inner product conjugates the second argument, so the exponent is negated.</p>
\[(\mathcal{L}^Tf)[\xi] = \hat{f}[\xi] = \int_0^1 f[x]e^{-2\pi\xi i x}\, dx\]
<p>Conversely, \(\mathcal{L}\) converts \(\hat{f}\) back to the standard basis.
It simply creates a linear combination of eigenfunctions.</p>
\[(\mathcal{L}\hat{f})[x] = f[x] = \sum_{\xi=-\infty}^\infty \hat{f}[\xi] e^{2\pi\xi i x}\]
<p>These operators are, in fact, inverses of each other, but a rigorous proof is beyond the scope of this post.
Therefore, we’ve diagonalized the Laplacian:</p>
<div class="hide_narrow">
\[\begin{align*} \Delta &= \mathcal{L} \mathcal{D} \mathcal{L}^T \\
&=
\begin{bmatrix} \vert & \vert & & \\ e^{2\pi\xi_1 i x} & e^{2\pi\xi_2 i x} & \dots \\ \vert & \vert & \end{bmatrix}
\begin{bmatrix} -(2\pi\xi_1)^2 & 0 & \dots \\ 0 & -(2\pi\xi_2)^2 & \dots \\ \vdots & \vdots & \ddots \end{bmatrix}
\begin{bmatrix} \text{---} & e^{2\pi\xi_1 i x} & \text{---} \\ \text{---} & e^{2\pi\xi_2 i x} & \text{---} \\ & \vdots & \end{bmatrix}
\end{align*}\]
</div>
<div class="hide_wide">
\[\begin{align*} \Delta &= \mathcal{L} \mathcal{D} \mathcal{L}^T \\
&=
\begin{bmatrix} \vert & \vert & & \\ e^{2\pi\xi_1 i x} & e^{2\pi\xi_2 i x} & \dots \\ \vert & \vert & \end{bmatrix}
\\ & \phantom{=} \begin{bmatrix} -(2\pi\xi_1)^2 & 0 & \dots \\ 0 & -(2\pi\xi_2)^2 & \dots \\ \vdots & \vdots & \ddots \end{bmatrix}
\\ & \phantom{=} \begin{bmatrix} \text{---} & e^{2\pi\xi_1 i x} & \text{---} \\ \text{---} & e^{2\pi\xi_2 i x} & \text{---} \\ & \vdots & \end{bmatrix}
\end{align*}\]
</div>
<p>Although \(\mathcal{L}^T\) transforms our real-valued function into a complex-valued function, \(\Delta\) as a whole still maps real functions to real functions.
Next, we’ll see how \(\mathcal{L}^T\) is itself an incredibly useful transformation.</p>
<hr />
<h1 id="applications">Applications</h1>
<p>In this section, we’ll explore several applications in <a href="https://en.wikipedia.org/wiki/Signal_processing">signal processing</a>, each of which arises from diagonalizing the Laplacian on a new domain.</p>
<h1 id="fourier-series">Fourier Series</h1>
<div class="review">
<p><em>Review: <a href="https://www.youtube.com/watch?v=r6sGWTCMz2k">But what is a Fourier series? From heat flow to drawing with circles | DE4</a>.</em></p>
</div>
<p>If you’re familiar with Fourier methods, you likely noticed that \(\hat{f}\) encodes the Fourier series of \(f\).
That’s because a Fourier transform is a change of basis into the Laplacian eigenbasis!</p>
<p>This basis consists of <em>waves</em>, which makes \(\hat{f}\) is a particularly interesting representation for \(f\).
For example, consider evaluating \(\hat{f}[1]\):</p>
<div class="hide_narrow">
\[\hat{f}[1] = \int_0^1 f[x] e^{-2\pi i x}\, dx = \int_0^1 f[x](\cos[2\pi x] - i\sin[2\pi x])\, dx\]
</div>
<div class="hide_wide">
\[\begin{align*} \hat{f}[1] &= \int_0^1 f[x] e^{-2\pi i x}\, dx \\&= \int_0^1 f[x](\cos[2\pi x] - i\sin[2\pi x])\, dx \end{align*}\]
</div>
<p>This integral measures how much of \(f\) is represented by waves of frequency (positive) 1.
Naturally, \(\hat{f}[\xi]\) computes the same quantity for any integer frequency \(\xi\).</p>
<div style="margin-bottom: 35px"></div>
\[\hphantom{aaa} {\color{#9673A6} \text{Real}\left[e^{2\pi i \xi x}\right]}\,\,\,\,\,\,\,\, {\color{#D79B00} \text{Complex}\left[e^{2\pi i \xi x}\right]}\]
<svg id="real_eigen" style="margin-top: 20px; margin-bottom: -20px;" class="center" width="500" height="250"></svg>
<div class="flextable" style="">
<div class="item" style="margin-right: 15px">
<div style="display: inline-flex"><p id="real_eigen_xi" style="min-width: 80px">$$\xi = 1$$</p>
<input type="range" min="-5" max="5" value="1" class="slider item" style="margin-top: 22px; min-width: 150px" id="real_eigen_slider" />
</div>
</div>
</div>
<p>Therefore, we say that \(\hat{f}\) expresses our function in the <em>frequency domain</em>.
To illustrate this point, we’ll use a Fourier series to decompose a piecewise linear function into a collection of waves.<sup id="fnref:vibecheck" role="doc-noteref"><a href="#fn:vibecheck" class="footnote" rel="footnote">8</a></sup>
Since our new basis is orthonormal, the transform is easy to invert by re-combining the waves.</p>
<p>Here, the \(\color{#9673A6}\text{purple}\) curve is \(f\); the \(\color{#D79B00}\text{orange}\) curve is a reconstruction of \(f\) from the first \(N\) coefficients of \(\hat{f}\).
Try varying the number of coefficients and moving the \(\color{#9673A6}\text{purple}\) dots to effect the results.</p>
<div style="width: 100%; margin-top: 25px; margin-bottom: 0px" class="center">
$$ \hphantom{aaa} {\color{#9673A6} f[x]}\,\,\,\,\,\,\,\,\,\,\,\, {\color{#D79B00} \sum_{\xi=-N}^N \hat{f}[\xi]e^{2\pi i \xi x}} $$
<div class="flextable" style="margin-top: -5px; width: 100%">
<svg id="real_fourier" style="margin-top: 15px; " class="center" width="600" height="250"></svg>
<div class="flextable" style="margin-top: 5px">
<div class="item" style="margin-right: 15px">
<div style="display: inline-flex"><p id="real_fourier_N" style="margin-top: -25px; min-width: 100px">$$N = 3$$</p>
<input type="range" min="0" max="16" value="3" class="slider item" id="real_fourier_slider" style="margin-left: -15px; margin-top: -20px; min-width: 125px" />
</div>
</div>
</div>
</div>
Additionally, explore the individual basis functions making up our result:
<table class="lol_table">
<tr>
<td>$$\hat{f}[0]$$</td>
<td>$$\hat{f}[1]e^{2\pi i x}$$</td>
<td>$$\hat{f}[2]e^{4\pi i x}$$</td>
<td>$$\hat{f}[3]e^{6\pi i x}$$</td>
</tr>
<tr>
<td rowspan="2"><div style="margin-bottom: -20.75%"> </div><svg id="real_fourier_0" width="200" height="100"></svg></td>
<td><svg id="real_fourier_c0" width="200" height="100"></svg></td>
<td><svg id="real_fourier_c1" width="200" height="100"></svg></td>
<td><svg id="real_fourier_c2" width="200" height="100"></svg></td>
</tr>
<tr>
<td><svg id="real_fourier_s0" width="200" height="100"></svg></td>
<td><svg id="real_fourier_s1" width="200" height="100"></svg></td>
<td><svg id="real_fourier_s2" width="200" height="100"></svg></td>
</tr>
</table>
</div>
<p>Many interesting operations become easy to compute in the frequency domain.
For example, by simply dropping Fourier coefficients beyond a certain threshold, we can reconstruct a smoothed version of our function.
This technique is known as a <em>low-pass filter</em>—try it out above.</p>
<h1 id="image-compression">Image Compression</h1>
<p>Computationally, Fourier series are especially useful for <em>compression</em>.
Encoding a function \(f\) in the standard basis takes a lot of space, since we store a separate result for each input.
If we instead express \(f\) in the Fourier basis, we only need to store a few coefficients—we’ll be able to approximately reconstruct \(f\) by re-combining the corresponding basis functions.</p>
<p>So far, we’ve only defined a Fourier transform for functions on \(\mathbb{R}\).
Luckily, the transform arose via diagonalizing the Laplacian, and the Laplacian is not limited to one-dimensional functions.
<strong>In fact, wherever we can define a Laplacian, we can find a corresponding Fourier transform.</strong><sup id="fnref:morelaplacians" role="doc-noteref"><a href="#fn:morelaplacians" class="footnote" rel="footnote">9</a></sup></p>
<p>For example, in two dimensions, the Laplacian becomes a sum of second derivatives.</p>
\[\Delta f[x,y] = \frac{\partial^2 f}{\partial x^2} + \frac{\partial^2 f}{\partial y^2}\]
<p>For the domain \([0,1]\times[0,1]\), we’ll find a familiar set of periodic eigenfunctions.</p>
<div class="hide_narrow">
\[e^{2\pi i(nx + my)} = \cos[2\pi(nx + my)] + i\sin[2\pi(nx + my)]\]
</div>
<div class="hide_wide">
\[\begin{align*} e^{2\pi i(nx + my)} &= \cos[2\pi(nx + my)]\, + \\ &\phantom{=}\, i\sin[2\pi(nx + my)] \end{align*}\]
</div>
<p>Where \(n\) and \(m\) are both integers.
Let’s see what these basis functions look like:</p>
<div class="flextable" style="margin-top: -5px; margin-bottom: 20px">
<div>
\[{\color{#9673A6} \text{Real}\left[e^{2\pi i(nx + my)}\right]}\]
<canvas style="margin-top: -40px" width="320" height="256" id="2d_eigen_re"></canvas>
</div>
<div>
\[{\color{#D79B00} \text{Complex}\left[e^{2\pi i(nx + my)}\right]}\]
<canvas style="margin-top: -40px" width="320" height="256" id="2d_eigen_im"></canvas>
</div>
</div>
<div class="flextable" style="margin-top: 10px; margin-bottom: -5px">
<div class="item" style="margin-right: 15px;">
<div style="display: inline-flex"><p id="2d_eigen_n" style="margin-top: -25px; min-width: 80px">$$n = 3$$</p>
<input type="range" min="-5" max="5" value="2" class="slider item" id="2d_eigen_slider_n" style="margin-top: -20px; min-width: 175px" />
</div>
<div style="display: inline-flex"><p id="2d_eigen_m" style="margin-top: -25px; min-width: 80px">$$m = 3$$</p>
<input type="range" min="-5" max="5" value="1" class="slider item" id="2d_eigen_slider_m" style="margin-top: -20px; min-width: 175px" />
</div>
</div></div>
<p>Just like the 1D case, the corresponding Fourier transform is a change of basis into the Laplacian’s orthonormal eigenbasis.
Above, we decomposed a 1D function into a collection of 1D waves—here, we equivalently decompose a 2D image into a collection of 2D waves.</p>
<div class="flextable" style="margin-top: -15px; margin-bottom: 20px">
<div>
<div style="height: 85px">
\[\phantom{\Bigg|} {\color{#9673A6} f[x,y]}\]
</div>
<canvas width="256" height="256" id="2d_image_before" style="margin-right: 20px; margin-left: 20px"></canvas>
</div>
<div>
<div style="height: 85px">
\[{\color{#D79B00} \sum_{n=-N}^N \sum_{m=-N}^N \hat{f}[n,m]e^{2\pi i(nx + my)}}\]
</div>
<canvas width="256" height="256" id="2d_image_after" style="margin-right: 20px; margin-left: 20px"></canvas>
</div>
</div>
<div class="flextable" style="margin-top: 10px; margin-bottom: -5px">
<div class="item" style="margin-right: 15px;">
<div style="display: inline-flex"><p id="2d_image_N" style="margin-top: -25px; min-width: 80px">$$N = 3$$</p>
<input type="range" min="0" max="32" value="3" class="slider item" id="2d_image_slider_N" style="margin-top: -20px; min-width: 175px" />
</div>
</div></div>
<p>A variant of the 2D Fourier transform is at the core of many image compression algorithms, including JPEG.</p>
<h2 id="spherical-harmonics">Spherical Harmonics</h2>
<p>Computer graphics is often concerned with functions on the unit sphere, so let’s see if we can find a corresponding Fourier transform.
In <a href="/Spherical-Integration">spherical coordinates</a>, the Laplacian can be defined as follows:</p>
<div class="hide_narrow">
\[\Delta f(\theta, \phi) = \frac{1}{\sin[\theta]}\frac{\partial}{\partial \theta}\left(\sin[\theta] \frac{\partial f}{\partial \theta}\right) + \frac{1}{\sin^2[\theta]}\frac{\partial^2 f}{\partial \phi^2}\]
</div>
<div class="hide_wide">
\[\begin{align*}
\Delta f(\theta, \phi) &= \frac{1}{\sin[\theta]}\frac{\partial}{\partial \theta}\left(\sin[\theta] \frac{\partial f}{\partial \theta}\right)\, + \\ &\phantom{=} \frac{1}{\sin^2[\theta]}\frac{\partial^2 f}{\partial \phi^2}
\end{align*}\]
</div>
<p>We won’t go through the full derivation, but this Laplacian admits an orthornormal eigenbasis known as the <em>spherical harmonics</em>.<sup id="fnref:harmonics" role="doc-noteref"><a href="#fn:harmonics" class="footnote" rel="footnote">10</a></sup></p>
\[Y_\ell^m[\theta, \phi] = N_\ell^m P_\ell^m[\cos[\theta]] e^{im\phi}\]
<p>Where \(Y_\ell^m\) is the spherical harmonic of degree \(\ell \ge 0\) and order \(m \in [-\ell,\ell]\). Note that \(N_\ell^m\) is a constant and \(P_\ell^m\) are the <a href="https://en.wikipedia.org/wiki/Associated_Legendre_polynomials">associated Legendre polynomials</a>.<sup id="fnref:legendre" role="doc-noteref"><a href="#fn:legendre" class="footnote" rel="footnote">11</a></sup></p>
<div style="margin-bottom: 35px"></div>
\[\hphantom{aa} {\color{#9673A6} \text{Real}\left[Y_\ell^m[\theta,\phi]\right]}\,\,\,\,\,\,\,\, {\color{#D79B00} \text{Complex}\left[Y_\ell^m[\theta,\phi]\right]}\]
<canvas id="spr_eigen" style="margin-top: -10px; margin-bottom: -10px" class="center wider_on_mobile" width="450" height="350"></canvas>
<div class="flextable" style="">
<div class="item">
<div style="display: inline-flex">
<p id="spr_eigen_ell" style="margin-left: 10px; min-width: 100px">$$\ell = 0$$</p>
<input type="range" min="0" max="6" value="2" class="slider item" style="margin-left: -10px; margin-right: 20px" id="spr_eigen_slider_ell" />
</div>
</div>
<div class="item">
<div style="display: inline-flex">
<p id="spr_eigen_m" style="margin-left: -10px; min-width: 100px">$$m = 0$$</p>
<input type="range" min="-2" max="2" value="1" class="slider item" style="margin-left: -10px; margin-right: 0px" id="spr_eigen_slider_m" />
</div>
</div>
</div>
<p>As above, we define the spherical Fourier transform as a change of basis into the spherical harmonics.
In game engines, this transform is often used to compress diffuse <em>environment maps</em> (i.e. spherical images) and global illumination probes.</p>
<div class="flextable" style="margin-top: -5px; margin-bottom: 20px">
<div>
<div style="height: 80px">
\[\phantom{\Bigg|} {\color{#9673A6} f[\theta,\phi]}\]
</div>
<canvas width="320" height="256" id="spr_image_before"></canvas>
</div>
<div>
<div style="height: 80px">
\[{\color{#D79B00} \sum_{\ell=0}^N \sum_{m=-\ell}^\ell \hat{f}[\ell,m]\left( Y_\ell^m[\theta,\phi]e^{im\phi} \right)}\]
</div>
<canvas width="320" height="256" id="spr_image_after"></canvas>
</div>
</div>
<div class="flextable" style="margin-top: 10px; margin-bottom: -5px">
<div class="item" style="margin-right: 15px;">
<div style="display: inline-flex"><p id="spr_image_N" style="margin-top: -25px; min-width: 80px">$$N = 3$$</p>
<input type="range" min="0" max="32" value="3" class="slider item" id="spr_image_slider_N" style="margin-top: -20px; min-width: 175px" />
</div>
</div></div>
<p>You might also recognize spherical harmonics as electron orbitals—quantum mechanics is primarily concerned with the eigenfunctions of linear operators.</p>
<h1 id="geometry-processing">Geometry Processing</h1>
<p>Representing functions as vectors underlies many modern algorithms—image compression is only one example.
In fact, because computers can do linear algebra so efficiently, applying linear-algebraic techniques to functions produces a powerful new computational paradigm.</p>
<p>The nascent field of <a href="https://www.youtube.com/watch?v=mas-PUA3OvA&list=PL9_jI1bdZmz0hIrNCMQW1YmZysAiIYSSS">discrete</a> <a href="https://www.ams.org/publications/journals/notices/201710/rnoti-p1153.pdf">differential</a> <a href="http://www.cs.cmu.edu/~kmcrane/Projects/DDG/paper.pdf">geometry</a> uses this perspective to build algorithms for three-dimensional <em>geometry processing</em>.
In computer graphics, functions on <em>meshes</em> often represent textures, unwrappings, displacements, or simulation parameters.
DDG gives us a way to faithfully encode such functions as vectors: for example, by associating a value with each vertex of the mesh.</p>
<div class="flextable">
<div class="item">
$$ \mathbf{f} = \begin{bmatrix}
{\color{#666666} f_1}\\
{\color{#82B366} f_2}\\
{\color{#B85450} f_3}\\
{\color{#6C8EBF} f_4}\\
{\color{#D79B00} f_5}\\
{\color{#9673A6} f_6}\\
{\color{#D6B656} f_7}\\
\end{bmatrix} $$
</div>
<div class="item">
<img class="diagram2d" style="width: 375px; margin-top: 10px" src="/assets/functions-are-vectors/mesh.png" />
</div>
</div>
<div style="margin-top: 5px"></div>
<p>One particularly relevant result is a Laplace operator for meshes.
A mesh Laplacian is a finite-dimensional matrix, so we can use <a href="https://en.wikipedia.org/wiki/Numerical_linear_algebra">numerical linear algebra</a> to find its eigenfunctions.</p>
<p>As with the continuous case, these functions generalize sine and cosine to a new domain.
Here, we visualize the real and complex parts of each eigenfunction, where the two colors indicate positive vs. negative regions.</p>
<div style="margin-top: 30px"></div>
\[\hphantom{aa} {\color{#9673A6} \text{Rea}}{\color{#82B366}\text{l}\left[\psi_N\right]}\,\,\,\,\,\,\,\, {\color{#D79B00} \text{Comp}}{\color{#6C8EBF} \text{lex}\left[\psi_N\right]}\]
<canvas id="geom_eigen" style="margin-top: 5px; margin-bottom: 0px" class="center wider_on_mobile" width="600" height="350"></canvas>
<div class="flextable" style="margin-top: 10px; margin-bottom: -5px">
<div class="item" style="margin-right: 15px;">
<div style="display: inline-flex"><p id="geom_eigen_N" style="margin-left: 10px; margin-top: -22px; min-width: 170px; max-width: 170px">$$4\text{th Eigenfunction}$$</p>
<input type="range" min="0" max="30" value="3" class="slider item" id="geom_eigen_slider_N" style="margin-top: -15px; min-width: 175px" />
</div>
</div></div>
<p>At this point, the implications might be obvious—this eigenbasis is useful for transforming and compressing functions on the mesh.
In fact, by interpreting the vertices’ positions as a function, we can even smooth or sharpen the geometry itself.</p>
<h1 id="further-reading">Further Reading</h1>
<p>There’s far more to signal and geometry processing than we can cover here, let alone the many other applications in engineering, physics, and computer science.
We will conclude with an (incomplete, biased) list of topics for further exploration. See if you can follow the functions-are-vectors thread throughout:</p>
<ul>
<li>Geometry: <a href="http://www.cs.cmu.edu/~kmcrane/Projects/HeatMethod/index.html">Distances</a>, <a href="https://nmwsharp.com/research/vector-heat-method/">Parallel Transport</a>, <a href="https://geometrycollective.github.io/boundary-first-flattening/">Flattening</a>, <a href="http://www.cs.cmu.edu/~kmcrane/Projects/NonmanifoldLaplace/index.html">Non-manifold Meshes</a>, and <a href="https://dl.acm.org/doi/pdf/10.1145/3386569.3392389">Polygonal Meshes</a></li>
<li>Simulation: the <a href="https://en.wikipedia.org/wiki/Finite_element_method">Finite Element Method</a>, <a href="http://www.cs.cmu.edu/~kmcrane/Projects/MonteCarloGeometryProcessing/index.html">Monte Carlo PDEs</a>, <a href="https://cseweb.ucsd.edu/~alchern/projects/MinimalCurrent/">Minimal Surfaces</a>, and <a href="https://yhesper.github.io/fc23/fc23.html">Fluid Cohomology</a></li>
<li>Light Transport: <a href="https://en.wikipedia.org/wiki/Radiosity_(computer_graphics)">Radiosity</a>, <a href="https://graphics.stanford.edu/papers/veach_thesis/thesis.pdf">Operator Formulation (Ch.4)</a>, <a href="https://bartwronski.com/2022/02/15/light-transport-matrices-svd-spectral-analysis-and-matrix-completion/">Low-Rank Approximation</a>, and <a href="https://rgl.epfl.ch/publications">Inverse Rendering</a></li>
<li>Machine Learning: <a href="https://nmwsharp.com/research/diffusion-net/">DiffusionNet</a>, <a href="https://ranahanocka.github.io/MeshCNN/">MeshCNN</a>, <a href="https://nmwsharp.com/research/neural-physics-subspaces/">Kinematics</a>, <a href="https://people.eecs.berkeley.edu/~brecht/papers/07.rah.rec.nips.pdf">Fourier</a> <a href="https://arxiv.org/pdf/2006.10739.pdf">Features</a>, and <a href="https://rgl.s3.eu-central-1.amazonaws.com/media/papers/Nicolet2021Large.pdf">Inverse Geometry</a></li>
<li><a href="https://www.youtube.com/watch?v=jvPPXbo87ds">Splines</a>: <a href="http://www.cemyuksel.com/research/interpolating_curves/">C2 Interpolation</a>, <a href="https://ttnghia.github.io/posts/quadratic-approximation-of-cubic-curves/">Quadratic Approximation</a>, and <a href="https://raphlinus.github.io/curves/2023/04/18/bezpath-simplify.html">Simplification</a></li>
</ul>
<p>Thanks to <a href="https://hachiyuki8.github.io/">Joanna Y</a>, <a href="https://yhesper.github.io/">Hesper Yin</a>, and <a href="https://fanpu.io/">Fan Pu Zeng</a> for providing feedback on this post.</p>
<hr />
<h1 id="footnotes">Footnotes</h1>
<link rel="stylesheet" type="text/css" href="/assets/functions-are-vectors/style.css" />
<script src="/assets/lib/geometry-processing/linear-algebra-asm.js"></script>
<script src="/assets/lib/geometry-processing/vector.js"></script>
<script src="/assets/lib/geometry-processing/complex.js"></script>
<script src="/assets/lib/geometry-processing/emscripten-memory-manager.js"></script>
<script> let memoryManager = new EmscriptenMemoryManager(); </script>
<script src="/assets/lib/geometry-processing/dense-matrix.js"></script>
<script src="/assets/lib/geometry-processing/sparse-matrix.js"></script>
<script src="/assets/lib/geometry-processing/complex-dense-matrix.js"></script>
<script src="/assets/lib/geometry-processing/complex-sparse-matrix.js"></script>
<script src="/assets/lib/geometry-processing/vertex.js"></script>
<script src="/assets/lib/geometry-processing/edge.js"></script>
<script src="/assets/lib/geometry-processing/face.js"></script>
<script src="/assets/lib/geometry-processing/halfedge.js"></script>
<script src="/assets/lib/geometry-processing/corner.js"></script>
<script src="/assets/lib/geometry-processing/mesh.js"></script>
<script src="/assets/lib/geometry-processing/geometry.js"></script>
<script src="/assets/lib/geometry-processing/meshio.js"></script>
<script type="module" src="/assets/functions-are-vectors/js/functions-are-vectors.js"></script>
<script src="/assets/lib/d3.v7.min.js"></script>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:polynomials" role="doc-endnote">
<p>We shouldn’t assume all countably infinite-dimensional vectors represent functions on the natural numbers.
Later on, we’ll discuss the space of real polynomial functions, which also has countably infinite dimensionality. <a href="#fnref:polynomials" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:rtor" role="doc-endnote">
<p>If you’re alarmed by the fact that the set of all real functions does not form a <a href="https://en.wikipedia.org/wiki/Hilbert_space">Hilbert space</a>, you’re probably not in the target audience of this post.
Don’t worry, we will later move on to \(L^2(\mathbb{R})\) and \(L^2(\mathbb{R},\mathbb{C})\). <a href="#fnref:rtor" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:otherspaces" role="doc-endnote">
<p>If you’re paying close attention, you might notice that none of these proofs depended on the fact that the <em>domain</em> of our functions is the real numbers.
In fact, the only necessary assumption is that our functions <em>return</em> elements of a field, allowing us to apply commutativity, associativity, etc. to the outputs.</p>
<p>Therefore, we can define a vector space over the set of functions that map <em>any</em> fixed set \(\mathcal{S}\) to a field \(\mathbb{F}\).
This result illustrates why we could intuitively equate finite-dimensional vectors with mappings from <em>index</em> to <em>value</em>.
For example, consider \(\mathcal{S} = \{1,2,3\}\):</p>
<div class="flextable" style="margin-top: -20px; margin-bottom: 10px">
<div class="item" style="margin-right: 5px; margin-top: 10px">
$$ \begin{align*} f_{\mathcal{S}\mapsto\mathbb{R}} &= \{ 1 \mapsto x, 2 \mapsto y, 3 \mapsto z \} \\ &\vphantom{\Big|}\phantom{\,}\Updownarrow \\ f_{\mathbb{R}^3} &= \begin{bmatrix}x \\ y \\ z\end{bmatrix} \end{align*} $$
</div>
<div class="item" style="margin-left: 5px">
<img class="diagram2d center" style="min-width: 300px" src="/assets/functions-are-vectors/stor.svg" />
</div>
</div>
<p>If \(\mathcal{S}\) is finite, a function from \(\mathcal{S}\) to a field directly corresponds to a familiar finite-dimensional vector. <a href="#fnref:otherspaces" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:polynatural" role="doc-endnote">
<p>Earlier, we used countably infinite-dimensional vectors to represent functions on the natural numbers.
In doing so, we implicitly employed the standard basis of impulse functions indexed by \(i \in \mathbb{N}\).
Here, we encode real polynomials by choosing new basis functions of type \(\mathbb{R}\mapsto\mathbb{R}\), namely \(\mathbf{e}_i[x] = x^i\). <a href="#fnref:polynatural" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:polybasis" role="doc-endnote">
<p>Since this basis spans the space of analytic functions, analytic functions have countably infinite dimensionality. Relative to the uncountably infinite-dimensional space of all real functions, analytic functions are vanishingly rare! <a href="#fnref:polybasis" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:symmetric" role="doc-endnote">
<p>Actually, it states that a <a href="https://en.wikipedia.org/wiki/Compact_operator_on_Hilbert_space">compact self-adjoint operator on a Hilbert space</a> admits an orthonormal eigenbasis with real eigenvalues.
Rigorously defining these terms is beyond the scope of this post, but note that our intuition about “infinite-dimensional matrices” only legitimately applies to this class of operators. <a href="#fnref:symmetric" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:thisspace" role="doc-endnote">
<p>Specifically, a basis for the domain of our Laplacian, not all functions. We’ve built up several conditions that restrict our domain to square integrable, periodic, twice-differentiable real functions on \([0,1]\). <a href="#fnref:thisspace" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:vibecheck" role="doc-endnote">
<p>Also known as a <a href="/assets/functions-are-vectors/vibecheck.jpg"><em>vibe check</em></a>. <a href="#fnref:vibecheck" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:morelaplacians" role="doc-endnote">
<p>Higher-dimensional Laplacians are sums of second derivatives, e.g.
\(\Delta f = \frac{\partial^2 f}{\partial x^2} + \frac{\partial^2 f}{\partial y^2} + \frac{\partial^2 f}{\partial z^2}\).
The <em>Laplace-Beltrami</em> operator expresses this sum as the divergence of the gradient, i.e. \(\Delta f = \nabla \cdot \nabla f\).
Using exterior calculus, we can further generalize divergence and gradient to produce the <em>Laplace-de Rham</em> operator, \(\Delta f = \delta d f\), where \(\delta\) is the <a href="https://en.wikipedia.org/wiki/Hodge_star_operator#Codifferential">codifferential</a> and \(d\) is the <a href="https://en.wikipedia.org/wiki/Exterior_derivative">exterior derivative</a>. <a href="#fnref:morelaplacians" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:harmonics" role="doc-endnote">
<p>Typically, a “harmonic” function must satisfy \(\Delta f = 0\), but in this case we include all eigenfunctions. <a href="#fnref:harmonics" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:legendre" role="doc-endnote">
<p>The associated Legendre polynomials are closely related to the <a href="https://en.wikipedia.org/wiki/Legendre_polynomials">Legendre polynomials</a>, which form a orthogonal basis for polynomials. Interestingly, they can be derived by applying the <a href="https://en.wikipedia.org/wiki/Gram%E2%80%93Schmidt_process">Gram–Schmidt</a> process to the basis of powers (\(1, x, x^2, \dots\)). <a href="#fnref:legendre" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Conceptualizing functions as infinite-dimensional vectors lets us apply the tools of linear algebra to a vast landscape of new problems, from image and geometry processing to curve fitting, light transport, and machine learning. Prerequisites: introductory linear algebra, introductory calculus, introductory differential equations. This article received an honorable mention in 3Blue1Brown’s Summer of Math Exposition 3! Functions as Vectors Vector Spaces Linear Operators Diagonalization Inner Product Spaces The Spectral Theorem Applications Fourier Series Image Compression Geometry Processing Further Reading Functions as Vectors Vectors are often first introduced as lists of real numbers—i.e. the familiar notation we use for points, directions, and more. $$ \mathbf{v} = \begin{bmatrix}x\\y\\z\end{bmatrix} $$ You may recall that this representation is only one example of an abstract vector space. There are many other types of vectors, such as lists of complex numbers, graph cycles, and even magic squares. However, all of these vector spaces have one thing in common: a finite number of dimensions. That is, each kind of vector can be represented as a collection of \(N\) numbers, though the definition of “number” varies. If any \(N\)-dimensional vector is essentially a length-\(N\) list, we could also consider a vector to be a mapping from an index to a value. \[\begin{align*} \mathbf{v}_1 &= x\\ \mathbf{v}_2 &= y\\ \mathbf{v}_3 &= z \end{align*}\ \iff\ \mathbf{v} = \begin{bmatrix}x \\ y \\ z\end{bmatrix}\] What does this perspective hint at as we increase the number of dimensions? Dimensions In higher dimensions, vectors start to look more like functions! Countably Infinite Indices Of course, a finite-length vector only specifies a value at a limited number of indices. Could we instead define a vector that contains infinitely many values? Writing down a vector representing a function on the natural numbers (\(\mathbb{N}\))—or any other countably infinite domain—is straightforward: just extend the list indefinitely. $$ \begin{align*}\mathbf{v}_1 &= 1\\\mathbf{v}_2 &= 2\\ &\vdots \\ \mathbf{v}_i &= i\end{align*}\ \iff\ \mathbf{v} = \begin{bmatrix}1 \\ 2 \\ 3 \\ \vdots \end{bmatrix} $$ This vector could represent the function \(f(x) = x\), where \(x \in \mathbb{N}\).1 Uncountably Infinite Indices Many interesting functions are defined on the real numbers (\(\mathbb{R}\)), so may not be representable as a countably infinite vector. Therefore, we will have to make a larger conceptual leap: not only will our set of indices be infinite, it will be uncountably infinite. That means we can’t write down vectors as lists at all—it is impossible to assign an integer index to each element of an uncountable set. So, how can we write down a vector mapping a real index to a certain value? Now, a vector really is just an arbitrary function: $$ \mathbf{v}_{x} = x^2\ \iff\ \mathbf{v} = \begin{bmatrix} x \mapsto x^2 \end{bmatrix} $$ Precisely defining how and why we can represent functions as infinite-dimensional vectors is the purview of functional analysis. In this post, we won’t attempt to prove our results in infinite dimensions: we will focus on building intuition via analogies to finite-dimensional linear algebra. Vector Spaces Review: Abstract vector spaces | Chapter 16, Essence of linear algebra. Formally, a vector space is defined by choosing a set of vectors \(\mathcal{V}\), a scalar field \(\mathbb{F}\), and a zero vector \(\mathbf{0}\). The field \(\mathbb{F}\) is often the real numbers (\(\mathbb{R}\)), complex numbers (\(\mathbb{C}\)), or a finite field such as the integers modulo a prime (\(\mathbb{Z}_p\)). Additionally, we must specify how to add two vectors and how to multiply a vector by a scalar. \[\begin{align*} (+)\ &:\ \mathcal{V}\times\mathcal{V}\mapsto\mathcal{V}\\ (\cdot)\ &:\ \mathbb{F}\times\mathcal{V} \mapsto \mathcal{V} \end{align*}\] To describe a vector space, our definitions must entail several vector space axioms. A Functional Vector Space In the following sections, we’ll work with the vector space of real functions. To avoid ambiguity, square brackets are used to denote function application. The scalar field \(\mathbb{F}\) is the real numbers \(\mathbb{R}\). The set of vectors \(\mathcal{V}\) contains functions from \(\mathbb{R}\) to \(\mathbb{R}\).2 \(\mathbf{0}\) is the zero function, i.e. \(\mathbf{0}[x] = 0\). Adding functions corresponds to applying the functions separately and summing the results. $$ (f + g)[x] = f[x] + g[x] $$ This definition generalizes the typical element-wise addition rule—it’s like adding the two values at each index. \[f+g = \begin{bmatrix}f_1 + g_1 \\ f_2 + g_2 \\ \vdots \end{bmatrix}\] Multiplying a function by a scalar corresponds to applying the function and scaling the result. $$ (\alpha f)[x] = \alpha f[x] $$ This rule similarly generalizes element-wise multiplication—it’s like scaling the value at each index. \[\alpha f = \begin{bmatrix}\alpha f_1 \\ \alpha f_2 \\ \vdots \end{bmatrix}\] Proofs Given these definitions, we can now prove all necessary vector space axioms. We will illustrate the analog of each property in \(\mathbb{R}^2\), the familiar vector space of two-dimensional arrows. Vector Addition is Commutative For all vectors $$\mathbf{u}, \mathbf{v} \in \mathcal{V}$$: $$\mathbf{u} + \mathbf{v} = \mathbf{v} + \mathbf{u}$$ Since real addition is commutative, this property follows directly from our definition of vector addition: $$\begin{align*} (f + g)[x] &= f[x] + g[x]\\ &= g[x] + f[x]\\ &= (g + f)[x] \end{align*}$$ Vector Addition is Associative For all vectors $$\mathbf{u}, \mathbf{v}, \mathbf{w} \in \mathcal{V}$$: $$(\mathbf{u} + \mathbf{v}) + \mathbf{w} = \mathbf{u} + (\mathbf{v} + \mathbf{w})$$ This property also follows from our definition of vector addition: $$\begin{align*} ((f + g) + h)[x] &= (f + g)[x] + h[x]\\ &= f[x] + g[x] + h[x]\\ &= f[x] + (g[x] + h[x])\\ &= f[x] + (g + h)[x]\\ &= (f + (g + h))[x] \end{align*}$$ $$\mathbf{0}$$ is an Additive Identity For all vectors $$\mathbf{u} \in \mathcal{V}$$: $$\mathbf{0} + \mathbf{u} = \mathbf{u} $$ This one is easy: $$\begin{align*} (\mathbf{0} + f)[x] &= \mathbf{0}[x] + f[x]\\ &= 0 + f[x]\\ &= f[x] \end{align*}$$ Additive Inverses Exist For all vectors $$\mathbf{u} \in \mathcal{V}$$, there exists a vector $$-\mathbf{u} \in \mathcal{V}$$ such that: $$\mathbf{u} + (-\mathbf{u}) = \mathbf{0}$$ Negation is defined as applying $$f$$ and negating the result: $$(-f)[x] = -f[x]$$. Clearly, $$-f$$ is also in $$\mathcal{V}$$. $$\begin{align*} (f + (-f))[x] &= f[x] + (-f)[x]\\ &= f[x] - f[x]\\ &= 0\\ &= \mathbf{0}[x] \end{align*}$$ $$1$$ is a Multiplicative Identity For all vectors $$\mathbf{u} \in \mathcal{V}$$: $$1\mathbf{u} = \mathbf{u}$$ Note that $$1$$ is specified by the choice of $$\mathbb{F}$$. In our case, it is simply the real number $$1$$. $$\begin{align*} (1 f)[x] &= 1 f[x]\\ &= f[x] \end{align*}$$ Scalar Multiplication is Associative For all vectors $$\mathbf{u} \in \mathcal{V}$$ and scalars $$\alpha, \beta \in \mathbb{F}$$: $$(\alpha \beta)\mathbf{u} = \alpha(\beta\mathbf{u})$$ This property follows from our definition of scalar multiplication: $$\begin{align*} ((\alpha\beta) f)[x] &= (\alpha\beta)f[x]\\ &= \alpha(\beta f[x])\\ &= \alpha(\beta f)[x] \end{align*}$$ Scalar Multiplication Distributes Over Vector Addition For all vectors $$\mathbf{u}, \mathbf{v} \in \mathcal{V}$$ and scalars $$\alpha \in \mathbb{F}$$: $$\alpha(\mathbf{u} + \mathbf{v}) = \alpha\mathbf{u} + \alpha\mathbf{v}$$ Again using our definitions of vector addition and scalar multiplication: $$\begin{align*} (\alpha (f + g))[x] &= \alpha(f + g)[x]\\ &= \alpha(f[x] + g[x])\\ &= \alpha f[x] + \alpha g[x]\\ &= (\alpha f)[x] + (\alpha g)[x]\\ &= (\alpha f + \alpha g)[x] \end{align*}$$ Scalar Multiplication Distributes Over Scalar Addition For all vectors $$\mathbf{u} \in \mathcal{V}$$ and scalars $$\alpha, \beta \in \mathbb{F}$$: $$(\alpha + \beta)\mathbf{u} = \alpha\mathbf{u} + \beta\mathbf{u}$$ Again using our definitions of vector addition and scalar multiplication: $$\begin{align*} ((\alpha + \beta)f)[x] &= (\alpha + \beta)f[x]\\ &= \alpha f[x] + \beta f[x] \\ &= (\alpha f)[x] + (\beta f)[x] \end{align*}$$ Therefore, we’ve built a vector space of functions!3 It may not be immediately obvious why this result is useful, but bear with us through a few more definitions—we will spend the rest of this post exploring powerful techniques arising from this perspective. A Standard Basis for Functions Review: Linear combinations, span, and basis vectors | Chapter 2, Essence of linear algebra. Unless specified otherwise, vectors are written down with respect to the standard basis. In \(\mathbb{R}^2\), the standard basis consists of the two coordinate axes. $$ \mathbf{e}_1 = \begin{bmatrix}1 \\ 0\end{bmatrix},\,\, \mathbf{e}_2 = \begin{bmatrix}0 \\ 1\end{bmatrix} $$ Hence, vector notation is shorthand for a linear combination of the standard basis vectors. $$ \mathbf{u} = \begin{bmatrix}\alpha \\ \beta\end{bmatrix} = \alpha\mathbf{e}_1 + \beta\mathbf{e}_2 $$ Above, we represented functions as vectors by assuming each dimension of an infinite-length vector contains the function’s result for that index. This construction points to a natural generalization of the standard basis. Just like the coordinate axes, each standard basis function contains a \(1\) at one index and \(0\) everywhere else. More precisely, for every \(\alpha \in \mathbb{R}\): \[\mathbf{e}_\alpha[x] = \begin{cases} 1 & \text{if } x = \alpha \\ 0 & \text{otherwise} \end{cases}\] Ideally, we could express an arbitrary function \(f\) as a linear combination of these basis functions. However, there are uncountably many of them—and we can’t simply write down a sum over the reals. Still, considering their linear combination is illustrative: \[\begin{align*} f[x] &= f[\alpha]\mathbf{e}_\alpha[x] \\ &= f[1]\mathbf{e}_1[x] + f[2]\mathbf{e}_2[x] + f[\pi]\mathbf{e}_\pi[x] + \dots \end{align*}\] If we evaluate this “sum” at \(x\), we’ll find that all terms are zero—except \(\mathbf{e}_x\), making the result \(f[x]\). Linear Operators Review: Change of basis | Chapter 13, Essence of linear algebra. Now that we can manipulate functions as vectors, let’s start transferring the tools of linear algebra to the functional perspective. One ubiquitous operation on finite-dimensional vectors is transforming them with matrices. A matrix \(\mathbf{A}\) encodes a linear transformation, meaning multiplication preserves linear combinations. \[\mathbf{A}(\alpha \mathbf{x} + \beta \mathbf{y}) = \alpha \mathbf{A}\mathbf{x} + \beta \mathbf{A}\mathbf{y}\] Multiplying a vector by a matrix can be intuitively interpreted as defining a new set of coordinate axes from the matrix’s column vectors. The result is a linear combination of the columns: \[\mathbf{Ax} = \begin{bmatrix} \vert & \vert & \vert \\ \mathbf{u} & \mathbf{v} & \mathbf{w} \\ \vert & \vert & \vert \end{bmatrix} \begin{bmatrix}x_1 \\ x_2 \\ x_3\end{bmatrix} = x_1\mathbf{u} + x_2\mathbf{v} + x_3\mathbf{w}\] \[\begin{align*} \mathbf{Ax} &= \begin{bmatrix} \vert & \vert & \vert \\ \mathbf{u} & \mathbf{v} & \mathbf{w} \\ \vert & \vert & \vert \end{bmatrix} \begin{bmatrix}x_1 \\ x_2 \\ x_3\end{bmatrix} \\ &= x_1\mathbf{u} + x_2\mathbf{v} + x_3\mathbf{w} \end{align*}\] When all vectors can be expressed as a linear combination of \(\mathbf{u}\), \(\mathbf{v}\), and \(\mathbf{w}\), the columns form a basis for the underlying vector space. Here, the matrix \(\mathbf{A}\) transforms a vector from the \(\mathbf{uvw}\) basis into the standard basis. Since functions are vectors, we could imagine transforming a function by a matrix. Such a matrix would be infinite-dimensional, so we will instead call it a linear operator and denote it with \(\mathcal{L}\). \[\mathcal{L}f = \begin{bmatrix} \vert & \vert & \vert & \\ \mathbf{f} & \mathbf{g} & \mathbf{h} & \cdots \\ \vert & \vert & \vert & \end{bmatrix} \begin{bmatrix}f_1\\ f_2 \\ f_3\\ \vdots\end{bmatrix} = f_1\mathbf{f} + f_2\mathbf{g} + f_3\mathbf{h} + \cdots\] \[\begin{align*} \mathcal{L}f &= \begin{bmatrix} \vert & \vert & \vert & \\ \mathbf{f} & \mathbf{g} & \mathbf{h} & \cdots \\ \vert & \vert & \vert & \end{bmatrix} \begin{bmatrix}f_1\\ f_2\\ f_3 \\ \vdots\end{bmatrix} \\ &= f_1\mathbf{f} + f_2\mathbf{g} + f_3\mathbf{h} + \cdots \end{align*}\] This visualization isn’t very accurate—we’re dealing with uncountably infinite-dimensional vectors, so we can’t actually write out an operator in matrix form. Nonetheless, the structure is suggestive: each “column” of the operator describes a new basis function for our functional vector space. Just like we saw with finite-dimensional vectors, \(\mathcal{L}\) represents a change of basis. Differentiation Review: Derivative formulas through geometry | Chapter 3, Essence of calculus. So, what’s an example of a linear operator on functions? You might recall that differentiation is linear: \[\frac{\partial}{\partial x} \left(\alpha f[x] + \beta g[x]\right) = \alpha\frac{\partial f}{\partial x} + \beta\frac{\partial g}{\partial x}\] It’s hard to visualize differentiation on general functions, but it’s feasible for the subspace of polynomials, \(\mathcal{P}\). Let’s take a slight detour to examine this smaller space of functions. \[\mathcal{P} = \{ p[x] = a + bx + cx^2 + dx^3 + \cdots \}\] We typically write down polynomials as a sequence of powers, i.e. \(1, x, x^2, x^3\), etc. All polynomials are linear combinations of the functions \(\mathbf{e}_i[x] = x^i\), so they constitute a countably infinite basis for \(\mathcal{P}\).4 This basis provides a convenient vector notation: \[\begin{align*} p[x] &= a + bx + cx^2 + dx^3 + \cdots \\ &= a\mathbf{e}_0 + b\mathbf{e}_1 + c \mathbf{e}_2 + d\mathbf{e}_3 + \dots \end{align*}\ \iff\ \mathbf{p} = \begin{bmatrix}a\\ b\\ c\\ d\\ \vdots\end{bmatrix}\] \[\begin{align*} p[x] &= a + bx + cx^2 + dx^3 + \cdots \\ &= a\mathbf{e}_0 + b\mathbf{e}_1 + c \mathbf{e}_2 + d\mathbf{e}_3 + \dots \\& \iff\ \mathbf{p} = \begin{bmatrix}a\\ b\\ c\\ d\\ \vdots\end{bmatrix} \end{align*}\] Since differentiation is linear, we’re able to apply the rule \(\frac{\partial}{\partial x} x^n = nx^{n-1}\) to each term. \[\begin{align*}\frac{\partial}{\partial x}p[x] &= \vphantom{\Bigg\vert}a\frac{\partial}{\partial x}1 + b\frac{\partial}{\partial x}x + c\frac{\partial}{\partial x}x^2 + d\frac{\partial}{\partial x}x^3 + \dots \\ &= b + 2cx + 3dx^2 + \cdots\\ &= b\mathbf{e}_0 + 2c\mathbf{e}_1 + 3d\mathbf{e}_2 + \dots\end{align*} \ \iff\ \frac{\partial}{\partial x}\mathbf{p} = \begin{bmatrix}b\\ 2c\\ 3d\\ \vdots\end{bmatrix}\] \[\begin{align*}\frac{\partial}{\partial x}p[x] &= \vphantom{\Bigg\vert}a\frac{\partial}{\partial x}1 + b\frac{\partial}{\partial x}x + c\frac{\partial}{\partial x}x^2\, +\\ & \phantom{=} d\frac{\partial}{\partial x}x^3 + \dots \\ &= b + 2cx + 3dx^2 + \cdots\\ &= b\mathbf{e}_0 + 2c\mathbf{e}_1 + 3d\mathbf{e}_2 + \dots \\ &\iff\ \frac{\partial}{\partial x}\mathbf{p} = \begin{bmatrix}b\\ 2c\\ 3d\\ \vdots\end{bmatrix}\end{align*}\] We’ve performed a linear transformation on the coefficients, so we can represent differentiation as a matrix! \[\frac{\partial}{\partial x}\mathbf{p} = \begin{bmatrix}0 & 1 & 0 & 0 & \cdots\\ 0 & 0 & 2 & 0 & \cdots\\ 0 & 0 & 0 & 3 & \cdots\\ \vdots & \vdots & \vdots & \vdots & \ddots \end{bmatrix}\begin{bmatrix}a\\ b\\ c\\ d\\ \vdots\end{bmatrix} = \begin{bmatrix}b\\ 2c\\ 3d\\ \vdots\end{bmatrix}\] Each column of the differentiation operator is itself a polynomial, so this matrix represents a change of basis. \[\frac{\partial}{\partial x} = \begin{bmatrix} \vert & \vert & \vert & \vert & \vert & \\ 0 & 1 & 2x & 3x^2 & 4x^3 & \cdots \\ \vert & \vert & \vert & \vert & \vert & \end{bmatrix}\] As we can see, the differentiation operator simply maps each basis function to its derivative. This result also applies to the larger space of analytic real functions, which includes polynomials, exponential functions, trigonometric functions, logarithms, and other familiar names. By definition, an analytic function can be expressed as a Taylor series about \(0\): \[f[x] = \sum_{n=0}^\infty \frac{f^{(n)}[0]}{n!}x^n = \sum_{n=0}^\infty \alpha_n x^n\] Which is a linear combination of our polynomial basis functions. That means a Taylor expansion is essentially a change of basis into the sequence of powers, where our differentiation operator is quite simple.5 Diagonalization Review: Eigenvectors and eigenvalues | Chapter 14, Essence of linear algebra. Matrix decompositions are arguably the crowning achievement of linear algebra. To get started, let’s review what diagonalization means for a \(3\times3\) real matrix \(\mathbf{A}\). Eigenvectors A vector \(\mathbf{u}\) is an eigenvector of the matrix \(\mathbf{A}\) when the following condition holds: $$ \mathbf{Au} = \lambda \mathbf{u} $$ The eigenvalue \(\lambda\) may be computed by solving the characteristic polynomial of \(\mathbf{A}\). Eigenvalues may be real or complex. The matrix \(\mathbf{A}\) is diagonalizable when it admits three linearly independent eigenvectors, each with a corresponding real eigenvalue. This set of eigenvectors constitutes an eigenbasis for the underlying vector space, indicating that we can express any vector \(\mathbf{x}\) via their linear combination. $$ \mathbf{x} = \alpha\mathbf{u}_1 + \beta\mathbf{u}_2 + \gamma\mathbf{u}_3 $$ To multiply \(\mathbf{x}\) by \(\mathbf{A}\), we just have to scale each component by its corresponding eigenvalue. $$ \begin{align*} \mathbf{Ax} &= \alpha\mathbf{A}\mathbf{u}_1 + \beta\mathbf{A}\mathbf{u}_2 + \gamma\mathbf{A}\mathbf{u}_3 \\ &= \alpha\lambda_1\mathbf{u}_1 + \beta\lambda_2\mathbf{u}_2 + \gamma\lambda_3\mathbf{u}_3 \end{align*} $$ Finally, re-combining the eigenvectors expresses the result in the standard basis. Intuitively, we’ve shown that multiplying by \(\mathbf{A}\) is equivalent to a change of basis, a scaling, and a change back. That means we can write \(\mathbf{A}\) as the product of an invertible matrix \(\mathbf{U}\) and a diagonal matrix \(\mathbf{\Lambda}\). \[\begin{align*} \mathbf{A} &= \mathbf{U\Lambda U^{-1}} \\ &= \begin{bmatrix}\vert & \vert & \vert \\ \mathbf{u}_1 & \mathbf{u}_2 & \mathbf{u}_3 \\ \vert & \vert & \vert \end{bmatrix} \begin{bmatrix}\lambda_1 & 0 & 0 \\ 0 & \lambda_2 & 0 \\ 0 & 0 & \lambda_3 \end{bmatrix} \begin{bmatrix}\vert & \vert & \vert \\ \mathbf{u}_1 & \mathbf{u}_2 & \mathbf{u}_3 \\ \vert & \vert & \vert \end{bmatrix}^{-1} \end{align*}\] \[\begin{align*} \mathbf{A} &= \mathbf{U\Lambda U^{-1}} \\ &= \begin{bmatrix}\vert & \vert & \vert \\ \mathbf{u}_1 & \mathbf{u}_2 & \mathbf{u}_3 \\ \vert & \vert & \vert \end{bmatrix} \\ & \phantom{=} \begin{bmatrix}\lambda_1 & 0 & 0 \\ 0 & \lambda_2 & 0 \\ 0 & 0 & \lambda_3 \end{bmatrix} \\ & \phantom{=} \begin{bmatrix}\vert & \vert & \vert \\ \mathbf{u}_1 & \mathbf{u}_2 & \mathbf{u}_3 \\ \vert & \vert & \vert \end{bmatrix}^{-1} \end{align*}\] Note that \(\mathbf{U}\) is invertible because its columns (the eigenvectors) form a basis for \(\mathbb{R}^3\). When multiplying by \(\mathbf{x}\), \(\mathbf{U}^{-1}\) converts \(\mathbf{x}\) to the eigenbasis, \(\mathbf{\Lambda}\) scales by the corresponding eigenvalues, and \(\mathbf{U}\) takes us back to the standard basis. In the presence of complex eigenvalues, \(\mathbf{A}\) may still be diagonalizable if we allow \(\mathbf{U}\) and \(\mathbf{\Lambda}\) to include complex entires. In this case, the decomposition as a whole still maps real vectors to real vectors, but the intermediate values become complex. Eigenfunctions Review: What’s so special about Euler’s number e? | Chapter 5, Essence of calculus. So, what does diagonalization mean in a vector space of functions? Given a linear operator \(\mathcal{L}\), you might imagine a corresponding definition for eigenfunctions: \[\mathcal{L}f = \psi f\] The scalar \(\psi\) is again known as an eigenvalue. Since \(\mathcal{L}\) is infinite-dimensional, it doesn’t have a characteristic polynomial—there’s not a straightforward method for computing \(\psi\). Nevertheless, let’s attempt to diagonalize differentiation on analytic functions. The first step is to find the eigenfunctions. Start by applying the above condition to our differentiation operator in the power basis: \[\begin{align*} && \frac{\partial}{\partial x}\mathbf{p} = \psi \mathbf{p} \vphantom{\Big|}& \\ &\iff& \begin{bmatrix}0 & 1 & 0 & 0 & \cdots\\ 0 & 0 & 2 & 0 & \cdots\\ 0 & 0 & 0 & 3 & \cdots\\ \vdots & \vdots & \vdots & \vdots & \ddots \end{bmatrix}\begin{bmatrix}p_0\\ p_1\\ p_2\\ p_3\\ \vdots\end{bmatrix} &= \begin{bmatrix}\psi p_0\\ \psi p_1 \\ \psi p_2 \\ \psi p_3 \\ \vdots \end{bmatrix} \\ &\iff& \begin{cases} p_1 &= \psi p_0 \\ p_2 &= \frac{\psi}{2} p_1 \\ p_3 &= \frac{\psi}{3} p_2 \\ &\dots \end{cases} & \end{align*}\] This system of equations implies that all coefficients are determined solely by our choice of constants \(p_0\) and \(\psi\). We can explicitly write down their relationship as \(p_i = \frac{\psi^i}{i!}p_0\). Now, let’s see what this class of polynomials actually looks like. \[p[x] = p_0 + p_0\psi x + p_0\frac{\psi^2}{2}x^2 + p_0\frac{\psi^3}{6}x^3 + p_0\frac{\psi^4}{24}x^4 + \dots\] \[\begin{align*} p[x] &= p_0 + p_0\psi x + p_0\frac{\psi^2}{2}x^2\, +\\ &\phantom{=} p_0\frac{\psi^3}{6}x^3 + p_0\frac{\psi^4}{24}x^4 + \dots \end{align*}\] Differentiation shows that this function is, in fact, an eigenfunction for the eigenvalue \(\psi\). \[\begin{align*} \frac{\partial}{\partial x} p[x] &= 0 + p_0\psi + p_0 \psi^2 x + p_0\frac{\psi^3}{2}x^2 + p_0\frac{\psi^4}{6}x^3 + \dots \\ &= \psi p[x] \end{align*}\] \[\begin{align*} \frac{\partial}{\partial x} p[x] &= 0 + p_0\psi + p_0 \psi^2 x\, +\\ &\phantom{=} p_0\frac{\psi^3}{2}x^2 + p_0\frac{\psi^4}{6}x^3 + \dots \\ &= \psi p[x] \end{align*}\] With a bit of algebraic manipulation, the definition of \(e^{x}\) pops out: \[\begin{align*} p[x] &= p_0 + p_0\psi x + p_0\frac{\psi^2}{2}x^2 + p_0\frac{\psi^3}{6}x^3 + p_0\frac{\psi^4}{24}x^4 + \dots \\ &= p_0\left((\psi x) + \frac{1}{2!}(\psi x)^2 + \frac{1}{3!}(\psi x)^3 + \frac{1}{4!}(\psi x)^4 + \dots\right) \\ &= p_0 e^{\psi x} \end{align*}\] \[\begin{align*} p[x] &= p_0 + p_0\psi x + p_0\frac{\psi^2}{2}x^2\, +\\ &\phantom{=} p_0\frac{\psi^3}{6}x^3 + p_0\frac{\psi^4}{24}x^4 + \dots \\ &= p_0\Big((\psi x) + \frac{1}{2!}(\psi x)^2\, +\\ &\phantom{=p_0\Big((} \frac{1}{3!}(\psi x)^3 + \frac{1}{4!}(\psi x)^4 + \dots\Big) \\ &= p_0 e^{\psi x} \end{align*}\] Therefore, functions of the form \(p_0e^{\psi x}\) are eigenfunctions for the eigenvalue \(\psi\), including when \(\psi=0\). Diagonalizing Differentiation We’ve found the eigenfunctions of the derivative operator, but can we diagonalize it? Ideally, we would express differentiation as the combination of an invertible operator \(\mathcal{L}\) and a diagonal operator \(\mathcal{D}\). \[\begin{align*} \frac{\partial}{\partial x} &= \mathcal{L} \mathcal{D} \mathcal{L}^{-1} \\ &= \begin{bmatrix} \vert & \vert & & \\ \alpha e^{\psi_1 x} & \beta e^{\psi_2 x} & \dots \\ \vert & \vert & \end{bmatrix} \begin{bmatrix} \psi_1 & 0 & \dots \\ 0 & \psi_2 & \dots \\ \vdots & \vdots & \ddots \end{bmatrix} {\color{red} \begin{bmatrix} \vert & \vert & & \\ \alpha e^{\psi_1 x} & \beta e^{\psi_2 x} & \dots \\ \vert & \vert & \end{bmatrix}^{-1} } \end{align*}\] \[\begin{align*} \frac{\partial}{\partial x} &= \mathcal{L} \mathcal{D} \mathcal{L}^{-1} \\ &= \begin{bmatrix} \vert & \vert & & \\ \alpha e^{\psi_1 x} & \beta e^{\psi_2 x} & \dots \\ \vert & \vert & \end{bmatrix} \\ & \phantom{=} \begin{bmatrix} \psi_1 & 0 & \dots \\ 0 & \psi_2 & \dots \\ \vdots & \vdots & \ddots \end{bmatrix} \\ & \phantom{=} {\color{red} \begin{bmatrix} \vert & \vert & & \\ \alpha e^{\psi_1 x} & \beta e^{\psi_2 x} & \dots \\ \vert & \vert & \end{bmatrix}^{-1} } \end{align*}\] Diagonalization is only possible when our eigenfunctions form a basis. This would be true if all analytic functions are expressible as a linear combination of exponentials. However… Counterexample: $$f[x] = x$$ First assume that \(f[x] = x\) can be represented as a linear combination of exponentials. Since analytic functions have countably infinite dimensionality, we should only need a countably infinite sum: \[f[x] = x = \sum_{n=0}^\infty \alpha_n e^{\psi_n x}\] Differentiating both sides: \[\begin{align*} f^{\prime}[x] &= 1 = \sum_{n=0}^\infty \psi_n\alpha_n e^{\psi_n x} \\ f^{\prime\prime}[x] &= 0 = \sum_{n=0}^\infty \psi_n^2\alpha_n e^{\psi_n x} \end{align*}\] Since \(e^{\psi_n x}\) and \(e^{\psi_m x}\) are linearly independent when \(n\neq m\), the final equation implies that all \(\alpha = 0\), except possibly the \(\alpha_\xi\) corresponding to \(\psi_\xi = 0\). Therefore: \[\begin{align*} 1 &= \sum_{n=0}^\infty \psi_n\alpha_n e^{\psi_n x}\\ &= \psi_\xi \alpha_\xi + \sum_{n\neq \xi} 0\psi_n e^{\psi_n x} \\ &= 0 \end{align*}\] That’s a contradiction—the linear combination representing \(f[x] = x\) does not exist. A similar argument shows that we can’t represent any non-constant function whose \(n\)th derivative is zero, nor periodic functions like sine and cosine. Real exponentials don’t constitute a basis, so we cannot construct an invertible \(\mathcal{L}\). The Laplace Transform We previously mentioned that more matrices can be diagonalized if we allow the decomposition to contain complex numbers. Analogously, more linear operators are diagonalizable in the larger vector space of functions from \(\mathbb{R}\) to \(\mathbb{C}\). Differentiation works the same way in this space; we’ll still find that its eigenfunctions are exponential. \[\frac{\partial}{\partial x} e^{(a+bi)x} = (a+bi)e^{(a+bi)x}\] However, the new eigenfunctions have complex eigenvalues, so we still can’t diagonalize. We’ll need to consider the still larger space of functions from \(\mathbb{C}\) to \(\mathbb{C}\). \[\frac{\partial}{\partial x} : (\mathbb{C}\mapsto\mathbb{C}) \mapsto (\mathbb{C}\mapsto\mathbb{C})\] In this space, differentiation can be diagonalized via the Laplace transform. Although useful for solving differential equations, the Laplace transform is non-trivial to invert, so we won’t discuss it further. In the following sections, we’ll delve into an operator that can be easily diagonalized in \(\mathbb{R}\mapsto\mathbb{C}\): the Laplacian. Inner Product Spaces Review: Dot products and duality | Chapter 9, Essence of linear algebra. Before we get to the spectral theorem, we’ll need to understand one more topic: inner products. You’re likely already familiar with one example of an inner product—the Euclidean dot product. \[\begin{bmatrix}x\\ y\\ z\end{bmatrix} \cdot \begin{bmatrix}a\\ b\\ c\end{bmatrix} = ax + by + cz\] An inner product describes how to measure a vector along another vector. For example, \(\mathbf{u}\cdot\mathbf{v}\) is proportional to the length of the projection of \(\mathbf{u}\) onto \(\mathbf{v}\). $$ \mathbf{u} \cdot \mathbf{v} =\|\mathbf{u}\|\|\mathbf{v}\|\cos[\theta] $$ With a bit of trigonometry, we can show that the dot product is equivalent to multiplying the vectors’ lengths with the cosine of their angle. This relationship suggests that the product of a vector with itself produces the square of its length. \[\begin{align*} \mathbf{u}\cdot\mathbf{u} &= \|\mathbf{u}\|\|\mathbf{u}\|\cos[0] \\ &= \|\mathbf{u}\|^2 \end{align*}\] Similarly, when two vectors form a right angle (are orthogonal), their dot product is zero. $$ \begin{align*} \mathbf{u} \cdot \mathbf{v} &= \|\mathbf{u}\|\|\mathbf{v}\|\cos[90^\circ] \\ &= 0 \end{align*} $$ Of course, the Euclidean dot product is only one example of an inner product. In more general spaces, the inner product is denoted using angle brackets, such as \(\langle \mathbf{u}, \mathbf{v} \rangle\). The length (also known as the norm) of a vector is defined as \(\|\mathbf{u}\| = \sqrt{\langle \mathbf{u}, \mathbf{u} \rangle}\). Two vectors are orthogonal if their inner product is zero: \(\ \mathbf{u} \perp \mathbf{v}\ \iff\ \langle \mathbf{u}, \mathbf{v} \rangle = 0\). A vector space augmented with an inner product is known as an inner product space. A Functional Inner Product We can’t directly apply the Euclidean dot product to our space of real functions, but its \(N\)-dimensional generalization is suggestive. \[\begin{align*} \mathbf{u} \cdot \mathbf{v} &= u_1v_1 + u_2v_2 + \dots + u_Nv_N \\ &= \sum_{i=1}^N u_iv_i \end{align*}\] Given countable indices, we simply match up the values, multiply them, and add the results. When indices are uncountable, we can convert the discrete sum to its continuous analog: an integral! \[\langle f, g \rangle = \int_a^b f[x]g[x] \, dx\] When \(f\) and \(g\) are similar, multiplying them produces a larger function; when they’re different, they cancel out. Integration measures their product over some domain to produce a scalar result. Of course, not all functions can be integrated. Our inner product space will only contain functions that are square integrable over the domain \([a, b]\), which may be \([-\infty, \infty]\). Luckily, the important properties of our inner product do not depend on the choice of integration domain. Proofs Below, we’ll briefly cover functions from \(\mathbb{R}\) to \(\mathbb{C}\). In this space, our intuitive notion of similarity still applies, but we’ll use a slightly more general inner product: \[\langle f,g \rangle = \int_a^b f[x]\overline{g[x]}\, dx\] Where \(\overline{x}\) denotes conjugation, i.e. \(\overline{a + bi} = a - bi\). Like other vector space operations, an inner product must satisfy several axioms: Conjugate Symmetry For all vectors $$\mathbf{u}, \mathbf{v} \in \mathcal{V}$$: $$\langle \mathbf{u}, \mathbf{v} \rangle = \overline{\langle \mathbf{v}, \mathbf{u} \rangle}$$ Conjugation may be taken outside the integral, making this one easy: $$\begin{align*} \langle f, g \rangle &= \int_a^b f[x]\overline{g[x]} \, dx \\ &= \int_a^b \overline{g[x]\overline{f[x]}} \, dx \\ &= \overline{\int_a^b g[x]\overline{f[x]} \, dx} \\ &= \overline{\langle g, f \rangle} \end{align*}$$ Note that we require conjugate symmetry because it implies $$\langle\mathbf{u}, \mathbf{u}\rangle = \overline{\langle\mathbf{u}, \mathbf{u}\rangle}$$, i.e. the inner product of a vector with itself is real. Linearity in the First Argument For all vectors $$\mathbf{u}, \mathbf{v}, \mathbf{w} \in \mathcal{V}$$ and scalars $$\alpha, \beta \in \mathbb{F}$$: $$\langle \alpha \mathbf{u} + \beta \mathbf{v}, \mathbf{w} \rangle = \alpha\langle \mathbf{u}, \mathbf{w} \rangle + \beta\langle \mathbf{v}, \mathbf{w} \rangle $$ The proof follows from linearity of integration, as well as our vector space axioms: \[\begin{align*} \langle \alpha f + \beta g, h \rangle &= \int_a^b (\alpha f + \beta g)[x]\overline{h[x]} \, dx \\ &= \int_a^b (\alpha f[x] + \beta g[x])\overline{h[x]} \, dx \\ &= \int_a^b \alpha f[x]\overline{h[x]} + \beta g[x]\overline{h[x]} \, dx \\ &= \alpha\int_a^b f[x]\overline{h[x]}\, dx + \beta\int_a^b g[x]\overline{h[x]} \, dx \\ &= \alpha\langle f, h \rangle + \beta\langle g, h \rangle \end{align*}\] \[\begin{align*} &\langle \alpha f + \beta g, h \rangle\\ &= \int_a^b (\alpha f + \beta g)[x]\overline{h[x]} \, dx \\ &= \int_a^b (\alpha f[x] + \beta g[x])\overline{h[x]} \, dx \\ &= \int_a^b \alpha f[x]\overline{h[x]} + \beta g[x]\overline{h[x]} \, dx \\ &= \alpha\int_a^b f[x]\overline{h[x]}\, dx\, +\\&\hphantom{==} \beta\int_a^b g[x]\overline{h[x]} \, dx \\ &= \alpha\langle f, h \rangle + \beta\langle g, h \rangle \end{align*}\] Given conjugate symmetry, an inner product is also antilinear in the second argument. Positive-Definiteness For all $$\mathbf{u} \in \mathcal{V}$$: $$ \begin{cases} \langle \mathbf{u}, \mathbf{u} \rangle = 0 & \text{if } \mathbf{u} = \mathbf{0} \\ \langle \mathbf{u}, \mathbf{u} \rangle > 0 & \text{otherwise} \end{cases} $$ By conjugate symmetry, we know $$\langle f, f \rangle$$ is real, so we can compare it with zero. However, rigorously proving this result requires measure-theoretic concepts beyond the scope of this post. In brief, we redefine $$\mathbf{0}$$ not as specifically $$\mathbf{0}[x] = 0$$, but as an equivalence class of functions that are zero "almost everywhere." If $$f$$ is zero almost everywhere, it is only non-zero on a set of measure zero, and therefore integrates to zero. After partitioning our set of functions into equivalence classes, all non-zero functions square-integrate to a positive value. This implies that every function has a real norm, $$\sqrt{\langle f, f \rangle}$$. Along with the definition \(\|f\| = \sqrt{\langle f, f \rangle}\), these properties entail a variety of important results, including the Cauchy–Schwarz and triangle inequalities. The Spectral Theorem Diagonalization is already a powerful technique, but we’re building up to an even more important result regarding orthonormal eigenbases. In an inner product space, an orthonormal basis must satisfy two conditions: each vector is unit length, and all vectors are mutually orthogonal. $$ \begin{cases} \langle\mathbf{u}_i,\mathbf{u}_i\rangle = 1 & \forall i \\ \langle \mathbf{u}_i, \mathbf{u}_j \rangle = 0 & \forall i \neq j \end{cases} $$ A matrix consisting of orthonormal columns is known as an orthogonal matrix. Orthogonal matrices represent rotations of the standard basis. In an inner product space, matrix-vector multiplication computes the inner product of the vector with each row of the matrix. Something interesting happens when we multiply an orthogonal matrix \(\mathbf{U}\) by its transpose: \[\begin{align*} \mathbf{U}^T\mathbf{U} &= \begin{bmatrix}\text{---} & \mathbf{u}_1 & \text{---} \\ \text{---} & \mathbf{u}_2 & \text{---} \\ \text{---} & \mathbf{u}_3 & \text{---} \end{bmatrix} \begin{bmatrix}\vert & \vert & \vert \\ \mathbf{u}_1 & \mathbf{u}_2 & \mathbf{u}_3 \\ \vert & \vert & \vert \end{bmatrix} \\ &= \begin{bmatrix} \langle \mathbf{u}_1, \mathbf{u}_1 \rangle & \langle \mathbf{u}_1, \mathbf{u}_2 \rangle & \langle \mathbf{u}_1, \mathbf{u}_3 \rangle \\ \langle \mathbf{u}_2, \mathbf{u}_1 \rangle & \langle \mathbf{u}_2, \mathbf{u}_2 \rangle & \langle \mathbf{u}_2, \mathbf{u}_3 \rangle \\ \langle \mathbf{u}_3, \mathbf{u}_1 \rangle & \langle \mathbf{u}_3, \mathbf{u}_2 \rangle & \langle \mathbf{u}_3, \mathbf{u}_3 \rangle \end{bmatrix} \\ &= \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix} \\ &= \mathcal{I} \end{align*}\] Since \(\mathbf{U}^T\mathbf{U} = \mathcal{I}\) (and \(\mathbf{U}\mathbf{U}^T = \mathcal{I}\)), we’ve found that the transpose of \(\mathbf{U}\) is equal to its inverse. When diagonalizing \(\mathbf{A}\), we used \(\mathbf{U}\) to transform vectors from our eigenbasis to the standard basis. Conversely, its inverse transformed vectors from the standard basis to our eigenbasis. If \(\mathbf{U}\) happens to be orthogonal, transforming a vector \(\mathbf{x}\) into the eigenbasis is equivalent to projecting \(\mathbf{x}\) onto each eigenvector. $$ \mathbf{U}^{-1}\mathbf{x} = \mathbf{U}^T\mathbf{x} = \begin{bmatrix}\langle \mathbf{u}_1, \mathbf{x} \rangle \\ \langle \mathbf{u}_2, \mathbf{x} \rangle \\ \langle \mathbf{u}_3, \mathbf{x} \rangle \end{bmatrix} $$ Additionally, the diagonalization of \(\mathbf{A}\) becomes quite simple: \[\begin{align*} \mathbf{A} &= \mathbf{U\Lambda U^T} \\ &= \begin{bmatrix}\vert & \vert & \vert \\ \mathbf{u}_1 & \mathbf{u}_2 & \mathbf{u}_3 \\ \vert & \vert & \vert \end{bmatrix} \begin{bmatrix}\lambda_1 & 0 & 0 \\ 0 & \lambda_2 & 0 \\ 0 & 0 & \lambda_3 \end{bmatrix} \begin{bmatrix}\text{---} & \mathbf{u}_1 & \text{---} \\ \text{---} & \mathbf{u}_2 & \text{---} \\ \text{---} & \mathbf{u}_3 & \text{---} \end{bmatrix} \end{align*}\] \[\begin{align*} \mathbf{A} &= \mathbf{U\Lambda U^T} \\ &= \begin{bmatrix}\vert & \vert & \vert \\ \mathbf{u}_1 & \mathbf{u}_2 & \mathbf{u}_3 \\ \vert & \vert & \vert \end{bmatrix} \\ &\phantom{=} \begin{bmatrix}\lambda_1 & 0 & 0 \\ 0 & \lambda_2 & 0 \\ 0 & 0 & \lambda_3 \end{bmatrix} \\ &\phantom{=} \begin{bmatrix}\text{---} & \mathbf{u}_1 & \text{---} \\ \text{---} & \mathbf{u}_2 & \text{---} \\ \text{---} & \mathbf{u}_3 & \text{---} \end{bmatrix} \end{align*}\] Given an orthogonal diagonalization of \(\mathbf{A}\), we can deduce that \(\mathbf{A}\) must be symmetric, i.e. \(\mathbf{A} = \mathbf{A}^T\). \[\begin{align*} \mathbf{A}^T &= (\mathbf{U\Lambda U}^T)^T \\ &= {\mathbf{U}^T}^T \mathbf{\Lambda }^T \mathbf{U}^T \\ &= \mathbf{U\Lambda U}^T \\ &= \mathbf{A} \end{align*}\] The spectral theorem states that the converse is also true: \(\mathbf{A}\) is symmetric if and only if it admits an orthonormal eigenbasis with real eigenvalues. Proving this result is somewhat involved in finite dimensions and very involved in infinite dimensions, so we won’t reproduce the proofs here. Self-Adjoint Operators We can generalize the spectral theorem to our space of functions, where it states that a self-adjoint operator admits an orthonormal eigenbasis with real eigenvalues.6 Denoted as \(\mathbf{A}^{\hspace{-0.1em}\star\hspace{0.1em}}\), the adjoint of an operator \(\mathbf{A}\) is defined by the following relationship. \[\langle \mathbf{Ax}, \mathbf{y} \rangle = \langle \mathbf{x}, \mathbf{A}^{\hspace{-0.1em}\star\hspace{0.1em}}\mathbf{y} \rangle\] When \(\mathbf{A} = \mathbf{A}^\star\), we say that \(\mathbf{A}\) is self-adjoint. The adjoint can be thought of as a generalized transpose—but it’s not obvious what that means in infinite dimensions. We will simply use our functional inner product to determine whether an operator is self-adjoint. The Laplace Operator Review: Solving the heat equation | DE3. Earlier, we weren’t able to diagonalize (real) differentiation, so it must not be self-adjoint. Therefore, we will explore another fundamental operator, the Laplacian. There are many equivalent definitions of the Laplacian, but in our space of one-dimensional functions, it’s just the second derivative. We will hence restrict our domain to twice-differentiable functions. \[\Delta f = \frac{\partial^2 f}{\partial x^2}\] We may compute \(\Delta^\star\) using two integrations by parts: \[\begin{align*} \left\langle \Delta f[x], g[x] \right\rangle &= \int_a^b f^{\prime\prime}[x] g[x]\, dx \\ &= f^\prime[x]g[x]\Big|_a^b - \int_a^b f^{\prime}[x] g^{\prime}[x]\, dx \\ &= (f^\prime[x]g[x] - f[x]g^{\prime}[x])\Big|_a^b + \int_a^b f[x] g^{\prime\prime}[x]\, dx \\ &= \left\langle f[x], \Delta g[x] \right\rangle \end{align*}\] \[\begin{align*} \left\langle \Delta f[x], g[x] \right\rangle &= \int_a^b f^{\prime\prime}[x] g[x]\, dx \\ &= f^\prime[x]g[x]\Big|_a^b - \int_a^b f^{\prime}[x] g^{\prime}[x]\, dx \\ &= (f^\prime[x]g[x] - f[x]g^{\prime}[x])\Big|_a^b \\&\hphantom{=}+ \int_a^b f[x] g^{\prime\prime}[x]\, dx \\ &= \left\langle f[x], \Delta g[x] \right\rangle \end{align*}\] In the final step, we assume that \((f^\prime[x]g[x] - f[x]g^{\prime}[x])\big|_a^b = 0\), which is not true in general. To make our conclusion valid, we will constrain our domain to only include functions satisfying this boundary condition. Specifically, we will only consider periodic functions with period \(b-a\). These functions have the same value and derivative at \(a\) and \(b\), so the additional term vanishes. For simplicity, we will also assume our domain to be \([0,1]\). For example: Therefore, the Laplacian is self-adjoint…almost. Technically, we’ve shown that the Laplacian is symmetric, not that \(\Delta = \Delta^\star\). This is a subtle point, and it’s possible to prove self-adjointness, so we will omit this detail. Laplacian Eigenfunctions Review: e^(iπ) in 3.14 minutes, using dynamics | DE5. Applying the spectral theorem tells us that the Laplacian admits an orthonormal eigenbasis. Let’s find it.7 Since the Laplacian is simply the second derivative, real exponentials would still be eigenfunctions—but they’re not periodic, so we’ll have to exclude them. \[\color{red} \Delta e^{\psi x} = \psi^2 e^{\psi x}\] Luckily, a new class of periodic eigenfunctions appears: \[\begin{align*} \Delta \sin[\psi x] &= -\psi^2 \sin[\psi x] \\ \Delta \cos[\psi x] &= -\psi^2 \cos[\psi x] \end{align*}\] If we allow our diagonalization to introduce complex numbers, we can also consider functions from \(\mathbb{R}\) to \(\mathbb{C}\) . Here, purely complex exponentials are eigenfunctions with real eigenvalues. \[\begin{align*} \Delta e^{\psi i x} &= (\psi i)^2e^{\psi i x} \\ &= -\psi^2 e^{\psi i x} \\ &= -\psi^2 (\cos[\psi x] + i\sin[\psi x]) \tag{Euler's formula} \end{align*}\] \[\begin{align*} \Delta e^{\psi i x} &= (\psi i)^2e^{\psi i x} \\ &= -\psi^2 e^{\psi i x} \\ &= -\psi^2 (\cos[\psi x] + i\sin[\psi x]) \\& \tag{Euler's formula} \end{align*}\] Using Euler’s formula, we can see that these two perspectives are equivalent: they both introduce \(\sin\) and \(\cos\) as eigenfunctions. Either path can lead to our final result, but we’ll stick with the more compact complex case. We also need to constrain the set of eigenfunctions to be periodic on \([0,1]\). As suggested above, we can pick out the eigenvalues that are an integer multiple of \(2\pi\). \[e^{2\pi \xi i x} = \cos[2\pi \xi x] + i\sin[2\pi \xi x]\] Our set of eigenfunctions is therefore \(e^{2\pi \xi i x}\) for all integers \(\xi\). Diagonalizing the Laplacian Now that we’ve found suitable eigenfunctions, we can construct an orthonormal basis. Our collection of eigenfunctions is linearly independent, as each one corresponds to a distinct eigenvalue. Next, we can check for orthogonality and unit magnitude: Mutual Orthogonality Compute the inner product of \(e^{2\pi \xi_1 i x}\) and \(e^{2\pi \xi_2 i x}\) for \(\xi_1 \neq \xi_2\): \[\begin{align*} \langle e^{2\pi\xi_1 i x}, e^{2\pi\xi_2 i x} \rangle &= \int_0^1 e^{2\pi\xi_1 i x} \overline{e^{2\pi\xi_2 i x}}\, dx \\ &= \int_0^1 (\cos[2\pi\xi_1 x] + i\sin[2\pi\xi_1 x])(\cos[2\pi\xi_2 x] - i\sin[2\pi\xi_2 x])\, dx \\ &= \int_0^1 \cos[2\pi\xi_1 x]\cos[2\pi\xi_2 x] - i\cos[2\pi\xi_1 x]\sin[2\pi\xi_2 x] +\\ &\phantom{= \int_0^1} i\sin[2\pi\xi_1 x]\cos[2\pi\xi_2 x] + \sin[2\pi\xi_1 x]\sin[2\pi\xi_2 x] \, dx \\ &= \int_0^1 \cos[2\pi(\xi_1-\xi_2)x] + i\sin[2\pi(\xi_1-\xi_2)x]\, dx \\ &= \frac{1}{2\pi(\xi_1-\xi_2)}\left(\sin[2\pi(\xi_1-\xi_2] x)\Big|_0^1 - i\cos[2\pi(\xi_1-\xi_2) x]\Big|_0^1\right) \\ &= 0 \end{align*}\] \[\begin{align*} &\langle e^{2\pi\xi_1 i x}, e^{2\pi\xi_2 i x} \rangle \\ &= \int_0^1 e^{2\pi\xi_1 i x} \overline{e^{2\pi\xi_2 i x}}\, dx \\ &= \int_0^1 (\cos[2\pi\xi_1 x] + i\sin[2\pi\xi_1 x])\cdot\\ &\hphantom{==\int_0^1}(\cos[2\pi\xi_2 x] - i\sin[2\pi\xi_2 x])\, dx \\ &= \int_0^1 \cos[2\pi\xi_1 x]\cos[2\pi\xi_2 x]\, -\\ &\hphantom{==\int_0^1} i\cos[2\pi\xi_1 x]\sin[2\pi\xi_2 x] +\\ &\hphantom{== \int_0^1} i\sin[2\pi\xi_1 x]\cos[2\pi\xi_2 x]\, +\\ &\hphantom{==\int_0^1} \sin[2\pi\xi_1 x]\sin[2\pi\xi_2 x] \, dx \\ &= \int_0^1 \cos[2\pi(\xi_1-\xi_2)x]\, +\\ &\hphantom{==\int_0^1} i\sin[2\pi(\xi_1-\xi_2)x]\, dx \\ &= \frac{1}{2\pi(\xi_1-\xi_2)}\Big(\sin[2\pi(\xi_1-\xi_2] x)\Big|_0^1\, -\\ &\hphantom{=== \frac{1}{2\pi(\xi_1-\xi_2)}} i\cos[2\pi(\xi_1-\xi_2) x]\Big|_0^1\Big) \\ &= 0 \end{align*}\] Note that the final step is valid because \(\xi_1-\xi_2\) is a non-zero integer. This result also applies to any domain \([a,b]\), given functions periodic on \([a,b]\). It’s possible to further generalize to \([-\infty,\infty]\), but doing so requires a weighted inner product. Unit Magnitude It’s easy to show that all candidate functions have norm one: \[\begin{align*} \langle e^{2\pi\xi i x}, e^{2\pi\xi i x} \rangle &= \int_0^1 e^{2\pi\xi i x}\overline{e^{2\pi\xi i x}} \\ &= \int_0^1 e^{2\pi\xi i x}e^{-2\pi\xi i x}\, dx \\&= \int_0^1 1\, dx\\ &= 1 \end{align*}\] With the addition of a constant factor \(\frac{1}{b-a}\), this result generalizes to any \([a,b]\). It’s possible to further generalize to \([-\infty,\infty]\), but doing so requires a weighted inner product. The final step is to show that all functions in our domain can be represented by a linear combination of eigenfunctions. To do so, we will find an invertible operator \(\mathcal{L}\) representing the proper change of basis. Critically, since our eigenbasis is orthonormal, we can intuitively consider the inverse of \(\mathcal{L}\) to be its transpose. \[\mathcal{I} = \mathcal{L}\mathcal{L}^{-1} = \mathcal{L}\mathcal{L}^{T} = \begin{bmatrix} \vert & \vert & & \\ e^{2\pi\xi_1 i x} & e^{2\pi\xi_2 i x} & \dots \\ \vert & \vert & \end{bmatrix}\begin{bmatrix} \text{---} & e^{2\pi\xi_1 i x} & \text{---} \\ \text{---} & e^{2\pi\xi_2 i x} & \text{---} \\ & \vdots & \end{bmatrix}\] \[\begin{align*} \mathcal{I} &= \mathcal{L}\mathcal{L}^{-1} \\&= \mathcal{L}\mathcal{L}^{T} \\&= \begin{bmatrix} \vert & \vert & & \\ e^{2\pi\xi_1 i x} & e^{2\pi\xi_2 i x} & \dots \\ \vert & \vert & \end{bmatrix}\\ & \phantom{=} \begin{bmatrix} \text{---} & e^{2\pi\xi_1 i x} & \text{---} \\ \text{---} & e^{2\pi\xi_2 i x} & \text{---} \\ & \vdots & \end{bmatrix} \end{align*}\] This visualization suggests that \(\mathcal{L}^Tf\) computes the inner product of \(f\) with each eigenvector. \[\mathcal{L}^Tf = \begin{bmatrix}\langle f, e^{2\pi\xi_1 i x} \rangle \\ \langle f, e^{2\pi\xi_2 i x} \rangle \\ \vdots \end{bmatrix}\] Which is highly reminiscent of the finite-dimensional case, where we projected onto each eigenvector of an orthogonal eigenbasis. This insight allows us to write down the product \(\mathcal{L}^Tf\) as an integer function \(\hat{f}[\xi]\). Note that the complex inner product conjugates the second argument, so the exponent is negated. \[(\mathcal{L}^Tf)[\xi] = \hat{f}[\xi] = \int_0^1 f[x]e^{-2\pi\xi i x}\, dx\] Conversely, \(\mathcal{L}\) converts \(\hat{f}\) back to the standard basis. It simply creates a linear combination of eigenfunctions. \[(\mathcal{L}\hat{f})[x] = f[x] = \sum_{\xi=-\infty}^\infty \hat{f}[\xi] e^{2\pi\xi i x}\] These operators are, in fact, inverses of each other, but a rigorous proof is beyond the scope of this post. Therefore, we’ve diagonalized the Laplacian: \[\begin{align*} \Delta &= \mathcal{L} \mathcal{D} \mathcal{L}^T \\ &= \begin{bmatrix} \vert & \vert & & \\ e^{2\pi\xi_1 i x} & e^{2\pi\xi_2 i x} & \dots \\ \vert & \vert & \end{bmatrix} \begin{bmatrix} -(2\pi\xi_1)^2 & 0 & \dots \\ 0 & -(2\pi\xi_2)^2 & \dots \\ \vdots & \vdots & \ddots \end{bmatrix} \begin{bmatrix} \text{---} & e^{2\pi\xi_1 i x} & \text{---} \\ \text{---} & e^{2\pi\xi_2 i x} & \text{---} \\ & \vdots & \end{bmatrix} \end{align*}\] \[\begin{align*} \Delta &= \mathcal{L} \mathcal{D} \mathcal{L}^T \\ &= \begin{bmatrix} \vert & \vert & & \\ e^{2\pi\xi_1 i x} & e^{2\pi\xi_2 i x} & \dots \\ \vert & \vert & \end{bmatrix} \\ & \phantom{=} \begin{bmatrix} -(2\pi\xi_1)^2 & 0 & \dots \\ 0 & -(2\pi\xi_2)^2 & \dots \\ \vdots & \vdots & \ddots \end{bmatrix} \\ & \phantom{=} \begin{bmatrix} \text{---} & e^{2\pi\xi_1 i x} & \text{---} \\ \text{---} & e^{2\pi\xi_2 i x} & \text{---} \\ & \vdots & \end{bmatrix} \end{align*}\] Although \(\mathcal{L}^T\) transforms our real-valued function into a complex-valued function, \(\Delta\) as a whole still maps real functions to real functions. Next, we’ll see how \(\mathcal{L}^T\) is itself an incredibly useful transformation. Applications In this section, we’ll explore several applications in signal processing, each of which arises from diagonalizing the Laplacian on a new domain. Fourier Series Review: But what is a Fourier series? From heat flow to drawing with circles | DE4. If you’re familiar with Fourier methods, you likely noticed that \(\hat{f}\) encodes the Fourier series of \(f\). That’s because a Fourier transform is a change of basis into the Laplacian eigenbasis! This basis consists of waves, which makes \(\hat{f}\) is a particularly interesting representation for \(f\). For example, consider evaluating \(\hat{f}[1]\): \[\hat{f}[1] = \int_0^1 f[x] e^{-2\pi i x}\, dx = \int_0^1 f[x](\cos[2\pi x] - i\sin[2\pi x])\, dx\] \[\begin{align*} \hat{f}[1] &= \int_0^1 f[x] e^{-2\pi i x}\, dx \\&= \int_0^1 f[x](\cos[2\pi x] - i\sin[2\pi x])\, dx \end{align*}\] This integral measures how much of \(f\) is represented by waves of frequency (positive) 1. Naturally, \(\hat{f}[\xi]\) computes the same quantity for any integer frequency \(\xi\). \[\hphantom{aaa} {\color{#9673A6} \text{Real}\left[e^{2\pi i \xi x}\right]}\,\,\,\,\,\,\,\, {\color{#D79B00} \text{Complex}\left[e^{2\pi i \xi x}\right]}\] $$\xi = 1$$ Therefore, we say that \(\hat{f}\) expresses our function in the frequency domain. To illustrate this point, we’ll use a Fourier series to decompose a piecewise linear function into a collection of waves.8 Since our new basis is orthonormal, the transform is easy to invert by re-combining the waves. Here, the \(\color{#9673A6}\text{purple}\) curve is \(f\); the \(\color{#D79B00}\text{orange}\) curve is a reconstruction of \(f\) from the first \(N\) coefficients of \(\hat{f}\). Try varying the number of coefficients and moving the \(\color{#9673A6}\text{purple}\) dots to effect the results. $$ \hphantom{aaa} {\color{#9673A6} f[x]}\,\,\,\,\,\,\,\,\,\,\,\, {\color{#D79B00} \sum_{\xi=-N}^N \hat{f}[\xi]e^{2\pi i \xi x}} $$ $$N = 3$$ Additionally, explore the individual basis functions making up our result: $$\hat{f}[0]$$ $$\hat{f}[1]e^{2\pi i x}$$ $$\hat{f}[2]e^{4\pi i x}$$ $$\hat{f}[3]e^{6\pi i x}$$ Many interesting operations become easy to compute in the frequency domain. For example, by simply dropping Fourier coefficients beyond a certain threshold, we can reconstruct a smoothed version of our function. This technique is known as a low-pass filter—try it out above. Image Compression Computationally, Fourier series are especially useful for compression. Encoding a function \(f\) in the standard basis takes a lot of space, since we store a separate result for each input. If we instead express \(f\) in the Fourier basis, we only need to store a few coefficients—we’ll be able to approximately reconstruct \(f\) by re-combining the corresponding basis functions. So far, we’ve only defined a Fourier transform for functions on \(\mathbb{R}\). Luckily, the transform arose via diagonalizing the Laplacian, and the Laplacian is not limited to one-dimensional functions. In fact, wherever we can define a Laplacian, we can find a corresponding Fourier transform.9 For example, in two dimensions, the Laplacian becomes a sum of second derivatives. \[\Delta f[x,y] = \frac{\partial^2 f}{\partial x^2} + \frac{\partial^2 f}{\partial y^2}\] For the domain \([0,1]\times[0,1]\), we’ll find a familiar set of periodic eigenfunctions. \[e^{2\pi i(nx + my)} = \cos[2\pi(nx + my)] + i\sin[2\pi(nx + my)]\] \[\begin{align*} e^{2\pi i(nx + my)} &= \cos[2\pi(nx + my)]\, + \\ &\phantom{=}\, i\sin[2\pi(nx + my)] \end{align*}\] Where \(n\) and \(m\) are both integers. Let’s see what these basis functions look like: \[{\color{#9673A6} \text{Real}\left[e^{2\pi i(nx + my)}\right]}\] \[{\color{#D79B00} \text{Complex}\left[e^{2\pi i(nx + my)}\right]}\] $$n = 3$$ $$m = 3$$ Just like the 1D case, the corresponding Fourier transform is a change of basis into the Laplacian’s orthonormal eigenbasis. Above, we decomposed a 1D function into a collection of 1D waves—here, we equivalently decompose a 2D image into a collection of 2D waves. \[\phantom{\Bigg|} {\color{#9673A6} f[x,y]}\] \[{\color{#D79B00} \sum_{n=-N}^N \sum_{m=-N}^N \hat{f}[n,m]e^{2\pi i(nx + my)}}\] $$N = 3$$ A variant of the 2D Fourier transform is at the core of many image compression algorithms, including JPEG. Spherical Harmonics Computer graphics is often concerned with functions on the unit sphere, so let’s see if we can find a corresponding Fourier transform. In spherical coordinates, the Laplacian can be defined as follows: \[\Delta f(\theta, \phi) = \frac{1}{\sin[\theta]}\frac{\partial}{\partial \theta}\left(\sin[\theta] \frac{\partial f}{\partial \theta}\right) + \frac{1}{\sin^2[\theta]}\frac{\partial^2 f}{\partial \phi^2}\] \[\begin{align*} \Delta f(\theta, \phi) &= \frac{1}{\sin[\theta]}\frac{\partial}{\partial \theta}\left(\sin[\theta] \frac{\partial f}{\partial \theta}\right)\, + \\ &\phantom{=} \frac{1}{\sin^2[\theta]}\frac{\partial^2 f}{\partial \phi^2} \end{align*}\] We won’t go through the full derivation, but this Laplacian admits an orthornormal eigenbasis known as the spherical harmonics.10 \[Y_\ell^m[\theta, \phi] = N_\ell^m P_\ell^m[\cos[\theta]] e^{im\phi}\] Where \(Y_\ell^m\) is the spherical harmonic of degree \(\ell \ge 0\) and order \(m \in [-\ell,\ell]\). Note that \(N_\ell^m\) is a constant and \(P_\ell^m\) are the associated Legendre polynomials.11 \[\hphantom{aa} {\color{#9673A6} \text{Real}\left[Y_\ell^m[\theta,\phi]\right]}\,\,\,\,\,\,\,\, {\color{#D79B00} \text{Complex}\left[Y_\ell^m[\theta,\phi]\right]}\] $$\ell = 0$$ $$m = 0$$ As above, we define the spherical Fourier transform as a change of basis into the spherical harmonics. In game engines, this transform is often used to compress diffuse environment maps (i.e. spherical images) and global illumination probes. \[\phantom{\Bigg|} {\color{#9673A6} f[\theta,\phi]}\] \[{\color{#D79B00} \sum_{\ell=0}^N \sum_{m=-\ell}^\ell \hat{f}[\ell,m]\left( Y_\ell^m[\theta,\phi]e^{im\phi} \right)}\] $$N = 3$$ You might also recognize spherical harmonics as electron orbitals—quantum mechanics is primarily concerned with the eigenfunctions of linear operators. Geometry Processing Representing functions as vectors underlies many modern algorithms—image compression is only one example. In fact, because computers can do linear algebra so efficiently, applying linear-algebraic techniques to functions produces a powerful new computational paradigm. The nascent field of discrete differential geometry uses this perspective to build algorithms for three-dimensional geometry processing. In computer graphics, functions on meshes often represent textures, unwrappings, displacements, or simulation parameters. DDG gives us a way to faithfully encode such functions as vectors: for example, by associating a value with each vertex of the mesh. $$ \mathbf{f} = \begin{bmatrix} {\color{#666666} f_1}\\ {\color{#82B366} f_2}\\ {\color{#B85450} f_3}\\ {\color{#6C8EBF} f_4}\\ {\color{#D79B00} f_5}\\ {\color{#9673A6} f_6}\\ {\color{#D6B656} f_7}\\ \end{bmatrix} $$ One particularly relevant result is a Laplace operator for meshes. A mesh Laplacian is a finite-dimensional matrix, so we can use numerical linear algebra to find its eigenfunctions. As with the continuous case, these functions generalize sine and cosine to a new domain. Here, we visualize the real and complex parts of each eigenfunction, where the two colors indicate positive vs. negative regions. \[\hphantom{aa} {\color{#9673A6} \text{Rea}}{\color{#82B366}\text{l}\left[\psi_N\right]}\,\,\,\,\,\,\,\, {\color{#D79B00} \text{Comp}}{\color{#6C8EBF} \text{lex}\left[\psi_N\right]}\] $$4\text{th Eigenfunction}$$ At this point, the implications might be obvious—this eigenbasis is useful for transforming and compressing functions on the mesh. In fact, by interpreting the vertices’ positions as a function, we can even smooth or sharpen the geometry itself. Further Reading There’s far more to signal and geometry processing than we can cover here, let alone the many other applications in engineering, physics, and computer science. We will conclude with an (incomplete, biased) list of topics for further exploration. See if you can follow the functions-are-vectors thread throughout: Geometry: Distances, Parallel Transport, Flattening, Non-manifold Meshes, and Polygonal Meshes Simulation: the Finite Element Method, Monte Carlo PDEs, Minimal Surfaces, and Fluid Cohomology Light Transport: Radiosity, Operator Formulation (Ch.4), Low-Rank Approximation, and Inverse Rendering Machine Learning: DiffusionNet, MeshCNN, Kinematics, Fourier Features, and Inverse Geometry Splines: C2 Interpolation, Quadratic Approximation, and Simplification Thanks to Joanna Y, Hesper Yin, and Fan Pu Zeng for providing feedback on this post. Footnotes We shouldn’t assume all countably infinite-dimensional vectors represent functions on the natural numbers. Later on, we’ll discuss the space of real polynomial functions, which also has countably infinite dimensionality. ↩ If you’re alarmed by the fact that the set of all real functions does not form a Hilbert space, you’re probably not in the target audience of this post. Don’t worry, we will later move on to \(L^2(\mathbb{R})\) and \(L^2(\mathbb{R},\mathbb{C})\). ↩ If you’re paying close attention, you might notice that none of these proofs depended on the fact that the domain of our functions is the real numbers. In fact, the only necessary assumption is that our functions return elements of a field, allowing us to apply commutativity, associativity, etc. to the outputs. Therefore, we can define a vector space over the set of functions that map any fixed set \(\mathcal{S}\) to a field \(\mathbb{F}\). This result illustrates why we could intuitively equate finite-dimensional vectors with mappings from index to value. For example, consider \(\mathcal{S} = \{1,2,3\}\): $$ \begin{align*} f_{\mathcal{S}\mapsto\mathbb{R}} &= \{ 1 \mapsto x, 2 \mapsto y, 3 \mapsto z \} \\ &\vphantom{\Big|}\phantom{\,}\Updownarrow \\ f_{\mathbb{R}^3} &= \begin{bmatrix}x \\ y \\ z\end{bmatrix} \end{align*} $$ If \(\mathcal{S}\) is finite, a function from \(\mathcal{S}\) to a field directly corresponds to a familiar finite-dimensional vector. ↩ Earlier, we used countably infinite-dimensional vectors to represent functions on the natural numbers. In doing so, we implicitly employed the standard basis of impulse functions indexed by \(i \in \mathbb{N}\). Here, we encode real polynomials by choosing new basis functions of type \(\mathbb{R}\mapsto\mathbb{R}\), namely \(\mathbf{e}_i[x] = x^i\). ↩ Since this basis spans the space of analytic functions, analytic functions have countably infinite dimensionality. Relative to the uncountably infinite-dimensional space of all real functions, analytic functions are vanishingly rare! ↩ Actually, it states that a compact self-adjoint operator on a Hilbert space admits an orthonormal eigenbasis with real eigenvalues. Rigorously defining these terms is beyond the scope of this post, but note that our intuition about “infinite-dimensional matrices” only legitimately applies to this class of operators. ↩ Specifically, a basis for the domain of our Laplacian, not all functions. We’ve built up several conditions that restrict our domain to square integrable, periodic, twice-differentiable real functions on \([0,1]\). ↩ Also known as a vibe check. ↩ Higher-dimensional Laplacians are sums of second derivatives, e.g. \(\Delta f = \frac{\partial^2 f}{\partial x^2} + \frac{\partial^2 f}{\partial y^2} + \frac{\partial^2 f}{\partial z^2}\). The Laplace-Beltrami operator expresses this sum as the divergence of the gradient, i.e. \(\Delta f = \nabla \cdot \nabla f\). Using exterior calculus, we can further generalize divergence and gradient to produce the Laplace-de Rham operator, \(\Delta f = \delta d f\), where \(\delta\) is the codifferential and \(d\) is the exterior derivative. ↩ Typically, a “harmonic” function must satisfy \(\Delta f = 0\), but in this case we include all eigenfunctions. ↩ The associated Legendre polynomials are closely related to the Legendre polynomials, which form a orthogonal basis for polynomials. Interestingly, they can be derived by applying the Gram–Schmidt process to the basis of powers (\(1, x, x^2, \dots\)). ↩Jersey City2023-07-06T00:00:00+00:002023-07-06T00:00:00+00:00https://thenumb.at/Jersey-City<p>Jersey City, NJ, 2023</p>
<div style="text-align: center;"><img src="/assets/jersey-city/00.webp" /></div>
<div style="text-align: center;"><img src="/assets/jersey-city/01.webp" /></div>
<div style="text-align: center;"><img src="/assets/jersey-city/02.webp" /></div>
<!-- <div style="text-align: center;"><img src="/assets/jersey-city/03.webp"></div> -->
<div style="text-align: center;"><img src="/assets/jersey-city/04.webp" /></div>
<div style="text-align: center;"><img src="/assets/jersey-city/05.webp" /></div>
<div style="text-align: center;"><img src="/assets/jersey-city/06.webp" /></div>
<div style="text-align: center;"><img src="/assets/jersey-city/07.webp" /></div>
<div style="text-align: center;"><img src="/assets/jersey-city/08.webp" /></div>
<div style="text-align: center;"><img src="/assets/jersey-city/09.webp" /></div>
<div style="text-align: center;"><img src="/assets/jersey-city/10.webp" /></div>
<div style="text-align: center;"><img src="/assets/jersey-city/11.webp" /></div>
<div style="text-align: center;"><img src="/assets/jersey-city/12.webp" /></div>
<p></p>Jersey City, NJ, 2023Pittsburgh2023-05-19T00:00:00+00:002023-05-19T00:00:00+00:00https://thenumb.at/Pittsburgh<p>University of Pittsburgh / Carnegie Mellon University, Pittsburgh, PA, 2023</p>
<div style="text-align: center;"><img src="/assets/pittsburgh/01.webp" /></div>
<div style="text-align: center;"><img src="/assets/pittsburgh/02.webp" /></div>
<div style="text-align: center;"><img src="/assets/pittsburgh/03.webp" /></div>
<div style="text-align: center;"><img src="/assets/pittsburgh/04.webp" /></div>
<div style="text-align: center;"><img src="/assets/pittsburgh/05.webp" /></div>
<div style="text-align: center;"><img src="/assets/pittsburgh/06.webp" /></div>
<div style="text-align: center;"><img src="/assets/pittsburgh/07.webp" /></div>
<div style="text-align: center;"><img src="/assets/pittsburgh/08.webp" /></div>
<div style="text-align: center;"><img src="/assets/pittsburgh/09.webp" /></div>
<div style="text-align: center;"><img src="/assets/pittsburgh/10.webp" /></div>
<div style="text-align: center;"><img src="/assets/pittsburgh/11.webp" /></div>
<div style="text-align: center;"><img src="/assets/pittsburgh/12.webp" /></div>
<div style="text-align: center;"><img src="/assets/pittsburgh/13.webp" /></div>
<div style="text-align: center;"><img src="/assets/pittsburgh/14.webp" /></div>
<div style="text-align: center;"><img src="/assets/pittsburgh/15.webp" /></div>
<div style="text-align: center;"><img src="/assets/pittsburgh/16.webp" /></div>
<div style="text-align: center;"><img src="/assets/pittsburgh/17.webp" /></div>
<p></p>University of Pittsburgh / Carnegie Mellon University, Pittsburgh, PA, 2023Optimizing Open Addressing2023-03-19T00:00:00+00:002023-03-19T00:00:00+00:00https://thenumb.at/Optimizing-Open-Addressing<p>Your default hash table should be open-addressed, using Robin Hood linear probing with backward-shift deletion.
When prioritizing deterministic performance over memory efficiency, two-way chaining is also a good choice.</p>
<p><em>Code for this article may be found on <a href="https://github.com/TheNumbat/hashtables">GitHub</a>.</em></p>
<ul>
<li><a href="#open-addressing-vs-separate-chaining">Open Addressing vs. Separate Chaining</a></li>
<li><a href="#benchmark-setup">Benchmark Setup</a></li>
<li><a href="#discussion">Discussion</a>
<ul>
<li><a href="#separate-chaining">Separate Chaining</a></li>
<li><a href="#linear-probing">Linear Probing</a></li>
<li><a href="#quadratic-probing">Quadratic Probing</a></li>
<li><a href="#double-hashing">Double Hashing</a></li>
<li><a href="#robin-hood-linear-probing">Robin Hood Linear Probing</a></li>
<li><a href="#two-way-chaining">Two Way Chaining</a></li>
</ul>
</li>
<li><a href="#unrolling-prefetching-and-simd">Unrolling, Prefetching, and SIMD</a></li>
<li><a href="#benchmark-data">Benchmark Data</a></li>
</ul>
<h1 id="open-addressing-vs-separate-chaining">Open Addressing vs. Separate Chaining</h1>
<p>Most people first encounter hash tables implemented using <em>separate chaining</em>, a model simple to understand and analyze mathematically.
In separate chaining, a hash function is used to map each key to one of \(K\) <em>buckets</em>.
Each bucket holds a linked list, so to retrieve a key, one simply traverses its corresponding bucket.</p>
<p><img class="diagram center" src="/assets/hashtables/chaining.svg" /></p>
<p>Given a hash function drawn from a <a href="https://en.wikipedia.org/wiki/Universal_hashing">universal family</a>, inserting \(N\) keys into a table with \(K\) buckets results in an expected bucket size of \(\frac{N}{K}\), a quantity also known as the table’s <em>load factor</em>.
Assuming (reasonably) that \(\frac{N}{K}\) is bounded by a constant, all operations on such a table can be performed in expected constant time.</p>
<p>Unfortunately, this basic analysis doesn’t consider the myriad factors that go into implementing an efficient hash table on a real computer.
In practice, hash tables based on <em>open addressing</em> can provide superior performance, and their limitations can be worked around in nearly all cases.</p>
<h2 id="open-addressing">Open Addressing</h2>
<p>In an open-addressed table, each bucket only contains a single key.
Collisions are handled by placing additional keys elsewhere in the table.
For example, in linear probing, a key is placed in the first open bucket starting from the index it hashes to.</p>
<p><img class="diagram center" src="/assets/hashtables/linear.svg" /></p>
<p>Unlike in separate chaining, open-addressed tables may be represented in memory as a single flat array.
<em>A priori</em>, we should expect operations on flat arrays to offer higher performance than those on linked structures due to more coherent memory accesses.<sup id="fnref:cacheusage" role="doc-noteref"><a href="#fn:cacheusage" class="footnote" rel="footnote">1</a></sup></p>
<p>However, if the user wants to insert more than \(K\) keys into an open-addressed table with \(K\) buckets, the entire table must be resized.
Typically, a larger array is allocated and all current keys are moved to the new table.
The size is grown by a constant factor (e.g. 2x), so insertions occur in <em>amortized</em> constant time.</p>
<h2 id="why-not-separate-chaining">Why Not Separate Chaining?</h2>
<p>In practice, the standard libraries of many languages provide separately chained hash tables, such as C++’s <code class="language-plaintext highlighter-rouge">std::unordered_map</code>.
At least in C++, this choice is now considered to have been a mistake—it violates C++’s principle of not paying for features you didn’t ask for.</p>
<p>Separate chaining still has some purported benefits, but they seem unconvincing:</p>
<ul>
<li>
<p><em>Separately chained tables don’t require any linear-time operations.</em></p>
<p>First, this isn’t true.
To maintain a constant load factor, separate chaining hash tables <em>also</em> have to resize once a sufficient number of keys are inserted, though the limit can be greater than \(K\).
Most separately chained tables, including <code class="language-plaintext highlighter-rouge">std::unordered_map</code>, only provide amortized constant time insertion.</p>
<p>Second, open-addressed tables can be incrementally resized.
When the table grows, there’s no reason we have to move all existing keys <em>right away</em>.
Instead, every subsequent operation can move a fixed number of old keys into the new table.
Eventually, all keys will have been moved and the old table may be deallocated.
This technique incurs overhead on every operation during a resize, but none will require linear time.</p>
</li>
<li>
<p><em>When each (key + value) entry is allocated separately, entries can have stable addresses. External code can safely store pointers to them, and large entries never have to be copied.</em></p>
<p>This property is easy to support by adding a single level of indirection. That is, allocate each entry separately, but use an open-addressed table to map
keys to pointers-to-entries. In this case, resizing the table only requires copying pointers, rather than entire entries.</p>
</li>
<li>
<p><em>Lower memory overhead.</em></p>
<p>It’s possible for a separately chained table to store the same data in less space than an open-addressed equivalent, since flat tables necessarily waste space storing empty buckets.
However, matching <a href="#robin-hood-linear-probing">high-quality</a> open addressing schemes (especially when storing entries indirectly) requires a load factor greater than one, degrading query performance.
Further, individually allocating nodes wastes memory due to heap allocator overhead and fragmentation.</p>
</li>
</ul>
<p>In fact, the only situation where open addressing truly doesn’t work is when the table cannot allocate <em>and</em> entries cannot be moved.
These requirements can arise when using <a href="https://www.data-structures-in-practice.com/intrusive-linked-lists/">intrusive linked lists</a>, a common pattern in kernel data structures and embedded systems with no heap.</p>
<h1 id="benchmark-setup">Benchmark Setup</h1>
<p>For simplicity, let us only consider tables mapping 64-bit integer keys to 64-bit integer values.
Results will not take into account the effects of larger values or non-trivial key comparison, but should still be representative, as keys can often be represented in 64 bits and values are often pointers.</p>
<p>We will evaluate tables using the following metrics:</p>
<ul>
<li>Time to insert \(N\) keys into an empty table.</li>
<li>Time to erase \(N\) existing keys.</li>
<li>Time to look up \(N\) existing keys.</li>
<li>Time to look up \(N\) missing keys.<sup id="fnref:missingelements" role="doc-noteref"><a href="#fn:missingelements" class="footnote" rel="footnote">2</a></sup></li>
<li>Average probe length for existing & missing keys.</li>
<li>Maximum probe length for existing & missing keys.</li>
<li>Memory amplification in a full table (total size / size of data).</li>
</ul>
<p>Each open-addressed table was benchmarked at 50%, 75%, and 90% load factors.
Every test targeted a table capacity of 8M entries, setting \(N = \lfloor 8388608 * \text{Load Factor} \rfloor - 1\).
Fixing the table capacity allows us to benchmark behavior at exactly the specified load factor, avoiding misleading results from tables that have recently grown.
The large number of keys causes each test to unpredictably traverse a data set much larger than L3 cache (128MB+), so the CPU<sup id="fnref:mycpu" role="doc-noteref"><a href="#fn:mycpu" class="footnote" rel="footnote">3</a></sup> cannot effectively cache or prefetch memory accesses.</p>
<h2 id="table-sizes">Table Sizes</h2>
<p>All benchmarked tables use exclusively power-of-two sizes.
This invariant allows us to translate hashes to array indices with a fast bitwise AND operation, since:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>h % 2**n == h & (2**n - 1)
</code></pre></div></div>
<p>Power-of-two sizes also admit a neat virtual memory trick,<sup id="fnref:vmem" role="doc-noteref"><a href="#fn:vmem" class="footnote" rel="footnote">4</a></sup> but that optimization is not included here.</p>
<p>However, power-of-two sizes can cause pathological behavior when paired with some hash functions, since indexing simply throws away the hash’s high bits.
An alternative strategy is to use <em>prime</em> sized tables, which are significantly more robust to the choice of hash function.
For the purposes of this post, we have a fixed hash function and non-adversarial inputs, so prime sizes were not used.<sup id="fnref:prime" role="doc-noteref"><a href="#fn:prime" class="footnote" rel="footnote">5</a></sup></p>
<h2 id="hash-function">Hash Function</h2>
<p>All benchmarked tables use the <code class="language-plaintext highlighter-rouge">squirrel3</code> hash function, a fast integer hash that (as far as I can tell) only exists in <a href="https://www.youtube.com/watch?v=LWFzPP8ZbdU">this GDC talk on noise-based RNG</a>.</p>
<p>Here, it is adapted to 64-bit integers by choosing three large 64-bit primes:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint64_t</span> <span class="nf">squirrel3</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">at</span><span class="p">)</span> <span class="p">{</span>
<span class="k">constexpr</span> <span class="kt">uint64_t</span> <span class="n">BIT_NOISE1</span> <span class="o">=</span> <span class="mh">0x9E3779B185EBCA87ULL</span><span class="p">;</span>
<span class="k">constexpr</span> <span class="kt">uint64_t</span> <span class="n">BIT_NOISE2</span> <span class="o">=</span> <span class="mh">0xC2B2AE3D27D4EB4FULL</span><span class="p">;</span>
<span class="k">constexpr</span> <span class="kt">uint64_t</span> <span class="n">BIT_NOISE3</span> <span class="o">=</span> <span class="mh">0x27D4EB2F165667C5ULL</span><span class="p">;</span>
<span class="n">at</span> <span class="o">*=</span> <span class="n">BIT_NOISE1</span><span class="p">;</span>
<span class="n">at</span> <span class="o">^=</span> <span class="p">(</span><span class="n">at</span> <span class="o">>></span> <span class="mi">8</span><span class="p">);</span>
<span class="n">at</span> <span class="o">+=</span> <span class="n">BIT_NOISE2</span><span class="p">;</span>
<span class="n">at</span> <span class="o">^=</span> <span class="p">(</span><span class="n">at</span> <span class="o"><<</span> <span class="mi">8</span><span class="p">);</span>
<span class="n">at</span> <span class="o">*=</span> <span class="n">BIT_NOISE3</span><span class="p">;</span>
<span class="n">at</span> <span class="o">^=</span> <span class="p">(</span><span class="n">at</span> <span class="o">>></span> <span class="mi">8</span><span class="p">);</span>
<span class="k">return</span> <span class="n">at</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>I make no claims regarding the quality or robustness of this hash function, but observe that it’s cheap, it produces the expected number of collisions in power-of-two tables, and it passes <a href="https://github.com/rurban/smhasher">smhasher</a> when applied bytewise.</p>
<h2 id="other-benchmarks">Other Benchmarks</h2>
<p>Several less interesting benchmarks were also run, all of which exhibited identical performance across equal-memory open-addressed tables and <strong>much</strong> worse performance for separate chaining.</p>
<ul>
<li>Time to look up each element by following a <a href="https://danluu.com/sattolo/">Sattolo cycle</a> (2-4x).<sup id="fnref:sattolo" role="doc-noteref"><a href="#fn:sattolo" class="footnote" rel="footnote">6</a></sup></li>
<li>Time to clear the table (100-200x).</li>
<li>Time to iterate all values (3-10x).</li>
</ul>
<h1 id="discussion">Discussion</h1>
<h2 id="separate-chaining">Separate Chaining</h2>
<p>To get our bearings, let’s first <a href="https://github.com/TheNumbat/hashtables/blob/main/code/chaining.h">implement</a> a simple separate chaining table and benchmark it at various load factors.</p>
<table class="fcenter" style="width: 100%">
<tr class="hrow1"><td rowspan="2" class="header"><strong>Separate Chaining</strong></td><td colspan="2"><strong>50% Load</strong></td><td colspan="2"><strong>100% Load</strong></td><td colspan="2"><strong>200% Load</strong></td><td colspan="2"><strong>500% Load</strong></td></tr>
<tr class="hrow2"><td>Existing</td><td>Missing</td><td>Existing</td><td>Missing</td><td>Existing</td><td>Missing</td><td>Existing</td><td>Missing</td></tr>
<tr><td>Insert (ns/key)</td>
<td colspan="2">115</td>
<td colspan="2">113</td>
<td colspan="2">120</td>
<td colspan="2">152</td>
</tr><tr><td>Erase (ns/key)</td>
<td colspan="2">110</td>
<td colspan="2">128</td>
<td colspan="2">171</td>
<td colspan="2">306</td>
</tr><tr><td>Lookup (ns/key)</td>
<td>28</td>
<td>28</td>
<td>35</td>
<td>40</td>
<td>50</td>
<td>68</td>
<td>102</td>
<td>185</td>
</tr><tr><td>Average Probe</td>
<td>1.25</td>
<td>0.50</td>
<td>1.50</td>
<td>1.00</td>
<td>2.00</td>
<td>2.00</td>
<td>3.50</td>
<td>5.00</td>
</tr><tr><td>Max Probe</td>
<td>7</td>
<td>7</td>
<td>10</td>
<td>9</td>
<td>12</td>
<td>13</td>
<td>21</td>
<td>21</td>
</tr><tr><td>Memory Amplification</td>
<td colspan="2">2.50</td>
<td colspan="2">2.00</td>
<td colspan="2">1.75</td>
<td colspan="2">1.60</td>
</tr></table>
<p>These are respectable results, especially lookups at 100%: the CPU is able to hide much of the memory latency.
Insertions and erasures, on the other hand, are pretty slow—they involve allocation.</p>
<p>Technically, the reported memory amplification in an <em>underestimate</em>, as it does not include heap allocator overhead.
However, we should not <em>directly</em> compare memory amplification against the coming open-addressed tables, as storing larger values would improve separate chaining’s ratio.</p>
<h3 id="stdunordered_map"><code class="language-plaintext highlighter-rouge">std::unordered_map</code></h3>
<p>Running these benchmarks on <code class="language-plaintext highlighter-rouge">std::unordered_map<uint64_t,uint64_t></code> produces results roughly equivalent to the 100% column, just with marginally slower insertions and faster clears.
Using <code class="language-plaintext highlighter-rouge">squirrel3</code> with <code class="language-plaintext highlighter-rouge">std::unordered_map</code> additionally makes insertions slightly slower and lookups slightly faster.
Further comparisons will be relative to these results, since exact performance numbers for <code class="language-plaintext highlighter-rouge">std::unordered_map</code> depend on one’s standard library distribution (here, Microsoft’s).</p>
<h2 id="linear-probing">Linear Probing</h2>
<p>Next, let’s <a href="https://github.com/TheNumbat/hashtables/blob/main/code/linear.h">implement</a> the simplest type of open addressing—linear probing.</p>
<p>In the following diagrams, assume that each letter hashes to its alphabetical index. The skull denotes a <em>tombstone</em>.</p>
<ul>
<li>
<p>Insert: starting from the key’s index, place the key in the first empty or tombstone bucket.</p>
<p><img class="probing2 center" src="/assets/hashtables/linear_insert.svg" /></p>
</li>
<li>
<p>Lookup: starting from the key’s index, probe buckets in order until finding the key, reaching an empty non-tombstone bucket, or exhausting the table.</p>
<p><img class="probing center" style="margin-bottom: -25px" src="/assets/hashtables/linear_find.svg" /></p>
</li>
<li>
<p>Erase: lookup the key and replace it with a tombstone.</p>
<p><img class="probing2 center" src="/assets/hashtables/linear_erase.svg" /></p>
</li>
</ul>
<p>The results are surprisingly good, considering the simplicity:</p>
<table class="fcenter">
<tr class="hrow1"><td rowspan="2" class="header"><strong>Linear Probing<br />(Naive)</strong></td><td colspan="2"><strong>50% Load</strong></td><td colspan="2"><strong>75% Load</strong></td><td colspan="2"><strong>90% Load</strong></td></tr>
<tr class="hrow2"><td>Existing</td><td>Missing</td><td>Existing</td><td>Missing</td><td>Existing</td><td>Missing</td></tr>
<tr><td>Insert (ns/key)</td>
<td colspan="2">46</td>
<td colspan="2">59</td>
<td colspan="2">73</td>
</tr><tr><td>Erase (ns/key)</td>
<td colspan="2">21</td>
<td colspan="2">36</td>
<td colspan="2">47</td>
</tr><tr><td>Lookup (ns/key)</td>
<td>20</td>
<td>86</td>
<td>35</td>
<td>124</td>
<td>47</td>
<td>285</td>
</tr><tr><td>Average Probe</td>
<td>0.50</td>
<td>6.04</td>
<td>1.49</td>
<td>29.5</td>
<td>4.46</td>
<td>222</td>
</tr><tr><td>Max Probe</td>
<td>43</td>
<td>85</td>
<td>181</td>
<td>438</td>
<td>1604</td>
<td>2816</td>
</tr><tr><td>Memory Amplification</td>
<td colspan="2">2.00</td>
<td colspan="2">1.33</td>
<td colspan="2">1.11</td>
</tr></table>
<p>For existing keys, we’re already beating the separately chained tables, and with lower memory overhead!
Of course, there are also some obvious problems—looking for missing keys takes much longer, and the worst-case probe lengths are quite scary, even at 50% load factor.
But, we can do much better.</p>
<h3 id="erase-rehashing">Erase: Rehashing</h3>
<p>One simple optimization is automatically recreating the table once it accumulates too many tombstones.
Erase now occurs in <em>amortized</em> constant time, but we already tolerate that for insertion, so it shouldn’t be a big deal.
Let’s <a href="https://github.com/TheNumbat/hashtables/blob/main/code/linear_with_rehashing.h">implement</a> a table that re-creates itself after erase has been called \(\frac{N}{2}\) times.</p>
<p>Compared to <strong>naive linear probing</strong>:</p>
<table class="fcenter">
<tr class="hrow1"><td rowspan="2" class="header"><strong>Linear Probing<br />+ Rehashing</strong></td><td colspan="2"><strong>50% Load</strong></td><td colspan="2"><strong>75% Load</strong></td><td colspan="2"><strong>90% Load</strong></td></tr>
<tr class="hrow2"><td>Existing</td><td>Missing</td><td>Existing</td><td>Missing</td><td>Existing</td><td>Missing</td></tr>
<tr><td>Insert (ns/key)</td>
<td colspan="2">46</td>
<td colspan="2">58</td>
<td colspan="2">70</td>
</tr><tr><td>Erase (ns/key)</td>
<td class="bad" colspan="2">23 <strong>(+10.4%)</strong></td>
<td class="good" colspan="2">27 <strong>(-23.9%)</strong></td>
<td class="good" colspan="2">29 <strong>(-39.3%)</strong></td>
</tr><tr><td>Lookup (ns/key)</td>
<td>20</td>
<td>85</td>
<td>35</td>
<td class="good">93 <strong>(-25.4%)</strong></td>
<td>46</td>
<td class="good">144 <strong>(-49.4%)</strong></td>
</tr><tr><td>Average Probe</td>
<td>0.50</td>
<td>6.04</td>
<td>1.49</td>
<td class="good">9.34 <strong>(-68.3%)</strong></td>
<td>4.46</td>
<td class="good">57.84 <strong>(-74.0%)</strong></td>
</tr><tr><td>Max Probe</td>
<td>43</td>
<td>85</td>
<td>181</td>
<td class="good">245 <strong>(-44.1%)</strong></td>
<td>1604</td>
<td class="good">1405 <strong>(-50.1%)</strong></td>
</tr><tr><td>Memory Amplification</td>
<td colspan="2">2.00</td>
<td colspan="2">1.33</td>
<td colspan="2">1.11</td>
</tr></table>
<p>Clearing the tombstones dramatically improves missing key queries, but they are still pretty slow.
It also makes erase slower at 50% load—there aren’t enough tombstones to make up for having to recreate the table.
Rehashing also does nothing to mitigate long probe sequences for existing keys.</p>
<h3 id="erase-backward-shift">Erase: Backward Shift</h3>
<p>Actually, who says we need tombstones? There’s a little-taught, but very simple algorithm for erasing keys that doesn’t degrade performance <em>or</em> have to recreate the table.</p>
<p>When we erase a key, we don’t know whether the lookup algorithm is relying on our bucket being filled to find keys later in the table.
Tombstones are one way to avoid this problem.
Instead, however, we can guarantee that traversal is never disrupted by simply <em>removing and reinserting all following keys</em>, up to the next empty bucket.</p>
<p>Note that despite the name, we are not simply shifting the following keys backward. Any keys that are already at their optimal index will not be moved.
For example:</p>
<p><img class="probing2 center" src="/assets/hashtables/linear_deletion.svg" /></p>
<p>After <a href="https://github.com/TheNumbat/hashtables/blob/main/code/linear_with_deletion.h">implementing</a> backward-shift deletion, we can compare it to <strong>linear probing with rehashing</strong>:</p>
<table class="fcenter">
<tr class="hrow1"><td rowspan="2" class="header"><strong>Linear Probing<br />+ Backshift</strong></td><td colspan="2"><strong>50% Load</strong></td><td colspan="2"><strong>75% Load</strong></td><td colspan="2"><strong>90% Load</strong></td></tr>
<tr class="hrow2"><td>Existing</td><td>Missing</td><td>Existing</td><td>Missing</td><td>Existing</td><td>Missing</td></tr>
<tr><td>Insert (ns/key)</td>
<td colspan="2">46</td>
<td colspan="2">58</td>
<td colspan="2">70</td>
</tr><tr><td>Erase (ns/key)</td>
<td class="bad" colspan="2">47 <strong>(+105%)</strong></td>
<td class="bad" colspan="2">82 <strong>(+201%)</strong></td>
<td class="bad" colspan="2">147 <strong>(+415%)</strong></td>
</tr><tr><td>Lookup (ns/key)</td>
<td>20</td>
<td class="good">37 <strong>(-56.2%)</strong></td>
<td>36</td>
<td>88</td>
<td>46</td>
<td>135</td>
</tr><tr><td>Average Probe</td>
<td>0.50</td>
<td class="good">1.50 <strong>(-75.2%)</strong></td>
<td>1.49</td>
<td class="good">7.46 <strong>(-20.1%)</strong></td>
<td>4.46</td>
<td class="good">49.7 <strong>(-14.1%)</strong></td>
</tr><tr><td>Max Probe</td>
<td>43</td>
<td class="good">50 <strong>(-41.2%)</strong></td>
<td>181</td>
<td>243</td>
<td>1604</td>
<td>1403</td>
</tr><tr><td>Memory Amplification</td>
<td colspan="2">2.00</td>
<td colspan="2">1.33</td>
<td colspan="2">1.11</td>
</tr></table>
<p>Naturally, erasures become significantly slower, but queries on missing keys now have reasonable behavior, especially at 50% load.
In fact, at 50% load, this table already beats the performance of equal-memory separate chaining in all four metrics.
But, we can still do better: a 50% load factor is unimpressive, and max probe lengths are still very high compared to separate chaining.</p>
<p><em>You might have noticed that neither of these upgrades improved average probe length for existing keys.
That’s because it’s not possible for linear probing to have any other average!
Think about why that is—it’ll be interesting later.</em></p>
<h2 id="quadratic-probing">Quadratic Probing</h2>
<p>Quadratic probing is a common upgrade to linear probing intended to decrease average and maximum probe lengths.</p>
<p>In the linear case, a probe of length \(n\) simply queries the bucket at index \(h(k) + n\).
Quadratic probing instead queries the bucket at index \(h(k) + \frac{n(n+1)}{2}\).
Other quadratic sequences are possible, but this one can be efficiently computed by adding \(n\) to the index at each step.</p>
<p>Each operation must be updated to use our new probe sequence, but the logic is otherwise almost identical:</p>
<p><img class="probing2 center" src="/assets/hashtables/quad_insert.svg" /></p>
<p>There are a couple subtle caveats to quadratic probing:</p>
<ul>
<li>Although this specific sequence will touch every bucket in a power-of-two table, other sequences (e.g. \(n^2\)) may not.
If an insertion fails despite there being an empty bucket, the table must grow and the insertion is retried.</li>
<li>There actually <em>is</em> a <a href="https://epubs.siam.org/doi/pdf/10.1137/1.9781611975062.3">true deletion algorithm for quadratic probing</a>, but it’s a bit more involved than the linear case and hence not reproduced here. We will again use tombstones and rehashing.</li>
</ul>
<p>After <a href="https://github.com/TheNumbat/hashtables/blob/main/code/quadratic.h">implementing</a> our quadratic table, we can compare it to <strong>linear probing with rehashing</strong>:</p>
<table class="fcenter">
<tr class="hrow1"><td rowspan="2" style="width:21%" class="header"><strong>Quadratic Probing<br />+ Rehashing</strong></td><td colspan="2"><strong>50% Load</strong></td><td colspan="2"><strong>75% Load</strong></td><td colspan="2"><strong>90% Load</strong></td></tr>
<tr class="hrow2"><td>Existing</td><td>Missing</td><td>Existing</td><td>Missing</td><td>Existing</td><td>Missing</td></tr>
<tr><td>Insert (ns/key)</td>
<td colspan="2">47</td>
<td class="good" colspan="2">54 <strong>(-7.2%)</strong></td>
<td class="good" colspan="2">65 <strong>(-6.4%)</strong></td>
</tr><tr><td>Erase (ns/key)</td>
<td colspan="2">23</td>
<td colspan="2">28</td>
<td colspan="2">29</td>
</tr><tr><td>Lookup (ns/key)</td>
<td>20</td>
<td>82</td>
<td class="good">31 <strong>(-13.2%)</strong></td>
<td>93</td>
<td class="good">42 <br /><strong>(-7.3%)</strong></td>
<td class="bad">163 <strong>(+12.8%)</strong></td>
</tr><tr><td>Average Probe</td>
<td class="good">0.43 <strong>(-12.9%)</strong></td>
<td class="good">4.81 <strong>(-20.3%)</strong></td>
<td class="good">0.99 <strong>(-33.4%)</strong></td>
<td class="good">5.31 <strong>(-43.1%)</strong></td>
<td class="good">1.87 <strong>(-58.1%)</strong></td>
<td class="good">18.22 <strong>(-68.5%)</strong></td>
</tr><tr><td>Max Probe</td>
<td class="good">21 <strong>(-51.2%)</strong></td>
<td class="good">49 <strong>(-42.4%)</strong></td>
<td class="good">46 <strong>(-74.6%)</strong></td>
<td class="good">76 <strong>(-69.0%)</strong></td>
<td class="good">108 <strong>(-93.3%)</strong></td>
<td class="good">263 <strong>(-81.3%)</strong></td>
</tr><tr><td>Memory Amplification</td>
<td colspan="2">2.00</td>
<td colspan="2">1.33</td>
<td colspan="2">1.11</td>
</tr></table>
<p>As expected, quadratic probing dramatically reduces both the average and worst case probe lengths, especially at high load factors.
These gains are not <em>quite</em> as impressive when compared to the linear table with backshift deletion, but are still very significant.</p>
<p>Reducing the average probe length improves the average performance of all queries.
Erasures are the fastest yet, since quadratic probing quickly finds the element and marks it as a tombstone.
However, beyond 50% load, maximum probe lengths are still pretty concerning.</p>
<h2 id="double-hashing">Double Hashing</h2>
<p>Double hashing purports to provide even better behavior than quadratic probing.
Instead of using the same probe sequence for every key, double hashing determines the probe stride by hashing the key a second time.
That is, a probe of length \(n\) queries the bucket at index \(h_1(k) + n*h_2(k)\)</p>
<ul>
<li>If \(h_2\) is always co-prime to the table size, insertions always succeed, since the probe sequence will touch every bucket. That might sound difficult to assure, but since our table sizes are always a power of two, we can simply make \(h_2\) always return an odd number.</li>
<li>Unfortunately, there is no true deletion algorithm for double hashing. We will again use tombstones and rehashing.</li>
</ul>
<p>So far, we’ve been throwing away the high bits our of 64-bit hash when computing table indices.
Instead of evaluating a second hash function, let’s just re-use the top 32 bits: \(h_2(k) = h_1(k) >> 32\).</p>
<p>After <a href="https://github.com/TheNumbat/hashtables/blob/main/code/double.h">implementing</a> double hashing, we may compare the results to <strong>quadratic probing with rehashing</strong>:</p>
<table class="fcenter">
<tr class="hrow1"><td rowspan="2" class="header"><strong>Double Hashing<br />+ Rehashing</strong></td><td colspan="2"><strong>50% Load</strong></td><td colspan="2"><strong>75% Load</strong></td><td colspan="2"><strong>90% Load</strong></td></tr>
<tr class="hrow2"><td>Existing</td><td>Missing</td><td>Existing</td><td>Missing</td><td>Existing</td><td>Missing</td></tr>
<tr><td>Insert (ns/key)</td>
<td class="bad" colspan="2">70 <strong>(+49.6%)</strong></td>
<td class="bad" colspan="2">80 <strong>(+48.6%)</strong></td>
<td class="bad" colspan="2">88 <strong>(+35.1%)</strong></td>
</tr><tr><td>Erase (ns/key)</td>
<td class="bad" colspan="2">31 <strong>(+33.0%)</strong></td>
<td class="bad" colspan="2">43 <strong>(+53.1%)</strong></td>
<td class="bad" colspan="2">47 <strong>(+61.5%)</strong></td>
</tr><tr><td>Lookup (ns/key)</td>
<td class="bad">29 <strong>(+46.0%)</strong></td>
<td>84</td>
<td class="bad">38 <strong>(+25.5%)</strong></td>
<td>93</td>
<td class="bad">49 <strong>(+14.9%)</strong></td>
<td>174</td>
</tr><tr><td>Average Probe</td>
<td class="good">0.39 <strong>(-11.0%)</strong></td>
<td class="good">4.39 <strong>(-8.8%)</strong></td>
<td class="good">0.85 <strong>(-14.7%)</strong></td>
<td class="good">4.72 <strong>(-11.1%)</strong></td>
<td class="good">1.56 <strong>(-16.8%)</strong></td>
<td class="good">16.23 <strong>(-10.9%)</strong></td>
</tr><tr><td>Max Probe</td>
<td class="good">17 <strong>(-19.0%)</strong></td>
<td class="bad">61 <strong>(+24.5%)</strong></td>
<td class="good">44 <br /><strong>(-4.3%)</strong></td>
<td class="bad">83 <br /><strong>(+9.2%)</strong></td>
<td class="bad">114 <strong>(+5.6%)</strong></td>
<td class="good">250 <br /><strong>(-4.9%)</strong></td>
</tr><tr><td>Memory Amplification</td>
<td colspan="2">2.00</td>
<td colspan="2">1.33</td>
<td colspan="2">1.11</td>
</tr></table>
<p>Double hashing succeeds in further reducing average probe lengths, but max lengths are more of a mixed bag.
Unfortunately, most queries now traverse larger spans of memory, causing performance to lag quadratic hashing.</p>
<h2 id="robin-hood-linear-probing">Robin Hood Linear Probing</h2>
<p>So far, quadratic probing and double hashing have provided lower probe lengths, but their raw performance isn’t much better than linear probing—especially on missing keys.
Enter <em><a href="https://cs.uwaterloo.ca/research/tr/1986/CS-86-14.pdf">Robin Hood</a> linear probing</em>.
This strategy drastically reins in maximum probe lengths while maintaining high query performance.</p>
<p>The key difference is that when inserting a key, we are allowed to steal the position of an existing key if it is closer to its ideal index (“richer”) than the new key is.
When this occurs, we simply swap the old and new keys and carry on inserting the old key in the same manner. This results in a more equitable distribution of probe lengths.</p>
<p><img class="probing2 center" src="/assets/hashtables/rh_insert.svg" /></p>
<p>This insertion method turns out to be so effective that we can entirely ignore rehashing as long as we keep track of the maximum probe length.
Lookups will simply probe until either finding the key or exhausting the maximum probe length.</p>
<p><a href="https://github.com/TheNumbat/hashtables/blob/main/code/robin_hood.h">Implementing</a> this table, compared to <strong>linear probing with backshift deletion</strong>:</p>
<table class="fcenter">
<tr class="hrow1"><td rowspan="2" class="header"><strong>Robin Hood<br />(Naive)</strong></td><td colspan="2"><strong>50% Load</strong></td><td colspan="2"><strong>75% Load</strong></td><td colspan="2"><strong>90% Load</strong></td></tr>
<tr class="hrow2"><td>Existing</td><td>Missing</td><td>Existing</td><td>Missing</td><td>Existing</td><td>Missing</td></tr>
<tr><td>Insert (ns/key)</td>
<td class="bad" colspan="2">53 <strong>(+16.1%)</strong></td>
<td class="bad" colspan="2">80 <strong>(+38.3%)</strong></td>
<td class="bad" colspan="2">124 <strong>(+76.3%)</strong></td>
</tr><tr><td>Erase (ns/key)</td>
<td class="good" colspan="2">19 <strong>(-58.8%)</strong></td>
<td class="good" colspan="2">31 <strong>(-63.0%)</strong></td>
<td class="good" colspan="2">77 <strong>(-47.9%)</strong></td>
</tr><tr><td>Lookup (ns/key)</td>
<td>19</td>
<td class="bad">58 <strong>(+56.7%)</strong></td>
<td class="good">31 <strong>(-11.3%)</strong></td>
<td class="bad">109 <strong>(+23.7%)</strong></td>
<td class="bad">79 <strong>(+72.2%)</strong></td>
<td class="bad">147 <strong>(+8.9%)</strong></td>
</tr><tr><td>Average Probe</td>
<td>0.50</td>
<td class="bad">13 <strong>(+769%)</strong></td>
<td>1.49</td>
<td class="bad">28 <br /><strong>(+275%)</strong></td>
<td>4.46</td>
<td class="bad">67 <strong>(+34.9%)</strong></td>
</tr><tr><td>Max Probe</td>
<td class="good">12 <strong>(-72.1%)</strong></td>
<td class="good">13 <strong>(-74.0%)</strong></td>
<td class="good">24 <strong>(-86.7%)</strong></td>
<td class="good">28 <br /><strong>(-88.5%)</strong></td>
<td class="good">58 <strong>(-96.4%)</strong></td>
<td class="good">67 <strong>(-95.2%)</strong></td>
</tr><tr><td>Memory Amplification</td>
<td colspan="2">2.00</td>
<td colspan="2">1.33</td>
<td colspan="2">1.11</td>
</tr></table>
<p>Maximum probe lengths improve <em>dramatically</em>, undercutting even quadratic probing and double hashing.
This solves the main problem with open addressing.</p>
<p>However, performance gains are less clear-cut:</p>
<ul>
<li>Insertion performance suffers across the board.</li>
<li>Existing lookups are slightly faster at low load, but much slower at 90%.</li>
<li>Missing lookups are maximally pessimistic.</li>
</ul>
<p>Fortunately, we’ve got one more trick up our sleeves.</p>
<h3 id="erase-backward-shift-1">Erase: Backward Shift</h3>
<p>When erasing a key, we can run a very similar backshift deletion algorithm.
There are just two differences:</p>
<ul>
<li>We can stop upon finding a key that is already in its optimal location. No further keys could have been pushed past it—they would have stolen this spot during insertion.</li>
<li>We don’t have to re-hash—since we stop at the first optimal key, all keys before it can simply be shifted backward.</li>
</ul>
<p><img class="probing2 center" src="/assets/hashtables/rh_deletion.svg" /></p>
<p>With backshift deletion, we no longer need to keep track of the maximum probe length.</p>
<p><a href="https://github.com/TheNumbat/hashtables/blob/main/code/robin_hood_with_deletion.h">Implementing</a> this table, compared to <strong>naive robin hood probing</strong>:</p>
<table class="fcenter">
<tr class="hrow1"><td rowspan="2" class="header"><strong>Robin Hood<br />+ Backshift</strong></td><td colspan="2"><strong>50% Load</strong></td><td colspan="2"><strong>75% Load</strong></td><td colspan="2"><strong>90% Load</strong></td></tr>
<tr class="hrow2"><td>Existing</td><td>Missing</td><td>Existing</td><td>Missing</td><td>Existing</td><td>Missing</td></tr>
<tr><td>Insert (ns/key)</td>
<td colspan="2">52</td>
<td colspan="2">78</td>
<td colspan="2">118</td>
</tr><tr><td>Erase (ns/key)</td>
<td class="bad" colspan="2">31 <strong>(+63.0%)</strong></td>
<td class="bad" colspan="2">47 <strong>(+52.7%)</strong></td>
<td class="good" colspan="2">59 <strong>(-23.3%)</strong></td>
</tr><tr><td>Lookup (ns/key)</td>
<td>19</td>
<td class="good">33 <strong>(-43.3%)</strong></td>
<td>31</td>
<td class="good">74 <strong>(-32.2%)</strong></td>
<td>80</td>
<td class="good">108 <strong>(-26.1%)</strong></td>
</tr><tr><td>Average Probe</td>
<td>0.50</td>
<td class="good">0.75 <strong>(-94.2%)</strong></td>
<td>1.49</td>
<td class="good">1.87 <strong>(-93.3%)</strong></td>
<td>4.46</td>
<td class="good">4.95 <strong>(-92.6%)</strong></td>
</tr><tr><td>Max Probe</td>
<td>12</td>
<td class="good">12 <strong>(-7.7%)</strong></td>
<td>24</td>
<td class="good">25 <strong>(-10.7%)</strong></td>
<td>58</td>
<td>67</td>
</tr><tr><td>Memory Amplification</td>
<td colspan="2">2.00</td>
<td colspan="2">1.33</td>
<td colspan="2">1.11</td>
</tr></table>
<p>We again regress the performance of erase, but we finally achieve reasonable missing-key behavior.
Given the excellent maximum probe lengths and overall query performance, Robin Hood probing with backshift deletion should be your default choice of hash table.</p>
<p><em>You might notice that at 90% load, lookups are significantly slower than basic linear probing.
This gap actually isn’t concerning, and explaining why is an interesting exercise.</em><sup id="fnref:slowerat90" role="doc-noteref"><a href="#fn:slowerat90" class="footnote" rel="footnote">7</a></sup></p>
<h2 id="two-way-chaining">Two-Way Chaining</h2>
<p>Our Robin Hood table has excellent query performance, and exhibited a maximum probe length of only 12—it’s a good default choice.
However, another class of hash tables based on <em>Cuckoo Hashing</em> can <em>provably</em> never probe more than a constant number of buckets.
Here, we will examine a particularly practical formulation known as <em>Two-Way Chaining</em>.</p>
<p>Our table will still consist of a flat array of buckets, but each bucket will be able to hold a small handful of keys.
When inserting an element, we will hash the key with <em>two different</em> functions, then place it into whichever corresponding bucket has more space.
If both buckets happen to be filled, the entire table will grow and the insertion is retried.</p>
<p><img class="probing2 center" src="/assets/hashtables/two_way_insert.svg" /></p>
<p>You might think this is crazy—in separate chaining, the expected maximum probe length scales with \(O(\frac{\lg N}{\lg \lg N})\), so wouldn’t this waste copious amounts of memory growing the table?
It turns out adding a second hash function reduces the expected maximum to \(O(\lg\lg N)\), for reasons <a href="https://www.cs.cmu.edu/afs/cs/project/pscico-guyb/realworld/www/slidesS14/load-balancing.pdf">explored elsewhere</a>.
That makes it practical to cap the maximum bucket size at a relatively small number.</p>
<p>Let’s <a href="https://github.com/TheNumbat/hashtables/blob/main/code/two_way.h">implement</a> a flat two-way chained table.
As with double hashing, we can use the high bits of our hash to represent the second hash function.</p>
<table class="fcenter">
<tr class="hrow1"><td rowspan="2" class="header"><strong>Two-Way<br />Chaining</strong></td><td colspan="2"><strong>Capacity 2</strong></td><td colspan="2"><strong>Capacity 4</strong></td><td colspan="2"><strong>Capacity 8</strong></td></tr>
<tr class="hrow2"><td>Existing</td><td>Missing</td><td>Existing</td><td>Missing</td><td>Existing</td><td>Missing</td></tr>
<tr><td>Insert (ns/key)</td>
<td colspan="2">265</td>
<td colspan="2">108</td>
<td colspan="2">111</td>
</tr><tr><td>Erase (ns/key)</td>
<td colspan="2">24</td>
<td colspan="2">34</td>
<td colspan="2">45</td>
</tr><tr><td>Lookup (ns/key)</td>
<td>23</td>
<td>23</td>
<td>30</td>
<td>30</td>
<td>39</td>
<td>42</td>
</tr><tr><td>Average Probe</td>
<td>0.03</td>
<td>0.24</td>
<td>0.74</td>
<td>2.73</td>
<td>3.61</td>
<td>8.96</td>
</tr><tr><td>Max Probe</td>
<td>3</td>
<td>4</td>
<td>6</td>
<td>8</td>
<td>13</td>
<td>14</td>
</tr><tr><td>Memory Amplification</td>
<td colspan="2">32.00</td>
<td colspan="2">4.00</td>
<td colspan="2">2.00</td>
</tr></table>
<p>Since every key can be found in one of two possible buckets, query times become essentially deterministic.
Lookup times are especially impressive, only slightly lagging the Robin Hood table and having no penalty on missing keys.
The average probe lengths are very low, and max probe lengths are strictly bounded by two times the bucket size.</p>
<p>The table size balloons when each bucket can only hold two keys, and insertion is very slow due to growing the table.
However, capacities of 4-8 are reasonable choices for memory unconstrained use cases.</p>
<p>Overall, two-way chaining is another good choice of hash table, at least when memory efficiency is not the highest priority.
In the next section, we’ll also see how lookups can even further accelerated with prefetching and SIMD operations.</p>
<p><em>Remember that deterministic queries might not matter if you haven’t already implemented incremental table growth.</em></p>
<h3 id="cuckoo-hashing">Cuckoo Hashing</h3>
<p><a href="https://www.brics.dk/RS/01/32/BRICS-RS-01-32.pdf">Cuckoo hashing</a> involves storing keys in two separate tables, each of which holds one key per bucket.
This scheme will guarantee that every key is stored in its ideal bucket in one of the two tables.</p>
<p>Each key is hashed with two different functions and inserted using the following procedure:</p>
<ol>
<li>If at least one table’s bucket is empty, the key is inserted there.</li>
<li>Otherwise, the key is inserted anyway, evicting one of the existing keys.</li>
<li>The evicted key is transferred to the opposite table at its other possible position.</li>
<li>If doing so evicts another existing key, go to 3.</li>
</ol>
<p>If the loop ends up evicting the key we’re trying to insert, we know it has found a cycle and hence must grow the table.</p>
<p><img class="probing2 center" src="/assets/hashtables/cuckoo_insert.svg" /></p>
<p>Because Cuckoo hashing does not strictly bound the insertion sequence length, I didn’t benchmark it here, but it’s still an interesting option.</p>
<h1 id="unrolling-prefetching-and-simd">Unrolling, Prefetching, and SIMD</h1>
<p>So far, the lookup benchmark has been implemented roughly like this:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">N</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">assert</span><span class="p">(</span><span class="n">table</span><span class="p">.</span><span class="n">find</span><span class="p">(</span><span class="n">keys</span><span class="p">[</span><span class="n">i</span><span class="p">])</span> <span class="o">==</span> <span class="n">values</span><span class="p">[</span><span class="n">i</span><span class="p">]);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Naively, this loop runs one lookup at a time.
However, you might have noticed that we’ve been able to find keys <em>faster than the CPU can load data from main memory</em>—so something more complex must be going on.</p>
<p>In reality, modern out-of-order CPUs can execute several iterations in parallel.
Expressing the loop as a dataflow graph lets us build a simplified mental model of how it actually runs: the CPU is able to execute a computation whenever its dependencies are available.</p>
<p>Here, green nodes are resolved, in-flight nodes are yellow, and white nodes have not begun:</p>
<p><img class="probes center" src="/assets/hashtables/dataflow.svg" /></p>
<p>In this example, we can see that each iteration’s root node (<code class="language-plaintext highlighter-rouge">i++</code>) only requires the value of <code class="language-plaintext highlighter-rouge">i</code> from the previous step—<strong>not</strong> the result of <code class="language-plaintext highlighter-rouge">find</code>.
Therefore, the CPU can run several <code class="language-plaintext highlighter-rouge">find</code> nodes in parallel, hiding the fact that each one takes ~100ns to fetch data from main memory.</p>
<h2 id="unrolling">Unrolling</h2>
<p>There’s a common optimization strategy known as <em>loop unrolling</em>.
Unrolling involves explicitly writing out the code for several <em>semantic</em> iterations inside each <em>actual</em> iteration of a loop.
When each inner pseudo-iteration is independent, unrolling explicitly severs them from the loop iterator dependency chain.</p>
<p>Most modern x86 CPUs can keep track of ~10 simultaneous pending loads, so let’s try manually unrolling our benchmark 10 times.
The new code looks like the following, where <code class="language-plaintext highlighter-rouge">index_of</code> hashes the key but doesn’t do the actual lookup.</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint64_t</span> <span class="n">indices</span><span class="p">[</span><span class="mi">10</span><span class="p">];</span>
<span class="k">for</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">N</span><span class="p">;</span> <span class="n">i</span> <span class="o">+=</span> <span class="mi">10</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">j</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">j</span> <span class="o"><</span> <span class="mi">10</span><span class="p">;</span> <span class="n">j</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">indices</span><span class="p">[</span><span class="n">j</span><span class="p">]</span> <span class="o">=</span> <span class="n">table</span><span class="p">.</span><span class="n">index_of</span><span class="p">(</span><span class="n">keys</span><span class="p">[</span><span class="n">i</span> <span class="o">+</span> <span class="n">j</span><span class="p">]);</span>
<span class="p">}</span>
<span class="k">for</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">j</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">j</span> <span class="o"><</span> <span class="mi">10</span><span class="p">;</span> <span class="n">j</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">assert</span><span class="p">(</span><span class="n">table</span><span class="p">.</span><span class="n">find_index</span><span class="p">(</span><span class="n">indices</span><span class="p">[</span><span class="n">j</span><span class="p">])</span> <span class="o">==</span> <span class="n">values</span><span class="p">[</span><span class="n">i</span> <span class="o">+</span> <span class="n">j</span><span class="p">]);</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>This isn’t true unrolling, since the inner loops introduce dependencies on <code class="language-plaintext highlighter-rouge">j</code>.
However, it doesn’t matter in practice, as computing <code class="language-plaintext highlighter-rouge">j</code> is never a bottleneck.
In fact, our compiler can automatically unroll the inner loop for us (<code class="language-plaintext highlighter-rouge">clang</code> in particular does this).</p>
<table class="fcenter" style="width: 55%">
<tr class="hrow2"><td class="header"><strong>Lookups (ns/find)</strong></td><td>Naive</td><td>Unrolled</td></tr>
<tr><td>Separate Chaining (100%)</td>
<td>35</td>
<td>34</td>
</tr><tr><td>Linear (75%)</td>
<td>36</td>
<td>35</td>
</tr><tr><td>Quadratic (75%)</td>
<td>31</td>
<td>31</td>
</tr><tr><td>Double (75%)</td>
<td>38</td>
<td>37</td>
</tr><tr><td>Robin Hood (75%)</td>
<td>31</td>
<td>31</td>
</tr><tr><td>Two-Way Chaining (x4)</td>
<td>30</td>
<td>32</td>
</tr></table>
<p>Explicit unrolling has basically no effect, since the CPU was already able to saturate its pending load buffer.
However, manual unrolling now lets us introduce <em>software prefetching</em>.</p>
<h2 id="software-prefetching">Software Prefetching</h2>
<p>When traversing memory in a predictable pattern, the CPU is able to preemptively load data for future accesses.
This capability, known as <em>hardware prefetching</em>, is the reason our flat tables are so fast to iterate.
However, hash table <em>lookups</em> are inherently unpredictable: the address to load is determined by a hash function.
That means hardware prefetching cannot help us.</p>
<p>Instead, we can explicitly ask the CPU to prefetch address(es) we will load from in the near future.
Doing so can’t speed up an individual lookup—we’d immediately wait for the result—but software prefetching can accelerate batches of queries.
Given several keys, we can compute where each query <em>will</em> access the table, tell the CPU to prefetch those locations, and <strong>then</strong> start actually accessing the table.</p>
<p>To do so, let’s make <code class="language-plaintext highlighter-rouge">table.index_of</code> prefetch the location(s) examined by <code class="language-plaintext highlighter-rouge">table.find_indexed</code>:</p>
<ul>
<li>In open-addressed tables, we prefetch the slot the key hashes to.</li>
<li>In two-way chaining, we prefetch <em>both</em> buckets the key hashes to.</li>
</ul>
<p>Importantly, prefetching requests the entire cache line for the given address, i.e. the aligned 64-byte chunk of memory containing that address.
That means surrounding data (e.g. the following keys) may also be loaded into cache.</p>
<table class="fcenter" style="width: 70%; margin-top: 15px; margin-bottom: 15px;">
<tr class="hrow2"><td class="header"><strong>Lookups (ns/find)</strong></td><td>Naive</td><td>Unrolled</td><td>Prefetched</td></tr>
<tr><td>Separate Chaining (100%)</td>
<td>35</td>
<td>34</td>
<td class="good">26 <strong>(-22.6%)</strong></td>
</tr><tr><td>Linear (75%)</td>
<td>36</td>
<td>35</td>
<td class="good">23 <strong>(-36.3%)</strong></td>
</tr><tr><td>Quadratic (75%)</td>
<td>31</td>
<td>31</td>
<td class="good">22 <strong>(-27.8%)</strong></td>
</tr><tr><td>Double (75%)</td>
<td>38</td>
<td>37</td>
<td class="good">33 <strong>(-12.7%)</strong></td>
</tr><tr><td>Robin Hood (75%)</td>
<td>31</td>
<td>31</td>
<td class="good">22 <strong>(-29.4%)</strong></td>
</tr><tr><td>Two-Way Chaining (x4)</td>
<td>30</td>
<td>32</td>
<td class="good">19 <strong>(-40.2%)</strong></td>
</tr></table>
<p>Prefetching makes a big difference!</p>
<ul>
<li>Separate chaining benefits from prefetching the unpredictable initial load.<sup id="fnref:chainprefetch" role="doc-noteref"><a href="#fn:chainprefetch" class="footnote" rel="footnote">8</a></sup></li>
<li>Linear and quadratic tables with a low average probe length greatly benefit from prefetching.</li>
<li>Double-hashed probe sequences quickly leave the cache line, making prefetching a bit less effective.</li>
<li>Prefetching is ideal for two-way chaining: keys can <em>always</em> be found in a prefetched cache line—but each lookup requires fetching two locations, consuming extra load buffer slots.</li>
</ul>
<h2 id="simd">SIMD</h2>
<p>Modern CPUs provide SIMD (single instruction, multiple data) instructions that process multiple values at once.
These operations are useful for probing: for example, we can load a batch of \(N\) keys and compare them with our query in parallel.
In the best case, a SIMD probe sequence could run \(N\) times faster than checking each key individually.</p>
<p><img class="probes2 center" src="/assets/hashtables/simd_find.svg" /></p>
<p>Let’s <a href="https://github.com/TheNumbat/hashtables/blob/main/code/two_way_simd.h">implement</a> SIMD-accelerated lookups for linear probing and two-way chaining, with \(N=4\):</p>
<table class="fcenter" style="width: 90%; margin-top: 15px; margin-bottom: 15px">
<tr class="hrow2"><td class="header"><strong>Lookups (ns/find)</strong></td><td>Naive</td><td>Unrolled</td><td>Prefetched</td><td>Missing</td></tr>
<tr><td>Linear (75%)</td>
<td>36</td>
<td>35</td>
<td>23</td>
<td>88</td>
</tr><tr><td>SIMD Linear (75%)</td>
<td class="bad">39 <strong>(+10.9%)</strong></td>
<td class="bad">42 <strong>(+18.0%)</strong></td>
<td class="bad">27 <strong>(+19.1%)</strong></td>
<td class="good">98 <strong>(-21.2%)</strong></td>
</tr><tr><td>Two-Way Chaining (x4)</td>
<td>30</td>
<td>32</td>
<td>19</td>
<td>30</td>
</tr><tr><td>SIMD Two-Way Chaining (x4)</td>
<td class="good">28 <strong>(-6.6%)</strong></td>
<td class="good">30 <strong>(-4.6%)</strong></td>
<td class="good">17 <strong>(-9.9%)</strong></td>
<td class="good">19 <strong>(-35.5%)</strong></td>
</tr></table>
<p>Unfortunately, SIMD probes are <em>slower</em> for the linearly probed table—an average probe length of \(1.5\) meant the overhead of setting up an (unaligned) SIMD comparison wasn’t worth it.
However, missing keys are found faster, since their average probe length is much higher (\(29.5\)).</p>
<p>On the other hand, two-way chaining consistently benefits from SIMD probing, as we can find any key using at most two (aligned) comparisons.
However, given a bucket capacity of four, lookups only need to query 1.7 keys on average, so we only see a slight improvement—SIMD would be a much more impactful optimization at higher capacities.
Still, missing key lookups get significantly faster, as they previously had to iterate through all keys in the bucket.</p>
<h1 id="benchmark-data">Benchmark Data</h1>
<p><em>Updated 2023/04/05: additional optimizations for quadratic, double hashed, and two-way tables.</em></p>
<ul>
<li>The linear and Robin Hood tables use backshift deletion.</li>
<li>The two-way SIMD table has capacity-4 buckets and uses AVX2 256-bit SIMD operations.</li>
<li>Insert, erase, find, find unrolled, find prefetched, and find missing report nanoseconds per operation.</li>
<li>Average probe, max probe, average missing probe, and max missing probe report number of iterations.</li>
<li>Clear and iterate report milliseconds for the entire operation.</li>
<li>Memory reports the ratio of allocated memory to size of stored data.</li>
</ul>
<table class="fcenter" id="summary" style="width: 100%">
<tr class="hrow2"><td class="header name"><strong>Table</strong></td>
<td class="rth"><div class="rcontainer"><div class="rcontent">Insert</div></div></td>
<td class="rth"><div class="rcontainer"><div class="rcontent">Erase</div></div></td>
<td class="rth"><div class="rcontainer"><div class="rcontent">Find</div></div></td>
<td class="rth"><div class="rcontainer"><div class="rcontent">Unrolled</div></div></td>
<td class="rth"><div class="rcontainer"><div class="rcontent">Prefetched</div></div></td>
<td class="rth"><div class="rcontainer"><div class="rcontent">Avg Probe</div></div></td>
<td class="rth"><div class="rcontainer"><div class="rcontent">Max Probe</div></div></td>
<td class="rth"><div class="rcontainer"><div class="rcontent">Find DNE</div></div></td>
<td class="rth"><div class="rcontainer"><div class="rcontent">DNE Avg</div></div></td>
<td class="rth"><div class="rcontainer"><div class="rcontent">DNE Max</div></div></td>
<td class="rth"><div class="rcontainer"><div class="rcontent">Clear (ms)</div></div></td>
<td class="rth"><div class="rcontainer"><div class="rcontent">Iterate (ms)</div></div></td>
<td class="rth"><div class="rcontainer"><div class="rcontent">Memory</div></div></td>
</tr>
<tr><td class="name">Chaining 50</td>
<td>115</td>
<td>110</td>
<td>28</td>
<td>26</td>
<td>21</td>
<td>1.2</td>
<td>7</td>
<td>28</td>
<td>0.5</td>
<td>7</td>
<td>113.1</td>
<td>16.2</td>
<td>2.5</td>
</tr>
<tr><td class="name">Chaining 100</td>
<td>113</td>
<td>128</td>
<td>35</td>
<td>34</td>
<td>26</td>
<td>1.5</td>
<td>10</td>
<td>40</td>
<td>1.0</td>
<td>9</td>
<td>118.1</td>
<td>13.9</td>
<td>2.0</td>
</tr>
<tr><td class="name">Chaining 200</td>
<td>120</td>
<td>171</td>
<td>50</td>
<td>50</td>
<td>37</td>
<td>2.0</td>
<td>12</td>
<td>68</td>
<td>2.0</td>
<td>13</td>
<td>122.6</td>
<td>14.3</td>
<td>1.8</td>
</tr>
<tr><td class="name">Chaining 500</td>
<td>152</td>
<td>306</td>
<td>102</td>
<td>109</td>
<td>77</td>
<td>3.5</td>
<td>21</td>
<td>185</td>
<td>5.0</td>
<td>21</td>
<td>153.7</td>
<td>23.0</td>
<td>1.6</td>
</tr>
<tr><td class="name">Linear 50</td>
<td>46</td>
<td>47</td>
<td>20</td>
<td>20</td>
<td>15</td>
<td>0.5</td>
<td>43</td>
<td>37</td>
<td>1.5</td>
<td>50</td>
<td>1.1</td>
<td>4.8</td>
<td>2.0</td>
</tr>
<tr><td class="name">Linear 75</td>
<td>58</td>
<td>82</td>
<td>36</td>
<td>35</td>
<td>23</td>
<td>1.5</td>
<td>181</td>
<td>88</td>
<td>7.5</td>
<td>243</td>
<td>0.7</td>
<td>2.1</td>
<td>1.3</td>
</tr>
<tr><td class="name">Linear 90</td>
<td>70</td>
<td>147</td>
<td>46</td>
<td>45</td>
<td>29</td>
<td>4.5</td>
<td>1604</td>
<td>135</td>
<td>49.7</td>
<td>1403</td>
<td>0.6</td>
<td>1.2</td>
<td>1.1</td>
</tr>
<tr><td class="name">Quadratic 50</td>
<td>47</td>
<td>23</td>
<td>20</td>
<td>20</td>
<td>16</td>
<td>0.4</td>
<td>21</td>
<td>82</td>
<td>4.8</td>
<td>49</td>
<td>1.1</td>
<td>5.1</td>
<td>2.0</td>
</tr>
<tr><td class="name">Quadratic 75</td>
<td>54</td>
<td>28</td>
<td>31</td>
<td>31</td>
<td>22</td>
<td>1.0</td>
<td>46</td>
<td>93</td>
<td>5.3</td>
<td>76</td>
<td>0.7</td>
<td>2.1</td>
<td>1.3</td>
</tr>
<tr><td class="name">Quadratic 90</td>
<td>65</td>
<td>29</td>
<td>42</td>
<td>42</td>
<td>30</td>
<td>1.9</td>
<td>108</td>
<td>163</td>
<td>18.2</td>
<td>263</td>
<td>0.6</td>
<td>1.0</td>
<td>1.1</td>
</tr>
<tr><td class="name">Double 50</td>
<td>70</td>
<td>31</td>
<td>29</td>
<td>28</td>
<td>23</td>
<td>0.4</td>
<td>17</td>
<td>84</td>
<td>4.4</td>
<td>61</td>
<td>1.1</td>
<td>5.6</td>
<td>2.0</td>
</tr>
<tr><td class="name">Double 75</td>
<td>80</td>
<td>43</td>
<td>38</td>
<td>37</td>
<td>33</td>
<td>0.8</td>
<td>44</td>
<td>93</td>
<td>4.7</td>
<td>83</td>
<td>0.8</td>
<td>2.2</td>
<td>1.3</td>
</tr>
<tr><td class="name">Double 90</td>
<td>88</td>
<td>47</td>
<td>49</td>
<td>49</td>
<td>42</td>
<td>1.6</td>
<td>114</td>
<td>174</td>
<td>16.2</td>
<td>250</td>
<td>0.6</td>
<td>1.0</td>
<td>1.1</td>
</tr>
<tr><td class="name">Robin Hood 50</td>
<td>52</td>
<td>31</td>
<td>19</td>
<td>19</td>
<td>15</td>
<td>0.5</td>
<td>12</td>
<td>33</td>
<td>0.7</td>
<td>12</td>
<td>1.1</td>
<td>4.8</td>
<td>2.0</td>
</tr>
<tr><td class="name">Robin Hood 75</td>
<td>78</td>
<td>47</td>
<td>31</td>
<td>31</td>
<td>22</td>
<td>1.5</td>
<td>24</td>
<td>74</td>
<td>1.9</td>
<td>25</td>
<td>0.7</td>
<td>2.1</td>
<td>1.3</td>
</tr>
<tr><td class="name">Robin Hood 90</td>
<td>118</td>
<td>59</td>
<td>80</td>
<td>79</td>
<td>41</td>
<td>4.5</td>
<td>58</td>
<td>108</td>
<td>5.0</td>
<td>67</td>
<td>0.6</td>
<td>1.2</td>
<td>1.1</td>
</tr>
<tr><td class="name">Two-way x2</td>
<td>265</td>
<td>24</td>
<td>23</td>
<td>25</td>
<td>17</td>
<td>0.0</td>
<td>3</td>
<td>23</td>
<td>0.2</td>
<td>4</td>
<td>17.9</td>
<td>19.6</td>
<td>32.0</td>
</tr>
<tr><td class="name">Two-way x4</td>
<td>108</td>
<td>34</td>
<td>30</td>
<td>32</td>
<td>19</td>
<td>0.7</td>
<td>6</td>
<td>30</td>
<td>2.7</td>
<td>8</td>
<td>2.2</td>
<td>3.7</td>
<td>4.0</td>
</tr>
<tr><td class="name">Two-way x8</td>
<td>111</td>
<td>45</td>
<td>39</td>
<td>39</td>
<td>28</td>
<td>3.6</td>
<td>13</td>
<td>42</td>
<td>9.0</td>
<td>14</td>
<td>1.1</td>
<td>1.5</td>
<td>2.0</td>
</tr>
<tr><td class="name">Two-way SIMD</td>
<td>124</td>
<td>46</td>
<td>28</td>
<td>30</td>
<td>17</td>
<td>0.3</td>
<td>1</td>
<td>19</td>
<td>1.0</td>
<td>1</td>
<td>2.2</td>
<td>3.7</td>
<td>4.0</td>
</tr>
</table>
<p><br /></p>
<h2 id="footnotes">Footnotes</h2>
<link rel="stylesheet" type="text/css" href="/assets/hashtables/style.css" />
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:cacheusage" role="doc-endnote">
<p>Actually, this prior shouldn’t necessarily apply to hash tables, which by definition have unpredictable access patterns.
However, we will see that flat storage still provides performance benefits, especially for linear probing, whole-table iteration, and serial accesses. <a href="#fnref:cacheusage" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:missingelements" role="doc-endnote">
<p>To account for various deletion strategies, the <code class="language-plaintext highlighter-rouge">find-missing</code> benchmark was run after the insertion & erasure tests, and after inserting a second set of elements.
This sequence allows separate chaining’s reported maximum probe length to be larger for missing keys than existing keys.
Looking up the latter set of elements was also benchmarked, but results are omitted here, as the tombstone-based tables were simply marginally slower across the board. <a href="#fnref:missingelements" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:mycpu" role="doc-endnote">
<p>AMD 5950X @ 4.4 GHz + DDR4 3600/CL16, Windows 11 Pro 22H2, MSVC 19.34.31937.<br />
No particular effort was made to isolate the benchmarks, but they were repeatable on an otherwise idle system.
Each was run 10 times and average results are reported. <a href="#fnref:mycpu" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:vmem" role="doc-endnote">
<p>Virtual memory allows separate regions of our address space to refer to the <em>same</em> underlying pages.
If we map our table <em>twice</em>, one after the other, probe traversal never has to check whether it needs to wrap around to the start of the table.
Instead, it can simply proceed into the second mapping—since writes to one range are automatically reflected in the other, this just works.</p>
<p><img class="diagram center" src="/assets/hashtables/vmem.svg" /> <a href="#fnref:vmem" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:prime" role="doc-endnote">
<p>You might expect indexing a prime-sized table to require an expensive integer mod/div, but it actually doesn’t.
Given a fixed set of possible primes, a prime-specific indexing operation can be selected at runtime based on the current size.
Since each such operation is a modulo by a <em>constant</em>, <a href="https://lemire.me/blog/2021/04/28/ideal-divisors-when-a-division-compiles-down-to-just-a-multiplication/">the compiler can translate it to a fast integer multiplication</a>. <a href="#fnref:prime" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:sattolo" role="doc-endnote">
<p>A Sattolo traversal causes each lookup to directly depend on the result of the previous one, serializing execution and preventing the CPU from hiding memory latency.
Hence, finding each element took ~<a href="https://gist.github.com/jboner/2841832">100ns</a> in <em>every</em> type of open table: roughly equal to the latency of fetching data from RAM.
Similarly, each lookup in the low-load separately chained table took ~200ns, i.e. \((1 + \text{Chain Length}) * \text{Memory Latency}\) <a href="#fnref:sattolo" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:slowerat90" role="doc-endnote">
<p>First, <em>any</em> linearly probed table requires the same <em>total</em> number of probes to look up each element once—that’s why the average probe length never improved.
Every hash collision requires moving some key one slot further from its index, i.e. adds one to the total probe count.</p>
<p>Second, the cost of a lookup, per probe, can <em>decrease</em> with the number of probes required.
When traversing a long chain, the CPU can avoid cache misses by prefetching future accesses.
Therefore, allocating a larger portion of the same <em>total</em> probe count to long chains can increase performance.</p>
<p>Hence, linear probing tops this particular benchmark—but it’s not the most useful metric in practice.
Robin Hood probing greatly decreases maximum probe length, making query performance far more consistent even if technically slower on average.
Anyway, you really don’t need a 90% load factor—just use 75% for consistently high performance. <a href="#fnref:slowerat90" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:chainprefetch" role="doc-endnote">
<p>Is the CPU smart enough to automatically prefetch addresses contained within the explicitly prefetched cache line? I’m pretty sure it’s not, but that would be cool. <a href="#fnref:chainprefetch" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Your default hash table should be open-addressed, using Robin Hood linear probing with backward-shift deletion. When prioritizing deterministic performance over memory efficiency, two-way chaining is also a good choice. Code for this article may be found on GitHub. Open Addressing vs. Separate Chaining Benchmark Setup Discussion Separate Chaining Linear Probing Quadratic Probing Double Hashing Robin Hood Linear Probing Two Way Chaining Unrolling, Prefetching, and SIMD Benchmark Data Open Addressing vs. Separate Chaining Most people first encounter hash tables implemented using separate chaining, a model simple to understand and analyze mathematically. In separate chaining, a hash function is used to map each key to one of \(K\) buckets. Each bucket holds a linked list, so to retrieve a key, one simply traverses its corresponding bucket. Given a hash function drawn from a universal family, inserting \(N\) keys into a table with \(K\) buckets results in an expected bucket size of \(\frac{N}{K}\), a quantity also known as the table’s load factor. Assuming (reasonably) that \(\frac{N}{K}\) is bounded by a constant, all operations on such a table can be performed in expected constant time. Unfortunately, this basic analysis doesn’t consider the myriad factors that go into implementing an efficient hash table on a real computer. In practice, hash tables based on open addressing can provide superior performance, and their limitations can be worked around in nearly all cases. Open Addressing In an open-addressed table, each bucket only contains a single key. Collisions are handled by placing additional keys elsewhere in the table. For example, in linear probing, a key is placed in the first open bucket starting from the index it hashes to. Unlike in separate chaining, open-addressed tables may be represented in memory as a single flat array. A priori, we should expect operations on flat arrays to offer higher performance than those on linked structures due to more coherent memory accesses.1 However, if the user wants to insert more than \(K\) keys into an open-addressed table with \(K\) buckets, the entire table must be resized. Typically, a larger array is allocated and all current keys are moved to the new table. The size is grown by a constant factor (e.g. 2x), so insertions occur in amortized constant time. Why Not Separate Chaining? In practice, the standard libraries of many languages provide separately chained hash tables, such as C++’s std::unordered_map. At least in C++, this choice is now considered to have been a mistake—it violates C++’s principle of not paying for features you didn’t ask for. Separate chaining still has some purported benefits, but they seem unconvincing: Separately chained tables don’t require any linear-time operations. First, this isn’t true. To maintain a constant load factor, separate chaining hash tables also have to resize once a sufficient number of keys are inserted, though the limit can be greater than \(K\). Most separately chained tables, including std::unordered_map, only provide amortized constant time insertion. Second, open-addressed tables can be incrementally resized. When the table grows, there’s no reason we have to move all existing keys right away. Instead, every subsequent operation can move a fixed number of old keys into the new table. Eventually, all keys will have been moved and the old table may be deallocated. This technique incurs overhead on every operation during a resize, but none will require linear time. When each (key + value) entry is allocated separately, entries can have stable addresses. External code can safely store pointers to them, and large entries never have to be copied. This property is easy to support by adding a single level of indirection. That is, allocate each entry separately, but use an open-addressed table to map keys to pointers-to-entries. In this case, resizing the table only requires copying pointers, rather than entire entries. Lower memory overhead. It’s possible for a separately chained table to store the same data in less space than an open-addressed equivalent, since flat tables necessarily waste space storing empty buckets. However, matching high-quality open addressing schemes (especially when storing entries indirectly) requires a load factor greater than one, degrading query performance. Further, individually allocating nodes wastes memory due to heap allocator overhead and fragmentation. In fact, the only situation where open addressing truly doesn’t work is when the table cannot allocate and entries cannot be moved. These requirements can arise when using intrusive linked lists, a common pattern in kernel data structures and embedded systems with no heap. Benchmark Setup For simplicity, let us only consider tables mapping 64-bit integer keys to 64-bit integer values. Results will not take into account the effects of larger values or non-trivial key comparison, but should still be representative, as keys can often be represented in 64 bits and values are often pointers. We will evaluate tables using the following metrics: Time to insert \(N\) keys into an empty table. Time to erase \(N\) existing keys. Time to look up \(N\) existing keys. Time to look up \(N\) missing keys.2 Average probe length for existing & missing keys. Maximum probe length for existing & missing keys. Memory amplification in a full table (total size / size of data). Each open-addressed table was benchmarked at 50%, 75%, and 90% load factors. Every test targeted a table capacity of 8M entries, setting \(N = \lfloor 8388608 * \text{Load Factor} \rfloor - 1\). Fixing the table capacity allows us to benchmark behavior at exactly the specified load factor, avoiding misleading results from tables that have recently grown. The large number of keys causes each test to unpredictably traverse a data set much larger than L3 cache (128MB+), so the CPU3 cannot effectively cache or prefetch memory accesses. Table Sizes All benchmarked tables use exclusively power-of-two sizes. This invariant allows us to translate hashes to array indices with a fast bitwise AND operation, since: h % 2**n == h & (2**n - 1) Power-of-two sizes also admit a neat virtual memory trick,4 but that optimization is not included here. However, power-of-two sizes can cause pathological behavior when paired with some hash functions, since indexing simply throws away the hash’s high bits. An alternative strategy is to use prime sized tables, which are significantly more robust to the choice of hash function. For the purposes of this post, we have a fixed hash function and non-adversarial inputs, so prime sizes were not used.5 Hash Function All benchmarked tables use the squirrel3 hash function, a fast integer hash that (as far as I can tell) only exists in this GDC talk on noise-based RNG. Here, it is adapted to 64-bit integers by choosing three large 64-bit primes: uint64_t squirrel3(uint64_t at) { constexpr uint64_t BIT_NOISE1 = 0x9E3779B185EBCA87ULL; constexpr uint64_t BIT_NOISE2 = 0xC2B2AE3D27D4EB4FULL; constexpr uint64_t BIT_NOISE3 = 0x27D4EB2F165667C5ULL; at *= BIT_NOISE1; at ^= (at >> 8); at += BIT_NOISE2; at ^= (at << 8); at *= BIT_NOISE3; at ^= (at >> 8); return at; } I make no claims regarding the quality or robustness of this hash function, but observe that it’s cheap, it produces the expected number of collisions in power-of-two tables, and it passes smhasher when applied bytewise. Other Benchmarks Several less interesting benchmarks were also run, all of which exhibited identical performance across equal-memory open-addressed tables and much worse performance for separate chaining. Time to look up each element by following a Sattolo cycle (2-4x).6 Time to clear the table (100-200x). Time to iterate all values (3-10x). Discussion Separate Chaining To get our bearings, let’s first implement a simple separate chaining table and benchmark it at various load factors. Separate Chaining50% Load100% Load200% Load500% Load ExistingMissingExistingMissingExistingMissingExistingMissing Insert (ns/key) 115 113 120 152 Erase (ns/key) 110 128 171 306 Lookup (ns/key) 28 28 35 40 50 68 102 185 Average Probe 1.25 0.50 1.50 1.00 2.00 2.00 3.50 5.00 Max Probe 7 7 10 9 12 13 21 21 Memory Amplification 2.50 2.00 1.75 1.60 These are respectable results, especially lookups at 100%: the CPU is able to hide much of the memory latency. Insertions and erasures, on the other hand, are pretty slow—they involve allocation. Technically, the reported memory amplification in an underestimate, as it does not include heap allocator overhead. However, we should not directly compare memory amplification against the coming open-addressed tables, as storing larger values would improve separate chaining’s ratio. std::unordered_map Running these benchmarks on std::unordered_map<uint64_t,uint64_t> produces results roughly equivalent to the 100% column, just with marginally slower insertions and faster clears. Using squirrel3 with std::unordered_map additionally makes insertions slightly slower and lookups slightly faster. Further comparisons will be relative to these results, since exact performance numbers for std::unordered_map depend on one’s standard library distribution (here, Microsoft’s). Linear Probing Next, let’s implement the simplest type of open addressing—linear probing. In the following diagrams, assume that each letter hashes to its alphabetical index. The skull denotes a tombstone. Insert: starting from the key’s index, place the key in the first empty or tombstone bucket. Lookup: starting from the key’s index, probe buckets in order until finding the key, reaching an empty non-tombstone bucket, or exhausting the table. Erase: lookup the key and replace it with a tombstone. The results are surprisingly good, considering the simplicity: Linear Probing(Naive)50% Load75% Load90% Load ExistingMissingExistingMissingExistingMissing Insert (ns/key) 46 59 73 Erase (ns/key) 21 36 47 Lookup (ns/key) 20 86 35 124 47 285 Average Probe 0.50 6.04 1.49 29.5 4.46 222 Max Probe 43 85 181 438 1604 2816 Memory Amplification 2.00 1.33 1.11 For existing keys, we’re already beating the separately chained tables, and with lower memory overhead! Of course, there are also some obvious problems—looking for missing keys takes much longer, and the worst-case probe lengths are quite scary, even at 50% load factor. But, we can do much better. Erase: Rehashing One simple optimization is automatically recreating the table once it accumulates too many tombstones. Erase now occurs in amortized constant time, but we already tolerate that for insertion, so it shouldn’t be a big deal. Let’s implement a table that re-creates itself after erase has been called \(\frac{N}{2}\) times. Compared to naive linear probing: Linear Probing+ Rehashing50% Load75% Load90% Load ExistingMissingExistingMissingExistingMissing Insert (ns/key) 46 58 70 Erase (ns/key) 23 (+10.4%) 27 (-23.9%) 29 (-39.3%) Lookup (ns/key) 20 85 35 93 (-25.4%) 46 144 (-49.4%) Average Probe 0.50 6.04 1.49 9.34 (-68.3%) 4.46 57.84 (-74.0%) Max Probe 43 85 181 245 (-44.1%) 1604 1405 (-50.1%) Memory Amplification 2.00 1.33 1.11 Clearing the tombstones dramatically improves missing key queries, but they are still pretty slow. It also makes erase slower at 50% load—there aren’t enough tombstones to make up for having to recreate the table. Rehashing also does nothing to mitigate long probe sequences for existing keys. Erase: Backward Shift Actually, who says we need tombstones? There’s a little-taught, but very simple algorithm for erasing keys that doesn’t degrade performance or have to recreate the table. When we erase a key, we don’t know whether the lookup algorithm is relying on our bucket being filled to find keys later in the table. Tombstones are one way to avoid this problem. Instead, however, we can guarantee that traversal is never disrupted by simply removing and reinserting all following keys, up to the next empty bucket. Note that despite the name, we are not simply shifting the following keys backward. Any keys that are already at their optimal index will not be moved. For example: After implementing backward-shift deletion, we can compare it to linear probing with rehashing: Linear Probing+ Backshift50% Load75% Load90% Load ExistingMissingExistingMissingExistingMissing Insert (ns/key) 46 58 70 Erase (ns/key) 47 (+105%) 82 (+201%) 147 (+415%) Lookup (ns/key) 20 37 (-56.2%) 36 88 46 135 Average Probe 0.50 1.50 (-75.2%) 1.49 7.46 (-20.1%) 4.46 49.7 (-14.1%) Max Probe 43 50 (-41.2%) 181 243 1604 1403 Memory Amplification 2.00 1.33 1.11 Naturally, erasures become significantly slower, but queries on missing keys now have reasonable behavior, especially at 50% load. In fact, at 50% load, this table already beats the performance of equal-memory separate chaining in all four metrics. But, we can still do better: a 50% load factor is unimpressive, and max probe lengths are still very high compared to separate chaining. You might have noticed that neither of these upgrades improved average probe length for existing keys. That’s because it’s not possible for linear probing to have any other average! Think about why that is—it’ll be interesting later. Quadratic Probing Quadratic probing is a common upgrade to linear probing intended to decrease average and maximum probe lengths. In the linear case, a probe of length \(n\) simply queries the bucket at index \(h(k) + n\). Quadratic probing instead queries the bucket at index \(h(k) + \frac{n(n+1)}{2}\). Other quadratic sequences are possible, but this one can be efficiently computed by adding \(n\) to the index at each step. Each operation must be updated to use our new probe sequence, but the logic is otherwise almost identical: There are a couple subtle caveats to quadratic probing: Although this specific sequence will touch every bucket in a power-of-two table, other sequences (e.g. \(n^2\)) may not. If an insertion fails despite there being an empty bucket, the table must grow and the insertion is retried. There actually is a true deletion algorithm for quadratic probing, but it’s a bit more involved than the linear case and hence not reproduced here. We will again use tombstones and rehashing. After implementing our quadratic table, we can compare it to linear probing with rehashing: Quadratic Probing+ Rehashing50% Load75% Load90% Load ExistingMissingExistingMissingExistingMissing Insert (ns/key) 47 54 (-7.2%) 65 (-6.4%) Erase (ns/key) 23 28 29 Lookup (ns/key) 20 82 31 (-13.2%) 93 42 (-7.3%) 163 (+12.8%) Average Probe 0.43 (-12.9%) 4.81 (-20.3%) 0.99 (-33.4%) 5.31 (-43.1%) 1.87 (-58.1%) 18.22 (-68.5%) Max Probe 21 (-51.2%) 49 (-42.4%) 46 (-74.6%) 76 (-69.0%) 108 (-93.3%) 263 (-81.3%) Memory Amplification 2.00 1.33 1.11 As expected, quadratic probing dramatically reduces both the average and worst case probe lengths, especially at high load factors. These gains are not quite as impressive when compared to the linear table with backshift deletion, but are still very significant. Reducing the average probe length improves the average performance of all queries. Erasures are the fastest yet, since quadratic probing quickly finds the element and marks it as a tombstone. However, beyond 50% load, maximum probe lengths are still pretty concerning. Double Hashing Double hashing purports to provide even better behavior than quadratic probing. Instead of using the same probe sequence for every key, double hashing determines the probe stride by hashing the key a second time. That is, a probe of length \(n\) queries the bucket at index \(h_1(k) + n*h_2(k)\) If \(h_2\) is always co-prime to the table size, insertions always succeed, since the probe sequence will touch every bucket. That might sound difficult to assure, but since our table sizes are always a power of two, we can simply make \(h_2\) always return an odd number. Unfortunately, there is no true deletion algorithm for double hashing. We will again use tombstones and rehashing. So far, we’ve been throwing away the high bits our of 64-bit hash when computing table indices. Instead of evaluating a second hash function, let’s just re-use the top 32 bits: \(h_2(k) = h_1(k) >> 32\). After implementing double hashing, we may compare the results to quadratic probing with rehashing: Double Hashing+ Rehashing50% Load75% Load90% Load ExistingMissingExistingMissingExistingMissing Insert (ns/key) 70 (+49.6%) 80 (+48.6%) 88 (+35.1%) Erase (ns/key) 31 (+33.0%) 43 (+53.1%) 47 (+61.5%) Lookup (ns/key) 29 (+46.0%) 84 38 (+25.5%) 93 49 (+14.9%) 174 Average Probe 0.39 (-11.0%) 4.39 (-8.8%) 0.85 (-14.7%) 4.72 (-11.1%) 1.56 (-16.8%) 16.23 (-10.9%) Max Probe 17 (-19.0%) 61 (+24.5%) 44 (-4.3%) 83 (+9.2%) 114 (+5.6%) 250 (-4.9%) Memory Amplification 2.00 1.33 1.11 Double hashing succeeds in further reducing average probe lengths, but max lengths are more of a mixed bag. Unfortunately, most queries now traverse larger spans of memory, causing performance to lag quadratic hashing. Robin Hood Linear Probing So far, quadratic probing and double hashing have provided lower probe lengths, but their raw performance isn’t much better than linear probing—especially on missing keys. Enter Robin Hood linear probing. This strategy drastically reins in maximum probe lengths while maintaining high query performance. The key difference is that when inserting a key, we are allowed to steal the position of an existing key if it is closer to its ideal index (“richer”) than the new key is. When this occurs, we simply swap the old and new keys and carry on inserting the old key in the same manner. This results in a more equitable distribution of probe lengths. This insertion method turns out to be so effective that we can entirely ignore rehashing as long as we keep track of the maximum probe length. Lookups will simply probe until either finding the key or exhausting the maximum probe length. Implementing this table, compared to linear probing with backshift deletion: Robin Hood(Naive)50% Load75% Load90% Load ExistingMissingExistingMissingExistingMissing Insert (ns/key) 53 (+16.1%) 80 (+38.3%) 124 (+76.3%) Erase (ns/key) 19 (-58.8%) 31 (-63.0%) 77 (-47.9%) Lookup (ns/key) 19 58 (+56.7%) 31 (-11.3%) 109 (+23.7%) 79 (+72.2%) 147 (+8.9%) Average Probe 0.50 13 (+769%) 1.49 28 (+275%) 4.46 67 (+34.9%) Max Probe 12 (-72.1%) 13 (-74.0%) 24 (-86.7%) 28 (-88.5%) 58 (-96.4%) 67 (-95.2%) Memory Amplification 2.00 1.33 1.11 Maximum probe lengths improve dramatically, undercutting even quadratic probing and double hashing. This solves the main problem with open addressing. However, performance gains are less clear-cut: Insertion performance suffers across the board. Existing lookups are slightly faster at low load, but much slower at 90%. Missing lookups are maximally pessimistic. Fortunately, we’ve got one more trick up our sleeves. Erase: Backward Shift When erasing a key, we can run a very similar backshift deletion algorithm. There are just two differences: We can stop upon finding a key that is already in its optimal location. No further keys could have been pushed past it—they would have stolen this spot during insertion. We don’t have to re-hash—since we stop at the first optimal key, all keys before it can simply be shifted backward. With backshift deletion, we no longer need to keep track of the maximum probe length. Implementing this table, compared to naive robin hood probing: Robin Hood+ Backshift50% Load75% Load90% Load ExistingMissingExistingMissingExistingMissing Insert (ns/key) 52 78 118 Erase (ns/key) 31 (+63.0%) 47 (+52.7%) 59 (-23.3%) Lookup (ns/key) 19 33 (-43.3%) 31 74 (-32.2%) 80 108 (-26.1%) Average Probe 0.50 0.75 (-94.2%) 1.49 1.87 (-93.3%) 4.46 4.95 (-92.6%) Max Probe 12 12 (-7.7%) 24 25 (-10.7%) 58 67 Memory Amplification 2.00 1.33 1.11 We again regress the performance of erase, but we finally achieve reasonable missing-key behavior. Given the excellent maximum probe lengths and overall query performance, Robin Hood probing with backshift deletion should be your default choice of hash table. You might notice that at 90% load, lookups are significantly slower than basic linear probing. This gap actually isn’t concerning, and explaining why is an interesting exercise.7 Two-Way Chaining Our Robin Hood table has excellent query performance, and exhibited a maximum probe length of only 12—it’s a good default choice. However, another class of hash tables based on Cuckoo Hashing can provably never probe more than a constant number of buckets. Here, we will examine a particularly practical formulation known as Two-Way Chaining. Our table will still consist of a flat array of buckets, but each bucket will be able to hold a small handful of keys. When inserting an element, we will hash the key with two different functions, then place it into whichever corresponding bucket has more space. If both buckets happen to be filled, the entire table will grow and the insertion is retried. You might think this is crazy—in separate chaining, the expected maximum probe length scales with \(O(\frac{\lg N}{\lg \lg N})\), so wouldn’t this waste copious amounts of memory growing the table? It turns out adding a second hash function reduces the expected maximum to \(O(\lg\lg N)\), for reasons explored elsewhere. That makes it practical to cap the maximum bucket size at a relatively small number. Let’s implement a flat two-way chained table. As with double hashing, we can use the high bits of our hash to represent the second hash function. Two-WayChainingCapacity 2Capacity 4Capacity 8 ExistingMissingExistingMissingExistingMissing Insert (ns/key) 265 108 111 Erase (ns/key) 24 34 45 Lookup (ns/key) 23 23 30 30 39 42 Average Probe 0.03 0.24 0.74 2.73 3.61 8.96 Max Probe 3 4 6 8 13 14 Memory Amplification 32.00 4.00 2.00 Since every key can be found in one of two possible buckets, query times become essentially deterministic. Lookup times are especially impressive, only slightly lagging the Robin Hood table and having no penalty on missing keys. The average probe lengths are very low, and max probe lengths are strictly bounded by two times the bucket size. The table size balloons when each bucket can only hold two keys, and insertion is very slow due to growing the table. However, capacities of 4-8 are reasonable choices for memory unconstrained use cases. Overall, two-way chaining is another good choice of hash table, at least when memory efficiency is not the highest priority. In the next section, we’ll also see how lookups can even further accelerated with prefetching and SIMD operations. Remember that deterministic queries might not matter if you haven’t already implemented incremental table growth. Cuckoo Hashing Cuckoo hashing involves storing keys in two separate tables, each of which holds one key per bucket. This scheme will guarantee that every key is stored in its ideal bucket in one of the two tables. Each key is hashed with two different functions and inserted using the following procedure: If at least one table’s bucket is empty, the key is inserted there. Otherwise, the key is inserted anyway, evicting one of the existing keys. The evicted key is transferred to the opposite table at its other possible position. If doing so evicts another existing key, go to 3. If the loop ends up evicting the key we’re trying to insert, we know it has found a cycle and hence must grow the table. Because Cuckoo hashing does not strictly bound the insertion sequence length, I didn’t benchmark it here, but it’s still an interesting option. Unrolling, Prefetching, and SIMD So far, the lookup benchmark has been implemented roughly like this: for(uint64_t i = 0; i < N; i++) { assert(table.find(keys[i]) == values[i]); } Naively, this loop runs one lookup at a time. However, you might have noticed that we’ve been able to find keys faster than the CPU can load data from main memory—so something more complex must be going on. In reality, modern out-of-order CPUs can execute several iterations in parallel. Expressing the loop as a dataflow graph lets us build a simplified mental model of how it actually runs: the CPU is able to execute a computation whenever its dependencies are available. Here, green nodes are resolved, in-flight nodes are yellow, and white nodes have not begun: In this example, we can see that each iteration’s root node (i++) only requires the value of i from the previous step—not the result of find. Therefore, the CPU can run several find nodes in parallel, hiding the fact that each one takes ~100ns to fetch data from main memory. Unrolling There’s a common optimization strategy known as loop unrolling. Unrolling involves explicitly writing out the code for several semantic iterations inside each actual iteration of a loop. When each inner pseudo-iteration is independent, unrolling explicitly severs them from the loop iterator dependency chain. Most modern x86 CPUs can keep track of ~10 simultaneous pending loads, so let’s try manually unrolling our benchmark 10 times. The new code looks like the following, where index_of hashes the key but doesn’t do the actual lookup. uint64_t indices[10]; for(uint64_t i = 0; i < N; i += 10) { for(uint64_t j = 0; j < 10; j++) { indices[j] = table.index_of(keys[i + j]); } for(uint64_t j = 0; j < 10; j++) { assert(table.find_index(indices[j]) == values[i + j]); } } This isn’t true unrolling, since the inner loops introduce dependencies on j. However, it doesn’t matter in practice, as computing j is never a bottleneck. In fact, our compiler can automatically unroll the inner loop for us (clang in particular does this). Lookups (ns/find)NaiveUnrolled Separate Chaining (100%) 35 34 Linear (75%) 36 35 Quadratic (75%) 31 31 Double (75%) 38 37 Robin Hood (75%) 31 31 Two-Way Chaining (x4) 30 32 Explicit unrolling has basically no effect, since the CPU was already able to saturate its pending load buffer. However, manual unrolling now lets us introduce software prefetching. Software Prefetching When traversing memory in a predictable pattern, the CPU is able to preemptively load data for future accesses. This capability, known as hardware prefetching, is the reason our flat tables are so fast to iterate. However, hash table lookups are inherently unpredictable: the address to load is determined by a hash function. That means hardware prefetching cannot help us. Instead, we can explicitly ask the CPU to prefetch address(es) we will load from in the near future. Doing so can’t speed up an individual lookup—we’d immediately wait for the result—but software prefetching can accelerate batches of queries. Given several keys, we can compute where each query will access the table, tell the CPU to prefetch those locations, and then start actually accessing the table. To do so, let’s make table.index_of prefetch the location(s) examined by table.find_indexed: In open-addressed tables, we prefetch the slot the key hashes to. In two-way chaining, we prefetch both buckets the key hashes to. Importantly, prefetching requests the entire cache line for the given address, i.e. the aligned 64-byte chunk of memory containing that address. That means surrounding data (e.g. the following keys) may also be loaded into cache. Lookups (ns/find)NaiveUnrolledPrefetched Separate Chaining (100%) 35 34 26 (-22.6%) Linear (75%) 36 35 23 (-36.3%) Quadratic (75%) 31 31 22 (-27.8%) Double (75%) 38 37 33 (-12.7%) Robin Hood (75%) 31 31 22 (-29.4%) Two-Way Chaining (x4) 30 32 19 (-40.2%) Prefetching makes a big difference! Separate chaining benefits from prefetching the unpredictable initial load.8 Linear and quadratic tables with a low average probe length greatly benefit from prefetching. Double-hashed probe sequences quickly leave the cache line, making prefetching a bit less effective. Prefetching is ideal for two-way chaining: keys can always be found in a prefetched cache line—but each lookup requires fetching two locations, consuming extra load buffer slots. SIMD Modern CPUs provide SIMD (single instruction, multiple data) instructions that process multiple values at once. These operations are useful for probing: for example, we can load a batch of \(N\) keys and compare them with our query in parallel. In the best case, a SIMD probe sequence could run \(N\) times faster than checking each key individually. Let’s implement SIMD-accelerated lookups for linear probing and two-way chaining, with \(N=4\): Lookups (ns/find)NaiveUnrolledPrefetchedMissing Linear (75%) 36 35 23 88 SIMD Linear (75%) 39 (+10.9%) 42 (+18.0%) 27 (+19.1%) 98 (-21.2%) Two-Way Chaining (x4) 30 32 19 30 SIMD Two-Way Chaining (x4) 28 (-6.6%) 30 (-4.6%) 17 (-9.9%) 19 (-35.5%) Unfortunately, SIMD probes are slower for the linearly probed table—an average probe length of \(1.5\) meant the overhead of setting up an (unaligned) SIMD comparison wasn’t worth it. However, missing keys are found faster, since their average probe length is much higher (\(29.5\)). On the other hand, two-way chaining consistently benefits from SIMD probing, as we can find any key using at most two (aligned) comparisons. However, given a bucket capacity of four, lookups only need to query 1.7 keys on average, so we only see a slight improvement—SIMD would be a much more impactful optimization at higher capacities. Still, missing key lookups get significantly faster, as they previously had to iterate through all keys in the bucket. Benchmark Data Updated 2023/04/05: additional optimizations for quadratic, double hashed, and two-way tables. The linear and Robin Hood tables use backshift deletion. The two-way SIMD table has capacity-4 buckets and uses AVX2 256-bit SIMD operations. Insert, erase, find, find unrolled, find prefetched, and find missing report nanoseconds per operation. Average probe, max probe, average missing probe, and max missing probe report number of iterations. Clear and iterate report milliseconds for the entire operation. Memory reports the ratio of allocated memory to size of stored data. Table Insert Erase Find Unrolled Prefetched Avg Probe Max Probe Find DNE DNE Avg DNE Max Clear (ms) Iterate (ms) Memory Chaining 50 115 110 28 26 21 1.2 7 28 0.5 7 113.1 16.2 2.5 Chaining 100 113 128 35 34 26 1.5 10 40 1.0 9 118.1 13.9 2.0 Chaining 200 120 171 50 50 37 2.0 12 68 2.0 13 122.6 14.3 1.8 Chaining 500 152 306 102 109 77 3.5 21 185 5.0 21 153.7 23.0 1.6 Linear 50 46 47 20 20 15 0.5 43 37 1.5 50 1.1 4.8 2.0 Linear 75 58 82 36 35 23 1.5 181 88 7.5 243 0.7 2.1 1.3 Linear 90 70 147 46 45 29 4.5 1604 135 49.7 1403 0.6 1.2 1.1 Quadratic 50 47 23 20 20 16 0.4 21 82 4.8 49 1.1 5.1 2.0 Quadratic 75 54 28 31 31 22 1.0 46 93 5.3 76 0.7 2.1 1.3 Quadratic 90 65 29 42 42 30 1.9 108 163 18.2 263 0.6 1.0 1.1 Double 50 70 31 29 28 23 0.4 17 84 4.4 61 1.1 5.6 2.0 Double 75 80 43 38 37 33 0.8 44 93 4.7 83 0.8 2.2 1.3 Double 90 88 47 49 49 42 1.6 114 174 16.2 250 0.6 1.0 1.1 Robin Hood 50 52 31 19 19 15 0.5 12 33 0.7 12 1.1 4.8 2.0 Robin Hood 75 78 47 31 31 22 1.5 24 74 1.9 25 0.7 2.1 1.3 Robin Hood 90 118 59 80 79 41 4.5 58 108 5.0 67 0.6 1.2 1.1 Two-way x2 265 24 23 25 17 0.0 3 23 0.2 4 17.9 19.6 32.0 Two-way x4 108 34 30 32 19 0.7 6 30 2.7 8 2.2 3.7 4.0 Two-way x8 111 45 39 39 28 3.6 13 42 9.0 14 1.1 1.5 2.0 Two-way SIMD 124 46 28 30 17 0.3 1 19 1.0 1 2.2 3.7 4.0 Footnotes Actually, this prior shouldn’t necessarily apply to hash tables, which by definition have unpredictable access patterns. However, we will see that flat storage still provides performance benefits, especially for linear probing, whole-table iteration, and serial accesses. ↩ To account for various deletion strategies, the find-missing benchmark was run after the insertion & erasure tests, and after inserting a second set of elements. This sequence allows separate chaining’s reported maximum probe length to be larger for missing keys than existing keys. Looking up the latter set of elements was also benchmarked, but results are omitted here, as the tombstone-based tables were simply marginally slower across the board. ↩ AMD 5950X @ 4.4 GHz + DDR4 3600/CL16, Windows 11 Pro 22H2, MSVC 19.34.31937. No particular effort was made to isolate the benchmarks, but they were repeatable on an otherwise idle system. Each was run 10 times and average results are reported. ↩ Virtual memory allows separate regions of our address space to refer to the same underlying pages. If we map our table twice, one after the other, probe traversal never has to check whether it needs to wrap around to the start of the table. Instead, it can simply proceed into the second mapping—since writes to one range are automatically reflected in the other, this just works. ↩ You might expect indexing a prime-sized table to require an expensive integer mod/div, but it actually doesn’t. Given a fixed set of possible primes, a prime-specific indexing operation can be selected at runtime based on the current size. Since each such operation is a modulo by a constant, the compiler can translate it to a fast integer multiplication. ↩ A Sattolo traversal causes each lookup to directly depend on the result of the previous one, serializing execution and preventing the CPU from hiding memory latency. Hence, finding each element took ~100ns in every type of open table: roughly equal to the latency of fetching data from RAM. Similarly, each lookup in the low-load separately chained table took ~200ns, i.e. \((1 + \text{Chain Length}) * \text{Memory Latency}\) ↩ First, any linearly probed table requires the same total number of probes to look up each element once—that’s why the average probe length never improved. Every hash collision requires moving some key one slot further from its index, i.e. adds one to the total probe count. Second, the cost of a lookup, per probe, can decrease with the number of probes required. When traversing a long chain, the CPU can avoid cache misses by prefetching future accesses. Therefore, allocating a larger portion of the same total probe count to long chains can increase performance. Hence, linear probing tops this particular benchmark—but it’s not the most useful metric in practice. Robin Hood probing greatly decreases maximum probe length, making query performance far more consistent even if technically slower on average. Anyway, you really don’t need a 90% load factor—just use 75% for consistently high performance. ↩ Is the CPU smart enough to automatically prefetch addresses contained within the explicitly prefetched cache line? I’m pretty sure it’s not, but that would be cool. ↩NYC: Bronx Zoo2023-03-14T00:00:00+00:002023-03-14T00:00:00+00:00https://thenumb.at/Bronx-Zoo<p>Bronx Zoo, New York, NY, 2023</p>
<div style="text-align: center;"><img src="/assets/bronx-zoo/00.webp" /></div>
<div style="text-align: center;"><img src="/assets/bronx-zoo/01.webp" /></div>
<div style="text-align: center;"><img src="/assets/bronx-zoo/02.webp" /></div>
<div style="text-align: center;"><img src="/assets/bronx-zoo/03.webp" /></div>
<div style="text-align: center;"><img src="/assets/bronx-zoo/04.webp" /></div>
<div style="text-align: center;"><img src="/assets/bronx-zoo/05.webp" /></div>
<div style="text-align: center;"><img src="/assets/bronx-zoo/06.webp" /></div>
<div style="text-align: center;"><img src="/assets/bronx-zoo/07.webp" /></div>
<div style="text-align: center;"><img src="/assets/bronx-zoo/08.webp" /></div>
<div style="text-align: center;"><img src="/assets/bronx-zoo/09.webp" /></div>
<div style="text-align: center;"><img src="/assets/bronx-zoo/10.webp" /></div>
<div style="text-align: center;"><img src="/assets/bronx-zoo/11.webp" /></div>
<div style="text-align: center;"><img src="/assets/bronx-zoo/12.webp" /></div>
<div style="text-align: center;"><img src="/assets/bronx-zoo/13.webp" /></div>
<div style="text-align: center;"><img src="/assets/bronx-zoo/14.webp" /></div>
<div style="text-align: center;"><img src="/assets/bronx-zoo/15.webp" /></div>
<div style="text-align: center;"><img src="/assets/bronx-zoo/16.webp" /></div>
<div style="text-align: center;"><img src="/assets/bronx-zoo/17.webp" /></div>
<div style="text-align: center;"><img src="/assets/bronx-zoo/18.webp" /></div>
<div style="text-align: center;"><img src="/assets/bronx-zoo/19.webp" /></div>
<div style="text-align: center;"><img src="/assets/bronx-zoo/20.webp" /></div>
<p></p>Bronx Zoo, New York, NY, 2023Spherical Integration2023-01-08T00:00:00+00:002023-01-08T00:00:00+00:00https://thenumb.at/Spherical-Integration<p><em>Or, where does that \(\sin\theta\) come from?</em></p>
<p>Integrating functions over spheres is a ubiquitous task in graphics—and a common source of confusion for beginners. In particular, understanding why integration in spherical coordinates requires multiplying by \(\sin\theta\) takes some thought.</p>
<h1 id="the-confusion">The Confusion</h1>
<p>So, we want to integrate a function \(f\) over the unit sphere. For simplicity, let’s assume \(f = 1\). Integrating \(1\) over any surface computes the area of that surface: for a unit sphere, we should end up with \(4\pi\).</p>
<p>Integrating over spheres is much easier in the eponymous spherical coordinates, so let’s define \(f\) in terms of \(\theta, \phi\):</p>
<p><img class="diagram2d_smallerer center" style="padding-bottom: 10px" src="/assets/spherical-integration/spherical.svg" /></p>
<p>\(\theta\) ranges from \(0\) to \(\pi\), representing latitude, and \(\phi\) ranges from \(0\) to \(2\pi\), representing longitude. Naively, we might try to integrate \(f\) by ranging over the two parameters:</p>
\[\begin{align*}
\int_{0}^{2\pi}\int_0^\pi 1\, d\theta d\phi &= \int_0^{2\pi} \theta\Big|_0^\pi\, d\phi\\
&= \int_0^{2\pi} \pi\, d\phi \\
&= \pi \left(\phi\Big|_0^{2\pi}\right) \\
&= 2\pi^2
\end{align*}\]
<p>That’s not \(4\pi\)—we didn’t integrate over the sphere! All we did was integrate over a flat rectangle of height \(\pi\) and width \(2\pi\).
One way to conceptualize this integral is by adding up the differential area \(dA\) of many small rectangular patches of the domain.</p>
<p><img class="diagram2d_smaller center" src="/assets/spherical-integration/patch.svg" /></p>
<p>Each patch has area \(dA = d\theta d\phi\), so adding them up results in the area of the rectangle. What we actually want is to add up the areas of patches <em>on the sphere</em>, where they are smaller.</p>
<p><img class="diagram2d_smaller center" src="/assets/spherical-integration/sphere_patch.svg" /></p>
<p>In the limit (small \(d\theta,d\phi\)), the spherical patch \(d\mathcal{S}\) is a factor of \(\sin\theta\) smaller than the rectangular patch \(dA\).<sup id="fnref:patch" role="doc-noteref"><a href="#fn:patch" class="footnote" rel="footnote">1</a></sup> Intuitively, the closer to the poles the patch is, the smaller its area.</p>
<p>When integrating over the sphere \(\mathcal{S}\), we call the area differential \(d\mathcal{S} = \sin\theta\, d\theta d\phi\).<sup id="fnref:solidangle" role="doc-noteref"><a href="#fn:solidangle" class="footnote" rel="footnote">2</a></sup></p>
<p>Let’s try using it:</p>
\[\begin{align*}
\iint_\mathcal{S} 1\,d\mathcal{S} &= \int_{0}^{2\pi}\int_0^\pi \sin\theta\, d\theta d\phi\\
&= \int_0^{2\pi} (-\cos\theta)\Big|_0^\pi\, d\phi\\
&= \int_0^{2\pi} 2\, d\phi \\
&= 2 \left(\phi\Big|_0^{2\pi}\right) \\
&= 4\pi
\end{align*}\]
<p>It works! If you just wanted the intuition, you can stop reading here.</p>
<h1 id="but-why-sintheta">But why \(\sin\theta\)?</h1>
<p>It’s illustrative to analyze how \(\sin\theta\) arises from parameterizing the cartesian (\(x,y,z\)) sphere using spherical coordinates.</p>
<p>In cartesian coordinates, a unit sphere is defined by \(x^2 + y^2 + z^2 = 1\). It’s possible to formulate a cartesian surface integral based on this definition, but it would be ugly.</p>
\[\iint_{x^2+y^2+z^2=1} f dA = \text{?}\]
<p>Instead, we can perform a <em>change of coordinates</em> from cartesian to spherical coordinates. To do so, we will define \(\Phi : \theta,\phi \mapsto x,y,z\):</p>
<table class="yetanothertable"><tr>
<td>
<p>
$$ \begin{align*}
\Phi(\theta,\phi) = \begin{bmatrix}\sin\theta\cos\phi\\
\sin\theta\sin\phi\\
\cos\theta\end{bmatrix}
\end{align*} $$
</p>
</td>
<td>
<img class="diagram2d" style="min-width: 200px" src="/assets/spherical-integration/coordchange.svg" />
</td>
</tr>
</table>
<p>The function \(\Phi\) is a <em>parameterization</em> of the unit sphere \(\mathcal{S}\). We can check that it satisfies \(x^2+y^2+z^2=1\) regardless of \(\theta\) and \(\phi\):</p>
\[\begin{align*}
|\Phi(\theta,\phi)|^2 &= (\sin\theta\cos\phi)^2 + (\sin\theta\sin\phi)^2 + \cos^2\theta\\
&= \sin^2\theta(\cos^2\phi+\sin^2\phi) + \cos^2\theta \\
&= \sin^2\theta + \cos^2\theta\\
&= 1
\end{align*}\]
<p>Applying \(\Phi\) to the rectangular domain \(\theta\in[0,\pi],\phi\in[0,2\pi]\) in fact describes all of \(\mathcal{S}\), giving us a much simpler parameterization of the integral.</p>
<p>To integrate over \(d\theta\) and \(d\phi\), we also need to compute how they relate to \(d\mathcal{S}\). Luckily, there’s a formula that holds for <strong>any</strong> parametric surface \(\mathbf{r}(u,v)\) describing a three-dimensional domain \(\mathcal{R}\):<sup id="fnref:fundamental" role="doc-noteref"><a href="#fn:fundamental" class="footnote" rel="footnote">3</a></sup></p>
\[d\mathcal{R} = \left\lVert \frac{\partial \mathbf{r}}{\partial u} \times \frac{\partial \mathbf{r}}{\partial v} \right\rVert du dv\]
<p>The two partial derivatives represent tangent vectors on \(\mathcal{R}\) along the \(u\) and \(v\) axes, respectively. Intuitively, each tangent vector describes how a \(u,v\) patch is stretched along the corresponding axis when mapped onto \(\mathcal{R}\). The magnitude of their cross product then computes the area of the resulting parallelogram.</p>
<p><img class="diagram2d_double center" src="/assets/spherical-integration/cross.svg" /></p>
<p>Finally, we can actually apply the change of coordinates:</p>
\[\begin{align*}
\iint_\mathcal{S} f\, d\mathcal{S} &=
\int_0^{2\pi}\int_0^\pi f(\Phi(\theta,\phi)) \left\lVert \frac{\partial\Phi}{\partial\theta} \times \frac{\partial\Phi}{\partial\phi} \right\rVert d\theta d\phi
\end{align*}\]
<p>We just need to compute the area term:</p>
\[\begin{align*}
\frac{\partial\Phi}{\partial\theta} &= \begin{bmatrix}\cos\phi\cos\theta & \sin\phi\cos\theta & -\sin\theta\end{bmatrix}\\
\frac{\partial\Phi}{\partial\phi} &= \begin{bmatrix}-\sin\phi\sin\theta & \cos\phi\sin\theta & 0\end{bmatrix}\\
\frac{\partial\Phi}{\partial\theta} \times \frac{\partial\Phi}{\partial\phi} &= \begin{bmatrix}\sin^2\theta\cos\phi & \sin^2\theta\sin\phi & \cos^2\phi\cos\theta\sin\theta + \sin^2\phi\cos\theta\sin\theta\end{bmatrix}\\
&= \begin{bmatrix}\sin^2\theta\cos\phi & \sin^2\theta\sin\phi & \cos\theta\sin\theta \end{bmatrix}\\
\left\lVert\frac{\partial\Phi}{\partial\theta} \times \frac{\partial\Phi}{\partial\phi}\right\rVert &=
\sqrt{\sin^4\theta(\cos^2\phi+\sin^2\phi) + \cos^2\theta\sin^2\theta}\\
&= \sqrt{\sin^2\theta(\sin^2\theta + \cos^2\theta)}\\
&= \lvert\sin\theta\rvert
\end{align*}\]
<p>Since \(\theta\in[0,\pi]\), we can say \(\lvert\sin\theta\rvert = \sin\theta\). That means \(d\mathcal{S} = \sin\theta\, d\theta d\phi\)! Our final result is the familiar spherical integral:</p>
\[\iint_\mathcal{S} f dS = \int_0^{2\pi}\int_0^\pi f(\theta,\phi) \sin\theta\, d\theta d\phi\]
<p><br /></p>
<h2 id="footnotes">Footnotes</h2>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:patch" role="doc-endnote">
<p>Interestingly, if we knew the formula for the area of a spherical patch, we could do some slightly illegal math to derive the area form:</p>
<div style="margin-bottom: -25px">
$$\begin{align*} dA &= (\phi_1-\phi_0)(\cos\theta_0-\cos\theta_1)\\
&= ((\phi + d\phi) - \phi)(\cos(\theta)-\cos(\theta+d\theta))\\
&= d\phi d\theta \frac{(\cos(\theta)-\cos(\theta+d\theta))}{d\theta}\\
&= d\phi d\theta \sin\theta \tag{Def. derivative}
\end{align*} $$
</div>
<p><a href="#fnref:patch" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:solidangle" role="doc-endnote">
<p>In graphics, also known as <em>differential solid angle</em>, \(d\mathbf{\omega} = \sin\theta\, d\theta d\phi\). <a href="#fnref:solidangle" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fundamental" role="doc-endnote">
<p>Even more generally, the scale is related to the inner product on the tangent space of the surface. This inner product is defined by the <a href="https://en.wikipedia.org/wiki/First_fundamental_form">first fundamental form</a> \(\mathrm{I}\) of the surface, and the scale factor is \(\sqrt{\det\mathrm{I}}\), regardless of dimension. For surfaces immersed in 3D, this simplifies to the cross product mentioned above. Changes of coordinates that don’t change dimensionality have a more straightforward scale factor: it’s the determinant of their Jacobian.<sup id="fnref:exterior" role="doc-noteref"><a href="#fn:exterior" class="footnote" rel="footnote">4</a></sup> <a href="#fnref:fundamental" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:exterior" role="doc-endnote">
<p>These topics are often more easily understood using the language of exterior calculus, where integrals can be uniformly expressed regardless of dimension. I will write about it at some point. <a href="#fnref:exterior" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Or, where does that \(\sin\theta\) come from? Integrating functions over spheres is a ubiquitous task in graphics—and a common source of confusion for beginners. In particular, understanding why integration in spherical coordinates requires multiplying by \(\sin\theta\) takes some thought. The Confusion So, we want to integrate a function \(f\) over the unit sphere. For simplicity, let’s assume \(f = 1\). Integrating \(1\) over any surface computes the area of that surface: for a unit sphere, we should end up with \(4\pi\). Integrating over spheres is much easier in the eponymous spherical coordinates, so let’s define \(f\) in terms of \(\theta, \phi\): \(\theta\) ranges from \(0\) to \(\pi\), representing latitude, and \(\phi\) ranges from \(0\) to \(2\pi\), representing longitude. Naively, we might try to integrate \(f\) by ranging over the two parameters: \[\begin{align*} \int_{0}^{2\pi}\int_0^\pi 1\, d\theta d\phi &= \int_0^{2\pi} \theta\Big|_0^\pi\, d\phi\\ &= \int_0^{2\pi} \pi\, d\phi \\ &= \pi \left(\phi\Big|_0^{2\pi}\right) \\ &= 2\pi^2 \end{align*}\] That’s not \(4\pi\)—we didn’t integrate over the sphere! All we did was integrate over a flat rectangle of height \(\pi\) and width \(2\pi\). One way to conceptualize this integral is by adding up the differential area \(dA\) of many small rectangular patches of the domain. Each patch has area \(dA = d\theta d\phi\), so adding them up results in the area of the rectangle. What we actually want is to add up the areas of patches on the sphere, where they are smaller. In the limit (small \(d\theta,d\phi\)), the spherical patch \(d\mathcal{S}\) is a factor of \(\sin\theta\) smaller than the rectangular patch \(dA\).1 Intuitively, the closer to the poles the patch is, the smaller its area. When integrating over the sphere \(\mathcal{S}\), we call the area differential \(d\mathcal{S} = \sin\theta\, d\theta d\phi\).2 Let’s try using it: \[\begin{align*} \iint_\mathcal{S} 1\,d\mathcal{S} &= \int_{0}^{2\pi}\int_0^\pi \sin\theta\, d\theta d\phi\\ &= \int_0^{2\pi} (-\cos\theta)\Big|_0^\pi\, d\phi\\ &= \int_0^{2\pi} 2\, d\phi \\ &= 2 \left(\phi\Big|_0^{2\pi}\right) \\ &= 4\pi \end{align*}\] It works! If you just wanted the intuition, you can stop reading here. But why \(\sin\theta\)? It’s illustrative to analyze how \(\sin\theta\) arises from parameterizing the cartesian (\(x,y,z\)) sphere using spherical coordinates. In cartesian coordinates, a unit sphere is defined by \(x^2 + y^2 + z^2 = 1\). It’s possible to formulate a cartesian surface integral based on this definition, but it would be ugly. \[\iint_{x^2+y^2+z^2=1} f dA = \text{?}\] Instead, we can perform a change of coordinates from cartesian to spherical coordinates. To do so, we will define \(\Phi : \theta,\phi \mapsto x,y,z\): $$ \begin{align*} \Phi(\theta,\phi) = \begin{bmatrix}\sin\theta\cos\phi\\ \sin\theta\sin\phi\\ \cos\theta\end{bmatrix} \end{align*} $$ The function \(\Phi\) is a parameterization of the unit sphere \(\mathcal{S}\). We can check that it satisfies \(x^2+y^2+z^2=1\) regardless of \(\theta\) and \(\phi\): \[\begin{align*} |\Phi(\theta,\phi)|^2 &= (\sin\theta\cos\phi)^2 + (\sin\theta\sin\phi)^2 + \cos^2\theta\\ &= \sin^2\theta(\cos^2\phi+\sin^2\phi) + \cos^2\theta \\ &= \sin^2\theta + \cos^2\theta\\ &= 1 \end{align*}\] Applying \(\Phi\) to the rectangular domain \(\theta\in[0,\pi],\phi\in[0,2\pi]\) in fact describes all of \(\mathcal{S}\), giving us a much simpler parameterization of the integral. To integrate over \(d\theta\) and \(d\phi\), we also need to compute how they relate to \(d\mathcal{S}\). Luckily, there’s a formula that holds for any parametric surface \(\mathbf{r}(u,v)\) describing a three-dimensional domain \(\mathcal{R}\):3 \[d\mathcal{R} = \left\lVert \frac{\partial \mathbf{r}}{\partial u} \times \frac{\partial \mathbf{r}}{\partial v} \right\rVert du dv\] The two partial derivatives represent tangent vectors on \(\mathcal{R}\) along the \(u\) and \(v\) axes, respectively. Intuitively, each tangent vector describes how a \(u,v\) patch is stretched along the corresponding axis when mapped onto \(\mathcal{R}\). The magnitude of their cross product then computes the area of the resulting parallelogram. Finally, we can actually apply the change of coordinates: \[\begin{align*} \iint_\mathcal{S} f\, d\mathcal{S} &= \int_0^{2\pi}\int_0^\pi f(\Phi(\theta,\phi)) \left\lVert \frac{\partial\Phi}{\partial\theta} \times \frac{\partial\Phi}{\partial\phi} \right\rVert d\theta d\phi \end{align*}\] We just need to compute the area term: \[\begin{align*} \frac{\partial\Phi}{\partial\theta} &= \begin{bmatrix}\cos\phi\cos\theta & \sin\phi\cos\theta & -\sin\theta\end{bmatrix}\\ \frac{\partial\Phi}{\partial\phi} &= \begin{bmatrix}-\sin\phi\sin\theta & \cos\phi\sin\theta & 0\end{bmatrix}\\ \frac{\partial\Phi}{\partial\theta} \times \frac{\partial\Phi}{\partial\phi} &= \begin{bmatrix}\sin^2\theta\cos\phi & \sin^2\theta\sin\phi & \cos^2\phi\cos\theta\sin\theta + \sin^2\phi\cos\theta\sin\theta\end{bmatrix}\\ &= \begin{bmatrix}\sin^2\theta\cos\phi & \sin^2\theta\sin\phi & \cos\theta\sin\theta \end{bmatrix}\\ \left\lVert\frac{\partial\Phi}{\partial\theta} \times \frac{\partial\Phi}{\partial\phi}\right\rVert &= \sqrt{\sin^4\theta(\cos^2\phi+\sin^2\phi) + \cos^2\theta\sin^2\theta}\\ &= \sqrt{\sin^2\theta(\sin^2\theta + \cos^2\theta)}\\ &= \lvert\sin\theta\rvert \end{align*}\] Since \(\theta\in[0,\pi]\), we can say \(\lvert\sin\theta\rvert = \sin\theta\). That means \(d\mathcal{S} = \sin\theta\, d\theta d\phi\)! Our final result is the familiar spherical integral: \[\iint_\mathcal{S} f dS = \int_0^{2\pi}\int_0^\pi f(\theta,\phi) \sin\theta\, d\theta d\phi\] Footnotes Interestingly, if we knew the formula for the area of a spherical patch, we could do some slightly illegal math to derive the area form: $$\begin{align*} dA &= (\phi_1-\phi_0)(\cos\theta_0-\cos\theta_1)\\ &= ((\phi + d\phi) - \phi)(\cos(\theta)-\cos(\theta+d\theta))\\ &= d\phi d\theta \frac{(\cos(\theta)-\cos(\theta+d\theta))}{d\theta}\\ &= d\phi d\theta \sin\theta \tag{Def. derivative} \end{align*} $$ ↩ In graphics, also known as differential solid angle, \(d\mathbf{\omega} = \sin\theta\, d\theta d\phi\). ↩ Even more generally, the scale is related to the inner product on the tangent space of the surface. This inner product is defined by the first fundamental form \(\mathrm{I}\) of the surface, and the scale factor is \(\sqrt{\det\mathrm{I}}\), regardless of dimension. For surfaces immersed in 3D, this simplifies to the cross product mentioned above. Changes of coordinates that don’t change dimensionality have a more straightforward scale factor: it’s the determinant of their Jacobian.4 ↩ These topics are often more easily understood using the language of exterior calculus, where integrals can be uniformly expressed regardless of dimension. I will write about it at some point. ↩