Jekyll2023-06-03T02:11:19+00:00https://thenumb.at/feed.xmlMax SlaterComputer Graphics, Programming, and MathPittsburgh2023-05-19T00:00:00+00:002023-05-19T00:00:00+00:00https://thenumb.at/Pittsburgh<p>University of Pittsburgh / Carnegie Mellon University, Pittsburgh, PA, 2023</p>
<div style="text-align: center;"><img src="/assets/pittsburgh/01.webp" /></div>
<div style="text-align: center;"><img src="/assets/pittsburgh/02.webp" /></div>
<div style="text-align: center;"><img src="/assets/pittsburgh/03.webp" /></div>
<div style="text-align: center;"><img src="/assets/pittsburgh/04.webp" /></div>
<div style="text-align: center;"><img src="/assets/pittsburgh/05.webp" /></div>
<div style="text-align: center;"><img src="/assets/pittsburgh/06.webp" /></div>
<div style="text-align: center;"><img src="/assets/pittsburgh/07.webp" /></div>
<div style="text-align: center;"><img src="/assets/pittsburgh/08.webp" /></div>
<div style="text-align: center;"><img src="/assets/pittsburgh/09.webp" /></div>
<div style="text-align: center;"><img src="/assets/pittsburgh/10.webp" /></div>
<div style="text-align: center;"><img src="/assets/pittsburgh/11.webp" /></div>
<div style="text-align: center;"><img src="/assets/pittsburgh/12.webp" /></div>
<div style="text-align: center;"><img src="/assets/pittsburgh/13.webp" /></div>
<div style="text-align: center;"><img src="/assets/pittsburgh/14.webp" /></div>
<div style="text-align: center;"><img src="/assets/pittsburgh/15.webp" /></div>
<div style="text-align: center;"><img src="/assets/pittsburgh/16.webp" /></div>
<div style="text-align: center;"><img src="/assets/pittsburgh/17.webp" /></div>
<p></p>University of Pittsburgh / Carnegie Mellon University, Pittsburgh, PA, 2023Optimizing Open Addressing2023-03-19T00:00:00+00:002023-03-19T00:00:00+00:00https://thenumb.at/Optimizing-Open-Addressing<p>Your default hash table should be open-addressed, using Robin Hood linear probing with backward-shift deletion.
When prioritizing deterministic performance over memory efficiency, two-way chaining is also a good choice.</p>
<p><em>Code for this article may be found on <a href="https://github.com/TheNumbat/hashtables">GitHub</a>.</em></p>
<ul>
<li><a href="#open-addressing-vs-separate-chaining">Open Addressing vs. Separate Chaining</a></li>
<li><a href="#benchmark-setup">Benchmark Setup</a></li>
<li><a href="#discussion">Discussion</a>
<ul>
<li><a href="#separate-chaining">Separate Chaining</a></li>
<li><a href="#linear-probing">Linear Probing</a></li>
<li><a href="#quadratic-probing">Quadratic Probing</a></li>
<li><a href="#double-hashing">Double Hashing</a></li>
<li><a href="#robin-hood-linear-probing">Robin Hood Linear Probing</a></li>
<li><a href="#two-way-chaining">Two Way Chaining</a></li>
</ul>
</li>
<li><a href="#unrolling-prefetching-and-simd">Unrolling, Prefetching, and SIMD</a></li>
<li><a href="#benchmark-data">Benchmark Data</a></li>
</ul>
<h1 id="open-addressing-vs-separate-chaining">Open Addressing vs. Separate Chaining</h1>
<p>Most people first encounter hash tables implemented using <em>separate chaining</em>, a model simple to understand and analyze mathematically.
In separate chaining, a hash function is used to map each key to one of \(K\) <em>buckets</em>.
Each bucket holds a linked list, so to retrieve a key, one simply traverses its corresponding bucket.</p>
<p><img class="diagram center" src="/assets/hashtables/chaining.svg" /></p>
<p>Given a hash function drawn from a <a href="https://en.wikipedia.org/wiki/Universal_hashing">universal family</a>, inserting \(N\) keys into a table with \(K\) buckets results in an expected bucket size of \(\frac{N}{K}\), a quantity also known as the table’s <em>load factor</em>.
Assuming (reasonably) that \(\frac{N}{K}\) is bounded by a constant, all operations on such a table can be performed in expected constant time.</p>
<p>Unfortunately, this basic analysis doesn’t consider the myriad factors that go into implementing an efficient hash table on a real computer.
In practice, hash tables based on <em>open addressing</em> can provide superior performance, and their limitations can be worked around in nearly all cases.</p>
<h2 id="open-addressing">Open Addressing</h2>
<p>In an open-addressed table, each bucket only contains a single key.
Collisions are handled by placing additional keys elsewhere in the table.
For example, in linear probing, a key is placed in the first open bucket starting from the index it hashes to.</p>
<p><img class="diagram center" src="/assets/hashtables/linear.svg" /></p>
<p>Unlike in separate chaining, open-addressed tables may be represented in memory as a single flat array.
<em>A priori</em>, we should expect operations on flat arrays to offer higher performance than those on linked structures due to more coherent memory accesses.<sup id="fnref:cacheusage" role="doc-noteref"><a href="#fn:cacheusage" class="footnote" rel="footnote">1</a></sup></p>
<p>However, if the user wants to insert more than \(K\) keys into an open-addressed table with \(K\) buckets, the entire table must be resized.
Typically, a larger array is allocated and all current keys are moved to the new table.
The size is grown by a constant factor (e.g. 2x), so insertions occur in <em>amortized</em> constant time.</p>
<h2 id="why-not-separate-chaining">Why Not Separate Chaining?</h2>
<p>In practice, the standard libraries of many languages provide separately chained hash tables, such as C++’s <code class="language-plaintext highlighter-rouge">std::unordered_map</code>.
At least in C++, this choice is now considered to have been a mistake—it violates C++’s principle of not paying for features you didn’t ask for.</p>
<p>Separate chaining still has some purported benefits, but they seem unconvincing:</p>
<ul>
<li>
<p><em>Separately chained tables don’t require any linear-time operations.</em></p>
<p>First, this isn’t true.
To maintain a constant load factor, separate chaining hash tables <em>also</em> have to resize once a sufficient number of keys are inserted, though the limit can be greater than \(K\).
Most separately chained tables, including <code class="language-plaintext highlighter-rouge">std::unordered_map</code>, only provide amortized constant time insertion.</p>
<p>Second, open-addressed tables can be incrementally resized.
When the table grows, there’s no reason we have to move all existing keys <em>right away</em>.
Instead, every subsequent operation can move a fixed number of old keys into the new table.
Eventually, all keys will have been moved and the old table may be deallocated.
This technique incurs overhead on every operation during a resize, but none will require linear time.</p>
</li>
<li>
<p><em>When each (key + value) entry is allocated separately, entries can have stable addresses. External code can safely store pointers to them, and large entries never have to be copied.</em></p>
<p>This property is easy to support by adding a single level of indirection. That is, allocate each entry separately, but use an open-addressed table to map
keys to pointers-to-entries. In this case, resizing the table only requires copying pointers, rather than entire entries.</p>
</li>
<li>
<p><em>Lower memory overhead.</em></p>
<p>It’s possible for a separately chained table to store the same data in less space than an open-addressed equivalent, since flat tables necessarily waste space storing empty buckets.
However, matching <a href="#robin-hood-linear-probing">high-quality</a> open addressing schemes (especially when storing entries indirectly) requires a load factor greater than one, degrading query performance.
Further, individually allocating nodes wastes memory due to heap allocator overhead and fragmentation.</p>
</li>
</ul>
<p>In fact, the only situation where open addressing truly doesn’t work is when the table cannot allocate <em>and</em> entries cannot be moved.
These requirements can arise when using <a href="https://www.data-structures-in-practice.com/intrusive-linked-lists/">intrusive linked lists</a>, a common pattern in kernel data structures and embedded systems with no heap.</p>
<h1 id="benchmark-setup">Benchmark Setup</h1>
<p>For simplicity, let us only consider tables mapping 64-bit integer keys to 64-bit integer values.
Results will not take into account the effects of larger values or non-trivial key comparison, but should still be representative, as keys can often be represented in 64 bits and values are often pointers.</p>
<p>We will evaluate tables using the following metrics:</p>
<ul>
<li>Time to insert \(N\) keys into an empty table.</li>
<li>Time to erase \(N\) existing keys.</li>
<li>Time to look up \(N\) existing keys.</li>
<li>Time to look up \(N\) missing keys<sup id="fnref:missingelements" role="doc-noteref"><a href="#fn:missingelements" class="footnote" rel="footnote">2</a></sup>.</li>
<li>Average probe length for existing & missing keys.</li>
<li>Maximum probe length for existing & missing keys.</li>
<li>Memory amplification in a full table (total size / size of data).</li>
</ul>
<p>Each open-addressed table was benchmarked at 50%, 75%, and 90% load factors.
Every test targeted a table capacity of 8M entries, setting \(N = \lfloor 8388608 * \text{Load Factor} \rfloor - 1\).
Fixing the table capacity allows us to benchmark behavior at exactly the specified load factor, avoiding misleading results from tables that have recently grown.
The large number of keys causes each test to unpredictably traverse a data set much larger than L3 cache (128MB+), so the CPU<sup id="fnref:mycpu" role="doc-noteref"><a href="#fn:mycpu" class="footnote" rel="footnote">3</a></sup> cannot effectively cache or prefetch memory accesses.</p>
<h2 id="table-sizes">Table Sizes</h2>
<p>All benchmarked tables use exclusively power-of-two sizes.
This invariant allows us to translate hashes to array indices with a fast bitwise AND operation, since:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>h % 2**n == h & (2**n - 1)
</code></pre></div></div>
<p>Power-of-two sizes also admit a neat virtual memory trick<sup id="fnref:vmem" role="doc-noteref"><a href="#fn:vmem" class="footnote" rel="footnote">4</a></sup>, but that optimization is not included here.</p>
<p>However, power-of-two sizes can cause pathological behavior when paired with some hash functions, since indexing simply throws away the hash’s high bits.
An alternative strategy is to use <em>prime</em> sized tables, which are significantly more robust to the choice of hash function.
For the purposes of this post, we have a fixed hash function and non-adversarial inputs, so prime sizes were not used<sup id="fnref:prime" role="doc-noteref"><a href="#fn:prime" class="footnote" rel="footnote">5</a></sup>.</p>
<h2 id="hash-function">Hash Function</h2>
<p>All benchmarked tables use the <code class="language-plaintext highlighter-rouge">squirrel3</code> hash function, a fast integer hash that (as far as I can tell) only exists in <a href="https://www.youtube.com/watch?v=LWFzPP8ZbdU">this GDC talk on noise-based RNG</a>.</p>
<p>Here, it is adapted to 64-bit integers by choosing three large 64-bit primes:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint64_t</span> <span class="nf">squirrel3</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">at</span><span class="p">)</span> <span class="p">{</span>
<span class="k">constexpr</span> <span class="kt">uint64_t</span> <span class="n">BIT_NOISE1</span> <span class="o">=</span> <span class="mh">0x9E3779B185EBCA87ULL</span><span class="p">;</span>
<span class="k">constexpr</span> <span class="kt">uint64_t</span> <span class="n">BIT_NOISE2</span> <span class="o">=</span> <span class="mh">0xC2B2AE3D27D4EB4FULL</span><span class="p">;</span>
<span class="k">constexpr</span> <span class="kt">uint64_t</span> <span class="n">BIT_NOISE3</span> <span class="o">=</span> <span class="mh">0x27D4EB2F165667C5ULL</span><span class="p">;</span>
<span class="n">at</span> <span class="o">*=</span> <span class="n">BIT_NOISE1</span><span class="p">;</span>
<span class="n">at</span> <span class="o">^=</span> <span class="p">(</span><span class="n">at</span> <span class="o">>></span> <span class="mi">8</span><span class="p">);</span>
<span class="n">at</span> <span class="o">+=</span> <span class="n">BIT_NOISE2</span><span class="p">;</span>
<span class="n">at</span> <span class="o">^=</span> <span class="p">(</span><span class="n">at</span> <span class="o"><<</span> <span class="mi">8</span><span class="p">);</span>
<span class="n">at</span> <span class="o">*=</span> <span class="n">BIT_NOISE3</span><span class="p">;</span>
<span class="n">at</span> <span class="o">^=</span> <span class="p">(</span><span class="n">at</span> <span class="o">>></span> <span class="mi">8</span><span class="p">);</span>
<span class="k">return</span> <span class="n">at</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>I make no claims regarding the quality or robustness of this hash function, but observe that it’s cheap, it produces the expected number of collisions in power-of-two tables, and it passes <a href="https://github.com/rurban/smhasher">smhasher</a> when applied bytewise.</p>
<h2 id="other-benchmarks">Other Benchmarks</h2>
<p>Several less interesting benchmarks were also run, all of which exhibited identical performance across equal-memory open-addressed tables and <strong>much</strong> worse performance for separate chaining.</p>
<ul>
<li>Time to look up each element by following a <a href="https://danluu.com/sattolo/">Sattolo cycle</a><sup id="fnref:sattolo" role="doc-noteref"><a href="#fn:sattolo" class="footnote" rel="footnote">6</a></sup> (2-4x).</li>
<li>Time to clear the table (100-200x).</li>
<li>Time to iterate all values (3-10x).</li>
</ul>
<h1 id="discussion">Discussion</h1>
<h2 id="separate-chaining">Separate Chaining</h2>
<p>To get our bearings, let’s first <a href="https://github.com/TheNumbat/hashtables/blob/main/code/chaining.h">implement</a> a simple separate chaining table and benchmark it at various load factors.</p>
<table class="fcenter" style="width: 100%">
<tr class="hrow1"><td rowspan="2" class="header"><strong>Separate Chaining</strong></td><td colspan="2"><strong>50% Load</strong></td><td colspan="2"><strong>100% Load</strong></td><td colspan="2"><strong>200% Load</strong></td><td colspan="2"><strong>500% Load</strong></td></tr>
<tr class="hrow2"><td>Existing</td><td>Missing</td><td>Existing</td><td>Missing</td><td>Existing</td><td>Missing</td><td>Existing</td><td>Missing</td></tr>
<tr><td>Insert (ns/key)</td>
<td colspan="2">115</td>
<td colspan="2">113</td>
<td colspan="2">120</td>
<td colspan="2">152</td>
</tr><tr><td>Erase (ns/key)</td>
<td colspan="2">110</td>
<td colspan="2">128</td>
<td colspan="2">171</td>
<td colspan="2">306</td>
</tr><tr><td>Lookup (ns/key)</td>
<td>28</td>
<td>28</td>
<td>35</td>
<td>40</td>
<td>50</td>
<td>68</td>
<td>102</td>
<td>185</td>
</tr><tr><td>Average Probe</td>
<td>1.25</td>
<td>0.50</td>
<td>1.50</td>
<td>1.00</td>
<td>2.00</td>
<td>2.00</td>
<td>3.50</td>
<td>5.00</td>
</tr><tr><td>Max Probe</td>
<td>7</td>
<td>7</td>
<td>10</td>
<td>9</td>
<td>12</td>
<td>13</td>
<td>21</td>
<td>21</td>
</tr><tr><td>Memory Amplification</td>
<td colspan="2">2.50</td>
<td colspan="2">2.00</td>
<td colspan="2">1.75</td>
<td colspan="2">1.60</td>
</tr></table>
<p>These are respectable results, especially lookups at 100%: the CPU is able to hide much of the memory latency.
Insertions and erasures, on the other hand, are pretty slow—they involve allocation.</p>
<p>Technically, the reported memory amplification in an <em>underestimate</em>, as it does not include heap allocator overhead.
However, we should not <em>directly</em> compare memory amplification against the coming open-addressed tables, as storing larger values would improve separate chaining’s ratio.</p>
<h3 id="stdunordered_map"><code class="language-plaintext highlighter-rouge">std::unordered_map</code></h3>
<p>Running these benchmarks on <code class="language-plaintext highlighter-rouge">std::unordered_map<uint64_t,uint64_t></code> produces results roughly equivalent to the 100% column, just with marginally slower insertions and faster clears.
Using <code class="language-plaintext highlighter-rouge">squirrel3</code> with <code class="language-plaintext highlighter-rouge">std::unordered_map</code> additionally makes insertions slightly slower and lookups slightly faster.
Further comparisons will be relative to these results, since exact performance numbers for <code class="language-plaintext highlighter-rouge">std::unordered_map</code> depend on one’s standard library distribution (here, Microsoft’s).</p>
<h2 id="linear-probing">Linear Probing</h2>
<p>Next, let’s <a href="https://github.com/TheNumbat/hashtables/blob/main/code/linear.h">implement</a> the simplest type of open addressing—linear probing.</p>
<p>In the following diagrams, assume that each letter hashes to its alphabetical index. The skull denotes a <em>tombstone</em>.</p>
<ul>
<li>
<p>Insert: starting from the key’s index, place the key in the first empty or tombstone bucket.</p>
<p><img class="probing2 center" src="/assets/hashtables/linear_insert.svg" /></p>
</li>
<li>
<p>Lookup: starting from the key’s index, probe buckets in order until finding the key, reaching an empty non-tombstone bucket, or exhausting the table.</p>
<p><img class="probing center" style="margin-bottom: -25px" src="/assets/hashtables/linear_find.svg" /></p>
</li>
<li>
<p>Erase: lookup the key and replace it with a tombstone.</p>
<p><img class="probing2 center" src="/assets/hashtables/linear_erase.svg" /></p>
</li>
</ul>
<p>The results are surprisingly good, considering the simplicity:</p>
<table class="fcenter">
<tr class="hrow1"><td rowspan="2" class="header"><strong>Linear Probing<br />(Naive)</strong></td><td colspan="2"><strong>50% Load</strong></td><td colspan="2"><strong>75% Load</strong></td><td colspan="2"><strong>90% Load</strong></td></tr>
<tr class="hrow2"><td>Existing</td><td>Missing</td><td>Existing</td><td>Missing</td><td>Existing</td><td>Missing</td></tr>
<tr><td>Insert (ns/key)</td>
<td colspan="2">46</td>
<td colspan="2">59</td>
<td colspan="2">73</td>
</tr><tr><td>Erase (ns/key)</td>
<td colspan="2">21</td>
<td colspan="2">36</td>
<td colspan="2">47</td>
</tr><tr><td>Lookup (ns/key)</td>
<td>20</td>
<td>86</td>
<td>35</td>
<td>124</td>
<td>47</td>
<td>285</td>
</tr><tr><td>Average Probe</td>
<td>0.50</td>
<td>6.04</td>
<td>1.49</td>
<td>29.5</td>
<td>4.46</td>
<td>222</td>
</tr><tr><td>Max Probe</td>
<td>43</td>
<td>85</td>
<td>181</td>
<td>438</td>
<td>1604</td>
<td>2816</td>
</tr><tr><td>Memory Amplification</td>
<td colspan="2">2.00</td>
<td colspan="2">1.33</td>
<td colspan="2">1.11</td>
</tr></table>
<p>For existing keys, we’re already beating the separately chained tables, and with lower memory overhead!
Of course, there are also some obvious problems—looking for missing keys takes much longer, and the worst-case probe lengths are quite scary, even at 50% load factor.
But, we can do much better.</p>
<h3 id="erase-rehashing">Erase: Rehashing</h3>
<p>One simple optimization is automatically recreating the table once it accumulates too many tombstones.
Erase now occurs in <em>amortized</em> constant time, but we already tolerate that for insertion, so it shouldn’t be a big deal.
Let’s <a href="https://github.com/TheNumbat/hashtables/blob/main/code/linear_with_rehashing.h">implement</a> a table that re-creates itself after erase has been called \(\frac{N}{2}\) times.</p>
<p>Compared to <strong>naive linear probing</strong>:</p>
<table class="fcenter">
<tr class="hrow1"><td rowspan="2" class="header"><strong>Linear Probing<br />+ Rehashing</strong></td><td colspan="2"><strong>50% Load</strong></td><td colspan="2"><strong>75% Load</strong></td><td colspan="2"><strong>90% Load</strong></td></tr>
<tr class="hrow2"><td>Existing</td><td>Missing</td><td>Existing</td><td>Missing</td><td>Existing</td><td>Missing</td></tr>
<tr><td>Insert (ns/key)</td>
<td colspan="2">46</td>
<td colspan="2">58</td>
<td colspan="2">70</td>
</tr><tr><td>Erase (ns/key)</td>
<td class="bad" colspan="2">23 <strong>(+10.4%)</strong></td>
<td class="good" colspan="2">27 <strong>(-23.9%)</strong></td>
<td class="good" colspan="2">29 <strong>(-39.3%)</strong></td>
</tr><tr><td>Lookup (ns/key)</td>
<td>20</td>
<td>85</td>
<td>35</td>
<td class="good">93 <strong>(-25.4%)</strong></td>
<td>46</td>
<td class="good">144 <strong>(-49.4%)</strong></td>
</tr><tr><td>Average Probe</td>
<td>0.50</td>
<td>6.04</td>
<td>1.49</td>
<td class="good">9.34 <strong>(-68.3%)</strong></td>
<td>4.46</td>
<td class="good">57.84 <strong>(-74.0%)</strong></td>
</tr><tr><td>Max Probe</td>
<td>43</td>
<td>85</td>
<td>181</td>
<td class="good">245 <strong>(-44.1%)</strong></td>
<td>1604</td>
<td class="good">1405 <strong>(-50.1%)</strong></td>
</tr><tr><td>Memory Amplification</td>
<td colspan="2">2.00</td>
<td colspan="2">1.33</td>
<td colspan="2">1.11</td>
</tr></table>
<p>Clearing the tombstones dramatically improves missing key queries, but they are still pretty slow.
It also makes erase slower at 50% load—there aren’t enough tombstones to make up for having to recreate the table.
Rehashing also does nothing to mitigate long probe sequences for existing keys.</p>
<h3 id="erase-backward-shift">Erase: Backward Shift</h3>
<p>Actually, who says we need tombstones? There’s a little-taught, but very simple algorithm for erasing keys that doesn’t degrade performance <em>or</em> have to recreate the table.</p>
<p>When we erase a key, we don’t know whether the lookup algorithm is relying on our bucket being filled to find keys later in the table.
Tombstones are one way to avoid this problem.
Instead, however, we can guarantee that traversal is never disrupted by simply <em>removing and reinserting all following keys</em>, up to the next empty bucket.</p>
<p>Note that despite the name, we are not simply shifting the following keys backward. Any keys that are already at their optimal index will not be moved.
For example:</p>
<p><img class="probing2 center" src="/assets/hashtables/linear_deletion.svg" /></p>
<p>After <a href="https://github.com/TheNumbat/hashtables/blob/main/code/linear_with_deletion.h">implementing</a> backward-shift deletion, we can compare it to <strong>linear probing with rehashing</strong>:</p>
<table class="fcenter">
<tr class="hrow1"><td rowspan="2" class="header"><strong>Linear Probing<br />+ Backshift</strong></td><td colspan="2"><strong>50% Load</strong></td><td colspan="2"><strong>75% Load</strong></td><td colspan="2"><strong>90% Load</strong></td></tr>
<tr class="hrow2"><td>Existing</td><td>Missing</td><td>Existing</td><td>Missing</td><td>Existing</td><td>Missing</td></tr>
<tr><td>Insert (ns/key)</td>
<td colspan="2">46</td>
<td colspan="2">58</td>
<td colspan="2">70</td>
</tr><tr><td>Erase (ns/key)</td>
<td class="bad" colspan="2">47 <strong>(+105%)</strong></td>
<td class="bad" colspan="2">82 <strong>(+201%)</strong></td>
<td class="bad" colspan="2">147 <strong>(+415%)</strong></td>
</tr><tr><td>Lookup (ns/key)</td>
<td>20</td>
<td class="good">37 <strong>(-56.2%)</strong></td>
<td>36</td>
<td>88</td>
<td>46</td>
<td>135</td>
</tr><tr><td>Average Probe</td>
<td>0.50</td>
<td class="good">1.50 <strong>(-75.2%)</strong></td>
<td>1.49</td>
<td class="good">7.46 <strong>(-20.1%)</strong></td>
<td>4.46</td>
<td class="good">49.7 <strong>(-14.1%)</strong></td>
</tr><tr><td>Max Probe</td>
<td>43</td>
<td class="good">50 <strong>(-41.2%)</strong></td>
<td>181</td>
<td>243</td>
<td>1604</td>
<td>1403</td>
</tr><tr><td>Memory Amplification</td>
<td colspan="2">2.00</td>
<td colspan="2">1.33</td>
<td colspan="2">1.11</td>
</tr></table>
<p>Naturally, erasures become significantly slower, but queries on missing keys now have reasonable behavior, especially at 50% load.
In fact, at 50% load, this table already beats the performance of equal-memory separate chaining in all four metrics.
But, we can still do better: a 50% load factor is unimpressive, and max probe lengths are still very high compared to separate chaining.</p>
<p><em>You might have noticed that neither of these upgrades improved average probe length for existing keys.
That’s because it’s not possible for linear probing to have any other average!
Think about why that is—it’ll be interesting later.</em></p>
<h2 id="quadratic-probing">Quadratic Probing</h2>
<p>Quadratic probing is a common upgrade to linear probing intended to decrease average and maximum probe lengths.</p>
<p>In the linear case, a probe of length \(n\) simply queries the bucket at index \(h(k) + n\).
Quadratic probing instead queries the bucket at index \(h(k) + \frac{n(n+1)}{2}\).
Other quadratic sequences are possible, but this one can be efficiently computed by adding \(n\) to the index at each step.</p>
<p>Each operation must be updated to use our new probe sequence, but the logic is otherwise almost identical:</p>
<p><img class="probing2 center" src="/assets/hashtables/quad_insert.svg" /></p>
<p>There are a couple subtle caveats to quadratic probing:</p>
<ul>
<li>Although this specific sequence will touch every bucket in a power-of-two table, other sequences (e.g. \(n^2\)) may not.
If an insertion fails despite there being an empty bucket, the table must grow and the insertion is retried.</li>
<li>There actually <em>is</em> a <a href="https://epubs.siam.org/doi/pdf/10.1137/1.9781611975062.3">true deletion algorithm for quadratic probing</a>, but it’s a bit more involved than the linear case and hence not reproduced here. We will again use tombstones and rehashing.</li>
</ul>
<p>After <a href="https://github.com/TheNumbat/hashtables/blob/main/code/quadratic.h">implementing</a> our quadratic table, we can compare it to <strong>linear probing with rehashing</strong>:</p>
<table class="fcenter">
<tr class="hrow1"><td rowspan="2" style="width:21%" class="header"><strong>Quadratic Probing<br />+ Rehashing</strong></td><td colspan="2"><strong>50% Load</strong></td><td colspan="2"><strong>75% Load</strong></td><td colspan="2"><strong>90% Load</strong></td></tr>
<tr class="hrow2"><td>Existing</td><td>Missing</td><td>Existing</td><td>Missing</td><td>Existing</td><td>Missing</td></tr>
<tr><td>Insert (ns/key)</td>
<td colspan="2">47</td>
<td class="good" colspan="2">54 <strong>(-7.2%)</strong></td>
<td class="good" colspan="2">65 <strong>(-6.4%)</strong></td>
</tr><tr><td>Erase (ns/key)</td>
<td colspan="2">23</td>
<td colspan="2">28</td>
<td colspan="2">29</td>
</tr><tr><td>Lookup (ns/key)</td>
<td>20</td>
<td>82</td>
<td class="good">31 <strong>(-13.2%)</strong></td>
<td>93</td>
<td class="good">42 <br /><strong>(-7.3%)</strong></td>
<td class="bad">163 <strong>(+12.8%)</strong></td>
</tr><tr><td>Average Probe</td>
<td class="good">0.43 <strong>(-12.9%)</strong></td>
<td class="good">4.81 <strong>(-20.3%)</strong></td>
<td class="good">0.99 <strong>(-33.4%)</strong></td>
<td class="good">5.31 <strong>(-43.1%)</strong></td>
<td class="good">1.87 <strong>(-58.1%)</strong></td>
<td class="good">18.22 <strong>(-68.5%)</strong></td>
</tr><tr><td>Max Probe</td>
<td class="good">21 <strong>(-51.2%)</strong></td>
<td class="good">49 <strong>(-42.4%)</strong></td>
<td class="good">46 <strong>(-74.6%)</strong></td>
<td class="good">76 <strong>(-69.0%)</strong></td>
<td class="good">108 <strong>(-93.3%)</strong></td>
<td class="good">263 <strong>(-81.3%)</strong></td>
</tr><tr><td>Memory Amplification</td>
<td colspan="2">2.00</td>
<td colspan="2">1.33</td>
<td colspan="2">1.11</td>
</tr></table>
<p>As expected, quadratic probing dramatically reduces both the average and worst case probe lengths, especially at high load factors.
These gains are not <em>quite</em> as impressive when compared to the linear table with backshift deletion, but are still very significant.</p>
<p>Reducing the average probe length improves the average performance of all queries.
Erasures are the fastest yet, since quadratic probing quickly finds the element and marks it as a tombstone.
However, beyond 50% load, maximum probe lengths are still pretty concerning.</p>
<h2 id="double-hashing">Double Hashing</h2>
<p>Double hashing purports to provide even better behavior than quadratic probing.
Instead of using the same probe sequence for every key, double hashing determines the probe stride by hashing the key a second time.
That is, a probe of length \(n\) queries the bucket at index \(h_1(k) + n*h_2(k)\)</p>
<ul>
<li>If \(h_2\) is always co-prime to the table size, insertions always succeed, since the probe sequence will touch every bucket. That might sound difficult to assure, but since our table sizes are always a power of two, we can simply make \(h_2\) always return an odd number.</li>
<li>Unfortunately, there is no true deletion algorithm for double hashing. We will again use tombstones and rehashing.</li>
</ul>
<p>So far, we’ve been throwing away the high bits our of 64-bit hash when computing table indices.
Instead of evaluating a second hash function, let’s just re-use the top 32 bits: \(h_2(k) = h_1(k) >> 32\).</p>
<p>After <a href="https://github.com/TheNumbat/hashtables/blob/main/code/double.h">implementing</a> double hashing, we may compare the results to <strong>quadratic probing with rehashing</strong>:</p>
<table class="fcenter">
<tr class="hrow1"><td rowspan="2" class="header"><strong>Double Hashing<br />+ Rehashing</strong></td><td colspan="2"><strong>50% Load</strong></td><td colspan="2"><strong>75% Load</strong></td><td colspan="2"><strong>90% Load</strong></td></tr>
<tr class="hrow2"><td>Existing</td><td>Missing</td><td>Existing</td><td>Missing</td><td>Existing</td><td>Missing</td></tr>
<tr><td>Insert (ns/key)</td>
<td class="bad" colspan="2">70 <strong>(+49.6%)</strong></td>
<td class="bad" colspan="2">80 <strong>(+48.6%)</strong></td>
<td class="bad" colspan="2">88 <strong>(+35.1%)</strong></td>
</tr><tr><td>Erase (ns/key)</td>
<td class="bad" colspan="2">31 <strong>(+33.0%)</strong></td>
<td class="bad" colspan="2">43 <strong>(+53.1%)</strong></td>
<td class="bad" colspan="2">47 <strong>(+61.5%)</strong></td>
</tr><tr><td>Lookup (ns/key)</td>
<td class="bad">29 <strong>(+46.0%)</strong></td>
<td>84</td>
<td class="bad">38 <strong>(+25.5%)</strong></td>
<td>93</td>
<td class="bad">49 <strong>(+14.9%)</strong></td>
<td>174</td>
</tr><tr><td>Average Probe</td>
<td class="good">0.39 <strong>(-11.0%)</strong></td>
<td class="good">4.39 <strong>(-8.8%)</strong></td>
<td class="good">0.85 <strong>(-14.7%)</strong></td>
<td class="good">4.72 <strong>(-11.1%)</strong></td>
<td class="good">1.56 <strong>(-16.8%)</strong></td>
<td class="good">16.23 <strong>(-10.9%)</strong></td>
</tr><tr><td>Max Probe</td>
<td class="good">17 <strong>(-19.0%)</strong></td>
<td class="bad">61 <strong>(+24.5%)</strong></td>
<td class="good">44 <br /><strong>(-4.3%)</strong></td>
<td class="bad">83 <br /><strong>(+9.2%)</strong></td>
<td class="bad">114 <strong>(+5.6%)</strong></td>
<td class="good">250 <br /><strong>(-4.9%)</strong></td>
</tr><tr><td>Memory Amplification</td>
<td colspan="2">2.00</td>
<td colspan="2">1.33</td>
<td colspan="2">1.11</td>
</tr></table>
<p>Double hashing succeeds in further reducing average probe lengths, but max lengths are more of a mixed bag.
Unfortunately, most queries now traverse larger spans of memory, causing performance to lag quadratic hashing.</p>
<h2 id="robin-hood-linear-probing">Robin Hood Linear Probing</h2>
<p>So far, quadratic probing and double hashing have provided lower probe lengths, but their raw performance isn’t much better than linear probing—especially on missing keys.
Enter <em><a href="https://cs.uwaterloo.ca/research/tr/1986/CS-86-14.pdf">Robin Hood</a> linear probing</em>.
This strategy drastically reins in maximum probe lengths while maintaining high query performance.</p>
<p>The key difference is that when inserting a key, we are allowed to steal the position of an existing key if it is closer to its ideal index (“richer”) than the new key is.
When this occurs, we simply swap the old and new keys and carry on inserting the old key in the same manner. This results in a more equitable distribution of probe lengths.</p>
<p><img class="probing2 center" src="/assets/hashtables/rh_insert.svg" /></p>
<p>This insertion method turns out to be so effective that we can entirely ignore rehashing as long as we keep track of the maximum probe length.
Lookups will simply probe until either finding the key or exhausting the maximum probe length.</p>
<p><a href="https://github.com/TheNumbat/hashtables/blob/main/code/robin_hood.h">Implementing</a> this table, compared to <strong>linear probing with backshift deletion</strong>:</p>
<table class="fcenter">
<tr class="hrow1"><td rowspan="2" class="header"><strong>Robin Hood<br />(Naive)</strong></td><td colspan="2"><strong>50% Load</strong></td><td colspan="2"><strong>75% Load</strong></td><td colspan="2"><strong>90% Load</strong></td></tr>
<tr class="hrow2"><td>Existing</td><td>Missing</td><td>Existing</td><td>Missing</td><td>Existing</td><td>Missing</td></tr>
<tr><td>Insert (ns/key)</td>
<td class="bad" colspan="2">53 <strong>(+16.1%)</strong></td>
<td class="bad" colspan="2">80 <strong>(+38.3%)</strong></td>
<td class="bad" colspan="2">124 <strong>(+76.3%)</strong></td>
</tr><tr><td>Erase (ns/key)</td>
<td class="good" colspan="2">19 <strong>(-58.8%)</strong></td>
<td class="good" colspan="2">31 <strong>(-63.0%)</strong></td>
<td class="good" colspan="2">77 <strong>(-47.9%)</strong></td>
</tr><tr><td>Lookup (ns/key)</td>
<td>19</td>
<td class="bad">58 <strong>(+56.7%)</strong></td>
<td class="good">31 <strong>(-11.3%)</strong></td>
<td class="bad">109 <strong>(+23.7%)</strong></td>
<td class="bad">79 <strong>(+72.2%)</strong></td>
<td class="bad">147 <strong>(+8.9%)</strong></td>
</tr><tr><td>Average Probe</td>
<td>0.50</td>
<td class="bad">13 <strong>(+769%)</strong></td>
<td>1.49</td>
<td class="bad">28 <br /><strong>(+275%)</strong></td>
<td>4.46</td>
<td class="bad">67 <strong>(+34.9%)</strong></td>
</tr><tr><td>Max Probe</td>
<td class="good">12 <strong>(-72.1%)</strong></td>
<td class="good">13 <strong>(-74.0%)</strong></td>
<td class="good">24 <strong>(-86.7%)</strong></td>
<td class="good">28 <br /><strong>(-88.5%)</strong></td>
<td class="good">58 <strong>(-96.4%)</strong></td>
<td class="good">67 <strong>(-95.2%)</strong></td>
</tr><tr><td>Memory Amplification</td>
<td colspan="2">2.00</td>
<td colspan="2">1.33</td>
<td colspan="2">1.11</td>
</tr></table>
<p>Maximum probe lengths improve <em>dramatically</em>, undercutting even quadratic probing and double hashing.
This solves the main problem with open addressing.</p>
<p>However, performance gains are less clear-cut:</p>
<ul>
<li>Insertion performance suffers across the board.</li>
<li>Existing lookups are slightly faster at low load, but much slower at 90%.</li>
<li>Missing lookups are maximally pessimistic.</li>
</ul>
<p>Fortunately, we’ve got one more trick up our sleeves.</p>
<h3 id="erase-backward-shift-1">Erase: Backward Shift</h3>
<p>When erasing a key, we can run a very similar backshift deletion algorithm.
There are just two differences:</p>
<ul>
<li>We can stop upon finding a key that is already in its optimal location. No further keys could have been pushed past it—they would have stolen this spot during insertion.</li>
<li>We don’t have to re-hash—since we stop at the first optimal key, all keys before it can simply be shifted backward.</li>
</ul>
<p><img class="probing2 center" src="/assets/hashtables/rh_deletion.svg" /></p>
<p>With backshift deletion, we no longer need to keep track of the maximum probe length.</p>
<p><a href="https://github.com/TheNumbat/hashtables/blob/main/code/robin_hood_with_deletion.h">Implementing</a> this table, compared to <strong>naive robin hood probing</strong>:</p>
<table class="fcenter">
<tr class="hrow1"><td rowspan="2" class="header"><strong>Robin Hood<br />+ Backshift</strong></td><td colspan="2"><strong>50% Load</strong></td><td colspan="2"><strong>75% Load</strong></td><td colspan="2"><strong>90% Load</strong></td></tr>
<tr class="hrow2"><td>Existing</td><td>Missing</td><td>Existing</td><td>Missing</td><td>Existing</td><td>Missing</td></tr>
<tr><td>Insert (ns/key)</td>
<td colspan="2">52</td>
<td colspan="2">78</td>
<td colspan="2">118</td>
</tr><tr><td>Erase (ns/key)</td>
<td class="bad" colspan="2">31 <strong>(+63.0%)</strong></td>
<td class="bad" colspan="2">47 <strong>(+52.7%)</strong></td>
<td class="good" colspan="2">59 <strong>(-23.3%)</strong></td>
</tr><tr><td>Lookup (ns/key)</td>
<td>19</td>
<td class="good">33 <strong>(-43.3%)</strong></td>
<td>31</td>
<td class="good">74 <strong>(-32.2%)</strong></td>
<td>80</td>
<td class="good">108 <strong>(-26.1%)</strong></td>
</tr><tr><td>Average Probe</td>
<td>0.50</td>
<td class="good">0.75 <strong>(-94.2%)</strong></td>
<td>1.49</td>
<td class="good">1.87 <strong>(-93.3%)</strong></td>
<td>4.46</td>
<td class="good">4.95 <strong>(-92.6%)</strong></td>
</tr><tr><td>Max Probe</td>
<td>12</td>
<td class="good">12 <strong>(-7.7%)</strong></td>
<td>24</td>
<td class="good">25 <strong>(-10.7%)</strong></td>
<td>58</td>
<td>67</td>
</tr><tr><td>Memory Amplification</td>
<td colspan="2">2.00</td>
<td colspan="2">1.33</td>
<td colspan="2">1.11</td>
</tr></table>
<p>We again regress the performance of erase, but we finally achieve reasonable missing-key behavior.
Given the excellent maximum probe lengths and overall query performance, Robin Hood probing with backshift deletion should be your default choice of hash table.</p>
<p><em>You might notice that at 90% load, lookups are significantly slower than basic linear probing.
This gap actually isn’t concerning, and explaining why is an interesting exercise<sup id="fnref:slowerat90" role="doc-noteref"><a href="#fn:slowerat90" class="footnote" rel="footnote">7</a></sup>.</em></p>
<h2 id="two-way-chaining">Two-Way Chaining</h2>
<p>Our Robin Hood table has excellent query performance, and exhibited a maximum probe length of only 12—it’s a good default choice.
However, another class of hash tables based on <em>Cuckoo Hashing</em> can <em>provably</em> never probe more than a constant number of buckets.
Here, we will examine a particularly practical formulation known as <em>Two-Way Chaining</em>.</p>
<p>Our table will still consist of a flat array of buckets, but each bucket will be able to hold a small handful of keys.
When inserting an element, we will hash the key with <em>two different</em> functions, then place it into whichever corresponding bucket has more space.
If both buckets happen to be filled, the entire table will grow and the insertion is retried.</p>
<p><img class="probing2 center" src="/assets/hashtables/two_way_insert.svg" /></p>
<p>You might think this is crazy—in separate chaining, the expected maximum probe length scales with \(O(\frac{\lg N}{\lg \lg N})\), so wouldn’t this waste copious amounts of memory growing the table?
It turns out adding a second hash function reduces the expected maximum to \(O(\lg\lg N)\), for reasons <a href="https://www.cs.cmu.edu/afs/cs/project/pscico-guyb/realworld/www/slidesS14/load-balancing.pdf">explored elsewhere</a>.
That makes it practical to cap the maximum bucket size at a relatively small number.</p>
<p>Let’s <a href="https://github.com/TheNumbat/hashtables/blob/main/code/two_way.h">implement</a> a flat two-way chained table.
As with double hashing, we can use the high bits of our hash to represent the second hash function.</p>
<table class="fcenter">
<tr class="hrow1"><td rowspan="2" class="header"><strong>Two-Way<br />Chaining</strong></td><td colspan="2"><strong>Capacity 2</strong></td><td colspan="2"><strong>Capacity 4</strong></td><td colspan="2"><strong>Capacity 8</strong></td></tr>
<tr class="hrow2"><td>Existing</td><td>Missing</td><td>Existing</td><td>Missing</td><td>Existing</td><td>Missing</td></tr>
<tr><td>Insert (ns/key)</td>
<td colspan="2">265</td>
<td colspan="2">108</td>
<td colspan="2">111</td>
</tr><tr><td>Erase (ns/key)</td>
<td colspan="2">24</td>
<td colspan="2">34</td>
<td colspan="2">45</td>
</tr><tr><td>Lookup (ns/key)</td>
<td>23</td>
<td>23</td>
<td>30</td>
<td>30</td>
<td>39</td>
<td>42</td>
</tr><tr><td>Average Probe</td>
<td>0.03</td>
<td>0.24</td>
<td>0.74</td>
<td>2.73</td>
<td>3.61</td>
<td>8.96</td>
</tr><tr><td>Max Probe</td>
<td>3</td>
<td>4</td>
<td>6</td>
<td>8</td>
<td>13</td>
<td>14</td>
</tr><tr><td>Memory Amplification</td>
<td colspan="2">32.00</td>
<td colspan="2">4.00</td>
<td colspan="2">2.00</td>
</tr></table>
<p>Since every key can be found in one of two possible buckets, query times become essentially deterministic.
Lookup times are especially impressive, only slightly lagging the Robin Hood table and having no penalty on missing keys.
The average probe lengths are very low, and max probe lengths are strictly bounded by two times the bucket size.</p>
<p>The table size balloons when each bucket can only hold two keys, and insertion is very slow due to growing the table.
However, capacities of 4-8 are reasonable choices for memory unconstrained use cases.</p>
<p>Overall, two-way chaining is another good choice of hash table, at least when memory efficiency is not the highest priority.
In the next section, we’ll also see how lookups can even further accelerated with prefetching and SIMD operations.</p>
<p><em>Remember that deterministic queries might not matter if you haven’t already implemented incremental table growth.</em></p>
<h3 id="cuckoo-hashing">Cuckoo Hashing</h3>
<p><a href="https://www.brics.dk/RS/01/32/BRICS-RS-01-32.pdf">Cuckoo hashing</a> involves storing keys in two separate tables, each of which holds one key per bucket.
This scheme will guarantee that every key is stored in its ideal bucket in one of the two tables.</p>
<p>Each key is hashed with two different functions and inserted using the following procedure:</p>
<ol>
<li>If at least one table’s bucket is empty, the key is inserted there.</li>
<li>Otherwise, the key is inserted anyway, evicting one of the existing keys.</li>
<li>The evicted key is transferred to the opposite table at its other possible position.</li>
<li>If doing so evicts another existing key, go to 3.</li>
</ol>
<p>If the loop ends up evicting the key we’re trying to insert, we know it has found a cycle and hence must grow the table.</p>
<p><img class="probing2 center" src="/assets/hashtables/cuckoo_insert.svg" /></p>
<p>Because Cuckoo hashing does not strictly bound the insertion sequence length, I didn’t benchmark it here, but it’s still an interesting option.</p>
<h1 id="unrolling-prefetching-and-simd">Unrolling, Prefetching, and SIMD</h1>
<p>So far, the lookup benchmark has been implemented roughly like this:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">N</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">assert</span><span class="p">(</span><span class="n">table</span><span class="p">.</span><span class="n">find</span><span class="p">(</span><span class="n">keys</span><span class="p">[</span><span class="n">i</span><span class="p">])</span> <span class="o">==</span> <span class="n">values</span><span class="p">[</span><span class="n">i</span><span class="p">]);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Naively, this loop runs one lookup at a time.
However, you might have noticed that we’ve been able to find keys <em>faster than the CPU can load data from main memory</em>—so something more complex must be going on.</p>
<p>In reality, modern out-of-order CPUs can execute several iterations in parallel.
Expressing the loop as a dataflow graph lets us build a simplified mental model of how it actually runs: the CPU is able to execute a computation whenever its dependencies are available.</p>
<p>Here, green nodes are resolved, in-flight nodes are yellow, and white nodes have not begun:</p>
<p><img class="probes center" src="/assets/hashtables/dataflow.svg" /></p>
<p>In this example, we can see that each iteration’s root node (<code class="language-plaintext highlighter-rouge">i++</code>) only requires the value of <code class="language-plaintext highlighter-rouge">i</code> from the previous step—<strong>not</strong> the result of <code class="language-plaintext highlighter-rouge">find</code>.
Therefore, the CPU can run several <code class="language-plaintext highlighter-rouge">find</code> nodes in parallel, hiding the fact that each one takes ~100ns to fetch data from main memory.</p>
<h2 id="unrolling">Unrolling</h2>
<p>There’s a common optimization strategy known as <em>loop unrolling</em>.
Unrolling involves explicitly writing out the code for several <em>semantic</em> iterations inside each <em>actual</em> iteration of a loop.
When each inner pseudo-iteration is independent, unrolling explicitly severs them from the loop iterator dependency chain.</p>
<p>Most modern x86 CPUs can keep track of ~10 simultaneous pending loads, so let’s try manually unrolling our benchmark 10 times.
The new code looks like the following, where <code class="language-plaintext highlighter-rouge">index_of</code> hashes the key but doesn’t do the actual lookup.</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint64_t</span> <span class="n">indices</span><span class="p">[</span><span class="mi">10</span><span class="p">];</span>
<span class="k">for</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">N</span><span class="p">;</span> <span class="n">i</span> <span class="o">+=</span> <span class="mi">10</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">j</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">j</span> <span class="o"><</span> <span class="mi">10</span><span class="p">;</span> <span class="n">j</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">indices</span><span class="p">[</span><span class="n">j</span><span class="p">]</span> <span class="o">=</span> <span class="n">table</span><span class="p">.</span><span class="n">index_of</span><span class="p">(</span><span class="n">keys</span><span class="p">[</span><span class="n">i</span> <span class="o">+</span> <span class="n">j</span><span class="p">]);</span>
<span class="p">}</span>
<span class="k">for</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">j</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">j</span> <span class="o"><</span> <span class="mi">10</span><span class="p">;</span> <span class="n">j</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">assert</span><span class="p">(</span><span class="n">table</span><span class="p">.</span><span class="n">find_index</span><span class="p">(</span><span class="n">indices</span><span class="p">[</span><span class="n">j</span><span class="p">])</span> <span class="o">==</span> <span class="n">values</span><span class="p">[</span><span class="n">i</span> <span class="o">+</span> <span class="n">j</span><span class="p">]);</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>This isn’t true unrolling, since the inner loops introduce dependencies on <code class="language-plaintext highlighter-rouge">j</code>.
However, it doesn’t matter in practice, as computing <code class="language-plaintext highlighter-rouge">j</code> is never a bottleneck.
In fact, our compiler can automatically unroll the inner loop for us (<code class="language-plaintext highlighter-rouge">clang</code> in particular does this).</p>
<table class="fcenter" style="width: 55%">
<tr class="hrow2"><td class="header"><strong>Lookups (ns/find)</strong></td><td>Naive</td><td>Unrolled</td></tr>
<tr><td>Separate Chaining (100%)</td>
<td>35</td>
<td>34</td>
</tr><tr><td>Linear (75%)</td>
<td>36</td>
<td>35</td>
</tr><tr><td>Quadratic (75%)</td>
<td>31</td>
<td>31</td>
</tr><tr><td>Double (75%)</td>
<td>38</td>
<td>37</td>
</tr><tr><td>Robin Hood (75%)</td>
<td>31</td>
<td>31</td>
</tr><tr><td>Two-Way Chaining (x4)</td>
<td>30</td>
<td>32</td>
</tr></table>
<p>Explicit unrolling has basically no effect, since the CPU was already able to saturate its pending load buffer.
However, manual unrolling now lets us introduce <em>software prefetching</em>.</p>
<h2 id="software-prefetching">Software Prefetching</h2>
<p>When traversing memory in a predictable pattern, the CPU is able to preemptively load data for future accesses.
This capability, known as <em>hardware prefetching</em>, is the reason our flat tables are so fast to iterate.
However, hash table <em>lookups</em> are inherently unpredictable: the address to load is determined by a hash function.
That means hardware prefetching cannot help us.</p>
<p>Instead, we can explicitly ask the CPU to prefetch address(es) we will load from in the near future.
Doing so can’t speed up an individual lookup—we’d immediately wait for the result—but software prefetching can accelerate batches of queries.
Given several keys, we can compute where each query <em>will</em> access the table, tell the CPU to prefetch those locations, and <strong>then</strong> start actually accessing the table.</p>
<p>To do so, let’s make <code class="language-plaintext highlighter-rouge">table.index_of</code> prefetch the location(s) examined by <code class="language-plaintext highlighter-rouge">table.find_indexed</code>:</p>
<ul>
<li>In open-addressed tables, we prefetch the slot the key hashes to.</li>
<li>In two-way chaining, we prefetch <em>both</em> buckets the key hashes to.</li>
</ul>
<p>Importantly, prefetching requests the entire cache line for the given address, i.e. the aligned 64-byte chunk of memory containing that address.
That means surrounding data (e.g. the following keys) may also be loaded into cache.</p>
<table class="fcenter" style="width: 70%; margin-top: 15px; margin-bottom: 15px;">
<tr class="hrow2"><td class="header"><strong>Lookups (ns/find)</strong></td><td>Naive</td><td>Unrolled</td><td>Prefetched</td></tr>
<tr><td>Separate Chaining (100%)</td>
<td>35</td>
<td>34</td>
<td class="good">26 <strong>(-22.6%)</strong></td>
</tr><tr><td>Linear (75%)</td>
<td>36</td>
<td>35</td>
<td class="good">23 <strong>(-36.3%)</strong></td>
</tr><tr><td>Quadratic (75%)</td>
<td>31</td>
<td>31</td>
<td class="good">22 <strong>(-27.8%)</strong></td>
</tr><tr><td>Double (75%)</td>
<td>38</td>
<td>37</td>
<td class="good">33 <strong>(-12.7%)</strong></td>
</tr><tr><td>Robin Hood (75%)</td>
<td>31</td>
<td>31</td>
<td class="good">22 <strong>(-29.4%)</strong></td>
</tr><tr><td>Two-Way Chaining (x4)</td>
<td>30</td>
<td>32</td>
<td class="good">19 <strong>(-40.2%)</strong></td>
</tr></table>
<p>Prefetching makes a big difference!</p>
<ul>
<li>Separate chaining benefits from prefetching the unpredictable initial load<sup id="fnref:chainprefetch" role="doc-noteref"><a href="#fn:chainprefetch" class="footnote" rel="footnote">8</a></sup>.</li>
<li>Linear and quadratic tables with a low average probe length greatly benefit from prefetching.</li>
<li>Double-hashed probe sequences quickly leave the cache line, making prefetching a bit less effective.</li>
<li>Prefetching is ideal for two-way chaining: keys can <em>always</em> be found in a prefetched cache line—but each lookup requires fetching two locations, consuming extra load buffer slots.</li>
</ul>
<h2 id="simd">SIMD</h2>
<p>Modern CPUs provide SIMD (single instruction, multiple data) instructions that process multiple values at once.
These operations are useful for probing: for example, we can load a batch of \(N\) keys and compare them with our query in parallel.
In the best case, a SIMD probe sequence could run \(N\) times faster than checking each key individually.</p>
<p><img class="probes2 center" src="/assets/hashtables/simd_find.svg" /></p>
<p>Let’s <a href="https://github.com/TheNumbat/hashtables/blob/main/code/two_way_simd.h">implement</a> SIMD-accelerated lookups for linear probing and two-way chaining, with \(N=4\):</p>
<table class="fcenter" style="width: 90%; margin-top: 15px; margin-bottom: 15px">
<tr class="hrow2"><td class="header"><strong>Lookups (ns/find)</strong></td><td>Naive</td><td>Unrolled</td><td>Prefetched</td><td>Missing</td></tr>
<tr><td>Linear (75%)</td>
<td>36</td>
<td>35</td>
<td>23</td>
<td>88</td>
</tr><tr><td>SIMD Linear (75%)</td>
<td class="bad">39 <strong>(+10.9%)</strong></td>
<td class="bad">42 <strong>(+18.0%)</strong></td>
<td class="bad">27 <strong>(+19.1%)</strong></td>
<td class="good">98 <strong>(-21.2%)</strong></td>
</tr><tr><td>Two-Way Chaining (x4)</td>
<td>30</td>
<td>32</td>
<td>19</td>
<td>30</td>
</tr><tr><td>SIMD Two-Way Chaining (x4)</td>
<td class="good">28 <strong>(-6.6%)</strong></td>
<td class="good">30 <strong>(-4.6%)</strong></td>
<td class="good">17 <strong>(-9.9%)</strong></td>
<td class="good">19 <strong>(-35.5%)</strong></td>
</tr></table>
<p>Unfortunately, SIMD probes are <em>slower</em> for the linearly probed table—an average probe length of \(1.5\) meant the overhead of setting up an (unaligned) SIMD comparison wasn’t worth it.
However, missing keys are found faster, since their average probe length is much higher (\(29.5\)).</p>
<p>On the other hand, two-way chaining consistently benefits from SIMD probing, as we can find any key using at most two (aligned) comparisons.
However, given a bucket capacity of four, lookups only need to query 1.7 keys on average, so we only see a slight improvement—SIMD would be a much more impactful optimization at higher capacities.
Still, missing key lookups get significantly faster, as they previously had to iterate through all keys in the bucket.</p>
<h1 id="benchmark-data">Benchmark Data</h1>
<p><em>Updated 2023/04/05: additional optimizations for quadratic, double hashed, and two-way tables.</em></p>
<ul>
<li>The linear and Robin Hood tables use backshift deletion.</li>
<li>The two-way SIMD table has capacity-4 buckets and uses AVX2 256-bit SIMD operations.</li>
<li>Insert, erase, find, find unrolled, find prefetched, and find missing report nanoseconds per operation.</li>
<li>Average probe, max probe, average missing probe, and max missing probe report number of iterations.</li>
<li>Clear and iterate report milliseconds for the entire operation.</li>
<li>Memory reports the ratio of allocated memory to size of stored data.</li>
</ul>
<table class="fcenter" id="summary" style="width: 100%">
<tr class="hrow2"><td class="header name"><strong>Table</strong></td>
<td class="rth"><div class="rcontainer"><div class="rcontent">Insert</div></div></td>
<td class="rth"><div class="rcontainer"><div class="rcontent">Erase</div></div></td>
<td class="rth"><div class="rcontainer"><div class="rcontent">Find</div></div></td>
<td class="rth"><div class="rcontainer"><div class="rcontent">Unrolled</div></div></td>
<td class="rth"><div class="rcontainer"><div class="rcontent">Prefetched</div></div></td>
<td class="rth"><div class="rcontainer"><div class="rcontent">Avg Probe</div></div></td>
<td class="rth"><div class="rcontainer"><div class="rcontent">Max Probe</div></div></td>
<td class="rth"><div class="rcontainer"><div class="rcontent">Find DNE</div></div></td>
<td class="rth"><div class="rcontainer"><div class="rcontent">DNE Avg</div></div></td>
<td class="rth"><div class="rcontainer"><div class="rcontent">DNE Max</div></div></td>
<td class="rth"><div class="rcontainer"><div class="rcontent">Clear (ms)</div></div></td>
<td class="rth"><div class="rcontainer"><div class="rcontent">Iterate (ms)</div></div></td>
<td class="rth"><div class="rcontainer"><div class="rcontent">Memory</div></div></td>
</tr>
<tr><td class="name">Chaining 50</td>
<td>115</td>
<td>110</td>
<td>28</td>
<td>26</td>
<td>21</td>
<td>1.2</td>
<td>7</td>
<td>28</td>
<td>0.5</td>
<td>7</td>
<td>113.1</td>
<td>16.2</td>
<td>2.5</td>
</tr>
<tr><td class="name">Chaining 100</td>
<td>113</td>
<td>128</td>
<td>35</td>
<td>34</td>
<td>26</td>
<td>1.5</td>
<td>10</td>
<td>40</td>
<td>1.0</td>
<td>9</td>
<td>118.1</td>
<td>13.9</td>
<td>2.0</td>
</tr>
<tr><td class="name">Chaining 200</td>
<td>120</td>
<td>171</td>
<td>50</td>
<td>50</td>
<td>37</td>
<td>2.0</td>
<td>12</td>
<td>68</td>
<td>2.0</td>
<td>13</td>
<td>122.6</td>
<td>14.3</td>
<td>1.8</td>
</tr>
<tr><td class="name">Chaining 500</td>
<td>152</td>
<td>306</td>
<td>102</td>
<td>109</td>
<td>77</td>
<td>3.5</td>
<td>21</td>
<td>185</td>
<td>5.0</td>
<td>21</td>
<td>153.7</td>
<td>23.0</td>
<td>1.6</td>
</tr>
<tr><td class="name">Linear 50</td>
<td>46</td>
<td>47</td>
<td>20</td>
<td>20</td>
<td>15</td>
<td>0.5</td>
<td>43</td>
<td>37</td>
<td>1.5</td>
<td>50</td>
<td>1.1</td>
<td>4.8</td>
<td>2.0</td>
</tr>
<tr><td class="name">Linear 75</td>
<td>58</td>
<td>82</td>
<td>36</td>
<td>35</td>
<td>23</td>
<td>1.5</td>
<td>181</td>
<td>88</td>
<td>7.5</td>
<td>243</td>
<td>0.7</td>
<td>2.1</td>
<td>1.3</td>
</tr>
<tr><td class="name">Linear 90</td>
<td>70</td>
<td>147</td>
<td>46</td>
<td>45</td>
<td>29</td>
<td>4.5</td>
<td>1604</td>
<td>135</td>
<td>49.7</td>
<td>1403</td>
<td>0.6</td>
<td>1.2</td>
<td>1.1</td>
</tr>
<tr><td class="name">Quadratic 50</td>
<td>47</td>
<td>23</td>
<td>20</td>
<td>20</td>
<td>16</td>
<td>0.4</td>
<td>21</td>
<td>82</td>
<td>4.8</td>
<td>49</td>
<td>1.1</td>
<td>5.1</td>
<td>2.0</td>
</tr>
<tr><td class="name">Quadratic 75</td>
<td>54</td>
<td>28</td>
<td>31</td>
<td>31</td>
<td>22</td>
<td>1.0</td>
<td>46</td>
<td>93</td>
<td>5.3</td>
<td>76</td>
<td>0.7</td>
<td>2.1</td>
<td>1.3</td>
</tr>
<tr><td class="name">Quadratic 90</td>
<td>65</td>
<td>29</td>
<td>42</td>
<td>42</td>
<td>30</td>
<td>1.9</td>
<td>108</td>
<td>163</td>
<td>18.2</td>
<td>263</td>
<td>0.6</td>
<td>1.0</td>
<td>1.1</td>
</tr>
<tr><td class="name">Double 50</td>
<td>70</td>
<td>31</td>
<td>29</td>
<td>28</td>
<td>23</td>
<td>0.4</td>
<td>17</td>
<td>84</td>
<td>4.4</td>
<td>61</td>
<td>1.1</td>
<td>5.6</td>
<td>2.0</td>
</tr>
<tr><td class="name">Double 75</td>
<td>80</td>
<td>43</td>
<td>38</td>
<td>37</td>
<td>33</td>
<td>0.8</td>
<td>44</td>
<td>93</td>
<td>4.7</td>
<td>83</td>
<td>0.8</td>
<td>2.2</td>
<td>1.3</td>
</tr>
<tr><td class="name">Double 90</td>
<td>88</td>
<td>47</td>
<td>49</td>
<td>49</td>
<td>42</td>
<td>1.6</td>
<td>114</td>
<td>174</td>
<td>16.2</td>
<td>250</td>
<td>0.6</td>
<td>1.0</td>
<td>1.1</td>
</tr>
<tr><td class="name">Robin Hood 50</td>
<td>52</td>
<td>31</td>
<td>19</td>
<td>19</td>
<td>15</td>
<td>0.5</td>
<td>12</td>
<td>33</td>
<td>0.7</td>
<td>12</td>
<td>1.1</td>
<td>4.8</td>
<td>2.0</td>
</tr>
<tr><td class="name">Robin Hood 75</td>
<td>78</td>
<td>47</td>
<td>31</td>
<td>31</td>
<td>22</td>
<td>1.5</td>
<td>24</td>
<td>74</td>
<td>1.9</td>
<td>25</td>
<td>0.7</td>
<td>2.1</td>
<td>1.3</td>
</tr>
<tr><td class="name">Robin Hood 90</td>
<td>118</td>
<td>59</td>
<td>80</td>
<td>79</td>
<td>41</td>
<td>4.5</td>
<td>58</td>
<td>108</td>
<td>5.0</td>
<td>67</td>
<td>0.6</td>
<td>1.2</td>
<td>1.1</td>
</tr>
<tr><td class="name">Two-way x2</td>
<td>265</td>
<td>24</td>
<td>23</td>
<td>25</td>
<td>17</td>
<td>0.0</td>
<td>3</td>
<td>23</td>
<td>0.2</td>
<td>4</td>
<td>17.9</td>
<td>19.6</td>
<td>32.0</td>
</tr>
<tr><td class="name">Two-way x4</td>
<td>108</td>
<td>34</td>
<td>30</td>
<td>32</td>
<td>19</td>
<td>0.7</td>
<td>6</td>
<td>30</td>
<td>2.7</td>
<td>8</td>
<td>2.2</td>
<td>3.7</td>
<td>4.0</td>
</tr>
<tr><td class="name">Two-way x8</td>
<td>111</td>
<td>45</td>
<td>39</td>
<td>39</td>
<td>28</td>
<td>3.6</td>
<td>13</td>
<td>42</td>
<td>9.0</td>
<td>14</td>
<td>1.1</td>
<td>1.5</td>
<td>2.0</td>
</tr>
<tr><td class="name">Two-way SIMD</td>
<td>124</td>
<td>46</td>
<td>28</td>
<td>30</td>
<td>17</td>
<td>0.3</td>
<td>1</td>
<td>19</td>
<td>1.0</td>
<td>1</td>
<td>2.2</td>
<td>3.7</td>
<td>4.0</td>
</tr>
</table>
<p><br /></p>
<h2 id="footnotes">Footnotes</h2>
<link rel="stylesheet" type="text/css" href="/assets/hashtables/style.css" />
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:cacheusage" role="doc-endnote">
<p>Actually, this prior shouldn’t necessarily apply to hash tables, which by definition have unpredictable access patterns.
However, we will see that flat storage still provides performance benefits, especially for linear probing, whole-table iteration, and serial accesses. <a href="#fnref:cacheusage" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:missingelements" role="doc-endnote">
<p>To account for various deletion strategies, the <code class="language-plaintext highlighter-rouge">find-missing</code> benchmark was run after the insertion & erasure tests, and after inserting a second set of elements.
This sequence allows separate chaining’s reported maximum probe length to be larger for missing keys than existing keys.
Looking up the latter set of elements was also benchmarked, but results are omitted here, as the tombstone-based tables were simply marginally slower across the board. <a href="#fnref:missingelements" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:mycpu" role="doc-endnote">
<p>AMD 5950X @ 4.4 GHz + DDR4 3600/CL16, Windows 11 Pro 22H2, MSVC 19.34.31937.<br />
No particular effort was made to isolate the benchmarks, but they were repeatable on an otherwise idle system.
Each was run 10 times and average results are reported. <a href="#fnref:mycpu" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:vmem" role="doc-endnote">
<p>Virtual memory allows separate regions of our address space to refer to the <em>same</em> underlying pages.
If we map our table <em>twice</em>, one after the other, probe traversal never has to check whether it needs to wrap around to the start of the table.
Instead, it can simply proceed into the second mapping—since writes to one range are automatically reflected in the other, this just works.</p>
<p><img class="diagram center" src="/assets/hashtables/vmem.svg" /> <a href="#fnref:vmem" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:prime" role="doc-endnote">
<p>You might expect indexing a prime-sized table to require an expensive integer mod/div, but it actually doesn’t.
Given a fixed set of possible primes, a prime-specific indexing operation can be selected at runtime based on the current size.
Since each such operation is a modulo by a <em>constant</em>, <a href="https://lemire.me/blog/2021/04/28/ideal-divisors-when-a-division-compiles-down-to-just-a-multiplication/">the compiler can translate it to a fast integer multiplication</a>. <a href="#fnref:prime" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:sattolo" role="doc-endnote">
<p>A Sattolo traversal causes each lookup to directly depend on the result of the previous one, serializing execution and preventing the CPU from hiding memory latency.
Hence, finding each element took ~<a href="https://gist.github.com/jboner/2841832">100ns</a> in <em>every</em> type of open table: roughly equal to the latency of fetching data from RAM.
Similarly, each lookup in the low-load separately chained table took ~200ns, i.e. \((1 + \text{Chain Length}) * \text{Memory Latency}\). <a href="#fnref:sattolo" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:slowerat90" role="doc-endnote">
<p>First, <em>any</em> linearly probed table requires the same <em>total</em> number of probes to look up each element once—that’s why the average probe length never improved.
Every hash collision requires moving some key one slot further from its index, i.e. adds one to the total probe count.</p>
<p>Second, the cost of a lookup, per probe, can <em>decrease</em> with the number of probes required.
When traversing a long chain, the CPU can avoid cache misses by prefetching future accesses.
Therefore, allocating a larger portion of the same <em>total</em> probe count to long chains can increase performance.</p>
<p>Hence, linear probing tops this particular benchmark—but it’s not the most useful metric in practice.
Robin Hood probing greatly decreases maximum probe length, making query performance far more consistent even if technically slower on average.
Anyway, you really don’t need a 90% load factor—just use 75% for consistently high performance. <a href="#fnref:slowerat90" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:chainprefetch" role="doc-endnote">
<p>Is the CPU smart enough to automatically prefetch addresses contained within the explicitly prefetched cache line? I’m pretty sure it’s not, but that would be cool. <a href="#fnref:chainprefetch" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Your default hash table should be open-addressed, using Robin Hood linear probing with backward-shift deletion. When prioritizing deterministic performance over memory efficiency, two-way chaining is also a good choice. Code for this article may be found on GitHub. Open Addressing vs. Separate Chaining Benchmark Setup Discussion Separate Chaining Linear Probing Quadratic Probing Double Hashing Robin Hood Linear Probing Two Way Chaining Unrolling, Prefetching, and SIMD Benchmark Data Open Addressing vs. Separate Chaining Most people first encounter hash tables implemented using separate chaining, a model simple to understand and analyze mathematically. In separate chaining, a hash function is used to map each key to one of \(K\) buckets. Each bucket holds a linked list, so to retrieve a key, one simply traverses its corresponding bucket. Given a hash function drawn from a universal family, inserting \(N\) keys into a table with \(K\) buckets results in an expected bucket size of \(\frac{N}{K}\), a quantity also known as the table’s load factor. Assuming (reasonably) that \(\frac{N}{K}\) is bounded by a constant, all operations on such a table can be performed in expected constant time. Unfortunately, this basic analysis doesn’t consider the myriad factors that go into implementing an efficient hash table on a real computer. In practice, hash tables based on open addressing can provide superior performance, and their limitations can be worked around in nearly all cases. Open Addressing In an open-addressed table, each bucket only contains a single key. Collisions are handled by placing additional keys elsewhere in the table. For example, in linear probing, a key is placed in the first open bucket starting from the index it hashes to. Unlike in separate chaining, open-addressed tables may be represented in memory as a single flat array. A priori, we should expect operations on flat arrays to offer higher performance than those on linked structures due to more coherent memory accesses.1 However, if the user wants to insert more than \(K\) keys into an open-addressed table with \(K\) buckets, the entire table must be resized. Typically, a larger array is allocated and all current keys are moved to the new table. The size is grown by a constant factor (e.g. 2x), so insertions occur in amortized constant time. Why Not Separate Chaining? In practice, the standard libraries of many languages provide separately chained hash tables, such as C++’s std::unordered_map. At least in C++, this choice is now considered to have been a mistake—it violates C++’s principle of not paying for features you didn’t ask for. Separate chaining still has some purported benefits, but they seem unconvincing: Separately chained tables don’t require any linear-time operations. First, this isn’t true. To maintain a constant load factor, separate chaining hash tables also have to resize once a sufficient number of keys are inserted, though the limit can be greater than \(K\). Most separately chained tables, including std::unordered_map, only provide amortized constant time insertion. Second, open-addressed tables can be incrementally resized. When the table grows, there’s no reason we have to move all existing keys right away. Instead, every subsequent operation can move a fixed number of old keys into the new table. Eventually, all keys will have been moved and the old table may be deallocated. This technique incurs overhead on every operation during a resize, but none will require linear time. When each (key + value) entry is allocated separately, entries can have stable addresses. External code can safely store pointers to them, and large entries never have to be copied. This property is easy to support by adding a single level of indirection. That is, allocate each entry separately, but use an open-addressed table to map keys to pointers-to-entries. In this case, resizing the table only requires copying pointers, rather than entire entries. Lower memory overhead. It’s possible for a separately chained table to store the same data in less space than an open-addressed equivalent, since flat tables necessarily waste space storing empty buckets. However, matching high-quality open addressing schemes (especially when storing entries indirectly) requires a load factor greater than one, degrading query performance. Further, individually allocating nodes wastes memory due to heap allocator overhead and fragmentation. In fact, the only situation where open addressing truly doesn’t work is when the table cannot allocate and entries cannot be moved. These requirements can arise when using intrusive linked lists, a common pattern in kernel data structures and embedded systems with no heap. Benchmark Setup For simplicity, let us only consider tables mapping 64-bit integer keys to 64-bit integer values. Results will not take into account the effects of larger values or non-trivial key comparison, but should still be representative, as keys can often be represented in 64 bits and values are often pointers. We will evaluate tables using the following metrics: Time to insert \(N\) keys into an empty table. Time to erase \(N\) existing keys. Time to look up \(N\) existing keys. Time to look up \(N\) missing keys2. Average probe length for existing & missing keys. Maximum probe length for existing & missing keys. Memory amplification in a full table (total size / size of data). Each open-addressed table was benchmarked at 50%, 75%, and 90% load factors. Every test targeted a table capacity of 8M entries, setting \(N = \lfloor 8388608 * \text{Load Factor} \rfloor - 1\). Fixing the table capacity allows us to benchmark behavior at exactly the specified load factor, avoiding misleading results from tables that have recently grown. The large number of keys causes each test to unpredictably traverse a data set much larger than L3 cache (128MB+), so the CPU3 cannot effectively cache or prefetch memory accesses. Table Sizes All benchmarked tables use exclusively power-of-two sizes. This invariant allows us to translate hashes to array indices with a fast bitwise AND operation, since: h % 2**n == h & (2**n - 1) Power-of-two sizes also admit a neat virtual memory trick4, but that optimization is not included here. However, power-of-two sizes can cause pathological behavior when paired with some hash functions, since indexing simply throws away the hash’s high bits. An alternative strategy is to use prime sized tables, which are significantly more robust to the choice of hash function. For the purposes of this post, we have a fixed hash function and non-adversarial inputs, so prime sizes were not used5. Hash Function All benchmarked tables use the squirrel3 hash function, a fast integer hash that (as far as I can tell) only exists in this GDC talk on noise-based RNG. Here, it is adapted to 64-bit integers by choosing three large 64-bit primes: uint64_t squirrel3(uint64_t at) { constexpr uint64_t BIT_NOISE1 = 0x9E3779B185EBCA87ULL; constexpr uint64_t BIT_NOISE2 = 0xC2B2AE3D27D4EB4FULL; constexpr uint64_t BIT_NOISE3 = 0x27D4EB2F165667C5ULL; at *= BIT_NOISE1; at ^= (at >> 8); at += BIT_NOISE2; at ^= (at << 8); at *= BIT_NOISE3; at ^= (at >> 8); return at; } I make no claims regarding the quality or robustness of this hash function, but observe that it’s cheap, it produces the expected number of collisions in power-of-two tables, and it passes smhasher when applied bytewise. Other Benchmarks Several less interesting benchmarks were also run, all of which exhibited identical performance across equal-memory open-addressed tables and much worse performance for separate chaining. Time to look up each element by following a Sattolo cycle6 (2-4x). Time to clear the table (100-200x). Time to iterate all values (3-10x). Discussion Separate Chaining To get our bearings, let’s first implement a simple separate chaining table and benchmark it at various load factors. Separate Chaining50% Load100% Load200% Load500% Load ExistingMissingExistingMissingExistingMissingExistingMissing Insert (ns/key) 115 113 120 152 Erase (ns/key) 110 128 171 306 Lookup (ns/key) 28 28 35 40 50 68 102 185 Average Probe 1.25 0.50 1.50 1.00 2.00 2.00 3.50 5.00 Max Probe 7 7 10 9 12 13 21 21 Memory Amplification 2.50 2.00 1.75 1.60 These are respectable results, especially lookups at 100%: the CPU is able to hide much of the memory latency. Insertions and erasures, on the other hand, are pretty slow—they involve allocation. Technically, the reported memory amplification in an underestimate, as it does not include heap allocator overhead. However, we should not directly compare memory amplification against the coming open-addressed tables, as storing larger values would improve separate chaining’s ratio. std::unordered_map Running these benchmarks on std::unordered_map<uint64_t,uint64_t> produces results roughly equivalent to the 100% column, just with marginally slower insertions and faster clears. Using squirrel3 with std::unordered_map additionally makes insertions slightly slower and lookups slightly faster. Further comparisons will be relative to these results, since exact performance numbers for std::unordered_map depend on one’s standard library distribution (here, Microsoft’s). Linear Probing Next, let’s implement the simplest type of open addressing—linear probing. In the following diagrams, assume that each letter hashes to its alphabetical index. The skull denotes a tombstone. Insert: starting from the key’s index, place the key in the first empty or tombstone bucket. Lookup: starting from the key’s index, probe buckets in order until finding the key, reaching an empty non-tombstone bucket, or exhausting the table. Erase: lookup the key and replace it with a tombstone. The results are surprisingly good, considering the simplicity: Linear Probing(Naive)50% Load75% Load90% Load ExistingMissingExistingMissingExistingMissing Insert (ns/key) 46 59 73 Erase (ns/key) 21 36 47 Lookup (ns/key) 20 86 35 124 47 285 Average Probe 0.50 6.04 1.49 29.5 4.46 222 Max Probe 43 85 181 438 1604 2816 Memory Amplification 2.00 1.33 1.11 For existing keys, we’re already beating the separately chained tables, and with lower memory overhead! Of course, there are also some obvious problems—looking for missing keys takes much longer, and the worst-case probe lengths are quite scary, even at 50% load factor. But, we can do much better. Erase: Rehashing One simple optimization is automatically recreating the table once it accumulates too many tombstones. Erase now occurs in amortized constant time, but we already tolerate that for insertion, so it shouldn’t be a big deal. Let’s implement a table that re-creates itself after erase has been called \(\frac{N}{2}\) times. Compared to naive linear probing: Linear Probing+ Rehashing50% Load75% Load90% Load ExistingMissingExistingMissingExistingMissing Insert (ns/key) 46 58 70 Erase (ns/key) 23 (+10.4%) 27 (-23.9%) 29 (-39.3%) Lookup (ns/key) 20 85 35 93 (-25.4%) 46 144 (-49.4%) Average Probe 0.50 6.04 1.49 9.34 (-68.3%) 4.46 57.84 (-74.0%) Max Probe 43 85 181 245 (-44.1%) 1604 1405 (-50.1%) Memory Amplification 2.00 1.33 1.11 Clearing the tombstones dramatically improves missing key queries, but they are still pretty slow. It also makes erase slower at 50% load—there aren’t enough tombstones to make up for having to recreate the table. Rehashing also does nothing to mitigate long probe sequences for existing keys. Erase: Backward Shift Actually, who says we need tombstones? There’s a little-taught, but very simple algorithm for erasing keys that doesn’t degrade performance or have to recreate the table. When we erase a key, we don’t know whether the lookup algorithm is relying on our bucket being filled to find keys later in the table. Tombstones are one way to avoid this problem. Instead, however, we can guarantee that traversal is never disrupted by simply removing and reinserting all following keys, up to the next empty bucket. Note that despite the name, we are not simply shifting the following keys backward. Any keys that are already at their optimal index will not be moved. For example: After implementing backward-shift deletion, we can compare it to linear probing with rehashing: Linear Probing+ Backshift50% Load75% Load90% Load ExistingMissingExistingMissingExistingMissing Insert (ns/key) 46 58 70 Erase (ns/key) 47 (+105%) 82 (+201%) 147 (+415%) Lookup (ns/key) 20 37 (-56.2%) 36 88 46 135 Average Probe 0.50 1.50 (-75.2%) 1.49 7.46 (-20.1%) 4.46 49.7 (-14.1%) Max Probe 43 50 (-41.2%) 181 243 1604 1403 Memory Amplification 2.00 1.33 1.11 Naturally, erasures become significantly slower, but queries on missing keys now have reasonable behavior, especially at 50% load. In fact, at 50% load, this table already beats the performance of equal-memory separate chaining in all four metrics. But, we can still do better: a 50% load factor is unimpressive, and max probe lengths are still very high compared to separate chaining. You might have noticed that neither of these upgrades improved average probe length for existing keys. That’s because it’s not possible for linear probing to have any other average! Think about why that is—it’ll be interesting later. Quadratic Probing Quadratic probing is a common upgrade to linear probing intended to decrease average and maximum probe lengths. In the linear case, a probe of length \(n\) simply queries the bucket at index \(h(k) + n\). Quadratic probing instead queries the bucket at index \(h(k) + \frac{n(n+1)}{2}\). Other quadratic sequences are possible, but this one can be efficiently computed by adding \(n\) to the index at each step. Each operation must be updated to use our new probe sequence, but the logic is otherwise almost identical: There are a couple subtle caveats to quadratic probing: Although this specific sequence will touch every bucket in a power-of-two table, other sequences (e.g. \(n^2\)) may not. If an insertion fails despite there being an empty bucket, the table must grow and the insertion is retried. There actually is a true deletion algorithm for quadratic probing, but it’s a bit more involved than the linear case and hence not reproduced here. We will again use tombstones and rehashing. After implementing our quadratic table, we can compare it to linear probing with rehashing: Quadratic Probing+ Rehashing50% Load75% Load90% Load ExistingMissingExistingMissingExistingMissing Insert (ns/key) 47 54 (-7.2%) 65 (-6.4%) Erase (ns/key) 23 28 29 Lookup (ns/key) 20 82 31 (-13.2%) 93 42 (-7.3%) 163 (+12.8%) Average Probe 0.43 (-12.9%) 4.81 (-20.3%) 0.99 (-33.4%) 5.31 (-43.1%) 1.87 (-58.1%) 18.22 (-68.5%) Max Probe 21 (-51.2%) 49 (-42.4%) 46 (-74.6%) 76 (-69.0%) 108 (-93.3%) 263 (-81.3%) Memory Amplification 2.00 1.33 1.11 As expected, quadratic probing dramatically reduces both the average and worst case probe lengths, especially at high load factors. These gains are not quite as impressive when compared to the linear table with backshift deletion, but are still very significant. Reducing the average probe length improves the average performance of all queries. Erasures are the fastest yet, since quadratic probing quickly finds the element and marks it as a tombstone. However, beyond 50% load, maximum probe lengths are still pretty concerning. Double Hashing Double hashing purports to provide even better behavior than quadratic probing. Instead of using the same probe sequence for every key, double hashing determines the probe stride by hashing the key a second time. That is, a probe of length \(n\) queries the bucket at index \(h_1(k) + n*h_2(k)\) If \(h_2\) is always co-prime to the table size, insertions always succeed, since the probe sequence will touch every bucket. That might sound difficult to assure, but since our table sizes are always a power of two, we can simply make \(h_2\) always return an odd number. Unfortunately, there is no true deletion algorithm for double hashing. We will again use tombstones and rehashing. So far, we’ve been throwing away the high bits our of 64-bit hash when computing table indices. Instead of evaluating a second hash function, let’s just re-use the top 32 bits: \(h_2(k) = h_1(k) >> 32\). After implementing double hashing, we may compare the results to quadratic probing with rehashing: Double Hashing+ Rehashing50% Load75% Load90% Load ExistingMissingExistingMissingExistingMissing Insert (ns/key) 70 (+49.6%) 80 (+48.6%) 88 (+35.1%) Erase (ns/key) 31 (+33.0%) 43 (+53.1%) 47 (+61.5%) Lookup (ns/key) 29 (+46.0%) 84 38 (+25.5%) 93 49 (+14.9%) 174 Average Probe 0.39 (-11.0%) 4.39 (-8.8%) 0.85 (-14.7%) 4.72 (-11.1%) 1.56 (-16.8%) 16.23 (-10.9%) Max Probe 17 (-19.0%) 61 (+24.5%) 44 (-4.3%) 83 (+9.2%) 114 (+5.6%) 250 (-4.9%) Memory Amplification 2.00 1.33 1.11 Double hashing succeeds in further reducing average probe lengths, but max lengths are more of a mixed bag. Unfortunately, most queries now traverse larger spans of memory, causing performance to lag quadratic hashing. Robin Hood Linear Probing So far, quadratic probing and double hashing have provided lower probe lengths, but their raw performance isn’t much better than linear probing—especially on missing keys. Enter Robin Hood linear probing. This strategy drastically reins in maximum probe lengths while maintaining high query performance. The key difference is that when inserting a key, we are allowed to steal the position of an existing key if it is closer to its ideal index (“richer”) than the new key is. When this occurs, we simply swap the old and new keys and carry on inserting the old key in the same manner. This results in a more equitable distribution of probe lengths. This insertion method turns out to be so effective that we can entirely ignore rehashing as long as we keep track of the maximum probe length. Lookups will simply probe until either finding the key or exhausting the maximum probe length. Implementing this table, compared to linear probing with backshift deletion: Robin Hood(Naive)50% Load75% Load90% Load ExistingMissingExistingMissingExistingMissing Insert (ns/key) 53 (+16.1%) 80 (+38.3%) 124 (+76.3%) Erase (ns/key) 19 (-58.8%) 31 (-63.0%) 77 (-47.9%) Lookup (ns/key) 19 58 (+56.7%) 31 (-11.3%) 109 (+23.7%) 79 (+72.2%) 147 (+8.9%) Average Probe 0.50 13 (+769%) 1.49 28 (+275%) 4.46 67 (+34.9%) Max Probe 12 (-72.1%) 13 (-74.0%) 24 (-86.7%) 28 (-88.5%) 58 (-96.4%) 67 (-95.2%) Memory Amplification 2.00 1.33 1.11 Maximum probe lengths improve dramatically, undercutting even quadratic probing and double hashing. This solves the main problem with open addressing. However, performance gains are less clear-cut: Insertion performance suffers across the board. Existing lookups are slightly faster at low load, but much slower at 90%. Missing lookups are maximally pessimistic. Fortunately, we’ve got one more trick up our sleeves. Erase: Backward Shift When erasing a key, we can run a very similar backshift deletion algorithm. There are just two differences: We can stop upon finding a key that is already in its optimal location. No further keys could have been pushed past it—they would have stolen this spot during insertion. We don’t have to re-hash—since we stop at the first optimal key, all keys before it can simply be shifted backward. With backshift deletion, we no longer need to keep track of the maximum probe length. Implementing this table, compared to naive robin hood probing: Robin Hood+ Backshift50% Load75% Load90% Load ExistingMissingExistingMissingExistingMissing Insert (ns/key) 52 78 118 Erase (ns/key) 31 (+63.0%) 47 (+52.7%) 59 (-23.3%) Lookup (ns/key) 19 33 (-43.3%) 31 74 (-32.2%) 80 108 (-26.1%) Average Probe 0.50 0.75 (-94.2%) 1.49 1.87 (-93.3%) 4.46 4.95 (-92.6%) Max Probe 12 12 (-7.7%) 24 25 (-10.7%) 58 67 Memory Amplification 2.00 1.33 1.11 We again regress the performance of erase, but we finally achieve reasonable missing-key behavior. Given the excellent maximum probe lengths and overall query performance, Robin Hood probing with backshift deletion should be your default choice of hash table. You might notice that at 90% load, lookups are significantly slower than basic linear probing. This gap actually isn’t concerning, and explaining why is an interesting exercise7. Two-Way Chaining Our Robin Hood table has excellent query performance, and exhibited a maximum probe length of only 12—it’s a good default choice. However, another class of hash tables based on Cuckoo Hashing can provably never probe more than a constant number of buckets. Here, we will examine a particularly practical formulation known as Two-Way Chaining. Our table will still consist of a flat array of buckets, but each bucket will be able to hold a small handful of keys. When inserting an element, we will hash the key with two different functions, then place it into whichever corresponding bucket has more space. If both buckets happen to be filled, the entire table will grow and the insertion is retried. You might think this is crazy—in separate chaining, the expected maximum probe length scales with \(O(\frac{\lg N}{\lg \lg N})\), so wouldn’t this waste copious amounts of memory growing the table? It turns out adding a second hash function reduces the expected maximum to \(O(\lg\lg N)\), for reasons explored elsewhere. That makes it practical to cap the maximum bucket size at a relatively small number. Let’s implement a flat two-way chained table. As with double hashing, we can use the high bits of our hash to represent the second hash function. Two-WayChainingCapacity 2Capacity 4Capacity 8 ExistingMissingExistingMissingExistingMissing Insert (ns/key) 265 108 111 Erase (ns/key) 24 34 45 Lookup (ns/key) 23 23 30 30 39 42 Average Probe 0.03 0.24 0.74 2.73 3.61 8.96 Max Probe 3 4 6 8 13 14 Memory Amplification 32.00 4.00 2.00 Since every key can be found in one of two possible buckets, query times become essentially deterministic. Lookup times are especially impressive, only slightly lagging the Robin Hood table and having no penalty on missing keys. The average probe lengths are very low, and max probe lengths are strictly bounded by two times the bucket size. The table size balloons when each bucket can only hold two keys, and insertion is very slow due to growing the table. However, capacities of 4-8 are reasonable choices for memory unconstrained use cases. Overall, two-way chaining is another good choice of hash table, at least when memory efficiency is not the highest priority. In the next section, we’ll also see how lookups can even further accelerated with prefetching and SIMD operations. Remember that deterministic queries might not matter if you haven’t already implemented incremental table growth. Cuckoo Hashing Cuckoo hashing involves storing keys in two separate tables, each of which holds one key per bucket. This scheme will guarantee that every key is stored in its ideal bucket in one of the two tables. Each key is hashed with two different functions and inserted using the following procedure: If at least one table’s bucket is empty, the key is inserted there. Otherwise, the key is inserted anyway, evicting one of the existing keys. The evicted key is transferred to the opposite table at its other possible position. If doing so evicts another existing key, go to 3. If the loop ends up evicting the key we’re trying to insert, we know it has found a cycle and hence must grow the table. Because Cuckoo hashing does not strictly bound the insertion sequence length, I didn’t benchmark it here, but it’s still an interesting option. Unrolling, Prefetching, and SIMD So far, the lookup benchmark has been implemented roughly like this: for(uint64_t i = 0; i < N; i++) { assert(table.find(keys[i]) == values[i]); } Naively, this loop runs one lookup at a time. However, you might have noticed that we’ve been able to find keys faster than the CPU can load data from main memory—so something more complex must be going on. In reality, modern out-of-order CPUs can execute several iterations in parallel. Expressing the loop as a dataflow graph lets us build a simplified mental model of how it actually runs: the CPU is able to execute a computation whenever its dependencies are available. Here, green nodes are resolved, in-flight nodes are yellow, and white nodes have not begun: In this example, we can see that each iteration’s root node (i++) only requires the value of i from the previous step—not the result of find. Therefore, the CPU can run several find nodes in parallel, hiding the fact that each one takes ~100ns to fetch data from main memory. Unrolling There’s a common optimization strategy known as loop unrolling. Unrolling involves explicitly writing out the code for several semantic iterations inside each actual iteration of a loop. When each inner pseudo-iteration is independent, unrolling explicitly severs them from the loop iterator dependency chain. Most modern x86 CPUs can keep track of ~10 simultaneous pending loads, so let’s try manually unrolling our benchmark 10 times. The new code looks like the following, where index_of hashes the key but doesn’t do the actual lookup. uint64_t indices[10]; for(uint64_t i = 0; i < N; i += 10) { for(uint64_t j = 0; j < 10; j++) { indices[j] = table.index_of(keys[i + j]); } for(uint64_t j = 0; j < 10; j++) { assert(table.find_index(indices[j]) == values[i + j]); } } This isn’t true unrolling, since the inner loops introduce dependencies on j. However, it doesn’t matter in practice, as computing j is never a bottleneck. In fact, our compiler can automatically unroll the inner loop for us (clang in particular does this). Lookups (ns/find)NaiveUnrolled Separate Chaining (100%) 35 34 Linear (75%) 36 35 Quadratic (75%) 31 31 Double (75%) 38 37 Robin Hood (75%) 31 31 Two-Way Chaining (x4) 30 32 Explicit unrolling has basically no effect, since the CPU was already able to saturate its pending load buffer. However, manual unrolling now lets us introduce software prefetching. Software Prefetching When traversing memory in a predictable pattern, the CPU is able to preemptively load data for future accesses. This capability, known as hardware prefetching, is the reason our flat tables are so fast to iterate. However, hash table lookups are inherently unpredictable: the address to load is determined by a hash function. That means hardware prefetching cannot help us. Instead, we can explicitly ask the CPU to prefetch address(es) we will load from in the near future. Doing so can’t speed up an individual lookup—we’d immediately wait for the result—but software prefetching can accelerate batches of queries. Given several keys, we can compute where each query will access the table, tell the CPU to prefetch those locations, and then start actually accessing the table. To do so, let’s make table.index_of prefetch the location(s) examined by table.find_indexed: In open-addressed tables, we prefetch the slot the key hashes to. In two-way chaining, we prefetch both buckets the key hashes to. Importantly, prefetching requests the entire cache line for the given address, i.e. the aligned 64-byte chunk of memory containing that address. That means surrounding data (e.g. the following keys) may also be loaded into cache. Lookups (ns/find)NaiveUnrolledPrefetched Separate Chaining (100%) 35 34 26 (-22.6%) Linear (75%) 36 35 23 (-36.3%) Quadratic (75%) 31 31 22 (-27.8%) Double (75%) 38 37 33 (-12.7%) Robin Hood (75%) 31 31 22 (-29.4%) Two-Way Chaining (x4) 30 32 19 (-40.2%) Prefetching makes a big difference! Separate chaining benefits from prefetching the unpredictable initial load8. Linear and quadratic tables with a low average probe length greatly benefit from prefetching. Double-hashed probe sequences quickly leave the cache line, making prefetching a bit less effective. Prefetching is ideal for two-way chaining: keys can always be found in a prefetched cache line—but each lookup requires fetching two locations, consuming extra load buffer slots. SIMD Modern CPUs provide SIMD (single instruction, multiple data) instructions that process multiple values at once. These operations are useful for probing: for example, we can load a batch of \(N\) keys and compare them with our query in parallel. In the best case, a SIMD probe sequence could run \(N\) times faster than checking each key individually. Let’s implement SIMD-accelerated lookups for linear probing and two-way chaining, with \(N=4\): Lookups (ns/find)NaiveUnrolledPrefetchedMissing Linear (75%) 36 35 23 88 SIMD Linear (75%) 39 (+10.9%) 42 (+18.0%) 27 (+19.1%) 98 (-21.2%) Two-Way Chaining (x4) 30 32 19 30 SIMD Two-Way Chaining (x4) 28 (-6.6%) 30 (-4.6%) 17 (-9.9%) 19 (-35.5%) Unfortunately, SIMD probes are slower for the linearly probed table—an average probe length of \(1.5\) meant the overhead of setting up an (unaligned) SIMD comparison wasn’t worth it. However, missing keys are found faster, since their average probe length is much higher (\(29.5\)). On the other hand, two-way chaining consistently benefits from SIMD probing, as we can find any key using at most two (aligned) comparisons. However, given a bucket capacity of four, lookups only need to query 1.7 keys on average, so we only see a slight improvement—SIMD would be a much more impactful optimization at higher capacities. Still, missing key lookups get significantly faster, as they previously had to iterate through all keys in the bucket. Benchmark Data Updated 2023/04/05: additional optimizations for quadratic, double hashed, and two-way tables. The linear and Robin Hood tables use backshift deletion. The two-way SIMD table has capacity-4 buckets and uses AVX2 256-bit SIMD operations. Insert, erase, find, find unrolled, find prefetched, and find missing report nanoseconds per operation. Average probe, max probe, average missing probe, and max missing probe report number of iterations. Clear and iterate report milliseconds for the entire operation. Memory reports the ratio of allocated memory to size of stored data. Table Insert Erase Find Unrolled Prefetched Avg Probe Max Probe Find DNE DNE Avg DNE Max Clear (ms) Iterate (ms) Memory Chaining 50 115 110 28 26 21 1.2 7 28 0.5 7 113.1 16.2 2.5 Chaining 100 113 128 35 34 26 1.5 10 40 1.0 9 118.1 13.9 2.0 Chaining 200 120 171 50 50 37 2.0 12 68 2.0 13 122.6 14.3 1.8 Chaining 500 152 306 102 109 77 3.5 21 185 5.0 21 153.7 23.0 1.6 Linear 50 46 47 20 20 15 0.5 43 37 1.5 50 1.1 4.8 2.0 Linear 75 58 82 36 35 23 1.5 181 88 7.5 243 0.7 2.1 1.3 Linear 90 70 147 46 45 29 4.5 1604 135 49.7 1403 0.6 1.2 1.1 Quadratic 50 47 23 20 20 16 0.4 21 82 4.8 49 1.1 5.1 2.0 Quadratic 75 54 28 31 31 22 1.0 46 93 5.3 76 0.7 2.1 1.3 Quadratic 90 65 29 42 42 30 1.9 108 163 18.2 263 0.6 1.0 1.1 Double 50 70 31 29 28 23 0.4 17 84 4.4 61 1.1 5.6 2.0 Double 75 80 43 38 37 33 0.8 44 93 4.7 83 0.8 2.2 1.3 Double 90 88 47 49 49 42 1.6 114 174 16.2 250 0.6 1.0 1.1 Robin Hood 50 52 31 19 19 15 0.5 12 33 0.7 12 1.1 4.8 2.0 Robin Hood 75 78 47 31 31 22 1.5 24 74 1.9 25 0.7 2.1 1.3 Robin Hood 90 118 59 80 79 41 4.5 58 108 5.0 67 0.6 1.2 1.1 Two-way x2 265 24 23 25 17 0.0 3 23 0.2 4 17.9 19.6 32.0 Two-way x4 108 34 30 32 19 0.7 6 30 2.7 8 2.2 3.7 4.0 Two-way x8 111 45 39 39 28 3.6 13 42 9.0 14 1.1 1.5 2.0 Two-way SIMD 124 46 28 30 17 0.3 1 19 1.0 1 2.2 3.7 4.0 Footnotes Actually, this prior shouldn’t necessarily apply to hash tables, which by definition have unpredictable access patterns. However, we will see that flat storage still provides performance benefits, especially for linear probing, whole-table iteration, and serial accesses. ↩ To account for various deletion strategies, the find-missing benchmark was run after the insertion & erasure tests, and after inserting a second set of elements. This sequence allows separate chaining’s reported maximum probe length to be larger for missing keys than existing keys. Looking up the latter set of elements was also benchmarked, but results are omitted here, as the tombstone-based tables were simply marginally slower across the board. ↩ AMD 5950X @ 4.4 GHz + DDR4 3600/CL16, Windows 11 Pro 22H2, MSVC 19.34.31937. No particular effort was made to isolate the benchmarks, but they were repeatable on an otherwise idle system. Each was run 10 times and average results are reported. ↩ Virtual memory allows separate regions of our address space to refer to the same underlying pages. If we map our table twice, one after the other, probe traversal never has to check whether it needs to wrap around to the start of the table. Instead, it can simply proceed into the second mapping—since writes to one range are automatically reflected in the other, this just works. ↩ You might expect indexing a prime-sized table to require an expensive integer mod/div, but it actually doesn’t. Given a fixed set of possible primes, a prime-specific indexing operation can be selected at runtime based on the current size. Since each such operation is a modulo by a constant, the compiler can translate it to a fast integer multiplication. ↩ A Sattolo traversal causes each lookup to directly depend on the result of the previous one, serializing execution and preventing the CPU from hiding memory latency. Hence, finding each element took ~100ns in every type of open table: roughly equal to the latency of fetching data from RAM. Similarly, each lookup in the low-load separately chained table took ~200ns, i.e. \((1 + \text{Chain Length}) * \text{Memory Latency}\). ↩ First, any linearly probed table requires the same total number of probes to look up each element once—that’s why the average probe length never improved. Every hash collision requires moving some key one slot further from its index, i.e. adds one to the total probe count. Second, the cost of a lookup, per probe, can decrease with the number of probes required. When traversing a long chain, the CPU can avoid cache misses by prefetching future accesses. Therefore, allocating a larger portion of the same total probe count to long chains can increase performance. Hence, linear probing tops this particular benchmark—but it’s not the most useful metric in practice. Robin Hood probing greatly decreases maximum probe length, making query performance far more consistent even if technically slower on average. Anyway, you really don’t need a 90% load factor—just use 75% for consistently high performance. ↩ Is the CPU smart enough to automatically prefetch addresses contained within the explicitly prefetched cache line? I’m pretty sure it’s not, but that would be cool. ↩NYC: Bronx Zoo2023-03-14T00:00:00+00:002023-03-14T00:00:00+00:00https://thenumb.at/Bronx-Zoo<p>Bronx Zoo, New York, NY, 2023</p>
<div style="text-align: center;"><img src="/assets/bronx-zoo/00.webp" /></div>
<div style="text-align: center;"><img src="/assets/bronx-zoo/01.webp" /></div>
<div style="text-align: center;"><img src="/assets/bronx-zoo/02.webp" /></div>
<div style="text-align: center;"><img src="/assets/bronx-zoo/03.webp" /></div>
<div style="text-align: center;"><img src="/assets/bronx-zoo/04.webp" /></div>
<div style="text-align: center;"><img src="/assets/bronx-zoo/05.webp" /></div>
<div style="text-align: center;"><img src="/assets/bronx-zoo/06.webp" /></div>
<div style="text-align: center;"><img src="/assets/bronx-zoo/07.webp" /></div>
<div style="text-align: center;"><img src="/assets/bronx-zoo/08.webp" /></div>
<div style="text-align: center;"><img src="/assets/bronx-zoo/09.webp" /></div>
<div style="text-align: center;"><img src="/assets/bronx-zoo/10.webp" /></div>
<div style="text-align: center;"><img src="/assets/bronx-zoo/11.webp" /></div>
<div style="text-align: center;"><img src="/assets/bronx-zoo/12.webp" /></div>
<div style="text-align: center;"><img src="/assets/bronx-zoo/13.webp" /></div>
<div style="text-align: center;"><img src="/assets/bronx-zoo/14.webp" /></div>
<div style="text-align: center;"><img src="/assets/bronx-zoo/15.webp" /></div>
<div style="text-align: center;"><img src="/assets/bronx-zoo/16.webp" /></div>
<div style="text-align: center;"><img src="/assets/bronx-zoo/17.webp" /></div>
<div style="text-align: center;"><img src="/assets/bronx-zoo/18.webp" /></div>
<div style="text-align: center;"><img src="/assets/bronx-zoo/19.webp" /></div>
<div style="text-align: center;"><img src="/assets/bronx-zoo/20.webp" /></div>
<p></p>Bronx Zoo, New York, NY, 2023Spherical Integration2023-01-08T00:00:00+00:002023-01-08T00:00:00+00:00https://thenumb.at/Spherical-Integration<p><em>Or, where does that \(\sin\theta\) come from?</em></p>
<p>Integrating functions over spheres is a ubiquitous task in graphics—and a common source of confusion for beginners. In particular, understanding why integration in spherical coordinates requires multiplying by \(\sin\theta\) takes some thought.</p>
<h1 id="the-confusion">The Confusion</h1>
<p>So, we want to integrate a function \(f\) over the unit sphere. For simplicity, let’s assume \(f = 1\). Integrating \(1\) over any surface computes the area of that surface: for a unit sphere, we should end up with \(4\pi\).</p>
<p>Integrating over spheres is much easier in the eponymous spherical coordinates, so let’s define \(f\) in terms of \(\theta, \phi\):</p>
<p><img class="diagram2d_smallerer center" style="padding-bottom: 10px" src="/assets/spherical-integration/spherical.svg" /></p>
<p>\(\theta\) ranges from \(0\) to \(\pi\), representing latitude, and \(\phi\) ranges from \(0\) to \(2\pi\), representing longitude. Naively, we might try to integrate \(f\) by ranging over the two parameters:</p>
\[\begin{align*}
\int_{0}^{2\pi}\int_0^\pi 1\, d\theta d\phi &= \int_0^{2\pi} \theta\Big|_0^\pi\, d\phi\\
&= \int_0^{2\pi} \pi\, d\phi \\
&= \pi \left(\phi\Big|_0^{2\pi}\right) \\
&= 2\pi^2
\end{align*}\]
<p>That’s not \(4\pi\)—we didn’t integrate over the sphere! All we did was integrate over a flat rectangle of height \(\pi\) and width \(2\pi\).
One way to conceptualize this integral is by adding up the differential area \(dA\) of many small rectangular patches of the domain.</p>
<p><img class="diagram2d_smaller center" src="/assets/spherical-integration/patch.svg" /></p>
<p>Each patch has area \(dA = d\theta d\phi\), so adding them up results in the area of the rectangle. What we actually want is to add up the areas of patches <em>on the sphere</em>, where they are smaller.</p>
<p><img class="diagram2d_smaller center" src="/assets/spherical-integration/sphere_patch.svg" /></p>
<p>In the limit (small \(d\theta,d\phi\)), the spherical patch \(d\mathcal{S}\) is a factor of \(\sin\theta\) smaller than the rectangular patch \(dA\)<sup id="fnref:patch" role="doc-noteref"><a href="#fn:patch" class="footnote" rel="footnote">1</a></sup>. Intuitively, the closer to the poles the patch is, the smaller its area.</p>
<p>When integrating over the sphere \(\mathcal{S}\), we call the area differential \(d\mathcal{S} = \sin\theta\, d\theta d\phi\)<sup id="fnref:solidangle" role="doc-noteref"><a href="#fn:solidangle" class="footnote" rel="footnote">2</a></sup>. Let’s try using it:</p>
\[\begin{align*}
\iint_\mathcal{S} 1\,d\mathcal{S} &= \int_{0}^{2\pi}\int_0^\pi \sin\theta\, d\theta d\phi\\
&= \int_0^{2\pi} (-\cos\theta)\Big|_0^\pi\, d\phi\\
&= \int_0^{2\pi} 2\, d\phi \\
&= 2 \left(\phi\Big|_0^{2\pi}\right) \\
&= 4\pi
\end{align*}\]
<p>It works! If you just wanted the intuition, you can stop reading here.</p>
<h1 id="but-why-sintheta">But why \(\sin\theta\)?</h1>
<p>It’s illustrative to analyze how \(\sin\theta\) arises from parameterizing the cartesian (\(x,y,z\)) sphere using spherical coordinates.</p>
<p>In cartesian coordinates, a unit sphere is defined by \(x^2 + y^2 + z^2 = 1\). It’s possible to formulate a cartesian surface integral based on this definition, but it would be ugly.</p>
\[\iint_{x^2+y^2+z^2=1} f dA = \text{?}\]
<p>Instead, we can perform a <em>change of coordinates</em> from cartesian to spherical coordinates. To do so, we will define \(\Phi : \theta,\phi \mapsto x,y,z\):</p>
<table class="yetanothertable"><tr>
<td>
<p>
$$ \begin{align*}
\Phi(\theta,\phi) = \begin{bmatrix}\sin\theta\cos\phi\\
\sin\theta\sin\phi\\
\cos\theta\end{bmatrix}
\end{align*} $$
</p>
</td>
<td>
<img class="diagram2d" style="min-width: 200px" src="/assets/spherical-integration/coordchange.svg" />
</td>
</tr>
</table>
<p>The function \(\Phi\) is a <em>parameterization</em> of the unit sphere \(\mathcal{S}\). We can check that it satisfies \(x^2+y^2+z^2=1\) regardless of \(\theta\) and \(\phi\):</p>
\[\begin{align*}
|\Phi(\theta,\phi)|^2 &= (\sin\theta\cos\phi)^2 + (\sin\theta\sin\phi)^2 + \cos^2\theta\\
&= \sin^2\theta(\cos^2\phi+\sin^2\phi) + \cos^2\theta \\
&= \sin^2\theta + \cos^2\theta\\
&= 1
\end{align*}\]
<p>Applying \(\Phi\) to the rectangular domain \(\theta\in[0,\pi],\phi\in[0,2\pi]\) in fact describes all of \(\mathcal{S}\), giving us a much simpler parameterization of the integral.</p>
<p>To integrate over \(d\theta\) and \(d\phi\), we also need to compute how they relate to \(d\mathcal{S}\). Luckily, there’s a formula that holds for <strong>any</strong> parametric surface \(\mathbf{r}(u,v)\) describing a three-dimensional domain \(\mathcal{R}\)<sup id="fnref:fundamental" role="doc-noteref"><a href="#fn:fundamental" class="footnote" rel="footnote">3</a></sup>:</p>
\[d\mathcal{R} = \left\lVert \frac{\partial \mathbf{r}}{\partial u} \times \frac{\partial \mathbf{r}}{\partial v} \right\rVert du dv\]
<p>The two partial derivatives represent tangent vectors on \(\mathcal{R}\) along the \(u\) and \(v\) axes, respectively. Intuitively, each tangent vector describes how a \(u,v\) patch is stretched along the corresponding axis when mapped onto \(\mathcal{R}\). The magnitude of their cross product then computes the area of the resulting parallelogram.</p>
<p><img class="diagram2d_double center" src="/assets/spherical-integration/cross.svg" /></p>
<p>Finally, we can actually apply the change of coordinates:</p>
\[\begin{align*}
\iint_\mathcal{S} f\, d\mathcal{S} &=
\int_0^{2\pi}\int_0^\pi f(\Phi(\theta,\phi)) \left\lVert \frac{\partial\Phi}{\partial\theta} \times \frac{\partial\Phi}{\partial\phi} \right\rVert d\theta d\phi
\end{align*}\]
<p>We just need to compute the area term:</p>
\[\begin{align*}
\frac{\partial\Phi}{\partial\theta} &= \begin{bmatrix}\cos\phi\cos\theta & \sin\phi\cos\theta & -\sin\theta\end{bmatrix}\\
\frac{\partial\Phi}{\partial\phi} &= \begin{bmatrix}-\sin\phi\sin\theta & \cos\phi\sin\theta & 0\end{bmatrix}\\
\frac{\partial\Phi}{\partial\theta} \times \frac{\partial\Phi}{\partial\phi} &= \begin{bmatrix}\sin^2\theta\cos\phi & \sin^2\theta\sin\phi & \cos^2\phi\cos\theta\sin\theta + \sin^2\phi\cos\theta\sin\theta\end{bmatrix}\\
&= \begin{bmatrix}\sin^2\theta\cos\phi & \sin^2\theta\sin\phi & \cos\theta\sin\theta \end{bmatrix}\\
\left\lVert\frac{\partial\Phi}{\partial\theta} \times \frac{\partial\Phi}{\partial\phi}\right\rVert &=
\sqrt{\sin^4\theta(\cos^2\phi+\sin^2\phi) + \cos^2\theta\sin^2\theta}\\
&= \sqrt{\sin^2\theta(\sin^2\theta + \cos^2\theta)}\\
&= \lvert\sin\theta\rvert
\end{align*}\]
<p>Since \(\theta\in[0,\pi]\), we can say \(\lvert\sin\theta\rvert = \sin\theta\). That means \(d\mathcal{S} = \sin\theta\, d\theta d\phi\)! Our final result is the familiar spherical integral:</p>
\[\iint_\mathcal{S} f dS = \int_0^{2\pi}\int_0^\pi f(\theta,\phi) \sin\theta\, d\theta d\phi\]
<p><br /></p>
<h2 id="footnotes">Footnotes</h2>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:patch" role="doc-endnote">
<p>Interestingly, if we knew the formula for the area of a spherical patch, we could do some slightly illegal math to derive the area form:</p>
\[\begin{align*} dA &= (\phi_1-\phi_0)(\cos\theta_0-\cos\theta_1)\\
&= ((\phi + d\phi) - \phi)(\cos(\theta)-\cos(\theta+d\theta))\\
&= d\phi d\theta \frac{(\cos(\theta)-\cos(\theta+d\theta))}{d\theta}\\
&= d\phi d\theta \sin\theta \tag{Def. derivative}
\end{align*}\]
<p><a href="#fnref:patch" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:solidangle" role="doc-endnote">
<p>In graphics, also known as <em>differential solid angle</em>, \(d\mathbf{\omega} = \sin\theta\, d\theta d\phi\). <a href="#fnref:solidangle" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fundamental" role="doc-endnote">
<p>Even more generally, the scale is related to the inner product on the tangent space of the surface. This inner product is defined by the <a href="https://en.wikipedia.org/wiki/First_fundamental_form">first fundamental form</a> \(\mathrm{I}\) of the surface, and the scale factor is \(\sqrt{\det\mathrm{I}}\), regardless of dimension. For surfaces immersed in 3D, this simplifies to the cross product mentioned above. Changes of coordinates that don’t change dimensionality have a more straightforward scale factor: it’s the determinant of their Jacobian.<sup id="fnref:exterior" role="doc-noteref"><a href="#fn:exterior" class="footnote" rel="footnote">4</a></sup> <a href="#fnref:fundamental" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:exterior" role="doc-endnote">
<p>These topics are often more easily understood using the language of exterior calculus, where integrals can be uniformly expressed regardless of dimension. I will write about it at some point. <a href="#fnref:exterior" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Or, where does that \(\sin\theta\) come from? Integrating functions over spheres is a ubiquitous task in graphics—and a common source of confusion for beginners. In particular, understanding why integration in spherical coordinates requires multiplying by \(\sin\theta\) takes some thought. The Confusion So, we want to integrate a function \(f\) over the unit sphere. For simplicity, let’s assume \(f = 1\). Integrating \(1\) over any surface computes the area of that surface: for a unit sphere, we should end up with \(4\pi\). Integrating over spheres is much easier in the eponymous spherical coordinates, so let’s define \(f\) in terms of \(\theta, \phi\): \(\theta\) ranges from \(0\) to \(\pi\), representing latitude, and \(\phi\) ranges from \(0\) to \(2\pi\), representing longitude. Naively, we might try to integrate \(f\) by ranging over the two parameters: \[\begin{align*} \int_{0}^{2\pi}\int_0^\pi 1\, d\theta d\phi &= \int_0^{2\pi} \theta\Big|_0^\pi\, d\phi\\ &= \int_0^{2\pi} \pi\, d\phi \\ &= \pi \left(\phi\Big|_0^{2\pi}\right) \\ &= 2\pi^2 \end{align*}\] That’s not \(4\pi\)—we didn’t integrate over the sphere! All we did was integrate over a flat rectangle of height \(\pi\) and width \(2\pi\). One way to conceptualize this integral is by adding up the differential area \(dA\) of many small rectangular patches of the domain. Each patch has area \(dA = d\theta d\phi\), so adding them up results in the area of the rectangle. What we actually want is to add up the areas of patches on the sphere, where they are smaller. In the limit (small \(d\theta,d\phi\)), the spherical patch \(d\mathcal{S}\) is a factor of \(\sin\theta\) smaller than the rectangular patch \(dA\)1. Intuitively, the closer to the poles the patch is, the smaller its area. When integrating over the sphere \(\mathcal{S}\), we call the area differential \(d\mathcal{S} = \sin\theta\, d\theta d\phi\)2. Let’s try using it: \[\begin{align*} \iint_\mathcal{S} 1\,d\mathcal{S} &= \int_{0}^{2\pi}\int_0^\pi \sin\theta\, d\theta d\phi\\ &= \int_0^{2\pi} (-\cos\theta)\Big|_0^\pi\, d\phi\\ &= \int_0^{2\pi} 2\, d\phi \\ &= 2 \left(\phi\Big|_0^{2\pi}\right) \\ &= 4\pi \end{align*}\] It works! If you just wanted the intuition, you can stop reading here. But why \(\sin\theta\)? It’s illustrative to analyze how \(\sin\theta\) arises from parameterizing the cartesian (\(x,y,z\)) sphere using spherical coordinates. In cartesian coordinates, a unit sphere is defined by \(x^2 + y^2 + z^2 = 1\). It’s possible to formulate a cartesian surface integral based on this definition, but it would be ugly. \[\iint_{x^2+y^2+z^2=1} f dA = \text{?}\] Instead, we can perform a change of coordinates from cartesian to spherical coordinates. To do so, we will define \(\Phi : \theta,\phi \mapsto x,y,z\): $$ \begin{align*} \Phi(\theta,\phi) = \begin{bmatrix}\sin\theta\cos\phi\\ \sin\theta\sin\phi\\ \cos\theta\end{bmatrix} \end{align*} $$ The function \(\Phi\) is a parameterization of the unit sphere \(\mathcal{S}\). We can check that it satisfies \(x^2+y^2+z^2=1\) regardless of \(\theta\) and \(\phi\): \[\begin{align*} |\Phi(\theta,\phi)|^2 &= (\sin\theta\cos\phi)^2 + (\sin\theta\sin\phi)^2 + \cos^2\theta\\ &= \sin^2\theta(\cos^2\phi+\sin^2\phi) + \cos^2\theta \\ &= \sin^2\theta + \cos^2\theta\\ &= 1 \end{align*}\] Applying \(\Phi\) to the rectangular domain \(\theta\in[0,\pi],\phi\in[0,2\pi]\) in fact describes all of \(\mathcal{S}\), giving us a much simpler parameterization of the integral. To integrate over \(d\theta\) and \(d\phi\), we also need to compute how they relate to \(d\mathcal{S}\). Luckily, there’s a formula that holds for any parametric surface \(\mathbf{r}(u,v)\) describing a three-dimensional domain \(\mathcal{R}\)3: \[d\mathcal{R} = \left\lVert \frac{\partial \mathbf{r}}{\partial u} \times \frac{\partial \mathbf{r}}{\partial v} \right\rVert du dv\] The two partial derivatives represent tangent vectors on \(\mathcal{R}\) along the \(u\) and \(v\) axes, respectively. Intuitively, each tangent vector describes how a \(u,v\) patch is stretched along the corresponding axis when mapped onto \(\mathcal{R}\). The magnitude of their cross product then computes the area of the resulting parallelogram. Finally, we can actually apply the change of coordinates: \[\begin{align*} \iint_\mathcal{S} f\, d\mathcal{S} &= \int_0^{2\pi}\int_0^\pi f(\Phi(\theta,\phi)) \left\lVert \frac{\partial\Phi}{\partial\theta} \times \frac{\partial\Phi}{\partial\phi} \right\rVert d\theta d\phi \end{align*}\] We just need to compute the area term: \[\begin{align*} \frac{\partial\Phi}{\partial\theta} &= \begin{bmatrix}\cos\phi\cos\theta & \sin\phi\cos\theta & -\sin\theta\end{bmatrix}\\ \frac{\partial\Phi}{\partial\phi} &= \begin{bmatrix}-\sin\phi\sin\theta & \cos\phi\sin\theta & 0\end{bmatrix}\\ \frac{\partial\Phi}{\partial\theta} \times \frac{\partial\Phi}{\partial\phi} &= \begin{bmatrix}\sin^2\theta\cos\phi & \sin^2\theta\sin\phi & \cos^2\phi\cos\theta\sin\theta + \sin^2\phi\cos\theta\sin\theta\end{bmatrix}\\ &= \begin{bmatrix}\sin^2\theta\cos\phi & \sin^2\theta\sin\phi & \cos\theta\sin\theta \end{bmatrix}\\ \left\lVert\frac{\partial\Phi}{\partial\theta} \times \frac{\partial\Phi}{\partial\phi}\right\rVert &= \sqrt{\sin^4\theta(\cos^2\phi+\sin^2\phi) + \cos^2\theta\sin^2\theta}\\ &= \sqrt{\sin^2\theta(\sin^2\theta + \cos^2\theta)}\\ &= \lvert\sin\theta\rvert \end{align*}\] Since \(\theta\in[0,\pi]\), we can say \(\lvert\sin\theta\rvert = \sin\theta\). That means \(d\mathcal{S} = \sin\theta\, d\theta d\phi\)! Our final result is the familiar spherical integral: \[\iint_\mathcal{S} f dS = \int_0^{2\pi}\int_0^\pi f(\theta,\phi) \sin\theta\, d\theta d\phi\] Footnotes Interestingly, if we knew the formula for the area of a spherical patch, we could do some slightly illegal math to derive the area form: \[\begin{align*} dA &= (\phi_1-\phi_0)(\cos\theta_0-\cos\theta_1)\\ &= ((\phi + d\phi) - \phi)(\cos(\theta)-\cos(\theta+d\theta))\\ &= d\phi d\theta \frac{(\cos(\theta)-\cos(\theta+d\theta))}{d\theta}\\ &= d\phi d\theta \sin\theta \tag{Def. derivative} \end{align*}\] ↩ In graphics, also known as differential solid angle, \(d\mathbf{\omega} = \sin\theta\, d\theta d\phi\). ↩ Even more generally, the scale is related to the inner product on the tangent space of the surface. This inner product is defined by the first fundamental form \(\mathrm{I}\) of the surface, and the scale factor is \(\sqrt{\det\mathrm{I}}\), regardless of dimension. For surfaces immersed in 3D, this simplifies to the cross product mentioned above. Changes of coordinates that don’t change dimensionality have a more straightforward scale factor: it’s the determinant of their Jacobian.4 ↩ These topics are often more easily understood using the language of exterior calculus, where integrals can be uniformly expressed regardless of dimension. I will write about it at some point. ↩Los Cabos2023-01-01T00:00:00+00:002023-01-01T00:00:00+00:00https://thenumb.at/Cabo<p>Los Cabos, BCS, Mexico, 2022</p>
<div style="text-align: center;"><img src="/assets/cabo/00.webp" /></div>
<div style="text-align: center;"><img src="/assets/cabo/01.webp" /></div>
<div style="text-align: center;"><img src="/assets/cabo/02.webp" /></div>
<div style="text-align: center;"><img src="/assets/cabo/03.webp" /></div>
<div style="text-align: center;"><img src="/assets/cabo/04.webp" /></div>
<div style="text-align: center;"><img src="/assets/cabo/05.webp" /></div>
<div style="text-align: center;"><img src="/assets/cabo/06.webp" /></div>
<div style="text-align: center;"><img src="/assets/cabo/07.webp" /></div>
<div style="text-align: center;"><img src="/assets/cabo/08.webp" /></div>
<div style="text-align: center;"><img src="/assets/cabo/09.webp" /></div>
<div style="text-align: center;"><img src="/assets/cabo/10.webp" /></div>
<div style="text-align: center;"><img src="/assets/cabo/11.webp" /></div>
<div style="text-align: center;"><img src="/assets/cabo/12.webp" /></div>
<div style="text-align: center;"><img src="/assets/cabo/13.webp" /></div>
<div style="text-align: center;"><img src="/assets/cabo/14.webp" /></div>
<div style="text-align: center;"><img src="/assets/cabo/15.webp" /></div>
<div style="text-align: center;"><img src="/assets/cabo/16.webp" /></div>
<div style="text-align: center;"><img src="/assets/cabo/17.webp" /></div>
<p></p>Los Cabos, BCS, Mexico, 2022Exploring Neural Graphics Primitives2022-11-27T00:00:00+00:002022-11-27T00:00:00+00:00https://thenumb.at/Neural-Graphics-Primitives<p>Neural fields have quickly become an interesting and useful application of machine learning to computer graphics. The fundamental idea is quite straightforward: we can use neural models to represent all sorts of digital signals, including images, light fields, signed distance functions, and volumes.</p>
<ul>
<li><a href="#neural-networks">Neural Networks</a></li>
<li><a href="#compression">Compression</a></li>
<li><a href="#a-basic-network">A Basic Network</a></li>
<li><a href="#activations">Activations</a></li>
<li><a href="#input-encodings">Input Encodings</a>
<ul>
<li><a href="#positional-encoding">Positional Encoding</a></li>
<li><a href="#instant-neural-graphics-primitives">Instant Neural Graphics Primitives</a></li>
</ul>
</li>
<li><a href="#more-applications">More Applications</a>
<ul>
<li><a href="#neural-sdfs">SDFs</a></li>
<li><a href="#neural-radiance-fields">NeRFs</a></li>
</ul>
</li>
<li><a href="#references">References</a></li>
</ul>
<h1 id="neural-networks">Neural Networks</h1>
<p>This post assumes a basic knowledge of <a href="https://www.youtube.com/watch?v=aircAruvnKk">neural networks</a>, but let’s briefly recap how they work.</p>
<p>A network with \(n\) inputs and \(m\) outputs approximates a function from \(\mathbb{R}^n \mapsto \mathbb{R}^m\) by feeding its input through a sequence of <em>layers</em>. The simplest type of layer is a <em>fully-connected</em> layer, which encodes an affine transform: \(x \mapsto Ax + b\), where \(A\) is the <em>weight</em> matrix and \(b\) is the <em>bias</em> vector. Layers other than the first (input) and last (output) are considered <em>hidden</em> layers, and can have arbitrary dimension. Each hidden layer is followed by applying a non-linear function \(x \mapsto f(x)\), known as the <em>activation</em>. This additional function enables the network to encode non-linear behavior.</p>
<p>For example, a network from \(\mathbb{R}^3 \mapsto \mathbb{R}^3\), including two hidden layers of dimensions 5 and 4, can be visualized as follows. The connections between nodes represent weight matrix entries:</p>
<div class="center"><img class="center" src="/assets/neural-graphics/nn.svg" /></div>
<p>Or equivalently in symbols (with dimensions):</p>
\[\begin{align*}
h_{1(5)} &= f(A_{1(5\times3)}x_{(3)} + b_{1(5)})\\
h_{2(4)} &= f(A_{2(4\times5)}h_{1(5)} + b_{2(4)})\\
y_{(3)} &= A_{3(3\times4)}h_{2(4)} + b_{3(3)}
\end{align*}\]
<p>This network has \(5\cdot3 + 5 + 4\cdot5 + 4 + 3\cdot4 + 3 = 59\) parameters that define its weight matrices and bias vectors. The process of <em>training</em> finds values for these parameters such that the network approximates another function \(g(x)\). Suitable parameters are typically found via stochastic <a href="/Autodiff/#optimization">gradient descent</a>. Training requires a loss function measuring how much \(\text{NN}(x)\) differs from \(g(x)\) for all \(x\) in a <em>training set</em>.</p>
<p>In practice, we might not know \(g\): given a large number of input-output pairs (e.g. images and their descriptions), we simply perform training and hope that the network will faithfully approximate \(g\) for inputs outside of the training set, too.</p>
<p>We’ve only described the most basic kind of neural network: there’s a whole world of possibilities for new layers, activations, loss functions, optimization algorithms, and training paradigms. However, the fundamentals will suffice for this post.</p>
<h1 id="compression">Compression</h1>
<p>When generalization is the goal, neural networks are typically under-constrained: they include far more parameters than strictly necessary to encode a function mapping the training set to its corresponding results. Through <em>regularization</em>, the training process hopes to use these ‘extra’ degrees of freedom to approximate the behavior of \(g\) outside the training set. However, insufficient regularization makes the network vulnerable to <em>over-fitting</em>—it may become extremely accurate on the training set, yet unable to handle new inputs. For this reason, researchers always evaluate neural networks on a <em>test set</em> of known data points that are excluded from the training process.</p>
<p>However, in this post we’re actually going to optimize <em>for</em> over-fitting! Instead of approximating \(g\) on new inputs, we will <em>only</em> care about reproducing the right results on the training set. This approach allows us to use over-constrained networks for (lossy) <em>data compression</em>.</p>
<p>Let’s say we have some data we want to compress into a neural network. If we can parameterize the input by assigning each value a unique identifier, that gives us a training set. For example, we could parameterize a list of numbers by their index:</p>
\[[1, 1, 2, 5, 15, 52, 203] \mapsto \{ (0,1), (1,1), (2,2), (3,5), (4,15), (5,52), (6,203) \}\]
<p>…and train a network to associate the index \(n\) with the value \(a[n]\). To do so, we can simply define a loss function that measures the squared error of the result: \((\text{NN}(n) - a[n])^2\). We only care that the network produces \(a[n]\) for \(n\) in 0 to 6, so we want to “over-fit” as much as possible.</p>
<p>But, where’s the compression? If we make the network itself <em>smaller</em> than the data set—while being able to reproduce it—we can consider the network to be a compressed encoding of the data. For example, consider this photo of a numbat:</p>
<div class="center_container"><img class="result" src="/assets/neural-graphics/numbat.png" style="text-align: center" /><em class="center" style="text-align: center">Perth Zoo</em></div>
<p>The image consists of 512x512 pixels with three channels (r,g,b) each. Hence, we could naively say it contains 786,432 parameters. If a network with fewer parameters can reproduce it, the network itself can be considered a compressed version of the image. More rigorously, each pixel can be encoded in 3 bytes of data, so we’d want to be able to store the network in fewer than 768 kilobytes—but reasoning about size on-disk would require getting into the weeds of mixed precision networks, so let’s only consider parameter count for now.</p>
<h1 id="a-basic-network">A Basic Network</h1>
<p>Let’s train a network to encode the numbat.</p>
<p>To create the data set, we will associate each pixel with its corresponding \(x,y\) coordinate. That means our training set will consist of 262,144 examples of the form \((x,y) \mapsto (r,g,b)\). Before training, we’ll normalize the values such that \(x,y \in [-1,1]\) and \(r,g,b \in [0,1]\).</p>
\[\begin{align*}
(-1.0000,-1.0000) &\mapsto (0.3098, 0.3686, 0.2471)\\
(-1.0000, -0.9961) &\mapsto (0.3059, 0.3686, 0.2471)\\
&\vdots\\
(0.9961, 0.9922) &\mapsto (0.3333, 0.4157, 0.3216)\\
(0.9961, 0.9961) &\mapsto (0.3412, 0.4039, 0.3216)
\end{align*}\]
<p>Clearly, our network will need to be a function from \(\mathbb{R}^2 \mapsto \mathbb{R}^3\), i.e. have two inputs and three outputs. Just going off intuition, we might include three hidden layers of dimension 128. This architecture will only have 33,795 parameters—far fewer than the image itself.</p>
\[\mathbb{R}^2 \mapsto_{fc} \mathbb{R}^{128} \mapsto_{fc} \mathbb{R}^{128} \mapsto_{fc} \mathbb{R}^{128} \mapsto_{fc} \mathbb{R}^3\]
<p>Our activation function will be the standard non-linearity ReLU (rectified linear unit). Note that we only want to apply the activation after each hidden layer: we don’t want to clip our inputs or outputs by making them positive.</p>
\[\text{ReLU}(x) = \max\{x, 0\}\]
<div class="diagram2d center"><img class="center" width="250" src="/assets/neural-graphics/relu.svg" /></div>
<p>Finally, our loss function will be the mean squared error between the network output and the expected color:</p>
\[\text{Loss}(x,y) = (\text{NN}(x,y) - \text{Image}_{xy})^2\]
<p>Now, let’s train. We’ll do 100 passes over the full data set (100 <em>epochs</em>) with a batch size of 1024. After training, we’ll need some way to evaluate how well the network encoded our image. Hopefully, the network will now return the proper color given a pixel coordinate, so we can re-generate our image by simply evaluating the network at each pixel coordinate in the training set and normalizing the results.</p>
<p>And, what do we get? After a bit of hyperparameter tweaking (learning rate, optimization schedule, epochs), we get…</p>
<div class="center_container">
<canvas class="result" style="width: 100%" id="out_relu_only_canvas"></canvas>
<table><tr><td style="width: 40%">Epochs: <span id="out_relu_only_epoch">100</span></td><td><input type="range" min="0" max="20" value="20" class="slider" id="out_relu_only" /></td></tr></table></div>
<p>Well, that sort of worked—you can see the numbat taking shape. But unfortunately, by epoch 100 our loss isn’t consistently decreasing: more training won’t make the result significantly better.</p>
<h1 id="activations">Activations</h1>
<p>So far, we’ve only used the ReLU activation function. Taking a look at the output image, we see a lot of lines: various activations jump from zero to positive across these boundaries. There’s nothing fundamentally wrong with that—given a suitably large network, we could still represent the exact image. However, it’s difficult to recover high-frequency detail by summing functions clipped at zero.</p>
<h2 id="sigmoids">Sigmoids</h2>
<p>Because we know the network should always return values in \([0,1]\), one easy improvement is adding a sigmoid activation to the output layer. Instead of teaching the network to directly output an intensity in \([0,1]\), this will allow the network to compute values in \([-\infty,\infty]\) that are deterministically mapped to a color in \([0,1]\). We’ll apply the logistic function:</p>
\[\text{Sigmoid}(x) = \frac{1}{1 + e^{-x}}\]
<div class="diagram2d center"><img class="center" width="250" src="/assets/neural-graphics/sigmoid.svg" /></div>
<p>For all future experiments, we’ll use this function as an output layer activation. Re-training the ReLU network:</p>
<div class="center_container">
<canvas style="width: 100%" class="result" id="out_relu_sigmoid_canvas"></canvas>
<table><tr><td style="width: 40%">Epochs: <span id="out_relu_sigmoid_epoch">100</span></td><td><input type="range" min="0" max="20" value="20" class="slider" id="out_relu_sigmoid" /></td></tr></table></div>
<p>The result looks significantly better, but it’s still got a lot of line-like artifacts. What if we just use the sigmoid activation for the other hidden layers, too? Re-training again…</p>
<div class="center_container">
<canvas style="width: 100%" class="result" id="out_sigmoid_canvas"></canvas>
<table><tr><td style="width: 40%">Epochs: <span id="out_sigmoid_epoch">100</span></td><td><input type="range" min="0" max="20" value="20" class="slider" id="out_sigmoid" /></td></tr></table></div>
<p>That made it worse. The important observation here is that changing the activation function can have a significant effect on reproduction quality. Several papers have been published comparing different activation functions in this context, so let’s try some of those.</p>
<h2 id="sinusoids">Sinusoids</h2>
<p><em><a href="https://www.vincentsitzmann.com/siren/">Implicit Neural Representations with Periodic Activation Functions</a></em></p>
<p>This paper explores the usage of periodic functions (i.e. \(\sin,\cos\)) as activations, finding them well suited for representing signals like images, audio, video, and distance fields. In particular, the authors note that the derivative of a sinusoidal network is also a sinusoidal network. This observation allows them to fit differential constraints in situations where ReLU and TanH-based models entirely fail to converge.</p>
<p>The proposed form of periodic activations is simply \(f(x) = \sin(\omega_0x)\), where \(\omega_0\) is a hyperparameter. Increasing \(\omega_0\) should allow the network to encode higher frequency signals.</p>
<p>Let’s try changing our activations to \(f(x) = \sin(2\pi x)\):</p>
<div class="center_container">
<canvas style="width: 100%" class="result" id="out_sin_no_w_canvas"></canvas>
<table><tr><td style="width: 40%">Epochs: <span id="out_sin_no_w_epoch">100</span></td><td><input type="range" min="0" max="20" value="20" class="slider" id="out_sin_no_w" /></td></tr></table></div>
<p>Well, that’s better than the ReLU/Sigmoid networks, but still not very impressive—an unstable training process lost much of the <em>low</em> frequency data. It turns out that sinusoidal networks are quite sensitive to how we initialize the weight matrices.</p>
<p>To account for the extra scaling by \(\omega_0\), the paper proposes initializing weights beyond the first layer using the distribution \(\mathcal{U}\left(-\frac{1}{\omega_0}\sqrt{\frac{6}{\text{fan\_in}}}, \frac{1}{\omega_0}\sqrt{\frac{6}{\text{fan\_in}}}\right)\). Retraining with proper weight initialization:</p>
<div class="center_container">
<canvas style="width: 100%" class="result" id="out_sin_w_canvas"></canvas>
<table><tr><td style="width: 40%">Epochs: <span id="out_sin_w_epoch">100</span></td><td><input type="range" min="0" max="20" value="20" class="slider" id="out_sin_w" /></td></tr></table></div>
<p>Now we’re getting somewhere—the output is quite recognizable. However, we’re still missing a lot of high frequency detail, and increasing \(\omega_0\) much more starts to make training unstable.</p>
<h2 id="gaussians">Gaussians</h2>
<p><em><a href="https://arxiv.org/pdf/2111.15135.pdf">Beyond Periodicity: Towards a Unifying Framework for Activations in Coordinate-MLPs</a></em></p>
<p>This paper analyzes a variety of new activation functions, one of which is particularly suited for image reconstruction. It’s a gaussian of the form \(f(x) = e^{\frac{-x^2}{\sigma^2}}\), where \(\sigma\) is a hyperparameter. Here, a smaller \(\sigma\) corresponds to a higher bandwidth.</p>
<p>The authors demonstrate good results using gaussian activations: reproduction as good as sinusoids, but without dependence on weight initialization or special input encodings (which we will discuss in the next section).</p>
<p>So, let’s train our network using \(f(x) = e^{-4x^2}\):</p>
<div class="center_container">
<canvas style="width: 100%" class="result" id="out_gauss_canvas"></canvas>
<table><tr><td style="width: 40%">Epochs: <span id="out_gauss_epoch">100</span></td><td><input type="range" min="0" max="20" value="20" class="slider" id="out_gauss" /></td></tr></table></div>
<p>Well, that’s better than the initial sinusoidal network, but worse than the properly initialized one. However, something interesting happens if we scale up our input coordinates from \([-1,1]\) to \([-16,16]\):</p>
<div class="center_container">
<canvas style="width: 100%" class="result" id="out_gauss_16_canvas"></canvas>
<table><tr><td style="width: 40%">Epochs: <span id="out_gauss_16_epoch">100</span></td><td><input type="range" min="0" max="20" value="20" class="slider" id="out_gauss_16" /></td></tr></table></div>
<p>The center of the image is now reproduced almost perfectly, but the exterior still lacks detail. It seems that while gaussian activations are more robust to initialization, the distribution of inputs can still have a significant effect.</p>
<h2 id="beyond-the-training-set">Beyond the Training Set</h2>
<p>Although we’re only measuring how well each network reproduces the image, we might wonder what happens outside the training set. Luckily, our network is still just a function \(\mathbb{R}^2\mapsto\mathbb{R}^3\), so we can evaluate it on a larger range of inputs to produce an “out of bounds” image.</p>
<p>Another purported benefit of using gaussian activations is sensible behavior outside of the training set, so let’s compare. Evaluating each network on \([-2,2]\times[-2,2]\):</p>
<table>
<tr>
<td style="width: 33%">
<img class="result" src="/assets/neural-graphics/out_relu_sigmoid/oob.png" style="text-align:center" /><p class="center" style="text-align:center">ReLU</p>
</td>
<td style="width: 33%">
<img class="result" src="/assets/neural-graphics/out_sin_w/oob.png" style="text-align:center" /><p class="center" style="text-align:center">Sinusoid</p>
</td>
<td style="width: 33%">
<img class="result" src="/assets/neural-graphics/out_gauss_16/oob.png" style="text-align:center" /><p class="center" style="text-align:center">Gaussian</p>
</td>
</tr>
</table>
<p>That’s pretty cool—the gaussian network ends up representing a sort-of-edge-clamped extension of the image. The sinusoid turns into a low-frequency soup of common colors. The ReLU extension has little to do with the image content, and based on running training a few times, is very unstable.</p>
<h1 id="input-encodings">Input Encodings</h1>
<p>So far, our results are pretty cool, but not high fidelity enough to compete with the original image. Luckily, we can go further: recent research work on high-quality neural primitives has relied on not only better activation functions and training schemes, but new <em>input encodings</em>.</p>
<p>When designing a network, an input encoding is essentially just a fixed-function initial layer that performs some interesting transformation before the fully-connected layers take over. It turns out that choosing a good initial transform can make a <em>big</em> difference.</p>
<h2 id="positional-encoding">Positional Encoding</h2>
<p>The current go-to encoding is known as <em>positional encoding</em>, or otherwise <em>positional embedding</em>, or even sometimes <em>fourier features</em>. This encoding was first used in a graphics context by the original <a href="https://arxiv.org/pdf/2003.08934.pdf">neural radiance fields paper</a>. It takes the following form:</p>
\[x \mapsto \left[x, \sin(2^0\pi x), \cos(2^0\pi x), \dots, \sin(2^L\pi x), \cos(2^L\pi x) \right]\]
<p>Where \(L\) is a hyperparameter controlling bandwidth (higher \(L\), higher frequency signals). The unmodified input \(x\) may or may not be included in the output.</p>
<p>When using this encoding, we transform our inputs by a hierarchy of sinusoids of increasing frequency. This brings to mind the fourier transform, hence the “fourier features” moniker, but note that we are <strong>not</strong> using the actual fourier transform of the training signal.</p>
<p>So, let’s try adding a positional encoding to our network with \(L=6\). Our new network will have the following architecture:</p>
\[\mathbb{R}^2 \mapsto_{enc6} \mathbb{R}^{30} \mapsto_{fc} \mathbb{R}^{128} \mapsto_{fc} \mathbb{R}^{128} \mapsto_{fc} \mathbb{R}^{128} \mapsto_{fc} \mathbb{R}^3\]
<p>Note that the sinusoids are applied element-wise to our two-dimensional \(x\), so we end up with \(30\) input dimensions. This means the first fully-connected layer will have \(30\cdot128\) weights instead of \(2\cdot128\), so we are adding some parameters. However, the additional weights don’t make a huge difference in of themselves.</p>
<p>The ReLU network improves <em>dramatically</em>:</p>
<div class="center_container">
<canvas style="width: 100%" class="result" id="out_relu_enc6_canvas"></canvas>
<table><tr><td style="width: 40%">Epochs: <span id="out_relu_enc6_epoch">100</span></td><td><input type="range" min="0" max="20" value="20" class="slider" id="out_relu_enc6" /></td></tr></table></div>
<p>The gaussian network also improves significantly, now only lacking some high frequency detail:</p>
<div class="center_container">
<canvas style="width: 100%" class="result" id="out_gauss_enc6_canvas"></canvas>
<table><tr><td style="width: 40%">Epochs: <span id="out_gauss_enc6_epoch">100</span></td><td><input type="range" min="0" max="20" value="20" class="slider" id="out_gauss_enc6" /></td></tr></table></div>
<p>Finally, the sinusoidal network… fails to converge! Reducing \(\omega_0\) to \(\pi\) lets training succeed, and produces the best result so far. However, the unstable training behavior is another indication that sinusoidal networks are particularly temperamental.</p>
<div class="center_container">
<canvas style="width: 100%" class="result" id="out_sin_enc6_canvas"></canvas>
<table><tr><td style="width: 40%">Epochs: <span id="out_sin_enc6_epoch">100</span></td><td><input type="range" min="0" max="20" value="20" class="slider" id="out_sin_enc6" /></td></tr></table></div>
<p>As a point of reference, saving this model’s parameters in half precision (which does not degrade quality) requires only 76kb on disk. That’s smaller than a JPEG encoding of the image at ~111kb, though not quite as accurate. Unfortunately, it’s still missing some fine detail.</p>
<h2 id="instant-neural-graphics-primitives">Instant Neural Graphics Primitives</h2>
<p><em><a href="https://nvlabs.github.io/instant-ngp/">Instant Neural Graphics Primitives with a Multiresolution Hash Encoding</a></em></p>
<p>This paper made a splash when it was released early this year and eventually won a best paper award at SIGGRAPH 2022. It proposes a novel input encoding based on a learned multi-resolution hashing scheme. By combining their encoding with a fully-fused (i.e. single-shader) training and evaluation implementation, the authors are able to <strong>train</strong> networks representing gigapixel-scale images, SDF geometry, volumes, and radiance fields in near real-time.</p>
<p>In this post, we’re only interested in the encoding, though I wouldn’t say no to instant training either.</p>
<h3 id="the-multiresolution-grid">The Multiresolution Grid</h3>
<p>We begin by breaking up the domain into several square grids of increasing resolution. This hierarchy can be defined in any number of dimensions \(d\). Here, our domain will be two-dimensional:</p>
<p><img class="diagram2d center" src="/assets/neural-graphics/multigrid.svg" /></p>
<p>Note that the green and red grids also tile the full image; their full extent is omitted for clarity.</p>
<p>The resolution of each grid is a constant multiple of the previous level, but note that the growth factor does not have to be two. In fact, the authors derive the scale after choosing the following three hyperparameters:</p>
\[\begin{align*}
L &:= \text{Number of Levels} \\
N_{\min} &:= \text{Coarsest Resolution} \\
N_{\max} &:= \text{Finest Resolution}
\end{align*}\]
<p>And compute the growth factor via:</p>
\[b := \exp\left(\frac{\ln N_{\max} - \ln N_{\min}}{L - 1}\right)\]
<p>Therefore, the resolution of grid level \(\ell\) is \(N_\ell := \lfloor N_{\min} b^\ell \rfloor\).</p>
<h3 id="step-1---find-cells">Step 1 - Find Cells</h3>
<p>After choosing a hierarchy of grids, we can define the input transformation. We will assume the input \(\mathbf{x}\) is given in \([0,1]\).</p>
<p>For each level \(\ell\), first scale \(\mathbf{x}\) by \(N_\ell\) and round up/down to find the cell containing \(\mathbf{x}\). For example, when \(N_\ell = 2\):</p>
<p><img class="diagram2d center" src="/assets/neural-graphics/coordinate.svg" /></p>
<p>Given the bounds of the cell containing \(\mathbf{x}\), we can then compute the integer coordinates of each corner of the cell. In this example, we end up with four vertices: \([0,0], [0,1], [1,0], [1,1]\).</p>
<h3 id="step-2---hash-coordinates">Step 2 - Hash Coordinates</h3>
<p>We then hash each corner coordinate. The authors use the following function:</p>
\[h(\mathbf{x}) = \bigoplus_{i=1}^d x_i\pi_i\]
<p>Where \(\oplus\) denotes bit-wise exclusive-or and \(\pi_i\) are large prime numbers. To improve cache coherence, the authors set \(\pi_1 = 1\), as well as \(\pi_2 = 2654435761\) and \(\pi_3 = 805459861\). Because our domain is two-dimensional, we only need the first two coefficients:</p>
\[h(x,y) = x \oplus (2654435761 y)\]
<p>The hash is then used as an index into a hash table. Each grid level has a corresponding hash table described by two more hyperparameters:</p>
\[\begin{align*}
T &:= \text{Hash Table Slots} \\
F &:= \text{Features Per Slot}
\end{align*}\]
<p>To map the hash to a slot in each level’s hash table, we simply modulo by \(T\).</p>
<p><img class="diagram2d_double center" src="/assets/neural-graphics/hashing.svg" /></p>
<p>Each slot contains \(F\) <strong>learnable parameters</strong>. During training, we will <a href="/Autodiff/#backward-mode">backpropagate</a> gradients all the way to the hash table entries, dynamically optimizing them to <em>learn</em> a good input encoding.</p>
<p>What do we do about hash collisions? <strong>Nothing</strong>—the training process will automatically find an encoding that is robust to collisions. Unfortunately, this makes the encoding difficult to scale down to very small hash table sizes—eventually, simply too many grid points are assigned to each slot.</p>
<p>Finally, the authors note that training is not particularly sensitive to how the hash table entries are initialized, but settle on an initial distribution \(\mathcal{U}(-0.0001,0.0001)\).</p>
<h3 id="step-3---interpolate">Step 3 - Interpolate</h3>
<p>Once we’ve retrieved the \(2^d\) hash table slots corresponding to the current grid cell, we <em>linearly interpolate</em> their values to define a result at \(\mathbf{x}\) itself. In the two dimensional case, we use bi-linear interpolation: interpolate horizontally twice, then vertically once (or vice versa).</p>
<p><img class="diagram2d center" src="/assets/neural-graphics/bilerp.svg" /></p>
<p>The authors note that interpolating the discrete values is necessary for optimization with gradient descent: it makes the encoding function continuous.</p>
<h3 id="step-4---concatenate">Step 4 - Concatenate</h3>
<p>At this point, we have mapped \(\mathbf{x}\) to \(L\) different vectors of length \(F\)—one for each grid level \(\ell\). To combine all of this information, we simply concatenate the vectors to form a single encoding of length \(LF\).</p>
<p>The encoded result is then fed through a simple fully-connected neural network with ReLU activations (i.e. a multilayer perceptron). This network can be quite small: the authors use two hidden layers with dimension 64.</p>
<p><img class="center" style="width: 90%" src="/assets/neural-graphics/ingp.svg" /></p>
<h2 id="results">Results</h2>
<p>Let’s implement the hash encoding and compare it with the earlier methods. The sinusoidal network with positional encoding might be hard to beat, given it uses only ~35k parameters.</p>
<p>We will first choose \(L = 16\), \(N_{\min} = 16\), \(N_{\max} = 256\), \(T = 1024\), and \(F = 2\). Combined with a two hidden layer, dimension 64 MLP, these parameters define a model with 43,395 parameters. On disk, that’s about 90kb.</p>
<div class="center_container">
<canvas style="width: 100%" class="result" id="out_ingp_small_canvas"></canvas>
<table><tr><td style="width: 40%">Epochs: <span id="out_ingp_small_epoch">30</span></td><td><input type="range" min="0" max="30" value="30" class="slider" id="out_ingp_small" /></td></tr></table></div>
<p>That’s a good result, but not obviously better than we saw previously: when restricted to 1024-slot hash tables, the encoding can’t be perfect. However, the training process converges impressively quickly, producing a usable image after only <em>three</em> epochs and fully converging in 20-30. The sinusoidal network took over 80 epochs to converge, so fast convergence alone is a significant upside.</p>
<p>The authors recommend a hash table size of \(2^{14}-2^{24}\): their use cases involve representing large volumetric data sets rather than 512x512 images. If we scale up our hash table to just 4096 entries (a parameter count of ~140k), the result at last becomes difficult to distinguish from the original image. The other techniques would have required even more parameters to achieve this level of quality.</p>
<div class="center_container">
<canvas style="width: 100%" class="result" id="out_ingp_big_canvas"></canvas>
<table><tr><td style="width: 40%">Epochs: <span id="out_ingp_big_epoch">30</span></td><td><input type="range" min="0" max="30" value="30" class="slider" id="out_ingp_big" /></td></tr></table></div>
<p>The larger hash maps result in a ~280kb model on disk, which, while much smaller than the uncompressed image, is over twice as large as an equivalent JPEG. The hash encoding really shines when representing much higher resolution images: at gigapixel scale it handily beats the other methods in reproduction quality, training time, <em>and</em> model size.</p>
<p><em>(These results were compared against the <a href="https://github.com/NVlabs/instant-ngp">reference implementation</a> using the same parameters—the outputs were equivalent, except for my version being many times slower!)</em></p>
<h1 id="more-applications">More Applications</h1>
<p>So far, we’ve only explored ways to represent images. But, as mentioned at the start, neural models can encode any dataset we can express as a function. In fact, current research work primarily focuses on representing <em>geometry</em> and <em>light fields</em>—images are the easy case.</p>
<h2 id="neural-sdfs">Neural SDFs</h2>
<p>One relevant way to represent surfaces is using signed distance functions, or SDFs. At every point \(\mathbf{x}\), an SDF \(f\) computes the distance between \(\mathbf{x}\) and the closest point on the surface. When \(\mathbf{x}\) is inside the surface, \(f\) instead returns the negative distance.</p>
<p>Hence, the surface is defined as the set of points such that \(f(\mathbf{x}) = 0\), also known as the zero level set of \(f\). SDFs can be very simple to <a href="https://iquilezles.org/articles/distfunctions/">define and combine</a>, and when combined with <a href="https://iquilezles.org/articles/raymarchingdf/">sphere tracing</a>, can create remarkable results in a few lines of code.</p>
<div class="center" style="width: 100%">
<iframe width="640" height="360" frameborder="0" src="https://www.shadertoy.com/embed/XsyGRW?gui=true&t=10&paused=true&muted=false" class="center box_shadow"></iframe>
<a href="http://hughsk.io/">
<em class="center" style="text-align: center; padding-top: 5px">Hugh Kennedy</em>
</a></div>
<p>Because SDFs are simply functions from position to distance, they’re easy to represent with neural models: all of the techniques we explored for images also apply when representing distance fields.</p>
<p>Traditionally, graphics research and commercial tools have favored <em>explicit</em> geometric representations like triangle meshes—and broadly still do—but the advent of ML models is bringing <em>implicit</em> representations like SDFs back into the spotlight. Luckily, SDFs can be converted into meshes using methods like <a href="https://en.wikipedia.org/wiki/Marching_cubes">marching cubes</a> and <a href="https://www.cse.wustl.edu/~taoju/research/dualContour.pdf">dual contouring</a>, though this introduces discretization error.</p>
<div class="center" style="width: 100%">
<model-viewer alt="Statue Siren" src="https://www.vincentsitzmann.com/siren/img/statue_siren.glb" style="width: 60%; height: 600px; background-color: #404040" exposure=".8" camera-orbit="0deg 75deg 20%" min-field-of-view="50deg" class="center box_shadow" auto-rotate="" camera-controls="" ar-status="not-presenting">
</model-viewer>
<script type="module" src="https://unpkg.com/@google/model-viewer/dist/model-viewer.js"></script>
<em class="center" style="text-align: center; padding-top: 5px"><a href="https://www.vincentsitzmann.com/siren/">
Implicit Neural Representations with Periodic Activation Functions</a>
</em>
</div>
<p>When working with neural representations, one may not even require proper <em>distances</em>—it may be sufficient that \(f(\mathbf{x}) = 0\) describes the surface. Techniques in this broader class are known as <em>level set methods</em>. Another SIGGRAPH 2022 best paper, <a href="https://nmwsharp.com/research/interval-implicits/">Spelunking the Deep</a>, defines relatively efficient closest point, collision, and ray intersection queries on arbitrary neural surfaces.</p>
<h2 id="neural-radiance-fields">Neural Radiance Fields</h2>
<p>In computer vision, another popular use case is modelling <a href="https://en.wikipedia.org/wiki/Light_field">light fields</a>. A light field can be encoded in many ways, but the model proposed in the <a href="https://www.matthewtancik.com/nerf">original NeRF paper</a> maps a 3D position and 2D angular direction to RGB radiance and volumetric density. This function provides enough information to synthesize images of the field using volumetric ray marching. (Basically, trace a ray in small steps, adding in radiance and attenuating based on density.)</p>
\[x, y, z, \theta, \phi \mapsto R, G, B, \sigma\]
<p>Because NeRFs are trained to match position and direction to radiance, one particularly successful use case has been using a set of 2D photographs to learn a 3D representation of a scene. Though NeRFs are not especially conducive to recovering geometry and materials, synthesizing images from entirely new angles is relatively easy—the model defines the full light field.</p>
<div class="center"><img class="center" src="/assets/neural-graphics/nerf.png" />
<em class="center" style="text-align: center; padding-top: 5px"><a href="https://www.matthewtancik.com/nerf">
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis</a>
</em></div>
<p>There has been an explosion of NeRF-related papers over the last two years, many of which choose different parameterizations of the light field. Some even <a href="https://alexyu.net/plenoctrees/">ditch the neural encoding entirely</a>. Using neural models to represent light fields also helped kickstart research on the more general topic of <a href="https://rgl.epfl.ch/publications"><em>differentiable rendering</em></a>, which seeks to separately recover scene parameters like geometry, materials, and lighting by reversing the rendering process via gradient descent.</p>
<h3 id="neural-radiance-cache">Neural Radiance Cache</h3>
<p>In real-time rendering, another exciting application is <a href="https://research.nvidia.com/publication/2021-06_real-time-neural-radiance-caching-path-tracing">neural radiance caching</a>. NRC brings NeRF concepts into the real-time ray-tracing world: a model is trained to encode a radiance field <em>at runtime</em>, dynamically learning from small batches of samples generated every frame. The resulting network is used as an intelligent cache to estimate incoming radiance for secondary rays—without recursive path tracing.</p>
<div class="center"><img class="center" src="/assets/neural-graphics/nrc.jpg" />
<em class="center" style="text-align: center; padding-top: 5px"><a href="https://research.nvidia.com/publication/2021-06_real-time-neural-radiance-caching-path-tracing">
Real-time Neural Radiance Caching for Path Tracing</a>
</em></div>
<h1 id="references">References</h1>
<p>This website attempts to collate all recent literature on neural fields:</p>
<ul>
<li><a href="https://neuralfields.cs.brown.edu/eg22.html">Neural Fields in Visual Computing and Beyond</a></li>
</ul>
<p>Papers referenced in this article:</p>
<ul>
<li><a href="https://www.vincentsitzmann.com/siren/">Implicit Neural Representations with Periodic Activation Functions</a></li>
<li><a href="https://arxiv.org/pdf/2111.15135.pdf">Beyond Periodicity: Towards a Unifying Framework for Activations in Coordinate-MLPs</a></li>
<li><a href="https://www.matthewtancik.com/nerf">NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis</a></li>
<li><a href="https://nvlabs.github.io/instant-ngp/">Instant Neural Graphics Primitives with a Multiresolution Hash Encoding</a></li>
<li><a href="https://nmwsharp.com/research/interval-implicits/">Spelunking the Deep: Guaranteed Queries on General Neural Implicit Surfaces via Range Analysis</a></li>
<li><a href="https://alexyu.net/plenoctrees/">PlenOctrees For Real-time Rendering of Neural Radiance Fields</a></li>
<li><a href="https://research.nvidia.com/publication/2021-06_real-time-neural-radiance-caching-path-tracing">Real-time Neural Radiance Caching for Path Tracing</a></li>
</ul>
<p>More particularly cool papers:</p>
<ul>
<li><a href="https://nv-tlabs.github.io/nglod/">Neural Geometric Level of Detail: Real-time Rendering with Implicit 3D Shapes</a></li>
<li><a href="https://nv-tlabs.github.io/lip-mlp/">Learning Smooth Neural Functions via Lipschitz Regularization</a></li>
<li><a href="https://ishit.github.io/nie/">A Level Set Theory for Neural Implicit Evolution under Explicit Flows</a></li>
<li><a href="https://arxiv.org/pdf/2210.00124.pdf">Implicit Neural Spatial Representations for Time-dependent PDEs</a></li>
<li><a href="https://neuralfields.cs.brown.edu/paper_297.html">Intrinsic Neural Fields: Learning Functions on Manifolds</a></li>
</ul>
<!-- - [A Structured Dictionary Perspective on Implicit Neural Representations](https://arxiv.org/pdf/2112.01917.pdf) -->
<!-- - [Multiplicative Filter Networks](https://openreview.net/pdf?id=OmtmcPkkhT) -->
<!-- - [Exploring Differential Geometry in Neural Implicits](https://dsilvavinicius.github.io/differential_geometry_in_neural_implicits/assets/novello2022exploring.pdf) -->
<link rel="stylesheet" type="text/css" href="/assets/neural-graphics/style.css" />
<script type="module" src="/assets/neural-graphics/neural-graphics.js"></script>Neural fields have quickly become an interesting and useful application of machine learning to computer graphics. The fundamental idea is quite straightforward: we can use neural models to represent all sorts of digital signals, including images, light fields, signed distance functions, and volumes. Neural Networks Compression A Basic Network Activations Input Encodings Positional Encoding Instant Neural Graphics Primitives More Applications SDFs NeRFs References Neural Networks This post assumes a basic knowledge of neural networks, but let’s briefly recap how they work. A network with \(n\) inputs and \(m\) outputs approximates a function from \(\mathbb{R}^n \mapsto \mathbb{R}^m\) by feeding its input through a sequence of layers. The simplest type of layer is a fully-connected layer, which encodes an affine transform: \(x \mapsto Ax + b\), where \(A\) is the weight matrix and \(b\) is the bias vector. Layers other than the first (input) and last (output) are considered hidden layers, and can have arbitrary dimension. Each hidden layer is followed by applying a non-linear function \(x \mapsto f(x)\), known as the activation. This additional function enables the network to encode non-linear behavior. For example, a network from \(\mathbb{R}^3 \mapsto \mathbb{R}^3\), including two hidden layers of dimensions 5 and 4, can be visualized as follows. The connections between nodes represent weight matrix entries: Or equivalently in symbols (with dimensions): \[\begin{align*} h_{1(5)} &= f(A_{1(5\times3)}x_{(3)} + b_{1(5)})\\ h_{2(4)} &= f(A_{2(4\times5)}h_{1(5)} + b_{2(4)})\\ y_{(3)} &= A_{3(3\times4)}h_{2(4)} + b_{3(3)} \end{align*}\] This network has \(5\cdot3 + 5 + 4\cdot5 + 4 + 3\cdot4 + 3 = 59\) parameters that define its weight matrices and bias vectors. The process of training finds values for these parameters such that the network approximates another function \(g(x)\). Suitable parameters are typically found via stochastic gradient descent. Training requires a loss function measuring how much \(\text{NN}(x)\) differs from \(g(x)\) for all \(x\) in a training set. In practice, we might not know \(g\): given a large number of input-output pairs (e.g. images and their descriptions), we simply perform training and hope that the network will faithfully approximate \(g\) for inputs outside of the training set, too. We’ve only described the most basic kind of neural network: there’s a whole world of possibilities for new layers, activations, loss functions, optimization algorithms, and training paradigms. However, the fundamentals will suffice for this post. Compression When generalization is the goal, neural networks are typically under-constrained: they include far more parameters than strictly necessary to encode a function mapping the training set to its corresponding results. Through regularization, the training process hopes to use these ‘extra’ degrees of freedom to approximate the behavior of \(g\) outside the training set. However, insufficient regularization makes the network vulnerable to over-fitting—it may become extremely accurate on the training set, yet unable to handle new inputs. For this reason, researchers always evaluate neural networks on a test set of known data points that are excluded from the training process. However, in this post we’re actually going to optimize for over-fitting! Instead of approximating \(g\) on new inputs, we will only care about reproducing the right results on the training set. This approach allows us to use over-constrained networks for (lossy) data compression. Let’s say we have some data we want to compress into a neural network. If we can parameterize the input by assigning each value a unique identifier, that gives us a training set. For example, we could parameterize a list of numbers by their index: \[[1, 1, 2, 5, 15, 52, 203] \mapsto \{ (0,1), (1,1), (2,2), (3,5), (4,15), (5,52), (6,203) \}\] …and train a network to associate the index \(n\) with the value \(a[n]\). To do so, we can simply define a loss function that measures the squared error of the result: \((\text{NN}(n) - a[n])^2\). We only care that the network produces \(a[n]\) for \(n\) in 0 to 6, so we want to “over-fit” as much as possible. But, where’s the compression? If we make the network itself smaller than the data set—while being able to reproduce it—we can consider the network to be a compressed encoding of the data. For example, consider this photo of a numbat: Perth Zoo The image consists of 512x512 pixels with three channels (r,g,b) each. Hence, we could naively say it contains 786,432 parameters. If a network with fewer parameters can reproduce it, the network itself can be considered a compressed version of the image. More rigorously, each pixel can be encoded in 3 bytes of data, so we’d want to be able to store the network in fewer than 768 kilobytes—but reasoning about size on-disk would require getting into the weeds of mixed precision networks, so let’s only consider parameter count for now. A Basic Network Let’s train a network to encode the numbat. To create the data set, we will associate each pixel with its corresponding \(x,y\) coordinate. That means our training set will consist of 262,144 examples of the form \((x,y) \mapsto (r,g,b)\). Before training, we’ll normalize the values such that \(x,y \in [-1,1]\) and \(r,g,b \in [0,1]\). \[\begin{align*} (-1.0000,-1.0000) &\mapsto (0.3098, 0.3686, 0.2471)\\ (-1.0000, -0.9961) &\mapsto (0.3059, 0.3686, 0.2471)\\ &\vdots\\ (0.9961, 0.9922) &\mapsto (0.3333, 0.4157, 0.3216)\\ (0.9961, 0.9961) &\mapsto (0.3412, 0.4039, 0.3216) \end{align*}\] Clearly, our network will need to be a function from \(\mathbb{R}^2 \mapsto \mathbb{R}^3\), i.e. have two inputs and three outputs. Just going off intuition, we might include three hidden layers of dimension 128. This architecture will only have 33,795 parameters—far fewer than the image itself. \[\mathbb{R}^2 \mapsto_{fc} \mathbb{R}^{128} \mapsto_{fc} \mathbb{R}^{128} \mapsto_{fc} \mathbb{R}^{128} \mapsto_{fc} \mathbb{R}^3\] Our activation function will be the standard non-linearity ReLU (rectified linear unit). Note that we only want to apply the activation after each hidden layer: we don’t want to clip our inputs or outputs by making them positive. \[\text{ReLU}(x) = \max\{x, 0\}\] Finally, our loss function will be the mean squared error between the network output and the expected color: \[\text{Loss}(x,y) = (\text{NN}(x,y) - \text{Image}_{xy})^2\] Now, let’s train. We’ll do 100 passes over the full data set (100 epochs) with a batch size of 1024. After training, we’ll need some way to evaluate how well the network encoded our image. Hopefully, the network will now return the proper color given a pixel coordinate, so we can re-generate our image by simply evaluating the network at each pixel coordinate in the training set and normalizing the results. And, what do we get? After a bit of hyperparameter tweaking (learning rate, optimization schedule, epochs), we get… Epochs: 100 Well, that sort of worked—you can see the numbat taking shape. But unfortunately, by epoch 100 our loss isn’t consistently decreasing: more training won’t make the result significantly better. Activations So far, we’ve only used the ReLU activation function. Taking a look at the output image, we see a lot of lines: various activations jump from zero to positive across these boundaries. There’s nothing fundamentally wrong with that—given a suitably large network, we could still represent the exact image. However, it’s difficult to recover high-frequency detail by summing functions clipped at zero. Sigmoids Because we know the network should always return values in \([0,1]\), one easy improvement is adding a sigmoid activation to the output layer. Instead of teaching the network to directly output an intensity in \([0,1]\), this will allow the network to compute values in \([-\infty,\infty]\) that are deterministically mapped to a color in \([0,1]\). We’ll apply the logistic function: \[\text{Sigmoid}(x) = \frac{1}{1 + e^{-x}}\] For all future experiments, we’ll use this function as an output layer activation. Re-training the ReLU network: Epochs: 100 The result looks significantly better, but it’s still got a lot of line-like artifacts. What if we just use the sigmoid activation for the other hidden layers, too? Re-training again… Epochs: 100 That made it worse. The important observation here is that changing the activation function can have a significant effect on reproduction quality. Several papers have been published comparing different activation functions in this context, so let’s try some of those. Sinusoids Implicit Neural Representations with Periodic Activation Functions This paper explores the usage of periodic functions (i.e. \(\sin,\cos\)) as activations, finding them well suited for representing signals like images, audio, video, and distance fields. In particular, the authors note that the derivative of a sinusoidal network is also a sinusoidal network. This observation allows them to fit differential constraints in situations where ReLU and TanH-based models entirely fail to converge. The proposed form of periodic activations is simply \(f(x) = \sin(\omega_0x)\), where \(\omega_0\) is a hyperparameter. Increasing \(\omega_0\) should allow the network to encode higher frequency signals. Let’s try changing our activations to \(f(x) = \sin(2\pi x)\): Epochs: 100 Well, that’s better than the ReLU/Sigmoid networks, but still not very impressive—an unstable training process lost much of the low frequency data. It turns out that sinusoidal networks are quite sensitive to how we initialize the weight matrices. To account for the extra scaling by \(\omega_0\), the paper proposes initializing weights beyond the first layer using the distribution \(\mathcal{U}\left(-\frac{1}{\omega_0}\sqrt{\frac{6}{\text{fan\_in}}}, \frac{1}{\omega_0}\sqrt{\frac{6}{\text{fan\_in}}}\right)\). Retraining with proper weight initialization: Epochs: 100 Now we’re getting somewhere—the output is quite recognizable. However, we’re still missing a lot of high frequency detail, and increasing \(\omega_0\) much more starts to make training unstable. Gaussians Beyond Periodicity: Towards a Unifying Framework for Activations in Coordinate-MLPs This paper analyzes a variety of new activation functions, one of which is particularly suited for image reconstruction. It’s a gaussian of the form \(f(x) = e^{\frac{-x^2}{\sigma^2}}\), where \(\sigma\) is a hyperparameter. Here, a smaller \(\sigma\) corresponds to a higher bandwidth. The authors demonstrate good results using gaussian activations: reproduction as good as sinusoids, but without dependence on weight initialization or special input encodings (which we will discuss in the next section). So, let’s train our network using \(f(x) = e^{-4x^2}\): Epochs: 100 Well, that’s better than the initial sinusoidal network, but worse than the properly initialized one. However, something interesting happens if we scale up our input coordinates from \([-1,1]\) to \([-16,16]\): Epochs: 100 The center of the image is now reproduced almost perfectly, but the exterior still lacks detail. It seems that while gaussian activations are more robust to initialization, the distribution of inputs can still have a significant effect. Beyond the Training Set Although we’re only measuring how well each network reproduces the image, we might wonder what happens outside the training set. Luckily, our network is still just a function \(\mathbb{R}^2\mapsto\mathbb{R}^3\), so we can evaluate it on a larger range of inputs to produce an “out of bounds” image. Another purported benefit of using gaussian activations is sensible behavior outside of the training set, so let’s compare. Evaluating each network on \([-2,2]\times[-2,2]\): ReLU Sinusoid Gaussian That’s pretty cool—the gaussian network ends up representing a sort-of-edge-clamped extension of the image. The sinusoid turns into a low-frequency soup of common colors. The ReLU extension has little to do with the image content, and based on running training a few times, is very unstable. Input Encodings So far, our results are pretty cool, but not high fidelity enough to compete with the original image. Luckily, we can go further: recent research work on high-quality neural primitives has relied on not only better activation functions and training schemes, but new input encodings. When designing a network, an input encoding is essentially just a fixed-function initial layer that performs some interesting transformation before the fully-connected layers take over. It turns out that choosing a good initial transform can make a big difference. Positional Encoding The current go-to encoding is known as positional encoding, or otherwise positional embedding, or even sometimes fourier features. This encoding was first used in a graphics context by the original neural radiance fields paper. It takes the following form: \[x \mapsto \left[x, \sin(2^0\pi x), \cos(2^0\pi x), \dots, \sin(2^L\pi x), \cos(2^L\pi x) \right]\] Where \(L\) is a hyperparameter controlling bandwidth (higher \(L\), higher frequency signals). The unmodified input \(x\) may or may not be included in the output. When using this encoding, we transform our inputs by a hierarchy of sinusoids of increasing frequency. This brings to mind the fourier transform, hence the “fourier features” moniker, but note that we are not using the actual fourier transform of the training signal. So, let’s try adding a positional encoding to our network with \(L=6\). Our new network will have the following architecture: \[\mathbb{R}^2 \mapsto_{enc6} \mathbb{R}^{30} \mapsto_{fc} \mathbb{R}^{128} \mapsto_{fc} \mathbb{R}^{128} \mapsto_{fc} \mathbb{R}^{128} \mapsto_{fc} \mathbb{R}^3\] Note that the sinusoids are applied element-wise to our two-dimensional \(x\), so we end up with \(30\) input dimensions. This means the first fully-connected layer will have \(30\cdot128\) weights instead of \(2\cdot128\), so we are adding some parameters. However, the additional weights don’t make a huge difference in of themselves. The ReLU network improves dramatically: Epochs: 100 The gaussian network also improves significantly, now only lacking some high frequency detail: Epochs: 100 Finally, the sinusoidal network… fails to converge! Reducing \(\omega_0\) to \(\pi\) lets training succeed, and produces the best result so far. However, the unstable training behavior is another indication that sinusoidal networks are particularly temperamental. Epochs: 100 As a point of reference, saving this model’s parameters in half precision (which does not degrade quality) requires only 76kb on disk. That’s smaller than a JPEG encoding of the image at ~111kb, though not quite as accurate. Unfortunately, it’s still missing some fine detail. Instant Neural Graphics Primitives Instant Neural Graphics Primitives with a Multiresolution Hash Encoding This paper made a splash when it was released early this year and eventually won a best paper award at SIGGRAPH 2022. It proposes a novel input encoding based on a learned multi-resolution hashing scheme. By combining their encoding with a fully-fused (i.e. single-shader) training and evaluation implementation, the authors are able to train networks representing gigapixel-scale images, SDF geometry, volumes, and radiance fields in near real-time. In this post, we’re only interested in the encoding, though I wouldn’t say no to instant training either. The Multiresolution Grid We begin by breaking up the domain into several square grids of increasing resolution. This hierarchy can be defined in any number of dimensions \(d\). Here, our domain will be two-dimensional: Note that the green and red grids also tile the full image; their full extent is omitted for clarity. The resolution of each grid is a constant multiple of the previous level, but note that the growth factor does not have to be two. In fact, the authors derive the scale after choosing the following three hyperparameters: \[\begin{align*} L &:= \text{Number of Levels} \\ N_{\min} &:= \text{Coarsest Resolution} \\ N_{\max} &:= \text{Finest Resolution} \end{align*}\] And compute the growth factor via: \[b := \exp\left(\frac{\ln N_{\max} - \ln N_{\min}}{L - 1}\right)\] Therefore, the resolution of grid level \(\ell\) is \(N_\ell := \lfloor N_{\min} b^\ell \rfloor\). Step 1 - Find Cells After choosing a hierarchy of grids, we can define the input transformation. We will assume the input \(\mathbf{x}\) is given in \([0,1]\). For each level \(\ell\), first scale \(\mathbf{x}\) by \(N_\ell\) and round up/down to find the cell containing \(\mathbf{x}\). For example, when \(N_\ell = 2\): Given the bounds of the cell containing \(\mathbf{x}\), we can then compute the integer coordinates of each corner of the cell. In this example, we end up with four vertices: \([0,0], [0,1], [1,0], [1,1]\). Step 2 - Hash Coordinates We then hash each corner coordinate. The authors use the following function: \[h(\mathbf{x}) = \bigoplus_{i=1}^d x_i\pi_i\] Where \(\oplus\) denotes bit-wise exclusive-or and \(\pi_i\) are large prime numbers. To improve cache coherence, the authors set \(\pi_1 = 1\), as well as \(\pi_2 = 2654435761\) and \(\pi_3 = 805459861\). Because our domain is two-dimensional, we only need the first two coefficients: \[h(x,y) = x \oplus (2654435761 y)\] The hash is then used as an index into a hash table. Each grid level has a corresponding hash table described by two more hyperparameters: \[\begin{align*} T &:= \text{Hash Table Slots} \\ F &:= \text{Features Per Slot} \end{align*}\] To map the hash to a slot in each level’s hash table, we simply modulo by \(T\). Each slot contains \(F\) learnable parameters. During training, we will backpropagate gradients all the way to the hash table entries, dynamically optimizing them to learn a good input encoding. What do we do about hash collisions? Nothing—the training process will automatically find an encoding that is robust to collisions. Unfortunately, this makes the encoding difficult to scale down to very small hash table sizes—eventually, simply too many grid points are assigned to each slot. Finally, the authors note that training is not particularly sensitive to how the hash table entries are initialized, but settle on an initial distribution \(\mathcal{U}(-0.0001,0.0001)\). Step 3 - Interpolate Once we’ve retrieved the \(2^d\) hash table slots corresponding to the current grid cell, we linearly interpolate their values to define a result at \(\mathbf{x}\) itself. In the two dimensional case, we use bi-linear interpolation: interpolate horizontally twice, then vertically once (or vice versa). The authors note that interpolating the discrete values is necessary for optimization with gradient descent: it makes the encoding function continuous. Step 4 - Concatenate At this point, we have mapped \(\mathbf{x}\) to \(L\) different vectors of length \(F\)—one for each grid level \(\ell\). To combine all of this information, we simply concatenate the vectors to form a single encoding of length \(LF\). The encoded result is then fed through a simple fully-connected neural network with ReLU activations (i.e. a multilayer perceptron). This network can be quite small: the authors use two hidden layers with dimension 64. Results Let’s implement the hash encoding and compare it with the earlier methods. The sinusoidal network with positional encoding might be hard to beat, given it uses only ~35k parameters. We will first choose \(L = 16\), \(N_{\min} = 16\), \(N_{\max} = 256\), \(T = 1024\), and \(F = 2\). Combined with a two hidden layer, dimension 64 MLP, these parameters define a model with 43,395 parameters. On disk, that’s about 90kb. Epochs: 30 That’s a good result, but not obviously better than we saw previously: when restricted to 1024-slot hash tables, the encoding can’t be perfect. However, the training process converges impressively quickly, producing a usable image after only three epochs and fully converging in 20-30. The sinusoidal network took over 80 epochs to converge, so fast convergence alone is a significant upside. The authors recommend a hash table size of \(2^{14}-2^{24}\): their use cases involve representing large volumetric data sets rather than 512x512 images. If we scale up our hash table to just 4096 entries (a parameter count of ~140k), the result at last becomes difficult to distinguish from the original image. The other techniques would have required even more parameters to achieve this level of quality. Epochs: 30 The larger hash maps result in a ~280kb model on disk, which, while much smaller than the uncompressed image, is over twice as large as an equivalent JPEG. The hash encoding really shines when representing much higher resolution images: at gigapixel scale it handily beats the other methods in reproduction quality, training time, and model size. (These results were compared against the reference implementation using the same parameters—the outputs were equivalent, except for my version being many times slower!) More Applications So far, we’ve only explored ways to represent images. But, as mentioned at the start, neural models can encode any dataset we can express as a function. In fact, current research work primarily focuses on representing geometry and light fields—images are the easy case. Neural SDFs One relevant way to represent surfaces is using signed distance functions, or SDFs. At every point \(\mathbf{x}\), an SDF \(f\) computes the distance between \(\mathbf{x}\) and the closest point on the surface. When \(\mathbf{x}\) is inside the surface, \(f\) instead returns the negative distance. Hence, the surface is defined as the set of points such that \(f(\mathbf{x}) = 0\), also known as the zero level set of \(f\). SDFs can be very simple to define and combine, and when combined with sphere tracing, can create remarkable results in a few lines of code. Hugh Kennedy Because SDFs are simply functions from position to distance, they’re easy to represent with neural models: all of the techniques we explored for images also apply when representing distance fields. Traditionally, graphics research and commercial tools have favored explicit geometric representations like triangle meshes—and broadly still do—but the advent of ML models is bringing implicit representations like SDFs back into the spotlight. Luckily, SDFs can be converted into meshes using methods like marching cubes and dual contouring, though this introduces discretization error. Implicit Neural Representations with Periodic Activation Functions When working with neural representations, one may not even require proper distances—it may be sufficient that \(f(\mathbf{x}) = 0\) describes the surface. Techniques in this broader class are known as level set methods. Another SIGGRAPH 2022 best paper, Spelunking the Deep, defines relatively efficient closest point, collision, and ray intersection queries on arbitrary neural surfaces. Neural Radiance Fields In computer vision, another popular use case is modelling light fields. A light field can be encoded in many ways, but the model proposed in the original NeRF paper maps a 3D position and 2D angular direction to RGB radiance and volumetric density. This function provides enough information to synthesize images of the field using volumetric ray marching. (Basically, trace a ray in small steps, adding in radiance and attenuating based on density.) \[x, y, z, \theta, \phi \mapsto R, G, B, \sigma\] Because NeRFs are trained to match position and direction to radiance, one particularly successful use case has been using a set of 2D photographs to learn a 3D representation of a scene. Though NeRFs are not especially conducive to recovering geometry and materials, synthesizing images from entirely new angles is relatively easy—the model defines the full light field. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis There has been an explosion of NeRF-related papers over the last two years, many of which choose different parameterizations of the light field. Some even ditch the neural encoding entirely. Using neural models to represent light fields also helped kickstart research on the more general topic of differentiable rendering, which seeks to separately recover scene parameters like geometry, materials, and lighting by reversing the rendering process via gradient descent. Neural Radiance Cache In real-time rendering, another exciting application is neural radiance caching. NRC brings NeRF concepts into the real-time ray-tracing world: a model is trained to encode a radiance field at runtime, dynamically learning from small batches of samples generated every frame. The resulting network is used as an intelligent cache to estimate incoming radiance for secondary rays—without recursive path tracing. Real-time Neural Radiance Caching for Path Tracing References This website attempts to collate all recent literature on neural fields: Neural Fields in Visual Computing and Beyond Papers referenced in this article: Implicit Neural Representations with Periodic Activation Functions Beyond Periodicity: Towards a Unifying Framework for Activations in Coordinate-MLPs NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis Instant Neural Graphics Primitives with a Multiresolution Hash Encoding Spelunking the Deep: Guaranteed Queries on General Neural Implicit Surfaces via Range Analysis PlenOctrees For Real-time Rendering of Neural Radiance Fields Real-time Neural Radiance Caching for Path Tracing More particularly cool papers: Neural Geometric Level of Detail: Real-time Rendering with Implicit 3D Shapes Learning Smooth Neural Functions via Lipschitz Regularization A Level Set Theory for Neural Implicit Evolution under Explicit Flows Implicit Neural Spatial Representations for Time-dependent PDEs Intrinsic Neural Fields: Learning Functions on ManifoldsNYC: Guggenheim2022-11-27T00:00:00+00:002022-11-27T00:00:00+00:00https://thenumb.at/Guggenheim<p>The Guggenheim, New York, NY, 2022</p>
<p>Central Park, New York, NY, 2022</p>
<div style="text-align: center;"><img src="/assets/guggenheim/00.webp" /></div>
<div style="text-align: center;"><img src="/assets/guggenheim/01.webp" /></div>
<div style="text-align: center;"><img src="/assets/guggenheim/02.webp" /></div>
<div style="text-align: center;"><img src="/assets/guggenheim/03.webp" /></div>
<div style="text-align: center;"><img src="/assets/guggenheim/04.webp" /></div>
<div style="text-align: center;"><img src="/assets/guggenheim/05.webp" /></div>
<div style="text-align: center;"><img src="/assets/guggenheim/06.webp" /></div>
<div style="text-align: center;"><img src="/assets/guggenheim/07.webp" /></div>
<div style="text-align: center;"><img src="/assets/guggenheim/08.webp" /></div>
<div style="text-align: center;"><img src="/assets/guggenheim/09.webp" /></div>
<div style="text-align: center;"><img src="/assets/guggenheim/10.webp" /></div>
<div style="text-align: center;"><img src="/assets/guggenheim/11.webp" /></div>
<div style="text-align: center;"><img src="/assets/guggenheim/12.webp" /></div>
<div style="text-align: center;"><img src="/assets/guggenheim/13.webp" /></div>
<div style="text-align: center;"><img src="/assets/guggenheim/14.webp" /></div>
<p></p>The Guggenheim, New York, NY, 2022 Central Park, New York, NY, 2022NYC: The Met2022-11-16T00:00:00+00:002022-11-16T00:00:00+00:00https://thenumb.at/The-Met<p>The Metropolitan Museum of Art, New York, NY, 2022</p>
<p>Central Park, New York, NY, 2022</p>
<!-- <div style="text-align: center;"><img src="/assets/the-met/00.webp"></div> -->
<div style="text-align: center;"><img src="/assets/the-met/01.webp" /></div>
<div style="text-align: center;"><img src="/assets/the-met/02.webp" /></div>
<div style="text-align: center;"><img src="/assets/the-met/03.webp" /></div>
<div style="text-align: center;"><img src="/assets/the-met/04.webp" /></div>
<!-- <div style="text-align: center;"><img src="/assets/the-met/05.webp"></div> -->
<div style="text-align: center;"><img src="/assets/the-met/06.webp" /></div>
<div style="text-align: center;"><img src="/assets/the-met/07.webp" /></div>
<div style="text-align: center;"><img src="/assets/the-met/08.webp" /></div>
<div style="text-align: center;"><img src="/assets/the-met/09.webp" /></div>
<div style="text-align: center;"><img src="/assets/the-met/10.webp" /></div>
<p></p>The Metropolitan Museum of Art, New York, NY, 2022 Central Park, New York, NY, 2022Differentiable Programming from Scratch2022-07-31T00:00:00+00:002022-07-31T00:00:00+00:00https://thenumb.at/Autodiff<p><a href="https://en.wikipedia.org/wiki/Differentiable_programming">Differentiable programming</a> has been a hot research topic over the past few years, and not only due to the popularity of machine learning libraries like TensorFlow, PyTorch, and JAX. Many fields apart from machine learning are also finding differentiable programming to be a useful tool for solving many kinds of optimization problems. In computer graphics, differentiable <a href="https://www.youtube.com/watch?v=Tou8or1ed6E">rendering</a>, differentiable <a href="https://physicsbaseddeeplearning.org/diffphys.html">physics</a>, and <a href="https://nvlabs.github.io/instant-ngp/">neural representations</a> are all poised to be important tools going forward.</p>
<p><em>This article received an honorable mention in 3Blue1Brown’s <a href="https://www.youtube.com/watch?v=cDofhN-RJqg">Summer of Math Exposition 2</a>!</em></p>
<ul>
<li><a href="#prerequisites">Prerequisites</a>
<ul>
<li><a href="#differentiation">Differentiation</a></li>
<li><a href="#optimization">Optimization</a></li>
</ul>
</li>
<li><a href="#differentiating-code">Differentiating Code</a>
<ul>
<li><a href="#numerical">Numerical</a></li>
<li><a href="#symbolic">Symbolic</a></li>
</ul>
</li>
<li><a href="#automatic-differentiation">Automatic Differentiation</a>
<ul>
<li><a href="#forward-mode">Forward Mode</a></li>
<li><a href="#backward-mode">Backward Mode</a></li>
</ul>
</li>
<li><a href="#de-blurring-an-image">De-blurring an Image</a></li>
<li><a href="#further-reading">Further Reading</a></li>
</ul>
<h1 id="prerequisites">Prerequisites</h1>
<h2 id="differentiation">Differentiation</h2>
<p>It all starts with the definition you learn in calculus class:</p>
\[f^{\prime}(x) = \lim_{h\rightarrow 0} \frac{f(x + h) - f(x)}{h}\]
<p>In other words, the derivative computes how much \(f(x)\) changes when \(x\) is perturbed by an infinitesimal amount. If \(f\) is a one-dimensional function from \(\mathbb{R} \mapsto \mathbb{R}\), the derivative \(f^{\prime}(x)\) returns the slope of the graph of \(f\) at \(x\).</p>
<div class="center" style="width: 80%;"><img class="diagram2d center" src="/assets/autodiff/slope.svg" /></div>
<p>However, there’s another perspective that provides better intuition in higher dimensions. If we think of \(f\) as a map from points in its domain to points in its range, we can think of \(f^{\prime}(x)\) as a map from <em>vectors</em> based at \(x\) to <em>vectors</em> based at \(f(x)\).</p>
<h3 id="one-to-one">One-to-One</h3>
<p>In 1D, this distinction is a bit subtle, as a 1D “vector” is just a single number. Still, evaluating \(f^{\prime}(x)\) shows us how a vector placed at \(x\) is scaled when transformed by \(f\). That’s just the slope of \(f\) at \(x\).</p>
<div class="center" style="width: 100%;"><img class="diagram2d_double center" src="/assets/autodiff/vec1d.svg" /></div>
<h3 id="many-to-one">Many-to-One</h3>
<p>If we consider a function \(g(x,y) : \mathbb{R}^2 \mapsto \mathbb{R}\) (two inputs, one output), this perspective will become clearer. We can differentiate \(g\) with respect to any particular input, known as a partial derivative:</p>
\[g_x(x,y) = \lim_{h\rightarrow0} \frac{g(x+h,y) - g(x,y)}{h}\]
<p>The function \(g_x\) produces the change in \(g\) given a change in \(x\). If we combine it with the partial derivative for \(y\), we get the derivative, or <em>gradient</em>, of \(g\):</p>
\[\nabla g(x,y) = \begin{bmatrix}g_x(x,y) & g_y(x,y)\end{bmatrix}\]
<p>That is, \(\nabla g(x,y)\) tells us how \(g\) changes if we change either \(x\) or \(y\). If we multiply \(\nabla g(x,y)\) with a column vector of differences \(\Delta x,\Delta y\), we’ll get their combined effect on \(g\):</p>
\[\nabla g(x,y) \begin{bmatrix}\Delta x\\\Delta y\end{bmatrix} = \Delta xg_x(x,y) + \Delta yg_y(x,y)\]
<p>It’s tempting to think of the gradient as just another vector. However, it’s often useful to think of the gradient as a <em>higher-order function</em>: \(\nabla g\) is a function that, when evaluated at \(x,y\), gives us <em>another function</em> that transforms <em>vectors</em> based at \(x,y\) to <em>vectors</em> based at \(g(x,y)\).</p>
<p>It just so happens that the function returned by our gradient is linear, so it can be represented as a matrix multiplication.</p>
<!-- $$ \nabla g(x,y) (\Delta x, \Delta y) = \begin{bmatrix}g_x(x,y)\\g_y(x,y)\end{bmatrix}\cdot\begin{bmatrix}\Delta x\\\Delta y\end{bmatrix} $$ -->
<div class="center" style="width: 80%;"><img class="diagram2d_double center" src="/assets/autodiff/vec2d.svg" /></div>
<p>The gradient is typically explained as the “direction of steepest ascent.” Why is that? When we evaluate the gradient at a point \(x,y\) and a vector \(\Delta x, \Delta y\), the result is a change in the output of \(g\). If we <em>maximize</em> the change in \(g\) with respect to the input vector, we’ll get the direction that makes the output increase the fastest.</p>
<p>Since the gradient function is just a product with \(\begin{bmatrix}g_x(x,y) & g_y(x,y)\end{bmatrix}\), the direction \(\begin{bmatrix}\Delta x & \Delta y\end{bmatrix}\) that maximizes the result is easy to find: it’s parallel to \(\begin{bmatrix}g_x(x,y) & g_y(x,y)\end{bmatrix}\). That means the gradient vector is, in fact, the direction of steepest ascent.</p>
<div class="center" style="width: 100%;"><img class="diagram2d center" src="/assets/autodiff/grad.svg" /></div>
<!-- (Aside: the relationship between maximizing the gradient and the direction of steepest ascent holds in arbitrary dimensions/spaces, but in curved spaces, it's not as simple as pulling out the vector.) -->
<h4 id="the-directional-derivative">The Directional Derivative</h4>
<p>Another important term is the <em>directional derivative</em>, which computes the derivative of a function along an arbitrary direction. It is a generalization of the partial derivative, which evaluates the directional derivative along a coordinate axis.</p>
\[D_{\mathbf{v}}f(x)\]
<p>Above, we discovered that our “gradient function” could be expressed as a dot product with the gradient vector. That was, actually, the directional derivative:</p>
\[D_{\mathbf{v}}f(x) = \nabla f(x) \cdot \mathbf{v}\]
<p>Which again illustrates why the gradient vector is the direction of steepest ascent: it is the \(\mathbf{v}\) that maximizes the directional derivative. Note that in curved spaces, the “steepest ascent” definition of the gradient still holds, but the directional derivative becomes more complicated than a dot product.</p>
<h3 id="many-to-many">Many-to-Many</h3>
<p>For completeness, let’s examine how this perspective extends to vector-valued functions of multiple variables. Consider \(h(x,y) : \mathbb{R}^2 \mapsto \mathbb{R}^2\) (two inputs, two outputs).</p>
\[h(x,y) = \begin{bmatrix}h_1(x,y)\\h_2(x,y)\end{bmatrix}\]
<p>We can take the gradient of each part of \(h\):</p>
\[\begin{align*}
\nabla h_1(x,y) &= \begin{bmatrix}h_{1_x}(x,y)& h_{1_y}(x,y)\end{bmatrix} \\
\nabla h_2(x,y) &= \begin{bmatrix}h_{2_x}(x,y)& h_{2_y}(x,y)\end{bmatrix}
\end{align*}\]
<p>What object would represent \(\nabla h\)? We can build a matrix from the gradients of each component, called the <em>Jacobian</em>:</p>
\[\nabla h(x,y) = \begin{bmatrix}h_{1_x}(x,y) & h_{1_y}(x,y)\\ h_{2_x}(x,y) & h_{2_y}(x,y)\end{bmatrix}\]
<p>Above, we argued that the gradient (when evaluated at \(x,y\)) gives us a map from input <em>vectors</em> to output <em>vectors</em>. That remains the case here: the Jacobian is a 2x2 matrix, so it transforms 2D \(\Delta x,\Delta y\) vectors into 2D \(\Delta h_1, \Delta h_2\) vectors.</p>
<div class="center" style="width: 85%;"><img class="diagram2d_double center" src="/assets/autodiff/vec2d2d.svg" /></div>
<p>Adding more dimensions starts to make our functions hard to visualize, but we can always rely on the fact that the derivative tells us how input vectors (i.e. changes in the input) get mapped to output vectors (i.e. changes in the output).</p>
<h3 id="the-chain-rule">The Chain Rule</h3>
<p>The last aspect of differentiation we’ll need to understand is how to differentiate function composition.</p>
\[h(x) = g(f(x)) \implies h^\prime(x) = g^\prime(f(x))\cdot f^\prime(x)\]
<p>We could prove this fact with a bit of real analysis, but the relationship is again easier to understand by thinking of the derivative as higher order function. In this perspective, the chain rule itself is just a function composition.</p>
<p>For example, let’s assume \(h(\mathbf{x}) : \mathbb{R}^2 \mapsto \mathbb{R}\) is composed of \(f(\mathbf{x}) : \mathbb{R}^2 \mapsto \mathbb{R}^2\) and \(g(\mathbf{x}) : \mathbb{R}^2 \mapsto \mathbb{R}\).</p>
<p>In order to translate a \(\Delta \mathbf{x}\) vector to a \(\Delta h\) vector, we can first use \(f^\prime\) to map \(\Delta \mathbf{x}\) to \(\Delta \mathbf{f}\), based at \(\mathbf{x}\). Then we can use \(g^\prime\) to map \(\Delta \mathbf{f}\) to \(\Delta g\), based at \(f(x)\).</p>
<div class="center" style="width: 100%;"><img class="center" style="width: 100%;" src="/assets/autodiff/vec2d2d1d.svg" /></div>
<p>Because our derivatives/gradients/Jacobians are linear functions, we’ve been representing them as scalars/vectors/matrices, respectively. That means we can easily compose them with the typical linear algebraic multiplication rules. Writing out the above example symbolically:</p>
\[\begin{align*} \nabla h(\mathbf{x}) &= \nabla g(f(\mathbf{x}))\cdot \nabla f(\mathbf{x}) \\
&= \begin{bmatrix}g_{x_1}(f(\mathbf{x}))& g_{x_2}(f(\mathbf{x}))\end{bmatrix} \begin{bmatrix}f_{1_{x_1}}(\mathbf{x}) & f_{1_{x_2}}(\mathbf{x})\\ f_{2_{x_1}}(\mathbf{x}) & f_{2_{x_2}}(\mathbf{x})\end{bmatrix} \\
&= \begin{bmatrix}g_{x_1}(f(\mathbf{x}))f_{1_{x_1}}(\mathbf{x}) + g_{x_2}(f(\mathbf{x}))f_{2_{x_1}}(\mathbf{x}) & g_{x_1}(f(\mathbf{x}))f_{1_{x_2}}(\mathbf{x}) + g_{x_2}(f(\mathbf{x}))f_{2_{x_2}}(\mathbf{x})\end{bmatrix}
\end{align*}\]
<p>The result is a 2D vector representing a gradient that transforms 2D vectors to 1D vectors. The composed function \(h\) had two inputs and one output, so that’s correct. We can also notice that each term corresponds to the chain rule applied to a different computational path from a component of \(\mathbf{x}\) to \(h\).</p>
<h2 id="optimization">Optimization</h2>
<p>We will focus on the application of differentiation to optimization via gradient descent, which is often used in machine learning and computer graphics. An optimization problem always involves computing the following expression:</p>
\[\underset{\mathbf{x}}{\arg\!\min} f(\mathbf{x})\]
<p>Which simply means “find the \(\mathbf{x}\) that results in the smallest possible value of \(f\).” The function \(f\), typically scalar-valued, is traditionally called an “energy,” or in machine learning, a “loss function.” Extra constraints are often enforced to limit the valid options for \(\mathbf{x}\), but we will disregard constrained optimization for now.</p>
<div class="center" style="width: 80%;"><img class="diagram2d center" src="/assets/autodiff/argmin.svg" /></div>
<p>One way to solve an optimization problem is to iteratively follow the gradient of \(f\) “downhill.” This algorithm is known as gradient descent:</p>
<ul>
<li>Pick an initial guess \(\mathbf{\bar{x}}\).</li>
<li>Repeat:
<ul>
<li>Compute the gradient \(\nabla f(\mathbf{\bar{x}})\).</li>
<li>Step along the gradient: \(\mathbf{\bar{x}} \leftarrow \mathbf{\bar{x}} - \tau\nabla f(\mathbf{\bar{x}})\).</li>
</ul>
</li>
<li>while \(\|\nabla f(\mathbf{\bar{x}})\| > \epsilon\).</li>
</ul>
<p>Given some starting point \(\mathbf{x}\), computing \(\nabla f(\mathbf{x})\) will give us the direction from \(\mathbf{x}\) that would increase \(f\) the fastest. Hence, if we move our point \(\mathbf{x}\) a small distance \(\tau\) along the negated gradient, we will decrease the value of \(f\). The number \(\tau\) is known as the step size (or in ML, the learning rate). By iterating this process, we will eventually find an \(\mathbf{x}\) such that \(\nabla f(\mathbf{x}) \simeq 0\), which is hopefully the minimizer.</p>
<div class="center" style="width: 80%;"><img class="diagram2d center" src="/assets/autodiff/descent.svg" /></div>
<p>This description of gradient descent makes optimization sound easy, but in reality there is a lot that can go wrong. When gradient descent terminates, the result is only required to be a critical point of \(f\), i.e. somewhere \(f\) becomes flat. That means we could wind up at a maximum (unlikely), a saddle point (possible), or a <em>local</em> minimum (likely). At a local minimum, moving \(\mathbf{x}\) in any direction would increase the value of \(f(\mathbf{x})\), but \(f(\mathbf{x})\) is <strong>not</strong> necessarily the minimum value \(f\) can take on globally.</p>
<table class="labeltable" style="width: 100%;"><tr>
<td style="width: 29%;"><img class="center" src="/assets/autodiff/maximum.svg" /></td>
<td style="width: 32%; padding-left: 5px;"><img class="center" src="/assets/autodiff/saddle.svg" /></td>
<td style="width: 39%;"><img class="center" src="/assets/autodiff/localmin.svg" /></td>
</tr><td style="text-align: center;">Maximum</td><td style="text-align: center;">Saddle Point</td><td style="text-align: center;">Local Minimum</td></table>
<p>Gradient descent can also diverge (i.e. never terminate) if \(\tau\) is too large. Because the gradient is only a linear approximation of \(f\), if we step too far along it, we might skip over changes in \(f\)’s behavior—or even end up <em>increasing</em> both \(f\) and \(\nabla f\). On the other hand, the smaller we make \(\tau\), the longer our algorithm takes to converge. Note that we’re assuming \(f\) has a lower bound and achieves it at a finite \(\mathbf{x}\) in the first place.</p>
<table class="labeltable" style="width: 100%;"><tr>
<td style="width: 50%;"><img class="diagram2d center" src="/assets/autodiff/diverge.svg" /></td>
<td style="width: 50%;"><img class="diagram2d center" src="/assets/autodiff/converge_slow.svg" /></td>
</tr><td style="text-align: center;">Divergence</td><td style="text-align: center;">Slow Convergence</td></table>
<p>The algorithm presented here is the most basic form of gradient descent: much research has been dedicated to devising loss functions and descent algorithms that have higher likelihoods of converging to reasonable results. The practice of adding constraints, loss function terms, and update rules is known as <em>regularization</em>. In fact, optimization is a whole field in of itself: if you’d like to learn more, there’s a vast amount of literature to refer to, especially within machine learning. <a href="https://distill.pub/2017/momentum/">This interactive article</a> explaining <em>momentum</em> is a great example.</p>
<h1 id="differentiating-code">Differentiating Code</h1>
<p>Now that we understand differentiation, let’s move on to programming. So far, we’ve only considered mathematical functions, but we can easily translate our perspective to programs. For simplicity, we’ll only consider <em>pure</em> functions, i.e. functions whose output depends solely on its parameters (no state).</p>
<p>If your program implements a relatively simple mathematical expression, it’s not too difficult to manually write another function that evaluates its derivative. However, what if your program is a deep neural network, or a physics simulation? It’s not feasible to differentiate something like that by hand, so we must turn to algorithms for automatic differentiation.</p>
<p>There are several techniques for differentiating programs. We will first look at numeric and symbolic differentiation, both of which have been in use as long as computers have existed. However, these approaches are distinct from the algorithm we now know as <em>autodiff</em>, which we will discuss later.</p>
<h2 id="numerical">Numerical</h2>
<p>Numerical differentiation is the most straightforward technique: it simply approximates the definition of the derivative.</p>
\[f^\prime(x) = \lim_{h\rightarrow 0} \frac{f(x+h)-f(x)}{h} \simeq \frac{f(x+0.001)-f(x)}{0.001}\]
<p>By choosing a small \(h\), all we have to do is evaluate \(f\) at \(x\) and \(x+h\). This technique is also known as differentiation via <em>finite differences</em>.</p>
<p>Implementing numeric differentiation as a higher order function is quite easy. It doesn’t even require modifying the function to differentiate:</p>
<pre id="numerical-example" class="editor">
function numerical_diff(f, h) {
return function (x) {
return (f(x + h) - f(x)) / h;
}
}
let df = numerical_diff(f, 0.001);
</pre>
<p>You can edit the following Javascript example, where \(f\) is drawn in blue and numerical_diff(\(f\), 0.001) is drawn in purple. Note that using control flow is not a problem:</p>
<table class="codetable"><tr>
<td style="width: 50%; padding-right: 10px;"><div id="numerical-editor" class="editor"></div>
<button id="numerical-button" type="button" style="margin-top: 10px;">Differentiate</button></td>
<td style="width: 50%;"><canvas id="numerical-canvas" class="diagram_canvas"></canvas></td>
</tr></table>
<p>Unfortunately, finite differences have a big problem: they only compute the derivative of \(f\) in one direction. If our input is very high dimensional, computing the full gradient of \(f\) becomes computationally infeasible, as we would have to evaluate \(f\) for each dimension separately.</p>
<p>That said, if you only need to compute one directional derivative of \(f\), the full gradient is overkill: instead, compute a finite difference between \(f(\mathbf{x})\) and \(f(\mathbf{x} + \Delta\mathbf{x})\), where \(\Delta\mathbf{x}\) is a small step in your direction of interest.</p>
<p>Finally, always remember that numerical differentiation is only an approximation: we aren’t computing the actual limit as \(h \rightarrow 0\). While finite differences are quite easy to implement and can be very useful for validating other results, the technique should usually be superseded by another approach.</p>
<h2 id="symbolic">Symbolic</h2>
<p>Symbolic differentiation involves transforming a representation of \(f\) into a representation of \(f^\prime\). Unlike numerical differentiation, this requires specifying \(f\) in a domain-specific language where each syntactic construct has a known differentiation rule.</p>
<p>However, that limitation isn’t so bad—we can create a compiler that differentiates expressions in our symbolic language for us. This is the technique used in computer algebra packages like Mathematica.</p>
<p>For example, we could create a simple language of polynomials that is symbolically differentiable using the following set of recursive rules:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>d(n) -> 0
d(x) -> 1
d(Add(a, b)) -> Add(d(a), d(b))
d(Times(a, b)) -> Add(Times(d(a), b), Times(a, d(b)))
</code></pre></div></div>
<table class="codetable" style="margin-top: 15px; margin-bottom: 15px;">
<tr>
<td style="width: 45%; padding-right: 10px;">
<div id="symbolic-editor" class="editor"></div>
<button id="symbolic-button" type="button">Differentiate</button>
</td>
<td style="width: 55%;">
<pre id="symbolic-output" class="editor"></pre>
<canvas id="symbolic-canvas" class="diagram_canvas" style="margin-top: 10px;"></canvas>
</td>
</tr>
</table>
<p>If we want our differentiable language to support more operations, we can simply add more differentiation rules. For example, to support trig functions:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>d(sin a) -> Times(d(a), cos a)
d(cos a) -> Times(d(a), Times(-1, sin a))
</code></pre></div></div>
<p>Unfortunately, there’s a catch: the size of \(f^\prime\)’s representation can become very large. Let’s write another recursive relationship that counts the number of terms in an expression:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Terms(n) -> 1
Terms(x) -> 1
Terms(Add(a, b)) -> Terms(a) + Terms(b) + 1
Terms(Times(a, b)) -> Terms(a) + Terms(b) + 1
</code></pre></div></div>
<p>And then prove that <code class="language-plaintext highlighter-rouge">Terms(a) <= Terms(d(a))</code>, i.e. differentiating an expression cannot decrease the number of terms:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Base Cases:
Terms(d(n)) -> 1 | Definition
Terms(n) -> 1 | Definition
=> Terms(n) <= Terms(d(n))
Terms(d(x)) -> 1 | Definition
Terms(x) -> 1 | Definition
=> Terms(x) <= Terms(d(x))
Inductive Case for Add:
Terms(Add(a, b)) -> Terms(a) + Terms(b) + 1 | Definition
Terms(d(Add(a, b))) -> Terms(d(a)) + Terms(d(b)) + 1 | Definition
Terms(a) <= Terms(d(a)) | Hypothesis
Terms(b) <= Terms(d(b)) | Hypothesis
=> Terms(Add(a, b)) <= Terms(d(Add(a, b)))
Inductive Case for Times:
Terms(Times(a, b)) -> Terms(a) + Terms(b) + 1 | Definition
Terms(d(Times(a, b))) -> Terms(a) + Terms(b) + 3 +
Terms(d(a)) + Terms(d(b)) | Definition
=> Terms(Times(a, b)) <= Terms(d(Times(a, b)))
</code></pre></div></div>
<p>This result might be acceptable if the size of <code class="language-plaintext highlighter-rouge">df</code> was linear in the size of <code class="language-plaintext highlighter-rouge">f</code>, but that’s not the case. Whenever we differentiate a <code class="language-plaintext highlighter-rouge">Times</code> expression, the number of terms in the result will at least <em>double</em>. That means the size of <code class="language-plaintext highlighter-rouge">df</code> grows <em>exponentially</em> with the number of <code class="language-plaintext highlighter-rouge">Times</code> we compose. You can demonstrate this phenomenon by nesting multiple <code class="language-plaintext highlighter-rouge">Times</code> in the Javascript example.</p>
<p>Hence, symbolic differentiation is usually infeasible at the scales we’re interested in. However, if it works for your use case, it can be quite useful.</p>
<h1 id="automatic-differentiation">Automatic Differentiation</h1>
<p>We’re finally ready to discuss the automatic differentiation algorithm actually used in modern differentiable programming: autodiff! There are two flavors of autodiff, each named for the direction in which it computes derivatives.</p>
<h2 id="forward-mode">Forward Mode</h2>
<p>Forward mode autodiff improves on our two older techniques by computing exact derivatives without building a potentially exponentially-large representation of \(f^\prime\). It is based on the mathematical definition of <em>dual numbers</em>.</p>
<h3 id="dual-numbers">Dual Numbers</h3>
<p>Dual numbers are a bit like complex numbers: they’re defined by <a href="https://en.wikipedia.org/wiki/Field_extension">adjoining</a> a new quantity \(\epsilon\) to the reals. But unlike complex numbers where \(i^2 = -1\), dual numbers use \(\epsilon^2 = 0\).</p>
<p>In particular, we can use the \(\epsilon\) part of a dual number to represent the derivative of the scalar part. If we replace each variable \(x\) with \(x + x^\prime\epsilon\), we will find that dual arithmetic naturally expresses how derivatives combine:</p>
<p>Addition:</p>
\[(x + x^\prime\epsilon) + (y + y^\prime\epsilon) = (x + y) + (x^\prime + y^\prime)\epsilon\]
<p>Multiplication:</p>
\[\begin{align*} (x + x^\prime\epsilon) * (y + y^\prime\epsilon) &= xy + xy^\prime\epsilon + x^\prime y\epsilon + x^\prime y^\prime\epsilon^2 \\
&= xy + (x^\prime y + xy^\prime)\epsilon \end{align*}\]
<p>Division:</p>
\[\begin{align*} \frac{x + x^\prime\epsilon}{y + y^\prime\epsilon} &= \frac{\frac{x}{y}+\frac{x^\prime}{y}\epsilon}{1+\frac{y^\prime}{y}\epsilon} \\
&= \left(\frac{x}{y}+\frac{x^\prime}{y}\epsilon\right)\left(1-\frac{y^\prime}{y}\epsilon\right) \\
&= \frac{x}{y} + \frac{x^\prime y - xy^\prime}{y^2}\epsilon
\end{align*}\]
<p>The chain rule also works: \(f(x + x^\prime\epsilon) = f(x) + f'(x)x^\prime\epsilon\) for any smooth function \(f\). To prove this fact, let us first show that the property holds for positive integer exponentiation.</p>
<p>Base case:
\((x+x^\prime\epsilon)^1 = x^1 + 1x^0x^\prime\epsilon\)</p>
<p>Hypothesis:
\((x+x^\prime\epsilon)^n = x^n + nx^{n-1}x^\prime\epsilon\)</p>
<p>Induct:</p>
\[\begin{align*}
(x+x^\prime\epsilon)^{n+1} &= (x^n + nx^{n-1}x^\prime\epsilon)(x+x^\prime\epsilon) \tag{Hypothesis}\\
&= x^{n+1} + x^nx^\prime\epsilon + nx^nx^\prime\epsilon + nx^{n-1}x^{\prime^2}\epsilon^2\\
&= x^{n+1} + (n+1)x^nx^\prime\epsilon
\end{align*}\]
<p>We can use this result to prove the same property for any smooth function \(f\). Examining the Taylor expansion of \(f\) at zero (also known as its <em>Maclaurin series</em>):</p>
\[f(x) = \sum_{n=0}^\infty \frac{f^{(n)}(0)x^n}{n!} = f(0) + f^\prime(0)x + \frac{f^{\prime\prime}(0)x^2}{2!} + \frac{f^{\prime\prime\prime}(0)x^3}{3!} + \dots\]
<p>By plugging in our dual number…</p>
\[\begin{align*} f(x+x^\prime\epsilon) &= f(0) + f^\prime(0)(x+x^\prime\epsilon) + \frac{f^{\prime\prime}(0)(x+x^\prime\epsilon)^2}{2!} + \frac{f^{\prime\prime\prime}(0)(x+x^\prime\epsilon)^3}{3!} + \dots\\
&= f(0) + f^\prime(0)(x+x^\prime\epsilon) + \frac{f^{\prime\prime}(0)(x^2+2xx^\prime\epsilon)}{2!} + \frac{f^{\prime\prime\prime}(0)(x^3+3x^2x^\prime\epsilon)}{3!} + \dots \\
&= f(0) + f^\prime(0)x + \frac{f^{\prime\prime}(0)x^2}{2!} + \frac{f^{\prime\prime\prime}(0)x^3}{3!} + \dots \\
&\phantom{= }+ \left(f^\prime(0) + f^{\prime\prime}(0)x + \frac{f^{\prime\prime\prime}(0)x^2}{2!} + \dots \right)x^\prime\epsilon \\
&= f(x) + f^\prime(x)x^\prime\epsilon
\end{align*}\]
<p>…we prove the result! In the last step, we recover the Maclaurin series for both \(f(x)\) and \(f^\prime(x)\).</p>
<h3 id="implementation">Implementation</h3>
<p>Implementing forward-mode autodiff in code can be very straightforward: we just have to replace our <code class="language-plaintext highlighter-rouge">Float</code> type with a <code class="language-plaintext highlighter-rouge">DiffFloat</code> that keeps track of both our value and its dual coefficient. If we then implement the relevant math operations for <code class="language-plaintext highlighter-rouge">DiffFloat</code>, all we have to do is run the program!</p>
<p>Unfortunately, JavaScript does not support operator overloading, so we’ll define a <code class="language-plaintext highlighter-rouge">DiffFloat</code> to be a two-element array and use functions to implement some basic arithmetic operations:</p>
<pre id="forward-lib" class="editor">
function Const(n) {
return [n, 0];
}
function Add(x, y) {
return [x[0] + y[0], x[1] + y[1]];
}
function Times(x, y) {
return [x[0] * y[0], x[1] * y[0] + x[0] * y[1]];
}
</pre>
<p>If we implement our function \(f\) in terms of these primitives, evaluating \(f([x,1])\) will return \([f(x),f^\prime(x)]\)!</p>
<p>This property extends naturally to higher-dimensional functions, too. If \(f\) has multiple outputs, their derivatives pop out in the same way. If \(f\) has inputs other than \(x\), assigning them constants means the result will be the partial derivative \(f_x\).</p>
<table class="codetable"><tr>
<td style="width: 50%; padding-right: 10px;"><div id="forward-editor" class="editor"></div>
<button id="forward-button" type="button" style="margin-top: 10px;">Differentiate</button></td>
<td style="width: 50%;"><canvas id="forward-canvas" class="diagram_canvas"></canvas></td>
</tr></table>
<h3 id="limitations">Limitations</h3>
<p>While forward-mode autodiff does compute exact derivatives, it suffers from the same fundamental problem as finite differences: each invocation of \(f\) can only compute the directional derivative of \(f\) for a single direction.</p>
<p>It’s useful to think of a forward mode derivative as computing one <em>column</em> of the gradient matrix. Hence, if \(f\) has few inputs but many outputs, forward mode can still be quite efficient at recovering the full gradient:</p>
\[\begin{align*}
\nabla f &= \begin{bmatrix} f_{1_x}(x,y) & f_{1_y}(x,y) \\ \vdots & \vdots \\ f_{n_x}(x,y) & f_{n_y}(x,y) \end{bmatrix} \\
\hphantom{\nabla f}&\hphantom{xxx}\begin{array}{} \underbrace{\hphantom{xxxxxx}}_{\text{Pass 1}} & \underbrace{\hphantom{xxxxxx}}_{\text{Pass 2}} \end{array}
\end{align*}\]
<p>Unfortunately, optimization problems in machine learning and graphics often have the opposite structure: \(f\) has a huge number of inputs (e.g. the coefficients of a 3D scene or neural network) and a single output. That is, \(\nabla f\) has many columns and few rows.</p>
<h2 id="backward-mode">Backward Mode</h2>
<p>As you might have guessed, backward mode autodiff provides a way to compute a <em>row</em> of the gradient using a single invocation of \(f\). For optimizing many-to-one functions, this is exactly what we want: the full gradient in one pass.</p>
<p>In this section, we will use Leibniz’s notation for derivatives, which is:</p>
\[f^\prime(x) = \frac{\partial f}{\partial x}\]
<p>Leibniz’s notation makes it easier to write down the derivative of an arbitrary variable with respect an arbitrary input. Derivatives also obtain nice algebraic properties, if you squint a bit:</p>
\[g(f(x))^\prime = \frac{\partial g}{\partial f}\cdot\frac{\partial f}{\partial x} = \frac{\partial g}{\partial x}\]
<h3 id="backpropagation">Backpropagation</h3>
<p>Similarly to how forward-mode autodiff propagated derivatives from inputs to outputs, backward-mode propagates derivatives from outputs to inputs.</p>
<p>That sounds easy enough, but the code only runs in one direction. How would we know what the gradient of our input should be before evaluating the rest of the function? We don’t—when evaluating \(f\), we use each operation to build a <em>computational graph</em> that represents \(f\). That is, when \(f\) tells us to perform an operation, we create a new node noting what the operation is and connect it to the nodes representing its inputs. In this way, a pure function can be nicely represented as a directed acyclic graph, or DAG.</p>
<p>For example, the function \(f(x,y) = x^2 + xy\) may be represented with the following graph:</p>
<p><img class="diagram2d_wider_on_mobile center" src="/assets/autodiff/g1.svg" /></p>
<p>When evaluating \(f\) at a particular input, we write down the intermediate values computed by each node. This step is known as the <em>forward pass</em>, and computes <em>primal</em> values.</p>
<p><img class="diagram2d_wider_on_mobile center" src="/assets/autodiff/g2.svg" /></p>
<p>Then, we begin the <em>backward pass</em>, where we compute <em>dual</em> values, or derivatives. Our ultimate goal is to compute \(\frac{\partial f}{\partial x}\) and \(\frac{\partial f}{\partial y}\). At first, we only know the derivative of \(f\) with respect to final plus—they’re the same value, so \(\frac{\partial f}{\partial +} = 1\).</p>
<p><img class="diagram2d_wider_on_mobile center" src="/assets/autodiff/g3.svg" /></p>
<p>We can see that the output was computed by adding together two incoming values. Increasing either input to the sum would increase the output by an equal amount, so derivatives propagated through this node should be unaffected. That is, if \(+\) is the output and \(+_1,+_2\) are the inputs, \(\frac{\partial +}{\partial +_1} = \frac{\partial +}{\partial +_2} = 1\).</p>
<p>Now we can use the chain rule to combine our derivatives, getting closer to the desired result: \(\frac{\partial f}{\partial +_1} = \frac{\partial f}{\partial +}\cdot\frac{\partial +}{\partial +_1} = 1\), \(\frac{\partial f}{\partial +_2} = \frac{\partial f}{\partial +}\cdot\frac{\partial +}{\partial +_2} = 1\).</p>
<p><img class="diagram2d_wider_on_mobile center" src="/assets/autodiff/g4.svg" /></p>
<p>When we evaluate a node, we know the derivative of \(f\) with respect to its output. That means we can propagate the derivative back along the node’s incoming edges, modifying it based on the node’s operation. As long as we evaluate all outputs of a node before the node itself, we only have to check each node once. To assure proper ordering, we may traverse the graph in <a href="https://en.wikipedia.org/wiki/Topological_sorting"><em>reverse topological order</em></a>.</p>
<p>Once we get to a multiplication node, there’s slightly more to do: the derivative now depends on the <em>primal</em> input values. That is, if \(f(x,y) = xy\), \(\frac{\partial f}{\partial x} = y\) and \(\frac{\partial f}{\partial y} = x\).</p>
<p>By applying the chain rule, we get \(\frac{\partial f}{\partial *_1} = 1\cdot*_2\) and \(\frac{\partial f}{\partial *_2} = 1\cdot*_1\) for both multiplication nodes.</p>
<p><img class="diagram2d_wider_on_mobile center" src="/assets/autodiff/g5.svg" /></p>
<p>Applying the chain rule one last time, we get \(\frac{\partial f}{\partial y} = 2\). But \(x\) has multiple incoming derivatives—how do we combine them? Each incoming edge represents a different way \(x\) affects \(f\), so \(x\)’s total contribution is simply their sum. That means \(\frac{\partial f}{\partial x} = 7\). Let’s check our result:</p>
\[\begin{align*}
f_x(x,y) &= 2x + y &&\implies& f_x(2,3) &= 7 \\
f_y(x,y) &= x &&\implies& f_y(2,3) &= 2
\end{align*}\]
<p>You’ve probably noticed that traversing the graph built up a derivative term for each path from an input to the output. That’s exactly the behavior that arose when we manually computed the gradient using the chain rule!</p>
<p>Backpropagation is essentially the chain rule upgraded with dynamic programming. Traversing the graph in reverse topological order means we only have to evaluate each vertex once—and re-use its derivative everywhere else it shows up. Despite having to express \(f\) as a computational graph and traverse both forward and backward, the whole algorithm has the same time complexity as \(f\) itself. Space complexity, however, is a separate issue.</p>
<h3 id="implementation-1">Implementation</h3>
<p>We can implement backward mode autodiff using a similar approach as forward mode. Instead of making every operation use dual numbers, we can make each step add a node to our computational graph.</p>
<pre id="backward-lib" class="editor">
function Const(n) {
return {op: 'const', in: [n], out: undefined, grad: 0};
}
function Add(x, y) {
return {op: 'add', in: [x, y], out: undefined, grad: 0};
}
function Times(x, y) {
return {op: 'times', in: [x, y], out: undefined, grad: 0};
}
</pre>
<p>Note that JavaScript will automatically store references within the <code class="language-plaintext highlighter-rouge">in</code> arrays, hence build a DAG instead of a tree. If we implement our function \(f\) in terms of these primitives, we can evaluate it on an input node to automatically build the graph.</p>
<pre id="backward-lib-2" class="editor">
let in_node = {op: 'const', in: [/* TBD */], out: undefined, grad: 0};
let out_node = f(in_node);
</pre>
<p>The forward pass performs a post-order traversal of the graph, translating inputs to outputs for each node. Remember we’re operating on a DAG: we must check whether a node is already resolved, lest we recompute values that could be reused.</p>
<pre id="backward-lib-3" class="editor">
function forward(node) {
if (node.out !== undefined) return;
if (node.op === 'const') {
node.out = node.in[0];
} else if (node.op === 'add') {
forward(node.in[0]);
forward(node.in[1]);
node.out = node.in[0].out + node.in[1].out;
} else if (node.op === 'times') {
forward(node.in[0]);
forward(node.in[1]);
node.out = node.in[0].out * node.in[1].out;
}
}
</pre>
<!-- (You might notice that given a value for $$x$$, the `out` values could have been computed while running $$f$$. That is true, but this purely graph-based approach will help illustrate a following section.) -->
<p>The backward pass is conceptually similar, but a naive pre-order traversal would end up tracing out every path in the DAG—every gradient has to be pushed back to the roots. Instead, we’ll first compute a reverse topological ordering of the nodes. This ordering guarantees that when we reach a node, everything “downstream” of it has already been resolved—we’ll never have to return.</p>
<pre id="backward-lib-4" class="editor">
function backward(out_node) {
const order = topological_sort(out_node).reverse();
for (const node of order) {
if (node.op === 'add') {
node.in[0].grad += node.grad;
node.in[1].grad += node.grad;
} else if (node.op === 'times') {
node.in[0].grad += node.in[1].out * node.grad;
node.in[1].grad += node.in[0].out * node.grad;
}
}
}
</pre>
<p>Finally, we can put our functions together to compute \(f(x)\) and \(f'(x)\):</p>
<pre id="backward-lib-5" class="editor">
function evaluate(x, in_node, out_node) {
in_node.in = [x];
forward(out_node);
out_node.grad = 1;
backward(out_node);
return [out_node.out, in_node.grad];
}
</pre>
<p>Just remember to clear all the <code class="language-plaintext highlighter-rouge">out</code> and <code class="language-plaintext highlighter-rouge">grad</code> fields before evaluating again! Lastly, the working implementation:</p>
<table class="codetable"><tr>
<td style="width: 50%; padding-right: 10px;"><div id="backward-editor" class="editor"></div>
<button id="backward-button" type="button" style="margin-top: 10px;">Differentiate</button></td>
<td style="width: 50%;"><canvas id="backward-canvas" class="diagram_canvas"></canvas></td>
</tr></table>
<h3 id="limitations-1">Limitations</h3>
<p>If \(f\) is a function of multiple variables, we can simply read the gradients from the corresponding input nodes. That means we’ve computed a whole row of \(\nabla f\). Of course, if the gradient has many rows and few columns, forward mode would have been more efficient.</p>
\[\begin{align*}
\nabla f &= \begin{bmatrix} \vphantom{\Big|} f_{0_a}(a,\dots,n) & \dots & f_{0_n}(a,\dots,n) \\ \vphantom{\Big|} f_{1_a}(a,\dots,n) & \dots & f_{1_n}(a,\dots,n) \end{bmatrix} \begin{matrix} \left.\vphantom{\Big| f_{0_a}(a,\dots,n)}\right\} \text{Pass 1} \\ \left.\vphantom{\Big| f_{0_a}(a,\dots,n)}\right\} \text{Pass 2} \end{matrix}
\end{align*}\]
<p>Unfortunately, backwards mode comes with another catch: we had to store the intermediate result of every single computation inside \(f\)! If we’re passing around substantial chunks of data, say, weight matrices for a neural network, storing the intermediate results can require an unacceptable amount of memory and memory bandwidth. If \(f\) contains loops, it’s especially bad—because every value is immutable, naive loops will create long chains of intermediate values. For this reason, real-world frameworks tend to encapsulate loops in monolithic parallel operations that have analytic derivatives.</p>
<p>Many engineering hours have gone into reducing space requirements. One problem-agnostic approach is called checkpointing: we can choose not to store intermediate results at some nodes, rather re-computing them on the fly during the backward pass. Checkpointing gives us a natural space-time tradeoff: by strategically choosing which nodes store intermediate results (e.g. ones with expensive operations), we can reduce memory usage without dramatically increasing runtime.</p>
<p>Even with checkpointing, training the largest neural networks requires far more fast storage than is available to a single computer. By partitioning our computational graph between multiple systems, each one only needs to store values for its local nodes. Unfortunately, this implies edges connecting nodes assigned to different processors must send their values across a network, which is expensive. Hence, communication costs may be minimized by finding min-cost graph cuts.</p>
<div class="center" style="width: 100%;"><img class="diagram2d center" style="width: 575px; margin-bottom: 20px;" src="/assets/autodiff/partition2.svg" /></div>
<h3 id="graphs-and-higher-order-autodiff">Graphs and Higher-Order Autodiff</h3>
<p>Earlier, we could have computed primal values while evaluating \(f\) itself. Frameworks like PyTorch and TensorFlow take this approach—evaluating \(f\) both builds the graph (also known as the ‘tape’) and evaluates the forward-pass results. The user may call <code class="language-plaintext highlighter-rouge">backward</code> at any point, propagating gradients to all inputs that contributed to the result.</p>
<p>However, the forward-backward approach can limit the system’s potential performance. The forward pass is relatively easy to optimize via parallelizing, vectorizing, and distributing graph traversal. The backward pass, on the other hand, is harder to parallelize, as it requires a topological traversal and coordinated gradient accumulation. Furthermore, the backward pass lacks some mathematical power. While computing a specific derivative is easy, we don’t get back a general <em>representation</em> of \(\nabla f\). If we wanted the gradient of the gradient (the Hessian), we’re back to relying on numerical differentiation.</p>
<p>Thinking about the gradient as a higher-order function reveals a potentially better approach. If we can represent \(f\) as a computational graph, there’s no reason we can’t also represent \(\nabla f\) <em>in the same way</em>. In fact, we can simply add nodes to the graph of \(f\) that compute derivatives with respect to each input. Because the graph already computes primal values, each node in \(f\) only requires us to add a constant number of nodes in the graph of \(\nabla f\). That means the result is only a constant factor larger than the input—evaluating it requires exactly the same computations as the forward-backward algorithm.</p>
<p>For example, given the graph of \(f(x) = x^2 + x\), we can produce the following:</p>
<div class="center" style="width: 100%;"><img class="diagram2d center" style="width: 500px;" src="/assets/autodiff/graph2.svg" /></div>
<p>Defining differentiation as a function <em>on computational graphs</em> unifies the forward and backward passes: we get a single graph that computes both \(f\) and \(\nabla f\). That means we can work with higher order derivatives by applying the transformation again! Even better, distributed training is easier as we no longer have to worry about synchronizing gradient updates across multiple systems. JAX implements this approach, enabling its seamless gradient, JIT compilation, and vectorization transforms. PyTorch also supports higher-order differentiation via including backward operations in the computational graph, and <a href="https://github.com/pytorch/functorch">functorch</a> provides a JAX-like API.</p>
<h1 id="de-blurring-an-image">De-blurring an Image</h1>
<p>Let’s use our fledgling differentiable programming framework to solve a real optimization problem: de-blurring an image. We’ll assume our observed image was computed using a simple box filter, i.e., each blurred pixel is the average of the surrounding 3x3 ground-truth pixels. Of course, a blur loses information, so we won’t be able to reconstruct the exact input—but we can get pretty close!</p>
\[\text{Blur}(\text{Image})_{xy} = \frac{1}{9} \sum_{i=-1}^1 \sum_{j=-1}^1 \text{Image}_{(x+i)(y+j)}\]
<table class="imagetable"><tr style="width: 100%;">
<td style="display: inline-block;"><canvas id="image-orig"></canvas><p>Ground Truth Image</p></td>
<td style="display: inline-block;"><canvas id="image-blur"></canvas><p>Observed Image</p></td></tr>
</table>
<p>We’ll need to add one more operation to our framework: division. The operation and forward pass are much the same as addition and multiplication, but the backward pass must compute \(\frac{\partial f}{\partial x}\) and \(\frac{\partial f}{\partial y}\) for \(f = \frac{x}{y}\).</p>
<pre id="image-div" class="editor">
function Divide(x, y) {
return {op: 'divide', in: [x, y], out: undefined, grad: 0};
}
// Forward...
if(node.op === 'divide') {
forward(node.in[0]);
forward(node.in[1]);
node.out = node.in[0].out / node.in[1].out;
}
// Backward...
if(node.op === 'divide') {
n.in[0].grad += n.grad / node.in[1].out;
n.in[1].grad += (-n.grad * node.in[0].out / (node.in[1].out * node.in[1].out));
}
</pre>
<p>Before we start programming, we need to express our task as an optimization problem. That entails minimizing a loss function that measures how far away we are from our goal.</p>
<p>Let’s start by guessing an arbitrary image—for example, a solid grey block. We can then compare the result of blurring our guess with the observed image. The farther our blurred result is from the observation, the larger the loss should be. For simplicity, we will define our loss as the total squared difference between each corresponding pixel.</p>
\[\text{Loss}(\text{Blur}(\text{Guess}), \text{Observed}) = \sum_{x=0}^W\sum_{y=0}^H (\text{Blur}(\text{Guess})_{xy} - \text{Observed}_{xy})^2\]
<p>Using differentiable programming, we can compute \(\frac{\partial \text{Loss}}{\partial \text{Guess}}\), i.e. how changes in our proposed image change the resulting loss. That means we can apply gradient descent to the guess, guiding it towards a state that minimizes the loss function. Hopefully, if our blurred guess matches the observed image, our guess will match the ground truth image.</p>
<p>Let’s implement our loss function in differentiable code. First, create the guess image by initializing the differentiable parameters to solid grey. Each pixel has three components: red, green, and blue.</p>
<pre id="image-impl-0" class="editor">
let guess_image = new Array(W*H*3);
for (let i = 0; i < W * H * 3; i++) {
guess_image[i] = Const(127);
}
</pre>
<p>Second, apply the blur using differentiable operations.</p>
<pre id="image-impl-1" class="editor">
let blurred_guess_image = new Array(W*H*3);
for (let x = 0; x < W; x++) {
for (let y = 0; y < H; y++) {
let [r,g,b] = [Const(0), Const(0), Const(0)];
// Accumulate pixels for averaging
for (let i = -1; i < 1; i++) {
for (let j = -1; j < 1; j++) {
// Convert 2D pixel coordinate to 1D row-major array index
const xi = clamp(x + i, 0, W - 1);
const yj = clamp(y + j, 0, H - 1);
const idx = (yj * W + xi) * 3;
r = Add(r, guess_image[idx + 0]);
g = Add(g, guess_image[idx + 1]);
b = Add(b, guess_image[idx + 2]);
}
}
// Set result to average
const idx = (y * W + x) * 3;
blurred_guess_image[idx + 0] = Divide(r, Const(9));
blurred_guess_image[idx + 1] = Divide(g, Const(9));
blurred_guess_image[idx + 2] = Divide(b, Const(9));
}
}
</pre>
<p>Finally, compute the loss using differentiable operations.</p>
<pre id="image-impl-2" class="editor">
let loss = Const(0);
for (let x = 0; x < W; x++) {
for (let y = 0; y < H; y++) {
const idx = (y * W + x) * 3;
let dr = Add(blurred_guess_image[idx + 0], Const(-observed_image[idx + 0]));
let dg = Add(blurred_guess_image[idx + 1], Const(-observed_image[idx + 1]));
let db = Add(blurred_guess_image[idx + 2], Const(-observed_image[idx + 2]));
loss = Add(loss, Times(dr, dr));
loss = Add(loss, Times(dg, dg));
loss = Add(loss, Times(db, db));
}
}
</pre>
<p>Calling <code class="language-plaintext highlighter-rouge">forward(loss)</code> performs the whole computation, storing results in each node’s <code class="language-plaintext highlighter-rouge">out</code> field. Calling <code class="language-plaintext highlighter-rouge">backward(loss)</code> computes the derivative of loss at every node, storing results in each node’s <code class="language-plaintext highlighter-rouge">grad</code> field.</p>
<p>Let’s write a simple optimization routine that performs gradient descent on the guess image.</p>
<pre id="image-impl-3" class="editor">
function gradient_descent_step(step_size) {
// Clear output values and gradients
reset(loss);
// Forward pass
forward(loss);
// Backward pass
loss.grad = 1;
backward(loss);
// Move parameters along gradient
for (let i = 0; i < W * H * 3; i++) {
let p = guess_image[i];
p.in[0] -= step_size * p.grad;
}
}
</pre>
<p>We’d also like to compute <code class="language-plaintext highlighter-rouge">error</code>, the squared distance between our guess image and the ground truth. We can’t use error to inform our algorithm—we’re not supposed to know what the ground truth was—but we can use it to measure how well we are reconstructing the image. For the current iteration, we visualize the guess image, the guess image after blurring, and the gradient of loss with respect to each pixel.</p>
<table class="imagetable">
<tr>
<td><canvas id="image-cur"></canvas><p id="image-error"></p></td>
<td><canvas id="image-cur-blur"></canvas><p style="width: 105%; margin-left: -2.5%;" id="image-loss"></p></td>
<td><canvas id="image-grad"></canvas><p><strong>Gradient Image</strong></p></td>
</tr>
</table>
<table class="imagetable" style="margin-top: -5px; margin-bottom: -15px;">
<tr>
<td style="width: 16.5%;"></td>
<td><div><p style="float:left;"><button id="image-reset">Reset</button> <button id="image-step">Step</button></p>
<input type="range" min="0" max="100" value="25" class="slider" id="image-step-size" style="width: 45%; margin-left: 16px; margin-top: 7px; float: left;" /></div></td>
</tr>
</table>
<p>The slider adjusts the step size. After running several steps, you’ll notice that even though loss goes to zero, error does not: the loss function does not provide enough information to exactly reconstruct the ground truth. We can also see that optimization behavior depends on the step size—small steps require many iterations to converge, and large steps may overshoot the target, oscillating between too-dark and too-bright images.</p>
<h1 id="further-reading">Further Reading</h1>
<p>If you’d like to learn more about differentiable programming in ML and graphics, check out the following resources:</p>
<ul>
<li><a href="https://jax.readthedocs.io/en/latest/notebooks/quickstart.html">JAX</a>, <a href="https://pytorch.org/tutorials/beginner/basics/intro.html">PyTorch</a>, <a href="https://www.tensorflow.org/tutorials/quickstart/beginner">TensorFlow</a>, <a href="https://github.com/geohot/tinygrad">tinygrad</a>, <a href="https://github.com/patr-schm/TinyAD">TinyAD</a></li>
<li><a href="https://physicsbaseddeeplearning.org/intro.html">Physics-Based Deep Learning</a>, <a href="https://www.youtube.com/watch?v=Tou8or1ed6E">Differentiable Rendering Intro</a>, <a href="https://rgl.epfl.ch/publications">Differentiable Rendering Papers</a>, <a href="https://neuralfields.cs.brown.edu/">Neural Fields in Visual Computing</a></li>
<li><a href="https://dlsyscourse.org/">CMU Deep Learning Systems</a></li>
</ul>
<link rel="stylesheet" type="text/css" href="/assets/lib/prism/prism.css" />
<script src="/assets/lib/prism/prism.js"></script>
<script type="module" src="/assets/autodiff/autodiff.js"></script>Differentiable programming has been a hot research topic over the past few years, and not only due to the popularity of machine learning libraries like TensorFlow, PyTorch, and JAX. Many fields apart from machine learning are also finding differentiable programming to be a useful tool for solving many kinds of optimization problems. In computer graphics, differentiable rendering, differentiable physics, and neural representations are all poised to be important tools going forward. This article received an honorable mention in 3Blue1Brown’s Summer of Math Exposition 2! Prerequisites Differentiation Optimization Differentiating Code Numerical Symbolic Automatic Differentiation Forward Mode Backward Mode De-blurring an Image Further Reading Prerequisites Differentiation It all starts with the definition you learn in calculus class: \[f^{\prime}(x) = \lim_{h\rightarrow 0} \frac{f(x + h) - f(x)}{h}\] In other words, the derivative computes how much \(f(x)\) changes when \(x\) is perturbed by an infinitesimal amount. If \(f\) is a one-dimensional function from \(\mathbb{R} \mapsto \mathbb{R}\), the derivative \(f^{\prime}(x)\) returns the slope of the graph of \(f\) at \(x\). However, there’s another perspective that provides better intuition in higher dimensions. If we think of \(f\) as a map from points in its domain to points in its range, we can think of \(f^{\prime}(x)\) as a map from vectors based at \(x\) to vectors based at \(f(x)\). One-to-One In 1D, this distinction is a bit subtle, as a 1D “vector” is just a single number. Still, evaluating \(f^{\prime}(x)\) shows us how a vector placed at \(x\) is scaled when transformed by \(f\). That’s just the slope of \(f\) at \(x\). Many-to-One If we consider a function \(g(x,y) : \mathbb{R}^2 \mapsto \mathbb{R}\) (two inputs, one output), this perspective will become clearer. We can differentiate \(g\) with respect to any particular input, known as a partial derivative: \[g_x(x,y) = \lim_{h\rightarrow0} \frac{g(x+h,y) - g(x,y)}{h}\] The function \(g_x\) produces the change in \(g\) given a change in \(x\). If we combine it with the partial derivative for \(y\), we get the derivative, or gradient, of \(g\): \[\nabla g(x,y) = \begin{bmatrix}g_x(x,y) & g_y(x,y)\end{bmatrix}\] That is, \(\nabla g(x,y)\) tells us how \(g\) changes if we change either \(x\) or \(y\). If we multiply \(\nabla g(x,y)\) with a column vector of differences \(\Delta x,\Delta y\), we’ll get their combined effect on \(g\): \[\nabla g(x,y) \begin{bmatrix}\Delta x\\\Delta y\end{bmatrix} = \Delta xg_x(x,y) + \Delta yg_y(x,y)\] It’s tempting to think of the gradient as just another vector. However, it’s often useful to think of the gradient as a higher-order function: \(\nabla g\) is a function that, when evaluated at \(x,y\), gives us another function that transforms vectors based at \(x,y\) to vectors based at \(g(x,y)\). It just so happens that the function returned by our gradient is linear, so it can be represented as a matrix multiplication. The gradient is typically explained as the “direction of steepest ascent.” Why is that? When we evaluate the gradient at a point \(x,y\) and a vector \(\Delta x, \Delta y\), the result is a change in the output of \(g\). If we maximize the change in \(g\) with respect to the input vector, we’ll get the direction that makes the output increase the fastest. Since the gradient function is just a product with \(\begin{bmatrix}g_x(x,y) & g_y(x,y)\end{bmatrix}\), the direction \(\begin{bmatrix}\Delta x & \Delta y\end{bmatrix}\) that maximizes the result is easy to find: it’s parallel to \(\begin{bmatrix}g_x(x,y) & g_y(x,y)\end{bmatrix}\). That means the gradient vector is, in fact, the direction of steepest ascent. The Directional Derivative Another important term is the directional derivative, which computes the derivative of a function along an arbitrary direction. It is a generalization of the partial derivative, which evaluates the directional derivative along a coordinate axis. \[D_{\mathbf{v}}f(x)\] Above, we discovered that our “gradient function” could be expressed as a dot product with the gradient vector. That was, actually, the directional derivative: \[D_{\mathbf{v}}f(x) = \nabla f(x) \cdot \mathbf{v}\] Which again illustrates why the gradient vector is the direction of steepest ascent: it is the \(\mathbf{v}\) that maximizes the directional derivative. Note that in curved spaces, the “steepest ascent” definition of the gradient still holds, but the directional derivative becomes more complicated than a dot product. Many-to-Many For completeness, let’s examine how this perspective extends to vector-valued functions of multiple variables. Consider \(h(x,y) : \mathbb{R}^2 \mapsto \mathbb{R}^2\) (two inputs, two outputs). \[h(x,y) = \begin{bmatrix}h_1(x,y)\\h_2(x,y)\end{bmatrix}\] We can take the gradient of each part of \(h\): \[\begin{align*} \nabla h_1(x,y) &= \begin{bmatrix}h_{1_x}(x,y)& h_{1_y}(x,y)\end{bmatrix} \\ \nabla h_2(x,y) &= \begin{bmatrix}h_{2_x}(x,y)& h_{2_y}(x,y)\end{bmatrix} \end{align*}\] What object would represent \(\nabla h\)? We can build a matrix from the gradients of each component, called the Jacobian: \[\nabla h(x,y) = \begin{bmatrix}h_{1_x}(x,y) & h_{1_y}(x,y)\\ h_{2_x}(x,y) & h_{2_y}(x,y)\end{bmatrix}\] Above, we argued that the gradient (when evaluated at \(x,y\)) gives us a map from input vectors to output vectors. That remains the case here: the Jacobian is a 2x2 matrix, so it transforms 2D \(\Delta x,\Delta y\) vectors into 2D \(\Delta h_1, \Delta h_2\) vectors. Adding more dimensions starts to make our functions hard to visualize, but we can always rely on the fact that the derivative tells us how input vectors (i.e. changes in the input) get mapped to output vectors (i.e. changes in the output). The Chain Rule The last aspect of differentiation we’ll need to understand is how to differentiate function composition. \[h(x) = g(f(x)) \implies h^\prime(x) = g^\prime(f(x))\cdot f^\prime(x)\] We could prove this fact with a bit of real analysis, but the relationship is again easier to understand by thinking of the derivative as higher order function. In this perspective, the chain rule itself is just a function composition. For example, let’s assume \(h(\mathbf{x}) : \mathbb{R}^2 \mapsto \mathbb{R}\) is composed of \(f(\mathbf{x}) : \mathbb{R}^2 \mapsto \mathbb{R}^2\) and \(g(\mathbf{x}) : \mathbb{R}^2 \mapsto \mathbb{R}\). In order to translate a \(\Delta \mathbf{x}\) vector to a \(\Delta h\) vector, we can first use \(f^\prime\) to map \(\Delta \mathbf{x}\) to \(\Delta \mathbf{f}\), based at \(\mathbf{x}\). Then we can use \(g^\prime\) to map \(\Delta \mathbf{f}\) to \(\Delta g\), based at \(f(x)\). Because our derivatives/gradients/Jacobians are linear functions, we’ve been representing them as scalars/vectors/matrices, respectively. That means we can easily compose them with the typical linear algebraic multiplication rules. Writing out the above example symbolically: \[\begin{align*} \nabla h(\mathbf{x}) &= \nabla g(f(\mathbf{x}))\cdot \nabla f(\mathbf{x}) \\ &= \begin{bmatrix}g_{x_1}(f(\mathbf{x}))& g_{x_2}(f(\mathbf{x}))\end{bmatrix} \begin{bmatrix}f_{1_{x_1}}(\mathbf{x}) & f_{1_{x_2}}(\mathbf{x})\\ f_{2_{x_1}}(\mathbf{x}) & f_{2_{x_2}}(\mathbf{x})\end{bmatrix} \\ &= \begin{bmatrix}g_{x_1}(f(\mathbf{x}))f_{1_{x_1}}(\mathbf{x}) + g_{x_2}(f(\mathbf{x}))f_{2_{x_1}}(\mathbf{x}) & g_{x_1}(f(\mathbf{x}))f_{1_{x_2}}(\mathbf{x}) + g_{x_2}(f(\mathbf{x}))f_{2_{x_2}}(\mathbf{x})\end{bmatrix} \end{align*}\] The result is a 2D vector representing a gradient that transforms 2D vectors to 1D vectors. The composed function \(h\) had two inputs and one output, so that’s correct. We can also notice that each term corresponds to the chain rule applied to a different computational path from a component of \(\mathbf{x}\) to \(h\). Optimization We will focus on the application of differentiation to optimization via gradient descent, which is often used in machine learning and computer graphics. An optimization problem always involves computing the following expression: \[\underset{\mathbf{x}}{\arg\!\min} f(\mathbf{x})\] Which simply means “find the \(\mathbf{x}\) that results in the smallest possible value of \(f\).” The function \(f\), typically scalar-valued, is traditionally called an “energy,” or in machine learning, a “loss function.” Extra constraints are often enforced to limit the valid options for \(\mathbf{x}\), but we will disregard constrained optimization for now. One way to solve an optimization problem is to iteratively follow the gradient of \(f\) “downhill.” This algorithm is known as gradient descent: Pick an initial guess \(\mathbf{\bar{x}}\). Repeat: Compute the gradient \(\nabla f(\mathbf{\bar{x}})\). Step along the gradient: \(\mathbf{\bar{x}} \leftarrow \mathbf{\bar{x}} - \tau\nabla f(\mathbf{\bar{x}})\). while \(\|\nabla f(\mathbf{\bar{x}})\| > \epsilon\). Given some starting point \(\mathbf{x}\), computing \(\nabla f(\mathbf{x})\) will give us the direction from \(\mathbf{x}\) that would increase \(f\) the fastest. Hence, if we move our point \(\mathbf{x}\) a small distance \(\tau\) along the negated gradient, we will decrease the value of \(f\). The number \(\tau\) is known as the step size (or in ML, the learning rate). By iterating this process, we will eventually find an \(\mathbf{x}\) such that \(\nabla f(\mathbf{x}) \simeq 0\), which is hopefully the minimizer. This description of gradient descent makes optimization sound easy, but in reality there is a lot that can go wrong. When gradient descent terminates, the result is only required to be a critical point of \(f\), i.e. somewhere \(f\) becomes flat. That means we could wind up at a maximum (unlikely), a saddle point (possible), or a local minimum (likely). At a local minimum, moving \(\mathbf{x}\) in any direction would increase the value of \(f(\mathbf{x})\), but \(f(\mathbf{x})\) is not necessarily the minimum value \(f\) can take on globally. MaximumSaddle PointLocal Minimum Gradient descent can also diverge (i.e. never terminate) if \(\tau\) is too large. Because the gradient is only a linear approximation of \(f\), if we step too far along it, we might skip over changes in \(f\)’s behavior—or even end up increasing both \(f\) and \(\nabla f\). On the other hand, the smaller we make \(\tau\), the longer our algorithm takes to converge. Note that we’re assuming \(f\) has a lower bound and achieves it at a finite \(\mathbf{x}\) in the first place. DivergenceSlow Convergence The algorithm presented here is the most basic form of gradient descent: much research has been dedicated to devising loss functions and descent algorithms that have higher likelihoods of converging to reasonable results. The practice of adding constraints, loss function terms, and update rules is known as regularization. In fact, optimization is a whole field in of itself: if you’d like to learn more, there’s a vast amount of literature to refer to, especially within machine learning. This interactive article explaining momentum is a great example. Differentiating Code Now that we understand differentiation, let’s move on to programming. So far, we’ve only considered mathematical functions, but we can easily translate our perspective to programs. For simplicity, we’ll only consider pure functions, i.e. functions whose output depends solely on its parameters (no state). If your program implements a relatively simple mathematical expression, it’s not too difficult to manually write another function that evaluates its derivative. However, what if your program is a deep neural network, or a physics simulation? It’s not feasible to differentiate something like that by hand, so we must turn to algorithms for automatic differentiation. There are several techniques for differentiating programs. We will first look at numeric and symbolic differentiation, both of which have been in use as long as computers have existed. However, these approaches are distinct from the algorithm we now know as autodiff, which we will discuss later. Numerical Numerical differentiation is the most straightforward technique: it simply approximates the definition of the derivative. \[f^\prime(x) = \lim_{h\rightarrow 0} \frac{f(x+h)-f(x)}{h} \simeq \frac{f(x+0.001)-f(x)}{0.001}\] By choosing a small \(h\), all we have to do is evaluate \(f\) at \(x\) and \(x+h\). This technique is also known as differentiation via finite differences. Implementing numeric differentiation as a higher order function is quite easy. It doesn’t even require modifying the function to differentiate: function numerical_diff(f, h) { return function (x) { return (f(x + h) - f(x)) / h; } } let df = numerical_diff(f, 0.001); You can edit the following Javascript example, where \(f\) is drawn in blue and numerical_diff(\(f\), 0.001) is drawn in purple. Note that using control flow is not a problem: Differentiate Unfortunately, finite differences have a big problem: they only compute the derivative of \(f\) in one direction. If our input is very high dimensional, computing the full gradient of \(f\) becomes computationally infeasible, as we would have to evaluate \(f\) for each dimension separately. That said, if you only need to compute one directional derivative of \(f\), the full gradient is overkill: instead, compute a finite difference between \(f(\mathbf{x})\) and \(f(\mathbf{x} + \Delta\mathbf{x})\), where \(\Delta\mathbf{x}\) is a small step in your direction of interest. Finally, always remember that numerical differentiation is only an approximation: we aren’t computing the actual limit as \(h \rightarrow 0\). While finite differences are quite easy to implement and can be very useful for validating other results, the technique should usually be superseded by another approach. Symbolic Symbolic differentiation involves transforming a representation of \(f\) into a representation of \(f^\prime\). Unlike numerical differentiation, this requires specifying \(f\) in a domain-specific language where each syntactic construct has a known differentiation rule. However, that limitation isn’t so bad—we can create a compiler that differentiates expressions in our symbolic language for us. This is the technique used in computer algebra packages like Mathematica. For example, we could create a simple language of polynomials that is symbolically differentiable using the following set of recursive rules: d(n) -> 0 d(x) -> 1 d(Add(a, b)) -> Add(d(a), d(b)) d(Times(a, b)) -> Add(Times(d(a), b), Times(a, d(b))) Differentiate If we want our differentiable language to support more operations, we can simply add more differentiation rules. For example, to support trig functions: d(sin a) -> Times(d(a), cos a) d(cos a) -> Times(d(a), Times(-1, sin a)) Unfortunately, there’s a catch: the size of \(f^\prime\)’s representation can become very large. Let’s write another recursive relationship that counts the number of terms in an expression: Terms(n) -> 1 Terms(x) -> 1 Terms(Add(a, b)) -> Terms(a) + Terms(b) + 1 Terms(Times(a, b)) -> Terms(a) + Terms(b) + 1 And then prove that Terms(a) <= Terms(d(a)), i.e. differentiating an expression cannot decrease the number of terms: Base Cases: Terms(d(n)) -> 1 | Definition Terms(n) -> 1 | Definition => Terms(n) <= Terms(d(n)) Terms(d(x)) -> 1 | Definition Terms(x) -> 1 | Definition => Terms(x) <= Terms(d(x)) Inductive Case for Add: Terms(Add(a, b)) -> Terms(a) + Terms(b) + 1 | Definition Terms(d(Add(a, b))) -> Terms(d(a)) + Terms(d(b)) + 1 | Definition Terms(a) <= Terms(d(a)) | Hypothesis Terms(b) <= Terms(d(b)) | Hypothesis => Terms(Add(a, b)) <= Terms(d(Add(a, b))) Inductive Case for Times: Terms(Times(a, b)) -> Terms(a) + Terms(b) + 1 | Definition Terms(d(Times(a, b))) -> Terms(a) + Terms(b) + 3 + Terms(d(a)) + Terms(d(b)) | Definition => Terms(Times(a, b)) <= Terms(d(Times(a, b))) This result might be acceptable if the size of df was linear in the size of f, but that’s not the case. Whenever we differentiate a Times expression, the number of terms in the result will at least double. That means the size of df grows exponentially with the number of Times we compose. You can demonstrate this phenomenon by nesting multiple Times in the Javascript example. Hence, symbolic differentiation is usually infeasible at the scales we’re interested in. However, if it works for your use case, it can be quite useful. Automatic Differentiation We’re finally ready to discuss the automatic differentiation algorithm actually used in modern differentiable programming: autodiff! There are two flavors of autodiff, each named for the direction in which it computes derivatives. Forward Mode Forward mode autodiff improves on our two older techniques by computing exact derivatives without building a potentially exponentially-large representation of \(f^\prime\). It is based on the mathematical definition of dual numbers. Dual Numbers Dual numbers are a bit like complex numbers: they’re defined by adjoining a new quantity \(\epsilon\) to the reals. But unlike complex numbers where \(i^2 = -1\), dual numbers use \(\epsilon^2 = 0\). In particular, we can use the \(\epsilon\) part of a dual number to represent the derivative of the scalar part. If we replace each variable \(x\) with \(x + x^\prime\epsilon\), we will find that dual arithmetic naturally expresses how derivatives combine: Addition: \[(x + x^\prime\epsilon) + (y + y^\prime\epsilon) = (x + y) + (x^\prime + y^\prime)\epsilon\] Multiplication: \[\begin{align*} (x + x^\prime\epsilon) * (y + y^\prime\epsilon) &= xy + xy^\prime\epsilon + x^\prime y\epsilon + x^\prime y^\prime\epsilon^2 \\ &= xy + (x^\prime y + xy^\prime)\epsilon \end{align*}\] Division: \[\begin{align*} \frac{x + x^\prime\epsilon}{y + y^\prime\epsilon} &= \frac{\frac{x}{y}+\frac{x^\prime}{y}\epsilon}{1+\frac{y^\prime}{y}\epsilon} \\ &= \left(\frac{x}{y}+\frac{x^\prime}{y}\epsilon\right)\left(1-\frac{y^\prime}{y}\epsilon\right) \\ &= \frac{x}{y} + \frac{x^\prime y - xy^\prime}{y^2}\epsilon \end{align*}\] The chain rule also works: \(f(x + x^\prime\epsilon) = f(x) + f'(x)x^\prime\epsilon\) for any smooth function \(f\). To prove this fact, let us first show that the property holds for positive integer exponentiation. Base case: \((x+x^\prime\epsilon)^1 = x^1 + 1x^0x^\prime\epsilon\) Hypothesis: \((x+x^\prime\epsilon)^n = x^n + nx^{n-1}x^\prime\epsilon\) Induct: \[\begin{align*} (x+x^\prime\epsilon)^{n+1} &= (x^n + nx^{n-1}x^\prime\epsilon)(x+x^\prime\epsilon) \tag{Hypothesis}\\ &= x^{n+1} + x^nx^\prime\epsilon + nx^nx^\prime\epsilon + nx^{n-1}x^{\prime^2}\epsilon^2\\ &= x^{n+1} + (n+1)x^nx^\prime\epsilon \end{align*}\] We can use this result to prove the same property for any smooth function \(f\). Examining the Taylor expansion of \(f\) at zero (also known as its Maclaurin series): \[f(x) = \sum_{n=0}^\infty \frac{f^{(n)}(0)x^n}{n!} = f(0) + f^\prime(0)x + \frac{f^{\prime\prime}(0)x^2}{2!} + \frac{f^{\prime\prime\prime}(0)x^3}{3!} + \dots\] By plugging in our dual number… \[\begin{align*} f(x+x^\prime\epsilon) &= f(0) + f^\prime(0)(x+x^\prime\epsilon) + \frac{f^{\prime\prime}(0)(x+x^\prime\epsilon)^2}{2!} + \frac{f^{\prime\prime\prime}(0)(x+x^\prime\epsilon)^3}{3!} + \dots\\ &= f(0) + f^\prime(0)(x+x^\prime\epsilon) + \frac{f^{\prime\prime}(0)(x^2+2xx^\prime\epsilon)}{2!} + \frac{f^{\prime\prime\prime}(0)(x^3+3x^2x^\prime\epsilon)}{3!} + \dots \\ &= f(0) + f^\prime(0)x + \frac{f^{\prime\prime}(0)x^2}{2!} + \frac{f^{\prime\prime\prime}(0)x^3}{3!} + \dots \\ &\phantom{= }+ \left(f^\prime(0) + f^{\prime\prime}(0)x + \frac{f^{\prime\prime\prime}(0)x^2}{2!} + \dots \right)x^\prime\epsilon \\ &= f(x) + f^\prime(x)x^\prime\epsilon \end{align*}\] …we prove the result! In the last step, we recover the Maclaurin series for both \(f(x)\) and \(f^\prime(x)\). Implementation Implementing forward-mode autodiff in code can be very straightforward: we just have to replace our Float type with a DiffFloat that keeps track of both our value and its dual coefficient. If we then implement the relevant math operations for DiffFloat, all we have to do is run the program! Unfortunately, JavaScript does not support operator overloading, so we’ll define a DiffFloat to be a two-element array and use functions to implement some basic arithmetic operations: function Const(n) { return [n, 0]; } function Add(x, y) { return [x[0] + y[0], x[1] + y[1]]; } function Times(x, y) { return [x[0] * y[0], x[1] * y[0] + x[0] * y[1]]; } If we implement our function \(f\) in terms of these primitives, evaluating \(f([x,1])\) will return \([f(x),f^\prime(x)]\)! This property extends naturally to higher-dimensional functions, too. If \(f\) has multiple outputs, their derivatives pop out in the same way. If \(f\) has inputs other than \(x\), assigning them constants means the result will be the partial derivative \(f_x\). Differentiate Limitations While forward-mode autodiff does compute exact derivatives, it suffers from the same fundamental problem as finite differences: each invocation of \(f\) can only compute the directional derivative of \(f\) for a single direction. It’s useful to think of a forward mode derivative as computing one column of the gradient matrix. Hence, if \(f\) has few inputs but many outputs, forward mode can still be quite efficient at recovering the full gradient: \[\begin{align*} \nabla f &= \begin{bmatrix} f_{1_x}(x,y) & f_{1_y}(x,y) \\ \vdots & \vdots \\ f_{n_x}(x,y) & f_{n_y}(x,y) \end{bmatrix} \\ \hphantom{\nabla f}&\hphantom{xxx}\begin{array}{} \underbrace{\hphantom{xxxxxx}}_{\text{Pass 1}} & \underbrace{\hphantom{xxxxxx}}_{\text{Pass 2}} \end{array} \end{align*}\] Unfortunately, optimization problems in machine learning and graphics often have the opposite structure: \(f\) has a huge number of inputs (e.g. the coefficients of a 3D scene or neural network) and a single output. That is, \(\nabla f\) has many columns and few rows. Backward Mode As you might have guessed, backward mode autodiff provides a way to compute a row of the gradient using a single invocation of \(f\). For optimizing many-to-one functions, this is exactly what we want: the full gradient in one pass. In this section, we will use Leibniz’s notation for derivatives, which is: \[f^\prime(x) = \frac{\partial f}{\partial x}\] Leibniz’s notation makes it easier to write down the derivative of an arbitrary variable with respect an arbitrary input. Derivatives also obtain nice algebraic properties, if you squint a bit: \[g(f(x))^\prime = \frac{\partial g}{\partial f}\cdot\frac{\partial f}{\partial x} = \frac{\partial g}{\partial x}\] Backpropagation Similarly to how forward-mode autodiff propagated derivatives from inputs to outputs, backward-mode propagates derivatives from outputs to inputs. That sounds easy enough, but the code only runs in one direction. How would we know what the gradient of our input should be before evaluating the rest of the function? We don’t—when evaluating \(f\), we use each operation to build a computational graph that represents \(f\). That is, when \(f\) tells us to perform an operation, we create a new node noting what the operation is and connect it to the nodes representing its inputs. In this way, a pure function can be nicely represented as a directed acyclic graph, or DAG. For example, the function \(f(x,y) = x^2 + xy\) may be represented with the following graph: When evaluating \(f\) at a particular input, we write down the intermediate values computed by each node. This step is known as the forward pass, and computes primal values. Then, we begin the backward pass, where we compute dual values, or derivatives. Our ultimate goal is to compute \(\frac{\partial f}{\partial x}\) and \(\frac{\partial f}{\partial y}\). At first, we only know the derivative of \(f\) with respect to final plus—they’re the same value, so \(\frac{\partial f}{\partial +} = 1\). We can see that the output was computed by adding together two incoming values. Increasing either input to the sum would increase the output by an equal amount, so derivatives propagated through this node should be unaffected. That is, if \(+\) is the output and \(+_1,+_2\) are the inputs, \(\frac{\partial +}{\partial +_1} = \frac{\partial +}{\partial +_2} = 1\). Now we can use the chain rule to combine our derivatives, getting closer to the desired result: \(\frac{\partial f}{\partial +_1} = \frac{\partial f}{\partial +}\cdot\frac{\partial +}{\partial +_1} = 1\), \(\frac{\partial f}{\partial +_2} = \frac{\partial f}{\partial +}\cdot\frac{\partial +}{\partial +_2} = 1\). When we evaluate a node, we know the derivative of \(f\) with respect to its output. That means we can propagate the derivative back along the node’s incoming edges, modifying it based on the node’s operation. As long as we evaluate all outputs of a node before the node itself, we only have to check each node once. To assure proper ordering, we may traverse the graph in reverse topological order. Once we get to a multiplication node, there’s slightly more to do: the derivative now depends on the primal input values. That is, if \(f(x,y) = xy\), \(\frac{\partial f}{\partial x} = y\) and \(\frac{\partial f}{\partial y} = x\). By applying the chain rule, we get \(\frac{\partial f}{\partial *_1} = 1\cdot*_2\) and \(\frac{\partial f}{\partial *_2} = 1\cdot*_1\) for both multiplication nodes. Applying the chain rule one last time, we get \(\frac{\partial f}{\partial y} = 2\). But \(x\) has multiple incoming derivatives—how do we combine them? Each incoming edge represents a different way \(x\) affects \(f\), so \(x\)’s total contribution is simply their sum. That means \(\frac{\partial f}{\partial x} = 7\). Let’s check our result: \[\begin{align*} f_x(x,y) &= 2x + y &&\implies& f_x(2,3) &= 7 \\ f_y(x,y) &= x &&\implies& f_y(2,3) &= 2 \end{align*}\] You’ve probably noticed that traversing the graph built up a derivative term for each path from an input to the output. That’s exactly the behavior that arose when we manually computed the gradient using the chain rule! Backpropagation is essentially the chain rule upgraded with dynamic programming. Traversing the graph in reverse topological order means we only have to evaluate each vertex once—and re-use its derivative everywhere else it shows up. Despite having to express \(f\) as a computational graph and traverse both forward and backward, the whole algorithm has the same time complexity as \(f\) itself. Space complexity, however, is a separate issue. Implementation We can implement backward mode autodiff using a similar approach as forward mode. Instead of making every operation use dual numbers, we can make each step add a node to our computational graph. function Const(n) { return {op: 'const', in: [n], out: undefined, grad: 0}; } function Add(x, y) { return {op: 'add', in: [x, y], out: undefined, grad: 0}; } function Times(x, y) { return {op: 'times', in: [x, y], out: undefined, grad: 0}; } Note that JavaScript will automatically store references within the in arrays, hence build a DAG instead of a tree. If we implement our function \(f\) in terms of these primitives, we can evaluate it on an input node to automatically build the graph. let in_node = {op: 'const', in: [/* TBD */], out: undefined, grad: 0}; let out_node = f(in_node); The forward pass performs a post-order traversal of the graph, translating inputs to outputs for each node. Remember we’re operating on a DAG: we must check whether a node is already resolved, lest we recompute values that could be reused. function forward(node) { if (node.out !== undefined) return; if (node.op === 'const') { node.out = node.in[0]; } else if (node.op === 'add') { forward(node.in[0]); forward(node.in[1]); node.out = node.in[0].out + node.in[1].out; } else if (node.op === 'times') { forward(node.in[0]); forward(node.in[1]); node.out = node.in[0].out * node.in[1].out; } } The backward pass is conceptually similar, but a naive pre-order traversal would end up tracing out every path in the DAG—every gradient has to be pushed back to the roots. Instead, we’ll first compute a reverse topological ordering of the nodes. This ordering guarantees that when we reach a node, everything “downstream” of it has already been resolved—we’ll never have to return. function backward(out_node) { const order = topological_sort(out_node).reverse(); for (const node of order) { if (node.op === 'add') { node.in[0].grad += node.grad; node.in[1].grad += node.grad; } else if (node.op === 'times') { node.in[0].grad += node.in[1].out * node.grad; node.in[1].grad += node.in[0].out * node.grad; } } } Finally, we can put our functions together to compute \(f(x)\) and \(f'(x)\): function evaluate(x, in_node, out_node) { in_node.in = [x]; forward(out_node); out_node.grad = 1; backward(out_node); return [out_node.out, in_node.grad]; } Just remember to clear all the out and grad fields before evaluating again! Lastly, the working implementation: Differentiate Limitations If \(f\) is a function of multiple variables, we can simply read the gradients from the corresponding input nodes. That means we’ve computed a whole row of \(\nabla f\). Of course, if the gradient has many rows and few columns, forward mode would have been more efficient. \[\begin{align*} \nabla f &= \begin{bmatrix} \vphantom{\Big|} f_{0_a}(a,\dots,n) & \dots & f_{0_n}(a,\dots,n) \\ \vphantom{\Big|} f_{1_a}(a,\dots,n) & \dots & f_{1_n}(a,\dots,n) \end{bmatrix} \begin{matrix} \left.\vphantom{\Big| f_{0_a}(a,\dots,n)}\right\} \text{Pass 1} \\ \left.\vphantom{\Big| f_{0_a}(a,\dots,n)}\right\} \text{Pass 2} \end{matrix} \end{align*}\] Unfortunately, backwards mode comes with another catch: we had to store the intermediate result of every single computation inside \(f\)! If we’re passing around substantial chunks of data, say, weight matrices for a neural network, storing the intermediate results can require an unacceptable amount of memory and memory bandwidth. If \(f\) contains loops, it’s especially bad—because every value is immutable, naive loops will create long chains of intermediate values. For this reason, real-world frameworks tend to encapsulate loops in monolithic parallel operations that have analytic derivatives. Many engineering hours have gone into reducing space requirements. One problem-agnostic approach is called checkpointing: we can choose not to store intermediate results at some nodes, rather re-computing them on the fly during the backward pass. Checkpointing gives us a natural space-time tradeoff: by strategically choosing which nodes store intermediate results (e.g. ones with expensive operations), we can reduce memory usage without dramatically increasing runtime. Even with checkpointing, training the largest neural networks requires far more fast storage than is available to a single computer. By partitioning our computational graph between multiple systems, each one only needs to store values for its local nodes. Unfortunately, this implies edges connecting nodes assigned to different processors must send their values across a network, which is expensive. Hence, communication costs may be minimized by finding min-cost graph cuts. Graphs and Higher-Order Autodiff Earlier, we could have computed primal values while evaluating \(f\) itself. Frameworks like PyTorch and TensorFlow take this approach—evaluating \(f\) both builds the graph (also known as the ‘tape’) and evaluates the forward-pass results. The user may call backward at any point, propagating gradients to all inputs that contributed to the result. However, the forward-backward approach can limit the system’s potential performance. The forward pass is relatively easy to optimize via parallelizing, vectorizing, and distributing graph traversal. The backward pass, on the other hand, is harder to parallelize, as it requires a topological traversal and coordinated gradient accumulation. Furthermore, the backward pass lacks some mathematical power. While computing a specific derivative is easy, we don’t get back a general representation of \(\nabla f\). If we wanted the gradient of the gradient (the Hessian), we’re back to relying on numerical differentiation. Thinking about the gradient as a higher-order function reveals a potentially better approach. If we can represent \(f\) as a computational graph, there’s no reason we can’t also represent \(\nabla f\) in the same way. In fact, we can simply add nodes to the graph of \(f\) that compute derivatives with respect to each input. Because the graph already computes primal values, each node in \(f\) only requires us to add a constant number of nodes in the graph of \(\nabla f\). That means the result is only a constant factor larger than the input—evaluating it requires exactly the same computations as the forward-backward algorithm. For example, given the graph of \(f(x) = x^2 + x\), we can produce the following: Defining differentiation as a function on computational graphs unifies the forward and backward passes: we get a single graph that computes both \(f\) and \(\nabla f\). That means we can work with higher order derivatives by applying the transformation again! Even better, distributed training is easier as we no longer have to worry about synchronizing gradient updates across multiple systems. JAX implements this approach, enabling its seamless gradient, JIT compilation, and vectorization transforms. PyTorch also supports higher-order differentiation via including backward operations in the computational graph, and functorch provides a JAX-like API. De-blurring an Image Let’s use our fledgling differentiable programming framework to solve a real optimization problem: de-blurring an image. We’ll assume our observed image was computed using a simple box filter, i.e., each blurred pixel is the average of the surrounding 3x3 ground-truth pixels. Of course, a blur loses information, so we won’t be able to reconstruct the exact input—but we can get pretty close! \[\text{Blur}(\text{Image})_{xy} = \frac{1}{9} \sum_{i=-1}^1 \sum_{j=-1}^1 \text{Image}_{(x+i)(y+j)}\] Ground Truth Image Observed Image We’ll need to add one more operation to our framework: division. The operation and forward pass are much the same as addition and multiplication, but the backward pass must compute \(\frac{\partial f}{\partial x}\) and \(\frac{\partial f}{\partial y}\) for \(f = \frac{x}{y}\). function Divide(x, y) { return {op: 'divide', in: [x, y], out: undefined, grad: 0}; } // Forward... if(node.op === 'divide') { forward(node.in[0]); forward(node.in[1]); node.out = node.in[0].out / node.in[1].out; } // Backward... if(node.op === 'divide') { n.in[0].grad += n.grad / node.in[1].out; n.in[1].grad += (-n.grad * node.in[0].out / (node.in[1].out * node.in[1].out)); } Before we start programming, we need to express our task as an optimization problem. That entails minimizing a loss function that measures how far away we are from our goal. Let’s start by guessing an arbitrary image—for example, a solid grey block. We can then compare the result of blurring our guess with the observed image. The farther our blurred result is from the observation, the larger the loss should be. For simplicity, we will define our loss as the total squared difference between each corresponding pixel. \[\text{Loss}(\text{Blur}(\text{Guess}), \text{Observed}) = \sum_{x=0}^W\sum_{y=0}^H (\text{Blur}(\text{Guess})_{xy} - \text{Observed}_{xy})^2\] Using differentiable programming, we can compute \(\frac{\partial \text{Loss}}{\partial \text{Guess}}\), i.e. how changes in our proposed image change the resulting loss. That means we can apply gradient descent to the guess, guiding it towards a state that minimizes the loss function. Hopefully, if our blurred guess matches the observed image, our guess will match the ground truth image. Let’s implement our loss function in differentiable code. First, create the guess image by initializing the differentiable parameters to solid grey. Each pixel has three components: red, green, and blue. let guess_image = new Array(W*H*3); for (let i = 0; i < W * H * 3; i++) { guess_image[i] = Const(127); } Second, apply the blur using differentiable operations. let blurred_guess_image = new Array(W*H*3); for (let x = 0; x < W; x++) { for (let y = 0; y < H; y++) { let [r,g,b] = [Const(0), Const(0), Const(0)]; // Accumulate pixels for averaging for (let i = -1; i < 1; i++) { for (let j = -1; j < 1; j++) { // Convert 2D pixel coordinate to 1D row-major array index const xi = clamp(x + i, 0, W - 1); const yj = clamp(y + j, 0, H - 1); const idx = (yj * W + xi) * 3; r = Add(r, guess_image[idx + 0]); g = Add(g, guess_image[idx + 1]); b = Add(b, guess_image[idx + 2]); } } // Set result to average const idx = (y * W + x) * 3; blurred_guess_image[idx + 0] = Divide(r, Const(9)); blurred_guess_image[idx + 1] = Divide(g, Const(9)); blurred_guess_image[idx + 2] = Divide(b, Const(9)); } } Finally, compute the loss using differentiable operations. let loss = Const(0); for (let x = 0; x < W; x++) { for (let y = 0; y < H; y++) { const idx = (y * W + x) * 3; let dr = Add(blurred_guess_image[idx + 0], Const(-observed_image[idx + 0])); let dg = Add(blurred_guess_image[idx + 1], Const(-observed_image[idx + 1])); let db = Add(blurred_guess_image[idx + 2], Const(-observed_image[idx + 2])); loss = Add(loss, Times(dr, dr)); loss = Add(loss, Times(dg, dg)); loss = Add(loss, Times(db, db)); } } Calling forward(loss) performs the whole computation, storing results in each node’s out field. Calling backward(loss) computes the derivative of loss at every node, storing results in each node’s grad field. Let’s write a simple optimization routine that performs gradient descent on the guess image. function gradient_descent_step(step_size) { // Clear output values and gradients reset(loss); // Forward pass forward(loss); // Backward pass loss.grad = 1; backward(loss); // Move parameters along gradient for (let i = 0; i < W * H * 3; i++) { let p = guess_image[i]; p.in[0] -= step_size * p.grad; } } We’d also like to compute error, the squared distance between our guess image and the ground truth. We can’t use error to inform our algorithm—we’re not supposed to know what the ground truth was—but we can use it to measure how well we are reconstructing the image. For the current iteration, we visualize the guess image, the guess image after blurring, and the gradient of loss with respect to each pixel. Gradient Image Reset Step The slider adjusts the step size. After running several steps, you’ll notice that even though loss goes to zero, error does not: the loss function does not provide enough information to exactly reconstruct the ground truth. We can also see that optimization behavior depends on the step size—small steps require many iterations to converge, and large steps may overshoot the target, oscillating between too-dark and too-bright images. Further Reading If you’d like to learn more about differentiable programming in ML and graphics, check out the following resources: JAX, PyTorch, TensorFlow, tinygrad, TinyAD Physics-Based Deep Learning, Differentiable Rendering Intro, Differentiable Rendering Papers, Neural Fields in Visual Computing CMU Deep Learning SystemsLake Tahoe2022-06-10T00:00:00+00:002022-06-10T00:00:00+00:00https://thenumb.at/Lake%20Tahoe<p>Incline Village, NV, 2022</p>
<div style="text-align: center;"><img src="/assets/tahoe/00.webp" /></div>
<div style="text-align: center;"><img src="/assets/tahoe/01.webp" /></div>
<div style="text-align: center;"><img src="/assets/tahoe/02.webp" /></div>
<div style="text-align: center;"><img src="/assets/tahoe/03.webp" /></div>
<div style="text-align: center;"><img src="/assets/tahoe/04.webp" /></div>
<div style="text-align: center;"><img src="/assets/tahoe/05.webp" /></div>
<div style="text-align: center;"><img src="/assets/tahoe/06.webp" /></div>
<div style="text-align: center;"><img src="/assets/tahoe/07.webp" /></div>
<div style="text-align: center;"><img src="/assets/tahoe/08.webp" /></div>
<div style="text-align: center;"><img src="/assets/tahoe/09.webp" /></div>
<div style="text-align: center;"><img src="/assets/tahoe/10.webp" /></div>
<p></p>Incline Village, NV, 2022