Stream of Thoughts

Momentary fascinations & inklings of ideas

Hello!
This is my place for messy drafts and other early-stage writing. Feel free to poke around, or go to the home page.

Recent Notes

Enter the Wavelet Matrix

Imagine you have an array of integers and want to preprocess it so that you can answer quantile queries, such as computing the median or the 99th percentile, in constant time.

One possible solution is to sort the array and then answer the quantile queries by looking up the relevant entries in the array.

But what if, instead, I asked you to preprocess the array for range quantile queries, where rather than computing the quantile over the entire array, you want to compute quantiles for any sub-slice arr[i..j] in constant time?

Computing range quantiles involves combining information from both the original order and the sorted order, so sorting the entire sequence in advance no longer suffices. And the quickselect algorithm, which is often used for computing specific quantiles directly from an unsorted array, does not help us either, since rather than being constant-time its runtime depends on the length of the slice of the array.

Enter the wavelet matrix, a data structure that represents the process of sorting.

The wavelet matrix reorganizes the data in the array in a way that allows answering range quantile queries, and a lot more, in constant time:

This all sounds very attractive, but there are some trade-offs.

The main downsides are that the data representation has some space overhead, on the order of 10-20% of the space required to store the bits of the integers themselves, as and a high constant time factor due to the distributed nature of the data representation.

Internally, the wavelet matrix is composed of an array of bit vectors, each representing the bit planes of the original integers, which means that the bits of any specific element are scattered throughout memory and need to be reassembled as queries are executed, which hurts performance due to a lack of data locality and somewhat unpredictable access patterns, which can be somewhat alleviated by batching.

I’d like to explain more about the mechanics of how this thing works in the future, but for now there’s a great paper written by Gonzalo Navarro, one of the co-discoverers of the wavelet matrix, which serves as a great introduction of the basic idea along with a whirlwind of the literature around it. The article is called Wavelet Trees for All because the wavelet matrix is actually an optimized data representation for a conceptual tree structure called the wavelet tree. The word wavelet hints at the fact that the integers are decomposed into something like their frequency components, but no actual wavelets are used in the process of making a wavelet tree.

I’ll leave you with a picture I made a while back of an algorithm that constructs a wavelet tree through a series of incremental stable sorts.

Further reading


TIL: Gilbert Curve Via

The Hilbert Curve is space-filling fractal that, in the limit, will fill any power-of-two-sized square with an increasingly dense squiggle that fills the plane.

The Gilbert Curve is a generalization of the Hilbert curve to rectangular spaces. The main code is in Python but there are also implementations in JavaScript and C.

Related: A very fast C++ Hilbert curve implementation in 2D and 3D (details here and here).


LLMs Make Context More Valuable

Reader-side LLMs enable information filtering and synthesis to be done on behalf of the reader.

This increases the value of curated contextual information, which can now be sifted through efficiently, and automatically integrated with the main text.

LLMs have issues with provenance and reliability, though, which is why I’m interested in author-provided context, which not only serves as a trust signal but also tames hallucinations, making catastrophic failure less likely.

Existing media formats have lightweight interaction mechanisms such as footnotes and annotations which augment a text with additional information. However, these interactions are usually limited to a small set of pre-determined annotations and footnotes, and the author likely had ideas they decided were not worth incorporating into the final work.

Similarly, commented code only contains a fragment of its relevant context and rarely explains every design decision. Some design decisions can be inferred from the surrounding context of the code itself, but other information cannot be inferred, and has to come from knowledge of the broader system and its environment.

I wonder what might be a good testing ground for these ideas. One thought is that large programming projects often have public design discussions about the broader tradeoffs involved in particular architectural or feature decisions. And as far as other kinds of writing, the Bible is probably one of the most analyzed and annotated pieces of text in human history, though that one’s a bit out of my area of expertise…

See also: Advanced Essay


TIL: Splash Color Format Via

Splash is a color format that reduces the space of choices in a playful way and lends itself to interesting UI design possibilities.

Each color is specified as a three-digit number with one digit per color channel, for a total of one thousand possible colors. For example, 407 stands for 4/9 units of red, 0/9 units of green, and 7/9 units of blue.


Voice Note Browser

I like thinking through ideas on a walk and sometimes record voice notes to myself as I go. The result is a collection of related notes, and I wanted a way to see their transcripts together on one page. But to my surprise, couldn’t find any apps that show more than a single transcript at a time.

So I built a little web app to do this using the newly-added transcripts in Apple’s Voice Memos, showing the transcripts from the couple of days’ worth of notes in an simple list.

This came together quickly and I’m really happy with it – it’s nice when something so simple can be so useful.

A few ideas for future work: