Made of Bits
Notes on compact data structures, with applications to interactive data exploration
Status: Braindump.
These are unedited notes, written with the purpose of getting all of these thoughts out of my head and onto the page.
I hope this outline will serve as the raw material for a more intentionally structured write-up in the future.
These notes are not intended to be read by anyone other than me. I'm publishing them to help me get past my tendency to not publish incomplete work and on the off chance that someone might find them interesting.
Readers may be interested in the companion JavaScript library I've been working on, which currently contains well-commented & tested implementations of
A dense bit vector representation
A sparse bit vector representation based on a sorted array
A sparse bit vector representation based on Elias-Fano encoding
A run-length-encoded bit vector representation
A multiset implementation based on the above
A wavelet matrix implementation based on the above
All of these data structures come with fast rank & select operations.
These notes chronicle my explorations of compressed data structures and their use in enabling highly interactive, low-latency exploration of relatively large datasets.
Much of this was dictated and is therefore longer than it would be otherwise. If I had more time, I would have written a shorter letter. :)
The data structures I learned about all centered on the fundamental concept of a bitvector equipped with fast rank and select operations. A bit vector can be thought of as a representation of a set of integers from a fixed-size universe.
Universe 0 1 2 3 4 5 6 7
┌───┬───┬───┬───┬───┬───┬───┬───┐
Bit vector │ 0 │ 1 │ 0 │ 0 │ 1 │ 1 │ 0 │ 1 │
└───┴───┴───┴───┴───┴───┴───┴───┘
Set of integers 1 4 5 7
represented by
the bit vector
On top of these, one can build representations of multi-sets where each element can be repeated an arbitrary number of times, rather than just 0 times or 1 time.
Another way to think of a bit vector is simply as a sequence from a finite alphabet with two symbols, 0 and 1. Through this lens, there is a quite powerful generalization to larger alphabets that maintain the ability to do efficient rank and select queries on top of this larger alphabet. These data structures are called the wavelet tree, and its more space-efficient cousin, the wavelet matrix.
All of these ideas are being developed in an associated JavaScript library, which I ported to the web from an earlier prototype written in Rust.
Bit vectors with rank & select
Conceptually, a bit vector represents a universe of size N that is typically thought of as representing the integers from 0 to N-1.
It is often quite useful to think of this set as a sorted array, giving order to the elements of the set, which do not intrinsically have an order.
Every element in that universe is either a 0-bit or a 1-bit.
There are many ways to represent the concept of a bitvector in software, but the (theoretically) fundamental operations are in common across all of them.
Two fundamental operations are called rank and select, and they are inverses of each other.
Rank
You can think of a bit vector as representing a set of integers.
One natural question to ask is, for some input i, how many integers in the set are less than i? This question is called a rank query because the answer to this question tells you the rank of that number in the set if all of its elements were sorted in increasing order.
For example, for our set diagrammed earlier:
Universe 0 1 2 3 4 5 6 7
┌───┬───┬───┬───┬───┬───┬───┬───┐
Bit vector │ 0 │ 1 │ 0 │ 0 │ 1 │ 1 │ 0 │ 1 │
└───┴───┴───┴───┴───┴───┴───┴───┘
Set of integers 1 4 5 7
represented by
the bit vector
the rank₁ of the number 6 is 3, because there are 3 elements in the set that are less than 6.
The rank operation is in fact a family of operations, one per symbol capable of being represented by the bit vector, ie. 0 and 1.
So we have rank₁(i) which tells us the number of 1-bits preceding i, and rank₀(i) which tells us the number of 0-bits preceding i.
It is worth noting that you can compute either one from the other: rank₁(i) = i - rank₀(i)
rank is defined for inputs outside the universe, eg. in the above example rank₁(1000) is 4 since there are 4 elements less than 1000. Similarly, rank₁(0) = rank₀(0) = 0 since no by our definitions, no bitvector can contain values less than zero.
Select
Returning to the sorted array idea, the set above can be thought of as the sorted array [1, 4, 5, 7]
.
The select operation is equivalent to array indexing – select(n) returns the n-th element from this sorted set. For our running example, select₁(0) = 1, select₁(1) = 4, select₁(2) = 5, and select₁(3) = 7.
Select is undefined outside of [0, N). In my libraries, I represent this with an option type in Zig and Rust, and with integer-or-null trySelect1/0 functions representing the option use case in JS, and a select1/0 which throws an error in the null case since it is quite common that the caller knows the value will be in bounds.
Rank & select are (more or less) inverses
For valid inputs to select, rank₁(select₁(n) + 1) = n. The + 1 is because we defined rank(i) as returning the number of elements less than i.
select₁(rank₁(n)) = n if n is in the set, and is the next-largest element in the set otherwise (or null if there are no elements in the set, since in that case rank returns 0 which is not a valid input for select. In our running example, the select of rank of 5 is 5, the select of rank of 6 is 7, and the select of rank of 7 is 7..
Edge cases
The definition of rank as returning the number of elements less than its argument, rather than less than or equal to, is a choice I've made after exploring both alternatives.
I've found it more useful than the "inclusive" alternative. The software industry has mostly standardized on half-open [a, b) indexing, and the rank function is similar to that.
One constraint is that in the dense bitvector case, you want the output of rank to be representable in the integer size you are using. For these data structures, it is not uncommon to want to have, for example, a 4 billion element set. So the number of 1s in the maximally large 32-bit bit vector should be representable.
To handle this edge case it is best to allow bit vectors with sizes up to N = 2^32 - 1 which represents the integers in [0, N), ie. capable of representing the presence of any integer up to 2^32-2.
Remember that 2^32-1 is the maximum value representable by a 32-bit unsigned integer, so this choice allows asking "how many 1-bits are there in this bit vector?" as rank1(2^32-1), which will return a value from 0 to 2^32-1, inclusive, since there are up to 2^32-1 individually addressable bits.
For select, I've found that having it index from zero (in analogy with array indexing) to be an ergonomic choice. There may be good arguments in the other direction, but in practice I've found this works well.
For the purposes of the wavelet matrix, I have found there to be a use case for inclusive ranges, but those don't apply to individual bit vectors.
Misc. properties
If you think of the set of integers represented by a bit vector as a distribution, then the rank and select operations correspond to the fundamental operations on mathematical distributions known as the CDF and the quantile functions, which are also inverses of each other.
Because the rank function is effectively a way to access the prefix sum of the array, we can sum any contiguous range of elements in constant time: sum(bv[i..j]) = rank1(j) - rank1(i + 1).
Representations
Sorted array
You can think about a bit vector as representing the set of integers for which the corresponding bit is 1. A set of integers can be thought of as the sorted list of its elements. Sorting gives us a canonical ordered representation for the set, which does not intrinsically have an order.
One way to represent this list is by using an array. This is something I've found useful for debugging purposes because it gives a really simple baseline implementation to compare against more sophisticated space-optimized versions.
It can also be a useful representation on its own if you just want bit vector semantics and are okay with logarithmic access times for rank.
In this representation, rank corresponds to doing a binary search on the array, and select is constant time since it is just an array access.
Dense
This type of vector is good when you have a universe that is small enough to fit in RAM and want fast queries and are okay spending the memory to densely represent each bit. The ideal use case is of course when entropy is high.
A common strategy is to represent a bit vector with one bit per bit. This means that you're spending a bit's worth of storage on every element in the set, but you're also spending a bit on every element not in the set.
There are many schemes for representing a bit vector densely. One that I have implemented a few times and quite like for its elegance is described in the paper "Fast, Small, Simple Rank/Select on Bitmaps" (journal entry, pdf).
The idea is to store the bits themselves in an array of unsigned integer blocks (in the paper u8, but I use u32).
Then you build several sampled indexes on top of this basic structure – one for rank, and one each for select 0 and select 1.
If the two sampling rates are comparable, then I think it doesn't make much sense to try to speed up rank queries using the select blocks, though I should model this more carefully.
If the powers are all 10, then there are 1024 bits, or 32-bit blocks, between rank samples. That means that if you don't use select blocks, you simply scan 32 blocks linearly or using binary search. If you use the select blocks, you need to retrieve the sample and then keep decoding samples until you overstep and then lower down to regular blocks.
Another performance optimization idea I had (for the select-block scanning during rank queries) was it may be possible to scan the select blocks without decoding, or with cheaper decoding (zeroing out the low blocks) since we only care about the block-aligned position for this initial scan.
I have a partial zig implementation and a full and tested JavaScript implementation in the made-of-bits library.
I had some visualizations of this in a figma file.
Note: this notebookhas a visualization of the dense bit vector that shows the positions of sampled rank & select blocks in a "swimlane"-style view with one row per block type, along with a row for the original bits, but it currently only runs when I have a local server going.
To me, there's something very elegant about the idea of combining the two forms of sampling together, one sampling evenly in the universe and the other sampling evenly in the value space.
It makes me wonder whether there is a similar idea that could be applied to making sense of long natural language texts where you sample evenly in terms of pace or the occurrence of a particular theme or word or character (person) as well as sampling regularly in terms of number of words or paragraphs and summarizing at each level to allow the user to dynamically navigate without missing too much between samples regardless of what it is they're interested in.
For rank, can we make a rank function that is either rank 0 or 1 depending on an argument But which does not use a branch to do this?
eg. fn rank(symbol, index) which does something like
const r1 = rank1(index)
return (1 - symbol) * 2 * r1 - r1
but deals with overflow & stuff properly
Sparse: Elias-Fano Encoding
Elias-Fano encoding is a way to represent a sorted list of elements that spends bits only on the elements that are present, like an array, but takes advantage of the fact that the array is sorted to optimize storage space. It is a fixed-width encoding in that each element takes the same number of bits to represents. The number of bits each element takes is dependent on the number of elements and the size of the universe (ie. size of the largest element, usually) and is given by the formula.
I believe the weight that I'm using to pick the split point between the low and high bits is optimal, and all other ways, including those found in somewhat popular libraries, have failure modes in one case or another. It's not a big deal in practice, but I found that interesting. And it was worth diving in to really understand what was going on.
I wrote a small mathematica notebook to explore my approach vs. one alternative. Screenshot below, since otherwise you'd need Mathematica to read the notebook. This illustrates that my approach performs better than the recommended approach from the paper. Each function return the recommended split point, and `ef` returns the number of bits needed to store the resulting bitvector (lower is better):
Other split points are also possible if your primary objective is not to minimize space usage, but rather, say, to enable fast queries on just the high bits up to a particular level of refinement.
Another fine point is the use of binary search in the sparse rank1 operation. Some other libraries like simple-sds use a linear search on the low bits, but I elected for a logarithmic search. The reasoning that can justify linear search is that when the data are randomly distributed, the expected size of any given high-bits group is very small (since the data are evenly scattered across groups in expectation), so the cost of a linear search is low. But in a world where the data exhibits clustering, then a particular group of low bits might get very, very large. A binary search approach prevents this worst case from damaging the runtime, at the expense of potentially being slightly slower on randomly-distributed data on average. Pasting from an email I sent wrt. run-length-encoded bitvectors):
(N is the universe size, n is the number of elements, H is the high bits vector)
For any value of N/n, H has n 1-bits and between n and 2n 0-bits (ie. number of high blocks).
So on average, there is never more than 1 element in a high block, regardless of the ratio N/n.
In addition, because our n values are uniformly distributed, they are effectively assigned to blocks at random.
This means that the resulting distribution of block counts can be modeled by a multinomial distribution with n trials, and between n and 2n equally probable categories (representing blocks).
Which I think means that the average theoretical performance of rank does not decrease as N/n gets large.
I noticed a remark in a paper ("On Elias-Fano for Rank Queries in FM-Indexes") that seems to suggest the same:
The keys have be evenly distributed into buckets for the average bucket size to be O(1). In general, of course, we cannot guarantee this of arbitrary data.
I wonder how much slowdown there is due to the fact that this heuristic is widely deployed in commonly-used succinct data structure libraries (at least sdsl and simple-sds).
Old notes from a previous post draft:
Check out elias-fano-2.pdf! I found it referenced here.
Bits of these first two seem absolutely worth borrowing for the writeup:
Elias-Fano Rank Analysis: an analysis of the efficiency of the rank op with linear and binary search.
Notes on the Elias-Fano Encoding
some visualizations, and a discussion of space usage and worst case, and stars & bars on which we should absolutely have a section.
also contains an impl of the binomial coefficient.
should have an assertion for maximum n/k size for which results are correct
Elias Fano Encoding: misc. notes
This implements a sparse bit vector in Elias-Fano encoding.
Elias Fano encoding works by taking advantage of the sorted structure of its elements.
One property of this encoding is that based on the number of elements and the maximum element, so you can compute the space usage from those numbers alone: it is not data-dependent beyond knowing those two facts.
Another cool fact is that it supports repetition, so you can put the same bit in more than one time. In other words, it is capable of representing a multiset.
We should talk about how this gets us close to the optimal space usage, but first we should discuss what it even means for the space usage to the optimal. What is the optimal space for a sorted sequence?
You can think of it as treating all permutations the same way. So all you need to do is take the entropy of n bits and then subtract something, or something.
bar chart-style vis of multi bit vec, ie. stack the same-value elements vertically and use the horizontal axis to indicate the elements of the universe as in the picture at the very top of this post.
Here we should have a discussion about the meaning of the bit vector operations in the face of multiplicity or link to the section containing that discussion.
Effectively, it groups the numbers by their high bits and stores those in a different encoding from the low bits which are stored (uncompressed) as they appear.
Maybe show numbers horizontally since that’s how figures are written out in the EF split point vis.
One way to choose this flip point is to optimize for space.
One way to do this is to look at what we have to pay for a given split point.
If our high group contains K bits, then there are 2^k groups. The high bits are encoded in a unary code where each group is represented by its count. This means there are 2^k 0-bits, and *n* 1-bits (one per element).
The number of 1 bits in the high bits is the same regardless of the split point, since it's just one per element. It's the number of zeros – of groups – that change.
Consider what happens if we move the split point down by 1 bit (increasing the number of high bits and reducing the number of low bits). This means changing the number of groups from 2^k to 2^(k+1), i.e. increasing the number of groups by 2^k (since 2^(k+1) = 2 * 2^k).
By doing this we've added 2^k bits to the high bits, and removed *n* bits from the low bits (since each entry now has to store 1 less bit per element). So we want to do this only if the number of elements is greater than 2^k, since then we reduce the total number of required bits.
If we move the split point up by one bit (reducing the number of high bits and increasing the number of low bits), we reduce the number of bits in the high bit vector by half (ie. by 2^k / 2 = 2^(k-1)), and increase the number of bits in the low bits by *n*.
I think something I missed in the analysis above is how this all relates to the universe size (or the 'practical' universe size, ie maximum 1-bit index plus one.
Basic fact: We need to tile the universe with high-bit groups.
The universe size stays constant regardless of how large we make each group.
Fewer groups means fewer "separators".
if we think of this in terms of *u* and *n*, how does *u* fit into it? And what happens when *u* is not a power of two?
One way to explain the process of picking the split point is the following.
Imagine we have more elements than our universe size.
In that scenario, it costs us less to simply have the number of separators equal the size of the universe than it would cost us to include even a single bit in the low bits, since that would be *n* (which is > *u*).
The next decision point, if n is less than u, comes at half the size of the universe.
When the number of elements is a power of 2, ...
Imagine we have 5-bit numbers. [maybe we can make the 5 Tangle-like draggable]
Discuss the choice between using a linear scan versus binary search in the course of a rank operation and how it depends on the data distribution (linear is fine if data is random, but worst case can be worse: this is the rank analysis notebook. Cite the same paper that we cited in my run length bitvector work.
One question is what the space-optimal split points between the high and low bits.
References
Vigna has an EF here:
Wikipedia.
Folly
The simple-sds library by Jouni Sirén
On Elias-Fano for Rank Queries in FM-Indexes for its size determination strategy (which we do not use but was an input)
Run-length-encoded
I came up with a scheme for run length encoded bit vectors that have favorable big-O scaling for rank queries versus the existing alternatives.
This idea was developed together with the co-author of the original wavelet matrix paper who is also the author of the book Compact Data Structures and we ended up writing a paper about it: Weighted Range Quantile Queries.
A run-length encoded bitvector can be viewed as a series of runs of zeros, each of which we can represent by a Z, followed by by a run of of ones, each represented by an O. Then the bit vector itself can be modeled like this as a regular expression: (ZO)*
The basic idea is to store two sparse bit vectors: One for the end positions of each zero-run, and another one for the end positions of each ZO, ie. the end position of each one-run. This is contrasted with the solution suggested in Gonzalo Navarro's book, which stores the zero-run positions and the one-run positions separately, and does not have any bit vector combining both.
It is this combination that enables us to answer rank queries without binary search.
Compressed
Higher-level structures
Multiset
using an idea from the original RRR paper (which paper I came across in this post), you can use two bit vectors to model a multi-set where every element can appear with multipliciry, ie. more than one time.
The basic idea is to represent the universe in a bit vector the same way we have been doing and call this the occupancy bitvec, then represent the cumulative count in a parallel bit vector. This basically means that the first bit vector stores which elements have a count greater than 0, and the second bit vector tells you what that count is (since you can derive the individual counts from the cumulative sum).
There's a free parameter here for each of the two bit vectors, which may want to use a different kind of representation depending on your data distribution. If your weights are very large, it makes sense to use a sparse bit vector for the weights. If your universe is large, it might make sense to use a sparse bit vector for those two.
Note: Since the Elias-Fano representation supports repetitions, it is also a multiset. But since it represents the repetitions explicitly, it is not a good choice when you have very large numbers of repeats.
Here's how I conceptually visualize multisets:
┌───┐
│ 1 │
┌───┐ ├───┤
│ 1 │ │ 1 │
┌───┼───┼───┬───┬───┼───┼───┬───┐
│ 0 │ 1 │ 0 │ 0 │ 1 │ 1 │ 0 │ 1 │
└───┴───┴───┴───┴───┴───┴───┴───┘
Value 1 4 5 7
Count 2 1 3 1
Here, the "occupancy bitmap" is the bit vector on the bottom, and the counts are visualized as repeats, with all repetitions of the same value stacked vertically.
Rank still means "what is the total count of values less than i?", and select still means "return the value of the n-th element".
One could imagine generalizing this to allow for repeated zeros, but I think this complicates the rank/select implementations.
This picture is in contradistinction to the way I conceptually visualize run-length-encoded bitvectors, which is as a single horizontally-laid-out bitvector with repeated 0s and 1s, eg. 00111011111001001111.
Wavelet tree
The wavelet tree is a very elegant and versatile data structure that generalizes bit vectors to an alphabet size greater than two. Meaning: Imagine a bit vector where each element can be not only a 0 or 1, but in fact any positive integer.
Effectively, you can store an arbitrary integer array as a wavelet tree (or matrix).
Another good way to think about the wavelet tree is that it represents the permutation – it represents the original sequence on the top level, and the sorted sequence on the bottom level, with the levels describing the sorting process during a bucket sort with two buckets.
Here's a visualization of wavelet tree construction that I was working on. The wavelet matrix only stores the transitions – beige columns are 0-bits and indicate elements that "go left" (belong to the left child), and black colunns are 1-bits and indicate elements that "go right" (belong to the right child).
This permutation perspective explains why the wavelet tree is advantageous: It gives you access to constant time operations on the original sequence, as well as constant time operations on the sorted sequence, together with new operations that cannot be done in constant time on either sequence, such as ranged quantile queries.
You can read all about wavelet trees in Wavelet Trees for All.
Wavelet matrix
A wavelet matrix is basically a levelwise wavelet tree with the order of nodes on each level rearranged.
Due to the clever rearrangement, we no longer need to store an explicit list of node offsets on each level, because we can always find the appropriate position in a child node on the next level from the position on the parent level. It feels a little bit like pulling a string. Looking at the first bit tells you where to go on the next level to discover the next bit. And the process repeats until you've found all of the bits you were looking for, and thus recostructed the symbol, or computed the rank, or the select, or whatever operation you were doing.
And all this while using only the ceil(log2(max_symbol)) bits per element, plus a small factor overhead (eg. <10%) for the rank/select0/select1 indices on top of the data blocks themselves. And because we're storing things in bit plane order, we can, for example, use only 20 bits per symbol to store symbols up to 1 million.
Most of the operations I've implemented are very standard, but I've made one structural improvement for batch queries for some of the operations (I think the most recent version of my library, this is only done for the counts function.). The idea there is to enable multiple queries on the wavelet matrix to run in level wise order, where first we query the first bit vector for each input, then query the second bitvector, and so on – and in left-to-right order, to increase the chances that we're reading memory that is already cached. This equates to a level-wise traversal / breadth-first traversal of the tree but requires scratch memory to operate.
I've also come up with a few techniques for leveraging the properties of the wavelet tree/matrix to do some cool things I haven't seen described elsewhere.
Bit-interleaved codes for multidimensional, multi-resolution queries
you can encode multiple dimensions in the tree using Z-order/Morton encoding. example, in two dimensions, interleave the bits of the x and y coordinates and put those into the WM.
Since the wavelet matrix enables querying not only on an index range but also a symbol range, you can do spatial queries constrained to part of the grid. And not only that, but you can actually do queries at any desired level of granularity, treating the wavelet matrix akin to a quadtree.
Bit-limited queries to return results for aggregate symbols.
This enables fast approximate answers. It also enables querying H2 histograms at a lower resolution than they were encoded in. It also enables a Google Maps style zoom-and-pan interaction in 1D and 2D using masked queries, as alluded to in the "masked queries" section above. Based on your zoom level you query to a different bit depth. This gives you back the "tiles" that intersect the viewport at that zoom level.
We can use ignore_bits queries to query reduced resolutions of a histogram:
Given an (a, b) histogram, an (a+1, b-1) histogram has half the number of bins, and each new bin is the sum of two adjacent bins in the previous histogram.
eg. a (0, 3) histogram can be reduced to a (1, 2) by right-shifting all bin indices by one bit (bin >> 1).
this means that ignore_bits queries on the WM are effectively querying lower resolutions of the same dataset!
Masked queries
One general approach to querying multidimensional ranges is to use a technique like this one (from a very old paper but re-described and clarified somewhat in the 2021 note by Tropf). There was a good blog post about it here. I realized that there's a better way to do simple range-constrained queries in multiple dimensions with a wavelet matrix, though, since the original one Tropf described relies on query splitting. I made a visualization of an early prototype here.
In this approach, you turn your original query for an arbitrary rectangular region into multiple queries, each querying for a contiguous range of symbols that is also contiguous in the sorted sequence of z-order codes.
The idea is that you do multidimensional querying by issuing a series of one-dimensional queries. My code to do this was very suspect and I never fully understood the idea enough to implement it in a principled way from scratch. it would still be a useful addition to my library to implement this method.
My approach instead uses what I call masked queries, where you rely on the regular wavelet matrix traversal to do the splitting, but you use bit masks that encode the spatial structure to guide the "recursion" down the levels of the tree, ie. to select which nodes to recurse into. The basic idea is that we can do a query on a wavelet tree as if it represents multiple wavelet trees. If our two masks are 10101010 and 01010101, then This means our single 8-bit (8-level) wavelet tree represents the interleaving of two 4-bit wavelet trees. We use masks to mask the symbols before we do the symbol range comparison, which effectively means that our range checks only look at the bits for the 4-bit tree that we care about. This enables us to independently prune the search range incrementally in both the X and the Y directions and is somewhat reminiscent of Bentley's r-tree. What this accomplishes is the ability to do bounding box queries at any desired granularity.
The key thing that makes masked queries work is that you are essentially inserting the same bit in the same position in all of the symbols – eg. both range endpoints, and the symbol you want to check for whether it is contained in the range. And internally "padding" The symbols this way does not change the results of any comparison operations -- their less and less-than relationships are the same. So masking is a very efficient way to effectively ignore some bits, because we don't need to fully squish them out in the way that eg. Morton code decode2x and decode2y do -- we can just zero them (or one them, if we want) and leave them in the same place, since the comparison works as desired on these masked symbols.
Using masked queries, an interleaved wavelet matrix, and bit-limited queries allows us to do count(s) queries in multiple dimensions at any level of granularity.
For counts()
we just use the masks when performing range checks.
For count()
we do that for ranges but also need to do a bit of trickery to early-out for interior nodes in multiple dimensions.
It's basically the same idea as in the symbol range1d case, only with the added tweak that now we have to make sure that we're fully in range across all dimensions, not just the current one, i.e. all masks, not just the current mask, and that's where the subtleties lie.
We union all bitmasks so we can tell when we the symbol range is fully contained within the query range at a particular wavelet tree node, in order to avoid needless recursion. See the comments in the Rust code (below).
port the `count` function below into the JS implementation
This still needs to be ported into JavaScript from the original Rust and should be heavily commented when ported. The rust code for `count` queries is here:
// Count the number of occurences of symbols in each of the symbol ranges,
// returning a parallel array of counts.
// Range is an index range.
// Masks is a slice of bitmasks, one per level, indicating the bitmask operational
// at that level, to enable multidimensional queries.
// To search in 1d, pass std::iter::repeat(u32::MAX).take(wm.num_levels()).collect().
pub fn count_batch(
&self,
range: Range<u32>,
symbol_ranges: &[Range<u32>],
masks: Option<&[u32]>,
) -> Vec<u32> {
let masks = masks.unwrap_or(&self.default_masks);
// Union all bitmasks so we can tell when we the symbol range is fully contained within
// the query range at a particular wavelet tree node, in order to avoid needless recursion.
let all_masks = union_masks(masks);
// The return vector of counts
let mut counts = vec![0; symbol_ranges.len()];
// Initialize a wavelet matrix traversal with one entry per symbol range we're searching.
let init = CountSymbolRange::new(0, 0, range.start, range.end);
let mut traversal = Traversal::new(std::iter::repeat(init).take(symbol_ranges.len()));
for (level, mask) in self.levels.iter().zip(masks.iter().copied()) {
traversal.traverse(|xs, go| {
// Cache rank queries when the start of the current range is the same as the end of the previous range
let mut rank_cache = RangedRankCache::new();
for x in xs {
// The symbol range corresponding to the current query, masked to the relevant dimensions at this level
let symbol_range = mask_range(symbol_ranges[x.key].clone(), mask);
// Left, middle, and right symbol indices for the children of this node.
let (left, mid, right) = level.splits(x.val.left);
// Tuples representing the rank0/1 of start and rank0/1 of end.
let (start, end) = rank_cache.get(x.val.start, x.val.end, level);
// Check the left child if there are any elements there
if start.0 != end.0 {
// Determine whether we can short-circuit the recursion because the symbols
// represented by the left child are fully contained in symbol_range in all
// dimensions (ie. for all distinct masks). For example, if the masks represent
// a two-dimensional query, we need to check that (effectively) the quadtree
// node, represented by two contiguous dimensions, is contained. It's a bit subtle
// since we can early-out not only if a contiguous 'xy' range is detected, but also
// a contiguous 'yx' range – so long as the symbol range is contained in the most
// recent branching in all dimensions, we can stop the recursion early and count the
// node's children, since that means all children are contained within the query range.
//
// Each "dimension" is indicated by a different mask. So far, use cases have meant that
// each bit of the symbol is assigned to at most one mask.
//
// To accumulate a new mask to the accumulator, we will either set or un-set all the bits
// corresponding to this mask. We will set them if the symbol range represented by this node
// is fully contained in the query range, and un-set them otherwise.
//
// If the node is contained in all dimensions, then the accumulator will be equal to all_masks,
// and we can stop the recursion early.
let acc = accumulate_mask(left..mid, mask, &symbol_range, x.val.acc);
if acc == all_masks {
counts[x.key] += end.0 - start.0;
} else if overlaps(&symbol_range, &mask_range(left..mid, mask)) {
// We need to recurse into the left child. Do so with the new acc value.
go.left(x.val(CountSymbolRange::new(acc, left, start.0, end.0)));
}
}
// right child
if start.1 != end.1 {
// See the comments for the left node; the logical structure here is identical.
let acc = accumulate_mask(mid..right, mask, &symbol_range, x.val.acc);
if acc == all_masks {
counts[x.key] += end.1 - start.1;
} else if overlaps(&symbol_range, &mask_range(mid..right, mask)) {
go.right(x.val(CountSymbolRange::new(
acc,
mid,
level.nz + start.1,
level.nz + end.1,
)));
}
}
}
});
}
// For complete queries, the last iteration of the loop above finds itself recursing to the
// virtual bottom level of the wavelet tree, each node representing an individual symbol,
// so there should be no uncounted nodes left over. This is a bit subtle when masks are
// involved but I think the same logic applies.
if masks.len() == self.num_levels() {
debug_assert!(traversal.is_empty());
} else {
// Count any nodes left over in the traversal if it didn't traverse all levels,
// ie. some bottom levels were ignored.
//
// I'm not sure if this is actually the behavior we want – it means that symbols
// outside the range will be counted...
//
// Yeah, let's comment this out for now and leave this note here to decide later.
//
// for x in traversal.results() {
// counts[x.key] += x.val.end - x.val.start;
// }
}
counts
}
Representing a tree using a wavelet matrix of the node depths in DFS order.
There are succinct representations for trees that take extremely little space. but the simplest ones are limited in the functionality, and the more complex ones require complex code.
At some point I realized that if you don't really care about eking out every last bit of space, you could just use a wavelet matrix to store the node depths, and this would be enough to do all of the basic tree operations more or less trivially.
Storing the depths is kind of like storing just the left parenthesis. order to find the matching right parenthesis, I had to figure out a way to query for the next time that a symbol appears whose value is less than some particular symbol, ie. a selectFirstLessThanOrEqual query.
I haven't seen this described anywhere else, but it is implemented in my library.
The idea is to return the minimum select position across all the nodes that could potentially contain the first symbol <= symbol.
We find the first left node that is fully contained in the [0, symbol] symbol range, and then we recurse into the right child if it is partly contained, and repeat.
Here the filled gray nodes and orange nodes are the ones that may contain the leftmost minimum value. There's some complication about overlapping nodes, but you can look at the code for details. I don't actually fully understand the idea at the moment. I'm writing this because it's been a while. but I have an implementation, so it hopefully won't be too difficult to figure out the core concept later.
armed with this new primitive, along with rank and select, implementing the various tree traversal functionality is easy.
One cool idea that is related to this representation is that we can actually use the associated data technique to look up aggregate stats that have to do with both DFS ranges in the tree (ie. associated data that is ordered in the same way as the values in the wavelet matrix, ie. the original sequence), or BFS ranges in the tree (i.e. associated data that is ordered in the same way as the leaves of the wavelet matrix).
The set of all descendants of a node are contained in a range in the original sequence (DFS), while the set of all children at a particular level are ordered in BFS order (since they share a depth, and therefore have the same symbol, and are therefore contiguous on the bottom level).
So, we can use this to do lots of neat stuff.
One idea I would like to explore is to use this compact representation of all the files on my disk.
It would be cool to get size distributions, approximate overall sizes, and tags of the number of elements of each type, for example, media and text files and code.
We could also do this kind of analysis on GitHub repositories.
we could use a subset wavelet tree to encode multi tags or do the more trivial encoding where we simply store the individual tags separately and have another vector tell us the node offsets, ie. how many tags per element.
I wonder if we can use Huffman coding for this with some kind of codes that are limited to be in ascending order but not necessarily contiguous. I believe this might just be lexicographic Huffman codes, but those do not give the wavelet tree a Huffman shape.
the idea that we can associate data with the DFS order (ie. parallel to the elements of the WM), or with the BFS order (ie. parallel with the elements on the bottom level, which store all elements at a particular depth contiguously). So we can process all the children of a node at a particular depth (get the range corresponding to that node, then get the associated data for the symbols whose value is that depth within that range).
Weighted quantiles, take 1
Use run length encoded bit vectors at each level of a wavelet matrix. This is described in Weighted Range Quantile Queries, based on a set of email exchanges I had with Gonzalo Navarro a while back. This idea is quite expensive in terms of space, but does get the job done.
Weighted quantiles, take 2
Another idea I had more recently is to store the cumulative weights every k levels. For example, K might be 4, so we would store the weights alongside the first level, fourth level, eighth level, etc. At each level we store the weights alongside the Permuted values so that the cumulative weights are in the same order as the elements at that level. We can still do quantile queries even though we're no longer storing the weights at every level because the idea is to speculatively expand down both branches of the tree on the levels without weights, and then prune that down to the single branch containing the symbol of interest. the fact that we're sampling every cave level bounds the exponential growth.
For example, with k = 4, we would have 16 candidate nodes by the time we get to the next level that has its weights stored.
Huffman-shaped wavelet matrices
Moffat has a truly excellent overview of huffman coding in this survey paper. Dude is good.
You can use Huffman coding to compress your sequence up to its first order entropy. This means that (if entropy allows) lower levels in the wavelet matrix will be shorter than the higher levels.
Because the symbols are represented out of order on each level, we need a custom Huffman code that sorts all of the short elements at the end of the list (equivalent to setting their "trailing bits" to 1) when sorting the nodes on each level in the order mandated by the wavelet matrix).
The reason that we can pad the trailing bits to 1 is due to the prefix-free property of huffman codes.
You can think of Huffman codes as divvying up a unit of probability space. There is a complete binary tree whose leaves span that space. What you're doing when you pick a particular tree is picking a subset of this binary tree. When a node does not reach all the way to the ground, it still takes up all of the space at that higher level that would have been taken up by all of its leaves. For example, if the first symbol occurs at 50% probability, you may assign it the code word 0. in a three-bit tree, this code is taking up the space that would have been taken by 000, 001, 010, and 011.
There's another idea I had with respect to doing symbol range queries on Huffman-coded wavelet matrices. Typically, the thing that enables you to be able to do a symbol range query is the fact that your symbols are lexicographically ordered. That way you can pick which wavelet tree nodes to recurse into by selecting the ones that fall within your range only. With Huffman coding, this changes because the code takes over the assignment of code words and you cannot simply have your code words ordered in the same order as your original symbols.
However, I think that with the Wavelet Matrix code assignments, there's a free parameter within each bit level. For example, within all of the codes that are 3 bits long, you can pick how to assign them to your original symbols. For example, you can pick the codes to be lexicographically ordered in the same order as the original symbols within each level.
This means that you can still do range queries within a particular bit depth. So a symbol range query on the original symbols would translate into N symbol range queries on Huffman coded symbols where N is the number of levels in your wavelet matrix. I don't think that the JavaScript code I have right now to do Huffman coding does this particular ordering within each level, but it could be modified to do so, and then I think this approach would work, giving us the ability to do symbol range queries, even in the case of Huffman encoded wavelet matrices.
My library has well-commented code to do this, but I have not yet augmented the construction algorithms or query algorithms to actually construct shorter bit vectors for each level in such trees.
implement the remainder of work required to enable the use of Huffman shaped wavelet matrices, as described above
Note that as Tropf suggests in one of his notes, we can actually store both positive and negative integers as well as floats in a wavelet matrix. There's a quote in this set of notes from one of his articles that describes the technique, which involves some slight modifications to the bits prior to insertion into the wavelet matrix. Basically, you transform the bit values to make them ordered, and then you untransform after retrieval..
I wonder what the meaning of levels is in the wavelet matrix if we store floats in there. Do the nodes at each level acquire a logarithmic value range?
The associated data technique
The basic idea here is that you can store additional data outside the wavelet matrix that gets combined with the data in the wavelet matrix.
When you do a locate query in the wavelet matrix, you get back the specific range of indices on the bottom level corresponding to each queried symbol.
In the wavelet tree, the nodes on the bottom level are in ascending order, while the nodes on the top level are in the original order.
In the wavelet matrix, the nodes on the bottom still represent the sorted order, but symbols are only contiguous within instances of the same symbol (eg. all 2's are together, but the 2's are not next to the 1's or 3's).
The basic idea though is that you can take some associated data that you want to do range queries on, eg. cumulative sums of elementwise weights, and sort them in the same order as the bottom of the wavelet matrix. This can be done by sorting each weight by the bit reversal of its value in the wavelet matrix, since the nodes in the bottom of the wavelet matrix are sorted in bit reversal order.
Then when you do a query and you get back the range, you can use the indices of that simple range to do a lookup into the associated weights.
Note: The weights don't have to be arbitrary integers. You can use a regular dense bit vector as the weights, which means that this gives you the ability to do filtered queries, ie. discount some values (giving them weight zero) in the counts. Of course, they can also be floats - eg. scores of some sort.
The trouble is that if you want to rely on the counts during your process of getting to the leaf layer, then you need to repeat the weights at each level.
you can see a sketch of a solution towards reducing this space cost in the "Weighted quantiles, take 2" section above where you store the weights only every k levels rather than at every level.
You don't have to use cumulative structures for this, because you can just have associated metadata per element on the bottom level.
Another cool property of this is if your wavelet matrix actually represents the concatenation of multiple data sets, this technique continues to work. Because the cumulative sum goes across nodes and actually gives you the right answer in that situation without any boundary conditions (other than the start & end of the wm itself). I ran into this while trying to put multiple experiments' worth of histograms into the same wavelet matrix at work.
Notes from previous drafts
The Compression Perspective
A trend that's been steadily going along in computing has been the fact that memory access is getting increasingly expensive compared to the speed of just computing something.
This together with the fact that data volumes are generally increasing, we keep generating more and more of it, has given a rise in importance to an old idea from data structures.
The idea is that of data compression but in a way that allows you to compute on top of the compressed data representation.
Typically you think of compression as something like zip, where you can compress a file to make it smaller. In compression schemes like that, though, you typically have to decompress the data before you can actually operate on it.
Succinct data structures ask the question, why decompress? Why not come up with a way to compress the data and let us deal with that directly?
As an example, consider the integers from 0 to 1 million.
Imagine you have a set containing some of those numbers. How much space do you think it would take?
Usually numbers like that are represented in computers as 32- or 64-bit integers.
So if your set contained 10 elements you would take 320 bits. If your set contained a thousand elements you would take 32,000 bits, or 32K [is it k or K or kibibits or wot?].
And if your set contained, say, 500,000 elements, then you would probably take 500,000 times 32, or just under 2 megabytes [mebibits?]. Of course there are other ways to do this such as using 21 or 24 bits per number, but the general idea stands.
Bit vectors give us another way to represent or encode this same data.
There's something here about encoding space versus encoding matter. by allocating enough room to represent every possible number by a designated slot.
Intro Outline / Core Ideas
Trends in the economics of computing that make computation cheaper and cheaper relative to memory access. SO it's better to recompute stuff than to fetch stuff. - Where's that blog post about the supercomputing perspective where memory access is like a call to a distributed system?
The compression perspective in which you have a finite universe and each element is represented very compactly
The cool idea here is that while with usual compression methods, you have to decompress the data before using it, succinct data structures give you a compressed representation that you can directly do computation with.
Because we're interested in compression, it's natural that the data distribution will give rise to many techniques that each perform better in a specialized circumstance.
The example of a set of a million numbers (integers).
Rank and select as index structures on top of the data that enable extremely efficient navigation even though the data is compressed.
The 0/1 alphabet is clearly fundamental, but can be limiting in practice. Can we have a bit vector with a much larger alphabet? It turns out we can – and it's called a wavelet tree. We will see how this works in the final chapter.
Lots of fundamental ideas that can be applied in different contexts.
Blocks + Summary: You could imagine doing the same thing with large language models.
Index by value versus index by universe: Or do both. This is how our dense bit vectors work.
Visualization
Representing time series.
In this representation, you can get the distribution of values in any arbitrary time slice.
This is made much more effective if you use an approximate number encoding like H2 histograms, which can represent our data approximately and give us very compact summaries of the distribution.
visually, you could imagine showing the distribution of all elements to the left and all elements to the right, or doing that in ever-increasingly coarse time ranges centered on your current view, as a sort of discrete fisheye heatmap kind of view.
I've played around with this on audio files.
Representing an evolving distribution
Same as above, except now our series is made up of distributions, one per time segment. The treatment is actually the same, except now we want to store the sample weights using the associated data technique.
I made practical implementations of this for a startup I co-founded, but we have so far not needed to use it because our data volumes have not warranted this approach.
It's cool because it enables interactive zooming and panning over potentially very large time series where each entry is eg. a latency histogram.
Multidimensional cross filtering
Here is an example. The user interface needs a lot of work because the sliders are quite unintuitive, as is the behavior of filtering. But the core idea illustrates that we can visually explore 10 million individual data points that can be interactively filtered in multiple dimensions. We can also do a dynamic determination of the level of detail based, for example, on the constraints of the user's device -- the less bars we need to show in the bar chart, the cheaper the computation, since it can be limited to a lower bit depth based on the power of 2 of the number of bars in each marginal histogram.
High-Cardinality Density Plots
This is my favorite one. We can do Google-Maps-style zoomable panable density plots on top of a wavelet matrix.
I wrote a whole bunch of notes on the journey that led me to the wavelet matrix in the first place. Here they are.
Start with the problem of quickly plotting density plots of floating-point numbers.
The dataset is static, but we want interactivity (zoom).
We may want pixel-level bins, ie. many bins, eg. 1,000 x 1,000.
We may want many data points, eg. 10,000,000.
Datashader does this, but we want to tune for the local-data in-browser case.
Approach 1: Quick binning.
f64 -> u32 in each dimension, accumulate counts.
O(datapoints)
Can improve performance with certain tricks, ie. the JS library had fast binning, so did the Zig prototype.
Also does the color ramp using u32 and a lookup table, which is a standard trick, though it is different from the way d3 approaches it (color scales return a color string that is parsed into {r, g, b, a} if needed).
For zoom, continuously adjust the range represented by the bins and re-project all of the data points, ensuring that those outside the viewport do not contribute any weight.
Approach 2: Flip the perspective.
The trouble is that even just iterating ten million points can take a while – around 16 milliseconds in a simple benchmark on my laptop from 2019.
Can we do better? Yes, if we flip the problem on its head and look at it from the bin's perspective.
Let's examine the 1d case first.
What does each bin want? To know its count.
How could you compute the count? Iterate the array, bounds check, then sum. That's O(datapoints).
Can we do better? Yes – our dataset is static, so we can preprocess it.
So let's sort it into ascending order. Now how do we compute the count? Binary search for each endpoint, then subtract.
This is O(log(datapoints) * bins).
Can we do better?
Yes, noticing that each bin duplicates a query with its neighbor.
Yes, using techniques for accelerating binary search such as Eytzinger Binary Search and using a fast implementation like the one outlined in bitwise binary search.
Yes, noting that we are doing multiple searches and could do them in parallel, saving work particularly in the cases when multiple 'needles' wind up in the same location in the haystack. This can be done with a variant of multi-value binary search.
Generalizing to two dimensions
Okay, cool, but that all works for 1d. How can we extend it to work in 2d?
A fun question is: can we turn the 2d problem into a 1d problem?
The key property that powers the binary search idea is that all of the points in a bin are contiguous in the sorted array, so locating the endpoints of a bin gives us the information we need to determine the number of elements inside.
This would suggest that if we can somehow order our two dimensional points in such a way that the number of points in each bin is contiguous, then we can use the exact same approach – we won't have to change a thing.
But we will need a new idea: Z-Order Curves.
Explain them
So, what if we quantize our points and order them by (x, y)? We can think of the quantization as pre-binning into very fine bins – eg. if we use 16 bits for each of x and y, then our finest granularity will be 65,536 bins in each dimension, for four billion total bins in the xy grid.
With a z-order curve, any power of two sided square is contiguous. So it will be helpful for us to adopt a fixed tiling scheme, in contrast to our earlier 1d variant where each bin could be anywhere on the number line.
This approach is a very standard in the geographic mapping community, where they are known as slippy maps. In this approach, the viewport determines which bin range is visible, and a third z parameter determines the zoom level, ie. tile size. Do we use 8x8 tiles, or 3x3, or even 1x1?
We will call these fixed, power-of-two-aligned bins tiles.
If we include our two dimensional data into morton codes then sort it in ascending order, we can query the (lo, hi) of an individual tile and then hi - lo tells us the number of elements inside.
Can we do better?
Yes, by noticing that for many bins, the queries are contiguous like before, and share edges. But unlike previously, our tiles are only contiguous in the underlying array when they fall into a larger tile.
We can use an approach described in a paper from 1981 titled Multidimensional Range Search in Dynamically Balanced Trees.
He also write a 2021 note (with code in Pascal!) clarifying the idea, 40 years later, in which he notes
The above program works with word type data (no negatives). To handle integer data
(possibly negative), just invert the sign bit. This converts the data range to
0000..0 (lowest negative) to 1111..1 (highest positive).
The meaning behind the data is not relevant, only greater-less-equal matters! Anything
sortable that can be mapped to a 0000..0 to 1111..1 range, can be range searched (maybe
buckets of whatever that could again be searched anyhow).
To achieve this mapping for floating point data is surprisingly simple: If the float
is >=0, invert the sign bit, if the float is <0, invert all bits.
The consideration behind it is as follows: The sign bit is the first significant bit.
Unfortunately, it is 0 for positive data and 1 for negative data. So just invert it.
Exponent and mantissa are both not 2's complement (as it is the case with integers)
but in a range of 0000..0 (lowest value) to 1111..1 (highest value). Increasing the
exponent shifts more to the left, which corresponds to an increasing absolute value
of the float. The same is true for the mantissa: increasing the mantissa increases the
absolute value of the float. So, what we do is sorting the float according to its
absolute value. However, a negative with great absolute value is less than a negative
with small absolute value. This can simply be handled by inverting both exponent and
mantissa (in addition to the sign bit).
Of course, such bit manipulations are to be reverted when reporting results.
There is also Towards a general-purpose, multidimensional index: Integration, Optimization, and Enhancement of UB-Trees which describes its own versions of nextJumpIn and nextJumpOut
Another resource notes
BIGMIN and LITMAX Tropf and Herzog [19] implemented a more sophisticated approach for range searches using the Z-order curve. The authors do this by iterating through the stored objects, sorted by the Z-order value, just as with the naive approach. However, when they detect an object which is outside of the search range (e.g., point G of Figure 2.6b, they initiate a calculation to find the largest previous Z-order value within the range (LITMAX) and the next smallest Z-order value within the range (BIGMIN). Any value between LITMAX and BIGMIN is then outside the search range, and the search can then continue in two sub-ranges: from the range minimum code to LITMAX, and from BIGMIN to the range maximum point. This is done recursively until all values of the range have been searched.
this post has an implementation in Rust!
q: can we derive a bounding box FOR a range? check that github repo i found... might do that. Then we can check the range for intersection w bbox.
credit the guy whose code I got from Twitter (with permisson) – Jonah Harris (notebook)
We should figure out why this has to go multiple iterations though... (see the following cell)
Yes, by optimizing multi-value binary search further by having the sorted needles represented compactly – since they are pow2 & linearly spaced, we can "binary search" them in O(1).
But we can do better than O(log(datapoints) * bins)?
We could either go to the WM from here, or the EF.
WM has the advantage of that it is how I came to it, and that it is uncompressed, like the original sequence. And that it can work sorted, but doesn't have to.
EF has the advantage of keeping to the sorted-sequence theme, but 1. being more efficient, or at least not scaling with dataset size,
representing it
WM version
It turns out there's an even faster way to do the equivalent of the binary search: we can actually remove the number of data points from the big-O equation entirely.
though larger data sets will be slower to search than small ones because searching them causes more cache misses because the data is spaced farther apart in memory.
We can get to O(c * bins) where c is the number of bits in the morton representation using a data structure called a wavelet matrix.
It stores each number in log(u) bits, plus O(n) overhead that is a fraction of the original size, which is extra space for the rank index.
And the data doesn't need to be sorted. We will use a query over symbol ranges.
Can we do better?
Yes, using fast rank - sample more frequently (more memory), interleave rank blocks with one bits – try one u128 memory access or one u64 access w/ 50% overhead for rank blocks.
Yes, using batched queries – traverse each level one at a time and do rank on successive ones. Maybe even specialize to the underlying block size and take advantage of our even spacing, like we did for multi-value binary search, to search the same raw block for multiple ones.
Or at some zoom level, just iterate the blocks directly since things are spaced sufficiently close...
It also helps to ensure that the queries for the next level stay in sorted wavelet matrix order.
Yes, by traversing only the required number of levels based on the zoom level.
So typically, we are either searching a very wide range of bins, but with a shallow zoom level (big tiles), or a narrow range of bins, but zooming in more deeply.
Yes, by using a wavelet matrix built from quad vectors (!) which seem built for this use case.
Can we do more (adding functionality)?
We can use the WM to do interactive range filtering, or even more complex filtering by sorting the data in meaningful ways, e.g. treating (hierarchical) categories as sorted blocks.
This also means that if our data comes from a database that has a sequential ID column, we can preserve points in the original order in order to be able to relate them back to the rest of the data (ie. click a bin, see not only its density but actually get a sample from that bin, or enumerate all points in the bin)
Another extension is for points that have weight – some might be more important than others and should contribute more density to the density plot.
WM + separate weight vector for external lookup – not sure how exactly this works with the zooming behavior of looking only at some levels, though..
WM + RLE bitvectors is one approach
Can we do better?
Yes, I coauthored a paper about it (which got me an Erdös number of 4)!.
Notes
Maybe use different language than "can we do better?" When we're discussing optimizations that keep the same big O runtime, and use that phrase only when it comes to transitioning to a qualitatively different approach.
References and further reading
WM-related papers to investigate:
Note: Compact Data Structures has a section on answering top-k queries with a WM using associated weight (so not frequencies, but associated top weight):
10.3 Weighted Points
When points have associated weights (or relevances, or priorities), an additional query becomes of interest: top(G,x1,x2,y1,y2,m) returns m heaviest points in [x1,x2]× [y1 , y2 ], breaking ties arbitrarily among points of equal weight. The ways to solve this query depend on the structure we are using to represent the grid.
Faster Linear-space Orthogonal Range Searching in Arbitrary Dimensions
Compact Data Structures: "A recent work (Okajima and Maruyama, 2015) shows how to obtain worst-case time On(d−2)/d log n + occ log n, using dn log n + o(dn log n) bits of space. The idea is to use one wavelet tree per coordinate but sort the sequence of points in z-order, whose sort key is formed by a number where the first bit of the d coordinates are concatenated, then the same for the second d bits, and so on. They show that a rectangle query can be decomposed into On(d−2)/d log n one-dimensional intervals in z-order, each of which is intersected with some coordinate using the appropriate wavelet tree. Their experi- ments show that the structure can be an order of magnitude faster than kd-trees when the queries are not too selective, but their structure occupies around twice the kd-tree space (they store several accelerator structures to obtain that time performance). This has theoretical interest and is also promising in practice."
Compact Data Structures (book)
See the references in this notebook and incude any relevant ones here
Algorithmica
Static B-Trees - we could do this with simd now?
On multi value binary search:
"UPDATE: this is an instance of the Minimum Comparison Merging problem
Finding the locations of the K keys in the array of N values is the same as merging the two sorted sequences.
From Knuth Vol. 3, Section 5.3.2, we know that at least ceiling(lg(C(N+K,K))) comparisons are required (because there are C(N+K,K) ways to intersperse the keys in the array). When K is much smaller than N, this is close to lg((N^K/K!), or K lg(N) - K lg(K) = K.(n-k).
This bound cannot be beaten by any comparison-based method, so any such algorithm will take time essentially proportional to the number of keys."
old ideas notebook
Fabian Geisen's post on decoding morton codes
good post on RRR for fast compressed bitmaps:
Credit the guy whose code I got from Twitter (with permisson) – Jonah Harris (notebook)
Implementing a linear Quadtree in Rust is a great and extremely related post
Multidimensional Range Search in Dynamically Balanced Trees
For multidimensional range refinement in the z-order curve (bigmin, litmax)
Related: One stone, two birds: A lightweight multidimensional learned index with cardinality support
Algorithm 2 shows their bigmin/litmax thing
Z-ordered Range Refinement for Multi-dimensional Range Queries
a possible improvment for range refinement (BIGMIN)
The previous approach can tell you the next point along the curb that you can jump to that will be inside your selected region.
seems to involve being able to not only do that, but also when you enter into the curve, figure out where to jump back out again.
And they do the same kind of "consider the larger Z squares" thing (they call them layered cells)
Note that they're specifically not going for exactness, but rather an approximation that's cheap enough to compute. They don't mind scanning extra stuff that's outside the range.
Which may or may not be a problem for us.
Find that blog post on (rust?) code for doing the bigmin litmax jumping... update: it is this
"The resulting ordering can equivalently be described as the order one would get from a depth-first traversal of a quadtree or octree."
Papers to read
Interesting papers to read
Describes "The hashing trick" which is a way to project words into a vector using a hash function.
Explains how you can use what appears to be quantized dot product to do similarity search over 8,000 element vectors which are each one bit per dimension, so is efficient, but also give good results for similarity search.
Of course you can also use this as an initial step to select the top candidates and then do full search on the smaller search space using the original vectors.
On Elias-Fano for Rank Queries in FM-Indexes (2021, But for some reason I thought this paper was older. Maybe it was a different version of the paper.)
I haven't read this paper yet, but the abstract is already interesting because it says if all you want is to support fast rank queries on individual symbols from a large alphabet. You can basically just store Each symbol has its own sparse bit vector. So basically you only represent every symbol once.
The idea here about zero-suppressed Elias-Fano is quite interesting and makes me want to implement it. I don't fully understand it yet, but it sounds like for the low bits the idea is to represent every group as a dense bit vector if it has any elements.
For the high bits, I'm not exactly sure what they're doing. But it sounds like doing a rank operation becomes two or three rank operations on dense bit vectors.
This idea is good when your data is clustered, meaning a bucket is either zero or very likely to have a high occupancy, which is what justifies the use of the dense representation for each nonempty group.
You could even have a custom dense bit vector because you know you're searching within these really small group regions each time for example 8 times 32, so you could hardcode a binary search over rank blocks or just do a linear scan over all 8 with popcount or whatnot.
There's also another idea in here which I have not yet taken the time to understand properly.
Tables, by J. Ian Munro (also here)
In representing a subset of a large finite set, or an index for text search, one is faced with the need for both time and space efficiency. In this paper, we look at some approaches that have been applied to these problems to represent objects in near minimum space and still permit queries to be performed in constant time. It is hoped that this paper will draw attention to techniques for representing large (mostly static) structures.
This paper/note a very interesting point, which is that if you want to represent a binary tree, you can do that by thinking about the same data organization that you have in a heap, which is an array where the tree structure is implicit. And now you just have a sparse bit vector that marks which of the entries in that array are non-zero. So you spend bits only to mark each node, and the number of bits depends on the Elias Fano parameters.
Notes migrated from a previous draft about the wavelet matrix
Over the course of the last few years, I've been exploring techniques for the interactive handling and visualization of medium sized data sets directly in the browser without the requirement of having a server doing any of the aggregation and processing.
This was an outgrowth of work I conducted while on a performance team at Twitter (we've since spun out into a company/co-founded a company together).
One of the projects that came out of my early explorations was an open source density plot library for handling low millions of data points in the browser. I was curious about whether it was possible to go further, and in the course, plotted my way unpredictably towards a different area of computer science that I would not have predicted held any answers for me.
This was the world of succinct data structures. Specifically, it was the world of bit vectors with rank and select and the world of wavelet trees and wavelet matrices.
Today I'm open sourcing a small library that contains browser implementations of the data structures I found interesting and useful in my studies.
Here I plan to outline the basic concepts involved and explain the visual intuition behind the data structures.
Following that, I want to go through and talk about the ways that I've applied these ideas to the tasks involved in interactively summarizing datasets larger than those that are typically handled on the client side.
For example, here is a zoomable density plot of 50 million data points sampled from the US Census. Each data point is represented separately and additional information about it can be independently retrieved by joining to outside data.
This is a little bit of an unusual case because we don't really need to disaggregate beyond the level of a bin to provide a density plot. So I wonder if there's an example that allows some kind of interactive filtering, perhaps based on time. If we sort the data by time, we will need to find a way to link it to the outside source because it would no longer be sorted by primary key.
The first part of this post will outline the basic concepts involved (bitvec = set or multiset, rank/select = array indexing and its inverse), and the second part will go through the various data structures in detail after explaining the basic bitvector interface. Various levels of compression compared to the naïve approach will also be presented. And a note about the utility of plain sorted arrays when space is no object and log(n) queries are tolerable. Perhaps a diversion into bit-based binary search and its generalization to the multi-value case, which is equivalent to the interleaving of two sorted sets (eg. merge in mergesort). The various tradeoffs involved there – eg. small key set vs small value set, to decide how to split/recurse.
The third section will focus on the most capable data structure of them all, the wavelet matrix, and explain how it can usefully be applied to represent various kinds of data, including sequences of numerical data, large hierarchies, and even spatial queries over multidimensional data points.
Notes migrated from a previous set of drafts
Explanation Patterns
Always point the way to greater performance for those who need it. Point out when we are using a suboptimal strategy and what the trade-off is (eg. simplicity, or js vs rust).
Eg. EF bin search in select vs linear scan
Eg. number of blocks in the dense
Eg. different dense approaches (link to some; wasn't there that 'fastest in Rust' one, though it uses instrs that may not be available on wasm (not sure)?)
Eg. EF split point (trade space for... something. Should we optimize the power of 2 rank case, or provide a separate function for it that just uses the high bits? Could be useful for upper bits queries in the WM.)
Intro
What is it?
Why would I care (what problems can it solve, etc.)?
Who is explicitly not the audience? (Eg. peeps looking for maximum perf should look into WASM-compiled rust).
What is it designed for (what am I using it for)? Datavis, mostly, and just fun exploration, though it is well-tested ([todo] and benchmarked).
To do
Note that the universe is about the maximum representable number not about the quantity of numbers these are different because of repetitions
EF: How do you group? Div and mod. Also, check folly and sdsl and etc for their ef split criteria
Have a binary primer - show how it’s just decomposing into 0-9 + 10-90 etc
Explain the basics of entropy? Maybe in a compressed bitvec section
Implement the other “simple” but compressed bit vector from the same paper.
References: Design inspiration: Amit, and the sines circles thing
What I want are dynamic examples with execution traces for tailor-made visualization of block access and other relevant features, and maybe even high-level metrics like the number of scans or something, Maybe even a simple cache model to illustrate the advantage of in-order and batch queries
When discussing reading/writing bits, mention bitter and the reading bits in far too many ways post series
Note to self: Eleventy has issues serving static files with spaces in their names in development mode, so all the images have been renamed to remove spaces.
Consider adding clickable outline toggles to headings
Figure out why chatgpt cannot read this page. Is it the same reason that reader mode doesn't work in mobile Safari?
Make all of the heading nodes collapsible and start them collapsed by default.