Wavelet Matrix Construction.
(I’m working on a larger post about the wavelet matrix and writing and publishing extremely rough drafts as I go.)
To understand the wavelet matrix it can be useful to understand the wavelet tree first, so here’s a quick primer.
As I’ve mentioned before, I think of the wavelet tree (and matrix) as representing the process of sorting a sequence of integer symbols.
Let’s take an example. Imagine that our integer symbols are picked from a universe of size 2³, consisting of the numbers from 0 to 7.
The wavelet tree represents a quicksort whose pivots are all powers of two. At each level of recursion, the value space is partitioned into two halves based on the universe size, ie. into a lower half containing the numbers 0 to 3, and an upper half containing the numbers 4 to 7. After this first step, each half is recursively sorted using the same approach, terminating when the halves contain a single integer each.
You can see this in the diagram below, where each node is labeled with the range of the as-yet-unsorted integers inside it.
Each level of the diagram corresponds to the next stage of the sorting process. The full sequence of numbers starts in its original order on the top level, made up of symbols from 0 to 7. On the next level those numbers are partitioned into two sublists, with the one on the left containing all of the numbers from 0 to 3 in their original order, and the one on the right containing all numbers from 4 to 7, in their original order.
Then the same process happens again, with the left (0 to 3) node further partitioned into two child nodes, one from 0 to 1 and another from 2 to 3. Then we do this again and arrive at the fully-sorted sequence on the bottom level.
The diagram above shows the process abstractly without any particular sequence of numbers, but let’s see how this would actually work in practice. Here’s a visualization that shows the incremental sorting process for the sequence 5 2 1 3 7 2 6 1 5 2
:
We start off with the full sequence in its original order on the top layer, and end up with those same numbers sorted in ascending order on the bottom. In between, you see the process of recursive partitioning, where the numbers “slide” down either the left or the right arrow, depending on whether they belong to the bottom or the top half of the universe.
Now, because the entire sequence is represented in full on each level of the sort, you might think this is a very expensive way to represent the data. But in fact, this is the beautiful trick: the wavelet matrix stores not the boxes, but the arrows!
Here’s the same diagram again, this time with the numbers in parentheses notating the data that actually gets written into the wavelet matrix as individual 0- and 1-bits.
The drawing tool I’m using does not allow me to color the individual zeros and ones red or blue but you should think of the the zeros as colored blue, indicating that the number goes down and to the left along the blue arrow, and ones as red, indicating that the number goes down and to the right following the red arrow.
The data in the wavelet matrix consists of a bit vector per level containing the bits in parentheses, With the one small tweak that, rather than writing the bits down from left to right as they’re shown here, we first write down the bits for all the left nodes, followed by the bits for all the right nodes.
If those bits are all you’ve got it might look like we’ve scrambled the numbers beyond all possibility of repair. But we will soon see that it’s easy to reconstruct the original data from this new representation.
The reason for this is that at each level we’ve actually written out bits from the numbers themselves. For example, the number 5 is represented in binary as 101
, and those are the exact bits written out for each 5 in the sequence, in that order – a 1-bit on the top layer, a 0-bit on the middle layer, and another 1-bit on the bottom layer.
(I’d like to make an animated version of this diagram later, but for now I’m happy since this is already quite an improvement from my last post.)
More to come.
Here are some scraps of visualizations I’m working on that are quick sketches to see whether an idea works. This first one has some promise:
This tree view is an interesting idea, too, and inspired the boxy diagrams below:
The next few are a bit misleading because it is not the case that the bottom half of each node sorts to the left:
I experimented with showing the flow from the full parent node, but that seems even less enlightening, and I think the arrows in the main post are better illustrations of the idea.