Score by Surprise
Status: Draft
This piece is an experiment — I'm working on getting comfortable publishing less polished work, so this draft was written quickly & published right away.
I occasionally work with datasets whose columns are strings, such as the names of products, people, or dogs.
For quantitative data a scatterplot is a suitable tool for seeing the relationships between columns, but seeing the associations in categorical data is not so easy, particularly when the columns have high cardinality, since they cannot always be meaningfully plotted along ordered axes.
There's a simple technique I sometimes use to surface these associations, which has the advantages of being easy to program and having an intuitive interpretation. Due to this simplicity I often favor it during early explorations of data that I'm looking at for the first time.
As an example, imagine you have a dataset of dog names and breeds:
Name Breed
Pugsley Pug
Bailey Beagle
Bailey Golden Retriever
Snoopy Beagle
Snowy Maltese
... 100,000 more rows ...
In a dataset like this, it's possible that any breed of dog has any name, but in practice there are certain tendencies. For example, Beagles are more likely to be named Snoopy and pugs are more likely to be named Pugsley.
One way to surface these kinds of unexpected name-breed associations is to score each name-breed pair by a smoothed observed / expected ratio.
For example, here are the most associated dog names for pugs, based on a dataset of NYC dogs:
name observed expected score
Pug 36 0.445 25.605
Pugsley 27 0.479 18.928
Puggy 11 0.148 10.45
Mugsy 18 1.575 7.38
Pugsly 7 0.091 7.331
I'm not sure who names their pug pug. Some of those might be data entry errors...
To show that we can do this for names as well as breeds, here are the most associated dog breeds for the name snowy:
breed observed expected score
American Eskimo Dog 14 0.439 10.424
Maltese 47 5.997 6.861
Bichon Frise 10 1.648 4.155
West Highland White Terrier 4 0.303 3.837
Maltese Crossbreed 8 1.476 3.635
So, how does this work?
The idea is to compare the observed frequency of a name-breed combination to a baseline frequency for that combination.
For instance, the baseline frequency can be derived from an independence assumption, ie. let's say we assume that dog breeds have no effect on dog names.
This would imply that any name-breed pair should appear roughly in proportion to the popularities of the breed and name separately, i.e. the expected count for a name-breed pair would be total_count * P(name) * P(breed).
We can take the ratio between the observed and expected counts, ie. observed / expected
, and this does indeed give us a score, but the score has a problem: the ratio ignores information about the absolute counts.
For example, a name-breed pair seen 5 times but expected 1 time would have a score of 5 / 1 = 5, while a name seen 400 times but expected 100 times would have a score of 400 / 100 = 4, and would thus be ranked as less surprising.
This is a problem for us because we typically have more confidence in ratios that are based upon a larger number of observations, and consider them more significant.
We can fix this by tweaking our observed / expected ratio a little bit by using a "smoothing" process that shifts lower-count ratios towards 1. This function smooths a / b
towards x
with a smoothing strength of alpha
:
smooth = (a, b, x, alpha) => (a + alpha * x) / (b + alpha)
The score is then computed with something like score = smooth(observed, expected, 1.0, 1.0)
.
I usually set both x
and alpha
to 1 but it can interesting to play with the alpha
parameter interactively to inspect strong associations with lower counts.
This idea can also find relationships between more than two elements. For example, in a text dataset, to find significant associations among trigrams relative to an independence assumption you would multiply the probabilities of the three unigrams together to get the baseline expected value, and use the count of the triple as the observed frequency. In this context, it can be useful to additionally filter the trigrams by removing those that start or end in a stopword.
Bigrams work in the NLP case too. Here are the top bigrams found by this algorithm when looking at the top-level comments on a recent Hacker News story:
Notable bigrams
Larry Summers
Bret Taylor
Sam Altman
Adam D'Angelo
new board
Helen Toner
AI safety
initial board
one way
soap opera
These scores can also be useful beyond ranking. For example, in the visualization I made for this piece, the position of each dog name & breed was computed by performing a nonlinear dimensionality reduction on a matrix of these scores for the top dog names and breeds using the same dataset.
This technique is closely related to additive smoothing, which I learned about a long time ago as an intern at Disney working on a text classification project in the pre-LLM-era where we were trying to identify cursing children and other misbehaving characters in Disney's online virtual worlds.
Notes to self
Feedback from Zora
There's a simple technique I sometimes use to surface these associations, which has the advantages of being easy to program and having an intuitive interpretation. Due to this simplicity I often favor it during early explorations of data that I'm looking at for the first time.
Too clickbaity — slip in a 3 or 4 word summary of what’s going on in between the sentences?
In a dataset like this, it's possible that any breed of dog has any name, but in practice there are certain tendencies. For example, Beagles are more likely to be named Snoopy and pugs are more likely to be named Pugsley.
One way to surface these kinds of unexpected name-breed associations is to score each name-breed pair by a smoothed observed / expected ratio.
This explanation jumps right into fancy datamathy words — maybe give either (or both?) some extra explanation about what that actually looks like, and a brief little equation of what you’re counting up here — you go into it all in detail below but the explanation up front kind of assumes you already know how it works
Another, potentially more compelling way to write the same thing: https://observablehq.com/d/b7445d32444ba882
also talks about entropy things
The reason that the phrase "one way" appears in the notable bigrams of the OpenAI Hacker News story is that there is one top-level comment that mentions the phrase "four times." Even though no other comment mentions it, this is enough to bump it into the "unlikely phrases" category.
One idea this suggests is to use comments as the unit of counting. So count every bigram zero or one times depending on whether it appears or does not appear inside of a comment. And aggregate across comments to find the notable bigrams.