Infinite Ascent.

by CJ Quineson

cromulence: a small word frequency library

fitting a dictionary in the browser

I published a library called cromulence on NPM today. It’s used for looking up phrases and scoring how likely they are to be a puzzlehunt answer:

import { Cromulence, loadWordlist } from "cromulence";

const cromulence = Cromulence(await loadWordlist());

for (const word of [
  "mobilesuitgundam",
  "carltonfisk",
  "itsawanderfullife",
  "walkywoad",
  "implausibilities",
]) {
  const { cromulence, spacing } = cromulence.info(word);
  console.log(cromulence, spacing);
}
import { Cromulence, loadWordlist } from "cromulence";

const cromulence = Cromulence(await loadWordlist());

for (const word of [
  "mobilesuitgundam",
  "carltonfisk",
  "itsawanderfullife",
  "walkywoad",
  "implausibilities",
]) {
  const { cromulence, spacing } = cromulence.info(word);
  console.log(cromulence, spacing);
}

This prints:

10.9 MOBILE SUIT GUNDAM
9.9 CARLTON FISK
4.9 ITS A WANDER FUL LIFE
-1.2 WALKY WOAD
-4.6 I MPLA US I BILI TIES
10.9 MOBILE SUIT GUNDAM
9.9 CARLTON FISK
4.9 ITS A WANDER FUL LIFE
-1.2 WALKY WOAD
-4.6 I MPLA US I BILI TIES

Here I’ll explain the motivation behind it, and how the wordlist is chosen.

Short circuiting puzzles

In puzzlehunting canon, there’s a technique called indexing, where given a phrase and a number nn, you take the nnth letter of a phrase. If you have a list of phrases and numbers, you can index the numbers into the phrases, put the letters you get together, and hope that you get another English phrase. For example, in the puzzle horse_ebooks, you end up with this data (middle rows omitted):

WordNumber
PRINCESS1
HOUSE1
DRAGON5
GOBLET6
JAMES5
MAN1

If we take the 11st letter of PRINCESS, the 11st letter of HOUSE, the 55th letter of DRAGON, etc., we get the word PHOTOREALISM, which is the answer.

Part of what makes indexing nice is that you don’t need all the words and numbers. If you were missing the GOBLET and 6, you can still index the rest and get PHO_OREALISM, and you can fill in the blank to get PHOTOREALISM, like you would in a crossword. We can do this because puzzlehunt answers are usually English words or phrases.

In fact, even if you only had the letters _H_T_R_____M, it’s still possible to fill in the blanks using a tool like Nutrimatic. Getting the answer without solving most of the puzzle is called short circuiting.

There are variants on the indexing technique that make short circuiting harder:

  • There might be several choices for words and numbers, and you have to pick the right one.
  • The words and numbers might not directly correspond.
  • The letters might need to be reordered.

A tool like Nutrimatic can still be helpful in these cases. In fact, there’s several tools you can use for this, like qhex or OneLook or Qat or Solvertools.

Tools like these rely on a wordlist, which is used to tell whether a string of letters is an English word. Often these wordlists are also scored, where more common words have a higher score. If you have a scored wordlist, you can try every possibility, then sort them by score.

These wordlists vary widely in size. Qat’s wordlists are taken from nine dictionaries, each around 2 MB in size. The Solvertools wordlist is combined from Google Books and Wikipedia, and is 162 MB. Nutrimatic has a large 50 GB index built from Wikipedia. Why do these wordlist sizes vary so much?

The answer recognition problem

For short circuiting puzzles, the goal is to tell how likely a string of letters is to be an answer. Though puzzlehunt answers are usually English words or phrases, there’s a lot of variation. The answers to MIT Mystery Hunt puzzles range from:

  • rare words like SASSILY or IMPLAUSIBILITIES,
  • names like ARMIN ZOGGELER, MULUGETA WENDIMU, or UB IWEKRS,
  • items like BOK CHOY, OREO O’S or VW BUG, and
  • puns like IT’S A WANDERFUL LIFE or WALKY WOAD.

Let’s call these real answers. We want to distinguish these from fake answers, which are random strings of English letters drawn from an English letter distribution. This accounts for indexing random numbers into a list of words, which usually produces fake answers.

Given a method for telling real and fake answers apart, we can judge it based on its precision and recall. We treat real answers as positives. The precision of a method is the ratio of its true positives to its predicted positives, and the recall is the ratio of the true positives to the real positives. Each can be written as a percentage from 0% to 100%.

Let’s say our data had half real answers and half fake answers. Here’s some methods and their precision and recall:

  • Be perfect. This has 100% precision: it never accepts a fake answer. It also has 100% recall: it finds all of the real answers.

  • Call everything a fake answer. This has 100% precision: it never accepts a fake answer. It also has 0% recall: it doesn’t find any of the real answers.

  • Flip a coin; if it’s heads, say it’s a real answer. This has 50% precision: it accepts a fake answer half the time. It also has 50% recall: it finds half of the real answers.

  • Call everything a real answer. This has 50% precision: it accepts a fake answer half the time. It also has 100% recall: it finds all of the real answers.

For recognizing puzzlehunt answers, we want something with high recall: we don’t want to miss any potential answers. But we also want something with high precision: we don’t want to go through a bunch of fake answers to find a real one. The above methods show that it’s a tradeoff: by being more accepting, we have a higher recall, but we get more junk and become less precise. Let’s call this problem—recognizing answers with high precision and recall—the answer recognition problem.

Existing tools

Nutrimatic

Let’s try using Nutrimatic to judge each of the real answers I mentioned above. I searched the phrase on Nutrimatic, then wrote the font size of the first result, rounded to two decimal places. For example, walkywoad gets 0.99.

AnswerFont size
SASSILY1.11
IMPLAUSIBILITIES2.27
ARMIN ZOGGELER0.50
MULUGETA WENDIMU1.98
UB IWERKS2.56
BOK CHOY2.32
OREO O’S2.09
VW BUG2.03
IT’S A WANDERFUL LIFE0.62
WALKY WOAD0.99

If we consider an answer to be real if Nutrimatic gives it a font size of at least 1, then Nutrimatic recognizes 7 of these 10 answers. From a sample of 250 real and 250 fake answers (I didn’t want to overload Nutrimatic’s servers), a threshold of 1 gives a precision of 74% and a recall of 95%. If we want a precision of at least 90%, we can use a threshold of 1.3, which gives a precision of 92% and a recall of 82%. To get a precision of 95%, we have to use a threshold of 2.2, which has precision 95% and recall 73%.

Nutrimatic isn’t a tool built for this. It’s too forgiving of short random strings: for example, DDGD has size 1.21, and NSCNPV has size 1.06. It’s also too strict for long string or puns, like we see above.

Solvertools

The Solvertools library is the only tool I’ve seen that tackles the answer recognition problem head-on, so it does quite well. In fact, I stole the definition of the answer recognition problem from its README.

Solvertools has a function called cromulence, which measures how likely a string of letters is to be a real answer. The threshold for being a real answer is 0. Here’s the cromulences of the real answers I mentioned above:

AnswerCromulence
SASSILY6.7
IMPLAUSIBILITIES19.6
ARMIN ZOGGELER17.2
MULUGETA WENDIMU18.3
UB IWERKS12.6
BOK CHOY9.4
OREO O’S2.7
VW BUG–3.5
IT’S A WANDERFUL LIFE6.8
WALKY WOAD–1.2

Against answer data from every Mystery Hunt, it has a precision of 96% and a recall of 99%. Which is great! It does it with a 162 MB sized wordlist, though, and I wondered if we could make it smaller.

Shrinking the wordlist

Discarding words

I wanted to port the cromulence function from Solvertools to JavaScript, so it can run on the browser. But asking users to download a 162 MB file is a lot on The Internet, and I wanted a client-side solution. The question, then, is how small we can make the wordlist, without compromising too much on precision and recall.

I saw two ways to shrink it. We could discard low-frequency words, of which there’s a lot. The wordlist also has several multi-word phrases, which we can discard if they’re built from smaller words. Anyway, I ran the numbers on the Mystery Hunt answer data:

WordlistSizePrecisionRecall
Original162 MB96.18%98.81%
Discard phrases52 MB95.59%98.26%
Discard <105< 10^519 MB95.69%98.73%
Discard <105< 10^5, phrases6.0 MB95.79%98.15%
Discard <106< 10^64.5 MB96.26%98.73%
Discard <106< 10^6, phrases3.6 MB96.44%98.15%
Discard <107< 10^71.6 MB96.91%93.75%
Discard <107< 10^7, phrases1.5 MB96.31%93.75%
Discard <108< 10^8516 KB97.34%81.23%
Discard <108< 10^8, phrases516 KB97.34%81.23%

The “Discard <105< 10^5” wordlist is formed by discarding entries that have a score less than 10510^5 from the wordlist; similarly the “Discard <105< 10^5, phrases” is formed by taking that and discarding entries that have spaces in them. Note that the last two rows are the same because all entries with frequency at least 10810^8 are single words.

The bolded wordlist is the one I ended up using in cromulence. We still have a nice 98% recall. But the worse performance appears when we look at the answers I chose earlier, which I chose because cromulence performs badly on them with a smaller wordlist:

AnswerOriginalDiscard <106< 10^6, phrases
SASSILY6.7–4.1
IMPLAUSIBILITIES19.6–4.6
ARMIN ZOGGELER17.2–6.6
MULUGETA WENDIMU18.3–4.5
UB IWERKS12.6–2.7
BOK CHOY9.4–3.4
OREO O’S2.7–4.1
VW BUG–3.5–8.0
IT’S A WANDERFUL LIFE6.84.9
WALKY WOAD–1.2–1.2

This is the kind of info that’s hidden in that statistic; even reasonable-looking answers like IMPLAUSIBILITIES score poorly.

Compression

The 3.6 MB size is still pretty big; ideally it’d be under 1 MB. If we gzip the file, we get it down to 1.7 MB, which is better, but still too large. It turns out there are better ways to compress the wordlist than gzip.

The first optimization is, instead of storing the precise frequencies of each word, we instead store their Zipf frequencies. This idea is stolen from wordfreq, another word frequency library. (Solvertools and wordfreq are both by the same person, Robyn Speer.) We take the frequency, get its base-10 logarithm, add 9, then round to two decimal places. We then store words that have the same Zipf frequency together. The final wordlist has pretty much the same performance in the answer recognition problem.

The second optimization is to compress the wordlist using front compression. This is a similar approach to the one used by aspell to store its wordlists. Within each frequency bin, the words are in sorted order, like so:

adjacency
adventists
adversities
affixing
afonso
agon
airpower
airworthy
adjacency
adventists
adversities
affixing
afonso
agon
airpower
airworthy

Because the words are sorted, a word could share a prefix with the previous word:

adjacency
--ventists
----rsities
-ffixing
--onso
-gon
-irpower
---worthy
adjacency
--ventists
----rsities
-ffixing
--onso
-gon
-irpower
---worthy

Instead of storing the full word, we can store the length of the shared prefix. This is what front compression is. It’s a similar idea to compressing things with a prefix trie.

Together, these optimizations get the wordlist down to 1.3 MB. After gzip, this is 713 KB, which is good enough.

Future work

There might be better ways to compress the data; for example, we might get better results if we front encoded the whole word list rather than individual bins. Letting users have the option to choose between wordlists would be cool too, along with more tests on choosing a wordlist that gives good results. It’d be nice to see data collected against a larger puzzlehunt answer corpus, given there’s a lot of puzzlehunts.

The reason I wrote cromulence was so I could make more puzzlehunt tools. There’s a lot that already exist, but they often rely on a server of some sort; I want to make things that run entirely on the browser. See cryptlight, dropquote, and meta-data, all of which run entirely on the client. I’m hoping to use cromulence to write something like puzlink, but on the browser.

Comments

Loading...