Anti-Spelling Bee

Pavan B Govindaraju
2 min readJun 16, 2024

--

There are around 150,000+ words in the English language and one of its quirks is that it is not phonetic, meaning the pronunciation is not always mapped to the spellings of words. This has led to several competitions, the most famous of which is the Scripps National Spelling Bee, for which contestants memorize spellings and techniques to a great extent.

One lazy Sunday evening, I was watching a classic Jandhyala Telugu movie and stumbled upon a timeless comedy scene. The main character (Naresh) is scheduled to attend a job interview, which his parents have already made an arrangement with. He is casually asked to spell the word ‘Coffee’ and that is all what is needed to get the job.

But Naresh intentionally spells it as ‘K-A-U-P-H-Y’ which has a surprising quality of being ‘maximally disjoint’, or in other words, the phonetic mapping is the same as the original but there is not even one letter from the original spelling that is used in the alternative, thus blowing the mind of the interviewer and giving the audience a belly ache in the process.

This got me thinking if there is a way to do this computationally for any word. The following Python notebook outlines the approach for the same:

As one can see, this approach yields a reasonable output for the word ‘Coffee’, with several alternatives also utilizing the ‘gh’ combination, which we see in words such as ‘cough’.

Some of my favourites include:

kawphi, qaughi and honourable mentions such as kawphea

It can be debated whether the edit distance is the right metric, particularly when the alternate spelling can be longer than the original. Spellings such as ‘kawphea’ have an edit distance of 6, which is the length of the original word, but the letter ‘e’ is common in both and is thus not ‘maximally disjoint’. Either way, one could simply write a count the common letters function in a single line using the difference between the two set of letters and perform the same analysis.

There is some subjectivity to this entire approach, particularly the phonetic map not being exhaustive and one could also define cutoffs based on other criteria to consider only certain outputs.

Acknowledgements

Many thanks to Sankeerth Rao K for discussions on the computational approach

--

--

Pavan B Govindaraju

Specializes in not specializing || Blogging about data, systems and tech in general