Counting Syllables Accurately in Python on Google App Engine

I wanted to be able to count syllables accurately in Python and looked around for existing code that I could re-use. I found one or two routines written in PHP that looked promising so I ported them to Python but was pretty disappointed with the accuracy.

I also found a Python routine that is part of the contributed code for NLTK that was not bad but again struggled with some words. You see, I had naively thought this would be a simple exercise. I hadn’t realised that Syllable Counting in the English language is pretty difficult stuff with so many exceptions that it makes the most elegant algorithm convoluted and clumsy.

I then stumbled across this snippet of code by Jordan Boyd-Graper, via the excellent Running with Data site, and it seemed so elegant that I thought it must be too simplistic. But far from it, it is very accurate for the words it knows.

The code is shown here.

It works by looking up the pronunciation of the word in the Carnegie Mellon University’s pronunciation dictionary that is part of the Python-based Natural Language Toolkit (NLTK). This returns one or more pronunciations for the word. Then the clever bit is that the routine counts the stressed vowels in the word. The raw entry from the cmudict file for the word SYLLABLE is shown below.

The stressed vowels are denoted by the string of letters ending in a number. They appear to represent the different individual pronunciations of the vowel sound. Anyway, for the words that the dictionary knows about (120,000+ I believe), this represents a very accurate method for obtaining the syllable count.

However, there is a problem. As my target environment is Google App Engine, that little line at the top of the code that says…

…ruins your entire afternoon.

You see NLTK and Google App Engine don’t work well together due to NLTK’s recursive imports. I spent some time trying to unwind the recursive imports on cmudict so that Google App Engine would work but to no avail.

So then I thought laterally and decided to build my own structure from the cmudict file (the raw text 3.6MB file that NLTK loads and wraps an object around). My plan was as follows:

  1. Parse the raw cmudict file
  2. For every word in the file call the above syllable count routine
  3. Store the resultant syllable count in a word -> syllable lookup structure (a Python Dictionary)
  4. Pickle the resultant dictionary
  5. Un-pickle it where it is needed

And this seems to have worked quite well.

The code below builds the pickle file.

This results in a dictionary lookup that gives an accurate syllable count (or counts because some words have multiple pronunciations and therefore syllable counts) for the words it has in it’s dictionary.

Words not in the Dictionary

But what about words that the dictionary doesn’t know about? Well the way I handled that is to build a fallback routine into the code. The best (most accurate) mechanical routine I found was PHP-based and is part of Russel McVeigh’s site:

http://www.russellmcveigh.info/content/html/syllablecounter.php

I ported Russel’s code to Python and I added a couple of other exceptions that I found. Most of the mechanical syllable calculation routines I found, work on the following basic syllable rules:

  1. Count the number of vowels in the word
  2. Subtract one for any silent vowels such as the e at the end of a word
  3. Subtract any additional vowels in vowel pairs/triplets (ee, ei, eau, etc.) i.e. each group of multiple vowels scores only one vowel

The number you have left is the number of syllables. However there then follows a series of adjustments where if certain patterns are recognised in the word, syllables are added in or taken away and then finally you end up with the correct syllable count. But, even with all this adjustment it’s never accurate. But perhaps good enough for those words not in the cmudict.

So the code I’ve developed is really simple. It looks up syllable counts in the cmudict and returns the results if found and if not has a guess at the syllable count instead. I’d really like to share the code with you but something in my wordpress theme or the syntax highlighter that I use objects to something in the code. Perhaps, as I’m not a proper programmer it doesn’t like my esoteric, bastardised Hungarian notation variable names?

So I can’t post it here at the moment but will try to get that fixed. If you’re interested contact me and I’ll happily share it.

Danny Goodall

Edit – It looks like I *might* have solved that problem by using a different syntax highlighter.

Posted in Language and Text Processing, Tips and tagged , , , , , , .

5 Comments

  1. The Rhymebrain API (RhymeBrain API) also includes syllable counting, and uses the CMU dictionary. As a fallback, when it doesn’t contain the pronunciation of a word, it derives the CMU using machine learning and then counts the number of syllables.

  2. Pingback: Counting Syllables in the English Language Using Python | 42?

Leave a Reply

Your email address will not be published. Required fields are marked *