Counting Syllables Accurately in Python on Google App Engine
I wanted to be able to count syllables accurately in Python and looked around for existing code that I could re-use. I found one or two routines written in PHP that looked promising so I ported them to Python but was pretty disappointed with the accuracy.
I also found a Python routine that is part of the contributed code for NLTK that was not bad but again struggled with some words. You see, I had naively thought this would be a simple exercise. I hadn’t realised that Syllable Counting in the English language is pretty difficult stuff with so many exceptions that it makes the most elegant algorithm convoluted and clumsy.
I then stumbled across this snippet of code by Jordan Boyd-Graper, via the excellent Running with Data site, and it seemed so elegant that I thought it must be too simplistic. But far from it, it is very accurate for the words it knows.
The code is shown here.
import curses from curses.ascii import isdigit import nltk from nltk.corpus import cmudict def nsyl(word): return [len(list(y for y in x if isdigit(y[-1]))) for x in d[word.lower()]]
It works by looking up the pronunciation of the word in the Carnegie Mellon University’s pronunciation dictionary that is part of the Python-based Natural Language Toolkit (NLTK). This returns one or more pronunciations for the word. Then the clever bit is that the routine counts the stressed vowels in the word. The raw entry from the cmudict file for the word SYLLABLE is shown below.
SYLLABLE 1 S IH1 L AH0 B AH0 L
The stressed vowels are denoted by the string of letters ending in a number. They appear to represent the different individual pronunciations of the vowel sound. Anyway, for the words that the dictionary knows about (120,000+ I believe), this represents a very accurate method for obtaining the syllable count.
However, there is a problem. As my target environment is Google App Engine, that little line at the top of the code that says…
import nltk
…ruins your entire afternoon.
You see NLTK and Google App Engine don’t work well together due to NLTK’s recursive imports. I spent some time trying to unwind the recursive imports on cmudict so that Google App Engine would work but to no avail.
So then I thought laterally and decided to build my own structure from the cmudict file (the raw text 3.6MB file that NLTK loads and wraps an object around). My plan was as follows:
- Parse the raw cmudict file
- For every word in the file call the above syllable count routine
- Store the resultant syllable count in a word -> syllable lookup structure (a Python Dictionary)
- Pickle the resultant dictionary
- Un-pickle it where it is needed
And this seems to have worked quite well.
The code below builds the pickle file.
#!/usr/bin/env python
from curses.ascii import isdigit
from nltk.corpus import cmudict
try:
import cPickle as pickle
except:
import pickle
#-----
# Create a shared dictionary key's on the word with the value as a list of
# possible syllable counts
GzzCMUDict = cmudict.dict()
GdcSyllableCount = {}
def CreatePickle(AlgQuiet=False):
def SyllableCount(AszWord):
"""return the max syllable count in the case of multiple pronunciations"""
#http://groups.google.com/group/nltk-users/msg/81e70cb6704dc01e?pli=1
return [len([y for y in x if isdigit(y[-1])]) for x in GzzCMUDict[AszWord.lower()]]
try:
LhaInputFile = open('cmudict','r+')
except:
print "Could not open the cmudict file"
raise IOError
try:
for LszLine in LhaInputFile:
LszWord = LszLine.split(' ')[0].lower()
LliSyllableList = SyllableCount(LszWord)
if LszWord not in GdcSyllableCount:
GdcSyllableCount[LszWord] = sorted(LliSyllableList)
if not AlgQuiet:
print "%-20s added %s" % (LszWord, LliSyllableList)
else:
if not AlgQuiet:
print " -Word (%s) found twice. First count was %s, second was %s" % (LszWord, GdcSyllableCount[LszWord], LliSyllableList)
except:
print "An error was encountered processing the file."
raise IOError
try:
#-----
# Now write the dictionary away to a new pickle file
LhaOutputFile = open('cmusyllables.pickle','w')
if not AlgQuiet:
print "Finished processing input file\n\nNow dumping pickle file\n"
pickle.dump(GdcSyllableCount, LhaOutputFile,-1)
if not AlgQuiet:
print "Pickle file cmusyllables.pickle has been created."
except:
print "An error was encountered writing the pickle file."
raise IOError
def main():
#-----
# Open the CMU file and for each entry create a dict with the resulting
# number of syallbles
CreatePickle()
if __name__ == '__main__':
main()
This results in a dictionary lookup that gives an accurate syllable count (or counts because some words have multiple pronunciations and therefore syllable counts) for the words it has in it’s dictionary.
Words not in the Dictionary
But what about words that the dictionary doesn’t know about? Well the way I handled that is to build a fallback routine into the code. The best (most accurate) mechanical routine I found was PHP-based and is part of Russel McVeigh’s site:
http://www.russellmcveigh.info/content/html/syllablecounter.php
I ported Russel’s code to Python and I added a couple of other exceptions that I found. Most of the mechanical syllable calculation routines I found, work on the following basic syllable rules:
- Count the number of vowels in the word
- Subtract one for any silent vowels such as the e at the end of a word
- Subtract any additional vowels in vowel pairs/triplets (ee, ei, eau, etc.) i.e. each group of multiple vowels scores only one vowel
The number you have left is the number of syllables. However there then follows a series of adjustments where if certain patterns are recognised in the word, syllables are added in or taken away and then finally you end up with the correct syllable count. But, even with all this adjustment it’s never accurate. But perhaps good enough for those words not in the cmudict.
So the code I’ve developed is really simple. It looks up syllable counts in the cmudict and returns the results if found and if not has a guess at the syllable count instead. I’d really like to share the code with you but something in my wordpress theme or the syntax highlighter that I use objects to something in the code. Perhaps, as I’m not a proper programmer it doesn’t like my esoteric, bastardised Hungarian notation variable names?
So I can’t post it here at the moment but will try to get that fixed. If you’re interested contact me and I’ll happily share it.
Danny Goodall
Edit – It looks like I *might* have solved that problem by using a different syntax highlighter.
#!/usr/bin/env python
try:
import cPickle as pickle
except:
import pickle
import re
class cmusyllables(object):
def __init__(self):
#-----
# Record the mode of the syllable count - manual / lookup
self.szMode = None
self.dcSyllableCount = None
#-----
# New structures for the SyllableCount3 routine
self.dcSyllable3WordCache = {}
self.liSyllable3SubSyllables = [
'cial',
'tia',
'cius',
'cious',
'uiet',
'gious',
'geous',
'priest',
'giu',
'dge',
'ion',
'iou',
'sia$',
'.che$',
'.ched$',
'.abe$',
'.ace$',
'.ade$',
'.age$',
'.aged$',
'.ake$',
'.ale$',
'.aled$',
'.ales$',
'.ane$',
'.ame$',
'.ape$',
'.are$',
'.ase$',
'.ashed$',
'.asque$',
'.ate$',
'.ave$',
'.azed$',
'.awe$',
'.aze$',
'.aped$',
'.athe$',
'.athes$',
'.ece$',
'.ese$',
'.esque$',
'.esques$',
'.eze$',
'.gue$',
'.ibe$',
'.ice$',
'.ide$',
'.ife$',
'.ike$',
'.ile$',
'.ime$',
'.ine$',
'.ipe$',
'.iped$',
'.ire$',
'.ise$',
'.ished$',
'.ite$',
'.ive$',
'.ize$',
'.obe$',
'.ode$',
'.oke$',
'.ole$',
'.ome$',
'.one$',
'.ope$',
'.oque$',
'.ore$',
'.ose$',
'.osque$',
'.osques$',
'.ote$',
'.ove$',
'.pped$',
'.sse$',
'.ssed$',
'.ste$',
'.ube$',
'.uce$',
'.ude$',
'.uge$',
'.uke$',
'.ule$',
'.ules$',
'.uled$',
'.ume$',
'.une$',
'.upe$',
'.ure$',
'.use$',
'.ushed$',
'.ute$',
'.ved$',
'.we$',
'.wes$',
'.wed$',
'.yse$',
'.yze$',
'.rse$',
'.red$',
'.rce$',
'.rde$',
'.ily$',
'.ely$',
'.des$',
'.gged$',
'.kes$',
'.ced$',
'.ked$',
'.med$',
'.mes$',
'.ned$',
'.[sz]ed$',
'.nce$',
'.rles$',
'.nes$',
'.pes$',
'.tes$',
'.res$',
'.ves$',
'ere$'
]
#global $split_array;
self.liSyllable3AddSyllables = [
'ia',
'riet',
'dien',
'ien',
'iet',
'iu',
'iest',
'io',
'ii',
'ily',
'.oala$',
'.iara$',
'.ying$',
'.earest',
'.arer',
'.aress',
'.eate$',
'.eation$',
'[aeiouym]bl$',
'[aeiou]{3}',
'^mc','ism',
'^mc','asm',
'([^aeiouy])\1l$',
'[^l]lien',
'^coa[dglx].',
'[^gq]ua[^auieo]',
'dnt$'
]
#-----
# Create a list of the compiled regex
self.liSyllable3RESubSyllables = []
self.liSyllable3REAddSyllables = []
for LszRegEx in self.liSyllable3AddSyllables:
LreRegEx = re.compile(LszRegEx)
self.liSyllable3REAddSyllables.append(LreRegEx)
for LszRegEx in self.liSyllable3SubSyllables:
LreRegEx = re.compile(LszRegEx)
self.liSyllable3RESubSyllables.append(LreRegEx)
def Load(self, AszFile = 'cmusyllables.pickle'):
try:
LhaPickleFile = open(AszFile,'rb')
self.dcSyllableCount = pickle.load(LhaPickleFile)
#print "LOADED SYLLABLES"
except:
return( False )
return( True )
def GetRawDict(self):
return(self.dcSyllableCount)
def NonCMUSyllableCount(self, AszWord):
#LszWord = self._normalize_word( AszWord.lower() )
LszWord = AszWord
#-----
# If we've already seen this before then return the syllables
if LszWord in self.dcSyllable3WordCache:
return(self.dcSyllable3WordCache[LszWord])
#-----
#Split into parts on vowels and vowel sounds
LliWordParts = re.split(r'[^aeiouy]+', LszWord)
#-----
# Combine the valid parts of the word
LliValidWordParts = []
for LszValue in LliWordParts:
if LszValue <> '':
LliValidWordParts.append(LszValue)
LinSyllables = 0
#-----
# Loop through the compiled regexs looking for matches
for LreSylRE in self.liSyllable3RESubSyllables:
LinMatch = 0 if LreSylRE.search(LszWord) is None else 1
LinSyllables -= LinMatch
for LreSylRE in self.liSyllable3REAddSyllables:
LinMatch = 0 if LreSylRE.search(LszWord) is None else 1
LinSyllables += LinMatch
#-----
# Now compute the syllable count by the number of vowels
LinSyllables += len(LliValidWordParts)
#-----
# If we've not found any there must be at least 1
LinSyllables = 1 if LinSyllables == 0 else LinSyllables
#----
# Record this result in the word cache
self.dcSyllable3WordCache[LszWord] = LinSyllables
#-----
# Return the result
return(LinSyllables)
def SyllableCount(self, AszWord, AszMode = 'max', AlgFallBack=True):
if AszMode.lower() not in ['min','max','ave','raw']:
LszMode = 'max'
else:
LszMode = AszMode
LszWord = AszWord.lower()
if len(LszWord) == 0 or LszWord not in self.dcSyllableCount:
self.szMode = None
if len(LszWord) == 0 or not AlgFallBack:
if AszMode in ['min','max']:
return(0)
elif AszMode in ['ave']:
return(0.0)
elif AszMode in ['raw']:
return([])
else:
LliSyllableList = list((self.NonCMUSyllableCount(LszWord),))
self.szMode = 'manual'
else:
LliSyllableList = self.dcSyllableCount[LszWord]
self.szMode = 'lookup'
if LszMode == 'min':
return(min(LliSyllableList))
elif LszMode == 'max':
return(max(LliSyllableList))
elif LszMode == 'ave':
return(float(float(sum(LliSyllableList))/float(len(LliSyllableList))))
elif LszMode == 'raw':
return(LliSyllableList)
else:
return(None)
def GetSyllableMode(self):
#-----
# Return either None, manual or lookup depending on how the last
# syllable count was arrived at
return(self.szMode)
def main():
LzzSyllableCounter = cmusyllables()
LzzSyllableCounter.Load()
LliList = ['','theatre','productized','productised','pumblechook','everything','altogether','particular','opportunity','everybody','cooeed','cueing']
for LszWord in LliList:
print "'%s' has max(%d), min(%d), ave(%3.2f), raw(%s) syllables - Calculated by (%s)" % (LszWord,LzzSyllableCounter.SyllableCount(LszWord), LzzSyllableCounter.SyllableCount(LszWord, AszMode='min'),LzzSyllableCounter.SyllableCount(LszWord, AszMode='ave'),LzzSyllableCounter.SyllableCount(LszWord, AszMode='raw'),LzzSyllableCounter.GetSyllableMode())
if __name__ == '__main__':
main()

The Rhymebrain API (RhymeBrain API) also includes syllable counting, and uses the CMU dictionary. As a fallback, when it doesn’t contain the pronunciation of a word, it derives the CMU using machine learning and then counts the number of syllables.