– How much jargon does your text contain?

My first Google App Engine project went live yesterday. This one deals with estimating the readability of a text when jargon such as acronyms and abbreviations are taken into account.

</marketing-bit>As I’ve mentioned before I’m developing a Natural Language Processing system called ScrewTinny (scrutiny) that analyses the language that high-tech vendors use to take their products to market. Knowing how much jargon text contains allows me to infer which audience the text is aimed at (IT Technical, IT Business, Business). And that’s important to me.</marketing-bit>

Anyway, readability indexes are not new (Flesch-Kincaid, Coleman-Liau, Gunning Fog, SMOG, etc.) and so I looked for an existing index that took jargon into account that I could use. I did a great deal of searching and even asked a number of people who have an interest in this area, but I couldn’t find one. So I developed my own – and the Goodall Arcanicity Index was born. It’s got a long way to go until it is truly accurate but I’ve now coded it in Python and decided to put it up on Google’s appspot cloud. So it’s live at:

It’s very simple. You enter some text, it processes it and gives you a rating for the amount of arcane content (Arcanicity) the text contains. A by-product of my text-processing routines is a mountain of related text statistics so I decided to add thosfoe to the site.

As I discussed here, I also discovered the JavaScript-based Google Visualisation libraries which I will use as part of the ScrewTinny project. I wanted to get some experience with the Google routines and so for good measure I created visualisation to go along with the text statistics.

Google App Engine and NLTK

One of the interesting technical challenges involved getting the Python-based Natural Language ToolKit (NLTK) routines to work in Google’s App Engine. I had seen that it is notoriously difficult to get NLTK working with Google App Engine due to the way it recursively imports modules. But following some tips from the poster oakmad on this entry, I managed to get a small sub-section of the code working.

This discussion actually merits a separate blog entry where I can document the exact process I went through, and perhaps I will do that when I get time. But for the time being I’ll talk about the general approach. The way that I got the Punkt Sentence Tokenizer working was as follows.

I created a clean local Google App Engine instance and then I copied in the Pickled ‘english.pickle” Tokenizer object from the NLTK distribution. I un-pickled it and tried to use the resultant object’s tokenize method. This gave an error which involved some supporting imports that hadn’t happened. I then fixed the import and tried again until I got no further errors. ‘Fixing the import’ involved copying the module folder tree structure that was being complained about (one folder at a time) from a pristine NLTK installation to the Google App Engine local instance. As oakmad says, creating empty files was important so that the module didn’t go off and grab more than was needed. As I said I should document this properly and if anyone is interested let me know and I will.

It has to be said however that I tried to use a similar technical so that I could use NLTK’s CMU pronunciation dictionary (CMUDICT). But it became very complex, very quickly and as I’m not a real programmer I gave up. But I did get to use the cmudict routines on Google App Engine by building a separate data structure. I wanted to use the cmudict routines to allow me to count syllables accurately and if I say so myself, my solution was quite ‘lateral’. That definitely does need a seperate post and so I will do that when I get time.

Danny Goodall

Posted in Arcanicity, Cloud Stuff, Natural Language Processing (NLP), Python and tagged , , , , , , , , .