<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>OneBloke &#187; High-Tech Marketing Consultancy, Natural Language Process Software and Home of ScrewTinny &#8211; Inferring Meaning from Text</title>
	<atom:link href="http://www.onebloke.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.onebloke.com</link>
	<description>OneBloke Technology Marketing</description>
	<lastBuildDate>Sat, 06 Oct 2012 11:31:55 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.5.1</generator>
		<item>
		<title>Heroku Versus AppEngine and Amazon EC2 &#8211; Where Does it fit in?</title>
		<link>http://www.onebloke.com/2012/04/heroku-versus-appengine-and-amazon-ec2-where-does-it-fit-in/</link>
		<comments>http://www.onebloke.com/2012/04/heroku-versus-appengine-and-amazon-ec2-where-does-it-fit-in/#comments</comments>
		<pubDate>Thu, 26 Apr 2012 15:43:54 +0000</pubDate>
		<dc:creator>Danny Goodall</dc:creator>
				<category><![CDATA[Cloud / Platform as a Service]]></category>
		<category><![CDATA[Software Development]]></category>
		<category><![CDATA[amazon]]></category>
		<category><![CDATA[ec2]]></category>
		<category><![CDATA[google app engine]]></category>
		<category><![CDATA[heroku]]></category>

		<guid isPermaLink="false">http://www.onebloke.com/?p=362</guid>
		<description><![CDATA[I&#8217;ve just had a really pleasant experience looking at Heroku &#8211; the &#8216;cloud application platform&#8217; from Salesforce.com but it&#8217;s left me wondering where it fits in.
A mate of mine who works for Salesforce.com suggested I look at Heroku after I told him that I&#8217;d had some good and bad experiences with  [...]]]></description>
				<content:encoded><![CDATA[<h3><a href="http://www.onebloke.com/wp-content/uploads/2012/04/applesandoranges.jpg"><img class="alignright size-medium wp-image-363" title="applesandoranges" src="http://www.onebloke.com/wp-content/uploads/2012/04/applesandoranges-300x201.jpg" alt="" width="300" height="201" /></a>I&#8217;ve just had a really pleasant experience looking at Heroku &#8211; the &#8216;cloud application platform&#8217; from Salesforce.com but it&#8217;s left me wondering where it fits in.</h3>
<p>A mate of mine who works for Salesforce.com suggested I look at Heroku after I told him that I&#8217;d had some good and bad experiences with Google&#8217;s AppEngine and Amazon&#8217;s EC2. I&#8217;d been looking for somewhere to host some Python code that I&#8217;d written in my spare time and I had looked at both AppEngine and EC2 and found pros and cons with both of them.</p>
<p>As it turns out it was a good suggestion  because Heroku&#8217;s approach is very good for the spare-time developer like me. That&#8217;s not to say that it&#8217;s only an entry level environment &#8211; I&#8217;m sure it will scale with my needs, but getting up and running with it is very easy.</p>
<p>Having had some experience of the various platforms, I&#8217;m wondering where Heroku fits in. My high-level thoughts&#8230;</p>
<h2>Amazon&#8217;s EC2 &#8211; A Linux prompt in the sky</h2>
<p>Starting with EC2, I found EC2 the simplest concept to get to grips with but by far the most complex to configure. For the uninitiated, EC2 provides you with a machine instance in the cloud which is a very simple concept to understand. Every time you start a machine instance you effectively get a Linux prompt, of varying degrees of power and capacity, in the sky. What this means is that you have to manually configure the OS, database, web infrastructure, caching, etc. This is excellent in that it gives unrivalled flexibility and after all, we&#8217;ve all had to configure our development and test environment anyway so we should understand the technology.</p>
<p>But imagine that you&#8217;ve architected your system to have multiple machines hosting the database, multiple machines processing logic and multiple web servers managing user load; you have to configure each of these instances yourself. This is non-trivial and if you want to be able to flexibly scale each of the machine layers then you own that problem yourself (although there are after market solutions to this too).</p>
<p>But what it does mean is that if you&#8217;re taking a system that is currently deployed on internal infrastructure and deploying it to the cloud, you can mimic the internal configuration in the cloud. This in turn means that the application itself does not necessarily need to be re-archtected.</p>
<p>The sheer amount of additional infrastructure that Amazon makes available to cloud developers (Queuing, cloud storage,  MapReduce farms, storage, caching, etc) coupled with their experience of managing both the infrastructure and the associated business models, makes Amazon an easy choice for serious cloud deployments.</p>
<h2>Google AppEngine &#8211; Sandbox deployment dumbed down to the point of being dumb?</h2>
<p>So I&#8217;m a fan of Google, in the same way that I might say I&#8217;m a fan of oxygen. It&#8217;s ominpresent and it turns out that it&#8217;s easier to use a Google service than not &#8211; for pretty much all of Google&#8217;s services. They really understand the &#8220;giving crack cocaine free to school kids&#8221; model of adoption. They also like Python (my drug of choice) and so using AppEngine was a natural choice for me. AppEngine presents you with an abstracted view of a machine instance that runs your code and supports Java, Python or Google&#8217;s new Go language. With such language restrictions it&#8217;s clear to see that, unlike EC2, Google is presenting developers with a cosseted, language-aware, sand-boxed environment in which to run code. The fact that Google tunes the virtual machines to host and scale code optimally is, depending on your mindset, either a very good thing or close to being the end of the world. For me, not wanting, knowing how to, or needing to push the bounds of the language implementation, I found the AppEngine environment intuitive and easy. It&#8217;s Google right?</p>
<p>But some of the Python restrictions, such as not being able to use modules that contain C code are just too restrictive. Google also doesn&#8217;t present the developer with a standard SQL database interface, which adds another layer of complexity as you have to use Google&#8217;s high replication datastore.  Google would argue, with some justification I&#8217;m sure, that you can&#8217;t use a standard SQL database in an environment when the infrastructure that happens to be running your code at any given moment could be anywhere in Google&#8217;s data centres worldwide. But it meant that my code wouldn&#8217;t port without a little bit of attention.</p>
<p>The other issue I had with Google is that the pricing model works from quotas for various internal resources. Understanding how your application is likely to use these resources and therefore arriving at a projected cost is pretty difficult. So whilst Google has made getting code into the cloud relatively easy, it&#8217;s also put in place too many restrictions to make it of serious value.</p>
<h2>Heroku- Goldilock&#8217;s porridge too hot, too cold or just right?</h2>
<p>It would be tempting, and not a little symmetrical, to place Heroku squarely between the two other PaaS environments above. And whilst that is sort of where it fits in my mind, it would also be too simplistic. Heroku does avoid the outright complexity of EC2 and seems to also avoid some of the terminal restrictions (although it&#8217;s early days) of AppEngine. But the key difference with EC2 lies in how Heroku manages Dynos (Heroku&#8217;s name for an executing instance). To handle scale and to maximise use of its own resources, Heroku runs your code only for the specific instance that it is being executed. After that, the code, the machine instance and any data it contained are forgotten. This means that things like a persistent file system or a having a piece of your code always running cannot be relied upon.</p>
<p>These problems are pretty easily surmountable. Amazon&#8217;s S3 can be used as a persistent file store and Heroku apps can also launch a worker process that can be relied upon to not be restarted in the same way as the other Dyno web processes.</p>
<p>Scale is managed intelligently by Heroku in that you simply increase the number of web and worker processes that your application has access to &#8211; obviously this also has an impact on the cost. Finally there is an apparently thriving add-on community that provides (at additional monthly cost) access to caching, queuing and in fact any type of additional service that you might otherwise have installed for free on your Amazon EC2 instance.</p>
<h2>Conclusion</h2>
<p>I guess the main conclusion of this simple comparison is that whilst Heroku does make deploying web apps simple, you can&#8217;t simple take code already deployed on internal servers and <strong>git commit</strong> it to Heroku.com. Heroku forces you to think about the interactions your application will have with its new deployment environment, because if it didn&#8217;t, your app wouldn&#8217;t scale. This is also true of Google&#8217;s AppEngine, but the restrictions that AppEngine places on the type of code you can run makes it of limited value to my mind. These restrictions do not appear to be there with Amazon EC2. You can simply take an internally hosted system and build a deployment environment in the cloud that mimics the current environment. But at some point down the line, you&#8217;re going to have to think about making the code a better cloud citizen. With EC2, you&#8217;re simply able to defer the point of re-architecture. And the task of administering EC2 is a full time job in itself and should not be underestimated. Heroku is amazingly simply by comparison.</p>
<p>Anyway, those are my top of mind thoughts on the relative strengths and weaknesses of the different cloud hosting solutions I&#8217;ve personally looked at. Right now I have to say that Heroku really does strike an excellent balance between ease and capability. Worth a look.</p>
<p>Danny Goodall</p>
]]></content:encoded>
			<wfw:commentRss>http://www.onebloke.com/2012/04/heroku-versus-appengine-and-amazon-ec2-where-does-it-fit-in/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Inserting Google Chart Tools Visualizations into WordPress</title>
		<link>http://www.onebloke.com/2011/09/inserting-google-chart-tools-visualizations-into-wordpress/</link>
		<comments>http://www.onebloke.com/2011/09/inserting-google-chart-tools-visualizations-into-wordpress/#comments</comments>
		<pubDate>Mon, 19 Sep 2011 12:56:18 +0000</pubDate>
		<dc:creator>Danny Goodall</dc:creator>
				<category><![CDATA[PHP]]></category>
		<category><![CDATA[Tips]]></category>
		<category><![CDATA[WordPress]]></category>
		<category><![CDATA[advanced custom fields]]></category>
		<category><![CDATA[google charts]]></category>
		<category><![CDATA[notaproperprogrammer]]></category>
		<category><![CDATA[wordpress]]></category>

		<guid isPermaLink="false">http://www.onebloke.com/?p=327</guid>
		<description><![CDATA[EDIT: When I wrote this post no plugin existed to create and embed Google Charts within a WordPress blog. I&#8217;ve recently been made aware of the ChartBoot for WordPress plugin which seems to do exactly what I needed &#8211; although I haven&#8217;t looked at the plugin myself at the moment. It might be worth  [...]]]></description>
				<content:encoded><![CDATA[<p>EDIT: When I wrote this post no plugin existed to create and embed Google Charts within a WordPress blog. I&#8217;ve recently been made aware of the <a href="http://wordpress.org/extend/plugins/chartboot-for-wordpress/">ChartBoot for WordPress plugin</a> which seems to do exactly what I needed &#8211; although I haven&#8217;t looked at the plugin myself at the moment. It might be worth taking a look.</p>
<h3><a href="http://www.onebloke.com/wp-content/uploads/2011/09/Example-PIPESCOM-MED.png"><img class="alignright size-medium wp-image-351" title="Example PIPESCOM MED" src="http://www.onebloke.com/wp-content/uploads/2011/09/Example-PIPESCOM-MED-300x113.png" alt="" width="300" height="113" /></a>I needed to insert charts from <a href="http://code.google.com/apis/chart/interactive/docs/">Google&#8217;s Chart Tools</a> into my other WordPress blog but found that it wasn&#8217;t straightforward. There were a few WordPress plugins that claimed to be able to do it but none seemed to do exactly what I needed so I came up with the approach below.</h3>
<p>The finished product can be found in this <a href="http://www.lustratusrepama.com/2011/the-marketing-strategies-of-open-source-versus-closed-source-esbs/">open source versus closed source ESBs extract</a> from my blog. I should first say by way of explanation that I&#8217;m not a proper programmer and am certainly not well versed with PHP so I&#8217;m sure others could improve on this approach. But it certainly does what I need.</p>
<p>I should also say that the approach I have taken relies on already having produced the JavaScript code that produces the chart. I simply needed a way to inject that code into the WordPress blog and to have it render the chart where I wanted it. So if you&#8217;re looking for an approach that automates the chart production you won&#8217;t find it here.</p>
<p>First a recap of how Google&#8217;s Chart Tools work.</p>
<ol>
<li><span class="Apple-style-span" style="line-height: 18px;">First you load Google&#8217;s JavaScript chart library. </span></li>
<li><span class="Apple-style-span" style="line-height: 18px;">You then specify a function to be called when the page loads. </span></li>
<li><span class="Apple-style-span" style="line-height: 18px;">This function needs to contain the code to create the chart, pass the data and tell Google to render it</span></li>
<li><span class="Apple-style-span" style="line-height: 18px;">This function must also be passed a page element (usually a &lt;div&gt; tag) which controls where in your web page Google&#8217;s code will render the chart.</span></li>
</ol>
<p>So to accomplish this I decided that I needed to do two things.</p>
<ul>
<li><span class="Apple-style-span" style="line-height: 18px;">Firstly, I needed to modify my WordPress theme&#8217;s header.php code to ensure that I could load Google&#8217;s JavaScript routines and build a mechanism to insert the chart code. </span></li>
<li><span class="Apple-style-span" style="line-height: 18px;">Secondly I had to create some additional fields on WordPress&#8217; post entry page that allowed me to:</span>
<ul>
<li><span class="Apple-style-span" style="line-height: 18px;">Specify whether the Google chart API should be loaded for this Post (#1 in the list above) &#8211; i.e. we don&#8217;t want to load the Google Chart Tools code unless this post actually has a chart to be rendered</span></li>
<li><span class="Apple-style-span" style="line-height: 18px;">Specify the code that should be called to render each chart (#3 on the list above)</span></li>
</ul>
</li>
</ul>
<h3>Creating a Placeholder in the HTML</h3>
<p>The first thing to do is to create placeholder(s) for the chart(s) in the HTML of the WordPress post. To do this switch to HTML view in the WordPress editor, locate the position where you want to insert the chart and add a &lt;div&gt; tag. Specify an ID that is unique to this chart. So for example, if I want to insert two charts into my post I would insert the following HTML.</p>
<pre class="brush:xml">This is some content that goes above the chart.
&lt;div id="medchart1"&gt;This text is replaced by the chart but WordPress seems to need some text in the DIV or it removes it when you switch back to Visual mode.&lt;/div&gt;
And this is some content that goes below the first chart and above the second chart
&lt;div id="medchart2"&gt;This text is replaced by the chart&lt;/div&gt;
And this is text that goes below the second chart.</pre>
<h3>Creating the Custom Fields</h3>
<p>OK so that takes care of telling Google where to render the cart, now I need a way to allow me to create custom fields in WordPress that also allows me to access that field from the PHP code in WordPress&#8217; header.php. For this I found the truly excellent <a href="http://plugins.elliotcondon.com/advanced-custom-fields/">Advance Custom Fields</a> plugin. This plugin has two components &#8211; a UI that allows you to create the field groups and field codes and then the logic to substitute those fields using PHP when WordPress creates the page.</p>
<p>So I created a number of fields as shown below:</p>
<p><a href="http://www.onebloke.com/wp-content/uploads/2011/09/AdvancedCustomFieldsExample.png"><img class="alignnone size-full wp-image-331" title="AdvancedCustomFieldsExample" src="http://www.onebloke.com/wp-content/uploads/2011/09/AdvancedCustomFieldsExample.png" alt="" width="605" height="143" /></a></p>
<p>You can see the first field <strong>lgGV</strong> is the flag to say whether the Google Visualisation Chart Tools API should be loaded for this page. I&#8217;ve then shown three other fields named <strong>szGVDrawChartFunction1..3</strong>.These fields will contain the actual JavaScript code that, when executed will draw the chart.</p>
<p>Expanding the first field shows more detail.</p>
<p><a href="http://www.onebloke.com/wp-content/uploads/2011/09/AdvancedCustomFieldsExample2.png"><img class="alignnone size-full wp-image-334" title="AdvancedCustomFieldsExample2" src="http://www.onebloke.com/wp-content/uploads/2011/09/AdvancedCustomFieldsExample2.png" alt="" width="606" height="338" /></a></p>
<h3>Specifying Chart Details in the WordPress Post</h3>
<p>So, now when I edit a WordPress post I can enter values into the fields above to reflect the chart settings I want for that particular post.</p>
<p>As shown below:</p>
<p><a href="http://www.onebloke.com/wp-content/uploads/2011/09/AdvancedCustomFieldsExample3.png"><img class="alignnone size-full wp-image-336" title="AdvancedCustomFieldsExample3" src="http://www.onebloke.com/wp-content/uploads/2011/09/AdvancedCustomFieldsExample3.png" alt="" width="606" height="338" /></a></p>
<p>So for the post above, I&#8217;ve effectively created a number of PHP variables that will be available from WordPress&#8217; PHP code. These are lgGV, szGVDrawChartFunction1,2 and 3. In my example above only szGVDrawChartFunction1 and 2 have values. The 3rd is left blank.</p>
<p>If you click on the image to enlarge it you will also see that I&#8217;ve modified the code that produces the charts to reference the &lt;div&gt; IDs we created above. The first code section references</p>
<pre class="brush:php">...document.getElementById('medchart1')...</pre>
<p>and the second code references</p>
<pre class="brush:php">...document.getElementById('medchart2')...</pre>
<p>This tells the Google Chart code to render the chart inside those &lt;div&gt; blocks that we created above.</p>
<h3>Modifying header.php to access the advanced custom fields</h3>
<p>The Advanced Custom Fields plugin provides a number of PHP functions that you can use to access these post variables from PHP. These include:</p>
<pre class="brush:php">get_field()
the_field()
the_repeater_field()
get_sub_field()
the_sub_field()</pre>
<p>Actually the the_repeater_field() function is only available in their paid plugin. I haven&#8217;t used that version but as you can see from the code below it would make my code much more streamline.</p>
<p>So, for example, to access my lgGV logical field within my WordPress theme&#8217;s header.php code,  I might write:</p>
<pre class="brush:php">&lt;?php if(get_field('lgGV')): ?&gt;</pre>
<p>The author of the plugin has done a great job at making it so simple to access field codes associated with specific posts from within the WordPress PHP subsystem.</p>
<h3>Modifying the header.php code</h3>
<p>So now I need to modify my theme&#8217;s header.php code to examine and use these post-level fields.</p>
<p>Here it is.</p>
<pre class="brush:php">&lt;?php if(get_field('lgGV')): ?&gt;
	&lt;script type="text/javascript" src="http://www.google.com/jsapi"&gt;&lt;/script&gt;

	&lt;script type="text/javascript" &gt;
		google.load("visualization", "1", {packages:["table","corechart"]});
	&lt;/script&gt;

	&lt;script&gt;

		google.setOnLoadCallback(function()
                {
                    drawChart1();

                    &lt;?php if(trim(the_field('szGVDrawChartFunction2')!=='')): ?&gt;
                    drawChart2();
                    &lt;?php endif; ?&gt;

                    &lt;?php if(trim(the_field('szGVDrawChartFunction3')!=='')): ?&gt;
                    drawChart3();
                    &lt;?php endif; ?&gt;

                });
		function drawChart1() {&lt;?php strip_tags(the_field('szGVDrawChartFunction1')); ?&gt;}
		function drawChart2() {&lt;?php strip_tags(the_field('szGVDrawChartFunction2')); ?&gt;}
		function drawChart3() {&lt;?php strip_tags(the_field('szGVDrawChartFunction3')); ?&gt;}

    &lt;/script&gt;
&lt;?php endif; ?&gt;</pre>
<p>I inserted this code immediately under the wp_head() function call in my existing header.php file. This ensures that this code block is run every time my blog creates and serves a page. I don&#8217;t know enough about WordPress internals or theme development to know if this is the correct place for every theme. But it works for me.</p>
<p>On to the code. I&#8217;m sure if you understand PHP better than I do it will be self explanatory but just in case.</p>
<p>Line 1 checks to see if the custom post field lgGV has been set to true and if it has it loads the Google libraries (lines 2-5). If not then the entire code block is skipped. For my blog I will only occasionally insert charts so I don&#8217;t want the overhead of making my vistors&#8217; browser load libraries it isn&#8217;t going to need.</p>
<p>Lines 6 onwards is the declaration of the function that will be called once the page has finished loading.</p>
<p>Line 9 calls the first draw chart function drawChart1(). Here I assume that some code has been entered into the szGVDrawChartFunction1 custom field for this WordPress post.</p>
<p>Lines 10-12 check to see if anything was entered into the 2nd draw chart function and if it has,drawChart2() is called. This is repeated for the 3rd chart as well. Checking for the field being null stops me having to define and call empty functions.</p>
<p>Lines 17,18 and 19 define the drawChart1..3 functions. You can see that all I do is fill the function braces {} with the code that was entered into the relevant custom post fields in WordPress &#8211;  szGVDrawChartFunction1..3.</p>
<h3>Important</h3>
<p>A couple of important things to note here.</p>
<p>Firstly, the function definitions in lines 17-19 include the braces {}. So the code that is pasted inside them from the szGVDrawChartFunction1..3 custom post fields should be ONLY the code INSIDE the braces &#8211; not that actual braces themselves.</p>
<p>Secondly, the code that is pasted into these fields cannot currently contain new lines. I&#8217;m not sure why this is but I assume it is something to do with the way the PHP is rendered. So as a result I have to ensure that my chart definition code has all of the new lines removed. In effect each function appears on one line.</p>
<h3>Things to Improve in Future Versions</h3>
<p>I recognise that there is a lot of redundant code here and that it would be better using a loop to cycle through the various options but the Advanced Custom Field plugin doesn&#8217;t support repeating fields in the free version of the plugin. I did try to buy the commercial version but there seemed to be a problem with the author&#8217;s site at the time. I&#8217;ve just checked again and the store is now back online so I will get the commercial version and make my code tighter.</p>
<p>I also realise that inserting code into a template like this could be a security hole if someone could change the contents of my database, so I will have to build in some sort of protection too.</p>
<p>If you found this useful and understand PHP and/or WordPress I&#8217;d really welcome suggestions or improvements. Alternatively if you&#8217;ve found an existing plugin that can do the same I&#8217;d love to hear from you.</p>
<p>Dan.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.onebloke.com/2011/09/inserting-google-chart-tools-visualizations-into-wordpress/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>Counting Syllables Accurately in Python on Google App Engine</title>
		<link>http://www.onebloke.com/2011/06/counting-syllables-accurately-in-python-on-google-app-engine/</link>
		<comments>http://www.onebloke.com/2011/06/counting-syllables-accurately-in-python-on-google-app-engine/#comments</comments>
		<pubDate>Wed, 29 Jun 2011 08:33:55 +0000</pubDate>
		<dc:creator>Danny Goodall</dc:creator>
				<category><![CDATA[Language and Text Processing]]></category>
		<category><![CDATA[Tips]]></category>
		<category><![CDATA[cmudict]]></category>
		<category><![CDATA[gae]]></category>
		<category><![CDATA[google app engine]]></category>
		<category><![CDATA[nltk]]></category>
		<category><![CDATA[notaproperprogrammer]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[syllables]]></category>

		<guid isPermaLink="false">http://www.onebloke.com/?p=176</guid>
		<description><![CDATA[I wanted to be able to count syllables accurately in Python and looked around for existing code that I could re-use. I found one or two routines written in PHP that looked promising so I ported them to Python but was pretty disappointed with the accuracy.
I also found a Python routine that is part  [...]]]></description>
				<content:encoded><![CDATA[<h3><a href="http://www.onebloke.com/wp-content/uploads/2011/06/Syllable.png"><img class="alignright size-medium wp-image-177" style="clear: both" title="Syllable" src="http://www.onebloke.com/wp-content/uploads/2011/06/Syllable-267x300.png" alt="" width="267" height="300" /></a>I wanted to be able to count syllables accurately in Python and looked around for existing code that I could re-use. I found <a href="https://github.com/DaveChild/Text-Statistics ">one</a> or two routines written in PHP that looked promising so I ported them to Python but was pretty disappointed with the accuracy.</h3>
<p>I also found a <a href="http://code.google.com/p/nltk/source/browse/trunk/nltk_contrib/nltk_contrib/readability/syllables_en.py">Python routine that is part of the contributed code for NLTK</a> that was not bad but again struggled with some words. You see, I had naively thought this would be a simple exercise. I hadn&#8217;t realised that Syllable Counting in the English language is pretty difficult stuff with so many exceptions that it makes the most elegant algorithm convoluted and clumsy.</p>
<p>I then stumbled across <a href="http://groups.google.com/group/nltk-users/msg/81e70cb6704dc01e">this snippet of code by Jordan Boyd-Graper</a>, via the <a href="http://runningwithdata.com/post/3576752158/w">excellent Running with Data site</a>, and it seemed so elegant that I thought it must be too simplistic. But far from it, it is very accurate <strong>for the words it knows</strong>.</p>
<p>The code is shown here.</p>
<pre class="brush:py">import curses
from curses.ascii import isdigit
import nltk
from nltk.corpus import cmudict

def nsyl(word):
	return [len(list(y for y in x if isdigit(y[-1]))) for x in d[word.lower()]]</pre>
<p>It works by looking up the pronunciation of the word in the Carnegie Mellon University&#8217;s pronunciation dictionary that is part of the Python-based Natural Language Toolkit (NLTK). This returns one or more pronunciations for the word. Then the clever bit is that the routine counts the stressed vowels in the word. The raw entry from the cmudict file for the word SYLLABLE is shown below.</p>
<pre class="brush:plain">SYLLABLE 1 S IH1 L AH0 B AH0 L</pre>
<p>The stressed vowels are denoted by the string of letters ending in a number. They appear to represent the different individual pronunciations of the vowel sound. Anyway, for the words that the dictionary knows about (120,000+ I believe), this represents a very accurate method for obtaining the syllable count.</p>
<p>However, there is a problem. As my target environment is Google App Engine, that little line at the top of the code that says&#8230;</p>
<pre class="brush:py">import nltk</pre>
<p>&#8230;ruins your entire afternoon.</p>
<p>You see NLTK and Google App Engine don&#8217;t work well together due to NLTK&#8217;s recursive imports. I spent some time trying to unwind the recursive imports on cmudict so that Google App Engine would work but to no avail.</p>
<p>So then I thought laterally and decided to build my own structure from the cmudict file (the raw text 3.6MB file that NLTK loads and wraps an object around). My plan was as follows:</p>
<ol>
<li>Parse the raw cmudict file</li>
<li>For every word in the file call the above syllable count routine</li>
<li>Store the resultant syllable count in a word -&gt; syllable lookup structure (a Python Dictionary)</li>
<li>Pickle the resultant dictionary</li>
<li>Un-pickle it where it is needed</li>
</ol>
<p>And this seems to have worked quite well.</p>
<p>The code below builds the pickle file.</p>
<pre class="brush:py">#!/usr/bin/env python

from curses.ascii import isdigit
from nltk.corpus import cmudict
try:
    import cPickle as pickle
except:
    import pickle

#-----
# Create a shared dictionary key's on the word with the value as a list of
# possible syllable counts

GzzCMUDict = cmudict.dict()

GdcSyllableCount = {}

def CreatePickle(AlgQuiet=False):

    def SyllableCount(AszWord):
        """return the max syllable count in the case of multiple pronunciations"""

        #http://groups.google.com/group/nltk-users/msg/81e70cb6704dc01e?pli=1

        return [len([y for y in x if isdigit(y[-1])]) for x in GzzCMUDict[AszWord.lower()]]

    try:
        LhaInputFile = open('cmudict','r+')
    except:
        print "Could not open the cmudict file"
        raise IOError

    try:
        for LszLine in LhaInputFile:

            LszWord = LszLine.split(' ')[0].lower()

            LliSyllableList = SyllableCount(LszWord)

            if LszWord not in GdcSyllableCount:
                GdcSyllableCount[LszWord] = sorted(LliSyllableList)
                if not AlgQuiet:
                    print "%-20s added %s" % (LszWord, LliSyllableList)
            else:
                if not AlgQuiet:
                    print "  -Word (%s) found twice. First count was %s, second was %s" % (LszWord, GdcSyllableCount[LszWord], LliSyllableList)
    except:
        print "An error was encountered processing the file."
        raise IOError

    try:
        #-----
        # Now write the dictionary away to a new pickle file

        LhaOutputFile = open('cmusyllables.pickle','w')

        if not AlgQuiet:
            print "Finished processing input file\n\nNow dumping pickle file\n"
        pickle.dump(GdcSyllableCount, LhaOutputFile,-1)

        if not AlgQuiet:
            print "Pickle file cmusyllables.pickle has been created."
    except:
        print "An error was encountered writing the pickle file."
        raise IOError

def main():
    #-----
    # Open the CMU file and for each entry create a dict with the resulting
    # number of syallbles

    CreatePickle()

if __name__ == '__main__':
    main()</pre>
<p>This results in a dictionary lookup that gives an accurate syllable count (or counts because some words have multiple pronunciations and therefore syllable counts) for the words it has in it&#8217;s dictionary.</p>
<h3>Words not in the Dictionary</h3>
<p>But what about words that the dictionary doesn&#8217;t know about? Well the way I handled that is to build a fallback routine into the code. The best (most accurate) mechanical routine I found was PHP-based and is part of Russel McVeigh&#8217;s site:</p>
<h3><a href="http://www.russellmcveigh.info/content/html/syllablecounter.php">http://www.russellmcveigh.info/content/html/syllablecounter.php</a></h3>
<p>I ported Russel&#8217;s code to Python and I added a couple of other exceptions that I found. Most of the mechanical syllable calculation routines I found, work on the following basic syllable rules:</p>
<ol>
<li>Count the number of vowels in the word</li>
<li>Subtract one for any silent vowels such as the e at the end of a word</li>
<li>Subtract any additional vowels in vowel pairs/triplets (ee, ei, eau, etc.) i.e. each group of multiple vowels scores only one vowel</li>
</ol>
<p>The number you have left is the number of syllables. However there then follows a series of adjustments where if certain patterns are recognised in the word, syllables are added in or taken away and then finally you end up with the correct syllable count. But, even with all this adjustment it&#8217;s never accurate. But perhaps good enough for those words not in the cmudict.</p>
<p>So the code I&#8217;ve developed is really simple. It looks up syllable counts in the cmudict and returns the results if found and if not has a guess at the syllable count instead. I&#8217;d really like to share the code with you but something in my wordpress theme or the syntax highlighter that I use objects to something in the code. Perhaps, as I&#8217;m not a proper programmer it doesn&#8217;t like my esoteric, bastardised Hungarian notation variable names?</p>
<p>So I can&#8217;t post it here at the moment but will try to get that fixed. If you&#8217;re interested <a href="http://www.onebloke.com/contact/">contact me</a> and I&#8217;ll happily share it.</p>
<p>Danny Goodall</p>
<p>Edit &#8211; It looks like I *might* have solved that problem by using a different syntax highlighter.</p>
<pre class="brush:py">#!/usr/bin/env python
try:
    import cPickle as pickle
except:
    import pickle

import re

class cmusyllables(object):

    def __init__(self):

        #-----
        # Record the mode of the syllable count - manual / lookup

        self.szMode = None

        self.dcSyllableCount = None

        #-----
        # New structures for the SyllableCount3 routine

        self.dcSyllable3WordCache = {}

        self.liSyllable3SubSyllables = [
            'cial',
            'tia',
            'cius',
            'cious',
            'uiet',
            'gious',
            'geous',
            'priest',
            'giu',
            'dge',
            'ion',
            'iou',
            'sia$',
            '.che$',
            '.ched$',
            '.abe$',
            '.ace$',
            '.ade$',
            '.age$',
            '.aged$',
            '.ake$',
            '.ale$',
            '.aled$',
            '.ales$',
            '.ane$',
            '.ame$',
            '.ape$',
            '.are$',
            '.ase$',
            '.ashed$',
            '.asque$',
            '.ate$',
            '.ave$',
            '.azed$',
            '.awe$',
            '.aze$',
            '.aped$',
            '.athe$',
            '.athes$',
            '.ece$',
            '.ese$',
            '.esque$',
            '.esques$',
            '.eze$',
            '.gue$',
            '.ibe$',
            '.ice$',
            '.ide$',
            '.ife$',
            '.ike$',
            '.ile$',
            '.ime$',
            '.ine$',
            '.ipe$',
            '.iped$',
            '.ire$',
            '.ise$',
            '.ished$',
            '.ite$',
            '.ive$',
            '.ize$',
            '.obe$',
            '.ode$',
            '.oke$',
            '.ole$',
            '.ome$',
            '.one$',
            '.ope$',
            '.oque$',
            '.ore$',
            '.ose$',
            '.osque$',
            '.osques$',
            '.ote$',
            '.ove$',
            '.pped$',
            '.sse$',
            '.ssed$',
            '.ste$',
            '.ube$',
            '.uce$',
            '.ude$',
            '.uge$',
            '.uke$',
            '.ule$',
            '.ules$',
            '.uled$',
            '.ume$',
            '.une$',
            '.upe$',
            '.ure$',
            '.use$',
            '.ushed$',
            '.ute$',
            '.ved$',
            '.we$',
            '.wes$',
            '.wed$',
            '.yse$',
            '.yze$',
            '.rse$',
            '.red$',
            '.rce$',
            '.rde$',
            '.ily$',
            '.ely$',
            '.des$',
            '.gged$',
            '.kes$',
            '.ced$',
            '.ked$',
            '.med$',
            '.mes$',
            '.ned$',
            '.[sz]ed$',
            '.nce$',
            '.rles$',
            '.nes$',
            '.pes$',
            '.tes$',
            '.res$',
            '.ves$',
            'ere$'
        ]

        #global $split_array;
        self.liSyllable3AddSyllables  = [
            'ia',
            'riet',
            'dien',
            'ien',
            'iet',
            'iu',
            'iest',
            'io',
            'ii',
            'ily',
            '.oala$',
            '.iara$',
            '.ying$',
            '.earest',
            '.arer',
            '.aress',
            '.eate$',
            '.eation$',
            '[aeiouym]bl$',
            '[aeiou]{3}',
            '^mc','ism',
            '^mc','asm',
            '([^aeiouy])\1l$',
            '[^l]lien',
            '^coa[dglx].',
            '[^gq]ua[^auieo]',
            'dnt$'
        ]

        #-----
        # Create a list of the compiled regex

        self.liSyllable3RESubSyllables = []
        self.liSyllable3REAddSyllables = []

        for LszRegEx in self.liSyllable3AddSyllables:
            LreRegEx = re.compile(LszRegEx)
            self.liSyllable3REAddSyllables.append(LreRegEx)

        for LszRegEx in self.liSyllable3SubSyllables:
            LreRegEx = re.compile(LszRegEx)
            self.liSyllable3RESubSyllables.append(LreRegEx)

    def Load(self, AszFile = 'cmusyllables.pickle'):
        try:
            LhaPickleFile = open(AszFile,'rb')

            self.dcSyllableCount =  pickle.load(LhaPickleFile)
            #print "LOADED SYLLABLES"
        except:
            return( False )

        return( True )

    def GetRawDict(self):
        return(self.dcSyllableCount)

    def NonCMUSyllableCount(self, AszWord):

        #LszWord = self._normalize_word( AszWord.lower() )
        LszWord = AszWord

        #-----
        # If we've already seen this before then return the syllables

        if LszWord in self.dcSyllable3WordCache:
            return(self.dcSyllable3WordCache[LszWord])

        #-----
        #Split into parts on vowels and vowel sounds

        LliWordParts = re.split(r'[^aeiouy]+', LszWord)

        #-----
        # Combine the valid parts of the word

        LliValidWordParts = []

        for LszValue in LliWordParts:
            if LszValue &lt;&gt; '':
                LliValidWordParts.append(LszValue)

        LinSyllables = 0

        #-----
        # Loop through the compiled regexs looking for matches

        for LreSylRE in self.liSyllable3RESubSyllables:
            LinMatch = 0 if LreSylRE.search(LszWord) is None else 1
            LinSyllables -= LinMatch

        for LreSylRE in self.liSyllable3REAddSyllables:
            LinMatch = 0 if LreSylRE.search(LszWord) is None else 1
            LinSyllables += LinMatch

        #-----
        # Now compute the syllable count by the number of vowels

        LinSyllables += len(LliValidWordParts)

        #-----
        # If we've not found any there must be at least 1

        LinSyllables = 1 if LinSyllables == 0 else LinSyllables

        #----
        # Record this result in the word cache

        self.dcSyllable3WordCache[LszWord] = LinSyllables

        #-----
        # Return the result

        return(LinSyllables)

    def SyllableCount(self, AszWord, AszMode = 'max', AlgFallBack=True):

        if AszMode.lower() not in ['min','max','ave','raw']:
            LszMode = 'max'
        else:
            LszMode = AszMode

        LszWord = AszWord.lower()

        if len(LszWord) == 0 or LszWord not in self.dcSyllableCount:
            self.szMode = None
            if len(LszWord) == 0 or not AlgFallBack:
                if AszMode in ['min','max']:
                    return(0)
                elif AszMode in ['ave']:
                    return(0.0)
                elif AszMode in ['raw']:
                    return([])
            else:
                LliSyllableList = list((self.NonCMUSyllableCount(LszWord),))
                self.szMode = 'manual'
        else:
            LliSyllableList = self.dcSyllableCount[LszWord]
            self.szMode = 'lookup'

        if LszMode == 'min':
            return(min(LliSyllableList))
        elif LszMode == 'max':
            return(max(LliSyllableList))
        elif LszMode == 'ave':
            return(float(float(sum(LliSyllableList))/float(len(LliSyllableList))))
        elif LszMode == 'raw':
            return(LliSyllableList)
        else:
            return(None)

    def GetSyllableMode(self):
        #-----
        # Return either None, manual or lookup depending on how the last
        # syllable count was arrived at

        return(self.szMode)

def main():

    LzzSyllableCounter = cmusyllables()

    LzzSyllableCounter.Load()

    LliList = ['','theatre','productized','productised','pumblechook','everything','altogether','particular','opportunity','everybody','cooeed','cueing']

    for LszWord in LliList:
        print "'%s' has max(%d), min(%d), ave(%3.2f), raw(%s) syllables - Calculated by (%s)" % (LszWord,LzzSyllableCounter.SyllableCount(LszWord), LzzSyllableCounter.SyllableCount(LszWord, AszMode='min'),LzzSyllableCounter.SyllableCount(LszWord, AszMode='ave'),LzzSyllableCounter.SyllableCount(LszWord, AszMode='raw'),LzzSyllableCounter.GetSyllableMode())

if __name__ == '__main__':
    main()</pre>
]]></content:encoded>
			<wfw:commentRss>http://www.onebloke.com/2011/06/counting-syllables-accurately-in-python-on-google-app-engine/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>FileZilla, SFTP and Amazon EC2</title>
		<link>http://www.onebloke.com/2011/06/filezilla-sftp-and-amazon-ec2/</link>
		<comments>http://www.onebloke.com/2011/06/filezilla-sftp-and-amazon-ec2/#comments</comments>
		<pubDate>Mon, 20 Jun 2011 07:29:12 +0000</pubDate>
		<dc:creator>Danny Goodall</dc:creator>
				<category><![CDATA[Tips]]></category>
		<category><![CDATA[.pem]]></category>
		<category><![CDATA[.ppk]]></category>
		<category><![CDATA[amazon ws]]></category>
		<category><![CDATA[ec2]]></category>
		<category><![CDATA[filezilla]]></category>
		<category><![CDATA[security]]></category>
		<category><![CDATA[sftp]]></category>
		<category><![CDATA[ubuntu]]></category>

		<guid isPermaLink="false">http://www.onebloke.com/?p=263</guid>
		<description><![CDATA[&#160;
I&#8217;ve just made a little discovery so thought I would note it in in these pages because I&#8217;m sure I&#8217;ll need it again.
I&#8217;m investigating Amazon&#8217;s EC2 at that moment and am trying to put some code up there and struggling to use FTP securely to do it. I use FileZilla on Ubuntu and it seems that  [...]]]></description>
				<content:encoded><![CDATA[<p>&nbsp;</p>
<h3><a href="http://www.onebloke.com/wp-content/uploads/2011/06/filezilla-logo.png"><img class="alignright size-thumbnail wp-image-264" title="FileZilla Logo" src="http://www.onebloke.com/wp-content/uploads/2011/06/filezilla-logo-150x150.png" alt="" width="150" height="150" /></a>I&#8217;ve just made a little discovery so thought I would note it in in these pages because I&#8217;m sure I&#8217;ll need it again.</h3>
<p>I&#8217;m investigating Amazon&#8217;s EC2 at that moment and am trying to put some code up there and struggling to use FTP securely to do it. I use FileZilla on Ubuntu and it seems that FileZilla&#8217;s site manager wants me to enter a user name password combination to login to the EC2 instance. However in accordance with Amazon&#8217;s recommendations I&#8217;m running without user passwords but am instead using public key authentication. But there appears to be nowhere to specify the local private key file location in FileZilla&#8217;s Site Manager dialogue.</p>
<p>The answer is that hidden in FileZilla&#8217;s settings, Edit-&gt;Settings, under the Connection-SFTP setting is a dialogue that allows you to enter the location of the local keypair file. So I added my local key pair at which point FileZilla warned me that it needed to convert my .pem format to a .ppk format. I let it do this and specified the location and name of the converted file. Then, going back to the Site Manager, I set my Amazon host<strong> Login Type</strong> to <strong>Interactive</strong> and tried again and I was straight in. Interestingly I didn&#8217;t need to tie the Site Manager entry for my EC2 host to the keypair. Just adding the keypair to the general settings as described above did the trick. No messy passwords and no compromised security.</p>
<p>Danny Goodall</p>
]]></content:encoded>
			<wfw:commentRss>http://www.onebloke.com/2011/06/filezilla-sftp-and-amazon-ec2/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>arcanicity.appspot.com &#8211; How much jargon does your text contain?</title>
		<link>http://www.onebloke.com/2011/06/arcanicity-appspot-com-how-much-jargon-does-your-text-contain/</link>
		<comments>http://www.onebloke.com/2011/06/arcanicity-appspot-com-how-much-jargon-does-your-text-contain/#comments</comments>
		<pubDate>Wed, 15 Jun 2011 08:11:53 +0000</pubDate>
		<dc:creator>Danny Goodall</dc:creator>
				<category><![CDATA[Arcanicity]]></category>
		<category><![CDATA[Cloud Stuff]]></category>
		<category><![CDATA[Natural Language Processing (NLP)]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[arcanicity]]></category>
		<category><![CDATA[cmudct]]></category>
		<category><![CDATA[gae]]></category>
		<category><![CDATA[google app engine]]></category>
		<category><![CDATA[google charts]]></category>
		<category><![CDATA[nltk]]></category>
		<category><![CDATA[notaproperprogrammer]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[readability]]></category>

		<guid isPermaLink="false">http://www.onebloke.com/?p=166</guid>
		<description><![CDATA[
My first Google App Engine project went live yesterday. This one deals with estimating the readability of a text when jargon such as acronyms and abbreviations are taken into account.
&#60;/marketing-bit&#62;As I&#8217;ve mentioned before I&#8217;m developing a Natural Language Processing system called ScrewTinny  [...]]]></description>
				<content:encoded><![CDATA[<p><img class="alignright size-medium wp-image-171" title="Aconym Soup (smaller)" src="http://www.onebloke.com/wp-content/uploads/2011/06/Aconym-Soup-smaller-300x141.png" alt="" width="300" height="141" /></p>
<h3>My first Google App Engine project went live yesterday. This one deals with estimating the readability of a text when jargon such as acronyms and abbreviations are taken into account.</h3>
<p>&lt;/<strong>marketing-bit</strong>&gt;As I&#8217;ve mentioned before I&#8217;m developing a Natural Language Processing system called ScrewTinny (scrutiny) that analyses the language that high-tech vendors use to take their products to market. Knowing how much jargon text contains allows me to infer which audience the text is aimed at (IT Technical, IT Business, Business). And that&#8217;s important to me.&lt;/<strong>marketing-bit</strong>&gt;</p>
<p>Anyway, <a href="http://en.wikipedia.org/wiki/Readability_test">readability indexes</a> are not new (Flesch-Kincaid, Coleman-Liau, Gunning Fog, SMOG, etc.) and so I looked for an existing index that took jargon into account that I could use. I did a great deal of searching and even asked a number of people who have an interest in this area, but I couldn&#8217;t find one. <a href="http://www.lustratusrepama.com/2011/a-technology-reading-ease-index-goodall-arcanicity-first-draft/">So I developed my own &#8211; and the Goodall Arcanicity Index was born</a>. It&#8217;s got a long way to go until it is truly accurate but I&#8217;ve now coded it in Python and decided to put it up on Google&#8217;s appspot cloud. So it&#8217;s live at:</p>
<h3><a href="http://arcanicity.appspot.com">http://arcanicity.appspot.com</a></h3>
<p>It&#8217;s very simple. You enter some text, it processes it and gives you a rating for the amount of arcane content (Arcanicity) the text contains. A by-product of my text-processing routines is a mountain of related text statistics so I decided to add thosfoe to the <a href="http://arcanicity.appspot.com">arcanicity.appspot.com</a> site.</p>
<p>As I discussed <a title="Google Chart Tools, Hmm…" href="http://www.onebloke.com/2011/05/google-chart-tools-hmm/">here</a>, I also discovered the JavaScript-based <a href="http://code.google.com/apis/chart/">Google Visualisation</a> libraries which I will use as part of the ScrewTinny project. I wanted to get some experience with the Google routines and so for good measure I created visualisation to go along with the text statistics.</p>
<h3>Google App Engine and NLTK</h3>
<p><a href="http://www.onebloke.com/wp-content/uploads/2011/06/GoogleAppEngineLlogoTrans.png"><img class="alignleft size-full wp-image-151" title="GoogleAppEngineLlogoTrans" src="http://www.onebloke.com/wp-content/uploads/2011/06/GoogleAppEngineLlogoTrans.png" alt="" width="161" height="188" /></a>One of the interesting technical challenges involved getting the Python-based Natural Language ToolKit (NLTK) routines to work in Google&#8217;s App Engine. I had seen that <a href="http://groups.google.com/group/nltk-users/browse_thread/thread/95db3032ccca7ab8/">it is notoriously difficult to get NLTK working with Google App Engine</a> due to the way it recursively imports modules. But following some tips from the poster <a href="http://groups.google.com/group/nltk-users/msg/d9f9ee360edf9c50">oakmad on this entry</a>, I managed to get a small sub-section of the code working.</p>
<p>This discussion actually merits a separate blog entry where I can document the exact process I went through, and perhaps I will do that when I get time. But for the time being I&#8217;ll talk about the general approach. The way that I got the Punkt Sentence Tokenizer working was as follows.</p>
<p>I created a clean local Google App Engine instance and then I copied in the Pickled &#8216;english.pickle&#8221; Tokenizer object from the NLTK distribution. I un-pickled it and tried to use the resultant object&#8217;s tokenize method. This gave an error which involved some supporting imports that hadn&#8217;t happened. I then fixed the import and tried again until I got no further errors. &#8216;Fixing the import&#8217; involved copying the module folder tree structure that was being complained about (one folder at a time) from a pristine NLTK installation to the Google App Engine local instance. As oakmad says, creating empty __init__.py files was important so that the module didn&#8217;t go off and grab more than was needed. As I said I should document this properly and if anyone is interested let me know and I will.</p>
<p>It has to be said however that I tried to use a similar technical so that I could use NLTK&#8217;s CMU pronunciation dictionary (CMUDICT). But it became very complex, very quickly and as I&#8217;m not a real programmer I gave up. But I did get to use the cmudict routines on Google App Engine by building a separate data structure. I wanted to use the cmudict routines to allow me to count syllables accurately and if I say so myself, my solution was quite &#8216;lateral&#8217;. That definitely does need a seperate post and so I will do that when I get time.</p>
<p>Danny Goodall</p>
]]></content:encoded>
			<wfw:commentRss>http://www.onebloke.com/2011/06/arcanicity-appspot-com-how-much-jargon-does-your-text-contain/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Nothing&#8217;s ever easy &#8211; Google App Engine&#8230;</title>
		<link>http://www.onebloke.com/2011/06/nothings-ever-easy-google-app-engine/</link>
		<comments>http://www.onebloke.com/2011/06/nothings-ever-easy-google-app-engine/#comments</comments>
		<pubDate>Wed, 01 Jun 2011 14:11:39 +0000</pubDate>
		<dc:creator>Danny Goodall</dc:creator>
				<category><![CDATA[Cloud Stuff]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[amazon ws]]></category>
		<category><![CDATA[bigtable]]></category>
		<category><![CDATA[django]]></category>
		<category><![CDATA[ec2]]></category>
		<category><![CDATA[gae]]></category>
		<category><![CDATA[google app engine]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[notaproperprogrammer]]></category>
		<category><![CDATA[orm]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[sqs]]></category>
		<category><![CDATA[web2py]]></category>

		<guid isPermaLink="false">http://www.onebloke.com/?p=141</guid>
		<description><![CDATA[Well, at least nothing appears easy to me when selecting the deployment configuration for ScrewTinny &#8211; my Python-based competitive marketing intelligence app.
I had planned to deploy ScrewTinny to Google App Engine and I&#8217;ve actually been very happy with how easy it is to get up and running with it  [...]]]></description>
				<content:encoded><![CDATA[<h3>Well, at least nothing appears easy to <a href="http://www.onebloke.com/wp-content/uploads/2011/06/django-logo.png"><img class="alignright size-medium wp-image-143" title="django-logo" src="http://www.onebloke.com/wp-content/uploads/2011/06/django-logo-300x136.png" alt="" width="300" height="136" /></a>me when selecting the deployment configuration for ScrewTinny &#8211; my Python-based competitive marketing intelligence app.</h3>
<p>I had planned to deploy ScrewTinny to Google App Engine and I&#8217;ve actually been very happy with how easy it is to get up and running with it on my <a title="Google App Engine and the Arcanicity Index" href="http://www.onebloke.com/2011/05/google-app-engine-and-the-arcanicity-index/">Arcanicity Index project</a> (more of that later). However, I found the process of generating HTML in a Python app akin to pulling my own teeth out. So I looked at templating techniques where embedded code is replaced at run-time which looked like it might be a bit more bearable. I read that Django templates are supported by Google App Engine, so I decided to take a look at <a href="https://www.djangoproject.com/">Django</a>.</p>
<p>Django features an object relation mapper that I really like the idea of. It&#8217;s sits on top of a MySQL database and allows me to programmatically deal with objects while it handles the persistence and retrieval to and from the underlying SQL database. But then <a href="http://code.google.com/appengine/articles/django.html">I read that whilst Django can be deployed to Google App Engine</a> (and it&#8217;s views are supported natively), it appears that Google&#8217;s database strategy doesn&#8217;t allow the object mapper to work.</p>
<p><a href="http://www.onebloke.com/wp-content/uploads/2011/06/GoogleAppEngineLlogoTrans.png"><img class="alignleft size-full wp-image-151" title="GoogleAppEngineLlogoTrans" src="http://www.onebloke.com/wp-content/uploads/2011/06/GoogleAppEngineLlogoTrans.png" alt="" width="161" height="188" /></a>I can understand how something like Google App Engine isn&#8217;t going to provide a generic SQL database as it wouldn&#8217;t get near the scale that was required. But it&#8217;s frustrating to have to look elsewhere if I want to use Django&#8217;s ORM.</p>
<p>I was urged to look at <a href="http://www.allbuttonspressed.com/blog/django">a branch of Django called Django-norel</a> that seems to run over non-SQL databases &#8211; including the Google App Engine&#8217;s <a href="http://labs.google.com/papers/bigtable.html">Bigtable</a> &#8211; but I can&#8217;t take the risk that this would end up as a dead end project (even though the people responsible for this project suggest that <a href="http://www.allbuttonspressed.com/blog/django/minor-updates-and-api-changes">Django-norel is going to make its way back into the main Django source code trunk</a>).</p>
<p>So then I wondered whether I could run Django outside of the Google App Engine infrastructure and so asked my current host (The excellent &#8211; so far) ICDSoft and they told me that it was possible but they didn&#8217;t support the WSGI gateways that are needed. Neither, they tell me do they support <a href="http://www.web2py.com/">web2py</a> (which was another alternative I thought about trying) as it needed to run a background process and my shared hosting plan does not allow this. I <a href="http://code.google.com/appengine/docs/python/gettingstarted/usingwebapp.html">believe that it&#8217;s possible to run web2py on Google App server</a> but it only <em>appears</em> to have a Database Abstraction Layer that leaves me mapping my objects to SQL tables and back again.</p>
<p>So where next?</p>
<p>Well I&#8217;m going to take a look at Amazon EC2. I&#8217;ve signed up for an account and in theory it looks like I can start a machine image, run Django &#8211; or anything I choose, and interact with it via Amazon SQS. So when I get a bit of time I&#8217;ll dive into that.</p>
<p>I wish I were a <em>proper</em> programmer. I&#8217;ll keep you updated.</p>
<p>Danny Goodall</p>
]]></content:encoded>
			<wfw:commentRss>http://www.onebloke.com/2011/06/nothings-ever-easy-google-app-engine/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Google Chart Tools, Hmm&#8230;</title>
		<link>http://www.onebloke.com/2011/05/google-chart-tools-hmm/</link>
		<comments>http://www.onebloke.com/2011/05/google-chart-tools-hmm/#comments</comments>
		<pubDate>Thu, 26 May 2011 11:17:26 +0000</pubDate>
		<dc:creator>Danny Goodall</dc:creator>
				<category><![CDATA[Charting]]></category>
		<category><![CDATA[Natural Language Processing (NLP)]]></category>
		<category><![CDATA[gae]]></category>
		<category><![CDATA[google app engine]]></category>
		<category><![CDATA[google charts]]></category>
		<category><![CDATA[html5]]></category>
		<category><![CDATA[nlp]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[screwtinny]]></category>

		<guid isPermaLink="false">http://www.onebloke.com/?p=125</guid>
		<description><![CDATA[So my plans for how best to visualise the output of ScrewTinny have been changing recently.
I&#8217;ve looked at using Excel to create charts manually. I&#8217;ve looked at Python chart libraries. I&#8217;ve looked at Google Docs. But now I think I might have found a winner &#8211; Google Chart Tools.
I&#8217;ve been agonising  [...]]]></description>
				<content:encoded><![CDATA[<h3><a href="http://www.onebloke.com/wp-content/uploads/2011/05/google_logo.jpg"><img class="alignright size-medium wp-image-372" title="google_logo" src="http://www.onebloke.com/wp-content/uploads/2011/05/google_logo-300x125.jpg" alt="" width="300" height="125" /></a>So my plans for how best to visualise the output of ScrewTinny have been changing recently.</h3>
<p>I&#8217;ve looked at using Excel to create charts manually. I&#8217;ve looked at Python chart libraries. I&#8217;ve looked at Google Docs. But now I think I <em>might</em> have found a winner &#8211; Google Chart Tools.</p>
<p>I&#8217;ve been agonising about how to deliver the results of my NLP text analysis. At the moment I&#8217;ve put a bunch of effort into creating charts in .Net so that I can inject them automatically into a PowerPoint using VSTO. It&#8217;s working fine but it limits me from being about to use the generated charts easily for web consumption from within Python &#8211; my development language of choice.</p>
<p>It also limits me from easily getting the charts on the web. And this is perhaps the most restricting element of my design decision. I had initially thought that I would use MS Office based documents (Word and PowerPoint) to publish my research. However I now feel that with the richness of HTML 5 and technologies such as Google Docs, I should think again.</p>
<p>I&#8217;ve just watched this video which shows how easy it is to embed data visualisation into a web page.</p>
<p><iframe width="700" height="394" src="http://www.youtube.com/embed/NZtgT4jgnE8?fs=1&#038;feature=oembed" frameborder="0" allowfullscreen></iframe></p>
<p>Google Chart Tools seems to be a great way to embed charts into an HTML 5 page. It also raises questions about where to store the data (Google Docs Spreadsheet?) and other logistical issues too but that&#8217;s for another day.</p>
<p>I&#8217;m going to be putting my <a href="http://www.lustratusrepama.com/2011/eureka-arcanicity-and-the-technology-reading-ease-index/">Arcanicity Projec</a>t code up onto <a href="http://www.onebloke.com/2011/05/google-app-engine-and-the-arcanicity-index/">Google App Engine</a> and rather than chuck out a vanilla HTML page with the results as I had planned, I think I will use some tables and charts from the Chart Tools library.</p>
<p>You can have a little interactive play with the <a href="http://code.google.com/apis/ajax/playground/#annotated_time_line">Google Visualization library here in the Google Code Playground</a>.</p>
<p>I&#8217;ll keep you posted.</p>
<p>Danny Goodall</p>
]]></content:encoded>
			<wfw:commentRss>http://www.onebloke.com/2011/05/google-chart-tools-hmm/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Google App Engine and the Arcanicity Index</title>
		<link>http://www.onebloke.com/2011/05/google-app-engine-and-the-arcanicity-index/</link>
		<comments>http://www.onebloke.com/2011/05/google-app-engine-and-the-arcanicity-index/#comments</comments>
		<pubDate>Tue, 17 May 2011 16:10:43 +0000</pubDate>
		<dc:creator>Danny Goodall</dc:creator>
				<category><![CDATA[Arcanicity]]></category>
		<category><![CDATA[Language and Text Processing]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Software Development]]></category>
		<category><![CDATA[42]]></category>
		<category><![CDATA[arcanicity]]></category>
		<category><![CDATA[cost]]></category>
		<category><![CDATA[deep thought]]></category>
		<category><![CDATA[gae]]></category>
		<category><![CDATA[google app engine]]></category>
		<category><![CDATA[language toolkit]]></category>
		<category><![CDATA[nltk]]></category>
		<category><![CDATA[notaproperprogrammer]]></category>
		<category><![CDATA[punkt]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[security]]></category>
		<category><![CDATA[test]]></category>
		<category><![CDATA[text statistics]]></category>
		<category><![CDATA[tokenize]]></category>
		<category><![CDATA[tokenizer]]></category>

		<guid isPermaLink="false">http://www.onebloke.com/?p=100</guid>
		<description><![CDATA[I&#8217;ve decided to investigate Google App Engine (GAE) in my spare time and I need a project to test it with.
So I&#8217;m going to try to produce an on-line version of my Arcanicity Index. It will be very simple system, and because I don&#8217;t have a fag packet on which to sketch it out, I&#8217;ll list the  [...]]]></description>
				<content:encoded><![CDATA[<h3><a href="http://www.onebloke.com/wp-content/uploads/2011/05/google-app-engine.png"><img class="alignright size-full wp-image-374" title="google-app-engine" src="http://www.onebloke.com/wp-content/uploads/2011/05/google-app-engine.png" alt="" width="250" height="250" /></a>I&#8217;ve decided to investigate Google App Engine (GAE) in my spare time and I need a project to test it with.</h3>
<p>So I&#8217;m going to try to produce an on-line version of my <a href="http://www.lustratusrepama.com/2011/eureka-arcanicity-and-the-technology-reading-ease-index/">Arcanicity Index</a>. It will be very simple system, and because I don&#8217;t have a fag packet on which to sketch it out, I&#8217;ll list the specification below.</p>
<p>The user will be asked to enter some text and then click a button market Process. <a href="http://www.bbc.co.uk/cult/hitchhikers/guide/deepthought.shtml">DeepThought will then sit and ponder for 7 and half million years</a> and respond with &#8220;42&#8243;. Either that or it will provide the visitor with some text statistics and an Arcanicity Index estimate for the text they entered.</p>
<p>That should do it for a test system. I did produce a &#8220;hello world&#8221; app using GAE a long time ago so I know the principles but there are a few areas that I&#8217;m not sure about:</p>
<ul>
<li>Security? Does Google protect the web server, the app, the code, etc?</li>
<li>Embedded? How can I link or embed the application in my own web page or will I have to send users to Google and hope they come back?</li>
<li>Cost? How much power will deep thought, sorry Google, provide me with free of charge? And what would be the cost of hosting it should it become moderately popular?</li>
<li>NLTK? I&#8217;m using the <a href="http://www.opendocs.net/nltk/0.9.5/guides/tokenize.html">Natural Language Toolkit&#8217;s Punkt Tokenizer</a> to separate the text into sentences and words but I know GAE doesn&#8217;t support NLTK out of the box.</li>
</ul>
<p>Of those issues it&#8217;s NLTK that gives me the most concern. NLTK provides a much better (although not perfect) mechanism for detecting the end of sentences. Most other methods I&#8217;ve seen treat&#8230;</p>
<blockquote><p>Mr. T. Brown said come A.S.A.P.</p></blockquote>
<p>&#8230;as either 3,4 or 5 sentences. So it&#8217;s important I can use the NLTK Tokenizer. I have <a href="http://groups.google.com/group/nltk-users/browse_thread/thread/95db3032ccca7ab8/">read of some tricks to manually install NLTK</a> so I will probably start there. But I&#8217;m not really a proper programmer so I might have to agree with the rest of the world that there are 5 sentences in the text above.</p>
<p>I&#8217;ll post about my exploits as I go.</p>
<p>Danny Goodall</p>
]]></content:encoded>
			<wfw:commentRss>http://www.onebloke.com/2011/05/google-app-engine-and-the-arcanicity-index/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>The Un-Mission Statement</title>
		<link>http://www.onebloke.com/2011/05/the-un-mission-statement/</link>
		<comments>http://www.onebloke.com/2011/05/the-un-mission-statement/#comments</comments>
		<pubDate>Sun, 15 May 2011 20:25:02 +0000</pubDate>
		<dc:creator>Danny Goodall</dc:creator>
				<category><![CDATA[General]]></category>
		<category><![CDATA[arcanicity]]></category>
		<category><![CDATA[c#]]></category>
		<category><![CDATA[msoffice]]></category>
		<category><![CDATA[natural language processing]]></category>
		<category><![CDATA[nlp]]></category>
		<category><![CDATA[nltk]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[readability]]></category>
		<category><![CDATA[screwtinny]]></category>
		<category><![CDATA[vb.net]]></category>
		<category><![CDATA[vsto]]></category>

		<guid isPermaLink="false">http://www.onebloke.com/?p=43</guid>
		<description><![CDATA[I&#8217;m not a big fan of mission statements&#8230;
but I thought I&#8217;d better set out a manifesto &#8211; a mission statement of sorts for this blog. I wanted to set down in writing the sort of content I&#8217;ll be posting in these pages. I already blog on technology marketing and competitive intelligence issues in my  [...]]]></description>
				<content:encoded><![CDATA[<h2>I&#8217;m not a big fan of mission statements&#8230;</h2>
<p>but I thought I&#8217;d better set out a manifesto &#8211; a mission statement of sorts for this blog. I wanted to set down in writing the sort of content I&#8217;ll be posting in these pages. I already <a href="http://www.lustratusrepama.com/repama-blog">blog on technology marketing and competitive intelligence issues in my blog at Lustratus REPAMA</a> and I tweet on similar issues at <a href="http://twitter.com/lustratusrepama">@lustratusrepama</a>. So I&#8217;ll save comment on those subjects for there.</p>
<p>But I needed a place on the Internet where I could lay my geek hat and call home. I can&#8217;t post technical stuff on my Lustratus REPAMA blog because it&#8217;s the <a href="http://www.lustratusrepama.com/competitive/audience-strata-mismatch/">wrong audience</a> so I will post here about the following technical issues:</p>
<ul>
<li>Computerised processing of language or natural language processing (NLP, an area I have investigated in developing ScrewTinny</li>
<li>ScrewTinny &#8211; a natural language processing system that infers meaning behind text.</li>
<li>The Natural Language Toolkit (NLTK) is a library of Python routines and structures that significantly aid in the development of scripts that process text</li>
<li>General programming issues in the Python language</li>
<li>Developing solutions for MS Office with VB.Net and C# that use the VSTO interface.</li>
<li>I&#8217;ve also got a keen interest in text readability and particularly in rating how easy or difficult a specific piece of text is to understand. I&#8217;ve been working on <a href="http://www.lustratusrepama.com/2011/eureka-arcanicity-and-the-technology-reading-ease-index/">an Arcanicity Index that looks to rate how much jargon a particular piece of text has</a>. So those issues will be addressed in these pages too.</li>
</ul>
<p>And anything else that fits between the gaps. I hope my ramblings provide some use to some readers.</p>
<p>Danny Goodall</p>
]]></content:encoded>
			<wfw:commentRss>http://www.onebloke.com/2011/05/the-un-mission-statement/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>
