DIT MSc in Computing (Data Analytics): March 2012

Thursday, March 29, 2012

Census Data Released, Dissertation Projects Ahoy?

The 2011 census data has been released by the Central Statistics Office today. Details are available at:
http://cso.ie/en/newsandevents/pressreleases/2012pressreleases/pressreleasethisisireland-highlightsfromcensus2011part1/

The census data gives a great opportunity for doing data analysis and in particular creating visualisations. The All-Ireland Research Obervatory at Maynooth (http://www.airo.ie/) are already doing a really nice job of this - see http://airomaps.nuim.ie/flexviewer/?config=Census2011.xml - but there is lots left to do.

For inspiration have a look at this TED talk by Jeff Thorp on Making Data More Human http://www.ted.com/talks/jer_thorp_make_data_more_human.html

Other great examples of visualising similar data include:

Chromaroma (www .chromaroma .com) showing London Underground usage
FlightPatterns (www .aaronkoblin .com /work /flightpatterns) that shows flight paths between US cities
JustLanded (http ://http://blog.blprnt.com/blog/blprnt/just-landed-processing-twitter-metacarta-hidden-data .blprnt .com /blog /blprnt /just -landed -processing -twitter -metacarta -hidden -data ) that shows international travel harvested from Twitter posts
100SecondworldHistory (http ://www .ragtag .http://www.ragtag.info/2011/feb/2/history-world-100-seconds//2011/feb /2/history -world -100-seconds /), a visualisation of world history based on located and dated Wikipedia articles.

Wednesday, March 28, 2012

Text Analytics in Weka

Weka, the open source machine learning toolkit, can be used to perform text analytics, and the GUI explorer is nice for doing this. The most straightforward way in which to proceed is to generate an .arff file in which the text involved is represented as a string attribute. For example:

@relation essays
@attribute essay_id numeric
@attribute essay string
@attribute prediction {0, 1}

@data
1788, ' Dear ORGANIZATION1 CAPS1 more and more people start to use computers goes more and more into the dark aged While computer MONTH1 be helpful in some cases they are also making people exercise less most CAPS1 many people are nature and more people are becoming awkward', 0
1789, ' Dear LOCATION1 Time CAPS1 me tell you what I think of computers there is nothing wrong with people being on computers I say this because us as kids really do need computers for projects and school work',1

In this case I have one numeric attribute, one categorical attribute and one string attribute. Note in the @data section that the string data is enclosed in single quotation marks.

In order to build a prediction model from this data we need to transform the string representation into something more amenable to use with prediction algorithms - the bag of words, or word vector, representation is the most straightforward way in which to do this. Weka will do this for us using the StringToWordVector filter. This can be done through the GUI tools, the command line or through code using the API and there are a wide range of options to change the behaviour of the filter (e.g. stop word removal, stemming etc).

(One thing to note if using a numeric target is that the StringToWordVector filter has a slightly oddly named parameter, doNotOperateOnPerClassBasis which is set to false in the GUI and must be changed to true for numeric target variables.)

Once the StringToWordVector has been applied you will have a new dataset with many more attributes - one per feature - and you can use this to build and evaluate prediction models as normal.

If you have been given separate training and test files containing string attributes that you plan to transform using the StringToWordVector filter there is a little gotcha to watch out for. With a bag of words representation the dictionary used by the training and test sets must be the same. If you run the StringToWordVector filter independently to the training and test sets you are likely to end with two very different dictionaries and so two very different arff files. This means prediction models trained using the training data will not be able to handle test instances.

In order to get around this you must apply the StringtoWordVector filter in batch mode from the command line. this can be done as follows from a terminal window:

java -cp /Applications/weka-3-6-2/weka.jar weka.filters.unsupervised.attribute.StringToWordVector -b -i training_set.arff -o training_set_vec.arff -r test_set.arff -s test_set_vec.arff

where I am using the StringToWordVector filter, the -cp option puts the Weka jar on my class path, the -b option tells Weka to use batch mode, -i specifies my training set input file, -o the output after processing this first file, -r is my test set input file and -s is the output file I would like to use. You could also add any of the StringToWordVector options after the filter name.

This will give you compatible training and test word vector arff files that can be used by Weka.

Saturday, March 24, 2012

Nice Talk By Foursquare

Nice slides by a talk by FourSquare on running ML agorithms across very large networks of people - good stuff for project ideas again.

http://engineering.foursquare.com/2012/03/23/machine-learning-with-large-networks-of-people-and-places/

Wednesday, March 21, 2012

Weka Dataset for Kaggle Essay Scoring Contest

There is a version of the dataset for the Kaggle Hewlett Foundation Automated Essay Scoring contest prepped for use in Weka available for download at:

https://docs.google.com/open?id=0B87X5AAMrki2RGdUVHUzLTRSekM5eEt5M3BsbTByQQ

Storyful & Data Journalism Discussion on Radio One

Mark Little was on Radio One this morning talking about his company Storyful and the potential for data analytics in journalism - interesting stuff and definitely grist for the dissertation mill:

http://www.rte.ie/radio/radioplayer/rteradioweb.html#!rii=9:3234361:133::

There are massive challanges for journalists in dealing with the amounts of data with which they are now faced. Massive challenges lead to massive opportunities for the application of analytics techniques in text mining, data visualisation, etc. The Guardian Datablog is a nice example of this:

http://www.guardian.co.uk/news/datablog

Wednesday, March 14, 2012

Importing a Library in SAS Enterprise Miner OnDemand

Most of the Kaggle datasets have been added to a folder associated with our class on the SAS Enterprise Miner Ondemand servers. As an alternative to uploading data files yourself using the "File Import" node you can access these datasets by adding the folder as a library to your project. These are the seps to do this.

1) Click on the name of your project in the project tree (upper left of the Enterprise Miner interface) and from the Properties Panel click on the "..." button next to "Project Start Code"

2) Add the following piece of SAS code to the project start code window

libname KAGGLE "/courses/u_dit.ie1/i_610146/c_3477/KaggleDatasets";

and hit "Run Now" a couple of times before hitting OK.

3) Now you can add a data source to your project as normal and when it comes to selecting from a library a new library called Kaggle should be present in the list. The list of datasets within this librry have names that should show an obvious connection back to the Kaggle contests.

4) Continue as normal.

There are some problems emerging with the bigger datasets so I will keep working on these. If you are having any problems, or need a dataset that isn't present (try uploading it yourself using the File Import node) just give me a shout.

Monday, March 12, 2012

Loading Data into SAS Enterprise Miner OnDemand

In order to load local data into SAS Enterprise Miner OnDemand, follow these steps:

1) Create a SAS Enterprise Miner Project

2) Create a new diagram

3) From the "Sample" tab add a "File Import" node to your new diagram

4) Select the "File Import" node on the diagram and from the properties panel click on the "..." beside the "Import File" option in the "Train" section.

5) Select "My Computer" and click on "Browse..."

6) Select the file you are interested in (possible file types are shown below) and click "OK" (watch out it may take a little while for this step to complete). The easiest format to use is a comm-separated file where the first row contains comma separated variable names.

7) Right click on the "File Import" node and choose "Edit Variables" to set the roles and levels for the variables in the dataset.

8) Continue as normal.