DIT MSc in Computing (Data Analytics): 2012

Thursday, May 31, 2012

SAS Ireland New Positions Presentation

SAS Ireland have a significant number of very interesting analytics positions coming up in the near future. These will involve working with SAS in Ireland and training with their international centres of excellence.

Representatives from SAS would like to come in to present these positions to you and answer any questions that you might have. They are also interested in accepting CVs from interested candidates on the day. The details of this session are as follows:

Date: Thursday the 7th of June
Time: 18:00 - 19:00
Location: Room KA 3-021, Kevin St.

So that I can gauge likely attendance (and arrange tea and biscuits!) could you please let me know if you will be coming along?

I will forward on any further information on the positions as I receive it.

Thursday, May 24, 2012

Machine Learning Results Published

The Machine Learning assignment results have now been published on the module Webcourses page. Please note that these are indicative provisional results and will not be finalised until after examination boards have taken place.

Overall performance was very good. The submissions to Kaggle were strong and some groups managed to threaten the top of the various leaderboards.The credit scoring contest was by far the most popular but there were also good submissions in the automatic essay grading, bond prediction, car classification and biological response prediction contests.

Game Over: Churn in Games

This article from gamasutra.com is a nice example of churn prediction in a very different setting than usual - predicting whether or not players will leave an online game.

http://www.gamasutra.com/view/feature/170472/predicting_churn_datamining_your_.php

Thursday, May 17, 2012

DubLinked Data Visualisation Event (24th May)

On Thursday 24th May DubLinked will hold a data visualisation event to explore the area of data visualisation, learn about available data visualisation tools. Speakers on the day will provide an overview of the current state of play in the area, future trends and identify challenges in the area. There will be practical demonstrations of creating visualisations, followed by a workshop where participants will be invited to create visualisations using Dublinked data- so why not bring your laptop on the day to create your own data visualisation.

More details are available at: http://www.dublinked.ie/?q=datavisualisationevent

Tuesday, May 15, 2012

Possible Dissertation Idea

Check out the article in the May 2012 New Scientist called, 'Making Numbers Punch their Weight'.

There might be some Data Analytics type dissertation projects using the methods covered in this article.

http://www.newscientist.com/article/mg21428635.500-font-for-digits-lets-numbers-punch-their-weight.html

Monday, May 14, 2012

Good Luck

Good luck to everyone in the exams over the coming weeks.

Wednesday, May 2, 2012

SAP Competition for Students

THE COMPETITION

In 2012 SAP University Alliances will be hosting an exciting new competition in collaboration with the 2012 European 2012 Football Championships. Students can make use of the actual results of the football championships combined with a free download of the Crystal Dashboard design software from the University Alliances Community.

Students at universities in Europe, the Middle East and Africa (EMEA) are invited to participate in this competition. Each participating team is to develop a dashboard using SAP Crystal Dashboard Design to display and analyse the results of the games throughout the Euro 2012 competition. Each week a new version of the results from the competition will be uploaded here for students to download in Excel format. Students can then build, week on week through the competition, their dashboards in Crystal Dashboard Design.

Each team's dashboard will be judged on its usefulness, usability, data quality and presentation, subject to the terms and conditions below.
Download a poster, share on bulletin boards, and distribute to students!

RECOGNITION OPPORTUNITIES

The top three student teams will make it through to our finals and win a valuable prize from SAP. The final overall winner will be judged by our panel of design experts and crowned overall Best Dashboard Design for the University Alliances program, EMEA, 2012.

Finalist teams will present their dashboards to a panel of judges and key User Group Meeting attendees. The winning team will be announced at the SAP University Alliances EMEA User Group Meeting.

IMPORTANT DATES
Closing date for registration of teams: June 1st, 2012
Closing date for submission of dashboard entries: July 13th, 2012
Notification of 3 finalist teams by SAP: August 10th, 2012
Finals: September 6-7th, 2012

RULES
Each university can submit multiple teams but each team must consist of no more than 3 named students.

The dashboard must be built using SAP Crystal Dashboard Design (software available for free download - Windows operating system required.)

Entries must submit a dashboard and a 10 minute recorded presentation showcasing their dashboard.

All submissions (dashboard and documentation) must be in English.

All submitted materials must be original materials created by the members of the applicable team.

GETTING STARTED
Students must first register their team. Read the terms and conditions.

STEPS AND RESOURCES
Students: Review these steps and resources to understand the Football Championships, download software, learn dashboarding, and submit your team's entry: http://scn.sap.com/docs/DOC-26542

For general inquiries, send an email to emeadashboard@sap.com.

Friday, April 20, 2012

DIT Email Accounts

Folks,

It is very important to monitor your DIT email accounts. This is the official communication channel for a lot of material from DIT - for example from exams and registrations. You can set up your DIT mail to forward all messages on to other accounts that you use to make it easier to keep an eye on.

Friday, April 13, 2012

More Interesting Things!

A few more interesting things:

1) A really nice article. "Predictive Analytics in the Public Sector: Using Data Mining to Assist Better Target Selection for Audit", by Duncan Cleary from the Revenue about bulding predictive models in SAS. Very practical stuff.

http://www.ejeg.com/issue/download.html?idArticle=232

2) Stanford University lecturers Dan Jurafsky and Chris Manning have a free online natural language processing course running at the moment. The Week 3 lectures on Text Classification do a really nice job covering the Naive Bayes classifiers, bag of words type representations and the use of precision, recall and f-measure to evaluate performance. Available at (you may need to log in):

https://class.coursera.org/nlp/lecture/index

Thursday, April 12, 2012

Creme Free Conference

Creme, a company about whom I posted job details on the LinkedIn group a little while ago, are hosting a free conference on Thursday and Friday of next week. Details are available at:

http://www.cremeglobal.com/information/conference/

For anyone intersted in applying for their jobs it might be worth attending.

Tuesday, April 10, 2012

Making a Kaggle Submission

Almost all of the Kaggle contests require that you submit a comma-spearated-value (csv) file continaing your predictions for the task at hand. To make your submission you simply need to prepare this file and submit it. These files typically have a very simple structure and are just a list of <ID, prediction> pairs. So you can epxect to submit something like:

12005, 0
12007, 1
12008, 1
...

where the first number is a test case ID and the second number is the binary prediction I have generated for this test example. The Kaggle software then generates an evaluation measure for your entry based on these results.

Generating these files is reaonsably straight-forward. The following are instructions for how to do this in Weka and SAS Enterprise Miner.

Weka

In Weka, rather than perfomring a k-fold cross validation experiment on the Classify tab as we might normally do we choose the "Supplied test set" test option. This allows us select an arff file containing test instances. The structure of this arff file must be the same as the structure of the training file used. (Note that due to a bug when you load a test file the "Instances" number is the little file selection box doesn't seem to change.)

To instruct Weka to output the actual predictions made by tyour classification model you need to click on the "More options..." button and from the resulting dialogue box turn on the "Output predictions" option.

Now when you run your test Weka will training your model using the training data, perform a test
using the supplied test set and output the resulting predictions to the log. Weka prioduces a predictions listing with a test ID, the expected result, the actual result and some other useful measures. You can scroll up the output pane to find these and then copy them into your favourite editor (Excel is pretty good for this) to prepare your Kaggle submission. The prediction listing is in the same order that the test cases are given in your test arff file so you will need to extract the test instance IDs from this file to match these up with the actual predictions.

SAS Enteprise Miner

The process for generating a submission file from Enterprise Miner is reasonably similar. First you need to construct a typical modelling process flow, for example as shown below.

To this process flow you now need to add the data file containg the test instances you would like to generate predictions for. (Note that the type of this data source when added to the projkect should have been set to "Score"). Simply drag this onto the diagram. In order to generate precitions Enterprise Miner uses a Score Node. this is very clever as it takes the score data you connect to it, puts it through any preprocessing you have included in your modelling process flow and then presents all of these test instances to the model recording the predictions. The score node should be positioned after your modelling node (in my example a logitic regression node) and have connected to it your test data set and the output from the model node itself, as shown below.

After you run this process flow you just need to access the predictions generated. To do this select the Score node and move your attention to the properties panel on the left. click on the "..." beside the "Exported Data" propoerty to access the data expeorted by this node (this option is actually available for any SAS node). From the list of exported datasets selected the SCORE dataset and click the "Browse..." button.

This shows you the dataset exported by the score node. If you scroll all the way over to the right-hand side you will see a new column that has been added by the scoring process called "Predictions for XXX", where XXX is repaced with the name of your target variable.

To export this dataset so that you can generate a Kaggle submission file press Ctrl-A (Cmd-A on a Mac) to select all data, then right-click and select either "Copy" or "Export to Excel" as shown below. If you selected coopy simply now paste into yuor favourite editor (again Excel is pretty good) to prepare your Kaggle submission, or if you selected "Export to Excel" Excel should open and you will be able to prepare your Kaggle submission from here.

Both of these approaches work reasonably well and I have testedthem on a number of the Kaggl datasets without any problems. There is, however, ptential to have problems as many of the dataset are wuite large which stretch our installations of these programmes (rather than the programmes themselves) towards their limits. If you are having problems please let me know and we can figure something out.

One last thing - if you are struggling with very large file sizes, don't forget two things you can try to do.

Down sample the training data so that you have a smaller training set to deal with - it is unlikely that you really need all of the data provided to build a good model.
Break the test cases file up into chunks and generate predictions for the chunks one at a time

Thursday, March 29, 2012

Census Data Released, Dissertation Projects Ahoy?

The 2011 census data has been released by the Central Statistics Office today. Details are available at:
http://cso.ie/en/newsandevents/pressreleases/2012pressreleases/pressreleasethisisireland-highlightsfromcensus2011part1/

The census data gives a great opportunity for doing data analysis and in particular creating visualisations. The All-Ireland Research Obervatory at Maynooth (http://www.airo.ie/) are already doing a really nice job of this - see http://airomaps.nuim.ie/flexviewer/?config=Census2011.xml - but there is lots left to do.

For inspiration have a look at this TED talk by Jeff Thorp on Making Data More Human http://www.ted.com/talks/jer_thorp_make_data_more_human.html

Other great examples of visualising similar data include:

Chromaroma (www .chromaroma .com) showing London Underground usage
FlightPatterns (www .aaronkoblin .com /work /flightpatterns) that shows flight paths between US cities
JustLanded (http ://http://blog.blprnt.com/blog/blprnt/just-landed-processing-twitter-metacarta-hidden-data .blprnt .com /blog /blprnt /just -landed -processing -twitter -metacarta -hidden -data ) that shows international travel harvested from Twitter posts
100SecondworldHistory (http ://www .ragtag .http://www.ragtag.info/2011/feb/2/history-world-100-seconds//2011/feb /2/history -world -100-seconds /), a visualisation of world history based on located and dated Wikipedia articles.

Wednesday, March 28, 2012

Text Analytics in Weka

Weka, the open source machine learning toolkit, can be used to perform text analytics, and the GUI explorer is nice for doing this. The most straightforward way in which to proceed is to generate an .arff file in which the text involved is represented as a string attribute. For example:

@relation essays
@attribute essay_id numeric
@attribute essay string
@attribute prediction {0, 1}

@data
1788, ' Dear ORGANIZATION1 CAPS1 more and more people start to use computers goes more and more into the dark aged While computer MONTH1 be helpful in some cases they are also making people exercise less most CAPS1 many people are nature and more people are becoming awkward', 0
1789, ' Dear LOCATION1 Time CAPS1 me tell you what I think of computers there is nothing wrong with people being on computers I say this because us as kids really do need computers for projects and school work',1

In this case I have one numeric attribute, one categorical attribute and one string attribute. Note in the @data section that the string data is enclosed in single quotation marks.

In order to build a prediction model from this data we need to transform the string representation into something more amenable to use with prediction algorithms - the bag of words, or word vector, representation is the most straightforward way in which to do this. Weka will do this for us using the StringToWordVector filter. This can be done through the GUI tools, the command line or through code using the API and there are a wide range of options to change the behaviour of the filter (e.g. stop word removal, stemming etc).

(One thing to note if using a numeric target is that the StringToWordVector filter has a slightly oddly named parameter, doNotOperateOnPerClassBasis which is set to false in the GUI and must be changed to true for numeric target variables.)

Once the StringToWordVector has been applied you will have a new dataset with many more attributes - one per feature - and you can use this to build and evaluate prediction models as normal.

If you have been given separate training and test files containing string attributes that you plan to transform using the StringToWordVector filter there is a little gotcha to watch out for. With a bag of words representation the dictionary used by the training and test sets must be the same. If you run the StringToWordVector filter independently to the training and test sets you are likely to end with two very different dictionaries and so two very different arff files. This means prediction models trained using the training data will not be able to handle test instances.

In order to get around this you must apply the StringtoWordVector filter in batch mode from the command line. this can be done as follows from a terminal window:

java -cp /Applications/weka-3-6-2/weka.jar weka.filters.unsupervised.attribute.StringToWordVector -b -i training_set.arff -o training_set_vec.arff -r test_set.arff -s test_set_vec.arff

where I am using the StringToWordVector filter, the -cp option puts the Weka jar on my class path, the -b option tells Weka to use batch mode, -i specifies my training set input file, -o the output after processing this first file, -r is my test set input file and -s is the output file I would like to use. You could also add any of the StringToWordVector options after the filter name.

This will give you compatible training and test word vector arff files that can be used by Weka.

Saturday, March 24, 2012

Nice Talk By Foursquare

Nice slides by a talk by FourSquare on running ML agorithms across very large networks of people - good stuff for project ideas again.

http://engineering.foursquare.com/2012/03/23/machine-learning-with-large-networks-of-people-and-places/

Wednesday, March 21, 2012

Weka Dataset for Kaggle Essay Scoring Contest

There is a version of the dataset for the Kaggle Hewlett Foundation Automated Essay Scoring contest prepped for use in Weka available for download at:

https://docs.google.com/open?id=0B87X5AAMrki2RGdUVHUzLTRSekM5eEt5M3BsbTByQQ

Storyful & Data Journalism Discussion on Radio One

Mark Little was on Radio One this morning talking about his company Storyful and the potential for data analytics in journalism - interesting stuff and definitely grist for the dissertation mill:

http://www.rte.ie/radio/radioplayer/rteradioweb.html#!rii=9:3234361:133::

There are massive challanges for journalists in dealing with the amounts of data with which they are now faced. Massive challenges lead to massive opportunities for the application of analytics techniques in text mining, data visualisation, etc. The Guardian Datablog is a nice example of this:

http://www.guardian.co.uk/news/datablog

Wednesday, March 14, 2012

Importing a Library in SAS Enterprise Miner OnDemand

Most of the Kaggle datasets have been added to a folder associated with our class on the SAS Enterprise Miner Ondemand servers. As an alternative to uploading data files yourself using the "File Import" node you can access these datasets by adding the folder as a library to your project. These are the seps to do this.

1) Click on the name of your project in the project tree (upper left of the Enterprise Miner interface) and from the Properties Panel click on the "..." button next to "Project Start Code"

2) Add the following piece of SAS code to the project start code window

libname KAGGLE "/courses/u_dit.ie1/i_610146/c_3477/KaggleDatasets";

and hit "Run Now" a couple of times before hitting OK.

3) Now you can add a data source to your project as normal and when it comes to selecting from a library a new library called Kaggle should be present in the list. The list of datasets within this librry have names that should show an obvious connection back to the Kaggle contests.

4) Continue as normal.

There are some problems emerging with the bigger datasets so I will keep working on these. If you are having any problems, or need a dataset that isn't present (try uploading it yourself using the File Import node) just give me a shout.

Monday, March 12, 2012

Loading Data into SAS Enterprise Miner OnDemand

In order to load local data into SAS Enterprise Miner OnDemand, follow these steps:

1) Create a SAS Enterprise Miner Project

2) Create a new diagram

3) From the "Sample" tab add a "File Import" node to your new diagram

4) Select the "File Import" node on the diagram and from the properties panel click on the "..." beside the "Import File" option in the "Train" section.

5) Select "My Computer" and click on "Browse..."

6) Select the file you are interested in (possible file types are shown below) and click "OK" (watch out it may take a little while for this step to complete). The easiest format to use is a comm-separated file where the first row contains comma separated variable names.

7) Right click on the "File Import" node and choose "Edit Variables" to set the roles and levels for the variables in the dataset.

8) Continue as normal.

Tuesday, January 10, 2012

Good Luck

Good luck to everyone in the exams this week.