Friday, April 20, 2012

DIT Email Accounts

Folks,

It is very important to monitor your DIT email accounts. This is the official communication channel for a lot of material from DIT - for example from exams and registrations. You can set up your DIT mail to forward all messages on to other accounts that you use to make it easier to keep an eye on.

Friday, April 13, 2012

More Interesting Things!

A few more interesting things:

1) A really nice article. "Predictive Analytics in the Public Sector: Using Data Mining to Assist Better Target Selection for Audit", by Duncan Cleary from the Revenue about bulding predictive models in SAS. Very practical stuff.

http://www.ejeg.com/issue/download.html?idArticle=232

2) Stanford University lecturers Dan Jurafsky and Chris Manning have a free online natural language processing course running at the moment. The Week 3 lectures on Text Classification do a really nice job covering the Naive Bayes classifiers, bag of words type representations and the use of precision, recall and f-measure to evaluate performance. Available at (you may need to log in):

https://class.coursera.org/nlp/lecture/index


Thursday, April 12, 2012

Creme Free Conference

Creme, a company about whom I posted job details on the LinkedIn group a little while ago, are hosting a free conference on Thursday and Friday of next week. Details are available at:

http://www.cremeglobal.com/information/conference/

For anyone intersted in applying for their jobs it might be worth attending.

Tuesday, April 10, 2012

Making a Kaggle Submission

Almost all of the Kaggle contests require that you submit a comma-spearated-value (csv) file continaing your predictions for the task at hand. To make your submission you simply need to prepare this file and submit it. These files typically have a very simple structure and are just a list of <ID, prediction> pairs. So you can epxect to submit something like:

12005, 0
12007, 1
12008, 1
...

where the first number is a test case ID and the second number is the binary prediction I have generated for this test example. The Kaggle software then generates an evaluation measure for your entry based on these results.

Generating these files is reaonsably straight-forward. The following are instructions for how to do this in Weka and SAS Enterprise Miner.

Weka

In Weka, rather than perfomring a k-fold cross validation experiment on the Classify tab as we might normally do we choose the "Supplied test set" test option. This allows us select an arff file containing test instances. The structure of this arff file must be the same as the structure of the training file used. (Note that due to a bug when you load a test file the "Instances" number is the little file selection box doesn't seem to change.)


To instruct Weka to output the actual predictions made by tyour classification model you need to click on the "More options..." button and from the resulting dialogue box turn on the "Output predictions" option.


Now when you run your test Weka will training your model using the training data, perform a test
using the supplied test set and output the resulting predictions to the log. Weka prioduces a predictions listing with a test ID, the expected result, the actual result and some other useful measures. You can scroll up the output pane to find these and then copy them into your favourite editor (Excel is pretty good for this) to prepare your Kaggle submission. The prediction listing is in the same order that the test cases are given in your test arff file so you will need to extract the test instance IDs from this file to match these up with the actual predictions.


SAS Enteprise Miner

The process for generating a submission file from Enterprise Miner is reasonably similar. First you need to construct a typical modelling process flow, for example as shown below.



To this process flow you now need to add the data file containg the test instances you would like to generate predictions for. (Note that the type of this data source when added to the projkect should have been set to "Score"). Simply drag this onto the diagram. In order to generate precitions Enterprise Miner uses a Score Node. this is very clever as it takes the score data you connect to it, puts it through any preprocessing you have included in your modelling process flow and then presents all of these test instances to the model recording the predictions. The score node should be positioned after your modelling node (in my example a logitic regression node) and have connected to it your test data set and the output from the model node itself, as shown below.


After you run this process flow you just need to access the predictions generated. To do this select the Score node and move your attention to the properties panel on the left. click on the "..." beside the "Exported Data" propoerty to access the data expeorted by this node (this option is actually available for any SAS node). From the list of exported datasets selected the SCORE dataset and click the "Browse..." button.


This shows you the dataset exported by the score node. If you scroll all the way over to the right-hand side you will see a new column that has been added by the scoring process called "Predictions for XXX", where XXX is repaced with the name of your target variable.


To export this dataset so that you can generate a Kaggle submission file press Ctrl-A (Cmd-A on a Mac) to select all data, then right-click and select either "Copy" or "Export to Excel" as shown below. If you selected coopy simply now paste into yuor favourite editor (again Excel is pretty good) to prepare your Kaggle submission, or if you selected "Export to Excel" Excel should open and you will be able to prepare your Kaggle submission from here.


 Both of these approaches work reasonably well and I have testedthem on a number of the Kaggl datasets without any problems. There is, however, ptential to have problems as many of the dataset are wuite large which stretch our installations of these programmes (rather than the programmes themselves) towards their limits. If you are having problems please let me know and we can figure something out.

One last thing - if you are struggling with very large file sizes, don't forget two things you can try to do.
  1. Down sample the training data so that you have a smaller training set to deal with - it is unlikely that you really need all of the data provided to build a good model.
  2. Break the test cases file up into chunks and generate predictions for the chunks one at a time