Wednesday, March 28, 2012

Text Analytics in Weka

Weka, the open source machine learning toolkit, can be used to perform text analytics, and the GUI explorer is nice for doing this. The most straightforward way in which to proceed is to generate an .arff file in which the text involved is represented as a string attribute. For example:

@relation essays
@attribute essay_id numeric
@attribute essay string
@attribute prediction {0, 1}

@data
1788, ' Dear ORGANIZATION1 CAPS1 more and more people start to use computers goes more and more into the dark aged While computer MONTH1 be helpful in some cases they are also making people exercise less most CAPS1 many people are nature and more people are becoming awkward', 0
1789, ' Dear LOCATION1 Time CAPS1 me tell you what I think of computers there is nothing wrong with people being on computers I say this because us as kids really do need computers for projects and school work',1

In this case I have one numeric attribute, one categorical attribute and one string attribute. Note in the @data section that the string data is enclosed in single quotation marks.

In order to build a prediction model from this data we need to transform the string representation into something more amenable to use with prediction algorithms - the bag of words, or word vector, representation is the most straightforward way in which to do this. Weka will do this for us using the StringToWordVector filter. This can be done through the GUI tools, the command line or through code using the API and there are a wide range of options to change the behaviour of the filter (e.g. stop word removal, stemming etc).

(One thing to note if using a numeric target is that the StringToWordVector filter has a slightly oddly named parameter, doNotOperateOnPerClassBasis which is set to false in the GUI and must be changed to true for numeric target variables.)

Once the StringToWordVector has been applied you will have a new dataset with many more attributes - one per feature - and you can use this to build and evaluate prediction models as normal.

If you have been given separate training and test files containing string attributes that you plan to transform using the StringToWordVector filter there is a little gotcha to watch out for. With a bag of words representation the dictionary used by the training and test sets must be the same. If you run the StringToWordVector filter independently to the training and test sets you are likely to end with two very different dictionaries and so two very different arff files. This means prediction models trained using the training data will not be able to handle test instances.

In order to get around this you must apply the StringtoWordVector filter in batch mode from the command line. this can be done as follows from a terminal window:

java -cp /Applications/weka-3-6-2/weka.jar weka.filters.unsupervised.attribute.StringToWordVector -b -i training_set.arff -o training_set_vec.arff -r test_set.arff -s test_set_vec.arff

where I am using the StringToWordVector filter, the -cp option puts the Weka jar on my class path, the -b option tells Weka to use batch mode, -i specifies my training set input file, -o the output after processing this first file, -r is my test set input file and -s is the output file I would like to use. You could also add any of the StringToWordVector options after the filter name.

This will give you compatible training and test word vector arff files that can be used by Weka.

No comments:

Post a Comment