DREaM event 4: Introduction to data mining
Kevin Swingler from Stirling University presented a workshop session to introduce participants to data mining on 25th April 2012 at the third DREaM workshop.
Kevin Swingler provided a preview of this session in a short interview.
You can also view this presentation on Slideshare.
You can also view this video on Vimeo.
Kevin Swingler from Stirling University provided an introduction to the technique of data mining, which is currently used for fraud detection, controlling machines, choosing medical interventions, predicting stocks and shares prices and telling whether a cow is likely to be fertile!
Swingler provided a libraries context for his presentation by discussing ways in which data mining is or could be used within libraries and LIS research. This included text mining, demand prediction, and search and recommend facilities. He also described work on citation clustering, which may help to alert librarians to new emerging research fields.
Swingler described the techniques used to get from a big pile of data to a model that can be used for the purpose required. The methodology relies heavily on two particular steps: the pre-processing of data so it is in a format that can be used by the modelling software, and the interpretation of the results. Of these, Swingler argued that data preparation is probably the most important stage. He used an analogy from cookery to illustrate this point, observing that if you want to cook a lasagne, the technology involved (the oven) is fairly simple, whereas the expertise comes in the preparation. It is the same with data mining.
In describing data pre-processing, Swingler emphasised that the quantity and quality of the data are key. If you don’t have enough data, the computer doesn’t have enough to create a reliable model. Data mining is all about extrapolation, so you need sufficient data and a sufficient distribution of data to prevent bias. The data also has to be of a sufficient quality to allow the resulting computer model to make sufficiently accurate predictions. However, Swingler explained that there are acceptable degrees of error in data mining, depending on the task. How often the computer model gets something wrong, and the preferred direction of the error must be considered so you can minimise costs and other side effects.
Next, Swingler provided an overview of the processes used to “clean” data, including creating a frequency histogram for each variable to assess whether you can use the variable reliably. He recommended evaluating the distribution of the values, looking for outliers, minority values, data balance and data entry errors. He discussed the issues associated with interpreting outliers in more detail, including the problems these can present when the computer attempts to model what might exist in the space between the bulk of the data and the outliers. He observed that outliers are generally removed or discussed further with the person commissioning the work.
It may be that the outliers are the points of interest for the study, but this may mean that the problem under consideration cannot be solved using data mining because insufficient data exists to create a reliable model. Data can be rebalanced so that rare things happen more regularly in the data, but this presents its own issues. Swingler observed that you may need massively more data than you thought to get enough examples of the rarer instances that you’re interested in. You may also need more data to reduce the impact of noise if examples of the variable of interest are spread out.
Swingler did not dwell on the software element of building a model with your clean data, explaining that the software itself varies in complexity and cost. However, he explained that having chosen a piece of software and a modelling technique, you take your carefully prepared data and show it to the software, which will then build the model for you. Choosing the modelling technique and parameters can be an experimental process, involving trialling a number of different options to see how well the model performs. Swingler explained how to test the performance of a model by splitting the data to give the software 70% of the data to learn from, then use the remaining 30% to test against. The ideal result is a model where the test and the training data are about the same, within the limits you are looking for.
Once an acceptable model has been achieved, good performance can be measured using a confusion matrix, which will tell you how often the model got it right and which way it got it wrong when an error did occur. Swingler observed that most of the techniques do offer a confidence score, which can be useful if there is a cost associated with believing the wrong answer.
Swingler concluded that there can be something of an art form involved in data mining, and whilst there are free tools available to try, it is better to collaborate with someone with experience of data mining to get a sense of the complexity surrounding a specific data set of interest.
If you would like to comment on this presentation, please join the discussion in the DREaM online community.