Thursday, August 17, 2006

I am trying to define data-mining.
This is not as trivial as it sounds. The word is perhaps as misunderstood as relativity.
I have a friend who builds models on financial reports posted on Excel sheel sheets and calls it data-mining. Another friend googles to find the best deal on a camcorder and calls it, you guessed it, data-mining.
Since every fact that you come across is data and every digging around you do for better understanding of facts is mining, data-mining is many a times used for any activity that involves linking data to understand a pattern. Which pretty much encompasses everything we do. Intelligence functions by seeking and learning patterns in the sea of data it is surrounded with and in that sense data-mining is the necessary condition for any intelligence.

But data-mining is much more rigorous than that. Just as education cannot be reduced to knowing the alphabets and the ability to write a letter (that's literacy), the same is the case with data-mining.

Thearling defines data-mining as extraction of hidden predictive information from large databases. There is still ambiguity here what we mean by large databases but we have a start.

Let us examine the implications of this definition in detail.

Large Databases
The very term database gives us an indication that data-mining works on a repository of data and not on adhoc sprinkling of figures and data from here and there pasted on an excel sheet. In fact, it would be safe to say, all good data-mining is based on good data warehouses which in turn implies the most important aspects of the data in data-mining - extensivity and rigor.

Predictive

This does not necessarily imply only forecasting. Whenever we are building a model on past data , the goal is always to build an actionable prediction on behavior or outcome. And, hence, validation is as important a milestone as the modeling itself.

Hidden

Eric A King in (http://www.dmreview.com/portals/portalarticle.cfm?articleId=1038094&topicId=230255) qualifies Data Mining as exploring
previously unknown interrelationships and recurrences across seemingly unrelated attributes in order to predict actions, behaviors and outcomes. He thus differentiates the same from OLAP reporting.
He further qualifies it as "...we are looking at prediction derived from information hidden within large volumes of data rather than retrospection drawn from an OLAP or SQL query."