Data mining is the iterative process that aims to discover and identify the relations in the data set or in the flow of data examined, through manual or automatic methods. Such analysis is divided into two types of activity: predictive analysis and descriptive analysis.
It is not right to think that one theory is better than the other. As a matter of fact, the results of the two activities are complementary to achieving the same goal. While the descriptive analysis tries to find patterns and other new information, the predictive analysis allows to produce an executable model in the form of code, useful for the prediction, the estimation and the identification of a process. Thus, in a nutshell, data mining concerns the activity carried out on big data in order to make them intelligible to everyone, and to extract from them predictive information useful to those who have requested them. The main data mining techniques are:
- Classification: it is the activity that aims to discover the function that, in turn, allows to label a single datum of a class. Some of these algorithms are the Bayesian classification, the statistical classification or the so-called random forests. A type of classifiers are the decision trees which allow to identify, in order of importance, the causes that lead to the occurrence of an event;
- Regression: involves finding the function that associates a dependent variable with one or more independent ones. This dependent variable is functional, linear or higher-degree polynomial with respect to the independent ones, plus the error value. The most used method is that of Least Squares;
- Clustering: it is the activity that allows to identify a series of categories, or more precisely clusters, that divide the data set;
- Association: it is the discovery of random but recurrent connections, extractable from the data contained in a database, aimed, for example, to detect anomalies.
INFORMATION EXTRACTION
Data mining can be seen as the union of two sciences, statistical sciences and machine learning. It can be defined as the process that allows to discover models and descriptions from a data set. Such a process cannot be the application of random machine-learning methods and statistical tools. Instead, it must be a well-planned and structured process, so as to be useful and fully descriptive of the system being examined. This information extraction plan usually follows a five-step experimental procedure:
- Problem definition and hypothesis formulation. The identification of the model is more efficient if the context on which the application works is well defined, therefore an excellent knowledge and experience are needed to better define the problem to work on;
- Data collection:. This phase concerns how the datum is generated and then how it can be collected. In general, two approaches are possible. The first is the design of experiment, in which the expert has control over data generation. In fact, the system is influenced so as to study, in an isolated way and in a response variable, the resulting effect. The second approach, instead, does not include the possibility to influence the system and it is defined as observational study. It concerns exclusively the analysis of the datum, without knowing the generating cause;
- Data pre-processing. Usually, with the observation approach, data come from databases and other storage. Data pre-processing: therefore includes at least two activities. The detection and removal of outliers, anomalous data given the context and unrelated to other observed data, and the distribution of value thresholds, scaled in such a way that all the variables have the same weight;
- Model estimation. In this phase it is chosen the methodology that provides the best model that can, in turn, represent the case in question;
- Model interpretation and performance study. The possibility of interpreting a model does not depend on its accuracy, in fact the easiest models are the most interpretable, but they are also the least accurate ones. This is because a model built with data mining techniques, sometimes, must be interpreted in order to facilitate human comprehension, and to implement strategies. To improve user experience it is necessary to optimise these results and make them comprehensible.
These phases are not independent, that is, the data mining process necessarily involves an iterative approach. Thanks to the observation of the results obtained from a certain phase, the data set can be computed again so as to solve the problem concerned.
WHY DATA MINING IS IMPORTANT?
The fields of application of data mining are countless, but they can be grouped into some macro-categories. Below the list of the main fields and the advantages that data mining can bring to each of them.
MARKETING
- Identification of types of buyers united by purchasing habits and socio-demographic characteristics;
- Predictability in identifying abandoning clients, and therefore adopting appropriate strategies to prevent it;
- Identification of those products or services that are usually bought at the same time;
- Possibility to convey all the strategies of a company in the same direction.
Economy and finance
- Identification of anomalies in the use of credit cards and tracking of fraudulent behaviours;
- Predictability on stock index trends;
- Predictability on the influence of the general trend of the markets and of the single sector of reference.
Science
- In the clinical and pharmacological field, data mining is a valuable support for the decision-making process;
- Accuracy of weather forecasts, determined by the cross analysis of a huge amount of data;
- Classification and identification of stars, galaxies, planets, satellites and other celestial bodies.
Information and Communication Technology (ICT)
- On the security front, data mining optimises and speeds up intrusion-detection procedures.
Statistics
- Speeding up of demographic analyses and, above all, possibility to extrapolate the information precluded to the normal statistical methodologies, managing to provide valid predictive models.
Industry
- Increased productivity thanks to the analysis that identifies errors or inefficiencies in production chains.