Data mining

데이터마이닝을 한마디로 요약하면 대량의 데이터 집합으로부터 유용한 정보를 추출하는 것¹으로 정의된다.

About

좀더 상세히 정의하면, 다음과 같다.²

데이터마이닝이란 의미있는 패턴과 규칙을 발견하기 위해서 자동화되거나 반자동화된 도구를 이용하여 대량의 데이터를 탐색하고 분석하는 과정이다.

데이터마이닝의 또다른 정의로서 가트너그룹은 다음과 같이 정의하고 있다.³

데이터마이닝은 통계 및 수학적 기술뿐만 아니라 패턴인식 기술들을 이용하여 데이터 저장소에 저장된 대용량의 데이터를 조사함으로써 의미있는 새로운 상관관계, 패턴, 추세 등을 발견하는 과정이다.

Data mining process

Problem definition
Data acquisition and selection
Data exploration
Data preprocessing
Train and evaluate data mining model
Interpret results

Data set

What is the difference between test set and validation set?

Partition.png

Training set

a set of examples used for learning: to fit the parameters of the classifier In the MLP case, we would use the training set to find the “optimal” weights with the back-prop rule

Validation set

a set of examples used to tune the parameters of a classifier In the MLP case, we would use the validation set to find the “optimal” number of hidden units or determine a stopping point for the back-propagation algorithm

Test set

a set of examples used only to assess the performance of a fully-trained classifier In the MLP case, we would use the test to estimate the error rate after we have chosen the final model (MLP size and actual weights) After assessing the final model on the test set, YOU MUST NOT tune the model any further!

Why separate test and validation sets?

The error rate estimate of the final model on validation data will be biased (smaller than the true error rate) since the validation set is used to select the final model After assessing the final model on the test set, YOU MUST NOT tune the model any further!

Favorite site

References

Hand et al., 2001 ↩
Berryand Linoff, 1997, 2000 ↩
2004년 1월 가트너그룹 웹사이트 ↩