Skip to content

Cross validation

Cross Validation 은 database중 일부를 이용해서 Training set으로 정하되, 그 일부를 바꿔가면서 training and test 하는 방법입니다.

  • Training set만 가지고 만든 알고리즘은 Training set에 대해서는 잘 동작하지만 좋은 Hypothesis Function으로 판단하기는 어렵다
  • Training set에 대해서는 낮은 오류율을 보이지만 다른 데이터들에 대해서는 높은 오류율을 보일 수 있다
  • 위 문제를 해결하기 위해 전체 훈련 데이터 중 일부를 Cross Validation Set 으로 사용 한다
    • Training set (60%)
    • Cross Validation Set (20%)
    • Test set (20%)

K-fold cross-validation

One iteration of the K-fold cross-validation is performed in the following way: First, a random permutation of the sample set is generated and partitioned into K subsets ("folds") of about equal size. Of the K subsets, a single subset is retained as the validation data for testing the model (this subset is called the "testset"), and the remaining K - 1 subsets together are used as training data ("trainset"). Then a model is trained on the trainset and its accuracy is evaluated on the testset. Model training and evaluation is repeated K times, with each of the K subsets used exactly once as the testset.

The case of a 5-fold cross-validation with 30 samples is illustrated in the picture below:

Xv_folds.gif

Leave-one-out cross-validation

As the name suggests, leave-one-out cross-validation involves using a single sample from the original sample set as the validation data, and the remaining samples as the training data. This is repeated such that each sample in the sample set is used exactly once as the validation data. This is the same as K-fold cross-validation where K is equal to the number of samples in the sample set.

There is no need in generating random permutations for leave-one-out cross-validation and in repeating it, because the training and validation datasets for each of the folds are always the same, and therefore the result of the accuracy estimation is determined.

See also

Favorite site