Dataset

자료 집합 또는 데이터 세트(data set)는 자료의 모임이다.

일반적으로 자료 집합은 하나의 데이터베이스 테이블의 내용이나 하나의 통계적 자료 행렬과 일치하며 여기에서 테이블의 모든 컬럼은 특정한 변수를 대표하며 각 로우는 제기된 자료 집합의 주어진 멤버와 일치한다. 이 자료 집합은 변수 개개의 값들을 나열하는데, 이를테면 자료 집합의 각 멤버에 대한 물체의 높이와 무게를 들 수 있다. 각각의 값은 자료라고 부른다. 자료 집합은 하나 이상의 멤버에 대한 데이터를 이루며, 로우의 수와 일치한다.

자료 집합이라는 용어는 또한 특정한 실험이나 이벤트에 상응하는, 밀접히 관계된 테이블의 모임 안의 데이터를 가리킬 수도 있다. 이러한 종류의 예는 우주 탐사체의 장비로 실험을 수행하는 항공 우주국에 의해 수집된 데이터 집합을 들 수 있다.

in Machine Learning

Dataset-training_validation_testing.png

Validation vs Test

Machine Learning에서 validation set을 사용하는 이유
Test set은 모델의 '최종 성능' 을 평가하기 위해서 쓰이며, training의 과정에 관여하지 않는 차이가 있습니다.
반면 Validation set은 여러 모델 중에서 최종 모델을 선정하기 위한 성능 평가에 관여한다 보시면됩니다.
따라서 Validation set은 Training과정에 관여하게 됩니다.

즉, validation set은 training 과정에 관여를 하며, training이 된 여러가지 모델 중 가장 좋은 하나의 모델을 고르기 위한 셋입니다. test set은 모든 training 과정이 완료된 후에 최종적으로 모델의 성능을 평가하기 위한 셋입니다. 만약 test set이 모델을 개선하는데 쓰인다면, 그건 test set이 아니라 validation set입니다. 만약 여러 모델을 성능 평가하여 그 중에서 가장 좋은 모델을 선택하고 싶지 않은 경우에는 validation set을 만들지 않아도 됩니다. 하지만 이 경우에는문제가 생길 것입니다. (test accuracy를 예측할 수도 없고, 모델 튜닝을 통해 overfitting을 방지할 수도 없습니다.)

Data pools

[추천] Github - handong1587 - Computer Vision Datasets ¹

또한 머신 러닝 (Machine learning)을 위한 샘플 데이터셋을 정리한다.

종합

AI-Hub - AI Hub (과학기술정보통신부)
Data.gov - 미국 연방 공공 데이터 16TB 규모의 30만개 데이터셋 포함
Datasets - HuggingFace의 빠르고 사용하기 쉽고 효율적인 데이터 조작 도구를 갖춘 ML 모델을 위한 즉시 사용 가능한 데이터 세트의 최대 허브.

Image

ImageNet
CIFAR-10
TinyImages
PASCAL VOC
COCO
AVSS
Transient Attributes for High-Level Understanding and Editing of Outdoor Scenes (밤, 낮, 봄, 여름, 가을, 겨울, 등 야외 사진)
LAION-400M - 4억개짜리 이미지-텍스트 쌍 데이터셋
DeepFashion - 패션 관련

Image (OCR)

Video

YouTube-VOS
VA: A Video Dataset of Atomic Visual Action
YouTube-8M - YouTube의 비디오를 카테고리별로 검색하고 세그먼테이션한다.

Regression

Random Linear Regression - Kaggle

Tracking

License Plate Recognition

Korean car number plate 822 images data share
- Local Download: Korean_car_number_plate_822.zip

Pedestrian detection

Face recognition

Car

Waymo: 자율주행 데이터셋
Udacity Self Driving Car Dataset

Korean

KorQuAD: KorQuAD 2.0은 KorQuAD 1.0에서 질문답변 20,000+ 쌍을 포함하여 총 100,000+ 쌍으로 구성된 한국어 Machine Reading Comprehension 데이터셋 입니다. KorQuAD 1.0과는 다르게 1~2 문단이 아닌 Wikipedia article 전체에서 답을 찾아야 합니다. 매우 긴 문서들이 있기 때문에 탐색 시간에 대한 고려가 필요할 것 입니다. 또한 표와 리스트도 포함되어 있기 때문에 HTML tag를 통한 문서의 구조 이해도 필요합니다. 이 데이터셋을 통해서 다양한 형태와 길이의 문서들에서도 기계독해가 가능해질 것 입니다.

NLP

The Big Bad NLP Database - Quantum Stat

Fire

~~visor VISOR smoke dataset~~ - Videos with evidence of smoke.
Fire dataset from the Bilkent University - Videos with evidence of smoke and fire.
~~MESH database of news con- tent~~ - Instances of catastrophe related videos from the Deutsche Welle broadcaster with several news related to fires in a variety of conditions.
FASTData - Collection of fire images and videos.
Fire and smoke dataset - Clips containing evidence of fire and smoke in several scenarios.
- http://signal.ee.bilkent.edu.tr/VisiFire/Demo/SampleClips.html

Fall detection (Activity Recognition)

CIRL fall detection dataset - Indoor videos for action recognition and fall detection, not available for comercial purposes.
Multiple camera fall dataset - 24 videos of falls and fall confounding situations recorded with 8 cameras in different angles.
MMU fall detection dataset - 20 indoor videos including 38 normal activities and 29 different falls
Fall detection Dataset
Fall detection Dataset - Le2i - Laboratoire Electronique, Informatique et Image
(Activity Recognition) 쓰러짐(Fall Down)행동 인식 관련 데이터 세트
UP-Fall Detection Dataset: A Multimodal Approach
UR Fall Detection Dataset
UMAFall: Fall Detection Dataset (Universidad de Malaga)
TST Fall detection dataset v1
TST Fall detection dataset v2

Human Pose

Injured civilians

NICTA dataset - Contains a total of 25551 unique pedestrians.
MIT pedestrian dataset - 924 images of 64x128 containing pedestrians.

Road accident

MIT pedestrian dataset - 924 images of 64x128 containing pedestrians.
INRIA pedestrian dataset - Images of pedestrians and negative images.
CAVIAR dataset - Sequences containing pedestrians, also suitable for action recognition.
Trajectory based anomalous event detection - Synthetic and real-world data.
The German Traffic Sign detection benchmark - 900 images of roads with traffic sign ground truth.
Thermal pedestrian database - 10 sequences containing pedestrians recorded with a thermal camera.
TUD Brussels and TUD Paris, Multi-Cue Onboard Pedestrian Detection - Images of pedestrians taken by on-board cameras.

Simulation of Crowd Problems for Computer Vision - Approach for generating video evidence of dangerous situations in crowded scenes.
PETS dataset - Sequences containing different crowd activities.
Unusual crowd activity dataset - Clip with unusual crowd activities.
USCD anomaly detection dataset - Clips of the street with stationary camera.
UMN monitoring human activity dataset - Videos of crowded scenes.

자연어

The General Index - 백만개 저널의 n-gram 인덱스를 무료로 공개

https://archive.org/details/GeneralIndex

연구자 Carl Malamud가 유료 논문을 포함한 107,233,728개의 저널에서 SpaCy로 추출한 n-gram 인덱스를 공개

다양한 연구 분야에 사용할 수 있게 웹 아카이브에 무료로 공개

예) 특정 화학 물질이 논문에 몇 번이나 사용되었는가

3개의 테이블로 구성

3500억 개의 n-gram 과 저널 id
197억 개의 키워드 와 저널 id
저널 id 와 메타 데이터 : 논문제목, 저자, DOI(논문 고유 식별 번호)

카탈로그는 5TB의 압축파일로 해제시 38TB

ETC

Kaggle Datasets
Index of /public/AI/pile_preliminary_components
- Thread by @theshawwn on Thread Reader App – Thread Reader App
- OpenAI의 GPT-3가 사용했던 데이터와 비슷한 자료들
- books3.tar.gz: 37GB, 약 197,000권의 책을 txt로 추출한 것
- github.tar.gz: 106G, 깃헙의 여러 repo 들을 모은 것
- stackexchange_dataset.tar: 34G, 스택익스체인지의 질답 자료들
한국관광 데이터랩 - 이동통신, 신용카드, 내비게이션, 관광통계, 조사연구 등 다양한 관광 빅데이터 및 융합분석 서비스를 제공하는 관광특화 빅데이터 플랫폼

Favorite site

References

Handong1587-2015-09-24-Computer_Vision_Datasets.md.zip ↩