Vision transformer

A vision transformer (ViT) is a transformer designed for computer vision. A ViT breaks down an input image into a series of patches (rather than breaking up text into tokens), serialises each patch into a vector, and maps it to a smaller dimension with a single matrix multiplication. These vector embeddings are then processed by a transformer encoder as if they were token embeddings.

ViT has found applications in image recognition, image segmentation, and autonomous driving.

비전 트랜스포머는 트랜스포머로 구축된 이미지 처리 모델이다. 2020년 "An image is worth 16x16 words" 논문에 의해 이미지를 단어처럼 처리하면 어떨까 하는 아이디어에서 시작되어, 분야에 따라 종래의 합성곱 신경망과 비슷한 성능을 내고 있다.

자연어(NLP)에서 많이 사용되는 Transformer를 Vision Task에 적용.

Favorite site

Vision transformer - Wikipedia

Vision transformer

Categories

See also

Favorite site

논문 리뷰/정리/요약