All projects Computer Vision

Vision Transformer (ViT)

Images as sequences of patches — attention instead of convolution.

Study & implementation 2023 TransformersSelf-attentionPatch embedding

Motivation

The Vision Transformer shows that the inductive biases of convolution are not strictly necessary: split an image into patches, embed them as tokens, and let self-attention model global relationships directly.

Method

Patches are linearly embedded and augmented with positional encodings and a class token, then processed by a standard transformer encoder. Attention lets every patch attend to every other from the first layer, giving a global receptive field.

\[ \mathrm{Attention}(Q,K,V) = \mathrm{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V \]

Mathematical core

Self-attention is a data-dependent, permutation-equivariant mixing operator — a learned, dense analogue of message passing on a complete graph. That connection ties ViTs to my work on graph neural networks.

Dosovitskiy et al. (2020)

← PreviousPersistent Homology Next →Summarization with T5