Vision Transformer (ViT) is rapidly becoming a go-to alternative to Convolutional Neural Networks (CNNs) for computer vision tasks due to its impressive rate of accuracy and computational efficiency. ViT models have been able to outperform CNNs by almost 4x in many combinations of datasets and tasks, thus establishing themselves as very powerful contenders. 

Likewise, transformer-based models have become the norm in Natural Language Processing (NLP), with ChatGPT being a good example of this. Self-attention mechanisms are used to model the dependencies between words in the text and create sophisticated language models such as GPT.

What is a Vision Transformer (ViT)?

In 2021, a conference research paper titled, “An Image is Worth 16*16 Words: Transformers for Image Recognition at Scale,” introduced the Vision Transformer (ViT) model. The pre-trained ViT models and fine-tuning code can be accessed on Google Research’s GitHub. These models are pre-trained on the ImageNet and ImageNet-21k datasets.

ViT technology has a range of potential applications in computer vision, from the common (image classification, object detection, image segmentation) to the robust (action recognition). ViTs also extend well beyond everyday use cases; for example, their applications enable generative models and multi-modal tasks such as visual grounding, visual question-answering, and visual reasoning.

ViTs are a powerful class of deep learning models that process images as sequences, making it possible for the model to understand the structure of an image. ViTs perform this by first dividing the input images into patches and then flattening them into single vectors by combining the channels of all pixels. Finally, they are linearly projected to the desired input dimension in order to obtain class labels for each image.

The following is a breakdown of the vision transformer architecture:

  1. Divide the image into patches
  2. Flatten each patch
  3. Use linear embeddings to create lower-dimensional representations of the flattened patches
  4. Introduce positional embeddings to the representations
  5. Feed the resulting sequence into a regular transformer encoder
  6. Pretrain the model with image labels, using fully supervised learning on a large dataset
  7. Fine-tune the model on a downstream dataset for image classification tasks.
Vision Transformer Model (ViT)
A demo of a Vision Transformer for Image Classification

The image patches are treated as sequence tokens, similar to words in natural language processing. The ViT encoder is based on the original transformer architecture and it consists of multiple blocks.

Each ViT encoder block includes three key processing components:

  1. Layer normalization
  2. Multi-head self-attention mechanism
  3. Multi-layer perceptrons (MLPs)

Layer Normalization is used to standardize the output of each layer and ensures that the model can learn the features effectively and reduces the covariate shift.

The Multi-head Self-Attention (MSA) network generates attention maps that capture the relationships between the different embedded visual tokens. The attention maps help the model focus on the most important regions of the image, such as objects or regions with distinct visual features. The attention mechanism allows the model to leverage contextual information to improve feature representation.

The Multi-Layer Perceptrons (MLP) is a two-layer classification network that is used as the output of the transformer. The final MLP block, also known as the MLP head, has GELU (Gaussian Error Linear Unit) at the end. Applying a softmax function to the output of the MLP head can provide classification labels, which is useful for tasks such as Image Classification.

Vision Transformer (ViT) in Image Recognition

The Transformer architecture is widespread within Natural Language Processing, but its application in Computer Vision is restricted. For many CV applications, attentional mechanisms are merged with convolutional networks or used to improve the overall structure without entirely replacing it. The most well-known image recognition algorithms include ResNet, VGG, YOLOv3, and YOLOv7.

How AI Identifies Text from Images

Performance of Vision Transformers in Computer Vision

Recent benchmarks have demonstrated impressive performance from Vision Transformers (ViT) in image classification, object detection, and semantic image segmentation tasks.

CSWin Transformer is a powerful and efficient framework for general computer vision tasks that relies on a new method called “Cross-Shaped Window self-attention.” As a result of this technique, the model can analyze multiple areas of an image simultaneously, resulting in faster processing.

The CSWin Transformer outperformed prior state-of-the-art techniques like the Swin Transformer on several benchmark tasks. With 85.4% Top-1 accuracy on ImageNet-1K, 53.9 box APs and 46.4 mask APs on the COCO detection task, and 52.2 mIOUs on the ADE20K semantic segmentation task, it has achieved remarkable results.

A Short Overview of Transformers:

Any text-related task relied upon attention mechanisms combined with RNNs for a long time. Transformers were invented in 2017 after a game-changing paper was published. The paper was titled “Attention is all you need.”

With transformers, different parts of the input data are given different weights based on self-attention. As a result, they are becoming increasingly popular in solving NLP problems, replacing RNN models such as long short-term memory (LSTM).

Transformers V/S RNN

Transformers and Recurrent Neural Networks have distinct features. While RNNs utilize sequential information, Transformers are non-sequential. 

For example, there is a difference between RNNs and Transformers. RNNs take in one word at a time in a sequence of words, whereas with Transformers, all the words can be processed as part of an ensemble. This distinction allows for faster processing speed and improved accuracy.

Transformers use the attention mechanism to access past information. In contrast, RNNs can only store a limited amount of historical data in previous states, which often gets lost in subsequent states. This makes Transformers better equipped to understand context and extract relevant information from available data.

Transformers view the position of each word in a sentence as important and use positional embeddings to store this data. This is something that sets them apart from other models.

Over the last five years, Transformers have become the go-to architecture in NLP, and they have proven to be more effective than RNNs across a wide range of tasks. To stay up-to-date in the field, it’s essential to understand this architecture and its evolution.

ViT vs. Convolutional Neural Networks

Computer Vision has been relying on Convolutional Neural Networks (CNNs) for a number of years. Their ability to produce simplified versions of input images by generating feature maps is proving to be invaluable in this field. 

The procedure involves the use of filters, highlights the most relevant segments, and then passes them along to a multi-layer perceptron for further processing, which eventually results in the desired classification task.

The ViT model adopts a similar approach to transformers used for text by representing an input image as a sequence of image patches and directly predicting class labels for the image.

Computer Vision has significantly advanced with the introduction of the Vision Transformer (ViT) architecture. Unlike traditional convolutional neural networks which used filters to highlight notable features, ViT employs the self-attention mechanism to analyze and classify images. This represents a move away from CNNs which reduce complexity by creating feature maps prior to a multi-layer perceptron-based classification system.

The paper titled “Do Vision Transformers See Like Convolutional Neural Networks?” by Raghu et al., published in 2021 by Google Research and Google Brain, provides insight into the differences between CNNs and Vision Transformers like: 

  1. ViT has more similarity between shallow and deep layer representations compared to CNNs.
  2. While ViT obtains global representation from shallow layers, the local representation obtained from shallow layers is also important, unlike CNNs.
  3. Skip connections in ViT have a more significant impact on the performance and similarity of representations than CNNs (ResNet).
  4. ViT retains more spatial information than ResNet.
  5. ViT can learn high-quality intermediate representations with large amounts of data.
  6. MLP-Mixer’s representation is closer to ViT than to ResNet.

Prominent applications of ViT

1) Image Classification

Image classification is a fundamental computer vision problem, and CNNs have been the most successful architecture for this task. Although ViTs have shown remarkable performance on large datasets, they do not achieve comparable results on small to medium-sized datasets. One reason for this is that CNNs are better at encoding local information in images because of their locally restricted receptive fields.

2) Image Captioning

ViTs have enabled a more sophisticated form of image categorization, which involves generating a caption that describes the content of an image instead of a single-word label. This is made possible by the ability of ViTs to learn a general representation of a given data modality rather than just a basic set of labels. With the use of ViTs, it is possible to generate descriptive text for an image. An implementation of ViT trained on the COCO dataset will be used to illustrate this. The generated captions for the images can be seen below.

3) Image Segmentation


Segmenting images can be a difficult challenge, but Dense Prediction Transformers (DPT), released by Intel in March 2021, has made the process easier. With vision transformers for image processing and semantic segmentation of up to 49.02% mIoU on the ADE20K dataset, DPT offers significant improvements to current models and algorithms. We’ve seen an incredible leap of up to 28% higher performances when used for monocular depth estimation compared to fully-convolutional networks.

4) Image Segmentation

A network for detecting and localizing anomalies in images based on transformers combines patch embedding with a reconstruction-based approach. By using transformer networks, the spatial information of the embedded patches can be preserved, which is then processed by a Gaussian mixture density network to localize the anomalous areas.

5) Anomaly Detection

Google Research has just published an intriguing paper on action recognition using pure-transformer-based models. This is notable given the recent success of these models in image classification tasks. The proposed model extracts spatiotemporal tokens from the input video and then encodes the tokens with a series of transformer layers. To handle long sequences, efficient variants of the model are proposed which factorize spatial and temporal dimensions into separate components. 

The size of training datasets needed to effectively use transformer-based models is substantial; however, this challenge can be addressed by suitable regularisation techniques during training as well as ensembling pre-trained image models to produce results using comparatively small datasets.

Key Takeaways

When considering ViT, it’s wise to remember some key points. ViTs have recently shown excellent performance for computer vision tasks, surpassing the dominant status of CNNs in this field. 

Compared to CNN models, Transformers require more data for maximum accuracy. However, the self-attention mechanism of Transformer models gives developers and testers a clearer look into model development. Attention maps give a graphical representation of the strengths and weaknesses of the model, providing insight that is difficult to achieve with CNNs. 

Furthermore, Transformers are both efficient and accurate, taking significantly less training time than CNNs.

Deploying the chosen approach should be quick and simple, even if time is of the essence. CNN-based approaches can be implemented more easily than frameworks for Transformers. 

Vision Transformers have drastically impacted the development of vision models, with Google’s ViT-MoE model being the largest of its kind, at 15 billion parameters. This has resulted in impressive results on ImageNet-1K classification tasks.