Member-only story

ViT: Transformer in Vision Domain

Nikhil Verma
6 min readMar 1, 2022

Can you think of doing the task of “Image recognition” but without using convolution. If not, gear up and familiarise yourself with the latest advancement in the field of Computer Vision with transformer architecture.

Image Recognition

Image Recognition is an identification task to recognise the main entity present in an image. And a usual way to deal with the problem is to pass an image through a neural network that finally predicts the class of image with help of a probability density function, where the most probable class has highest density, as in this image, dog has high density compared to other classes.

Until recently, Convolutional neural networks were the best solution to the task of image classification. Ex ResNet, It was a prevalent model for computer vision(CV) tasks. The weighted propagation side of ResNet does a summation with a skip connection that skips a layer of weights. But recently, Vision Transformers (ViT) have achieved highly competitive performance in benchmarks for several computer vision applications, such as image classification, object detection, and semantic image segmentation.

Transformer refresher

--

--

Nikhil Verma
Nikhil Verma

Written by Nikhil Verma

Knowledge shared is knowledge squared | My Portfolio https://lihkinverma.github.io/portfolio/ | My blogs are living document, updated as I receive comments

No responses yet