Member-only story
ViT: Transformer in Vision Domain
Can you think of doing the task of “Image recognition” but without using convolution. If not, gear up and familiarise yourself with the latest advancement in the field of Computer Vision with transformer architecture.
Image Recognition
Image Recognition is an identification task to recognise the main entity present in an image. And a usual way to deal with the problem is to pass an image through a neural network that finally predicts the class of image with help of a probability density function, where the most probable class has highest density, as in this image, dog has high density compared to other classes.
Until recently, Convolutional neural networks were the best solution to the task of image classification. Ex ResNet, It was a prevalent model for computer vision(CV) tasks. The weighted propagation side of ResNet does a summation with a skip connection that skips a layer of weights. But recently, Vision Transformers (ViT) have achieved highly competitive performance in benchmarks for several computer vision applications, such as image classification, object detection, and semantic image segmentation.