Compression of vision transformer models with AIminify

Introduction

The vision transformer model family includes deep learning architectures using transformer layers. These layers are popular in natural language processing but are also useful for computer vision. Unlike traditional convolutional neural networks (CNNs), vision transformers view images as sequences of patches. This approach helps capture long-range dependencies and global context better.

Since their launch, vision transformers have performed well in image classification and other vision tasks. They have inspired extensions like DeiT (Data-efficient Image Transformer), Swin Transformer (which adds a hierarchical structure), and BEiT (which merges self-supervised learning with transformers).

In this blog, we will discuss how to use AIminify for the compression of vision transformer models. We will demonstrate that the performance loss is minimal while the reduction in FLOPS is significant.

This blog is the third in a series of blogs about the compression of neural networks. Part 1 (on U-nets) and part 2 (on YOLO models) can be found here and here. More information on pruning can be found here.

Why compress vision transformer models?

The main reason to compress vision attention models is to reduce costs. Self-attention in vision transformers grows with the square of the image size. This results in high memory use, especially for large images. Many methods try to lower these costs, like using xformers in PyTorch or MemoryEfficientAttention. However, we will show that you can achieve even greater savings without losing much performance.

Vision transformer implementation

We chose to benchmark three popular vision transformer models:

  • ViT, specifically the vit_base_patch16_224 model.
  • DEiT, specifically the deit_base_patch16_224 model.
  • DinoV2, specifically the dinov2_vitsb14_lc model.

We focused on the PyTorch implementation of these models. We used pretrained models from `timm` or `facebookresearch`. Researchers pretrained all models on the ImageNet dataset.

Baseline performance

ModelImage sizeAccuracyModel size (MB)ParametersFLOPS
vit_base_patch16_22422479.12%82.5686,567,65635,127,656,448
deit_base_patch16_22422480.47%82.5686,567,65635,127,656,448
dinov2_vitsb14_lc64479.65%22.8723,977,57691,096,848,384

Compressing the model

AIminify provides compression strength levels from 0 to 5. This strength mainly influences the pruning part of the algorithm. A strength of 0 means no pruning. Strengths 1 to 5 determine how much the model is pruned, ranging from slight to heavy.

By adding training and validation generators and setting fine_tune to True, AIminify fine-tunes the model after pruning. This process helps keep accuracy near the original. We also selected mixed precision training, which speeds up training time without impacting final accuracy.

AIminify integrates out of the box with all three models. We compare FLOPS, parameters, and model size across different compression settings. For performance, we use accuracy as the key metric. AIminify works seamlessly with other attention layer speedup frameworks such as xformers and MemoryEfficientAttention.

vit_base_patch16_224

Compression strengthAccuracyModel size (MB)ParametersFLOPS
174.35%
(−6.03%)
73.85
(−10.55%)
77,434,802
(−10.55%)
31,531,653,120
(−10.24%)
376.95%
(−2.75%)
62.10
(−24.78%)
65,115,747
(−24.78%)
26,681,103,360
(−24.05%)
572.70%
(−8.12%)
51.82
(−37.23%)
54,336,766
(−37.23%)
22,436,947,968
(−36.13%)

deit_base_patch16_224

Compression strengthAccuracyModel size (MB)ParametersFLOPS
177.14%
(−4.14%)
74.77
(−9.43%)
78,403,112
(−9.43%)
31,912,919,040
(−9.15%)
376.23%
(−5.26%)
64.1
(−22.35%)
67,218,363
(−22.35%)
27,508,995,072
(−21.69%)
574.83%
(−7.00%)
54.7
(−33.74%)
57,355,434
(−33.74%)
23,625,529,344
(−32.74%)

dinov2_vitsb14_lc

Compression strengthAccuracyModel size (MB)ParametersFLOPS
182.06%
(+3.03%)
21.79
(−4.73%)
22,843,301
(−4.73%)
86,300,573,184
(−5.27%)
372.91%
(−8.46%)
18.96
(−17.07%)
19,883,420
(−17.07%)
73,784,733,696
(−19.00%)
571.45%
(−10.29%)
16.91
(−26.07%)
17,727,144
(−26.07%)
64,666,933,248
(−29.01%)

Takeaways

Here are a few key points from the results:

  • AIminify reduces the computational needs of vision transformer models. This comes with only slight accuracy impacts, making it great for scenarios where speed, memory, or energy is crucial.
  • AIminify’s compression leads to real efficiency. Moderate pruning (strength 3) lowers parameters, file size, and FLOPs by about 20–25% with just a 2.7–8.5% accuracy drop. Aggressive pruning (strength 5) achieves 30–37% reductions but results in a 7–10% accuracy loss.
  • Compression strength 1 may even boost accuracy for self-supervised models like DINOv2. This suggests that light pruning can serve as a form of regularization.
  • Compression strength 3 offers the best balance. This setting provides around a 25% efficiency gain with less than a 10% accuracy loss. 
  • For latency-sensitive uses, like mobile or edge devices, aggressive pruning (strength 5) can cut inference costs by more than one-third, with up to a 10% accuracy drop.

Try AIminify with compression strength 3 on your ViT model today! Contact us to get a license!