How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers
Abstract
Vision Transformers, compared to CNNs, rely on model regularization or data augmentation due to weaker inductive biases. To better understand interplay between amount of data, AugReg, model size and compute budget, systematic empirical study was conducted. Increasing combination of compute and AugReg could yield models with the same performance as increasing training data.
1. Introduction
To avoid any inconsistencies and achieve homogeneity to precisely investigate the effects of variables, all models were trained and evaluated in a consistent setup. According to the result, carefully selecting regularization and augmentations could effect same amount of 10x increase of data size. However, the amount of compute time was similar. Comparing transfer learning and training from scratch, there seemed to be trade-off between compute and performance.
2. Scope of the study
Both computational and sample efficiency should be considered. Pre-training cost dominates over fine-tuning, but since the pre-trained model can be downloaded in the majority of cases, this paper focuses on the later case. Pre-training and setup cost are referred to practitioner’s cost, and inference cost to deployment costs in this paper.
3. Experimental setup
3.1. Datasets and metrics
For pre-training datasets, two large-scale image datasets, ImageNet-1k and ImageNet-21k are used, deduplicated. For transfer learning, CIFAR-100, Oxford IIIT Pets, Resisc45 and Kitti-distance are used.
3.2. Models
4 different configurations of ViT were used, with same patch-size of 16 and additional 32. The hidden layer in the head was dropped since it were not helpful to be more accurate, but resulted in optimizatioin instabilities. Hybrid models that first process images with a ResNet backbone and feed spatial output to a ViT are used. The ResNet stem block with 7x7 conv, BN, ReLU, maxPool was followed by variable number of bottleneck blocks.
3.3. Regularization and data augmentations
For regularization, dropout was applied in the intermediate activations, and stochastic depth regularization technique with linearly increasing probability of dropping layers was adopted.
For data augmentation, Mixup and RandAugment were combinated. Since increasing AugReg need decrease in weight decay, two weight decay values were tested.
3.4. Pre-training
Pre-training used Adam with cosine learning rate schedule with a linear warmup. To stabilize training, gradients were clipped with norm of 1.
3.5. Fine-tuning
Fine-tuning used SGD, and two resolution.
4. Findings
4.1. Scaling datasets with AugReg and compute
With precise image augmentations and model regularization, the pre-trained model accuracct was similar to the that of bigger dataset. However, it was not true with arbitarily small datasets.
4.2. Transfer is the better option
Transferring a pre-trained model was more cost-efficient and better on results. Even for the specialized or structured datasets, transferring a pre-trained model was simultaneousely cheaper, and performed better.
4.3. More data yields more generic models
Assuming for similar compute, bigger size dataset was better. For most of tasks, excluding almost solved tasks, longer schedule worked better. Therefore, for a fixed compute budget, more data worked better.
4.4. Prefer augmentation to regularization
The trade-offs between data augmentation and model regularization are not clear. When data size was small, any kind of AugReg helps. However, with a fixed compute budget, AugReg hurts when the dataset size was bigger. There were more cases adding augmentation helped better than adding regularization.
4.5. Choosing which pre-trained model to transfer
Since it is infeasible to chooes a model to train based on the downstream task performance, one should choose it based on the upstream validation accuracy. Generally, this strategy worked well compared to trying all possible training.
A note on validation data for the ImageNet-1k dataset:
For the models pre-trained on ImageNet-21k and transferred to ImageNet-1k, validation score was not correlated with observed test performance. It seemed that there could be some possibilities for larger models to memorize the data from the training set, since the ImageNet-21k contains ImageNet-1k, and it the evaluation set can contain some of those in the training set. Using independently collected ImageNetV2 for the validation set resolved this problem.
4.6. Prefer increasing patch-size to shrinking model-size
Although the models were desinged to have similar inference thoughtput, tiny models with smaller patch size worked poorly than larger models with bigger patch size. This might happen since the patch size influences the amount of tokens on which self-attention is performed, eveutnally contribution to model capacity.
5. Related work
Contrast to strategy of training ViT on large data, there is a focus of using strong regularization and augmentation schemes to tackle overfit with small dataset.
There is work that introduces inductive biases in ViT variants or retain some of general architectural parameters of successful convolutional architectures, while adding self-attention to them. Some of authors propose hierarchical versions of ViT. Some suggests by carefully initializing Vision Transformer, it behaves similar to CNNs in the beginning of training.
To address overfitting and improve transfer performance, self-supervised learning objectives can be helpful.
6. Conclusion
- Best option is transfer learning
- Training with large dataset is better
- Data augmentation is preferred
댓글남기기