Skip to content

Paper: Model Quantization and Hardware Acceleration for Vision Transformers: A Comprehensive Survey

Abstract

  • 背景:Vision Transformers (ViTs) 在许多视觉领域正替代convolutional neural networks (CNNs) ,但ViTs的模型大、算力、存储需求大,尤其是对于资源受限的设备,这突出了 algorithm-hardware co-design 的重要性。而量化模型(算法)便可很好地提高其效率。
  • 贡献(survey):
    1. 一些独特的 ViTs 架构的特性与运行时间。
    2. 模型量化的基本原理,对最先进的 ViTs 量化技术进行了比较分析。
    3. 量化 ViTs 的硬件加速,突出了 hardware-friendly algorithm 设计的重要性。
    4. 当前挑战与未来研究展望

1 简介

背景 + 目前部分 survey 的问题(忽略 hardware 或 algorithm,主要用于大语言模型的压缩) + 本文结构(目录)

2 ViTs 架构及性能分析

The Vision Transformers (ViTs), utilizing the self-attention mechanism to grasp “long-range” relationships in image sequences, 在 CV 领域取得了重大成功.

2.1 ViTs 架构概述

Image title
details
  • The process culminates with a fully-connected layer, termed the “MLP Head”, for classification purposes.

  • The MHA (Multi-Head Attention) module first projects the image sequence by multiplying it through distinct weight matrices \(W^Q,W^K\) and \(W^V\), generating query \((Q)\), key \((K)\),and value \((V)\) activation. The self-attention mechanism is then applied as follows:

\[\mathrm{Attention}(Q,K,V)=\mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V,\]

where \(d_k\) represents the dimensionality of the key vectors. The MHA aggregates information from multiple representation subspaces, synthesizing the outputs from different heads into a unified representation:

\[\text{MultiHead}(Q,K,V)=\text{Concat}(\text{head}_1,\ldots,\text{head}_h)W^O,\quad\]

with each head defined as:

\[\mathrm{head}_i=\mathrm{Attention}(QW_i^Q,KW_i^K,VW_i^V).\]
  • The FFN (Feed-Forward) module, which includes two dense layers activated by GELU , processes each token individually, augmenting the model's capacity to understand intricate functions:
\[\mathrm{FFN}(x)=GELU(xW_1+b_1)W_2+b_2.\]
  • In summary, the MHA encompasses six linear operations, including four weight-to-activation transformations (\(W^Q,W^K,W^V\), and \(W^O\)) projections and two activationto-activation transformations \(Q\times K^T\) and Out\(_\mathrm{softmax}\times V.\) In contrast, the FFN comprises two linear projections \(W_1\) and \(W_2\). Non-linear operations like Softmax, LayerNorm, and GELU, though less prevalent, present computational challenges on conventional hardware due to their complexity, potentially restricting the enhancement of end-to-end transformer inference.

2.2 ViTs 举例

  • ViTs 是许多图像处理领域的 Transformer-based models 的基础架构(在处理不同规模及复杂度的图像时 robustness 良好)。
  • DeiT 在 ViTs 基础上 “By incorporating a novel teacher-student strategy tailored for Transformers, which includes the use of distillation tokens, DeiT can effectively train on smaller datasets without a substantial loss in performance, demonstrating the model’s adaptability to data constraints.”
  • Swin-Transformer integrating a hierarchical structure with shifted windows, improves the model’s ability to capture local features while still maintaining a global context.
More Vits
  • S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, and M. Shah, “Transformers in vision: A survey,” ACM computing surveys (CSUR), vol. 54, no. 10s, pp. 1–41, 2022.
  • K. Han, Y. Wang, H. Chen, X. Chen, J. Guo, Z. Liu, Y. Tang, A. Xiao, C. Xu, Y. Xu, et al., “A survey on vision transformer,” IEEE transactions on pattern analysis and machine intelligence, vol. 45, no. 1, pp. 87–110, 2022.

2.3 用 Roofline Model 分析 ViTs 的性能

  1. This model aids in identifying whether a layer or operation is a computation or memory bottleneck, thereby enabling optimal utilization of memory access and processing capabilities。
  2. In alignment with the roofline model, we evaluate the model’s computational demand by measuring both the floating-point operations (FLOPs) and the memory operations (MOPs) involved. Following this, we determine the arithmetic intensity, calculated as the ratio of FLOPs to bytes accessed (FLOPs / B)(即 Arithmetic Intensity = #FLOPs / #MOPs).
  3. 通过不同大小、属性的测试集在不同大小 ViTs 上的测试与对比,得出:
    需要快速访问内存时 “Employing quantization to represent data in 8-bit or lower-bit formats can be especially advantageous, as it reduces the model’s memory footprint, accelerates data transfer, thereby enhancing model inference.”
    在计算量受限时 “Adopting INT8 precision emerges as a crucial optimization in such compute-bound situations. This approach not only mitigates computational bottlenecks but also capitalizes on the enhanced efficiency and throughput of quantized computing, significantly boosting overall performance.”
    (突出了 Quantization 的重要性)

3 量化(Quantization)基础

3.1 Linear Quantization

  • Linear Quantization linearly maps the continuous range of weight or activation values to a discrete set of levels.
\[q=round(\frac rS)+Z\]
  • r represents input real numbers, q represents output quantized integers, S is a real-valued scaling factor, and Z is an integer zero point.
\[\tilde{r}=S(q-Z)\]
  • 由于 rounding 操作的近似,该方式的 dequantization 存在一定误差。

3.2 Symmetric and Asymmetric Quantization

\[S=\frac{r_{max}-r_{min}}{q_{max}-q_{min}}\]

where \([r_{min},r_{max}]\) and \([q_{min},q_{max}]\) represent the clipping range of real and integer values, respectively.

  • The symmetric quantization 将方程中的分子换为绝对值 \(r_{max}-r_{min}=2\max(|r_{max}|,|r_{min}|)= 2\max(|r|).\)
  • If the clipping range is symmetric, the value of zero point Z becomes 0. Eq. 可化简为 \(q=\) \(round(r/S)\), 更精简高效.
  • As for the \(q_{min}\) and \(q_{max}\), we can choose to use the “full range” or “restricted range” . In "full range" mode, \(S=\frac{2max(|r|)}{2^n-1}\) . In "restricted range" mode \(S=\frac{max(|r|)}{2^{n-1}-1}\) . In INT8, is [-128,127] and [-127,127] respectively.
  • However, directly using the min/max values to determine the clipping range may be sensitive to outliers, resulting low resolution of quantization. Percentile and KL divergence strategies are introduced to address this problem.

3.3 Static and Dynamic Quantization

  • 动态量化在运行时期间为每个 activation map 动态确定范围,这种方法需要即时计算输入度量(如最小值、最大值、百分位数等),这会显著增加计算成本。但动态量化通常可以达到更高的精度,因为它可以精确地确定每个单独输入的信号范围。
  • 静态量化可以在运行前预先计算和确定范围。这种方法不会增加计算成本,但它通常提供的准确性较低。
  • A common technique for pre-calculating the range is to process a set of calibration inputs to determine the average range of activations [35]. Mean Square Error (MSE) [38] and entropy [39] are often used as metrics to choose the best range. Additionally, the clipping range can also be learned during the training process [40].

3.4 Quantization Granularity

  • The determination of the clipping range can be categorized into layerwise, channelwise(common), and groupwise quantization, based on the level of granularity employed.
  • Accuracy (but also) overhead 依次递增.

3.5&3.6 Post-Training Quantization (PTQ) and Quantization Aware Training

skip

3.7 Data Free Quantization(DFQ or ZSQ)

  • Data-Free Quantization (DFQ), alternatively termed ZeroShot Quantization (ZSQ), operates quantization independently of actual data.
  • 合成与真实数据相似的虚拟数据,the synthetic data thus generated is then utilized to calibrate the model during PTQ and to perform fine-tuning within QAT.
  • 可解决数据“大量、隐私、安全”问题。

4 Model Quantization for ViTs

To do (后面的内容有一点“仙之人兮列如麻”,我暂时还没有能力总结)

5 量化 ViTs 的硬件加速

6 总结与展望