背景:Deep Neural Networks (DNNs) 在机器学习领域取得了重大进步,优点是精度高,缺点是存储消耗高,耗能高,这使得他们在有限的硬件资源上难以使用,量化 (Quantization)(the full-precision values are stored in low bit-width precision)即为解决这一难题的方法之一。
Quantization的优点:1. Quantization not only reduces memory requirements but also replaces high-cost operations with low-cost ones. 2. DNN quantization offers flexibility and efficiency in hardware design, making it a widely adopted technique in various methods.
贡献:
Consequently, we present a comprehensive survey of quantization concepts and methods, with a focus on image classification.
We describe clustering-based quantization methods and explore the use of a scale factor parameter for approximating full-precision values.
Moreover, we thoroughly review the training of a quantized DNN, including the use of a straight-through estimator and quantization regularization. We explain the replacement of floating-point operations with low-cost bitwise operations in a quantized DNN and the sensitivity of different layers in quantization.
Furthermore, we highlight the evaluation metrics for quantization methods and important benchmarks in the image classification task. We also present the accuracy of the state-of-the-art methods on CIFAR-10 and ImageNet.
This article attempts to make the readers familiar with the basic and advanced concepts of quantization, introduce important works in DNN quantization, and highlight challenges for future research in this field.
Deep Convolutional Neural Network (DCNN) 成就斐然,但需要存储大量参数,进行大量计算(The main operation in DCNNs is multiply-accumulate
(MAC) in convolution and Fully Connected (FC) layers.),所以DNNs的加速很有必要。
In the beginning, the focus was on hardware optimization for processing speedup in DNN accelerators.
-> Later, researchers concluded that compression
and software optimization of DNNs can be more effective before touching hardware.
The approaches in DNN compression:
Quantization: approximates the numerical network components with low bit-width precision.
Pruning: removing unnecessary or less important connections within the network and
making a sparse network that reduces memory usage as well as computations.
Low-rank approximation: an approach to simplify matrices and images, creates a
new matrix close to the weight matrix, which has lower dimensions and fewer computations in DNNs.
Knowledge Distillation (KD): employ a simpler model that exhibits generalization and accuracy comparable to the complex model.
Advantages of quantization
High compression, with less accuracy reduction.
Flexibility
-- Since quantization is not dependent on the network architecture, a quantization algorithm can be
applied to various types of DNNs. (Many quantization methods originally designed for
DCNNs are also used for Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks) .
Smaller number of cycles on hardware
-- as high-cost floating-point operations are replaced with low-cost operations.
Reduces the cost of hardware accelerator design.
-- For instance, in 1-bit quantization, a 32-bit floating-point multiplier can be replaced an XNOR operator, leading to a
cost reduction of 200 times in Xilinx FPGA
Contribute to controlling
overfitting.
-- By simplifing parameters
A DCNN consists of various types of layers, and the common layers include convolution layer, normalization layer, pooling layer, and FC layer.
The main layer in DCNN is the convolution layer, which is formed in three
dimensions.
This layer produces an output feature map by convolving multiple filters (weights)
with the input feature map.
There is weight sharing in the convolution layer, which means that
each weight is applied to different connections.
The majority of computationsin DCNNs are in this
layer due to its three-dimensional structure and weight sharing.
Weight sharing -> a significant reduction in the number of parameters.
-> the majority of parameters in DCNNs are typically in the FC layers, where each neuron
is connected to all neurons in both the previous and next layers.
As the convolution and FC layers
contain the majority of computations and parameters in DCNNs, the primary focus is on these
layers in accelerators and compression techniques.
Quantization is mapping values from a continuous space to a discrete space, where full-precision
values are mapped to new values with lower bit-width called quantization levels.
Each numerical component in neural networks can be quantized. These components are typically divided into three main categories: weights, activations, and gradients.
Weights: the most common (But in
most cases, biases and other parameters, such as batch normalization parameters, are kept in full
precision in view of the fact that they include a minimal rate of neural network parameters, and the
quantization of them is less efficient in compression.)
Activations: 比weights的量化更困难 (While
weights remain fixed after training, activations change during the inference phase according to the
input data.),但仅仅对weights量化效率不高、内存使用率低, 故需共同量化
Gradients: 仅对训练时的加速有用,更难量化(Gradients are propagated
from the output to the first layer of a network during the backward pass of the EBP algorithm.
High-precision gradients are essential for the convergence of the optimization algorithm during
training. Furthermore, due to the wide range of gradient values, accurate quantization requires the
use of more bits.)
In a low bit-width precision quantized network, the convergence of the learning algorithm is challenging. -> requires more iterations than the full-precision network for convergence. It needs customized solutions compatible with a discrete network.
A reduction in model accuracy. -> needs retraining, fine-tuning agter quantization -> repeat to reach an acceptable accuracy.
Speedup phase
the training and inference phases
the inference phase
The model accuracy in the QAT approach is commonly higher than in PTQ, because the trained
model is more compatible with the quantization process.
d represents the step size, and M is an odd number and determines the number
of quantization levels. Consequently, the quantization levels include zero, positive, and negative
values symmetrically.
It maps the full-precision values in the range \(x\in[0,1]\) to \(2^k\) quantization levels within the same interval with step size \(\frac1{2^k-1}.\) For \(k\) bit-width, the quantization levels are \(L_q=\)\(\{0,\frac1{2^k-1},\frac2{2^k-1},\ldots,1\}.\) For example, for \(k=2\),there are \(2^2=4\) quantization levels, which are \(L_q=\)\(\{ 0\), ⅓, ⅔, \(1\} .\)
x represents full-precision values, \(\nu\in\mathbb{R}^{K}\) denotes the learnable floating-point basis vector, and \(e_{l}\) is a \(k\)-bit binary vector from \([-1,-1,\ldots,-1]\mathrm{~to~}[1,1,\ldots,1].\)
Deterministic and Stochastic Quantization Comparison.¶
Stochastic quantization has
shown better model generalization compared to deterministic quantization.
Implementation of stochastic quantization is more challenging and costly than deterministic quantization, particularly in hardware implementations, as it
requires a random bit generator.
In non-uniform quantization, the step size is determined according to the distribution of the full-precision values, which
makes it more complex and accurate than uniform quantization.
-> Logarithmic quantization allows the encoding of a larger range of numbers using the same storage in comparison with uniform quantization
by storing a small integer exponent instead of a floating-point number.
Previous studies have revealed that weights in DCNNs often follow a normal distribution with
a mean of zero:
-> In logarithmic quantization, the quantization levels are denser for
values close to zero. Therefore, the distribution of quantization levelsin logarithmic quantization is
matched to the distribution of the full-precision weights in DCNNs, which leads to more accurate
quantization.
The base-2 logarithm quantization is naturally a representation of the binary system.
-> it is well-matched to digital hardware and provides simple operations.
DeepCompression method:
using the k-means algorithm, where the weight values in a cluster are close to each other and mapped to the same quantization
level, which is the cluster center.
Single Level Quantization (SLQ) (for high bit-width precision ):
the weights of each layer are
clustered separately using the k-means algorithm.
-> the clusters are grouped into
two categories based on quantization loss.
-> Low loss: quantization ; High loss: retrain
-> These steps are repeated until all the weights are
quantized.
-> SLQ is not suitable for low bit-width quantization due to the small number of clusters, which leads to significant quantization loss.
Multiple Level Quantization (MLQ) (for 2-bit and 3-bit quantization.):
-> (Compared to SLQ) partitions weights not only in the width but also in the depth of the network. Layers are quantized iteratively
and incrementally (not at the same time).
Extended Single Level Quantization (ESLQ):
changes cluster
centers as quantization levels to values with a specific type. For example, quantization levels are
mapped to the closest number in the form of Power Of Two (POT), making it well-suited for
implementation on FPGA platforms.
Weighted entropy measure (for evaluating the quality of clustering)
Challenges of the clustering-based approach
Not suitable for implementation
in hardware and software due to their significant time complexity and computational requirements
for codebook reconstruction.
The weights within a
cluster are not contiguous in memory, which leads to irregular memory accesses with long delays.
The clustering-based approach is not suitable for activations quantization. (As weights are fixed during training, but activations are not)
In the Equation, \(w_{kj}(n-1)\) and \(w_{kj}(n)\) indicate the weights between \(k\) and \(j\) layers before and after the update, respectively. \(\gamma\) and \(\delta\) are the learning rate and the error signal, respectively. \(x_i\) and \(\gamma_i\) are the inputs and outputs of layer \(i\), respectively, and \(h^\prime\) denotes the derivative of the activation function.
(The authors of the Bi-RealNet
paper concluded that higher-order functions require more complex computations, and thus, the
second-order function is acceptable)
In the Equation, \(i\) determines the epoch number in \(N\) epochs, \(T_{min}\) is set to \(10^{-1}\),and \(T_{max}\) is set to 10.EDE lies between the identity function \((y=x)\) and the \(hard \ tanh\) function. \(Hard \ tanh\) is close to the \(Sign\) function, but it discards the parameters outside the range [-1,1].
Note
Consequently, those parameters are not updated anymore, leading to a loss of information. However, the identity function covers the parameters outside [-1,1] but has a significant difference from the \(Sign\) function, as indicated by the shaded area in Figure 12. EDE makes a tradeoff between the identity and hard tanh functions by varying parameters \(k\) and \(t\) during training. Initially, \(k\) is bigger than 1,making EDE closer to the identity function. As the number of epochs increases, \(k\) gradually tends towards 1, causing EDE transition to hard tanh for achieving more accurate estimation.
The derivative of Half-Wave Gaussian Quantization (HWGQ) is zero.
HWGQ is bounded to
qm for x>0, whereas Vanilla ReLU tends to infinity. Consequently, using Vanilla ReLU in the
backward pass leads to inaccurate gradients and unstable learning during training.
In Clipped ReLU,
the weak point of the Vanilla ReLU is modified by setting the gradient to zero for x≥qm. The idea behind this modification
comes from the fact that the frequency of the large values is commonly low, and these values are
interpreted as outliers.
Experimental results in the HWGQ method show that Log-tailed ReLU
achieves higher accuracy in AlexNet compared to Clipped ReLU. However, Clipped ReLU achieves
superior performance compared to Log-tailed ReLU in VGGNet-variant and ResNet-18, which are
deeper than AlexNet.
The PACT function maps the full-precision activations to the range [0, \(\alpha].\) Then the output of
the PACT function is quantized to \(k\) bit-width precision using Equation (46).
If \(\alpha\) is equal to 1,then the PACT function corresponds to the bounded rectifier function with \(\upsilon=0\) in the ABC method [117]. The optimum \(\alpha\) is found during training for minimizing the accuracy drop in quantization. It should be noted that the optimum value varies across different layers and models. Since Equation (46) is not differentiable, STE is employed for updating \(\alpha:\)
The training is dependent on the value of \(\alpha.\)If the initial value of \(\alpha\) is too small, then, according to Equation (47), most activations will fall in the range of non-zero gradient, causing frequent changes in the value of \(\alpha\) during training and leading to low model accuracy. However, if the initial value of \(\alpha\) is too large, then the gradient will be zero for the majority of activations, leading to small gradients and the risk of gradient vanishing in the EBP algorithm. To address this, \(\alpha\) is initialized with a large value that is not excessively large and then reduced using L2-norm regularization.
Although the effectiveness of STE has been demonstrated in practice through the results of previous works, there is still a concern regarding the lack of theoretical proof for its performance. Therefore, in recent years, some researchers have made efforts to theoretically justify the performance of STE [138, 139] .
Table 3 summarizes the forward quantization function and its estimator in the backward pass
using STE for several previous works.
Sometimes structural adjustments are necessary for the neural network after quantization.
example
For instance, the max-pooling layer in some binary DNNs is displaced. In
DCNNs, a max-pooling layer commonly comes immediately after the activation layer. However,
in a binary neural network, where the Sign function is used for the binarization of the activations,
placing the max-pooling layer immediately after the Sign function results in an output matrix containing only +1 elements, as the values in a binarized matrix are −1 and +1. This leads to a loss
of information.
In quantization, approximating weights with low bit-width precision acts as a regularizer, and pushing the weights toward zeros can lead to a significant drop
in accuracy.
some approaches
Bit-level Sparsity Quantization (BSQ) suggested a regularization for mixed-precision quantization.
a periodic regularization to push the full-precision weights toward the quantization levels.
Tang et al. introduced a new regularization for binary quantization:
In Equation, \(L(W,b)\) represents the loss function, and the second term denotes the regularization relation. \(L\) indicates the number of layers. \(N_l\) and \(M_l\) are the dimensions of the weight matrix in layer \(\iota\) The parameter \(\lambda\) controls the effect of the loss function and regularization term.
-
In this article, we surveyed the previous quantization works in the image classification task. The
basic and advanced concepts of DCNN quantization were discussed, as well as the most important
methods and approaches in this field, along with their advantages and challenges. Some previous
works perform quantization on both weights and activations, offering a higher compression rate
and employing lower-cost operations compared to approaches that are quantized only weights.
However, quantization of activations is more challenging compared to weights, which is due to
the wide range of activations, the use of a non-differentiable activation function, the estimation
of activations during the backward pass, and the variation of activation values during inference.
The QAT and PTQ methods were studied, and it is concluded that the QAT methods generally achieve higher accuracy than the PTQ methods in the inference phase. Training a quantized
DNN poses new challenges compared to a full-precision network since the units are discrete. Itcommonly requires additional iterations for convergence in contrast to training a full-precision
network, and adaptive training strategies are required to build an accurate model. For instance,
the adjustment of learning rate and regularization techniques can be different from the training in
the full-precision network.
We discussed uniform and non-uniform quantization techniques and concluded that nonuniform quantization, especially the POT quantization approach, efficiently covers the distribution of full-precision values, which leads to enhanced accuracy. For decreasing quantization error,
it is important to allocate quantization levels to informative regions. Using the scale factor helps
in shifting the quantization levels to the most informative parts of data.
Some previous methods have successfully achieved high accuracy on large-scale datasets, such
as ImageNet, when both weights and activations are quantized in low bit-width. However, quantization with a precision lower than 4 bits remains a challenging task, especially in deeper networks.
During the training of a quantized network, STE is commonly used for calculating gradients in the
backward pass. The noise resulting from gradient mismatch, due to inaccurate estimation, is amplified layer-by-layer from the end of the network to the initial layers. This amplification of noise
is more considerable in deeper networks compared to shallow networks. In the training, this noise
can have a negative impact on model convergence. Additionally, since the number of parameters
increases with the depth of the neural network, the range of parameters in the deeper networks is
wider than in shallow ones, and the quantization is more challenging. Accordingly, future works
should focus on addressing the quantization of weights and activations in deeper networks with
low bit-width, such as binary or ternary quantization.
In this article, we discussed mixed-precision, which is currently an interesting approach in the
quantization of the DNNs. The main challenge in mixed-precision quantization is the exponential time complexity in finding the optimum bit-width for each layer. It is desirable for future
worksto develop solutionsthat can determine the optimum mixed-precision with polynomial time
complexity.