Path: blob/master/site/en-snapshot/lite/performance/quantization_spec.md
25118 views
TensorFlow Lite 8-bit quantization specification
The following document outlines the specification for TensorFlow Lite's 8-bit quantization scheme. This is intended to assist hardware developers in providing hardware support for inference with quantized TensorFlow Lite models.
Specification summary
We are providing a specification, and we can only provide some guarantees on behaviour if the spec is followed. We also understand different hardware may have preferences and restrictions that may cause slight deviations when implementing the spec that result in implementations that are not bit-exact. Whereas that may be acceptable in most cases (and we will provide a suite of tests that to the best of our knowledge include per-operation tolerances that we gathered from several models), the nature of machine learning (and deep learning in the most common case) makes it impossible to provide any hard guarantees.
8-bit quantization approximates floating point values using the following formula.
Per-axis (aka per-channel in Conv ops) or per-tensor weights are represented by int8
two’s complement values in the range [-127, 127]
with zero-point equal to 0. Per-tensor activations/inputs are represented by int8
two’s complement values in the range [-128, 127]
, with a zero-point in range [-128, 127]
.
There are other exceptions for particular operations that are documented below.
Note: In the past our quantization tooling used per-tensor, asymmetric, uint8
quantization. New tooling, reference kernels, and optimized kernels for 8-bit quantization will use this spec.
Signed integer vs unsigned integer
TensorFlow Lite quantization will primarily prioritize tooling and kernels for int8
quantization for 8-bit. This is for the convenience of symmetric quantization being represented by zero-point equal to 0. Additionally many backends have additional optimizations for int8xint8
accumulation.
Per-axis vs per-tensor
Per-tensor quantization means that there will be one scale and/or zero-point per entire tensor. Per-axis quantization means that there will be one scale and/or zero_point
per slice in the quantized_dimension
. The quantized dimension specifies the dimension of the Tensor's shape that the scales and zero-points correspond to. For example, a tensor t
, with dims=[4, 3, 2, 1]
with quantization params: scale=[1.0, 2.0, 3.0]
, zero_point=[1, 2, 3]
, quantization_dimension=1
will be quantized across the second dimension of t
:
Often, the quantized_dimension
is the output_channel
of the weights of convolutions, but in theory it can be the dimension that corresponds to each dot-product in the kernel implementation, allowing more quantization granularity without performance implications. This has large improvements to accuracy.
TFLite has per-axis support for a growing number of operations. At the time of this document, support exists for Conv2d and DepthwiseConv2d.
Symmetric vs asymmetric
Activations are asymmetric: they can have their zero-point anywhere within the signed int8
range [-128, 127]
. Many activations are asymmetric in nature and a zero-point is an relatively inexpensive way to effectively get up to an extra binary bit of precision. Since activations are only multiplied by constant weights, the constant zero-point value can be optimized pretty heavily.
Weights are symmetric: forced to have zero-point equal to 0. Weight values are multiplied by dynamic input and activation values. This means that there is an unavoidable runtime cost of multiplying the zero-point of the weight with the activation value. By enforcing that zero-point is 0 we can avoid this cost.
Explanation of the math: this is similar to section 2.3 in arXiv:1712.05877, except for the difference that we allow the scale values to be per-axis. This generalizes readily, as follows:
is a matrix of quantized activations.
is a matrix of quantized weights.
Consider multiplying the th row of , by the th column of , , both of length . The quantized integer values and zero-points values are , and , respectively.
The \(\sum_{i=0}^{n} q_{a}^{(i)} q_{b}^{(i)}\) term is unavoidable since it’s performing the dot product of the input value and the weight value.
The and terms are made up of constants that remain the same per inference invocation, and thus can be pre-calculated.
The \(\sum_{i=0}^{n} q_{a}^{(i)} z_b\) term needs to be computed every inference since the activation changes every inference. By enforcing weights to be symmetric we can remove the cost of this term.
int8 quantized operator specifications
Below we describe the quantization requirements for our int8 tflite kernels: