Updated April 7, 2023

Definition of PyTorch Quantization

PyTorch is a framework to implement deep learning, so sometimes we need to compute the different points by using lower bit widths. At that time we can use PyTorch quantization. Basically, quantization is a technique that is used to compute the tensors by using bit width rather than the floating point. In another word, we can say that by using the quantized model we can perform the different operations on input tensors with integer values rather than floating-point values. The main thing about quantization is that we can perform some complex model or more compact model representation as per our requirement.

What is PyTorch Quantization?

A quantized model executes a few or every one of the procedures on tensors with whole numbers rather than drifting point esteems. This takes into account a smaller model portrayal and the utilization of elite execution vectorized procedure on numerous equipment stages. PyTorch upholds INT8 quantization contrasted with normal FP32 models taking into account a 4x decrease in the model size and a 4x decrease in-memory data transmission necessities. Equipment support for INT8 calculations is commonly 2 to multiple times quicker in contrast with the FP32 register. Quantization is basically a method to accelerate surmising and just the forward pass is upheld for quantized administrators.

At a lower level, PyTorch gives a method for addressing quantized tensors and performing activities with them. They can be utilized to straightforwardly build models that play out all or part of the calculation with lower accuracy. More significant level APIs are given that fuse run-of-the-mill work processes of changing over the FP32 model to bring down accuracy with negligible exactness misfortune.

How does quantization work?

Before we can see how blended accuracy prepares functions, we first need to audit a smidgen about mathematical sorts.

In PC designing, decimal numbers like 1.0151 or 566132.8 are generally addressed as drifting point numbers. Since we can have boundlessly exact numbers (think π), yet restricted space in which to store them, we need to make a tradeoff between accuracy (the number of decimals we can remember for a number before we need to begin adjusting it) and size (the number of pieces we use to store the number).

The planning of quantization work uses the values of fp32 in int8. This is finished by binning the qualities: planning scopes of qualities in the fp32 space into individual int8 values. For instance, two loads constants 1.2251 and 1.6125 in fp32 may both be changed over to 12 in int8, on the grounds that they are both in the container [1, 2]. Picking the right receptacles is clearly vital.

PyTorch gives three unique quantization calculations, which contrast fundamentally in where they decide these canisters — “dynamic” quantization does as such at runtime, “preparing mindful” quantization does as such at train time, and “static” quantization does as such as an extra moderate advance in the middle of the two. Every one of these methodologies enjoys benefits and drawbacks (which we will cover in the blink of an eye). Note that there are other quantization procedures proposed in scholastic writing too.

PyTorch quantization model

First, we need to understand different types of concepts as follows.
Quantization Configuration in PyTorch: In which we need to specify the weight of the quantization model.
Backend Configuration: In this concept, we specify the kernels with different numeric values.
Quantization engine: At the point when a quantized model is executed, the quantization engine indicates which backend is to be utilized for execution. Guarantee that the quantization engine is steady with the Quantization Configuration.
After that, we need to define the workflow of the quantized model that we can use pre-trained quantized model or post-training quantized model, so as per our requirement we can use any model.

Now we need to check which type of device and operator are to be supported.

The set of accessible administrators and the quantization numeric additionally rely upon the backend being utilized to run quantized models. Presently quantized administrators are upheld just for CPU derivation in the accompanying backend x86 and ARM. Both the quantization arrangement (how tensors ought to be quantized and the quantized pieces (number juggling with quantized tensors) are subordinate.

Three types of quantization

Now let’s see the three types of quantization as follows.

1. Dynamic Quantization

This very the easiest method of quantization, by using this model we can convert the activation to int8 before the computation. That means computation can perform only by using int8 matrix multiplication inefficiently.

2. Post-training static quantization

One can additionally work on the presentation (idleness) by changing organizations over to utilize both whole number math and int8 memory. Static quantization plays out the extra advance of initial taking care of groups of information through the organization and registering the subsequent appropriations of the various enactments

3. Quantization Aware Training

This is the third strategy and the one that ordinarily brings about the most noteworthy precision of these three. With QAT, all loads and actions are “phonily quantized” during both the forward and in reverse passes of preparing: that is, float esteems are adjusted to imitate int8 values, yet all calculations are as yet finished with drifting point numbers.

Static quantization

Static quantization quantizes the loads and actuation of the model. It permits the client to meld initiations into going before layers where conceivable. Subsequently, static quantization is hypothetically quicker than dynamic quantization while the model size and memory data transmission utilizations stay to be something similar.

Improved performance in practice

By using quantization, we can improve the performance of deep learning, we know that quantization is worked on integer values instead of floating-point. Normally quantization provides the different models and modes to improve the performance of the model.

Examples:

import torchvision
model_quant = torchvision.models.quantization.mobilenet_v2(pretrained=True, quantize=True)
model_data = torchvision.models.mobilenet_v2(pretrained=True)
import os
import torch
def model_size(modl):
    torch.save(modl.state_dict(), "demo.pt")
    print("%.2f MB" %(os.path.getsize("demo.pt")/1e6))
    os.remove('demo.pt')
model_size(model_data)
model_size(model_quant)

Explanation

The final output of the above program we illustrated by using the following screenshot as follows.

Conclusion

We hope from this article you learn more about the PyTorch Quantization. From the above article, we have taken in the essential idea of the PyTorch Quantization and we also see the representation and example of PyTorch Quantization. From this article, we learned how and when we use the PyTorch Quantization.