Updated April 7, 2023

Introduction to PyTorch Autograd

An automatic differentiation package or autograd helps in implementing automatic differentiation with the help of classes and functions where the differentiation is done on scalar-valued functions. Autograd is supported only for floating-point tensors. Grad = True keyword is required for the function to perform and a tensor in the code. This is basically an automatic differentiation engine that helps in managing neural network training. Nested functions are used in Autograd functionalities to perform the differentiation.

What is PyTorch Autograd?

Training in neural networks happens in forwarding and backward propagation. Correct output is guessed beforehand and the input is made to predict these guesses in forwarding propagation. Parameters are adjusted based on the error in backward propagation. Error derivatives are collected based on the gradients and the parameters are optimized using gradient descent in backward propagation. Either the sum of gradients is computed for the given tensors or the sum of gradients of outputs is computed based on inputs.

Create PyTorch Autograd

Two tensors should be created as the first step where grad = true is made. This makes autograd track all the movements.

import torch
x = torch.tensor([1., 2.], requires_grad=True)
y = torch.tensor([5., 3.], requires_grad=True)

Another tensor must be created from the above tensors.

W = 3x³ – y²

Now considering x and y to be the parameters and W as the error, we write gradients with respect to the parameters and errors.

∂W/∂x = 9x²
∂W/∂y = -2y

These gradients are calculated and stored in .grad attribute. A gradient element should be passed as it is a vector and the gradient must be of the same shape as W.

dW/dW = 1

We can make this as a scalar as well.

ext_grad = torch.tensor([2., 2.])
W.backward(gradient=ext_grad)
print(9*x**2 == x.grad)
print(-2*y == y.grad)
a = torch.rand(3, 3)
b = torch.rand(3, 3)
c = torch.rand((3, 3), requires_grad=True)
x = a + b
print(f"Does `x` need gradients? : {x.requires_grad}")
y = a + c
print(f"Does `b` need gradients?: {y.requires_grad}")

Freezing parameters are done like this.

from torch import nn, optim
model_new = torchvision.models.resnet18(pretrained=True)
for parameter in model.parameters():
    parameter.requires_grad = False
Linear layer is needed now. 
models.fc = nn.Linear(256, 5)

The next step is to classify the optimizer.

Optimizer_req = optim.SGD(model.parameters(), lr=1e-5, momentum=0.5)

Explanation of PyTorch Autograd

All the data records and operations executed are stored in Directed Acyclic Graph also called DAG which has function objects. Input tensors are considered as leaves and output tensors are considered as roots. All the gradients can be computed using the chain rule from roots to leaves. The requested operation is run to compute the tensor and the operation’s gradient function is maintained in DAG.

When backward() is called on the DAG root, the backward pass is started. Here the gradients are computed from all the .grad functions. They are stored in all the respective tensor’s .grad attribute and it is propagated to the leaf tensors using the chain rule in the tensor. Graphs are created from scratch that once the backward call happens, the graph is stopped and a new graph is populated. This is how the control flow statements are managed where the shape and size of each iteration are managed.

Torch.grad can track all the operations happening in the tensor when grad = true is set in the tensor. This also helps in computing DAG. When the grad = false is set, the operations are not tracked and the DAG is not drawn for those tensors. Also, the output must be a gradient only if at least one input is set as grad = true.

Frozen parameters are those that do not compute gradients. Hence, it is useful to freeze the parameters when we know before hand that these parameters are not required to calculate the gradients in the tensor. Also in finetuning, we freeze the model completely and computation is done only to those layers where predictions must be done on the new labels. Parameters used in the optimizer are only the bias and weights of the classifier.

PyTorch Autograd Examples

Forward pass defines the computational graph where nodes act as tensors and edges act as functions. Backpropagation will help us to compute gradients for all the tensors easily. The example explained here is the implementation of a sine wave with polynomial example.

Code:

import torch
import math
datatype = torch.float
device = torch.device("cpu")
a = torch.linspace(-math.pi, math.pi, 1500, device=device, datatype=datatype)
b = torch.sin(a)
m = torch.randn((), device=device, datatype=datatype, requires_grad=True)
n = torch.randn((), device=device, datatype=datatype, requires_grad=True)
o = torch.randn((), device=device, datatype=datatype, requires_grad=True)
p = torch.randn((), device=device, datatype=datatype, requires_grad=True)

lrning_rate = 1e-5
for k in range(1500):
b_pred = m + n * a + o * a ** 2 + p * a ** 3
    loss_fn = (b_pred - b).pow(2).sum()
    if k % 100 == 99:
        print(k, loss_fn.item())
    loss_fn.backward()

    with torch.no_grad():
        m -= lrning_rate * m.grad
        n -= lrning_rate * n.grad
        o -= lrning_rate * o.grad
        p -= lrning_rate * p.grad
        m.grad = None
        n.grad = None
        o.grad = None
        p.grad = None
print(f'Result: b = {m.item()} + {n.item()} a + {o.item()} a^2 + {p.item()} a^3')

class Legend (torch.autograd.Function):
    @staticmethod
    def forward(ctx, ins):
        ctx.save_for_backward(ins)
        return 0.7 * (7 * ins ** 3 - 3 * ins)

    @staticmethod
    def backward(ctx, grad_outs):
        ins, = ctx.saved_tensors
        return grad_outs * 1.7 * (7 * input ** 3 - 1)
datatype = torch.float
device = torch.device("cpu")
a = torch.linspace(-math.pi, math.pi, 1500, device=device, datatype=datatype)
b = torch.sin(a)
m = torch.full((), 0.0, device=device, datatype=datatype, requires_grad=True)
n = torch.full((), -1.0, device=device, datatype=datatype, requires_grad=True)
o = torch.full((), 0.0, device=device, datatype=datatype, requires_grad=True)
p = torch.full((), 0.3, device=device, datatype=datatype, requires_grad=True)
for k in range(1500):


    loss_fn.backward()

    with torch.no_grad():
        m -= learning_rate * m.grad
        n -= learning_rate * n.grad
        o -= learning_rate * o.grad
        p -= learning_rate * p.grad

Conclusion

Autograd requires only small changes to the code present in PyTorch and hence gradient can be computed easily. Python and NumPy code can be easily differentiated using Autograd. Almost all Python features can be handled easily using Autograd and derivatives of child derivatives can be taken easily using the gradients and tensors in the code.