timepytorchtime-complexityflops

Is it possible that the inference time is large while number of parameters and flops are low in pytorch?


I calculated flops of network using Pytorch. I used the function 'profile' in 'thop' library.

In my experiment. My network showed that

Flops : 619.038M Parameters : 4.191M Inference time : 25.911

Unlike my experiment, I would check the flops and parameters with ResNet50 which showed that

Flops : 1.315G Parameters: 26.596M Inference time : 8.553545

Is is possible that the inference time is large while flops are low? Or are there flops that the 'profile' function can't measure some functions? However, Similar results came out using the 'FlopCountAnalysis in fvcore.nn' and 'get_model_complexity_info in ptflops'

Here is the code that I measured the inference time using Pytorch.

model.eval()
model.cuda()

dummy_input = torch.randn(1,3,32,32).cuda()

#flops = FlopCountAnalysis(model, dummy_input)
#print(flop_count_table(flops))
#print(flops.total())

macs, params = profile(model, inputs=(dummy_input,))
macs, params = clever_format([macs, params], "%.3f")
print('Flops:',macs)
print('Parameters:',params)

starter, ender = torch.cuda.Event(enable_timing=True), 
torch.cuda.Event(enable_timing=True)

repetitions = 300
timings=np.zeros((repetitions,1))

for _ in range(10):
    _ = model(dummy_input)

# MEASURE PERFORMANCE
with torch.no_grad():
    for rep in range(repetitions):
        starter.record()
        _ = model(dummy_input)
        ender.record()
        # WAIT FOR GPU SYNC
        torch.cuda.synchronize()
        curr_time = starter.elapsed_time(ender)
        timings[rep] = curr_time

print('time(s) :',np.average(timings))

Solution

  • It is absolutely normal situation. The thing is FLOPS (or MACs) are theoretical measures that may be useful when you want to disregard some hardware/software optimizations that leads to the fact that different operations will work faster/slower on different hardware.

    For example in the case of neural networks the different architectures will have different CPU/GPU utilization. Let's consider two simple architectures with almost same number of parameters / FLOPs:

    1. Deep network:
    layers = [nn.Conv2d(3, 16, 3)]
    for _ in range(12):
        layers.extend([nn.Conv2d(16, 16, 3, padding=1)])
    deep_model = nn.Sequential(*layers)
    
    1. Wide network wide_model = nn.Sequential(nn.Conv2d(3, 1024, 3))

    Modern GPUs allows you to parallelize a large number of simple operations. But when you have deep network you need to know outputs of layer[i] to compute the outputs of layer[i+1]. So it became blocking factor that reduces utilization of your hardware.

    Complete example:

    import numpy as np
    import torch
    from thop import clever_format, profile
    from torch import nn
    
    
    def measure(model, name):
        model.eval()
        model.cuda()
    
        dummy_input = torch.randn(1, 3, 64, 64).cuda()
    
        macs, params = profile(model, inputs=(dummy_input,), verbose=0)
        macs, params = clever_format([macs, params], "%.3f")
        print("<" * 50, name)
        print("Flops:", macs)
        print("Parameters:", params)
    
        starter, ender = torch.cuda.Event(enable_timing=True), torch.cuda.Event(
            enable_timing=True
        )
    
        repetitions = 300
        timings = np.zeros((repetitions, 1))
    
        for _ in range(10):
            _ = model(dummy_input)
    
        # MEASURE PERFORMANCE
        with torch.no_grad():
            for rep in range(repetitions):
                starter.record()
                _ = model(dummy_input)
                ender.record()
                # WAIT FOR GPU SYNC
                torch.cuda.synchronize()
                curr_time = starter.elapsed_time(ender)
                timings[rep] = curr_time
    
        print("time(ms) :", np.average(timings))
    
    
    layers = [nn.Conv2d(3, 16, 3)]
    for _ in range(12):
        layers.extend([nn.Conv2d(16, 16, 3, padding=1)])
    deep_model = nn.Sequential(*layers)
    measure(deep_model, "My deep model")
    
    wide_model = nn.Sequential(nn.Conv2d(3, 1024, 3))
    measure(wide_model, "My wide model")
    

    Results:

    <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< My deep model
    Flops: 107.940M
    Parameters: 28.288K
    time(ms) : 0.6160109861691793
    <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< My wide model
    Flops: 106.279M
    Parameters: 28.672K
    time(ms) : 0.1514971748739481
    

    As you can see the models have similar number of parameters/flops but computing time is 4x large for the deep network.

    It is just a one of possible reason why the inference time is large when number of parameters and flops are low. You may need to take into account other underlying hardware/software optimizations.