Is it possible that the inference time is large while number of parameters and flops are low in pytorch?

I calculated flops of network using Pytorch. I used the function 'profile' in 'thop' library.

In my experiment. My network showed that

Flops : 619.038M Parameters : 4.191M Inference time : 25.911

Unlike my experiment, I would check the flops and parameters with ResNet50 which showed that

Flops : 1.315G Parameters: 26.596M Inference time : 8.553545

Is is possible that the inference time is large while flops are low? Or are there flops that the 'profile' function can't measure some functions? However, Similar results came out using the 'FlopCountAnalysis in fvcore.nn' and 'get_model_complexity_info in ptflops'

Here is the code that I measured the inference time using Pytorch.

model.eval()
model.cuda()

dummy_input = torch.randn(1,3,32,32).cuda()

#flops = FlopCountAnalysis(model, dummy_input)
#print(flop_count_table(flops))
#print(flops.total())

macs, params = profile(model, inputs=(dummy_input,))
macs, params = clever_format([macs, params], "%.3f")
print('Flops:',macs)
print('Parameters:',params)

starter, ender = torch.cuda.Event(enable_timing=True), 
torch.cuda.Event(enable_timing=True)

repetitions = 300
timings=np.zeros((repetitions,1))

for _ in range(10):
    _ = model(dummy_input)

# MEASURE PERFORMANCE
with torch.no_grad():
    for rep in range(repetitions):
        starter.record()
        _ = model(dummy_input)
        ender.record()
        # WAIT FOR GPU SYNC
        torch.cuda.synchronize()
        curr_time = starter.elapsed_time(ender)
        timings[rep] = curr_time

print('time(s) :',np.average(timings))

Solution

It is absolutely normal situation. The thing is FLOPS (or MACs) are theoretical measures that may be useful when you want to disregard some hardware/software optimizations that leads to the fact that different operations will work faster/slower on different hardware.

For example in the case of neural networks the different architectures will have different CPU/GPU utilization. Let's consider two simple architectures with almost same number of parameters / FLOPs:

Deep network:

layers = [nn.Conv2d(3, 16, 3)]
for _ in range(12):
    layers.extend([nn.Conv2d(16, 16, 3, padding=1)])
deep_model = nn.Sequential(*layers)

Wide network wide_model = nn.Sequential(nn.Conv2d(3, 1024, 3))

Modern GPUs allows you to parallelize a large number of simple operations. But when you have deep network you need to know outputs of layer[i] to compute the outputs of layer[i+1]. So it became blocking factor that reduces utilization of your hardware.

Complete example:

import numpy as np
import torch
from thop import clever_format, profile
from torch import nn


def measure(model, name):
    model.eval()
    model.cuda()

    dummy_input = torch.randn(1, 3, 64, 64).cuda()

    macs, params = profile(model, inputs=(dummy_input,), verbose=0)
    macs, params = clever_format([macs, params], "%.3f")
    print("<" * 50, name)
    print("Flops:", macs)
    print("Parameters:", params)

    starter, ender = torch.cuda.Event(enable_timing=True), torch.cuda.Event(
        enable_timing=True
    )

    repetitions = 300
    timings = np.zeros((repetitions, 1))

    for _ in range(10):
        _ = model(dummy_input)

    # MEASURE PERFORMANCE
    with torch.no_grad():
        for rep in range(repetitions):
            starter.record()
            _ = model(dummy_input)
            ender.record()
            # WAIT FOR GPU SYNC
            torch.cuda.synchronize()
            curr_time = starter.elapsed_time(ender)
            timings[rep] = curr_time

    print("time(ms) :", np.average(timings))


layers = [nn.Conv2d(3, 16, 3)]
for _ in range(12):
    layers.extend([nn.Conv2d(16, 16, 3, padding=1)])
deep_model = nn.Sequential(*layers)
measure(deep_model, "My deep model")

wide_model = nn.Sequential(nn.Conv2d(3, 1024, 3))
measure(wide_model, "My wide model")

Results:

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< My deep model
Flops: 107.940M
Parameters: 28.288K
time(ms) : 0.6160109861691793
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< My wide model
Flops: 106.279M
Parameters: 28.672K
time(ms) : 0.1514971748739481

As you can see the models have similar number of parameters/flops but computing time is 4x large for the deep network.

It is just a one of possible reason why the inference time is large when number of parameters and flops are low. You may need to take into account other underlying hardware/software optimizations.