I calculated flops of network using Pytorch. I used the function 'profile' in 'thop' library.
In my experiment. My network showed that
Flops : 619.038M Parameters : 4.191M Inference time : 25.911
Unlike my experiment, I would check the flops and parameters with ResNet50 which showed that
Flops : 1.315G Parameters: 26.596M Inference time : 8.553545
Is is possible that the inference time is large while flops are low? Or are there flops that the 'profile' function can't measure some functions? However, Similar results came out using the 'FlopCountAnalysis in fvcore.nn' and 'get_model_complexity_info in ptflops'
Here is the code that I measured the inference time using Pytorch.
model.eval()
model.cuda()
dummy_input = torch.randn(1,3,32,32).cuda()
#flops = FlopCountAnalysis(model, dummy_input)
#print(flop_count_table(flops))
#print(flops.total())
macs, params = profile(model, inputs=(dummy_input,))
macs, params = clever_format([macs, params], "%.3f")
print('Flops:',macs)
print('Parameters:',params)
starter, ender = torch.cuda.Event(enable_timing=True),
torch.cuda.Event(enable_timing=True)
repetitions = 300
timings=np.zeros((repetitions,1))
for _ in range(10):
_ = model(dummy_input)
# MEASURE PERFORMANCE
with torch.no_grad():
for rep in range(repetitions):
starter.record()
_ = model(dummy_input)
ender.record()
# WAIT FOR GPU SYNC
torch.cuda.synchronize()
curr_time = starter.elapsed_time(ender)
timings[rep] = curr_time
print('time(s) :',np.average(timings))
It is absolutely normal situation. The thing is FLOPS (or MACs) are theoretical measures that may be useful when you want to disregard some hardware/software optimizations that leads to the fact that different operations will work faster/slower on different hardware.
For example in the case of neural networks the different architectures will have different CPU/GPU utilization. Let's consider two simple architectures with almost same number of parameters / FLOPs:
layers = [nn.Conv2d(3, 16, 3)]
for _ in range(12):
layers.extend([nn.Conv2d(16, 16, 3, padding=1)])
deep_model = nn.Sequential(*layers)
wide_model = nn.Sequential(nn.Conv2d(3, 1024, 3))
Modern GPUs allows you to parallelize a large number of simple operations. But when you have deep network you need to know outputs of layer[i] to compute the outputs of layer[i+1]. So it became blocking factor that reduces utilization of your hardware.
Complete example:
import numpy as np
import torch
from thop import clever_format, profile
from torch import nn
def measure(model, name):
model.eval()
model.cuda()
dummy_input = torch.randn(1, 3, 64, 64).cuda()
macs, params = profile(model, inputs=(dummy_input,), verbose=0)
macs, params = clever_format([macs, params], "%.3f")
print("<" * 50, name)
print("Flops:", macs)
print("Parameters:", params)
starter, ender = torch.cuda.Event(enable_timing=True), torch.cuda.Event(
enable_timing=True
)
repetitions = 300
timings = np.zeros((repetitions, 1))
for _ in range(10):
_ = model(dummy_input)
# MEASURE PERFORMANCE
with torch.no_grad():
for rep in range(repetitions):
starter.record()
_ = model(dummy_input)
ender.record()
# WAIT FOR GPU SYNC
torch.cuda.synchronize()
curr_time = starter.elapsed_time(ender)
timings[rep] = curr_time
print("time(ms) :", np.average(timings))
layers = [nn.Conv2d(3, 16, 3)]
for _ in range(12):
layers.extend([nn.Conv2d(16, 16, 3, padding=1)])
deep_model = nn.Sequential(*layers)
measure(deep_model, "My deep model")
wide_model = nn.Sequential(nn.Conv2d(3, 1024, 3))
measure(wide_model, "My wide model")
Results:
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< My deep model
Flops: 107.940M
Parameters: 28.288K
time(ms) : 0.6160109861691793
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< My wide model
Flops: 106.279M
Parameters: 28.672K
time(ms) : 0.1514971748739481
As you can see the models have similar number of parameters/flops but computing time is 4x large for the deep network.
It is just a one of possible reason why the inference time is large when number of parameters and flops are low. You may need to take into account other underlying hardware/software optimizations.