I am training an object detection model using the YOLO11 models from Ultralytics, and I am noticing something very strange. The yolo-nano
model is turning out to be slower than yolo-small
model.
This makes no sense since the YOLO-nano
is around 1/3 the size of the small model. By all accounts, the inference should be faster. Why is that not the case? Here is a short script to measure and report the inference speed of the models.
import time
import statistics
from ultralytics import YOLO
import cv2
# Configuration
IMAGE_PATH = "./artifacts/cars.jpg"
MODELS_TO_TEST = ['n', 's', 'm', 'l', 'x']
NUM_RUNS = 100
WARMUP_RUNS = 10
INPUT_SIZE = 640
def benchmark_model(model_name):
"""Benchmark a YOLO model"""
print(f"\nTesting {model_name}...")
# Load model
model = YOLO(f'yolo11{model_name}.pt')
# Load image
image = cv2.imread(IMAGE_PATH)
# Warmup
for _ in range(WARMUP_RUNS):
model(image, imgsz=INPUT_SIZE, verbose=False)
# Benchmark
times = []
for i in range(NUM_RUNS):
start = time.perf_counter()
model(image, imgsz=INPUT_SIZE, verbose=False)
end = time.perf_counter()
times.append((end - start) * 1000)
if (i + 1) % 20 == 0:
print(f" {i + 1}/{NUM_RUNS}")
# Calculate stats
times = sorted(times)[5:-5] # Remove outliers
mean_ms = statistics.mean(times)
fps = 1000 / mean_ms
return {
'model': model_name,
'mean_ms': mean_ms,
'fps': fps,
'min_ms': min(times),
'max_ms': max(times)
}
def main():
print(f"Benchmarking YOLO11 models on {IMAGE_PATH}")
print(f"Input size: {INPUT_SIZE}, Runs: {NUM_RUNS}")
results = []
for model in MODELS_TO_TEST:
result = benchmark_model(model)
results.append(result)
print(f"{model}: {result['mean_ms']:.1f}ms ({result['fps']:.1f} FPS)")
print(f"\n{'Model':<12} {'Mean (ms)':<12} {'FPS':<8}")
print("-" * 32)
for r in results:
print(f"{r['model']:<12} {r['mean_ms']:<12.1f} {r['fps']:<8.1f}")
if __name__ == "__main__":
main()
The result I am getting from this run is -
Model Mean (ms) FPS
--------------------------------
n 9.9 100.7
s 6.6 150.4
m 9.8 102.0
l 13.0 77.1
x 23.1 43.3
I am running this on an NVIDIA-4060. I tested this on a MacBook Pro with an M1 Chip as well, and I am getting similar results. Why can this be happening?
I believe it's just that the model is simply too small.This is the benchmarking when trying to run the nano on the Nvidia machine.
Model Per Image (ms) FPS
--------------------------------
n 7.61 131.5
s 6.13 163.1
m 11.11 90.0
This seems to make no sense.
However, if trying to run this on the CPU,
Model Per Image (ms) FPS
--------------------------------
n 25.74 38.8
s 57.50 17.4
m 148.19 6.7
Everything is slower than on GPU, as expected, but now nano is much faster than small.
The nano model is probably too small to properly take advantage of the GPU. The time for sending the images from the CPU to the GPU itself is probably overtaking any advantage that we have from using a smaller model.
Maybe `nano` makes sense if you are doing truly edge computing with Raspberry Pi or something. As long as you have some GPU, yolo-small
is simply striking the right balance between set up cost and actual inference time.