YOLO-nano slower than YOLO-small

I am training an object detection model using the YOLO11 models from Ultralytics, and I am noticing something very strange. The yolo-nano model is turning out to be slower than yolo-small model.

This makes no sense since the YOLO-nano is around 1/3 the size of the small model. By all accounts, the inference should be faster. Why is that not the case? Here is a short script to measure and report the inference speed of the models.

import time
import statistics
from ultralytics import YOLO
import cv2

# Configuration
IMAGE_PATH = "./artifacts/cars.jpg"
MODELS_TO_TEST = ['n', 's', 'm', 'l', 'x']
NUM_RUNS = 100
WARMUP_RUNS = 10
INPUT_SIZE = 640

def benchmark_model(model_name):
    """Benchmark a YOLO model"""
    print(f"\nTesting {model_name}...")
    
    # Load model
    model = YOLO(f'yolo11{model_name}.pt')
    
    # Load image
    image = cv2.imread(IMAGE_PATH)
    
    # Warmup
    for _ in range(WARMUP_RUNS):
        model(image, imgsz=INPUT_SIZE, verbose=False)
    
    # Benchmark
    times = []
    for i in range(NUM_RUNS):
        start = time.perf_counter()
        model(image, imgsz=INPUT_SIZE, verbose=False)
        end = time.perf_counter()
        times.append((end - start) * 1000)
        
        if (i + 1) % 20 == 0:
            print(f"  {i + 1}/{NUM_RUNS}")
    
    # Calculate stats
    times = sorted(times)[5:-5]  # Remove outliers
    mean_ms = statistics.mean(times)
    fps = 1000 / mean_ms
    
    return {
        'model': model_name,
        'mean_ms': mean_ms,
        'fps': fps,
        'min_ms': min(times),
        'max_ms': max(times)
    }

def main():
    print(f"Benchmarking YOLO11 models on {IMAGE_PATH}")
    print(f"Input size: {INPUT_SIZE}, Runs: {NUM_RUNS}")
    
    results = []
    for model in MODELS_TO_TEST:
        result = benchmark_model(model)
        results.append(result)
        print(f"{model}: {result['mean_ms']:.1f}ms ({result['fps']:.1f} FPS)")
    
    print(f"\n{'Model':<12} {'Mean (ms)':<12} {'FPS':<8}")
    print("-" * 32)
    for r in results:
        print(f"{r['model']:<12} {r['mean_ms']:<12.1f} {r['fps']:<8.1f}")

if __name__ == "__main__":
    main()

The result I am getting from this run is -

Model        Mean (ms)    FPS     
--------------------------------
n            9.9          100.7   
s            6.6          150.4   
m            9.8          102.0   
l            13.0         77.1    
x            23.1         43.3

I am running this on an NVIDIA-4060. I tested this on a MacBook Pro with an M1 Chip as well, and I am getting similar results. Why can this be happening?

Solution

I believe it's just that the model is simply too small.This is the benchmarking when trying to run the nano on the Nvidia machine.

Model    Per Image (ms)   FPS     
--------------------------------
n        7.61             131.5   
s        6.13             163.1   
m        11.11            90.0

This seems to make no sense.

However, if trying to run this on the CPU,

Model    Per Image (ms)   FPS     
--------------------------------
n        25.74            38.8    
s        57.50            17.4    
m        148.19           6.7

Everything is slower than on GPU, as expected, but now nano is much faster than small.

The nano model is probably too small to properly take advantage of the GPU. The time for sending the images from the CPU to the GPU itself is probably overtaking any advantage that we have from using a smaller model.

Maybe `nano` makes sense if you are doing truly edge computing with Raspberry Pi or something. As long as you have some GPU, yolo-small is simply striking the right balance between set up cost and actual inference time.