I am currently preparing my dataset in order to train a SSD model on it.
I was wondering if I have to make annotations on each of my images for each of my classes, or if I could crop my images to isolate each of my class and put them in the class folder they belong to.
With the first method I would get something like
dataset
|
|_annotations
| |
| |_001.xml
| |_002.xml
| |_...
|
|_images
|
|_001.jpg
|_002.jpg
|_...
With the second method:
dataset
|
|_class1
| |
| |_crop01.jpg
| |_crop02.jpg
| |_...
|
|_class2
|
|_crop01.jpg
|_crop02.jpg
|_...
Would there be a difference in the training process by using one or the other method? I have noticed that for classification models, the second method is used while for detectors (such as YOLO or SSD) the first one is used?
Is it just a kind of habit or a have to, or both can be used for both classification and detection? What would be the influence of training a detection model using the cropping method?
Thanks in advance for your help
The SSD model gets fed the whole image along with the bounding boxes of the objects. That is something you can't recreate using the second approach (which is, as you said, used for classification). The detection model learns to output the bounding box offset along with the class, so it requires the original image along with the annotations.