machine-learningkerascomputer-visioncross-validationmedical-imaging

Is k-folds cross validation a smarter idea than using a validation set instead?


I have a somewhat large (~2000) set of medical images I plan to use to train a CV model (using efficentnet architecture) in my workplace. In preparation for this, I was reading up on some good practices for training medical images. I have split the dataset by patients to prevent leakages and split my data in train:test:val in the order of 60:20:20. However, I read that k-folds cross validation was a newer practice then using a validation set, but I was recommended away from doing so as k-folds is supposed to be far more complicated. What would you recommend in this instance, and are there any other good practices to adopt?


Solution

  • Common Practice

    A train:test split with cross-validation on the training set is part of the standard workflow in many machine learning modules. For an example and further details, I recommend the excellent sklearn article on it.

    Implementation

    The implementation may be a little trickier but should not be prohibitive given the many code examples assuming you are using TF or Pytorch (see e.g. this SO question).

    Should you be using k-fold cross validation?

    Compared to a single validation set, k-fold cross-validation avoids over-fitting hyperparameters to a fixed validation set and makes better use of the available data by utilizing the entire training set across the folds, albeit at greater computational cost. Whether or not this makes a big difference depends on your task. 2000 images does not sound like a lot in computer vision terms, so making good use of the data may be relevant to you, especially if you plan on tuning hyperparameters.