pythonalgorithmmachine-learninganomaly-detection

Machine learning options to detect errors in a large number of sql tables?


I'm new to ML and want to build a system that can detect errors or anomalies in input data that I receive from customers. The data is structured in sql tables with various column names. The value types for each column varies but the most common are numbers, strings and dates.

Some of the values in these tables will be wrong. Examples of errors that I can encounter are:

Up until now, the best option I can envision is to run some unsupervised edge case detection algorithm. But, from what I have understood by reading online about these algorithms, they do not really do much of machine learning. Rather just classify based on edge criterias.

The input data can reside in hundreds of tables with tens or hundreds of columns each. This means that just going through the data structure manually is a daunting task. My aim is a system that, just by looking at the data in one column, can detect data type and also automatically tell the outliers.

As I do think that there are patterns to be found in the errors that may occur and the fact that my dataset is huge, I would like to try out some semi-supervised algorithm where I could review the suggested errors from the algorithm classifying false positives etc. To feed back those assertions into the algorithm would improve the predictions I think.

Right now, I have started off using Python but have no clue on which algorithms to use and how to build a proper pipeline that adapts my input data to work well with the classifiers.

I would be very grateful if someone can give me suggestions on which algorithms and steps I could use to implement the system I have in mind or suggest already existing tools for this.

Thanks!


Solution

  • IMO, you forget the most important check: values out of the normal range. These ranges can be found by simple statistical observation or by... common sense.