sqlssisdata-miningetldata-quality

How to use pure SQL for Exploratory Data Analysis?


I'm an ETL developer using different tools for ETL tasks. The same question rises in all our projects: the importance of the data profiling before the Data Warehouse is build and before the ETL is build for data movement. Usually I have done data profiling (i.e. finding bad data, data anomalies, counts, distinct values etc.) using pure SQL because ETL tools does not provide a good alternative for these (there is some data quality components in our tools, but they are not so sophisticated). One option is to use R programming language or SPSS Modeler etc. kind of tools for this kind of Exploratory Data Analysis. But usually these kinds of tools are not available or does not qualify if there is millions of rows of data.

How to do this kind of profiling using SQL? Is there any helper scripts available? How do you do this kind of Exploratory Data Analysis before data cleaning and ETL?


Solution

  • I found a good tool for this purpose: Datacleaner. This seems to do most of things I want to do with data in EDA process.