A recent study challenges the conventional wisdom of ‘Garbage In, Garbage Out’ (GIGO) in AI/ML modeling, suggesting that manual data cleaning can sometimes lower the predictive ceiling of models when dealing with big data.
The traditional mindset assumes that ‘clean’ data is non-negotiable for predictive AI/ML modeling, but the authors argue that this assumption can be limiting. They identify two types of ‘noise’: Predictor Error (random typos, dropped logs, or transient glitches) and Structural Uncertainty (the inherent gap between recorded metrics and the complex reality they represent).
While manual scrubbing can fix noise due to errors, it cannot fix the noise due to Structural Uncertainty. However, using a comprehensive, high-dimensional data architecture and a flexible model can allow the model to triangulate the hidden drivers reliably despite the presence of data errors.
The study shows that keeping a massive amount of messy, highly correlated variables (even if error-prone) can allow the model to drown out individual errors and overcome Structural Uncertainty, redefining ‘data quality’ to include the comprehensiveness and redundancy of the variable portfolio.
The authors conclude that treating GIGO as a universal law can be a trap, especially in the context of big data, and that a more nuanced approach to data quality is needed.
Photo by Ron Lach on Pexels
Photos provided by Pexels
