Data science how to avoid data dredging

3/1/2024

Hypothesizing after the results are known (HARK), a term coined by Kerr ( 1998), defines the presentation of a post hoc hypothesis as if it had been made a priori. The community itself is thereby supported to be more productive in generating and critically evaluating theories that integrate wider, complex systems. Validation underpins ‘natural selection’ in a knowledge base maintained by the scientific community. With a model-centered paradigm, the reproducibility focus changes from the ability of others to reproduce both data and specific inferences from a study to the ability to evaluate models as representation of reality. Reproducibility is attained by employing two levels of model validation: internal (relative to data collated around hypotheses) and external (independent to the hypotheses used to generate data or to the data used to generate hypotheses). We here propose a HARK-solid, reproducible inference framework suitable for big data, based on models that represent formalization of hypotheses. Some of the HARK precautions can conflict with the modern reality of researchers’ obligations to use big, ‘organic’ data sources-from high-throughput genomics to social media streams.

Despite potential drawbacks, HARK has deepened thinking about complex causal processes.

Hypothesizing after the results are known (HARK) has been disparaged as data dredging, and safeguards including hypothesis preregistration and statistically rigorous oversight have been recommended.

0 Comments

Data science how to avoid data dredging

Leave a Reply.

Author

Archives

Categories