BayesWipe is a product of the information integration

research group of Arizona State University

- Sushovan De
- Yuheng Hu
- Preet Inder Singh Rihan
- Yi Chen
- Subbarao Kambhampati

Unlike other data cleaning tools, BayesWipe is built to learn both the generative and error model directly from the dirty data itself. It uses a Bayesian network to model the generative model of the data, and a maximum entropy model (with edit distance and a substitution probability measure) for the error model.

The tool will first read the input data, strip it away of all unique keys, quantize any continuous attributes, and then train a Bayesian model on the resulting data. This Bayesian learner ignores small perturbations, and learns the overall model of the data. It also then estimates error statistics on this data. Then it processes the input data tuple by tuple, and cleans them.

In order to clean a tuple, BayesWipe first creates a set of candidate clean tuples as possible replacements for the tuple. The original tuple is always made a part of this set. Suppose the candidate tuple is T*, and the original tuple is T, BayesWipe finds that particular T* that gives the maximum probability for P(T*|T).

Further details can be found in our paper.

