BayesWipe Database Cleaner

Getting started

An overview of BayesWipe, how to download and use, examples, and more.

System Requirements

BayesWipe was written in C#, and runs on Windows. Additionally, it uses software that runs on Java.

  • Windows (7, 8, 8.1 tested)
  • .NET Framework 3.5 (it should be pre-installed on Windows 7+)
  • Java (to run Banjo, java.exe should be on your path.)
  • Disk space to hold your data about thrice over.

Data Requirements

BayesWipe works with any CSV file.

  • A single CSV file
  • Comma or Tab separated
  • Fields must not be enclosed by double quotes (support for this is coming in future versions).

Step by step instructions

BayesWipe is fairly intuitive, but just in case, here is a step by step walkthrough.

  • Start by running BayesWipeGui.exe
  • Click browse and find your input data file, then click Next.
  • Provide / edit the names of your columns (optional). Choose the type of processing for the attributes:
    • For any attributes of your data that are unique keys (names, id, street address), that don't repeat in the database, choose to 'ignore' those attributes, since they cannot be fixed.
    • For any attributes that are numerical, choose 'numerical' as the output type, and they will be quantized during the cleaning process.
    • For the remaining attributes, choose the default, 'categorical'.
  • Click Next, and wait while BayesWipe learns the structure, generative model, error statistics of the data, and then cleans it.
  • To see the output, open the file cleaned_1_to_1.txt when instructed. (Note: the ignored attributes are not shown in this file, that support is coming in a future version.)

Screenshots

This is what BayesWipe looks like, when it is run.

step 1 step 2 step 3 finished

Sample Data

You can run it on any csv file, but if you are looking for sample data, try the UCI Machine Learning repository. I have tested the system on the iris and mushroom databases, and it seems to work. However, note that the data may not be dirty, and there is no ground truth to compare against. You just need their .data files.

Current Limitations and Future Work

BayesWipe is a research project, and is being constantly updated.

  • Currently, the expected percentage of error cannot be set. Coming in v. 1.1
  • Currently, the time allowed to Banjo is set to 1 minute. Option coming in v. 1.1
  • A visual display of which values were changed by the system. coming in v. 1.1
  • Output file does not retain ignored attributes. Coming in v. 1.2
  • Fields must not be enclosed by double quotes. Coming in v. 1.3
  • Online querying of a database. Coming in v. 1.4
  • Attribute type will be autodetected. Coming in v. 1.5