Startup businesses and new projects sometimes collect their data haphazardly at first. The amount is small, so enforced error checking seems like too much effort. As the information grows, the text files, spreadsheets, and simple databases become harder to manage.
Eventually, it’s time to move to a real information management system. With the move comes the job of legacy data migration. The growing stakeholder base and the new opportunities make having clean data more important than ever. Migration without cleanup leads to inefficient operations and costly errors. Three related but distinct considerations need to be kept in mind: accuracy, consistency, and completeness.
Accuracy
Information entered manually, with no error checking in place, is bound to include errors. OCR has the same problem. Copying and pasting can pick up the wrong information or leave some out.
It’s usually hard to tell if a field is accurate, but automatic checking during legacy data migration can catch many errors. Dates and measures can be in the wrong format or out of range. Required items may be missing or contain just “?” or “NA.” An error check is an essential part of the process.
Consistency
Merging information from different sources raises issues of consistency. The data in each source may be accurate but stored in different ways. The differences can be as simple as using a whole word against an abbreviation. Whatever they are, they need resolving without creating duplicate entries. Migration software should apply standardization rules to fix many problems.
Consistency checking helps to catch errors in accuracy as well. If two entries are different, one may be wrong or outdated. Manual verification could be necessary to decide which one to retain.
Completeness
Sometimes essential information just isn’t there. Ad hoc methods such as spreadsheet entry generally don’t enforce mandated fields. If the information is optional and its lack doesn’t cause problems, that’s fine. But if essential information is missing, such as contact data or the date of a test, that’s a problem.
It’s hard to reconstruct missing information, but it’s necessary to do something. The migration process could insert null or default values and flag those items for review. Another approach is to put the defective items into a separate table. They can be brought up to standard and merged later.
Prepare for dirty data
Any legacy data migration is bound to run into inaccurate, inconsistent, or incomplete data. The managers in charge of the process should plan for it. They need to use a migration tool which will handle it gracefully and intelligently. The result will be a more usable and reliable information system.