« What is data profiling? Data in the real world. | Bill’s Epic data project Fail- a cautionary tale »
When should you data profile? Morning, Noon and Night!
Data profiling is an important part of any data related project. The question often arises when the best time to data profile is. As you would expect from a software company that sells a really cool visual data profiling tool, our view is “all the time”.
Using data profiling tools before the project
Data profiling is useful even before the project is defined. By doing a first higher level data profiling on key data sets, you will get;
- better project scope definition
- a more accurate budget estimate
- a clear baseline from which improvement can be measured
- the ability to correctly manage expectations
The last two are important ones. Making wild promises based on what the data model says should be in the tables and then failing to deliver due to data quality issues is not nearly as career enhancing as defining an ambitious but doable scope and delivering based on the actual data, clearly communicating the progress made based on facts.
Data quality issues can seriously (double digit percentage seriously) affect the final cost of a project. Knowing about issues will let you set a realistic budget.
Data profiling at the beginning of the project
After the initial higher level data profiling done before the project, budgeting for a more detailed data profiling of the source data will:
- allow clear design guidance to ETL developers
- Clearly identify the subject matter experts needed to understand the data, and let you engage them early- rather than in a rush when ETL development hits the underlying data issues, and the project is already running late and over budget.
Data profiling during the project
- By setting up automated data profiling tasks for the output of each ETL process, ETL developers and architects can track the progress for migration, cleansing or conversion tasks using concrete information.
- Objective criteria can be set for each profile task to determine what level if data quality is considered “acceptable” for the final data load or fact set deliverables.
Data profiling at the end of the project
- Doing a final data profiling run, and comparing it to the baseline established before the project will provide a clear Before/After view that will both clearly communicate the progress made, but also assist in justifying and promoting the next data quality initiative.
« What is data profiling? Data in the real world. | Bill’s Epic data project Fail- a cautionary tale »