Schema and data validation of T1D Exchange mapped data using Pandera framework
Background/Objective: Data registries, such as T1D Exchange, advance data-driven innovation by compiling data from centers across the nation and using them to answer complex questions and develop strategies to improve patient outcomes. That requires member institutions to map, transform, and validate their data – a challenging and detail-oriented task. Our institution was able to reduce submission errors and improve data quality by using the open-source framework Pandera (Bantilan, 2020) to validate data before submission.
Methods: Pandera is a lightweight schema and data validation framework built in Python (3.7, 3.8, 3.9). It allows users to define a schema for their data and to specify a wide variety of data quality checks. Our team translated the mapping documentation provided by the T1D Exchange to data tests in Python using Pandera. This validation step was added after data extraction and mapping, before any data was submitted to the T1D Exchange.
Results: In the submission prior to using Pandera, the T1D Exchange reported 26 data schema and validation errors back to our institution. In the first monthly submission after adding Pandera schema validation to the workflow, only one error was reported.
This Post Has 0 Comments