In a recent study published in Nature Communications, researchers examined the impact of integrating non-prescription pharmaceutical sales to improve weekly recorded mortality from respiratory diseases in England, using almost two billion transactions from a United Kingdom (UK) high street store between March 2016 and March 2020.
Study: Assessing the value of integrating national longitudinal shopping data into respiratory disease forecasting models. Image Credit: Wanannc/Shutterstock.com
Background
Researchers are investigating the use of social and behavioral data to anticipate influenza-like infections and their impact on vulnerable individuals. They indicate incorporating this data into disease models for more accurate prediction.
Traditional approaches, such as surveys and self-reporting, pose logistical challenges. Alternative digital footprint databases provide long-term monitoring of health habits at scale, augmenting qualitative observations and reflecting local population differences.
About the study
In the present study, researchers created the PADRUS artificial intelligence (AI)-based tool utilizing non-prescription pharmaceutical sales data to estimate weekly mortality from respiratory disorders.
This technology increased the accuracy of respiratory illness forecasting models and operated at the finer geographic granularity of local governments.
Using retail sales data and non-prescription drug purchases, the PADRUS machine learning model predicted reported fatalities from respiratory diseases in 314 local authority areas throughout England. The study sought to explore the efficacy of these models by assessing the role of sales data compared to other predictive factors.
Two comparison models were developed, one for each of England's 314 Lower Tier Local Authorities (LTLAs) and the other for a weekly dependent variable (output feature) indicating respiratory fatalities in the LTLA for that week. Based on this information, predictive models were built and evaluated to derive the best forecasting mechanism.
Model Class Reliance (MCR) analysis was used to determine variable significance, which may aid in determining a variable's absolute necessity (MCR-) and maximum utility (MCR+) in forecasting. Due to multicollinear, significant shared data, and non-linear-type interactions between variables, group-MCR was used to examine the relevance of different variable categories.
The models were created utilizing sales and outcome data across Wales and England using 3-31 days between the final day from the one-week sales aggregate period and the reported fatalities day to generate prediction horizons and assess if linear connections existed.
The baseline, PADRUS, and PADRUNOS models were non-linear, used a random forest regressor, and provided results on held-out test data (30%).
The PADRUS model used 56 characteristics extracted from sales, meteorological, demographic, environmental, and socioeconomic data.
The PADRUNOS model was created by optimizing a random forest regressor using a time series cross-validation grid search to forecast weekly mortality from respiratory illness for the 314 LTLAs 17 days before.
The team assessed weekly time-series forecasting outcomes across LTLAs for both models and performed further stratification by the Index of Multiple Deprivation (IMD) to explore the impact of the population's economic situation.
Results
Models that used sales data outperformed those that used factors related to respiratory disease, such as sociodemographics and meteorological data. Accuracy advances were highest during times of highest public risk.
Between December 2009 and April 2015, the highest weekly fatalities from respiratory diseases in Wales and England were 3,521, and the lowest weekly deaths were 868, for a total of 378,230 deaths due to respiratory disease.
Regressors that predicted weekly respiratory fatalities using sales information 17 days before showed the best results, with a 0.8 R2 out-of-sample value for the held-out information (30%). Predictions done 24 days before continued to produce high results (R2 0.8, root mean square error (RMSE) 224); however, performance was noticeably lower when made ≤10 days or ≥31 days before.
The PADRUS model outperformed the baseline model significantly, with an R2 of 0.8, resulting in significantly greater predictive accuracy.
The most important drivers in developing model projections were LTLA population size and age, with death rates from respiratory disease being higher in older populations.
Sales data characteristics, notably the fraction of cough medicine purchases, were followed by IMD concentration and weather factors, which had a higher influence on predicting than decongestant sales and housing-related variables. According to the MCR study, the number of populations in the three age groups remains essential in producing the best projections.
Sales characteristics generated considerably larger permutation significance boundaries than IMD (MCR- 3.7, MCR+ 5.6). Weather (MCR 7.6, MCR+ 7.5) became the second most significant element utilized for forecasting.
PADRUS and PADRUNOS showed similar prediction trends between 2016 and 2020; however, PADRUS was better at detecting spikes in respiratory death rates than PADRUNOS. Adding sales data improved PADRUS model accuracy, particularly during the winter months. PADRUS and PADRUNOS showed more accurate forecasts in locations with higher concentrations of deprivation, with PADRUS outperforming PADRUNOS across all interquartile ranges of the IMD.
Conclusion
Overall, the study findings showed that sales data used for population health monitoring, including non-prescription medication sales data for managing respiratory symptoms, can improve forecasting accuracy for respiratory deaths despite the high geographic granularity required.