Comparative Study of Imputation Techniques for Missing Value Estimation in Particulate Matter 2.5 µm Time Series

Document Type : Original Research Paper

Authors

1 Departamento de Ingeniería de Sistemas e Informática, Universidad Nacional de Moquegua, Moquegua, Peru

2 Departmento de Ciencias Básicas, Universidad Nacional de Moquegua, Moquegua, Peru

3 Departmento de Ingeniería Ambiental, Universidad Nacional de Moquegua, Moquegua, Peru

Abstract

Particulate matter 2.5 µm (PM2.5) or less in diameter is one of the most important air pollutants owing to its harmful effects on health. However, the measured data of PM2.5 in air quality monitoring networks may have large missing values owing to equipment failure. We conducted a comparative study of imputation techniques for missing value estimation in PM2.5, which was regularly measured in the air quality monitoring network in Lima City, Peru. Lima is the second most polluted city in South America. In this regard, various imputation techniques were implemented, among them, moving averages-based approaches (e.g., Autoregressive Integrated Moving Average ARIMA, Exponentially Weighted Moving Average EWMA, Linear Weighted Moving Average LWMA, and Local Average of Nearest Neighbors LANN), interpolation-based models (e.g., spline), and deep learning-based methods (e.g., Long Short-Term Memory LSTM, Bidirectional LSTM, Gated Recurrent Unit GRU, and Bidirectional GRU) to estimate missing values in PM2.5 time series. For experimentation, a dataset of 11822 h was used, considering 80% for training and the remaining 20% for testing. The results in terms of RMSE, MAPE, and R2 demonstrated that for different configurations of short-gaps of missing values, the techniques based on moving averages yielded better results than those based on deep learning. Among the moving average-based techniques, ARIMA was the best model for estimating missing values in PM2.5 time series, and the MAPE values ranged from 0.0005% to 11.6522%.

Keywords

Main Subjects


Alkabbani, H., Ramadan, A., Zhu, Q., & Elkamel, A. (2022). An Improved Air Quality Index Machine Learning-Based Forecasting with Multivariate Data Imputation Approach. Atmosphere, 13(7). https://doi.org/10.3390/atmos13071144
Belachsen, I., & Broday, D. M. (2022). Imputation of Missing PM2.5 Observations in a Network of Air Quality Monitoring Stations by a New kNN Method. Atmosphere, 13(11). https://doi.org/10.3390/atmos13111934
Bu, X., Xie, Z., Liu, J., Wei, L., Wang, X., Chen, M., & Ren, H. (2021). Global PM2.5-attributable health burden from 1990 to 2017: Estimates from the Global Burden of disease study 2017. Environmental Research, 197. https://doi.org/10.1016/j.envres.2021.111123
Chen, Z., Liu, P., Xia, X., Wang, L., & Li, X. (2022). The underlying mechanism of PM2.5-induced ischemic stroke. In Environmental Pollution (Vol. 310). https://doi.org/10.1016/j.envpol.2022.119827
Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Gated Recurrent Neural Networks on Sequence Modeling. ArXiv.
Colorado, M. (2019). Perú es el país con la peor calidad de aire y Santiago la capital más contaminada de Latinoamérica. France 24, 1–3. https://www.france24.com/es/20190313-medio-ambiente-calidad-aire-contaminacion
Flores, A., Tito-Chura, H., Centty-Villafuerte, D., & Ecos-Espino, A. (2023). Pm2.5 Time Series Imputation with Deep Learning and Interpolation. Computers, 12(8). https://doi.org/10.3390/computers12080165
Flores, A., Tito, H., & Silva, C. (2019). Local average of nearest neighbors: Univariate time series imputation. International Journal of Advanced Computer Science and Applications, 10(8). https://doi.org/10.14569/ijacsa.2019.0100807
Huang, F., Pan, B., Wu, J., Chen, E., & Chen, L. (2017). Relationship between exposure to PM2.5 and lung cancer incidence and mortality: A meta-analysis. Oncotarget, 8(26). https://doi.org/10.18632/oncotarget.17313
Lee, Y. S., Choi, E., Park, M., Jo, H., Park, M., Nam, E., Kim, D. G., Yi, S. M., & Kim, J. Y. (2023). Feature extraction and prediction of fine particulate matter (PM2.5) chemical constituents using four machine learning models. Expert Systems with Applications, 221. https://doi.org/10.1016/j.eswa.2023.119696
Moritz, S. (2021). imputeTS. In The R Journal (Vol. 9, Issue 1).
Moritz, S., & Bartz-Beielstein, T. (2017). imputeTS: Time series missing value imputation in R. R Journal, 9(1). https://doi.org/10.32614/rj-2017-009
Oh, J., Choi, S., Han, C., Lee, D. W., Ha, E., Kim, S., Bae, H. J., Pyun, W. B., Hong, Y. C., & Lim, Y. H. (2023). Association of long-term exposure to PM2.5 and survival following ischemic heart disease. Environmental Research, 216. https://doi.org/10.1016/j.envres.2022.114440
Peker, N., & Kubat, C. (2021). A hybrid modified deep learning data imputation method for numeric data sets. International Journal of Intelligent Systems and Applications in Engineering, 9(1). https://doi.org/10.18201/ijisae.2021167931
Priya, S. A., & Khanaa, V. (2023). An Intelligent Air Quality During COVID-19 Prediction and Monitoring System Using Temporal CNN-LSTM. In EAI/Springer Innovations in Communication and Computing: Vol. Part F274. https://doi.org/10.1007/978-3-031-23683-9_31
Reátegui-Romero, W., Sánchez-Ccoyllo, O. R., Andrade, M. de F., & Moya-Alvarez, A. (2018). PM2.5 Estimation with the WRF/Chem Model, Produced by Vehicular Flow in the Lima Metropolitan Area. Open Journal of Air Pollution, 07(03). https://doi.org/10.4236/ojap.2018.73011
Reátegui-Romero, W., Zaldivar-Alvarez, W. F., Pacsi-Valdivia, S., Sánchez-Ccoyllo, O. R., Garciá-Rivero, A. E., & Moya-Alvarez, A. (2021). Behavior of the Average Concentrations As Well As Their PM10 and PM2.5 Variability in the Metropolitan Area of Lima, Peru: Case Study February and July 2016. International Journal of Environmental Science and Development, 12(7). https://doi.org/10.18178/ijesd.2021.12.7.1341
Republica, G. La. (2023). Perú es el país con peor calidad de aire de Sudamérica. https://especial.larepublica.pe/la-republica-sostenible/2023/09/14/peru-es-el-pais-con-peor-calidad-de-aire-de-sudamerica-1225756
Rojas, F. J., Pacsi-Valdivia, S., & Sánchez-Ccoyllo, O. R. (2022). Simulación computacional e influencia de las variables meteorológicas en las concentraciones de PM10 y PM2.5 en Lima Metropolitana. Información Tecnológica, 33(3). https://doi.org/10.4067/s0718-07642022000300223
RumboMinero. (2022). El 79% de su consumo de energía provino de fuentes de combustibles fósiles en 2021. https://www.rumbominero.com/usa-internacionales/consumo-energia-combustibles-fosiles-2021/
Saif-ul-Allah, M. W., Qyyum, M. A., Ul-Haq, N., Salman, C. A., & Ahmed, F. (2022). Gated Recurrent Unit Coupled with Projection to Model Plane Imputation for the PM2.5 Prediction for Guangzhou City, China. Frontiers in Environmental Science, 9. https://doi.org/10.3389/fenvs.2021.816616
Sak, H., Senior, A., & Beaufays, F. (2014). Long short-term memory recurrent neural network architectures for large scale acoustic modeling. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. https://doi.org/10.21437/interspeech.2014-80
Tapia, V. L., Vasquez, B. V., Vu, B., Liu, Y., Steenland, K., & Gonzales, G. F. (2020). Association between maternal exposure to particulate matter (PM2.5) and adverse pregnancy outcomes in Lima, Peru. Journal of Exposure Science and Environmental Epidemiology, 30(4). https://doi.org/10.1038/s41370-020-0223-5
Tapia, V., Steenland, K., Sarnat, S. E., Vu, B., Liu, Y., Sánchez-Ccoyllo, O., Vasquez, V., & Gonzales, G. F. (2020). Time-series analysis of ambient PM2.5 and cardiorespiratory emergency room visits in Lima, Peru during 2010–2016. Journal of Exposure Science and Environmental Epidemiology, 30(4). https://doi.org/10.1038/s41370-019-0189-3
Tapia, Vilma, Steenland, K., Vu, B., Liu, Y., Vásquez, V., & Gonzales, G. F. (2020). PM2.5exposure on daily cardio-respiratory mortality in Lima, Peru, from 2010 to 2016. Environmental Health: A Global Access Science Source, 19(1). https://doi.org/10.1186/s12940-020-00618-6
Vasquez-Apestegui, B. V., Parras-Garrido, E., Tapia, V., Paz-Aparicio, V. M., Rojas, J. P., Sanchez-Ccoyllo, O. R., & Gonzales, G. F. (2021). Association between air pollution in Lima and the high incidence of COVID-19: findings from a post hoc analysis. BMC Public Health, 21(1). https://doi.org/10.1186/s12889-021-11232-7
Vu, B. N., Tapia, V., Ebelt, S., Gonzales, G. F., Liu, Y., & Steenland, K. (2021). The association between asthma emergency department visits and satellite-derived PM2.5 in Lima, Peru. Environmental Research, 199. https://doi.org/10.1016/j.envres.2021.111226
Wyer, K. E., Kelleghan, D. B., Blanes-Vidal, V., Schauberger, G., & Curran, T. P. (2022). Ammonia emissions from agriculture and their contribution to fine particulate matter: A review of implications for human health. In Journal of Environmental Management (Vol. 323). https://doi.org/10.1016/j.jenvman.2022.116285
Yuan, H., Xu, G., Yao, Z., Jia, J., & Zhang, Y. (2018). Imputation of missing data in time series for air pollutants using long short-term memory recurrent neural networks. UbiComp/ISWC 2018 - Adjunct Proceedings of the 2018 ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2018 ACM International Symposium on Wearable Computers. https://doi.org/10.1145/3267305.3274648