Smogseer: A Convolutional LSTM model for forecasting air quality from Sentinel-5P data

Taimur Khan

Abstract

The South Asian Smog denotes a recurring annual occurrence of heightened levels of air pollution marked by elevated levels of air contaminants, reduced visibility, and significant socio-economic impacts. These Extreme Smog Events predominantly occur in the northwestern regions of the Indo-Gangetic Plains (IGP) during the months from November to February. Since 2016, their frequency and pervasiveness have led to their colloquial local reference as “the fifth season”. Inhabitants of cities like Lahore, Amritsar, Faisalabad, Multan, and Delhi experience outbursts of extremely hazardous air quality levels during this period. In the last decade, there has been an increase in air pollution sources while crop residue burning, changing weather patterns, and motor vehicles have greatly contributed to the increased frequency and intensity of heightened smog events. However, forecasting of the Extreme Smog Events in South Asia remains elusive as monitoring efforts can help mobilise timely efforts to mitigate conditions that drive the smog. In this study, I use five-day air constituent data from Sentinel-5P level 2 remote sensing product predict hightened aerosol events using Convolutional Long-Short Term Memory neueral network model. The predictor for heightened smog is the UV (Ultraviolet) Aerosol Index at 340-380 nm. The results show that the Aerosol Index can be forecasted at a five-day interval with a Meas Squared Error of ~0.0018 and a loss of ~0.3995, indicating that while Smogseer can predict heightened smog events, the model can be further improved by incorporating additional data sources and refining the model architecture.

1 Introduction

Convincing evidence has shown that air pollution is a major environmental risk to health. The World Health Organization (WHO) estimates that 4.2 million premature deaths occur each year due to outdoor air pollution (Majeed et al. (2024)). In South Asia, air pollution is a major public health concern, with high levels of particulate matter (PM2.5) and other pollutants (Majeed et al. (2024)). Monitoring air quality is essential for public health and environmental protection. Satellites provide a valuable tool for monitoring air quality, as they can measure pollutants such as methane (CH4), nitrogen dioxide (NO2), sulfur dioxide (SO2), carbon monoxide (CO), Formaldehyde (HCHO), and ozone (O3) from space. Using these variables,the Sentinel-5P product also calculates the Aerosol Index (AI) which is a measure of the amount of aerosols in the atmosphere. The AI is used to monitor air quality and can be used to forecast air quality events such as smog. The aerosol index (AI) is a measure of the amount of aerosols in the atmosphere and is used to monitor air quality. The AI data from the Sentinel-5P satellite is available in near real-time and can be used to monitor air quality (ESA (2021)).

In this project, we will use a convolutional LSTM neural network to forecast the AI in South Asia. The goal of this project is to develop a model that can accurately predict the AI in the future and help in monitoring air quality. The convolutional LSTM neural network is a class of neural networks that is used for spatio-temporal data (Shi et al. (2015)). It combines the spatial information from convolutional layers with the temporal information from LSTM layers. The convolutional LSTM neural network has been shown to be effective for predicting spatio-temporal data such as weather forecasting (Kumar et al. (2020)) and anomaly detection (Luo, Liu, and Gao (2017)).

The goal of this project is to develop a model that can accurately forecast the AI in the future and help in monitoring air quality.

2 Data

An arbitary study area was chosen which confined to bounding box covering northern regions of South Asia as these regions continuosly experience some of world’s worst air quality (Majeed et al. (2024)). The bounding box coordinates were:

bbox = [68.137207,24.886436,84.836426,34.379713] #WGS84 // lon,lat,lon,lat

For the study area abive, the corresponding Sentinel-5P data was downloaded using the Deep Earth System Data Lab’s xcube sentinel datatore (Brandt (2023)). The attached notebook titled deepesdl-S5PL2.ipynb can be viewed to see how this works. ALternative, the attached script titled data.py as a Code Link in this manuscript document can also be used to download the raw data.

However, for ease, the downloaded version resulting file can be found on Zenodo (Khan (2024)).

The timeperiod for the study was chosen from 01.01.2019 to 31.12.2023. The data was downloaded from the Sentinel-5P data store using the Deep Earth System Data Lab’s xcube sentinel datatore (Brandt (2023)). The data was downloaded in the form of netCDF files which contain the AI data for the study area and time period. The data was then preprocessed and split into training and testing datasets. The training dataset contains the AI data from 01.01.2019 to 31.02.2022 and the testing dataset contains the AI data from 01.01.2023 to 31.12.2023.

The spatial resolution of the data is 3.629km x 3.269km per pixel.

Here is the resulting dataset:

2.1 Predictor variables (features)

The following predictor variables were chose for training the ConvLSTM.

2.1.1 Methane – CH4

A map showing the concentration of methane in the atmosphere as measured by the Sentinel-5P satellite for 13.01.2019. — Figure 1: Methane (CH4) concentration in the atmosphere as measured by the Sentinel-5P satellite for 13.01.2019.

2.1.2 Ozone – O3

A map showing the concentration of ozone in the atmosphere as measured by the Sentinel-5P satellite for 13.01.2019. — Figure 2: Ozone (O3) concentration in the atmosphere as measured by the Sentinel-5P satellite for 13.01.2019.

2.1.3 Sulphur Dioxide – SO2

A map showing the concentration of sulphur dioxide in the atmosphere as measured by the Sentinel-5P satellite for 13.01.2019. — Figure 3: Sulphur Dioxide (SO2) concentration in the atmosphere as measured by the Sentinel-5P satellite for 13.01.2019.

2.1.4 Carbon Monoxide – CO

A map showing the concentration of carbon monoxide in the atmosphere as measured by the Sentinel-5P satellite for 13.01.2019. — Figure 4: Carbon Monoxide (CO) concentration in the atmosphere as measured by the Sentinel-5P satellite for 13.01.2019.

2.1.5 Nitrogen Dioxide – NO2

A map showing the concentration of nitrogen dioxide in the atmosphere as measured by the Sentinel-5P satellite for 13.01.2019. — Figure 5: Nitrogen Dioxide (NO2) concentration in the atmosphere as measured by the Sentinel-5P satellite for 13.01.2019.

2.1.6 Formaldehyde – HCHO

A map showing the concentration of formaldehyde in the atmosphere as measured by the Sentinel-5P satellite for 13.01.2019. — Figure 6: Formaldehyde (HCHO) concentration in the atmosphere as measured by the Sentinel-5P satellite for 13.01.2019.

2.2 Target variable

2.2.1 Aerosol Index – AI (340-380 nm)

A map showing the concentration of aerosol index in the atmosphere as measured by the Sentinel-5P satellite for 13.01.2019. — Figure 7: Aerosol Index (AI) concentration in the atmosphere as measured by the Sentinel-5P satellite for 13.01.2019.

The Aerosol Index shows large variability through the year and across years.

A map showing the concentration of aerosol index in the atmosphere as measured by the Sentinel-5P satellite for lat=31.5204, lon=74.3587. — Figure 8: Aerosol Index (AI) concentration in the atmosphere as measured by the Sentinel-5P satellite for lat=31.5204, lon=74.3587.

3 Model Training

A Convolutional Long Short-Term Memory (ConvLSTM) model is a specialized neural network architecture that integrates the strengths of Convolutional Neural Networks (CNNs) and Long Short-Term Memory networks (LSTMs) (Luo, Liu, and Gao (2017)). This hybrid model is particularly well-suited for tasks involving spatiotemporal data, where both spatial and temporal dependencies are critical. The primary advantage of ConvLSTM models lies in their ability to simultaneously process and analyze spatial and temporal information, making them more effective for spatiotemporal tasks compared to using separate CNN and LSTM models. ConvLSTM models are used extensively in fields such as video processing, weather forecasting, and environmental monitoring, where data exhibits strong correlations across both space and time.

Convolutional LSTMs. — Figure 9: Convolutional LSTMs combine both spatial and temporal information in the neural network (Luo, Liu, and Gao (2017)).

The idea is to use the predictor variables as input to the ConvLSTM and to get the taget variable as an output.The model was trained at 50 and 100 epochs. The model was trained using the Adam optimizer with a learning rate of 0.001 and a batch size of 32.

I used the numpy (Harris et al. (2020)) python package for all the data preprocssing steps and then I used Tensorflow (Martin (2015)) and Keras (Chollet et al. (2015)) python packages for building a custom ConvLSTM. I used matplotlib (Hunter (2007)) for all plotting tasks.

As model training required a lot of memory, the DeepESDL Jupyter Lab proved to be insufficient. Hence, I had to train the model at the Model Server Grid (MSG) Windows cluster at the Helmholtz - Center for Environmental Research (UFZ) in Leipzig, Germany Even then, the model arichtecture had to be simplified to reduce the memory requirements. The model architecture is as follows:

Smogseer. — Figure 10: The model architecture for the custom ConvLSTM model titled “Smogseer”.

The training took approximately 4 hours for 50 epochs and 10 hours for 100 epochs.

The feature training (X_train) dataset dimesion were: (292, 1, 291, 512, 6). Here is the breakdown.

292: dates
1: time step
291: latitudes
512: longitudes
6: Features ['SO2', 'NO2', 'CH4', 'O3', 'CO', 'HCHO']

The feature testing (X_val) dataset dimesion were: (74, 1, 291, 512, 6). Here is the breakdown.

74: dates
1: time step
291: latitudes
512: longitudes
6: Features ['SO2', 'NO2', 'CH4', 'O3', 'CO', 'HCHO']

The target training (y_train) dataset dimension was: (292, 1, 291, 512, 1). Here is the breakdown.

292: dates
1: time step
291: latitudes
512: longitudes
1: Target ['AI']

The target testing (y_val) dataset dimension was: (74, 1, 291, 512, 1). Here is the breakdown.

74: dates
1: time step
291: latitudes
512: longitudes
1: Target ['AI']

3.1 Model Workflow

Here is a workflow for the provided code in smogseer.py:

A run of the code can be found in smogseer50.py and smogseer100.py.

Import Libraries:
- Import necessary libraries such as xarray, numpy, tensorflow, sklearn, and matplotlib.
Load Dataset:
- Load the dataset using xarray.open_dataset().
Stack Features:
- Stack the features into a single DataArray and transpose to desired dimensions.
Convert to NumPy Arrays:
- Convert the DataArray to a NumPy array.
Normalize Input Data:
- Normalize the input data using StandardScaler.
Impute Missing Values in Input Data:
- Reshape the data for imputation.
- Impute missing values using SimpleImputer.
- Reshape the data back to original dimensions.
Add Time Dimension to Input Data:
- Add an additional time dimension to the input data.
Load Target Data:
- Load the target dataset using xarray.open_dataset().
Normalize Target Data:
- Normalize the target data using MinMaxScaler.
Impute Missing Values in Target Data:
- Reshape the target data for imputation.
- Impute missing values using SimpleImputer.
- Reshape the target data back to original dimensions.
Ensure Target Data Shape:
- Ensure the target data shape is (num_samples, num_timesteps, num_latitudes, num_longitudes, 1).
Remove Samples with NaN Values:
- Identify and remove samples with NaN values in the target data.
Verify Target Data Range:
- Ensure the target data values are within the valid range [0, 1].
Split Data into Training and Validation Sets:
- Split the cleaned data into training and validation sets based on a defined ratio.
Define Model Architecture:
- Define the model architecture using ConvLSTM2D and Conv3D layers with appropriate activation functions and initializers.
- Add batch normalization layers between LSTM layers.
Compile Model:
- Compile the model with Adam optimizer, binary_crossentropy loss, and mean_square_error as a metric.
Print Model Summary:
- Print the summary of the defined model.
Define Data Generator Class:
- Define a DataGenerator class to handle large datasets efficiently.
Initialize Data Generators:
- Initialize training and validation data generators with a specified batch size.
Define Callbacks for Training:
- Define callbacks for reducing learning rate, early stopping, and TensorBoard logging.
Train the Model:
- Train the model using the data generators and defined callbacks.
Save the Model:
- Save the trained model to a keras file .
Load the Model:
- Load the saved model for further evaluation and prediction.
Run Predictions on Validation Data:
- Run predictions on the validation data using the loaded model.
Evaluate the Model:
- Evaluate the model on the validation data to obtain loss and accuracy.
Binary Classification Threshold:
- Apply a threshold to convert predictions to binary values for classification.
Flatten Predictions and Ground Truth:
- Flatten the predictions and ground truth data for comparison.
Plot Comparisons and Training History:
- Define functions to plot comparisons between ground truth and predictions.
- Visualize a few samples by plotting comparisons.
- Define a function to plot training and validation loss and accuracy over epochs.
- Plot and save the training history.

To load the 50 epoch model and run predictions on the validation data, you can use the following code:

## LOAD CHECKPOINTS IF NEEDED
X_val = np.load('data/X_val.npy')
y_val = np.load('data/Y_val.npy')

# Load the model
model = load_model('smogseer50.keras')

# Run predictions on validation data
predictions = model.predict(X_val)

# Evaluate the model on validation data
val_loss, val_accuracy = model.evaluate(X_val, y_val)
print(f"Validation Loss: {val_loss}")
print(f"Validation Accuracy: {val_accuracy}")

3/3 ━━━━━━━━━━━━━━━━━━━━ 7s 693ms/step
3/3 ━━━━━━━━━━━━━━━━━━━━ 3s 432ms/step - loss: 0.3992 - mean_squared_error: 0.0016
Validation Loss: 0.39958614110946655
Validation Accuracy: 0.0018263210076838732

To load the 100 epoch model and run predictions on the validation data, you can use the following code:

## LOAD CHECKPOINTS IF NEEDED
X_val = np.load('data/X_val.npy')
y_val = np.load('data/Y_val.npy')

# Load the model
model = load_model('smogseer100.keras')

# Run predictions on validation data
predictions = model.predict(X_val)

# Evaluate the model on validation data
val_loss, val_accuracy = model.evaluate(X_val, y_val)
print(f"Validation Loss: {val_loss}")
print(f"Validation Accuracy: {val_accuracy}")

3/3 ━━━━━━━━━━━━━━━━━━━━ 3s 632ms/step
3/3 ━━━━━━━━━━━━━━━━━━━━ 3s 423ms/step - loss: 0.4003 - mean_squared_error: 0.0019
Validation Loss: 0.40102294087409973
Validation Accuracy: 0.0022550118155777454

The images were converted to a GIF. The ground truth for the same time period was also visualised. The GIFs can be see side by side here:

comparison plot. — Comparison model predictions and ground truth for the validation data.

The model prediction and gorund truth data were visualised side by side for 5 timesteps, which is 25 days. These images can be found in the static folder in the repository.

An example:

<IPython.core.display.Image object>

Figure 11: Animated predictions for 25 days from 01.01.2023 to 31.01.2023.

4 Evaluation

The evaluation metrics used in Smogseer are: 1. Binary Crossentropy Loss, 2. Mean Squared Error.

Here is what the evaluation metrics look like for the 50 epochs model:

Smogseer50 — Figure 12: Evaluation of the Smogseer model on the validation dataset with an epoch of 50.

And here is what the evaluation metrics look like for the 100 epochs model:

Smogseer100 — Figure 13: Evaluation of the Smogseer model on the validation dataset with an epoch of 100.

5 Conclusion

5.1 Loss over Epochs

Training Loss (Blue Line):

The training loss starts at around 0.70 and decreases steadily as the number of epochs increases.
It shows a continuous decline, indicating that the model is learning from the training data and improving its performance.

Validation Loss (Orange Line):

The validation loss starts slightly higher than the training loss, which is typical.
It follows a similar declining trend as the training loss.
The validation loss closely tracks the training loss, which suggests that the model is generalizing well to the validation data without overfitting.

5.2 MSE over Epochs

Training MSE (Blue Line):

The training MSE starts at a high value and decreases steadily, similar to the training loss.
This decrease indicates that the model’s predictions are becoming closer to the actual target values as training progresses.

Validation MSE (Orange Line):

The validation MSE also starts high and decreases over the epochs.
It tracks the training MSE closely, reinforcing that the model’s performance on unseen data is improving without significant overfitting.

5.3 Key Observations

Steady Decrease: Both the loss and MSE for training and validation datasets are steadily decreasing, which is a positive sign that the model is learning effectively.
No Overfitting: Since the validation loss and MSE closely follow the training loss and MSE without diverging, it indicates that there is no significant overfitting.
Plateauing: Both training and validation loss/MSE curves start to plateau after about 30 epochs. This suggests that further training may not significantly improve the model’s performance.

5.4 Recommendations

Early Stopping: Implement early stopping in future training runs to stop training when the validation loss stops improving, saving computational resources and potentially preventing overfitting.
Learning Rate: If you observe the model’s performance plateauing, consider decreasing the learning rate to allow for finer adjustments to the weights.
Model architecture: Experiment with different model architectures, hyperparameters, or additional features to improve the model’s predictive performance.

6 References

Brandt, Alicja AND Fomferra, Gunnar AND Balfanz. 2023. “DeepESDL-an Open Platform for Research and Collaboration in Earth Sciences.” In EGU General Assembly Conference Abstracts, EGU–15225.

Chollet, François et al. 2015. “Keras.” https://keras.io.

ESA. 2021. “Copernicus Sentinel-5P (Processed by ESA), TROPOMI Level 2 Nitrogen Dioxide Total Column Products.” https://doi.org/10.5270/S5P-9bnp8q8.

Harris, Charles R., K. Jarrod Millman, Stéfan J. van der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, et al. 2020. “Array Programming with NumPy.” Nature 585 (7825): 357–62. https://doi.org/10.1038/s41586-020-2649-2.

Hunter, J. D. 2007. “Matplotlib: A 2D Graphics Environment.” Computing in Science & Engineering 9 (3): 90–95. https://doi.org/10.1109/MCSE.2007.55.

Khan, Taimur. 2024. “Smogseer: A Convolutional LSTM for Forecasting Air Quality Metrics from Sentinel-5P Data.” https://doi.org/10.5281/zenodo.13118498.

Kumar, Ashutosh, Tanvir Islam, Yoshihide Sekimoto, Chris Mattmann, and Brian Wilson. 2020. “Convcast: An Embedded Convolutional LSTM Based Architecture for Precipitation Nowcasting Using Satellite Data.” Plos One 15 (3): e0230114.

Luo, Weixin, Wen Liu, and Shenghua Gao. 2017. “Remembering History with Convolutional Lstm for Anomaly Detection.” In 2017 IEEE International Conference on Multimedia and Expo (ICME), 439–44. IEEE.

Majeed, Rabia, Muhammad Shehzaib Anjum, Muhammad Imad-ud-din, Suhaib Malik, Muhammad Naveed Anwar, Bilal Anwar, and Muhammad Fahim Khokhar. 2024. “Solving the Mysteries of Lahore Smog: The Fifth Season in the Country.” Frontiers in Sustainable Cities 5: 1314426.

Martin, abadi AND Ashish agarwal AND Paul barham AND Eugene brevdo AND Zhifeng chen AND Craig citro AND Greg s. corrado AND Andy davis AND Jeffrey dean AND Matthieu devin AND Sanjay ghemawat AND Ian goodfellow AND Andrew harp AND Geoffrey irving AND Michael isard AND Yangqing Jia AND Rafal jozefowicz AND Lukasz kaiser AND Manjunath kudlur AND Josh levenberg AND Dandelion mané AND Rajat monga AND Sherry moore AND Derek murray AND Chris olah AND Mike schuster AND Jonathon shlens AND Benoit steiner AND Ilya sutskever AND Kunal talwar AND Paul tucker AND Vincent vanhoucke AND Vijay vasudevan AND Fernanda viégas AND Oriol vinyals AND Pete warden AND Martin wattenberg AND Martin wicke AND Yuan yu AND Xiaoqiang zheng. 2015. “TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems.” https://www.tensorflow.org/.

Shi, Xingjian, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. 2015. “Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting.” Advances in Neural Information Processing Systems 28.