Data mining and machine learning to enhance data analysis for atmospheric science research

The scientific media is brimming with Artificial Intelligence related buzzwords such as data mining and machine learning. But what do these terms really mean? And can we put these techniques to work in our own research? Data mining can be defined as the process of analysing hidden patterns of data to provide useful information, whereas machine learning is the science of making a computer (or machine) learn from data without being explicitly programmed, the machine is then subsequently able to perform automatic tasks. These methods have been widely used in a number of applications, from robotics to aerospace, from environmental to physical sciences.

SMEAR II Hyytiälä visit to observe atmospheric measurements
SMEAR II Hyytiälä visit to observe atmospheric measurements

As with other scientific disciplines, these methods have shown strong potential in atmospheric research, especially related to INAR research activities. This is thanks to the large and high-quality data sets generated by INAR infrastructure and research facilities, including the SMEAR networks, various measurement campaigns, laboratory experiments and simulations. Over the past two years, I have been working at INAR to propose and provide solutions based on data mining and machine learning to enhance data analysis for atmospheric science research. Here, I present several examples of how these methods are beneficial to INAR research activities.

The first feasibility study was to perform an automatic classification for new-particle formation (NPF) identification. Manual classification has typically been done to identify different classes of NPF. The classes are classified as non-event days or event days, where event days can be further grouped to be class 1b, class 1b or class 2. Our feasibility study on this subject was recently published in Zaidan et al. (2018a). The study used long time-series data of DMPS (1996-2010), obtained from SMEAR II station at Hyytiälä, to train machine learning models. The models were then evaluated using the remaining data from the same station (2011-2014). The proposed method resulted in a classification accuracy of 84.2 % for determining event/non-event days. In particular, the proposed method successfully predicted all event days when the growth and formation rate can be determined with a good confidence level (often labelled as class 1a days). Most misclassified days (with an accuracy of 75 %) are the event days of class II, where the determination of growth and formation rate are much more uncertain. The results reported in this article pointed towards the potential of the proposed method and suggested further development in this direction for deployment in the smartSMEAR ecosystem.

A study related to data mining was the application of mutual information (MI) for finding relationships between measured variables in large data sets (Zaidan et al. 2018b). When there are numerous atmospheric measurements involved in campaigns or continuous measurements, traditional data exploration tasks, such as generating scatter plots and other visualisation methods, may be tedious. The proposed method allows the search of relevant variables to be further investigated by statistical analyses. Unlike, linear correlation methods, such as Pearson correlation coefficient, MI is also capable of detecting non-linear relationship between variables. The method was tested on large data sets (1996-2014) obtained from SMEAR II station, Hyytiälä, to find the relationship between NPF and ambient variables. The applied MI method found that formation events were strongly linked to sulfuric acid concentration and water content, ultraviolet radiation, condensation sink and temperature. Previously, these quantities have been well established to be important players in the phenomenon via dedicated field, laboratory and theoretical research. The same results were obtained by a data analysis method which operates without supervision and without the need of understanding the physics deeply.

Machine learning algorithms can also be used to approximate a real physical process of atmospheric variables. This approach is known as a proxy or an estimator. With regards to our research of urban air pollution, we developed machine learning methods to act as air pollutant proxies. We demonstrated the use of these methods in two case studies: development of black carbon and ozone proxies. The proposed methods were evaluated using data from two of the latest measurement campaigns in Saudi Arabia and Jordan, as described in Zaidan et al., 2019a and Zaidan et al., 2019b, respectively. The studies demonstrated that these are promising methods to substitute real measurements or to fill missing data.

PM2.5 low-cost sensors in megasense

Validation of low-cost sensors next to SMEAR III station

In INAR, I am currently involved in several groups, including Global atmosphere-Earth surface feedbacks (GAEA) and Multi-Scale Modelling - from processes to the Earth system MSM). Furthermore, I am also involved in the megasense programme, which focuses on air pollution research in collaboration with the computer science department at Helsinki university. My research activities there are in air pollution data analytics, low-cost sensors measurements, callibration and validation as well as supervising students. I am also open for any collaboration related to INAR research activities and beyond where my contact can be found below.

This article was written by Martha Arbayani Zaidan, and it has appeared in Pan Eurasian EXperiment (PEEX) blog and INAR newsletter, on 29 November 2019.

References

M.A. Zaidan, D., Wraith, B.E. Boor and T. Hussein, Bayesian proxy modelling for estimating black carbon concentrations using white-box and black-box models. Applied Sciences, 2019b, 9, 4976.

M.A. Zaidan, M.A., L. Dada, M.A. Alghamdi, H. Al-Jeelani, H. Lihavainen, A. Hyvärinen and T. Hussein, Mutual information input selector and probabilistic machine learning utilisation for air pollution proxies. Applied Sciences, 2019a, 9, 4475.

M.A. Zaidan, V. Haapasilta, R. Relan, P. Paasonen, V.-M. Kerminen, H. Junninen, M. Kulmala, and A. S. Foster, Exploring nonlinear associations between atmospheric new-particle formation and ambient variables: an information theoretic approach, Atmospheric Chemistry and Physics 18 (17), 12699-12714, 2018.

M.A. Zaidan, V. Haapasilta, R. Relan, H. Junninen, P.P. Aalto, M. Kulmala, L. Laursson and A.S. Foster, Predicting atmospheric particle formation days by Bayesian classification of the time series features, Tellus B: Chemical & Physical Meteorology, 1-10, 2018.







Comments

Popular posts from this blog

Why using Linux?

Tutorial: what is the best way to backup your data safely and efficiently?