Hererra posada & Aristizábal / INGE CUC, vol. 18 no. 2, pp. 249–265. July - December, 2022

Artificial Intelligence and Machine Learning Model for Spatial and Temporal Prediction of Drought Events in the Department of Magdalena, Colombia

Modelo de Inteligencia Artificial y Aprendizaje Automático para la Predicción Espacial y Temporal de Eventos de Sequía en el departamento del Magdalena, Colombia

DOI: http://doi.org/10.17981/ingecuc.18.2.2022.20

Artículo de Investigación Científica. Fecha de Recepción: 27/08/2021. Fecha de Aceptación: 11/11/2021.

Daissy Milenys Herrera Posada

Universidad Nacional de Colombia. Medellín (Colombia)

dmherrerap@unal.edu.co

Edier Aristizábal

Universidad Nacional de Colombia. Medellín (Colombia)

evaristizabalg@unal.edu.co

To cite this paper

D. Hererra posada & E. Aristizábal, “Artificial Intelligence and Machine Learning Model for Spatial and Temporal Prediction of Drought Events in the Department of Magdalena, Colombia”, INGE CUC, vol. 18, no. 2, pp. 249–265. DOI: http://doi.org/10.17981/ingecuc.18.2.2022.20

Abstract

Introduction— Drought is one of the most critical hydrometeorological phenomenon in terms of its impacts on society. Although Colombia is a tropical country, there are areas of the territory which have periods of drought, and this causes significant economic damage.

Objective— Due to recent advances in terms of the spatial and temporal resolutions of remote sensing, and artificial intelligence techniques, it is possible to develop automatic learning models supported by historical information.

Methodology— In this study, a Random Forest (RF) and Bagged Decision Tree Classifier (DTC) model was built to perform spatial and temporal drought prediction in the department of Magdalena using the following features: Normalized Difference Vegetation Index (NDVI), land surface temperature (LST), precipitation, Normalized Difference Water Index (NDWI), Normalized Multiband Drought Index (NMDI), evapotranspiration (ET), surface soil moisture (SSM), subsurface soil moisture (SUSM), Multivariate ENSO Index (MEI), Southern Oscillation Index (SOI), and Oceanic Niño Index (ONI).

Results— For labelling, which allows one to train and evaluate the model, the Standardized Precipitation Index (SPI) was used to identify drought events.

Conclusions— The implementation of the developed model can allow governmental entities to take actions to mitigate impacts generated by recurring droughts in their territories.

Keywords— Drought forecasting; Standardized Precipitation Index; satellite imagery; Google Earth Engine; machine learning; random forest; decision tree classifier; spatial interpolation

Resumen

Introducción— La sequía es uno de los fenómenos hidrometeorológicos más críticos por sus impactos en la sociedad. A pesar de que Colombia es un país tropical, existen zonas del territorio que presentan periodos de sequía, lo que ocasiona importantes perjuicios económicos.

Objetivo— Debido a los recientes avances en cuanto a las resoluciones espaciales y temporales de la teledetección, y a las técnicas de inteligencia artificial, es posible desarrollar modelos de aprendizaje automático apoyados en información histórica.

Metodología— En este estudio se construyó un modelo clasificador de Bosque Aleatorio (RF) y Árbol de Decisión en Bolsa (DTC) para realizar la predicción espacial y temporal de sequía en el departamento del Magdalena utilizando las siguientes características: Índice de Vegetación de Diferencia Normalizada (NDVI), temperatura de la superficie terrestre (LST), precipitación, Índice de Agua de Diferencia Normalizada (NDWI), Índice de Sequía Multibanda Normalizada (NMDI), evapotranspiración (ET), humedad superficial del suelo (SSM), humedad subsuperficial del suelo (SUSM), Índice ENSO Multivariado (MEI), Índice de Oscilación del Sur (SOI) e Índice del Niño Oceánico (ONI).

Resultados— Para el etiquetado, que permite entrenar y evaluar el modelo, se utilizó el Índice de Precipitación Estandarizado (SPI) para identificar los eventos de sequía.

Conclusiones— La implementación del modelo desarrollado puede permitir a las entidades gubernamentales tomar acciones para mitigar los impactos generados por sequías recurrentes en sus territorios.

Palabras clave— Predicción de sequías; Índice de Precipitación Estandarizado; imágenes de satélite; Google Earth Engine; aprendizaje automático; bosque aleatorio; clasificador de árbol de decisión; interpolación espacial

I. Introduction

Drought is a common and frequently occurring characteristic of the climate [1], and it is defined as a lack of moisture caused by the absence of precipitation during a given period of time [2]. When the length of time without precipitation increases significantly, the amount of available water cannot meet the water demand of the environment and population [3]. According to the FAO, drought is one of the most critical natural phenomena in terms of impacts on society, since it halts food production, prevents the development of pastures, affects markets, and causes the deaths of living beings and population migration [4].

An increase in the number of droughts has been pointed out as one of the main consequences of climate change. There has not just been an increase in their frequency: an increase in the intensity of droughts is also expected [5], [6]. This is due to changes in the hydro-climatological variables that condition or determine the occurrence of drought events [7], which estimates a reduction in precipitation of between 2% and 8% for each degree kelvin that the temperature increases due to global warming. In addition, CU and NCAR predicts an increase in evapotranspiration and an increase in the surface temperature [8].

In Colombia, drought has had a significant impact on the population. During the 2014 drought event, there were 642 fires in the departments of La Guajira, Magdalena, Córdoba, Atlántico, Sucre, Bolívar, and Cesar, as well as the deaths of 3 200 head of cattle, the loss of about 47% of rice crops, and water shortages in 48 municipalities in Colombia [9]. In the case of the department of Magdalena, the drought event that occurred in 2015 reduced the water supply of the department by 60%, decreasing the water supply available for the population and for agricultural, fish farming, and livestock farming activities in the department, which led to food shortages [10].

Considering the negative effects of drought and the possible increase in its frequency as a consequence of climate change, it is necessary to implement mechanisms to monitor the phenomenon and search for strategies for implementing adaptation measures and reducing its effects. As part of these mechanisms, there are early warning systems supported by automatic learning techniques that make it possible to evaluate or predict the occurrence of droughts [11]-[16].

Machine learning (ML) is a group of computational techniques within artificial intelligence (AI), which takes inputs from statistics to learn from past events, recognize patterns and predict new observations [17]. Within ML techniques, there are several supervised and unsupervised methods, such as Neural Networks, Decision Trees, Logistic Regression, Principal Component Analysis, Clustering, among others. ML applications range from cancer prediction and forecasting [17], automatic speech recognition [18], daily flow forecasting [19], remote sensing [20], among others.

Regarding the application of artificial intelligence for drought prediction, some works stand out [11] stand out: for the prediction of agricultural drought in Australia, they implement classification and regression trees, random forests, flexible discriminant analysis, and support vector machines. In Korea, the authors run random forest models and regression trees in different climatic regions of the United States for drought prediction, which is determined by the Standardized Precipitation Index (SPI) [12]. In Malaysia, in order to evaluate and reduce the potential impact of drought on palm crops, the authors implement different types of support vector machines for the prediction of the Standardized Precipitation Evapotranspiration Index (SPEI) [13]. Similarly, Australian researchers predict SPEI for assessing agricultural drought over South-Eastern Australia, making use of models such as random forest, support vector machines, and neural networks [14]. In Canada, the authors implement bootstrapping and boosting in artificial neural networks and support vector machines for SPI prediction at different time scales over the Awash River basin in Ethiopia [15]. Finally, China proposes a new index to evaluate drought, called the Integrated Agricultural Drought Index (IDI), and uses neural networks with back-propagation for the recognition of non-stationary patterns in the occurrence of droughts [16]. In Colombia, the National Observatory, for the follow-up and monitoring of drought in Colombia and through unsupervised methods such as principal component analysis, estimated the threat due to meteorological drought at the national scale [21].

Several studies incorporate satellite information as an input variable for the execution of different models [14], [16], [12]. This is because remote sensors provide data with a high temporal resolution that covers large extensions of land [3], thus allowing access to areas where field data is scarce or where there is a low density of specific data. The Colombian territory is no exception to the problems mentioned above; for this reason, the implementation of strategies or models for the evaluation of droughts using open, free, and quality information, such as the use of satellite images, becomes a necessary strategy for carrying out hydrometeorological studies distributed in any part of the territory. Additionally, droughts are natural phenomena that have an impact on large portions of land [1]; therefore, it is relevant to obtain continuous maps where the spatiality of the phenomenon is detailed. The use of remote sensing data facilitates obtaining these results and reduces the uncertainty involved in making distributed maps by implementing interpolation techniques where the density of point data is low.

This work seeks to implement another type of Machine Learning (ML) tool for drought prediction over the Colombian territory. In this case, the department of Magdalena is used as a pilot study area, and, by implementing satellite information, the intent is to predict the droughts determined by the SPI, using decision trees and random forests. This allows one to model the temporal occurrence and spatial distribution of drought events over the department. The objective is to establish an early warning system that allows authorities to take measures to reduce a region’s vulnerability to drought events and to establish a methodology for drought prediction that can be implemented in other territories of Colombia.

II. Study Area

Each year, UNGRD (Colombia) prepares a consolidated report on emergencies reported in each municipality of Colombia [22], in which the damage and human and economic losses are recorded. Droughts are among the events reported. Between 2010 and 2019, 144 drought events were recorded (Table 1), and the departments of Cauca and Magdalena are the departments with the highest numbers of events. Therefore, considering the high recurrence of events and territorial extension, the department of Magdalena was selected as a pilot study area.

Table 1.

Droughts reported by departments to the UNGRD from 2010 to 2019.

Department	Droughts reported	Department	Droughts reported
Atlántico	8	Guajira	13
Bolívar	12	Magdalena	17
Boyacá	4	Nariño	5
Caldas	1	Norte de Santander	2
Cauca	17	Quindío	1
Cesar	7	Risaralda	9
Córdoba	14	Santander	13
Cundinamarca	2	Sucre	6
Guaviare	1	Tolima	4
Huila	1	Valle de cauca	5

* The consolidation of emergencies for the year 2010 presented problems for the visualization of the information; therefore, the reports for the year 2010 could not be considered.

Source: Authors.

The department of Magdalena is in northern Colombia, with a territorial extension of 23 188 km2 [23], and a population of 1 263 788 inhabitants [24]. The average temperature varies by sector within the department; in the south it can exceed 28°C, and in the center and north, it is between 26°C-28°C; meanwhile, in the Sierra Nevada, it decreases according to the elevation with respect to sea level, reaching –8°C. With a bimodal regime, the average annual precipitation varies by sector within Magdalena: in the flat area of the department, the average annual precipitation is between 1 000 and 1 500 mm, and in the south and in the vicinity of the Sierra Nevada, the average precipitation exceeds 2 000 mm [25].

III. Data

A. Remote sensing data

In Korea, the Normalized Difference Vegetation Index (NDVI), Land Surface Temperature (LST), precipitation, Normalized Difference Water Index (NDWI), Normalized Multiband Drought Index (NMDI), and Evapotranspiration (ET) were used as predictor variables. Additionally, the Surface Soil Moisture (SSM) and Subsurface Soil Moisture (SUSM) were considered [12]. All these variables are directly related to the water content present in the soil and hydrological systems [26]. In addition, vegetation responds to moisture changes; therefore, the vegetation indices NDVI, NDWI, and NMDI were considered to understand the state of the vegetation and how it is affected by a moisture deficit.

The previously mentioned variables are obtained through the Google Earth Engine (GEE) platform [27], from which satellite information is extracted for the department of Magdalena for the period from 2010 to 2019 on a monthly time scale. For precipitation, the Climate Hazards Group InfraRed Precipitation with Station Data (CHIRPS) database was accessed, specifically UCSB-CHG/-CHIRPS/DAILY, with a spatial resolution of 0.05°, which is equivalent to 5.5 km (Fig. 1A). Soil moisture data were provided by the National Aeronautics and Space Administration (NASA) in conjunction with the U.S. Department of Agriculture (USDA) at a resolution of 0.25°, which corresponds to 28 km (Fig. 1G and Fig. 1H).

The other variables were obtained through information provided by the MODIS satellite at spatial resolutions of 1 km for LST (MODIS/006/MOD11A2) (Fig. 1B), 0.5 km for ET (MODIS/006/MOD16A2) (Fig. 1C), and 0.5 km for reflectance (MODIS/006/MOD09A1).

Fig. 1. Satellite predictor variables: A) Precipitation. B) Land Surface Temperature (LST). C) Evapotranspiration (ET). D) Normalized Difference Vegetation Index (NDVI). E) Normalized Difference Water Index (NDWI). F) Normalized Multiband Drought Index (NMDI). G) Surface Soil Moisture (SSM). H) Subsurface Soil Moisture (SUSM).

Source: Authors.

The NDVI, NDWI, and NMDI (Fig. 1D, Fig. 1E and Fig. 1F) were estimated using reflectance and the equations in Table 2 at a spatial resolution of 0.5 km. The NDVI indicates the state of the vegetation, based on the radiation reflected by the vegetation in near-infrared wavelengths with respect to the red band of the visible spectrum [28]; the NDWI estimates the moisture content of the vegetation based on the radiation reflected by the surface in the infrared wavelength [29], and the NMDI is used to track the moisture content of the soil and vegetation from the infrared [30].

Table 2.

Calculation and bands needed to find NDVI, NDWI and NMDI.

Índices de vegetación	Fórmula
NDVI	ρBanda 2 – ρBanda 1/ρBanda 2 + ρBanda 1
NDWI	ρBanda 2 – ρBanda 5/ρBanda 2 + ρBanda 5
NMDI	ρBanda 6 – ρBanda 7/ρBanda 2 + ρBanda 6 – ρBanda 7

Source: Authors.

As can be seen, each variable has a different spatial resolution, so it is necessary to homogenize the resolutions; for this purpose, the GEE Scale function was used and all the satellite information was resampled at a spatial resolution of 1 km.

As for the missing values within the data series, these were filled in with the average value of each image. Anomalous data, caused by information processing errors or errors in data collection, were replaced by the 99th percentile, as the maximum value, and the 1st percentile, as the minimum value, within the data series of each predictor variable.

B. Macro-climatic variables

The phenomena of El Niño Southern Oscillation (ENSO) La Niña and El Niño are closely related to the hydrological anomalies that have developed in the South American tropics [31]. High precipitation and maximum flows are associated with the occurrence of La Niña, while El Niño is characterized by long-lasting dry periods, modifying the intensity and prolongation of droughts within the territory [31].

To evaluate the influence of ENSO, variables such as the Multivariate ENSO Index (MEI), Southern Oscillation Index (SOI), and Oceanic Niño Index (ONI), which are elaborated using a monthly temporal resolution by the U.S. Oceanic and Atmospheric Administration (NOAA) and consider different parameters that allow the status of each month to be classified as El Niño, La Niña, or Neutral, were selected for the study.

IV. Model for Drought Forecasting

A. Reference data: SPI

To evaluate drought conditions within the department of Magdalena, the Standardized Precipitation Index (SPI) is selected as the model response variable. The SPI evaluates precipitation anomalies at various time scales, allowing the study of various types of droughts [3].

For the calculation of the SPI, precipitation information on a monthly scale and a continuous information record of at least 30 years are required [2]. Therefore, a search was made for data provided by the IDEAM from the existing precipitation stations within the department of Magdalena that meet the requirements for the calculation of the index. Thus, 43 precipitation stations that present rainfall information on a monthly scale since before 1989 were identified, as shown in Fig. 2.

Fig. 2. Yield data precipitation stations within the department of Magdalena used in the study, and their respective areas of influence.

Source: Authors.

The rainfall stations represent point values, so they do not allow one to establish the spatial distribution of rainfall in the department, which is necessary to establish the spatial distribution of SPI, preventing the analysis of this variable with the previously mentioned distributed variables. For each IDEAM rainfall station within the department of Magdalena, an area of influence with a radius of 5 km was established (Fig. 2), which makes it possible to establish an SPI value for each area of influence, as well as to determine the satellite information corresponding to each evaluated area. This allows the development of the analysis (Fig. 3).

Fig. 3. Conceptual scheme of the sources of information for the predictor variables and the response variable.

Source: Authors.

The SPI is calculated by fitting the monthly precipitation data (x) to a gamma distribution function (1), where the alpha (α) and beta (β) parameters of the function are estimated for each precipitation station and time scale. With the precipitation series fitted to a distribution, we proceed to calculate the cumulative probability of each event (2). Since the gamma distribution function is not defined for events in which x = 0, the factor (3) is added, which represents the probability in the case that precipitation events have values of 0 [32]. The cumulative probability is transformed into a standard normal random variable (Z), with a mean of zero and a standard deviation of 1; the Z values found represent the values corresponding to the SPI [32]. In other words, the SPI represents how many standard deviations, above or below, an event is from the average rainfall fitted to the gamma distribution function [32].

By considering the normal distribution of SPI values, drought or wetness events can be defined [2]. The WMO [3], based on [2], categorizes the SPI from extremely wet, for SPI values greater than 2, to extremely dry, for values less than –2 (Table 3).

Table 3.

SPI values.

Value	Category
2.0 y más	Extremely wet
1.5 a 1.99	Very wet
1.0 a 1.49	Moderately wet
–0.99 a 0.99	Near normal
–1.0 a –1.49	Moderately dry
–1.5 a –1.99	Severely dry
–2.0 y menos	Extremely dry

Source: [3].

The SPI can be evaluated for different time scales depending on the type of drought to be studied, which can be meteorological, hydrological, or agricultural [33]. Meteorological drought occurs when, during a period of time, precipitation is lower than expected; hydrological drought refers to a decrease in river flow and the levels of reservoirs and lakes due to a deficit of precipitation; and agricultural drought refers to the fact that, due to the deficit of precipitation, there is not enough moisture in the soil for the normal functioning of crops [33]. According to WMO, to study agricultural drought, the SPI should be calculated with accumulated precipitation between 1 and 6 months; for meteorological drought, the accumulated precipitation should be between 1 and 2 months, and for hydrological drought, between 6 and 24 months [3].

In this sense, the present work evaluates agricultural drought, due to the social repercussions of this event. For this, the SPI was calculated with a precipitation accumulation period of 3 months (SPI3), since this allows one to understand the changes of agricultural drought [34] and provides a seasonal approximation of precipitation [3]. Additionally, the time span evaluated is relevant for annual crops [35] and the intra-seasonal study of precipitation is relevant for herbaceous and low-cut crops [36].

In this way, and with the IDEAM precipitation information, SPI3 was calculated for each precipitation station through the SPI Generator program developed by the National Drought Monitoring Center of the UNL [37], and the SPI3 values from 2010 to 2019 in the areas of influence of each station were selected.

The SPI3 value calculated is a continuous value that can be classified according to Table 3; however, the classification of the drought magnitude is a function of local conditions [3]. Therefore, in order to establish which SPI3 values represent drought in the department of Magdalena, SPI3 was calculated for each month and municipality that reported drought emergencies to the UNGRD. The frequency histogram of the SPI3 values obtained is shown in Fig. 4. As can be seen, most of the reported months have an SPI3 of –1; therefore, the continuous variable is transformed into a categorical variable. SPI values below –1 indicate a drought and values above –1 indicate normal or wet conditions, as shown in Fig. 5.

Fig. 4. SPI3 frequency histogram of months with reported droughts in the department of Magdalena from 2010 to 2019.

Source: Authors.

Fig. 5. Map of SPI3 classified into drought and normal/wet conditions.

Source: Authors.

B. Predictor variables

Initially, 11 predictor variables were considered: the ONI, MEI, SOI, LST, precipitation, ET, SSM, SUSM, NDVI, NDWI, and NMDI. For the selection of the variables with the highest predictive capacity, the free Feature Selector from scikit-learn in Python was used, where the importance of each variable is calculated according to a gradient boosting machine (GBM) [38]. The results are presented in Fig. 6. The variables that contribute the least to the drought prediction are the NDWI, ONI, and MEI, so they were eliminated from the model. To establish the collinearity between the remaining variables, the Pearson correlation coefficient was calculated, as shown in Fig. 7. The SUSM and SSM variables show a high positive correlation (0.94), so the SUSM variable was not considered in the model due to its high collinearity and lower importance compared to SSM (Fig. 6).

Fig. 6. Importance of each variable according to the Feature Selector.

Source: Authors.

Fig. 7. Correlation of selected variables.

Source: Authors.

C. Assembled models

For the construction of the drought prediction model, ensemble machine learning models were used. These models retain the properties of the base estimator but reduce the variance or fit problems that can affect model performance. Bagging-type ensemble models use estimators with good performance and build multiple models simultaneously, randomly selecting the observations and variables, in some cases. This is why they are used for problems where variance reduction is desired [39]. Boosting methods, on the other hand, use weak estimators, i.e., with poor performance, and build a new consecutive model, in which they assign weights to the observations erroneously predicted by the base estimator. In this way, in the end, a robust model is obtained that reduces the fitting problems of the initial models [39], [40].

In this work, we chose to select bagging models with decision tree-type estimators because of the good performance of these estimators when there is a large volume of observations (they yield excellent fits). The following is a brief description of the two bagging models used.

1) Bagging decision trees

The Decision Tree Classifier (DTC) is based on fragmenting a complex decision into multiple simple decisions, with the objective that the final result gives a reason for the solution of the initial complex decision [41]. It is called a decision tree because simpler decisions are derived from the complex decision, and these in turn become even simpler decisions, thus forming a tree-shaped scheme in which the leaves represent the final answer to each question and the roots represent the complex decision to be addressed [41].

To reduce the variance associated with the decision tree model, subsets of data will be created by randomly extracting observations from the training data, thus creating different predictive models with each data set; the final result is the most repeated prediction within each subset [39].

2) Random forest

The Random Forest (RF) method uses the same concept of bagging decision trees, but the difference is that in RF, in addition to randomly selecting the observations of each subset, it also performs a random selection of variables to be used in each subset of data [42].

To run the supervised models, the sklearn package was implemented through Python [43]. According to the scheme shown in Fig. 8, initially the observations are randomly divided into training data (75%), which are used for validation curves, hyper-parameter fitting, learning curves, and re-training the model, and evaluation data (25%).

Fig. 8. Framework of the application of RF and DTC models.

Source: Authors.

V. Drought Forecasting Model

A validation curve refers to the result obtained by varying a hyper-parameter over a wide range of values, in order to delimit where the model performs best by modifying only one hyper-parameter [44]. On the other hand, with a learning curve, it is possible to visualize the behaviour or performance of the model as the number of observations increases, which makes it possible to establish whether the model has problems with fit or variance [44]. Within the two procedures described and in the search for the best set of hyper-parameters, cross-validation is used, as shown in Fig. 8. This consists of dividing the training data into subsets, in this case, 5, and in each iteration 4 subsets are used for training and the remaining subset is used for validation; thus, all the observations are used to both train and validate the model. The metric used to determine the performance in all the procedures described was recall, since it focuses on evaluating the accuracy or predictive ability of the class of interest, which in this case is the drought class. Finally, the final model, already calibrated, uses the evaluation data, i.e., 25% of the data, to predict the response variable; then, it is possible to discover the predictive capacity of the model by comparing the simulated SPI3 with the measured values.

A. Bagging decision tree

Table 4 presents the hyper-parameters that optimize the model results in terms of recall. Fig. 9 shows the learning curve of the model: it is possible to observe that, as more observations are added, both the validation curve and the training curve increase the recall value; likewise, both curves tend to approach each other, which indicates a reduction in the variance of the problem.

Table 4.

Variation of hyper-parameters for bagging decision trees.

Hyper-parameters
Changing	Range of values	Best value
min samples leaf	20, 30, 40, y 50	40
Splitter	Best o Random	Best
max features	Sqrt, Log2, o None	Sqrt
Constant	Value
Class weight	Balanced
Criterion	Entropy
Random state	0

Source: Authors.

Fig. 9. Learning curve for bagging decision trees.

Source: Authors.

Table 5 presents the classification report and confusion matrix using the evaluation data (25%): it can be observed that False Negatives (FNs) —drought events not identified by the model— represent 2.5% of the total evaluation data. False Positives (FPs) —drought events erroneously identified by the model— represent 21% of the total evaluation data. In fact, the number of FPs is higher than the number of True Positives (TPs) —drought events identified by the model—. This percentage is significant and indicates that the model tends to overestimate the number of drought events. The accuracy for predicting drought is 0.33, while the recall is 0.8, which is in accordance with the percentages of FPs and FNs, respectively.

Table 5.

Classification report and confusion matrix results for bagging decision trees.

	Precision	Recall	F1 Score	Support
Near normal / wet	0.96	0.76	0.85	72374
Dry	0.33	0.8	0.47	10636
Average	0.88	0.76	0.8	83010
Evaluation data: 83010	TN: 54919	FN: 2089	TP: 8547	FP: 7455

Source: Authors.

B. Random forest

The set of hyper-parameters for random forest that yielded the best drought prediction are presented in Table 6.

Table 6.

Variation of hyper-parameters for random forest.

Hyper-parameters
Changing	Range of values	Best value
min_samples_leaf	20, 30, 40, y 50	20
n estimators	90, 100, y 150	100
Class weight	Balanced o Balanced subsample	Balanced
Max features	Sqrt, Log2, o None	Sqrt
Constant	Value
Min samples leaf	20
Criterion	Entropy
Random State	0

Source: Authors.

Fig. 10 presents the learning curve obtained with RF, showing that the model improves performance, in terms of recall, as the amount of data increases, in both the training and validation curves.

Fig. 10. Learning curve for random forest.

Source: Authors.

Likewise, it is possible to observe that both curves tend to converge, which indicates that the variance of the problem is reduced as the model improves the learning process by increasing the amount of data.

The results of the model with the evaluation data (25%) are shown in Table 7. The percentages of FNs and FPs are 2% and 7.5%, respectively, indicating that the model with RF tends to reduce FNs over FPs. This is reflected in the precision and recall obtained (0.59 and 0.84, respectively).

Table 7.

Reporte de la Clasificación y Resultados de la Matriz de Confusión para Bosque Aleatorio.

	Precision	Recall	F1 Score	Support
Near normal / wet	0.97	0.91	0.94	72374
Dry	0.59	0.84	0.69	10636
Average	0.93	0.9	0.91	83010
Evaluation data: 83010	TN: 66100	FN: 1706	TP: 8930	FP: 6274

Source: Authors.

C. Spatial prediction of drought

To discover the spatial distribution of droughts in the entire department of Magdalena, we proceeded to take the distributed values for each of the selected predictor variables, and, using the RF model constructed, we estimated the SPI3 value for the entire department.

The month of July 2014 was selected for the spatial validation of the SPI3 results. The press reports of the month in question stated that the municipalities of Santa Marta, Plato, Zapayán, Concordia, and Tenerife declared a public calamity due to drought; and the municipalities of San Sebastián, San Zenón, San Ángel, Nueva Granada, and Pivijay were close to declaring it. Reports indicated that 70% of the department’s crops were affected, 4 300 head of cattle died, and there were large forest fires [45]. In response to the emergency, 65 000 liters of water were provided to 100 000 families in the municipalities of Santa Marta, Zapallán, and Concordia [46]. Due to the state of emergency, on 1 August 2014, the UNGRD declared a public calamity for the entire department of Magdalena.

The results of the prediction model, using RF, are shown in Fig. 11; Fig. 11A shows the prediction of the response variable and Fig. 11B presents the probability of the occurrence of drought.

Fig. 11. Application of the random forest model for July 2014 and for municipalities experiencing a public calamity due to drought for the month in question. A) Drought forecast, according to SPI3. B) Drought probability map.

Source: Authors.

To evaluate the spatial predictive capacity of the model, the modelled SPI3 pixels within the area of influence of each station were taken and compared with the SPI3 calculated from IDEAM rainfall stations. Table 8 presents the results, in which the percentages of FNs (0.7%) and FPs (29.6%) with respect to the total amount of data evaluated indicate an overestimation of the pixels with drought. Additionally, the accuracy and recall for predicting drought are 0.59 and 0.98.

Table 8.

Report of the classification and confusion matrix results for random forest in July 2014.

	Precision	Recall	F1 Score	Support
Near normal / wet	0.97	0.48	0.64	1559
Dry	0.59	0.98	0.74	1208
Average	0.78	0.73	0.69&	2767
Evaluation data: 2767	TN: 741	FN: 19	TP:1189	FP: 818

Source: Authors.

According to Fig. 11A, 60.4% of the territory of the department of Magdalena experienced drought conditions. Fig. 11 highlights the municipalities that reported drought conditions for that month. It can be seen that, according to the model, drought is not present in all the municipalities reported, i.e., there are sectors with wet-to-normal conditions, which indicates that there may be areas that are more affected than others by the occurrence of the climatic phenomenon. Additionally, there are sectors that, according to the model used, present a high probability of drought, and these were not reported within the municipalities that declared a public calamity in the month in question.

A more detailed description of the events would help to strengthen the validation of the results. This would facilitate both the evaluation of SPI3 as an index to measure agricultural drought and a more detailed validation of any method developed to evaluate drought within the department.

VI. Conclusions and Discussion

The results obtained in this work indicate that ML methods are a tool with an important potential for predicting the temporal and spatial occurrence of droughts in the Colombian territory.

There is a wide range of ML methods. The present study indicates that the assembled methods perform adequately, with appropriate values for both model performance and spatial and temporal predictive capability.

Both RF and DTC models predict drought within the department in a timely manner; however, the accuracy of DTC is much lower than that obtained by RF, indicating that DTC greatly overestimates the occurrence of drought events compared to RF.

The model developed, in addition to providing a spatialized map of the occurrence of drought within the department, provides a map of the probabilities of the occurrence of the event, which could help local authorities to make decisions about how the emergency is distributed in the territory and to discover the sites with the highest probability of occurrence. This would allow them to determine the sectors most affected by the event and thus to deliver resources to these priority locations in a drought emergency within the department.

The information necessary to develop the proposed methodology is free and accessible to the public. This fact is relevant because it opens up the possibility of replicating the described workflow in other departments of Colombia and making a continuous and progressive follow-up of the behaviour of the phenomenon, facilitating studies based on what has been observed and the implementation of mitigation strategies. The above can be implemented within the national strategy for the integral management of drought in Colombia, as a mechanism within the objective of strengthening the monitoring and follow-up of drought.

As mentioned previously, the use of satellite images brings with it a great advantage in terms of the spatial distribution of information, but it is important to highlight the limitations that this type of data entails. For example, the low spatial resolution of some implemented variables such as the SSM, SUSM, and precipitation, which have resolutions of 28 km, 28 km, and 5.5 km, respectively, prevents the application of the proposed methodology in areas with little geographical extension, because there would not be enough information that was representative of the conditions of the territory. In this same sense, to the extent that satellites or variables that provide a better spatial resolution are integrated, a more detailed study and applicability in areas with low territorial extension may be possible.

On the other hand, the possibility of extending the applicability of the SPI is planned, i.e., taking advantage of the fact that the index has a normal distribution, it is possible to simultaneously evaluate both dry and wet conditions within the territory. This in turn makes it possible to study the impacts produced not only by drought, but also by an excess of humidity or precipitation within the territory.

The type of drought evaluated within the study is agricultural drought, so the precipitation deficit is studied in an accumulated period of 3 months. This is an initial approach according to the literature, but it is possible to adjust the accumulated months to evaluate the behaviour of certain vegetation or relevant crops within the economy of each department; within the Colombian territory, bananas and potatoes, among others, are especially relevant. Thus, it is feasible to implement within the models specific information related to the behaviour and needs of the crops, such as water requirements and in which seasons the harvest, cultivation, and growth occur. This allows a comprehensive assessment of drought from the perspective of the food security of the population and economic impacts.

The variable to be implemented to evaluate drought within the study is the SPI, due to the ease with which this variable is obtained. It would be important in later studies to implement or evaluate different indexes that allow the evaluation of drought, such as the Standardized Precipitation and Evapotranspiration Index (SPEI), Effective Drought Index (EI), and Palmer Drought Index (PDSI), among others, which integrate, in addition to rainfall, other types of meteorological variables that would allow researchers to cover different aspects that are key in the occurrence or determination of drought.

References

[1] D. Wilhite, M. Sivakumar & D. Wood, Early warning systems for drought preparedness and drought management. GEN, CH: WMO, 2000. Available from http://www.wamis.org/agm/pubs/agm2/agm02.pdf

[2] T. McKee, N. Doesken & J. Kleistet, “The relationship of drought frequency and duration to time scales,” presented at 8th Conference on Applied Climatology, ANA, CA, USA, 17-22 Jan. 1993. Available from https://www.droughtmanagement.info/literature/AMS_Relationship_Drought_Frequency_Duration_Time_Scales_1993.pdf

[3] WMO, “Índice normalizado de precipitación Guía de usuario”, GEN, CH: WMO, Report No. 1090, 2012. Available: https://library.wmo.int/doc_num.php?explnum_id=7769

[4] FAO, Sequía: FAO in Emergencies. [Online] Disponible en http://www.fao.org/emergencies/tipos-de-peligros-y-de-emergencias/sequia/es/ [Fecha de consulta 31 Ene. 2020].

[5] W. Cramer, G. Yohe, M. Auffhammer, C. Huggel, U. Molau, M. Dias, A. Solow, D. Stone, & L. Tibig, “Detection and attribution of observed impacts”, in Climate Change 2014: Impacts, Adaptation,and Vulnerability. Part A: Global and Sectoral Aspects. Contribution of Working Group II to the FifthAssessment Report of the Intergovernmental Panel on Climate Change, C. Field, V. Barros, D. Dokken, K. Mach, M. Mastrandrea, T. Bilir, M. Chatterjee, K. Ebi, Y. Estrada, R. Genova, B. Girma, E. Kissel, A. Levy, S. MacCracken, P. Mastrandrea & L. White, eds, CAMB, UK/NYC, NY, USA: Cambridge UP, 2014, pp. 979-1037. https://doi.org/10.1017/CBO9781107415379.023

[6] S. Solomon, D. Qin, M. Manning, Z. Chen, M. Marquis, K. Averyt, M. Tig-nor & H. Miller, “Climate change 2007: the physical science basis,” Cambridge UP, CAMB, UK/NY, USA, Report, 2007. Available: https://www.ipcc.ch/report/ar4/wg1/

[7] A. Dai, T. Zhao & J. Chen, “Climate change and drought: A precipitation and evaporation perspective,” Curr Clim Change Rep, vol. 4, no. 9, pp. 301–312, May. 2018. https://doi.org/10.1007/s40641-018-0101-6

[8] S. Mukherjee, A. Mishra & K. Trenberth, “Climate change and drought: a perspective on drought indices,” Curr Clim Change Rep, vol. 4, no. 2, pp. 145–163, Jun. 2018. https://doi.org/10.1007/s40641-018-0098-x

[9] Revista SEMANA, “Las graves secuelas económicas de la sequía”, Semana, 23 Jul. 2014. [Online]. Disponible en https://www.semana.com/nacion/articulo/las-graves-secuelas-economicas-de-la-sequia/396750-3

[10] Redacción El Heraldo, “Magdalena, el más azotado por la temporada de sequía”, El Heraldo, 27 Sep. 2015. [Online]. Disponible en https://www.elheraldo.co/magdalena/magdalena-el-mas-azotado-por-la-temporada-de-sequia-219590

[11] O. Rahmati, F. Falah, K. Dayal, R. Deo, F. Mohammadi, T. Biggs, D. Moghaddam, S. Naghibi & D. Bui, “Machine learning approaches for spatial modeling of agricultural droughts in the south-east region of Queensland Australia,” Sci Total Environ, vol. 699, pp. 1–10, Jan. 2020. https://doi.org/10.1016/j.scitotenv.2019.134230

[12] S. Park, J. Im, E. Jang & J. Rhee, “Drought assessment and monitoring through blending of multi-sensor indices using machine learning approaches for different climate regions,” Agric For Meteorol, vol. 216, pp. 157–169, Jan. 2016. https://doi.org/10.1016/j.agrformet.2015.10.011

[13] K. Fung, Y. Huang, C. Koo & M. Mirzaei, “Improved SVR machine learning models for agricultural drought prediction at downstream of Langat River Basin, Malaysia,” J Water Clim Chang, vol. 11, no. 4, pp. 1383–1398, Jun. 2019. https://doi.org/10.2166/wcc.2019.295

[14] P. Feng, B. Wang, D. Li Liu & Q. Yu, “Machine learning-based integration of remotely-sensed drought factors can improve the estimation of agricultural drought in south-eastern Australia,” Agric. Syst, vol. 173, pp. 303–316, Aug. 2015. https://doi.org/10.1016/j.agsy.2019.03.015

[15] A. Belayneh, J. Adamowski, B. Khalil & J. Quilty, “Coupling machine learning methods with wavelet transforms and the bootstrap and boosting ensemble approaches for drought prediction,” Atmos Res, vol. 172-173, pp. 37–47, Jun. 2016. https://doi.org/10.1016/j.atmosres.2015.12.017

[16] X. Liu, X. Zhu, Q. Zhang, T. Yang, Y. Pan & P. Sun, “A remote sensing and artificial neural network-based integrated agricultural drought index: Index development and applications,” Catena, vol. 186, no. 2, pp. 1–10, Mar. 2020. https://doi.org/10.1016/j.catena.2019.104394

[17] J. Cruz & D. Wishart, “Applications of machine learning in cancer prediction and prognosis,” Cancer Inform, vol. 2, pp. 59–77, Dec. 2006. https://doi.org/10.1177/11769351060020003

[18] L. Deng & X. Li, “Machine learning paradigms for speech recognition: An overview,” IEEE Trans Audio Speech Lang Process, vol. 21, no. 5, pp. 1060–1089, May. 2013. https://doi.org/10.1109/TASL.2013.2244083

[19] K. Rasouli, W. Hsieh & A. Cannon, “Daily streamflow forecasting by machine learning methods with weather and climate inputs,” J. Hydrol., vol. 414-415, pp. 284–293, Jan. 2012. https://doi.org/10.1016/j.jhydrol.2011.10.039

[20] D. Lary, A. Alavi, A. Gandomi & A. Walker, “Machine learning in geosciences and remote sensing,” GSF, vol.7, no. 1, pp. 3–10, Apr. 2015. https://doi.org/10.1016/j.gsf.2015.07.003

[21] Minambiente, UNGRD, IDEAM, Estrategia Nacional para la gestión integral de la sequía en Colombia. BO, CO: MinAmbiente, IDEAM, & UNGRD, 2018. Recuperado de https://www.unccd.int/sites/default/files/country_profile_documents/ENGIS%2520para%2520publicaci%25C3%25B3n_Colombia.pdf

[22] UNGRD, “Consolidado anual de emergencias”, (1998-2021), Gobierno de Colombia [Online]. Disponible en http://portal.gestiondelriesgo.gov.co/Paginas/Consolidado-Atencion-de-Emergencias.aspx (consultado 2020, May. 18).

[23] Gobernación del Magdalena, “Nuestro departamento”, Gobierno de Colombia [Online]. Disponible en http://www.magdalena.gov.co/departamento/nuestro-departamento (consultado 2020, May. 18)

[24] DANE, Resultados Censo Nacional de Población y Vivienda 2018. BO, CO: Gobierno de Colombia. Recuperado de https://www.dane.gov.co/files/censo2018/informacion-tecnica/presentaciones-territorio/191004-CNPV-presentacion-Magdalena.pdf

[25] IDEAM, Magdalena. BO, CO: Gobierno de Colombia. Recuperado de http://atlas.ideam.gov.co/basefiles/magdalena_texto.pdf

[26] K. Trenberth, A. Dai, G. Van Der Schrier, P. Jones, J. Barichivich, K. Briffa & J. Sheffield, “Global warming and changes in drought,” Nat Clim Change, vol. 4, no. 1, pp. 17–22, Dec. 2013. https://doi.org/10.1038/nclimate2067

[27] N. Gorelick, M. Hancher, M. Dixon, S. Ilyushchenko, D. Thau & R. Moore, “Google Earth Engine: Planetary-scale geospatial analysis for everyone,” Remote Sens Environ, vol. 202, pp. 18–27, Jul. 2016. https://doi.org/10.1016/j.rse.2017.06.031

[28] NASA. Normalized Difference Vegetation Index (NDVI), 30 Aug. 2000. Available: https://earthobservatory.nasa.gov/features/MeasuringVegetation/measuring_vegetation_2.php

[29] T. Du, D. Bui, M. Nguyen & H. Lee, “Satellite-based, multi-indices for evaluation of agricultural droughts in a highly dynamic tropical catchment, central Vietnam,” J Water, vol. 10, no. 5, pp. 1–24, Jan. 2018. https://doi.org/10.3390/w10050659

[30] L. Wang & J. Qu, “NMDI: A normalized multi-band drought index for monitoring soil and vegetation moisture with satellite remote sensing,” Geophys Res Lett, vol. 34, no. 20, pp. 1–5, Jul. 2007. https://doi.org/10.1029/2007GL031021

[31] G. Poveda y O. Mesa, “Las fases extremas del fenómeno ENSO (El Niño y La Niña) y su influencia sobre la hidrología de Colombia”, Ing Hidraulic Mex, vol. 11, no. 1, pp. 21–37, Ene. 1996. Disponible en http://www.revistatyca.org.mx/ojs/index.php/tyca/article/view/765

[32] D. Edwards & T. McKee, “Characteristics of 20th century drought in the United States at multiple time scales,” Dept. Atmos. Sci., CSU, FRT COLL., CO, USA, Climatology Report No. 97-2, Paper no. 634, 1997. Available: http://hdl.handle.net/10217/170176

[33] O. Valiente, “Sequía: definiciones, tipologías y métodos de cuantificación”, Invest Geogr, no. 26, pp. 59–80, Mar. 2001. https://doi.org/10.14198/INGEO2001.26.06

[34] R. Mayorga y G. Hurtado, “La sequía en Colombia, Documento tecnico de respaldo a la informacion en la pagina web del IDEAM”, IDEAM, BO, CO, Report IDEAM–METEO/004-2006, 2006. Recuperado de http://www.cambioclimatico.gov.co/documents/21021/21147/NotaT%C3%A9cnicaSequia.pdf/d9ba4965-f7cd-4a2f-a875-2a38b1d6a941

[35] G. Hurtado, Sequía meteorológica y sequía agrícola en Colombia: Incidencia y tendencias. BO, CO: IDEAM, 2012. Disponible en http://www.ideam.gov.co/documents/21021/21138/Sequias+Incidencias+y+Tendencias.pdf/3e72c86c-cf4a-42f9-95f1-07e7cf88861a

[36] J. Gómez y M. Cadena, “Actualización de las estadísticas de la sequía en Colombia”, IDEAM, BO, CO, Nota técnica IDEAM-METEO/001-2018, Jun. 2017. Recupearado de http://www.ideam.gov.co/documents/21021/124446218/NT+001-2018_Actualizaci%C3%B3n+de+las+estad%C3%ADsticas+de+la+sequia+en+Colombia/d47113b3-536b-4c83-a69c-22f97993016f?version=1.1

[37] NDMC, “Climographs,” SNR [Online]. Available: https://drought.unl.edu/Climographs.aspx (consultado: 2018, dec. 7).

[38] W. Koehrsen, “A feature selection tool for machine learning in Python, Towards Data Science,” 22 Jun. 2018. Available: https://towardsdatascience.com/a-feature-selection-tool-for-machine-learning-in-python-b64dd23710f0

[39] C. Sutton, “11 - Classification and regression trees, bagging, and boosting,” Handb Stat, vol. 24, pp. 303–329, Dec. 2005. https://doi.org/10.1016/S0169-7161(04)24011-1

[40] E. Bauer & R. Kohavi, “An empirical comparison of voting classification algorithms: Bagging, boosting, and variants,” Mach Learn, vol. 36, no. 1-2, pp. 1–38, Jan. 1996. Available: http://robotics.stanford.edu/~ronnyk/vote.pdf

[41] S. Safavian & D. Landgrebe, “A survey of decision tree classifier methodology,” IEEE Trans Syst Man Cybern, vol. 21, no. 3, pp. 660–674, Jun. 1991. https://doi.org/10.1109/21.97458

[42] M. Pal, “Random forest classifier for remote sensing classification,” Int J Remote Sens, vol. 26, no. 1, pp. 217–222, Oct. 2003. https://doi.org/10.1080/01431160412331269698

[43] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot & É. Duchesnay, “Scikit-learn: Machine learning in Python”, J Mach Learn Res, vol. 12, no. 85, pp. 2825–2830, Mar. 2011. Available: https://www.jmlr.org/papers/v12/pedregosa11a.html

[44] Scikit-Learn, Scikit-Learn Machine Learning in Python [Online]. Available: https://scikit-learn.org/stable/index.html (consultado: 2020, May. 18).

[45] W Fin de Semana, “Declaran calamidad pública por sequía en cinco municipios del Magdalena”, W Radio, 27 Jul. 2014. Disponible en https://www.wradio.com.co/noticias/actualidad/declaran-calamidad-publica-por-sequia-en-cinco-municipios-del-magdalena/20140727/nota/2341212.aspx

[46] M. Correa, “La sequía impacta a 7 departamentos”, El Colombiano, 22 Jul. 2014. Disponible en https://www.elcolombiano.com/historico/la_sequia_impacta_a_7_departamentos-IGEC_303649

Daissy Milenys Herrera Posada. Universidad Nacional de Colombia (Medellín, Colombia). https://orcid.org/0000-0002-1825-0097

Edier Aristizábal. Universidad Nacional de Colombia (Medellín, Colombia). https://orcid.org/0000-0002-1825-0097

Artificial Intelligence and Machine Learning Model for Spatial and Temporal Prediction of Drought Events in the Department of Magdalena, Colombia

Modelo de Inteligencia Artificial y Aprendizaje Automático para la Predicción Espacial y Temporal de Eventos de Sequía en el departamento del Magdalena, Colombia

DOI: http://doi.org/10.17981/ingecuc.18.2.2022.20

Artículo de Investigación Científica. Fecha de Recepción: 27/08/2021. Fecha de Aceptación: 11/11/2021.

Droughts reported by departments to the UNGRD from 2010 to 2019.

* The consolidation of emergencies for the year 2010 presented problems for the visualization of the information; therefore, the reports for the year 2010 could not be considered.

Source: Authors.

Source: Authors.

Calculation and bands needed to find NDVI, NDWI and NMDI.

Source: Authors.

Fig. 2. Yield data precipitation stations within the department of Magdalena used in the study, and their respective areas of influence.

Source: Authors.

Fig. 3. Conceptual scheme of the sources of information for the predictor variables and the response variable.

Source: Authors.

SPI values.

Source: [3].

Fig. 4. SPI3 frequency histogram of months with reported droughts in the department of Magdalena from 2010 to 2019.

Source: Authors.

Fig. 5. Map of SPI3 classified into drought and normal/wet conditions.

Source: Authors.

Fig. 6. Importance of each variable according to the Feature Selector.

Source: Authors.

Fig. 7. Correlation of selected variables.

Source: Authors.

Fig. 8. Framework of the application of RF and DTC models.

Source: Authors.

Variation of hyper-parameters for bagging decision trees.

Source: Authors.

Fig. 9. Learning curve for bagging decision trees.

Source: Authors.

Classification report and confusion matrix results for bagging decision trees.

Source: Authors.

Variation of hyper-parameters for random forest.

Source: Authors.

Fig. 10. Learning curve for random forest.

Source: Authors.

Reporte de la Clasificación y Resultados de la Matriz de Confusión para Bosque Aleatorio.

Source: Authors.

Fig. 11. Application of the random forest model for July 2014 and for municipalities experiencing a public calamity due to drought for the month in question. A) Drought forecast, according to SPI3. B) Drought probability map.

Source: Authors.

Report of the classification and confusion matrix results for random forest in July 2014.

Source: Authors.

.

© The author; licensee Universidad de la Costa - CUC.

INGE CUC vol. 18 no. 2, pp. -265. July - December, 2022

Barranquilla. ISSN 0122-6517 Impreso, ISSN 2382-4700 Online

.