setwd('C:/Users/jourd/OneDrive/Documents/UCL/STDM_2/')
library(knitr)
library(reshape)
library(foreign)
library(gstat)
library(sf)
library(rgdal)
library(tmap)
library(foreign)
library(scatterplot3d)
library(plot3D)
library(rgl)
library(maptools)
library(lattice)
library(spdep)
library(sp)
library(rgdal)
library(sf)
library(ggplot2)
library(gridExtra)
library(car)
library(dplyr)
library(devtools)
library(keras)
library(tidyverse)
library(tidyverse)
library(glue)
library(forcats)
library(timetk)
library(tidyquant)
library(tibbletime)
library(cowplot)
library(recipes)
library(rsample)
library(yardstick)
library(keras)
library(tfruns)
library(spdep)
library(nnet)
library(mlbench)
library(caret)
library(neuralnet)
library(robustbase)
Wildfires have been a persistent and devastating problem in California, causing extensive damage to public health and infrastructure. Due to its hot and dry climate California is prone to wildfires. However, the frequency and size of the wildfires have increased considerably in recent years (Westerling and Bryant, 2008). The cause of wildfires is attributed to a combination of climate, landscape, and anthropogenic factors. These include high temperatures, low precipitation rates, vegetation density, alongside increased human activity and lack of forest management. These factors have been exacerbated by climate change, as abnormal weather patterns are more frequent and harder to predict (Westerling and Bryant, 2008). Machine learning can be utilised to predict the location and spread of potential wildfires. This has been done in several areas around the world using both supervised and non-supervised techniques. This can be seen in Banerjee’s 2021 paper where they study the spatial and temporal patterns in wildfires in California by using non-supervised k-Means clustering.
The aim of this experiment is to utilise supervised machine
learning methods to try and predict the size of a wildfire;
based upon climate, anthropogenic and landscape data. For this
experiment spatio-temporal variables were combined into a
dataset to construct the predictor variables. Two supervised
machine learning methods were chosen, namely an Artificial
Neural Network, and LSTM. ANNs were inspired by the structure
and function of the human brain, they consist of nodes (neurons)
interconnected in a multilayer system (Vatsavai R, 2008). Most
simple ANN’s have 3 layers; the input layer, hidden layer and
output layer; each layer in connected by weights. Weights
represent numerical values that are assigned and optimised in
the training of the network. In basic terms the weights
determine the strength of the connection between the layers.
ANN’s were chosen for this experiment because of their ability
to model complex behaviours in large datasets. Long Short-Term
Memory (LSTM) is a type of Recurrent Neural Network (RNN), the
architecture of the network means it can selectively forget or
remember information to make predictions (Vatsavai. R 2008).
They are particularly useful for time-series forecasting in deep
learning. LSTM’s are constructed with 4 layers; the input layer,
LSTM layer, hidden layer and output layer (Liang et al, 2019).
The LSTM layer is where the model architecture differs from a
more traditional forward feeding network like the ANN. Within
the LSTM layer are memory cells, these give the LSTM the ability
to store and retrieve data, following this there are several
‘gates’ that allow the model to choose what information is
stored or forgotten (Vatsavai. R 2008). This mechanism allows
the LSTM to learn patterns in complex relationships. An LSTM
model was selected for this experiment as we expect the
relationship between the independent and predictor variables to
be non-linear. Furthermore, we expect the input data to have a
degree of noise, the ability of the LSTM to selectively remember
information can help mitigate the effect of this noise.
Although both ANN’s and LSTM’s are capable of regression and
classification analysis, LSTM’s has the benefit of being able to
process long term predictions. As our dataset involves many
years of wildfire data, a model that can ‘remember’ previous
information may produce more accurate results. It may seem
counterintuitive to use 2 methods that ultimately make the same
prediction; however, it is important to compare the results of
the methods and benchmark them upon each other.
DatasetElevation |
SourceExtracted from California DEM |
Methods AppliedConverted to polygons and spatially joined to California grid. |
Population |
California 2010 Census data |
Joined to county shapefile of California, then spatially joined to the California grid |
Distance to Road |
Distance to the nearest road OpenStreetMap |
Converted to polygons and spatially joined to the California grid |
Climate |
California 2010 Census data |
Joined to county shapefile of California, then spatially joined to the California grid |
Historic Wildfire Polygon | Year of wildfire https://scholar.colorado.edu/ concern/datasets/8336h304x | Spatially joined to the California |
The data used in this report was gathered from several sources (as seen in the table below). The final form of the data is a collection of point shapefiles with longitude, latitude values and predictor variables. Each shapefile contains 8800 points and represents a year of data (2001 to 2019), and thus a year of wildfire polygons, as seen in Figure 1. The raw data contains 75 columns of attributes which were reduced to the final 13 predictor variables. The Albers projection for California was used as this allows the data to be viewed in metres. Each attribute in the table is numerical with land cover type being the only categorical attribute. Land cover ranges from 1 to 12 and is developed to categorise the Earth’s surface into units.
# set the directory where the shapefiles are located
dir_path <- "Shapefiles_small/"
# list all the shapefiles in the directory
shp_files <- list.files(dir_path, pattern = "\\.shp$")
# create an empty list to store the shapefiles
shp_list <- list()
# loop through the shapefiles and read them into the list
for (i in 1:length(shp_files)) {
shp_list[[i]] <- readOGR(dsn = paste(dir_path,shp_files[i], sep = ""))
}
## OGR data source with driver: ESRI Shapefile
## Source: "C:\Users\jourd\OneDrive\Documents\UCL\STDM_2\Shapefiles_small\2001.shp", layer: "2001"
## with 8801 features
## It has 75 fields
## Integer64 fields read as strings: OBJECTID dist_roads Elevation OBJECTID_1 Join_Count TARGET_FID test OBJECTID_2 Join_Cou_1 TARGET_F_1 OBJECTID_3 Join_Cou_2 TARGET_F_2 ORIG_FID
## OGR data source with driver: ESRI Shapefile
## Source: "C:\Users\jourd\OneDrive\Documents\UCL\STDM_2\Shapefiles_small\2002.shp", layer: "2002"
## with 8801 features
## It has 75 fields
## Integer64 fields read as strings: OBJECTID dist_roads Elevation OBJECTID_1 Join_Count TARGET_FID test OBJECTID_2 Join_Cou_1 TARGET_F_1 OBJECTID_3 Join_Cou_2 TARGET_F_2 ORIG_FID
## OGR data source with driver: ESRI Shapefile
## Source: "C:\Users\jourd\OneDrive\Documents\UCL\STDM_2\Shapefiles_small\2003.shp", layer: "2003"
## with 8801 features
## It has 75 fields
## Integer64 fields read as strings: OBJECTID dist_roads Elevation OBJECTID_1 Join_Count TARGET_FID test OBJECTID_2 Join_Cou_1 TARGET_F_1 OBJECTID_3 Join_Cou_2 TARGET_F_2 ORIG_FID
## OGR data source with driver: ESRI Shapefile
## Source: "C:\Users\jourd\OneDrive\Documents\UCL\STDM_2\Shapefiles_small\2004.shp", layer: "2004"
## with 8801 features
## It has 75 fields
## Integer64 fields read as strings: OBJECTID dist_roads Elevation OBJECTID_1 Join_Count TARGET_FID test OBJECTID_2 Join_Cou_1 TARGET_F_1 OBJECTID_3 Join_Cou_2 TARGET_F_2 ORIG_FID
## OGR data source with driver: ESRI Shapefile
## Source: "C:\Users\jourd\OneDrive\Documents\UCL\STDM_2\Shapefiles_small\2005.shp", layer: "2005"
## with 8801 features
## It has 75 fields
## Integer64 fields read as strings: OBJECTID dist_roads Elevation OBJECTID_1 Join_Count TARGET_FID test OBJECTID_2 Join_Cou_1 TARGET_F_1 OBJECTID_3 Join_Cou_2 TARGET_F_2 ORIG_FID
## OGR data source with driver: ESRI Shapefile
## Source: "C:\Users\jourd\OneDrive\Documents\UCL\STDM_2\Shapefiles_small\2006.shp", layer: "2006"
## with 8801 features
## It has 75 fields
## Integer64 fields read as strings: OBJECTID dist_roads Elevation OBJECTID_1 Join_Count TARGET_FID test OBJECTID_2 Join_Cou_1 TARGET_F_1 OBJECTID_3 Join_Cou_2 TARGET_F_2 ORIG_FID
## OGR data source with driver: ESRI Shapefile
## Source: "C:\Users\jourd\OneDrive\Documents\UCL\STDM_2\Shapefiles_small\2007.shp", layer: "2007"
## with 8801 features
## It has 75 fields
## Integer64 fields read as strings: OBJECTID dist_roads Elevation OBJECTID_1 Join_Count TARGET_FID test OBJECTID_2 Join_Cou_1 TARGET_F_1 OBJECTID_3 Join_Cou_2 TARGET_F_2 ORIG_FID
## OGR data source with driver: ESRI Shapefile
## Source: "C:\Users\jourd\OneDrive\Documents\UCL\STDM_2\Shapefiles_small\2008.shp", layer: "2008"
## with 8801 features
## It has 75 fields
## Integer64 fields read as strings: OBJECTID dist_roads Elevation OBJECTID_1 Join_Count TARGET_FID test OBJECTID_2 Join_Cou_1 TARGET_F_1 OBJECTID_3 Join_Cou_2 TARGET_F_2 ORIG_FID
## OGR data source with driver: ESRI Shapefile
## Source: "C:\Users\jourd\OneDrive\Documents\UCL\STDM_2\Shapefiles_small\2009.shp", layer: "2009"
## with 8801 features
## It has 75 fields
## Integer64 fields read as strings: OBJECTID dist_roads Elevation OBJECTID_1 Join_Count TARGET_FID test OBJECTID_2 Join_Cou_1 TARGET_F_1 OBJECTID_3 Join_Cou_2 TARGET_F_2 ORIG_FID
## OGR data source with driver: ESRI Shapefile
## Source: "C:\Users\jourd\OneDrive\Documents\UCL\STDM_2\Shapefiles_small\2010.shp", layer: "2010"
## with 8801 features
## It has 75 fields
## Integer64 fields read as strings: OBJECTID dist_roads Elevation OBJECTID_1 Join_Count TARGET_FID test OBJECTID_2 Join_Cou_1 TARGET_F_1 OBJECTID_3 Join_Cou_2 TARGET_F_2 ORIG_FID
## OGR data source with driver: ESRI Shapefile
## Source: "C:\Users\jourd\OneDrive\Documents\UCL\STDM_2\Shapefiles_small\2011.shp", layer: "2011"
## with 8801 features
## It has 75 fields
## Integer64 fields read as strings: OBJECTID dist_roads Elevation OBJECTID_1 Join_Count TARGET_FID test OBJECTID_2 Join_Cou_1 TARGET_F_1 OBJECTID_3 Join_Cou_2 TARGET_F_2 ORIG_FID
## OGR data source with driver: ESRI Shapefile
## Source: "C:\Users\jourd\OneDrive\Documents\UCL\STDM_2\Shapefiles_small\2013.shp", layer: "2013"
## with 8801 features
## It has 75 fields
## Integer64 fields read as strings: OBJECTID dist_roads Elevation OBJECTID_1 Join_Count TARGET_FID test OBJECTID_2 Join_Cou_1 TARGET_F_1 OBJECTID_3 Join_Cou_2 TARGET_F_2 ORIG_FID
## OGR data source with driver: ESRI Shapefile
## Source: "C:\Users\jourd\OneDrive\Documents\UCL\STDM_2\Shapefiles_small\2014.shp", layer: "2014"
## with 8801 features
## It has 75 fields
## Integer64 fields read as strings: OBJECTID dist_roads Elevation OBJECTID_1 Join_Count TARGET_FID test OBJECTID_2 Join_Cou_1 TARGET_F_1 OBJECTID_3 Join_Cou_2 TARGET_F_2 ORIG_FID
## OGR data source with driver: ESRI Shapefile
## Source: "C:\Users\jourd\OneDrive\Documents\UCL\STDM_2\Shapefiles_small\2015.shp", layer: "2015"
## with 8801 features
## It has 75 fields
## Integer64 fields read as strings: OBJECTID dist_roads Elevation OBJECTID_1 Join_Count TARGET_FID test OBJECTID_2 Join_Cou_1 TARGET_F_1 OBJECTID_3 Join_Cou_2 TARGET_F_2 ORIG_FID
## OGR data source with driver: ESRI Shapefile
## Source: "C:\Users\jourd\OneDrive\Documents\UCL\STDM_2\Shapefiles_small\2016.shp", layer: "2016"
## with 8801 features
## It has 75 fields
## Integer64 fields read as strings: OBJECTID dist_roads Elevation OBJECTID_1 Join_Count TARGET_FID test OBJECTID_2 Join_Cou_1 TARGET_F_1 OBJECTID_3 Join_Cou_2 TARGET_F_2 ORIG_FID
## OGR data source with driver: ESRI Shapefile
## Source: "C:\Users\jourd\OneDrive\Documents\UCL\STDM_2\Shapefiles_small\2017.shp", layer: "2017"
## with 8801 features
## It has 75 fields
## Integer64 fields read as strings: OBJECTID dist_roads Elevation OBJECTID_1 Join_Count TARGET_FID test OBJECTID_2 Join_Cou_1 TARGET_F_1 OBJECTID_3 Join_Cou_2 TARGET_F_2 ORIG_FID
## OGR data source with driver: ESRI Shapefile
## Source: "C:\Users\jourd\OneDrive\Documents\UCL\STDM_2\Shapefiles_small\2018.shp", layer: "2018"
## with 8801 features
## It has 75 fields
## Integer64 fields read as strings: OBJECTID dist_roads Elevation OBJECTID_1 Join_Count TARGET_FID test OBJECTID_2 Join_Cou_1 TARGET_F_1 OBJECTID_3 Join_Cou_2 TARGET_F_2 ORIG_FID
## OGR data source with driver: ESRI Shapefile
## Source: "C:\Users\jourd\OneDrive\Documents\UCL\STDM_2\Shapefiles_small\2019.shp", layer: "2019"
## with 8801 features
## It has 75 fields
## Integer64 fields read as strings: OBJECTID dist_roads Elevation OBJECTID_1 Join_Count TARGET_FID test OBJECTID_2 Join_Cou_1 TARGET_F_1 OBJECTID_3 Join_Cou_2 TARGET_F_2 ORIG_FID
Wildfires have been a persistent and devastating problem in California, causing extensive damage to public health and infrastructure. Due to it’s hot and dry climate California is prone to wildfire. However, the frequency and size of the wildfires have increased considerably in recent years. The cause of wildfires is attributed to a combination of climate, landscape and anthropological factors. These include high temperatures, low precipitation rates, vegetation density, alongside increased human activity and lack of forest management. These factors have been exacerbated by climate change, as abnormal weather patterns are more frequent and harder to predict. Wildfires are an issue that needs to be addressed, through the use of machine learning to predict the spread and size of these fires.
Below is a table outlining the attributes of each point in the dataframe
Variable | Description |
---|---|
pop | Population density at point |
Elevation | Elevation at point |
dist_roads | Distance to the nearest road |
ig_Year | Year of wildfire |
ig_Month | Month of wildfire |
tot_pix | Total area of wildfire |
lc_Code | Land type for each point |
coords.x1 | Longitude of point |
coords.x2 | Latitude of point |
tp | total precipitation |
d2m | Average dewpoint temperature |
t2m | Average surface temperature |
si10 | 10m_wind_speed |
swvl1 | Volumetric soil water layer |
pev | Potential Evaporation |
Tot_ar_km2 | Total Burn area km |
Most of the data was pre-processed in Arcgis Pro due to its ability to handle large amounts of netCDF data. A 20km x 20km grid was generated covering the state. The grid was then used to store relevant data such as population density, distance to roads, elevation, and mean summer climate data. Climate data was extracted using the CDS API in python. Subsequently, shapefiles were created for each year from 2001 to 2019, representing point clouds where each point corresponds to the centroid of a grid cell. These shapefiles contain information on the attributes of the wildfires that occurred during each year, enabling analysis of their patterns and trends. Finally, rows with missing data or Null data were removed.
# Selecting one year to plot
test_plot <- shp_list[[4]]
# Set the year as a factor
test_plot$ig_year <- as.factor(test_plot$ig_year)
# plot the shapefile
tm_shape(test_plot) + tm_dots("ig_year", palette = c("lightgrey", "orange"), size = 0.06, legend.show = FALSE) +
tm_scale_bar(position = c("LEFT")) +
tm_compass() +
tm_layout(main.title = "Location of Wildfires in 2018", title.position = c("centre", "top"))
Climate Data Analysis: To analyse the temporal patterns in the climate data we plot a histogram, and heatmap showing the frequency of wildfire each month over the years. We can see the months with the largest number of wildfires. On average June to October had the most wildfires. This analysis allows us to select just the average of these months for the climate data. We can then plot the climate variables on a shapefile of California to understand the spatial distribution. The main observation from the climate data is the divide between the North and South of California, in general precipitation and volumetric soil layer values are higher in the North than South, but the surface temperature, windspeed and evaporation values are lower in the North. This represents that the North of California experiences larger precipitation rates but an overall cooler temperature. This may influence the occurrence of wildfires.
# Select 1 year shapefile
climate_shp <- shp_list[[3]]
# Create a list of the climate variables
atr_list = c("tp", "d2m", "si10",
"t2m", "swvl1", "pev")
# Create a list to store the description of each of the climate variables
desc_list = c("Total\nPrecipitation", "Dewpoint\nTemperature",
"Surface\nTemperature", "10m Wind Speed",
"Volumetric\nSoil Layer", "Evaporation")
# Empty list to store the plots
plots_list <- list()
# Loop through the list of climate variables and construct a plot for each
for (i in 1:length(atr_list)) {
plot <- tm_shape(climate_shp) + tm_dots(atr_list[i], title = desc_list[i]) + tm_layout(legend.width = 0.4,
legend.frame = TRUE,legend.text.size = 0.8,
legend.title.size = 0.9,
legend.title.fontface = "bold")
plots_list[[i]] <- (plot) # Add the plot to the plots_list
}
# Arrange plots in a grid
grid_arrange <- tmap_arrange(plots_list, ncol = 2, nrow = 3)
# Display final grid with title
grid_arrange
In order to test the temporal frequency of fires we can plot the number of fires each month for 1 selected year.
# List to store extracted attributes
attribute_list <- list()
# Loop through shapefiles and extract attribute
for (i in 1:length(shp_list)) {
# Extract the month values from the shapefile
attribute_value <- shp_list[[i]]$ig_month
# Remove all 0 values, ie no fire data
attribute_value <- Filter(function(x) x != 0, attribute_value)
# Add the month values to the list
attribute_list[[i]] <- attribute_value
}
# Filter again to catch any non fire values that may have passed
my_matrix_nonzero <- Filter(function(x) x != 0, attribute_list)
# Create a histogram of the fire frequency per month for one year
ggplot()+aes(attribute_list[[1]]) +
geom_histogram(binwidth = 1, fill = 'orange', alpha = 0.7) +
scale_x_continuous(breaks = 1:12, labels = month.name) +
ggtitle("Histogram of Fire Frequency over Months")+
labs(y = 'Frequency', x = 'Months') +
theme(axis.text.x = element_text(angle = 90))
# Create a list to store the dataframes to
attribute_list = list()
# Loop through shapefiles and extract attributes
for (i in 1:length(shp_list)) {
# Extract the month values
attribute_value <- shp_list[[i]]$ig_month
# Filter out all of the fire data, to only include that in the list
attribute_value <- Filter(function(x) x != 0, attribute_value)
# add the month numbers to a list
attribute_list[[i]] <- attribute_value
}
# generate a list of years
years <- 2001:2011
# Create a dataframe with 1 to 12 in the first col
month_df <- data_frame(month = 1:12)
# Loops through each years month list
for (i in 1:length(attribute_list)) {
months <- attribute_list[[i]]
f_table <- aggregate(list(values_list = months), by = list(month = months), FUN = length)
f_table <- setNames(f_table, c("month", i))
month_df <- merge(month_df, f_table, by = "month", all.x = TRUE)
}
month_df[is.na(month_df)] <- 0
month_melt <- melt(month_df, id.vars = "month")
ggplot(month_melt, aes(x= variable, y = month, fill = value)) +
geom_tile() +
scale_y_discrete(limits = month.name) +
scale_fill_gradient(low = "white", high = "orange")+
labs(title = "Fire Frequency Heatmap",
x = "Year",
y = "Month",
fill = "Frequency")
# Exploratory spatio-temporal analysis ## Multicollinearity
Testing for collinearity is important before data is used to
build an ANN. If there is a high degree of correlation between
the input variables they may provide redundant information,
leading to a model that is overfitted; that performs well on the
training data but poorly on test data. Collinearity was tested
on the climate variables, as these are most likely to be highly
correlated. This was done using the car package and extracting
the VIF (variance inflation factor) values.
# Select one year to test
mc_shp <- shp_list[[2]]
# Select the variables to be tested
mc_shp <- subset(mc_shp, select= c("pop_est", "dist_roads", "Elevation", "tp", "t2m",
"d2m", "si10", "swvl1", "tot_ar_km2"))
# Run the lm function from the car library
model <- lm(tot_ar_km2 ~ tp+t2m+d2m+si10+swvl1, data=mc_shp)
# Create a table to display the VIF results
r <- as.data.frame(vif(model))
colnames(r)[1] <- "VIF"
knitr::kable(r)
VIF | |
---|---|
tp | 2.928386 |
t2m | 6.996844 |
d2m | 6.767041 |
si10 | 1.929448 |
swvl1 | 6.256030 |
Global Moran’s I: Moran’s I is a measure of spatial
autocorrelation; it is a measure of how clustered or dispersed
the variables are across an area. It is important to test for
autocorrelation in data as it can provide an insight into the
distribution of data. The Moran I coefficient can range from
-1 to 1; values lose to 1 show positive spatial
autocorrelation, while values close to -1 show a negative
spatial correlation.
Using the moran.mc() function the global Moran I was
calculated, this uses Monte Carlo simulation to calculate the
Moran coefficient in each region of the study area and
compares this to the observed value. The Moran I Statistic and
P value is displayed in figure 4. The Moran Coefficients
produced were positive and relatively close to 1, meaning the
land cover, burned area and wildfire year are positively
spatially autocorrelated. The P value for each observation
also lies below the 0.05 threshold, meaning we are 95%
confident in the Moran statistic.
# Reading in shapefile of combined data, to speed up compilation time
combined_shp <- readOGR(dsn = "Combined/master2.shp")
## OGR data source with driver: ESRI Shapefile
## Source: "C:\Users\jourd\OneDrive\Documents\UCL\STDM_2\Combined\master2.shp", layer: "master2"
## with 15013 features
## It has 81 fields
## Integer64 fields read as strings: Join_Count TARGET_FID JOIN_FID OBJECTID Join_Cou_1 TARGET_F_1 OBJECTID_1 Join_Cou_2 TARGET_F_2 dist_roads OBJECTID_2 Join_Cou_3 TARGET_F_3 Elevation OBJECTID_3 Join_Cou_4 TARGET_F_4 X Y
#Create a spatial weights matrix:
set.ZeroPolicyOption(TRUE)
## [1] FALSE
fire.W <- nb2listw(poly2nb(combined_shp),zero.policy = TRUE)
# Moran I test land cover
#mt_lc <- moran.test(combined_shp$lc_code, fire.W)
mc_lc <- moran.mc(combined_shp$lc_code, listw=fire.W, nsim=9999)
# Moran I test burned area
#mt_area <- moran.test(combined_shp$tot_pix, fire.W)
mc_area <- moran.mc(combined_shp$tot_ar_km2, listw=fire.W, nsim=9999)
# Moran I test year
#mt_year <- moran.test(combined_shp$ig_year, fire.W)
mc_year <- moran.mc(combined_shp$ig_year, listw=fire.W, nsim=9999)
# Extracting the moran statistic and P value
statistic <- as.double(c(mc_lc$statistic, mc_area$statistic, mc_year$statistic))
p_value <- as.double(c(mc_lc$p.value, mc_area$p.value, mc_year$p.value))
# Creating a dataframe of the results to display in a graph
df <- data.frame(Statistic = c(statistic), P_Value = c(p_value))
rownames(df) <- c("Land Cover", "Burned Area", "Wildfire Year")
knitr::kable(df, format = 'pipe')
Statistic | P_Value | |
---|---|---|
Land Cover | 0.6523640 | 1e-04 |
Burned Area | 0.4492389 | 1e-04 |
Wildfire Year | 0.6436802 | 1e-04 |
# Below code has been adapted from the geocomputation tutorial notebook
moranCluster <- function(shape, W, var, alpha=0.05, p.adjust.method="bonferroni")
{
# Code adapted from https://rpubs.com/Hailstone/346625
Ii <- localmoran(shape[[var]], W)
shape$Ii <- Ii[,"Ii"]
shape$Iip <- Ii[,"Pr(z != E(Ii))"]
shape$sig <- shape$Iip<alpha
# Scale the data to obtain low and high values
shape$scaled <- scale(shape[[var]]) # high low values at location i
shape$lag_scaled <- lag.listw(W, shape$scaled) # high low values at neighbours j
shape$lag_cat <- factor(ifelse(shape$scaled>0 & shape$lag_scaled>0, "HH",
ifelse(shape$scaled>0 & shape$lag_scaled<0, "HL",
ifelse(shape$scaled<0 & shape$lag_scaled<0, "LL",
ifelse(shape$scaled<0 & shape$lag_scaled<0, "LH", "Equivalent")))))
shape$sig_cluster <- as.character(shape$lag_cat)
shape$sig_cluster[!shape$sig] <- "Non-sig"
shape$sig_cluster <- as.factor(shape$sig_cluster)
results <- data.frame(Ii=shape$Ii, pvalue=shape$Iip, type=shape$lag_cat, sig=shape$sig_cluster)
return(list(results=results))
}
clusters <- moranCluster(combined_shp, W = fire.W, var="ig_year")$results
combined_shp$Ii_cluster <- clusters$sig
tm_shape(combined_shp) + tm_polygons(col="Ii_cluster")
Variable | Description |
---|---|
pop | Population density at point |
Elevation | Elevation at point |
dist_roads | Distance to the nearest road |
ig_Month | Month of wildfire |
lc_Code | Land type for each point |
tp | total precipitation |
d2m | Average dewpoint temperature |
t2m | Average surface temperature |
si10 | 10m wind speed |
swvl1 | Volumetric soil water layer |
pev | Potential Evaporation |
The first supervised machine learning technique utilised was an Artificial Neural Network (ANN). We utilised a simple model with one hidden layer a sigmoid activation function and linear output. From this model we hope to be able to predict the size of a wildfire based upon the predictor values seen in figure 3. The structure of the network can be seen in Figure X. where I1:13 represent the predictor variables, H1:9 are the 9 neurons in the hidden layer and O1 is the output layer. B1 and B2 represent the bias introduced into the network, this is a number that is added to the weights of the input and the activation function. It is effectively like an intercept in a linear regression model.
To undertake regression analysis on the size of wildfires, only the grid cells that contained an historic wildfire were extracted, along with the climate, anthropogenic and landscape data. These records were then combined into one database (containing the attributes in figure 3). From this database the training and validation subsets can be taken, we chose to train on 60% of the data. The data then had to be normalised; for this experiment the data was normalised, so it had a mean of 0 and a variance of 1, this was done using the built-in R function scale(). Normalisation helps improve the convergence rate of the model; this is because the predictor variables may have different scales which can lead to suboptimal convergence rates.
To select the most optimal parameters for the model we used a hyperparameter grid, here we specify a range of values for the number of artificial neurons in the hidden layer and the decay rate, which controls the amount of regularisation applied to the weights. We can then run the model on each combination of parameters in the grid and calculate the RMSE of each output. We then used a k-fold cross validation with 10 folds and 3 repeats, this is important because it will estimate how well the model will perform on unseen data. We applied a sigmoid function to the model using act.fct = “sigmoid”, this maps the input values to an output value using a sigmoidal function, limiting the output to between 0 and 1. This sigmoid function adds some non-linearity to the hidden layer of the model, it allows the model to learn more complex patterns. Linout = TRUE was set for the model to change the output from logistic to linear. Finally, each combination of parameters was tested on the model and a mean RMSE score is returned. The optimal parameters were shown by the lowest RMSE. We tested a network with 2 hidden layers, but this was more prone to overfitting.
source_url('https://gist.githubusercontent.com/fawda123/7471137/raw/466c1474d0a505ff044412703516c34f1a4684a5/nnet_plot_update.r')
plot.nnet(fire.nnet.cv$finalModel)
### Performance of the Model The implementation of the ANN was
relatively straightforward, this was because this experiment
used a simple ANN with just one hidden layer. The observed
values were plotted against the predicted values for the test
subset. This plot shows a slight y=x relationship, particularly
around the origin where the majority of the points are located.
This is potentially due to the varying wildfire sizes, our
dataset contained a larger number of smaller scale fires than
larger scale; so it is understandable that we have a larger
number of points here. Furthermore, horizontal artifacts can be
seen within the distribution of points, this is particularly
evident in the larger fire sizes. The accuracy of the model was
tested using RMSE and R2 values, these values can be seen in
figure 15. An RMSE close to 0 would indicate that the model is
making good predictions, meaning we can predict wildfire size
from predictor variables. The optimal parameters gave an RMSE of
0.6, meaning on average the predictions differed from the
observed values by 0.6 units. The range of our dependant
variable was -0.4 to 4.7 meaning the RMSE is reasonably good.
The R2 of the model was 0.6, this indicated that 60 amount of
the variance in the size of a wildfire can be explained by the
climate, landscape and anthropogenic predictor variables. After
denormalising the RMSE value we contextualise the error in the
model. This was on average: X km2. The overall performance of
the model was reasonable but would not be a very reliable model
for government work.
# Modified from the Spatial Temporal Notebook
fire.test <- (testing_data[,-15])
pred<-predict(fire.nnet.cv$finalModel, fire.test)
y <- testing_data$tot_ar_km2
max(testing_data$tot_ar_km2)
## [1] 5.220514
ggplot()+ aes(x = pred, y = y) +
geom_point(size = 1) +
xlab("Observed") +
ylab("Predicted") +
theme(aspect.ratio = 1) +
geom_abline(slope=1, intercept=0)
testing_data$difference <- abs(testing_data$tot_ar_km2-pred)
ggplot(testing_data, aes(x = coords.x1, y = coords.x2, color = difference)) + geom_point() + scale_color_gradient(trans="log10")
The second supervised machine learning technique utilised was an LSTM. This is a more advanced model that is used in deep learning methods. We started by building a simple multivariate LSTM from the predictor variables and built upon this to try and constrain the best parameters for the model. This model architecture was chosen as LSTM’s can avoid the vanishing gradient problem (Nunes et al, 2019) ### Way the data was divided For this experiment we used an LSTM to do time series forecasting, so the first step of analysis was to plot the wildfire size based upon their data. This yielded a timeseries showing the variation in wildfire size throughout the years, from this we can formulate the experimental problem. We wish to be able to predict the wildfire size of the following month, meaning we need our model to be able to predict the size of the wildfire for the next month. As the data does not have a predictable frequency of data collection, i.e there is not a defined amount of time between observations, we must define the timestep manually. Our data has a lower temporal resolution so using a timestep of single days may yield poor results. A timestep of 10 should be suitable, giving us 3 timesteps per month of data. We must first pre-process the data by creating a matrix from the dataframe and normalising the data before feeding it into the model by using the R scale() function.
We must determine a few model parameters; loopback = number of back steps we want the input data to go, delay = the number of steps into the future. These will vary depending upon the experimental problem. For our model we selected a loopback of 50 and a delay of 10. The model implementation was a recurrent neural network with the first layer being an LSTM layer with 32 nodes, this is then fed into a dropout layer with 30% dropout. Next, it is passed into a dense layer with 32 nodes and a ‘relu’ activation function. Finally, the last layer is a dense layer with one node and a ‘sigmoid’ activation function, this represents the output layer. The selection of these layers was done by experimenting with different combinations to find the optimal architecture.
p2
# The code for creating and running the LSTM was modified from (François Chollet, 2017)
# https://blogs.rstudio.com/ai/posts/2017-12-20-time-series-forecasting-with-recurrent-neural-networks/
generator <- function(data, lookback, delay, min_index, max_index,
shuffle = FALSE, batch_size, step) {
if (is.null(max_index))
max_index <- nrow(data) - delay - 1
i <- min_index + lookback
function() {
if (shuffle) {
rows <- sample(c((min_index+lookback):max_index), size = batch_size)
} else {
if (i + batch_size >= max_index)
i <<- min_index + lookback
rows <- c(i:min(i+batch_size-1, max_index))
i <<- i + length(rows)
}
samples <- array(0, dim = c(length(rows),
lookback / step,
dim(data)[[-1]]))
targets <- array(0, dim = c(length(rows)))
for (j in 1:length(rows)) {
indices <- seq(rows[[j]] - lookback, rows[[j]]-1,
length.out = dim(samples)[[2]])
samples[j,,] <- data[indices,]
targets[[j]] <- data[rows[[j]] + delay,2]
}
list(samples, targets)
}
}
lookback <- 20
step <- 2
delay <- 10
batch_size <- 128
train_max_ind = 4000
val_max_ind = 5500
train_gen <- generator(
data,
lookback = lookback,
delay = delay,
min_index = 1,
max_index = train_max_ind,
shuffle = TRUE,
step = step,
batch_size = batch_size
)
val_gen = generator(
data,
lookback = lookback,
delay = delay,
min_index = train_max_ind+1,
max_index = val_max_ind,
step = step,
batch_size = batch_size
)
test_gen <- generator(
data,
lookback = lookback,
delay = delay,
min_index = val_max_ind+1,
max_index = NULL,
step = step,
batch_size = batch_size
)
# How many steps to show the whole validation dataset
val_steps <- (val_max_ind+1 - train_max_ind - lookback) / batch_size
# How many steps to show whole data of test dataset
test_steps <- (nrow(data) - val_max_ind - lookback) / batch_size
model <- keras_model_sequential() %>%
layer_flatten(input_shape = c(lookback / step, dim(data)[-1])) %>%
layer_dense(units = 32, activation = "relu", kernel_regularizer = regularizer_l2(0.01)) %>%
layer_dense(units = 1)
model %>% compile(
optimizer = optimizer_rmsprop(lr = 0.07),
loss = "mae"
)
history <- model %>% fit_generator(
train_gen,
steps_per_epoch = 200,
epochs = 10,
validation_data = val_gen,
validation_steps = val_steps
)
plot(history)
model <- keras_model_sequential() %>%
bidirectional(
layer_lstm(units = 6), input_shape = list(NULL, dim(data)[[-1]]),
) %>%
layer_flatten() %>%
layer_dense(units = 32, activation = "relu", kernel_regularizer = regularizer_l2(0.01)) %>%
layer_dense(units = 1)
model %>% compile(
optimizer = optimizer_rmsprop(lr = 0.1),
loss = 'MAE',
)
history <- model %>% fit_generator(
train_gen,
steps_per_epoch = 100,
epochs = 10,
validation_data = val_gen,
validation_steps = val_steps
)
plot(history)
The model performance was analysed by plotting the validation and training loss on a graph throughout training. This way we can visualize how the loss improves or degrades through the training process. A good model will show the validation and training loss decrease and both lines will have similar trajectories. Our model showed the validation and training loss lines lie within similar loss ranges, this shows the model is not overfitting and makes a reasonable attempt at prediction. However, the MAE values are quite large 0.9 on average. This means that the model is not predicting the output layer fully with the predictor variables. The decrease in loss from both validation and loss through training suggests the model is learning how to predict the size of wildfire. The scale of the burned area ranged from -0.3 to 7, with skew in the higher numbers. This suggests that the model is able to predict the size of wildfires with an error of 0.9 units on average but this is overall not an accurate model. To convert the MAE to RMSE we multiple it by the number of observations in the validation set (1500) and square root this. This yielded an RMSE of 36.7.
From our analysis the LSTM model provided a better prediction of the wildfire size, this was based on the RMSE value produced from the optimal parameters. The structure of the LSTM makes it suitable architecture to analyse multivariate timeseries, especially because the wildfire data has a long history of results. However, both models did not produce an accurate enough prediction for use in government work. This is likely because the climate data gathered just represented a mean summer result instead of daily climate values. If each wildfire contained climate data taken on the given day then the model may be more accurate. Furthermore, both models were run on the entire state of California, we showed in figure 8 that there is a clear divide in climate data between the North and South of California. Potentially splitting the data into North and South and running 2 independent models would yield better results. Furthermore, temporal analysis on the data as the equations for temporal autocorrelation did not show any meaningful results. The implementation of the ANN made it easier to interpret the results. In just a couple lines of codes the observed vs predicted graph can be plotted and the RMSE, R2 and MAE values can be shown. It is a little harder to produce the same error checking result with the LSTM but it also yields a RMSE and Loss graph. The implementation of the ANN was very easy, little pre-processing of the data was required to produce a satisfactory result. Hyper parameters could be adjusted with simple modifications in code and the implementation of the ANN allowed accuracy checking to be done. The LSTM was a little harder to implement as it requires the data to be slightly more pre-processed. Furthermore, it requires the user to manually select important information such as the timestep and loopback, which could lead to experimental errors. The running time of the LSTM was also slightly shorter than the ANN but this could have been due to the differing number of epochs and steps per epoch between the two models. We also had to run the ANN on the hyperparameter grid each time we adjusted the data pre-processing, this increased computation time. Another benefit of the ANN model is that we can analyse how it performed across the study area. By taking a difference between the observed and predicted values we can plot the accuracy spatially. From figure 12 you can see that it performed pretty well across the study area, some areas especially along the North coast did not perform as well as inland California. This may be due to the record of climate data. How could they be improved: As the size of a wildfire varies between a finite range we could improve the predictive power of the models by classifying the fires into several groups. We could have 5 groups ranging from small fires to very large fires and instead allow the models to predict which class the wildfire would lie within. Furthermore, we could split the wildfires into small and large fires and run the model on both sets of data, it is possible that the mechanisms that control the spread of a large fire differ from those that control a smaller one.