Rich Pauloo
5/21/2018

Introduction

The Global Landslide Catalog (GLC) was created to identify landslides triggered maintly by rainfall. The paper describing the dataset can be found here. We will start with an exploratory data analysis to get a broad overview of the data. Following this analysis, we will develop a research question, and set out to answer it.


Exploratory Data Analysis

Load some packages for analysis

library(maps)
library(tidyverse) # conflicts with `maps` function `map`

Read in data

d <- readRDS("landslide.rds")   # spatial points data frame
maps::map("world", fill=TRUE, col="white", bg="lightblue", ylim=c(-60, 90), mar=c(0,0,0,0), main = "Landslides in the World")
points(d, cex = .2, col = "red")


df <- d@data  # get just the data

How large is our data and what does it contain? It looks like we have 3327 observations of landslides and 34 columns, or predictor variables, excluding latitude and longitude.

dim(df)
## [1] 3327   34
colnames(df)
##  [1] "objectid"        "date"            "time"           
##  [4] "country"         "trigger"         "hazardtype"     
##  [7] "hazardintensity" "injuries"        "fatalities"     
## [10] "population"      "aspect"          "bio01"          
## [13] "bio02"           "bio03"           "bio04"          
## [16] "bio05"           "bio06"           "bio07"          
## [19] "bio08"           "bio09"           "bio10"          
## [22] "bio11"           "bio12"           "bio13"          
## [25] "bio14"           "bio15"           "bio16"          
## [28] "bio17"           "bio18"           "bio19"          
## [31] "elevation"       "hillshade"       "landcover"      
## [34] "slope"

What dates does our data span? Our data ranges nearly 20 years, from 1997 to 2017.

substr(df$date,1,4) %>% # extract years
  sort() %>%            # sort years
  .[c(1,nrow(df))]      # subset first and last year
## [1] "1997" "2015"

In the column names, we see that there are 19 bioclimatic variables from WorldClim. They are coded as follows:

  • BIO1 = Annual Mean Temperature
  • BIO2 = Mean Diurnal Range (Mean of monthly (max temp - min temp))
  • BIO3 = Isothermality (BIO2/BIO7) (* 100)
  • BIO4 = Temperature Seasonality (standard deviation *100)
  • BIO5 = Max Temperature of Warmest Month
  • BIO6 = Min Temperature of Coldest Month
  • BIO7 = Temperature Annual Range (BIO5-BIO6)
  • BIO8 = Mean Temperature of Wettest Quarter
  • BIO9 = Mean Temperature of Driest Quarter
  • BIO10 = Mean Temperature of Warmest Quarter
  • BIO11 = Mean Temperature of Coldest Quarter
  • BIO12 = Annual Precipitation
  • BIO13 = Precipitation of Wettest Month
  • BIO14 = Precipitation of Driest Month
  • BIO15 = Precipitation Seasonality (Coefficient of Variation)
  • BIO16 = Precipitation of Wettest Quarter
  • BIO17 = Precipitation of Driest Quarter
  • BIO18 = Precipitation of Warmest Quarter
  • BIO19 = Precipitation of Coldest Quarter

There are a few issues with our categorical data, for instance, duplicate entries in the trigger, hazardtype, and hazardintensity variables.

unique(df$trigger)[c(10,15)]
## [1] "Continuous_rain" "Continuous_Rain"
unique(df$trigger)[c(9,16)]
## [1] "Snowfall_snowmelt" "Snowfall"
unique(df$trigger)[c(2,6)]
## [1] "unknown" "Unknown"
unique(df$trigger)[c(1,3)]
## [1] "Downpour" "Rain"
unique(df$trigger)[c(14,19)]
## [1] "Other" "other"
unique(df$hazardtype)[c(1,12)]
## [1] "Mudslide" "mudslide"
unique(df$hazardtype)[c(5,7)]
## [1] "Rock_Fall" "Rockfall"
unique(df$hazardtype)[c(3,14)]
## [1] "Landslide" "landslide"
unique(df$hazardintensity)[c(1,6)]
## [1] "Medium" "medium"
unique(df$hazardintensity)[c(2,8)]
## [1] "Very_large"  "Extra Large"
unique(df$hazardintensity)[c(3,7)]
## [1] "Large" "large"
unique(df$hazardintensity)[c(4,9)]
## [1] "Small" "small"

Let’s correct these duplicate entries by collapsing them into common categories. We can take care of many of these issues by simply converting all of the characters to lowercase, and in the instances where that doesn’t work, we can replace values.

# mutate existing columns to lowercase, which fixes most encoding errors
df <- df %>% 
  mutate(trigger = tolower(trigger),                 # coerce characters to lowercase
         hazardtype = tolower(hazardtype),           # same as above
         hazardintensity = tolower(hazardintensity)) # same as above

Combine the remaining similar entries.

library(stringr)

# replace terms in the trigger category
df$trigger <- df$trigger %>% str_replace("snowfall_snowmelt", "snowfall")

# replace terms in the hazardtype category
df$hazardtype <- df$hazardtype %>% str_replace("rock_fall", "rockfall")

# replace terms in the hazardintensity category
df$hazardintensity <- df$hazardintensity %>% str_replace("extra large", "very_large")

With these cleaned categorical variables, we can overwrite the data in our SpatialPointsDataFrame.

d@data <- df

Are there obvious spatial patterns in hazard triggers, type, or intensity?

# make a world basemap as an sf object
world <- sf::st_as_sf(maps::map('world', plot = FALSE, fill = TRUE))

# convert the spdf into an sf object for plotting
d_sf <- sf::st_as_sf(d)

# map of landslide triggers
d_sf %>% 
  ggplot() +
  geom_sf(data = world) +
  geom_sf(aes(color = trigger), alpha = .6) +
  labs(title = "Landslide Triggers")


# map of hazrd type
d_sf %>% 
  ggplot() +
  geom_sf(data = world) +
  geom_sf(aes(color = hazardtype), alpha = .6) +
  labs(title = "Landslide Hazard Type")


# map of hazard intensity
d_sf %>% 
  ggplot() +
  geom_sf(data = world) +
  geom_sf(aes(color = hazardintensity), alpha = .6) +
  labs(title = "Landslide Hazard Intensity")

Large and Very Large landslides are defined in Kirschbaum (2015) as landslides with moderate to high numbers of fatalities.Combined, Large and Very Large landslides account for less than 7% of the total landslides in this dataset…

df %>% 
  count(hazardintensity) %>% 
  filter(hazardintensity != "unknown") %>% 
  ggplot(aes(reorder(hazardintensity, n), n)) +
  geom_col(fill = c("red", "red", "grey50","grey50")) +
  coord_flip() +
  labs(title = "Counts of Worldwide Landslide Hazard Intensity",
       subtitle = "Period of Record: 1997-2015",
       y = "Count",
       x = "Hazard Intensity")


df %>% 
  filter(hazardintensity %in% c("large", "very_large")) %>%  # filter for large and vl landslides
  nrow() / nrow(df) # calculate proportion of these landslides
## [1] 0.06672678

…but account for roughly 84% of fatalities and 63% of injuries.

df %>% 
  filter(hazardintensity != "unknown") %>% 
  group_by(hazardintensity) %>% 
  summarise(fatalities = sum(fatalities),
            injuries = sum(injuries)) %>% 
  mutate(prop_fatalities = (fatalities / sum(fatalities)) %>% round(2),
         prop_injuries = (injuries / sum(injuries)) %>% round(2)) %>% 
  knitr::kable()
hazardintensity fatalities injuries prop_fatalities prop_injuries
large 1837 432 0.12 0.62
medium 2357 216 0.15 0.31
small 136 44 0.01 0.06
very_large 11390 7 0.72 0.01

Let’s combine large and very large landslides into a category called disaster, and small and medium landslides into another category called non-disaster. Hazard intensity will be our binary response variable in the model.

# re-encode "large" and "very large" landslides "disaster", and "small" and "medium" landslides as "non-disaster"
df %>% 
  mutate(severity = ifelse(hazardintensity == "very_large", "disaster",
                    ifelse(hazardintensity == "medium", "non-disaster", 
                    ifelse(hazardintensity == "large", "disaster",
                    ifelse(hazardintensity == "small", "non-disaster", hazardintensity))))) -> df

# now overwrite data in d with this new dataframe
d@data <- df

Question

Where are the disasters located? It looks like they’re well-distributed throughout the world, with a large cluster in Oceania and along the Himalaya mountain range between India and China. The obvious spatial dependency of landslide intensity, suggests that “continent” might be a feature class that improves the prediction accuracy of the model. If time permits, this feature class will be added.

First remove the 14 “unknown” earthquakes, since we’re not interested in predicting these, and they complicate the model.

d <- d[d$severity != "unknown" , ] # remove "unknown" hazards
#d@data$hazardintensity <- factor(d@data$hazardintensity) # convert response to factor
d@data$country <- factor(d@data$country) # convert country predictor to factor
d@data$severity <- factor(d@data$severity) # convert country predictor to factor

View the disasters, or fatal landslides.

dis <- d[d$severity == "disaster", ]
maps::map("world", fill=TRUE, col="white", bg="lightblue", ylim=c(-60, 90), mar=c(0,0,0,0), main = "Landslides in the World")
points(dis, cex = .5, col = "red")


Modeling

Let’s proceed with a suite of models to predict fatal landslides.


Linear Discriminant Analysis

We begin with a rather simple method: Linear Discriminant Analysis (LDA). To validate our model, we will split the data randomly into two parts: one set for training, and another for testing. We specify the model as follows:

lda(severity ~ elevation + slope + bio12 + bio13 + bio14 + bio15 + bio16 + bio17 + bio18 + bio19)

where bio12 through bio19 are precipitation indices.

library(MASS) # for LDA and QDA models

set.seed(678983) # set seed for reproducibility

# split into training and testing
train_id <- sample(1:nrow(d), nrow(d)/2)  # index of half of the data
train <- d[train_id, ]  # test set: half of the data 
test <- d[-train_id, ]  # train set: other half

# fit the model
lda_fit <- lda(formula = severity ~ elevation + slope + bio12 + bio13 + 
                         bio14 + bio15 + bio16 + bio17 + bio18 + bio19,
               data = train)

# predict the test set
pred_prob <- predict(lda_fit, test, type = "response")

# view the confusion matrix
cm <- table(pred_prob$class, test$severity)
cm
##               
##                disaster non-disaster
##   disaster            0            0
##   non-disaster      114         1543

# proportion of incorrectly classified disasters
pic <- cm[2, 1] / sum(cm[, 1])
pic
## [1] 1

# proportion of incorrectly classified non-disasters
picnd <- cm[1, 2] / sum(cm[ , 2])
picnd 
## [1] 0

# overall error rate: sum of off diagonals divided by total
error <- (sum(cm) - sum(diag(cm))) / sum(cm)
error
## [1] 0.06879903

LDA prefectly predicts non-disasters, and does rather poorly in predicting disasters. This is because LDA seeks to minimize the overall error, irrespective of class importance. The LDA classifier is essentially the same as the null classifier, which predicts that all events are non-disasters. This achieves an overall error rate of 7%, because the LDA classifier approximates the Bayes classifier, which returns the smallest possible total number of misclassified observations, regardless of which class the observations are from. In disaster management, we’re more concerned about misclassified disasters. We need a test with high sensitivity to disaster events.

Looking more closely, the posterior probability of any disaster event is always less than .5:

# the first column of the posterior maxtrix corrsponds to disasters
pred_prob$posterior[ , 1] %>% max()
## [1] 0.3569364

Because it uses a .5 threshold, LDA will always classify diasters as non-disasters to reduce the overall error. We can overcome this limitation by varying the posterior probability threshold at which LDA classifies an event as a disaster. This will come with some error because sometimes, we’ll incorrectly classify a non-disaster as a disaster, but we can tolerate some of these errors. For example, we change the posterior probability threshold, for example from .5 to .1. Formally, we change \(Pr(severity = disaster | X = x) > 0.5\) to \(Pr(severity = disaster | X = x) > 0.1\).

# change the posterior probability decision boundary from .5 to .1
pred <- rep("non-disaster", times = nrow(test))
pred[pred_prob$posterior[, "disaster"] > .1] <- "disaster"
cm <- table(pred, test$severity) # confusion matrix of predicted classes and actual class
cm
##               
## pred           disaster non-disaster
##   disaster           40          232
##   non-disaster       74         1311

# proportion of incorrectly classified large landslides
pic <- cm[2, 1] / sum(cm[, 1]) 
pic
## [1] 0.6491228
 
# overall error rate: sum of off diagonals divided by total
error <- (sum(cm) - sum(diag(cm))) / sum(cm)
error
## [1] 0.1846711

We see that for a posterior probability threshold of .1, the overall error increases from around 7% to about 18%, but we also go from a 100% error rate in our disaster prediction to a 64% error rate in predicting these events. Because we’re more concerned with class-specific performance, in this case, the sensitivity of the test (the proportion of true disasters correctly identified), we will investigate the tradeoff between error and a range of thresholds to inform what threshold to select.

# function to iterate over probabilities

pp <- seq(0.01, 0.5, .001) # posterior probabilities to iterate over

calc_errors <- function(pp, fit){
  # predict the test set
  pred_prob <- predict(fit, test, type = "response")

  # change the posterior probability decision boundary from .5 to .1
  pred <- rep("non-disaster", times = nrow(test))
  pred[pred_prob$posterior[, "disaster"] > pp] <- "disaster"
  cm <- table(pred, test$severity)

  # proportion of incorrectly classified disasters
  #pic <- cm[2, 1] / sum(cm[, 1])
  pic <- ifelse(nrow(cm) == 2, 
                cm[2, 1] / sum(cm[, 1]), 
                cm[1, 1] / sum(cm[, 1]))
  
  # proportion of incorrectly classified non-disasters
  #picnd <- cm[1, 2] / sum(cm[, 2])
  picnd <- ifelse(nrow(cm) == 2,
                  cm[1, 2] / sum(cm[, 2]),
                  0)
 
  # overall error rate: sum of off diagonals divided by total
  #oe <- (sum(cm) - sum(diag(cm))) / sum(cm)
  oe <- ifelse(nrow(cm) == 2, 
               (sum(cm) - sum(diag(cm))) / sum(cm),
               cm[1] / sum(cm))

  # return the posterior probability threshold, and associated errors
  return(c(pp, pic, picnd, oe))
}

# apply a range of posterior probabilites over this function and combine output
temp <- lapply(pp, calc_errors, fit = lda_fit) %>% do.call(rbind, .)

# make into df for plotting and rename columns
temp <- as.data.frame(temp) 
colnames(temp) <- c("Threshold", 
                    "Incorrectly Classified Disasters", 
                    "Incorrectly Classified Non-Disasters", 
                    "Overall Error Rate")

# plot
temp %>% 
  gather(key, value, -Threshold) %>% # convert data from wide to long format
  ggplot(aes(Threshold, value)) +
  geom_line(aes(color = key), size = .7) +
  labs(title = "LDA at varying Posterior Probability Thresholds",
       subtitle = "with associated error rates",
       y = "Error Rate",
       color = NULL) +
  theme(legend.position = "bottom")

From this we can see that applying LDA with a threshold around .05 allows us to predict fatal landslides with about 75% accuracy, albeit with the caveat that we incorrectly classify 50% of non-disasters.

Let’s imagine that the highest error in terms of incorrectly classified non-disasters we’re willing to tolerate is 33%. So in other words, 1 out of every 3 locations we expect a disaster, there’s actaully no disaster. This corresponds to a threshold of 0.069 and an accuracy of about 60% in predicting disaster landslides.

temp %>% filter(`Incorrectly Classified Non-Disasters` <= .33) %>% head(1)
##   Threshold Incorrectly Classified Disasters
## 1     0.069                        0.3947368
##   Incorrectly Classified Non-Disasters Overall Error Rate
## 1                            0.3266364          0.3313217

It’s important to remember that LDA assumes that observations in the \(k\)th class are drawn from a multivariate Gaussian distribution \(N(\mu_k, \Sigma)\) where \(\mu_k\) is class-specific mean vector, and \(\Sigma\) is a common covariance matrix to all classes \(k\).


Quadratic Discriminant Analysis

Quadratic Discriminant Analysis (QDA) is very similar to LDA, expect it assumes observations of the \(k\)th class have a class-spefic covariance matrix in addition to a class-specific mean vector. In other words, they are drawn from the multivariate Gaussian distribution \(N(\mu_k, \Sigma_k)\). QDA forms quadratic decision boundaries between classes where LDA can only form linear ones. Without knowledge of the covariance matrix of our predictors, we will try both LDA and QDA.

# set seed
set.seed(56743) 

# fit the model
qda_fit <- qda(formula = severity ~ elevation + slope + bio12 + bio13 + 
                         bio14 + bio15 + bio16 + bio17 + bio18 + bio19,
               data = train)

# apply a range of posterior probabilites over this function and combine output
temp <- lapply(seq(.01, .5, .001), calc_errors, fit = qda_fit) %>% do.call(rbind, .)

# make into df for plotting and rename columns
temp <- as.data.frame(temp) 
colnames(temp) <- c("Threshold", 
                    "Incorrectly Classified Disasters", 
                    "Incorrectly Classified Non-Disasters", 
                    "Overall Error Rate")

# plot
temp %>% 
  gather(key, value, -Threshold) %>% # convert data from wide to long format
  ggplot(aes(Threshold, value)) +
  geom_line(aes(color = key), size = .7) +
  labs(title = "QDA at varying Posterior Probability Thresholds",
       subtitle = "With Associated Error Rates",
       y = "Error Rate",
       color = NULL) +
  theme(legend.position = "bottom") +
  coord_cartesian(ylim = c(0,1)) 

Unlike LDA, with QDA, we notice that when \(Pr(severity = disaster | X = x) > 0.5\), we actually have a an error rate lower than 100%! This shows that QDA is much better at picking out decision boundaries between disaster classes. However, let’s see how QDA performs according to our acceptable non-disaster error rate of 33%. This corresponds to a threshold of 0.06, and an accuracy of about 56% in predicting disaster landslides, which isn’t as good as what we achieved with LDA.

temp %>% filter(`Incorrectly Classified Non-Disasters` <= .33) %>% head(1)
##   Threshold Incorrectly Classified Disasters
## 1      0.06                        0.4385965
##   Incorrectly Classified Non-Disasters Overall Error Rate
## 1                            0.3266364          0.3343392

Random Forest

Let’s see how well a random forest with the same model specification predicts landslide disasters.

library(randomForest)

# set seed
set.seed(4576789)

# fit the model
rf_fit <- 
  randomForest(
    formula = severity ~ elevation + slope + bio12 + bio13 +
              bio14 + bio15 + bio16 + bio17 + bio18 + bio19,
    data = train, importance = TRUE)

# predict the test set
yhat <- predict(rf_fit, test)

# confusion matrix
cm <- table(yhat, test$severity)
cm
##               
## yhat           disaster non-disaster
##   disaster            5           11
##   non-disaster      109         1532

# proportion of incorrectly classified large landslides
pic <- cm[2, 1] / sum(cm[, 1]) 
pic
## [1] 0.9561404

# proportion of incorrectly classified non-disasters
picnd <- cm[1, 2] / sum(cm[ , 2])
picnd 
## [1] 0.00712897

# overall error rate: sum of off diagonals divided by total
error <- (sum(cm) - sum(diag(cm))) / sum(cm)
error
## [1] 0.07242004

At first glance, a random forest algorithm actaully performs worse than the LDA and QDA models. It correctly predicts disasters only about 4% of the time. Remember how we modified the thresholds at which the LDA and QDA models predicted classes? In the same way, we can modify the “decision boundaries” of the random forest by modifying the cutoff argument. In classification problems, this parameter controls how a “winning class” is determined from the ensemble of trees grown (in regression, an observation is simply the mean of all values determined by the trees). By default, the “winning class” for an observation is determined by \(1/k\), which for a 2-class response is simply the majority vote. Like in LDA and QDA, we will adjust these boundaries. We will asssign a class of “non-disaster” to observations where 66% of the grown trees correspond to the “non-disaster” class, and we will investigate the tradeoffs in error associated with various cutoff levels for the “disaster” class.

set.seed(9876325)

rf_mod <- function(x){
  # fit the model
  rf_fit <- 
    randomForest(
      formula = severity ~ elevation + slope + bio12 + bio13 +
                bio14 + bio15 + bio16 + bio17 + bio18 + bio19,
      data = train, importance = TRUE,
      cutoff = c(x, .66)) # apply the cutoff for disaster and non-diaster class determination

  # predict the test set
  yhat <- predict(rf_fit, test)

  # confusion matrix
  cm <- table(yhat, test$severity)
  cm

  # proportion of incorrectly classified large landslides
  pic <- cm[2, 1] / sum(cm[, 1]) 
  pic

  # proportion of incorrectly classified non-disasters
  picnd <- cm[1, 2] / sum(cm[ , 2])
  picnd 

  # overall error rate: sum of off diagonals divided by total
  oe <- (sum(cm) - sum(diag(cm))) / sum(cm)
  oe
  
  # store errors
  return(c(x, pic, picnd, oe))
}

# apply the random forest over a vector of disaster cutoffs from .01 to .34
rf_list <- lapply(seq(.01,.34, .01), rf_mod) 

# collapse into a matrix
temp <- rf_list %>% do.call(rbind, .)

# make into df for plotting and rename columns
temp <- as.data.frame(temp) 
colnames(temp) <- c("Threshold", 
                    "Incorrectly Classified Disasters", 
                    "Incorrectly Classified Non-Disasters", 
                    "Overall Error Rate")

# plot
temp %>% 
  gather(key, value, -Threshold) %>% # convert data from wide to long format
  ggplot(aes(Threshold, value)) +
  geom_line(aes(color = key), size = .7) +
  labs(title = "Random Forest at varying Cutoff Thresholds",
       subtitle = "with associated error rates",
       y = "Error Rate", x = "Cutoff for 'disaster' class",
       color = NULL) +
  theme(legend.position = "bottom")

Interestingly, applying the same acceptable limit of incorrectly classified non-disasters at less than or equal to 33%, random forest gives us nearly identical results as LDA, correctly preducting about 60% of disasters.

temp %>% filter(`Incorrectly Classified Non-Disasters` <= .33) %>% head(1)
##   Threshold Incorrectly Classified Disasters
## 1      0.05                        0.4298246
##   Incorrectly Classified Non-Disasters Overall Error Rate
## 1                            0.3097861          0.3180447

One useful output from the random forest algorithm is a rank of the variable importance, which helps us interpret the relationship between our response and predictors. We see that, in our selected model, BIO19, elevation, and BIO16 are the most important predictors in the model. BIO19 is the precipitation of the coldest quarter, and BIO16 is the precipitation of the wettest quarter which all make sense. Disasterous landslides are primarily influenced by the severity of winter storms, elevational gradients, and maximum precipitaiton observed in a quarter.

# rf model with the lowest error in disaster prediction given an acceptable 33% error rate in non-disasters
rf_fit <- 
    randomForest(
      formula = severity ~ elevation + slope + bio12 + bio13 +
                bio14 + bio15 + bio16 + bio17 + bio18 + bio19,
      data = train, importance = TRUE,
      cutoff = c(.05, .66))

# Variable Importance
temp <- importance(rf_fit)[, "MeanDecreaseGini"]
data.frame(variable = names(temp),
           mean_decrease_gini = temp) %>% 
  ggplot() +
  geom_col(aes(forcats::fct_reorder(variable, mean_decrease_gini), # x = sorted mean decrease in gini
               mean_decrease_gini, # y = mean decrease in gini
               fill = mean_decrease_gini)) + # fill by mean decrease in gini
  coord_flip() + # flip coordinates
  labs(title = "Variable Importance", subtitle = "Mean Decrease in Gini Coefficient",
       y = "Mean Decrease in Gini Coefficient", x = "Variable") +
  scale_fill_viridis_c() + # viridis color scale
  guides(fill = FALSE) # remove legend


Boosting

In random forests trees are grown in parallel. In boosting, trees are grown sequentially on the residuals of the previous tree. We use distribution = "bernoulli" because this is a binary classification problem. If it were a regression problem, we would use distribution = "gaussian".

library(gbm)

# set a seed
set.seed(763459823)

# encode severity as a binary response for gbm
train@data <- train@data %>% mutate(severity_binary = ifelse(severity == "non-disaster", 0, 1))

# fit the model
boost_fit <- gbm(severity_binary ~ elevation + slope + bio12 + bio13 +
                 bio14 + bio15 + bio16 + bio17 + bio18 + bio19,
                 data = train, 
                 distribution = "bernoulli", # binary response in 1, 0
                 n.trees = 5000, # number of trees to train
                 interaction.depth = 4) # each tree is a stump

# predict the test set
boost_pred <- predict(boost_fit, test, 
                      n.trees = 5000, 
                      type = "response") # return the predicted probability of 1 (disaster)

# confusion matrix
temp <- rep("non-disaster", length(boost_pred))
temp[boost_pred >= .1] <- "disaster" # use a proability threshold of .1
cm <- table(temp, test$severity)
cm
##               
## temp           disaster non-disaster
##   disaster           38          223
##   non-disaster       76         1320

Let’s explore the relationship between the probability threshold and various errors.

boost_error <- function(p, id){
  
  # fit the model
  boost_fit <- gbm(severity_binary ~ elevation + slope + bio12 + bio13 +
                   bio14 + bio15 + bio16 + bio17 + bio18 + bio19,
                   data = train, 
                   distribution = "bernoulli", # binary response in 1, 0
                   n.trees = 5000, # number of trees to train
                   interaction.depth = id) # depth of trees
  
  # predict the test set
  boost_pred <- predict(boost_fit, test, 
                        n.trees = 5000, 
                        type = "response") # return the predicted probability of 1 (disaster)

  # set probability threshold for disasters
  pred <- rep("non-disaster", times = length(boost_pred))
  pred[boost_pred >= p] <- "disaster"
  cm <- table(pred, test$severity)

  # proportion of incorrectly classified disasters
  pic <- ifelse(nrow(cm) == 2, 
                cm[2, 1] / sum(cm[, 1]), 
                cm[1, 1] / sum(cm[, 1]))
  
  # proportion of incorrectly classified non-disasters
  picnd <- ifelse(nrow(cm) == 2,
                  cm[1, 2] / sum(cm[, 2]),
                  0)
 
  # overall error rate: sum of off diagonals divided by total
  oe <- ifelse(nrow(cm) == 2, 
               (sum(cm) - sum(diag(cm))) / sum(cm),
               cm[1] / sum(cm))

  # return the posterior probability threshold, and associated errors
  return(c(p, pic, picnd, oe, id))
}

# apply a range of posterior probabilites over this function and combine output
# takes a while
temp <- list()
for(i in 1:4){
  temp[[i]] <- lapply(seq(.01, .5, .01), boost_error, id = i) %>% do.call(rbind, .)
}

# make into df for plotting and rename columns
temp <- temp %>% do.call(rbind, .)
temp <- as.data.frame(temp) 
colnames(temp) <- c("Threshold", 
                    "Incorrectly Classified Disasters", 
                    "Incorrectly Classified Non-Disasters", 
                    "Overall Error Rate",
                    "Interaction Depth")

# save as rds for faster loading when knitting
write_rds(temp, "temp.rds")
# read in data
temp <- read_rds("temp.rds")

# plot
temp %>% 
  #dplyr::select(-`Interaction Depth`) %>% 
  gather(key, value, -Threshold, -`Interaction Depth`) %>% # convert data from wide to long format
  #mutate(`Interaction Depth` = rep(1:4, each = 150)) %>% 
  dplyr::filter(Threshold > 0.02) %>% 
  ggplot(aes(Threshold, value)) +
  geom_line(aes(color = key), size = .7) +
  geom_hline(yintercept = .33, linetype = "dashed") +
  labs(title = "Boosting Trees Error Rates",
       subtitle = "At Varying Thresholds and Interaction Depths",
       y = "Error Rate",
       color = NULL) +
  facet_grid(~`Interaction Depth`) +
  theme(legend.position = "bottom") + 
  coord_cartesian(xlim = c(0,.5))

The plot above displays 4 interaction depths used for boosting. The dashed black horizontal line at 33% error shows that model performance appears insensitive to interaction depth. These calculations are computationally intensive. Rather than exhaustively computing the parameter space at a fine scale, we can use the graph to determine that incorrectly classified disasters reach a minumim at an interaction depth of 3, and more closely inspect this region to the overall performance of the bossted model.

# apply a range of posterior probabilites over this function and combine output
# takes a while
temp <- list()

temp <- lapply(seq(.05, .075, .001), boost_error, id = 3) %>% do.call(rbind, .)
temp <- as.data.frame(temp) 
colnames(temp) <- c("Threshold", 
                    "Incorrectly Classified Disasters", 
                    "Incorrectly Classified Non-Disasters", 
                    "Overall Error Rate",
                    "Interaction Depth")

write_rds(temp, "temp_2.rds")

The optimal boosted tree model uses an interaction depth of 3, with a disaster acceptance threshold of 0.064, and correctly predicts 64% of disasterous landslides, which is an improvement over LDA, QDA, and random forest models.

temp_2 <- read_rds("temp_2.rds")

temp_2 %>% filter(`Incorrectly Classified Non-Disasters` <= .33) %>% head(1)
##   Threshold Incorrectly Classified Disasters
## 1     0.064                        0.3596491
##   Incorrectly Classified Non-Disasters Overall Error Rate
## 1                            0.3298769          0.3319252
##   Interaction Depth
## 1                 3

As with random forests, we can view the variable importance for our optimal boosted model. Compared to our random forest model, there is agreement in the top 2 variables: bio19, and elevation.

# run the optimal model
boost_fit <- gbm(severity_binary ~ elevation + slope + bio12 + bio13 +
                   bio14 + bio15 + bio16 + bio17 + bio18 + bio19,
                   data = train, 
                   distribution = "bernoulli", # binary response in 1, 0
                   n.trees = 5000, # number of trees to train
                   interaction.depth = 3) # depth of trees

# predict the test set
boost_pred <- predict(boost_fit, test, 
                      n.trees = 5000, 
                      type = "response") # return the predicted probability of 1 (disaster)
# plot variable importance
p %>% ggplot() + # p comes from a hidden chunk in the source code, which is essentially `summary(boost_fit)`
  geom_col(aes(forcats::fct_reorder(var, rel.inf), rel.inf, fill = rel.inf)) +
  coord_flip() +
  scale_fill_viridis_c() +
  guides(fill = FALSE) +
  labs(title = "Variable Importance", subtitle = "Optimal Boosting Model",
       y = "Relative Influence", x = "Variable")


Extending the Model

Can we use our model predictions to predict regions of the world where fatal landslides are most likely?

We begin by assembling our test set: the independent variables used in the model, but on a worldwide scale.

library(raster)
library(sp)

bio <- getData("worldclim", var="bio", res=10)

Elevation data is a bit trickier because it involves downloading and merging rasters.

country_dem <- list() 

# not all countries are available, so we filter them out
for(i in iso_codes[-c(4,12,21,28,37,41,49,55,56,75,82,84,85,98,104,106,114,121,137,142,144,147,
                      154,159,163,166,175,178,187,196,197,205,215,216,218,223,239,242,248,249,250,
                      251, 252)]){
  country_dem[[i]] <- getData("alt", country = i)
}

# merge, save, plot
names(country_dem) <- NULL # do.call won't work with a named list
world_dem <- do.call(raster::merge, country_dem) # merge all rasters into one
write_rds(world_dem, "world_dem.rds")    # save
plot(world_dem)

# a small test -- some scratch code for russia
list(country_dem[[162]]$`/Users/richpauloo/Desktop/GEO_200CN/Lab14/RUS1_msk_alt.grd`, country_dem[[162]]$`/Users/richpauloo/Desktop/GEO_200CN/Lab14/RUS2_msk_alt.grd`) %>%
  do.call(raster::merge, .) %>% plot()

# do this for all 8 list elements with columns of values.

A bit of data wrangling is necessary to merge some of the larger rasters, which are split into multiple elements in a list.

# initalize vector
x <- vector(length = length(country_dem))

# logical test to find raster layers with multiple columns
for(i in 1:length(x)){
  x[i] = ifelse(length(country_dem[[i]]) <= 10, TRUE, FALSE)
}

# extract these elements to perform a special merge on their data
fix <- country_dem[x]

# function to merge these data
merge_rasters <- function(x){
  i <- length(x) # length of element
  
  merged_raster <- NA
  
  # conditions for merging depend on the number of separate raster elements
  if(i == 2) {
    merged_raster <- c(x[1], x[2]) 
    names(merged_raster) <- NULL 
    merged_raster <- merged_raster %>% do.call(raster::merge, .)
  }
  if(i == 3) {
    merged_raster <- c(x[1], x[2], x[3]) 
    names(merged_raster) <- NULL 
    merged_raster <- merged_raster %>% do.call(raster::merge, .)
  }
  if(i ==4) {
    merged_raster <- c(x[1], x[2], x[3], x[4]) 
    names(merged_raster) <- NULL 
    merged_raster <- merged_raster %>% do.call(raster::merge, .)
  }
  
  return(merged_raster)
}

# apply the function
fix_merge <- lapply(fix, merge_rasters)
write_rds(fix_merge, "fix_merge.rds") # save this because it takes a while to generate

# merge all raster layers
all_rasters <- c(fix_merge, country_dem[!x])
world <- all_rasters %>% do.call(merge, .)
write_rds(world, "world.rds") # save this because it takes a while to generate

Let’s view the global DEM we just made.

# read in previously written world elevation raster
world <- read_rds("world.rds")

# plot
plot(world, main = "Elevation (meters)")

To save on computing time, and avoid aggregating all of the independent variables, let’s build a more simple model with which to predict landslide severity worldwide. We will use the top 4 predictors from our boosted tree instead of all 10. The new model is specified as severity_binary ~ elevation + bio13 + bio18 + bio19. This results in a loss of about 2-3% prediction accuracy in predicting disasterous landslides.

# run the simplified model
boost_fit <- gbm(severity_binary ~ elevation + bio13 + bio18 + bio19,
                 data = train, 
                 distribution = "bernoulli", # binary response in 1, 0
                 n.trees = 5000, # number of trees to train
                 interaction.depth = 3) # depth of trees

# predict the test set
boost_pred <- predict(boost_fit, test, 
                      n.trees = 5000, 
                      type = "response") # return the predicted probability of 1 (disaster)


# set probability threshold for disasters
pred <- rep("non-disaster", times = length(boost_pred))
pred[boost_pred >= .064] <- "disaster"
cm <- table(pred, test$severity)

# proportion of incorrectly classified disasters
pic <- cm[2, 1] / sum(cm[, 1])
pic  
## [1] 0.3684211

# proportion of incorrectly classified non-disasters
picnd <- cm[1, 2] / sum(cm[, 2])
picnd
## [1] 0.3324692
 
# overall error rate: sum of off diagonals divided by total
oe <- (sum(cm) - sum(diag(cm))) / sum(cm)
oe
## [1] 0.3349427

We need to aggreate the world raster to the scale of the bio raster before running the model.

res(bio)/res(world) # find the scaling factor for x and y between the bio and world rasters
## [1] 20 20
# takes a while, so save it for future use
world_agg <- aggregate(world, c(20,20), fun = mean) # aggregate world raster by the x and y scaling factors
write_rds(world_agg, "world_agg.rds")

We also need to make sure the rasters line up in terms of rows, columns, and extent. So we must crop and fix the extent of the rasters.

world_agg <- read_rds("world_agg.rds")
bio_crop <- crop(bio, world_agg) # crop bioclim variables to the elevation raster
extent(bio_crop) <- extent(world_agg) # fix extent of bioclim

# stack rasters
world_stack <- stack(bio_crop[[13]], # bio13
                     bio_crop[[18]], # bio18
                     bio_crop[[19]], # bio19
                     world_agg) # elevation 

names(world_stack)[4] <- "elevation" # give the 4th raster layer its correct name

Now we can look at our independent variables.

plot(world_stack)

And fit the model to this new data:

# coerce stack to data frame for the model
wsdf <- as.data.frame(world_stack)

# fit the model - outputs a vector of predictions with length = number of elements in the raster layer
boost_pred <- predict(boost_fit, wsdf, 
                      n.trees = 5000, 
                      type = "response")

Results

We can view the posterior probability threshold for where the model predicts fatal landslides.

# reorganize the predicitons into a matrix -- the same dimensions as the raster stack
out <- matrix(boost_pred, nrow = 839, ncol = 2160, byrow = TRUE)

# shapefile of the world coastline
coast = shapefile("ne_50m_coastline/ne_50m_coastline.shp")

# replace values
rout <- raster(out) # turn prediction matrix into raster for plotting 
extent(rout) <- extent(world_stack) # define extent of prediciton raster
proj4string(rout) <- proj4string(world_stack) # define CRS of prediciton raster
raster::mask(rout, world_stack$bio13)  %>% # mask to only include land
  plot(main = "Posterior Probability Threshold for Fatal Landslides", xlab = "Longitude", ylab = "Latitude")
plot(coast, add = TRUE) # add coastlines

Applying our threshold of 0.064 from above, we can map the areas that the model predicts disasterous landslides to occur, and overlay the locations where we have data for fatal landslide occurence.

# identify areas predicted by model to have fatal landslides
bin_response <- rep(0, times = length(boost_pred))
bin_response[boost_pred >= .064] = 1
out <- matrix(bin_response, nrow = 839, ncol = 2160, byrow = TRUE) # put into matrix for raster

# plot
rout <- raster(out)
extent(rout) <- extent(world_stack) # define extent of prediciton raster
proj4string(rout) <- proj4string(world_stack) # define CRS of prediciton raster
raster::mask(rout, world_stack$bio13)  %>%
  plot(main = "Predicted Fatal Landslide Areas", 
       xlab = "Longitude", ylab = "Latitude", legend = FALSE) 
plot(coast, add = TRUE)
points(dis, cex = .5, col = "red") # fatal landslides

We examine the effect of changing our acceptance probabilty threshold in the plots below. As the threshold increases, less land is precicted to have fatal landslide danger, which decreases our overall error, and out misclassification error of small and medium landslides as fatal landslides, but increases the error of misclassifying fatal landslides, as illustrated in the error plots above.

#par(mfrow = c(2,2)) # set up graphical device

# iterate over probability thresholds
for(i in c(.064,.1, .15,.2)){
  # identify areas predicted by model to have fatal landslides
  bin_response <- rep(0, times = length(boost_pred))
  bin_response[boost_pred >= i] = 1
  out <- matrix(bin_response, nrow = 839, ncol = 2160, byrow = TRUE) # put into matrix for raster
  
  # plot
  rout <- raster(out)
  extent(rout) <- extent(world_stack) # define extent of prediciton raster
  proj4string(rout) <- proj4string(world_stack) # define CRS of prediciton raster
  raster::mask(rout, world_stack$bio13)  %>%
    plot(main = paste("Probability Threshold =", i), 
         xlab = "Longitude", ylab = "Latitude", legend = FALSE) 
  plot(coast, add = TRUE)
  points(dis, cex = .5, pch = 16, col = "red")
}


Conclusion

Green areas represent locations predicted by the boosted tree classification model where fatal landslides are likely to occur. Keep in mind that:

  • we accept a 33% misclassification error rate of areas with small to medium landslides as areas with fatal landslides, in order to increase our prediction accuracy of fatal landslides to ~62%
  • we extend our test set to the entire world

If we take the 6 year landslide dataset (2007-2013) to represent the population-level spatial distribution of global landslides, the model appears to overpredict areas susceptible to fatal landslides.

Inaccurate model specification is the most likely cause for this error. Comparing our predicitons with the data of where disasterous landslides are recorded, we see that the model predicts fatal landslides in many areas where there is no data to support this prediction.

Very northern regions in Alaska, Canada, Greenland, Russia, and Siberia as well as large areas throughout mainland Africa and in the Amazon are predicted to have fatal landslides.

By adding “latitude” into the model, we can correct for inaccurate predictions in northern regions. It is probable that landslides are related to soil type, underlying geology, and patterns of land use. In the absence of a worldwide soil, geology, or land use raster, we might lazily use “continent” as an easy-to-add covariate that should covary with these predictors.