John L. Godlee

I had an email from a colleague who had a load of univariate distributions generated from different landscape evolution simulations. They wanted to compare which distribution was closest to the observed distibution in the real landscape.

To begin with, the colleague suggested using t-tests to see if two distributions were similar, but I think there is a problem with using t-tests in that way. As I interpret it, t-tests analyse whether two distribution means are appreciably different, rather than similar. Anyway, this was my discussion:

So, you have many simulated elevation distributions, and you want to know which distribution closest to the true observed distribution? In that case, I definitely take issue with using K-S (Kolmogorov-Smirnov) tests or many one-sample t-tests, as they just don’t fit the framing of your question. Both these will test whether the sample mean of a simulated distribution is appreciably different from the observed mean in the original landscape, using a p-value test. Firstly, it’s not as simple as reversing the logic to say: “if you have a non-significant p-value then they are appreciably similar”. Secondly, you don’t want to know whether each simulation is suitable (i.e. similar) or not, you want to know which is most suitable. With the above tests all simulations might be suitable, and given the large sample size I would wager that they all would be, but you wouldn’t know which was the most sitable.

As an aside, before resigning yourself to non-parametric tests, it might be worth seeing if a simple transformation of the data can achieve normality, or at least normal-ish. Try log-transforming or sqrt-ing. With a sample of 3000 points I don’t think having truly normal data is such a big deal anyway, and there is a lot of literature to support linear models being robust to violations of normality.

Equivalence testing may be what you are looking for. It tests instead whether two distributions are suitably similar, but it won’t be able to tell you which one is most similar to the truth.

My favourite approach however, having mulled this over a good deal, is this:

Transform distributions to achieve a normal-ish distribution.

Do a linear regression of observed (original landscape, dependent, y) against predicted (simulated, independent, x) values.

Compare those regression models using AIC or some other information criterion to find the regression which minimises variance between predicted and observed values. The simulated values used in that regression are from the “best” simulation. You could also report R^2 values, which should be highest in the model with the lowest AIC. Note it is generally accepted that if two models are within 2 AIC points of each other, you cannot say that one is better than the other, thus you may have multiple simulations which are the most suitable.

I’ve included an R script which does similar with a bunch of fake normal-ish data.

For data visualisation, I would:

Transform distributions to achieve a normal-ish distribution.

Make an interval plot or a boxplot showing the mean and standard deviation of each simulated distribution compared to the true mean.

My attempt is also in the R script.

Thanks for the interesting problem, let me know what you end up using, John

Here is the R script I referenced:

# Which elevation distribution is closest to the truth?
# John Godlee (johngodlee@gmail.com)
# 2020-10-05

set.seed(20201005)

# Packages
library(dplyr)
library(tidyr)
library(ggplot2)

# Create data
distrib_list <- list(
  obs = rnorm(1000, 144, 5), 
  el1 = rnorm(1000, 110, 5), 
  el2 = rnorm(1000, 120, 5), 
  el3 = rnorm(1000, 130, 5),
  el4 = rnorm(1000, 140, 5), 
  el5 = rnorm(1000, 150, 5), 
  el6 = rnorm(1000, 160, 5)
)

##' 6 simulated distrib. and 1 observed distrib.. Each has: 
##'   1000 points,
##'   standard deviation of 5
##'   varying mean

# Create density plot 
do.call(cbind, distrib_list) %>%
  as.data.frame(.) %>%
  gather(., key, value) %>%
  ggplot(.) + 
    geom_line(stat = "density",
      aes(x = value, colour = key))

##' Expect that `el4` is most similar to `obs`

# Run linear regressions
mod_list <- lapply(distrib_list[-1], function(y) { 
  lm(y ~ distrib_list[[1]])
    })

# Extract AIC, BIC, R^2
mod_df <- data.frame(name = names(distrib_list[-1]))
mod_df$aic <- lapply(mod_list, AIC)
mod_df$bic <- lapply(mod_list, BIC)
mod_df$rsq <- lapply(mod_list, function(x) { summary(x)$r.squared })

##' There are loads of other metrics to choose from

# Look at the extracted metrics
mod_df

# Visualise data
do.call(cbind, distrib_list) %>%
  as.data.frame(.) %>%
  gather(., key, value) %>% 
  mutate(sim = case_when(
      grepl("el", key) ~ TRUE,
      TRUE ~ FALSE)) %>%
  ggplot() + 
    geom_boxplot(data = raw_dat,
      aes(x = key, y = value, colour = sim), fill = NA) +
    stat_summary(data = raw_dat, 
      aes(x = key, y = value, fill = sim),
      fun = mean, geom = "point", shape = 21, size = 5, colour = "black") + 
    theme(legend.position = "none")

Comparing which simulated distribution is closest to the truth

2020-10-05