Comparing which simulated distribution is closest to the truth


I had an email from a colleague who had a load of univariate distributions generated from different landscape evolution simulations. They wanted to compare which distribution was closest to the observed distibution in the real landscape.

To begin with, the colleague suggested using t-tests to see if two distributions were similar, but I think there is a problem with using t-tests in that way. As I interpret it, t-tests analyse whether two distribution means are appreciably different, rather than similar. Anyway, this was my discussion:

So, you have many simulated elevation distributions, and you want to know which distribution closest to the true observed distribution? In that case, I definitely take issue with using K-S (Kolmogorov-Smirnov) tests or many one-sample t-tests, as they just don’t fit the framing of your question. Both these will test whether the sample mean of a simulated distribution is appreciably different from the observed mean in the original landscape, using a p-value test. Firstly, it’s not as simple as reversing the logic to say: “if you have a non-significant p-value then they are appreciably similar”. Secondly, you don’t want to know whether each simulation is suitable (i.e. similar) or not, you want to know which is most suitable. With the above tests all simulations might be suitable, and given the large sample size I would wager that they all would be, but you wouldn’t know which was the most sitable.

As an aside, before resigning yourself to non-parametric tests, it might be worth seeing if a simple transformation of the data can achieve normality, or at least normal-ish. Try log-transforming or sqrt-ing. With a sample of 3000 points I don’t think having truly normal data is such a big deal anyway, and there is a lot of literature to support linear models being robust to violations of normality.

Equivalence testing may be what you are looking for. It tests instead whether two distributions are suitably similar, but it won’t be able to tell you which one is most similar to the truth.

My favourite approach however, having mulled this over a good deal, is this:

I’ve included an R script which does similar with a bunch of fake normal-ish data.

For data visualisation, I would:

My attempt is also in the R script.

Thanks for the interesting problem, let me know what you end up using, John

Here is the R script I referenced:

# Which elevation distribution is closest to the truth?
# John Godlee (
# 2020-10-05


# Packages

# Create data
distrib_list <- list(
  obs = rnorm(1000, 144, 5), 
  el1 = rnorm(1000, 110, 5), 
  el2 = rnorm(1000, 120, 5), 
  el3 = rnorm(1000, 130, 5),
  el4 = rnorm(1000, 140, 5), 
  el5 = rnorm(1000, 150, 5), 
  el6 = rnorm(1000, 160, 5)

##' 6 simulated distrib. and 1 observed distrib.. Each has: 
##'   1000 points,
##'   standard deviation of 5
##'   varying mean

# Create density plot, distrib_list) %>% %>%
  gather(., key, value) %>%
  ggplot(.) + 
    geom_line(stat = "density",
      aes(x = value, colour = key))

##' Expect that `el4` is most similar to `obs`

# Run linear regressions
mod_list <- lapply(distrib_list[-1], function(y) { 
  lm(y ~ distrib_list[[1]])

# Extract AIC, BIC, R^2
mod_df <- data.frame(name = names(distrib_list[-1]))
mod_df$aic <- lapply(mod_list, AIC)
mod_df$bic <- lapply(mod_list, BIC)
mod_df$rsq <- lapply(mod_list, function(x) { summary(x)$r.squared })

##' There are loads of other metrics to choose from

# Look at the extracted metrics

# Visualise data, distrib_list) %>% %>%
  gather(., key, value) %>% 
  mutate(sim = case_when(
      grepl("el", key) ~ TRUE,
      TRUE ~ FALSE)) %>%
  ggplot() + 
    geom_boxplot(data = raw_dat,
      aes(x = key, y = value, colour = sim), fill = NA) +
    stat_summary(data = raw_dat, 
      aes(x = key, y = value, fill = sim),
      fun = mean, geom = "point", shape = 21, size = 5, colour = "black") + 
    theme(legend.position = "none")