Processing data from the TRY traits database

2021-12-20

I’ve been working recently with data from the TRY global plant traits database, to assess which dominant dry tropical tree species we have good trait data for, and for which we are lacking decent trait data. One of the key inputs to the process-based carbon cycle model used in the SECO project is leaf mass per area, sometimes expressed as the leaf area per mass, aka specific leaf area (SLA), so I’m focussing on that. If we find gaps in the trait coverage of some species, maybe we can address those gaps with data collection during the project.

The data requests retrieved from TRY are in a format that makes them quite difficult to parse in R. Instead of a 2D table, it’s more like a 1D list, with metadata and trait data on different rows, linked by an observation ID. In this post I want to share the R code I use to create a neat dataframe from this data.

I use data.table::fread() to read in the data, because the files can be quite large, 3.35 GB in my case:

try_dat <- fread("dat/18017.txt", header = TRUE, sep = "\t", dec = ".", 
  quote = "", data.table = FALSE, encoding = "UTF-8")

Also note that the data are tab separated, and to fix encoding issues it’s a good idea to enforce UTF-8 encoding.

Then I rename some columns and keep the useful ones:

try_clean <- try_dat %>%
  dplyr::select(
    obs_id = ObservationID,
    species_id = AccSpeciesID,
    species_name = AccSpeciesName,
    trait_id = TraitID,
    trait_name = TraitName,
    key_id = DataID,
    key_name = DataName,
    val_orig = OrigValueStr,
    val_std = StdValue,
    unit_std = UnitName,
    error_risk = ErrorRisk) 

I create lookup tables to match the species IDs and trait IDs later on:

species_id_lookup <- try_clean %>% 
  dplyr::select(species_id, species_name) %>% 
  unique()

trait_id_lookup <- try_clean %>% 
  dplyr::select(trait_id, trait_name) %>% 
  unique() %>%
  filter(!is.na(trait_id)) 

Then I split the data by observation ID:

try_split <- split(try_clean, try_clean$obs_id)

Then I loop through each of those observations, extracting the trait data and some useful metadata that is commonly attached to each observation. But note that there are lots of metadata in TRY, and not all observations share all metadata. A lot don’t even have latitude and longitude coordinates, limiting their usefulness.

total <- length(try_split)
try_df <- as.data.frame(do.call(rbind, mclapply(seq_along(try_split), function(x) {
  message(x, "/", total)
  x <- try_split[[x]]
  # Subset columns
  traits <- x[!is.na(x$trait_id),
    c("species_id", "trait_id", "val_orig", "val_std", "unit_std", "error_risk")]

  # Extract some common metadata
  meta_ext <- function(y, key_val) {
    ext <- y[y$key_id == key_val, "val_std"]
    ifelse(length(ext) == 0, NA, ext)
  }

  traits$elev <- meta_ext(x, 61)
  traits$longitude <- meta_ext(x, 60)
  traits$latitude <- meta_ext(x, 59)
  traits$map <- meta_ext(x, 80)
  traits$mat <- meta_ext(x, 62)
  traits$biome <- meta_ext(x, 193)
  traits$country <- meta_ext(x, 1412)

  return(traits)
}, mc.cores = 3)))

Finally, I can add the trait and species names back in using the lookup tables:

# Add trait and species names to dataframe
try_df$trait_name <- trait_id_lookup$trait_short[
  match(try_df$trait_id, trait_id_lookup$trait_id)]

try_df$species_name <- species_id_lookup$species_name[
  match(try_df$species_id, species_id_lookup$species_id)]