An R function to split species names

2020-06-05

For my research assistant position I have been cleaning lots of taxonomic data for tree species in southern Africa. On the surface this seems simple, Brachystegia spiciformis gets split into c("Brachystegia", "spiciformis"). However, what about when the species is written as Brachystegia spiciformis var. kwangensis? Here is a list of possible species name forms I found in my dataset:

And that isn’t counting the species with multiple below-species taxonomic ranks, like: Vachellia gerrardii subsp. gerrardii var. latisiliqua.

Separating these out by hand would take a very long time, so I wrote a function which does it for me.

First the function splits strings by spaces or optionally dots with no spaces, then it searches to see if a species is cf., meaning that the absolute species isn’t known but a guess has been made, in which case species is replaces with indet (indeterminate) and the species is stored in the confer column. Then a similar process to search for both varieties and subspecies. If below-species ranks are to be returned then the dataframe is returned as is, otherwise the confer column replaces the indet in species if below-species ranks are not returned.

This function doesn’t catch Brachystegia sp.2, but I have a separate function which replaces these with Brachystegia indet based on a lookup table supplied by the user.

#' Split full species name into genus, species, and optionally below-species taxonomic ranks
#'
#' @param x vector of genus and species names
#' @param subsp logical, should lower taxonomic ranks be returned?
#'
#' @return dataframe of character vectors with one column per rank
#'
#' @export
#'
splitSpecies <- function(x, subsp = TRUE) {
  x <- strsplit(x, " |[a-z]\\.[a-z]")

  x <- lapply(x, function(y) {
    # genus
    genus <- y[1]

    # cf and species 
    if (grepl("cf(\\.)?", y[2])) {
      species <- "indet"
      cf <- y[3]
      plus <- 1
    } else {
      species <- y[2]
      cf <- NA_character_
      plus <- 0
    }

    if (!is.na(y[3+plus])) {
      sub_string <- paste(y[(3+plus):length(y)], collapse = " ")

      # variety if present
      if (grepl("var(\\.)?", sub_string)) {
        string <- strsplit(sub_string, " ")
        variety <- string[[1]][which(grepl("var(\\.)", string[[1]])) + 1]
      } else {
        variety <- NA_character_
      }

      # subspecies if present
      if (grepl("subs(p)?(\\.)?", sub_string)) {
        string <- strsplit(sub_string, " ")
        subspecies <- string[[1]][which(grepl("subs(p)?(\\.)?", string[[1]])) + 1]
      } else {
        subspecies <- NA_character_
      }
      c(genus, species, cf, subspecies, variety)
    } else {
      c(genus, species, cf, NA_character_, NA_character_)
    }
  })

  out <- as.data.frame(do.call(rbind, x))
  names(out) <- c("genus", "species", "confer", "subspecies", "variety")[1:length(out)]

  # Replace cf. as species is subsp. == FALSE
  if (subsp) {
    out <- out
  } else {
    out$species[!is.na(out$confer)] <- out$confer[!is.na(out$confer)]
    out <- out[,c("genus", "species")]
  }
  return(out)
}