For my research assistant position I have been cleaning lots of taxonomic data for tree species in southern Africa. On the surface this seems simple, Brachystegia spiciformis
gets split into c("Brachystegia", "spiciformis")
. However, what about when the species is written as Brachystegia spiciformis var. kwangensis
? Here is a list of possible species name forms I found in my dataset:
- Brachystegia spiciformis
- Brachystegia cf. spiciformis
- Acacia abyssinica subsp. calophylla
- Acacia sieberiana var. woodii
And that isn’t counting the species with multiple below-species taxonomic ranks, like: Vachellia gerrardii subsp. gerrardii var. latisiliqua
.
Separating these out by hand would take a very long time, so I wrote a function which does it for me.
First the function splits strings by spaces or optionally dots with no spaces, then it searches to see if a species is cf.
, meaning that the absolute species isn’t known but a guess has been made, in which case species
is replaces with indet
(indeterminate) and the species is stored in the confer
column. Then a similar process to search for both varieties and subspecies. If below-species ranks are to be returned then the dataframe is returned as is, otherwise the confer column replaces the indet
in species if below-species ranks are not returned.
This function doesn’t catch Brachystegia sp.2
, but I have a separate function which replaces these with Brachystegia indet
based on a lookup table supplied by the user.
#' Split full species name into genus, species, and optionally below-species taxonomic ranks
#'
#' @param x vector of genus and species names
#' @param subsp logical, should lower taxonomic ranks be returned?
#'
#' @return dataframe of character vectors with one column per rank
#'
#' @export
#'
splitSpecies <- function(x, subsp = TRUE) {
x <- strsplit(x, " |[a-z]\\.[a-z]")
x <- lapply(x, function(y) {
# genus
genus <- y[1]
# cf and species
if (grepl("cf(\\.)?", y[2])) {
species <- "indet"
cf <- y[3]
plus <- 1
} else {
species <- y[2]
cf <- NA_character_
plus <- 0
}
if (!is.na(y[3+plus])) {
sub_string <- paste(y[(3+plus):length(y)], collapse = " ")
# variety if present
if (grepl("var(\\.)?", sub_string)) {
string <- strsplit(sub_string, " ")
variety <- string[[1]][which(grepl("var(\\.)", string[[1]])) + 1]
} else {
variety <- NA_character_
}
# subspecies if present
if (grepl("subs(p)?(\\.)?", sub_string)) {
string <- strsplit(sub_string, " ")
subspecies <- string[[1]][which(grepl("subs(p)?(\\.)?", string[[1]])) + 1]
} else {
subspecies <- NA_character_
}
c(genus, species, cf, subspecies, variety)
} else {
c(genus, species, cf, NA_character_, NA_character_)
}
})
out <- as.data.frame(do.call(rbind, x))
names(out) <- c("genus", "species", "confer", "subspecies", "variety")[1:length(out)]
# Replace cf. as species is subsp. == FALSE
if (subsp) {
out <- out
} else {
out$species[!is.na(out$confer)] <- out$confer[!is.na(out$confer)]
out <- out[,c("genus", "species")]
}
return(out)
}