Data management during and after my PhD

2021-09-20

During my PhD I spent a lot of time collecting field data. Along with colleagues from Angola I setup 15 1 ha permanent woodland survey plots in Bicuar National Park, Angola, where we conducted a census of all woody stems >5 cm diameter, and made additional measurements of the grass biomass and tree canopy cover. These plots will hopefully have another census in 2023. I also collected terrestrial LiDAR data in 22 1 ha woodland plots, all 15 in Bicuar National Park and an additional seven in southern Tanzania, to quantify canopy complexity.

I think these two datasets form a key product of my PhD thesis. PhD students often generate a lot of data , but only a minority of PhD students develop a long-term plan for data management and data dissemination. I chose to write an extra chapter in my thesis all about the multiple uses of the data I collected, and it’s contribution to the greater good of the field, but it requires that the data are properly archived, managed, and advertised, otherwise nobody else will want to use it. Like the investigative chapters of the thesis, which I hope can be converted into manuscripts for peer-review, extending their lifespan and making them more impactful by reaching a larger audience, I hope that I can ensure the data I collected during my PhD has a legacy beyond just my PhD thesis.

Lots of universities have a data management plan web page, these are some of the first results from Google for “phd data management”:

The University of Edinburgh, where I did my PhD, also has one, but I didn’t see it until writing this blog post, a week after handing in.

Writing a DMP | The University of Edinburgh

At the end of the first year of my PhD I wrote a “Confirmation Report”. In other institutions I’ve heard them referred to as “Upgrades”. It’s sort of a friendly examination that makes sure you have a developed plan for what to do during the PhD, before it’s too late. You write a report that’s part literature review and part methodology proposal, then have a mini viva with some other academics. I always felt like my confirmation report should have required a data management plan, similar to how it required an ethics assessment and a timeframe, but it didn’t. We did have a short presentation on data management during the “Research Planning and Management” course in the first year of the PhD, which consisted mostly of information on how to store data on the University network. I would have liked to see more guidance on how to manage and archive large volumes of data (TBs), both during and after the PhD, to ensure that data is usable by others, and by your future self.

For the plot census data, which was only a couple of GBs, I have stored the data in three places:

This conforms to the 3-2-1 backup rule , which recommends keeping at least 3 copies of the data, with at least 2 different media types (hard drive, network share), and store 1 copy off-site (I have two different off-site locations, University, parent’s house). I also have “cleaned” versions of the plot census data hosted on the SEOSAW database , which makes the data accessible to other researchers under agreement. I’ve already had a few other projects request to use the data, which is very nice to see.

One thing that I didn’t keep good track of for a little while was only using one copy as the ‘primary’ copy, and using the others as backup only. At one point I was writing new data to both my personal hard drives and I got mixed up. Since then, I put one of the hard drives in a cupboard out of site, to deter me from writing data to it unless I wanted to do a backup. As an aside, I use rsync to make backups. It’s quick and efficient and very rarely fails. I have plans to buy a NAS (Network Attached Storage), the Synology DS420+ looks nice, but for now having loose hard drives will have to suffice.

The LiDAR data consists of raw .zfs files exported directly from the scanner, databases built by Cyclone (Leica’s proprietary LiDAR processing software), PTX files outputted by Cyclone, and LAZ files created by me which compress the huge PTX files to a more manageable binary format.

The key items to keep in my opinion are the raw .zfs files, and the PTX files, as they constitute the raw untouched data in open formats, but the LAZ files are the ones I’ll probably use most on a day to day basis, simply because they are small enough that drive I/O isn’t a bottleneck for processing time.

I’ve got the LAZ files backed up in the same places as the plot census data, and also in a DataShare repository , which gives them a permanent DOI and makes them available for others to use. The scan databases I don’t think I will back up, because every aspect of information in them is represented in some other file. The only convenience of keeping them is that I would be able to quickly boot up Cyclone and use their very good 3D rendering, but Cloud Compare is enough for me most of the time. The PTX files I have backed up both on my personal hard drives, and also on cassette at the University, a service which costs about £50 per pair of tapes I think, which is very reasonable. This isn’t perfect as the cassette backup isn’t that accessible, but the PTX files are just so big that it’s difficult to keep them anywhere else. As long as I have two sets of hard drives, each stored in different places, they should be safe.