Questions about operational IT for research

2020-06-25

I have a couple of open questions, similar to previous questions posed in my lab group on how folk set up their R environments. I think that discussions like this are a good way of developing a sense of collegiality in academic groups. Often discussion of our specific research is stymied by feelings that it has to be perfect before talking to our colleagues about it, but discussing operational topics like data management, data analysis, etc. is an effective way of sharing experience and making all our lives easier. Once the boring day-to-day topics are taken care of in the most optimal fashion, the hard work of research becomes slightly more pleasurable.

First question: How does one manage large file storage, rasters and the like? I currently download large spatial data to my local machine for analysis, but then my laptop periodically runs out of hard disk space, and I have to delete various layers. Then inevitably I need some file again and I have to figure out where I got the old file from, or I want to run an old analysis and find that I carelessly deleted an important file of raw data.

I’ve tried keeping files on Google Drive but this is a pain because the large files choke up the syncing on my domestic internet connection. I’ve tried keeping files on my university datastore, but the upload/download speed when not on the University network is very frustrating. At the moment I have large files on a networked home server, but there are two major caveats to my approach: firstly if I ever decide to work away from home I will no longer have access to those files, and secondly I do not have enough hard drive space for redundancy, so if my spinning disk hard drives fail, that’s it.

As a side note on the question above and the R environment question, I got concerned about how much IT infrastructure my university is pushing onto employees and students. The general consensus among my lab group on the R environment question was that the University managed R environment, as installed using the ‘Application Catalog’ is unusable for real research, due to an issue with managing packages. One lab group member said that when they talked to IT about it they were recommended to just not use the University R environment. This surely is a service which should be provided to all at the University without question?! Another story is from a lab group who decided that it was easier to buy their own high-spec image rendering desktop machine rather than deal with the University’s poorly managed cluster computer setup. Finally there are all the PhD students in my office who choose to use their own laptops, keyboards and mice, which presumably they paid for themselves, rather than the terrible network-managed all-in-one Desktop PCs and low-end chiclet keyboards. My Desktop PC was pushed to the back of my desk after about 2 weeks of work in favour of my own laptop and an external display.

Second question: How does one create a truly reproducible QGIS workflow, which keeps a record of the intermediary object created, the processes which create them and the inputs provided?

I was recently clipping, diffing and dissolving a bunch of different spatial vector files to create a data-informed study region which will define the spatial bounds of part of my research. Normally I do these things in R, but this time I needed to do a fair amount of manual clicking so I opted for QGIS. Looking back if I had looked at each operation more carefully I probably could have got away with not doing any manual clicking, but I was short on time and will power. What I would really like is to export a script, maybe in Python because I know QGIS already interacts well with Python, which shows exactly what I did, down to the manual clicking for the creation of free-hand polygons, to produce what I created manually.