Workflow for writing an academic paper

2019-12-20

I’ve been writing a couple of papers simulateously recently and so I’ve had some practice in managing the writing process. I do analyses mostly in R with some small bits of data cleaning on the command line with gdal or text manipulation utilities like sed, grep and awk. I write the paper with LaTeX.

I was inspired by this British Ecological Society guide which came out a couple of years ago.

I keep a Git repository for each paper I’m writing. This repository contains all the data, scripts, and notes for that paper. The basic directory structure looks like this, with some minor changes when a particular project demands it:

.
├── README.md 
├── build.sh
├── output.log 
├── notes/
├── data/
├── img/
├── include/
├── scripts/
└── manuscript/
    ├── img/
    └── include/

The README.md contains a short description of the project and also a description of the directory structure, with descriptions of data files and the purpose of each script. This skeleton for the README.md can be copied between projects with minor changes.

notes/ contains multiple text files with notes on the project, notably a record of what was accomplished during each session in a file called journal.md.

data/ contains all the data for the project. Shapefiles are kept in their own folder due to them actually being a collection of multiple files.

img/ contains all the images created during analysis. These images are later copied to manuscript/img/ for use in the manuscript.

include/ contains non-image outputs created during analysis, this might be tables, individual statistics. These files are then copied to manuscript/include/ for use in the manuscript.

scripts/ contains all the scripts used during analysis, mostly R scripts and shell scripts.

build.sh is an executable shell script which runs all the analysis scripts, copies images and other items to the manuscript directory, and finally compiles the paper. The stdout and stderr of build.sh gets dumped into output.log, which I can then read to determine if everything compiled properly.

Below is an example of build.sh:

#!/bin/sh

{
IMG="manuscript/img/"
INC="manuscript/include/"

# Run data compilation
Rscript scripts/data_clean.R
Rscript scripts/analysis.R

# Transfer images to manuscript
cp img/map.pdf $IMG
cp img/barplot.pdf $IMG

# Transfer snippets to manuscript
cp include/n_plots.tex $INC
cp include/n_outliers.tex $INC
cp include/hull_cover.tex $INC

# Transfer tables to manuscript
cp include/model_fit.tex $INC

# Edit tables
## model_fit.tex
sed -i 's/\$//g' "${INC}model_fit.tex"
sed -i 's/caption{}/caption{Model fit statistics for linear regression}/g' "${INC}model_fit.tex" 

# Compile 
latexmk manuscript/manuscript.tex
latexmk -C

} > output.log 2>&1

First I define some variables that I reuse when I’m moving images and tables, then I run each R script in the right order, then I transfer images, snippets and tables to the manuscript directory and edit some of the snippets and tables if I need to. Finally it compiles the LaTeX document using latexmk. All of the paths in build.sh are defined relative to the root of the project directory, to make the project portable between machines. The commands in the scripts are wrapped in { ... } > output.log 2>&1. This redirects all the output from the commands (stderr and stdout) to a file called output.log.

I generate snippets with useful numerical statistics that I can insert into my LaTeX document, so if they change over the course of the analyses the values automatically update. For example, if I want to know the number of plots in my dataset:

n_plots <- sum(rowSums(df))

fileConn <- file(paste0("include/stats.tex"))
writeLines(
  c(
    paste0("\\newcommand{\\nplots}{", n_plots, "}"),
  fileConn)
close(fileConn)

This defines a LaTeX variable called \nplots{} which returns the value of n_plots and writes it to a .tex file. In LaTeX I can just include stats.tex with \input{include/stats.tex} and then call the variable in text with we measured \nplots{} plots.

I generate tables with the {stargazer} package from dataframes in R.

fileConn <- file("include/model_fit.tex")
writeLines(stargazer(model_fit_df,
  summary = FALSE,
  label = "model_fit", digit.separate = 0, rownames = FALSE), fileConn)
close(fileConn)

These tables often need to be adjusted quite a lot, which I do mainly with sed in build.sh. I cna change the column headers, add a caption, adjust digit rounding and adjust the tbale formatting if I need to. Doing all this on the command line is easier than doing it through stargazer with R.

As a side note I have a simple function to reformat p values from statistical tests so they look sensible for publication:

p_format <- function(p){
  case_when(p < 0.01 ~ paste0("p <0.01"),
    p < 0.05 ~ paste0("p <0.05"),
    TRUE ~ paste0("p = ", round(p, digits = 2)))
}

As always, the issue I have with this setup is that no matter how thorough I try to be with my explanation of the setup, some people just aren’t familiar with the tools used, notably shell scripting and LaTeX. For the LaTeX issue one thing I can do is to convert the document to a word document with something like pandoc, but this has problems. The question has been asked many times and there are loads of options, it’s messy: