Extracting pages with colour from a PDF

2021-10-26

I wanted to print my PhD thesis so I could have a version to annotate before my viva. The cost at my local copy shop to print a full colour version of the thesis would have been somewhere around £60, while a black and white copy only cost about £15. It wasn’t necessary to print the whole document in colour as only pages with figures contained any colour, so I wanted to find a way to automatically extract the pages which did contain colour and create a new document containing only those pages, so I could print those in colour separately.

I created a shell script that uses ghostscript (gs) to find the colour pages, and pdfjam to extract those pages and create a new document:

#!/usr/bin/env sh

# Extract colour pages from a PDF, then create a new PDF containing only those pages. Useful for saving on printing costs.

if [ "$#" -ne 2 ]; then
	echo "Usage: $0 <input.pdf> <output.pdf>"
    exit 2
fi

if [ ! -f $1 ]; then
    echo "Input file not found"
    exit 2
fi

pages=$(gs -o - -sDEVICE=inkcov "${1}" | tail -n +6 | sed '/^Page*/N;s/\n//' | sed -E '/Page [0-9]+ 0.00000  0.00000  0.00000  / d' | grep -Eo '^Page\s[0-9]+' | awk '{print $2}' | tr '\n' ',' | sed 's/,$//g')

if [ -z "${pages}" ]; then
	echo "File has no colour pages"
	exit 2
fi

pdfjam "${1}" ${pages} -o "${2}" &> /dev/null

The first part of the script with the if statements simply checks whether the parameters passed to the script are valid. The script needs to be fed an existing input file, and an output file name.

The pages variable is created by using the inkcov device provided in gs >v9.05. The inkcov device displays the ink coverage separately for each page, so all that needs to be done is to exclude pages which contain only black, and then format the page numbers in the way that pdfjam expects. If no colour pages are found then the script exits without creating a new PDF. pdfjam then takes the input filename, the page range, and the output filename and creates a new PDF document.