Delete / remove blank pages from a PDF #

The following was originally posted to http://www.accesspdf.com/article.php/20050128092744804. Found a copy on the Wayback Machine, which is apparently not indexed by Google.

Removing Blank Pages from a PDF

Friday, January 28 2005 @ 09:27 AM PST
Contributed by: Admin

Here is an idea for how to remove blank pages from a PDF using pdftotext and pdftk. It is based on a recent posting to comp.text.pdf.

[email protected] wrote:
> Hello all,

> Sorry if this is a recurrent question. I'm rendering/printing an HTML
>  document from a web-based program to a pdf file. The web-based
> program has minimal features to control pagination, etc. (I know
> web-based print control is relatively primitive) and the outcome is
> unwanted blank pages in the PDF output file.

> Anyway, I'm basically looking for a program that will allow me batch 
> process a folder of PDF files and strip out the blank pages. Is there
>  any programs or utilities that will do this? Any suggestions are
> greatly appreciated

> DM

Here is one simple idea that assumes that all non-blank pages have
(extractable) text on them.

1. Use pdftotext (from the xpdf project) to convert mydoc.pdf to
mydoc.txt. Pdftotext uses the formfeed character (0x0c) to mark page breaks.

2. Scan mydoc.txt looking for pages with no text. Record these page
indexes (start counting at page 1, not zero).

3. If you find blank pages, use pdftk to remove them. Construct the
pdftk command line using the page indexes you collected in step 2. For
example, to drop page 3, say:

   pdftk mydoc.pdf cat 1-2 4-end output mydoc.noblanks.pdf

It shouldn't be too hard to write such a shell script, eh?

Sid Steward
http://www.AccessPDF.com/pdftk/

The Script

Using bash (via MSYS) on my Win2k machine, I have strung some commands together that identify PDF pages with no extractable text on them. I don't say "blank pages," because sometimes a non-blank PDF page has no extractable text on it.

#!/bin/sh
#
# find_textless_pdf_pages.sh
# bash script for MSYS; also requires pdftotext (xpdf);
#
# identify PDF pages that have no extractable text on them;
# linux users might need to omit the -c sed option and then
# drop the 'R' from the sed script;
#
# invoke like so:
#  find_textless_pdf_pages.sh mydoc.pdf
#
pdftotext $1 - |
tr "FRfr" "frFR" |
sed -c -n '/^FR$/{ N; /^FRnFR$/a
PageNoText
/^FRnFR$/!a
Page
D; }'

Command Breakdown

pdftotext - converts the input PDF file into text. It uses the formfeed character (f) to mark page breaks.

tr - translates characters. Sed doesn't see non-printing ASCII characters such as f or r (carriage return). So, translate R->r, F->f, and f->F, r->R.

sed - the stream editor. I discuss sed here. If it finds a line that is just "FR", then it looks ahead to see if the next line is also "FR". If it is, then it prints "PageNoText". If it isn't, then it prints "Page". Finally, it uses the D command to continue processing with just the second line of text.

I'll continue working on this script as time permits.

/nix | May 10, 2010


Subscribe or visit the archives.