tinyapps.org / blog


Batch replace text in PDF files #

Simple text replacements in simple PDF documents can be made with changepagestring.pl, part of CAM-PDF-1.52, which, by the way, includes many other cool tools like:
$ perl -MCPAN -e shell
If this is the first time you've run CPAN, it will ask you a series of questions - the default answers worked fine for me. When the cpan> prompt appears, install the CAM::PDF module:
cpan> install CAM::PDF
Now let's see if our PDF allows modification:
$ pdfinfo.pl pcasm-book.pdf 
File:         pcasm-book.pdf
File Size:    1071411 bytes
Pages:        195
Author:       Paul A. Carter
CreationDate: D:20050320210800
Creator:      LaTeX with hyperref package
Keywords:     80x86 assembly programming
Producer:     pdfTeX-1.10b
Subject:      80x86 Assembly Language Programming
Title:        PC Assembly Language
Page Size:    variable
Optimized:    no
PDF version:  1.4
Security
  Passwd:     none
  Print:      yes
  Modify:     yes
  Copy:       yes
  Add:        yes
As it does, let's batch replace the word "Borland" with the word "Inprise" and name the new file output.pdf:
$ changepagestring.pl -o pcasm-book.pdf Borland Inprise output.pdf
That seems to have worked, but there are still instances of "Borland" in the file - why were they not changed? The following script by Adam314 will output the entire file, including the hidden PDF formatting codes:
#!/usr/bin/perl
use warnings;
use strict;
use CAM::PDF;
 
my $infile = '/path/pcasm-book.pdf';

#open file
my $doc = CAM::PDF->new($infile) || die "$CAM::PDF::errstr\n";
 
#look for string
for my $page (1..$doc->numPages) {
	my $content = $doc->getPageContent($page);
		print $content
		}
Sure enough, the string "Borland" only shows up twice. Where are all the others? Why, surrounded by hideous formatting code like these examples:
Borl)1(and)1('s
Borlan)1(d's)-2
Borlan)1(d)-497
Borl)1(and)-241
In his link above, Adam314 offers advice for replacing instances like these with regex. At this point I grew rather weary, however, especially as text replacements were wont to cut off or run into other words. However, for simple text replacements in simple PDF documents, changepagestring.pl may come in handy.

/nix | Dec 13, 2009


Subscribe or visit the archives