$ perl -MCPAN -e shellIf this is the first time you've run CPAN, it will ask you a series of questions - the default answers worked fine for me. When the cpan> prompt appears, install the CAM::PDF module:
cpan> install CAM::PDFNow let's see if our PDF allows modification:
$ pdfinfo.pl pcasm-book.pdf File: pcasm-book.pdf File Size: 1071411 bytes Pages: 195 Author: Paul A. Carter CreationDate: D:20050320210800 Creator: LaTeX with hyperref package Keywords: 80x86 assembly programming Producer: pdfTeX-1.10b Subject: 80x86 Assembly Language Programming Title: PC Assembly Language Page Size: variable Optimized: no PDF version: 1.4 Security Passwd: none Print: yes Modify: yes Copy: yes Add: yesAs it does, let's batch replace the word "Borland" with the word "Inprise" and name the new file output.pdf:
$ changepagestring.pl -o pcasm-book.pdf Borland Inprise output.pdfThat seems to have worked, but there are still instances of "Borland" in the file - why were they not changed? The following script by Adam314 will output the entire file, including the hidden PDF formatting codes:
#!/usr/bin/perl use warnings; use strict; use CAM::PDF; my $infile = '/path/pcasm-book.pdf'; #open file my $doc = CAM::PDF->new($infile) || die "$CAM::PDF::errstr\n"; #look for string for my $page (1..$doc->numPages) { my $content = $doc->getPageContent($page); print $content }Sure enough, the string "Borland" only shows up twice. Where are all the others? Why, surrounded by hideous formatting code like these examples:
Borl)1(and)1('s Borlan)1(d's)-2 Borlan)1(d)-497 Borl)1(and)-241In his link above, Adam314 offers advice for replacing instances like these with regex. At this point I grew rather weary, however, especially as text replacements were wont to cut off or run into other words. However, for simple text replacements in simple PDF documents, changepagestring.pl may come in handy.
/nix | Dec 13, 2009