Extract the start 7/ and length, and recurse into 7 this functlon ta = explode ' tra ne read_xref Sf, Shes, null, S$data[0], §data[0] + $data[1]; } D> May 2004 « PHP Architect + w
Trang 1Jump Right To It
Three days of pure PHP http:/Avww.phparch.com/phpworks
php | works
Existing suliseribers can upgrate to the Print edition and save!
Login to your account for more details
archite ect
Da for flexibility and
architect
The Magazine For PHP Professionals
php Visit: bttp:/Avww.phparch.conyprint for
more information or to subscribe online
php|architect Subscription Dept
P.O Box 54526
1771 Avenue Road Toronto, ON M5M 4N5
allow up to 4 to 6 weeks for your subscription to be established and your first issue
to be mailed to you
State/Province:
ZIP/Postal Code:
Country:
Payment type:
VISA Mastercard American Express
Credit Card Number:
Expiration Date:
E-mail address:
Phone Number:
Canada *US Pricing is approximate and for illustration purposes only
Address) Canada /SA $ 83.99 CAD ($59.99 US*) city FHTmternational Surface $111.99 CAD ($79.99 US*)
O International Air
Ci Combo edition add-on
$125.99 CAD ($89.99 US*)
$ 14.00 CAD ($10.00 Us)
(print + PDF edition)
Signature: Date:
*By signing this order form, you agree that we will charge your account in Canadian dollars for the “CAD” amounts indicated above Because of fluctuations in the exchange rates, the actual amount charged in your currency on your credit card statement may vary slightly
**Offer available only in conjunction with the purchase of a print subscription
To subscribe via snail mail - please detach/copy this form, fill it out and mail to the address above or fax to +1-416-630-5057
FEATURE
In the Belly of the Beast
Interpreting and Manipulating PDF Files
by Marco Tabini
in last month's issue, we examined the structure and con- tents of a PDF document in considerable detail This month, we'll actually write a PHP library capable of open- ing one and modifying its contents
elcome to part two of our little trip down PDF
\W-: While last month we focused primarily
on understanding what the structure of a PDF document is, this tine over we'll look at the problem of altering the contents of a PDF file from a more practical perspective
The main thing to understand, before we move on to anything else, is that parsing a PDF file is a complex—
but by no means complicated—endeavour because the but it also does not follow a top-down logic In other
a PDF file one doesn’t start at the beginning and move
is true
Since we'll often find ourselves jumping at various—
and completely arbitrary—positions into the docu- ment, the first decision that we need to make is how we're going to access the data While it is tempting to such a good idea: if you consider that a PDF can have memory we expose ourselves to the potential of clog- ging up large chunks of RAM, thus limiting our server's ability to process a large number of requests
Yet, seeking to arbitrary locations in a document is not always easy, or even possible Imagine, for exam- this case, you'd have to download the entire file before you could actually find out about any of its characteris- tics, since the offset of the cross-reference table appears
at the end of the file Bven in this case, | would recom- May 2004 + PHP Architect + www.phparch.com
mend storing the document in a local file and then accessing the data through the filesystem
The one notable exception to this rule is a special class of PDFdocumentsknown as “linearized PDF files’
A linearized PDF document containsa dictionary at the ties for determining the location of the first page in the table first The structure of linearized PDF files is beyond about it directly from the PDF specification document published by Adobe
Getting Started The first thing we need to do in order to be able to mine where the crossreference table and trailer dic- format of the startxref pointer is fixed For example, in
my document it looks like the following:
startxref
53593
%EOF
REQUIREMENTS PHP: 4.3.0+
OS: Any
34
Trang 2
In the Belly of the Beast
Thus, all we need to do is move to the end of the file,
back up a few bytes and then find this sequence of
data As you can see from Listing 1 (findxref php), this
is readily accomplished by using a dmple regular
ends with a dollar sign, indicating that the resulting
match must be anchored to the end of the data stream
Even though we're only taking fifty characters from the
end of the file, | have added the anchor to prevent the
regex engine from picking up a previous cross-refer-
the cross-reference table pointer is not saved to the
10 digits for the offset like the cross-reference entries
themselves}, you're not alone This decision is a bit of a
mystery, but it’s something that we have to live with
the way—throughout the remainder of the article,
you'll notice that | have created an individual include
This is clearly not a good design practice, but it fulfills
one important purpose: it keeps the listingsin the arti-
cles hort and to the point Thus, in the interest of clar-
ity, | hope that you'll forgive me and that, if you decide
not follow the same layout
Reading the Cross-reference Table
ure out how to read the cross-reference table itself If
we move to offset 55,593 of the file, we'll find the fol-
lowing:
xref
9000000000 65535
0000005632 00000
0000006483 00000
0000006509 00000
The word xref is followed by the first object represent-
entries that follow (twenty-two); we'll call this the
selves: for each line, we have the offset at which the
object can be found (10 characters}, followed by the
in use or f for objects that are free
There are a few important things to notice here First
of all, each set of data is conveniently laid out in a line
of text, so that we can use the fgets() function to
retrieve it However, you should keep in mind that PDF
files always use the Windows convention for identifying
ily elsewhere) and, therefore, you must instruct the PHP
interpreter to do so as well—regardless of the platform
May 2004 « PHP Architect + www.phparch.com
your script is running on This can be accomplisned bi turning on the auto_detect_line_endings INI directive this directly from the code by first reading the current value, turning the directive on for the duration of our file operations and then restoring it back to its original because it is posdble that other portions of our script than the one we need it in
Another gotcha when reading the cross-reference table is that there may be more than one block of entries—that is, once you've read out all the entries, you could find another header followed by a new set of didn’t check for this possbility and simply assume that our code would be unable to read most documents
that’s the situation in which partial cross-reference tables are most likely to be found
As you can see in Listing 2 (readxref.php}, the pdf_read_xref(Q) function is a bit long, but otherwise fact that the cross-reference table is formatted using a very stylized layout, so that we can take advantage of vided by PHR
The only aspect of this function that we have not explored is the little segment of code that starts at line
84 and ends at line 100 This is where our code reads
Listing 1
* Returns the offset of the most recent
* cross-reference table in the file
Z function pdf_find_xref ($F) {
First, seek to the end of the file,
? allowing for 50 bytes just so that
e have enough data to look into
fseek ($f, -50, SEEK_END);
// Next, try to find the proper sequence // of data, Note that the information can be // separated by a Windows-style, Mac-style unix-style new] ine
$data = fread ($f, 50);
if Cipreg_match
C fstartxref (7: \rl\n [\r\n) (d+) (7: \r | \n | Arn 6EOF (7: \r [An | rn $/*
$data, $matches))
e (“Unable to find pointer to xref table");
}
⁄/ If we get here, then we have the offset /⁄/ where the most recently introduced xre table is
return (int) $matches([1];
35
In the Belly of the Beast the trailer dictionary; as you can see, it makes use of a
few elements that | have not yet introduced (the pdf_context class and the pdf_read_value() function)
However, if you leave the mechanics of how the infor- that the trailer dictionary ends up in an associative array If you remember from last month’s article, files that have been modified usually contain more than one
of a /Prev key/value pair in the trailer, with a pointer to ply recourses onto itself until all the cross-reference mation in the older tables and trailersis not allowed to Listing 2
overwrite the data contained in the newer ones by the the first case, and by merging the trailer arraysin a par- ticular order in the second
Writing a PDF Lexer Now that we know where the objects are—the cross reference table gives us the location of every object in the file—it’s time to try and read them We could, in theory, write a series of ad-hoc functions that try to are much easier if we, instead, make use of that won- derful computer science concept known as the /exer (also known as a tokenizer}
* Reads a cross-reference table
if Soffset is provided and $start and Send are
* set to Null, the function will start canting the
* xref table from the current position in the ie
* If more than one parts of xref table are pre
* the Function will recurse onto itselF as many tines
* as needed function pdf_read_xref ($f, &$result, $offset, $start = null,
$end = null) { // 1f we didn't get a start and end, we need // to get them from the document itself
if Gis.mull ($start) || is nu]l (§end)) { // Move to the start of the table fseek ($f, Soffset);
// Wake sure that PHP keeps, track of // the Vine endings proper
$ơld_ini = ini_get (‘auto_detect_line_endings');
// Get a Vine of text from the file
$data = trim (fgets ($f));
// Make sure the xref marker is where we
it
ff expect
if (Sdata !== ‘xref") { die (“Unable to Find xref table”);
⁄ New get the next line and spTí across 4 single space character
$data = explode (' ', trim (fgets (§Ð))¡
// Wake sure the format is what we expected
if (count Cédata) != 2) {
“Unexpected header in xref table”);
⁄ Calculate the start and end object
in the xref table
$start = Sdata[0];
nd = $start + §data[1];
if (pisset {Sresult[‘xref_location’ Hà, t $result[’xref_location’] = §of
if Clisset net max_object']) || $end >
$result[‘max_object']) {
$result[‘max_object*] = $end;
// Now cycle through each object // pointer for ( $start < $end; §start++) Í // Get a line of text from the // information out of there
$data = trim (fgets ($F);
offset = substr ($data, 0, 10);
$generation = substr ($data, 11, 5);
if (lisset ($result[‘xref'] [$start] [Cint) $genera- tion])) {
$offset;
}
$resuTt[°xref"] [$§start][(int) $generation] = (int) }
// Get the next line, which could either be the beginning // of the trailer dictionary or the header of another
ff xref section
$data = trim (fgets ($F);
if (Sdata === ‘trailer’) {
w pdf context ($F Thang pdf_read_ value, iso;
// Check whether there is a /Prev // entry, which indicates that there
if (isset (Strai ler’ /Prev'])) { pdf_read_xref ($f, $result, $trailer[*/Prev'])
$result[‘trailer'] = array_merge ($result[‘trail- er'], §trailer);
// We have another xref segment
ff to read Extract the start 7/ and length, and recurse into (7 this functlon
ta = explode (' tra)
ne read_xref (Sf, Shes, null, S$data[0], §data[0] +
$data[1]);
} D>
May 2004 « PHP Architect + www.phparch.com 36
In the Belly of the Beast
Our lexer will take the input from the PDF file and split it in individual tokens according to a particular set reducing the contents of this article in a series of words token), we would establish that a token is either a set of characters or a punctuation mark—assuming that tance to us
Identifying tokens in a PDF file is quite ample in the- ory, although in practical terms you have to watch out
semantic value (meaning that it is used only for the pur- pose of delimiting tokens and has no other purpose)
Whitespace is composed of space characters, newlines and line feeds
This would be enough to cover most situations, but
in some cases you'll find that tokens are not always (including some of Adobe's own) “optimize” a PDF file whitespace characters where the distinction between ple, consider the following snippet of PDF code that shows the beginning of a dictionary:
<< /Entry (Value) >>
The whitespace between << and /Entry is made unnec- essary by the fact that the two tokens are made up of could only appear outside of a literal string to indicate the second open angular bracket and delimit a token before the next character—whatever that is Therefore, the snippet above could be rewritten as follows:
<</entry (Value)>>
Clearly, whitespace isn't enough to delimit a token—we classes that can be used for the same purpose Listing 3 (tokenizer.php) shows our lexer, the pdf_read_token() function, which looks a lot more complicated than it really is
This file also contains the pdf_context class that we mentioned earlier, which the tokenizer also makes use
of The pdf_context class is used to create a wrapper around a file pointer that makes it possible to:
+ Create a memory-based buffer for the file's contents
+ Keep track of the current pointer in the file and of the length of the buffer May 2004 + PHP Architect + www.phparch.com
+ Maintain a stack of tokens that have been read from the file but not yet used The necessity of creating a buffer here arises from the gle character at atime out of the file By reading a fixed
in memory, we can save ourselves a few expensive portion of the system that is responsible for interpret- ing the meaning of the tokens—more about that later Note that there isno compelling reason to store this information in a class, other than the convenience fac- tor of having a convenient PHP syntax to work with avoid OOP altogether, although, in my opinion, that easier to introduce bugs that would be tough to find and fix
Going back to the pdf_read_token() function for a moment, you can see that it works in a very dmple rent offset in the file buffer Next, it tries to determine the type of token that it is dealing with by looking at end of the token varies depending on the character class it belongsto: for array and literal string delimiters,
a single character is all we need, whereas for hex string character, since they both share the same initial open simply scan the file until we end up in a different char- acter class
Parsing the Data Next in the list, we need to be able to understand the meaning of each token in the context of the PDF file— construct: the parser
Parsers can be very complicated, and are usually not coded by hand—in most cases, a developer would use the parser to a relatively complex finite-state machine languages In our case, however, the parsing of a PDF file is smple enough that the entire process can be coded in just about 150 lines worth of PHP Before introducing another listing, however, le†'s con- sider the types of data that we need to deal with For ues, for example, we read as many tokens as we need from the file and store them in the appropriate data structures In two cases, however, we need to make a distinction: strings and indirect objets
The problem with strings—and, particularly, with lit- eral string—is that they change the rules that our lexer
37
Trang 3
In the Belly of the Beast
Listing 3
set);
This class is used to
* read data from the input // Set the first character in the stream
* file in a bufferized way
switch (char) { class pdf_context
var $fil
var $buffer;
/{ This is either 3 an array or literal string var $stack; /f delimiter, Return it
case ‘>'
Sthis->file = $f;
// appropriate case and return the token /{ Optionally move he fle
// pointer to if Ge-vbuffer[Se-soffset) 0 Schar) {
$c->offset++;
eck (fthis-»file, §pos); } else {
return $char;
$this->buffer = fread ($this->file, 100);
$thi s~ reek array); his is “another” type of chen (probably
⁄/ Find the end and return it Make sure that there is at least one
// not exist
while(l) { function ensure_content(>
⁄/ Determine the length of the token
if (Sts >offset »= $this->length - 1) {
tỉ s->increase_length(); $pos = strespn ($c-sbuffer, " []<>O\r\n/",
} else $c-offset);
return true;
‡if (fc-»offset + $pos < $c->length - 1
} else
false;
Sthis->buffer = fread ($this->file, 100);
return true;
}
$c->offset += Sos:
Z/ XE there isa token available
// em the stack, pop it out
ff return
if Ccount ($c->stack)) {
turn array_pop($c->stack) ;
}
// Strip away any whitespace
do {
if (!$c->ensure_content()) {
return false;
May 2004 « PHP Architect + www.phparch.com 38
In the Belly of the Beast has to follow in order to find the end of the token,
because a closed parenthesis could be escaped by a backslash and, therefore, its presence alone does not this problem is taken care of by switching the machine
We could, in fact, do the very same thing to our lexer
by creating a special case in the switch statement that some additional code that looks for a parenthesis not even number? Because the backsashes themselves can Therefore, an even number of backslasnes means that they are all escaped and should be interpreted as liter-
al characters, so that the last one does not escape the parenthesis, which becomes the string delimiter The parenthesis becomes an “orphan” and escapes the string
Given that we only have a limited amount of space and | really wanted to keep things as Simple as possible, however, | chose to implement the string parsing func- thesis token is returned by the tokenizer, the code sm- ply keeps scanning the input file until it finds an unescaped closed parenthesis
The other problematic data elements are, as | men- tioned above, indirect objects Both object declarations and references are made up by three tokens Therefore,
able to tell whether it ispart of a larger element until it has read at least one more token—and potentially two
The problem here is not with reading the tokens—it’s
numeric value turns out to be just a numeric value
We could, in theory, put the extra tokens “back in the buffer” by rolling back the offset pointer in the buffer
be difficult to do, since we don't really know how many with
Therefore, we use a completely different approach:
unused tokens are stored in a stack, which is part of the
pdf_read_token() checks whether anything is present and returnsit, without even reading one character from the file buffer
You can see the end result of all our tribulations in Listing 4 (readvalue.php), which contains the pdf_read_value() function You will also notice a num- data types—and they are Since we'll be reading and
May 2004 « PHP Architect + www.phparch.com
the object types as we read them from the stream To zeroth element indicates the type, while element 1 con- tains the actual value, which varies depending on the nature of the data Thus, for example, the trailer dic- tionary could look like this:
Array ( PDF_TYPE_DICTIONARY , Array
‘/size’ => arras PDF_TYPE_NUMERIC, Root’ => array PDF_TYPE_OBJ_REF, 3»
'/Prev' => arra PDF_TYPE_NUMERIC,
54655 3;
Not unlike some of its predecessors, pdf_read_value() looks a lot scarier than it actually is—the code is quite each value is actually stored in an array whose zeroth data type of a type practically immediate, which will
to write objects back to the file
Before moving on to the next step, note that we make no provigon in our lexer for reading stream data
This is because we are not intent on interpreting every- ments that allow us to modify its contents However, adding support for streams shouldn't be too much of a references, which we'll add shortly, since the length of
a stream is often expressed in that way
Getting to the Root of the Problem All the pieces are finally in place—we snould now be able to read through the PDF file and interpret its con- tents, at least to the extent that we need in order to be able to append data to it In order to demonstrate how the PDF functionality that we have built works, our goal
is to open a PDF file and add a textual element to its first page
Listing 5 (index.php) is our main script—and, unfor- tunately, it’s too large to show here; you will, however, will hopefully be able to follow me there
Once we have declared a few variables that we we'll end up using throughout the script, we read the cross-
to retrieve the Poot object from it Because the /Root entry inside the file trailer has to be an indirect object reference, we must find a way to retrieve the actual
39
In the Belly of the Beast
Listing 4
// Define various data ty V7 that we use throughout the systen define (‘POF_TYPE_NULL", 0);
define (‘PDF TYPE_NUMERIC’
define (*PDE_ TYPE_TOKEN"
define (‘POF_TYPE_HEX’, define (‘PDF_TYPE_STRING’, 4);
false) {
define (‘PDF_TYPE_STREAM', 10);
* Reads a value from the current
* data stream
af function pdf_read_value (&$c, $token = nu]]) // Get a token from the stream
if Ges nell ($token) {
en = pdf_read_token ($c);
if (Stoken false) { return false;
} switch ($token) { case <
/ This is a hex string
// Read the value, then the terminator
the parenthesis
$s = pdf_read_token ($c);
return false, } $pos - §c-»offse
$term = pdf_read_token ($c);
if Stern t= '3') £
eC "ủnospected data after hex string”);
return array (PDF_TYPEHEX, $s);
break;
// This is a dictionary
$result = array;
part of somethin // Recurse into this function until we reach
⁄/ the end of the dictionary
while ((ikey = pdf read token ($§c)) l== '>>') {
if (Ske y false) { return false; this case, we're
or_an object spe
if (value = pdf_read_value ($c)) =
return false;
⁄/ the end of the array
while ((gtoken = pdf_read_token ($c)) !== ']') {
CN false;
}
if ((§value = pdf read value ($c, $token)) = return false;
$result[] = $value;
return array (PDF_TYPE_ARRAY, $result);
a // this is a string
$pos = $c-»offset;
while) { // Start by finding the next closed // parenthesis
$pos = strpos (Sc->buffer, ‘)", $pos);
If you can't find it, tr V/ reading more data from the strean
iF pos, == -1) {
F (!$c->inerease_ s-lengthO) { return fals // Make sure that there is no backslash before
If there is, // move on Otherwise, return the string
if đc >Bffer[Spos ~ A9 £ sult = substr đc sbutfor, $c-»offset, t+ Đổ
$c->offset = $pos +1;
return array (PDF_TYPE_STRING, $result) ; } else { §post+i
iF Gos > Se-soffset + Se-slength) {
$c->increase_length();
} } ult
if (is_numeric (§token)) { // numeric token Make sure that it is not
g else
if (Stolz = pdf-read token ($<) ! false) {
f (is numeric (ftok2)) { // Two numeric tokens in a row In probably in
// front of either an object reference cification
/ Determine the case and return the
if (($tok3 = pdf_read_token ($c)) !== false) {
$result[$key] = $value; switch ($tok3) {
case ‘obj?
return array (PDF_TYPE_DICTIONARY, $result);
return array case ‘[' : (PDF_TYPE_OBIDEC, (int) $token, (int) $tok2);
// This is an array
$result = array;
// Recurse into this function until we reach
case ‘R' return array Gint) $token, find $tok2) ;
May 2004 + PHP Architect + www.phparch.com
Continued on page 47
40
Trang 4
In the Belly of the Beast
object data, as the reference itself won't help us much
Thisis accomplished by the pdf_resolve_object () func-
tion, which you can see in Listing 6 as part of the
used to determine whether any object is an indirect ref-
thing that will come in handy at pretty much every step
of the way
As you can see, pdf_resolve_object() first checks to
see if the value it has been passed is an indirect object
do, other than returning right away If, on the other
hand, it did receive an indirect reference, it uses the
cross-reference table to determine its position and
mines how the object is returned to the caller; if it is set
to true, pdf_resolve_object() stores the object ID and
object's data is encapsulated inside another object of
type PDF_TYPE_OBJECT Otherwise, the direct value is
returned, and all information regarding the object’s ID
ues have their uses—if you want to retrieve an object
want it encapsulated, so that you can later rewrite it
back to the stream If, on the other hand, you’re just
trying to retrieve a value, as you would, for example, if
determine its length, the non-encapsulated version will
even though my code doesn’t perform that function
add it, the pdf_resolve_cbject() function is safe to use
before reading the object and restores it afterwards If
the function didn’t do so and you were reading a
stream, resolving the /Length parameter could result in
Listing 4: Continued from page 40
the file pointer being moved to a different location in the stream
Let's go back to index.php With the root object firm-
ly in hand, we can now compile a list of all the pages contained in the document To do so, we feed the /Pages element of the root dictionary to the pdf_read_pages() function, which you can see in Listing
7 (readpages.php)
The reason why we have a separate function just to read through the /Pages element of the root object is
could be nested in an arbitrary combination of /Page into the function several timesin order to end up with tant to understand that the order in which the pages are resolved by using this method doesn’t necessarily appear to the user—that is, the first page in the list is specification provides a different set of facilities for speaking, you should only be interested in that if you cal terms, | have never found an occasion in which the logical and phydcal page order didn’t coincide—at document is an excerpt that starts from, say, page 25, but the order of the pages should usually be the same
In our sample script, we only take in condderation page 1 (which is the zeroth element resulting from the pages array} We then use the pdf_find_resources() function, shown in Listing 8 (page.php) to retrieve the need a dedicated function because, as you may
return array (PDF_TYPE_OBIREF, (int) $token, (int) $tok2);
} // Tf we get to this point, that numeric value up // there was just a numeric value Pt the extra // tokens back into the stack and return the value
array_push ($c->stack, $tok3);
}
array_push (Sc->stack, §tok2);
return array (PDF_TYPE_NUMERIC, $token);
} else { // Just a token Return it
return array (PDF_TYPE_TOKEN, $token);
May 2004 « PHP Architect + www.phparch.com
In the Belly of the Beast
resource, so that if there isn’t one associated with the page itself, there may be one associated with its parent, sample file that we were looking at last month, the with every page and with their parent (the /Pages dic- optimization, but it is perfectly acceptable (for the record, the PDF was creating on Linux by exporting an OpenOffice.org 1.1 file)
Next, we need to find the font resources, so that we can append our own to the existing ones Finding the any of its predecessors so far, since it’s either there, in which case we piggyback on it, or it isn’t, in which case
ed with the page The only difficulty here isin finding a name for the font resource that doesn’t conflict with one that already exists The approach that | have taken Listing 6
isto simply run through all the resources available and look at those called /Fx, where x is a numerical value
The font resource we create is the next highest avail-
ly used is /F10, ours will be /F11 Note that this choice nation you like, as long asit starts with a letter and not
a digit
The font resource that we create and add to the font dictionary is the simplest possible one: it uses the Helvetica font, which must be supported by every PDF reader and, therefore, doesn’t need to be embedded in the document itself
Graffiti on the Wall We've now come to the part where we actually need to
“write” some text on the document Unfortunately, this involves a few steps
First, the concept of drawing pretty much anything
<?php
ye
* Resolves an object reference,
* ensuring that the result value
* 4s always a direct object
af function pdf_resolve_object (8c, $obj spec, §encapsulate = true)
global $xref_data;
// Exit if we get invalid data
if (lis_array (Sobj_spec)) { return false;
if (Sobj_spec[0] == POF_TYPE_OBIREF) {
df This is a reference, resolve it
if Cisset ($xref_data[‘xref"] [$obj_spec[1]] [$obj_spec[2]]}) { // Save current file position 7/ This is needed if you want to resolve // references while you're reading another object // (e.9-: if you need to determine the length // of a stream)
$ơld_pos = ftell (§c->file);
// Reposition the file pointer and // load the object header
$c->reset ($xref_data[‘xref*] [$obj_spec[1]] [$obj_spec[2]]};
header = pdf_read_value ($c);
if (§header[0] != PDE_TYPE_oBJDEC || $header[1] !=
$obj_spec[1] || $header[2] != $obj_spec[2]) {
ie ("unable to find object ([$obj_spec[1]}, {Sobj_spec[2]}) at expected location”) ;
} // Tf we're being asked to store all the informa- tion
// about the object, we add the object ID and gen- eration
⁄/ nunber for later use
if (encapsulate) {
$result = array ¢ POF_TYPE_OBIECT,
‘obj’ => Sobj_spec[1],
‘gen’ => $obj_spec[2]
u } else { $result = arrayO¡
// Now simply read the object data unti]
// we encounter an end-of-object marker while) {
$value = pdf_read_value ($c);
if ($value === false) {
‘alse;
}
if (§value[0] == PDF_TYPE_TOKEN && §value[1]
“endobj') {
break;
$result[] = $value;
$c->reset (Sold_pos);
return $result;
} else { return Sobj_spec;
}
* Generates a new object container
* with the proper object ID and
* a generation number of zero
af function pdf_new_object () global $xref_data;
return arra PDF_TYPE_OBJECT,
‘obj’ => $xref_data[‘max_object "J++,
0
May 2004 + PHP Architect s www.phparch.com 42
In the Belly of the Beast
on a page requires a series of commands that PDF bor- rows from Postscript In order for the reader to recog- nize them, we'll have to encapsulate them in a stream, and add that stream to the contents of the page
When drawing text on the screen, a certain number
of transformations can be applied to it: translation (so choice}, rotation and scaling In our case, we will only deal with the first two
The transformations are applied using a simple matrix; unfortunately, we do not have enough space the PDF specification document does a pretty good job
of that, so I'll refer you to it Instead, let us focus on the commands used to apply the transformation itself;
here's an example:
Listing 7
Ma Mb Mc Md x y Tm Looks cryptic, doesn’t it? The first four elements of the express the rotation that should be applied to the text They can also be used to determine the scale, but, as | mentioned, that is beyond the scope of this article The
x and y parameters, on the other hand, indicate the coordinates at which we want the text to apply Finally,
Tm isthe command itself, which tellsthe PDFinterpreter
to apply these values to the text transformation matrix tion call is the exact opposite of what we are used to in PHP (where we use function (paraml, param2, } This format is called “Reverse Polish Notation” and is often
<2php // Creates a list of all the pages // that are present in a document function pdf_read_pages (8c, &$pages, &fresult) { // Get the kids dictionary Skids = pdf_resolve_object (Sc, §pages[1][1]['/Kids']);
foreach ($kids[1] as $v) {
$pq = pdf_resolve_object ($c, $v);
1f (§v[1][I]['/Type'] === ‘Pages*) £ // Tf one of the kids is an embedded // /Pages array, resolve it as well
pdf_read_pages ($c, $v, $result);
else
$resu]t[] = $pg:
Listing 8
<?php
* Finds the resources associated with a page
af function pdf_find_resources (&ic, $obj) { $ebj = pdf_resolve_object (fc, $obj);
// Tf the current object has a resources // it Otherwise, we move back to its // parent object
if Cisset (Sobj[1] [1] [‘/Resources'])) { del
n pdf_resolve_object ($c, $obj[1][1][‘/Resources']);
se
if (1isset (§obj[1][1]['/Parent'])) {
n false;
yell }
se { return pdf_find_resources {§obj[1][1][*/Parent']);
}
May 2004 + PHP Architect + www.phparch.com
Trang 5
In the Belly of the Beast
eters, such as the PostScript virtual machine on which
the PDF specification is based
Next, we'll select a font that will be used to draw the
text:
/F11 10 TẾ
The Tf command sets the current font resource to /F11,
a floating-point value, so that you could have text in
size 12.5
Before writing the text itself, we need to set the spac-
ing between one line of text and the next Thisisnot as
easy to determine as you may think—because it
how the font itself is dedgned From a practical per-
empirical default that works in most occasions The TL
command below sets the interline to five points:
5 TL
Finally, we can actually draw the text! Thisis done by
using a combination of two commands The text is
actually drawn using the ‘ command (no that’s not a
mistake—the command isa quotation mark) However,
Figure 1
if a newline character is present in the text, it is smply newline character with the T* command, which causes the drawing pointer to be reset to the next line
Finally, all we need to do is update the page’s /Contents array with a reference to our stream Once again, we need to determine if there already is an array and what it contains, and act accordingly, so that we can add our own data to it
Writing it All Back The final step before we can call it a day consists of they can be applied to the document To do so, we first beginning of the main script (Listing 5) Next, we call the pdf_write_objects() function to rewrite the objects that we modified back to the file If you take a look at Listing 9 (writer.php), you'll notice that this function is, essentially, the reverse of pdf_read_value(), since it first appropriate value
There are two thingshere that are worth mentioning
First, the information that we write back to the file is not a “true” delta—the resources dictionary may not
Ε Adobe Acrobat - [out.pdf]
Pal Fie Edt Document Tools View Window Help
(a)
S 2S
o) 4) 10f2 DM 65x11n
May 2004 « PHP Architect + www.phparch.com
-c~ :El ~IEƒ
testing platform even for
McCaffres came to the ré
Ougét inal article this mon
44
In the Belly of the Beast
very optimized, but it will do if you're only making small changes to a document—and it beats having to build a system that “remembers’ what was changed
object is written to stream, pdf_write_objectsQ
“makes a note” of the file pointer’s current postion
This comesin handy afterwards, when we rebuild the cross-reference table by calling pdf_write_xref() Here,
we create the proper entries one at atime This process could be optimized by grouping those entries belong- dealing only with small changes it’s hardly worth the trouble
pdf_write_xref() terminates by writing the trailer dic- object, which has not changed but must be there nonetheless, as well as a numeric value that declares the number of objects stored in the file and a pointer to the previous cross-reference table
Where to Go From here That's it! As you can see, once one figures out how and modify the contents of a PDF file program matical- shared by many people, that PDF is a non-modifiable format quite strange
Although the end result of our sample script is rela- tively Simple (if you run it against the sample file that | included in this month’s for your convenience—and like the output in gure 1), the foundation on which it
is built is quite solid and can be expanded upon to pro- vide additional functionality
Before parting ways, | just want to share one final tid- bit of information with you Working with PDF files can Windows, because the Acrobat PDF viewer is about as useful for debugging as testing whether your house's electrical circuit is working by sticking your fingers in First, you can actually get Acrobat to provide you with more useful error messages by pressing the Control key appears when you try to load a corrupted file Second, able online at www planetpdf con/ma†npage asp?webpageT~ d=3463) to visually inspect the contents of your file and determine what is wrong with it
Sometimes, however, it will be hard to figure bugs out While | was writing this article, | lost lots of time debugging a problem that turned out to be just a out any useful error This brings us to the last tool you'll need plenty of—patience!
About the Author ?> Marco is the Publisher of (and a frequent contributor to) phplarchitect
he can be found trying to hack his computer into submission You can write to him at marcot@phparch com
To Discuss this article:
http:/forums.phparch.com/145
ing by published author Larry Ullman
This course on the world's most popular Web
-the know to begin developing dynamic Web sites
language used by all databases and MySQL- world's most popular open source
database, this class teaches how to best store
May 2004 « PHP Architect + www.phparch.com 45