In this case, you’d have to download the entire file before you could actually find out about any of its characteris-tics, since the offset of the cross-reference table appears at the en
Trang 1Jump Right To It.
Three days of pure PHP
php|w rks
Trang 2Existing subscribers
can upgrade to
the Print edition
and save!
Login to your account
for more details.
NEW!
*By signing this order form, you agree that we will charge your account in Canadian dollars for the “CAD” amounts indicated above Because of fluctuations in the exchange rates, the actual amount charged in your currency on your credit card statement may vary slightly.
**Offer available only in conjunction with the purchase of a print subscription.
Choose a Subscription type:
CCaannaaddaa//UUSSAA $$ 8833 9999 CCAADD (($$5599 9999 UUSS**)) IInntteerrnnaattiioonnaall SSuurrffaaccee $$111111 9999 CCAADD (($$7799 9999 UUSS**)) IInntteerrnnaattiioonnaall AAiirr $$112255 9999 CCAADD (($$8899 9999 UUSS**)) CCoommbboo eeddiittiioonn aadddd oonn $$ 1144 0000 CCAADD (($$1100 0000 UUSS)) ((pprriinntt ++ PPDDFF eeddiittiioonn))
Your charge will appear under the name "Marco Tabini & Associates, Inc." Please allow up to 4 to 6 weeks for your subscription to be established and your first issue
to be mailed to you.
*US Pricing is approximate and for illustration purposes only.
php|architect Subscription Dept.
P.O Box 54526
1771 Avenue Road
Toronto, ON M5M 4N5
Canada
Name:
Address: _
City: _
State/Province:
ZIP/Postal Code: _
Country: _
Payment type:
VISA Mastercard American Express
Credit Card Number:
Expiration Date: _
E-mail address:
Visit: http://www.phparch.com/print for more information or to subscribe online.
Signature: Date:
php|architect
The Magazine For PHP Professionals
YYoouu’’llll nneevveerr kknnoow w w whhaatt w wee’’llll ccoom mee uupp w wiitthh nneexxtt
Trang 3Welcome to part two of our little trip down PDF
lane While last month we focused primarily
on understanding what the structure of a PDF document is, this time over we’ll look at the problem of
altering the contents of a PDF file from a more practical
perspective
The main thing to understand, before we move on to
anything else, is that parsing a PDF file is a complex—
but by no means complicated—endeavour because the
file is not only not intended for human consumption,
but it also does not follow a top-down logic In other
words, as we also discovered last month, when parsing
a PDF file one doesn’t start at the beginning and move
down to the end of the file In fact, the exact opposite
is true
Since we’ll often find ourselves jumping at various—
and completely arbitrary—positions into the
docu-ment, the first decision that we need to make is how
we’re going to access the data While it is tempting to
just load the entire file in memory, that’s usually not
such a good idea; if you consider that a PDF can have
pretty much any size, by loading an entire document in
memory we expose ourselves to the potential of
clog-ging up large chunks of RAM, thus limiting our server’s
ability to process a large number of requests
Yet, seeking to arbitrary locations in a document is
not always easy, or even possible Imagine, for
exam-ple, if you’re accessing a PDF document via HTTP In
this case, you’d have to download the entire file before
you could actually find out about any of its
characteris-tics, since the offset of the cross-reference table appears
at the end of the file Even in this case, I would
recom-mend storing the document in a local file and then accessing the data through the filesystem
The one notable exception to this rule is a special class of PDF documents known as “linearized PDF files”
A linearized PDF document contains a dictionary at the beginning of the file that provides the necessary facili-ties for determining the location of the first page in the file without having to read through the cross-reference table first The structure of linearized PDF files is beyond the scope of this article, but you can find out more about it directly from the PDF specification document published by Adobe
Getting Started
The first thing we need to do in order to be able to interpret the contents of a PDF document is to deter-mine where the cross-reference table and trailer dic-tionary are This is quite easy if you consider that the format of the ssttaarrttxxrreeff pointer is fixed For example, in
my document it looks like the following:
startxref 53593
%%EOF
In the Belly of the Beast
Interpreting and Manipulating PDF Files
by Marco Tabini
PHP: 4.3.0+
OS: Any Applications: A PDF Reader (for testing) Code Directory: pdf
REQUIREMENTS
In last month's issue, we examined the structure and
con-tents of a PDF document in considerable detail This
month, we'll actually write a PHP library capable of
open-ing one and modifyopen-ing its contents.
Trang 4Thus, all we need to do is move to the end of the file,
back up a few bytes and then find this sequence of
data As you can see from Listing 1 (ffiinnddxxrreeff pphhpp), this
is readily accomplished by using a simple regular
expression Note how the regex pattern specification
ends with a dollar sign, indicating that the resulting
match must be anchored to the end of the data stream
Even though we’re only taking fifty characters from the
end of the file, I have added the anchor to prevent the
regex engine from picking up a previous
cross-refer-ence table pointer by mistake If you’re wondering why
the cross-reference table pointer is not saved to the
document using a fixed format (say, for example, using
10 digits for the offset like the cross-reference entries
themselves), you’re not alone This decision is a bit of a
mystery, but it’s something that we have to live with
By the way—throughout the remainder of the article,
you’ll notice that I have created an individual include
file for each of the functions that we will be writing
This is clearly not a good design practice, but it fulfills
one important purpose: it keeps the listings in the
arti-cles short and to the point Thus, in the interest of
clar-ity, I hope that you’ll forgive me and that, if you decide
to use any of the code in your own projects, you will
not follow the same layout
Reading the Cross-reference Table
Now that we now where to look for it, it’s time to
fig-ure out how to read the cross-reference table itself If
we move to offset 55,593 of the file, we’ll find the
fol-lowing:
xref
0 22
0000000000 65535 f
0000000017 00000 n
0000005632 00000 n
0000005659 00000 n
0000006483 00000 n
0000053169 00000 n
0000006509 00000 n
0000039936 00000 n
The word xxrreeff is followed by the first object
represent-ed in the table (0 in this case) and the number of
entries that follow (twenty-two); we’ll call this the
“header” of the table Next come the entries
them-selves: for each line, we have the offset at which the
object can be found (10 characters), followed by the
generation number and the letter nn for objects that are
in use or ff for objects that are free
There are a few important things to notice here First
of all, each set of data is conveniently laid out in a line
of text, so that we can use the ffggeettss(()) function to
retrieve it However, you should keep in mind that PDF
files always use the Windows convention for identifying
newlines in the cross-reference table (but not
necessar-ily elsewhere) and, therefore, you must instruct the PHP
interpreter to do so as well—regardless of the platform
your script is running on This can be accomplished by turning on the aauuttoo ddeetteecctt lliinnee eennddiinnggss INI directive (which became available as of PHP 4.3.0) We can do this directly from the code by first reading the current value, turning the directive on for the duration of our file operations and then restoring it back to its original value This sequence of operations is important, because it is possible that other portions of our script may depend on the directive being in a different state than the one we need it in
Another gotcha when reading the cross-reference table is that there may be more than one block of entries—that is, once you’ve read out all the entries, you could find another header followed by a new set of entries, or you could find the trailer dictionary If we didn’t check for this possibility and simply assume that the cross-reference table is always followed by a trailer, our code would be unable to read most documents that have been modified after their creation, since that’s the situation in which partial cross-reference tables are most likely to be found
As you can see in Listing 2 (rreeaaddxxrreeff pphhpp), the
ppddff rreeaadd xxrreeff(()) function is a bit long, but otherwise quite simple It is written to take full advantage of the fact that the cross-reference table is formatted using a very stylized layout, so that we can take advantage of the fastest and most convenient string functions pro-vided by PHP
The only aspect of this function that we have not explored is the little segment of code that starts at line
84 and ends at line 100 This is where our code reads
Listing 1
1 <?php
2
3 /*
4 * Returns the offset of the most recent
5 * cross-reference table in the file
6 */
7
8 function pdf_find_xref ( $f )
9 {
10 // First, seek to the end of the file,
11 // allowing for 50 bytes just so that
12 // we have enough data to look into
13
14 fseek ( $f , - 50 , SEEK_END );
15
16 // Next, try to find the proper sequence
17 // of data Note that the information can be
18 // separated by a Windows-style, Mac-style
20
21 $data = fread ( $f , 50 );
22
23 if (! preg_match
( ‘/startxref(?:\r|\n|\r\n)(\d+)(?:\r|\n|\r\n)%%EOF(?:\r|\n|\r\n)$/’
, $data , $matches )) {
24 die ( “Unable to find pointer to xref table” );
25 }
26
27 // If we get here, then we have the offset
28 // where the most recently introduced xref
30
31 return (int) $matches [ ];
32 }
33
34 ?>
Trang 5the trailer dictionary; as you can see, it makes use of a
few elements that I have not yet introduced (the
ppddff ccoonntteexxtt class and the ppddff rreeaadd vvaalluuee(()) function)
However, if you leave the mechanics of how the
infor-mation is retrieved aside for a moment, you’ll notice
that the trailer dictionary ends up in an associative
array If you remember from last month’s article, files
that have been modified usually contain more than one
cross-reference table; this is indicated by the presence
of a //PPrreevv key/value pair in the trailer, with a pointer to
its beginning If this entry is present, the function
sim-ply recourses onto itself until all the cross-reference
tables present in the file are read Note that any
infor-mation in the older tables and trailers is not allowed to
overwrite the data contained in the newer ones by the simple stratagem of checking that an entry is not set in the first case, and by merging the trailer arrays in a par-ticular order in the second
Writing a PDF Lexer
Now that we know where the objects are—the cross reference table gives us the location of every object in the file—it’s time to try and read them We could, in theory, write a series of ad-hoc functions that try to read from the file and interpret its contents, but things are much easier if we, instead, make use of that
won-derful computer science concept known as the lexer
(also known as a tokenizer)
F
Listing 2
1 <?php
2
3 /*
4 * Reads a cross-reference table
5 *
6 * if $offset is provided and $start and $end are
7 * set to Null, the function will start reading the
8 * xref table from the current position in the file
9 * If more than one parts of xref table are present,
10 * the function will recurse onto itself as many times
11 * as needed
12 */
13
14 function pdf_read_xref ( $f , & $result , $offset , $start = null ,
$end = null )
15 {
16 // If we didn’t get a start and end, we need
17 // to get them from the document itself
18
19 if ( is_null ( $start ) || is_null ( $end )) {
20
22
23 fseek ( $f , $offset );
24
27
28 $old_ini = ini_get ( ‘auto_detect_line_endings’ );
29
31
32 $data = trim ( fgets ( $f ));
33
36
37 if ( $data !== ‘xref’ ) {
38 die ( “Unable to find xref table” );
39 }
40
43
44 $data = explode ( ‘ ‘ , trim ( fgets ( $f )));
45
47
48 if ( count ( $data ) != 2 ) {
49 die ( “Unexpected header in xref table” );
50 }
51
54
55 $start = $data [ ];
56 $end = $start + $data [ ];
57 }
58
59 if (!isset ( $result [ ‘xref_location’ ])) {
60 $result [ ‘xref_location’ ] = $offset ;
61 }
62
63 if (!isset ( $result [ ‘max_object’ ]) || $end >
$result [ ‘max_object’ ]) {
64 $result [ ‘max_object’ ] = $end ;
65 }
66
67 // Now cycle through each object
69
70 for (; $start < $end ; $start ++) {
71
75
76 $data = trim ( fgets ( $f ));
77
78 $offset = substr ( $data , 0 , 10 );
79 $generation = substr ( $data , 11 , 5 );
80
81 if (!isset ( $result [ ‘xref’ ][ $start ][(int) $genera-tion ])) {
82 $result [ ‘xref’ ][ $start ][(int) $generation ] = (int)
$offset ;
83 }
84 }
85
86 // Get the next line, which could either be the beginning
87 // of the trailer dictionary or the header of another
89
90 $data = trim ( fgets ( $f ));
91
92 if ( $data === ‘trailer’ ) {
93
95
96 $c = new pdf_context ( $f );
97 $trailer = pdf_read_value ( $c );
98
102
103 if (isset ( $trailer [ ‘/Prev’ ])) {
104 pdf_read_xref ( $f , $result , $trailer [ ‘/Prev’ ]);
105 $result [ ‘trailer’ ] = array_merge ( $result [ ‘trail-er’ ], $trailer );
106 } else {
107 $result [ ‘trailer’ ] = $trailer ;
108 }
109
110 } else {
111
116
117 $data = explode ( ‘ ‘ , $data );
118 pdf_read_xref ( $f , $result , null , $data [ ], $data [ ] +
$data [ ]);
119
120 }
121 }
122
123 ?>
Trang 6Our lexer will take the input from the PDF file and
split it in individual tokens according to a particular set
of rules For example, if we were writing a lexer for
reducing the contents of this article in a series of words
(with every grammatical element representing a
token), we would establish that a token is either a set of
characters or a punctuation mark—assuming that
whitespace and paragraph markers are of no
impor-tance to us
Identifying tokens in a PDF file is quite simple in
the-ory, although in practical terms you have to watch out
for a few potential pitfalls First the basics: the simplest
form of delimiter is the whitespace, which has no
semantic value (meaning that it is used only for the
pur-pose of delimiting tokens and has no other purpur-pose)
Whitespace is composed of space characters, newlines
and line feeds
This would be enough to cover most situations, but
in some cases you’ll find that tokens are not always
delimited using whitespaces When some applications
(including some of Adobe’s own) “optimize” a PDF file
to reduce its size as much as possible, they remove
whitespace characters where the distinction between
two tokens is made obvious in another way For
exam-ple, consider the following snippet of PDF code that
shows the beginning of a dictionary:
<< /Entry (Value) >>
The whitespace between <<<< and //EEnnttrryy is made
unnec-essary by the fact that the two tokens are made up of
two completely different classes of characters Since <<<<
could only appear outside of a literal string to indicate
the beginning of a dictionary, the lexer should stop at
the second open angular bracket and delimit a token
before the next character—whatever that is Therefore,
the snippet above could be rewritten as follows:
<</Entry (Value)>>
Clearly, whitespace isn’t enough to delimit a token—we
must also keep in mind all the other possible character
classes that can be used for the same purpose Listing 3
(ttookkeenniizzeerr pphhpp) shows our lexer, the ppddff rreeaadd ttookkeenn(())
function, which looks a lot more complicated than it
really is
This file also contains the ppddff ccoonntteexxtt class that we
mentioned earlier, which the tokenizer also makes use
of The ppddff ccoonntteexxtt class is used to create a wrapper
around a file pointer that makes it possible to:
• Create a memory-based buffer for the file’s
contents
• Keep track of the current pointer in the file
and of the length of the buffer
• Maintain a stack of tokens that have been read from the file but not yet used The necessity of creating a buffer here arises from the fact that we don’t want our tokenizer to read one sin-gle character at a time out of the file By reading a fixed amount at a time and then accessing the dara directly
in memory, we can save ourselves a few expensive function calls The token stack is actually used by the portion of the system that is responsible for interpret-ing the meaninterpret-ing of the tokens—more about that later Note that there is no compelling reason to store this information in a class, other than the convenience fac-tor of having a convenient PHP syntax to work with You could just as easily store everything in an array and avoid OOP altogether, although, in my opinion, that would significantly complicate your code and make it easier to introduce bugs that would be tough to find and fix
Going back to the ppddff rreeaadd ttookkeenn(()) function for a moment, you can see that it works in a very simple way: first, it removes any whitespace that is at the cur-rent offset in the file buffer Next, it tries to determine the type of token that it is dealing with by looking at the first character The procedure used to then find the end of the token varies depending on the character class it belongs to: for array and literal string delimiters,
a single character is all we need, whereas for hex string and dictionary delimiters we need to check one more character, since they both share the same initial open angular bracket For all the other types of tokens, we simply scan the file until we end up in a different char-acter class
Parsing the Data
Next in the list, we need to be able to understand the meaning of each token in the context of the PDF file— and this is the job of another great computer science construct: the parser
Parsers can be very complicated, and are usually not coded by hand—in most cases, a developer would use
a “parser generator” like YACC or Bison These reduce the parser to a relatively complex finite-state machine that is flexible enough to accommodate certain types of languages In our case, however, the parsing of a PDF file is simple enough that the entire process can be coded in just about 150 lines’ worth of PHP
Before introducing another listing, however, let’s con-sider the types of data that we need to deal with For the most part, they are simple to handle: for direct val-ues, for example, we read as many tokens as we need from the file and store them in the appropriate data structures In two cases, however, we need to make a distinction: strings and indirect objets
The problem with strings—and, particularly, with lit-eral string—is that they change the rules that our lexer
Trang 7Listing 3
1 <?php
2
3 /*
4 * This class is used to
5 * read data from the input
6 * file in a bufferized way
7 * and to store unused tokens
8 */
9
10 class pdf_context
11 {
12 var $file ;
13 var $buffer ;
14 var $offset ;
15 var $length ;
16
17 var $stack ;
18
20
21 function pdf_context ( $f )
22 {
23 $this -> file = $f ;
24 $this -> reset ();
25 }
26
28 // pointer to a new location
29 // and reset the buffered data
30
31 function reset ( $pos = null )
32 {
33 if (! is_null ( $pos )) {
34 fseek ( $this -> file , $pos );
35 }
36
37 $this -> buffer = fread ( $this -> file , 100 );
38 $this -> offset = 0 ;
39 $this -> length = strlen ( $this -> buffer );
40 $this -> stack = array();
41 }
42
43 // Make sure that there is at least one
44 // character beyond the current offset in
45 // the buffer to prevent the tokenizer
46 // from attempting to access data that does
48
49 function ensure_content ()
50 {
51 if ( $this -> offset >= $this -> length - 1 ) {
52 return $this -> increase_length ();
53 } else {
54 return true ;
55 }
56 }
57
58 // Forcefully read more data into the buffer
59
60 function increase_length ()
61 {
62 if ( feof ( $this -> file )) {
63 return false ;
64 } else {
65 $this -> buffer = fread ( $this -> file , 100 );
66 $this -> length = strlen ( $this -> buffer );
67 return true ;
68 }
69 }
70 }
71
72 /*
73 * Reads a token from the file
74 */
75
76 function pdf_read_token (& $c )
77 {
78 // If there is a token available
81
82 if ( count ( $c -> stack )) {
83 return array_pop ( $c -> stack );
84 }
85
86 // Strip away any whitespace
87
88 do {
89 if (! $c -> ensure_content ()) {
90 return false ;
91 }
92 $c -> offset += strspn ( $c -> buffer , “ \n\r” , $c -> off-set );
93 } while ( $c -> offset >= $c -> length - 1 );
94
95 // Get the first character in the stream
96
97 $char = $c -> buffer [ $c -> offset ++];
98
99 switch ( $char ) {
100
101 case ‘[‘ :
102 case ‘]’ :
103 case ‘(‘ :
104 case ‘)’ :
105
108
109 return $char ;
110
111 case ‘<’ :
112 case ‘>’ :
113
117
118 if ( $c -> buffer [ $c -> offset ] == $char ) {
119 if (! $c -> ensure_content ()) {
120 return false ;
122 $c -> offset ++;
123 return $char $char ;
124 } else {
125 return $char ;
126 }
127
128 default :
129
133
134 if (! $c -> ensure_content ()) {
135 return false ;
136 }
137
138 while( 1 ) {
139
141
142 $pos = strcspn ( $c -> buffer , “ []<>()\r\n/” ,
$c -> offset );
143
144 if ( $c -> offset + $pos < $c -> length - 1 ) {
146 } else {
152
153 $c -> increase_length ();
155 }
156
157 $result = substr ( $c -> buffer , $c -> offset - 1 , $pos
+ 1 );
158
159 $c -> offset += $pos ;
160 return $result ;
161 }
162 }
163
164 ?>
Trang 8has to follow in order to find the end of the token,
because a closed parenthesis could be escaped by a
backslash and, therefore, its presence alone does not
indicate the end of the string In a “traditional” lexer,
this problem is taken care of by switching the machine
to a new context in which a different set of rules apply
We could, in fact, do the very same thing to our lexer
by creating a special case in the sswwiittcchh statement that
is part of ppddff rreeaadd ttookkeenn(()) in Listing 2 and writing
some additional code that looks for a parenthesis not
preceded by an even number of backslashes Why an
even number? Because the backslashes themselves can
be escaped by prefixing them with another backslash
Therefore, an even number of backslashes means that
they are all escaped and should be interpreted as
liter-al characters, so that the last one does not escape the
parenthesis, which becomes the string delimiter The
last in am odd number of backslashes right before a
parenthesis becomes an “orphan” and escapes the
parenthesis, thus preventing it from terminating the
string
Given that we only have a limited amount of space
and I really wanted to keep things as simple as possible,
however, I chose to implement the string parsing
func-tionality inside the parser itself When an open
paren-thesis token is returned by the tokenizer, the code
sim-ply keeps scanning the input file until it finds an
unescaped closed parenthesis
The other problematic data elements are, as I
men-tioned above, indirect objects Both object declarations
and references are made up by three tokens Therefore,
once our parser encounters a numeric value, it won’t be
able to tell whether it is part of a larger element until it
has read at least one more token—and potentially two
The problem here is not with reading the tokens—it’s
with what to do with them if, by any chance, the
numeric value turns out to be… just a numeric value
We could, in theory, put the extra tokens “back in the
buffer” by rolling back the offset pointer in the buffer
to the beginning of the second token, but that would
be difficult to do, since we don’t really know how many
whitespace characters were between the tokens to start
with
Therefore, we use a completely different approach:
unused tokens are stored in a stack, which is part of the
file context When a new token is requested,
ppddff rreeaadd ttookkeenn(()) checks whether anything is present
in the stack and, if something is in there, it pops it out
and returns it, without even reading one character from
the file buffer
You can see the end result of all our tribulations in
Listing 4 (rreeaaddvvaalluuee pphhpp), which contains the
ppddff rreeaadd vvaalluuee(())function You will also notice a
num-ber of constant definitions that look suspiciously like
data types—and they are Since we’ll be reading and
writing data back and forth, we’ll need to keep track of
the object types as we read them from the stream To
do so, each object is encapsulated in an array whose zeroth element indicates the type, while element 1 con-tains the actual value, which varies depending on the nature of the data Thus, for example, the trailer dic-tionary could look like this:
Array ( PDF_TYPE_DICTIONARY, Array (
‘/Size’ => array ( PDF_TYPE_NUMERIC, 22),
‘/Root’ => array ( PDF_TYPE_OBJ_REF, 12,
0 ),
‘/Prev’ => array ( PDF_TYPE_NUMERIC, 54655
) );
Not unlike some of its predecessors, ppddff rreeaadd vvaalluuee(())
looks a lot scarier than it actually is—the code is quite heavily commented, so I will limit myself to noting that each value is actually stored in an array whose zeroth element contains its type This makes identifying the data type of a type practically immediate, which will turn out to be very important later on when we’ll need
to write objects back to the file.
Before moving on to the next step, note that we make no provision in our lexer for reading stream data
This is because we are not intent on interpreting
every-thing that is stored in a PDF file—but only those
ele-ments that allow us to modify its contents However, adding support for streams shouldn’t be too much of a problem—all you need is the ability to resolve object references, which we’ll add shortly, since the length of
a stream is often expressed in that way
Getting to the Root of the Problem
All the pieces are finally in place—we should now be able to read through the PDF file and interpret its con-tents, at least to the extent that we need in order to be able to append data to it In order to demonstrate how the PDF functionality that we have built works, our goal
is to open a PDF file and add a textual element to its first page
Listing 5 (iinnddeexx pphhpp) is our main script—and, unfor-tunately, it’s too large to show here; you will, however, find it in the code associated with this article, so you will hopefully be able to follow me there
Once we have declared a few variables that we we’ll end up using throughout the script, we read the cross-reference table from the file, then immediately attempt
to retrieve the Root object from it Because the //RRoooott entry inside the file trailer has to be an indirect object reference, we must find a way to retrieve the actual
Trang 9Continued on page 41
Listing 4
1 <?php
2
3 // Define various data types
4 // that we use throughout the system
5
6 define ( ‘PDF_TYPE_NULL’ , 0 );
7 define ( ‘PDF_TYPE_NUMERIC’ , 1 );
8 define ( ‘PDF_TYPE_TOKEN’ , 2 );
9 define ( ‘PDF_TYPE_HEX’ , 3 );
10 define ( ‘PDF_TYPE_STRING’ , 4 );
11 define ( ‘PDF_TYPE_DICTIONARY’ , 5 );
12 define ( ‘PDF_TYPE_ARRAY’ , 6 );
13 define ( ‘PDF_TYPE_OBJDEC’ , 7 );
14 define ( ‘PDF_TYPE_OBJREF’ , 8 );
15 define ( ‘PDF_TYPE_OBJECT’ , 9 );
16 define ( ‘PDF_TYPE_STREAM’ , 10 );
17
18 /*
19 * Reads a value from the current
20 * data stream
21 */
22
23 function pdf_read_value (& $c , $token = null )
24 {
25 // Get a token from the stream
26
27 if ( is_null ( $token )) {
28 $token = pdf_read_token ( $c );
29 }
30
31 if ( $token === false ) {
32 return false ;
33 }
34
35 switch ( $token ) {
36
37 case ‘<’ :
38
41
42 $s = pdf_read_token ( $c );
43
44 if ( $s === false ) {
45 return false ;
47
48 $term = pdf_read_token ( $c );
49
50 if ( $term !== ‘>’ ) {
51 die ( “Unexpected data after hex string” );
53
54 return array ( PDF_TYPE_HEX , $s );
55
56 break;
57
58 case ‘<<’ :
59
61
62 $result = array();
63
66
67 while (( $key = pdf_read_token ( $c )) !== ‘>>’ ) {
68 if ( $key === false ) {
69 return false ;
71
72 if (( $value = pdf_read_value ( $c )) === false )
{
73 return false ;
75
76 $result [ $key ] = $value ;
78
79 return array ( PDF_TYPE_DICTIONARY , $result );
80
81 case ‘[‘ :
82
84
85 $result = array();
86
89
90 while (( $token = pdf_read_token ( $c )) !== ‘]’ ) {
91 if ( $token === false ) {
92 return false ;
94
95 if (( $value = pdf_read_value ( $c , $token )) ===
false ) {
96 return false ;
98
99 $result [] = $value ;
100 }
101
102 return array ( PDF_TYPE_ARRAY , $result );
103
104 case ‘(‘ :
105
107
108 $pos = $c -> offset ;
109
110 while( 1 ) {
111
114
115 $pos = strpos ( $c -> buffer , ‘)’ , $pos );
116
119
120 if ( $pos == - 1 ) {
121 if (! $c -> increase_length ()) {
122 return false ;
125
the parenthesis If there is,
128
129 if ( $c -> buffer [ $pos - 1 ] !== ‘\\’ ) {
130 $result = substr ( $c -> buffer , $c -> offset ,
$pos - $c -> offset + 1 );
131 $c -> offset = $pos + 1 ;
132 return array ( PDF_TYPE_STRING , $result );
133 } else {
134 $pos ++;
135
136 if ( $pos > $c -> offset + $c -> length ) {
137 $c -> increase_length ();
140 }
141
142 default :
143
144 if ( is_numeric ( $token )) {
145
part of something else
147
148 if (( $tok2 = pdf_read_token ( $c )) !== false ) {
149 if ( is_numeric ( $tok2 )) {
150
this case, we’re probably in
or an object specification
data
154
155 if (( $tok3 = pdf_read_token ( $c )) !==
false ) {
156 switch ( $tok3 ) {
157
158 case ‘obj’ :
159
( PDF_TYPE_OBJDEC , (int) $token , (int) $tok2 );
161
163
( PDF_TYPE_OBJREF , (int) $token , (int) $tok2 );
Trang 10object data, as the reference itself won’t help us much.
This is accomplished by the ppddff rreessoollvvee oobbjjeecctt(())
func-tion, which you can see in Listing 6 as part of the
oobbjjeeccttss pphhpp include file The function can actually be
used to determine whether any object is an indirect
ref-erence and resolve it to the actual object
data—some-thing that will come in handy at pretty much every step
of the way
As you can see, ppddff rreessoollvvee oobbjjeecctt(()) first checks to
see if the value it has been passed is an indirect object
reference If it isn’t, the function has really nothing to
do, other than returning right away If, on the other
hand, it did receive an indirect reference, it uses the
cross-reference table to determine its position and
starts reading it The $$eennccaappssuullaattee parameter
deter-mines how the object is returned to the caller; if it is set
to true, ppddff rreessoollvvee oobbjjeecctt(()) stores the object ID and
generation number in the array, so that effectively the
object’s data is encapsulated inside another object of
type PPDDFF TTYYPPEE OOBBJJEECCTT Otherwise, the direct value is
returned, and all information regarding the object’s ID
and generation number is lost Both types of return
val-ues have their uses—if you want to retrieve an object
with the intention of modifying it, you will probably
want it encapsulated, so that you can later rewrite it
back to the stream If, on the other hand, you’re just
trying to retrieve a value, as you would, for example, if
you were reading a stream object and you wanted to
determine its length, the non-encapsulated version will
be easier to handle Speaking of retrieving streams,
even though my code doesn’t perform that function
(since I’m not writing a PDF reader), if you intend to
add it, the ppddff rreessoollvvee oobbjjeecctt(())function is safe to use
because it saves the file pointer’s current position
before reading the object and restores it afterwards If
the function didn’t do so and you were reading a
stream, resolving the //LLeennggtthh parameter could result in
the file pointer being moved to a different location in the file—and you would be unable to read the rest of the stream
Let’s go back to index.php With the root object
firm-ly in hand, we can now compile a list of all the pages contained in the document To do so, we feed the //PPaaggeess element of the root dictionary to the
ppddff rreeaadd ppaaggeess(())function, which you can see in Listing
7 (rreeaaddppaaggeess pphhpp)
The reason why we have a separate function just to read through the //PPaaggeess element of the root object is that, as I mentioned in last month’s article, the pages could be nested in an arbitrary combination of //PPaaggee and //PPaaggeess dictionaries, so that we may need to recurse into the function several times in order to end up with
an array that contains only page elements It is impor-tant to understand that the order in which the pages are resolved by using this method doesn’t necessarily correspond to the logical order in which they will appear to the user—that is, the first page in the list is not necessarily the first page of the document; the PDF specification provides a different set of facilities for determining the logical page order, but, technically speaking, you should only be interested in that if you want to display the contents of a document In practi-cal terms, I have never found an occasion in which the logical and physical page order didn’t coincide—at most, there might be a fixed discrepancy because the document is an excerpt that starts from, say, page 25,
but the order of the pages should usually be the same
In our sample script, we only take in consideration page 1 (which is the zeroth element resulting from the pages array) We then use the ppddff ffiinndd rreessoouurrcceess(())
function, shown in Listing 8 (ppaaggee pphhpp) to retrieve the resources associated with the page Here, again, we need a dedicated function because, as you may remember, the resource dictionary is an inheritable
Listing 4: Continued from page 40
164 return array ( PDF_TYPE_OBJREF , (int) $token , (int) $tok2 );
166
170
171 array_push ( $c -> stack , $tok3 );
174
175 array_push ( $c -> stack , $tok2 );
177
178 return array ( PDF_TYPE_NUMERIC , $token );
179 } else {
181
182 return array ( PDF_TYPE_TOKEN , $token );
183 }
184
185 }
186 }
187
188 ?>