Tài liệu Jump Right To It. pptx

In this case, you’d have to download the entire file before you could actually find out about any of its characteris-tics, since the offset of the cross-reference table appears at the en

Trang 1

Jump Right To It.

Three days of pure PHP

php|w rks

Trang 2

Existing subscribers

can upgrade to

the Print edition

and save!

Login to your account

for more details.

NEW!

*By signing this order form, you agree that we will charge your account in Canadian dollars for the “CAD” amounts indicated above Because of fluctuations in the exchange rates, the actual amount charged in your currency on your credit card statement may vary slightly.

**Offer available only in conjunction with the purchase of a print subscription.

Choose a Subscription type:

CCaannaaddaa//UUSSAA $$ 8833 9999 CCAADD (($$5599 9999 UUSS**)) IInntteerrnnaattiioonnaall SSuurrffaaccee $$111111 9999 CCAADD (($$7799 9999 UUSS**)) IInntteerrnnaattiioonnaall AAiirr $$112255 9999 CCAADD (($$8899 9999 UUSS**)) CCoommbboo eeddiittiioonn aadddd oonn $$ 1144 0000 CCAADD (($$1100 0000 UUSS)) ((pprriinntt ++ PPDDFF eeddiittiioonn))

Your charge will appear under the name "Marco Tabini & Associates, Inc." Please allow up to 4 to 6 weeks for your subscription to be established and your first issue

to be mailed to you.

*US Pricing is approximate and for illustration purposes only.

php|architect Subscription Dept.

P.O Box 54526

1771 Avenue Road

Toronto, ON M5M 4N5

Canada

Name:

Address: _

City: _

State/Province:

ZIP/Postal Code: _

Country: _

Payment type:

VISA Mastercard American Express

Credit Card Number:

Expiration Date: _

E-mail address:

Visit: http://www.phparch.com/print for more information or to subscribe online.

Signature: Date:

php|architect

The Magazine For PHP Professionals

YYoouu’’llll nneevveerr kknnoow w w whhaatt w wee’’llll ccoom mee uupp w wiitthh nneexxtt

Trang 3

Welcome to part two of our little trip down PDF

lane While last month we focused primarily

on understanding what the structure of a PDF document is, this time over we’ll look at the problem of

altering the contents of a PDF file from a more practical

perspective

The main thing to understand, before we move on to

anything else, is that parsing a PDF file is a complex—

but by no means complicated—endeavour because the

file is not only not intended for human consumption,

but it also does not follow a top-down logic In other

words, as we also discovered last month, when parsing

a PDF file one doesn’t start at the beginning and move

down to the end of the file In fact, the exact opposite

is true

Since we’ll often find ourselves jumping at various—

and completely arbitrary—positions into the

docu-ment, the first decision that we need to make is how

we’re going to access the data While it is tempting to

just load the entire file in memory, that’s usually not

such a good idea; if you consider that a PDF can have

pretty much any size, by loading an entire document in

memory we expose ourselves to the potential of

clog-ging up large chunks of RAM, thus limiting our server’s

ability to process a large number of requests

Yet, seeking to arbitrary locations in a document is

not always easy, or even possible Imagine, for

exam-ple, if you’re accessing a PDF document via HTTP In

this case, you’d have to download the entire file before

you could actually find out about any of its

characteris-tics, since the offset of the cross-reference table appears

at the end of the file Even in this case, I would

recom-mend storing the document in a local file and then accessing the data through the filesystem

The one notable exception to this rule is a special class of PDF documents known as “linearized PDF files”

A linearized PDF document contains a dictionary at the beginning of the file that provides the necessary facili-ties for determining the location of the first page in the file without having to read through the cross-reference table first The structure of linearized PDF files is beyond the scope of this article, but you can find out more about it directly from the PDF specification document published by Adobe

Getting Started

The first thing we need to do in order to be able to interpret the contents of a PDF document is to deter-mine where the cross-reference table and trailer dic-tionary are This is quite easy if you consider that the format of the ssttaarrttxxrreeff pointer is fixed For example, in

my document it looks like the following:

startxref 53593

%%EOF

In the Belly of the Beast

Interpreting and Manipulating PDF Files

by Marco Tabini

PHP: 4.3.0+

OS: Any Applications: A PDF Reader (for testing) Code Directory: pdf

REQUIREMENTS

In last month's issue, we examined the structure and

con-tents of a PDF document in considerable detail This

month, we'll actually write a PHP library capable of

open-ing one and modifyopen-ing its contents.

Trang 4

Thus, all we need to do is move to the end of the file,

back up a few bytes and then find this sequence of

data As you can see from Listing 1 (ffiinnddxxrreeff pphhpp), this

is readily accomplished by using a simple regular

expression Note how the regex pattern specification

ends with a dollar sign, indicating that the resulting

match must be anchored to the end of the data stream

Even though we’re only taking fifty characters from the

end of the file, I have added the anchor to prevent the

regex engine from picking up a previous

cross-refer-ence table pointer by mistake If you’re wondering why

the cross-reference table pointer is not saved to the

document using a fixed format (say, for example, using

10 digits for the offset like the cross-reference entries

themselves), you’re not alone This decision is a bit of a

mystery, but it’s something that we have to live with

By the way—throughout the remainder of the article,

you’ll notice that I have created an individual include

file for each of the functions that we will be writing

This is clearly not a good design practice, but it fulfills

one important purpose: it keeps the listings in the

arti-cles short and to the point Thus, in the interest of

clar-ity, I hope that you’ll forgive me and that, if you decide

to use any of the code in your own projects, you will

not follow the same layout

Reading the Cross-reference Table

Now that we now where to look for it, it’s time to

fig-ure out how to read the cross-reference table itself If

we move to offset 55,593 of the file, we’ll find the

fol-lowing:

xref

0 22

0000000000 65535 f

0000000017 00000 n

0000005632 00000 n

0000005659 00000 n

0000006483 00000 n

0000053169 00000 n

0000006509 00000 n

0000039936 00000 n

The word xxrreeff is followed by the first object

represent-ed in the table (0 in this case) and the number of

entries that follow (twenty-two); we’ll call this the

“header” of the table Next come the entries

them-selves: for each line, we have the offset at which the

object can be found (10 characters), followed by the

generation number and the letter nn for objects that are

in use or ff for objects that are free

There are a few important things to notice here First

of all, each set of data is conveniently laid out in a line

of text, so that we can use the ffggeettss(()) function to

retrieve it However, you should keep in mind that PDF

files always use the Windows convention for identifying

newlines in the cross-reference table (but not

necessar-ily elsewhere) and, therefore, you must instruct the PHP

interpreter to do so as well—regardless of the platform

your script is running on This can be accomplished by turning on the aauuttoo ddeetteecctt lliinnee eennddiinnggss INI directive (which became available as of PHP 4.3.0) We can do this directly from the code by first reading the current value, turning the directive on for the duration of our file operations and then restoring it back to its original value This sequence of operations is important, because it is possible that other portions of our script may depend on the directive being in a different state than the one we need it in

Another gotcha when reading the cross-reference table is that there may be more than one block of entries—that is, once you’ve read out all the entries, you could find another header followed by a new set of entries, or you could find the trailer dictionary If we didn’t check for this possibility and simply assume that the cross-reference table is always followed by a trailer, our code would be unable to read most documents that have been modified after their creation, since that’s the situation in which partial cross-reference tables are most likely to be found

As you can see in Listing 2 (rreeaaddxxrreeff pphhpp), the

ppddff rreeaadd xxrreeff(()) function is a bit long, but otherwise quite simple It is written to take full advantage of the fact that the cross-reference table is formatted using a very stylized layout, so that we can take advantage of the fastest and most convenient string functions pro-vided by PHP

The only aspect of this function that we have not explored is the little segment of code that starts at line

84 and ends at line 100 This is where our code reads

Listing 1

1 <?php

2

3 /*

4 * Returns the offset of the most recent

5 * cross-reference table in the file

6 */

7

8 function pdf_find_xref ( $f )

9 {

10 // First, seek to the end of the file,

11 // allowing for 50 bytes just so that

12 // we have enough data to look into

13

14 fseek ( $f , - 50 , SEEK_END );

15

16 // Next, try to find the proper sequence

17 // of data Note that the information can be

18 // separated by a Windows-style, Mac-style

20

21 $data = fread ( $f , 50 );

22

23 if (! preg_match

( ‘/startxref(?:\r|\n|\r\n)(\d+)(?:\r|\n|\r\n)%%EOF(?:\r|\n|\r\n)$/’

, $data , $matches )) {

24 die ( “Unable to find pointer to xref table” );

25 }

26

27 // If we get here, then we have the offset

28 // where the most recently introduced xref

30

31 return (int) $matches [ ];

32 }

33

34 ?>

Trang 5

the trailer dictionary; as you can see, it makes use of a

few elements that I have not yet introduced (the

ppddff ccoonntteexxtt class and the ppddff rreeaadd vvaalluuee(()) function)

However, if you leave the mechanics of how the

infor-mation is retrieved aside for a moment, you’ll notice

that the trailer dictionary ends up in an associative

array If you remember from last month’s article, files

that have been modified usually contain more than one

cross-reference table; this is indicated by the presence

of a //PPrreevv key/value pair in the trailer, with a pointer to

its beginning If this entry is present, the function

sim-ply recourses onto itself until all the cross-reference

tables present in the file are read Note that any

infor-mation in the older tables and trailers is not allowed to

overwrite the data contained in the newer ones by the simple stratagem of checking that an entry is not set in the first case, and by merging the trailer arrays in a par-ticular order in the second

Writing a PDF Lexer

Now that we know where the objects are—the cross reference table gives us the location of every object in the file—it’s time to try and read them We could, in theory, write a series of ad-hoc functions that try to read from the file and interpret its contents, but things are much easier if we, instead, make use of that

won-derful computer science concept known as the lexer

(also known as a tokenizer)

F

Listing 2

1 <?php

2

3 /*

4 * Reads a cross-reference table

5 *

6 * if $offset is provided and $start and $end are

7 * set to Null, the function will start reading the

8 * xref table from the current position in the file

9 * If more than one parts of xref table are present,

10 * the function will recurse onto itself as many times

11 * as needed

12 */

13

14 function pdf_read_xref ( $f , & $result , $offset , $start = null ,

$end = null )

15 {

16 // If we didn’t get a start and end, we need

17 // to get them from the document itself

18

19 if ( is_null ( $start ) || is_null ( $end )) {

20

22

23 fseek ( $f , $offset );

24

27

28 $old_ini = ini_get ( ‘auto_detect_line_endings’ );

29

31

32 $data = trim ( fgets ( $f ));

33

36

37 if ( $data !== ‘xref’ ) {

38 die ( “Unable to find xref table” );

39 }

40

43

44 $data = explode ( ‘ ‘ , trim ( fgets ( $f )));

45

47

48 if ( count ( $data ) != 2 ) {

49 die ( “Unexpected header in xref table” );

50 }

51

54

55 $start = $data [ ];

56 $end = $start + $data [ ];

57 }

58

59 if (!isset ( $result [ ‘xref_location’ ])) {

60 $result [ ‘xref_location’ ] = $offset ;

61 }

62

63 if (!isset ( $result [ ‘max_object’ ]) || $end >

$result [ ‘max_object’ ]) {

64 $result [ ‘max_object’ ] = $end ;

65 }

66

67 // Now cycle through each object

69

70 for (; $start < $end ; $start ++) {

71

75

77

78 $offset = substr ( $data , 0 , 10 );

79 $generation = substr ( $data , 11 , 5 );

80

81 if (!isset ( $result [ ‘xref’ ][ $start ][(int) $genera-tion ])) {

82 $result [ ‘xref’ ][ $start ][(int) $generation ] = (int)

$offset ;

83 }

84 }

85

86 // Get the next line, which could either be the beginning

87 // of the trailer dictionary or the header of another

89

91

92 if ( $data === ‘trailer’ ) {

93

95

96 $c = new pdf_context ( $f );

97 $trailer = pdf_read_value ( $c );

98

102

103 if (isset ( $trailer [ ‘/Prev’ ])) {

104 pdf_read_xref ( $f , $result , $trailer [ ‘/Prev’ ]);

105 $result [ ‘trailer’ ] = array_merge ( $result [ ‘trail-er’ ], $trailer );

106 } else {

107 $result [ ‘trailer’ ] = $trailer ;

108 }

109

110 } else {

111

116

117 $data = explode ( ‘ ‘ , $data );

118 pdf_read_xref ( $f , $result , null , $data [ ], $data [ ] +

$data [ ]);

119

120 }

121 }

122

123 ?>

Trang 6

Our lexer will take the input from the PDF file and

split it in individual tokens according to a particular set

of rules For example, if we were writing a lexer for

reducing the contents of this article in a series of words

(with every grammatical element representing a

token), we would establish that a token is either a set of

characters or a punctuation mark—assuming that

whitespace and paragraph markers are of no

impor-tance to us

Identifying tokens in a PDF file is quite simple in

the-ory, although in practical terms you have to watch out

for a few potential pitfalls First the basics: the simplest

form of delimiter is the whitespace, which has no

semantic value (meaning that it is used only for the

pur-pose of delimiting tokens and has no other purpur-pose)

Whitespace is composed of space characters, newlines

and line feeds

This would be enough to cover most situations, but

in some cases you’ll find that tokens are not always

delimited using whitespaces When some applications

(including some of Adobe’s own) “optimize” a PDF file

to reduce its size as much as possible, they remove

whitespace characters where the distinction between

two tokens is made obvious in another way For

exam-ple, consider the following snippet of PDF code that

shows the beginning of a dictionary:

<< /Entry (Value) >>

The whitespace between <<<< and //EEnnttrryy is made

unnec-essary by the fact that the two tokens are made up of

two completely different classes of characters Since <<<<

could only appear outside of a literal string to indicate

the beginning of a dictionary, the lexer should stop at

the second open angular bracket and delimit a token

before the next character—whatever that is Therefore,

the snippet above could be rewritten as follows:

<</Entry (Value)>>

Clearly, whitespace isn’t enough to delimit a token—we

must also keep in mind all the other possible character

classes that can be used for the same purpose Listing 3

(ttookkeenniizzeerr pphhpp) shows our lexer, the ppddff rreeaadd ttookkeenn(())

function, which looks a lot more complicated than it

really is

This file also contains the ppddff ccoonntteexxtt class that we

mentioned earlier, which the tokenizer also makes use

of The ppddff ccoonntteexxtt class is used to create a wrapper

around a file pointer that makes it possible to:

• Create a memory-based buffer for the file’s

contents

• Keep track of the current pointer in the file

and of the length of the buffer

• Maintain a stack of tokens that have been read from the file but not yet used The necessity of creating a buffer here arises from the fact that we don’t want our tokenizer to read one sin-gle character at a time out of the file By reading a fixed amount at a time and then accessing the dara directly

in memory, we can save ourselves a few expensive function calls The token stack is actually used by the portion of the system that is responsible for interpret-ing the meaninterpret-ing of the tokens—more about that later Note that there is no compelling reason to store this information in a class, other than the convenience fac-tor of having a convenient PHP syntax to work with You could just as easily store everything in an array and avoid OOP altogether, although, in my opinion, that would significantly complicate your code and make it easier to introduce bugs that would be tough to find and fix

Going back to the ppddff rreeaadd ttookkeenn(()) function for a moment, you can see that it works in a very simple way: first, it removes any whitespace that is at the cur-rent offset in the file buffer Next, it tries to determine the type of token that it is dealing with by looking at the first character The procedure used to then find the end of the token varies depending on the character class it belongs to: for array and literal string delimiters,

a single character is all we need, whereas for hex string and dictionary delimiters we need to check one more character, since they both share the same initial open angular bracket For all the other types of tokens, we simply scan the file until we end up in a different char-acter class

Parsing the Data

Next in the list, we need to be able to understand the meaning of each token in the context of the PDF file— and this is the job of another great computer science construct: the parser

Parsers can be very complicated, and are usually not coded by hand—in most cases, a developer would use

a “parser generator” like YACC or Bison These reduce the parser to a relatively complex finite-state machine that is flexible enough to accommodate certain types of languages In our case, however, the parsing of a PDF file is simple enough that the entire process can be coded in just about 150 lines’ worth of PHP

Before introducing another listing, however, let’s con-sider the types of data that we need to deal with For the most part, they are simple to handle: for direct val-ues, for example, we read as many tokens as we need from the file and store them in the appropriate data structures In two cases, however, we need to make a distinction: strings and indirect objets

The problem with strings—and, particularly, with lit-eral string—is that they change the rules that our lexer

Trang 7

Listing 3

1 <?php

2

3 /*

4 * This class is used to

5 * read data from the input

6 * file in a bufferized way

7 * and to store unused tokens

8 */

9

10 class pdf_context

11 {

12 var $file ;

13 var $buffer ;

14 var $offset ;

15 var $length ;

16

17 var $stack ;

18

20

21 function pdf_context ( $f )

22 {

23 $this -> file = $f ;

24 $this -> reset ();

25 }

26

28 // pointer to a new location

29 // and reset the buffered data

30

31 function reset ( $pos = null )

32 {

33 if (! is_null ( $pos )) {

34 fseek ( $this -> file , $pos );

35 }

36

37 $this -> buffer = fread ( $this -> file , 100 );

38 $this -> offset = 0 ;

39 $this -> length = strlen ( $this -> buffer );

40 $this -> stack = array();

41 }

42

43 // Make sure that there is at least one

44 // character beyond the current offset in

45 // the buffer to prevent the tokenizer

46 // from attempting to access data that does

48

49 function ensure_content ()

50 {

51 if ( $this -> offset >= $this -> length - 1 ) {

52 return $this -> increase_length ();

53 } else {

54 return true ;

55 }

56 }

57

58 // Forcefully read more data into the buffer

59

60 function increase_length ()

61 {

62 if ( feof ( $this -> file )) {

63 return false ;

64 } else {

65 $this -> buffer = fread ( $this -> file , 100 );

66 $this -> length = strlen ( $this -> buffer );

67 return true ;

68 }

69 }

70 }

71

72 /*

73 * Reads a token from the file

74 */

75

76 function pdf_read_token (& $c )

77 {

78 // If there is a token available

81

82 if ( count ( $c -> stack )) {

83 return array_pop ( $c -> stack );

84 }

85

86 // Strip away any whitespace

87

88 do {

89 if (! $c -> ensure_content ()) {

90 return false ;

91 }

92 $c -> offset += strspn ( $c -> buffer , “ \n\r” , $c -> off-set );

93 } while ( $c -> offset >= $c -> length - 1 );

94

95 // Get the first character in the stream

96

97 $char = $c -> buffer [ $c -> offset ++];

98

99 switch ( $char ) {

100

101 case ‘[‘ :

102 case ‘]’ :

103 case ‘(‘ :

104 case ‘)’ :

105

108

109 return $char ;

110

111 case ‘<’ :

112 case ‘>’ :

113

117

118 if ( $c -> buffer [ $c -> offset ] == $char ) {

120 return false ;

122 $c -> offset ++;

123 return $char $char ;

124 } else {

125 return $char ;

126 }

127

128 default :

129

133

135 return false ;

136 }

137

138 while( 1 ) {

139

141

142 $pos = strcspn ( $c -> buffer , “ []<>()\r\n/” ,

$c -> offset );

143

144 if ( $c -> offset + $pos < $c -> length - 1 ) {

146 } else {

152

153 $c -> increase_length ();

155 }

156

157 $result = substr ( $c -> buffer , $c -> offset - 1 , $pos

+ 1 );

158

159 $c -> offset += $pos ;

160 return $result ;

161 }

162 }

163

164 ?>

Trang 8

has to follow in order to find the end of the token,

because a closed parenthesis could be escaped by a

backslash and, therefore, its presence alone does not

indicate the end of the string In a “traditional” lexer,

this problem is taken care of by switching the machine

to a new context in which a different set of rules apply

We could, in fact, do the very same thing to our lexer

by creating a special case in the sswwiittcchh statement that

is part of ppddff rreeaadd ttookkeenn(()) in Listing 2 and writing

some additional code that looks for a parenthesis not

preceded by an even number of backslashes Why an

even number? Because the backslashes themselves can

be escaped by prefixing them with another backslash

Therefore, an even number of backslashes means that

they are all escaped and should be interpreted as

liter-al characters, so that the last one does not escape the

parenthesis, which becomes the string delimiter The

last in am odd number of backslashes right before a

parenthesis becomes an “orphan” and escapes the

parenthesis, thus preventing it from terminating the

string

Given that we only have a limited amount of space

and I really wanted to keep things as simple as possible,

however, I chose to implement the string parsing

func-tionality inside the parser itself When an open

paren-thesis token is returned by the tokenizer, the code

sim-ply keeps scanning the input file until it finds an

unescaped closed parenthesis

The other problematic data elements are, as I

men-tioned above, indirect objects Both object declarations

and references are made up by three tokens Therefore,

once our parser encounters a numeric value, it won’t be

able to tell whether it is part of a larger element until it

has read at least one more token—and potentially two

The problem here is not with reading the tokens—it’s

with what to do with them if, by any chance, the

numeric value turns out to be… just a numeric value

We could, in theory, put the extra tokens “back in the

buffer” by rolling back the offset pointer in the buffer

to the beginning of the second token, but that would

be difficult to do, since we don’t really know how many

whitespace characters were between the tokens to start

with

Therefore, we use a completely different approach:

unused tokens are stored in a stack, which is part of the

file context When a new token is requested,

ppddff rreeaadd ttookkeenn(()) checks whether anything is present

in the stack and, if something is in there, it pops it out

and returns it, without even reading one character from

the file buffer

You can see the end result of all our tribulations in

Listing 4 (rreeaaddvvaalluuee pphhpp), which contains the

ppddff rreeaadd vvaalluuee(())function You will also notice a

num-ber of constant definitions that look suspiciously like

data types—and they are Since we’ll be reading and

writing data back and forth, we’ll need to keep track of

the object types as we read them from the stream To

do so, each object is encapsulated in an array whose zeroth element indicates the type, while element 1 con-tains the actual value, which varies depending on the nature of the data Thus, for example, the trailer dic-tionary could look like this:

Array ( PDF_TYPE_DICTIONARY, Array (

‘/Size’ => array ( PDF_TYPE_NUMERIC, 22),

‘/Root’ => array ( PDF_TYPE_OBJ_REF, 12,

0 ),

‘/Prev’ => array ( PDF_TYPE_NUMERIC, 54655

) );

Not unlike some of its predecessors, ppddff rreeaadd vvaalluuee(())

looks a lot scarier than it actually is—the code is quite heavily commented, so I will limit myself to noting that each value is actually stored in an array whose zeroth element contains its type This makes identifying the data type of a type practically immediate, which will turn out to be very important later on when we’ll need

to write objects back to the file.

Before moving on to the next step, note that we make no provision in our lexer for reading stream data

This is because we are not intent on interpreting

every-thing that is stored in a PDF file—but only those

ele-ments that allow us to modify its contents However, adding support for streams shouldn’t be too much of a problem—all you need is the ability to resolve object references, which we’ll add shortly, since the length of

a stream is often expressed in that way

Getting to the Root of the Problem

All the pieces are finally in place—we should now be able to read through the PDF file and interpret its con-tents, at least to the extent that we need in order to be able to append data to it In order to demonstrate how the PDF functionality that we have built works, our goal

is to open a PDF file and add a textual element to its first page

Listing 5 (iinnddeexx pphhpp) is our main script—and, unfor-tunately, it’s too large to show here; you will, however, find it in the code associated with this article, so you will hopefully be able to follow me there

Once we have declared a few variables that we we’ll end up using throughout the script, we read the cross-reference table from the file, then immediately attempt

to retrieve the Root object from it Because the //RRoooott entry inside the file trailer has to be an indirect object reference, we must find a way to retrieve the actual

Trang 9

Continued on page 41

Listing 4

1 <?php

2

3 // Define various data types

4 // that we use throughout the system

5

6 define ( ‘PDF_TYPE_NULL’ , 0 );

7 define ( ‘PDF_TYPE_NUMERIC’ , 1 );

8 define ( ‘PDF_TYPE_TOKEN’ , 2 );

9 define ( ‘PDF_TYPE_HEX’ , 3 );

10 define ( ‘PDF_TYPE_STRING’ , 4 );

11 define ( ‘PDF_TYPE_DICTIONARY’ , 5 );

12 define ( ‘PDF_TYPE_ARRAY’ , 6 );

13 define ( ‘PDF_TYPE_OBJDEC’ , 7 );

14 define ( ‘PDF_TYPE_OBJREF’ , 8 );

15 define ( ‘PDF_TYPE_OBJECT’ , 9 );

16 define ( ‘PDF_TYPE_STREAM’ , 10 );

17

18 /*

19 * Reads a value from the current

20 * data stream

21 */

22

23 function pdf_read_value (& $c , $token = null )

24 {

25 // Get a token from the stream

26

27 if ( is_null ( $token )) {

28 $token = pdf_read_token ( $c );

29 }

30

31 if ( $token === false ) {

32 return false ;

33 }

34

35 switch ( $token ) {

36

37 case ‘<’ :

38

41

42 $s = pdf_read_token ( $c );

43

44 if ( $s === false ) {

45 return false ;

47

48 $term = pdf_read_token ( $c );

49

50 if ( $term !== ‘>’ ) {

51 die ( “Unexpected data after hex string” );

53

54 return array ( PDF_TYPE_HEX , $s );

55

56 break;

57

58 case ‘<<’ :

59

61

62 $result = array();

63

66

67 while (( $key = pdf_read_token ( $c )) !== ‘>>’ ) {

68 if ( $key === false ) {

69 return false ;

71

72 if (( $value = pdf_read_value ( $c )) === false )

{

73 return false ;

75

76 $result [ $key ] = $value ;

78

79 return array ( PDF_TYPE_DICTIONARY , $result );

80

81 case ‘[‘ :

82

84

85 $result = array();

86

89

90 while (( $token = pdf_read_token ( $c )) !== ‘]’ ) {

91 if ( $token === false ) {

92 return false ;

94

95 if (( $value = pdf_read_value ( $c , $token )) ===

false ) {

96 return false ;

98

99 $result [] = $value ;

100 }

101

102 return array ( PDF_TYPE_ARRAY , $result );

103

104 case ‘(‘ :

105

107

108 $pos = $c -> offset ;

109

110 while( 1 ) {

111

114

115 $pos = strpos ( $c -> buffer , ‘)’ , $pos );

116

119

120 if ( $pos == - 1 ) {

121 if (! $c -> increase_length ()) {

122 return false ;

125

the parenthesis If there is,

128

129 if ( $c -> buffer [ $pos - 1 ] !== ‘\\’ ) {

130 $result = substr ( $c -> buffer , $c -> offset ,

$pos - $c -> offset + 1 );

131 $c -> offset = $pos + 1 ;

132 return array ( PDF_TYPE_STRING , $result );

133 } else {

134 $pos ++;

135

136 if ( $pos > $c -> offset + $c -> length ) {

137 $c -> increase_length ();

140 }

141

142 default :

143

144 if ( is_numeric ( $token )) {

145

part of something else

147

148 if (( $tok2 = pdf_read_token ( $c )) !== false ) {

149 if ( is_numeric ( $tok2 )) {

150

this case, we’re probably in

or an object specification

data

154

155 if (( $tok3 = pdf_read_token ( $c )) !==

false ) {

156 switch ( $tok3 ) {

157

158 case ‘obj’ :

159

( PDF_TYPE_OBJDEC , (int) $token , (int) $tok2 );

161

163

( PDF_TYPE_OBJREF , (int) $token , (int) $tok2 );

Trang 10

object data, as the reference itself won’t help us much.

This is accomplished by the ppddff rreessoollvvee oobbjjeecctt(())

func-tion, which you can see in Listing 6 as part of the

oobbjjeeccttss pphhpp include file The function can actually be

used to determine whether any object is an indirect

ref-erence and resolve it to the actual object

data—some-thing that will come in handy at pretty much every step

of the way

As you can see, ppddff rreessoollvvee oobbjjeecctt(()) first checks to

see if the value it has been passed is an indirect object

reference If it isn’t, the function has really nothing to

do, other than returning right away If, on the other

hand, it did receive an indirect reference, it uses the

cross-reference table to determine its position and

starts reading it The $$eennccaappssuullaattee parameter

deter-mines how the object is returned to the caller; if it is set

to true, ppddff rreessoollvvee oobbjjeecctt(()) stores the object ID and

generation number in the array, so that effectively the

object’s data is encapsulated inside another object of

type PPDDFF TTYYPPEE OOBBJJEECCTT Otherwise, the direct value is

returned, and all information regarding the object’s ID

and generation number is lost Both types of return

val-ues have their uses—if you want to retrieve an object

with the intention of modifying it, you will probably

want it encapsulated, so that you can later rewrite it

back to the stream If, on the other hand, you’re just

trying to retrieve a value, as you would, for example, if

you were reading a stream object and you wanted to

determine its length, the non-encapsulated version will

be easier to handle Speaking of retrieving streams,

even though my code doesn’t perform that function

(since I’m not writing a PDF reader), if you intend to

add it, the ppddff rreessoollvvee oobbjjeecctt(())function is safe to use

because it saves the file pointer’s current position

before reading the object and restores it afterwards If

the function didn’t do so and you were reading a

stream, resolving the //LLeennggtthh parameter could result in

the file pointer being moved to a different location in the file—and you would be unable to read the rest of the stream

Let’s go back to index.php With the root object

firm-ly in hand, we can now compile a list of all the pages contained in the document To do so, we feed the //PPaaggeess element of the root dictionary to the

ppddff rreeaadd ppaaggeess(())function, which you can see in Listing

7 (rreeaaddppaaggeess pphhpp)

The reason why we have a separate function just to read through the //PPaaggeess element of the root object is that, as I mentioned in last month’s article, the pages could be nested in an arbitrary combination of //PPaaggee and //PPaaggeess dictionaries, so that we may need to recurse into the function several times in order to end up with

an array that contains only page elements It is impor-tant to understand that the order in which the pages are resolved by using this method doesn’t necessarily correspond to the logical order in which they will appear to the user—that is, the first page in the list is not necessarily the first page of the document; the PDF specification provides a different set of facilities for determining the logical page order, but, technically speaking, you should only be interested in that if you want to display the contents of a document In practi-cal terms, I have never found an occasion in which the logical and physical page order didn’t coincide—at most, there might be a fixed discrepancy because the document is an excerpt that starts from, say, page 25,

but the order of the pages should usually be the same

In our sample script, we only take in consideration page 1 (which is the zeroth element resulting from the pages array) We then use the ppddff ffiinndd rreessoouurrcceess(())

function, shown in Listing 8 (ppaaggee pphhpp) to retrieve the resources associated with the page Here, again, we need a dedicated function because, as you may remember, the resource dictionary is an inheritable

Listing 4: Continued from page 40

164 return array ( PDF_TYPE_OBJREF , (int) $token , (int) $tok2 );

166

170

171 array_push ( $c -> stack , $tok3 );

174

175 array_push ( $c -> stack , $tok2 );

177

178 return array ( PDF_TYPE_NUMERIC , $token );

179 } else {

181

182 return array ( PDF_TYPE_TOKEN , $token );

183 }

184

185 }

186 }

187

188 ?>

Tiêu đề	In the Belly of the Beast Interpreting and Manipulating PDF Files
Tác giả	Marco Tabini
Chuyên ngành	Web Development
Thể loại	Feature
Năm xuất bản	2004
Thành phố	Toronto

Định dạng
Số trang	14
Dung lượng	457,92 KB