1. Trang chủ
  2. » Công Nghệ Thông Tin

Tài liệu In the Belly of the Beast doc

5 673 0
Tài liệu được quét OCR, nội dung có thể không chính xác
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề In The Belly Of The Beast
Tác giả Marco Tabini
Trường học PHP Architect
Chuyên ngành PHP Programming
Thể loại Bài viết
Năm xuất bản 2004
Thành phố Toronto
Định dạng
Số trang 5
Dung lượng 443,59 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

Extract the start 7/ and length, and recurse into 7 this functlon ta = explode ' tra ne read_xref Sf, Shes, null, S$data[0], §data[0] + $data[1]; } D> May 2004 « PHP Architect + w

Trang 1

Jump Right To It

Three days of pure PHP http:/Avww.phparch.com/phpworks

php | works

Existing suliseribers can upgrate to the Print edition and save!

Login to your account for more details

archite ect

Da for flexibility and

architect

The Magazine For PHP Professionals

php Visit: bttp:/Avww.phparch.conyprint for

more information or to subscribe online

php|architect Subscription Dept

P.O Box 54526

1771 Avenue Road Toronto, ON M5M 4N5

allow up to 4 to 6 weeks for your subscription to be established and your first issue

to be mailed to you

State/Province:

ZIP/Postal Code:

Country:

Payment type:

VISA Mastercard American Express

Credit Card Number:

Expiration Date:

E-mail address:

Phone Number:

Canada *US Pricing is approximate and for illustration purposes only

Address) Canada /SA $ 83.99 CAD ($59.99 US*) city FHTmternational Surface $111.99 CAD ($79.99 US*)

O International Air

Ci Combo edition add-on

$125.99 CAD ($89.99 US*)

$ 14.00 CAD ($10.00 Us)

(print + PDF edition)

Signature: Date:

*By signing this order form, you agree that we will charge your account in Canadian dollars for the “CAD” amounts indicated above Because of fluctuations in the exchange rates, the actual amount charged in your currency on your credit card statement may vary slightly

**Offer available only in conjunction with the purchase of a print subscription

To subscribe via snail mail - please detach/copy this form, fill it out and mail to the address above or fax to +1-416-630-5057

FEATURE

In the Belly of the Beast

Interpreting and Manipulating PDF Files

by Marco Tabini

in last month's issue, we examined the structure and con- tents of a PDF document in considerable detail This month, we'll actually write a PHP library capable of open- ing one and modifying its contents

elcome to part two of our little trip down PDF

\W-: While last month we focused primarily

on understanding what the structure of a PDF document is, this tine over we'll look at the problem of altering the contents of a PDF file from a more practical perspective

The main thing to understand, before we move on to anything else, is that parsing a PDF file is a complex—

but by no means complicated—endeavour because the but it also does not follow a top-down logic In other

a PDF file one doesn’t start at the beginning and move

is true

Since we'll often find ourselves jumping at various—

and completely arbitrary—positions into the docu- ment, the first decision that we need to make is how we're going to access the data While it is tempting to such a good idea: if you consider that a PDF can have memory we expose ourselves to the potential of clog- ging up large chunks of RAM, thus limiting our server's ability to process a large number of requests

Yet, seeking to arbitrary locations in a document is not always easy, or even possible Imagine, for exam- this case, you'd have to download the entire file before you could actually find out about any of its characteris- tics, since the offset of the cross-reference table appears

at the end of the file Bven in this case, | would recom- May 2004 + PHP Architect + www.phparch.com

mend storing the document in a local file and then accessing the data through the filesystem

The one notable exception to this rule is a special class of PDFdocumentsknown as “linearized PDF files’

A linearized PDF document containsa dictionary at the ties for determining the location of the first page in the table first The structure of linearized PDF files is beyond about it directly from the PDF specification document published by Adobe

Getting Started The first thing we need to do in order to be able to mine where the crossreference table and trailer dic- format of the startxref pointer is fixed For example, in

my document it looks like the following:

startxref

53593

%EOF

REQUIREMENTS PHP: 4.3.0+

OS: Any

34

Trang 2

In the Belly of the Beast

Thus, all we need to do is move to the end of the file,

back up a few bytes and then find this sequence of

data As you can see from Listing 1 (findxref php), this

is readily accomplished by using a dmple regular

ends with a dollar sign, indicating that the resulting

match must be anchored to the end of the data stream

Even though we're only taking fifty characters from the

end of the file, | have added the anchor to prevent the

regex engine from picking up a previous cross-refer-

the cross-reference table pointer is not saved to the

10 digits for the offset like the cross-reference entries

themselves}, you're not alone This decision is a bit of a

mystery, but it’s something that we have to live with

the way—throughout the remainder of the article,

you'll notice that | have created an individual include

This is clearly not a good design practice, but it fulfills

one important purpose: it keeps the listingsin the arti-

cles hort and to the point Thus, in the interest of clar-

ity, | hope that you'll forgive me and that, if you decide

not follow the same layout

Reading the Cross-reference Table

ure out how to read the cross-reference table itself If

we move to offset 55,593 of the file, we'll find the fol-

lowing:

xref

9000000000 65535

0000005632 00000

0000006483 00000

0000006509 00000

The word xref is followed by the first object represent-

entries that follow (twenty-two); we'll call this the

selves: for each line, we have the offset at which the

object can be found (10 characters}, followed by the

in use or f for objects that are free

There are a few important things to notice here First

of all, each set of data is conveniently laid out in a line

of text, so that we can use the fgets() function to

retrieve it However, you should keep in mind that PDF

files always use the Windows convention for identifying

ily elsewhere) and, therefore, you must instruct the PHP

interpreter to do so as well—regardless of the platform

May 2004 « PHP Architect + www.phparch.com

your script is running on This can be accomplisned bi turning on the auto_detect_line_endings INI directive this directly from the code by first reading the current value, turning the directive on for the duration of our file operations and then restoring it back to its original because it is posdble that other portions of our script than the one we need it in

Another gotcha when reading the cross-reference table is that there may be more than one block of entries—that is, once you've read out all the entries, you could find another header followed by a new set of didn’t check for this possbility and simply assume that our code would be unable to read most documents

that’s the situation in which partial cross-reference tables are most likely to be found

As you can see in Listing 2 (readxref.php}, the pdf_read_xref(Q) function is a bit long, but otherwise fact that the cross-reference table is formatted using a very stylized layout, so that we can take advantage of vided by PHR

The only aspect of this function that we have not explored is the little segment of code that starts at line

84 and ends at line 100 This is where our code reads

Listing 1

* Returns the offset of the most recent

* cross-reference table in the file

Z function pdf_find_xref ($F) {

First, seek to the end of the file,

? allowing for 50 bytes just so that

e have enough data to look into

fseek ($f, -50, SEEK_END);

// Next, try to find the proper sequence // of data, Note that the information can be // separated by a Windows-style, Mac-style unix-style new] ine

$data = fread ($f, 50);

if Cipreg_match

C fstartxref (7: \rl\n [\r\n) (d+) (7: \r | \n | Arn 6EOF (7: \r [An | rn $/*

$data, $matches))

e (“Unable to find pointer to xref table");

}

⁄/ If we get here, then we have the offset /⁄/ where the most recently introduced xre table is

return (int) $matches([1];

35

In the Belly of the Beast the trailer dictionary; as you can see, it makes use of a

few elements that | have not yet introduced (the pdf_context class and the pdf_read_value() function)

However, if you leave the mechanics of how the infor- that the trailer dictionary ends up in an associative array If you remember from last month’s article, files that have been modified usually contain more than one

of a /Prev key/value pair in the trailer, with a pointer to ply recourses onto itself until all the cross-reference mation in the older tables and trailersis not allowed to Listing 2

overwrite the data contained in the newer ones by the the first case, and by merging the trailer arraysin a par- ticular order in the second

Writing a PDF Lexer Now that we know where the objects are—the cross reference table gives us the location of every object in the file—it’s time to try and read them We could, in theory, write a series of ad-hoc functions that try to are much easier if we, instead, make use of that won- derful computer science concept known as the /exer (also known as a tokenizer}

* Reads a cross-reference table

if Soffset is provided and $start and Send are

* set to Null, the function will start canting the

* xref table from the current position in the ie

* If more than one parts of xref table are pre

* the Function will recurse onto itselF as many tines

* as needed function pdf_read_xref ($f, &$result, $offset, $start = null,

$end = null) { // 1f we didn't get a start and end, we need // to get them from the document itself

if Gis.mull ($start) || is nu]l (§end)) { // Move to the start of the table fseek ($f, Soffset);

// Wake sure that PHP keeps, track of // the Vine endings proper

$ơld_ini = ini_get (‘auto_detect_line_endings');

// Get a Vine of text from the file

$data = trim (fgets ($f));

// Make sure the xref marker is where we

it

ff expect

if (Sdata !== ‘xref") { die (“Unable to Find xref table”);

⁄ New get the next line and spTí across 4 single space character

$data = explode (' ', trim (fgets (§Ð))¡

// Wake sure the format is what we expected

if (count Cédata) != 2) {

“Unexpected header in xref table”);

⁄ Calculate the start and end object

in the xref table

$start = Sdata[0];

nd = $start + §data[1];

if (pisset {Sresult[‘xref_location’ Hà, t $result[’xref_location’] = §of

if Clisset net max_object']) || $end >

$result[‘max_object']) {

$result[‘max_object*] = $end;

// Now cycle through each object // pointer for ( $start < $end; §start++) Í // Get a line of text from the // information out of there

$data = trim (fgets ($F);

offset = substr ($data, 0, 10);

$generation = substr ($data, 11, 5);

if (lisset ($result[‘xref'] [$start] [Cint) $genera- tion])) {

$offset;

}

$resuTt[°xref"] [$§start][(int) $generation] = (int) }

// Get the next line, which could either be the beginning // of the trailer dictionary or the header of another

ff xref section

$data = trim (fgets ($F);

if (Sdata === ‘trailer’) {

w pdf context ($F Thang pdf_read_ value, iso;

// Check whether there is a /Prev // entry, which indicates that there

if (isset (Strai ler’ /Prev'])) { pdf_read_xref ($f, $result, $trailer[*/Prev'])

$result[‘trailer'] = array_merge ($result[‘trail- er'], §trailer);

// We have another xref segment

ff to read Extract the start 7/ and length, and recurse into (7 this functlon

ta = explode (' tra)

ne read_xref (Sf, Shes, null, S$data[0], §data[0] +

$data[1]);

} D>

May 2004 « PHP Architect + www.phparch.com 36

In the Belly of the Beast

Our lexer will take the input from the PDF file and split it in individual tokens according to a particular set reducing the contents of this article in a series of words token), we would establish that a token is either a set of characters or a punctuation mark—assuming that tance to us

Identifying tokens in a PDF file is quite ample in the- ory, although in practical terms you have to watch out

semantic value (meaning that it is used only for the pur- pose of delimiting tokens and has no other purpose)

Whitespace is composed of space characters, newlines and line feeds

This would be enough to cover most situations, but

in some cases you'll find that tokens are not always (including some of Adobe's own) “optimize” a PDF file whitespace characters where the distinction between ple, consider the following snippet of PDF code that shows the beginning of a dictionary:

<< /Entry (Value) >>

The whitespace between << and /Entry is made unnec- essary by the fact that the two tokens are made up of could only appear outside of a literal string to indicate the second open angular bracket and delimit a token before the next character—whatever that is Therefore, the snippet above could be rewritten as follows:

<</entry (Value)>>

Clearly, whitespace isn't enough to delimit a token—we classes that can be used for the same purpose Listing 3 (tokenizer.php) shows our lexer, the pdf_read_token() function, which looks a lot more complicated than it really is

This file also contains the pdf_context class that we mentioned earlier, which the tokenizer also makes use

of The pdf_context class is used to create a wrapper around a file pointer that makes it possible to:

+ Create a memory-based buffer for the file's contents

+ Keep track of the current pointer in the file and of the length of the buffer May 2004 + PHP Architect + www.phparch.com

+ Maintain a stack of tokens that have been read from the file but not yet used The necessity of creating a buffer here arises from the gle character at atime out of the file By reading a fixed

in memory, we can save ourselves a few expensive portion of the system that is responsible for interpret- ing the meaning of the tokens—more about that later Note that there isno compelling reason to store this information in a class, other than the convenience fac- tor of having a convenient PHP syntax to work with avoid OOP altogether, although, in my opinion, that easier to introduce bugs that would be tough to find and fix

Going back to the pdf_read_token() function for a moment, you can see that it works in a very dmple rent offset in the file buffer Next, it tries to determine the type of token that it is dealing with by looking at end of the token varies depending on the character class it belongsto: for array and literal string delimiters,

a single character is all we need, whereas for hex string character, since they both share the same initial open simply scan the file until we end up in a different char- acter class

Parsing the Data Next in the list, we need to be able to understand the meaning of each token in the context of the PDF file— construct: the parser

Parsers can be very complicated, and are usually not coded by hand—in most cases, a developer would use the parser to a relatively complex finite-state machine languages In our case, however, the parsing of a PDF file is smple enough that the entire process can be coded in just about 150 lines worth of PHP Before introducing another listing, however, le†'s con- sider the types of data that we need to deal with For ues, for example, we read as many tokens as we need from the file and store them in the appropriate data structures In two cases, however, we need to make a distinction: strings and indirect objets

The problem with strings—and, particularly, with lit- eral string—is that they change the rules that our lexer

37

Trang 3

In the Belly of the Beast

Listing 3

set);

This class is used to

* read data from the input // Set the first character in the stream

* file in a bufferized way

switch (char) { class pdf_context

var $fil

var $buffer;

/{ This is either 3 an array or literal string var $stack; /f delimiter, Return it

case ‘>'

Sthis->file = $f;

// appropriate case and return the token /{ Optionally move he fle

// pointer to if Ge-vbuffer[Se-soffset) 0 Schar) {

$c->offset++;

eck (fthis-»file, §pos); } else {

return $char;

$this->buffer = fread ($this->file, 100);

$thi s~ reek array); his is “another” type of chen (probably

⁄/ Find the end and return it Make sure that there is at least one

// not exist

while(l) { function ensure_content(>

⁄/ Determine the length of the token

if (Sts >offset »= $this->length - 1) {

tỉ s->increase_length(); $pos = strespn ($c-sbuffer, " []<>O\r\n/",

} else $c-offset);

return true;

‡if (fc-»offset + $pos < $c->length - 1

} else

false;

Sthis->buffer = fread ($this->file, 100);

return true;

}

$c->offset += Sos:

Z/ XE there isa token available

// em the stack, pop it out

ff return

if Ccount ($c->stack)) {

turn array_pop($c->stack) ;

}

// Strip away any whitespace

do {

if (!$c->ensure_content()) {

return false;

May 2004 « PHP Architect + www.phparch.com 38

In the Belly of the Beast has to follow in order to find the end of the token,

because a closed parenthesis could be escaped by a backslash and, therefore, its presence alone does not this problem is taken care of by switching the machine

We could, in fact, do the very same thing to our lexer

by creating a special case in the switch statement that some additional code that looks for a parenthesis not even number? Because the backsashes themselves can Therefore, an even number of backslasnes means that they are all escaped and should be interpreted as liter-

al characters, so that the last one does not escape the parenthesis, which becomes the string delimiter The parenthesis becomes an “orphan” and escapes the string

Given that we only have a limited amount of space and | really wanted to keep things as Simple as possible, however, | chose to implement the string parsing func- thesis token is returned by the tokenizer, the code sm- ply keeps scanning the input file until it finds an unescaped closed parenthesis

The other problematic data elements are, as | men- tioned above, indirect objects Both object declarations and references are made up by three tokens Therefore,

able to tell whether it ispart of a larger element until it has read at least one more token—and potentially two

The problem here is not with reading the tokens—it’s

numeric value turns out to be just a numeric value

We could, in theory, put the extra tokens “back in the buffer” by rolling back the offset pointer in the buffer

be difficult to do, since we don't really know how many with

Therefore, we use a completely different approach:

unused tokens are stored in a stack, which is part of the

pdf_read_token() checks whether anything is present and returnsit, without even reading one character from the file buffer

You can see the end result of all our tribulations in Listing 4 (readvalue.php), which contains the pdf_read_value() function You will also notice a num- data types—and they are Since we'll be reading and

May 2004 « PHP Architect + www.phparch.com

the object types as we read them from the stream To zeroth element indicates the type, while element 1 con- tains the actual value, which varies depending on the nature of the data Thus, for example, the trailer dic- tionary could look like this:

Array ( PDF_TYPE_DICTIONARY , Array

‘/size’ => arras PDF_TYPE_NUMERIC, Root’ => array PDF_TYPE_OBJ_REF, 3»

'/Prev' => arra PDF_TYPE_NUMERIC,

54655 3;

Not unlike some of its predecessors, pdf_read_value() looks a lot scarier than it actually is—the code is quite each value is actually stored in an array whose zeroth data type of a type practically immediate, which will

to write objects back to the file

Before moving on to the next step, note that we make no provigon in our lexer for reading stream data

This is because we are not intent on interpreting every- ments that allow us to modify its contents However, adding support for streams shouldn't be too much of a references, which we'll add shortly, since the length of

a stream is often expressed in that way

Getting to the Root of the Problem All the pieces are finally in place—we snould now be able to read through the PDF file and interpret its con- tents, at least to the extent that we need in order to be able to append data to it In order to demonstrate how the PDF functionality that we have built works, our goal

is to open a PDF file and add a textual element to its first page

Listing 5 (index.php) is our main script—and, unfor- tunately, it’s too large to show here; you will, however, will hopefully be able to follow me there

Once we have declared a few variables that we we'll end up using throughout the script, we read the cross-

to retrieve the Poot object from it Because the /Root entry inside the file trailer has to be an indirect object reference, we must find a way to retrieve the actual

39

In the Belly of the Beast

Listing 4

// Define various data ty V7 that we use throughout the systen define (‘POF_TYPE_NULL", 0);

define (‘PDF TYPE_NUMERIC’

define (*PDE_ TYPE_TOKEN"

define (‘POF_TYPE_HEX’, define (‘PDF_TYPE_STRING’, 4);

false) {

define (‘PDF_TYPE_STREAM', 10);

* Reads a value from the current

* data stream

af function pdf_read_value (&$c, $token = nu]]) // Get a token from the stream

if Ges nell ($token) {

en = pdf_read_token ($c);

if (Stoken false) { return false;

} switch ($token) { case <

/ This is a hex string

// Read the value, then the terminator

the parenthesis

$s = pdf_read_token ($c);

return false, } $pos - §c-»offse

$term = pdf_read_token ($c);

if Stern t= '3') £

eC "ủnospected data after hex string”);

return array (PDF_TYPEHEX, $s);

break;

// This is a dictionary

$result = array;

part of somethin // Recurse into this function until we reach

⁄/ the end of the dictionary

while ((ikey = pdf read token ($§c)) l== '>>') {

if (Ske y false) { return false; this case, we're

or_an object spe

if (value = pdf_read_value ($c)) =

return false;

⁄/ the end of the array

while ((gtoken = pdf_read_token ($c)) !== ']') {

CN false;

}

if ((§value = pdf read value ($c, $token)) = return false;

$result[] = $value;

return array (PDF_TYPE_ARRAY, $result);

a // this is a string

$pos = $c-»offset;

while) { // Start by finding the next closed // parenthesis

$pos = strpos (Sc->buffer, ‘)", $pos);

If you can't find it, tr V/ reading more data from the strean

iF pos, == -1) {

F (!$c->inerease_ s-lengthO) { return fals // Make sure that there is no backslash before

If there is, // move on Otherwise, return the string

if đc >Bffer[Spos ~ A9 £ sult = substr đc sbutfor, $c-»offset, t+ Đổ

$c->offset = $pos +1;

return array (PDF_TYPE_STRING, $result) ; } else { §post+i

iF Gos > Se-soffset + Se-slength) {

$c->increase_length();

} } ult

if (is_numeric (§token)) { // numeric token Make sure that it is not

g else

if (Stolz = pdf-read token ($<) ! false) {

f (is numeric (ftok2)) { // Two numeric tokens in a row In probably in

// front of either an object reference cification

/ Determine the case and return the

if (($tok3 = pdf_read_token ($c)) !== false) {

$result[$key] = $value; switch ($tok3) {

case ‘obj?

return array (PDF_TYPE_DICTIONARY, $result);

return array case ‘[' : (PDF_TYPE_OBIDEC, (int) $token, (int) $tok2);

// This is an array

$result = array;

// Recurse into this function until we reach

case ‘R' return array Gint) $token, find $tok2) ;

May 2004 + PHP Architect + www.phparch.com

Continued on page 47

40

Trang 4

In the Belly of the Beast

object data, as the reference itself won't help us much

Thisis accomplished by the pdf_resolve_object () func-

tion, which you can see in Listing 6 as part of the

used to determine whether any object is an indirect ref-

thing that will come in handy at pretty much every step

of the way

As you can see, pdf_resolve_object() first checks to

see if the value it has been passed is an indirect object

do, other than returning right away If, on the other

hand, it did receive an indirect reference, it uses the

cross-reference table to determine its position and

mines how the object is returned to the caller; if it is set

to true, pdf_resolve_object() stores the object ID and

object's data is encapsulated inside another object of

type PDF_TYPE_OBJECT Otherwise, the direct value is

returned, and all information regarding the object’s ID

ues have their uses—if you want to retrieve an object

want it encapsulated, so that you can later rewrite it

back to the stream If, on the other hand, you’re just

trying to retrieve a value, as you would, for example, if

determine its length, the non-encapsulated version will

even though my code doesn’t perform that function

add it, the pdf_resolve_cbject() function is safe to use

before reading the object and restores it afterwards If

the function didn’t do so and you were reading a

stream, resolving the /Length parameter could result in

Listing 4: Continued from page 40

the file pointer being moved to a different location in the stream

Let's go back to index.php With the root object firm-

ly in hand, we can now compile a list of all the pages contained in the document To do so, we feed the /Pages element of the root dictionary to the pdf_read_pages() function, which you can see in Listing

7 (readpages.php)

The reason why we have a separate function just to read through the /Pages element of the root object is

could be nested in an arbitrary combination of /Page into the function several timesin order to end up with tant to understand that the order in which the pages are resolved by using this method doesn’t necessarily appear to the user—that is, the first page in the list is specification provides a different set of facilities for speaking, you should only be interested in that if you cal terms, | have never found an occasion in which the logical and phydcal page order didn’t coincide—at document is an excerpt that starts from, say, page 25, but the order of the pages should usually be the same

In our sample script, we only take in condderation page 1 (which is the zeroth element resulting from the pages array} We then use the pdf_find_resources() function, shown in Listing 8 (page.php) to retrieve the need a dedicated function because, as you may

return array (PDF_TYPE_OBIREF, (int) $token, (int) $tok2);

} // Tf we get to this point, that numeric value up // there was just a numeric value Pt the extra // tokens back into the stack and return the value

array_push ($c->stack, $tok3);

}

array_push (Sc->stack, §tok2);

return array (PDF_TYPE_NUMERIC, $token);

} else { // Just a token Return it

return array (PDF_TYPE_TOKEN, $token);

May 2004 « PHP Architect + www.phparch.com

In the Belly of the Beast

resource, so that if there isn’t one associated with the page itself, there may be one associated with its parent, sample file that we were looking at last month, the with every page and with their parent (the /Pages dic- optimization, but it is perfectly acceptable (for the record, the PDF was creating on Linux by exporting an OpenOffice.org 1.1 file)

Next, we need to find the font resources, so that we can append our own to the existing ones Finding the any of its predecessors so far, since it’s either there, in which case we piggyback on it, or it isn’t, in which case

ed with the page The only difficulty here isin finding a name for the font resource that doesn’t conflict with one that already exists The approach that | have taken Listing 6

isto simply run through all the resources available and look at those called /Fx, where x is a numerical value

The font resource we create is the next highest avail-

ly used is /F10, ours will be /F11 Note that this choice nation you like, as long asit starts with a letter and not

a digit

The font resource that we create and add to the font dictionary is the simplest possible one: it uses the Helvetica font, which must be supported by every PDF reader and, therefore, doesn’t need to be embedded in the document itself

Graffiti on the Wall We've now come to the part where we actually need to

“write” some text on the document Unfortunately, this involves a few steps

First, the concept of drawing pretty much anything

<?php

ye

* Resolves an object reference,

* ensuring that the result value

* 4s always a direct object

af function pdf_resolve_object (8c, $obj spec, §encapsulate = true)

global $xref_data;

// Exit if we get invalid data

if (lis_array (Sobj_spec)) { return false;

if (Sobj_spec[0] == POF_TYPE_OBIREF) {

df This is a reference, resolve it

if Cisset ($xref_data[‘xref"] [$obj_spec[1]] [$obj_spec[2]]}) { // Save current file position 7/ This is needed if you want to resolve // references while you're reading another object // (e.9-: if you need to determine the length // of a stream)

$ơld_pos = ftell (§c->file);

// Reposition the file pointer and // load the object header

$c->reset ($xref_data[‘xref*] [$obj_spec[1]] [$obj_spec[2]]};

header = pdf_read_value ($c);

if (§header[0] != PDE_TYPE_oBJDEC || $header[1] !=

$obj_spec[1] || $header[2] != $obj_spec[2]) {

ie ("unable to find object ([$obj_spec[1]}, {Sobj_spec[2]}) at expected location”) ;

} // Tf we're being asked to store all the informa- tion

// about the object, we add the object ID and gen- eration

⁄/ nunber for later use

if (encapsulate) {

$result = array ¢ POF_TYPE_OBIECT,

‘obj’ => Sobj_spec[1],

‘gen’ => $obj_spec[2]

u } else { $result = arrayO¡

// Now simply read the object data unti]

// we encounter an end-of-object marker while) {

$value = pdf_read_value ($c);

if ($value === false) {

‘alse;

}

if (§value[0] == PDF_TYPE_TOKEN && §value[1]

“endobj') {

break;

$result[] = $value;

$c->reset (Sold_pos);

return $result;

} else { return Sobj_spec;

}

* Generates a new object container

* with the proper object ID and

* a generation number of zero

af function pdf_new_object () global $xref_data;

return arra PDF_TYPE_OBJECT,

‘obj’ => $xref_data[‘max_object "J++,

0

May 2004 + PHP Architect s www.phparch.com 42

In the Belly of the Beast

on a page requires a series of commands that PDF bor- rows from Postscript In order for the reader to recog- nize them, we'll have to encapsulate them in a stream, and add that stream to the contents of the page

When drawing text on the screen, a certain number

of transformations can be applied to it: translation (so choice}, rotation and scaling In our case, we will only deal with the first two

The transformations are applied using a simple matrix; unfortunately, we do not have enough space the PDF specification document does a pretty good job

of that, so I'll refer you to it Instead, let us focus on the commands used to apply the transformation itself;

here's an example:

Listing 7

Ma Mb Mc Md x y Tm Looks cryptic, doesn’t it? The first four elements of the express the rotation that should be applied to the text They can also be used to determine the scale, but, as | mentioned, that is beyond the scope of this article The

x and y parameters, on the other hand, indicate the coordinates at which we want the text to apply Finally,

Tm isthe command itself, which tellsthe PDFinterpreter

to apply these values to the text transformation matrix tion call is the exact opposite of what we are used to in PHP (where we use function (paraml, param2, } This format is called “Reverse Polish Notation” and is often

<2php // Creates a list of all the pages // that are present in a document function pdf_read_pages (8c, &$pages, &fresult) { // Get the kids dictionary Skids = pdf_resolve_object (Sc, §pages[1][1]['/Kids']);

foreach ($kids[1] as $v) {

$pq = pdf_resolve_object ($c, $v);

1f (§v[1][I]['/Type'] === ‘Pages*) £ // Tf one of the kids is an embedded // /Pages array, resolve it as well

pdf_read_pages ($c, $v, $result);

else

$resu]t[] = $pg:

Listing 8

<?php

* Finds the resources associated with a page

af function pdf_find_resources (&ic, $obj) { $ebj = pdf_resolve_object (fc, $obj);

// Tf the current object has a resources // it Otherwise, we move back to its // parent object

if Cisset (Sobj[1] [1] [‘/Resources'])) { del

n pdf_resolve_object ($c, $obj[1][1][‘/Resources']);

se

if (1isset (§obj[1][1]['/Parent'])) {

n false;

yell }

se { return pdf_find_resources {§obj[1][1][*/Parent']);

}

May 2004 + PHP Architect + www.phparch.com

Trang 5

In the Belly of the Beast

eters, such as the PostScript virtual machine on which

the PDF specification is based

Next, we'll select a font that will be used to draw the

text:

/F11 10 TẾ

The Tf command sets the current font resource to /F11,

a floating-point value, so that you could have text in

size 12.5

Before writing the text itself, we need to set the spac-

ing between one line of text and the next Thisisnot as

easy to determine as you may think—because it

how the font itself is dedgned From a practical per-

empirical default that works in most occasions The TL

command below sets the interline to five points:

5 TL

Finally, we can actually draw the text! Thisis done by

using a combination of two commands The text is

actually drawn using the ‘ command (no that’s not a

mistake—the command isa quotation mark) However,

Figure 1

if a newline character is present in the text, it is smply newline character with the T* command, which causes the drawing pointer to be reset to the next line

Finally, all we need to do is update the page’s /Contents array with a reference to our stream Once again, we need to determine if there already is an array and what it contains, and act accordingly, so that we can add our own data to it

Writing it All Back The final step before we can call it a day consists of they can be applied to the document To do so, we first beginning of the main script (Listing 5) Next, we call the pdf_write_objects() function to rewrite the objects that we modified back to the file If you take a look at Listing 9 (writer.php), you'll notice that this function is, essentially, the reverse of pdf_read_value(), since it first appropriate value

There are two thingshere that are worth mentioning

First, the information that we write back to the file is not a “true” delta—the resources dictionary may not

Ε Adobe Acrobat - [out.pdf]

Pal Fie Edt Document Tools View Window Help

(a)

S 2S

o) 4) 10f2 DM 65x11n

May 2004 « PHP Architect + www.phparch.com

-c~ :El ~IEƒ

testing platform even for

McCaffres came to the ré

Ougét inal article this mon

44

In the Belly of the Beast

very optimized, but it will do if you're only making small changes to a document—and it beats having to build a system that “remembers’ what was changed

object is written to stream, pdf_write_objectsQ

“makes a note” of the file pointer’s current postion

This comesin handy afterwards, when we rebuild the cross-reference table by calling pdf_write_xref() Here,

we create the proper entries one at atime This process could be optimized by grouping those entries belong- dealing only with small changes it’s hardly worth the trouble

pdf_write_xref() terminates by writing the trailer dic- object, which has not changed but must be there nonetheless, as well as a numeric value that declares the number of objects stored in the file and a pointer to the previous cross-reference table

Where to Go From here That's it! As you can see, once one figures out how and modify the contents of a PDF file program matical- shared by many people, that PDF is a non-modifiable format quite strange

Although the end result of our sample script is rela- tively Simple (if you run it against the sample file that | included in this month’s for your convenience—and like the output in gure 1), the foundation on which it

is built is quite solid and can be expanded upon to pro- vide additional functionality

Before parting ways, | just want to share one final tid- bit of information with you Working with PDF files can Windows, because the Acrobat PDF viewer is about as useful for debugging as testing whether your house's electrical circuit is working by sticking your fingers in First, you can actually get Acrobat to provide you with more useful error messages by pressing the Control key appears when you try to load a corrupted file Second, able online at www planetpdf con/ma†npage asp?webpageT~ d=3463) to visually inspect the contents of your file and determine what is wrong with it

Sometimes, however, it will be hard to figure bugs out While | was writing this article, | lost lots of time debugging a problem that turned out to be just a out any useful error This brings us to the last tool you'll need plenty of—patience!

About the Author ?> Marco is the Publisher of (and a frequent contributor to) phplarchitect

he can be found trying to hack his computer into submission You can write to him at marcot@phparch com

To Discuss this article:

http:/forums.phparch.com/145

ing by published author Larry Ullman

This course on the world's most popular Web

-the know to begin developing dynamic Web sites

language used by all databases and MySQL- world's most popular open source

database, this class teaches how to best store

May 2004 « PHP Architect + www.phparch.com 45

Ngày đăng: 21/12/2013, 13:15

TỪ KHÓA LIÊN QUAN

🧩 Sản phẩm bạn có thể quan tâm

w