beginning regular expressions (programmer to programmer)

Continual Evolution in Techniques Supported 16The Languages That Support Regular Expressions 17 Replacing Text in Quantity 17 Chapter 2: Regular Expression Tools and an Approach to Using

Trang 2

Beginning Regular Expressions

Andrew Watt

Trang 4

Trang 6

Andrew Watt

Trang 7

Published simultaneously in Canada

LIMIT OF LIABILITY/DISCLAIMER OF WARRANTY: THE PUBLISHER AND THE AUTHOR MAKE NO RESENTATIONS OR WARRANTIES WITH RESPECT TO THE ACCURACY OR COMPLETENESS OF THE CON-TENTS OF THIS WORK AND SPECIFICALLY DISCLAIM ALL WARRANTIES, INCLUDING WITHOUTLIMITATION WARRANTIES OF FITNESS FOR A PARTICULAR PURPOSE NO WARRANTY MAY BE CREATED

REP-OR EXTENDED BY SALES REP-OR PROMOTIONAL MATERIALS THE ADVICE AND STRATEGIES CONTAINEDHEREIN MAY NOT BE SUITABLE FOR EVERY SITUATION THIS WORK IS SOLD WITH THE UNDERSTAND-ING THAT THE PUBLISHER IS NOT ENGAGED IN RENDERING LEGAL, ACCOUNTING, OR OTHER PRO-FESSIONAL SERVICES IF PROFESSIONAL ASSISTANCE IS REQUIRED, THE SERVICES OF A COMPETENTPROFESSIONAL PERSON SHOULD BE SOUGHT NEITHER THE PUBLISHER NOR THE AUTHOR SHALL BELIABLE FOR DAMAGES ARISING HERE FROM THE FACT THAT AN ORGANIZATION OR WEBSITE ISREFERRED TO IN THIS WORK AS A CITATION AND/OR A POTENTIAL SOURCE OF FURTHER INFORMA-TION DOES NOT MEAN THAT THE AUTHOR OR THE PUBLISHER ENDORSES THE INFORMATION THEORGANIZATION OR WEBSITE MAY PROVIDE OR RECOMMENDATIONS IT MAY MAKE FURTHER, READ-ERS SHOULD BE AWARE THAT INTERNET WEBSITES LISTED IN THIS WORK MAY HAVE CHANGED ORDISAPPEARED BETWEEN WHEN THIS WORK WAS WRITTEN AND WHEN IT IS READ

For general information on our other products and services please contact our Customer Care Department withinthe United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002

Trademarks:Wiley, the Wiley logo, Wrox, the Wrox logo, Programmer to Programmer, and related trade dress aretrademarks or registered trademarks of John Wiley & Sons, Inc and/or its affiliates, in the United States and othercountries, and may not be used without written permission All other trademarks are the property of their respec-tive owners Wiley Publishing, Inc., is not associated with any product or vendor mentioned in this book

Wiley also publishes its books in a variety of electronic formats Some content that appears in print may not beavailable in electronic books

Library of Congress Cataloging-in-Publication Data:

Trang 8

About the Author

Andrew Watt is an independent consultant and experienced author with an interest and expertise inXML and Web technologies He has written and coauthored more than 10 books on Web development

and XML, including XPath Essentials and XML Schema Essentials He has been programming since 1984,

moving to Web development technologies in 1994 He’s a well-known voice in several influential onlinetechnical communities and is a frequent contributor to many Web development specifications

First, I would like to thank Jim Minatel, the acquisitions editor who put the platform in place to get

Beginning Regular Expressions off the ground at Wrox/Wiley His patience, under significant provocation

relating to timetable, and his tact, efficiency, and general good nature made those organizational aspects

of the book an enjoyable experience to repeat at a future date

The development editor, Marcia Ellett, was great to work with and did a lot to tidy up my prose to make

a better read for all readers of this book In addition, her eagle eyes spotted some minor slips that hadslipped through the authorial net Thanks, Marcia

Doug Steele, a fellow Microsoft MVP, was technical editor and carried out a tactful and painstaking joband picked up many little things that the smoke from the author’s midnight oil seemed somehow toobscure Thanks, Doug

Darren Niemke, another MVP, helped with technical editing of a number of chapters Thanks, Darren

My thanks go, too, to the production staff at Wiley who, as is typically the case, the author never meets.Without their efforts in translating a manuscript into a finished product this book would not exist in itscurrent form

Trang 9

Mary Beth Wakefield

Vice President & Executive Group Publisher

Trang 10

Introduction xxi

Who This Book Is For xxi What This Book Covers xxii How This Book Is Structured xxii What You Need to Use This Book xxiii Conventions xxiii

p2p.wrox.com xxv

What Are Regular Expressions? 2 What Can Regular Expressions Be Used For? 5

Regular Expressions You Already Use 7

Why Regular Expressions Seem Intimidating 8

Trang 11

Continual Evolution in Techniques Supported 16

The Languages That Support Regular Expressions 17 Replacing Text in Quantity 17

Chapter 2: Regular Expression Tools and an Approach to Using Them 21

Regular Expression Tools 21

Language- and Platform-Specific Tools 29

An Analytical Approach to Using Regular Expressions 31

Express and Document What You Want to Do in English 32 Consider the Data Source and Its Likely Contents 34 Consider the Regular Expression Options Available 34

Use Whitespace to Aid in Clear Documentation of the Regular Expression 37

Trang 12

Chapter 3: Simple Regular Expressions 41

Matching Single Characters 42

Matching Sequences of Characters That Each Occur Once 47

Matching Optional Characters 56

Other Cardinality Operators 62

Regular Expression Metacharacters 74

Trang 13

Modifiers 103

Exercises 104

Introduction to Character Classes 105

Using Ranges in Character Classes 114

Metacharacter Meaning within Character Classes 133

Negated Character Classes 136

Combining Positive and Negative Character Classes 137

POSIX Character Classes 139

Exercises 141

String, Line, and Word Boundaries 144

What Is a Word? 164

Trang 14

Identifying Word Boundaries 164

Grouping Using Parentheses 171

Alternation 177

Capturing Parentheses 185

Non-Capturing Parentheses 188 Back References 190 Exercises 193

Why You Need Lookahead and Lookbehind 196

Lookahead 197

Positive Lookahead Examples 203

Lookbehind 209

How to Match Positions 214

Exercises 220

Trang 15

Chapter 9: Sensitivity and Specificity of Regular Expressions 221

What Are Sensitivity and Specificity? 222

The Sensitivity/Specificity Trade-Off 230 How Metacharacters Affect Sensitivity and Specificity 230

Sensitivity, Specificity, and Positional Characters 231

Sensitivity, Specificity, and Lookahead and Lookbehind 232

Knowing the Data, Sensitivity, and Specificity 233

Revisiting the Star Training Company Example 236 Exercises 240

Chapter 10: Documenting and Debugging Regular Expressions 241

Documenting Regular Expressions 242

Know Your Data 246

The User Interface 253 Metacharacters Available 256

Trang 16

Modes 262

Examples 268

Search-and-Replace Examples 270

Regular Expressions in Visual Basic for Applications 278 Exercises 280

Chapter 12: Regular Expressions in StarOffice/OpenOffice.org Writer 281

The User Interface 282 Metacharacters Available 284

POSIX Character Classes 301

Exercises 304

Introducing findstr 305

Metacharacters Supported by findstr 308

Word-Boundary Positions 313 Beginning- and End-of-Line Positions 315

Trang 17

Single File Examples 319

Multiple File Example 321

A Filelist Example 322 Exercises 323

The PowerGREP Interface 325

Exercises 349

The Excel Find Interface 351 The Wildcards Excel Supports 355

Using Wildcards in Data Forms 360 Using Wildcards in Filters 362 Exercises 363

Chapter 16: Regular Expression Functionality in SQL Server 2000 365

Metacharacters Supported 366 Using LIKE with Regular Expressions 366

Trang 18

Negated Character Classes 376 Using Full-Text Search 379

Document Filters on Image Columns 391 Exercises 391

Getting Started with MySQL 393 The Metacharacters MySQL Supports 396

Testing Matching of Literals: _ and % Metacharacters 400

Using the REGEXP Keyword and Metacharacters 401

Social Security Number Example 410 Exercises 411

The Interface to Metacharacters in Microsoft Access 413

The Metacharacters Supported in Access 422

Using the # Metacharacter 424 Using the # Character with Date/Time Data 425 Using Character Classes in Access 426 Exercises 428

Chapter 19: Regular Expressions in JScript and JavaScript 429

Using Regular Expressions in JavaScript and JScript 430

Metacharacters in JavaScript and JScript 451

Trang 19

Documenting JavaScript Regular Expressions 452 SSN Validation Example 452 Exercises 454

The RegExp Object and How to Use It 455

Using the Match Object and the Matches Collection 471 Supported Metacharacters 473

The System.Text.RegularExpressions namespace 486

Using the Match.Success Property and Match.NextMatch Method 495

Multiline Matching: The Effect on the ^ and $ Metacharacters 505Inline Documentation Using the IgnorePatternWhitespace Option 505

The Metacharacters Supported in Visual Basic NET 508

Exercises 510

Trang 20

Chapter 22: C# and Regular Expressions 511

The Classes of the System.Text.RegularExpressions namespace 512

Metacharacters Supported in Visual C# NET 542

Getting Started with PHP 5.0 549 How PHP Structures Support for Regular Expressions 553

Trang 21

The eregi() Function 559

The Metacharacters Supported in PHP 581

How Constraints Are Expressed in W3C XML Schema 598

Mixing Unicode Character Classes with Other Metacharacters 607

Trang 22

Matching Numeric Digits 614

Exercises 616

Introduction to the java.util.regex Package 620

Trang 23

Metacharacters Supported in the java.util.regex Package 645

The POSIX Character Classes in the java.util.regex Package 651

Using Methods of the String Class 654

Exercises 658

Obtaining and Installing Perl 659

Basics of Perl Regular Expression Usage 667 Using the Perl Regular Expression Operators 667

The Metacharacters Supported in Perl 684

Trang 24

justi-This book aims to help you overcome the hurdles that make so many developers uncomfortable withregular expressions and allow you to make effective use of the power that is available to the developerwho understands the strengths and pitfalls of regular expressions.

Who This Book Is For

Beginning Regular Expressions is designed for developers who need to manipulate text but are new to

reg-ular expressions or have tried regreg-ular expressions in the past but have found that the learning curve,presented by experts who didn’t realize the needs of newcomers to the topic, was just too steep to allowthem to make progress

This book is targeted at developers who use Windows as their primary or only operating system Youwon’t need to spend time understanding aspects of Unix to begin to use regular expressions All of thetools and languages presented in this book run on Windows, although versions of many are availablethat will run on other platforms too

Beginning Regular Expressions takes you forward from things you are likely to know already, such as the

use of the *and ?characters when doing command line file searching As you build your knowledge, yousee working examples that you can adapt to allow you to explore solutions to the problems that you meet.Whether you are an occasional programmer or simply one who hasn’t used regular expressions yet, youwill be shown the component parts of regular expressions, what they mean, how to use them, and pitfalls

to be aware of when using them Working examples form a core part of how you learn to create, stand, and use regular expressions Most of the chapters contain a number of Try It Out sections that showyou how to put regular expressions to work Each Try It Out section is accompanied by a How It Workssection or other explanation that explains how a regular expression works

Trang 25

under-What This Book Covers

This book introduces the various parts of the construction of a regular expression pattern, explains whatthey mean, and walks you through working examples showing how they work and why they do whatthey do By working through the examples, you will build your understanding of how to make regularexpressions do what you want them to do and avoid creating regular expressions that don’t meet yourintentions

Beginning chapters introduce regular expressions and show you a method you can use to break down atext manipulation problem into component parts so that you can make an intelligent choice about con-structing a regular expression pattern that matches what you want it to match and avoids matchingunwanted text

To solve more complex problems, I encourage you to set out a problem definition and progressively refine

it to express it in English in a way that corresponds to a regular expression pattern that does what youwant it to do

The second part of the book devotes a chapter to each of several technologies available on the Windowsplatform You are shown how to use each tool or language with regular expressions (for example, how to

do a lookahead in Perl or create a named variable in C#)

Regular expressions can be useful in applications such as Microsoft Word, OpenOffice.org Writer,Microsoft Excel, and Microsoft Access A chapter is devoted to each

In addition, tools such as the little-known Windows findstrutility and the commercial PowerGrep tooleach have a chapter showing how they can be used to solve text manipulation tasks that span multiplefiles

The use of regular expressions in the MySQL and Microsoft SQL Server databases are also demonstrated.Several programming languages have a chapter describing the metacharacters available for use in those lan-guages together with demonstrations of how the objects or classes of that language can be used with regularexpressions The languages covered are VBScript, JScript, Visual Basic NET, C#, PHP, Java, and Perl.XML is used increasingly to store textual data The W3C XML Schema definition language can use regu-lar expressions to automatically validate data in an XML document W3C XML Schema has a chapterdemonstrating how regular expressions can be used with the xs:patternelement

How This Book Is Str uctured

Chapters 1 through 10 describe the component parts of regular expression patterns and show you whatthey do and how they can be used with a variety of text manipulation tools and languages I suggest thatyou work through these chapters in order and build up your understanding of regular expressions.The book then devotes a chapter to each of several text manipulation tools and programming languages.These chapters assume knowledge from Chapters 1 through 10, but you can dip into the tool-specificand language-specific chapters in any order you want

Trang 26

The book was written in this way so that you could use Chapters 1 through 10 to get a grasp of how touse regular expressions You can then apply that knowledge by exploring the chapters devoted to tech-nologies you already use or will have to use for specific projects.

Many developers are asked to program in languages in which they are not fully experienced Each ter devoted to a programming language provides many examples of working code that you can adapt,

chap-as appropriate, to your own needs

What You Need to Use This Book

Beginning Regular Expressions makes use of a range of tools and programming languages Examples in

Chapters 1 through 10 use a variety of tools ranging from Microsoft Word and OpenOffice.org Writer

to PowerGrep, Java, and Perl

This book is targeted primarily at Windows users and developers However, developers on other platformscan use much of the book

It is likely that you won’t have all the tools or technologies used in this book For example, it’s unlikelythat you’ll be programming in JScript, Perl, C#, Java, and PHP on a regular basis Depending on whichlanguages interest you, it is assumed that you have the necessary tools installed However, where freetrial software or free downloads are available you will be given information about where to obtain themand basic information about how to install them on Windows

Conventions

To help you get the most from the text and keep track of what’s happening, a number of conventions areused throughout the book

Try It Out

The Try It Out is an exercise you should work through, following the text in the book.

1 They usually consist of a set of steps

2 Each step has a number

3 Follow the steps in order

How It Works

After most Try It Outs, the code you’ve typed is explained in detail.

Tips, hints, tricks, and asides to the current discussion are offset and placed in italics like this.

Boxes like this one hold important, not-to-be-forgotten information that is directly relevant to the surrounding text.

Trang 27

As for styles in the text:

❑ Important words are highlighted when introduced.

❑ Keyboard strokes are shown like this: Ctrl+A

❑ Filenames, URLs, and code within the text appear like this: persistence.properties

❑ Code is presented in two different ways:

In code examples new and important code is highlighted with a gray background

The gray highlighting is not used for code that’s less important in the present context

or that has been shown before

Source Code

As you work through the examples in this book, you may choose either to type in all the code manually

or to use the source code files that accompany the book All of the source code used in this book is able for download at www.wrox.com Once at the site, simply locate the book’s title (either by using theSearch box or by using one of the title lists), and click the Download Code link on the book’s detail page

avail-to obtain all the source code for the book

Because many books have similar titles, you may find it easiest to search by ISBN; for this book the

ISBN is 0-7645-7489-2.

Once you download the code, just decompress it with your favorite compression tool Alternately, youcan go to the main Wrox code download page at www.wrox.com/dynamic/books/download.aspxtosee the code available for this book and all other Wrox books

Errata

We make every effort to ensure that there are no errors in the text or code However, no one is perfect,and mistakes do occur If you find an error in one of our books, such as a spelling mistake or faultypiece of code, we would be grateful for your feedback By sending in errata you may save anotherreader hours of frustration, and you will be helping us provide even higher quality information

To find the errata page for this book, go to www.wrox.comand locate the title using the Search box orone of the title lists Then, on the book details page, click the Book Errata link On this page, you canview all errata that has been posted for this book A complete book list including links to each book’serrata is also available at www.wrox.com/misc-pages/booklist.shtml

If you don’t spot “your” error on the Book Errata page, go to www.wrox.com/contact/tech

support.shtmland complete the form there to send us the error you have found We’ll check the information and, if appropriate, post a message to the book’s errata page and fix the problem in sub-sequent editions of the book

Trang 28

For author and peer discussion, join the P2P forums at p2p.wrox.com The forums are a Web-based system for you to post messages relating to Wrox books and related technologies and interact with otherreaders and technology users The forums offer a subscription feature to e-mail you topics of interest ofyour choosing when new posts are made to the forums Wrox authors, editors, other industry experts,and your fellow readers are present on these forums

At http://p2p.wrox.comyou will find a number of forums that will help you not only as you readthis book, but also as you develop your own applications To join the forums, just follow these steps:

1. Go to p2p.wrox.com, click the Register link, and read the terms of use and click Agree

2. Complete the required information to join as well as any optional information you wish to provideand click Submit

3. You will receive an e-mail with information describing how to verify your account and completethe joining process

You can read messages in the forums without joining P2P but to post your own messages, you must join.

Once you join, you can post new messages and respond to messages other users post You can read sages at any time on the Web If you would like to have new messages from a particular forum e-mailed

mes-to you, click the Subscribe mes-to this Forum icon by the forum name in the forum listing

For more information on how to use the Wrox P2P, be sure to read the P2P FAQs for answers to tions about how the forum software works and many common questions specific to P2P and Wroxbooks To read the FAQs, click the FAQ link on any P2P page

Trang 30

be matched using regular expressions Forms on the Web accept text as input, which can bematched against allowable input Business documents consist of text, and searches for specificsequences of characters can be made using regular expressions E-mail messages consist of text.Developers’ code consists of text And regular expressions can be beneficially used in many situa-tions where text is used.

Not only is text everywhere, but there also is lots of it, and increasingly, text must be updated oraggregated As the volume of text created or to which you have access increases, you need efficientand effective ways to find text of particular interest or to change specific pieces of text

Finding and changing individual pieces of text can be straightforward if you are dealing with asingle document only a page or two in length It becomes a more daunting task, potentially prone

to human error, if you are dealing with dozens of documents, each hundreds of pages in length, orwith thousands of relatively short documents It is for tasks such as this that regular expressionsare used, because regular expressions allow automation of many useful types of text processing.For example, in a Web form you will want to check that a credit card number is correctly struc-tured or that a postal code is correctly formed In a lengthy document, you might want to find ahazily recalled URL for an important source of information You might want to convert HTMLcode so that it conforms to the rules of Extensible Markup Language (XML) syntax and complieswith company policy to use XHTML code You might want to check that user input into aWindows application satisfies necessary criteria to allow correct processing

In this chapter, you will learn the following:

❑ What regular expressions are

❑ What regular expressions can be used for

Trang 31

The list of possible uses for a tool that allows the manipulation of text is almost endless, with text being

so widespread Sadly, many computer users and developers have little or no knowledge of regularexpressions and how they can help in working with text This book aims to change that

What Are Regular Expressions?

Regular expressions are patterns of characters that match, or fail to match, sequences of characters intext To allow developers to create regular expression patterns, certain characters and combinations ofcharacters have special meanings and uses, and this book spends considerable time looking at those Butfirst, here are some more basic ideas

Regular expressions, at the most basic level, allow computer users and developers to find desired pieces

of text and, often, to replace those pieces of text with something that is preferred At other times, regularexpressions are used to test whether a sequence of characters that might be intended to be a credit cardnumber or a Social Security number has an allowed pattern of characters Whether it’s finding existingsequences of characters or testing sequences of characters for their suitability (or not) for storage, the keyaspect of regular expressions is matching a pattern against a sequence of characters

It is reasonable, in a broad sense, to refer to a regular expression language, but strictly speaking, there is

no regular expression language Like scripting languages such as JavaScript and VBScript, which can beused only in the context of another application or language, regular expressions can be used only in thecontext of a “proper” programming language, including scripting languages, or as part of an applicationsuch as Microsoft Word and OpenOffice.org Writer or a command-line utility such as the findstrutil-

ity Regular expressions can be discussed in an abstract way, but they are used together with another

exam-Try It Out Matching Literal Characters

The simplest type of regular expression pattern is a sequence of characters For example, if you want tofind the sequence of three characters car, you can use a regular expression pattern carto find thosecharacters

First, try to express the problem in plain English:

Match a sequence of characters; first match the letter c , followed by the letter a , followed by the letter r

Suppose that you had the following text in a document, Car.txt:

Carl spilt his carton of orange juice on the carpet of his new car

If he had taken more care when opening the carton he wouldn’t have had this

annoying and disappointing accident

Some car shampoo would, Carl hoped, make the carpet look as good as new

Trang 32

There are many occurrences of the sequence of characters car, as shown in a simple regular expressionssearch in OpenOffice.Org Writer in Figure 1-1 To try it out for yourself, follow these steps:

1. Open Car.txtin OpenOffice.org Writer (regular expressions are supported in version 1.1 andabove)

2. Use Ctrl+F to open the Find and Replace dialog box.

3. Check the Regular Expressions check box

4. Enter car in the Search For text box.

Figure 1-1

As you can see in Figure 1-1, the sequences of characters carhave been selected whether or not they

formed a word and whether the initial character of the sequence was uppercase or lowercase

The following table shows a more formal breakdown for the very simple regular expression pattern car.When you have simple literal patterns such as car, a formal layout of the meaning of a regular expres-sion seems like overkill, but when you begin to create significantly more complex regular expressionpatterns later in the book, laying out each part of a regular expression helps you keep track of the pieces

Trang 33

Letter Instruction

In short documents, as in this example, whether matches that are whole words or simply charactersequences, if you use the Find All option in a search in OpenOffice.org Writer, you can quickly scan allthe matches by eye, because they are shown in reversed highlight

However, regular expressions can be used to tighten up the match For example, you might want to findoccurrences of the sequence of characters carthat make up a word, but you don’t want to find it when it

is part of another word And you might also want to make the search case sensitive

You can do both in OpenOffice.org Writer by using the regular expression \<car\>and checking theMatch Case check box, as shown in Figure 1-2 The \<and \>metacharacters simply match the bound-ary of a word

Figure 1-2

Trang 34

Don’t worry too much about the syntax for matching the positions at the beginning and end of wordsfor the moment You’ll see the various forms of syntax for that in Chapter 6.

What Can Regular Expressions Be Used For?

The potential number of uses for regular expressions is enormous This section briefly describes someexamples of what regular expressions can be used for

Finding Doubled Words

Regular expressions can be used to find text where words have been doubled In some text, such as thefollowing sentence, a doubled word may be intentional:

It is for tasks such as that that regular expressions are used.

The doubling of the word that is how I wanted to express an idea In other situations, the doubling of a

word is inappropriate and undesired, as in the following:

Paris in the the Spring

The techniques to find doubled words are described in detail in Chapter 7

In some settings, you don’t need regular expressions to identify doubled words In the preceding phrase,the second theis often underlined with a wavy red line in Microsoft Word, for example (depending onsettings)

Checking Input from Web Forms

Another common use of regular expressions is to check that data entered in Web forms conforms to astructure that will be acceptable to the server-side process to which the form data will be submitted Forexample, suppose someone attempts to enter the following as a supposed U.S Social Security number(SSN):

up with an attempt being made to enter a name into a date column in the database, which will likelycause an error when the attempt is made to write the data to the database You also need to be able tomake checks to ensure that you don’t allow dates such as 2005-02-31, because there are never 31 days inFebruary The data you collect from a form is simply a sequence of characters; therefore, regular expres-sions are ideal to ensure that inappropriate data is detected on the client side and that the user is asked

to enter appropriate data in place of the erroneous data

Trang 35

Changing Date Formats

Imagine that you are “translating” a business document from U.S English to British English One of thecomponents of the document’s text may represent dates, and you will need to locate and, very possibly,change those dates, because the conventional representation of dates in the United States and in the U.K differs For example, in the United States, the date for Christmas Day 2001 would be written as

12/25/2001 In the U.K., this might be written as 25/12/2001 If you also had to represent dates forJapanese customers, you might express the same date as 2001-12-25

Assuming that you had a document with dates using the U.S English conventions, you could create aregular expression to detect those sequences of characters wherever they occurred in the document.Depending on what the desired output format is, you could also replace a U.S English date on the inputside with a British English or Japanese date on the output side

Finding Incorrect Case

Because there are a lot of jargon and acronyms associated with computing, it is very easy for incorrectcase to creep into documents This can happen either because of a word processor attempting to autocor-rect what it imagines (wrongly) are incorrect doubling of uppercase (capital) letters The sample docu-ment shown here, XPath.txt, is designed to illustrate one of the problems that can creep into technicaldocuments

Xpath is an abbreviation for the XML Path Language

XPath is used to navigate around a tree model of an XML document

There are significant differences in the data model of XpatH 1.0 and Xpath 2.0

XSLT is one of the technologies with which xpath is often used

The correct way to write XPathis with two uppercase initial letters As you can see, the sample ment has several incorrect forms of the word due to errors in the case of one or more characters

docu-The sensible approach to a problem like this depends, in part, on whether the word at issue can also beused in normal English In the case of XPath there is only one correct form, and it doesn’t occur as aword in normal English That allows you to simply find all occurrences of the characters xpath, whetherupper- or lowercase, and replace them with the correct sequence of characters, XPath

There are several possible approaches to problems of this type, and you will see many examples of themlater in the book

Adding Links to URLs

Suppose that you have URLs located in a document that you want to convert for displaying on the Web

If the URL is stored in a separate column in a relational database, it may be straightforward just to placethe URL in the column as the value of the hrefattribute of an HTML/XHTMLaelement However, ifthe URL is included in a piece of text such as the following, the problem of recognizing a URL becomes alittle tougher

Trang 36

The World Wide Web Consortium, the W3C, has developed many specifications for XML and associated languages The W3C’s home page is located at

http://www.w3.org and its technical reports are located at http://www.w3.org/tr/

Finding URLs depends on being able to recognize a URL as it occurs anywhere in what might potentially

be a very long document In addition, you don’t know what the actual titles of the Web pages are to whichthe URLs point, so you can’t supply the page title as text but have to use the URL both as the value of the

hrefattribute of an XHTMLaelement and also as the text contained between the start tag and end tag ofthe aelement What you want to do is locate each URL and then replace it with a new piece of text con-

structed like this (the italicized theURL stands for the actual URL that you find inside the text):

<a href=”theURL”>theURL</a>

Regular Expressions You Already Use

If you have used a computer for any length of time, you very likely are familiar with at least some uses

of regular expressions, although you may not use that term to describe the text patterns that you use inword processors or directory listings from the command line, for example

Search and Replace in Word Processors

Most modern word processors have some sort of regular expression support, although for some word

processors the term regular expressions is not used In Microsoft Word, for example, the limited regular expression support available in the word processor itself uses the term wildcards to describe the sup-

ported regular expression patterns

The simplest pattern is literal text So if you want to find a text pattern of Star, you can enter those fourcharacters in the Find What box in the Find and Replace dialog box in Microsoft Word As you will see alittle later in this chapter, an approach like that can have its problems when you’re handling substantialquantities of text

Directory Listings

If you have done any work at the command line, you have probably used simple regular expressions whendoing directory listings Two metacharacters are available: *(the asterisk) and ?(the question mark).For example, on the Windows platform, if you want to find the executable files in the current directoryyou can use the following command from the command line:

dir *.exe

The dircommand is an instruction to list all files in the current directory and is equivalent to enteringthe following at the command line:

dir *.*

Trang 37

The *.exepattern matches any sequence of zero or more characters followed by a period followed bythe literal sequence of characters exein a filename Similarly, the *.*pattern indicates zero or morecharacters followed by a period followed by zero or more characters.

On other occasions, when searching a directory you will know the exact number of characters that youwant to match Suppose that you have a directory containing multiple Excel workbooks, each of which

contains monthly sales If you know that the filename consists of the word Sales followed by two digits

for the year followed by three alphabetic characters for the month, you could search for all sales books from 2004 by using this command:

work-Dir Sales04???.xls

This would display the Excel workbooks whose filenames are constructed as just described But if youhad some other workbooks named, for example, Sales04123.xls and Sales04234.xls, the command wouldalso cause those to be displayed, although you don’t necessarily want to see those

The ways in which regular expressions can be used together with the dircommand are, as you can see,very limited due to the provision of only two metacharacters in path wildcards

Online Searching

Another scenario where regular expressions, although admittedly simple ones, are widely used is inonline searching The search box on eBay.com, for example, will accept the asterisk wildcard so that

photo*matches words such as photo, photos, photograph, and photographs

Why Regular Expressions Seem Intimidating

There are several reasons why many developers find regular expressions intimidating Among the sons are the compact, cryptic syntax of regular expressions Another is the absence of a standards bodyfor regular expressions, so that the regular expression patterns for a particular meaning vary among lan-guages or tools that support regular expressions

rea-Compact, Cryptic Syntax

Regular expressions syntax is very compact and can seem totally cryptic to those unfamiliar with regularexpressions At times it seems as if there are backslash characters, parentheses, and square brackets every-where Once you understand what each character and metacharacter does in a regular expression pattern,you will be able to build your own regular expressions or analyze those created by other developers

A metacharacter is a character, or combination of characters, that have a special meaning when they

are contained in a regular expression pattern.

The syntax of wildcards in directory paths differs significantly from that of regular

expressions, so the intention of this section is not to focus on the details of path

wildcards, but to remind you that you likely already apply patterns to find suitable

pieces of text (in this case, filenames).

Trang 38

Whitespace Can Significantly Alter the Meaning

Placing unintended whitespace in a regular expression can radically alter the meaning of the regularexpression and can turn what ought to be matches into nonmatches, and vice versa When creating regu-lar expression patterns, you need to be meticulous about handling whitespace

For example, suppose that you want to match the content of a document that stores information aboutpeople A sample document, People.txt, is shown here:

Cardoza, FredCatto, PhilipaDuncan, JeanEdwards, NeilEngland, ElizabethMain, RobertMartin, JaneMeens, CarolPatrick, HarryPaul, JeanineRoberts, ClementineSchmidt, PaulSells, SimonSmith, PeterStephens, SheilaWales, GarethZinni, Hamish

Assuming that each name is laid out in the preceding format, you can use a regular expression to locate

all names where the surname begins with an uppercase S by using the following regular expression:

^S.*

Figure 1-3 shows the result of using that regular expression in OpenOffice.org Writer in People.txt

Notice that all the names where the surname begins with S are selected.

However, if you insert a single space character between the ^and the Sin the regular expression pattern,

as follows, there is no match at all, as illustrated in Figure 1-4

^ S.*

This occurs because the ^metacharacter is a marker for the beginning of a line (or paragraph inOpenOffice.org Writer) So inserting a space character immediately after the caret, ^, means that the regular expression is searching for lines that begin with a space character Because each of the lines in

People.txthas an alphabetic character at the beginning of the line, there is no line that begins with aspace character, and therefore, there is no match for the regular expression pattern

Spotting an error like the existence of the inadvertent space character can be tough I recommend thatyou always look carefully at the data you are going to manipulate so that you understand its characteris-tics and know that at least some matches for the regular expression should be present If you follow thatadvice (which I discuss in more detail in Chapter 2), you will know that there are some surnames begin-

ning with S, and therefore, you should be aware that something is wrong with the regular expression

pattern if the regular expression is returning zero matches

Trang 39

Figure 1-3

Figure 1-4

Trang 40

However, when a regular expression returns some matches, you might overlook the undesired effects ofinadvertently introduced whitespace Suppose that you want to find names where the surname begins

with C, D, R, or S You could do that with the following regular expression pattern:

With the small amount of ordered data in People.txtyou might easily notice the absence of expected

names with surnames starting with D In larger documents where the data is not ordered, however, it

might be a different story

The use of whitespace inside regular expression patterns is discussed in Chapter 4.

Tiêu đề	Beginning Regular Expressions
Tác giả	Andrew Watt
Trường học	University of Toronto
Chuyên ngành	Computer Science
Thể loại	Guide
Thành phố	Toronto

Định dạng
Số trang	771
Dung lượng	24,24 MB