1. Trang chủ
  2. » Công Nghệ Thông Tin

Addison wesley text processing in python jun 2003 ISBN 0321112547

344 118 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 344
Dung lượng 1,17 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

Most of the time, if you want to convert values between different Python datatypes, an explicit conversion/encoding call is required, but numeric types contain promotion rules to allownu

Trang 1

Python has a rich collection of basic datatypes All of Python'scollection types allow you to hold heterogeneous elements

inside them, including other collection types (with minor

limitations) It is straightforward, therefore, to build complexdata structures in Python

Unlike many languages, Python datatypes come in two

varieties: mutable and immutable All of the atomic datatypesare immutable, as is the collection type tuple The collections

list and dict are mutable, as are class instances The

mutability of a datatype is simply a question of whether objects

of that type can be changed "in place"an immutable object canonly be created and destroyed, but never altered during its

existence One upshot of this distinction is that immutable

objects may act as dictionary keys, but mutable objects maynot Another upshot is that when you want a data

structureespecially a large onethat will be modified frequentlyduring program operation, you should choose a mutable

datatype (usually a list)

Most of the time, if you want to convert values between

different Python datatypes, an explicit conversion/encoding call

is required, but numeric types contain promotion rules to allownumeric expressions over a mixture of types The built-in

datatypes are listed below with discussions of each The built-infunction type() can be used to check the datatype of an object

A.3.1 Simple Types

bool

Trang 2

earlier micro-releases of Python (e.g., 2.2.1) include the names

True and False, but not the Boolean datatype

int

A signed integer in the range indicated by the register size ofthe interpreter's CPU/OS platform For most current platforms,integers range from (2**31)-1 to negative (2**31)-1 You canfind the size on your platform by examining sys.maxint

Integers are the bottom numeric type in terms of promotions;

nothing gets promoted to an integer, but integers are

sometimes promoted to other numeric types A float, long, orstring may be explicitly converted to an int using the int()

Trang 3

decimal point and/or exponent notation (e.g., 1.0, 1e3, 37.,

.453e-12) A numeric expression that involves both int/longtypes and float types promotes all component types to floatsbefore performing the computation An int, long, or string may

be explicitly converted to a float using the float() function

SEE ALSO: float 19;

complex

An object containing two floats, representing real and imaginarycomponents of a number A numeric expression that involvesboth int/long/float types and complex types promotes all

function If two float/int arguments are passed to complex(),the second is the imaginary component of the constructed

number (e.g., complex(1.1,2))

string

An immutable sequence of 8-bit character values Unlike in

many programming languages, there is no "character" type inPython, merely strings that happen to have length one Stringobjects have a variety of methods to modify strings, but suchmethods always return a new string object rather than modifythe initial object itself The built-in chr() function will return alength-one string whose ordinal value is the passed integer The

str() function will return a string representation of a passed in

Trang 4

tuple of matching length and content datatypes If only one

value is being interpolated, you may give the bare item ratherthan a tuple of length one For example:

Trang 6

precision is included, the length of those digits to the right ofthe decimal are included in the total length:

Trang 8

sys.stderr.write('could not complete action\n')

result of action

You cannot seek within STDOUT or STDERRgenerally you shouldconsider these as pure sequential outputs

Writing to STDOUT and STDERR is fairly inflexible, and most ofthe time the print statement accomplishes the same purposemore flexibly In particular, methods like sys.stdout.write()

only accept a single string as an argument, while print can

handle any number of arguments of any type Each argument iscoerced to a string using the equivalent of repr(obj) For

example:

>>> print "Pi: %.3f" % 3.1415, 27+11, {3:4,1:2}, (1,2,3)Pi: 3.142 38 {1: 2, 3: 4} (1, 2, 3)

Each argument to the print statment is evaluated before it is

printed, just as when an argument is passed to a function As aconsequence, the canonical representation of an object is

printed, rather than the exact form passed as an argument In

my example, the dictionary prints in a different order than it

was defined in, and the spacing of the list and dictionary is

slightly different String interpolation is also peformed and is avery common means of defining an output format precisely

There are a few things to watch for with the print statement Aspace is printed between each argument to the statement If

Trang 10

def print_func(*args):

import sys

sys.stdout.write(' '.join(map(repr,args))+'\n')

Readers could enhance this to add the missing capabilities, butusing print as a statement is the clearest approach, generally

subsequences can be accessed by subscripting and slicing, andnew tuples can be constructed from such elements and slices.Tuples are similar to "records" in some other programming

languages

The constructor syntax for a tuple is commas between listeditems; in many contexts, parentheses around a constructed listare required to disambiguate a tuple for other constructs such

Trang 11

The constructor syntax for a list is surrounding square braces

An empty list may be constructed with no objects between thebraces; a length-one list can contain simply an object name;longer lists separate each element object with commas

Indexing and slices, of course, also use square braces, but thesyntactic contexts are different in the Python grammar (andcommon sense usually points out the difference) Some

Trang 12

.values(), and items(); or in recent Python versionswith the

.popitem() method All the dict methods generate containedobjects in an unspecified order

The constructor syntax for a dict is surrounding curly brackets

An empty dict may be constructed with no objects between thebrackets Each key/value pair entered into a dict is separated by

a colon, and successive pairs are separated by commas Forexample:

Trang 13

Python 2.3+ includes a standard module that implements a setdatatype For earlier Python versions, a number of developershave created third-party implementations of sets If you have atleast Python 2.2, you can download and use the sets modulefrom <http://tinyurl.com/2d31> (or browse the Python CVS)youwill need to add the definition True,False=1, 0 to your localversion, though

A set is an unordered collection of hashable objects Unlike alist, no object can occur in a set more than once; a set

resembles a dict that has only keys but no values Sets utilizebitwise and Boolean syntax to perform basic set-theoretic

Trang 14

container that also knows how to perform actions; i.e., hasmethods) A class instance (or any namespace) acts very muchlike a dict in terms of creating a mapping between names andvalues Attributes of a class instance may be set or modified

Trang 17

names, in order to allow easy pruning and rearrangement

SEE ALSO: int 18; float 19; list 28; string 129; tuple 28;

UserDict 24; UserList 28; UserString 33;

Trang 18

various custom modules to perform encodings, encryptions, andcompressions are handy to have around (and you certainly donot want the work of implementing them yourself) But at theheart of text processing are basic transformations of bits of

text That's what string functions and string methods do

prefer simple to complex when simple is enough

This chapter does several things Section 2.1 looks at a number

Trang 19

be solved using (predominantly) the techniques documented inthis chapter Each of these "Problems" presents working

solutions that can often be adopted with little change to real-lifejobs But a larger goal is to provide readers with a starting pointfor adaptation of the examples It is not my goal to provide

foundation and starting point from which to develop the

functionality they need for their own projects and tasks Andeven better than spurring adaptation, these examples aim toencourage contemplation In presenting examples, this booktries to embody a way of thinking about problems and an

attitude towards solving them More than any individual

technique, such ideas are what I would most like to share withreaders

Section 2.2 is a "reference with commentary" on the Pythonstandard library modules for doing basic text manipulations.The discussions interspersed with each module try to give someguidance on why you would want to use a given module or

function, and the reference documentation tries to contain moreexamples of actual typical usage than does a plain reference Inmany cases, the examples and discussion of individual functionsaddresses common and productive design patterns in Python.The cross-references are intended to contextualize a given

function (or other thing) in terms of related ones (and to helpyou decide which is right for you) The actual listing of

functions, constants, classes, and the like is in alphabetical

order within type of thing

Trang 20

provides some aids for using this book in a learning context.The problems and solutions presented in Section 2.3 are

somewhat more open-ended than those in Section 2.1 As well,each section labeled as "Discussion" is followed by one labeled

Trang 22

This chapter discusses Python capabilities that are likely to beused in text processing applications For an introduction to

Python syntax and semantics per se, readers might want to skipahead to Appendix A (A Selective and Impressionistic Short

Review of Python); Guido van Rossum's Python Tutorial at

<http://python.org/doc/current/tut/tut.html> is also quite

excellent The focus here occupies a somewhat higher level: notthe Python language narrowly, but also not yet specific to textprocessing

In Section 1.1, I look at some programming techniques thatflow out of the Python language itself, but that are usually notobvious to Python beginnersand are sometimes not obvious

even to intermediate Python programmers The programmingtechniques that are discussed are ones that tend to be

applicable to text processing contextsother programming tasksare likely to have their own tricks and idioms that are not

explicitly documented in this book

In Section 1.2, I document modules in the Python standard

library that you will probably use in your text processing

application, or at the very least want to keep in the back of yourmind A number of other Python standard library modules arefar enough afield of text processing that you are unlikely to usethem in this type of application Such remaining modules aredocumented very briefly with one- or two-line descriptions

More details on each module can be found with Python's

standard documentation

Trang 23

2.1.1 Problem: Quickly sorting lines on custom criteria

Sorting is one of the real meat-and-potatoes algorithms of textprocessing and, in fact, of most programming Fortunately forPython developers, the native [].sort method is

extraordinarily fast Moreover, Python lists with almost any

heterogeneous objects as elements can be sortedPython cannotrely on the uniform arrays of a language like C (an unfortunateexception to this general power was introduced in recent Pythonversions where comparisons of complex numbers raise a

TypeError; and [1+1j,2+2j].sort() dies for the same

alphabetization of the lines is "unnatural." But often text linescontain meaningful bits of information in positions other thanthe first character position: A last name may occur as the

second word of a list of people (for example, with first name asthe first word); an IP address may occur several fields into aserver log file; a money total may occur at position 70 of eachline; and so on What if you want to sort lines based on thisstyle of meaningful order that Python doesn't quite understand?The list sort method [].sort() supports an optional customcomparison function argument The job this function has is to

Trang 24

should come second The built-in function cmp() does this in amanner identical to the default [].sort() (except in terms ofspeed, 1st.sort() is much faster than 1st.sort(cmp)) Forshort lists and quick solutions, a custom comparison function isprobably the best thing In a lot of cases, you can even get bywith an in-line lambda function as the custom comparison

function, which is a pleasant and handy idiom

When it comes to speed, however, use of custom comparisonfunctions is fairly awful Part of the problem is Python's functioncall overhead, but a lot of other factors contribute to the

slowness Fortunately, a technique called "Schwartzian

Transforms" can make for much faster custom sorts

Schwartzian Transforms are named after Randal Schwartz, whoproposed the technique for working with Perl; but the technique

is equally applicable to Python

The pattern involved in the Schwartzian Transform techniqueconsists of three steps (these can more precisely be called theGuttman-Rosler Transform, which is based on the SchwartzianTransform):

1 Transform the list in a reversible way into one that sorts "naturally."

Call Python's native [].sort() method

Reverse the transformation in (1) to restore the original listitems (in new sorted order)

The reason this technique works is that, for a list of size N, itonly requires O(2N) transformation operations, which is easy toamortize over the necessary O(N log N) compare/flip operationsfor large lists The sort dominates computational time, so

anything that makes the sort more efficient is a win in the limit

Trang 25

Below is an example of a simple, but plausible, custom sorting

algorithm The sort is on the fourth and subsequent words of a

list of input lines Lines that are shorter than four words sort tothe bottom Running the test against a file with about 20,000

linesabout 1 megabyteperformed the Schwartzian Transform

sort in less than 2 seconds, while taking over 12 seconds for thecustom comparison function sort (outputs were verified as

Trang 27

HOWTOs, email, Usenet posts, and this book itself are written inplaintext (or at least something close enough to plaintext thatgeneric processing techniques are valuable) Moreover, manyformats like HTML and are frequently enough hand-editedthat their plaintext appearance is important.

One task that is extremely common when working with prosetext files is reformatting paragraphs to conform to desired

margins Python 2.3 adds the module textwrap, which

performs more limited reformatting than the code below Most

of the time, this task gets done within text editors, which areindeed quite capable of performing the task However,

sometimes it would be nice to automate the formatting process.The task is simple enough that it is slightly surprising that

Python has no standard module function to do this There is the

class formatter.DumbWriter, or the possibility of inheritingfrom and customizing formatter.AbstractWriter These

classes are discussed in Chapter 5; but frankly, the amount ofcustomization and sophistication needed to use these classesand their many methods is way out of proportion for the task athand

Below is a simple solution that can be used either as a

command-line tool (reading from STDIN and writing to

STDOUT) or by import to a larger application

Trang 28

if word >= len(words):

end_words = 1

else: # Compose line of words while len(line)+len(words[word]) <= right-left:

return '\n'.join([line.rjust(right) for line in lines])

else: # left justify

return '\n'.join([' '*left+line for line in lines])

Trang 29

import sys

if len(sys.argv) <> 4:

print "Please specify left_margin, right_marg, justification" else:

Trang 30

DELIMITED = 1

FLATFILE = 2

# Some sample "statistical" func (in functional programming style)nillFunc = lambda 1st: None

toFloat = lambda 1st: map(float, 1st)

avg_1st = lambda 1st: reduce(operator.add, toFloat(lst))/len(lst)sum_1st = lambda 1st: reduce(operator.add, toFloat(lst))

Trang 31

E.g.: {1:avg_1st, 4:sum_lst, 5:max_lst} would specify the average of column one, the sum of column 4, and the max of column 5 All other cols incl 2,3, >=6 are ignored.

Trang 33

print ' Average salary -', results[3]

print ' Max years worked -', results[4]

getflat = FieldStats(flat, field_funcs={3:avg_lst,4:max_lst}, style=FLATFILE,

column_positions=(15,25,35,45,52)) print 'Flat Calculations:'

Trang 34

FILE.readlines() (for either memory or speed efficiency,

respectively) Moreover, only the data that is actually of interest

is collected into lists, in order to save memory However, ratherthan require multiple passes to collect statistics on multiple

fields, as many field columns and summary functions as wantedcan be used in one pass

One possible improvement would be to allow multiple summaryfunctions against the same field during a pass But that is left

as an exercise to the reader, if she desires to do it

2.1.4 Problem: Counting characters, words,

lines, and paragraphs

There is a wonderful utility under Unix-like systems called wc.What it does is so basic, and so obvious, that it is hard to

imagine working without it wc simply counts the characters,words, and lines of files (or STDIN) A few command-line

options control which results are displayed, but I rarely use

them

In writing this chapter, I found myself on a system without wc,and felt a remedy was in order The example below is actually

an "enhanced" wc since it also counts paragraphs (but it lacksthe command-line switches) Unlike the external wc, it is easy

to use the technique directly within Python and is available

anywhere Python is The main trickinasmuch as there is oneis acompact use of the "".join() and "".split() methods

(string.join() and string.split() could also be used, forexample, to be compatible with Python 1.5.2 or below)

wc.py

Trang 36

protocols like Simple Mail Transport Protocol (SMTP), NetworkNews Transport Protocol (NNTP), or HTTP (depending on

content encoding), or even just when displaying them in manystandard tools like editors In order to encode 8-bit binary data

as ASCII, a number of techniques have been invented over

time

An obvious, but obese, encoding technique is to translate eachbinary byte into its hexadecimal digits UUencoding is an olderstandard that developed around the need to transmit binaryfiles over the Usenet and on BBSs Binhex is a similar techniquefrom the MacOS world In recent years, base64which is

specified by RFC1521has edged out the other styles of

encoding All of the techniques are basically 4/3 encodingsthat

is, four ASCII bytes are used to represent three binary bytesbutthey differ somewhat in line ending and header conventions (aswell as in the encoding as such) Quoted printable is yet

another format, but of variable encoding length In quoted

printable encoding, most plain ASCII bytes are left unchanged,but a few special characters and all high-bit bytes are escaped

Python provides modules for all the encoding styles mentioned.The high-level wrappers uu, binhex, base64, and quopri alloperate on input and output file-like objects, encoding the datatherein They also each have slightly different method namesand arguments binhex, for example, closes its output file afterencoding, which makes it unusable in conjunction with a

cStringlO file-like object All of the high-level encoders utilizethe services of the low-level C module binascii binascii, inturn, implements the actual low-level block conversions, butassumes that it will be passed the right size blocks for a givenencoding

The standard library, therefore, does not contain quite the right

Trang 39

import types

def word_histogram(source):

"""Create histogram of normalized words (no punct or digits)""" hist = {}

trans = maketrans('','')

if type(source) in (StringType,UnicodeType): # String-like src for word in split(source):

for word in split(line):

word = translate(word, trans, punctuation+digits)

Trang 40

hist[word] = hist.get(word,0) + 1

except ImportError: # Older Python ver line = source.readline() # Slow but mem-friendly while line:

Ngày đăng: 19/04/2019, 10:14

TỪ KHÓA LIÊN QUAN