Even where a popular biological program is not written inPython, or if you are a programmer who has good reason for using another language, wecan still use Python as a way of illustratin
Trang 2Bioinformatics and Beyond
Do you have a biological question that could be readily answered by computationaltechniques, but little experience in programming? Do you want to learn more about thecore techniques used in computational biology and bioinformatics? Written in anaccessible style, this guide provides a foundation for both newcomers to computerprogramming and those who want to learn more about computational biology Thechapters guide the reader through: a complete beginners’ course to programming inPython, with an introduction to computing jargon; descriptions of core bioinformaticsmethods with working Python examples; scientific computing techniques, including imageanalysis, statistics and machine learning This book also functions as a language referencewritten in straightforward English, covering the most common Python language elementsand a glossary of computing and biological terms This title will teach undergraduates,postgraduates and professionals working in the life sciences how to program with Python,
a powerful, flexible and easy-to-use language
TIM J STEVENS, a biochemist by training, is a Senior Investigator Scientist at the MRCLaboratory of Molecular Biology in Cambridge He researches three-dimensional genomearchitecture and provides computational biology oversight, development and trainingwithin the Cell Biology Division
WAYNE BOUCHER, a mathematician and theoretical physicist by training, is a SeniorPost-Doctoral Associate and computing technician for the Department of Biochemistry atthe University of Cambridge He teaches undergraduate mathematics and postgraduateprogramming courses Wayne is currently developing software for the analysis ofbiological molecules by nuclear magnetic resonance spectroscopy
Trang 3University of Cambridge
Trang 9Appendix 2 Selected standard type methods and operations Appendix 3 Standard module highlights
Trang 10Many years ago we started programming in Python because we were working on a largecomputational biology project In those days choosing Python was not nearly as common
as it is today Nonetheless things worked out well, and as our expertise grew it seemedonly natural that we should run some elementary Python courses for the School of Biology
at the University of Cambridge, where we were employed The basis for those courses iswhat turned into the initial idea for this book While there were many books about gettingstarted with Python and some that were tailored to bioinformatics, we felt that there wasstill some room for what we wanted to put across We began with the idea that we couldwrite some chapters in relatively straightforward English that were aimed at biologists,who might be complete novices at programming, and have other sections that are useful to
a more experienced programmer Also, given that we didn’t consider ourselves to betypical bioinformaticians, we were thinking more broadly than just sequence-basedinformatics, though naturally such things would be included We felt that although wecouldn’t anticipate all the requirements of a biological programmer there were nonetheless
a number of key concepts and techniques which we could try to explain The end result ishopefully a toolkit of ideas and examples which can be applied by biologists in a variety
of situations
Tim J StevensWayne Boucher
Cambridge January 2014
Trang 11We acknowledge the support of the Medical Research Council and the Biotechnologyand Biological Sciences Research Council, the UK funding bodies who have funded thescientific projects that we have been involved with over the years This has allowed us touse and develop our Python programming skills while remaining gainfully employed.
Trang 121 PrologueContents
Python programming for biology
Choosing PythonPython’s history and versionsBioinformatics
Computer platforms and installations
Python programming for biology
One of the main aims of this book is to empower the average researcher in the lifesciences, who may have a pertinent scientific question that can be readily answered bycomputational techniques, but who doesn’t have much, if any, experience withprogramming For many in this position, the task of writing a program in a computerlanguage is a bottleneck, if not an impassable barrier Often, the task is daunting andseems to require a significant investment of time The task is also subject to the barrierspresented by a vocabulary filled with jargon and a seemingly steep learning curve forthose people who were not trained in computing or have no inclination to becomecomputer specialists With this in mind for the novice programmer, one ought to start withthe language that is the easiest to get to grips with, and at the time of writing we believethat that language is Python This is not to say that we have made a compromise bychoosing a language that is easy to learn but which is not powerful or fully featured.Python is certainly a very rich and capable way of programming, even for very largeprojects; otherwise we authors wouldn’t be using it for our own scientific work
A second main aim of this book is to use Python as a means to illustrate some of what isgoing on within biological computing We hope our explanations will show you thescientific context of why something is done with computers, even if you are a newcomer
to biology or medical sciences Even where a popular biological program is not written inPython, or if you are a programmer who has good reason for using another language, wecan still use Python as a way of illustrating the major principles of programming forbiology We feel that many of the most useful biological programs are based oncombinations of simple principles that almost anyone can understand By trying toseparate the core concepts from the obfuscation and special cases, we aim to provide anoverview of techniques and strategies that you can use as a resource in your own research.Virtually all of the examples in this book are working code that can be run and are based
on real problems or programs within biological computing The examples can then beadapted, altered and combined to enable you to program whatever you need
We wish to make clear that this book intends to show you what sort of things can be
Trang 13done and how to begin It does not intend to offer a deep and detailed analysis of specificbiological and computational problems This is not a typical scientific book, given that wedon’t always go for the most detailed or up-to-date examples Given the choice, we aim togive a broad-based understanding to newcomers and avoid what some may considerpedantry No doubt some people will think our approach somewhat too simplistic, but ifyou know enough to know the difference then we don’t recommend looking to this bookfor those kinds of answers Likewise, there is only room for so many examples and wecannot cover all of the scientific methods (including Python software libraries) that wewould want to Hopefully though, we give the reader enough pointers to make a goodstart.
Choosing Python
It is perhaps important to include a short justification to say why we have written thisbook for the Python programming language; after all, we can choose from severalalternative languages Certainly Python is the language that we the authors write in on adaily basis, but this familiarity was actually born out of a conscious decision to use Pythonfor a large biological programming project after having tried and considered a number ofpopular alternatives Aside from Python, the languages that we have commonly comeacross in today’s biological community include: C, C++, FORTRAN, Java, Matlab, Perl,
R and Ruby Specific comparison with some of these languages will be made at variouspoints in the book, but there are some characteristics of Python that we enjoy, which wefeel would not be available to the same level or in the same combination in any otherlanguage
We like the clear and consistent layout that directs the programmer away fromobfuscated program code and towards an elegantly readable solution; this becomesespecially important when trying to work out what someone else’s program does, or evenwhat your own material does several years later We like the way that Python has objectorientation at its heart, so you can use this powerful way to organise your data while stillhaving the easy look and feel of Python This also means that by learning the languagebasics you automatically become familiar with the very useful object-oriented approach
We like that Python generally requires fewer lines of program code than other languages
to do the equivalent job, and that it often seems so much less tedious to write
It is important to make it clear that we would not currently use Python for everyprogramming task in the life sciences Python is not a perfect language As it standscurrently for some specialised tasks, particularly those that require fast mathematicalcalculations which are not supported by the numeric Python modules, we actively promoteworking with a Python extension such as Cython, or some faster alternative language.However, we heartily recommend that Python be used to administer the bookkeepingwhile the faster alternative provides extra modules that act as a fast calculation engine Tothis end, in Chapter 27 we will show you how you can seamlessly mesh the Pythonlanguage with Cython and also with the compiled language C, to give all the benefits ofPython and very fast calculations
Python’s history and versions
Trang 14of his innovation and continuing support that Python is popular and continues to grow.The Python programming community has afforded Guido the honour of the title
‘benevolent dictator for life’ What this means is that despite the fact that many aspects ofPython are developed by a large community, Guido has the ultimate say in what goes intoPython Although not bound in any legality, everyone abides by Guido’s decisions, even if
at times some people are surprised by what he decides We believe that this situation haslargely benefited Python by ensuring that the philosophy remains unsullied Seeminglyoften, a committee decision has the tendency to try to appease all views and can becometediously slow with indecision; too timid to make any bold, yet improving moves ThePython programming community has a large role in criticising Python and guiding itsfuture development, but when a decision needs to be made, it is one that everyone accepts.Certainly there could be a big disagreement in the future, but so far the benevolentdictator’s decisions have always taken the community with him
There have been several, and in our opinion improving, versions of the Pythonprogramming language All versions before Python 3 share a very high degree ofbackward-compatibility, so that code written for version 1.5 will still (mostly) work withsay version 2.7 with few problems Python 3 is not as compatible with older versions, butthis seems a reasonable price to be paid to keep things moving forward and eradicate some
of the undesired legacy that earlier versions have built up Rest assured though, version 3remains similar enough in look and feel to the older Pythons, even if it is not exactly thesame, and the examples in this book work with both Python 2 and Python 3 except wherespecifically noted Also, included with the release of Python 3 is a conversion program
‘2to3’ which will attempt to automatically change the relevant parts of a version 2program so that it works with version 3 This will not be able to deal with every situation,but it will handle the vast majority and save considerable effort
For this book we will assume that you are using Python version 2.6 or 2.7 or 3 Somebits, however, that use some newer features will not work with versions prior to 2.6without alteration We feel that it is better to use the best available version, rather thanwrite in a deliberately archaic manner, which would detract from clarity
Bioinformatics
The field of bioinformatics has emerged as we have discovered, through experimentation,large amounts of DNA and protein sequence information In its most conservative sensebioinformatics is the discipline of extracting scientific information by the study of thesebiological sequences, which, because of the large amount of data, must be analysed bycomputer Initially this encompassed what most biological computing was about, but wecontend that this was simply where biomolecular computing began and that it has far to
go The informatics of biological systems these days includes the study of molecularstructures, including their dynamics and interactions, enzymatic activity, medical andpharmacological statistics, metabolic profiles, system-wide modelling and the organisation
of experimental procedures, to name only a subset It is within this wider context that thisbook is placed
At present the programming language that is historically most famous for being used
Trang 15with bioinformatics is probably Perl, which is notable for its ability to manipulatesequences, particularly when stored as letters within formatted text It also has a library ofmodules available to perform many common bioinformatics tasks, collectively namedBioPerl In this arena Python can do everything that Perl can There is a Python equivalent
of BioPerl, unsurprisingly named BioPython, and at this time the uptake of Python withinthe bioinformatics community is growing, which is not surprising, given our belief that it
is an easier but more powerful language to work with It is important to note that althoughsome of the BioPython modules will certainly be discussed in the course of this book (and
we would generally advise using tested, existing code wherever possible to make yourprograms easier to write and understand) the explanations and examples will be more to
do with understanding what is going on underneath We aim to avoid this book simplybecoming a brochure for existing programs where you don’t have to know the innerworkings
Computer platforms and installations
Python is available for every commonly used computer operating system includingversions of Microsoft Windows, Mac OS X, Linux and UNIX With Windows you willgenerally have to download and install Python, as it is not included as standard On mostnew Mac OS X, Linux and UNIX systems Python is included as standard (indeed someparts of Linux operating systems are themselves written with Python), although youshould check to see which version of Python you have: typing ‘python’ at a command linereveals the version For a list of website locations where you can download Python forvarious platforms see the reference section at the end of this book or the CambridgeUniversity Press site: http://www.cambridge.org/pythonforbiology
Precisely because Python is available for and can be run on many different computerplatforms, any programs you write will generally be able to be run on all computersystems However, there are a few important caveats you should be aware of AlthoughPython as a language is interpreted in the same way on every computer system, when itcomes to interacting with the operating system (Windows, Mac OS X, Linux …), thingscan work differently on different computers This is a problem that all cross-platformcomputing languages face You will probably come across this in your Python programswhen dealing with files and the directories that contain them Although each operatingsystem will have its own nuances, once you are aware of the differences it is a relativelysimple job to ensure that your programs work just as well under any common operatingsystem, and we will cover details of this as required in the subsequent chapters
1 The name itself derives from Monty Python, which is why you’ll find the occasionalhonorary reference to ‘spam’, ‘dead parrot’ etc when arbitrary examples are given
Trang 162 A beginners’ guideContents
Programming principles
Interpreting commandsReusable functionalityTypes of data
Python objectsVariablesBasic data types
NumbersText stringsSpecial objectsData collectionsConverting between typesProgram flow
OperationsControl statements
Programming principles
The Python language can be viewed as a formalised system of understanding instructions(represented by letters, numbers and other funny characters) and acting upon thosedirections Quite naturally, you have to put something in to get something out, and whatyou are going to be passing to Python is a series of commands Python is itself a computerprogram, which is designed to interpret commands that are written in the Python language,and then act by executing what these instructions direct A programmer will sometimesrefer to such commands collectively as ‘code’
Interpreting commands
So, to our first practical point; to get the Python interpreter to do something we will give itsome commands in the form of a specially created piece of text It is possible to givePython a series of commands one at a time, as we slowly type something into ourcomputer However, while giving Python instructions line by line is useful if you want to
Trang 17test out something small, like the examples in this chapter, for the most part this method ofissuing commands is impractical What we usually do instead is create all of the lines oftext representing all the instructions, written as commands in the Python language, andstore the whole lot in a file We can then activate the Python interpreter program so that itreads all of the text from the file and acts on all of the commands issued within A series
of commands that we store together in such a way, and which do a specific job, can beconsidered as a computer program.1 If you would like to try any of the examples given inthe book the next chapter will tell you how to actually get started The initial intention,however, is mostly to give you a flavour of Python and introduce a few key principles
Reusable functionality
When writing programs in the Python language, which the Python interpreter can then use,
we are not restricted to reading commands from only one file It is a very commonpractice to have a program distributed over a number of different files This helps toorganise writing of the program, as you can put different specialised parts of yourinstructions into different files that you can develop separately, without having to wadethrough large amounts of text Also, and perhaps most importantly, having Pythoncommands in multiple files enables different programs to share a set of commands Withshared files, the distinction between which commands belong to one program and whichbelong to another is mostly meaningless As such, we typically refer to such a shared file
as a module.
In Python you will use modules on a regular basis And, as you might have alreadyguessed, the idea is to have modules containing a series of commands which perform afunction that would be useful for several programs, perhaps in quite different situations.For example, you could write a module which contains the commands to do a statisticalanalysis on some numeric data This would be useful to any program that needs to run thatkind of analysis, as hopefully we have written the statistics module in such a way that theprecise amount and source of the numeric data that we send to the module is irrelevant.Whenever we use a module we are avoiding having to write new Python commands, andare hopefully using something that has been tried and tested and is known to work
Trang 18When working with Python there is already a long list of pre-made modules that youcan use For example, there are modules to perform common mathematical operations, tointeract with the operating system and to search for patterns of symbols within text Theseare all generally very useful, and as such they are included as standard whenever you have
Python installed You will still have to load, or import, these modules into a program to
use them, but in essence you can think of these modules as a convenient way of extendingthe vocabulary of the Python language when you need to By the same token, you don’thave to load any modules that are not going to be useful, which might slow things down oruse unnecessary computer memory
Types of data
Before going on to give a more detailed tutorial we will first describe a little about theconstruction and makeup of commands written in the Python language Writing thecommand code for a program involves thinking about items of data There can be manydifferent kinds of data, from different origins, that we would wish to manipulate with acomputer Typically we will represent the smallest units of this information as numbers ortext We can organise such numbers and text into structured arrangements, for example, tocreate a list of data, and we can then manipulate this entire larger container, with all of itsunderlying elements, as a single unit For example, given a list containing numbers youcould extract the first number from the list, or maybe get the list in reverse order
numbers = [6, 0, 2, 2, 1, 4, 1, 5]
numbers.reverse()
print(numbers)
Defining a list of numbers as a single entity and then reversing its order, before printing the result to the screen.
In Python, as in many languages, there are some standard types of data-containingstructures that form the basis of most programs, and which are very easy to create and fillwith information But you are not limited to these standard data structures; you can createyour own data organisation For example, you could create a data structure called aPerson, which can store the name, sex, height and age of real people In a program, just asyou could get the first element of data stored in a list, so too could you extract the numberthat represents the age of a Person data structure Going further, you could create manyPerson data structures and organise them further by placing them into lists A datastructure can appear inside the organisation of many other data structures, so a singlePerson could appear in several different lists (for example, organised by age, sex orwhatever) or a Person could contain references to other Person data structures to indicatethe relationships between parents and children
Python objects
This is where we can introduce the concept of an object The Person data structure
described above would commonly be referred to as a Person object Indeed, all of theorganised data structures in Python, including the simple inbuilt ones, are referred to as
Trang 19An important concept when dealing with objects is inheritance That is to say that wecan make a new type of data structure by basing it on an existing one Indeed, every object
in Python, except the simplest data structure of them all (the base object), inherits itsorganisation from another object Accordingly, you could take a Person object and use itsspecification to create a Scientist object This would immediately give the Scientist objectthe same data organisation of a Person object, with its age, sex and height data, but we can
go on to modify the Scientist object to also store different information, like a list ofpublications or current work institution This can also be done for the built-in objects, soyou could have your own version of a Python list that is only allowed to contain oddnumbers, if you really, really wanted
So far we have discussed the manipulation of data by a Python program in fairly looseterms, so it is about time to more properly introduce you to a few of the concepts that youwill commonly use in Python programs The examples that we give use operations andtypes of data that are built into the language as standard, i.e that the Python interpreterwill know how to handle without you having to add any special information
is similar to algebra, where you can describe formulae, like y = x2 + 3, without specifyingwhat x or y actually are, and then use the formula on different values of x in order tocompute y
Note that in Python if you set the variable with the name ‘x’ to take a numeric valueyou can still set it to be some other type of data later on in the program, so initially it may
be a number, but later be some text Bearing this in mind, you must be careful that youonly perform operations on the ‘x’ data that are valid for that type of data Staying with the
Trang 20idea of data items having a particular data type, we next go through the basic types
in combination with other types of number object, but in Python 2 if you perform somemathematical operations with only integers the result is an integer too While this makessense for addition and multiplication, division will give you the perhaps surprising result
of a whole number, rounding the answer (towards negative infinity to be precise) Theadvantage of integer operations is that they are quick and always precise; non-integerrepresentation can give rise to small errors which can sometimes have seriousconsequences
In Python 2 there are actually two types of integers, normal integers and long integers,although you usually don’t have to pay much attention to this fact The long integervariety is used when the number is so big2 that it must be stored in a different way, as ittakes up more memory slots to store the digits Accordingly, you might see the 18-digitnumber 123,456,789,123,456,789 represented in Python (before version 3) as123456789123456789L, i.e with an extra ‘L’ at the end giving a hint that it is the longvariety But otherwise you can simply treat it as a number and do all the usual operationswith it In Python 3 this distinction disappears and every integer is a long integer
Floating point numbers
Floating point numbers, often simply referred to as floats, are numbers expressed in the
decimal system, i.e 2.1, 999.998, 0.000004 or whatever The value 2.0 would also beinterpreted as a floating point number, but the value 2, without the decimal point, will not;
it will be interpreted as an integer Floating point numbers can also carry a suffix thatstates which power of ten they operate at So, for example, you can express four point sixmillion as 4.6×106, which in Python would be written as 4.6e6 (or as 46e5 or as 0.46e7)and similarly one hundredth would be 1.0e-2 A potential pitfall with floating pointnumbers is that they are of limited precision Of course you would not expect to be able toexpress some fractions like ⅓ exactly, but there can otherwise be some surprises whenyou do certain calculations For example, 0.1 plus 0.2 may sometimes give you somethinglike 0.30000000000000004, because of the way that the innards of computers work Thedifference between this number and the desired value of 0.3 is what would be referred to
Trang 21matter, but sometimes it does matter and you should be aware of this issue Commonsituations where the floating point errors could matter include: when you are repeatedlyupdating a value and the error grows, when you are interested in the small difference thatresults when subtracting two larger numbers and when two values ought to be equal butthey aren’t exactly, e.g after division you test for 1.0 but don’t get the expected exactvalue
Text strings
Strings are stretches of alphanumeric characters like “abc” or ‘Hello world’, in otherwords they represent text In Python strings are indicated inside of single or doublequotation marks, so that their text data can be distinguished from other data types andfrom the commands of the program Thus if in Python we issue the commandprint(“lumberjack”) we know that “lumberjack” is the string data and everything else isPython command Similarly, quotation marks will also distinguish between real numbersand text that happens to be readable as a number For example, 1.71 is a floating pointnumber but “1.71” is a piece of text containing four characters You cannot domathematics with the text string “1.71”, although it is possible to convert it to a numberobject with the value 1.71
String objects might contain elements that cannot be represented by the printablecharacters found on a keyboard, but which are nonetheless part of a piece of text A goodexample of this is the way that you can split text over several lines When you type into
your computer you may use the Return key to do this In a Python string you would use
the special sequence “\n” to do this:3 Python uses a combination of characters to providethe special meaning For example, “Dead Parrot” naturally goes on one line, but
“Dead\nParrot” goes on two, as if you had pressed Return between the two words.
Another concept that deserves some explanation is the empty string, written simply as
””, with no visible characters between quotes You can think of this in the same way as anempty list; as a data structure that is capable of containing a sequence, but which happens
to contain nothing The empty string is useful in situations where you must have a stringobject present but don’t want to display any characters
Text strings are made up of individual characters in a specific order, and in some waysyou can think of them as being like lists Thus, for example, you can query what the firstcharacter of a string is, or determine how long it is In Python, however, you cannotmodify strings once they are defined; if you want to make a change you have to recreatethem in their entirety This might seem stifling at first glance, but it rarely is in practice.The benefit of this system is that you can use strings to access items in a Python dictionary(which is a handy way to store data that we discuss below); if strings were internallyalterable this would not be possible in Python Python can readily perform operations toreplace an existing string with a modified version For example, if you wanted to convertsome data that is initially stored as “Dead Parrot” into the text “Ex-Parrot” you couldredefine the data as the string “Ex-” joined onto the last six characters of the original text
If at any point it really is painful to redefine a string entirely, a common trick is to convertthe text into a list of separate characters (see list data type below) that you can manipulate
Trang 22Special objects
Booleans
The two Boolean objects are True and False, and they mean much what you might expect.Many objects can be examined to test whether they are logically false, like an empty list orzero, or logically true, like 1.0 or a filled list However, the True and False objects (notethe capital letters) are special They are the objects that you get back when you do a truthtest So, if you write a command to determine whether some number is equal to anothernumber you will get a True object if they are equal or a False object if they are not equal.This differs from some languages where you might get 0 or 1 rather than dedicatedBoolean objects Also, you can set things to be True or False within your programs whereyou know that some data should only take one of two values
an empty one This distinction may seem tenuous, but it can be critical For example, if Ihave records of people where I can store the names of their children, an empty list wouldindicate that a person has no children to be named, but a None object would indicate that Ihave been unable to determine whether the person had any children or not; not that theydefinitely had none
of days in each month of a year as a list, as illustrated below Often in Python programsyou will be accessing the elements contained in a list by referring to a specific position (an
Trang 23days = [31, 28.243, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31]
matrix = [[-1, 0, 0], [0, 1, 0], [0, 0, 3]]
Defining lists in Python; a simple list of numbers (both integers and floating point) and a second list which contains three sub-lists, each representing a row in a matrix.
Tuples
A tuple is a data structure that is very much like a list, but which you cannot change once
it is created In Python a tuple may be defined using round brackets ‘()’ Although youcannot change its items, a tuple is used to contain a sequence of elements in a specificorder, and the different positions of this sequence can be interrogated The contents of atuple are defined in their entirety when the tuple object is made Having a kind of list thatyou cannot change may seem like a pointless data structure, but tuples are a surprisinglyuseful type of object If you know that a sequence should definitely not have any elementsmodified, added or deleted, then you can use a tuple to ensure that it is not possible todeviate from this plan: for example, if you want to specify a vector with exactly threespatial coordinates, e.g (x, y, z), using a tuple ensures that you can’t have an invalidvector with too few or too many values Similarly, tuples are used where you haveelements that you know always go together; accordingly you could use tuples to specify atext font like (‘helvetica’, 10) or (‘roman’, 12), where you must have two elements torepresent the name and the size of the font, and if you were to redefine the font you wouldhave to specify both Tuples, unlike lists, can be used as keys to refer to data in dictionarydata structures (see the Dictionaries section below)
Sets
Sets, like lists and tuples, are data containers that encompass a collection of other objects.However, unlike lists and tuples, the elements are in no particular order and the elementscannot be repeated in a set A notable use for sets is when you have some data that youknow, or suspect, contains repeat objects By placing such data within a set anyduplication will be removed For example, you might have a list containing the colours ofdifferent items; if you put these colours into a set object you can find out how manydifferent colours were used Also sets can be useful because you can easily perform setoperations, for example, to find the items that two collections have in common; this would
Trang 24There is actually another variety of set, called a frozen set These are the same as
regular sets with the exception, as the name suggests, that they cannot be altered oncecreated (just like tuples) A useful consequence of this is that they can be used as keys toextract data from dictionary data structures (see below)
Dictionaries
A dictionary is a Python data structure which associates pairs of data objects to create a
look-up table The first object in the pair is called the key, and is unique inside a given dictionary, and the second object is its value Unlike lists, where you refer to items by their
position in a sequence, with dictionaries the data is not stored in any particular order andyou access the values contained within by using the key For example, you could have adictionary which records the heights of various mountains, where you find the correctheight for the correct mountain by using the name of the mountain (a string, the key) tolook up the height (the value) In this instance if you were using a list you would have toknow in which order the heights were stored, but with a dictionary you do not You canhave empty dictionaries and add and remove data from dictionaries, by adding andremoving the pairs of key and value The value that a particular key finds can be altered atany time, and although a key can only refer to one value, the value could be a datastructure object, like a list or set, that contains other items In Python we use curlybrackets to specify the beginning and end of a dictionary
ageDict = {'homer':36, 'marge':34, 'lisa':8, 'bart':10}
print(ageDict['lisa'])
Defining a dictionary in Python, which in this example allows an age value to be accessed using a name as the key.
An important point to be aware of is that while any type of data object can be a value in
a dictionary, only certain kinds of object (those that cannot be internally modified toassume a different identity5) can be used as keys Put verbosely: text strings,6 tuples, theTrue and False objects, the None object, user-defined objects and frozen sets can be used
as keys to access dictionary data, but lists, normal sets and other dictionaries cannot Thereason behind this rule is that the values in a dictionary are efficiently accessed basedupon knowledge of the route from their keys, so Python must be able to consistently anduniquely identify each key and thus get the correct location of each value
Converting between types
If you have some useful data in one type of data structure but need it to be in another, itcan be a very quick operation to do the transfer Indeed, many of the commontransformations between the standard types of data are built into the language Thus, forexample, with single commands you can convert a floating point number to an integer,which ignores everything after the decimal point, or convert an unalterable tuple to achangeable list containing the same elements, or make a number from a text string thatcontains digits
Trang 25So far we have described the common types of data that you will be dealing with in yourprograms To make a working program, however, you must be able to do more thanorganise data; you have to work with it by performing operations that depend upon thecontent of the data A simple example of an operation, and one which we have alreadyused above, would be the addition of numbers Operations are specific to the type ofobject that they work upon, so you can do mathematics with numbers, but not strings.Similarly, you can join two strings together to form a longer text For the two operations
that are built into objects are called methods in the jargon, and these are a special kind of what we later describe as functions Because there are only a limited set of symbols that
can be sensibly used to indicate such inbuilt procedures, often you have to activate the
Trang 26procedure (‘call the method’ in jargon) directly with a dot notation For example, toreverse the order of a list named myData, because there is no symbolic way of reversingthe list, you would issue the command myData.reverse() With this notation, notice thatthe method has a name7 and that it is clearly associated with its list object using a dot ‘.’.
We use the brackets at the end of the command to actually activate the procedure If wesimply issued the command myData.reverse then Python would interpret this as referring
to the method (the procedure) without actually running it The brackets at the end of anobject’s method may also contain some data that the operation is going to work with Forexample, to put the number 6 onto the end of a list you can use the append operation that
is built into list objects, and the extra number goes in brackets: myData.append(6)
Where there is actually a neat symbolic way of representing an operation there will also
be an equivalent, albeit less elegant, version with the dot notation As was illustrated in theexample given earlier, x+y can be written as x. add (y) Here, you can see that theoperation which the + symbol activates is internally called ‘ add ’ The plethora ofunderscore ‘_’ symbols indicates that this method is inbuilt and normally hidden
Control statements
You have more control in a program than just activating all of the written commands once
in written order You can use control commands to divert the flow of the program’sexecution under certain conditions, to add loops to repeat the execution of certainstatements and to jump to a completely different part of the program, run some commandsand then jump back again It is very common to use all of these techniques, even in simpleprograms As a simple example you might wish to look at all of the elements of a list inturn by using a repeating loop, performing the same operation on each of the values
The ability to jump from executing the program flow in one place to another, executesome commands and then jump back again would be described in Python jargon as a
function In some older programming languages you can make the order of your program’s
execution jump about by using GOTO commands; which simply says that commandsfrom now on are executed from a specified line of code in the program In contrast, afunction in Python is a section of code that is bundled together with a name When thePython interpreter reads the commands that go together to make a function it does notactivate those commands immediately Only when the name of the function is usedappropriately in the main flow of the program’s execution are the commands from thefunction run At the end of the function’s execution the program flow goes back to the
Trang 27be useful in many separate situations, so too functions are written because they perform arole that is useful in many different parts of a program It is generally far better to write a
function to do a particular job once, and then activate or call that function wherever that
job needs to be performed, rather than writing several bits of code that do the same thing.One note of caution with using Python functions is that they can be proportionally slow torun compared to the regular flow of a Python program; so if speed is an issue things canoften be helped by removing unnecessary calls to functions Also, functions are generallyonly useful if you use them in more than one place in a program If a procedure is onlyever going to be run in one part of a program you would usually put the requiredcommands directly into the program and not bother with a named function
Although Python functions can exist on their own, they can also be linked to particularkinds of data structures A function that is linked to an object becomes a method of thatobject (a procedure that belongs to the object), and can be executed in the same way asany other method with the dot notation, as discussed earlier
1 Not ‘programme’, even in the UK
2 Typically the long integers start at 231 or 263 depending on whether the system is 32 bit
or 64 bit
3 To actually write the two characters “\n” without it being interpreted as a new line youwould use “\n”
4 In Python 3 having different types in a list is discouraged
Trang 285 The jargon is hashable, and this point is discussed further in Chapter 3.
6 It may seem surprising and limiting that text strings in Python are not internallymodifiable, but in practice this causes few problems, given the right syntax
7 Which hopefully describes its purpose
Trang 293 Python basicsContents
Introducing the fundamentals
Getting startedWhitespace mattersUsing variablesSimple data types
ArithmeticString manipulationCollection data types
List and tuple manipulationSet manipulation
Dictionary manipulationImporting modules
Introducing the fundamentals
Python is a powerful, general-purpose computing language It can be used for large andcomplicated tasks or for small and simple ones Naturally, to get people started with itsuse, we begin with relatively straightforward examples and then afterwards increase thecomplexity Hence, in the next two chapters we cover most of the day-to-dayfundamentals of the language You will need to be, at least a little, familiar with theseideas to appreciate the subsequent chapters Much of what we illustrate here is called
scripting, although there is no hard and fast rule about what is deemed to be a program
and what is ‘merely’ a script We will use the terminology interchangeably
Here we describe most of the common operations and the basic types of data, but someaspects will be left to dedicated chapters Initially the focus will be on the core data typeshandled by Python, which basically means numbers and text With numbers we will ofcourse describe doing arithmetic operations and how this can be moved from the specific
into the abstract using variables All the other kinds of data in Python can also be worked
with in a similarly abstract manner, although the operations that are used to manipulatenon-numeric data won’t be mathematical Moving on from simple numbers and text wewill describe some of the other standard types of Python data Most notable of these are
the collection types that act as containers for other data, so, for example, we could have a
list of words or a set of numbers, and the list or set is a named entity in itself; just another
Trang 30item that we can give a label to in our programs Python also has the ability to let you
describe your own types of data, by making an object specification called a class.
However, this will be discussed in Chapter 7 We will end this chapter by introducing theidea of importing Python modules, which is a mechanism to allow a program to accessextra functionality contained in separate files
Finally, using Python is not only about the operation of programs, it is also important toconsider what it means to the people who read it Hopefully you will be writing clearlyunderstandable code, with meaningful variable names and such like Nonetheless, it is agood idea, when using any programming language, to get into the habit of adding human-readable comments to your programs, especially at points where the logic of what ishappening is not so obvious Such comments are simply textual descriptions that areseparate from the functional part of the code In Python, comments are usually introducedusing the hash symbol1 ‘#’, whereupon all subsequent text on that line is for humans toread and not part of the program proper
Trang 32Figure 3.2 Setting the PATH environment variable in Windows For the ‘python’
command to be recognised by Windows systems the PATH environment variable mustinclude the location of the directory that contains the ‘python.exe’ file An environment
variable may be set in the Windows graphical interface via Control Panel → System and
Security → System → Advanced system settings If PATH is not already defined then the Python executable location may be specified via New …, for example, as ‘C:\Python27’
or ‘C:\Python33’, depending on the version If the PATH is already defined then, after selection of this system variable in the lower table, using Edit … enables the addition of the Python location after any existing values, after a semicolon, for example, adding
‘;C:\Python34’ Note that the PATH specification has no spaces between entries (only ‘;’) and no trailing slash ‘\’.
The file name for Python scripts traditionally ends in ‘.py’, as illustrated in the examplebelow, although strictly speaking it does not have to By running the script we send the filecontaining lines of code to the Python interpreter, which reads it and acts on the contents
The alternative to running Python from script files is to run Python alone, without a file,
in an interactive mode This mode gives you a prompt ‘>>>’, where you can type manualinput that is passed to the interpreter one line at a time To start the interpreter with theWindows operating system you would click on the Python icon To start from Mac OS X,Linux or UNIX this means opening a command-line shell and typing ‘python’, then
pressing the Return or Enter key.
Trang 33command and move on to the next line Note that, by using the ‘-i’ flag, it is also possible
to run a Python script and then go into an interactive mode immediately afterwards Whenthe script is done it presents you with the prompt and awaits further instructions:
In Python 2 you can print a text message to the terminal window via the printcommand, for example:
print 'Hello world'
This automatically moves onto the next line because it prints a newline controlcharacter at the end However, if you do not want to go to the next line put a comma at theend:
Trang 34The print operation automatically converts anything that is not already a text string intotext for display Hence, for example, you can print numbers:
Trang 35If you want to see the value of some variable when running Python from a script fileyou need to explicitly use print However, at the Python prompt just giving a variablename, and nothing else, on a line will print it out, albeit sometimes slightly differently.Note that print tidies things a little by rounding the last few decimal places, which isnormally what you want:
commands are interpreted Space can be added by pressing the space bar or the Tab key,
and moves the characters which follow to the right Things like space and tab, which have
no printed symbol, are often still considered as ‘characters’ in computing and arecollectively referred to as ‘whitespace’ If you have too little or too much whitespace atthe beginning of a line you get a syntax error The syntax error indicates that the Pythoninterpreter was unable to process the input characters in a meaningful way
Trang 36Note that whitespace after the beginning of a line does not matter (between tokens of
We can use as many different variable names as we like and assign their value based onother variables For example, in the following we assign a value to x and then assign avalue for y based on x:
>>> x = 17
>>> y = x * 13
>>> print(y)
221
Trang 37specify, and then stick to, a given kind or type of data for a given variable You could
initially allocate a numeric value to ‘x’, without advance warning, and then later onchange ‘x’ to some text This differs from languages like C and Java, for example, whereyou would have to declare up front what type of data ‘x’ was to contain In Python, thetype of variable is specified by the type of whatever its value is set to So if you redefine avariable its type may change Although variables can change type, it is usually best toavoid that practice
Simple data types
As with other computer languages, Python has various simple, inbuilt types of data Theseare Boolean values, integers, floating point numbers, complex numbers, text strings andthe null object
Boolean values represent truth or falsehood, as used in logic operations Not
surprisingly, there are only two values, and in Python they are called True and False.3Example usage:
x = -7
y = 123
Floating point numbers (in mathematics the real numbers), which are written with
decimal points or exponential notation, are not always represented exactly, since acomputer has only a finite amount of memory This introduces issues to do with numericalerrors, and potential instability of numerical algorithms However, such issues are
Trang 38y = 1.2-5.8j
Strings represent text, i.e strings of characters They can be delimited by single quotes
(’) or double quotes (”), but you have to use the same delimiter at both ends Unlike someprogramming languages, such as Perl, there is no practical difference between the twotypes of quote, although using one type does allow the other type to appear inside thestring as a regular character Example usage:
z = None
Finally, if you have a variable and want to know what its data type is then you can usethe type() function This actually generates a special object representing the type, though itprints out in an informative way:
print( type(x) ) # 'complex'
print( type(z) ) # 'NoneType'
Arithmetic
Trang 39Python mostly uses a similar syntax to other computer languages for performingnumerical arithmetic:
A non-programmer might wonder why x//y is useful at all However, it turns out that itdoes come up in various contexts, but mostly when x and y are integers This brings up anoddity, which Python, before version 3, shares in common with many computer languages,namely that for integers, the operation x/y is the same as x//y A non-programmer mightexpect that 13/5 is equal to 2.6, but in fact it is equal to 2, the integer part of that This is incontrast to doing division where at least one floating point number is involved like 13/5.0,13.0/5 or 13.0/5.0, which are all indeed equal to 2.6 Hence in Python 2, if you have twointegers and want to do the traditional non-integer division then you can explicitly convertone of them to a floating point number using the float() function, so, for example,float(13)/5 (There is also an int() function for converting floating point numbers to theirinteger part.)
It is a historic accident that integer division behaves this way, although the situationchanges in Python 3, where integer division reverts to its more traditional ‘human’meaning, so 13/5 now does equal 2.6 Accordingly, it is recommended that in Python 2you avoid x/y if x and y are integers, but instead use x//y
Trang 40it rarely is because it is easy enough to create a new, modified string from an existingstring And since strings are not modifiable it means that they can be placed in sets andused as keys in dictionaries, both of which are exceedingly useful.
In this section we will illustrate some basic manipulations on strings using thefollowing example string:
text = 'hello world' # same as double quoted "hello world"
In some ways a string can be thought of as a list of characters, although in Python a list
of characters would be a different entity (see below for a discussion of lists) Note that
when we refer to something in a string as being a character, we don’t just mean the
regular symbols for letters, numbers and punctuation; we also include spaces andformatting codes (tab stop, new line etc.) You can access the character at a specificposition, or index, using square brackets:
text[1] # 'e'
text[5] # ' ' – a space
Note that the index for accessing the characters of a string starts counting from 0, not 1