python data for developers

Python Data for Developers A Curated Collection of Chapters from the O’Reilly Data and Programming Library Data is everywhere, and it’s not just for data scientists.. Python for Data A

Trang 1

Python Data

for Developers

A Curated Collection of Chapters from the O'Reilly Data and Programming Library

Trang 2

Python Data for Developers

A Curated Collection of Chapters

from the O’Reilly Data and Programming Library

Data is everywhere, and it’s not just for data scientists Developers

are increasingly seeing it enter their realm, requiring new skills and

problem solving Python has emerged as a giant in the field,

combining an easy-to-learn language with strong libraries and a

vibrant community If you have a programming background (in

Python or otherwise), this free ebook will provide a snapshot of the

landscape for you to start exploring more deeply

For more information on current & forthcoming Programming

content, check out www.oreilly.com/programming/free/

Trang 3

Python for Data Analysis

Available here

Chapter 2: Introductory Examples

Python Language Essentials Appendix

Python Data Science Handbook

Available here

Chapter 3: Introduction to NumPy

Chapter 4: Introduction to Pandas

Data Science from Scratch

Available here

Chapter 10: Working with Data

Chapter 25: Go Forth and Do Data Science

Python and HDF5

Available here

Chapter 2: Getting Started

Chapter 3: Working with Data Sets

Cython

Available here

Chapter 1: Cython Essentials

Chapter 3: Cython in Depth

Trang 5

Python for Data Analysis

Wes McKinney

Trang 6

CHAPTER 2 Introductory Examples

This book teaches you the Python tools to work productively with data While readersmay have many different end goals for their work, the tasks required generally fall into

a number of different broad groups:

Interacting with the outside world

Reading and writing with a variety of file formats and databases

Modeling and computation

Connecting your data to statistical models, machine learning algorithms, or othercomputational tools

Presentation

Creating interactive or static graphical visualizations or textual summaries

In this chapter I will show you a few data sets and some things we can do with them.These examples are just intended to pique your interest and thus will only be explained

at a high level Don’t worry if you have no experience with any of these tools; they will

be discussed in great detail throughout the rest of the book In the code examples you’llsee input and output prompts like In [15]:; these are from the IPython shell

To follow along with these examples, you should run IPython in Pylab

mode by running ipython pylab at the command prompt.

13

Trang 7

1.usa.gov data from bit.ly

In 2011, URL shortening service bit.ly partnered with the United States governmentwebsite usa.gov to provide a feed of anonymous data gathered from users who shortenlinks ending with .gov or .mil As of this writing, in addition to providing a live feed,hourly snapshots are available as downloadable text files.1

In the case of the hourly snapshots, each line in each file contains a common form ofweb data known as JSON, which stands for JavaScript Object Notation For example,

if we read just the first line of a file you may see something like

In [15]: path = 'ch02/usagov_bitly_data2012-03-16-1331923249.txt'

In [16]: open(path).readline()

Out[16]: '{ "a": "Mozilla\\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\\/535.11

(KHTML, like Gecko) Chrome\\/17.0.963.78 Safari\\/535.11", "c": "US", "nk": 1,

"tz": "America\\/New_York", "gr": "MA", "g": "A6qOVH", "h": "wfLQtf", "l":

"orofrog", "al": "en-US,en;q=0.8", "hh": "1.usa.gov", "r":

"http:\\/\\/www.facebook.com\\/l\\/7AQEFzjSi\\/1.usa.gov\\/wfLQtf", "u":

"http:\\/\\/www.ncbi.nlm.nih.gov\\/pubmed\\/22415991", "t": 1331923247, "hc":

1331822918, "cy": "Danvers", "ll": [ 42.576698, -70.954903 ] }\n'

Python has numerous built-in and 3rd party modules for converting a JSON string into

a Python dictionary object Here I’ll use the json module and its loads function invoked

on each line in the sample file I downloaded:

import json

path = 'ch02/usagov_bitly_data2012-03-16-1331923249.txt'

records = [json.loads(line) for line in open(path, 'rb')]

If you’ve never programmed in Python before, the last expression here is called a list

comprehension, which is a concise way of applying an operation (like json.loads) to acollection of strings or other objects Conveniently, iterating over an open file handlegives you a sequence of its lines The resulting object records is now a list of Pythondicts:

In [18]: records[0]

Out[18]:

{u'a': u'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like

Gecko) Chrome/17.0.963.78 Safari/535.11',

Trang 8

Counting Time Zones in Pure Python

Suppose we were interested in the most often-occurring time zones in the data set (the

tz field) There are many ways we could do this First, let’s extract a list of time zonesagain using a list comprehension:

In [25]: time_zones = [rec['tz'] for rec in records]

Oops! Turns out that not all of the records have a time zone field This is easy to handle

as we can add the check if 'tz' in rec at the end of the list comprehension:

In [26]: time_zones = [rec['tz'] for rec in records if 'tz' in rec]

Trang 9

time zone I’ll show two approaches: the harder way (using just the Python standardlibrary) and the easier way (using pandas) One way to do the counting is to use a dict

to store counts while we iterate through the time zones:

Trang 10

Counting Time Zones with pandas

The main pandas data structure is the DataFrame, which you can think of as

repre-senting a table or spreadsheet of data Creating a DataFrame from the original set ofrecords is simple:

In [17]: from pandas import DataFrame, Series

Data columns (total 18 columns):

_heartbeat_ 120 non-null float64

Trang 11

Name: tz, dtype: object

The output shown for the frame is the summary view, shown for large DataFrame

ob-jects The Series object returned by frame['tz'] has a method value_counts that gives

us what we’re looking for:

Trang 12

Figure 2-1 Top time zones in the 1.usa.gov sample data

Parsing all of the interesting information in these “agent” strings may seem like adaunting task Luckily, once you have mastered Python’s built-in string functions andregular expression capabilities, it is really not so bad For example, we could split offthe first token in the string (corresponding roughly to the browser capability) and makeanother summary of the user behavior:

In [33]: results = Series([x.split()[0] for x in frame.a.dropna()])

Trang 13

2 Mozilla/4.0

3 Mozilla/5.0

4 Mozilla/5.0

dtype: object

In [35]: results.value_counts()[:8]

Out[35]:

Mozilla/5.0 2594

Mozilla/4.0 601

GoogleMaps/RochesterNY 121

Opera/9.80 34

TEST_INTERNET_AGENT 24

GoogleProducer 21

Mozilla/6.0 5

BlackBerry8520/5.0.0.681 4

dtype: int64 Now, suppose you wanted to decompose the top time zones into Windows and non-Windows users As a simplification, let’s say that a user is on non-Windows if the string 'Windows' is in the agent string Since some of the agents are missing, I’ll exclude these from the data: In [36]: cframe = frame[frame.a.notnull()] We want to then compute a value whether each row is Windows or not: In [37]: operating_system = np.where(cframe['a'].str.contains('Windows'), : 'Windows', 'Not Windows') In [38]: operating_system[:5] Out[38]: array(['Windows', 'Not Windows', 'Windows', 'Not Windows', 'Windows'], dtype='|S11') Then, you can group the data by its time zone column and this new list of operating systems: In [39]: by_tz_os = cframe.groupby(['tz', operating_system]) The group counts, analogous to the value_counts function above, can be computed using size This result is then reshaped into a table with unstack: In [40]: agg_counts = by_tz_os.size().unstack().fillna(0) In [41]: agg_counts[:10] Out[41]: Not Windows Windows tz

245 276

Africa/Cairo 0 3

Africa/Casablanca 0 1

Africa/Ceuta 0 2

Africa/Johannesburg 0 1

Africa/Lusaka 0 1

America/Anchorage 4 1

America/Argentina/Buenos_Aires 1 0

Trang 14

America/Argentina/Cordoba 0 1

America/Argentina/Mendoza 0 1

Finally, let’s select the top overall time zones To do so, I construct an indirect index array from the row counts in agg_counts: # Use to sort in ascending order In [42]: indexer = agg_counts.sum(1).argsort() In [43]: indexer[:10] Out[43]: tz 24

Africa/Cairo 20

Africa/Casablanca 21

Africa/Ceuta 92

Africa/Johannesburg 87

Africa/Lusaka 53

America/Anchorage 54

America/Argentina/Buenos_Aires 57

America/Argentina/Cordoba 26

America/Argentina/Mendoza 55

dtype: int64 I then use take to select the rows in that order, then slice off the last 10 rows: In [44]: count_subset = agg_counts.take(indexer)[-10:] In [45]: count_subset Out[45]: Not Windows Windows tz

America/Sao_Paulo 13 20

Europe/Madrid 16 19

Pacific/Honolulu 0 36

Asia/Tokyo 2 35

Europe/London 43 31

America/Denver 132 59

America/Los_Angeles 130 252

America/Chicago 115 285

245 276

America/New_York 339 912

Then, as shown in the preceding code block, this can be plotted in a bar plot; I’ll make

it a stacked bar plot by passing stacked=True (see Figure 2-2) :

In [47]: count_subset.plot(kind='barh', stacked=True)

The plot doesn’t make it easy to see the relative percentage of Windows users in the smaller groups, but the rows can easily be normalized to sum to 1 then plotted again (see Figure 2-3):

In [49]: normed_subset = count_subset.div(count_subset.sum(1), axis=0)

In [50]: normed_subset.plot(kind='barh', stacked=True)

1.usa.gov data from bit.ly | 21

Trang 15

All of the methods employed here will be examined in great detail throughout the rest

of the book

MovieLens 1M Data Set

GroupLens Research (http://www.grouplens.org/node/73) provides a number of tions of movie ratings data collected from users of MovieLens in the late 1990s and

collec-Figure 2-2 Top time zones by Windows and non-Windows users

Figure 2-3 Percentage Windows and non-Windows users in top-occurring time zones

Trang 16

early 2000s The data provide movie ratings, movie metadata (genres and year), anddemographic data about the users (age, zip code, gender, and occupation) Such data

is often of interest in the development of recommendation systems based on machinelearning algorithms While I will not be exploring machine learning techniques in greatdetail in this book, I will show you how to slice and dice data sets like these into theexact form you need

The MovieLens 1M data set contains 1 million ratings collected from 6000 users on

4000 movies It’s spread across 3 tables: ratings, user information, and movie mation After extracting the data from the zip file, each table can be loaded into a pandasDataFrame object using pandas.read_table:

infor-import pandas as pd

unames = ['user_id', 'gender', 'age', 'occupation', 'zip']

users = pd.read_table('movielens/users.dat', sep='::', header=None,

names=unames)

rnames = ['user_id', 'movie_id', 'rating', 'timestamp']

ratings = pd.read_table('movielens/users.dat', sep='::', header=None,

names=rnames)

mnames = ['movie_id', 'title', 'genres']

movies = pd.read_table('movielens/users.dat', sep='::', header=None,

movie_id title genres

0 1 Toy Story (1995) Animation|Children's|Comedy

1 2 Jumanji (1995) Adventure|Children's|Fantasy

2 3 Grumpier Old Men (1995) Comedy|Romance

3 4 Waiting to Exhale (1995) Comedy|Drama

MovieLens 1M Data Set | 23

Trang 17

4 5 Father of the Bride Part II (1995) Comedy

by sex and age As you will see, this is much easier to do with all of the data mergedtogether into a single table Using pandas’s merge function, we first merge ratings withusers then merging that result with the movies data pandas infers which columns to

use as the merge (or join) keys based on overlapping names:

In [66]: data = pd.merge(pd.merge(ratings, users), movies)

0 One Flew Over the Cuckoo's Nest (1975) Drama

1000204 Modulations (1998) Documentary

1000205 Broken Vessels (1998) Drama

1000206 White Boys (1999) Drama

1000207 One Little Indian (1973) Comedy|Drama|Western

1000208 Five Wives, Three Secretaries and Me (1998) Documentary

Trang 18

[1000209 rows x 10 columns]

In [68]: data.ix[0]

Out[68]:

user_id 1

movie_id 1193

rating 5

timestamp 978300760

gender F age 1

occupation 10

zip 48067

title One Flew Over the Cuckoo's Nest (1975) genres Drama Name: 0, dtype: object In this form, aggregating the ratings grouped by one or more user or movie attributes is straightforward once you build some familiarity with pandas To get mean movie ratings for each film grouped by gender, we can use the pivot_table method: In [69]: mean_ratings = data.pivot_table('rating', rows='title', : cols='gender', aggfunc='mean') In [70]: mean_ratings[:5] Out[70]: gender F M title

$1,000,000 Duck (1971) 3.375000 2.761905 'Night Mother (1986) 3.388889 3.352941 'Til There Was You (1997) 2.675676 2.733333 'burbs, The (1989) 2.793478 2.962085 And Justice for All (1979) 3.828571 3.689024 This produced another DataFrame containing mean ratings with movie titles as row labels and gender as column labels First, I’m going to filter down to movies that re-ceived at least 250 ratings (a completely arbitrary number); to do this, I group the data by title and use size() to get a Series of group sizes for each title: In [71]: ratings_by_title = data.groupby('title').size() In [72]: ratings_by_title[:10] Out[72]: title $1,000,000 Duck (1971) 37

'Night Mother (1986) 70

'Til There Was You (1997) 52

'burbs, The (1989) 303

And Justice for All (1979) 199

1-900 (1994) 2

10 Things I Hate About You (1999) 700

101 Dalmatians (1961) 565

101 Dalmatians (1996) 364

12 Angry Men (1957) 616

dtype: int64

Trang 19

In [73]: active_titles = ratings_by_title.index[ratings_by_title >= 250]

In [74]: active_titles

Out[74]:

Index([u''burbs, The (1989)', u'10 Things I Hate About You (1999)',

u'101 Dalmatians (1961)', , u'Back to School (1986)',

u'Back to the Future (1985)', ], dtype='object')

The index of titles receiving at least 250 ratings can then be used to select rows frommean_ratings above:

Wrong Trousers, The (1993) 4.588235 4.478261

Sunset Blvd (a.k.a Sunset Boulevard) (1950) 4.572650 4.464589

Wallace & Gromit: The Best of Aardman Animation (1996) 4.563107 4.385075

Schindler's List (1993) 4.562602 4.491415

Shawshank Redemption, The (1994) 4.539075 4.560625

Grand Day Out, A (1992) 4.537879 4.293255

To Kill a Mockingbird (1962) 4.536667 4.372611

Creature Comforts (1990) 4.513889 4.272277

Usual Suspects, The (1995) 4.513317 4.518248

Trang 20

Measuring rating disagreement

Suppose you wanted to find the movies that are most divisive between male and femaleviewers One way is to add a column to mean_ratings containing the difference inmeans, then sort by that:

In [80]: mean_ratings['diff'] = mean_ratings['M'] - mean_ratings['F']

Sorting by 'diff' gives us the movies with the greatest rating difference and which werepreferred by women:

Little Shop of Horrors, The (1960) 3.650000 3.179688 -0.470312

Guys and Dolls (1955) 4.051724 3.583333 -0.468391

Good, The Bad and The Ugly, The (1966) 3.494949 4.221300 0.726351

Kentucky Fried Movie, The (1977) 2.878788 3.555147 0.676359

Dumb & Dumber (1994) 2.697987 3.336595 0.638608

Longest Day, The (1962) 3.411765 4.031447 0.619682

Cable Guy, The (1996) 2.250000 2.863787 0.613787

Trang 21

# Standard deviation of rating grouped by title

Dumb & Dumber (1994) 1.321333

Blair Witch Project, The (1999) 1.316368

Natural Born Killers (1994) 1.307198

Tank Girl (1995) 1.277695

Rocky Horror Picture Show, The (1975) 1.260177

Eyes Wide Shut (1999) 1.259624

Evita (1996) 1.253631

Billy Madison (1995) 1.249970

Fear and Loathing in Las Vegas (1998) 1.246408

Bicentennial Man (1999) 1.245533

Name: rating, dtype: float64

You may have noticed that movie genres are given as a pipe-separated (|) string If youwanted to do some analysis by genre, more work would be required to transform thegenre information into a more usable form I will revisit this data later in the book toillustrate such a transformation

There are many things you might want to do with the data set:

• Visualize the proportion of babies given a particular name (your own, or anothername) over time

Trang 22

• Determine the relative rank of a name.

• Determine the most popular names in each year or the names with largest increases

straight-As of this writing, the US Social Security Administration makes available data files, oneper year, containing the total number of births for each sex/name combination Theraw archive of these files can be obtained here:

http://www.ssa.gov/oact/babynames/limits.html

In the event that this page has been moved by the time you’re reading this, it can mostlikely be located again by Internet search After downloading the “National data” filenames.zip and unzipping it, you will have a directory containing a series of files likeyob1880.txt I use the UNIX head command to look at the first 10 lines of one of thefiles (on Windows, you can use the more command or open it in a text editor):

Trang 23

Name: births, dtype: int64

Since the data set is split into files by year, one of the first things to do is to assembleall of the data into a single DataFrame and further to add a year field This is easy to

do using pandas.concat:

# 2010 is the last available year right now

years = range(1880, 2011)

pieces = []

columns = ['name', 'sex', 'births']

for year in years:

path = 'names/yob%d.txt' % year

frame = pd.read_csv(path, names=columns)

frame['year'] = year

pieces.append(frame)

# Concatenate everything into a single DataFrame

names = pd.concat(pieces, ignore_index=True)

There are a couple things to note here First, remember that concat glues the DataFrameobjects together row-wise by default Secondly, you have to pass ignore_index=Truebecause we’re not interested in preserving the original row numbers returned fromread_csv So we now have a very large DataFrame containing all of the names data:Now the names DataFrame looks like:

Trang 24

In [104]: total_births.plot(title='Total births by sex and year')

Figure 2-4 Total births by sex and year

Next, let’s insert a column prop with the fraction of babies given each name relative tothe total number of births A prop value of 0.02 would indicate that 2 out of every 100babies was given a particular name Thus, we group the data by year and sex, then addthe new column to each group:

def add_prop(group):

# Integer division floors

births = group.births.astype(float)

US Baby Names 1880-2010 | 31

Trang 25

group['prop'] = births / births.sum()

return group

names = names.groupby(['year', 'sex']).apply(add_prop)

Remember that because births is of integer type, we have to cast either

the numerator or denominator to floating point to compute a fraction

(unless you are using Python 3!).

The resulting complete data set now has the following columns:

In [107]: np.allclose(names.groupby(['year', 'sex']).prop.sum(), 1)

Out[107]: True

Now that this is done, I’m going to extract a subset of the data to facilitate furtheranalysis: the top 1000 names for each sex/year combination This is yet another groupoperation:

def get_top1000(group):

return group.sort_index(by='births', ascending=False)[:1000]

grouped = names.groupby(['year', 'sex'])

top1000 = pd.concat(pieces, ignore_index=True)

The resulting data set is now quite a bit smaller:

Trang 26

We’ll use this Top 1,000 data set in the following investigations into the data.

Analyzing Naming Trends

With the full data set and Top 1,000 data set in hand, we can start analyzing variousnaming trends of interest Splitting the Top 1,000 names into the boy and girl portions

In [113]: total_births = top1000.pivot_table('births', rows='year', cols='name', : aggfunc=sum)

Now, this can be plotted for a handful of names using DataFrame’s plot method:

In [115]: subset = total_births[['John', 'Harry', 'Mary', 'Marilyn']]

In [116]: subset.plot(subplots=True, figsize=(12, 10), grid=False,

.: title="Number of births per year")

See Figure 2-5 for the result On looking at this, you might conclude that these nameshave grown out of favor with the American population But the story is actually morecomplicated than that, as will be explored in the next section

US Baby Names 1880-2010 | 33

Trang 27

Figure 2-5 A few boy and girl names over time

Measuring the increase in naming diversity

One explanation for the decrease in plots above is that fewer parents are choosingcommon names for their children This hypothesis can be explored and confirmed inthe data One measure is the proportion of births represented by the top 1000 mostpopular names, which I aggregate and plot by year and sex:

In [118]: table = top1000.pivot_table('prop', rows='year',

in the top 50% of births This number is a bit more tricky to compute Let’s considerjust the boy names from 2010:

Trang 28

Figure 2-6 Proportion of births represented in top 1000 names by sex

After sorting prop in descending order, we want to know how many of the most popularnames it takes to reach 50% You could write a for loop to do this, but a vectorizedNumPy way is a bit more clever Taking the cumulative sum, cumsum, of prop then callingthe method searchsorted returns the position in the cumulative sum at which 0.5 wouldneed to be inserted to keep it in sorted order:

In [122]: prop_cumsum = df.sort_index(by='prop', ascending=False).prop.cumsum()

Trang 29

Since arrays are zero-indexed, adding 1 to this result gives you a result of 117 By trast, in 1900 this number was much smaller:

This resulting DataFrame diversity now has two time series, one for each sex, indexed

by year This can be inspected in IPython and plotted as before (see Figure 2-7):

In [130]: diversity.plot(title="Number of popular names in top 50%")

Figure 2-7 Plot of diversity metric by year

Trang 30

As you can see, girl names have always been more diverse than boy names, and theyhave only become more so over time Further analysis of what exactly is driving thediversity, like the increase of alternate spellings, is left to the reader.

The “Last letter” Revolution

In 2007, a baby name researcher Laura Wattenberg pointed out on her website (http: //www.babynamewizard.com) that the distribution of boy names by final letter haschanged significantly over the last 100 years To see this, I first aggregate all of the births

in the full data set by year, sex, and final letter:

# extract last letter from name column

get_last_letter = lambda x: x[-1]

last_letters = names.name.map(get_last_letter)

last_letters.name = 'last_letter'

table = names.pivot_table('births', rows=last_letters,

cols=['sex', 'year'], aggfunc=sum)

Then, I select out three representative years spanning the history and print the first fewrows:

In [132]: subtable = table.reindex(columns=[1910, 1960, 2010], level='year')

Next, normalize the table by total births to compute a new table containing proportion

of total births for each sex ending in each letter:

In [135]: letter_prop = subtable / subtable.sum().astype(float)

With the letter proportions now in hand, I can make bar plots for each sex brokendown by year See Figure 2-8:

US Baby Names 1880-2010 | 37

Trang 31

import matplotlib.pyplot as plt

fig, axes = plt.subplots(2, 1, figsize=(10, 8))

letter_prop['M'].plot(kind='bar', rot=0, ax=axes[0], title='Male')

letter_prop['F'].plot(kind='bar', rot=0, ax=axes[1], title='Female',

legend=False)

Figure 2-8 Proportion of boy and girl names ending in each letter

As you can see, boy names ending in “n” have experienced significant growth since the1960s Going back to the full table created above, I again normalize by year and sexand select a subset of letters for the boy names, finally transposing to make each column

a time series:

In [138]: letter_prop = table / table.sum().astype(float)

In [139]: dny_ts = letter_prop.ix[['d', 'n', 'y'], 'M'].T

Trang 32

Figure 2-9 Proportion of boys born with names ending in d/n/y over time

Boy names that became girl names (and vice versa)

Another fun trend is looking at boy names that were more popular with one sex earlier

in the sample but have “changed sexes” in the present One example is the name Lesley

or Leslie Going back to the top1000 dataset, I compute a list of names occurring in thedataset starting with 'lesl':

In [143]: all_names = top1000.name.unique()

In [144]: mask = np.array(['lesl' in x.lower() for x in all_names])

In [145]: lesley_like = all_names[mask]

In [146]: lesley_like

Out[146]: array(['Leslie', 'Lesley', 'Leslee', 'Lesli', 'Lesly'], dtype=object)

From there, we can filter down to just those names and sum births grouped by name

to see the relative frequencies:

Name: births, dtype: int64

Next, let’s aggregate by sex and year and normalize within year:

US Baby Names 1880-2010 | 39

Trang 33

In [149]: table = filtered.pivot_table('births', rows='year',

Figure 2-10 Proportion of male/female Lesley-like names over time

Conclusions and The Path Ahead

The examples in this chapter are rather simple, but they’re here to give you a bit of aflavor of what sorts of things you can expect in the upcoming chapters The focus of

this book is on tools as opposed to presenting more sophisticated analytical methods.

Mastering the techniques in this book will enable you to implement your own analyses(assuming you know what you want to do!) in short order

Trang 34

APPENDIX Python Language Essentials

Knowledge is a treasure, but practice is the key to it.

—Thomas FullerPeople often ask me about good resources for learning Python for data-centric appli-cations While there are many excellent Python language books, I am usually hesitant

to recommend some of them as they are intended for a general audience rather thantailored for someone who wants to load in some data sets, do some computations, andplot some of the results There are actually a couple of books on “scientific program-ming in Python”, but they are geared toward numerical computing and engineeringapplications: solving differential equations, computing integrals, doing Monte Carlosimulations, and various topics that are more mathematically-oriented rather than be-ing about data analysis and statistics As this is a book about becoming proficient atworking with data in Python, I think it is valuable to spend some time highlighting themost important features of Python’s built-in data structures and libraries from the per-spective of processing and manipulating structured and unstructured data As such, Iwill only present roughly enough information to enable you to follow along with therest of the book

This chapter is not intended to be an exhaustive introduction to the Python languagebut rather a biased, no-frills overview of features which are used repeatedly throughoutthis book For new Python programmers, I recommend that you supplement this chap-ter with the official Python tutorial (http://docs.python.org) and potentially one of themany excellent (and much longer) books on general purpose Python programming In

my opinion, it is not necessary to become proficient at building good software in Python

to be able to productively do data analysis I encourage you to use IPython to ment with the code examples and to explore the documentation for the various types,functions, and methods Note that some of the code used in the examples may notnecessarily be fully-introduced at this point

experi-Much of this book focuses on high performance array-based computing tools for ing with large data sets In order to use those tools you must often first do some munging

work-to corral messy data inwork-to a more nicely structured form Fortunately, Python is one of

381

Trang 35

the easiest-to-use languages for rapidly whipping your data into shape The greater yourfacility with Python, the language, the easier it will be for you to prepare new data setsfor analysis.

The Python Interpreter

Python is an interpreted language The Python interpreter runs a program by executing

one statement at a time The standard interactive Python interpreter can be invoked onthe command line with the python command:

The >>> you see is the prompt where you’ll type expressions To exit the Python

inter-preter and return to the command prompt, you can either type exit() or press Ctrl-D.Running Python programs is as simple as calling python with a .py file as its first argu-ment Suppose we had created hello_world.py with these contents:

print 'Hello world'

This can be run from the terminal simply as:

$ python hello_world.py

Hello world

While many Python programmers execute all of their Python code in this way, many

scientific Python programmers make use of IPython, an enhanced interactive Python

interpreter Chapter 3 is dedicated to the IPython system By using the %run command,IPython executes the code in the specified file in the same process, enabling you toexplore the results interactively when it’s done

$ ipython

Python 2.7.2 |EPD 7.1-2 (64-bit)| (default, Jul 3 2011, 15:17:51)

Type "copyright", "credits" or "license" for more information.

IPython 0.12 An enhanced Interactive Python.

? -> Introduction and overview of IPython's features.

%quickref -> Quick reference.

help -> Python's own help system.

object? -> Details about 'object', use 'object??' for extra details.

In [1]: %run hello_world.py

Hello world

In [2]:

Trang 36

The default IPython prompt adopts the numbered In [2]: style compared with thestandard >>> prompt.

The Basics

Language Semantics

The Python language design is distinguished by its emphasis on readability, simplicity,and explicitness Some people go so far as to liken it to “executable pseudocode”

Indentation, not braces

Python uses whitespace (tabs or spaces) to structure code instead of using braces as inmany other languages like R, C++, Java, and Perl Take the for loop in the abovequicksort algorithm:

Trang 37

}

Love it or hate it, significant whitespace is a fact of life for Python programmers, and

in my experience it helps make Python code a lot more readable than other languagesI’ve used While it may seem foreign at first, I suspect that it will grow on you after awhile

I strongly recommend that you use 4 spaces to as your default

indenta-tion and that your editor replace tabs with 4 spaces Many text editors

have a setting that will replace tab stops with spaces automatically (do

this!) Some people use tabs or a different number of spaces, with 2

spaces not being terribly uncommon 4 spaces is by and large the

stan-dard adopted by the vast majority of Python programmers, so I

recom-mend doing that in the absence of a compelling reason otherwise.

As you can see by now, Python statements also do not need to be terminated by icolons Semicolons can be used, however, to separate multiple statements on a singleline:

sem-a = 5; b = 6; c = 7

Putting multiple statements on one line is generally discouraged in Python as it oftenmakes code less readable

Everything is an object

An important characteristic of the Python language is the consistency of its object

model Every number, string, data structure, function, class, module, and so on exists

in the Python interpreter in its own “box” which is referred to as a Python object Each object has an associated type (for example, string or function) and internal data In

practice this makes the language very flexible, as even functions can be treated just likeany other object

for line in file_handle:

# keep the empty lines for now

# if len(line) == 0:

# continue

results.append(line.replace('foo', 'bar'))

Trang 38

Function and object method calls

Functions are called using parentheses and passing zero or more arguments, optionallyassigning the returned value to a variable:

result = f(x, y, z)

g()

Almost every object in Python has attached functions, known as methods, that have

access to the object’s internal contents They can be called using the syntax:

obj.some_method(x, y, z)

Functions can take both positional and keyword arguments:

result = f(a, b, c, d=5, e='foo')

Định dạng
Số trang	389
Dung lượng	12,34 MB