This book will delve directly into Python for data science, providing you with a straight andfast route to solve various data science problems using Python and its powerful data analysis
Trang 2Python Data Science Essentials Second Edition
Become an efficient data science practitioner by
understanding Python's key concepts
Alberto Boschetti
Luca Massaron
BIRMINGHAM - MUMBAI
Trang 3Second Edition
Copyright © 2016 Packt Publishing
All rights reserved No part of this book may be reproduced, stored in a retrieval system, ortransmitted in any form or by any means, without the prior written permission of thepublisher, except in the case of brief quotations embedded in critical articles or reviews.Every effort has been made in the preparation of this book to ensure the accuracy of theinformation presented However, the information contained in this book is sold withoutwarranty, either express or implied Neither the authors, nor Packt Publishing, and itsdealers and distributors will be held liable for any damages caused or alleged to be causeddirectly or indirectly by this book
Packt Publishing has endeavored to provide trademark information about all of the
companies and products mentioned in this book by the appropriate use of capitals
However, Packt Publishing cannot guarantee the accuracy of this information
First published: April 2015
Second edition: October 2016
Trang 5About the Authors
Alberto Boschetti is a data scientist with expertise in signal processing and statistics He
holds a PhD in telecommunication engineering and currently lives and works in London Inhis work projects, he faces challenges ranging from natural language processing (NLP),behavioral analysis, and machine learning to distributed processing He is very passionateabout his job and always tries to stay updated about the latest developments in data sciencetechnologies, attending meet-ups, conferences, and other events
I would like to thank my family, my friends, and my colleagues Also, a big thanks to the
open source community.
Luca Massaron is a data scientist and marketing research director specializing in
multivariate statistical analysis, machine learning, and customer insight, with over a decade
of experience of solving real-world problems and generating value for stakeholders byapplying reasoning, statistics, data mining, and algorithms From being a pioneer of webaudience analysis in Italy to achieving the rank of a top ten Kaggler, he has always beenvery passionate about every aspect of data and its analysis, and also about demonstratingthe potential of data-driven knowledge discovery to both experts and non-experts Favoringsimplicity over unnecessary sophistication, Luca believes that a lot can be achieved in datascience just by doing the essentials
To Yukiko and Amelia, for their loving patience "Roads go ever ever on, under cloud and under star, yet feet that wandering have gone turn at last to home afar".
Trang 6About the Reviewer
Zacharias Voulgaris is a data scientist and technical author specializing in data science
books He has an engineering and management background, with post-graduate studies ininformation systems and machine learning Zacharias has worked as a research fellow
at Georgia Tech, investigating and applying machine learning technologies to real-worldproblems, as an SEO manager in an e-marketing company in Europe, as a program manager
in Microsoft, and as a data scientist at US Bank and at G2 Web Services
Dr Voulgaris has also authored technical books, the most notable of which is Data Scientist
-the definitive guide to becoming a data scientist (Technics Publications), and his newest book, Julia for Data Science (Technics Publications), was released during the summer of 2016 He
has also written a number of data-science-related articles on blogs and participates invarious data science/machine learning meetup groups Finally, he has provided technical
editorial aid in the book Python Data Science Essentials (Packt), by the same authors as this
Trang 7For support files and downloads related to your book, please visit www.PacktPub.com.Did you know that Packt offers eBook versions of every book published, with PDF andePub files available? You can upgrade to the eBook version at www.PacktPub.com and as aprint book customer, you are entitled to a discount on the eBook copy Get in touch with us
at service@packtpub.com for more details
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for arange of free newsletters and receive exclusive discounts and offers on Packt books andeBooks
h t t p s : / / w w w p a c k t p u b c o m / m a p t
Get the most in-demand software skills with Mapt Mapt gives you full access to all Packtbooks and video courses, as well as industry-leading tools to help you plan your personaldevelopment and advance your career
Why subscribe?
Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via a web browser
Trang 8Table of Contents
Introducing data science and Python 8
Explaining virtual environments 19
A glance at the essential packages 23
Fast installation and first test usage 38
How Jupyter Notebooks can help data scientists 42
Trang 9LIBSVM data examples 57 Loading data directly from CSV or text files 57
The data science process 64
Data loading and preprocessing with pandas 66
Working with categorical and text data 87
Scraping the Web with Beautiful Soup 96
Data processing with NumPy 99
The basics of NumPy ndarray objects 101
Creating NumPy arrays 104
From lists to unidimensional arrays 104
From lists to multidimensional arrays 107
Arrays derived from NumPy functions 110
Getting an array directly from a file 112
NumPy's fast operations and computations 114
Slicing and indexing with NumPy arrays 118
Summary
Trang 10PCA for big data – RandomizedPCA 139
Linear Discriminant Analysis (LDA) 141
Latent Semantical Analysis (LSA) 142
Independent Component Analysis (ICA) 142
Restricted Boltzmann Machine (RBM) 146
The detection and treatment of outliers 147
Using cross-validation iterators 174
Hyperparameter optimization 179
Building custom scoring functions 182
Reducing the grid search runtime 184
Feature selection 186
Selection based on feature variance 187
Stability and L1-based selection 192
Wrapping everything in a pipeline 194
Combining features together and chaining transformations 195
Building custom transformation functions 198
Preparing tools and datasets 200
Linear and logistic regression 202
Trang 11SVM for regression 215
Ensemble strategies 218
Random subspaces and random patches 220
Random Forests and Extra-Trees 221
Estimating probabilities from an ensemble 223
Sequences of models – AdaBoost 226
Dealing with big data 231
Creating some big datasets as examples 231
An overview of Stochastic Gradient Descent (SGD) 238
Approaching deep learning 239
A peek at Natural Language Processing (NLP) 247
A complete data science example – text classification 252
An overview of unsupervised learning 254
Introduction to graph theory 267
Graph algorithms 273
Graph loading, dumping, and sampling 282
Trang 12Bar graphs 296
Selected graphical examples with pandas 300
Wrapping up matplotlib's commands 309
Enhancing your EDA capabilities 316
Interactive visualizations with Bokeh 323
Advanced data-learning representations 326
Feature importance for RandomForests 330
Creating a prediction server for ML-AAS 333
Appendix: Strengthen Your Python Foundations 340
Your learning list 340
Comprehensions for lists and dictionaries 351
Learn by watching, reading, and doing 351
Trang 13"A journey of a thousand miles begins with a single step."
–Laozi (604 BC - 531 BC)
Data science is a relatively new knowledge domain that requires the successful integration
of linear algebra, statistical modeling, visualization, computational linguistics, graph
analysis, machine learning, business intelligence, and data storage and retrieval
The Python programming language, having conquered the scientific community during thelast decade, is now an indispensable tool for the data science practitioner and a must-havetool for every aspiring data scientist Python will offer you a fast, reliable, cross-platform,mature environment for data analysis, machine learning, and algorithmic problem solving.Whatever stopped you before from mastering Python for data science applications will beeasily overcome by our easy, step-by-step, and example-oriented approach that will helpyou apply the most straightforward and effective Python tools to both demonstrative andreal-world datasets As the second edition of Python Data Science Essentials, this bookoffers updated and expanded content Based on the recent Jupyter Notebooks
(incorporating interchangeable kernels, a truly polyglot data science system), this bookincorporates all the main recent improvements in Numpy, Pandas, and Scikit-learn
Additionally, it offers new content in the form of deep learning (by presenting Keras–based
on both Theano and Tensorflow), beautiful visualizations (seaborn and ggplot), and webdeployment (using bottle) This book starts by showing you how to set up your essentialdata science toolbox in Python’s latest version (3.5), using a single-source approach
(implying that the book's code will be easily reusable on Python 2.7 as well) Then, it willguide you across all the data munging and preprocessing phases in a manner that explainsall the core data science activities related to loading data, transforming, and fixing it foranalysis, and exploring/processing it Finally, the book will complete its overview by
presenting you with the principal machine learning algorithms, graph analysis techniques,and all the visualization and deployment instruments that make it easier to present yourresults to an audience of both data science experts and business users
What this book covers
Trang 14Chapter 2, Data Munging, gives an overview of the data science pipeline and explores all
the key tools for handling and preparing data before you apply any learning algorithm andset up your hypothesis experimentation schedule
Chapter 3, The Data Pipeline, discusses all the operations that can potentially improve or
even boost your results
Chapter 4, Machine Learning, delves into the principal machine learning algorithms offered
by the Scikit-learn package, such as, among others, linear models, support vector machines,ensembles of trees, and unsupervised techniques for clustering
Chapter 5, Social Network Analysis, introduces graphs, which is an interesting deviation
from the predic-tors/target flat matrices It is quite a hot topic in data science now Expect todelve into very complex and intricate networks!
Chapter 6, Visualization, Insights, and Results, the concluding chapter, introduces you to the
basics of visualization with Matplotlib, how to operate EDA with pandas, how to achievebeautiful visualizations with Seaborn and Bokeh, and also how to set up a web server toprovide information on demand
Appendix, Strengthen Your Python Foundations, covers a few Python examples and
tutorials focused on the key features of the language that are indispensable in order to work
on data science projects
What you need for this book
Python and all the data science tools mentioned in the book, from IPython to Scikit-learn,are free of charge and can be freely downloaded from the Internet To run the code thataccompanies the book, you need a computer that uses Windows, Linux, or Mac OS
operating systems The book will introduce you step-by-step to the process of installing thePython interpreter and all the tools and data that you need to run the examples
Who this book is for
If you are an aspiring data scientist and you have at least a working knowledge of dataanalysis and Python, this book will get you started in data science Data analysts withexperience in R or MATLAB will also find the book to be a comprehensive reference toenhance their data manipulation and machine learning skills
Trang 15In this book, you will find a number of text styles that distinguish between different kinds
of information Here are some examples of these styles and an explanation of their meaning.Code words in text, database table names, folder names, filenames, file extensions,
pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "By usingthe to_bokeh method, any chart and plot from other packages can be easily ported intoBokeh."
A block of code is set as follows:
return template('Hi <b>{{name}}</b>!', name=name)
print("Try going to http://localhost:{}/personal/Tom".format(port))
print("Try going to http://localhost:{}/personal/Carl".format(port))
run(host='localhost', port=port)
Any command-line input or output is written as follows:
In: import numpy as np
from bokeh.plotting import figure, output_file, show
New terms and important words are shown in bold Words that you see on the screen, for
example, in menus or dialog boxes, appear in the text like this: "Once the Jupyter instance
has opened in the browser, click on the New button."
Warnings or important notes appear in a box like this
Trang 16Reader feedback
Feedback from our readers is always welcome Let us know what you think about thisbook-what you liked or disliked Reader feedback is important for us as it helps us developtitles that you will really get the most out of To send us general feedback, simply e-
mail feedback@packtpub.com, and mention the book's title in the subject of your
message If there is a topic that you have expertise in and you are interested in either
writing or contributing to a book, see our author guide at www.packtpub.com/authors
Customer support
Now that you are the proud owner of a Packt book, we have a number of things to help you
to get the most from your purchase
Downloading the example code
You can download the example code files for this book from your account at h t t p : / / w w w p
a c k t p u b c o m If you purchased this book elsewhere, you can visit h t t p : / / w w w p a c k t p u b c
o m / s u p p o r t and register to have the files e-mailed directly to you
You can download the code files by following these steps:
Log in or register to our website using your e-mail address and password
WinRAR / 7-Zip for Windows
Zipeg / iZip / UnRarX for Mac
7-Zip / PeaZip for Linux
Trang 17The code bundle for the book is also hosted on GitHub at h t t p s : / / g i t h u b c o m / P a c k t P u b l
i s h i n g / P y t h o n - D a t a - S c i e n c e - E s s e n t i a l s - S e c o n d - E d i t i o n We also have other codebundles from our rich catalog of books and videos available at h t t p s : / / g i t h u b c o m / P a c k t
P u b l i s h i n g / Check them out!
Downloading the color images of this book
We also provide you with a PDF file that has color images of the screenshots/diagrams used
in this book The color images will help you better understand the changes in the output.You can download this file from h t t p : / / w w w p a c k t p u b c o m / s i t e s / d e f a u l t / f i l e s / d o w n l
your book, clicking on the Errata Submission Form link, and entering the details of your
errata Once your errata are verified, your submission will be accepted and the errata will
be uploaded to our website or added to any list of existing errata under the Errata section ofthat title
To view the previously submitted errata, go to h t t p s : / / w w w p a c k t p u b c o m / b o o k s / c o n t e n
t / s u p p o r t and enter the name of the book in the search field The required information will
appear under the Errata section.
Piracy
Piracy of copyrighted material on the Internet is an ongoing problem across all media AtPackt, we take the protection of our copyright and licenses very seriously If you comeacross any illegal copies of our works in any form on the Internet, please provide us withthe location address or website name immediately so that we can pursue a remedy
Please contact us at copyright@packtpub.com with a link to the suspected pirated
Trang 18If you have a problem with any aspect of this book, you can contact us
at questions@packtpub.com, and we will do our best to address the problem
Trang 19First Steps
Whether you are an eager learner of data science or a well-grounded data science
practitioner, you can take advantage of this essential introduction to Python for data
science You can use it to the fullest if you already have at least some previous experience inbasic coding, in writing general-purpose computer programs in Python, or in some otherdata analysis-specific language such as MATLAB or R
This book will delve directly into Python for data science, providing you with a straight andfast route to solve various data science problems using Python and its powerful data
analysis and machine learning packages The code examples that are provided in this bookdon't require you to be a master of Python However, they will assume that you at leastknow the basics of Python scripting, including data structures such as lists and dictionaries,and the workings of class objects If you don't feel confident about these subjects or haveminimal knowledge of the Python language, before reading this book, we suggest that youtake an online tutorial There are many possible choices, but we suggest starting with thesuggestions from the official beginner's guide to Python from the Python Foundation ordirectly going to the free Code Academy course at h t t p s : / / w w w c o d e c a d e m y c o m / l e a r n / p
y t h o n Using Code Academy's tutorial, or any other alternative you may find useful, in amatter of a few hours of study, you should acquire all the building blocks that will ensureyou enjoy this book to the fullest We have also prepared a tutorial of our own, which can
be found in the last part of this book, in order to provide an integration of the two
aforementioned free courses
In any case, don't be intimidated by our starting requirements; mastering Python enoughfor data science applications isn't as arduous as you may think It's just that we have to
Trang 20In this introductory chapter, we will work out the basics to set off in full swing and gothrough the following topics:
How to set up a Python data science toolbox
Using your browser as an interactive notebook, to code with Python using
Jupyter
An overview of the data that we are going to study in this book
Introducing data science and Python
Data science is a relatively new knowledge domain, though its core components have beenstudied and researched for many years by the computer science community Its componentsinclude linear algebra, statistical modeling, visualization, computational linguistics, graphanalysis, machine learning, business intelligence, and data storage and retrieval
Data science is a new domain and you have to take into consideration that currently itsfrontiers are still somewhat blurred and dynamic Since data science is made of variousconstituent sets of disciplines, please also keep in mind that there are different profiles ofdata scientists depending on their competencies and areas of expertise
In such a situation, what can be the best tool of the trade that you can learn and effectivelyuse in your career as a data scientist? We believe that the best tool is Python, and we intend
to provide you with all the essential information that you will need for a quick start
In addition, other tools such as R and MATLAB provide data scientists with specializedtools to solve specific problems in statistical analysis and matrix manipulation in datascience However, Python really completes your data scientist skill set This multipurposelanguage is suitable for both development and production alike; it can handle small- tolarge-scale data problems and it is easy to learn and grasp no matter what your background
Trang 21At present, the core Python characteristics that render it an indispensable data science toolare as follows:
It offers a large, mature system of packages for data analysis and machine
learning It guarantees that you will get all that you may need in the course of adata analysis, and sometimes even more
Python can easily integrate different tools and offers a truly unifying ground fordifferent languages, data strategies, and learning algorithms that can be fittedtogether easily and which can concretely help data scientists forge powerfulsolutions There are packages that allow you to call code in other languages (inJava, C, Fortran, R, or Julia), outsourcing some of the computations to them andimproving your script performance
It is very versatile No matter what your programming background or style is(object-oriented, procedural, or even functional), you will enjoy programmingwith Python
It is cross-platform; your solutions will work perfectly and smoothly on
Windows, Linux (even on small-sized distributions, suitable for IoT on tiny-PCslike Raspberry Pi, Arduino and so on), and Mac OS systems You won't have toworry all that much about portability
Although interpreted, it is undoubtedly fast compared to other mainstream dataanalysis languages such as R and MATLAB (though it is not comparable to C,Java, and the newly emerged Julia language) Moreover, there are also static
compilers such as Cython or just-in-time compilers such as PyPy that can
transform Python code into C for higher performance
It can work with large in-memory data because of its minimal memory footprintand excellent memory management The memory garbage collector will oftensave the day when you load, transform, dice, slice, save, or discard data usingvarious iterations and reiterations of data wrangling
It is very simple to learn and use After you grasp the basics, there's no better way
to learn more than by immediately starting with the coding
Moreover, the number of data scientists using Python is continuously growing:new packages and improvements have been released by the community everyday, making the Python ecosystem an increasingly prolific and rich language fordata science
Trang 22Installing Python
First, let's proceed to introduce all the settings you need in order to create a fully workingdata science environment to test the examples and experiment with the code that we aregoing to provide you with
Python is an open source, object-oriented, and cross-platform programming language.Compared to some of its direct competitors (for instance, C++ or Java), Python is veryconcise It allows you to build a working software prototype in a very short time Yet it hasbecome the most used language in the data scientist's toolbox not just because of that It isalso a general-purpose language, and it is very flexible due to a variety of available
packages that solve a wide spectrum of problems and necessities
Python 2 or Python 3?
There are two main branches of Python: 2.7.x and 3.x At the time of writing this secondedition of the book, the Python Foundation (h t t p s : / / w w w p y t h o n o r g /) is offering
downloads for Python version 2.7.11 and 3.5.1 Although the third version is the newest, the
older one is still the most used version in the scientific area, since a few packages (check the
website at h t t p : / / p y 3 r e a d i n e s s o r g / for a compatibility overview) won't run otherwiseyet
In addition, there is no immediate backward compatibility between Python 3 and 2 In fact,
if you try to run some code developed for Python 2 with a Python 3 interpreter, it may notwork Major changes have been made to the newest version, and that has affected pastcompatibility Some data scientists, having built most of their work on Python 2 and itspackages, are reluctant to switch to the new version
In this second edition of the book, we intend to address a growing audience of data
scientists, data analysts, and developers, who may not have such a strong legacy withPython 2 Thus, we agreed that it would be better to work with Python 3 rather than theolder version We suggest using a version such as Python 3.4 or above After all, Python 3 isthe present and the future of Python It is the only version that will be further developedand improved by the Python Foundation and it will be the default version of the future onmany operating systems
Trang 23Anyway, if you are currently working with version 2 and you prefer to keep on workingwith it, you can still use this book and all its examples In fact, for the most part, our codewill simply work on Python 2 after having the code itself preceded by these imports:
from future import (absolute_import, division,
print_function, unicode_literals)
from builtins import *
from future import standard_library
standard_library.install_aliases()
The from future import commands should always occur at thebeginning of your scripts or else you may experience Python reporting anerror
As described in the Python-future website (h t t p : / / p y t h o n - f u t u r e o r g /), these importswill help convert several Python 3-only constructs to a form compatible with both Python 3and Python 2 (and in any case, most Python 3 code should just simply work on Python 2even without the aforementioned imports)
In order to run the upward commands successfully, if the future package is not alreadyavailable on your system, you should install it (version >= 0.15.2) using the followingcommand to be executed from a shell:
$> pip install -U future
If you're interested in understanding the differences between Python 2 and Python 3
further, we recommend reading the wiki page offered by the Python Foundation itself at: h t
t p s : / / w i k i p y t h o n o r g / m o i n / P y t h o n 2 o r P y t h o n 3
Step-by-step installation
Novice data scientists who have never used Python (who likely don't have the languagereadily installed on their machines) need to first download the installer from the main website of the project, www.python.org/downloads/, and then install it on their local
machine
Trang 24This section provides you with full control over what can be installed onyour machine This is very useful when you have to set up single
machines to deal with different tasks in data science Anyway, please bewarned that a step-by-step installation really takes time and effort
Instead, installing a ready-made scientific distribution, such as Anaconda,will lessen the burden of installation procedures and it may be well suitedfor first starting and learning because it saves you time and sometimeseven trouble, though it will put a large number of packages (and we won'tuse most of them) on your computer all at once Therefore, if you want tostart immediately with an easy installation procedure, just skip this part
and proceed to the section, Scientific distributions.
This being a multiplatform programming language, you'll find installers for machines thateither run on Windows or Unix-like operating systems
Remember that some of the latest versions of most Linux distributions (such as CentOS,Fedora, Red Hat Enterprise, Ubuntu, and some other minor ones) have Python 2 packaged
in the repository In such a case and in the case that you already have a Python version onyour computer (since our examples run on Python 3), you first have to check what versionyou are exactly running To do such a check, just follow these instructions:
Open a Python shell, type python in the terminal, or click on any Python icon1
you find on your system
Then, after having Python started, to test the installation, run the following code2
in the Python interactive shell or REPL:
Trang 25To clarify the operations we have just mentioned, when a command is given in the terminalcommand line, we prefix the command with $> Otherwise, if it's for the Python REPL, it's
preceded by >>> (REPL is an acronym that stands for Read-Eval-Print-Loop, a simple
interactive environment which takes a user's single commands from an input line in a shelland returns the results by printing)
The installation of packages
Python won't come bundled with all you need, unless you take a specific premade
distribution Therefore, to install the packages you need, you can use either pip or
easy_install Both these two tools run in the command line and make the process ofinstallation, upgrade, and removal of Python packages a breeze To check which tools havebeen installed on your local machine, run the following command:
It provides an uninstall functionality
It rolls back and leaves your system clear if, for whatever reason, the packageinstallation fails
Trang 26Using easy_install in spite of the advantages of pip makes sense if youare working on Windows because pip won't always install pre-compiledbinary packages Sometimes it will try to build the package's extensionsdirectly from C source, thus requiring a properly configured compiler
(and that's not an easy task on Windows) This depends if the package isrunning on eggs, Python metadata files for distributing code as bundles,(and pip cannot directly use their binaries, but it needs to build from theirsource code) or wheels, the new standard for Python distribution of codebundles (In this last case, pip can install binaries if available, as explainedhere: h t t p : / / p y t h o n w h e e l s c o m /) Instead, easy_install will alwaysinstall available binaries from eggs and wheels Therefore, if you are
experiencing unexpected difficulties installing a package, easy_installcan save your day (at some price anyway, as we just mentioned in the list).The most recent versions of Python should already have pip installed by default Therefore,you may have it already installed on your system If not, the safest way is to download theget-pi.py script from https://bootstrap.pypa.io/get-pip.py and then run it usingthe following:
$> python get-pip.py
The script will also install the setup tool from h t t p s : / / p y p i p y t h o n o r g / p y p i / s e t u p t o o l
s, which also contains easy_install
You're now ready to install the packages you need in order to run the examples provided inthis book To install the < package-name > generic package, you just need to run thiscommand:
$> pip install < package-name >
Alternatively, you can run the following command:
$> easy_install < package-name >
Note that in some systems, pip might be named as pip3 and easy_install as
easy_install-3 to stress the fact that both operate on packages for Python 3 If you'reunsure, check the version of Python pip is operating on with:
$> pip -V
Trang 27For easy_install, the command is slightly different:
$> easy_install version
After this, the <pk> package and all its dependencies will be downloaded and installed Ifyou're not certain whether a library has been installed or not, just try to import a moduleinside it If the Python interpreter raises an ImportError error, it can be concluded that thepackage has not been installed
This is what happens when the NumPy library has been installed:
>>> import numpy
This is what happens if it's not installed:
>>> import numpy
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: No module named numpy
In the latter case, you'll need to first install it through pip or easy_install
Take care that you don't confuse packages with modules With pip, youinstall a package; in Python, you import a module Sometimes, the
package and the module have the same name, but in many cases, they
don't match For example, the sklearn module is included in the packagenamed Scikit-learn
Finally, to search and browse the Python packages available for Python, look at h t t p s : / / p y
p i p y t h o n o r g / p y p i
Package upgrades
More often than not, you will find yourself in a situation where you have to upgrade apackage because either the new version is required by a dependency or it has additionalfeatures that you would like to use First, check the version of the library you have installed
by glancing at the version attribute, as shown in the following example, numpy:
Trang 28Now, if you want to update it to a newer release, say the 1.11.0 version, you can run thefollowing command from the command line:
$> pip install -U numpy==1.11.0
Alternatively, you can use the following command:
$> easy_install upgrade numpy==1.11.0
Finally, if you're interested in upgrading it to the latest available version, simply run thiscommand:
$> pip install -U numpy
You can alternatively run the following command:
$> easy_install upgrade numpy
Scientific distributions
As you've read so far, creating a working environment is a time-consuming operation for adata scientist You first need to install Python and then, one by one, you can install all thelibraries that you will need Sometimes, the installation procedures may not go as smoothly
as you'd hoped for earlier, requiring the user to do extra steps, to install additional
executables (like, in Linux boxes, gFortran for Scipy) or libraries (like libfreetype for
Matplotlib) Usually, the backtrace of the error produced during the failed installation isclear enough to understand what went wrong and to take the correct resolving action, but
at other times, the error is tricky or subtle, holding up the user for hours without advancing
of the key features of each of these packages
We suggest that you first promptly download and install a scientific distribution, such asAnaconda (which is the most complete one), and after practicing the examples in the book,decide to fully uninstall the distribution and set up Python alone, which can be
accompanied by just the packages you need for your projects
Trang 29Anaconda (h t t p : / / c o n t i n u u m i o / d o w n l o a d s) is a Python distribution offered by
Continuum Analytics that includes nearly 200 packages, which comprises NumPy, SciPy,pandas, Jupyter, Matplotlib, Scikit-learn, and NLTK It's a cross-platform distribution(Windows, Linux, and Mac OS X) that can be installed on machines with other existingPython distributions and versions Its base version is free; instead, add-ons that containadvanced features are charged separately Anaconda introduces conda, a binary packagemanager, as a command-line tool to manage your package installations As stated on thewebsite, Anaconda's goal is to provide enterprise-ready Python distribution for large-scaleprocessing, predictive analytics, and scientific computing
Leveraging conda to install packages
If you've decided to install an Anaconda distribution, you can take advantage of the condabinary installer we mentioned previously Anyway, conda is an open source packagemanagement system, and consequently it can be installed separately from an Anacondadistribution
You can test immediately whether conda is available on your system Open a shell anddigit:
conda can help you manage two tasks: installing packages and creating virtual
environments In this section, we will explore how conda can help you easily install most ofthe packages you may need in your data science projects
Before starting, please check that you have the latest version of conda at hand:
$> conda update conda
Trang 30You can also install a particular version of the package just by pointing it out:
$> conda install <package-name>=1.11.0
Similarly, you can install multiple packages at once by listing all their names:
$> conda install <package-name-1> <package-name-2>
If you just need to update a package that you previously installed, you can keep on usingconda:
$> conda update <package-name>
You can update all the available packages simply by using the all argument:
$> conda update all
Finally, conda can also uninstall packages for you:
$> conda remove <package-name>
If you would like to know more about conda, you can read its documentation at h t t p : / / c o
n d a p y d a t a o r g / d o c s / i n d e x h t m l In summary, as a main advantage, it handles binarieseven better than easy_install (by always providing a successful installation on Windowswithout any need to compile the packages from source) but without its problems andlimitations With the use of conda, packages are easy to install (and installation is alwayssuccessful), update, and even uninstall On the other hand, conda cannot install directlyfrom a git server (so it cannot access the latest version of many packages under
development) and it doesn't cover all the packages available on PyPI as pip itself
Enthought Canopy
Enthought Canopy (h t t p s : / / w w w e n t h o u g h t c o m / p r o d u c t s / c a n o p y /) is a Python
distribution by Enthought Inc It includes more than 200 preinstalled packages, such asNumPy, SciPy, Matplotlib, Jupyter, and pandas (more on these packages later) This
distribution is targeted at engineers, data scientists, quantitative and data analysts, andenterprises Its base version is free (which is named Canopy Express), but if you needadvanced features, you have to buy a front version It's a multiplatform distribution and itscommand-line install tool is canopy_cli
Trang 31PythonXY (h t t p : / / p y t h o n - x y g i t h u b i o /) is a free, open source Python distributionmaintained by the community It includes a number of packages, which include NumPy,SciPy, NetworkX, Jupyter, and Scikit-learn It also includes Spyder, an interactive
development environment inspired by the MATLAB IDE The distribution is free It worksonly on Microsoft Windows, and its command-line installation tool is pip
WinPython
WinPython (h t t p : / / w i n p y t h o n s o u r c e f o r g e n e t /) is also a free, open-source Pythondistribution maintained by the community It is designed for scientists, and includes manypackages such as NumPy, SciPy, Matplotlib, and Jupyter It also includes Spyder as an IDE
It is free and portable You can put WinPython into any directory, or even into a USB flashdrive, and at the same time maintain multiple copies and versions of it on your system It
works only on Microsoft Windows, and its command-line tool is the WinPython Package
Manager (WPPM).
Explaining virtual environments
No matter whether you have chosen installing a standalone Python or instead you used ascientific distribution, you may have noticed that you are actually bound on your system tothe Python's version you have installed The only exception, for Windows users, is to use aWinPython distribution, since it is a portable installation and you can have as many
different installations as you need
A simple solution to break free of such a limitation is to use virtualenv, which is a tool tocreate isolated Python environments That means that, by using different Python
environments, you can easily achieve these things:
Testing any new package installation or doing experimentation on your Pythonenvironment without any fear of breaking anything in an irreparable way In thiscase, you need a version of Python that acts as a sandbox
Trang 32Having at hand multiple Python versions (both Python 2 and Python 3), gearedwith different versions of installed packages This can help you in dealing withdifferent versions of Python for different purposes (for instance, some of thepackages we are going to present on Windows OS only work using Python 3.4,which is not the latest release).
Taking a replicable snapshot of your Python environment easily and having yourdata science prototypes work smoothly on any other computer or in production
In this case, your main concern is the immutability and replicability of yourworking environment
You can find documentation about virtualenv at h t t p : / / v i r t u a l e n v r e a d t h e d o c s i o /
e n / s t a b l e /, though we are going to provide you with all the directions you need to startusing it immediately In order to take advantage of virtualenv, you have first to install it
on your system:
$> pip install virtualenv
After the installation completes, you can start building your virtual environments Beforeproceeding, you have to take a few decisions:
If you have more versions of Python installed on your system, you have to decidewhich version to pick up Otherwise, virtualenv will take the Python versionvirtualenv was installed by on your system In order to set a different Pythonversion, you have to digit the argument -p followed by the version of Python youwant or insert the path of the Python executable to be used (for instance, -ppython2.7) or just point to a Python executable such as -p
c:\Anaconda2\python.exe
With virtualenv, when required to install a certain package, it will install it fromscratch, even if it is already available at a system level (on the Python directoryyou created the virtual environment from) This default behavior makes sensebecause it allows you to create a completely separated empty environment Inorder to save disk space and limit the time of installation of all the packages, youmay instead decide to take advantage of already available packages on yoursystem by using the argument system-site-packages
You may want to be able to later move around your virtual environment acrossPython installations, even among different machines Therefore, you may want tomake the functioning of all of the environment's scripts relative to the path it isplaced in by using the argument relocatable
Trang 33After deciding on the Python version, the linking to existing global packages, and therelocability of the virtual environment, in order to start, you just launch the command from
a shell Declare the name you would like to assign to your new environment:
$> virtualenv clone
virtualenv will just create a new directory using the name you provided, in the path fromwhich you actually launched the command To start using it, you just enter the directoryand digit activate:
installed on your system You can record the entire list in a text file by this command:
$> pip freeze > requirements.txt
After saving the list in a text file, just take it into your virtual environment and install all thepackages in a breeze with a single command:
$> pip install -r requirements.txt
Each package will be installed according to the order in the list (packages are listed in acase-insensitive sorted order) If a package requires other packages that are later in the list,that's not a big deal because pip automatically manages such situations So if your packagerequires Numpy and Numpy is not yet installed, pip will install it first
When you're finished installing packages and using your environment for scripting andexperimenting, in order to return to your system defaults, just issue this command:
$> deactivate
If you want to remove the virtual environment completely, after deactivating and gettingout of the environment's directory, you just have to get rid of the environment's directoryitself by a recursive deletion For instance, on Windows you just do this:
Trang 34On Linux and Mac, the command will be:
$> rm -r -f clone
If you are working extensively with virtual environments, you should
consider using virtualenvwrapper, which is a set of wrappers for
virtualenv, in order to help you manage multiple virtual environmentseasily It can be found at h t t p : / / b i t b u c k e t o r g / d h e l l m a n n / v i r t u a l e n v
w r a p p e r If you are operating on a Unix system (Linux or OS X), anothersolution we have to quote is pyenv (which can be found at h t t p s : / / g i t h u
b c o m / y y u u / p y e n v), which lets you set your main Python version, allowsinstallation of multiple versions and creates virtual environments Its
peculiarity is that it does not depend on Python to be installed and it
works perfectly at the user-level (no need for sudo commands)
conda for managing environments
If you have installed the Anaconda distribution, or you have tried conda using a Minicondainstallation, you can also take advantage of the conda command to run virtual
environments as an alternative to virtualenv Let's see in practice how to use conda for that
We can check what environments we have available like this:
>$ conda info -e
This command will report to you what environments you can use on your system based onconda Most likely, your only environment will be just root, pointing to your Anacondadistribution's folder
As an example, we can create an environment based on Python version 3.4, having all thenecessary Anaconda-packaged libraries installed That makes sense, for instance, for usingthe package Theano together with Python 3 on Windows (because of an issue we willexplain in shortly) In order to create such an environment, just do this:
$> conda create -n python34 python=3.4 anaconda
The command asks for a particular Python Version 3.4 and requires the installation of allpackages available on the Anaconda distribution (the argument anaconda) It names theenvironment as python34 using the argument -n The complete installation will take awhile, given the large number of packages in the Anaconda installation After havingcompleted all of the installation, you can activate the environment:
$> activate python34
Trang 35If you need to install additional packages to your environment, when activated, you just dothe following:
$> conda install -n python34 <package-name1> <package-name2>
That is, you make the list of the required packages follow the name of your environment.Naturally, you can also use pip install, as you would do in a virtualenv environment.You can also use a file instead of listing all the packages by name yourself You can create alist in an environment using the list argument and piping the output to a file:
$> conda list -e > requirements.txt
Then, in your target environment, you can install the entire list using:
$> conda install file requirements.txt
You can even create an environment, based on a requirements list:
$> conda create -n python34 python=3.4 file requirements.txt
Finally, after having used the environment, to close the session, you simply do this:
$> deactivate
Contrary to virtualenv, there is a specialized argument in order to completely remove anenvironment from your system:
$> conda remove -n python34 all
A glance at the essential packages
We mentioned that the two most relevant characteristics of Python are its ability to integratewith other languages and its mature package system, which is well embodied by PyPI (seethe Python Package Index at: h t t p s : / / p y p i p y t h o n o r g / p y p i), a common repository forthe majority of Python open source packages that is constantly maintained and updated.The packages that we are now going to introduce are strongly analytical and they willconstitute a complete data science toolbox All the packages are made up of extensivelytested and highly optimized functions for both memory usage and performance, ready to
Trang 36Partially inspired by similar tools present in R and MATLAB environments, we will
together explore how a few selected Python commands can allow you to efficiently handledata and then explore, transform, experiment, and learn from the same without having towrite too much code or reinvent the wheel
NumPy
NumPy, which is Travis Oliphant's creation, is the true analytical workhorse of the Pythonlanguage It provides the user with multidimensional arrays, along with a large set offunctions to operate a multiplicity of mathematical operations on these arrays Arrays areblocks of data arranged along multiple dimensions, which implement mathematical vectorsand matrices Characterized by optimal memory allocation, arrays are useful not just forstoring data, but also for fast matrix operations (vectorization), which are indispensablewhen you wish to solve ad hoc data science problems:
Version at the time of print: 1.11.0
Suggested install command: pip install numpy
As a convention largely adopted by the Python community, when importing NumPy, it issuggested that you alias it as np:
Version at time of print: 0.17.1
Suggested install command: pip install scipy
Trang 37The pandas package deals with everything that NumPy and SciPy cannot do Thanks to itsspecific data structures, namely DataFrames and Series, pandas allows you to handlecomplex tables of data of different types (which is something that NumPy's arrays cannotdo) and time series Thanks to Wes McKinney's creation, you will be able easily and
smoothly to load data from a variety of sources You can then slice, dice, handle missingelements, add, rename, aggregate, reshape, and finally visualize your data at will:
Version at the time of print: 0.18.1
Suggested install command: pip install pandas
Conventionally, pandas is imported as pd:
import pandas as pd
Scikit-learn
Started as part of the SciKits (SciPy Toolkits), Scikit-learn is the core of data science
operations on Python It offers all that you may need in terms of data preprocessing,
supervised and unsupervised learning, model selection, validation, and error metrics.Expect us to talk at length about this package throughout this book Scikit-learn started in
2007 as a Google Summer of Code project by David Cournapeau Since 2013, it has been
taken over by the researchers at INRA (French Institute for Research in Computer Science
and Automation):
Version at the time of print: 0.17.1
Suggested install command: pip install scikit-learn
Note that the imported module is named sklearn
Trang 38A scientific approach requires the fast experimentation of different hypotheses in a
reproducible fashion Initially named IPython and limited to working only with the Pythonlanguage, Jupyter was created by Fernando Perez in order to address the need for aninteractive command shell for several languages (based on shell, web browser, and
application interface), featuring graphical integration, customizable commands, rich history(in the JSON format), and computational parallelism for an enhanced performance Jupyter
is our favored choice throughout this book, and it is used to clearly and effectively illustrateoperations with scripts and data and the consequent results We will devote a section of thischapter to explain in detail the characteristics of its interface and describing how it can turninto a precious tool for any data scientist:
Version at the time of print: 1.0.0 (ipykernel = 4.3.1)
Suggested install command: pip install jupyter
Version at the time of print: 1.5.1
Suggested install command: pip install matplotlib
You can simply import what you need for your visualization purposes with the followingcommand:
import matplotlib.pyplot as plt
Trang 39Downloading the example code
You can download the example code files from your account at
www.packtpub.com for all the Packt Publishing books you have purchased
If you purchased this book elsewhere, you can visit
www.packtpub.com/support and register to have the files e-mailed directly
to you
Statsmodels
Previously part of SciKits, statsmodels was thought to be a complement to SciPy's statisticalfunctions It features generalized linear models, discrete choice models, time series analysis,and a series of descriptive statistics as well as parametric and nonparametric tests:
Version at the time of print: 0.6.1
Suggested install command: pip install statsmodels
Beautiful Soup
Beautiful Soup, a creation of Leonard Richardson, is a great tool to scrap out data fromHTML and XML files retrieved from the Internet It works incredibly well, even in the case
of tag soups (hence the name), which are collections of malformed, contradictory, and
incorrect tags After choosing your parser (the HTML parser included in Python's standardlibrary works fine), thanks to Beautiful Soup, you can navigate through the objects in thepage and extract text, tables, and any other information that you may find useful:
Version at the time of print: 4.4.1
Suggested install command: pip install beautifulsoup4
Note that the imported module is named bs4
Trang 40Developed by the Los Alamos National Laboratory, NetworkX is a package specialized inthe creation, manipulation, analysis, and graphical representation of real-life network data(it can easily operate with graphs made up of a million nodes and edges) Besides
specialized data structures for graphs and fine visualization methods (2D and 3D), itprovides the user with many standard graph measures and algorithms, such as the shortestpath, centrality, components, communities, clustering, and PageRank We will mainly usethis package in Chapter 5, Social Network Analysis:
Version at the time of print: 1.11
Suggested install command: pip install networkx
Conventionally, NetworkX is imported as nx:
Version at the time of print: 3.2.1
Suggested install command: pip install nltk