XII List of ContributorsNational Taiwan University of Science and Technology Department of Computer Science and Information Engineering Wirtschatsforschung DIW Berlin German Institute fo
Trang 3School of Business and EconomicsHumboldt-Universität zu BerlinSpandauer Straße
BerlinGermanyhaerdle@wiwi.hu-berlin.deProfessor Antony Unwin
Library of Congress Control Number:
© Springer-Verlag Berlin Heidelberg
his work is subject to copyright All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilm or in any other way, and storage in data banks Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September ,
, in its current version, and permission for use must always be obtained from Springer Violations are liable for prosecution under the German Copyright Law.
he use of general descriptive names, registered names, trademarks, etc in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.
Typesetting and Production: LE-TEX Jelonek, Schmidt & Vöckler GbR, Leipzig, Germany
Cover: deblik, Berlin, Germany
Printed on acid-free paper
springer.com
Trang 5III Methodologies
III.1 Interactive Linked Micromap Plots for the Display
of Geographically Referenced Statistical Data
Jürgen Symanzik, Daniel B Carr 267
III.2 Grand Tours, Projection Pursuit Guided Tours, and Manual Controls
Dianne Cook, Andreas Buja, Eun-Kyung Lee, Hadley Wickham 295
III.3 Multidimensional Scaling
Michael A.A Cox, Trevor F Cox 315
III.4 Huge Multidimensional Data Visualization: Back to the Virtue
of Principal Coordinates and Dendrograms in the New Computer Age
Francesco Palumbo, Domenico Vistocco, Alain Morineau 349
III.5 Multivariate Visualization by Density Estimation
Michael C Minnotte, Stephan R Sain, David W Scott 389
III.6 Structured Sets of Graphs
Richard M Heiberger, Burt Holland 415
III.7 Regression by Parts:
Fitting Visually Interpretable Models with GUIDE
Wei-Yin Loh 447
III.8 Structural Adaptive Smoothing
by Propagation–Separation Methods
Jörg Polzehl, Vladimir Spokoiny 471
III.9 Smoothing Techniques for Visualisation
Adrian W Bowman 493
III.10 Data Visualization via Kernel Machines
Yuan-chin Ivan Chang, Yuh-Jye Lee, Hsing-Kuo Pao, Mei-Hsien Lee,
Su-Yun Huang 539
III.11 Visualizing Cluster Analysis and Finite Mixture Models
Friedrich Leisch .561
III.12 Visualizing Contingency Tables
David Meyer, Achim Zeileis, Kurt Hornik 589
III.13 Mosaic Plots and Their Variants
Heike Hofmann .617
III.14 Parallel Coordinates: Visualization, Exploration
and Classiication of High-Dimensional Data
Alfred Inselberg 643
III.15 Matrix Visualization
Han-Ming Wu, ShengLi Tzeng, Chun-Houh Chen 681
III.16 Visualization in Bayesian Data Analysis
Jouni Kerman, Andrew Gelman, Tian Zheng, Yuejing Ding 709
III.17 Programming Statistical Data Visualization in the Java Language
Junji Nakano, Yoshikazu Yamamoto, Keisuke Honda 725
III.18 Web-Based Statistical Graphics using XML Technologies
Yoshiro Yamamoto, Masaya Iizuka, Tomokazu Fujino 757
Trang 6Table of Contents VII
IV Selected Applications
IV.1 Visualization for Genetic Network Reconstruction
Grace S Shieh, Chin-Yuan Guo 793
IV.2 Reconstruction, Visualization and Analysis of Medical Images
Henry Horng-Shing Lu 813
IV.3 Exploratory Graphics of a Financial Dataset
Antony Unwin, Martin heus, Wolfgang K Härdle 831
IV.4 Graphical Data Representation in Bankruptcy Analysis
Wolfgang K Härdle, Rouslan A Moro, Dorothea Schäfer 853
IV.5 Visualizing Functional Data with an Application
to eBay’s Online Auctions
Wolfgang Jank, Galit Shmueli, Catherine Plaisant, Ben Shneiderman 873
IV.6 Visualization Tools for Insurance Risk Processes
Krzysztof Burnecki, Rafał Weron 899
Trang 7George Mason University
Center for Computational Statistics
cchen@stat.sinica.edu.tw
Dianne Cook
Iowa State UniversityDepartment of StatisticsUSA
trevor.cox@unilever.com
Yuejing Ding
Columbia UniversityDepartment of StatisticsUSA
yding@stat.columbia.edu
Trang 8Fukuoka Women’s University
Department of Environmental Science
bholland@temple.edu
Keisuke Honda
Graduate Universityfor Advanced StudiesJapan
khonda@ism.ac.jp
Kurt Hornik
Wirtschatsuniversität WienDepartment of Statisticsand MathematicsAustria
Kurt.Hornik@wu-wien.ac.at
Su-Yun Huang
Academia SinicaInstitute of Statistical ScienceTaiwan
syhuang@stat.sinica.edu.tw
Masaya Iizuka
Okayama UniversityGraduate School of Natural Scienceand Technology
Japaniizuka@ems.okayama-u.ac.jp
Alfred Inselberg
Tel Aviv UniversitySchool of Mathematical SciencesIsrael
aiisreal@post.tau.ac.il
Wolfgang Jank
University of MarylandDepartment of Decisionand Information TechnologiesUSA
wjank@rhsmith.umd.edu
Trang 9National Taiwan University
of Science and Technology
Department of Computer Science
and Information Engineering
AustriaDavid.Meyer@wu-wien.ac.at
George Michailidis
University of MichiganDepartment of StatisticsUSA
gmichail@umich.edu
Michael C Minnotte
Utah State UniversityDepartment of Mathematicsand Statistics
USAmike.minnotte@usu.edu
Alain Morineau
La Revue MODULADFrance
alain.morineau@modulad.fr
Rouslan A Moro
Humboldt-Universität zu BerlinInstitut für Statistik und ÖkonometrieGermany
rmoro@diw.de
Paul Murrell
University of AucklandDepartment of StatisticsNew Zealand
Trang 10XII List of Contributors
National Taiwan University
of Science and Technology
Department of Computer Science
and Information Engineering
Wirtschatsforschung (DIW) Berlin
German Institute for Economic Research
gshieh@stat.sinica.edu.tw
Galit Shmueli
University of MarylandDepartment of Decisionand Information TechnologiesUSA
gshmueli@rhsmith.umd.edu
Ben Shneiderman
University of MarylandDepartment of Computer ScienceUSA
ben@cs.umd.edu
Vladimir Spokoiny
Weierstrass Institutefor Applied Analysis and StochasticsGermany
spokoiny@wias-berlin.de
Jürgen Symanzik
Utah State UniversityDepartment of Mathematicsand Statistics
USAsymanzik@math.usu.edu
Martin Theus
University of AugsburgDepartment of Computational Statisticsand Data Analysis
Germanymartin.theus@math.uni-augsburg.de
ShengLi Tzeng
Academia SinicaInstitute of Statistical ScienceTaiwan
hh@stat.sinica.edu.tw
Trang 11Worcester Polytechnic Institute
Computer Science Department
gwills@spss.com
Han-Ming Wu
Academia SinicaInstitute of Statistical ScienceTaiwan
hmwu@stat.sinica.edu.tw
Yoshikazu Yamamoto
Tokushima Bunri UniversityDepartment of EngineeringJapan
yamamoto@es.bunri-u.ac.jp
Yoshiro Yamamoto
Tokai UniversityDepartment of MathematicsJapan
yamamoto@sm.u-tokai.ac.jp
Achim Zeileis
Wirtschatsuniversität WienDepartment of Statisticsand MathematicsAustria
Achim.Zeileis@wu-wien.ac.at
Tian Zheng
Columbia UniversityDepartment of StatisticsUSA
tzheng@stat.columbia.edu
Trang 13Part I
Data Visualization
Trang 15Introduction
Antony Unwin, Chun-houh Chen, Wolfgang K Härdle
1.1 Computational Statistics and Data Visualization . 4
Data Visualization and Theory . 4
Presentation and Exploratory Graphics . 4
Graphics and Computing . 5
1.2 The Chapters . 6
Summary and Overview; Part II . 7
Summary and Overview; Part III . 9
Summary and Overview; Part IV 10
The Authors . 11
1.3 Outlook 12
Trang 164 Antony Unwin, Chun-houh Chen, Wolfgang K Härdle
a matter of common sense (in which case their common sense cannot be in goodshape), while others believe that preparing graphics is a low-level task, not appropri-ate for scientific attention his volume of the Handbook of Computational Statisticstakes graphics for data visualization seriously
Data Visualization and Theory
1.1.1
Graphics provide an excellent approach for exploring data and are essential for senting results Although graphics have been used extensively in statistics for a longtime, there is not a substantive body of theory about the topic Quite a lot of atten-tion has been paid to graphics for presentation, particularly since the superb books ofEdward Tute However, this knowledge is expressed in principles to be followed andnot in formal theories Bertin’s work from the s is oten cited but has not beendeveloped further his is a curious state of affairs Graphics are used a great deal inmany different fields, and one might expect more progress to have been made alongtheoretical lines
pre-Sometimes in science the theoretical literature for a subject is considerable whilethere is little applied literature to be found he literature on data visualization is verymuch the opposite Examples abound in almost every issue of every scientific jour-nal concerned with quantitative analysis here are occasionally articles published in
a more theoretical vein about specific graphical forms, but little else Although there
is a respected statistics journal called the Journal of Computational and GraphicalStatistics, most of the papers submitted there are in computational statistics Perhapsthis is because it is easier to publish a study of a technical computational problemthan it is to publish work on improving a graphic display
Presentation and Exploratory Graphics
1.1.2
he differences between graphics for presentation and graphics for exploration lie
in both form and practice Presentation graphics are generally static, and a single
Trang 17Figure ..A barchart of the number of authors per paper, a histogram of the number of pages per
paper, and parallel boxplots of length by number of authors Papers with more than three authors have
been selected
graphic is drawn to summarize the information to be presented hese displays should
be of high quality and include complete definitions and explanations of the variables
shown and of the form of the graphic Presentation graphics are like proofs of
math-ematical theorems; they may give no hint as to how a result was reached, but they
should offer convincing support for its conclusion Exploratory graphics, on the other
hand, are used for looking for results Very many of them may be used, and they
should be fast and informative rather than slow and precise hey are not intended
for presentation, so that detailed legends and captions are unnecessary One
presen-tation graphic will be drawn for viewing by potentially thousands of readers while
thousands of exploratory graphics may be drawn to support the data investigations
of one analyst
Books on visualization should make use of graphics Figure . shows some simple
summaries of data about the chapters in this volume, revealing that over half the
chapters had more than one author and that more authors does not always mean
longer papers
Developments in computing power have been of great benefit to graphics in recent
years It has become possible to draw precise, complex displays with great ease and
to print them with impressive quality at high resolution hat was not always the
case, and initially computers were more a disadvantage for graphics Computing
screens and printers could at best produce clumsy line-driven displays of low
resolu-tion without colour hese offered no competiresolu-tion to careful, hand-drawn displays
Furthermore, even early computers made many calculations much easier than before
and allowed fitting of more complicated models his directed attention away from
graphics, and it is only in the last years that graphics have come into their own
again
Trang 186 Antony Unwin, Chun-houh Chen, Wolfgang K Härdle
hese comments relate to presentation graphics, that is, graphics drawn for thepurpose of illustrating and explaining results Computing advances have benefittedexploratory graphics, that is, graphics drawn to support exploring data, far more.Not just the quality of graphic representation has improved but also the quantity It isnow trivial to draw many different displays of the same data or to riffle through manydifferent versions interactively to look for information in data hese capabilities areonly gradually becoming appreciated and capitalized on
he importance of sotware availability and popularity in determining what yses are carried out and how they are presented will be an interesting research topicfor future historians of science In the business world, no one seems to be able to
anal-do without the spreadsheet Excel If Excel anal-does not offer a particular graphic form,then that form will not be used (In fact Excel offers many graphic forms, thoughnot all that a statistician would want.) Many scientists, who only rarely need access
to computational power, also rely on Excel and its options In the world of statisticsitself, the packages SAS and SPSS were long dominant In the last years, first S andS-plus and now R have emerged as important competitors None of these packagescurrently provide effective interactive tools for exploratory graphics, though they areall moving slowly in that direction as well as extending the range and flexibility of thepresentation graphics they offer
Data visualization is a new term It expresses the idea that it involves more thanjust representing data in a graphical form (instead of using a table) he informationbehind the data should also be revealed in a good display; the graphic should aidreaders or viewers in seeing the structure in the data he term data visualization isrelated to the new field of information visualization his includes visualization ofall kinds of information, not just of data, and is closely associated with research bycomputer scientists Up till now the work in this area has tended to concentrate just
on presenting information, rather than on what may be deduced from it Statisticianstend to be concerned more with variability and to emphasize the statistical properties
of results he closer linking of graphics with statistical modelling can make this moreexplicit and is a promising research direction that is facilitated by the flexible nature
of current computing sotware Statisticians have an important role to play here
The Chapters
1.2
Needless to say, each Handbook chapter uses a lot of graphic displays Figure . is
a scatterplot of the number of figures against the number of pages here is an proximate linear relationship with a couple of papers having somewhat more figuresper page and one somewhat less he scales have been chosen to maximize the data-ink ratio An alternative version with equal scales makes clearer that the number offigures per page is almost always less than one
ap-he Handbook has been divided into three sections: Principles, Methodology,and Applications Needless to say, the sections overlap Figure . is a binary matrixvisualization using Jaccard coefficients for both chapters (rows) and index entries
Trang 19Figure ..A scatterplot of the number of figures against the number of pages for the Handbook’s
chapters
(columns) to explore links between chapters In the raw data map (lower-let portion
of Fig .) there is a banding of black dots from the lower-let to upper-right
cor-ners indicating a possible transition of chapter/index combinations In the proximity
map of indices (upper portion of Fig .), index groups A, B, C, D, and E are
over-lapped with each other and are dominated by chapters of Good Graphics, History,
Functional Data, Matrix Visualization, and Regression by Parts respectively
he ten chapters in Part II are concerned with principles of data visualization First
there is an historical overview by Michael Friendly, the custodian of the Internet
Gallery of Data Visualization, outlining the developments in graphical displays over
the last few hundred years and including many fine examples
In the next chapter Antony Unwin discusses some of the guidelines for the
prepa-ration of sound and attractive data graphics he question mark in the chapter title
sums it up well: whatever principles or recommendations are followed, the success
of a graphic is a matter of taste; there are no fixed rules
he importance of sotware for producing graphics is incontrovertible Paul
Mur-rell in his chapter summarizes the requirements for producing accurate and exact
static graphics He emphasizes both the need for flexibility in customizing standard
plots and the need for tools that permit the drawing of new plot types
Structure in data may be represented by mathematical graphs George Michailidis
pursues this idea in his chapter and shows how this leads to another class of graphic
displays associated with multivariate analysis methods
Trang 208 Antony Unwin, Chun-houh Chen, Wolfgang K Härdle
Figure ..Matrix visualizations of the Handbook with chapters in the rows and index entries in the columns
Lee Wilkinson approaches graph-theoretic visualizations from another point ofview, and his displays are concerned predominantly, though by no means exclusively,with trees, directed graphs and geometric graphs He also covers the layout of graphs,
a tricky problem for large numbers of vertices, and raises the intriguing issue of graphmatching
Most data displays concentrate on one or two dimensions his is frequently cient to reveal striking information about a dataset To gain insight into multivariatestructure, higher-dimensional representations are required Martin heus discussesthe main statistical graphics of this kind that do not involve dimension reduction andcompares their possible range of application
suffi-Everyone knows about Chernoff faces, though not many ever use them he tential of data glyphs for representing cases in informative and productive ways hasnot been fully realized Matt Ward gives an overview of the wide variety of possibleforms and of the different ways they can be utilized
Trang 21po-here are two chapters on linking Adalbert Wilhelm describes a formal model
for linked graphics and the conceptual structure underlying it He is able to
encom-pass different types of linking and different representations Graham Wills looks at
linking in a more applied context and stresses the importance of distinguishing
be-tween views of individual cases and aggregated views He also highlights the variety
of selection possibilities there are in interactive graphics Both chapters point out the
value of linking simple data views over linking complicated ones
he final chapter in this section is by Simon Urbanek He describes the graphics
that have been introduced to support tree models in statistics he close association
between graphics and the models (and collections of models in forests) is particularly
interesting and has relevance for building closer links between graphics and models
in other fields
he middle and largest section of the Handbook concentrates on individual area of
graphics research
Geographical data can obviously benefit from visualization Much of Bertin’s work
was directed at this kind of data Juergen Symanzik and Daniel Carr write about
mi-cromaps (multiple small images of the same area displaying different parts of the
data) and their interactive extension
Projection pursuit and the grand tour are well known but not easy to use Despite
the availability of attractive free sotware, it is still a difficult task to analyse datasets in
depth with this approach Dianne Cook, Andreas Buja, Eun-Kyung Lee and Hadley
Wickham describe the issues involved and outline some of the progress that has been
made
Multidimensional scaling has been around for a long time Michael Cox and Trevor
Cox (no relation, but an MDS would doubtless place them close together) review the
current state of research
Advances in high-throughput techniques in industrial projects, academic studies
and biomedical experiments and the increasing power of computers for data
collec-tion have inevitably changed the practice of modern data analysis Real-life datasets
become larger and larger in both sample size and numbers of variables Francesco
Palumbo, Alain Morineau and Domenico Vistocco illustrate principles of
visualiza-tion for such situavisualiza-tions
Some areas of statistics benefit more directly from visualization than others
Den-sity estimation is hard to imagine without visualization Michael Minnotte, Steve Sain
and David Scott examine estimation methods in up to three dimensions Interestingly
there has not been much progress with density estimation in even three dimensions
Sets of graphs can be particularly useful for revealing the structure in datasets
and complement modelling efforts Richard Heiberger and Burt Holland describe an
approach primarily making use of Cartesian products and the Trellis paradigm
Wei-Yin Loh describes the use of visualization to support the use of regression models, in
particular with the use of regression trees
Trang 2210 Antony Unwin, Chun-houh Chen, Wolfgang K Härdle
Instead of visualizing the structure of samples or variables in a given dataset, searchers may be interested in visualizing images collected with certain formats Usu-ally the target images are collected with various types of noise pattern and it is neces-sary to apply statistical or mathematical modelling to remove or diminish the noisestructure before the possible genuine images can be visualized Jörg Polzehl and Vlad-imir Spokoiny present one such novel adaptive smoothing procedure in reconstruct-ing noisy images for better visualization
re-he continuing increase in computer power has had many different impacts onstatistics Computationally intensive smoothing methods are now commonplace, al-though they were impossible only a few years ago Adrian Bowman gives an overview
of the relations between smoothing and visualization Yuan-chin Chang, Yuh-Jye Lee,Hsing-Kuo Pao, Mei-Hsien Lee and Su-Yun Huang investigate the impact of kernelmachine methods on a number of classical techniques: principal components, canon-ical correlation and cluster analysis hey use visualizations to compare their resultswith those from the original methods
Cluster analyses have oten been a bit suspect to statisticians he lack of formalmodels in the past and the difficulty of judging the success of the clusterings wereboth negative factors Fritz Leisch considers the graphical evaluation of clusteringsand some of the possibilities for a sounder methodological approach
Multivariate categorical data were difficult to visualize in the past he chapter byDavid Meyer, Achim Zeileis and Kurt Hornik describes fairly classical approachesfor low dimensions and emphasizes the link to model building Heike Hofmann de-scribes the powerful tools of interactive mosaicplots that have become available inrecent years, not least through her own efforts, and discusses how different varia-tions of the plot form can be used for gaining insight into multivariate data features.Alfred Inselberg, the original proposer of parallel coordinate plots, offers an over-view of this approach to multivariate data in his usual distinctive style Here he con-siders in particular classification problems and how parallel coordinate views can beadapted and amended to support this kind of analysis
Most analyses using graphics make use of a standard set of graphical tools, forexample, scatterplots, barcharts, and histograms Han-Ming Wu, ShengLi Tzeng andChun-houh Chen describe a different approach, built around using colour approxi-mations for individual values in a data matrix and applying cluster analyses to orderthe matrix rows and columns in informative ways
For many years Bayesians were primarily theoreticians hanks to MCMC ods they are now able to also apply their ideas to great effect his has led to newdemands in assessing model fit and the quality of the results Jouni Kerman, An-drew Gelman, Tian Zheng and Yuejing Ding discuss graphical approaches for tack-ling these issues in a Bayesian framework
meth-Without sotware to draw the displays, graphic analyis is almost impossible days Junji Nakano, Yamamoto Yoshikazu and Keisuke Honda are working on Java-based sotware to provide support for new developments, and they outline their ap-proach here Many researchers are interested in providing tools via the Web YoshiroYamamoto, Masaya Iizuka and Tomokazu Fujino discuss using XML for interactivestatistical graphics and explain the issues involved
Trang 23nowa-Summary and Overview; Part IV 1.2.3
he final section contains seven chapters on specific applications of data
visualiza-tion here are, of course, individual applications discussed in earlier chapters, but
here the emphasis is on the application rather than principles or methodology
Genetic networks are obviously a promising area for informative graphic displays
Grace Shieh and Chin-Yuan Guo describe some of the progress made so far and make
clear the potential for further research
Modern medical imaging systems have made significant contributions to
diag-noses and treatments Henry Lu discusses the visualization of data from positron
emission tomography, ultrasound and magnetic resonance
Two chapters examine company bankruptcy datasets In the first one, Antony
Un-win, Martin heus and Wolfgang Härdle use a broad range of visualization tools to
carry out an extensive exploratory data analysis No large dataset can be analysed
cold, and this chapter shows how effective data visualization can be in assessing data
quality and revealing features of a dataset he other bankruptcy chapter employs
graphics to visualize SVM modelling Wolfgang Härdle, Rouslan Moro and Dorothea
Schäfer use graphics to display results that cannot be presented in a closed analytic
form
he astonishing growth of eBay has been one of the big success stories of recent
years Wolfgang Jank, Galit Shmueli, Catherine Plaisant and Ben Shneiderman have
studied data from eBay auctions and describe the role graphics played in their
anal-yses
Krzysztof Burnecki and Rafal Weron consider the application of visualization in
insurance his is another example of how the value of graphics lies in providing
insight into the output of complex models
he editors would like to thank the authors of the chapters for their contributions It
is important for a collective work of this kind to cover a broad range and to gather
many experts with different interests together We have been fortunate in receiving
the assistance of so many excellent contributors
he mixture at the end remains, of course, a mixture Different authors take
dif-ferent approaches and have different styles It early became apparent that even the
term data visualization means different things to different people! We hope that the
Handbook gains rather than loses by this eclecticism
Figures . and . earlier in the chapter showed that the chapter form varied
be-tween authors in various ways Figure . reveals another aspect he scatterplot shows
an outlier with a very large number of references (the historical survey of Michael
Friendly) and that some papers referenced the work of their own authors more than
others he histogram is for the rate of self-referencing
Trang 2412 Antony Unwin, Chun-houh Chen, Wolfgang K Härdle
Figure ..A scatterplot of the number of references to papers by a chapter’s authors against the total number of references and a histogram of the rate of self-referencing
Outlook
1.3
here are many open issues in data visualization and many challenging researchproblems he datasets to be analysed tend to be more complex and are certainlybecoming larger all the time he potential of graphical tools for exploratory dataanalysis has not been fully realized, and the complementary interplay between statis-tical modelling and graphics has not yet been fully exploited Advances in computersotware and hardware have made producing graphics easier, but they have also con-tributed to raising the standards expected
Future developments will undoubtedly include more flexible and powerful ware and better integration of modelling and graphics here will probably be indi-vidual new and innovative graphics and some improvements in the general design
sot-of displays Gradual gains in knowledge about the perception sot-of graphics and thepsychological aspects of visualization will lead to improved effectiveness of graphicdisplays Ideally there should be progress in the formal theory of data visualization,but that is perhaps the biggest challenge of all
Trang 25Part II
Principles
Trang 27Pre-17th Century: Early Maps and Diagrams . 17
1600–1699: Measurement and Theory . 19
1700–1799: New Graphic Forms 22
1800–1850: Beginnings of Modern Graphics . 25
1850–1900: The Golden Age of Statistical Graphics 28
1900–1950: The Modern Dark Ages . 37
1950–1975: Rebirth of Data Visualization 39
1975–present: High-D, Interactive and Dynamic Data Visualization . 40
1.3 Statistical Historiography . 42
History as ‘Data’ . 42
Analysing Milestones Data 43
What Was He Thinking? – Understanding Through Reproduction 45
1.4 Final Thoughts . 48
Trang 2816 Michael Friendly
It is common to think of statistical graphics and data visualization as relatively ern developments in statistics In fact, the graphic representation of quantitative in-formation has deep roots hese roots reach into the histories of the earliest map mak-ing and visual depiction, and later into thematic cartography, statistics and statisticalgraphics, medicine and other fields Along the way, developments in technologies(printing, reproduction), mathematical theory and practice, and empirical observa-tion and recording enabled the wider use of graphics and new advances in form andcontent
mod-his chapter provides an overview of the intellectual mod-history of data tion from medieval to modern times, describing and illustrating some significant ad-vances along the way It is based on a project, called the Milestones Project, to collect,catalogue and document in one place the important developments in a wide range ofareas and fields that led to modern data visualization his effort has suggested somequestions concerning the use of present-day methods of analysing and understand-ing this history, which I discuss under the rubric of ‘statistical historiography.’
visualiza-Introduction
1.1
he only new thing in the world is the history you don’t know – Harry S Truman
It is common to think of statistical graphics and data visualization as relatively ern developments in statistics In fact, the graphic portrayal of quantitative informa-tion has deep roots hese roots reach into the histories of the earliest map-makingand visual depiction, and later into thematic cartography, statistics and statisticalgraphics, with applications and innovations in many fields of medicine and sciencewhich are oten intertwined with each other hey also connect with the rise of statis-tical thinking and widespread data collection for planning and commerce up throughthe th century Along the way, a variety of advancements contributed to the wide-spread use of data visualization today hese include technologies for drawing andreproducing images, advances in mathematics and statistics, and new developments
mod-in data collection, empirical observation and recordmod-ing
From above ground, we can see the current fruit and anticipate future growth; wemust look below to understand their germination Yet the great variety of roots andnutrients across these domains, which gave rise to the many branches we see today,are oten not well known and have never been assembled in a single garden to bestudied or admired
his chapter provides an overview of the intellectual history of data tion from medieval to modern times, describing and illustrating some significantadvances along the way It is based on what I call the Milestones Project, an attempt
visualiza-to provide a broadly comprehensive and representative catalogue of important velopments in all fields related to the history of data visualization
Trang 29de-here are many historical accounts of developments within the fields of
proba-bility (Hald, ), statistics (Pearson, ; Porter, ; Stigler, ), astronomy
(Riddell, ) and cartography (Wallis and Robinson, ), which relate to, inter
alia, some of the important developments contributing to modern data
visualiza-tion here are other, more specialized, accounts which focus on the early history
of graphic recording (Hoff and Geddes, , ), statistical graphs (Funkhouser,
, ; Royston, ; Tilling, ), fitting equations to empirical data
(Fare-brother, ), economics and time-series graphs (Klein, ), cartography (Friis,
; Kruskal, ) and thematic mapping (Robinson, ; Palsky, ) and so
forth; Robinson (Robinson, , Chap ) presents an excellent overview of some
of the important scientific, intellectual and technical developments of the th–th
centuries leading to thematic cartography and statistical thinking Wainer and
Velle-man () provide a recent account of some of the history of statistical graphics
But there are no accounts which span the entire development of visual thinking
and the visual representation of data and which collate the contributions of disparate
disciplines Inasmuch as their histories are intertwined, so too should be any telling
of the development of data visualization Another reason for interweaving these
ac-counts is that practitioners in these fields today tend to be highly specialized and
unaware of related developments in areas outside their domain, much less of their
history
In organizing this history, it proved useful to divide history into epochs, each of which
turned out to be describable by coherent themes and labels his division is, of course,
somewhat artificial, but it provides the opportunity to characterize the
accomplish-ments in each period in a general way before describing some of them in more detail
Figure ., discussed in Sect .., provides a graphic overview of the epochs I
de-scribe in the subsections below, showing the frequency of events considered
mile-stones in the periods of this history For now, it suffices to note the labels attached to
these epochs, a steady rise from the early th century to the late th century, with
a curious wiggle thereater
In the larger picture – recounting the history of data visualization – it turns out
that many of the milestone items have a story to be told: What motivated this
de-velopment? What was the communication goal? How does it relate to other
devel-opments – What were the precursors? How has this idea been used or re-invented
today? Each section below tries to illustrate the general themes with a few exemplars
In particular, this account attempts to tell a few representative stories of these periods,
rather than to try to be comprehensive
For reasons of economy, only a limited number of images could be printed here,
and these only in black and white Others are referred to by Web links, mostly from
Trang 3018 Michael Friendly
Figure ..Time distribution of events considered milestones in the history of data visualization, shown
by a rug plot and density estimate
the Milestones Project, http://www.math.yorku.ca/SCS/Gallery/milestone/, where
a colour version of this chapter will also be found
Pre-17th Century: Early Maps and Diagrams
1.2.1
he earliest seeds of visualization arose in geometric diagrams, in tables of the tions of stars and other celestial bodies, and in the making of maps to aid in navigationand exploration he idea of coordinates was used by ancient Egyptian surveyors inlaying out towns, earthly and heavenly positions were located by something akin tolatitude and longitude by at least B.C., and the map projection of a spherical earthinto latitude and longitude by Claudius Ptolemy [c –c ] in Alexandria wouldserve as reference standards until the th century
posi-Among the earliest graphical depictions of quantitative information is an mous th-century multiple time-series graph of the changing position of the sevenmost prominent heavenly bodies over space and time (Fig .), described by Funk-houser () and reproduced in Tute (, p ) he vertical axis represents theinclination of the planetary orbits; the horizontal axis shows time, divided into intervals he sinusoidal variation with different periods is notable, as is the use of
anony-a grid, suggesting both anony-an implicit notion of anony-a coordinanony-ate system anony-and something anony-akin
to graph paper, ideas that would not be fully developed until the –s
In the th century, the idea of plotting a theoretical function (as a proto bar graph)and the logical relation between tabulating values and plotting them appeared in
Trang 31Figure ..Planetary movements shown as cyclic inclinations over time, by an unknown astronomer,
appearing in a th-century appendix to commentaries by A.T Macrobius on Cicero’s In Somnium
Sciponis Source: Funkhouser (, p )
a work by Nicole Oresme [–] Bishop of Liseus(Oresme, , ),
fol-lowed somewhat later by the idea of a theoretical graph of distance vs speed by
Nico-las of Cusa
By the th century, techniques and instruments for precise observation and
mea-surement of physical quantities and geographic and celestial position were well
de-veloped (for example, a ‘wall quadrant’ constructed by Tycho Brahe [–],
cov-ering an entire wall in his observatory) Particularly important were the development
of triangulation and other methods to determine mapping locations accurately
(Fri-sius, ; Tartaglia, ) As well, we see initial ideas for capturing images directly
(the camera obscura, used by Reginer Gemma-Frisius in to record an eclipse
of the sun), the recording of mathematical functions in tables (trigonometric tables
by Georg Rheticus, ) and the first modern cartographic atlas (heatrum Orbis
Terrarum by Abraham Ortelius, ) hese early steps comprise the beginnings of
data visualization
Amongst the most important problems of the th century were those concerned
with physical measurement – of time, distance and space – for astronomy,
survey-Funkhouser (, p ) was sufficiently impressed with Oresme’s grasp of the relation
be-tween functions and graphs that he remarked, ‘If a pioneering contemporary had collected
some data and presented Oresme with actual figures to work upon, we might have had
sta-tistical graphs four hundred years before Playfair.’
Trang 3220 Michael Friendly
ing, map making, navigation and territorial expansion his century also saw greatnew growth in theory and the dawn of practical application – the rise of analyticgeometry and coordinate systems (Descartes and Fermat), theories of errors of mea-surement and estimation (initial steps by Galileo in the analysis of observations onTycho Brahe’s star of (Hald, , §.)), the birth of probability theory (Pascaland Fermat) and the beginnings of demographic statistics (John Graunt) and ‘politi-cal arithmetic’ (William Petty) – the study of population, land, taxes, value of goods,etc for the purpose of understanding the wealth of the state
Early in this century, Christopher Scheiner (–, recordings from ) troduced an idea Tute () would later call the principle of ‘small multiples’ toshow the changing configurations of sunspots over time, shown in Fig . he mul-tiple images depict the recordings of sunpots from October until December
in-of that year he large key in the upper let identifies seven groups in-of sunspots by theletters A–G hese groups are similarly identified in the smaller images, arrayedlet to right and top to bottom below
Another noteworthy example (Fig .) shows a graphic by Michael Florentvan Langren[–], a Flemish astronomer to the court of Spain, believed to bethe first visual representation of statistical data (Tute, , p ) At that time, lack of
Figure ..Scheiner’s representation of the changes in sunspots over time Source: Scheiner (–)
Trang 33Figure ..Langren’s graph of determinations of the distance, in longitude, from Toledo to Rome.
he correct distance is ′ Source: Tute (, p )
a reliable means to determine longitude at sea hindered navigation and exploration.his -D line graph shows all known estimates of the difference in longitude be-tween Toledo and Rome and the name of the astronomer (Mercator, Tycho Brahe,Ptolemy, etc.) who provided each observation
What is notable is that van Langren could have presented this information in ious tables – ordered by author to show provenance, by date to show priority, or bydistance However, only a graph shows the wide variation in the estimates; note thatthe range of values covers nearly half the length of the scale Van Langren took as hisoverall summary the centre of the range, where there happened to be a large enoughgap for him to inscribe ‘ROMA.’ Unfortunately, all of the estimates were biased up-wards; the true distance (′) is shown by the arrow Van Langren’s graph is also
var-a milestone var-as the evar-arliest known exemplvar-ar of the principle of ‘effect ordering for dvar-atvar-adisplay’ (Friendly and Kwan, )
In the s, the systematic collection and study of social data began in variousEuropean countries, under the rubric of ‘political arithmetic’ (John Graunt, andWilliam Petty, ), with the goals of informing the state about matters related towealth, population, agricultural land, taxes and so forth,as well as for commercialpurposes such as insurance and annuities based on life tables (Jan de Witt, ) Atapproximately the same time, the initial statements of probability theory around (see Ball, ) together with the idea of coordinate systems were applied by Chris-tiaan Huygens in to give the first graph of a continuous distribution function(from Graunt’s based on the bills of mortality) he mid-s saw the first bivariateplot derived from empirical data, a theoretical curve relating barometric pressure toaltitude, and the first known weather map,showing prevailing winds on a map ofthe earth (Halley, )
By the end of this century, the necessary elements for the development of graphicalmethods were at hand – some real data of significant interest, some theory to make
For navigation, latitude could be fixed from star inclinations, but longitude required curate measurement of time at sea, an unsolved problem until with the invention of
ac-a mac-arine chronometer by John Hac-arrison See Sobel () for ac-a populac-ar ac-account
For example, Graunt () used his tabulations of London births and deaths from parishrecords and the bills of mortality to estimate the number of men the king would find avail-able in the event of war (Klein, , pp –)
Image: http://math.yorku.ca/SCS/Gallery/images/huygens-graph.gif
Image: http://math.yorku.ca/SCS/Gallery/images/halleyweathermap-.jpg
Trang 3422 Michael Friendly
sense of them, and a few ideas for their visual representation Perhaps more tantly, one can see this century as giving rise to the beginnings of visual thinking, asillustrated by the examples of Scheiner and van Langren
impor-1700–1799: New Graphic Forms
1.2.3
With some rudiments of statistical theory, data of interest and importance, and theidea of graphic representation at least somewhat established, the th century wit-nessed the expansion of these aspects to new domains and new graphic forms Incartography, mapmakers began to try to show more than just geographical position
on a map As a result, new data representations (isolines and contours) were invented,and thematic mapping of physical quantities took root Towards the end of this cen-tury, we see the first attempts at the thematic mapping of geologic, economic andmedical data
Abstract graphs, and graphs of functions became more widespread, along with theearly stirrings of statistical theory (measurement error) and systematic collection ofempirical data As other (economic and political) data began to be collected, somenovel visual forms were invented to portray them, so the data could ‘speak to theeyes.’
For example, the use of isolines to show contours of equal value on a coordinategrid (maps and charts) was developed by Edmund Halley () Figure ., showingisogons – lines of equal magnetic declination – is among the first examples of the-matic cartography, overlaying data on a map Contour maps and topographic mapswere introduced somewhat later by Philippe Buache () and Marcellin du Carla-Boniface ()
Timelines, or ‘cartes chronologiques,’ were first introduced by Jacques Dubourg in the form of an annotated chart of all of history (from Creation) on a -foot scroll (Ferguson, ) Joseph Priestley, presumably independently, used a moreconvenient form to show first a timeline chart of biography (lifespans of famouspeople, B.C to A.D , Priestley, ), and then a detailed chart of history(Priestley, )
Barbeu-he use of geometric figures (squares or rectangles) and cartograms to compare eas or demographic quantities by Charles de Fourcroy() and August F.W Crome() provided another novel visual encoding for quantitative data using superim-posed squares to compare the areas of European states
ar-As well, several technological innovations provided necessary ingredients for theproduction and dissemination of graphic works Some of these facilitated the repro-duction of data images, such as three-colour printing, invented by Jacob le Blon in
, and lithography, invented by Aloys Senefelder in Of the latter, Robinson(, p ) says “the effect was as great as the introduction [of the Xerox machine].”Yet, likely due to expense, most of these new graphic forms appeared in publicationswith limited circulation, unlikely to attract wide attention
Image: http://math.yorku.ca/SCS/Gallery/images/palsky/defourcroy.jpg
Trang 35Figure ..A portion of Edmund Halley’s New and Correct Sea Chart Shewing the Variations in the Compass in the Western and Southern Ocean, Source: Halley (), image from Palsky (, p )
A prodigious contributor to the use of the new graphical methods, Johann bert [–] introduced the ideas of curve fitting and interpolation from empir-ical data points He used various sorts of line graphs and graphical tables to showperiodic variation in, for example, air and soil temperature.
Lam-William Playfair [–] is widely considered the inventor of most of the ical forms used today – first the line graph and barchart (Playfair, ), later the
graph-Image: http://www.journals.uchicago.edu/Isis/journal/demo/vn//fg.gif
Trang 36In this figure, the let axis and line on each circle/pie graph shows population,while the right axis and line shows taxes Playfair intended that the slope of the lineconnecting the two would depict the rate of taxation directly to the eye; but, of course,the slope also depends on the diameters of the circles Playfair’s graphic sins can per-haps be forgiven here, because the graph clearly shows the slope of the line for Britain
to be in the opposite direction of those for the other nations
A somewhat later graph (Playfair, ), shown in Fig ., exemplifies the best thatPlayfair had to offer with these graphic forms Playfair used three parallel time series
to show the price of wheat, weekly wages and reigning ruler over a -year spanfrom to and used this graph to argue that workers had become better off inthe most recent years
By the end of this century (), the utility of graphing in scientific applicationsprompted a Dr Buxton in London to patent and market printed coordinate paper;curiously, a patent for lined notepaper was not issued until he first known
Trang 37Figure ..William Playfair’s time-series graph of prices, wages and reigning ruler over a -year
period Source: Playfair (), image from Tute (, p )
published graph using coordinate paper is one of periodic variation in barometric
pressure (Howard, ) Nevertheless, graphing of data would remain rare for
an-other or so years,perhaps largely because there wasn’t much quantitative
infor-mation (apart from widespread astronomical, geodetic and physical measurement)
of sufficient complexity to require new methods and applications Official statistics,
regarding population and mortality, and economic data were generally fragmentary
and oten not publicly available his would soon change
With the fertilization provided by the previous innovations of design and technique,
the first half of the th century witnessed explosive growth in statistical graphics and
thematic mapping, at a rate which would not be equalled until modern times
In statistical graphics, all of the modern forms of data display were invented:
bar-and piecharts, histograms, line graphs bar-and time-series plots, contour plots,
scatter-plots and so forth In thematic cartography, mapping progressed from single maps
to comprehensive atlases, depicting data on a wide variety of topics (economic,
so-cial, moral, medical, physical, etc.), and introduced a wide range of novel forms of
symbolism During this period graphical analysis of natural and physical
phenom-ena (lines of magnetism, weather, tides, etc.) began to appear regularly in scientific
publications as well
In , the first geological maps were introduced in England by William Smith
[–], setting the pattern for geological cartography or ‘stratigraphic geology’
William Herschel (), in a paper that describes the first instance of a modern scatterplot,
devoted three pages to a description of plotting points on a grid
Trang 3826 Michael Friendly
(Smith, ) hese and other thematic maps soon led to new ways of showing titative information on maps and, equally importantly, to new domains for graphi-cally based inquiry
quan-In the s, Baron Charles Dupin [–] invented the use of continuousshadings (from white to black) to show the distribution and degree of illiteracy inFrance (Dupin, ) – the first unclassed choropleth map, and perhaps the firstmodern-style thematic statistical map (Palsky, , p ) Later given the lovelytitle ‘Carte de la France obscure et de la France éclairée,’ it attracted wide attention,and was also perhaps the first application of graphics in the social realm
More significantly, in , the ministry of justice in France instituted the firstcentralized national system of crime reporting, collected quarterly from all depart-ments and recording the details of every charge laid before the French courts In ,André-Michel Guerry, a lawyer with a penchant for numbers, used these data (alongwith other data on literacy, suicides, donations to the poor and other ‘moral’ vari-ables) to produce a seminal work on the moral statistics of France (Guerry, ) –
a work that (along with Quételet, , ) can be regarded as the foundation ofmodern social science.
Guerry used maps in a style similar to Dupin to compare the ranking of ments on pairs of variables, notably crime vs literacy, but other pairwise variablecomparisons were made.He used these to argue that the lack of an apparent (nega-tive) relation between crime and literacy contradicted the armchair theories of somesocial reformers who had argued that the way to reduce crime was to increase edu-cation.Guerry’s maps and charts made somewhat of an academic sensation both
depart-in France and the rest of Europe; he later exhibited several of these at the don Exhibition and carried out a comparative study of crime in England and France(Guerry, ) for which he was awarded the Moynton Prize in statistics by theFrench Academy of Sciences.But Guerry’s systematic and careful work was unable
Lon-Image: http://math.yorku.ca/SCS/Gallery/images/dupin-map_.jpg
Guerry showed that rates of crime, when broken down by department, type of crime, age andgender of the accused and other variables, remained remarkably consistent from year to year,yet varied widely across departments He used this to argue that such regularity implied thepossibility of establishing social laws, much as the regularity of natural phenomena impliedphysical ones Guerry also pioneered the study of suicide, with tabulations of suicides inParis, –, by sex, age, education, profession, etc., and a content analysis of suicidenotes as to presumed motives
Today, one would use a scatterplot, but that graphic form had only just been invented schel, ) and would not enter common usage for another years; see Friendly and Denis()
(Her-Guerry seemed reluctant to take sides He also contradicted the social conservatives whoargued for the need to build more prisons or impose more severe criminal sentences SeeWhitt ()
Among the plates in this last work, seven pairs of maps for England and France eachincluded sets of small line graphs to show trends over time, decompositions by subtype
of crime and sex, distributions over months of the year, and so forth he final plate, ongeneral causes of crime, is an incredibly detailed and complex multivariate semi-graphicdisplay attempting to relate various types of crimes to each other, to various social and moralaspects (instruction, religion, population) as well as to their geographic distribution
Trang 39Figure ..A portion of Dr Robert Baker’s cholera map of Leeds, , showing the districts affected by cholera Source: Gilbert (, Fig )
to shine in the shadows cast by Adolphe Quételet, who regarded moral and socialstatistics as his own domain
In October , the first case of asiatic cholera occurred in Great Britain, and over
people died in the epidemic that ensued over the next months or so (Gilbert,
) Subsequent cholera epidemics in – and – produced similarlylarge death tolls, but the water-borne cause of the disease was unknown until when Dr John Snow produced his famous dot map(Snow, ) showing deathsdue to cholera clustered around the Broad Street pump in London his was indeed
a landmark graphic discovery, but it occurred at the end of the period, roughly –
, which marks a high point in the application of thematic cartography to human(social, medical, ethnic) topics he first known disease map of cholera (Fig .), due
to Dr Robert Baker (), shows the districts of Leeds ‘affected by cholera’ in theparticularly severe outbreak
I show this figure to make another point – why Baker’s map did not lead to a reka’ experience, while John Snow’s did Baker used a town plan of Leeds that hadbeen divided into districts Of a population of in all of Leeds, Baker mapped
‘eu-Image: http://www.math.yorku.ca/SCS/Gallery/images/snow.jpg
Trang 4028 Michael Friendly
the cholera cases by hatching in red ‘the districts in which the cholera had vailed.’ In his report, he noted an association between the disease and living con-ditions: ‘how exceedingly the disease has prevailed in those parts of the town wherethere is a deficiency, oten an entire want of sewage, drainage and paving’ (Baker, ,
pre-p ) Baker did not indicate the incidence of disease on his map, nor was he equipped
to display rates of disease (in relation to population density),and his knowledge ofpossible causes, while definitely on the right track, was both weak and implicit (notanalysed graphically or by other means) It is likely that some, perhaps tenuous, causalindicants or evidence were available to Baker, but he was unable to connect the dots
or see a geographically distributed outcome in relation to geographic factors in eventhe simple ways that Guerry had tried
At about the same time, –, the use of graphs began to become recognized
in some official circles for economic and state planning – where to build railroads andcanals? What is the distribution of imports and exports? his use of graphical meth-ods is no better illustrated than in the works of Charles Joseph Minard [–],whose prodigious graphical inventions led Funkhouser () to call him the Playfair
of France To illustrate, we choose (with some difficulty) an ‘tableau-graphique’(Fig .) by Minard, an early progenitor of the modern mosaicplot (Friendly, )
On the surface, mosaicplots descend from bar charts, but Minard introduced two multaneous innovations: the use of divided and proportional-width bars so that areahad a concrete visual interpretation he graph shows the transportation of commer-cial goods along one canal route in France by variable-width, divided bars (Minard,
si-) In this display the width of each vertical bar shows distance along this route;the divided-bar segments have height proportional to amount of goods of varioustypes (shown by shading), so the area of each rectangular segment is proportional tothe cost of transport Minard, a true visual engineer (Friendly, ), developed suchdiagrams to argue visually for setting differential price rates for partial vs completeruns Playfair had tried to make data ‘speak to the eyes,’ but Minard wished to makethem ‘calculer par l’œil’ as well
It is no accident that, in England, outside the numerous applications of graphicalmethods in the sciences, there was little interest in or use of graphs amongst statis-ticians (or ‘statists’ as they called themselves) If there is a continuum ranging from
‘graph people’ to ‘table people,’ British statisticians and economists were ically more table-inclined and looked upon graphs with suspicion up to the time ofWilliam Stanley Jevons around (Maas and Morgan, ) Statistics should beconcerned with the recording of ‘facts relating to communities of men which are ca-pable of being expressed by numbers’ (Mouat, , p ), leaving the generalization
philosoph-to laws and theories philosoph-to others Indeed, this view was made abundantly clear in thelogo of the Statistical Society of London (now the Royal Statistical Society): a banded
he German geographer Augustus Petermann produced a ‘Cholera map of theBritish Isles’ in using national data from the – epidemic (image:http://images.rgs.org/webimages//////S.jpg) shaded in proportion
to the relative rate of mortality using class intervals (< , , , )
No previous disease map had allowed determination of the range of mortality in any givenarea