A good visualization design requires adeep understanding of your problem, data, and users.. CHAPTER 2 Operationalization, fromquestions to data In this chapter we look at how to turn dat
Trang 3Danyel Fisher & Miriah Meyer
Making Sense of Data
First Edition
Boston Farnham Sebastopol Tokyo
Beijing Boston Farnham Sebastopol Tokyo
Beijing
Trang 4[FILL IN]
Making Sense of Data
by Miriah Meyer and Danyel Fisher
Copyright © 2016 Miriah Meyer, Microsoft All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc , 1005 Gravenstein Highway North, Sebastopol,
CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional use Online editions are also available for most titles ( http://safaribooksonline.com ) For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com
Editors: Laurel Ruma and Shannon Cutt
Production Editor: FILL IN PRODUC‐
TION EDITOR
Copyeditor: FILL IN COPYEDITOR
Proofreader: FILL IN PROOFREADER
Indexer: FILL IN INDEXER
Interior Designer: David Futato
Cover Designer: Karen Montgomery
Illustrator: Rebecca Demarest April 2016: First Edition
Revision History for the First Edition
2016-04-04: First Early Release
See http://oreilly.com/catalog/errata.csp?isbn=9781491928400 for release details The O’Reilly logo is a registered trademark of O’Reilly Media, Inc Making Sense of Data, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc While the publisher and the author(s) have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author(s) disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work Use of the information and instructions contained in this work is at your own risk If any code samples or other technology this work contains or describes is sub‐ ject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
Trang 5Table of Contents
1 Introduction 5
Making Sense of Data 8
Creating a Good Visualization 9
Who are we? 12
Who is this book for? 12
The rest of this book 14
2 Operationalization, from questions to data 17
Example: Understanding the Design of a Transit Systems 18
The Operationalization Tree 21
The Leaves of the Tree 24
Flowing Results Back Upwards 25
Applying the Tree to the UTA Scenario 26
Visualization, from Top to Bottom 34
Conclusion: A Well-Operationalized Task 34
For Further Reading 35
3 Data Counseling 37
Why is this hard? 38
Creating visualizations is a collaborative process 38
The Goal of Data Counseling 39
The data counseling process 39
Conclusion 52
4 Components of a Visualization 55
Data Abstraction 56
Direct and Indirect Measures 56
iii
Trang 6Dimensions 58
A Suite of Actions 60Choosing an Appropriate Visualization 62
iv | Table of Contents
Trang 7CHAPTER 1 Introduction
Visualization is a vital tool to understand and share insights arounddata The right visualization can help express a core idea, or open aspace to examination; it can get the world talking about a dataset, orsharing an insight
As an example of how visualization can help people change minds,and help an organization make decisions, we can look back to 2006when Microsoft was rolling out their new mapping tool, VirtualEarth, a zoomable world map At that time the team behind VirtualEarth had lots questions about how users were making use of thisnew tool, and so they collected usage data in order to answer thesequestions
The usage data was based on traditional telemetry: it had greatinformation on what cities were most looked at; how many viewerswere in “street” mode vs “photograph” mode; and even informationabout viewers’ displays And because the Virtual Earth tool is built
on top of a set of progressively higher resolution image tiles, theteam also collected data on how often individual tiles were accessed.What this usage data didn’t have, however, was specific information
that addressed how users were using the system Were they getting
stuck anywhere? Did they have patterns of places they liked to lookat? What places would be valuable for investing in future photogra‐phy?
5
Trang 8Figure 1-1 Hotmap, looking at the central United States The white box surrounds the anomaly discussed below.
To unravel these questions, the team developed a visualization toolcalled Hotmap Figure 1 shows a screen capture from the visualiza‐tion tool, focusing on the central United States Hotmap uses a heat‐map encoding of the tile access values, using a colormap to encodethe access values at the geospatial location of the tiles Thus, brightspots on the map are places where more users have accessed imagetiles Note that the color map is a logarithmic color scale, so brightspots have many more accesses than dim ones
Some of the brightest areas correspond to major population centers
— Chicago and Minneapolis on the right, Denver and Salt Lake City
on the left In the center, though, is an anomalous shape: a brightspot where no big city exists There’s a star shape around the brightspot, and an arc of bright colors nearby The spot is in a sparsely-populated bit of South Dakota There’s no obvious reason why usersmight zoom in there It is, however, very close to the center of a map
of the continental US In fact, the team learned that the center of thestar corresponds to the center of the default placement of the map inmany browsers Thus, the bright spot with the star most likely corre‐sponds to users sliding around after inadvertently zooming in, try‐ing to figure out where they had landed; the arc seems tocorrespond to variations in monitor proportions
As a result of usability challenges like this one, many mapping tools
— including Virtual Earth — longer offer a zoom slider, keepingusers from accidentally zooming all the way in on a single click
A second screen capture looks at a bright spot off the coast ofGhana This spot exhibits the same cross pattern created by users
6 | Chapter 1: Introduction
Trang 9scrolling around to try to figure out what part of the map they wereviewing This spot is likely only bright because it is 0 degrees lati‐tude, 0 degrees longitude — under this spot is only a large expanse
of water While computers might find (0,0) appealing, it is unlikelythat there is much there for the typical Virtual Earth user to findinteresting
Figure 1-2 Hotmap, looking at the map origin (0,0).
This bright spot inspired a hunt for bugs; the team rapidly learnedthat Virtual Earth’s search facility would sometimes fail: instead ofreturning an error message, typos and erroneous searches wouldsometimes redirect the user to (0,0) Interestingly, the bug had been
on the backlog for some time, but the team had decided that it wasnot likely to influence users much Seeing this image made it clearthat some users really were being confused by the error; the teamprioritized the bug
Although the Virtual Earth team had started out using the Hotmapvisualization expecting to find out about how users interacted withmaps, they gleaned much more than just a characterization of usagepatterns Like many — dare we say most? — new visualizations, the
Introduction | 7
Trang 10most interesting insights are those that the viewer was not anticipat‐ing to find.
Making Sense of Data
Visualization can give the viewer a rich and broad sense of a dataset
It can communicate data succinctly while exposing where moreinformation is needed or where an assumption does not hold Fur‐thermore, visualization provides us a canvas to bring our own ideas,experiences, and knowledge to bear when we look at and analyzedata, allowing for multiple interpretations If a picture is worth athousand words, a well-chosen interactive chart might well be worth
a few hundred statistical tests
Is visualization the silver bullet to help us make sense of data? It cansupport a case, but does not stand alone There are two questions toconsider to help you decide if your data analysis problem is a goodcandidate for a visualization solution
First, are the analysis tasks clearly defined? A crisp task such as “I
want to know the total number of users who looked at Seattle” sug‐
gests that an algorithm, statistical test, or even a table of numbersmight be the best way to answer the question On the other hand,
“How do users explore the map?” is much fuzzier These fuzzy tasks
are great candidates for a visualization solution because they requireyou to look at the data from different angles and perspectives, and to
be able to make decisions and inferences based on your own knowl‐edge and understanding
The second question to consider: Is all the necessary information contained in the data set? If there is information about the problem
that is not in the data set, requiring an expert to interpret the datathat is there, then visualization is a great solution Going back to ourfuzzy question about exploring a map, we can imagine that it isunlikely that there will be an explicit attribute in the data that classi‐fies a user’s exploration style Instead, answering this questionrequires someone to interpret other aspects of the data, to bringknowledge to bear about what aspects of the data infer an explora‐tion style Again, visualization enables this sort of flexible and user-centric analysis
8 | Chapter 1: Introduction
Trang 11In the figure below we illustrate the effects of considering the taskand data questions on the space of problems that are amenable to avisualization solution.
Figure 1-3 The best visualizations combine information in the user’s head with system-accessible data
Fairly regularly, someone shows up at one of our offices with a data‐set; they want us to help them make sense of their data Our firststep is to consider the fuzziness of the tasks and extent of the digitaldata in order to determine whether we should begin the process ofdesigning a visualization, or instead throw the data into some statis‐tical software More often than not, the problems we see benefit insome way from an interactive visualization system
We’ve learned over the years that designing effective visualizations
to make sense of data is not an art - it is a systematic and repeata‐ble process This book is an attempt to articulate the general set oftechniques we use to create insightful visualizations
Creating a Good Visualization
Choosing or designing a good visualization is rarely a straightfor‐ward process It is tempting to believe that there is one, beautiful vis‐ualization which will show all the critical aspects of a dataset, that
Creating a Good Visualization | 9
Trang 12the right visual representation will open the secrets and reveal all.
This is often the impression that we, at least, are left with after read‐ing case studies in data science books A perfect, simple, and elegantvisualization — perhaps just a bar chart, or a well-chosen scatterplot
— shows precisely what the important variable was, and how it var‐ied in precisely the way that taught a critical lesson
In our experience, this does not really match reality It takes hardwork, and trial and error, to get to an insightful visualization Webreak apart fuzzy questions into actionable, concrete tasks, and wehave to reshape and restructure the data into a form that can beworked into the visualization We have to work around limitations
in the data, and we need to try to understand just what the userwants to learn We have to consider which visual representations touse and what interaction mechanisms to support And no single vis‐ualization is ever quite able to show all of the important aspects ofour data at once - there just are not enough visual encoding chan‐nels
We suspect that your situation looks something like this too
Designing effective visualizations presents a paradox On the onehand, visualizations are intended to help a user learn about parts oftheir data that they don’t know about On the other hand, the more
we know about the user’s needs, and about the context of their data,the better a visualization can serve the user In this book, weembrace this paradox: we attempt to weave through the knowledgeusers do have of their datasets, of the context that the data lives inand the ways it was collected — including its likely flaws, challenges,and errors — in order to figure out the aspects of it that matter
10 | Chapter 1: Introduction
Trang 13Figure 1-4 The path from ill-formed problem & dataset to successful visualization
Put another way, this book is about the path from “I have some
data…” to “Look at my clear, concise, and insightful visualization.” We
believe that creating effective visualizations is, itself, a process of
Creating a Good Visualization | 11
Trang 14exploration and discovery A good visualization design requires adeep understanding of your problem, data, and users In this book,
we lay out a process for acquiring this knowledge and using it todesign effective visualization tools
Who are we?
The authors of this book have a combined three decades of experi‐ence in making sense of data through designing and using visualiza‐tions We’ve worked with data from a broad range of fields: biologyand urban transportation, business intelligence and scientific visual‐ization, debugging code and building maps We’ve worked withteams of analysts spanning small, academic science labs to teams ofdata analysts embedded in large companies Some of the projectswe’ve worked on result in sophisticated, bespoke visualization sys‐tems designed collaboratively with other analysts, and other timeswe’ve pointed people to off-the-shelf visualization tools after a fewconversations All in all, we’ve thought about how to visualize hun‐dreds of data sets
We’ve found that our knowledge about visualization techniques, sol‐utions, and systems shapes the way that we think and reason aboutdata Visualization, fundamentally, is about presenting data in a waythat elicits human-reasoning, that makes room for individual inter‐pretations, and supports exploration Because of this, we work withour collaborators to operationalize their questions and data in a waythat reflects these characteristics The process we lay out in this bookdescribes our thinking and inquiry in these terms
Who is this book for?
This book is for people with access to data and, perhaps, a suite ofcomputational tools, but are less than sure how to turn that data intovisual insight If you’ve found that data science books too-casuallyassume that you can figure out what to do with the data once collec‐ted, and that visualization books too-casually assume that you canfigure out what dimensions of the data you need to explore, thenthis book is for you
We’re not going to teach you in detail how to clean data, managedata, or write visualization code: there are already great books writ‐ten about these topics, and we’ll point you to some of them (We will
12 | Chapter 1: Introduction
Trang 15talk about why those processes are important, though.) You will notcome out of this book being able to choose a beautiful colormap orselect a typeface — again, we will point to resources as appropriate.Instead, we will lay out a framework for how to think about datagiven the possibilities, and constraints, of visual exploration.
We’ll walk through a process that we call data counseling, a set of
iterative steps that are meant to elicit a wide range of perspectives
on, and information about, a data problem The goal of data coun‐
seling is to get to an operationalization of the data that is amenable
to a visualization solution This solution may be a series of chartscreated during the process as you explore the data, or it could be anoff-the-shelf, interactive visualization tool that you use after you’veoperationalized your data And it some cases, the solution will be a
bespoke visualization tool that you’ll create because your unique
problem requires a unique solution
There are four components to a good operationalization:
Regardless of the visualization outcome, a person going through thedata counseling process will make new discoveries and gain newinsights along the way We believe that effective visualization design
is about a deep investigation into sense making
A Note on the History of Data Counseling
Miriah and Danyel jointly, and independently, described this pro‐cess; we’re sure that many other researchers carry out similar pro‐cesses One of us jokingly calls it “data psychotherapy.” (The other,more reasonably, named it “data counseling.”) It starts, not uncom‐monly, when people walk into our office:
CLIENT: I have some data that I’d like to visualize
Q: What about the data would you like to visualize?
CLIENT: I think the data would show me how profitable our stores are Q: What does it mean for a store to be profitable?
CLIENT: It means that the store has lots of sales of high-profit items Q: Why does profit vary by store?
…
And so on By the end of this process, we would often find that theuser had described the dimensions they found most important—theoutcome measure (profit); the relevant dimensions upon which itmight vary (which store, which item); and so on They key step,however, was stepping away from the data to ask what end the usertruly wanted to accomplish — “to persuade my boss to increase my
Who is this book for? | 13
Trang 16department’s funding”, or “to find out whether our users are happy”,
or “to change the mix of products we sell” Once we’d articulatedthese questions, finding an appropriate visualization became mucheasier
In this book, we systematize this process into what we hope arereproducible and clear steps
The rest of this book
In Chapter 2, we describe the Operationalization Tree The Tree isthe core technique that gets us from high-level user needs down tospecific, actionable questions We’ll discuss how to narrow a ques‐tion from a broad task into something that can be addressed with a
sequence of visualizations For example, the broad question “how do
users use our maps?” does not necessarily suggest a specific visualiza‐
tion – but “what places do users look at on our maps?” can leads very
clearly to a visualization like Hotmap
In the next chapter, Chapter 4, we’ll translate from the high-levelconcepts to low-level visualization components We will discussconcepts like dimensions and measures, and how to identify them inyour data We’ll talk about the broad set of tasks that can be carriedout with a visualization, and we’ll connect those to tasks that weidentified in Chapter 2
An operationalization is not born, fully formed, from the skull of avisualization expert: the data has a history and pedigree; people havecreated and collected it In Chapter 3, we lay out an iterative set ofsteps for getting to an operationalization, which we call “data coun‐seling” This process is about working with data owners and otherstakeholders to delve deep into an analysis problem, uncoveringrelationships and insights that are difficult to articulate, and thenusing that knowledge to build an effective operationalization Theprocess describes the kinds of questions to ask, who to ask them of,and how to rapidly explore the data through increasingly sophistica‐ted prototypes
With this technique for operationalizing, and for collecting informa‐tion from interviewees, in mind, we turn to the visualizations them‐selves In ???, we’ll discuss the core types of visualizations We’ll startwith the familiar, such as bar charts, scatter plots, and timelines, andmove on to some less well known variants For each class, we will
14 | Chapter 1: Introduction
Trang 17describe the types of data that fit on them, what sorts of tasks theycan address, and how they can be enhanced with additional dimen‐sions.
Often, more than one visualization may be necessary to examine acomplex, real-world dataset Infographics and dashboards com‐monly show several different visualizations; we can apply interactivetechniques to build richer connections ??? talks about multiplelinked view visualizations These linked views employ individualvisualizations, tied together through user interaction, to support avery rich and complex set of tasks and data constraints
For example, overview+detail can be a good solution to visualizelots of data, but requires a good way to meaningfully summarize andaggregate the data A complex data set with many different attributesmight suggest a multiform visualization, which allows the users toexamine the attributes contrasted against each other in pairs or tri‐ads, linked across different views Chapters four and five togetherform the core knowledge that is necessary to have in order to knowwhat kinds of visualization solutions are possible
With this understanding of creating a visualization—from data tovisualization—we might consider declaring victory and going home.The remainder of the book gives us tools for carrying out thesesteps
In ???, we present two case studies that focus on the how we appliedthe data counseling process to real-world problems These problemsillustrate the flexibility of the process, as well as the diverse types ofoutcomes that are possible
??? addresses the design process We discuss design iteration andrapid prototyping; and we discuss some of the tools we use fordeciding how well a visualization suits user needs We discuss con‐siderations that we’ve found meaningful for creating effective tools:the role of aesthetics; the difference between exploratory andexplanatory visualizations; and value of bespoke visualizations
??? discusses shaping and reshaping data, data cleaning, and tools:those that are intended for reshaping data into the shape we need;and then tools for visualizing data The latter will range from toolsoriented toward programmers (those implemented over Java, Java‐script, and Python) through those oriented toward data scientistsand data users, such as R, Tableau, and even Excel As we will see,
The rest of this book | 15
Trang 18there are many tradeoffs to these different tools; some are excellent
in one context, but cannot fulfill needs in another
??? touches on some challenges of encountering data in the realworld: collecting, shaping, and manipulating it
There is a lot that will not be covered in this book, such as the per‐ceptual aspects of visualization, human factors components of inter‐faces, or how to use a variety of visualization toolkits We do,however, included references to these types of issues along the way
We also provide a github site, http://shouldhaveaname.github.com,where a reader can download the code to regenerate many of thebook’s figures We’re not claiming these are the right implementa‐tions — or even particularly good code — but we feel the readershould be able to use this as an opportunity to see what it takes tocarry out these operations
16 | Chapter 1: Introduction
Trang 19CHAPTER 2 Operationalization, from
questions to data
In this chapter we look at how to turn data, and a question, intomore meaningful tasks More specifically, we discuss the notion of
operationalization, the process of refining high-level,
problem-specific questions into problem-specific tasks over the data The operationali‐zation of a problem provides concise design requirements for avisualization tool that can support finding answers to those ques‐tions
The concept of operationalization appears across data science: theidea of transforming user questions into data-driven results can befound in dozens of references Most commonly, we hear only aboutthe successful design and data choices: a chart of the perfect dimen‐sions that throws a phenomenon into perfect focus This is alsocommon in popular press retellings of data science - “when thedata scientists analyzed shoppers’ check out data, they realized thatpeople who bought soda often bought nothing else.” This is, how‐ever, only half the story What inspired the analysts to look at sodasales? Were they looking at the shopping cards of people whobought just one thing, or the co-purchasing behavior around soda,
or did they look at a dozen other products before noticing this?Data analysts often begin with different questions and goals thanwhere they end up, and these questions are often underspecified orhighly abstract “Do our users find this game fun to play?” “Whicharticles on our site are selling ads?” “Which donors should we keep
17
Trang 20track of for outreach in the next year?” The process of breakingdown these questions into something that can actually be computedover the data is iterative, exploratory, and sometimes surprising.Operationalization is the process of making decisions that lead to ananswer.
What makes operationalization for data visualization different? Inmany fields, operationalization is a process of reducing a process to
a single, or small number of metrics, and attempting to optimizethat metric Here, we read operationalization more broadly: we are
not trying to merely identify a single metric, but to instead choose
the techniques that allow the analyst to get a usable answer Visuali‐zation, though, is not an inevitable outcome: as we explore the data,
we might realize that our goal is best answered with a machinelearning algorithm, or a statistical analysis
Visualization has the unique feature of allowing users to exploremany interrelated questions, and to get to know how data looksfrom different perspectives Complex, vague tasks require looking at
a number of different dimensions to understand what they mean,and how they interact We can ask how a variety of metrics relate todifferent parts of the data Visualization allows us to explore data in
to start with a motivating problem which we will use to ground ourdiscussion throughout the rest of the chapter
Example: Understanding the Design of a
Transit Systems
Our example looks at the design of a public transit system Or morespecifically, the questions that a geography colleague has about theeffects of the system design on the local community of residents.We’ll follow this example in order to better understand how to oper‐ationalize a complex question, and look at several different pathstoward making sense of it
18 | Chapter 2: Operationalization, from questions to data
Trang 211 Our thanks to graduate students Josh Dawson and Sean McKenna, who have been working with us on this collaboration.
We collaborated with Dr Steve Farber, a geographer interested inthis characterization who is studying the Utah Transit Authority(UTA), the public transit system that services Salt Lake City and thesurrounding areas 1 Steve’s research focuses on a core concern oftransit design: “how do the tradeoffs between removing cars, versusservicing those who rely on transit, play out?”
This is a well-known trade-off within public transit design: there arevery different priorities to taking cars off the road versus servicingeconomically-disadvantaged residents who rely on transit to getaround, and they require very different implementations If a system
is designed to take cars off the road it would be as efficient as possi‐ble when going between the most-popular points at the busiesttimes, making the transit system competitive with cars If the goal is
to serve people without cars, however, it would need to adequately
— if never highly efficiently — serve low-income neighborhoods for
a broad set of times throughout the day Furthermore, not only dotransit designers need to optimize against these competing needs,but they also have to design around legacy routes, road designs, andpolitical influences
Due to the challenges inherent in designing a public transit system it
is important to be able to characterize how design decisions affectthe core efficacy of the system in order to steer improvements andrefinements for the better
This question is, as phrased, poorly defined There is no singlesource of data that labels the tradeoffs explicitly; the planners them‐selves most likely never addressed the question directly Further‐more, there are many different ways we might imagine trying toanswer it Can we look at ridership data? Can we survey residents?Could we just interview the transit designers themselves?
Our goal of operationalization is to refine and clarify this question
until we can forge an explicit link between the data that we can find
and the questions we’d like to answer As a first step, we asked our
collaborator what data is actually available?
Steve computed a time cube based on the UTA time tables that
stores, for every pair of locations in the Salt Lake Valley and for each
Example: Understanding the Design of a Transit Systems | 19
Trang 222 cite paper
3http://www.census.gov/hhes/commuting/
minute of the day, the time it takes to travel on existing transitroutes 2 The cube was generated using a sophisticated algorithmthat considers not only the fastest transit option between two loca‐tions, but also walking and waiting times at pick-up stops Thus, thecube can tell us that it takes 28 minutes to get from A location to B
at 5:01 am, but at 4:03 pm it takes 35 minutes There is one cube forthe weekdays, and one for weekend schedules
Additionally, he collected a number of government census datasetsthat characterize the neighborhoods in and around Salt Lake City,and the people who live there The travel cube shows how long ittakes to go between places, the census data helps us understand howmany people go between these pairs of places, and how often - andperhaps what sorts of places they are It allows us to ask which dis‐tricts have the most wealthy or poor people; it allows us to ask whatplaces tend to be the origins or destinations of trips, and so to char‐acterize areas as job hubs or residential areas Along with demo‐graphic information of the people in each neighborhood, the censusdata also tracks, 3 for pairs of neighborhoods, the income distribu‐tion for the people who commute between them for work
Our collaborator computed the travel times for each block in theregion The travel cube allows us to ask questions like “how longdoes it take to get between block A and block B at a given time?”.The census data provides a much richer analysis Specifically, Whilethe data is at different granularities, but combining them mightallow us to ask questions like “for each district, how long does it takethe people in the highest income bracket to get to work by transit?”Now that we have data and a high-level question, our visualizationwork begins Data alone is not enough to dictate a set of designrequirements for constructing a visualization What is missing here
is a translation of the high-level question — understanding thetrade-offs in the transit system — into a set of concrete task that wecan perform over the data And that’s where operationalizationcomes in We’ll dig further into this example after describing a con‐struct for guiding the translation: the operationalization tree
20 | Chapter 2: Operationalization, from questions to data
Trang 23Before continuing, though, it is worth noting that the data and the
operationalization are fundamentally a specific perspective on a
problem: they are proxies for what we are trying to understand Inthis UTA example there are other ways that our collaborator couldhave framed his inquiry, and other types of data he could have col‐lected This is a large part of why visualization is so important foranswering questions like these as it allows an analyst’s experienceand knowledge to layer directly on top of the actual data that is ulti‐mately shown
The Operationalization Tree
The core process of operationalization is the route from a generalgoal or a broad question, to specific questions, and to visualizationsbased on concrete data We begin with a broad question thatdescribes a research interest, or a business goal, or that orients a dataexploration We go through a series of stages meant to refine thequestion, based on the knowledge of the problem, needs of stake‐holders, what data is available (or can be collected), and the way thefinal audience will consume it
Carrying out this transformation requires collaboration with stake‐holders: to learn what data is available, and how the results will beused Interviews help us identify the questions and goals of thestakeholders with respect to the data, and understanding what data
is available, or can be made available Throughout the transforma‐tion we use operationalization to translate those questions and goalsinto a description of the problem that is amenable to a data solution.We’ll talk more specifically collaboration techniques - specificallyinterviewing and prototyping - in Chapter 3
The operationalization tree is a construct that represents a process
of refining a broad question into a set of specific tasks that can beperformed over the data The root of the tree is the high-level ques‐tion that the stakeholder wishes to answer; the internal levels repre‐sent mid-level tasks that describe goals using the language of theproblem space; and the leaves represent specific tasks that can beperformed over specific data measures, often utilizing a visualiza‐tion
A data analyst constructs the tree from the root, exploring bothdepth and breadth The construction of the tree represents the con‐tinual refinement of tasks into computable chunks Once leaf nodes
The Operationalization Tree | 21
Trang 24are defined and tasks resolved, the solutions are propagated back upthe tree to support answering higher level tasks.
Figure 2-1 Recursive representation of the operationalization tree The question is rephrased as one or more tasks; each task in turn is separa‐ ted into an action, several objects of the action, and a descriptor.
Building an operationalization tree begins with a high-level question
or goal for the project The general question might be a researchgoal, or a general question about user behavior, or a specific aspect
we wish to improve or change In the UTA scenario, the question webegin with is “How do the tradeoffs between removing cars, versusservicing those who rely on transit, play out in the UTA system?”From there, we go through the following steps to build the tree:
1 Refine the question into one or more tasks that, individually or
together, address the general question
• If the task is unambiguous and we can figure out what visual‐ization, background knowledge, or computation will address
it, we do so
• If the task is ambiguous, break it down into four components actions, objects, descriptors, and partitions - looking for
undefined terms and ambiguous phrases
2 Define the objects, descriptors, and partitions by creating a new
question that addresses each one, and return to step 1 with
those questions
22 | Chapter 2: Operationalization, from questions to data
Trang 253 Lastly, once tasks have been addressed, propagate the resultsback up to support higher level tasks.
The root question is the most difficult one to translate into a task.This translation in particular relies on the data counseling process,
as well as from a detailed understanding of what data exists and isavailable We discuss this further in Chapter 3
After selecting a task, particularly one that is abstract, fuzzy, or
ambiguous, the next step is to identify the four components of that
task We use these components as a guide to finding the more spe‐cific questions, and then tasks, that will provide the next step down‐ard in the tree:
• Actions: Actions are the words that articulate the specific thing
being done with the data, such as compare, identify, or charac‐
terize, etc Actions are helpful for identifying the other compo‐
nents, and can help choose visualizations
• Objects: Objects are things that exist in the world; these are the
items which respond to the action “A neighborhood,” or “astore,” or “a user” are all objects
• Descriptors: The value that will be measured for the objects.
“Effectiveness of the transit system,” or “happiness of a user”, or
“sales of a store” are all descriptors
• Partition: Logical groups of the objects “The western vs eastern
region of stores,” or “Players, divided by the day they startedplaying,” or “players, partitioned by whether they have bought
an upgrade.”
Every task will have an action, and this verb is useful for identifying
the other components Take this task, “Compare the amount of money spent in-game by players who play more hours to those
who play fewer hours” Here, the action is compare, which is useful
for determining the object The objects in this task are the things we
want to compare; we want to compare players But, what is it about
players we want to compare? That is the descriptor, which in this
example is money spent Finally, there is a specific partitioning of the
objects We don’t just want to compare all players, we specifically
want to compare two groups - those that play many hours andthose that play few hours
The Operationalization Tree | 23
Trang 26Example 2-1 Exemplar Task for a Game
Task: Compare the amount of money spent in-game by players who
play more hours to those who play fewer hours
Action: compare
Object: players
Descriptor: money spent
Partition: players who play many hours; players who play few hours
Given this breakdown of the task we can now figure out where weneed to further refine our descriptions We do this by consideringthe question “Are the object, descriptor, and partition each directlylinked to the data?” For each of the three, do we know specificallywhich aspect of the data it represents, or how to derive it from thedata? If not, we formulate a subquestion in order to derive a morespecific answer In Example 2-1, the partition divides between
“many” and “few” hours We will need to divide further, so we ask anew question: “In our game, how many is ‘many’ hours for aplayer?”
Not all tasks will have a partition Sometimes the task is meant tooccur over the full set of data specified by the objects and descrip‐tors We’ll see some examples of this when we discuss the operation‐alization tree for the UTA example below
The Leaves of the Tree
Just how far does the tree go down? There’s no specific answer tothis question: the answer really is, “far enough.” We know we’vemade it all the way down one branch of the tree when the task isdirectly actionable, using data at hand We know how to describe theobjects, descriptors, and (optional) partitions in terms of the data -where to find it, how to compute, and how to aggregate it At theleaf of the tree, we finally hit a questions like: “What ARE the topten most-populated census blocks?” “What products DO sell themost across our stores?” We know what the question will look like,and we know what we can do to get the answer
Low-level objects can be interpreted from the data It may be that
we can read the data directly off the table, but it may be more indi‐rect: we may need to carry out transformations on it – whether
24 | Chapter 2: Operationalization, from questions to data
Trang 27mathematical transformations or database joins However, by thetime we get to a leaf, the definitions should be unambiguous Theleaf-level objects directly describe items in the dataset Similarly,partitions at the leaf level will specifically describe a dimension ofthe data over which a logical set of objects can be created.
Descriptors will turn into measures and metrics: a good descriptor is
a concrete number On the way to the leaf, we have used proxies:
“ we can’t directly quantify a “convenient transit schedule,” but wecan estimate “a bus comes at least every fifteen minutes” In the leafnote, the descriptor is a precise, measurable dimension
To solve a leaf, then, is to answer the question—whether as a num‐ber, or as a visualization, or even as an interaction We might decidethat the right answer to “many hours” of gameplay is “six hours” anumber—or “the hours played by the top 10% of players” a formula
—or “above the logical breakpoint”, which might be represented by adistribution We can now propagate the leaf’s results back up
Flowing Results Back Upwards
When a low-level task is completed, we have a refined, specificresult We can list the top most-populous counties, or the distribu‐tion of hours that players have spent on the game, or the number ofminutes that it takes to get between two places, by hour of the day
We now propagate that task upwards We had collected this highlyrefined data for a reason: we were fulfilling a more abstractly-defined task This low-level fact is one piece of the task above Wemay have defined what an object will be, or confirmed that adescriptor would be a measurable and meaningful choice on thisdataset, or refined a task verb
Sometimes, though, we don’t have an exact answer and instead wehave a visualization that helps an analyst in making a decision, adecision that might be different under different conditions in the thecontext of other data As such, we will sometimes propagate up avisualization, or even interactive tool, that allows the decision-making to happen at a higher level of the tree
Through refinement of the objects, descriptors, and partitions weeventually propagate up answers to the task at hand: and thus, webuild a higher-level visualization The visualization design require‐ments are built from the propagated results from the lower levels of
Flowing Results Back Upwards | 25
Trang 28the tree For example, we might decide to allow a high-level user toselect between several different descriptors.
Applying the Tree to the UTA Scenario
Going back the UTA example, we left off with the question of “How
do the tradeoffs between removing cars, versus servicing those whorely on transit, play out in the UTA system?” While there are severaldifferent ways to address this question, such as surveying riders orinterviewing the system designers — Steve’s approach to this ques‐tion entails analyzing the travel times inherent to the transit system.The high level question cannot be addressed directly with the data,and so needs to be refined We asked Steve what would answer thisquestion Steve clarified that he can see the choice as a tradeoff:comparing the effectiveness of the transit system for removing carsversus supporting people who rely on transit for their transporta‐tion This is the root of our operationalization tree
• Question: How do the tradeoffs between removing cars, versus
servicing those who rely on transit, play out in the UTA system?
• Task: Compare the effectiveness of the transit system for
removing cars versus supporting people who rely on transit fortheir transportation
From this task we identify the four task components in order to helpguide our operationalization refinement
• Action: compare
• Objects: removing cars; people who rely on transit
• Descriptor: effectiveness of the transit system
• Partition: (none)
Now we can use these components to help refine the ambiguity inthis task We start with the objects, and ask “Are the objects directlylinked to the data?” The answer is no, we do nothave anything inour data that specifies people’s decisions about cars and driving.Instead, what we do have from the census data is information aboutthe salaries of people living in the different census blocks, as well asthe blocks’ populations and some of the most popular commutes
We want to find a proxy that will get us, at one end of the scale, cars
26 | Chapter 2: Operationalization, from questions to data
Trang 29that can be replaced by transit; at the other end, people who aredependent on transit One reasonable choice is to decide that high-income earners are more likely to be able to afford a car, while low-income earners are less likely to We decide to use income as areasonable proxy for car ownership.
This leads us to a new, lower level node in the operationalizationtree where we refine the objects
• Question: How do we define removing cars and people without
• Partition: high-income; low-income
Again, we look at the object and ask whether it is directly linked tothe data — here, this object IS clearly defined Census blocks are theitems contained in the census datasets We move on to the descrip‐tor and ask if it is directly linked to the data The answer is no, wedon’t have any direct measure of income for census blocks Thisleads us to create yet another, even lower level node in the opera‐tionalization tree
• Question: What does it mean for a census block to be high- or
low-income?
What we do have from census data is the number of people in eachcensus block whose salaries fall within 1 of 3 brackets: people mak‐ing less than $1250 a month, people making between $1250 and
$3333 a month, and people making more than $3333 a month Wecan try different choices here: we can do is compute, for each censusblock, the ratio of the number of workers in the highest bracket tothe total number of workers, along with the ratio for the lowestbracket
• Task: Characterize census blocks by the ratio of the number of
people at different salary levels to the total population
• Action: characterize
• Objects: census blocks
Applying the Tree to the UTA Scenario | 27
Trang 30• Descriptor: proportion of of people to the population
• Partition: salary bracket
The objects are, once again, directly linked to the data; the descrip‐tor and partition are directly computable from the data Thus, wehave reached a leaf node and can solve the task In this case, wecompute, for each census block, the ratios of the numbers of work‐ers in each salary bracket to the total number of workers in theblock We have created a new dimension in the data set
28 | Chapter 2: Operationalization, from questions to data
Trang 31Figure 2-2 Distributions of the number of residents within the three salary brackets (a) People making less than $1250 per month (b) Peo‐ ple making between $1250 and $3333 a month (c) People making more than $3333 a month.
Applying the Tree to the UTA Scenario | 29
Trang 32Having solved the leaf node we can now flow the result back up thetree That solution substitutes for the object “taking cars off theroad” In this manner, the tree helps structure the operationalization
as both a guide for where to refine, as well as a bookkeeping mecha‐nism for aggregating proxy solutions to support a high-level ques‐tion
As we bubble up, we can check on these assumptions We mighthave mistakenly chosen blocks that have low population—and so donot take many cars off the road Or we might have chosen blocksthat have very few commuters from them We can generate maps tocheck these
Figure 2-3 An example choropleth showing the census blocks with a larger percentage of higher income residents in blue, and larger per‐ centage of lower income residents in red Several purple blocks indicate
an equal number of high and low income residents.
These questions force us to dive deeper Looking at these imagesside by side, it is less clear what we mean by “rich” or “poor” neigh‐
30 | Chapter 2: Operationalization, from questions to data