Jessica RoperA Guide to Improving Data Integrity and Adoption A Case Study in Verifying Usage Data Boston Farnham Sebastopol TokyoBeijing Boston Farnham Sebastopol Tokyo Beijing... 1 Val
Trang 3Jessica Roper
A Guide to Improving Data
Integrity and Adoption
A Case Study in Verifying Usage Data
Boston Farnham Sebastopol TokyoBeijing Boston Farnham Sebastopol Tokyo
Beijing
Trang 4[LSI]
A Guide to Improving Data Integrity and Adoption
by Jessica Roper
Copyright © 2017 O’Reilly Media Inc All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional use Online editions are also available for most titles (http://oreilly.com/safari) For more information, contact our corporate/institutional sales department: 800-998-9938 or
corporate@oreilly.com.
Editor: Nicole Tache
Production Editor: Colleen Lobner
Copyeditor: Octal Publishing Services
Interior Designer: David Futato
Cover Designer: Randy Comer
Illustrator: Rebecca Demarest December 2016: First Edition
Revision History for the First Edition
2016-12-12: First Release
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc A Guide to Improving Data Integrity and Adoption, the cover image, and related trade dress are
trademarks of O’Reilly Media, Inc.
While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limi‐ tation responsibility for damages resulting from the use of or reliance on this work Use of the information and instructions contained in this work is at your own risk If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsi‐ bility to ensure that your use thereof complies with such licenses and/or rights.
Trang 5Table of Contents
A Guide to Improving Data Integrity and Adoption 1
Validating Data Integrity as an Integral Part of Business 1
Using the Case Study as a Guide 3
An Overview of the Usage Data Project 4
Getting Started with Data 5
Managing Layers of Data 6
Performing Additional Transformation and Formatting 7
Starting with Smaller Datasets 9
Determining Acceptable Error Rates 10
Creating Work Groups 11
Reassessing the Value of Data Over Time 13
Checking the System for Internal Consistency 14
Verifying Accuracy of Transformations and Aggregation Reports 19
Allowing for Tests to Evolve 21
Implementing Automation 29
Conclusion 31
Further Reading 32
iii
Trang 7A Guide to Improving Data Integrity and Adoption
In most companies, quality data is crucial to measuring success andplanning for business goals Unlike sample datasets in classes andexamples, real data is messy and requires processing and effort to beutilized, maintained, and trusted How do we know if the data isaccurate or whether we can trust final conclusions? What steps can
we take to not only ensure that all of the data is transformed cor‐rectly, but also to verify that the source data itself can be trusted asaccurate? How can we motivate others to treat data and its accuracy
as priority? What can we do to expand adoption of data?
Validating Data Integrity as an Integral Part of Business
Data can be messy for many reasons Unstructured data such as logfiles can be complicated to understand and parse information A lot
of data, even when structured, is still not standardized For example,parsing text from online forums can be complicated and might need
to include logic to accommodate slang such as “bad ass,” which is apositive phrase but made with negative words The system creatingthe data can also make it messy because different languages have dif‐ferent expectations for design, such as Ruby on Rails, which requires
a separate table to represent many-to-many relationships
Implementation or design can also lead to messy data For example,the process or code that creates data, and the database storing thatdata might use incompatible formats Or, the code might store a set
of values as one column instead of many columns Some languages
1
Trang 8parse and store values in a format that is not compatible with the
databases used to store and process it, such as YAML (YAML Ain’t
Markup Language), which is not a valid data type in some databases
and is stored instead as a string Because this format is intended towork much like a hash with key-and-value pairs, searching with thedatabase language can be difficult
Also, code design can inadvertently produce a table that holds datafor many different, unrelated models (such as categories, address,name, and other profile information) that is also self-referential Forexample, the dataset in Table 1-1 is self-referential, wherein eachrow has a parent ID representing the type or category of the row.The value of the parent ID refers to the ID column of the same table
In Table 1-1, all information around a “User Profile” is stored in thesame table, including labels for profile values, resulting in some val‐ues representing labels, whereas others represent final values forthose labels The data in Table 1-1 shows that “Mexico” is a “Coun‐try,” part of the “User Profile” because the parent ID of “Mexico” is
11, the ID for “Country,” and so on I’ve seen this kind of example inthe real world, and this format can be difficult to query I believe thisrelationship was mostly the result of poor design My guess is that, atthe time, the idea was to keep all “profile-like” things in one tableand, as a result, relationships between different parts of the profilealso needed to be stored in the same place
Table 1-1 Self-referential data example (source: Jessica Roper and Brian Johnson)
ID Parent ID Value
9 NULL User Profile
Data quality is important for a lot of reasons, chiefly that it’s difficult
to draw valid conclusions from impartial or inaccurate data With adataset that is too small, skewed, inaccurate, or incomplete, it’s easy
to draw invalid conclusions Organizations that make data quality apriority are said to be data driven; to be a data-driven companymeans priorities, features, products used, staffing, and areas of focusare all determined by data rather than intuition or personal experi‐ence The company’s success is also measured by data Other thingsthat might be measured include ad impression inventory, user
Trang 9engagement with different products and features, user-base size andpredictions, revenue predictions, and most successful marketingcampaigns To affect data priority and quality will likely requiresome work to make the data more usable and reportable and willalmost certainly require working with others within the organiza‐tion.
Using the Case Study as a Guide
In this report, I will follow a case study from a large and critical dataproject at Spiceworks, where I’ve worked for the past seven years aspart of the data team, validating, processing and creating reports.Spiceworks is a software company that aims to be “everything IT foreveryone IT,” bringing together vendors and IT pros in one place.Spiceworks offers many products including an online communityfor IT pros to do research and collaborate with colleagues and ven‐dors, a help desk with a user portal, network monitoring tools, net‐work inventory tools, user management, and much more
Throughout much of the case study project, I worked with otherteams at Spiceworks to understand and improve our datasets Wehave many teams and applications that either produce or consumedata, from the network-monitoring tool and online community thatcreate data, to the business analysts and managers who consumedata to create internal reports and prove return on investment tocustomers My team helps to analyze and process the data to providevalue and enable further utilization by other teams and products viastandardizing, filtering, and classifying the data (Later in thisreport, I will talk about how this collaboration with other teams is acritical component to achieving confidence in the accuracy andusage of data.)
This case study demonstrates Spiceworks’ process for checking eachpart of the system for internal and external consistency Throughoutthe discussion of the usage data case study, I’ll provide some quicktips to keep in mind when testing data, and then I’ll walk throughstrategies and test cases to verify raw data sources (such as parsinglogs) and work with transformations (such as appending and sum‐marizing data) I will also use the case study to talk about vettingdata for trustworthiness and explain how to use data monitors toidentify anomalies and system issues for the future Finally, I willdiscuss automation and how you can automate different tests at dif‐
Using the Case Study as a Guide | 3
Trang 10ferent levels and in different ways This report should serve as aguide for how to think about data verification and analysis and some
of the tools that you can use to determine whether data is reliableand accurate, and to increase the usage of data
An Overview of the Usage Data Project
The case study, which I’ll refer to as the usage data project, or UDP,began with a high-level goal: to determine usage across all of Spice‐works’ products and to identify page views and trends by our users.The need for this new processing and data collection came after along road of hodge-podge reporting wherein individual teams andproducts were all measured in different ways Each team and depart‐ment collected and assessed data in its own way—how data wasmeasured in each team could be unique Metrics became increas‐ingly important for us to measure success and determine which fea‐tures and products brought the most value to the company and,therefore, should have more resources devoted to them
The impetus for this project was partially due to company growth—Spiceworks had reached a size at which not everyone knew exactlywhat was being worked on and how the data from each place corre‐lated to their own Another determining factor was inventory—toimprove and increase our inventory, we needed to accurately deter‐mine feature priority and value We also needed to utilize andunderstand our users and audience more effectively to know what toshow, to whom, and when (such as display ads or send emails).When access to this data occurred at an executive level, it was evenmore necessary to be able to easily compare products and under‐stand the data as a whole to answer questions like: “How many totalactive users do we have across all of our products?” and “How manyusers are in each product?” It wasn’t necessary to understand howeach product’s data worked We also needed to be able to do analysis
on cross-product adoption and usage
The product-focused reporting and methods of measuring perfor‐mance that were already in place made comparison and analysis ofproducts impossible The different data pieces did not share thesame mappings, and some were missing critical statistics such aswhich specific user was active on a feature We thus needed to find anew source for data (discussed in a moment)
Trang 11When our new metrics proved to be stable, individual teams began
to focus more on the quality of their data After all, the product bugsand features that should be focused on are all determined by datathey collect to record usage and performance After our experiencewith the UDP and wider shared data access, teams have learned toensure that their data is being collected correctly during beta testing
of the product launch instead of long after This guarantees themeasy access to data reports dynamically created on the data collected.After we made the switch to this new way of collecting and manag‐ing data from the start—which was automatic and easy—more peo‐ple in the organization were motivated to focus on data quality,consistency, and completeness These efforts moved us to being amore truly data-driven company and, ultimately, a stronger com‐pany because of it
Getting Started with Data
Where to begin? After we determined the goals of the project, wewere ready to get started As I previously remarked, the first taskwas to find new data After some research, we identified much of thedata needed was available in logs from Spiceworks’ advertising ser‐vice (see Figure 1-1), which is used to identify a target audience thatusers qualify to be in and therefore what set of ads should be dis‐played to them On each page of our applications, the advertisingservice is loaded, usually even when no ads are displayed Each newpage and even context changes, such as switching to a new tab, cre‐ate a log entry We parsed these logs into tables to analyze usageacross all products; then, we identified places where tracking wasmissing or broken to show what parts of the advertising-service datasource could be trusted
As Figure 1-1 demonstrates, each log entry offered a wealth of datafrom the web request that we scraped for further analysis, includingthe uniform resource locator (URL) of the page, the user whoviewed it, the referrer of the page, the Internet Protocol (IP) address,and, of course, a time stamp to indicate when the page was viewed
We parsed these logs into structured data tables, appended moreinformation (such as geography, and other user profile informa‐tion), and created aggregate data that could provide insights intoproduct usage and cohort analysis
Getting Started with Data | 5
Trang 12Figure 1-1 Ad service log example (source: Jessica Roper and Brian Johnson)
Managing Layers of Data
There are three layers of data useful to keep in mind, each used dif‐ferently and with different expectations (Figure 1-2) The first layer
is raw, unprocessed data, often produced by an application or exter‐nal process; for example, some raw data from the usage data studycomes from products such as Spiceworks’ cloud helpdesk, whereusers can manage IT tickets and requests, and our community,which is where users can interact online socially through discus‐sions, product research, and so on This data is in a format thatmakes sense for how the application itself works Most often, it isnot easily consumed nor does it lend itself well for creating reports.For example, in the community, due to the frameworks used, webreak apart different components and ideas of users and relation‐ships so that email, subscriptions, demographics and interests, and
so forth are all separated into many different components, but foranalysis and reporting it’s better to have these different pieces ofinformation all connected Because this data is in a raw format, it ismore likely to be unstructured and/or somewhat random, andsometimes even incomplete
Figure 1-2 Data layers (source: Jessica Roper and Brian Johnson)
Trang 13The next layer of data is processed and structured following someformat, usually created from the raw dataset At this layer, compres‐sion can be used if needed; either way, the final format will be aresult of general processing, transformation, and classification Touse and analyze even this structured and processed layer of data stillusually requires deep understanding and knowledge and can be a bitmore difficult to report on accurately Deeper understanding isrequired to work with this dataset because it still includes all of theraw data, complete with outliers and invalid data but in a formattedand consistent representation with classifications and so on.
The final layer is reportable data that excludes outliers, incompletedata, and unqualified data; it includes only the final classificationswithout the raw source for the classification included, allowing forsegmentation and further analysis at the business and product levelswithout confusion This layer is also usually built from the previouslayer, processed and structured data If needed, other products andprocesses using this data can further format and standardize it forthe individual needs as well as apply further filtering
Performing Additional Transformation and Formatting
The most frequent reasons additional transformation and format‐ting are needed are when it is necessary to improve performance forthe analysis or report being created, to work with analysis tools(which can be quite specific as to how data must be formatted towork well), and to blend data sources together
An example of a use case in which we added more filtering was toanalyze changes in how different products were used and determinewhat changes had positive long-term effects This analysis requiredfurther filtering to create cohort groups and ensure that the usersbeing observed were in the ideal audiences for observation Remov‐ing users unlikely to engage in a product from analysis helped us todetermine what features changed an engaged user’s behavior
In addition, further transformations were required For example, weused a third-party business intelligence tool to feed in the data toanalyze and filter final data results for project managers One trans‐formation we had to make was to create a summary table that broke
Performing Additional Transformation and Formatting | 7
Trang 14out the categorization and summary data needed into columnsinstead of rows.
For a long time, a lot of the processed and compressed data at Spice‐works was developed and formatted in a way that was highly related
to the reporting processes that would be consuming the data Thisusually would be the final reporting data, but many of the reportscreated were fairly standard, so we could create a generic way forconsumption Then, each report applied filters, and further aggrega‐tions on the fly Over time, as data became more widely used anddynamically analyzed as well as combined with different data sour‐ces, these generic tables proved to be difficult to use for diggingdeeper into the data and using it more broadly
Frequently, the format could not be used at all, forcing analysts to goback to the raw unprocessed data that required a higher level ofknowledge about the data if it were to be used at all If the wrongassumptions were made about the data or if the wrong pieces of datawere used (perhaps some that was no longer actively updated),incorrect conclusions might have been drawn For example, whendigging into the structured data parsed from the logs, some of ourfinancial analysts incorrectly assumed that the presence of a user ID(generic, anonymous user identifier—ID) indicated the user waslogged in However, in some cases we identified the user throughother means and included flags to indicate the source of the ID.Because the team did not have a full understanding of these flags orthe true meaning of the field they were using, they got wildly differ‐ent results than other reports tracking only logged-in users, whichcaused a lot of confusion
To be able to create new reports from the raw, unprocessed data, weblended additional sources and analyzed the data as a whole Oneproblem arose from different data sources with different representa‐tions of the same entities Of course, this is not surprising, becauseeach product team needed to have its own idea of users, and usuallysome sort of profile for those users Blending the data required cre‐ating mappings and relationships among the different datasets,which of course required a deep understanding of those relation‐ships and datasets Over time, as data consumption and usage grew,
we updated, refactored, and reassessed how data is processed andaggregated Our protocol has evolved over time to fit the needs forour data consumption
Trang 15Starting with Smaller Datasets
A few things to keep in mind when you’re validating data includebecoming deeply familiar with the data, using small datasets, andtesting components in isolation Beginning with smaller datasetswhen necessary allows for faster iterations of testing before working
on the full dataset The sample data is a great place to begin digginginto what the raw data really looks like to better understand how
it needs to be processed and to identify patterns that are consideredvalid
When you’re creating smaller datasets to work with, it is important
to try to be as random as possible but still ensure that the sample islarge enough to be representative of the whole I usually aim forabout 10 percent, but this will vary between datasets Keep in mindthat it’s important to include data over time, from varying geograph‐ical locations, and include data that will be used for filtering such asdemographics This understanding will define the parametersaround the data needed to create tests
For example, one of the Spiceworks’ products identifies computermanufacturer data that is collected in aggregate anonymously andthen categorized and grouped for further analysis This information
is originally sourced from devices such as my laptop, which is aMacBook Pro (Retina, 15-inch, mid-2014) (Figure 1-3) Categoriz‐ing and grouping the data into a set for all MacBook Pros requiresspending time understanding what kind of titles are possible forApple and MacBook in the dataset by searching through the datafor related products To really understand the data, however, it isimportant to also understand titles in their raw format to gain someinsight into how they are aggregated and changed before beingpushed into the dataset that is being categorized and grouped.Therefore, testing a data scrubber requires familiarity with the data‐set and the source, if possible, so that you know which patterns andedge case conditions to check for and how you should formatthe data
Starting with Smaller Datasets | 9
Trang 16Figure 1-3 Example of laptop manufacturing information
Determining Acceptable Error Rates
It’s important to understand acceptable errors in data This will varybetween datasets, but, overall, you want to understand what anacceptable industry standard is and understand the kinds of deci‐sions that are being made with the data to determine the acceptableerror rate The rule of thumb I use is that edge case issues represent‐ing less than one percent of the dataset are not worth a lot of timebecause they will not affect trends or final analysis However, youshould still investigate all issues at some level to ensure that the setaffected is indeed small or at least outside the system (i.e., caused bysomeone removing code that tracks usage because that personbelieved it “does not do anything”)
In some cases, this error rate is not obtainable or not exact; forexample, some information we appended assigned sentiment (posi‐tive, negative, neutral) to posts viewed by users in the online forumsthat are part of Spiceworks’ community
To determine our acceptable error rate, we researched sentimentanalysis as a whole in the industry and found that the average accu‐racy rate is between 65 and 85 percent We decided on a goal of 25percent error rate for posts with incorrect sentiment assignedbecause it kept us in that top half of accuracy levels achieved inthe industry
Trang 17When errors are found, understanding the sample size affected willalso help you to determine severity and priority of the errors I gen‐erally try to ensure that the amount of data “ignored” in each stepmakes up less of the dataset than the allowable error so that thecombined error will still be within the acceptable rate For example,
if we allow an error rate of one-tenth of a percent in each of 10products, we can assume that the total error rate is still around orless than 1 percent, which is the overall acceptable error rate.After a problem is identified, the next goal is to find examples offailures and identify patterns Some patterns to look for are failuresfor the same day of the week, time of day, or from only a small set ofapplications or users For example, we once found a pattern
in which a web page’s load time increased significantly every Mon‐day at the same time during the evening After further digging, thisinformation led us to find that a database backup was locking largetables and causing slow page loads To account for this we addedextra servers and utilized one of them to be used for backups so thatperformance would be retained even during backups Any dataavailable that can be grouped and counted to check for patterns
in the problematic data can be helpful This can assist in identifyingthe source of issues and better estimate the impact the errors couldhave
Examples are key to providing developers, or whoever is doing dataprocessing, with insight into the cause of the problem and providing
a solution or way to account for the issue A bug that cannot bereproduced is very difficult to fix or understand Determining pat‐terns will also help you to identify how much data might be affectedand how much time should be spent investigating the problem
Creating Work Groups
Achieving data integrity often requires working with other groups
or development teams to understand and improve accuracy Inmany cases, these other teams must do work to improve the systems.Data integrity and comparability becomes more important thehigher up in an organization that data is used Generally, lots ofpeople or groups can benefit from good data, but they are not usu‐ally the ones who must create and maintain data so it’s good [4].Therefore, the higher the level of support and usage of data (such asfrom managers and executives), the more accurate the system will
Creating Work Groups | 11
Trang 18be and the more likely the system will evolve to improve accuracy Itwill require time and effort to coordinate between those who con‐sume data and those who produce it, and some executive directionwill be necessary to ensure that coordination When managers orexecutives utilize data, collection and accuracy will be easier to make
a priority for other teams helping to create and maintain that data.This does not mean that collection and accuracy aren’t importantbefore higher-level adoption of data, but it can be a much longerand difficult process to coordinate between teams and maintaindata
One effective way to influence this coordination among team mem‐bers is consistently showing metrics in a view that’s relevant to theaudience For example, to show the value of data to a developer, youcan compare usage of products before and after a new feature isadded to show how much of an effect that feature has had on usage.Another way to use data is to connect unrelated products or datasources together to be used in a new way
As an example, a few years ago at Spiceworks, each product—andeven different features within some products—had individual defi‐nitions for categories After weeks of work to consolidate the cate‐gories and create a new way to manage and maintain them that wasmore versatile, it took additional effort and coordination to educateothers on the team about the new system, and it took work withindividual teams to help enable and encourage them to apply thenew system The key to getting others to adopt the new system wasshowing value to them In this case, my goal was to show value bymaking it easier to connect different products such as how-to’s andgroups for our online forums
There were only a few adopters in the beginning, but each newadopter helped to push others to use the same definitions and pro‐cesses for categorization—very slowly, but always in the same con‐sistent direction As we blended more data together, a unifiedcategorization grew in priority, making it more heavily adopted andused Now, the new system is widely used and used for the potentialseen when building it initially several years ago It took time forteam members to see the value in the new system and to ultimatelyadopt it, but as soon as the tipping point was crossed, the workalready put in made final adoption swift and easy in comparison
Trang 19Collaboration helped achieve confidence in the accuracy becauseeach different application was fully adopting the new categorization,which then vetted individual product edge cases against the designand category set defined In a few cases, the categorization systemneeded to be further refined to address those edge cases, such as toaccount for some software and hardware that needed to belong tomore than one category.
Reassessing the Value of Data Over Time
The focus and value placed on data in a company evolves over time
In the beginning, data collection might be a lower priority thanacquiring new clients and getting products out the door In myexperience, data consumption begins with the product managersand developers working on products who want to understand if andhow their features and products are being used or to help in thedebugging process It can also include monitoring system perfor‐mance and grow quickly when specific metrics are set as goals forperformance of products
After a product manager adopts data-tracking success and failure,the goal is to make that data reportable and sharable so that otherscan also view the data as a critical method of measuring success Asmore product managers and teams adopt data metrics, those metricscan be shared and standardized Executive-level adoption of datametrics at a high level is much easier with the data in uniform andreportable format that can measure company goals
If no parties are already interested in data and the value it can bring,this is a good opportunity to begin using data to track your success
of products that you are invested in and share the results with man‐agers, teammates, and so on If you can show value and success inthe products or prove opportunity that is wasted, the data is morelikely to be seen as valuable and as a metric for success
Acting only as an individual, you can show others the value you canget out of data, and thereby push them to invest and use it for theirown needs The key is to show value and, when possible, make iteasy for others to maintain and utilize the data Sometimes, it mightrequire building a small tool or defining relationships between datastructures that make the data easy to use and maintain
Reassessing the Value of Data Over Time | 13