A Guide to Improving Data Integrity andAdoption A Case Study in Verifying Usage Data Jessica Roper... A Guide to Improving Data Integrity and Adoption, the cover image, and related trade
Trang 2Strata+Hadoop World
Trang 4A Guide to Improving Data Integrity and
Adoption
A Case Study in Verifying Usage Data
Jessica Roper
Trang 5A Guide to Improving Data Integrity and Adoption
by Jessica Roper
Copyright © 2017 O’Reilly Media Inc All rights reserved
Printed in the United States of America
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472
O’Reilly books may be purchased for educational, business, or sales promotional use Online
editions are also available for most titles (http://oreilly.com/safari) For more information, contact
our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com.
Editor: Nicole Tache
Production Editor: Colleen Lobner
Copyeditor: Octal Publishing Services
Interior Designer: David Futato
Cover Designer: Randy Comer
Illustrator: Rebecca Demarest
December 2016: First Edition
Revision History for the First Edition
2016-12-12: First Release
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc A Guide to Improving Data
Integrity and Adoption, the cover image, and related trade dress are trademarks of O’Reilly Media,
Inc
While the publisher and the author have used good faith efforts to ensure that the information andinstructions contained in this work are accurate, the publisher and the author disclaim all
responsibility for errors or omissions, including without limitation responsibility for damages
resulting from the use of or reliance on this work Use of the information and instructions contained inthis work is at your own risk If any code samples or other technology this work contains or describes
is subject to open source licenses or the intellectual property rights of others, it is your responsibility
to ensure that your use thereof complies with such licenses and/or rights
978-1-491-97052-2
[LSI]
Trang 6A Guide to Improving Data Integrity and
Adoption
In most companies, quality data is crucial to measuring success and planning for business goals
Unlike sample datasets in classes and examples, real data is messy and requires processing and effort
to be utilized, maintained, and trusted How do we know if the data is accurate or whether we cantrust final conclusions? What steps can we take to not only ensure that all of the data is transformedcorrectly, but also to verify that the source data itself can be trusted as accurate? How can we
motivate others to treat data and its accuracy as priority? What can we do to expand adoption of data?
Trang 7Validating Data Integrity as an Integral Part of Business
Data can be messy for many reasons Unstructured data such as log files can be complicated to
understand and parse information A lot of data, even when structured, is still not standardized Forexample, parsing text from online forums can be complicated and might need to include logic to
accommodate slang such as “bad ass,” which is a positive phrase but made with negative words Thesystem creating the data can also make it messy because different languages have different
expectations for design, such as Ruby on Rails, which requires a separate table to represent many relationships
many-to-Implementation or design can also lead to messy data For example, the process or code that createsdata, and the database storing that data might use incompatible formats Or, the code might store a set
of values as one column instead of many columns Some languages parse and store values in a format
that is not compatible with the databases used to store and process it, such as YAML (YAML Ain’t
Markup Language), which is not a valid data type in some databases and is stored instead as a string.
Because this format is intended to work much like a hash with key-and-value pairs, searching with thedatabase language can be difficult
Also, code design can inadvertently produce a table that holds data for many different, unrelated
models (such as categories, address, name, and other profile information) that is also self-referential.For example, the dataset in Table 1-1 is self-referential, wherein each row has a parent ID
representing the type or category of the row The value of the parent ID refers to the ID column of thesame table In Table 1-1, all information around a “User Profile” is stored in the same table,
including labels for profile values, resulting in some values representing labels, whereas others
represent final values for those labels The data in Table 1-1 shows that “Mexico” is a “Country,”part of the “User Profile” because the parent ID of “Mexico” is 11, the ID for “Country,” and so on.I’ve seen this kind of example in the real world, and this format can be difficult to query I believethis relationship was mostly the result of poor design My guess is that, at the time, the idea was tokeep all “profile-like” things in one table and, as a result, relationships between different parts of theprofile also needed to be stored in the same place
9 NULL User Profile
Data quality is important for a lot of reasons, chiefly that it’s difficult to draw valid conclusions from
Trang 8impartial or inaccurate data With a dataset that is too small, skewed, inaccurate, or incomplete, it’seasy to draw invalid conclusions Organizations that make data quality a priority are said to be datadriven; to be a data-driven company means priorities, features, products used, staffing, and areas offocus are all determined by data rather than intuition or personal experience The company’s success
is also measured by data Other things that might be measured include ad impression inventory, userengagement with different products and features, user-base size and predictions, revenue predictions,and most successful marketing campaigns To affect data priority and quality will likely require somework to make the data more usable and reportable and will almost certainly require working withothers within the organization
Using the Case Study as a Guide
In this report, I will follow a case study from a large and critical data project at Spiceworks, whereI’ve worked for the past seven years as part of the data team, validating, processing and creatingreports Spiceworks is a software company that aims to be “everything IT for everyone IT,” bringingtogether vendors and IT pros in one place Spiceworks offers many products including an onlinecommunity for IT pros to do research and collaborate with colleagues and vendors, a help desk with
a user portal, network monitoring tools, network inventory tools, user management, and much more.Throughout much of the case study project, I worked with other teams at Spiceworks to understandand improve our datasets We have many teams and applications that either produce or consume data,from the network-monitoring tool and online community that create data, to the business analysts andmanagers who consume data to create internal reports and prove return on investment to customers
My team helps to analyze and process the data to provide value and enable further utilization by otherteams and products via standardizing, filtering, and classifying the data (Later in this report, I willtalk about how this collaboration with other teams is a critical component to achieving confidence inthe accuracy and usage of data.)
This case study demonstrates Spiceworks’ process for checking each part of the system for internaland external consistency Throughout the discussion of the usage data case study, I’ll provide somequick tips to keep in mind when testing data, and then I’ll walk through strategies and test cases toverify raw data sources (such as parsing logs) and work with transformations (such as appending andsummarizing data) I will also use the case study to talk about vetting data for trustworthiness andexplain how to use data monitors to identify anomalies and system issues for the future Finally, I willdiscuss automation and how you can automate different tests at different levels and in different ways.This report should serve as a guide for how to think about data verification and analysis and some ofthe tools that you can use to determine whether data is reliable and accurate, and to increase the usage
of data
An Overview of the Usage Data Project
The case study, which I’ll refer to as the usage data project, or UDP, began with a high-level goal: to
Trang 9determine usage across all of Spiceworks’ products and to identify page views and trends by ourusers The need for this new processing and data collection came after a long road of hodge-podgereporting wherein individual teams and products were all measured in different ways Each team anddepartment collected and assessed data in its own way—how data was measured in each team could
be unique Metrics became increasingly important for us to measure success and determine whichfeatures and products brought the most value to the company and, therefore, should have more
resources devoted to them
The impetus for this project was partially due to company growth—Spiceworks had reached a size atwhich not everyone knew exactly what was being worked on and how the data from each place
correlated to their own Another determining factor was inventory—to improve and increase ourinventory, we needed to accurately determine feature priority and value We also needed to utilizeand understand our users and audience more effectively to know what to show, to whom, and when(such as display ads or send emails)
When access to this data occurred at an executive level, it was even more necessary to be able toeasily compare products and understand the data as a whole to answer questions like: “How manytotal active users do we have across all of our products?” and “How many users are in each
product?” It wasn’t necessary to understand how each product’s data worked We also needed to beable to do analysis on cross-product adoption and usage
The product-focused reporting and methods of measuring performance that were already in placemade comparison and analysis of products impossible The different data pieces did not share thesame mappings, and some were missing critical statistics such as which specific user was active on afeature We thus needed to find a new source for data (discussed in a moment)
When our new metrics proved to be stable, individual teams began to focus more on the quality oftheir data After all, the product bugs and features that should be focused on are all determined bydata they collect to record usage and performance After our experience with the UDP and widershared data access, teams have learned to ensure that their data is being collected correctly duringbeta testing of the product launch instead of long after This guarantees them easy access to data
reports dynamically created on the data collected After we made the switch to this new way of
collecting and managing data from the start—which was automatic and easy—more people in theorganization were motivated to focus on data quality, consistency, and completeness These effortsmoved us to being a more truly data-driven company and, ultimately, a stronger company because ofit
Getting Started with Data
Where to begin? After we determined the goals of the project, we were ready to get started As Ipreviously remarked, the first task was to find new data After some research, we identified much ofthe data needed was available in logs from Spiceworks’ advertising service (see Figure 1-1), which
is used to identify a target audience that users qualify to be in and therefore what set of ads should be
Trang 10displayed to them On each page of our applications, the advertising service is loaded, usually evenwhen no ads are displayed Each new page and even context changes, such as switching to a new tab,create a log entry We parsed these logs into tables to analyze usage across all products; then, weidentified places where tracking was missing or broken to show what parts of the advertising-servicedata source could be trusted.
As Figure 1-1 demonstrates, each log entry offered a wealth of data from the web request that wescraped for further analysis, including the uniform resource locator (URL) of the page, the user whoviewed it, the referrer of the page, the Internet Protocol (IP) address, and, of course, a time stamp toindicate when the page was viewed We parsed these logs into structured data tables, appended moreinformation (such as geography, and other user profile information), and created aggregate data thatcould provide insights into product usage and cohort analysis
Figure 1-1 Ad service log example (source: Jessica Roper and Brian Johnson)
Managing Layers of Data
There are three layers of data useful to keep in mind, each used differently and with different
expectations (Figure 1-2) The first layer is raw, unprocessed data, often produced by an application
or external process; for example, some raw data from the usage data study comes from products such
as Spiceworks’ cloud helpdesk, where users can manage IT tickets and requests, and our community,which is where users can interact online socially through discussions, product research, and so on.This data is in a format that makes sense for how the application itself works Most often, it is noteasily consumed nor does it lend itself well for creating reports For example, in the community, due
to the frameworks used, we break apart different components and ideas of users and relationships sothat email, subscriptions, demographics and interests, and so forth are all separated into many
different components, but for analysis and reporting it’s better to have these different pieces of
information all connected Because this data is in a raw format, it is more likely to be unstructured
Trang 11and/or somewhat random, and sometimes even incomplete.
Figure 1-2 Data layers (source: Jessica Roper and Brian Johnson)
The next layer of data is processed and structured following some format, usually created from theraw dataset At this layer, compression can be used if needed; either way, the final format will be aresult of general processing, transformation, and classification To use and analyze even this
structured and processed layer of data still usually requires deep understanding and knowledge andcan be a bit more difficult to report on accurately Deeper understanding is required to work with thisdataset because it still includes all of the raw data, complete with outliers and invalid data but in aformatted and consistent representation with classifications and so on
The final layer is reportable data that excludes outliers, incomplete data, and unqualified data; itincludes only the final classifications without the raw source for the classification included, allowingfor segmentation and further analysis at the business and product levels without confusion This layer
is also usually built from the previous layer, processed and structured data If needed, other productsand processes using this data can further format and standardize it for the individual needs as well asapply further filtering
Performing Additional Transformation and Formatting
The most frequent reasons additional transformation and formatting are needed are when it is
necessary to improve performance for the analysis or report being created, to work with analysistools (which can be quite specific as to how data must be formatted to work well), and to blend datasources together
An example of a use case in which we added more filtering was to analyze changes in how differentproducts were used and determine what changes had positive long-term effects This analysis
required further filtering to create cohort groups and ensure that the users being observed were in theideal audiences for observation Removing users unlikely to engage in a product from analysis helped
us to determine what features changed an engaged user’s behavior
In addition, further transformations were required For example, we used a third-party business
intelligence tool to feed in the data to analyze and filter final data results for project managers Onetransformation we had to make was to create a summary table that broke out the categorization andsummary data needed into columns instead of rows
For a long time, a lot of the processed and compressed data at Spiceworks was developed and
formatted in a way that was highly related to the reporting processes that would be consuming thedata This usually would be the final reporting data, but many of the reports created were fairly
Trang 12standard, so we could create a generic way for consumption Then, each report applied filters, andfurther aggregations on the fly Over time, as data became more widely used and dynamically
analyzed as well as combined with different data sources, these generic tables proved to be difficult
to use for digging deeper into the data and using it more broadly
Frequently, the format could not be used at all, forcing analysts to go back to the raw unprocesseddata that required a higher level of knowledge about the data if it were to be used at all If the wrongassumptions were made about the data or if the wrong pieces of data were used (perhaps some thatwas no longer actively updated), incorrect conclusions might have been drawn For example, whendigging into the structured data parsed from the logs, some of our financial analysts incorrectly
assumed that the presence of a user ID (generic, anonymous user identifier—ID) indicated the userwas logged in However, in some cases we identified the user through other means and included flags
to indicate the source of the ID Because the team did not have a full understanding of these flags orthe true meaning of the field they were using, they got wildly different results than other reports
tracking only logged-in users, which caused a lot of confusion
To be able to create new reports from the raw, unprocessed data, we blended additional sources andanalyzed the data as a whole One problem arose from different data sources with different
representations of the same entities Of course, this is not surprising, because each product team
needed to have its own idea of users, and usually some sort of profile for those users Blending thedata required creating mappings and relationships among the different datasets, which of course
required a deep understanding of those relationships and datasets Over time, as data consumptionand usage grew, we updated, refactored, and reassessed how data is processed and aggregated Ourprotocol has evolved over time to fit the needs for our data consumption
Starting with Smaller Datasets
A few things to keep in mind when you’re validating data include becoming deeply familiar with thedata, using small datasets, and testing components in isolation Beginning with smaller datasets whennecessary allows for faster iterations of testing before working on the full dataset The sample data is
a great place to begin digging into what the raw data really looks like to better understand how itneeds to be processed and to identify patterns that are considered valid
When you’re creating smaller datasets to work with, it is important to try to be as random as possiblebut still ensure that the sample is large enough to be representative of the whole I usually aim forabout 10 percent, but this will vary between datasets Keep in mind that it’s important to include dataover time, from varying geographical locations, and include data that will be used for filtering such asdemographics This understanding will define the parameters around the data needed to create tests.For example, one of the Spiceworks’ products identifies computer manufacturer data that is collected
in aggregate anonymously and then categorized and grouped for further analysis This information isoriginally sourced from devices such as my laptop, which is a MacBook Pro (Retina, 15-inch, mid-2014) (Figure 1-3) Categorizing and grouping the data into a set for all MacBook Pros requires
Trang 13spending time understanding what kind of titles are possible for Apple and MacBook in the dataset bysearching through the data for related products To really understand the data, however, it is important
to also understand titles in their raw format to gain some insight into how they are aggregated andchanged before being pushed into the dataset that is being categorized and grouped Therefore, testing
a data scrubber requires familiarity with the dataset and the source, if possible, so that you knowwhich patterns and edge case conditions to check for and how you should format the data
Figure 1-3 Example of laptop manufacturing information
Determining Acceptable Error Rates
It’s important to understand acceptable errors in data This will vary between datasets, but, overall,you want to understand what an acceptable industry standard is and understand the kinds of decisionsthat are being made with the data to determine the acceptable error rate The rule of thumb I use is thatedge case issues representing less than one percent of the dataset are not worth a lot of time becausethey will not affect trends or final analysis However, you should still investigate all issues at somelevel to ensure that the set affected is indeed small or at least outside the system (i.e., caused by
someone removing code that tracks usage because that person believed it “does not do anything”)
In some cases, this error rate is not obtainable or not exact; for example, some information we
appended assigned sentiment (positive, negative, neutral) to posts viewed by users in the online
forums that are part of Spiceworks’ community
To determine our acceptable error rate, we researched sentiment analysis as a whole in the industry
Trang 14and found that the average accuracy rate is between 65 and 85 percent We decided on a goal of 25percent error rate for posts with incorrect sentiment assigned because it kept us in that top half ofaccuracy levels achieved in the industry.
When errors are found, understanding the sample size affected will also help you to determine
severity and priority of the errors I generally try to ensure that the amount of data “ignored” in eachstep makes up less of the dataset than the allowable error so that the combined error will still bewithin the acceptable rate For example, if we allow an error rate of one-tenth of a percent in each of
10 products, we can assume that the total error rate is still around or less than 1 percent, which is theoverall acceptable error rate
After a problem is identified, the next goal is to find examples of failures and identify patterns Somepatterns to look for are failures for the same day of the week, time of day, or from only a small set ofapplications or users For example, we once found a pattern in which a web page’s load time
increased significantly every Monday at the same time during the evening After further digging, thisinformation led us to find that a database backup was locking large tables and causing slow pageloads To account for this we added extra servers and utilized one of them to be used for backups sothat performance would be retained even during backups Any data available that can be grouped andcounted to check for patterns in the problematic data can be helpful This can assist in identifying thesource of issues and better estimate the impact the errors could have
Examples are key to providing developers, or whoever is doing data processing, with insight into thecause of the problem and providing a solution or way to account for the issue A bug that cannot bereproduced is very difficult to fix or understand Determining patterns will also help you to identifyhow much data might be affected and how much time should be spent investigating the problem
Creating Work Groups
Achieving data integrity often requires working with other groups or development teams to
understand and improve accuracy In many cases, these other teams must do work to improve thesystems Data integrity and comparability becomes more important the higher up in an organizationthat data is used Generally, lots of people or groups can benefit from good data, but they are notusually the ones who must create and maintain data so it’s good [4] Therefore, the higher the level ofsupport and usage of data (such as from managers and executives), the more accurate the system will
be and the more likely the system will evolve to improve accuracy It will require time and effort tocoordinate between those who consume data and those who produce it, and some executive directionwill be necessary to ensure that coordination When managers or executives utilize data, collectionand accuracy will be easier to make a priority for other teams helping to create and maintain that data.This does not mean that collection and accuracy aren’t important before higher-level adoption of data,but it can be a much longer and difficult process to coordinate between teams and maintain data
One effective way to influence this coordination among team members is consistently showing
metrics in a view that’s relevant to the audience For example, to show the value of data to a
Trang 15developer, you can compare usage of products before and after a new feature is added to show howmuch of an effect that feature has had on usage Another way to use data is to connect unrelated
products or data sources together to be used in a new way
As an example, a few years ago at Spiceworks, each product—and even different features withinsome products—had individual definitions for categories After weeks of work to consolidate thecategories and create a new way to manage and maintain them that was more versatile, it took
additional effort and coordination to educate others on the team about the new system, and it tookwork with individual teams to help enable and encourage them to apply the new system The key togetting others to adopt the new system was showing value to them In this case, my goal was to showvalue by making it easier to connect different products such as how-to’s and groups for our onlineforums
There were only a few adopters in the beginning, but each new adopter helped to push others to usethe same definitions and processes for categorization—very slowly, but always in the same consistentdirection As we blended more data together, a unified categorization grew in priority, making it moreheavily adopted and used Now, the new system is widely used and used for the potential seen whenbuilding it initially several years ago It took time for team members to see the value in the new
system and to ultimately adopt it, but as soon as the tipping point was crossed, the work already put inmade final adoption swift and easy in comparison
Collaboration helped achieve confidence in the accuracy because each different application was fullyadopting the new categorization, which then vetted individual product edge cases against the designand category set defined In a few cases, the categorization system needed to be further refined toaddress those edge cases, such as to account for some software and hardware that needed to belong tomore than one category
Reassessing the Value of Data Over Time
The focus and value placed on data in a company evolves over time In the beginning, data collectionmight be a lower priority than acquiring new clients and getting products out the door In my
experience, data consumption begins with the product managers and developers working on productswho want to understand if and how their features and products are being used or to help in the
debugging process It can also include monitoring system performance and grow quickly when
specific metrics are set as goals for performance of products
After a product manager adopts data-tracking success and failure, the goal is to make that data
reportable and sharable so that others can also view the data as a critical method of measuring
success As more product managers and teams adopt data metrics, those metrics can be shared andstandardized Executive-level adoption of data metrics at a high level is much easier with the data inuniform and reportable format that can measure company goals
If no parties are already interested in data and the value it can bring, this is a good opportunity tobegin using data to track your success of products that you are invested in and share the results with