IT training a guide to improving data integrity and adoption khotailieu

Jessica RoperA Guide to Improving Data Integrity and Adoption A Case Study in Verifying Usage Data Boston Farnham Sebastopol TokyoBeijing Boston Farnham Sebastopol Tokyo Beijing... 1 Val

Trang 3

Jessica Roper

A Guide to Improving Data

Integrity and Adoption

A Case Study in Verifying Usage Data

Boston Farnham Sebastopol TokyoBeijing Boston Farnham Sebastopol Tokyo

Beijing

Trang 4

[LSI]

A Guide to Improving Data Integrity and Adoption

by Jessica Roper

Printed in the United States of America.

Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.

O’Reilly books may be purchased for educational, business, or sales promotional use Online editions are also available for most titles (http://oreilly.com/safari) For more information, contact our corporate/institutional sales department: 800-998-9938 or

corporate@oreilly.com.

Editor: Nicole Tache

Production Editor: Colleen Lobner

Copyeditor: Octal Publishing Services

Interior Designer: David Futato

Cover Designer: Randy Comer

Illustrator: Rebecca Demarest December 2016: First Edition

Revision History for the First Edition

2016-12-12: First Release

The O’Reilly logo is a registered trademark of O’Reilly Media, Inc A Guide to Improving Data Integrity and Adoption, the cover image, and related trade dress are

trademarks of O’Reilly Media, Inc.

While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limi‐ tation responsibility for damages resulting from the use of or reliance on this work Use of the information and instructions contained in this work is at your own risk If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsi‐ bility to ensure that your use thereof complies with such licenses and/or rights.

Trang 5

Table of Contents

A Guide to Improving Data Integrity and Adoption 1

Validating Data Integrity as an Integral Part of Business 1

Using the Case Study as a Guide 3

An Overview of the Usage Data Project 4

Getting Started with Data 5

Managing Layers of Data 6

Performing Additional Transformation and Formatting 7

Starting with Smaller Datasets 9

Determining Acceptable Error Rates 10

Creating Work Groups 11

Reassessing the Value of Data Over Time 13

Checking the System for Internal Consistency 14

Verifying Accuracy of Transformations and Aggregation Reports 19

Allowing for Tests to Evolve 21

Implementing Automation 29

Conclusion 31

Further Reading 32

iii

Trang 7

A Guide to Improving Data Integrity and Adoption

In most companies, quality data is crucial to measuring success andplanning for business goals Unlike sample datasets in classes andexamples, real data is messy and requires processing and effort to beutilized, maintained, and trusted How do we know if the data isaccurate or whether we can trust final conclusions? What steps can

we take to not only ensure that all of the data is transformed cor‐rectly, but also to verify that the source data itself can be trusted asaccurate? How can we motivate others to treat data and its accuracy

as priority? What can we do to expand adoption of data?

Validating Data Integrity as an Integral Part of Business

Data can be messy for many reasons Unstructured data such as logfiles can be complicated to understand and parse information A lot

of data, even when structured, is still not standardized For example,parsing text from online forums can be complicated and might need

to include logic to accommodate slang such as “bad ass,” which is apositive phrase but made with negative words The system creatingthe data can also make it messy because different languages have dif‐ferent expectations for design, such as Ruby on Rails, which requires

a separate table to represent many-to-many relationships

Implementation or design can also lead to messy data For example,the process or code that creates data, and the database storing thatdata might use incompatible formats Or, the code might store a set

of values as one column instead of many columns Some languages

1

Trang 8

parse and store values in a format that is not compatible with the

databases used to store and process it, such as YAML (YAML Ain’t

Markup Language), which is not a valid data type in some databases

and is stored instead as a string Because this format is intended towork much like a hash with key-and-value pairs, searching with thedatabase language can be difficult

Also, code design can inadvertently produce a table that holds datafor many different, unrelated models (such as categories, address,name, and other profile information) that is also self-referential Forexample, the dataset in Table 1-1 is self-referential, wherein eachrow has a parent ID representing the type or category of the row.The value of the parent ID refers to the ID column of the same table

In Table 1-1, all information around a “User Profile” is stored in thesame table, including labels for profile values, resulting in some val‐ues representing labels, whereas others represent final values forthose labels The data in Table 1-1 shows that “Mexico” is a “Coun‐try,” part of the “User Profile” because the parent ID of “Mexico” is

11, the ID for “Country,” and so on I’ve seen this kind of example inthe real world, and this format can be difficult to query I believe thisrelationship was mostly the result of poor design My guess is that, atthe time, the idea was to keep all “profile-like” things in one tableand, as a result, relationships between different parts of the profilealso needed to be stored in the same place

Table 1-1 Self-referential data example (source: Jessica Roper and Brian Johnson)

ID Parent ID Value

9 NULL User Profile

Data quality is important for a lot of reasons, chiefly that it’s difficult

to draw valid conclusions from impartial or inaccurate data With adataset that is too small, skewed, inaccurate, or incomplete, it’s easy

to draw invalid conclusions Organizations that make data quality apriority are said to be data driven; to be a data-driven companymeans priorities, features, products used, staffing, and areas of focusare all determined by data rather than intuition or personal experi‐ence The company’s success is also measured by data Other thingsthat might be measured include ad impression inventory, user

Trang 9

engagement with different products and features, user-base size andpredictions, revenue predictions, and most successful marketingcampaigns To affect data priority and quality will likely requiresome work to make the data more usable and reportable and willalmost certainly require working with others within the organiza‐tion.

Using the Case Study as a Guide

In this report, I will follow a case study from a large and critical dataproject at Spiceworks, where I’ve worked for the past seven years aspart of the data team, validating, processing and creating reports.Spiceworks is a software company that aims to be “everything IT foreveryone IT,” bringing together vendors and IT pros in one place.Spiceworks offers many products including an online communityfor IT pros to do research and collaborate with colleagues and ven‐dors, a help desk with a user portal, network monitoring tools, net‐work inventory tools, user management, and much more

Throughout much of the case study project, I worked with otherteams at Spiceworks to understand and improve our datasets Wehave many teams and applications that either produce or consumedata, from the network-monitoring tool and online community thatcreate data, to the business analysts and managers who consumedata to create internal reports and prove return on investment tocustomers My team helps to analyze and process the data to providevalue and enable further utilization by other teams and products viastandardizing, filtering, and classifying the data (Later in thisreport, I will talk about how this collaboration with other teams is acritical component to achieving confidence in the accuracy andusage of data.)

This case study demonstrates Spiceworks’ process for checking eachpart of the system for internal and external consistency Throughoutthe discussion of the usage data case study, I’ll provide some quicktips to keep in mind when testing data, and then I’ll walk throughstrategies and test cases to verify raw data sources (such as parsinglogs) and work with transformations (such as appending and sum‐marizing data) I will also use the case study to talk about vettingdata for trustworthiness and explain how to use data monitors toidentify anomalies and system issues for the future Finally, I willdiscuss automation and how you can automate different tests at dif‐

Using the Case Study as a Guide | 3

Trang 10

ferent levels and in different ways This report should serve as aguide for how to think about data verification and analysis and some

of the tools that you can use to determine whether data is reliableand accurate, and to increase the usage of data

An Overview of the Usage Data Project

The case study, which I’ll refer to as the usage data project, or UDP,began with a high-level goal: to determine usage across all of Spice‐works’ products and to identify page views and trends by our users.The need for this new processing and data collection came after along road of hodge-podge reporting wherein individual teams andproducts were all measured in different ways Each team and depart‐ment collected and assessed data in its own way—how data wasmeasured in each team could be unique Metrics became increas‐ingly important for us to measure success and determine which fea‐tures and products brought the most value to the company and,therefore, should have more resources devoted to them

The impetus for this project was partially due to company growth—Spiceworks had reached a size at which not everyone knew exactlywhat was being worked on and how the data from each place corre‐lated to their own Another determining factor was inventory—toimprove and increase our inventory, we needed to accurately deter‐mine feature priority and value We also needed to utilize andunderstand our users and audience more effectively to know what toshow, to whom, and when (such as display ads or send emails).When access to this data occurred at an executive level, it was evenmore necessary to be able to easily compare products and under‐stand the data as a whole to answer questions like: “How many totalactive users do we have across all of our products?” and “How manyusers are in each product?” It wasn’t necessary to understand howeach product’s data worked We also needed to be able to do analysis

on cross-product adoption and usage

The product-focused reporting and methods of measuring perfor‐mance that were already in place made comparison and analysis ofproducts impossible The different data pieces did not share thesame mappings, and some were missing critical statistics such aswhich specific user was active on a feature We thus needed to find anew source for data (discussed in a moment)

Trang 11

When our new metrics proved to be stable, individual teams began

to focus more on the quality of their data After all, the product bugsand features that should be focused on are all determined by datathey collect to record usage and performance After our experiencewith the UDP and wider shared data access, teams have learned toensure that their data is being collected correctly during beta testing

of the product launch instead of long after This guarantees themeasy access to data reports dynamically created on the data collected.After we made the switch to this new way of collecting and manag‐ing data from the start—which was automatic and easy—more peo‐ple in the organization were motivated to focus on data quality,consistency, and completeness These efforts moved us to being amore truly data-driven company and, ultimately, a stronger com‐pany because of it

Getting Started with Data

Where to begin? After we determined the goals of the project, wewere ready to get started As I previously remarked, the first taskwas to find new data After some research, we identified much of thedata needed was available in logs from Spiceworks’ advertising ser‐vice (see Figure 1-1), which is used to identify a target audience thatusers qualify to be in and therefore what set of ads should be dis‐played to them On each page of our applications, the advertisingservice is loaded, usually even when no ads are displayed Each newpage and even context changes, such as switching to a new tab, cre‐ate a log entry We parsed these logs into tables to analyze usageacross all products; then, we identified places where tracking wasmissing or broken to show what parts of the advertising-service datasource could be trusted

As Figure 1-1 demonstrates, each log entry offered a wealth of datafrom the web request that we scraped for further analysis, includingthe uniform resource locator (URL) of the page, the user whoviewed it, the referrer of the page, the Internet Protocol (IP) address,and, of course, a time stamp to indicate when the page was viewed

We parsed these logs into structured data tables, appended moreinformation (such as geography, and other user profile informa‐tion), and created aggregate data that could provide insights intoproduct usage and cohort analysis

Getting Started with Data | 5

Trang 12

Figure 1-1 Ad service log example (source: Jessica Roper and Brian Johnson)

Managing Layers of Data

There are three layers of data useful to keep in mind, each used dif‐ferently and with different expectations (Figure 1-2) The first layer

is raw, unprocessed data, often produced by an application or exter‐nal process; for example, some raw data from the usage data studycomes from products such as Spiceworks’ cloud helpdesk, whereusers can manage IT tickets and requests, and our community,which is where users can interact online socially through discus‐sions, product research, and so on This data is in a format thatmakes sense for how the application itself works Most often, it isnot easily consumed nor does it lend itself well for creating reports.For example, in the community, due to the frameworks used, webreak apart different components and ideas of users and relation‐ships so that email, subscriptions, demographics and interests, and

so forth are all separated into many different components, but foranalysis and reporting it’s better to have these different pieces ofinformation all connected Because this data is in a raw format, it ismore likely to be unstructured and/or somewhat random, andsometimes even incomplete

Figure 1-2 Data layers (source: Jessica Roper and Brian Johnson)

Trang 13

The next layer of data is processed and structured following someformat, usually created from the raw dataset At this layer, compres‐sion can be used if needed; either way, the final format will be aresult of general processing, transformation, and classification Touse and analyze even this structured and processed layer of data stillusually requires deep understanding and knowledge and can be a bitmore difficult to report on accurately Deeper understanding isrequired to work with this dataset because it still includes all of theraw data, complete with outliers and invalid data but in a formattedand consistent representation with classifications and so on.

The final layer is reportable data that excludes outliers, incompletedata, and unqualified data; it includes only the final classificationswithout the raw source for the classification included, allowing forsegmentation and further analysis at the business and product levelswithout confusion This layer is also usually built from the previouslayer, processed and structured data If needed, other products andprocesses using this data can further format and standardize it forthe individual needs as well as apply further filtering

Performing Additional Transformation and Formatting

The most frequent reasons additional transformation and format‐ting are needed are when it is necessary to improve performance forthe analysis or report being created, to work with analysis tools(which can be quite specific as to how data must be formatted towork well), and to blend data sources together

An example of a use case in which we added more filtering was toanalyze changes in how different products were used and determinewhat changes had positive long-term effects This analysis requiredfurther filtering to create cohort groups and ensure that the usersbeing observed were in the ideal audiences for observation Remov‐ing users unlikely to engage in a product from analysis helped us todetermine what features changed an engaged user’s behavior

In addition, further transformations were required For example, weused a third-party business intelligence tool to feed in the data toanalyze and filter final data results for project managers One trans‐formation we had to make was to create a summary table that broke

Performing Additional Transformation and Formatting | 7

Trang 14

out the categorization and summary data needed into columnsinstead of rows.

For a long time, a lot of the processed and compressed data at Spice‐works was developed and formatted in a way that was highly related

to the reporting processes that would be consuming the data Thisusually would be the final reporting data, but many of the reportscreated were fairly standard, so we could create a generic way forconsumption Then, each report applied filters, and further aggrega‐tions on the fly Over time, as data became more widely used anddynamically analyzed as well as combined with different data sour‐ces, these generic tables proved to be difficult to use for diggingdeeper into the data and using it more broadly

Frequently, the format could not be used at all, forcing analysts to goback to the raw unprocessed data that required a higher level ofknowledge about the data if it were to be used at all If the wrongassumptions were made about the data or if the wrong pieces of datawere used (perhaps some that was no longer actively updated),incorrect conclusions might have been drawn For example, whendigging into the structured data parsed from the logs, some of ourfinancial analysts incorrectly assumed that the presence of a user ID(generic, anonymous user identifier—ID) indicated the user waslogged in However, in some cases we identified the user throughother means and included flags to indicate the source of the ID.Because the team did not have a full understanding of these flags orthe true meaning of the field they were using, they got wildly differ‐ent results than other reports tracking only logged-in users, whichcaused a lot of confusion

To be able to create new reports from the raw, unprocessed data, weblended additional sources and analyzed the data as a whole Oneproblem arose from different data sources with different representa‐tions of the same entities Of course, this is not surprising, becauseeach product team needed to have its own idea of users, and usuallysome sort of profile for those users Blending the data required cre‐ating mappings and relationships among the different datasets,which of course required a deep understanding of those relation‐ships and datasets Over time, as data consumption and usage grew,

we updated, refactored, and reassessed how data is processed andaggregated Our protocol has evolved over time to fit the needs forour data consumption

Trang 15

Starting with Smaller Datasets

A few things to keep in mind when you’re validating data includebecoming deeply familiar with the data, using small datasets, andtesting components in isolation Beginning with smaller datasetswhen necessary allows for faster iterations of testing before working

on the full dataset The sample data is a great place to begin digginginto what the raw data really looks like to better understand how

it needs to be processed and to identify patterns that are consideredvalid

When you’re creating smaller datasets to work with, it is important

to try to be as random as possible but still ensure that the sample islarge enough to be representative of the whole I usually aim forabout 10 percent, but this will vary between datasets Keep in mindthat it’s important to include data over time, from varying geograph‐ical locations, and include data that will be used for filtering such asdemographics This understanding will define the parametersaround the data needed to create tests

For example, one of the Spiceworks’ products identifies computermanufacturer data that is collected in aggregate anonymously andthen categorized and grouped for further analysis This information

is originally sourced from devices such as my laptop, which is aMacBook Pro (Retina, 15-inch, mid-2014) (Figure 1-3) Categoriz‐ing and grouping the data into a set for all MacBook Pros requiresspending time understanding what kind of titles are possible forApple and MacBook in the dataset by searching through the datafor related products To really understand the data, however, it isimportant to also understand titles in their raw format to gain someinsight into how they are aggregated and changed before beingpushed into the dataset that is being categorized and grouped.Therefore, testing a data scrubber requires familiarity with the data‐set and the source, if possible, so that you know which patterns andedge case conditions to check for and how you should formatthe data

Starting with Smaller Datasets | 9

Trang 16

Figure 1-3 Example of laptop manufacturing information

Determining Acceptable Error Rates

It’s important to understand acceptable errors in data This will varybetween datasets, but, overall, you want to understand what anacceptable industry standard is and understand the kinds of deci‐sions that are being made with the data to determine the acceptableerror rate The rule of thumb I use is that edge case issues represent‐ing less than one percent of the dataset are not worth a lot of timebecause they will not affect trends or final analysis However, youshould still investigate all issues at some level to ensure that the setaffected is indeed small or at least outside the system (i.e., caused bysomeone removing code that tracks usage because that personbelieved it “does not do anything”)

In some cases, this error rate is not obtainable or not exact; forexample, some information we appended assigned sentiment (posi‐tive, negative, neutral) to posts viewed by users in the online forumsthat are part of Spiceworks’ community

To determine our acceptable error rate, we researched sentimentanalysis as a whole in the industry and found that the average accu‐racy rate is between 65 and 85 percent We decided on a goal of 25percent error rate for posts with incorrect sentiment assignedbecause it kept us in that top half of accuracy levels achieved inthe industry

Trang 17

When errors are found, understanding the sample size affected willalso help you to determine severity and priority of the errors I gen‐erally try to ensure that the amount of data “ignored” in each stepmakes up less of the dataset than the allowable error so that thecombined error will still be within the acceptable rate For example,

if we allow an error rate of one-tenth of a percent in each of 10products, we can assume that the total error rate is still around orless than 1 percent, which is the overall acceptable error rate.After a problem is identified, the next goal is to find examples offailures and identify patterns Some patterns to look for are failuresfor the same day of the week, time of day, or from only a small set ofapplications or users For example, we once found a pattern

in which a web page’s load time increased significantly every Mon‐day at the same time during the evening After further digging, thisinformation led us to find that a database backup was locking largetables and causing slow page loads To account for this we addedextra servers and utilized one of them to be used for backups so thatperformance would be retained even during backups Any dataavailable that can be grouped and counted to check for patterns

in the problematic data can be helpful This can assist in identifyingthe source of issues and better estimate the impact the errors couldhave

Examples are key to providing developers, or whoever is doing dataprocessing, with insight into the cause of the problem and providing

a solution or way to account for the issue A bug that cannot bereproduced is very difficult to fix or understand Determining pat‐terns will also help you to identify how much data might be affectedand how much time should be spent investigating the problem

Creating Work Groups

Achieving data integrity often requires working with other groups

or development teams to understand and improve accuracy Inmany cases, these other teams must do work to improve the systems.Data integrity and comparability becomes more important thehigher up in an organization that data is used Generally, lots ofpeople or groups can benefit from good data, but they are not usu‐ally the ones who must create and maintain data so it’s good [4].Therefore, the higher the level of support and usage of data (such asfrom managers and executives), the more accurate the system will

Creating Work Groups | 11

Trang 18

be and the more likely the system will evolve to improve accuracy Itwill require time and effort to coordinate between those who con‐sume data and those who produce it, and some executive directionwill be necessary to ensure that coordination When managers orexecutives utilize data, collection and accuracy will be easier to make

a priority for other teams helping to create and maintain that data.This does not mean that collection and accuracy aren’t importantbefore higher-level adoption of data, but it can be a much longerand difficult process to coordinate between teams and maintaindata

One effective way to influence this coordination among team mem‐bers is consistently showing metrics in a view that’s relevant to theaudience For example, to show the value of data to a developer, youcan compare usage of products before and after a new feature isadded to show how much of an effect that feature has had on usage.Another way to use data is to connect unrelated products or datasources together to be used in a new way

As an example, a few years ago at Spiceworks, each product—andeven different features within some products—had individual defi‐nitions for categories After weeks of work to consolidate the cate‐gories and create a new way to manage and maintain them that wasmore versatile, it took additional effort and coordination to educateothers on the team about the new system, and it took work withindividual teams to help enable and encourage them to apply thenew system The key to getting others to adopt the new system wasshowing value to them In this case, my goal was to show value bymaking it easier to connect different products such as how-to’s andgroups for our online forums

There were only a few adopters in the beginning, but each newadopter helped to push others to use the same definitions and pro‐cesses for categorization—very slowly, but always in the same con‐sistent direction As we blended more data together, a unifiedcategorization grew in priority, making it more heavily adopted andused Now, the new system is widely used and used for the potentialseen when building it initially several years ago It took time forteam members to see the value in the new system and to ultimatelyadopt it, but as soon as the tipping point was crossed, the workalready put in made final adoption swift and easy in comparison

Trang 19

Collaboration helped achieve confidence in the accuracy becauseeach different application was fully adopting the new categorization,which then vetted individual product edge cases against the designand category set defined In a few cases, the categorization systemneeded to be further refined to address those edge cases, such as toaccount for some software and hardware that needed to belong tomore than one category.

Reassessing the Value of Data Over Time

The focus and value placed on data in a company evolves over time

In the beginning, data collection might be a lower priority thanacquiring new clients and getting products out the door In myexperience, data consumption begins with the product managersand developers working on products who want to understand if andhow their features and products are being used or to help in thedebugging process It can also include monitoring system perfor‐mance and grow quickly when specific metrics are set as goals forperformance of products

After a product manager adopts data-tracking success and failure,the goal is to make that data reportable and sharable so that otherscan also view the data as a critical method of measuring success Asmore product managers and teams adopt data metrics, those metricscan be shared and standardized Executive-level adoption of datametrics at a high level is much easier with the data in uniform andreportable format that can measure company goals

If no parties are already interested in data and the value it can bring,this is a good opportunity to begin using data to track your success

of products that you are invested in and share the results with man‐agers, teammates, and so on If you can show value and success inthe products or prove opportunity that is wasted, the data is morelikely to be seen as valuable and as a metric for success

Acting only as an individual, you can show others the value you canget out of data, and thereby push them to invest and use it for theirown needs The key is to show value and, when possible, make iteasy for others to maintain and utilize the data Sometimes, it mightrequire building a small tool or defining relationships between datastructures that make the data easy to use and maintain

Reassessing the Value of Data Over Time | 13

Định dạng
Số trang	39
Dung lượng	29,24 MB