a guide to improving data integrity and adoption

A Guide to Improving Data Integrity andAdoption A Case Study in Verifying Usage Data Jessica Roper... A Guide to Improving Data Integrity and Adoption, the cover image, and related trade

Trang 2

Strata+Hadoop World

Trang 4

A Guide to Improving Data Integrity and

Adoption

A Case Study in Verifying Usage Data

Jessica Roper

Trang 5

A Guide to Improving Data Integrity and Adoption

by Jessica Roper

Printed in the United States of America

Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472

O’Reilly books may be purchased for educational, business, or sales promotional use Online

editions are also available for most titles (http://oreilly.com/safari) For more information, contact

our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com.

Editor: Nicole Tache

Production Editor: Colleen Lobner

Copyeditor: Octal Publishing Services

Interior Designer: David Futato

Cover Designer: Randy Comer

Illustrator: Rebecca Demarest

December 2016: First Edition

Revision History for the First Edition

2016-12-12: First Release

The O’Reilly logo is a registered trademark of O’Reilly Media, Inc A Guide to Improving Data

Integrity and Adoption, the cover image, and related trade dress are trademarks of O’Reilly Media,

Inc

While the publisher and the author have used good faith efforts to ensure that the information andinstructions contained in this work are accurate, the publisher and the author disclaim all

responsibility for errors or omissions, including without limitation responsibility for damages

resulting from the use of or reliance on this work Use of the information and instructions contained inthis work is at your own risk If any code samples or other technology this work contains or describes

is subject to open source licenses or the intellectual property rights of others, it is your responsibility

to ensure that your use thereof complies with such licenses and/or rights

978-1-491-97052-2

[LSI]

Trang 6

A Guide to Improving Data Integrity and

Adoption

In most companies, quality data is crucial to measuring success and planning for business goals

Unlike sample datasets in classes and examples, real data is messy and requires processing and effort

to be utilized, maintained, and trusted How do we know if the data is accurate or whether we cantrust final conclusions? What steps can we take to not only ensure that all of the data is transformedcorrectly, but also to verify that the source data itself can be trusted as accurate? How can we

motivate others to treat data and its accuracy as priority? What can we do to expand adoption of data?

Trang 7

Validating Data Integrity as an Integral Part of Business

Data can be messy for many reasons Unstructured data such as log files can be complicated to

understand and parse information A lot of data, even when structured, is still not standardized Forexample, parsing text from online forums can be complicated and might need to include logic to

accommodate slang such as “bad ass,” which is a positive phrase but made with negative words Thesystem creating the data can also make it messy because different languages have different

expectations for design, such as Ruby on Rails, which requires a separate table to represent many relationships

many-to-Implementation or design can also lead to messy data For example, the process or code that createsdata, and the database storing that data might use incompatible formats Or, the code might store a set

of values as one column instead of many columns Some languages parse and store values in a format

that is not compatible with the databases used to store and process it, such as YAML (YAML Ain’t

Markup Language), which is not a valid data type in some databases and is stored instead as a string.

Because this format is intended to work much like a hash with key-and-value pairs, searching with thedatabase language can be difficult

Also, code design can inadvertently produce a table that holds data for many different, unrelated

models (such as categories, address, name, and other profile information) that is also self-referential.For example, the dataset in Table 1-1 is self-referential, wherein each row has a parent ID

representing the type or category of the row The value of the parent ID refers to the ID column of thesame table In Table 1-1, all information around a “User Profile” is stored in the same table,

including labels for profile values, resulting in some values representing labels, whereas others

represent final values for those labels The data in Table 1-1 shows that “Mexico” is a “Country,”part of the “User Profile” because the parent ID of “Mexico” is 11, the ID for “Country,” and so on.I’ve seen this kind of example in the real world, and this format can be difficult to query I believethis relationship was mostly the result of poor design My guess is that, at the time, the idea was tokeep all “profile-like” things in one table and, as a result, relationships between different parts of theprofile also needed to be stored in the same place

9 NULL User Profile

Data quality is important for a lot of reasons, chiefly that it’s difficult to draw valid conclusions from

Trang 8

impartial or inaccurate data With a dataset that is too small, skewed, inaccurate, or incomplete, it’seasy to draw invalid conclusions Organizations that make data quality a priority are said to be datadriven; to be a data-driven company means priorities, features, products used, staffing, and areas offocus are all determined by data rather than intuition or personal experience The company’s success

is also measured by data Other things that might be measured include ad impression inventory, userengagement with different products and features, user-base size and predictions, revenue predictions,and most successful marketing campaigns To affect data priority and quality will likely require somework to make the data more usable and reportable and will almost certainly require working withothers within the organization

Using the Case Study as a Guide

In this report, I will follow a case study from a large and critical data project at Spiceworks, whereI’ve worked for the past seven years as part of the data team, validating, processing and creatingreports Spiceworks is a software company that aims to be “everything IT for everyone IT,” bringingtogether vendors and IT pros in one place Spiceworks offers many products including an onlinecommunity for IT pros to do research and collaborate with colleagues and vendors, a help desk with

a user portal, network monitoring tools, network inventory tools, user management, and much more.Throughout much of the case study project, I worked with other teams at Spiceworks to understandand improve our datasets We have many teams and applications that either produce or consume data,from the network-monitoring tool and online community that create data, to the business analysts andmanagers who consume data to create internal reports and prove return on investment to customers

My team helps to analyze and process the data to provide value and enable further utilization by otherteams and products via standardizing, filtering, and classifying the data (Later in this report, I willtalk about how this collaboration with other teams is a critical component to achieving confidence inthe accuracy and usage of data.)

This case study demonstrates Spiceworks’ process for checking each part of the system for internaland external consistency Throughout the discussion of the usage data case study, I’ll provide somequick tips to keep in mind when testing data, and then I’ll walk through strategies and test cases toverify raw data sources (such as parsing logs) and work with transformations (such as appending andsummarizing data) I will also use the case study to talk about vetting data for trustworthiness andexplain how to use data monitors to identify anomalies and system issues for the future Finally, I willdiscuss automation and how you can automate different tests at different levels and in different ways.This report should serve as a guide for how to think about data verification and analysis and some ofthe tools that you can use to determine whether data is reliable and accurate, and to increase the usage

of data

An Overview of the Usage Data Project

The case study, which I’ll refer to as the usage data project, or UDP, began with a high-level goal: to

Trang 9

determine usage across all of Spiceworks’ products and to identify page views and trends by ourusers The need for this new processing and data collection came after a long road of hodge-podgereporting wherein individual teams and products were all measured in different ways Each team anddepartment collected and assessed data in its own way—how data was measured in each team could

be unique Metrics became increasingly important for us to measure success and determine whichfeatures and products brought the most value to the company and, therefore, should have more

resources devoted to them

The impetus for this project was partially due to company growth—Spiceworks had reached a size atwhich not everyone knew exactly what was being worked on and how the data from each place

correlated to their own Another determining factor was inventory—to improve and increase ourinventory, we needed to accurately determine feature priority and value We also needed to utilizeand understand our users and audience more effectively to know what to show, to whom, and when(such as display ads or send emails)

When access to this data occurred at an executive level, it was even more necessary to be able toeasily compare products and understand the data as a whole to answer questions like: “How manytotal active users do we have across all of our products?” and “How many users are in each

product?” It wasn’t necessary to understand how each product’s data worked We also needed to beable to do analysis on cross-product adoption and usage

The product-focused reporting and methods of measuring performance that were already in placemade comparison and analysis of products impossible The different data pieces did not share thesame mappings, and some were missing critical statistics such as which specific user was active on afeature We thus needed to find a new source for data (discussed in a moment)

When our new metrics proved to be stable, individual teams began to focus more on the quality oftheir data After all, the product bugs and features that should be focused on are all determined bydata they collect to record usage and performance After our experience with the UDP and widershared data access, teams have learned to ensure that their data is being collected correctly duringbeta testing of the product launch instead of long after This guarantees them easy access to data

reports dynamically created on the data collected After we made the switch to this new way of

collecting and managing data from the start—which was automatic and easy—more people in theorganization were motivated to focus on data quality, consistency, and completeness These effortsmoved us to being a more truly data-driven company and, ultimately, a stronger company because ofit

Getting Started with Data

Where to begin? After we determined the goals of the project, we were ready to get started As Ipreviously remarked, the first task was to find new data After some research, we identified much ofthe data needed was available in logs from Spiceworks’ advertising service (see Figure 1-1), which

is used to identify a target audience that users qualify to be in and therefore what set of ads should be

Trang 10

displayed to them On each page of our applications, the advertising service is loaded, usually evenwhen no ads are displayed Each new page and even context changes, such as switching to a new tab,create a log entry We parsed these logs into tables to analyze usage across all products; then, weidentified places where tracking was missing or broken to show what parts of the advertising-servicedata source could be trusted.

As Figure 1-1 demonstrates, each log entry offered a wealth of data from the web request that wescraped for further analysis, including the uniform resource locator (URL) of the page, the user whoviewed it, the referrer of the page, the Internet Protocol (IP) address, and, of course, a time stamp toindicate when the page was viewed We parsed these logs into structured data tables, appended moreinformation (such as geography, and other user profile information), and created aggregate data thatcould provide insights into product usage and cohort analysis

Figure 1-1 Ad service log example (source: Jessica Roper and Brian Johnson)

Managing Layers of Data

There are three layers of data useful to keep in mind, each used differently and with different

expectations (Figure 1-2) The first layer is raw, unprocessed data, often produced by an application

or external process; for example, some raw data from the usage data study comes from products such

as Spiceworks’ cloud helpdesk, where users can manage IT tickets and requests, and our community,which is where users can interact online socially through discussions, product research, and so on.This data is in a format that makes sense for how the application itself works Most often, it is noteasily consumed nor does it lend itself well for creating reports For example, in the community, due

to the frameworks used, we break apart different components and ideas of users and relationships sothat email, subscriptions, demographics and interests, and so forth are all separated into many

different components, but for analysis and reporting it’s better to have these different pieces of

information all connected Because this data is in a raw format, it is more likely to be unstructured

Trang 11

and/or somewhat random, and sometimes even incomplete.

Figure 1-2 Data layers (source: Jessica Roper and Brian Johnson)

The next layer of data is processed and structured following some format, usually created from theraw dataset At this layer, compression can be used if needed; either way, the final format will be aresult of general processing, transformation, and classification To use and analyze even this

structured and processed layer of data still usually requires deep understanding and knowledge andcan be a bit more difficult to report on accurately Deeper understanding is required to work with thisdataset because it still includes all of the raw data, complete with outliers and invalid data but in aformatted and consistent representation with classifications and so on

The final layer is reportable data that excludes outliers, incomplete data, and unqualified data; itincludes only the final classifications without the raw source for the classification included, allowingfor segmentation and further analysis at the business and product levels without confusion This layer

is also usually built from the previous layer, processed and structured data If needed, other productsand processes using this data can further format and standardize it for the individual needs as well asapply further filtering

Performing Additional Transformation and Formatting

The most frequent reasons additional transformation and formatting are needed are when it is

necessary to improve performance for the analysis or report being created, to work with analysistools (which can be quite specific as to how data must be formatted to work well), and to blend datasources together

An example of a use case in which we added more filtering was to analyze changes in how differentproducts were used and determine what changes had positive long-term effects This analysis

required further filtering to create cohort groups and ensure that the users being observed were in theideal audiences for observation Removing users unlikely to engage in a product from analysis helped

us to determine what features changed an engaged user’s behavior

In addition, further transformations were required For example, we used a third-party business

intelligence tool to feed in the data to analyze and filter final data results for project managers Onetransformation we had to make was to create a summary table that broke out the categorization andsummary data needed into columns instead of rows

For a long time, a lot of the processed and compressed data at Spiceworks was developed and

formatted in a way that was highly related to the reporting processes that would be consuming thedata This usually would be the final reporting data, but many of the reports created were fairly

Trang 12

standard, so we could create a generic way for consumption Then, each report applied filters, andfurther aggregations on the fly Over time, as data became more widely used and dynamically

analyzed as well as combined with different data sources, these generic tables proved to be difficult

to use for digging deeper into the data and using it more broadly

Frequently, the format could not be used at all, forcing analysts to go back to the raw unprocesseddata that required a higher level of knowledge about the data if it were to be used at all If the wrongassumptions were made about the data or if the wrong pieces of data were used (perhaps some thatwas no longer actively updated), incorrect conclusions might have been drawn For example, whendigging into the structured data parsed from the logs, some of our financial analysts incorrectly

assumed that the presence of a user ID (generic, anonymous user identifier—ID) indicated the userwas logged in However, in some cases we identified the user through other means and included flags

to indicate the source of the ID Because the team did not have a full understanding of these flags orthe true meaning of the field they were using, they got wildly different results than other reports

tracking only logged-in users, which caused a lot of confusion

To be able to create new reports from the raw, unprocessed data, we blended additional sources andanalyzed the data as a whole One problem arose from different data sources with different

representations of the same entities Of course, this is not surprising, because each product team

needed to have its own idea of users, and usually some sort of profile for those users Blending thedata required creating mappings and relationships among the different datasets, which of course

required a deep understanding of those relationships and datasets Over time, as data consumptionand usage grew, we updated, refactored, and reassessed how data is processed and aggregated Ourprotocol has evolved over time to fit the needs for our data consumption

Starting with Smaller Datasets

A few things to keep in mind when you’re validating data include becoming deeply familiar with thedata, using small datasets, and testing components in isolation Beginning with smaller datasets whennecessary allows for faster iterations of testing before working on the full dataset The sample data is

a great place to begin digging into what the raw data really looks like to better understand how itneeds to be processed and to identify patterns that are considered valid

When you’re creating smaller datasets to work with, it is important to try to be as random as possiblebut still ensure that the sample is large enough to be representative of the whole I usually aim forabout 10 percent, but this will vary between datasets Keep in mind that it’s important to include dataover time, from varying geographical locations, and include data that will be used for filtering such asdemographics This understanding will define the parameters around the data needed to create tests.For example, one of the Spiceworks’ products identifies computer manufacturer data that is collected

in aggregate anonymously and then categorized and grouped for further analysis This information isoriginally sourced from devices such as my laptop, which is a MacBook Pro (Retina, 15-inch, mid-2014) (Figure 1-3) Categorizing and grouping the data into a set for all MacBook Pros requires

Trang 13

spending time understanding what kind of titles are possible for Apple and MacBook in the dataset bysearching through the data for related products To really understand the data, however, it is important

to also understand titles in their raw format to gain some insight into how they are aggregated andchanged before being pushed into the dataset that is being categorized and grouped Therefore, testing

a data scrubber requires familiarity with the dataset and the source, if possible, so that you knowwhich patterns and edge case conditions to check for and how you should format the data

Figure 1-3 Example of laptop manufacturing information

Determining Acceptable Error Rates

It’s important to understand acceptable errors in data This will vary between datasets, but, overall,you want to understand what an acceptable industry standard is and understand the kinds of decisionsthat are being made with the data to determine the acceptable error rate The rule of thumb I use is thatedge case issues representing less than one percent of the dataset are not worth a lot of time becausethey will not affect trends or final analysis However, you should still investigate all issues at somelevel to ensure that the set affected is indeed small or at least outside the system (i.e., caused by

someone removing code that tracks usage because that person believed it “does not do anything”)

In some cases, this error rate is not obtainable or not exact; for example, some information we

appended assigned sentiment (positive, negative, neutral) to posts viewed by users in the online

forums that are part of Spiceworks’ community

To determine our acceptable error rate, we researched sentiment analysis as a whole in the industry

Trang 14

and found that the average accuracy rate is between 65 and 85 percent We decided on a goal of 25percent error rate for posts with incorrect sentiment assigned because it kept us in that top half ofaccuracy levels achieved in the industry.

When errors are found, understanding the sample size affected will also help you to determine

severity and priority of the errors I generally try to ensure that the amount of data “ignored” in eachstep makes up less of the dataset than the allowable error so that the combined error will still bewithin the acceptable rate For example, if we allow an error rate of one-tenth of a percent in each of

10 products, we can assume that the total error rate is still around or less than 1 percent, which is theoverall acceptable error rate

After a problem is identified, the next goal is to find examples of failures and identify patterns Somepatterns to look for are failures for the same day of the week, time of day, or from only a small set ofapplications or users For example, we once found a pattern in which a web page’s load time

increased significantly every Monday at the same time during the evening After further digging, thisinformation led us to find that a database backup was locking large tables and causing slow pageloads To account for this we added extra servers and utilized one of them to be used for backups sothat performance would be retained even during backups Any data available that can be grouped andcounted to check for patterns in the problematic data can be helpful This can assist in identifying thesource of issues and better estimate the impact the errors could have

Examples are key to providing developers, or whoever is doing data processing, with insight into thecause of the problem and providing a solution or way to account for the issue A bug that cannot bereproduced is very difficult to fix or understand Determining patterns will also help you to identifyhow much data might be affected and how much time should be spent investigating the problem

Creating Work Groups

Achieving data integrity often requires working with other groups or development teams to

understand and improve accuracy In many cases, these other teams must do work to improve thesystems Data integrity and comparability becomes more important the higher up in an organizationthat data is used Generally, lots of people or groups can benefit from good data, but they are notusually the ones who must create and maintain data so it’s good [4] Therefore, the higher the level ofsupport and usage of data (such as from managers and executives), the more accurate the system will

be and the more likely the system will evolve to improve accuracy It will require time and effort tocoordinate between those who consume data and those who produce it, and some executive directionwill be necessary to ensure that coordination When managers or executives utilize data, collectionand accuracy will be easier to make a priority for other teams helping to create and maintain that data.This does not mean that collection and accuracy aren’t important before higher-level adoption of data,but it can be a much longer and difficult process to coordinate between teams and maintain data

One effective way to influence this coordination among team members is consistently showing

metrics in a view that’s relevant to the audience For example, to show the value of data to a

Trang 15

developer, you can compare usage of products before and after a new feature is added to show howmuch of an effect that feature has had on usage Another way to use data is to connect unrelated

products or data sources together to be used in a new way

As an example, a few years ago at Spiceworks, each product—and even different features withinsome products—had individual definitions for categories After weeks of work to consolidate thecategories and create a new way to manage and maintain them that was more versatile, it took

additional effort and coordination to educate others on the team about the new system, and it tookwork with individual teams to help enable and encourage them to apply the new system The key togetting others to adopt the new system was showing value to them In this case, my goal was to showvalue by making it easier to connect different products such as how-to’s and groups for our onlineforums

There were only a few adopters in the beginning, but each new adopter helped to push others to usethe same definitions and processes for categorization—very slowly, but always in the same consistentdirection As we blended more data together, a unified categorization grew in priority, making it moreheavily adopted and used Now, the new system is widely used and used for the potential seen whenbuilding it initially several years ago It took time for team members to see the value in the new

system and to ultimately adopt it, but as soon as the tipping point was crossed, the work already put inmade final adoption swift and easy in comparison

Collaboration helped achieve confidence in the accuracy because each different application was fullyadopting the new categorization, which then vetted individual product edge cases against the designand category set defined In a few cases, the categorization system needed to be further refined toaddress those edge cases, such as to account for some software and hardware that needed to belong tomore than one category

Reassessing the Value of Data Over Time

The focus and value placed on data in a company evolves over time In the beginning, data collectionmight be a lower priority than acquiring new clients and getting products out the door In my

experience, data consumption begins with the product managers and developers working on productswho want to understand if and how their features and products are being used or to help in the

debugging process It can also include monitoring system performance and grow quickly when

specific metrics are set as goals for performance of products

After a product manager adopts data-tracking success and failure, the goal is to make that data

reportable and sharable so that others can also view the data as a critical method of measuring

success As more product managers and teams adopt data metrics, those metrics can be shared andstandardized Executive-level adoption of data metrics at a high level is much easier with the data inuniform and reportable format that can measure company goals

If no parties are already interested in data and the value it can bring, this is a good opportunity tobegin using data to track your success of products that you are invested in and share the results with

Định dạng
Số trang	31
Dung lượng	3,79 MB