Table of ContentsIntroduction 1 Chapter 1: Big Data Security Rationales 3 Finding Threats Faster Versus Trusting a Tool 5 Big Data Potentially Can Change the Entire Architecture of Busin
Trang 3Table of Contents
Introduction 1
Chapter 1: Big Data Security Rationales 3
Finding Threats Faster Versus Trusting a Tool 5
Big Data Potentially Can Change the Entire Architecture of Business and IT 5
Chapter 2: Securing HeavyD 15
Why Big Data Security is Necessary 18
Does Security Even Work? 33
Chapter 3: How Does Big Data Change Security? 45
Frameworks and Distributions 46
Shrink the Square Peg to Fit a Round Hole? 55
Chapter 4: Understanding Big Data Security Failures 63
Scope of the Problem 64
Can We Get Beyond CIA? 66
Chapter 5: Framing the Big Data Security Challenge 75
Why Not Give Up and Wait? 75
Can Privacy Help Us? 81
Chapter 6: Six Elements of Big Data Security 89
Threat Model for a Hadoop Environment 89
Six Elements 91
Automation and Scale 94
Bottom Line on Network and System Security 96
Element 2: Data Protection 96
Bottom Line on Data Protection 101
Element 3: Vulnerability Management 101
Trang 4Bottom Line on Vulnerability Management 103
Element 4: Access Control 103
Bottom Line on Access Control 105
Bottom Line on Policies 110
Conclusion 111
Trang 5You can’t throw a stick these days without hitting a story about the future of ArtificialIntelligence or Machine Learning Many of those stories talk at a very high level about theethics involved in giant automation systems Should we worry about how we use newfound power in big data systems? While the abuse of tools is always interesting, behind thecurtain lies another story that draws far less attention.
These books are written with the notion that all tools can be used for good or bad, andultimately what matters for engineers is to find a definition of reasonable measures ofquality and reliability Big data systems need a guide to be made safe, because ultimatelythey are a gateway to enhanced knowledge When you think of the abuse that can be donewith a calculator, looking across the vast landscape of fraud and corruption, imagine now ifthe calculator itself cannot be trusted The faster a system can analyze data and provide a
“correct” action or answer, the more competitive advantage to be harnessed in any try A complicated question emerges: how can we make automation tools reliable and pre-dictable enough to be trusted with critical decisions?
indus-The first book takes the reader through the foundations for engineering quality into bigdata systems Although all technology follows a long arc with many dependencies, thereare novel and interesting problems in big data that need special attention and solutions.This is similar to our book on “Securing the Virtual Environment” where we emphasize anew approach based on core principles of information security The second book thentakes the foundations and provides specific steps in six areas to architect, build, and assessbig data systems While industry rushes ahead to cross the bridges of data we are excitedlybuilding, we might still have time to establish clear measurements of quality, as it relates
to whether these bridges can be trusted
Trang 7This chapter aims to help security become an integral part of any big data systems cussion, whether it is before or after deployment We all know security isn’t embedded yet.Security is the exception to the deployment discussion, let alone the planning phase “I’msure someone is thinking about that” might end up being you If you have a group talkingabout real security steps and delivery dates early in your big data deployment, you are like-
dis-ly the exception This gap between theory and reality partdis-ly is because security ers lack perspective on why businesses are moving towards big data systems; the trainedrisk professionals are not at the table to think how best to approach threats and vulnerabil-ities as technology is adopted We faced a similar situation with cloud technology Thebusiness jumped in early, occasionally bringing someone from the security community in
practition-to look around and scratch their head as practition-to why this even was happening
Definitions of a big data environment are not the point here (we’ll get to that in a nute), although it’s tempting to spend a lot of time on all the different descriptions andnames that are floating around That would be like debating what really a cloud environ-ment is The semantics and marketing are useful, yet ultimately not moving us along much
mi-in terms of engmi-ineermi-ing safety Suffice it up front to say this topic of security is about morethan just a measure of data size and is something less tangible, more sophisticated, andunknown in nature We say data has become big because size matters to modes of opera-tion, while really we also imply here a change in data rates and variations In rough terms,the systems we ran in the past are like horses compared to these new steam engine discus-sions, so we need to take off our client-server cowboy hat and spurs, in order to start think-ing about the risks of trains and cars running on rails and roads Together, the variableshave become known as engines that run on 3V (Volume, Velocity, Variety), a triad whichapparently Gartner coined first around 2001
The rationale for security in this emerging world of 3V engines is really twofold On the
one hand security is improved by running on 3V (you can’t predict what you don’t know) and
on the other hand, security has to protect 3V in order to ensure trust in these engines Better
security engines will result from 3V, assuming you can trust the 3V engines Few thingsspeak to this situation faster and better risk knowledge from safe automation, than theGrover Shoe Factory Disaster of 1905
Trang 8On the left you see the giant factory, almost an entire city block, before disaster On theright you see the factory and even neighboring buildings across the street turned intonothing more than rubble and ashes.
The background to this story comes from an automation technology rush Around 1890there were 100,000 boilers installed, as Americans could not wait to deploy steam enginetechnology throughout the country During this great boom, in the years 1880 to 1890, over2,000 boilers were known to have caused serious disasters We are not just talking aboutfactories in remote areas Trains with their giant boilers up front were ugly disfigured thingsthat looked like Cthulhu himself was the engineer
Despite decades of death and destruction through the late 1800s, Grove Shoe Factorystill had a catastrophic explosion in 1905 with cascading failures that leveled its entirebuilding, burning to the ground with workers trapped inside
Trang 9This example helps illustrate why trusted 3V engines are as important, if not more so, asthe performance benefits of a 3V engine Nobody wants to be the Grover Shoe Factory ofbig data, so that is why we look at the years before 1905 and ask how the rationale for se-curity was presented Who slept through safety class or ignored warnings when building abig data engine? We need to take the security issue very seriously, because these “engines”are being used for very important work, and security issues can have a cascading effect ifnot properly implemented.
There is a clear relationship between the two sides of security: better knowledge fromdata and more trusted engines to process the data I have found that most people in thesecurity community are feverishly working on improving the former, generating lots ofshoes as quickly as possible The latter has mostly been left unexplored, leaving unfinished
or unknown how exactly to build a safe big data engine That is why I am focusing primarily
on the rationale for security in big data with a business perspective in mind, rather thanjust looking at security issues for security market/industry purposes
Finding Threats Faster Versus Trusting a Tool
Don’t get me wrong There is much merit in the use of 3V systems for data collection inorder to detect and respond to threats faster The rationale is to use big data to improvethe quality of security itself Many people actively are working on better security paradigmsand tools based on the availability of more data, which is being collected faster than everbefore with more detail If you buy a modern security product, it is most likely running on abig data distribution You could remove the fancy marketing material and slick interfaceand build one yourself One might even argue this is just a natural evolution from the exist-ing branches of detection, including IDS, SIEM, AV, and anti-SPAM In all of these products,the collection and analysis of as much data as possible is justified by the need to morequickly address real threats and vulnerabilities
Indeed, as the threat intelligence community progressed towards an overwhelming flow
of data being shared, they needed better tools From collection and correlation to zation to machine learning solutions, products have emerged than can sift through dataand get a better signal from the noise For example, let’s say three threat intelligence feedshave the same indicator of compromise and are only slightly altered, making it hard forhumans to see the similarities A big data engine can find these anomalies much quicker.However, one wonders whether the engine itself is safe, while being used to quickly im-prove our knowledge of threats and vulnerabilities
visuali-Big Data Potentially Can Change the Entire Architecture of Business and IT
Trang 10It makes a lot of sense at face value that instead of doing analysis on multiple sources ofinformation and disconnected warehouses, a centralized approach could be a faster pathwith better insights The rationale for big data, meaning a centralized approach, can thus
be business-driven, rather than driven by whatever reasons people had to keep the dataseparate, like privacy or accuracy
Agriculture is an excellent example of how an industry can evolve with new technology.Replace the oxen with a tractor, and look how much more grain you have in the silos Nowconsolidate silos with automobiles and elevators and measure again Eventually we arereaching a world where every minute piece of data about inputs and outputs from a fieldcould help improve the yield for the farmer
Fly a drone over orchards and collect thermal imagery that predicts crop yields or theneed for water, fertilizer, pesticides; these inexpensive birds-eye views and collection sys-tems are very attractive because they can significantly increase knowledge Did the cropdusting work? Is one fertilizer more effective at less cost? Will almonds survive thedrought? Answers to the myriad of these business questions are increasingly being asked
of big data systems
Drones even can collect soil data directly, not waiting for visuals or emissions fromplants, predicting early in a season precisely what produce yields might look like at theend Robots can roam amongst cattle with thermal monitors to assess health and reportback like spies on the range Predictive analysis using remote feeds from distributed areas,which is changing the whole business of agriculture and risk management, depends on 3Vengines running reliably
Today the traditional engines of agriculture (diesel-powered tractors) are being set up
to monitor data constantly and provide feedback to both growers and their suppliers Inthis context, there is so much money on the line, with entire markets depending on accu-rate prediction; everyone has to trust the data environment is safe against compromise ortampering
A compromise in the new field of 3V engines is not always obvious or absolute Whengrowers upload their data into a supplier’s system, such as a seed company, that data sud-denly may be targeted by investors who want to get advance knowledge of yields to gamethe market A central collection system would know crucial details about market-wide sup-ply changes long before the food is harvested Imagine having just one giant boiler in a fac-tory, potentially failing and setting the whole business on fire, rather than having multipleredundant engines where one can be shut down at the first sign of trouble
Trang 11To be fair, the 1905 shoe factory had dual boilers It is a mystery to this day why thenewer, more secure model wasn’t being used Instead, they kept running an older one atunsafe performance levels Perhaps someone thought the new one was harder to manage
or was not yet as efficient because of safety enhancements
Again I want to emphasize that it is very, very easy to find warnings about the misuse ordanger of 3V systems The ethics of using an engine that carries great power are obvious.Examples of what is really at stake can be found nearly anywhere Search on Google, forexample, using terms like “professional hair” and “unprofessional hair” and you see quick-
ly a problem
Trang 12Social scientists could be talking for days about the significance of obviously anced results like these, perpetuating bias This is a shocking yet somewhat lightheartedexample Even more troubling is predictive policing technology that perpetuates bias byconsistently ranking risk based on race, further perpetuating racism in justice systems Thiskind of analytic error leads to systems that do a poor job of predicting actual violent crime,making obvious mistakes This old cartoon hints at the origin of the black hoodie for stok-ing fear when talking about hackers or criminals of any kind, really.
Trang 13imbal-Clearly people driving the engine aren’t exactly working that hard at ensuring commonconcepts of safety or accuracy are in place for actually useful results It is almost as thoughDouglas Adams was right in his joke that the world’s smartest computer, when asked themeaning of life, would simply reply “42.” That is what makes careless errors so easy to find.And it is fundamentally a different problem than what I would describe as issues of quality
in the engines beneath these poorly executed usages In fact, I would argue quality in theengine is set to become an even more serious issue as we put pressure on users to thinkabout bias and prejudice in their application development The troubles will shift behindthe curtain
At least you have some leverage when algorithms are poorly orchestrated A Google gineer can say “Oops, I forgot to train the algorithm on non-white faces.” When the algo-rithms depend on infrastructure that fails, will the same be true? Will engineers be able tosay, “Oops, I see that someone two days ago was injecting bad data into our training set”and set about fixing things? To make a finer point, it doesn’t matter how good an algorithm
en-is if the engine lacks data integrity controls Network communication or storage without aclear and consistent way to prevent malicious modification is a problem big data environ-ments have to anticipate
The results we see often can be explained as a function of machines presenting answers
to questions that reinforce our existing bias What also needs to be investigated and pared for, more the focus of this set of books, is how to build protection for the underlyingsystems themselves We need to go deeper to where we are talking about failures beyondbias or error by operators We need to be talking about a threat model exercise, and think-ing about how to stay safe when it comes to attackers who intend to poison, manipulate,
pre-or otherwise break your engine
We saw a very real-world case of this with Microsoft’s “learning” bot called TayTweets.Within a day of it being launched on Twitter, a concerted effort was made by attackers topoison its “learning” system I use scare-quotes here because I quickly uncovered that thissupposedly intelligent system was being tricked using a dictation flaw, and not really learn-ing For every strange statement by the bot I looked at the conversation thread and foundsomeone saying “Repeat after me!” Here you can see what I found and how I found it
Trang 14TayTweets was presented as a system able to handle huge amounts of data and learnfrom it However, by issuing a simple dictation command the attackers could bypass
“learning” and instead expose a “copy and paste” bot Then the attacker simply had to tate whatever they wanted TayTweets to say so they could take a screenshot and declare(false) victory It really is a terrible big data design that Microsoft tried to pass off as ad-vanced, when the foundation was so weak it was almost immediately compromised bysimpletons
dic-Don’t let this environment be yours, where some adversary can bypass advanced rithms and simply set answers to be whatever serves their purposes When you think aboutthe hassle of distributed denial of service (DDoS) attacks today as an annoyance, imaginesomeone dumping poison into your training set or polluting your machine learning inter-face The only good news in the Microsoft TayTweet case was that the attackers were toodumb to cover their tracks and effectively painted a giant target on themselves; we unin-tentionally had a sweet “honey pot” that allowed us to capture bad identities as they ea-gerly sent streams of “copy and paste” commands
algo-A Look at Definitions
Let’s talk about 3V definitions in more detail The most generic and easiest definition forpeople to agree upon seems to have originated in 2001 from Doug Laney, VP of Research atGartner He wrote “Velocity, Variety and Volume” are definitive characteristics This has
Trang 15been widely accepted and quoted, as any search for a definition will tell you Gartner offers
an updated version in their IT glossary:
Big data is high-volume, high-velocity andhigh-variety information assets that demandcost-effective, innovative forms of informa-tion processing for enhanced insight and de-cision making
I find this definition easy to work with Given that many people talk about big data likestreams leading into a “data lake,” I would like to take a minute to use the analogy of databeing a leaky pipe
Imagine that a pipe drips data twenty times per minute, which is probably what you’veseen if you ever noticed a leaky faucet In one day that is about 28,800 drips Assumingthere is 0.25 milliliter per drip, and there are 15,000 drips in a gallon, we are talking about apipe losing 694 gallons each year
That kind of data leakage obviously is a problem, albeit a relatively small one on itsown 700 gallons sounds large yet in real life; as you listen to a leaky faucet drip, drip, drip,
we know that 20 drips a minute easily can go quite a while being unnoticed
Now let’s take this from basic math into the social science realm of possibilities Multiplythe leaky pipe example by 10,000, which is a modest number of potentially faulty pipes inany city or suburb Hold on to your hat, as we suddenly are talking about 288 million drips,
or 19,000 gallons being wasted every single day! That is a lot of gallons thrown away everyday
At what point would alarms be raised somewhere? In 30 million seconds, which is aboutone year, these 10,000 pipes are going to throw away 7 million gallons Our little drip mathexample becomes a BIG problem for regulators when we collect it into a single place ofstudy and see the overall picture of what has been happening at the macro level Of course,that assumes we can get some record or report on these 7 million gallons are wasted, rath-
er than actually dripping on a thirsty plant in an irrigation system Perhaps you were ready thinking ahead to the problem of differentiation We should be able to devise a way
al-to tell wasted drips every few seconds versus normal water use
All of this is before we even add in measuring variety yet, such as detecting various lutants in every drip, versus clean water What kind of drips are they? Adding variety magni-
Trang 16pol-fies any kind of effort required (e.g., the water has unknown and varied ingredients thatrequire testing to identify) Thus BIG means you need “cost-effective, innovative forms ofinformation processing” for you to understand the leaks and decide what to do about it.
Or more to the point of common big data usage, maybe our reports do not come fromliquid sensors People could take pictures of these leaky pipes and post to Instagram, orTweet about drips, or have a “report a leak” application based on an API that gets queriedand reports are posted in a shared folder The possibilities of knowledge gathering enginesare really ridiculously infinite when you consider the output possible for every sensor typefor every related concern
Now let’s go to national scale, just for the sake of argument If we build an analysis map
of water, we might end up with something like Surging Seas Their interactive guide showswhat happens to the coastline when oceans rise Basically the eastern pointy bits of thecountry disappear
Risk Areas from Surging Seas
A 3V world of data like this might seem obvious when you pull back far enough on thescope controls to see rising water level impact to coastline The important thing here is tra-ditional IT systems would struggle to crunch data quickly enough to generate one city ofdata let alone national interactive accurate maps Thinking about the latest greatest super-hero movie and all the artificial worlds with disasters they can generate on supercomput-ers; we quickly are becoming dependent on the highest-performance analysis money can
Trang 17buy Processing lots of varied data that you’re acquiring very quickly, meaning creatingknowledge from the freshest inputs possible, is the foundation of big data engines.
Take a little bit of data and you can capture, store, and measure it with your own tools
in an environment you bought and paid for as a one-time purchase Increase the rateenough, which leads to an increase in volume, and special tools and skills become re-quired, potentially leading you to a datacenter-like environment you have to share withoutowning We are talking about the shift from working with 700 gallons to 7 trillion gallons orultimately even being able to measure the ocean It reminds me when I used to work for aCIO who was fond of warning me “don’t boil the ocean” at the start of every project Littledid he realize that the power of distributed nodes (humans and our machines) generatingoutput (carbon emissions) would create climate change fast enough to actually boil theocean With big data environments, I could have said back to him, “We’ll see if we can fig-ure out how to cool it down.”
With a simple 3V definition in hand, you would think things could progress smoothly.And yet the more I met with people to discuss definitions and examples, the more I found3V didn’t give much room to talk about security Big security for big data? Didn’t soundright High security? That sounds more normal Can we have high security big data? Even-tually, upon suggestion of others, I experimented with heavy instead of high terminology.The thought at the time was to start using a new term and see if it sticks It didn’t, but ithelped find answers in how to get security into the definition In the next chapter I will ex-plain why there is certain gravity to big data when discussing how to define risk
Trang 19Heavy data, shortened to HeavyD, started off as sort of a joke Humor supposedly helpswith difficult subjects, and creates new insights Whereas big is relative to volume, heavyrelates to force required It is sort of like saying high security, but even more scientific Back
to the leaking pipes example we used in the definition above, Newton’s second law of tion helps explain safety in terms of water:
mo-There is a specific effort, a newton (N) quired to keep an object on the surface in-stead of under water If a person who weighs75kg falls off a ferry into water there would
re-be a displacement effect, let’s say 4N of ter Given 10N per 1kg, our swimmer thus dis-places 0.4kg water, leaving 70kg A 275N life-jacket, an adult standard size, gives 27.5kguplift, more than enough to float and survive
wa-Survival sounds really good It reminded me of a lock or cryptography surviving an tack So could we talk about the effort required for survival within data lakes in terms ofheavy or light? The analogy is tempting, yet I found most people don’t like to think about aworld in terms of lifejackets, never mind put one on, or to think about Newton “The guy hit
at-in the head by an apple” doesn’t have the right rat-ing to it Eat-insteat-in is apparently all the ragethese days It seemed to make more sense to use the increasingly popular framework ofrelativity to explain weight and heavy data; our ability to hold weight is bigger and biggerdepending where you are on the timeline of transitions from analog to digital tools
We are working to capture and interpret an infinite amount of analog data with ourhighly limited digital tools From that perspective, today’s big data tools in about five years’time no longer would be considered big, given Moore’s law of progress in compute power Atour of the Computer History Museum may have reinforced my reasoning on this point
The meaning of big data today versus hundreds of years ago, put in terms of our changing computational power, is a reflection of how our tools today operate versus then.Walking along evolutions of machines seemed to say what we consider big depends on thetime, as we have been working on big data “problems” for a very, very long time The prob-
Trang 20ever-lems of navigation in the 1400s were astronomically difficult Today navigation is so trivial,everyone has a tiny chip that can tell you for virtually no cost how to find the best route toblack pepper vendors in India Columbus-era big data would be a laugh, just like his tinyboats, compared to what we can do with technology now.
My father used to explain cultural relativity in a similar way You and I might live in ferent time zones For someone in Moscow, it can be 8 in the morning while it is 10 at nightfor someone in San Francisco Yet both people share a global idea of absolute time A defi-nition of big data in this light would be that you and I share a definition of data but big foryou is a different number than for me This brings us to the awesome question, “Can thesame security solutions fit an absolute definition, or must they be relative, also?” Perhaps
dif-it is more like asking whether a watch can work in different time zones versus whether dif-itcan work with different intervals of time; as far as I know, no one tried to monitor milli-seconds with a sundial
Managers today seek “enhanced insight and decision making” but they will not escapethe fundamentals of data like integrity or availability This is clear At the same time, we areentering a new world in security where the old rules may no longer apply We simultane-ously need a different approach than we have used in the past to handle new speeds, sizes,and changes while also honoring concepts we know so well from where we started Peta-bytes already no longer are an exception or considered big, as the IDC explained in “TheDigital Universe in 2020.”
From 2005 to 2020, the digital universewill grow by a factor of 300, from 130 exa-bytes to 40,000 exabytes, or 40 trillion giga-bytes (more than 5,200 gigabytes for everyman, woman, and child in 2020) From nowuntil 2020, the digital universe will aboutdouble every two years
The predictions about size usually just sound like a bunch of numbers This size thanthis size then this size What does any of this really mean to heavy data? One of our bestsources of breaches for studying how to respond and improve global security actually willfit on a thumb drive That’s right; in the world of security, about 2GB was what was consid-ered big for some current definitions of heavy Meanwhile the CEO of Pivotal was boasting
in 2014 that his customers were gearing up to work with 500PB/month When security lysts are excited to work with 2 gigabytes for insights and trends and knowledge, and busi-
Trang 21ana-nesses are pushing past 500,000 terabytes/month in their plan, I see a significant delta tunately, things change and I expect security to see a massive increase.
For-Perhaps a good real-world example of the potential rate of growth was in 2008 when Iworked at Barclays Global Investors We secured an environment with many petabytes ofreal-time global financial data Our size in terms of data analysis (quant) capabilities wasarguably leading the world back then Just five years later, “many petabytes” no longerwas considered exceptional in the industry That is a strikingly fast evolution of size Withluck, the security industry will be catching up any moment now to other industries
Our concept of what is really heavy already should be headed into an exabyte range(1000 petabytes) Some of our present security tools will survive these transitions fromlight to heavy data, and some may not Data everywhere, for purposes of this book, meansusing current tools to acquire, process, and store the heaviest possible amount of data.While our notion of whether a tool is current becomes far less constant (e.g., cloud comput-ing gives us incredibly rapid expansion of capabilities), our security controls must be de-signed to maintain consistency across this different scope of data to be effective
Let me give the example of Security Information and Event Management (SIEM) tools as
an example We are seeing a continuous shift from a collection and correlation as a market
to the present demand for real-time analysis and forensics of as much data as possible
Although there were a plethora of vendors five years ago in the log management market(over 15!) that offered proprietary off-the-shelf log management infrastructure, today theycompete with the even larger cloud infrastructure (IaaS) market “Pallets shipped” used to
be a metric of SIEM vendor success, which due to the success of Amazon’s service modelhas become almost an extinct concept This transformation parallels the more general ITmarket and can be illustrated with three generations of data analysis capability: Batch,Near-time and Real-time
Batch systems were the SIEM objective and leaders of five years ago They solved forproblems of integrity and availability, such as UDP packet loss and the need for dedicatedimmutable storage for investigations They struggled with over-promising analytic and in-vestigation capabilities, which has opened the door to a new set of vendors In short, thefoundation capabilities were to collect all data and store it As this value became commodi-tized by cloud infrastructure, customers sought real analytics and speed for knowledge
Trang 22Because a query on the established archive data system could take weeks or evenmonths, a new era of near-time systems evolved These products assume varied volumes
of high-velocity data (batch infrastructure) and provide significant analytic performanceenhancements, such that a human analyst can get results fast enough to pivot and explorethrough ongoing investigations What looks like a suspicious external IP address? Perhapsyou found a command and control destination Now what internal IP addresses are talking
to that external IP? Perhaps you found infected machines
Real time is the newest and emerging market These tools attempt to increase the formance bar again; offering forensic-level capability to capture attacks as they happenand peel apart detailed indicators of compromise They also introduce machine learningcapabilities to reduce the false-positive and noise rate, which will likely be the Achilles heel
per-of the near-time market That really means that the further you move from left to right onthis continuum, the heavier the data you should be able to handle, safely
All of these security data management solutions will offer batch generation capability,although many are off-loading this architecture to the cloud Some solutions offer near-time capability, which usually indicates big data technology instead of just batch In fact, ifyou pull back the curtain you likely will find a big data distribution sitting in front of you.Only a few solutions so far are integrating digital forensics and incident response tools toprovide real-time capabilities from their big data near-real time systems Of course, inte-gration of the three isn’t required All three of these phases can be built and configured in-stead of packaged and sold off the shelf However, if you’re talking truly heavy data, thenquestions quickly arise as to how safe things are
Walking into a meeting with an entire security group can sometimes be intimidating Onone occasion, I was led through long marble lined hallways past giant solid wooden doors
to a long table surrounded by very serious looking faces “Would you like some water?” thehead of the table asked me, as if to indicate I should be properly prepared for a very dryconversation “We understand your system is collecting voice traffic” he continued, “and
we have strict instructions to not collect any voice data in the system.” Their concern,
right-ly so, was someone breaching the repository of data would be able to play back tions Nixon’s impeachment had made an impression on someone, I supposed, and theywanted to talk about how to keep the security big data system free of conversations Ofcourse, I wanted to talk about how safe the system could be made
conversa-Why Big Data Security is Necessary
You have seen several examples already of how big data is sold as a way to achieve
rap-id intelligence for success and avorap-id failure from delays Cure for disease, better ture, safer travel, saving animals from extinction; the list gets longer every day It is sold as
Trang 23agricul-everything from finding a new answer to looking up prior knowledge to avoid mistakes andwaste In many ways and in virtually every industry, velocity can be pitched as a betterpath because it can fundamentally alter business processes This is the steam engine tran-sition away from horses and to a whole new concept of speed Yet how comfortable reallyshould you be with the brakes and suspension of a car before you push the accelerator tothe floor? Perhaps you have heard of someone who avoided disaster just in time due tofinding what they needed with enhanced speed Or maybe you know of someone whorushed ahead without crucial input and created a disaster?
There are risks ahead that seem awfully familiar to anyone who reads the unfortunatestory of Grover Shoe Factory The question is whether we can figure out the baseline of reli-ability or control we need to avoid serious disaster The tools necessary for high perfor-mance data management and analysis include data protection The reality is that big dataenvironments presently require investment and planning beyond any default configuration(default safe configuration simply is not the way IT runs) Before we open up the power ofbig data, consider whether we have put in place the confidentiality, integrity, and availabil-ity we need to keep things pointed in the right direction
An interesting example might be the 2013 UNICEF Digital Maps project A global zation asked youth to be “urgency rank testers” and “prioritize issues and reduce disasterrisk” by uploading photos in their neighborhoods to a cloud site On the face of it, thisproject sounds like an opportunity for kids to try and make their piece of the world a betterplace In aggregate, it gives global planners a better idea of where to send clean-up resour-ces or change policies
organi-You might imagine, however, that some people do not appreciate kids sending pictures
of their pollution to an external authority for judgement - whistleblowers in literally everyplace a child can go So the big data system becomes very heavy very quickly, and has towork hard to protect the confidentiality and integrity of reports
Trang 24I intend to discuss measurements across the usual areas, to avoid reinvention of ideaswhere possible What I’m talking about is confidentiality, integrity and availability Trying
to get somewhere fast with the analysis of data, especially when you have kids acting assensors to report environmental issues, can easily run into trouble with all kinds of threatmodels I’ll touch on threats to availability first Perhaps my favorite example of threatmodeling comes from a very large retail company that consolidated all its data warehousesinto a central one Once data was pooled it made access easy to all its developers and sci-entists Easy access also unfortunately meant easy mistakes could be made, worsened by alack of any accountability or audit trail
A single wrong command, by a well-intentioned intern, destroyed their entire ted data set The reload took several days, negating all the speed gains for their analysis Inother words, moving to the consolidated big data environment was justified because theysaw answers to queries in a day instead of a week Losing all the data meant they wereback to a week before they could get their answers
consolida-This real-world example was echoed at a big data conference when a man from HomeDepot stood on stage and told the audience his CEO almost cancelled the entire Hadoopproject when it became non-responsive Initial results were so promising, so much insightgained, that the company rushed forward with their infrastructure At the point of using it
in production there was an availability error and the CEO apparently was fuming mad asengineers scrambled to resurrect the system Cool story
“Oops, I dropped the entire table” or “it might be a few more hours” becomes a strophic event in the high-speed open and uncontrolled big data environment Any timesaved from data being so easily accessible is lost as data goes completely unavailable.Availability therefore is presented in this book as the first facet to protect data perfor-mance
cata-Once a system is available, we will then discuss data integrity If the data both sits able for use and has not been corrupted, we can then talk about protecting confidentiality.You may have noticed the simple reason for approaching data protection in this particularorder There is no need to measure integrity and confidentially when you are in a “no go”situation Systems are offline When systems fail, the other measurements basically arezero So first we might as well talk about ensuring systems are “go” before we get to thelevels of how to protect data on those systems
avail-Next, assuming I have achieved reasonable availability through fancy new distributedcomputing technology for big data environments, data integrity failure comes into focus.This tends to be the source of some very interesting results, changing our very knowledgeand intelligence we gain from big data, which is why I anticipate it will become the mostimportant future area of risk High availability of bad data (spreading bad information) can
Trang 25be as bad as no availability at all Maybe it can be worse Some say it is more dangerous topublish incorrect results than to fail to publish anything at all, since you then have correc-tion and cleanup to apply resources towards I still get the impression from environments Iwork in that they would rather be online than offline, so I’m sticking with availability thenintegrity Your mileage may vary.
When talking about integrity, I’m tempted to use the old saying, “You can’t connect dotsyou don’t have.” This implies that if you remove privacy, you get more data, which pro-duces more knowledge It sounds simple Yet it really is also an integrity discussion Whatcomes to mind when you see this image from Saarland University researchers (http:// www.mia.uni-saarland.de/Research/IP_Compress.shtml)?
Do you see big white bird or a plane? Do you see someone’s right arm? Some people Ihave shown this image to have said, “that’s Lena!” The image you are looking at there hasbeen reconstructed after 98% of the data was lost That’s right, only 2% of data was re-maining, integrity almost completely destroyed Perhaps I should say compressed Sincethe beginning of computers, people have seen great financial benefit to reducing datawhile maintaining integrity of images Streaming video and music are great examples ofthis So here is a look at the “random” data that actually was stored or transmitted to gen-erate the above image
Who is Lena? A picture of a woman in a magazine was scanned into digital format andused for image compression tests and has become a standard Lena, after 98% was re-moved, looks like random dots until you apply smoothing and diffusion In the old days ofthe Internet, such as greyscale or black and white indexed screens, compression of Lena’s
Trang 26face typically looked something more like a mosaic of small tiles, like from ancient ranean art.
Mediter-With the latest analysis tools, an approximation of the original image was recoveredfrom just a few data points from this 140x140 pixel square
Again, while this raises all kinds of confidentiality issues, there also are interesting plications for integrity here If you can scratch 98% of data, you get incredible storagegains, which means bringing in more data, which means even bigger pictures can be ap-proximated from what you cram into memory In reverse, if we do our best to destroy datafrom memory and leave just 2%, maybe our messages can be resurrected The recoverymay not be perfect, but neither are most audio or video streams When you really thinkabout it, the human brain has an unbelievable ability to reconstruct data from the tiniestfragments, so storing just a few points can mean pulling back a very large memory Whydid I ask originally if you saw a plane? Because if you ask a human what they see here, theymight say plane just from seeing a small percentage of one
Trang 27im-Do you see a plane? If so, your mind has the amazing ability to restore the missing data.Giving you a few more dots and asking you to draw it in with “smoothing and diffusion”might be a different story, but it’s important to think like this when looking at massiveamounts of data and how the picture might change if you alter just a few points, flip somebits, and generate different outcomes Historians are well versed in this problem of trying
to make sense of the past from voluminous records Security experts increasingly have tothink about this as well; how to reach certainty given little or no labels on the data, chang-ing data distributions, model decay or counterfactual conditions With advances in ma-chine learning, we expect great gains in data integrity controls, such that our most adver-sarial environments (fraud, money laundering) will be managed by trusted big data sys-tems And data sets get ever more massive as we are building engines and automation sys-tems, bringing up the issue of whether an engine itself is trustworthy
Take the history of highways in America, for example Some people still tell me they lieve President Eisenhower, elected in 1952, wanted to fund a national system of roads tomove military equipment in case of war with the Soviets They even go so far as to give thepresident credit for this idea, claiming he was inspired by his time in Europe as a war gener-
be-al learning about Roman roads Notwithstanding that Roman roads were basicbe-ally a copyfrom the Persian Empire, a proper historian might review 2% of the available data fromyears before Eisenhower and restore the full story
Fear of a Soviet war arguably is a strong motivator No surprise, then, that it could beused by Eisenhower’s vice president Nixon to tell the American public they should agree tofund national highways because of this fear He literally said, “Should an atomic war come”
Trang 28in pushing for federal policy and funding If we accept fear as a motivator, however, wemust consider another major data point of this administration: race.
I’m not saying that because fear is an issue, every fear must be valid I am saying wehave ample data today, far more than the 2% needed, to show that leaders of the US ac-tively were engaged in a struggle against civil rights, a war on minorities Nixon was literallyrecorded in private (the precursor to today’s Microsoft, Amazon, Google, Apple’s personalvoice assistance), saying terrible things while in office This is crucial to consider, as wepaint our picture of the full story, because then it is Occam’s razor - it’s easy to see how thehighway system also was a major civil rights issue The latest historian research indicates asystem was intentionally built through neighborhoods to displace blacks as well as perpet-uate segregation (fund infrastructure to support “white flight” redistributions of wealth).The movie “Rubble Kings” provides interesting insight into how risk actually increased dra-matically where transit planners thought it would decline
Fear was not the entire foundation of the decision, however Fear was an important alyst that came at the end of many prior arguments for federal funding of highways It was
cat-a long time before Eisenhower ccat-ame into office cat-and wcat-as hcat-anded briefings from prior deccat-a-des (started in the 1890s by wheelmen as the good roads movement) to fund a nationalhighway system By 1930, a lobby group of tire, auto, and oil companies picked up the bat-
deca-on and looked at ways to push federal policy Advertising campaigns by Goodyear Tire andRubber, members of national highway lobby groups, tried to win public support for yearsbefore Eisenhower became president Here is a simple graph from Open Libary to illustrateactivity relative to the election:
If we pull just a few data points and take them at face value, we might conclude hower came up with a brilliant economic and military plan It’s seductive to think of Presi-
Trang 29Eisen-dents in this way, as leaders without compare If we apply proper science we see a term movement towards roads, heavily funded by corporations but lagging in government,which finally found a vice-president who knew fear could easily push through resistance.
long-My digression here into historic data points to paint a factual picture is the whole gle of integrity among big data sets And rather than slowly train historians over manyyears and wait more for them to collect data and publish a trusted answer as a book, thefuture might just be a query to big data systems: who really should we credit for highways
strug-in America and why?
“Lost in translation” might be another great example Ebola didn’t turn up in the newestand best big data monitoring systems that scanned global communication systems Whywere they late to see the spread? Turns out that the engines were unable to translate orread French The first cases were in French-speaking countries, so a “GDELT Project” early-warning system ran into data integrity issues
There are many “lost in translation” or “telephone game” stories One of the most cent big ones was that the French government was passing a law to make email illegal out-side of work hours Social media spread the story fast and wide As it turns out, this was agross misinterpretation of article 25 of a bill about labor, still being considered, which nev-
re-er mentioned email It actually said employees should have rest pre-eriods protected, askingemployers to set digital communication rules for off-hours This is nothing shocking, ifyou’ve heard of a weekend And maybe you also aren’t shocked that false rumors aboutlaws in a foreign language easily can be spread, despite all the best big data engines avail-able that can do translations for us Some futurists predict things you learn in school will
be worthless because of automation tools In reality, add in a little human slang or “art” oflanguage and these machines become lost, scratching at data corruption errors
The language barriers to integrity remind me of fears of foreign “troll armies.” In thepast, people used rumor to create negative impressions of someone The National Enquirerand other tabloids were notorious for modeling this behavior The barrier to poisoning theflow of information is lowered significantly if you can automate tens of thousands of ac-
Trang 30counts filling big data learning systems with “troll” views Calling this an “anti-pollution”problem means it also may follow past trends We saw more than thirty years of ditheringover leaded gasoline, or sulfur in diesel, causing widespread serious health issues beforeenvironmental activists pushed forward clean regulation Our information managementsystems have become character assassination tools, and our ability to do something about
it is linked to a risk calculus for protecting the victims
So integrity is really about keeping data sufficiently trustworthy so we don’t end up lieving the wrong things, seeing the wrong data, making mistakes because false knowl-edge After we ensure we won’t lose access to the data we need - availability and integrity -finally we can address the issue of making sure only the right people see the data that weworked so hard to preserve Loss of privacy is probably the best way to look at this, putting
be-it in terms closer to the earlier two considerations This may be the most difficult area tounderstand, as our concept of privacy is quickly shifting and is not determined by the tech-nology A legal and social framework is emerging for new concepts in confidentiality thatimpact how new technical controls are being changed or developed to achieve privacy
To be honest, confidentiality is always the most interesting part of the equation for curity professionals Availability, despite the importance, often gets relegated to opera-tions I believe the interest in privacy is because of the relatively higher levels of difficulty.For example, consider the facial recognition application made by a Russian startup, Find-Face You take a photo and it checks VK, a social network in Russia, to find matches Here isthe difficult and scary world everyone has been worried about How would you recom-mend security for Russians now that they have lost degrees of privacy in public because ofsensors (cameras), storage (VK) and high-speed recognition systems (FindFace)?
se-We could argue about ephemeral data That’s a favorite topic of mine because our ability solutions are so good now, with snapshots and point in time recovery from failuresand human error, that it’s a challenge to delete data Can we develop a protocol to forcephotos to be easily erased from massively redundant distributed systems, such as the EUhas talked about with “right to be forgotten”? Perhaps FindFace scraped VK into its owndatabase, immune from takedown requests on VK A better answer could be a standard fortimed data decay, such as was developed with “Vanish.”
avail-We could also argue about fooling the algorithms People change their clothes todaylike it is no big deal In fact, there is a cultural expectation that you do not wear the samething every day Who’s that weird guy that always wears a black suit and white shirt with ared tie? Just kidding, that’s still considered normal for some weird reason No, I meanchanging t-shirts It’s a sign of American prosperity to have many clothes to rotate throughinstead of just a few And yet we do not see people change their facial appearance much.Why not invest in facial hair rotations or temporary face tattoos? It would display wealthwhile also breaking the recognition systems
Trang 31Actually I just was trying to find ways to fool an emotion detection algorithm when Istarted to notice the systems couldn’t find any face at all (see Error:0 in upper right corner).Smiling has been known to affect early facial recognition systems, just like wearing flip-flops were known to throw off gait analysis It’s a huge topic that lends itself to integritymore than confidentiality at this point My reason for bringing it up is to point out thosesimple errors in big data systems can have impact because they challenge the logic usedwithin Complicated systems can be fragile in unexpected ways if not tested properly forsafety, and expensive to fix down the road Its’ better to think of them before you start thejourney.
Here’s a related example, if you worry about “appearance” being monitored Considerdata that indicates your stress level Some ride sharing companies monitor your phonebattery as an indicator of how desperate you are to ride A woman tells me that after call-ing a ride-sharing service late at night, as her battery dipped into its red zone, she had toagree to an absurdly high prices She expects to see scarcity because of traffic or someevent congestion, yet everything was clear as normal The only thing different was her bat-tery level She paid on the assumption that the calculation was done fairly
Trang 32Unfortunately for her, someone was actively tuning the system to hit customers low onbattery with a much higher price While the face of the company is telling customers thatprice changes are necessary and helpful, benefitting the rider, engineers inside can tunethe engine to scrape money off anyone who “appears” more needy We have seen Ubercharge premiums during natural disasters and terrorist attacks, all preying on stress levels
of customers to take more money from them However, the company brushed away cisms, saying the big data engines were doing the heavy lifting (blame the algorithm), im-partially calculating a higher price because of the staid economic theory of supply and de-mand It turns out instead they probably were training the engine with a rather old system
criti-of ethics, because they get the unfair advantage over customers in the market if they cansurveil and sniff out demand Historically, this is looked upon like ambulance-chasing (law-yers who solicit clients in the emergency room) or worse
If storekeepers tell you their “algorithm” runs on its own rules, and you know thatmeans it takes advantage of you, what then? Can you restore balance with the system byfinding weaknesses? Is tit-for-tat fair in competition? Riders who know this game and howthe engine really is being run could carry a charged battery to swap every time they hail aride Such a scheme pays for itself almost immediately Battery cost is far below a surgefee, and at the end of the ride, you still would have the battery Or users can send bogusbattery data to the Uber app Or people can swap batteries “Hey, I’ll give you a dollar if Ican use your battery for a minute to save $20 off this Uber fake surge fee.”
Changing appearances like this to fool a predator is a natural response, just like thesmiling face you saw earlier Yet that only helps the individual, so those wanting to stop apredator more systematically would take a more active approach Some riders might tryreverse engineering or gaining insight into how the big data engines are working Again the
Trang 33point is to discuss weaknesses within systems to anticipate who can use them for changingoutcomes.
Let’s look at another real-world example A person figured out he could pull mobile veillance data to figure out who was at a particular location Imagine watching someoneheaded home after work, assuming they must be hungry, and sending them a “Wouldn’tyou like pizza tonight?” prompt Around 2010, Google was leaking through their API the lo-cation of people driving so anyone could track them More recently, a marketing executivefigured out who was going into healthcare facilities He started asking for money frompolitically-oriented agencies to build a system called “RealOptions” that would influencethe thoughts of people in a stressful decision time, weigh in from the outside on those in-side a care facility trying to seek help Testing emotional assessment algorithms on imagesfrom wrestling gave me a strange, although perhaps expected result Here the man on topdoing the choking is rated as “neutral anger,” while the man on bottom being choked gets
sur-a 92% hsur-appiness score
Applying emotion analysis could maybe even provide better commentary on tive events, or alter outcomes by recommending treatment Looking at MMA fighter pre-game emotions, it started to look like a pattern emerged to help predict who would win If
competi-a system holds dcompeti-atcompeti-a thcompeti-at gcompeti-amblers ccompeti-an use to mcompeti-ake more profitcompeti-able bets, it ccompeti-an seriouslychange threat models and controls necessary to manage trust What this really means isgiant analytic engines are irresistible to people who want to find you and manage you to-wards specific outcomes As I said, the natural response is to think about how to becomeless visible, and the bigger response is to think about building trusted environments
An easy way to explain the challenge of true confidentiality in a big data world is to look
at a few ancient examples Here you see an ancient drawing of a Japanese rock garden or
“karesansui” (枯山水) with a kitten leaving tracks as it approaches the building Howdoes one achieve true confidentiality in this world of billions of sensors picking up traces of
Trang 34your every move? If planned properly, the system in the garden requires an entire refresh
to set it back to an undisturbed state The pebbles or sand are a simple construct with linespulled to reveal any intruders
When you think in current technology terms and the stones you’re “touching,” maybe ithelps to look at apps running on your mobile phone and who is watching them for move-ment Here’s a simple illustration of who might be sitting on the wooden platform watch-ing their stones in the tech garden:
Trang 35An even more ingenious system of monitoring than the zen-like rock gardens was theJapanese Nightingale floor, or “uguisubari” (鴬張り) This type of floor intentionallymakes squeaks to prevent walking silently You can find them still in the castles of Kyoto.Basically nails were set upward in a V-shape so the boards would press into them andsqueak as someone steps down.
Trang 36Instead of thinking about this literally as a noisy floor, keep in mind that it was namedafter the “uguisu” bird with a high-pitched song known for being heard more than seen.The way the story was told to me, a squeaky floor on its own could generate so much noisethat it would become a nuisance and the human brain might tune it out from fatigue - thesqueaks become normal That is assuming, of course, that there was any amount of regulartraffic on these castle floors It probably is fair to say instead people started walking on theboards in a way that was pleasing to the ear, almost to the effect of making their own bird-like song dance as they walked People could literally create a unique song for each person.Now when someone walked, you knew their identity even with your eyes closed In effect ifyou map unique values to each style of walking because the “song” they create you areliterally identifying any stranger immediately by listening because unfamiliar pattern.
Whether that was true or it was simply to make noise in the dead of night, either way itshows the ancient practice of generating data with primitive sensors More importantly itshows, as with the rock garden, that if you know something about how the system worksyou may be able to come up with a plan to disable it or change the outcome Although Ihave to point out these Japanese detection systems are designed to have a high barrier/cost to disable Some refer to this as a dichotomy between safety and security The former
Trang 37is making the world safe from harm because of technology use; the latter is making nology trusted to operate as expected Let’s look deeper at the latter.
tech-Does Security Even Work?
Spoiler alert: yes, sometimes it does Or maybe I should say the alternative of no
securi-ty is worse This is the section dedicated to those I have worked with on big data who ask ifsecurity is so hard to do right “Why do we even bother?” or “Why don’t we get to it later?”.Everyone knows waiting is expensive, far more expensive, than solving problems early.Some people are gamblers, though, and some are overconfident in their ability to pull theirengine out of a nose-dive late instead of early They point to the endless stories of enterpri-ses getting breached They ask should we spend all the money and time trying to secureresources only to find them broken anyway, all over the place; this world is a disaster sowhy don’t we just give up? And they ask if we can do the least amount possible for now just
to focus on other areas that need work to actually bring in money Perhaps in their mindthey’re rowing a leaky boat so fast they’ll make it to shore and jump off before it sinks
It’s an ironic question The people asking usually are building big data tools and ronments We can turn the question around on them and ask why bother building any-thing Bugs are found, mistakes are made Systems fail Should we even try when we knownothing is perfect, we always will be leaving some money on the table? The value of aproperly functioning system should include security in its definition Availability, for exam-ple, is an easy place to start and discuss shared values Integrity usually has shared value
envi-as well; especially when you highlight that high-availability of bad data may not be betterthan no availability at all
A website forced offline because of resource exhaustion means something discrete can
be added to relieve the pressure and the site brought back online But when a customersees someone else’s credit card in their checkout cart, leaving a site off-line may be thebest option until integrity of accounts can be re-established The fix to integrity often getsmore complicated Stuxnet might be one of the greatest examples of this distinction Theattack was a slow impact to quality of the centrifuges and trust within nuclear plant teams,rather than once inside immediately cause an outage
Confidentiality has been the hardest sell to developers because it has less clear tive alignment with big data projects Everyone knows what an outage does Most peopleknow what quality failures will do Privacy sometimes confuses people as it sits opposed tothe very purpose of big data projects to gather information for knowledge Scienceprojects, the birth-place of big data systems, tend to be about doing things faster, ratherthan keeping them private They are working with public data by definition, such as look-ing up into space to record interesting light - can’t make the night sky private But a major
Trang 38objec-shift has been happening with adoption of big data tools into industries where privacy hassome very important cost and benefit considerations Security can help preserve value, if
we can explain how it becomes an advantage for the business that thirsts for gathering asmuch data as possible to achieve knowledge
Perhaps the best way to approach the topic is externality of risk It means, in brief, thatthe cost from losing data is not to the custodians or gatherers of data, but rather to thedata “owner.” We see this most clearly with credit cards because a retailer holding ontoand then losing many cards means the cardholders see charges, not the retailer This iswhere regulation steps in and tells the custodians they have to take care to maintain confi-dentiality or they will be fined Without regulation it seems unlikely, based on market expe-rience, for the custodian to value protection at the same level as the customers who areexternal to the system yet still bear the brunt of a breach Have regulations ended breach-es? Obviously not Has our understanding of breaches improved as well as our ability todefend? Absolutely yes Unfortunately the adversaries also have evolved, but let’s givecredit where due Without regulations like California’s SB1386 forcing an economic reba-lance and accounting for externalities, we would not even be having a breach review dis-cussion to figure out how to raise the bar
Security thus has to be considered in terms of value, with some adjustment and ment caused by external interests, in order to explain best how it will make sense from abusiness view The solutions I will discuss in this book are not meant to be starting over asmuch as taking stock of fundamentals, learned lessons, and moving ahead with relativity
align-to industry and situation Solving outages in Hadoop, for example, is really about takingsome recent availability advances that came from software-defined storage and adaptingthem for a Hadoop Distributed File System (HDFS)
The bottom line is yes; security really works and makes a positive difference Proof, ically, is in the big data The impact will only become greater as customers look for compet-itive differentiation, seeking the big data companies that run with trust in mind And thebest way to demonstrate trust is through measurement, the same way people are demon-strating value in all their big data deployments The difference is really just the measure-ments and values of security often are set externally by regulators, accountants or securitymanagers, obscured by marketing and translation to engineering
iron-Is Big Data Just Another Buzzword Belying Extra Revenue (JABBER)?This book has been significantly more difficult to write than my first book, about how tosecurity cloud environments After some searching and review I eventually realized the dif-ferences are in writing about relative differences in an end of an era (Cloud becoming
Trang 39mainstream after a decade) versus the start of a new one (big data development only justbeginning) We clearly are still at the dawn of big data, with many roads with forks ahead.
The primary cause of this difference between Cloud and big data, based on my ence, is who was pushing the evolution of them To put it simply, Cloud represents a long-time trend to split resources, allowing better utilization options Big data is the long-timetrend towards distributing workloads to share load as widely as possible Better utilization
experi-of resources was attractive to enterprise environments which demanded all the security experi-ofprior dedicated systems to address threat models that included insider attacks A widelydistributed workload was attractive to academic and closed research environments whereinsiders were not considered a threat Follow Beowulf and VMware as two threads from
1999 to 2009, and you get very different outcomes Big data to me represents a differenttrack of innovation that was largely assumed to need little in terms of enterprise trust re-quirements Perhaps the big data phrase will die and be replaced by data science, machinelearning, artificial intelligence, or any of the other things people actually want to do withthe data But for now, I use it to mean something that evolved in a particular way from itsintent to achieve high performance for a low cost by sharing workloads
Although I have been lucky to have worked in many environments over the years withsome degree of the 3V elements of big data, the challenges we face today are pushing usinto an entirely new scale Some will say that the world is becoming more connected thanever before I tend to agree with the caveat that we underestimate the connections of past;while slow, they still existed
Our ability to move information is faster than ever, and we can add more detail thanever, as well as carry more data overall It is like our shipping lanes of the sea have im-proved in all measures We should not look at shipping as an entirely new concept everytime a new supertanker has been launched We did not, in other words, recently flip aswitch and become a global society We always had global connectivity and tend to ignore
or overlook what the past might mean for our future; our continuum increases along 3Vquite far back A big data map to track all the ships has new problems as well as old onesfor naval engineering and logistics
Trang 40Some of the old problems are fairly easy to list: uniqueness of names for attribution,accuracy of data, as well as complete and consistent records Once we add in new sensors
on ships to automate tracking their geolocation, we see some amusing errors in ted places Here’s a big data system tracking a ship as it sails across the Sahara desert:
unexpec-And here are some ships that have moved deep inland and parked where there are nowaterways At least I should say I searched for water and found only desert where I havedrawn the red circles