What makes it even more complicated and difficult to understand is the fact that what isdeemed as ‘big’ now may not be that big in the near future due to rapid advances in software techn
Trang 2Big Data For Beginners
Understanding SMART Big Data, Data Mining & Data Analytics For improved Business Performance, Life Decisions &
More!
Trang 3
This document is geared towards providing exact and reliable information in regards to thetopic and issue covered The publication is sold with the idea that the publisher is notrequired to render accounting, officially permitted, or otherwise, qualified services Ifadvice is necessary, legal or professional, a practiced individual in the profession should
The information provided herein is stated to be truthful and consistent, in that any liability,
in terms of inattention or otherwise, by any usage or abuse of any policies, processes, ordirections contained within is the solitary and utter responsibility of the recipient reader.Under no circumstances will any legal responsibility or blame be held against the
publisher for any reparation, damages, or monetary loss due to the information herein,either directly or indirectly
Respective authors own all copyrights not held by the publisher
The information herein is offered for informational purposes solely, and is universal as so.The presentation of the information is without contract or any type of guarantee assurance
The trademarks that are used are without any consent, and the publication of the trademark
is without permission or backing by the trademark owner All trademarks and brands
within this book are for clarifying purposes only and are the owned by the owners
themselves, not affiliated with this document
Trang 6Conclusion
Trang 7
If you are in the world of IT or business, you have probably heard about the Big Dataphenomenon You might have even encountered professionals who introduced themselves
as data scientists Hence, you are wondering, just what is this emerging new area of
science? What types of knowledge and problem-solving skills do data scientists have?What types of problems are solved by data scientists through Big Data tech?
After reading book, you will have the answers to these questions In addition, you willbegin to become proficient with important industry terms and applications and tools inorder to prepare you for a deeper understanding of the other important areas of Big Data.Every day, our society is creating about 3 quintillion bytes of data You are probablywondering what 3 quintillion is Well, this is 3 followed by 18 zeros And that folks isgenerated EVERY DAY With this massive stream of data, the need to make sense of forthis becomes more crucial and quickly increasing demand for Big Data understanding.Business owners, large or small, must have basic knowledge in big data
Trang 9
‘Big data’ is one of the latest technology trends that are profoundly affecting the wayorganizations utilize information to enhance the customer experience, improve their
products and services, create untapped sources of revenue, transform business models andeven efficiently manage health care services What makes it a highly trending topic is thefact that the effective use of big data almost always ends up with significantly dramaticresults Yet, the irony though is nobody really knows what ‘big data’ actually
means
There is no doubt that ‘big data’ is not just a highly trending IT buzzword Rather, it is afast evolving concept in information technology and data management that is
revolutionizing the way companies conduct their businesses The sad part is, it is also
turning out to be a classic conundrum because no one, not even a group of the best IT
experts or computer geeks can come up with a definitive explanation describing exactlywhat it is They always fall short of coming up with an appropriate description for ‘bigdata’ that that is acceptable to all At best, what most of these computer experts couldcome up with are roundabout explanations and sporadic examples to describe it Try
asking several IT experts what ‘big data’ is and you will get just as many different answers
as the number of people you ask
What makes it even more complicated and difficult to understand is the fact that what isdeemed as ‘big’ now may not be that big in the near future due to rapid advances in
software technology and the data management systems designed to handle them
We also cannot escape the fact that we now live in a digital universe where everything andanything we do leaves a digital trace we call data At the center of this digital universe isthe World Wide Web from which comes a deluge of data that floods our consciousness
every single second With well over one trillion web pages (50 billion of which have
already been indexed by and are searchable through various major search engines), the
web offers us unparalleled interconnectivity which allows us to interact with anyone andanything within a connected network we happen to be part of Each one of these
interactions generates data too that is coursed through and recorded in the web - adding up
Trang 10developing technologies designed to handle it is what is collectively referred to as ‘bigdata’
Trang 11
connected to the internet and therefore, they are continuously leaving behind their digitaltrails which is adding more data to the already burgeoning bulk of information alreadystored in millions of servers that span the internet
And if your imagination still serves you right at this point, try contemplating on the morethan thirty billion Point Of Sales transactions per year that are coursed through
electronically-connected POS devices If you are still up to it, why not also go over themore than 10,000 credit card payments being done online or through other connecteddevices every single second The sheer volume alone of the combined torrential data thatenvelops us unceasingly is amazingly unbelievable “Mind boggling” is an
understatement Stupefying would be more appropriate
Don’t blink nowbut the ‘big data’that had been accumulated by the web for the past fiveyears (since 2010) and are now stored in millions ofservers scattered all over the globe far exceeds all ofthe prior data that had been produced and recordedthroughout the whole history of mankind The ‘bigdata’ we refer to includes anything and everything thathas been fed into big data systems such as social network chatters, content of web pages,GPS trails, financial market data, online banking transactions, streaming music and
videos, podcasts, satellite imagery, etc It is estimated that over 2.5 quintillion bytes ofdata (2.5 x 1018) is created by us every day This massive flood of data which we
collectively call as ‘big data’ just keeps on getting bigger and bigger through time Expertsestimate that its volume will reach 35 zetta bytes (35 x 1021) by 2020
In essence, if and when data sets grow extremely big or become excessively too complex
Trang 12problem is, there is no common set ceiling or acceptable upper threshold level beyondwhich the bulk of information starts to be classified as big data In practice, what mostcompanies normally do is to consider as big data those which have outgrown their ownrespective database management tools Big Data, in such case, is the enormous data whichthey can no longer handle either because it is too massive, too complex, or both Thismeans the ceiling varies from one company to the other In other words, different
companies have different upper threshold limits to determine what constitutes big data.Almost always, the ceiling is determined by how much data their respective databasemanagement tools are able to handle at any given time That’s probably one of the reasonswhy the definition of ‘big data’ is so fuzzy
Trang 14of their competitors
For hidden deep within the torrent of big data information stream is a wealth of usefulknowledge and valuable behavioral and market patterns that can be used by companies
(big or small) to fuel their growth and profitability – simply waiting to be tapped
However, such valuable information have to be ‘mined’ and ‘refined’ first before they can
be put into good use - much like drilling for oil that is buried underground
Similar to oil which has to be drilled and refined first before you can harness its awesomepower to the hilt, ‘big data’ users have to dig deep, sift through, and analyze the layersupon layers of data sets that makes up big data before they can extract usable sets that hasspecific value to them
In other words, like oil, big data becomes more valuable only after it is ‘mined’,
processed, and analyzed for pertinent data that can be used to create new values Thiscumbersome process is called big data analytics Analytics is what gives big data its shineand makes it usable for application to specific cases To make the story short, big datagoes hand in hand with analytics Without analytics, big data is nothing more than a bunch
of meaningless digital trash
The traditional way of processing big data however, used to be a tough and expensive task
to tackle It involves analyzing massive volumes of data which traditional analytics andconventional business intelligence solutions can’t handle It requires the use of equally
Trang 15The giant corporations who started digging into big dataahead of everybody else had to spend fortunes on expensivehardware and ground breaking data management software to
be able do it – albeit, with a great deal of success at that.Their pioneering efforts revealed new insights that wereburied deep in the maze of information clogging the internetservers and which they were able to retrieve and use to greatadvantage For example, after analyzing geographical andsocial data and after scrutinizing every business transaction,they discovered a new marketing factor called ‘peer
influence’ which played an important role in shapingshopping preferences This discovery allowed them to establish specific market needs andsegments without the need to conduct tedious product samplings thus, blazing the trail for
data driven marketing.
All this while, the not-so-well-resourced companies could only watch in awe – sidelined
by the prohibitive cost of processing big data The good news though is this will not be forlong because there is now affordable commodity hardware which the not-so-well-
Trang 16Big Data is a kind of supercomputing that can be used by governments and businesses,which will make it doable to keep track of pandemic in real time, guess where the nextterrorist attack will happen, improve efficiency of restaurant chains, project votingpatterns on elections, and predict the volatility of financial markets while they are
happening
Hence, many of the seemingly unrelated yet diverse will be integrated into big datanetwork Similar to any powerful tech, when used properly and effectively for the good,Big Data could push the mankind towards many possibilities But if used with badintentions, the risks could be very high and could even be damaging
The need to get big data is immediate for different organizations If a malevolent
organization gets the tech first, then the rest of the world could be at risk If a terroristorganization secured the tech first before the CIA, the security of the USA could becompromised
The resolutions will need business establishments to be more creative at different levelsincluding organizational, financial, and technical If the cold war in the 1950s was all
about getting the arms, today, Big Data is the arms race.
Trang 17
Trends in the world of supercomputing are in some ways similar to those of the fashionindustry Even if you wait long enough, you can have the chance to wear it again Most ofthe tech used in Big Data have been used in different industries for many years, such asdistributed file systems, parallel processing, and clustering
Enterprise supercomputing was developed by online companies with worldwide
operations that require the processing of exponentially growing numbers of users and theirprofiles (Yahoo!, Google, and Facebook) But they need to do this as fast as they canwithout spending too much money This is enterprise supercomputing known as Big Data.Big Data could cause disruptive changes to organizations, and can reach far beyond onlinecommunities to the social media platforms that spans and connects the globe Big Data isnot just a fad It is a crucial aspect of modern tech that will be used for generations tocome
Big data computing is actually not a new technology Since the beginning of time,
predicting the weather has been a crucial big data concern, when weather models are
processed using one supercomputer, which can occupy a whole gymnasium and integratedwith then-fast processing units with costly memory Software during the 1970s was verycrude, so most of the performance during that time was credited due to the innovativeengineering of the hardware component
Software technology had improved in the 1990s leading to the improved setup whereprograms processed on one huge supercomputer can be partitioned into smaller programsthat are running simultaneously on several workstations Once all the programs are doneprocessing, the results will be collated and analyzed to forecast the weather for severalweeks
But even during the 1990s, the computer simulators need about 15 days to calculate andproject the weather for a week Of course, it doesn’t help people to know that it was
cloudy last week Nowadays, the parallel computer simulations for weather prediction forthe whole week could be completed in a matter of hours
In reality, these supercomputers cannot predict the weather Instead, they are just trying tosimulate and forecast its behavior Through human analysis, the weather could be
predicted Hence, supercomputers alone cannot process Big Data and make sense of it.Many weather forecasting agencies use different simulators with varying strengths
Computer simulators that are good at forecasting where a hurricane will fall in New Yorkare not that accurate in forecasting how the humidity level could affect the air operations
at Atlanta International Airport
Weather forecasters in every region study the results of several simulations with varioussets of initial data They not only pore over actual output from weather agencies, but theyalso look at different instruments such as the doppler radar
Even though there are tons of data involved, weather simulation is not categorized as BigData, because there is a lot of computing required Scientific computing problems (usually
Trang 18
Early electronic computers are designed to perform scientific computing, such as
deciphering codes or calculating missile trajectories, which all involves working on
mathematical problems using millions of equations Scientific calculations can also solveequations for non-scientific problems like in rendering animated movies
Big Data is regarded as the enterprise equivalent of HPC that is also known as the
enterprise supercomputing or high-performance commercial computing Big Data can alsoresolve huge computing problems, but this is more about discovering simulations and lessabout equations
During the early 1960s, financial organizations such as banks and lending firms used
enterprise computers to automate accounts and manage their credit card ventures
Nowadays, online businesses such as eBay, Amazon, and even large retailers are usingenterprise supercomputing in order to find solutions for numerous business problems thatthey encounter However, enterprise supercomputing can be used for much more thanstudying customer attrition, managing subscribers, or discovering idle accounts
Big Data and Hadoop
Hadoop is regarded as the first enterprise supercomputing software platform, which works
at scale and is quite affordable It exploits the easy trick of parallelism that is already inuse in high performance computing industry Yahoo! developed this software in order tofind a specific solution for a problem, but they immediately realized that this software hasthe ability to solve other computer problems
Even though the fortunes of Yahoo! changed drastically, it has made a large contribution
to the incubation of Facebook, Google, and big data
Yahoo! originally developed Hadoop to easily process the flood of clickstream data
received by the search engine Click stream refers to the history of links clicked by theusers Because it could be monetized to potential advertisers, analyzing the data for
clickstream from thousands of Yahoo! servers needed a huge scalable database, which wascost-effective to create and run
The early search engine company discovered that many commercial solutions during thattime were either very expensive or entirely not capable of scaling such huge data Hence,Yahoo! had to develop the software from scratch, and so DIY enterprise supercomputingbegan
Similar to Linux, Hadoop is designed as an open-source software tech Just as Linux led tothe commodity clouds and clusters in HPC, Hadoop has developed a big data network ofdisruptive possibilities, new startups, old vendors, and new products
Hadoop was created as portable software; it can be operated using other platforms asidefrom Linux The power to run open source software similar to Hadoop on a Microsoft OS
is a crucial and a success for the open source community, which was a huge milestoneduring that time
Trang 19Knowing the history of Yahoo is crucial in understanding the history of Big Data, becauseYahoo was the first company to operate at such massive scale Dave Filo and Jerry Yangbegan Yahoo! as a tech project in order to index the internet But as they work on, theyrealized that traditional indexing strategies cannot be used with the explosion of contentthat should be indexed
Even before the creation of Hadoop, Yahoo! had the need for a computer platform, whichcan take the same amount of time to develop the web index regardless of the growth rate
of internet content The creators realized that there is a need to use the parallelism tacticfrom the high power computing world for the project to become scalable and then thecomputing grid of Yahoo! became the cluster network that Hadoop was based on
Similar to the importance of Hadoop was Yahoo!’s innovation in restructuring their
Operations and Engineering teams in order to support network platforms of this scale Theexperience of Yahoo in operating a large-scale computing platform, which spread acrossseveral locations resulted to the re-invention of the Information Technology Department.Complicated platforms had to be developed initially and deployed by small teams
Running an organization to scale up in order to provide support to these platforms is analtogether separate matter However, reinventing the IT department is just as important asgetting the software and hardware to scale
Similar to many corporate departments from Sales to HR, IT firms conventionally attainscalability by way of centralizing the process By having a dedicated team of IT expertsmanaging a thousand storage databases is more cost-effective compared to compensatingthe salaries for a large team However, Storage Admins usually don’t have a workingknow-how of the numerous apps on these arrays
Centralization will exchange the working knowledge of the generalist for expertise of thesubject matter as well as cost efficiency Businesses are now realizing the unintended risks
of exchanges made several years ago, which created silos, which will inhibit their capacity
to use big data
Conventional IT firms divide expertise and responsibilities that often constrain
collaboration among and between teams Minor glitches because of miscommunicationscould be acceptable on a few minor email servers, but even a small glitch in producingsupercomputers may cost businesses to lose money
Even a small margin of error could result to a large difference In the Big Data world, 100Terabytes is just a Small Data, but 1% error in 100 TB is 1 Million MB Detecting andresolving errors at this massive scale could consume many hours
Trang 20This paradigm was easier to understand, but with exponentially increasing sophisticatedplatforms, the layer paradigm started to cover the underlying sophistication, which
impeded or even avoided effective triage of performance and reliability concerns
Similar to a Boeing 747, platforms for supercomputing should be interpreted as a wholecollection of technologies or the manageability or efficiency could be affected
Trang 21
In the early stages of computer history, systems are considered as platforms - these arecalled as mainframes and usually they are regarded as mainframes and are produced bycompanies that also supplies specialized teams of engineers who closely work with theircustomers to make certain that the platform can function according to its design
This method was effective so long as you take satisfaction as a customer of IBM Butwhen IBM started to make some changes in the 1960s, other companies provided moreoptions and better prices However, this has resulted to partitioning of industry silos
Nowadays, enterprises that are still dominating their silo still have the tendency to behavelike a monopoly so long as they can get away with it When storage, server, and databasecompanies started to proliferate, IT firms mimic this alignment with their relative groups
of storage, server, and database specialists
But in order to effectively stand up a big data cluster, each member who is working on thecluster should be organizationally and physically present The required collaborative workfor effective cluster deployments at this scale could be difficult to achieve in a subsequentlevel of a silo
If your business likes to embrace big data or come together in that magical place whereBig Data Works in the Cloud, the IT department should reorganize some silos and studythe platform well
But far from reality, many business organizations cannot easily handle such changes,especially if the change is too fast Disruption and chaos have been constants in the
1 The impact to the silo mindset, both in the industry and the organization will be animportant milestone of big data
2 The IT industry will be bombarded by the new tech of big data, since most of the
products before the creation of Hadoop are not functioning at all Big Data software andhardware is many times faster compared to existing business-scale products and also a lotcheaper
Trang 225 In working with Big Data, programmers and data scientists are required to set things upfor a better understanding of how the data will flow beneath This includes the
Great Possibilities with Big Data
Nowadays, Big Data is not just for social networking or machine-generated online logs.Enterprises and agencies can seek answers to questions, which they may never have thecapacity to ask and Big Data could help in identifying such questions
For example, car producers can now access their worldwide parts inventory across
numerous plants and also acquire tons of data (usually in petabytes) coming from thesensors that can be installed in the cars they have manufactured
Other enterprises can now analyze and process tons of data while they are still collecting it
on the field For instance, prospecting for gold reserves will involve seismic sensors in thefield acquiring tons of data that could be sent to HQ and analyzed within minutes
powered supercomputers – this process takes a lot of time Today, a Hadoop cluster
In the past, this data should be taken back to a costly data center, and transferred to high-distributed all over seismic trucks parked in a vacant lot could still do the task withinhours, and find patterns to know the prospecting route for the next day
In the field of agriculture, farmers can use hundreds of farm sensors that could transmitdata back to the Hadoop cluster installed in a bard in order to monitor the growth of thecrops
Government agencies are also using Hadoop clusters because these are more affordable.For instance, the CDC and the WHO are now using Big Data to track the spread of
pandemic such as SARS or H1N1 as they happen
Even though Big Data allows it to process large data sets, the process could be fast, thanks
to parallelism Hadoop could also be used for data sets, which are not considered as BigData The small Hadoop cluster can be considered as an artificial retina
effective reservoir, so that the business or enterprise could fully realize these possibilities.The data reservoir cannot be considered as another drag-and-drop business warehouse.The data stored in the reservoir, similar to the fresh water stored in water reservoir should
Regardless of the form of data transmission, the data should still be collected into a cost-be used to sustain the operations of the business
Trang 26Big data isnot a singleentity Rather, it is a synthesis of several data-management technologies that have evolved overtime Big data is what allows businesses the ability
to store, analyze, and exploit massive amounts ofdata with great ease and on real time to gain deepermarket insights and create new value that willbenefit the organization But, big data has to bemanaged well before it can be utilized to provide themost appropriate solution that meets specific business requirements of an enterprise And,the key to managing big data well is by having a clear understanding of what it truly is.Unfortunately, each time we attempt to define the term big data, our minds almost alwaysend up swirling in confusion It is not only the enormity of big data that poses a challengeand makes it difficult to understand but also the seemingly endless variety of tasks
involved in processing it including analysis, capturing information, curating data sets,searching for and filtering relevant information, sharing data with others, providing forsufficient data storage, efficient data transfer, visualization, and most important of allensuring privacy of information
Without a clear understanding of what big data is, we won’t be able to harness its fullpotential much less use it to our advantage If we want to tap the full potential of big data,
we are left with no choice but to continue seeking for a truly definitive explanation ofwhat it really is - no matter how overwhelming the task may seem We need to discovernovel ways to dig up relevant information embedded deep in its vast realm of information
in order to discover useful insights and create innovative products and services of
significant value
Let me point out that data becomes valuable only if it leads to the creation of significantbusiness solutions We need to create meaningful value from data before we can attach amonetary value to it In other words, to have a more stable and sounder basis for big datavaluation, we have to link the data’s value to its potential role in supporting businessdecisions that produce positive results
Trang 28challenge that is why the direction of big data technology today is to develop huge datatools that uses a distributed system where data is stored and analyze across a network ofinterconnected databases located across the globe This scalable data storage setup
coupled with a distributed approach to querying will allow businesses to have a 360-degree view of their customers as well as allow them access to much more historical datathan usual thus giving businesses more and deeper market insights Needless to say,
having more data to base decision making is better than creating marketing models based
on a few, limited data
Velocity Based Value
Big Data Velocity is about the speed by which data streams into our own networks in realtime coming from all possible sources including business processes, other networks,
digitally connected machines, as well as the streaming data that is created every time
people use their mobile devices, or each time they interact with social media sites, and thelike This flow of data is not only massive but also continuous which in effect puts bigdata in a state of perpetual flux Making it possible for big data users to access and analyzeinformation in real-time is where the real value of big data velocity lay It means
researchers and businesses are able to make valuable timely decisions that provide themwith strategic competitive advantages and improve their bottom line (ROI) tremendously.The more real time customer data you absorbed in your big-data management tool and themore queries, reports, dashboards, and customers’ interaction that gets recorded in yourdata base, the better your chances are in making the right decision at the right time Withsuch timely information, you will be able to develop excellent customer relationship andachieve management objectives with great ease
Trang 29The data sets that make up big data are varied and include both structured and
unstructured data In essence, big data is a mixture of unstructured and multi-structureddata which together compose the bulk of information contained therein This variedcustomer data includes information coming from the Customer Relations Managementsystems; feedbacks, reactions, and interactions from social media; call-center logs, etc.With varied customer data as your basis, you will be able to paint more refined customerprofiles, determine client desires and preferences, etc which means you will be better-informed in making business decisions and do a better job in engaging customers
To come up with clear pictures of customer profiles and preferences, therefore, you mustnot limit your big data analytics only to digital inputs such as social network interactionsand web behavior You must also include traditional data such as those coming from yourown business transactions, financial records, call center logs, point-of-sale records, andsuch other channels of interaction you engage with Digital data inputs are growing at atremendous rate and may totally overwhelm traditional data but that is not reason enough
to exclude traditional data from your data sets They are part and parcel of big data
analytics and contribute a lot to creating a truly representative market profile of yourtargeted customer base
Trang 30
developed by IBM in the 70’s called Structured Query Language (more popularly known
by its acronym SQL) Structured data was a welcome alternative to the traditional paper-based data management systems which is highly unstructured and too cumbersome tomanage And since limited storage capacity remained a problem, structured data still had
to be augmented by paper or microfilm storage
Trang 31Unstructured data refers to data sets that are text-heavy and are not organized into specific
fields Because of this, traditional databases or data models have difficulty interpretingthem Examples of unstructured data include Metadata, photos and graphic images,
webpages, PDF files, wikis and word processing documents, streaming instrument data,blog entries, videos, emails, Twitter tweets, and other social media posts Locating
unstructured data requires the use of semantic search algorithm
Veracity Based Value
Big Data Veracity is the term that describes the process of eliminating any abnormality inthe data being absorbed by any big data system This includes biases, ‘noise’ or irrelevantdata and those that are being mined which has nothing to do with the problem for which asolution is being sought Big data veracity actually poses a bigger challenge than volumeand velocity when it comes to analytics You have to clean incoming data and prevent
‘dirty’ and uncertain, imprecise data from accumulating in your big data system
By default, current big data systems accept enormous amounts of both structured andunstructured data at great speed And, since unstructured data like social media data
contains a great deal of uncertain and imprecise data, we need to filter it to keep our dataclean and unpolluted For this, we may need some help However, it would be highly
unreasonable to spend huge amount of human capital for data preparation alone The sadpart is, organizations have no recourse but to absorb both structured and unstructured dataalong with its imperfections into their big data systems and then prepare the data for theiruse by filtering out the noise and the imprecise
Tools meant to automate data preparation, cleansing, and filtering are already in the worksbut it may still take a while before they are released for general use In the meantime, itmay be easier and more prudent to devise a Data Veracity scoring and ranking system forthe valuation of data sets to minimize if not eliminate the chances of making businessdecisions based on uncertain and imprecise data
In Summary
It is the combination of these factors, high-volume, high-velocity, high-variety and
veracity that makes up what we now call as Big Data There are also data managementplatforms and data management solutions which supply the tools, methodology and thetechnology needed to capture, curate, store and search & analyze big data all of which aredesigned to create new value, find correlations, discover new insights and trends as well asreveal relationships that were previously unknown or unavailable
Trang 32The logical approach to using big data therefore is to process unstructured data and drawout or create ordered meaning from it that can be used as a structured input to an
application or for whatever valuable solution it may serve man
Take note, however, that once you process big data and move it from source data to
processed application data, there will be some loss of information that will occur Andonce you lose the source data there is no way you can recover it In house processing ofbig data almost always end up with you throwing some data away For all you know, theremay still be useful signals in the bits of data you have thrown away This underscores theimportance of scalable big data systems where you can keep everything
Trang 33
There are three forms of big data solutions you can choose from for your deployment:software-only solution, hardware solution, or cloud-based solution The deploymentmethod that would be the ideal route to pursue will depend on several factors like thelocation of your source data, privacy and regulatory factors, availability of human
resources and specific requirements of the project Most companies have opted for a mix
of on-demand cloud resources together with their existing in-house big data deployments.Big data is big And since it is too massive to manage through conventional means itfollows that it will also be too big to bring anywhere or move from one location to
another The solution to this is to move the program — not the data It would be as simple
as running your code on the local web services platform which hosts the data you need Itwon’t cost you an arm and a leg nor will you spend much time to transfer the data youneed to your system if you do it this way Remember, the fastest connection to any sourcedata is through the data centers that host the said data Even a millisecond difference inprocessing time can spell the difference between gaining or losing your competitive
advantage
Trang 34A platform refers to the collection of components or sub-systems which should operatelike one object A Formula One car is the car equivalent of a supercomputer Every
component of the Formula One car and every design have been fully optimized not onlyfor performance, but also performance for every kilogram of curb weight or every liter ofgas A two-liter engine, which yields 320 HP rather than 150HP can be achieved becausethis is more efficient
The racing car engine with higher HP will have better performance However,
performance actually means efficiency such as miles per gallon or horsepower per
kilogram But when it comes to computing platforms, this is jobs executed per watt
Performance is always measured as a ratio of something achieved for the effort exerted.The latest series of Honda F1 technology are now installed in other racing cars becauseoptimized tech derived from the racing program enabled Honda to design cars with higherperformance vehicles not only for racers but also for general consumers
For example, a Honda Civic has the same platform with the F1 The suspension, steering,brakes, and engine are all designed so you will actually feel that you are driving one
vehicle, and not just a chunk of complicated subsets
Trang 35The design as well as the production of a new commercial airplane is sophisticated,
expensive, and mired in several layers of regulations Hence, the process can be tediousand slow as any lapses in the design and structure could risk lives
Platforms which would be produced out of physical parts need more planning compared toplatforms that are produced from nothing like software Remember, you can’t download anew set of engines every month
But the designers of aircraft today also understand the value of flexible software Firstdeveloped in the military, “fly-by-wire” tech refers to flying by using mechanical wire andnot electrical wire
In conventional aircraft, the stick and pedals are mechanically linked to the control
surfaces on the wings; hence, the mechanical linkages can control these surfaces When itcomes to the fly-by-wire aircraft, the cockpit controls are transmitted to a supercomputer,which can control the motorized actuators, that can command the tail and wings
Fighter planes also use fly-by-wire software to keep them safe Pilots can still turn sosteep while in flight that there is the tendency for them to pass out However, the softwarecan still sense these factors and will restrict the turns to keep the pilots in focus and alive.These software features are also now applied to commercial aircraft and even sedans,which make those platforms a lot more efficient and safe But if the fly-by-wire software
is mired with design flaws and bugs, this could still lead to a mess on the infield, which isbest to prevent
Trang 36
During the 1960s, IBM and Bank of America create the first credit card processing
system Even though these initial mainframes processed just a small percentage of the datawhen compared to Amazon or eBay today, the engineering was very complicated thattime When credit cards became very popular, there was a need to build processing
systems to manage the load as well as handle the growth without the need to re-engineerthe system once in a while
These prototype platforms were developed around software, peripheral equipment, andmainframes all from one vendor
IBM also developed a large database system as a side-project for NASA for the Apolloproject that later evolved as a separate product as IMS Since IBM created these solutions
to certain problems, which large customers encountered, the outcome systems were not yetproducts These were highly integrated, custom-built, and costly platforms that will laterevolve into a lucrative business for IBM
These solutions alongside other interlinked software and hardware components were allbuilt as a single system, normally by a specialized team of experts Small groups
collaborated with one another, hence the expert on databases, networks, and storage
acquired enough working knowledge in other related areas
These solutions usually needed development of new software and hardware technologies,
so extended collaboration of expertise was important to the success of the project Theproximity of the team members allowed permitted a cluster of knowledge to rise, whichwas important to the success of the platform Every job of the team was not complete untilthey provided a completed, integrated working platform to the client as a fully operationalsolution to the problem in the enterprise
The End of IBM’s Monopoly
During the 1970s, the monopoly of IBM was curtailed enough for other companies such asOracle, DEC, and Amdahl to rise and start offering IBM clients with alternatives DECcreated small computers, which provided higher performance at a fraction of the cost
compared to mainframes produced by IBM But the main issue was compatibility
Meanwhile, Amdahl offered a compatible alternative, which was less costly compared tothe IBM mainframe Companies can now create and market their own range of productsand service and become successful in the world with less monopolistic enterprises
These options of alternative value led to silos of expertise and silos of vendors inside the
IT groups that are already aligned with the vendors Similar to Amdahl, Oracle also tookadvantage of the technology, which IBM created but never turned to products The
cofounder of Oracle, Larry Elison, harnessed the power of relational database technology,which was originally developed by IBM Oracle placed it on seminal VAX and createdone of the first business software companies after the mainframe era
When products inside silos were offered to customers, putting the system together was nolonger the concern of a single supplier It is now the job of the customer
Trang 37Large systems integrators such as Wipro and Accenture are now trying to fill this gap.However, they also run inside the constraints of IT departments and the same
organizational silos created by vendors
Silos are the price paid for the post-mainframe alternatives to IBM Silos could obscurethe true nature of computer platforms as one system of interlinked software and hardware
Trang 38Oracle made a fortune because of its post-mainframe silo for many years as customerspurchased their database tech and ran it on EMC hardware, HP, and Sun As computerapps became more sophisticated, creating platforms with silos became harded, and
business organizations trying to use the clustering technology of Oracle, RAC, realizedthat it sis quite impossible to establish
Because this failure can be a result of their client’s own substandard platform engineeringthat exposed more flaws, Oracle developed an engineering platform, which combined allthe parts and engineering product expertise that made lucrative experiences possible.Exadata, the resulting product, was originally created for the data warehouse market.However, it has found more success with conventional Oracle RAC clients running appssuch as SAP
Because Oracle was a software company, the original release of Exadata was based onhardware created by HP However, Exadata became successful that Oracle had decided tosource the hardware parts that also became part of the reason why they have acquired Sun
In sourcing all the software and hardware components in Exadata, Oracle revived the all-in model for mainframe
This all-in model is also known as “one throat to choke” On the surface, this is enticing,but it will assume that the throat could be choked Large clients including AT&T,
Citibank, and Amgen buy so much services and equipment, which they can choke anyvendor they want when things go down
But for the majority of users, because they are too big to manage their own database
without technical support from Oracle and too small to demand timely support from
Oracle, all-in shopping usually decreases the leverage of the customers with vendors.Similar to Exadata, big data supercomputers should be designed as platforms for
engineering, and this design must be built on engineering approach where all the softwareand hardware parts are considered as one system This is the system for platform – thesystem it was before these parts were acquired by vendor silos
Trang 39
At present, data architects are responsible for designing the new platforms that are oftenfound in their corresponding IT departments where they are working as experts in theircertain silo But like building architects, platform architects should have an intensiveworking knowledge of the whole platform, which includes the enterprise value of thewhole platform, the physical aspects of the plant, and bits of computer science
Since any part of the platform could be optimized, repaired or triaged, architects working
on the platform should be knowledgeable of the entire platform to effectively collaboratewith controllers, business owners, UI designers, Java or Linux programmers, networkdesigners, and data center electricians
Architects working on the data platform should be agile and capable enough to pore overthe details of the system with the network administrator, and then fly to another teamcomposed of business owners Overfamiliarity or too much knowledge in only one aspect
of the Big Data system could obfuscate the whole perspective of the platform It is crucial
to have the capacity to filter out the details as there are varied forms of details and theirrelative significance may shift from one form to another
Sorting out details according to their importance is a very crucial skill that a platformarchitect should have Creating systems as platforms is a skill that is not usually taught atschool, and often acquired while on the job
This aspect of Big Data rarely requires a learning process, which could easily alienateother group colleagues, because it may seem that platform architects are trying to performeveryone’s work
But in reality, architects are working on a job that no one knows or at least no one is
completely willing to do Hence, many data architects are not part of the IT firm, but theyare freelancing around the rough edges where the platform is not recovering or scaling.Platform architects on freelance are usually hired to triage the system When the edges arealready polished, there is only a slim chance of opportunity to inform the system ownersabout the details of their platform
Big Data is Do It Yourself Supercomputing
Trang 40cluster they stand for, it will come from production without data or applications To
populate the platform, data should be freed from their own organizational and technicalsilos
Big Data is now crucial for the enterprise because of the business value it holds Dataspecialists are developing new strategies in analyzing both the legacy data and the tons ofnew data streaming in
Both Operations and Development will be responsible for the success of the big data
initiative of the enterprise The walls between the platform, organization, data, and
business can’t exist at worldwide scale
Like our nervous system, a big data cluster is a complex interconnection system workingfrom a group of commodity parts The neurons in the brain are considered as the buildingblocks of the nervous system, but these are very basic components Remember, the
neurons in the goby fish are also composed of these very basic components
But you are far more sophisticated than the sum of your goby fish parts The ultimate bigdata job of your nervous system is your personality and behavior
Big Data Platform Engineering
Big Data platforms should operate and process data at a scale, which leaves minimal roomfor error Similar to a Boeing 747, clusters of big data should be developed for efficiency,scale, and speed Most business organizations venturing into big data don’t have the
expertise and the experience in designing and running supercomputers However, manyenterprises are now facing that prospect Awareness of the big data platform will increasethe chance of success with big data
Legacy silos – whether they are vendor, organizational, or infrastructure – should be
replaced with a perspective that is platform-centric Previously, agencies and enterpriseswere satisfied in buying an SQL prompt and establishing their own applications
Nowadays, these groups cannot read raw data science output, because they need to
visualize this or else it could be impossible to derive the business value that they are
searching for Enterprise owners prefer to see images and not raw numbers
Not similar to the legacy infrastructure stack, silos have no place in the data visualizationstack Once implemented properly, big data platform could deliver the data to the rightlayer, which is the analytics layer, at the right time for the right cost
If the platform could aggregate a more complex data such as SQL, JPGs, PDFs, videos,and tweets, then the analytics layer could be in the best position to deliver intelligence that
is actionable for the business