Sixty-one percent of IT leaders expect spending on big data
initiatives to increase, while only 5% expect decreases. The challenge: Finding
the right big data talent to fulfill those initiatives, according to a recent
survey.
Nearly 60% of the respondents are confident that their IT
department can satisfy big data demands of the business and 14% are not confident.
"The data indicates current expectations of big data
are still somewhat unrealistic due to market hype,”the report states. “Despite
IT leaders expecting spending to increase, the confidence level in their
department’s ability to meet big data demands in comparison to broader IT
initiatives is lower.”
About two thirds of the IT executives rank big data
architects as the most difficult role to fill. Data scientists (48%) and data
modelers (43%) round out the top three most difficult positions to
fill. More technical big data positions are ranked less difficult to fill.
Not by coincidence, big data companies are introducing
online and face-to-face training programs and certifications for Hadoop and
other related software platforms.
Still, other big data challenges remain. Variety—the
dimension of big data dealing with the different forms of data—hinders
organizations from deriving value from big data the most, according to 45% of
those surveyed. Speed of data is next in terms of challenges, at 31%, followed
by the amount of data, at 24%.
The application of big data is happening in a number of
business areas, according to the study, with 81% of organizations viewing
operations and fulfillment as priority areas within the next 12 months. This
was followed by customer satisfaction (53%), business strategy (52%),
governance/risk/compliance (51%) and sales/marketing (49%).
More than 200 IT leaders participated in the February 2015
survey.
ADDRESSING FIVE EMERGING CHALLENGES
OF BIG DATA
Introduction - Big Data Challenges
Challenge #1: Uncertainty of the Data Management Landscape
Challenge #2: The Big Data Talent Gap
Challenge #3: Getting Data into the Big Data Platform
Challenge #4: Synchronization across the Data Sources
Challenge #5: Getting Useful Information out of the Big Data
Platform
Considerations: What Risks Do These Challenges Really Pose?
Conclusion: Addressing the Challenge with a Big Data Integration
Strategy
INTRODUCTION - BIG DATA CHALLENGES
Big data technologies are maturing to a point in which more
organizations are prepared to pilot and adopt big data as a core component of
the information management and analytics infrastructure. Big data, as a
compendium of emerging disruptive tools and technologies, is positioned as the
next great step in enabling integrated analytics in many common business
scenarios.
As big data wends its inextricable way into the enterprise, information
technology (IT) practitioners and business sponsors alike will bump up against
a number of challenges that must be addressed before any big data program can
be successful. Five of those challenges are:
1.
Uncertainty of
the Data Management Landscape –
There are many competing technologies, and within each technical area there are
numerous rivals. Our first challenge is making the best choices while not
introducing additional unknowns and risk to big data adoption.
2.
The Big Data
Talent Gap – The excitement around big data
applications seems to imply that there is a broad community of experts
available to help in implementation. However, this is not yet the case, and the
talent gap poses our second challenge.
3.
Getting Data
into the Big Data Platform – The
scale and variety of data to be absorbed into a big data environment can
overwhelm the unprepared data practitioner, making data accessibility and
integration our third challenge.
4.
Synchronization
Across the Data Sources – As more
data sets from diverse sources are incorporated into an analytical platform,
the potential for time lags to impact data currency and consistency becomes our
fourth challenge.
5.
Getting Useful
Information out of the Big Data Platform –
Lastly, using big data for different purposes ranging from storage augmentation
to enabling high-performance analytics is impeded if the information cannot be
adequately provisioned back within the other components of the enterprise
information architecture, making big data syndication our fifth challenge.
In this paper, we examine these challenges and consider the
requirements for tools to help address them. First, we discuss each of the
challenges in greater detail, and then we look at understanding and then
quantifying the risks of not addressing these issues. Finally, we explore how a
strategy for data integration can be crafted to manage those risks.
CHALLENGE #1: UNCERTAINTY OF THE DATA MANAGEMENT LANDSCAPE
One disruptive facet of big data is the use of a variety of
innovative data management frameworks whose designs are intended to support
both operational and to a greater extent, analytical processing. These
approaches are generally lumped into a category referred to as NoSQL (that is,
“not only SQL”) frameworks that are differentiated from the conventional
relational database management system paradigm in terms of storage model, data
access methodology, and are largely designed to meet performance demands for
big data applications (such as managing massive amounts of data and rapid
response times).
There are a number of different NoSQL approaches. Some employ the
paradigm of a document store that maintains a hierarchical object
representation (using standard encoding methods such as XML, JSON, or BSON)
associated with each managed data object or entity. Others are based on the
concept of a key-value store that allows applications to associate values
associated with varying attributes (as named “keys”) to be associated with each
managed object in the data set, basically enabling a schema-less model. Graph
databases maintain the interconnected relationships among different objects,
simplifying social network analyses. And other paradigms are continuing to evolve.
The wide variety of
NoSQL tools, developers and the status of the market are creating uncertainty
within the data management landscape.
We are still in the relatively early stages of this
evolution, with many competing approaches and companies. In fact, within each
of these NoSQL categories, there are dozens of models being developed by a wide
contingent of organizations, both commercial and non-commercial. Each approach
is suited differently to key performance dimensions—some models provide great
flexibility, others are eminently scalable in terms of performance while others
support a wider range of functionality.
In other words, the wide variety of NoSQL tools developers
and the status of the market lend a great degree of uncertainty to the data
management landscape. Choosing a NoSQL tool can be difficult, but committing to
the wrong core data management technology can prove to be a costly error if the
selected vendor’s tool does not live up to expectations, the vendor company
fails, or if third-party application development tends to adopt different data
management schemes. For any organization seeking to institute big data, this
challenge is to propose a means for your organization to select NoSQL
alternatives while mitigating the technology risk.
CHALLENGE #2: THE BIG DATA TALENT GAP
It is difficult to peruse the analyst and high-tech media
without being bombarded with content touting the value of big data analytics
and corresponding reliance on a wide variety of disruptive technologies. These
new tools range from traditional relational database tools with alternative
data layouts designed to increased access speed while decreasing the storage
footprint, in-memory analytics, NoSQL data management frameworks, as well as
the broad Hadoop ecosystem.
There is a growing community of application developers who
are increasing their knowledge of tools like those comprising the Hadoop
ecosystem. That being said, despite the promotion of these big data
technologies, the reality is that there is not a wealth of skills in the
market. The typical expert, though, has gained experience through tool
implementation and its use as a programming model, rather than the data
management aspects. That suggests that many big data tools experts remain
somewhat naïve when it comes to the practical aspects of data modeling, data
architecture, and data integration. And in turn, this can lead to less-then-successful
implementations whose performance is negatively impacted by issues related to
data accessibility.
And the talent gap is real—consider these statistics:
According to analyst firm McKinsey & Company, “By 2018, the United States
alone could face a shortage of 140,000 to 190,000 people with deep analytical
skills as well as 1.5 million managers and analysts with the know-how to use
the analysis of big data to make effective decisions.”2 And in a report
from 2012, “Gartner analysts predicted that by 2015, 4.4 million IT jobs
globally will be created to support big data with 1.9 million of those jobs in
the United States. … However, while the jobs will be created, there is no
assurance that there will be employees to fill those positions.”
NoSQL and other innovative data management options are
predicted to grow in 2015 and beyond.
The big data talent gap is real. Consider this statistic:
“By 2018, the US alone could face a shortage of 140,000 to 190,000 people with
deep analytical skills as well as 1.5 million managers and analysts with the
know-how to use the analysis of big data to make effective decisions.”
CHALLENGE #3: GETTING DATA INTO THE BIG DATA PLATFORM
It
might seem obvious that the intent of a big data program involves processing or
analyzing massive amounts of data. Yet while many people have raised
expectations regarding analyzing massive data sets sitting in a big data
platform, they may not be aware of the complexity of facilitating the access,
transmission, and delivery of data from the numerous sources and then loading
those various data sets into the big data platform.
The
impulse toward establishing the ability to manage and analyze data sets of
potentially gargantuan size can overshadow the practical steps needed to
seamlessly provision data to the big data environment. The intricate aspects of
data access, movement, and loading are only part of the challenge. The need to
navigate extraction and transformation is not limited to structured
conventional relational data sets. Analysts increasingly want to import older
mainframe data sets (in VSAM files or IMS structures, for example) and at the
same time want to absorb meaningful representations of objects and concepts
refined out of different types of unstructured data sources such as emails,
texts, tweets, images, graphics, audio files, and videos, all accompanied by
their corresponding metadata.
An
additional challenge is navigating the response time expectations for loading
data into the platform. Trying to squeeze massive data volumes through “data
pipes” of limited bandwidth will both degrade performance and may even impact
data currency. This actually implies two challenges for any organization
starting a big data program. The first involves both cataloging the numerous
data source types expected to be incorporated into the analytical framework and
ensuring that there are methods for universal data accessibility, while the
second is to understand the performance expectations and ensure that the tools
and infrastructure can handle the volume transfers in a timely manner.
CHALLENGE
#4: SYNCHRONIZATION ACROSS THE DATA SOURCES
Once
you have figured out how to get data into the big data platform, you begin to
realize that data copies migrated from different sources on different schedules
and at different rates can rapidly get out of synchronization with the
originating systems. There are different aspects of synchrony. From a data
currency perspective, synchrony implies that the data coming from one source is
not out of date with data coming from another source. From a semantics
perspective, synchronization implies commonality of data concepts, definitions,
metadata, and the like.
With
conventional data marts and data warehouses, sequences of data extractions,
transformations, and migrations all provide situations in which there is a risk
for information to become unsynchronized. But as the data volumes explode and
the speed at which updates are expected to be made, ensuring the level of
governance typically applied for conventional data management environments
becomes much more difficult.
The
inability to ensure synchrony for big data poses the risk of analyses that use
inconsistent or potentially even invalid information. If inconsistent data in a
conventional data warehouse poses a risk of forwarding faulty analytical
results to downstream information consumers, allowing more rampant
inconsistencies and asynchrony in a big data environment can have a much more
disastrous effect.
Many people may not be aware of the
complexity of facilitating the access, transmission, and delivery of data from
the numerous sources and then loading those data sets into the big data
platform.
The inability to ensure synchrony
for big data poses the risk of analyses that use inconsistent or potentially
even invalid information.
CHALLENGE #5: GETTING USEFUL INFORMATION OUT OF THE BIG DATA
PLATFORM
Most
of the most practical uses cases for big data involve data availability:
augmenting existing data storage as well as providing access to end-users
employing business intelligence tools for the purpose of data discovery. These
BI tools not only must be able to connect to one or more big data platforms,
they must provide transparency to the data consumers to reduce or eliminate the
need for custom coding. At the same time, as the number of data consumers
grows, we can anticipate a need to support a rapidly expanding collection of
many simultaneous user accesses. That demand may spike at different times or
the day or in reaction to different aspects of business process cycles.
Ensuring right-time data availability to the community of data consumers
becomes a critical success factor.
This
frames our fifth and final challenge: enabling a means of making data
accessible to the different types of downstream applications in a way that is
seamless and transparent to the consuming applications while elastically
supporting demand.
CONSIDERATIONS:
WHAT RISKS DO THESE CHALLENGES REALLY POSE?
Considering
the business impacts of these challenges suggests some serious risks to
successfully deploying a big data program. In Table 1, we reflect on the
impacts of our challenges and corresponding risks to success.
Challenge
|
Impact
|
Risk
|
Uncertainty of the market landscape
|
Difficulty in choosing technology
components
Vendor lock-in
|
Committing to failing product or failing vendor
|
Big data talent gap
|
Steep learning curve
Extended time for design,
development, and implementation
|
Delayed time to value
|
Big data loading
|
Increased cycle time for analytical platform data
population
|
Inability to actualize the program due to unmanageable
data latencies
|
Synchronization
|
Data that is inconsistent or out of date
|
Flawed decisions based on flawed data
|
Big data accessibility
|
Increased complexity in syndicating data to end-user
discovery tools
|
Inability to appropriately satisfy the growing community
of data consumers
|