KEMBAR78
IBM - Big Data Architecture and Patterns | PDF | Big Data | Analytics
0% found this document useful (0 votes)
75 views43 pages

IBM - Big Data Architecture and Patterns

The document discusses the classification and architecture of big data, highlighting the various characteristics that define big data sources and the challenges in selecting appropriate solutions. It introduces a structured approach to categorize big data problems and suggests evaluating business scenarios to determine their suitability for big data solutions. The article also emphasizes the importance of understanding data types, processing methodologies, and the overall architecture required for effective big data implementation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
75 views43 pages

IBM - Big Data Architecture and Patterns

The document discusses the classification and architecture of big data, highlighting the various characteristics that define big data sources and the challenges in selecting appropriate solutions. It introduces a structured approach to categorize big data problems and suggests evaluating business scenarios to determine their suitability for big data solutions. The article also emphasizes the importance of understanding data types, processing methodologies, and the overall architecture required for effective big data implementation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 43

https://www.ibm.

com/developerworks/library/bd-archpatterns1/

Part 1: Introduction to big data classification


and architecture
How to classify big data into categories

Divakar Mysore with Shrikant Khupat, Shweta Jain


Published on September 17, 2013
Overview
Big data can be stored, acquired, processed, and analyzed in many ways. Every big
data source has different characteristics, including the frequency, volume, velocity, type,
and veracity of the data. When big data is processed and stored, additional dimensions
come into play, such as governance, security, and policies. Choosing an architecture
and building an appropriate big data solution is challenging because so many factors
have to be considered.
This "Big data architecture and patterns" series presents a structured and pattern-based
approach to simplify the task of defining an overall big data architecture. Because it is
important to assess whether a business scenario is a big data problem, we include
pointers to help determine which business problems are good candidates for big data
solutions.
From classifying big data to choosing a big data solution
Try out IBM big data solutions
Download a trial version of an IBM big data solution and see how it works in your own
environment. Choose from several products:
 BigInsights Quick Start Edition, the IBM Hadoop-based offering that extends the value of open
source Hadoop with functions like Big SQL, text analytics, and BigSheets
 InfoSphere Streams Quick Start Edition, a non-production version of InfoSphere Streams, a
high-performance computing platform that rapidly ingests, analyzes, and correlates information
as it arrives from thousands of real-time sources
 Many additional big data and analytics products available for trial download
If you've spent any time investigating big data solutions, you know it's no simple task.
This series takes you through the major steps involved in finding the big data solution
that meets your needs.
We begin by looking at types of data described by the term "big data." To simplify the
complexity of big data types, we classify big data according to various parameters and
provide a logical architecture for the layers and high-level components involved in any
big data solution. Next, we propose a structure for classifying big data business
problems by defining atomic and composite classification patterns. These patterns help
determine the appropriate solution pattern to apply. We include sample business
problems from various industries. And finally, for every component and pattern, we
present the products that offer the relevant function.
Part 1 explains how to classify big data. Additional articles in this series cover the
following topics:
 Defining a logical architecture of the layers and components of a big data solution
 Understanding atomic patterns for big data solutions
 Understanding composite (or mixed) patterns to use for big data solutions
 Choosing a solution pattern for a big data solution
 Determining the viability of a business problem for a big data solution
 Selecting the right products to implement a big data solution
Classifying business problems according to big data type
Business problems can be categorized into types of big data problems. Down the road,
we'll use this type to determine the appropriate classification pattern (atomic or
composite) and the appropriate big data solution. But the first step is to map the
business problem to its big data type. The following table lists common business
problems and assigns a big data type to each.
Big data business problems by type

Big data
Business problem type Description

Utility companies have rolled out smart


meters to measure the consumption of
water, gas, and electricity at regular
intervals of one hour or less. These smart
meters generate huge volumes of interval
data that needs to be analyzed.
Utilities also run big, expensive, and
complicated systems to generate power.
Each grid includes sophisticated sensors
that monitor voltage, current, frequency,
and?other important operating
characteristics.
To gain operating efficiency, the company
must monitor the data delivered by the
Machine- sensor. A big data solution can analyze
Utilities: Predict power generated power generation (supply) and power
consumption data consumption (demand) data using smart
meters.

Telecommunications: Web and Telecommunications operators need to


Customer churn social data build detailed customer churn models that
analytics include social media and transaction
Transaction data, such as CDRs, to keep up with the
data competition.
The value of the churn models depends
Big data
Business problem type Description

on the quality of customer attributes


(customer master data such as date of
birth, gender, location, and income) and
the social behavior of customers.
Telecommunications providers who
implement a predictive analytics strategy
can manage and predict churn by
analyzing the calling patterns of
subscribers.

Marketing departments use Twitter feeds


to conduct sentiment analysis to
determine what users are saying about
the company and its products or services,
especially after a new product or release
is launched.
Customer sentiment must be integrated
with customer profile data to derive
Marketing: Sentiment Web and meaningful results. Customer feedback
analysis social data may vary according to customer
demographics.

IT departments are turning to big data


solutions to analyze application logs to
gain insight that can improve system
performance. Log files from various
Customer service: Call Human- application vendors are in different
monitoring generated formats; they must be standardized
before IT departments can use them.

Retail: Personalized Web and Retailers can use facial recognition


messaging based on social data technology in combination with a photo
facial recognition and from social media to make personalized
social media Biometrics offers to customers based on buying
behavior and location.
This capability could have a tremendous
impact on retailers? loyalty programs, but
it has serious privacy ramifications.
Big data
Business problem type Description

Retailers would need to make the


appropriate privacy disclosures before
implementing these applications.

Retailers can target customers with


specific promotions and coupons based
location data. Solutions are typically
designed to detect a user's location upon
entry to a store or through GPS.
Machine- Location data combined with customer
generated preference data from social networks
data enable retailers to target online and in-
Retail and marketing: store marketing campaigns based on
Mobile data and location- Transaction buying history. Notifications are delivered
based targeting data through mobile applications, SMS, and
email.

Fraud management predicts the


likelihood that a given transaction or
customer account is experiencing fraud.
Solutions analyze transactions in real
time and generate recommendations for
immediate action, which is critical to
stopping third-party fraud, first-party
fraud, and deliberate misuse of account
privileges.
Solutions are typically designed to detect
and prevent myriad fraud and risk types
Machine- across multiple industries, including:
generated  Credit and debit payment card fraud
data  Deposit account fraud
 Technical fraud
 Bad debt
Transaction
 Healthcare fraud
data  Medicaid and Medicare fraud
 Property and casualty insurance fraud
FSS, Healthcare: Fraud Human-  Worker compensation fraud
detection generated  Insurance fraud
 Telecommunications fraud
Categorizing big data problems by type makes it simpler to see the characteristics of
each kind of data. These characteristics can help us understand how the data is
acquired, how it is processed into the appropriate format, and how frequently new data
becomes available. Data from different sources has different characteristics; for
example, social media data can have video, images, and unstructured text such as blog
posts, coming in continuously.
We assess data according to these common characteristics, covered in detail in the
next section:
 The format of the content
 The type of data (transaction data, historical data, or master data, for example)
 The frequency at which the data will be made available
 The intent: how the data needs to be processed (ad-hoc query on the data, for example)
 Whether the processing must take place in real time, near real time, or in batch mode.
Using big data type to classify big data characteristics
It's helpful to look at the characteristics of the big data along certain lines — for
example, how the data is collected, analyzed, and processed. Once the data is
classified, it can be matched with the appropriate big data pattern:
 Analysis type — Whether the data is analyzed in real time or batched for later analysis. Give
careful consideration to choosing the analysis type, since it affects several other decisions about
products, tools, hardware, data sources, and expected data frequency. A mix of both types may
be required by the use case:
o Fraud detection; analysis must be done in real time or near real time.
o Trend analysis for strategic business decisions; analysis can be in batch mode.
 Processing methodology — The type of technique to be applied for processing data (e.g.,
predictive, analytical, ad-hoc query, and reporting). Business requirements determine the
appropriate processing methodology. A combination of techniques can be used. The choice of
processing methodology helps identify the appropriate tools and techniques to be used in your
big data solution.
 Data frequency and size — How much data is expected and at what frequency does it arrive.
Knowing frequency and size helps determine the storage mechanism, storage format, and the
necessary preprocessing tools. Data frequency and size depend on data sources:
o On demand, as with social media data
o Continuous feed, real-time (weather data, transactional data)
o Time series (time-based data)
 Data type — Type of data to be processed — transactional, historical, master data, and others.
Knowing the data type helps segregate the data in storage.
 Content format — Format of incoming data — structured (RDMBS, for example), unstructured
(audio, video, and images, for example), or semi-structured. Format determines how the
incoming data needs to be processed and is key to choosing tools and techniques and defining
a solution from a business perspective.
 Data source — Sources of data (where the data is generated) — web and social media,
machine-generated, human-generated, etc. Identifying all the data sources helps determine the
scope from a business perspective. The figure shows the most widely used data sources.
 Data consumers — A list of all of the possible consumers of the processed data:
o Business processes
o Business users
o Enterprise applications
o Individual people in various business roles
o Part of the process flows
o Other data repositories or enterprise applications
 Hardware — The type of hardware on which the big data solution will be implemented —
commodity hardware or state of the art. Understanding the limitations of hardware helps inform
the choice of big data solution.
depicts the various categories for classifying big data. Key categories for defining big
data patterns have been identified and highlighted in striped blue. Big data patterns,
defined in the next article, are derived from a combination of these categories.

Big data classification

Conclusion and acknowledgements


In the rest of this series, we'll describes the logical architecture and the layers of a big
data solution, from accessing to consuming big data. We will include an exhaustive list
of data sources, and introduce you to atomic patterns that focus on each of the
important aspects of a big data solution. We'll go over composite patterns and explain
the how atomic patterns can be combined to solve a particular big data use cases. We'll
conclude the series with some solution patterns that map widely used use cases to
products.
The authors would like to thank Rakesh R. Shinde for his guidance in defining the
overall structure of this series, and for reviewing it and providing valuable comments.
Part 2: How to know if a big data solution is
right for your organization
Before making the decision to invest in a big data solution, evaluate the data available
for analysis; the insight that might be gained from analyzing it; and the resources
available to define, design, create, and deploy a big data platform. Asking the right
questions is a good place to start. Use the questions in this article to guide your
investigation. The answers will begin to reveal more about the characteristics of the
data and the problem you're trying to solve.
Although organizations generally have a vague understanding of the type of data that
needs to be analyzed, it's quite possible that the specifics are not as clear. After all, the
data might hold keys to patterns that have not been noticed before, and once a pattern
is recognized, the need for additional analysis becomes obvious. To help uncover
these unknown unknowns, start by implementing a few basic use cases, and in the
process, collect and gather data that was not previously available. As the data
repository is built and more data is collected, a data scientist is better able to determine
the key data and better able to build predictive and statistical models that will generate
more insight.
It may also be the case that the organization already knows what it does not know. To
address these known unknowns, the organization must start by working with a data
scientist to identify the external or third-party data sources and to implement a few use
cases that rely on this external data.
This article first tries to answer some of the questions typically raised by most CIOs
prior to taking up a big data initiative, then focuses on a dimensions-based approach
that will help in assessing the viability of a big data solution for an organization.
Does my big data problem require a big data solution?
Big data, a little at a time
For the most part, organizations choose to implement a big data solution incrementally.
Not every analytical and reporting requirement requires a big data solution. For projects
that perform parallel processing on a large dataset or ad-hoc reporting from multiple
data sources, a big data solution may not be necessary.
With the advent of big data technologies, organizations are asking themselves: "Is big
data the right solution to my business problem, or does it provide me with a business
opportunity? Are business opportunities hiding in the big data? Here are some of the
typical questions we hear from CIOs:
 What kind of insight and business value are possible if I use big data technologies?
 Is it possible to augment my existing data warehouse?
 How do I assess the cost of expanding my current environment or adopting a new solution?
 What is the impact on my existing IT governance?
 Can I incrementally implement a big data solution?
 What specific skills are required to understand and analyze the requirements to build and
maintain the big data solution?
 Do I have existing enterprise data that could be used to deliver business insight?
 The complexity of my data coming in from a variety of sources is increasing. Can a big data
solution help?
Dimensions to help assess the viability of a big data solution
To answer these questions, this article proposes a structured approach for evaluating
the viability of a big data solution according to the dimensions shown in the following
figure.
Figure 1. Dimensions to consider when assessing the viability of a big data solution

 Business value from the insight that might be gained from analyzing the data
 Governance considerations for the new sources of data and how the data will be used
 People with relevant skills available and commitment of sponsors
 Volume of the data being captured
 Variety of data sources, data types, and data formats
 Velocity at which the data is generated, the speed with which it needs to be acted upon, or the
rate at which it is changing
 Veracity of the data, or rather, the uncertainty or trustworthiness of the data
For each dimension, we include key questions. Assign a weight and priority for each
dimension, according to the business context. The assessment will vary by business
case and by organization. Consider working through these questions in a series of
workshops with the relevant business and IT stakeholders.
Business value: What insights are possible with big data technologies?
Many organizations wonder if the business insights they are seeking can be addressed
by a big data solution. There are no definitive guidelines that define the insights that can
be derived from big data. The scenarios need to be identified by the organization and
they evolve over time. A data scientist is key to determining and identifying the business
use cases and scenarios that, if implemented, will bring significant value to the
business.
The data scientist must be able to understand the key performance indicators and apply
statistical and complex algorithms to the data to get a list of use cases. The use cases
vary by industry and business. It's helpful to study the market for what competitors are
doing, which market forces are at work, and primarily, what customers are looking for.
The following table shows examples of use cases from various industries.
Table 1. Sample use cases from various industries
Industry Sample use cases

E-retailers like eBay are constantly creating target offers to boost


customer lifetime value (CLV); deliver consistent cross-channel customer
experiences; harvest customer leads from sales, marketing, and other
sources; and continuously optimize back-end processes.

Recommendation engines: Increase average order size by


recommending complementary products based on predictive analysis for
cross-selling.
Cross-channel analytics: Sales attribution, average order value, and
lifetime value (for example, how many in-store purchases resulted from a
particular recommendation, advertisement or promotion).
Event analytics: What series of steps (the golden path) led to a desired
outcome (product purchase or registration, for example)?
"Right offer at the right time" and "Next-best offer": Deploying predictive
E-commerce and models in combination with recommendation engines that drive
online retail automated next-best offers and tailored interactions across multiple
interaction channels.
Merchandizing and market-basket analysis
Campaign management and customer loyalty programs
Supply-chain management and analytics
Event- and behavior-based targeting
Retail and customer- Market and consumer segmentations
focused Predictive analysis: Retailers want to predict factors that might be
important for a buyer before the product is put on shelves
Compliance and regulatory reporting
Risk analysis and management
Fraud detection and security analytics
CRM and customer loyalty programs
Credit risk, scoring, and analysis
High-speed arbitrage trading
Financial services Trade surveillance
Abnormal trading pattern analysis
Industry Sample use cases

Fraud management helps improve customer profitability by predicting the


likelihood that a given transaction or customer account is experiencing
fraud. Solutions analyze transactions in real time and generate
recommendations for immediate action, which is critical to stopping third-
party fraud, as well as first-party fraud and deliberate misuse of account
privileges. Solutions are typically designed to detect and prevent a wide
variety of fraud and risk types across multiple industries, including:

Credit and debit payment card fraud


Deposit account fraud
Technical fraud and bad debt
Healthcare fraud
Medicaid and Medicare fraud
Property and casualty insurance fraud
Fraud detection Workers' compensation fraud
Insurance fraud
Much of the data we currently work with is the direct consequence of
increased social media and digital marketing. Customers generate a trail
of "data exhaust" that can be mined and put to use.

Large-scale click-stream analytics


Ad targeting, analysis, forecasting, and optimization
Web and digital Abuse and click-fraud prevention
media Social graph analysis and profile segmentation
Campaign management and loyalty programs
Fraud detection
Threat detection
Cyber-security
Public sector Compliance and regulatory analysis
Energy consumption and carbon footprint management
Health insurance fraud detection
Campaign and sales program optimization
Brand management
Health and life Patient care quality and program analysis
sciences Medical device and pharmaceutical supply-chain management
Drug discovery and development analysis
Revenue assurance and price optimization
Customer churn prevention
Campaign management and customer loyalty
Call Detail Record (CDR) analysis
Telecommunications Network performance and optimization
Mobile user location analysis
Utilities Utilities run big, expensive, complicated systems to generate power. Each
grid includes sophisticated sensors that monitor voltage, current,
frequency and other important operating characteristics. Efficiency means
paying careful attention to all of the data streaming from the sensors.
Industry Sample use cases

Utilities are now leveraging Hadoop clusters to analyze power


generation (supply) and power consumption (demand) data via
smart meters.
The adoption of smart meters has resulted in a deluge of data
flowing at unprecedented levels. Most utilities are ill-prepared to
analyze the data once the meters are turned on.
In the cable industry, big data can be used to analyze set-top box data on
a daily basis by large cable operators such as Time Warner, Comcast,
and Cox Communications. This data can be leveraged to adjust
Media advertising or promotional activity.

Mashups: Mobile user location and precision targeting


Machine-generated data
Online dating: A leading online dating service uses sophisticated analysis
to measure the compatibility between individual members, so it can
suggest good matches
Miscellaneous Online gaming
Predictive maintenance of aircraft and automobiles

Potential customers are generating huge amounts of new data on social networks and
review sites. Within the enterprise, transactional data and web logs are growing as
customers switch to online channels to conduct business and interact with companies.

When this new data is analyzed in the context of the archived data about existing
customers, businesses gain insight into new business opportunities.
Big data can offer a viable solution if:
 The value generated by the insight developed from the data is worth the capital cost of
investing in a big data solution
 Customer-facing scenarios demonstrate the potential value from the insight
When evaluating the business value to be gained by a big data solution, consider your
whether your current environment can be expanded and weigh the cost of this
investment.

Can my current environment be expanded?


Ask the following questions to determine if you can augment the existing data
warehouse platform?
 Are the current datasets very large — on the order of terabytes or petabytes?
 Does the existing warehouse environment contain a repository of all data generated or
acquired?
 Is there a significant amount of cold or low-touch data that is not being analyzed to derive
business insight?
 Do you have to throw data away because you are unable to store or process it?
 Do you want to be able to perform data exploration on complex and large amounts of data?
 Do you want to be able to do analysis of non-operational data?
 Are you interested in using your data for traditional and new types of analytics?
 Are you trying to delay an upgrade to your existing data warehouse?
 Are you looking for ways to lower your overall cost of doing analytics?
If the answer to any of these questions is yes, explore ways to augment the existing
data warehouse environment.
What is the cost of expanding my current environment?
The cost and feasibility of extending an existing data warehouse platform or IT
environment vs. implementing a big data solution depends on:
 Existing tools and technology
 Scalability of the existing system
 The processing power of the existing environment
 The storage capability of the existing platform
 Governance and policies in force
 The heterogeneity of existing IT applications
 The technology and business skills that exist in the organization.
It also depends on the volume of data that will be gathered and collected from new data
sources, the complexity of business use cases, the analytical complexity of processing,
and how expensive it is to get the data and people with the right skill set. Can the
existing pool of resources develop new big data skills or can the resources with niche
skills be hired externally?
Keep in mind that the effect of a big data initiative on other projects under way.
Acquiring data from new sources is costly. It's important to first identify any data that
exists internally in the systems and applications and in third-party data being received
currently. If a business problem can be solved with existing data, data from external
sources may not be required.
Assess the application portfolio of the organization before procuring new tools and
applications. For example, a plain vanilla Hadoop platform may not be sufficient for the
requirements, and it may be necessary to buy specialized tools. Or in contrast, a
commercial version of Hadoop may be expensive for the current use case, but may be
needed as a long-term investment to support a strategic big data platform. Consider the
cost of the infrastructure, hardware, software, and maintenance required by for big data
tools and technologies.
Governance and control on data: What is the impact on existing IT governance?
When deciding whether to implement a big data platform, an organization might be
looking at new data sources and new types of data elements where the ownership of
the day is not clearly defined. Certain industry regulations govern the data that is
acquired and used by an organization. For example, in the case of healthcare, is it
legitimate to access patient data to derive insight from the data? Similar rules govern all
industries. In addition to issues of IT governance, business processes of an
organization may also need to be redefined or modified to enable the organization to
acquire, store, and access external data.
Consider the following governance-related issues in the context of your situation:
 Security and privacy— In keeping with local regulations, what data can the solution access?
What data can be stored? What data should be encrypted during motion? At rest? Who is
allowed to see the raw data and the insights?
 Standardization of data— Are there standards governing the data? Is the data in a proprietary
format? Is some of the data in a non-standard format?
 Timeframe in which the data is available— Is the data available in a timeframe that allows action
to be taken in a timely fashion?
 Ownership of data— Who owns the data? Does the solution have appropriate access and
permission to use the data?
 Allowable uses: How is the data allowed to be used?
Can I incrementally implement a big data solution?
A big data solution can be incrementally implemented. It's helpful to clearly define the
scope of the business problem and to set, in measurable terms, the expected business
revenue gain.
For the foundational business case, take care in outlining the scope of the problem and
projected benefits from the solution. If the scope is too small, the business benefits will
not be realised, and if it's too large, it will be challenging to get the funding and complete
the project inside an appropriate timeframe. Define the core functions in the first
iteration of the project, so that it's easy to win the confidence of stakeholders.
People: Are the right skills on board and the right people aligned?
Specific skills are required to understand and analyze the requirements and maintain
the big data solution. These skills include industry knowledge, domain expertise, and
technical knowledge on big data tools and technologies. Data scientists with expertise in
modeling, statistics, analytics, and math are key to the success of any big data initiative.
Before undertaking a new big data project, make sure the right people are on board:
 Do you have buy-in from stakeholders and other business sponsors who are willing to invest in
the project?
 Are data scientists available who understand the domain, who can look at the massive quantity
of data and who can identify ways to generate meaningful and useful insights from the data?
Is there existing data that can be used to get insight?
All organizations have quite a lot of data not being harnessed for business insight.
Pockets include log files, errors files, and operational data from applications. Don't
overlook this data as a potential source of valuable information.
Is the data complexity increasing?
Look for hints that the complexity of data has increased, especially with regard to
volume, variety, velocity, and veracity.
Has the volume of data increased?
You may want to consider a big data solution if:
 The data is sized in petabytes and exabytes, and in the near future, might grow to zetabytes.
 The data volume is posing technical and economic challenges to store, search, share, analyze,
and visualize using traditional methods, such as relational database engines.
 The data processing can currently make use of massive parallel processing power on available
hardware.
Has the variety of data increased?
The variety of data might demand a big data solution if:
 The data content and structure cannot be anticipated or predicted.
 The data format varies, including structured, semi-structured, and unstructured data.
 The data can be generated by users and machines in any format, for example: Microsoft® Word
files, Microsoft Excel® spreadsheets, Microsoft PowerPoint presentations, PDF files, social
media, web and software logs, email, photos and video footage from cameras, information-
sensing mobile devices, aerial sensory technologies, genomics, and medical records.
 New types of data have emerged from sources that weren't previously mined for insight.
 Domain entities take on different meanings in different contexts.
Has the velocity of the data increased or changed?
Consider whether your data:
 Is changing rapidly and must be responded to immediately
 Has overwhelmed traditional technologies and methods, which are no longer adequate to
handle data coming in real time
Is your data trustworthy?
Consider a big data solution if:
 The authenticity or accuracy of the data is unknown.
 The data includes ambiguous information.
 It's unclear whether the data is complete.
A big data solution might be appropriate if there is reasonable complexity in the volume,
variety, velocity, or veracity of the data. For more complex data, assess any risks
associated with implementing a big data solution. For less complex data, traditional
solutions should be assessed.
Is all big data a big data problem?
Not all big data situations require a big data solution. Look for hints in the market. What
are competitors doing? What market forces are at work? What are the customers
demanding?
Use the questions in this article to help you determine whether a big data solution is
appropriate for your business situation and for the business insight you need. If you've
decided it's time to embark on a big data project, watch for the next article on defining a
logical architecture and determining the key components required for your big data
solution.
Part3: Understanding the architectural
layers of a big data solution
Overview
Part 2 of this "Big data architecture and patterns" series describes a dimensions-based
approach for assessing the viability of a big data solution. If you have already explored
your own situation using the questions and pointers in the previous article and you've
decided it's time to build a new (or update an existing) big data solution, the next step is
to identify the components required for defining a big data solution for the project.
Logical layers of a big data solution
Logical layers offer a way to organize your components. The layers simply provide an
approach to organizing components that perform specific functions. The layers are
merely logical; they do not imply that the functions that support each layer are run on
separate machines or separate processes. A big data solution typically comprises these
logical layers:
1. Big data sources
2. Data massaging and store layer
3. Analysis layer
4. Consumption layer
 Big data sources: Think in terms of all of the data available for analysis, coming in from all
channels. Ask the data scientists in your organization to clarify what data is required to perform
the kind of analyses you need. The data will vary in format and origin:
o Format— Structured, semi-structured, or unstructured.
o Velocity and volume— The speed that data arrives and the rate at which it's delivered varies
according to data source.
o Collection point— Where the data is collected, directly or through data providers, in real time or
in batch mode. The data can come from a primary source, such as weather conditions, or it can
come from a secondary source, such as a media-sponsored weather channel.
o Location of data source— Data sources can be inside the enterprise or external. Identify the
data to which you have limited-access, since access to data affects the scope of data available
for analysis.
 Data massaging and store layer: This layer is responsible for acquiring data from the data
sources and, if necessary, converting it to a format that suits how the data is to be analyzed. For
example, an image might need to be converted so it can be stored in an Hadoop Distributed File
System (HDFS) store or a Relational Database Management System (RDBMS) warehouse for
further processing. Compliance regulations and governance policies dictate the appropriate
storage for different types of data.
 Analysis layer: The analysis layer reads the data digested by the data massaging and store
layer. In some cases, the analysis layer accesses the data directly from the data source.
Designing the analysis layer requires careful forethought and planning. Decisions must be made
with regard to how to manage the tasks to:
o Produce the desired analytics
o Derive insight from the data
o Find the entities required
o Locate the data sources that can provide data for these entities
o Understand what algorithms and tools are required to perform the analytics.
 Consumption layer: This layer consumes the output provided by the analysis layer. The
consumers can be visualization applications, human beings, business processes, or services. It
can be challenging to visualize the outcome of the analysis layer. Sometimes it's helpful to look
at what competitors in similar markets are doing.
Each layer includes several types of components, as illustrated below.
Figure 1. Components by logical and vertical layer

Big data sources


This layer includes all the data sources necessary to provide the insight required to
solve the business problem. The data is structured, semi-structured, and unstructured,
and it comes from many sources:
 Enterprise legacy systems— These are the enterprise applications that drive the analytics and
insights required for business:
o Customer relationship management systems
o Billing operations
o Mainframe applications
o Enterprise resource planning
o Web applications
Web applications and other data sources augment the enterprise-owned data. Such
applications can expose the data using custom protocols and mechanisms.
 Data management systems (DMS)— The data management systems store legal data,
processes, policies, and various other kinds of documents:
o Microsoft® Excel® spreadsheets
o Microsoft Word documents
These documents can be converted into structured data that can be used for analytics.
The document data can be exposed as domain entities or the data massaging and
storage layer can transform it into the domain entities.
 Data stores— Data stores include enterprise data warehouses, operational databases, and
transactional databases. This data is typically structured and can be consumed directly or
transformed easily to suit requirements. Such data may or may not be stored in the distributed
file system, depending on the context of the situation.
 Smart devices— Smart devices are capable of capturing, processing, and communicating
information on most widely used protocols and formats. Examples include smartphones, meters,
and healthcare devices. Such devices can be used to perform various kinds of analysis. For the
most part, smart devices do real-time analytics, but the information stemming from smart
devices can be analyzed in batch, as well.
 Aggregated data providers— These providers own or acquire the data and expose it in
sophisticated formats, at required frequencies, and through specific filters. Huge volumes of
data pour in, in a variety of formats, produced at different velocities, and made available by
various data providers, sensors, and existing enterprises.
 Additional data sources— A wide range of data comes from automated sources:
o Geographical information:
 Maps
 Regional details
 Location details
 Mining details
o Human-generated content:
 Social media
 Email
 Blogs
 Online information
o Sensor data:
 Environment: Weather, moisture, humidity, lightening
 Electricity: Current, energy potential, etc.
 Navigation instruments
 Ionizing radiation, subatomic particles, etc.
 Proximity, presence, and so on
 Position, angle, displacement, distance, speed, acceleration
 Acoustic, sound vibration, etc.
 Automotive, transportation, etc.
 Thermal, heat, temperature
 Optical, light, imaging, photon
 Chemical
 Pressure
 Flow, fluid, velocity
 Force, density level, etc.
 Other data from sensor vendors
Data massaging and store layer
Because incoming data characteristics can vary, components in the data massaging
and store layer must be capable of reading data at various frequencies, formats, sizes,
and on various communication channels:
 Data acquisition— Acquires data from various data sources and sends the data to the data
digest component or stores it in specified locations. This component must be intelligent enough
to choose whether and where to store the incoming data. It must be able to determine whether
the data should be massaged before it can be stored or if the data can be directly sent to the
business analysis layer.
 Data digest— Responsible for massaging the data in the format required to achieve the purpose
of the analysis. This component can have simple transformation logic or complex statistical
algorithms to convert source data. The analysis engine determines the specific data formats that
are required. The major challenge is accommodating unstructured data formats, such as
images, audio, video, and other binary formats.
 Distributed data storage— Responsible for storing the data from data sources. Often, multiple
data storage options are available in this layer, such as distributed file storage (DFS), cloud,
structured data sources, NoSQL, etc.
Analysis layer
This is the layer where business insight is extracted from the data:
 Analysis-layer entity identification— Responsible for identifying and populating the contextual
entities. This is a complex task that requires efficient high-performance processes. The data
digest component should complement this entity identification component by massaging the
data into the required format. Analysis engines will need the contextual entities to perform the
analysis.
 Analysis engine— Uses other components (specifically, entity identification, model
management, and analytic algorithms) to process and perform the analysis. The analysis engine
can have various workflows, algorithms, and tools that support parallel processing.
 Model management— Responsible for maintaining various statistical models and for verifying
and validating these models by continuously training the models to be more accurate. The
model management component then promotes these models, which can be used by the entity
identification or analysis engine components.
Consumption layer
This layer consumes the business insight derived from the analytics applications. The
outcome of the analysis is consumed by various users within the organization and by
entities external to the organization, such as customers, vendors, partners, and
suppliers. This insight can be used to target customers for product offers. For example,
with the business insight gained from analysis, a company can use customer preference
data and location awareness to deliver personalized offers to customers as they walk
down the aisle or pass by the store.
The insight can also be used to detect fraud by intercepting transactions in real time and
correlating them with the view that has been built using the data already stored in the
enterprise. A customer can be notified of a possible fraud while the fraudulent
transaction is happening, so corrective actions can be taken immediately.
In addition, business processes can be triggered based on the analysis done in the data
massaging layer. Automated steps can be launched — for example, the process to
create a new order if the customer has accepted an offer can be triggered automatically,
or the process to block the use of a credit card can be triggered if a customer has
reported fraud.
The output of analysis can also be consumed by a recommendation engine that can
match customers with the products they like. The recommendation engine analyzes
available information and provides personalized and real-time recommendations.
The consumption layer also provides internal users the ability to understand, find, and
navigate federated data within and outside the enterprise. For the internal consumers,
the ability to build reports and dashboards for business users enables the stakeholders
to make informed decisions and to design appropriate strategies. To improve
operational effectiveness, real-time business alerts can be generated from the data and
operational key performance indicators can be monitored:
 Transaction interceptor— This component intercepts high-volume transactions in real time and
converts them into a suitable format that can be readily understood by the analysis layer to do
real-time analysis on the incoming data. The transaction interceptor should have the ability to
integrate with and handle data from various sources such as sensors, smart meters,
microphones, cameras, GPS devices, ATMs, and image scanners. Various types of adapters
and APIs can be used to connect to the data sources. Various accelerators, such as real-time
optimization and streaming analytics, video analytics, accelerators for banking, insurance, retail,
telecom, and public transport, social media analytics, and sentiment analytics are also available
to simplify development.
 Business process management processes— The insight from the analysis layer can be
consumed by Business Process Execution Language (BPEL) processes, APIs, or other
business processes to further drive business value by automating the functions for upstream
and downstream IT applications, people, and processes.
 Real-time monitoring— Real-time alerts can be generated using the data coming out of the
analysis layer. The alerts can be sent to interested consumers and devices, such as
smartphones and tablets. Key performance indicators can be defined and monitored for
operational effectiveness using the data insight generated from the analytics components. Data
in real time can be made available to business users from varied sources in the form of
dashboards to monitor the health of the system or to measure the effectiveness of a campaign.
 Reporting engine— The ability to produce reports similar to traditional business intelligence
reports is critical. Ad-hoc reports, scheduled reports, or self-query and analysis can be created
by users based on the insight coming out of the analysis layer.
 Recommendation engine— Based on the outcome of analysis from the analysis layer,
recommendation engines can offer real-time, relevant, and personalized recommendations to
shoppers, increasing the conversion rates and the average value of each order in an e-
commerce transaction. In real time, the engine processes available information and responds
dynamically to each user, based on the users' real-time activities, the information stored within
CRM systems for registered customers, and the social profiles for non-registered customers.
 Visualization and discovery— Data can be navigated across various federated data sources
within and outside the enterprise. The data can vary in content and format, and all of the data
(structured, semi-structured, and unstructured) can be combined for visualization and provided
to the users. This ability enables organizations to combine their traditional enterprise content
(contained in enterprise content managements systems and data warehouses) with new social
content (tweets and blog posts, for example) in a single user interface.
Vertical layers
Aspects that affect all of the components of the logical layers (big data sources, data
massaging and storage, analysis, and consumption) are covered by the vertical layers:
 Information integration
 Big data governance
 Systems management
 Quality of service
Information integration
Big data applications acquire data from various data origins, providers, and data
sources and are stored in data storage systems such as HDFS, NoSQL, and MongoDB.
This vertical layer is used by various components (data acquisition, data digest, model
management, and transaction interceptor, for example) and is responsible for
connecting to various data sources. Integrating information across data sources with
varying characteristics (protocols and connectivity, for example) requires quality
connectors and adapters. Accelerators are available to connect to most of the known
and widely used sources. These include social media adapters and weather data
adapters. This layer can also be used by components to store information in big data
stores and to retrieve information from big data stores for processing. Most of the big
data stores have services and APIs available to store and retrieve the information.
Big data governance
Data governance is about defining guidelines that help enterprises make the right
decisions about the data. Big data governance helps in dealing with the complexities,
volume, and variety of data that is within the enterprise or is coming in from external
sources. Strong guidelines and processes are required to monitor, structure, store, and
secure the data from the time it enters the enterprise, gets processed, stored, analyzed,
and purged or archived.
In addition to normal data governance considerations, governance for big data includes
additional factors:
 Managing high volumes of data in variety of formats.
 Continuously training and managing the statistical models required to pre-process unstructured
data and analytics. Keep in mind that this is an important step when dealing with unstructured
data.
 Setting policy and compliance regulations for external data regarding its retention and usage.
 Defining the data archiving and purging policies.
 Creating the policy for how data can be replicated across various systems.
 Setting data encryption policies.
Quality of service layer
This layer is responsible for defining data quality, policies around privacy and security,
frequency of data, size per fetch, and data filters:
 Data quality
o Completeness in identifying all of the data elements required
o Timeliness for providing data at an acceptable level of freshness
o Accuracy in verifying that the data respects data accuracy rules
o Adherence to a common language (data elements fulfill the requirements expressed in plain
business language)
o Consistency in verifying that the data from multiple systems respects the data consistency rules
o Technical conformance in meeting the data specification and information architecture guidelines
 Policies around privacy and security
Policies are required to protect sensitive data. Data acquired from external agencies and
providers can include sensitive information (such as the contact information of a Facebook user
or product pricing information). Data can originate from different regions and countries and must
be treated accordingly. Decisions must be made about data masking and the storage of such
data. Consider the following data access policies:
o Data availability
o Data criticality
o Data authenticity
o Data sharing and publishing
o Data storage and retention, including questions such as can the external data be stored? If so,
for how long? What kind of data can be stored?
o Constraints of data providers (political, technical, regional)
o Social media terms of use (see Related topics)
 Data frequency
How frequently is fresh data available? Is it on-demand, continuous, or offline?
 Size of fetch
This attribute helps define the size of data that can be fetched and consumed per fetch.
 Filters
Standard filters remove unwanted data and noise in the data and leave only the data required
for analysis.
Systems management
Systems management is critical for big data because it involves many systems across
clusters and boundaries of the enterprise. Monitoring the health of the overall big data
ecosystem includes:
 Managing the logs of systems, virtual machines, applications, and other devices
 Correlating the various logs and helping investigate and monitor the situation
 Monitoring real-time alerts and notifications
 Using a real-time dashboard showing various parameters
 Referring to reports and detailed analysis about the system
 Setting and abiding by service-level agreements
 Managing storage and capacity
 Archiving and managing archive retrieval
 Performing system recovery, cluster management, and network management
 Policy management
Summary
For developers, layers offer a way to categorize the functions that must be performed by
a big data solution, and suggest an organization for the code that must address these
functions. For business users wanting to derive insight from big data, however, it's often
helpful to think in terms of big data requirements and scope. Atomic patterns, which
address the mechanisms for accessing, processing, storing, and consuming big data,
give business users a way to address requirements and scope. The next article
introduces atomic patterns for this purpose.
Part 4: Understanding atomic and
composite patterns for big data solutions
Part 3 of this series describes the logical layers of a big data solution. These layers
define and categorize the various components that must address the functional and
non-functional requirements for a given business case. This article builds on the
concept of layers and components to explain the typical atomic and composite patterns
in which they are used in the solution. By mapping a proposed solution to the patterns
given here, you can visualize how the components need to be designed and where they
should be placed functionally. The patterns also help define the architecture of the big
data solution. Using atomic and composite patterns can help further refine the roles and
responsibilities of each component of the big data solution.
Try out IBM big data solutions
Download a trial version of an IBM big data solution and see how it works in your own
environment. Choose from several products:
 BigInsights Quick Start Edition, the IBM Hadoop-based offering that extends the value of open
source Hadoop with functions like Big SQL, text analytics, and BigSheets
 InfoSphere Streams Quick Start Edition, a non-production version of InfoSphere Streams, a
high-performance computing platform that rapidly ingests, analyzes, and correlates information
as it arrives from thousands of real-time sources
 Many additional big data and analytics products available for trial download
This article covers atomic and composite patterns. The final article in this series will
describe solution patterns.

Figure 1. Categories of patterns

Atomic patterns
Atomic patterns help identify the how the data is consumed, processed, stored, and
accessed for recurring problems in a big data context. They can also help identify the
required components. Accessing, storing, and processing a variety of data from different
data sources requires different approaches. Each pattern addresses specific
requirements — visualization, historical data analysis, social media data, and
unstructured data storage, for example. Atomic patterns can work together to form a
composite pattern. There is no layering or sequence to these atomic patterns. For
example, visualization patterns can interact with data access patterns for social media
directly, and visualization patterns can interact with the advanced analysis processing
pattern.

Figure 2. Examples of atomic patterns for consumption, processing, data access, and
storage

Data consumption patterns


This type of pattern addresses the various ways in which the outcome of data analysis
is consumed. This section includes data consumption patterns to meet several
requirements.
Visualization pattern
The traditional way of visualizing data is based on graphs, dashboards, and summary
reports. These traditional approaches are not always the optimal way to visualize the
data.
Typical requirements for big data visualization, including emerging requirements are
listed below:
 To do live analysis and display of stream data
 To mine data interactively, based on context
 To perform advanced searches and get recommendations
 To visualize information in parallel
 To have access to advanced hardware for futuristic visualization needs
Research is under way to determine how big data insights can be consumed by humans
and machines. The challenges include the volume of data involved and the need to
associate context with it. Insight must be presented in the appropriate context.
The goal is to make it easier to consume the data intuitively, so the reports and
dashboards might offer full-HD viewing and 3-D interactive videos, and might provide
users the ability to control business activities and outcomes from one application.
Ad-hoc discovery pattern
Creating standard reports that suit all business needs is often not feasible because
businesses have varied requirements for queries of business data. Users might need
the ability to issue ad-hoc queries when looking for specific information, depending on
the context of the problem.
Ad-hoc analysis can help data scientists and key business users understand the
behavior of business data. Complexity involved in ad-hoc processing springs from
several factors:
 Multiple data sources available for the same domains.
 A single query can have multiple results.
 The output can be static with a variety of formats (video, audio, graphs, text).
 The output can be dynamic and interactive.
Augment traditional data stores
During initial exploration of big data, many enterprises would prefer to use the existing
analytics platform to keep costs down and to rely on existing skills. Augmenting existing
data stores helps broaden the scope of data available for existing analytics to include
data that resides inside and outside organizational boundaries, such as social media
data, which can enrich the master data. By broadening the scope to include new facts
tables, dimensions, and master data in the existing storages, and acquiring customer
data from social media, an organization can gain deeper customer insight.
Keep in mind, however, that new data sets are typically larger in size, and existing
extract, transform, and load tools might not be sufficient to process it. Advanced tools
with massively parallel processing capabilities might be required to address the volume,
variety, veracity, and velocity characteristics of data.
Notification pattern
Big data insight enables humans, businesses, and machines to act instantly by using
notifications to indicate events. The notification platform must be capable of handling
the anticipated volume of notifications to be sent out in a timely manner. These
notifications are different from mass mailing or mass sending of SMS messages
because the content is generally specific to the consumer. For example,
recommendation engines can provide insight on the huge customer base across the
world, and notifications can be sent to such customers.
Initiate an automated response pattern
Business insight derived from big data can be used to trigger or initiate other business
processes or transactions.
Processing patterns
Big data can be processed when data is at rest or data is in motion. Depending on the
complexity of the analysis, the data might not be processed at real time. This pattern
addresses how the big data is processed in real time, near real time, or batch.
The following high-level categories for processing big data apply to most analytics.
These categories also often apply to traditional RDBMS-based systems. The only
difference is the massive scale of data, variety, and velocity. In processing big data,
techniques such as machine learning, complex event processing, event stream
processing, decision management, and statistical model management are used.
Historical data analysis pattern
Traditional historical data analysis is limited to a predefined period of data, which
usually depends on data retention policies. Beyond that period, data is usually archived
or purged because of processing and storage limitations. These limitations are
overcome by Hadoop-based systems and other equivalent systems with huge storage
and distributed, massively parallel processing capabilities. The operational, business,
and data warehouse data are moved to big data storage and are processed using the
big data platform capabilities.
Historical analysis involves analyzing the historical trends for given period, set of
seasons, and products and comparing that to the current data available. To be able to
store and process such huge data, tools like HDFS, NoSQL, SPSS®, and InfoSphere®
BigInsights™ are useful.
Advanced analytics pattern
Big data provides enormous opportunities to realize creative insights. Different data sets
can be co-related in many contexts. Discovering these relationships requires innovative
complex algorithms and techniques.
Advanced analysis includes predictions, decisions, inferential processes, simulations,
contextual information identifications, and entity resolutions. The application of
advanced analytics includes biometric data analysis, for example, DNA analysis, spatial
analysis, location-based analytics, scientific analysis, research, and many others.
Advanced analytics require heaving computing to manage the huge amount of data.
Data scientists can guide in identifying the suitable techniques, algorithms, data sets,
and data sources required to solve problems in a given context. Tools such as SPSS,
InfoSphere Streams, and InfoSphere BigInsights provide such capabilities. These tools
access unstructured data and the structured data (for example, JSON data) stored in
big data storage systems such as BigTable, HBase, and others.
Pre-process raw data pattern
Big data solutions are mostly dominated by Hadoop systems and technologies based
on MapReduce, which are out-of-the-box solutions for distributed storage and
processing. However, extracting data from unstructured data such as images, audio,
video, binary feeds, or even text is a complex task and needs techniques such as
machine learning and natural language processing, etc. The other major challenge is
how to verify the accuracy and correctness of output from such techniques and
algorithms.
To perform analysis on any data, the data must be in some kind of structured format.
The unstructured data accessed from various data sources can be stored as is and then
transformed into structured data, for example JSON) and stored back in the big data
storage systems. Unstructured text can be converted into semi- or structured data.
Similarly, image, audio, and video data need to be converted into the formats that can
be used for analysis. Moreover, the accuracy and correctness of the advanced analytics
that use predictive and statistical algorithms depends on the amount of data and
algorithms used to train its models.
The following list shows the algorithms and activities required to convert unstructured
data into structured data:
 Document and text classification
 Feature extraction
 Image and text segmentation
 Co-relating the features, variables, and timings and then extracting the values with timing
 Accuracy checking for output using techniques such as the confusion matrix and other manual
activities
Data scientists can help in choosing the appropriate techniques and algorithms.
Ad-hoc analysis pattern
Processing ad-hoc queries on big data bring challenges different from those incurred
when doing ad-hoc queries on structured data because the data sources and data
formats are not fixed and require different mechanisms to retrieve and process the data.
Although simple ad-hoc queries can be resolved by big data providers, in most cases,
the queries are complex because data, algorithms, formats, and entity resolutions must
be discovered dynamically at runtime. The expertise of data scientists and business
users is required to defining the analysis required for the following tasks:
 Identify and discover the computations and algorithms
 Identify and discover the data sources
 Define the required formats that can be consumed by the computations
 Perform the computations on the data in parallel
Access patterns
Although there are many data sources and ways data can be accessed in a big data
solution, this section covers the most common.
Web and social media access pattern
The Internet is the source of data that provides many of the insights derived today. Web
and social media is useful in almost all analysis, but different access mechanisms are
required to acquire this data.
Web and social media is the most complex among all the data sources because of its
huge variety, velocity, and volume. There are around 40-50 categories of websites and
each requires different treatment to access this data. This section lists these categories
and explains the accessing mechanism. The high-level categories from the big data
perspective are commerce sites, social media sites, and sites having specific and
generic components. See Figure 3 for the access mechanisms. The accessed data is
stored in data storage after pre-processing, if required.
Figure 3. Web and social media data access

The following steps are required to access web media information.

Figure 4. Big data accessing steps


Web media access for data in unstructured storage
1. Step A-1. A crawler reads the raw data.
2. Step A-2. The data is stored in unstructured storage.
Web media access pre-process data for structured storage
1. Step B-1. The crawler reads the raw data.
2. Step B-2. This data gets pre-processed.
3. Step B-3. The data is stored in structured storage.
Web media access to pre-process unstructured data
1. Step C-1. Data from the providers can be unstructured, in rare cases.
2. Step C-2. Data is pre-processed.
3. Step C-3. Data is stored in structured storage.
Web media access for unstructured or structured data
1. Step D-1. Data providers provide structured or unstructured data.
2. Step D-2. Data is stored in structured or unstructured storage.
Web media access to pre-process unstructured data
1. Step E-1. Unstructured data, stored without pre-processing, cannot be useful unless it is in a
structured format.
2. Step E-2. Data is pre-processed.
3. Step E-3. Pre-processed, structured data is stored in structured storage.
As shown in the diagram, the data can be directly stored in storage, or it can be pre-
processed and converted into an intermediate or standard format, then stored.
Before the data can be analyzed, it has to be in a format that can be used for entity
resolution or for querying required data. Such pre-processed data can be stored in a
storage system.
Although pre-processing is often thought of as trivial, it can be very complex and time-
consuming.
Device-generated data pattern
Device-generated content includes data from sensors. Data is sensed from data origins
such as weather information, electrical measurements, and pollution data, and is
captured by the sensors. The data can be photos, videos, text, and other binary
formats.
The following diagram explains the typical process for processing machine generated
data.

Figure 5. Device-generated data access


Figure 5 explains the process for accessing data from sensors. The data captured by
the sensors can be sent to device gateways that do some initial pre-processing and that
buffer the high-velocity data. The machine-generated data is mostly in binary format
(audio, video, and sensor reading) or in text format. Such data can be initially stored in a
storage system or it can be pre-processed and then stored. The pre-processing is
required for analysis.
Transactional, operational and warehouse data pattern
It is possible to store the existing transactional, operational, and warehouse data to
avoid purging or archiving data (because of storage and processing limitations), or to
reduce the load on the traditional storage storage when the data is accessed by other
consumers.
For most enterprises, the transactional, operational, master data, and warehouse
information is at the heart of any analytics. This data, if augmented with the
unstructured data and the external data available across the Internet or through sensors
and smart devices, can help organizations get accurate insight and perform advanced
analytics.
Transactional and warehouse data can be pushed into storage using standard
connectors made available by various database vendors. Pre-processing transactional
data is much easier because the data is mostly structured. Simple extract, transform,
and load processes can be used to move the transactional data into storage.
Transactional data can be easily converted into the formats like JSON and CSV. Using
tools such as Sqoop makes it easier to push transactional data into storage systems
such as HBase and HDFS.
Storage patterns
The storage patterns helps to determine the appropriate storage for various data types
and formats. Data can be stored as is, stored against key-value pairs, or stored in
predefined formats.
Distributed file systems such as GFS and HDFS are quite capable of storing any sort of
data. But the ability to retrieve or query the data efficiently affects performance. The
selection of technology makes a difference.
Storage pattern for distributed and unstructured data
Most big data is unstructured and can have information that can be extracted in different
ways for different contexts. Most of the time, unstructured data must be stored as is, in
its original format.
Such data can be stored in distributed file systems such as HDFS and NoSQL
document storage such as MongoDB. These systems provide an efficient way to
retrieve unstructured data.
Storage pattern for distributed and structured data
Structured data includes data that arrives from the data source and is already in a
structured format and unstructured data that has been pre-processed into a formats
such as JSON. This converted data must be stored to avoid frequent data conversion
from raw data to structured data.
Technologies such as BigTable from Google are used to store structured data. BigTable
is a large-scale, fault-tolerant, self-managing system that includes terabytes of memory
and petabytes of storage.
HBase in Hadoop is comparable to BigTable. It uses HDFS for underlying storage.
Storage pattern for traditional data stores
Traditional data storage is not the best choice for storing big data, but in cases in which
enterprises are doing initial data exploration, they may choose to use the existing data
warehouse, RDBMS system and other content stores. These existing storage systems
can be used to store the data that is digested and filtered using the big data platform.
Do not consider traditional data storage systems as appropriate for big data.
Storage pattern for cloud storage
Many cloud infrastructure providers have distributed structured, unstructured storage
capability. Big data technologies are bit different from traditional configurations,
maintenance, system management, and programming and modeling perspectives.
Moreover, the skills required to implement big data solutions are rare and costly.
Enterprises exploring the big data technologies can use cloud solutions that provide big
data storage, maintenance, and system management.
The data to be stored is often sensitive; it includes medical records and biometric data.
Consider the data security, data sharing, data governance, and other policies around
data, especially when considering cloud as a storage repository for big data. The ability
to transfer huge amount of data is also another key consideration for cloud storage.
Composite patterns
Atomic patterns focus on providing capabilities required to perform individual
functions. Composite patterns, however, are classified based on the end-to-end
solution. Each composite pattern has one or more dimensions to consider. There are
many variations in the cases that apply to each pattern. Composite patterns map to one
or more atomic patterns to solve a given business problem. The list of composite
patterns described in this article is based on typically recurring business problems, but
this is not a comprehensive list of composite patterns.
Store and explore pattern
This pattern is useful when the business problem demands storing a huge amount of
new and existing data that has been previously unused because of lack of adequate
storage and analysis capability. The pattern is designed to ease the load on existing
data storage. The stored data can be used for initial exploration and ad-hoc discovery.
Users can derive reports to analyze the quality of data and its value in further
processing. Raw data can be pre-processed and cleaned using ETL tools, before any
type of analysis can happen.
Figure 6. Store and explore composite pattern
Figure 6 depicts the various dimensions for this pattern. Usage of data could be for the
purpose of only storing it or also to process and consume it.
An example of a storage-only case is the situation in which data is just acquired and
stored for the future to meet a compliance or legal requirement. The processing and
consumption case is for the situation in which the outcome of the analysis can be
processed and consumed. Data can be accessed from newly identified sources or from
existing data stores.
Purposeful and predictable analysis composite pattern
This pattern is used to perform analysis using various processing techniques and, as a
result, may enrich existing data with new insight or create output that can be consumed
by various users. The analysis can happen in real time as the events are happening or
in batch mode to draw insight based on data that has been gathered. As an example of
data at rest that can be analyzed, a telecommunications company might build churn
models that include analyzing the call data records, social data, and transaction data.
As an example of analyzing data in motion, the need to predict that a given transaction
is experiencing fraud must happen in real time or near real time.
Figure 7. Purposeful and predictive analysis composite pattern

Figure 7 depicts the various dimensions for this pattern. Processing performed could be
standard or predictive, and it can include decision-making.
In addition, notifications can be sent to a system or users regarding certain tasks or
messages. The notifications can use visualization. The processing can happen in real
time or in batch mode.
Actionable analysis pattern
The most advanced form of big data solution is the case in which analysis is performed
on the set of data and actions are implied based on the repeatable past actions or on an
action matrix. The actions can be manual, partially automated, or fully automated. The
base analysis needs to be highly accurate. The actions are predefined and the result of
analysis is mapped against the actions. The typical steps involved in actionable analysis
are:
 Analyze the data to get the insight.
 Make a decision.
 Activate the appropriate channel to take the action to the right consumer.
Figure 8. Actionable analysis composite pattern

Figure 8 illustrates that the analysis can be manual, partially automated, or fully
automatic. It uses the atomic patterns as explained in the diagram.
Manual action means the system recommends the actions based on the outcome of the
analysis and the human being decides and carries out the actions. Partial
automation means that the actions are recommended by the analysis, but human
intervention is not required to set the action in motion or to choose from a set of
recommended actions. Fully automated means the actions are executed immediately by
the system after the decision is made. For example, a work order may be created by the
system automatically after equipment is predicted to fail.
The following matrix shows how the atomic patterns map to the composite patterns,
which are combinations of atomic patterns. Each composite pattern is designed to be
used in certain situations for data that has a particular set of characteristics. The matrix
shows the typical combinations of patterns. The patterns must be tailored to meet
specific situations and requirements. In the matrix, the composite patterns are listed in
order from simplest to most complex. The "store and explore" pattern is the least
complex.
Figure 9. Composite to atomic patterns mapping
Summary
Taking a patterns-based approach can help the business team and the technical team
to agree on the primary objective of the solution. Using the patterns, the technical team
can define the architectural principles and make some of the key architecture decisions.
The technical team can apply these patterns to the architectural layers and derive the
set of components needed to implement the solution. Often, the solution starts with a
limited scope and evolves as the business becomes more and more confident that the
solution will bring value. As this evolution happens, the composite and atomic patterns
that align to the solution get refined. The patterns can be used in the initial stages to
define a pattern-based architecture and to map out how the components in the
architecture will be designed, step by step.
Figure 10. Atomic patterns mapping to architecture layers
In Part 2 of this series, we describe the complexities associated with big data and how
to determine if it's time to implement or update your big data solution. In this article, we
covered atomic and composite patterns and explained that a solution can be composed
of multiple patterns. Given a particular context, you may find that some patterns are
more appropriate than the others. We recommend you take an end-to-end view of the
solution and examine the patterns involved, then define the architecture of the big data
solution.
For architects and designers, mapping to patterns enables further refinement of the
responsibilities of each component in the architecture. For business users, it's often
helpful to gain a better understanding of the business scope of the big data problem, so
that valuable insights can be derived and so the solution meets matches the desired
outcome.
Solution patterns further help to define the optimal set of components based on whether
the business problem needs data discovery and exploration, purposeful and predictable
analysis, or actionable analysis. Remember that there is no recommended sequence or
order in which the atomic, composite, or solution patterns must be applied for arriving at
a solution. The next article in this series introduces solution patterns for this purpose.
Part 5: Apply a solution pattern to your big
data problem and choose the products to
implement it
Part 3 of this series describes atomic and composite patterns that address the most
common and recurring big data problems and their solutions. This article suggests
three solution patterns that can be used to architect a big data solution. Each solution
pattern uses a composite pattern, which is made of up logical components (also
covered in Part 3). At the end of this article, find a list of products and tools that map to
the components of each solution pattern.
Solution patterns
The following sections describe three solution patterns that can be used to architect a
big data solution. To illustrate the patterns, we apply them to a particular use case (how
to detect healthcare insurance fraud), but the patterns can be used to address many
other business scenarios. Each solution pattern takes advantage of a composite
pattern. In the following table, see the list of solution patterns covered here, along with
the composite patterns they are based on.
Table 1. Composite pattern used by each solution pattern

Solution pattern Composite pattern

Getting started Store and explore

Gaining advanced business insight Purposeful and predictive analytics

Take the next best action Actionable analysis

Description of the use case: Insurance fraud


Financial fraud poses a serious risk to all segments of the financial sector. In the United
States, insurers lose billions of dollars annually. In India, the loss in 2011 alone totals
INR 300 billion. Apart from the financial loss, insurers are losing business because of
customer dissatisfaction. Although many insurance regulatory bodies have defined
frameworks and processes to control fraud, they are often reacting to fraud rather than
taking proactive steps to prevent it. Traditional approaches, such as circulating the roll
of black-listed customers, insurance agents, and staff, do not resolve the problem of
fraud.
This article proposes a solution pattern for big data solution, based on the logical
architecture described in Part 3 of this series and the composite patterns covered
in Part 4.
Insurance fraud is an act or omission intended to gain dishonest or unlawful advantage,
either for the party committing the fraud or for other related parties. Broad categories of
fraud include:
 Policyholder fraud and claims fraud— Fraud against the insurer in the purchase and execution
of an insurance product, including fraud at the time of making an insurance claim.
 Intermediary fraud— Fraud perpetuated by an insurance agent, corporate agent, intermediary,
or third party-agent against the insurer or the policy holders.
 Internal fraud— Fraud against the insurer by its director, manager, or any other staff or office
member.
Current fraud-detection process
The insurance regulatory boards have established anti-fraud policies, which include
well-defined processes for monitoring fraud, for searching for potential fraud indicators
(and publishing a list), and for coordinating with law enforcement agencies. The insurers
have staff dedicated to analyzing fraudulent claims.
Issues with the current fraud-detection process
The insurance regulators have well-defined fraud-detection and mitigation processes.
Traditional solutions use models based on historical fraud data, black-listed customers
and insurance agents, and regional data about fraud peculiar to a certain area. The data
available for detecting fraud is limited to the given insurer's IT systems and a few
external sources.
Current fraud-detection processes are mostly manual and work on limited data sets.
Insurers may not be able to investigate all the indicators. Fraud is often detected very
late, and it is difficult for the insurer to do adequate followup for each fraud case.
Current fraud detection relies on what is known about existing fraud cases, so every
time a new type of fraud occurs, insurance companies have to bear the consequences
for the first time. Most traditional methods work within a particular data source and
cannot accommodate the ever-growing variety of data from different sources. A big data
solution can help address these challenges and play an important role in fraud detection
for insurance companies.
Solution pattern: Getting started
This solution pattern is based on the store-and-explore composite pattern. It focuses on
acquiring and storing the relevant data from various sources inside or outside the
enterprise. The data sources shown in Figure 1 are examples only; domain experts can
identify the appropriate data sources.
Because a large volume of varied data from many sources must be collected, stored,
and processed, this business challenge is a good candidate for a big data solution.
The following diagram shows the solution pattern, mapped onto the logical architecture,
described in Part 3.
Figure 1. Solution pattern for getting started

Figure 1 uses data providers from:


 External data sources
 Structured data storage
 Transformed, structured data
 Entity resolution
 Big data explorer components
The data required for healthcare fraud detection can be acquired from various sources
and systems such as banks, medical institutions, social media, and Internet agencies. It
includes unstructured data from sources such as blogs, social media, news agencies,
reports from various agencies, and X-ray reports. See the data sources layer in Figure 1
for more examples. With big data analytics, the information from these varied sources
can be correlated and combined, and — with the help of defined rules — analyzed to
determine the possibility of fraud.
In this pattern, the required external data is acquired from data providers who contribute
preprocessed, unstructured data converted to structured or semi-structured format. This
data is stored in the big data stores after initial preprocessing. The next step is to
identify possible entities and generate ad-hoc reports from the data.
Entity identification is the task of recognizing named elements in the data. All entities
required for analysis must be identified, including loose entities that do not have
relationships to other entities. Entity identification is mostly performed by data scientists
and business analysts. Entity resolution can be as simple as identifying single entities or
complex entities based on the data relationships and contexts. This pattern uses the
simple-form entity resolution component.
Structured data can be simply converted into the format most appropriate for analysis
and directly stored in big data structured storages.
Ad-hoc queries can be performed on this data to get the information like:
 Overall fraud risk profile for a given customer, region, insurance product, agent, or approving
staff in the given period
 Inspection of past claims by certain agents or approvers or by the customer across insurers
Typically, organizations get started with big data by adapting this pattern, as the name
implies. Organizations employ an exploratory approach to assess what kind of insight
could be generated, given the data available. At this stage, organizations do not
generally invest in advanced analytics techniques such as machine learning, feature
extraction, and text analytics.
Solution pattern: Gaining advanced business insight
This pattern is more advanced than the getting-started pattern. It predicts fraud at three
stages of claim processing:
1. The claim is already settled.
2. The claim processing is underway.
3. The claims request is just received.
For cases 1 and 2, the claims can be processed in batch, and the fraud-detection
process can be initiated as part of the regular reporting process or as requested by the
business. Case 3 can be processed at near-real time. The claims request interceptor
intercepts the claim request, initiates the fraud-detection process (if the indicators report
it as a possible fraud case), then notifies the stakeholders identified in the system. The
earlier the fraud is detected, the less severe is the risk or loss.
Figure 2. Solution pattern for gaining advanced business insight

Figure 2 uses:
 Unstructured data storage
 Structured data storage
 Transformed structured data
 Preprocessed, unstructured data
 Entity resolution
 Fraud-detection engine
 Business rules
 Big data explorer
 Alerts and notifications to users
 Claims request interceptor
In this pattern, organizations can choose to preprocess unstructured data before
analyzing it.
The data is acquired and stored, as-is, in unstructured data storage. It is then
preprocessed into a format that can be consumed by the analysis layer. At times, the
preprocessing can be complex and time-consuming. Machine-learning techniques can
be used for text analytics, and the Hadoop Image Processing Framework can be useful
for processing images. The most widely used technique is JSON. The preprocessed
data is then stored in structured data storage, such as HBase.
The core component in this pattern is the fraud-detection engine, composed of the
advanced analytics capabilities that help predict fraud. Well-defined and frequently
updated fraud indicators help identify fraud. The following fraud indicators can help
detect fraud, and technology can be used to implement systems to combat fraud. Here's
a list of common fraud indicators:
 Claims are made shortly after the policy inception.
 Serious underwriting lapses occur while processing a claim.
 The insured person is overtly aggressive in pursuit of a quick settlement.
 Insured parties are willing to accept a small settlement rather than document all losses.
 The authenticity of documents is doubtful.
 The insured person is behind in loan payments.
 The injury incurred is not visible.
 A high-value claim has no known casualty.
 Relationships exist between clusters of individuals, including policy holders, medical institutions,
associates, suppliers, and partners.
 Links exist between licensed and non-licensed healthcare providers.
Traditional methods alone are not adequate to predict fraud. Social-network analytics
are required to detect links between licensed and non-licensed healthcare providers and
to detect relationships between policy holders, medical institutions, associates,
suppliers, and partners. Validating the authenticity of documents and finding the credit
score of individuals are difficult tasks to accomplish with traditional approaches.
During analysis, the search for all of these indicators can occur simultaneously on a
huge volume of data. Every indicator is weighted. The total weight across all indicators
indicates the accuracy and severity of the predicted fraud.
When the analysis is complete, alerts and notifications can be sent to relevant
stakeholders, and reports can be generated to show the outcome of analysis.
This pattern is suitable for enterprises that need to perform advanced analytics using
big data. It involves performing complex preprocessing so that the data can be stored in
a form that can be analyzed using advanced techniques, such as feature extraction,
entity resolution, text analytics, machine learning, and predictive analytics. This pattern
does not involve taking any action or suggesting recommendations on the output of
analysis.
Solution pattern: Take the next-best action
The fraud predictions made in the solution pattern about gaining advanced business
insight normally lead to certain actions to be taken, such as rejecting the claim or putting
it on hold until additional clarification and information is received or reporting it for legal
action. In this pattern, actions are defined for each outcome of the prediction. This
action-to-outcome table is referred to as an action-decision matrix.
Figure 3. Solution pattern for the next-best action

Figure 3 uses:
 Unstructured data storage
 Structured data storage
 Transformed structured data
 Preprocessed unstructured data
 Entity resolution
 Fraud-detection engine
 Business rules
 Decision matrices
 Data exploration tools
 Alerts and notifications to users
 Claims request interceptor
 Alterations and notifications to other systems and business process components
Typically, three kinds of actions can be taken:
 A notification can be sent to stakeholders to take the necessary action — for example, to notify
the user to take legal action against the claimant.
 The system notifies the user and waits for the user's feedback before taking further action. The
system can wait for the user to respond to a task or it can stop or put on hold a claim-processing
transaction.
 For scenarios that do not need manual intervention, the system can take an automated action.
For example, the system can send a trigger to a process to stop the claims process and inform
the legal department about the claimant, agent, and approver.
This pattern is suitable for enterprises that need to perform advanced analytics using
big data. This pattern uses advanced capabilities to detect fraud, to notify and alert
relevant stakeholders, and to initiate automatic workflows to take action based on
outcome of processing.
Products and technologies that form the backbone of a big data solution
The following diagram shows how big data software maps to the various components of
the logical architecture described in Part 3. These are not the only products,
technologies, or solutions that can be used in a big data solution; your own
requirements and environment must shape the tools you choose to deploy.
Figure 4 shows big data appliances, such as IBM PureData™ System for Hadoop and
IBM PureData System for Analytics, cutting across layers. These appliances have
features such as built-in visualization, built-in analytic accelerators, and a single system
console. Using an appliance has many advantages. (See Related topics for more
information about the IBM PureData System for Hadoop.)
Figure 4. Products and technologies mapped to logical layers diagram
Benefits of using big data analytics in fraud detection
Using big data analytics for detecting fraud has various benefits over traditional
approaches. Insurance companies can build systems that include all relevant data
sources. An all-encompassing system helps detect uncommon cases of fraud.
Techniques such as predictive modeling thoroughly analyze instances of fraud, filter
obvious cases, and refer low-incidence fraud cases for further analysis.
A big data solution can also help build a global perspective of the anti-fraud efforts
throughout the enterprise. Such a perspective often leads to better fraud detection by
linking associated information within the organization. Fraud can occur at a number of
source points: claims processing, insurance surrender, premium payment, application
for a new policy, or employee-related or third-party fraud. Combined data from various
sources enables better predictions.
Analytics technologies enable an organization to extract important information from
unstructured data. Although volumes of structured information are stored in data
warehouses, most of the crucial information about fraud is in unstructured data, such as
third-party reports, which are rarely analyzed. In most insurance agencies, social media
data is not appropriately stored or analyzed.

Conclusion
Using business scenarios based on the use case of identifying fraud in the insurance
industry, this article describes solution patterns that vary in complexity. The simplest
pattern addresses storing data from various sources and doing some initial exploration.
The most complex covers how to gain insight from the data and take action based on
the analysis.
Each business scenarios is mapped to the appropriate atomic and composite patterns
that make up the solution pattern. Architects and designers can apply the solution
pattern to define the high-level solution and functional components of the appropriate
big data solution.

You might also like