Big Data Analytics UNIT-1
Big Data Analytics UNIT-1
What is big data, why big data, convergence of key trends, unstructured data,
industry examples of big data, web analytics, big data and marketing, fraud and
big data, risk and big data, credit risk management, big data and algorithmic
trading, big data and healthcare, big data in medicine, advertising and big data, big
data technologies, introduction to Hadoop, open source technologies, cloud and
big data, mobile business intelligence, Crowd sourcing analytics, inter and trans
firewall analytics.
Big data:
Big data refers to datasets whose size is beyond the ability of typical database
software tool to capture, store, manage and analyze. Big data is data that goes beyond the
traditional limits of data along three dimensions: i) Volume, ii) Variety iii) Velocity
Data Volume:
Data Volume can be measured by quality of transactions, events and amount of
history. Big Data isn’t just a description of raw volume. The real challenge is identifying or
developing most cost-effective and reliable methods for extracting value from all the
terabytes and petabytes of data now available. That’s where Big Data analytics become
necessary.
Measuring data volume
2
Data Variety:
It is the assortment of data. Traditionally data, especially operational data, is
―structured‖ as it is put into a database based on the type of data (i.e., character, numeric,
floating point, etc.).
Wide variety of data:
Internet data(Social media ,Social Network-Twitter, Face book), Primary Research (Surveys,
Experiences, Observations), Secondary Research (Competitive and Market place data,
2
Industry reports, Consumer data, Business data), Location data (Mobile device data,
Geospatial data), Image data (Video, Satellite image, Surveillance), Supply Chain data
(vendor Catalogs, Pricing etc), Device data (Sensor data, RF device, Telemetry)
Structured Data
They have predefined data model and fit into relational database. Especially,
operational data is ―structured‖ as it is put into a database based on the type of data (i.e.,
character, numeric, floating point, etc.)
Semi-structured data
These are data that do not fit into a formal structure of data models. Semi-structured
data is often a combination of different types of data that has some pattern or structure that is
not as strictly defined as structured data. Semi-structured data contain tags that separate
semantic elements which includes the capability to enforce hierarchies within the data.
Unstructured data
Do not have a predefined data model and /or do not fit into a relational database.
Oftentimes, text, audio, video, image, geospatial, and Internet data (including click streams
and log files) are considered unstructured data.
Data Velocity
Data velocity is about the speed at which data is created, accumulated, ingested, and
processed. The increasing pace of the world has put demands on businesses to process
information in real-time or with near real-time responses. This may mean that data is
processed on the fly or while ―streaming‖ by to make quick, real-time decisions or it may be
that monthly batch processes are run inter-day to produce more timely decisions.
Why bother about Unstructured data?
- The amount of data (all data, everywhere) is doubling every two years.
- Our world is becoming more transparent. Everyone is accepting this and people
don’t mind parting with data that is considered sacred and private.
- Most new data is unstructured. Specifically, unstructured data represents
almost 95 percent of new data, while structured data represents only 5 percent.
- Unstructured data tends to grow exponentially, unlike structured data, which
tends to grow in a more linear fashion.
- Unstructured data is vastly underutilized.
Need to learn how to:
- Use Big data
- Capitalize new technology capabilities and leverage existing technology assets.
- Enable appropriate organizational change.
- Deliver fast and superior results.
Advantage of Big data Business Models:
Improve Operational Increase Achieve Competitive
Efficiencies Revenues Differentiation
Reduce risks and costs Sell to microtrends Offer new services
Save time Enable self service Seize market share
Improve customer
Lower complexity Incubate new ventures
experience
Enable self service Detect fraud
3
Web Analytics
Web analytics is the measurement, collection, analysis and reporting of web data
for purposes of understanding and optimizing web usage. Web analytics is not just a
tool for measuring web traffic but can be used as a tool for business and market
research, and to assess and improve the effectiveness of a web site. The following
are the some of the web analytic metrics: Hit, Page view, Visit / Session, First Visit /
First Session, Repeat Visitor, New Visitor, Bounce Rate, Exit Rate, Page Time Viewed
/ Page Visibility Time / Page View Duration, Session Duration / Visit Duration.
Average Page View Duration, and Click path etc.
Why use big data tools to analyse web analytics data?
Web event data is incredibly valuable
• It tells you how your customers actually behave (in lots of detail), and
how that varies
• Between different customers
• For the same customers over time. (Seasonality, progress in customer
journey)
• How behaviour drives value
• It tells you how customers engage with you via your website / webapp
• How that varies by different versions of your product
• How improvements to your product drive increased customer
satisfaction and lifetime value
• It tells you how customers and prospective customers engage with
your different marketing campaigns and how that drives subsequent
behaviour
Deriving value from web analytics data often involves very
personalized analytics
• The web is a rich and varied
space! E.g.
• Bank
• Newspaper
• Social network
• Analytics application
• Government organisation (e.g. tax office)
• Retailer
• Marketplace
• For each type of business you‟d expect different :
• Types of events, with different types of associated data
• Ecosystem of customers / partners with different types of relationships
• Product development cycle (and approach to product development)
• Types of business questions / priorities to inform how the data is
analysed
Web analytics tools are good at delivering the standard reports
that are common across different business types.
• Where does your traffic come from e.g.
• Sessions by marketing campaign / referrer
• Sessions by landing page
4
• Understanding events common across business types (page views,
transactions, „goals‟) e.g.
• Page views per session
• Page views per web page
• Conversion rate by traffic source
• Transaction value by traffic source
• Capturing contextual data common people browsing the web
• Timestamps
• Referer data
• Web page data (e.g. page title, URL)
• Browser data (e.g. type, plugins, language)
• Operating system (e.g. type, timezone)
• Hardware (e.g. mobile / tablet / desktop, screen resolution, colour
depth)
• What is the impact of different ad campaigns and creative on the way users
behave, subsequently? What is the return on ad spend?
• How do visitors use social channels (Facebook / Twitter) to interact around
video content? How can we predict which content will “go viral”?
• How do updates to our product change the “stickiness” of our service?
I) Digital Marketing
5
III) Big data and Advances in health care
6
Database Marketers, Pioneers of Big Data
Database marketing is concerned with building databases containing info
about individuals, using that information to better understand those individuals and
communicating effectively with some of those individuals to drive business value.
Marketing databases are typically used for
i) Customer acquisition
ii) Retaining and cross-selling to existing customers which reactivates the cycle
As companies grew and systems proliferated, a situation where there was one system
for one product and another for another product etc. was landed up (silos). Then
companies began developing technologies to manage and duplicate data from multiple
sources. Companies started developing software that could eliminate duplicate
customer info (de-duping). This enable them to extract customer information from
silos product systems, manage the info into single database, remove all the duplicates
and then send direct mail to subsets of the customers in the database. Companies such
as Reader’s Digest and several other firms were early champions of this new kind of
marketing and they used it very effectively. By the 1980’s marketers developed the
ability to run reports on the info in their databases which gave them better and deeper
insights into buying habits and preferences of customers. Telemarketing became
popular when marketers figured out how to feed information extracted from customer
databases to call centers. In 1990’s email entered the picture and marketers saw
opportunities to reach customers via Internet and WWW. In the past five years there
has been exponential growth in database marketing and the new scale is pushing up
against the limits of technology.
Big Data & New School of Marketing
New school marketers deliver what today’s consumers want ie. Relevant
interactive communication across digital power channels
Digital power channels: email, mobile, social display and
web. Consumers have changed so must marketers
7
Social & Affiliate Marketing or Pay for Performance Marketing on the Internet
The concept of affiliate marketing, or pay for performance marketing on the Internet
is often credited to William J. Tobin, the founder of PC Flowers & Gifts.
Amazon.com launched its own affiliate program in 1996 and middleman affiliate
networks like Link-share and Commission Junction emerged preceding the 1990s
Internet boom, providing the tools and technology to allow any brand to put affiliate
marketing practices to use. Today, most of the major brands have a thriving affiliate
program. Today, industry analysts estimate affiliate marketing to be a $3 billion
industry. It’s an industry that largely goes anonymous. Unlike email and banner
advertising, affiliate marketing is a behind the scenes channel most consumers are
unaware of.
In 2012, the emergence of the social web brings these concepts together. What only
professional affiliate marketers could do prior to Facebook, Twitter, and Tumblr, now
any consumer with a mouse can do. Couponmountain.com and other well known
affiliate sites generate multimillion dollar yearly revenues for driving transactions for
the merchants they promote. The expertise required to build, host, and run a business
like Couponmountain.com is no longer needed when a consumer with zero technical
or business background can now publish the same content simply by clicking
―Update Status‖ or ―Tweet.‖ The barriers to enter the affiliate marketing industry as
an affiliate no longer exist.
8
Empowering Marketing with Social intelligence
As a result of the growing popularity and use of social media around the world
and across nearly every demographic, the amount of user-generated content—or ―big
data‖—created is immense, and continues growing exponentially. Millions of status
updates, blog posts, photographs, and videos are shared every second. Successful
organizations will not only need to identify the information relevant to their company
and products—but also be able to dissect it, make sense of it, and respond to it—in
real time and on a continuous basis, drawing business intelligence—or insights—that
help predict likely future customer behavior. Very intelligent software is required to
parse all that social data to define things like the sentiment of a post.
Marketers now have the opportunity to mine social conversations for purchase intent
and brand lift through Big Data. So, marketers can communicate with consumers
regardless of the channel. Since this data is captured in real-time, Big Data is forcing
marketing organizations to quickly optimize media and message. Since this data
provides details on all aspects of consumer behavior, companies are eliminating silos
within the organization to prescription across channels, across media, and across the
path to purchase.
9
- This fraud detection system uses an open source search server based on Apache
Lucene. It can be used to search all kind of documents at near real-time. The tool is
used to index new transactions which are sourced in real-time, which allows analytics
to run in a distributed fashion utilizing the data specific to the index. Using this tool,
large historical data sets can be used in conjunction with real-time data to identify
deviation from typical payment patterns. The big data component allows overall
historical patterns to be compared and contrasted and allows the number of attributes
and characteristics about consumer behavior to be very wide with little impact on
overall performance.
- Percolator performs the function of identifying new transactions that have raised
profiles. Percolator can handle both structured and unstructured data. This provides
scalability to the event processing framework and allows specific suspicious
transactions to be enriched with additional unstructured information (E.g. Phone
location/geospatial records, customer travel schedules and so on). This ability to
enrich the transaction further can reduce false positives and increase the experience of
customer while redirecting fraud efforts to actual instances of suspicious activity.
- Capgemini’s fraud Big Data initiative focuses on flagging the suspicious credit card
transactions to prevent fraud in near real-time via multi-attribute monitoring. Real-
time inputs involving transaction data and customers records are monitored via
validity checks and detection rules. Pattern recognition is performed against the data
to score and weight individual transactions across each of the rules and scoring
dimensions. A cumulative score is then calculated for each transaction record and
compared against thresholds to decide if the transaction is suspicious or not.
10
Social Network Analysis (SNA)
- This is another approach to solving fraud with Big data.
- SNA views social relationships and makes assumptions.
- SNA could reveal all individuals involved in fraudulent activity from perpetrators to
their associates and understand their relationships and behavior to identify a bust out
fraud case. Bust out is a hybrid credit and fraud problem and the scheme is typically
defined by the following behavior.
- There are some Big Data solutions in the market like SAS’s SNA solution, which
helps institutions and goes beyond individual and account views to analyze all related
activities and relationships at a network dimension. The network dimension allows
visualization of social networks and helps to see hidden connections and relationships,
which could be a group of fraudsters. There are huge amounts of data involved behind
the scene, but the key to SNA solutions like SAS’s is the visualization techniques for
users to easily engage and take action.
11
- Social media and cell phone usage data are opening up new opportunities to
analyze customer behavior that can be used for credit decisioning.
- As Figure illustrates, there are four critical parts of the typical credit risk
framework: planning, customer acquisition, account management, and collections.
All four parts are handled in unique ways through the use of Big Data.
14
Disruptive analytics
- Data science and disruptive analytics can have immediate beneficial impact on the
healthcare systems.
- Data analytics makes it possible to create transparent approach to pharmaceutical
decision making based on the aggregation and analysis of healthcare data such as
electronic medical records and insurance claims data.
- Creating healthcare analytics framework has significant value for individual
stakeholders.
- For providers (physicians), there is an opportunity to build analytics systems for
evidence – based medicine(EBM) lifting through clinical and health outcomes
data to determine the best clinical protocols that provide the best heath outcomes
for patients and create defined standards of care.
- For producers( Pharmaceutical and medical device companies)there is an
opportunity to build analytics systems to enable(transactional medicine)
integrating externally generated post marketing safety, epidemiology and health
outcomes data with internally generated clinical and discovery data ( sequencing,
expression, biomarkers) to enable improved strategic R&D decision making
across the pharmaceutical value chain.
- For payers (ie, insurance companies) there is an opportunity to create analytics
systems to enable comparative effectiveness research(CER) that will be used to
drive reimbursement by mining large collections of claims, health care
records(EMR/EHR), economic and geographic, demographic data sets to
determine what treatment and therapies work best for which patients in which
context and with what overall economic and outcomes benefit.
A Holistic Value Proposition
− The ability to collect, integrate, analyze and manage data can make health care
data such as EHR/EMR, valuable.
− Big data approach to analyze health care data creates methods and platform for
analysis of large volumes of disparate kinds of data (Clinical, EMR, Claims, Labs
etc.) to better answer questions of outcomes, epidemiology, safety, effectiveness
and pharma economic benefit.
− Big data technology and platforms such as Hadoop, R, Open health data etc. help
clients create real-world evidence-based approaches to realize solutions for
competitive effectiveness research, improve outcomes in complex populations and
to improve decision making.
BI is not Data Science
− Traditional Business Intelligence and data warehousing skills do not help in
predictive analytics. Like a lawyer who draws a conclusion and then looks for
supporting evidence. Traditional BI is declarative and doesn’t necessarily require
any real domain understanding. Generating automated reports from aging data
warehouses that are briefly scanned by senior management does not meet the
definition of data science.
− Making data science useful to business is about identifying that question
management really tries to answer question.
15
IV) Pioneering New Frontiers in Medicine
− In Medical Field, Big Data analytics are being used by researches to understand
autoimmune disease such as Rheumatoid Arthritis, Diabetes and lupus and
neurodegenerative disease such as multi sclerosis Parkinson’s and Alzheimer’s. In
most these cases, the goal is to identify the genetic variations that causes the
diseases. The data sets used for such identification contain thousands of genes. For
example a research work on the role of environment factors and interactions
between environmental factors in multiple sclerosis typically uses data sets that
contain 100,000 to 500,000 genetic variations. The algorithms used to identify the
interactions between environmental factors and diseases. They also have rapid
search techniques built into them and should be able to do statistical analysis and
permutation analysis which can be very, very time consuming if not properly
done.
Challenges faced by pioneers of quantitative pharmacology
− The data set is very large 1000 by 2000 matrix.
− When an interactive analysis for first order and second order interactions are done
each of the 500,000 genetic locations have to be compared to each of all the rest
of the 500,000 genetics locations for the first order and this has to be done twice
and then 500,000 may reduce to a third for the second order interaction and so on.
Basically a second order interaction would be 500,000 squared, a third order
would be 500,000 cubed and so on. Such huge computations are made possible in
little time with the aid of big data technologies.
If out of three ads aired, two have high breakthrough but one is weak, the weak
performing ad could be quickly taken off air and the media spend can be rotated to the
higher performing ads. This will make breakthrough scores go up.
Instead of 30 second ads, a mix of 15s and 30s ads, can be planned, suppose real time
data shows that 15s ads work as well as 30s ads. Instead of spending money on 30s
ads, all money can be spent on 15-second ads and scores will continue to grow.
The measurement tools and capabilities are enabling real-time optimization on this
and so there’s a catch-up happening both in terms of advertising systems and
processes, but the industry infrastructure must be able to actually enable all of this
real-time optimization.
Now, the impact on sales in social media can be measured through market mixed
modeling. Market mixed modeling is a way that can take all the different variables in
the marketing mix—including paid, owned, and earned media—and uses them as
independent variables that regress against sales data and tries to understand the single
variable impact of all these different things.
Since these methods are quite advanced, organizations use high-end internal analytic
talent and advanced analytics platforms such as SAS or point solutions such as Unica
and Omniture. Alternatively, there are several large analytics providers like Mu
Sigma that supply it as a software-as-a-service (SaaS).
As the world becomes more digital, the quantity and quality of marketing data is
improving, which is leading to more granular and insightful MMM analyses.
The Three Big Data Vs in Advertising
Impact of the three Vs (volume, velocity, and variety) in advertising:
Volume
The volume of information and data that is available to the advertiser has gone up
exponentially versus what it was 20 years ago. In the old days, we would copy test our
advertising. The agency would build a media plan demographically targeted and we’d
execute it. Maybe 6 to 12 months later, we’d try to use whatever sales data we had to
try to understand if there was any impact. In today’s world, there is hugely more
advertising effectiveness data. On TV advertising, we can measure every ad in every
TV show every day, across about 70 percent of the viewing audience. We measure
clients digital ad performance hourly—by ad, by site, by exposure, and by audience.
On a daily or weekly basis, an advertiser can look at their advertising performance.
18
Velocity
There are already companies that will automate and optimize advertising on the web
without any human intervention at all based on click-thru. It’s now beginning to
happen on metrics like breakthrough, branding, purchase intent etc. This is sometimes
called programmatic buying. Literally, we’ll have systems in place that will be
measuring the impact of the advertising across websites or different placements
within websites, figuring out where the advertising is performing best. It will be
automated optimization and reallocation happening in real-time. The volume and the
velocity of data, the pace at which we can get the data, make decisions and do things
about it is dramatically increased.
Variety
Before, we really didn’t have a lot of data about how our advertising was performing
in market. We have a lot more data and it’s a lot more granular. We can look at our
brand’s overall advertising performance in the market. But we can also decompose it
to how much of a performance is because of the creative quality, the media weight,
how much is because of the program that the ads sit in. How much is because of the
placement: time of day, time of year, pod position, how much is because of cross-
platform exposure, how much is because of competitive activity. Then we have the
ability to optimize on most of those things—in real time. And now we can also
measure earned (social) and owned media. Those are all things that weren’t even
measured before.
Apple entered into the mobile and tablet market because of the iPod, which crushed
giants like Sony in the MP3 market. For Apple, the market was not just about selling
hardware or music on iTunes. It gave them a chance to get as close to a consumer as
anyone can possibly get. This close interaction also generated a lot of data that help
them expand and capture new customers. Again it’s all about the data, analytics, and
putting it into action.
Google gives away product that other companies, such as Microsoft, license for the
same the reason. It also began playing in the mobile hardware space through the
development of the Android platform and the acquisition of Motorola. It’s all about
gathering consumer data and monetizing the data. With Google Dashboard we can see
every search we did, e-mails we sent, IM messages, web-based phone calls,
documents we viewed, and so on. This is powerful for marketers.
The online retailer Amazon has created new hardware with Kindle and Barnes and
Noble released the Nook. If both companies know every move we make, what we
download, what we search for, they can study our behaviors to present new products
that they believe will appeal to us. The connection with consumers and more
importantly taking action on the derived data is important to win.
19
BIG DATA TECHNOLOGY
20
Components of Hadoop:
2) Map Reduce:
- Because Hadoop stores the entire dataset in small pieces across a collection of
servers, analytical jobs can be distributed in parallel to each of the servers storing
part of the data. Each server evaluates the question against its local fragment
simultaneously and reports its result back for collation into a comprehensive
answer.
- Map Reduce is the agent that distributes the work and collects the results.
- Both HDFS and Map Reduce are designed to continue to work even if there are
failures.
- HDFS continuously monitors the data stored on the cluster. If a server becomes
unavailable, a disk drive fails or data is damaged due to hardware or software
problems, HDFS automatically restores the data from one of the known good
replicas stored elsewhere on the cluster.
- Map Reduce monitors the progress of each of the servers participating in the job,
when an analysis job is running. If one of them is slow in returning an answer or
fails before completing its work, Map Reduce automatically starts another
instance of the task on another server that has a copy of the data.
- Because of the way that HDFS and Map Reduce work, Hadoop provides scalable,
reliable and fault-tolerant services for data storage and analysis at very low cost.
21
Old Vs New approaches to Data Analytics
Old Approach (Database approach) New Approach (Big data Analytics)
Follows data and analytics technology Follows data and analytics platform
stack with different layers of cross- that does all the data processing and
communicating data and works on ―scale- analytics in one layer without moving
up‖ expensive hardware. data back and forth on cheap but
scalable (―scale-out‖) commodity
hardware.
Data is moved to places where they have to Data must be processed and converted
be processed. into usable business intelligence where
it sits.
Massive parallel processing was not Hardware and storage is affordable and
employed due to hardware and storage continuing to get cheaper to enable
limitations. massive parallel processing.
Due to technological limitations storing, New proprietary technologies and open
managing and analyzing massive data sets source inventions enable different
were difficult. approaches that make it easier and
more affordable to store, manage &
analyze data.
Not able to handle unstructured data. The variety of data and ability to
handle unstructured data is on the rise.
Big data approach provides solution to
this.
Data Discovery
- Data discovery is the term used to describe the new wave of business intelligence
that enables users to explore data, make discoveries and uncover insights in a
dynamic and intuitive way versus predefined queries and preconfigured drill-
down dashboards. This approach is being followed by many business users due to
its freedom and flexibility to view Big Data. There are two software companies
that stand out in the crowd by growing their businesses at unprecedented rates in
this space: Tableau Software and QlikTech International.
- Both companies’ approach to the market is much different than the traditional BI
software vendor. They used a sales model referred to as ―land and expand‖. This
model was based on the fact that analytics and reporting are produced by the
people using the results. The model enabled business people to create their own
reports and dashboards.
- The most important characteristic of rapid-fire BI is that business users, not
specialized developers, drive the applications. The result is that everyone wins.
The IT team can stop the backlog of change requests and instead spend time on
strategic IT issues. Users can serve themselves data and reports when needed.
- There is a simple example of powerful visualization. A company uses an
interactive dashboard to track the critical metrics driving their business. Every
day, the CEO and other executives are plugged in real-time to see how their
22
markets are performing in terms of sales and profit, what the service quality scores
look like against advertising investments, and how products are performing in
terms of revenue and profit. Interactivity is key: a click on any filter lets the
executive look into specific markets or products. She can click on any data point
in any one view to show the related data in the other views. She can look into any
unusual pattern or outlier by showing details on demand. Or she can click through
the underlying information in a split-second.
- Business intelligence needs to work the way people’s minds work. Users need to
navigate and interact with data any way they want to—asking and answering
questions on their own and in big groups or teams.
- Qliktech has designed in a way for users to leverage direct—and indirect—search.
With QlikView search, users type relevant words or phrases in any order and get
instant, associative results. With a global search bar, users can search across the
entire data set. With search boxes on individual list boxes, users can confine the
search to just that field. Users can conduct both direct and indirect searches. For
example, if a user wanted to identify a sales rep but couldn’t remember the sales
rep’s name—just details about the person, such as that he sells fish to customers in
the Nordic region—the user could search on the sales rep list box for ―Nordic‖
and ―fish‖ to narrow the search results to just the people who meet those criteria.
23
product/service is zero. Whether a private hosted model or a publicly shared one, the
true value lies in delivering software, data and/or analytics in an ―as a service’
model.
Predictive Analytics
- Enterprises will move from being in reactive positions (Business Intelligence) to
forward learning positions (Predictive analysis). Using all the data available i.e
traditional internal data sources combined with new rich external data sources will
make the predictions more accurate and meaningful. Algorithm trading and supply
chain optimizations are two examples where predictive analytics have greatly
reduced the friction in business. Predictive analytics proliferate in every facet of
our lives both personal and business. Some of the leading trends in business today
are
- Recommendation engines similar to those used in Netflix, Amazon that use past
purchases and buying behavior to recommend new purchases.
- Risk engines for a wide variety of business areas, including market and credit risk,
catastrophic risk and portfolio risk.
- Innovation engines for new product innovation, drug discovery and consumer and
fashion trends to predict new product formulations and new purchases.
- Consumer insight engines that integrate a wide variety of consumer-related
information including sentiment, behavior and emotions.
- Optimization engines that optimize complex interrelated operations and decisions
that are too complex to handle.
24
- A focus on customer success
Unlike traditional enterprise software, with SaaS business, it is easy for customers
to leave if they are not satisfied. Today’s BI is not designed for the end-user. It
must be designed to be more intuitive, easily accessible, real time and meet the
expectations of today’s customer technology who expect a much more connected
experience.
Mobile Business Intelligence
- Simplicity and ease of use had been the major barriers to BI adoption. But mobile
devices have made complicated actions to be performed very easily. For example, a
young child can use an ipad or iphone easily but not a laptop. This ease of use will
drive the wide adoption of mobile BI.
- Multi touch and software oriented devices have brought mobile analytics and
intelligence to a much wider audience.
- Ease of mobile application development and development have also contributed to
the wide adoption of mobile BI.
Three elements that have impacted the viability of mobile BI are
i) Location-GPS component enables finding location easy.
ii) Transaction can be done through smart phones.
iii) Multimedia functionality allows virtualization.
Three challenges with mobile BI include
i) Managing standards for these devices.
ii) Managing security (always a big challenge).
iii) Managing ―bring your own device‖, where you have devices both owned by the
company and devices owned by the individual, both contributing to
productivity.
Crowdsourcing Analytics
- Crowdsourcing is the recognition that organizations can’t always have the best
and brightest internal people to solve all their big problems. By creating an open,
competitive environment with clear rules and goals, problems can be solved.
- In October 2006, Netflix an online DVD rental business announced a contest to
create a new predictive model for recommending movies based on past user
ratings. The grand prize was $1,000,000. Netflix already had an algorithm to solve
the problem but thought there was an opportunity to improve the model which
would turnout huge revenues.
- Kaggle is an Australian firm that provides innovative solutions for statistical
analytics for outsourcing. Kaggle manages competitions among world’s best data
scientist corporations, governments and research laboratories. Organizations that
confront complex statistical challenges describe the problems to kaggle and
provide data sets. Kaggle converts the problems and the data into contests that are
posted on its website. The contest features cash prizes ranging in values from
$100 to $3 million. Kaggle’s clients range in size from tiny start-ups to
Multinational Corporations such as Ford Motor Company and government
agencies such as NASA.
- The idea is that someone comes to Kaggle with a problem, they put it up on their
website and then people from all over the world can compete to see who can
25
produce the best solution. In essence Kaggle has developed an effective global
platform for crowdsourcing complex analytic problems.
- There are various types of crowdsourcing such as crowd voting, crowd
purchasing, wisdom of crowds, crowd funding and contests.
- Example:
99designs.com/, does crowdsourcing of graphic design.
Agentanything.com/, posts missions where agents are invited to do various
jobs.
33needs.com/, allows people to contribute to charitable programs to make
social impact.
Inter and Trans-Firewall Analytics
- Yesterday companies were doing functional silo-based analytics. Today they are
doing intra-firewall analytics with data within the firewall. Tomorrow they will be
collaborating on insights with other companies to do inter-firewall analytics as well as
leveraging the public domain spaces to do trans-firewall analytics (Fig.1).
- As fig.2 depicts, setting up inter-firewall and trans-firewall analytics can add
significant value. But this presents some challenges. When information is collected
outside the firewall, the information to noise ratio increases putting additional
requirements on analytical methods and technology requirements.
- Further, organizations are limited by a fear of collaboration and overreliance on
proprietary information. The fear of collaboration is driven by competitive fears, data
privacy concerns and proprietary orientations that limit opportunities for cross-
organizational learning and innovations. The transition to an inter-firewall and trans-
firewall paradigm may not be easy but it continues to grow and will become a key
weapon for decisions scientists to drive disruptive value and efficiencies.
Figure 1
26
Figure 2
- For many reasons, organizations find it hard to make changes after spending many
years implementing a data management, BI and analytic stack. So the organizations
have to do lot of research and development on the new technologies before
completely adopting the technologies to minimize the risk. The two core programs
that have to focused by R & D teams are
Program Goal Core Elements
Tap into the latent creativity of all Visa employees,
Innovation providing them with a platform to demonstrate
mastery and engage collaboratively with their Employee personal growth
management
colleagues.
Employee acquisition and
retention
Look outside o the company and scan th
Research and open
environment f trends, new technology, e Innovation
innovation
approaches for an
d
Competitive advantage
Adding Big Data Technology
The process that enterprises must follow to get started with the big data technology
1. Practical approach – Start with the problem and then find a solution.
2. Opportunistic Approach – Start with the technology and then find a home for
it. For both the approaches the following activities have to be conducted,
(i) Play – R & D team members may request to install their lab to get more familiar with
the technology.
(ii)Initial Business review – Talk with the business owner to validate the applicability
and rank, the priorities to ensure that it is worth pursuing.
27
(iii) Architecture Review – Asses the validity of the underlying architecture and
ensure that it maps to IT’s standards.
(iv) Pilot use cases – find the use case to test the technology out.
(v)Transfer from R & D to Production – Negotiate internally regarding what it
would take to move it from research to production using the following table.
Organizations may have a lot of smart people, but there are other smart people outside.
Organizations need to be exposed to the value they are creating. A systematic program
that formalizes relationships with a powerful ecosystem is shown in the following table.
Innovation ecosystem: Leveraging brain power from outside of the organization
Source Example
Tap into a major university who did a major study on social network
Academic community
analytics.
Leverage research a vendor completed in their labs demonstrating success
Vendors’ research arms
leveraging unstructured data.
Research houses Use research content to support a given hypothesis for a new endeavor.
Government agencies Discuss fraud strategies with the intelligence community.
Venture capital organizations Have a venture capital firm review some new trends they are tracking and
investing in.
Invite BI and analytic technology start-ups in instead of just sticking with the
Start-ups
usual suspects.
28
UNIT-II: Introduction to NoSQL, aggregate data models, aggregates, key-value and
document data models, relationships, graph databases, schema less databases,
materialized views, distribution models, sharding, master-slave replication, peer- peer
replication, sharding and replication, consistency, relaxing consistency, version stamps,
Working with Cassandra – Table creation, loading and reading data.
29