Chapter 1: Introduction
Over the past decade, much has been written about "Big Data" in the last couple of years, but just what is it? As now commonly used, the term Big Data refers not just to the explosive growth in data that almost all organizations are experiencing, but also the emergence of data technologies that allow that data to be leveraged. Big Data is a holistic term used to describe the ability of any company, in any industry, to find advantage in the ever increasingly large amount of data that now flows continuously into those enterprises, as well as the semistructured and unstructured data that was previously either ignored or too costly to deal with.
The problem is that as the world becomes more connected via technology, the amount of data flowing into companies is growing exponentially and identifying value in that data becomes more difficult - as the data haystack grows larger, the needle becomes more difficult to find. So Big Data is really about finding the needles gathering, sorting and analyzing the flood of data to find the valuable information on which sound business decisions are made.
Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis and visualization. The trend to larger data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to "spot business trends, determine quality of research, prevent diseases, link legal citations, combat crime, and determine real-time roadway traffic conditions."
Big Data has been defined in various ways by different organizations over the years. Few of them include:
Ericsson defines: People, devices and things are constantly generating massive volumes of data. At work people create data, as do children at home, students at school, people and things on the move, as well as objects that are stationary. Devices and sensors attached to millions of things take measurements from their surroundings, providing up-to-date readings over the entire globe data to be stored for later use by countless different applications.
IBM defines: Every day, we create 2.5 quintillion bytes of data so much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is big data. In simple words, A set of technology advances that have made capturing and analyzing data at high scale and speed vastly more efficient.
Chapter 2: Evolution
Coming up of Big Data
From Main Frames to Desktop Computers, from Laptop to Tablet PC, the computing world has transformed a lot. Gone are the days when only the wealthy could afford computer systems. Also with the increased usage of computers, the generation of data has increased many folds. Today from the Large Hadron Collider to small sensors to even a simple smart phone that one uses, everything generates data. Be it small or enormous in size.
Technologists say, the world has turned to a Global Village. Well, the credit to this goes to the Internet. The Internet has made distances shorter and the world smaller. The Internet is defined as the worldwide interconnection of individual networks operated by government, industry, academia, and private parties. In a matter of very few years, the Internet consolidated itself as a very powerful platform that has changed forever the way we do business, and the way we communicate. In today's scenario it's apt to say that- Size of Internet Expanding Daily. Every hour, every minute, rather every second. As of 2012, limits on the size of data sets that are feasible to process in a reasonable amount of time were on the order of exabytes of data. Scientists regularly encounter limitations due to large data sets in many areas, including meteorology, genomics, connectomics, complex physics simulations,[10] and biological and environmental research.[11] The limitations also affect Internet search, finance and business informatics. Data sets grow in size in part because they are increasingly being gathered by ubiquitous information-sensing mobile devices, aerial sensory technologies (remote sensing), software logs, cameras, microphones, radio-frequency identification readers, and wireless sensor networks.
While the Old concept was- Few companies generate data, all others consume it, the New concept is- All of us are generating data, and all of us are consuming the same. A lot of data comes from the Internet. With increasing ease in access to the Internet from every hook and corner of the world, data generation has rapidly increased. Mobile devices such as 'Tablet PC' and 'Smart Phones' too provide Internet usage and people using such means has grown in great numbers. Latest numbers reveal that we have around 2749 million internet users across the globe that are continuously adding to the global data chunk. (source- ITU mar13 report). Following graph illustrates the scenario.
Figure 1 : Internet User's Worldwide
Figure 2 : Smart Phone & Tablet User's Worldwide
At the end of the day all this certainly contributes to the creation of 'Big Data'.
Chapter 3: Needs to address Big Data
We are awash in a flood of data today. In a broad range of application areas, data is being collected at unprecedented scale. Decisions that previously were based on guesswork, or on painstakingly constructed models of reality, can now be made based on the data itself. Such Big Data analysis now drives nearly every aspect of our modern society, including mobile services, retail, manufacturing, financial services, life sciences, and physical sciences.
The promise of data-driven decision-making is now being recognized broadly, and there is growing enthusiasm for the notion of ``Big Data. While the promise of Big Data is real -- for example, it is estimated that Google alone contributed 54 billion dollars to the US economy in 2009 -- there is currently a wide gap between its potential and its realization. In the initial days of 'Big Data' development, this large chunk of data was viewed as a liability for any organization. But in the recent years as the importance of data for every organization has grown, the approach towards Big Data has transformed radically. It is now being thought of as an asset to the organization.
Heterogeneity, scale, timeliness, complexity, and privacy problems with Big Data impede progress at all phases of the pipeline that can create value from data. The problems start right away during data acquisition, when the data tsunami requires us to make decisions, currently in an ad hoc manner, about what data to keep and what to discard, and how to store what we keep reliably with the right metadata. Much data today is not natively in structured format; for example, tweets and blogs are weakly structured pieces of text, while Figures and video are structured for storage and display, but not for semantic content and search: transforming such content into a structured format for later analysis is a major challenge. The value of data explodes when it can be linked with other data, thus data integration is a major creator of value. Since most data is directly generated in digital format today, we have the opportunity and the challenge both to influence the creation to facilitate later linkage and to automatically link previously created data. Data analysis, organization, retrieval, and modeling are other foundational challenges. During the last 35 years, data management principles such as physical and logical independence and cost-based optimization have led, during the last 35 years, to a multi-billion dollar industry. More importantly, The many novel challenges and opportunities associated with Big Data necessitate rethinking many aspects of these data management platforms, while retaining other desirable aspects. I believe that appropriate investment in Big Data will lead to a new wave of fundamental technological advances that will be embodied in the next generations of Big Data management and analysis platforms, products, and systems. 5
Chapter 4: Characteristics and Applications
4.1 Big Data - Characteristics
Big Data stores first started appearing around 2009 and offered several intriguing capabilities versus relational databases, including: Easily distributed across multiple "commodity" servers and are horizontally scalable. Open-source technology providing low-cost implementation with near constant improvements. Robust and proven reliability in production with technology leaders such as Google, Amazon and Facebook ensuring high levels of availability. Schema-free with easy replication support and relatively simple API's (application program interfaces). Support for structured, semi-structured and unstructured data. Eventually consistent, meaning that overtime, data updates will eventually propagate throughout the database.
In a 2001 research report and related lectures, META Group (now Gartner) analyst Doug Laney defined data growth challenges and opportunities as being three-dimensional, i.e. increasing volume (amount of data), velocity (speed of data in and out), and variety (range of data types and sources). Gartner, and now much of the industry, continue to use this "3Vs" model for describing big data. In 2012, Gartner updated its definition as follows: "Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization." Additionally, a new V "Veracity" is added by some organizations to describe it.
Figure 3 : Big Data characteristics 6
4.2 Big Data - Applications
When big data is distilled and analyzed in combination with traditional enterprise data, enterprises can develop a more thorough and insightful understanding of their business, which can lead to enhanced productivity, a stronger competitive position and greater innovation all of which can have a significant impact on the bottom line.
For example, in the delivery of healthcare services, management of chronic or long-term conditions is expensive. Use of in-home monitoring devices to measure vital signs, and monitor progress is just one way that sensor data can be used to improve patient health and reduce both office visits and hospital admittance. Big Data has it's wide spread application in scientific and research areas. The Large Hadron Collider experiments represent about 150 million sensors delivering data 40 million times per second. There are nearly 600 million collisions per second. After filtering and refraining from recording more than 99.999% of these streams, there are 100 collisions of interest per second. As a result, only working with less than 0.001% of the sensor stream data, the data flow from all four LHC experiments represents 25 petabytes annual rate before replication (as of 2012). This becomes nearly 200 petabytes after replication. Manufacturing companies deploy sensors in their products to return a stream of telemetry. In the automotive industry, systems such as General Motors OnStar or Renaults R-Link , deliver communications, security and navigation services. Perhaps more importantly, this telemetry also reveals usage patterns, failure rates and other opportunities for product improvement that can reduce development and assembly costs. The proliferation of smart phones and other GPS devices offers advertisers an opportunity to target consumers when they are in close proximity to a store, a coffee shop or a restaurant. This opens up new revenue for service providers and offers many businesses a chance to target new customers. Retailers usually know who buys their products. Use of social media and web log files from their ecommerce sites can help them understand who didnt buy and why they chose not to, information not available to them today. This can enable much more effective micro customer segmentation and targeted marketing campaigns, as well as improve supply chain efficiencies through more accurate demand planning. Finally, social media sites like Facebook and LinkedIn simply wouldnt exist without big data. Their business model requires a personalized experience on the web, which can only be delivered by capturing and using all the available data about a user or member.
Figure 4 : Big Data applications
Chapter 5: Data processed by Big Data
Data can be defined in many ways. Few simple definitions explaining data can be enlisted as below: Information in raw or unorganized form (such as alphabets, numbers, or symbols) that refer to, or represent, conditions, ideas, or objects. Distinct pieces of information, usually formatted in a special way. A collection of facts, such as values or measurements. Data can exist in a variety of forms -- as numbers or text on pieces of paper, as bits and bytes stored in electronic memory, or as facts stored in a person's mind.
Data can be classified into two discrete categories namely1. Structured Data Data that resides in a fixed field within a record or file is called structured data. This includes data contained in relational databases and spreadsheets. Structured data first depends on creating a data model viz. defining what fields of data will be stored and how that data will be stored: data type and any restrictions on the data input. Structured data has the advantage of being easily entered, stored, queried and analyzed. Structured data is organized in a highly mechanized and manageable way. Structured data is often managed using Structured Query Language (SQL)
2. Unstructured Data Unstructured data usually refers to information that doesn't reside in a traditional rowcolumn database. Unstructured data files often include text and multimedia content. Examples include email messages, word processing documents, videos, photos, audio files, presentations, web pages and many other kinds of business documents. These sorts of files may have an internal structure, they are still considered "unstructured" because the data they contain doesn't fit neatly in a database. Experts estimate that 80 to 90 percent of the data in any organization is unstructured. And the amount of unstructured data in enterprises is growing significantly. Unstructured data is raw and unorganized. Digging through such data can be cumbersome and costly.
Big Data has generally to do with this large collection of unstructured data that is growing in size daily and swiftly.
Chapter 6: Big Data Technologies
Dealing with Big Data which sets up to multiple petabytes in size (a single petabyte is a quadrillion bits of data) requires new technologies and new approaches to efficiently process large quantities of data within tolerable elapsed times. Traditional relational database technologies, like SQL, have been proven inadequate in terms of response times when applied to very large datasets such as those found in Data implementations. To address this shortcoming, these Big Data implementations are leveraging new technologies that provide a framework for processing the massive data stores that define Big Data. The Big Data landscape is dominated by two classes of technology 1. Operational Systems that provide operational capabilities for real-time, interactive workloads where data is primarily captured and stored. Focus is on servicing highly concurrent requests while exhibiting low latency for responses operating on highly selective access criteria.
2. Analytical Systems that provide analytical capabilities for retrospective, complex analysis that may touch most or all of the data. Focus is on high throughput; queries can be very complex and touch most if not all of the data in the system at any time. Both systems tend to operate over many servers operating in a cluster, managing tens or hundreds of terabytes of data across billions of records. Currently trending technologies: 1. 2. 3. 4. 5. 6. 7. 8. 9. Column oriented databases Schema-less / No-SQL databases Map Reduce Hadoop Hive PIG WibiData PLATFORA Sky Tree 10
Chapter 7: NO-SQL
7.1 NO SQL OVERVIEW
NoSQL encompasses a wide variety of different database technologies and were developed in response to a rise in the volume of data stored about users, objects and products, the frequency in which this data is accessed, and performance and processing needs. While the hype surrounding NoSQL (non-relational) database technology has become deafening, there is real substance beneath the often exaggerated claims. But like most things in life, the benefits come at a cost. Developers accustomed to data modelling and application development against relational database technology will need to approach things differently.
What will it achieve? 1. Scalability Refers to the ability of an application or product to increase in size as demand warrants. Adding computers in parallel thus providing increased capacity and horizontal scaling. Aims at building system which are more flexible.
2. Availability Availability is a guarantee that every request receives a response about whether it was successful or failed. Users want their systems (Facebook, Twitter, Telecom app, etc) to be ready to serve them at all times. If a user cannot access the system, it is said to be unavailable.
7.2 NO-SQL IMPLEMENTATION
How to achieve? 1. Dynamic Schemas NoSQL databases are built to allow the insertion of data without a predefined schema. That makes it easy to make significant application changes in real-time, without worrying about service interruptions which means development is faster, code integration is more reliable, and less database administrator time is needed.
11
2. Auto-sharding Database sharding is the process of splitting up a database across multiple machines to improve the scalability of an application. NoSQL databases, on the other hand, usually support auto-sharding, meaning that they natively and automatically spread data across an arbitrary number of servers, without requiring the application to even be aware of the composition of the server pool. Data and query load are automatically balanced across servers, and when a server goes down, it can be quickly and transparently replaced with no application disruption.
3. Replication Most NoSQL databases also support automatic replication, meaning that you get high availability and disaster recovery without involving separate applications to manage these tasks. The storage environment is essentially virtualized from the developer's perspective.
4. Integrated Caching Many NoSQL database technologies have excellent integrated caching capabilities, keeping frequently-used data in system memory as much as possible and removing the need for a separate caching layer that must be maintained.
7.3 NO SQL DATABASE TYPES
Document databases Pair each key with a complex data structure known as a document. Documents can contain many different key-value pairs, or key-array pairs, or even nested documents. Graph stores Used to store information about networks, such as social connections. Graph stores include Neo4J and HyperGraphDB. Key-value stores Simplest NoSQL databases. Every single item in the database is stored as an attribute name (or "key"), together with its value. Examples of key-value stores are Riak and Voldemort. Some key-value stores, such as Redis, allow each value to have a type, such as "integer", which adds functionality. Wide-column stores Cassandra and HBase are optimized for queries over large datasets, and store columns of data together, instead of rows. 12
Chapter 8: Graph Database
8.1 Graph database - overview
The driving force behind most NoSQL databases is to enable you to store information in a large scale, highly available, and optimized database, and then retrieve this information again extremely quickly. The focus of these databases is on putting data in and getting it out again in as efficient a manner as possible. Graph databases operate in a different functional space; their main aim is to enable you to perform queries based on the relationships between data items write application that can analyze these relationships. A graph database is a database that uses graph structures with nodes, edges, and properties to represent and store data. A graph database is any storage system that provides index-free adjacency. This means that every element contains a direct pointer to its adjacent elements and no index lookups are necessary.
A graph database stores data in a graph, the most generic of data structures, capable of elegantly representing any kind of data in a highly accessible way. Key attributes A graph database stores data in a graph, the most generic of data structures, capable of elegantly representing any kind of data in a highly accessible way. A Graph contains Nodes and Relationships A Graph [:RECORDS_DATA_IN]> Nodes [:WHICH_HAVE]> Properties. The simplest possible graph is a single Node, a record that has named values referred to as Properties. A Node could start with a single Property and grow to a few million, though that can get a little awkward. At some point it makes sense to distribute the data into multiple nodes, organized with explicit Relationships.
Figure 5 : Graph database structure 13
8.2 Graph database - components
Figure 6 : Graph node
Figure 7 : Graph properties 14
Figure 8 : Graph relationships
Figure 9 : Graph index 15
Figure 10 : Graph path
8.3 Graph database - operations Querying a Graph Database
All graph databases provide a means to enable an application to walk through a set of connected nodes based on the relationships between these nodes. The programmatic interfaces that graph databases provide vary from vendor to vendor, ranging from the simple imperative approach through to more declarative mechanisms. In the simple imperative approach you select a node as a starting point, examine the relationships that this node has with other nodes, and then traverse each relevant relationship to find related nodes, and then repeat the process for each related node until you have found the data that your require or there is no more data to search. In the more declarative approach you select a starting point, specify criteria that filters the relationships to traverse and nodes to match, and then let the database server implement its own graph-traversal algorithm to return a collection of matching nodes. If possible, adopt the declarative approach because this mechanism can help to avoid tying the structure of your code too closely to the structure of the database.
16
Query a Graph with a Traversal A Traversal navigates> a Graph; it identifies> Paths which order> Nodes. A Traversal is how you query a Graph, navigating from starting Nodes to related Nodes according to an algorithm, finding answers to questions like what music do my friends like that I dont yet own, or if this power supply goes down, what web services are affected?
Figure 11 : Graph traversal
17
8.4 Graph database - examples A.
Figure 12 : Graph DB example-1
B.
Twitter and it's relationship
Figure 13 : Graph DB example-2 18
CONCLUSION
We have entered an era of Big Data. Through better analysis of the large volumes of data that are becoming available, there is the potential for making faster advances in many scientific disciplines and improving the profitability and success of many enterprises. However, many technical challenges described in this paper must be addressed before this potential can be realized fully. The challenges include not just the obvious issues of scale, but also heterogeneity, lack of structure, error-handling, privacy, timeliness, provenance, and visualization, at all stages of the analysis pipeline from data acquisition to result interpretation. These technical challenges are common across a large variety of application domains, and therefore not cost-effective to address in the context of one domain alone. Furthermore, these challenges will require transformative solutions, and will not be addressed naturally by the next generation of industrial products. We must support and encourage fundamental research towards addressing these technical challenges if we are to achieve the promised benefits of Big Data.
19
Bibliography
1. www.couchbase.com 2. www.datastax.com 3. www.forbes.com/sites/davefeinleib/2012/07/09/the-3-is-of-bigdata/ 4. www.slideshare.net/bigdatalandscape/big-data-trends 5. www.kunocreative.com/blog/bid/76907/Big-Data-Made-SimpleWhat- Marketers-Need-to-Know 6. http://www.cra.org/ccc/files/docs/init/bigdatawhitepaper.pdf 7. http://en.wikipedia.org/wiki/Big_data 8. http://msdn.microsoft.com/en-us/library/dn313282.aspx 9. http://www.neo4j.org/learn/graphdatabase 10. http://www.slideshare.net/maxdemarzi/introduction-to-graphdatabases-12735789
20