KEMBAR78
BigData BCom | PDF | Big Data | Business Intelligence
0% found this document useful (0 votes)
63 views57 pages

BigData BCom

Big Data refers to large and complex datasets that require innovative processing techniques for effective analysis and decision-making. It has evolved from structured data in the 1970s to include unstructured and semi-structured data, driven by advancements in technology such as Hadoop and the Internet of Things. The importance of Big Data lies in its ability to provide insights that can enhance business value, improve customer understanding, and drive innovation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views57 pages

BigData BCom

Big Data refers to large and complex datasets that require innovative processing techniques for effective analysis and decision-making. It has evolved from structured data in the 1970s to include unstructured and semi-structured data, driven by advancements in technology such as Hadoop and the Internet of Things. The importance of Big Data lies in its ability to provide insights that can enhance business value, improve customer understanding, and drive innovation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 57

Unit-I : INTRODUCTION TO BIG DATA

Big Data
The definition of Big Data – according to Gartner is
“Big data” is high-volume, velocity, and variety information assets that
demand cost-effective, innovative forms of information processing for
enhanced insight and decision making.”
Big Data refers to complex and large data sets that have to be processed
and analysed to uncover valuable information that can benefit businesses
and organizations. However, there are certain basic tenets of Big Data
that will make it even simpler:
 Big Data refers to a massive amount of data that keeps on growing
exponentially with time.
 Big Data is so voluminous that it cannot be processed or analyzed
using conventional data processing techniques.
 It includes data mining, data storage, data analysis, data sharing, and
data visualization.
 The term is an all-comprehensive one including data, data frameworks,
along with the tools and techniques used to process and analyze the
data.

Evolution of Big Data


The evolution of Big Data

1970s and before was the era of mainframes. The data was essentially
primitive and structured. Relational databases evolved in 1980s and
1990s. The era was of data intensive applications. The World Wide Web
(WWW) and the Internet of Things (IOT) have led to an onslaught of
structured, unstructured, and multimedia data.

The History of Big Data


Although the concept of big data itself is relatively new, the origins of
large data sets go back to the 1960s and '70s when the world of data was
just getting started with the first data centres and the development of the
relational database.
Three Phases of Big Data

Around 2005, people began to realize just how


much data users generated through Facebook,
YouTube, and other online services. Hadoop
(an open-source framework created
specifically to store and analyze big data sets)
was developed that same year. NoSQL also
began to gain popularity during this time.
The development of open-source frameworks,
such as Hadoop (and more recently, Spark) was essential for the growth of
big data because they make big data easier to work with and cheaper to
store. In the years since then, the volume of big data has skyrocketed.
Users are still generating huge amounts of data—but it’s not just humans
who are doing it.
With the advent of the Internet of Things (IoT), more objects and devices
are connected to the internet, gathering data on customer usage patterns
and product performance. The emergence of machine learning has
produced still more data.
While big data has come far, its usefulness is only just beginning. Cloud
computing has expanded big data possibilities even further. The cloud
offers truly elastic scalability, where developers can simply spin up ad hoc
clusters to test a subset of data.
Benefits of Big Data and Data Analytics
Big data makes it possible for you to gain more answers that are complete
because you have more information.
More answers that are complete mean more confidence in the data—
which means a completely different approach to tackling problems.

Types of Big Data


a) Structured: Structured is one of the types of big data and By
structured data, we mean data that can be processed, stored, and
retrieved in a fixed format. It refers to highly organized information that
can be readily and seamlessly stored and accessed from a database by
simple search engine algorithms. For instance, the employee table in
a company database will be structured as the employee details,
their job positions, their salaries, etc., will be present in an organized
manner.
b) Unstructured: Unstructured data refers to the data that lacks any
specific form or structure whatsoever. This makes it very difficult and
time-consuming to process and analyze unstructured data. Email is an
example of unstructured data. Structured and unstructured are two
important types of big data.
c) Semi-structured: Semi structured is the third type of big data. Semi-
structured data pertains to the data containing both the formats
mentioned above, that is, structured and unstructured data. To be
precise, it refers to the data that although has not been classified under a
particular repository (database), yet contains vital information or tags that
segregate individual elements within the data. Thus we come to the end
of types of data.

Characteristics of Big Data Characteristics of Data


Big data has three key
characteristics:
1. Composition: The composition
of data deals with the structure of
data, that is, the sources of data,
the granularity, the types, and the
nature of data as to whether it is
static or real-time streaming.
2. Condition: The condition of data deals with the state of data, that is,
"Can one use this data as is for analysis?" or "Does it require cleansing
for further enhancement and enrichment?"
3. Context: The context of data deals with "Where has this data been
generated?" "Why was this data generated?" How sensitive is this data?"

Definition of Big Data


 Big data is high-velocity and high-variety information assets that
demand cost effective, innovative forms of information processing for
enhanced insight and decision making.
 Big data refers to datasets whose size is typically beyond the storage
capacity of and also complex for traditional database software tools
 Big data is anything beyond the human & technical infrastructure
needed to support storage, processing and analysis.
 It is data that is big in volume, velocity and variety.
In 2001, Gartner analyst Doug
Laney listed the 3 ‘V’s of Big
Data – Variety, Velocity, and
Volume. Let us look at them in
depth:
a) Variety: Variety of Big Data
refers to structured,
unstructured, and semi-
structured data that is gathered
from multiple sources. While in
the past, data could only be
collected from spreadsheets and
databases, today data comes in
an array of forms such as
emails, PDFs, photos, videos, audios, and so much more. Variety is one of
the important characteristics of big data.
b) Velocity: Velocity essentially refers to the speed at which data is
being created in real-time. In a broader prospect, it comprises the rate of
change, linking of incoming data sets at varying speeds, and activity
bursts.
c) Volume: Big Data indicates huge ‘volumes’ of data that is being
generated on a daily basis from various sources like social media
platforms, business processes, machines, networks, human interactions,
etc. Such a large amount of data is stored in data warehouses.

Part I of the definition: "Big data is high-volume, high-velocity, and


high-variety information assets" talks about voluminous data (humongous
data) that may have great variety (a good mix of structured, semi-
structured and unstructured data) and will require a good speed/pace for
storage, preparation, processing and analysis.
Part II of the definition: "cost effective, innovative forms of
information processing" talks about embracing new techniques and
technologies to capture (ingest), store, process, persist, integrate and
visualize the high volume, high-velocity, and high-variety data.
Part III of the definition: "enhanced insight and decision making" talks
about deriving deeper, richer and meaningful insights and then using
these insights to make faster and better decisions to gain business value
and thus a competitive edge.
Data —> Information —> Actionable intelligence —> Better decisions —
>Enhanced business value

More characteristics of big data


Looking beyond the original three V's, here are details on some of the
other ones that are now often associated with big data:
 Veracity refers to the degree of accuracy in data sets and how
trustworthy they are. Raw data collected from various sources can
cause data quality issues that may be difficult to pinpoint. If they aren't
fixed through data cleansing processes, bad data leads to analysis errors
that can undermine the value of business analytics initiatives. Data
management and analytics teams also need to ensure that they have
enough accurate data available to produce valid results.
 Some data scientists and consultants also add value to the list of big
data's characteristics. Not all the data that's collected has real business
value or benefits. As a result, organizations need to confirm that data
relates to relevant business issues before it's used in big data analytics
projects.
 Variability also often applies to sets of big data, which may have
multiple meanings or be formatted differently in separate data sources
-- factors that further complicate big data management and analytics.

Challenges with Big Data


Data volume: Data today is growing at an exponential rate. This high
tide of data will continue to rise continuously. The key questions are – “will
all this data be useful for analysis?”,
“Do we work with all this data or subset of it?”,
“How will we separate the knowledge from the noise?” etc.
Storage: Cloud computing is the answer to managing infrastructure for
big data as far as cost-efficiency, elasticity and easy upgrading /
downgrading is concerned. This further complicates the decision to host

big data solutions outside the enterprise.


Data retention: How long should one retain this data? Some data may
require for log-term decision, but some data may quickly become
irrelevant and obsolete.
Skilled professionals: In order to develop, manage and run those
applications that generate insights, organizations need professionals who
possess a high-level proficiency in data sciences.
Other challenges: Other challenges of big data are with respect to
capture, storage, search, analysis, transfer and security of big data.
Visualization: Big data refers to datasets whose size is typically beyond
the storage capacity of traditional database software tools. There is no
explicit definition of how big the data set should be for it to be considered
bigdata. Data visualization(computer graphics) is becoming popular as a
separate discipline. There are very few data visualization experts.

Why is Big Data Important?


The importance of big data does not revolve around how much data a
company has but how a company utilizes the collected data. Every
company uses data in its own way; the more efficiently a company uses
its data, the more potential it has to grow. The company can take data
from any source and analyze it to find answers which will enable:
1. Cost Savings: Some tools of Big Data like Hadoop and Cloud-Based
Analytics can bring cost advantages to business when large amounts of
data are to be stored and these tools also help in identifying more
efficient ways of doing business.
2. Time Reductions: The high speed of tools like Hadoop and in-memory
analytics can easily identify new sources of data which helps
businesses analyzing data immediately and make quick decisions
based on the learning.
3. Understand the market conditions: By analyzing big data you can
get a better understanding of current market conditions. For example,
by analyzing customers’ purchasing behaviors, a company can find out
the products that are sold the most and produce products according to
this trend. By this, it can get ahead of its competitors.
4. Control online reputation: Big data tools can do sentiment analysis.
Therefore, you can get feedback about who is saying what about your
company. If you want to monitor and improve the online presence of
your business, then, big data tools can help in all this.
5. Using Big Data Analytics to Boost Customer Acquisition and
Retention: The customer is the most important asset any business
depends on. There is no single business that can claim success without
first having to establish a solid customer base. However, even with a
customer base, a business cannot afford to disregard the high
competition it faces. If a business is slow to learn what customers are
looking for, then it is very easy to begin offering poor quality products.
In the end, loss of clientele will result, and this creates an adverse
overall effect on business success. The use of big data allows
businesses to observe various customer related patterns and trends.
Observing customer behavior is important to trigger loyalty.
6. Using Big Data Analytics to Solve Advertisers Problem and
Offer Marketing Insights: Big data analytics can help change all
business operations. This includes the ability to match customer
expectation, changing company’s product line and of course ensuring
that the marketing campaigns are powerful.
7. Big Data Analytics As a Driver of Innovations and Product
Development: Another huge advantage of big data is the ability to
help companies innovate and redevelop their products.

Business Intelligence vs Big Data


Although Big Data and Business Intelligence are two technologies used to
analyze data to help companies in the decision-making process, there are
differences between both of them. They differ in the way they work as
much as in the type of data they analyze.
Traditional BI methodology is based on the principle of grouping all
business data into a central server. Typically, this data is analyzed in
offline mode, after storing the information in an environment called Data
Warehouse. The data is structured in a conventional relational database
with an additional set of indexes and forms of access to the tables
(multidimensional cubes).
A Big Data solution differs in many aspects to BI to use.
1. In a Big Data environment, information is stored on a distributed file
system, rather than on a central server. It is a much safer and more
flexible space.
2. Big Data solutions carry the processing functions to the data, rather
than the data to the functions. As the analysis is centered on the
information, it´s easier to handle larger amounts of information in a
more agile way.
3. Big Data can analyze data in different formats, both structured and
unstructured. The volume of unstructured data (those not stored in a
traditional database) is growing at levels much higher than the
structured data. Nevertheless, its analysis carries different challenges.
Big Data solutions solve them by allowing a global analysis of various
sources of information.
4. Data processed by Big Data solutions can be historical or come from
real-time sources. Thus, companies can make decisions that affect
their business in an agile and efficient way.
5. Big Data technology uses parallel mass processing (MPP) concepts,
which improves the speed of analysis. With MPP many instructions are
executed simultaneously, and since the various jobs are divided into
several parallel execution parts, at the end the overall results are
reunited and presented. This allows you to analyze large volumes of
information quickly.

Comparison
of Business Intelligence Big Data
Objectives

Purpose The purpose of Business The main purpose of Big


Intelligence is to help the Data is to capture,
business to make better process, and analyze the
decisions. Business Intelligence data, both structured and
helps in delivering accurate unstructured to improve
reports by extracting customer outcomes.
information directly from the
data source.

EcoSystem / Operation systems, ERP Hadoop, Spark, R Server,


Component databases, Data Warehouse, hive, HDFS etc.
s Dashboard etc.

Tools Below is the list of tools used for Below is the list of tools
business intelligence. These used in Big Data. These
tools enable a business to tools or frameworks store
collate, analyze and visualize a large amount of data
data, which can be used in and process them to get
making better business insights from data to
decisions and to come up with make good decisions for
good strategic plans. the business.
Comparison
of Business Intelligence Big Data
Objectives

 Online analytical processing  Hadoop


(OLAP)  Spark
 Data Warehousing  Hive
 Digital Dashboards & Data  Polybase
mining  Presto
 Microsoft Power BI  Storm etc
 Google Analytics etc

Characterist Below are the six features of Big data can be


ics/ Business Intelligence: described by some
Properties Location intelligence, Executive characteristics such as
Dashboards, “what if” analysis, Volume, Variety,
Interactive reports, Metadata Variability, Velocity, and
layer, and Ranking reports Veracity.

Benefits Below is the list of benefits of Below is the list of


Business Intelligence benefits of Big Data
 Helps in making better  Better Decision making
business decisions  Fraud detection
 Faster and more accurate  Storage, mining, and
reporting and analysis analysis of data
 Improved data quality  Market prediction &
 Reduced costs and forecasting
 Increase revenues  Improves the service
 Improved operational  Helps in implementing
efficiency etc. the new strategies
 Keep up with customer
trends
 Cost savings
 Better sales insights,
which helps in
increasing revenues
etc

Applied Social media, Healthcare, The banking sector,


Fields Gaming Industry, Food Industry Entertainment, and
etc Social media, Healthcare,
Retail and wholesale etc
Unit-II : BIG DATA ANALYTICS
Big Data is creating significant new opportunities for organizations to
derive new value and create competitive advantage from their most
valuable asset: information. For businesses, Big Data helps drive
efficiency, quality, and personalized products and services, producing
improved levels of customer satisfaction and profit. For scientific efforts,
Big Data analytics enable new avenues of investigation with potentially
richer results and deeper insights than previously available. In many
cases, Big Data analytics integrate structured and unstructured data with
Real-time feeds and queries, opening new paths to innovation and insight.

Introduction to Big Data Analytics


Big Data Analytics is...
1. Technology-enabled analytics: Quite a few data analytics and
visualization tools are available in the market today from leading
vendors such as IBM, Tableau, SAS, R Analytics, Statistica, World
Programming Systems (WPS), etc. to help process and analyze your big
data.
2. About gaining a meaningful, deeper, and richer insight into your
business to steer it in the right direction. understanding the customer's
demographics to cross-sell and up- sell to them, better leveraging the
services of your vendors and suppliers, etc.
3. About a competitive edge over your competitors by enabling you with
findings that allow quicker and better decision-making.
4. A tight handshake between three communities: IT, business users, and
data Analysts.
5. Working with datasets whose volume and variety exceed the current
storage and processing capabilities and infrastructure of your
enterprise.
About moving code to data. This makes perfect sense as the program for
distributed processing is tiny (just a few KBs) compared to the data
(Terabytes or Petabytes today and likely to be Exabytes or Zettabytes in
the near future).

Examples of Big Data Analytics


There are three examples of Big Data Analytics in different areas: retail, IT
infrastructure, and social media.
1. Retail: As mentioned earlier, Big Data presents many opportunities to
improve sales and marketing analytics.
An example of this is the U.S. retailer Target. After analyzing consumer
purchasing behavior, Target's statisticians determined that the retailer
made a great deal of money from three main life-event situations.
• When people tend to buy many new products.
• When people buy new products and change their spending habits.
• When people have many new things to buy and have an urgency to
buy them. The analysis target to manage its inventory, knowing that
there would be demand for specific products and it would likely vary by
month over the coming nine- to ten-month cycles.
2. IT infrastructure: MapReduce paradigm is an ideal technical
framework for many Big Data projects, which rely on large data sets with
unusual data structures. One of the main benefits of Hadoop is that it
employs a distributed file system, meaning it can use a distributed cluster
of servers and commodity hardware to process large amounts of data.
Some of the most common examples of Hadoop implementations are in
the social media space, where Hadoop can manage transactions, give
textual updates, and develop social graphs among millions of users.
Twitter and Facebook generate massive amounts of unstructured data
and use Hadoop and its ecosystem of tools to manage this high volume.
3. Social media: It represents a tremendous opportunity to leverage
social and professional interactions to derive new insights.
LinkedIn represents a company in which data itself is the product. Early
on, Linkedln founder Reid Hoffman saw the opportunity to create a social
network for working professionals.
Linkedln has more than 250 million user accounts and has added many
additional features and data-related products, such as recruiting, job
seeker tools, advertising, and ln Maps, which show a social graph of a
user's professional network.

Classification of Analytics
There are basically two schools of thought:
1 Those that classify analytics into basic, operationalized, advanced and
Monetized.
2 Those that classify analytics into analytics 1.0, analytics 2.0, and
analytics 3.0.

First School of Thought


It includes Basic analytics, Operationalized analytics, Advanced analytics
and Monetized analytics.
Basic analytics: This primarily is slicing and dicing of data to help with
basic business insights. This is about reporting on historical data, basic
visualization, etc.
Operationalized analytics: It is operationalized analytics if it gets
woven into the enterprises business processes.
Advanced analytics: This largely is about forecasting for the future by
way of predictive and prescriptive modelling.
Monetized analytics: This is analytics in use to derive direct business
revenue.

Second School of Thought:


Let us take a closer look at analytics 1.0, analytics 2.0, and analytics 3.0.
Refer Table 2.1. Figure 2.1 shows the subtle growth of analytics from
Descriptive  Diagnostic  Predictive  Perspective analytics.
Analytics 1.0 Analytics 2.0 Analytics 3.0
Era: Mid 1990s to Era: 2005 to 2012 Era: 2012 to present
2009 Descriptive Statistics Descriptive +
Descriptive Statistics + predictive Statistics predictive +
(report on events, (use data from the prescriptive statistics
occurrences, etc of the past to make (Use data from the
past) predictions for the past to make
future) prophecies for the
future and at the same
time make
recommendations to
leverage the situation
to one’s advantage)
Key questions asked: Key Questions asked: Key Questions asked:
What happened? What happened? What will happen?
Why did it happen? Why will it happen? When will it happen?
What should be the
action taken to take
Analytics 1.0 Analytics 2.0 Analytics 3.0
advantage of what will
happen?
Data from legacy Big Data A blend of Big Data
systems. ERP, CRM and data from legacy
and 3rd party systems, ERP, CRM
applications. and 3rd party
applications.
Small and structured Big data is being taken A blend of Big Data
data sources. Data up seriously. Data is and traditional
stored in enterprise mainly unstructured, analytics to yield
data warehouses or arriving at a much insights and offerings
data marts. higher pace. This fast with speed and
flow of data entailed impact.
that the influx of big
volume data had to be
stored and processed
rapidly, often on
massive parallel
servers running
Hadoop.
Data was internally Data was often Data is both being
sourced. externally sourced. internally and
externally sourced.
Relational databased Database appliances, In memory analytics,
Hadoop clusters, SQL in database
to Hadoop processing, agile
environments, etc. analytical methods,
machine learning
techniques, etc.

Challenges of Big Data Analytics


There are mainly seven challenges of big data: scale, security, schema,
Continuous availability, Consistency, Partition tolerant and data quality.
Scale: Storage (RDBMS (Relational Database Management System) or
NoSQL (Not only SQL)) is one major concern that needs to be addressed to
handle the need for scaling rapidly and elastically. The need of the hour is
a storage that can best withstand the attack of large volume, velocity and
variety of big data. Should you scale vertically or should you scale
horizontally?
Security: Most of the NoSQL big data platforms have poor security
mechanisms (lack of proper authentication and authorization
mechanisms) when it comes to safeguarding big data. A spot that cannot
be ignored given that big data carries credit card information, personal
information and other sensitive data.
schema: Rigid schemas have no place. We want the technology to be
able to fit our big data and not the other way around. The need of the
hour is dynamic schema. Static (pre-defined schemas) are obsolete.
Continuous availability: The big question here is how to provide 24/7
support because almost all RDBMS and NoSQL big data platforms have a
certain amount of downtime built in.
Consistency: Should one opt for consistency or eventual consistency?
Partition tolerant: How to build partition tolerant systems that can take
care of both hardware and software failures?
Data quality: How to maintain data quality- data accuracy,
completeness, timeliness, etc.? Do we have appropriate metadata in
place?

Importance of of Big Data Analytics


Let us study the various approaches to analysis of data and what it leads
to.
Reactive-Business Intelligence: What does Business Intelligence (BI)
help us with? It allows the businesses to make faster and better decisions
by providing the right information to the right person at the right time in
the right format. It is about analysis of the past or historical data and
then displaying the findings of the analysis or reports in the form of
enterprise dashboards, alerts, notifications, etc. It has support for both
pre-specified reports as well as ad hoc querying.
Reactive - Big Data Analytics: Here the analysis is done on huge
datasets but the approach is still reactive as it is still based on static data.
Proactive - Analytics: This is to support futuristic decision making by
use of data mining predictive modelling, text mining, and statistical
analysis on. This analysis is not on big data as it still the traditional
database management practices on big data and therefore has severe
limitations on the storage capacity and the processing capability.
Proactive - Big Data Analytics: This is filtering through terabytes,
petabytes, exabytes of information to filter out the relevant data to
analyze. This also includes high performance analytics to gain rapid
insights from big data and the ability to solve complex problems using
more data.

Big Data Technologies


Big Data technology is primarily classified into the following two types:
Operational Big Data Technologies
This type of big data technology mainly includes the basic day-to-day data
that people used to process. Typically, the operational-big data includes
daily basis data such as online transactions, social media platforms, and
the data from any particular organization or a firm, which is usually
needed for analysis using the software based on big data technologies.
The data can also be referred to as raw data used as the input for several
Analytical Big Data Technologies.
Some specific examples that include the Operational Big Data
Technologies can be listed as below:
o Online ticket booking system, e.g., buses, trains, flights, and movies,
etc.
o Online trading or shopping from e-commerce websites like Amazon,
Flipkart, Walmart, etc.
o Online data on social media sites, such as Facebook, Instagram,
Whatsapp, etc.
o The employees' data or executives' particulars in multinational
companies.

Analytical Big Data Technologies


Analytical Big Data is commonly referred to as an improved version of Big
Data Technologies. This type of big data technology is a bit complicated
when compared with operational-big data. Analytical big data is mainly
used when performance criteria are in use, and important real-time
business decisions are made based on reports created by analyzing
operational-real data. This means that the actual investigation of big data
that is important for business decisions falls under this type of big data
technology.
Some common examples that involve the Analytical Big Data
Technologies can be listed as below:
o Stock marketing data
o Weather forecasting data and the time series analysis
o Medical health records where doctors can personally monitor the
health status of an individual
o Carrying out the space mission databases where every information of a
mission is very important

Top Big Data Technologies


We can categorize the leading big data technologies into the following
four sections:
o Data Storage
o Data Mining
o Data Analytics
o Data Visualization
Data Storage
Let us first discuss leading Big Data Technologies that come under Data
Storage:

Hadoop: When it comes to handling big data, Hadoop is one of the


leading technologies. Also, it is capable enough to process tasks in
batches. The Hadoop framework was mainly introduced to store and
process data in a distributed data processing environment. The Apache
Software Foundation introduced Hadoop which is written in Java
programming language.

MongoDB: MongoDB is another important component of big data


technologies in terms of storage. No relational properties and RDBMS
properties apply to MongoDb because it is a NoSQL database. MongoDB
uses schema documents. This enables MongoDB to hold massive
amounts of data. It is based on a simple cross-platform document-
oriented design. MongoDB Inc. introduced MongoDB which is written with
a combination of C++, Python, JavaScript, and Go language.

RainStor: RainStor is a popular database management system designed


to manage and analyze organizations' Big Data requirements. It uses
deduplication strategies that help manage storing and handling vast
amounts of data. RainStor was designed in 2004 by a RainStor Software
Company. It operates just like SQL.

Hunk: Hunk is mainly helpful when data needs to be accessed in remote


Hadoop clusters. Hunk allows us to report and visualize vast amounts of
data from Hadoop and NoSQL data sources. Hunk was introduced in 2013
by Splunk Inc. It is based on the Java programming language.

Cassandra: Cassandra is one of the leading big data technologies among


the list of top NoSQL databases. It is open-source, distributed and has
extensive column storage options. Cassandra was developed in 2008 by
the Apache Software Foundation for the Facebook inbox search feature. It
is based on the Java programming language.

Data Mining
Let us now discuss leading Big Data Technologies that come under Data
Mining:
Presto: Presto is an open-source and a distributed SQL query engine
developed to run interactive analytical queries against huge-sized data
sources. The size of data sources can vary from gigabytes to petabytes.
Presto is developed in 2013 by the Apache Software Foundation.
Companies like Repro, Netflix, Airbnb, Facebook and Checkr are using this
big data technology.

RapidMiner: RapidMiner is defined as the data science software that


offers us a very robust and powerful graphical user interface to create,
deliver, manage, and maintain predictive analytics. RapidMiner is
developed in 2001 by Ralf Klinkenberg, Ingo Mierswa, and Simon Fischer
at the Technical University of Dortmund's AI unit. It was initially named
YALE (Yet Another Learning Environment). Companies that are making
good use of the RapidMiner tool are Boston Consulting Group, InFocus,
Domino's.

ElasticSearch: When it comes to finding information, elasticsearch is


known as an essential tool. It provides a purely distributed search engine
which is completely text-based. ElasticSearch is primarily written in a Java
programming language and was developed in 2010 by Shay Banon. Now,
it has been handled by Elastic NV since 2012. ElasticSearch is used by
many top companies, such as LinkedIn, Netflix, Facebook, Google,
Accenture, StackOverflow, etc.

Data Analytics
Now, let us discuss leading Big Data Technologies that come under Data
Analytics:
Apache Kafka: Apache Kafka is a popular streaming platform. This
streaming platform is primarily known for its three core capabilities:
publisher, subscriber and consumer. It is written in Java language and
was developed by the Apache software community in 2011. Some top
companies using the Apache Kafka platform include Twitter, Spotify,
Netflix, Yahoo, LinkedIn etc.

Splunk: Splunk is known as one of the popular software platforms for


capturing, correlating, and indexing real-time streaming data. Splunk can
also produce graphs, alerts, summarized reports, data visualizations, and
dashboards, etc., using related data. Splunk Inc. introduced Splunk in the
year 2014. It is written in combination with AJAX, Python, C++ and XML.
Companies such as Trustwave, QRadar are making good use of Splunk for
their analytical and security needs.

KNIME: KNIME is used to draw visual data flows, execute specific steps
and analyze the obtained models, results, and interactive views. It also
allows us to execute all the analysis steps altogether. It consists of an
extension mechanism that can add more plugins, giving additional
features and functionalities. KNIME is based on Eclipse and written in a
Java programming language. It was developed in 2008 by KNIME
Company. A list of companies that are making use of KNIME includes
Harnham, Tyler, and Paloalto.

Spark: Apache Spark is known for offering In-memory computing


capabilities that help enhance the overall speed of the operational
process. It also provides a generalized execution model to support more
applications. Spark is written using Java, Scala, Python and R language.
The Apache Software Foundation developed it in 2009. Companies like
Amazon, ORACLE, CISCO, VerizonWireless and Hortonworks are using this
big data technology and making good use of it.

R-Language: R is defined as the programming language, mainly used in


statistical computing and graphics. It is a free software environment used
by leading data miners, practitioners and statisticians. Language is
primarily beneficial in the development of statistical-based software and
data analytics.
R-language was introduced in Feb 2000 by R-Foundation. It is written in
Fortran. Companies like Barclays, American Express, and Bank of America
use R-Language for their data analytics needs.
Blockchain: Blockchain is a technology that can be used in several
applications related to different industries, such as finance, supply chain,
manufacturing, etc. Additionally, it is also used to fulfill the needs of
shared ledger, smart contract, privacy, and consensus in any Business
Network Environment.
Blockchain technology was first introduced in 1991 by two researchers,
Stuart Haber and W. Scott Stornetta. However, blockchain has its first
real-world application in Jan 2009 when Bitcoin was launched. It is a
specific type of database based on Python, C++, and JavaScript. ORACLE,
Facebook, and MetLife are a few of those top companies using Blockchain
technology.

Data Visualization
Let us discuss leading Big Data Technologies that come under Data
Visualization:
Tableau: Tableau is one of the fastest and most powerful data
visualization tools used by leading business intelligence industries.
Tableau helps in creating the visualizations and insights in the form of
dashboards and worksheets.
Tableau is developed and maintained by a company named TableAU. It
was introduced in May 2013. It is written using multiple languages, such
as Python, C, C++, and Java. Some of the top companies using Tableau
are Cognos, QlikQ, and ORACLE Hyperion.

Plotly: As the name suggests, Plotly is best suited for plotting or creating
graphs and relevant components at a faster speed in an efficient way.
This helps interactive styling graphs with Jupyter notebook and Pycharm.
Plotly was introduced in 2012 by Plotly company. It is based on
JavaScript. Paladins and Bitbank are some of those companies that are
making good use of Plotly.
Unit-III: INTRODUCTION TO R & GETTING STARTED
WITH R
INTRODUCTION
Statistical computing and high-scale data analysis tasks needed a new
category of computer language besides the existing procedural and
object-oriented programming languages, which would support these tasks
instead of developing new software. There is plenty of data available
today which can be analysed in different ways to provide a wide range of
useful insights for multiple operations in various industries. Problems such
as the lack of support, tools and techniques for varied data analysis have
been solved with the introduction of one such language called R.

What is R?
R is a scripting or programming language which provides an environment
for statistical computing, data science and graphics. It was inspired by,
and is mostly compatible with, the statistical language S developed at Bell
laboratory (formerly AT & T, now Lucent technologies). Although there
are some very important differences between R and S, much of the code
written for S runs unaltered on R. R has become so popular that it is used
as the single most important tool for computational statistics,
visualisation and data science.

Why R?
R has opened tremendous scope for statistical computing and data
analysis. It provides techniques for various statistical analyses like
classical tests and classification, time-series analysis, clustering, linear
and non-linear modelling and graphical operations. The techniques
supported by R are highly extensible.
S is the pioneer of statistical computing; however, it is a proprietary
solution and is not readily available to developers. In contrast, R is
available freely. Hence, it helps the developer community in research and
development.
Another reason behind the popularity and widespread use of R is its
superior support
for graphics. It can provide well-developed and high-quality plots from
data analysis. The plots can contain mathematical formulae and symbols,
if necessary, and users have full control over the selection and use of
symbols in the graphics. Hence, other than robustness, user-experience
and user-friendliness are two key aspects of R.

The following points describe why R language should be used:


If you need to run statistical calculations in your application, learn and
deploy R. It
easily integrates with programming languages such as Java, C++, Python
and Ruby.
 If you wish to perform a quick analysis for making sense of data.
 If you are working on an optimisation problem.
 If you need to use re-usable libraries to solve a complex problem,
leverage the 2000+ free libraries provided by R.
 If you wish to create compelling charts.
 If you aspire to be a Data Scientist.
 If you want to have fun with statistics.
 R is free. It is available under the terms of the Free Software
Foundation’s GNU General Public License in source code form.
 It is available for Windows, Mac and a wide variety of Unix platforms
(including FreeBSD, Linux, etc.).

 In addition to enabling statistical operations, it is a general


programming language so that you can automate your analyses and
create new functions.
 R has excellent tools for creating graphics such as bar charts, scatter
plots, multipanel lattice charts, etc.
 It has an object oriented and functional programming structure along
with support from a robust and vibrant community.
 R has a flexible analysis tool kit, which makes it easy to access data in
various formats, manipulate it (transform, merge, aggregate, etc.), and
subject it to traditional and modern statistical models (such as
regression, ANOVA, tree models, etc.)
 R can be extended easily via packages. It relates easily to other
programming languages.
 Existing software as well as emerging software can be integrated with
R packages to make them more productive.
 R can easily import data from MS Excel, MS Access, MySQL, SQLite,
Oracle etc. It can easily connect to databases using ODBC (Open
Database Connectivity Protocol) and ROracle package.

Advantages of R Over Other Programming Languages


Advanced programming languages like Python also support statistical
computing and data visualisation along with traditional computer
programming. However, R wins the race over Python and similar
languages because of the following two advantages:
1. Python needs third party extensions and support for data visualisation
and statistical computing. However, R does not require any such
support extensively. For example, the lm function is present for linear
regression analysis and data analysis in both Python and R. In R, data
can be easily passed through the function and the function will return
an object with detailed information about the regression. The function
can also return information about the standard errors, coefficients,
residual values and so on. When lm function is called in the Python
environment, it will duplicate the functionalities using third party
libraries such as SciPy, NumPy and so on. Hence, R can do the same
thing with a single line of code instead of taking support from third
party libraries.

2. R has the fundamental data type, i.e., a vector that can be organised
and aggregated in different ways even though the core is the same.
Vector data type imposes some limitations on the language as this is a
rigid type. However, it gives a strong logical base to R. Based on the
vector data type, R uses the concept of data frames that are like a
matrix with attributes and internal data structure similar to
spreadsheets or relational database. Hence, R follows a column-wise
data structure based on the aggregation of vectors.

There are also some disadvantages of R. For example, R cannot scale


efficiently for larger data sets. Hence, the use of R is limited to
prototyping and sandboxing. It is rarely used for enterprise-level
solutions. By default, R uses a single-thread execution approach while
working on data stored in the RAM which leads to scalability issues as
well. Developers from open source communities are working hard on
these issues to make R capable of multi-threading execution and
parallelisation. This will help R to utilise more than one core processor.
There are big data extensions from companies like Revolution R and the
issues are expected to be resolved soon. Other languages like SPlus can
help to store objects permanently on disks, hence, supporting better
memory management and analysis of high volume of massive datasets.

Data types in R
R is a programming language. Like other programming languages, R also
makes use
of variables to store varied information. This means that when variables
are created, locations are reserved in the computer’s memory to hold the
related values. The number of locations or size of memory reserved is
determined by the data type of the variables. Data type essentially means
the kind of value which can be stored, such as boolean, numbers,
characters, etc. In R, however, variables are not declared as data types.
Variables in R are used to store some R objects and the data type of the R
object becomes the data type of the variable. The most popular (based on
usage) R objects are:
 Vector  Array
 List  Factor
 Matrix  Data Frames
A vector is the simplest of all R objects. It has varied data types. The
most commonly used data types are listed as follows:
 Logical  Character
 Numeric  Double
o Integer
class() function can be used to reveal the data type.

Logical
TRUE / T and FALSE / F are logical values.
> TRUE > FALSE
[1] TRUE [1] FALSE
> class(TRUE) > class(FALSE)
[1] "logical" [1] "logical"
>T >F
[1] TRUE [1] FALSE
> class(T) > class(F)
[1] "logical" [1] "logical"

Numeric
>2 > 76.25
[1] 2 [1] 76.25
> class (2) > class(76.25)
[1] "numeric" [1] "numeric"
Integer
Integer data type is a sub class of numeric data type. Notice the use of
“L” as a suffix to a numeric value in order for it to be considered an
“integer”.
> 2L > class(2L)
[1] 2 [1] "integer"
Functions such as “is.numeric()”, “is.integer()” can be used to test the
data type.
> is.numeric(2) > is.integer(2)
[1] TRUE [1] FALSE
> is.numeric(2L) > is.integer(2L)
[1] TRUE [1] TRUE
Note: Integers are numeric but NOT all numbers are integers.

Character
> "Data Science" > class("Data Science")
[1] "Data Science" [1] "character"
is.character() function can be used to ascertain if a value is a character.
> is.character ("Data Science")
[1] TRUE

double (for double precision floating point numbers)


By default, numbers are of “double” type unless explicitly mentioned with
an L suffixed to the number for it to be considered an integer.
> typeof (76.25)
[1] "double"
Variables and ls() Function
R, like any other programming language, uses variables to store
information. Let us start by creating a variable “RectangleHeight” and
assign the value 2 to it. Note the use of the operator “<-” to assign a
value to the variable.
Likewise, the variable “RectangleWidth” is defined and assigned the value
4. The area of the rectangle is computed using the formula
“RectangleHeight * RectangleWidth”. The computed value for the area of
the rectangle is stored in the variable “RectangleArea”.
RectangleHeight <- 2 RectangleWidth
RectangleWidth <- 4 [1] 4
RectangleArea <- RectangleHeight * RectangleArea
RectangleWidth [1] 8
RectangleHeight
[1] 2
Note: When a value is assigned to a variable, it does not display anything
on the console. To get the value, type the name of the variable at the
prompt.
Use the ls() function to list all the objects in the working environment.
> ls()
[1] "RectangleArea" "RectangleHeight" "RectangleWidth"
ls() is also useful to clean the environment before running a code.
Execute the rm() function as shown to clean up the environment.
> rm(list=ls())
> ls()
character(0)

Variables
(i) Assign a value of 50 to the variable called ‘Var’.
> Var <- 50 Or > Var = 5
(ii) Print the value in the variable, ‘Var’.
> Var
[1] 50
(iii) Perform arithmetic operations on the variable, ‘Var’.
> Var + 10 > Var / 2
[1] 60 [1] 25
Variables can be reassigned values either of the same data type or of a
different data type.
(iv) Reassign a string value to the variable, ‘Var’.
> Var <- “R is a Statistical Programming Language”
Print the value in the variable, ‘Var’.
> Var
[1] “R is a Statistical Programming Language”
(v) Reassign a logical value to the variable, ‘Var’.
> Var <- TRUE
> Var
[1] TRUE
Functions
In this section we will try out a few functions such as sum(), min(), max()
and seq().

sum() function
sum() function returns the sum of all the values in its arguments.
Syntax: sum(..., na.rm = FALSE)
where … implies numeric or complex or logical vectors.
na, rm accepts a logical value. Should missing values (including NaN (Not
a Number)) be removed?
Example: Sum the values ‘1’, ‘2’ and ‘3’ provided as arguments to sum()
> sum(1, 2, 3)
[1] 6

min() function
min() function returns the minimum of all the values present in their
arguments.
Syntax: min(…, na.rm=FALSE)
where … implies numeric or character arguments and na.rm accepts a
logical value.
Example: > min(1, 2, 3)
[1] 1

max() function
max() function returns the maximum of all the values present in their
arguments.
Syntax: max(…, na.rm=FALSE)
where … implies numeric or character arguments and na.rm accepts a
logical value.
Example: > max(44, 78, 66)
[1] 78

seq() function
seq() function generates a regular sequence.
Syntax: seq(start from, end at, interval, length.out)
where, Start from: It is the start value of the sequence.
End at: It is the maximal or end value of the sequence.
Interval: It is the increment of the sequence.
length.out: It is the desired length of the sequence.
Example:
> seq(1, 10, 2) > seq(18)
[1] 1 3 5 7 9 [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
> seq(1, 10, length.out=10) 16 17 18
[1] 1 2 3 4 5 6 7 8 9 10 > seq(1, 6, by=3)
[1] 1 4

Control Structures
Control structures in R allow you to control the flow of execution of a
series of R expressions. Basically, control structures allow you to put
some “logic” into your R code, rather than just always executing the same
R code every time. Control structures allow you to respond to inputs or to
features of the data and execute different R expressions accordingly.

Commonly used control structures are:


if and else: testing a condition and acting on it
for: execute a loop a fixed number of times
while: execute a loop while a condition is true
repeat: execute an infinite loop (must break out of it to stop)
break: break the execution of a loop
next: skip an iteration of a loop

Most control structures are not used in interactive sessions, but rather
when writing functions or longer expresisons. However, these constructs
do not have to be used in functions and it’s a good idea to become
familiar with them before we delve into functions.

if-else
The if-else combination is probably the most commonly used control
structure in R (or perhaps any language). This structure allows you to test
a condition and act on it depending on whether it’s true or false.
if(<condition>) {
## do something
}
## Continue with rest of code
The above code does nothing if the condition is false. If you have an
action you want to execute when the condition is false, then you need an
else clause.
if(<condition>) {
## do something
}
else {
## do something else
}
You can have a series of tests by following the initial if with any number of
else ifs.
if(<condition1>) {
## do something
} else if(<condition2>) {
## do something different
} else {
## do something different
}
Here is an example of a valid if/else structure.
## Generate a uniform random number
x <- runif(1, 0, 10)
if(x > 3) {
y <- 10
} else {
y <- 0
}
The value of y is set depending on whether x > 3 or not. This expression
can also be written a different, but equivalent, way in R.
y <- if(x > 3) {
10
} else {
0
}

for Loops
In R, for loops take an iterator variable and assign it successive values
from a sequence or vector. For loops are most commonly used for
iterating over the elements of an object (list, vector, etc.)
> for(i in 1:10)
print(i)
This loop takes the i variable and in each iteration of the loop gives it
values 1, 2, 3, …, 10, executes the code within the curly braces, and then
the loop exits.
The following three loops all have the same behavior.
x <- c("a", "b", "c", "d")
for(i in 1:4) {
## Print out each element of 'x'
print(x[i])
}
The seq_along() function is commonly used in conjunction with for loops in
order to generate an integer sequence based on the length of an object
(in this case, object x).
x <- c("z", "y", "x", "w") ## Generate a sequence based on
length of 'x'
for(i in seq_along(x))
print(x[i])
It is not necessary to use an index-type variable.
for(letter in x)
print(letter)

Nested for loops


for loops can be nested inside of each other.
x <- matrix(1:6)
for(i in seq_len(nrow(x))) {
for(j in seq_len(ncol(x))) {
print(x[i, j])
}
}

while Loops
While loops begin by testing a condition. If it is true, then they execute
the loop body. Once the loop body is executed, the condition is tested
again, and so forth, until the condition is false, after which the loop exits.
count <- 0
while(count <= 10) {
print(count)
count <- count + 1
}
While loops can potentially result in infinite loops if not written properly.
repeat Loops
repeat initiates an infinite loop right from the start. The only way to exit a
repeat loop is to call break.
One possible paradigm might be in an iterative algorithm where you may
be searching for a solution and you don’t want to stop until you’re close
enough to the solution. In this kind of situation, you often don’t know in
advance how many iterations it’s going to take to get “close enough” to
the solution.
val = 5
repeat {
print (val)
val <- val + 1
if (val == 10) break
}

next, break
next is used to skip an iteration of a loop.
for(i in 1:100) {
if(i <= 20) next
print (i)
}
break is used to exit a loop immediately, regardless of what iteration the
loop be on.
for(i in 1:100) {
print(i)
if(i >= 20) break
}

Vectors in R
The fundamental data type in R is the vector. A vector is a sequence of
data elements all of the same type.

Creating Vectors
There are various ways to create vectors but one of the most common is
the concatenation operator. This takes arguments and places them all in
a vector.
x <- c(1, 5, 2, 6) is.vector(x)
x ## [1] TRUE
[1] 1 5 2 6
Note that c() orders the values in the vector in the order in which they
were entered.

Vector Arithmetic
We can do arithmetic with vectors in a similar manner as we have with
integers. When we use operators we are doing something element by
element or “elementwise.”
y <- c(1,6,4,8)
x+y
## [1] 2 11 6 14
Notice that we did not add all of the values together but we added both of
the first values from x and y, then the second values and so on.
> x*y > x %% y
[1] 1 30 8 48 [1] 0 5 2 6
> x/y
[1] 1.0000000 0.8333333 0.5000000
0.7500000

Functions on Vectors
We considered functions on specific data values but we can actually put
vectors into most functions in R. One of the simplest functions can help
us with knowing information about Recycling that we encountered before.
This is the length() function.
> length(x)
> length(y)
> length(z)
Then length vector is very important with the writing of functions which
we will get to in a later unit. We can use any() and all() in order to answer
logical questions about elements
any(x>3)
[1] TRUE
We see that there must be at least one x that is greater than 3.
all(x>3)
## [1] FALSE
However, not all values of x are larger than 3.

Other Functions for Vectors


There area various other functions that can be run on vectors, some of
these you may seen before:
mean() finds the arithmetic mean of a vector.
median() finds the median of a vector.
sd() and var() finds the standard deviation and variance of a vector
respectively.
min() and max() finds the minimum and maximum of a vector
respectively.
sort() returns a vector that is sorted.
summary() returns a 5 number summary of the numbers in a vector.

The which() Function


Some functions help us work with the data more to return values in which
we are interested in. For example, above we asked if any elements in
vector x were greater than 3. The which() function will tell us the
elements that are.
> which(x>3)
[1] 2 4

Indexing Vectors
We can call specific elements of a vector by using the following:
x[] is a way to call up a specific element of a vector.
x[1] is the first element.
x[3] is the third element.
x[-3] is a vector with everything but the third element.
We can start of by checking what we have stored so far:
ls()
## [1] "x" "y" "z"
Now, that we see the vectors available we can try indexing x:
x[3]
## [1] 2
x[-3]
## [1] 1 5 6
Note that x[3] returns the third element and x[-3] returns everything but
the third element.

Naming Vector Elements


With vectors it can be important to assign names to the values. Then
when doing plots or considering maximum and minimums, instead of
being given a numerical place within the vector we can be given a specific
name of what that value represents. For example say that vector x
represents the number of medications of 4 unique patients. We could then
use the names() function to assign names to the values
>x > names(x) <- c("Patient A", "Patient
[1] 1 5 2 6 B", "Patient C", "Patient D")
> names(x) >x
NULL Patient A Patient B Patient C Patient D
1 5 2
6

Arrays in R
Arrays are still a vector in R but they have added extra options to them.
We can essentially call them “vector structure”. With a vector we have a
list of objects in one dimension. With an array we can have any number
of dimensions to our data.
We can consider a simple vector to start with
> x <- c(1,2,3,4)
This means that x is a vector with 4 elements. This simple vector can be
turned into an array by specifying some dimensions on it.
> x.array <- array(x, dim=c(2,2)) [,1] [,2]
> x.array [1,] 1 3
[2,] 2 4
 A regular vector has a single dimension.
 A matrix has 2 dimensions
 An array can have up to n dimensions.
We can learn about arrays with the following functions:
> dim(x.array)
[1] 2 2
We can see that our array is a 2x2 matrix.
> is.vector(x.array) > is.array(x.array)
[1] FALSE [1] TRUE

We can also see that R does view these are different objects. There is an
array and a vector class.

Properties of Arrays
We can also have R tell us:
 Type of elements does our array contain with the typeof() function.
 The structure of the array with the str() function.
 Other attributes with the attributes() function.
> typeof(x.array) > attributes(x.array)
[1] "double" $dim
> str(x.array) [1] 2 2
num [1:2, 1:2] 1 2 3 4

The structure gives a lot of detail about the array and the attributes lets
you know that a given attribute is the number of dimensions which is 2x2.

Working with Arrays


As statisticians it is important to know how to work with arrays. Much of
our data will be represented by vectors and arrays.

Indexing Arrays
Previously we learned how to extract or remove information from vectors.
We can also index arrays but our index takes into account all the
dimensions of our array
For example if we wish to take the element out of the first row and first
column we can do that by:
> x.array[1,1]
[1] 1
Just like in vectors, we can replace values in an array but using indexing
and assigning of values.
> x.array[1,1] <- 5 [,1] [,2]
> x.array [1,] 5 3
[2,] 2 4
Many times we wish to have functions act on either just the row or the
column and there are many functions built into R for this. For example:
> rowSums(x.array) > colSums(x.array)
[1] 8 6 [1] 7 7

Matrices in R
A Matrix is a vector that also contains information on the number of rows
and number of columns. However, vectors are not matrices.
Creating Matrices
An important first step with matrices is to learn how to create them. One
of the easiest ways to do this is with the matrix() function.
x <- c(1,2,3,4) [,1] [,2]
x.mat <- matrix(x, nrow=2, ncol=2, [1,] 1 2
byrow=TRUE) [2,] 3 4
x.mat
Note: the byrow=TRUE means that we will the matrix by the row, it is not
the same as if we do not fill it by row:
> x.mat2 <- matrix(x, nrow=2, ncol=2, [,1] [,2]
byrow=FALSE) [1,] 1 3
> x.mat2 [2,] 2 4

We can also create matrices purely by expressing the number of columns


we wish to have. In larger forms of data we may not know the exact
amount of rows and columns but certainly we can choose at least the
number of columns.
y <- c(1,2,3,4,5,6,7) [,1] [,2]
y.mat <- matrix(y, ncol=2) [1,] 1 5
y.mat [2,] 2 6
[3,] 3 7
[4,] 4 1
Matrix Operations
R can be a great tool for working with matrices. Many operations we need
to do with linear algebra can be done in R. We can perform elementwise
multiplication just like in vectors:
> x.mat * x.mat2
[,1] [,2]
[1,] 1 6
[2,] 6 16
R does have the ability to do matrix multiplication as well
> x.mat %*% x.mat2
[,1] [,2]
[1,] 5 11
[2,] 11 25
We can transpose matrices and extract the diagonals as well
> t(x.mat) > diag(x.mat2)
[,1] [,2] [1] 1 4
[1,] 1 3
[2,] 2 4
Another common matrix calculation is the
inverse. Many algorithms and functions in
statistics need to work with the inverse of
matrices:
solve(x.mat)
[,1] [,2]
[1,] -2.0 1.0
[2,] 1.5 -0.5

The apply() Function


Many times we wish to use our own function over the elements of a
matrix. The apply() function allows someone to use an R function or user-
defined function with a matrix. This function is
apply(m, dimcode, f, arguments)
Where,
m: matrix you wish to use.
Dimcode: 1 if you want to apply
function to rows, 2 if you want to apply to
columns
f: function you wish to use
arguments: specific arguments for function being used.
apply() Example: We begin with our matrix y.mat. We can use the apply
function to get means of either the columns or the rows.
> apply(y.mat, 1, mean) > apply(y.mat,2,mean)
[1] 3.0 4.0 5.0 2.5 [1] 2.50 4.75

Naming Rows and Columns of Matrices


Just like in vectors we may want to name elements in a matrix. Now we
have more than on dimension so we can name both the rows and
columns. Consider the following matrices where we have recorded both
weight(lbs) and height(inches) of subjects at time point 1.
> time1 <- matrix( c(115, 63, 175, 69, 259, 57, 325, 70), ncol=2,
byrow=TRUE)
> time1
[,1] [,2]
[1,] 115 63
[2,] 175 69
[3,] 259 57
[4,] 325 70
Without the story behind these we do not know what kind of data we have
here or what is being measured. This is where it can be very important to
name both the columns and the rows of data.
> #Names for Time 1
> colnames(time1) <- c("weight1", "height1")
> rownames(time1) <- c("Subject 1", "Subject 2", "Subject 3",
"Subject 4")
> time1
weight1 height1
Subject 1 115 63
Subject 2 175 69
Subject 3 259 57
Subject 4 325 70
We can see that now time1 is much more clear as to what the data
contains.
Unit-IV: EXPLORING DATA IN R

Data Frames
Imagine a data frame as something akin to a database table or an Excel
spreadsheet. It has a specific number of columns, each of which is
expected to contain values of a particular data type. It also has an
indeterminate number of rows, i.e. sets of related values for each column.
Assume, we have been asked to store data of our employees (such as
employee ID, name and the project that they are working on). We have
been given three independent vectors, viz., namely, “EmpNo”,
“EmpName” and “ProjName” that holds details such as employee ids,
employee names and project names, respectively.
> EmpNo <- c(1000, 1001, 1002, 1003, 1004)
> EmpName <- c(“Jack”, “Jane”, “Margaritta”, “Joe”, “Dave”)
> ProjName <- c(“PO1”, “PO2”, “PO3”, “PO4”, “PO5”)
However, we need a data structure similar to a database table or an Excel
spreadsheet
that can bind all these details together. We create a data frame by the
name, “Employee” to store all the three vectors together.
> Employee <- data.frame(EmpNo, EmpName, ProjName)
Let us print the content of the date frame, “Employee”.
> Employee
EmpNo EmpName ProjName
1 1000 Jack PO1
2 1001 Jane PO2
3 1002 Margaritta PO3
4 1003 Joe PO4
5 1004 Dave PO5
We have just created a data frame, “Employee” with data neatly
organised into rows and the variable names serving as column names
across the top.

Data Frame Access


There are two ways to access the content of data frames:
i. By providing the index number in square brackets
ii. By providing the column name as a string in double brackets.
By Providing the Index Number in Square Brackets
Example 1: To access the second column, “EmpName”, we type the
following command at the R prompt.
> Employee[2]
EmpName
1 Jack
2 Jane
3 Margaritta
4 Joe
5 Dave
Example 2: To access the first and the second column, “EmpNo” and
“EmpName”, we type the following command at the R prompt.
> Employee[1:2]
EmpNo EmpName
1 1000 Jack
2 1001 Jane
3 1002 Margaritta
4 1003 Joe
5 1004 Dave
Example 3:
> Employee [3,]
EmpNo EmpName ProjName
3 1002 Margaritta PO3
Please notice the extra comma in the square bracket operator in the
example. It is not a typo.
Example 4: Let us define row names for the rows in the data frame.
> row.names(Employee) <- c(“Employee 1”, “Employee 2”,
“Employee 3”,
“Employee 4”, “Employee 5”)
> row.names (Employee)
[1] “Employee 1” “Employee 2” “Employee 3” “Employee 4” “Employee
5”
> Employee
EmpNo EmpName ProjName
Employee 1 1000 Jack P01
Employee 2 1001 Jane P02
Employee 3 1002 Margaritta P03
Employee 4 1003 Joe P04
Employee 5 1004 Dave P05
Let us retrieve a row by its name.
> Employee [“Employee 1”,]
EmpNo EmpName ProjName
Employee 1 1000 Jack P01

Let us pack the row names in an index vector in order to retrieve multiple
rows.
> Employee [c (“Employee 3”, “Employee 5”),]
EmpNo EmpName ProjName
Employee 3 1002 Margaritta P03
Employee 5 1004 Dave P05
By Providing the Column Name as a String in Double Brackets
> Employee [[“EmpName”]]
[1] Jack Jane Margaritta Joe Dave
Levels: Dave Jack Jane Joe Margaritta
Just to keep it simple (typing so many double brackets can get unwieldy
at times), use
the notation with the $ (dollar) sign.
> Employee$EmpName
[1] Jack Jane Margaritta Joe Dave
Levels: Dave Jack Jane Joe Margaritta
To retrieve a data frame slice with the two columns, “EmpNo” and
“ProjName”, we
pack the column names in an index vector inside the single square
bracket operator.
> Employee[c(“EmpNo”, “ProjName”)]
EmpNo ProjName
1 1000 P01
2 1001 P02
3 1002 P03
4 1003 P04
5 1004 P05
Let us add a new column to the data frame.
To add a new column, “EmpExpYears” to store the total number of years
of experience that the employee has in the organisation, follow the steps
given as follows:
> Employee$EmpExpYears <-c(5, 9, 6, 12, 7)
Print the contents of the date frame, “Employee” to verify the addition of
the new
column.
> Employee
EmpNo EmpName ProjName EmpExpYears
1 1000 Jack P01 5
2 1001 Jane P02 9
3 1002 Margaritta P03 6
4 1003 Joe P04 12
5 1004 Dave P05 7

Ordering the Data Frames


Let us display the content of the data frame, “Employee” in ascending
order of
“EmpExpYears”.
> Employee[order(Employee$EmpExpYears),]
EmpNo EmpName ProjName EmpExpYears
1 1000 Jack P01 5
3 1002 Margaritta P03 6
5 1004 Dave P05 7
2 1001 Jane P02 9
4 1003 Joe P04 12
Use the syntax as shown next to display the content of the data frame,
“Employee” in
descending order of “EmpExpYears”.
> Employee[order(-Employee$EmpExpYears),]
EmpNo EmpName ProjName EmpExpYears
4 1003 Joe P04 12
2 1001 Jane P02 9
5 1004 Dave P05 7
3 1002 Margaritta P03 6
1 1000 Jack P01 5
R Functions for understanding Data in Data Frames
We will explore the data held in the data frame with the help of the
following R
functions:
 dim()  names()
 nrow()  head()
 ncol()  tail()
 str()  edit()
 summary()

dim() Function
The dim()function is used to obtain the dimensions of a data frame. The
output of this function returns the number of rows and columns.
> dim(Employee)
[1] 5 4
The data frame, “Employee” has 5 rows and 4 columns.
nrow() Function
The nrow() function returns the number of rows in a data frame.
> nrow(Employee)
[1] 5
The data frame, “Employee” has 5 rows.
ncol() Function
The ncol() function returns the number of columns in a data frame.
> ncol(Employee)
[1] 4
The data frame, “Employee” has 4 columns.
str() Function
The str() function compactly displays the internal structure of R objects.
We will use
it to display the internal structure of the dataset, “Employee”.
> str (Employee)
‘data.frame’ : 5 obs. of 4 variables:
$ EmpNo : num 1000 1001 1002 1003 1004
$ EmpName : Factor w/ 5 levels “Dave”, “Jack”, ..: 2 3 5 4 1
$ ProjName : Factor w/ 5 levels “P01”, “P02”, “P03”, ..: 1 2 3 4 5
$ EmpExpYears : num 5 9 6 12 7
summary() Function
We will use the summary() function to return result summaries for each
column of the dataset.
> summary (Employee)
EmpNo EmpName ProjName EmpExpYears
Min. : 1000 Dave : 1 P01:1 Min. : 5.0
1st Qu. : 1001 Jack : 1 P02:1 1st Qu. : 6.0
Median : 1002 Jane : 1 P03:1 Median : 7.0
Mean : 1002 Joe : 1 P04:1 Mean : 7.8
3rd Qu. : 1003 Margaritta : 1 P05:1 3rd Qu. : 9.0
Max. : 1004 Max. : 12.0
names() Function
The names()function returns the names of the objects. We will use the
names() function to return the column headers for the dataset,
“Employee”.
> names (Employee)
[1] “EmpNo” “EmpName” “ProjName” “EmpExpYears”
In the example, names(Employee) returns the column headers of the
dataset “Employee”.
The str() function helps in returning the basic structure of the dataset.
This function
provides an overall view of the dataset.
head() Function
The head()function is used to obtain the first n observations where n is set
as 6 by default.
Examples 1: In this example, the value of n is set as 3 and hence, the
resulting output would contain the first 3 observations of the dataset.
> head(Employee, n=3)
EmpNo EmpName ProjName EmpExpYears
1 1000 Jack P01 5
2 1001 Jane P02 9
3 1002 Margaritta P03 6
2. Consider x as the total number of observations. In case of any negative
values as input for n in the head() function, the output obtained is first
x+n observations. In this example, x=5 and n= -2, then the number of
observations returned will be
x + n =5 + (-2)= 3
> head(Employee, n=-2)
EmpNo EmpName ProjName EmpExpYears
1 1000 Jack P01 5
2 1001 Jane P02 9
3 1002 Margaritta P03 6
tail() Function
The tail()function is used to obtain the last n observations where n is set
as 6 by default.
> tail(Employee, n=3)
EmpNo EmpName ProjName EmpExpYears
3 1002 Margaritta P03 6
4 1003 Joe P04 12
5 1004 Dave P05 7
Example: Consider the example, where the value of n is negative, and
the output is returned by a simple sum up value of x+n. Here x = 5 and n
=-2. When a negative input is given in the case of the tail()function, it
returns the last x+n observations. The example given as follows returns
the last 3 records from the dataset, “Employee”.
> tail(Employee, n=-2)
EmpNo EmpName ProjName EmpExpYears
3 1002 Margaritta P03 6
4 1003 Joe P04 12
5 1004 Dave P05 7
edit() Function
The edit() function will invoke the text editor on the R object. We will use
the edit() function to open the dataset , “Employee” in the text editor.
> edit(Employee)
To retrieve the first three rows (with all columns) from the dataset,
“Employee”, use
the syntax given as follows:
> Employee[1:3,]
EmpNo EmpName ProjName EmpExpYears
1 1000 Jack P01 5
2 1001 Jane P02 9
3 1002 Margaritta P03 6
To retrieve the first three rows (with the first two columns) from the
dataset, “Employee”, use the syntax given as follows:
> Employee[1:3, 1:2]
EmpNo EmpName
1 1000 Jack
2 1001 Jane
3 1002 Margaritta

A brief summary of functions for exploring data in R


Function Name Description
nrow(x) Returns the number of rows
ncol(x) Returns the number of columns
str(mydata) Provides structure to a dataset
summary(mydata) Provides basic descriptive statistics and
frequencies
edit(mydata) Opens the data editor
names(mydata) Returns the list of variables in a dataset
head(mydata) Returns the first n rows of a dataset. By
default, n = 6
head(mydata, Returns the first 10 rows of a dataset
n=10)
head(mydata, n= - Returns all the rows but the last 10
10)
tail(mydata) Returns the last n rows. By default, n = 6
tail(mydata, n=10) Returns the last 10 rows
tail(mydata, n= - Returns all the rows but the first 10
10)
mydata[1:10, ] Returns the first 10 rows
mydata[1:10,1:3] Returns the first 10 rows of data of the first 3
variables

Load Data Frames


Let us look at how R can load data into data frames from external files.
Reading from a .csv (comma separated values file)
We have created a .csv file by the name, “item.csv” in the D:\ drive. It has
the following content:
A B C
1 Itemcode ItemCategory ItemPrice
2 I1001 Electronics 700
3 I1002 Desktop 300
supplies
4 I1003 Office supplies 350

Let us load this file using the read.csv function.


> ItemDataFrame <- read.csv(“D:/item.csv”)
> ItemDataFrame
Itemcode ItemCategory ItemPrice
1 I1001 Electronics 700
2 I1002 Desktop 300
supplies
3 I1003 Office supplies 350
Subsetting Data Frame
To subset the data frame and display the details of only those items
whose price is greater than or equal to 350.
> subset(ItemDataFrame, ItemPrice >=350)
Itemcode ItemCategory ItemPrice
1 I1001 Electronics 700
3 I1003 Office supplies 350
To subset the data frame and display only the category to which the items
belong (items whose price is greater than or equal to 350).
> subset(ItemDataFrame, ItemPrice >=350, select =
c(ItemCategory))
ItemCategory
1 Electronics
3 Office supplies
To subset the data frame and display only the items where the category is
either “Office supplies” or “Desktop supplies”.
> subset(ItemDataFrame, ItemCategory == “Office supplies” |
ItemCategory
== “Desktop supplies”)
Itemcode ItemCategory ItemPrice
2 I1002 Desktop 300
supplies
3 I1003 Office supplies 350
Reading from a Tab Separated Value File
For any file that uses a delimiter other than a comma, one can use the
“read.table” command.
Example: We have created a tab separated file by the name, “item-tab-
sep.txt” in the D:\ drive. It has the following content.
Itemcode ItemQtyOnHan ItemReorderL
d vl
I1001 75 25
I1002 30 25
I1003 35 25
Let us load this file using the “read.table” function. We will read the
content from the file but will not store its content to a data frame.
> read.table(“d:/item-tab-sep.txt”,sep=“\t”)
V1 V2 V3
1 Itemcode ItemQtyOnHan ItemReorderL
d vl
2 I1001 75 25
3 I1002 30 25
4 I1003 35 25

Notice the use of V1, V2 and V3 as column headings. It means that our
specified column names, “Itemcode”, ItemCategory” and “ItemPrice” are
not considered. In other words, the first line is not automatically treated
as a column header.
Let us modify the syntax, so that the first line is treated as a column
header.
> read.table(“d:/item-tab-sep.txt”,sep=“\t”, header=TRUE)
1 Itemcode ItemQtyOnHan ItemReorderL
d vl
2 I1001 75 25
3 I1002 30 25
4 I1003 35 25
Now let us read the content of the specified file into the data frame,
“ItemDataFrame”.
> ItemDataFrame <- read.table(“D:/item-tab-sep.txt”,sep=“\t”,
header=TRUE)
> ItemDataFrame
Itemcode ItemQtyOnHan ItemReorderL
d vl
1 I1001 75 25
2 I1002 30 25
3 I1003 35 25
Reading from a Table
A data table can reside in a text file. The 1001 Physics 85
cells inside the table are separated by 2001 Chemistry 87
3001 Mathematics 93
blank characters. An example of a table
4001 English 84
with 4 rows and 3 columns is given as
follows:
V1 V2 V3 Copy and paste the table in a file
1 1001 Physics 85 named “d:/mydata.txt” with a text
2 2001 Chemistry 87 editor and then load the data into the
3 3001 Mathematics 93 workspace with the function
4 4001 English 84 “read.table”.
> mydata =
read.table(“d:/mydata.txt”)
> mydata
Merging Data Frames
Let us now attempt to merge two data frames using the merge function.
The merge function takes an x frame (item.csv) and a y frame (item-tab-
sep.txt) as arguments. By
default, it joins the two frames on columns with the same name (the two
“Itemcode”
columns).
> csvitem <- read.csv(“d:/item.csv”)
> tabitem <- read.table(“d:/item-tab-sep.txt”,sep=“\t”,header=TRUE)
> merge (x=csvitem, y=tabitem)
Itemco ItemCategory ItemPri ItemQtyOnHa ItemReorder
de ce nd Lvl
1 I1001 Electronics 700 75 25
2 I1002 Desktop 300 30 25
supplies
3 I1003 Office supplies 350 35 25
Unit-V : DATA VISUALIZATION USING R
Reading an XML File
Step 1: Install an XML package:
> install.packages(“XML”)
Installing package into ‘C:/Users/seema_acharya/Documents/R/winlibrary/
3.2’ (as ‘lib’ is unspecified) trying URL
‘https://cran.hafro.is/bin/windows/contrib/3.2/XML_3.98-1.3.zip’
Content type ‘application/zip’ length 4299803 bytes (4.1 MB)
downloaded 4.1 MB
package ‘XML’ successfully unpacked and MD5 sums checked

Step 2: Input data: Store the data below in a text file (XMLFile.xml in the
D: drive). Ensure that the file is saved with an extension of .xml.
<RECORDS>
<EMPLOYEE>
<EMPID>1001</EMPID>
<EMPNAME>Merrilyn</EMPNAME>
<SKILLS>MongoDB</SKILLS>
<DEPT>Computer Science</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<EMPID>1002</EMPID>
<EMPNAME>Ramya</EMPNAME>
<SKILLS>People Management</SKILLS>
<DEPT>Human Resources</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<EMPID>1003</EMPID>
<EMPNAME>Fedora</EMPNAME>
<SKILLS>Recruitment</SKILLS>
<DEPT>Human Resources</DEPT>
</EMPLOYEE>
</RECORDS>
Reading an XML File: The xml file is read in R using the function
“xmlParse()”. It is stored as a list in R.
Step 1: Begin by loading the required packages.
> library(“XML”)
Warning message: package ‘XML’ was built under R version 3.2.3
> library (“methods”)
> output <- xmlParse(file = “d:/XMLFile.xml”)
> print(output)
<?xml version=“1.0”?>
<RECORDS>
<EMPLOYEE>
<EMPID>1001</EMPID>
<EMPNAME>Merrilyn</EMPNAME>
<SKILLS>MongoDB</SKILLS>
<DEPT>Computer Science</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<EMPID>1002</EMPID>
<EMPNAME>Ramya</EMPNAME>
<SKILLS>People Management</SKILLS>
<DEPT>Human Resources</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<EMPID>1003</EMPID>
<EMPNAME>Fedora</EMPNAME>
<SKILLS>Recruitment</SKILLS>
<DEPT>Human Resources</DEPT>
</EMPLOYEE>
</RECORDS>
Step 2: Extract the root node from the XML file.
> rootnode <- xmlRoot(output)
Find the number of nodes in the root.
> rootsize <- xmlSize(rootnode)
> rootsize
[1] 3
Let us display the details of the first node.
> print (rootnode[1])
$EMPLOYEE
<EMPLOYEE>
<EMPID>1001</EMPID>
<EMPNAME>Merrilyn</EMPNAME>
<SKILLS>MongoDB</SKILLS>
<DEPT>ComputerScience</DEPT>
</EMPLOYEE>
attr(, “class”)
[1] “XMLInternalNodeList” “XMLNodeList”
Let us display the details of the first element of the first node.
> print(rootnode[[1]][[1]])
<EMPID>1001</EMPID>
Let us display the details of the third element of the first node.
> print(rootnode[[1]][[3]])
<SKILLS>MongoDB</SKILLS>
Next, display the details of the third element of the second node.
> print(rootnode[[2]][[3]])
<SKILLS>PeopleManagement</SKILLS>
We can also display the value of 2nd element of the first node.
> output <-xmlValue(rootnode[[1]][[2]])
> output
[1] “Merrilyn”
Step 3: Convert the input xml file to a data frame using the
xmlToDataFrame function.
> xmldataframe <- xmlToDataFrame(“d:/XMLFile.xml”)
Display the output of the data frame.
> xmldataframe
EMPID EMPNAME SKILLS DEPT
1 1001 Merrilyn MongoDB ComputerScien
ce
2 1002 Ramya PeopleMananement HumanResourc
es
3 1003 Fedora Recruitment HumanResourc
es

Reading Data from Web


Nowadays most business organisations are using the Internet and cloud
services for
storing data. This online dataset is directly accessible through packages
and application programming interfaces (APIs). Different packages are
available in R for reading from online datasets.
Packages for reading web data
Packages Description Download Link
RCurl The package permits https://cran.r-project.org/
download of files from the web/packages/RCurl/
web server and post forms. index.html
Google It allows uploading of data http://code.google.com/p/r-
Prediction to Google storage and then googlepredictionapi-v121
API training them for Google
Prediction API.
Infochimps The package provides the http://api.infochimps.com
functions for accessing all
API.
HttpReque The package reads the web https://cran.r-project.org/
st data with the help of an web/packages/httpRequest/
HTTP request protocol and index.html
implements the GET, POST
request.
WDI The package reads all World http://crans
Bank data. projectorg/web/packages/W
D1/index.html
XML The package reads and http://crans
creates an XML and HTML projectorg/web/packages/XM
document with the help of L/index.html
an HTTP or FTP protocol.
Quantmod The package reads finance http://crans-projectorg/web/
data from Yahoo finance. packages/
quantmodfindex.html
ScrapeR The package reads online http://crans-projectorg/web/
data. packages/scrapeR/
index.html
Web scraping extracts data from any webpage of a website. Here
package ‘RCurl’ is used for web scraping. At first, the package, ‘RCurl’ is
imported into the workspace and then getURL() function of the package,
‘RCurl’ takes the required webpage. Now htmlTreeParse() function parses
the content of the webpage.
> #loading RCurl package
> library (RCurl)
> # Passing the URL for which web data required
> wd <-
getURL(“https://en.wikipedia.org/wiki/R_(programming_language)”
, ssl.verifypeer = FALSE)
> #Now parsing the web data
> wd_parsed <- htmlTreeParse(wd)
> #displaying the content of web page
> wd_parsed

Reading a JSON (Java Script Object Notation) Document


Step 1: Install rjson package.
> install.packages(“rjson”)
Step 2: Input data.
Store the data given below in a text file (‘D:/Jsondoc.json’). Ensure that
the file is saved with an extension of “.json”
{
‘EMPID’:[‘1001’,’2001’,’3001’,’4001’,’5001’,’6001’,’7001’,’8001’],
‘Name’:
[‘Ricky’,’Danny’,’Mitchelle’,’Ryan’,’Gerry’,’Nonita’,’Simon’,’Gallop’ ],
‘Dept’: [‘IT’,’Operations’,’IT’,’HR’,’Finance’,’IT’,’Operations’,’Finance’]
}
A JSON document begins and ends with a curly brace ({}). A JSON
document is a set of key value pairs. Each key:value pair is delimited
using ‘,’ as a delimiter.
Step 3: Read the JSON file, ‘d:/Jsondoc.json’.
> output <- fromJSON(file = “d:/Jsondoc.json”)
> output
$EMPID
[1] “1001” “2001” “3001” “4001” “5001” “6001” “7001” “8001”
$Name
[1] “Ricky” “Danny” “Mitchelle” “Ryan” “Gerry” “Nonita” “Simon”
“Gallop”
$Dept
[1] “IT” “Operations” “IT” “HR” “Finance” “IT” “Operations” “Finance”
Step 4: Convert JSON to a data frame.
> JSONDataFrame <- as.data.frame(output)
Display the content of the data frame.
> JSONDataFrame
EMPID Name Dept
1 1001 Ricky IT
2 2001 Danny Operations
3 3001 Mitchelle IT
4 4001 Ryan HR
5 5001 Gerry Finance
6 6001 Nonita IT
7 7001 Simon Operations
EMPID Name Dept
8 8001 Gallop Finance

Using R with Databases


Business analytical processing uses database for storing large volume of
information. Business intelligence systems or business intelligence tools
handle all the analytical processing of a database and use different types
of database systems. The tools support the relational database
processing (RDBMS), accessing a part of the large database, getting a
summary of the database, accessing it concurrently, managing security,
constraints, server connectivity and other functionality.
At present, different types of databases are available in the market for
processing. They have many inbuilt tools, GUIs and other inbuilt functions
through which database processing becomes easy. R provides inbuilt
packages to access SQL, MySQl, PostGreSQL, etc. With the help of these
packages, users can easily access a database since all the packages
follow the same steps for accessing data from the database.
RODBC
RODBC is a package of languages that interacts with a database. Michael
Lapsley and
Brian Ripley developed this package.
RODBC helps in accessing databases such as MS Access and Microsoft
SQL Server through an ODBC interface. Its package has many inbuilt
functions for performing database operations on the database. Some
major functions of RODBC packages used in database connectivity are:
Function Description
odbcConnect(dsn, uid= ‘‘, pwd= ‘‘) The function opens a
where, dsn is domain name server, uid connection to an ODBC
is the user ID and pwd is the database.
password.
sqlFetch(sqltable) The function reads a table from
where, sqltable is name of the SQL an ODBC database into a data
table. frame.
sqlQuery(query) The function takes a query,
where, query is the SQL query. sends to an ODBC database
and returns its result.
sqlSave(dataframe, tablename= The function writes or updates
‘sqltable’) a data frame to a table in the
where, data frame defines the data ODBC database.
frame object and tablename argument
is the name of the table.
sqlDrop(sqltable) The function removes a table
where, sqltable is the name of the SQL from the ODBC database.
table.
odbcclose() The function closes the open
connection.
Here is a sample code where package RODBC is used for reading data
from a database.
># importing package
> library(RODBC)
> connect1 <- odbcConnect(dsn = ‘servername’, uid= ‘‘, pwd= ‘‘)
#Open connection
> query1 <- ‘Select * from lib.table where…’
> Demodb <- sqlQuery(connect1, query1, errors = TRUE)
> odbcClose(connect1) #Close the connection

Using MySQL and R


MySQL is an open source SQL database system. It is a small-sized
popular database that is available for free download. For accessing
MySQL database, users need to install the MySQL database system on
their computers. MySQL database can be downloaded and installed from
its official website.
R also provides a package, ‘RMySQL’ used for accessing the database
from the MySQL database. Like other packages, ‘RMySQL2’ has many
inbuilt functions for interacting with a database.
Major functions of RMySQL
Function Description
dbConnect(MySQL(), uid= ‘‘, pwd= ‘‘, The function opens a
dbname = ‘‘,…) connection to the MySQL
where, MySQL() is MySQL driver, uid is database.
the user ID, pwd is the password and
dbname is the database name.
dbDisconnect(connectionname) The function closes the
where, Connectionname defines the open connection.
name of the connection.
dbSendQuery(connectionname, sql) The function runs the SQL
where, connectionname defines the queries of the open
name of the connection. connection.
dbListTables(connectionname) The function lists the tables
where, connectionname defines the of the database of the open
name of the connection. connection.
dbWriteTable(connectionname, name The function creates the
= ‘table name’, value = table and alternatively
data.frame.name) writes or updates a data
where, connectionname defines the frame in the database.
name of the connection.
A sample code to illustrate the use of RMySQL for reading data from a
database is
given below.
> # importing package
> library(RMySQL)
> connectm <- odbcConnect(MySQL(), uid= ‘‘, pwd= ‘‘,dbname = ‘‘,
host = ‘‘)
#Open connection ‘connectm’
> querym <- ‘Select * from lib.table where…’
> Demom<- dbSendQuery(connectm, querym)
> dbDisconnect(connectm) #Close the connection ‘connect’
Using PostgreSQL and R
PostgreSQL is an open source and customisable SQL database system.
After MySQL, PostgreSQL database is used for business analytical
processing. For accessing the PostgreSQL database, users need to install
the PostgreSQL database system on their computer system. Please note
that it requires a server. Users can get a server on rent, download and
install the PostgreSQL database from its official website.
R has a package, ‘RPostgreSQL’ that is used for accessing the database
from the PostgreSQL database. Like other packages, RPostgreSQL has
many inbuilt functions for interacting with its database.
Major functions of the RPostgreSQL
Function Description
dbConnect(driverobject, uid= ‘‘, pwd= The function opens a
‘‘, dbname = ‘‘,…) connection to an RPostgreSQL
where, driverobject is an object of database.
database driver, uid is the user ID, pwd
is the password and dbname is the
database name.
dbDisconnect(connectionname) The function closes the open
where, Connectionname defines the connection.
name of the connection.

Reading an Excel File


A spreadsheet is a table that stores data in rows and columns. Many
applications are available for creating a spreadsheet. Microsoft Excel is
the most popular for creating an Excel file. An Excel file uses “.xlsx”
extension and stores data in a spreadsheet.
In R, different packages are available such as gdata, xlsx, etc., that
provide functions for reading Excel files. Importing such packages is
necessary before using any inbuilt function of any package. The
“read.xlsx()” is an inbuilt function of ‘xlsx’ package for
reading Excel files. The syntax of the read.xlsx() function is
read.xlsx(‘filename’,…)
where, filename argument defines the path of the file to be read and the
dots ‘…’ define the other optional arguments.

Working with R Charts and Graphs


Histograms
A histogram is a graphical illustration of the distribution of numerical data
in successive numerical intervals of equal sizes. It looks similar to a bar
graph. However, values are grouped into continuous ranges in a
histogram. The height of a histogram bar represents the number of
values occurring in a particular range.
R uses hist(x) function to create simple histograms, where x is a numeric
value to be plotted. Basic syntax to create a histogram using R is:
hist(v,main,xlab,xlim,ylim,breaks,col,border)
where,
v takes a vector that contains numeric values, ‘main’ is the main title of
the bar chart, xlab is the label of the X-axis, xlim specifies the range of
values on the X-axis, ylim specifies the range of values on the Y-axis,
‘breaks’ control the number of bins or mentions the width of the bar, ‘col’
sets the colour of the bars and ‘border’ sets the border colour of the bars.
Example 1: A simple histogram can be created by just providing the input
vector where other parameters are optional.
h<- c (8,13,30,5,28) # Create data for the histogram
hist(h) #Create histogram for H
Example 2: A histogram simple can be created by providing the input
vector, file name, label for X-axis “xlab”, colour “col” and colour “border”:
# Create data for the histogram
H <- c (8,13,30,5,28)
# Give a file name for the
histogram
png(file =
“samplehistogram.png”)
#Create a sample histogram
hist(H, xlab="Categories",
col="red")
#Save the sample histogram file
dev.off()
Executing the above code fetches the output as shown. It fills the bar
with the ‘col’ colour parameter and border to the bar can be done by
passing values to the ‘border’ parameter.
> H <- c (8,13,30,5,28)
> hist(H, xlab="Categories", col="red")
Example 3: The parameters xlim and ylim are used to denote the range
of values that are used in the X and Y axes. And breaks are used to
specify the width of each bar.
> H <- c (8,13,30,5,28) #Create data
# Give a file name for the histogram
> png(file =
“samplelimhistogram.png”)
> hist(H, xlab ="Values", ylab=
"Colours", col="green", xlim=c(0,30),
ylim=c(0,5), breaks= 5)
#Save the samplelimhistogram.png
file
> dev.off()
> H <- c (8,13,30,5,28)
> hist(H, xlab ="Values", ylab = "Colours", col= "green", xlim=c(0,30),
ylim=c(0,5), breaks=5)
Executing the above code will display the histogram as shown above.
> H <- c (8, 13, 30, 5, 28)
> bins <- c(0, 5, 10, 15, 20, 25, 30)
> bins
[1] 0 5 10 15 20 25 30
> hist(H, xlab ="Values", ylab="Colours", col="green", xlim=c(0,30),
ylim=c(0,5), breaks=bins)

Bar Charts
A bar chart is a pictorial representation of statistical data. Both vertical
and horizontal bars can be drawn using R. It also provides an option to
colour the bars in different colours. The length of the bar is directly
proportional to the values of the axes.
R uses the “barplot()” function to create a bar chart. The basic syntax for
creating a bar chart using R is
barplot(H, xlab, ylab, main, names.arg, col)
where, H is a matrix or a vector that contains the numeric values used in
bar chart, xlab is the label of the X-axis, ylab is the label of the Y-axis,
main is the main title of the bar chart, ‘names.arg’ is the collection of
names to appear under each bar and col is used to give colours to the
bars. Some basic bar charts commonly used in R are:
 Simple bar chart
 Grouped bar chart
 Stacked bar chart Simple Bar Chart
1. Simple Bar Chart
A simple bar chart is created by just
providing the input values and a name to
the bar chart. The following code creates
and saves a bar chart using the ‘barplot()’
function in R.
Example 1: # Create data for the bar
chart
> H <- c (8,13,30,5,28)
> #Plot bar chart using barplot() function
> barplot(H) # ‘OR’
> barplot (H, xlab = "Categories", ylab="Values", col="blue")
When executing the above sample code, it returns a simple bar chart
diagram as the output. The bar takes up the input values and the file is
stored. The ‘barplot()’ function draws the simple bar chart with the inputs
provided. It can be drawn both vertically and horizontally. Labels for
both the X and Y axes can be given with xlab and ylab parameters. The
colour parameter is passed to fill the colour in the bar.
Example 2: The bar chart is drawn Horizontal Bar Chart
horizontally by passing the “horiz”
parameter TRUE.
# Create data for the bar chart
> H <- c (8,13,30,5,28)
#Plot bar chart using barplot() function
> barplot(H, horiz=TRUE))
> barplot(H, xlab = “Values”,
ylab=“Categories”, col=“blue”,
horiz=TRUE)
Executing the above code in R will takes up the input values and plots the
bar using the ‘barplot()’ function. Here when the “horiz” parameter is set
to TRUE, it displays the bar chart in a horizontal position else it will be
displayed as a default vertical bar chart.
2. Group Bar Chart
Group Bar Chart
A group data in R is used to handle
multiple inputs and takes the value of
the matrix. This group bar chart is
created using the ‘barplot()’ function
and accepts the matrix inputs.
Example:
> colors <- c("green","orange","brown")
> months <-
c("Mar","Apr","May","Jun","Jul")
> regions <- c("East","West","North")
> Values <- matrix(c(2,9,3,11,9,4,8,7,3,12,5,2,8,10,11), nrow=3, ncol= 5,
byrow = TRUE)
> Values
Mar Apr May Jun Jul
East 2 9 3 11 9
West 4 8 7 3 12
North 5 2 8 10 11
> barplot(Values, col=colors, width=2, beside=TRUE,
names.arg=months, main = “Total Revenue 2022 by month”)
> legend(“topleft”, regions, cex=0.6, bty= “n”, fill=colors)
The matrix input is read and passed to
the ‘barplot()’ function to create a group Stacked Bar Chart
bar chart. Here the legend column is
included on the top right side of the bar
chart.
3. Stacked Bar Chart
Stacked bar chart is similar to group bar
chart where multiple inputs can take
different graphical representations.
Except by grouping the values, the
stacked bar chart stacks
each bar one after the other based on the input values.
Example
> days <- c("Mon","Tues","Wed")
> months<-c("Jan","Feb","Mar","Apr","May")
> colours <- c("red","blue","green")
> val <-matrix(c(2, 5, 8, 6, 9, 4, 6, 4, 7, 10, 12, 5, 6, 11, 13), nrow =3,
ncol=5, byrow =TRUE)
> barplot(val, main="Total", names.arg=months, xlab="Months",
ylab="Days", col=colours)
> legend("topleft", days, cex=0.75, fill=colours)
The ‘Total’ is set as the main title of the stack bar chart with ‘Months’ as
the X-axis label and ‘Season’ as the Y-axis label. The code legend
(“topleft”, days, cex=1.3, fill=colours) specifies the legend to be
displayed at the top right of the bar chart with colours filled accordingly.

Line Graphs
A line chart is a graph that connects a series of points by drawing line
segments between them. These points are ordered in one of their
coordinate (usually the x-coordinate) value. Line charts are usually used
in identifying the trends in data.
The ‘plot()’ function in R is used to create the line graph.
Syntax: The basic syntax to create a line chart in R is
plot(v, type, col, xlab, ylab)
where,
 v is a vector containing the numeric values.
 type takes the value "p" to draw only the points, "l" to draw only
the lines and "o" to draw both points and lines.
 xlab is the label for x axis.
 ylab is the label for y axis.
 main is the Title of the chart.
 col is used to give colors to both the points and lines.
Example: A simple line chart is created using the input vector and the
type parameter as "O".
> v <- c(7,12,28,3,41) # Create the data for the chart.
> plot(v,type = "o") # Plot the bar chart.

Line Chart Title, Color and Labels


The features of the line chart can be expanded by using additional
parameters to add color to the points and lines, to give a title to the chart
and to add labels to the axes.
Example:
> v <- c(7,12,28,3,41) # Create the data for
the chart.
> plot(v, type = "o", col = "red", xlab =
"Month", ylab = "Rain fall", main = "Rain fall
chart")

Multiple Lines in a Line Chart


More than one line can be drawn on the
same chart by using the lines()function. After the first line is plotted, the
lines() function can use an additional vector as input to draw the second
line in the chart,
> v <- c(7,12,28,3,41) # Create the data for
the chart.
> t <- c(14,7,6,19,3)
> plot(v,type = "o",col = "red", xlab =
"Month", ylab = "Rain fall", main = "Rain fall
chart")
> lines(t, type = "o", col = "blue")
Scatterplots
Scatterplots show many points plotted in the Cartesian plane. Each point
represents the values of two variables. One variable is chosen in the
horizontal axis and another in the vertical axis. The simple scatterplot is
created using the ‘plot()’ function.
Syntax: The basic syntax for creating scatterplot in R is
plot(x, y, main, xlab, ylab, xlim, ylim, axes)
Where,
 x is the data set whose values are the horizontal coordinates.
 y is the data set whose values are the vertical coordinates.
 main is the tile of the graph.
 xlab is the label in the horizontal axis.
 ylab is the label in the vertical axis.
 xlim is the limits of the values of x used for plotting.
 ylim is the limits of the values of y used for plotting.
 axes indicates whether both axes should be drawn on the plot.
Example: To create a basic Scatter Plot, the data
frame "mtcars" available in the R environment is used. The columns
"wt" and "mpg" in mtcars are only used.
> input <- mtcars[,c('wt','mpg')]
> print(head(input)) #head will give top 6 rows from the dataframe
wt mpg
Mazda RX4 2.620 21.0
Mazda RX4 Wag 2.875 21.0
Datsun 710 2.320 22.8
Hornet 4 Drive 3.215 21.4
Hornet Sportabout 3.440
18.7
Valiant 3.460 18.1
Creating the Scatterplot
The below script will create a
scatterplot graph for the relation
between ‘wt(weight)’ and ‘mpg(miles
per gallon)’.
> #Plot the chart for cars with weight
between 2.5 to 5 & mileage between 15 and 30.
> plot(x = input$wt, y = input$mpg,
xlab = "Weight", ylab = "Milage",
xlim = c(2.5,5), ylim = c(15,30), main
= "Weight vs Milage" )

Scatterplot Matrices
When we have more than two variables
and we want to find the correlation
between one variable versus the
remaining ones we use scatterplot matrix. We use ‘pairs()’ function to
create matrices of scatterplots.
Syntax: The basic syntax for creating scatterplot matrices in R is
pairs(formula, data)
Where,
 formula represents the series of variables used in pairs.
 data represents the data set from which the variables will be taken.
Example: Each variable is paired up with each of the remaining variable.
A scatterplot is plotted for each pair.
# Plot the matrices between 4 variables giving 12 plots.
# One variable with 3 others and total 4 variables.
> pairs(~wt+mpg+disp+cyl, data = mtcars, main = "Scatterplot
Matrix")

Pie Charts
R Programming language has numerous libraries to create charts and
graphs. A pie-chart is a representation of values as slices of a circle with
different colors. The slices are labeled and the numbers corresponding to
each slice is also represented in the chart.
In R the pie chart is created using the ‘pie()’ function which takes positive
numbers as a vector input. The additional parameters are used to control
labels, color, title etc.
Syntax: The basic syntax for creating a pie-chart using the R is
pie(x, labels, radius, main, col, clockwise)
Where,
 x is a vector containing the numeric values used in the pie chart.
 labels is used to give description to the slices.
 radius indicates the radius of the
circle of the pie chart.(value
between −1 and +1).
 main indicates the title of the
chart.
 col indicates the color palette.
 clockwise is a logical value
indicating if the slices are drawn
clockwise or anti clockwise.
Example: A very simple pie-chart is
created using just the input vector and
labels. The below script will create and
save the pie chart in the current R
working directory.
> x <- c(21, 62, 10, 53) # Create data for
graph.
> labels <- c("London", "New York",
"Singapore", "Mumbai")
> pie(x, labels) # Plot the chart.

Pie Chart - Title & Colors


We will use parameter ‘main’ to add a title to the chart and another
parameter is ‘col’ which will make use of rainbow colour pallet while
drawing the chart. The length of the pallet should be same as the number
of values we have for the chart. Hence,‘length(x)’ is used.
Example: The below script will create and save the pie chart in the
current R working directory.
> x <- c(21, 62, 10, 53) # Create data for graph.
> labels <- c("London", "New York", "Singapore", "Mumbai")
# Plot the chart with title and rainbow color pallet.
> pie( x, labels, main = "City pie chart", col = rainbow(length(x)) )

Pie Chart - Slice Percentages & Chart


Legend
We can add slice percentage and a chart
legend by creating additional chart
variables.
> x <- c(21, 62, 10,53) # Create data for
graph.
> labels <- c("London", "New York",
"Singapore", "Mumbai")
> piepercent<- round(100*x/sum(x), 1)
# Plot the chart.
> pie(x, labels = piepercent, main = "City pie chart", col =
rainbow(length(x)))
> legend("topright", c("London","New York","Singapore","Mumbai"), cex
= 0.8,
fill = rainbow(length(x)))

3D Pie Chart
A pie chart with 3 dimensions can be
drawn using additional packages.
The package ‘plotrix’ has a function
called ‘pie3D()’ that is used for this.
> library(plotrix) # Get the library.
> x <- c(21, 62, 10,53) # Create
data.
> lbl <- c("London", "New York",
"Singapore", "Mumbai")
# Plot the chart.
> pie3D(x,labels = lbl, explode = 0.1,
main = "Pie Chart of Countries ")

You might also like