Data Science
Lecture 1
Data science
Exponential Increase in Data
• All human generated information up
to 2003 was about 5 exabytes.
• Same amount of data was generated
every 2 days in 2011
• and would be every 10 min NOW.
“Data is the New Oil”
– World Economic Forum 2011
• “Data is the new oil." Coined in 2006 by
Clive Huby, a British data
commercialization entrepreneur, this
now famous phrase was embraced by
the World Economic Forum in a 2011
report,
• Data is just like crude oil. It’s valuable,
but if unrefined it cannot really be
used. It has to be changed into gas,
plastic, chemicals, etc.
• To create a valuable entity that drives
profitable activity; so must data be
broken down, analyzed for it to have
value.
What is Data Science?
• Fortune magazine
• “Hot New Gig in Tech”
• Hal Varian, Google’s Chief Economist, NYT, 2009:
• Statistics: The next attractive job
• “The ability to take data—to be able to understand it, to process it, to extract value
from it, to visualize it, to communicate it—that’s going to be a hugely important skill.”
• Mike Driscoll, CEO of meta markets:
• “Data science, as it's practiced, is a blend of Red-Bull-fueled hacking and espresso-
inspired statistics.”
• “Data science is the civil engineering of data. Its acolytes possess a practical knowledge
of tools & materials, coupled with a theoretical understanding of what's possible.”
Data Science – A Visual Definition
• Drew Conway’s Data Science Venn Diagram
What do data scientists do?
• “They need to find nuggets of truth in data and then explain it to the
business leaders” , Rchard Snee, EMC
• Data scientists “tend to be “hard scientists”, particularly physicists,
rather than computer science majors. Physicists have a strong
mathematical background, computing skills, and come from a
discipline in which survival depends on getting the most from the
data. They have to think about the big picture, the big problem.”
DJ Patil, Chief Scientist at LinkedIn
Mike Driscoll’s three skills of data geeks
1) Statistics
• traditional analysis
2) Data Munging
• parsing, scraping, and formatting data
3) Visualization
• – graphs, tools, etc.
Data Science
“Data Science refers to an emerging area of work concerned with the
collection, preparation, analysis, visualization, management and
preservation of large collections of information.”
An Introduction to Data Science by Jeffrey Stanton, Syracuse University,
School of Information Studies.
Data Science – A Definition
• Data Science is the science which uses computer science, statistics
and machine learning, visualization and human-computer interactions
to collect, clean, integrate, analyze, visualize, interact with data to
create data products.
Data Science Service Change
Applying advanced Converting new data
statistical tools to insights into (often
existing data to small) changes to
generate new insights business processes
Smarter Work
More efficient and effective use of staff and resources
Data Scientist
“A data scientist is someone who can obtain, scrub, explore, model
and interpret data, blending hacking, statistics and machine learning.
Data scientists not only are adept at working with data, but appreciate
data itself as a first-class product.”
Hilary Mason, chief scientist at bit.ly
• “data wrangling”
• “data jujitsu”
• “data munging”
Three types of tasks
1) Preparing to run a model
Gathering, cleaning, integrating, restructuring, transforming, loading,
filtering, deleting, combining, merging, verifying, extracting, shaping,
massaging.
2) Running the model
3)Communicating the results
Data Science is about Data Products
•“Data-driven apps” (Mike Loukides)
•Data science is about building data products,
•Spellchecker
not just answering questions
•Machine Translator
•Data products empower others to use the data.
•Interactive visualizations
•Google flu application •May help communicate your results
•Global Burden of Disease (e.g., Nate Silver’s maps)
•May empower others to do their own analysis
• Online Databases (e.g., Global Burden of Disease)
• Enterprise data warehouse
• Sloan Digital Sky Survey
Goal of Data Science: Turn data into data products.
Types of data science work
• Data science tends to fall into three broad categories:
• Investigating – aggregating and inspecting data to get basic Simple
insights on what is currently happening
• Predicting – taking the data and using it to understand what will
happen in the future
• Optimizing – using the data to choose what the best choice of
actions will be Complex
Distinguishing Data Science from...
Business Intelligence and Data Warehouse
Statistics
Data(base) Management
Visualization
Machine Learning
Data Mining
What is data science?
• Deals with both structured and unstructured data.
• Associated with the cleansing, preparation and final analysis of data.
• Combines the programming, logical reasoning, mathematics and statistics.
• Captures data in the most ingenious ways and encourages the ability of
looking at things with a different perspective.
• Cleanses, prepares and aligns the data.
• An umbrella of several techniques that are used for extracting the
information and the insights of data.
• Data scientists are responsible for creating the data products and several
other data based applications that deal with data in such a way that
conventional systems are unable to do.
What is data mining?
• It is process of gathering information from huge databases that was and then
using that information to previously incomprehensible and unknown to make
relevant business decisions.
• Set of various methods that are used in the process of knowledge discovery for
distinguishing the relationships and patterns that were previously unknown.
• Data mining is a mergence of various other fields like artificial intelligence,
pattern recognition, visualization of data, machine learning, statistical studies
and so on.
• Primary goal: To extract information from various sets of data in an attempt to
transform it in proper and understandable structures for eventual use.
• A process which is used by data scientists and machine learning enthusiasts to
convert large sets of data into something more usable.
What is machine learning?
• Machine learning is responsible for providing computers the ability
to learn about newer data sets without being programmed via an
explicit source.
• Machine learning and data mining follow the relatively same process.
But!
Machine learning follows the method of data analysis which is
responsible for automating the model building in an analytical way.
It uses algorithms that iteratively gain knowledge from data and in
this process; it lets computers find the apparently hidden insights
without any help from an external program.
What is the difference between these three terms?
• Data scientists are responsible for coming up with data centric products and
applications that handle data in a way which conventional systems cannot. The process
of data science is much more focused on the technical abilities of handling any type of
data.
• Unlike data mining and data machine learning it is responsible for assessing the impact
of data in a specific product or organization.
• Data science focuses on the science of data, data mining deals with the process of
discovering newer patterns in big data sets. It might be apparently similar to machine
learning, because it categorizes algorithms. However, unlike machine learning,
algorithms are only a part of data mining.
• In machine learning, algorithms are used for gaining knowledge from data sets.
However, in data mining algorithms are only combined that too as the part of a process.
Unlike machine learning it does not completely focus on algorithms.
Data Science
• Data Science is a field of study which includes everything from Big
Data Analytics, Data Mining, Predictive Modeling, Data
Visualization, Mathematics, and Statistics.
• Data Science has been referred to as the fourth paradigm of Science.
(the other three being Theoretical, Empirical and Computational).
Academia often conduct exclusive research in Data Science.
Key Differences Between Data Science Vs Data Mining
• Data Mining is an activity which is a part of a broader Knowledge
Discovery in Databases (KDD) Process while Data Science is a field of
study just like Applied Mathematics or Computer Science.
• Data Science is thought to be broader in scope while Data Mining is
considered narrower.
• Some activities of Data Mining such as statistical analysis, writing data
flows and pattern recognition can intersect with Data Science. Hence,
Data Mining becomes a subset of Data Science.
• Machine Learning in Data Mining is used more in pattern recognition
while in Data Science it has a more general use.
Data Science Vs Data Mining Comparison Table
Databases and Data Science
Databases Data Science
Data Value “Precious” “Cheap”
Data Volume Modest Massive
Examples Bank records, Online clicks,
Personnel records, GPS logs,
Census, Tweets,
Medical records Building sensor readings
Priorities Consistency, Speed,
Error recovery, Availability,
Auditability Query richness
Structured Strongly (Schema) Weakly or none (Text)
Properties Transactions, ACID* CAP* theorem (2/3),
eventual consistency
Realizations SQL NoSQL:
MongoDB, CouchDB,
Hbase, Cassandra, Riak, Memcached,
Apache River, …
ACID = Atomicity, Consistency, Isolation and Durability
CAP = Consistency, Availability, Partition Tolerance
Business Intelligence and Data Science
Business Intelligence Data Science
Querying the past Querying the past present and future
Machine Learning and Data Science
Machine Learning Data Science
Develop new (individual) models Explore many models, build and tune
hybrids
Prove mathematical properties of
models Understand empirical properties of
models
Improve/validate on a few, relatively Develop/use tools that can handle
clean, small datasets massive datasets
Publish a paper Take action!
Data Scientists
“I worry that the Data Scientist role is like the mythical “webmaster” of
the 90s: master of all trades.”
-- Aaron Kimball, CTO Wibidata
What “data science” tells me:
• If you’re a DBA, you need to learn to deal with unstructured data
• If you’re a statistician, you need to learn to deal with data that does
not fit in memory
• If you’re a software engineer, you need to learn statistical modeling
and how to communicate results.
• If you’re a business analyst, you need to learn about nd tradeoffs at
scale