What is data science and why we need this?
Data science is a multi-disciplinary field that uses scientific methods processes,
algorithms and systems to extract knowledge and insights from structured and
unstructured data.
So, this is just by the book definition of data science. However, to understand data
science, if I need to use a layman language so that everyone can understand.
“Data Science is a way of getting insights from structured and unstructured data”.
Data science is something that we do with the data to solve a business problem. If it put
it in a different way, data science is used by businesses for:
1. Expansion
2. Enhance product/services offerings
3. Augment customer/client base
Why do we need data science?
You may not be aware of it; however, data science existence has been more than for 50
years, but we never look at it the way we look at it right now. It’s because when it was
discovered, there was not a lot of data and we also didn’t have the sophisticated
computers and devices back then as well.
So, we could not understand how useful data science could be.
But now, we have computers, tools, and sophisticated devices. Moreover, the amount of
data that has been generated in recent years has fueled the need for data science. As
per the recent study, more than 90% of the world’s data has been generated in the last 2
years.
Some of the examples of the volume of data that is being continuously generating at
present are:
1. 7 billion is the volume of shares traded on the US stock market each day.
2. 10 Terabytes are the amount of data generated in one flight from Ne w York to
London.
3. 400 Million is the number of tweets per day on Twitter.
4. 3 Billion is the number of “likes” each day on Facebook.
It is the volume of this huge data that have been generated in the last 2 years which has
compelled the Organizations and the Governments to start investing in data science,
artificial intelligence, machine learning, and automation.
It’s now more important for organizations to stay competitive in the market. They are
doing all sorts of work to harness the power of their data in the most meaningful way for
the company growth.
Governments are now using the Census data to bridge the gap between the government
and the masses. The aim is to get closer to its people. This helps them further in
distributing the state resources equally amongst the people.
Applications of Data Science
Let’s discuss some of the widely used and talked about applications of data science:
US Elections: Trump campaigns were believed to be dependent on polling
research. His team of data scientists conducted more than 8,00,000 live as well
as online surveys across seventeen battleground states during the campaign
period. Once the analytic system was up and running at the end of summer 2016, the
campaign was sending out tailored messages to 1,00,000 targeted voters every day.
The voters have then attributed 500 nodal points on the basis of their personality traits.
Localized and highly targeted video and audio campaigns were then developed and
privately shared with the voters which generated a positive aura for Trump as a future
president and thus resulted in his ultimate win.
1. Social Networking: Have you ever imagined how social networking
sites like Facebook, Twitter or LinkedIn suggest whom to follow and who
might be your friend. Well, they use data science.
Starting from simple rules like:
1. Which place do you belong to?
2. Who are your friends?
3. Who are your friends of friends?
4. Your interest areas
All of these data points help these social networking companies recommend you who
might be your friend and who do you want to follow.
2. Loan Approval: Do you know how banks decide how much credit needs
to be given to which Customers? What are the criteria?
Well, you have guessed the answer already. Yes, by using data science. Lending
institutions like banks and the NBFCs uses the customer credit rating/score provided by
the credit bureau companies like Sibel to decide if they can give a customer a loan or
not. Also, it helps them in deciding the number of loans that can be disbursed to a
customer.
A typical bank looks at 100 data points of an applicant before deciding on the loan
approval.
3. Banking Fraud: Banks and credit card companies also use data
science to detect fraudulent transactions in real-time. Every transaction
you do with your bank gets analyzed in real-time and if it looks fishy or
suspicious, the system highlights these transactions and you get a
phone call from your bank asking if you have attempted that particular
transaction.
Similarly, banks use data science to contain anti-money laundering activities determining
the source of funds, where and how they are being used.
4. Ecommerce: Have you thought of how these e-commerce companies
like Amazon, Flipkart, and Snapdeal show you the right product on their
platform which you happen to buy straightaway. Well, they use a
Recommendations engine which in turn uses past data of the customer
behaviour happening on their website. In addition to this, eCommerce
sites use data science to optimize discounts on various products, thus
enhancing their cross-selling and up-selling strategies.
They also use data science to generate demand forecast so that they can fill up their
warehouse inventory up to the maximum extent.
I am sure; you guys are now getting enough idea on how data science is impacting our
day-to-day life. Moreover, this impact is getting stronger each day.
Data Wrangling - Data wrangling is “the process of programmatically transforming data into a
format that makes it easier to work with.
Time Series – Time series data is data that is recorded across time, not always in consistent
intervals, but across time nonetheless.
Web Scraping – Web scraping, web harvesting, or web data extraction is data scraping used
for extracting data from websites. Web scraping software may access the World Wide
Web directly. Examples- Real estate data, Hotel data, Scraping Job Data etc. Scrapping tables
such as Wikipedia/IMBD
Why Learn Python for Data Science?
Python is no-doubt the best-suited language for a Data Scientist. I have listed down a few points
which will help you understand why people go with Python for Data Science:
• Python is a free, flexible and powerful open-source language
• Python cuts development time in half with its simple and easy to read syntax
• With Python, you can perform data manipulation, analysis, and visualization
• Python provides powerful libraries for Machine learning applications and other
scientific computations.
Python Libraries for Data Science
This is the part where the actual power of Python with data science comes into the picture.
Python comes with numerous libraries for scientific computing, analysis, visualization etc. Some
of them are listed below:
• NumPy – NumPy is a core library of Python for Data Science which stands for
‘Numerical Python’. It is used for scientific computing, which contains a powerful
n-dimensional array object and provide tools for integrating C, C++ etc. It can
also be used as multi-dimensional container for generic data where you can
perform various NumPy Operations and special functions.
• Matplotlib – Matplotlib is a powerful library for visualization in Python. It can be
used in Python scripts, shell, web application servers and other GUI toolkits. You
can use different types of plots and how multiple plots work using Matplotlib.
• Seaborn – Seaborn is a statistical plotting library in Python. So whenever you’re
using Python for data science, you will be using matplotlib (for 2D visualizations)
and Seaborn, which has its beautiful default styles and a high level interface to
draw statistical graphics.
• Scikit-learn – Scikit learn is one of the main attractions, where in you can
implement machine learning using Python. It is a free library which contains
simple and efficient tools for data analysis and mining purposes. You can
implement various algorithm, such as logistic regression, time series
algorithm using scikit-learn.
• SciPy – SciPy is a free and open-source Python library used for scientific
computing and technical computing. SciPy contains modules for optimization,
linear algebra & integration.
• Pandas – Pandas is an important library in Python for data science. It is used for
data manipulation and analysis. It is well suited for different data such as tabular,
ordered and unordered time series, matrix data etc.
What is Machine Learning?
Well, Machine Learning is a concept which allows the machine to learn from examples and
experience, and that too without being explicitly programmed. So instead of you writing the
code, what you do is you feed data to the generic algorithm, and the algorithm/ machine builds
the logic based on the given data.
Have you ever shopped online? So, while checking for a product, did you noticed when it
recommends for a product similar to what you are looking for? or did you noticed “the person
bought this product also bought this” combination of products. How are they doing this
recommendation? This is machine learning.
Did you ever get a call from any bank or finance company asking you to take a loan or an
insurance policy? What do you think, do they call everyone? No, they call only a few selected
customers who they think will purchase their product. How do they select? This is target
marketing and can be applied using Clustering. This is machine learning.
What is Machine Learning? Machine Learning is a subset of artificial intelligence which
focuses mainly on machine learning from their experience and making predictions based on its
experience.
What does it do? It enables the computers or the machines to make data-driven decisions
rather than being explicitly programmed for carrying out a certain task. These programs or
algorithms are designed in a way that they learn and improve over time when are exposed to
new data.
As you know, we are living in the world of humans and machines. The Humans have been
evolving and learning from their past experience since millions of years. On the other hand, the
era of machines and robots have just begun. You can consider it in a way that currently we are
living in the primitive age of machines, while the future of machine is enormous and is beyond
our scope of imagination.
In today’s world, these machines or the robots have to be programmed before they start
following your instructions. But what if the machine started learning on their own from their
experience, work like us, feel like us, do things more accurately than us? These things sound
fascinating, Right? Well, just remember this is just the beginning of the new era.
What Is Deep Learning?
Deep learning is a machine learning technique that teaches computers to do what comes
naturally to humans: learn by example. Deep learning is a key technology behind driverless
cars, enabling them to recognize a stop sign, or to distinguish a pedestrian from a lamppost. It is
the key to voice control in consumer devices like phones, tablets, TVs, and hands-free
speakers. Deep learning is getting lots of attention lately and for good reason. It’s achieving
results that were not possible before.
In deep learning, a computer model learns to perform classification tasks directly from images,
text, or sound. Deep learning models can achieve state-of-the-art accuracy, sometimes
exceeding human-level performance. Models are trained by using a large set of labeled data
and neural network architectures that contain many layers.
Deep Learning is further divided into:
Computer vision focuses how computers can gain high-level understanding from digital
images or videos. It then translates this data into insights used to drive decision making.
Examples: Google lens, Facial recognition, Self-driving cars etc.
Natural Language Processing, usually shortened as NLP, is a branch of artificial
intelligence that deals with the interaction between computers and human’s language.
Examples: Email filters, Google translator, Predictive analysis, Smart assistance – Alexa or Siri
How does Deep Learning attain such impressive results?
In a word, accuracy. Deep learning achieves recognition accuracy at higher levels than
ever before. This helps consumer electronics meet user expectations, and it is crucial
for safety-critical applications like driverless cars. Recent advances in deep learning
have improved to the point where deep learning outperforms humans in some tasks like
classifying objects in images.
While deep learning was first theorized in the 1980s, there are two main reasons it has
only recently become useful:
1. Deep learning requires large amounts of labeled data. For example, driverless car
development requires millions of images and thousands of hours of video.
2. Deep learning requires substantial computing power. High-performance GPUs have
a parallel architecture that is efficient for deep learning. When combined with
clusters or cloud computing, this enables development teams to reduce training time
for a deep learning network from weeks to hours or less.
Examples of Deep Learning at Work
Deep learning applications are used in industries from automated driving to medical
devices.
Automated Driving: Automotive researchers are using deep learning to automatically
detect objects such as stop signs and traffic lights. In addition, deep learning is used to
detect pedestrians, which helps decrease accidents.
Aerospace and Defense: Deep learning is used to identify objects from satellites that
locate areas of interest, and identify safe or unsafe zones for troops.
Medical Research: Cancer researchers are using deep learning to automatically detect
cancer cells. Teams at UCLA built an advanced microscope that yields a high-
dimensional data set used to train a deep learning application to accurately identify
cancer cells.
Industrial Automation: Deep learning is helping to improve worker safety around
heavy machinery by automatically detecting when people or objects are within an
unsafe distance of machines.
Electronics: Deep learning is being used in automated hearing and speech translation.
For example, home assistance devices that respond to your voice and know your
preferences are powered by deep learning applications.
*NOTE
Since R was built as a statistical language, it suits much better to do statistical learning.
Python, on the other hand, is a better choice for machine learning with its flexibility for
production use
Language Cost Ease to learn Data Handling Visualisation Update
Open
Python Source Easier than R Extremely easy Good Faster updates
Slow for large
Open Long codes for data base, works Excels at
R Source simple task on RAM visualisation Faster updates
merely Update at every new
SAS Expensive Easiest Smooth & Stable functional rollout
SAS was a global leader in corporate jobs for data analysis. But now, the open-source
technologies have taken over.