KEMBAR78
IDS - Lecture 1 | PDF | Data Science | Big Data
0% found this document useful (0 votes)
45 views52 pages

IDS - Lecture 1

Introduction to Data Science Lecture.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views52 pages

IDS - Lecture 1

Introduction to Data Science Lecture.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Intro To Data Science: Introduction

CUST
Fall 2024

Slide credits: MIT AI, Jeffery


IBM Data Science 1
Course Learning Outcomes
Course Code: SE4883
CLO1: Understand basic concepts of data science, statistics and probability
and their application in understanding behavior of data
CLO2: Apply basic tools for performing exploratory data analysis and
visualization
CLO3: Understand basic predictive modeling and data analysis methods
CLO4: Learn Python for performing different data science steps
Reference Material
● Material shared through Oddo
● Text Books
○ Cathy O'Neil and Rachel Schutt. Doing Data Science, Straight
Talk From The Frontline. O'Reilly. 2014. (Text Book)
○ Steven S. Skiena, “The Data Science Design Manual”,
Springer, 2017
● Reference Book
○ David M Deiz and Christopher D Barr. OpenIntro Statistics.
Third Edition
○ An Introduction to Statistical Learning, by Gareth James,
Daniela Witten, Trevor Hastie, and Rob Tibshirani
Some Useful Links
● https://www.Kaggle.com/
● You will need the following things installed on your computer:
•Python External link, version 3.8 or higher.
•Pandas External link, any version that's compatible with your version
of Python.
• Numpy External link, any version that's compatible with your version
of Python.
● Google Colab/ IBM Watson Studio
Assessment
● Quizzes 10%
● Assignments 10%
● Project 20%
● Mid-Term 20%
● Final-Term 40%
Project
● A major component of the class: goal is to take a real-world problem that you
are interested in, and apply data science methodologies to gain insight/solve
problem of that the domain

● Work to be done in groups of 3-4 students

● Class projects must be focused on some real data problem (ideally one that
you collect yourself), not an already-curated data set
How to take course
● Course is interesting, challenging, has high demand in industry
● Try to spend extra time learning about the contents discussed during class
● Do maximum programming practice
● Try to grasp Math involved (Not too much here)

Last but not least, HAVE FUN Learning!


Defining Data Science and Data Scientists
Basically,
Data science is the field of exploring, manipulating,
and analyzing data, and using data to answer
questions or make recommendations.
Data Science
Some possible definitions

Data science is the application of


computational and statistical
techniques to address or gain insight
into some problem in the real world

1
1
Some possible definitions

Data science = statistics +


data processing +
machine learning +
scientific inquiry +
visualization +
business analytics +
big data + …

1
2
So what is the process of data science?

13
So Now, Who Is a Data Scientist?
Data scientists use their skills to extract knowledge and insights from data to solve real-world
problems.
In Academic:
Background: Typically has a scientific background (social science, biology, etc.)
• Working with large datasets
• Addressing challenges related to data structure, size, quality, and complexity
• Applying computational methods to solve problems using data
In Industry :
• How to design the experiments,
• How to the process of collecting, cleaning, and munging of data.
• Exploratory data analysis, which combines visualization and data sense.
• Find patterns, build models, and algorithms.
• Use analyses for decision making.
Pop Up Quiz!!!
Question!
What is Data?
What is Data?

The quantities, characters, or symbols on which operations are performed by a


computer,
• which may be stored and transmitted in the form of electrical signals and recorded
on magnetic, optical, or mechanical recording media.
Now, let’s learn Big Data definition
Types of Data?
What is Data?

Following are the types of Big Data:


→Structured
→Semi- Structured
→Unstructured

Structured
Any data that can be stored, accessed and processed in the form of fixed format is
termed as a ‘structured’ data.
Over the period of time, talent in computer science has achieved greater success in
developing techniques for working with such kind of data (where the format is well
known in advance) and also deriving value out of it. However, nowadays, we are
foreseeing issues when a size of such data grows to a huge extent, typical sizes are
being in the rage of multiple zettabytes.

Do you know? 1021 bytes equal to 1 zettabyte or one billion terabytes forms a zettabyte.
Example of Structured Data
What is Data?

An ‘Employee’ table in a database is an example of Structured Data

Employee_ID Employee_Name Gender Department Salary_In_lacs

2365 Rajesh Kulkarni Male Finance 650000

3398 Pratibha Joshi Female Admin 650000

7465 Shushil Roy Male Admin 500000

7500 Shubhojit Das Male Finance 500000

7699 Priya Sane Female Finance 550000


What is Data?
Type of Data
Unstructured
Any data with unknown form or the structure is classified as unstructured data.
In addition to the size being huge, un-structured data poses multiple challenges in
terms of its processing for deriving value out of it.
A typical example of unstructured data is a heterogeneous data source containing a
combination of simple text files, images, videos etc.
Now a days organizations have wealth of data available with them but unfortunately,
they don’t know how to derive value out of it since this data is in its raw form or
unstructured format.
Example of Unstructured Data
What is Data?

The output returned by ‘Google Search’


What is Data?
Type of Data
Semi-structured
Semi-structured data can contain both the forms of data.
We can see semi-structured data as a structured in form but it is actually not defined
with e.g. a table definition in relational DBMS.
Example of semi-structured data is
• Data represented in an XML file.
• Personal data stored in an XML file
Related Concepts
Cloud Computing

The term “cloud computing” can be used to describe applications and data that users
access over the Internet rather than on their local computer.
Cloud Computing Services/Service Providers
● Amazon Web Services
● Google Cloud
● IBM Watson Studio
What is Data?
Big Data

Big Data is a collection of data that is huge in volume, yet growing exponentially with
time.
• It is a data with so large size and complexity that none of traditional data
management tools can store it or process it efficiently.
• Big data is also a data but with huge size.
What is Data?
Example of Big Data?

The New York Stock Exchange is an example of Big Data that generates about one
terabyte of new trade data per day.
What is Data?
Characteristics of Big Data

Characteristics Of Big Data


Big data can be described by the following characteristics:
- Volume
- Variety
- Velocity

Volume – The name Big Data itself is related to a size which is enormous. Size of data plays a very crucial
role in determining value out of data. Also, whether a particular data can actually be considered as a Big Data
or not, is dependent upon the volume of data. Hence, ‘Volume’ is one characteristic which needs to be
considered while dealing with Big Data solutions.
What is Data?
Characteristics of Big Data

Variety – The next aspect of Big Data is its variety.


- Variety refers to heterogeneous sources and the nature of data, both structured and unstructured.
- During earlier days, spreadsheets and databases were the only sources of data considered by most of the
applications. Nowadays, data in the form of emails, photos, videos, monitoring devices, PDFs, audio, etc.
are also being considered in the analysis applications.
- This variety of unstructured data poses certain issues for storage, mining and analyzing data.
What is Data?
Characteristics of Big Data

Velocity – The term ‘velocity’ refers to the speed of generation of data.


- How fast the data is generated and processed to meet the demands, determines real potential in the data.
- Big Data Velocity deals with the speed at which data flows in from sources like business processes,
application logs, networks, and social media sites, sensors, Mobile devices, etc.
- The flow of data is massive and continuous.
???

What are the 5 V’s of Big data?


Big Data Processing Tools

Big Data processing technologies provide ways to work with

large sets of structured, semi-structured, and


unstructured data so that the value can be derived from
big data.
What is Hadoop
Hadoop is an open source framework based on Java that
manages the storage and processing of large amounts of
data for applications.

Hadoop uses distributed storage and parallel processing to


handle big data and analytics jobs, breaking workloads down
into smaller workloads that can be run at the same time.
Self Learning

What is Apache hive?


What is Apache Spark?
HomeWork
Data Science and AI
AI
AI is the branch of computer science that includes the development of systems that
can replicate tasks associated with human intelligence.
Machine learning
Machine learning is a subset of AI that uses computer algorithms to learn about data
and make predictions with it
Deep Learning
Deep learning is a subset of machine learning that uses layered neural networks to
simulate human decision-making.
Generative AI
Generative artificial intelligence refers to the use of AI to create new content, like text,
images, music, audio, and videos.
Data Science combines math and statistics,
specialized programming, advanced analytics, AI and
machine learning with specific subject matter
expertise to uncover actionable insights hidden in an
organization's data.
-- Leveraging Data Science?
Class Activity
Explore Data Science Job Listing
For this Activity, you should find a data science job posting on a job
board of your choice, such as LinkedIn, Indeed, Rozi.pk.
Analyze the posting by responding to the following questions and statements

Identify the following aspects of data science job post:

1. What is the company name that is advertising the job?


2. What is the job title?
3. Where is the role located?
4. What is the expected salary or salary range?
5. What is the total number of results from the search for the job post?
6. What is one technical responsibility from the job post related to something you
learned about in this course?
7. What are two required technical skills from the job post?
8. What are at least two ideas or concepts you learned about in this course relevant
to these job posts?

You might also like