File

Uploaded by

Harsh Ojha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

0% found this document useful (0 votes)

64 views27 pages

File

Uploaded by

Harsh Ojha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

You are on page 1/ 27

ACKNOWLEDGEMENT The work in this report is an outcome of continuous work over a period and drew intellectual support from Internshala and other sources. I would like to articulate our profound gratitude and indebtedness to Internshala helped us in completion of the training, I am thankful to Internshala Training Associates for teaching and assisting me in making the training successful. Mayank 18001004030 MED, 7" Sem DCRUST MURTHAL Downloades by Harsh Oha (hoa1. ABOUT TRAINING + NAME OF TRAINING: DATA SCIENCE + HOSTING INSTITUTION: INTERNSHALA * DATES: From 1* July 2021 to 12% August 2021 2. ABOUT INTERNSHALA Internshala is an internship and online training platform, based in Gurgaon, India, Founded in 2011 by Sarvesh Agrawal, an IIT Madras alumni. The site offers searching and posting internships, and other career services such as counselling, cover-letter writing, resume building and training programs to students. 3. OBJECTIVES To explore, sort and analyse mega data from various sources to take advantage of them and reach conclusions to optimize business processes and for decision support. Examples include machine maintenance or (predictive maintenance), in the fields of marketing and sales with sales forecasting based on weather 4. DATA SCIENCE Data Science as a multi-disciplinary subject that uses mathematics, statistics, and computer science to study and evaluate data. The key objective of Data Science is to extract valuable information for use in strategic decision making, product development, trend analysis, and forecasting. Data Science concepts and proc programming, social engineering, data warehousing, machine learning, and natural language processing. The key techniques in use are data mining, big data analysis, data extraction and data retrieval. s are mostly derived from data engineering, statistics, Data science is the field of study that combines domain expertise, programming skills, and knowledge of mathematics and statistics to extract meaningful insights from data. Data science practitioners apply machine leaming algorithms to numbers, text, images, video, audio, and more to produce artificial intelligence (AI) systems to perform tasks that ordinarily require DDownloades by Harsh Oha hoinats2human intelligence. In tur, these systems generate insights which analysts and business users can translate into tangible business value. DATA SCIENCE PROCESS: 1. The first step of this process is setting a research goal. The main purpose here is making sure all the stakeholders understand the what, how, and why of the project. 2. The second phase is data retrieval. You want to have data available for analysis, so this step includes finding suitable data and getting access to the data from the data owner. The result is data in its raw form, which probably needs polishing and transformation before it becomes usable. 3. Now that you have the raw data, it’s time to prepare it. This includes transforming the data from a raw form into data that’s directly usable in your models. To achieve this, you'll detect and correct different kinds of errors in the data, combine data from different data sources, and transform it. If you have successfully completed this step, you can progress to data visualization and modeling. 4. The fourth step is data exploration. The goal of this step is to gain a deep understanding of the data. You'll look for patterns, correlations, and deviations based on visual and descriptive techniques. The insights you gain from this phase will enable you to start modeling. 5. Finally, we get to the sexiest part: model building (often referred to as “data modeling” throughout this book). It is now that you attempt to gain the insights or make the predictions stated in your project charter. Now is the time to bring out the heavy guns, but remember research has taught us that often (but not always) a combination of simple models tends to outperform one complicated model. If you've done this phase right, you're almost done. 6. The last step of the data science model is presenting your results and automating the analysis, if needed. One goal of a project is to change a process and/or make better decisions. You may still need to convince the business that your findings will indeed change the business process as expected. This is where you can shine in your influencer role, The importance of this step is more apparent in projects on a strategic and tactical level. Certain projects require you to perform the business process over and over again, so automating the project will save time. 5. MY LEARNINGS E 1) INTRODUCTION TO DATA SCIEN + Overview & Terminologies in Data Science + Applications of Data Science » Unfamiliar detection (fraud, disease, etc.) Downloaded by Harsh Ojha(hojhat52002@gma.com)v Automation and decision-making (credit worthiness, ete.) Classifications (classifying emails a “important” or “junk”) Forecasting (sales, revenue, ete.) Pattern detection (weather patterns, financial market patterns, etc.) vvvyv Recognition (facial, voice, text, etc.) v Recommendations (based on leamed preferences, recommendation engines can refer you to movies, restaurants and books you may like) 2) PYTHON FOR DATA SCIENCE Introduction to Python, Understanding Operators, Variables and Data Types, Conditional Statements, Looping Constructs, Functions, Data Structure, Lists, Dictionaries, Understanding Standard Libraries in Python, reading a CSV File in Python, Data Frames and basic operations with Data Frames, Indexing Data Frame. 3) UNDERSTANDING THE STATISTICS FOR DATA SCIENCE Introduction to Statistics, Measures of Central Tendency, Understanding the spread of data, Data Distribution, Introduction to Probability, Probabilities of Discrete and Continuous Variables, Normal Distribution, Introduction to Inferential Statistics, Understanding the Confidence Interval and margin of error, Hypothesis Testing, Various Tests, Correlation. 4) PREDICTIVE MODELING AND BASICS OF MACHINE LEARNING Introduction to Predictive Modeling, Types and Stages of Predictive Models, Hypothesis Generation, Data Extraction and Exploration, Variable Identification, Univariate Analysis for Continuous Variables and Categorical Variables, Bivariate Analysis, Treating Missing Values and Outliers, Transforming the Variables, Basics of Model Building, Linear and Logistic Regression, Decision Trees, K-means Algorithms in Python. Summary of Procedure of Analyzing Data: Data science generally has a five-stage life cycle that consists of: + Capture: data entry, signal reception, data extraction + Maintain: Data cleansing, data staging, data processing. + Process: Data mining, clustering/classification, data modelling + Communicate; Data reporting, data visualization + Analyse: Predictive analysis, regression Downloaded by Harsh Oha (hoinats2Introduction to Data Science Data Science ‘The field of bringing insights from data using scientific techniques is called data setence. Applications Amazon Go — No checkout lines Computer Vision - The advancement in recognizing an image by a computer involves processing large sets of image data from multiple objects of same category. For example, Face recognition. Spectrum of Business Analysis 4 What can happen? Given data is collected and used. Big Data What is likely to happen? 2 Predictive Analysis e ’ E What's happening 3 now? Dashboards Why did it happen? Detective Analysis ‘What happened? Reporting — 1 Value added to organization Downloaded by Harsh Ojha(hojhat52002@gmail.com)Reporting / Management Information System To track what is happening in organization. Detective Analysis Asking questions based on data we are seeing, like. Why something happened? Dashboard / Business Intelligence Utopia of reporting, Every action about business is reflected in front of screen, Predictive Modelling Using past data to predict what is happening at granular level. Big Data Stage where complexity of handling data gets beyond the traditional system, Can be caused because of volume, variety or velocity of data, Use specific tools to analyse such scale data. Application of Data * Recommendation System Example-In Amazon recommendations are different for different users according to their past search, ial Media 1. Recommendation Engine 2. Ad placement Sentiment Analysis + Deciding the right credit limit for credit card customers. ‘Suggesting right products from e-commerce companies 1, Recommendation System 2. Past Data Searched 3, Discount Price Optimization ‘+ How google and other search engines know what are the mote relevant results for our search query 1. Apply ML and Data Science 2, Fraud Detection 3 AD placement Personalized search results Downloaded by Harsh Ojha(hojhat52002@gmai.com)Python Introduction Python is an interpreted, high-level, general-purpose programming language. It has efficient high-level data structures and a simple but effective approach to object-oriented programming. Python’s elegant syntax and dynamic typing, together with its interpreted nature, make it an ideal language for scripting and rapid application development in many areas on most platforms, Python for Data science: Why Python??? Python is an open source language. Syntax as simple as English, Very large and Collaborative developer community. Extensive Packages UNDERSTANDING OPERATORS: Theory of operators: - Operators are symbolic representation of Mathematical tasks. VARIABLES AND DATATYPES: Variables are named bounded to objects. Data types in python are int (Integer), Float, Boolean and strings CONDITIONAL STATEMENTS: If-else statements (Single condition) It eli se statements (Multiple Condition) LOOPING CONSTRUCTS: For loop FUNCTIONS: Functions are re-usable piece of code. Created for solving specific problem. Two types: Built-in functions and User- defined functions. Functions cannot be reused in python. DATA STRUCTURES: Two types of Data structures: LISTS: A list is an ordered data structure with elements separated by comma and enclosed within square brackets. DICTIONARY: A dictionary is an unordered data structure with elements separated by comma and stored as key: value pair, enclosed with curly braces {} Downloaded by Harsh Ojha(hojhat52002@gmai.com)Statistics Descriptive Statistic Mode Itis a number which occurs most frequently in the data series, It is robust and is not generally affected much by addition of couple of new values. import pandas as pd data=pd.read_esv("Mode.csv") _{/reads data from esv file data head) Jiprint first five lines ‘mode_data~data['Subject'].mode() //to take mode of subject column print{mode_data) Mean import pandas as pd data-pd.read_esv("mean.csv") _//reads data from esv file data.head0) print first five lines ‘mean_data=data[ Overallmarks].mean() //to take mode of subject column print{mean_data) Median Absolute central value of data set import pandas as pd data=pd.read_esv("data.csv") _//reads data from esv file data,head() Jiprint first five lines ‘median_data~data[Overallmarks].median() ito take mode of subject column, print(median_data) Types of variables ‘+ Continous ~ Which takes continuous numeric values. Eg-marks ‘* Categorial-Which have discrete values. Eg- Gender * Ordinal - Ordered categorial variables. Eg- Teacher feedback ‘+ Nominal —Unorderd categorial variable. Eg- Gender Downloaded by Harsh Ojha (hjhat52002@gmaicom)Outliers Any value which will fall outside the range of the data is termed as a outlier. Eg- 9700 instead of 97. Reasons of Outliers ‘* Typos-During collection. Eg-adding extra zero by mistake. * Mes ent Error-Outliers in data due to m jurement operator being faulty ‘+ Intentional Error-Errors which are induced intentionally, Eg-claiming smaller amount of alcohol consumed then actual ‘© Legit Outlier—These are values which are not actually errors but in data due to legitimate reasons. Eg - a CEO's salary might actually be high as compared to other employees. Interquartile Range (IQR) Is difference between third and first quartile from last. It is robust to outliers. Histograms Histograms depict the underlying frequency of a set of di ‘tc or continuous data that are measured on an interval scale. import pandas as pd histogram=pd.read_csv(histogram.csv) import matplotlib.pyplot as plt Ymatplot inline pit. hist(x~ ‘Overall Marks’ data-histogram) pltshow, Inferential Statisties Inferential statistics allows to make inferences about the population from the sample data. Hypothesis Testing Hypothesis testing is a kind of statistical inference that involves asking a question, collecting data, and then examining what the data tells us about how to proceed. The hypothesis to be tested is called the null hypothesis and given the symbol Ho. We test the null hypothesis against an alternative hypothesis, which is, given the symbol Ha. Decision Made Nall Hypothess hs Tue Null Hypothesis False = Nall Typeihens "TypelEvor — Cepenn i Bo not Reject Null Hypothesis | Conrect Decision ‘Type U1 Error r ts ‘When we have just a sample not population statistics Use sample standard deviation to estimate population standard deviation, T test is more prone to errors, because we just have samples. Downloaded by Harsh Ojha(hojhat52002@gmai.com)ZScore ‘The distance in terms of number of standard deviations, the observed value is away from mean, is standard score or z score. Z—value is above mean, -Z— value is below mean. The distribution once converted to z- score is always same as that of shape of original distribution. jquared Test To test categorical variables. Correlation Determine the relationship between two variables, It is denoted by r. The value ranges from -1 to +1, Hence, 0 means no relation. import pandas as pd import numpy as np data-pd.read_esv("data.esv data.corr() Downloaded by Harsh Ojha(hojhat52002¢@gmail.com)Pres Modelling Making use of past data and attributes we predict future using this data. Eg- Past _ | Horror Movies Future | Unwatched Horror Movies Predicting stock price movement 1. Analysing past stock prices. 2. Analysing similar stocks. 3. Future stock price requi Types 1. Supervised Learning Supervised leaming is a type algorithm that uses a known dataset (called the training dataset) to make predictions. The training dataset includes input data and response values. ‘© Regression-which have continuous possible values. Eg-Marks ssification-which have only two values. Cancer prediction is either 0 or 1 2. Unsupervised Learning Unsupervised learning is the training of machine using information that is neither classified nor. Here the task of machine is to group unsorted information according to similarities, patterns and. differences without any prior training of data ‘+ Clustering: A clustering problem is where you want to discover the inherent groupings in the data, such as grouping customers by purchasing behaviour: ‘© Association: An association rule learning problem is where you want to discover rules that describe large portions of your data, such as people that buy X also tend to buy Y. Stages of Predictive Modellin; 1. Problem definition Hypothesis Generation Data Extraction/Collection Data Exploration and Transformation Predictive Modelling aw nen Model Development/Implementation Problem Definition Identify the right problem statement, ideally formulate the problem mathematically. Downloaded by Harsh Ojha(hojhat52002@gmai.com)Hypothesis Generation List down all possible variables, which might influence problem objective. These variables should be free from personal bias and preferences, Quality of model is directly proportional to quality of hypothesis. Data Extraction/Collection Collect data from different sources and combine those for exploration and model building. While looking at data we might come actoss new hypothesis. Data Exploration and Transformation Data extraction is a process that involves retrieval of data from various sources for further data processing or data storage. Steps of Data Extraction Reading the data Eg- From esv file * Variable identification © Univariate Analysis + Bivariate Analysis * Missing value treatment © Outlier treatment © Variable Transformation Variable Treatment It is the process of identifying whether variable is 1. Independent or dependent variable 2. Continuous or categorical variable Why do we perform variable identification? 1. Techniques like supervised learning require identification of dependent variable. 2. Different data processing techniques for categorical and continuous data. Categorical variable- Stored as object. Continuous variable-Stored as int or float. Univariate Analysis 1. Explore one variable at a time. 2. Summarize the variable. 3. Make sense out of that summary to discover insights, anomalies, ete. Bivariate Analysis ‘+ When two variables are studied together for their empirical relationship. ‘+ When you want to see whether the two variables are associated with each other. * Ithelps in prediction and detecting anomalies. Downloaded by Harsh Ojha(hejhat52002@gmai.com)Missing Value Treatment Reasons of missing value 1. Non-response — Eg-when you collect data on people’s income and many choose not to answer. 2. Error in data collection, Eg- Faculty data 3. Error in data reading, 1. MCAR (Missing completely at random): Missing values have no relation to the variable in which missing value exist and other variables in dataset 2. MAR (Missing at random): Missing values have no relation to the in which missing value exist and the variables other than the variables in which missing values exist. 3. MNAR (Missing not at random): Missing values have relation to the variable in which missing value exists Identifying, Syntax: - 1. describe() 2. Isnull) Output will we in True ot False Different methods to deal with missing values 1. Imputation Continuous-Impute with help of mean, median or regression mode. Categorical-With mode, classification model 2. Deletion Row wise ot column wise deletion. But it leads to loss of data Outlier Treatment Reasons of Outliers 1. Data entry Errors 2. Measurement Errors 3. Processing Errors 4, Change in underlying population ‘Types of Outlier Univariate Analysing only one variable for outlier. Eg_—In box plot of height and weight. Weight will we analysed for outlier Bivariate Analysing both variables for outlier, ig- In scatter plot graph of height and weight. Both will we analysed. Downloaded by Harsh Ojha(hejhat52002@gmai.com)Identifying Outlier Graphical Method © Box Plot Box Plot " = = © Scatter Plot r Formula Method Using Box Plot

Data Science Report - Compress
No ratings yet
Data Science Report - Compress
31 pages
Final Industrial Report
No ratings yet
Final Industrial Report
34 pages
Data Science Report
No ratings yet
Data Science Report
32 pages
Internship Report: T.J.Instituteoftechnology
No ratings yet
Internship Report: T.J.Instituteoftechnology
29 pages
Data Science Report
No ratings yet
Data Science Report
32 pages
Data Science Report
No ratings yet
Data Science Report
32 pages
Harsh Synopsis
No ratings yet
Harsh Synopsis
21 pages
Data Science Report
No ratings yet
Data Science Report
32 pages
Data Science Training Insights
No ratings yet
Data Science Training Insights
32 pages
Chapter 1
No ratings yet
Chapter 1
85 pages
Internship
No ratings yet
Internship
28 pages
IDS Complete Notes
No ratings yet
IDS Complete Notes
126 pages
Unit I
No ratings yet
Unit I
52 pages
Data-Science-Report - Priyesh
No ratings yet
Data-Science-Report - Priyesh
32 pages
Data Science Unit 1
No ratings yet
Data Science Unit 1
85 pages
Selected Topics - Datascience
No ratings yet
Selected Topics - Datascience
17 pages
Seminar On Data Science
100% (7)
Seminar On Data Science
25 pages
Data Science Training Report 2023
No ratings yet
Data Science Training Report 2023
32 pages
Fundamentals of Data Science Course
100% (3)
Fundamentals of Data Science Course
62 pages
Data Science Process UNIT - II PS New
No ratings yet
Data Science Process UNIT - II PS New
21 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
25 pages
IDS Unit 1
No ratings yet
IDS Unit 1
67 pages
Unit 3
No ratings yet
Unit 3
9 pages
Unit-1 IDS
No ratings yet
Unit-1 IDS
26 pages
Dsdm-Unit1 241031 194317
No ratings yet
Dsdm-Unit1 241031 194317
38 pages
Data Science
100% (2)
Data Science
33 pages
EDS Unit 1?
No ratings yet
EDS Unit 1?
15 pages
Data Science
No ratings yet
Data Science
14 pages
Data Science Course in Pitampura
No ratings yet
Data Science Course in Pitampura
19 pages
Bcom Python
No ratings yet
Bcom Python
71 pages
Data Science Lifecycle Explained
No ratings yet
Data Science Lifecycle Explained
9 pages
AIDS C04-Session-19
No ratings yet
AIDS C04-Session-19
29 pages
Kadir
No ratings yet
Kadir
84 pages
Data Science
No ratings yet
Data Science
15 pages
Himadev
No ratings yet
Himadev
37 pages
Introduction To Data Science Course Outline
No ratings yet
Introduction To Data Science Course Outline
5 pages
TE Sem1 UNIT 1 (Data Science and Visualization) HONOURS - TE (SEM V)
No ratings yet
TE Sem1 UNIT 1 (Data Science and Visualization) HONOURS - TE (SEM V)
28 pages
Data Science & Business Basics Guide
No ratings yet
Data Science & Business Basics Guide
35 pages
00 Introduction To Data Science
No ratings yet
00 Introduction To Data Science
4 pages
Ids Sem Ans U-I
No ratings yet
Ids Sem Ans U-I
17 pages
Introduction To Data Science - 23CSH-283
100% (1)
Introduction To Data Science - 23CSH-283
48 pages
Intro to Data Science Basics
No ratings yet
Intro to Data Science Basics
171 pages
Data Science Essentials for Beginners
No ratings yet
Data Science Essentials for Beginners
7 pages
Data Science Overview Basic To Advance Guide
No ratings yet
Data Science Overview Basic To Advance Guide
27 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
16 pages
My Internship Document
No ratings yet
My Internship Document
41 pages
Adobe Scan 09 Sept 2024
No ratings yet
Adobe Scan 09 Sept 2024
4 pages
Lecture 1 What Is Data Science Prerequisites, Lifecycle and Applications Simplilearn
No ratings yet
Lecture 1 What Is Data Science Prerequisites, Lifecycle and Applications Simplilearn
5 pages
Internshala Summer Training Report On Data Science
77% (22)
Internshala Summer Training Report On Data Science
70 pages
Class X Data Science
No ratings yet
Class X Data Science
29 pages
Data Science Mastery Course in Pitampura
No ratings yet
Data Science Mastery Course in Pitampura
19 pages
DS Notes
No ratings yet
DS Notes
159 pages
Unit I - Notes
No ratings yet
Unit I - Notes
15 pages
Data Science Ppt1 Update
No ratings yet
Data Science Ppt1 Update
67 pages
Data Science and Big Data by IBM CE Allsoft Summer Training Final Report
100% (1)
Data Science and Big Data by IBM CE Allsoft Summer Training Final Report
41 pages
Data Science - AD1102-1
No ratings yet
Data Science - AD1102-1
53 pages
Impact of Data Science Across Industries
No ratings yet
Impact of Data Science Across Industries
3 pages
Data Science
No ratings yet
Data Science
8 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
13 pages
Discrete Structure
No ratings yet
Discrete Structure
86 pages
Lecture-5 Processor Organization
No ratings yet
Lecture-5 Processor Organization
15 pages
Computer Network
No ratings yet
Computer Network
159 pages
Microprocessor Unit 3 Part 3
No ratings yet
Microprocessor Unit 3 Part 3
12 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
16 pages
DBMS Chap 3
No ratings yet
DBMS Chap 3
5 pages
Data Science
No ratings yet
Data Science
16 pages
Daa File
No ratings yet
Daa File
63 pages
DBMS Chap 3
No ratings yet
DBMS Chap 3
2 pages
DBMS Unit 1
No ratings yet
DBMS Unit 1
16 pages
DBMS Chap 3
No ratings yet
DBMS Chap 3
13 pages
DBMS Chap 3
No ratings yet
DBMS Chap 3
7 pages
Data Science
No ratings yet
Data Science
25 pages
DBMS Unit 1
No ratings yet
DBMS Unit 1
5 pages

File

Uploaded by

File

Uploaded by

You might also like