Piyush Data Science 3
Piyush Data Science 3
• Vishwkarma
• Pushpak
• Mantra-Tantra-Yantra
Data Analyst vs Data Engineer vs Data Scientist
Transaction Data OLTP
Data Analyst
(Structured & Unstructured} OLAP
ETL –
Data Warehouse
Extract/Transform/Load
Data Engineer
Data Scientists
Role Definition
Parameter Data Scientist Data Analyst Data Engineers
Who? A data scientist develops and A data analyst collects, cleans, Data engineers are responsible
implements data-driven solutions to stores and organises data. for building and maintaining
overcome business challenges. the infrastructure and tools
needed to collect and store
large amounts of data
Focus Data Scientist focuses on a futuristic Data Analyst focuses on the Data Engineer focuses on
display of data. present technical analysis of improving data consumption
data. techniques continuously.
Role Details Data Scientist roles are to provide Data Analyst performs data Data Engineer roles are to
supervised / unsupervised / deep cleaning, organizes raw data, build data in an appropriate
learning of data, classify and regress analyse and visualize data to format. A data engineer works
data. Data Scientists heavily used interpret the analysis. at the back end. A data
neural networks, machine learning engineer uses optimized
for continuous regression analysis. machine learning algorithms
to maintain data and make
data available in the most
appropriate manner.
Role Definition
Parameter Data Scientist Data Analyst Data Engineers
• Machine Learning, Deep Learning • Data Manipulation (Pandas), • ETL & Data Modelling Tools,
(Scikit-learn, TensorFlow) Data Visualization & Big Data Technologies (Spark,
Enterprise BI tools (Tableau, Hadoop)
• Data Visualization (Matplotlib,
Seaborn) Power BI) • SQL/NoSQL, Data Storage
(Redshift, Big Query)
• Big Data Working knowledge • Statistical Analysis, Reporting
(Spark, Hadoop) Tools (Excel, Google Sheets), • Cloud Services (AWS, Azure)
SAS, SPSS, Business Acumen • Data Pipeline Tools (Airflow),
• SQL/NoSQL, Cloud Platforms (AWS,
Google Cloud) • Strong communication, Hadoop, Pig, Hive
presentation and domain
• Strong Communication and Domain
knowledge
Skills
Role Definition
Parameter Data Scientist Data Analyst Data Engineers
Responsibilities Data Scientist take any data science Data Analysts are good Data engineers are data
project from inception to end. Consider statistician , visualizing the data, architects, they bring the data
data scientist to be solutions architect create charts, reports, from various sources or formats
in software world. They generate dashboards and expert in data in the required format and data
models, use existing models, fine tune visualization tools such Tableau, source which can be consumed
them, provide hyper parameter tuning, PowerBI, Excel etc., and for storage, analysis, reporting
etc. They are very well aware of implement requests coming and archiving.
domain knowledge, customer from Data Scientist or business.
requirements and technical skills to
achieve the goal.
Example Building a predictive model to Analysing sales data to Designing a data warehouse
forecast customer attrition / identify trends and customer to store customer data from
retention rate , developing a segments, creating various sources, ETL (Extract,
recommendation system for dashboards to track key Transform, Load) processes
products metrics for data cleansing and
integration
Indicative Time Scale
Fundamentals (3 weeks)
• Vector spaces Learning Resource Details
• Rank of matrices Gilbert Strang’s Linear Algebra Course (MIT OCW)
• Eigenvalues/eigenvectors Linear Algebra Done Right - Sheldon Axler
• Singular value decomposition (SVD)
• Matrix factorization
• Projection
• Inner Product Spaces
• Application of Linear Algebra concepts for data
scientist role (high level understanding only)
• Loss function and recommender system for ML
• Word embedding for NLP
• Image convolution for computer vision
Phase 2: Data Engineer Roadmap
Learn Cloud Platforms and Data Pipeline Data Analysis and ML part-2 (3-5 weeks)
(20-24 week) - Choose a cloud provider (AWS, GCP, or Azure) • Learn Data Pipelines: Explore tools like Apache Airflow or AWS Glue to build
and manage data pipelines.
and gain hands-on experience with its services • Learn ETL Processes: Understand the process of extracting, transforming, and
Fundamentals (2 weeks) loading data into data warehouses or data lakes.
• Build Data Warehouse: Design and implement a data warehouse using
• Cloud Computing Concepts: Understand fundamental concepts like IaaS (Infrastructure
cloud-based services.
as a Service), PaaS (Platform as a Service), and SaaS (Software as a Service).
• HDFS, MapReduce, Apachespark
• Cloud Providers: Choose a cloud provider (AWS, GCP, or Azure) and familiarize yourself
with its services and pricing models. Practical Exercises and Projects (Kaggle)
• Cloud Console: Learn to navigate the cloud provider's console and manage resources. • Cloud Migration: Migrate existing applications or data to the cloud.
Data Storage Concepts (5-6 weeks) • Data Pipeline Implementation: Build data pipelines to extract, transform,
and load data into cloud-based storage.
• Object Storage: Explore services like S3 (AWS), Blob Storage (Azure), and Cloud Storage
• Machine Learning Deployment: Deploy machine learning models to the
(GCP) for storing large amounts of unstructured data.
cloud for real-time predictions.
• Data Lakes: Understand the concept of data lakes and how to implement them using
• Cloud Architecture Design: Design cloud architectures for various use cases
cloud-based services.
• Data Warehouses: Learn about cloud-based data warehouses like Redshift (AWS), Big Learning Details
Query (GCP), and Synapse Analytics (Azure).
Resource
Data Processing (4-5 weeks)
• Serverless Computing: Explore services like Lambda (AWS), Cloud Functions (GCP), and Data https://www.edx.org/learn/data-engineering/ibm-data-
Azure Functions for running code without managing servers. Engineering engineering-basics-for-
• ETL Tools: Learn to use cloud-based ETL tools like AWS Glue, Dataflow (GCP), and Azure Basics everyone?index=product&queryID=764a801e40ccfd42bf011a
Data Factory. 379c137d3d&position=1&results_level=second-level-
• Data Pipelines: Design and implement data pipelines using orchestration tools like
Airflow.
results&term=Data+Engineering&objectID=course-f33be2a5-
322f-4b9c-9ac5-
Data Analysis and ML part-1 (3-4 weeks) a89b43080427&campaign=Data+Engineering+Basics+for+Ever
• Managed Services: Utilize managed services like EMR (AWS), Dataproc (GCP), and
HDInsight (Azure) for running big data analytics frameworks like Hadoop and Spark.
yone&source=edX&product_category=course&placement_url
• Machine Learning Platforms: Explore platforms like SageMaker (AWS), AI Platform =https%3A%2F%2Fwww.edx.org%2Fsearch
(GCP), and Azure Machine Learning for building and deploying machine learning models. https://www.striim.com/blog/guide-to-data-pipelines/
• Data Visualization: Use cloud-based visualization tools like QuickSight (AWS), Looker
Studio (GCP), and Power BI (Azure).
Ghislain Fourny’s YouTube Lectures (Big Data Systems)
Phase 2: Data Engineer Roadmap
ETL and Data Warehouse (14 weeks)
Fundamentals (2 weeks) Learning Resource Details
• What is ETL? (Overview of Extract, Transform, Load)
Migrate SQL to Azure SQL https://learn.microsoft.com/en-
• Difference between ETL and ELT
• Introduction to Data Warehousing (OLTP vs. OLAP) us/credentials/applied-skills/migrate-sql-
• Star and Snowflake schema workloads-azure-sql-database/
Azure Data Engineering https://learn.microsoft.com/en-
ETL Tools and Platforms (4 weeks) us/credentials/certifications/azure-data-
• Overview of ETL tools: Apache Nifi, Talend, Informatica, Alteryx
engineer/?practice-assessment-
• Hands-on practice with Airflow or Prefect (scheduling and orchestrating ETL jobs)
• Data pipeline creation, error handling, and logging type=certification#certification-prepare-for-
the-exam
Data Transformation Technique (1 week) ETL, Dataflows https://www.edx.org/learn/data-
• Data cleaning and normalization engineering/ibm-building-etl-and-data-
• Deduplication and data validation pipelines-with-bash-airflow-and-
• Handling missing values and outliers
kafka?irclickid=1K4zsKUW2xyKRMXWqM12MzF
OUkCUBy1KCXt3Uc0&irgwc=1
Data Loading into Data Warehouse(2 week)
• Loading data into relational databases (PostgreSQL, MySQL)
• Loading data into cloud warehouses (Snowflake, Redshift, BigQuery) Big Data Computing by Dr https://onlinecourses.nptel.ac.in/noc24_cs130/
• Batch vs. streaming data loading Rajiv Mishra (IIT Patna) preview
Computer Vision
• Image processing, image classification (CNNs),
• Object detection (YOLO, SSD)
• Object segmentation, transfer learning
• CNN architectures (ResNet, Inception)
Phase 2: Data Scientist Roadmap
Learning Resource Details
Machine Learning with Python https://www.freecodecamp.org/learn/machine-learning-with-python/