Introduction to Data science
Rathinaraja Jeyaraj Ph.D., RJ
Post-doctoral fellow,
University of Houston - Victoria,
Texas, USA.
FROM DATA TO KNOWLEDGE
1. Domain knowledge and problem formulation
2. Data engineering Data engineer
2.1 Capturing (collecting) the data from device/software/application
2.2 Ingesting (transporting) the data to the storage location
2.3 Managing the data (storing and retrieving data from databases/files)
3. Exploratory data analysis (to summarize the main characteristics and behaviour of data) Data analyst
4. Visualization (answering the questions – table, chart, plot, graph, statistics, rules (if-else), trees) Data visualization
5. Data pre-processing (preparing the data for feeding into the algorithm)
6. Mathematical modelling (machine learning)
Data scientist
6.1 Building the model
Data analytics
6.2 Evaluating (testing) the model
6.3 Is the model good? If not go to step 3 or step 4
7. Deploy the model for production ML (MLops) engineer
2
1. Abstract science
2. Social science
3. Natural science
4. Applied science
3
END-TO-END (E2E) IMPLEMENTATION
1. Domain knowledge and problem formulation for the questions.
Example: For web-series recommender system in Netflix,
Domain knowledge – the function of the social network, user activities, objective of E-com companies, etc.
Question – Can you recommend a new web series “W” to subscriber “X” based on his past browsing history?
Problem formulation – identify the list of variables and objectives for this problem to build an equation to be solved.
2. Data engineering
2.1 Capturing (collecting) the data from devices/software/application
From smartphone1 – teamscope, open data kit, kobo toolbox, Redcap, Magpi, Jotforms mobile, CommCare, etc.
Logging tools2 – log4j, Loggly, Splunk, Sumo Logic, Sematext, LogStash, GrayLog, PaperTrails, etc.
IoT tools – Raspberry pi, sensors, actuators, RFID readers, Scanner, temperature recorder, CCTV, etc.
Any applications – Facebook, Instagram, WhatsApp, etc.
4
2.2 Ingesting3 (transporting) the data – Kafka, Nifi, Kinesis, Spark, Storm, Syncsort, Flume, Chukwa, Sqoop, Samza, etc.
2.3 Managing the data (storing and retrieving from databases/files)
SQL – MySQL, Oracle, MariaDB, PostgreSQL, Microsoft SQL Server, DB2, etc.
NoSQL – Hbase, MongoDB, Cassandra, DynamoDB, Neo4j, etc.
File formats – CSV, XML, JSON, images, videos, etc.
3. Exploratory data analysis – EDA (to summarize the main characteristics and behaviour of data)
Statistical measures of centre and variation, graphs, charts, plots, etc., probability distribution.
4. Data pre-processing (preparing the data for modelling) - Data wrangling
Data cleaning1 – Binning, clustering, regression, normalization, aggregation, etc.
Data transformation2 – Smoothing, aggregation, normalization, feature extraction, etc.
Data integration – Correlation analysis, etc.
5
Data reduction – Data cube aggregation, dimensionality reduction, data compression, numerosity reduction, discretization
Feature engineering – Imputation, categorical encoding, binning, scaling, log transform, feature selection and grouping.
5. Visualization1 (answering the questions) – Python libraries, Tableau, PowerBI, Infogram, ChartBlocks, Datawrapper
The discovered knowledge can be presented as table, chart, plot, graph, statistics, rules (if-else), trees.
6. Mathematical modelling – machine learning (Python libraries, R, Weka, Matlab)
6.1 Building the model from pre-processed data
6.2 Evaluating (testing) the model
6.3 Is the model good? If not go to step 3 or step 4
7. Deploying the model for production – cloud (AWS, Google), personal computer, smartwatch, etc.
6
7
WHAT DO YOU NEED FOR DATA SCIENCE?
Single machine vs distributed system platform for data science
▪ To work in data science on a single machine – Python, Excel, MATLAB, SAS, R, Weka, SQL databases, etc.
▪ To work in data science on the distributed system – Hadoop, Spark, Storm, etc.
To get into data science using Python
▪ Invest your time and gain respective domain/subject knowledge.
▪ Get a grip on the basics of statistics, probability, mathematics (calculus, linear algebra), machine learning, optimization
techniques, etc.
▪ Python framework (Anaconda)
▪ Python programming and IDEs (Jupyter/Spyder, Google colab).
8
▪ Math and scientific computing libraries (Numpy/Scipy).
▪ Data pre-processing and managing library (Pandas).
▪ Graphing and visualization library (Matplotlib/Plotly/Seaborn).
▪ Machine learning and deep learning libraries (Scikit-learn, TensorFlow, PyTorch, Keras, Caffe, Thaeno).
▪ To work on an image dataset for computer vision (OpenCV).
▪ To work on a text dataset for NLP (NLTK).
9
REQUIREMENTS FOR DATA SCIENCE JOBS
Data science job - expectation Now, I am an expert in data science
10
Data science job - reality What?
11
Data scientist role
▪ An analytical mind and critical thinking to define and work on a wide variety of problems in different domains.
▪ Strong familiarity with algorithm design techniques for a given problem.
▪ Good at statistics, probability, discrete, mathematics, calculus, linear algebra, machine learning, optimization techniques, etc.
▪ Good programming knowledge.
▪ Experience in data analytics.
▪ Working knowledge of data science E2E implementation tools.
12
▪ PhD is expected as they accumulate domain knowledge.
▪ Ultimately, more focused on building models (algorithms) in data analytics.
Data analyst role
▪ Sufficient knowledge of exploratory data analysis tasks.
▪ Hands-on experience in using algorithms (pre-built models), sometimes building algorithms.
▪ Preferably graduate degree is desired.
Data engineer role
▪ ETL tools like database, data warehouse, and distributed file systems for designing storage plans for storing data.
▪ Undergraduate degree is good enough.
13
Any QUESTIONS?
You can reach me at: jrathinaraja@gmail.com
Personal website: https://jrathinaraja.co.in/
14