I
Day 3 of 50
Python Tools for a Every Data Scientist.
A step-by-step guide to learn data science by Data Science East Africa team.
PREPARED BY DSEA
II
Python has tools for all stages of the life cycle of a data
science project. Any data science project has the
following 3 stages inherently included in it.
Data Collection
Data Modelling
Data Visualization
And python provides very neat tools for all 3 of these stages.
Data Collection
III
Data Collection
Beautiful Soup
When data collection involves scraping data off of the web, python
provides a library called beautifulsoup. TH
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
Data Collection
III
This library parses a web page and stores its contents neatly. For example, it
will store the title separately. It will also store all the <a> tags separately which
will provide you with very neat list of URLs contained within the page.
Read More: https://pypi.org/project/beautifulsoup4/
TH
Data Collection
III
WGET
Downloading data , especially from the web, is one of the vital tasks of a
data scientist. Wget is a free utility for non-interactive download of files
from the Web. Since it is non-interactive, it can work in the background
even if the user isn’t logged in.
It supports HTTP, HTTPS, and FTP protocols, as well as retrieval through
HTTP proxies. So the next time you want to download a website or all the TH
images from a page, wget is there to assist you.
Read More: https://pypi.org/project/wget/
Data Collection
III
DATA APIS
Apart from the tools that you need to scrape or download data, you also
need actual data. This is where data APIs help. A number of APIs exist in
python that let you download data for free e.g. Alpha Vantage provides
real-time and historical data for global equities, forex and
cryptocurrencies. TH
Read More:
https://www.analyticsvidhya.com/blog/2016/11/an-introduction-to-apis-
application-programming-interfaces-5-apis-a-data-scientist-must-know/
Data Collection
III
DATA APIS
TH
IV
Data Modelling
IMBALANCED LEARNING
Imabalanced-learn is one such tool to balance
datasets. A dataset is imbalanced when one class
or category of data has disproportionately larger
samples than other categories. This can cause
huge problems for classification algorithms
which may end up being biased towards the class
that has more data.
IV
Data Modelling
IMBALANCED LEARNING
Read More
http://glemaitre.github.io/imbalanced-learn/install.html
IV SCIPY ECOSYSTEM — NUMPY
The actual data processing or modelling
happens through python’s scipy stack. Python’s
SciPy Stack is a collection of software
specifically designed for scientific computing in
Pytho. nScipy secosystem contains a lot of
useful libraries but Numpy is arguably the most
powerful tool among all.
IV SCIPY ECOSYSTEM — NUMPY
The most fundamental package, around which the
scientific computation stack is built, NumPy stands for
Numerical Python.
It provides an abundance of useful features for
operations on matrices. If someone has used
MATLAB they immediately realize that NumPy is not
only as powerful as MATLAB but is also very similar
in its operation.
IV
READ MORE:
https://pandas.pydata.org/
IV
Data Visualization
MATPLOTLIB
Another package from the SciPy ecosystem that
is tailored for the generation of simple and
powerful visualizations with ease is Matplotlib. It
is a 2D plotting library which produces
publication quality figures in a variety of hard-
copy formats
IV
Data Visualization
MATPLOTLIB
Read More:
https://matplotlib.org/stable/index.html
IV
Data Visualization
SEABORN
Seaborn is a Python data visualization library
based on matplotlib. It primarily provides a high-
level interface for drawing attractive and
informative statistical graphics.
It is mostly focused on visualizations such as
heat maps.
IV
Data Visualization
SEABORN
Read More:
https://seaborn.pydata.org/
IV
Data Visualization
MOVIEPY
MoviePy is a Python library for video editing —
cutting, concatenations, title insertions, video
compositing, video processing, and creation of
custom effects. It can read and write all common
audio and video formats, including GIF.
Read More: https://pypi.org/project/moviepy/
V
Happy Learning !
DATA SCIENCE EAST AFRICA TEAM.
Sources and References
Intro to Data Science
Premier League Stats
Wikipedia: Data Science
Do you have
any questions?
ZIMCORE HUBS | NEW HIRE LAUNCHPAD
We're always here for you
datsscienceeastafrica@gmail.com
Twitter: DataScience_EA, TechMadi