KEMBAR78
Data Visualization | PDF | World Wide Web | Internet & Web
0% found this document useful (0 votes)
5 views20 pages

Data Visualization

The document provides a guide to essential Python tools for data science, covering three main stages: Data Collection, Data Modelling, and Data Visualization. It highlights specific libraries such as Beautiful Soup for web scraping, imbalanced-learn for dataset balancing, and Matplotlib and Seaborn for data visualization. The document serves as a resource for data scientists to effectively utilize Python in their projects.

Uploaded by

Saraphina Kirika
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views20 pages

Data Visualization

The document provides a guide to essential Python tools for data science, covering three main stages: Data Collection, Data Modelling, and Data Visualization. It highlights specific libraries such as Beautiful Soup for web scraping, imbalanced-learn for dataset balancing, and Matplotlib and Seaborn for data visualization. The document serves as a resource for data scientists to effectively utilize Python in their projects.

Uploaded by

Saraphina Kirika
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

I

Day 3 of 50

Python Tools for a Every Data Scientist.


A step-by-step guide to learn data science by Data Science East Africa team.

PREPARED BY DSEA
II

Python has tools for all stages of the life cycle of a data
science project. Any data science project has the
following 3 stages inherently included in it.

Data Collection
Data Modelling
Data Visualization
And python provides very neat tools for all 3 of these stages.
Data Collection
III

Data Collection
Beautiful Soup
When data collection involves scraping data off of the web, python
provides a library called beautifulsoup. TH
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
Data Collection
III

This library parses a web page and stores its contents neatly. For example, it
will store the title separately. It will also store all the <a> tags separately which
will provide you with very neat list of URLs contained within the page.

Read More: https://pypi.org/project/beautifulsoup4/


TH
Data Collection
III

WGET
Downloading data , especially from the web, is one of the vital tasks of a
data scientist. Wget is a free utility for non-interactive download of files
from the Web. Since it is non-interactive, it can work in the background
even if the user isn’t logged in.
It supports HTTP, HTTPS, and FTP protocols, as well as retrieval through
HTTP proxies. So the next time you want to download a website or all the TH
images from a page, wget is there to assist you.

Read More: https://pypi.org/project/wget/


Data Collection
III

DATA APIS

Apart from the tools that you need to scrape or download data, you also
need actual data. This is where data APIs help. A number of APIs exist in
python that let you download data for free e.g. Alpha Vantage provides
real-time and historical data for global equities, forex and
cryptocurrencies. TH

Read More:

https://www.analyticsvidhya.com/blog/2016/11/an-introduction-to-apis-
application-programming-interfaces-5-apis-a-data-scientist-must-know/
Data Collection
III

DATA APIS

TH
IV

Data Modelling
IMBALANCED LEARNING

Imabalanced-learn is one such tool to balance


datasets. A dataset is imbalanced when one class
or category of data has disproportionately larger
samples than other categories. This can cause
huge problems for classification algorithms
which may end up being biased towards the class
that has more data.
IV

Data Modelling
IMBALANCED LEARNING

Read More

http://glemaitre.github.io/imbalanced-learn/install.html
IV SCIPY ECOSYSTEM — NUMPY

The actual data processing or modelling


happens through python’s scipy stack. Python’s
SciPy Stack is a collection of software
specifically designed for scientific computing in
Pytho. nScipy secosystem contains a lot of
useful libraries but Numpy is arguably the most
powerful tool among all.
IV SCIPY ECOSYSTEM — NUMPY

The most fundamental package, around which the


scientific computation stack is built, NumPy stands for
Numerical Python.

It provides an abundance of useful features for


operations on matrices. If someone has used
MATLAB they immediately realize that NumPy is not
only as powerful as MATLAB but is also very similar
in its operation.
IV
READ MORE:

https://pandas.pydata.org/
IV

Data Visualization
MATPLOTLIB

Another package from the SciPy ecosystem that


is tailored for the generation of simple and
powerful visualizations with ease is Matplotlib. It
is a 2D plotting library which produces
publication quality figures in a variety of hard-
copy formats
IV

Data Visualization
MATPLOTLIB

Read More:

https://matplotlib.org/stable/index.html
IV

Data Visualization
SEABORN

Seaborn is a Python data visualization library


based on matplotlib. It primarily provides a high-
level interface for drawing attractive and
informative statistical graphics.
It is mostly focused on visualizations such as
heat maps.
IV

Data Visualization
SEABORN

Read More:
https://seaborn.pydata.org/
IV

Data Visualization
MOVIEPY

MoviePy is a Python library for video editing —


cutting, concatenations, title insertions, video
compositing, video processing, and creation of
custom effects. It can read and write all common
audio and video formats, including GIF.

Read More: https://pypi.org/project/moviepy/


V

Happy Learning !
DATA SCIENCE EAST AFRICA TEAM.

Sources and References


Intro to Data Science
Premier League Stats
Wikipedia: Data Science
Do you have
any questions?

ZIMCORE HUBS | NEW HIRE LAUNCHPAD


We're always here for you
datsscienceeastafrica@gmail.com
Twitter: DataScience_EA, TechMadi

You might also like