KEMBAR78
Introduction to Open Data and Data Science | PDF
Open Data Learn-up
@_opendatahack
About opendatahack.org
Open data hack is a collaborative effort in solving day to day difficulties
faced by local communities, civic bodies and non profit institutions.
Technologists, designers, innovators and government bodies with great
social insights come aboard for a day together to build technology
based solutions availing enormously accessible free open data.
@_opendatahack
Our current projects in India?
● Real-time environment vitals monitoring system with suggestions
● Health factors heat map of urban localities in India
● Mapping of all quality abortion clinics in India
@_opendatahack
Introduction to Open Data
@_opendatahack
What is Data?
A collection of facts, information and statistics that
can be analyzed to develop new knowledge
@_opendatahack
What is Open Data?
@_opendatahack
Definition by OKF
A piece of data or content is open if anyone is
free to use, reuse, and redistribute it -
subject only, at most, to the requirement to
attribute and/or share-alike.
@_opendatahack
Definition by ODI
Open data is data that is made available by
organizations, businesses and individuals for
anyone to access, use and share.
@_opendatahack
Let’s define it...
@_opendatahack
Open Data is accessible public data that people,
companies and organizations can use to launch new
ventures, analyze patterns and trends, make data-
driven decisions, and solve complex problems.
@_opendatahack
Benefits of Open Data
● Data Driven Decision Making
● Performance Measurement
● Reduction of Government Costs
● Support an Open Government Initiative
– e.g. Transparency
● Economic Development
● Increased Citizen Engagement
● Talent Attraction / Retention
@_opendatahack
Types of Open Data
● Government data
● Commercial data
● Crowd sourced data
@_opendatahack
Few Open Data projects...
@_opendatahack
Open Data sources
@_opendatahack
Open Data Licenses
● Open Data Commons Public Domain Dedication
and Licence (ODC PDDL) – Public domain
● Creative Commons CCZero – Public domain
● Open Data Commons Attribution License –
Attribution for data(bases)
● Open Data Commons Open Database License
(OdbL) - Attribution-ShareAlike for data(bases)
@_opendatahack
Introduction to Data Science
@_opendatahack
What is Data Science?
Data science ~ computer science +
mathematics/statistics + visualization
@_opendatahack
Data is just like crude
● It’s valuable, but if unrefined it cannot really be used.
● It has to be changed into gas, plastic, chemicals, etc
to create a valuable entity that drives profitable
activity
- Data must be broken down and analyzed for it to
have value.
@_opendatahack
Outline
● Harvesting
● Cleaning
● Analyzing
● Visualizing
● Publishing
DATA
@_opendatahack
Data harvesting
● Locally available data
● Data dumps from Web
● Data through Web APIs
● Structured data in Web documents
@_opendatahack
Data cleansing
● Harvested data may come with lots of noise or
interesting anomalies.
● Goal is to provide structured presentation for
analysis.
- Network(graph)
- Values with dimension
@_opendatahack
Data Science Tools
@_opendatahack
Data harvesting
● urllib & BeautifulSoup
● Scrapy
@_opendatahack
Some tips & ethics
● Use the mobile version of the sites if available
● No cookies
● Respect robots.txt
● Identify yourself
● If possible, download bulk data first, process it later
● Prefer dumps over APIs, APIs over scraping
● Be polite and request permission to gather the data
● Worth checking: https://scraperwiki.com/
@_opendatahack
Data analyzing
● Numpy
- Offers efficient multidimensional array object, ndarray
- Basic linear algebra operations and data types
- Requires GNU Fortran
● Scipy
- Builds on top of NumPy
- Modules for statistics, optimization, signal processing, ...
- Add-ons (called SciKits) for machine learning, data mining, etc
● For analysing networks
- NetworkX
- igraph
@_opendatahack
Data visualizing
● Matplotlib
● NetworkX
● PyGraphviz
@_opendatahack
@_opendatahack
NumPy + SciPy + Matplotlib +
IPython
● Provides Matlab ”-ish” environment
● ipython provides extended interactive interpreter
(tab completion, magic functions for object querying,
debugging, ...)
@_opendatahack
Some conviniet data formats
● JSON (import simplejson)
● XML (import xml)
● RDF (import rdflib, SPARQLWrapper)
● GraphML (import networkx)
● CSV (import csv)
@_opendatahack
Resource Description Framework
(RDF)
● Collection of W3C standards for modeling complex
relations and to exchange information
● Allows data from multiple sources to combine nicely
● RDF describes data with triples
● - each triple has form subject - predicate - object
e.g. PyconIndia2017 is organized in Delhi
@_opendatahack
Why R for Data Science?
● Algorithms
● Visualizations
● Data manupulation
● Integrations
● Easily scalable
@_opendatahack
Simple R code for bar graph
# Create the data for the chart.
H <- c(7,12,28,3,41)
# Give the chart file a name.
png(file = "barchart.png")
# Plot the bar chart.
barplot(H)
# Save the file.
dev.off()
@_opendatahack
Shiny R
https://shiny.rstudio.com/gallery/
@_opendatahack
Few commonly used algorithms
● Naïve Bayes Classifier Algorithm
● K Means Clustering Algorithm
● Support Vector Machine Algorithm
● Apriori Algorithm
● Linear Regression
● Logistic Regression
● Artificial Neural Networks
● Random Forests
● Decision Trees
● Nearest Neighbours
@_opendatahack
Anaconda
@_opendatahack
Thank you
opendatahack.org
fb.com/opendatahack

Introduction to Open Data and Data Science