Mining Public Datasets
using Apache Zeppelin (incubating),
Apache Spark and Juju
by Alexander Bezzubov
NFLabs for AppacheCon ’16 NA
Alexander Bezzubov
Software Engineer at NFLabs, Seoul,
South Korea
Co-organizer of SeoulTech Society
Committer and PPMC member of
Apache Zeppelin (Incubating)
github.com/bzz
@seoul_engineer
Graduated Maths at St.Petersburg State
University, Russia
PUBLIC DATASETS: Number, Size & Growth
Web Crawls
Structured data (RDF, micro-formats, tables)
Hackers News\Reddit\Twitter\StackOverflow\Wikipedia
Reviews (movies, restaurants, beer, wine)
Emails (Enroll, ASF public ML archives)
Census Data (US, UK, UN, Japan, etc)
Transportation (Taxi, Flights, Bicycles)
Climate
Genome
PUBLIC DATASETS: Number, Size & Growth
Web Crawls
Structured data (RDF, micro-formats, tables)
Hackers News\Reddit\Twitter\StackOverflow\Wikipedia
Reviews (movies, restaurants, beer, wine)
Emails (Enroll, ASF public ML archives)
order of Tbs
Census Data (US, UK, UN, Japan, etc)
Transportation (Taxi, Flights, Bicycles)
Genome
PUBLIC DATASETS: Number, Size & Growth
Web Crawls
Structured data (RDF, micro-formats, tables)
Hackers News\Reddit\Twitter\StackOverflow\Wikipedia
Reviews (movies, restaurants, beer, wine)
Emails (Enroll, ASF public ML archives)
order of Tbs
Census Data (US, UK, UN, Japan, etc)
Transportation (Taxi, Flights, Bicycles)
Genome
AWS Public Datasets https://aws.amazon.com/public-data-sets/
Yahoo Webscope https://webscope.sandbox.yahoo.com/
Stanford Network Analyser Project http://snap.stanford.edu/data/
Physics Research http://opendata.cern.ch
PUBLIC DATASETS: Number, Size & Growth
Web Crawls
Structured data (RDF, micro-formats, tables)
Hackers News\Reddit\Twitter\StackOverflow\Wikipedia
Reviews (movies, restaurants, beer, wine)
Emails (Enroll, ASF public ML archives)
order of Tbs
Census Data (US, UK, UN, Japan, etc)
Transportation (Taxi, Flights, Bicycles)
Genome
AWS Public Datasets https://aws.amazon.com/public-data-sets/
Yahoo Webscope https://webscope.sandbox.yahoo.com/
Stanford Network Analyser Project http://snap.stanford.edu/data/
Physics Research http://opendata.cern.ch order of Pbs
PUBLIC DATA = OPPORTUNITY
I. Tools
II. Data
TOOL TO PURSUIT THE OPPORTUNITY:
Overview Big Data eco-system
… …
TOOL TO PURSUIT THE OPPORTUNITY:
Overview Big Data eco-system
TOOL TO PURSUIT THE OPPORTUNITY:
Todays choice Zeppelin, Spark, Juju
Apache Spark
Scala, Python, R
Apache Zeppelin
Modern Web GUI, plays nicely with Spark, Flink,
Elasticsearch, etc.
Warcbase
Spark library for saved crawl data (WARC)
Juju
Scales, integration with Spark, Zeppelin, AWS, GCE
APACHE ZEPPELIN: Overview
Zeppelin: Brief history
http://zeppelin.incubator.apache.org
12.2012 Commercial App using AMP Lab Shark 0.5
10.2013 Prototype Hive/Shark
08.2013 NFLabs Internal project Hive/Shark
12.2014 Enters ASF Incubation
01.2016 3 major releases
05.2016 Graduation vote passed
Interactive Visualization
APACHE SPARK
http://spark.apache.org
From Berkeley AMP Labs, since 2010
Joined Apache since 2014
1000+ contributors
REPL + Java, Scala, Python, R APIs
JUJU
https://jujucharms.com/
Service modelling at scale
Deployment\configuration automation
+ Integration with Spark, Zeppelin, Ganglia, etc
+ AWS, GCE, Azure, LXC, etc
JUJU
http://bigdata.juju.solutions/getstarted
$ apt-get install juju-core juju-quickstart
# or
$ brew install juju juju-quickstart
$ juju generate-config
#LXC, AWS, GCE, Azure, VMWare, OpenStack
$ juju bootstrap
$ juju quickstart apache-hadoop-spark-zeppelin
$ juju expose spark zeppelin
$ juju add-unit -n4 slave
JUJU
http://bigdata.juju.solutions/getstarted
7 node cluster designed to scale out
APPROACH: local, small cluster, big cluster
1 core Prototype
Your laptop
10s PC Estimate the cost
AWS spot instances
1000 instances Scale out Deployment automation
I. Tools
II. Data
DATA: GitHub http://githubarchive.org
• 300Gb compressed
• Collaboration google and github engineers
• Events on PR, repo, issues, comments, etc in JSON
http://www.commitlogsfromlastnight.com/
http://sideeffect.kr/popularconvention/
https://www.gitlive.net/
http://zoom.it/kCsU
DATA PRODUCT: Get notified when
project goes Open Source
DATA PRODUCT: Exploration
DATA PRODUCT: Sketch
We are going to build a Notebook that
sends you a digest email:
DATA PRODUCT: pieces (flow-chart)
We are going to build a Notebook that:
• Downloads the latest data from GitHub Archive
• Read & explore the dataset
• Imports, filters the PublicEvent
• Join logs w/ more data from Github API calls
• Shows HTML template, to visualise the list
• Sends email notifications
• Does all above automatically, once a day
DATA PRODUCT: Full impl
I. Tools
II. Data
DATA: Common Crawl
https://commoncrawl.org
Nonprofit, by Factual
On AWS S3 in WARC, WAT, formats
since 2013, monthly: ~150Tb compressed, 2+bln ulrs
URL Index by Ilya Kreymer of @webrecorder_io
http://index.commoncrawl.org/
https://about.commonsearch.org
DATA: CommonCrawl - Data Product
Measuring the impact of Google Analytics
Objective: estimate % of pages/domains that use Google
Analytics/Facebook
Existing research from 2013
DATA: CommonCrawl - Data Product
Measuring the impact of Google Analytics
Copy to HDFS vs read from S3
Verify using grep
hadoop jar hadoop-examples.jar grep /grep-data/ \
/grep-output/ '[Bb]ig [Dd]ata is ([a-zA-Z]{5,})'
Verify using grep
DATA: CommonCrawl - Data Product
Feb 2016 Crawl:
- 48Tb compressed
- 100 segments (dir on S3)
- 30,000 files, ~1Gb each
DATA: CommonCrawl - Data Product
AWS optimisations:
- pick spot instance prices
- pick instance type (net throughput)
- user Juju instead of EMR (2x $$ savings!)
Spark optimisations:
- IO-bound, so increase spark.executor.cores
spark.executor.memory
DATA: CommonCrawl - Data Product
Zeppelin Viewer
Community service for sharing example notebooks
http://zeppelinhub.com/viewer
TAKEAWAY
There are plenty of free tools out there
To crunch the data for fun and profit
They are easy (not simple) to learn and generic enough
Questions?
Alexander Bezzubov
@seoul_engineer
github.com/bzz
Thank you
Alexander Bezzubov
NFLabs, Seoul (we are hiring!)