100% found this document useful (1 vote)

158 views45 pages

Mining Public Datasets

Mining Public Datasets using Apache Zeppelin (incubating), Apache Spark and Juju by Alexander Bezzubov NFLabs for AppacheCon ’16 NA

Uploaded by

Mahout

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

158 views45 pages

Mining Public Datasets

Mining Public Datasets using Apache Zeppelin (incubating), Apache Spark and Juju by Alexander Bezzubov NFLabs for AppacheCon ’16 NA

Uploaded by

Mahout

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 45

Mining Public Datasets

using Apache Zeppelin (incubating),

Apache Spark and Juju
by Alexander Bezzubov

NFLabs for AppacheCon ’16 NA

Alexander Bezzubov

Software Engineer at NFLabs, Seoul,

South Korea

Co-organizer of SeoulTech Society

Committer and PPMC member of

Apache Zeppelin (Incubating)

github.com/bzz

@seoul_engineer
Graduated Maths at St.Petersburg State
University, Russia
PUBLIC DATASETS: Number, Size & Growth

Web Crawls
Structured data (RDF, micro-formats, tables)
Hackers News\Reddit\Twitter\StackOverflow\Wikipedia
Reviews (movies, restaurants, beer, wine)
Emails (Enroll, ASF public ML archives)
order of Tbs
Census Data (US, UK, UN, Japan, etc)
Transportation (Taxi, Flights, Bicycles)
Genome
PUBLIC DATASETS: Number, Size & Growth

Physics Research http://opendata.cern.ch

PUBLIC DATASETS: Number, Size & Growth

Physics Research http://opendata.cern.ch order of Pbs

PUBLIC DATA = OPPORTUNITY
I. Tools
II. Data
TOOL TO PURSUIT THE OPPORTUNITY:
Overview Big Data eco-system

… …
TOOL TO PURSUIT THE OPPORTUNITY:
Overview Big Data eco-system
TOOL TO PURSUIT THE OPPORTUNITY:
Todays choice Zeppelin, Spark, Juju

Apache Spark
Scala, Python, R

Apache Zeppelin
Modern Web GUI, plays nicely with Spark, Flink,
Elasticsearch, etc.

Warcbase
Spark library for saved crawl data (WARC)

Juju
Scales, integration with Spark, Zeppelin, AWS, GCE

APACHE ZEPPELIN: Overview

Zeppelin: Brief history

http://zeppelin.incubator.apache.org

12.2012 Commercial App using AMP Lab Shark 0.5

10.2013 Prototype Hive/Shark
08.2013 NFLabs Internal project Hive/Shark
12.2014 Enters ASF Incubation
01.2016 3 major releases
05.2016 Graduation vote passed
Interactive Visualization
APACHE SPARK

http://spark.apache.org

From Berkeley AMP Labs, since 2010

Joined Apache since 2014

1000+ contributors

REPL + Java, Scala, Python, R APIs

JUJU

https://jujucharms.com/

Service modelling at scale

Deployment\configuration automation

+ Integration with Spark, Zeppelin, Ganglia, etc

+ AWS, GCE, Azure, LXC, etc

JUJU

http://bigdata.juju.solutions/getstarted

$ apt-get install juju-core juju-quickstart

# or
$ brew install juju juju-quickstart
$ juju generate-config
#LXC, AWS, GCE, Azure, VMWare, OpenStack

$ juju bootstrap
$ juju quickstart apache-hadoop-spark-zeppelin
$ juju expose spark zeppelin
$ juju add-unit -n4 slave
JUJU

http://bigdata.juju.solutions/getstarted

7 node cluster designed to scale out

APPROACH: local, small cluster, big cluster

1 core Prototype
Your laptop

10s PC Estimate the cost

AWS spot instances

1000 instances Scale out Deployment automation

I. Tools
II. Data
DATA: GitHub http://githubarchive.org

• 300Gb compressed

• Collaboration google and github engineers

• Events on PR, repo, issues, comments, etc in JSON

http://www.commitlogsfromlastnight.com/
http://sideeﬀect.kr/popularconvention/
https://www.gitlive.net/
http://zoom.it/kCsU
DATA PRODUCT: Get notified when
project goes Open Source
DATA PRODUCT: Exploration
DATA PRODUCT: Sketch

We are going to build a Notebook that

sends you a digest email:
DATA PRODUCT: pieces (flow-chart)

We are going to build a Notebook that:

• Downloads the latest data from GitHub Archive

• Read & explore the dataset

• Imports, filters the PublicEvent

• Join logs w/ more data from Github API calls

• Shows HTML template, to visualise the list

• Sends email notifications

• Does all above automatically, once a day

DATA PRODUCT: Full impl
I. Tools
II. Data
DATA: Common Crawl

https://commoncrawl.org

Nonprofit, by Factual

On AWS S3 in WARC, WAT, formats

since 2013, monthly: ~150Tb compressed, 2+bln ulrs

URL Index by Ilya Kreymer of @webrecorder_io

http://index.commoncrawl.org/
https://about.commonsearch.org
DATA: CommonCrawl - Data Product

Measuring the impact of Google Analytics

Objective: estimate % of pages/domains that use Google

Analytics/Facebook

Existing research from 2013

DATA: CommonCrawl - Data Product

Measuring the impact of Google Analytics

Copy to HDFS vs read from S3
Verify using grep
hadoop jar hadoop-examples.jar grep /grep-data/ \
/grep-output/ '[Bb]ig [Dd]ata is ([a-zA-Z]{5,})' 

Verify using grep

DATA: CommonCrawl - Data Product

Feb 2016 Crawl: 

- 48Tb compressed
- 100 segments (dir on S3)
- 30,000 files, ~1Gb each
DATA: CommonCrawl - Data Product

AWS optimisations:
- pick spot instance prices
- pick instance type (net throughput)
- user Juju instead of EMR (2x $$ savings!)
Spark optimisations:
- IO-bound, so increase spark.executor.cores
spark.executor.memory
DATA: CommonCrawl - Data Product
Zeppelin Viewer

Community service for sharing example notebooks

http://zeppelinhub.com/viewer
TAKEAWAY

There are plenty of free tools out there

To crunch the data for fun and profit

They are easy (not simple) to learn and generic enough

Questions?

Alexander Bezzubov
@seoul_engineer

github.com/bzz
Thank you
Alexander Bezzubov
NFLabs, Seoul (we are hiring!)

Probability and Statistics Problems
No ratings yet
Probability and Statistics Problems
5 pages
Beginner Data Projects for Analysts
No ratings yet
Beginner Data Projects for Analysts
6 pages
Sigmod278 Silberstein
No ratings yet
Sigmod278 Silberstein
12 pages
Scala Cheat Sheet Amresh
No ratings yet
Scala Cheat Sheet Amresh
2 pages
Guide
No ratings yet
Guide
210 pages
Sample Plan by Satish Mistry: Scope of Personal Financial Plan / Financial Objective
No ratings yet
Sample Plan by Satish Mistry: Scope of Personal Financial Plan / Financial Objective
147 pages
Types of Survey Questions
No ratings yet
Types of Survey Questions
13 pages
Dash User Guide and Documentation
100% (2)
Dash User Guide and Documentation
376 pages
Developing Machine Learning Applications With TensorFlow
No ratings yet
Developing Machine Learning Applications With TensorFlow
22 pages
Stock Market Analysis Project
No ratings yet
Stock Market Analysis Project
23 pages
End To End Guide On Getting A Job in Tech Industry
100% (1)
End To End Guide On Getting A Job in Tech Industry
11 pages
Data Scientist Interview Prep Guide
No ratings yet
Data Scientist Interview Prep Guide
7 pages
Effective Media Strategies For Communicating Quarterly Earnings
100% (2)
Effective Media Strategies For Communicating Quarterly Earnings
10 pages
Software Enginering Basics
No ratings yet
Software Enginering Basics
8 pages
100 Days Data Analyst Learning Roadmap
No ratings yet
100 Days Data Analyst Learning Roadmap
6 pages
IoT Module 5
No ratings yet
IoT Module 5
22 pages
Real-Time Web & Push Tech Overview
No ratings yet
Real-Time Web & Push Tech Overview
45 pages
Cash Management Report
No ratings yet
Cash Management Report
117 pages
NFT Ticketing Marketplace Vylto
No ratings yet
NFT Ticketing Marketplace Vylto
17 pages
EAI & Data Source Patterns Guide
No ratings yet
EAI & Data Source Patterns Guide
20 pages
Ethical and Legal Issue in AI and Data Science
100% (1)
Ethical and Legal Issue in AI and Data Science
13 pages
Twitter Sentiment Analysis
No ratings yet
Twitter Sentiment Analysis
27 pages
CBM Basel III - Group 7
No ratings yet
CBM Basel III - Group 7
16 pages
How To Build A Data Science Portfolio by Michae
No ratings yet
How To Build A Data Science Portfolio by Michae
2 pages
Using Generative Adversarial Networks For Improving Classification Effectiveness in Credit Card Fraud Detection
100% (1)
Using Generative Adversarial Networks For Improving Classification Effectiveness in Credit Card Fraud Detection
8 pages
Data Structures for ECE Students
100% (1)
Data Structures for ECE Students
133 pages
Financial Analyst Interview Questions
75% (4)
Financial Analyst Interview Questions
2 pages
Data Science Interview Questions
No ratings yet
Data Science Interview Questions
31 pages
SQL With Python Guide
No ratings yet
SQL With Python Guide
17 pages
Software Development Lifecycle Guide
No ratings yet
Software Development Lifecycle Guide
5 pages
Predictive Modeling
No ratings yet
Predictive Modeling
1 page
Accounting Basics for Developers
No ratings yet
Accounting Basics for Developers
32 pages
Algorithms - Discrete Mathematics Questions and Answers - Sanfoundry
No ratings yet
Algorithms - Discrete Mathematics Questions and Answers - Sanfoundry
7 pages
GDS-VFP Interview Guides July 9 2021
No ratings yet
GDS-VFP Interview Guides July 9 2021
6 pages
Intro to CS for Students
No ratings yet
Intro to CS for Students
50 pages
Resource Roadmap, Hangukquant
No ratings yet
Resource Roadmap, Hangukquant
9 pages
Python in A Day - Cheet Sheet
No ratings yet
Python in A Day - Cheet Sheet
2 pages
Willey PDF
80% (5)
Willey PDF
330 pages
Complex SQL Queries
No ratings yet
Complex SQL Queries
43 pages
US Expansion Strategy Guide
100% (4)
US Expansion Strategy Guide
66 pages
Banking Bot Report
No ratings yet
Banking Bot Report
24 pages
Top 20 Big Data Analytics Master's Programs
No ratings yet
Top 20 Big Data Analytics Master's Programs
14 pages
Data Visualization with R
100% (1)
Data Visualization with R
18 pages
Migrating From JSP - Servlet To Spring MVC
No ratings yet
Migrating From JSP - Servlet To Spring MVC
14 pages
Data Science Interview Questions
No ratings yet
Data Science Interview Questions
32 pages
Insidethemachinelearninginterview Sample
50% (2)
Insidethemachinelearninginterview Sample
40 pages
Interview Questions BA - Mortgage
No ratings yet
Interview Questions BA - Mortgage
2 pages
Yfinance Python Tutorial (2022) - Analyzing Alpha
No ratings yet
Yfinance Python Tutorial (2022) - Analyzing Alpha
20 pages
Programming Methodologies Tutorial
0% (1)
Programming Methodologies Tutorial
10 pages
How To Find Business Information
100% (1)
How To Find Business Information
218 pages
Amex Fraud Analyst QNA
No ratings yet
Amex Fraud Analyst QNA
3 pages
Microsoft Interview Questions!
No ratings yet
Microsoft Interview Questions!
19 pages
Getting Your Data Ready For Ai Oreilly Ebook 87023487USEN
No ratings yet
Getting Your Data Ready For Ai Oreilly Ebook 87023487USEN
25 pages
Cracking The LinkedIn Data Scientist Interview - by Dan Lee - DataInterview - Medium
No ratings yet
Cracking The LinkedIn Data Scientist Interview - by Dan Lee - DataInterview - Medium
17 pages
Think Stats: Probability and Statistics For Programmers
No ratings yet
Think Stats: Probability and Statistics For Programmers
140 pages
Apr24 HallB 1605 GIDS2015 SparkBigData PrajodVettiyattil
No ratings yet
Apr24 HallB 1605 GIDS2015 SparkBigData PrajodVettiyattil
48 pages
In9040 PHD Presentation Selimozcan 2
No ratings yet
In9040 PHD Presentation Selimozcan 2
36 pages
Week 1 Explore The Use Case and Analyze The Dataset
No ratings yet
Week 1 Explore The Use Case and Analyze The Dataset
30 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
30 pages
1 Introduction
No ratings yet
1 Introduction
31 pages
Azure Data Engineer Resume - Hire IT People - We Get IT Done
100% (1)
Azure Data Engineer Resume - Hire IT People - We Get IT Done
4 pages
BDA Unit 2 Notes
No ratings yet
BDA Unit 2 Notes
32 pages
Fault Tolerance-Challenges Techniques and Implemen
No ratings yet
Fault Tolerance-Challenges Techniques and Implemen
7 pages
Cloud Data Lakes For Dummies 2nd Snowflake Special Edition
100% (1)
Cloud Data Lakes For Dummies 2nd Snowflake Special Edition
52 pages
Chapter - 2 Introduction To HADOOP
No ratings yet
Chapter - 2 Introduction To HADOOP
34 pages
Senior Data Engineer Expertise
No ratings yet
Senior Data Engineer Expertise
5 pages
Big Data PPT (Autosaved)
No ratings yet
Big Data PPT (Autosaved)
193 pages
Apache NiFi for Data Engineers
No ratings yet
Apache NiFi for Data Engineers
63 pages
Unit 3
No ratings yet
Unit 3
10 pages
MongoDB Ebook 07292020
No ratings yet
MongoDB Ebook 07292020
24 pages
Lab Manual Iot & Big Data
No ratings yet
Lab Manual Iot & Big Data
75 pages
Azure Data Factory
77% (13)
Azure Data Factory
52 pages
Part
No ratings yet
Part
6 pages
Bussin
No ratings yet
Bussin
81 pages
Data Science Program Calendar
No ratings yet
Data Science Program Calendar
6 pages
Apache Flume for Data Engineers
No ratings yet
Apache Flume for Data Engineers
29 pages
Guardium v10.1.3 Release Notes
No ratings yet
Guardium v10.1.3 Release Notes
53 pages
Install Single Node Hadoop on Ubuntu
No ratings yet
Install Single Node Hadoop on Ubuntu
13 pages
Cloud Computing Insights
No ratings yet
Cloud Computing Insights
38 pages
Apache Spark Basics & Comparison
No ratings yet
Apache Spark Basics & Comparison
66 pages
Big Data Analytics Syllabus
No ratings yet
Big Data Analytics Syllabus
2 pages
Unit 06 - Assignment 1 Frontsheet
No ratings yet
Unit 06 - Assignment 1 Frontsheet
30 pages
2013 Coverity Scan Report
No ratings yet
2013 Coverity Scan Report
25 pages
21CS71 Model Set 1 Paper
No ratings yet
21CS71 Model Set 1 Paper
2 pages
BE VII & VIII SEM Syllabus AY 23-24
No ratings yet
BE VII & VIII SEM Syllabus AY 23-24
51 pages
Database and BI
No ratings yet
Database and BI
33 pages
Architecting Data Lakes Zaloni PDF
No ratings yet
Architecting Data Lakes Zaloni PDF
63 pages
Understanding Big Data
No ratings yet
Understanding Big Data
117 pages
Cloud Computing Insights
No ratings yet
Cloud Computing Insights
21 pages
Weather Analysis
No ratings yet
Weather Analysis
3 pages

Mining Public Datasets

Uploaded by

Mining Public Datasets

Uploaded by

Mining Public Datasets

using Apache Zeppelin (incubating),

NFLabs for AppacheCon ’16 NA

Software Engineer at NFLabs, Seoul,

Co-organizer of SeoulTech Society

Committer and PPMC member of

Physics Research http://opendata.cern.ch

Physics Research http://opendata.cern.ch order of Pbs

APACHE ZEPPELIN: Overview

12.2012 Commercial App using AMP Lab Shark 0.5

From Berkeley AMP Labs, since 2010

Joined Apache since 2014

REPL + Java, Scala, Python, R APIs

Service modelling at scale

+ Integration with Spark, Zeppelin, Ganglia, etc

+ AWS, GCE, Azure, LXC, etc

$ apt-get install juju-core juju-quickstart

7 node cluster designed to scale out

10s PC Estimate the cost

1000 instances Scale out Deployment automation

• Collaboration google and github engineers

• Events on PR, repo, issues, comments, etc in JSON

We are going to build a Notebook that

We are going to build a Notebook that:

• Read & explore the dataset

• Imports, filters the PublicEvent

• Join logs w/ more data from Github API calls

• Shows HTML template, to visualise the list

• Sends email notifications

• Does all above automatically, once a day

On AWS S3 in WARC, WAT, formats

since 2013, monthly: ~150Tb compressed, 2+bln ulrs

URL Index by Ilya Kreymer of @webrecorder_io

Measuring the impact of Google Analytics

Objective: estimate % of pages/domains that use Google

Existing research from 2013

Measuring the impact of Google Analytics

Verify using grep

Feb 2016 Crawl:

Community service for sharing example notebooks

There are plenty of free tools out there

To crunch the data for fun and profit

They are easy (not simple) to learn and generic enough

You might also like

Feb 2016 Crawl: