KEMBAR78
Data Science Stack with MongoDB and RStudio | PPTX
Data Science Stack with MongoDB and RStudio
Building up an easy data science platform with
RStudio server on top of your MongoDB
Winston Chen – Lead Software Engineer
What does Fliptop do?
• Predictive Lead Scoring, using data science
– Pull opportunity/lead/contact data from CRM
– Aggregate company data and social data from various data
sources and the internet
– Over 3000 signals
– Build conversion/revenue model
– Predict lead conversion and revenue
Our Platform Stack
• Java/Scala
• Liftweb
• JMS/Storm
• MongoDB/MySql
Our Machine Learning Stack
• Python
• Numpy/Scipy/Pandas
• Bottle (RESTful Server)
So, where is R then?
• Problem:
– Data is stored in MongoDB
• Sales Lead Data
• Sales Opportunity Data
• Sales Contact Data
– It’s hard to view/digest/process data on the fly using MongoDB
console
• (X) Text processing for insight extraction?
• (X) Prototype cool machine learning algorithms on the fly?
• Solution:
– R and Rstudio Server
• Why not scala?
• Why not python/ipython
MongoDB Console & Query
Rstudio Server
Pull MongoDB data into R data frame
• rmongodb (https://github.com/gerald-lindsly/rmongodb)
Transform Into a R data-frame
1 – Get the total count of your data set
2 – Construct Vectors for each column
3 – Loop through curser and insert values
Where are my apply functions?
- Too bad. We are using mongo cursor :P
4 – Go into sub bson block to extract data (optional)
5 – Construct data frame and return
You are able to get the full example code here:
http://goo.gl/tlyyXp
We now have a data frame to play with from MongoDB bson.
This is NOT a BIG DATA Stack
• It takes around 1 min to process 900Mb+ of bson from
Mongo.
• NOT BIG data stack – Data should fit into the ram
• Most of the data in the business world is not big anyways.
• It works fine for us (m1.large machine in AWS)
– CRM data is never big, not even after we pull in 3000+ additional
signals.
– The term ‘Big-Data’ is seriously overrated, ‘Data Science’
however, is the key term here.
@Fliptop, we now use Rstudio to do
• Data Insight Extraction
• Algorithm prototyping
If you REALLY want BIG Data
• Look into: HDFS + Pig/Hive + Hue
(any other suggestion from the audience here?)
QA
• Winston Chen
– Personal Blog: http://winston.attlin.com/
– Twitter: @wingchen83
– winston@fliptop.com
• Fliptop is hiring Data Scientists. Please email to:
winston@fliptop.com

Data Science Stack with MongoDB and RStudio

  • 1.
    Data Science Stackwith MongoDB and RStudio Building up an easy data science platform with RStudio server on top of your MongoDB Winston Chen – Lead Software Engineer
  • 2.
    What does Fliptopdo? • Predictive Lead Scoring, using data science – Pull opportunity/lead/contact data from CRM – Aggregate company data and social data from various data sources and the internet – Over 3000 signals – Build conversion/revenue model – Predict lead conversion and revenue
  • 3.
    Our Platform Stack •Java/Scala • Liftweb • JMS/Storm • MongoDB/MySql
  • 4.
    Our Machine LearningStack • Python • Numpy/Scipy/Pandas • Bottle (RESTful Server)
  • 5.
    So, where isR then? • Problem: – Data is stored in MongoDB • Sales Lead Data • Sales Opportunity Data • Sales Contact Data – It’s hard to view/digest/process data on the fly using MongoDB console • (X) Text processing for insight extraction? • (X) Prototype cool machine learning algorithms on the fly? • Solution: – R and Rstudio Server • Why not scala? • Why not python/ipython
  • 6.
  • 7.
  • 8.
    Pull MongoDB datainto R data frame • rmongodb (https://github.com/gerald-lindsly/rmongodb) Transform Into a R data-frame
  • 9.
    1 – Getthe total count of your data set
  • 10.
    2 – ConstructVectors for each column
  • 11.
    3 – Loopthrough curser and insert values Where are my apply functions? - Too bad. We are using mongo cursor :P
  • 12.
    4 – Gointo sub bson block to extract data (optional)
  • 13.
    5 – Constructdata frame and return You are able to get the full example code here: http://goo.gl/tlyyXp We now have a data frame to play with from MongoDB bson.
  • 14.
    This is NOTa BIG DATA Stack • It takes around 1 min to process 900Mb+ of bson from Mongo. • NOT BIG data stack – Data should fit into the ram • Most of the data in the business world is not big anyways. • It works fine for us (m1.large machine in AWS) – CRM data is never big, not even after we pull in 3000+ additional signals. – The term ‘Big-Data’ is seriously overrated, ‘Data Science’ however, is the key term here.
  • 15.
    @Fliptop, we nowuse Rstudio to do • Data Insight Extraction • Algorithm prototyping
  • 16.
    If you REALLYwant BIG Data • Look into: HDFS + Pig/Hive + Hue (any other suggestion from the audience here?)
  • 17.
    QA • Winston Chen –Personal Blog: http://winston.attlin.com/ – Twitter: @wingchen83 – winston@fliptop.com • Fliptop is hiring Data Scientists. Please email to: winston@fliptop.com