KEMBAR78
Data Pipeline Management Framework on Oozie | PPT
Data Pipeline
Management
Framework on Oozie
Kun Lu
Overview
Architecture of Campaign Analytics
What are the issues in the old Campaign Analytics
processes
Build Pipeline Management Framework for robust
computing environment
Architecture of Campaign
Analytics
What are the issues the
framework needs to solve
Consistent and robust framework
Adding a new analytics job more easier
Ability to coordinate complex workflows
(serialized and parallel processing)
It should support the catch-up feature
It should make debugging and tracing
easier
What does Oozie provide?
Workflow Engine
Workflow definition
A DAG with control flow nodes or action nodes (connected with
transition arrows)
Workflow Nodes
Control flow nodes (start, end, decision, fork, join, kill node)
Action nodes (Map-reduce, pig, Java, Script, etc.)
Parameterization of Workflow
Job Properties
EL functions (Basic EL, WF EL, Hadoop EL, HDFS EL)
Oozie Console
Oozie Client and API
Workflow Design Pattern
Campaign Analytics Pipeline
Management Framework
Campaign Analytics Pipeline Management Framework(PMF) is
built on top of Oozie.
PMF defines campaign analytics processing pipeline. Each
pipeline includes a set of workflows.
PMF organizes, schedules and coordinates the campaign
analytics jobs. It also provides the built-in catch-up feature to
make the pipeline robust.
Oozie workflow engine executes workflows and sending jobs
status to Oozie server.
Monitoring/Tracing jobs through Oozie console.
PMF & Oozie Execution Env.
PMF Servers
Own Pipeline definition
Passing workflow tasks to Oozie through Ooize client
Oozie Server
Executes workflow tasks
Manages task status
Hadoop Cluster
Workflow definition deployed in HDFS
M/R processes run on the cluster
Oozie Console
Workflow Console
Current Workflows
PMF manages three pipelines (hourly
pipeline, daily pipeline, and weekly
pipeline)
Includes 12 workflows
Map/Reduce Jobs run per month:
~100,000 jobs

Data Pipeline Management Framework on Oozie

  • 1.
  • 2.
    Overview Architecture of CampaignAnalytics What are the issues in the old Campaign Analytics processes Build Pipeline Management Framework for robust computing environment
  • 3.
  • 4.
    What are theissues the framework needs to solve Consistent and robust framework Adding a new analytics job more easier Ability to coordinate complex workflows (serialized and parallel processing) It should support the catch-up feature It should make debugging and tracing easier
  • 5.
    What does Oozieprovide? Workflow Engine Workflow definition A DAG with control flow nodes or action nodes (connected with transition arrows) Workflow Nodes Control flow nodes (start, end, decision, fork, join, kill node) Action nodes (Map-reduce, pig, Java, Script, etc.) Parameterization of Workflow Job Properties EL functions (Basic EL, WF EL, Hadoop EL, HDFS EL) Oozie Console Oozie Client and API
  • 6.
  • 7.
    Campaign Analytics Pipeline ManagementFramework Campaign Analytics Pipeline Management Framework(PMF) is built on top of Oozie. PMF defines campaign analytics processing pipeline. Each pipeline includes a set of workflows. PMF organizes, schedules and coordinates the campaign analytics jobs. It also provides the built-in catch-up feature to make the pipeline robust. Oozie workflow engine executes workflows and sending jobs status to Oozie server. Monitoring/Tracing jobs through Oozie console.
  • 8.
    PMF & OozieExecution Env. PMF Servers Own Pipeline definition Passing workflow tasks to Oozie through Ooize client Oozie Server Executes workflow tasks Manages task status Hadoop Cluster Workflow definition deployed in HDFS M/R processes run on the cluster Oozie Console
  • 9.
  • 10.
    Current Workflows PMF managesthree pipelines (hourly pipeline, daily pipeline, and weekly pipeline) Includes 12 workflows Map/Reduce Jobs run per month: ~100,000 jobs