KEMBAR78
Multi Layer Monitoring V1 | PDF
Maintaining Non-Stop Services
 with Multi Layer Monitoring

                       Lahav Savir
       System Architect and CEO of Emind Systems
                lahavs@emindsys.com

                   www.emindsys.com
The approach
• Non-stop applications can’t leave on their own
• More complex systems require more
    monitoring
•   Proactive system monitoring
    • Customized monitors
    • Monitor each process, component and application
        separately and all as a whole
    •   Proactive correction of problems before they become
        noticeable by your customers.
    •   Allow application to function at maximum availability
    •   SNMP monitoring of application infrastructure
    •   Alerting of potential problem or situation prior to accuracy
    •   Visual layered display of the entire data center

2
Monitoring is not just for
     System Administrators
    but for Developers as well




4
The goal for monitoring is to
      keep track of the running services 24/7,
        find troubles as early as possible and
        keep you alerted (only when needed…)


    Good monitoring infrastructure provides you a
    quick and direct troubleshooting abilities via
      visual representation of the system status




5
Multi-Layered Monitoring


                               Services
                     Keeping SLA, End-to-end service,
                        User experience monitors




                                                               Unified Dashboard
                            Applications
                      Application proprietary monitors,
                              Custom counters




                      Operating Systems
                  CPU, Memory, Disk, Network, Processes




                          Infrastructure
                  Network connections, network devices,
                 Chassis, Routers, Load Balancers, Firewalls




6
Visualize the information
• Use Maps & Views
    • Visualize the topology
    • Visualize the application flow
    • Focus on different layers
• Layered views (service & host groups)
    • Network, Hardware, OS, Application
• Different roles looking for different info
• Graph Service performance
    • Transactions, Success rates, Cache
• Aggregated view
    • Cluster’s average


7
Why multi layer ?
• Correlated information on one uniform view
    • Network throughput, CPU usage, Application usage
• Generate aggregated reports for different
    machines & layers
•   Collect information from all nodes
    • Switches, routers, firewalls, load balancers, storage,
      servers, applications
• Collect different types of data
    • Utilization, throughput, concurrency, cache status
    • Application performance, error rates
• Objective
    • Find the root cause via visuals on the dashboard !
    • Be aware of what’s going on
8
Unified Dashboard
             Infrastructure
     Network connections, network devices,
    Chassis, Routers, Load Balancers, Firewalls




9
Infrastructure layer
• Hardware redundancy my be dangerous if you
     don’t keep your eyes on it
     • Administrators not always seeing the HW
     • Redundant hardware can fool you (until it dies)
• Vendor specific MIBS / Syslog
• Today’s hardware provides detailed status
     interfaces
     • Power supplies, power usage
     • Fans & temperature
     • Disk controllers & drives
     • Switch ports, interfaces
     • Links, connections

10
HP HW monitoring




11
Network infrastructure health
Link Quality        Devices utilization




                    Network throughput




12
Unified Dashboard
        Operating Systems
     CPU, Memory, Disk, Network, Processes




13
Operating System health
• Monitoring of OS
     essentials
      • CPU, Memory, Disk
        I/O, Network traffic,
        processes, services
•    Use cases
      • Failure on log
        cleanup > disks full




14
Unified Dashboard
           Applications
     Application proprietary monitors,
             Custom counters




15
Applications health
• Transaction counters
     • Measure transaction rates
     • Measure counters on
     input & output
• % Success rates
     • Success counters for
     primary operations




16
System input/output monitoring

        Outlined topology




17
Applications health
• Queues
     • Processing backlog
     • Semaphores / throttle usage
• Latency
     • Measure the time it takes to process request / data chunk
• Optimizations
     • Measure compression rates




18
Applications health
• DB synchronizations
     • Replication status
     • Replication backlog & delays




19
Applications health
• Cluster & topology monitoring
     • Track application topology changes
     • Indicate dependencies status




20
Application topology

                     Outlined topology
                 Front-ends    Routers                Transports




       Load                                                          Load
                              Firewalls   Firewalls
     Balancers                                                     Balancers




21
Services
     Keeping SLA, End-to-end service,
        User experience monitors




                                        Unified Dashboard
22
Cluster overall QoS
•Now it’s a cluster, aggregated counters
•Want to know what users are experiencing




23
User Experience & Service QoS
• Simulate user behavior
   • Latency
       • How long it takes to login
       • How long it takes to send and
         receive a message
       • How long it takes to “check out”
     • Success rates
       • What’s the success rate of the
         user’s operation
     • Download speed
       • What’s the download speed
       from different locations
       • What’s your Content Delivery
         Network (CDN) performance
24
Recommendations
• Build a generic monitoring infrastructure with
     generic tools and interfaces
•    Use embedded SNMP
•    Net-snmp is extendable (also for Windows)
     • PROXY – proxy request to other SNMP agent (embedded)
     proxy -v 2c -c public 127.0.0.1:50910 1.3.6.1.4.1.15867.2000.3.6
       1.3.6.1.4.1.15867.2000.3.6
     • PASS – STDOUT based subagent
     pass .1.3.6.1.4.1.15867.2001 /bin/sh
       /usr/local/ixi/GenericSubAgent.sh
     • EXEC – run a script
     exec .1.3.6.1.4.1.15867.1100.20.10 axs-imap-test-stat /bin/bash
       /opt/mas/scripts/imap_tester.sh last_state

• SSH / Telnet
25
Using Status files
• Perfect for batch operations
     • perl, python, php
• Status file
     TIMESTAMP:1276203703
     STATUS:0
     HOSTNAME:myserver

• Observer
     if [ $(get_time_delta ${file}) -gt ${max_d_s} ]; then
          err "Delta is greater than ${max_delta} hours"
          return
     Fi
     if [ "$(parse_status_file ${file} STATUS)" != "0" ]; then
          err "Last backup status is not 0“
          return
     Fi
     echo ${ok}

27
Command line based info.
• Command line applications
      • DB, Softswitch, Etc.
• Example
      • Snmpd.conf
     pass .1.3.6.1.4.1.15867.1.100 /bin/bash
     /usr/local/emind/replication_status.sh


      • Run the script replication_status.sh
         • Execute SQL Query - show    slave statusG;
         • Parse the output
         • Return data to snmpd



28
Important to remember
• Define your goals, What’s right for you ?
     • Don’t over monitor
     • Use methodologies and technologies that fit your network
       and needs, not the other way around.
• Build generic interfaces
     • SNMP
     • Simple command line
     • No proprietary protocols
     • Agnostic to the monitoring tool




29
Mobile Access




30
Leading tools – Open source first…
• Nagios
     • Very generic, lot’s of public plug-ins
     • Easy to tweak and build it at your own style
• Zabix
     • A complete monitoring solution
     • Less customizable
• Cacti
     • Graphing (RRD tool)
     • Easy to configure
     • Lot’s of public templates



31
Monitoring on the Cloud
• Nagios + Dynamic Configuration = Dynagios !
• Key features
     • Auto provisioning
     • Add, Remove, Suspend, Unsuspended
• Machines are monitored base on their
     predefined profiles
•    Machines can join / leave the monitor
     (purposely)
     • Join on boot
     • Leave on shutdown
     • If crash happens alert will raise


32
Commercial tools

Feature                          ManageEngine       Orion
OS Requirements                  Windows & Linux   Windows
Modular (applications, IP SLA,         Yes           Yes
Netflow, Conf.)
Very easy to deploy                    Yes           Yes
Multi vendor with lot’s of             Yes           Yes
templates
Maps                                   Yes           Yes
Less customizable … but still          Yes           Yes
flexible
Est. price for 100 devices            $10k          $15k




33
Questions?




34

Multi Layer Monitoring V1

  • 1.
    Maintaining Non-Stop Services with Multi Layer Monitoring Lahav Savir System Architect and CEO of Emind Systems lahavs@emindsys.com www.emindsys.com
  • 2.
    The approach • Non-stopapplications can’t leave on their own • More complex systems require more monitoring • Proactive system monitoring • Customized monitors • Monitor each process, component and application separately and all as a whole • Proactive correction of problems before they become noticeable by your customers. • Allow application to function at maximum availability • SNMP monitoring of application infrastructure • Alerting of potential problem or situation prior to accuracy • Visual layered display of the entire data center 2
  • 3.
    Monitoring is notjust for System Administrators but for Developers as well 4
  • 4.
    The goal formonitoring is to keep track of the running services 24/7, find troubles as early as possible and keep you alerted (only when needed…) Good monitoring infrastructure provides you a quick and direct troubleshooting abilities via visual representation of the system status 5
  • 5.
    Multi-Layered Monitoring Services Keeping SLA, End-to-end service, User experience monitors Unified Dashboard Applications Application proprietary monitors, Custom counters Operating Systems CPU, Memory, Disk, Network, Processes Infrastructure Network connections, network devices, Chassis, Routers, Load Balancers, Firewalls 6
  • 6.
    Visualize the information •Use Maps & Views • Visualize the topology • Visualize the application flow • Focus on different layers • Layered views (service & host groups) • Network, Hardware, OS, Application • Different roles looking for different info • Graph Service performance • Transactions, Success rates, Cache • Aggregated view • Cluster’s average 7
  • 7.
    Why multi layer? • Correlated information on one uniform view • Network throughput, CPU usage, Application usage • Generate aggregated reports for different machines & layers • Collect information from all nodes • Switches, routers, firewalls, load balancers, storage, servers, applications • Collect different types of data • Utilization, throughput, concurrency, cache status • Application performance, error rates • Objective • Find the root cause via visuals on the dashboard ! • Be aware of what’s going on 8
  • 8.
    Unified Dashboard Infrastructure Network connections, network devices, Chassis, Routers, Load Balancers, Firewalls 9
  • 9.
    Infrastructure layer • Hardwareredundancy my be dangerous if you don’t keep your eyes on it • Administrators not always seeing the HW • Redundant hardware can fool you (until it dies) • Vendor specific MIBS / Syslog • Today’s hardware provides detailed status interfaces • Power supplies, power usage • Fans & temperature • Disk controllers & drives • Switch ports, interfaces • Links, connections 10
  • 10.
  • 11.
    Network infrastructure health LinkQuality Devices utilization Network throughput 12
  • 12.
    Unified Dashboard Operating Systems CPU, Memory, Disk, Network, Processes 13
  • 13.
    Operating System health •Monitoring of OS essentials • CPU, Memory, Disk I/O, Network traffic, processes, services • Use cases • Failure on log cleanup > disks full 14
  • 14.
    Unified Dashboard Applications Application proprietary monitors, Custom counters 15
  • 15.
    Applications health • Transactioncounters • Measure transaction rates • Measure counters on input & output • % Success rates • Success counters for primary operations 16
  • 16.
    System input/output monitoring Outlined topology 17
  • 17.
    Applications health • Queues • Processing backlog • Semaphores / throttle usage • Latency • Measure the time it takes to process request / data chunk • Optimizations • Measure compression rates 18
  • 18.
    Applications health • DBsynchronizations • Replication status • Replication backlog & delays 19
  • 19.
    Applications health • Cluster& topology monitoring • Track application topology changes • Indicate dependencies status 20
  • 20.
    Application topology Outlined topology Front-ends Routers Transports Load Load Firewalls Firewalls Balancers Balancers 21
  • 21.
    Services Keeping SLA, End-to-end service, User experience monitors Unified Dashboard 22
  • 22.
    Cluster overall QoS •Nowit’s a cluster, aggregated counters •Want to know what users are experiencing 23
  • 23.
    User Experience &Service QoS • Simulate user behavior • Latency • How long it takes to login • How long it takes to send and receive a message • How long it takes to “check out” • Success rates • What’s the success rate of the user’s operation • Download speed • What’s the download speed from different locations • What’s your Content Delivery Network (CDN) performance 24
  • 24.
    Recommendations • Build ageneric monitoring infrastructure with generic tools and interfaces • Use embedded SNMP • Net-snmp is extendable (also for Windows) • PROXY – proxy request to other SNMP agent (embedded) proxy -v 2c -c public 127.0.0.1:50910 1.3.6.1.4.1.15867.2000.3.6 1.3.6.1.4.1.15867.2000.3.6 • PASS – STDOUT based subagent pass .1.3.6.1.4.1.15867.2001 /bin/sh /usr/local/ixi/GenericSubAgent.sh • EXEC – run a script exec .1.3.6.1.4.1.15867.1100.20.10 axs-imap-test-stat /bin/bash /opt/mas/scripts/imap_tester.sh last_state • SSH / Telnet 25
  • 25.
    Using Status files •Perfect for batch operations • perl, python, php • Status file TIMESTAMP:1276203703 STATUS:0 HOSTNAME:myserver • Observer if [ $(get_time_delta ${file}) -gt ${max_d_s} ]; then err "Delta is greater than ${max_delta} hours" return Fi if [ "$(parse_status_file ${file} STATUS)" != "0" ]; then err "Last backup status is not 0“ return Fi echo ${ok} 27
  • 26.
    Command line basedinfo. • Command line applications • DB, Softswitch, Etc. • Example • Snmpd.conf pass .1.3.6.1.4.1.15867.1.100 /bin/bash /usr/local/emind/replication_status.sh • Run the script replication_status.sh • Execute SQL Query - show slave statusG; • Parse the output • Return data to snmpd 28
  • 27.
    Important to remember •Define your goals, What’s right for you ? • Don’t over monitor • Use methodologies and technologies that fit your network and needs, not the other way around. • Build generic interfaces • SNMP • Simple command line • No proprietary protocols • Agnostic to the monitoring tool 29
  • 28.
  • 29.
    Leading tools –Open source first… • Nagios • Very generic, lot’s of public plug-ins • Easy to tweak and build it at your own style • Zabix • A complete monitoring solution • Less customizable • Cacti • Graphing (RRD tool) • Easy to configure • Lot’s of public templates 31
  • 30.
    Monitoring on theCloud • Nagios + Dynamic Configuration = Dynagios ! • Key features • Auto provisioning • Add, Remove, Suspend, Unsuspended • Machines are monitored base on their predefined profiles • Machines can join / leave the monitor (purposely) • Join on boot • Leave on shutdown • If crash happens alert will raise 32
  • 31.
    Commercial tools Feature ManageEngine Orion OS Requirements Windows & Linux Windows Modular (applications, IP SLA, Yes Yes Netflow, Conf.) Very easy to deploy Yes Yes Multi vendor with lot’s of Yes Yes templates Maps Yes Yes Less customizable … but still Yes Yes flexible Est. price for 100 devices $10k $15k 33
  • 32.