KEMBAR78
IBM System X Reference Architecture For Hadoop: MapR | PDF | Apache Hadoop | Computer Cluster
0% found this document useful (0 votes)
157 views35 pages

IBM System X Reference Architecture For Hadoop: MapR

The MapR-validated reference architecture solution from IBM for Hadoop big data analytics is built around powerful, affordable, scalable System x servers and IBM networking solutions so you can deploy your MapR-validated solution more quickly.

Uploaded by

M_BB
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
157 views35 pages

IBM System X Reference Architecture For Hadoop: MapR

The MapR-validated reference architecture solution from IBM for Hadoop big data analytics is built around powerful, affordable, scalable System x servers and IBM networking solutions so you can deploy your MapR-validated solution more quickly.

Uploaded by

M_BB
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

IBM System x reference architecture for Hadoop: MapR

07 August 2013 Authors: Beth L. Hoffman, IBM Billy Robinson, IBM Andy Lerner, MapR Technologies

Copyright IBM Corporation, 2013

Table of contents
Introduction .............................................................................................................................. 1 Business problem and business value ................................................................................... 1
Business problem .................................................................................................................................... 1 Business value ......................................................................................................................................... 2

Requirements ........................................................................................................................... 2
Functional requirements .......................................................................................................................... 2 Nonfunctional requirements ..................................................................................................................... 4

Architectural overview ............................................................................................................. 5 Component model .................................................................................................................... 6 Operational model .................................................................................................................. 10


Cluster nodes ......................................................................................................................................... 10 Networking ............................................................................................................................................. 12 Predefined configurations ...................................................................................................................... 15 Number of nodes for MapR management services ............................................................................... 17 Deployment diagram .............................................................................................................................. 19

Deployment considerations ................................................................................................... 20


Systems management ........................................................................................................................... 21 Storage considerations .......................................................................................................................... 21 Performance considerations .................................................................................................................. 22 Scaling considerations ........................................................................................................................... 23 High availability considerations .............................................................................................................. 24 Migration considerations ........................................................................................................................ 27

Appendix 1: Bill of material ................................................................................................... 28


Node....................................................................................................................................................... 28 Administration / Management network switch ....................................................................................... 29 Data network switch ............................................................................................................................... 30 Rack ....................................................................................................................................................... 30 Cables .................................................................................................................................................... 30

Resources ............................................................................................................................... 31 Trademarks and special notices ........................................................................................... 32

IBM System x reference architecture for Hadoop: MapR

Introduction
This reference architecture is a predefined and optimized hardware infrastructure for MapR M5 Edition, a distribution of Hadoop with value added capabilities produced by MapR Technologies. The reference architecture provides a predefined hardware configuration for implementing MapR M5 Edition on IBM System x hardware and IBM networking products. MapR M5 Edition is a complete Hadoop distribution supporting MapReduce, HBase, and Hadoop ecosystem workloads. MapReduce is a core component of Hadoop that provides an off-line, batch-oriented framework for high-throughput data access and distributed computation. The predefined configuration provides a baseline configuration for a MapR cluster which can be modified based on the specific customer requirements, such as lower cost, improved performance, and increased reliability. The intended audience of this document is IT professionals, technical architects, sales engineers, and consultants to assist in planning, designing and implementing the MapR solution on IBM System x. It is assumed that you have existing knowledge of Apache Hadoop components and capabilities. The Resources section provides links to Hadoop information.

Business problem and business value


This section describes the business problem associated with big data environments and the value that the MapR solution on IBM System x offers.

Business problem
Every day, around 2.5 quintillion bytes of data is createdso much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone global positioning system (GPS) signals to name a few. This data is big data. Big data spans three dimensions: Volume Big data comes in one size: large. Enterprises are awash with data, easily amassing terabytes and even petabytes of information. Velocity Often time-sensitive, big data must be used as it is streaming in to the enterprise in order to maximize its value to the business. Variety Big data extends beyond structured data, including unstructured data of all varieties: text, audio, video, click streams, log files and more.

Big data is more than a challenge; it is an opportunity to find insight in new and emerging types of data to make your business more agile, and to answer questions that, in the past, were beyond reach. Until now, there was no practical way to harvest this opportunity. Today, MapR uses the latest big data technologies, such as the massive map-reduce scale-out capabilities of Hadoop to open the door to a world of possibilities.

IBM System x reference architecture for Hadoop: MapR

Business value
MapR provides a Hadoop-based big data solution that is easy to manage, dependable, and fast. MapR eliminates the complexity of setting up and managing Hadoop. MapR provides alerts, alarms, and insights through an advanced graphical interface. MapR Heatmap provides a clear view of cluster health and performance, and MapR volumes simplify data security, retention, placement, and quota management. MapR provides Direct Access NFS. This allows users to mount the entire Hadoop cluster as an NFS volume that simplifies how an application can write to and read from a Hadoop cluster. MapR provides a high level of availability including support for rolling upgrades, self-healing and automated stateful failover. MapR also provides dependable data storage with full data protection and business continuity features including snapshots and mirroring. Unique MapR functions improve MapReduce throughput. MapR deployed on IBM System x servers with IBM networking components provides superior performance, reliability, and scalability. The reference architecture supports entry through high-end configurations and the ability to easily scale as the use of big data grows. A choice of infrastructure components provides flexibility in meeting varying big data analytics requirements.

Requirements
The functional and nonfunctional requirements for the MapR reference architecture are described in this section.

Functional requirements
The key functional requirements for a big data solution include: Support for a variety of application types, including batch and real-time analytics Support for industry-standard interfaces, so that existing applications can work with MapR Support for real-time streaming and processing of data Support for a variety of data types and databases Support for a variety of client interfaces Support for large volumes of data

MapR supports mission-critical and real-time big data analytics across different industries. MapR is used across financial services, retail, media, healthcare, manufacturing, telecommunications and government organizations and by leading Fortune 100 and Web 2.0 companies. The MapR platform for big data can be used for a variety of use cases from batch applications that use MapReduce with data source such as clickstreams to real-time applications that use sensor data. The MapR platform for Apache Hadoop integrates a growing set of functions, including MapReduce, file-based applications, interactive SQL, NoSQL databases, search and discovery, and real-time stream processing. With MapR, data does not need to be moved to specialized silos for processing; data can be processed in place.

IBM System x reference architecture for Hadoop: MapR

Figure 1 shows MapR key capabilities that meet the functional requirements stated earlier.

Figure 1: MapR key capabilities

MapReduce MapR provides the required performance for MapReduce operations on Hadoop and publishes performance benchmarking results. The MapR architecture is built in C/C++ and harnesses distributed metadata with an optimized shuffle process, enabling MapR to deliver consistent high performance. File-based applications MapR is a 100% Portable Operating System Interface (POSIX) compliant system that fully supports random read-write operations. By supporting an industry-standard Network File System (NFS), users can mount a MapR cluster and run any file-based application, written in any language, directly on the data residing in the cluster. All standard tools in the enterprise including browsers, UNIX tools, spreadsheets, and scripts can access the cluster directly without any modifications. SQL There are a number of applications that support SQL access against data contained in MapR including Hive, Hadapt and others. MapR is also leading the development of Apache Drill that brings American National Standards Institute (ANSI) SQL capabilities to Hadoop. Database MapR has removed the trade-offs that organizations face when looking to deploy a NoSQL solution. Specifically, MapR delivers ease of use, dependability, and performance advantages for HBase applications. MapR provides scalability, strong consistency, reliability, and continuous low latency with an architecture that does not require compactions or background consistency checks.

IBM System x reference architecture for Hadoop: MapR

Search MapR integrates enterprise-grade search. On a single platform, customers can now perform predictive analytics, full search and discovery; and conduct advanced database operations. The MapR enterprisegrade search capability works directly on Hadoop data but can also index and search standard files without having to perform any conversion or transformation. Stream processing MapR provides a simplified architecture for real-time stream computational engines such as Storm. Streaming data feeds can be written directly to the MapR platform for Hadoop for long-term storage and MapReduce processing.

Nonfunctional requirements
Customers require their big data solution to be easy, dependable, and fast. Here is a list of the key nonfunctional requirements: Easy Dependable Fast Superior performance Scalability Data security and protection Automated self-healing Insight into software/hardware health and issues High availability (HA) and business continuity (99.999% uptime) Ease of development Easy management at scale Advanced job management Multitenancy

The MapR solution provides features and capabilities that fulfill the nonfunctional requirements requested by customers. The blue boxes in Figure 1 illustrate the key MapR capabilities that meet the nonfunctional requirements. MapR provides dependability, ease-of-use, and speed to Hadoop, NoSQL, database, and streaming applications in one unified big data platform. This full range of applications and data sources benefit from MapR enterprise-grade platform and unified architecture for files and tables. The MapR platform provides high availability, data protection and disaster recovery to support mission-critical applications. The MapR platform also makes it easier to use existing applications and solutions by supporting industry-standard interfaces, such as NFS. To support a diverse set of applications and users, MapR also provides multitenancy features and volume support. These features include support for heterogeneous hardware within a cluster and data and job placement control so that applications can be selectively ran in a cluster to take advantage of faster processors or solid-state drives (SSDs).

IBM System x reference architecture for Hadoop: MapR

The subsequent sections in this document describe the reference architecture that meets the business needs and the functional and nonfunctional requirements.

Architectural overview
Figure 2 shows the main features of the MapR IBM System x reference architecture. Users can log in to the MapR client from outside the firewall using Secure Shell (SSH) on port 22 to access the MapR solution in the corporate network. MapR provides several interfaces that allow administrators and users to perform administration and data functions depending on their roles and access level. Hadoop application programming interfaces (APIs) can be used to access data. MapR APIs can be used for cluster management and monitoring. MapR data services, management services, and other services run on the nodes in cluster. Storage is a component of each node in the cluster. Data can be ingested into the MapR Hadoop storage either through the Hadoop APIs or NFS depending on the needs of the customer.

Figure 2: MapR architecture overview

IBM System x reference architecture for Hadoop: MapR

Component model
MapR is a complete distribution that includes HBase, Pig, Hive, Mahout, Cascading, Sqoop, Flume and more. MapR distribution is 100% API-compatible with Hadoop (MapReduce, HDFS and HBase). The M5 Edition includes advanced high availability and data protection features, such as JobTracker HA, No NameNode HA, snapshots, and mirroring. The M5 Edition supports enterprise mission-critical deployments. MapR M5 includes the following features: MapR Control System and Heatmap: The MapR Control System (MCS) provides full visibility into cluster resources and activity. The MapR MCS dashboard includes the MapR Heatmap that provides visual insight into node health, service status, and resource utilization, organized by the cluster topology (for example, data centers and racks). Designed to manage large clusters with thousands of nodes, the MapR Heatmap shows the health of the entire cluster at a glance. Filters and group actions are also provided to select specific components and perform administrative actions directly because the number of nodes, files, and volumes can be very high. The Heatmap interfaces are designed for managing the smallest to the largest clusters, but also include command-line interface (CLI) and Representational State Transfer (REST) access as well. MapR No NameNode high availability: MapR Hadoop distribution is unique because it was designed for high availability. MapR is the only Hadoop distribution designed with no single point of failure. Other Hadoop distributions have asingle primary NameNode and when the NameNode goes down, the entire cluster becomes unavailable until the NameNode is restarted. In those cases where other distributions are configured with multiple NameNodes, the entire cluster becomes unavailable during the failover to a secondary NameNode. With MapR, file metadata is replicated and distributed, so that there is no data loss or downtime even in the face of multiple disk or node failures. MapR JobTracker high availability: The MapR JobTracker HA improves recovery time objectives and provides for a self-healing cluster. Upon failure, the MapR JobTracker automatically restarts on another node in the cluster. TaskTrackers can automatically pause and then reconnect to the new JobTracker. Any currently running jobs or tasks continue without losing any progress or failing. MapR storage services: MapR stores data in a distributed shared system that eliminates contention and the expense from data transport and retrieval. Automatic, transparent client-side compression reduces network resources and reduces footprint on disk, while direct block device I/O provides throughput at hardware speed without additional resources. As an additional performance boost, with MapR, you can read files while they are still being written. MapR No NameNode architecture scales linearly with the number of nodes, providing unlimited file support. You need to add nodes to increase the number of files supported to more than a trillion files containing over 1000 PB of data. MapR Direct Access NFS: MapR Direct Access NFS makes Hadoop radically easier and less expensive to use by letting the user mount the Hadoop file system from a standard NFS client. Unlike the write-once system found in other Hadoop distributions, MapR allows files to be modified and overwritten, and enables multiple concurrent reads and writes on any file. Users can browse files, automatically open associated applications with

IBM System x reference architecture for Hadoop: MapR

a mouse click, or drag files and directories into and out of the cluster. Additionally, standard command-line tools and UNIX applications and utilities (such as grep, tar, sort, and tail) can be used directly on data in the cluster. With other Hadoop distributions, the user must copy the data out of the cluster in order to use standard tools. MapR job metrics: The MapR job metrics service provides in-depth access to the performance statistics of your cluster and the jobs that run on it. With MapR job metrics, you can examine trends in resource use, diagnose unusual node behavior, or examine how changes in your job configuration affect the job's execution. MapR snapshots: MapR provides snapshots that are atomic and transactionally consistent. MapR snapshots provide protection from user and application errors with flexible schedules to accommodate a range of recovery point objectives. MapR snapshots can be scheduled or performed on demand. Recovering from a snapshot is as easy as dragging the directory or files to the current directory. MapR snapshots offer high performance and space efficiency. No data is copied in order to create a snapshot. As a result, a snapshot of a petabyte volume can be performed in seconds. A snapshot operation does not have any impact on write performance because MapR uses redirecton-write to implement snapshots. All writes in MapR goes to new blocks on disk. This means that a snapshot needs to retain references to the old blocks and does not require copying data blocks. MapR mirroring: MapR makes data protection easy and built-in. Going far beyond replication, MapR mirroring means that you can set policies around your recovery time objectives (RTO) and mirror your data automatically within your cluster, between clusters, (such as a production and a research cluster) or between sites. MapR volumes: MapR volumes make cluster data both easy to access and easy to manage by grouping related files and directories into a single tree structure so they can be easily organized, managed, and secured. MapR volumes provide the ability to apply policies including the following: replication factor, scheduled mirroring, scheduled snapshots, data placement and topology control, quotas and usage tracking, administrative permissions.

These features are implemented by a set of MapR components as shown in the MapR component model in Figure 3.

IBM System x reference architecture for Hadoop: MapR

Figure 3: MapR component model

MapR includes several open source projects many of which are shown in the Hadoop ecosystem box: Apache Flume A distributed, reliable, and highly available service for efficiently moving large amounts of data around a cluster. Apache Hadoop A framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. Apache HBase The Hadoop database, a distributed, scalable, big data store. Apache HCatalog A table and storage management service for data created using Apache Hadoop. Apache Hive A data warehouse system for Hadoop that facilitates easy data summarization, ad hoc queries, and the analysis of large data sets stored in Hadoop compatible file systems. Apache Mahout A scalable machine learning library. Apache Oozie A workflow coordination manager Apache Pig A language and runtime for analyzing large data sets, consisting of a highlevel language for expressing data analysis programs and an infrastructure for evaluating those programs. Apache Sqoop A tool designed for efficiently transferring bulk data between Apache Hadoop and structured data stores such as relational databases. Apache Whirr A set of libraries for running cloud services. Apache ZooKeeper A distributed service for maintaining configuration information, providing distributed synchronization, and providing group service.

IBM System x reference architecture for Hadoop: MapR

Cascading An application framework for Java developers to quickly and easily develop robust data analytics and data management applications on Apache Hadoop.

MapR data services All cluster nodes can run data or worker services, which run the MapReduce tasks over a subset of the data. Cluster nodes also store the MapR File System (MapR-FS) data. In large clusters, some nodes may be dedicated to management services. Each node hosts the following data services: MapR-FS - Provides distributed file services. TaskTracker - Runs map and reduces tasks for MapReduce clusters. HBase RegionServer (optional) - Maintains and serves table regions in an HBase cluster.

MapR management services MapR management services can run on any node. In multi-rack configurations, these services should be on nodes spread across racks. Here is the list of MapR management services: Container Location Data Base (CLDB) Manages and maintains container location information and replication. JobTracker Is a Hadoop service that farms out MapReduce tasks to specific nodes in a MapReduce cluster. ZooKeeper Coordinates activity and keeps track of management services locations on the cluster. HBase Master (optional) - Monitors all RegionServer instances in an HBase cluster and manages metadata changes. For high availability, more than one node can be configured to run HBase Master. In this case, only one HBase Master is active while additional HBase Masters are available to take over HBase management if the active HBase Master fails.

Other optional MapR services MapR offers two other optional services that can be run on multiple nodes in the cluster: NFS server (Gateway) Provides NFS access to the distributed file system. The NFS server is often run on all nodes in the cluster to allow local mounting of the cluster file system from any node. WebServer Provides MapR Control System graphical user interface (GUI) and REST API.

MapR provides support for many client interfaces, several of which were described in the architecture overview or feature list. Open Database Connectivity (ODBC) and Java Database Connectivity (JDBC) can be used to access data in the MapR Hadoop cluster. A MapR CLI provides an additional way to manage the cluster and services.

Hadoop and the MapR solution are operating system independent. MapR supports many Linux operating systems. Red Hat Linux and SUSE Linux are supported with the IBM System x reference architecture. The details about the versions can be found at:

IBM System x reference architecture for Hadoop: MapR

http://www.mapr.com/doc/display/MapR/Requirements+for+Installation

Operational model
This section describes the operational model for the MapR reference architecture. The operational model focuses on a cluster of nodes to support a given amount of data. In order to illustrate the operational model for different sized customer environments, four different models are provided for supporting different amounts of data. Throughout the document, these will be referred to as starter rack, half rack, full rack, and multirack configuration sizes. Note that the operational model for the multirack size is three times larger than the full rack size, and therefore, data amounts between them can be extrapolated by using different multiples of the full rack operational model. A predefined configuration for the MapR solution consists of cluster nodes, networking, power, and a rack. The predefined configurations can be implemented as is or modified based on specific customer requirements, such as lower cost, improved performance, and increased reliability. Key workload requirements such as the data growth rate, sizes of datasets, and data ingest patterns help in determining the proper configuration for a specific deployment. A best practice when designing a MapR cluster infrastructure is to conduct the necessary testing and proof of concepts against representative data and workloads to ensure that the proposed design achieves the necessary success criteria.

Cluster nodes
The MapR reference architecture is implemented on a set of nodes that make up a cluster. Nodes are implemented on System x3630 M4 servers with locally attached storage. MapR runs well on a homogenous server environment with no need for different hardware configurations for management and data services. Server nodes can run three different types of services: Data (worker) services for storing and processing data Management (control) services for coordinating and managing the cluster Miscellaneous services (optional) for file and web serving

Unlike other Hadoop distributions that require different server configurations for management nodes and data nodes, the MapR reference architecture requires only a single MapR node hardware configuration. Each node is then configured to run one or more of the mentioned services. In large clusters, a node could be dedicated to running only management services. Each node in the cluster is an IBM System x3630 M4 server. Each predefined node is made up of the components as shown in Table 1. The Intel Xeon processor E5-2450 is recommended to provide sufficient performance. A minimum of 48 GB of memory is recommended for most MapReduce workloads with 96 GB or more recommended for HBase and memory-intensive MapReduce workloads. A choice of 3 TB and 4 TB drives is suggested depending on the amount of data that needs to be stored. For the hard disk drive (HDD) controller, JBOD (just a bunch of disks) is the best choice for a MapR cluster. It provides excellent performance and, when combined with the Hadoop default of 3x data replication, also provides significant protection against data loss. The use of Redundant Array of Independent Disks (RAID) with

IBM System x reference architecture for Hadoop: MapR

10

data disks is discouraged with MapR. MapR provides an automated way to set up and manage storage pools. (RAID can be used to mirror the OS, which is described in a later section.) Nodes can be customized according to client needs.

Component System Processor Memory - base Disk (OS) a Disk (data) b HDD controller Hardware storage protection

Predefined configuration System x3630 M4 2 x Intel Xeon Processor E5-2450 2.1GHz 8-core 48 GB 6 x 8GB 1600MHz RDIMM (minimum) 3 TB drives: 2 x 3TB NL SATA 3.5 inch or 4 TB drives: 2 x 4TB NL SATA 3.5 inch 3 TB drives: 12 x 3TB NL SATA 3.5 inch (36TB Total) or 4 TB drives: 12 x 4TB NL SATA 3.5 inch (48TB Total) 6Gb JBOD controller None (JBOD). By default, MapR maintains a total of three copies of data stored within the cluster. The copies are distributed across data servers and racks for fault recovery. Integrated 1GBaseT Adapter Mellanox ConnectX-3 EN Dual-port SFP+ 10GbE Adapter

Hardware management network adapter Data network adapter

Table 1: Node predefined configuration.

a. OS drives may be of a smaller size than the data drives. OS drives are used for the operating system, the MapR application, and software monitoring. b. Data drives should be of the same size, either all 3 TB or all 4 TB. The reference architecture recommends the storage-rich System x3630 M4 model for several reasons: Storage capacity The nodes are storage-rich. Each of the fourteen 3.5 inch drives has raw capacity up to 4 TB for a total of 56 TB per node and over a petabyte per rack. Performance This hardware supports the latest Intel Xeon processors based on the Intel Sandy Bridge microarchitecture. Flexibility Server hardware uses embedded storage resulting in simple scalability; just add more nodes. More PCIe slots Up to two PCIe slots are available. They can be used for network adapter redundancy and increased network throughput. Better power efficiency The server offers a common form factor (CFF) power supply unit (PSU) including Platinum 80 plus options.

IBM System x reference architecture for Hadoop: MapR

11

Networking
With respect to networking, the reference architecture specifies two networks: a data network and an administrative/management network. Figure 4 shows the networking configuration for MapR installed on a cluster with one rack. Data network The data network is a private cluster data interconnect among nodes used for data access, moving data across nodes within a cluster, and ingesting data into the MapR file system. The MapR cluster typically connects to the customers corporate data network. One top of rack switch is required for the data network used by MapR. Either a 1GbE or 10GbE switch can be used. A 1Gb Ethernet switch is sufficient for some workloads. A 10Gb Ethernet switch can provide extra I/O bandwidth for added performance. This rack switch for the data network has two physical, aggregated links connected to each of the nodes in the cluster. The data network is a private virtual local area network (VLAN) or subnet. The two Mellanox 10GbE ports of each node can be link aggregated to the recommended G8264 rack switch for better performance and improved HA. Alternatively, MapR can automatically take advantage of multiple data networks provided by multiple switches. Even when only one switch is used, because it has ports available and cluster nodes have multiple data network interfaces, it can be configured with multiple subnets to allow multiple network connections to each node. MapR automatically uses additional network connections to increase bandwidth. The recommended 10GbE switch is the IBM System Networking RackSwitch G8264. The recommended 1GbE switch is the IBM RackSwitch G8052. The enterprise-level IBM RackSwitch G8264 has the following characteristics: Up to sixty-four 1Gb/10Gb SFP+ ports Four 40 Gb QSFP+ ports Support for 1.28 Tbps of non-blocking throughput Energy-efficient, cost-effective design Optimized for applications requiring high bandwidth and low latency

Each node has two Mellanox dual-port 10GbE networking cards for fault tolerance. The first port on each Mellanox card should connect back to the G8264 switch at the top of the rack. The second port on each Mellanox card is available to connect into the customer's data network in cases when an optional edge node is not used for data ingest and access. The customer data network can optionally go through an edge node to exchange data with the cluster data network. Edge nodes can help control access to the cluster and can be configured to control access to the data network, administration network, or both.

IBM System x reference architecture for Hadoop: MapR

12

Hardware management network The hardware management network is a 1GbE network used for in-band OS administration and out-ofband hardware management. In-band administrative services, such as SSH or Virtual Network Computing (VNC), running on the host operating system allows administration of cluster nodes. Out-of-band management, through the integrated management modules II (IMM2) within the System x3630 M4 server, allows hardware-level management of cluster nodes, such as node deployment or basic input/output system (BIOS) configuration. Hadoop has no dependency on the integrated management modules (IMM2). Based on customer requirements, the administration links and management links can be segregated onto separate VLANs or subnets. The administrative/management network is typically connected directly to the customers administrative network. When the in-band administrative services on the host operating system are used, MapR should be configured to only use the data network. By default, MapR uses all the available network interfaces. The reference architecture requires one 1Gb Ethernet top of rack switch for the hardware management network. The two 10Gb uplinks between the G8052 and G8264 top of rack switches (as shown in Figure 4) are optional, however they can be used in customer environments that require faster routing and access over the administration network to all the nodes in the cluster. Administration users are also able to access all the nodes in the cluster through the customer admin network, as shown in Figure 5. This rack switch for the hardware management network is connected to each of the nodes in the cluster using two physical links (one for in-band OS administration and one link for out-of-band IMM2 hardware management). On the nodes, the administration link should connect to port 1 on the integrated 1GBaseT adapter and the management link should connect to the dedicated IMM2 port. The recommended switch is IBM RackSwitch G8052 with the following features: Forty-eight 1 GbE RJ45 ports Four standard 10 GbE SFP+ ports Low 130W power rating and variable speed fans to reduce power consumption

IBM System x reference architecture for Hadoop: MapR

13

Figure 4: MapR rack network configuration

Multirack network The data network in the predefined reference architecture configuration consists of a single network topology. The single G8264 data network switch within each rack represents a single point of failure. Addressing this challenge can be achieved by building a redundant data network using an additional IBM RackSwitch G8264 top of rack switch per rack and appropriate additional IBM RackSwitch G8316 core switches per cluster. In this case, the second Mellanox 10GbE port can be connected to the second IBM RackSwitch G8264. Figure 5 shows how the network is configured when the MapR cluster is installed across more than one rack. The data network is connected across racks by two aggregated 40GbE uplinks from each racks G8264 switch to a core G8316 switch. A 40GbE switch is recommended for interconnecting the data network across multiple racks. IBM System Networking RackSwitch G8316 is the recommended switch. A best practice is to have redundant core switches for each rack to avoid a single point of failure. Within each rack, the G8052 switch can optionally be configured to have two uplinks to the G8264 switch to allow propagation of the administrative/management VLAN across cluster racks through the G8316 core switch. Many other cross rack network configurations are possible and may be required to meet the needs of specific deployments or to address clusters larger than three racks.

IBM System x reference architecture for Hadoop: MapR

14

If the solution is initially implemented as a multirack solution, or if the system grows by adding additional racks, the nodes that provide management services should be distributed across racks to maximize fault tolerance.

Figure 5: MapR cross rack network configuration

Predefined configurations
Four predefined configurations for the MapR reference architecture are highlighted in Table 2. The table shows the amount of space for data and the number of nodes that each predefined configuration provides. Storage space is described in two ways; the total amount of raw storage space when using 3 TB or 4 TB drives (raw storage) and the amount of space for the data the customer has (available data space). Available data space assumes the use of Hadoop replication with three copies of the data, and 25% capacity reserved for efficient file system operation and to allow time to increase capacity if needed. Available data space might increase significantly with MapR automatic compression. The estimates in Table 2 does not include additional space freed up by using compression because compression rates can vary widely based on file contents.

IBM System x reference architecture for Hadoop: MapR

15

Starter rack Half rack Storage space using 3 TB drives Raw storage Available data space Raw storage Available data space 108 TB 27 TB 144 TB 36 TB 360 TB 90 TB 480 TB 120 TB

Full rack

Multirack

720 TB 180 TB 960 TB 240 TB

2,160 TB 540 TB 2,880 TB 720 TB

Storage space using 4 TB drives

Number of nodes Number of nodes


Table 2: Predefined configurations

10

20

60

The number of nodes that are required in the cluster to support these four predefined configurations are shown in Table 2. These are the estimates for highly available clusters. Three nodes are required to support a customer deployment that has 36 TB of data. Ten nodes are needed to support a customer deployment that has 120 TB of data, and so on. When estimating disk space within a MapR Hadoop cluster, the following must be considered: For improved fault tolerance and improved performance, the MapR file system replicates data blocks across multiple cluster data nodes. By default, the file system maintains three replicas. Compression ratio is an important consideration in estimating disk space and can vary greatly based on file contents. MapR provides automatic compression. Available data space might increase significantly with MapR automatic compression. If the customers data compression ratio is not available, assume a compression ratio of 2.5. To ensure efficient file system operation and to allow time to add more storage capacity to the cluster if necessary, reserve 25% of the total capacity of the cluster.

Assuming the default three replicas maintained by the MapR file system, the raw data disk space, and the required number of nodes can be estimated using the following equations: Total raw data disk space = (User data, uncompressed) * (4 / compression rate) Total required nodes = (Total raw data disk space) / (Raw data disk per node) You should also consider future growth requirements when estimating disk space. Based on these sizing principals, Table 3 demonstrates an example for a cluster that needs to store 250 TB of uncompressed user data. The example shows that the MapR cluster will need to have 400TB of raw disk to support 250 TB of uncompressed data. The 400 TB is for data storage and does not include OS disk space. Nine nodes, or nearly a half rack, would be required to support deployment of this size.

IBM System x reference architecture for Hadoop: MapR

16

Description Size of uncompressed user data Compression rate Size of compressed data Storage multiplication factor Raw data disk space needed for MapR cluster - Storage needed for MapR-FS 3x replication - Storage reserved for headroom Raw data disk per node (with 4 TB drives) Minimum number of nodes required
Table 3: Example of storage sizing with 4 TB drives

Value 250 TB 2.5x 100 TB 4 400 TB 300 TB 100 TB 48 TB 9

Number of nodes for MapR management services


The nodes run MapR data services, management services, and other optional services. The number of nodes recommended for running management services and data services vary based on the size of the configuration. MapR is very flexible in its ability to use any node for any management service. Depending on workloads and HA requirements, multiple nodes could be dedicated to a single management service, multiple management services, or both management and data services. The number of nodes running management services can be customized based on specific workloads and HA requirements. Table 4 shows the number of nodes that should run management services depending on the size of cluster. Number of Subset of nodes Breakout of function nodes running management services < 40 3-5 Dedicated management nodes are not required. Management services run on nodes that also run data and optional services. CLDB on two or three nodes. JobTracker or HBase Master on two or three nodes. ZooKeeper on three nodes (run ZooKeeper on an odd number of nodes). Reduce the number of task slots on servers running both data and management services to ensure the processor and memory resources are available to management services. WebServer and NFS server can also be run on nodes running management services. For faster failover recovery times, avoid running ZooKeeper and CLDB services on the same servers. Dedicated management nodes are not required. Management services run on nodes that also run data and optional services. CLDB on two to four nodes. JobTracker or HBase Master on two or three nodes. ZooKkeeper on three or five nodes (run ZooKeeper on an odd

40 100

5-7

IBM System x reference architecture for Hadoop: MapR

17

number of nodes). Reduce the number of task slots on servers running both data and management services to ensure that the processor and memory resources are available to management services. Web server and NFS server can also be run on nodes running management services. Spread management services across racks and across nodes to avoid running an instance of each management service on a single node. For faster failover recovery times, avoid running ZooKeeper and CLDB services on the same servers. > 100 7 or more Dedicate nodes to management services. Do not run data or optional services on the same nodes running management services. CLDB on three or more nodes JobTracker or HBase Master on two or three nodes. ZooKeeper on five nodes (run ZooKeeper on an odd number of nodes). On very large clusters, dedicate nodes to running only CLDB. On very large clusters, dedicate nodes to running only ZooKeeper. Spread management services across racks and across nodes to avoid running an instance of each management service on a single node.

Table 4: Number of nodes running MapR management services

In clusters up to 100 nodes, management services typically reside on nodes that also provide data services. For a very small cluster that does not require failover or high availability, all the management services and data services can run on one node. However, HA is recommended and requires at least three nodes running management services. Even if high availability is not required, MapR M5 provides snapshots, mirrors, and multiple NFS servers. Also, the HA features of M5 provide a mechanism for administrators to perform rolling upgrades of management processes without any downtime on the cluster. A small cluster with less than 40 nodes should be set up to run the management services on three to five nodes for high availability. A medium-sized cluster should be set up to run the management services on at least five nodes with ZooKeeper, CLDB, and JobTracker nodes distributed across racks. This environment provides failover and high availability for all critical services. For clusters over 100 nodes, some nodes can be dedicated to management services. A large cluster can be set up to run the management services on a minimum of seven nodes with these nodes distributed across racks. In a large cluster, isolate the CLDB service from other services by placing them on dedicated nodes. In addition, in large clusters, ZooKeeper services should be isolated from other services on dedicated nodes. To reduce recovery time upon node failure, avoid running CLDB and ZooKeeper on the same node.

IBM System x reference architecture for Hadoop: MapR

18

Deployment diagram
The reference architecture for the MapR solution requires two separate networks; one for hardware management and one for data. Each network requires a rack switch. An additional data network switch can be configured for high availability and increased throughput. One rack switch occupies 1U of space in the rack. Figure 6 shows an overview of the architecture in three different one-rack sized clusters without network redundancy: a starter rack, a half rack and a full rack. Figure 7 shows a multirack-sized cluster without network redundancy. The intent of the four predefined configurations is to ease initial sizings for customers and to show example starting points for different sized workloads. The reference architecture is not limited to these four sized clusters. The starter rack configuration consists of three nodes and a pair of rack switches. The half rack configuration consists of 10 nodes and a pair of rack switches. The full rack configuration (a rack fully populated) consists of 20 nodes and a pair of rack switches. The multirack contains a total of 60 nodes; 20 nodes and a pair of switches in each rack. A MapR implementation can be deployed as a multirack solution to support larger workload and deployment requirements. In the Networking section of this document, you can see the networking configuration across multiple nodes and multiple racks.

Figure 6: Starter rack, half rack, and full rack MapR predefined configurations

IBM System x reference architecture for Hadoop: MapR

19

Figure 7: Multirack MapR predefined configuration

Deployment considerations
The predefined node configuration can be tailored to match customer requirements. Table 5 shows the common ways to adjust the predefined configuration. The Value options are available for customers who require a cost-effective solution. Performance options are for customers who want top performance. The Enterprise option offers redundancy. A combination of options, such as Value and Enterprise or Performance and Enterprise, can be considered as well. The Intel Xeon processor E5-2450 is recommended to provide sufficient performance, however smaller or larger processors may be used. A minimum of 48 GB of memory is recommended for most MapReduce workloads with 96 GB or more recommended for HBase and memory-intensive MapReduce workloads. A choice of 3 TB and 4 TB drives is suggested depending on the amount of data that needs to be stored. IBM also offers 2 TB drives. This size may meet the storage density requirements of some big data analytics workloads. SAS drives may be substituted in place of Serial Advanced Technology Attachment (SATA) drives but IBM has found that SATA drives offer the same performance for less cost.

IBM System x reference architecture for Hadoop: MapR

20

Description Node

Value options 2 x Intel Xeon processor E5-2420 1.9 GHz 6-core 1333MHz RDIMM

Enterprise options 2 x Intel Xeon processor E5-2420 1.9 GHz 6-core 1333MHz RDIMM

Performance options 2 x Intel Xeon processor E5-2430 2.2 GHz 6-core 1333MHz RDIMM or 2 x E5-2450 2.1 GHz 8-core 1600MHz RDIMM 72 GB (6 x 8 GB + 6 x 4 GB) or 96 GB (12 x 8GB) 1 x 3 TB 3.5 inch 36TB (12 x 3 TB NL SATA 3.5 inch) or 48TB (12 x 4 TB NL SATA 3.5 inch) 6Gb JBOD controller

Memory base

48 GB (6 x 8 GB)

48 GB (6 x 8 GB) or 72 GB (6 x 8 GB + 6 x 4 GB) 2 x 3 TB 3.5 inch (mirrored) 36 TB (12 x 3TB NL SATA 3.5 inch) or 48 TB (12 x 4TB NL SATA 3.5 inch) 6Gb JBOD controller

Disk (OS) Disk (data)

1 x 3 TB 3.5 inch 36 TB (12 x 3 TB NL SATA 3.5 inch) or 48 TB (12 x 4 TB NL SATA 3.5 inch) 6Gb JBOD controller

HDD controller

Available data 9-22 TB with 3 TB drives 9-22 TB with 3 TB drives 9-22 TB with 3 TB drives or or 12-30 TB with 4 TB drives space* (per node) or 12-30 TB with 4 TB drives 12-30 TB with 4 TB drives Data network switch Hardware management network switch 1GbE switch with 4 x 10GbE uplinks (IBM G8052) 1GbE switch with 4 x 10GbE uplinks (IBM G8052) Redundant switches 10GbE switch with 4 x 40GbE uplinks (IBM G8264) 1GbE switch with 4 x 10GbE uplinks (IBM G8052)

1GbE switch with 4 x 10GbE uplinks (IBM G8052)

Table 5: Additional hardware options

Systems management
The mechanism for systems management within the MapR solution is different from other Apache Hadoop distributions. The standard Hadoop distribution places the management services on separate servers than the data servers. In contrast, MapR management services are distributed across the same System x nodes that are used for data services.

Storage considerations
Each server node in the reference architecture has an internal directly attached storage. External storage is not used in this reference architecture.

Available data space assumes the use of Hadoop replication with three copies of the data, and 25% capacity reserved for efficient file system operation and to allow time to increase capacity if needed. Available data space may increase significantly with MapR automatic compression. Because compression can vary widely by file contents, this estimate provides a range from no compression up to 2.5 times compression. Some data may have even greater compression.

IBM System x reference architecture for Hadoop: MapR

21

In situations where higher storage capacity is required, the main design approach is to increase the amount of data disk space per node. Using 4 TB drives instead of 3 TB drives increases the total per node data disk capacity from 36 TB to 48 TB, a 33% increase. Consider using the same size drives for the OS to simplify maintenance to one type of disk drive. When increasing data disk capacity, you must be cognizant of the balance between performance and throughput. For some workloads, increasing the amount of user data stored per node can decrease disk parallelism and negatively impact performance. In this case, and when 4 TB drives provide insufficient capacity, higher capacity can be achieved by increasing the number of nodes in the cluster.

Performance considerations
There are a couple of approaches to increasing cluster performance: increasing node memory and using a high-performance job scheduler and MapReduce framework. Often, improving performance comes at increased cost and you need to consider the cost/benefit trade-offs of designing for higher performance. In the MapR predefined configuration, node memory can be increased to 96 GB by using twelve 8 GB RDIMMs. Even larger memory configuration might provide greater performance depending on the workload. Architecting for lower cost There are two key modifications that can be made to lower the cost of a MapR reference architecture solution. When considering lower-cost options, it is important to ensure that customers understand the potential lower performance implications of a lower cost design. A lower cost version of the MapR reference architecture can be achieved by using lower cost node processors and lower cost cluster data network infrastructure. The node processors can be substituted with the Intel Xeon E5-2430 2.2 GHz 6-core processor or the Intel Xeon E5-2420 1.9 GHz 6-core processor. These processors require 1333 MHz RDIMMs, which may also lower the per-node cost of the solution. Using a lower cost network infrastructure can significantly lower the cost of the solution, but can also have a substantial negative impact on intra-cluster data throughput and cluster ingest rates. To use a lower cost network infrastructure, use the following substitutions to the predefined configuration: Within each node, substitute the Mellenox 10GbE dual SFP+ network adapter with the additional ports on the integrated 1GBaseT adapters within the System x3630 M4 server. Within each rack, substitute the IBM RackSwitch G8264 top of rack switch with the IBM RackSwitch G8052. Within each cluster, substitute the IBM RackSwitch G8316 core switch with the IBM RackSwitch G8264. Though the network wiring schema is the same as that described in the networking section, the media types and link speeds within the data network have changed. The data network within a rack that connects the cluster nodes to the lower cost option, G8052 top of rack switch, is now based on two aggregated 1GBaseT links per node. The physical interconnect between the admin/management networks and the data networks within each rack is now based on two aggregated 1GBaseT links between the admin/management network G8052 switch and the lower cost data network G8052 switch. Within a cluster, the racks are interconnected through two aggregated 10GbE links

IBM System x reference architecture for Hadoop: MapR

22

between the substitute G8052 data network switch in each rack and a lower cost G8264 core switch.

Architecting for high ingest rates Architecting for high ingest rates is not a trivial matter. It is important to have a full characterization of the ingest patterns and volumes. The following questions provide guidance to key factors that affect the rates: On what days and at what times are the source systems available or not available for ingest? When a source system is available for ingest, what is the duration for which the system remains available? Do other factors impact the day, time, and duration ingest constraints? When ingests occur, what is the average and maximum size of ingest that must be completed? What factors impact ingest size? What is the format of the source data (structured, semi-structured, unstructured)? Are there any data transformation or cleansing requirements that must be achieved during ingest?

Scaling considerations
The Hadoop architecture is designed to be linearly scalable. When the capacity of the existing infrastructure is reached, the cluster can be scaled out by simply adding additional nodes. Typically, identically configured nodes are best to maintain the same ratio of storage and compute capabilities. A MapR cluster is scalable by simply adding additional System x3630 M4 nodes and network switches and optionally adding management services and optional services on those nodes. MapR No NameNode architecture allows linear scalability to trillions of files and thousands of petabytes. As the capacity of existing racks is reached, new racks can be added to the cluster. It is important to note that some workloads may not scale completely linear. When designing a new MapR reference architecture implementation, future scale out should be a key consideration in the initial design. There are two key aspects to consider: networking and management. Both of these aspects are critical to cluster operation and become more complex as the cluster infrastructure grows. The cross rack networking configuration described in Figure 7 is designed to provide robust network interconnection of racks within the cluster. As additional racks are added, the predefined networking topology remains balanced and symmetrical. If there are plans to scale the cluster beyond one rack, a best practice is to initially design the cluster with multiple racks even if the initial number of nodes would fit within one rack. Starting with multiple racks can enforce proper network topology and prevent future reconfiguration and hardware changes. As racks are added over time, multiple G8316 switches may be required for greater scalability and balanced performance. Also, as the number of nodes within the cluster increases, so do many of the tasks of managing the cluster, such as updating node firmware or operating systems. Building a cluster management framework

IBM System x reference architecture for Hadoop: MapR

23

as part of the initial design and proactively considering the challenges of managing a large cluster will pay off significantly in the long run. xCAT, an open source project that IBM supports, is a scalable distributed computing management and provisioning tool that provides a unified interface for hardware control, discovery, and operating system deployment. Within the MapR reference architecture, the System x server integrated management modules (IMM2) and the cluster management network provides an out-of-band management framework that management tools, such as xCAT, can use to facilitate or automate the management of cluster nodes. Training is required to fully use the capabilities in xCAT. See the Resources section for more information about xCAT. Proactive planning for future scale out and the development of cluster management framework as a part of initial cluster design provides a foundation for future growth that can minimize hardware reconfigurations and cluster management issues as the cluster grows.

High availability considerations


When implementing a MapR cluster on System x, consider availability requirements as part of the final hardware and software configuration. Typically, Hadoop is considered a highly reliable solution, but MapR enhancements make it highly available. Hadoop and MapR best practices provide significant protection against data loss. MapR ensures that failures are managed without causing an outage. There is redundancy that can be added to make a cluster even more reliable. Some consideration should be given to both hardware and software redundancy. Networking considerations If network redundancy is a requirement, use an additional switch in the data networks. Optionally, a second redundant switch can be added to ensure high availability of the hardware management network. The hardware management network will not affect the availability of the MapR-FS or Hadoop functionality, but may impact the management of the cluster, and so, availability requirements must be considered. MapR provides application-level Network Interface Card (NIC) bonding for higher throughput and high availability. Customers can either choose MapR application-level bonding or OS-level bonding and switchbased aggregation of some form matching the OS bonding configuration when using multiple NICs. Virtual Link Aggregation Groups (vLAG) can be used between redundant switches. vLAG is an IBM BNT switch feature. If 1Gbps data network links are used, it is recommended that more than one is used per node to increase throughput. Hardware availability considerations With no single point of failure, redundancy in server hardware components is not required for MapR. MapR automatically and transparently handles hardware failure resulting in the loss of any node in the cluster running any data or management service. MapRs default three-way replication of data ensures that no data is lost because two additional replicas of data are maintained on other nodes in the cluster. MapReduce tasks or HBase region servers from failed nodes are automatically started on other nodes in the cluster. Failure of a node running any management service is automatically and transparently recovered as described in the following services. All ZooKeeper services are available for read operations, with one acting as the leader for all writes. If the node running the leader fails, the remaining nodes will elect a new leader.

IBM System x reference architecture for Hadoop: MapR

24

Most commonly, three ZooKeeper instances are used to allow HA operations. In some large clusters, five ZooKeeper instances are used to allow fully HA operations even during maintenance windows that affect ZooKeeper instances. The number of instances of ZooKeeper services that must be run in a cluster depends on the clusters high availability requirement, but it should always be an odd number. ZooKeeper requires a quorum of (N/2)+1 to elect a leader where N is the total number of ZooKeeper nodes. Running more than five ZooKeeper instances is not necessary. All CLDB services are available for read operations, with one acting as the write master. If the node running the master CLDB service goes down, another running CLDB will automatically become the master. A minimum of two instances is needed for high availability. One JobTracker service is active. Other JobTracker instances are configured but not running. If the active JobTracker goes down, one of the configured instances automatically takes over without requiring any job to restart. A minimum of two instances is needed for high availability. If running HBase on the cluster, one HBase Master service is active. Other HBase Master instances are configured, but not running. If the active HBase Master goes down, one of the configured instances automatically takes over HBase management. A minimum of two instances is needed for high availability. (MapR also offers M7 that provides a native NoSQL database through HBase APIs without requiring any HBase Master or RegionServer processes.) All NFS servers are active simultaneously and can present an HA NFS server to nodes external to the cluster. To do this, specify the virtual IP addresses for two or more NFS servers for NFS high availability. Additionally, use round-robin Domain Name System (DNS) across multiple virtual IP addresses for load balancing in addition to high availability. For NFS access from within the cluster, NFS servers should be run on all nodes in the cluster and each node should mount its local NFS server. MapR WebServer can run on any node in the cluster to run the MapR Control System. The web server also provides a REST interface to all MapR management and monitoring functions. For HA, multiple active web servers can be run with users connecting to any web server for cluster management and monitoring. Note that even with no web server running, all monitoring and management capabilities are available using the MapR command line interface. Within racks, switches and nodes have redundant power feeds with each power feed connected from a separate PDU.

IBM System x reference architecture for Hadoop: MapR

25

Storage availability RAID disk configuration is not necessary and should be avoided in MapR clusters. The use of RAID causes a negative impact on performance. MapR provides automated setup and management of storage pools. The three-way replication provided by MapR-FS provides higher durability than RAID configurations because multiple node failures might not compromise data integrity. If the default 3x replication is not sufficient for availability requirements, the replication factor can be increased on a file, volume, or cluster basis. Replication levels higher than 5 are not normally used. Mirroring of MapR volumes within a single cluster can be used to achieve very high replication levels for higher durability or for higher read bandwidth. Mirrors can be used between clusters as well. MapR efficiently mirrors by only copying changes to the mirror. Mirrors are useful for load balancing or disaster recovery. MapR also provides manual or scheduled snapshots of volumes to protect against human error and programming defects. Snapshots are useful for rollback to a known data set.

Software availability considerations Operating system availability One of the hard disk drives can be used on each node to mirror the operating system. RAID1 should be used to mirror the two drives. If OS mirroring is not used, the disk that would have been used for OS mirroring is available for MapR data storage. Using the same disk for OS and data is not recommended because it can compromise performance. NameNode availability MapR Hadoop distribution is unique because it was designed with a No NameNode architecture for high availability. MapR is the only Hadoop distribution designed with no single point of failure. Other Hadoop distributions have a single primary NameNode and when the NameNode goes down, the entire cluster becomes unavailable until the NameNode is restarted. In those cases, where other distributions are configured with multiple NameNodes, the entire cluster becomes unavailable during the failover to a secondary NameNode. With MapR, the file metadata is replicated, distributed, and persistent, so that there is no data loss or downtime even in the face of multiple disk or node failures. JobTracker availability The MapR JobTracker HA improves recovery time objectives and provides for a selfhealing cluster. Upon failure, the MapR JobTracker automatically restarts on another node in the cluster. TaskTrackers can automatically pause and then reconnect to the new JobTracker. Any currently running jobs or tasks continue without losing any progress or failing. NFS availability

IBM System x reference architecture for Hadoop: MapR

26

You can easily set up a pool of NFS nodes with HA and failover using virtual IP addresses. If one node fails, the virtual IP addresses will be automatically reassigned to the next NFS node in the pool. It is also common to place an NFS server on every node where NFS access to the cluster is needed.

Migration considerations
If migration of data or applications to MapR is required, consideration needs to be given to the type and amount of data to be migrated and the source of the data being migrated. It is possible to migrate most data types into MapR-FS, but it is necessary to understand the migration requirements to verify viability. Standard Hadoop tools such as distcp can be used to migrate data from other Hadoop distributions. For data in a POSIX file system, you need to NFS mount the MapR cluster and use standard Linux commands to copy the files into the MapR Hadoop cluster. Either Sqoop or database import/export tools with MapR NFS can be used to move data between databases and MapR Hadoop. You also need to consider whether applications need to be modified to take advantage of the Hadoop functionality. With MapR read/write file system that can be mounted by a standard NFS client, significant effort required to migrate applications to other Hadoop distributions can often be avoided.

IBM System x reference architecture for Hadoop: MapR

27

Appendix 1: Bill of material


This section contains a bill of materials list for each of the core components of the reference architecture. The bill of materials includes the part numbers, component descriptions, and quantities. Table 6 shows how many core components are required for each of the predefined configuration sizes.

Size Small

Component Node Administration/Management network switch Data network switch Rack Cables Node Administration/Management network switch Data network switch Rack Cables Node Administration/Management network switch Data network switch Rack Cables Node Administration/Management network switch Data network switch Rack Cables

Quantity 3 1 1 1 3 10 1 1 1 10 20 1 1 1 20 100 5 5 5 100

Medium

Large

Very large

Table 6: Mapping between predefined configuration sizes and the bill of materials.

Node

Part number Description 7158AC1 A1ZD A1ZA A2ZQ 5977 A22S A1H5 A20R A1YJ IBM System x3630 M4 PCIe Riser Card 2 (1x8 LP for Slotless RAID) PCIe Riser Card for slot 1 (1x8 FH/HL + 1x8 LP Slots) Mellanox ConnectX-3 EN Dual-port SFP+ 10GbE Adapter Select Storage devices no IBM-configured RAID required IBM 3TB 7.2K 6Gbps NL SATA 3.5 G2HS HDD IBM System x 750W High Efficiency Platinum AC Power Supply System Documentation and Software-US English Intel Xeon Processor E5-2450 8C 2.1GHz 20MB Cache 1600MHz 95W

Quantity 1 1 1 1 1 14 1 1 1

IBM System x reference architecture for Hadoop: MapR

28

A1YW A292 3876 A1Z6 A20L 6311 A1Z8 A1Z9 2306 A20J A20B A20C A288 A1ZV A1ZX A1ZY A200 A207 A1ZJ A1ZK A2M3

Addl Intel Xeon Processor E5-2450 8C 2.1GHz 20MB Cache 1600MHz 95W 8GB (1x8GB, 2Rx4, 1.5V) PC3-12800 CL11 ECC DDR3 1600MHz LP RDIMM SB- IBM 6Gb Performance Optimized HBA X3630 M4 Planar X3630 M4 Chassis ASM w/o Planar 2.8m, 10A/100-250V, C13 to IEC 320-C14 Rack Power Cable 3.5" Hot Swap BP Bracket Assembly , 12x 3.5" 3.5" Hot Swap Cage Assembly, Rear, 2 x 3.5" Rack Installation >1U Component BIOS GBM L1 COPT, 1U RIASER CAGE - SLOT 2 L1 COPT, 1U BUTTERFLY RIASER CAGE - SLOT 1 X3630 M3 Agency Label Label GBM 2x2 HDD BRACKET X3630M4 Storage-Rich EIA LED cover kit X3630M4 Storage-Rich EIA USB cover kit Rail Kit for x3630 M4 and x3530 M4 EIA USB Board EIA OP Board X3630 M4 Shipping Bracket

1 6 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Table 7: Sample node bill of materials.

Administration / Management network switch

Part number Description 7309HC1 6311 A1DK 2305 IBM System Networking RackSwitch G8052 (Rear to Front) 2.8m, 10A/100-250V, C13 to IEC 320-C14 Rack Power Cable IBM 19" Flexible 4 Post Rail Kit Rack Installation of 1U Component

Quantity 1 2 1 1

Table 8: Sample administration / management network switch bill of materials.

IBM System x reference architecture for Hadoop: MapR

29

Data network switch

Part number Description 7309HC3 6311 A1DK 2305 IBM System Networking RackSwitch G8264 (Rear to Front) 2.8m, 10A/100-250V, C13 to IEC 320-C14 Rack Power Cable IBM 19" Flexible 4 Post Rail Kit Rack Installation of 1U Component

Quantity 1 2 1 1

Table 9: Sample data network switch bill of materials

Rack

Part number Description 1410RC4 6012 2202 2304 2310 E1350 42U rack cabinet DPI Single-phase 30A/208V C13 Enterprise PDU (US) Cluster 1350 Ship Group Rack Assembly - 42U Rack Cluster Hardware & Fabric Verification - 1st Rack

Quantity 1 4 1 1 1

Table 10: Sample rack bill of materials

Cables

Part Number Description 3737 2323 3m Molex Direct Attach Copper SFP+ Cable IntraRack CAT5E Cable Service

Quantity 2 2

Table 11: Sample cables bill of materials

This bill of materials information is for the United States; part numbers and descriptions may vary in other countries. Other sample configurations are available from your IBM sales team. Components are subject to change without notice.

IBM System x reference architecture for Hadoop: MapR

30

Resources
IBM System x3630 M4 (MapR Node) - On ibm.com: ibm.com/systems/x/hardware/rack/x3630m4 - On IBM Redbooks: ibm.com/redbooks/abstracts/tips0889.html IBM RackSwitch G8052 (1GbE Switch) - On ibm.com: ibm.com/systems/networking/switches/rack/g8052 - On IBM Redbooks: ibm.com/redbooks/abstracts/tips0813.html IBM RackSwitch G8264 (10GbE Switch) - On ibm.com: ibm.com/systems/networking/switches/rack/g8264 - On IBM Redbooks: ibm.com/redbooks/abstracts/tips0815.html MapR: -

MapR main website: http://www.mapr.com MapR products: http://www.mapr.com/products MapR M5 overview: http://www.mapr.com/products/mapr-editions/m5-edition MapR No NameNode Architecture: http://www.mapr.com/products/only-withmapr/namenode-ha MapR Resources: http://www.mapr.com/resources MapR products and differentiation: http://www.mapr.com/products/only-with-mapr MapR starting point: http://www.mapr.com/doc/display/MapR/Start+Here Planning your MapR Deployment: http://www.mapr.com/doc/display/MapR/Planning+the+Deployment MapR CLDB: http://mapr.com/doc/display/MapR/Isolating+CLDB+Nodes MapR ZooKeeper: http://mapr.com/doc/display/MapR/Isolating+ZooKeeper+Nodes

Open source software: - Hadoop: http://hadoop.apache.org - Flume: http://incubator.apache.org/flume - HBase: http://hbase.apache.org - Hive: http://hive.apache.org - Oozie: http://incubator.apache.org/oozie - Pig: http://pig.apache.org - ZooKeeper: http://zookeeper.apache.org Other resources - xCat: http://xcat.sourceforge.net/

IBM System x reference architecture for Hadoop: MapR

31

Trademarks and special notices


Copyright IBM Corporation 2013. References in this document to IBM products or services do not imply that IBM intends to make them available in every country. IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol ( or ), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at www.ibm.com/legal/copytrade.shtml. Java and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or its affiliates. Intel, Intel Inside (logos), MMX, and Pentium are trademarks of Intel Corporation in the United States, other countries, or both. UNIX is a registered trademark of The Open Group in the United States and other countries. Linux is a trademark of Linus Torvalds in the United States, other countries, or both. Other company, product, or service names may be trademarks or service marks of others. Information is provided "AS IS" without warranty of any kind. All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer. Information concerning non-IBM products was obtained from a supplier of these products, published announcement material, or other publicly available sources and does not constitute an endorsement of such products by IBM. Sources for non-IBM list prices and performance numbers are taken from publicly available information, including vendor announcements and vendor worldwide homepages. IBM has not tested these products and cannot confirm the accuracy of performance, capability, or any other claims related to non-IBM products. Questions on the capability of non-IBM products should be addressed to the supplier of those products. All statements regarding IBM future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only. Contact your local IBM office or IBM authorized reseller for the full text of the specific Statement of Direction. Some information addresses anticipated future capabilities. Such information is not intended as a definitive statement of a commitment to specific levels of performance, function or delivery schedules with respect to any future products. Such commitments are only made in IBM product announcements. The information is presented here to communicate IBM's current investment and development activities as a good faith effort to help with our customers' future planning.

IBM System x reference architecture for Hadoop: MapR

32

Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve throughput or performance improvements equivalent to the ratios stated here. Photographs shown are of engineering prototypes. Changes may be incorporated in production models. Any references in this information to non-IBM websites are provided for convenience only and do not in any manner serve as an endorsement of those websites. The materials at those websites are not part of the materials for this IBM product and use of those websites is at your own risk.

IBM System x reference architecture for Hadoop: MapR

33

You might also like