KEMBAR78
MongoDB Best Practices | PPTX
MongoDB Best Practices
Jay Runkel
Principal Solutions Architect
jay.runkel@mongodb.com
@jayrunkel
About Me
• Solution Architect
• Part of Sales Organization
• Work with many organizations new to MongoDB
Everyone Loves MongoDB’s Flexibility
• Document Model
• Dynamic Schema
• Powerful Query Language
• Secondary Indexes
Everyone Loves MongoDB’s Flexibility
• Document Model
• Dynamic Schema
• Powerful Query Language
• Secondary Indexes
Sometimes Organizations Struggle with
Performance
Good News!
• Poor Performance Usually Due to Common (and often simple) mistakes
Agenda
• Quick MongoDB Introduction
• Best Practices
1. Hardware/OS
2. Schema/Queries
3. Loading Data
MongoDB Introduction
Document Data Model
Relational MongoDB
{
first_name: ‘Paul’,
surname: ‘Miller’,
city: ‘London’,
location: [45.123,47.232],
cars: [
{ model: ‘Bentley’,
year: 1973,
value: 100000, … },
{ model: ‘Rolls Royce’,
year: 1965,
value: 330000, … }
]
}
Documents are Rich Data Structures
{
first_name: ‘Paul’,
surname: ‘Miller’,
cell: 447557505611,
city: ‘London’,
location: [45.123,47.232],
Profession: [‘banking’, ‘finance’, ‘trader’],
cars: [
{ model: ‘Bentley’,
year: 1973,
value: 100000, … },
{ model: ‘Rolls Royce’,
year: 1965,
value: 330000, … }
Fields can contain an array of
sub-documents
Fields
Typed fields
Fields can
contain arrays
Do More With Your Data
{
first_name: ‘Paul’,
surname: ‘Miller’,
city: ‘London’,
location: [45.123,47.232],
cars: [
{ model: ‘Bentley’,
year: 1973,
value: 100000, … },
{ model: ‘Rolls Royce’,
year: 1965,
value: 330000, … }
}
}
Rich Queries
Find everybody in London
with a car built between
1970 and 1980
Geospatial
Find all of the car owners
within 5km of Trafalgar Sq.
Text Search
Find all the cars described
as having leather seats
Aggregation
Calculate the average value
of Paul’s car collection
Map Reduce
What is the ownership
pattern of colors by
geography over time?
(is purple trending up in
China?)
Automatic Sharding
Three types: hash-based, range-based, location-
aware
Increase or decrease capacity as you go
Automatic balancing
Query Routing
Multiple query optimization
models
Each sharding option
appropriate for different
apps
mongos
Replica Sets
Replica Set – 2 to 50 copies
Self-healing shard
Data Center Aware
Addresses availability considerations:
High Availability
Disaster Recovery
Maintenance
Workload Isolation: operational & analytics
Assumptions
Assumptions
MongoDB 3.0 or 3.2
Storage Engine Architecture in 3.2
Content
Repo
IoT Sensor
Backend
Ad Service
Customer
Analytics
Archive
MongoDB Query Language (MQL) + Native Drivers
MongoDB Document Data Model
WT MMAP
Supported in MongoDB 3.2
Management
Security
In-memory
(beta)
Encrypted 3rd party
Best Practices
Hardware/Operating System
Servers
• Specifications Good Fit For MongoDB?
• Correct Number of Servers?
• Properly Configured?
What Type of Servers
• RAM
– 64  256 GB+
• Fast IO Systems
– RAID-10/SSDs
• Many cores
– Compress/Uncompress
– Encrypt/Decrypt
– Aggregation queries
What about a SAN?
• Mostly Random Disk Access
• IOPS
• Need dedicated IOPS or performance will vary
• Configure your SAN properly
• Suitability of any IO system will depend upon IOPS
How Many Servers Do I Need?
• How Many Shards Do I Need?
MongoDB cluster sizing at 30,000 ft
• Disk Space
• RAM
• Query Throughput
• Sum of disk space across shards > greater than required storage size
Disk Space: How Many Shards Do I Need?
• Sum of disk space across shards > greater than required storage size
Disk Space: How Many Shards Do I Need?
Example
Data Size = 9 TB
WiredTiger Compression Ratio: .33
Storage size = 3 TB
Server disk capacity = 2 TB
2 Shards Required
• Working set should fit in RAM
– Sum of RAM across shards > Working Set
• WorkSet = Indexes plus the set of documents accessed frequently
• WorkSet in RAM 
– Shorter latency
– Higher Throughput
RAM: How Many Shards Do I Need?
• Measuring Index Size
– db.coll.stats() – index size of collection
• Estimate frequently accessed documents
– Ex: total size of documents accessed
per day
RAM: How Many Shards Do I Need?
• Measuring Index Size
– db.coll.stats() – index size of collection
• Estimate frequently accessed documents
– Ex: total size of documents accessed
per day
RAM: How Many Shards Do I Need?
Example
Working Set = 428 GB
Server RAM = 128 GB
428/128 = 3.34
4 Shards Required
• Measure max sustained query rate of a single server (with replication)
– build a prototype and measure
• Assume sharding overhead of 20-30%
Query Rate: How Many Shards Do I Need?
• Measure max sustained query rate of a single server (with replication)
– build a prototype and measure
• Assume sharding overhead of 20-30%
Query Rate: How Many Shards Do I Need?
Example
Require: 50K ops/sec
Prototype performance: 20 ops/sec (1
replica set)
4 Shards Required: 80 ops/sec * .7 =
56K ops/sec
Configure Them Properly
• Default OS Settings Often Don’t Provide Optimal Performance
• See MongoDB Production Notes
– https://docs.mongodb.org/manual/administration/production-notes
• Also Review:
– Amazon EC2: https://docs.mongodb.org/ecosystem/platforms/amazon-ec2/
– Azure: https://docs.mongodb.org/ecosystem/platforms/windows-azure/
Server/OS Configuration
• Server configuration recommendations
– XFS
– Turn off atime and diratime
– NOOP scheduler
– File descriptor limits
– Disable transparent huge pages and NUMA
– Read ahead of 32
– Separate data volumes for data files, the journal, and the log.
– Change the default TCP keepalive time to 300 seconds.
These are important
• Ignore them and your performance may suffer
• The first 100 lines of the MongoDB logs identifies
suboptimal OS settings
Best Practices
Schema Design
Don’t Use a Relational Schema
Taylor MongoDB Schema toApplication
Workload
• Design schema to provide good query performance
• Schema design will impact required number of shards!
Application
Query Workload
{
Name: “john”
Height: 12
Address: {…}
}
db.cust.find({…})
db.cust.aggregate({…})
Compare Alternative Schemas
• Build a spreadsheet
• Calculate # of shards for each schema
• Estimate query performance
– # of documents
– # of inserts
– # of deletes
– Required indexes
– Number of documents inspected
– Number of documents sent across network
Modeling Decisions
• Referencing vs. Embedding
• Aggregating data by device, customer, product, etc.
Referencing
Procedure
{
"_id" : 333,
"date" : "2003-02-09T05:00:00"),
"hospital" : “County Hills”,
"patient" : “John Doe”,
"physician" : “Stephen Smith”,
"type" : ”Chest X-ray",
”result" : 134
}
Results
{
“_id” : 134
"type" : "txt",
"size" : NumberInt(12),
"content" : {
value1: 343,
value2: “abc”,
…
}
}
EmbeddingProcedure
{
"_id" : 333,
"date" : "2003-02-09T05:00:00"),
"hospital" : “County Hills”,
"patient" : “John Doe”,
"physician" : “Stephen Smith”,
"type" : ”Chest X-ray",
”result" : {
"type" : "txt",
"size" : NumberInt(12),
"content" : {
value1: 343,
value2: “abc”,
…
}
}
}
Embedding
• Advantages
– Retrieve all relevant information in a single query/document
– Avoid implementing joins in application code
– Update related information as a single atomic operation
• MongoDB doesn’t offer multi-document transactions
• Limitations
– Large documents mean more overhead if most fields are not relevant
– Might mean replicating data
– 16 MB document size limit
Referencing
• Advantages
– Smaller documents
– Less likely to reach 16 MB document limit
– Infrequently accessed information not accessed on every query
– No duplication of data
• Limitations
– Two queries required to retrieve information
– Cannot update related information atomically
{
_id: 2,
first: “Joe”,
last: “Patient”,
addr: { …},
procedures: [
{
id: 12345,
date: 2015-02-15,
type: “Cat scan”,
…},
{
id: 12346,
date: 2015-02-15,
type: “blood test”,
…}]
}
Patients
Embed
One-to-Many & Many-to-Many Relationships
{
_id: 2,
first: “Joe”,
last: “Patient”,
addr: { …},
procedures: [12345, 12346]}
{
_id: 12345,
date: 2015-02-15,
type: “Cat scan”,
…}
{
_id: 12346,
date: 2015-02-15,
type: “blood test”,
…}
Patients
Reference
Procedures
Schema Alternatives – Do the math?
• How complex queries?
• How much hardware/shards will I need?
Vital Sign Monitoring Device
Vital Signs Measured:
• Blood Pressure
• Pulse
• Blood Oxygen Levels
Produces data at regular intervals
• Once per minute
We have a hospital(s) of devices
Data From Vital Signs Monitoring Device
{
deviceId: 123456,
spO2: 88,
pulse: 74,
bp: [128, 80],
ts: ISODate("2013-10-16T22:07:00.000-0500")
}
• One document per minute per device
• Relational approach
Document Per Hour (By minute)
{
deviceId: 123456,
spO2: { 0: 88, 1: 90, …, 59: 92},
pulse: { 0: 74, 1: 76, …, 59: 72},
bp: { 0: [122, 80], 1: [126, 84], …, 59: [124, 78]},
ts: ISODate("2013-10-16T22:00:00.000-0500")
}
• Store per-minute data at the hourly level
• Update-driven workload
• 1 document per device per hour
Characterizing Write Differences
• Example: data generated every minute
• Recording the data for 1 patient for 1 hour:
Document Per Event
60 inserts
Document Per Hour
1 insert, 59 updates
Characterizing Read Differences
• Want to graph 24 hour of vital signs for a patient:
• Read performance is greatly improved
Document Per Event
1440 reads
Document Per Hour
24 reads
Characterizing Memory and Storage
Differences
Document Per Minute Document Per Hour
Number Documents 52.6 B 876 M
Total Index Size 6364 GB 106 GB
_id index 1468 GB 24.5 GB
{ts: 1, deviceId: 1} 4895 GB 81.6 GB
Document Size 92 Bytes 758 Bytes
Database Size 4503 GB 618 GB
• 100K Devices
• 1 years worth of data
Characterizing Memory and Storage
Differences
Document Per Minute Document Per Hour
Number Documents 52.6 B 876 M
Total Index Size 6364 GB 106 GB
_id index 1468 GB 24.5 GB
{ts: 1, deviceId: 1} 4895 GB 81.6 GB
Document Size 92 Bytes 758 Bytes
Database Size 4503 GB 618 GB
• 100K Devices
• 1 years worth of data
100000 *
365 * 24 *
60
100000 *
365 * 24
Characterizing Memory and Storage
Differences
Document Per Minute Document Per Hour
Number Documents 52.6 B 876 M
Total Index Size 6364 GB 106 GB
_id index 1468 GB 24.5 GB
{ts: 1, deviceId: 1} 4895 GB 81.6 GB
Document Size 92 Bytes 758 Bytes
Database Size 4503 GB 618 GB
• 100K Devices
• 1 years worth of data
100000 *
365 * 24 *
60 * 130
100000 *
365 * 24 *
130
Characterizing Memory and Storage
Differences
Document Per Minute Document Per Hour
Number Documents 52.6 B 876 M
Total Index Size 6364 GB 106 GB
_id index 1468 GB 24.5 GB
{ts: 1, deviceId: 1} 4895 GB 81.6 GB
Document Size 92 Bytes 758 Bytes
Database Size 4503 GB 618 GB
• 100K Devices
• 1 years worth of data
100000 *
365 * 24 *
60 * 92
100000 *
365 * 24 *
758
Best Practices
Loading Data
Rule of Thumb
• To saturate a MongoDB cluster 
– loader hardware ~= mongodb hardware
• Many threads
• Many mongos
Loader Architecture
loader
mongos
primary
primary
primary
secondary
secondary
secondary
secondary
secondary
secondary
Loader Architecture
loader
mongos
primary
primary
primary
secondary
secondary
secondary
secondary
secondary
secondary
Where are the bottlenecks?
Loader Architecture
loader
mongos
primary
primary
primary
secondary
secondary
secondary
secondary
secondary
secondary
Where are the bottlenecks?
Loader Architecture
loader (8)
mongos (4)
primary
primary
primary
secondary
secondary
secondary
secondary
secondary
secondary
loader (8)
mongos (4)
loader (8)
mongos (4)
Use many
threads
Use
multiple
loader
servers
When Sharding
• If you care about initial performance, you must pre-split
• Otherwise, initial performance will be slow
• (hash sharding automatically presplits collection)
Without presplitting
Shard 1 Shard 2 Shard 3 Shard 4
-∞ … ∞
• sh.shardCollection(“records.patients”, {zipcode : 1})
Without presplitting
Shard 1 Shard 2 Shard 3 Shard 4
-∞ … 11305
• 64K chunks
• Splitting will occur quickly
• Balancing occurs much more slowly
• The entire query workload  Shard 1
11306 … 44506
44507 … ∞
Without presplitting
Shard 1 Shard 2 Shard 3 Shard 4
-∞ … 11305
11306 … 44506
44507 … ∞
Loader
mongos
Split collection
Shard 1 Shard 2 Shard 3 Shard 4
• Split and distribute empty chunks before loading any data
• Evenly distribute query load across cluster
-∞ … 08333
08334 … 16667
16668 … 25000
25001… 33334
33335 … 41668
41669 … 50000
50001 … 58334
58335 … 66668
66669 … 75000
75001 … 83334
88335 … 96668
96669 … 99999
Split collection
Shard 1 Shard 2 Shard 3 Shard 4
-∞ … 08333
08334 … 16667
16668 … 25000
25001… 33334
33335 … 41668
41669 … 50000
50001 … 58334
58335 … 66668
66669 … 75000
75001 … 83334
88335 … 96668
96669 … 99999
Loader
mongos
Summary
Best Practices
1. Use servers with specifications that will provide good MongoDB performance
– 64+ GB RAM, many cores, many IOPS (RAID-10/SSDs)
2. Calculate How Many Shards?
1. Calculate required RAM and Disk Space
2. Build a prototype to determine the ops/sec capacity of a server
3. Do the math
3. Configure OS for Optimal MongoDB Performance
– See MongoDB Production Notes
– Review logs for warnings (Don’t ignore)
Best Practices (cont.)
4. Create a Document Schema
– Denormalized
5. Tailor schema to application workload
– Use application queries to guide schema design decisions
– Consider alternative schemas
– Compare cluster size (# of shards) and performance
– Build a spreadsheet
Best Practices
6. Loading Data
– Loader Hardware ~= MongoDB hardware
– Many threads
– Many mongos
7. Pre-split
– Ensure query workload is evenly distributed across the cluster from the start
Questions?
jay.runkel@mongodb.com
@jayrunkel

MongoDB Best Practices

  • 2.
    MongoDB Best Practices JayRunkel Principal Solutions Architect jay.runkel@mongodb.com @jayrunkel
  • 3.
    About Me • SolutionArchitect • Part of Sales Organization • Work with many organizations new to MongoDB
  • 4.
    Everyone Loves MongoDB’sFlexibility • Document Model • Dynamic Schema • Powerful Query Language • Secondary Indexes
  • 5.
    Everyone Loves MongoDB’sFlexibility • Document Model • Dynamic Schema • Powerful Query Language • Secondary Indexes
  • 6.
  • 7.
    Good News! • PoorPerformance Usually Due to Common (and often simple) mistakes
  • 8.
    Agenda • Quick MongoDBIntroduction • Best Practices 1. Hardware/OS 2. Schema/Queries 3. Loading Data
  • 9.
  • 10.
    Document Data Model RelationalMongoDB { first_name: ‘Paul’, surname: ‘Miller’, city: ‘London’, location: [45.123,47.232], cars: [ { model: ‘Bentley’, year: 1973, value: 100000, … }, { model: ‘Rolls Royce’, year: 1965, value: 330000, … } ] }
  • 11.
    Documents are RichData Structures { first_name: ‘Paul’, surname: ‘Miller’, cell: 447557505611, city: ‘London’, location: [45.123,47.232], Profession: [‘banking’, ‘finance’, ‘trader’], cars: [ { model: ‘Bentley’, year: 1973, value: 100000, … }, { model: ‘Rolls Royce’, year: 1965, value: 330000, … } Fields can contain an array of sub-documents Fields Typed fields Fields can contain arrays
  • 12.
    Do More WithYour Data { first_name: ‘Paul’, surname: ‘Miller’, city: ‘London’, location: [45.123,47.232], cars: [ { model: ‘Bentley’, year: 1973, value: 100000, … }, { model: ‘Rolls Royce’, year: 1965, value: 330000, … } } } Rich Queries Find everybody in London with a car built between 1970 and 1980 Geospatial Find all of the car owners within 5km of Trafalgar Sq. Text Search Find all the cars described as having leather seats Aggregation Calculate the average value of Paul’s car collection Map Reduce What is the ownership pattern of colors by geography over time? (is purple trending up in China?)
  • 13.
    Automatic Sharding Three types:hash-based, range-based, location- aware Increase or decrease capacity as you go Automatic balancing
  • 14.
    Query Routing Multiple queryoptimization models Each sharding option appropriate for different apps mongos
  • 15.
    Replica Sets Replica Set– 2 to 50 copies Self-healing shard Data Center Aware Addresses availability considerations: High Availability Disaster Recovery Maintenance Workload Isolation: operational & analytics
  • 16.
  • 17.
  • 18.
    Storage Engine Architecturein 3.2 Content Repo IoT Sensor Backend Ad Service Customer Analytics Archive MongoDB Query Language (MQL) + Native Drivers MongoDB Document Data Model WT MMAP Supported in MongoDB 3.2 Management Security In-memory (beta) Encrypted 3rd party
  • 19.
  • 20.
    Servers • Specifications GoodFit For MongoDB? • Correct Number of Servers? • Properly Configured?
  • 21.
    What Type ofServers • RAM – 64  256 GB+ • Fast IO Systems – RAID-10/SSDs • Many cores – Compress/Uncompress – Encrypt/Decrypt – Aggregation queries
  • 22.
    What about aSAN? • Mostly Random Disk Access • IOPS • Need dedicated IOPS or performance will vary • Configure your SAN properly • Suitability of any IO system will depend upon IOPS
  • 23.
    How Many ServersDo I Need? • How Many Shards Do I Need?
  • 24.
    MongoDB cluster sizingat 30,000 ft • Disk Space • RAM • Query Throughput
  • 25.
    • Sum ofdisk space across shards > greater than required storage size Disk Space: How Many Shards Do I Need?
  • 26.
    • Sum ofdisk space across shards > greater than required storage size Disk Space: How Many Shards Do I Need? Example Data Size = 9 TB WiredTiger Compression Ratio: .33 Storage size = 3 TB Server disk capacity = 2 TB 2 Shards Required
  • 27.
    • Working setshould fit in RAM – Sum of RAM across shards > Working Set • WorkSet = Indexes plus the set of documents accessed frequently • WorkSet in RAM  – Shorter latency – Higher Throughput RAM: How Many Shards Do I Need?
  • 28.
    • Measuring IndexSize – db.coll.stats() – index size of collection • Estimate frequently accessed documents – Ex: total size of documents accessed per day RAM: How Many Shards Do I Need?
  • 29.
    • Measuring IndexSize – db.coll.stats() – index size of collection • Estimate frequently accessed documents – Ex: total size of documents accessed per day RAM: How Many Shards Do I Need? Example Working Set = 428 GB Server RAM = 128 GB 428/128 = 3.34 4 Shards Required
  • 30.
    • Measure maxsustained query rate of a single server (with replication) – build a prototype and measure • Assume sharding overhead of 20-30% Query Rate: How Many Shards Do I Need?
  • 31.
    • Measure maxsustained query rate of a single server (with replication) – build a prototype and measure • Assume sharding overhead of 20-30% Query Rate: How Many Shards Do I Need? Example Require: 50K ops/sec Prototype performance: 20 ops/sec (1 replica set) 4 Shards Required: 80 ops/sec * .7 = 56K ops/sec
  • 33.
    Configure Them Properly •Default OS Settings Often Don’t Provide Optimal Performance • See MongoDB Production Notes – https://docs.mongodb.org/manual/administration/production-notes • Also Review: – Amazon EC2: https://docs.mongodb.org/ecosystem/platforms/amazon-ec2/ – Azure: https://docs.mongodb.org/ecosystem/platforms/windows-azure/
  • 34.
    Server/OS Configuration • Serverconfiguration recommendations – XFS – Turn off atime and diratime – NOOP scheduler – File descriptor limits – Disable transparent huge pages and NUMA – Read ahead of 32 – Separate data volumes for data files, the journal, and the log. – Change the default TCP keepalive time to 300 seconds.
  • 35.
    These are important •Ignore them and your performance may suffer • The first 100 lines of the MongoDB logs identifies suboptimal OS settings
  • 36.
  • 37.
    Don’t Use aRelational Schema
  • 38.
    Taylor MongoDB SchematoApplication Workload • Design schema to provide good query performance • Schema design will impact required number of shards! Application Query Workload { Name: “john” Height: 12 Address: {…} } db.cust.find({…}) db.cust.aggregate({…})
  • 39.
    Compare Alternative Schemas •Build a spreadsheet • Calculate # of shards for each schema • Estimate query performance – # of documents – # of inserts – # of deletes – Required indexes – Number of documents inspected – Number of documents sent across network
  • 40.
    Modeling Decisions • Referencingvs. Embedding • Aggregating data by device, customer, product, etc.
  • 41.
    Referencing Procedure { "_id" : 333, "date": "2003-02-09T05:00:00"), "hospital" : “County Hills”, "patient" : “John Doe”, "physician" : “Stephen Smith”, "type" : ”Chest X-ray", ”result" : 134 } Results { “_id” : 134 "type" : "txt", "size" : NumberInt(12), "content" : { value1: 343, value2: “abc”, … } }
  • 42.
    EmbeddingProcedure { "_id" : 333, "date": "2003-02-09T05:00:00"), "hospital" : “County Hills”, "patient" : “John Doe”, "physician" : “Stephen Smith”, "type" : ”Chest X-ray", ”result" : { "type" : "txt", "size" : NumberInt(12), "content" : { value1: 343, value2: “abc”, … } } }
  • 43.
    Embedding • Advantages – Retrieveall relevant information in a single query/document – Avoid implementing joins in application code – Update related information as a single atomic operation • MongoDB doesn’t offer multi-document transactions • Limitations – Large documents mean more overhead if most fields are not relevant – Might mean replicating data – 16 MB document size limit
  • 44.
    Referencing • Advantages – Smallerdocuments – Less likely to reach 16 MB document limit – Infrequently accessed information not accessed on every query – No duplication of data • Limitations – Two queries required to retrieve information – Cannot update related information atomically
  • 45.
    { _id: 2, first: “Joe”, last:“Patient”, addr: { …}, procedures: [ { id: 12345, date: 2015-02-15, type: “Cat scan”, …}, { id: 12346, date: 2015-02-15, type: “blood test”, …}] } Patients Embed One-to-Many & Many-to-Many Relationships { _id: 2, first: “Joe”, last: “Patient”, addr: { …}, procedures: [12345, 12346]} { _id: 12345, date: 2015-02-15, type: “Cat scan”, …} { _id: 12346, date: 2015-02-15, type: “blood test”, …} Patients Reference Procedures
  • 46.
    Schema Alternatives –Do the math? • How complex queries? • How much hardware/shards will I need?
  • 47.
    Vital Sign MonitoringDevice Vital Signs Measured: • Blood Pressure • Pulse • Blood Oxygen Levels Produces data at regular intervals • Once per minute
  • 48.
    We have ahospital(s) of devices
  • 49.
    Data From VitalSigns Monitoring Device { deviceId: 123456, spO2: 88, pulse: 74, bp: [128, 80], ts: ISODate("2013-10-16T22:07:00.000-0500") } • One document per minute per device • Relational approach
  • 50.
    Document Per Hour(By minute) { deviceId: 123456, spO2: { 0: 88, 1: 90, …, 59: 92}, pulse: { 0: 74, 1: 76, …, 59: 72}, bp: { 0: [122, 80], 1: [126, 84], …, 59: [124, 78]}, ts: ISODate("2013-10-16T22:00:00.000-0500") } • Store per-minute data at the hourly level • Update-driven workload • 1 document per device per hour
  • 51.
    Characterizing Write Differences •Example: data generated every minute • Recording the data for 1 patient for 1 hour: Document Per Event 60 inserts Document Per Hour 1 insert, 59 updates
  • 52.
    Characterizing Read Differences •Want to graph 24 hour of vital signs for a patient: • Read performance is greatly improved Document Per Event 1440 reads Document Per Hour 24 reads
  • 53.
    Characterizing Memory andStorage Differences Document Per Minute Document Per Hour Number Documents 52.6 B 876 M Total Index Size 6364 GB 106 GB _id index 1468 GB 24.5 GB {ts: 1, deviceId: 1} 4895 GB 81.6 GB Document Size 92 Bytes 758 Bytes Database Size 4503 GB 618 GB • 100K Devices • 1 years worth of data
  • 54.
    Characterizing Memory andStorage Differences Document Per Minute Document Per Hour Number Documents 52.6 B 876 M Total Index Size 6364 GB 106 GB _id index 1468 GB 24.5 GB {ts: 1, deviceId: 1} 4895 GB 81.6 GB Document Size 92 Bytes 758 Bytes Database Size 4503 GB 618 GB • 100K Devices • 1 years worth of data 100000 * 365 * 24 * 60 100000 * 365 * 24
  • 55.
    Characterizing Memory andStorage Differences Document Per Minute Document Per Hour Number Documents 52.6 B 876 M Total Index Size 6364 GB 106 GB _id index 1468 GB 24.5 GB {ts: 1, deviceId: 1} 4895 GB 81.6 GB Document Size 92 Bytes 758 Bytes Database Size 4503 GB 618 GB • 100K Devices • 1 years worth of data 100000 * 365 * 24 * 60 * 130 100000 * 365 * 24 * 130
  • 56.
    Characterizing Memory andStorage Differences Document Per Minute Document Per Hour Number Documents 52.6 B 876 M Total Index Size 6364 GB 106 GB _id index 1468 GB 24.5 GB {ts: 1, deviceId: 1} 4895 GB 81.6 GB Document Size 92 Bytes 758 Bytes Database Size 4503 GB 618 GB • 100K Devices • 1 years worth of data 100000 * 365 * 24 * 60 * 92 100000 * 365 * 24 * 758
  • 57.
  • 58.
    Rule of Thumb •To saturate a MongoDB cluster  – loader hardware ~= mongodb hardware • Many threads • Many mongos
  • 59.
  • 60.
  • 61.
  • 62.
    Loader Architecture loader (8) mongos(4) primary primary primary secondary secondary secondary secondary secondary secondary loader (8) mongos (4) loader (8) mongos (4) Use many threads Use multiple loader servers
  • 63.
    When Sharding • Ifyou care about initial performance, you must pre-split • Otherwise, initial performance will be slow • (hash sharding automatically presplits collection)
  • 64.
    Without presplitting Shard 1Shard 2 Shard 3 Shard 4 -∞ … ∞ • sh.shardCollection(“records.patients”, {zipcode : 1})
  • 65.
    Without presplitting Shard 1Shard 2 Shard 3 Shard 4 -∞ … 11305 • 64K chunks • Splitting will occur quickly • Balancing occurs much more slowly • The entire query workload  Shard 1 11306 … 44506 44507 … ∞
  • 66.
    Without presplitting Shard 1Shard 2 Shard 3 Shard 4 -∞ … 11305 11306 … 44506 44507 … ∞ Loader mongos
  • 67.
    Split collection Shard 1Shard 2 Shard 3 Shard 4 • Split and distribute empty chunks before loading any data • Evenly distribute query load across cluster -∞ … 08333 08334 … 16667 16668 … 25000 25001… 33334 33335 … 41668 41669 … 50000 50001 … 58334 58335 … 66668 66669 … 75000 75001 … 83334 88335 … 96668 96669 … 99999
  • 68.
    Split collection Shard 1Shard 2 Shard 3 Shard 4 -∞ … 08333 08334 … 16667 16668 … 25000 25001… 33334 33335 … 41668 41669 … 50000 50001 … 58334 58335 … 66668 66669 … 75000 75001 … 83334 88335 … 96668 96669 … 99999 Loader mongos
  • 69.
  • 70.
    Best Practices 1. Useservers with specifications that will provide good MongoDB performance – 64+ GB RAM, many cores, many IOPS (RAID-10/SSDs) 2. Calculate How Many Shards? 1. Calculate required RAM and Disk Space 2. Build a prototype to determine the ops/sec capacity of a server 3. Do the math 3. Configure OS for Optimal MongoDB Performance – See MongoDB Production Notes – Review logs for warnings (Don’t ignore)
  • 71.
    Best Practices (cont.) 4.Create a Document Schema – Denormalized 5. Tailor schema to application workload – Use application queries to guide schema design decisions – Consider alternative schemas – Compare cluster size (# of shards) and performance – Build a spreadsheet
  • 72.
    Best Practices 6. LoadingData – Loader Hardware ~= MongoDB hardware – Many threads – Many mongos 7. Pre-split – Ensure query workload is evenly distributed across the cluster from the start
  • 73.

Editor's Notes

  • #11 Here we have greatly reduced the relational data model for this application to two tables. In reality no database has two tables. It is much more common to have hundreds or thousands of tables. And as a developer where do you begin when you have a complex data model?? If you’re building an app you’re really thinking about just a hand full of common things, like products, and these can be represented in a document much more easily that a complex relational model where the data is broken up in a way that doesn’t really reflect the way you think about the data or write an application.
  • #14 MongoDB provides horizontal scale-out for databases using a technique called sharding, which is trans- parent to applications. Sharding distributes data across multiple physical partitions called shards. Sharding allows MongoDB deployments to address the hardware limitations of a single server, such as bottlenecks in RAM or disk I/O, without adding complexity to the application. MongoDB supports three types of sharding: • Range-based Sharding. Documents are partitioned across shards according to the shard key value. Documents with shard key values “close” to one another are likely to be co-located on the same shard. This approach is well suited for applications that need to optimize range- based queries. • Hash-based Sharding. Documents are uniformly distributed according to an MD5 hash of the shard key value. Documents with shard key values “close” to one another are unlikely to be co-located on the same shard. This approach guarantees a uniform distribution of writes across shards, but is less optimal for range-based queries. • Tag-aware Sharding. Documents are partitioned according to a user-specified configuration that associates shard key ranges with shards. Users can optimize the physical location of documents for application requirements such as locating data in specific data centers. MongoDB automatically balances the data in the cluster as the data grows or the size of the cluster increases or decreases.
  • #15 Sharding is transparent to applications; whether there is one or one hundred shards, the application code for querying MongoDB is the same. Applications issue queries to a query router that dispatches the query to the appropriate shards. For key-value queries that are based on the shard key, the query router will dispatch the query to the shard that manages the document with the requested key. When using range-based sharding, queries that specify ranges on the shard key are only dispatched to shards that contain documents with values within the range. For queries that don’t use the shard key, the query router will dispatch the query to all shards and aggregate and sort the results as appropriate. Multiple query routers can be used with a MongoDB system, and the appropriate number is determined based on performance and availability requirements of the application.
  • #16 High Availability – Ensure application availability during many types of failures Disaster Recovery – Address the RTO and RPO goals for business continuity Maintenance – Perform upgrades and other maintenance operations with no application downtime Secondaries can be used for a variety of applications – failover, hot backup, rolling upgrades, data locality and privacy and workload isolation