KEMBAR78
An Overview of Flash Storage for Databases | PDF
An Overview of Flash Storage
                          for Databases
                               Morgan Tocker
                           <morgan@percona.com>




     1
Wednesday, March 9, 2011
Introduction

                                [ Me]                       [Percona]

                   Director of Training. Previously    Consulting, Training,
                       worked at MySQL, Sun           Support & Development
                            Microsystems.                  for MySQL.




     ★   No invested interest in which hardware I recommend.
         ✦
             [Disclaimer] Some hardware vendors have engaged in our
             services to evaluate and improve performance of their
             products.

     2
Wednesday, March 9, 2011
What this talk is about
     ★   Flash technologies (NAND, NOR).
     ★   Server Usage.
         ✦
              Not USB thumb drives.
         ✦
              Not Consumer usage.
     ★   “For Database” == MySQL.
         ✦
              Should be more or less applicable for all databases.




     3
Wednesday, March 9, 2011
Agenda
     ★   Introduction.
     ★   A look at the current market.
     ★   Applications.




     4
Wednesday, March 9, 2011
Revolutionary
     ★   Change in technology -
         ✦
              From spinning disk to solid state.
     ★   No mechanical moving parts.
     ★   Jump in performance.
     ★   Requires changes in the Application.
     ★   Hard not to predict a quick replacement to all SSDs in
         the next 5-10 years*



             * However, at the moment hard disks are still
     5       becoming cheaper (size) quicker than SSDs!
Wednesday, March 9, 2011
“Numbers everyone should know”
      L1 cache reference                                                              0.5 ns
      Branch mispredict                                                               5 ns
      L2 cache reference                                                              7 ns
      Mutex lock/unlock                                                              25 ns
      Main memory reference                                                         100 ns
      Compress 1K bytes with Zippy                                                3,000 ns
      Send 2K bytes over 1 Gbps network                                          20,000 ns
      NAND Flash (my estimate)                                                   50,000 ns
      Read 1 MB sequentially from memory                                        250,000 ns
      Round trip within same datacenter                                         500,000 ns
      Disk seek                                                              10,000,000 ns
      Read 1 MB sequentially from disk                                       20,000,000 ns
      Send packet CA->Netherlands->CA                                       150,000,000 ns

              See: http://www.linux-mag.com/cache/7589/1.html and Google http://
     6        www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-ladis2009.pdf
Wednesday, March 9, 2011
Physics Behind
     ★   “Floating Gate Transistors”
         ✦
              Non volatile memory.
     ★   One State - Single State (SLC)
         ✦
              Faster, more reliable, expensive.
     ★   Many States - Multi Level Cell (MLC)
         ✦
              Usually 4 states.
         ✦
              Slower, less reliable, cheaper.




     7
Wednesday, March 9, 2011
Classification
     ★   NOR
         ✦
              Speeds like memory for reads.
         ✦
              Much, much slower for erase/writing data.
         ✦
              Practical use: storing firmware.
     ★   NAND
         ✦
              Faster writes.
         ✦
              Only block-level read access (4K).
         ✦
              Idea is to compact as many cells in limited space - to make it
              competitive with hard drives.



     8
Wednesday, March 9, 2011
Erasing (NAND)
     ★   Erase is to set all bits to “1111...”
         ✦
              Erasing process is similar to “flash” in photocameras - this is
              where the name FLASH comes from.
         ✦
              Erase is slow, done in batch operations (up to 1MB).
     ★   Change “1” -> “0” is fast.
     ★   Change “0” -> “1” is possible only be erase.
         ✦
              1st write: “1111” -> “1110”. Block marked as “written”
         ✦
              2nd write: even “1110” -> “1010” is not possible.




     9
Wednesday, March 9, 2011
Erase Challenges
     ★   Erase is slow
         ✦
              You want to erase many blocks in a single “flash”.
         ✦
              Block Management.
     ★   [via software] When you write, card never writes the
         same block.
     ★   Background process to run garbage collection.




    10
Wednesday, March 9, 2011
Erase Lifecycle
     ★   SLC ~100K times per cell (may vary).
     ★   MLC ~10K times per cell (may vary).
     ★   For many this is a major point of discussion.
         ✦
              How big of an issue depends a lot on firmware.
         ✦
              Many cells and even distribution (“wear levelling”) makes it a
              couple of years under heavy work load.




    11
Wednesday, March 9, 2011
Write degradation
     ★   Expected.
         ✦
              More full the device, harder it is to garbage collect.
     ★   Graph for Fusion-io 320G MLC card:




    12
Wednesday, March 9, 2011
Firmware Really Matters (1)
     ★   I would not expect even less flat performance on a
         cheaper, non-enterprise class of hardware.
         ✦
              Come to my talk on Friday.
         ✦
              I will tell you consistency of performance is more important
              than anything else.




    13
Wednesday, March 9, 2011
Firmware Really Matters (2)
     ★   Many revisions of firmware for each vendor.
         ✦
              Important to compare apples-to-apples in any comparisons.
         ✦
              I heard a rumour one large SSD vendor is on their 4th
              successful complete ground up implementation ;)




    14
Wednesday, March 9, 2011
Agenda
     ★   Introduction.
     ★   A look at the current market.
     ★   Applications.




    15
Wednesday, March 9, 2011
The current market (1)
     ★   Fusion-IO.
         ✦
              Established player with a large product line.
         ✦
              Enjoyed near-monopoly for a while being only PCI card
              vendor.
     ★   Virident.
         ✦
              Previously a MySQL Appliance vendor.
         ✦
              Switched business model in ~2010 to just ship PCI Flash
              cards.
         ✦
              Very good, consistent results.



    16
Wednesday, March 9, 2011
The current market (2)
     ★   Intel/OCZ/other.
         ✦
              Typically aims for pro-desktop market.
         ✦
              Does not necessarily offer the same features/promises as the
              “enterprise hardware”...




    17
Wednesday, March 9, 2011
You pay more for...
     ★   Greater amount of over provisioning (more consistent).
     ★   Internal redundancy (aka RAID).
     ★   More complex firmware (more consistent).
     ★   Guarantee of durability (such as a capacitor).
     ★   Greater life-span (more write cycles).
     ★   Better Performance (much more IOPS).




    18
Wednesday, March 9, 2011
Fusion-io




    19
Wednesday, March 9, 2011
Performance Specification
     ★   160G SLC
         ✦
              110K read IOPS (4K)
         ✦
              26us read latency.
     ★   320G MLC
         ✦
              71K read IOPS.
         ✦
              41us read latency.
     ★   “Duo” Range (not covered).
     ★   Lifetime:
         ✦
              SLC flash @ 40% write duty | 25 calendar years
         ✦
              MLC flash @ 20% write duty | 10 calendar years
         ✦
              MLC flash @ 40% write duty | 5 calendar years
    20
Wednesday, March 9, 2011
Fusion-io Overview
     ★   Fast. Very fast.
         ✦
              Cheaper than disks in terms of $-per IOPS.
     ★   PCI-E - closest to CPU.
     ★   Durability.
     ★   Shares host memory / CPU
     ★   Most complex part - firmware.
     ★   Large amount of space reservation for heavy writes.




    21
Wednesday, March 9, 2011
Fusion-io drawbacks
     ★   Expensive. Let’s say “$6000+” (retail; your price may be
         less).
         ✦
              For full performance, requires additional 25% space
              reservation.
         ✦
              DRAM is actually probably cheaper per GB.
     ★   PCI-E is not hot swap.
         ✦
              Also has potential for errors (when host fails, garbage keeps
              being sent. Fusion-io handles this well.)




    22
Wednesday, March 9, 2011
Fusion-io durability
     ★   Cache is located on host system.
     ★   “Transaction log” to prevent lost data.
         ✦
              Crash recovery.




    23
Wednesday, March 9, 2011
Fusion-io read performance
         160GB SLC card
         8 threads: 33K IOPS (525MB/sec), 0.28 ms 95% response time




                           RAID 10 is Dell Perc 6i
                           on 8 disks 2.5” 15 RPM SAS



    24
Wednesday, March 9, 2011
Fusion-io write performance
     ★   8 threads: 20K IOPS (314MB/sec), 0.26 ms 95%
         response time.




    25
Wednesday, March 9, 2011
Fusion-io databases
     ★   Many read / write threads to utilize throughput.
     ★   “MySQL” is not able to fully use it.
         ✦
              Better in 5.5, MySQL-5.1-plugin, XtraDB.
     ★   InnoDB IO path “needs work”.




    26
Wednesday, March 9, 2011
Virident TachIOn




    27
Wednesday, March 9, 2011
Virident
     ★   PCI interface.
     ★   Has NAND flash upgrade modules.
     ★   Good stable results.
     ★   Advertised 300,000 IOPS in 75:25 (read:write).




    28
Wednesday, March 9, 2011
Virident Options
     ★   300G, 400G, 600, 800G SLC cards.
         ✦
              400G is $13,600
     ★   (More or less the same price range as Fusion-io).




    29
Wednesday, March 9, 2011
2010 Benchmarks:




            http://www.mysqlperformanceblog.com/2010/06/15/virident-
    30      tachion-new-player-on-flash-pci-e-cards-market/
Wednesday, March 9, 2011
Intel SSDs




    31
Wednesday, March 9, 2011
Intel SSDs
     ★   Were awesome in 2008.
         ✦
              Many accolades, first SSDs that probably made sense for a
              lot of pro-desktop users.
     ★   A couple of iterations of firmware, but mostly intel
         treated customers like mushrooms for 2 years.
         ✦
              No clear advance warning of road map.
         ✦
              Finally a replacement 510 series announced last month.
                     • Slides don’t feature these. Have not used them.




    32
Wednesday, March 9, 2011
Intel Overview
     ★   SATA form factor.
     ★   Intel X25-M Gen 1 (50nm) & Gen 11 (35nm).
         ✦
              MLC
     ★   Intel X25-E (50nm)
         ✦
              SLC
         ✦
              “Enterprise”.
     ★   New 510 series - just released last month.




    33
Wednesday, March 9, 2011
X25-E
     ★   32G / 64G
     ★   Throughput: 35K IOPS reads, 3.5K IOPS writes.
     ★   Latency: 75us reads, 85us writes.
     ★   64G - $725
         ✦
              $11/GB
     ★   Write endurance:
         ✦
              1 petabyte of random writes (32G)
         ✦
              2 petabytes of random writes (64G)



    34
Wednesday, March 9, 2011
X25-M Gen II
     ★   80G / 160G
     ★   Throughput: 35K IOS reads, 6.5 / 8.5K IOPS writes.
     ★   Latency: 65us reads, 85us writes.
     ★   160GB - $415
         ✦
              ~$3 / GB
     ★   Write Endurance.
         ✦
              Not mentioned in official specification.




    35
Wednesday, March 9, 2011
X25-E and X25-M
     ★   Even if “E” is enterprise - power loss means data loss.
         ✦
              Loss of transactions.
     ★   You can disable write cache, but performance is woeful.




    36
Wednesday, March 9, 2011
X25 Deployments
     ★   RAID
         ✦
              Software / hardware?
         ✦
              Level 0? 1? 10? 5? 50?
     ★   Engineering process could be complicated and
         expensive.
         ✦
              There are/were ready solutions (Schooner[1], Gear6[2], Cisco
              servers).




             [1] Changed business model recently.
    37       [2] Went broke.
Wednesday, March 9, 2011
Agenda
     ★   Introduction.
     ★   A look at the current market.
     ★   Applications.




    38
Wednesday, March 9, 2011
MySQL Specific (1)
     ★   SSD is very good at Random reads.
         ✦
              Not so good at sequential writes!
     ★   Data files on SSD.
         ✦
              Table files (*.ibd).
         ✦
              Rollback segments (ibdata1).
     ★   Logs on RAID with BBU.
         ✦
              Binary logs.
         ✦
              Transaction logs.
         ✦
              Double write buffer.
         ✦
              Insert buffer.
         ✦
              Slow log, error log, general log.
    39       See: http://yoshinorimatsunobu.blogspot.com/2009/05/tables-on-ssd-redobinlogsystem.html

Wednesday, March 9, 2011
MySQL Specific (2)
     ★   Buy memory, or buy SSDs?
         ✦
              [Usually] Buy memory when it’s possible.




    40
Wednesday, March 9, 2011
Other Reasons to use Flash (1)
     ★   Server Consolidation.
         ✦
              Hard drives do ~100-200 IOPS*
         ✦
              Now one card can get 100K (theorhetical)!
         ✦
              ~x2 - x10 reduction in many cases (see craigslist).




    41       * Assuming no RAID controller performing additional merging.
Wednesday, March 9, 2011
Other Reasons to use Flash (2)
     ★   Power consumption reduction.
         ✦
              “Transactions per watt” incredibly lower.
                     • See: http://www.percona.com/files/percona-live/jeremy-
                       Craigslist.pptx.pdf
         ✦
              Important for a large number of people. Even if power is
              cheap, colo facilities often limit availability per-rack.




    42
Wednesday, March 9, 2011
Other Reasons to use Flash (3)
     ★   Limit variance / risk of operational issues from cold
         starts.
         ✦
              Easy to see something like an advertising network miss
              response time goals when aim is 50ms/page.
                     • Each IO is ~10ms.
                     • Following a few secondary keys to a primary key and you miss it.
     ★   Good for throughput too.




    43
Wednesday, March 9, 2011
Applications must change




Wednesday, March 9, 2011
Short Term (1)
     ★   Multi-threaded IO is required to exploit all throughput
         offered.
         ✦
              InnoDB Plugin, MySQL 5.5 ready.
         ✦
              Many other databases are not ready.




    45
Wednesday, March 9, 2011
Short Term (2)
     ★   Opportunities for Multi-level caches when data exceeds
         SSDs size.
         ✦
              See Flashcache (Facebook), ZFS L2 ARC, Veritas.




    46
Wednesday, March 9, 2011
Long Term
     ★   Decades of hard drive assumptions about random IO
         cost need to be unwound.
         ✦
              For example, InnoDB, Oracle, PostgreSQL work like this...




    47
Wednesday, March 9, 2011
Basic Operation (High Level)


                             Log Files


     SELECT * FROM City
   WHERE CountryCode=ʼAUSʼ




                                           Tablespace
                             Buffer Pool



    48
Wednesday, March 9, 2011
Basic Operation (High Level)


                             Log Files


     SELECT * FROM City
   WHERE CountryCode=ʼAUSʼ




                                           Tablespace
                             Buffer Pool



    48
Wednesday, March 9, 2011
Basic Operation (High Level)


                             Log Files


     SELECT * FROM City
   WHERE CountryCode=ʼAUSʼ




                                           Tablespace
                             Buffer Pool



    48
Wednesday, March 9, 2011
Basic Operation (High Level)


                             Log Files


     SELECT * FROM City
   WHERE CountryCode=ʼAUSʼ




                                           Tablespace
                             Buffer Pool



    48
Wednesday, March 9, 2011
Basic Operation (High Level)


                             Log Files


     SELECT * FROM City
   WHERE CountryCode=ʼAUSʼ




                                           Tablespace
                             Buffer Pool



    48
Wednesday, March 9, 2011
Basic Operation (High Level)


                             Log Files


     SELECT * FROM City
   WHERE CountryCode=ʼAUSʼ




                                           Tablespace
                             Buffer Pool



    48
Wednesday, March 9, 2011
Basic Operation (cont.)


                                  Log Files

      UPDATE City SET
     name = 'Morgansville'
    WHERE name = 'Brisbane'
    AND CountryCode='AUS'




                                                Tablespace
                                  Buffer Pool



    49
Wednesday, March 9, 2011
Basic Operation (cont.)


                                  Log Files

      UPDATE City SET
     name = 'Morgansville'
    WHERE name = 'Brisbane'
    AND CountryCode='AUS'




                                                Tablespace
                                  Buffer Pool



    49
Wednesday, March 9, 2011
Basic Operation (cont.)


                                  Log Files

      UPDATE City SET
     name = 'Morgansville'
    WHERE name = 'Brisbane'
    AND CountryCode='AUS'




                                                Tablespace
                                  Buffer Pool



    49
Wednesday, March 9, 2011
Basic Operation (cont.)


                                  Log Files

      UPDATE City SET
     name = 'Morgansville'
    WHERE name = 'Brisbane'
    AND CountryCode='AUS'




                                                Tablespace
                                  Buffer Pool



    49
Wednesday, March 9, 2011
Basic Operation (cont.)

                                 01010

                                  Log Files

      UPDATE City SET
     name = 'Morgansville'
    WHERE name = 'Brisbane'
    AND CountryCode='AUS'




                                                Tablespace
                                  Buffer Pool



    49
Wednesday, March 9, 2011
Basic Operation (cont.)

                                 01010

                                  Log Files

      UPDATE City SET
     name = 'Morgansville'
    WHERE name = 'Brisbane'
    AND CountryCode='AUS'




                                                Tablespace
                                  Buffer Pool



    49
Wednesday, March 9, 2011
Basic Operation (cont.)

                                 01010

                                  Log Files

      UPDATE City SET
     name = 'Morgansville'
    WHERE name = 'Brisbane'
    AND CountryCode='AUS'




                                                Tablespace
                                  Buffer Pool



    49
Wednesday, March 9, 2011
Basic Operation (cont.)

                                 01010

                                  Log Files

      UPDATE City SET
     name = 'Morgansville'
    WHERE name = 'Brisbane'
    AND CountryCode='AUS'




                                                Tablespace
                                  Buffer Pool



    49
Wednesday, March 9, 2011
Long Term (cont.)
     ★   Examples of “the database is the log” for MySQL are the
         PBXT and RethinkDB storage engines.




    50
Wednesday, March 9, 2011
Storage Hardware also changes
     ★   Most of us used to buying RAID controllers, placing
         disks below them.
         ✦
              Only a very limited number of RAID controllers understand
              SSDS.
         ✦
              RAID controllers are used to optimizing IO for devices
              capable of 100-200 IOPS.
         ✦
              If we look at Fusion-IO, the devices also internally RAID
              (~RAID4).




    51
Wednesday, March 9, 2011
Technologies to look at
     ★   More PCI express cards.
         ✦
              Potential to lower barrier to entry - only ~2-3 players,
              competition not as hot as it could be (yet).
     ★   More Enterprise focused MLC.
         ✦
              Better software (firmware) means more wear levelling,
              improved performance, etc.
         ✦
              More storage in fewer cells = lower cost.
     ★   Violin Memory
         ✦
              I am not hands-on familiar with their technology, but they
              have some very high end offerings.
         ✦
              Expect more awesome high end offerings (all vendors).
    52
Wednesday, March 9, 2011
Questions
     ★   Thank you for Confoo for letting me speak about such a
         niche topic!
     ★   If I’m out of time, please feel free to catch me around.




    53
Wednesday, March 9, 2011

An Overview of Flash Storage for Databases

  • 1.
    An Overview ofFlash Storage for Databases Morgan Tocker <morgan@percona.com> 1 Wednesday, March 9, 2011
  • 2.
    Introduction [ Me] [Percona] Director of Training. Previously Consulting, Training, worked at MySQL, Sun Support & Development Microsystems. for MySQL. ★ No invested interest in which hardware I recommend. ✦ [Disclaimer] Some hardware vendors have engaged in our services to evaluate and improve performance of their products. 2 Wednesday, March 9, 2011
  • 3.
    What this talkis about ★ Flash technologies (NAND, NOR). ★ Server Usage. ✦ Not USB thumb drives. ✦ Not Consumer usage. ★ “For Database” == MySQL. ✦ Should be more or less applicable for all databases. 3 Wednesday, March 9, 2011
  • 4.
    Agenda ★ Introduction. ★ A look at the current market. ★ Applications. 4 Wednesday, March 9, 2011
  • 5.
    Revolutionary ★ Change in technology - ✦ From spinning disk to solid state. ★ No mechanical moving parts. ★ Jump in performance. ★ Requires changes in the Application. ★ Hard not to predict a quick replacement to all SSDs in the next 5-10 years* * However, at the moment hard disks are still 5 becoming cheaper (size) quicker than SSDs! Wednesday, March 9, 2011
  • 6.
    “Numbers everyone shouldknow” L1 cache reference 0.5 ns Branch mispredict 5 ns L2 cache reference 7 ns Mutex lock/unlock 25 ns Main memory reference 100 ns Compress 1K bytes with Zippy 3,000 ns Send 2K bytes over 1 Gbps network 20,000 ns NAND Flash (my estimate) 50,000 ns Read 1 MB sequentially from memory 250,000 ns Round trip within same datacenter 500,000 ns Disk seek 10,000,000 ns Read 1 MB sequentially from disk 20,000,000 ns Send packet CA->Netherlands->CA 150,000,000 ns See: http://www.linux-mag.com/cache/7589/1.html and Google http:// 6 www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-ladis2009.pdf Wednesday, March 9, 2011
  • 7.
    Physics Behind ★ “Floating Gate Transistors” ✦ Non volatile memory. ★ One State - Single State (SLC) ✦ Faster, more reliable, expensive. ★ Many States - Multi Level Cell (MLC) ✦ Usually 4 states. ✦ Slower, less reliable, cheaper. 7 Wednesday, March 9, 2011
  • 8.
    Classification ★ NOR ✦ Speeds like memory for reads. ✦ Much, much slower for erase/writing data. ✦ Practical use: storing firmware. ★ NAND ✦ Faster writes. ✦ Only block-level read access (4K). ✦ Idea is to compact as many cells in limited space - to make it competitive with hard drives. 8 Wednesday, March 9, 2011
  • 9.
    Erasing (NAND) ★ Erase is to set all bits to “1111...” ✦ Erasing process is similar to “flash” in photocameras - this is where the name FLASH comes from. ✦ Erase is slow, done in batch operations (up to 1MB). ★ Change “1” -> “0” is fast. ★ Change “0” -> “1” is possible only be erase. ✦ 1st write: “1111” -> “1110”. Block marked as “written” ✦ 2nd write: even “1110” -> “1010” is not possible. 9 Wednesday, March 9, 2011
  • 10.
    Erase Challenges ★ Erase is slow ✦ You want to erase many blocks in a single “flash”. ✦ Block Management. ★ [via software] When you write, card never writes the same block. ★ Background process to run garbage collection. 10 Wednesday, March 9, 2011
  • 11.
    Erase Lifecycle ★ SLC ~100K times per cell (may vary). ★ MLC ~10K times per cell (may vary). ★ For many this is a major point of discussion. ✦ How big of an issue depends a lot on firmware. ✦ Many cells and even distribution (“wear levelling”) makes it a couple of years under heavy work load. 11 Wednesday, March 9, 2011
  • 12.
    Write degradation ★ Expected. ✦ More full the device, harder it is to garbage collect. ★ Graph for Fusion-io 320G MLC card: 12 Wednesday, March 9, 2011
  • 13.
    Firmware Really Matters(1) ★ I would not expect even less flat performance on a cheaper, non-enterprise class of hardware. ✦ Come to my talk on Friday. ✦ I will tell you consistency of performance is more important than anything else. 13 Wednesday, March 9, 2011
  • 14.
    Firmware Really Matters(2) ★ Many revisions of firmware for each vendor. ✦ Important to compare apples-to-apples in any comparisons. ✦ I heard a rumour one large SSD vendor is on their 4th successful complete ground up implementation ;) 14 Wednesday, March 9, 2011
  • 15.
    Agenda ★ Introduction. ★ A look at the current market. ★ Applications. 15 Wednesday, March 9, 2011
  • 16.
    The current market(1) ★ Fusion-IO. ✦ Established player with a large product line. ✦ Enjoyed near-monopoly for a while being only PCI card vendor. ★ Virident. ✦ Previously a MySQL Appliance vendor. ✦ Switched business model in ~2010 to just ship PCI Flash cards. ✦ Very good, consistent results. 16 Wednesday, March 9, 2011
  • 17.
    The current market(2) ★ Intel/OCZ/other. ✦ Typically aims for pro-desktop market. ✦ Does not necessarily offer the same features/promises as the “enterprise hardware”... 17 Wednesday, March 9, 2011
  • 18.
    You pay morefor... ★ Greater amount of over provisioning (more consistent). ★ Internal redundancy (aka RAID). ★ More complex firmware (more consistent). ★ Guarantee of durability (such as a capacitor). ★ Greater life-span (more write cycles). ★ Better Performance (much more IOPS). 18 Wednesday, March 9, 2011
  • 19.
    Fusion-io 19 Wednesday, March 9, 2011
  • 20.
    Performance Specification ★ 160G SLC ✦ 110K read IOPS (4K) ✦ 26us read latency. ★ 320G MLC ✦ 71K read IOPS. ✦ 41us read latency. ★ “Duo” Range (not covered). ★ Lifetime: ✦ SLC flash @ 40% write duty | 25 calendar years ✦ MLC flash @ 20% write duty | 10 calendar years ✦ MLC flash @ 40% write duty | 5 calendar years 20 Wednesday, March 9, 2011
  • 21.
    Fusion-io Overview ★ Fast. Very fast. ✦ Cheaper than disks in terms of $-per IOPS. ★ PCI-E - closest to CPU. ★ Durability. ★ Shares host memory / CPU ★ Most complex part - firmware. ★ Large amount of space reservation for heavy writes. 21 Wednesday, March 9, 2011
  • 22.
    Fusion-io drawbacks ★ Expensive. Let’s say “$6000+” (retail; your price may be less). ✦ For full performance, requires additional 25% space reservation. ✦ DRAM is actually probably cheaper per GB. ★ PCI-E is not hot swap. ✦ Also has potential for errors (when host fails, garbage keeps being sent. Fusion-io handles this well.) 22 Wednesday, March 9, 2011
  • 23.
    Fusion-io durability ★ Cache is located on host system. ★ “Transaction log” to prevent lost data. ✦ Crash recovery. 23 Wednesday, March 9, 2011
  • 24.
    Fusion-io read performance 160GB SLC card 8 threads: 33K IOPS (525MB/sec), 0.28 ms 95% response time RAID 10 is Dell Perc 6i on 8 disks 2.5” 15 RPM SAS 24 Wednesday, March 9, 2011
  • 25.
    Fusion-io write performance ★ 8 threads: 20K IOPS (314MB/sec), 0.26 ms 95% response time. 25 Wednesday, March 9, 2011
  • 26.
    Fusion-io databases ★ Many read / write threads to utilize throughput. ★ “MySQL” is not able to fully use it. ✦ Better in 5.5, MySQL-5.1-plugin, XtraDB. ★ InnoDB IO path “needs work”. 26 Wednesday, March 9, 2011
  • 27.
    Virident TachIOn 27 Wednesday, March 9, 2011
  • 28.
    Virident ★ PCI interface. ★ Has NAND flash upgrade modules. ★ Good stable results. ★ Advertised 300,000 IOPS in 75:25 (read:write). 28 Wednesday, March 9, 2011
  • 29.
    Virident Options ★ 300G, 400G, 600, 800G SLC cards. ✦ 400G is $13,600 ★ (More or less the same price range as Fusion-io). 29 Wednesday, March 9, 2011
  • 30.
    2010 Benchmarks: http://www.mysqlperformanceblog.com/2010/06/15/virident- 30 tachion-new-player-on-flash-pci-e-cards-market/ Wednesday, March 9, 2011
  • 31.
    Intel SSDs 31 Wednesday, March 9, 2011
  • 32.
    Intel SSDs ★ Were awesome in 2008. ✦ Many accolades, first SSDs that probably made sense for a lot of pro-desktop users. ★ A couple of iterations of firmware, but mostly intel treated customers like mushrooms for 2 years. ✦ No clear advance warning of road map. ✦ Finally a replacement 510 series announced last month. • Slides don’t feature these. Have not used them. 32 Wednesday, March 9, 2011
  • 33.
    Intel Overview ★ SATA form factor. ★ Intel X25-M Gen 1 (50nm) & Gen 11 (35nm). ✦ MLC ★ Intel X25-E (50nm) ✦ SLC ✦ “Enterprise”. ★ New 510 series - just released last month. 33 Wednesday, March 9, 2011
  • 34.
    X25-E ★ 32G / 64G ★ Throughput: 35K IOPS reads, 3.5K IOPS writes. ★ Latency: 75us reads, 85us writes. ★ 64G - $725 ✦ $11/GB ★ Write endurance: ✦ 1 petabyte of random writes (32G) ✦ 2 petabytes of random writes (64G) 34 Wednesday, March 9, 2011
  • 35.
    X25-M Gen II ★ 80G / 160G ★ Throughput: 35K IOS reads, 6.5 / 8.5K IOPS writes. ★ Latency: 65us reads, 85us writes. ★ 160GB - $415 ✦ ~$3 / GB ★ Write Endurance. ✦ Not mentioned in official specification. 35 Wednesday, March 9, 2011
  • 36.
    X25-E and X25-M ★ Even if “E” is enterprise - power loss means data loss. ✦ Loss of transactions. ★ You can disable write cache, but performance is woeful. 36 Wednesday, March 9, 2011
  • 37.
    X25 Deployments ★ RAID ✦ Software / hardware? ✦ Level 0? 1? 10? 5? 50? ★ Engineering process could be complicated and expensive. ✦ There are/were ready solutions (Schooner[1], Gear6[2], Cisco servers). [1] Changed business model recently. 37 [2] Went broke. Wednesday, March 9, 2011
  • 38.
    Agenda ★ Introduction. ★ A look at the current market. ★ Applications. 38 Wednesday, March 9, 2011
  • 39.
    MySQL Specific (1) ★ SSD is very good at Random reads. ✦ Not so good at sequential writes! ★ Data files on SSD. ✦ Table files (*.ibd). ✦ Rollback segments (ibdata1). ★ Logs on RAID with BBU. ✦ Binary logs. ✦ Transaction logs. ✦ Double write buffer. ✦ Insert buffer. ✦ Slow log, error log, general log. 39 See: http://yoshinorimatsunobu.blogspot.com/2009/05/tables-on-ssd-redobinlogsystem.html Wednesday, March 9, 2011
  • 40.
    MySQL Specific (2) ★ Buy memory, or buy SSDs? ✦ [Usually] Buy memory when it’s possible. 40 Wednesday, March 9, 2011
  • 41.
    Other Reasons touse Flash (1) ★ Server Consolidation. ✦ Hard drives do ~100-200 IOPS* ✦ Now one card can get 100K (theorhetical)! ✦ ~x2 - x10 reduction in many cases (see craigslist). 41 * Assuming no RAID controller performing additional merging. Wednesday, March 9, 2011
  • 42.
    Other Reasons touse Flash (2) ★ Power consumption reduction. ✦ “Transactions per watt” incredibly lower. • See: http://www.percona.com/files/percona-live/jeremy- Craigslist.pptx.pdf ✦ Important for a large number of people. Even if power is cheap, colo facilities often limit availability per-rack. 42 Wednesday, March 9, 2011
  • 43.
    Other Reasons touse Flash (3) ★ Limit variance / risk of operational issues from cold starts. ✦ Easy to see something like an advertising network miss response time goals when aim is 50ms/page. • Each IO is ~10ms. • Following a few secondary keys to a primary key and you miss it. ★ Good for throughput too. 43 Wednesday, March 9, 2011
  • 44.
  • 45.
    Short Term (1) ★ Multi-threaded IO is required to exploit all throughput offered. ✦ InnoDB Plugin, MySQL 5.5 ready. ✦ Many other databases are not ready. 45 Wednesday, March 9, 2011
  • 46.
    Short Term (2) ★ Opportunities for Multi-level caches when data exceeds SSDs size. ✦ See Flashcache (Facebook), ZFS L2 ARC, Veritas. 46 Wednesday, March 9, 2011
  • 47.
    Long Term ★ Decades of hard drive assumptions about random IO cost need to be unwound. ✦ For example, InnoDB, Oracle, PostgreSQL work like this... 47 Wednesday, March 9, 2011
  • 48.
    Basic Operation (HighLevel) Log Files SELECT * FROM City WHERE CountryCode=ʼAUSʼ Tablespace Buffer Pool 48 Wednesday, March 9, 2011
  • 49.
    Basic Operation (HighLevel) Log Files SELECT * FROM City WHERE CountryCode=ʼAUSʼ Tablespace Buffer Pool 48 Wednesday, March 9, 2011
  • 50.
    Basic Operation (HighLevel) Log Files SELECT * FROM City WHERE CountryCode=ʼAUSʼ Tablespace Buffer Pool 48 Wednesday, March 9, 2011
  • 51.
    Basic Operation (HighLevel) Log Files SELECT * FROM City WHERE CountryCode=ʼAUSʼ Tablespace Buffer Pool 48 Wednesday, March 9, 2011
  • 52.
    Basic Operation (HighLevel) Log Files SELECT * FROM City WHERE CountryCode=ʼAUSʼ Tablespace Buffer Pool 48 Wednesday, March 9, 2011
  • 53.
    Basic Operation (HighLevel) Log Files SELECT * FROM City WHERE CountryCode=ʼAUSʼ Tablespace Buffer Pool 48 Wednesday, March 9, 2011
  • 54.
    Basic Operation (cont.) Log Files UPDATE City SET name = 'Morgansville' WHERE name = 'Brisbane' AND CountryCode='AUS' Tablespace Buffer Pool 49 Wednesday, March 9, 2011
  • 55.
    Basic Operation (cont.) Log Files UPDATE City SET name = 'Morgansville' WHERE name = 'Brisbane' AND CountryCode='AUS' Tablespace Buffer Pool 49 Wednesday, March 9, 2011
  • 56.
    Basic Operation (cont.) Log Files UPDATE City SET name = 'Morgansville' WHERE name = 'Brisbane' AND CountryCode='AUS' Tablespace Buffer Pool 49 Wednesday, March 9, 2011
  • 57.
    Basic Operation (cont.) Log Files UPDATE City SET name = 'Morgansville' WHERE name = 'Brisbane' AND CountryCode='AUS' Tablespace Buffer Pool 49 Wednesday, March 9, 2011
  • 58.
    Basic Operation (cont.) 01010 Log Files UPDATE City SET name = 'Morgansville' WHERE name = 'Brisbane' AND CountryCode='AUS' Tablespace Buffer Pool 49 Wednesday, March 9, 2011
  • 59.
    Basic Operation (cont.) 01010 Log Files UPDATE City SET name = 'Morgansville' WHERE name = 'Brisbane' AND CountryCode='AUS' Tablespace Buffer Pool 49 Wednesday, March 9, 2011
  • 60.
    Basic Operation (cont.) 01010 Log Files UPDATE City SET name = 'Morgansville' WHERE name = 'Brisbane' AND CountryCode='AUS' Tablespace Buffer Pool 49 Wednesday, March 9, 2011
  • 61.
    Basic Operation (cont.) 01010 Log Files UPDATE City SET name = 'Morgansville' WHERE name = 'Brisbane' AND CountryCode='AUS' Tablespace Buffer Pool 49 Wednesday, March 9, 2011
  • 62.
    Long Term (cont.) ★ Examples of “the database is the log” for MySQL are the PBXT and RethinkDB storage engines. 50 Wednesday, March 9, 2011
  • 63.
    Storage Hardware alsochanges ★ Most of us used to buying RAID controllers, placing disks below them. ✦ Only a very limited number of RAID controllers understand SSDS. ✦ RAID controllers are used to optimizing IO for devices capable of 100-200 IOPS. ✦ If we look at Fusion-IO, the devices also internally RAID (~RAID4). 51 Wednesday, March 9, 2011
  • 64.
    Technologies to lookat ★ More PCI express cards. ✦ Potential to lower barrier to entry - only ~2-3 players, competition not as hot as it could be (yet). ★ More Enterprise focused MLC. ✦ Better software (firmware) means more wear levelling, improved performance, etc. ✦ More storage in fewer cells = lower cost. ★ Violin Memory ✦ I am not hands-on familiar with their technology, but they have some very high end offerings. ✦ Expect more awesome high end offerings (all vendors). 52 Wednesday, March 9, 2011
  • 65.
    Questions ★ Thank you for Confoo for letting me speak about such a niche topic! ★ If I’m out of time, please feel free to catch me around. 53 Wednesday, March 9, 2011