KEMBAR78
MySQL Infrastructure Testing Automation at GitHub | PDF
How people build software!
MySQL Infrastructure
Testing Automation 

@ GitHub
IkeWalker
GitHub
Boston MySQL Meetup
December 11, 2017
1
!
How people build software!
Agenda
• About
• MySQL @ GitHub
• Automation
• Backup/restores
• Failovers
• Schema migrations
2
!
How people build software!
About me
• Database Architect
• Working with MySQL since 2006
• Organizer of Boston MySQL Meetup
github.com/ikewalker
@iowalker
3
!
How people build software! 4
• The world’s largest Octocat T-shirt and stickers store
• And water bottles
• And hoodies
• We also do stuff related to things
• Word is new swag is coming up
GitHub
How people build software!
GitHub
• 66M repositories
• 24M developers
• 117K businesses
• More than a million teams
• World’s largest open source hosting
• Alexa top 100
• Critical path in build flows
5
!
How people build software!
MySQL at GitHub
• GitHub stores repositories in git, and uses MySQL
as the backend database for all related metadata:
• Repository metadata, users, issues, pull
requests, comments etc.
• Website/API/Auth/more all use MySQL.
• We run a few (growing number of) clusters, totaling
over 100 MySQL servers.
• The setup isn’t very large but very busy.
6
!
How people build software!
MySQL at GitHub
• Our MySQL servers must be available, responsive
and in good state
• GitHub has 99.95% SLA
• Availability issues must be handled quickly, as
automatically as possible.
7
!
How people build software!
github/database-infrastructure
• @ggunson, @jessbreckenridge, @jonahberquist,
@shlomi-noach, @tomkrouper, @gtowey
• Concerned with:
• Data availability
• Data integrity
8
!
How people build software!
Testing
9
!
How people build software!
Backups/restores
that ^
10
How people build software!
Your data
It’s important
11
!
How people build software!
Restores
• Dedicated restore servers.
• One per cluster.
• Continuously restores, catches up with replication,
restores, catches up with replication, restores, …
• Sending a “success” event at the end of each cycle.
• We monitor for number of “success” events in past
24-ish hours, per cluster.
12
!
How people build software! 13
!
!
!
!
!
production replicas
auto-restore replica
master
!
auto-restore replicas
""""""
backup replica
How people build software!
Restores
• New host provisioning uses same flow as restore.
• A human may kick a restore/reclone manually.
• Chatops: 

.mysql backup-restore -H restore.this.host -r
14
!
How people build software!
Restore failure
• A specific backup/restore may fail because
computers.
• No reason for panic.
• Previous backup/restores proven to be working
• At most we lose time
• Two sequential failures, or failures across clusters
are incidents to be investigated
15
!
How people build software!
Restore: delayed replica
• One delayed replica per cluster
• Lagging at 4 hours
• Chatops: .mysql panic
16
!
How people build software!
Failovers
^ that, too
17
How people build software!
MySQL setup @ GitHub
• Plain-old single writer master-replicas
asynchronous replication.
• Not yet semi-sync
• Cross DC, multiple data centers
• 5.7, RBR
• Servers with special roles: production replica,
backup, auto-restore, migration-test, analytics, …
• 2-3 tiers of replication
• Occasional cluster split (functional sharding)
• Very dynamic, always changing
18
!
How people build software!
Points of failure
• Master failure, sev1
• Intermediate masters failure
19
!
! !
!
!
!
! !
!
!
How people build software!
orchestrator
• Topology discovery
• Refactoring
• Failovers for masters and intermediate masters
• Open source, Apache 2 license
• github.com/github/orchestrator
20
!
How people build software!
orchestrator failovers @ GitHub
• Automated master & intermediate master failovers
for all clusters.
• On failover, runs GitHub-specific hooks
• Grabbing VIP/DNS
• Updating server role
• Kicking services (e.g. pt-heartbeat)
• Notifying chat
• Running puppet
21
!
How people build software!
Testing cluster
• Dedicated testing cluster in production
• Does not take production traffic
• “load-test” traffic
• Resembles a production topology:
• OS, MySQL Versions
• Data centers
• Server roles
• DNS
• Proxy
• Used for many of our deployment tests
22
!
How people build software!
Failover testing
• Multiple times per day:
• Setup the cluster in desired topology layout
• Inject failure (kill/block/reject)
• Wait, expect recovery
• Check topology:
• Expect new master, correct DNS changes,
replica capacity, …
• Restore old master from backup
• (an implicit backup/restore test)
• “success/failure” event
23
!
How people build software!
Failover in production
• We expect < 30s failover
• Intermediate master failover has low impact on
subset of users, depending on cluster/DC/server
• Master failover implies outage
• Planned master switchover takes a few seconds
24
!
How people build software!
A moment of reflection
25
How people build software!
What builds trust in failovers?
A testing environment?
26
!
How people build software!
Chaos testing in production
• First steps into regular testing
• Manual
• Supported by our peers
• Learning, understanding impact
27
!
How people build software!
Tests that go wrong
• Many things can go wrong
• Corrupt replication
• Invalidated servers
• Unassigned DNS
• Cleanups
28
!
How people build software!
Schema migrations
29
How people build software!
Is your data correct?
The data you see is merely a ghost of your original data
30
!
How people build software!
gh-ost
• Young. 16 months old.
• In production at GitHub since born.
• Software
• Bugs
• Development
• Bugs
31
How people build software!
gh-ost testing
• gh-ost works perfectly well on our data
• Tested, re-tested, and tested again
• Full coverage of production tables
32
How people build software!
gh-ost testing servers
• Dedicated servers that run continuous tests
33
How people build software! 34
!
!
!
#
!
!
production replicas
testing replica
master
!
gh-ost testing replicas
!
!
!
#
!
!
production replicas
testing replica
master
!
How people build software!
gh-ost testing
• Trivial ENGINE=INNODB migration
• Stop replication
• Cut-over, cut-back
• Checksum both tables, compare
• Checksum failure: stop the world, alert
• Success/failure: event
• Drop ghost table
• Catch up
• Next table
35
How people build software!
gh-ost development cycle
• Work on branch

.deploy gh-ost/mybranch to prod/mysql_role=ghost_testing
• Let continuous tests run
• Depending on nature of change, observe hours/days/more.
• Merge
• Tests run regardless of deployed branch
36
How people build software!
Conclusion
• Backup & restore
• Failovers
• Schema migrations
37
How people build software!
Thank you!
Questions?
github.com/ikewalker
@iowalker
38
!

MySQL Infrastructure Testing Automation at GitHub

  • 1.
    How people buildsoftware! MySQL Infrastructure Testing Automation 
 @ GitHub IkeWalker GitHub Boston MySQL Meetup December 11, 2017 1 !
  • 2.
    How people buildsoftware! Agenda • About • MySQL @ GitHub • Automation • Backup/restores • Failovers • Schema migrations 2 !
  • 3.
    How people buildsoftware! About me • Database Architect • Working with MySQL since 2006 • Organizer of Boston MySQL Meetup github.com/ikewalker @iowalker 3 !
  • 4.
    How people buildsoftware! 4 • The world’s largest Octocat T-shirt and stickers store • And water bottles • And hoodies • We also do stuff related to things • Word is new swag is coming up GitHub
  • 5.
    How people buildsoftware! GitHub • 66M repositories • 24M developers • 117K businesses • More than a million teams • World’s largest open source hosting • Alexa top 100 • Critical path in build flows 5 !
  • 6.
    How people buildsoftware! MySQL at GitHub • GitHub stores repositories in git, and uses MySQL as the backend database for all related metadata: • Repository metadata, users, issues, pull requests, comments etc. • Website/API/Auth/more all use MySQL. • We run a few (growing number of) clusters, totaling over 100 MySQL servers. • The setup isn’t very large but very busy. 6 !
  • 7.
    How people buildsoftware! MySQL at GitHub • Our MySQL servers must be available, responsive and in good state • GitHub has 99.95% SLA • Availability issues must be handled quickly, as automatically as possible. 7 !
  • 8.
    How people buildsoftware! github/database-infrastructure • @ggunson, @jessbreckenridge, @jonahberquist, @shlomi-noach, @tomkrouper, @gtowey • Concerned with: • Data availability • Data integrity 8 !
  • 9.
    How people buildsoftware! Testing 9 !
  • 10.
    How people buildsoftware! Backups/restores that ^ 10
  • 11.
    How people buildsoftware! Your data It’s important 11 !
  • 12.
    How people buildsoftware! Restores • Dedicated restore servers. • One per cluster. • Continuously restores, catches up with replication, restores, catches up with replication, restores, … • Sending a “success” event at the end of each cycle. • We monitor for number of “success” events in past 24-ish hours, per cluster. 12 !
  • 13.
    How people buildsoftware! 13 ! ! ! ! ! production replicas auto-restore replica master ! auto-restore replicas """""" backup replica
  • 14.
    How people buildsoftware! Restores • New host provisioning uses same flow as restore. • A human may kick a restore/reclone manually. • Chatops: 
 .mysql backup-restore -H restore.this.host -r 14 !
  • 15.
    How people buildsoftware! Restore failure • A specific backup/restore may fail because computers. • No reason for panic. • Previous backup/restores proven to be working • At most we lose time • Two sequential failures, or failures across clusters are incidents to be investigated 15 !
  • 16.
    How people buildsoftware! Restore: delayed replica • One delayed replica per cluster • Lagging at 4 hours • Chatops: .mysql panic 16 !
  • 17.
    How people buildsoftware! Failovers ^ that, too 17
  • 18.
    How people buildsoftware! MySQL setup @ GitHub • Plain-old single writer master-replicas asynchronous replication. • Not yet semi-sync • Cross DC, multiple data centers • 5.7, RBR • Servers with special roles: production replica, backup, auto-restore, migration-test, analytics, … • 2-3 tiers of replication • Occasional cluster split (functional sharding) • Very dynamic, always changing 18 !
  • 19.
    How people buildsoftware! Points of failure • Master failure, sev1 • Intermediate masters failure 19 ! ! ! ! ! ! ! ! ! !
  • 20.
    How people buildsoftware! orchestrator • Topology discovery • Refactoring • Failovers for masters and intermediate masters • Open source, Apache 2 license • github.com/github/orchestrator 20 !
  • 21.
    How people buildsoftware! orchestrator failovers @ GitHub • Automated master & intermediate master failovers for all clusters. • On failover, runs GitHub-specific hooks • Grabbing VIP/DNS • Updating server role • Kicking services (e.g. pt-heartbeat) • Notifying chat • Running puppet 21 !
  • 22.
    How people buildsoftware! Testing cluster • Dedicated testing cluster in production • Does not take production traffic • “load-test” traffic • Resembles a production topology: • OS, MySQL Versions • Data centers • Server roles • DNS • Proxy • Used for many of our deployment tests 22 !
  • 23.
    How people buildsoftware! Failover testing • Multiple times per day: • Setup the cluster in desired topology layout • Inject failure (kill/block/reject) • Wait, expect recovery • Check topology: • Expect new master, correct DNS changes, replica capacity, … • Restore old master from backup • (an implicit backup/restore test) • “success/failure” event 23 !
  • 24.
    How people buildsoftware! Failover in production • We expect < 30s failover • Intermediate master failover has low impact on subset of users, depending on cluster/DC/server • Master failover implies outage • Planned master switchover takes a few seconds 24 !
  • 25.
    How people buildsoftware! A moment of reflection 25
  • 26.
    How people buildsoftware! What builds trust in failovers? A testing environment? 26 !
  • 27.
    How people buildsoftware! Chaos testing in production • First steps into regular testing • Manual • Supported by our peers • Learning, understanding impact 27 !
  • 28.
    How people buildsoftware! Tests that go wrong • Many things can go wrong • Corrupt replication • Invalidated servers • Unassigned DNS • Cleanups 28 !
  • 29.
    How people buildsoftware! Schema migrations 29
  • 30.
    How people buildsoftware! Is your data correct? The data you see is merely a ghost of your original data 30 !
  • 31.
    How people buildsoftware! gh-ost • Young. 16 months old. • In production at GitHub since born. • Software • Bugs • Development • Bugs 31
  • 32.
    How people buildsoftware! gh-ost testing • gh-ost works perfectly well on our data • Tested, re-tested, and tested again • Full coverage of production tables 32
  • 33.
    How people buildsoftware! gh-ost testing servers • Dedicated servers that run continuous tests 33
  • 34.
    How people buildsoftware! 34 ! ! ! # ! ! production replicas testing replica master ! gh-ost testing replicas ! ! ! # ! ! production replicas testing replica master !
  • 35.
    How people buildsoftware! gh-ost testing • Trivial ENGINE=INNODB migration • Stop replication • Cut-over, cut-back • Checksum both tables, compare • Checksum failure: stop the world, alert • Success/failure: event • Drop ghost table • Catch up • Next table 35
  • 36.
    How people buildsoftware! gh-ost development cycle • Work on branch
 .deploy gh-ost/mybranch to prod/mysql_role=ghost_testing • Let continuous tests run • Depending on nature of change, observe hours/days/more. • Merge • Tests run regardless of deployed branch 36
  • 37.
    How people buildsoftware! Conclusion • Backup & restore • Failovers • Schema migrations 37
  • 38.
    How people buildsoftware! Thank you! Questions? github.com/ikewalker @iowalker 38 !