Make code, no bugs!
Make code, no bugs!
knock knock race condition! who's there?
OS update
SSL expiration
Environment mismatch
Really bad luck
function auth(pwd) { if (u.pwd == pwd) { return "Collect 200!" } return "Go to jail" }
function auth(pwd) { if (u.hpwd == hash(pwd)) { return "Collect 200!" } return "Go to jail" }
function newUser(...) { ... u.salt = generateVeryRandomAndSecureSalt() } function auth(pwd) { if (u.hpwd == hash(pwd + u.salt)) { return "Collect 200!" } return "Go to jail" }
are you an existing user?
are you an existing user?
if (u.hpwd == hash(pwd + u.salt)) {
return "Collect 200!"
}
if (u.salt == new byte[16] && u.hpwd == hash(pwd)) {
u.salt = generateVeryRandomAndSecureSalt(16)
u.hpwd = hash(pwd + u.salt)
u.save()
return "Collect 200!"
}
return "Go to jail"
if (u.hpwd == hash(pwd + u.salt)) {
return "Collect 200!"
}
if (u.salt == new byte[16] && u.hpwd == hash(pwd)) {
u.salt = generateVeryRandomAndSecureSalt(16)
u.hpwd = hash(pwd + u.salt)
u.save()
return "Collect 200!"
}
return "Go to jail"
UPDATE TABLE users ADD COLUMN salt NOT NULL DEFAULT '\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\'
and in QA and Staging!
are you an existing user?
are you an existing user?
Timeline
10:12 Migration script ran on prod DB
10:20 code deployed to prod
16:05 First impacted customer
18:05 First impacted customer sends email
18:10 Problem escalated to engineering team
18:15 Decision made not to roll back
19:20 Engineer repro on prod, problem identified
19:30 Fixed migration script is code-reviewed
19:35 Fixed migration script ran on prod DB
10:20 code deployed to prod
16:05 First impacted customer
18:05 First impacted customer sends email
19:30 Fixed migration script is code-reviews
19:35 Fixed migration script ran on prod DB
Time to Failure: 5h45m
Time to Detection: 2h
Time to Mitigation: 1h25m
Time to Recovery: 5m
prod default salts are byte[15]
instead of byte[16]
DEFAULT '\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\'
15 '\0' and 1 '\'
MySQL decide not to fail, and guessed differently on Windows and Linux
Time to detection
always allow rollback
prod is always different
Global database, years in production
10s of thousands nodes
10s millions queries per second
host01.mycompany.com 10 host06.mycompany.com 90 host07.mycompany.com 30 host10.mycompany.com 31 host02.mycompany.com 35
?ost01.mycompany.com 10
host06.mycompany.com 90
host07.mycompany.com 30
host10.mycompany.com 31
host02.mycompany.com 35
No worries, bad entries are deleted
Unless the first entry is bad 🐛🐛🐛
many years, many nodes
checksum=0123456789abcdef
checksum=0123456789abcdef // backward compatibility?
checksum=0123456789abcdef // backward compatibility FTW if !headers.containsKey("checksum") { return "OK" }
checksum=0123456789abcdef host01.mycompany.com 10 host06.mycompany.com 90 host07.mycompany.com 30 host10.mycompany.com 31 host02.mycompany.com 35
checksum=0123456789abcdef
host01.mycompany.com 10
?ost06.mycompany.com 90
host07.mycompany.com 30
host10.mycompany.com 31
host02.mycompany.com 35
checksum=0123456789abcdef
?ost01.mycompany.com 10
host06.mycompany.com 90
host07.mycompany.com 30
host10.mycompany.com 31
host02.mycompany.com 35
check?um=0123456789abcdef
host01.mycompany.com 10
host06.mycompany.com 90
host07.mycompany.com 30
host10.mycompany.com 31
host02.mycompany.com 35
Backward compatibility FTW?
check?um=0123456789abcdef ?ost01.mycompany.com 10 host06.mycompany.com 90 host07.mycompany.com 30 host10.mycompany.com 31 host02.mycompany.com 35
10s of thousands nodes
Single NIC failure on a single host - caused worldwide downtime to multiple subsystems
Time to Failure: many years
Time to Detection: 1m
Time to Mitigation: 1h
Time to Recovery: 4h
Time to detection
Never tested scenarios
Technical debt
Odds
Figure out the most important business metrics
Figure out thresholds
Alarm on the thresholds
Watch trends, DoD, WoW, YoY, and refine
Regionalize (subsystems, locations, markets)
Deployment Windows - identify low-traffic o'clock
Canary deployments - deploy to one (or few) instances, compare metrics for a while
Controlled rollout - deploy to X% at the time
easier during low-traffic windows
Auto-rollback on alarms
Functional - comparing results of Beta vs Prod
Non functional - comparing key metrics (latencies, CPU utilization, GC profile etc) of Beta vs Prod
Change management process - a "code" reviewed document
Pre-mortem
I play a little game. I assume the worst
Baelish, P.
Clear rollback criteria
Pre-reviewed scripts
Any monkey can execute or take over
We don't have many tests. I'm not advocating that you shouldn't put
in tests. [The reason we can get away with this] is that we have a great community. So instead
of having automated test we have automated human tests.
But I'm not stack overflow ...
Synthetic transactions - emulate users, products and interactions. Define success and failure metrics
In the early 2000s, Amazon created GameDay, a program designed to increase resilience by
purposely injecting major failures into critical systems semi-regularly to discover flaws and
subtle dependencies. Basically, a GameDay exercise tests a company's
systems
,
software
, and people
in the course of preparing for a response to a
disastrous event.
DiRT was developed to find vulnerabilities in critical systems and business processes by intentionally causing failures in them, and to fix them before such failures happen in an uncontrolled manner. DiRT tests both Google's technical robustness, by breakinglive systems
, and our operational resilience by explicitly preventing criticalpersonnel
, area experts, and leaders from participating.
Post mortem
Emphasis on learning, not on finger-pointing
Five+ whys?
"how would you cut impact in half?"
"how would you cut detection in half?"
best learnings - read Post Mortems
@kenegozi
http://kenegozi.com
http://kenegozi.com/talks