How to write Bad Software

{

Write

Social networks
Search engines
E-Commerce
Productivity tools
Cloud services
Games
Intelligence
Spaceships
Pacemakers
Software

Bad

lose information
lose trust
lose money
lose sleep
lose life

What makes software bad?

solution

Make code, no bugs!

solution

test
unit test
integration test
formal validation

however

knock knock race condition! who's there?

OS update

SSL expiration

Environment mismatch

Really bad luck

Hush

function auth(pwd) {
	if (u.pwd == pwd) {
		return "Collect 200!"
	}

	return "Go to jail"
}

function auth(pwd) {
	if (u.hpwd == hash(pwd)) {
		return "Collect 200!"
	}

	return "Go to jail"
}

function newUser(...) {
	...
	u.salt = generateVeryRandomAndSecureSalt()
}

function auth(pwd) {
	if (u.hpwd == hash(pwd + u.salt)) {
		return "Collect 200!"
	}

	return "Go to jail"
}

are you an existing user?

if (u.hpwd == hash(pwd + u.salt)) {
	return "Collect 200!"
}

if (u.salt == new byte[16] && u.hpwd == hash(pwd)) {
	u.salt = generateVeryRandomAndSecureSalt(16)
	u.hpwd = hash(pwd + u.salt)
	u.save()
	return "Collect 200!"
}

return "Go to jail"

if (u.hpwd == hash(pwd + u.salt)) {
	return "Collect 200!"
}

if (u.salt == new byte[16] && u.hpwd == hash(pwd)) {
	u.salt = generateVeryRandomAndSecureSalt(16)
	u.hpwd = hash(pwd + u.salt)
	u.save()
	return "Collect 200!"
}

return "Go to jail"

UPDATE TABLE users
ADD COLUMN salt 
NOT NULL 
DEFAULT '\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\'

and in QA and Staging!

are you an existing user?

what now?

Timeline

10:12 Migration script ran on prod DB
10:20 code deployed to prod
16:05 First impacted customer
18:05 First impacted customer sends email
18:10 Problem escalated to engineering team
18:15 Decision made not to roll back
19:20 Engineer repro on prod, problem identified
19:30 Fixed migration script is code-reviewed
19:35 Fixed migration script ran on prod DB

how are we doing on time?

10:20 code deployed to prod
16:05 First impacted customer
18:05 First impacted customer sends email
19:30 Fixed migration script is code-reviews
19:35 Fixed migration script ran on prod DB

Time to Failure: 5h45m
Time to Detection: 2h
Time to Mitigation: 1h25m
Time to Recovery: 5m

root cause

prod default salts are byte[15] instead of byte[16]

DEFAULT '\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\'

15 '\0' and 1 '\'

MySQL decide not to fail, and guessed differently on Windows and Linux

learnings from NullBug

Time to detection

always allow rollback

prod is always different

What did you say?

Global database, years in production

10s of thousands nodes

10s millions queries per second

thou shalt gossip

host01.mycompany.com 10
host06.mycompany.com 90
host07.mycompany.com 30
host10.mycompany.com 31
host02.mycompany.com 35

?ost01.mycompany.com 10
host06.mycompany.com 90
host07.mycompany.com 30
host10.mycompany.com 31
host02.mycompany.com 35

No worries, bad entries are deleted

Unless the first entry is bad 🐛🐛🐛

many years, many nodes

show me the check(sum)

checksum=0123456789abcdef

checksum=0123456789abcdef

// backward compatibility?

checksum=0123456789abcdef

// backward compatibility FTW
if !headers.containsKey("checksum") {
	return "OK"
}

checksum=0123456789abcdef

host01.mycompany.com 10
host06.mycompany.com 90
host07.mycompany.com 30
host10.mycompany.com 31
host02.mycompany.com 35

checksum=0123456789abcdef

host01.mycompany.com 10
?ost06.mycompany.com 90
host07.mycompany.com 30
host10.mycompany.com 31
host02.mycompany.com 35

checksum=0123456789abcdef

?ost01.mycompany.com 10
host06.mycompany.com 90
host07.mycompany.com 30
host10.mycompany.com 31
host02.mycompany.com 35

check?um=0123456789abcdef

host01.mycompany.com 10
host06.mycompany.com 90
host07.mycompany.com 30
host10.mycompany.com 31
host02.mycompany.com 35

Backward compatibility FTW?

check?um=0123456789abcdef

?ost01.mycompany.com 10
host06.mycompany.com 90
host07.mycompany.com 30
host10.mycompany.com 31
host02.mycompany.com 35

10s of thousands nodes

Why?

Why did error rate go up?
Why did hosts restart?
Why was memory released twice?
Why was there a broken hostname?
Why wasn't the broken message rejected?
Why did the server accept a missing checksum?
Why wasn't the impact limited to a single host?

impact

Single NIC failure on a single host - caused worldwide downtime to multiple subsystems

how are we doing on time?

Time to Failure: many years
Time to Detection: 1m
Time to Mitigation: 1h
Time to Recovery: 4h

learnings from BitBug

Time to detection

Never tested scenarios

Technical debt

Odds

now what?

metrics

Figure out the most important business metrics

Figure out thresholds

Alarm on the thresholds

Watch trends, DoD, WoW, YoY, and refine

Blast radius reduction

Regionalize (subsystems, locations, markets)

Deployment Windows - identify low-traffic o'clock

Canary deployments - deploy to one (or few) instances, compare metrics for a while

Controlled rollout - deploy to X% at the time
easier during low-traffic windows

Auto-rollback on alarms

replay

Functional - comparing results of Beta vs Prod

Non functional - comparing key metrics (latencies, CPU utilization, GC profile etc) of Beta vs Prod

Notomatic

Change management process - a "code" reviewed document

Pre-mortem

I play a little game. I assume the worst
Baelish, P.

Clear rollback criteria

Pre-reviewed scripts

Any monkey can execute or take over

off the beaten track

We don't have many tests. I'm not advocating that you shouldn't put in tests. [The reason we can get away with this] is that we have a great community. So instead of having automated test we have automated human tests.

But I'm not stack overflow ...

Synthetic transactions - emulate users, products and interactions. Define success and failure metrics

Let's play

In the early 2000s, Amazon created GameDay, a program designed to increase resilience by purposely injecting major failures into critical systems semi-regularly to discover flaws and subtle dependencies. Basically, a GameDay exercise tests a company's systems, software, and people in the course of preparing for a response to a disastrous event.

DiRT was developed to find vulnerabilities in critical systems and business processes by intentionally causing failures in them, and to fix them before such failures happen in an uncontrolled manner. DiRT tests both Google's technical robustness, by breaking live systems, and our operational resilience by explicitly preventing critical personnel, area experts, and leaders from participating.

Analysis

Post mortem

Emphasis on learning, not on finger-pointing

Five+ whys?

"how would you cut impact in half?"

"how would you cut detection in half?"

best learnings - read Post Mortems

@kenegozi
http://kenegozi.com
http://kenegozi.com/talks