Fail... The Right Way


Node.js in Production


http://ssw2014.formidablelabs.com

 @ryan_roemer | formidablelabs.com

Tip - space bar advances slides

Welcome to Production

Production can be a rough place for your Node.js apps. Things can go very wrong out in the wild.

Formidable Labs

Formidable Labs

3:00 AM

Our Focus

Whether on PAAS, IAAS, or bare metal.

  • Design for Failure: Keep your Node.js apps up
  • Avoidance: Get yourself out of the failover business
  • Isolate: One failure at a time
  • Analyze: Debug and diagnose problems quickly

1. Design for Failure

Fail and recover at multiple levels.

Let's look at failure from a system perspective.

Single Node.js Worker

Never ignore errors.

Have a strong bias for killing the worker.

  • Handle: uncaughtException, Domains
  • Listen: foo.on("error")

Multiple Node.js Workers

Use cluster or recluster to multiplex CPUs and isolate errors.

  • Workers: die early on errors
  • Master: monitor and kill workers

Multiple Node.js Workers

/*global process:false */
var recluster = require("recluster");
var cluster = recluster("./server.js");
cluster.run();

// Hot reload: kill -s SIGUSR2 CLUSTER_PID
process.on("SIGUSR2", function() {
  console.log("Got SIGUSR2, reloading cluster...");
  cluster.reload();
});

Server

  • Use monit or alternatives
  • Restart the Node.js master

Service

  • Load-balancers
  • Heartbeat / ping monitors
  • Availability zones, etc.

Make it Hot

Everything up to this point should have hot failover.

Datacenter

Hot failover across datacenters?

  • Typically very costly
  • But, the real deal if you're serious

Disaster Recovery

  • "Business Continuity"
  • Don't let a technological problem end your business
  • Have a worst case, "lose some data" recovery plan

2. Avoid Failures

Get out of the business of failover when you don't have to do it yourself.

Resources to Not Support

Don't rely on system / service resources you don't need to.

  • Disk: NAS, disks, SSDs.
  • Datastores: DB, cloud services.
  • ... Load Balancers, DNS, etc.

How To Avoid

  • Use SAAS wherever possible! (DB, LBs, storage).
  • Or PAAS for some Node.js apps.
  • Design Stateless, fungible servers (no disk risks).

3. Isolate Failures

Isolate failures you can't avoid.

Resources to Support

Look to resources you must depend on:

  • CPU/Load: Run out of this and it's over.
  • HTTP: Each different host you hit.
  • Datastores: Connections? Different Hosts?
  • ... also, memory, I/O, etc. and combinations thereof

Some Anecdotes

Node.js apps can be bad neighbors.

  • DB (auto-suggest) vs. HTTP (vendor translations)
  • DB (CRUD app) vs. CPU/Load (co-located PHP app)
  • Read vs. Write DB operations.

How To Isolate

  • Create "micro-services" that stand on their own.
  • Monitor for cross-pressure and respond. (Next section!)

4. Analyze Everything

Data drives problem discovery and action.

Log, Monitor, Mine

Scout
PagerDuty
Pingdom
Loggly
AWS Elastic MapReduce / Hadoop

Decisions, Goals

Things to look for in Node.js apps...

Identify

  • Resource pressure: CPU, I/O, memory, network
  • Performance: Throughput, latency
  • Errors/Bugs: Quantitative, qualitative

Decide

  • Scale up, scale down?
  • Separate services?

Recap

  • Design for failure
  • Avoid
  • Isolate
  • Analyze