Principal Site Reiliability Engineer

A bit about us:

Our client based in San Francisco is scaling rapidly and is focused on changing the way people buy online. They focus on direct-to-consumer companies all the way to the most popular names in retail. Some of the brands we work with are Sonos, Neiman Marcus, Home Depot, and over 700 other brands globally.

We are hiring at the Senior, Principal, and Staff levels.

Why join us?

Do you want to change the way people experience their purchases? We do too...

Very Competitive Base Salary!
Unlimited PTO and Sick Time Policy!
Commuter Benefits Up to $300 a Month!
Fully Covered Medical/Dental/Vision Plans that Require $0 from Employees!

This position can be remote from one of the following states: California, Colorado, Florida, Georgia, Illinois, Massachusetts, Missouri, North Carolina, New Hampshire, New Jersey, New York, Oregon, Tennessee, Texas, Utah, or Washington -- AND anywhere in Canada!!!

Job Details

A software engineer with strong knowledge of distributed data systems and infrastructure and wants to automate everything in service of:
Reducing engineering friction between creating an experiment and delivering an observable, reliable & secure experiment at scale (business value) in production

Someone with production-related technologies and production support experience:
AWS/GCP,
Linux, Docker, Jenkins, Kubernetes
Prometheus, ELK, Grafana, (Cassandra, Yugabyte, Redis, MongoDB, etc.)
(Kafka, Pulsar, Elasticsearch, etc.)
service-oriented architecture

They like to automate what normally is done by hand.
They have experience running efficient, high-availability, large-scale systems.

Someone who has experience interacting with development teams. Not just production environments, staging environments, deployments, CI/CD process.

An SRE is:
Responsible for availability, efficiency, latency, performance, capacity planning, monitoring, emergency response
Responsible for the incident management process

They have experience helping the business establish Metrics, Service Level Objectives (SLO) (e.g., availability targets, reliability target)

An SRE:
Supports development teams launch products
Makes sure the products don’t blow up
Measures and enforces SLO agreements across development teams
Optimizes and measures
Automates fixes vs requiring a human to fix it. (e.g., automated responses when some error takes place so as to give time for a human to investigate what is going on, assess, and takes appropriate action)
Leads capacity planning/provisioning strategy
Guides development teams to a more efficient operation

An SRE:
Has a mix of software engineering and systems engineering experience. Network engineering and systems administration could be a good mix too if they have software engineering experience/education.

Is someone with the ability to work with development teams to flag what a bad project or system is because it will be difficult for the DevOps team to administer.

Monitoring
Creates alerting systems
What needs to be taken care of immediately (P0 emergency)
Ticket system for non-P0 alerts – Not – logging, but rather alerts what needs attention with their priority level
Is responsible for Capacity Planning – Demand Forecasting (e.g., how much spare capacity do we need during peak demand time and how do you measure and benchmark it)