Role: SRE Engineer – Observability
Location: San Jose, CA
Duration: 6+ months (possible extension)
Pay Rate: $70 to $80

Key Responsibilities:

Optimize, automate, and maintain observability platforms built using the Kubernetes + Prometheus Stack (Prometheus, Loki, Grafana, Alert Manager).
Enhance and streamline monitoring and alerting solutions for large-scale distributed systems hosted on AWS.
Automate observability workflows and tooling using Python (preferred) or Go (e.g., custom exporters, dashboard generation, alert configurations).
Integrate with PagerDuty for incident management, alert routing, and on-call support processes.
Collaborate with DevOps teams to instrument CI/CD pipelines for better visibility into deployment performance and system health.
Implement Infrastructure-as-Code (IaC) for observability components using Terraform.
Troubleshoot monitoring gaps, reduce alert noise, and resolve performance or reliability issues across the stack.
Share engineering insights, challenges solved, and past implementation experiences through GitHub or other documented work.

Mandatory Skills:

Experience: 5+ years in observability, monitoring, or site reliability engineering.
Observability Tools: Hands-on experience with Prometheus, Loki, Grafana, and Alert Manager.
Cloud & Kubernetes Experience: with AWS (EKS/EC2) and Kubernetes monitoring (e.g., kube-state-metrics, cAdvisor).
Programming: Strong proficiency in Python (preferred) or Go.
Incident Management: Experience integrating and managing on-call workflows with PagerDuty.
IaC: Proficient with Terraform or equivalent tools for observability infrastructure provisioning.

Nice-to-Have:

- Experience with the Zoom Developer Platform for collaboration tool integrations.
- Certifications such as AWS Certified DevOps Engineer or Grafana Certified Associate.

Portfolio of past work (GitHub or equivalent) demonstrating observability automation, optimization, or incident troubleshooting.

Apply for this position