Role: SRE Engineer – Observability
Location: San Jose, CA
Duration: 6+ months (possible extension)
Pay Rate: $70 to $80
Key Responsibilities:
- Optimize, automate, and maintain observability platforms built using the Kubernetes + Prometheus Stack (Prometheus, Loki, Grafana, Alert Manager).
- Enhance and streamline monitoring and alerting solutions for large-scale distributed systems hosted on AWS.
- Automate observability workflows and tooling using Python (preferred) or Go (e.g., custom exporters, dashboard generation, alert configurations).
- Integrate with PagerDuty for incident management, alert routing, and on-call support processes.
- Collaborate with DevOps teams to instrument CI/CD pipelines for better visibility into deployment performance and system health.
- Implement Infrastructure-as-Code (IaC) for observability components using Terraform.
- Troubleshoot monitoring gaps, reduce alert noise, and resolve performance or reliability issues across the stack.
- Share engineering insights, challenges solved, and past implementation experiences through GitHub or other documented work.
Mandatory Skills:
- Experience: 5+ years in observability, monitoring, or site reliability engineering.
- Observability Tools: Hands-on experience with Prometheus, Loki, Grafana, and Alert Manager.
- Cloud & Kubernetes Experience: with AWS (EKS/EC2) and Kubernetes monitoring (e.g., kube-state-metrics, cAdvisor).
- Programming: Strong proficiency in Python (preferred) or Go.
- Incident Management: Experience integrating and managing on-call workflows with PagerDuty.
- IaC: Proficient with Terraform or equivalent tools for observability infrastructure provisioning.
Nice-to-Have:
-
- Experience with the Zoom Developer Platform for collaboration tool integrations.
- Certifications such as AWS Certified DevOps Engineer or Grafana Certified Associate.
- Portfolio of past work (GitHub or equivalent) demonstrating observability automation, optimization, or incident troubleshooting.