SRE Engineer – Observability

Job Type: Contract
Work Flexibility: On-site
Location: San Jose CA
Required Skills: Alert Manager Grafana IaC Kubernetes Loki Prometheus Python & Go

Role: SRE Engineer – Observability
Location: San Jose, CA
Duration: 6+ months (possible extension)
Pay Rate: $70 to $80

Key Responsibilities:

  • Optimize, automate, and maintain observability platforms built using the Kubernetes + Prometheus Stack (Prometheus, Loki, Grafana, Alert Manager).
  • Enhance and streamline monitoring and alerting solutions for large-scale distributed systems hosted on AWS.
  • Automate observability workflows and tooling using Python (preferred) or Go (e.g., custom exporters, dashboard generation, alert configurations).
  • Integrate with PagerDuty for incident management, alert routing, and on-call support processes.
  • Collaborate with DevOps teams to instrument CI/CD pipelines for better visibility into deployment performance and system health.
  • Implement Infrastructure-as-Code (IaC) for observability components using Terraform.
  • Troubleshoot monitoring gaps, reduce alert noise, and resolve performance or reliability issues across the stack.
  • Share engineering insights, challenges solved, and past implementation experiences through GitHub or other documented work.

Mandatory Skills:

  • Experience: 5+ years in observability, monitoring, or site reliability engineering.
  • Observability Tools: Hands-on experience with Prometheus, Loki, Grafana, and Alert Manager.
  • Cloud & Kubernetes Experience: with AWS (EKS/EC2) and Kubernetes monitoring (e.g., kube-state-metrics, cAdvisor).
  • Programming: Strong proficiency in Python (preferred) or Go.
  • Incident Management: Experience integrating and managing on-call workflows with PagerDuty.
  • IaC: Proficient with Terraform or equivalent tools for observability infrastructure provisioning.

 

Nice-to-Have:

    • Experience with the Zoom Developer Platform for collaboration tool integrations.
    • Certifications such as AWS Certified DevOps Engineer or Grafana Certified Associate.
  • Portfolio of past work (GitHub or equivalent) demonstrating observability automation, optimization, or incident troubleshooting.

Apply for this position

Allowed Type(s): .pdf, .doc, .docx