Role: Senior Site Reliability Manager
Full-Time – Hybrid
Local to San Jose, CA
The Client is a simple and scalable cloud-based IoT edge orchestration solution that delivers visibility, control, and security for the distributed edge. Their platform allows customers to seamlessly manage and deploy any compute node, unlocking the value of IoT data, enabling real-time decisions, maximizing operational efficiency, and driving new business outcomes.
Job Summary: They are looking for an experienced Senior Site Reliability Engineer (SRE) to join our team and contribute to the design and upkeep of our exciting start-up. Reporting to the VP of Engineering, the Sr. Manager of SRE Operations will be responsible for ensuring the availability of our SaaS platform and meeting the uptime and performance requirements of our Fortune 500 customers.
Must Have:
• Cloud-Based Architecture Knowledge: Deep understanding of cloud-based architectures and distributed systems, essential for managing and optimizing the cloud-based IoT edge orchestration solution.
• SRE and Operations Experience: Proven experience in a Site Reliability Engineer role or a similar operations role, with a minimum of 5 years in the field.
• Disaster Recovery and Performance Monitoring: Strong skills in implementing and managing disaster recovery processes, performance monitoring, alerting systems, and reporting.
• Compliance and Standards: Familiarity with ISO27001 and SOC2 standards, ensuring that incidents are handled according to these guidelines.
• Team Leadership and Management: Experience in managing and leading SRE or operations teams, including on-call strategy and incident management.
• Problem-Solving Skills: Excellent problem-solving abilities, capable of handling high-pressure situations effectively.
• Communication and Interpersonal Skills: Strong communication skills to collaborate with various teams, report to upper management, and interface with customers.
• Customer-Centric Mindset: A strong focus on customer success, ensuring the platform meets and exceeds customer expectations.
Key Responsibilities:
• Lead the SRE Operations team to implement processes and procedures that ensure quality and predictability of disaster recovery, performance monitoring, alerting, and reporting.
• Ensure compliance with ISO27001 and SOC2 standards in incident handling.
• Play a key role in team performance, growth, and on-call strategy for 24x7x365 availability.
• Serve as the initial escalation point for incidents, ensuring timely resolution by involving other teams as needed.
• Collaborate with the SRE Technical Lead and other engineering groups to suggest and implement platform improvements.
• Regularly report on platform performance to upper management.
• Interface with the Customer Experience Organization and meet with customers as required.
• Perform hands-on duties as part of the SRE Operations Team.
Qualifications:
• Bachelor’s degree in Computer Science, Engineering, or related field.
• Minimum of 5 years of experience in a Site Reliability Engineer role or similar.
• Proven experience in managing and leading SRE or operations teams.
• Strong understanding of cloud-based architectures and distributed systems.
• Experience with disaster recovery, performance monitoring, and alerting systems.
• Familiarity with ISO27001 and SOC2 standards.
• Excellent problem-solving skills and ability to handle high-pressure situations.
• Strong communication and interpersonal skills.
• Energetic, self-starter with a customer-centric mindset.
Key Attributes:
• Customer-Centric: Always strive to create ecstatic customers by understanding and addressing their needs.
• Deployment Excellence: Ensure frictionless deployments and smooth operation of the platform.
• Escalation Management: Efficiently manage escalations and on-call duties.
• Leadership: Radiate energy and enthusiasm, serving as a technical leader to the team.
• Commitment: Fully committed to customer success and operational excellence.