Staff Engineer - Application SRE
The Role:
We are looking for an experienced Staff Software Engineer - Application SRE to join our Site Reliability Engineering (SRE) team. In this role, you will be responsible for ensuring the reliability, availability, and performance of our mission-critical applications. You will leverage your software engineering skills to build tools, automate processes, and collaborate with cross-functional teams to deliver high-performance, scalable systems. As a senior member of the team, you will take ownership of application reliability and guide other engineers in following best practices for maintaining high-quality services. In this senior capacity you will also be leading and mentoring teams directly and in-directly.
Responsibilities:
- Understanding and documenting the performance and scalability non-functional requirements, including SLI/SLOs. Validating requirements with business stakeholders.
- Manage SLI/SLOs of customer-facing interfaces as well as backend services and provide improvement plans for non-compliance.
- Develop custom dashboards in observability platforms (New Relic/Dynatrace/Grafana etc.) to represent a holistic view of system operational health
- Improve reliability, quality, and time-to-market of our suite of software solutions
- Support release engineering by providing automation support as well as pushing changes to production when manual intervention needed
- Measure and optimize system performance, with an eye toward pushing our capabilities forward, getting ahead of customer needs, and innovating to continually improve
- Provide primary operational support and engineering for multiple large distributed software applications
- Take ownership and deliver reliability initiatives end to end leading teams directly and indirectly.
Key Responsibilities:
- Application Reliability Engineering:
- Lead efforts to design and implement systems that ensure the high availability, scalability, and reliability of critical applications.
- Incident Management:
- Drive incident response, root cause analysis, and remediation for application-related issues, ensuring rapid resolution and preventing recurrence.
- Problem Management:
- Conduct 5-why analysis on issues related to application design, code, and configuration to arrive at the best possible cause and solution for arresting them.
- Automation and Tooling:
- Develop and maintain automation tools to improve application deployment, monitoring, and scaling, minimizing manual work and reducing time-to-recovery during incidents.
- Performance Tuning:
- Analyze and resolve application performance bottlenecks, collaborating with developers to optimize code and infrastructure to improve response times and throughput.
- Monitoring and Observability:
- Architect and implement robust monitoring, logging, and alerting systems to gain deep visibility into application performance and health. Use tools such as Prometheus, Grafana, Datadog, or New Relic.
- Service-Level Objectives (SLOs):
- Establish and maintain service-level objectives (SLOs) and indicators (SLIs) that ensure operational excellence, working with stakeholders to balance reliability and innovation.
- Collaboration with Engineering Teams:
- Work closely with software development teams to embed SRE best practices into the application lifecycle, ensuring reliability is built into all stages of development.
- Capacity Planning and Scalability:
- Monitor application traffic and infrastructure capacity, proactively scaling systems to handle growth and ensure smooth application operation during peak loads.
- Mentorship and Leadership:
- Mentor junior SREs and software engineers on best practices for reliability engineering and foster a culture of continuous improvement.
- Continuous Improvement:
- Lead post-incident reviews and retrospectives, driving improvements to system architecture, operational practices, and incident response processes.
Required Skills and Experience:
- 10+ years of experience in software engineering, site reliability engineering, or a related role.
- Proficiency in at least one programming language (e.g., Java, Go, Node) and strong scripting skills (e.g., Bash, Python).
- Hands-on experience frameworks such as Spring boot, React
- Hands-on experience with monitoring, observability, and logging tools (e.g., Prometheus, Grafana, Datadog, New Relic) to track system performance and health.
- Strong experience with cloud platforms (AWS, Google Cloud, Azure) and cloud-native architectures, including containers and orchestration tools (e.g., Kubernetes, Docker).
- Expertise in building and managing CI/CD pipelines and infrastructure as code (IaC) tools such as Terraform, Ansible, or CloudFormation.
- Troubleshooting experience, including troubleshooting complex distributed systems in production, performing root cause analysis, and developing remediation strategies.
- Familiarity with microservice architectures and distributed systems at scale.
- Strong understanding of containers, networking, databases, and performance optimization techniques.
- Excellent communication and collaboration skills, with the ability to work effectively across teams and mentor others.
- Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field, or equivalent experience.
Preferred Qualifications:
- Experience with multi-cloud or hybrid-cloud environments.
- Knowledge of security best practices for applications and cloud infrastructure.
- Experience leading blameless postmortems and implementing long-term fixes.
- Familiarity with database management systems (SQL, NoSQL) and caching technologies.
- Category
- Technology
- Locations
- Mexico City
- Remote status
- Hybrid
- Employment type
- Full-time
Staff Engineer - Application SRE
Loading application form