Technology
·
Remote - LatAm, Remote - Mexico
·
Fully Remote
Site Reliability Engineer (SRE)
The Role:
The Site Reliability Engineer (SRE) will be responsible for ensuring the availability, reliability, scalability, and performance of complex application platforms. operating in a multi-cloud (AWS/GCP) environment. This role combines software engineering, systems engineering, and operational excellence to create highly reliable and automated systems that meet the demanding requirements of financial services.
Responsibilities:
- Design, implement, and maintain highly available, scalable, and secure platforms for complex financial applications.
- Develop and manage automated solutions for infrastructure provisioning, configuration management, and operational tasks using Infrastructure as Code (IaC).
- Own and enhance monitoring, alerting, and telemetry systems to ensure real-time visibility into application and infrastructure performance.
- Improve incident detection and response, conduct root cause analysis, and contribute to postmortem documentation and remediation.
- Define and enforce SLOs, SLIs, and SLAs in collaboration with application teams to maintain service reliability and business alignment.
- Participate in a 24x7 on-call rotation, driving incident resolution and reducing Mean Time to Recovery (MTTR).
- Collaborate with developers and platform engineers to build self-healing systems, automated failovers, and resilient architecture.
- Lead efforts to identify and eliminate toil by automating recurring manual tasks and improving system design.
- Contribute to capacity planning, performance tuning, and optimization efforts to ensure platforms are production-grade.
- Support and champion DevSecOps and FinOps best practices, balancing speed, security, and cost efficiency.
Requirements:
- 5+ years of experience in SRE, DevOps, or systems engineering roles supporting high-volume, mission-critical applications.
- Strong expertise in cloud platforms (AWS and/or GCP), especially services such as EC2, EKS/GKE, IAM, CloudWatch, Stackdriver, etc.
- Deep proficiency in Linux systems (RHEL/CentOS/Debian) and container orchestration using Kubernetes, Docker, EKS/GKE.
- Proven experience with Infrastructure as Code tools: Terraform, Ansible, Chef, or equivalent.
- Strong coding/scripting skills in Python, Bash, or Golang, and a mindset for automation.
- Experience with CI/CD pipelines and tools such as Jenkins, Harness, Bitbucket, Git.
- Hands-on experience with observability stacks: Prometheus, ELK, CloudWatch, StackDriver, Grafana.
- Familiarity with Agile and ITSM processes (incident/change/problem/configuration management), preferably using Jira/JSM.
- Excellent problem-solving skills and the ability to thrive under pressure in a high-availability environment.
Preferred:
- AWS/GCP Certifications or relevant Site Reliability/Cloud Solution Architect certifications.
- Familiarity with service mesh, canary deployments, blue/green rollouts, and modern release engineering practices.
- Experience supporting multi-region/multi-cloud environments.
- Exposure to security tooling and processes for highly regulated environments.
Soft Skills:
- Strong written and verbal communication skills for effective collaboration with global engineering and operations teams.
- High ownership mindset with the ability to work independently and proactively improve system reliability.
- Organized, analytical, and driven by continuous improvement.
- Category
- Technology
- Locations
- Remote - LatAm, Remote - Mexico
- Remote status
- Fully Remote
- Employment type
- Full-time