Site Reliability Engineer (SRE)

The Role:

The Site Reliability Engineer (SRE) will be responsible for ensuring the availability, reliability, scalability, and performance of complex application platforms. operating in a multi-cloud (AWS/GCP) environment. This role combines software engineering, systems engineering, and operational excellence to create highly reliable and automated systems that meet the demanding requirements of financial services.

Responsibilities:

Design, implement, and maintain highly available, scalable, and secure platforms for complex financial applications.
Develop and manage automated solutions for infrastructure provisioning, configuration management, and operational tasks using Infrastructure as Code (IaC).
Own and enhance monitoring, alerting, and telemetry systems to ensure real-time visibility into application and infrastructure performance.
Improve incident detection and response, conduct root cause analysis, and contribute to postmortem documentation and remediation.
Define and enforce SLOs, SLIs, and SLAs in collaboration with application teams to maintain service reliability and business alignment.
Participate in a 24x7 on-call rotation, driving incident resolution and reducing Mean Time to Recovery (MTTR).
Collaborate with developers and platform engineers to build self-healing systems, automated failovers, and resilient architecture.
Lead efforts to identify and eliminate toil by automating recurring manual tasks and improving system design.
Contribute to capacity planning, performance tuning, and optimization efforts to ensure platforms are production-grade.
Support and champion DevSecOps and FinOps best practices, balancing speed, security, and cost efficiency.

Requirements:

5+ years of experience in SRE, DevOps, or systems engineering roles supporting high-volume, mission-critical applications.
Strong expertise in cloud platforms (AWS and/or GCP), especially services such as EC2, EKS/GKE, IAM, CloudWatch, Stackdriver, etc.
Deep proficiency in Linux systems (RHEL/CentOS/Debian) and container orchestration using Kubernetes, Docker, EKS/GKE.
Proven experience with Infrastructure as Code tools: Terraform, Ansible, Chef, or equivalent.
Strong coding/scripting skills in Python, Bash, or Golang, and a mindset for automation.
Experience with CI/CD pipelines and tools such as Jenkins, Harness, Bitbucket, Git.
Hands-on experience with observability stacks: Prometheus, ELK, CloudWatch, StackDriver, Grafana.
Familiarity with Agile and ITSM processes (incident/change/problem/configuration management), preferably using Jira/JSM.
Excellent problem-solving skills and the ability to thrive under pressure in a high-availability environment.

Preferred:

AWS/GCP Certifications or relevant Site Reliability/Cloud Solution Architect certifications.
Familiarity with service mesh, canary deployments, blue/green rollouts, and modern release engineering practices.
Experience supporting multi-region/multi-cloud environments.
Exposure to security tooling and processes for highly regulated environments.

Soft Skills:

Strong written and verbal communication skills for effective collaboration with global engineering and operations teams.
High ownership mindset with the ability to work independently and proactively improve system reliability.
Organized, analytical, and driven by continuous improvement.

Site Reliability Engineer (SRE)

Current Job Openings

Site Reliability Engineer (SRE)