Site Reliability Engineer (SRE) (Permanent) – Bangalore, India
Overview
Our financial services client is hiring two permanent Site Reliability Engineers (SREs) in Bangalore to help improve platform reliability, scalability, observability, and operational excellence across cloud-first services. This is a hands-on engineering role focused on production stability, automation, and continuous improvement in a regulated environment.
Role Details
- Location: Bangalore, India
- Employment Type: Permanent (2 positions)
- Work Model: Onsite/Hybrid (depending on client policy)
Key Responsibilities
- Own and improve service reliability, availability, latency, and performance across critical platforms.
- Build and enhance monitoring, alerting, and observability (metrics, logs, traces) to reduce MTTR and prevent reoccurrence.
- Lead incident response and post-incident reviews (RCA), driving permanent fixes and reliability improvements.
- Automate operational tasks and reduce toil through scripting and tooling.
- Support and improve CI/CD pipelines and release practices to enable safe, frequent deployments.
- Partner with engineering and infrastructure teams to implement reliability best practices (SLOs/SLIs, error budgets, capacity planning).
- Contribute to cloud architecture decisions across AWS and Azure, focusing on resiliency and cost/performance balance.
- Ensure platform operations align with security and compliance needs typical in financial services.
Required Skills & Experience
- Proven experience as an SRE / DevOps / Production Engineer supporting business-critical systems.
- Strong cloud experience across AWS and Azure (hands-on, production environment).
- Solid Linux and networking fundamentals (DNS, TLS, load balancing, routing concepts).
- Experience with Infrastructure as Code (e.g., Terraform, CloudFormation, ARM/Bicep).
- Strong automation/scripting skills (e.g., Python, Bash, PowerShell).
- Experience with containers and orchestration (e.g., Docker, Kubernetes).
- Observability tooling experience (e.g., CloudWatch/Azure Monitor, Prometheus/Grafana, ELK/Splunk, Datadog/New Relic-any relevant mix).
- Comfortable working in on-call / support rotations and handling major incidents calmly and methodically.
Nice to Have
- Experience in financial services or other regulated environments.
- Strong understanding of SRE practices: SLO/SLI, error budgets, capacity planning, chaos testing, reliability engineering patterns.
- Experience with service mesh, API gateways, or distributed tracing in microservices environments.
- Knowledge of security fundamentals in cloud environments (IAM, secrets management, hardening).
