Overview
Consultant role focused on Site Reliability Engineering for Platform & Cloud Operations, bridging legacy middleware stability with cloud-native modernization through SRE practices and automation.
Key Responsibilities
- Implement and operationalize SRE practices including SLO/SLI definition, error budget management, observability, and incident response across hybrid cloud and on-premises platforms.
- Manage and support legacy middleware technologies (JBoss, Apache, WebSphere, IIS) while driving modernization toward cloud-native and containerized architectures.
- Design and maintain CI/CD pipelines, Infrastructure-as-Code (IaC) frameworks using Terraform and GitOps, and automation workflows.
- Build self-service and self-healing solutions, auto-remediation playbooks, chaos engineering exercises, and proactive alerting.
- Monitor platform health through observability tooling (Dynatrace, Splunk) and drive continuous improvement against defined SLOs and reliability targets.
- Lead knowledge transfer, documentation, and runbook development.
- Contribute to the Platform as a Product strategy by embedding SRE principles into platform deliverables, supporting cloud adoption, and advising on architectural decisions.
- Ensure compliance with ITSM processes, security standards, and business continuity requirements including disaster recovery planning and SLA adherence.
Required Experience
- Master's degree with 5 years of experience or a Bachelor’s degree with a minimum of 7 years of relevant experience, or an equivalent combination of education and experience.
- Minimum 4–5 years of hands-on SRE experience with demonstrated application of SRE principles SLO/SLI, error budgets, observability, incident response, and chaos engineering in high-availability environments.
- Proven experience with legacy middleware platforms (JBoss, Apache, WebSphere, IIS) alongside modern application stacks (.NET, Java, NodeJS, Angular).
- Strong proficiency in DevOps/DevSecOps tooling -Terraform, Kubernetes/AKS, Docker, GitOps, CI/CD pipelines, and GitHub/GitLab/Azure Repos.
- Hands-on experience with multi-cloud platforms (Azure and AWS).
- Solid scripting and automation skills in Python, PowerShell, or Bash.
- Experience with monitoring and observability platforms including Dynatrace, Splunk, Prometheus, and Grafana.
- Working experience in Agile and SAFe delivery environments.
- Strong database skills across relational and NoSQL platforms (PostgreSQL, MySQL) with experience in PAAS and COTS solution management.