Site Reliability Engineer - SRE
Senior Site Reliability Engineer Focus: Observability, Resilience & Platform Reliability We are looking for a Site Reliability Engineer to help build highly observable, resilient and scalable cloud platforms across a Microsoft-first environment. This role sits at the intersection of engineering, operations and platform enablement, with a strong emphasis on
observability, automation, and reliability as a product feature.
You will design and operate systems that are measurable, debuggable and self-healing, while partnering closely with engineering teams to improve system reliability, performance and customer experience.
What You'll Be Doing - Design and operate highly observable systems using metrics, logs, traces and alerts to provide deep visibility into platform health and performance.
- Build and maintain SRE tooling and frameworks across the Microsoft stack (Azure, .NET, Kubernetes, DevOps).
- Define and track SLIs, SLOs and error budgets to guide reliability decisions.
- Implement proactive monitoring and alerting strategies that reduce noise and focus on customer impact.
- Improve platform resilience, availability and performance through automation, testing and fault-tolerant design.
- Partner with development teams to embed reliability into the SDLC (CI/CD, release strategies, testing, telemetry).
- Lead incident response, post-incident reviews and systemic improvements.
- Continuously reduce toil through automation and self-service tooling.
Tech Environment - Cloud: Microsoft Azure
- Platform: AKS, App Services, Azure Functions
- Observability: Azure Monitor, Application Insights, OpenTelemetry, Log Analytics, Grafana, Prometheus
- CI/CD: Azure DevOps, GitHub Actions
- IaC: Bicep, ARM, Terraform
- Languages: C#, PowerShell, Python
- Containers: Docker, Kubernetes
What We're Looking For - Strong experience in SRE, Platform, Cloud or DevOps engineering roles.
- Deep knowledge of Microsoft / Azure ecosystems.
- Hands-on experience with observability platforms (metrics, logs, tracing, alerting).
- Experience defining and using SLOs, SLIs, and error budgets.
- Solid understanding of distributed systems and cloud-native architecture.
- Strong automation mindset and scripting ability.
- Experience supporting production systems at scale.
Based in Sydney CBD, with Hybrid work arrangement,
Interviews happening this week.