Site Reliability Engineer - SRE

Senior Site Reliability Engineer 
 
Focus: Observability, Resilience & Platform Reliability
 
We are looking for a Site Reliability Engineer to help build highly observable, resilient and scalable cloud platforms across a Microsoft-first environment. This role sits at the intersection of engineering, operations and platform enablement, with a strong emphasis on observability, automation, and reliability as a product feature.
You will design and operate systems that are measurable, debuggable and self-healing, while partnering closely with engineering teams to improve system reliability, performance and customer experience.
 
What You'll Be Doing
  • Design and operate highly observable systems using metrics, logs, traces and alerts to provide deep visibility into platform health and performance.
  • Build and maintain SRE tooling and frameworks across the Microsoft stack (Azure, .NET, Kubernetes, DevOps).
  • Define and track SLIs, SLOs and error budgets to guide reliability decisions.
  • Implement proactive monitoring and alerting strategies that reduce noise and focus on customer impact.
  • Improve platform resilience, availability and performance through automation, testing and fault-tolerant design.
  • Partner with development teams to embed reliability into the SDLC (CI/CD, release strategies, testing, telemetry).
  • Lead incident response, post-incident reviews and systemic improvements.
  • Continuously reduce toil through automation and self-service tooling.
Tech Environment 
  • Cloud: Microsoft Azure
  • Platform: AKS, App Services, Azure Functions
  • Observability: Azure Monitor, Application Insights, OpenTelemetry, Log Analytics, Grafana, Prometheus
  • CI/CD: Azure DevOps, GitHub Actions
  • IaC: Bicep, ARM, Terraform
  • Languages: C#, PowerShell, Python
  • Containers: Docker, Kubernetes
What We're Looking For
  • Strong experience in SRE, Platform, Cloud or DevOps engineering roles.
  • Deep knowledge of Microsoft / Azure ecosystems.
  • Hands-on experience with observability platforms (metrics, logs, tracing, alerting).
  • Experience defining and using SLOs, SLIs, and error budgets.
  • Solid understanding of distributed systems and cloud-native architecture.
  • Strong automation mindset and scripting ability.
  • Experience supporting production systems at scale.

Based in Sydney CBD, with Hybrid work arrangement,

Interviews happening this week.