SRE Engineer

You will be part of a close-knit team that values knowledge sharing, continuous learning, and professional growth. With access to industry-recognised certifications, strong mentorship, and technical development programmes, you will have every chance to advance your career while working on cutting-edge AWS native databases and automation projects.

SITE RELIABILITY ENGINEER

Salary: £400 - £500/PD Inside IR 35
Location: London

What you'll do:

As a Site Reliability Engineer based in London, you will play an integral role in supporting a wide range of AWS native databases including RDS, Aurora, Neptune, as well as CockroachDB. Your daily responsibilities will involve designing robust software solutions that enhance system performance while ensuring high availability for critical applications. You will work hand-in-hand with product engineering teams to improve observability tools and telemetry systems, driving forward automation initiatives that reduce manual intervention. By participating in incident management processes—facilitating transparent communication with stakeholders and leading blameless post-mortems—you will help foster a culture of continuous improvement. Your commitment to maintaining operational stability through rigorous change management practices will be essential as you plan and execute disaster recovery tests. The role also offers opportunities to collaborate on infrastructure simplification projects alongside other SREs, ensuring best practices are shared across teams. Success in this position requires not only technical proficiency but also excellent interpersonal skills to thrive in an environment that values teamwork, knowledge sharing, and mutual support.

* Design, code, test, and deliver software enhancements aimed at improving existing systems by adopting DevOps principles across all cloud database offerings.
* Troubleshoot complex incidents efficiently, communicate effectively with stakeholders at all levels, facilitate blameless post-mortems, and identify corrective actions to ensure permanent resolution.
* Actively participate throughout the development lifecycle to ensure reliability, scalability, and operational stability are maintained across all supported platforms.
* Define, create, and monitor application analytics to support improved service level objectives and drive data-informed decision making.
* Ensure strict adherence to change management release processes while accelerating automation initiatives for these workflows.
* Lead resiliency management planning by scheduling and executing disaster recovery tests with a focus on automating these activities wherever possible.
* Provide on-call support during production incidents outside standard working hours as required by the business needs.
* Contribute to enhancing product observability and telemetry by supporting ongoing modernisation efforts within the infrastructure.
* Collaborate closely with engineering teams to brainstorm ideas that simplify infrastructure management and streamline SRE practices.

What you bring:

* Proficiency in Python or Unix Shell scripting combined with solid SQL skills enables you to automate tasks efficiently across complex environments.
* A good understanding of development tools such as source code control software (e.g., Git), automated build systems, automated testing frameworks, and JIRA ensures smooth collaboration within multidisciplinary teams.
* Familiarity with infrastructure as code concepts allows you to contribute effectively towards automation goals using tools like Terraform or Puppet.
* Experience with build automation pipelines, test-driven development methodologies, continuous integration (CI), and continuous delivery (CD) practices is highly valued.
* Hands-on experience managing both relational (e.g., RDS/Aurora) and non-relational databases equips you to support diverse data storage requirements.
* Previous exposure to site reliability engineering concepts—including service level objectives (SLOs), service level agreements (SLAs), service level indicators (SLIs), and error budgets—will help you excel in this role.
* Practical experience or familiarity with at least one major public cloud provider (AWS preferred; Google Cloud or Azure also considered) is important for success.
* Experience managing configuration for large fleets of servers using declarative frameworks is advantageous for scaling operations smoothly.
* Knowledge of leveraging APIs securely along with authentication mechanisms and data structures enhances your ability to integrate systems seamlessly.
* Understanding microservice architectures, REST API design/development principles, Docker/Kubernetes containerisation technologies, and CI/CD integration is beneficial.

Robert Walters Operations Limited is an employment business and employment agency and welcomes applications from all candidates

Similar jobs

View more jobs

SRE Engineer

Share

Similar jobs