Back to Jobs

Principal Site Reliability Engineer job at Fidelity Investments in Durham, NC

Remote, USA Full-time Posted 2026-06-16

Title: Principal Site Reliability Engineer Location: 100 New Millennium Way, Bldg 1, Durham NC Job Description: Position Description: Combines Operational excellence with Development experience to deliver services at high scale, high availability with resilience. Builds reliability into the ecosystem by applying best practices in Resiliency Engineering, Automation, Observability and Chaos Testing. Streamlines and accelerates software delivery cycle by using DevOps practices and toolchain. Integrates Site Reliability Engineering (SRE) practices (Observability and Chaos) with DevOps processes and delivery pipelines to stop bad code from reaching production. Ensures business-critical enterprise systems are continuously available to internal and external customers. Implements technical standardization and process refinements within the engineering organization and for Site Reliability Engineers. Collaborates with production support teams to define and implement processes for the identification, collection, and analysis of incident data. Brings together technical, procedural, and financial data to reduce toil and increase efficiency. Primary Responsibilities: Develops Chaos Testing capabilities using multiple Chaos Tools (AWS Fault Injection Service (FIS), Chaos Mesh, and Chaosd) and Chaos Toolkit. Develops and enhances organization’s internal Chaos Framework to streamline Chaos Executions and reporting. Provides specialized technical expertise in the adoption of Chaos Engineering by application teams. Chaos tests and observes business-critical applications to understand the weaknesses and increase application resiliency. Activates Observability for the critical applications with recommended Service Level Indicators and Service Level Objectives for Latency, Availability, Error Rate etc. Utilizes modern monitoring tools (Datadog, Splunk, Catchpoint etc.) to reduce mean time to detect an issue and improve the response times. Creates CI/CD pipelines with security and quality checks with Application Lifecycle management toolchain. Helps in integrating Chaos and Observability with CI/CD pipelines. Automates repetitive activities using scripting languages (Python, Groovy etc.). Implements and supports solutions based on cloud platforms AWS/Azure and container orchestration Kubernetes. Onboards /Evaluates New Cloud services that help to enhance the Resiliency of cloud ecosystem. Serves as a liaison for vendor engagement. Participates in incident management, problem management and incident postmortems. Takes part in peer code reviews providing qualitative feedback. Builds processes and capabilities to adapt and respond to risks, and disruptions, while maintaining business operations and data recovery with minimal disruptions. Coaches peer SREs and application teams on SRE and DevOps. Implements Agile methodologies in the team’s project completion using incremental and iterative steps. Education and Experience: Bachelor’s degree in Computer Science, Engineering, Information Technology, Information Systems, or a closely related field (or foreign education equivalent) and five (5) years of experience as a Principal Site Reliability Engineer (or closely related occupation) implementing resilient container and cloud-based applications and infrastructure solutions, using DevOps or SRE practices, in a financial services environment. Or, alternatively, Master’s degree (or foreign education equivalent) in Computer Science, Engineering, Information Technology, Information Systems, or a closely related field (or foreign education equivalent) and three (3) years of experience as a Principal Site Reliability Engineer (or closely related occupation) implementing resilient container and cloud-based applications and infrastructure solutions, using DevOps or SRE practices, in a financial services environment. Skills and Knowledge: Candidate must also possess: Demonstrated Expertise (“DE”) improving application resiliency by implementing chaos engineering to build system's capability to withstand turbulent conditions in production, using Chaos Mesh, Chaosd, Azure Chaos Studio, AWS FIS, or Gremlin; and driving automation to implement scalable approaches for the planning, design, execution, and reporting of chaos testing using Jenkins pipelines, standard frameworks, data visualization, and dashboards. DE implementing advanced observability practices and techniques in production and pre-production environments, at scale using Datadog, Splunk, or Catchpoint; tracking the error budget, proactively identifying issues, minimizing Mean Time to Repair (MTTR); and balancing customer expectations by implementing Service-Level Indicators (SLIs) and Service-Level Objectives (SLOs) using logs, traces, monitors and synthetic tests. DE migrating and maintaining cloud applications and creating cloud solutions using Amazon Web Services (AWS) or Azure cloud services; Implementing infrastructure as code for cloud; Onboarding new AWS or Azure services with required reviews and security controls in non-production and production environments; and researching evolving cloud ecosystem to adopt machine learning based tools (AWS DevOps guru) to boost AIOps abilities. DE implementing CI/CD pipelines in both production and non-production environments using Application Lifecycle Management (ALM) tools (JIRA, GitHub, Jenkins, SonarQube, Artifactory, or uDeploy) to enable faster code delivery, enhanced software quality, reliability, and security; and developing products, and core and common capabilities for the organization to reduce toil and drive standardization, using containerization and orchestration technologies (Docker or Kubernetes), Infrastructure as Code (IaC) tools, scripting languages (Python or Groovy), and engineering best practices. #PE1M2 #LI-DNI Certifications: Category:Information Technology Most roles at Fidelity are Hybrid, requiring associates to work onsite every other week (all business days, M-F) in a Fidelity office. This does not apply to Remote or fully Onsite roles. Some roles may have unique onsite requirements. Please consult with your recruiter for the specific expectations for this position. Please be advised that Fidelity’s business is governed by the provisions of the Securities Exchange Act of 1934, the Investment Advisers Act of 1940, the Investment Company Act of 1940, ERISA, numerous state laws governing securities, investment and retirement-related financial activities and the rules and regulations of numerous self-regulatory organizations, including FINRA, among others. Those laws and regulations may restrict Fidelity from hiring and/or associating with individuals with certain Criminal Histories. Apply tot his job Apply To this Job

Similar Jobs

Implementation Consultant - Community (Work From Anywhere, USA)

Remote, USA Full-time

Podcast Partner and Project Manager – São Paulo

Remote, USA Full-time

Staff Accountant (Insurance exp. req'd, Remote option but must reside in EST/CST)

Remote, USA Full-time

Frontend Engineer, Ads Product & Tech

Remote, USA Full-time

Partner Delivery Lead - Music

Remote, USA Full-time

(US) Software Implementation Consultant II – Financial – Remote, USA

Remote, USA Full-time

[Remote] Senior Director, SEO and Content (Southwest Airlines)

Remote, USA Full-time

Senior Environmental Consultant

Remote, USA Full-time

Senior/Principal Consultant - Value Chain Transformation

Remote, USA Full-time

Virtualization Systems Design Engineer

Remote, USA Full-time

Experienced Data Analyst – High-Level Investigation and Content Development at blithequark

Remote, USA Full-time

Go-to-Market Engineer - Auckland, New Zealand

Remote, USA Full-time

Experienced Phone and Chat Specialist with Bonus Opportunity at arenaflex

Remote, USA Full-time

Staff Product Manager, Scheduling

Remote, USA Full-time

Tennessee Residents Only - Healthcare Customer Service Representative

Remote, USA Full-time

Experienced Virtual Assistant for Customer Service and Marketing at arenaflex

Remote, USA Full-time

Summer 2023 Human Resources Co-op – TMMAL

Remote, USA Full-time

Remote Data Entry & Research Participant – Flexible Part‑Time Micro‑Task Specialist for arenaflex

Remote, USA Full-time

Apple & Associates – Application Engineer (VRU) – Dallas, TX

Remote, USA Full-time

Part-Time Virtual Assistant | Dynamic MSP | DFW Area

Remote, USA Full-time