[Remote] Senior Cluster Site Reliability Engineer

Remote, USA Full-time Posted 2026-06-16

Note: The job is a remote job and is open to candidates in USA. Voleon is a technology company that applies state-of-the-art machine learning techniques to real-world problems in finance. As a Senior Cluster Site Reliability Engineer, you will help scale the research compute cluster, ensuring high uptime and reliability while supporting both on-prem and cloud infrastructure.

Responsibilities

Be a first responder in the event of cluster outages or issues. Triage and resolve urgent issues as they arise
Ensure a high degree of cluster uptime (measured in multiple nines), and define + track SLAs to quantify reliability
Diagnose systemic/recurring patterns of problems, and engineer precision solutions to them in collaboration with engineering teams
Develop robust metrics and observability for cluster health and use those metrics to inform your work. Build out custom observability mechanisms when off-the-shelf ones won't do
Help software and research teams design policies around fair cluster usage, and help develop enforcement mechanisms for said policies
Assist in forecasting cluster growth, and help select appropriate scale-up strategies. Help optimize operations across dimensions of cost and usability

Skills

5+ years of experience in SRE or DevOps roles, preferably working as a senior engineer or tech lead
Knowledge of HPC/batch compute frameworks (Slurm, Kueue, AWS/Google Cloud Platform Batch) and/or machine learning training systems (Kubeflow, MLflow, Horovod)
Ability to develop scripts and utilities of moderate complexity in a common scripting language (Python, Ruby, etc.)
Familiarity with infrastructure-as-code and configuration management tools (Terraform, Ansible)
Experience with cloud infrastructure (AWS or Google Cloud Platform)
Familiarity designing and implementing modern observability stacks (Prometheus, Grafana, Loki, ELK, OpenTelemetry)
Experience with distributed storage technologies (Lustre, Ceph, S3)
Embodies a 'system engineer' rather than 'system administrator' mindset, thinking systematically and leveraging automation
Bachelor degree in computer science
Hands-on experience with HPC frameworks (Slurm, Grid Engine) and Kubernetes-based job orchestrators (Airflow, Kueue, Kubeflow Pipelines), along with other distributed computing frameworks (Ray, Modin, Dask, Spark)
Familiarity with ML frameworks (PyTorch/Tensorflow, JAX, Horovod, DeepSpeed)
Familiarity with hybrid/on-prem environments
Experience with containerization (Docker, Podman, Singularity), particularly for HPC/batch compute environments
Experience with HPC networking (InfiniBand, RDMA)
Solid security/IAM foundations (Identity management systems, AWS/Google Cloud Platform IAM, Zero Trust)

Benefits

If you have a great candidate in mind for this role and would like to have the potential to earn $15,000 if your referred candidate is successfully hired and employed by The Voleon Group, please use this form to submit your referral.

Company Overview

Dice is a job-searching platform for technology professionals. It is a sub-organization of DHI Group. It was founded in 1990, and is headquartered in Santa Clara, California, USA, with a workforce of 201-500 employees. Its website is http://www.dice.com.

Company H1B Sponsorship

Dice has a track record of offering H1B sponsorships, with 2 in 2022, 4 in 2021, 5 in 2020. Please note that this does not guarantee sponsorship for this specific role.

Apply To This Job

Apply Now

[Remote] Senior Cluster Site Reliability Engineer

Similar Jobs

[Remote] Senior Core Network Engineer

[Remote] Customer Storytelling & Content Project Manager (Part-Time)

[Remote] AI Operations Associate

[Remote] Clinical Sales Representative

[Remote] Sr. Principal Machine Learning Engineer

[Remote] Inside Sales Representative- Fitness Education Certifications

[Remote] Unix System Administrator

[Remote] Resident Solutions Architect - Communications, Media, Entertainment & Games

[Remote] Senior Application Security Engineer ID70122

[Remote] Electrical Engineer

Account Executive (Central Territory- Remote)

Account Executive - US Remote

Experienced Customer Service Representative – Work From Home Opportunity at arenaflex

[Remote] QA - Manual & Automation Tester

Experienced Customer Service Representative – Call Center Operations

Car Dealer Specialist

Solutions Engineer

AI & ML Engineer

Entry-Level Remote Data Entry Specialist – Work‑From‑Home Opportunity with arenaflex

HEDIS Reviewer - RN/LPN/LVN (100% Remote)