High-Performance Computing (HPC) Engineer – (Stockholm, Sweden)
Overview
Our High-Growth Technology client is seeking an experienced High-Performance Computing (HPC) Engineer to help design, build, and operate large-scale compute platforms supporting demanding workloads (e.g., AI/ML, simulation, rendering, analytics, and research). You will work closely with infrastructure, platform, and research/engineering teams to deliver reliable, high-throughput systems with strong performance, automation, and observability.
Key Responsibilities
- Design, deploy, and support HPC clusters (on-prem / colocation / cloud-connected) with a focus on performance, resilience, and scalability.
- Administer and optimise Linux-based compute environments (provisioning, patching, kernel/driver tuning, user access, hardening).
- Implement and maintain workload scheduling and cluster management (e.g., Slurm or equivalent), including partitions/queues, fair-share policies, and job efficiency improvements.
- Support GPU-accelerated environments (where applicable): driver/toolkit management, performance profiling, stability troubleshooting.
- Build and maintain automation for cluster lifecycle operations (IaC, config management, CI/CD-style ops).
- Partner with networking and storage teams to ensure high-throughput, low-latency performance across the stack.
- Own incident response and problem management for HPC services; lead root-cause analysis and preventative improvements.
- Develop monitoring, logging, and capacity planning to meet throughput and availability targets.
- Produce clear documentation (runbooks, architecture diagrams, operational standards) and contribute to continuous improvement.
Required Skills & Experience
- Strong hands-on experience as an HPC Engineer / Linux Systems Engineer / Infrastructure Engineer in performance-critical environments.
- Deep Linux administration skills (systemd, networking basics, storage, performance tuning, troubleshooting).
- Experience operating HPC or large-scale compute platforms, including one or more of:
- Schedulers / cluster managers (Slurm preferred; PBS, LSF, Kubernetes for batch, etc.)
- GPU compute (NVIDIA drivers/CUDA, NCCL awareness, profiling tools)
- MPI and distributed compute concepts (OpenMPI/MPICH understanding)
- Solid scripting/automation skills (Bash, Python; plus Ansible/Terraform or similar).
- Practical understanding of observability (metrics, logs, tracing), and using monitoring stacks to drive reliability.
- Good knowledge of storage and data movement patterns used in HPC (parallel file systems and/or high-performance shared storage concepts).
- Strong communication skills-able to work across platform, network, storage, and application teams.
Desirable / Nice-to-Have
- Experience with high-speed interconnects (e.g., InfiniBand, RoCE) and low-latency network troubleshooting.
- Experience with containerised HPC or hybrid HPC workloads (Apptainer/Singularity, Docker where appropriate).
- Familiarity with security best practices in shared compute environments (least privilege, auditing, secrets handling).
- Background supporting AI/ML infrastructure at scale (GPU fleet operations, job efficiency, capacity optimisation).
Location & Working Model
- Stockholm, Sweden (based locally).
- Working model: Hybrid/On-site depending on operational needs.
What Success Looks Like
- Stable, high-performance clusters with measurable improvements in throughput, utilisation, and job success rates.
- Strong automation and repeatability across provisioning, configuration, and operations.
- Clear operational practices (monitoring, alerting, runbooks) that reduce MTTR and improve reliability.
Next Steps
- Please send me your most recent CV which aligns with this Job Description and your contact information.
