Job Details

Machine Learning Engineer

  2025-04-08     Impax Recruitment     Sonoma,CA  
Description:

ML Infrastructure Engineer


We are building AI physics models that don't just predict, but understand cause and effect within climate.


What You'll Do


  • Architect and operate distributed training clusters (e.g. 12+ nodes, 8 GPUs per node) using GKE, Kubernetes, and cloud-native infra
  • Design scalable, efficient data pipelines for petabyte-scale datasets
  • Implement and optimize model/data/pipeline parallelism across foundation models
  • Deploy, monitor, and debug large-scale multi-node GPU training jobs using DDP, FSDP, and DeepSpeed
  • Tune low-level system components (e.g. CUDA, NCCL, network interfaces) for max throughput
  • Build cluster observability tools: failure detection, logging, monitoring, and autoscaling
  • Collaborate with research and modeling teams to productionize experiments at scale


You'd Be Great If You


  • Have deep hands-on experience with distributed training frameworks (FSDP, DeepSpeed, DDP)
  • Know how to set up and debug Kubernetes/GKE GPU clusters, from CUDA to networking
  • Are fluent in PyTorch and familiar with its performance quirks (e.g., dataset loading, sampler design)
  • Have worked on ML infra at scale (multi-node, multi-GPU setups, 100B+ param models)
  • Understand sampling techniques, data sharding, and performance tuning across the ML stack
  • Can spot a NCCL timeout from a mile away and know how to fix it
  • Value rapid iteration, ownership, and scaling up ambitious systems with a lean team


Apply for this Job

Please use the APPLY HERE link below to view additional details and application instructions.

Apply Here

Back to Search