What to Expect
Tesla's Supercomputing SRE team works directly with the high-performance computing and machine learning infrastructure on which our ML algorithms run; this includes virtual simulations, Autopilot hardware, silicon design, and Dojo.
With the rapidly-growing need for more data and optimized compute resources, cluster builds are getting larger and increasingly complex.
Continued development/automation of deployment, monitoring, self-healing and alerting processes is imperative to the success of our engineering groups.
As a Site Reliability Engineer on our Supercomputing SRE team, you will be responsible for maintaining and improving our infrastructure to ensure engineering teams across Autopilot/AI and Dojo have the necessary tools and resources to be productive.
This includes managing our HPC clusters, monitoring compute/GPU/network metrics, writing scripts for configuration management, and collaborating with our Data Center team to coordinate the smooth operation of hundreds of servers/bring up new capacity on our GPU clusters.
What You’ll Do
Support the AI/ML cluster infrastructure on both GPU and Dojo platforms, focusing on systems automation, configuration management and deployment at scale
Improve our cluster health monitoring and auto-recovery pipeline
Work with users on debugging application performance issues
Work with hardware and storage vendors to tune and optimize our servers, storage and network
Write Ansible playbooks for configuration management
Performance tuning and OS provisioning on Linux systems
Manage HPC clusters, workloads and applications
Automation and systems engineering in Python, Golang or Bash/Shell
Participate in 24x7 on-call rotation
What You’ll Bring
Proficiency in high-level programming language and/or scripting with (Python, Golang, Bash)
Strong understanding of Linux fundamentals and performance optimizations (Ubuntu/RHEL OS)
Advanced experience with configuration management systems such as Ansible
Demonstrable knowledge of TCP/IP, IPoIB, Linux operating system internals, filesystems, disk/storage technologies and storage protocols
Experience in collaborating with network and data center teams for large scale cluster builds
5+ years' experience with configuration management software (Ansible, etc.
) systems monitoring and alerting (Prometheus, Grafana, Telegraf, Splunk, etc.
) and/or administering HPC workload managers (SLURM, LSF, etc.
)
3+ years’ experience with high-throughput low-latency networks, GPU-based computing systems, and/or high performance storage systems
Experience with Slurm and storage management of distributed parallel file systems a plus
Bachelor’s degree in computer science, electrical engineering or related field
3+ years of additional equivalent experience or evidence of exceptional ability related to the position