search
yourdomain > San Mateo > computer/technical > Site Reliability Engineer, Autopilot AI & Dojo Infrastructure

Site Reliability Engineer, Autopilot AI & Dojo Infrastructure

Report Ad  Whatsapp
Posted : Thursday, April 18, 2024 04:12 AM

What to Expect Tesla's Supercomputing SRE team works directly with the high-performance computing and machine learning infrastructure on which our ML algorithms run; this includes virtual simulations, Autopilot hardware, silicon design, and Dojo.
With the rapidly-growing need for more data and optimized compute resources, cluster builds are getting larger and increasingly complex.
Continued development/automation of deployment, monitoring, self-healing and alerting processes is imperative to the success of our engineering groups.
As a Site Reliability Engineer on our Supercomputing SRE team, you will be responsible for maintaining and improving our infrastructure to ensure engineering teams across Autopilot/AI and Dojo have the necessary tools and resources to be productive.
This includes managing our HPC clusters, monitoring compute/GPU/network metrics, writing scripts for configuration management, and collaborating with our Data Center team to coordinate the smooth operation of hundreds of servers/bring up new capacity on our GPU clusters.
What You’ll Do Support the AI/ML cluster infrastructure on both GPU and Dojo platforms, focusing on systems automation, configuration management and deployment at scale Improve our cluster health monitoring and auto-recovery pipeline Work with users on debugging application performance issues Work with hardware and storage vendors to tune and optimize our servers, storage and network Write Ansible playbooks for configuration management Performance tuning and OS provisioning on Linux systems Manage HPC clusters, workloads and applications Automation and systems engineering in Python, Golang or Bash/Shell Participate in 24x7 on-call rotation What You’ll Bring Proficiency in high-level programming language and/or scripting with (Python, Golang, Bash) Strong understanding of Linux fundamentals and performance optimizations (Ubuntu/RHEL OS) Advanced experience with configuration management systems such as Ansible Demonstrable knowledge of TCP/IP, IPoIB, Linux operating system internals, filesystems, disk/storage technologies and storage protocols Experience in collaborating with network and data center teams for large scale cluster builds 5+ years' experience with configuration management software (Ansible, etc.
) systems monitoring and alerting (Prometheus, Grafana, Telegraf, Splunk, etc.
) and/or administering HPC workload managers (SLURM, LSF, etc.
) 3+ years’ experience with high-throughput low-latency networks, GPU-based computing systems, and/or high performance storage systems Experience with Slurm and storage management of distributed parallel file systems a plus Bachelor’s degree in computer science, electrical engineering or related field 3+ years of additional equivalent experience or evidence of exceptional ability related to the position

• Phone : NA

• Location : Palo Alto, CA

• Post ID: 9003705596


Related Ads (See all)


auburn.yourdomain.com is an interactive computer service that enables access by multiple users and should not be treated as the publisher or speaker of any information provided by another information content provider. © 2024 yourdomain.com