What to Expect
Tesla’s continued success depends on engineers being able to develop, debug, and deploy software quickly.
Our build, tools, and developer experience infrastructure directly impact over 1000 vehicle, energy, and Autopilot software engineers.
You'll be joining a small team located at the center of the firmware organization called Engineering Productivity, Build and Internal Infrastructure.
This unique position exposes us to a wide array of interesting technical challenges and enables us to be the defenders of best practices such as code hygiene, reuse, and maintainability.
You will ensure high availability of tools, services, and computational clusters by implementing SRE best practices and methodologies.
We are a distributed team and hire in multiple locations:
Palo Alto, CA
Bellevue, WA
Austin, TX
What You’ll Do
Ensure the availability of new and existing developer tools
Drive the migration of large-scale, distributed diagnostics applications towards cloud-native microservices
Capacity planning and analysis, and infrastructure change management (including tuning, reshaping, resizing, and migrating infrastructure), for services and their immediate downstreams
Work with SWE counterparts to identify and mitigate production issues; validate, document and exercise failover/disaster recovery plans and graceful degradation mechanisms policies and standard methodologies
Actively participate and contribute to code reviews and technical design documents, with an eye toward identifying performance and reliability bottlenecks
Provide SRE expertise and implementing standard methodologies in the areas of CI/CD, dashboard integrity improvements, identifying and evaluating for the right set of alerts, SLOs and error budgets to use for services on an ongoing basis
Participate in team on-call support rotation
What You’ll Bring
BS in Computer Science, proof of exceptional skills in related fields with practical software engineering experience, or equivalent
Expert knowledge of Linux operating system internals, filesystems, disk/storage technologies, and storage protocols, and networking stack
Troubleshooting and full-cycle incident response (mitigation, correction, prevention)
2+ years of handling services in a large-scale distributed systems environment, preferably bare metal
2+ years of experience with containerization software such as: Kubernetes, Docker
Expert knowledge of systems programming (bash and shell tools) and practical, validated knowledge of at least one higher-level language (Python, Go)
Expert knowledge of CI/CD platforms such as: Jenkins, TeamCity, Github Actions