Architect next-generation A.I. cluster components including CPUs, GPUs and ASICs performing machine learning workloads. Create solutions supporting cluster data movement including cluster network, block storage, and memory. Work as liaison between cluster hardware architects and Autopilot application developers to achieve high overall cluster utilization and scalability.
You will drive the automation of the existing computing CPU/GPU cluster and development a new modern clustered ML computing infrastructure. Work across teams to document workflows and translate into job pipelines including integrated testing and deployment. Work closely with multiple teams to understand how the work output of each team’s pipelines would be useful to other teams to then help create a fully automated process enabling rapid fleet learning. Should have experience with Python, Perl or other scripting languages, virtualization and deployment of build/test environments, and job automation (ala Jenkins) and resource scheduling.
- Identify and propose new effort saving automations for the distributed CPU, GPU and ASIC computing clusters
- Create multi-tier storage strategies supporting extremely high throughput with low latency. Ensure solution scales to multi-petabyte while maintaining performance
- Improve root cause analysis and corrective action for problems large and small – identify patterns and design task automations
- Recommend and implement solutions to improve computing cluster performance
- Lead optimization efforts with Python scripting language
- Develop environment isolation. Facilitate reproducibility and precise/complete versioning of the environment used to produce results and separate the deployment of job environments and toolchains from the underlying server hardware and host operating systems on heterogeneous clusters
- Support development of virtual simulation environment in cluster environment
- Develop CI/CD pipeline management and job scheduling. Leverage already deployed technologies such as Jenkins and Slurm with an emphasis of scaling the build environment.
- Design and implement one-off cluster solutions as required supporting the wider goals of Autopilot. Ensure all efforts are coordinated and documented.
- MS in Computer Science, Electrical Engineering or related field or a Bachelor’s degree with 3 years of additional equivalent experience
- 3+ years’ experience on Linux systems compute cluster or farm
- Experience with system administration of Ubuntu servers
- Strong systems API Python development
- Knowledge about Linux infrastructure monitoring and alerting
- Working knowledge of Puppet configuration management
- Evidence of exceptional ability
Nice to have:
- Experience with job schedulers such as LSF, Slurm, Kubernetes, Docker Swarm
- Experience with low-level hardware schedulers
- Experience with solving Linux system level problems
- Experience with Unreal Engine
- Understanding of cost optimized methods of data movement
- Previous experience at the large-scale data center – 1000+ nodes
- Experience with containers – Docker, Singularity
- Term Entry +
- Company Tesla