Zyte is seeking an experienced Team Lead to manage our Core & MLOps Squad, responsible for building the bedrock infrastructure that powers Zyte at scale. This hands-on technical leadership role requires expertise across MLOps, systems programming, and orchestration to lead a cross-functional team in designing and maintaining the scalable foundation that enables all Zyte teams to build and run their services with confidence.
Requirements
- Design and evolve the core platform (Kubernetes, Mesos, GPU scheduling/autoscaling, distributed compute).
- Own the model platform: registry, experiment tracking, training orchestration, evaluation, serving, and monitoring.
- Build the Golden Path: reference repos, a scaffold CLI, opinionated CI/CD pipelines, runtime contracts (health/metrics/tracing/SLOs), high-performance clients, circuit breakers and other productionâready defaults.
- Operate a secure, multiâtenant model registry and training platform with standardized experiment/evaluation harnesses.
- Provide turnkey serving patterns (online + batch), drift/quality monitoring, and rollback playbooks.
- Integrate public/openâsource AI capabilities as managed platform services with cost and dataâgovernance guardrails.
- Run the squad: roadmap/prioritization, delivery, mentoring, and high engineering standards.
- Partner with product engineering (Zyte API, Scrapy Cloud), Prod Ops, and Security on adoption and rollout plans.
- Mentor the team and foster a platform-thinking mindset.
- Ownership Areas: Container orchestration (Kubernetes/Knative), GPU provisioning & autoscaling, environment & secret management.
- Operators, sidecars, and internal SDKs/libraries (Go/Rust/Python/Java) that enforce the golden path contract.
- Model platform: registry, experiment tracking, training orchestration, evaluation framework, serving infra, model monitoring.
- Observability: logging/metrics/tracing pipelines;
- Billing pipeline: metering/events/cost tracking abstractions.
- Golden Path: Java, Python, ML templates + CI/CD blueprints + docs + scaffold CLI.
- Reliability enablement (SRE practices), cost governance, supplyâchain security (SBOM, image signing).
- 5+ years experience building distributed systems; 3+ years in MLOps/ML platform engineering (or equivalent impact).
- Knowledge of Linux/OS internals (process model, cgroups/namespaces), networking (TCP/IP, HTTP/2), concurrency, and performance profiling.
- Deep understanding of Kubernetes (bonus: Mesos)
- Proficiency developing high-performance services in Java, Rust, Go or C++ (bonus: familiarity with vert.x and Netty frameworks); strong Python skills.
- Experience with GPU infrastructure (scheduling, containerization, optimization).
- Track record of designing and operating model platforms (registry, training, serving, monitoring) in production.
- Demonstrated success leading technical teams and implementing organization-wide platform solutions.
Benefits
- We love fostering and nourishing new ideas and bringing them to market
- Become part of a self-motivated, progressive, multi-cultural team.
- Have the freedom and flexibility to work from where you do your best work, as we are a completely remote company.
- Get the chance to work with cutting-edge open-source technologies and tools.