What You'll Work On
Core Responsibilities
- Design and implement high-performance REST APIs that handle real-time LLM request proxying and logging
- Build asynchronous processing pipelines using distributed task queues for background job orchestration
- Optimize PostgreSQL queries and schema design for transactional data at scale
- Implement sophisticated Redis caching strategies for sub-millisecond response times
- Design data ingestion pipelines that write millions of events daily to analytical databases
- Work with WebSocket connections for real-time streaming responses
- Build evaluation and experimentation systems for AI model testing
Required Skills
Must-Have Technical Skills
1. REST API Development (Critical)
- Deep understanding of RESTful principles and API design patterns
- Experience building production APIs with:
- Authentication/authorization (JWT, API keys, OAuth)
- Request validation and serialization
- Rate limiting and throttling
- Pagination and filtering
- API versioning strategies
- Comfortable with OpenAPI/Swagger specifications
- Experience with middleware patterns and request/response interceptors
2. PostgreSQL Expertise (Critical)
- Advanced SQL query optimization and indexing strategies
- Understanding of ACID transactions and isolation levels
- Experience with:
- Complex joins and aggregations
- Database migrations and schema evolution
- Connection pooling (pgBouncer, connection pool managers)
- Query plan analysis and performance tuning
- Knowledge of PostgreSQL-specific features (CTEs, window functions, JSONB)
3. Redis Proficiency (Critical)
- Production experience using Redis for:
- Application caching (cache invalidation strategies)
- Session storage
- Rate limiting
- Message queuing/pub-sub
- Understanding of Redis data structures (strings, hashes, sets, sorted sets)
- Experience with Redis persistence (AOF, RDB)
- Knowledge of Redis clustering and high availability
4. Python Proficiency (Required)
- Strong Python 3.11+ experience
- Async/await patterns and asyncio
- Type hints and Pydantic for data validation
- Understanding of Python concurrency (threading, multiprocessing, gevent)
- Experience with Python package management (pip, poetry)
5. General Backend Engineering
- Strong understanding of HTTP protocol, status codes, headers
- Experience with authentication patterns (JWT, session-based, API keys)
- Knowledge of CORS, CSRF protection, and web security best practices
- Understanding of serialization formats (JSON, Protocol Buffers)
- Experience with logging, monitoring, and observability
- Proficient with Git version control
- Strong debugging and troubleshooting skills
Highly Desired Skills (Major Plus)
Celery & Distributed Task Processing
- Production experience with Celery or similar task queue systems (RQ, Dramatiq, Bull)
- Understanding of:
- Task routing and queue prioritization
- Worker concurrency models (gevent, eventlet, prefork)
- Task retries, timeouts, and error handling
- Batching strategies for performance optimization
- Task monitoring and debugging
- Experience with message brokers (Redis, RabbitMQ, SQS)
- Knowledge of distributed systems challenges (eventual consistency, idempotency)
ClickHouse or Analytical Databases
- Experience with OLAP databases (ClickHouse, Druid, Snowflake, BigQuery)
- Understanding of columnar storage and query optimization
- Experience designing schemas for analytical workloads
- Knowledge of data partitioning and retention strategies
- Experience with time-series data and aggregations
Nice to Have
- Experience with LLM APIs (OpenAI, Anthropic, Google Gemini)
- Familiarity with WebSocket protocols and real-time systems
- Docker and containerization experience
- Kubernetes knowledge
- Experience with cloud platforms (AWS, Azure, GCP)
- OpenTelemetry and distributed tracing
- Experience with Stripe or payment processing
- Background in evaluation/testing frameworks
- S3 or object storage experience
- CI/CD pipeline experience
Tech Stack Overview
Current Stack (not required, but helpful):
- Primary Framework: Python with Django REST Framework
- Databases: PostgreSQL (primary), ClickHouse (analytics)
- Caching/Queuing: Redis (2 instances - MQ and cache)
- Task Queue: Celery with gevent workers
- Web Server: Gunicorn with gevent, Daphne for WebSockets
- Authentication: JWT, API keys, OAuth via social-auth
- Integration: LiteLLM for multi-provider LLM routing
- Monitoring: OpenTelemetry, custom tracing
- Storage: AWS S3