Software Backend Engineering
Use this skill to design, implement, and review production-grade backend services: API boundaries, data layer, auth, caching, observability, error handling, testing, and deployment.
Defaults to bias toward: type-safe boundaries (validation at the edge), OpenTelemetry for observability, zero-trust assumptions, idempotency for retries, RFC 9457 errors, Postgres + pooling, structured logs, timeouts, and rate limiting.
Quick Reference
| Task | Default Picks | Notes |
|---|---|---|
| REST API | Fastify / Express / NestJS | Prefer typed boundaries + explicit timeouts |
| Edge API | Hono / platform-native handlers | Keep work stateless, CPU-light |
| Type-Safe API | tRPC | Prefer for TS monorepos and internal APIs |
| GraphQL API | Apollo Server / Pothos | Prefer for complex client-driven queries |
| Database | PostgreSQL | Use pooling + migrations + query budgets |
| ORM / Query Layer | Prisma / Drizzle / SQLAlchemy / GORM / SeaORM | Prefer explicit transactions |
| Authentication | OIDC/OAuth + sessions/JWT | Prefer httpOnly cookies for browsers |
| Validation | Zod / Pydantic / validator libs | Validate at the boundary, not deep inside |
| Caching | Redis (or managed) | Use TTLs + invalidation strategy |
| Background Jobs | BullMQ / platform queues | Make jobs idempotent + retry-safe |
| Testing | Unit + integration + contract/E2E | Keep most tests below the UI layer |
| Observability | Structured logs + OpenTelemetry | Correlation IDs end-to-end |
Scope
Use this skill to:
- Design and implement REST/GraphQL/tRPC APIs
- Model data schemas and run safe migrations
- Implement authentication/authorization (OIDC/OAuth, sessions/JWT)
- Add validation, error handling, rate limiting, caching, and background jobs
- Ship production readiness (timeouts, observability, deploy/runbooks)
When NOT to Use This Skill
Use a different skill when:
- Frontend-only concerns -> See software-frontend
- Infrastructure provisioning (Terraform, K8s manifests) -> See ops-devops-platform
- API design patterns only (no implementation) -> See dev-api-design
- SQL query optimization and indexing -> See data-sql-optimization
- Security audits and threat modeling -> See software-security-appsec
- System architecture (beyond single service) -> See software-architecture-design
Decision Tree: Backend Technology Selection
Backend project needs: [API Type] - REST API? - Simple CRUD -> Express/Fastify + Prisma/Drizzle - Enterprise features -> NestJS (DI, modules) - High performance -> Fastify (tight request lifecycle) - Edge/Serverless -> Hono (Cloudflare Workers, Vercel Edge) - Type-Safe API? - Full-stack TypeScript monorepo -> tRPC (no schema, no codegen) - Public API with docs -> REST + OpenAPI - Flexible data fetching -> GraphQL + Pothos/Apollo - GraphQL API? - Code-first -> Pothos GraphQL (TypeScript) - Schema-first -> Apollo Server + GraphQL Codegen - Runtime Selection? - Enterprise stable -> Node.js (current LTS) - Performance-critical -> Bun (verify runtime constraints) - Security-focused -> Deno (verify platform support) - Authentication Strategy? - Browser sessions -> httpOnly cookies + server-side session store - OAuth/Social -> OIDC/OAuth library (or platform auth) - Service-to-service -> short-lived JWT + mTLS where possible - Database Layer? - Type-safe ORM -> Prisma (migrations, Studio) - SQL-first/perf -> Drizzle (SQL-like API) - Raw SQL -> driver + query builder (Kysely/sqlc/SQLx) - Edge-compatible -> driver/ORM + Neon/Turso/D1 - Caching Strategy? - Distributed cache -> Redis (multi-server) - Serverless cache -> managed Redis (e.g., Upstash) - In-memory cache -> process memory (single instance only) - Edge Deployment? - Global low-latency -> Cloudflare Workers - Next.js integration -> Vercel Edge Functions - AWS ecosystem -> Lambda@Edge - Background Jobs? - Complex workflows -> BullMQ (Redis-backed, retries) - Serverless workflows -> AWS Step Functions - Simple scheduling -> cron + durable storage
Runtime & Language Alternatives:
- Node.js (current LTS) (Express/Fastify/NestJS + Prisma/Drizzle): default for broad ecosystem + mature tooling
- Bun (Hono/Elysia + Drizzle): consider for perf-sensitive workloads (verify runtime constraints)
- Python (FastAPI + SQLAlchemy): strong for data-heavy services and ML integration
- Go (Fiber/Gin + GORM/sqlc): strong for concurrency and simple deploys
- Rust (Axum + SeaORM/SQLx): strong for safety/performance-critical services
See assets/ for language-specific starter templates and references/edge-deployment-guide.md for edge computing patterns.
API Design Patterns (Dec 2025)
Idempotency Patterns
All mutating operations MUST support idempotency for retry safety.
Implementation:
// Idempotency key header const idempotencyKey = request.headers['idempotency-key']; const cached = await redis.get(`idem:${idempotencyKey}`); if (cached) return JSON.parse(cached); const result = await processOperation(); await redis.set(`idem:${idempotencyKey}`, JSON.stringify(result), 'EX', 86400); return result;
| Do | Avoid |
|---|---|
| Store idempotency keys with TTL (24h typical) | Processing duplicate requests |
| Return cached response for duplicate keys | Different responses for same key |
| Use client-generated UUIDs | Server-generated keys |
Pagination Patterns
| Pattern | Use When | Example |
|---|---|---|
| Cursor-based | Large datasets, real-time data | ?cursor=abc123&limit=20 |
| Offset-based | Small datasets, random access | ?page=3&per_page=20 |
| Keyset | Sorted data, high performance | ?after_id=1000&limit=20 |
Prefer cursor-based pagination for APIs with frequent inserts.
Error Response Standard (Problem Details)
Use a consistent machine-readable error format (RFC 9457 Problem Details): https://www.rfc-editor.org/rfc/rfc9457
{ "type": "https://example.com/problems/invalid-request", "title": "Invalid request", "status": 400, "detail": "email is required", "instance": "/v1/users" }
Health Check Patterns
// Liveness: Is the process running? app.get('/health/live', (req, res) => { res.status(200).json({ status: 'ok' }); }); // Readiness: Can the service handle traffic? app.get('/health/ready', async (req, res) => { const dbOk = await checkDatabase(); const cacheOk = await checkRedis(); if (dbOk && cacheOk) { res.status(200).json({ status: 'ready', db: 'ok', cache: 'ok' }); } else { res.status(503).json({ status: 'not ready', db: dbOk, cache: cacheOk }); } });
Migration Rollback Strategies
| Strategy | Description | Use When |
|---|---|---|
| Backward-compatible | New code works with old schema | Zero-downtime deployments |
| Expand-contract | Add new, migrate, remove old | Schema changes |
| Shadow tables | Write to both during transition | High-risk migrations |
Common Backend Mistakes to Avoid
| FAIL Avoid | PASS Instead | Why |
|---|---|---|
| Storing sessions in memory | Use Redis/Upstash | Memory lost on restart, no horizontal scaling |
| Synchronous file I/O | Use fs.promises or streams | Blocks event loop, kills throughput |
| Unbounded queries | Always use LIMIT + cursor pagination | Memory exhaustion, slow responses |
| Trusting client input | Validate with Zod at API boundaries | Injection attacks, type coercion bugs |
| Hardcoded secrets | Use env vars + secret manager (Vault, AWS SM) | Security breach on repo exposure |
| N+1 database queries | Use include/select or DataLoader | 10-100x performance degradation |
console.log in production | Use structured logging (Pino/Winston) | No correlation IDs, unqueryable logs |
| Catching errors silently | Log + rethrow or handle explicitly | Hidden failures, debugging nightmares |
| Missing connection pooling | Use Prisma connection pool or PgBouncer | Connection exhaustion under load |
| No request timeouts | Set timeouts on HTTP clients and DB queries | Resource leaks, cascading failures |
Security anti-patterns:
- FAIL Don't use MD5/SHA1 for passwords -> Use Argon2id
- FAIL Don't store JWTs in localStorage -> Use httpOnly cookies
- FAIL Don't trust
X-Forwarded-Forwithout validation -> Configure trusted proxies - FAIL Don't skip rate limiting -> Use sliding window (Redis) or token bucket
- FAIL Don't log sensitive data -> Redact PII, tokens, passwords
Optional: AI/Automation Extensions
Note: AI-assisted backend patterns. Skip if not using AI tooling.
AI-Assisted Code Generation
| Tool | Use Case |
|---|---|
| GitHub Copilot | Inline suggestions, boilerplate |
| Cursor | AI-first IDE, context-aware |
| Claude Code | CLI-based development |
Review requirements for AI-generated code:
- All imports verified against package.json
- Type checker passes (strict mode)
- Security scan passes
- Tests cover generated code
Infrastructure Economics and Business Impact
Why this matters: Backend decisions directly impact revenue. A 100ms latency increase can reduce conversions by 7%. A poorly chosen architecture can cost 10x more in cloud spend. Performance SLAs are revenue commitments.
Cost Modeling Quick Reference
| Decision | Cost Impact | Revenue Impact |
|---|---|---|
| Edge vs. Origin | 60-80% latency reduction | +2-5% conversion rate |
| Serverless vs. Containers | Variable cost, scales to zero | Better unit economics at low scale |
| Reserved vs. On-Demand | 30-60% cost savings | Predictable COGS |
| Connection pooling | 50-70% fewer DB connections | Lower database costs |
| Caching layer | 80-95% fewer origin requests | Reduced compute costs |
Performance SLA -> Revenue Mapping
SLA Target -> Business Metric P50 latency < 100ms -> Baseline user experience P95 latency < 500ms -> 95% users satisfied P99 latency < 1000ms -> Enterprise SLA compliance Uptime 99.9% (43.8m downtime/month) -> Standard SLA tier Uptime 99.99% (4.4m downtime/month) -> Enterprise tier ($$$)
Unit Economics Checklist
Before deploying any backend service, calculate:
- Cost per request: Total infra cost / monthly requests
- Cost per user: Total infra cost / MAU
- Gross margin impact: How does infra cost affect product margin?
- Scale economics: At 10x traffic, does cost scale linearly or worse?
- Break-even point: At what traffic level does this architecture pay for itself?
Architecture Decision -> Business Impact
| Architecture Choice | Technical Benefit | Business Impact |
|---|---|---|
| CDN + Edge caching | Lower latency | Higher conversion, better SEO |
| Read replicas | Scale reads | Handle traffic spikes without degradation |
| Queue-based processing | Decouple services | Smoother UX during high load |
| Multi-region deployment | Fault tolerance | Enterprise SLA compliance |
| Auto-scaling | Right-sized infra | Lower COGS, better margins |
FinOps Practices for Backend Teams
- Tag all resources - Every resource tagged with
team,service,environment - Set billing alerts - Alert at 50%, 80%, 100% of budget
- Review weekly - 15-minute weekly cost review meeting
- Right-size monthly - Check CPU/memory utilization, downsize overprovisioned
- Spot/Preemptible for non-prod - 60-90% savings on dev/staging
See references/infrastructure-economics.md for detailed cost modeling, cloud provider comparisons, and ROI calculators.
Navigation
Resources
- references/backend-best-practices.md - Template authoring guide, quality checklist, and shared utilities pointers
- references/edge-deployment-guide.md - Edge computing patterns, Cloudflare Workers vs Vercel Edge, tRPC, Hono, Bun
- references/infrastructure-economics.md - Cost modeling, performance SLAs -> revenue, FinOps practices, cloud optimization
- references/go-best-practices.md - Go idioms, concurrency, error handling, GORM usage, testing, profiling
- references/rust-best-practices.md - Ownership, async, Axum, SeaORM, error handling, testing
- references/python-best-practices.md - FastAPI, SQLAlchemy, async patterns, validation, testing, performance
- references/nodejs-best-practices.md - Event loop, async patterns, Express/Fastify/NestJS/Hono, error handling, memory management, security, profiling
- references/database-patterns.md - PostgreSQL patterns (JSONB, CTEs, partitioning), connection pooling, migration strategies, ORM comparison, index design
- references/message-queues-background-jobs.md - BullMQ patterns, broker comparison (Redis/SQS/Kafka/RabbitMQ), idempotent jobs, DLQ, scheduling, delivery guarantees
- data/sources.json - External references per language/runtime
- Shared checklists: ../software-clean-code-standard/assets/checklists/backend-api-review-checklist.md, ../software-clean-code-standard/assets/checklists/secure-code-review-checklist.md
Shared Utilities (Centralized patterns - extract, don't duplicate)
- ../software-clean-code-standard/utilities/auth-utilities.md - Argon2id, jose JWT, OAuth 2.1/PKCE
- ../software-clean-code-standard/utilities/error-handling.md - Effect Result types, correlation IDs
- ../software-clean-code-standard/utilities/config-validation.md - Zod 3.24+, Valibot, secrets management
- ../software-clean-code-standard/utilities/resilience-utilities.md - p-retry v6, opossum v8, OTel spans
- ../software-clean-code-standard/utilities/logging-utilities.md - pino v9 + OpenTelemetry integration
- ../software-clean-code-standard/utilities/testing-utilities.md - Vitest, MSW v2, factories, fixtures
- ../software-clean-code-standard/utilities/observability-utilities.md - OpenTelemetry SDK, tracing, metrics
- ../software-clean-code-standard/references/clean-code-standard.md - Canonical clean code rules (
CC-*) for citation
Templates
- assets/nodejs/template-nodejs-prisma-postgres.md - Node.js + Prisma + PostgreSQL
- assets/go/template-go-fiber-gorm.md - Go + Fiber + GORM + PostgreSQL
- assets/rust/template-rust-axum-seaorm.md - Rust + Axum + SeaORM + PostgreSQL
- assets/python/template-python-fastapi-sqlalchemy.md - Python + FastAPI + SQLAlchemy + PostgreSQL
Related Skills
- ../software-architecture-design/SKILL.md - System decomposition, SLAs, and data flows
- ../software-security-appsec/SKILL.md - Authentication/authorization and secure API design
- ../ops-devops-platform/SKILL.md - CI/CD, infrastructure, and deployment safety
- ../qa-resilience/SKILL.md - Resilience, retries, and failure playbooks
- ../software-code-review/SKILL.md - Review checklists and standards for backend changes
- ../qa-testing-strategy/SKILL.md - Testing strategies, test pyramids, and coverage goals
- ../dev-api-design/SKILL.md - RESTful design, GraphQL, and API versioning patterns
- ../data-sql-optimization/SKILL.md - SQL optimization, indexing, and query tuning patterns
Freshness Protocol
When users ask version-sensitive recommendation questions, do a quick freshness check before asserting "best" choices or quoting versions.
Trigger Conditions
- "What's the best backend framework for [use case]?"
- "What should I use for [API design/auth/database]?"
- "What's the latest in Node.js/Go/Rust?"
- "Current best practices for [REST/GraphQL/tRPC]?"
- "Is [framework/runtime] still relevant in 2026?"
- "[Express] vs [Fastify] vs [Hono]?"
- "Best ORM for [database/use case]?"
How to Freshness-Check
- Start from
data/sources.json(official docs, release notes, support policies). - Run a targeted web search for the specific component and open release notes/support policy pages.
- Prefer official sources over blogs for versions and support windows.
What to Report
- Current landscape: what is stable and widely used now
- Emerging trends: what is gaining traction (and why)
- Deprecated/declining: what is falling out of favor (and why)
- Recommendation: default choice + 1-2 alternatives, with trade-offs
Example Topics (verify with fresh search)
- Node.js LTS support window and major changes
- Bun vs Deno vs Node.js
- Hono, Elysia, and edge-first frameworks
- Drizzle vs Prisma for TypeScript
- tRPC and end-to-end type safety
- Edge computing and serverless patterns
Operational Playbooks
- references/operational-playbook.md - Full backend architecture patterns, checklists, TypeScript notes, and decision tables