Kosmos: An AI Scientist for Autonomous Discovery - An implementation and adaptation to be driven by Claude Code or API - Based on the Kosmos AI Paper - https://arxiv.org/abs/2511.02824
npx skills add https://github.com/jimmc414/Kosmos --skill treatment-plansInstallieren Sie diesen Skill über die CLI und beginnen Sie mit der Verwendung des SKILL.md-Workflows in Ihrem Arbeitsbereich.
An autonomous AI scientist for scientific discovery, implementing the architecture described in Lu et al. (2024).
Kosmos is an open-source implementation of an autonomous AI scientist that can:
The system runs autonomous research cycles, generating tasks, executing analyses, and synthesizing findings into validated discoveries.
Without Docker, code runs via exec() with static validation. See "Code Execution Security" below.
git clone https://github.com/jimmc414/Kosmos.git
cd Kosmos
pip install -e .
cp .env.example .env
# Edit .env and set ANTHROPIC_API_KEY or OPENAI_API_KEY
# Run smoke tests
python scripts/smoke_test.py
# Run unit tests
pytest tests/unit/ -v --tb=short
import asyncio
from kosmos.workflow.research_loop import ResearchWorkflow
async def run():
workflow = ResearchWorkflow(
research_objective="Your research question here",
artifacts_dir="./artifacts"
)
result = await workflow.run(num_cycles=5, tasks_per_cycle=10)
report = await workflow.generate_report()
print(report)
asyncio.run(run())
# Run research with default settings
kosmos run "What metabolic pathways differ between cancer and normal cells?" --domain biology
# With budget limit
kosmos run "How do perovskites optimize efficiency?" --domain materials --budget 50
# Interactive mode (recommended for first time)
kosmos run --interactive
# Maximum verbosity
kosmos run "Your question" --domain biology --trace
# Real-time streaming display
kosmos run "Your question" --stream
# Streaming with token display disabled
kosmos run "Your question" --stream --no-stream-tokens
# Show system information
kosmos info
# Run diagnostics
kosmos doctor
| Feature | Description | Status |
|---|---|---|
| Research Loop | Multi-cycle autonomous research with hypothesis generation | Complete |
| Literature Search | ArXiv, PubMed, Semantic Scholar integration | Complete |
| Code Execution | Docker-sandboxed Jupyter notebooks | Complete |
| Knowledge Graph | Neo4j-based relationship storage (optional) | Complete |
| Context Compression | Query-based hierarchical compression (20:1 ratio) | Complete |
| Discovery Validation | 8-dimension ScholarEval quality framework | Complete |
| Multi-Provider LLM | Anthropic, OpenAI, LiteLLM (100+ providers) | Complete |
| Budget Enforcement | Cost tracking with configurable limits and enforcement | Complete |
| Error Recovery | Exponential backoff with circuit breaker | Complete |
| Debug Mode | 4-level verbosity with stage tracking | Complete |
| Real-time Streaming | SSE/WebSocket events, CLI --stream flag | Complete |
AI-generated code runs in isolated Docker containers:
| Layer | Implementation |
|---|---|
| Container Isolation | --cap-drop=ALL, no privileged access |
| Network | Disabled (--network=none) |
| Filesystem | Read-only root, tmpfs for scratch |
| Resources | CPU: 2 cores, Memory: 2GB, Timeout: 300s |
| Pooling | Pre-warmed containers reduce cold start |
See: kosmos/execution/sandbox.py, docker_manager.py
Without Docker, falls back to CodeValidator static analysis + exec(). Not recommended for untrusted inputs.
| Agent | Role |
|---|---|
| Research Director | Master orchestrator coordinating all agents |
| Hypothesis Generator | Generates testable hypotheses from literature |
| Experiment Designer | Creates experimental protocols |
| Data Analyst | Analyzes results and interprets findings |
| Literature Analyzer | Searches and synthesizes papers |
| Plan Creator/Reviewer | Strategic task generation with 70/30 exploration/exploitation |
The system processes literature in batches, not bulk:
Effective ratio: ~20:1. See kosmos/compression/compressor.py.
All configuration via environment variables. See .env.example for the full list.
# Anthropic (default)
LLM_PROVIDER=anthropic
ANTHROPIC_API_KEY=sk-ant-...
# OpenAI
LLM_PROVIDER=openai
OPENAI_API_KEY=sk-...
# LiteLLM (supports 100+ providers including local models)
LLM_PROVIDER=litellm
LITELLM_MODEL=ollama/llama3.1:8b
LITELLM_API_BASE=http://localhost:11434
BUDGET_ENABLED=true
BUDGET_LIMIT_USD=10.00
Budget enforcement raises BudgetExceededError when the limit is reached, gracefully transitioning the research to completion.
Three independent limits in kosmos/config.py:
| Setting | Default | Range |
|---|---|---|
max_parallel_hypotheses |
3 | 1-10 |
max_concurrent_experiments |
10 | 1-16 |
max_concurrent_llm_calls |
5 | 1-20 |
The paper describes 10 parallel tasks. Default now matches paper specification.
# Neo4j (optional, for knowledge graph features)
NEO4J_URI=bolt://localhost:7687
NEO4J_PASSWORD=your-password
# Redis (optional, for distributed caching)
REDIS_URL=redis://localhost:6379
Start Neo4j, Redis, and PostgreSQL with Docker Compose:
# Start all optional services (Neo4j, Redis, PostgreSQL)
docker compose --profile dev up -d
# Or start individual services
docker compose up -d neo4j
docker compose up -d redis
docker compose up -d postgres
# Stop services
docker compose --profile dev down
Service URLs when running via Docker:
Literature search via Semantic Scholar works without authentication. An API key is optional but increases rate limits:
# Optional: Get API key from https://www.semanticscholar.org/product/api
SEMANTIC_SCHOLAR_API_KEY=your-key-here
# Enable debug mode with level 1-3
DEBUG_MODE=true
DEBUG_LEVEL=2
# Or use CLI flag for maximum verbosity
kosmos run "Your research question" --trace
See docs/DEBUG_MODE.md for comprehensive debug documentation.
kosmos/
├── agents/ # Research agents (director, hypothesis, experiment, etc.)
├── compression/ # Context compression (20:1 ratio)
├── core/ # LLM providers, metrics, configuration
│ └── providers/ # Anthropic, OpenAI, LiteLLM with async support
├── execution/ # Docker-based sandboxed code execution
├── knowledge/ # Neo4j knowledge graph (1,025 lines)
├── literature/ # ArXiv, PubMed, Semantic Scholar clients
├── orchestration/ # Plan creation/review, task delegation
├── validation/ # ScholarEval 8-dimension quality framework
├── workflow/ # Main research loop integration
└── world_model/ # State management, JSON artifacts
| Category | Percentage | Description |
|---|---|---|
| Paper gaps | 100% | All 17 paper implementation gaps complete |
| Ready for user testing | 95% | Core research loop, agents, LLM providers, validation |
| Deferred | 5% | Phase 4 production mode (polyglot persistence) |
| Issue | Description | Status |
|---|---|---|
| #66 | CLI deadlock - async refactor | ✅ Fixed |
| #67 | SkillLoader domain mapping | ✅ Fixed |
| #68 | Pydantic V2 migration | ✅ Fixed |
| #54-#58 | Critical paper gaps | ✅ Fixed |
| #59 | h5ad/Parquet data formats | ✅ Fixed |
| #69 | R language execution | ✅ Fixed |
| #60 | Figure generation | ✅ Fixed |
| #61 | Jupyter notebook generation | ✅ Fixed |
| #70 | Null model statistical validation | ✅ Fixed |
| #63 | Failure mode detection | ✅ Fixed |
| #62 | Code line provenance | ✅ Fixed |
| #64 | Multi-run convergence framework | ✅ Fixed |
| #65 | Paper accuracy validation | ✅ Fixed |
| #72 | Real-time streaming API | ✅ Fixed |
All 17 paper implementation gaps have been addressed. Full tracking: PAPER_IMPLEMENTATION_GAPS.md
| Category | Count | Status |
|---|---|---|
| Unit tests | 2251 | Passing |
| Integration tests | 415 | Passing |
| E2E tests | 121 | Most pass, some skip (environment-dependent) |
| Requirements tests | 815 | Passing |
E2E tests skip based on environment:
@pytest.mark.requires_neo4j)This project implements the architecture from the Kosmos paper but has not yet reproduced the paper's claimed results:
| Paper Claim | Implementation Status |
|---|---|
| 79.4% accuracy on scientific statements | Architecture implemented, not validated |
| 7 validated discoveries | Not reproduced |
| 1,500 papers per run | Architecture supports this |
| 42,000 lines of code per run | Architecture supports this |
| 200 agent rollouts | Configurable via max_iterations |
The system is suitable for experimentation and further development. Before production research use, validation studies should be conducted.
Docker recommended: Without Docker, code execution falls back to direct exec() which is unsafe for untrusted code.
Neo4j optional: Knowledge graph features require Neo4j. Set NEO4J_URI, NEO4J_USER, NEO4J_PASSWORD to enable.
R support via Docker: R language execution requires the R-enabled Docker image (docker/sandbox/Dockerfile.r) with TwoSampleMR, susieR, and MendelianRandomization packages.
Single-user: No multi-tenancy or user isolation.
Not a reproduction study: We have not yet reproduced the paper's 79.4% accuracy or 7 validated discoveries.
The original paper omitted implementation details for 6 critical components. This repository provides those implementations:
| Gap | Problem | Solution |
|---|---|---|
| 0 | Context compression for 1,500 papers | Hierarchical 3-tier compression (20:1 ratio) |
| 1 | State Manager schema unspecified | 4-layer hybrid architecture (JSON + Neo4j + Vector + Citations) |
| 2 | Task generation algorithm unstated | Plan Creator + Plan Reviewer pattern |
| 3 | Agent integration mechanism unclear | Skill loader with 116 domain-specific skills (see #67) |
| 4 | Execution environment not described | Docker sandbox with Python + R support (see #69) |
| 5 | Discovery validation criteria missing | ScholarEval 8-dimension quality framework |
For detailed analysis, see archive/120525_implementation_gaps_v2.md.
See CONTRIBUTING.md.
Areas where contributions would be useful:
MIT License
Version: 0.2.0-alpha | Tests: 3704 passing | Last Updated: 2025-12-09