agskills.dev
MARKETPLACE

langsmith-observability

LLM observability platform for tracing, evaluation, and monitoring. Use when debugging LLM applications, evaluating model outputs against datasets, monitoring production systems, or building systematic testing pipelines for AI applications.

davila721.2k1.9k

Prévia

SKILL.md
Metadata
name
langsmith-observability
description
LLM observability platform for tracing, evaluation, and monitoring. Use when debugging LLM applications, evaluating model outputs against datasets, monitoring production systems, or building systematic testing pipelines for AI applications.
version
1.0.0
author
Orchestra Research
license
MIT
tags
[Observability, LangSmith, Tracing, Evaluation, Monitoring, Debugging, Testing, LLM Ops, Production]
dependencies
[langsmith>=0.2.0]

LangSmith - LLM Observability Platform

Development platform for debugging, evaluating, and monitoring language models and AI applications.

When to use LangSmith

Use LangSmith when:

  • Debugging LLM application issues (prompts, chains, agents)
  • Evaluating model outputs systematically against datasets
  • Monitoring production LLM systems
  • Building regression testing for AI features
  • Analyzing latency, token usage, and costs
  • Collaborating on prompt engineering

Key features:

  • Tracing: Capture inputs, outputs, latency for all LLM calls
  • Evaluation: Systematic testing with built-in and custom evaluators
  • Datasets: Create test sets from production traces or manually
  • Monitoring: Track metrics, errors, and costs in production
  • Integrations: Works with OpenAI, Anthropic, LangChain, LlamaIndex

Use alternatives instead:

  • Weights & Biases: Deep learning experiment tracking, model training
  • MLflow: General ML lifecycle, model registry focus
  • Arize/WhyLabs: ML monitoring, data drift detection

Quick start

Installation

pip install langsmith # Set environment variables export LANGSMITH_API_KEY="your-api-key" export LANGSMITH_TRACING=true

Basic tracing with @traceable

from langsmith import traceable from openai import OpenAI client = OpenAI() @traceable def generate_response(prompt: str) -> str: response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": prompt}] ) return response.choices[0].message.content # Automatically traced to LangSmith result = generate_response("What is machine learning?")

OpenAI wrapper (automatic tracing)

from langsmith.wrappers import wrap_openai from openai import OpenAI # Wrap client for automatic tracing client = wrap_openai(OpenAI()) # All calls automatically traced response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "Hello!"}] )

Core concepts

Runs and traces

A run is a single execution unit (LLM call, chain, tool). Runs form hierarchical traces showing the full execution flow.

from langsmith import traceable @traceable(run_type="chain") def process_query(query: str) -> str: # Parent run context = retrieve_context(query) # Child run response = generate_answer(query, context) # Child run return response @traceable(run_type="retriever") def retrieve_context(query: str) -> list: return vector_store.search(query) @traceable(run_type="llm") def generate_answer(query: str, context: list) -> str: return llm.invoke(f"Context: {context}\n\nQuestion: {query}")

Projects

Projects organize related runs. Set via environment or code:

import os os.environ["LANGSMITH_PROJECT"] = "my-project" # Or per-function @traceable(project_name="my-project") def my_function(): pass

Client API

from langsmith import Client client = Client() # List runs runs = list(client.list_runs( project_name="my-project", filter='eq(status, "success")', limit=100 )) # Get run details run = client.read_run(run_id="...") # Create feedback client.create_feedback( run_id="...", key="correctness", score=0.9, comment="Good answer" )

Datasets and evaluation

Create dataset

from langsmith import Client client = Client() # Create dataset dataset = client.create_dataset("qa-test-set", description="QA evaluation") # Add examples client.create_examples( inputs=[ {"question": "What is Python?"}, {"question": "What is ML?"} ], outputs=[ {"answer": "A programming language"}, {"answer": "Machine learning"} ], dataset_id=dataset.id )

Run evaluation

from langsmith import evaluate def my_model(inputs: dict) -> dict: # Your model logic return {"answer": generate_answer(inputs["question"])} def correctness_evaluator(run, example): prediction = run.outputs["answer"] reference = example.outputs["answer"] score = 1.0 if reference.lower() in prediction.lower() else 0.0 return {"key": "correctness", "score": score} results = evaluate( my_model, data="qa-test-set", evaluators=[correctness_evaluator], experiment_prefix="v1" ) print(f"Average score: {results.aggregate_metrics['correctness']}")

Built-in evaluators

from langsmith.evaluation import LangChainStringEvaluator # Use LangChain evaluators results = evaluate( my_model, data="qa-test-set", evaluators=[ LangChainStringEvaluator("qa"), LangChainStringEvaluator("cot_qa") ] )

Advanced tracing

Tracing context

from langsmith import tracing_context with tracing_context( project_name="experiment-1", tags=["production", "v2"], metadata={"version": "2.0"} ): # All traceable calls inherit context result = my_function()

Manual runs

from langsmith import trace with trace( name="custom_operation", run_type="tool", inputs={"query": "test"} ) as run: result = do_something() run.end(outputs={"result": result})

Process inputs/outputs

def sanitize_inputs(inputs: dict) -> dict: if "password" in inputs: inputs["password"] = "***" return inputs @traceable(process_inputs=sanitize_inputs) def login(username: str, password: str): return authenticate(username, password)

Sampling

import os os.environ["LANGSMITH_TRACING_SAMPLING_RATE"] = "0.1" # 10% sampling

LangChain integration

from langchain_openai import ChatOpenAI from langchain_core.prompts import ChatPromptTemplate # Tracing enabled automatically with LANGSMITH_TRACING=true llm = ChatOpenAI(model="gpt-4o") prompt = ChatPromptTemplate.from_messages([ ("system", "You are a helpful assistant."), ("user", "{input}") ]) chain = prompt | llm # All chain runs traced automatically response = chain.invoke({"input": "Hello!"})

Production monitoring

Hub prompts

from langsmith import Client client = Client() # Pull prompt from hub prompt = client.pull_prompt("my-org/qa-prompt") # Use in application result = prompt.invoke({"question": "What is AI?"})

Async client

from langsmith import AsyncClient async def main(): client = AsyncClient() runs = [] async for run in client.list_runs(project_name="my-project"): runs.append(run) return runs

Feedback collection

from langsmith import Client client = Client() # Collect user feedback def record_feedback(run_id: str, user_rating: int, comment: str = None): client.create_feedback( run_id=run_id, key="user_rating", score=user_rating / 5.0, # Normalize to 0-1 comment=comment ) # In your application record_feedback(run_id="...", user_rating=4, comment="Helpful response")

Testing integration

Pytest integration

from langsmith import test @test def test_qa_accuracy(): result = my_qa_function("What is Python?") assert "programming" in result.lower()

Evaluation in CI/CD

from langsmith import evaluate def run_evaluation(): results = evaluate( my_model, data="regression-test-set", evaluators=[accuracy_evaluator] ) # Fail CI if accuracy drops assert results.aggregate_metrics["accuracy"] >= 0.9, \ f"Accuracy {results.aggregate_metrics['accuracy']} below threshold"

Best practices

  1. Structured naming - Use consistent project/run naming conventions
  2. Add metadata - Include version, environment, user info
  3. Sample in production - Use sampling rate to control volume
  4. Create datasets - Build test sets from interesting production cases
  5. Automate evaluation - Run evaluations in CI/CD pipelines
  6. Monitor costs - Track token usage and latency trends

Common issues

Traces not appearing:

import os # Ensure tracing is enabled os.environ["LANGSMITH_TRACING"] = "true" os.environ["LANGSMITH_API_KEY"] = "your-key" # Verify connection from langsmith import Client client = Client() print(client.list_projects()) # Should work

High latency from tracing:

# Enable background batching (default) from langsmith import Client client = Client(auto_batch_tracing=True) # Or use sampling os.environ["LANGSMITH_TRACING_SAMPLING_RATE"] = "0.1"

Large payloads:

# Hide sensitive/large fields @traceable( process_inputs=lambda x: {k: v for k, v in x.items() if k != "large_field"} ) def my_function(data): pass

References

Resources