agskills.dev
MARKETPLACE

anndata

This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.

davila719.9k1.8k

预览

SKILL.md
Metadata
name
anndata
description
This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.

AnnData

Overview

AnnData is a Python package for handling annotated data matrices, storing experimental measurements (X) alongside observation metadata (obs), variable metadata (var), and multi-dimensional annotations (obsm, varm, obsp, varp, uns). Originally designed for single-cell genomics through Scanpy, it now serves as a general-purpose framework for any annotated data requiring efficient storage, manipulation, and analysis.

When to Use This Skill

Use this skill when:

  • Creating, reading, or writing AnnData objects
  • Working with h5ad, zarr, or other genomics data formats
  • Performing single-cell RNA-seq analysis
  • Managing large datasets with sparse matrices or backed mode
  • Concatenating multiple datasets or experimental batches
  • Subsetting, filtering, or transforming annotated data
  • Integrating with scanpy, scvi-tools, or other scverse ecosystem tools

Installation

uv pip install anndata # With optional dependencies uv pip install anndata[dev,test,doc]

Quick Start

Creating an AnnData object

import anndata as ad import numpy as np import pandas as pd # Minimal creation X = np.random.rand(100, 2000) # 100 cells × 2000 genes adata = ad.AnnData(X) # With metadata obs = pd.DataFrame({ 'cell_type': ['T cell', 'B cell'] * 50, 'sample': ['A', 'B'] * 50 }, index=[f'cell_{i}' for i in range(100)]) var = pd.DataFrame({ 'gene_name': [f'Gene_{i}' for i in range(2000)] }, index=[f'ENSG{i:05d}' for i in range(2000)]) adata = ad.AnnData(X=X, obs=obs, var=var)

Reading data

# Read h5ad file adata = ad.read_h5ad('data.h5ad') # Read with backed mode (for large files) adata = ad.read_h5ad('large_data.h5ad', backed='r') # Read other formats adata = ad.read_csv('data.csv') adata = ad.read_loom('data.loom') adata = ad.read_10x_h5('filtered_feature_bc_matrix.h5')

Writing data

# Write h5ad file adata.write_h5ad('output.h5ad') # Write with compression adata.write_h5ad('output.h5ad', compression='gzip') # Write other formats adata.write_zarr('output.zarr') adata.write_csvs('output_dir/')

Basic operations

# Subset by conditions t_cells = adata[adata.obs['cell_type'] == 'T cell'] # Subset by indices subset = adata[0:50, 0:100] # Add metadata adata.obs['quality_score'] = np.random.rand(adata.n_obs) adata.var['highly_variable'] = np.random.rand(adata.n_vars) > 0.8 # Access dimensions print(f"{adata.n_obs} observations × {adata.n_vars} variables")

Core Capabilities

1. Data Structure

Understand the AnnData object structure including X, obs, var, layers, obsm, varm, obsp, varp, uns, and raw components.

See: references/data_structure.md for comprehensive information on:

  • Core components (X, obs, var, layers, obsm, varm, obsp, varp, uns, raw)
  • Creating AnnData objects from various sources
  • Accessing and manipulating data components
  • Memory-efficient practices

2. Input/Output Operations

Read and write data in various formats with support for compression, backed mode, and cloud storage.

See: references/io_operations.md for details on:

  • Native formats (h5ad, zarr)
  • Alternative formats (CSV, MTX, Loom, 10X, Excel)
  • Backed mode for large datasets
  • Remote data access
  • Format conversion
  • Performance optimization

Common commands:

# Read/write h5ad adata = ad.read_h5ad('data.h5ad', backed='r') adata.write_h5ad('output.h5ad', compression='gzip') # Read 10X data adata = ad.read_10x_h5('filtered_feature_bc_matrix.h5') # Read MTX format adata = ad.read_mtx('matrix.mtx').T

3. Concatenation

Combine multiple AnnData objects along observations or variables with flexible join strategies.

See: references/concatenation.md for comprehensive coverage of:

  • Basic concatenation (axis=0 for observations, axis=1 for variables)
  • Join types (inner, outer)
  • Merge strategies (same, unique, first, only)
  • Tracking data sources with labels
  • Lazy concatenation (AnnCollection)
  • On-disk concatenation for large datasets

Common commands:

# Concatenate observations (combine samples) adata = ad.concat( [adata1, adata2, adata3], axis=0, join='inner', label='batch', keys=['batch1', 'batch2', 'batch3'] ) # Concatenate variables (combine modalities) adata = ad.concat([adata_rna, adata_protein], axis=1) # Lazy concatenation from anndata.experimental import AnnCollection collection = AnnCollection( ['data1.h5ad', 'data2.h5ad'], join_obs='outer', label='dataset' )

4. Data Manipulation

Transform, subset, filter, and reorganize data efficiently.

See: references/manipulation.md for detailed guidance on:

  • Subsetting (by indices, names, boolean masks, metadata conditions)
  • Transposition
  • Copying (full copies vs views)
  • Renaming (observations, variables, categories)
  • Type conversions (strings to categoricals, sparse/dense)
  • Adding/removing data components
  • Reordering
  • Quality control filtering

Common commands:

# Subset by metadata filtered = adata[adata.obs['quality_score'] > 0.8] hv_genes = adata[:, adata.var['highly_variable']] # Transpose adata_T = adata.T # Copy vs view view = adata[0:100, :] # View (lightweight reference) copy = adata[0:100, :].copy() # Independent copy # Convert strings to categoricals adata.strings_to_categoricals()

5. Best Practices

Follow recommended patterns for memory efficiency, performance, and reproducibility.

See: references/best_practices.md for guidelines on:

  • Memory management (sparse matrices, categoricals, backed mode)
  • Views vs copies
  • Data storage optimization
  • Performance optimization
  • Working with raw data
  • Metadata management
  • Reproducibility
  • Error handling
  • Integration with other tools
  • Common pitfalls and solutions

Key recommendations:

# Use sparse matrices for sparse data from scipy.sparse import csr_matrix adata.X = csr_matrix(adata.X) # Convert strings to categoricals adata.strings_to_categoricals() # Use backed mode for large files adata = ad.read_h5ad('large.h5ad', backed='r') # Store raw before filtering adata.raw = adata.copy() adata = adata[:, adata.var['highly_variable']]

Integration with Scverse Ecosystem

AnnData serves as the foundational data structure for the scverse ecosystem:

Scanpy (Single-cell analysis)

import scanpy as sc # Preprocessing sc.pp.filter_cells(adata, min_genes=200) sc.pp.normalize_total(adata, target_sum=1e4) sc.pp.log1p(adata) sc.pp.highly_variable_genes(adata, n_top_genes=2000) # Dimensionality reduction sc.pp.pca(adata, n_comps=50) sc.pp.neighbors(adata, n_neighbors=15) sc.tl.umap(adata) sc.tl.leiden(adata) # Visualization sc.pl.umap(adata, color=['cell_type', 'leiden'])

Muon (Multimodal data)

import muon as mu # Combine RNA and protein data mdata = mu.MuData({'rna': adata_rna, 'protein': adata_protein})

PyTorch integration

from anndata.experimental import AnnLoader # Create DataLoader for deep learning dataloader = AnnLoader(adata, batch_size=128, shuffle=True) for batch in dataloader: X = batch.X # Train model

Common Workflows

Single-cell RNA-seq analysis

import anndata as ad import scanpy as sc # 1. Load data adata = ad.read_10x_h5('filtered_feature_bc_matrix.h5') # 2. Quality control adata.obs['n_genes'] = (adata.X > 0).sum(axis=1) adata.obs['n_counts'] = adata.X.sum(axis=1) adata = adata[adata.obs['n_genes'] > 200] adata = adata[adata.obs['n_counts'] < 50000] # 3. Store raw adata.raw = adata.copy() # 4. Normalize and filter sc.pp.normalize_total(adata, target_sum=1e4) sc.pp.log1p(adata) sc.pp.highly_variable_genes(adata, n_top_genes=2000) adata = adata[:, adata.var['highly_variable']] # 5. Save processed data adata.write_h5ad('processed.h5ad')

Batch integration

# Load multiple batches adata1 = ad.read_h5ad('batch1.h5ad') adata2 = ad.read_h5ad('batch2.h5ad') adata3 = ad.read_h5ad('batch3.h5ad') # Concatenate with batch labels adata = ad.concat( [adata1, adata2, adata3], label='batch', keys=['batch1', 'batch2', 'batch3'], join='inner' ) # Apply batch correction import scanpy as sc sc.pp.combat(adata, key='batch') # Continue analysis sc.pp.pca(adata) sc.pp.neighbors(adata) sc.tl.umap(adata)

Working with large datasets

# Open in backed mode adata = ad.read_h5ad('100GB_dataset.h5ad', backed='r') # Filter based on metadata (no data loading) high_quality = adata[adata.obs['quality_score'] > 0.8] # Load filtered subset adata_subset = high_quality.to_memory() # Process subset process(adata_subset) # Or process in chunks chunk_size = 1000 for i in range(0, adata.n_obs, chunk_size): chunk = adata[i:i+chunk_size, :].to_memory() process(chunk)

Troubleshooting

Out of memory errors

Use backed mode or convert to sparse matrices:

# Backed mode adata = ad.read_h5ad('file.h5ad', backed='r') # Sparse matrices from scipy.sparse import csr_matrix adata.X = csr_matrix(adata.X)

Slow file reading

Use compression and appropriate formats:

# Optimize for storage adata.strings_to_categoricals() adata.write_h5ad('file.h5ad', compression='gzip') # Use Zarr for cloud storage adata.write_zarr('file.zarr', chunks=(1000, 1000))

Index alignment issues

Always align external data on index:

# Wrong adata.obs['new_col'] = external_data['values'] # Correct adata.obs['new_col'] = external_data.set_index('cell_id').loc[adata.obs_names, 'values']

Additional Resources