Matrix-aware R-to-Python bridge: SummarizedExperiment → NumPy / SciPy → AnnData

Overview

This vignette explains how MofaflexR converts assay matrices from a SummarizedExperiment (or any subclass, including SingleCellExperiment) into Python NumPy / SciPy objects via reticulate, and then wraps the result in an anndata.AnnData accessible from R via anndataR::ReticulateAnnData.

The conversion is dispatch-based: the package inspects the R class of the assay matrix and selects the most appropriate Python representation, aiming to avoid unnecessary copies.

Supported input matrix classes:

R matrix class	Python target	Zero-copy behavior
`matrix`	`numpy.ndarray`	Zero-copy where possible via `r_to_py` buffer protocol
`dgeMatrix`	`numpy.ndarray`	Zero-copy via `@x` slot + Fortran-order reshape
`dgCMatrix`	`scipy.sparse.csc_matrix`	Zero-copy for `@x`, `@i`, `@p` slot arrays
`dgRMatrix`	`scipy.sparse.csr_matrix`	Zero-copy for `@x`, `@j`, `@p` slot arrays
`COO_SparseMatrix`	`scipy.sparse.coo_matrix`	Zero-copy for values; coordinates require a copy
Other	coerced to `csc_matrix` or `ndarray`	One R-level copy during coercion

Motivation

Single-cell count matrices can contain tens of millions of non-zero entries. Copying them at the R ↔︎ Python boundary wastes memory and time. Different matrix classes expose memory in different ways; MofaflexR picks an appropriate conversion for each:

Supported sparse formats store their data, indices, and pointers in contiguous R integer/numeric vectors that can be handed to NumPy via the C buffer protocol (reticulate::np_array()), producing views with flags.owndata == False.
Dense base matrices and dgeMatrix store their values as a single contiguous column-major vector which can be similarly exposed as a Fortran-order NumPy array.
COO coordinate matrices require a 1→0 index-base conversion, so the coordinate arrays must be copied; only the value array can be a zero-copy view.

How assay dispatch works

The container-level entry point is sce_to_anndata() (and its sce_to_reticulate_anndata() wrapper). Internally, assay conversion follows this path:

# Conceptual pseudocode
X <- se_assay_to_python_matrix(x, assay = assay, prefer_sparse = TRUE)
# X has shape (n_features, n_samples) — same as the R assay
X <- X$T   # transpose once: AnnData expects (obs=cells, vars=features)

The internal dispatcher selects converters by matrix class:

# Simplified dispatch logic
if (is.matrix(x))                 return(.dense_matrix_to_numpy(x))
if (is(x, "dgeMatrix"))           return(.dense_matrix_to_numpy(x))
if (is(x, "dgCMatrix"))           return(.dgCMatrix_to_scipy_csc(x))
if (is(x, "dgRMatrix"))           return(.dgRMatrix_to_scipy_csr(x))
if (is(x, "COO_SparseArray"))     return(.coo_matrix_to_scipy_coo(x))
# fallback: coerce to dgCMatrix or densify, then convert

Dense matrix conversion

Base `matrix`

# r_to_py on a base R matrix uses NumPy's buffer protocol.
# The result is a Fortran-order float64 ndarray; owndata = FALSE.
# (Observed behavior; the reticulate API does not formally guarantee it,
# but it is stable in practice across current versions.)
arr <- reticulate::r_to_py(mat)  # shape (nrow, ncol), owndata = FALSE

R matrices are column-major; NumPy’s Fortran order matches this layout, so no reordering is needed and no copy occurs during the conversion itself.

`dgeMatrix`

dgeMatrix stores its values in a plain @x slot (a column-major numeric vector), plus @Dim. The conversion exposes @x as a 1-D zero-copy view, then calls .reshape((nr, nc), order = "F"). Because the 1-D buffer is already laid out in column-major order, NumPy’s reshape returns a Fortran-contiguous 2-D view (owndata = FALSE); no copy is made.

Sparse matrix conversion

`dgCMatrix` → `scipy.sparse.csc_matrix`

A dgCMatrix with m features (rows) and n samples (columns) stores:

Slot	Content	Length	R type
`@x`	non-zero values	`nnz`	`double` → `float64`
`@i`	row indices (0-based)	`nnz`	`integer` → `int32`
`@p`	column pointers (`@p[j]` = start of col `j` in `@x`)	`n + 1`	`integer` → `int32`

This maps directly to the CSC (Compressed Sparse Column) format. reticulate::np_array() exposes each slot vector to NumPy without copying:

# Pseudocode — zero-copy slot views
py_data    <- reticulate::np_array(mat@x, dtype = "float64")  # owndata = FALSE
py_indices <- reticulate::np_array(mat@i, dtype = "int32")    # owndata = FALSE
py_indptr  <- reticulate::np_array(mat@p, dtype = "int32")    # owndata = FALSE
csc <- sp$csc_matrix(
  reticulate::tuple(py_data, py_indices, py_indptr),
  shape = reticulate::tuple(nr, nc)
)
# SciPy stores references to the NumPy arrays; no further copy occurs.

`dgRMatrix` → `scipy.sparse.csr_matrix`

dgRMatrix uses the CSR (Compressed Sparse Row) format. The relevant slots are:

Slot	Content	R type
`@x`	non-zero values	`double`
`@j`	column indices (0-based)	`integer`
`@p`	row pointers	`integer`

These map directly to csr_matrix((data, indices, indptr), shape=...) with the same zero-copy path as dgCMatrix.

`COO_SparseMatrix` → `scipy.sparse.coo_matrix`

COO_SparseMatrix (from the SparseArray package) uses:

Slot	Content	Notes
`@nzdata`	non-zero values	zero-copy `np_array` view
`@nzcoo`	coordinate matrix, shape (nnz, 2), 1-based	1→0 conversion requires a copy
`@dim`	`c(nrow, ncol)`

The coordinate arrays must be decremented from 1-based to 0-based before being passed to SciPy. This creates new R integer vectors (one copy each for row and column), which are then turned into zero-copy NumPy views. The value array @nzdata itself is still a zero-copy view.

# Coordinate conversion — unavoidable copy for 1→0 base change
row_0 <- coo@nzcoo[, 1L] - 1L   # new R vector (copy)
col_0 <- coo@nzcoo[, 2L] - 1L   # new R vector (copy)
py_row  <- reticulate::np_array(row_0, dtype = "int32")   # view of the new vector
py_col  <- reticulate::np_array(col_0, dtype = "int32")   # view of the new vector
py_data <- reticulate::np_array(coo@nzdata, dtype = "float64")  # zero-copy view
coo_py  <- sp$coo_matrix(
  reticulate::tuple(py_data, reticulate::tuple(py_row, py_col)),
  shape = reticulate::tuple(nr, nc)
)

Orientation: why we transpose once

R assay matrices are stored features × samples (rows = features, columns = samples/cells). AnnData expects obs × vars = cells × features. The conversion therefore transposes the Python matrix exactly once:

X <- se_assay_to_python_matrix(x, assay = assay)  # (features, cells)
X <- X$T                                            # (cells, features) — zero-copy for sparse

For SciPy sparse matrices, .T swaps the stored arrays’ interpretation (CSC ↔︎ CSR swap for example) without reordering values; owndata remains False. For NumPy dense arrays, .T returns a transposed view sharing the same buffer.

What CAN force a copy

Trigger	Reason
Assay is an unsupported class (coercion fallback)	`as(mat, "dgCMatrix")` allocates a new R object
Assay is stored as a raw dense `integer` matrix	`storage.mode(x) <- "double"` makes a copy
COO coordinate arrays	1 → 0 base conversion requires new integer vectors
`obs` / `var` metadata (pandas `DataFrame`)	R columnar `data.frame` → Python row-oriented; one copy per column, unavoidable
`py_to_r(adata$X)`	reticulate converts sparse → dense R matrix
`adata.X.toarray()` or `.todense()` in Python	explicit densification
Any `.copy()` call on a SciPy or NumPy object	explicit copy

Lifetime caveat

Python objects created via the zero-copy path hold borrowed C references to the underlying R vectors. This applies to any matrix class whose slot vectors are exposed via reticulate::np_array() — not only dgCMatrix.

Ensure that:

The source R matrix object (and therefore its slots) remains alive and referenced by R while Python is using the converted object.
No R code triggers copy-on-modify on the same vector (standard R copy-on-modify semantics normally protect against this for unmodified objects).
The R garbage collector has not freed the underlying memory.

In typical usage (SCE / SE → AnnData → model fit → results back to R in the same session), this is always satisfied as long as sce or equivalent is held in the R global environment.

Examples

library(MofaflexR)
library(SingleCellExperiment)
library(Matrix)

set.seed(1)

# ---- sparse dgCMatrix example -------------------------------------------
counts <- rsparsematrix(500, 100, density = 0.05, repr = "C")
counts <- methods::as(abs(round(counts)), "dgCMatrix")
sce    <- SingleCellExperiment(assays = list(counts = counts))
colnames(sce) <- paste0("cell", seq_len(ncol(sce)))
rownames(sce) <- paste0("gene", seq_len(nrow(sce)))

# Low-level: CSC matrix (features x cells orientation)
csc <- sce_assay_to_scipy_csc(sce)
cat("owndata:", reticulate::py_to_r(csc$data$flags$owndata), "\n")
# owndata: FALSE  (zero-copy view)

# Full AnnData via the generalized bridge (cells x genes)
rada     <- sce_to_reticulate_anndata(sce)
py_adata <- rada$py_anndata()
sp       <- reticulate::import("scipy.sparse", convert = FALSE)
cat("Shape:  ", unlist(reticulate::py_to_r(py_adata$shape)), "\n")  # 100 500
cat("Sparse: ", reticulate::py_to_r(sp$issparse(py_adata$X)), "\n") # TRUE
cat("owndata:", reticulate::py_to_r(py_adata$X$data$flags$owndata), "\n") # FALSE

library(MofaflexR)
library(SummarizedExperiment)

# ---- dense dgeMatrix example (e.g. a PCA embedding stored as an assay) ---
set.seed(2)
dense_mat <- matrix(rnorm(500 * 50), nrow = 500L, ncol = 50L)  # 500 features x 50 cells
se <- SummarizedExperiment(assays = list(embedding = dense_mat))

# sce_to_anndata accepts any SummarizedExperiment subclass.
# The dense assay is converted to a NumPy ndarray via the buffer protocol.
adata <- sce_to_anndata(se, assay = "embedding")

np <- reticulate::import("numpy", convert = FALSE)
cat("Shape:     ", unlist(reticulate::py_to_r(adata$shape)), "\n")  # 50 500
cat("Is ndarray:", reticulate::py_to_r(np$ndim(adata$X)) == 2L, "\n") # TRUE

Maximilian Nuber