When working with large brain connectome datasets, storing all
matrices in memory as R lists can be inefficient. The
riemtan package now supports Apache Arrow Parquet format
for efficient storage and lazy loading of SPD matrices.
This vignette demonstrates how to:
First, let’s create some example SPD matrices and save them to Parquet format:
library(riemtan)
library(Matrix)
# Create example connectomes (4x4 SPD matrices)
set.seed(42)
connectomes <- lapply(1:50, function(i) {
mat <- diag(4) + matrix(rnorm(16, 0, 0.1), 4, 4)
mat <- (mat + t(mat)) / 2 # Make symmetric
mat <- mat + diag(4) * 0.5 # Ensure positive definite
Matrix::pack(Matrix::Matrix(mat, sparse = FALSE))
})
# Write to Parquet format
write_connectomes_to_parquet(
connectomes,
output_dir = "my_connectomes",
subject_ids = paste0("subject_", 1:50),
provenance = list(
study = "Example Study",
acquisition_date = "2024-01-01",
preprocessing = "Standard pipeline v1.0"
)
)This creates a directory structure:
my_connectomes/
├── metadata.json
├── matrix_0001.parquet
├── matrix_0002.parquet
├── ...
└── matrix_0050.parquet
Before using a Parquet directory, you can validate its structure:
Now use the Parquet backend to create a CSample:
# Load AIRM metric
data(airm)
# Create Parquet backend (default cache size: 10 matrices)
backend <- create_parquet_backend(
"my_connectomes",
cache_size = 10
)
# Create CSample with the backend
sample <- CSample$new(
backend = backend,
metric_obj = airm
)
# Sample info
print(paste("Sample size:", sample$sample_size))
print(paste("Matrix dimension:", sample$matrix_size))All standard CSample operations work transparently with the Parquet backend:
The Parquet backend loads matrices on-demand:
CSuperSample works seamlessly with Parquet-backed samples:
# Create two samples with different backends
backend1 <- create_parquet_backend("study1_connectomes")
backend2 <- create_parquet_backend("study2_connectomes")
sample1 <- CSample$new(backend = backend1, metric_obj = airm)
sample2 <- CSample$new(backend = backend2, metric_obj = airm)
# Create super sample
super_sample <- CSuperSample$new(list(sample1, sample2))
# Gather all connectomes
super_sample$gather()
# Compute statistics
super_sample$compute_fmean()
super_sample$compute_variation()The Parquet backend is fully compatible with the traditional list-based approach:
# Traditional approach (all in memory)
sample_memory <- CSample$new(
conns = connectomes,
metric_obj = airm
)
# Parquet approach (lazy loading)
backend <- create_parquet_backend("my_connectomes")
sample_parquet <- CSample$new(
backend = backend,
metric_obj = airm
)
# Both work identically
sample_memory$compute_fmean()
sample_parquet$compute_fmean()Adjust cache size based on available memory:
For operations that need all matrices, consider loading in batches:
New in v0.2.5: riemtan now supports cross-platform parallel processing that works seamlessly with the Parquet backend.
Configure parallel processing using the futureverse framework:
library(riemtan)
# Enable parallel processing (works on all platforms including Windows!)
set_parallel_plan("multisession", workers = 4)
# Check status
is_parallel_enabled() # TRUE
get_n_workers() # 4
# Create Parquet-backed sample
backend <- create_parquet_backend("large_dataset", cache_size = 20)
sample <- CSample$new(backend = backend, metric_obj = airm)All major operations automatically use parallel processing when beneficial:
For very large Parquet datasets, use batch loading with parallel I/O and memory management:
# Load specific subset in batches
subset_conns <- sample$load_connectomes_batched(
indices = 1:500, # Load first 500 matrices
batch_size = 50, # 50 matrices per batch
progress = TRUE # Show progress
)
# This loads 500 matrices in 10 batches, clearing cache between batches
# Each batch is loaded in parallel for 5-10x speedup# Sequential (default if not configured)
set_parallel_plan("sequential")
system.time(sample$compute_tangents()) # Baseline
# Parallel with 4 workers
set_parallel_plan("multisession", workers = 4)
system.time(sample$compute_tangents()) # 3-4x faster
# Parallel with 8 workers
set_parallel_plan("multisession", workers = 8)
system.time(sample$compute_tangents()) # 6-8x fasterEnable progress bars to monitor long-running operations:
# Install progressr for progress bars (optional)
install.packages("progressr")
# Enable parallel processing
set_parallel_plan("multisession", workers = 4)
# All operations support progress parameter
sample$compute_tangents(progress = TRUE)
sample$compute_vecs(progress = TRUE)
sample$compute_fmean(progress = TRUE)
# Batch loading with progress
conns <- sample$load_connectomes_batched(
indices = 1:1000,
batch_size = 100,
progress = TRUE # Shows "Batch 1/10: loading matrices 1-100"
)1. Choose appropriate worker count:
# Conservative (leave cores for system)
set_parallel_plan("multisession", workers = parallel::detectCores() - 1)
# Maximum performance (use all cores)
set_parallel_plan("multisession", workers = parallel::detectCores())2. Memory management with parallelization:
# Each worker loads its own data copy
# For 4 workers with 100 matrices of 200x200, expect:
# Memory = 4 workers × cache_size × matrix_size ≈ 4 × 20 × 320 KB ≈ 25 MB
# Use smaller cache with more workers
backend <- create_parquet_backend("dataset", cache_size = 10)
set_parallel_plan("multisession", workers = 8)3. Reset when done:
Parallel processing is automatically disabled for small datasets to avoid overhead:
# Small dataset (n < 10): Uses sequential processing automatically
small_sample <- CSample$new(conns = small_connectomes[1:5], metric_obj = airm)
small_sample$compute_tangents() # Sequential (no overhead)
# Large dataset (n >= 10): Uses parallel processing automatically
large_sample <- CSample$new(backend = backend, metric_obj = airm)
large_sample$compute_tangents() # Parallel (if plan is active)Different strategies for different environments:
# multisession: Works on all platforms (Windows, Mac, Linux)
set_parallel_plan("multisession", workers = 4)
# multicore: Unix only, lower overhead
set_parallel_plan("multicore", workers = 4) # Auto-fallback to multisession on Windows
# cluster: For remote/distributed computing
set_parallel_plan("cluster", workers = c("node1", "node2"))Access metadata stored with your Parquet data:
Customize the Parquet file naming pattern:
For very large datasets (1000+ matrices):
# Use minimal cache
backend <- ParquetBackend$new("large_dataset", cache_size = 3)
# Compute statistics without loading all matrices at once
sample <- CSample$new(backend = backend, metric_obj = airm)
sample$compute_tangents()
sample$compute_vecs()
# Operations that don't need all matrices in memory
sample$compute_fmean(batch_size = 32) # Uses batchingThe Parquet backend provides:
For small to medium datasets that fit in memory, the traditional list approach remains perfectly valid. For large brain connectome studies, the Parquet backend enables analysis at scale.