mira.topics.make_model#

mira.topics.make_model(n_samples, n_features, *, feature_type, highly_variable_key=None, exogenous_key=None, endogenous_key=None, counts_layer=None, categorical_covariates=None, continuous_covariates=None, covariates_keys=None, extra_features_keys=None, **model_parameters)#

Instantiates a topic model, which learns regulatory “topics” from single-cell RNA-seq or ATAC-seq data. Topics capture patterns of covariance between gene or cis-regulatory elements. Each cell is represented by a composition over topics, and each topic corresponds with activations of co-regulated elements.

You may use enrichment analysis of topics to understand signaling and transcription factor drivers of cell states, and embedding of cell-topic distributions to visualize and cluster cells, and to perform pseudotime trajectory inference.

When working with batched data, the parameters of the topic model are optimized using the novel CODAL (COvariate Disentangling Augmented Loss) objective, which shows State of the Art performance for detection of batch confounded cell types.

Parameters
n_samplesint

Number of samples in the dataset, used to choose hyperparameters for the model

n_featuresint

Number of features in the dataset, used to choose hyperparamters for the model

feature_type{‘expression’,’accessibilty’}

Modality of the data being modeled.

highly_variable_keystr, default = None

Column in AnnData that marks features to be modeled. These features should include all elements used for enrichment analysis of topics. For expression data, this should be highly variable genes releveant to your system (the top ~4000 appears to work well). For accessibility data, all called peaks may be used.

exogenous_keystr, default=None

Same as highly_variable_key, included for backwards compatibility.

endogenous_keystr, default=None

Column in AnnData that marks features to be used for encoder neural network. These features should prioritize elements that distinguish between populations, like highly-variable genes. If “None”, then the model will use the features supplied to “exogenous_key”.

counts_layerstr, default=None

Layer in AnnData that countains raw counts for modeling.

categorical_covariatesstr, list[str], np.ndarray[str], or None, default=None

Categorical covariates in the dataset. For example, batch of origin, donor, assay chemistry, sequencing machine, etc.

continuous_covariatesstr, list[str], np.ndarray[str], or None

Continuous covariates in the dataset. For example, FRIP score (ATAC-seq), percent reads mitochrondria (RNA-seq), or other QC metrics.

extra_features_keysstr, list[str], np.ndarray[str], or None

Columns in anndata.obs which contain extra features for the encoder neural network.

Returns
topic model

A CODAL (if there are technical covariates in the dataset) or MIRA topic model. Hyperparameters of the topic model are chosen based on the supplied dataset properties.

Other Parameters
cost_betafloat>0, default = 1.

Multiplier of the regularization loss terms (KL divergence and mutual information regularization) versus the reconstruction loss term. Smaller datasets (<10K cells ) sometimes require larger cost_beta (1.25 -2.), while larger datasets (>10K cells) always work well with cost_beta=1. This parameter is automatically set to a reasonable value based on the size of the dataset provided to this function.

num_topicsint, default=16

Number of topics to learn from data.

hiddenint, default=128

Number of nodes to use in hidden layers of encoder network

num_layers: int, default=3

Number of layers to use in encoder network, including output layer

num_epochs: int, default=40

Number of epochs to train topic model. The One-cycle learning rate policy requires a pre-defined training length, and 40 epochs is usually an overestimate of the optimal number of epochs to train for.

decoder_dropoutfloat (0., 1.), default=0.2

Dropout rate for the decoder network. Prevents node collapse.

encoder_dropoutfloat (0., 1.), default=0.2

Dropout rate for the encoder network. Prevents overfitting.

use_cudaboolean, default=True

Try using CUDA GPU speedup while training.

seedint, default=None

Random seed for weight initialization. Enables reproduceable initialization of model.

min_learning_ratefloat, default=1e-6

Start learning rate for One-cycle learning rate policy.

max_learning_ratefloat, default=1e-1

Peak learning rate for One-cycle policy.

batch_sizeint, default=64

Minibatch size for stochastic gradient descent while training. Larger batch sizes train faster, but may produce less optimal models.

initial_pseudocountsint, default=50

Initial pseudocounts allocated to approximated hierarchical dirichlet prior. More pseudocounts produces smoother topics, less pseudocounts produces sparser topics.

nb_parameterize_logspaceboolean, default=True

Parameterize negative-binomial distribution using log-space probability estimates of gene expression. Is more numerically stable.

embedding_sizeint > 0 or None, default=None

Number of nodes in first encoder neural network layer. Default of None gives an embedding size of hidden.

kl_strategy{‘monotonic’,’cyclic’}, default=’cyclic’

Whether to anneal KL term using monotonic or cyclic strategies. Cyclic may produce slightly better models.

CODAL models only
dependence_lrfloat>0, default=1e-4

Learning rate for tuning the mutual information estimator

dependence_hiddenint>0, default=64

Hidden size of mutual information estimator

weight_decayfloat>0, default=0.001

Weight decay of topic model weight optimizer

min_momentumfloat>0, default=0.85

Min momentum for 1-cycle learning rate policy

max_momentumfloat>0, default=0.95

Max momentum for 1-cycle learning rate policy

covariates_hiddenint>0, default=32

Number of nodes for single layer of technical effect network

covariates_dropoutfloat>0, default=0.05

Dropout applied to the technical effect network.

mask_dropoutfloat>0, default=0.05

Bernoulli coruption rate of technical effect predictions during training.

marginal_estimation_sizeint>0, default=256

Number of pairings used to estimate mutual information at each step.

dependence_betafloat>0, default=1.

The weight of the mutual information cost at each step is cost_beta`*`dependence_beta. Changing this value to more than 1 weights mutual information regularization more highly than KL-divergence regularization of the loss.

Accessibility models only
embedding_dropoutfloat>0, default=0.05

Bernoulli corruption of bag of peaks input to DAN encoder.

atac_encoderstr in {“fast”,”skipDAN”,”DAN”}, default=”skipDAN”

Which type of ATAC encoder to use. The best results are given by “skipDAN”, which is the default. However, this model is pretty much impossible to train on CPU. If instantiated without GPU, will throw an error and suggest the “fast” encoder.

The “fast” encoder skips the large embedding layer of the DAN models and calculates a first-pass LSI projection of the data.

Examples

>>> model = mira.topics.TopicModel(
    ...    *rna_data.shape,
    ...    feature_type = 'expression',
    ...    highly_variable = 'highly_variable', 
    ...    counts_layer = 'rawcounts',
    ...    categorical_covariates = ['batch','donor'],
    ...    continuous_covariates = ['FRIP']
    ... )