mira.topics.make_model#
- mira.topics.make_model(n_samples, n_features, *, feature_type, highly_variable_key=None, exogenous_key=None, endogenous_key=None, counts_layer=None, categorical_covariates=None, continuous_covariates=None, covariates_keys=None, extra_features_keys=None, **model_parameters)#
Instantiates a topic model, which learns regulatory “topics” from single-cell RNA-seq or ATAC-seq data. Topics capture patterns of covariance between gene or cis-regulatory elements. Each cell is represented by a composition over topics, and each topic corresponds with activations of co-regulated elements.
You may use enrichment analysis of topics to understand signaling and transcription factor drivers of cell states, and embedding of cell-topic distributions to visualize and cluster cells, and to perform pseudotime trajectory inference.
When working with batched data, the parameters of the topic model are optimized using the novel CODAL (COvariate Disentangling Augmented Loss) objective, which shows State of the Art performance for detection of batch confounded cell types.
- Parameters
- n_samplesint
Number of samples in the dataset, used to choose hyperparameters for the model
- n_featuresint
Number of features in the dataset, used to choose hyperparamters for the model
- feature_type{‘expression’,’accessibilty’}
Modality of the data being modeled.
- highly_variable_keystr, default = None
Column in AnnData that marks features to be modeled. These features should include all elements used for enrichment analysis of topics. For expression data, this should be highly variable genes releveant to your system (the top ~4000 appears to work well). For accessibility data, all called peaks may be used.
- exogenous_keystr, default=None
Same as highly_variable_key, included for backwards compatibility.
- endogenous_keystr, default=None
Column in AnnData that marks features to be used for encoder neural network. These features should prioritize elements that distinguish between populations, like highly-variable genes. If “None”, then the model will use the features supplied to “exogenous_key”.
- counts_layerstr, default=None
Layer in AnnData that countains raw counts for modeling.
- categorical_covariatesstr, list[str], np.ndarray[str], or None, default=None
Categorical covariates in the dataset. For example, batch of origin, donor, assay chemistry, sequencing machine, etc.
- continuous_covariatesstr, list[str], np.ndarray[str], or None
Continuous covariates in the dataset. For example, FRIP score (ATAC-seq), percent reads mitochrondria (RNA-seq), or other QC metrics.
- extra_features_keysstr, list[str], np.ndarray[str], or None
Columns in anndata.obs which contain extra features for the encoder neural network.
- Returns
- topic model
A CODAL (if there are technical covariates in the dataset) or MIRA topic model. Hyperparameters of the topic model are chosen based on the supplied dataset properties.
- Other Parameters
- cost_betafloat>0, default = 1.
Multiplier of the regularization loss terms (KL divergence and mutual information regularization) versus the reconstruction loss term. Smaller datasets (<10K cells ) sometimes require larger cost_beta (1.25 -2.), while larger datasets (>10K cells) always work well with cost_beta=1. This parameter is automatically set to a reasonable value based on the size of the dataset provided to this function.
- num_topicsint, default=16
Number of topics to learn from data.
- hiddenint, default=128
Number of nodes to use in hidden layers of encoder network
- num_layers: int, default=3
Number of layers to use in encoder network, including output layer
- num_epochs: int, default=40
Number of epochs to train topic model. The One-cycle learning rate policy requires a pre-defined training length, and 40 epochs is usually an overestimate of the optimal number of epochs to train for.
- decoder_dropoutfloat (0., 1.), default=0.2
Dropout rate for the decoder network. Prevents node collapse.
- encoder_dropoutfloat (0., 1.), default=0.2
Dropout rate for the encoder network. Prevents overfitting.
- use_cudaboolean, default=True
Try using CUDA GPU speedup while training.
- seedint, default=None
Random seed for weight initialization. Enables reproduceable initialization of model.
- min_learning_ratefloat, default=1e-6
Start learning rate for One-cycle learning rate policy.
- max_learning_ratefloat, default=1e-1
Peak learning rate for One-cycle policy.
- batch_sizeint, default=64
Minibatch size for stochastic gradient descent while training. Larger batch sizes train faster, but may produce less optimal models.
- initial_pseudocountsint, default=50
Initial pseudocounts allocated to approximated hierarchical dirichlet prior. More pseudocounts produces smoother topics, less pseudocounts produces sparser topics.
- nb_parameterize_logspaceboolean, default=True
Parameterize negative-binomial distribution using log-space probability estimates of gene expression. Is more numerically stable.
- embedding_sizeint > 0 or None, default=None
Number of nodes in first encoder neural network layer. Default of None gives an embedding size of hidden.
- kl_strategy{‘monotonic’,’cyclic’}, default=’cyclic’
Whether to anneal KL term using monotonic or cyclic strategies. Cyclic may produce slightly better models.
- CODAL models only
- dependence_lrfloat>0, default=1e-4
Learning rate for tuning the mutual information estimator
- dependence_hiddenint>0, default=64
Hidden size of mutual information estimator
- weight_decayfloat>0, default=0.001
Weight decay of topic model weight optimizer
- min_momentumfloat>0, default=0.85
Min momentum for 1-cycle learning rate policy
- max_momentumfloat>0, default=0.95
Max momentum for 1-cycle learning rate policy
- covariates_hiddenint>0, default=32
Number of nodes for single layer of technical effect network
- covariates_dropoutfloat>0, default=0.05
Dropout applied to the technical effect network.
- mask_dropoutfloat>0, default=0.05
Bernoulli coruption rate of technical effect predictions during training.
- marginal_estimation_sizeint>0, default=256
Number of pairings used to estimate mutual information at each step.
- dependence_betafloat>0, default=1.
The weight of the mutual information cost at each step is cost_beta`*`dependence_beta. Changing this value to more than 1 weights mutual information regularization more highly than KL-divergence regularization of the loss.
- Accessibility models only
- embedding_dropoutfloat>0, default=0.05
Bernoulli corruption of bag of peaks input to DAN encoder.
- atac_encoderstr in {“fast”,”skipDAN”,”DAN”}, default=”skipDAN”
Which type of ATAC encoder to use. The best results are given by “skipDAN”, which is the default. However, this model is pretty much impossible to train on CPU. If instantiated without GPU, will throw an error and suggest the “fast” encoder.
The “fast” encoder skips the large embedding layer of the DAN models and calculates a first-pass LSI projection of the data.
Examples
>>> model = mira.topics.TopicModel( ... *rna_data.shape, ... feature_type = 'expression', ... highly_variable = 'highly_variable', ... counts_layer = 'rawcounts', ... categorical_covariates = ['batch','donor'], ... continuous_covariates = ['FRIP'] ... )