mira.topics.BayesianTuner#

class mira.topics.BayesianTuner(*, model, save_name, min_topics, max_topics, storage='sqlite:///mira-tuning.db', n_jobs=1, max_trials=128, min_trials=48, stop_condition=12, seed=2556, tensorboard_logdir='runs', model_dir='models', pruner=None, sampler=None, log_steps=False, log_every=10, train_size=0.8)#

A SpeedyTuner object chooses the number of topics and the appropriate regularization to produce a model that best fits the user’s dataset.

The process consists of iteratively training a model using a set of hyperparameters, evaluating the resulting model, and choosing the next set of hyperparameters based on which set is most likely to yield an improvement over previous models trained.

Parameters
modelmira.topics.TopicModel

Topic model to tune. The provided model should have columns specified to retrieve endogenous and exogenous features, and should have the learning rate configued by get_learning_rate_bounds.

save_namestr (required)

Table under which to save tuning results in storage table. A good pattern to follow is: dataset/modality/model_id/tuning_run.

min_topicsint (required)

Minimum number of topics to try.

max_topicsint

Maximum number of topics to try.

storagestr or mira.topics.Redis(), default = ‘sqlite:///mira-tuning.db’

The default value saves the results from tuning in an SQLite table with the file location ./mira-tuning.db. SQLite tables require no outside libraries, but can only handle read-write for up to 5 concurrent processes. Tuning can be significantly sped up be running even more concurrent processes, which requires a REDIS database backend with faster read-write speeds.

To use the REDIS backend, start a REDIS server in the background, and pass a mira.topics.Redis() object to this paramter. Adjust the url as needed.

n_jobsint>0, default = 1

Number of concurrent trials to run at a time. The default SQLite backend can handle up to 5, but the REDIS backend can handle many more (>20!).

Each trial’s memory footprint is essentially that of the model parameters, the optimizer, and one batch of training (because the dataset is saved to disk and streamed batch-by-batch during model training). Thus, training a model with 200K cells requires the same memory as on 1000 cells. We suggest taking advantage of the low memory overhead to train currently across as many cores as possible.

tensorboard_logdirstr, default = ‘runs’,

Directory in which to save tensorboard log files.

min_trialsint>0, default = 48

Minimum number of trials to run.

max_trialsint>0, default = 128

If finding better models, continues to train until reaching this number of trials.

stop_condition, int>0, default = 12,

Continue tuning until a better model has not been produced for this many iterations.

model_logdirpath, default = ‘./models/’

Where to save the best models trained during tuning.

prunerNone or optuna.pruners.Pruner, default = None

If None, uses the default SuccessiveHalving bandit pruner.

samplerNone or optuna.pruner.BaseSampler, default = None

If None, uses MIRA’s default choice of a Gaussian Process sampler with pruning.

log_stepsboolean, default = False,

Whether to save loss at every step of training. Useful for debugging, but slows down tuning.

Examples

>>> tuner = mira.topics.SpeedyTuner(
    ...    model = model,
    ...    min_topics = 5,
    ...    max_topics = 55,
    ...    n_jobs = 1,
    ...    save_name = 'tuning/rna/0',
    ... )
>>> tuner.fit(data)
>>> model = tuner.fetch_best_weights()
Attributes
studyoptuna.study.Study

Optuna study object summarizing tuning results.

trial_attrslist of dicts

Data for each trial

Methods

fetch_best_weights()

Fetch weights best topic model trained during tuning.

fetch_weights(trial_num)

Fetch topic model weights trained in the given trial from disk.

fit(train[, test])

Run Bayesian optimization scheme for topic model hyperparameters.

load(*, model, save_name[, storage])

Load a tuning run from the given storage object.

plot_intermediate_values([palette, hue, ax, ...])

Plots the evaluation loss achieved at each epoch of training for all of the trials.

plot_pareto_front([x, y, hue, ax, figsize, ...])

Relational plot of tuning trails data.

purge()

If tuning is stopped with some trials in progress, those trials will be saved as "zombie" trials, doomed never to be completed.

classmethod load(*, model, save_name, storage='sqlite:///mira-tuning.db')#

Load a tuning run from the given storage object.

purge()#

If tuning is stopped with some trials in progress, those trials will be saved as “zombie” trials, doomed never to be completed. Upon restart of tuning, those zombie trials can interfere with selection of hyperparamters.

This function changes the state of all RUNNING trials to FAILED.

fit(train, test=None)#

Run Bayesian optimization scheme for topic model hyperparameters. This function launches multiple concurrent training processes to evaluate hyperparameter combinations. All processes are launched on the same node. Evaluate the memory usage of a single MIRA topic model to determine number of workers.

Parameters
trainanndata.AnnData

Anndata of expression or accessibility data. If test is not provided, this dataset will be partitioned into train and test sets according to the ratio given by the train_size parameter.

testanndata.AnnData

Anndata of expression or accessibility data. Evaluation set of cells.

Returns
mira.topics.TopicModelTopic model trained with best set of hyperparameters

found during tuning.

fetch_weights(trial_num)#

Fetch topic model weights trained in the given trial from disk. Can only fetch weights from trials which were not pruned.

Parameters
trial_numint

Trial number for which to fetch weights

Returns
mira.topics.TopicModel
Raises
ValueErrorIf trial does not exist
KeyErrorIf trial did not finish
fetch_best_weights()#

Fetch weights best topic model trained during tuning. This is the “official” topic model for a given dataset.

Returns
mira.topics.TopicModel
Raises
ValueErrorIf no trials have been completed
plot_intermediate_values(palette='Spectral_r', hue='value', ax=None, figsize=(10, 7), log_hue=False, na_color='lightgrey', add_legend=True, vmax=None, vmin=None, **plot_kwargs)#

Plots the evaluation loss achieved at each epoch of training for all of the trials.

Parameters
palettestr, default = ‘Spectral_r’

Which color to plot for each trial

axmatplotlib.pyplot.axes or None

Provide axes object to function for more control. If no axes are provided, they are created internally.

figsizetuple[int, int], default = (10,7)

Size of plot

log_hueboolean, default = False

Take the log of the hue value to plot

hue{‘value’, ‘number’, ‘num_topics’, ‘decoder_dropout’,

‘rate’, ‘distortion’, … }, default = “value” Which attribute of each trial to plot. For a full list of attributes, use: tuner.trial_attrs. The default “value” is the objective score.

vmin, vmaxfloat

Minimum and maximum bounds on continuous color palettes.

Returns
matplotlib.pyplot.axes
plot_pareto_front(x='num_topics', y='elbo', hue='number', ax=None, figsize=(7, 7), palette='Blues', na_color='lightgrey', size=100, alpha=0.8, add_legend=True, label_pareto_front=False, include_pruned_trials=True)#

Relational plot of tuning trails data. Often, it is most interesting to compare the objective value (“elbo”) versus the number of topics. This serves as a sanity check that the objective is convex with respect to topics and that the tuner converged on the appropriate number of topics for the dataset.

Parameters
xstr, default=”num_topics”

Trial attribute to plot on x-axis. Use tuner.trial_attrs to see list of possible attributes to plot.

ystr, default=”elbo”

Trial attribute to plot on y-axis. “elbo” and “value” plot the objective score. “distortion” plots the reconstruction loss. “rate” plots the KL-divergence loss.

hue{‘value’, ‘number’, ‘num_topics’, ‘decoder_dropout’,

‘rate’, ‘distortion’, … } Which attribute of each trial to plot. For a full list of attributes, use: tuner.trial_attrs

axmatplotlib.pyplot.axes or None

Provide axes object to function for more control. If no axes are provided, they are created internally.

figsizetuple[int, int], default = (7,7)

Size of plot

label_pareto_frontboolean, default=False

Only label trials on the pareto front of distortion and rate, e.g. the best trials.

include_prunded_trialsboolean, default=True

Whether to include pruned trials in the plot.

Returns
matplotlib.pyplot.axes