mira.topics.BayesianTuner#
- class mira.topics.BayesianTuner(*, model, save_name, min_topics, max_topics, storage='sqlite:///mira-tuning.db', n_jobs=1, max_trials=128, min_trials=48, stop_condition=12, seed=2556, tensorboard_logdir='runs', model_dir='models', pruner=None, sampler=None, log_steps=False, log_every=10, train_size=0.8)#
A SpeedyTuner object chooses the number of topics and the appropriate regularization to produce a model that best fits the user’s dataset.
The process consists of iteratively training a model using a set of hyperparameters, evaluating the resulting model, and choosing the next set of hyperparameters based on which set is most likely to yield an improvement over previous models trained.
- Parameters
- modelmira.topics.TopicModel
Topic model to tune. The provided model should have columns specified to retrieve endogenous and exogenous features, and should have the learning rate configued by
get_learning_rate_bounds.- save_namestr (required)
Table under which to save tuning results in storage table. A good pattern to follow is: dataset/modality/model_id/tuning_run.
- min_topicsint (required)
Minimum number of topics to try.
- max_topicsint
Maximum number of topics to try.
- storagestr or mira.topics.Redis(), default = ‘sqlite:///mira-tuning.db’
The default value saves the results from tuning in an SQLite table with the file location ./mira-tuning.db. SQLite tables require no outside libraries, but can only handle read-write for up to 5 concurrent processes. Tuning can be significantly sped up be running even more concurrent processes, which requires a REDIS database backend with faster read-write speeds.
To use the REDIS backend, start a REDIS server in the background, and pass a mira.topics.Redis() object to this paramter. Adjust the url as needed.
- n_jobsint>0, default = 1
Number of concurrent trials to run at a time. The default SQLite backend can handle up to 5, but the REDIS backend can handle many more (>20!).
Each trial’s memory footprint is essentially that of the model parameters, the optimizer, and one batch of training (because the dataset is saved to disk and streamed batch-by-batch during model training). Thus, training a model with 200K cells requires the same memory as on 1000 cells. We suggest taking advantage of the low memory overhead to train currently across as many cores as possible.
- tensorboard_logdirstr, default = ‘runs’,
Directory in which to save tensorboard log files.
- min_trialsint>0, default = 48
Minimum number of trials to run.
- max_trialsint>0, default = 128
If finding better models, continues to train until reaching this number of trials.
- stop_condition, int>0, default = 12,
Continue tuning until a better model has not been produced for this many iterations.
- model_logdirpath, default = ‘./models/’
Where to save the best models trained during tuning.
- prunerNone or optuna.pruners.Pruner, default = None
If None, uses the default SuccessiveHalving bandit pruner.
- samplerNone or optuna.pruner.BaseSampler, default = None
If None, uses MIRA’s default choice of a Gaussian Process sampler with pruning.
- log_stepsboolean, default = False,
Whether to save loss at every step of training. Useful for debugging, but slows down tuning.
Examples
>>> tuner = mira.topics.SpeedyTuner( ... model = model, ... min_topics = 5, ... max_topics = 55, ... n_jobs = 1, ... save_name = 'tuning/rna/0', ... ) >>> tuner.fit(data) >>> model = tuner.fetch_best_weights()
- Attributes
- studyoptuna.study.Study
Optuna study object summarizing tuning results.
- trial_attrslist of dicts
Data for each trial
Methods
Fetch weights best topic model trained during tuning.
fetch_weights(trial_num)Fetch topic model weights trained in the given trial from disk.
fit(train[, test])Run Bayesian optimization scheme for topic model hyperparameters.
load(*, model, save_name[, storage])Load a tuning run from the given storage object.
plot_intermediate_values([palette, hue, ax, ...])Plots the evaluation loss achieved at each epoch of training for all of the trials.
plot_pareto_front([x, y, hue, ax, figsize, ...])Relational plot of tuning trails data.
purge()If tuning is stopped with some trials in progress, those trials will be saved as "zombie" trials, doomed never to be completed.
- classmethod load(*, model, save_name, storage='sqlite:///mira-tuning.db')#
Load a tuning run from the given storage object.
- purge()#
If tuning is stopped with some trials in progress, those trials will be saved as “zombie” trials, doomed never to be completed. Upon restart of tuning, those zombie trials can interfere with selection of hyperparamters.
This function changes the state of all RUNNING trials to FAILED.
- fit(train, test=None)#
Run Bayesian optimization scheme for topic model hyperparameters. This function launches multiple concurrent training processes to evaluate hyperparameter combinations. All processes are launched on the same node. Evaluate the memory usage of a single MIRA topic model to determine number of workers.
- Parameters
- trainanndata.AnnData
Anndata of expression or accessibility data. If test is not provided, this dataset will be partitioned into train and test sets according to the ratio given by the train_size parameter.
- testanndata.AnnData
Anndata of expression or accessibility data. Evaluation set of cells.
- Returns
- mira.topics.TopicModelTopic model trained with best set of hyperparameters
found during tuning.
- fetch_weights(trial_num)#
Fetch topic model weights trained in the given trial from disk. Can only fetch weights from trials which were not pruned.
- Parameters
- trial_numint
Trial number for which to fetch weights
- Returns
- mira.topics.TopicModel
- Raises
- ValueErrorIf trial does not exist
- KeyErrorIf trial did not finish
- fetch_best_weights()#
Fetch weights best topic model trained during tuning. This is the “official” topic model for a given dataset.
- Returns
- mira.topics.TopicModel
- Raises
- ValueErrorIf no trials have been completed
- plot_intermediate_values(palette='Spectral_r', hue='value', ax=None, figsize=(10, 7), log_hue=False, na_color='lightgrey', add_legend=True, vmax=None, vmin=None, **plot_kwargs)#
Plots the evaluation loss achieved at each epoch of training for all of the trials.
- Parameters
- palettestr, default = ‘Spectral_r’
Which color to plot for each trial
- axmatplotlib.pyplot.axes or None
Provide axes object to function for more control. If no axes are provided, they are created internally.
- figsizetuple[int, int], default = (10,7)
Size of plot
- log_hueboolean, default = False
Take the log of the hue value to plot
- hue{‘value’, ‘number’, ‘num_topics’, ‘decoder_dropout’,
‘rate’, ‘distortion’, … }, default = “value” Which attribute of each trial to plot. For a full list of attributes, use: tuner.trial_attrs. The default “value” is the objective score.
- vmin, vmaxfloat
Minimum and maximum bounds on continuous color palettes.
- Returns
- matplotlib.pyplot.axes
- plot_pareto_front(x='num_topics', y='elbo', hue='number', ax=None, figsize=(7, 7), palette='Blues', na_color='lightgrey', size=100, alpha=0.8, add_legend=True, label_pareto_front=False, include_pruned_trials=True)#
Relational plot of tuning trails data. Often, it is most interesting to compare the objective value (“elbo”) versus the number of topics. This serves as a sanity check that the objective is convex with respect to topics and that the tuner converged on the appropriate number of topics for the dataset.
- Parameters
- xstr, default=”num_topics”
Trial attribute to plot on x-axis. Use tuner.trial_attrs to see list of possible attributes to plot.
- ystr, default=”elbo”
Trial attribute to plot on y-axis. “elbo” and “value” plot the objective score. “distortion” plots the reconstruction loss. “rate” plots the KL-divergence loss.
- hue{‘value’, ‘number’, ‘num_topics’, ‘decoder_dropout’,
‘rate’, ‘distortion’, … } Which attribute of each trial to plot. For a full list of attributes, use: tuner.trial_attrs
- axmatplotlib.pyplot.axes or None
Provide axes object to function for more control. If no axes are provided, they are created internally.
- figsizetuple[int, int], default = (7,7)
Size of plot
- label_pareto_frontboolean, default=False
Only label trials on the pareto front of distortion and rate, e.g. the best trials.
- include_prunded_trialsboolean, default=True
Whether to include pruned trials in the plot.
- Returns
- matplotlib.pyplot.axes