Regulatory potential modeling#
MIRA approaches the problem of comparing gene expression and accessibility at the cell and gene levels. The previous tutorials covered cell-level analyses - topic modeling, pseudotime trajectory inference, etc. This tutorial will cover gene-level Regulatory Potential modeling (RP modeling).
A regulatory potential model relates a gene’s expression to nearby cis-regulatory elements (CREs). While linking changes in accessibility to changes in gene expression is confounded by cell-wide changes in state, proximity of a CRE to the TSS of a gene gives a mechanistic justification to assume some regulatory effect. Thus, MIRA learns upstream and downstream decay distances that describe the range at which changes in local accessible chromatin appear to influence expression. Each RP model is a statistical model describing a gene’s relationship with its local chromatin neighborhood.
We can use RP models to:
Predict the expression state of a gene given its local chromatin state
Find transcription factors that bind potential regulatory sites around each gene
Understand the regulatory ranges of influence around genes
Identify genes where local chromatin and expression are out-of-sync.
Let’s start by importing some packages:
[1]:
import mira
import scanpy as sc
import anndata
import warnings
warnings.simplefilter("ignore")
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
matplotlib.rc("font", size = 12)
from IPython.display import Image
mira.utils.pretty_sderr()
INFO:numexpr.utils:Note: NumExpr detected 12 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
INFO:numexpr.utils:NumExpr defaulting to 8 threads.
[1]:
and downloading the SHARE-seq skin dataset (if not downloaded already):
[4]:
mira.datasets.ShareseqTopicModels()
mira.datasets.ShareseqAnnotatedData()
rna_data = anndata.read_h5ad('mira-datasets/shareseq_annotated_data/rna_data.joint_representation.rp_modeled.h5ad')
atac_data = anndata.read_h5ad('mira-datasets/shareseq_annotated_data/atac_data.joint_representation.motif_calls.tss_annotated.h5ad')
rna_model = mira.topics.load_model('mira-datasets/shareseq_topic_models/rna_model.pth')
atac_model = mira.topics.load_model('mira-datasets/shareseq_topic_models/atac_model.pth')
INFO:mira.datasets.datasets:Dataset already on disk.
INFO:mira.datasets.datasets:Dataset contents:
* mira-datasets/shareseq_topic_models
* atac_model.pth
* rna_model.pth
INFO:mira.datasets.datasets:Dataset already on disk.
INFO:mira.datasets.datasets:Dataset contents:
* mira-datasets/shareseq_annotated_data
* rna_data.joint_representation.rp_modeled.h5ad
* atac_data.joint_representation.motif_calls.tss_annotated.h5ad
INFO:mira.topic_model.base:Moving model to CPU for inference.
INFO:mira.topic_model.base:Moving model to device: cpu
INFO:mira.topic_model.base:Moving model to CPU for inference.
INFO:mira.topic_model.base:Moving model to device: cpu
TSS annotations#
To start, we must annotate the genes in our multiome dataset with TSS locations. Each gene symbol (or however you wish to identify your genes), must be associated with a unique TSS. If a gene symbol has multiple TSS annotations, only the first will be used.
The required information for each gene is the:
chromosome
transcription start
transcription end
strand
symbol/identifier
Non-redundant human (hg38 GENCODE VM39) and mouse (mm10 GENCODE VM23) TSS annotations are available to download through MIRA, but you can assemble your own TSS dataframe. You can see how these annotations were assembled and get instructions on assembling your own annotation by following this tutorial.
To run mira.tl.get_distance_to_TSS, first download the chromosome sizes and TSS annotations provided by MIRA. Equivalent methods are available for hg38:
mira.datasets.hg38_chrom_sizes()
mira.datasets.hg38_tss_data()
[6]:
mira.datasets.mm10_chrom_sizes()
mira.datasets.mm10_tss_data()
INFO:mira.datasets.datasets:Dataset contents:
* mira-datasets/mm10.chrom.sizes
INFO:mira.datasets.datasets:Dataset contents:
* mira-datasets/mm10_tss_data.bed12
[7]:
!head -n3 mira-datasets/mm10.chrom.sizes
chr1 195471971
chr2 182113224
chrX 171031299
[8]:
!head -n3 mira-datasets/mm10_tss_data.bed12
#chrom #txStart #txEnd #geneSymbol #score #strand #thickStart #thickEnd #itemRGB #exonCount #blockSizes #blockStarts
chr1 3073252 3074322 4933401J01RIK 0 + 0 0 0 1 1070 0
chr1 3102015 3102125 GM26206 0 + 0 0 0 1 110 0
If you have your own annotation with different column names, you can pass those column names as keyword arguments to mira.tl.get_distance_to_TSS
. The column names must exclude the preceeding “#”, if applicable.
[9]:
mira.tl.get_distance_to_TSS(atac_data,
tss_data='mira-datasets/mm10_tss_data.bed12',
genome_file='mira-datasets/mm10.chrom.sizes')
INFO:mira.tools.connect_genes_peaks:Finding peak intersections with promoters ...
INFO:mira.tools.connect_genes_peaks:Calculating distances between peaks and TSS ...
INFO:mira.tools.connect_genes_peaks:Masking other genes' promoters ...
INFO:mira.adata_interface.rp_model:Added key to var: distance_to_TSS
INFO:mira.adata_interface.rp_model:Added key to uns: distance_to_TSS_genes
RP model training#
With TSS annotations in place, we can train RP models. Most RP-model related functions take the expr_adata
and atac_adata
keyword arguments, so it’s easiest to put them into a dictionary for repeated use:
[10]:
rp_args = dict(expr_adata = rna_data, atac_adata= atac_data)
Next, we instantiate an RP Model. MIRA refers to the RP model variant that uses local chromatin to predict gene expression as a LITE model - or Local chromatin accessibility-Influenced Transcriptional Expression model.
The mira.rp.LITE_Model object takes the expression and accessibility topic models. You must also define a list of genes to model. To keep this short, I will demonstrate training with four genes, but in a full analysis, we recommend training models for all highly-variable genes, plus all genes that scored in the top 5% most-activated for any topic, which gives a good survey of interesting gene expression variation in your data. This following snippet gives you that genelist:
rp_genes = list(rna_model.features[rna_model.highly_variable])
for topic in range(rna_model.num_topics):
rp_genes.extend(rna_model.get_top_genes(topic, 200))
rp_genes = list(set(rp_genes))
[11]:
litemodel = mira.rp.LITE_Model(expr_model = rna_model,
accessibility_model=atac_model,
genes = ['LEF1','WNT3','KRT23','MT2'])
Now fit the models. Most LITE_model
methods take the n_workers
parameter, which parallelizes across cores. You can provide the mira.rp.SaveCallback
to the fit
function, which will save each model as it is trained. One must simply provide the prefix or directory where RP models are to be saved.
[12]:
!mkdir -p data/
!mkdir -p data/rpmodels/
litemodel.fit(**rp_args, n_workers=4,
callback = mira.rp.SaveCallback('data/rpmodels/'))
INFO:mira.adata_interface.core:Added cols to obs: model_read_scale
INFO:mira.adata_interface.topic_model:Fetching key X_topic_compositions from obsm
INFO:mira.adata_interface.core:Added cols to obs: softmax_denom
[12]:
<mira.rp_model.rp_model.LITE_Model at 0x7fa602ffb410>
Otherwise, one can save RP models after training via:
[13]:
litemodel.save('data/rpmodels/')
!ls data/rpmodels/
LITE_KRT23.pth LITE_MT2.pth NITE_KRT23.pth NITE_MT2.pth
LITE_LEF1.pth LITE_WNT3.pth NITE_LEF1.pth NITE_WNT3.pth
To reload LITE models from disk, either instantiate the container object and use the load
function with the file prefix given above:
litemodel = mira.rp.LITE_Model(expr_model = rna_model,
accessibility_model=atac_model,
genes = ['LEF1','WNT3','KRT23','MT2'])
litemodel.load('data/rpmodels/')
Or to skip having to provide the gene list, use mira.rp.LITE_Model.load_dir
. This function will load every model in the given directory.
litemodel = mira.rp.LITE_Model.load_dir(
expr_model = rna_model,
accessibility_model = atac_model,
prefix='data/rpmodels/'
)
Defining local chromatin neighborhoods#
If you are interested in the distance of estimated regulatory influence for a certain gene, you can index on the litemodel
object with a gene name, then use the parameters_
attribute (distance is decay rate in kilobases):
[14]:
litemodel['LEF1'].parameters_
[14]:
{'theta': 1.2137538,
'gamma': 0.9651811,
'bias': 1.6418066,
'bn_mean': 0.87357396,
'bn_var': 0.54260945,
'bn_eps': 1e-05,
'a_upstream': 0.2547144,
'a_promoter': 6.3659315,
'a_downstream': 0.21996741,
'distance_upstream': 37.609966,
'distance_downstream': 11.843691}
Or access the parameters of all models like so:
[15]:
pd.DataFrame(
litemodel.parameters_
).T
[15]:
theta | gamma | bias | bn_mean | bn_var | bn_eps | a_upstream | a_promoter | a_downstream | distance_upstream | distance_downstream | |
---|---|---|---|---|---|---|---|---|---|---|---|
LEF1 | 1.213754 | 0.965181 | 1.641807 | 0.873574 | 0.542609 | 0.00001 | 0.254714 | 6.365932 | 0.219967 | 37.609966 | 11.843691 |
WNT3 | 2.090939 | 0.953565 | -0.168771 | 1.041202 | 0.314111 | 0.00001 | 0.451865 | 2.252736 | 0.438339 | 26.511311 | 28.357067 |
KRT23 | 0.398770 | 1.048645 | -0.222739 | 0.460833 | 0.025817 | 0.00001 | 1.456381 | 0.956366 | 0.298903 | 16.317459 | 15.376928 |
MT2 | 1.212501 | 0.905758 | 1.345756 | 3.491027 | 1.094696 | 0.00001 | 0.643497 | 0.168498 | 3.534304 | 248.008972 | 8.631128 |
Say one wanted a list of all peaks within the influence of WNT3’s RP model. You can quickly access the peaks that make up a gene’s local cis-regulatory neighborhood using get_influential_local_peaks. This function takes the parameter decay_periods
, which defines the distance covered by the gene’s chromatin neighborhood in terms of the decay_periods
times the RP model’s
upstream and downstream decay distances.
[16]:
litemodel['WNT3'].get_influential_local_peaks(atac_data, decay_periods = 5.).head(5)
[16]:
chr | start | end | endogenous | distance_to_TSS | is_upstream | |
---|---|---|---|---|---|---|
peak_id | ||||||
7081 | chr11 | 103697694 | 103697994 | False | 76305.0 | True |
8549 | chr11 | 103892817 | 103893117 | False | 118818.0 | False |
13683 | chr11 | 103897002 | 103897302 | False | 123003.0 | False |
14605 | chr11 | 103773897 | 103774197 | False | 102.0 | True |
20920 | chr11 | 103900925 | 103901225 | True | 126926.0 | False |
You can also manually access TSS-peak distances via:
[17]:
tss_distances = mira.utils.fetch_gene_TSS_distances(atac_data)
tss_distances
[17]:
AnnData object with n_obs × n_vars = 55304 × 334124
obs: 'gene', 'chromosome', 'txStart', 'txEnd', 'strand'
var: 'chr', 'start', 'end', 'endogenous'
In this matrix, peaks that are upstream of a gene have negative distances, while downstream have positive. All peaks that are outside of some range (by default 600 kilobases) are masked and have “zero” distance. if a peak begins exactly on the TSS of a gene, it is adjusted to be one base pair distant to avoid getting masked.
Predicting expression from accessibility#
With trained RP models, the predict
function calculates the maximum aposteriori prediction of expression given the accessibility state of each gene in each cell. Additionally, the model quantifies the likelihood of that prediction.
[18]:
litemodel.predict(**rp_args)
INFO:mira.adata_interface.core:Added layer: LITE_prediction
INFO:mira.adata_interface.core:Added layer: LITE_logp
You can survey the goodness of fit by checking UMAPs. For the gene MT2, the correspondance between accessibility and expression looks almost perfect. For the other genes, not so much … more on that later!
LITE Model Predictions:
[19]:
sc.pl.umap(rna_data, color = litemodel.genes, frameon=False, color_map='viridis',
layer='LITE_prediction', ncols=2, vmin = 0, vmax = 'p97')
Gene Expression:
[20]:
sc.pl.umap(rna_data, color = litemodel.genes, **mira.pref.raw_umap(ncols=2, size = 20))
Gene-TF targeting#
With probabilistic in-silico deletion (pISD)#
RP models define a local chromatin neighborhood where changes in accessibility appear to influence gene expression. One may assume that transcription factor binding in many cis-regulatory elements within this neighborhood suggests the transcription factor regulates the gene of interest. MIRA can query for these types of interactions at a systems level, finding potential regulatory associations across many gene-TF pairs.
The algorithm for calculating these associations is called probabilistic in-silico deletion (pISD), and it measures the ability of the RP model to predict expression of a gene before and after the regulatory elements predicted to bind a certain transcription factor are masked. In this way, pISD simulates a “computational knock out” of that factor.
To compute association scores using each RP model against motif-based binding site predictions of all available transcription factors, use litemodel.probabilistic_isd
:
[28]:
litemodel.probabilistic_isd(**rp_args, n_workers = 4)
INFO:mira.adata_interface.rp_model:Appending to expression adata:
INFO:mira.adata_interface.rp_model:Added key to varm: 'motifs-prob_deletion')
INFO:mira.adata_interface.rp_model:Added key to layers: motifs-informative_samples
INFO:mira.adata_interface.rp_model:Added key to uns: motifs
This test defines a matrix of association scores between gene-TF pairs. We can access this matrix directly with:
[29]:
isd_matrix = mira.utils.fetch_ISD_matrix(rna_data) # ISD results stored in RNA AnnData
isd_matrix # genes are rows, TFs are columns
[29]:
FOXQ1 | NFYC | SREBF1 | MYF5 | MLXIPL | SMAD2::SMAD3::SMAD4 | HINFP | MAFB | ALX1 | NR1H2::RXRA | ... | ZBTB6 | EMX2 | IRF6 | MYBL1 | LIN54 | SNAI2 | POU4F1 | ETS2 | NFIB | TBX1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
gene | |||||||||||||||||||||
0610012D04RIK | 0.169540 | -0.029384 | -0.200878 | -0.099164 | -0.126589 | -0.523728 | -0.001994 | -0.068799 | 0.005452 | 0.107080 | ... | 0.605548 | -1.518619e-02 | 0.710122 | -2.396497e-07 | -0.005993 | -0.267877 | 0.041505 | 0.076527 | -0.106636 | 0.067892 |
0610038B21RIK | -0.036928 | 0.085300 | -0.051138 | -0.393942 | -0.061608 | -0.098026 | 0.003808 | -0.072545 | 0.030532 | -0.095745 | ... | 0.074316 | -3.394143e-02 | 0.000000 | -1.471498e-03 | -0.149618 | 0.097972 | -0.090758 | 0.116396 | 0.048799 | -0.022324 |
1110012L19RIK | 0.036556 | -0.275474 | -0.221381 | 0.011563 | 0.183237 | 0.078137 | 0.000000 | 0.028468 | 0.028468 | 1.900235 | ... | -0.001522 | -1.032316e+00 | 0.000000 | 0.000000e+00 | -1.075561 | 0.004174 | 0.064352 | -0.221381 | -0.000001 | -0.000096 |
1110059G10RIK | 0.175102 | -0.074012 | -0.006563 | 0.122240 | -0.019347 | 0.352985 | -0.000747 | -0.093108 | 0.000213 | 0.001827 | ... | 0.070042 | -5.139177e-03 | 0.019617 | -4.050315e-03 | 0.397353 | -0.256149 | 0.053545 | 0.300384 | 0.053623 | -0.021751 |
1600012H06RIK | 0.021406 | 0.592367 | 0.277713 | 0.015611 | -0.028802 | 0.198631 | 0.096183 | 0.171284 | 0.929829 | -0.011851 | ... | 0.034424 | 9.359678e-01 | 0.048430 | -1.033540e-01 | 0.385391 | 0.057265 | 0.289109 | 1.565489 | 0.116852 | 0.173030 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
ZSWIM2 | -0.033843 | -0.000013 | 0.024423 | -0.006741 | 0.000005 | 0.000964 | 0.067987 | 0.000080 | 0.000088 | -0.034123 | ... | 0.034215 | 2.064183e-09 | 0.000000 | 0.000000e+00 | -0.002852 | 0.000030 | 0.001128 | 0.000372 | 0.033620 | 0.000547 |
ZSWIM3 | -0.590584 | 1.978670 | 0.079675 | -0.524485 | 0.052527 | -2.281603 | -0.172332 | 0.075240 | -0.199830 | 0.137679 | ... | 1.032466 | 2.336549e-01 | 0.433290 | 2.409590e-06 | 0.448944 | 0.449440 | -0.194346 | 0.295567 | -1.740476 | -1.016739 |
ZSWIM5 | 0.050175 | -0.397321 | 0.161102 | 0.146898 | 0.047569 | 0.228481 | 0.191042 | -0.100273 | 0.091490 | -0.636234 | ... | -0.963587 | -4.722838e-02 | 0.063973 | 7.622661e-02 | -0.109906 | 0.035298 | -0.114043 | -0.715872 | 0.204014 | 0.104210 |
ZSWIM7 | 0.208178 | 0.147906 | 0.440069 | 0.309050 | -0.565245 | 0.246133 | 0.000028 | 0.793261 | 0.318095 | -0.047203 | ... | 0.021071 | 1.185795e-01 | 0.234073 | -1.586491e-02 | -0.804975 | 0.133644 | 0.584839 | -0.187613 | -1.530654 | -0.036285 |
ZYX | -0.927998 | -0.577818 | -3.116855 | 2.348907 | -0.020789 | 2.062772 | -0.984240 | 0.968229 | -0.193598 | -2.775327 | ... | 1.922306 | -5.982126e-01 | 0.001017 | 4.485060e-05 | 0.951619 | 10.972371 | -1.677477 | 0.074832 | 2.782266 | 1.382568 |
4883 rows × 555 columns
To find the top transcription factors associated with binding sites within WNT3’s neighborhood:
[30]:
isd_matrix.loc['WNT3'].sort_values().tail(10)
[30]:
KLF12 16.497324
RBPJ 17.133034
KLF16 18.547176
STAT3 18.610457
PRDM1 22.032345
ARID3A 23.249511
KLF11 24.054482
FOSL2 24.516716
EWSR1-FLI1 26.278431
FOS-1 66.844178
Name: WNT3, dtype: float64
Of these transcription factors, RBPJ is interesting because it is a key regulator of late hair follicle differentiation spurred by Notch signaling.
Querying with many genes#
One key limitation of the pISD algorithm is that binding site predictions are noisy - whether based on ChIP samples or especially as defined by motifs. Additionally, proximal binding predictions do not gaurantee mechanistic regulation. As such, testing one gene or one TF at a time is inadvisable.
Querying for shared association with transcription factors across many co-regulated genes, however, produces results much more robust to the afforementioned sources of error. As part of the MIRA study, we computed association scores between 4000 genes and 555 TF motifs. Let’s load those results and test them against an interseting geneset.
[31]:
rna_data = anndata.read_h5ad('mira-datasets/shareseq_annotated_data/rna_data.joint_representation.rp_modeled.h5ad')
[32]:
print('{} Genes, {} Factors'.format(
*mira.utils.fetch_ISD_matrix(rna_data).shape
))
4883 Genes, 555 Factors
Now we can query with many genes at a time to find potential regulators. The following query finds TF regulators of genes activated by the hair follicle Cortex cells topic via Wilcoxon test over the association scores:
[33]:
isd_results = mira.tl.driver_TF_test(rna_data, geneset=rna_model.get_top_genes(10, 150))
pd.DataFrame(isd_results).sort_values('pval').head(15)
INFO:mira.tools.tf_targeting:Testing with 144 query genes and 4739 background genes, against 555 factors
[33]:
id | name | parsed_name | pval | test_statistic | |
---|---|---|---|---|---|
247 | MA0523.1 | TCF7L2 | TCF7L2 | 4.800274e-08 | 430104.5 |
205 | MA0905.1 | HOXC10 | HOXC10 | 1.480306e-06 | 419074.0 |
164 | MA0768.1 | LEF1 | LEF1 | 2.535205e-06 | 417234.0 |
315 | MA0879.1 | DLX1 | DLX1 | 3.612823e-06 | 415980.0 |
496 | MA0485.2 | HOXC9 | HOXC9 | 4.001443e-06 | 415557.0 |
411 | MA0634.1 | ALX3 | ALX3 | 4.295351e-06 | 415366.0 |
385 | MA0902.2 | HOXB2 | HOXB2 | 4.534243e-06 | 415160.5 |
540 | MA0880.1 | DLX3 | DLX3 | 5.452141e-06 | 414503.0 |
457 | MA0881.1 | DLX4 | DLX4 | 5.452141e-06 | 414503.0 |
195 | MA1507.1 | HOXD4 | HOXD4 | 6.311274e-06 | 413973.0 |
232 | MA0726.1 | VSX2 | VSX2 | 7.569086e-06 | 413309.0 |
531 | MA0701.2 | LHX9 | LHX9 | 8.899616e-06 | 412714.0 |
399 | MA0075.3 | PRRX2 | PRRX2 | 8.899616e-06 | 412714.0 |
499 | MA1476.1 | DLX5 | DLX5 | 1.142593e-05 | 411775.0 |
311 | MA0907.1 | HOXC13 | HOXC13 | 1.151225e-05 | 411748.0 |
Fantastic. The top results show WNT-related (TCF, LEF1), homeobox (DLX), and HOXC (HOXC13) motifs are enriched in the chromatin neighborhoods of genes upregulated in the cortex.
As with topic enrichment, it is often instructive to compare and contrast regulators acting on two sets of genes. The mira.pl.compare_driver_TFs_plot function takes two genesets and calculates TF enrichment against both. In this case, I compare regulators of the top 150 genes associated with hair follicle Cortex and Medulla topics, respectively:
[34]:
mira.pl.compare_driver_TFs_plot(rna_data,
geneset1=rna_model.get_top_genes(10, 150),
geneset2=rna_model.get_top_genes(5, 150),
label_factors=['SMAD2::SMAD3','SMAD4','LEF1','HOXC13','RUNX1','DLX3'],
fontsize=20, figsize=(5,5), color='lightgrey',
axlabels= ('-log10 p-value\nCortex Regulators','-log10 p-value\nMedulla Regulators'))
plt.show()
INFO:mira.tools.tf_targeting:Testing with 144 query genes and 4739 background genes, against 555 factors
INFO:mira.tools.tf_targeting:Testing with 146 query genes and 4737 background genes, against 555 factors
Visualizing RP models#
After identifying TF regulators of a gene-of-interest, you may want to visualize that gene’s cis-regulatory neighborhood and assess binding sites. More comprehensive epigenetic visualization is coming in future updates of MIRA, but for now, we recommend using pygenometracks. Install pygenometracks using:
conda install -c conda-forge -c bioconda -y pygenometracks
or -
pip install pygenometracks
You can use the RP model’s write_bedgraph method to write the RP model profile to a bedgraph file, which can be plotted by pygenometracks.
[35]:
!mkdir -p data/trackdata
[36]:
litemodel['WNT3'].write_bedgraph(atac_data,
save_name = 'data/trackdata/WNT3_rpmodel.bedgraph')
Earlier, we found that WNT3 and RBPJ share a high association score. Let’s see where RBPJ motifs are landing around the WNT3 locus. One can access the matrix of all TF motif binding predictions by fetch_factor_hits
, or get a list of peaks predicted to be bound by RBPJ via fetch_binding_sites
. Motifs are accessed by unique IDs, so first we need to find the ID for RBPJ:
[37]:
factors = mira.utils.fetch_factor_meta(atac_data)
factors[factors.name == 'RBPJ']
[37]:
id | name | parsed_name | |
---|---|---|---|
1035 | MA1116.1 | RBPJ | RBPJ |
[38]:
rbpj_binding = mira.utils.fetch_binding_sites(atac_data, id = 'MA1116.1')
rbpj_binding.head(3)
[38]:
chr | start | end | endogenous | |
---|---|---|---|---|
peak_id | ||||
15 | chr6 | 113306525 | 113306825 | False |
18 | chr18 | 21300223 | 21300523 | False |
21 | chr13 | 119690114 | 119690414 | False |
[39]:
# sort by chromosome, then start site
rbpj_binding[['chr','start','end']].sort_values(['chr','start'])\
.to_csv('data/trackdata/RBPJ_hits.bed', # save as bed file
index = None, header=None, sep = '\t',
)
Now we define a configuration file for the pygenometracks plot we wish to make (see their extensive documentation on building plots). Of note, for the [genes]
heading, we can provide the same TSS annotation as was used in training the RP models, since it is a valid bed12 file.
[40]:
config_file = """
[x-axis]
[genes]
file = mira-datasets/mm10_tss_data.bed12
title = Genes
height = 3
[rp model]
file = data/trackdata/WNT3_rpmodel.bedgraph
height = 3
color = #e6e6e6
title = WNT3 RP Model
max_value = 1.1
min_value = 0
file_type = bedgraph
alpha = 0.2
[rp model2]
file = data/trackdata/WNT3_rpmodel.bedgraph
type = line:0.5
color = black
file_type = bedgraph
max_value = 1.1
min_value = 0
overlay previous = yes
[spacer]
[RBPJ hits]
file = data/trackdata/RBPJ_hits.bed
height = 1.5
style = UCSC
gene_rows = 1
color = red
title = RBPJ Motif Hits
labels = off
"""
with open('data/trackdata/config.ini', 'w') as f:
print(config_file, file = f)
Finally run the plotting command. To find a plotting location for a given gene, you can reference:
[41]:
mira.utils.fetch_TSS_data(atac_data)['WNT3']
[41]:
{'gene_chrom': 'chr11',
'gene_start': 103774149,
'gene_end': 103817957,
'gene_strand': '+'}
[45]:
!pygenometracks \
--tracks data/trackdata/config.ini \
--region chr11:103674149-103917957 \
-out data/trackdata/wnt3_plot.png \
--dpi 300 \
--width 20 \
--fontSize 9
WARNING:pygenometracks.tracksClass:Deprecated Warning: The section 4. [rp model2] uses parameter overlay previous but there is no more parameter with space in name. Will be substituted by overlay_previous.
WARNING:pygenometracks.tracksClass:Deprecation Warning: In section 6. [RBPJ hits], labels was set to off whereas in the future only true and false value will be accepted.
INFO:pygenometracks.tracksClass:initialize 1. [x-axis]
INFO:pygenometracks.tracksClass:initialize 2. [genes]
100%|████████████████████████████████████████| 12/12 [00:00<00:00, 15582.55it/s]
INFO:pygenometracks.tracksClass:initialize 3. [rp model]
100%|████████████████████████████████████| 4818/4818 [00:00<00:00, 33998.49it/s]
INFO:pygenometracks.tracksClass:initialize 4. [rp model2]
100%|████████████████████████████████████| 4818/4818 [00:00<00:00, 34002.38it/s]
INFO:pygenometracks.tracksClass:initialize 5. [spacer]
INFO:pygenometracks.tracksClass:initialize 6. [RBPJ hits]
100%|████████████████████████████████████████| 13/13 [00:00<00:00, 26001.88it/s]
INFO:pygenometracks.tracksClass:time initializing track(s):
INFO:pygenometracks.tracksClass:0.45090484619140625
DEBUG:pygenometracks.tracksClass:Figure size in cm is 20.0 x 10.50531914893617. Dpi is set to 300
INFO:pygenometracks.tracksClass:plotting 1. [x-axis]
INFO:pygenometracks.tracksClass:plotting 2. [genes]
DEBUG:pygenometracks.tracks.GenomeTrack:ylim 7.9799999999999995,-0.08
INFO:pygenometracks.tracksClass:plotting 3. [rp model]
INFO:pygenometracks.tracksClass:plotting 4. [rp model2]
INFO:pygenometracks.tracksClass:plotting 5. [spacer]
INFO:pygenometracks.tracksClass:plotting 6. [RBPJ hits]
DEBUG:pygenometracks.tracks.GenomeTrack:ylim 3.38,-0.08
[46]:
Image('data/trackdata/wnt3_plot.png')
[46]:
RBPJ appears to bind many regions proximal to WNT3’s local chromatin neighborhood, defined by the bounds of the learned regulatory potential function.
Next#
LITE models quantify a relationship between local chromatin accessibility and gene expression. Interestingly, for some genes, changes in local chromatin appear to diverge from changes in gene expression. The next tutorial on NITE regulation demonstrates how to identify and analyze this phenomenon.