mira.tl.get_distance_to_TSS#

mira.tl.get_distance_to_TSS(adata, tss_data=None, peak_chrom='chr', peak_start='start', peak_end='end', gene_id='geneSymbol', gene_chrom='chrom', gene_start='txStart', gene_end='txEnd', gene_strand='strand', sep='\t', max_distance=600000.0, promoter_width=3000, *, genome_file)#

Given TSS data for genes, find the distance between the TSS of each gene and the center of each accessible site measured in the data. This distance is used to train RP Models.

Parameters
adataanndata.AnnData

AnnData object of chromatin accessibility. Peak locations located in .var with columns corresponding to the chromosome, start, and end coordinates given by the peak_chrom, peak_start and peak_end parameters, respectively.

tss_datapd.DataFrame or str

DataFrame of TSS locations for each gene. TSS information must include the chromosome, start, end, strand, and symbol of the gene. May pass either an in-memory dataframe or path to that dataframe on disk.

sepstr, default = ” “

If loading tss_data from disk, use this separator character.

peak_chromstr, default = “chr”

The column in adata.var corresponding to the chromosome of peaks

peak_startstr, defualt = “start”

The column in adata.var corresponding to the start coordinate of peaks

peak_endstr, default = “end”

The column in adata.var corresponding to the end coordinate of peaks

gene_chromstr, default = “chrom”

The column in tss_data corresponding to the chromosome of genes

gene_startstr, default = “txStart”

The column in tss_data corresponding to the start index of a transcript. For plus-strand genes, this will be the TSS location.

gene_endstr, default = “txEnd”

The column in tss_data corresponding to the end of a transcript. For minus-strand genes, this will be the TSS location.

gene_strandstr, defualt = “strand”

The column in tss_data corresponding to the trandedness of the gene.

gene_idstr, default = “geneSymbol”

The column in tss_data corresponding to the symbol of the gene. This will be used to refer to specific genes and to connect the loci to observed expression for that gene. Make sure to use identical symbology in TSS labeling as in the expression counts data of your multiome expriment. If multiple loci have the same symbol, or a gene has muliple loci, only the first encountered will be used. To disambiguate symbol-loci mapping, use a single canonical splice variant for each gene.

max_distancefloat > 0, default = 6e5

Maximum distance to give a distance between a peak and a gene. All distances exceeding this threshold will be set to infinity.

promoter_widthWidth of the “promoter” region around each TSS, in base pairs.

The distance between a gene and a peak inside another gene’s promoter region is set to infinity. For PR modeling, this masks the effect of other genes’ promoter accessibility on the RP model.

genome_filestr

String, file location of chromosome lengths for you organism. For example:

chr1 248956422 chr2 242193529 chr3 198295559 chr4 190214555

Returns
adataanndata.AnnData
`.varm[“distance_to_TSS”]scipy.spmatrix[float] of shape (n_genes x n_peaks)

Distance between genes’ TSS and and peaks.

`.uns[“distance_to_TSS_genes”]np.ndarray[str] of shape (n_genes,)

Gene symbols corresponding to rows in the distance_to_TSS matrix.

Examples

One can download mm10 or hg38 TSS annotations via:

>>> mira.datasets.mm10_tss_data() # or mira.datasets.hg38_tss_data()
...   INFO:mira.datasets.datasets:Dataset contents:
...       * mira-datasets/mm10_tss_data.bed12

Then, to annotate the ATAC peaks:

>>> atac_data.var
...                        chr   start     end
...    chr1:9778-10670     chr1    9778   10670
...    chr1:180631-181281  chr1  180631  181281
...    chr1:183970-184795  chr1  183970  184795
...    chr1:190991-191935  chr1  190991  191935
>>> mira.tl.get_distance_to_TSS(atac_data, 
...                        tss_data = "mira-datasets/mm10_tss_data.bed12", 
...                        gene_chrom='chrom', 
...                        gene_strand='strand', 
...                        gene_start='chromStart',
...                        gene_end='chromEnd',
...                        genome_file = '~/genomes/hg38/hg38.genome'
...                    )
...    WARNING:mira.tools.connect_genes_peaks:71 regions encounted from unknown chromsomes: KI270728.1,GL000194.1,GL000205.2,GL000195.1,GL000219.1,KI270734.1,GL000218.1,KI270721.1,KI270726.1,KI270711.1,KI270713.1
...    INFO:mira.tools.connect_genes_peaks:Finding peak intersections with promoters ...
...    INFO:mira.tools.connect_genes_peaks:Calculating distances between peaks and TSS ...
...    INFO:mira.tools.connect_genes_peaks:Masking other genes' promoters ...
...    INFO:mira.adata_interface.rp_model:Added key to var: distance_to_TSS
...    INFO:mira.adata_interface.rp_model:Added key to uns: distance_to_TSS_genes