mira.tl.get_distance_to_TSS#
- mira.tl.get_distance_to_TSS(adata, tss_data=None, peak_chrom='chr', peak_start='start', peak_end='end', gene_id='geneSymbol', gene_chrom='chrom', gene_start='txStart', gene_end='txEnd', gene_strand='strand', sep='\t', max_distance=600000.0, promoter_width=3000, *, genome_file)#
Given TSS data for genes, find the distance between the TSS of each gene and the center of each accessible site measured in the data. This distance is used to train RP Models.
- Parameters
- adataanndata.AnnData
AnnData object of chromatin accessibility. Peak locations located in .var with columns corresponding to the chromosome, start, and end coordinates given by the peak_chrom, peak_start and peak_end parameters, respectively.
- tss_datapd.DataFrame or str
DataFrame of TSS locations for each gene. TSS information must include the chromosome, start, end, strand, and symbol of the gene. May pass either an in-memory dataframe or path to that dataframe on disk.
- sepstr, default = ” “
If loading tss_data from disk, use this separator character.
- peak_chromstr, default = “chr”
The column in adata.var corresponding to the chromosome of peaks
- peak_startstr, defualt = “start”
The column in adata.var corresponding to the start coordinate of peaks
- peak_endstr, default = “end”
The column in adata.var corresponding to the end coordinate of peaks
- gene_chromstr, default = “chrom”
The column in tss_data corresponding to the chromosome of genes
- gene_startstr, default = “txStart”
The column in tss_data corresponding to the start index of a transcript. For plus-strand genes, this will be the TSS location.
- gene_endstr, default = “txEnd”
The column in tss_data corresponding to the end of a transcript. For minus-strand genes, this will be the TSS location.
- gene_strandstr, defualt = “strand”
The column in tss_data corresponding to the trandedness of the gene.
- gene_idstr, default = “geneSymbol”
The column in tss_data corresponding to the symbol of the gene. This will be used to refer to specific genes and to connect the loci to observed expression for that gene. Make sure to use identical symbology in TSS labeling as in the expression counts data of your multiome expriment. If multiple loci have the same symbol, or a gene has muliple loci, only the first encountered will be used. To disambiguate symbol-loci mapping, use a single canonical splice variant for each gene.
- max_distancefloat > 0, default = 6e5
Maximum distance to give a distance between a peak and a gene. All distances exceeding this threshold will be set to infinity.
- promoter_widthWidth of the “promoter” region around each TSS, in base pairs.
The distance between a gene and a peak inside another gene’s promoter region is set to infinity. For PR modeling, this masks the effect of other genes’ promoter accessibility on the RP model.
- genome_filestr
String, file location of chromosome lengths for you organism. For example:
chr1 248956422 chr2 242193529 chr3 198295559 chr4 190214555
- Returns
Examples
One can download mm10 or hg38 TSS annotations via:
>>> mira.datasets.mm10_tss_data() # or mira.datasets.hg38_tss_data() ... INFO:mira.datasets.datasets:Dataset contents: ... * mira-datasets/mm10_tss_data.bed12
Then, to annotate the ATAC peaks:
>>> atac_data.var ... chr start end ... chr1:9778-10670 chr1 9778 10670 ... chr1:180631-181281 chr1 180631 181281 ... chr1:183970-184795 chr1 183970 184795 ... chr1:190991-191935 chr1 190991 191935 >>> mira.tl.get_distance_to_TSS(atac_data, ... tss_data = "mira-datasets/mm10_tss_data.bed12", ... gene_chrom='chrom', ... gene_strand='strand', ... gene_start='chromStart', ... gene_end='chromEnd', ... genome_file = '~/genomes/hg38/hg38.genome' ... ) ... WARNING:mira.tools.connect_genes_peaks:71 regions encounted from unknown chromsomes: KI270728.1,GL000194.1,GL000205.2,GL000195.1,GL000219.1,KI270734.1,GL000218.1,KI270721.1,KI270726.1,KI270711.1,KI270713.1 ... INFO:mira.tools.connect_genes_peaks:Finding peak intersections with promoters ... ... INFO:mira.tools.connect_genes_peaks:Calculating distances between peaks and TSS ... ... INFO:mira.tools.connect_genes_peaks:Masking other genes' promoters ... ... INFO:mira.adata_interface.rp_model:Added key to var: distance_to_TSS ... INFO:mira.adata_interface.rp_model:Added key to uns: distance_to_TSS_genes