genomics.py

Small genomics methods packaged for re-use.

genome_data logic has been imported from packages/graphics/.../genome_scale.py and should eventually be inherited back there.

genomics.chromosome_chunks(chrom, size=3000000000, overlap=0, autosomes=False, build='GRCh38.d1.vd1', location=None)[source]

Return a list of chunks of chrom of size roughly size+overlap, suitable for use with samtools -r or bedtools intersect.

If the given chromosome is smaller than the desired size+overlap, simply return the whole chromosome.

If the last chunk would be relatively small compared to size (less than 1/3), append it to the previous chunk instead.

autosomes return empty lists for non-autosomal chromosomes

genomics.fai_dict(failines)[source]
class genomics.genome_data(genome='GRCh38.d1.vd1', for_spacing=False, location=None)[source]

An object that holds chromosome lengths and lengths by arm, and admits of several simple access functions to view those data. Default build is HG38 (GRCh38.d1.vd1, in point of fact). Expects the data to reside in $TEAM_ROOT/reference/genomics/Hsap/<BUILDNAME>. Easy enough to fit HG19, but I’d have to get the pq arm data into the correct format.

arm(arm)[source]
arm_length(arm)[source]

Report the length of requested chromosome arm

chromosome_length(chrom)[source]

Report the length of requested chromosome

fai_locus(loc, pos=0, raise_error=False)[source]

Report offset between 0 and genome length of a chromosome or chromosome arm locus plus a position, optionally raising an error if position is off the end of the given segment.

>>> 0 == fai_locus('chr1', 0)
>>> x = fai_locus('chr1q', 100000)
genomics.hundredK_roundup(val)[source]

Return val rounded up to the nearest 100K.

genomics.normalize_chr_name(chrom)[source]

Return a string that “normalizes” chrom so that the string can be used to act as a key in the CHROMOSOMES dictionary.

genomics.read_genome_bed(genome='GRCh38.d1.vd1', kind='whole', sorting='lexsort', location=None)[source]

Read a bedfile and make displayable-genome offsets.

Default genome is GRCh38.d1.vd1, default kind is whole, choices: [whole, arms]

Default sorting is lexsort, choices: [lexsort, numsort]

genomics.read_genome_fai(genome='GRCh38.d1.vd1', location=None)[source]

Read a fasta index and make displayable-genome offsets.

Default genome is GRCh38.d1.vd1

genomics.sortable_coord(chrom, position=0, split=False, chronly=False)[source]

Return a string that allows for an ASCII sort, zero-padded so that chromosome 2 follows 1 and not 19, and M, X, and Y follow 22.

>>> sortable_coord('chr1:1234', split=True)
>>> sortable_coord('chr1', 1234)
>>> sortable_coord(line_of_a_vcf.split(const.TAB)[:2]
genomics.sortable_coord_to_bedfile_coord(locus)[source]
genomics.sortable_coord_to_vcf_coord(locus)[source]
genomics.sortable_line_to_bedline(line, sep='\t')[source]
genomics.sortable_line_to_vcfline(line, sep='\t')[source]
genomics.trim_chrom(chrom)[source]