pandas_tools.py

Tools to help with vcf manipulation in pandas

pandas_tools.boolean_group_profile(df, grouping, l_limit=0, r_limit=None, label=None)[source]

Group the columns of dataframe df between l_limit and r_limit using grouping and return a boolean_profile of the grouping on those columns for each row.

grouping is a function on the values of the columns

l_limit and r_limit default to the leftmost and rightmost columns, ie, all

if label is given, a dataframe with label as column name is returned. Otherwise, a Series is returned.

pandas_tools.boolean_profile(blist)[source]

Given a list of values that can be interpreted as booleans, return a string of 1’s and 0’s, one per value.

pandas_tools.collect_mapped_items(mlist, mapdict, sep='-')[source]

Build a sep-separated string of the sorted unique mappings from mapdict for the members of mlist

sep defaults to a DASH

pandas_tools.columns_showing_value(df, filter=<function <lambda>>, leftlimit=0, rightlimit=None, label='Filtered')[source]

Return a dataframe of the column-labels of df that meet filter. This is (no longer) simply a transpose of rows_showing_value, but see it for identical details, modulo c/top/left/ c/bottom/right/.

The reason it differs is that transposing a multi-index causes a misbehavior that leads to an exception. (which is probably a pandas bug, but as usual I feel disinclined to delve).

pandas_tools.condense_dataframe(df, matchlist, match_rows=True, match_columns=True)[source]

Given a set of row/column names, reduce df to contain only the terms found in matchlist.

By default, reduces both rows and columns.

match_rows=False and match_columns=False do the obvious thing. Both false gives a warning.

pandas_tools.convert_chrpos_to_locus(df, indexing=None, verify=True)[source]

Take a dataframe with CHR and POS columns and change it to a CHR:00POS Locus format, reindexing to include Locus and other columns if desired.

indexing=[] indexes by Locus. Otherwise, indexing is a list of column names to add to the Locus for indexing.

pandas_tools.convert_locus_to_chrpos(df, indexing=None, chromosome='chr', verify=True)[source]

Take a dataframe with a CHR:00POS Locus format and change it to CHR and POS columns, reindexing if desired.

By default (indexing=None) the returned dataframe has only its numeric index. indexing=[] indexes by Locus. Otherwise, indexing is a list of column names or indices to add to the Locus for indexing.

pandas_tools.idx_merge(df, df2, **kwargs)[source]

Curry out the virtually always used indexing and sorting from dataframe.merge()

pandas_tools.import_vtools_dataframe(dataframe, discriminant_columns=None, locus_columns=None, attribute_columns=None, sample_columns=None)[source]

Assumes dataframe is a vtools-generated vcf-like file from the standard lab vtools annotation suite. No provision is made to determine the version that generated the vcf-like file. This could give you a subtle confusion some time in the future. Returns a pair of dataframes indexed by sortable_coord, the first with sample data, the second with attribute data.

discriminant_columns is a list of additional columns to put in a multi-index. This really can’t be more than whatever the first column of the dataframe is because of indexing/sorting constraints. Default none.

locus_columns columns in which to find the chr,pos values – default, based on Kadara workflow, is 0-based cols 1 and 2.

attribute_columns columns to put into the second returned dataframe, default 0-based 4-22

sample_columns columns of interest, default 23 to the end.

pandas_tools.one_row_per_locus(df, last=False, selector=None, axis=1)[source]

Choose exactly one row for each locus, defaulting to the first, but allowing for a selector function to choose based on some other criteria (again defaulting to first if more than one match the criterion)

last: whether to select the last matching row instead of the first

selector: a boolean function given to df.apply(axis=1) for each row of the dataframe, to do a more refined selection

axis: I can’t figure out how it would make sense to apply on axis=0, but if you can, go for it.

pandas_tools.rearrange_column(df, fromcol, tocol=0)[source]

Move the column indexed at fromcol to the index at tocol, default first

pandas_tools.rename_column(df, fromname, toname)[source]

Change a column header from fromname to toname

pandas_tools.rows_showing_value(df, filter=<function <lambda>>, toplimit=0, bottomlimit=None, label='Filtered')[source]

Return a dataframe of the row-labels of df that meet filter on the columns between toplimit and bottomlimit, with label used as the column name

filter is a boolean function that defaults to returning values greater than zero

toplimit defaults to the top (first) row

bottomlimit defaults to the bottom (last) row

label defaults to ‘Filtered’

pandas_tools.stable_sort_dataframe(df, clist=None, ascending=True)[source]

Stable sort of df on the requested columns, all by default. mergesort is the only stable sort available in pandas.

pandas_tools.transversion_to_apply(row)[source]

Return the transition/transversion for a variant as pulled from a pandas vcf dataframe, for use in

>>> df.apply(thismodule.transversion_to_apply, axis=1)