pandas_tools.py¶
Tools to help with vcf manipulation in pandas
-
pandas_tools.boolean_group_profile(df, grouping, l_limit=0, r_limit=None, label=None)[source]¶ Group the columns of dataframe df between l_limit and r_limit using grouping and return a boolean_profile of the grouping on those columns for each row.
grouping is a function on the values of the columns
l_limit and r_limit default to the leftmost and rightmost columns, ie, all
if label is given, a dataframe with label as column name is returned. Otherwise, a Series is returned.
-
pandas_tools.boolean_profile(blist)[source]¶ Given a list of values that can be interpreted as booleans, return a string of 1’s and 0’s, one per value.
-
pandas_tools.collect_mapped_items(mlist, mapdict, sep='-')[source]¶ Build a sep-separated string of the sorted unique mappings from mapdict for the members of mlist
sep defaults to a DASH
-
pandas_tools.columns_showing_value(df, filter=<function <lambda>>, leftlimit=0, rightlimit=None, label='Filtered')[source]¶ Return a dataframe of the column-labels of df that meet filter. This is (no longer) simply a transpose of rows_showing_value, but see it for identical details, modulo c/top/left/ c/bottom/right/.
The reason it differs is that transposing a multi-index causes a misbehavior that leads to an exception. (which is probably a pandas bug, but as usual I feel disinclined to delve).
-
pandas_tools.condense_dataframe(df, matchlist, match_rows=True, match_columns=True)[source]¶ Given a set of row/column names, reduce df to contain only the terms found in matchlist.
By default, reduces both rows and columns.
match_rows=False and match_columns=False do the obvious thing. Both false gives a warning.
-
pandas_tools.convert_chrpos_to_locus(df, indexing=None, verify=True)[source]¶ Take a dataframe with CHR and POS columns and change it to a CHR:00POS Locus format, reindexing to include Locus and other columns if desired.
indexing=[] indexes by Locus. Otherwise, indexing is a list of column names to add to the Locus for indexing.
-
pandas_tools.convert_locus_to_chrpos(df, indexing=None, chromosome='chr', verify=True)[source]¶ Take a dataframe with a CHR:00POS Locus format and change it to CHR and POS columns, reindexing if desired.
By default (indexing=None) the returned dataframe has only its numeric index. indexing=[] indexes by Locus. Otherwise, indexing is a list of column names or indices to add to the Locus for indexing.
-
pandas_tools.idx_merge(df, df2, **kwargs)[source]¶ Curry out the virtually always used indexing and sorting from dataframe.merge()
-
pandas_tools.import_vtools_dataframe(dataframe, discriminant_columns=None, locus_columns=None, attribute_columns=None, sample_columns=None)[source]¶ Assumes dataframe is a vtools-generated vcf-like file from the standard lab vtools annotation suite. No provision is made to determine the version that generated the vcf-like file. This could give you a subtle confusion some time in the future. Returns a pair of dataframes indexed by sortable_coord, the first with sample data, the second with attribute data.
discriminant_columns is a list of additional columns to put in a multi-index. This really can’t be more than whatever the first column of the dataframe is because of indexing/sorting constraints. Default none.
locus_columns columns in which to find the chr,pos values – default, based on Kadara workflow, is 0-based cols 1 and 2.
attribute_columns columns to put into the second returned dataframe, default 0-based 4-22
sample_columns columns of interest, default 23 to the end.
-
pandas_tools.one_row_per_locus(df, last=False, selector=None, axis=1)[source]¶ Choose exactly one row for each locus, defaulting to the first, but allowing for a selector function to choose based on some other criteria (again defaulting to first if more than one match the criterion)
last: whether to select the last matching row instead of the first
selector: a boolean function given to df.apply(axis=1) for each row of the dataframe, to do a more refined selection
axis: I can’t figure out how it would make sense to apply on axis=0, but if you can, go for it.
-
pandas_tools.rearrange_column(df, fromcol, tocol=0)[source]¶ Move the column indexed at fromcol to the index at tocol, default first
-
pandas_tools.rename_column(df, fromname, toname)[source]¶ Change a column header from fromname to toname
-
pandas_tools.rows_showing_value(df, filter=<function <lambda>>, toplimit=0, bottomlimit=None, label='Filtered')[source]¶ Return a dataframe of the row-labels of df that meet filter on the columns between toplimit and bottomlimit, with label used as the column name
filter is a boolean function that defaults to returning values greater than zero
toplimit defaults to the top (first) row
bottomlimit defaults to the bottom (last) row
label defaults to ‘Filtered’