Data-Preprocessing

Documentation of ricebowl data preprocessing. To use this simply do from ricebowl.processing import data_preproc and then use each function with data_preproc.<function>

read_csv

General function to read a csv file.

Parameters- Path of the csv file

Output- Dataframe

Usage:

df = read_csv(path)

read_excel

General function to read an excel file.

Parameters- Path of the excel file, Sheet name

Output- Dataframe

Usage:

df = read_excel(path,sheet_name)

reformat_col_headers

General function for formatting the column headers to lower case. All “spaces” and “-” are replaced by “_”

Parameters- Dataframe

Output- Dataframe with formatted column headers

Usage:

df = reformat_col_headers(df)

str_to_datetime

General function to convert string columns in date format to datetime. Add the names of the columns which need to be converted to datetime from string type.

Parameters- Dataframe, kwargs[column names]

Output- Dataframe with the columns updated to datetime

Usage:

df = str_to_datetime(df, c1='col1', c2='col2' .... cn='col_n')

timestamp_to_datetime

General function to convert timestamp columns to datetime. Add the names of the columns which need to be converted to datetime from timestamp.

Parameters- Dataframe, kwargs[column names]

Output- Dataframe with the columns updated to datetime

Usage:

df = timestamp_to_datetime(df, c1='col1', c2='col2' .... cn='col_n')

to_timestamp

General function to convert datetime/str columns in datetime format to timestamp. Add the names of the columns which need to be converted to timestamp.

Parameters- Dataframe, kwargs[column names]

Output- Dataframe with the columns updated to timestamp

Usage:

df = to_timestamp(df, c1='col1', c2='col2' .... cn='col_n')

label_encode

General function to label encode the categorical columns. Add the names of the columns which need to be encoded.

Parameters- Dataframe, kwargs[column names]

Output- Dataframe with the columns updated with encoded labels, label encoder object

Usage:

df = label_encode(df, c1='col1', c2='col2' .... cn='col_n')

one_hot_encode

General function to one-hot encode the categorical columns. Add the names of the columns which need to be encoded.

Parameters- Dataframe, kwargs[column names]

Output- Dataframe with the columns updated with encoded labels

Usage:

df = one_hot_encode(df, c1='col1', c2='col2' .... cn='col_n')

dates_diff

General function to calculate the difference between 2 date columns.

Parameters- Dataframe, column 1, column 2, diff_type(Optional; Default=’days’; Takes in ‘days’/’weeks’/’months’/’years’)

Output- Dataframe with a new column according to difference. For example if diff_type=’weeks’ then the new column will be of the name ‘weeks’

Error Print- If a wrong “diff_type” is provided, prints an error message.

Usage:

df = dates_diff(df,col1,col2,diff_type='days')

drop_duplicates

General function to remove duplicate rows.

Parameters- Dataframe

Output- Dataframe without duplicate rows.

Usage:

df = drop_duplicates(df)

reset_index

General function to reset the index of the dataframe.

Parameters- Dataframe, Drop(True/False)

Output- Dataframe with a new index

Usage:

df = reset_index(df,drop=True)

to_dtype

General function to convert a column to a particular datatype.

Parameters- Dataframe, Data type, kwargs[column names]

Output- Dataframe with updated columns

Usage:

df = to_dtype(df, 'float', c1='col1', c2='col2'...., cn='col_n')

fill_mode

General function to fill null values with mode.

Parameters- Dataframe, kwargs[column names]

Output- Dataframe with the columns updated. The null values in the columns will be filled with the mode of that column.

Usage:

df = fill_mode(df, c1='col1', c2='col2' .... cn='col_n')

fill_mean

General function to fill null values with mean.

Parameters- Dataframe, kwargs[column names]

Output- Dataframe with the columns updated. The null values in the columns will be filled with the mean of that column.

Usage:

df = fill_mean(df, c1='col1', c2='col2' .... cn='col_n')

melt

General function to melt data.

Parameters- Dataframe, Columns to melt(in the form of a list), New column name to be made after melting, Column name displaying values; Default- ‘value’

Output- Dataframe with the columns updated. The data is melted.

Usage:

df = melt(df, ['col1','col2'...'col_n'], 'new_col_name_xyz', value)

split_columns

General function to make existing data a list of split values.

Parameters- Dataframe, Original Column, Separator to split on

Output- Dataframe with columns seperated. Example: if a column had dates like 2019-01-01 and we use this function with a separator ‘-‘, then the data will be modified to [2019,01,01]

Usage:

df = split_columns(df, 'column_name', separator='-')

remove_unwanted_chars

General function to remove unwanted characters from data.

Parameters- Dataframe, kwargs[column names]

Output- Dataframe with unwanted characters removed. (like $,€,£,inr,¥,₹)

Usage:

df = remove_unwanted_chars(df, c1='col1', c2='col2' .... cn='col_n')

fill_num_abbreviations

General function to fill “million M”, “billion B”, “thousand k”, “lakhs L”, “crore cr”.

Parameters- Dataframe, kwargs[column names]

Output- Dataframe with filled abbreviations. Example: 20k would be replaced by 20000

Usage:

df = fill_num_abbreviations(df, c1='col1', c2='col2' .... cn='col_n')

split_data

General function to split data for modeling purpose

Parameters- Data, Label, Test size (optional; default=0.3)

Output- xtrain, xtest, ytrain, ytest in array format.

Usage:

xtrain, xtest, ytrain, ytest = split_data(data, label, test_size=0.25)

find_corr

General function to find correlation excluding all null values

Parameters- Dataframe, method(optional; default=’pearson’)

Output- correlation data frame.

Usage:

corr = find_corr(df, method='spearman')

zscore_outliers

General function to find outliers in a random variable using zscore with a threshold of 2.5 for better results

Parameters- Datarame series

Output- List of outliers.

Usage:

outliers = zscore_outliers(df['xyz'])

standarization

General function to standardize the data using standard scaler

Parameters- Dataframe, list of columns to be converted

Output- Dataframe with updated columns

Usage:

df = standarization(data, list_of_cols=['xyz','abc'])

normalization

General function to normalize the data using min-max scaler

Parameters- Dataframe, list of columns to be converted

Output- Dataframe with updated columns

Usage:

df = normalization(data, list_of_cols=['xyz','abc'])

basic_stats

General function to get all the basic stats of the data

Parameters- Dataframe, file path to write the stats(Optional, default=None- prints on console)

Output- Basic stats in string format.

Usage:

stats = basic_stats(data, file = './xyz.txt')