Data-Preprocessing
==================
Documentation of ricebowl data preprocessing.
To use this simply do from ricebowl.processing import data_preproc and then use each function with data_preproc.<function>

read_csv
^^^^^^^^
General function to read a csv file.

Parameters- Path of the csv file

Output- Dataframe

Usage::
    
    df = read_csv(path)


read_excel
^^^^^^^^^^
General function to read an excel file.

Parameters- Path of the excel file, Sheet name

Output- Dataframe

Usage::
    
    df = read_excel(path,sheet_name)


reformat_col_headers
^^^^^^^^^^^^^^^^^^^^
General function for formatting the column headers to lower case.
All "spaces" and "-" are replaced by "_"

Parameters- Dataframe

Output- Dataframe with formatted column headers

Usage::
    
    df = reformat_col_headers(df)


str_to_datetime
^^^^^^^^^^^^^^^
General function to convert string columns in date format to datetime.
Add the names of the columns which need to be converted to datetime from string type.

Parameters- Dataframe, kwargs[column names]

Output- Dataframe with the columns updated to datetime

Usage::
    
    df = str_to_datetime(df, c1='col1', c2='col2' .... cn='col_n')


timestamp_to_datetime
^^^^^^^^^^^^^^^^^^^^^
General function to convert timestamp columns to datetime.
Add the names of the columns which need to be converted to datetime from timestamp.

Parameters- Dataframe, kwargs[column names]

Output- Dataframe with the columns updated to datetime

Usage::
    
    df = timestamp_to_datetime(df, c1='col1', c2='col2' .... cn='col_n')


to_timestamp
^^^^^^^^^^^^
General function to convert datetime/str columns in datetime format to timestamp.
Add the names of the columns which need to be converted to timestamp.

Parameters- Dataframe, kwargs[column names]

Output- Dataframe with the columns updated to timestamp

Usage::
    
    df = to_timestamp(df, c1='col1', c2='col2' .... cn='col_n')


label_encode
^^^^^^^^^^^^
General function to label encode the categorical columns.
Add the names of the columns which need to be encoded.

Parameters- Dataframe, kwargs[column names]

Output- Dataframe with the columns updated with encoded labels, label encoder object

Usage::
    
    df = label_encode(df, c1='col1', c2='col2' .... cn='col_n')


one_hot_encode
^^^^^^^^^^^^^^
General function to one-hot encode the categorical columns.
Add the names of the columns which need to be encoded.

Parameters- Dataframe, kwargs[column names]

Output- Dataframe with the columns updated with encoded labels

Usage::
    
    df = one_hot_encode(df, c1='col1', c2='col2' .... cn='col_n')


dates_diff
^^^^^^^^^^
General function to calculate the difference between 2 date columns.

Parameters- Dataframe, column 1, column 2, diff_type(Optional; Default='days'; Takes in 'days'/'weeks'/'months'/'years') 

Output- Dataframe with a new column according to difference. For example if diff_type='weeks' then the new column will be of the name 'weeks'

Error Print- If a wrong "diff_type" is provided, prints an error message.

Usage::
    
    df = dates_diff(df,col1,col2,diff_type='days')


drop_duplicates
^^^^^^^^^^^^^^^
General function to remove duplicate rows.

Parameters- Dataframe

Output- Dataframe without duplicate rows.

Usage::
    
    df = drop_duplicates(df)


reset_index
^^^^^^^^^^^
General function to reset the index of the dataframe.

Parameters- Dataframe, Drop(True/False)

Output- Dataframe with a new index

Usage::
    
    df = reset_index(df,drop=True)


to_dtype
^^^^^^^^
General function to convert a column to a particular datatype.

Parameters- Dataframe, Data type, kwargs[column names]

Output- Dataframe with updated columns

Usage::
    
    df = to_dtype(df, 'float', c1='col1', c2='col2'...., cn='col_n')


fill_mode
^^^^^^^^^
General function to fill null values with mode.

Parameters- Dataframe, kwargs[column names]

Output- Dataframe with the columns updated. The null values in the columns will be filled with the mode of that column.

Usage::
    
    df = fill_mode(df, c1='col1', c2='col2' .... cn='col_n')


fill_mean
^^^^^^^^^
General function to fill null values with mean.

Parameters- Dataframe, kwargs[column names]

Output- Dataframe with the columns updated. The null values in the columns will be filled with the mean of that column.

Usage::
    
    df = fill_mean(df, c1='col1', c2='col2' .... cn='col_n')


melt
^^^^
General function to melt data.

Parameters- Dataframe, Columns to melt(in the form of a list), New column name to be made after melting, Column name displaying values; Default- 'value'

Output- Dataframe with the columns updated. The data is melted.

Usage::
    
    df = melt(df, ['col1','col2'...'col_n'], 'new_col_name_xyz', value)


split_columns
^^^^^^^^^^^^^
General function to make existing data a list of split values.

Parameters- Dataframe, Original Column, Separator to split on

Output- Dataframe with columns seperated.
Example: if a column had dates like 2019-01-01 and we use this function with a separator '-', then the data will be modified to [2019,01,01] 

Usage::

    df = split_columns(df, 'column_name', separator='-')


remove_unwanted_chars
^^^^^^^^^^^^^^^^^^^^^
General function to remove unwanted characters from data.

Parameters- Dataframe, kwargs[column names]

Output- Dataframe with unwanted characters removed. (like $,€,£,inr,¥,₹) 

Usage::

    df = remove_unwanted_chars(df, c1='col1', c2='col2' .... cn='col_n')

 
fill_num_abbreviations
^^^^^^^^^^^^^^^^^^^^^^
General function to fill "million M", "billion B", "thousand k", "lakhs L", "crore cr".

Parameters- Dataframe, kwargs[column names]

Output- Dataframe with filled abbreviations.
Example: 20k would be replaced by 20000

Usage::

    df = fill_num_abbreviations(df, c1='col1', c2='col2' .... cn='col_n')


split_data
^^^^^^^^^^
General function to split data for modeling purpose

Parameters- Data, Label, Test size (optional; default=0.3)

Output- xtrain, xtest, ytrain, ytest in array format.

Usage::

    xtrain, xtest, ytrain, ytest = split_data(data, label, test_size=0.25)


find_corr
^^^^^^^^^
General function to find correlation excluding all null values

Parameters- Dataframe, method(optional; default='pearson')

Output- correlation data frame.

Usage::

    corr = find_corr(df, method='spearman')


zscore_outliers
^^^^^^^^^^^^^^^
General function to find outliers in a random variable using zscore with a threshold of 2.5 for better results

Parameters- Datarame series

Output- List of outliers.

Usage::

    outliers = zscore_outliers(df['xyz'])


standarization
^^^^^^^^^^^^^^
General function to standardize the data using standard scaler

Parameters- Dataframe, list of columns to be converted

Output- Dataframe with updated columns 

Usage::

    df = standarization(data, list_of_cols=['xyz','abc'])


normalization
^^^^^^^^^^^^^
General function to normalize the data using min-max scaler

Parameters- Dataframe, list of columns to be converted

Output- Dataframe with updated columns

Usage::

    df = normalization(data, list_of_cols=['xyz','abc'])


basic_stats
^^^^^^^^^^^
General function to get all the basic stats of the data

Parameters- Dataframe, file path to write the stats(Optional, default=None- prints on console)

Output- Basic stats in string format.

Usage::

    stats = basic_stats(data, file = './xyz.txt')