mightypy.stats package#

Module contents#

mightypy.stats#

class WOE_IV(event: str, non_event: str, target_col: str, bucket_col: str, value_col: str | None = None, agg_func: ~typing.Callable = <function count_nonzero>, bucket_col_type: str = 'continuous', n_buckets: int = 10)[source]#

Bases: object

Weight of Evidence and Information Value.

References

https://www.listendata.com/2015/03/weight-of-evidence-woe-and-information.html

Parameters:
  • event (str) – event name. Generally label true/1.

  • non_event (str) – non event name. Generally label false/0.

  • target_col (str) – Target column name.

  • value_col (str) – Value column name to aggregate(count). Defaults to None.

  • bucket_col (str) – bucketing column name.

  • agg_func (Callable, optional) – Aggregation function name. Defaults to np.count_nonzero.

  • bucket_col_type (str, optional) – Bucketing columns value type. If discrete buckets will not be created else buckets will be created. Defaults to ‘continuous’.

  • n_buckets (int, optional) – If bucket column has continuous values then create aritificial buckets. Defaults to 10.

Examples

>>> from sklearn.datasets import load_breast_cancer
>>> from mightypy.stats import WOE_IV
>>> dataset = load_breast_cancer(as_frame=True)
>>> df = dataset.frame[['mean radius', 'target']]
>>> target_map = {0: 'False', 1: 'True'}
>>> df['label'] = df['target'].map(target_map)
>>> obj = WOE_IV(event='True', non_event='False', target_col='label',
>>>              bucket_col='mean radius')
>>> cal_df, iv = obj.values(df)
>>> fig = obj.plot()
>>> fig.tight_layout()
>>> fig.show()

or directly

>>> fig, ax = obj.plot(df)
>>> fig.show()
plot(df: DataFrame | None = None, figsize=(10, 5)) Figure[source]#

Plot weight of evidence and subsequent plots.

Parameters:
  • df (Optional[pd.DataFrame], optional) – Input dataframe. Defaults to None.

  • figsize (tuple, optional) – Figure size. Defaults to (10, 5).

Raises:

ValueError – If dataframe doesn’t exist either in the model or in method args.

Returns:

matplotlib figure.

Return type:

plt.Figure

values(df: DataFrame | None = None) Tuple[DataFrame, float][source]#

Returns weight of evidence and information value for given dataframe.

Parameters:

df (Optional[pd.DataFrame], optional) – Input dataframe. Defaults to None.

Raises:

ValueError – If input dataframe does not exist either in the model or in method input args.

Returns:

calculated dataframe and information value.

Return type:

Tuple[pd.DataFrame, float]

population_stability_index(expected: list | ndarray, actual: list | ndarray, data_type: str) DataFrame[source]#

Populaion Stability Index.

References

https://www.listendata.com/2015/05/population-stability-index.html

Parameters:
  • expected (Union[list, np.ndarray]) – Expected values.

  • actual (Union[list, np.ndarray]) – Actual values.

  • data_type (str) – Type of data. Helps in bucketing.

Returns:

calculated dataframe.

Return type:

pd.DataFrame

Examples

>>> import numpy as np
>>> from mightypy.stats import population_stability_index
continuous data
>>> expected_continuous = np.random.normal(size=(500,))
>>> actual_continuous = np.random.normal(size=(500,))
>>> psi_df = population_stability_index(expected_continuous, actual_continuous, data_type='continuous')
>>> psi_df.psi.sum()
discrete data
>>> expected_discrete = np.random.randint(0,10, size=(500,))
>>> actual_discrete = np.random.randint(0,10, size=(500,))
>>> psi_df = population_stability_index(expected_discrete, actual_discrete, data_type='discrete')
>>> psi_df.psi.sum()