causallift package

CausalLift

Submodules

causallift.causal_lift module

class causallift.causal_lift.CausalLift(train_df=None, test_df=None, cols_features=None, col_treatment='Treatment', col_outcome='Outcome', col_propensity='Propensity', col_cate='CATE', col_recommendation='Recommendation', min_propensity=0.01, max_propensity=0.99, verbose=2, uplift_model_params={'cv': 3, 'estimator': 'xgboost.XGBClassifier', 'n_jobs': -1, 'param_grid': {'base_score': [0.5], 'booster': ['gbtree'], 'colsample_bylevel': [1], 'colsample_bytree': [1], 'gamma': [0], 'learning_rate': [0.1], 'max_delta_step': [0], 'max_depth': [3], 'min_child_weight': [1], 'missing': [None], 'n_estimators': [100], 'n_jobs': [-1], 'nthread': [None], 'objective': ['binary:logistic'], 'random_state': [0], 'reg_alpha': [0], 'reg_lambda': [1], 'scale_pos_weight': [1], 'subsample': [1], 'verbose': [0]}, 'return_train_score': False, 'scoring': None, 'search_cv': 'sklearn.model_selection.GridSearchCV'}, enable_ipw=True, propensity_model_params={'cv': 3, 'estimator': 'sklearn.linear_model.LogisticRegression', 'n_jobs': -1, 'param_grid': {'C': [0.1, 1, 10], 'class_weight': [None], 'dual': [False], 'fit_intercept': [True], 'intercept_scaling': [1], 'max_iter': [100], 'multi_class': ['ovr'], 'n_jobs': [1], 'penalty': ['l1', 'l2'], 'random_state': [0], 'solver': ['liblinear'], 'tol': [0.0001], 'warm_start': [False]}, 'return_train_score': False, 'scoring': None, 'search_cv': 'sklearn.model_selection.GridSearchCV'}, cv=3, index_name='index', partition_name='partition', runner='SequentialRunner', conditionally_skip=False, dataset_catalog={'df_03': <kedro.io.csv_local.CSVLocalDataSet object>, 'estimated_effect_df': <kedro.io.csv_local.CSVLocalDataSet object>, 'propensity_model': <kedro.io.pickle_local.PickleLocalDataSet object>, 'treated__sim_eval_df': <kedro.io.csv_local.CSVLocalDataSet object>, 'untreated__sim_eval_df': <kedro.io.csv_local.CSVLocalDataSet object>, 'uplift_models_dict': <kedro.io.pickle_local.PickleLocalDataSet object>}, logging_config={'disable_existing_loggers': False, 'formatters': {'json_formatter': {'class': 'pythonjsonlogger.jsonlogger.JsonFormatter', 'format': '[%(asctime)s|%(name)s|%(funcName)s|%(levelname)s] %(message)s'}, 'simple': {'format': '[%(asctime)s|%(name)s|%(levelname)s] %(message)s'}}, 'handlers': {'console': {'class': 'logging.StreamHandler', 'formatter': 'simple', 'level': 'INFO', 'stream': 'ext://sys.stdout'}, 'error_file_handler': {'backupCount': 20, 'class': 'logging.handlers.RotatingFileHandler', 'delay': True, 'encoding': 'utf8', 'filename': './errors.log', 'formatter': 'simple', 'level': 'ERROR', 'maxBytes': 10485760}, 'info_file_handler': {'backupCount': 20, 'class': 'logging.handlers.RotatingFileHandler', 'delay': True, 'encoding': 'utf8', 'filename': './info.log', 'formatter': 'simple', 'level': 'INFO', 'maxBytes': 10485760}}, 'loggers': {'anyconfig': {'handlers': ['console', 'info_file_handler', 'error_file_handler'], 'level': 'WARNING', 'propagate': False}, 'causallift': {'handlers': ['console', 'info_file_handler', 'error_file_handler'], 'level': 'INFO', 'propagate': False}, 'kedro.io': {'handlers': ['console', 'info_file_handler', 'error_file_handler'], 'level': 'WARNING', 'propagate': False}, 'kedro.pipeline': {'handlers': ['console', 'info_file_handler', 'error_file_handler'], 'level': 'INFO', 'propagate': False}, 'kedro.runner': {'handlers': ['console', 'info_file_handler', 'error_file_handler'], 'level': 'INFO', 'propagate': False}}, 'root': {'handlers': ['console', 'info_file_handler', 'error_file_handler'], 'level': 'INFO'}, 'version': 1})[source]

Bases: object

Set up datasets for uplift modeling. Optionally, propensity scores are estimated based on logistic regression.

Parameters
  • train_df (Optional[DataFrame]) – Pandas Data Frame containing samples used for training

  • test_df (Optional[DataFrame]) – Pandas Data Frame containing samples used for testing

  • cols_features (Optional[List[str]]) – List of column names used as features. If None (default), all the columns except for outcome, propensity, CATE, and recommendation.

  • col_treatment (str) – Name of treatment column. ‘Treatment’ in default.

  • col_outcome (str) – Name of outcome column. ‘Outcome’ in default.

  • col_propensity (str) – Name of propensity column. ‘Propensity’ in default.

  • col_cate (str) – Name of CATE (Conditional Average Treatment Effect) column. ‘CATE’ in default.

  • col_recommendation (str) – Name of recommendation column. ‘Recommendation’ in default.

  • min_propensity (float) – Minimum propensity score. 0.01 in default.

  • max_propensity (float) – Maximum propensity score. 0.99 in defualt.

  • verbose (int) –

    How much info to show. Valid values are:

    • 0 to show nothing

    • 1 to show only warning

    • 2 (default) to show useful info

    • 3 to show more info

  • uplift_model_params (Union[Dict[str, List[Any]], Type[BaseEstimator]]) –

    Parameters used to fit 2 XGBoost classifier models.

    If None (default):

    dict(
        search_cv="sklearn.model_selection.GridSearchCV",
        estimator="xgboost.XGBClassifier",
        scoring=None,
        cv=3,
        return_train_score=False,
        n_jobs=-1,
        param_grid=dict(
            max_depth=[3],
            learning_rate=[0.1],
            n_estimators=[100],
            verbose=[0],
            objective=["binary:logistic"],
            booster=["gbtree"],
            n_jobs=[-1],
            nthread=[None],
            gamma=[0],
            min_child_weight=[1],
            max_delta_step=[0],
            subsample=[1],
            colsample_bytree=[1],
            colsample_bylevel=[1],
            reg_alpha=[0],
            reg_lambda=[1],
            scale_pos_weight=[1],
            base_score=[0.5],
            missing=[None],
        ),
    )
    

    Alternatively, estimator model object is acceptable. The object must have the following methods compatible with scikit-learn estimator interface.

    • fit()

    • predict()

    • predict_proba()

  • enable_ipw (bool) – Enable Inverse Probability Weighting based on the estimated propensity score. True in default.

  • propensity_model_params (Dict[str, List[Any]]) –

    Parameters used to fit logistic regression model to estimate propensity score.

    If None (default):

    dict(
        search_cv="sklearn.model_selection.GridSearchCV",
        estimator="sklearn.linear_model.LogisticRegression",
        scoring=None,
        cv=3,
        return_train_score=False,
        n_jobs=-1,
        param_grid=dict(
            C=[0.1, 1, 10],
            class_weight=[None],
            dual=[False],
            fit_intercept=[True],
            intercept_scaling=[1],
            max_iter=[100],
            multi_class=["ovr"],
            n_jobs=[1],
            penalty=["l1", "l2"],
            solver=["liblinear"],
            tol=[0.0001],
            warm_start=[False],
        ),
    )
    

  • index_name (str) –

    Index name of the pandas data frame after resetting the index. ‘index’ in default.

    If None, the index will not be reset.

  • partition_name (str) – Additional index name to indicate the partition, train or test. ‘partition’ in default.

  • runner (str) –

    If set to ‘SequentialRunner’ (default) or ‘ParallelRunner’, the pipeline is run by Kedro sequentially or in parallel, respectively.

    If set to None , the pipeline is run by native Python.

    Refer to https://kedro.readthedocs.io/en/latest/04_user_guide/05_nodes_and_pipelines.html#runners

  • conditionally_skip (bool) –

    [Effective only if runner is set to either ‘SequentialRunner’ or ‘ParallelRunner’]

    Skip running the pipeline if the output files already exist. True in default.

  • dataset_catalog (Dict[str, AbstractDataSet]) –

    [Effective only if runner is set to either ‘SequentialRunner’ or ‘ParallelRunner’]

    Specify dataset files to save in Dict[str, kedro.io.AbstractDataSet] format.

    To find available file formats, refer to https://kedro.readthedocs.io/en/latest/kedro.io.html#data-sets

    In default:

    dict(
        # args_raw = CSVLocalDataSet(filepath='../data/01_raw/args_raw.csv', version=None),
        # train_df = CSVLocalDataSet(filepath='../data/01_raw/train_df.csv', version=None),
        # test_df = CSVLocalDataSet(filepath='../data/01_raw/test_df.csv', version=None),
        propensity_model  = PickleLocalDataSet(
            filepath='../data/06_models/propensity_model.pickle',
            version=None
        ),
        uplift_models_dict = PickleLocalDataSet(
            filepath='../data/06_models/uplift_models_dict.pickle',
            version=None
        ),
        df_03 = CSVLocalDataSet(
            filepath='../data/07_model_output/df.csv',
            load_args=dict(index_col=['partition', 'index'], float_precision='high'),
            save_args=dict(index=True, float_format='%.16e'),
            version=None,
        ),
        treated__sim_eval_df = CSVLocalDataSet(
            filepath='../data/08_reporting/treated__sim_eval_df.csv',
            version=None,
        ),
        untreated__sim_eval_df = CSVLocalDataSet(
            filepath='../data/08_reporting/untreated__sim_eval_df.csv',
            version=None,
        ),
        estimated_effect_df = CSVLocalDataSet(
            filepath='../data/08_reporting/estimated_effect_df.csv',
            version=None,
        ),
    )
    

  • logging_config (Optional[Dict[str, Any]]) –

    Specify logging configuration.

    Refer to https://docs.python.org/3.6/library/logging.config.html#logging-config-dictschema

    In default:

    {'disable_existing_loggers': False,
     'formatters': {
         'json_formatter': {
             'class': 'pythonjsonlogger.jsonlogger.JsonFormatter',
             'format': '[%(asctime)s|%(name)s|%(funcName)s|%(levelname)s] %(message)s',
         },
         'simple': {
             'format': '[%(asctime)s|%(name)s|%(levelname)s] %(message)s',
         },
     },
     'handlers': {
         'console': {
             'class': 'logging.StreamHandler',
             'formatter': 'simple',
             'level': 'INFO',
             'stream': 'ext://sys.stdout',
         },
        'info_file_handler': {
            'class': 'logging.handlers.RotatingFileHandler',
            'level': 'INFO',
            'formatter': 'simple',
            'filename': './info.log',
            'maxBytes': 10485760, # 10MB
            'backupCount': 20,
            'encoding': 'utf8',
            'delay': True,
        },
         'error_file_handler': {
             'class': 'logging.handlers.RotatingFileHandler',
             'level': 'ERROR',
             'formatter': 'simple',
             'filename': './errors.log',
             'maxBytes': 10485760,  # 10MB
             'backupCount': 20,
             'encoding': 'utf8',
             'delay': True,
         },
     },
     'loggers': {
         'anyconfig': {
             'handlers': ['console', 'info_file_handler', 'error_file_handler'],
             'level': 'WARNING',
             'propagate': False,
         },
         'kedro.io': {
             'handlers': ['console', 'info_file_handler', 'error_file_handler'],
             'level': 'WARNING',
             'propagate': False,
         },
         'kedro.pipeline': {
             'handlers': ['console', 'info_file_handler', 'error_file_handler'],
             'level': 'INFO',
             'propagate': False,
         },
         'kedro.runner': {
             'handlers': ['console', 'info_file_handler', 'error_file_handler'],
             'level': 'INFO',
             'propagate': False,
         },
         'causallift': {
             'handlers': ['console', 'info_file_handler', 'error_file_handler'],
             'level': 'INFO',
             'propagate': False,
         },
     },
     'root': {
         'handlers': ['console', 'info_file_handler', 'error_file_handler'],
         'level': 'INFO',
     },
     'version': 1}
    

__init__(train_df=None, test_df=None, cols_features=None, col_treatment='Treatment', col_outcome='Outcome', col_propensity='Propensity', col_cate='CATE', col_recommendation='Recommendation', min_propensity=0.01, max_propensity=0.99, verbose=2, uplift_model_params={'cv': 3, 'estimator': 'xgboost.XGBClassifier', 'n_jobs': -1, 'param_grid': {'base_score': [0.5], 'booster': ['gbtree'], 'colsample_bylevel': [1], 'colsample_bytree': [1], 'gamma': [0], 'learning_rate': [0.1], 'max_delta_step': [0], 'max_depth': [3], 'min_child_weight': [1], 'missing': [None], 'n_estimators': [100], 'n_jobs': [-1], 'nthread': [None], 'objective': ['binary:logistic'], 'random_state': [0], 'reg_alpha': [0], 'reg_lambda': [1], 'scale_pos_weight': [1], 'subsample': [1], 'verbose': [0]}, 'return_train_score': False, 'scoring': None, 'search_cv': 'sklearn.model_selection.GridSearchCV'}, enable_ipw=True, propensity_model_params={'cv': 3, 'estimator': 'sklearn.linear_model.LogisticRegression', 'n_jobs': -1, 'param_grid': {'C': [0.1, 1, 10], 'class_weight': [None], 'dual': [False], 'fit_intercept': [True], 'intercept_scaling': [1], 'max_iter': [100], 'multi_class': ['ovr'], 'n_jobs': [1], 'penalty': ['l1', 'l2'], 'random_state': [0], 'solver': ['liblinear'], 'tol': [0.0001], 'warm_start': [False]}, 'return_train_score': False, 'scoring': None, 'search_cv': 'sklearn.model_selection.GridSearchCV'}, cv=3, index_name='index', partition_name='partition', runner='SequentialRunner', conditionally_skip=False, dataset_catalog={'df_03': <kedro.io.csv_local.CSVLocalDataSet object>, 'estimated_effect_df': <kedro.io.csv_local.CSVLocalDataSet object>, 'propensity_model': <kedro.io.pickle_local.PickleLocalDataSet object>, 'treated__sim_eval_df': <kedro.io.csv_local.CSVLocalDataSet object>, 'untreated__sim_eval_df': <kedro.io.csv_local.CSVLocalDataSet object>, 'uplift_models_dict': <kedro.io.pickle_local.PickleLocalDataSet object>}, logging_config={'disable_existing_loggers': False, 'formatters': {'json_formatter': {'class': 'pythonjsonlogger.jsonlogger.JsonFormatter', 'format': '[%(asctime)s|%(name)s|%(funcName)s|%(levelname)s] %(message)s'}, 'simple': {'format': '[%(asctime)s|%(name)s|%(levelname)s] %(message)s'}}, 'handlers': {'console': {'class': 'logging.StreamHandler', 'formatter': 'simple', 'level': 'INFO', 'stream': 'ext://sys.stdout'}, 'error_file_handler': {'backupCount': 20, 'class': 'logging.handlers.RotatingFileHandler', 'delay': True, 'encoding': 'utf8', 'filename': './errors.log', 'formatter': 'simple', 'level': 'ERROR', 'maxBytes': 10485760}, 'info_file_handler': {'backupCount': 20, 'class': 'logging.handlers.RotatingFileHandler', 'delay': True, 'encoding': 'utf8', 'filename': './info.log', 'formatter': 'simple', 'level': 'INFO', 'maxBytes': 10485760}}, 'loggers': {'anyconfig': {'handlers': ['console', 'info_file_handler', 'error_file_handler'], 'level': 'WARNING', 'propagate': False}, 'causallift': {'handlers': ['console', 'info_file_handler', 'error_file_handler'], 'level': 'INFO', 'propagate': False}, 'kedro.io': {'handlers': ['console', 'info_file_handler', 'error_file_handler'], 'level': 'WARNING', 'propagate': False}, 'kedro.pipeline': {'handlers': ['console', 'info_file_handler', 'error_file_handler'], 'level': 'INFO', 'propagate': False}, 'kedro.runner': {'handlers': ['console', 'info_file_handler', 'error_file_handler'], 'level': 'INFO', 'propagate': False}}, 'root': {'handlers': ['console', 'info_file_handler', 'error_file_handler'], 'level': 'INFO'}, 'version': 1})[source]

Initialize self. See help(type(self)) for accurate signature.

Return type

None

estimate_cate_by_2_models()[source]

Estimate CATE (Conditional Average Treatment Effect) using 2 XGBoost classifier models.

Return type

Tuple[DataFrame, DataFrame]

estimate_recommendation_impact(cate_estimated=None, treatment_fraction_train=None, treatment_fraction_test=None, verbose=None)[source]

Estimate the impact of recommendation based on uplift modeling.

Parameters
  • cate_estimated (Optional[Type[Series]]) – Pandas series containing the CATE. If None (default), use the ones calculated by estimate_cate_by_2_models method.

  • treatment_fraction_train (Optional[float]) – The fraction of treatment in train dataset. If None (default), use the ones calculated by estimate_cate_by_2_models method.

  • treatment_fraction_test (Optional[float]) – The fraction of treatment in test dataset. If None (default), use the ones calculated by estimate_cate_by_2_models method.

  • verbose (Optional[int]) – How much info to show. If None (default), use the value set in the constructor.

Return type

Type[DataFrame]

causallift.generate_data module

The original code is at https://github.com/wayfair/pylift/blob/master/pylift/generate_data.py licensed under the BSD 2-Clause “Simplified” License Copyright 2018, Wayfair, Inc.

This code is an enhanced (backward-compatible) version that can simulate observational dataset including “sleeping dogs.”

“Sleeping dogs” (people who will “buy” if not treated but will not “buy” if treated) can be simulated by negative values in tau parameter. Observational data which includes confounding can be simulated by non-zero values in propensity_coef parameter. A/B Test (RCT) with a 50:50 split can be simulated by all-zeros values in propensity_coef parameter (default). The first element in each list parameter specifies the intercept.

causallift.generate_data.generate_data(N=1000, n_features=3, beta=[1,-2,3,-0.8], error_std=0.5, tau=3, discrete_outcome=False)[source]

Generates random data with a ground truth data generating process. Draws random values for features from [0, 1), errors from a 0-centered distribution with std error_std, and creates an outcome y.

Parameters
  • N – (Optional[int]) - Number of observations.

  • n_features – (Optional[int]) - Number of features.

  • beta – (Optional[List[float]]) - Array of beta coefficients to multiply by X to get y.

  • error_std – (Optional[float]) - Standard deviation (scale) of distribution from which errors are drawn.

  • tau – (Union[List[float], float]) - Array of coefficients to multiply by X to get y if treated. More/larger negative values will simulate more “sleeping dogs” If float scalar is input, effect of features is not considered.

  • tau_std – (Optional[float]) - When not None, draws tau from a normal distribution centered around tau with standard deviation tau_std rather than just using a constant value of tau.

  • discrete_outcome – (Optional[bool]) - If True, outcomes are 0 or 1; otherwise continuous.

  • seed – (Optional[int]) - Random seed fed to np.random.seed to allow for deterministic behavior.

  • feature_effect – (Optional[float]) - Effect of beta on outcome if treated.

  • propensity_coef – (Optional[List[float]]) - Array of coefficients to multiply by X to get propensity log-odds to be treated.

  • index_name – (Optional[str]) - Index name in the output DataFrame. If None (default), index name will not be set.

Returns

pd.DataFrame

A DataFrame containing the generated data.

Return type

df

causallift.pipeline module

Pipeline construction.

causallift.pipeline.create_pipeline(**kwargs)[source]

Create the project’s pipeline.

Parameters

kwargs – Ignore any additional arguments added in the future.

Returns

The resulting pipeline.

Return type

Pipeline

causallift.run module

Application entry point.

class causallift.run.ProjectContext(project_path, env=None)[source]

Bases: kedro.context.context.KedroContext

Users can override the remaining methods from the parent class here, or create new ones (e.g. as required by plugins)

property pipeline

Abstract property for Pipeline getter.

Return type

Pipeline

Returns

Defined pipeline.

project_name = 'CausalLift'
project_version = '0.14.3'
causallift.run.main(tags=None, env=None, runner=None, node_names=None, from_nodes=None, to_nodes=None)[source]

Application main entry point.

Parameters
  • tags – An optional list of node tags which should be used to filter the nodes of the Pipeline. If specified, only the nodes containing any of these tags will be run.

  • env – An optional parameter specifying the environment in which the Pipeline should be run.

  • runner – An optional parameter specifying the runner that you want to run the pipeline with.

  • node_names – An optional list of node names which should be used to filter the nodes of the Pipeline. If specified, only the nodes with these names will be run.

  • from_nodes – An optional list of node names which should be used as a starting point of the new Pipeline.

  • to_nodes – An optional list of node names which should be used as an end point of the new Pipeline.