slickml.selection._xgboost#

Module Contents#

Classes#

XGBoostFeatureSelector

XGBoost Feature Selector.

class slickml.selection._xgboost.XGBoostFeatureSelector[source]#

Bases: slickml.base.BaseXGBoostEstimator

XGBoost Feature Selector.

Notes

This is a wrapper using XGBoost [xgboost-api] to perform a frequency-based feature selection algorithm with n-folds cross-validation on top of an augmented data with noisy features iteratively. At each n-fold of cross-validation of each iteration, the best number of boostin rounds will be found to over-come the possibility of over-fitting, and the feature-importance of the best trained model will be used to select the features. Finally, the frequency of the features that showed up at each feature importance phase of each cross-validation fold of each iteration will the benchmark of feature selection. In principle, the maximum frequency of each feature can be n_iter times n_splits.

Parameters:
  • n_iter (int, optional) – Number of iteration to repeat the feature selection algorithm, by default 3

  • num_boost_round (int, optional) – Number of boosting rounds to fit a model, by default 200

  • n_splits (int, optional) – Number of folds for cross-validation, by default 4

  • metrics (str, optional) – Metrics to be tracked at cross-validation fitting time depends on the task (classification vs regression) with possible values of “auc”, “aucpr”, “error”, “logloss”, “rmse”, “rmsle”, “mae”. Note this is different than eval_metric that needs to be passed to params dict, by default “auc”

  • early_stopping_rounds (int, optional) – The criterion to early abort the xgboost.cv() phase if the test metric is not improved, by default 20

  • random_state (int, optional) – Random seed number, by default 1367

  • stratified (bool, optional) – Whether to use stratificaiton of the targets (only available for classification tasks) to run xgboost.cv() to find the best number of boosting round at each fold of each iteration, by default True

  • shuffle (bool, optional) – Whether to shuffle data to have the ability of building stratified folds in xgboost.cv(), by default True

  • sparse_matrix (bool, optional) – Whether to convert the input features to sparse matrix with csr format or not. This would increase the speed of feature selection for relatively large/sparse datasets. Consequently, this would actually act like an un-optimize solution for dense feature matrix. Additionally, this parameter cannot be used along with scale_mean=True standardizing the feature matrix to have a mean value of zeros would turn the feature matrix into a dense matrix. Therefore, by default our API banned this feature, by default False

  • scale_mean (bool, optional) – Whether to standarize the feauture matrix to have a mean value of zero per feature (center the features before scaling). As laid out in sparse_matrix, scale_mean=False when using sparse_matrix=True, since centering the feature matrix would decrease the sparsity and in practice it does not make any sense to use sparse matrix method and it would make it worse. The StandardScaler object can be accessed via cls.scaler_ if scale_mean or scale_strd is used unless it is None, by default False

  • scale_std (bool, optional) – Whether to scale the feauture matrix to have unit variance (or equivalently, unit standard deviation) per feature. The StandardScaler object can be accessed via cls.scaler_ if scale_mean or scale_strd is used unless it is None, by default False

  • nth_noise_threshold (int, optional) – The threshold to keep all the features up to the n-th noisy feature at each fold of each iteration. For example, for a feature selection with 4 iterations and 5-folds cv, the maximum number of noisy features would be 4*5=20, by default 1.

  • importance_type (str, optional) – Importance type of xgboost.train() with possible values "weight", "gain", "total_gain", "cover", "total_cover", by default “total_gain”

  • params (Dict[str, Union[str, float, int]], optional) – Set of parameters required for fitting a Booster, by default for a classification task {“eval_metric”: “auc”, “tree_method”: “hist”, “objective”: “binary:logistic”, “learning_rate”: 0.05, “max_depth”: 2, “min_child_weight”: 1, “gamma”: 0.0, “reg_alpha”: 0.0, “reg_lambda”: 1.0, “subsample”: 0.9, “max_delta_step”: 1, “verbosity”: 0, “nthread”: 4, “scale_pos_weight”: 1} and by default for any regression task {“eval_metric”: “rmse”, “tree_method”: “hist”, “objective”: “reg:squarederror”, “learning_rate”: 0.05, “max_depth”: 2, “min_child_weight”: 1, “gamma”: 0.0, “reg_alpha”: 0.0, “reg_lambda”: 1.0, “subsample”: 0.9, “max_delta_step”: 1, “verbosity”: 0, “nthread”: 4} Other options for objective: "reg:logistic", "reg:squaredlogerror"

  • verbose_eval (bool, optional) – Whether to show the results of xgboost.train() on train/test sets using eval_metric, by default False

  • callbacks (bool, optional) – Whether to logging standard deviation of metrics on train data and track the early stopping criterion, by default False

fit(X, y)[source]#

Fits the main feature selection algorithm

get_feature_frequency()[source]#

Returns the total feature frequency of the best model

get_feature_importance()[source]#

Returns feature importance based on importance_type at each fold of each iteration of the selection process as a dict of dataframes

get_cv_results()[source]#

Returns the total internal/external cross-validation results

plot_frequency()[source]#

Visualizes the selected features frequency as a bar chart

plot_cv_results()[source]#

Visualizies the cross-validation results

get_params()[source]#

Returns the final set of train parameters

get_default_params()[source]#

Returns the default set of train parameters

feature_importance_#

Returns a dict of all feature importance dataframes based on importance_type at each fold of each iteration during selection process

feature_frequency_#

Returns a pandas.DataFrame cosists of total frequency of each feature during the selection process

cv_results_#

Return a dict of the total internal/external cross-validation results

plotting_cv_#

Returns the required elements to visualize the histograms of total internal/external cross-validation results

References

__slots__ = []#
callbacks :Optional[bool] = False#
early_stopping_rounds :Optional[int] = 20#
importance_type :Optional[str] = total_gain#
metrics :Optional[str] = auc#
n_iter :Optional[int] = 3#
n_splits :Optional[int] = 4#
nth_noise_threshold :Optional[int] = 1#
num_boost_round :Optional[int] = 200#
params :Optional[Dict[str, Union[str, float, int]]]#
random_state :Optional[int] = 1367#
scale_mean :Optional[bool] = False#
scale_std :Optional[bool] = False#
shuffle :Optional[bool] = True#
sparse_matrix :Optional[bool] = False#
stratified :Optional[bool] = True#
verbose_eval :Optional[bool] = False#
__getstate__()#
__post_init__() None[source]#

Post instantiation validations and assignments.

__repr__(N_CHAR_MAX=700)#

Return repr(self).

__setstate__(state)#
fit(X: Union[pandas.DataFrame, numpy.ndarray], y: Union[List[float], numpy.ndarray, pandas.Series]) None[source]#

Fits the main feature selection algorithm.

Parameters:
  • X_train (Union[pd.DataFrame, np.ndarray]) – Input data for training (features)

  • y_train (Union[List[float], np.ndarray, pd.Series]) – Input ground truth for training (targets)

Returns:

None

get_cv_results() pandas.DataFrame[source]#

Returns internal and external cross-validation results.

Returns:

pd.DataFrame

get_default_params() Dict[str, Union[str, float, int]][source]#

Returns the default set of train parameters.

The default set of parameters will be used when params=None.

See also

get_params()

Returns:

Dict[str, Union[str, float, int]]

get_feature_frequency() pandas.DataFrame[source]#

ReturnS the total feature frequency of the best model at each fold of each iteration.

Returns:

pd.DataFrame

get_feature_importance() Dict[str, pandas.DataFrame][source]#

Returns the feature importance of the trained booster based on the given importance_type.

Returns:

pd.DataFrame

get_params() Optional[Dict[str, Union[str, float, int]]][source]#

Returns the final set of train parameters.

The default set of parameters will be updated with the new ones that passed to params.

Returns:

Dict[str, Union[str, float, int]]

plot_cv_results(*, figsize: Optional[Tuple[Union[int, float], Union[int, float]]] = (10, 8), internalcvcolor: Optional[str] = '#4169E1', externalcvcolor: Optional[str] = '#8A2BE2', sharex: Optional[bool] = False, sharey: Optional[bool] = False, save_path: Optional[str] = None, display_plot: Optional[bool] = True, return_fig: Optional[bool] = False) Optional[matplotlib.figure.Figure][source]#

Visualizies the cross-validation results.

Notes

It visualizes the internal and external cross-validiation performance during the selection process. The internal refers to the performance of the train/test folds during the xgboost.cv() using metrics rounds to help the best number of boosting round while the external refers to the performance of xgboost.train() based on watchlist using eval_metric. Additionally, sns.distplot previously was used which is now deprecated. More details in [seaborn-distplot-deprecation].

Parameters:
  • figsize (tuple, optional) – Figure size, by default (10, 8)

  • internalcvcolor (str, optional) – Color of the histograms for internal cv results, by default “#4169E1”

  • externalcvcolor (str, optional) – Color of the histograms for external cv results, by default “#8A2BE2”

  • sharex (bool, optional) – Whether to share “X” axis for each column of subplots, by default False

  • sharey (bool, optional) – Whether to share “Y” axis for each row of subplots, by default False

  • save_path (str, optional) – The full or relative path to save the plot including the image format such as “myplot.png” or “../../myplot.pdf”, by default None

  • display_plot (bool, optional) – Whether to show the plot, by default True

  • return_fig (bool, optional) – Whether to return figure object, by default False

  • kwargs (Dict[str, Any]) – Required plooting elements (plotting_cv_ attribute of XGBoostFeatureSelector)

Returns:

Figure, optional

plot_frequency(*, figsize: Optional[Tuple[Union[int, float], Union[int, float]]] = (8, 4), show_freq_pct: Optional[bool] = True, color: Optional[str] = '#87CEEB', marker: Optional[str] = 'o', markersize: Optional[Union[int, float]] = 10, markeredgecolor: Optional[str] = '#1F77B4', markerfacecolor: Optional[str] = '#1F77B4', markeredgewidth: Optional[Union[int, float]] = 1, fontsize: Optional[Union[int, float]] = 12, save_path: Optional[str] = None, display_plot: Optional[bool] = True, return_fig: Optional[bool] = False) Optional[matplotlib.figure.Figure][source]#

Visualizes the selected features frequency as a bar chart.

Notes

This plotting function can be used along with feature_frequency_ attribute of any frequency-based feature selection algorithm such as XGBoostFeatureSelector.

Parameters:
  • feature importance (pd.DataFrame) – Feature importance (feature_frequency_ attribute)

  • figsize (tuple, optional) – Figure size, by default (8, 4)

  • show_freq_pct (bool, optional) – Whether to show the features frequency in percent, by default True

  • color (str, optional) – Color of the horizontal lines of lollipops, by default “#87CEEB”

  • marker (str, optional) – Marker style of the lollipops. More valid marker styles can be found at [markers-api], by default “o”

  • markersize (Union[int, float], optional) – Markersize, by default 10

  • markeredgecolor (str, optional) – Marker edge color, by default “#1F77B4”

  • markerfacecolor (str, optional) – Marker face color, by defualt “#1F77B4”

  • markeredgewidth (Union[int, float], optional) – Marker edge width, by default 1

  • fontsize (Union[int, float], optional) – Fontsize for xlabel and ylabel, and ticks parameters, by default 12

  • save_path (str, optional) – The full or relative path to save the plot including the image format such as “myplot.png” or “../../myplot.pdf”, by default None

  • display_plot (bool, optional) – Whether to show the plot, by default True

  • return_fig (bool, optional) – Whether to return figure object, by default False

Returns:

Figure, optional

set_params(**params)#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**params (dict) – Estimator parameters.

Returns:

self (estimator instance) – Estimator instance.