slickml.selection¶
Classes¶
XGBoost Feature Selector. |
Package Contents¶
- class slickml.selection.XGBoostFeatureSelector[source]¶
Bases:
slickml.base.BaseXGBoostEstimator
XGBoost Feature Selector.
Notes
This is a wrapper using XGBoost [xgboost-api] to perform a frequency-based feature selection algorithm with n-folds cross-validation on top of an augmented data with noisy features iteratively. At each n-fold of cross-validation of each iteration, the best number of boostin rounds will be found to over-come the possibility of over-fitting, and the feature-importance of the best trained model will be used to select the features. Finally, the frequency of the features that showed up at each feature importance phase of each cross-validation fold of each iteration will the benchmark of feature selection. In principle, the maximum frequency of each feature can be n_iter times n_splits.
- Parameters:
n_iter (int, optional) – Number of iteration to repeat the feature selection algorithm, by default 3
num_boost_round (int, optional) – Number of boosting rounds to fit a model, by default 200
n_splits (int, optional) – Number of folds for cross-validation, by default 4
metrics (str, optional) – Metrics to be tracked at cross-validation fitting time depends on the task (classification vs regression) with possible values of “auc”, “aucpr”, “error”, “logloss”, “rmse”, “rmsle”, “mae”. Note this is different than eval_metric that needs to be passed to params dict, by default “auc”
early_stopping_rounds (int, optional) – The criterion to early abort the
xgboost.cv()
phase if the test metric is not improved, by default 20random_state (int, optional) – Random seed number, by default 1367
stratified (bool, optional) – Whether to use stratificaiton of the targets (only available for classification tasks) to run
xgboost.cv()
to find the best number of boosting round at each fold of each iteration, by default Trueshuffle (bool, optional) – Whether to shuffle data to have the ability of building stratified folds in
xgboost.cv()
, by default Truesparse_matrix (bool, optional) – Whether to convert the input features to sparse matrix with csr format or not. This would increase the speed of feature selection for relatively large/sparse datasets. Consequently, this would actually act like an un-optimize solution for dense feature matrix. Additionally, this parameter cannot be used along with
scale_mean=True
standardizing the feature matrix to have a mean value of zeros would turn the feature matrix into a dense matrix. Therefore, by default our API banned this feature, by default Falsescale_mean (bool, optional) – Whether to standarize the feauture matrix to have a mean value of zero per feature (center the features before scaling). As laid out in
sparse_matrix
,scale_mean=False
when usingsparse_matrix=True
, since centering the feature matrix would decrease the sparsity and in practice it does not make any sense to use sparse matrix method and it would make it worse. TheStandardScaler
object can be accessed viacls.scaler_
ifscale_mean
orscale_strd
is used unless it isNone
, by default Falsescale_std (bool, optional) – Whether to scale the feauture matrix to have unit variance (or equivalently, unit standard deviation) per feature. The
StandardScaler
object can be accessed viacls.scaler_
ifscale_mean
orscale_strd
is used unless it isNone
, by default Falsenth_noise_threshold (int, optional) – The threshold to keep all the features up to the n-th noisy feature at each fold of each iteration. For example, for a feature selection with 4 iterations and 5-folds cv, the maximum number of noisy features would be 4*5=20, by default 1.
importance_type (str, optional) – Importance type of
xgboost.train()
with possible values"weight"
,"gain"
,"total_gain"
,"cover"
,"total_cover"
, by default “total_gain”params (Dict[str, Union[str, float, int]], optional) – Set of parameters required for fitting a Booster, by default for a classification task {“eval_metric”: “auc”, “tree_method”: “hist”, “objective”: “binary:logistic”, “learning_rate”: 0.05, “max_depth”: 2, “min_child_weight”: 1, “gamma”: 0.0, “reg_alpha”: 0.0, “reg_lambda”: 1.0, “subsample”: 0.9, “max_delta_step”: 1, “verbosity”: 0, “nthread”: 4, “scale_pos_weight”: 1} and by default for any regression task {“eval_metric”: “rmse”, “tree_method”: “hist”, “objective”: “reg:squarederror”, “learning_rate”: 0.05, “max_depth”: 2, “min_child_weight”: 1, “gamma”: 0.0, “reg_alpha”: 0.0, “reg_lambda”: 1.0, “subsample”: 0.9, “max_delta_step”: 1, “verbosity”: 0, “nthread”: 4} Other options for objective:
"reg:logistic"
,"reg:squaredlogerror"
verbose_eval (bool, optional) – Whether to show the results of xgboost.train() on train/test sets using eval_metric, by default False
callbacks (bool, optional) – Whether to logging standard deviation of metrics on train data and track the early stopping criterion, by default False
- get_feature_importance()[source]¶
Returns feature importance based on importance_type at each fold of each iteration of the selection process as a dict of dataframes
- feature_importance_¶
Returns a dict of all feature importance dataframes based on importance_type at each fold of each iteration during selection process
- feature_frequency_¶
Returns a pandas.DataFrame cosists of total frequency of each feature during the selection process
- cv_results_¶
Return a dict of the total internal/external cross-validation results
- plotting_cv_¶
Returns the required elements to visualize the histograms of total internal/external cross-validation results
References
- __getstate__()¶
- classmethod __init_subclass__(**kwargs)¶
Set the
set_{method}_request
methods.This uses PEP-487 [1] to set the
set_{method}_request
methods. It looks for the information available in the set default values which are set using__metadata_request__*
class attributes, or inferred from method signatures.The
__metadata_request__*
class attributes are used when a method does not explicitly accept a metadata through its arguments or if the developer would like to specify a request value for those metadata which are different from the defaultNone
.References
- __repr__(N_CHAR_MAX=700)¶
Return repr(self).
- __setstate__(state)¶
- __sklearn_clone__()¶
- __slots__ = ()¶
- fit(X: pandas.DataFrame | numpy.ndarray, y: List[float] | numpy.ndarray | pandas.Series) None [source]¶
Fits the main feature selection algorithm.
- Parameters:
X_train (Union[pd.DataFrame, np.ndarray]) – Input data for training (features)
y_train (Union[List[float], np.ndarray, pd.Series]) – Input ground truth for training (targets)
- Returns:
None
- get_cv_results() pandas.DataFrame [source]¶
Returns internal and external cross-validation results.
- Returns:
pd.DataFrame
- get_default_params() Dict[str, str | float | int] [source]¶
Returns the default set of train parameters.
The default set of parameters will be used when
params=None
.See also
- Returns:
Dict[str, Union[str, float, int]]
- get_feature_frequency() pandas.DataFrame [source]¶
ReturnS the total feature frequency of the best model at each fold of each iteration.
- Returns:
pd.DataFrame
- get_feature_importance() Dict[str, pandas.DataFrame] [source]¶
Returns the feature importance of the trained booster based on the given
importance_type
.- Returns:
pd.DataFrame
- get_metadata_routing()¶
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
routing (MetadataRequest) – A
MetadataRequest
encapsulating routing information.
- get_params() Dict[str, str | float | int] | None [source]¶
Returns the final set of train parameters.
The default set of parameters will be updated with the new ones that passed to
params
.See also
- Returns:
Dict[str, Union[str, float, int]]
- plot_cv_results(*, figsize: Tuple[int | float, int | float] | None = (10, 8), internalcvcolor: str | None = '#4169E1', externalcvcolor: str | None = '#8A2BE2', sharex: bool | None = False, sharey: bool | None = False, save_path: str | None = None, display_plot: bool | None = True, return_fig: bool | None = False) matplotlib.figure.Figure | None [source]¶
Visualizies the cross-validation results.
Notes
It visualizes the internal and external cross-validiation performance during the selection process. The internal refers to the performance of the train/test folds during the
xgboost.cv()
usingmetrics
rounds to help the best number of boosting round while the external refers to the performance ofxgboost.train()
based on watchlist usingeval_metric
. Additionally, sns.distplot previously was used which is now deprecated. More details in [seaborn-distplot-deprecation].- Parameters:
figsize (tuple, optional) – Figure size, by default (10, 8)
internalcvcolor (str, optional) – Color of the histograms for internal cv results, by default “#4169E1”
externalcvcolor (str, optional) – Color of the histograms for external cv results, by default “#8A2BE2”
sharex (bool, optional) – Whether to share “X” axis for each column of subplots, by default False
sharey (bool, optional) – Whether to share “Y” axis for each row of subplots, by default False
save_path (str, optional) – The full or relative path to save the plot including the image format such as “myplot.png” or “../../myplot.pdf”, by default None
display_plot (bool, optional) – Whether to show the plot, by default True
return_fig (bool, optional) – Whether to return figure object, by default False
kwargs (Dict[str, Any]) – Required plooting elements (
plotting_cv_
attribute ofXGBoostFeatureSelector
)
- Returns:
Figure, optional
- plot_frequency(*, figsize: Tuple[int | float, int | float] | None = (8, 4), show_freq_pct: bool | None = True, color: str | None = '#87CEEB', marker: str | None = 'o', markersize: int | float | None = 10, markeredgecolor: str | None = '#1F77B4', markerfacecolor: str | None = '#1F77B4', markeredgewidth: int | float | None = 1, fontsize: int | float | None = 12, save_path: str | None = None, display_plot: bool | None = True, return_fig: bool | None = False) matplotlib.figure.Figure | None [source]¶
Visualizes the selected features frequency as a bar chart.
Notes
This plotting function can be used along with
feature_frequency_
attribute of any frequency-based feature selection algorithm such asXGBoostFeatureSelector
.- Parameters:
feature importance (pd.DataFrame) – Feature importance (
feature_frequency_
attribute)figsize (tuple, optional) – Figure size, by default (8, 4)
show_freq_pct (bool, optional) – Whether to show the features frequency in percent, by default True
color (str, optional) – Color of the horizontal lines of lollipops, by default “#87CEEB”
marker (str, optional) – Marker style of the lollipops. More valid marker styles can be found at [markers-api], by default “o”
markersize (Union[int, float], optional) – Markersize, by default 10
markeredgecolor (str, optional) – Marker edge color, by default “#1F77B4”
markerfacecolor (str, optional) – Marker face color, by defualt “#1F77B4”
markeredgewidth (Union[int, float], optional) – Marker edge width, by default 1
fontsize (Union[int, float], optional) – Fontsize for xlabel and ylabel, and ticks parameters, by default 12
save_path (str, optional) – The full or relative path to save the plot including the image format such as “myplot.png” or “../../myplot.pdf”, by default None
display_plot (bool, optional) – Whether to show the plot, by default True
return_fig (bool, optional) – Whether to return figure object, by default False
- Returns:
Figure, optional
- set_params(**params)¶
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters:
**params (dict) – Estimator parameters.
- Returns:
self (estimator instance) – Estimator instance.