Overview
This code demonstrates the application of AdaBoost with hyperparameter tuning and feature selection to predict employee attrition. The process includes addressing class imbalance using SMOTE, tuning AdaBoost parameters with GridSearchCV, and refining the model by selecting the top 10 most important features. The final model's performance is evaluated using metrics such as classification reports, confusion matrices, and ROC curves.
Features
Hyperparameter Tuning: The AdaBoost classifier is optimized by tuning the number of estimators through GridSearchCV. This ensures the model performs well with the best possible configuration of parameters.
SMOTE for Class Imbalance: The dataset is imbalanced, with significantly fewer instances of employee attrition. SMOTE (Synthetic Minority Over-sampling Technique) is applied to balance the dataset by oversampling the minority class (attrition cases). This ensures that the model is not biased towards predicting the majority class.
Feature Importance & Selection: After training the model, feature importance is calculated, and the top 10 most important features are selected. This step reduces the dimensionality of the data and focuses the model on the most influential attributes for predicting attrition. The selected features include monthly rate, daily rate, business travel, overtime, and gender.
Evaluation: The final model is evaluated using key classification metrics such as precision, recall, and F1-score. Additionally, a confusion matrix is used to assess the model's accuracy, and an ROC curve provides a visual representation of the model's performance.
Libraries Used
Pandas & NumPy: For data manipulation and analysis.
Matplotlib & Seaborn: For visualizing feature importance, confusion matrices, and ROC curves.
Scikit-learn: For building the machine learning pipeline, applying AdaBoost, tuning hyperparameters, and evaluating the model's performance.
Imbalanced-learn: For oversampling the minority class using SMOTE.
Results
The AdaBoost model with the best parameters and SMOTE achieves a classification accuracy of 83%. The ROC curve indicates a decent performance, with an AUC score reflecting the model's ability to distinguish between attrition and non-attrition cases.
After selecting the top 10 most important features, the model's performance drops slightly, with a classification accuracy of 62%. This decrease suggests that while feature selection reduces complexity, it may also lead to some loss of predictive power.
The model struggles with predicting the minority class (attrition cases), which is common in imbalanced datasets, despite using SMOTE to mitigate this issue. Further tuning or alternative models might be needed to improve performance for the minority class.