Analysing COVID-19 Mortality Patterns with Machine Learning

Introduction:

This project aims to analyze COVID-19 mortality patterns across different states, age groups, and medical conditions using a large dataset from 2020 to 2023. The primary goal is to apply machine learning models to understand and predict the factors contributing to COVID-19 deaths, providing insights for public health strategies and future pandemic preparedness. The project explores various statistical techniques, data visualization, and machine learning algorithms to identify key patterns and build predictive models.

Skills and Libraries Used:

  • Python: Data preprocessing, visualization, and machine learning implementation.

  • Pandas: Data manipulation, cleaning, and feature engineering.

  • NumPy: Numerical computations for data handling and transformation.

  • Seaborn & Matplotlib: For exploratory data analysis (EDA) through visualizations such as bar plots, box plots, and heatmaps.

  • Scikit-learn: Implementing machine learning models such as Linear Regression, Lasso, Ridge, Random Forest, Support Vector Machines (SVM), and Gradient Boosting.

  • XGBoost: Gradient Boosting for improving prediction accuracy.

  • Cross-validation & KFold: Model validation techniques to evaluate the performance of the models.

  • OneHotEncoder: Categorical data encoding for machine learning models.

Results:

  • Data Preprocessing: The dataset contained 621,000 rows of data with multiple categorical features. After cleaning and removing NaN values, the dataset was reduced to 437,551 rows. Categorical variables were transformed using OneHotEncoding for ML model compatibility.

  • Exploratory Data Analysis: Visualizations highlighted the distribution of COVID-19 deaths across age groups and states. Insights were drawn, such as higher mortality rates in older age groups and the significant impact of pre-existing conditions like respiratory and circulatory diseases.

  • Machine Learning Models:

    • Linear Regression achieved an R² score of 0.95 with a Mean Squared Error (MSE) of 119,573 after scaling the data.

    • Ridge Regression performed similarly with an R² score of 0.90 and an MSE of 69,549 on a smaller 3% data sample.

    • Random Forest Regressor provided promising results with an R² score of 0.91 and an MSE of 230,110.

    • Support Vector Regression (SVR) did not perform well on the dataset, with an R² score of 0.01 and an MSE of 684,977.

Overall, the Random Forest and Ridge Regression models were the most effective at predicting COVID-19 mortality, indicating that these models can help identify the most impactful factors contributing to deaths. Further work can include hyperparameter tuning and the use of deep learning models for improved accuracy.

Previous
Previous

Ethereum Price Prediction Machine Learning Model

Next
Next

Predicting Employee Attrition Using Machine Learning