A region-specific clustering approach to investigate risk-factors in mortality rate during COVID-19
Overview
This project was developed seeing the then present conditions of the COVID-19 pandemic and was completed under the guidance of Dr. Deepak Joshi, Assistant Prof. IIT Delhi, India.
Since the novel coronavirus outbreak in late 2019, it has continuously spread across the globe briskly. However, since its existence, the symptoms of the disease have been varying widely; thus, developing an urgent need to stratify high-risk categories of people who show more propensity to be affected by this deadly virus will be beneficial for health care.
Applications and Advantages
The study aimed to cluster 208 countries globally in groups with similar profiles concerning the country level COVID-19 pandemic risk factors.
The purpose of performing the data analysis was to measure how these significant risk factors determine the mortality rate due to coronavirus disease.
A worldwide view of country-level data would aid us to see the universal trends and serve as a benchmark to predict the future behaviour of the countries based on the past trends of similar countries in terms of the risk factor values.
Methodology
The study uses open-access data and machine learning algorithms. An unsupervised machine learning model (k-means) was employed for two hundred and eight countries to define data-driven clusters based on thirteen country-level parameters. The clusters were statistically examined by employing a one-way ANOVA test with the commonly accepted threshold of 0.05. The clusters were pair-wise compared using the t-tests along with utilising the Bonferroni correction for multiple tests. The analysis was performed using the statistical functions of the SciPy library.
Results
After performing the one-way ANOVA for comparing the clusters in terms of total cases, total deaths, total cases per population, total deaths per population, and death rate, the paradigm with four and seven clusters showed the best ability to stratify the countries according to total cases per population and death rate with p-values of less than 0.05 and 0.001, respectively.