Predicting Patients’ Survival Using Machine Learning.

Oyinkansola Awosan
10 min readMay 21, 2021
Photo by Hush Naidoo on Unsplash

Cardiovascular diseases are a group of diseases that affect the heart and the blood vessels. They are a group of heart condition that include structural problems, blood clots, and diseased vessels. Some examples are high blood pressure, coronary heart disease, cardiac arrest, heart failure amongst others.
According to the WHO, cardiovascular diseases kill about 17.9 million people each year there by making this set of diseases the number one cause of death globally.
This project is based on the survival of patients who have heart failure. Heart failure is a condition that occurs when the heart is not pumping blood as it should. It is important to know that while it is a chronic disease, as it can not be cured, a patient can have it for a very long period of time.

This project makes use of machine learning to predict patients' survival based on the data made available. As we will find out below, the dataset contains about 299 observations and 13 columns, where some of these columns are measure of some enzymes in the blood and other patient data.
The dataset was gotten from Kaggle, and can also be sourced here.

To begin, I gave a thorough description of the dataset, the variables in the dataset and what each means.

I then went ahead to import some of the necessary libraries for the project, after which I imported my dataset and named it HF as seen below. The inspiration for this is simply HF =Heart Failure.

The dataset contained 13 columns of which 7 are numerical features, while 6 are categorical features and 299 observations as earlier stated, but can be seen clearly here.

Considering the fact that the project is predicting patients’ survival, it is very clear that the target column/variable is the DEATH_EVENTcolumn, which is the last column. I then checked the correlation between each of the columns using a heatmap.

After checking this, I decided to clean up the data a bit, and the first thing I did was to rename some really long variables like creatinine_phosphokinase, among others.

As seen above, creatinine_phosphokinasebecame CPK, high_blood_pressure becameHBP, ejection_fraction became EF, and DEATH_EVENT which was in upper case all through was dropped to lower case.

Still trying to clean and preprocess the data, I checked for null values, and found none. I also checked for duplicated values and found none, which means that the data is in itself a clean dataset.

I went ahead to check how many unique values each variable contained, and as you can see below, for variables like age, EF, serum_creatinine, and serum_sodium, there are a lot of repeated values, as the number of unique values is quite low, while for variables like CPK, platelets, time, there are fewer repeated values. The other variables, contained just 2 values, either 1 or 0, hence the reason for them having just two unique values.

Further exploring the data, I first visualized the target variable, death_event,after which I visualized the numerical variables, before proceeding to work on the categorical values.

values = HF['death_event'].value_counts()labels = ['Survived', 'Dead']fig, ax = plt.subplots(figsize = (5, 5), dpi = 100)explode = (0, 0.06)patches, texts, autotexts = ax.pie(values, labels = labels, autopct = '%1.2f%%', shadow = True,startangle = 90, explode = explode)plt.setp(texts, color = 'grey')plt.setp(autotexts, size = 12, color = 'white')autotexts[1].set_color('black')plt.show()

This block of code then gave the image below, which shows that in the observations used in the dataset, more than times two of people who had heart failure survived as 67.89% survived, while 32.11% died.

death event visualization

AGE

After visualizing the agecolumn next, I then checked for outliers in the variable, and as expected, found none. However, the image below shows that the average age is somewhere around 60 and 61.

Age variable visualization

CPK

Below is the visualization for CPK.As seen clearly, we have a higher amount of people who had CPK in the blood and survived, than people who had this feature and died.

It is important to remember that for death event, 0= survived, while 1 = died. This helps us understand this chart and other charts going forward.

Also we can see that there are some outliers towards the right and we will see the outlier visualization below.

Here is the outlier visualization for the CPK variable and like earlier stated, there are outliers towards the far right.

To learn more about outliers, please read up my previous article.

It is important to note that outliers are data points that differ significantly from other observations and can be harmful to the model we will be building eventually.

After this, I moved to the EF variable. For this, we also see that people who survived having this feature are more than people who did not.

For this variable, one can see just a little outlier, so it is safe to say that the outliers in this variable are few. Let’s find out below.

Closely observing this, one can see just two outliers, giving proof to what was said earlier about the outliers in this variable being very few.

Below is the visualization for the platelets variable. Here, we can observe the same trend as the other observations, with occurrences of death being lower than occurrences of survival.

From this visualization, one can expect a good number of outliers, as we will see below.

Checking for the outliers which are quite obvious, we can see outliers on both sides of the divide.

For the serum_sodium variable, we observe that both chances of death and survival in relation to this variable are actually quite close, however, that if survival is still higher.

As seen above, the outliers for this variable are towards the left and this is shown in the image below.

We have very few outliers here as well, and all at one end.

For serum_creatinine, there is a stark difference between the survival and death rate in relation to the serum_creatinine, as the survival rate far outweighs the death rate.

Worthy of note is that there are lots of outliers in this variable, as we will see very soon.

In this image below, we are able to see just how far away some outliers are and also the fact that this variable has quite a number of outliers.

After doing this, I went on to visualize the categorical variables. These are the variables that have either 1 or 0 as values, with 1 representing True, and 0 as false.

In the diagram below, we see the anaemia axis as the one below, where we have people who do not have anaemia being much higher than people who do, and the survival rate for people who do not have it is higher than the survival rate of people who do have anaemia.

We also see that we have more people who did not have anaemia and died, than people who who had anaemia and died.

This is however expected considering the fact that we have a much higher amount of people who do not have anaemia, than people who do.

For the diabetes variable, we find that people who do not have diabetes and survived are higher than people who had it and survived.

Here we also see that the distribution of those who did not have anaemia is much higher than the that of those who had it. Hence, it is not surprising to see that the death rate of people who did not have anaemia is much higher than people who do.

Visualizing the HBP column, we observe the same trend we have been observing in other columns, with the dataset having more cases of people who do not have HBP, than people who do. As a result of this, we have more people who did not have HBP and survived than people who had it and survived, same way we have more people who did not have it and died, than people who had it and died.

The sex variable apparently presents a more interesting visualization than others. Here the sex being referred to is gender, and like earlier stated, 0 = woman, 1 = man.

It is therefore very clear that there are a lot more men than women for the observations in this dataset. The number of men in the observations, is more than twice the number of women it is therefore not surprising that less women died, and less women survived than men where we have a higher number of men who survived and men who died simply because of the distribution.

Finally, the smoking variable, people who did not smoke are twice as much as people who smoked.

It is not shocking then to see that the non smokers lead in both the survival rate and the death rate.

Model Building.

In this part of the project, I built different models to predict the survival rate of a person who has heart failure.

I used 9 different models, namely

  1. Logistics Regression
  2. Random Forest
  3. Multi-layer Perceptron Classifier
  4. XGBoost
  5. Gaussian Naive Bayes
  6. Support Vector Machine
  7. Decision Tree Classifier
  8. Gradient Boosting Classifier
  9. Light Gradient Boosting Machine

I started building the models by removing the outliers as they would make the models less accurate.

outliers = ['CPK','Platelets','serum_sodium','serum_creatinine']def outlier_removal(HF,column):q1 = HF[column].quantile(0.25)q3 = HF[column].quantile(0.75)iqr = q3 - q1point_low = q1 - 1.5 * iqrpoint_high = q3 + 1.5 * iqrclean_HF = HF.loc[(HF[column] >  point_low) & (HF[column] <  point_high)]return clean_HF# clean the dataset by removing outliersHF_cleaned = outlier_removal(outlier_removal(outlier_removal(HF,'CPK'),'serum_creatinine'),'serum_sodium')print(HF.shape)print(HF_cleaned.shape)

This removed the rows that had the outliers across all columns, and the shape of the dataset changed from 299 observations to 242, as seen below.

Developing the models took me through splitting the data into training, feature scaling, cross validation, and hyper parameter tuning and a few other operations as seen below.

After building models with the 9 models above, I had my highest accuracy score from the Light Gradient Boosting Machine(LGBMC), with a total of 86% accuracy.

However, following further research, I found out that accuracy score is not necessarily the best metric for determining model accuracy. As a result of that, I check the Area Under Curve(AUC) for all my models, and the while the LGBMC had an AUC of 83%, Random Forest had an AUC of 87%, with an accuracy of 85%.

Light Gradient Boost Machine
Random Forest

With this, the model moving to the next stage is the Random Forest model.

Model Deployment

A project is not useful on paper, it is useful in action, for people to make good and accurate predictions. Hence, this project was deployed to a live site using Streamlit, which is a well known open source frame work for developers.

Conclusion

  • The aim of this project was to predict if a patient of heart failure would die or survive based on the variables provided.
  • There were 13 variables, 299 observations, with no missing or duplicated values.

Here are some conclusion from the analysis

  • Machine Learning can work effectively in the health industry, and make life easier for everyone.
  • People between the ages of 40–70 are more likely to experience heart failure, than people who are older.
  • There is a very high chance of surviving heart failure.

I hope you enjoyed reading this as much as I enjoyed writing it.

The complete analysis can be found here on Github. Please feel free to drop your comments and suggestions.

Kindly connect with me via Twitter or LinkedIn.

Acknowledgements

This is my final project in the She Code Africa Mentorship Program, Cohort 4, Data Science Track. I am beyond grateful to my mentor Steven Kolawole for his guidance and tutoring during the course of this internship.

Resources

https://doi.org/10.1186/s12911-020-1023-5

--

--

Oyinkansola Awosan

Technical Writer, Open Source Enthusiast, Machine Learning & Site Reliability Engineer