Predicting heart disease with Data Science [Machine Learning Project]

Predicting heart disease? Is that even possible? Well, it is indeed possible to predict the presence or absence of heart disease in a patient if you have access to the right that, like the one, we will be using here. In this article, I will go through every step necessary to create such a system, from the data acquisition until the discussion of the results. From there you should be able to do the same using a similar dataset.

Background information for predicting heart disease

What is heart disease?

Heart disease is an umbrella term for a range of diseases, including vascular disease (such as coronary artery disease), heart rhythm problems (arrhythmias), and congenital disabilities (congenital heart defects).

 

A heart attack is an example of heart disease. Many people think that a heart attack is when you have chest pain for more than 20 minutes. The truth is, a heart attack is a sudden, major cardiac event that occurs when the heart can’t get enough blood and oxygen to it. A heart attack can happen when there is an obstruction in the heart’s blood vessels, or a heart attack can occur without any apparent cause. A person with coronary heart disease may experience warning signs before a heart attack occurs. The type of sign depends on the location of the blockage and the extent of the jam. The heart comprises four chambers – the right, left, and two at the top. There are valves in the heart that control how blood flows between the chambers.

 

What are the risk factors for heart disease?

Heart attacks can occur in healthy people who have never had any symptoms. They can happen to anyone, but some people have a higher risk for heart attack than others. Risk factors for heart attack include:

 

Age

Heart diseases are more common in people over 65, but they can occur at any age. Heart attacks are more common in people over 65, but they can occur at any age. Researchers from Brazil looked at the incidence of heart attacks in people of varying ages and explored why they are more common among certain groups. They published their findings in BMC Public Health.

Metabolic syndrome

Metabolic syndrome is a set of five conditions that can lead to many debilitating heart-related problems. Those are:

Autoimmune condition

 A condition such as rheumatoid arthritis or lupus can increase your risk of a heart attack because the immune system triggers the inflammation in your joints. Other autoimmune diseases, such as celiac disease, ankylosing spondylitis, multiple sclerosis, and psoriasis, are also associated with higher risks of heart attacks and strokes. Some medications also can trigger a heart attack.

Diet

High cholesterol level in the blood is one of the most critical risk factors for heart attacks. A high intake of saturated fat and trans-fatty acids is associated with an increased risk of heart disease. The researchers from the Department of Food and Nutritional Sciences at Bangor University in the UK measured the cholesterol level in the blood of more than 200,000 participants in the UK Biobank (the UK’s most extensive study of health and disease). The researchers found that eating a high-fat diet is associated with significantly higher blood cholesterol levels. The results were published in the journal Scientific Reports.

High Blood Pressure

High blood pressure can damage arteries/veins connected to your heart, eventually leading to coronary artery disease, heart failure, and stroke. Heart and blood vessel problems of any kind are called cardiovascular diseases or CVDs. If you already have CVD, high blood pressure makes it worse. High blood pressure can also damage arteries that lead to the brain. The damage that high blood pressure causes to the brain is called a stroke.

Obesity

Obesity can cause heart diseases and strokes as well. If you are overweight, your heart has to work more to pump blood throughout your body, increasing your heart’s workload, increasing the chance of strokes and diseases such as diabetes. It may also lead to breathing problems such as sleep apnea.

Diabetes

Diabetes is characterized by high blood sugar levels in the body. People with diabetes may have a hard time getting a good night’s sleep. People with diabetes are also more likely to experience depression and anxiety. Obesity can lead to insulin resistance, making it difficult for the body to take up sugar (glucose) from the blood. Insulin resistance causes the body to produce more glucose, so the pancreas produces more insulin. Insulin pushes glucose typically into the cells of the body and maintains a healthy blood sugar level. Without sufficient insulin (or too much insulin), blood sugar levels rise higher than average.

Preeclampsia

Preeclampsia is a condition that pregnant women experience. Preeclampsia causes high blood pressure during pregnancy and increases the lifetime risk of heart disease. Women who suffer from preeclampsia may develop it after 20 weeks of pregnancy.

Heart attacks in Family history

One out of three of your family members may develop an early heart attack by 55 for males and 65 for females. Given your family history, you might want to keep a closer eye on your heart.

Stress

Stress may be another cause of heart attacks. When you are under a lot of stress, your sympathetic nervous system becomes activated. Your blood pressure will increase, and your heart rate will quicken. If you have high blood pressure or high cholesterol, these are even more reasons to control your stress levels.

Lack of sport

Heart diseases can be associated with a lack of sports activity. But scientists have proved that this way of life is one of the causes of cardiovascular disease. It has been proven that sport and activity are efficient weapons against heart attacks. Not doing it regularly is often correlated to gaining weight, putting you at risk of heart attacks.

Illegal drugs

Using stimulant drugs, such as cocaine or amphetamines, can trigger a spasm of your coronary arteries that can lead to a heart attack. Cocaine, for example, is known to significantly decrease the amount of blood that reaches your heart muscle.

 

What are the complications of heart disease?

When you plan to prevent, let alone predicting heart disease, understanding the complications is useful to know have an understanding of what variables may be important prior to your EDA. Here are the major complications of heart disease.

 

Arrhythmia

Arrhythmia

Arrhythmia is essentially an abnormal heartbeat that has three potential properties. The heart:

 

Arrhythmiascan develop after a heart attack as a result of the stress on the heart. Electrical signals become disrupted in the heart, which can be the result of the muscles being damaged.

 

Sometimes, one has to place a pacemaker while the heart settles down temporarily. A pacemaker is a particular electric electrode inserted by a surgeon into your chest that allows your heart to beat regularly when it can’t produce a steady pulse by itself. The electrode is attached to a small box that has to be carried around with you. Luckily, there is a convenient carry case that comes with it, and so you don’t have to spend your time walking around carrying it with you. Occasionally this has to become permanent, and a tiny pacemaker is inserted under the skin.

 

Chest pain/angina

Angina is another way of saying chest pain in the medical world. Chest pain often happens when the heart does not receive enough oxygen, which results in pressure/squeezing in the chest. 

There exist multiple  types of angina . More commonly, we have stable and unstable angina. Stable angina is most commonly associated with physical exertion. Indeed, since physical activities increase the body’s need for oxygen, the heart muscle may be affected by it. Unstable anginas are not related to physical exertion and are chest pains that are cause for concern.

Cardiogenic shock

Cardiogenic shockis sort of a more severe form of a heart attack. It happens when there is extensive damage to the heart to the point where it cannot pump blood to maintain essential body functions.

 

As a result, people with cardiogenic shock can experience:

  • difficulty in breathing
  • Accelerated heartbeat and breathing
  • confusion
  • cold extremities
  • need to pee or vice versa
  • pale skin

A vasopressor is sometimes used to increase blood circulation. It does it by helping constrict the blood vessel, thus increasing the blood pressure and circulation.

Heart Failure

Heart failure is a situation in which the heart cannot pump blood properly around your body. More often than not, it happens on the left side of your heart. Heart failure can occur after a heart attack as a result of extensive damage to the heart.

 

you know that you have heart failure when you sense:

  • Fatigue
  • Swelling in your arms and legs
  • Breathlessness

 

Heart failure is usually treated through either (or both) medicines and surgery.

Heart rupture

Heart ruptureis a severe but rare complication of a heart attack. During a heart rupture, the heart muscles, walls, or valves are separated (rupture).  

Heart rupture happens when the heart is significantly damaged from a heart attack in the following 1-5 days. 

The symptoms are similar to a cardiogenic shock, but open-heart surgery is needed to repair the damage.

Peripheral artery disease

When you develop peripheral artery disease, your legs don’t receive enough blood flow. You might experience leg pain when walking or have a diminished ability to feel your legs at all. Atherosclerosis also can lead to peripheral artery disease.

Stroke

Mental and physical stress, high blood pressure, and unhealthy cholesterol and fat levels can lead to an ischemic stroke when the arteries to your brain are narrowed or blocked, leading to too little blood reaching your brain. A stroke is a medical emergency—brain tissue begins to die within minutes of a stroke.

Sudden cardiac arrest

A life-threatening condition that occurs when the heart stops functioning, An irregular heartbeat or severe injury can trigger sudden Cardiac Arrest. Prompt treatment can prevent it from progressing to cardiac death.

Aneurysm

A serious complication that can occur anywhere in your body, an aneurysm, is a bulge in your artery wall. You should be able to spot the difference in your blood vessels with one glance, but it can be challenging to diagnose, so your doctor may order blood tests or perform an echocardiogram to locate the issue. If an aneurysm bursts, you may face life-threatening internal bleeding.

 

 

What are the common heart disease treatments?

Another important aspect when you do the background check on predicting heart disease is to understand how they can be treated. Even though it might not attract the statistical modelling itself, it can be an interesting factor for decision making. It becomes particularly important when it comes to choosing a  metric and gauging the minimum level of accuracy your model needs to be.

 

The treatment of heart disease is usually done depending on the type of heart disease you have had. They may include

 

Blood tinning Medicines

The idea behind blood-thinning medicinesis to prevent blood clots which may reduce the chances of a patient developing other strokes/heart attacks.

 

Primary percutaneous coronary intervention (PCI)

PCI is the procedure in which patients with STEMI undergo mechanical revascularization through measures such as coronary stents, aspiration thrombectomy, angioplasty, etc. It is an emergency treatment of a STEMI (the most severe form of heart attack) to keep it in simple terms.

 

Coronary artery bypass graft.

Coronary bypass surgery is the process of redirecting blood around a blocked or partly blocked artery in the hearth. It involves taking a healthy vessel from another part of your body, usually the leg, chest, or arm, and connecting it to the top and bottom part of the heart’s blocked arteries. Hence, this procedure creates a new pathway to improve blood flow in your heart.

 

Other possible surgeries for more info check here.

 

How do you prevent heart diseases?

Even though it may have nothing to do with heart disease prediction, It is still good to know how can one prevent heart disease. Here is a list of the common ways.

 

A balanced and healthy diet

A low-fat, high-fibre diet is recommended, including plenty of fresh fruit and vegetables (5 portions a day) and whole grains. To avoid raising your blood pressure, you should limit the amount of salt you eat to no more than 6g (0.2oz) a day. 6g of salt is about 1 teaspoonful.

 

One important thing you should keep in mind when it comes to fat is that there are two different types: saturated and unsaturated. Saturated fats should be avoided as these will increase the levels of bad cholesterol in your blood.

 

Here are some pieces of advice ona balanced diet from the NHS.

 

Reduce Alcohol Consumption

If you drink, do not exceed the maximum recommended limits. Both men and women are advised not to drink more than 14 units a week regularly. Spread your drinking over 3 days or more if you drink as much as 14 units a week.

Give up Smoking

Smoking contributes to many different health problems, not just cancer. It is a significant risk factor for developing atherosclerosis (furring of the arteries). It also causes the majority of cases of coronary thrombosis in people under the age of 50.

Keep good weight

A GP or practise nurse can tell you what your ideal weight is concerning your height and build. You can alternatively find out your body mass index (BMI) by using a BMI calculator.

Physical Excercise

To maintain a healthy weight, one must have a healthy diet and also regular exercise. The best way to achieve this is by using a nutritious diet, reducing your chances of developing high blood pressure. In addition, regular exercise will make your heart and blood circulatory system more efficient, lower your cholesterol level, and keep your blood pressure at a healthy level.

Take your medicines

Suppose you have CHD, have high cholesterol, high blood pressure, or have a history of family heart disease. In that case, your doctor may prescribe medicine to prevent you from developing heart-related problems.

Check on your Blood Pressure

You can keep your blood pressure under control by eating a healthy diet low in saturated fat, exercising regularly, and, if needed, taking medicine to lower your blood pressure.

Check on your diabetes levels.

You have a greater chance of developing CHD if you have diabetes. Being physically active and controlling your weight and blood pressure will help manage your blood sugar level. If you have diabetes, you should be maintaining a target blood pressure level below 130/80mmHg.

 

Exploratory Data Analysis (EDA) for predicting heart disease

Dataset

Description

For this analysis, we used the UCI Heart Disease dataset, a popular machine learning dataset used to verify the presence of heart disease in a patient. The patients’ names and social security numbers were recently removed from the database, replaced with dummy values. The original database contains 76 attributes, but we will be using a subset of 14 features in this project.

 

you can find the dataset source >> https://archive.ics.uci.edu/ml/datasets/Heart+Disease

 

Dataset Author:

  1. Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.
  2. University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.
  3. University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.
  4. VA Medical Center, Long Beach, and Cleveland Clinic Foundation: Robert Detrano, MD, PhD.

 

Dataset variables:

  • age: patient’s age in years
  • sex: patient’s gender (1 = male, 0 = female)
  • cp: chest pain type
  • Value 1: asymptomatic
  • Value 2: atypical angina
  • Value 3: non-anginal pain
  • Value 4: typical angina
  • trestbps: patient’s resting blood pressure (mm Hg on admission to the hospital)
  • chol: patient’s cholesterol measurement in mg/dl
  • fbs: patient’s fasting blood sugar (> 120 mg/dl, 1 = true; 0 = false)
  • restecg: patient’s resting electrocardiographic results
  • Value 0: showing probable or definite left ventricular hypertrophy by Estes’ criteria
  • Value 1: normal
  • Value 2: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
  • thalach: patient’s maximum heart rate achieved
  • exang: Exercise-induced angina (1 = yes; 0 = no)
  • oldpeak: ST depression induced by exercise relative to rest (‘ST’ relates to positions on the ECG plot. See more here)
  • slp: the slope of the peak exercise ST segment 0: downsloping; 1: flat; 2: upsloping
  • caa: The number of major vessels (0–3)
  • thal: 3 = normal; 6 = fixed defect; 7 = reversible defect
  • target: diagnosis of heart disease (1 = no, 0= yes)

 

Data importing and Preprocessing

Let us first import the various libraries needed to perform the analysis for predicting heart disease. We will be using 

# Data Processing & Manipulation
import pandas as pd
import numpy as np

# Plotting Libraries
import seaborn as sns
import matplotlib.pyplot as plt


# Feature Selection & M:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import GridSearchCV,StratifiedKFold
from sklearn.ensemble import RandomForestClassifier, \
GradientBoostingClassifier,AdaBoostClassifier
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.compose import ColumnTransformer
from imblearn.over_sampling import SMOTE
from lightgbm import LGBMClassifier
from sklearn.metrics import recall_score


# Importing dataset
df = pd.read_csv("Path/to/file/heart.csv")
df.shape
(303, 14)

 

The data is imported correctly; now, let’s have a look inside. It is made of the 14 variables described in the above dataset description. Additionally, we can notice that we have 303 observations (rows), making it a relatively small dataset. 

 

 All of the variables are numbers; ergo, data will require no data transformation in the future. The oldpeak variable is made of floating-point (float), and the rest are composed of integers (int)

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trtbps    303 non-null    int64  
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    int64  
 6   restecg   303 non-null    int64  
 7   thalachh  303 non-null    int64  
 8   exng      303 non-null    int64  
 9   oldpeak   303 non-null    float64
 10  slp       303 non-null    int64  
 11  caa       303 non-null    int64  
 12  thall     303 non-null    int64  
 13  output    303 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 33.3 KB

 

Looking at the dataset, we can see that it does not contain any empty cells, making it a clean dataset. I wish all my projects were like that. If you are new to data science, more often than not, you will be cleaning your data for it to look like that. 

# Checking Null Values
df.isnull().sum()
age         0
sex         0
cp          0
trtbps      0
chol        0
fbs         0
restecg     0
thalachh    0
exng        0
oldpeak     0
slp         0
caa         0
thall       0
output      0
dtype: int64

 

Now let’s identify the type of data available in each column (categorical or numerical). A categorical variable contains data that fits into two/multiple categories (gender, colour, etc.), whereas a numerical variable is a number (salary, temperature, etc.). We can infer the variable type by looking at the dataset description. In our case will simply figure it from the numbers. 

 

One thing that usually happens is that you will have more unique numerical variables than unique categorical variables. So when we use the nunique() function on the Pandas table to identify the number of unique values each column has, we can already see what columns are numerical or not. So if we combine the number of unique and the statistical description of each column, we can infer the variable type. For example, Sex has two unique value and distribution that varies between 0-1. Ergo, we can be confident that we are dealing with a categorical variable. There is a lot of fluctuation in the distribution of the trtbps variable, and it contains a lot of unique value. Consequently, we can be confident that that is a numerical variable. You can do the same with the rest of the values and be able to identify their type.

# Checking number of unique value
for col in df.columns:
    print("{} : {}".format(col, df[col].nunique()))
age : 41
sex : 2
cp : 4
trtbps : 49
chol : 152
fbs : 2
restecg : 3
thalachh : 91
exng : 2
oldpeak : 40
slp : 3
caa : 5
thall : 4
output : 2

 

You can see the separation between categorical and numerical variables below.

num = ["age", "trtbps","chol","thalachh","oldpeak"]
cat = ["sex", "cp","fbs","restecg","exng","slp","caa","thall"]

 

Features analysis 

In this part, we take a deeper look at the various variable type identified int the previous section.

 

Target variable

Looking at the output with is our target variable, we can see that we have slightly more healthy users than users with heart disease. 

 

Target variable distribution

 

Categorical variables analysis

Looking at the distribution of gender, we can see that we have approximately 2 times more men than women. We observe a lot of overrepresentation of one category. We have 6x more patients without high fasting blood sugar (fbs =0). 2x more patients without work-induced angina(exng). I will not go into detail, but specific categories have 2-3 values that are overrepresented and others that are not. Already we can see that some form of imbalance is starting to form within the data. 

df_cat = df.loc[:, cat]
for i in cat:
    plt.figure(figsize = (14,8))
    sns.countplot(x = i, data = df_cat, palette = "OrRd")
    plt.title(i)

 

Heart disease categorical variable - Gender

 

Heart disease categorical variable - cp

 

Heart disease categorical variable - fbs

Fbs

Heart disease categorical variable - restecg

 

Heart disease categorical variable - exng

 

Heart disease categorical variable - slp 

Heart disease categorical variable - caa 

Heart disease categorical variable - thall 

 

If we separate the categorical values further and look at them in terms of the presence or absence of heart diseases. (1 = Patient without heart disease, 0 = patient with heart disease), we can see a certain pattern that is beginning to form.

 

  • Looking at the Chest Pain type (CP), the people without a disease appear to have much fewer cases in variations 1, 2, and 3, and patients with the condition tend to have a value of 0, which are asymptomatic cases.
  • A similar type of result is exercise-induced angina with more people with 80% of not heart disease patients not getting it for exercising too much.
  • The resting electrocardiogram results are pretty interesting. There are about 10% more people with a reported heart disease having a definite left ventricular hypertrophy. And about 40% more people without heart disease have a restecg normal.
  • the fbs appear to have the same proportions for both patients with heart disease and without heart disease. 
cat = ["sex", "cp","fbs","restecg","exng","slp","caa","thall", "output"]
df_cat = df.loc[:, cat]
for i in cat:
    plt.figure(figsize = (14,8))
    sns.countplot(x = i, data = df_cat, hue = "output", palette= "OrRd")
    plt.title(i)

 

Heart disease comparaison variable - gender 

Heart disease comparaison variable - cp 

Heart disease comparaison variable - fbs 

Heart disease comparaison variable - restecg 

Heart disease comparaison variable - exng 

Heart disease comparaison variable - slp 

Heart disease comparaison variable - caa 

Heart disease comparaison variable - thall 

 

Numerical variables analysis

Now let’s explore a bit the numerical variables and see if we can distinguish any pattern that will help us better understand the dataset and establish some sort of relationships between numerical features and the target variable.

 

The best way to see those relationships are through distribution plots, in my opinion. Here’re my observations:

  • Heart disease happens more often to people of older age. According to the graph, we can see that the number of cases of heart disease being lower for people below, according to the chart, 55. and higher for the ones above.
  • The patient’s resting blood pressure appears to have the same shape; however, higher trtbps values are observed with the patients with heart disease.
  • If we weighted the cholesterol graph, people with high cholesterol tend to be diagnosed more with heart disease than lower ones.
  • People with heart disease will have a lower maximum heart rate achieved compared to healthy patients.
  • Regarding the oldpeak variable, as the value increases, the number of patients with heart disease increases.
num = ["age", "trtbps","chol","thalachh","oldpeak","output"]
df_cat = df.loc[:, num]
for i in num:
    plt.figure(figsize = (14,8))
    ax= sns.displot(x = i, data = df_cat, hue = "output", palette= "OrRd")
    plt.title(i)
Heart disease numerical age
Heart disease numerical trbps
Heart disease numerical chol
Heart disease numerical thalachh
Heart disease numerical thalachh

 

 

Feature relationships for heart disease prediction

By now, you should have some idea about how the various features and attributes affect heart disease. But, let’s use a couple of statistical tests to figure out which of those features hold the most value for us to infer. We can go through 3 techniques, 

  1. the correlation graph
  2. the univariate selection
  3. the feature importance

These techniques are great for two main reasons, in my opinion. It allows us to confidently say that variables x,y z are the most important for us to make an accurate prediction. Two, even though it may not apply here, it allows us to eliminate features that do not bring any value for an inference. It is handy for projects that have a large number of variables.

 

a-  Correlation graph

Looking at the correlation graph, the higher increase in chances of heart attack is created by exngoldpeak, and caa. (ps, remember the output of 0 means that heart attack and 1 means no heart attack). This means that as the value of those variables increases, the value of the output decrease towards 0. Whereas the cp, thalach, and slp appear to have the opposite effect. There is no/weak correlation for the fbscholrestecg, and trrnps.

plt.figure(figsize = (14,10))
mask = np.triu(df.corr())
sns.heatmap(df.corr(), annot = True, fmt = ".1f", linewidths = .7,cmap="OrRd",
            mask = mask)
plt.show()

Heart disease prediction correlation

 

b-  Univariate selection

Looking at the result of the univariate selection, we can see that the thalachh, oldpeak, caa and cp appear to be the most important values for making a prediction. We sort of assumed these by looking at the correlation map. But with the univariate selection, we get an idea about the extent to which each variable affects the predictions.

# Get the features
X = df.iloc[:,0:13] 
y = df.iloc[:,-1]    #target column 
#apply SelectKBest class to extract top best features
k_best = SelectKBest(score_func=chi2, k=10)
fit = k_best.fit(X,y)
# Get the feauture names
names = pd.DataFrame(X.columns)
# Save the scores
scores = pd.DataFrame(fit.scores_)
# Combine the results
results = pd.concat([names,scores],axis=1)
results.columns = ['Column','Result'] 
# Print the results
print(results.nlargest(12,'Result')) 
     Column      Result
7   thalachh  188.320472
9    oldpeak   72.644253
11       caa   66.440765
2         cp   62.598098
8       exng   38.914377
4       chol   23.936394
0        age   23.286624
3     trtbps   14.823925
10       slp    9.804095
1        sex    7.576835
12     thall    5.791853
6    restecg    2.978271

 

c- Feature importance

The feature importance is a default function in most of the tree-based classifiers. The results show the recurrence of the same 4 variables in the univariate selection analysis as the most important ones again.

# Use a simple randome forest model
model = RandomForestClassifier()
# Fit the RF to the data 
model.fit(X,y)
# Use the Feature importance variable to extract important features
print(model.feature_importances_) 
# Compbine the values with the orginal names
f_imp = pd.Series(model.feature_importances_, index=X.columns)
plt.figure(figsize = (14,8))
# Plot the values
f_imp.nlargest(13).plot(kind='barh', color="#D16350")
plt.show()

Heart disease prediction feature importance

 

Model preparation for predicting heart disease

Now that we have an idea and an understanding of the data, we can create our predictive model based on what we know about the data. 

 

Dealing with the imbalance in data

The first thing that we realized from the EDA is that we are dealing with an imbalanced data set. An imbalanced dataset can be a problem because it can create a biased model. What do I mean by that? Suppose let’s assume that 9 out of 10 people in our dataset do not have heart disease. If we train our model in the above dataset for predicting heart disease, the model would guess “Not Diseased” and therefore be correct most of the time since there is a very dominant class. Thus, even if the model guesses incorrectly, the accuracy of the model will be high. 

 

To remedy that, we may use a Synthetic Minority Oversampling Technique or  SMOTE, a technique in which we try to oversample the minority class. SMOTE works by duplicating samples from the minority in the training dataset before fitting them into a model. Jason Brownlee explained this process very well in the following SMOTE article. Check it out for more information.

 

Choosing the right evaluation metric 

Besides the imbalance problem, the next thing we are supposed to deal with is the evaluation metric for predicting heart disease. We could use accuracy; however, we should look at what is more important before making a choice. 

 

Since it is a classification problem, we assess what evaluation metric is important by looking at the confusion matrix. 

 

Confusion matrix

 

Looking at the confusion matrix above, there are several choices we could use. We could go for accuracy, which is the number of True Positives (TP) + True Negative (TN)/TP + TN +False Positive (FP) + False Negative (FN). These choices should be excellent since the higher TP, and TN you have, the more accurate your model will be. However, as mentioned in the previous part, the model will more often than not be correct with an imbalanced data set because it just needs to predict the majority class for a good score.

 

One thing we can do here is to look at what is vital for our predictive model? What are we trying to achieve? What is the consequence of a wrong prediction? Talking about value, what is the impact of a patient’s diagnosis that does not have heart disease, but our model predicted that the patient has heart disease. Maybe you lose some money to order a new test, but it is for sure not life-threatening. So the consequences of a wrong prediction, in this case, are financial.

 

What are the consequences of a patient who has a heart disease but our model predicted that it does not? In this case, this would be a treat for the patient’s life.

 

Consequently, it is all about what do we value as more important. If we say that we should 100% focus on maximizing saving the most lives, then we should use a metric that gives more weight to False Negative, disregarding the effect of False Positive. In this case, we may use the recall, which means the more false-positive our model has, the less performant the model will be. The business result of this may be that your product may work fine; however, since you don’t care about False Positives, the patient may order more tests than they need, meaning that you are wasting people’s money. 

 

Or, if we want something more balanced, you can use the accuracy for it. The use of accuracy means that the model will give somewhat equal weights to FP and FN. The business implication of this strategy means less money wasted by patients on extra tests but less performant models. On the other hand, if you want to just look at false negatives (to save as much money), you can use the recall. If you’re going to change the balance, give, let’s say, 60% more importance to False positives and 40% to False negatives, then you can use the F-Score for it. Finally, you can take advantage of the ROC-AUC of your model to find what the right balance is.

 

So as you can see, picking the correct metric is related to the business model you chose more often than not. In the real world, you need to discuss this with your client and identify what they are trying to do to decide what metric would be appropriate for the problem at hand. This is why it is essential to acquire domain knowledge in whatever field you are in. In this case, I will be using precision to evaluate the performance of my model. The precision formula is Precision = True Positives/(True Positives+False Positives). The precision metric will allow me to maximize the quality of the model, which in our case penalizes giving the wrong prediction when a patient has heart disease.

 

Machine Learning models for predicting heart disease

 

Let’s first make the separation. Because we do not have too many variables and I would like to talk about this (and a bit of laziness), I have shamelessly hard-coded the various column. In a production environment, try the best not to hard-code any column names in your code. Try to add this dynamically to avoid mistakes while typing or when the data (column names) are updated.

# Separating the features and targets 
X =df[['age', 'sex', 'cp', 'trtbps', 'chol', 'fbs', 'restecg', 'thalachh',
       'exng', 'oldpeak', 'slp', 'caa', 'thall']]
y = df['output']
cat = ["sex", "cp","fbs","restecg","exng","slp","caa","thall"]
num = ["age", "trtbps","chol","thalachh","oldpeak"]

 

Next, let’s encode the categorical variables that we will use for predicting heart disease. We do that to have numerical values. It’s a good habit in terms of performance.

# Get the position of the categorical features
location_cat = []
for i in range(len(cat)):
    location_cat.append(X.columns.get_loc(cat[i]))
# Encoding categorical variables
col_trans = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), 
                                             location_cat)],
                              remainder='passthrough')

X = np.array(col_trans.fit_transform(X))

 

As discussed earlier, let’s apply SMOTE.

# Performing SMOTE
sm=SMOTE(k_neighbors=8)
X, y=sm.fit_resample(X,y)

 

Next, I am going to split the dataset into train and test sets. Then, I will set the parameter grid I would like to test. Since the dataset is relatively small, we can use a rather large grid. Another option would have been to use the RandomSearch Method, but I think using the GridSearch here is fine. It’s your choice anyways. Any of these methods will help improve your score for predicting heart disease.

 

I additionally plan to run my tests using tree-based classification models, so I initialized the various parameters for those. The classification models are:

  • Decision Trees (DT)
  • Light Gradient Boosting Model (LGBM)
  • AdaBoost (ADA)
  • Random Forest (RF)
  • eXtreme Gradient Boosting (XGB)
# Separating Train & Test
print('Splitting Dataset ... \n')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= .2, random_state= 8)

# Performing Scaling
print('Scalling Variables ... \n')
scl = StandardScaler()
X_train = scl.fit_transform(X_train)

# Set the parameter grid for the model



print("Training Dataset ...")

# Parameters Decision tree
params_dt = [{
    'splitter': ['best', 'random'],
    'criterion': ['gini', 'entropy'],
    'max_depth': [4,  6,  8,  10,  12,  20,  40, 70],
}]

# Parameters Light - GBM
params_lgbm =  {
    'objective' : ['binary'],
    'boosting_type' : ['gbdt', 'dart'],
    'learning_rate': [0.005, 0.01],
    'n_estimators': [8,16,24,32,64],
}

# Parameters Adaboost
params_ada={
    'learning_rate': [0.005, 0.01],
    'n_estimators': [8,16,24,32,64],
}

# Parameters Random Forest
params_rf = [{
    'criterion': ['gini', 'entropy'],
    'n_estimators': [50, 100, 300, 500, 750, 1000],
    'max_features': [2, 3, 4, 5],
}]


# Parameters xgboost
params_xgb = [{
    'max_depth': [4,  6,  8,  10],
    'learning_rate': [0.3, 0.1, 0.05],
}]


# Model Initialization
print("Initualizae models ")
models =[
("DT", DecisionTreeClassifier(),params_dt),
('LGBN', LGBMClassifier(),params_lgbm),
('ADA',AdaBoostClassifier(),params_ada),
('RF',RandomForestClassifier(),params_rf),
('XGB',GradientBoostingClassifier(),params_xgb)]

 

Now let’s run the whole thing, and let’s see what model had the best results at predicting heart disease.

 

base_accuracy = []
names = []
finalResults = []
best_models = []

for name,model, model_params in models:
    # Fitting the model
    model.fit(X_train, y_train)
    # Predicting on test with default hyperparameters
    base_pred = model.predict(X_test)
    # Save the resuls
    base_result = precision_score(y_test, base_pred)
    print('Base precision is :',base_result)
    # Applying Parameter tuning
    print('K-Fold Cross validation on {}'.format(name))
    # Saving the hyperparameter tuning results
    train_precision_score = cross_val_score(estimator=model, X=X_train, y=y_train, 
                                         cv=5, scoring='precision')
    # Getting top result
    print("Train precision Score: {:.2f} %".format(train_precision_score.mean()*100))
    # Showing the Deviation
    print("Standard Deviation: {:.2f} %".format(train_precision_score.std()*100))    

    print('Grid Search for model {} []\n'.format(name))
    clf = GridSearchCV(
    estimator=model,
    param_grid=model_params ,
    scoring='precision',
    cv=5,
    n_jobs=-1,
    verbose=0,
        )
    clf.fit(X_train, y_train)
    best_score = clf.best_score_
    #best_models.append((name,best_score*100))
    best_parameters = clf.best_params_
    print("Best Train precision for model {}: {:.2f} %".format(name,best_score*100))
    print("Parameters for best model {}".format(name), best_parameters)    
    print('Training for {} model Compeleted \n'.format(name))
    pred = clf.predict(X_test)
    best_score = precision_score(y_test, pred)
    best_models.append((name, round(best_score*100,3)))
    print("Best Test precision for mode {}: {:.2f} %".format(name,best_score*100))

# Sort the best results by precision score
best_models.sort(key=lambda k:k[1],reverse=True)
#sorted(best_models, reverse= True)
print("\n Models precision Score: \n ")
res = pd.DataFrame(best_models)
print(res)
plt.figure(figsize=(14,8))
sns.barplot(x=res[1], y=res[0], palette='OrRd')
print("\n\nModel With Highest precision is: \n",best_models[0],'\n\n')
Base precision is : 0.7666666666666667
K-Fold Cross validation on DT
Train precision Score: 76.82 %
Standard Deviation: 6.44 %
Grid Search for model DT 
Best Train precision for model DT: 84.68 %
Parameters for best model DT {'criterion': 'gini', 'max_depth': 70, 'splitter': 'random'}
Training for DT model Compeleted 
Best Test precision for mode DT: 68.75 %

Base precision is : 0.8823529411764706
K-Fold Cross validation on LGBN
Train precision Score: 83.14 %
Standard Deviation: 2.10 %
Grid Search for model LGBN 
Best Train precision for model LGBN: 80.39 %
Parameters for best model LGBN {'boosting_type': 'gbdt', 'learning_rate': 0.01, 'n_estimators': 64, 'objective': 'binary'}
Training for LGBN model Compeleted 
Best Test precision for mode LGBN: 82.61 %

Base precision is : 0.7941176470588235
K-Fold Cross validation on ADA
Train precision Score: 78.07 %
Standard Deviation: 4.86 %
Grid Search for model ADA 
Best Train precision for model ADA: 82.70 %
Parameters for best model ADA {'learning_rate': 0.01, 'n_estimators': 64}
Training for ADA model Compeleted 
Best Test precision for mode ADA: 82.61 %

Base precision is : 0.89189189189
K-Fold Cross validation on RF
Train precision Score: 94.44 %
Standard Deviation: 3.89 %
Grid Search for model RF 
Best Train precision for model RF: 96.59 %
Parameters for best model RF {'criterion': 'gini', 'max_features': 3, 'n_estimators': 300}
Training for RF model Compeleted 
Best Test precision for mode RF: 97.14%

Base precision is : 0.8333333333333334
K-Fold Cross validation on XGB
Train precision Score: 83.49 %
Standard Deviation: 5.08 %
Grid Search for model XGB 
Best Train precision for model XGB: 82.04 %
Parameters for best model XGB {'learning_rate': 0.3, 'max_depth': 4}
Training for XGB model Compeleted 
Best Test precision for mode XGB: 87.50 %

 Models precision Score: 
 
      0        1
0    RF  97.143
1   XGB   87.500
2  LGBN   82.609
3   ADA   82.609
4    DT   68.750


Model With Highest precision is: 
 ('RF', 97.297) 

Predicting heard disease results

Looking at the results, we can see that the Random Forest (RF) had the better results at predicting heart disease, with a precision score of 97.143. The other models had a pretty low performance compared to the RF.  Now how can you improve on that? Well, you can always test out different models to see any improvement, address the overfitting issue present above (can be seen with a high training score but low test score), or use a different method for hyperparameter tuning.

 

Heart disease results and  discussion

Now let’s look at the best model we had. We can see that it did very well since it has a pretty high precision score. Of the 35 people that had a head problem, the model misdiagnosed just 1. However, we can see that the number of people who did not have a disease but were predicted to have a heart problem is also relatively high (at 20). This means that we save many people, but we save them by recommending too many tests, which may lead to a waste of resources. 

 

Heart disease prediction confusion matrix
Confusion Matrix – Prediction Results for the Random Forest Model

 

 

This is why the world of data is so fascinating because from here, there are many things you can do. 

 

You can:

  • Try to optimize the model to see how you can diminish the number of false positives to improve predicting heart disease.
  • Try to remove features that do not increase nor decrease the predictive capability of our model.
  • Comment down on the section below, letting me know how you improved the above model
  • Deploy the model online in your app. If you have a dataset or a company with a dataset similar to the above, you could create an app around it that can help with the diagnosis.
  • Try to save your best model for predicting heart disease, build an app around it and deploy it.
  • Try to do some more exploration and a deeper dive into the results. Add your results on predicting heart disease to your portfolio.
  • Many more.

 

 

If you made this far in the article, thank you very much.

 

I hope this information was of use to you. 

Feel free to use any information from this page. I’d appreciate it if you can simply link to this article as the source. If you have any additional questions, you can reach out to malick@malicksarr.com  or message me on Twitter. If you want more content like this, join my email list to receive the latest articles. I promise I do not spam. 

 

Newsletter

 

Leave a Comment