Leveraging Machine Learning in Breast Cancer Diagnosis

I Re-Created a Machine Learning Algorithm That Can Diagnose Breast Cancer.

Nancy Shnoudeh
15 min readJan 27, 2021

Here’s how you can do it too.

Courtesy to: ComputerScience on YouTube, in particular this tutorial.

The problem

The stats:

  • False-positive mammograms and overdiagnosis of breast cancer among women ages 40 to 59 cost $4 billion in health-care spending annually (the study) — Journal Health Affairs
  • Breast cancer over-diagnosis is estimated to be between 22 and 31 percent of all diagnosed breast cancers, according to (the study)

This is a huge issue because failure for patients to receive proper treatment, or even too much treatment can result in serious harm.

Main reasons misdiagnosing happens:

  • The doctor failed to order the proper tests, or dismissed the patients concerns
  • The results were incorrectly interpreted
  • Failing to follow up with a patient about a positive test result or a suspicious test result
  • Waiting too long to conduct a biopsy when a lump is discovered, or other symptoms are present

Real life stories:

— “Cancer doesn’t grow like that,” Elizabeth Vines’ doctor told her after she went to see him about the rapidly growing lump in her breast. Two months later, she learned she had stage 3 breast cancer at age 35. — Elizabeth Vines, as told by Patti Greco (source)

— “Being told I had cancer was awful, but then to go through all of the treatment and surgery, to then be told it was unnecessary was traumatizing” — Sarah Boyle (source)

It’s so heartbreaking to hear these stories, but we have a choice to do something about it, and AI might be the solution. Here’s how I leveraged machine learning to diagnose breast cancer, and how YOU can do it too! YES YOU!

note: If you plan to do this project as well, this article has got you fully equipped with the tools to recreate this model from start to finish (little to no coding knowledge is required, but recommended)

Setting up

Step 1:

Opening an IDE.

What is an IDE?

- Stands for: Integrated Developement Environment 💡

- It’s the platform used by the “coder” to actually code on

- Different IDE’s serve different purposes and have different functions

For this model, I used google collaboratory, and I recommend you do as well.

Here’s what you want to do:

  1. Open google collabs.
  2. Once its opened, click “file” and “new notebook”.

Yup that's it! If you’re not famailiar with google collabs, here’s the rundown:

The basics of google collabs:

- (optional: Watch this video to get familair with google collabs.)

- The “+ code” button allows you to add a new line of code.

- The play button beside each line of code, allows you to run the code.

- Clicking “file” will allow you to do a variety of actions such as “opening a new notebook” “locating in drive” etc.

ps. I sugegsted you play around with all the features to become familiar with the options available.

Step 2:

Data.

Data is a fundamental part of AI (to learn more about it’s value, scroll down to “block 4” in this article I wrote about AI).

The data set I used was from kaggle and called:

“Breast Cancer Wisconsin (diagnostic) Data Set”

You’re going to need to download it to your computer so click this link to do so.

Because this was my first AI project, I was definitly overwhelmed with the data. I opened it up and started to scroll thorugh…

and scroll….

… and scroll…

… and scrolled some more…

THERE WAS SO MUCH DATA!!!

Once you download it to the computer and open it up it should look something like this: (but imagine this picture 233.54 times over)…

Now, it’s time to work the magic, aka code…

Quick check in:

At this point, you should…

- Have a new notebook opened in google collaboratory

- Be comfortable and familiar with google collaboratory

- Have your data set downloaded and opened

Let’s Get Coding 💪!

YAYYAYAYY! So hyped!

(ps.make sure you have a new notebook opened, without any code on it)

First, we’re going to make a comment about the what the program does, using the hashtag symbol:

#Description: This program detects breast cancer, based off of data

Next, because I didn’t have “seaborn” I had to install it (seaborn is a python data visualization library)

pip install seaborn 

What is a library? 📚

In programming, libraries are pre-compliled routines that the program uses.

Think about each library as tool box. Different tools do different tasks, and different libraries can be used for different things, to help with the project.

Next I had to import my libraries:

#import libraries 
import numpy as np
import pandas as pd
import matplotlib.pylot as plt
import seaborn as sns

Further explained:

- I made a comment using the hashtag to explain what this part of the code was doing

- ps. numpy, pandas, matplotlib.pylot, and seaborn are all different libraries and we gave them “nicknames” such as “sns” for seabron, so we can refer to the seaborn library by using “sns”.

Then, I’m going to load the kaggle data set that we downloaded:

#load the data
from google.colab import files
uploaded = files.upload()
df = pd.read_csv('data.csv')
df.head(7)

Further explained:

- I made a comment (using #)

- Using “uploaded = files.upload()” gives you the traditional upload button (when running the code), that allows files from your computer to be moved to the colab environment.

- In this line: “df = pd.read_csv(‘data.csv’)”, “df = pd.read_csv” is what’s used so that Pandas (a data analysis library) could read the csv file. I also put (‘data.csv’) because this was the name of my csv file.

- “df.” means data frame, therefore, when we use “df.” it means we’re calling on the data frame.

- I printed the first seven rows of the data- as shown by “df.head(7)”. The number 7 can be adjusted to the number of rows you want displayed.

After running that line of code, this is what you should see:

All columns are visible, wheras only 7 rows are visible

Next, we’re going to be counting the number of rows and colums in the data set.

#count the number of rows and columns in the data set
df.shape

There is 569 rows which means 569 patients were represented in this dataset. There were also 33 colums, therfor 31 variables effecting the diagnosis if we exclude “id number” and “diagnosis”.

Following, we’re now going to count the number of empty values in each column:

#count the number of empty (NaN, NAN, na)values in each column
df.isna().sum()

Let’s break it down:

- df. is for our data frame, and the function “isna()” is used to count the number of empty values in our data.

0 means the category has 0 empty columns

The last column has missing values (“Unnamed”), therefore, we want to drop that column to clean up the data. We’re going to do that using:

#Drop the column with all missing values
df = df.dropna(axis=1)

Now that we’ve dropped the column, we’re going to count the new number of columns and rows . 2 steps above we did the same thing so we’re going to be using the same code:

df.shape 

As you can see, we still have the same number of patients but one less column (from removing the empty column).

Next, continuing with our data exploration, we’re going to be counting the number of malignant and benign tumors.

#Get a count of the number of Malignant (M) or Benign (B) cells
df['diagnosis'].value_counts()

We want to visualize this count because data exploration is so much easier when it becomes visual. To do this, we’re going to use seaborn (the libraary we imported at the beginning). To call on the seaborn library, we have to start with “sns” (referencing “import seaborn as sns”).

#visualize the count
sns.countplot(df['diagnosis'], label= 'count')

Now we want to look at the data types to see which columns need to be encoded. The need to encode data would arise when you want to transform categorical data into a boolean.

What is categorical data?

- Data such as towns in the GTA, favourite animals, Ice cream flavours, etc.

What is a boolean?

- A data type

- A boolean is binary, either true or false, 1 or 0, yes or no

- There are many types of data types, including a float, string, integer, etc. In our output we can see that most of the data types are floats (each data type has its own characteristics).

#look at the data types to see which columns need to be encoded
df.dtypes

Further explained:

“df.dtypes” is the function used to do this. We’re using df (data frame) again because what we want the output to be is regarding our data frame.

Now that we have indetified the categorical data, we actually need to encode it:

#encode the categorical data values
from sklearn.preprocessing import LabelEncoder
labelencoder_Y = LabelEncoder()
df.iloc[:,1] = labelencoder_Y.fit_transform(df.iloc[:,1].values)

Let’s break this down:

- First we added a comment: “#encode the categorical data values

- Next, we imported label encoder from sklearn.preprocessing (sklearn is short for sci kit learn. We can think about it as sklearn is the library, preprosseing is the aisle and Label Encoder is the book)

- In the third line: “labelencoder_Y = LabelEncoder()”, we are “re-naimg” the book. From LabelEncoder() to labelencoder_Y. This doesn’t actually effect the output, but it’s for format purposes.

Break down of: “df.iloc[:,1]”

- df.iloc — loc is used to for data selcetion, and we put the i, because it stands for integer, and our data is in the form of integers. We put df. because it stands for data frame and we are selecting integers in our data frame.

- [:,1] — Think of this in terms of x and y coordinates (x,y). The y value is 1, and because of the colon, it refers to all the data in rows (x is horizontal). Therefore, this means we are looking at the first column for all our rows in the data set.

In summary, the first part of that line of code “df.iloc[:,1]” is saying that in our data frame (df.), we want to select (hence iloc), the first column for all our rows.

Closer look at “LabelEncoder().fit_transform(df.iloc[:,1].values)”:

- (df.iloc[:,1].values) — what we’re saying here is that the values we selected from the first column in every row, will be put into an array , also known as a list of numbers.

- fit_transform — “fit” creates a line of best fit for the data in our array, and “transform” is what acctually applies it. This article explains it in further detail.

Next, we want to create a pair plot, to visually show our data. To do that we’re going to use sns. because its how we’re going to access seaborn (a python library for data visulaization).

#Create a pair plot
sns.pairplot(df.iloc[:,1:5], hue='diagnosis')

Let’s break it down:

- We added “.pairplot” because its the way we wanted our data to be represented.

- In brakets we have “(df.iloc[:,1:5], hue=’diagnosis’)”.

- df.iloc[:,1:5] — This refers to selcting all rows (x axis), and columns 2,3,4. (numbers between 1 and 5).

- hue=’diagnosis’ — Hue meaning color, and diagnosis refering to doing different colors based on the diagnosis given in our data set.

Now we want to print the next 5 rows of data after encoding it in the steps above.

#print the first 5 rows of the new data
df.head(5)

Breakin ‘ it down:

— “.head()” is the funciton used to return the first five rows of data. In between the parenthesis I included the number 5, but you can change it to any number. (ps. this includes row 0)

Notice how in the diagnosis column, the M and B transformed to be either a 1 or 0. (1 — Malignant, 0 — Benign)

Next, we want to find the correlation of the columns. Essentially, we want to figure out how much weight each of the columns have on effecting the others.

#get the correlation of the columns
df.iloc[:,1:12].corr()

The score is between -1,0,and 1. The closesr the value is to positive 1, the more weight it has on impacting the cordenated column.

As you can see in the top right corner of “diagnosis” and “diagnosis”, they completly effect each other because the score is 1.

It’s definitely hard to understand the correlations when it’s a bunch of numbers, so, the next step is to visualize the correlation, using seaborn (sns).

#visualize the correlation
plt.figure(figsize=(10,10))
sns.heatmap(df.iloc[:,1:12].corr(),annot=True, fmt='.0%')

Further explained:

plt.figure(figsize=(10,10))

- plt. — means we’re using the library matplotlib. figure

- figure — is used to create figure. We add brackets to adjust the size, as shown below:

- (figsize=(10,10)) — “figsize” is the function used to adjust the size of the figure/graph. We then put in (10,10) as the coordinates for the graph.

sns.heatmap(df.iloc[:,1:12].corr(),annot=True, fmt=’.0%’)

- sns.heatmap — From the seaborn library(sns), we’re leveraging the heatmap.

- df.iloc[:,1:12] — We’re selecting columns 2,3,4,5,6,7,8,9,10,11 from our data frame.

- corr() — Used to specify that we want to show correlations of data

- annot=True — If annot is set to true, it allows us to put numerical values in each box.

- fmt=’.0%’ — The fmt parameter allows strings (text) to be put in the box. Because we set it to 0, it means we don’t want any text in the boxes. ps. if you add annot, and don’t want text, you still have to add fmt and set it to .0%.

Now we want to split our data sets, into independent (x) and dependent (Y) data sets.

#Split the data set into idependent (X) and dependent (Y) data sets
#.values means to make it an array
#(Y - the diagnosis, if the patient has cancer or not)
#(X - the features that help determine the diagnosis/Y)
X = df.iloc[:,2:31].values
Y = df.iloc [:,1].values
type (X)
Output

Further explained:

note: I added the other comments for my own clarity, but the only necessary one is the first comment.

- X is our independent variable, and Y is our dependent variable.

- We need to assign values to these variables

X = df.iloc[:,2:31].values

- In this line, we’re assigning x (the independent variables) to columns 2 thorugh 31 (not including 2 and 31).

Y = df.iloc [:,1].values

- In this line, we’re assigning y (the dependent variable) to column 0, which is the patient ID column

type (X)

- This function returns what kind of object x is. Our output was numpy.ndarray, which means an array from the numpy library.

In AI, we split data into training and testing data. Similar to tests in school, most of the work is practice for the test. We’re going to split 75% into training data (to train our model on, to make accurate predictions), and 25% testing data. This is how we’re going to do it:

#split the data, 75% training, 25% testing
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.25 , random_state = 0)

- from sklearn.model_selection import train_test_split — Here we are importing the train_test_splitbook” from the library scikit learn.

- random_state = 0 — Based on our ration given, the values for training and testing data will differ each time. To avoid getting different values, you can select the amount of randomness you want. 0 and 1 are the most popular.

- To explain what’s happening here in further detail, I would reccomend reading through this page. It thoroughly explains the train_test_split function.

Now we want to scale our data. Scaling means to tranform our data to meet a specific scale. This article explains it in further detail (I would only reccomend reading the first part, otherwise it might be confusing)

#Scale the data (feature scaling)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)

- Here, we are setting StandardScaler() as “sc”.

Now from the sklearn.ensemble librabry we want to import RandomForestClassifier.

from sklearn.ensemble import RandomForestClassifier

RandomForestClassifier is an algorithm that uses a set of decision trees.

What are decision trees?

This picture is a great representation. Basically it works by having questions and based on the answer given by variables in determing the diagnosis, it gives us the output (malignant or benign).

Random forest classifier uses the fundamentals of a descion tree, but it has many…

Now we want to create a function for the models (this ones long haha):

#Create a function for the modelsdef models(X_train, Y_train):#Logistic Regressionfrom sklearn.linear_model import LogisticRegression
log = LogisticRegression(random_state=0)
log.fit(X_train, Y_train)
#Decision treefrom sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(criterion = 'entropy', random_state=0)
tree.fit(X_train, Y_train)
#Random Forest Classifierfrom sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
forest.fit(X_train, Y_train)
#Print the models accuracy on the training dataprint('[0]Logistic Regression Training Accuracy:', log.score(X_train, Y_train))
print('[1]Decision Tree Classifier Training Accuracy:', tree.score(X_train, Y_train))
print('[2]Random Forest Classifier Training Accuracy:', forest.score(X_train, Y_train))
return log, tree, forest

*logisitic regression uses a set of independent variable to predict a binary outcome. You can read more about it in “block 2” of this AI article I wrote.

Now we want to get all the models:

# Getting all of the models
model = models(X_train, Y_train)

Next we are going to test the models accuracy on test data through a confusion matrix

#test model accuracy on test data on confusion matrix
from sklearn.metrics import confusion_matrix
for i in range( len(model) ):
print('Model ', i)
cm = confusion_matrix(Y_test, model[i].predict(X_test))
TP = cm[0][0]
TN = cm[1][1]
FN = cm[1][0]
FP = cm[0][1]
print(cm)
print('Testing Accuracy = ', (TP + TN)/(TP + TN + FN + FP))
print()

Further explained:

- cm stands for “cm = confusion_matrix(Y_test, model[i].predict(X_test))”. So anytime we use “cm” we are calling on everything left of the equals sign.

What is a confusion matrix?

- A technique for summarizing the performance of a classification algorithm.

- The confusion matrix shows the ways in which your classification model is confused when it makes predictions.

- Looks something like this:

ps. this is just an example from the internet

- “True positive/TF” — for correctly predicted event values.

- “False positive/FP” — for incorrectly predicted event values.

- “True negative/TN” — for correctly predicted no-event values.

- “False negative/FN” — for incorrectly predicted no-event values.

Output (shows us the highest testing accuracy of our three models)

Now we want to find another way to get metrics of the models, and see how well they work. To do that…

#show another way to get metrics of the models
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
for i in range( len(model) ):
print('Model ', i)
print( classification_report(Y_test, model [i]. predict(X_test)))
print (accuracy_score (Y_test, model [i]. predict(X_test)))
print()

The output shows us the accuracy of each model, and when applying this model, we can use the most accurate algorithm.

Further explained:

- precision: This tells when you predict something positive, how many times it was actually positive. Precision = TP/(TP + FP)

- recall: This tells out of actual positive data, how many times you predicted correctly. Recall = TP/(TP+FN)

- F1-score: The percent of positive predictions were correct. F1 Score = 2*(Recall * Precision) / (Recall + Precision)

- Support: The number of actual occurrences of the class in the specified dataset.

This article is great for deeper understanding :)

Lastly, we want to print the prediction of RandomForestClassifier model. RFC was the most accurate with an accuracy of 96.5%.

#Print the predicition of Random Forest Classifier Model
pred = model[2].predict(X_test)
print(pred)
print()
print(Y_test)

1 — Malignant

0 — Benign

By looking at our output, we can compare our model (on top) to the actual results form our testing data.

And BAM! That’s it 👏!

Now imagine scaling a model like this to work in hospitals or labs! The possibilities with AI truly are endless, and lifesaving 🚀!

If you have any questions or concerns feel free to leave a comment :)

Courtesy to: ComputerScience on YouTube, in particular this tutorial.

Let’s Connect:

LinkedIn

Medium

Newsletter

--

--