Supervised machine learning — Binary logistic regression overview

3 min readJul 15, 2020

Article by: Olga Berezovsky

Predictive modeling (or machine learning) can be challenging to understand and use, because there is no one algorithm which would work the best for every problem. Therefore, you have to apply different methods for your prediction, evaluate their performance, and only then select the strongest algorithm. In this article, I will go over some use cases for logistic regression and walk through the steps of applying this classification algorithm for data forecasting.

Factors to consider before choosing the algorithm

Before applying a specific type of algorithm, you have to consider and evaluate the following factors: parametrization, memory size, over-fitting tendency, time of learning, time of predicting, etc. You choose classification models over regression models when your target variable is categorical (and not numerical). Therefore, you would apply logistic regression for the following types of predictions:

Will a student pass or fail the class?
Is an email spam or not?
Which samples are false or true?
What is a person’s sex based on their height input?

All of these are probability predictions, but their output has to be transformed into a binary value of 0 or 1 for logistic regression (you can read more about logistic regression here).

Predictive modeling steps

I’ll walk through predictive modeling using Titanic challenge. There are two datasets that include real Titanic passenger information like name, age, gender, social class, fare, etc. One dataset, “Training,” has binary (yes or no) data for 891 passengers. Another one is “Testing” for which we have to predict which passengers will survive. We can use the training set to teach our prediction model with the given data patterns, and then use our testing set to evaluate its performance. I’ll use Python Pandas, Seaborn statistical graph, and Scikit-learn ML package for analysis and modeling.

1. Problem definition

We begin with defining our problem and understanding our target variable is and what features we have available. In our case, we want to predict which Titanic passengers will survive.

2. Exploratory data analysis

We begin with Exploratory Data Analysis to understand the dataset, its patterns, and which features we will use for Logistic Regression:

This tells us how big our datasets are, and we can see that the Train set has one more column. This is our target “Survived” value. Now, let’s take a look into gender data broken down by survival category:

We can see that women are more likely to survive. We have to run more analysis for other input feature like age, social class, family size. You can see full EDA here.

3. Feature engineering

Before we proceed with modeling, some feature engineering has to be done — preparing our data for prediction. We start with transforming categorical values into numerical: