Logistic regression is a machine learning algorithm which is primarily used for binary classification. In linear regression we used equation $$ p(X) = β_{0} + β_{1}X $$
The problem is that these predictions are not sensible for classification since of course, the true probability must fall between 0 and 1. To avoid this problem, we must model p(X) using a function that gives outputs between 0 and 1 for all values of X. Logistic regression is named after the function used at its core, the logistic function:
$$ p(X)=\frac{e^{β_{0} + β_{1}X }}{1+e^{β_{0} + β_{1}X }} $$
We will be working with the Titanic Data Set from Kaggle. We’ll be trying to predict a classification- survival or deceased.
Let’s begin by implementing Logistic Regression in Python for classification. We’ll use a “semi-cleaned” version of the titanic data set, if you use the data set hosted directly on Kaggle, you may need to do some additional cleaning.
Import Libraries
Let’s import some libraries to get started!
Pandas and Numpy for easier analysis.
import pandas as pd import numpy as np
Seaborn and Matplotlib for data visualization.
import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline
The Data
Let’s start by reading in the titanic_train.csv file into a pandas dataframe.
train = pd.read_csv('titanic_train.csv') train.info() RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): PassengerId 891 non-null int64 Survived 891 non-null int64 Pclass 891 non-null int64 Name 891 non-null object Sex 891 non-null object Age 714 non-null float64 SibSp 891 non-null int64 Parch 891 non-null int64 Ticket 891 non-null object Fare 891 non-null float64 Cabin 204 non-null object Embarked 889 non-null object dtypes: float64(2), int64(5), object(5)
Exploratory Data Analysis
Let’s begin some exploratory data analysis! We’ll start by checking out missing data!
Missing Data
We can use seaborn to create a simple heatmap to see where we are missing data!
sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='viridis')
Roughly 20 percent of the Age data is missing. The proportion of Age missing is likely small enough for reasonable replacement with some form of imputation. Looking at the Cabin column, it looks like we are just missing too much of that data to do something useful with at a basic level. We’ll probably drop this later, or change it to another feature like “Cabin Known: 1 or 0”
Let’s continue on by visualizing some more of the data!
Countplot of people who survived based on their sex.
sns.countplot(x='Survived',hue='Sex',data=train,palette='RdBu_r')
Countplot of people who survived based on their Passenger class.
sns.countplot(x='Survived',hue='Pclass',data=train,palette='rainbow')
Distribution plot of dataset based on age.
train['Age'].hist(bins=30,color='darkred',alpha=0.7)
Distribution plot of different amount of fare paid by passengers.
train['Fare'].hist(color='green',bins=40,figsize=(8,4))
Data Cleaning
We want to fill in missing age data instead of just dropping the missing age data rows. One way to do this is by filling in the mean age` of all the passengers (imputation). However we can be smarter about this and check the average age by passenger class. For example:
sns.boxplot(x='Pclass',y='Age',data=train,palette='winter')
We can see the wealthier passengers in the higher classes tend to be older, which makes sense. We’ll use these average age values to impute based on Pclass for Age.
def impute_age(cols): Age = cols[0] Pclass = cols[1] if pd.isnull(Age): if Pclass == 1: return 37 elif Pclass == 2: return 29 else: return 24 else: return Age train['Age'] = train[['Age','Pclass']].apply(impute_age,axis=1)
Check that heat map again!
sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='viridis')
Let’s go ahead and drop the Cabin column and the row in Embarked that is NaN.
train.drop('Cabin',axis=1,inplace=True) train.dropna(inplace=True)
Converting Categorical Features
We’ll need to convert categorical features to dummy variables using pandas! Otherwise our machine learning algorithm won’t be able to directly take in those features as inputs.
sex = pd.get_dummies(train['Sex'],drop_first=True) embark = pd.get_dummies(train['Embarked'],drop_first=True) train.drop(['Sex','Embarked','Name','Ticket'],axis=1,inplace=True) train = pd.concat([train,sex,embark],axis=1)
Great! Our data is ready for our model!
Building a Logistic Regression model
Let’s start by splitting our data into a training set and test set (there is another test.csv file that you can play around with in case you want to use all this data for training).
Train Test Split
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(train.drop('Survived',axis=1), train['Survived'], test_size=0.30, random_state=101)
Training and Predicting
from sklearn.linear_model import LogisticRegression logmodel = LogisticRegression() logmodel.fit(X_train,y_train) predictions = logmodel.predict(X_test) LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1, penalty='l2', random_state=None, solver='liblinear', tol=0.0001, verbose=0, warm_start=False)
Let’s move on to evaluate our model!
Evaluation
We can check precision,recall,f1-score using classification report!
from sklearn.metrics import classification_reportprint(classification_report(y_test,predictions)) precision recall f1-score support 0 0.81 0.93 0.86 163 1 0.85 0.65 0.74 104 avg / total 0.82 0.82 0.81 267
This was a brief overview of how to use a logistic regression model with python. I also demonstrated some useful methods to while doing data cleaning. The following notebook can be found here on github.
Thank You!