Linear Regression using Python (Basics – 2)

This is the continuation of my first post published here. Similar to the other article, it will be simple and easy to follow tutorial.

Housing dataset in which you have to predict the price of the house from the given parameters

import os
os.listdir()
['.ipynb_checkpoints', 'housingData-Real.csv', 'Untitled.ipynb']

Importing libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sbn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score,mean_squared_error

Reading the dataset

data = pd.read_csv("https://raw.githubusercontent.com/Afsaan/Data-Science/master/Linear%20Regression/housing/housingData-Real.csv")
df=pd.DataFrame(data)

Checking the info of the dataset

df.info()
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 21 columns):
id               21613 non-null int64
date             21613 non-null object
price            21613 non-null float64
bedrooms         21613 non-null int64
bathrooms        21613 non-null float64
sqft_living      21613 non-null int64
sqft_lot         21613 non-null int64
floors           21613 non-null float64
waterfront       21613 non-null int64
view             21613 non-null int64
condition        21613 non-null int64
grade            21613 non-null int64
sqft_above       21613 non-null int64
sqft_basement    21613 non-null int64
yr_built         21613 non-null int64
yr_renovated     21613 non-null int64
zipcode          21613 non-null int64
lat              21613 non-null float64
long             21613 non-null float64
sqft_living15    21613 non-null int64
sqft_lot15       21613 non-null int64
dtypes: float64(5), int64(15), object(1)
memory usage: 3.5+ MB

First 5 rows of the dataset

df.head()
	id	date	price	bedrooms	bathrooms	sqft_living	sqft_lot	floors	waterfront	view	...	grade	sqft_above	sqft_basement	yr_built	yr_renovated	zipcode	lat	long	sqft_living15	sqft_lot15
0	7129300520	20141013T000000	221900.0	3	1.00	1180	5650	1.0	0	0	...	7	1180	0	1955	0	98178	47.5112	-122.257	1340	5650
1	6414100192	20141209T000000	538000.0	3	2.25	2570	7242	2.0	0	0	...	7	2170	400	1951	1991	98125	47.7210	-122.319	1690	7639
2	5631500400	20150225T000000	180000.0	2	1.00	770	10000	1.0	0	0	...	6	770	0	1933	0	98028	47.7379	-122.233	2720	8062
3	2487200875	20141209T000000	604000.0	4	3.00	1960	5000	1.0	0	0	...	7	1050	910	1965	0	98136	47.5208	-122.393	1360	5000
4	1954400510	20150218T000000	510000.0	3	2.00	1680	8080	1.0	0	0	...	8	1680	0	1987	0	98074	47.6168	-122.045	1800	750

Dropping the unwanted column

df.drop(['id','date'],axis=1,inplace=True)

Univariate Linear Regression

Finding the best fit model with only one dependent variable and try to predict the model (ie. Univariate Linear Regression)

Selecting the column sqft_living

X = df.sqft_living
Y =df.price

Converting into 2d array

X=np.array(X).reshape(-1,1)
Y=np.array(Y).reshape(-1,1)

Splitting into training and testing dataset

X_train,X_test,Y_train,Y_test = train_test_split(X,Y,random_state=101)
model1 = LinearRegression()
model1.fit(X_train,Y_train)

Predicting the value of y

Y_pred = model1.predict(X_test)

Evaluation metric to check how close the predicted value is

a=r2_score(Y_test,Y_pred)
a
0.5185480212648037

Multiple Regression

Splitting into training and testing dataset

x=df.drop(['price'],axis=1)
y=df.price
x_train,x_test,y_train,y_test = train_test_split(x,y,random_state=101)
models=LinearRegression()
model = models.fit(x_train,y_train)
y_predict = models.predict(x_test)

b=r2_score(y_test,y_predict)
b
0.7097583909083975

print("r2 score of the Univariate linear Regression is : {}".format(a))
print("r2 score of the Multiple linear Regression is : {}".format(b))
r2 score of the Univariate linear Regression is : 0.5185480212648037
r2 score of the Multiple linear Regression is : 0.7097583909083975

As you can see that the r2 score of the multiple regression is more as compared to the univariate linear regression.

If you want to see more example like this then visit my GitHub.

Linear Regression