{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Finding Best ML Algorithm for House Price Prediction using k Cross Validation and GridSearchCV. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In machine learning, we couldn’t fit the model on the training data and can’t say that the model will work accurately for the real data. For this, we must assure that our model got the correct patterns from the data, and it is not getting up too much noise. For this purpose, we use the cross-validation technique.Cross-validation is a technique in which we train our model using the subset of the data-set and then evaluate using the complementary subset of the data-set."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Dataset is downloaded from here: https://www.kaggle.com/amitabhajoy/bengaluru-house-price-data"
]
},
{
"cell_type": "code",
"execution_count": 72,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"\n",
"from matplotlib import pyplot as plt\n",
"%matplotlib inline \n",
"import matplotlib\n",
"matplotlib.rcParams[\"figure.figsize\"]=(20,10)"
]
},
{
"cell_type": "code",
"execution_count": 73,
"metadata": {},
"outputs": [],
"source": [
"df1 = pd.read_csv(\"Bengaluru_House_Data.csv\")"
]
},
{
"cell_type": "code",
"execution_count": 74,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>area_type</th>\n",
" <th>availability</th>\n",
" <th>location</th>\n",
" <th>size</th>\n",
" <th>society</th>\n",
" <th>total_sqft</th>\n",
" <th>bath</th>\n",
" <th>balcony</th>\n",
" <th>price</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Super built-up Area</td>\n",
" <td>19-Dec</td>\n",
" <td>Electronic City Phase II</td>\n",
" <td>2 BHK</td>\n",
" <td>Coomee</td>\n",
" <td>1056</td>\n",
" <td>2.0</td>\n",
" <td>1.0</td>\n",
" <td>39.07</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Plot Area</td>\n",
" <td>Ready To Move</td>\n",
" <td>Chikka Tirupathi</td>\n",
" <td>4 Bedroom</td>\n",
" <td>Theanmp</td>\n",
" <td>2600</td>\n",
" <td>5.0</td>\n",
" <td>3.0</td>\n",
" <td>120.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Built-up Area</td>\n",
" <td>Ready To Move</td>\n",
" <td>Uttarahalli</td>\n",
" <td>3 BHK</td>\n",
" <td>NaN</td>\n",
" <td>1440</td>\n",
" <td>2.0</td>\n",
" <td>3.0</td>\n",
" <td>62.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Super built-up Area</td>\n",
" <td>Ready To Move</td>\n",
" <td>Lingadheeranahalli</td>\n",
" <td>3 BHK</td>\n",
" <td>Soiewre</td>\n",
" <td>1521</td>\n",
" <td>3.0</td>\n",
" <td>1.0</td>\n",
" <td>95.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Super built-up Area</td>\n",
" <td>Ready To Move</td>\n",
" <td>Kothanur</td>\n",
" <td>2 BHK</td>\n",
" <td>NaN</td>\n",
" <td>1200</td>\n",
" <td>2.0</td>\n",
" <td>1.0</td>\n",
" <td>51.00</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" area_type availability location size \\\n",
"0 Super built-up Area 19-Dec Electronic City Phase II 2 BHK \n",
"1 Plot Area Ready To Move Chikka Tirupathi 4 Bedroom \n",
"2 Built-up Area Ready To Move Uttarahalli 3 BHK \n",
"3 Super built-up Area Ready To Move Lingadheeranahalli 3 BHK \n",
"4 Super built-up Area Ready To Move Kothanur 2 BHK \n",
"\n",
" society total_sqft bath balcony price \n",
"0 Coomee 1056 2.0 1.0 39.07 \n",
"1 Theanmp 2600 5.0 3.0 120.00 \n",
"2 NaN 1440 2.0 3.0 62.00 \n",
"3 Soiewre 1521 3.0 1.0 95.00 \n",
"4 NaN 1200 2.0 1.0 51.00 "
]
},
"execution_count": 74,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df1.head()"
]
},
{
"cell_type": "code",
"execution_count": 75,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"area_type\n",
"Built-up Area 2418\n",
"Carpet Area 87\n",
"Plot Area 2025\n",
"Super built-up Area 8790\n",
"Name: area_type, dtype: int64"
]
},
"execution_count": 75,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df1.groupby('area_type')['area_type'].agg('count')"
]
},
{
"cell_type": "code",
"execution_count": 76,
"metadata": {},
"outputs": [],
"source": [
"df2 = df1.drop(['area_type','society','balcony','availability'] , axis=\"columns\")"
]
},
{
"cell_type": "code",
"execution_count": 77,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>location</th>\n",
" <th>size</th>\n",
" <th>total_sqft</th>\n",
" <th>bath</th>\n",
" <th>price</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Electronic City Phase II</td>\n",
" <td>2 BHK</td>\n",
" <td>1056</td>\n",
" <td>2.0</td>\n",
" <td>39.07</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Chikka Tirupathi</td>\n",
" <td>4 Bedroom</td>\n",
" <td>2600</td>\n",
" <td>5.0</td>\n",
" <td>120.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Uttarahalli</td>\n",
" <td>3 BHK</td>\n",
" <td>1440</td>\n",
" <td>2.0</td>\n",
" <td>62.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Lingadheeranahalli</td>\n",
" <td>3 BHK</td>\n",
" <td>1521</td>\n",
" <td>3.0</td>\n",
" <td>95.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Kothanur</td>\n",
" <td>2 BHK</td>\n",
" <td>1200</td>\n",
" <td>2.0</td>\n",
" <td>51.00</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" location size total_sqft bath price\n",
"0 Electronic City Phase II 2 BHK 1056 2.0 39.07\n",
"1 Chikka Tirupathi 4 Bedroom 2600 5.0 120.00\n",
"2 Uttarahalli 3 BHK 1440 2.0 62.00\n",
"3 Lingadheeranahalli 3 BHK 1521 3.0 95.00\n",
"4 Kothanur 2 BHK 1200 2.0 51.00"
]
},
"execution_count": 77,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Data Cleaning: Handling NA/Null values"
]
},
{
"cell_type": "code",
"execution_count": 78,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"location 1\n",
"size 16\n",
"total_sqft 0\n",
"bath 73\n",
"price 0\n",
"dtype: int64"
]
},
"execution_count": 78,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2.isnull().sum()"
]
},
{
"cell_type": "code",
"execution_count": 79,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"location 0\n",
"size 0\n",
"total_sqft 0\n",
"bath 0\n",
"price 0\n",
"dtype: int64"
]
},
"execution_count": 79,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df3 = df2.dropna()\n",
"df3.isnull().sum()"
]
},
{
"cell_type": "code",
"execution_count": 80,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['2 BHK', '4 Bedroom', '3 BHK', '4 BHK', '6 Bedroom', '3 Bedroom',\n",
" '1 BHK', '1 RK', '1 Bedroom', '8 Bedroom', '2 Bedroom',\n",
" '7 Bedroom', '5 BHK', '7 BHK', '6 BHK', '5 Bedroom', '11 BHK',\n",
" '9 BHK', nan, '9 Bedroom', '27 BHK', '10 Bedroom', '11 Bedroom',\n",
" '10 BHK', '19 BHK', '16 BHK', '43 Bedroom', '14 BHK', '8 BHK',\n",
" '12 Bedroom', '13 BHK', '18 Bedroom'], dtype=object)"
]
},
"execution_count": 80,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2['size'].unique()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Feature Engineering"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" Feature engineering is the process of using domain knowledge to extract features from raw data via data mining techniques. These features can be used to improve the performance of machine learning algorithms. Feature engineering can be considered as applied machine learning itself "
]
},
{
"cell_type": "code",
"execution_count": 81,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"<ipython-input-81-4c4c73fbe7f4>:1: SettingWithCopyWarning: \n",
"A value is trying to be set on a copy of a slice from a DataFrame.\n",
"Try using .loc[row_indexer,col_indexer] = value instead\n",
"\n",
"See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
" df3['bhk'] = df3['size'].apply(lambda x: int(x.split(' ')[0]))\n"
]
}
],
"source": [
"df3['bhk'] = df3['size'].apply(lambda x: int(x.split(' ')[0]))"
]
},
{
"cell_type": "code",
"execution_count": 82,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([ 2, 4, 3, 6, 1, 8, 7, 5, 11, 9, 27, 10, 19, 16, 43, 14, 12,\n",
" 13, 18], dtype=int64)"
]
},
"execution_count": 82,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df3['bhk'].unique()"
]
},
{
"cell_type": "code",
"execution_count": 83,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>location</th>\n",
" <th>size</th>\n",
" <th>total_sqft</th>\n",
" <th>bath</th>\n",
" <th>price</th>\n",
" <th>bhk</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1718</th>\n",
" <td>2Electronic City Phase II</td>\n",
" <td>27 BHK</td>\n",
" <td>8000</td>\n",
" <td>27.0</td>\n",
" <td>230.0</td>\n",
" <td>27</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4684</th>\n",
" <td>Munnekollal</td>\n",
" <td>43 Bedroom</td>\n",
" <td>2400</td>\n",
" <td>40.0</td>\n",
" <td>660.0</td>\n",
" <td>43</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" location size total_sqft bath price bhk\n",
"1718 2Electronic City Phase II 27 BHK 8000 27.0 230.0 27\n",
"4684 Munnekollal 43 Bedroom 2400 40.0 660.0 43"
]
},
"execution_count": 83,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df3[df3.bhk>20]"
]
},
{
"cell_type": "code",
"execution_count": 84,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['1056', '2600', '1440', ..., '1133 - 1384', '774', '4689'],\n",
" dtype=object)"
]
},
"execution_count": 84,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df3.total_sqft.unique()"
]
},
{
"cell_type": "code",
"execution_count": 85,
"metadata": {},
"outputs": [],
"source": [
"def is_float(x):\n",
" try:\n",
" float(x)\n",
" except:\n",
" return False\n",
" return True"
]
},
{
"cell_type": "code",
"execution_count": 86,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>location</th>\n",
" <th>size</th>\n",
" <th>total_sqft</th>\n",
" <th>bath</th>\n",
" <th>price</th>\n",
" <th>bhk</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>30</th>\n",
" <td>Yelahanka</td>\n",
" <td>4 BHK</td>\n",
" <td>2100 - 2850</td>\n",
" <td>4.0</td>\n",
" <td>186.000</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>122</th>\n",
" <td>Hebbal</td>\n",
" <td>4 BHK</td>\n",
" <td>3067 - 8156</td>\n",
" <td>4.0</td>\n",
" <td>477.000</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>137</th>\n",
" <td>8th Phase JP Nagar</td>\n",
" <td>2 BHK</td>\n",
" <td>1042 - 1105</td>\n",
" <td>2.0</td>\n",
" <td>54.005</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>165</th>\n",
" <td>Sarjapur</td>\n",
" <td>2 BHK</td>\n",
" <td>1145 - 1340</td>\n",
" <td>2.0</td>\n",
" <td>43.490</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>188</th>\n",
" <td>KR Puram</td>\n",
" <td>2 BHK</td>\n",
" <td>1015 - 1540</td>\n",
" <td>2.0</td>\n",
" <td>56.800</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>410</th>\n",
" <td>Kengeri</td>\n",
" <td>1 BHK</td>\n",
" <td>34.46Sq. Meter</td>\n",
" <td>1.0</td>\n",
" <td>18.500</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>549</th>\n",
" <td>Hennur Road</td>\n",
" <td>2 BHK</td>\n",
" <td>1195 - 1440</td>\n",
" <td>2.0</td>\n",
" <td>63.770</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>648</th>\n",
" <td>Arekere</td>\n",
" <td>9 Bedroom</td>\n",
" <td>4125Perch</td>\n",
" <td>9.0</td>\n",
" <td>265.000</td>\n",
" <td>9</td>\n",
" </tr>\n",
" <tr>\n",
" <th>661</th>\n",
" <td>Yelahanka</td>\n",
" <td>2 BHK</td>\n",
" <td>1120 - 1145</td>\n",
" <td>2.0</td>\n",
" <td>48.130</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>672</th>\n",
" <td>Bettahalsoor</td>\n",
" <td>4 Bedroom</td>\n",
" <td>3090 - 5002</td>\n",
" <td>4.0</td>\n",
" <td>445.000</td>\n",
" <td>4</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" location size total_sqft bath price bhk\n",
"30 Yelahanka 4 BHK 2100 - 2850 4.0 186.000 4\n",
"122 Hebbal 4 BHK 3067 - 8156 4.0 477.000 4\n",
"137 8th Phase JP Nagar 2 BHK 1042 - 1105 2.0 54.005 2\n",
"165 Sarjapur 2 BHK 1145 - 1340 2.0 43.490 2\n",
"188 KR Puram 2 BHK 1015 - 1540 2.0 56.800 2\n",
"410 Kengeri 1 BHK 34.46Sq. Meter 1.0 18.500 1\n",
"549 Hennur Road 2 BHK 1195 - 1440 2.0 63.770 2\n",
"648 Arekere 9 Bedroom 4125Perch 9.0 265.000 9\n",
"661 Yelahanka 2 BHK 1120 - 1145 2.0 48.130 2\n",
"672 Bettahalsoor 4 Bedroom 3090 - 5002 4.0 445.000 4"
]
},
"execution_count": 86,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df3[~df3['total_sqft'].apply(is_float)].head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Above shows that total_sqft can be a range (e.g. 2100-2850). For such case we can just take average of min and max value in the range. There are other cases such as 34.46Sq. Meter which one can convert to square ft using unit conversion. I am going to just drop such corner cases to keep things simple "
]
},
{
"cell_type": "code",
"execution_count": 87,
"metadata": {},
"outputs": [],
"source": [
"def convert_sqft_to_num(x):\n",
" tokens = x.split('-')\n",
" if len(tokens) == 2:\n",
" return (float(tokens[0]) + float(tokens[1]))/2\n",
" try:\n",
" return float(x)\n",
" except:\n",
" return None\n",
" "
]
},
{
"cell_type": "code",
"execution_count": 88,
"metadata": {},
"outputs": [],
"source": [
"df4 = df3.copy()\n",
"df4['total_sqft'] = df4['total_sqft'].apply(convert_sqft_to_num)"
]
},
{
"cell_type": "code",
"execution_count": 89,
"metadata": {},
"outputs": [],
"source": [
"df5 = df4.copy()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Add new feature called price per square feet"
]
},
{
"cell_type": "code",
"execution_count": 90,
"metadata": {},
"outputs": [],
"source": [
"df5['price_per_sqft'] = df5['price']*100000/df5['total_sqft']"
]
},
{
"cell_type": "code",
"execution_count": 91,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1304"
]
},
"execution_count": 91,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(df5.location.unique())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Examine locations which is a categorical variable. We need to apply dimensionality reduction technique here to reduce number of locations"
]
},
{
"cell_type": "code",
"execution_count": 92,
"metadata": {},
"outputs": [],
"source": [
"df5.location = df5.location.apply(lambda x: x.strip())"
]
},
{
"cell_type": "code",
"execution_count": 93,
"metadata": {},
"outputs": [],
"source": [
"location_stats = df5.groupby('location')['location'].agg('count').sort_values(ascending=False)"
]
},
{
"cell_type": "code",
"execution_count": 94,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"location\n",
"Whitefield 535\n",
"Sarjapur Road 392\n",
"Electronic City 304\n",
"Kanakpura Road 266\n",
"Thanisandra 236\n",
" ... \n",
"LIC Colony 1\n",
"Kuvempu Layout 1\n",
"Kumbhena Agrahara 1\n",
"Kudlu Village, 1\n",
"1 Annasandrapalya 1\n",
"Name: location, Length: 1293, dtype: int64"
]
},
"execution_count": 94,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"location_stats"
]
},
{
"cell_type": "code",
"execution_count": 95,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1052"
]
},
"execution_count": 95,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(location_stats[location_stats<=10])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Dimensionality Reduction"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Any location having less than 10 data points should be tagged as \"other\" location. This way number of categories can be reduced by huge amount. Later on when we do one hot encoding, it will help us with having fewer dummy columns"
]
},
{
"cell_type": "code",
"execution_count": 96,
"metadata": {},
"outputs": [],
"source": [
"location_stats_less_10 = location_stats[location_stats<=10]"
]
},
{
"cell_type": "code",
"execution_count": 97,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"location\n",
"BTM 1st Stage 10\n",
"Basapura 10\n",
"Sector 1 HSR Layout 10\n",
"Naganathapura 10\n",
"Kalkere 10\n",
" ..\n",
"LIC Colony 1\n",
"Kuvempu Layout 1\n",
"Kumbhena Agrahara 1\n",
"Kudlu Village, 1\n",
"1 Annasandrapalya 1\n",
"Name: location, Length: 1052, dtype: int64"
]
},
"execution_count": 97,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"location_stats_less_10"
]
},
{
"cell_type": "code",
"execution_count": 98,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1293"
]
},
"execution_count": 98,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(df5.location.unique())"
]
},
{
"cell_type": "code",
"execution_count": 99,
"metadata": {},
"outputs": [],
"source": [
"df5.location = df5.location.apply(lambda x: 'other' if x in location_stats_less_10 else x)"
]
},
{
"cell_type": "code",
"execution_count": 100,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"242"
]
},
"execution_count": 100,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(df5.location.unique())"
]
},
{
"cell_type": "code",
"execution_count": 101,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>location</th>\n",
" <th>size</th>\n",
" <th>total_sqft</th>\n",
" <th>bath</th>\n",
" <th>price</th>\n",
" <th>bhk</th>\n",
" <th>price_per_sqft</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Electronic City Phase II</td>\n",
" <td>2 BHK</td>\n",
" <td>1056.0</td>\n",
" <td>2.0</td>\n",
" <td>39.07</td>\n",
" <td>2</td>\n",
" <td>3699.810606</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Chikka Tirupathi</td>\n",
" <td>4 Bedroom</td>\n",
" <td>2600.0</td>\n",
" <td>5.0</td>\n",
" <td>120.00</td>\n",
" <td>4</td>\n",
" <td>4615.384615</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Uttarahalli</td>\n",
" <td>3 BHK</td>\n",
" <td>1440.0</td>\n",
" <td>2.0</td>\n",
" <td>62.00</td>\n",
" <td>3</td>\n",
" <td>4305.555556</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Lingadheeranahalli</td>\n",
" <td>3 BHK</td>\n",
" <td>1521.0</td>\n",
" <td>3.0</td>\n",
" <td>95.00</td>\n",
" <td>3</td>\n",
" <td>6245.890861</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Kothanur</td>\n",
" <td>2 BHK</td>\n",
" <td>1200.0</td>\n",
" <td>2.0</td>\n",
" <td>51.00</td>\n",
" <td>2</td>\n",
" <td>4250.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" location size total_sqft bath price bhk \\\n",
"0 Electronic City Phase II 2 BHK 1056.0 2.0 39.07 2 \n",
"1 Chikka Tirupathi 4 Bedroom 2600.0 5.0 120.00 4 \n",
"2 Uttarahalli 3 BHK 1440.0 2.0 62.00 3 \n",
"3 Lingadheeranahalli 3 BHK 1521.0 3.0 95.00 3 \n",
"4 Kothanur 2 BHK 1200.0 2.0 51.00 2 \n",
"\n",
" price_per_sqft \n",
"0 3699.810606 \n",
"1 4615.384615 \n",
"2 4305.555556 \n",
"3 6245.890861 \n",
"4 4250.000000 "
]
},
"execution_count": 101,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df5.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Outlier Removal Using Business Logic"
]
},
{
"cell_type": "code",
"execution_count": 102,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>location</th>\n",
" <th>size</th>\n",
" <th>total_sqft</th>\n",
" <th>bath</th>\n",
" <th>price</th>\n",
" <th>bhk</th>\n",
" <th>price_per_sqft</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>other</td>\n",
" <td>6 Bedroom</td>\n",
" <td>1020.0</td>\n",
" <td>6.0</td>\n",
" <td>370.0</td>\n",
" <td>6</td>\n",
" <td>36274.509804</td>\n",
" </tr>\n",
" <tr>\n",
" <th>45</th>\n",
" <td>HSR Layout</td>\n",
" <td>8 Bedroom</td>\n",
" <td>600.0</td>\n",
" <td>9.0</td>\n",
" <td>200.0</td>\n",
" <td>8</td>\n",
" <td>33333.333333</td>\n",
" </tr>\n",
" <tr>\n",
" <th>58</th>\n",
" <td>Murugeshpalya</td>\n",
" <td>6 Bedroom</td>\n",
" <td>1407.0</td>\n",
" <td>4.0</td>\n",
" <td>150.0</td>\n",
" <td>6</td>\n",
" <td>10660.980810</td>\n",
" </tr>\n",
" <tr>\n",
" <th>68</th>\n",
" <td>Devarachikkanahalli</td>\n",
" <td>8 Bedroom</td>\n",
" <td>1350.0</td>\n",
" <td>7.0</td>\n",
" <td>85.0</td>\n",
" <td>8</td>\n",
" <td>6296.296296</td>\n",
" </tr>\n",
" <tr>\n",
" <th>70</th>\n",
" <td>other</td>\n",
" <td>3 Bedroom</td>\n",
" <td>500.0</td>\n",
" <td>3.0</td>\n",
" <td>100.0</td>\n",
" <td>3</td>\n",
" <td>20000.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" location size total_sqft bath price bhk \\\n",
"9 other 6 Bedroom 1020.0 6.0 370.0 6 \n",
"45 HSR Layout 8 Bedroom 600.0 9.0 200.0 8 \n",
"58 Murugeshpalya 6 Bedroom 1407.0 4.0 150.0 6 \n",
"68 Devarachikkanahalli 8 Bedroom 1350.0 7.0 85.0 8 \n",
"70 other 3 Bedroom 500.0 3.0 100.0 3 \n",
"\n",
" price_per_sqft \n",
"9 36274.509804 \n",
"45 33333.333333 \n",
"58 10660.980810 \n",
"68 6296.296296 \n",
"70 20000.000000 "
]
},
"execution_count": 102,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df5[df5.total_sqft/df5.bhk<300].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Check above data points. We have 6 bhk apartment with 1020 sqft. Another one is 8 bhk and total sqft is 600. These are clear data errors that can be removed safely"
]
},
{
"cell_type": "code",
"execution_count": 103,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(13246, 7)"
]
},
"execution_count": 103,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df5.shape"
]
},
{
"cell_type": "code",
"execution_count": 104,
"metadata": {},
"outputs": [],
"source": [
"df6 = df5[~(df5.total_sqft/df5.bhk<300)]"
]
},
{
"cell_type": "code",
"execution_count": 105,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(12502, 7)"
]
},
"execution_count": 105,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df6.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Outlier Removal Using Standard Deviation and Mean"
]
},
{
"cell_type": "code",
"execution_count": 106,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"count 12456.000000\n",
"mean 6308.502826\n",
"std 4168.127339\n",
"min 267.829813\n",
"25% 4210.526316\n",
"50% 5294.117647\n",
"75% 6916.666667\n",
"max 176470.588235\n",
"Name: price_per_sqft, dtype: float64"
]
},
"execution_count": 106,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df6.price_per_sqft.describe()"
]
},
{
"cell_type": "code",
"execution_count": 107,
"metadata": {},
"outputs": [],
"source": [
"def remove_pps_outliers(df):\n",
" df_out = pd.DataFrame()\n",
" for key, subdf in df.groupby('location'):\n",
" m = np.mean(subdf.price_per_sqft)\n",
" st = np.std(subdf.price_per_sqft)\n",
" reduced_df = subdf[(subdf.price_per_sqft>(m-st)) & (subdf.price_per_sqft<=(m+st))]\n",
" df_out = pd.concat([df_out,reduced_df],ignore_index=True)\n",
" return df_out "
]
},
{
"cell_type": "code",
"execution_count": 108,
"metadata": {},
"outputs": [],
"source": [
"df7 = remove_pps_outliers(df6)"
]
},
{
"cell_type": "code",
"execution_count": 109,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(10241, 7)"
]
},
"execution_count": 109,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df7.shape"
]
},
{
"cell_type": "code",
"execution_count": 110,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 1080x720 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"def plot_scatter_chart(df,location):\n",
" bhk2 = df[(df.location==location) & (df.bhk==2)]\n",
" bhk3 = df[(df.location==location) & (df.bhk==3)]\n",
" matplotlib.rcParams['figure.figsize'] = (15,10)\n",
" plt.scatter(bhk2.total_sqft,bhk2.price,color='blue',label='2 BHK', s=50)\n",
" plt.scatter(bhk3.total_sqft,bhk3.price,marker='+', color='green',label='3 BHK', s=50)\n",
" plt.xlabel(\"Total Square Feet Area\")\n",
" plt.ylabel(\"Price (Lakh Indian Rupees)\")\n",
" plt.title(location)\n",
" plt.legend()\n",
" \n",
"plot_scatter_chart(df7,\"Rajaji Nagar\")"
]
},
{
"cell_type": "code",
"execution_count": 111,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 1080x720 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"plot_scatter_chart(df7,\"Hebbal\")"
]
},
{
"cell_type": "code",
"execution_count": 112,
"metadata": {},
"outputs": [],
"source": [
"def remove_bhk_outliers(df):\n",
" exclude_indices = np.array([])\n",
" for location,location_df in df.groupby('location'):\n",
" bhk_stats = {}\n",
" for bhk,bhk_df in location_df.groupby('bhk'):\n",
" bhk_stats[bhk] = {\n",
" 'mean':np.mean(bhk_df.price_per_sqft),\n",
" 'std':np.std(bhk_df.price_per_sqft),\n",
" 'count':bhk_df.shape[0]\n",
" }\n",
" for bhk,bhk_df in location_df.groupby(\"bhk\"):\n",
" stats = bhk_stats.get(bhk-1)\n",
" if stats and stats['count']>5:\n",
" exclude_indices = np.append(exclude_indices,bhk_df[bhk_df.price_per_sqft < (stats['mean'])].index.values)\n",
" return df.drop(exclude_indices,axis='index') "
]
},
{
"cell_type": "code",
"execution_count": 113,
"metadata": {},
"outputs": [],
"source": [
"df8 = remove_bhk_outliers(df7)"
]
},
{
"cell_type": "code",
"execution_count": 114,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(7329, 7)"
]
},
"execution_count": 114,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df8.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Outlier Removal Using Bathrooms Feature"
]
},
{
"cell_type": "code",
"execution_count": 115,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([ 4., 3., 2., 5., 8., 1., 6., 7., 9., 12., 16., 13.])"
]
},
"execution_count": 115,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df8.bath.unique()"
]
},
{
"cell_type": "code",
"execution_count": 116,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Text(0, 0.5, 'Count')"
]
},
"execution_count": 116,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 1080x720 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"plt.hist(df8.bath,rwidth=0.8)\n",
"plt.xlabel(\"Number of bathrooms\")\n",
"plt.ylabel(\"Count\")"
]
},
{
"cell_type": "code",
"execution_count": 117,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>location</th>\n",
" <th>size</th>\n",
" <th>total_sqft</th>\n",
" <th>bath</th>\n",
" <th>price</th>\n",
" <th>bhk</th>\n",
" <th>price_per_sqft</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>5277</th>\n",
" <td>Neeladri Nagar</td>\n",
" <td>10 BHK</td>\n",
" <td>4000.0</td>\n",
" <td>12.0</td>\n",
" <td>160.0</td>\n",
" <td>10</td>\n",
" <td>4000.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8486</th>\n",
" <td>other</td>\n",
" <td>10 BHK</td>\n",
" <td>12000.0</td>\n",
" <td>12.0</td>\n",
" <td>525.0</td>\n",
" <td>10</td>\n",
" <td>4375.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8575</th>\n",
" <td>other</td>\n",
" <td>16 BHK</td>\n",
" <td>10000.0</td>\n",
" <td>16.0</td>\n",
" <td>550.0</td>\n",
" <td>16</td>\n",
" <td>5500.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9308</th>\n",
" <td>other</td>\n",
" <td>11 BHK</td>\n",
" <td>6000.0</td>\n",
" <td>12.0</td>\n",
" <td>150.0</td>\n",
" <td>11</td>\n",
" <td>2500.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9639</th>\n",
" <td>other</td>\n",
" <td>13 BHK</td>\n",
" <td>5425.0</td>\n",
" <td>13.0</td>\n",
" <td>275.0</td>\n",
" <td>13</td>\n",
" <td>5069.124424</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" location size total_sqft bath price bhk price_per_sqft\n",
"5277 Neeladri Nagar 10 BHK 4000.0 12.0 160.0 10 4000.000000\n",
"8486 other 10 BHK 12000.0 12.0 525.0 10 4375.000000\n",
"8575 other 16 BHK 10000.0 16.0 550.0 16 5500.000000\n",
"9308 other 11 BHK 6000.0 12.0 150.0 11 2500.000000\n",
"9639 other 13 BHK 5425.0 13.0 275.0 13 5069.124424"
]
},
"execution_count": 117,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df8[df8.bath>10]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### It is unusual to have 2 more bathrooms than number of bedrooms in a home"
]
},
{
"cell_type": "code",
"execution_count": 118,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>location</th>\n",
" <th>size</th>\n",
" <th>total_sqft</th>\n",
" <th>bath</th>\n",
" <th>price</th>\n",
" <th>bhk</th>\n",
" <th>price_per_sqft</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1626</th>\n",
" <td>Chikkabanavar</td>\n",
" <td>4 Bedroom</td>\n",
" <td>2460.0</td>\n",
" <td>7.0</td>\n",
" <td>80.0</td>\n",
" <td>4</td>\n",
" <td>3252.032520</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5238</th>\n",
" <td>Nagasandra</td>\n",
" <td>4 Bedroom</td>\n",
" <td>7000.0</td>\n",
" <td>8.0</td>\n",
" <td>450.0</td>\n",
" <td>4</td>\n",
" <td>6428.571429</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6711</th>\n",
" <td>Thanisandra</td>\n",
" <td>3 BHK</td>\n",
" <td>1806.0</td>\n",
" <td>6.0</td>\n",
" <td>116.0</td>\n",
" <td>3</td>\n",
" <td>6423.034330</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8411</th>\n",
" <td>other</td>\n",
" <td>6 BHK</td>\n",
" <td>11338.0</td>\n",
" <td>9.0</td>\n",
" <td>1000.0</td>\n",
" <td>6</td>\n",
" <td>8819.897689</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" location size total_sqft bath price bhk price_per_sqft\n",
"1626 Chikkabanavar 4 Bedroom 2460.0 7.0 80.0 4 3252.032520\n",
"5238 Nagasandra 4 Bedroom 7000.0 8.0 450.0 4 6428.571429\n",
"6711 Thanisandra 3 BHK 1806.0 6.0 116.0 3 6423.034330\n",
"8411 other 6 BHK 11338.0 9.0 1000.0 6 8819.897689"
]
},
"execution_count": 118,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df8[df8.bath>df8.bhk + 2]"
]
},
{
"cell_type": "code",
"execution_count": 119,
"metadata": {},
"outputs": [],
"source": [
"df9 = df8[df8.bath < df8.bhk + 2]"
]
},
{
"cell_type": "code",
"execution_count": 120,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(7251, 7)"
]
},
"execution_count": 120,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df9.shape"
]
},
{
"cell_type": "code",
"execution_count": 121,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>location</th>\n",
" <th>size</th>\n",
" <th>total_sqft</th>\n",
" <th>bath</th>\n",
" <th>price</th>\n",
" <th>bhk</th>\n",
" <th>price_per_sqft</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1st Block Jayanagar</td>\n",
" <td>4 BHK</td>\n",
" <td>2850.0</td>\n",
" <td>4.0</td>\n",
" <td>428.0</td>\n",
" <td>4</td>\n",
" <td>15017.543860</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1st Block Jayanagar</td>\n",
" <td>3 BHK</td>\n",
" <td>1630.0</td>\n",
" <td>3.0</td>\n",
" <td>194.0</td>\n",
" <td>3</td>\n",
" <td>11901.840491</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1st Block Jayanagar</td>\n",
" <td>3 BHK</td>\n",
" <td>1875.0</td>\n",
" <td>2.0</td>\n",
" <td>235.0</td>\n",
" <td>3</td>\n",
" <td>12533.333333</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1st Block Jayanagar</td>\n",
" <td>3 BHK</td>\n",
" <td>1200.0</td>\n",
" <td>2.0</td>\n",
" <td>130.0</td>\n",
" <td>3</td>\n",
" <td>10833.333333</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1st Block Jayanagar</td>\n",
" <td>2 BHK</td>\n",
" <td>1235.0</td>\n",
" <td>2.0</td>\n",
" <td>148.0</td>\n",
" <td>2</td>\n",
" <td>11983.805668</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10232</th>\n",
" <td>other</td>\n",
" <td>2 BHK</td>\n",
" <td>1200.0</td>\n",
" <td>2.0</td>\n",
" <td>70.0</td>\n",
" <td>2</td>\n",
" <td>5833.333333</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10233</th>\n",
" <td>other</td>\n",
" <td>1 BHK</td>\n",
" <td>1800.0</td>\n",
" <td>1.0</td>\n",
" <td>200.0</td>\n",
" <td>1</td>\n",
" <td>11111.111111</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10236</th>\n",
" <td>other</td>\n",
" <td>2 BHK</td>\n",
" <td>1353.0</td>\n",
" <td>2.0</td>\n",
" <td>110.0</td>\n",
" <td>2</td>\n",
" <td>8130.081301</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10237</th>\n",
" <td>other</td>\n",
" <td>1 Bedroom</td>\n",
" <td>812.0</td>\n",
" <td>1.0</td>\n",
" <td>26.0</td>\n",
" <td>1</td>\n",
" <td>3201.970443</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10240</th>\n",
" <td>other</td>\n",
" <td>4 BHK</td>\n",
" <td>3600.0</td>\n",
" <td>5.0</td>\n",
" <td>400.0</td>\n",
" <td>4</td>\n",
" <td>11111.111111</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>7251 rows × 7 columns</p>\n",
"</div>"
],
"text/plain": [
" location size total_sqft bath price bhk \\\n",
"0 1st Block Jayanagar 4 BHK 2850.0 4.0 428.0 4 \n",
"1 1st Block Jayanagar 3 BHK 1630.0 3.0 194.0 3 \n",
"2 1st Block Jayanagar 3 BHK 1875.0 2.0 235.0 3 \n",
"3 1st Block Jayanagar 3 BHK 1200.0 2.0 130.0 3 \n",
"4 1st Block Jayanagar 2 BHK 1235.0 2.0 148.0 2 \n",
"... ... ... ... ... ... ... \n",
"10232 other 2 BHK 1200.0 2.0 70.0 2 \n",
"10233 other 1 BHK 1800.0 1.0 200.0 1 \n",
"10236 other 2 BHK 1353.0 2.0 110.0 2 \n",
"10237 other 1 Bedroom 812.0 1.0 26.0 1 \n",
"10240 other 4 BHK 3600.0 5.0 400.0 4 \n",
"\n",
" price_per_sqft \n",
"0 15017.543860 \n",
"1 11901.840491 \n",
"2 12533.333333 \n",
"3 10833.333333 \n",
"4 11983.805668 \n",
"... ... \n",
"10232 5833.333333 \n",
"10233 11111.111111 \n",
"10236 8130.081301 \n",
"10237 3201.970443 \n",
"10240 11111.111111 \n",
"\n",
"[7251 rows x 7 columns]"
]
},
"execution_count": 121,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df9"
]
},
{
"cell_type": "code",
"execution_count": 122,
"metadata": {},
"outputs": [],
"source": [
"df10 = df9.drop(['size','price_per_sqft'],axis = 'columns')"
]
},
{
"cell_type": "code",
"execution_count": 123,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>location</th>\n",
" <th>total_sqft</th>\n",
" <th>bath</th>\n",
" <th>price</th>\n",
" <th>bhk</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1st Block Jayanagar</td>\n",
" <td>2850.0</td>\n",
" <td>4.0</td>\n",
" <td>428.0</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1st Block Jayanagar</td>\n",
" <td>1630.0</td>\n",
" <td>3.0</td>\n",
" <td>194.0</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1st Block Jayanagar</td>\n",
" <td>1875.0</td>\n",
" <td>2.0</td>\n",
" <td>235.0</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1st Block Jayanagar</td>\n",
" <td>1200.0</td>\n",
" <td>2.0</td>\n",
" <td>130.0</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1st Block Jayanagar</td>\n",
" <td>1235.0</td>\n",
" <td>2.0</td>\n",
" <td>148.0</td>\n",
" <td>2</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" location total_sqft bath price bhk\n",
"0 1st Block Jayanagar 2850.0 4.0 428.0 4\n",
"1 1st Block Jayanagar 1630.0 3.0 194.0 3\n",
"2 1st Block Jayanagar 1875.0 2.0 235.0 3\n",
"3 1st Block Jayanagar 1200.0 2.0 130.0 3\n",
"4 1st Block Jayanagar 1235.0 2.0 148.0 2"
]
},
"execution_count": 123,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df10.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Using One Hot Encoding For Location"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"###### For categorical variables where no such ordinal relationship exists, the integer encoding is not enough.In fact, using this encoding and allowing the model to assume a natural ordering between categories may result in poor performance or unexpected results (predictions halfway between categories).In this case, a one-hot encoding can be applied to the integer representation. This is where the integer encoded variable is removed and a new binary variable is added for each unique integer value."
]
},
{
"cell_type": "code",
"execution_count": 124,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>1st Block Jayanagar</th>\n",
" <th>1st Phase JP Nagar</th>\n",
" <th>2nd Phase Judicial Layout</th>\n",
" <th>2nd Stage Nagarbhavi</th>\n",
" <th>5th Block Hbr Layout</th>\n",
" <th>5th Phase JP Nagar</th>\n",
" <th>6th Phase JP Nagar</th>\n",
" <th>7th Phase JP Nagar</th>\n",
" <th>8th Phase JP Nagar</th>\n",
" <th>9th Phase JP Nagar</th>\n",
" <th>...</th>\n",
" <th>Vishveshwarya Layout</th>\n",
" <th>Vishwapriya Layout</th>\n",
" <th>Vittasandra</th>\n",
" <th>Whitefield</th>\n",
" <th>Yelachenahalli</th>\n",
" <th>Yelahanka</th>\n",
" <th>Yelahanka New Town</th>\n",
" <th>Yelenahalli</th>\n",
" <th>Yeshwanthpur</th>\n",
" <th>other</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 242 columns</p>\n",
"</div>"
],
"text/plain": [
" 1st Block Jayanagar 1st Phase JP Nagar 2nd Phase Judicial Layout \\\n",
"0 1 0 0 \n",
"1 1 0 0 \n",
"2 1 0 0 \n",
"3 1 0 0 \n",
"4 1 0 0 \n",
"\n",
" 2nd Stage Nagarbhavi 5th Block Hbr Layout 5th Phase JP Nagar \\\n",
"0 0 0 0 \n",
"1 0 0 0 \n",
"2 0 0 0 \n",
"3 0 0 0 \n",
"4 0 0 0 \n",
"\n",
" 6th Phase JP Nagar 7th Phase JP Nagar 8th Phase JP Nagar \\\n",
"0 0 0 0 \n",
"1 0 0 0 \n",
"2 0 0 0 \n",
"3 0 0 0 \n",
"4 0 0 0 \n",
"\n",
" 9th Phase JP Nagar ... Vishveshwarya Layout Vishwapriya Layout \\\n",
"0 0 ... 0 0 \n",
"1 0 ... 0 0 \n",
"2 0 ... 0 0 \n",
"3 0 ... 0 0 \n",
"4 0 ... 0 0 \n",
"\n",
" Vittasandra Whitefield Yelachenahalli Yelahanka Yelahanka New Town \\\n",
"0 0 0 0 0 0 \n",
"1 0 0 0 0 0 \n",
"2 0 0 0 0 0 \n",
"3 0 0 0 0 0 \n",
"4 0 0 0 0 0 \n",
"\n",
" Yelenahalli Yeshwanthpur other \n",
"0 0 0 0 \n",
"1 0 0 0 \n",
"2 0 0 0 \n",
"3 0 0 0 \n",
"4 0 0 0 \n",
"\n",
"[5 rows x 242 columns]"
]
},
"execution_count": 124,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dummies = pd.get_dummies(df10.location)\n",
"dummies.head()"
]
},
{
"cell_type": "code",
"execution_count": 125,
"metadata": {},
"outputs": [],
"source": [
"df11 = pd.concat([df10,dummies],axis = 'columns')"
]
},
{
"cell_type": "code",
"execution_count": 126,
"metadata": {},
"outputs": [],
"source": [
"df11 = df11.drop(['other'],axis = 'columns')"
]
},
{
"cell_type": "code",
"execution_count": 127,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>location</th>\n",
" <th>total_sqft</th>\n",
" <th>bath</th>\n",
" <th>price</th>\n",
" <th>bhk</th>\n",
" <th>1st Block Jayanagar</th>\n",
" <th>1st Phase JP Nagar</th>\n",
" <th>2nd Phase Judicial Layout</th>\n",
" <th>2nd Stage Nagarbhavi</th>\n",
" <th>5th Block Hbr Layout</th>\n",
" <th>...</th>\n",
" <th>Vijayanagar</th>\n",
" <th>Vishveshwarya Layout</th>\n",
" <th>Vishwapriya Layout</th>\n",
" <th>Vittasandra</th>\n",
" <th>Whitefield</th>\n",
" <th>Yelachenahalli</th>\n",
" <th>Yelahanka</th>\n",
" <th>Yelahanka New Town</th>\n",
" <th>Yelenahalli</th>\n",
" <th>Yeshwanthpur</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1st Block Jayanagar</td>\n",
" <td>2850.0</td>\n",
" <td>4.0</td>\n",
" <td>428.0</td>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1st Block Jayanagar</td>\n",
" <td>1630.0</td>\n",
" <td>3.0</td>\n",
" <td>194.0</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1st Block Jayanagar</td>\n",
" <td>1875.0</td>\n",
" <td>2.0</td>\n",
" <td>235.0</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1st Block Jayanagar</td>\n",
" <td>1200.0</td>\n",
" <td>2.0</td>\n",
" <td>130.0</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1st Block Jayanagar</td>\n",
" <td>1235.0</td>\n",
" <td>2.0</td>\n",
" <td>148.0</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 246 columns</p>\n",
"</div>"
],
"text/plain": [
" location total_sqft bath price bhk 1st Block Jayanagar \\\n",
"0 1st Block Jayanagar 2850.0 4.0 428.0 4 1 \n",
"1 1st Block Jayanagar 1630.0 3.0 194.0 3 1 \n",
"2 1st Block Jayanagar 1875.0 2.0 235.0 3 1 \n",
"3 1st Block Jayanagar 1200.0 2.0 130.0 3 1 \n",
"4 1st Block Jayanagar 1235.0 2.0 148.0 2 1 \n",
"\n",
" 1st Phase JP Nagar 2nd Phase Judicial Layout 2nd Stage Nagarbhavi \\\n",
"0 0 0 0 \n",
"1 0 0 0 \n",
"2 0 0 0 \n",
"3 0 0 0 \n",
"4 0 0 0 \n",
"\n",
" 5th Block Hbr Layout ... Vijayanagar Vishveshwarya Layout \\\n",
"0 0 ... 0 0 \n",
"1 0 ... 0 0 \n",
"2 0 ... 0 0 \n",
"3 0 ... 0 0 \n",
"4 0 ... 0 0 \n",
"\n",
" Vishwapriya Layout Vittasandra Whitefield Yelachenahalli Yelahanka \\\n",
"0 0 0 0 0 0 \n",
"1 0 0 0 0 0 \n",
"2 0 0 0 0 0 \n",
"3 0 0 0 0 0 \n",
"4 0 0 0 0 0 \n",
"\n",
" Yelahanka New Town Yelenahalli Yeshwanthpur \n",
"0 0 0 0 \n",
"1 0 0 0 \n",
"2 0 0 0 \n",
"3 0 0 0 \n",
"4 0 0 0 \n",
"\n",
"[5 rows x 246 columns]"
]
},
"execution_count": 127,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df11.head()"
]
},
{
"cell_type": "code",
"execution_count": 128,
"metadata": {},
"outputs": [],
"source": [
"df12 = df11.drop(['location'],axis = 'columns')"
]
},
{
"cell_type": "code",
"execution_count": 129,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>total_sqft</th>\n",
" <th>bath</th>\n",
" <th>price</th>\n",
" <th>bhk</th>\n",
" <th>1st Block Jayanagar</th>\n",
" <th>1st Phase JP Nagar</th>\n",
" <th>2nd Phase Judicial Layout</th>\n",
" <th>2nd Stage Nagarbhavi</th>\n",
" <th>5th Block Hbr Layout</th>\n",
" <th>5th Phase JP Nagar</th>\n",
" <th>...</th>\n",
" <th>Vijayanagar</th>\n",
" <th>Vishveshwarya Layout</th>\n",
" <th>Vishwapriya Layout</th>\n",
" <th>Vittasandra</th>\n",
" <th>Whitefield</th>\n",
" <th>Yelachenahalli</th>\n",
" <th>Yelahanka</th>\n",
" <th>Yelahanka New Town</th>\n",
" <th>Yelenahalli</th>\n",
" <th>Yeshwanthpur</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2850.0</td>\n",
" <td>4.0</td>\n",
" <td>428.0</td>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1630.0</td>\n",
" <td>3.0</td>\n",
" <td>194.0</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1875.0</td>\n",
" <td>2.0</td>\n",
" <td>235.0</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1200.0</td>\n",
" <td>2.0</td>\n",
" <td>130.0</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1235.0</td>\n",
" <td>2.0</td>\n",
" <td>148.0</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 245 columns</p>\n",
"</div>"
],
"text/plain": [
" total_sqft bath price bhk 1st Block Jayanagar 1st Phase JP Nagar \\\n",
"0 2850.0 4.0 428.0 4 1 0 \n",
"1 1630.0 3.0 194.0 3 1 0 \n",
"2 1875.0 2.0 235.0 3 1 0 \n",
"3 1200.0 2.0 130.0 3 1 0 \n",
"4 1235.0 2.0 148.0 2 1 0 \n",
"\n",
" 2nd Phase Judicial Layout 2nd Stage Nagarbhavi 5th Block Hbr Layout \\\n",
"0 0 0 0 \n",
"1 0 0 0 \n",
"2 0 0 0 \n",
"3 0 0 0 \n",
"4 0 0 0 \n",
"\n",
" 5th Phase JP Nagar ... Vijayanagar Vishveshwarya Layout \\\n",
"0 0 ... 0 0 \n",
"1 0 ... 0 0 \n",
"2 0 ... 0 0 \n",
"3 0 ... 0 0 \n",
"4 0 ... 0 0 \n",
"\n",
" Vishwapriya Layout Vittasandra Whitefield Yelachenahalli Yelahanka \\\n",
"0 0 0 0 0 0 \n",
"1 0 0 0 0 0 \n",
"2 0 0 0 0 0 \n",
"3 0 0 0 0 0 \n",
"4 0 0 0 0 0 \n",
"\n",
" Yelahanka New Town Yelenahalli Yeshwanthpur \n",
"0 0 0 0 \n",
"1 0 0 0 \n",
"2 0 0 0 \n",
"3 0 0 0 \n",
"4 0 0 0 \n",
"\n",
"[5 rows x 245 columns]"
]
},
"execution_count": 129,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df12.head()"
]
},
{
"cell_type": "code",
"execution_count": 130,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(7251, 245)"
]
},
"execution_count": 130,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df12.shape\n"
]
},
{
"cell_type": "code",
"execution_count": 131,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>total_sqft</th>\n",
" <th>bath</th>\n",
" <th>bhk</th>\n",
" <th>1st Block Jayanagar</th>\n",
" <th>1st Phase JP Nagar</th>\n",
" <th>2nd Phase Judicial Layout</th>\n",
" <th>2nd Stage Nagarbhavi</th>\n",
" <th>5th Block Hbr Layout</th>\n",
" <th>5th Phase JP Nagar</th>\n",
" <th>6th Phase JP Nagar</th>\n",
" <th>...</th>\n",
" <th>Vijayanagar</th>\n",
" <th>Vishveshwarya Layout</th>\n",
" <th>Vishwapriya Layout</th>\n",
" <th>Vittasandra</th>\n",
" <th>Whitefield</th>\n",
" <th>Yelachenahalli</th>\n",
" <th>Yelahanka</th>\n",
" <th>Yelahanka New Town</th>\n",
" <th>Yelenahalli</th>\n",
" <th>Yeshwanthpur</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2850.0</td>\n",
" <td>4.0</td>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1630.0</td>\n",
" <td>3.0</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1875.0</td>\n",
" <td>2.0</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1200.0</td>\n",
" <td>2.0</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1235.0</td>\n",
" <td>2.0</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 244 columns</p>\n",
"</div>"
],
"text/plain": [
" total_sqft bath bhk 1st Block Jayanagar 1st Phase JP Nagar \\\n",
"0 2850.0 4.0 4 1 0 \n",
"1 1630.0 3.0 3 1 0 \n",
"2 1875.0 2.0 3 1 0 \n",
"3 1200.0 2.0 3 1 0 \n",
"4 1235.0 2.0 2 1 0 \n",
"\n",
" 2nd Phase Judicial Layout 2nd Stage Nagarbhavi 5th Block Hbr Layout \\\n",
"0 0 0 0 \n",
"1 0 0 0 \n",
"2 0 0 0 \n",
"3 0 0 0 \n",
"4 0 0 0 \n",
"\n",
" 5th Phase JP Nagar 6th Phase JP Nagar ... Vijayanagar \\\n",
"0 0 0 ... 0 \n",
"1 0 0 ... 0 \n",
"2 0 0 ... 0 \n",
"3 0 0 ... 0 \n",
"4 0 0 ... 0 \n",
"\n",
" Vishveshwarya Layout Vishwapriya Layout Vittasandra Whitefield \\\n",
"0 0 0 0 0 \n",
"1 0 0 0 0 \n",
"2 0 0 0 0 \n",
"3 0 0 0 0 \n",
"4 0 0 0 0 \n",
"\n",
" Yelachenahalli Yelahanka Yelahanka New Town Yelenahalli Yeshwanthpur \n",
"0 0 0 0 0 0 \n",
"1 0 0 0 0 0 \n",
"2 0 0 0 0 0 \n",
"3 0 0 0 0 0 \n",
"4 0 0 0 0 0 \n",
"\n",
"[5 rows x 244 columns]"
]
},
"execution_count": 131,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X = df12.drop('price',axis='columns')\n",
"X.head()"
]
},
{
"cell_type": "code",
"execution_count": 132,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 428.0\n",
"1 194.0\n",
"2 235.0\n",
"3 130.0\n",
"4 148.0\n",
" ... \n",
"10232 70.0\n",
"10233 200.0\n",
"10236 110.0\n",
"10237 26.0\n",
"10240 400.0\n",
"Name: price, Length: 7251, dtype: float64"
]
},
"execution_count": 132,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y = df12.price\n",
"y"
]
},
{
"cell_type": "code",
"execution_count": 133,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.model_selection import train_test_split\n",
"X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2,random_state = 10)"
]
},
{
"cell_type": "code",
"execution_count": 134,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.8452277697873348"
]
},
"execution_count": 134,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.linear_model import LinearRegression\n",
"lr_clf = LinearRegression()\n",
"lr_clf.fit(X_train,y_train)\n",
"lr_clf.score(X_test,y_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Use K Fold cross validation to measure accuracy of our LinearRegression model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this method, we split the data-set into k number of subsets(known as folds) then we perform training on the all the subsets but leave one(k-1) subset for the evaluation of the trained model. In this method, we iterate k times with a different subset reserved for testing purpose each time.\n",
"Always remember, a lower value of k is more biased, and hence undesirable. On the other hand, a higher value of K is less biased, but can suffer from large variability. It is important to know that a smaller value of k always takes us towards validation set approach, whereas a higher value of k leads to LOOCV approach."
]
},
{
"cell_type": "code",
"execution_count": 135,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([0.82430186, 0.77166234, 0.85089567, 0.80837764, 0.83653286])"
]
},
"execution_count": 135,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.model_selection import ShuffleSplit\n",
"from sklearn.model_selection import cross_val_score\n",
"cv = ShuffleSplit(n_splits = 5, test_size = 0.2, random_state = 0)\n",
"cross_val_score(LinearRegression(),X,y,cv=cv)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Find best model using GridSearchCV "
]
},
{
"cell_type": "code",
"execution_count": 136,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>model</th>\n",
" <th>best_score</th>\n",
" <th>best_params</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>linear_regression</td>\n",
" <td>0.818354</td>\n",
" <td>{'normalize': False}</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>lasso</td>\n",
" <td>0.687430</td>\n",
" <td>{'alpha': 2, 'selection': 'random'}</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>decision_tree</td>\n",
" <td>0.720273</td>\n",
" <td>{'criterion': 'friedman_mse', 'splitter': 'best'}</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" model best_score \\\n",
"0 linear_regression 0.818354 \n",
"1 lasso 0.687430 \n",
"2 decision_tree 0.720273 \n",
"\n",
" best_params \n",
"0 {'normalize': False} \n",
"1 {'alpha': 2, 'selection': 'random'} \n",
"2 {'criterion': 'friedman_mse', 'splitter': 'best'} "
]
},
"execution_count": 136,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.model_selection import GridSearchCV\n",
"from sklearn.linear_model import Lasso\n",
"from sklearn.tree import DecisionTreeRegressor\n",
"\n",
"def find_best_model_using_gridsearchcv(X,y):\n",
" algos = {\n",
" 'linear_regression' : {\n",
" 'model': LinearRegression(),\n",
" 'params': {\n",
" 'normalize': [True, False]\n",
" }\n",
" },\n",
" 'lasso': {\n",
" 'model': Lasso(),\n",
" 'params': {\n",
" 'alpha': [1,2],\n",
" 'selection': ['random', 'cyclic']\n",
" }\n",
" },\n",
" 'decision_tree': {\n",
" 'model': DecisionTreeRegressor(),\n",
" 'params': {\n",
" 'criterion' : ['mse','friedman_mse'],\n",
" 'splitter': ['best','random']\n",
" }\n",
" }\n",
" }\n",
" scores = []\n",
" cv = ShuffleSplit(n_splits=5, test_size=0.2, random_state=0)\n",
" for algo_name, config in algos.items():\n",
" gs = GridSearchCV(config['model'], config['params'], cv=cv, return_train_score=False)\n",
" gs.fit(X,y)\n",
" scores.append({\n",
" 'model': algo_name,\n",
" 'best_score': gs.best_score_,\n",
" 'best_params': gs.best_params_\n",
" })\n",
"\n",
" return pd.DataFrame(scores,columns=['model','best_score','best_params'])\n",
"\n",
"find_best_model_using_gridsearchcv(X,y)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Based on above results we can say that LinearRegression gives the best score. Hence we will use that. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Test the model for few properties "
]
},
{
"cell_type": "code",
"execution_count": 137,
"metadata": {},
"outputs": [],
"source": [
"def predict_price(location,sqft,bath,bhk):\n",
" loc_index = np.where(X.columns==location)[0][0]\n",
" \n",
" x = np.zeros(len(X.columns))\n",
" x[0] = sqft\n",
" x[1] = bath\n",
" x[2] = bhk\n",
" if loc_index >= 0:\n",
" x[loc_index] = 1\n",
" return lr_clf.predict([x])[0] \n",
" "
]
},
{
"cell_type": "code",
"execution_count": 138,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"83.49904676591962"
]
},
"execution_count": 138,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"predict_price('1st Phase JP Nagar',1000,2,2)"
]
},
{
"cell_type": "code",
"execution_count": 139,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"184.58430202040012"
]
},
"execution_count": 139,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"predict_price('Indira Nagar',1000, 3, 3)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}