Skip to content Skip to footer

5 Steps to Transform Messy Functions into Production-Ready Code | by Khuyen Tran | Jan, 2024


Image by Author

Functions are essential in a data science project because they make the code more modular, reusable, readable, and testable. However, writing a messy function that tries to do too much can introduce maintenance hurdles and diminish the code’s readability.

In the following code, the function impute_missing_values is long, messy, and tries to do many things. Since there are many hard-coded values, it would be impossible for someone else to reuse this function for a DataFrame with different column names.

def impute_missing_values(df):
# Fill missing values with group statistics
df["MSZoning"] = df.groupby("MSSubClass")["MSZoning"].transform(
lambda x: x.fillna(x.mode()[0])
)
df["LotFrontage"] = df.groupby("Neighborhood")["LotFrontage"].transform(
lambda x: x.fillna(x.median())
)

# Fill missing values with constant
df["Functional"] = df["Functional"].fillna("Typ")

df["Alley"] = df["Alley"].fillna("Missing")
for col in ["GarageType", "GarageFinish", "GarageQual", "GarageCond"]:
df[col] = df[col].fillna("Missing")

for col in ("BsmtQual", "BsmtCond", "BsmtExposure", "BsmtFinType1", "BsmtFinType2"):
df[col] = df[col].fillna("Missing")

df["FireplaceQu"] = df["FireplaceQu"].fillna("Missing")

df["PoolQC"] = df["PoolQC"].fillna("Missing")

df["Fence"] = df["Fence"].fillna("Missing")

df["MiscFeature"] = df["MiscFeature"].fillna("Missing")

numeric_dtypes = ["int16", "int32", "int64", "float16", "float32", "float64"]
for i in df.columns:
if df[i].dtype in numeric_dtypes:
df[i] = df[i].fillna(0)

# Fill missing values with mode
df["Electrical"] = df["Electrical"].fillna("SBrkr")
df["KitchenQual"] = df["KitchenQual"].fillna("TA")
df["Exterior1st"] = df["Exterior1st"].fillna(df["Exterior1st"].mode()[0])
df["Exterior2nd"] = df["Exterior2nd"].fillna(df["Exterior2nd"].mode()[0])
df["SaleType"] = df["SaleType"].fillna(df["SaleType"].mode()[0])
for i in df.columns:
if df[i].dtype == object:
df[i] = df[i].fillna(df[i].mode()[0])
return df

This example is adapted from the notebook titled How I Achieved Top 0.3% in a Kaggle Competition, with a few alterations.



Source link