Simple Guide to Polynomial Features

Jessie Jones
4 min readFeb 24, 2021

--

Have you ever found yourself grinding away doing feature engineering and creating polynomial and interaction feature columns by hand? I have. It was rough. Really rough. My finger tips were beginning to bleed from all of the typing! Okay, maybe it wasn’t that bad, but still… there’s a better way.

Let me introduce you to PolynomialFeatures from our good friends at sklearn! This is a class found under sklearns preprocessing module. The basic idea of what PolynomialFeatures does is systematically generate polynomial and interaction features for you. Isn’t that wonderful?

Lets go through an example to see how it works. First, here are our imports:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.datasets import load_iris #The dataset we will use
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.neighbors import KNeighborsClassifier

We will be using the iris dataset which contains information about different flowers such as petal length, septal width, and a few others.

iris = load_iris()
iris.data.shape

Above shows us the shape of our features.

Lets turn it into a dataframe and check it out:

df = pd.DataFrame(iris.data)
df.columns = iris.feature_names
df['flower'] = iris.target
df.head()

Create our X and y and then split out into our train-test-split:

X = df[iris.feature_names]
y = df['flower']
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size = 0.2,
random_state=42,
stratify = y)

Now we are going to use StandardScalar to ensure that our data is in the same scale. After that, we will instantiate our KNN model and use cross_val_score to get an idea of how it will perform later:

ss = StandardScaler() 
X_train_ss = ss.fit_transform(X_train)
X_test_ss = ss.transform(X_test)
knn = KNeighborsClassifier(n_neighbors=5,
p=2)
cross_val_score(knn, X_train_ss, y_train, cv=10).mean()

Lets now fit the model and then see how it does on our test set:

knn.fit(X_train_ss,y_train)
knn.score(X_test_ss, y_test)

Ok! We’ve just trained a model on the normal iris data. As we can see, we are scoring about 93.3% on our test set. This isn’t bad, but we can do better. Lets take a look at how this pipeline works if we add in PolynomialFeatures.

The first thing we need to do is instantiate PolynomialFeatures. This is done by creating a new variable and setting that equal to PF. Some key parameters to know are degree, interaction_only, and include_bias. More on these in a moment. First, lets see the syntax:

poly = PolynomialFeatures(degree = 2,
interaction_only = False,
include_bias = False)

Degree is telling PF what degree of polynomial to use. The standard is 2. Typically if you go higher than this, then you will end up overfitting. Interaction_only takes a boolean. If True, then it will only give you feature interaction (ie: column1 * column2). If false, then it will give you feature interaction, and features that are squared, cubed (depending on degree),etc.

Here’s how to look at your new polynomial features:

pd.DataFrame(X_train_poly, columns=poly.get_feature_names(iris.feature_names)).head()
The extra features we just created

All of our new features are on different scales, so we will need to scale them before passing along to our model. After that, we will instantiate our model and then run cross_val_score to get an idea of how its working.

sc = StandardScaler()
sc.fit(X_train_poly)
X_train_poly = sc.transform(X_train_poly)
X_test_poly = sc.transform(X_test_poly)
knn = KNeighborsClassifier(n_neighbors=5,
p=2)
cross_val_score(knn, X_train_poly, y_train, cv=10).mean()

Finally, we can see how our model trained with PolynomialFeatures does on unseen data!

knn.fit(X_train_poly,y_train)
knn.score(X_test_poly, y_test)

We got ~93.3% from our standard model, but that improved to ~96.6% when we used PolynomialFeatures. It was easy too!

--

--