Build a Support Vector Machine (SVM) with kernels for a Bank Marketing Dataset

Sachin Mamoru
17 min readJun 20, 2021

--

Photo by Owen Beard on Unsplash

Problem in brief

Use case: The dataset is related to direct marketing campaigns (phone calls) of a Portuguese banking institution. The goal is to predict if the client will subscribe to a term deposit.

Here The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (‘yes’) or not (‘no’) subscribed.

For the analysis purpose, we are using a bank marketing dataset here.

Domain Knowledge

A term deposit in bank Marketing is a fixed-term investment that involves the deposit of money into an account at a financial institution. Term deposit investments normally carry short-term maturities differing from one month to a few years and will have varying levels of expected minimum deposits. The investor must understand when buying a term deposit that they can withdraw their money only after the term ends.

Here, I’m using a dataset that is related to the term deposit of a banking institution and use that dataset to predict whether the customer will subscribe to a term deposit or not.

Given dataset having 21 columns, and we need to do a proper analysis to find out what are the features that strongly affect the term deposit and then train an SVM to predict whether the customer will subscribe to a term deposit or not.

Note: All the coding was done in a Google Colab notebook (Resources Section)

Preprocess the dataset as specified in the data mining process

First, we need to visualize and understand the dataset.

import pandas as pd
bank_df = pd.read_csv('../data/banking.csv')
bank_df.head(5)

Using the shape and dtypes methods, you can find the number of columns and the data types of each column.

Raw data(real-world data) is always incomplete and that data cannot be transmitted through a model. That would generate certain errors. That is why we need to preprocess data before sending it through a model.

Therefore applying to preprocess is a must for real-world data.

At first, we need to check the values of each column and need to have a rough idea of the data set. For that, we can use describe method.

Here, I will be not considering the duration column because according to the source of the dataset it says that this attribute highly affects the output target. So, as a first step, we can remove the duration column.

duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=’no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

Handle Missing Values

Missing data are described as values that are not available and that would be meaningful if they are observed. Missing data can be anything from missing sequence, incomplete feature, files missing, information incomplete, data entry error, etc.

So we need to handle them before going into further analysis. We have two options concerning missing values.

  1. Deletion
  2. Imputation

Imputation means filling the missing value by the mean or median value of the column dataset. But we need to be more careful with that because it might bias the model.

And if the number of missing cases is very less compared to the dataset, then it’s better to remove them.

print(bank_df.isnull().sum() * 100 / len(bank_df))

Here you can see there are no missing values to be considered.

Remove Duplicates

Let’s find out how many duplicates are there in the dataset.

print("Number of duplicates : " + str(new_df.duplicated().sum()))

Let’s remove them.

new_df = new_df.drop_duplicates()
new_df = new_df.reset_index(drop=True)

When we remove rows from a dataset, what happens is it will remove that particular row, but the indexes are not reset. For an instance, if we remove a row with index 5, then the new dataframe has the indexes as ..3,4,6,7, etc. So to avoid this situation we can use the reset_index method. It will reset the entire dataframe again.

Handle Outliers

An outlier is a data point that is distant from other related points. They may be due to variability in the measurement or may show experimental errors.

Types of Outliers

  • Global Outliers (also called “Point Anomalies”): Data points are far outside the entirety of the dataset.
  • Contextual (Conditional) Outliers: if an individual data instance is anomalous in a specific context or condition, then it is termed as a contextual outlier.
  • Collective Outliers: when a collection of data points is anomalous concerning the entire data set, the values themselves are not anomalous.

There are two basics methods to handle outliers.

  • Percentile
  • Box Plot

Here we are using the box plot method.

Here you can see there are a couple of outliers in the dataset that we need to fix.

There are three basics ways to handle outliers.

  • Remove all the outliers
  • Replace Outlier Values with a suitable value — Replace them with min or max quantile value
  • Using IQR — We can simply remove the data above and below the limits or we can replace them with the limit value.

For explanation purposes, I will remove outliers from the “campaign” column. Regarding other outliers, you can find the full code in the notebook. In here campaign is having an outlier with a value of more than 50. The other values which are higher than the upper quartile range cannot be considered as outliers since they are recognized as anomalies. Therefore we need to remove them.

sns.set(rc={'figure.figsize':(8,5)}, font_scale=1.2, style='whitegrid')warnings.filterwarnings("ignore")fig, axes = plt.subplots(1,2)plt.tight_layout(0.2)## DataFrameprint("Before Shape:",new_df.shape)## Removing 0 from column valuesdf_campaign = new_df.loc[(new_df['campaign'] < 50)]## Visulizationprint("After Shape:",df_campaign.shape)sns.boxplot(new_df["campaign"],orient='v',ax=axes[0])axes[0].title.set_text("Before")sns.boxplot(df_campaign["campaign"],orient='v',ax=axes[1])axes[1].title.set_text("After")plt.show()

Data Transformation — Handle Skewness

Skewed data is common in data science; skew is the degree of distortion from a normal distribution. The probability distribution with its tail on the right side is a right-skewed distribution and the one with its tail on the left side is a left-skewed distribution.

source — https://www.cambridgemaths.org/blogs/skewed-usage-skewed-distribution/

To detect the skewness we are illustrating the Q-Q plots and histograms.

Q-Q plots and histogram of “age”
  • Q-Q plot — Most of the data points lie on the line.
  • Histogram — Nearly symmetric distribution.
  • Transformation — No need.

Now let’s analyze the Q-Q plots and histogram of the “campaign”.

Q-Q plots and histogram of the “campaign”

According to the histogram that you can see there is a right-skewed skewness in the dataset of the “campaign” column. To make it symmetric distribution we need to apply a logarithmic transformation to it. But here is a special case that there are 0 value data points that can see in the column. If we apply log transformation, that 0 value data points will be replaced with minus infinity. Therefore we applied log(x+1) transformation.

# import the needed packages.import numpy as npfrom sklearn.preprocessing import FunctionTransformer#right skewed data# create columns variables to hold the columns that need transformationcolumns = ['campaign']# create the function transformer object with logarithmic transformationlogarithmic_transformation = FunctionTransformer(np.log, validate=True)# apply the transformation to your train datadata_new = logarithmic_transformation.transform(df2[columns])df_right_skewed = pd.DataFrame(data_new, columns=columns)df_right_skewed = df_right_skewed.reset_index(drop=True)df2 = df2.reset_index(drop=True)df2["campaign"] = df_right_skewed["campaign"]

Let’s take the “nr_employed” column now.

Q-Q plots and histogram of the “nr_employed”

Since the “nr_employed” column is having left-skewed skewness, we can apply the exponential or power transformation.

#left skewed data# create columns variables to hold the columns that need transformationcolumns = ['nr_employed']# create the function transformer object with exponential transformationexponential_transformer = FunctionTransformer(lambda x: x ** 3, validate=True)# apply the transformation to your train datadata_new = exponential_transformer.transform(df2[columns])df_left_skewed = pd.DataFrame(data_new, columns=columns)df_left_skewed = df_left_skewed.reset_index(drop=True)df2 = df2.reset_index(drop=True)df2["nr_employed"] = df_left_skewed["nr_employed"]

The rest of the other feature skewness handling you can see in the Python notebook.

Feature Coding Techniques

While a lot of improvements have been done in different machine learning frameworks to accept complicated categorical data types like text labels. Typically any usual workflow in feature engineering includes some form of transformation of these categorical values into numeric labels and then applying some encoding scheme on these values.

There two methods to do that.

  • One-hot Encoding — Representation of categorical variables as binary vectors.
  • Integer (Label) Encoding — Converting the labels into the numeric form to transform them into a machine-readable form.
source — https://medium.com/@michaeldelsole/what-is-one-hot-encoding-and-how-to-do-it-f0ae272f1179

Before feature coding, we need to take see the number of unique values in each categorical column.

df2.select_dtypes(include = np.object_).nunique()

Now I’ll apply encoding techniques for the categorical columns in the dataset.

I’ll apply Label Encoding for the “contact” column since it has only two classes (unique values).

df2['contact']=df2['contact'].astype('category')df2['contact']=df2['contact'].cat.codes

Now let’s apply One Hot Encoding for the rest of the categorical columns.

from sklearn.preprocessing import OneHotEncoder# creating instance of one-hot-encoderonehot_encoder = OneHotEncoder(handle_unknown='ignore')categorical_columns = ['job', 'marital', 'education', 'default', 'housing', 'loan', 'month', 'day_of_week', 'poutcome']# We fitting dataonehot_encoder.fit(df2[categorical_columns])column_names = onehot_encoder.get_feature_names(categorical_columns)# transforming testing dataonehot_encoder_df = pd.DataFrame(onehot_encoder.transform(df2[categorical_columns]).toarray(),columns=column_names)onehot_encoder_df = onehot_encoder_df.reset_index(drop=True)df2 = df2.join(onehot_encoder_df)df2.drop(categorical_columns, axis=1, inplace=True)

Now let’s see the sample of the dataset.

df2.sample(n = 5)
Part of the dataset after encoding

Standardized the features

Standardization is one of the scaling techniques where the values are centered around the mean with a unit standard deviation. This implies that the mean of the attribute becomes zero and the resultant distribution has a unit standard deviation.

Standardization

Let’s apply the standardization for our dataset now.

Note: Set aside the earlier encoded columns because we don’t standardize them. Then I have left the “age” and “pdays ”columns because they will be discretized as the next step after standardization.

feature_columns = ['campaign', 'previous',\'emp_var_rate', 'cons_price_idx', 'cons_conf_idx', 'euribor3m', 'nr_employed']# for feature datastandardize = df2[feature_columns].copy()# Create the scaler objectscaler = StandardScaler()# Fit the data to scalerscaler.fit(standardize)data_scaled = scaler.transform(standardize)df_standardized = pd.DataFrame(data_scaled, columns = standardize.columns)df2[feature_columns] = df_standardized

Feature Discretization

Many machine learning algorithms favor or perform properly when numerical input variables have a standard probability distribution. The discretization transform affords an automatic way to change a numeric input variable to have a different data distribution, which in turn can be used as input to a predictive model.
Values for the variable are grouped together into discrete bins and every bin is assigned a unique integer such that the ordinal relationship among the bins is defended. The usage of bins is often related to binning or k-bins, where k relates to the number of groups to which a numeric variable is mapped.

Discretization Approach

  • Supervised Approach — Discretization with decision trees
  • Unsupervised Approach — Equal-width discretization, K-means discretization
  • Custom Discretization

In the dataset, we have a feature called “age”. It provides a 17–95 range of values. It’s quite a large range. We can discretize this into eight bins representing the actual age groups. (young, adult, .etc)

import pandas as pdfrom sklearn.preprocessing import KBinsDiscretizerdiscretize_columns = ['age']discretize_data = pd.DataFrame(df2, columns=discretize_columns)# fit the scaler to the  datadiscretizer = KBinsDiscretizer(n_bins=8, encode='ordinal', strategy='kmeans')discretizer.fit(discretize_data)discretized = discretizer.transform(discretize_data)df_discretized = pd.DataFrame(discretized,columns=discretize_columns)df2[discretize_columns] = df_discretized

We have another column called “pdays”. There you can do discretization for 2 bins. That code snippet can be get from the colab notebook.

Train-Test Split

Initially, we need to separate our feature set and the target column (y).

df_features= df2.drop('y', axis=1)df_target = pd.DataFrame(df2['y'], columns=["y"])

Let’s split the dataset into training and testing set according to the 0.8–0.2 ratio.

from sklearn.model_selection import train_test_splitx_train, x_test, y_train, y_test = train_test_split(df_features, df_target, test_size = 0.2, random_state = 101)x_train=x_train.reset_index(drop=True)x_test=x_test.reset_index(drop=True)y_train=y_train.reset_index(drop=True)y_test=y_test.reset_index(drop=True)

Perform Feature Engineering

  • Dimension reduction

Dimensionality reduction is a machine learning technique of reducing the number of random variables in a problem by selecting a set of principal variables.

But what if we skip this step and tried to keep all the features of the dataset, then you will face the curse of dimensionality problem. It means the model complexity will be increase. and the model will try to capture all noise as well. So, it will lead to overfitted during training and some times testing phase.

Therefore to identify the most relevant features we need to appropriately use PCA (Principal Component Analysis) or SVD (Singular Value Decomposition) for feature reduction.

  • Significant Features

To identify dependent and independent features we can use the Correlation Matrix of the transformed dataset.

Let’s draw the heatmap for continuous variable features.

Here you can see different correlations between the features of the dataset.

  • Near to +1 — Strong positive correlation between variables.
  • Near to -1 — Strong negative correlation between variables.
  • Near to 0 — No correlation between variables.

If the two features are strongly correlated to each other, we can conclude that those two variables are highly dependent on each other.

According to this figure, you can see,

  • emp_var_rate and euribor3m — 0.97
  • euribor3m and nr_employed — 0.95
  • emp_var_rate and nr_employed — 0.9
  • cons_price_idx and emp_var_rate — 0.76
  • cons_price_idx and euribor3m — 0.67

Therefore we can say these 4 features are highly dependent on each other. So we can keep one feature from those dependent features and remove the rest.

But to identify the feature to be kept, we need to draw the correlation matrix with the target variable again.

Here we need to find the highest correlation features with the target (y) variable. Out of those, we need to remove dependent features which are having a weak relationship with the target.

Therefore let’s check previous dependent features correlation with the target,

  • nr_employed — (-0.34)
  • euribor3 — (-0.29)
  • emp_var_rate — (-0.28)
  • cons_price_idx — (-0.11)

According to these values, we can keep the “nr_employed” column and drop the “euribor3”, “emp_var_rate”, and “cons_price_idx” columns from the dataset.

Since there are many features (with one hot encoding) to be analyzed in the dataset, here we have only considered continuous variables.

Therefore, to analyze those columns we can simply apply PCA to dimensionality reduction without using the heatmap.

  • PCA — Principal Component Analysis

Principal Component Analysis, or PCA, is a dimensionality reduction technique that is usually used to reduce the dimensionality of huge data sets, by converting a large set of variables into a smaller one that still contains the majority of the information in the large set.
So to sum up, the concept of PCA is simply reducing the number of variables of a data set, while preserving as much information as possible.

Let’s apply PCA to our dataset and find the variance ratios of the PCA object. Then we can identify how many dimensions should be preserved for the model to be trained.

from sklearn.decomposition import PCApca = PCA()pca.fit(x_train)# check how many components should be remaindpca.explained_variance_ratio_

Here we’ll keep the high variance ratio features and remove the rest of the features. Here we take the features up to 95% of the variance explained by the PCA.

pca.explained_variance_ratio_[:25].sum()

Therefore I’ll keep 25 components (dimensions).

pca = PCA(n_components=25)pca.fit(x_train)x_train_pca = pca.transform(x_train)x_test_pca = pca.transform(x_test)

Accordingly, we were able to reduce the dataset to 25 dimensions by preserving most of the information of the dataset.

Additional: Fixing the class imbalance of target attribute using SMOTE

In our dataset, we can see a class imbalance in the target (y) variable. Therefore before going into model training, we need to handle that issue.

  • Imbalance data is where the classification dataset class has a skewed proportion.
  • The imbalance class creates a bias where the learning model tends to predict the majority class.
y_train.value_counts().plot(kind='bar', figsize=(6, 4))plt.title('y - has the client subscribed a term deposit? (0 = no, 1 = yes)', size=20, pad=30)

SMOTE: Synthetic Minority Oversampling Technique

Let’s apply SMOTE to our training dataset.

from imblearn.over_sampling import SMOTEsm = SMOTE(random_state=0)X_sm_pca, y_sm = sm.fit_resample(x_train_pca, y_train)print(f'''Shape of X before SMOTE: {x_train_pca.shape}Shape of X after SMOTE: {X_sm_pca.shape}''')print('\nBalance of positive and negative classes (%):')y_sm = pd.DataFrame(y_sm, columns=["y"])y_sm.value_counts(normalize=True) * 100y_sm.value_counts().plot(kind='bar', figsize=(6, 4))plt.title('y - has the client subscribed a term deposit? (0 = no, 1 = yes)', size=20, pad=30)

Now the class imbalance issue is fixed.

Train model — Support Vector Machines (SVM) with kernels

Support Vector Machines are a kind of supervised machine learning algorithm that implements analysis of data for classification and regression analysis. While they can be applied for regression, SVM is frequently used for classification. We carry out plotting in the n-dimensional space. The value of each feature is also the value of the specific coordinate. Then, we identify the ideal hyperplane that differentiates between the two classes.

source — https://www.javatpoint.com/machine-learning-support-vector-machine-algorithm

If the dataset is not linearly separable, we can make them separable using non-linear transformations or slack variables.
In SVM, draw the separable lines where they have the widest width, separating the positive samples from the negative samples.

Here we can find the main 3 lines:

  • Wx+b = 0 — for the decision boundary
  • Wx+b = (-1) — for class 1 boundary
  • Wx+b= (+1) — for class 2 boundary
  1. Positive samples — Wx+b≥ +1
  2. Negative samples — Wx+b≤ -1

Then by inducing the variable Y as following, we can generate a conditional statement.

  • y = (+1) and y = (-1)
  • y(Wx+b) ≥ +1
  • Therefore the support vectors — yᵢ(Wxᵢ + b) = 1

When this condition is satisfied, all the positive and negative data points will be behind the boundary lines.

When we applying SVM we need to have the maximum width between the boundaries.

In order to maximize width, we need to maximize 2/|𝑊| or else minimize the |𝑊| or minimize 1/2|𝑊|².

Therefore we applied the Lagrange theorem for the minimization.

Optimum width

Slack variables

If the dataset is not linearly separable, we can make them separable using slack variables. Slack variables are introduced to support certain constraints to be violated.

Here we are calculating the distance to the data points which are fallen under the different sides of the marginal hyperplane. Then we define a penalty function for that which is controlling the support for the miss-classifications. The data points can be on the wrong side of the margin boundary but with a penalty that increases with the distance from that boundary. Here the penalty (ξ) is known as the slack variable.

  • if ξᵢ = 0, xᵢ is exactly at the marginal hyperplane.
  • if 0 < ξᵢ ≤ 1, then xᵢ is located within the margin.
  • if ξᵢ > 1, then xᵢ are located at the other side of the separating hyperplane, which means a miss-classification.

Then we need to analyze the penalty function and need to choose an appropriate value for C (Regularization Parameter).

source — https://www.saedsayad.com/support_vector_machine.htm

SVM kernel

The SVM kernel is a function that gets low dimensional input space and transforms it to a higher dimensional space i.e. it transforms not separable problem to separable problem.

source — https://www.youtube.com/watch?v=wqSTBCguVyU

The kernel trick to the SVM optimization problem:

  • The optimization problem
  • The decision boundary
  • Now we apply the kernel function.
    It reduces the complexity of finding the mapping function. So, the Kernel function defines the inner product in the transformed space.
Kernel function

For the SVM, I’m using rbf kernel to train the model.

rbf kernel

Let’s apply the SVM with rbf kernel to train our dataset.

from sklearn import svm# we create an instance of SVM and fit out data.svc = svm.SVC(kernel='rbf', C=0.1, gamma=0.1).fit(X_sm_pca,y_sm)

Prediction

Let’s check the predicted values for our test dataset.

predictions = svc.predict(x_test_pca)y_hat = pd.DataFrame(predictions, columns=["Predicted y"])
y_test values and y_hat values

Let’s visualize this in a plot.

plt.plot(y_hat[:150], label = "Pred")plt.plot(y_test[:150], label = "Actual")plt.xlabel('x - axis')# Set the y axis label of the current axis.plt.ylabel('y - axis')# Set a title of the current axes.plt.title('Predictions vs Actual')# show a legend on the plotplt.legend()# Display a figure.plt.show()
Predictions vs Actual

Model Evaluation

When we predict values applying our model there can be an error between the actual and the predicted value. Therefore let’s analyze the accuracy of our model.

from sklearn import metrics#Coefficient of determination R²  of the predictionscore_pca=svc.score(x_test_pca,y_test)print("Coefficient of determination R²  of the prediction : "+str(score_pca * 100)+"%")print("Precision:",metrics.precision_score(y_test, y_hat))print("Recall:",metrics.recall_score(y_test, y_hat))

Confusion Matrix

A Confusion matrix is an N x N matrix applied for assessing the performance of a classification model, where N is the number of target classes. The matrix compares the actual target values with those predicted by the machine learning model. This matrix is mostly used for evaluating classification models.

from sklearn.metrics import confusion_matrixcnf_matrix = confusion_matrix(y_test, y_hat)fig, ax = plt.subplots(1)ax = sns.heatmap(cnf_matrix, ax=ax, cmap=plt.cm.Greens, annot=True)plt.title('Confusion Matrix')plt.ylabel('True Category')plt.xlabel('Predicted Category')plt.show()

Finally, let’s check the model classification report.

from sklearn.metrics import classification_reportprint(classification_report(y_test, y_hat))

Resources

  • Complete Co-lab notebook. (With preloaded data set)
  • Bank Marketing Dataset

https://archive.ics.uci.edu/ml/datasets/bank+marketing

Conclusion

We have done preprocessing, feature engineering, PCA up-to model creation, prediction, and accuracy testing with core concepts for a bank marketing dataset. Here we have used Support Vector Machines (SVM) with kernels to train our dataset.

I would like to express my special appreciation and thanks to Dr. Subha Fernando (Senior Lecturer at the University of Moratuwa) for inspiring me to write this article.

Thank you very much.

--

--