Peeking inside SVM’s box of pandora

12 min readFeb 11, 2024

How Quadratic Programming can be used to create an SVM without sklearn

When you want to go the extra mile in understanding Machine Learning algorithms, you have to leave the libraries behind and expose yourself to the statistical and mathematical underpinnings. Having had limited formal education in both areas, i often resort to code as an intermediate. Although in general, this is widely available for most basic algorithms, in case of SVMs this is often limited to either a Gradient Descent solution or the author of the article ends up using sklearn anyway. Since the primal and dual formulation are considered to be the de facto standard optimization problem to approach SVM, it felt only natural to use these for my little peek into the inner workings.

Note: I call it a little peek as i will be using a Quadratic Programming solver. Maybe someday i’ll dare to look into that as well.

Housekeeping

Code was created using Python 3.11.5 and the following libraries were imported:

import numpy as np
from cvxopt import matrix, solvers
import scipy.stats as stats
import matplotlib.pyplot as plt

A simple classification problem

Let’s first generate some data for us to play with.

#let's generate some two classes of data
# 10 samples per class
mclass = 10

X1x1 = stats.norm.rvs(size=mclass, loc=-2, scale=.5,random_state=42)
X1x2 = stats.norm.rvs(size=mclass, loc=-2, scale=.5,random_state=1337)
X2x1 = stats.norm.rvs(size=mclass, loc=2, scale=.5,random_state=271)
X2x2 = stats.norm.rvs(size=mclass, loc=2, scale=.5,random_state=314)

Xy1 = np.c_[X1x1, X1x2, np.zeros(mclass)]
Xy2 = np.c_[X2x1, X2x2, np.ones(mclass)]

Xy = np.concatenate([Xy1, Xy2])

X, y = Xy[:, :2], Xy[:, 2]

Note: Normally in any Machine Learning dataset (unless it’s a time series), you want to make sure the data is ordered randomly along row axis. Also, SVM is very sensitive to different scales, so make sure you’re scale your data in an actual scenario!

To keep it simple, we’re going to apply a hard margin Linear SVM classifier. This means that within the margin of the decision boundary up to the support vectors, we do not allow for any outliers in the form of margin violations. If these terms are new for you, check out Josh Starmer’s great Statquest here. We’re also going for the primal optimization since that one is worked out in Aurélien Géron’s “Hand-On Machine Learning with Scikit-Learn, Keras & Tensorflow” which i use for most of my studies. It should be known that if you go to more complex, non-linear SVM’s, you’ll most likely shift to the dual formulation to allow for the Kernel trick!

This primal formulation is given by Géron in the following manner:

minimize p subject to:

where:

p will contain the weights and bias term
H is the (n_features + 1) x (n_features + 1) identity matrix, with a 0 for the bias
f is a n_features + 1 vector containing 0's
b is a n_samples vector containing -1's
A is -t * X with the bias feature = 1 added
t defined as -1 when y = 0 and 1 when y = 1

One thing that is often not mentioned when converting this to code is that y and t are often mixed up, ending up with a y variable for which the calculation expects -1 for negative classes. This is one of those moments where different domains using different conventions can lead to confusion when learning.

Converting definition mentioned above results in the following python function

def lin_svm_primal_hard_margin(X, y):
    """
    Manual Hard Margin LinearSVM using Primal QP example with adjustments to match with Geron's book
    
    p will contain the weights and bias term
    H is the (n_features + 1) x (n_features + 1) identity matrix, with a 0 for the bias
    f is a n_features + 1 vector containing 0's
    b is a n_samples vector containing -1's
    A is -t * X with the bias feature = 1 added
    t defined as -1 when y = 0 and 1 when y = 1
    """
    #make negative classes -1 instead of 0
    t = np.where(y == 0, -1, y) 
 
    n_samples, n_features = X.shape
    
    #Extend feature matrix with ones for the bias term and slack variables
    X_ext = np.hstack([X, np.ones((n_samples, 1))])

    #Constructing matrices for QP
    H = np.eye(n_features + 1)
    H[n_features, n_features] = 0  # ignore bias term
    f = np.zeros(n_features + 1)
    A = -t[:, None] * X_ext
    b = -np.ones(n_samples)

    #Prepping the data for the solver
    H = matrix(H)
    f = matrix(f)
    A = matrix(A)
    b = matrix(b)

    #QP magic time!
    solution = solvers.qp(H, f, A, b)
    vars = np.array(solution['x']).flatten()

    #Extracting p and bias for easy use
    p = vars[:n_features]
    bias = vars[n_features]
    return p, bias

To plot the decision function for our found weights, let’s define the following plot function:

def plot_db(p, bias):
    x0s = np.linspace(-5, 5, 100)
    x1s = np.linspace(-5, 5, 100)
    
    #decision_boundary is found by setting p[0] * x0 + p[1] * x1 + bias = 0
    decision_boundary = -p[0] / p[1] * x0s - bias / p[1]
    margin = 1/p[1]
    gutter_up = decision_boundary + margin
    gutter_down = decision_boundary - margin
    
    plt.plot(x0s, decision_boundary, "k-", linewidth=2)
    plt.plot(x0s, gutter_up, "k--", linewidth=2, zorder=-2)
    plt.plot(x0s, gutter_down, "k--", linewidth=2, zorder=-2)
    
    #lets add the contour for swag
    x0, x1 = np.meshgrid(x0s, x1s)
    X_cont = np.c_[x0.ravel(), x1.ravel()]
    y_cont = p[0] * X_cont[:, 0] + p[1] * X_cont[:, 1] + bias
    y_cont = y_cont.reshape(x0.shape)

    plt.contourf(x0, x1, y_cont, alpha=0.2)

With that, we can finally call our own svm “fit” function and plot the fitted values!:

p, bias = lin_svm_primal_hard_margin(X,y)

plot_db(p, bias)

plt.plot(X[y[:] == 0, 0], X[y[:] == 0, 1], "o")
plt.plot(X[y[:] == 1, 0], X[y[:] == 1, 1], "x")
plt.axis([-5, 5, -5, 5])

Looks like our SVM found a nice wide margin to seperate the classes!

Making predictions

Now that we have our weights, let’s see if we can make predictions. Remember that for a linear SVM classifier, we use the following:

so in our case we can simply introduce new instances and plug it in this calculation!

#new instance!
Xnew = np.array([[-2, -2]])

For visual aid, let’s also plot it on our previous graph

Indeed our new shiney red instance should be classified as negative. Let’s see if the math agrees.

#let's predict our new instance
yhat = p[0] * Xnew[:, 0] + p[1] * Xnew[:, 1] + bias
yhat

>>>array([-1.16310223])

Hurray! as the equation gives us a negative number, it predicts it as a negative class as expected.

Understanding Quadratic Programming

Now we understand how a support vector machine can use Quadratic Programming to classify instances. Let’s see how Quadratic Programming works from a more generic point of view.

A QP solver minimizes quadratic functions subject to linear constraints. Now if you’re not daily driving math vocabulary, let’s work through a simple example in python to see what that means.

Note: a similar example is being done by Kody Powell in this youtube video.

Minimizing a random quadratic cost function

Say we want to minimize the following function:

Let’s plot this function to see how it looks visually:

x1r = np.arange(-10,10,.1)
x2r = np.arange(-10,10,.1)

x1, x2 = np.meshgrid(x1r, x2r)cost = 0.5*x1**2 - 4*x1 + x2**2 - 1.5*x2 + 42# Create a 3D plot
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, projection='3d')
ax.plot_surface(x1, x2, cost, cmap='viridis')
# Labels and title
ax.set_xlabel('X1')
ax.set_ylabel('X2')
ax.set_zlabel('Cost')
ax.set_title('Cost Function Surface Plot')

A nice convex cost function indeed! Now there are several different ways to solve this easier and faster than using QP but let’s do it anyways.

First let’s plot the cost function as a 2D contour so we can “look in the bowl”:

# Create a 2D contour plot
plt.figure(figsize=(8, 6))
cp = plt.contour(x1, x2, cost, levels=20, cmap='viridis')
plt.colorbar(cp)

Now let’s prep some matrixes to work with Quadratic programming. Remember that the generalized QP problem formulation is as follows:

Note: i’m using the same variables as in the last write up, to keep it consistent with both code and the referenced literature

So first let’s just minimize the formula mentioned earlier without any constraints. Notice how we juggle the coefficients into the different matrices, try if you can match the values with the original formula:

Note that the coefficients of the quadratics (0.5 and 1 respectively), need to be multiplied by two when plugged into the H Matrix. This is something the cvxopt solver expects, but this might differ between solvers so make sure to read the documentation.

# coefficients of the quadratics need to be multiplied by two!
H = np.array([[1, 0], [0, 2]], dtype=np.double)

# coefficients of the linear part, don't forget the sign!
f = np.array([-4, -1.5], dtype=np.double) 
# Convert to cvxopt matrices
H = matrix(H)
f = matrix(f)# Solve the problem
solution = solvers.qp(H, f)# Extract the optimal values
x_optimal = np.array(solution['x']).flatten()print("The optimal values are:", x_optimal)>>> The optimal values are: [4.   0.75]

Let’s plot and see if it’s indeed at the expected minimum in the center of the “bowl”

Quite unsurprising, but a nice starting point to see what QP actually tries to do!

Adding an inequality constraint

Now for the cool part, because we can now give the solver some constraints which it should respect while finding a minimum. Say that we want to find the minimum, but:

So let’s plot this out to see what we’re saying.

constraint_mask = x2 - x1 >= 0.5

# Create a 2D contour plot
plt.figure(figsize=(8, 6))
cp = plt.contour(x1, x2, cost, levels=20, cmap='viridis')# Plot the region satisfying the constraint
plt.contourf(x1, x2, constraint_mask, levels=[-1, -0.5, 0], alpha=0.3, colors=['red'])
plt.colorbar(cp)
#plt.plot(x1r, x1r + .5)
plt.axis([-10,10,-10,10])

The red area is now no longer part of the optimizing goal.

If we want to plug this inequality constraint into our solver, we have to juggle some things around first. Inequality constraints need to be presented to the solver in smaller or equal than form. In our case, this means switching things around, so:

With that done, let’s bring out the Python code. This time we just added the this constraint, but let see if you can again match up the values from the constraint formula above.

# coefficients of the quadratics need to be multiplied by two
H = np.array([[1, 0], [0, 2]], dtype=np.double)

# coefficients of the linear part, don't forget the sign!
f = np.array([-4, -1.5], dtype=np.double) # coefficients of the inequality constraint
A = np.array([[1,-1]], dtype=np.double) # the inequality constraint
b = np.array([-.5], dtype=np.double) # Convert to cvxopt matrices
H = matrix(H)
f = matrix(f)
A = matrix(A)
b = matrix(b)# Solve the problem
solution = solvers.qp(H, f, A, b)# Extract the optimal values
x_optimal = np.array(solution['x']).flatten()print("The optimal values are:", x_optimal)>>>The optimal values are: [1.5 2. ]

Let’s plot the new found minimum and verify if it respected out little constraint:

plt.figure(figsize=(8, 6))
cp = plt.contour(x1, x2, cost, levels=20, cmap='viridis')
plt.contourf(x1, x2, constraint_mask, levels=[-1, -0.5, 0], alpha=0.3, colors=['red'])
plt.scatter(x_optimal[0],x_optimal[1],c='red')

plt.colorbar(cp)
plt.axis([-10,10,-10,10])

Hurray! It indeed found a new minimum with the constraint taken into account.

If you want, you can continue adding more inequality constraints and equality constraints (they work basically the same), but since we arrived at QP through hard margin SVM classifiers, and its primal formulation only uses an inequality constraint, we’ll now circle it back to SVM.

Minimizing || w ||, maximizing margin for SVMs

The way QP is used in SVM’s is to minimize the slope of the decision function, which is the same as minimizing the Euclidean norm of the weights || w ||

So in this context the weights to be minimized act as the p variables in the example above. So first let’s just plot the cost function with just || w ||

# Define the range for w1 and w2
w1_range = np.linspace(-2, 2, 20)
w2_range = np.linspace(-2, 2, 20)

# Create a meshgrid
W1, W2 = np.meshgrid(w1_range, w2_range)# Calculate the value of 1/2 * norm(w)**2
cost = 0.5 * (W1**2 + W2**2)# Create a 3D plot
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, projection='3d')
ax.plot_surface(W1, W2, cost, cmap='viridis')
# Labels and title
ax.set_xlabel('w1')
ax.set_ylabel('w2')
ax.set_zlabel('Cost')
ax.set_title('Cost Function Surface Plot')# Create a 2D contour plot
plt.figure(figsize=(8, 6))
cp = plt.contour(W1, W2, cost, levels=20, cmap='viridis')
plt.colorbar(cp)

There’s our basic bowl again. Let’s see if we can minimize this without constraints. This is of course totally pointless as our instances are not even taken into account, but it does help to relate QP to our SVM goal.

#Constructing matrices for QP

n_features = 2 #since the coefficients are 1, the H matrix is the identity matrix with a bias feature added set to 0
H = np.eye(n_features + 1)
#H[n_features, n_features] = 0  # Bias term should not be regularized
#we don't have any linear terms in our minimizing function so the f vector is zero
#note that if you want to go into soft margin SVM, you will have to change this!
f = np.zeros(n_features + 1)#Solving QP problem
H = matrix(H)
f = matrix(f)solution = solvers.qp(H, f)
vars = np.array(solution['x']).flatten()# Create a 2D contour plot
plt.figure(figsize=(8, 6))
cp = plt.contour(W1, W2, cost, levels=20, cmap='viridis')
plt.scatter(vars[0],vars[1],c='red')
plt.colorbar(cp)

So our minimum is dead center where we’d expect it. So it seems the QP is working, but it’s still useless for our SVM. Let’s add the constraint.

#make negative classes -1 instead of 0
t = np.where(y == 0, -1, y)

n_samples, n_features = X.shape#Extend feature matrix with ones for the bias term and slack variables
X_ext = np.hstack([X, np.ones((n_samples, 1))])#Constructing matrices for QP
H = np.eye(n_features + 1)
H[n_features, n_features] = 0  # Bias term should not be regularizedf = np.zeros(n_features + 1)#Adding our constraint left side
A = -t[:, None] * X_ext#Adding our constraint right side
b = -np.ones(n_samples)#Solving QP problem
H = matrix(H)
f = matrix(f)
A = matrix(A)
b = matrix(b)solution = solvers.qp(H, f, A, b)
vars = np.array(solution['x']).flatten()#Extracting p and bias for easy use
p = vars[:n_features]
bias = vars[n_features]

So this might be a tad confusing as we’re actually using the X matrix as constraint coefficient as described in the “subject to” Hard margin linear SVM classifier objective:

however we have to flip it around to make it comply with the QP solvers needs:

note that the X_ext variable contains all x + b

Let’s plot the new minimum to verify it indeed moved away from the unconstraint version:

# Create a 2D contour plot
plt.figure(figsize=(8, 6))
cp = plt.contour(W1, W2, cost, levels=20, cmap='viridis')
plt.scatter(p[0],p[1],c='red')
plt.colorbar(cp)

Of course, finally we can verify if the found weights actually line up with the slope and thus margin from our previous article:

def plot_db(p, bias):
    x0s = np.linspace(-5, 5, 100)
    x1s = np.linspace(-5, 5, 100)

    #decision_boundary is found by setting p[0] * x0 + p[1] * x1 + bias = 0
    decision_boundary = -p[0] / p[1] * x0s - bias / p[1]
    margin = 1/p[1]
    gutter_up = decision_boundary + margin
    gutter_down = decision_boundary - margin
    
    plt.plot(x0s, decision_boundary, "k-", linewidth=2)
    plt.plot(x0s, gutter_up, "k--", linewidth=2, zorder=-2)
    plt.plot(x0s, gutter_down, "k--", linewidth=2, zorder=-2)    
    #lets add the contour for swag
    x0, x1 = np.meshgrid(x0s, x1s)
    X_cont = np.c_[x0.ravel(), x1.ravel()]
    y_cont = p[0] * X_cont[:, 0] + p[1] * X_cont[:, 1] + bias
    y_cont = y_cont.reshape(x0.shape)
    plt.contourf(x0, x1, y_cont, alpha=0.2)plot_db(p, bias)
plt.plot(X[y[:] == 0, 0], X[y[:] == 0, 1], "o")
plt.plot(X[y[:] == 1, 0], X[y[:] == 1, 1], "x")plt.axis([-5, 5, -5, 5])

Success!