Classificiation Using Logistic Regression in Visual Basic

Machine Learning. What languages come to mind? R? Python? Matlab? Bet you didn’t think Visual Basic.


In this article we will discuss a simple binary (only outputs “yes” or “no”) implementation of logistic regression in Visual Basic and use it to make predictions based on data in spreadsheets.


Super Important Terms

Parameter Vector — The vector containing what are effectively the coefficients for your hypothesis (eg. 5, 2 and 1 are contained in the parameter vector for the previous weight-height model). They are, for some reason, represented with the Greek letter θ.

Feature — a measurable property of a phenomenon being observed (eg. height and age are the features in the previous weight-height model).

Likelihood — The probability of getting a certain dataset using a certain hypothesis given a certain parameter vector.

Posterior Probability — The probability of getting a certain hypothesis with a certain parameter vector given a certain dataset.

How does one go about turning a parameter vector and values into a hypothesis?

Using the weight predictor as an example, let



We want to get w = 5h + 2a + 1. Hmm.

If we multiplied each element in x by their corresponding element in θ, we’d get

h(x) = 1*1 + 5*h + 2*a

Put in a more abstract way, with


, our hypothesis is


This can be done with a for-loop or, if you’re feeling extra sneaky, matrix multiplication (xTθ).

How does one go about turning hypotheses into predictions?

We simply pass in our hypothesis and get out a prediction. If the function is unsure, it’ll return a fraction: 0.9 means it’s pretty sure that it’s got a 1; 0.4 is a less sure 0.

Wait a second, how do we get the values for θ?

Part 1: How wrong are we?

For each vector θ, we need to work out how wrong the model based upon the values is. The error metric we’re going to use is the negative log likelihood.

We define it thus:

The graphs for these cases are shown below.

Where the red line shows

. We can see that the functions have asymptotes at x = 0 and x = 1 and that they strongly penalise wrong predictions with made high certainty (when y = 0 but we predicted something like 0.9). We can combine the expressions thus:

Where N is the number of examples in x, yi is the ith value in y and xi is the ith example in x.

Note that when yi = 0, the first term in the sum reduces to 0; when yi = 0, the second term reduces to 0. We take a log transform to make the function easier to handle — small numbers that might experience underflow become less small; multiplication becomes addition. A very good derivation can be found here:

Part 2: How can we stop being wrong?

To stop being wrong we have to minimise the errors we make. One of the most intuitive methods of doing this makes use of elementary calculus. The negative log likelihood function for logistic regression can be shown to be concave — that is that there exists a single global optimum.

We thus take partial derivatives for all θ, multiply them by a learning rate (step size) α and then subtract them from their respective original values:

where f represents a certain feature. Looping over many times, this essentially moves us closer to the optimum — like a skateboarder rolling down a valley — where

for all θ.

In the case of our logistic function, the derivative can be easily obtained from the chain rule:

We thus arrive at the following update function:

Where f is a feature, α is the learning rate and xi [f] is the value of feature f in example xi.

In summary, our algorithm takes the following form:

Repeat for all features:

If gradient is within reasonable threshold, then exit loop

Optimising the Optimiser

  • Record the squared partial derivatives for each feature in an array
  • .
  • Divide the learning rate for each feature by the square root of the sum of the last 10 partial derivatives.

This dynamically changes the learning rate as we approach convergence.

We can speed up convergence further by normalising all the features. We could also take a stochastic approach by updating the parameter vector after every training example we go over though this converges “less”.

Need even more speed?

Using the code

Hide Copy Code


Our logistic function and error cost function are implemented below.

Hide Shrink

Copy Code

Function LogisticFunction(x)
'Compute the value of the logistic function given input x
'a sigmoid function has an S shape and is given by 1/(e^(-t))
'e = 2.71... (Euler's number)

Return (1 / (1 + Math.Exp(-x)))

End Function

Function LogisticCost()

Dim cost As Double

'Loop through all the training examples
For i = 0 To exampleCount - 1

Dim exampleCost As Double

If exampleOutputs(i) = 0 Then
exampleCost = Math.Log(1 - LogisticFunction(Hypothesis(exampleInputs(i))))
exampleCost = Math.Log(LogisticFunction(Hypothesis(exampleInputs(i))))
End If

cost += exampleCost


'Take average
cost *= -1 / exampleCount

Return Math.Abs(cost)

End Function

And our optimisation algorithm is implemented here:

Hide Shrink

Copy Code

Sub Minimise()

'Simple implementation of adagrad bc no way in hell am I implementing bfgs or anything that requires the hessan
'gradient = X'(h(x) - y)

Dim learningRate As New List(Of Double)
Dim derivative As New List(Of Double)
Dim pastGradientSquares As New List(Of List(Of Double))
Dim sumOfPastGradientSquares As New List(Of Double)

'Initialise all the variables
For i = 0 To featureCount
pastGradientSquares.Add(New List(Of Double))

lastCost = LogisticCost()
deltaCost = 100

'Do this until we converge (the derivative = 0ish)
While (Math.Abs(deltaCost) > desiredDelta)

Dim difference As New List(Of Double)
Dim h As New List(Of Double)

'Reset the derivative
For i = 0 To thetas.Count
derivative(i) = 0

'Loop through the hypotheses and populate the list h

For i = 0 To exampleCount - 1

'Get the difference

For i = 0 To exampleCount - 1
difference.Add(h(i) - exampleOutputs(i))

'Multiply by the features

For i = 0 To exampleCount - 1
For j = 0 To featureCount - 1
derivative(j) += difference(i) * exampleInputs(i)(j)

'Update the list of previous squared derivatives
pastGradientSquares(j).Add(Math.Pow(derivative(j), 2))

'If we exceed 10 things, just remove the oldest entry
If pastGradientSquares(j).Count > 10 Then
End If

'Update the sums
sumOfPastGradientSquares(j) = Sum(pastGradientSquares(j))


'Multiply by Learning Rate for this specific feature and get new thetas

For i = 0 To featureCount - 1
thetas(i) -= (learningRate(i) / (Math.Sqrt(sumOfPastGradientSquares(i) + 0.00000001))) * derivative(i)

Dim currentCost = LogisticCost()
deltaCost = currentCost - lastCost

'We need to look like we're doing something so here's something to keep users occupied
Console.WriteLine("Training... " + LogisticCost().ToString + " " + derivative(0).ToString) ' + " " + derivative(1).ToString + " " + derivative(2).ToString)

lastCost = currentCost

End While

End Sub

We can quickly get predictions like this:

Hide Copy Code

Function Predict(x)

Return LogisticFunction(Hypothesis(x))

End Function

And see how well our classifier is doing like this:

Hide Shrink

Copy Code

Sub Check()

Dim correctlyClassified As Double = 0

'Loop through results and see how many were classified correctly
For i = 0 To exampleInputs.Count - 1

Dim result As String
If Predict(exampleInputs(i)) > 0.5 Then
result = "Positive"
If exampleOutputs(i) = 1 Then
correctlyClassified += 1
End If
result = "Negative"
If exampleOutputs(i) = 0 Then
correctlyClassified += 1
End If
End If

Dim trueClassification As String
If exampleOutputs(i) = 1 Then
trueClassification = "Positive"
trueClassification = "Negative"
End If

Console.WriteLine(result + " and was actually " + trueClassification)


Console.WriteLine("Correctly Classified " + correctlyClassified.ToString + " out of " + exampleInputs.Count.ToString)

End Sub


The resulting predictions looked a bit like this:

Another dataset was used to predict whether pupils would pass their exams based upon the hours they spend studying, their previous performance and whether or not they’re in a relationship.

The resulting predictions were:

Other Things to Look Into


Multiclass classification

For instance, with three classes cat, dog and table, an image of a cat might score 0.9 on the cat classifier, 0.4 on the dog classifer and 0.001 on the table classifier. We then select the classifer with the 0.9 score and return”it’s a cat”.



- Peter the Mentor, age 16

Originally published at Acorn Aspirations.

Powering the Next Generation of Thought Leaders, Innovators and Technologists in #AI @teensinai #TeensInAI #GirlsinAI #MCStartup2016 Founder @elenasinel