Forward prop is how the network generates a probability vector for your given inputs.
Lets use this diagram to visualize how it works, where B(L) is the bias of a layer, A(L-1) is the value of the previous layer and W(L) is the weight of that connection. Finally, A(L) is the value of the next layer in the network, expressed as a represenation of the previous variables I mentioned. In the working layers, we use the ReLU function, and in the output layer, we use the softmax function. Y is the expected output, C is the generated probability.
$$ Z(L) = A(L-1)*W(L) + B(L) \\ A(L) = ReLU(Z(L)) $$
Inspiration: 3B1B
This sequeunce occurs for every layer, and using matrices we can perform the operation for a whole layer at once. Now, lets tackle backpropogation, something that twisted my mind for a bit.
// TODO
For a single forward pass, the output value of the third layer is too low, its too close to 0, to the point where the floating point numbers are unstable. eg → A probaility of too 4.15633838e-26 is too low, even for a bad guess.
First, we check the “squishing” function, softmax:
Initially, I had the following function:
def softmax(vector):
e = np.exp(vector)
return e / e.sum()
but the output layer probabilities are too close to 0 (eg: 2.57699342e-20 can be unstable). Apparently, this is common when the exponential values (numerators) get too large. To mitigate this, we will sutract the max value from the vector, to have a function that looks like:
def softmax(vector):
e = np.exp(vector - np.max(vector))
return e / e.sum()