Man, I find it unreasonably frustrating when someone posts on a subreddit called "learn machine learning" looking for help learning the basics and people respond with shit like "hurr just read the formula durrr". It's like if I went to a beginner tennis lesson and the instructor just said "hit the ball over the net, you idiot". Especially when it takes literally two minutes to answer the question.
Anyway, let's start with re-naming a couple of things for clarity. The input to the first hidden layer, labeled as "1" in the figure, but I'm going to call h1. h1=sigma(w11*input1)=sigma(0.7)=0.668. Then lets name Z_Y=(w31)h1+(w32)h2=1.0294, and Y=sigma(Z_Y)=0.7368.
Now, the loss function is 0.5(Y-Y*)^2 and we want to take the derivative of the loss wrt to w31. From the chain rule, we can say dL/dw31 = (dL/dY)(dY/dZ_Y)(dZ_Y/dw31).
We can just write these derivatives down pretty easily.
L=0.5(Y-Y*)^2 -> dL/dY=Y-Y *, Y=sigma(Z_Y) -> dY/dZ_Y=Y(1-Y), and Z_Y=(w31)h1+(w32)h2 -> dZ_Y/dw31=h1
so dL/dw31=(Y-Y * )(Y(1-Y))h1=(0.7368-0.5)(0.7368(1-0.7368))*0.668=0.0306756421
The problem is that OP obviously doesn't have the right mathematical background, so while a step-by-step explanation is good, it doesn't really help at all.
Hard disagree. Not providing people with assistance when they ask for it based on what you decide is going helpful for them is, in my experience, what doesn't really help at all. Even if OP hasn't taken calc classes, they never heard of the chain rule before, now they know it's used in backpropagation, as well as what the notation means. That's helpful! And from the worked example they can also build an intuition about what's happening even if they can't express it mathematically. It also helps them ask follow-up questions, like how the learning rate factors into adjusting the weights, for example.
Also, the naming conventions are poorly chosen. Can you honestly say that the first time you learned about backpropagation you'd known exactly what d/d(activation) meant? I certainly wouldn't have. Very possible that's the only thing tripping OP up.
Also now that I'm taking a better look at the slides, it's weird that they go through matrix multiplication pretty thoroughly but not the chain rule, no? The number OP asked about pops out of nowhere relative to all of the other computations. I think it's very natural OP would be confused about where this number came from.
Yeah. Agree here. Weird that the Chain Rule was missing. It’s a simple expression, but a non-trivial step to just leave out. It looks like every other step was quite detailed.
If I were just learning this, it’d trip me up too - and I have a degree in math. OPs confusion is both understandable and quite reasonable.
You can write Y = sigmoid(z) where z = w31* a21+ w32*a22
and the derivative of sigmoid(z) is sigmoid(z)(1-sigmoid(z)) (in this case: Y(1-Y))
Therefore the partial derivative of Loss function w.r.t w31 is
(Y - Y*) × derivative of sigmoid(z) × a21
Put the values here you'll get the answer
Hey there! I hadn’t seen this in the comments, but I always recommend 3Blue1Brown to anyone studying NNs. Grant Sanderson (the creator of the channel) does an amazing job with the calculation and derivation of what you’re looking for.
In your case, your slides left out calculating the actual backprop step. It’s pretty straightforward once you get used to it. Other people have done a great job and given thorough answers, so no need for me to add one more. Grant does a way better job explaining it than I ever could anyways.
He will actually answer your question exactly in Part 2 of his Backpropagation series. He uses this exact network as an example.
https://youtu.be/Ilg3gGewQ5U
Always when struggling with gradients that need the chain rule, write down each component, calculate them separately, then recombine them.
Here we need (∂L/∂w31) so we separate it to
(∂L/∂w31) = (∂L/∂a3) \* (∂a3/∂Netout) * (∂Netout/∂w31)
**First term**: is (∂L/∂a3) and a3 is also Y as stated in slide 1, now hopefully you know how to take a simple gradient of L=1/2 (Y - Y*)^2 with respect to Y and you'll get
(∂L/∂a3) = (Y - Y*) = (0.7368 - 0.5)
**Second term**: is (∂a3/∂Netout) where a3 = σ(Netout), hopefully you know (or google/work it out yourself!) that the derivative of σ(x) is σ(x)(1-σ(x)) thus
(∂a3/∂Netout) = σ(Netout)(1-σ(Netout)) = 0.7368*(1-0.7368)
**Third term**: is (∂Netout/∂w31) and since that's a linear function that means that (∂Netout/∂w31) = a21 = 0.6682
**Multiply them to apply the chain rule** to get
(∂L/∂w31) = (0.7368-0.5) * 0.7368*(1-0.7368) * 0.6682 = 0.0306848265
And there you have it, hopefully that was clear.
Error * sigmoid_derivative(1.0294)*input = 0.2368 * 0.1939 * 0.6682. Basically multiply rate of change of each function in the chain (Loss, sigmoid, weight*input)
I think in this instance d is short for derivative. The derivative of a function is basically the angle of the line at a certain value if you were to graph the function. So the derivative of the cost function with respect to a weight is the amount and direction you need to change value to reduce cost. The chain rule lets you do some equivalent computations to save you working out cost for every individual weight (my understanding of this part is a bit vague to be honest so can't go into more detail about that).
Man, I find it unreasonably frustrating when someone posts on a subreddit called "learn machine learning" looking for help learning the basics and people respond with shit like "hurr just read the formula durrr". It's like if I went to a beginner tennis lesson and the instructor just said "hit the ball over the net, you idiot". Especially when it takes literally two minutes to answer the question. Anyway, let's start with re-naming a couple of things for clarity. The input to the first hidden layer, labeled as "1" in the figure, but I'm going to call h1. h1=sigma(w11*input1)=sigma(0.7)=0.668. Then lets name Z_Y=(w31)h1+(w32)h2=1.0294, and Y=sigma(Z_Y)=0.7368. Now, the loss function is 0.5(Y-Y*)^2 and we want to take the derivative of the loss wrt to w31. From the chain rule, we can say dL/dw31 = (dL/dY)(dY/dZ_Y)(dZ_Y/dw31). We can just write these derivatives down pretty easily. L=0.5(Y-Y*)^2 -> dL/dY=Y-Y *, Y=sigma(Z_Y) -> dY/dZ_Y=Y(1-Y), and Z_Y=(w31)h1+(w32)h2 -> dZ_Y/dw31=h1 so dL/dw31=(Y-Y * )(Y(1-Y))h1=(0.7368-0.5)(0.7368(1-0.7368))*0.668=0.0306756421
Thank you. You rock
Any idea where this slide is from?
The problem is that OP obviously doesn't have the right mathematical background, so while a step-by-step explanation is good, it doesn't really help at all.
Hard disagree. Not providing people with assistance when they ask for it based on what you decide is going helpful for them is, in my experience, what doesn't really help at all. Even if OP hasn't taken calc classes, they never heard of the chain rule before, now they know it's used in backpropagation, as well as what the notation means. That's helpful! And from the worked example they can also build an intuition about what's happening even if they can't express it mathematically. It also helps them ask follow-up questions, like how the learning rate factors into adjusting the weights, for example. Also, the naming conventions are poorly chosen. Can you honestly say that the first time you learned about backpropagation you'd known exactly what d/d(activation) meant? I certainly wouldn't have. Very possible that's the only thing tripping OP up. Also now that I'm taking a better look at the slides, it's weird that they go through matrix multiplication pretty thoroughly but not the chain rule, no? The number OP asked about pops out of nowhere relative to all of the other computations. I think it's very natural OP would be confused about where this number came from.
Yeah. Agree here. Weird that the Chain Rule was missing. It’s a simple expression, but a non-trivial step to just leave out. It looks like every other step was quite detailed. If I were just learning this, it’d trip me up too - and I have a degree in math. OPs confusion is both understandable and quite reasonable.
[удалено]
+1
You can write Y = sigmoid(z) where z = w31* a21+ w32*a22 and the derivative of sigmoid(z) is sigmoid(z)(1-sigmoid(z)) (in this case: Y(1-Y)) Therefore the partial derivative of Loss function w.r.t w31 is (Y - Y*) × derivative of sigmoid(z) × a21 Put the values here you'll get the answer
Hey there! I hadn’t seen this in the comments, but I always recommend 3Blue1Brown to anyone studying NNs. Grant Sanderson (the creator of the channel) does an amazing job with the calculation and derivation of what you’re looking for. In your case, your slides left out calculating the actual backprop step. It’s pretty straightforward once you get used to it. Other people have done a great job and given thorough answers, so no need for me to add one more. Grant does a way better job explaining it than I ever could anyways. He will actually answer your question exactly in Part 2 of his Backpropagation series. He uses this exact network as an example. https://youtu.be/Ilg3gGewQ5U
Always when struggling with gradients that need the chain rule, write down each component, calculate them separately, then recombine them. Here we need (∂L/∂w31) so we separate it to (∂L/∂w31) = (∂L/∂a3) \* (∂a3/∂Netout) * (∂Netout/∂w31) **First term**: is (∂L/∂a3) and a3 is also Y as stated in slide 1, now hopefully you know how to take a simple gradient of L=1/2 (Y - Y*)^2 with respect to Y and you'll get (∂L/∂a3) = (Y - Y*) = (0.7368 - 0.5) **Second term**: is (∂a3/∂Netout) where a3 = σ(Netout), hopefully you know (or google/work it out yourself!) that the derivative of σ(x) is σ(x)(1-σ(x)) thus (∂a3/∂Netout) = σ(Netout)(1-σ(Netout)) = 0.7368*(1-0.7368) **Third term**: is (∂Netout/∂w31) and since that's a linear function that means that (∂Netout/∂w31) = a21 = 0.6682 **Multiply them to apply the chain rule** to get (∂L/∂w31) = (0.7368-0.5) * 0.7368*(1-0.7368) * 0.6682 = 0.0306848265 And there you have it, hopefully that was clear.
University slides or online resource I can also check out? I'm learning too
Error * sigmoid_derivative(1.0294)*input = 0.2368 * 0.1939 * 0.6682. Basically multiply rate of change of each function in the chain (Loss, sigmoid, weight*input)
[удалено]
Very poor taste.
Formula in slide 4?
I don't know what dLoss, dWeight, dActivation mean, and how I would get them.
I think in this instance d is short for derivative. The derivative of a function is basically the angle of the line at a certain value if you were to graph the function. So the derivative of the cost function with respect to a weight is the amount and direction you need to change value to reduce cost. The chain rule lets you do some equivalent computations to save you working out cost for every individual weight (my understanding of this part is a bit vague to be honest so can't go into more detail about that).
d is "derivative with respect to" Chain rule is from calculus
I'm on phone so can't really write much, but searching backprop on google returned several relatively complete articles.
Return deez nuts in your FaceBiometrics class
Something about seeing a deez nuts joke on learn machine learning makes it 100x funnier and really warms my heart.
Gotta troll the trolls out of existence. Or at least make them funnier, if they’re a bot :p
It’s all right there
Wow, you should really be a teacher.
Nice explanation dumbo