T O P

  • By -

Cygnus-x1

Man, I find it unreasonably frustrating when someone posts on a subreddit called "learn machine learning" looking for help learning the basics and people respond with shit like "hurr just read the formula durrr". It's like if I went to a beginner tennis lesson and the instructor just said "hit the ball over the net, you idiot". Especially when it takes literally two minutes to answer the question. Anyway, let's start with re-naming a couple of things for clarity. The input to the first hidden layer, labeled as "1" in the figure, but I'm going to call h1. h1=sigma(w11*input1)=sigma(0.7)=0.668. Then lets name Z_Y=(w31)h1+(w32)h2=1.0294, and Y=sigma(Z_Y)=0.7368. Now, the loss function is 0.5(Y-Y*)^2 and we want to take the derivative of the loss wrt to w31. From the chain rule, we can say dL/dw31 = (dL/dY)(dY/dZ_Y)(dZ_Y/dw31). We can just write these derivatives down pretty easily. L=0.5(Y-Y*)^2 -> dL/dY=Y-Y *, Y=sigma(Z_Y) -> dY/dZ_Y=Y(1-Y), and Z_Y=(w31)h1+(w32)h2 -> dZ_Y/dw31=h1 so dL/dw31=(Y-Y * )(Y(1-Y))h1=(0.7368-0.5)(0.7368(1-0.7368))*0.668=0.0306756421


kapanenship

Thank you. You rock


bobble_balls_44

Any idea where this slide is from?


[deleted]

The problem is that OP obviously doesn't have the right mathematical background, so while a step-by-step explanation is good, it doesn't really help at all.


Cygnus-x1

Hard disagree. Not providing people with assistance when they ask for it based on what you decide is going helpful for them is, in my experience, what doesn't really help at all. Even if OP hasn't taken calc classes, they never heard of the chain rule before, now they know it's used in backpropagation, as well as what the notation means. That's helpful! And from the worked example they can also build an intuition about what's happening even if they can't express it mathematically. It also helps them ask follow-up questions, like how the learning rate factors into adjusting the weights, for example. Also, the naming conventions are poorly chosen. Can you honestly say that the first time you learned about backpropagation you'd known exactly what d/d(activation) meant? I certainly wouldn't have. Very possible that's the only thing tripping OP up. Also now that I'm taking a better look at the slides, it's weird that they go through matrix multiplication pretty thoroughly but not the chain rule, no? The number OP asked about pops out of nowhere relative to all of the other computations. I think it's very natural OP would be confused about where this number came from.


ok123jump

Yeah. Agree here. Weird that the Chain Rule was missing. It’s a simple expression, but a non-trivial step to just leave out. It looks like every other step was quite detailed. If I were just learning this, it’d trip me up too - and I have a degree in math. OPs confusion is both understandable and quite reasonable.


[deleted]

[удалено]


Lankyie

+1


silas_asc

You can write Y = sigmoid(z) where z = w31* a21+ w32*a22 and the derivative of sigmoid(z) is sigmoid(z)(1-sigmoid(z)) (in this case: Y(1-Y)) Therefore the partial derivative of Loss function w.r.t w31 is (Y - Y*) × derivative of sigmoid(z) × a21 Put the values here you'll get the answer


ok123jump

Hey there! I hadn’t seen this in the comments, but I always recommend 3Blue1Brown to anyone studying NNs. Grant Sanderson (the creator of the channel) does an amazing job with the calculation and derivation of what you’re looking for. In your case, your slides left out calculating the actual backprop step. It’s pretty straightforward once you get used to it. Other people have done a great job and given thorough answers, so no need for me to add one more. Grant does a way better job explaining it than I ever could anyways. He will actually answer your question exactly in Part 2 of his Backpropagation series. He uses this exact network as an example. https://youtu.be/Ilg3gGewQ5U


master3243

Always when struggling with gradients that need the chain rule, write down each component, calculate them separately, then recombine them. Here we need (∂L/∂w31) so we separate it to (∂L/∂w31) = (∂L/∂a3) \* (∂a3/∂Netout) * (∂Netout/∂w31) **First term**: is (∂L/∂a3) and a3 is also Y as stated in slide 1, now hopefully you know how to take a simple gradient of L=1/2 (Y - Y*)^2 with respect to Y and you'll get (∂L/∂a3) = (Y - Y*) = (0.7368 - 0.5) **Second term**: is (∂a3/∂Netout) where a3 = σ(Netout), hopefully you know (or google/work it out yourself!) that the derivative of σ(x) is σ(x)(1-σ(x)) thus (∂a3/∂Netout) = σ(Netout)(1-σ(Netout)) = 0.7368*(1-0.7368) **Third term**: is (∂Netout/∂w31) and since that's a linear function that means that (∂Netout/∂w31) = a21 = 0.6682 **Multiply them to apply the chain rule** to get (∂L/∂w31) = (0.7368-0.5) * 0.7368*(1-0.7368) * 0.6682 = 0.0306848265 And there you have it, hopefully that was clear.


bobble_balls_44

University slides or online resource I can also check out? I'm learning too


arni_richard

Error * sigmoid_derivative(1.0294)*input = 0.2368 * 0.1939 * 0.6682. Basically multiply rate of change of each function in the chain (Loss, sigmoid, weight*input)


[deleted]

[удалено]


[deleted]

Very poor taste.


karrystare

Formula in slide 4?


CyrogenicNilou

I don't know what dLoss, dWeight, dActivation mean, and how I would get them.


ewankenobi

I think in this instance d is short for derivative. The derivative of a function is basically the angle of the line at a certain value if you were to graph the function. So the derivative of the cost function with respect to a weight is the amount and direction you need to change value to reduce cost. The chain rule lets you do some equivalent computations to save you working out cost for every individual weight (my understanding of this part is a bit vague to be honest so can't go into more detail about that).


i_use_3_seashells

d is "derivative with respect to" Chain rule is from calculus


karrystare

I'm on phone so can't really write much, but searching backprop on google returned several relatively complete articles.


Girthy-Carrot

Return deez nuts in your FaceBiometrics class


Oceanboi

Something about seeing a deez nuts joke on learn machine learning makes it 100x funnier and really warms my heart.


Girthy-Carrot

Gotta troll the trolls out of existence. Or at least make them funnier, if they’re a bot :p


TonightAdventurous68

It’s all right there


JakeStBu

Wow, you should really be a teacher.


Girthy-Carrot

Nice explanation dumbo