Understanding Why Chain Rule Works

#calculus #differentiation

Table of Contents

“Mathematical intuition is the ability to see the truth without first having gone through a formal process of reasoning.”
— Henri Poincaré

The goal of this post is to force you to think about the chain rule a bit deeply.

The only prerequisite for reading this is understanding the power rule(i.e ddx[xn]=nxn1\dfrac{d}{dx}[x^n] = nx^{n-1}). Understanding differentiation intuitively is recommended.

The chain rule is a differentiation method for composite functions. It is defined as follows for a function f(u)f(u) where u=g(x)u = g(x) ddx[f(u)]=dduf(u)ddx(u)\dfrac{d}{dx}[f(u)] = \dfrac{d}{du}f(u)*\dfrac{d}{dx}(u) which implies differentiate f(u)f(u) with respect to uu treating uu as an independent variable, and differentiate g(x)g(x) with respect to xx and then multiply both.

For example, given the function f(x)=(2x2+10)2f(x) = (2x^2 + 10)^2, we can differentiate it as follows: we would set g(x)=2x2+10g(x) = 2x^2 + 10 and u=g(x)u = g(x) so we can express f(g(x))f(g(x)) as f(u)=u2f(u) = u^2.

Now, let’s use the chain rule: dduf(u)=2u\dfrac{d}{du}f(u) = 2u, ddx(u)=4x\dfrac{d}{dx}(u) = 4x and ddx[f(u)]=2u4x\dfrac{d}{dx}[f(u)] = 2u*4x. Recall that u=g(x)=2x2+10u = g(x) = 2x^2 + 10, so we have ddx[f(u)]=2(2x2+10)4x=16x3+80x\dfrac{d}{dx}[f(u)] = 2(2x^2 + 10)*4x = 16x^3 + 80x.


To understand the chain rule, you must first understand how function compositions work.

If we have a functions f(x)f(x) and g(x)g(x), what does f(g(x))f(g(x)) mean?

It means that for the function f(x)f(x), evaluate it at g(x)g(x) instead of xx. We want to study the effect of this change.

How does f(g(x))f(g(x)) change with respect to a change xx? Another way to put it is this: how does f(g(x))f(g(x)) change with respect to g(x)g(x) and g(x)g(x) with respect to xx?

Let’s explore the behaviour of change in a function composition.

Example 1: For g(x)=2xg(x) = 2x and f(x)=x2f(x) = x^2 for an interval x[4,4]x \in [-4, 4].

Let’s see the graph:

Let’s build the intuition for this by looking at the table of values of g(x)g(x), f(x)f(x) f(g(x))f(g(x)) and ddx[f(g(x))]\dfrac{d}{dx}[f(g(x))] on the interval.

xxg(x)=2xg(x) = 2xddx[g(x)]=2\dfrac{d}{dx}[g(x)] = 2f(x)=x2f(x) = x^2ddx[f(x)]=2x\dfrac{d}{dx}[f(x)] = 2xf(g(x))=4x2f(g(x)) = 4x^2ddx[f(g(x))]=8x\dfrac{d}{dx}[f(g(x))] = 8x
4-48-82216168-8646432-32
3-36-622996-6363624-24
2-24-422444-4161616-16
1-12-222112-2448-8
00002200000000
11222211224488
224422444416161616
336622996636362424
44882216168864643232

The graph shows that the maximum value for f(x)f(x) is 1616 while that of f(g(x))f(g(x)) is 6464. That’s a scale-up.

Can you see a pattern? Every value of f(g(x))f(g(x)) is 44 times the value of f(x)f(x).

We can show this formally by picking any two unique pairs of f(x)f(x) and f(g(x))f(g(x)) (i.e at the same xx) and finding the slope(i.e y2y1x2x1\dfrac{y_2 - y_1}{x_2 - x_1}). For example, using the pairs (1616, 6464) and (44, 1616), we have 6416164=4812=4\dfrac{64 - 16}{16 - 4} = \dfrac{48}{12} = 4. Let’s use another two pairs (99, 3636) and (11, 44), we have 43619=328=4\dfrac{4 - 36}{1 - 9} = \dfrac{-32}{-8} = 4

This turns out to be easy to see because f(g(x))f(g(x)) is simply 44 times f(x)f(x) so it makes sense that this is the case. Other places you can notice is as follows:

  • The values of ddx[f(g(x))]\dfrac{d}{dx}[f(g(x))] is 4 times the values of ddx[f(x)]\dfrac{d}{dx}[f(x)]
  • Every value of f(g(x))f(g(x)) and ddx[f(g(x))]\dfrac{d}{dx}[f(g(x))] is divisible by 44
  • The difference between any two values of f(g(x))f(g(x)) and ddx[f(g(x))]\dfrac{d}{dx}[f(g(x))] is divisible by 44 The means that for every change in the value of f(x)f(x) there’s a 4x change in the value of f(g(x))f(g(x)). This sounds weird, right? we just represented the rate of change between two somewhat independent functions as if they were dependent on each other. If this relationship was a function it would be h(x)=4xh(x) = 4x so that ddx[h(x))]=4\dfrac{d}{dx}[h(x))] = 4. This works because f(g(x))f(g(x)) is a composition of f(x)f(x) and g(x)g(x) and f(x)=x2f(x) = x^2.

Let’s try this for xx and g(x)g(x). But then, you might ask: Is g(x)g(x) a composition of xx? Let’s see.

If we created a function v(x)=xv(x) = x, then g(v(x))g(v(x)) equals g(x)g(x). This is true for every function of xx. Having set this foundation, we can see that for every change in xx, g(x)g(x) by a factor of 22.

Note: You won’t always have a constant factor between f(x)f(x) and f(g(x))f(g(x)) or between their derivatives. But most times g(x)g(x) would affect the behaviour of change in the f(g(x))f(g(x)) compared to f(x)f(x). It either makes it faster, slower, scale-up, scale-down and even combination of these along certain intervals.

The examples below show some of these other behaviours of change.

Example 2: For g(x)=xg(x) = -x and f(x)=x3f(x) = x^3 for an interval x[3,3]x \in [-3, 3].

xxg(x)=xg(x) = -xddx[g(x)]=1\dfrac{d}{dx}[g(x)] = -1f(x)=x3f(x) = x^3ddx[f(x)]=3x2\dfrac{d}{dx}[f(x)] = 3x^2f(g(x))=x3f(g(x)) = -x^3ddx[f(g(x))]=3x2\dfrac{d}{dx}[f(g(x))] = -3x^2
3-3331-127-272727272727-27
2-2221-18-812128812-12
1-1111-11-133113-3
00001-100000000
111-11-111331-13-3
222-21-18812128-812-12
333-31-12727272727-2727-27

Example 3: For g(x)=x+1g(x) = x + 1 and f(x)=x2+xf(x) = x^2+x for an interval x[2,2]x \in [-2, 2].

xxg(x)=x+1g(x) = x + 1ddx[g(x)]=1\dfrac{d}{dx}[g(x)] = 1f(x)=x2+xf(x) = x^2 + xddx[f(x)]=2x+1\dfrac{d}{dx}[f(x)] = 2x + 1f(g(x))=x2+3x+2f(g(x)) = x^2 + 3x + 2ddx[f(g(x))]=2x+3\dfrac{d}{dx}[f(g(x))] = 2x + 3
2-21-111223-3001-1
1-10011001-10011
00111100112233
11221122336655
2233113355121277

Chain rule answers the question: how do you express a function dependent on a variable as the rate of change in respect to that variable when the variable is a function?

Why is g(x)g(x) treated like a variable?

In the chain rule why are we differentiating in respect to g(x)g(x) as if it was a variable? This is simply because it is.

The intuition is the definition of the behaviour of change in a composite function: as g(x)g(x) changes f(g(x))f(g(x)) and as xx changes g(x)g(x).

g(x)g(x) is a variable in f(g(x))f(g(x)) so it’s treated as such when represent the change in f(g(x))f(g(x))

Why is the change in g(x)g(x) multiplied by the change in f(g(x))f(g(x))?

This is simply because as xx changes g(x)g(x) changes and that’s a factor affecting the change in f(g(x))f(g(x)).

Why Multiplication(and Not Addition)?

Why is the chain rule ddx[f(u)]=dduf(u)ddx(u)\dfrac{d}{dx}[f(u)] = \dfrac{d}{du}f(u)*\dfrac{d}{dx}(u) and not ddx[f(u)]=dduf(u)+ddx(u)\dfrac{d}{dx}[f(u)] = \dfrac{d}{du}f(u)+\dfrac{d}{dx}(u)?

We would look two ways to look at this: the first is a bit hand-wavy and the other is more intuitive.

For the first one, the intuition lies is in the definition: as g(x)g(x) changes f(g(x))f(g(x)) and as xx changes g(x)g(x). For clarity: Let’s break this statement into three part:

  • as g(x)g(x) changes f(x)f(x) - g(x)g(x) is treated as a variable
  • xx changes g(x)g(x)
  • and - from binary operations, we know that and implies multiplication. just like or implies addition.

Let’s look the more intuitive way version: currency conversions!

I have Nigerian Naira (NGN) and I want to convert it to Pounds sterling(GBP). But, there’s a little challenge: there are only two exchanges available; NGN/USD and USD/GBP. We would have to convert from NGN to USD and then, from USD to GBP. The rates are 1 USD = NGN1,500 and 1 GBP = 1.25 USD respectively. How do we convert NGN20,000 to GBP?

First, we would convert to USD. NGN20,000 to USD is 20,000/1500 = 13.33 USD. Then, USD to GBP is 13.33 / 1.25 = 10.67 GBP. That is, NGN20,000 equals 10.67 GBP. To get the rate of NGN to GBP, we multiply the both rates(.i.e 1/1500 x 1/1.25 = 1/1875).

The interesting thing is we can convert these individual conversions to functions. The function for converting NGN to USD would be g(x)=x1500g(x) = \dfrac{x}{1500} and the one for converting from USD to GBP would be f(x)=x1.25f(x) = \dfrac{x}{1.25}. The most interesting part of this is that the function for converting from NGN to GBP is function composition of f(x)f(x) and g(x)g(x)! That is, f(g(x))=x15001.25=x1875f(g(x)) = \dfrac{\dfrac{x}{1500}}{1.25} = \dfrac{x}{1875}.

If we apply the chain rule to the composite function f(u)f(u) where u=g(x)u = g(x), we have ddx[f(u)]=dduf(u)ddx(u)=11.2511500=11875\dfrac{d}{dx}[f(u)] = \dfrac{d}{du}f(u)*\dfrac{d}{dx}(u) = \dfrac{1}{1.25} * \dfrac{1}{1500} = \dfrac{1}{1875}. That is our expected rate!

But, if we change multiplication to addition to we have ddx[f(u)]=dduf(u)+ddx(u)=11.25+11500=12011500\dfrac{d}{dx}[f(u)] = \dfrac{d}{du}f(u)+\dfrac{d}{dx}(u) = \dfrac{1}{1.25}+\dfrac{1}{1500}=\dfrac{1201}{1500}. This is very wrong!

I employ you to think of others ways asides currency conversions whereby this can be shown!


I hope you were able to think wide and far about function compositions and the chain rule!