Why is the derivative of scalar with respect to vector a vector and not a scalar?

$\begingroup$

I'm really confused about matrix calculus and especially partial derivatives. When do we need to sum up partial derivatives to get a total derivative and when do we get a vector of partial derivatives as our derivative? I struggle with differentiating between the two. I'll make an example to make it clear:

L is a scalar, $\mathbf{o}$ is a vector of size $K$ and $\mathbf{y}$ is a vector of size $K$.

$$L = -\sum_{k} \log(y_k)$$$$\mathbf{y} = \text{softmax}(\mathbf{o})$$

So if we want to have the derivative of L with respect to $\mathbf{o}$, we would need to sum over all the partial derivatives with respect to the terms $\mathbf{y}$ so that we get the total derivative, that is as much as I understood from reading about multivariate calculus:

$$\frac{\partial L}{\partial \mathbf{o}} = \frac{\partial L}{\partial \mathbf{y}}\frac{\partial \mathbf{y}}{\partial \mathbf{o}} = \sum_{k}\frac{\partial L}{\partial y_k}\frac{\partial y_k}{\partial \mathbf{o}} = -\sum_{k} \frac{1}{y_k} \frac{\partial y_k}{\partial \mathbf{o}}$$

However, then $\frac{\partial L}{\partial \mathbf{o}}$ seems to be a vector of the partial derivatives of L with respect to every term of $\mathbf{o}$, i.e.:

$$ \frac{\partial L}{\partial \mathbf{o}} = \left< \frac{\partial L}{\partial o_1}, \frac{\partial L}{\partial o_2}, ..., \frac{\partial L}{\partial o_K} \right> $$

But shouldn't the derivative be the sum of all the partial derivatives of $\mathbf{o}$ to get the total derivative?

i.e. shouldn't the solution be:

$$\frac{\partial L}{\partial \mathbf{o}} = \frac{\partial L}{\partial \mathbf{y}}\frac{\partial \mathbf{y}}{\partial \mathbf{o}} = -\sum_{k} \frac{1}{y_k} \sum_{i} \frac{\partial y_k}{\partial o_i}$$

and then its just a scalar?

$\endgroup$

1 Answer

$\begingroup$

The derivative of a function $f : R^n \to R^m$ is the linearization (i.e. approximation by a linear function) of the function around the given point. Therefore, it must still be a function $R^n \to R^m$, but linear. This is represented by a matrix in $R^{m \times n}$. If the output dimension is $m = 1$, i.e. $f$ is a scalar function, that matrix has the shape of a row vector in $R^{1 \times n}$.

In your case, if $L : R^n \to R$, then $\frac{\partial L}{\partial \mathbf{y}}$ is $1 \times n$, while $\frac{\partial \mathbf{y}}{\partial \mathbf{o}}$ is $n \times n$. The first sum $\sum_k$ that you wrote is the "row-vector $\times$ matrix" multiplication of those two.

$\endgroup$ 3

Your Answer

Sign up or log in

Sign up using Google Sign up using Facebook Sign up using Email and Password

Post as a guest

By clicking “Post Your Answer”, you agree to our terms of service, privacy policy and cookie policy

You Might Also Like