[SOLVED] Why does nn.Linear(in_features, out_features) use a weight matrix of shape (out_features, in

Why does nn.Linear(in_features, out_features) use a weight matrix of shape (out_features, in_features) in PyTorch?

I’m trying to understand why PyTorch’s nn.Linear(in_features, out_features) layer has its weight matrix with the shape (out_features, in_features) instead of (in_features, out_features).

From a basic matrix multiplication perspective, it seems like having the shape (in_features, out_features) would eliminate the need for transposing the weight matrix during multiplication. For example, with an input tensor x of shape (batch_size, in_features), the multiplication with a weight matrix of shape (in_features, out_features) would result directly in an output of shape (batch_size, out_features), without requiring the transpose operation.

However, PyTorch defines the weight matrix as (out_features, in_features), meaning it gets transposed during the forward pass. What is the benefit of this design? How does it align with the broader principles of linear algebra and neural network implementations? Are there any efficiency or consistency considerations behind this choice that make it preferable?

Solution

Transpose is not required, because you can multiply by the matrix either from right or left. In particular, multiply from the left (out_features, in_features) * (in_features), to get (out_features) vector.

The benefit is deep in computer architecture. Matrix (out_features, in_features) is stored row by row in memory, e.g. for 3 in_features, it looks like that in flat memory

[in0, in1, in2], [in0, in1, in2], ..., [in0, in1, in2]

This allows to access memory in cache-friendly way, because you read 3 consecutive numbers to compute the first output feature, then read 3 consecutive numbers to compute the second output feature, etc