Why does layerNormalizationLayer in Deep Learning Toolbox include T dimension into the batch?

Hello,
While implementing a ViT transformer in Matlab, I found at that the layerNormalizationLayer does include the T dimension in the statistics calculated for each sample in the batch. This is problematics when implementing a transformer, since tokens correspond to the T dimension and reference implementations calculate the statistics separately for each token.
Thx

 Réponse acceptée

It seems Mathworks have listened and changed the behavior of layerNormalizationLayer in R2023a.:
Starting in R2023a, by default, the layer normalizes sequence data over the channel and spatial dimensions. In previous versions, the software normalizes over all dimensions except for the batch dimension (the spatial, time, and channel dimensions). Normalization over the channel and spatial dimensions is usually better suited for this type of data. To reproduce the previous behavior, set OperationDimension to "batch-excluded".

Plus de réponses (1)

Perhaps you can fold your T dimension into the C dimension and use a groupNormalizationLayer instead, with the groups defined so that different T belong to different groups.

7 commentaires

I could, but this is a hack which is quite painful as the whole model becomes changing formats back and forth (there are twelve blocks in the model that invoke layer normalization).
I don't see why that has to make it painful. Why couldn't you adopt a modular structure in your code like below? You could also make a reusable custom layer of your own, as we've discussed in earlier threads.
numTimes=2000;
GN=groupNormalizerTimeIndep(numTimes);
layers=[layer1,layer2,GN,layer3,layer4,GN,layer5,... ]
net = trainNetwork(sequences,layers);
function normalizerLayers=groupNormalizerTimeIndep(numTimes)
pre=functionLayer(@reshapeForw);
nlayer = groupNormalizationLayer(numTimes);
post=functionLayer(@(z)reshapeBack(z,numTimes));
normalizerLayers=[pre,nlayer,post];
end
function Xr=reshapeForw(X)
[H,W,C,T,B]=size(X);
Xr=reshape(X,H,W,C*T,B);
end
function X=reshapeBack(Xr,T)
[H,W,~,B]=size(Xr);
X=reshape(Xr,H,W,[],T,B);
end
Thx.
As I wrote, it's doable, but a PITA. For example, what if the input is CBT? SCBT? CBTU? SCBTU? All these could be handled using finddim, but as I wrote it's a PITA.
In addition, the number of layers grows by 2 for every normalization layer. For a 12 level transformer this adds a whopping 24 layers. The performance hit is not insignificant.
PS There's a small bug in the above code: it should be
[H,W,C,B,T]=size(X);
since the canonical order is SSCBT, and therefore the input needs first be permuted appropriately and inverse-permuted after re-reshaping.
Also, you need the order in C dimension to correspond to TC and not CT, otherwise groups would not cover the Ts. So, more permutes.
As I wrote, it's doable, but a PITA.
Well, I don't think lamenting that will get you anywhere. If you think there is an alternative solution, you can wait for other posts, but if we both haven't found one, I doubt it's coming.
In addition, the number of layers grows by 2 for every normalization layer. For a 12 level transformer this adds a whopping 24 layers. The performance hit is not insignificant.
I don't see why it would be. The functionLayers don't have any learnable parameters
Well, I don't think lamenting that will get you anywhere.
That said, I do agree it would be useful to have a more configurable normalization layer type, where you could explicitly specify which dimensions are to be included in the normalization.
Perhaps lamenting would cause someone from Mathworks to take notice and add the capability to the code base. Sigh ...
That happens sometimes, but usually you have to submit a formal enhancement request.

Connectez-vous pour commenter.

Catégories

En savoir plus sur Deep Learning Toolbox dans Centre d'aide et File Exchange

Produits

Version

R2022b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by