Dear @Hana Ahmed,
Thank you for sharing your detailed analysis. Your theoretical understanding is absolutely correct, and my testing confirms your observations. After replicating your experiment, I obtained identical results:
% Create complete networks to avoid initialization error layers1 = [ featureInputLayer(3136, 'Name', 'input') selfAttentionLayer(4, 784, 'Name', 'selfattention') ];
layers2 = [ featureInputLayer(3136, 'Name', 'input') selfAttentionLayer(8, 392, 'Name', 'selfattention') ];
net1 = dlnetwork(layers1); net2 = dlnetwork(layers2);
% Check parameter structure fprintf('Case 1 parameters: %.1fM\n', sum(cellfun(@numel, net1.Learnables.Value))/1e6); fprintf('Case 2 parameters: %.1fM\n', sum(cellfun(@numel, net2.Learnables.Value))/1e6);
Results:

Analysis: * Case 1: 4 heads × 784 channels = 9.8M parameters * Case 2: 8 heads × 392 channels = 4.9M parameters * Ratio: Exactly 2:1
Root Cause Analysis: MATLAB's selfAttentionLayer implementation uses parameter scaling proportional to NumKeyChannels^2 rather than maintaining constant parameters when NumHeads × NumKeyChannels = constant. This deviates from standard transformer architecture where parameter count should remain identical for equivalent total dimensionality ( 3136 in both your cases).
So, you are not misinterpreting the research literature. Your understanding of multi-head attention theory is correct - parameters should remain constant regardless of head configuration when total feature dimensionality is preserved. MATLAB's Deep Learning Toolbox has implemented a non-standard version that prioritizes a different computational structure over parameter efficiency.
My recommendation would be if you require true multi-head attention behavior (constant parameters), consider implementing a custom layer following standard transformer equations, as MATLAB's current implementation doesn't align with conventional practice.