attentionLayer
Description
A dot-product attention layer focuses on parts of the input using weighted multiplication operations.
Creation
Description
layer = attentionLayer(
creates a
dot-product attention layer and sets the numHeads
)NumHeads
property.
layer = attentionLayer(
also sets the numHeads
,Name=Value
)Scale
,
HasPaddingMaskInput
, HasScoresOutput
, AttentionMask
, DropoutProbability
, and Name
properties using one or more name-value arguments.
Properties
Attention
Number of heads, specified as a positive integer.
Each head performs a separate linear transformation of the input and computes attention weights independently. The layer uses these attention weights to compute a weighted sum of the input representations, generating a context vector. Increasing the number of heads lets the model capture different types of dependencies and attend to different parts of the input simultaneously. Reducing the number of heads can lower the computational cost of the layer.
The value of NumHeads
must evenly divide the size of the
channel dimension of the input queries, keys, and values.
Data Types: single
| double
| int8
| int16
| int32
| int64
| uint8
| uint16
| uint32
| uint64
Multiplicative factor for scaling dot product of queries and keys, specified as one of these values:
"auto"
— Multiply the dot product by1/sqrt(D)
, whereD
is the number of channels of the keys divided byNumHeads
.Numeric scalar — Multiply the dot product by the specified scalar.
Data Types: single
| double
| int8
| int16
| int32
| int64
| uint8
| uint16
| uint32
| uint64
| char
| string
| cell
Flag indicating whether the layer has an input that represents the padding mask,
specified as 0
(false
) or 1
(true
).
If the HasPaddingMaskInput
property is 0
(false
), then the layer has three inputs with the names "query"
, "key"
, and "value"
, which correspond to the input queries, keys, and values, respectively. In this case, the layer treats all elements as data.
If the HasPaddingMaskInput
property is 1
(true
), then the layer has an additional input with the name
"mask"
, which corresponds to the padding mask. In this case, the
padding mask is an array of ones and zeros. The layer uses or ignores elements of the
queries, keys, and values when the corresponding element in the mask is one or zero,
respectively.
The format of the padding mask must match that of the input keys. The size of the
"S"
(spatial), "T"
(time), and
"B"
(batch) dimensions of the padding mask must match the size of the
corresponding dimensions in the keys and values.
The padding mask can have any number of channels. The software uses only the values in the first channel to indicate padding values.
Flag indicating whether the layer has an output that represents the scores (also known as the
attention weights), specified as 0
(false
) or
1
(true
).
If the HasScoresOutput
property is 0
(false
), then the layer has one output with the name
"out"
, which corresponds to the output data.
If the HasScoresOutput
property is 1
(true
), then the layer has two inputs with the names
"out"
and "scores"
, which correspond to the output
data and the attention scores, respectively.
Attention mask indicating which elements to include when applying the attention operation, specified as one of these values:
"none"
— Do not prevent attention to elements with respect to their positions. IfAttentionMask
is"none"
, then the software prevents attention using only the padding mask."causal"
— Prevent elements in position m in the"S"
(spatial) or"T"
(time) dimension of the input queries from providing attention to the elements in positions n, where n is greater than m in the corresponding dimension of the input keys and values. Use this option for auto-regressive models.Logical or numeric array — Prevent attention to elements of the input keys and values when the corresponding element in the specified array is
0
. The specified array must be an Nk-by-Nq matrix or a Nk-by-Nq-by-numObservations
array, Nk is the size of the"S"
(spatial) or"T"
(time) dimension of the input keys, Nq is the size of the corresponding dimension of the input queries, andnumObservations
is the size of the"B"
dimension in the input queries.
Data Types: single
| double
| int8
| int16
| int32
| int64
| uint8
| uint16
| uint32
| uint64
| logical
| char
| string
Probability of dropping out attention scores, specified as a scalar in the range [0, 1).
During training, the software randomly sets values in the attention scores to zero using the specified probability. These dropouts can encourage the model to learn more robust and generalizable representations by preventing it from relying too heavily on specific dependencies.
Data Types: single
| double
| int8
| int16
| int32
| int64
| uint8
| uint16
| uint32
| uint64
Layer
Number of inputs to the layer, returned as 3
or
4
.
If the HasPaddingMaskInput
property is 0
(false
), then the layer has three inputs with the names "query"
, "key"
, and "value"
, which correspond to the input queries, keys, and values, respectively. In this case, the layer treats all elements as data.
If the HasPaddingMaskInput
property is 1
(true
), then the layer has an additional input with the name
"mask"
, which corresponds to the padding mask. In this case, the
padding mask is an array of ones and zeros. The layer uses or ignores elements of the
queries, keys, and values when the corresponding element in the mask is one or zero,
respectively.
The format of the padding mask must match that of the input keys. The size of the
"S"
(spatial), "T"
(time), and
"B"
(batch) dimensions of the padding mask must match the size of the
corresponding dimensions in the keys and values.
The padding mask can have any number of channels. The software uses only the values in the first channel to indicate padding values.
Data Types: double
Input names of the layer, returned as a cell array of character vectors.
If the HasPaddingMaskInput
property is 0
(false
), then the layer has three inputs with the names "query"
, "key"
, and "value"
, which correspond to the input queries, keys, and values, respectively. In this case, the layer treats all elements as data.
If the HasPaddingMaskInput
property is 1
(true
), then the layer has an additional input with the name
"mask"
, which corresponds to the padding mask. In this case, the
padding mask is an array of ones and zeros. The layer uses or ignores elements of the
queries, keys, and values when the corresponding element in the mask is one or zero,
respectively.
The format of the padding mask must match that of the input keys. The size of the
"S"
(spatial), "T"
(time), and
"B"
(batch) dimensions of the padding mask must match the size of the
corresponding dimensions in the keys and values.
The padding mask can have any number of channels. The software uses only the values in the first channel to indicate padding values.
The AttentionLayer
object stores this property as a cell array of character
vectors.
This property is read-only.
Number of outputs of the layer.
If the HasScoresOutput
property is 0
(false
), then the layer has one output with the name
"out"
, which corresponds to the output data.
If the HasScoresOutput
property is 1
(true
), then the layer has two inputs with the names
"out"
and "scores"
, which correspond to the output
data and the attention scores, respectively.
Data Types: double
This property is read-only.
Output names of the layer.
If the HasScoresOutput
property is 0
(false
), then the layer has one output with the name
"out"
, which corresponds to the output data.
If the HasScoresOutput
property is 1
(true
), then the layer has two inputs with the names
"out"
and "scores"
, which correspond to the output
data and the attention scores, respectively.
The AttentionLayer
object stores this property as a cell array of character
vectors.
Examples
Create a dot-product attention layer with 10 heads.
layer = attentionLayer(10)
layer = AttentionLayer with properties: Name: '' NumInputs: 3 InputNames: {'query' 'key' 'value'} NumHeads: 10 Scale: 'auto' AttentionMask: 'none' DropoutProbability: 0 HasPaddingMaskInput: 0 HasScoresOutput: 0 Learnable Parameters No properties. State Parameters No properties. Show all properties
Create a simple neural network with cross-attention.
numChannels = 256; numHeads = 8; net = dlnetwork; layers = [ sequenceInputLayer(1,Name="query") fullyConnectedLayer(numChannels) attentionLayer(numHeads,Name="attention") fullyConnectedLayer(numChannels,Name="fc-out")]; net = addLayers(net,layers); layers = [ sequenceInputLayer(1, Name="key-value") fullyConnectedLayer(numChannels,Name="fc-key")]; net = addLayers(net,layers); net = connectLayers(net,"fc-key","attention/key"); net = addLayers(net, fullyConnectedLayer(numChannels,Name="fc-value")); net = connectLayers(net,"key-value","fc-value"); net = connectLayers(net,"fc-value","attention/value");
View the network in a plot.
figure plot(net)
Algorithms
The attention operation focuses on parts of the input using weighted multiplication operations.
The single-head dot-product attention operation is given by
where:
Q denotes the queries.
K denotes the keys.
V denotes the values.
denotes the scaling factor.
M is a mask array of ones and zeros.
p is the dropout probability.
The mask operation includes or excludes the values of the matrix multiplication by setting values of the input to for zero-valued mask elements. The mask is the union of the padding and attention masks. The softmax function normalizes the value of the input data across the channel dimension such that it sums to one. The dropout operation sets elements to zero with probability p.
The multihead dot-product attention operation is given by
where:
h is the number of heads.
Each denotes the output of the head operation given by
Layers in a layer array or layer graph pass data to subsequent layers as formatted dlarray
objects.
The format of a dlarray
object is a string of characters in which each
character describes the corresponding dimension of the data. The format consists of one or
more of these characters:
"S"
— Spatial"C"
— Channel"B"
— Batch"T"
— Time"U"
— Unspecified
For example, you can describe 2-D image data that is represented as a 4-D array, where the
first two dimensions correspond to the spatial dimensions of the images, the third
dimension corresponds to the channels of the images, and the fourth dimension
corresponds to the batch dimension, as having the format "SSCB"
(spatial, spatial, channel, batch).
You can interact with these dlarray
objects in automatic differentiation
workflows, such as those for developing a custom layer, using a functionLayer
object, or using the forward
and predict
functions with
dlnetwork
objects.
This table shows the supported input formats of AttentionLayer
objects and the
corresponding output format. If the software passes the output of the layer to a custom
layer that does not inherit from the nnet.layer.Formattable
class, or a
FunctionLayer
object with the Formattable
property
set to 0
(false
), then the layer receives an
unformatted dlarray
object with dimensions ordered according to the formats
in this table. The formats listed here are only a subset. The layer may support additional
formats such as formats with additional "S"
(spatial) or
"U"
(unspecified) dimensions.
Query, Key, and Value Format | Output Format | Scores Output Format (When
|
---|---|---|
"CB" (channel, batch) | "CB" (channel, batch) | "UUUU" (unspecified, unspecified, unspecified,
unspecified) |
"SCB" (spatial, channel, batch) | "SCB" (spatial, channel, batch) | "UUUU" (unspecified, unspecified, unspecified,
unspecified) |
"CBT" (channel, batch, time) | "CBT" (channel, batch, time) | "UUUU" (unspecified, unspecified, unspecified,
unspecified) |
"SC" (spatial, channel) | "SC" (spatial, channel) | "UUU" (unspecified, unspecified, unspecified) |
"CT" (channel, time) | "CT" (channel, time) | "UUU" (unspecified, unspecified, unspecified) |
"BT" (batch, time) | "CBT" (channel, batch, time) | "UUUU" (unspecified, unspecified, unspecified,
unspecified) |
"SB" (spatial, batch) | "SCB" (spatial, channel, batch) | "UUUU" (unspecified, unspecified, unspecified,
unspecified) |
If HasMaskInput
is 1
(true
),
then the mask must have the same format as the queries, keys, and values.
References
[1] Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. "Attention is all you need." In Advances in Neural Information Processing Systems, Vol. 30. Curran Associates, Inc., 2017. https://papers.nips.cc/paper/7181-attention-is-all-you-need.
Extended Capabilities
Usage notes and limitations:
Code generation is not supported when
HasScoresOutput
is set totrue
.Code generation does not support passing
dlarray
objects with unspecified (U) dimensions to this layer.
Refer to the usage notes and limitations in the C/C++ Code Generation section. The same limitations apply to GPU code generation.
Version History
Introduced in R2024a
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: United States.
You can also select a web site from the following list
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)