XLM-RoBERTa
The XLM-RoBERTa model was proposed in Unsupervised Cross-lingual Representation Learning at Scale by Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov. It is based on Facebook’s RoBERTa model released in 2019. It is a large multi-lingual language model, trained on 2.5TB of filtered CommonCrawl data.
XLMRobertaAdapterModel
- class adapters.XLMRobertaAdapterModel(config)
XLM-Roberta Model transformer with the option to add multiple flexible heads on top. Docstring of the original model:
The bare Xlm Roberta Model outputting raw hidden-states without any specific head on top.
This model inherits from [PreTrainedModel]. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
- Parameters
config ([XLMRobertaModel]) – Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the [~PreTrainedModel.from_pretrained] method to load the model weights.
add_pooling_layer (bool, optional, defaults to True) – Whether to add a pooling layer
- config_class
alias of
XLMRobertaConfig
- forward(input_ids=None, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None, inputs_embeds=None, output_attentions=None, output_hidden_states=None, return_dict=None, head=None, **kwargs)
The [XLMRobertaModel] forward method, overrides the __call__ special method.
<Tip>
Although the recipe for forward pass needs to be defined within this function, one should call the [Module] instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
</Tip>
- Parameters
input_ids (torch.Tensor of shape (batch_size, sequence_length), optional) –
Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.
Indices can be obtained using [AutoTokenizer]. See [PreTrainedTokenizer.encode] and [PreTrainedTokenizer.__call__] for details.
[What are input IDs?](../glossary#input-ids)
attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) –
Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:
1 for tokens that are not masked,
0 for tokens that are masked.
[What are attention masks?](../glossary#attention-mask)
token_type_ids (torch.Tensor of shape (batch_size, sequence_length), optional) –
Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]:
0 corresponds to a sentence A token,
1 corresponds to a sentence B token.
[What are token type IDs?](../glossary#token-type-ids)
position_ids (torch.Tensor of shape (batch_size, sequence_length), optional) –
Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.n_positions - 1].
[What are position IDs?](../glossary#position-ids)
head_mask (torch.Tensor of shape (num_heads,) or (num_layers, num_heads), optional) –
Mask to nullify selected heads of the self-attention modules. Mask values selected in [0, 1]:
1 indicates the head is not masked,
0 indicates the head is masked.
inputs_embeds (torch.Tensor of shape (batch_size, sequence_length, hidden_size), optional) – Optionally, instead of passing input_ids you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix.
encoder_hidden_states (torch.Tensor of shape (batch_size, sequence_length, hidden_size), optional) – Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if the model is configured as a decoder.
encoder_attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) –
Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in the cross-attention if the model is configured as a decoder. Mask values selected in [0, 1]:
1 for tokens that are not masked,
0 for tokens that are masked.
past_key_values (~cache_utils.Cache, optional) –
Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used to speed up sequential decoding. This typically consists in the past_key_values returned by the model at a previous stage of decoding, when use_cache=True or config.use_cache=True.
Only [~cache_utils.Cache] instance is allowed as input, see our [kv cache guide](https://huggingface.co/docs/transformers/en/kv_cache). If no past_key_values are passed, [~cache_utils.DynamicCache] will be initialized by default.
The model will output the same cache format that is fed as input.
If past_key_values are used, the user is expected to input only unprocessed input_ids (those that don’t have their past key value states given to this model) of shape (batch_size, unprocessed_length) instead of all input_ids of shape (batch_size, sequence_length).
use_cache (bool, optional) – If set to True, past_key_values key value states are returned and can be used to speed up decoding (see past_key_values).
output_attentions (bool, optional) – Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more detail.
output_hidden_states (bool, optional) – Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more detail.
return_dict (bool, optional) – Whether or not to return a [~utils.ModelOutput] instead of a plain tuple.
cache_position (torch.Tensor of shape (sequence_length), optional) – Indices depicting the position of the input sequence tokens in the sequence. Contrarily to position_ids, this tensor is not affected by padding. It is used to update the cache in the correct position and to infer the complete sequence length.
- Returns
A [transformers.modeling_outputs.BaseModelOutputWithPoolingAndCrossAttentions] or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration ([XLMRobertaConfig]) and inputs.
last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size)) – Sequence of hidden-states at the output of the last layer of the model.
pooler_output (torch.FloatTensor of shape (batch_size, hidden_size)) – Last layer hidden-state of the first token of the sequence (classification token) after further processing through the layers used for the auxiliary pretraining task. E.g. for BERT-family of models, this returns the classification token after processing through a linear layer and a tanh activation function. The linear layer weights are trained from the next sentence prediction (classification) objective during pretraining.
hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) – Tuple of torch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) – Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
cross_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True and config.add_cross_attention=True is passed or when config.output_attentions=True) – Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).
Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.
past_key_values (Cache, optional, returned when use_cache=True is passed or when config.use_cache=True) – It is a [~cache_utils.Cache] instance. For more details, see our [kv cache guide](https://huggingface.co/docs/transformers/en/kv_cache).
Contains pre-computed hidden-states (key and values in the self-attention blocks and optionally if config.is_encoder_decoder=True in the cross-attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.
- Return type
[transformers.modeling_outputs.BaseModelOutputWithPoolingAndCrossAttentions] or tuple(torch.FloatTensor)
- prepare_inputs_for_generation(input_ids, past=None, attention_mask=None, **model_kwargs)
Prepare the model inputs for generation. Notable steps include selecting the correct input key and cloning when appropriate, creating position_ids from the attention_mask when missing, slicing inputs and converting 2D attention masks to 4D for compilable caches, and finally forwarding all additional keyword arguments unchanged to the model’s forward pass.
See the forward pass in the model documentation for expected arguments (different models might have different requirements for e.g. past_key_values). This function should work as is for most LLMs.