View source on GitHub
|
Base class for tokenizer implementations that return offsets.
Inherits From: Tokenizer, SplitterWithOffsets, Splitter
text.TokenizerWithOffsets(
name=None
)
The offsets indicate which substring from the input string was used to
generate each token. E.g., if input is a single string, then each token
token[i] was generated from the substring input[starts[i]:ends[i]].
Each TokenizerWithOffsets subclass must implement the tokenize_with_offsets
method, which returns a tuple containing both the pieces and the start and
end offsets where those pieces occurred in the input string. I.e., if
tokens, starts, ends = tokenize_with_offsets(s), then each token token[i]
corresponds with tf.strings.substr(s, starts[i], ends[i] - starts[i]).
If the tokenizer encodes tokens as strings (and not token ids), then it will usually be the case that these corresponding strings are equal; but that is not technically required. For example, a tokenizer might choose to downcase strings
Example:
class CharTokenizer(TokenizerWithOffsets):def tokenize_with_offsets(self, input):chars, starts = tf.strings.unicode_split_with_offsets(input, 'UTF-8')lengths = tf.expand_dims(tf.strings.length(input), -1)ends = tf.concat([starts[..., 1:], tf.cast(lengths, tf.int64)], -1)return chars, starts, endsdef tokenize(self, input):return self.tokenize_with_offsets(input)[0]pieces, starts, ends = CharTokenizer().split_with_offsets("a😊c")print(pieces.numpy(), starts.numpy(), ends.numpy())[b'a' b'\xf0\x9f\x98\x8a' b'c'] [0 1 5] [1 5 6]
Methods
split
split(
input
)
Alias for Tokenizer.tokenize.
split_with_offsets
split_with_offsets(
input
)
Alias for TokenizerWithOffsets.tokenize_with_offsets.
tokenize
@abc.abstractmethodtokenize( input )
Tokenizes the input tensor.
Splits each string in the input tensor into a sequence of tokens. Tokens generally correspond to short substrings of the source string. Tokens can be encoded using either strings or integer ids.
Example:
print(tf_text.WhitespaceTokenizer().tokenize("small medium large"))tf.Tensor([b'small' b'medium' b'large'], shape=(3,), dtype=string)
| Args | |
|---|---|
input
|
An N-dimensional UTF-8 string (or optionally integer) Tensor or
RaggedTensor.
|
| Returns | |
|---|---|
An N+1-dimensional UTF-8 string or integer Tensor or RaggedTensor.
For each string from the input tensor, the final, extra dimension contains
the tokens that string was split into.
|
tokenize_with_offsets
@abc.abstractmethodtokenize_with_offsets( input )
Tokenizes the input tensor and returns the result with byte-offsets.
The offsets indicate which substring from the input string was used to
generate each token. E.g., if input is a tf.string tensor, then each
token token[i] was generated from the substring
tf.substr(input, starts[i], len=ends[i]-starts[i]).
Example:
splitter = tf_text.WhitespaceTokenizer()pieces, starts, ends = splitter.tokenize_with_offsets("a bb ccc")print(pieces.numpy(), starts.numpy(), ends.numpy())[b'a' b'bb' b'ccc'] [0 2 5] [1 4 8]print(tf.strings.substr("a bb ccc", starts, ends-starts))tf.Tensor([b'a' b'bb' b'ccc'], shape=(3,), dtype=string)
| Args | |
|---|---|
input
|
An N-dimensional UTF-8 string (or optionally integer) Tensor or
RaggedTensor.
|
| Returns | |
|---|---|
A tuple (tokens, start_offsets, end_offsets) where:
|
View source on GitHub