View source on GitHub
|
An abstract base class for splitters that return offsets.
Inherits From: Splitter
text.SplitterWithOffsets(
name=None
)
Each SplitterWithOffsets subclass must implement the split_with_offsets
method, which returns a tuple containing both the pieces and the offsets where
those pieces occurred in the input string. E.g.:
class CharSplitter(SplitterWithOffsets):def split_with_offsets(self, input):chars, starts = tf.strings.unicode_split_with_offsets(input, 'UTF-8')lengths = tf.expand_dims(tf.strings.length(input), -1)ends = tf.concat([starts[..., 1:], tf.cast(lengths, tf.int64)], -1)return chars, starts, endsdef split(self, input):return self.split_with_offsets(input)[0]pieces, starts, ends = CharSplitter().split_with_offsets("a😊c")print(pieces.numpy(), starts.numpy(), ends.numpy())[b'a' b'\xf0\x9f\x98\x8a' b'c'] [0 1 5] [1 5 6]
Methods
split
@abc.abstractmethodsplit( input )
Splits the input tensor into pieces.
Generally, the pieces returned by a splitter correspond to substrings of the original string, and can be encoded using either strings or integer ids.
Example:
print(tf_text.WhitespaceTokenizer().split("small medium large"))tf.Tensor([b'small' b'medium' b'large'], shape=(3,), dtype=string)
| Args | |
|---|---|
input
|
An N-dimensional UTF-8 string (or optionally integer) Tensor or
RaggedTensor.
|
| Returns | |
|---|---|
An N+1-dimensional UTF-8 string or integer Tensor or RaggedTensor.
For each string from the input tensor, the final, extra dimension contains
the pieces that string was split into.
|
split_with_offsets
@abc.abstractmethodsplit_with_offsets( input )
Splits the input tensor, and returns the resulting pieces with offsets.
Example:
splitter = tf_text.WhitespaceTokenizer()pieces, starts, ends = splitter.split_with_offsets("a bb ccc")print(pieces.numpy(), starts.numpy(), ends.numpy())[b'a' b'bb' b'ccc'] [0 2 5] [1 4 8]
| Args | |
|---|---|
input
|
An N-dimensional UTF-8 string (or optionally integer) Tensor or
RaggedTensor.
|
| Returns | |
|---|---|
A tuple (pieces, start_offsets, end_offsets) where:
|
View source on GitHub