View source on GitHub
|
Tokenizes a tensor of UTF-8 strings on whitespaces.
Inherits From: TokenizerWithOffsets, Tokenizer, SplitterWithOffsets, Splitter
text.WhitespaceTokenizer()
Used in the notebooks
| Used in the guide | Used in the tutorials |
|---|---|
Methods
split
split(
input
)
Alias for Tokenizer.tokenize.
split_with_offsets
split_with_offsets(
input
)
Alias for TokenizerWithOffsets.tokenize_with_offsets.
tokenize
tokenize(
input
)
Tokenizes a tensor of UTF-8 strings on whitespaces.
The strings are split on ICU defined whitespace characters. These whitespace characters are dropped.
Example:
WhitespaceTokenizer().tokenize("small medium large")<tf.Tensor: shape=(3,), dtype=string, numpy=array([b'small', b'medium',b'large'], dtype=object)>
| Args | |
|---|---|
input
|
A RaggedTensor or Tensor of UTF-8 strings with any shape.
|
| Returns | |
|---|---|
A RaggedTensor of tokenized text. The returned shape is the shape of the
input tensor with an added ragged dimension for tokens of each string.
|
tokenize_with_offsets
tokenize_with_offsets(
input
)
Tokenizes a tensor of UTF-8 strings on whitespaces.
The strings are split on ICU defined whitespace characters. These whitespace characters are dropped.
Example:
splitter = WhitespaceTokenizer()pieces, starts, ends = splitter.tokenize_with_offsets("a bb ccc")print(pieces.numpy(), starts.numpy(), ends.numpy())[b'a' b'bb' b'ccc'] [0 2 5] [1 4 8]
| Args | |
|---|---|
input
|
A RaggedTensoror Tensor of UTF-8 strings with any shape.
|
| Returns | |
|---|---|
A tuple (tokens, start_offsets, end_offsets) where:
|
View source on GitHub