View source on GitHub
|
Tokenizes a tensor of UTF-8 strings on Unicode character boundaries.
Inherits From: TokenizerWithOffsets, Tokenizer, SplitterWithOffsets, Splitter, Detokenizer
text.UnicodeCharTokenizer()
Used in the notebooks
| Used in the guide |
|---|
Resulting tokens are integers (unicode codepoints). Scalar input will
produce a Tensor output containing the codepoints. Tensor inputs will
produce RaggedTensor outputs.
Example:
tokenizer = tf_text.UnicodeCharTokenizer()tokens = tokenizer.tokenize("abc")print(tokens)tf.Tensor([97 98 99], shape=(3,), dtype=int32)
tokens = tokenizer.tokenize(["abc", "de"])print(tokens)<tf.RaggedTensor [[97, 98, 99], [100, 101]]>
t = ["abc" + chr(0xfffe) + chr(0x1fffe) ]tokens = tokenizer.tokenize(t)print(tokens.to_list())[[97, 98, 99, 65534, 131070]]
Passing malformed UTF-8 will result in unpredictable behavior. Make sure inputs conform to UTF-8.
Methods
detokenize
detokenize(
input, name=None
)
Detokenizes input codepoints (integers) to UTF-8 strings.
Example:
tokenizer = tf_text.UnicodeCharTokenizer()tokens = tokenizer.tokenize(["abc", "de"])s = tokenizer.detokenize(tokens)print(s)tf.Tensor([b'abc' b'de'], shape=(2,), dtype=string)
| Args | |
|---|---|
input
|
A RaggedTensor or Tensor of codepoints (ints) with a rank of at
least 1.
|
name
|
The name argument that is passed to the op function. |
| Returns | |
|---|---|
| A N-1 dimensional string tensor of the text corresponding to the UTF-8 codepoints in the input. |
split
split(
input
)
Alias for Tokenizer.tokenize.
split_with_offsets
split_with_offsets(
input
)
Alias for TokenizerWithOffsets.tokenize_with_offsets.
tokenize
tokenize(
input
)
Tokenizes a tensor of UTF-8 strings on Unicode character boundaries.
Input strings are split on character boundaries using unicode_decode_with_offsets.
| Args | |
|---|---|
input
|
A RaggedTensoror Tensor of UTF-8 strings with any shape.
|
| Returns | |
|---|---|
A RaggedTensor of tokenized text. The returned shape is the shape of the
input tensor with an added ragged dimension for tokens (characters) of
each string.
|
tokenize_with_offsets
tokenize_with_offsets(
input
)
Tokenizes a tensor of UTF-8 strings to Unicode characters.
Example:
tokenizer = tf_text.UnicodeCharTokenizer()tokens = tokenizer.tokenize_with_offsets("a"+chr(8364)+chr(10340))print(tokens[0])tf.Tensor([ 97 8364 10340], shape=(3,), dtype=int32)print(tokens[1])tf.Tensor([0 1 4], shape=(3,), dtype=int64)print(tokens[2])tf.Tensor([1 4 7], shape=(3,), dtype=int64)
The start_offsets and end_offsets are in byte indices of the original
string. When calling with multiple string inputs, the offset indices will
be relative to the individual source strings:
toks = tokenizer.tokenize_with_offsets(["a"+chr(8364), "b"+chr(10300) ])print(toks[0])<tf.RaggedTensor [[97, 8364], [98, 10300]]>print(toks[1])<tf.RaggedTensor [[0, 1], [0, 1]]>print(toks[2])<tf.RaggedTensor [[1, 4], [1, 4]]>
| Args | |
|---|---|
input
|
A RaggedTensoror Tensor of UTF-8 strings with any shape.
|
| Returns | |
|---|---|
A tuple (tokens, start_offsets, end_offsets) where:
|
View source on GitHub