View source on GitHub
|
Split input by delimiters that match a regex pattern; returns offsets.
text.regex_split_with_offsets(
input,
delim_regex_pattern,
keep_delim_regex_pattern='',
name=None
)
regex_split_with_offsets will split input using delimiters that match a
regex pattern in delim_regex_pattern. It will return three tensors:
one containing the split substrings ('result' in the examples below), one
containing the offsets of the starts of each substring ('begin' in the
examples below), and one containing the offsets of the ends of each substring
('end' in the examples below).
Here is an example:
text_input=["hello there"]# split by whitespaceresult, begin, end = regex_split_with_offsets(input=text_input,delim_regex_pattern="\s")print("result: %s\nbegin: %s\nend: %s" % (result, begin, end))result: <tf.RaggedTensor [[b'hello', b'there']]>begin: <tf.RaggedTensor [[0, 6]]>end: <tf.RaggedTensor [[5, 11]]>
By default, delimiters are not included in the split string results.
Delimiters may be included by specifying a regex pattern
keep_delim_regex_pattern. For example:
text_input=["hello there"]# split by whitespaceresult, begin, end = regex_split_with_offsets(input=text_input,delim_regex_pattern="\s",keep_delim_regex_pattern="\s")print("result: %s\nbegin: %s\nend: %s" % (result, begin, end))result: <tf.RaggedTensor [[b'hello', b' ', b'there']]>begin: <tf.RaggedTensor [[0, 5, 6]]>end: <tf.RaggedTensor [[5, 6, 11]]>
If there are multiple delimiters in a row, there are no empty splits emitted. For example:
text_input=["hello there"] # Note the two spaces between the words.# split by whitespaceresult, begin, end = regex_split_with_offsets(input=text_input,delim_regex_pattern="\s")print("result: %s\nbegin: %s\nend: %s" % (result, begin, end))result: <tf.RaggedTensor [[b'hello', b'there']]>begin: <tf.RaggedTensor [[0, 7]]>end: <tf.RaggedTensor [[5, 12]]>
See https://github.com/google/re2/wiki/Syntax for the full list of supported expressions.
Returns | |
|---|---|
| A tuple of RaggedTensors containing: (split_results, begin_offsets, end_offsets) where tokens is of type string, begin_offsets and end_offsets are of type int64. |
View source on GitHub