View source on GitHub
|
A Splitter that uses a state machine to determine sentence breaks.
StateBasedSentenceBreaker splits text into sentences by using a state
machine to determine when a sequence of characters indicates a potential
sentence break.
The state machine consists of an initial state, then transitions to a
collecting terminal punctuation state once an acronym, an emoticon, or
terminal punctuation (ellipsis, question mark, exclamation point, etc.), is
encountered.
It transitions to the collecting close punctuation state when a close
punctuation (close bracket, end quote, etc.) is found.
If non-punctuation is encountered in the collecting terminal punctuation or collecting close punctuation states, then the state machine exits, returning false, indicating it has moved past the end of a potential sentence fragment.
Methods
break_sentences
break_sentences(
doc
)
Splits doc into sentence fragments and returns the fragments' text.
| Args | |
|---|---|
doc
|
A string Tensor of shape [batch] with a batch of documents.
|
| Returns | |
|---|---|
results
|
A string RaggedTensor of shape [batch, (num_sentences)]
with each input broken up into its constituent sentence fragments.
|
break_sentences_with_offsets
break_sentences_with_offsets(
doc
)
Splits doc into sentence fragments, returns text, start & end offsets.
Example:
1 1 2 3
012345678901234 01234567890123456789012345678901234567
doc: 'Hello...foo bar', 'Welcome to the U.S. don't be surprised'
fragment_text: [
['Hello...', 'foo bar'],
['Welcome to the U.S.' , 'don't be surprised']
]
start: [[0, 8],[0, 20]]
end: [[8, 15],[19, 38]]
| Args | |
|---|---|
doc
|
A string Tensor of shape [batch] or [batch, 1].
|
| Returns | |
|---|---|
A tuple of (fragment_text, start, end) where:
|
|
fragment_text
|
A string RaggedTensor of shape [batch, (num_sentences)]
with each input broken up into its constituent sentence fragments.
|
start
|
A int64 RaggedTensor of shape [batch, (num_sentences)]
where each entry is the inclusive beginning byte offset of a sentence.
|
end
|
A int64 RaggedTensor of shape [batch, (num_sentences)]
where each entry is the exclusive ending byte offset of a sentence.
|
View source on GitHub