Comentários (5)

1 calee comentou às Link permanente

Thank you for a great summary on tokenization. Could I use Custom Tokenizer in Appendix B commercially?

2 CraigTrim comentou às Link permanente

Yes - help yourself - thanks!

3 E5UP_Srikant_Jakilinki comentou às Link permanente

Thank you so much for the article and the data and code. CustomTokenizer refers to helpers like addToken() and CodeUtilities(). Where can we get those?

4 CraigTrim comentou às Link permanente

@calee - thanks for noticing that. for addToken, use this:
protected void addToken(List<String> tokens, String text) {
if (!StringUtilities.isEmpty(text)) {

protected void addToken(List<String> tokens, StringBuilder buffer) {
if (null != buffer && 0 != buffer.length()) {
addToken(tokens, buffer.toString().trim());
for the calls, I recommend replacing these references with the native java.lang.Character class. Method calls such as Character.isLetter(...), Character.isDigit(...) etc, should be sufficient.</String></String>

5 nurdo comentou às Link permanente

How about CodeUtilities.isSpecial()? What's the corresponding Character... ?