Comments (5)
  • Add a Comment
  • Edit
  • More Actions v
  • Quarantine this Entry

1 calee commented Permalink

Thank you for a great summary on tokenization. Could I use Custom Tokenizer in Appendix B commercially?

2 CraigTrim commented Permalink

Yes - help yourself - thanks!

3 E5UP_Srikant_Jakilinki commented Permalink

Thank you so much for the article and the data and code. CustomTokenizer refers to helpers like addToken() and CodeUtilities(). Where can we get those?

4 CraigTrim commented Permalink

@calee - thanks for noticing that. for addToken, use this:
protected void addToken(List<String> tokens, String text) {
if (!StringUtilities.isEmpty(text)) {

protected void addToken(List<String> tokens, StringBuilder buffer) {
if (null != buffer && 0 != buffer.length()) {
addToken(tokens, buffer.toString().trim());
for the calls, I recommend replacing these references with the native java.lang.Character class. Method calls such as Character.isLetter(...), Character.isDigit(...) etc, should be sufficient.</String></String>

5 nurdo commented Permalink

How about CodeUtilities.isSpecial()? What's the corresponding Character... ?