Comments (15)
  • Add a Comment
  • Edit
  • More Actions v
  • Quarantine this Entry

1 calee commented Permalink

Thank you for a great summary on tokenization. Could I use Custom Tokenizer in Appendix B commercially?

2 CraigTrim commented Permalink

Yes - help yourself - thanks!

3 E5UP_Srikant_Jakilinki commented Permalink

Thank you so much for the article and the data and code. CustomTokenizer refers to helpers like addToken() and CodeUtilities(). Where can we get those?

4 CraigTrim commented Permalink

@calee - thanks for noticing that. for addToken, use this:
protected void addToken(List<String> tokens, String text) {
if (!StringUtilities.isEmpty(text)) {
tokens.add(text);
}
}

 
protected void addToken(List<String> tokens, StringBuilder buffer) {
if (null != buffer && 0 != buffer.length()) {
addToken(tokens, buffer.toString().trim());
}
}
 
for the CodeUtilities.java calls, I recommend replacing these references with the native java.lang.Character class. Method calls such as Character.isLetter(...), Character.isDigit(...) etc, should be sufficient.</String></String>

5 nurdo commented Permalink

How about CodeUtilities.isSpecial()? What's the corresponding Character... ?

 
Thanks!

6 Ilya Geller commented Permalink

Language has only one and Natural way of parsing, by clauses, sentences and paragraphs.
That's it! Nothing else.
Based on that you can calculate Internal statistics and index by dictionary. This is Nature, the way of Evolution.
If you don't think so - please explain what you propose?
IBM had not, until recently, had an idea what data is and how it should to be handled. IBM did not know that Language has its own Internal parsing and statistics.
For instance, there are two sentences:
a) ‘Fire!’
b) ‘In this amazing city of Rome some people sometimes may cry in agony: ‘Fire!’’
Evidently, that the phrase ‘Fire!’ has different importance into both sentences, in regard to extra information in both. This distinction is reflected as the phrase weights: the first has 1, the second – 0.12; the greater weight signifies stronger emotional ‘acuteness’.
First you need to parse obtaining phrases from clauses, for sentences and paragraphs. Next, you calculate Internal statistics, weights; where the weight refers to the frequency that a phrase occurs in relation to other phrases.
After that data is indexed by common dictionary, like Merriam, and annotated by subtexts.

7 Ilya Geller commented Permalink

Russell said on antinomies: ‘I recognize further this element of truth in Poincaré's objection to totality, that whatever in any way concerns all of any or some (undetermined) of the members of a class must not be itself one of the members of a class (Russell 2001).’
Nothing is not a member of any class, it’s Nothing. For instance, suppose there is a town with just one barber, who is male. In this town, every man keeps himself clean-shaven, and he does so by doing exactly one of two things:
- shaving himself;
- or going to the barber.
However, everything is always becoming what it is not: everything – and I use here Biblical concept of the Judgement Day – becomes Nothing. Or, in Hegel’s words: ‘Hence, the truth of being and nothing alike is the unity of both of them; this unity is becoming’ (Hegel 1991).
The barber and everybody else sooner or later will become what they are not and finally pass away, and Russell did not take that into account.

 
Your tokens are also outside time unless they describe 'How does look what happens with 'What'?' along with naming 'What?' and 'The action which happens with 'What'?'
In this I follow Poincare.
‘A predicative phrase is a predicative definition preferably characterized by combinations of nouns and other parts of speech, such as a verb and an adjective and an article (e.g., the-grey-city-is) (US Patent 8447789).’ Geller decided that a predicative phrase may have undefined number of words as parts-of-speech, and it should answer at least these three questions: ‘What?’, ‘What is going on with ‘what’?’ and ‘How does it look like?’ (Geller 2003, Geller 2005 and US Patent 8504580).
Geller pedantically followed Poincare’s instructions, filing his patents and writing his articles: ‘Logical inferences alone are epistemically inadequate to express the essential structure of a genuine mathematical reasoning in view of its understandability… As a consequence of the logical antinomies, one should avoid any impredicative concept formation (Poincare 1920).’

8 Ilya Geller commented Permalink

This is a sample of structured text - sorry, some parts are hidden, but you see the general idea executed - I have really working program:
god - will - <> : 1050000
philosopher - <> - true : 1000000
thyrsus - be - <> : 1000000
place - find - <> : 1000000
mystery - say - in : 500000
they - say - in : 500000
my - accord - <> : 500000
i - have - succeeded : 500000
whether - have - succeeded : 500000
<> - be - few : 500000
ability - accord - <> : 500000
whom - <> - in : 500000
number - <> - in : 500000
i - interpret - <> : 500000
word - interpret - <> : 500000
<> - be - mystic : 500000
such - be - <> : 333333
i - have - <> : 333333
belief - be - <> : 333333
life - <> - whole : 333333
my - <> - whole : 333333
during - <> - whole : 333333
my - be - <> : 333333
i - seek - <> : 333333
i - be - <> : 333333
purgation - be - <> : 250000
wisdom - be - <> : 250000
herself - be - <> : 250000
them - be - <> : 250000
shadow - be - only : 200000
these - make - good : 200000
these - be - good : 200000
these - virtue - good : 200000
virtue - make - good : 200000
shadow - virtue - only : 200000
virtue - be - shadow : 200000
virtue - be - only : 200000
virtue - shadow - only : 200000
virtue - be - good : 200000
myself - arrive - world : 111111
founder - will - real : 111111
other - arrive - in : 111111
founder - have - real : 111111
mystery - will - real : 111111
founder - appear - real : 111111
mystery - have - real : 111111
meaning - will - real : 111111
mystery - appear - real : 111111
when - arrive - in : 111111
when - arrive - world : 111111
i - arrive - in : 111111
i - arrive - world : 111111
myself - arrive - in : 111111
meaning - have - real : 111111
other - arrive - world : 111111
world - arrive - in : 111111
meaning - appear - real : 111111
when - be - severe : 100000
when - be - exchanged : 100000
they - be - severe : 100000
they - be - exchanged : 100000
one - be - severe : 100000
wisdom - be - severe : 100000
wisdom - be - exchanged : 100000
another - be - exchanged : 100000
one - be - exchanged : 100000
another - be - severe : 100000
i - know - in : 71428
i - know - little : 71428
i - shall - truly : 71428
i - shall - in : 71428
i - shall - little : 71428
i - shall - while : 71428
i - know - truly : 71428
little - know - while : 71428
i - know - while : 71428
little - shall - truly : 71428
little - shall - in : 71428
little - shall - while : 71428
little - know - truly : 71428
little - know - in : 71428
way - have - in : 58823
i - have - in : 58823
right - have - in : 58823
whether - have - right : 58823
way - right - in : 58823

9 Ilya Geller commented Permalink

Geller’s idea of String Theory :
- Knowledge is the First String ;
- a-the word(s) are-is the Second ;
- paragraph is the Third, our String;
- there are to be numerous higher Strings .
In Differential Linguistics it’s told that an isolated-separated word is a none-predicative definition (in Poincare sense) – it is a noun, Knowledge. As soon as the isolated word is incorporated into a predicative phrase it becomes an opinion. Indeed, previously Geller referred to this example: a none-predicative definition ‘ggffrrtte’ (Geller 2004). Who can tell what Geller meant by ‘ggffrrtte’ unless it is included in a phrase and, consequently, explained by multiple words and subtexts?
Moore, in Truth and Falsity wrote on Knowledge: ‘So far, indeed, from truth being defined by reference to reality, reality can only be defined by reference to truth (Moore 1993)’. Wittgenstein at 1.1 and 1.3 clarified: ‘The world is the totality of facts, not of things… The facts in logical space are the world.’ The pillars of Analytical Philosophy divided reality and truth, world and facts, as Geller divided Knowledge from opinions.
As you can see Russell was confused telling what Knowledge is, as well as all other authors without a single exception: ‘…knowledge might be defined as belief which is in agreement with the facts …and no one knows what sort of agreement between them would make a belief true (Russell 1926).’ Merriam Webster, for instance, is also lost: ‘information, understanding, or skill that you get from experience or education; awareness of something; the state of being aware of something’.
Knowledge is Nothing in Hegel’s sense: ‘...pure being is the pure abstraction, and hence it is the absolutely negative, which when taken immediately, is equal nothing (Hegel 1991). Geller understands the absence of words, silence as the Knowledge of everything at once (Geller 2005, Geller 2006). Wittgenstein thought the same: ‘Whereof one cannot speak, thereof one must be silent’.
Thus, there are two kinds of Knowledge: Knowledge as silence and as none-predicative definitions. Text is not Knowledge but it is Nothing, though.
Geller thinks that Knowledge is the limit: the function of Knowledge changes its nature as soon as it reaches the limit. This allowed Geller to use Differential Analyses into (Computational) Linguistic/ Analytical Philosophy.

10 Ilya Geller commented Permalink

What I mean: each token should be a predicative definition, phrase; where just a separate word is none-predicative definition.
We exist in the Third String, we must operate with what is our. Words are from the Second String, they are of Nothing, they describe everything at once.
‘A predicative phrase is a predicative definition preferably characterized by combinations of nouns and other parts of speech, such as a verb and an adjective and an article (e.g., the-grey-city-is) (US Patent 8447789).’ Geller decided that a predicative phrase may have undefined number of words as parts-of-speech, and it should answer at least these three questions: ‘What?’, ‘What is going on with ‘what’?’ and ‘How does it look like?’ (Geller 2003, Geller 2005 and US Patent 8504580).
Three members, I think - better, and I wrote an article at NIST TREC 2004 on the preposition-adjective 'in': the existence of it forces me to insist that there should be at least three members.
However, in my Claim I said:
3. The method of claim 1 wherein each context phrase is a combination of a noun with other parts of speech at least one of which is a verb or an adjective.
For instance, an article is very important.