Skip to main content

alphaWorks  >  Forums  >  IBM LanguageWare Resource Workbench  >  developerWorks

Connection to Web search engine    Point your RSS reader here for a feed of the latest messages in this thread


     

 
 

My developerWorks
 Welcome, Guest
Sign in or register
This question is answered.

Permlink Replies: 4 - Pages: 1 - Last Post: Oct 1, 2009 9:50 AM Last Post By: VVTs Threads: [ Previous | Next ]
VVTs

Posts: 30
Registered: Jun 15, 2009 07:52:26 AM
Connection to Web search engine
Posted: Sep 30, 2009 08:08:40 AM
 
Click to report abuse...   Click to reply to this thread Reply
Hi!

I can't create a collection of web-documents on my LRW 7.1.1.3. We use proxy. So we type proxy connection details under manual proxy configuration item.
Java.net.SocketException: Malformed reply from SOCKS server appears.
(Language is English.)

Could I have the link to check connection to Search Engine web-address in browser?

Regards,
Valentin
KevinCunnane

Posts: 75
Registered: Feb 03, 2009 10:46:24 AM
Re: Connection to Web search engine
Posted: Sep 30, 2009 09:56:03 AM   in response to: VVTs in response to: VVTs's post
 
Click to report abuse...   Click to reply to this thread Reply
Hi Valentin. Unfortunately proxy connections are not currently supported when creating a document collection from the web. Thanks for notifying us of this limitation - we will investigate this for future releases.

In answer to the last part of your question, we connect to Google search for this feature http://www.google.com/search.

Kevin
VVTs

Posts: 30
Registered: Jun 15, 2009 07:52:26 AM
Re: Connection to Web search engine
Posted: Oct 01, 2009 02:40:03 AM   in response to: VVTs in response to: VVTs's post
 
Click to report abuse...   Click to reply to this thread Reply
Hi, Kevin!

Thanks!

The main aim of my connection test was to gather statistic information about encodings of russian html pages. I suppose it is not now always UTF-8, but more often it is windows-1251 or KOI8-R, both one-byte. Thus one more question: Is it possible to broaden list of encodings of anno files?

Regards,
Valentin.
KevinCunnane

Posts: 75
Registered: Feb 03, 2009 10:46:24 AM
Re: Connection to Web search engine
Posted: Oct 01, 2009 06:56:29 AM   in response to: VVTs in response to: VVTs's post
 
Click to report abuse...   Click to reply to this thread Reply
Hi Valentin. At present LanguageWare annotators only support UTF input. This is standard for Java applications. However, it is not hard to convert from other encodings to UTF format. Java has support for this when reading from files - I believe that if you specify the charset of the file, it will automatically convert it to UTF format. There are also tools such as ICU (International Components for Unicode) that have good support for doing this conversion in a program.

For example, the Document Collection Creator in the workbench checks the character encoding set in HTML meta tag. Then it tries to convert from specified (in HTML meta) encoding to UTF-8. For Russian web pages, you would probably want to do something similar before passing the text to the LanguageWare annotators.

Hope this helps you.

Kevin
VVTs

Posts: 30
Registered: Jun 15, 2009 07:52:26 AM
Re: Connection to Web search engine
Posted: Oct 01, 2009 09:50:39 AM   in response to: VVTs in response to: VVTs's post
 
Click to report abuse...   Click to reply to this thread Reply
Thanks!

Notepad did manual conversion, and perl script is good for a batch doing, but I'll try your link.

Regards,
Valentin

Point your RSS reader here for a feed of the latest messages in all forums