Topic
  • 3 replies
  • Latest Post - ‏2013-02-18T10:27:47Z by SystemAdmin
SystemAdmin
SystemAdmin
197 Posts

Pinned topic Remove a document. URL encode the documentId

‏2012-11-18T00:03:43Z |
I am using the REST API to try to remove a document from the index.

I use the API command
/admin/document?method=remove&collectionId=collid&documentId=url&api_username=username&api_password=password

Every time i use it I get back it is successful, but when I check the directory it has not actually removed it from the index.

The document id is the id returned from a SEAPI search. It is the url of a site.

I have tried it with not encoded documentId, and an encoded documentID., neither have been successful.

Has anyone had success with removing a document from the index using the rest api when the document is a web page ?

As an example I have a documentId of http://apps.nrc.nl/stijlboek/muur-muur

and if I encode it I have tried the following as the documentId :

http%3A%2F%2Fapps.nrc.nl%2Fstijlboek%2Fmuur-muur
and also :

http:%2F%2Fapps.nrc.nl%2Fstijlboek%2Fmuur-muur

Shame the id is not just a no. so that these problems do not occur.
Updated on 2013-02-18T10:27:47Z at 2013-02-18T10:27:47Z by SystemAdmin
  • hkurokaw
    hkurokaw
    1 Post

    Re: Remove a document. URL encode the documentId

    ‏2012-11-21T09:11:24Z  
    Hi,
    I am still afraid you might have specified a wrong document ID. Usually, the document ID is in encoded format like "http%3A%2F%2Fwww.ibm.com%2Findex.html" so you might have to encode the ID again like "http%253A%252F%252Fwww%2eibm%2ecom%252Findex%2ehtml". Can you verify the document ID once again with Search Application (http://<hostname>:8393/search)?

    You can see the document ID when clicking "Show detailed properties about each document..." button on the tool bar.

    Also, can you let me know how you verified that the document is still not removed from index? Did you search with a query and the document was returned? In that case, the document might be removed after a while. The API is just for requesting a removal and it might take some time until the document is actually removed from index.

    Thank you.
  • SystemAdmin
    SystemAdmin
    197 Posts

    Re: Remove a document. URL encode the documentId

    ‏2013-02-18T10:26:43Z  
    I have worked with Xmax on this and I believe he is right, the document(s) is(/are) not removed.
    Let me explain the use case some more first.
    We are scanning the internet for interesting documents. Whether or not is interesting gets determined in the parsing stage, where interesting documents are tagged and uninteresting documents are not. To keep the application lean, we remove the documents lacking the tag every night in the service window. We do this by stopping the crawler, firing the StreamingResultSet from the SIAPI and feeding the result to the REST API to remove each individual document, when done we restart the parse and index service and restart the crawler.
    We now find that this sometimes works and most of the time it doesn´t. I have not yet discovered the factor which makes the proces work correctly. I thought restarting the server did the trick, but no... We have to many documents in the corpus now, so we run out of the service window.
    Now, on how we know that the documents are not removed. When we perform the described procedure during office hours and it works we can see in the admin console the status of the parsing service change from "waiting" in "indexing" and then the number of documents in the index will reduce. Since a week of two or so we have not been able to see this work correctly, and I am inclined to think this is a malfunction of the REST API.
  • SystemAdmin
    SystemAdmin
    197 Posts

    Re: Remove a document. URL encode the documentId

    ‏2013-02-18T10:27:47Z  
    P.S. It has nothing to do with encoding of the DocumentId, since we get it to work sometimes....