I am using the REST API to try to remove a document from the index.
I use the API command
Every time i use it I get back it is successful, but when I check the directory it has not actually removed it from the index.
The document id is the id returned from a SEAPI search. It is the url of a site.
I have tried it with not encoded documentId, and an encoded documentID., neither have been successful.
Has anyone had success with removing a document from the index using the rest api when the document is a web page ?
As an example I have a documentId of http://apps.nrc.nl/stijlboek/muur-muur
and if I encode it I have tried the following as the documentId :
and also :
Shame the id is not just a no. so that these problems do not occur.
Pinned topic Remove a document. URL encode the documentId
Answered question This question has been answered.
Unanswered question This question has not been answered yet.
Updated on 2013-02-18T10:27:47Z at 2013-02-18T10:27:47Z by SystemAdmin
hkurokaw 270002JJFJ1 Post
Re: Remove a document. URL encode the documentId2012-11-21T09:11:24ZThis is the accepted answer. This is the accepted answer.Hi,
I am still afraid you might have specified a wrong document ID. Usually, the document ID is in encoded format like "http%3A%2F%2Fwww.ibm.com%2Findex.html" so you might have to encode the ID again like "http%253A%252F%252Fwww%2eibm%2ecom%252Findex%2ehtml". Can you verify the document ID once again with Search Application (http://<hostname>:8393/search)?
You can see the document ID when clicking "Show detailed properties about each document..." button on the tool bar.
Also, can you let me know how you verified that the document is still not removed from index? Did you search with a query and the document was returned? In that case, the document might be removed after a while. The API is just for requesting a removal and it might take some time until the document is actually removed from index.
Re: Remove a document. URL encode the documentId2013-02-18T10:26:43ZThis is the accepted answer. This is the accepted answer.I have worked with Xmax on this and I believe he is right, the document(s) is(/are) not removed.
Let me explain the use case some more first.
We are scanning the internet for interesting documents. Whether or not is interesting gets determined in the parsing stage, where interesting documents are tagged and uninteresting documents are not. To keep the application lean, we remove the documents lacking the tag every night in the service window. We do this by stopping the crawler, firing the StreamingResultSet from the SIAPI and feeding the result to the REST API to remove each individual document, when done we restart the parse and index service and restart the crawler.
We now find that this sometimes works and most of the time it doesn´t. I have not yet discovered the factor which makes the proces work correctly. I thought restarting the server did the trick, but no... We have to many documents in the corpus now, so we run out of the service window.
Now, on how we know that the documents are not removed. When we perform the described procedure during office hours and it works we can see in the admin console the status of the parsing service change from "waiting" in "indexing" and then the number of documents in the index will reduce. Since a week of two or so we have not been able to see this work correctly, and I am inclined to think this is a malfunction of the REST API.