The Harvest Tracker tool
Find details about the command syntax and the tool's output.
Command syntax
The Harvest Tracker tool is a Python script that you run on the data server. The command syntax is as follows:
- loops
- Specifies how many times the tool is to check the data server services. The default value is
maxint, which corresponds to 214,748,647. - loop_time
- Specifies the time interval for the checks in seconds. The default value is 30.
- logfile
- Defines the path to the log file. The default log file is /deepfs/config/harvest_tracker.log
- qcount
- Specifies the maximum number of objects in queue to be listed by name.
Running the command python32 /usr/local/storediq/bin/util/harvest_tracker.pyc -h displays the supported options and the default values.
At any time, you can stop the tool by pressing Enter followed and then entering y as confirmation.
While the tool is running, you will see additional messages like this one on the terminal where you started the tool:
Tue Feb 26 16:41:14 2019:pubsub/reconnectingpbclientfactory.py:64:ReconnectingPBClientFactory._onRemoteOkThese messages come from the data server and cannot be suppressed. They have nothing to do with the Harvest Tracker tool and can, therefore, be ignored. They are not written to the harvest_tracker.log file.
Output for active harvests
The statistics are written to an individual output block for each harvest, where the blocks are separated by dashed lines. The first line of each block shows the volume ID and the start time of that specific harvest. The statistics include the following information:
- Ingest Q: Cur len:
- Number of objects in the queue that are waiting for text extraction
- Ingest Q: Total in:
- Total number of objected that entered the queue since the harvest started
- Ingest Q: File:
- Full path to the file in the queue
- Output Q: Cur len:
- Number of objects in the queue that are waiting for node-indexing (PostGres)
- Output Q: Total in:
- Total number of objects that entered the queue since the harvest started
- Output Q: File:
- Full path to the file present in the queue
- Findex Q: Cur len:
- Number of objects that are queued for full-text indexing (into Lucene)
- Findex Q: Total in:
- Total number of objects that entered the full-text indexing queue since the harvest started
- Findex Q: File:
- Full path to the file in the full-text indexing queue
- TktMstr: Cur len:
- Number of objects being actively tracked by Ticket Master
- TktMstr: Total in:
- Total number of objects tracked by Ticket Master so far
- Harvest: Time to complete:
- Expected remaining processing time
- Harvest: Percent complete:
- Percent complete
- Harvest: Estimated vol size:
- Estimated size of the volume being harvested
- Harvest: Obj/s:
- Object processing rate
- Harvest: Max time file:
- Name of the file taking longest processing time
- Harvest: Max time val:
- Processing time of the file listed under
Harvest: Max time file: - Harvest: Max size file:
- Name of the largest file encountered so far
- Harvest: Max size val:
- Size of the file listed under
Harvest: Max size file: - ObjTracker: VolId:
- The ID of the volume being harvested
- ObjTracker: Files extended:
- A comma separated list of file paths to files that are taking longer than the normal processing time is
- ObjTracker: Files expired:
- A comma separated list of file paths to files whose processing could not be completed in the allotted maximum time
[Thu Oct 31 20:46:38 2019] ------------------------------
[Thu Oct 31 20:47:08 2019] Harvest: ---- VolId: 260, Name: ThisVolume, Start: Thu Oct 31 20:41:23 2019 ----(active)
[Thu Oct 31 20:47:08 2019] Ingest Q: Cur len: 55, Total in: 813
[Thu Oct 31 20:47:08 2019] Ingest Q: File: d3/xlsfiles/Certificate or license number.xls
[Thu Oct 31 20:47:08 2019] Ingest Q: File: d3/xlsfiles/Companies.xls
[Thu Oct 31 20:47:08 2019] Ingest Q: File: d3/xlsfiles/Countries.xls
[Thu Oct 31 20:47:08 2019] Output Q: Cur len: 12, Total in: 1203
[Thu Oct 31 20:47:08 2019] Output Q: File: d3/xlsfiles/Allattributes3.xls
[Thu Oct 31 20:47:08 2019] Findex Q: Cur len: 472, Total in: 1085
[Thu Oct 31 20:47:08 2019] Findex Q: File: d3/nsf/smoke.nsf
[Thu Oct 31 20:47:08 2019] Findex Q: File: d3/nsf/smoke.nsf
[Thu Oct 31 20:47:08 2019] Findex Q: File: d3/nsf/smoke.nsf
[Thu Oct 31 20:47:08 2019] Tkt Mstr: Cur len: 34, Total in: 112
[Thu Oct 31 20:47:08 2019] Tkt Mstr: File: d3/pptfiles/Device identifier or serial.ppt
[Thu Oct 31 20:47:08 2019] Tkt Mstr: File: d3/pptfiles/Discharge date.ppt
[Thu Oct 31 20:47:08 2019] Tkt Mstr: File: d3/pptfiles/Relative's full name.ppt
[Thu Oct 31 20:47:08 2019] Harvest: Time to complete: 1 minute 29 seconds
[Thu Oct 31 20:47:08 2019] Harvest: Percent complete: 78.01
[Thu Oct 31 20:47:08 2019] Harvest: Estimated vol size: 773 objects
[Thu Oct 31 20:47:08 2019] Harvest: Obj/s: 3.03
[Thu Oct 31 20:47:08 2019] Harvest: Max time file: d1/pst/standard.pst, Max time val: 276.00
[Thu Oct 31 20:47:08 2019] Harvest: Max size file: sips/bigsip_58m.pdf, Max size val: 58616009
[Thu Oct 31 20:47:08 2019] ObjTracker: files extended: []
[Thu Oct 31 20:47:08 2019] ObjTracker: files expired: []
[Thu Oct 31 20:47:08 2019] ------------------------------
Output for finished harvests
After a harvest is complete, the tool can provide a summary of the harvest operation. The summary contains the following information:
- Ingest Q: Longest Q len:
- Largest number of objects in the queue that waited for text extraction
- Output Q: Longest Q len:
- Largest number of objects in the queue that waited for node-indexing (PostGres)
- Findex Q: Longest Q len:
- Largest number of objects in the queue that waited for full-text indexing (into Lucene)
- TktMstr: Longest Q len:
- Maximum number of objects tracked
- Harvest: Min Obj/s:
- Lowest object processing rate
- Harvest: Max Obj/s:
- Highest object processing rate
- Harvest: Max time file:
- Name of file that took the longest processing time
- Harvest: Max time val:
- Actual processing time of the file listed under
Harvest: Max time file: - Harvest: Max size file:
- Name of largest file processed
- Harvest: Max size val:
- Size of the file listed under
Harvest: Max size file: - Harvest: Total:
- Total harvest time
- ObjTracker: VolId:
- The ID of the volume ID that was harvested
- ObjTracker: files extended:
- A comma separated list of file paths to files that took longer than the normal processing time
- ObjTracker: files expired:
- A comma separated list of file paths to files whose processing could not be completed in the allotted max time
- Object analysis: Total obj types:
- Total number of different file extensions encountered so far
- Object analysis: Total obj count:
- Total number of system level objects processed so far. Note that this count currently does not include the count of objects in a container.
- Object analysis: Longest tkt time:
- Longest life of a ticket tracked by Ticket Master
- Object analysis: ext:
- Extension (type) of object
- Object analysis: Count:
- Number of tickets of extension/type processed
- Object analysis: Max tkt time:
- Longest life of ticket of this extension/type tracked by Ticket Master
[Thu Oct 31 20:48:38 2019] ------------------------------
[Thu Oct 31 20:48:38 2019] Harvest_Tracker Summary...
[Thu Oct 31 20:48:38 2019] Harvest: ---- VolId: 260, Name: ThisVolume, Start: Thu Oct 31 20:41:23 2019 ----(stale)
[Thu Oct 31 20:48:38 2019] IngestQ: Longest Q len: 813
[Thu Oct 31 20:48:38 2019] OutputQ: Longest Q len: 1324
[Thu Oct 31 20:48:38 2019] FindexQ: Longest Q len: 1196
[Thu Oct 31 20:48:38 2019] TktMstr: Longest Q len: 112
[Thu Oct 31 20:48:38 2019] Harvest: Min Obj/s: 0.01, Max Obj/s: 3.63
[Thu Oct 31 20:48:38 2019] Harvest: Max time file: d1/pst/standard.pst, Max time val: 282.00
[Thu Oct 31 20:48:38 2019] Harvest: Max size file: sips/bigsip_58m.pdf, Max size val: 58616009
[Thu Oct 31 20:48:38 2019] Harvest: total: 0d 0h:7m:15s
[Thu Oct 31 20:48:38 2019] ObjTracker: files extended: []
[Thu Oct 31 20:48:38 2019] ObjTracker: files expired: []
[Thu Oct 31 20:48:38 2019] Object analysis:
[Thu Oct 31 20:48:38 2019] Total obj types: 22, total obj count: 793, Longest tkt time: 32.19:
[Thu Oct 31 20:48:38 2019] ext: 'xls': count: 193 (24.00%), Max tkt time: 2.39
[Thu Oct 31 20:48:38 2019] ext: 'ppt': count: 191 (24.00%), Max tkt time: 4.91
[Thu Oct 31 20:48:38 2019] ext: 'doc': count: 185 (23.00%), Max tkt time: 32.19
[Thu Oct 31 20:48:38 2019] ext: 'pdf': count: 181 (22.00%), Max tkt time: 15.01
[Thu Oct 31 20:48:38 2019] ext: 'mail': count: 11 (1.00%), Max tkt time: 0.03
[Thu Oct 31 20:48:38 2019] ext: 'msg': count: 7 (0.00%), Max tkt time: 0.02
[Thu Oct 31 20:48:38 2019] ext: 'eml': count: 4 (0.00%), Max tkt time: 0.00
[Thu Oct 31 20:48:38 2019] ext: 'mbx': count: 3 (0.00%), Max tkt time: 0.00
[Thu Oct 31 20:48:38 2019] ext: 'note': count: 2 (0.00%), Max tkt time: 0.00
[Thu Oct 31 20:48:38 2019] ext: 'pst': count: 2 (0.00%), Max tkt time: 0.00
[Thu Oct 31 20:48:38 2019] ext: 'rar': count: 2 (0.00%), Max tkt time: 0.00
[Thu Oct 31 20:48:38 2019] ext: 'nsf': count: 2 (0.00%), Max tkt time: 0.00
[Thu Oct 31 20:48:38 2019] ext: 'jpg': count: 1 (0.00%), Max tkt time: 0.01
[Thu Oct 31 20:48:38 2019] ext: 'rtf': count: 1 (0.00%), Max tkt time: 0.00
[Thu Oct 31 20:48:38 2019] ext: 'h': count: 1 (0.00%), Max tkt time: 0.00
