File-based data collection model
Typically, the File Collector Service collects and performs computations on data in a streaming fashion. It makes the arrival of data sensitive to processing window. After the processing window slides to new period, any data that has an earlier timestamp is dropped.
Two ways of tuning the data collection process,
- Improving confidence
- Confidence is a measure of making sure that all records within a timestamp or period are received. Increase in confidence levels results in minimizing the data loss.
- Reducing latency
- Latency is a measure of delay between receiving a record and performing computations on it for storage. Reducing the latency results in faster data refreshes and more real-time data. Reducing the latency can facilitate faster data refreshes and more near real-time data but the late data is lost.
Tuning depends on data arrival patterns and requirements. If all data arrives in a fixed pattern and predefined duration, then the window progress can be tuned to the duration. If the incoming data pattern is not fixed and late data comes after long delays, then the decision can be made to delay the window progress. Or, take middle approach to wait for most of the data to arrive with a calculated risk for late data to get dropped.
You must understand the following concepts to create the best configuration to your environment:
- Wave period
- Periodic intervals where File Collector checks whether a window is ready to be closed (active window). The criteria for a window to be closed is number of waves that has no new data that comes in the window period that is wait period. If new data comes into the window within this period, then the wait period is reset, and wait continues.
- Window period
- Length of time that is needed to group a set of data before it slides to next period. For example, if the window period is configured to 15 minutes (900 seconds) and the current window is 1:00:00 AM to 1:14:59 AM. After the window is progressed, all the data that is collected and grouped by the period is written to timeseries database. The next timestamp 1:15:00 AM to 1:29:59 AM is made the current window. Records that are received with older than current window timestamp get dropped while records with future timestamps are buffered until the window period is active.
- Wait duration
- It is the amount of time the File Collector waits for data on the period before it progresses to the next window. This duration can be depicted in number of waves. For example, if the wait period is set to 10 minutes (600 seconds), and waves set to the default value of 10 seconds, number of waves is equal to wait period / wave. That is, 600/300 is equal to two waves. The window period progresses only if the collector doesn't receive any new data for the period for exactly two waves. Otherwise, the count is restarted.