Question & Answer
Basic information & best practices
When data is added to tenant buckets, these are separate buckets in the QRadar architecture, meaning that data in one tenant bucket is completely separate from another tenant's bucket. Each tenant has a unique set of ten retention buckets that allows data to be written to disk in /store/ariel. There is no limit on the number of tenants that can be created under a domain in QRadar and each named tenant gets a set of ten event and ten flow retention buckets. This is the same with our file structure, as there is a separate folder for each tenant retention buckets to ensure the data it kept separate. More details on file structure are provided below.
When it comes to best practices, we always recommend that retention buckets be created by shortest lived data to longest lived data in the user interface. As the system always triggers on data that expires first, so it is best to organize your bucket structure in the same way. This allows administrators to provide structure around retention and makes it easier to organize data.
Bucket 1 (Longest retained data)
Bucket 10 (Shortest retained data)
Default (any data not categorized to a bucket)
Bucket 1 (Most important data)
Bucket 10 (Least important data)
Default (any data not categorized to a bucket)
Figure 1: User interface example of the Retention Bucket structure.
Question 1: Which retention policy will be followed?
If I setup a policy in Tenant A that says that all data must be stored for 1 month, but also setup a policy on global where I say using the filter, if domain is A store all data for 1 week.
Answer: Regardless of order in the user interface, we will always look at the data set to expire first and then the setting "immediately" or "when space is required". To determine which would fire first would depend on how you set the retention bucket. If it was immediately, the data in Domain A would expire first, then also be purged first. So, the system would keep Domain A data that is less than 7 days old, then delete anything older at the top of the hour. I'll cover how data is deleted later on.
If the settings in event retention was when 'space is required', then nothing would happen until you hit 85% disk utilization. Then it would depend on how much data was in domain A would determine if disk space was reduced enough for other buckets to be impacted. By default, we always attempt to keep data as long as possible, unless an event retention bucket says that data can be deleted immediately after the retention time period has expired. Event retention will monitor for data that can be deleted starting with the shortest intervals, this is determined by the record data in /store/ariel/events|flows/records.
Question 2: How are multiple tenants stored in the file system
Tenant data is kept in a separate folder as mentioned above. If you look in the file system in QRadar, you will see that a new folder is added for AUX.
- Events: /store/ariel/events/records/aux/tenantID#/Year/Month/Day/Hour/Minute
- Flows: /store/ariel/flows/records/aux/tenandID#/Year/Month/Day/Hour/Minute
The tenant ID is a numeric value in QRadar and is only visible from the QRadar database. To start, you would need to locate what ID is associated to the tenant. I looked to see if there was an AQL or any hover text to help determine what ID each tenant is assigned, but there is not, so I submitted a defect on this issue to show the tenant ID in the user interface.
Question 3: Is it possible for me to use/write a script that will copy the flow and event data?
The retention buckets per tenant are stored in Ariel data itself. You can take that data or entire aux directory and move it or back it up separately if administrators are required to provide additional backup protection. The data could even move it to another QRadar appliance that has more space as data is sorted under /store/ariel for tenants. Administrators would need to identify what tenant ID belongs to which bucket, then you could move the data based on your requirements by moving or copying the file under /store/ariel../aux/tenantID#/.
Question 4: How does data cleanup work with retention policies?
I wrote this a while back related on how 0 hours would work with an event retention policy and it covers your question as well. With any time based retention setting for data deletion, the data is not deleted immediately. The process that deletes data off disk is a process called "disk maintenance" in QRadar, which runs hourly. The retention bucket tags the data as expired; however, the data will still reside on disk until disk maintenance runs. This means that depending on when the data is written to disk that the data could be there for 1 minute, or up to a maximum of 1 hour. It just depends on how close you are to the disk maintenance process running. This is the same regardless of if you are setting 0 hours or one week. Think of retention deletions as "retention value + max 1 hour", such as "1 month + max 1 hour". When disk maintenance runs (dismaintd.pl), it is based off of the ariel data itself, not that timestamps of the events. You could consider this Storage time, but in essence is the /store/ariel/events/records/aux/tenantID#/Year/Month/Day/Hour/Minute where the data resides in the file system that disk maintenance deletes from the system.
So, yes a retention value of 0 is a valid time frame, however, all retention deletion relies on disk maintenance to do the actual data removal. You can see in the logs when disk maintenance is running as you will see this in the logs:
May 30 06:43:46 ::ffff:10.4.1.140 hostcontext.hostcontext wipe /store/ariel/flows/records Bucket 0 com.q1labs.frameworks.maintenance.file.FileProcessor: INFO 0000006000 IP ADDRESS/- - -/- - disk maintenance for /store/ariel/flows/records started
Question 5: What exactly happens to the QRadar system when the /store partition is filled/getting filled?
When any monitored partition on an appliance hits 95% full (/,/store,/store/tmp,/store/transient,/var/log), QRadar attempts to stop services (hostcontext) to protect from filling the entire disk to 100% and possibly preventing issues where the system will not boot as there is 0 disk available (hitting 100%). Typically, disk maintenance will consider "When space is required" to be at 90% full when we issue the System Notification "Disk Sentry: Disk usage exceeded warning threshold". At this point, disk maintenance will run and clean up data and reduce storage back to 92%. When the disk is cleaned up and the utilization gets to 92%, then services restart (hostcontext restarted). This applies to any partition as mentioned, if you put a very large file in /store/tmp and that partition hits 95%, services will stop even if /store/ is only 20% full/utilized.
NOTE: To determine what are considered monitored partitions, type: grep PARTITIONS /opt/qradar/conf/nva.hostcontext.conf and list of monitored partitions is displayed to the root user.
I would suggest that you treat any Disk Sentry messages as critical to administrators as hitting 95% on a monitored partition is not good. You can verify what partition is being filled (df -Th), but how the system reacts when /store is getting full is going to depend on how you configure event/flow retention. If you have everything sent to delete when space is required, it is going to start with the shortest retention period and clean that first. When you setup your retention buckets from shortest to longest in the user interface, you can see what data will be deleted as it will start at the top and proceed to the bottom of your retention bucket list. This gives you and any other admins a quick "waterfall" visual for what is deleted first (shortest retention) versus last (longest retention).
Figure 2: User interface example of the Retention Bucket order.
When the disk is 95% utilized, hostcontext is stopped. This stops ECS (event and flow pipeline), the accumulator, and a bunch of other subservices. For a discussion on this topic, see: Hostcontext restart and impacted services.
Question 6: Do I understand it correctly that it doesn't matter where the retention filters are set (tenant level or global level)? The system looks at the shortest retention time/filter?
Answer: Data is separated by folder and does not overlap in retention buckets as long as the buckets are not edited (we'll talk about editing retention buckets later). Global retention buckets go in the standard file system and tenant retention buckets are separated by a unique AUX folder that then has an assigned ID for the retention bucket and the data that belongs. Data in the global bucket is completely separate from data in the tenant buckets on disk. So, if you have data in a global retention bucket for 1 month and data in retention bucket A for 1 week set to immediately after retention period has expired. Then after a week, retention bucket A will be cleaned of data older than 1 week. Data in the global bucket is still available as that data has not expired yet.
I confirmed with development that retention data can only belong to one bucket. When we go to write data to disk, event/flow data is evaluated as to where the data belongs and to what retention bucket. Each bucket acts as a filter. Data for each bucket is evaluated versus the filter criteria in ascending order of the buckets. An event is put into the first bucket that matches the criteria and only to that bucket. The default bucket catches all the events that did not match any of the custom bucket's criteria. There is a pre-filter that defines if the data belongs to the global retention bucket or is the data part of a tenant bucket.
Each bucket allows defining a different retention policy for its events. The retention policies behave exactly like described in the above. Data matching a bucket is defined by a criteria similar to Ariel searching. It is possible to enter multiple filters in a single bucket, in which case an event will have to match either filter to be put in the bucket.
Hourly, retention buckets are reviewed by disk maintenance that does clean up on the actual data. Each bucket is evaluated if it contains data older than the retention period and if the setting is "Delete immediately after retention is expired" or "When storage space is required". If there is less than 13% free disk space in /store, then the when storage space is required function will trigger and start to clean older data. Data is cleaned up based on a combination of time (is data expired) and/or space (is disk space needed by the system). For time based retention (delete immediately after retention period expired) this is typically used for short lived data that you don't really care about. When space is required is typically assigned to lower buckets where you want to keep important data as long as possible.
Question 7: Do tenant filters include an automatic extra filter to only look in a specific retention bucket instead of over all buckets?
Yes, as mentioned above tenant buckets are stored in a separate area on disk and have unique bucket IDs that are separate of the global retention bucket. Let's explain how these are different in a little more detail.
This is how you can identify what data on disk belongs to which retention bucket. I'm going to use the example from the follow-up question to answer this directly.
For example, into the folder /store/ariel/events/payloads/aux/2/2017/4/11/9/ I have these files:
Note: Since you are in the payloads directory, the files are payload_events. If you were in the records directory, the files would be the exact same, except the name would begin with events~9.
Tenant retention buckets follow this format for file names on disk
- First identifier: payload_events~9 or events~9
This identifies the minute and second that the data belongs to. You'll notice that these values go from ~0_0 to ~59_0.
- Second identifier: There is a unique identifier added to the data, such as f02976cd4a46449b-891b35a400083851. This is the UUID for the data and matches between both the records directory and the payloads directory as they are written to disk at the same time.
- Third identifier: Finally, there is the retention bucket number. This is represented as tenant ID * 256 + retention bucket 0-10. So, if we have DeviceId~1_0~bb59ed322f8c4eef~b5dfefede8dc0987~512, this means that the tenant ID is 2*256 + 0 = 512. Where 0 is the default bucket and the numbers 1 to 10 represents the buckets in the user interface.
Question 8: What happens when I edit retention buckets?
I mentioned this above, but wanted to make sure I remind admins/users of how retention buckets work when you make an edit. QRadar, due to the number of data that is typically being processed does not go back in time and rewrite retention data when edits are made. For example, I've got a retention period of 3 months for retention bucket #2 for some log sources. After 1 month, I decide to move one of the log sources from retention bucket #2 -> #3, which retains data for 5 months for all log sources in that bucket.
Where is the data?
The 1 month of data from retention bucket #2 is still in retention bucket #2. It is not moved when an edit is made to retention buckets. That 1 month of data still has 2 months before it expires (pending disk space requirements). After an edit is made to a retention bucket, new data for that minute is now written to the newly assigned bucket. We do not go back after the fact and sort data, so it is important to realize that when you make edits to a retention bucket you could be splitting where data resides. If you remove a log source from a bucket and do not assign it anywhere, it automatically assigned to the default bucket. As any data not assign belongs to the default bucket.
Question 9: What happens when I delete a retention bucket?
Retention buckets are pointers to data that is on disk. When an administrator deletes an existing retention bucket, any data from that retention bucket that is on disk is assigned to the default bucket.
Question 10: Do you have additional questions that should be added to the FAQ page?
If administrators or users have additional questions about how tenant retention works with events or flows, you can ask in our forums. At the bottom of this technical note, there is an ask a question field. If you log in to IBM.com you can use the Ask a question field to open a forum question against this content. Answers to the question are appended directly to the end of this technical note. Optionally, you can navigate to the forums directly (https://ibm.biz/qradarforums) and ask us a question directly.
Was this topic helpful?
12 February 2021