Rules for using CSV data sources

The following rules apply when the data source is a CSV filesystem.

Rules for CSV file naming

The file name must begin with a table or group reference, followed by a start time, and end in .CSV.
The start time in the file name must be at or before the first time stamp in the file.
The end time is optional and can be at or after the last time stamp in the file. If the end time in the file name is at the same time as the start of the next interval, this end time must be after the latest time stamp in the file. For example, if the end time of 10:05 is at the start of the next interval, the latest time stamp in the file can be 10:04:59.
If the time zone of the file name is different to the time zone of the file content, the time zone must be explicitly defined in at least one of either the file name or file content time stamps. The following examples show the time zone defined in the file name:
CPULOAD_2013-07-17-00-00EST_2013-07-17-00-15EST.csv

CPULOAD_2013-07-17-00-00+0300_2013-07-17-00-15+0300.csv
A sample file name with 15-minute data is:
CPULOAD_2013-07-17-00-00_2013-07-17-00-15.csv

The file name shows that:
- The file contains the CPULOAD source table
- The first timestamp in the file is July 17, 2013 at 00:00
- The last timestamp is earlier than July 17, 2013 at 00:15
For steady-state processing, CSV files must contain only one interval. The interval must notionally start at the beginning of an hour, that is, at 00 minutes and 00 seconds. Practically, an interval can start later but it must align to the intervals that are determined by the notional start time. For example, the interval can be from 25 - 30 minutes, or from 26 - 29 minutes, but not from 26 - 31 minutes, because this time period overlaps two intervals.

Note: For more information about how to create file naming patterns, see Example file naming patterns

Rules for CSV file content

The following rules and limitations apply to CSV file content:

Header line

The first line of each CSV file must contain a comma-separated list of headers for the columns in the file. For example:
```
#timestamp,resourceName,AvgLoad,AvgPercentMemoryUsed
```

Column format

Date and time must be in the same column. Preferably, the time stamp is in UTC or otherwise contains the time zone information that is defined in Java SimpleDateFormat, http://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html. By supplying the time zone information in the data, it can be adjusted for Daylight Saving Time.
If the date and time is not in the same column, you must preprocess those files to combine those two columns. Alternatively, you can join the two columns as you reexport the CSV files.
Resource names can be made from several columns (in the Mediation Tool). One part of the resource name must be assigned to an attribute named 'Node'. If a null or empty value is seen for the Node attribute, that row of data is discarded. Null values that are in other parts of the resource key are not recommended, but do not cause the data to be discarded, instead, the resource name is made up of only the non-null parts of the key.

Characters

Files must be in CSV format. Ensure that any text fields that contain a comma are surrounded in quotation marks, for example, Supplier column contains 'International supplies, Inc.'
Metric values with a decimal point cannot use a comma as the decimal point, they must use a full stop character, .
If a field is surrounded by double quotation marks, it must not contain double quotation marks, for example, "International "supplies" Inc"

Time

When processing data in steady state mode, Predictive Insights expects a single CSV file for each interval. The first interval for each hour starts at 00 minutes and increment at multiples of the interval. For example, if the aggregation interval is 15 minutes, the intervals are 00 to 15 minutes, 15 to 30 minutes and so on. The time stamps in the CSV file cannot span intervals. For example, for the interval 15:00 to 15:15, the earliest time stamp in the file must be 15:00 or later and the latest time stamp in the file must be 15:14:59.999 or earlier.
When processing data in backlog mode, files can contain data that spans multiple time intervals. However, if the file content spans multiple intervals, then the data must be in chronological order. For performance reasons, ensure that file does not contain more than one day data.
The end time of the data in the file must be less than the end of the interval, it cannot be equal to the end of the interval. For example, a 5-minute interval with start time of 10:00 and end time of 10:05 can have a last time stamp of 10:04:59.
Files must be delivered to the source location for the extractor before the latency time set for the extractor expires. The default latency is the same period as the system.aggregation.interval. The minimum latency is 1 minute.

Examples of valid CSV file formats

The following example illustrates a CSV file format with data and time stamp in epoch format. Date and time must be in the same column.

#timestamp,resourceName,AvgLoad,AvgPercentMemoryUsed
1361523600000,resource1,6,21.900673
1361523600000,resource2,0,45.12558
1361523600000,resource3,12,20.727364
1361523600000,resource4,5,23.801073

The following example illustrates a CSV file format where data and time stamp are a string. Date and time must be in the same column.

#Timestamp, ResourceId,Metric_0
201304160815,ResourceId_1,0.0110
201304160815,ResourceId_2,0.0110
201304160815,ResourceId_3,0.0110

The following example illustrates a CSV file format where data and time stamp are a string, but not at the first position. The first column shows that fields can contain white space.

#Device,Parent Device,Sensor,Location,Time,Value
device 1 complex name,127.0.0.1,Sensor1,11/01/2014 16:35,46
device 1 complex name,127.0.0.1,Sensor1,11/01/2014 16:35,61
device 1 complex name,127.0.0.1,Sensor1,11/01/2014 16:39,46
device 1 complex name,127.0.0.1,Sensor1,11/01/2014 16:40,61
device 1 complex name,127.0.0.1,Sensor1,11/01/2014 16:40,46

The following example shows a CSV file where the data is in skinny format. In this format, metric names are contained in rows of data underneath a header row.

Timestamp,Node,SubResource,MetricName,MetricValue
2015-04-10 04:16:13.0,server1,subresourceA,M1,7003
2015-04-10 04:16:13.0,server1,subresourceA,M2,6683
2015-04-10 04:16:13.0,server1,subresourceA,M3,1041
2015-04-10 04:16:13.0,server1,subresourceB,M1,7643