HTTP Client
The HTTP Client origin reads data from an HTTP resource URL. For information about supported versions, see Supported Systems and Versions in the Data Collector documentation.
When you configure HTTP Client, you specify the resource URL, optional headers, and the method to use. For some methods, you can specify the request body and default content type.
You can configure the actions to take based on the response status and configure pagination properties to enable processing large volumes of data from paginated APIs. You can also enable the origin to read compressed and archived files.
The origin provides response header fields as record header attributes so you can use the information in the pipeline when needed.
The origin also provides several different authentication types to access data. You can enter credentials in the origin or you can secure the credentials in runtime resource files and reference the files in the origin. You can also configure the origin to use the OAuth 2 protocol to connect to an HTTP service.
When a pipeline stops, HTTP Client notes where it stops reading. When the pipeline starts again, HTTP Client continues processing from where it stopped by default. You can reset the origin to process all requested files.
Processing Mode
- Streaming
- HTTP Client maintains a connection and processes data as it becomes available. Use to process streaming data in real time.
- Polling
- HTTP Client polls the server at the specified interval for available data. Use to access data periodically, such as metrics and events at a REST endpoint.
- Batch
- HTTP Client processes all available data and then stops the pipeline. Use to process data as needed.
HTTP Method
To request data from an HTTP resource URL, specify the request method to use. Most servers require a GET request, but you should verify the request required by the server you want to access.
- GET
- PUT
- POST
- DELETE
- HEAD
Headers
- Headers
- Additional Security Headers
You can define headers in either property. However, only additional security headers support using credential functions to retrieve sensitive information from supported credential stores. For more information about credential stores, see Credential Stores in the Data Collector documentation.
If you define the same header in both properties, additional security headers take precedence.
Per-Status Actions
By default, the HTTP Client origin processes only responses that include a 2xx success status code. When the response includes any other status code, such as a 4xx or 5xx status code, by default, the origin generates an error and handles the record based on the error record handling configured for the stage.
You can configure the origin to perform one of several actions when it encounters an unsuccessful status code, that is, any non-2xx status code.
- Retry with linear backoff
- Retry with exponential backoff
- Retry immediately
- Cause the stage to fail and stop the pipeline
When defining the retry with linear or exponential backoff action, you also specify the backoff interval to wait in milliseconds. When defining any of the retry actions, you specify the maximum number of retries. If the stage receives a 2xx status code during a retry, then it processes the response. If the stage doesn't receive a 2xx status code after the maximum number of retries, then the stage generates an error.
You can add multiple status codes and configure a specific action for each code.
You can also configure the stage to generate records for all unsuccessful statuses that are not added to the Per-Status Actions list. You then specify the output field name that stores the error response body for those records.
For example, if the stage receives a 400 Bad Request code,
you want the pipeline to process the response body that contains the description of the
error. When configuring the stage, you do not add an action for the 400 status code
because you don't need the stage to retry the request. You select the Records for
Remaining Statuses property and then use the default value outErrorBody
as the name of the error response body field.
Pagination
The HTTP Client origin can use pagination to retrieve a large volume of data from a paginated API.
When configuring the HTTP Client origin to use pagination, use the pagination type supported by the API of the HTTP client. You will likely need to consult the documentation for the origin system API to determine the pagination type to use and the properties to set.
The HTTP Client origin supports the following common pagination types:
- Link in HTTP Header
- After processing the current page, uses the link in the HTTP header to access
the next page. The link in the header can be an absolute URL or a URL relative
to the resource URL configured for the stage. For example, let's say you
configure the following resource URL for the
stage:
https://myapp.com/api/objects?page=1
- Link in Response Field
- After processing the current page, uses the link in a field in the response body
to access the next page. The link in the response field can be an absolute URL
or a URL relative to the resource URL configured for the stage. For example,
let's say you configure the following resource URL for the
stage:
http://myapp.com/api/tickets.json?start_time=138301982
- By Page Number
- Begins processing with the specified initial page, and then requests the
following page. Use the
${startAt}
variable in the resource URL as the value of the page number to request. - By Offset Number
- Begins processing with the specified initial offset, and then requests the
following offset. Use the
${startAt}
variable in the resource URL as the value of the offset number to request.
For the link in response field pagination type, you must define a stop condition that determines when there are no more pages to process. For all other pagination types, the stage stops reading when it returns a page that does not contain any more records.
When you use any pagination type, you must specify a result field path and can choose whether to include all other fields in the record.
Page or Offset Number
When using page number or offset number pagination, the API of the HTTP client typically requires that you include a page or offset parameter at the end of the resource URL. The parameter determines the next page or offset of data to request.
The name of the parameter used by the API varies. For example, it
might be offset
, page
, start
, or
since
. Consult the documentation for the origin system API to
determine the name of the page or offset parameter.
The HTTP Client origin provides a ${startAt}
variable that you can use
in the URL as the value of the page or offset. For example, your resource URL might be
any of the following:
http://webservice/object?limit=15&offset=${startAt}
https://myapp.com/product?limit=5&since=${startAt}
https://myotherapp.com/api/v1/products?page=${startAt}
When the pipeline starts, the HTTP Client stage uses the value
of the Initial Page/Offset property as the
${startAt}
variable value. After the stage reads a page of results,
the stage increments the ${startAt}
variable by one if using page
number pagination or by the number of records read from the page if using offset number
pagination.
https://myapp.com/product?limit=5&since=${startAt}
https://myapp.com/product?limit=5&since=0
${startAt}
variable by 5, such that the next resource URL is
resolved to:https://myapp.com/product?limit=5&since=5
The second page of results also includes 5 items, starting at the 5th item.
Result Field Path
When using any pagination type, you must specify the result field path. The result field path is the location in the response that contains the data that you want to process.
The result field path must be a list or array. The origin creates a record for each object in the array.
{
"count":"1023",
"startAt":"2",
"maxResults":"2",
"total":"6",
"results":[
{
"firstName":"Joe",
"lastName":"Smith",
"phone":"555-555-5555"
},
{
"firstName":"Jimmy",
"lastName":"Smott",
"phone":"333-333-3333"
},
{
"firstName":"Joanne",
"lastName":"Smythe",
"phone":"777-777-7777"
}
]
}
{
"firstName":"Joe",
"lastName":"Smith",
"phone":"555-555-5555"
}
{
"firstName":"Jimmy",
"lastName":"Smott",
"phone":"333-333-3333"
}
{
"firstName":"Joanne",
"lastName":"Smythe",
"phone":"777-777-7777"
}
Keep All Fields
When using any pagination type, you can configure the origin to keep all fields in addition to those in the specified result field path. The resulting record includes all fields in the original structure and the result field path that includes one set of data.
By default, the origin returns only the data within the specified result field path.
For example, say we use the same sample data as above, with /results for the result field path. And we configure the origin to keep all fields. The origin generates three records that keep the existing record structure, and includes one set of data in the /results field.
{
"count":"1023",
"startAt":"2",
"maxResults":"2",
"total":"6",
"results":{
"firstName":"Joe",
"lastName":"Smith",
"phone":"555-555-5555"
}
}
The second record:
{
"count":"1023",
"startAt":"2",
"maxResults":"2",
"total":"6",
"results":{
"firstName":"Jimmy",
"lastName":"Smott",
"phone":"333-333-3333"
}
}
{
"count":"1023",
"startAt":"2",
"maxResults":"2",
"total":"6",
"results":{
"firstName":"Joanne",
"lastName":"Smythe",
"phone":"777-777-7777"
}
}
Pagination Examples
Let's look at some examples of how you might configure the supported pagination types.
Example for Link in HTTP Header
link:<https://myapp.com/api/objects?page=2>; rel="next",
<https://myapp.com/api/objects?page=9>; rel="last"
So after the HTTP Client origin reads the first page of results, it can use the next link in the HTTP header to read the next page.
https://myapp.com/api/objects?page=1
{
"total":"2000",
"limit":"10",
"results":[
{
"firstName":"Joe",
"lastName":"Smith"
},
...
{
"firstName":"Joanne",
"lastName":"Smythe"
}
]
}
On the Pagination tab of the
stage, you simply set Pagination Mode to link in HTTP header, and
then you set the result field path to the /results
field:
Example for Link in Response Field
The API of the HTTP client uses a field in the response body to access the next page. It requires that you include a timestamp in the resource URL indicating which items you want to start reading.
http://myapp.com/api/tickets.json?start_time=138301982
{
"ticket_events":[
{
"ticket_id":27,
"timestamp":138561439,
"via":"Email"
},
...
{
"ticket_id":30,
"timestamp":138561445,
"via":"Phone"
}
]
"next_page":"http://myapp.com/api/tickets.json?start_time=1389078385",
"count":1000,
"end_time":1389078385
}
On the Pagination tab of
the stage, you set Pagination Mode to link in response field, and
set the next page link field to the /next_page
field.
${record:value('/count') < 1000}
Then you set the result field path to the
/ticket_events
field:
Example for Page Number
The API of the HTTP client uses page number pagination. It requires that you include a page parameter in the URL that specifies the page number to return from the results.
${startAt}
variable:https://myotherapp.com/api/v1/products?page=${startAt}
{
"total":"2000",
"items":[
{
"item":"pencil",
"cost":"2.00"
},
...
{
"item":"eraser",
"cost":"1.10"
}
]
}
On the Pagination tab of the
stage, you set Pagination Mode to by page number. You want to
begin processing from the first page in the results, so you set the initial page to 0.
Then you set the result field path to the /items
field:
Example for Offset Number
limit
- Specifies the number of results per page.offset
- Specifies the offset value.
${startAt}
variable:https://myapp.com/product?limit=10&offset=${startAt}
{
"total":"2000",
"limit":"10",
"results":[
{
"firstName":"Joe",
"lastName":"Smith"
},
...
{
"firstName":"Joanne",
"lastName":"Smythe"
}
]
}
On the Pagination tab of the
stage, you set Pagination Mode to by offset number. You want to
begin processing from the first item in the results, so you set the initial offset to 0.
Then you set the result field path to the /results
field:
OAuth 2 Authorization
The HTTP Client origin can use the OAuth 2 protocol to connect to an HTTP service.
The origin can use the OAuth 2 protocol to connect to an HTTP service that uses basic, digest, or universal authentication, OAuth 2 client credentials, OAuth 2 username and password, or OAuth 2 JSON Web Tokens (JWT).
The OAuth 2 protocol authorizes third-party access to HTTP service resources without sharing credentials. The HTTP Client origin uses credentials to request an access token from the service. The service returns the token to the origin, and then the origin includes the token in a header in each request to the resource URL.
- Client credentials grant
-
HTTP Client sends its own credentials - the client ID and client secret or the basic, digest, or universal authentication credentials - to the HTTP service. For example, use the client credentials grant to process data from the Twitter API or from the Microsoft Azure Active Directory (Azure AD) API.
For more information about the client credentials grant, see https://tools.ietf.org/html/rfc6749#section-4.4.
- Resource owner password credentials grant
-
HTTP Client sends the credentials for the resource owner - the resource owner username and password - to the HTTP service. Or, you can use this grant type to migrate existing clients using basic, digest, or universal authentication to OAuth 2 by converting the stored credentials to an access token.
For example, use this grant to process data from the Getty Images API. For more information about using OAuth 2 to connect to the Getty Images API, see http://developers.gettyimages.com/api/docs/v3/oauth2.html.
For more information about the resource owner password credentials grant, see https://tools.ietf.org/html/rfc6749#section-4.3.
- JSON Web Tokens
-
HTTP Client sends a JSON Web Token (JWT) to an authorization service and obtains an access token for the HTTP service. For example, use JSON Web Tokens to process data with the Google API.
Let’s look at some examples of how to configure authentication and OAuth 2 authorization to process data from Twitter, Microsoft Azure AD, and Google APIs.
Example for Twitter
To use OAuth 2 authorization to read from Twitter, configure HTTP Client to use basic authentication and the client credentials grant.
For more information about configuring OAuth 2 authorization for Twitter, see https://developer.twitter.com/en/docs/authentication/oauth-2-0/application-only.
The following image shows the OAuth 2 tab configured for Twitter:
Example for Microsoft Azure AD
To use OAuth 2 authorization to read from Microsoft Azure AD, configure HTTP Client to use no authentication and the client credentials grant.
The following image shows the OAuth 2 tab configured for Microsoft Azure AD version 1.0:
Example for Google
Configure the HTTP Client origin to use OAuth 2 authorization to read from Google service accounts. The stage sends a JSON Web Token in a request to the Google Authorization Server and obtains an access token for calls to the Google API.
Before you configure the stage, create a service account and delegate domain-wide authority to the service account. For details, see the Google Identity documentation: Using OAuth 2.0 for Server to Server Applications.
For more information about Google service accounts, see the Google Cloud documentation: Understanding service accounts.
For more information about configuring OAuth 2 authorization for Google, see the Google Identity documentation: Using OAuth 2.0 to Access Google APIs.
Logging Request and Response Data
The HTTP Client origin can log request and response data to the Data Collector log.
When enabling logging, you configure the following properties:
- Verbosity
-
The type of data to include in logged messages:
- Headers_Only - Includes request and response headers.
- Payload_Text - Includes request and response headers as well as any text payloads.
- Payload_Any - Includes request and response headers and the payload, regardless of type.
- Log Level
- The level of messages to include in the Data Collector log. When you select a level, higher level messages are also logged. That is, if you select the Warning log level, then Severe and Warning messages are written to the Data Collector log.
- Max entity size
-
The maximum size of message data to write to the log. Use to limit the volume of data written to the Data Collector log for any single message.
Generated Records
The HTTP Client origin generates records based on the responses it receives.
Data in the response body is parsed based on the selected data format. For HEAD responses, when the response body contains no data, the origin creates an empty record. Information returned from the HEAD appears in record header attributes. For all other methods, when the response body contains no data, no records are created.
In generated records, all standard response header fields, such as Content-Encoding and Content-Type, are written to corresponding record header attributes. Custom response header fields are also written to record header attributes. Record header attribute names match the original response header names.
When you configure the origin to generate records for unsuccessful statuses that are not added to the Per-Status Actions list, then the record might also include a field that contains the error response body.
Data Formats
The HTTP Client origin processes data differently based on the data format.
The HTTP Client origin processes data formats as follows:
- Avro
- Generates a record for every message. Includes a
precision
andscale
field attribute for each Decimal field. - Binary
- Generates a record with a single byte array field at the root of the record.
- Datagram
- Generates a record for every message. The origin can process collectd messages, NetFlow 5 and NetFlow 9 messages, and the following types of syslog messages:
- Delimited
- Generates a record for each delimited line.
- JSON
- Generates a record for each JSON object. You can process JSON files that include multiple JSON objects or a single JSON array.
- Log
- Generates a record for every log line.
- Protobuf
- Generates a record for every protobuf message. By default, the origin assumes messages contain multiple protobuf messages.
- SDC Record
- Generates a record for every record. Use to process records generated by a Data Collector pipeline using the SDC Record data format.
- Text
- Generates a record for each line of text.
- XML
- Generates records based on a user-defined delimiter element. Use an XML element directly under the root element or define a simplified XPath expression. If you do not define a delimiter element, the origin treats the XML file as a single record.
Configuring an HTTP Client Origin
Configure an HTTP Client origin to read data from an HTTP resource URL.