Technical Blog Post
Abstract
Why does Support need that when -- analyzing problems sending AS2 and HTTP
Body
I know we drive our customers crazy at times when we ask for particular data, because they tell me so. We are looking for specific information, but it may not be obvious why we want it and what we are going to do with it. I thought maybe it would help if I explain why it is an analyst would ask for certain data. In this blog entry, I will outline the log information that's most useful for diagnosing problems with sending data using the AS2 and HTTP protocols. I plan to follow up with some other articles to discuss other protocols.
When someone tells me they cannot send data using HTTP (which includes AS2) I ask for a set of details, as follows:
I ask for general information to outline the problem. Is it a new configuration, or one that worked but then stopped? If it used to work, then it has to have stopped because something changed. The focus will be to find out what that is, and to do what's needed to correct it.
I ask if it affects all partners, because that defines the scope of the problem. It can also pinpoint the source. If a problem occurs for one partner, it might be an error on the partner's system. If it's more than one, it's pretty sure to be an issue within Sterling b2B Integrator.
I ask if it happens every time or if it is intermittent. If it is intermittent, there's a difference between the occurrences that work and those that do not. In order to establish a pattern to help determine what those differences are, I will want to know how often the problem occurs. I will want to know what times, and whether it always occurs on one node. I'll want to ask whether other events were happening such as heavy traffic from partners or system activity such as backups. I'll ask anything I can think of that distinguishes the errors from events that work correctly.
I also ask for some specific data:
1) Output from dump_info.sh or dump_info.cmd: My colleague Dana explained why we ask for this. I have nothing to add to her explanation, which is here:
2) Screen prints showing all of the steps of an example of the workflow that failed. I ask for the header information as well, such as this:
The header information is the "Business Process Detail" part at the top. It has the name of the process, the instance ID, status, and state.
There were 90 steps in the workflow for this failed process, which is the system process, AS2SendSyncMDN. I was sending data to an AS2 partner. I would ask for all of the steps to be included in the workflow so I can see the context of errors that occur. In this example, though, I'll just include the steps around the error:
Step #37 is the actual error. When I see that, I can see HTTP Client POST Service failed, and in this case, that it failed on December 30, 2015 at 3:17:57. I can also see it ran on node1, which tells me which logs will contain the information I need.
I then ask further for the text of the status report and instance data. Those are contained in the 1st and 3rd "info" buttons on step 37.
Here's the status report:
Execute Business Process | |||
| |||
| Name: AS2SendSyncMDN Instance ID: 700717 Service Name: HTTP Client POST Service | ||
| Status Report: | ||
|
| ||
It shows the EXACT time when the error occurred, and also gives more information about the error. It confirms the Instance ID. It also shows the SessionToken. The instance ID and SessionToken are included in the log files, so this information helps to determine what entries we should focus on in the logs.
3) The full log files from your SI/install/logs directory from the node where the error occurred, as follows:
httpclient.log
Perimeter.log
If you look for those files in the SI/install/log directory, just from the names you can see a lot of information:
-rw-r--r-- 1 bkitche users 1014 Dec 30 15:17 httpclient.log.D20151230.T144707
-rw-r--r-- 1 bkitche users 2540348 Dec 30 16:25 Perimeter.log.D20151230.T080519
The date and time are when the files were last updated.
The names of the files include the date and time when the file was created:
Httpclient.log was created on Date 20151230 (which is December 30, 2015)
And at Time 144707, or 14:47:07 in 24 hour time. That's 2:47:07 PM.
Sometimes people will download these files, then copy the contents into a graphics-oriented word processor like Microsoft Word, and change the names. Then we lose all of the information that was present in the original file names. That's why I always ask for the files with their original names.
3b) If possible, we ask for a new test with the log in full debug mode. To get that information you would first go to Operations > System > Logs, and locate the log, then click on the icon to the left of the log.
Scroll down until you see the log that is being discussed:
Then click on the icon to the far left, and select Logging Level of "On":
For the Perimeter.log, the Logging Level can be set to Error (it's lowest setting), or Info, Communicatiuon Trace, or All. An analyst wants all of the data they can get so will almost always request the Logging Level to be set to "All".
The Logging Level refers to the debug setting for log information. A log with a Logging Level of "All" or "On" will generate a LOT more information.
I could write a book about how to analyze a log file for all of the information about an error. Any analyst could. You are welcome to review that information yourself, and to ask questions when you discuss an issue with an analyst, but it would make this blog entry very long for me to explain in detail. I and other analysts will likely write blog entries on how to use different log files to troubleshoot particular issues.
Here's what I see for that error in the screen prints from my workflow, with the httpclient.log set to Logging Level of "Off":
[2015-12-30 15:17:57.087] ERROR CustomConnectAgent.connectFailed() - Received signal from Perimeter due to fail to make connection with message [Could not complete connection to specified host]
[2015-12-30 15:17:57.088] ERROR CustomConnectAgent.connectFailed() - encountered error due to [Could not complete connection to specified host]
If I turn the Logging Level to "On" there would be hundreds of lines.
4) For HTTP only: The business process itself.
If you're using AS2 and our standard processes, we are familiar with them and we have them on our test servers, so we can review what steps are supposed to occur.
If you wrote your own business process(es), we aren't familiar with them. It's possible the process itself is causing a mistake. If so, we can analyze it and try to make suggestions. We can also install your process on a test server, modify it to work, and run tests using it.
5) For AS2 only: The AS2 partner and organization screens
These will contain your identifiers, end points, the names of the certificates you are using, and other details about your configuration. We can review all of that information, along with the other information we are requesting. It may help us to find clues about the problem.
All of this information is part of the problem description. There may very well be more items that I (or another analyst) will ask you to provide. Our goal is only to understand what is happening and why, so we can help you to come to a successful resolution.
If you have any questions about this information, please add comments. I think a discussion would make this a much more useful item for all of us.
Thanks! I hope this is helpful.
UID
ibm11121553




