An experience deploying TADDM 7.2.1
swg 100000F61E Visits (4297)
The customer had an existing 7.2 environment split into two separate TADDM Domains. They were also not synching to an eCMDB TADDM server. In order to consolidate these two domains into the desired single deployment solution, we would have needed to install an eCMDB in order to consolidate all of the data in one place. But, because the customer was also going to migrate to new hardware and were also not concerned with their existing data (beyond scopes/access lists and profiles), we elected to deploy TADDM 7.2.1 in Streaming Mode Fresh rather than Migrating their existing infrastructure.
Saving what matters
Generally speaking, setting up a new TADDM environment should not require bringing over the discovered data because we should rediscover it pretty quickly. However, it does matter if you care about your exisitng change history. Our customer did not care and therefore we elected to do a fresh install as previously mentioned. In any case though, most people will want to save their custom configurations to speed up re-deployment time. These configurations include:
NOTE: In the following, the EXPORT was done in our existing 7.2 environment and the import was done after installing our NEW environment on a Discovery Server.
WARNING: The imports above only need to be done one time per 7.2 Domain and only on ONE Discovery server. It does not need to be done for each 7.2.1 discovery server.
If you are interested, you can read more about planning for 7.2.1.The conversion from a Synchronization deployment to a Streaming deployment (which covers some of the above as well) is also a useful read.
Installing 7.2.1 is relatively straightforward. For this deployment, we had the following infrastructure:
The order of installation was:
For all of this, I just followed the directions in the documentation. Once completely installed, you need to do the following:
Once your infrastructure is deployed you need to test discovery. The first test should be against a single computer system to validate that the infrastructure is functioning properly. Once you've validated credentials, you will need to discover everything that you had discovered in 7.2. Our goal was to determine if we were finding all the same things we found before and that we were having the same problems. After initial discoveries of everything that was intended to be discovered, the discovery logs indicated a lot of "Access Denied" errors that implied our credentials were bad. I was concerned that some mistake had been made during the import or that there was a problem with the sensors. In order to validate this, I wanted to know how many issues they had in their old environment. Since we essentially setup a parallel installation I could still look at their latest discovery history located in $COL
Analyzing a discovery in TADDM 7.2
The contents of the $COL
<runid> is the filename (without the .ser) extension of the files in the events directory above. In order to process the files on a Unix system, I do the following:
$ cd $COL
This will give you all of the events in a CSV file and you can import it directly into a Spreadsheet program. To further simplify the processing though, do the following:
$ cat /tmp/all_events.csv | awk '/SessionSensor/
Next, import this file into your Spreadsheet program (splitting by comma). If you then create a Pivot Table based on this, it will look something like:
In the above, critical failed on the SessionSensor and normal succeeded. This is about a 50/50 success to failure raitio and is surprising but does indicate that the results from a discovery of the environment at the customer in 7.2.1 is similar, indicating we have achieved parity between the solutions:
In the above shot, the 'Successfully Accessed Systems' are the same as the normal systems in the previous table. 'Access Failures' are the same as 'critical'. Comparing the above two, the numbers are very similar and do show parity.
NOTE: The above shot is from the 'Discovery Remediation Report' . These are not presently installed with 7.2.1 and can be downloaded here. The Remediation of discovery is a process that is often left out of a deployment. It is very important that a process be put in place to handle the failures that occur during discovery. A great starting point is the previously mentioned document and reports: TADDM 7.2.1 Discovery Scan Report and Remediation for Computer Systems.
Using the Remediation reports, we looked at some of the errors we should be able to deal with, like timeouts. We needed to go extend some timeout values that are not uncommon to have to change - we made the following changes:
To simplify using the above reports we really wanted to discovery of the whole environment in one scope. In this customer's environment it was just easier to manage. If your scopes are just too large, then ensure you divide them logically for problem resolution as well as timing/load balancing/etc. If possible, keep your complete discovery cycle within a Sunday to Saturday weekly run and there are many reports that aggregate data over the week for troubleshooting and statistical analysis.
While doing this, we also hit some performance issues (slow reporting and storage) and recalled that once you initially populate the database, you need to get the stat
We also ran into some OutOfMemory problmes (OOMs) on the discovery server because we were doing such large scopes -- these got narrowed down to actually one sensor: SMSServerSensor and we had disable it to get discovery to work (it actually turned out it had nothing to do w/ the large scopes, but was SOLELY related to the target and the sensor). Once we disabled the sensor, everything worked. This is has been opened as a defect and will hopefully be resolved soon. For this customer's environment disabling it was not a problem.
A second OOM we had was actually due to size of results (and is also an Open Defect). However a workaround to resolve it was to increase the memory of discovery to 2048. This resolved the problem as well. Here's what we did:
NOTE: OutOfMemory Errors typically cause the status of the TADDM Server to look a little strange -- with services either not started, Stopped or in a constantly Starting state. If this occurs, check the $COL
Maintenance of Environment
TADDM generally just replaces data that is in the environment so that when you have discovered your whole environment, it typically should not change drastically in size. However, any time the database changes drastically in size, you need to make sure you update the statistics, etc. described in Database Tuning. Some other things you need to do:
1. Maintain your change_history table
2. Manage Dormant Components
3. Maintain your DISCEVNT Table -- As of 7.2.1, discovery events are persisted to the database giving us an opportunity to report on the data easier. But, this table needs to be maintained just like the CHAN
SELECT EVENT_RUNID, COUNT(*) FROM DISCEVNT