<EDIT>(Monthly) archive updates can be found here,</EDIT>
I knew that DataPower forum is a very active forum among IBM forums for years. And I did manually navigate the forum hierarchy up to top level and down again by clicking onto "category" links that lead to category pages listing forum links and/or other category links. Via manual navigation I already knew that WebSphere Application Server forum (20044/58692) as well as WebSphere Portal forum (18467/53367)
were more active than DataPower forum (7459/31819),
but I did find no others. Here let me define that I mean "higher Message count" by "more active" (58692/53367/31819).
On the weekend I finally got the right idea on how to write an IBM forum crawler, so let me first copy in the result here before discussing the crawler below. As can be seen there are far more than 4000 IBM forums, and DataPower forum is 5th most active. The generated table only displays entries with at least 5-digit message count number:
|4239 forums (369 categories) in total in 511338ms||Wed Nov 05 2014 12:53:20 GMT+0100 (CET)|
|2||WebSphere Application Server||20044/58692||2014-11-04T23:33:17Z|
|5||IBM DataPower Gateway Appliance||7458/31816||2014-11-05T11:57:03Z|
|7||IBM Business Process Manager||6573/28781||2014-11-05T11:27:51Z|
|8||IBM DB2 for Linux, Unix, and Windows Forum||9673/27322||2014-11-04T05:28:55Z|
|9||Functional and GUI Testing||8436/25067||2014-11-05T09:16:26Z|
|10||System x Server||6294/24699||2014-11-03T01:50:46Z|
|12||Development Tools (RAD, RSA, RDA, RSM, RWD)||8906/21512||2014-11-04T14:29:22Z|
|14||IBM Web Experience Factory - Best Practices||5042/20535||2014-11-05T04:26:04Z|
|15||IBM Systems Director Forum (System x, System z, Power Systems)||6229/20256||2014-11-02T22:50:08Z|
|18||Maximo and process automation solutions||6060/18911||2014-11-05T09:26:08Z|
|20||Rational DOORS DXL||3248/16698||2014-11-05T09:44:56Z|
|22||Cell Broadband Engine Architecture forum||3522/14872||2014-07-01T07:32:38Z|
|24||IBM System Storage||3787/13454||2014-11-05T01:34:16Z|
|25||IBM WebSphere Transformation Extender||3295/13390||2014-11-05T02:42:32Z|
|26||IBM Tivoli Monitoring (ITM) General Discussion||4448/13166||2014-11-04T20:27:19Z|
|27||WebSphere JavaServer Faces (JSF)||3191/12782||2014-10-24T05:08:43Z|
|28||WebSphere Studio Site Developer and Application Developer||4494/12104||2014-07-15T19:12:12Z|
|29||IBM DB2 Express-C Forum||3657/12051||2014-11-05T09:10:52Z|
|31||IBM Integration Designer and WebSphere Integration Developer||3677/11461||2014-10-24T21:40:41Z|
|32||Deployment and Configuration||2537/10281||2014-10-30T17:22:14Z|
|33||Rational Performance Testing||3189/10083||2014-11-05T11:31:48Z|
Further below you can find the syntax highlighted source code listing of IBM-Forum-crawler.js.html.
You can save the HTML page by right-clicking this link and choosing "save as".
In case you click this link IBM forum crawling starts in your browser immediately and it takes more that 6 minutes!!
Also note that this crawler does work with Firefox browser only (either Linux or Windows).
This is Firefox only because any one of the big5 browsers working is enough and I deveoped it with Firefox (and its web console).
There is exactly one warning left
"Synchronous XMLHttpRequest on the main thread is deprecated because of its detrimental effects to the end user's experience. For more help http://xhr.spec.whatwg.org/ IBM-Forum-crawler.js.html:30"
which is no problem here because the main browser page gets updated for each of the (369) category pages with current information like this:
crawling forums, sofar ... 13608ms 40 forums 5 categories category?id=a4c83a8c-1106-418e-bcae-323eb6641707 category?id=bab4f8b9-dff8-4d75-a597-e6ce404d50ba category?id=89f02d67-954c-4ca2-aba2-68d9623cb1d4 category?id=5f9d2362-6568-499e-a82d-8e111b4eb114 category?id=2c8198b1-5874-4b5d-9ad5-751538d69a97 public?
Using Firefox web console on a IBM category page I was able to find the access to all information needed in the page. The "data picking" auxiliary functions are generated from that investigations,
Function strSort(a, b) is responsible for descending sorting of (found) forum lines by Message count.
The actual crawling is done by a simple recursive Depth First Search (DFS) starting on root IBM forum category page. These IBM category pages do not contain cycles and therefore there was no risk on running into endless loops. Anyway associative array visited implements cycle detection and abort in case you want to apply the crawler somewhere else.
Avoiding CORS (Cross-origin resource sharing) problems is most easily done by running the IBM-Forum-crawler.js.html page from the same domain (www.ibm.com) as the requested resources (the forum category pages visited during DFS). IBMers like me caanot just "store" something on www.ibm.com -- but the link above shows the easy workaround I found; generate a dummy blog posting here on developerWorks, click "Insert Image" igon, select a "*.html" (!) file and click OK. I do use this trick for ages now, for all my syntax highlighted source code listings like below being an <iframe> with @src pointing to locally uploaded HTML page. This is the link to just the syntax highlighted source code page reference in below <iframe>:
Another strange issue is that I could not get arr.concat(A) working and therefore worked around it with another for-loop.
Last but not least I did run into a (for me) very surprising issue. The whole parsing assumes a single page DOM representation of a category page. I had to increase the maximal value up to 5000(!) in "suff" variable until results got consistent. Reason turned out to be this "strange" category page (listing 3415 forums alone):
Some words on runtime, 511 seconds in total for creating above table sounds long, but this means slightly less than 1.4 seconds per loaded and processed category page on average which is not that bad.
- indentation by 4
- need to combine variable declarations inside function
- having to add 'use strict' in various places
- no space between function name and left parenthesis
- a required spaced between for and left parentheses in for loop
- having a one variable per line in function variable definitions
- having to use "i += 1" instead of "++i"
Because it runs fine in Firefox I have not looked into DataPower GatewayScript implementation sofar.
The loadHTML(url) function would need to be replaced by usage of "urlopen" module.
And parsing of HTML (string) into DOM has to be done somehow (done by Firefox currently).
It can be done by htmlparser.js in DataPower GatewayScript, see this previous blog posting: