We’ve had a recent deployment to production that did not go according to plan, or so we thought at first. It is an upgrade of a BizTalk solution from BizTalk 2004 to 2006 R2. It integrates a website, a CRM system, a bank system, and a fulfillment house.
We deployed the solution to our test environment, fixed some issues that were found, and everything worked fine. The testers and business users were happy and the system was working as expected. The BizTalk environment had already been built and configured the same as in test by the client. There was already another solution running on the test and production servers, and it seemed to be running fine.
So came the day of the deployment to production. Procedures and release notes were followed, end systems were switched over, MSI files imported, and… BizTalk was running very slowly. The website and the CRM systems were receiving timeouts from BizTalk. A decision was made to rollback and investigate the cause.
We started by comparing the configurations on the new BizTalk 2006 R2 server in production and the new BizTalk 2006 R2 server in test. Then we compared the new BizTalk 2006 R2 server in production and the old BizTalk 2004 server in production. Most ideas we had like “could it be an issue with the different .NET versions?” or “could an end system be running slower in production?” did not work because of one of these two questions: then why does it work in test? Or: then why does it work in BizTalk 2004?
So I asked our system administrator to run the MsgBoxViewer tool on the new test and production servers. The MsgBoxViewer is an awesome tool developed by one of the top Microsoft BizTalk support engineer. It analyses the current configuration and state of both your BizTalk and SQL machines, performs a list of checks and raises warnings (in yellow) and critical alerts (in red). It also gives you a lot of information about what is currently happening on your server. For example, a list of row counts for the tables in your MsgBox database, or the TCP/IP registry configurations on your servers. The latest version is available here.
Anyways, what the MsgBoxViewer tool came back with was pretty alarming: in the BizTalk 2006 R2 production, all the hosts had message publishing throttling turned on due to reason 6: DB Size. From the resulting HTML file, you can scroll down to the end to see the Summary Report, and at the end of the report it had the throttling settings:
So if you have a look at the inbound host throttling page on MSDN you can see that the message publishing throttling is enabled with reason 6 when “Host message queue size, the spool table size or the tracking table size exceed the specified threshold.” This is the reason why BizTalk was running so slow, it was throttling down its message publishing and taking much longer than normal.
This led me to have a look at the the “MsgBox Db : Get MsgBox Tables Size” section of the MsgBoxViewer report and, sure enough, the Spool table had over 511 thousand rows(!!!). This should usually be under 500 rows. The usual reason why this happens is that the SQL Server Agent jobs created by BizTalk are not running, or aren’t keeping up.
In our case the culprits were:
- “TrackedMessages_Copy_BizTalkMsgBoxDb” job was failing.
- DTA Purge and Archive (BizTalkDTADb) job was disabled in production but enabled in test.
These had been configured correctly by the client in the test system a few weeks before, but they did not realize that not implementing it into production straight away would make such an impact. After mentioning to them this issue, the purge and archive job was configured and enabled, and the tracked messages job was also configured correctly. Now that these ran without failing a few times, we ran the MsgBoxViewer tool again and there are only 128 rows in the Spool table and none of the hosts are being throttled. Nice!
A good lesson to learn or to be reminded of is to always always make sure the BizTalk SQL jobs are running correctly.
Note: running the MsgBoxViewer tool is pretty safe, but it links to a few pages that recommend changes to the system. Make sure you know what you are doing and have tested before making any changes to your production system.