Tip: don't 'tweak' the Workflow Server Group without really checking
Mike Gambier
madgambler at hotmail.com
Wed Oct 11 05:26:09 EDT 2006
Hi,
This is not a question so feel free to ignore it if you're not interested.
Recently our Basis people decided to play with the Server Group for the
Event Delivery job in Workflow (SWEQSRV) without keeping a proper eye on the
results. (The settings are visible in SWEQADM for those who don't know.)
We have millions of Workflow instances running in our system (that's no
joke) and deal with about 50,000 triggering events every day. That's okay
though because we have 16 Application Servers on tap and Workflow had access
to four of them (before the changes) and 50 dialog processes on each, all
adding up to a whopping 200 dialog processes on demand. We used to be able
to deliver up to 200 events per minute (12,000 per hour) depending on how
the nature of the events being delivered.
The Basis guys decided to increase the number of App Servers in the Server
Group but decreased the number of processes in each. Logical thinking you
may argue, as the number of dialog processes remainined at 200, but what
happened was Workflow started to lose the battle for dialog processes on ALL
servers during the online day because we have thousands of users hogging
them.
You may or may not know that if Workflow fails to secure a dialog process at
runtime it creates a tRFC entry for WORKFLOW_LOCAL_100 in table ARCFSSTATE
so that a batch job called RSARFCEX can run later on and pick up the slack.
It does this for Events, Method Calls, New Tasks...pretty much everything.
Had the Basis people looked a bit harder they would have spotted a sudden
surge of entries being written to ARFCSSTATE following their changes.
It's been a month now and Workflows have 'stalled' everywhere........... We
have 20 MILLION tRFCs for Workflows that cannot be processed by RSARFCEX
because SAP's code simply can't cope with that many records (it blows its
own internal storage even trying to process a single day's work!!).
Basis have reverted their changes and the tRFC queue is hardly being hit by
Workflow at all anymore, so we're almost back to where we were. We are also
trying to reprocess the tRFC queue using our own tools and asking SAP for
help with theirs via OSS.
Granted the numbers we are dealing with is extreme, but I urge Workflow and
Basis people alike to keep an eye on tRFCs for Workflow whenever they meddle
with the SWEQADM settings for Parallel Event Delivery.
MGT
More information about the SAP-WUG
mailing list