Overview
In my last article, I discussed how to monitor services and integration points for a complex Documentum based application. In this article, I turn my attention to strategies we employed to make our workflow integration more resilient to problems that might be encountered. This article will have a more technical focus.
In our experience, portions of the Documentum workflows subsystem – namely, the Java Method Server – can sometimes unexpectedly become unresponsive. The monitoring solution described in the previous article works well at catching this problem and routing it to the support team to give it a “kick.” Unfortunately, any documents which are submitted during the method server outage can become stuck in paused workflows which can prevent users from seeing their tasks. To recover from situations like this, we have implemented a “workflow recovery poller” that finds and restarts affected workflows.
Documentum Workflow Objects Primer
Before getting into the details of our solution, we need to provide some background by covering the key Documentum workflow objects and their relationships. You can refer to the Documentum Object Reference for additional information.
Process (dm_process)
A workflow is really a just a process and Documentum has a “process” object which models the steps. This object is really a template and you can think of each workflow as a specific instance following the steps of that template. Note that these processes can be created and managed with EMC’s Documentum products such as Workflow Manager and Business Process Manager.
Activity (dm_activity)
There is an “activity” object for each step of the process. The process object has a repeating attribute (r_act_name) which lists the names of each activity in the template. There are also related repeating attributes which describe other attributes of each activity.
Workflow (dm_workflow)
The “workflow” object captures the state of a document (or group of documents) going through the steps of a process template. This object has an overall status which is captured using the field “r_runtime_state”. The valid values for runtime state are:
DF_WF_STATE_UNKNOWN = -1
DF_WF_STATE_DORMANT = 0
DF_WF_STATE_RUNNING = 1
DF_WF_STATE_FINISHED = 2
DF_WF_STATE_HALTED = 3
DF_WF_STATE_TERMINATED = 4
In addition to this overall state, each individual activity in the process template has a state. These are captured in the repeating attribute “r_act_state”. The valid values for activity state are:
DF_ACT_STATE_UNKNOWN = -1
DF_ACT_STATE_DORMANT = 0
DF_ACT_STATE_ACTIVE = 1
DF_ACT_STATE_FINISHED = 2
DF_ACT_STATE_HALTED = 3
DF_ACT_STATE_FAILED = 4
Documentum updates the status of the workflow and activities as the workflow progresses. If an activity fails, for example, the r_act_state will show a value of “4”.
Package (dmi_package)
The workflow object doesn’t have any fields to relate it to the document(s) with which it is associated. This is the job of the “package” object. Each package references the id of the workflow and has a repeating attribute (called r_component_id) which contains the object ids of the content associated with that workflow.
Work Item (dmi_workitem)
Each activity for a workflow can have multiple performers and there are associated “work items” for each of them. The work item captures details (such as due date) about that activity for that performer. Each work item has a field (r_queue_item_id) which is used to associate it with a “queue item” in the user’s inbox.
Queue Item (dmi_queue_item)
“Queue items” are Documentum constructs which are used to preset items in the user’s inbox queue. Queue items are not workflow specific objects, but there will be one associated with each work item.
Workflow Design
We have brushed up on the key Documentum workflow objects, but we aren’t quite ready to jump into the details of our workflow recovery implementation. There isn’t a “one size fits all” implementation, because a successful strategy depends on the specifics of the workflow process design. In order to demonstrate the implementation, we will use a simplified version of our process template.
The end goal of our workflow is to ensure that a document gets the appropriate reviews so that it can be made active to end users. We have an automated workflow activity which inspects custom attributes on the document to determine the appropriate reviewer and creates review tasks. Note that this activity re-runs whenever a task is completed so that we can re-evaluate the status.
All of our documents go through workflow when they are uploaded. Documents initially have a version label of “pending”, and once the classifications are approved, we replace that label with “active” which indicates that the document can go live.
Workflow Recovery Implementation
Now that we have covered the details of our workflow implementation, we can discuss the strategy to recover from outages. As mentioned above, we have experienced times when the Java Method Server becomes unresponsive. When this is happening and new documents are imported (or outstanding review tasks are acted upon), the workflow activities fail and the documents become “stuck” in a pending state. Our solution was to introduce a process which runs regularly to find and fix workflows which aren’t in a valid state.
Given the business requirements of our system, we decided to keep the logic simple and kick off a new workflow when a document is in an invalid state. This works for us because our aforementioned automated review activity will inspect the status of previously processed documents and act appropriately. However, the same concepts could be used to restart workflows anywhere in the process flow.
Our process is a standalone java application that runs every 5 minutes. It utilizes a framework for running periodic tasks which is part or our solution, but could also have been implemented utilizing standard Documentum job scheduling. At a high level, the process finds any failed workflows by looking at state codes on the workflow object, deleting these failed workflows, and then starting a new one for any pending document (without a workflow). The detailed code, which utilizes Documentum Foundation Classes (DFC), follows:
- Delete any halted workflows (note that when a running workflow fails, Documentum can put that workflow in a halted state).
We identify these workflow objects using a DQL query:
SELECT r_object_id FROM dm_workflow w WHERE w.r_runtime_state=3
We then iterate over the returned objects and abort the workflows using DFC:IDfWorkflow workflow = (IDfWorkflow)session.getObject(new DfId(workflowId)); if (!workflow.isDeleted() && workflow.getRuntimeState()!=IDfWorkflow.DF_WF_STATE_TERMINATED) { workflow.abort(); }
- Delete any workflow with a halted or failed activity. Note that this is different than item 1 because the overall workflow runtime state can show as “running” even though there is a failed activity.
We identify these workflow objects using a DQL query:
SELECT r_object_id FROM dm_workflow w WHERE any w.r_act_state in (3,4)
We then iterate over the returned objects using the same logic described in step 1.
- Now that we have deleted any halted workflows, we find all the “pending” documents which don’t have an associated workflow and start one. This works for us because all of our documents go through a workflow and it is a straightforward way to cover the various failure scenarios. If this wasn’t the case, we would have implemented logic to keep track of the documents for which workflows were deleted and act on those.
We utilize DQL queries executed via DFC to locate the documents in question:
SELECT DISTINCT r_object_id FROM dm_document d
WHERE ANY r_version_label = 'PENDING' and r_object_id not in
(SELECT DISTINCT d.r_object_id FROM dmi_package p, dm_workflow w, dm_document d
WHERE ANY r_component_id = d.r_object_id AND p.r_workflow_id=w.r_object_id AND w.r_runtime_state <=1)
Note:This query locates documents with our ‘PENDING’ version label and then performs a sub-select to only return those which don’t have a workflow. We determine which documents have active workflows by referencing the document’s r_object_id with the r_component_id repeating attribute on the dmi_package table, which we then join on r_workflow_id to the workflow table to limit results to active runtime states.
We then loop through each of these documents and execute DFC methods to start a new workflow. Below are the key API calls executed for each document. Note that there are also calls to utility methods which are show after the main code block. We assume that your logic would provide variables, methods, and values which appear in blue.
// Instantiate a workflow. Note that you can also use an "IDfWorkflowBuilder" // object to simplify this, but our approach gives you more flexibility (e.g. // to set the workflow name and supervisor name). IDfWorkflow wf = (IDfWorkflow) session.newObject("dm_workflow"); // Set the workflow name wf.setObjectName("workflow for document"); // Get the id for the associated process template // this example assumes you write some logic to get the process by name // (e.g. select r_object_id from dm_process where object_name='xyz') String processId = getProcessIdByName("some process template"); // Set the process id on the workflow wf.setProcessId(new DfId(processId)); // Set the supervisor wf.setSupervisorName(supervisorName); // Save the workflow to the repository wf.save(); // Now start the workflow as this doesn't happen by simply saving it. // We don't set any "performers" because there is an automated activity in our // workflow which does this. If you need it, use wf.setPerformers() wf.execute(); // Even though we have executed this workflow, the start activity will not // become active until we attach a package because our start activity // has an "input port" which means it requires a package as input. So we // create a package with a document... IDfList objectIds = new DfList(); objectIds.append(new DfId(documentId)); // Get the starting activity IDfActitivty startActivity = getStartActivity(processId); // Now lookup the name for activity rather than using the name on the // IDfActivity object since it might be different due to // a known idiosyncracy String startActivityName = getActivityName(process,startActivity); // Add a package to the workflow to associate the document with it IDfId pkgId = wf.addPackage(startActivityName, getInputPort(startActivity), getInputPackageName(startActivity), "dm_document", null, // not using a "note" false, // for our example objectIds); // End of processing... we now have an active workflow!
The following utility method gets the starting activity for a workflow by looking for an activity type of “1” which indicates that it is a start activity.
public static IDfActivity getStartActivity(String processId) throws DfException { IDfProcess process = (IDfProcess)session.getObject(new DfId(processId)); int activityCount = process.getActivityCount(); // Iterate over the activities looking for the start activity for (int i = 0; i < activityCount; i++) { if (process.getActivityType(i) == 1) { // 1=start activity IDfId activityId = process.getActivityDefId(i); return (IDfActivity)process.getSession().getObject(activityId); } } return null; }
This utility method gets the name of the passed activity. Note that the Documentum workflow applications will not update the name of existing activity objects if it is changed. Therefore, we always refer back to the process object to get the name.
public static String getActivityName(IDfProcess process, IDfActivity activity) throws DfException { if (activity == null) { throw new NullPointerException("activity must be non-null"); } // Get the id of the activity IDfId activityId = activity.getObjectId(); // Iterate over the activities looking for the specified id. // we do this rather than use the value on the activity because that // value may not be correct if the name has changed! int activityCount = process.getActivityCount(); for (int i=0; i < activityCount; i++) { IDfId id = process.getActivityDefId(i); // If the id is found, return the matching name if (activityId.equals(id)) { return process.getActivityName(i); } } // The activity is not used within the specified process. This is an error. throw new IllegalArgumentException("dm_activity '" + activityId + "' not member of dm_process '" + process.getObjectId() + "'"); }
This utility method gets the name of “input port” for the start activity. In Documentum workflows, ports are logical points of information exchange between activities and must be provided to add a package to a workflow.
public static String getInputPort(IDfActivity activity) throws DfException { for (int i = 0; i < activity.getPortCount(); i++) { if (activity.getPortType(i).equals("INPUT")) { return activity.getPortName(i); } } return null; }
Finally, this utility method gets the name of package for the start activity. This must also be provided to add a package to a workflow.
public static String getInputPackageName(IDfActivity activity) throws DfException { // Iterate over the ports looking for the input port for (int i = 0; i < activity.getPortCount(); i++) { if (activity.getPortType(i).equals("INPUT")) { // Get the name of the package associated with the input port return activity.getPackageName(i); } } // If no package is located, return null return null; }
Conclusion
We got into the gory details of Documentum workflows and looked at some DQL and DFC code, but the fundamental idea is simple: find any workflows which are in a bad state and restart them. These details are all necessitated by the need to make our system resilient to outages. If you rely on the Java Method Server as we do, hopefully these details will assist you in making your system more resilient too.
While this wouldn’t be necessary in an ideal world, the reality is that systems with complex integration points can and will have problems. Depending on the criticality of the system, you may want to consider investing in defensive strategies to minimize the impact of these problems.