Search This Blog

Monday, August 1, 2011

What is parallelism in Ab Initio?

1) Component parallelism:
A graph with multiple processes running simultaneously on separate data uses component parallelism.

2) Data parallelism:
A graph that deals with data divided into segments and operates on each segment simultaneously uses data parallelism. Nearly all commercial data processing tasks can use data parallelism. To support this form of parallelism, Ab Initio software provides Partition Components to segment data, and Departition Components to merge segmented data back together.

3) Pipeline parallelism:
A graph with multiple components running simultaneously on the same data uses pipeline parallelism.

Each component in the pipeline continuously reads from upstream components, processes data, and writes to downstream components. Since a downstream component can process records previously written by an upstream component, both components can operate in parallel.

NOTE: To limit the number of components running simultaneously, set phases in the graph.

I have a job that will do the following: ftps files from remote server; reformat data in those files and updates the database; deletes the temporary files. How do we trap errors generated by Ab Initio when an ftp fails? If I have to re-run / re-start a graph again, what are the points to be considered? does *.rec file have anything to do with it?


AbInitio has very good restartability and recovery features built into it. In Your situation you can do the tasks you mentioned in one graph with phase breaks.

FTP in phase 1 and your transaformation in next phase and then DB update in another pahse (This is just an example this may not best of doing it as best design depends on various other factors)

If the graph fails during FTP then your graph fails in Phase 0, you can restart the graph, if your graph fails in Phase 1 then AB_JOB.rec file exists and when you restart your graph you would see a message saying recovery file exists, do you want to start your graph from last successful check point or restart from begining. Same thing if it fails in Phase 2.

Phases are expensive from Disk I/O perspective, so have to be careful in doing too much phasing.

Coming back to error trapping each component has reject, error, log ports, reject captures rejected records, error captures corresponding error and log captures the execution statistics of the component. You can control reject status of each component by setting reject threshold to either "Never Abort", "Abort on first reject" or setting "ramp/limit"

Recovery files keep tack of crucial information for recovering the graph from failed status, which node the component is executing on etc. It is a bad idea to just remove the *.rec files, you always want to rollback the recovery fils cleanly so that temporary files created during graph execution won't hang around and occupy disk space and create issues.

always use m_rollback -d

I'm having trouble finding information about the AB_JOB variable. Where and how can I set this variable?

You can change the value of the AB_JOB variable in the start script of a given graph. This will enable you to run the same graph multiple times at the same time (thus parallel). However, make sure you append some unique identifier such as timestamp or sequential number to the end of each AB_JOB variable name you assign. You will also need to vary the file names of any outputs to keep the graphs from stepping on each other’s outputs. I have used this technique to create a "utility" graph as a container for a start script that runs another graph multiple times depending on the local variable input to the "utility" graph. Be careful you don't max out the capacity of the server you are running on.