Search This Blog

Monday, August 1, 2011

What is parallelism in Ab Initio?

1) Component parallelism:
A graph with multiple processes running simultaneously on separate data uses component parallelism.

2) Data parallelism:
A graph that deals with data divided into segments and operates on each segment simultaneously uses data parallelism. Nearly all commercial data processing tasks can use data parallelism. To support this form of parallelism, Ab Initio software provides Partition Components to segment data, and Departition Components to merge segmented data back together.

3) Pipeline parallelism:
A graph with multiple components running simultaneously on the same data uses pipeline parallelism.

Each component in the pipeline continuously reads from upstream components, processes data, and writes to downstream components. Since a downstream component can process records previously written by an upstream component, both components can operate in parallel.

NOTE: To limit the number of components running simultaneously, set phases in the graph.

I have a job that will do the following: ftps files from remote server; reformat data in those files and updates the database; deletes the temporary files. How do we trap errors generated by Ab Initio when an ftp fails? If I have to re-run / re-start a graph again, what are the points to be considered? does *.rec file have anything to do with it?


AbInitio has very good restartability and recovery features built into it. In Your situation you can do the tasks you mentioned in one graph with phase breaks.

FTP in phase 1 and your transaformation in next phase and then DB update in another pahse (This is just an example this may not best of doing it as best design depends on various other factors)

If the graph fails during FTP then your graph fails in Phase 0, you can restart the graph, if your graph fails in Phase 1 then AB_JOB.rec file exists and when you restart your graph you would see a message saying recovery file exists, do you want to start your graph from last successful check point or restart from begining. Same thing if it fails in Phase 2.

Phases are expensive from Disk I/O perspective, so have to be careful in doing too much phasing.

Coming back to error trapping each component has reject, error, log ports, reject captures rejected records, error captures corresponding error and log captures the execution statistics of the component. You can control reject status of each component by setting reject threshold to either "Never Abort", "Abort on first reject" or setting "ramp/limit"

Recovery files keep tack of crucial information for recovering the graph from failed status, which node the component is executing on etc. It is a bad idea to just remove the *.rec files, you always want to rollback the recovery fils cleanly so that temporary files created during graph execution won't hang around and occupy disk space and create issues.

always use m_rollback -d

I'm having trouble finding information about the AB_JOB variable. Where and how can I set this variable?

You can change the value of the AB_JOB variable in the start script of a given graph. This will enable you to run the same graph multiple times at the same time (thus parallel). However, make sure you append some unique identifier such as timestamp or sequential number to the end of each AB_JOB variable name you assign. You will also need to vary the file names of any outputs to keep the graphs from stepping on each other’s outputs. I have used this technique to create a "utility" graph as a container for a start script that runs another graph multiple times depending on the local variable input to the "utility" graph. Be careful you don't max out the capacity of the server you are running on.

How can I tune a graph so it does not excessivly consume my CPU? How to Tune a Graph Against Excessive CPU consumption?


1. Reduce the DOP ( degree of parallleism ) for components.

Example:
1. Change from a 4-way parallel to a 2-way parallel.
2. Examine each transformation for inefficiencies.

Example:
1. If transformation uses many local variables, make these variables global.
2. If same function call is performed more than once; call it once and store its value in a global variable.
3. When reading data, reduce the amount of data that needs to be carried forward to the next component

What is Ad hoc multifile? How is it used?


Ad hoc multifiles treat several serial files having the same record format as a single graph component.

Frequently, the input of a graph consists of a set of serial files, all of which have to be processed as a unit. An Ad hoc multifile is a multifile created 'on the fly' out of a set of serial files, without needing to define a multifile system to contain it. This enables you to represent the needed set of serial files with a single input file component in the graph. Moreover, the set of files used by the component can be determined at runtime. This lets the user customize which set of files the graph uses as input without having to change the graph itself, even after it goes into production.

Ad hoc multifiles can be used as output, intermediate, and lookup files as well as input files.

The simplest way to define an Ad hoc multifile is to list the files explicitly as follows:

1. Insert an input file component in your graph.
2. Open the properties dialog. Select Description tab.
3. Select Partitions in the Data Location of the Description tab
4. Click Edit to open the Define multifile Partitions dialog box.
5. Click New and enter the first file name. Click New again and enter the second file name and so on.
6. Click OK.

If you have added 'n' files, then the input file now acts something like a file in a n-way multifile system, whose data partitions are the n files you listed. It is possible for components to run in the layout of the input file component. However, there is no way to run commands such as m_ls or m_dump on the files, because they do not comprise a real multifile system.

There are other ways than listing the input files explicitly in an Ad hoc multifile.

1. Listing files using wildcards - If the input file names have a common pattern then you can use a wild card for all the files. E.g. $AI_SERIAL/ad_hoc_input_*.dat. All the files that are found at the runtime matching the wild card pattern will be taken for the Ad hoc multifile.

2. Listing files in a variable. You can create a runtime parameter for the graph and inside the parameter you can list all the files separated by spaces.

3. Listing files using a command - E.g. $(ls $AI_SERIAL/ad_hoc_input_*.dat), which produces the list of files to be used for the ad hoc multifile. This method gives maximum flexibility in choosing the input files, since you can use complex commands also that involves owner of file or date time stamp.