Common Warehouse DataStage Interview Preparation Guide
Download PDF

Data Warehouse DataStage Frequently Asked Questions in various Data Warehouse DataStage Interviews asked by the interviewer. So learn Data Warehouse DataStage with the help of this Data Warehouse DataStage Interview questions and answers guide and feel free to comment as your suggestions, questions and answers on any Data Warehouse DataStage Interview Question or answer by the comment feature available on the page.

37 Warehouse DataStage Questions and Answers:

Table of Contents:

Common  Warehouse DataStage Job Interview Questions and Answers
Common Warehouse DataStage Job Interview Questions and Answers

1 :: Explain How to implement type2 slowly changing dimenstion in datastage?

Slow changing dimension is a common problem in Dataware housing. For example: There exists a customer called lisa in a company ABC and she lives in New York. Later she she moved to Florida. The company must modify her address now. In general 3 ways to solve this problem

Type 1: The new record replaces the original record, no trace of the old record at all, Type 2: A new record is added into the customer dimension table. Therefore, the customer is treated essentially as two different people. Type 3: The original record is modified to reflect the changes.

In Type1 the new one will over write the existing one that means no history is maintained, History of the person where she stayed last is lost, simple to use.

In Type2 New record is added, therefore both the original and the new record Will be present, the new record will get its own primary key, Advantage of using this type2 is, Historical information is maintained But size of the dimension table grows, storage and performance can become a concern.
Type2 should only be used if it is necessary for the data warehouse to track the historical changes.

In Type3 there will be 2 columns one to indicate the original value and the other to indicate the current value. example a new column will be added which shows the original address as New york and the current address as Florida. Helps in keeping some part of the history and table size is not increased. But one problem is when the customer moves from Florida to Texas the new york information is lost. so Type 3 should only be used if the changes will only occur for a finite number of time.

2 :: Explain What does a Config File in parallel extender consist of?

Config file consists of the following.
a) Number of Processes or Nodes.
b) Actual Disk Storage Location.


The APT_Configuration file is having the information of resource disk,node pool,and scratch information,

node information in the since it contains the how many nodes we given to run the jobs, because based on the nodes only data stage will create processors at back end while running the jobs,resource disk means this is the place where exactly jobs will be loading,scratch information will be useful whenever we using the lookups in the jobs

3 :: Explain What is job control?how it is developed?explain with steps?

Controlling Datstage jobs through some other Datastage jobs. Ex: Consider two Jobs XXX and YYY. The Job YYY can be executed from Job XXX by using Datastage macros in Routines.

To Execute one job from other job, following steps needs to be followed in Routines.

1. Attach job using DSAttachjob function.

2. Run the other job using DSRunjob function

3. Stop the job using DSStopJob function

4 :: Explain How to handle the rejected rows in datastage?

We can handle rejected rows in two ways with help of Constraints in a Tansformer.1) By Putting on the Rejected cell where we will be writing our constarints in the properties of the Transformer2)Use REJECTED in the expression editor of the ConstraintCreate a hash file as a temporory storage for rejected rows. Create a link and use it as one of the output of the transformer. Apply either ofthe two stpes above said on that Link. All the rows which are rejected by all the constraints will go to the Hash File.

5 :: Explain Dimensional modelling is again sub divided into 2 types?

a) Star Schema - Simple & Much Faster. Denormalized form.
b) Snowflake Schema - Complex with more Granularity. More normalized form.

6 :: What is a project? Specify its various components?

You always enter DataStage through a DataStage project. When you start a DataStage client you are prompted to connect to a project. Each project contains:

DataStage jobs.

Built-in components. These are predefined components used in a job.

User-defined components. These are customized components created using the DataStage Manager or DataStage Designer

You always enter DataStage through a DataStage project. When you start a DataStage client you are prompted to connect to a project. Each project contains:

7 :: What is differences between Oracle8i/9i?

Mutliproceesing,databases more dimesnionsal modeling

8 :: How to eliminate duplicate rows?

The Duplicates can be eliminated by loading thecorresponding data in the Hash file. Specify the columns on which u want to eliminate as the keys of hash.

you can delete duplicate records from source itself using datastage taking option as userdefined query, instead of taking table read option. and u can use remove duplicate stage in datastage. and using of hashfile as source also based on the hash key.

9 :: How good are you with your PL/SQL please explain?

You will not be writtinf pl/sql in datastage ! sql knowledge is enough ...

10 :: It is possible to access the same job two users at a time in datastage?

No it is not possible for the same job to be accessed by two users, because if one of the job is being used by one user then if the second one tries to open the same job then you get the error as "Job (jobname) is being accessed by another user"

11 :: What is DS Administrator used for - did u use it?

The Administrator enables you to set up DataStage users, control the purging of the Repository, and, if National Language Support (NLS) is enabled, install and manage maps and locales

13 :: What is DS Director used for - did u use it?

Datastage Director is GUI to monitor, run, validate & schedule datastage server jobs.

14 :: How to pass parameters to job by using file?

You can do this, by passing parameters from unix file, and then calling the execution of a datastage job. the ds job has the parameters defined (which are passed by unix)

15 :: Explain How many jobs have you created in your last project?

100+ jobs for every 6 months if you are in Development, if you are in testing 40 jobs for every 6 months although it need not be the same number for everybody

16 :: Explain What is the flow of loading data into fact & dimensional tables?

Here is the sequence of loading a datawarehouse.

1. The source data is first loading into the staging area, where data cleansing takes place.

2. The data from staging area is then loaded into dimensions/lookups.

3.Finally the Fact tables are loaded from the corresponding source tables from the staging area.

18 :: Explain Containers Usage and Types?

Container is a collection of stages used for the purpose of Reusability. There are 2 types of Containers.
a) Local Container: Job Specific
b) Shared Container: Used in any job within a project. ?
There are two types of shared container:?
1.Server shared container. Used in server jobs (can also be used in parallel jobs).?
2.Parallel shared container. Used in parallel jobs. You can also include server shared containers in parallel jobs as a way of incorporating server job functionality into a parallel stage (for example, you could use one to make a server plug-in stage available to a parallel job).regardsjagan

19 :: Explain What happens out put of hash file is connected to transformer ..
What error it through?

If Hash file output is connected to transformer stage the hash file will consider as the Lookup file if there is no primary link to the same Transformer stage, if there is no primary link then this will treat as primary link itself. you can do SCD in server job by using Lookup functionality. This will not return any error code.

20 :: Explain the difference between drs and odbc stage?

To answer your question the DRS stage should be faster then the ODBC stage as it uses native database connectivity. You will need to install and configure the required database clients on your DataStage server for it to work.

Dynamic Relational Stage was leveraged for Peoplesoft to have a job to run on any of the supported databases. It supports ODBC connections too. Read more of that in the plug-in documentation.

ODBC uses the ODBC driver for a particular database, DRS is a stage that tries to make it seamless for switching from one database to another. It uses the native connectivities for the chosen target

21 :: Explain how to improve the job performance?

by using partition technics we can improve the performance
like
hash,modules,range,random etc

22 :: Explain It is possible to run parallel jobs in server jobs?

No, It is not possible to run Parallel jobs in server jobs. But Server jobs can be executed in Parallel jobs

23 :: Explain What are other Performance tunings you have done in your last project to increase the performance of slowly running jobs?

Minimise the usage of Transformer (Instead of this use Copy, modify, Filter, Row Generator)
Use SQL Code while extracting the data
Handle the nulls
Minimise the warnings
Reduce the number of lookups in a job design
Use not more than 20stages in a job
Use IPC stage between two passive stages Reduces processing time
Drop indexes before data loading and recreate after loading data into tables
Gen'll we cannot avoid no of lookups if our requirements to do lookups compulsory.
There is no limit for no of stages like 20 or 30 but we can break the job into small jobs then we use dataset Stages to store the data.
IPC Stage that is provided in Server Jobs not in Parallel Jobs
Check the write cache of Hash file. If the same hash file is used for Look up and as well as target, disable this Option.
If the hash file is used only for lookup then "enable Preload to memory". This will improve the performance. Also, check the order of execution of the routines.
Don't use more than 7 lookups in the same transformer; introduce new transformers if it exceeds 7 lookups.
Use Preload to memory option in the hash file output.
Use Write to cache in the hash file input.
Write into the error tables only after all the transformer stages.
Reduce the width of the input record - remove the columns that you would not use.
Cache the hash files you are reading from and writting into. Make sure your cache is big enough to hold the hash files.
Use ANALYZE.FILE or HASH.HELP to determine the optimal settings for your hash files.
This would also minimize overflow on the hash file.

If possible, break the input into multiple threads and run multiple instances of the job.
Staged the data coming from ODBC/OCI/DB2UDB stages or any database on the server using Hash/Sequential files for optimum performance also for data recovery in case job aborts.
Tuned the OCI stage for 'Array Size' and 'Rows per Transaction' numerical values for faster inserts, updates and selects.
Tuned the 'Project Tunables' in Administrator for better performance.
Used sorted data for Aggregator.
Sorted the data as much as possible in DB and reduced the use of DS-Sort for better performance of jobs
Removed the data not used from the source as early as possible in the job.
Worked with DB-admin to create appropriate Indexes on tables for better performance of DS queries
Converted some of the complex joins/business in DS to Stored Procedures on DS for faster execution of the jobs.
If an input file has an excessive number of rows and can be split-up then use standard logic to run jobs in parallel.
Before writing a routine or a transform, make sure that there is not the functionality required in one of the standard routines supplied in the sdk or ds utilities categories.
Constraints are generally CPU intensive and take a significant amount of time to process. This may be the case if the constraint calls routines or external macros but if it is inline code then the overhead will be minimal.
Try to have the constraints in the 'Selection' criteria of the jobs itself. This will eliminate the unnecessary records even getting in before joins are made.
Tuning should occur on a job-by-job basis.
Use the power of DBMS.
Try not to use a sort stage when you can use an ORDER BY clause in the database.
Using a constraint to filter a record set is much slower than performing a SELECT ? WHERE?.
Make every attempt to use the bulk loader for your particular database. Bulk loaders are generally faster than using ODBC or OLE.

24 :: Explain What are Sequencers?

A sequencer allows you to synchronize the control flow of multiple activities in a job sequence. It can have multiple input triggers as well as multiple output triggers.The sequencer operates in two modes:ALL mode. In this mode all of the inputs to the sequencer must be TRUE for any of the sequencer outputs to fire.ANY mode. In this mode, output triggers can be fired if any of the sequencer inputs are TRUE

25 :: How to create Containers?

There are Two types of containers

1.Local Container

2.Shared Container

Local container is available for that particular Job only.

Where as Shared Containers can be used any where in the project.

Local container:

Step1:Select the stages required

Step2:Edit>ConstructContainer>Local

SharedContainer:

Step1:Select the stages required

Step2:Edit>ConstructContainer>Shared

Shared containers are stored in the SharedContainers branch of the Tree Structure
Warehouse DataStage Interview Questions and Answers
37 Warehouse DataStage Interview Questions and Answers