Datastage Interview Preparation:
DataStage is a very popular ETL tool which was available in the current market.
In this article, I am sharing a set of very useful question-answers intended for IBM Datastage interviews. Going through below Datastage Interview Questions can make it easy for you to crack the interview.
Most Common Datastage interview Questions and Answers
We have covered detailed answers to the Datastage interview Questions which will be helpful to freshers and experienced professionals.
Recommended read => ETL testing interview questions
Q #1) What is Datastage?
It is available in following 3 different editions
- Server Edition
- Enterprise Edition
- MVS Edition
Q #2) Highlight the main features of Datastage?
The main features of Datastage are highlighted below
- It is the data integration component of IBM Infosphere information server.
- It is a GUI based tool.We just need to drag and drop the Datastage objects and we can convert it to Datastage code.
- It is used to perform the ETL operations (Extract, transform, load)
- It provides connectivity to multiple sources & multiple targets at the same time
- Provides partitioning and parallel processing techniques which enable the Datastage jobs to process a huge volume of data quite faster.
- It has enterprise-level connectivity.
Q #3) What are the primary usages of Datastage tool?
Datastage is an ETL tool which is primarily used for extracting data from source systems, transforming that data and finally loading it to target systems.
Q #4) What are the main differences you have observed between 7.x and 8.x version of DataStage?
Here are the main differences between both the versions
|7.x version was platform dependent||This version is platform independent|
|It has 2-tier architecture where datastage is built on top of Unix server||It has 3-tier architecture where we have UNIX server database at the bottom then XMETA database which acts as a repositorty and then we have datastage on top.|
|There is no concept of parameter set||We have parameter sets which can be used anywhere in the project.|
|We had designer and manager as two separate clients||In this version, the manager client was merged into designer client|
|We had to manually search for the jobs in this version||Here we have quick find option in the repository where we can search easily for the jobs.|
Q #5) Can you highlight the main features of IBM Infosphere information server?
Below are the main features of IBM Infosphere information server suite
- It provides a single platform for data integration.It has the capability to connect to multiple source systems as well as write to multiple target systems.
- It is based on centralized layers.All the components of the suite are able to share the baseline architecture of the suite.
- It has layers for the unified repository, for integrated metadata services and common parallel engine.
- It provides tools for analysis, cleansing, monitoring, transforming and delivering data.
- It has massively parallel processing capabilities.It turns out the processing to be very fast.
Q #6) What are the different layers in the information server architecture?
Below are the different layers of information server architecture
- Unified user interface
- Common services
- Unified parallel processing
- Unified Metadata
- Common connectivity
Q #7) What could be a data source system?
It could be a database table, a flat file, or even an external application like people soft.
Q #8) On which interface you will be working as a developer?
As a Datastage developer, we work on Datastage client interface which is known as a Datastage designer that needs to be installed on the local system. In the backend, it is connected to the Datastage server.
Q #9) What are the different common services in Datastage?
Below is the list of common services in Datastage
- Metadata services
- Unified service deployment
- Security services
- Looping and reporting services.
Q #10) How do you start developing a Datastage project?
The very first step is to create a Datastage job on the Datastage server. All the Datastage objects that we create are stored in the Datastage project. A Datastage project is separated environment on the server for jobs, tables, definitions, and routines.
A Datastage project is separated environment on the server for jobs, tables, definitions, and routines.
Q #11) What is a DataStage job?
The Datastage job is simply a DataStage code that we create as a developer. It contains different stages linked together to define data and process flow.
Stages are nothing but the functionalities that get implemented.
For example: Let’s assume that I want to do a sum of the sales amount. This can be a ‘group by’ operation that will be performed by one stage.
Now, I want to write the result to a target file. So, this operation will be performed by another stage. Once, I have defined both the stages, I need to define the data flow from my ‘group by’ stage to the target file stage. This data flow is defined by DataStage links.
Once, I have defined both the stages, I need to define the data flow from my ‘group by’ stage to the target file stage. This data flow is defined by DataStage links.
Q #12) What are DataStage sequences?
Datastage sequence connects the DataStage jobs in a logical flow.
Q #13) If you want to use a same piece of code in different jobs, how will you achieve this?
This can be done by using shared containers. We have shared containers for reusability. A shared container is a reusable job element consisting of stages and links. We can call a shared container in different Datastage jobs.
Q #14) Where do the Datastage jobs get stored?
The Datastage jobs get stored in the repository. We have various folders in which we can store the Datastage jobs.
Q #15) Where do you see different stages in the designer?
All the stages are available within a window called as ‘Palette’. It has various categories depending upon the kind of function that the stage provides.
The various categories of stages in the Palette are – General, Data Quality, Database, Development, File, Processing, etc.
Q #16) What are the Processing stages?
The processing stages allow us to apply the actual data transformation.
For example, The ‘aggregator’ stage under Processing category allows us to apply all the ‘group by’ operations. Similarly, we have other stages in Processing like the ‘Join’ stage that allows us to join together the data coming from two different input streams.
Q #17) What are the steps needed to create a simple basic Datastage job?
Click on the File -> Click on New -> Select Parallel Job and hit Ok. A parallel job window will open up. In this Parallel job, we can put together different stages and define the data flow between them. The simplest DataStage job is an ETL job.
In this, we first need to extract the data from the source system for which we can use either a file stage or database stage because my source system can either be a database table or a file.
Suppose we are reading data from a text file. In this case, we will drag and drop the ‘Sequential File’ stage to the parallel job window. Now, we need to perform some transformation on top of this data.
We will use ‘Transformer’ stage which is available under the Processing category. We can write any logic under the Transformer stage.
Finally, we need to load the processed data to some target table. Let’s say my target database is DB2. So, for this, we will select DB2 connector stage. Then we will be connecting these data states through sequential links.
After this, we need to configure the stages so that they point to the correct filesystem or database.
For example: For the Sequential file stage, we need to define the mandatory parameters like the file name, file location, column metadata.
Then we need to compile the Datastage job. Compiling the job checks for the syntax of the job and creates an executable file for the Datastage job that can be executed at run time.
Q #18) Name the different sorting methods in Datastage.
There are two methods available
- Link sort
- Inbuilt Datastage Sort
Q #19) In a batch if a job fails in between and you want to restart the batch from that particular job and not from the scratch then what will you do?
In Datastage, there is an option in job sequence – ‘Add checkpoints so the sequence is restart-able on failure’. If this option is checked, then we can rerun the job sequence from the point where it failed.
Q #20) How do you import and export the Datastage jobs?
We have below command line functions for this
- Import: dsimport.exe
- Export: dsexport.exe
Q #21) What are routines in Datastage? Enlist various types of routines.
Routine is a set of functions that are defined by DS manager. It is run via transformer stage.
There are 3 kinds of routines
- Parallel routines
- Mainframe routines
- Server routines
Q #22) How do you remove duplicate values in DataStage?
There are two ways to handle duplicate values
- We can use remove duplicate stage to eliminate duplicates.
- We can use Sort stage to remove duplicate. The sort stage has a property called ‘allow duplicates’. We won’t get duplicate values in the output of sort on setting this property equal to false.
Q #23) What are the different kinds of views available in a Datastage director?
There are 3 kinds of views available in Datastage director
- Log view
- Status view
- Job view
Q #24) Distinguish between Informatica & Datastage. Which one would you choose and why?
Both Informatica and DataStage are powerful ETL tools.
Below are some points differentiating between both the tools:
|Parallel Processing||Informatica does not support parallel processing.||In contrast to this, datastage provides mechanism for parallel processing.|
|Implementing SCDs||It is quite simple to implement SCDs (Slowly changing dimensions) in Informatica.||However, it is complex to implement SCDs in datastage. Datastage supports SCDs merely through custom scripts.|
|Version Control||Informatica supports version controlling through check-in and check-out of objects.||However, we don’t have this functionality available in datastage.|
|Available Transformations||Lesser transformations are available.||Datastage offers more variety of transformations than Informatica.|
|Power of lookup||Informatica provides very powerful dynamic cache lookup||We don’t have any similar thing in datastage.|
In my personal opinion, I would go with Informatica over Datastage. The reason being I have found Informatica more systematic and user-friendly than DataStage.
Another strong reason is that debugging and error handling is much better in Informatica as compared to Datastage. So, fixing issues become easier in Informatica. Datastage does not provide complete error handling support.
=> Want to learn more about Informatica? We have a detailed explanation here.
Q #25) Give an idea of system variables.
System variables are the read-only variables beginning with ‘@’ which can be read by either transformer stage or routine. They are used to get the system information.
Q #26) What is the difference between passive stage and active stage?
Passive stages are utilized for extraction and loading whereas active stages are utilized for transformation.
Q #27) What are the various kinds of containers available in Datastage?
We have below 2 containers in Datastage
- Local container
- Shared container
Q #28) Is the value of staging variable stored temporarily or permanently?
Temporarily. It is a temporary variable.
Q #29) what are the different type of jobs in Datastage?
We have two types of jobs in Datastage
- Server jobs (They run in a sequential manner)
- Parallel jobs (They get executed in a parallel way)
Q #30) What is the use of Datastage director?
Through Datastage director, we can schedule a job, validate the job, execute the job and monitor the job.
Q #31) What are the various kinds of the hash file?
We have 2 type of hash files
- Static hash file
- Dynamic hash file
Q #32) What is a quality stage?
Quality stage (also called as integrity stage) is a stage that aids in combining the data together coming from different sources.
You should carry a handy knowledge on Datastage architecture, its main features and you should be able to explain how it is different from some other popular ETL tools.
Additionally, you should have a fair idea on different stages & their usage, end- to- end way of creating a Datastage job & running it.
Also read => What is ETL Testing?
All the best!