Saturday, March 29, 2008

INFORMATICA ARCHITECTURE

Informatica provides following components
PowerCenter Repository
PowerCenter Repository Server
PowerCenter Client
PowerCenter Server

PowerCenter Repository : It maintains the metadata. Set of tables are created within repository database to maintain the Informatica metadata. PowerCenter client and server access the repository to access metadata.

PowerCenter Repository Server: It manages connections to the repository from client applications.

PowerCenter Client: This is used to maintain the powercenter objects such as Source, Targets, mappings etc. It has following client tools
Designer : To create mappings that contain transformation instructions for the PowerCenter Server.
Workflow manager: To create, schedule, and run workflows. A workflow is a set of instructions that describes how and when to run tasks related to extracting, transforming, and loading data.
Repository Manager: To administer the metadata repository. You can create repository users and groups, assign privileges and permissions, and manage folders and locks.
Repository Server Admin Console: To administer the Repository Servers and repositories Workflow Monitor: To monitor scheduled and running workflows for each PowerCenter Server.

PowerCenter Server: With help of repository and repository server, powercenter server executes the business logic for extraction, transformation and load.

What is Repository ?
Powercenter repository is maintained in RDBMS . It contains the instruction required to extract, transform and load data. PowerCenter access repository using repository server. Metadata in the repository is added using powercenter client tools. There can be two types of repository
Global Repository: Object created in Global repository can be accessed across different repository.
Local Repository: Object created in local repository can only be accessed within the repository. Global repository are registered to local repository so that global repository objects are accessed in local repository using shortcuts.

Error Handling Approach

Error Handling Approach
Error Handling is one of the most important component in any Data warehouse or data integration project. If process is not able to handle and manage ERROR effeciently, then it is very rare that the project will succeed.Following steps and questions should be answered


1. Who is the business owner of the process/project
2. Who will be responsible for Error correction
3. What will the interface to present the ERROR to business community
4. Error correction will be at source or IT will be responsibile to re-run error correction
5. Error retention period in ERROR table

Above are few points to consider before deciding on the ERROR management process. Any ETL will provide feature to implement ERROR management. And sometime we go outside the ETL tool for presentation and correction of ERROR.
Informatica version 7.x onward provide feature where you can capture Error information at transformation level and Source record in ERROR table or ERROR File. These tables or files can be re-processed to present the error records to business community.
But in my opinion it is always better to have a common ERROR management process of all data integration project ( independent of tool). Any ETL tool can support that Error management process.

Without using Aggregator transf

Without using Aggregator transformation how to find
sum(sal),
min(sal),
max(sal),
count(sal),
avg(sal) in relational table as well as in flatfile.

Mapping Scenario

Hi Friends,
This is another scenario similar to the previous one.
Suppose the source table has values like this
10 A
10 B
10 C
20 X
20 Y
20 Z
30 D
30 E
30 F

And the target should be
10 A*B*C
20 X*Y*Z
30 D*E*F

How will you do the mapping?

Answer:This can be done using an expression transf and Aggregator transf.
In expression transf declare 2 variables ID_V and NAME_V
NAME_V
IIF(ID_V=ID,NAME_V ' * ' NAME,NAME)
ID_V
IIF(ID_V!=ID,ID,ID_V)
NOW pass the output ports to aggregator GROUP BY ID and connect to target.
This will work..

Wednesday, March 26, 2008

Mapping Scenario

Hi all,
My source table has 2 columns as follows
10 A
10 B
10 C
20 L
20 M
20 N
30 D
30 E
In the target The data should be as follows
10 ABC
20 LMN
30 DE
Any suggestions?

Sunday, March 23, 2008

Oracle sequence generator and informatica sequence generator

I have a task in which I need to take a flat file as my source and use oracle's sequence generator for a column and store into target flat file with sequence numbers including.

EXample:Source flat file
abc 1000
xyz 2000
def 3000

Target flat file
1 abc 1000
2 xyz 2000
3 def 3000
Note : The sequence numbers should come only from ORACLE (not from informatica's sequence generator Transformation).

Saturday, March 22, 2008

Do's and Dont's while using Sorted Input in Aggregator transformation

In general, follow this check list to ensure that you are handling aggregator with sorted inputs correctly:
1. Do not use sorted input if any of the following conditions are true:


a.The aggregate expression uses nested aggregate functions.
b.The session uses incremental aggregation.
c. Input data is data driven.
You select data driven for the Treat Source Rows as Session Property, or the Update Strategy transformation appears before the Aggregator transformation in the mapping.

If you use sorted input under these circumstances, the Informatica Server reverts to default aggregate behavior, reading all values before performing aggregate calculations.

Remember, when you are using sorted inputs, you pass sorted data through the Aggregator.
Data must be sorted as follows:
1. By the Aggregator group by ports, in the order they appear in the Aggregator transformation.
2. Using the same sort order configured for the session.

If data is not in strict ascending or descending order based on the session sort order, the Informatica Server fails the session.

Friday, March 21, 2008

Why staging area is needed?

Unlike OLTP systems that create their own data through a user interface, data warehouses source their data from other systems.
There is physical data movement from source database to data warehouse database.
Staging area is primarily designed to serve as intermediate resting place for data before it is processed and integrated into the target data warehouse.

This staging are serves many purpose above and beyond the primary function.
a.The data is most consistent with the source.
It is devoid of any transformation or has only minor format changes.
b. The staging area in a relational database can be read/ scanned/ queried using SQL without the need of logging into the source system or reading files (text/xml/binary).
c. It is a prime location for validating data quality from source or auditing and tracking down data issues.
d. Staging area acts as a repository for historical data if not truncated

Query to display Rank


Hi,Suppose we have a table with fields

Rollno Name Marks

1 A 95

2 B 90

3 C 75

4 D 80

5 E 85

Write a query to display rollno,name,marks,Rank in ascending order of rank.

Wednesday, March 19, 2008

Backup of mappings?

Hi,
Could anyone tell me how to take the backup of all the mappings in a repository?Is it the job of an ETL-developer?

What is the Rankindex in Ranktransformation?

What is sequence generator transformation?

The seq gen trans generates numeric values.We can use the seq gen trans to create unique primary key values,replace missing keys or cycle through a sequential range of numbers.

Can any one explain real time complex mappings or complex transformations in Informatica.

What is Update Strategy transformation?

When we design our data warehouse,we need to decide what type of information

to store in targets.As part of our target table design,we need to determine whether

to maintain all the historic data or just the most recent changes.The model we

we choose constitutes our update strategy which tells us how to handle changes to

existing records.

Update strategy flags a record for insert,update,delete or reject.We use this

transformation when we want to exert fine control over updates to a target,based

on certain conditions that we apply.

For eg:When the mailing address of a customer has changed and when we have to

update it we can use update strategy.Also for people no longer working in a

company we can flag all employee records for reject.

Sunday, March 16, 2008

Load Distinct and Duplicate values to different targets

I have a scenario in which the source consists of 2 cols

10 A
10 A
20 C
30 D
40 E
20 C

and there are 2 targets one to load distinct values and another one to load duplicate values.
T1
10 A
20 C
30 D
40 E

T2
10 A
20 C
How to load?

Informatica Scenario

I have a flatfile source containing 100 records.If there are three targets how do

you populate it with 100 records each?

What is a Materialized View?What is the significance of Materialized Views in Data warehousing?

What is a View?What are the advantages of having a view?

Load the remaining rows

Suppose there are 100,000 rows in the source and 20,000 rows are loaded to

target. Now in between if the session stops after loading 20,000 rows how will you

load the remaining rows?

What is a staging area?Do we need it?What is the purpose of a staging area?

Staging area is place where you hold temporary tables on data warehouse server.

Staging tables are connected to work area or fact tables. We basically need staging

area to hold the data , and perform data cleansing and merging , before loading

the data into warehouse.In the absence of a staging area, the data load will have

to go from the OLTP system to the OLAP system directly,

which in fact can severely hamper the performance of the OLTP system. This is

the primary reason for the existence of a staging area. In addition, it also offers a

platform for carrying out data cleansing.

Why we use Stored Procedure transformation?

A Stored procedure transformation is an important tool for populating and

maintaining databases.Database administrators create stored procedures to

automate time consuming tasks that are too complicated for standard SQL

statements.

Stored procedures can be used to do the following tasks:

  • Check the status of a target database before loading data into it.
  • Determine if enough space exists in a database.
  • Perform a specialised calculation.
  • Drop and recreate indexes.

INTERVIEW QUESTIONS

Infosys:

1. Explain Dynamic lookup and its need? Why we use dynamic lookup when everything can be done by static lookup?

2. Explain Subquery and correlated subquery?

3. What all tasks have you used in workflow manager?

4. Explain the complex mapping that you have done?

5. Tell me about performance tuning in informatica.?

6. Can you use WHERE clause with HAVING clause in SQL?


Deloitte:

1. What is the difference between View and Materialized View? Explain both.

2. What is the diff between star-schema and snow flake schema?

3. How many schemas did you have in your project? Explain

4. Did you use aggregator transformation in your mapping? How does it work?


Mindtree:

1. In expression transf if you have an input port,variable port1,output port1,variable port2,output port 2 then what is the order of execution?

2. What is aggregator cache size?

3. Explain the need for dynamic lookup?

4. Explin SCD type 2 mapping?

5. Performance tuning?

6. What is the use of sorted input in aggregator transf?How does it work?

Please feel free to post your comments so that i can improve this blog...

Tuesday, March 11, 2008

Performance Tuning in Informatica

The goal of performance tuning is to optimize session performance so sessions run during the available load window for the Informatica Server. You can increase the session performance by following. The performance of the Informatica Server is related to network connections. Data generally moves across a network at less than 1 MB per second, whereas a local disk moves data five to twenty times faster. Thus network connections often affect on session performance.So avoid network connections.

Flat files: If ur flat files stored on a machine other than the informatica server, move those files to the machine that consists of informatica server. Relational datasources: Minimize the connections to sources, targets and informatica server to improve session performance. Moving target database into server system may improve session performance.

Staging areas: If u use staging areas u force informatica server to perform multiple data passes. Removing of staging areas may improve session performance.
U can run the multiple informatica servers against the same repository. Distributing the session load to multiple informatica servers may improve session performance.

Running the informatica server in ASCII datamovement mode improves the session performance because ASCII datamovement mode stores a character value in one byte. Unicode mode takes 2 bytes to store a character.

If a session joins multiple source tables in one Source Qualifier, optimizing the query may improve performance. Also, single table select statements with an ORDER BY or GROUP BY clause may benefit from optimization such as adding indexes.

We can improve the session performance by configuring the network packet size, which allows data to cross the network at one time. To do this go to server manger, choose server configure database connections. If u r target consists key constraints and indexes u slow the loading of data. To improve the session performance in this case drop constraints and indexes before u run the session and rebuild them after completion of session.

Running parallel sessions by using concurrent batches will also reduce the time of loading the data. So concurrent batches may also increase the session performance.

Partitioning the session improves the session performance by creating multiple connections to sources and targets and loads data in parallel pipelines.

In some cases if a session contains an aggregator transformation, u can use incremental aggregation to improve session performance. Avoid transformation errors to improve the session performance. If the session contains lookup transformation u can improve the session performance by enabling the look up cache. If U’r session contains filter transformation, create that filter transformation nearer to the sources or u can use filter condition in source qualifier. Aggregator, Rank and joiner transformation may often decrease the session performance because they must group data before processing it. To improve session performance in this case use sorted ports option.

Datawarehouse and Datamarts

Data Warehouse
A data warehouse is a centralized repository containing comprehensive detailed and summary data that provides a complete view of customers, suppliers, business processes, and transactions, from a historical perspective with little volatility.
Advantages

Improve data qualityMinimize inconsistent reportsCapture and provide access to business metadataProvide capability for data sharingIntegrate data from multiple sourcesMerge historical and current data appropriatelyImprove the speed and performance of all reporting needs

Data Mart
Contains a subset of the data stored in the data warehouse that is of interest to a specific business community, department, or set of users (for example: marketing promotions, finance, or account collections)
In a well structured BI system, the data warehouse serves as a single source for multiple data marts

Unit Testing

Unit Testing
The objective of Unit testing involves testing of Business transformation rules, error conditions, mapping fields at staging and core levels.Unit testing involves the following
1. Check the Mapping of fields present in staging level.
2. Check for the duplication of values generated using Sequence generator.
3. Check for the correctness of surrogate keys, which uniquely identifies rows in database.
4. Check for Data type constraints of the fields present in staging and core levels.
5. Check for the population of status and error messages into target table.6. Check for string columns are left and right trimmed.
7. Check every mapping needs to implement the process abort mapplet which is invoked if the number of record read from source is not equal to trailer count.
8. Check every object, transformation, source and target need to have proper metadata. Check visually in data warehouse designer tool if every transformation has a meaningful description.