DW/BI Basic Concepts

来源：互联网发布：淘宝达人大v在哪里认证编辑：程序博客网时间：2024/04/28 22:27

1. Define Data Warehouse?
A subject-oriented , integrated , time-variant and non-volatile collection of data in support of management's decision making process
2. What is junk dimension? What is the difference between junk dimension and degenerated dimension?
A "junk" dimension is a collection of random transactional codes, flags and/or text attributes that are unrelated to any particular dimension. The junk dimension is simply a structure that provides a convenient place to store the junk attributes. Where as a degenerate dimension is data that is dimensional in nature but stored in a fact table.
Junk dimension: the column which we are using rarely or not used, these columns are formed a dimension is called junk dimension
Degenerative dimension: the column which we use in dimension are degenerative dimension
Ex.Emp table has empno, ename, sal, job, deptno
But We are talking only the column empno, ename from the EMP table and forming a dimension this is called degenerative dimension
3. Differnce between Normalization and Denormalization?

Normalization is the process of removing redundancies.
OLTP uses the Normalization process

Denormalization is the process of allowing redundancies.
OLAP/DW uses the denormalized process to capture greater level of detailed data (each and every transaction)

4. Why fact table is in normal form?

A fact table consists of measurements of business requirements and foreign keys of dimensions tables as per business rules.
A fact table consists of measurements of business requirements and foreign keys of dimensions tables as per business rules.
There can just be SKs within a Star schema, which itself is de-Normalized. Now, if there were then FKs on the dimensions as well, I would agree. Being in normal form, more granularities are achieved with less coding i.e. less number of joins while retrieving the fact.
5. What is Difference between E-R Modeling and Dimensional Modeling?
Basic difference is E-R modeling will have logical and physical model. Dimensional model will have only physical model. E-R modeling is used for normalizing the OLTP database design.
Dimensional modeling is used for de-normalizing the ROLAP/MOLAP design. Adding to the point:
E-R modeling revolves around the Entities and their relationships to capture the overall process of the system.
Dimensional model / Multidimensional Modeling revolves around Dimensions (point of analysis) for decision-making and not to capture the process.
In ER modeling the data is in normalized form. So more number of Joins, which may adversely affect the system performance. Whereas in Dimensional Modeling the data is denormalized, so less number of joins, by which system performance will improve.
6. What is conformed fact?
Conformed dimensions are the dimensions, which can be used across multiple Data Marts in combination with multiple facts tables accordingly
Conformed facts are allowed to have the same name in separate tables and can be combined and compared mathematically. Conformed dimensions are those tables that have a fixed structure. There will b no need to change the metadata of these tables and they can go along with any number of facts in that application without any changes
Dimension table, which is used, by more than one fact table is known as a conformed dimension.
7. What are the methodologies of Data Warehousing?

They are mainly 2 methods.
1. Ralph Kimbell Model (Top - Down approach: Data Warehouse --> Data Mart)
Kimball model always structured as Denormalized structure.

2. Inmon Model. (Bottom - Up approach: Data Mart --> Data Warehouse)
Inmon model structured as Normalized structure.

8. What are data validation strategies for data mart validation after loading process?
Data validation is to make sure that the loaded data is accurate and meets the business requirements. Strategies are different methods followed to meet the validation requirements.
9. What is surrogate key?

Surrogate key is the primary key for the Dimensional table. Surrogate key is a substitution for the natural primary key.

Data warehouses typically use a surrogate, (also known as artificial or identity key), key for the dimension tables primary keys. They can use Infa sequence generator, or Oracle sequence, or SQL Server Identity values for the surrogate key.

It is useful because the natural primary key (i.e. Customer Number in Customer table) can change and this makes updates more difficult and also used in SCDs to preserve historical data.

10. What is meant by metadata in context of a Data warehouse and how it is important?
Metadata or Meta data is data about data. Examples of metadata include data element descriptions, data type descriptions, attribute/property descriptions, range/domain descriptions, and process/method descriptions. The repository environment encompasses all corporate metadata resources: database catalogs, data dictionaries, and navigation services. Metadata includes things like the name, length, valid values, and description of a data element. Metadata is stored in a data dictionary and repository. It insulates the data warehouse from changes in the schema of operational systems. Metadata Synchronization The process of consolidating, relating and synchronizing data elements with the same or similar meaning from different systems. Metadata synchronization joins these differing elements together in the data warehouse to allow for easier access.

In context of a Data warehouse metadata is meant the information about the data. This information is stored in the designer repository. Meta data is the data about data; Business Analyst or data modeler usually capture information about data - the source (where and how the data is originated), nature of data (char, varchar, nullable, existance, valid values etc) and behavior of data (how it is modified / derived and the life cycle) in data dictionary a.k.a metadata. Metadata is also presented at the Data mart level, subsets, fact and dimensions, ODS etc. For a DW user, metadata provides vital information for analysis / DSS.
11. What are the possible data marts in Retail sales?
Product information, sales information
12. What is the main difference between schema in RDBMS and schemas in Data Warehouse?
RDBMS Schema
* Used for OLTP systems
* Traditional and old schema
* Normalized
* Difficult to understand and navigate
* Cannot solve extract and complex problems
* Poorly modelled
DWH Schema
* Used for OLAP systems
* New generation schema
* De Normalized
* Easy to understand and navigate
* Extract and complex problems can be easily solved
* Very good model
13.What is Dimensional Modeling?
In Dimensional Modeling, Data is stored in two kinds of tables: Fact Tables and Dimension tables.
Fact Table contains fact data e.g. sales, revenue, profit etc.....
Dimension table contains dimensional data such as Product Id, product name, product description etc.....
Dimensional Modeling is a design concept used by many data warehouse designers to build their data warehouse. In this design model all the data is stored in two types of tables - Facts table and Dimension table. Fact table contains the facts/measurements of the business and the dimension table contains the context of measurements i.e., the dimensions on which the facts are calculated.
14. Why is Data Modeling Important?

The data model is also detailed enough to be used by the database developers to use as a "blueprint" for building the physical database. The information contained in the data model will be used to define the relational tables, primary and foreign keys, stored procedures, and triggers. A poorly designed database will require more time in the long-term. Without careful planning you may create a database that omits data required to create critical reports, produces results that are incorrect or inconsistent, and is unable to accommodate changes in the user's requirements.

15. What does level of Granularity of a fact table signify?
It describes the amount of space required for a database. Level of Granularity indicates the extent of aggregation that will be permitted to take place on the fact data. More Granularity implies more aggregation potential and vice-versa. In simple terms, level of granularity defines the extent of detail. As an example, let us look at geographical level of granularity. We may analyze data at the levels of COUNTRY, REGION, TERRITORY, CITY and STREET. In this case, we say the highest level of granularity is STREET. Level of granularity means the upper/lower level of hierarchy, up to which we can see/drill the data in the fact table. Level of granularity means the upper/lower level of hierarchy, up to which we can see/drill the data in the fact table.
16. What is degenerate dimension table?
The values of dimension, which is stored, in fact table is called degenerate dimensions. These dimensions don't have it's own dimensions.
17. How do you load the time dimension?
In Data warehouse we manually load the time dimension, Every Data warehouse maintains a time dimension. It would be at the most granular level at which the business runs at (ex: week day, day of the month and so on). Depending on the data loads, these time dimensions are updated. Weekly process gets updated every week and monthly process, every month.
18. Difference between Snowflake and Star Schema. What are situations where Snow flake Schema is better than Star Schema to use and when the opposite is true?
Star schema and snowflake both serve the purpose of dimensional modeling when it comes to data warehouses.
Star schema is a dimensional model with a fact table (large) and a set of dimension tables (small). The whole set-up is totally denormalized.
However in cases where the dimension tables are split to many tables that are where the schema is slightly inclined towards normalization (reduce redundancy and dependency) there comes the snowflake schema.
The nature/purpose of the data that is to be feed to the model is the key to your question as to which is better.
Star schema
*    contains the dimension tables mapped around one or more fact tables.
*    It is a denormalized model.
*    No need to use complicated joins.
*    Queries results fastly.
Snowflake schema
*    It is the normalized form of Star schema.
*    Contains in depth joins, because the tables are splited in to many pieces. We can easily do modification directly in the tables.
*    We have to use complicated joins, since we have more tables.
*    There will be some delay in processing the Query.
19. Why do you need Star schema?
1) Less joiners contains
2) Simply database
3) Support drilling up options
20. Why do you need Snowflake schema?
Some times we used to provide separate dimensions from existing dimensions that time we will go to snowflake
Disadvantage Of snowflake: Query performance is very low because more joiners is there

21. What is conformed fact?
Conformed dimensions are the dimensions, which can be used across multiple Data Marts in combination with multiple facts tables accordingly
Conformed facts are allowed to have the same name in separate tables and can be combined and compared mathematically. Conformed dimensions are those tables that have a fixed structure. There will b no need to change the metadata of these tables and they can go along with any number of facts in that application without any changes
Dimension table, which is used, by more than one fact table is known as a conformed dimension.

22. What are conformed dimensions
They are dimension tables in a star schema data mart that adhere to a common structure, and therefore allow queries to be executed across star schemas. For example, the Calendar dimension is commonly needed in most data marts. By making this Calendar dimension adhere to a single structure, regardless of what data mart it is used in your organization, you can query by date/time from one data mart to another to another.
Conformed dimensions are dimensions which are common to the cubes.(cubes are the schemas contains facts and dimension tables)

Consider Cube-1 contains F1,D1,D2,D3 and Cube-2 contains F2,D1,D2,D4 are the Facts and Dimensions
here D1,D2 are the Conformed Dimensions
23. What is Fact table
A table in a data warehouse whose entries describe data in a fact table. Dimension tables contain the data from which dimensions are created. A fact table in data ware house is it describes the transaction data. It contains characteristics and key figures.
24. What are Semi-additive and faceless facts and in which scenario will you use such kinds of fact tables
Semi-Additive: Semi-additive facts are facts that can be summed up for some of the dimensions in the fact table, but not the others. For example:
Current Balance and Profit Margin are the facts. Current Balance is a semi-additive fact, as it makes sense to add them up for all accounts (what's the total current balance for all accounts in the bank?), but it does not make sense to add them up through time (adding up all current balances for a given account for each day of the month does not give us any useful information
A factless fact table captures the many-to-many relationships between
dimensions, but contains no numeric or textual facts. They are often used to record events or coverage information. Common examples of factless fact tables include:
- Identifying product promotion events (to determine promoted products that didn't sell)
- Tracking student attendance or registration events
- Tracking insurance-related accident events
- Identifying building, facility, and equipment schedules for a hospital or university
25. What are the Different methods of loading Dimension tables
Conventional Load: Before loading the data, all the Table constraints will be checked against the data.

Direct load:(Faster Loading) All the Constraints will be disabled. Data will be loaded directly.Later the data will be checked against the table constraints and the bad data won't be indexed. Conventional and Direct load method are applicable for only oracle. The naming convension is not general one applicable to other RDBMS like DB2 or SQL server..

26.What are Aggregate tables
Aggregate tables contain redundant data that is summarized from other data in the warehouse. These are the tables which contain aggregated / summarized data. E.g Yearly, monthly sales information. These tables will be used to reduce the query execution time.
Aggregate table contains the summary of existing warehouse data which is grouped to certain levels of dimensions. Retrieving the required data from the actual table, which have millions of records will take more time and also affects the server performance. To avoid this we can aggregate the table to certain required level and can use it.This tables reduces the load in the database server and increases the performance of the query and can retrieve the result very fastly.
27. What is a dimension table
A dimensional table is a collection of hierarchies and categories along which the user can drill down and drill up. it contains only the textual attributes.
25. Why are OLTP database designs not generally a good idea for a Data Warehouse
OLTP cannot store historical information about the organization. It is used for storing the details of daily transactions while a datawarehouse is a huge storage of historical information obtained from different datamarts for making intelligent decisions about the organization.
26. What is the need of surrogate key; why primary key not used as surrogate key
Surrogate Key is an artificial identifier for an entity.In surrogate key values are generated by the system sequentially(Like Identity property in SQL Server and Sequence in Oracle). They do not describe anything.
Primary Key is a natural identifier for an entity. In Primary keys all the values are entered manually by the user which are uniquely identified. There will be no repeatition of data.

Need for surrogate key not Primary Key
If a column is made a primary key and later there needs a change in the datatype or the length for that column then all the foreign keys that are dependent on that primary key should be changed making the database Unstable
Surrogate Keys make the database more stable because it insulates the Primary and foreign key relationships from changes in the data types and length.
For Example : You are extracting Customer Information from OLTP Source and after ETL process, loading customer information in a dimension table (DW). If you take SCD Type 1, Yes you can use Primary Key of Source CustomerID as Primary Key in Dimension Table. But if you would like to preserve history of customer in Dimension table i.e. Type 2. Then you need another unique no apart from CustomerID. There you have to use Surrogate Key.
Another reason : If you have AlphaNumeric as a CustomerID. Then you have to use surrogate key in Dimension Table. It is advisable to have system generated small integer number as a surrogate key in the dimension table. so that indexing and retrieval is much faster.
27. What is data cleaning? how is it done?
Data Cleansing: the act of detecting and removing and/or correcting a database's dirty data (i.e., data that is incorrect, out-of-date, redundant, incomplete, or formatted incorrectly)
It can be done by using the exisitng ETL tools or using third party tools like Trivillium etc.,
28. What are slowly changing dimensions
Dimensions that change over time are called Slowly Changing Dimensions. For instance, a product price changes over time; People change their names for some reason; Country and State names may change over time. These are a few examples of Slowly Changing Dimensions since some changes are happening to them over a period of time
29. What are Data Marts
Data Mart is a segment of a data warehouse that can provide data for reporting and analysis on a section, unit, department or operation in the company, e.g. sales, payroll, production. Data marts are sometimes complete individual data warehouses which are usually smaller than the corporate data warehouse.
Data Mart: a data mart is a small data warehouse. In general, a data warehouse is divided into small units according the busness requirements. for example, if we take a Data Warehouse of an organization, then it may be divided into the following individual Data Marts. Data Marts are used to improve the performance during the retrieval of data.
       eg: Data Mart of Sales, Data Mart of Finance, Data Mart of Maketing, Data Mart of HR etc.
30. Can a dimension table contains numeric values?
No. Only Fact Table having Numeric Fields.
31. Explain degenerated dimension in detail.
Degenerated dimension is a dimension, which is not having any source in oltp
     It is generated at the time of transaction
    Like invoice no this is generated when the invoice is raised
     It is not used in linking and it is also not a fkey
    But we can refer these degenerated dimensions as a primary key of the fact table
A Degenerate dimension is a Dimension which has only a single attribute.
This dimension is typically represented as a single field in a fact table.
The data items thar are not facts and data items that do not fit into the existing dimensions are termed as Degenerate Dimensions.
Degenerate Dimensions are the fastest way to group similar transactions.
Degenerate Dimensions are used when fact tables represent transactional data.
32. Give examples of degenerated dimensions
Degenerated Dimension is a dimension key without corresponding dimension. Example:
     In the PointOfSale Transaction Fact table, we have:
     Date Key (FK), Product Key (FK), Store Key (FK), Promotion Key (FP), and POS Transaction Number
Date Dimension corresponds to Date Key, Production Dimension corresponds to Production Key. In a traditional parent-child database, POS Transactional Number would be the key to the transaction header record that contains all the info valid for the transaction as a whole, such as the transaction date and store identifier. But in this dimensional model, we have already extracted this info into other dimension. Therefore, POS Transaction Number looks like a dimension key in the fact table but does not have the corresponding dimension table.Therefore, POS Transaction Number is a degenerated dimension.
33. What are the steps to build the data warehouse
1.Gathering bussiness requiremnts
* Identifying Sources
* Identifying Facts
* Defining Dimensions
* Define Attribues
* Redefine Dimensions & Attributes
* Organize Attribute Hierarchy & Define Relationship
* Assign Unique Identifiers
* Additional convetions:Cardinality/Adding ratios
* Understand the bussiness requirements.
2.Once the business requirements are clear then Identify the Grains(Levels).
3.Grains are defined; design the Dimensional tables with the Lower level Grains.
4.Once the Dimensions are designed, design the Fact table With the Key Performance Indicators (Facts).
5.Once the dimensions and Fact tables are designed define the relation ship between the tables by using primary key and Foreign Key. In logical phase data base design looks like Star Schema design so it is named as Star Schema Design
34. What is the different architecture of data warehouse
1. Top down - (bill Inmon)
2. Bottom up - (Ralph kimbol)
There are three types of architectures.
* Date warehouse Basic Architecture:
In this architecture end users access data that is derived from several sources through the data warehouse.
Architecture: Source --> Warehouse --> End Users
* Data warehouse with staging area Architecture:
Whenever the data that is derived from sources need to be cleaned and processed before putting it into warehouse then staging area is used.
Architecture: Source --> Staging Area -->Warehouse --> End Users
* Data warehouse with staging area and data marts Architecture:
Customization of warehouse architecture for different groups in the organization then data marts are added and used.
Architecture: Source --> Staging Area --> Warehouse --> Data Marts --> End Users