Thursday, 11 September 2008

Data warehouse Neeraj Nathani SmartBridge


Data warehouse












In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database used for reporting and data analysis. It is a central repository of data which is created by integrating data from one or more disparate sources. Data warehouses store current as well as historical data and are used for creating trending reports for senior management reporting such as annual and quarterly comparisons.
The data stored in the warehouse are uploaded from the operational systems (such as marketing, sales etc., shown in the figure to the right). The data may pass through an operational data store for additional operations before they are used in the DW for reporting.
The typical ETL-based data warehouse uses staging, data integration, and access layers to house its key functions. The staging layer or staging database stores raw data extracted from each of the disparate source data systems. The integration layer integrates the disparate data sets by transforming the data from the staging layer often storing this transformed data in an operational data store (ODS) database. The integrated data are then moved to yet another database, often called the data warehouse database, where the data is arranged into hierarchical groups often called dimensions and into facts and aggregate facts. The combination of facts and dimensions is sometimes called a star schema. The access layer helps users retrieve data.[1]
A data warehouse constructed from an integrated data source systems does not require ETL, staging databases, or operational data store databases. The integrated data source systems may be considered to be a part of a distributed operational data store layer. Data federation methods or data virtualization methods may be used to access the distributed integrated source data systems to consolidate and aggregate data directly into the data warehouse database tables. Unlike the ETL-based data warehouse, the integrated source data systems and the data warehouse are all integrated since there is no transformation of dimensional or reference data. This integrated data warehouse architecture supports the drill down from the aggregate data of the data warehouse to the transactional data of the integrated source data systems.
Data warehouses can be subdivided into data marts. Data marts store subsets of data from a warehouse.
This definition of the data warehouse focuses on data storage. The main source of the data is cleaned, transformed, cataloged and made available for use by managers and other business professionals for data mining, online analytical processing, market research and decision support (Marakas & O'Brien 2009). However, the means to retrieve and analyze data, to extract, transform and load data, and to manage the data dictionary are also considered essential components of a data warehousing system. Many references to data warehousing use this broader context. Thus, an expanded definition for data warehousing includes business intelligence tools, tools to extract, transform and load data into the repository, and tools to manage and retrieve metadata.

Benefits of a data warehouse

A data warehouse maintains a copy of information from the source transaction systems. This architectural complexity provides the opportunity to:
  • Congregates data from multiple sources into a single database so a single query engine can be used to present data.
  • Mitigates the problem of database isolation level lock contention in transaction processing systems caused by attempts to run large, long running, analysis queries in transaction processing databases.
  • Maintain data history, even if the source transaction systems do not.
  • Integrate data from multiple source systems, enabling a central view across the enterprise. This benefit is always valuable, but particularly so when the organization has grown by merger.
  • Improve data quality, by providing consistent codes and descriptions, flagging or even fixing bad data.
  • Present the organization's information consistently.
  • Provide a single common data model for all data of interest regardless of the data's source.
  • Restructure the data so that it makes sense to the business users.
  • Restructure the data so that it delivers excellent query performance, even for complex analytic queries, without impacting the operational systems.
  • Add value to operational business applications, notably customer relationship management (CRM) systems.

Generic data warehouse environment

The environment for data warehouses and marts includes the following:
  • Source systems that provide data to the warehouse or mart;
  • Data integration technology and processes that are needed to prepare the data for use;
  • Different architectures for storing data in an organization's data warehouse or data marts;
  • Different tools and applications for the variety of users;
  • Metadata, data quality, and governance processes must be in place to ensure that the warehouse or mart meets its purposes.
In regards to source systems listed above, Rainer states, “A common source for the data in data warehouses is the company’s operational databases, which can be relational databases” (130).
Regarding data integration, Rainer states, “It is necessary to extract data from source systems, transform them, and load them into a data mart or warehouse” (131).
Rainer discusses storing data in an organization’s data warehouse or data marts. “There are a variety of possible architectures to store decision-support data” (131).
Metadata are data about data. “IT personnel need information about data sources; database, table, and column names; refresh schedules; and data usage measures (133).
Today, the most successful companies are those that can respond quickly and flexibly to market changes and opportunities. A key to this response is the effective and efficient use of data and information by analysts and managers (Rainer, 127). A “data warehouse” is a repository of historical data that are organized by subject to support decision makers in the organization (128). Once data are stored in a data mart or warehouse, they can be accessed.

Facts

A fact is a value or measurement, which represents a fact about the managed entity or system.
Facts as reported by the reporting entity are said to be at raw level.
E.g. if a BTS received 1,000 requests for traffic channel allocation, it allocates for 820 and rejects the remaining then it would report 3 facts or measurements to a management system:
  • tch_req_total = 1000
  • tch_req_success = 820
  • tch_req_fail = 180
Facts at raw level are further aggregated to higher levels in various dimensions to extract more service or business-relevant information out of it. These are called aggregates or summaries or aggregated facts.
E.g. if there are 3 BTSs in a city, then facts above can be aggregated from BTS to city level in network dimension. E.g.
  • tch\_req\_success\_city = tch\_req\_success\_bts1 + tch\_req\_success\_bts2 + tch\_req\_success\_bts3
  • avg\_tch\_req\_success\_city = (tch\_req\_success\_bts1 + tch\_req\_success\_bts2 + tch\_req\_success\_bts3) / 3

[edit] Dimensional vs. normalized approach for storage of data

There are two leading approaches to storing data in a data warehouse — the dimensional approach and the normalized approach.
The dimensional approach, whose supporters are referred to as “Kimballites”, believe in Ralph Kimball’s approach in which it is stated that the data warehouse should be modeled using a Dimensional Model/star schema. The normalized approach, also called the 3NF model, whose supporters are referred to as “Inmonites”, believe in Bill Inmon's approach in which it is stated that the data warehouse should be modeled using an E-R model/normalized model.
In a dimensional approach, transaction data are partitioned into "facts", which are generally numeric transaction data, and "dimensions", which are the reference information that gives context to the facts. For example, a sales transaction can be broken up into facts such as the number of products ordered and the price paid for the products, and into dimensions such as order date, customer name, product number, order ship-to and bill-to locations, and salesperson responsible for receiving the order.
A key advantage of a dimensional approach is that the data warehouse is easier for the user to understand and to use. Also, the retrieval of data from the data warehouse tends to operate very quickly. Dimensional structures are easy to understand for business users, because the structure is divided into measurements/facts and context/dimensions. Facts are related to the organization’s business processes and operational system whereas the dimensions surrounding them contain context about the measurement (Kimball, Ralph 2008).
The main disadvantages of the dimensional approach are:
  1. In order to maintain the integrity of facts and dimensions, loading the data warehouse with data from different operational systems is complicated, and
  2. It is difficult to modify the data warehouse structure if the organization adopting the dimensional approach changes the way in which it does business.
In the normalized approach, the data in the data warehouse are stored following, to a degree, database normalization rules. Tables are grouped together by subject areas that reflect general data categories (e.g., data on customers, products, finance, etc.). The normalized structure divides data into entities, which creates several tables in a relational database. When applied in large enterprises the result is dozens of tables that are linked together by a web of joins. Furthermore, each of the created entities is converted into separate physical tables when the database is implemented (Kimball, Ralph 2008). The main advantage of this approach is that it is straightforward to add information into the database. A disadvantage of this approach is that, because of the number of tables involved, it can be difficult for users both to:
  1. join data from different sources into meaningful information and then
  2. access the information without a precise understanding of the sources of data and of the data structure of the data warehouse.
It should be noted that both normalized and dimensional models can be represented in entity-relationship diagrams as both contain joined relational tables. The difference between the two models is the degree of normalization.
These approaches are not mutually exclusive, and there are other approaches. Dimensional approaches can involve normalizing data to a degree (Kimball, Ralph 2008).
In Information-Driven Business (Wiley 2010),[6] Robert Hillard proposes an approach to comparing the two approaches based on the information needs of the business problem. The technique shows that normalized models hold far more information than their dimensional equivalents (even when the same fields are used in both models) but this extra information comes at the cost of usability. The technique measures information quantity in terms of Information Entropy and usability in terms of the Small Worlds data transformation measure.[7]

Source: Wikipedia.

No comments:

Post a Comment