Business Intelligence & Data Mining 080370131049 Assignment 1
1) State the difference between Data Warehouse & Data Mart. Data Mart Data Warehouse
Page | 1
1) A Data Mart is a specific, subject
oriented, repository of data designed to answer specific questions for a specific set of users. So an organization could have multiple data marts serving the needs of marketing, sales, operations, collections, etc. A data mart usually is organized as one dimensional model as a star-schema (OLAP cube) made of a fact table and multiple dimension tables. 2) A Data Mart is subject oriented database which gives the data about each and every individual department in an organisation. 3) It is a sub set of data warehousing. It can provide the data to analyze query reporting & analysis. 4) Data Mart would be a Source for Data Warehouse and will be subject Oriented. 5) Data Marts is used on a business division/department level. A data mart only contains the required subject specific data for local analysis. 6) In Data marts staging area is eliminated because data directly transfers from various source systems to data marts. 7) A Data mart typically consists of a single data structure (i.e., a star
A Data Warehouse (DW) is a single organizational repository of enterprise wide data across many or all subject areas. The Data Warehouse is the authoritative repository of all the fact and dimension data (that is also available in the data marts) at an atomic level.
A Data Warehouse is a relational database which is specially designed for analysis purpose rather than for transactional purpose. It is a collection of data marts. It Represents historical data.
Data Warehouse Will be your target for historical data and will be use for Analysis. Data Warehouse is used on an enterprise level
In Data warehouse, staging area is used for mapping and integration of data collected from different sources. A Data warehouse consists of many different types of data structures
Business Intelligence & Data Mining 080370131049
schema, snowflake schema or hypercube). 8) Data marts are fed from the data warehouse with a subset (and aggregated/summarized) of the data warehouse data for performance and getting the data closer to the user. 9) Data marts are two-tier warehouses, consisting of operational source systems and the data mart that can quickly be tailored to individual user needs. (staging, ODS, extracts, etc.) Data warehouse is a logical concept that houses the atomic data (and some aggregated/summarized data) for strategic analysis. The enterprise Data warehouse model is a three-tier model that includes data sources, a single central data warehouse, which we refer to as a data warehouse, and one or more departmental data marts. The central data warehouse is the heart of this model.
Page | 2
2) Describe the components of a Data Warehouse.
Overall Architecture The data warehouse architecture is based on a relational database management system server that functions as the central repository for informational data. Operational data and processing is completely separated from data warehouse processing. This central information repository is surrounded by a number of key components designed to make the entire environment functional, manageable and accessible by both the operational systems that source data into the warehouse and by end-user query and analysis tools.
Business Intelligence & Data Mining 080370131049
Typically, the source data for the warehouse is coming from the operational applications. As the data enters the warehouse, it is cleaned up and transformed into an integrated structure and format. The transformation process may involve conversion, summarization, filtering and condensation of data. Because the data contains a historical component, the warehouse must be capable of holding and managing large volumes of data as well as different data structures for the same database over time. Source systems and Databases Source Systems are all those 'transaction/Production' raw data providers, from where the details are pulled out for making it suitable for Data Warehousing. The sources can be quite diverse: Production Databases like Oracle, Sybase, SQL. Excel Sheets. Database of small time applications like in MS Access. ASCII/Data flat files.
Page | 3
Data Staging 'Area' The data staging area is the place where all 'grooming' is done on data after it is pulled from the Source Systems. The end point of grooming is for the Data to be loaded into the 'Analysis OR Presentation Server'. Data staging covers most of the 'back-bone' activities of a Data-Warehouse, which typically are also the biggest analytical and technical challenge of a project. These activities are 'Extraction' and 'Transformation' ETL-Data Extraction Data Extraction is an activity, which pulls the data from various data sources. Most of these sources are production systems OR are used for transaction level work. ETL-Data Transformation If Data Extraction is mining the iron ore, Transformation is to create the steel billets. The Transformation makes sure that the transaction level raw data is transformed into a form (while still being detailed) so that it can be loaded into the 'presentation/Loaded' area. ETL-Presentation/Loaded 'Area' This is the repository where the data is finally loaded after going through all the works of Extraction and Transformation. This becomes the ultimate source for information for various reasons ranging from queries to advanced data modeling. Dimensional Model The presentation area has data model, which is different from that of production system. This is called Dimensional Model. It is the way data is organized in datawarehouse. This concept has been dealt with fair degree of detail as this is the engine of Data Warehouse. Meta Data Meta Data subject is covered in a separate section. It contains all the business and technical designs, rules and locations etc. of all the data starting from the Extraction to final data usage.
Business Intelligence & Data Mining 080370131049
Meta data is data about data that describes the data warehouse. It is used for building, maintaining, managing and using the data warehouse. Meta data can be classified into: Technical meta data, which contains information about warehouse data for use by warehouse designers and administrators when carrying out warehouse development and management tasks. Business meta data, which contains information that gives users an easyto-understand perspective of the information stored in the data warehouse.
Page | 4
End User Tools and Applications. Data is cooked for consumption. There is a long list of applications to which the data can be put to and the tools, which can make it happen. This includes the reporting, publishing, analysis, modeling and mining tools. Data-Warehouse Administration and Tools Data warehouse is a large platform, which has large number of users, data sources and data targets. Just like production systems, it has to be administered in terms of performance, timelines and availability. This also includes activity logging, data security, backing-up and archiving. Data- Marts The entire section of Data Warehouse is equally applicable to a Data-Mart. A Data-Mart is a Data repository with a more restricted and short-term perspective. Please refer to De-Normalized Data Warehouse/Data Mart for similarities and differences between a Data Warehouse and a Data Mart. OLAP Servers & Data Marts While Data Warehouse can be accessed for any end-user tools application, it also feeds to the downstream OLAP Layer. For example, HR wants to have its own data mart in their separate servers due to confidential reasons. Similarly people who are traveling may need to have their own offline data Mart. Access Tools The principal purpose of data warehousing is to provide information to business users for strategic decision-making. These users interact with the data warehouse using front-end tools. Many of these tools require an information specialist, although many end users develop expertise in the tools. Tools fall into four main categories: query and reporting tools, application development tools, online analytical processing tools, and data mining tools. Query and Reporting tools can be divided into two groups: reporting tools and managed query tools. Reporting tools can be further divided into production reporting tools and report writers. Production reporting tools let companies generate regular operational reports or support high-volume batch jobs such as calculating and printing paychecks. Report writers, on the other hand, are inexpensive desktop tools designed for end-users. Managed query tools shield end users from the complexities of SQL and database structures by inserting a metalayer between users and the database.
Business Intelligence & Data Mining 080370131049
These tools are designed for easy-to-use, point-and-click operations that either accept SQL or generate SQL database queries. Often, the analytical needs of the data warehouse user community exceed the built-in capabilities of query and reporting tools. In these cases, organizations will often rely on the tried-and-true approach of in-house application development using graphical development environments such as PowerBuilder, Visual Basic and Forte. These application development platforms integrate well with popular OLAP tools and access all major database systems including Oracle, Sybase, and Informix. OLAP tools are based on the concepts of dimensional data models and corresponding databases, and allow users to analyze the data using elaborate, multidimensional views. Typical business applications include product performance and profitability, effectiveness of a sales program or marketing campaign, sales forecasting and capacity planning. These tools assume that the data is organized in a multidimensional model. A critical success factor for any business today is the ability to use information effectively. Data mining is the process of discovering meaningful new correlations, patterns and trends by digging into large amounts of data stored in the warehouse using artificial intelligence, statistical and mathematical techniques. Information Delivery System The information delivery component is used to enable the process of subscribing for data warehouse information and having it delivered to one or more destinations according to some user-specified scheduling algorithm. In other words, the information delivery system distributes warehouse-stored data and other information objects to other data warehouses and end-user products such as spreadsheets and local databases. Delivery of information may be based on time of day or on the completion of an external event. The rationale for the delivery systems component is based on the fact that once the data warehouse is installed and operational, its users don't have to be aware of its location and maintenance. All they need is the report or an analytical view of data at a specific point in time. With the proliferation of the Internet and the World Wide Web such a delivery system may leverage the convenience of the Internet by delivering warehouse-enabled information to thousands of end-users via the ubiquitous world wide network. In fact, the Web is changing the data warehousing landscape since at the very high level the goals of both the Web and data warehousing are the same: easy access to information. The value of data warehousing is maximized when the right information gets into the hands of those individuals who need it, where they need it and they need it most. However, many corporations have struggled with complex client/server systems to give end users the access they need. The issues become even more difficult to resolve when the users are physically remote from the data warehouse location. The Web removes a lot of these issues by giving users universal and relatively inexpensive access to data. Couple this access with the ability to deliver required information on demand and the result is a web-enabled information delivery system that allows users dispersed across continents to perform a sophisticated business-critical analysis and to engage in collective decision-making.
Page | 5
Business Intelligence & Data Mining 080370131049
Page | 6
1)Describe Different types of Data Marts.
A data mart is a subject oriented database which supports the business needs of individual departments in the enterprise or middle management users. A data mart is a subset of enterprise data warehouse. Data marts are known as high performance query structures (HPQS). Three basic types of data marts are dependent, independent, and hybrid. The categorization is based primarily on the data source that feeds the data mart. Dependent data marts draw data from a central data warehouse that has already been created. Independent data marts, in contrast, are standalone systems built by drawing data directly from operational or external sources of data or both. Hybrid data marts can draw data from operational systems or data warehouses. Dependent Data Marts A dependent data mart allows you to unite your organization's data in one data warehouse. This gives you the usual advantages of centralization. Figure illustrates a dependent data mart.
Business Intelligence & Data Mining 080370131049
Figure : Dependent Data Mart
Page | 7
Independent Data Marts An independent data mart is created without the use of a central data warehouse. This could be desirable for smaller groups within an organization. It is not, however, the focus of this Guide. See the Data Mart Suites documentation for further details regarding this architecture. Figure illustrates an independent data mart.
Business Intelligence & Data Mining 080370131049
Figure : Independent Data Marts
Page | 8
Hybrid Data Marts A hybrid data mart allows you to combine input from sources other than a data warehouse. This could be useful for many situations, especially when you need ad hoc integration, such as after a new group or product is added to the organization. Figure illustrates a hybrid data mart.
Figure : Hybrid Data Mart
Extraction, Transformation, and Transportation The main difference between independent and dependent data marts is how you populate the data mart; that is, how you get data out of the sources and into the data mart. This step, called the Extraction-Transformation-Transportation (ETT)
Business Intelligence & Data Mining 080370131049
process, involves moving data from operational systems, filtering it, and loading it into the data mart. With dependent data marts, this process is somewhat simplified because formatted and summarized (clean) data has already been loaded into the central data warehouse. The ETT process for dependent data marts is mostly a process of identifying the right subset of data relevant to the chosen data mart subject and moving a copy of it, perhaps in a summarized form. With independent data marts, however, you must deal with all aspects of the ETT process, much as you do with a central data warehouse. The number of sources are likely to be fewer and the amount of data associated with the data mart is less than the warehouse, given your focus on a single subject. The motivations behind the creation of these two types of data marts are also typically different. Dependent data marts are usually built to achieve improved performance and availability, better control, and lower telecommunication costs resulting from local access of data relevant to a specific department. The creation of independent data marts is often driven by the need to have a solution within a shorter time. Hybrid data marts simply combine the issues of independent and independent data marts.
Page | 9
2)Short Note on Recent Trends in Data Warehousing.
Business Intelligence & Data Mining 080370131049
Although data warehousing has greatly matured as a technology discipline over the past ten years, enterprises that undertake data warehousing initiatives continue to face fresh challenges that evolve with the changing business and technology environment. The data warehouse is being called on to support new initiatives, such as customer relationship management and supply chain management, and has also been directly impacted by the rise of e- business. Data warehousing vendors have developed new and more sophisticated technologies and have acquired and merged with other vendors. The number of homegrown and packaged software implementations throughout the average enterprise has grown rapidly, creating more data sources and information delivery options. With all of the activity surrounding data warehousing, it is hard to sort out which issues and trends are most pressing for enterprises. Trend #10: Data Warehouse Do-Overs Data warehousing has matured as a technology discipline and most large enterprises have completed some form of data warehousing initiative, whether it is an enterprise-wide data warehouse or just one or two departmental data marts. These initiatives have achieved varying degrees of success, and many organizations are now in the process of reengineering or even totally rebuilding their data infrastructures. According to META Group, almost one- third of data warehousing efforts through 2001 will be do-overs. What problems and challenges have made these do-overs necessary? There are some common pitfalls that many enterprise data warehousing initiatives have fallen into: Many organizations undertake data warehousing projects with a build-itand-they-will-come attitude. Unfortunately, this philosophy has doomed many a data warehouse to failure. Data warehousing projects need to involve end users from the beginning to ensure buy-in when the data warehouse is complete. Some organizations also fail to create the killer apps that actually deliver the benefits of the data warehouse to end users. Another pitfall is not architecting the data warehouse for performance, scalability and reliability. Many enterprises do not take future needs into account when building their initial data warehouse and fail to anticipate the demands of warehouse operations. They are forced to rebuild their data warehouse from the ground up when data volumes and user demands overwhelm their original systems. Data quality issues are often ignored in initial data warehouse implementations. Enterprises dont feel the negative impact of poor data quality until after their data warehouse is already up and running. Many are now reexamining the quality of the data in their warehouses and are undertaking the painful process of resolving data quality problems. Some data warehouses are unsuccessful because their sponsors didnt take the time to define success at the outset of the project. According to META Group, only 40 percent of enterprises measure ROI for their data warehousing initiatives. Without a clear definition of success, its hard to determine whether the data warehouse is delivering real business benefits. Finally, many data warehousing projects simply fall into the late-andoverbudget trap. Enterprises fail to anticipate the scope of their data warehousing projects and do not implement proper project planning.
Page | 10
Business Intelligence & Data Mining 080370131049
The good news behind past data warehousing "failures" is that enterprises have learned from their mistakes and are developing a set of best practices as they correct the problems. This means more successful implementations in the future as newcomers to data warehousing learn from those who have been there before. Trend #9: Proliferation of Data Sources The number of enterprise data sources is growing rapidly, with new types of sources emerging every year. The most exciting new source is, of course, enterprise e-business operations. Enterprises want to integrate clickstream data from their Web sites with other internal data in order to get a complete picture of their customers and integrate internal processes. Other sources for valuable data include ERP programs, operational data stores, packaged and homegrown analytic applications and existing data marts. The process of integrating these sources into one data warehouse can be complicated and is made even more difficult when an enterprise merges with or acquires another enterprise. Enterprises also look to a growing number of external sources to supplement their internal data. These might include prospect lists, demographic and psychographic data, and business profiles purchased from third-party providers. Enterprises might also want to use an external provider for help with address verification, where internal company sources are compared with a master list to ensure data accuracy. Additionally, some industries have their own specific sources of external data. For example, the retail industry uses data from store scanners, and the pharmaceutical industry uses prescription data that is aggregated by third- party vendors. Trend #8: Outsourcing Although enterprises have not yet begun to outsource their actual data warehouses, they are outsourcing other applications and, by extension, the data used and generated by those applications. The use of outsourcing is growing rapidly. Gartner, Inc. estimates that by 2003, 45 percent of large enterprises will host or rent some form of business application with an application service provider (ASP). ASPs offer fast application deployment and application expertise that an enterprise might not possess. While the benefits can be great, enterprises that use ASPs must manage the risks inherent in outsourcing data. First, enterprises should make sure that their ASP is taking adequate security measures to keep data separate and private from the data of the ASPs other customers. Second, the enterprise should ensure that the ASP has experience with moving large volumes of data so that migration of data to and from the ASP will go smoothly. Third, the ASP should have proven experience in backup and recovery for the database(s) being used. Finally, enterprises should ensure that the flow of data between the enterprises internal systems and the ASP can be kept intact. Trend #7: Hub Versus Relational Databases In an effort to control costs and improve performance, enterprises are increasingly implementing data hubs in their data warehouses instead of using relational databases. Keeping data in a relational database can be quite expensive, costing three to five times more than keeping data in a nonrelational repository. Additionally, queries on nonrelational data stores can outperform queries on relational databases. In hopes of achieving these benefits, enterprises
Page | 11
Business Intelligence & Data Mining 080370131049
are turning to compressed flat files to replace some of their RDBMSs. Despite the performance benefits and cost-effectiveness of these data hubs, they are limited by not having SQL and are not appropriate for all situations. Trend #6: Active Data Warehouses As enterprises face competitive pressure to increase the speed of decision making, the data warehouse must evolve to support real-time analysis and action. "Active" data warehouses are one way to meet this need. In contrast to traditional data warehouses, active data warehouses are tied closely to operational systems, are designed to hold very detailed and current data, and feature shortened batch windows. And unlike most operational data stores (ODS), active data warehouses hold integrated data and are open to user queries. All of the aforementioned characteristics make active data warehouses ideal for real-time analysis and decision-making as well as automated event triggering. With an active data warehouse, an enterprise can respond to customer interactions and changing business conditions in real time. An active data warehouse enables a credit card company to detect and stop fraud as it happens, a transportation company to reroute its vehicles quickly and efficiently or an online retailer to communicate special offers based on a customers Web surfing behavior. The active data warehouses greatest benefit lies in its ability to support tactical as well as strategic decisions. Trend #5: Fusion with CRM Customer relationship management (CRM) is one of the most popular business initiatives in enterprises today. CRM helps enterprises attract new customers and develop loyalty among existing customers with the end result of increasing sales and improving profitability. A data warehouse contains the information an enterprise needs to truly understand its customers and is, therefore, increasingly looked to as a prerequisite for a successful CRM initiative. One of the most important requirements of CRM is the integration of sales, marketing and customer care all of these customer-facing functions must share information and work together. In the past, enterprises seldom integrated these areas, but CRM initiatives are pushing them to do so in order to better understand and serve their customers. The data warehouse is essential in this integration process, as it collects data from all channels and customer touchpoints, and presents a unified view of the customer to sales, marketing and customer-care employees. Software packages are increasingly reflecting the need for integration of these functional areas as demonstrated by the trend towards merging customer-care and campaign-management software. Trend #4: Growing Number of End Users As vendors make data warehousing and business intelligence tools more accessible to the masses, the number of data warehousing end users is growing rapidly. Survey.com predicts that the number of data warehouse users will more than quadruple by 2002, with an average of 2,718 individual users and 609 concurrent users per warehouse. In addition to coping with the growth in the number of end users, data warehousing teams will need to cater to different types of end users. In a recent study, Gartner found that the use of business intelligence tools is growing most rapidly among administration and operations personnel, followed closely by executive-level personnel. These findings
Page | 12
Business Intelligence & Data Mining 080370131049
demonstrate that business intelligence tools have become both easier to use and more strategic. Obviously, end users will have different needs depending on their position in the company while the business analyst needs ad hoc querying capabilities, the CEO and COO may only want static reporting. Enterprises can handle the growing number of end users through the use of several techniques including parallelism and scalability, optimized data partitioning, aggregates, cached result sets and single-mission data marts. These techniques allow a large number of employees to concurrently access the data warehouse without compromising performance. Accommodating the different needs of various end-user groups will require as much of an organizational solution as a technical one. Data warehousing teams should involve end users from the beginning in order to determine the types of data and applications necessary to meet their decision-making needs. Trend #3: More Complex Queries In addition to becoming more numerous, queries against the data warehouse will also become more complex. User expectations are growing in terms of the ability to get exactly the type of information needed, when its needed. Simple data aggregation is no longer enough to satisfy users who want to be able to drill down on multiple dimensions. For example, it may not be enough to deliver a regional sales report every week. Users may want to look at the data by customized dimensions perhaps by a certain customer characteristic, a specific sales location or the time of purchase. Users are also demanding more sophisticated business intelligence tools. According to Gartner, data mining is the most rapidly growing business intelligence technology. Other sophisticated technologies are also becoming more popular. Vendors are developing software that can monitor data repositories and trigger reactions to events on a real-time basis. For example, if a telecom customer calls to cancel his call-waiting feature, real-time analytic software can detect this and trigger a special offer of a lower price in order to retain the customer. Vendors are also developing a new generation of data mining algorithms, featuring predictive power combined with explanatory components, robustness and self-learning features. These new algorithms automate data mining and make it more accessible to mainstream users by providing explanations with results, indicating when results are not reliable and automatically adapting to changes in underlying predictive models and/or data structures. Enterprises can handle complex queries and the demands of advanced analytic technologies by implementing some of the same techniques used to handle the increasing number of users, including parallelism. These techniques ensure that complex queries will not compromise data warehouse performance. In trying to meet end-user demands, enterprises will also need to address data warehouse availability. In global organizations, users need 24x7 uptime in order to get the information they need. In enterprises with moderate data volumes, high availability is easily implemented with high redundancy levels. In enterprises with large data volumes, however, systems must be carefully engineered for robustness through the use of well-designed parallel frameworks. Trend #2: Integrated Customer View Obtaining a 360- degree view of the customer is rapidly becoming the single most popular rationale for large-scale data warehousing efforts. Enterprises want
Page | 13
Business Intelligence & Data Mining 080370131049
to get a complete picture of each customer across all channels and all lines of business. While this sounds like a simple concept, it can be very difficult to implement. Many enterprises have historically been organized around products, geographies or other business-related dimensions, and their IT systems reflect this. Moving to a customer-centric view requires a big change in the way they collect, store and disseminate information. Enterprises have to integrate the proliferating data sources previously mentioned and must be sure to handle data quality issues so that customers are represented accurately across all systems. Trend #1: Exploding Data Volumes One of the biggest technology issues facing enterprises today is the explosion in data volumes that is expected to occur over the next several years. According to Gartner, in 2004 enterprises will be managing 30 times more data than in 1999. And Survey.com found that the amount of usable data in the average data warehouse will increase 290 percent to more than 1.2 terabytes in 2002. Ebusiness is one of the primary culprits in the data explosion, as clickstream data is expected to quickly add terabytes to the data warehouse. As the number of other customer contact channels grows, they add even more data. Escalating end-user demands also play a part, as organizations collect more information and store it for longer periods. The data explosion creates extreme scalability challenges for enterprises. A truly scalable data warehouse will allow an enterprise to accommodate increasing data volumes by simply adding more hardware. Scalable data warehouses typically rely on parallel technology frameworks. Fortunately, lower hardware costs are making parallel technology more accessible. Distributed memory parallel processor (DMPP) hardware is becoming less expensive, and alternatives to DMPP are also improving server clustering (of SMPs) is evolving as a substitute. Additionally, storage costs continue to decline every year, making it possible for enterprises to keep terabytes of detailed historical data. This article has looked at some of the major challenges in data warehousing today. Hopefully this list provides some food for thought for those involved in data warehousing initiatives and encourages you to examine the way these trends and issues affect your own organizations. While this article has presented only brief suggestions for dealing with data warehousing challenges, I hope readers will use these suggestions as a springboard to further exploration of available solutions.
Page | 14
Business Intelligence & Data Mining 080370131049
Page | 15
7) Explain Virtual & Dynamic Warehouse. A virtual warehouse provides the opportunity for retailers to advertise a whole new line of items online that they would not otherwise have room for on their own shelves. Through the distributors virtual warehouse services, customers can order products from the retailers website. The order is sent to the distributors warehouse, where it is picked, packed and shipped directly to the customer. The benefits: Customers appreciate a fully-stocked inventory, with multiple ordering options and fast shipments. Taking advantage of a virtual warehouse gives retailers the ability to expand their customer base with new products, while increasing customer loyalty through superior services. Because the distributors provide the inventory space, in addition to the picking, packing and shipping labor, retailers can cut costs significantly while improving profits. Distributors expand business while reducing inventories, and with the ability to continually update prices online, distributors no longer have to honor outdated prices that are often found in catalogs. Dynamic warehousing systems are replacing simple static systems more and more. For good reason;the advantages are varied and convincing: - better usage of space - shorter transport routes - less handling equipment - lower costs