This Tutorial Explains The Role of Metadata in ETL, Examples & Types of Metadata, Metadata Repository & Challenges in Metadata Management:
Data Mart in ETL was explained in detail in our previous tutorial.
The concept of Metadata is very important in ETL and this tutorial will explain all about Metadata.
It covers the role of metadata, examples of metadata, as well as its types, metadata repository, how can data warehousing metadata be managed, challenges for metadata management.
You will also get to know what is metadata-driven ETL and the difference between data and metadata.
- Data warehouse/ETL developers and testers.
- Database professionals with basic knowledge of database concepts.
- Database administrators/big data experts who want to understand Data warehouse/ETL areas.
- College graduates/freshers who are looking for Data warehouse jobs.
What You Will Learn:
- Metadata In ETL
- Comparison Between Data And MetaData
Metadata In ETL
Data warehouse team (or) users can use metadata in a variety of situations to build, maintain and manage the system. The basic definition of metadata in the Data warehouse is, “it is data about data”.
Metadata can hold all kinds of information about DW data like:
- Source for any extracted data.
- Use of that DW data.
- Any kind of data and its values.
- Features of data.
- Transformation logic for extracted data.
- DW tables and their attributes.
- DW objects
Metadata acts as a table of contents for data in the DW system, which shows the technique with more details about that data. In simple words, you can think of an index in any book that acts as metadata, for the contents in that book.
Similarly, Metadata works as an index to the DW content. All such metadata is stored in a repository. By going through Metadata, the end-users get to know from where they can begin analyzing the DW system. Else, it is tough for the end-users to know where to start the data analysis from in such a huge DW system.
Role Of Metadata In Data Warehouse
In the earlier days, Metadata was created and maintained as documents. But in today's digital world, various tools have made this job easier by recording metadata at each level of the DW process.
Metadata created by one tool can be standardized (i.e. data can be brought into one unique format) and can be reused across the other tools anywhere in the DW system.
As we are aware that operational systems maintain current data, the DW systems maintain historical and current data.
Metadata must keep a track of all the changes that happen in source systems, data extraction/transformation methods and in the structure (or) content of data that will arise in this process. Metadata will maintain various versions to keep a track of all these changes over several years.
Sufficient metadata provided in the repository will help any user, in analyzing the system more efficiently and independently. By understanding metadata, you can run any sort of queries on DW data for the best results.
Pictorial representation of the Role of Metadata:
Examples Of Metadata In Simple Terms
Given below are some of the examples of Metadata.
- Metadata for a web page may contain the language it is coded in, the tools used to build it, supporting browsers, etc.
- Metadata for a digital image may contain the size of the picture, resolution, color intensity, image creation date, etc.
- Metadata for a document may contain the document created date, last modified date, it's size, author, description, etc.
Comparison Between Data And MetaData
|1||Data is a set of information.||Metadata is information about data.|
|2||Data may (or) may not be processed.||Metadata is always a processed data.|
Types Of Metadata
The classification of metadata into various types will help us to understand it better. This classification can be based on its usage (or) the users etc.
Let's explore the different types of Metadata below:
#1) Backroom Metadata: Directs the DBAs (or) the end-users on extract, clean and load processes.
#2) Front room Metadata: Directs the end-users to work with BI tools and reports.
#3) Process Metadata: This stores ETL process metadata such as the number of rows loaded, rejected, processed and time taken to load into a DW system, etc. This information can also be accessible to the end-users.
At the same time, the statistics of the staging tables are also important to the ETL team. This metadata will store the staging tables process data such as the number of rows loaded, rejected, processed and the time taken to load into each staging table.
#4) Data Lineage: This stores the logical transformation for each source system element to the DW target element.
#5) Business Definitions: The context for DW tables has been derived from the business definitions. Every attribute in a table is associated with a business definition. Hence these should be stored as metadata (or) any other document for future reference. Both the end-users and the ETL team depend on these business definitions.
#6) Technical Definitions: Technical definitions are exclusively used in the data staging area more than the business definitions. The main purpose is to reduce the ambiguity while creating staging tables and to reuse any existing tables. Technical definitions will store the details of each staging table such as its location and structure.
Each staging table is technically documented here, if not documented then it means that the staging table is not existing. This avoids the recreation of the same staging table.
#7) Business Metadata: Data will be stored in business terms for the benefit of end-users/analysts/managers/ any users. Business metadata is proxy to the source system data i.e. no data manipulations will be done on it. It can be derived from any business documents and business rules.
#8) Technical Metadata: This will store technical data such as tables attributes, their data types, size, primary key attributes, foreign key attributes, and any indexes. This is more structured when compared to business metadata.
Technical metadata is mainly intended for the DW team such as developers/testers/analysts/DBAs to build (or) maintain the system. This is also significantly used by the administrators to monitor the database loads and data backups etc.
#9) Operational Metadata: As we know the data into the DW system is sourced from many operational systems with diverse data types and fields. DW extracts transform such data into the unique type and load all this data into the system.
At the same time, it must be able to link back the data to its source system data. The metadata that stores all these operational data sources information is known as Operational metadata.
#10) Source System Information:
You can collect the following metadata from various source systems:
- Database (or) file system: This will store the names of source system databases (or) files.
- Table specifications: This will store all the details about tables such as table name, its purpose, size, attributes, primary keys, and foreign keys.
- Exception handling rules: This will store different methods of recovering the system in case of system failures.
- Business definitions: This will store business definitions for a brief understanding of data.
- Business rules: This will store a set of rules for each table to understand its data and to avoid inconsistency.
Source system metadata saves a lot of time to the DW team while analyzing the data.
#11) ETL Job Metadata: ETL job metadata is very important as it stores the details of all the jobs to be processed in the schedule, to load the ETL system.
This metadata stores the following information:
- Job Name: ETL job name.
- Job Purpose: The purpose of running the job.
- Source Tables/Files: It provides the names and location of all tables and files from which the data is being sourced by this ETL job. This can have more than one table (or) file name.
- Target Tables/Files: It provides the names and location of all tables and files to which the data is being transformed by this ETL job. This can have more than one table (or) file name.
- Rejected Data: It provides the names and location of all the tables and files from which the intended source data has not been loaded into the target.
- Pre Processes: It provides the jobs (or) script names on which the current job is dependent. It means those have to be successfully executed before running the current job.
- Post Processes: It provides the jobs (or) script names that should be run immediately after the current job to complete the process.
- Frequency: It provides information on how frequently the job should be executed i.e. daily, weekly (or) monthly.
#12) Transformation Metadata: Transformation metadata stores all ETL process-related construction information. Every single manipulation of data in the ETL process is known as data transformation.
Any set of functions, stored procedures, cursors, variables and loops in the ETL process can be considered as transformations. But such transformations cannot be documented separately as metadata.
The entire ETL process is built up with data transformations. Few transformations in ETL can be predefined and used across the DW system. ETL developers spend their time in building (or) re-processing all the data transformations. Reusing the predefined transformations during the ETL process development will speed up the work.
Read through the below data transformations that you can find in ETL:
- Source Data Extractions: This involves data transformations to read from source system data such as a SQL Select query (or) FTP (or) reading XML/mainframe data.
- Surrogate Key Generators: The new sequence number that should be generated for every database table row is stored as metadata.
- Lookups: Lookups can be formed with all the IN statements, inner joins, and outer joins. These are mainly used to hold the surrogate keys from all the respective dimension tables while loading a fact.
- Filters: Filters are recommended to sort out the data that should be extracted, loaded and rejected in the ETL process. Filtering the data in the early stages of the ETL system is a good practice. Filters are applied depending on the business rules (or) constraints.
- Aggregates: Depending on the level of data granularity, the metadata related to aggregate functions can be used such as sum, count, average, etc.
- Update Strategies: These are the rules applied to a record while updating the data. If there is any modification in the existing data, then this will indicate if a record should be added, deleted (or) updated.
- Target Loader: Target loader will store the details of the database, table names and column names into which the data should be loaded through the ETL process. Moreover, this will also store the details of bulk load utility if any, that is performed while loading data into the ETL system.
Every transformation can be named distinctively with a brief note about its purpose.
Some examples of naming conventions are inlined here for the above list of transformations.
SRC_<name of the table> SEQ_<surrogate key column name> LKP_<Name of the table referred> FIL_<Purpose> AGG_<Purpose> UPD_<Update type>_<Name of table> TRG_<Name of table>
Metadata Repository In ETL
A metadata repository is a place where any type of metadata is stored either in a local database (or) in a virtual database. Each type of metadata such as business metadata (or) technical metadata can be separated logically in a repository.
In addition to the above two types, the repository also has one more component named Information navigator.
Information navigator can be used to perform the below tasks:
- Interface From Query Tool: This provides an interface to the query tools to access DW metadata.
- Drill Down For Details: This allows the user to drill down metadata for more detailed information. As an example, at the first level, the user can get a data table definition. By drilling down he can get the table attributes at the next level. By drilling down the data more he can get the details of each attribute etc.
- Review Pre-defined Queries and Reports: This allows the user to review pre-defined queries and reports. This acts as a reference to frame queries on his own with suitable parameters etc.
Pictorial representation of Metadata Repository:
How Can Data Warehousing Metadata Be Managed?
People, process and tools are the key sources to manage metadata.
- People should understand the metadata for appropriate usage.
- The process will incorporate metadata into tools (or) repository with the progress of the DW life cycle for future use.
- Afterward, metadata can be managed by tools.
Challenges For Metadata Management
Once the metadata is created, you may face the below challenges while integrating and managing metadata in the system.
- Bringing various formats of metadata into a standard format may need more effort if various tools are being used in the DW system, Because metadata can be stored across spreadsheets, applications (or) databases.
- Metadata formats do not have any established industry-wide standards. With this lack of standardized process, it is tough to pass metadata through various levels of the DW system and tools.
- Consistently maintaining various versions of historical metadata is a complex task.
What Is Metadata Driven ETL?
Metadata driven ETL establishes a layer to simplify the data loading process into a DW system. You can decide whether to process the data into the system (or) not depending on the metadata. Hence you can call it is as metadata-driven ETL.
The significant role of Metadata in determining the success (or) failure of a DW system was explained in detail in this tutorial.
We also explored the Meaning, Role, Examples, Types, Challenges of Metadata in detail along with the concerned pictorial representation.
We hope these informative tutorials from this Data Warehouse Series enriched your knowledge on Data Warehousing and related concepts!!!