The world of data is changing very quickly and it is very easy to get lost in all the technical terms that flourish with progress. One sign of this rapid development is the use of new exotic terms to describe methodologies or techniques.
In this article, I introduce to you a guide to understand what are data warehouses, datalakes and datamarts which all refer to data storage and management systems. They each have distinct characteristics and use cases.
In this mini-series, we will explore the essential attributes of each method, their differences and the technical choices that justify their uses and configurations.
First, a few definitions inspired by William H. Inmon, an American computer scientist, recognized by many as the father of the concept of data warehouse.
What is a data warehouse?
💡A data warehouse is a relational database designed for analytical queries.
Defined by Bill Immon as a centralized repository, the concept of data warehouse is defined by these four criteria:
- Subject-oriented : the data is organized by theme (ex: marketing, sales, inventory, human resources)
- Integrated : Heterogeneous data from disparate sources is cohesively integrated and ready to use.
- Non-volatile : data in the data warehouse is never modified or deleted.
- Chronological : a data warehouse must make it possible to analyze the evolution of data over time thanks to historization.
Unlike a simple database that collects and stores data, the data warehouse brings together and consolidates all business data at a single point. It generally incorporates an ETL-type integration process: Extract, Transform and Load. The ETL allows the extraction of data from various sources (example: API, ERP, CRM platform) and its transformation and standardization to feed the data warehouse on a regular basis.
The key concept of the data warehouse is to allow users to access a unified version of the truth to enable decision making, reporting, or building prediction models.
Data warehouses also offer a fine degree of user permissions and access control, an essential feature of data governance.
What is a datamart ?
A datamart is a subset of the data warehouse intended for a small group of users.
💡The datamart is business-oriented and brings together all the information specific to a subject, a function or a profession. It constitutes a compartment of structured and organized data to serve a specific community and meet specific business needs.
This concept was also defined by Bill Inmon as a flow of data coming from the data warehouse, having the goal to present the data in a specialized way, aggregated and grouped functionally.
This controlled organizational structure within a data warehouse that contains aggregated data from a smaller range of sources aims to make analysis convenient and accessible to specific teams and business units.
Depending on how an organization implements its technology and organizes its analytics team, the specifics of ownership and access for a data mart can vary. In some cases, teams and business units can be fully responsible for their own datamarts, and datamarts can effectively be controlled. In other cases, the boundaries and access may be looser.
Like the data warehouse, the datamart is easily integrated into business intelligence platforms.
What is a datalake ?
Data warehouses and datamarts are based on the assumption that important business data is structured. Structured data contains predictable formats, is easily interpreted by a machine, and can be stored in a relational database. A datalake, on the other hand, is a store of object or file that can easily accommodate a large volume of raw, unstructured data such as free-form text, images, videos, sounds and other media, as well as structured data.
The most basic use of a datalake is to fully store huge volumes of data before deciding what to do with it. In this approach, the datalake is a staging area to the data warehouse. Another use that is starting to catch on is to train machine learning models using the very large set of unstructured learning data contained in the datalake.
The main disadvantage of the datalake is its “obscurity”. The data lake can be complete but its content is generally difficult to access and exploit with conventional tools, which generally makes it unsuitable for use by analysts.
🔎From a data governance perspective, datalakes do not offer a fine level of user permission and access control.
Finally, an exceptionally disorganized and poorly managed data lake can quickly become opaque and then turn into a data swamp.
💡To sum it up, the main advantage of the datalake over the data warehouse is its ability to store large amounts of media such as documents, images, videos and audio. These data carriers can then be used as learning and validation sets for machine learning models dealing, for example, with speech recognition or computer vision problems.
And after that ?
Recent developments in data storage models seek to combine the characteristics of the data warehouse and the data lake. The goal is better management of the datalake opacity and support for data science tools and languages that operate on less structured data and typically associated with datalakes such as Apache Spark and Python.
💡A data repository combining the characteristics of a data lake and a data warehouse is called a data lakehouse.
In a future article, we'll take a look at the different configurations and use cases of data warehouse, datamart and datalake !