Data Management, what is it all about?

Written by Pierre Baudin | 27 October 2021

An overview of the data management process which allows to add value to data

Following the feedback I received after my last article “Data manager, what is my job?” where I describe my work as a data manager at Avanci, this article is about giving you an overview of data management. I wish it to address and answer many of the questions I was asked.

At Avanci, data managers are mainly involved in the construction of Customer data platforms (CDP).

To read more : Data Warehouse vs Data Lake vs Data Mart: The Guide - Definition

The setup of a CDP includes several important steps which constitute the heart of the data manager’s work.

Qualify and configure

At the start of a data project, data sources or generally multiple and diverse. The format of the data is also various.

The first step consists in configuring the access to the data, setting the data format in use, and ensuring that the data flows are stable.

As an example, the data can be transferred via fat files (csv, Excel) from the business management software of the customer. As a data manager, we must control that the structure of the incoming data corresponds to the required data model used for the data integration. In other cases, the use of an API (Application Programming Interface) is required to collect or send information. There is also the possibility to interface directly with the customer’s databases.

Each data source has its specificity which needs to be considered to ensure the correct operation of the data pipeline. In addition, the collection frequency, the size of the data and the flow scheduling are parameters that need to be set for optimal operation.

💡 Suivant les systèmes et les besoins du client, il est possible d’avoir des flux de données de type transactionnel qui sont alimenté constamment, d’autres mis à jour toutes les heures, tous les jours, toutes les semaines, voire tous les mois.

According to the size of the customer data sources, its data types and its needs, the quantity of data to handle can vary a lot. From a customer database of tens of thousands of lines and another of several millions, the technologies and techniques used to handle such data vary also. The data manager must make the right choices and anticipate the future needs and challenges which includes for example the integration and processing time, data access or frequency of processing.

Once the systems are in production, routines are put in place to produce regular measurement of data quality and coherence. This information is made available via dashboards to follow the evolution of key indicators as well as alert in case of anomalies. This way, it is possible to quickly adjust any issues if necessary.

Here are two examples to help you visualize what we call raw data. The data are generated randomly from fake databases for test purposes.

CSV format (Comma Separated Value)

JSON format (JavaScript Object Notation)

Integrate and Organize

From the raw data, the next step is to integrate it and organize it. If we take the csv file or the json data above, you will find below an example of the desired result, a nice data table organized in a structured database (SQL Structured Query Language)

The example below has been generated from the demo databases of Microsoft. The domain name contoso.com belongs to Microsoft.

A database role is to store information such as names, addresses, phone numbers, transactions, and all other sort of data in an organized fashion which enables operations such as processing, filtering, sorting. It supports the extraction of measurement on the data or the retrieval of information of a person from a name for example.

A database usually contains several tables linked to each other using keys:

With the access to the data for use in analytics or other application in mind, the data manager has at its disposition a range of techniques to organize the data. For exemple, a table can be separated into a table containing the primary information and another one with secondary information. Another example is the creation of views of the data. This method allows access to the data in a specific form (aggregated, concatenated, or other) without the need to create additional tables which in returns give a lot of flexibility.

These techniques allow for the database architecture to be optimized for the processing of the data and its usage.

Processing and enhancing

Once the integration and organization of the data has been done, the third step of data management starts. It is now time for cleaning, improving, enhancing, and preparing the data.

Depending on the customer need and the future usage of the data, the data processing steps can vary in complexity and exhaustivity.

Here are examples of data cleaning:

Recovery of email addresses or detection of invalid addresses

Formatting and standardization of the last name and first name of a contact

Cleaning and formatting of a phone number

The data manager has also a range of techniques to augment the value of data by creating variables which are calculated from other data fields.

Here are some examples:

Computation of the gender: using available and open-source database of first names with corresponding gender label, it is possible to confirm or impute the gender attributed to a contact
Title: using the gender, attribution of a title
Categorization of a contact: a contact can be identified as a customer or a prospect using the transactions data
The recovery and confirmation of postal address to ensure that the mail send to the contact arrive to the desired destination. It can also be used for zoning and geographical analysis.
The verification of email addresses to confirm that the address is valid and avoid spam trap or other exclusion put in place by the Internet service providers
Deduplication of contacts in the database which allows to clean the records and create a golden record for a contact containing all the data attached to it from different sources

Finally, another step of data enhancement goes through the computation of aggregated variables and other specific indicators required for the data exploitation.

Among these indicators, there are for example :

The total number of orders made by a contact
The number of days since the last order
The average amount of its basket
The preferred location of shopping based on behavioral patterns
The measurement of a rate of activity regarding a marketing event
The segmentation of the customers to identify the VIP or customer attrition

This list is only a glimpse and it is possible to imagine another multitude of information that can be calculated, aggregated, or categorized through data modeling.

To wrap up this article, data management work would be incomplete without planification of all these tasks and the assurance that all the data pipelines and processes are operating autonomously. Data management work at this last stage is to optimize the data flow and data processing before the production phase to minimize human intervention afterwards.

View full post