An overview of the data management process which allows to add value to data
Following the feedback I received after my last article “Data manager, what is my job?” where I describe my work as a data manager at Avanci, this article is about giving you an overview of data management. I wish it to address and answer many of the questions I was asked.
At Avanci, data managers are mainly involved in the construction of Customer data platforms (CDP).
The setup of a CDP includes several important steps which constitute the heart of the data manager’s work.
At the start of a data project, data sources or generally multiple and diverse. The format of the data is also various.
The first step consists in configuring the access to the data, setting the data format in use, and ensuring that the data flows are stable.
As an example, the data can be transferred via fat files (csv, Excel) from the business management software of the customer. As a data manager, we must control that the structure of the incoming data corresponds to the required data model used for the data integration. In other cases, the use of an API (Application Programming Interface) is required to collect or send information. There is also the possibility to interface directly with the customer’s databases.
Each data source has its specificity which needs to be considered to ensure the correct operation of the data pipeline. In addition, the collection frequency, the size of the data and the flow scheduling are parameters that need to be set for optimal operation.
According to the size of the customer data sources, its data types and its needs, the quantity of data to handle can vary a lot. From a customer database of tens of thousands of lines and another of several millions, the technologies and techniques used to handle such data vary also. The data manager must make the right choices and anticipate the future needs and challenges which includes for example the integration and processing time, data access or frequency of processing.
Once the systems are in production, routines are put in place to produce regular measurement of data quality and coherence. This information is made available via dashboards to follow the evolution of key indicators as well as alert in case of anomalies. This way, it is possible to quickly adjust any issues if necessary.
Here are two examples to help you visualize what we call raw data. The data are generated randomly from fake databases for test purposes.
From the raw data, the next step is to integrate it and organize it. If we take the csv file or the json data above, you will find below an example of the desired result, a nice data table organized in a structured database (SQL Structured Query Language)
The example below has been generated from the demo databases of Microsoft. The domain name contoso.com belongs to Microsoft.
A database role is to store information such as names, addresses, phone numbers, transactions, and all other sort of data in an organized fashion which enables operations such as processing, filtering, sorting. It supports the extraction of measurement on the data or the retrieval of information of a person from a name for example.
A database usually contains several tables linked to each other using keys:
With the access to the data for use in analytics or other application in mind, the data manager has at its disposition a range of techniques to organize the data. For exemple, a table can be separated into a table containing the primary information and another one with secondary information. Another example is the creation of views of the data. This method allows access to the data in a specific form (aggregated, concatenated, or other) without the need to create additional tables which in returns give a lot of flexibility.
These techniques allow for the database architecture to be optimized for the processing of the data and its usage.
Once the integration and organization of the data has been done, the third step of data management starts. It is now time for cleaning, improving, enhancing, and preparing the data.
Depending on the customer need and the future usage of the data, the data processing steps can vary in complexity and exhaustivity.
Here are examples of data cleaning:
The data manager has also a range of techniques to augment the value of data by creating variables which are calculated from other data fields.
Here are some examples:
Finally, another step of data enhancement goes through the computation of aggregated variables and other specific indicators required for the data exploitation.
This list is only a glimpse and it is possible to imagine another multitude of information that can be calculated, aggregated, or categorized through data modeling.
To wrap up this article, data management work would be incomplete without planification of all these tasks and the assurance that all the data pipelines and processes are operating autonomously. Data management work at this last stage is to optimize the data flow and data processing before the production phase to minimize human intervention afterwards.