Data Manager
Managing data isn't always the funniest part of data science. It includes several essential steps
that require time and which are also the source of the production of value.
Dans cet article, je vous livre ma méthodologie pour s’attaquer à un projet data, basé sur les bonnes pratiques de l’industrie et sur mon expérience.
In this article, I share with you my methodology for tackling a data project that is based on industry best practices and my experience.
Before we embark on the implementation of a new predictive algorithm with which we believe will be able to more accurately predict the dropout rate of our contacts, we need to make sure that we have the necessary data, that we have changed or deleted any problematic values or observations, and that we've turned that information into a dataset that we're going to be able to work with.
💡What better way to do this than with a handy list covering collecting data from external sources, understanding the problem to be solved, transformation steps, and documenting the entire process.
To also read : Data Warehouse vs Data Lake vs Data Mart : the guide - Definitions
#1 - External sources and factors
Data validation
In a data project, initial data can come from several types of sources: internal or external, private or public, raw or processed.
Everyone tries to maintain an acceptable level of quality but from experience it is best to always validate the data even if your interlocutor or supplier assures you that everything is in order.
The time spent on these checks is not wasted. If inconsistencies are found, it is better that they are at the start of the project rather than when the project is near completion. And if everything seems consistent, this step will have allowed us to begin to familiarize ourselves with the data.
Systems validation
It is essential to verify the correct operation of the systems allowing the collection of data whether it is an FTP server (File Transfer Protocol) or an API (Application Programming Interface) or even via email. This step will be to ensure that the usernames and passwords are correct, that the connection parameters are working and that the accesses are operational.
When you have deadlines to communicate to your customer, you'll be in a better position if you know that a vendor is behind schedule with their API setup.
On the same theme as before: beware of other people's systems.
Proofreading of sources
If you're working from a template or picking up other people's notes or code, be sure to come back to them after you've worked on your project for a while. Once you become more familiar with the project, the data, and the transformations, you'll notice things to fix or optimize and maybe things you missed the first time around.
#2 - Understand the subject
Develop domain knowledge
When you start working on a data project, it is often very likely that you are working on a subject that you do not know enough about. For example, the jargon for B2B (business to business) is not the same as for B2C (business to customer). A retail company does not have the same data as another in events or insurance.
Some information can be found on the Internet, in other cases, you will have to dive into your project data or associated data (product repository, geolocation data). Finally, don't forget your customer. He relies on your data expertise, don’t hesitate to ask for his domain expertise.
Familiarity with the field is an essential key throughout the process. This invaluable knowledge is applied in the processing and verification of the data, until the final solution is presented to the customer.
Understand the deliverables
What questions are you trying to answer with this data? How are you going to respond to them? Are you going to run models and present the results in a meeting with the client, or send them csv files, or do you write an algorithm that will run in production?
Knowing what to produce will help you decide what to do with the data - how to prepare it, what to do with the missing data, or in what format you want your data to be after all the transformations are in place.
Explore the data
Data mining is an important step in the process and will be repeated several times. It allows you to continue to familiarize yourself with the data with more granularity.
Here are some questions to get you started:
- What are the unique values for each variable?
- How are discrete or continuous values distributed?
- How many duplicates do you find by taking a benchmark variable or a combination of variables?
- In the contact data, how many email addresses contain invalid domains?
#3 - Data transformation
Data cleaning
Data preparation generally requires a cleaning step, which is not necessarily the easiest. This is about managing missing values, outliers, duplicates and more broadly tidying up the data (see: the concept of tidy data).
The choices of processing steps have consequences on the data and therefore on the quality of the analyzes and models that will be made downstream. They can be immediate and visible or delayed and camouflaged or any other combination. Having a good knowledge of your client's initial data, domain and activity can save you a lot of trouble later on.
Validate data transformations
When transforming or merging data, it is important to always stop and verify that the result meets expectations. A simple check on the number of expected columns or rows is a good start.
It might be worth taking a few examples and manually checking that the results are correct. It can be annoying but much less annoying than realizing an error after going into production.
In other cases, even if the transformations are correct, they don't necessarily mean what you think they do. Therefore, it is very important to validate them and ask questions if the calculations do not match what you see.
And it is also possible that the validation method is flawed. Remaining critical in this transformation validation step ensures that the data produced is consistent and usable for the future.
Experiment in a test environment
When attempting complex operations on large datasets, do not attempt to build the cleansing and transform algorithms on the available dataset. Instead, save your energy and time and apply your treatments by working with dataset prototypes that bring together relevant use case examples.
Once it works, try it out on real data - and as discussed earlier, be sure to validate your results.
#4 - Documentation et versioning
Documentation is for me one of the things that will make you stand out from the crowd and increase your efficiency. When we embark on a data project, there are many requests that arrive as it progresses, the deliverables can change, a specificity of the field unknown at the start can induce to rethink the transformation of the data or the client asks after a few months in production to remind him of the deduplication rules. In all of these cases and more, detailed documentation will save you precious time.
🔎From personal experience, in addition to the essential specifications or the data dictionary, I do my best to maintain a document logging and listing all the information encountered, the links to relevant data, the customer's questions and the changes that have been made. carried out with their justification. You don't have to make a presentable document, but these details are a lifesaver when you need to present or transfer the project to someone, or if you or the client have questions in a few months!
Finally, use a versioning tool (Git) for your projects. To put it simply, versioning allows, thanks to a control repository (copy of the project on a local network or on the cloud), to progress simultaneously on different tasks. It also allows you to archive a set of files keeping the timeline of all changes that have been made to them. In this way, there is no risk of altering or losing the information already stored in the common repository.
Here we are ! We are now ready to work on our new predictive algorithm. Much work has been done to get there and this is the reality of current production systems.
To sum up, here is our final checklist :
- Data validation
- Systems validation
- Proofreading of sources
- Development of domain knowledge
- Understanding of deliverables
- Data mining
- Data cleaning
- Validating data transformations
- Experimentation on data samples
- Documentation of all steps
Now that the data is clean and organized, we can move on to the part that most data scientists prefer: analysis and modeling.
But we must never forget that no exceptional analysis or algorithm will completely
compensate for unprocessed data!