Managing data isn't always the funniest part of data science. It includes several essential steps
that require time and which are also the source of the production of value.
Dans cet article, je vous livre ma méthodologie pour s’attaquer à un projet data, basé sur les bonnes pratiques de l’industrie et sur mon expérience.
In this article, I share with you my methodology for tackling a data project that is based on industry best practices and my experience.
Before we embark on the implementation of a new predictive algorithm with which we believe will be able to more accurately predict the dropout rate of our contacts, we need to make sure that we have the necessary data, that we have changed or deleted any problematic values or observations, and that we've turned that information into a dataset that we're going to be able to work with.
💡What better way to do this than with a handy list covering collecting data from external sources, understanding the problem to be solved, transformation steps, and documenting the entire process.
In a data project, initial data can come from several types of sources: internal or external, private or public, raw or processed.
Everyone tries to maintain an acceptable level of quality but from experience it is best to always validate the data even if your interlocutor or supplier assures you that everything is in order.
The time spent on these checks is not wasted. If inconsistencies are found, it is better that they are at the start of the project rather than when the project is near completion. And if everything seems consistent, this step will have allowed us to begin to familiarize ourselves with the data.
It is essential to verify the correct operation of the systems allowing the collection of data whether it is an FTP server (File Transfer Protocol) or an API (Application Programming Interface) or even via email. This step will be to ensure that the usernames and passwords are correct, that the connection parameters are working and that the accesses are operational.
When you have deadlines to communicate to your customer, you'll be in a better position if you know that a vendor is behind schedule with their API setup.
On the same theme as before: beware of other people's systems.
If you're working from a template or picking up other people's notes or code, be sure to come back to them after you've worked on your project for a while. Once you become more familiar with the project, the data, and the transformations, you'll notice things to fix or optimize and maybe things you missed the first time around.
When you start working on a data project, it is often very likely that you are working on a subject that you do not know enough about. For example, the jargon for B2B (business to business) is not the same as for B2C (business to customer). A retail company does not have the same data as another in events or insurance.
Some information can be found on the Internet, in other cases, you will have to dive into your project data or associated data (product repository, geolocation data). Finally, don't forget your customer. He relies on your data expertise, don’t hesitate to ask for his domain expertise.
Familiarity with the field is an essential key throughout the process. This invaluable knowledge is applied in the processing and verification of the data, until the final solution is presented to the customer.
What questions are you trying to answer with this data? How are you going to respond to them? Are you going to run models and present the results in a meeting with the client, or send them csv files, or do you write an algorithm that will run in production?
Knowing what to produce will help you decide what to do with the data - how to prepare it, what to do with the missing data, or in what format you want your data to be after all the transformations are in place.
Data mining is an important step in the process and will be repeated several times. It allows you to continue to familiarize yourself with the data with more granularity.
Here are some questions to get you started:
Data preparation generally requires a cleaning step, which is not necessarily the easiest. This is about managing missing values, outliers, duplicates and more broadly tidying up the data (see: the concept of tidy data).
The choices of processing steps have consequences on the data and therefore on the quality of the analyzes and models that will be made downstream. They can be immediate and visible or delayed and camouflaged or any other combination. Having a good knowledge of your client's initial data, domain and activity can save you a lot of trouble later on.
When transforming or merging data, it is important to always stop and verify that the result meets expectations. A simple check on the number of expected columns or rows is a good start.
It might be worth taking a few examples and manually checking that the results are correct. It can be annoying but much less annoying than realizing an error after going into production.
In other cases, even if the transformations are correct, they don't necessarily mean what you think they do. Therefore, it is very important to validate them and ask questions if the calculations do not match what you see.
And it is also possible that the validation method is flawed. Remaining critical in this transformation validation step ensures that the data produced is consistent and usable for the future.
When attempting complex operations on large datasets, do not attempt to build the cleansing and transform algorithms on the available dataset. Instead, save your energy and time and apply your treatments by working with dataset prototypes that bring together relevant use case examples.
Once it works, try it out on real data - and as discussed earlier, be sure to validate your results.
Documentation is for me one of the things that will make you stand out from the crowd and increase your efficiency. When we embark on a data project, there are many requests that arrive as it progresses, the deliverables can change, a specificity of the field unknown at the start can induce to rethink the transformation of the data or the client asks after a few months in production to remind him of the deduplication rules. In all of these cases and more, detailed documentation will save you precious time.
🔎From personal experience, in addition to the essential specifications or the data dictionary, I do my best to maintain a document logging and listing all the information encountered, the links to relevant data, the customer's questions and the changes that have been made. carried out with their justification. You don't have to make a presentable document, but these details are a lifesaver when you need to present or transfer the project to someone, or if you or the client have questions in a few months!
Finally, use a versioning tool (Git) for your projects. To put it simply, versioning allows, thanks to a control repository (copy of the project on a local network or on the cloud), to progress simultaneously on different tasks. It also allows you to archive a set of files keeping the timeline of all changes that have been made to them. In this way, there is no risk of altering or losing the information already stored in the common repository.
Here we are ! We are now ready to work on our new predictive algorithm. Much work has been done to get there and this is the reality of current production systems.
Now that the data is clean and organized, we can move on to the part that most data scientists prefer: analysis and modeling.
But we must never forget that no exceptional analysis or algorithm will completely
compensate for unprocessed data!