Accéder au contenu principal

Haut-parleurs

  • Brian Campbell Tir à la tête

    Brian Campbell

Pour les entreprises

Formation de 2 personnes ou plus ?

Donnez à votre équipe l’accès à la bibliothèque DataCamp complète, avec des rapports centralisés, des missions, des projets et bien plus encore
Essayer DataCamp Pour Les EntreprisesPour une solution sur mesure , réservez une démo.

Manage Data Science Projects Effectively

January 2022

Slides

Partager

If you lead a data science team, you’ve seen projects hit some common roadblocks. Your team can’t get the data they need or they are struggling to get a good model in front of customers. Even worse, you’ve built out a solution to find no one cares about it.

In this webinar, Brian Campbell, Engineering Manager at Lucid Software, will unpack best practices in data science project management, and how to harness the power of collaboration for effective, scalable, and successful deployment of data projects.

Key Takeaways:

  • How to build a strong data science foundation within your organization

  • How to deploy solutions and find problems worth solving by working with experts

  • How to better define your projects so your solutions are even more useful


Webinar Transcripts

Brian’s Background

Thank you all, for joining me from all over the world. My name is Brian Campbell, I'm currently an Engineering Manager with Lucid Software, creators of the Visual Collaboration tools, Lucid Chart and LucidSpark. Since we're all sharing where we're from, I'll let you know that I'm working out of Lodi, California, in California's Central Valley.

So a little bit about my experience as a leader in the data space. So, I've been with Lucid for seven years. I started there working in product development and growth as an engineer. From there, I moved over to leading what we now call our data infrastructure team, who's responsible for, amongst other things, data ingestion into our data warehouse management and helping run the infrastructure for data science projects. Recently I've gone on to manage both the data infrastructure team and the data science team. So, I'm able to see things from both sides.

I'm excited to share some things that I have learned from getting about a dozen projects from concept to production.

What data scientists think they’ll do

So, as I talk to data scientists, I know that there are certain things they consider their core skills and certain things they enjoy doing. They tend to focus around the data itself. Data scientists want to be cleaning interesting data sets, they want to build cool models on top of that, and then, they want to see those models used by people to do interesting things.

What data scientists really do

But in order to actually get a project into production, data scientists usually have to do so much more. They have to go find and source these useful data sets and bring them into their warehouse or whatever data world they're working in, and then, they get to clean it. And then, usually they have-- a lot of data science teams have some sort of reporting aspect they have to do on that data. And then once they built the model, they've got to deploy it into the cloud or some other deployment space, integrate it with their products.

And, they have to do the project management of all that as well, of keeping different stakeholders informed. And honestly, it is too much for one person, a single data scientist to do well. And even if you're running a small lean data team, it's going to be too much for a team. I found that a successful project requires collaboration across the organization. And so, I want to share what I've learned here about collaborating.

Today’s Agenda

So first, you have to find the right collaborators for your project. From there, we'll talk a little bit about how we work with and communicate with those collaborators.

Then, how you should organize work on your team to ensure the collaboration goes effectively. Then, I'll take you through a real life example, from my own team, where we did some things well and we did some things not so well. And then, we'll open up for some questions and answers. All right. So first, finding the right collaborators.

A successful data science projects starts with the right problem

A successful data science project starts by finding the right problem to solve. And so first, you need to find the-- in order to find the right problem you have to find someone that knows that problem well.

I found that data teams like to stick with what they know. They go and look at their data and find interesting things that they can take from it. And say, hey, there's this interesting dimension that would be cool to forecast, cool to classify, and we've got all the data to do it. Let's go do it. But that is a trap, because you end up creating a project that isn't useful to people. Like, you've done something interesting, but if it's not useful to the business, if it doesn't bring any sort of value, it's not going to get you anywhere.

Collaborate with a problem expert

And so, in order to find the right project, you should go to people that problems in your organization. Like, go around to leaders across the organization, in any department, and talk with them about what challenges they're facing day to day, as well as looking down the road. And once you've gotten a good sense of what challenges your organization is facing, you can then apply your own knowledge about what your team is strong-- your team's strengths and abilities and see what challenges you have the ability to take on.

And once you've picked out a good problem that you think your team can handle well, because it requires all the skills of a data scientist, cleaning and modeling data, you should come back to the leaders that you were working with and find the person that knows the most about that problem. Who is most affected by it, who understands the domain of the problem most. This person, we're going to call our problem expert for the rest of the meeting. And they are going to be your most valuable collaborator all the way through the project. They'll help you find your requirements, they'll help you validate your solutions, they'll answer questions for you along the way. So maintaining a good relationship with your problem expert is critical to the successful project.

From problem to requirements

And so, once you have your problem expert, the first thing you need to do with them is understand the requirements of a solution. And in a data science project, the requirements tend to focus around the metrics that we put on models, accuracy or precision recall. And it's important to understand what you need. A lot of problem experts, of course, are going to come in with unrealistic expectations. They want the machine to do it correctly 100% of the time.

So you'll need to manage those expectations and find the place where-- like, what is the numbers, what point, and what threshold, so a threshold do we-- are we doing a good job? If we drop below a certain threshold, are we causing more harm than good? And making sure that whatever solution you build lands within that range. But usually, a problem domain usually has more needs than data science metrics. They might have certain throughput requirements, they might have certain latency requirements that you need to understand.

Because you can come up with a model that does things really well, really slowly. But if you needed the answer an hour before you got it, then it doesn't do you any good. So making sure you understand both sides. Both the data science and the domain needs of the problem are going to be important for choosing a solution.

Collaborate to get data

So yeah, so once you've got a problem, you and your team can come up with a couple of solution ideas. But you also come up with what data you're going to need to build those solutions.

Some of that's going to be data you have access to. And some of that is going to be data that you don't have access to. And getting access to that data is going to be valuable. Data in our organization can live in lots of places, it can live in a silo that's controlled by another team, it can live in a third party tool, like SalesForce or your marketing automation platform. And you're going to need that data. If you need that data for your product to succeed, you're going to have to find the people that control access to that data and collaborate with them to bring it into data scientist's sandbox.

This group, we're going to call our data experts. And they're going to be able to tell us, if the data exists that you're actually looking for, they're going to help you understand how to get it out. I believe the data science team or if you have a BI team you're working with, should be the ones focused on bringing that data into your pool. But they'll help you understand things like API limits or other restrictions on platforms to make sure you don't accidentally hurt them.

And then, they might tell you the data doesn't exist. And then, you have to find it. Make a plan to start collecting new data for your organization.

Collaborate to implement

And then, another key partner is going to be on the implementation side. Most models, in order to be useful to the organization, useful to the world, require more than your Jupyter notebooks. You are going to need to integrate it with some sort of product, you might be building an API, put it into a web app, and that requires more than just the data science skills.

Usually, things like your company's API or company's website app, are built by another group of engineers or another team. They have their own priorities, they have their own roadmaps. So they need to be informed about what's happening quickly, and understand the solution. The possible solutions. And key thing to not forget about is your IT infrastructure. Data science teams tend to-- and data science projects tend to require underlying infrastructure that's different from what your infrastructure teams are going to be used to working with.

You might require GPUs, you might require just outrageous amounts of memory compared to a normal web server. And so, making sure that there's an infrastructure team that is ready and prepared to set you up with what you need to run your model is going to be critical.

Who do we collaborate with

So, now we've got these three sets of experts that we're working with. We have our problem experts, who understand our problem the best, who want it solved the most. We're going to be working with them throughout the project.

You have your data experts, who know where your data you need lives, and how to get to it. And can help you understand what it means. And then, your implementation experts, who have this necessary skill set to get your solution from your organization to the world. And once you've identified these three people and started building a relationship, help them understand the solution and what you're trying to do, it's time to start building.

And so, how do we work with these collaborators through a project's lifecycle? Working with collaborators is mostly about communication. So the Project Management Institute did a study about communication across organizations that have project managers. They found that organizations with effective communication see 80% of the project succeed. While organizations with the least effective communication would see only 50% of projects reach their goal. So communication, like, 30% of the time, a project lives or dies based on how effective the people working on it are communicating.

Setting clear expectations

So how do we communicate effectively with our collaborators? I kind of propose here that the key here is coming up with an effective cadence and effective content for communication. And then, establishing clear expectations about that cadence and content. So the cadence for communication really depends on which expert you're working with. The problem expert is going to require being checked with. You're going to need to talk with them, they're going to need to talk with you throughout the lifecycle of the project very frequently.

I'm using abstract dates here, but let's say that your company operates in an agile way in two weeks sprints. Problem experts, you need to be talking to them once or twice a sprint. They might be involved in planning, they might be involved in acceptance, but it's going to be pretty frequent. Data experts, you're going to check in with them regularly at the beginning, until you've been able to obtain the data that they're helping you with, and you understand the data that they're helping you with.

And this could be, depending on the project, you might be checking in with them weekly, you might be checking in daily, making sure that your team is not blocked, and they aren't blocked in helping you. And the same is for implementation experts. Except, you're going to work with them more at the end of the project, rather than early in the project. But make sure that not only do you set expectations for how often you expect them to communicate to you, but you set those expectations about how often you plan to communicate to them, and that you meet those expectations.

There's nothing worse than reaching out to a partner, telling them, we're going to need your help in a couple of months, once we work through this project lifecycle. And having them, and then, not talking to them for months. And then, coming back and saying, OK, we're ready for your help. And they're saying, who are you? What are you talking about? I don't remember any of this at all. And so, actually checking with people regularly and meeting that expectation that you set is really important to effective communication.

So the other side of this is content. What do we talk about? And I found that it's really useful to focus this collaborative communication about collaboration on timelines. What help do we need at what time?

Timelines are hard

Timelines are really hard though. If you've worked in software for any amount of time, you know that-- or I imagine it applies to every domain, but I only software. That when you out doing cool new things, timelines slip, they get messy, they're really hard-- it's really hard to predict what's around the corner.

And working on data science projects, I've found it even messier. I imagine it's because it's a newer field, we don't have the same tooling and the same experience and instincts built up around what's going to work are not going to work in different situations. And so, timelines are messy, they get moved around a lot, but they do provide important context for all of your partners. So yeah. So, I found that building timelines around milestones rather than dates to be very helpful.

By breaking your project into phases and saying-- discussing how long do you think it will be between phases, and at what phase you're going to need help from the other experts in your organization is more effective for keeping up with things, than having a set calendar. And then, when you reach milestones, you can look down the road, think about what you've learned for the last little bit and update your timeline accordingly. So let's go through an example of a project and why communicating about its timelines is important, and updating this timeline is important.

An example data science timeline

So I took-- here's an example data science timeline, here's some common phrases. Like, this is mostly an example. We've got basic phases of our project, gather requirements, gather your data, clean it, build something on it, release your data science side and let the world use it. And for this example, we're going to say there's two weeks between each section. But, so you tell your implementation experts at the beginning, hey, we estimate it's going to take about two months to need your help. So let's sync again in two months, figure out what we're going to do.

Then things start to slip

Then things start to slip. Data collection ends up being more complicated than you expected, it ends up taking four weeks instead of two weeks. And then, so when you go to build your model, the methods you were using weren't working, you've got to start over, it takes longer than you expected to. Your implementation experts come back-- come in a month before you're ready, they've rearranged their roadmap for you, they've set their priorities around you being ready at this time, and now, they're left in the lurch. They're not going to have work to do for another month.

So now they have to go back and rearrange their roadmap, rearrange their timelines, their priorities. And that will hurt your relationship and make it harder to collaborate in the future. But, if back at the beginning when you were in the collecting data side, when you finished collecting it, you could come up to them and say, OK, we're two weeks behind schedule, here's what's going on. They can take that into account a month before they did before. And a month earlier than they could have previously.

And then, in the model building phase, since the implementation expert is coming up next, you should be syncing with them very frequently. And so, as soon as you knew things were slipping, you could tell them and they could rearrange accordingly.

Keep everyone up to date

So yeah. It's important to keep everyone up to date. So, yeah. So the right content for communicating here is a timeline. But, it's OK to update and revise that timeline frequently. And you should be communicating to your partners about the revised timeline whenever you revise it.

And think about who is going to be most affected by your timeline. The people that you are going to be working with next, if you're at the beginning and you're coming up to needing help from data experts, if you're coming up on needing help from an implementation expert, keep them up to date very frequently. But, I will also note that once you're done working with an expert, don't stop keeping them up to date. They've invested in you, they're interested in your solution. So you should, at least occasionally, tell them how the project is going, what the results look like. So that they can enjoy some of the fruits of your work as well.

Helping your team collaborate

All right. So now, let's look at the team, and what your own team should be doing to help ensure collaboration is going smoothly. So yeah. I found that in a collaborative environment, the data science team's job is to find potential problems for collaborators quickly by learning as quickly as possible. So how do we do that? So in one of the principles in the Agile methodology is to create something of value quickly, and then, iterate on it. Rather than try to build the whole project all at once.

Start with baseline models

In the data science world, that small piece of value, you might call it a proof of concept or prototype. Well, I've heard it called the baseline model. You created some sort of model, either something quick based on limited data, like based on heuristics given to you by your problem expert. Or even just giving random results to inputs. And this creates a baseline for you to work from. So what can we do with that baseline in a useful way?

So first, it gives you something to compare against for iteration as your own team tries new things. As new data, you can see if you're actually doing better than your baseline and make your new model your baseline. Or you can see if you're doing worse, and then, you can throw away a bad solution quickly. And you can also learn things about the roles from the baseline. A classic use here is, you're working with a data expert, they have a large data set you need to get, but it's going to take a lot of work to get that expert that data set out of a third party platform or wherever they're keeping it.

But, they can usually get you a CSV with a few hundred rows or something very quickly. And so, you can go in and see, hey, will actually having this additional data improve my work as much as I expect by adding it to the baseline or building a model with that limited amount, and seeing if it-- how it compares to the baseline? And then, you can come back to the data expert and say, we really need this, or we can say, let's hold off, maybe this won't be as helpful as we thought.

Build prototypes

The other thing you can do with a baseline is build a prototype around it. Once you've decided on how you want your final solution to work, how it's going to integrate into the application, you can build a prototype of how that model is going to be used. Yeah. And assuming that you've built your baseline to have the same inputs and outputs, as your final solution, you can then work with your implementation teams to start building a proto-- building your solution quicker by prototyping it around the baseline.

And this lets you do two things that are extremely useful much earlier in the life cycle than if you had done this in incremental steps. So one, you can show your prototype to your problem expert and they can take it around to other people that might use it and say, hey, if this actually reaches the-- the underlying model, if it reaches the accuracy it needs, will this still be a useful solution? And they can try it out and say, yes, or, oh, we need to tweak this, we need to change some piece of it. And you can know that well before you go live.

The other thing you can do is see how your implementation partners can use it to see how it meets things like throughput. How much data can we put through this system? How quick can it handle-- can this type of model process things? How much memory is it going to take to work on the size of a data set? And they can make good implementation decisions faster than they could if they had waited and gotten your final model before they started building.

Agile methods make timelines more complicated, but produce better results faster

So yeah. So these are-- we see these as kind of an Agile methodology. My timeline, I showed you before. It was very much a waterfall of, hey, we're doing each piece in discrete. Chunks, we're doing each chunk till it's done. And it's not the best way to do things. It's easy to reason about. But by having multiple work streams going in parallel, having some people building a prototype while other people are improving the model, usually lets you get to your results faster. And since you're testing your results along the way, you end up with better results than you would have otherwise.

All right. So let me take you through a project that my team did. How we used these principles, how we didn't use them at our own peril. And the results of that. So we at Lucid Software, have built a tool called LucidSpark. LucidSpark's a digital whiteboard solution. And one of the cool things you can do with it is quickly pull out sticky notes for a brainstorming session. You have something you want to brainstorm, this is-- it looks like it might be a team activity ideas. You want to go do something with your team, want to see what people are interested in. Your team can jump on, pull out stickies, write down their ideas.

But in any brainstorming session, there's usually this long dead time where the facilitator has to go through find duplicates, ideas, and do duplicate them, kind of group ideas based into different categories. We want it to do that automatically. We wanted to say, Oh hey, we can use document modeling, topic modeling and vectorization to then cluster the sticky notes on a board into similar ideas. And so, something that would take a facilitator a significant amount of time while their team just kind of sits there, well, a matter of seconds instead.

So that is what we set off to do, is automatically gather sticky notes.

From problems to requirements

And so, let's start from the beginning. This problem came from our product leadership. The director of product for LucidSpark had the idea, they tasked one of their product managers to work with it, to start working with us to figure out how we're going to do this. So that product manager was a clear problem expert. They knew-- they were in charge of figuring out what customers would want from this feature or what they would need.

And that's great, because the product manager also had access to the roadmap of a product development team. So adding this feature to the product was going to be one of their priorities for the quarter we were working on this. But then, so we had that implementation partner. The other partner was, of course, the team I was leading at the time. The team responsible for the infrastructure that the actual vectorization and clustering model was going to run on.

Getting data

So first, we sat down with the data science team to figure out what data they would want to do this. Since my team would be responsible for the ingestion of that data, I was going to be a partner here as well. And so, we talked, of course, the data science team would love a collection of brainstorms. The perfect data set here is, what does the brainstorming session look like at the beginning? What does it look like at the end? No such data set exists. No one's done that sort of significant research on brainstorming. And then, released a public data set about it. And so-- and we weren't going to set out to build that ourselves. So we had to say, hey, after bouncing around a couple of options, we decided to stick with the classic word to vec corporas.

Building a prototype

From there, the one thing they want really well is that the team is able to build the prototype. We set out and said, so, my team wasn't involved here, the data science team and the product development team built out a baseline model that would return-- would take in all of the ideas from all of the sticky notes from our brainstorming session, and sort them in their random collections. And so, that looked-- product development team quickly build out, prototype, play with it, see what makes the most sense for the users, as well as then, have something that when the final solution, the final model, was done, they could just plug it in.

Deployment time

On the deployment side, at the beginning of the project, we looked at it and said, hey, this looks similar to some other data science projects we've done on the infrastructure side. So we set up a Lambda for the data science team, gave them a deployment pipeline for that Lambda, so they could quickly release changes to it. And then, let them go at it for a couple of months without really checking in. And that worked. So first, why Lambda? I will-- because that's important to this. Is that, lambda has a very low maintenance cost and is very low budget.

Deployment time — Unknown Unknowns

We knew that this model would maybe be getting a couple of requests a month, it doesn't need a full web server, a couple of hundred requests a week, I mean. But still, it's not going to need a full web server, it's going to be very spiky traffic. So having something that scales automatically for fairly cheap is going to be really valuable. And then, one Friday night, I got a phone call. Well, I got an urgent Slack message saying, hey, we're trying to get our final clustering model out to production, and the pipeline you gave us is throwing errors.

And so, also we are trying to turn this on Monday, so we need to figure this out now. So I jump on a Zoom call with the team, and find out that the NLP models they were using aren't going to fit in our Lambda because the other side of the coin for Lambda is that you get 150 megabytes of hard drive space. So if you're using a huge [? board to vec ?] corporate, it's not going to fit. And so, we did not release on that Monday. While my team spent the next week frantically building out a new deployment pipeline for them.

AWS had recently launched container backed Lambdas, so instead of giving them a zip file with your code, you could build a container. That container can be up to, I think, 4 gigabytes. So that gave us plenty of room to work. So we went that direction instead. But it did mean that we missed our initial deadline by a week. But of course, as well, once we actually got it out, we found that there were problems with it. That it was taking up to a couple of minutes to cluster larger documents. If you're even larger than that, we were running out of memory and crashing out. So a common use case for our users would be they wait a couple of minutes, and then, get an error. Which is the worst possible situation to be in.

Iteration and relaunch

So we had to stop, pull it back into an internal alpha. And that's when the real collaboration really started, where we started having regular check ins across all of the partners. And yeah. And started doing the principles I laid out here. The key one being, we set up milestones and metrics. We came out and said, OK, we need this project to succeed like 80% of the time, and it needs to be able to cluster in 30 seconds for us to release this to a closed beta of users that we have good relationships with, that can try it out for us.

So that was our goal. Was cutting down the release time, cutting down the time to do a clustering, and cutting down the number of times it failed. And so, data science turned around, started optimizing their model for speed rather than quality of clusters. My team, the infrastructure team, went off and switched us from a Lambda to a hosted containers using AWS Fargate so that we didn't have these long cold start times. And so, we didn't have to worry about-- we had more memory to use, and so we were not crashing as often. Yeah.

And so once that was done, we could go to our internal beta, make some more improvements. Or, and go to a wider release. And we did that in a month, compared to the three months it took us to get to the initial failure.

Closing Notes and Q&A

So yeah. So I'm going to leave you some thoughts in conclusion, looking at this example. And we'll open it to questions and answers. So yeah. So what did we do right and wrong going through this? The first time, we tried to do everything up front with very little collaboration.

First iteration vs second iteration

The data infrastructure team built some infrastructure, left it to the data scientists. The problem expert, left it to the data scientists to build a good solution without considering that they could come back with something that clusters really well, but takes a long time. Yeah. And that was sort of the wrong things to do it. And so the second time through we had our regular check-ins. We were doing stand-ups together and sometimes we're checking, at least, weekly together. And Yeah.

So that we knew where we were, how things were going. We had set metrics that were going to define success. And that allowed us to build timelines. We could say, hey, I'm going to try this method, other people have tried this method and it made it-- like, this should make our K means clustering method 50% faster. And it will take us two days to implement. And so we could know in two days, we would have something that's potentially 50% faster. And then, two days later, we'd update the timeline based on the actual results.

And then, we were releasing new versions to our internal alpha, and later, our beta on a weekly basis. As opposed to one big thrust at the end. Which bit us pretty badly. So yeah. So looking back at the principles needed to succeed. One, we needed the right collaborators. And wanted to make sure we had everyone involved that knows the data, that knows the problem, that knows how an issue is going to be implemented. From there, we made sure that we communicated as a group effectively. By communicating regularly and focusing on the timeline and blockers on that timeline.

From there, we used to milestones about around our metrics to inform and revise our timeline. And continue to communicate about it. Because then as the different teams learned things they were able to improve things. Yeah. And then we put our-- we focused on putting new iterations in front of partners and customers quickly. So they could try it, and learn from it. And that allowed us to iterate towards our set goals. Yeah. So those are the final thoughts I want to leave with you. These final principles of finding good collaborators, communicating with them effectively, and working as a team to iterate and build good solutions quickly. Yeah.

Q&A

Q: How often do you think you should keep stakeholders informed around project updates in a data science project, Brian?

A: Yeah, so it really is going to vary from stakeholder to stakeholder. Different people just are going to require different cadences. You can always ask them what's going to work for them, but if it's someone you're working with regularly, you might be doing daily stand-ups together. If it's someone that you plan on working with later but you don't need their help right now, you might be checking in, well I would say, on a sprint basis. Assuming your company is using sprints. So that might be two weeks, that might be once a month.

Q: What is the best way to manage executive leadership's expectations around data science project updates and timelines?

A: This is a trickier one. Good executives have high expectations and busy calendars. So they're not going to-- they want to know what's going on, but they're not going to be coming to your daily stand-ups or weekly check-ins. So at the very least, at least in software, we have monthly executive check-ins, where every team checks in with the engineering executive or the product executive once a month. And we get five minutes to run through all of our projects and what's going on with them. And that's been very helpful. The executive team has much more -- has a much stronger idea of what's going on. As well as, it's good to keep, at least, who's above you in the organization, like, whoever's managing you, be it a director or whatnot, update on your timeline. Kind of on a sprint-- that sprint basis. And so, if the executive has a question, they can give an informed answer.

Q: What are lessons that you can share about how to avoid those unknown unknowns when launching data science projects?

A: Oh yeah. So of course, unknown unknowns are the hardest because they're unknown. One thing I'm trying to get myself and my teams to think about is, when we start on a project, we like to focus on the similarities of other projects. So like, hey, we did this in the past and it worked. So, let's try that again. And that, focusing on the similarities, tends to give us blind spots. So, I try to focus the team more on what is going-- we've identified a similar thing that we can build on, but what's going to be different? And trying to see what's different as quickly as possible. Yeah so, focus on the differences from-- you're like, you know what you know, focus on the shadows, focus on what's different, to see what's going to bite you down the road.

Q: How often do teams go looking for problems to solve versus how often are they directed by someone higher up in an organization? Like someone working on an operations section, for example?

A: Oh, that's — so with my own experience, it's probably been about a 50-50 split of us getting projects assigned, versus us taking our own projects and pitching them up. Yeah. We probably pitch about half our projects and half of our projects pitched to us. And yeah, getting projects assigned isn't that fun, in my experience. Like, if you want just to have a good time on a project, having control of your roadmap is really useful. So I like to encourage people to at least take the time to reach out to the organization to find interesting projects to pitch so that they can have something that they own from beginning to end. And that's going from a career perspective and it kind of makes for happier work for data scientists.

Q: Do you have any comments on data quality or data cleansing? And who should own that? Who is responsible? Maybe within the project that you worked on or within the-- or as the best practice?

A: Yeah. I too often quote like, if you want it done right, you've got to do it yourself. So I usually put that on the data science team or whoever, whatever team is going to be using the data. Because you're the one that knows what you want it to look like, what it's going to need to look like. So it's easier to do it yourself. But of course, that goes against everything I've been saying about communication. But yeah, if you're the one that needs the data clean, you should probably be the one cleaning it. But you can, of course, work with those data partners to understand the gaps, understand, OK, if we've have knows in this field, what should we be doing? If we have duplicates, what should we be doing?

Q: Where would you start the collaboration process in a non-data-driven organization that is generally disconnected and is trying to turn into a data-driven one? How would you go about, basically, forging these stakeholder relationships in an organization that is not data-driven?

A: Excuse me. So yeah. So we've got an organization but you need to collaborate, they don't fully understand the data. People want-- I've found, generally, that people everywhere like two things, they like friends, they like people that sit down with them and listen to their problems. So if you reach out to people and offer to like, hey, let's go get a coffee, let's go get lunch. Let's sit down and talk about what's going on in your organization. That alone builds a strong relationship. And they might not be able to communicate to you about the data needs of their problems, but they can always communicate about their problems. And that comes to you, as then the data science leader, to figure out where the data needs are. Of course, the other side is people want to see-- they want to see-- they want to be involved in high value projects, because that's good for your career. And so, helping-- like, showing people like, hey, you might not understand what we're doing, but if we get to the end, and we get this result, here's the value. That can convince people to come along with you and help you out.

Q: How do you avoid scope creep?

A: That is a tough one. I think the getting cooperates a value and iterating is a very useful principle there. Because once you-- scope creep tends to happen in the middle of the lifecycle of a project or when teams are like, hey, this would also be cool, this would also be interesting. But once you have something tangible for people to work with and to see the results from, then, you actually know what's going to be useful and what's going to be interesting. And then, you can, of course, start expanding your scope. Because you actually have something useful you need to add. Or you can turn something down, push it down the backlog or further down the road map, if it's not going to provide the immediate value. I want to add there as well, just having conversations about value. It's something I've learned more recently in my career is that, you need to understand the value of what you're doing to the organization. And that's a really hard thing to understand usually. Like, I've worked on an infrastructure team my whole time where the value is behind layers and layers of other teams doing interesting things. But, being able to talk about, in concrete terms, hopefully in numbers, about what value a new feature would add and what it won't. Or is really important to be having every step of the way.

Q: Do you think data literacy plays an important role when creating a common data language and collaborating with data science teams effectively?

A: Yeah. Data literacy is an interesting one. As I've talked about with like the product manager on this last project, data literacy isn't-- you can't go in with an expectation of data literacy. You've got to meet people where they are. But improving your company's data literacy helps people make better decisions, it helps in the collaboration because you now have partners that understand what you're doing, that can contribute rather than just accept what you're saying.

It also gives teams the ability to be more self-sufficient. And that has been kind of the main value of data literacy as we've seen it build at Lucid Software is teams are able to answer their own questions, build their own data products with insight from the data teams rather than having the data teams build them for them. So a little different from what you're asking, but that's what I've seen.

Summary

Successfully managing data science projects is a complex task that requires a combination of technical knowledge, stakeholder engagement, and strategic planning. Brian Campbell from Lucid Software provides insights derived from his extensive experience leading data infrastructure and data science teams. Notable topics include the significance of identifying and working with the appropriate stakeholders, maintaining effective communication, and managing project timelines. Campbell also stresses the importance of establishing a strong data infrastructure and the benefits of iterative development and quick prototyping to meet business needs efficiently. He discusses the challenges and solutions in managing data science projects, using real-world examples from his work at Lucid Software.

Key Takeaways:

  • Interdepartmental collaboration is essential for successful data science projects.
  • Effective communication can greatly influence project success rates.
  • Identifying the right problem and having a problem expert is important.
  • Iterative development and quick prototyping can assist in building effective solutions.
  • Managing expectations and having realistic timelines can prevent project delays.

Deep Dives

The Importance of Collaboration

Collaboration is the key to successful data science projects. Brian Campbell stresses that while a data scientist may possess the technical abilities to manage data and construct models, the scope of a data science project often extends beyond these tasks. It requires the input and expertise of various stakeholders across an organization. Campbell highlights tha ...
Lire La Suite

t finding the right team members—problem experts, data experts, and implementation experts—is important. These team members provide vital insights into the problem domain, data accessibility, and deployment requirements. A successful project requires a collective effort where each stakeholder plays an important role in pushing the project forward. As Campbell states, “A successful project requires collaboration across the organization.”

Effective Communication

Communication is a key factor in the success of data science projects. Brian Campbell cites a study from the Project Management Institute indicating that organizations with effective communication see an 80% success rate in projects. He advises establishing a clear communication routine with stakeholders, such as regular check-ins and updates on project timelines. Campbell suggests focusing communication on timelines and project milestones, as this provides context for all team members and helps align expectations. Regular updates can prevent misunderstandings and ensure that all stakeholders are aware of project progress and potential challenges.

Selecting the Right Problems

Identifying the right problem to solve is an important step in any data science project. Campbell advises against the temptation to work on projects simply because they are technically interesting. Instead, data teams should focus on problems that bring tangible value to the organization. This requires engaging with leaders across departments to understand their challenges and aligning these with the data team’s capabilities. Once a problem is selected, identifying a problem expert—someone deeply familiar with the issue—is vital for guiding the project. This expert helps in understanding the requirements and validating the solutions, ensuring that the project stays relevant to business needs.

Iterative Development and Quick Prototyping

Campbell advocates for an agile approach to data science projects, emphasizing the significance of iterative development and quick prototyping. He recounts an example from Lucid Software where his team applied these principles to develop a feature for clustering brainstorming ideas on a digital whiteboard. By constructing a baseline model and a prototype early in the project, the team was able to collect valuable feedback and make necessary adjustments before full-scale deployment. This approach not only accelerates development but also reduces the risk of misalignment with user needs. Campbell notes, “By having multiple work streams going in parallel, you end up with better results than you would have otherwise.”


Connexe

The Definitive Guide to Machine Learning for Business Leaders

Craft a 21st-century data strategy to optimize business outcomes.

white paper

5 Best Practices for Building Data Science Skills Academies

Best practices and expert advice on setting up an in-house skills academy

white paper

5 Best Practices for Building Data Science Skills Academies

Best practices and expert advice on setting up an in-house skills academy

webinar

Data Science for Business Leaders

Here's how to build a high-performance data team aligned with company strategy.

webinar

Data Skills to Future-Proof Your Organization

Discover how to develop data skills at scale across your organization.

webinar

Scaling Data Science At Your Organization - Part 3

Learn how to organize your data science team to scale effectively.

Hands-on learning experience

Companies using DataCamp achieve course completion rates 6X higher than traditional online course providers

Learn More

Upskill your teams in data science and analytics

Learn More

Join 5,000+ companies and 80% of the Fortune 1000 who use DataCamp to upskill their teams.

Don’t just take our word for it.