How to conduct the perfect marketing experiment / Data for Bluffers #13

09 May 2022

Every time you run a campaign it’s a chance to learn, but how do you know if your last success was random?

In this episode, Tom and Ed cover some of the common pitfalls that could skew your results as well as a framework to help you produce great ones.

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut elit tellus, luctus nec ullamcorper mattis, pulvinar dapibus leo.

Tom
Welcome to another addition of the data for bluffers podcast. Uh, this week I wanted to cover a topic that comes up time and time again, and that’s how to draw more accurate conclusions from campaigns to try and understand what has worked and what hasn’t. And the way we think about this is applying scientific method to running campaigns. So hopefully after this chat with ed, you should leave with a better understanding of a why it’s valuable, but also some guidance as well as the pitfalls to avoid. So you can actually get better understanding of what really works from your own campaigns. Uh, for those of you that don’t know, uh, Ed’s backgrounds in scientific research at Bristol university here in the UK. So because he’s from my side, the perfect person to shed light on how marketing can apply these methods. So I guess to start with ed, why should people think scientifically about campaign testing
Ed
Science and the scientific method is at its core. The answer to the question, how do you learn the most from doing whatever you are doing? How do we provide the structure to the process that we learn new things. That means that we can be as confident as possible that the new things we learn are true and how we can generate the experiments and the evidence to show us that that’s true to test our hypothesis. So either scientific method, you kind of start with an, a hypothesis, something you think might be true, or you want to test if is true. And then you set up an experiment that challenges that hypothesis. And it’s important to say that what you, what you are doing when you’re setting up the experiment is making sure that given the experiment comes out with a particular result. Yeah. Then the hypothesis is either ruled in or ruled out. And so what’s actually important in that process is that you have to have really two hypothesis. You can test a single hypothesis, which is basically could this thing be true or could it not be true? Now, what we are generally doing in science is looking at two hypotheses and saying, one of these is true, or can we rule one of them out whilst ruling the other in,
Tom
I think the really interesting thing there is the framework. Cause I think a lot of people may be like to say, they’re thinking scientifically about it, but fall over some, you know, common pitfalls. So, or why we, why should we imply that to campaigns? I guess
Ed
People already really view campaigns to a certain extent of experiments, to the extent that people look at metrics coming off those campaigns and try and analyze how well they did and work out what it was that went well and what it was that went badly. Also as part of the campaign sort of creation or design process, people do conduct experiments all the time. For example, focus groups may be an important part of a campaign. And a focus group is really to some extent, an experiment on different content. So you’re running your content in front of a small group of people to test the hypothesis that one of these pieces of content is more effective and the actions you want more than the other.
Tom
That’s where I think this point about the framework’s really, really important because a lot of people will be treating them like experiments, but actually not maybe following the method to the letter, if you like, and therefore some of these pitfalls that we’re gonna discuss later really impact, I guess, the conclusions they draw, right? And the most important thing about this is drawing valid conclusions. And if you don’t do this correctly and your conclusions are wrong, then you may as well have just guessed
Ed
You kind of have to commit to the process. Now I wouldn’t go so far as saying that if you do it badly, you might have well have guessed. It’s not necessarily 50 50, but it’s definitely the case that you can run. What looks like an experiment or a focus group. And you get an answer and that answer will have a large amount of uncertainty around it. It’s important to say that in nearly all experiments or in fact, I would argue in all experiments, there is an aspect of uncertainty and the goal of the experiment is to reduce that uncertainty to a satisfactory level. So I’m sure anyone who’s ever done science experiments at school remembers doing repeat trials, right? And you do the repeat trials to try and eliminate the uncertainty that comes from doing one single trial.
Tom
It’d be interesting to talk about repeat trials in the context of campaigns. You know, there’s lots of variables, but I say, when you disagree with me, that’s exactly why you are here, cuz you are the guy that knows it. So just a really simple level. I think what’s one example, I guess, of the common experiment that we’ll see people doing in a marketing world.
Ed
So I, I think that a really common example is AB testing, which is in itself a, you know, a, a quite explicit experimental framework. And that’s the idea that you have two say pieces of copy or images in an advert and you want to test which one drives more engagement. Yeah. And a very standard way of doing that is to randomly serve one of these to each person that say gets shown the advert. And then after superiors say, okay, well click throughs, you know, 10% of people clicked image a and visited our site because of it. 20% of people clicked image B and visited our site because of it that suggests that image B is more engaging to people and people who are experiencing AB testing will know that there’s a lot more to it than just looking at those headline numbers. okay. I’m not suggesting that’s what everyone’s doing, but that’s, that’s the, the, the, the basics of it is trying to understand which, which of a, or B is better a modification on it on a, B testing is what you might call champion challenger, or is, is known as champion challenger.
Ed
And that’s where you have, instead of say two pieces of copy, you run a small experiment with both of them and then okay. B wins. So then you run B for the full campaign. You have a constantly updating system where you have your champion, which is the, the branding or the, the link text you’re using. And then every now and then you add challenges into the experiment. So new potential text, and then you assess whether it starts doing better or not.
Tom
Oh, okay. So you just sort of keep, you keep the main one, the main horse running. So to say you bring in, bring in others alongside. Okay. That makes sense.
Ed
Yeah. And then if any of the challenges outperforms enough, it becomes a champion. Yeah.
Tom
And I think that’s, cuz that’s, that’s the one, you know, I think we all hear about, you know, everyone, everyone just throws that around quite loosely AB testing. And I think sometimes when we certainly speak to speak to people, you kind of, you know, there, there, there, there are certainly some mistakes. I think that happened when people set up their AB tests or running their AB tests, are there some sort of common or some core principles that, that we kind of think of, if we’re gonna commit to running our campaigns in a scientific way, you know, are there kind of those common core principles we should be thinking about?
Ed
Uh, yeah, I think there’s, there are some, we can be a bit broader than talking about campaigns that specifically experiments, but any analysis that we are going to do over a campaign should make an attempt to kind of a adhere to some, to a set of principles, to make sure that you are a learning, what you think you’re learning and B able to make those projections from the results of the campaign itself. So for example, I, I think a really core principle is what I would, I would call baselining a more scientific name, maybe a control group. And that’s the idea that, you know, you are measuring where you are already mm-hmm so in way in champion challenger, it’s measuring where you are already in AB testing. It’s it’s, you’ve got your a and your B text. So one of them is, is the baseline.
Ed
Yeah. So you are, you are providing the data and you make sure that you have the data that your new campaign or whatever you’re trying to test is trying to be. Yeah. You see it quite often, is that like, someone’s up a new particularly say someone sets up a new channel, for example. Yeah. Then immediately starts looking for the benefits of that channel without understanding where they were beforehand. Yeah. Or because of the way that they’re experimenting, say with a new tool that’s helping with targeting, they’re getting new analysis back from that tool without beforehand having understood what was going on when they weren’t using the tool.
Tom
Yeah. Okay.
Ed
Um, and you can see be, and people try and, you know, reverse engineer almost what the situation was beforehand.
Tom
Yeah. Okay.
Ed
A second principle is, is being very aware of what you’re doing and what’s changed. So what is changing either? What is your challenge or what is your, what is, what are you changing? Are you changing the copy? Are you changing the targeting? Are you changing the bid price? You know, the time of day you’re servicing the ad, that sort of thing. But then also, and this comes into this being aware of what else is, what else might be changing
Tom
That’s yeah, that, that’s a really interesting point. Yeah. Go on. Expand.
Ed
Say, you’re talking about social. Has someone else done something that has increased your number of followers while you’ve been running that campaign? Yeah. You know, and, and, and therefore you actually have an unfair experiment with like, if, if you look through, okay, let’s see what our LinkedIn clickthroughs are. Yeah. For this new campaign, it’s like, oh, it’s great. But actually your engagement with all of your LinkedIn stuff’s gone up. Yeah. Because of some something else.
Tom
Yeah. Some, someone, someone at an event has, has kind of said, you know, follow us on LinkedIn and we’ll give you a, a cookie or, you know, whatever it might be. And all of a sudden you’ve collected a thousand thousand new followers, you know, given out a thousand cookies and someone in another department who wasn’t aware of that is sitting there saying, well, this, this campaign we’ve run, or these changes we’ve made are really, really impactful. Um, yet when they come to make similar changes, you know, in six months time, they don’t get the result. And then they’re, they’re kind of left scratching their heads.
Ed
It, yeah, exactly. It’s it’s the, and this is the fear or the, the risk of getting false positive. So thinking or having a bigger impact than you actually are, this, this kind of feeds into a principle when it comes to a pot, more positive principle for a construction is the idea of trying to do these things in stages. So make changes in small stages.
Tom
And, and I guess is this where the, you know, bringing back to what to about earlier the, the repeat experiments comes in. So almost if you get a good answer, run it again and see if you get the same answer that that’s the, that’s the principle behind that. At least how practical that is always is not always the case, but principle, at least
Ed
You can think of that actually on a very micro level that each engagement is an experiment in its own. Right. However, it only has two results. Yes or no. So in your LinkedIn example where someone was giving out some sort of offer to, for people to follow the question be is okay, can you take the data you got from your experiment and ignore the days where that event was happening and do you still get the same result? That’s, that’s a more dangerous way of doing it post. So after the experiments happen, segmenting like that is a little bit more dangerous because you might miss longer term effects for example, of that, that following. So the fact that you got this big beat spike in followers probably means that LinkedIn starts showing your poster to more people. So then you get a further, get a big spike in followers afterwards.
Ed
And what this comes into, uh, is really what I’d say as a third, a third principle, which is the data collection or, or setup, right? So making sure you are set up to collect data, the data you want in a format that allows you to carry out the analysis you need to do in the LinkedIn example, that means not just collecting the data on the number of followers, but being able to that based on day. Right. Or potentially, and I’m not sure you can do this in LinkedIn, but if you could maybe also collect on location and then you could exclude anyone who might have been at the event.
Tom
Yeah. Okay. So re really that yeah, really that standing back and thinking, I guess that helps with standing back and thinking about what could change outside your control is, is, is that really important step? You know, especially as you work in these larger organizations and, you know, you know, you know, if your organization run events or if your organization does something totally different, you can make sure you’re set up to, to look for that
Ed
Sort of as a data scientist. The way I think about this is conduct experiments on the data after the data’s being collected. So ask the question, okay. Say we hadn’t done that event, what would’ve happened. And we do that by finding the places where people won’t have gone to that event, that sort of core idea of, of having really clean good data significantly helps with the increase in confidence you can have in the results of that experiment, because you can ask the, what if questions and we’ll come onto this a bit later about, you know, how to, how to, how to question and experiment and what, what should be being asked and the having good data collection allows you to get ahead of that and ask those questions without running the experiment again. So another thing that I think is really key to understand when you are setting up an, an experiment is, is the assumptions.
Ed
And by this, I mean, almost your assumptions of what either hasn’t changed or more explicitly doesn’t matter. So for example, you might run an AB test where you show people in one city, one type of advert and people in another city, another ad now what you, what you’re doing there. And the reason you’re grouping them together is because potentially you want to tap into word of mouth. So you have to kind of segment your audience, but you’re making the assumption that the people in the cities are the same or the communities inside the cities are the same. You, you need to be aware of that assumption when it comes to both running your experiment. So for example, in that case, what you might do then is either reverse the experiment halfway through. So you have data for both copies in both cities, or it adds more cities and randomly assign. And then as, as you add more cities, then you are assuming that you are getting a mix of behaviors across your
Tom
Cities. And, and I guess when we’re talking about assumptions, it’s especially in, in the digital world, it’s also recognizing the assumptions about the offline world. You know, what’s the weather doing, you know, is probably a common one. You know, if you sell ice creams or cold drinks, you know, I’m sure there’d be a kind of heavy correlation between people’s behavior and your product. I, I guess there’ll be other environmental factors that might be less obvious, but always important to document what those, those are, you know, to make sure that you are, you are thinking about those things to make sure you get a higher certainty from your, from your results. And you know, when you come to repeat it, you’re not left scratching your head with results that don’t match up.
Ed
Yeah. Yeah, exactly. I think that sort of AA, there’s an awareness piece going in mm-hmm so the weather is a, is a good example because you can’t really control for it. Yeah. What do you do if you’ve, you’ve decided to separate in time, so you’ve been running something and then you switched your challenger for a bit, but it’s suddenly really hot on all those days. Mm-hmm now you, you kind of have to run the experiment again.
Tom
yeah.
Ed
Or you have to assume that your product sales are not driven by the weather. Yes. Now, if you are selling sunglasses, mm-hmm, , that’s probably a very bad assumption. If you are selling computers, it might be a good assumption, but it might actually also be a bad assumption as well because people might spend more Le less time inside. Yeah. So it’d be less likely to feel the, the drive to buy a new computer. So in some cases it’s really simple, but in other cases, like you do have to think hard about why people might have the effects.
Tom
Yeah. And, and I guess that can come, especially with things like the weather that can get even more complicated. If, you know, if you’re targeting the UK, for example, with you, with a campaign, you know, the weather over the UK can vary hugely in a few square miles, let alone up and down the country. So, you know, you could get this big skew. Um, so, you know, if, if it, if you think weather is important, you might need to take steps to split things up in a way that you can at least try and measure what the weather’s doing, cuz you know, measuring the weather the whole in the UK could, could potentially be quite difficult.
Ed
Yeah, exactly. Like if yeah, if you are, if you are separating in space, but you accidentally or not accidentally, you, you, you set up your campaign to, you know, serve in different locations then, um, yeah. One’s one has very difference to the weather. It could be the weather that drives the results a lot more than the copy. Now I think this leads us on to what I’d almost says the fifth, the fifth principle and that’s sampling and in an AB test or in, in a challenger champion test, how do you choose your samples in a fair way? So you are gonna, you, you are feeding your ads to two groups of people. Is it a fair test? This is kind of a, a classic, is it, is it a fair experiment based on the sample groups? For example, if you are serving, um, ads and you, as we’ve said, say split by location, that could be actually very unfair because as we know, people sort themselves by location, some post codes are much more affluent than others.
Ed
For example, now there’s a tendency in sampling to think that random is best. So are you, you know, you can just make it as random as possible. And once it’s really random, you’re you are safe and you’re gonna have good samples. The problem is the chance of your audience being a random section of the public is actually really small. So a good example would be even pay per click, right? Everyone uses Google, but not everyone spends the same amount of time on the internet. So there are character profiles in people on the internet. So say you’ve got display ad and you want a AB tested display ad, but then you are going to use that display ad on a billboard, but you want a B test it first. So AB test it as a display ad. Then the audience for the billboard is different to the audience for the display ad.
Ed
And so even though you can randomize it as much as you want, you actually want to rescale your audience based on the characters that audience, you know, you wanna dial down the number of people who spend all day at computer and dial up the number of people who spend all their time walking around, because they’re gonna see your boom board more. Now this for companies is very, very hard to do it. It’s a really well known problem in electoral polling and, and issue polling as well. So a lot of polling companies will add edits to their polls based on the fact that they know that there’s uneven representation in their sample. So for example, in an online poll is, will have a much younger audience than the country in general. And therefore, what a lot of companies do is they, if they do an online poll, they try and understand how old the people were and then overrate the age of people back up to the population level.
Ed
When, if anyone’s ever done a poll or any sort of market research that’s paid for in a survey, they tend to be quite a lot of demographic questions and that’s to re weight everything. Now, obviously, if you are, you know, in the world where there’s not that many cookies, it’s actually quite hard for people to do, sorry, company companies to do with their adverts. And therefore actually it, it really becomes important, I guess, is almost another principle, which is to test in situ you. Yeah. Which is fine for like any sort of web ad you can sort of test in situ yeah. What you can’t do is test a billboard in situ . Yeah. And then, and then decide you want to change it, right? Yeah. You, I mean, if you’re large enough, you could do that, right. Yeah. Across the country, but it’s likely that the experiment is gonna cost a lot of money yeah. At that
Tom
Point. But I think what’s interesting there though, even like yeah, noted that that sampling piece can be really hard to do. I think, even being aware of it, you know, when, when it comes to review your results and communicate your results, you’ll be less surprised by things down the line. And just knowing that how you’ve sampled, even if you can’t get over it, knowing that your sampling could have introduced, you know, not errors, but inaccuracies just, I think really just helps you have a, have a kind of a fair understanding of what’s gone on. So I think even if you can’t fix it, being aware of it is really important, but what about the other side when, um, you are not the one necessarily running the campaign, you know, maybe it’s someone in your team, maybe it’s an agency or, you know, a third party, what sort of questions can we start to ask them to, I guess, get a better understanding of, of, of how much scientific rigor went in and, and really that to give you a, a feeling of how much certainty there is in the results.
Ed
Yeah. I, I think there’s, um, a lot you can ask and there’s a lot you should ask really. Yeah. In general, when some people find this easier than others, I would say, be skeptical. Anything that sounds too good to be true. It might not be, but question it mm-hmm because ultimately you need that confidence to use the result. You know, there’s actually a, a, a sort of a well known trend in physics that, uh, experiments, the more impactful in experiment, the more likely it is to be wrong.
Tom
Right.
Ed
Because to a certain extent, they rush to get their results out. Yeah. And then that gives you the, the desire or, or not, not the desire, but people then are sort of maybe slightly so skeptical than they need to be. So if someone’s coming with, to you with a result, you are the person that needs to be skeptical about it. So part of that is just asking, you know, could this be because there was a different area, you know, the, the, these two were in different areas, could this be because of the weather, those sorts of questions. And that goes back to the idea of data, right? If you are running one of these experiments, you need the data to, to support the conclusions that you have. So to rule out and almost to anticipate the fact that someone might have new questions that you didn’t necessarily have going into it. A really important question to ask is like, you know, how much get people to quantify things? How much better is it? You know, how much has, if, particularly if someone’s trying to sell you something off the back of it, how much does it outperform? And what’s the evidence for that also bear in mind that that’s very unlikely to be one number. Right.
Tom
Okay. Well, as in, you’d expect a range or just
Ed
Exactly, you know, if someone’s selling a product to multiple companies yeah. That they say improves their ROI on their advertising, then. Okay. Well what, and they might say it’s five times mm-hmm but ask, okay, well, what’s the worst case you’ve ever seen. Mm-hmm , you know, it might not work it, you assume that it might, it, it may not work in some scenarios. Right. Mm-hmm so they might say the worst case is zero, but the best case is 20, right? Yeah. And five is the average. It’s more likely that if, if they want they’ll choose an average, that’s slightly near the top end. Yeah.
Tom

Ed
This is how these things work, but, but that idea of okay. Understanding what the range of results are. Yeah. And that kind of feeds into another thing, which is, it was understanding from that range of results. What’s the range of performances within the experiment itself. And that’s a really good way of baselining any performance, because it might be the case that you, you have a very large ad budget and 1% increase in efficiency is massive for your bottom line. So it’s not just, okay, it’s, it’s got a small increase in efficiency. You shouldn’t listen to it. But the question is, okay, well, how does that compare to the variation within the trials of the experiment itself? So for example, say you you’ve, they’ve done the experiment over a month and over the whole month, you’ve got 5% greater efficiency, you know, 5% greater ROI from the, the new solution. Now that’s good, but how do you baseline it? Okay, well, say that the, the day to day difference varied. So E if you just take each day’s performance, that if that varied by 1%, so the performance of both varies by 1% each day, then that tells you that that’s, that 5% is quite meaningful. If the day to day performance was 80%, then that 5% variance over the month could be down because it had one extra day that was 80% better.
Tom
I really like that actually, cuz it it’s always one of the questions you think if, you know, someone gives you a result, you say, you know, it was, you know, we’ve run this campaign and we got this, you know, this much better. It’s like, well, if we, if we, if we keep running that for another month, should we expect the same result really as a really powerful question as well. I think to, to ask someone who’s running that stuff for you,
Ed
It’s a good, it’s a good question for getting an understanding of how much scrutiny the person you are talking to has put into the experiment themselves as well.
Tom
Yeah. That’s a good point. And I guess almost, um, in terms of call, let’s call it result engineering, you know, cuz you might wanna stop an experiment on a particular day because the numbers give you a good result. You know? So actually that drilling into it like that would identify any, any behavior like that. And,
Ed
And for example, if, if people give you results, which are very old, you know, oh this was something, yeah. You know, this is our, this is the perform, the average performance of folder companies on our platform three years ago, you might wanna ask. Okay, well what’s happened in the, in, in the intervening three years is the reason you’re not using any more recent numbers. For example, now it could be a data gathering. There could be very valid reasons for that. But one reason for that is, well, we’ve got a good number and we don’t really wanna
Tom
Yeah, yeah.
Ed
Lose it, you
Tom
Know? Yeah. Yeah. I really love, honestly, that’s really, that’s such a cool concept. That’s so powerful. Everyone else in the business is gonna hate you cuz I’m just gonna be, ask, asking for this breakdown and everything. Every time I speak to anyone now, but I think it’s, it’s so powerful and I, yeah, I almost can’t believe I wasn’t aware of it. So that’s cool. That’s really cool. What about pitfalls? Right. You know, what are the common pitfalls people should be thinking of? Right. And I guess this comes to when this, when they’re analyzing their results or setting it up, they can think of these again, it will just help, help deliver more accuracy into what they’re, what they’re concluding.
Ed
So I think the most common pitfall and it’s quite broad, but it, it really is, is what the scientific method is, is trying to avoid. And that is people being fooled by randomness. The, the idea that a random result is taken to be the truth. So particularly in, in statistical sciences. So the, the, the sort of statistical analysis of data, for example, we often talk about having a, what’s called a P value of, of 95% right now what that actually sort of translates to is that about one in 20 times you run the experiment, you could get a result, which is positive, even if there’s no effect, right. That’s kind of, kind of the way thinking about it. So, so say you’ve got a challenger champion challenger situation. Yeah. Your challenger could win one in 20 times, even if it’s no better. Yeah. Okay. You know, it is a relatively small probability, but when you think that that is also for each experiment, so you might be doing 20 experiments a
Tom
Year. I was gonna say, yeah, if you’re doing 20 experiments a year, one of those is, is random.
Ed
Exactly. Exactly. Like, and, and, and yeah, you expect one of those to potentially have the random result and, uh, rather than the correct result, now it is not quite that clean cut because sure,
Tom
Sure.
Ed
It doesn’t, it doesn’t automatically happen like that. It’s not everyone in 20. And also also the, the error can go to a certain extent the other way. So you can, you can reject cha challenges, even though they were better for the same reason. So you kind of gotta be wary both ways. Now, all you can do in these situations is do things which increase your confidence. And I’d say there’s, there’s a lot of experimental things you can do experimental as in things in the experiment, not cutting edge experimental things to, um, to improve your confidence, for example, run the experiment for longer. So going back to the example of being at school, doing an experiment, you do more and more repeat readings, you can get more and more confident yeah. About your, your value, your, your, you know, your result, um, trying different scenarios. Yeah. A more practical, real solution is monitoring once you’ve made the decision. Mm. So for example, say you have a champion challenger model and you go for the challenger. Okay. Let’s monitor and actually check that it’s holding up the numbers that it produced during the experiment.
Tom
Yeah. Great, great
Ed
Advice. And if not, don’t be afraid to sort of reconsider that problem. There’s also, I would say this is also where theory comes in, in earning a big way in science and theory can kind of come in in marketing as well. And that is the idea that because you understand why it’s a particular piece of copy is doing better because there’s some behavioral science, for example, reason that you understands that there should be an improvement from this copy mm-hmm , then you can rely on that theory to give you more confidence in the result.
Tom
Yeah.
Ed
And unfortunately, that kind of clashes with another common pitfall , which we should probably talk about. And we’ve spoken about before in our podcast about bias and that’s confirmation bias.
Tom
Yeah.
Ed
To recap for anyone that hasn’t listened to that episode, uh, confirmation bias is the kind of natural thing we do as humans to look for evidence that reaffirms or confirms what we already believe to be true. Yeah. This is something we do naturally. It has a play, a role to play, you know, it is worth to a certain extent, investing more time being skeptical of things that confound you than being skeptical of things that refirm things reaffirm, whatever, what you ready believe, but it’s also not ideal. And we tend to overstate overstate the importance of evidence that supports things we already believe. So it, when it comes to having, you know, theory that you think’s gonna work, don’t just go with the theory. Yeah. And then skimp on the experiment because of that, because you like the theory
Tom
And, and I guess that also can be used as a, as a question, you know, if you, if you think about someone who’s giving you a result, um, is that the result they would’ve wanted for the conversation, have they actually analyzed the results with enough rigor or have they just managed to read the results in a light that confirm the, the thing that they thought or was maybe most beneficial to them?
Ed
Yeah, exactly. I think that that idea of being skeptical, I guess, and trying to be skeptical of yourself at the same time. Yeah. Which is, which is difficult, but if you’re conscious of it, you can, you can manage it. Um, I think a couple, we’ve got a couple more common pitfalls as well. So I mean, something I think we’ve definitely spoken about before, and I think on the podcast is the, the kind of causation is not correlation.
Tom
Yeah. Yeah.
Ed
So the correlation being the statistical idea that two things increase with each other, and this is a classic example of you think that your change or whatever you’ve done has improved your results or the clickthroughs on your ad, but actually it’s some sort of external factor
Tom
Mm-hmm .
Ed
And unfortunately in experiments, we nearly always measure correlation. We very rarely measure causation. There are methods to, to measure causality. No one really uses them like day to day in statistics. It where in, yeah. In the statistics that marketers use, that’s all correlation really we’re measuring the causality comes from the theory and from the repeat experiment. And also from an idea that correlation will get broken by randomness at some point,
Tom
But we we’ve talked about it before when we did, but it, I think it was the, um, SP correlations I think is the website that, that does a great job at, at highlighting this, you know, things like, um, I dunno if it’s actually a real one, but these sorts of things, you know, number of films, Johnny Depp, starting and swimming pool deaths in America, there’s actually no link there just because the lines are the same.
Ed
A lot of the time this is to do with those sort of cross factors. So it’s something outside the experiment that, that goes back to controlling for all of the things that are gonna change. I mean, one final, uh, sort of idea is probably worth mentioning in terms of common pitfalls. It, it is really goes back to the idea of measuring exactly what you wanna measure and over the, I would say time, period, that you want to measure. So there’s a classic, uh, I’d say mass, I dunno problem, which is known as Simpsons paradox. I mean, it’s a paradox that’s I guess that’s what it is. And, and basically it is a mass statistics problem or phrasing of an idea that basically because you have uneven sampling your, the, the treatment or say, let’s say the ad that works best in two subgroups of your audience actually works worse.
Ed
Overall, a good example of these would be you have two ads and you try and sample randomly, but your random sampling goes a bit wrong and you end up showing one of the ads to a lot more, uh, males and another of the ads to a lot more females. Now in that position, it is possible for the ad that outperforms amongst males, it to be the same ad that outperforms amongst females. But it is not the say it is not the a, that outperforms across the population as a whole. So I would suggest people go and look it up. And if they’re and convince themselves, it takes a little bit of getting your head round. Um, it’s actually quite simple. Like once you see it written down, unfortunately that’s not the podcast medium, but once you see it written down, it, you, you could see that it is, is highly possible. And this goes back to, you know, understanding exactly what you’re doing when in particular, when you are analyzing your data post, um, and they’re making decisions off that and, and matching your decision to your experiment. So don’t make the decision. We are gonna sell the, we are gonna push this ad to males on the basis of an uneven experiment because you’ve segmented Reed, your data post, and actually your sampling’s gone wrong.
Tom
Okay. Makes sense to some reading to do there, but yeah, it makes sense, you know, I guess to sort of bring it, bring it to a close, I guess, what core bits of, I guess action advice. Can we, can we give people off this? And obviously there’s, there’s lots in there, but kind of three, three core points. I guess if
Ed
I wanted to pull three things out, I think the first would be a concession almost, and that is that it doesn’t have to be perfect. Mm-hmm right. You don’t have to run the perfect experiment to not run. The perfect experiment is not the same as not running an experiment at all, or tos in a coin. Yeah. Obviously there’s always, you know, business limitations.
Tom
Yeah.
Ed
You know, for example, we spoke about ads, a challenger, maybe winning one in 20 times when it shouldn’t, that’s fine. Because 19 out of the 20 times you are, you are getting what you want out of the experiment, right? Yeah. So actually you, you are better off going through the process even though the process. Isn’t perfect. Yeah. I think secondly is actually related to that. I think that a challenger champion model with constant monitoring is better than, than an AB test and set model. So have your baseline and then ask the question is what I’ve done. That’s new, at least no worse than the baseline in almost a, a pretrial. Yeah. And then have the full trial and ask if it’s better. And if it isn’t better, this is about being comfortable with the idea that two things could be similar, right. They don’t have to be one better than the other.
Ed
And that’s sort of the problem with AB testing is that often you fall into the trap of, okay, well, we’ll just go with the one that’s a little bit better rather than accepting that there’s no real difference or that there’s no statistically significant difference to be scientific about it. So, and then, and then constant monitoring. So, you know, understanding what your previous champions were doing and how well they were performing and having your, your current champion whatever’s running and constantly analyzing it back to those previous models because it may be, and this is something that actually is really quite common. It may be that just changing things. It’s almost the football manager solution that changing things has caused a difference because it’s new and it’s fresh and people are bored of the old ad, cuz they’ve seen it a thousand times and actually it’s not what you change it to.
Ed
Right. It’s, you know, the football manager comes in, the team do better for three weeks cause they get a little boost and they wanna impress the new manager. It’s nothing to do with the skills of that manager. It is literally just the fact that you’ve got a new manager and then my third piece of device. And I would say this as a data scientist is data cleanliness, data hygiene, and have a data plan, you know, collect what you wanna collect, make sure you know, how make sure you know how you’re gonna get it early. Yeah. I, and by early, I mean before you start, no yeah, no afterwards. Um, and get a baseline out of that.
Tom
Awesome. Three, three great points, um, and load loads of great info. That’s that’s been fascinating. Thanks ed for another, another good chat as usual, like, and share this with anyone else you think would, would benefit from this. Um, and with that I say goodbye and see you in two weeks time, uh, ed say goodbye, goodbye.

Friends in conversation | Herdify

Sign up to the Herdify newsletter