Spent $200 Billion, Then Said: Use AI Less | Episode 450

  Listen on SpotifyListen on Apple PodcastsAvailable on YouTube These engagement failures, and how to fix them, map directly onto the Octalysis Core Drives. Get the free Core Drives in the Wild guide: professorgame.com/WildCD

Episode Summary

Rob breaks down why Amazon shut down KiroRank, the internal leaderboard that scored staff on raw AI usage on its Kiro developer platform. He shows how stacking Core Drive 2 (Development & Accomplishment) and Core Drive 5 (Social Influence & Relatedness) produced flawless compliance toward the wrong target, a textbook case of Goodhart’s law: once a measure becomes a target, it stops being a good measure. Drawing on the Octalysis Strategy Dashboard and Toyota’s Five Whys, he lays out the one question to ask before you measure anything. Listeners learn to measure outcomes instead of activity, and how to keep a proxy metric from quietly getting gamed.

About the Host

Rob Alvarez is Head of Engagement Strategy, Europe at The Octalysis Group (TOG), a leading gamification and behavioral design consultancy. A globally recognized gamification strategist and TEDx speaker, he founded and hosts Professor Game, the #1 gamification podcast, and has interviewed hundreds of global experts. He designs evidence-based engagement systems that drive motivation, loyalty, and results, and teaches LEGO® SERIOUS PLAY® and gamification at top institutions including IE Business School, EFMD, and EBS University across Europe, the Americas, and Asia.

Key Takeaways

  • Amazon shut down KiroRank, its internal leaderboard scoring staff on AI usage on the Kiro developer platform, after employees set autonomous AI agents on needless tasks just to climb the ranks and inflated the company’s compute costs.
  • Goodhart’s law explains the failure: when a measure becomes a target, it stops being a good measure. You get what you measure, not what you want, so raw AI usage climbed while productivity went unmeasured.
  • KiroRank stacked Core Drive 2 (Development & Accomplishment) through a progress bar and ranking, and Core Drive 5 (Social Influence & Relatedness) through public status, producing flawless compliance toward the wrong outcome.
  • The more powerful and expensive the tool being measured, the more a gamed metric costs you, which is why Amazon paid in real compute money rather than a rounding error.
  • The Octalysis Strategy Dashboard starts with business metrics by asking what outcome you actually want, using Toyota’s Five Whys to move from “increase AI usage” to a result worth hitting, like productivity per employee.
  • Engagement is the value created for users and the business, not click counts or usage volume, which is why most dashboards measure activity when they should measure the outcome.

Topics Covered

  • 0:00 – The $200 billion AI paradox
  • 0:27 – Goodhart’s law and gamed metrics
  • 1:49 – The two Core Drives Amazon stacked
  • 2:39 – Flawless compliance, the wrong target
  • 3:38 – Amazon’s KiroRank AI leaderboard
  • 5:11 – Measure the right thing, not usage
  • 5:38 – The Octalysis Strategy Dashboard
  • 6:12 – Toyota’s Five Whys for metrics
  • 7:21 – When proxy metrics are unavoidable
  • 7:58 – Measure the outcome, not the activity
  • 8:33 – Get the Core Drives in the Wild guide

Mentioned in This Episode

Free Resources and Get in Touch

Looking forward to reading or hearing from you, Rob Full episode transcription (AI Generated)

The $200 billion AI paradox

Rob Alvarez (0:00): So picture this. This company is spending around $200 billion this year, most of it on AI and data center infrastructure. They’re betting the whole farm on AI. That same company just told its own staff to use AI less. A senior VP actually told staff, and the FT reported this in May, don’t use AI just for the sake of using AI. It’s a huge company. And the reason why it’s a lesson about motivation that has absolutely nothing to do with AI.

Goodhart’s law and gamed metrics

Rob Alvarez (0:27): Enter Goodhart’s law. When a measure becomes a target, it stops being a good measure. You get what you measure, not what you want. Nowadays, everybody and their uncle are measuring AI usage, these raw usage metrics, and getting gamed all along. So, hi, I’m Rob. I’ve been driving adoption and behavior change for digital products and programs using behavioral science for over a decade. And now I’m also using the Octalysis framework at the Octalysis Group, the premium consultancy on gamification and behavioral design. And I’m also leading the number one podcast on gamification, Professor Game. This isn’t the typical AI backlash story that everyone is reading on the newspapers. It’s a Core Drives failure. And you can build the same trap entirely by accident. So during this video, what we’re gonna do is we’re gonna look at the motivation machinery behind it, why gaming was, like Thanos like to say, inevitable. And the one question to ask before you measure absolutely anything. And if looking at these failures and how to fix them and how they are represented in the Core Drives is something that you are looking into because they will drive your business forward, we have something for you. We have the Core Drives in the Wild free guide. All you have to do is click on the link below in the description and you’ll get direct access to that and as well to our email list.

The two Core Drives Amazon stacked

Rob Alvarez (1:49): Now let’s get started because this massive company strapped on two very powerful, as all the Core Drives, but the combination is also a very good one. They put together Core Drive 2, Development & Accomplishment, and Core Drive 5, Social Influence & Relatedness. Core Drive 2, because they put up a leaderboard, literally a leaderboard, to see how much AI usage, or you could say it nowadays, as how many tokens you’re spending on the use of AI. And you could see your progress. It was very clear you had a progress bar, you had a leaderboard seeing where you were positioned within that table. And also, of course, that brings in Core Drive 5, Social Influence & Relatedness, because you were there next to other people. So you were gaining status or losing status in some cases as well, depending on where you positioned yourself on that leaderboard.

Flawless compliance, the wrong target

Rob Alvarez (2:39): This was generating some perhaps even healthy competition, some colleagues saying, yeah, I’m beating you now on this leaderboard. And this healthy competition got them to maximize for that metric. Actually, the system got the behavior, it was rewarding. Flawless compliance. The problem was it was the wrong thing. It was raw usage. They were not really aiming for how productive they were becoming with AI, how the benefits of AI were landing. That was literally not being measured. And remember, perhaps you might look into a past episode where we were discussing how the way the test was set up when ChatGPT was coming out, we’re setting up the students to actually try to find a way to go through the cybersecurity and be able to use ChatGPT to get the right answers under very high pressure. This is a similar application of that same principle.

Amazon’s KiroRank AI leaderboard

Rob Alvarez (3:38): So now let’s dive in into who actually made this mistake. And it’s none other than Amazon. Their leaderboard was called KiroRank, scoring their staff on this AI activity on the Kiro developer platform. This again, this was quoted on the Financial Times article. You can find the source on the show notes. But the main thing is that the workers were even setting autonomous AI agents on lose, needless tasks. Just to climb the ranks. So they were optimizing for increasing their AI usage, not on making this AI usage useful. Of course, the compute costs were all over the place. What did they end up doing? They had to kill the initiative itself. And this is not a oh dumb Amazon story. This is a measurement trap that is easy to fall to whenever you are setting a target. As I mentioned before, a teacher sets the scores. The teacher or the professor in our case prepares you for the test, and it produces the exact same outcome that we were trying to avoid, which was cheating. The more powerful that tool that you were actually measuring, the more expensive it’s gonna become for you when people start gaming the metric. Amazon paid in real compute money. This was not just a small mistake. This was not just a small flip. It was became so big that it’s been quoted in major media.

Measure the right thing, not usage

Rob Alvarez (5:11): So the question becomes what can I actually do about this? Because by no means am I advocating for not measuring things. Quite the opposite. I do think measurements are important. There’s many other principles as well, you know, including the very famous as well, what gets measured gets improved. And this actually sticks to that. If you’re measuring how much you use, you’re gonna improve how much people are using it. What are they using it for? If you’re not measuring it, it does not get improved.

The Octalysis Strategy Dashboard

Rob Alvarez (5:38): So one of the things that we do in the Octalysis framework in our five-step process is within the Strategy Dashboard. We do what we call the business metrics. And this is not, what do we want to measure? Get on with it. We really deep dive into what it is that you actually want as an outcome. What is the result that you were looking for? In the case of Amazon, what was the result that they wanted from AI adoption? They didn’t really want AI adoption for the sake of it. What was the reason behind that adoption? Is what you’re really targeting.

Toyota’s Five Whys for metrics

Rob Alvarez (6:12): And oftentimes, what I do with my students, this is a completely separate subject, when we talk about lean operations, this came from back in the 80s, I think, or maybe even further. The Japanese in Toyota have a principle called the Five Whys. You ask, why do you want this? And it’s like, we want to increase AI usage. But you stop for a second and say, well, really? Why do you want to increase AI usage? And you keep asking this at least five times, maybe even less, or sometimes a little bit more, until you really arrive to something that says, if I achieve this, this will be a worthwhile objective. Because if you say, we increased AI usage, okay, so what? If you say we managed to increase, I don’t know, productivity per employee in this section by 2x or whatever the number is, that is. Or probably is, I would say, a lot more meaningful than we managed to increase AI consumption by 500x and now we’re paying for it. So getting that real nice round metric of what it is that you want to achieve makes actually Goodhart’s law that we mentioned at the start work in your favor.

When proxy metrics are unavoidable

Rob Alvarez (7:21): So, yes, if you’ve built a gameable proxy metric, and don’t get me wrong, I understand this is sometimes what you get. And going back to the teaching example that I’ve mentioned a couple of times from the previous episode, it’s really hard to measure learning. So we use proxies like tests, oral exams, presentations, and whatnot, because we cannot drill brains and see how much people have really learned. It’s really, really hard. Sometimes you do need proxy metrics, but you also have to look at what are the consequences. What if people only aim for that? What’s a way to game the metric? You have to look at it really, really hard.

Measure the outcome, not the activity

Rob Alvarez (7:58): In the end, what you want to measure is the outcome. Not the activity. So most dashboards are measuring the activity. And progress, Core Drive 2, Development & Accomplishment, people see how they can progress through whatever it is that we’re measuring. Engagement is not really how many times people are clicking. That, that is irrelevant. It’s the value that you’re creating for your users, of course, and for whomever the client is. It could be an internal client, in the like in the case of Amazon, an external client, whatever that might be. You want to consider the user, but also yourself as a company, as a designer. What is the metric that you’re optimizing for?

Get the Core Drives in the Wild guide

Rob Alvarez (8:33): If this is something that has been useful for you and you want to look at how you can optimize the use of the Core Drives from the Octalysis framework on your own cases, we have a free resource for you. Just go to the link in the description, click on the Core Drives in the Wild guide, and you will get access to that. You’ll get an email for every single day with one of the Core Drives explained. You’ll look at a case out there in the wild from one of our past episodes. My take as a consultant now from the Octalysis Group. And you will have that right in your inbox for absolutely free. And you know, as a bonus, as a plus, you’ll also be added to our email list, which I’m now regularly updating with latest things, stuff I’m observing every single, almost every single day. Just go on to the link, right, you can find, I think it’s right below. Click on that and get access to your Core Drives in the Wild. And as always, at least for now and for today, it is time to say that it’s game over. End of transcription

Discover more from Professor Game

Subscribe to get the latest posts sent to your email.