HomeEpisodesContact Us

A conversation with experts about what it takes to build things at scale.

Bill Kennedy: The mistake boot, building ACs, Black boxes & AI in software - The Reliability Podcast

Listen on


(00:00) Because at the end of the day, it's not about your time at two o 'clock in the afternoon when life is good. It's about when that bug hits production, that's when the stress hit. How quickly can you analyze and fix that? If you're not already practicing that every day on your local machine, you're dead. This is why I tell people, stop using debuggers. Don't use a debugger to find bugs. Use a debugger to learn a mental model of a code base that you've never seen before. Debuggers trace. They don't find bugs. They trace code. I got a piece of code here. I don't know how it works. I'll step through with the debugger, but you got to get out of it.


(00:49) PIYUSH: Welcome, Bill. Really! We've known each other for a while and, your work in the whole industry has been really commendable and I have so many questions. So I'm going to resolve them for asking them one by one. The first one that I have is how did you become an engineer? I don't know nothing about your background. Just tell me, how did engineering attract you? Software engineering.

How did Bill become an engineer

(01:14) WILLIAM: Real quick. I must've been in junior high school or freshmen, My parents just brought home this computer from Radio Shack, You had to connect a cassette player to it to be able to save and load anything. And it had basic on it. And I don't know, within about two weeks, I just, it was consuming all my time when I got home from school. And I think I kind of maxed it out within a couple of weeks. And then, I was fortunate enough for my parents who are not wealthy at all, but I guess saw the value in it, invested about $1500 in a Kaypro II, which was amazing. CPM, like it was the next level like I think I look I've lived on that machine.

(02:06) PIYUSH: What year was this?

(02:09) WILLIAM: I think I got the Kaypro in '86. Must have been like my junior year '85 '86 I got that Kaypro II. It was one of these "Portable Computers", it had the cover, pick it up. I mean thing, God knows, probably weighed more than I did at that time. All I did was learn basic and wrote a ton of games that I couldn't even win. I don't play games because I just can't win in games even the ones I was writingAnd then eventually I got that upgraded to an IBM. Not an IBM but a Kaypro sort of PC with the MS-DOS at that point, and then that carried me all the way through college that machine. But I think when I fell in love with that Kaypro I just knew that this is what I wanted to do. I just wanted to be a software developer. The funny thing is, when I got my first job at a university, it was and there was no work I got this first job in '92. There was like no work and I did not have any pedigree in terms of my... I went to a state university which was fine,but I enjoyed having fun more than going to class so I didn't have the best GPA coming out of college. It was tough to find a job back. It took me literally eight months to find a job.And I took this job for $18000 a year in 1992 because it was the only job I could get. But the job really was it was a mix of software writing C code and having to do basically IT - fix machines, install networks etc.I became a Novel certified engineer at the time in these two years and about two years into that I had to make a choice. Did I want to continue down the path of being like that CNE? That certified novel engineer and stay in the ops world or get into software? And I decided software for one reason at least back then when something broke if you didn't have the parts It wasn't gonna get fixed and I hated that hated being basically handcuffed to not being able to fix a problem because you didn't have the hardware. Which taught me to constantly have duplicates of everything over time. But with software at least if you wrote it you could fix it. And so I really kind of went down that path and my second job was in Miami Hospice named V-Toss and, at that point, I just went hedge-down and focused on software dev.

Why the love for engineering

(04:44) PIYUSH: Interesting, interesting! So, you got into software and you're saying that back then there was very little jobs as well. How did you decide to, this is always amazed me, almost everybody starts their career, as a fascination with computers, whoever I've met, they got fascinated with it. But during this course how do you start feeling yourself, when is that first gratification and reward that you get and say I got to make a career out of this thing, and this is what I'm gonna do for the rest of my life. What was that moment?

(05:20) I knew right away because I wasn't really interested in much of anything else Played a lot of sports but at basically like five to and a hundred pounds wet you weren't gonna do anything professionally, right? And I love sports and I spent a lot of time outside and I and But from a career perspective I just loved Writing code and watching things happen. I just I knew that was gonna be my path. I wasn't even Gonna be an option for me if I wasn't outside and hanging out I was in my room just sort of banging out code and I didn't have the computer for me I was writing it on paper at times. So I don't know I just felt this the funny thing is my dad was a musician Really good musician and at the time I got that Kpro I'd also been given a guitar. I got these two things at the same time and I always laugh because I'm like, we know which one won out like I cannot play guitar Okay I like I I ended up just you always gravitate to the things that you want to do I hate when people say I don't have time for that Because that's that's wrong. You just don't have the attention for it. You always make time for the things you want to do and I just kept making time for the computer not not the guitar

What does it take to stay relevant in engineering?

(06:42) Yeah, you know, this is very interesting I almost got a sneak peek into the sort of person that you are and the way you approach this because my next question Do was you know you've been in that you mentioned 92 was your first job, right by the clock if I measure right now That's been 30 years right now 30 years into the software industry. I Only have one question for you How do you stay relevant and interested? Give me a quick one for this one, you know, like what's what's your recipe here?

(07:14) I'm not the kind of person who actually loves change But anytime something has to change in my life It starts with a moment of depression and kind of just exhaustion And Then I have to fight through that for a month or two And then I get excited about the fact that I'm learning so Sort of as an example in 2013 I had to switch to go because we had to get off of Windows as a company We just had to get off of Windows. So suddenly I had to learn a new programming language and a new operating system in like March of 2013. I was depressed. I knew Windows inside and out I could code anything in C sharp and now I'm starting from scratch again and Probably within a few months I started feeling like I could be productive again and be productive on my machine and not be clueless and be productive programming and I And then the excitement sort of starts to come in And Because I feel like I'm not a complete idiot, right? I'm not completely clueless so That's one problem that I have. It's just I know I need to learn something new every year, but it's just kind of feeling like I'm especially at my age, right? You're 53 and you have no idea what this is and you feel like you need to learn it And you know There's gonna be a struggle, but you just got to put the hours in every day. So And at my age, it's a little bit easier than let's say it was when I was starting out because when you're starting out You've got to work you gotta bring in money I had kids at an early age at 26 and my first kid this idea of Funcation that I see sometimes on Twitter.

Persistence in the job

(09:00) I don't get it this idea that somebody can just like I don't like my job So I'm just gonna quit for three months That blows my mind dude. I've never been in that situation. I've always had somebody I've had to keep alive, right? I told my wife recently I'm like, I don't want to keep anything else alive Can we just like no more like animals no more nothing. I'm like I'm done right? I got all these kids, but I've always had to keep Kids alive like there's no Like I'm not having my job. I'm gonna quit like freaking insane I mean the whole thought isn't even now where I probably have enough financial freedom to just I Don't think I could my stomach would hurt thinking that I don't have some income coming in. So No, you know, so the bottom line is when you're younger I think there's Probably more stress to have to learn things Maybe you don't want to learn because you've got to have the income coming in and the job is what the job is until you can Maybe find a new one, but now at my age What's nice Is I get to sort of maybe feel the waters on something and ask myself Honestly, are you feeling any passion for this? Is this getting you excited? No, I'm not gonna do I don't care if the industry thinks that everybody on the planet should know this, you know, how long it took me to start learning kubernetes Probably five years from the time that Kelsey first talked about it Right. Yeah, I had lots of friends feeling like I'm falling behind. I don't know kubernetes. I'm like, Oh I'm just not passionate about it. Eventually. I had to learn it It became important from a business perspective, right? And again, I depressed that I had to learn this stuff super complicated but over a few years

(10:42) I I'm comfortable with it now, so You just gotta find that thing that you don't mind putting your attention to and it's not about time What is it about and don't lie to yourself that you have some?

(10:57) Form of attention to something that you don't because it's obvious to see what people to see what you're actually spending no amazing Yeah

(11:05) I usually have an opinion here you know all the things that we can fake attention is something we cannot fake

College and curiosity

(11:12) So then back to your question about early years being formative to learn how to learn I think the one thing that college did for me I think the one thing that college can do for someone it's not necessarily learn a particular tech because we know tech is changing every year it's learning how to solve problems I think it's learning how to be the a programmer where you can find code and find solutions that work for a happy path what there what you don't have time to learn in university and what you have to learn outside of university and I don't know if you can learn this without having the right people around and having a desire for it is the engineering piece I mean desire to change that programming and turn it into a piece of code or system that's resilient right and manageable I have no people who don't care about that they just want to bang out some code and get it working life is good I don't want to work with those people So I I think the formative years needs to be where you're in an environment where you're allowed to fail you're picked back up and you're told keep going forward I just had

(12:35) I just saw something recently where oh we had it at a client where one of our guys made a big mistake brought down production for a couple hours and they wanted to fire him And I'm like dude if you fire him you gotta fire me and everybody that I know because there isn't one person that hasn't done something to a production system that cared about the job they were doing right

(13:01) I rather have somebody on my team that brought down production this year for a couple hours than somebody who I didn't even know existed on the team who's doing more for you at the end of the day And so I think you need you need to have a culture where you want people to do things without the and do things right don't make the same mistake twice But things happen okay Like I would more question the process and the access they had to bring down production than the fact that they brought it down So I think the early years for anyone has to be about

The mistake boot

(13:42) Learning how to minimize mistakes, but not be afraid of sharing the mistake. I was in a shop, we had this 16 size, like 16 shack level sneaker. We painted it gold. We called it the boot. Anytime somebody made a mistake, whether it brought down production or brought down developers from working, they would write up what they did and they put it in the boot, and the boot would be on your desk. I thought you're going to make them wear the boot. No, it's on your desk. So everybody sees you have the boot. You could keep the boot for a week, a month, or an hour. There were times where I got up out of my desk, walked over to the desk that had the boot, and everybody would stop working and look at me. I would just grab the boot and walk up. I didn't even wait for the ceremony. I just knew this was mine. Everybody was like, oh man, what did Bill just do? The point of the boot wasn't to shame anybody. It was quite the opposite. It was to show that we're a team, we all make mistakes, we fix those mistakes.

(14:47) Embrace the failure.

(14:48) Embrace it and own it. Don't hide it, own it, because if you hide it, we can't fix anything. Then at the end of the year, we would take all the papers out of the boot and read them, and just laugh for like an hour. Because those problems were no longer problems. We would just laugh at ourselves that we made these sorts of mistakes.

What’s missing in engineering

(15:12) That takes me to a very, I don't know, this is naturally flowing into my next question, which is you've seen software for a very long time. All of these stories make it our journey through this software to be, where we understand both things equally. One, it has to be reliable, it has to work. At the same time, we know mistakes are going to happen. If you wear that lens again, what would you say you see as missing in the software engineering landscape? I know it's a very generic question here, but what are those prominent things that come to your head? When you look at software engineering today, the practices that are happening, I'm going to dwell down on this further. I've got nuanced things here that I want to talk about. But do you believe that as an industry, where we are right now, if we have to progress, what do you find as missing? Like today, what do you wish as a green field, if you were to come in, I wish there was this aspect of software engineering, which would actually improve in general. What would that be?

(16:14) The same thing that's missing today was missing 30 years ago. It's the attitude. It's this attitude that I'm responsible for the code that I write. This attitude that I'm not responsible. Like every developer is responsible for the software they write. And they have to understand the production environment they're working in. So I'll give you an example. I have met, early in my career, I was working in a lot of enterprise companies where there were IT departments. When I say IT department, I mean people who were writing software for the internal organization so they could do things, right? I had found over all that time that anytime I met an engineer or developer that was an IT developer, their software broke all the time. And it was because they always had this attitude that they could walk over to the desk of the person where it broke and just do some debugging there and patch it and get it out. It was like their work motto was get something done fast that seems to work. And if it breaks, it's no big deal because I'm just affecting something internal or this particular human being. And I'll patch it really quickly and they'll be up and running. So their entire life is just about banging out a bunch of code, however, and then fixing it every day as it goes on. And the attitude that also comes with that is this is my job security. Why would I ever build the software in such a way that it just worked or somebody else could take it over? Because then I'm gonna be replaced and I wanna have that job security.

(17:58) So I don't know where these mentalities come from, but almost my entire career, I would not hire who I thought were IT -based developers because it was so hard to get the idea out of their head that you're not allowed to touch production. You're responsible for the code you write. You should have this desire that nobody finds bugs unless they're like crazy edge cases. And you need to have that idea that you are just a steward. And it's not about job security, it's about your ability to grow. And if somebody can't take over the software you're writing, then I can't even promote you.

QA for 2 weeks rule

(18:39) You're worried about about not getting fired, I'm worried about not being promoted because somebody says we can't lose you in this position. It has to be this sort of change in mentality. When I was running software shops prior to Arden, I would make everybody that started my team work for at least two weeks in the QA department. People would treat the QA engineers like s***. There's no other word for it. They put code in there, they'd find a bug, and they'd argue with QA whether it was a bug or not, or this or that. And I said, you know what? That ends. All of you start your jobs on my team in Qa, usually for at least two weeks. And that stuff stopped immediately. One, because they were finding bugs in those weeks and having to interact with developers. Two, they created relationships with these people. And three, they understand the importance of not slowing down QA because you decided not to test something that was so obviously broken. And I would also do things like this on a Friday during the summer. I'd be like, okay, everybody at your desks, if you can find three bugs in the next hour, real bugs in the next hour, you can go home early. You know how many bugs - Where does this love for - Right.

(20:01) Where does this love for quality of software come from? Like, is it like, do you feel, like, how do you, what makes you passionate about this quality of software? Which I do agree with you, I find it missing in many people that I've met.

(20:18) When I say that you need to be responsible for the code you write, it's coming from a place of empathy. There's a real human being on the other side of that is having potentially a really bad day. We know we have a mental health crisis all over the planet today, much worse than we ever did. You know, your bug, think about this for a second. We are the only industry, and I didn't make this up, I don't remember where I heard this from. We are the only industry where people's lives are at risk and we're not licensed to do our job. Think about it.

(20:55) Beautiful, profound.

Empathy with code

(20:57) And so you might say, well, Bill, I don't work on embedded healthcare equipment. No, but your bug on that site, whatever it is, could be the final straw in somebody's bad day. And so it may not be directly related to their heartbeat, but it's directly related to their mental health at that moment in time. And so you should not be writing code. I don't care if you're a backend developer and you feel like your code doesn't touch, your code's driving the front end. You need to have empathy for the people that are using your software. And if you don't, again, get out of this industry. So that's where it comes from. It also comes from the idea that I've had to wake up way too many times in my life at three o 'clock in the morning to deal with somebody else's problem. Okay? That's not fair. Back in the day in the 90s, I would get paid for that at least. I used to like need an extra cash. I would take the phone home. People eventually started realizing I was making like an extra thousand a month And they started competing for the phone with me, which really pissed me off. But today, I don't know what happened. You get on, I've heard stories now that you're on call. You don't get like paid to bring the phone home for that week at night. You know, like that's even worse. So there's people lives involved everywhere. And if you don't have empathy for them, again, what are you doing? Why are you here?

Permanent software, ephemeral job

(22:25) Amazing, Bill, because, you know, and I wanna say two opinions here. You know, like one is probably I'm gonna conclude what you just said is when you say people have to own up the software that we write, you know, like I say this, I haven't seen this over a while. I say software is almost permanent. Your job is maybe ephemeral in it. So what you do, make sure that it actually falls in line because the software is actually permanently going to be there because it's there as a part of our lives, you know, and that takes me to the next thing I wanted to actually touch upon and you profoundly already have mentioned this, you know, I keep telling this to a lot of people that I meet, you know, if you look around, you know, these days, every moment of your life, start till the end of it, you know, like let's imagine a very quiet moment. You're having a dinner, you know, and you say, hey, Alexa, can you play a song? In this moment, you're basically relying on a piece of software to add to that music, you know, or you're running out, you're having a dinner.

(23:32) You run out of something, you Doordash it. Imagine if the payment did not go through, how infuriating it would be. And I'm talking about good cases here, not situations where lives matter on it. There are other parts of it, you know, where you're running your car, software is all around you. Software has become such a permanent part of our lives that at the moment, we don't even assume, we don't even pay attention to the fact that there's a software running underneath and it touches our lives on a daily basis. It is so much around us that our lives would be dysfunctional if the software stopped working around it. And, you know, it adds to these moments, it makes our lives, it makes us happy, it makes us sad, it makes us infuriated. It can change our mood in a day and it touches us in so many places. And you said it really well here, that, you know, we are not licensed practitioners with the amount of responsibility we have on our head to deliver something that people almost carelessly look past it, even. You know, it's so subtle and so underneath in our lives that we move past it assuming, hey, this thing is always gonna work, right?

I build Air Conditioners

(24:42) So when people ask me what I do, you know what I tell them? I tell them I build air conditioners. And they're like, what? I go, you're in a room right now, right? Yeah, are you comfortable? Yeah, I'm comfortable. Have you, either of you, even on this call today, have thought about the air conditioning system once in all the time that we've been talking here? Of course not, why? Because it's working. But if that AC breaks down, not only is it, you're going to feel it immediately, it's gonna consume the rest of your day. We need to get into this attitude that we build air conditioners, that we don't even know that the system, that nobody knows the systems we're building even exists. That the idea is nobody in your company, other than your team, should even know who you are, especially if you're a backend developer. I always say, if you want the oohs and the ahs, become a front -end dev. But I promise you, you're gonna spend hours on a screen, you're gonna show it to somebody, and they're not gonna say, good job, they're gonna say, I don't like this. So I couldn't be a front -end dev, right? It's just a constant feedback of negativity on all the time you do. If you're a backend dev, your job has to be, your attitude has to be, I'm building air conditioners. Nobody's gonna know my name in this company, because my stuff's just gonna work, and it's always gonna work. And that's the attitude. I want everybody to say, I build air conditioners.

(26:16) No, amazing. I would like to end this passage of ours by saying the hallmark of the work that we do as software engineers is to deliver these moments for our users, exceed their expectations, time and again, even when they're not actively paying attention to the software underneath. Or if I say, in other words, as you just said, it just works, so much so that they don't even realize it.

(26:42) If they have to think about the software, we failed. I tried to buy something yesterday. I tried to buy a silly thing from my business partner yesterday. I went on their website, I filled in the form, I hit the button, and they said a bunch of fields weren't filled in, and they were. And I looked at it, I must've spent three minutes, and I finally said, you know what? You don't get my money now. Like, that moment was horrific, right? That form shouldn't even play into my head. At this point, I know how to fill it in. It doesn't even register. And these are the moments that we've got to create. Nobody should even be thinking about the website or the webpage or the button. It's just there.

Relevance of 'engineering' in software

(27:26) I have a very controversial opinion here that I want to ask you. We can conclude almost to say that software is of utmost relevance to us in modern human life. I want to ask you a question. It's been 30 years. Do you think software engineering is still relevant? The fact that we have not been able to address this hallmark of quality? I agree, software is relevant, but is the engineering still relevant? Or do we need better ways around it?

(27:56) I didn't come up with this quote. The biggest downfall of civilization is going to be legacy software. Forget about earthquakes, forget about tornadoes and hurricanes and asteroids. That's not how civilization's gonna end. It's gonna end because all of this legacy software that we have is eventually gonna break down and nobody's gonna know how to fix it. So this idea of engineering excellence is even more important today than maybe it ever was because the systems were much more complex and the integrations are much, much deeper.

(28:31) What I'm seeing from the industry today are systems like Serviceweaver that are trying to have developers write the minimal amount of code they need to solve a business problem and then all of the code underneath that to deploy it and manage it in production and sort of taking care of you. I'm seeing that now more and more. There's Darklang that has to do the same thing. There's another programming language or system that somebody just shared with me. I don't know if these are the... I like Serviceweaver only because you still at least own all the stack because it's Go. But I don't know how excited I am about some of these other systems because they're so black boxed that if those companies go out, like what do you have? But I'm seeing the industry trying to create more and more boilerplate, more and more sort of tooling you don't have to touch. We've always said that Kubernetes eventually will become similar to the operating system. Nobody will ever go to the Q control command unless you're really deep in the bowels of something. But remember, every layer creates more and more opportunity for bugs, more and more opportunities for problems, more and more things that somebody has to have a mental model of. So you have to be really careful as an industry how much abstraction we're throwing at things. And sometimes I think we should be removing abstractions out, not necessarily adding. But it's, I think, no different than climate change, man.

(30:12) I mean, we're screaming about climate change today and we're doing very, very little about it. We're talking about this and there are some solutions coming in. But I think the world wants to move so fast to market and so fast to try to solve certain problems that they just want to build a Jenga game. And eventually there's going to be too many holes.

WeWork-ization of software

(30:38) You said a very interesting point here, you know, that and abstractions and how the world is shipping logic and not owning the stack. I have this opinion here, you know, and it gets me thinking. I'm sitting in a Wework right now and here's a great, I would say a similarity that I see. Industry in the past, businesses used to own real estate, the real estate of an office. They would run the entire operation themselves. What they did was, Wework came along and said, let me abstract all of this for you. You simply come in, do your work and leave the entire office space, the entire management to us. And here we are. Right. So it kind of like disrupted that. I see a parallel in the modern software for very similar situation happening here. Right. You know, bring your business logic. Everything else is not owned by you. The cloud, the orchestrator, the operating system, everything is rented. You know, you see what I'm saying here, right? With this very little control, I would say, you know, like, of course, there's a lot of freedom now because you don't have to worry about a lot of things. But at the same time, when you say, I don't have to worry about a lot of things, it directly translates to I have no control over those things as well.

(32:22) Now, as a software person who's shipping software, building software, this lack of control over performance, over security, over cost or the black box, does that make you uncomfortable? And it almost becoming a modern norm. Trying to do a thing against it would literally be going against the tide. Oh, what's your opinion on that? You know, like this entire view of modern software?

(32:48) Let me give you an analogy. Okay, I'm at a point in my life where I don't want to own a home anymore. I'm fortunate to have two homes. And I don't want to own them anymore. Because the amount of money that I have to put in to both these houses every week, right, waiting for the next thing to break, dealing with the next situation that comes, I'm at a point where I'd rather just rent and let it be somebody else's problem. Right? Refrigerator breaks, hey, you got to fix the fridge. The lawn isn't great. That's your like, I don't want the responsibility of owning and maintaining these homes anymore. I think it's the same sort of because the out of

The cost of cloud

(33:26) A cost pocket and the upgrades that I'd have to do over time is super expensive and it weighs down on you as a burden. I think the thing is what the cloud does is a matter of you running your own data centers. I've run my own data centers before. I've had to buy computers and stack them up. I've had to worry about maintaining them and upgrading them. I've had to worry about all of that stuff over time. And we could almost argue that maybe the cost of running your own data center and being able to do all the tax work and blah, blah, blah is equivalent to using somebody's cloud. But two things, one, I can walk away from the cloud anytime I want, now I gotta figure out how to sell this stuff. And two, if something breaks, I don't have to deal with it so much as make a phone call. So I'm super happy with the idea of renting those things that we need at least at this point in my life. Now, one thing you have to make a choice on And I don't think it's as, I think it's easier now than maybe it was 10 years ago. As I've always said that you should, if you wanna keep things simple, then I call it going to bed. You should go to bed with a cloud provider. What do I mean? I mean, if you're going to choose Amazon, then use all the Amazon services that are there. Use everything that's there and build your platform on that. You wanna use Google, then go heads down with Google. You wanna use Microsoft, don't write abstractions with the idea that you wanna get off that cloud one day.

(35:12) Because you're adding layers of abstraction that you now have to maintain, understand and deal with. Now, some people say, Bill, that's crazy. Why would I fully integrate everything? I've never known somebody to get off of a cloud. Just like I've never really known somebody to switch out a database. And so if you make that choice, there's less engineering, less engineering for you to manage And you're then relying on that. Now, I think today things are really interesting. So we've been switching at Arden a lot of stuff out of Kubernetes in Google, the cloud run. Cloud run is bad ass, right? It's package your software, put it into an image and they'll run it, they'll run it at scale, they'll run it everything. From an ops perspective, there's like almost little to nothing you need to do. They have the ingress if you need it and they have these things and they can scale it. And I wanna see more engineering teams and platforms move back towards that. It used to be where you just got the Linode box and you ran your stuff in there. To me, cloud run is the same thing, except you can do it in a container and you can scale at a different level, right? And with that, one could argue now you've somewhat stayed agnostic, except let's say you need a messaging system. Do you choose Nats? I'm running an Amazon, do you choose one of that? Sqs, that's the decision you have to sort of make, right? But I'd rather rent than own all day long, dude.

(36:58) But take me back to that question. A while back you said that the whole thing is becoming a black box. Now in one place, I get this dilemma. I'm not saying that's a counter opinion because as an engineering leader, that's a dilemma that you're constantly under. You want to be living on a rent because owning is expensive, but at the same time, as you mentioned a while back, everything is a black box. And so how does that discomfort and comfort, like at what point does that marry together? The fear of a black box, at the same time, the fear of owning. How do you practically decide between these two?

Compute as a black box

(37:40) I just think that I'm not a networking guru. I don't want to manage machines. I trust that Amazon, Microsoft, and Google have got that down pretty well in 2023. So I don't see the compute as a black box. I just see it as a machine that I could own or they own. And I see the container as the delivery mechanic, which is the same delivery mechanic I use on my local machine. And so for me, I want to worry about everything above the container and I want somebody else to worry about everything below the container and the containers to hand off. I mean, think about it. I'm just asking you to.

(38:27) Run a piece of code, maybe one or multiple instances of it on multiple machines. This is a solved problem. I should not be building that myself, and I shouldn't be doing it manually anymore. This is a solved problem. The question is, you want to do it with Kubernetes? Do you want to do it with Cloud Run? Do you want to do it with whatever Amazon provides? You want to do it with, choose your poison.

(38:52) Can I decrypt it this way? What are basically the undercurrent here is, as long as it is common knowledge, and I think that's a key word here. You know what we're hinting at. If the knowledge is well -known to us, then I don't care for it, who manages it. Do not do magic for me. Take away the load of common knowledge. Is that what we're saying? Because would that be a fair way of saying so?

​​The beauty of Kubernetes

(39:20) The burden of these things that are truly just at this point, boilerplate. What's beautiful is, if I can run the software on my local machine, the same way in a sense it gets deployed off my machine, then at least when I'm debugging it, I'm working within the scope of some similar. I think that's one of the most beautiful things about Kubernetes, is I can run a full Kubernetes stack on my local machine, and I'm developing the same way things are in production. I'm practicing every day. If I'm using Cloud Run, it's the same thing. I'm just going to run it in Docker, the same way I would run it there, the same form of access control. I think that is super important. Back in the day, the way we would run and debug software on our local machines were completely different than the way we had to do things in production. I think containers and Kubernetes and things have made your local environment at some level equal. We don't run at scale in our local machine, but at some level, I don't want to deal with the things that just like I don't want to write my own operating system. I expect that to just work. It's the same thing at this point for me. Now, again, it depends on what level, what you think you need in terms of complexity. Keep it simple until you can. If Cloud Run works for you because you've decided smartly not to go down this crazy microservice path, and you can build a single service that does everything for you right now, then you replicate horizontally that way.

(41:02) Then you can use Cloud Run and there's almost no complexity in terms of deployment. But if you feel for some crazy reason, that you need a little bit more control over that service or more services, that's where Kubernetes can come in because it can give you all the dials and switches that you want. The same thing when you're writing software. It's how much complexity do you want to put into this right now and why? I don't want to keep it simple until I can.

(41:38) This whole wework model of software is about keeping things simple as long as what they're abstracting is common knowledge which stays on my local and runs the exact same way where I'm running. If it's homogeneous, it works well.

Don't use a debugger to find bugs

(41:55) Because at the end of the day, it's no about your time at two o 'clock in the afternoon when life is good. It's about when that bug hits production, that's when the stress hit. How quickly can you analyze and fix that? If you're not already practicing that every day on your local machine, you're dead. This is why I tell people, stop using debuggers. Don't use a debugger to find bugs. Use a debugger to learn a mental model of a code base that you've never seen before. Debuggers trace. They don't find bugs. They trace code. I got a piece of code here. I don't know how it works. I'll step through with the debugger, but you got to get out of it. You need to make sure your logs are working and those metrics that you might have are working or if you're doing open telemetry traces. If you're not practicing that every day at your desk, what are you going to do when you need it in production? You're dead.

(42:51) Amazing. We're basically saying, and here's an opinion that I present. With software, what has happened is that, and I've seen this over the ages, a lot of these practices have shifted leftwards. But the art of finding fault is still debugging in production. We haven't shifted it leftwards yet. So what I'm hearing is you're basically saying, do the same thing at home, what you're going to do at work on your local, start doing the similar things. If you're going to use these debuggers, these profilers,

(43:26) Do it on your local, but use them to inject, fearing tomorrow, how am I going to do this in production? That's, is that a takeaway from it?

(43:36) You can't do it in production, you can't do it at your desk. That is the bottom line. A great sports analogy is what you do in practice is what you do in the game. But if you're practicing one thing and the game's completely different, then all the practice that you've had is completely useless. You can't apply it to the game.

(43:56) Right. No. Beautiful, that is, I think that line about the game and the practice, I think that hit hard there. Yeah, I mean, I associate with sports, I completely get that. It takes me to next question, you know, and this is so natural for me to ask, because, you know, there was a wave in the past decade where we kind of publicized this thing, move fast, break often, debug in production, you know, like, I'm sure you're not new to these codes that we kind of popularize. And a lot of engineers started following that track, you know, and I used to hear, Oh yeah, we debug in production, you know, we ship every hour, we have to break stuff. Do you think the world has taken that little too far with the amount of trust that we just said the software is supposed to deliver? What's your opinion on that?

We need more QA

(44:46) This is when I start saying I'm aging out, because I do not believe in that at any level. I think you can run some beta software to give people an opportunity to look at things, including yourself, and you should do that if you can in a production, or at least what we call stage. Everything we do has that stage environment. It's bang on that, anybody can bang on stage. Because honestly, when you look at stuff at your desk, your brain is in a different place than when you look at it on stage. And then when you look at it on production, even when I write a blog post, I'll read it 1000 times, I'll run it locally on, I run the website locally and I read it, I deploy it And then I go, Oh my God, how did I not see that? Because your brain just changes gears in some form or another. So we go back to this idea that you shouldn't be doing anything that's gonna break production, because you're dealing with people who are relying on that software. But you could most definitely have a staging environment, where you give people an opportunity to start banging on. I don't get it. I mean, I wish QA departments came back, to be honest with you. I think we're asking developers to do way too much. I have never met a developer, including myself, who is really good at QA, because we're afraid to find bugs. We're afraid because we know that when we find it, it's about to set us back an hour, and we don't wanna deal with it. I think departments, if I could create this idea of a software developer license, I would create a QA license, and I would make all companies have to have a QA department, but licensed QA people.

(46:27) I would absolutely do that. I do not believe in developers doing QA. And we got rid of them for the cost or something.

Engineers fail at finding bugs

(46:37) Walk me deeper into this. You touch a nerve here, and it's really close to what I see. What we're saying is, engineers almost fail at finding bugs. I'm gonna abstract this out, because bugs have this feeling of, and even for me, that first reaction that I get when I see a bug in my own code is of fear, is of irritation. There's some agitation in me, like, Oh no, I don't want this. So if my natural reaction is something that I do not want, you can never incentivize me to find more faults in this. It's almost like living in a denial. I want to live in a denial, that no, there are no bugs in my system, and that's a constant state. Now with that, and also this publication of the fact that, break things fast, it's almost hypocritical. These are two very contrasting things. You're incentivizing me to send bugs, and at the same time, my fraternity kind of prohibits me from finding those bugs, because that's the culture, that's the DNA that we all live in. We don't find them, we all live in denial. And I find this very counterintuitive, and I think that is where a lot of this has gotten wrong, and marred by this beautiful point that you said a while back. The production is a different world, the local is a different world. And all of this, I don't seem to make any sense of this. And I really would want to hear other experts' opinion here, that how do you inculcate a culture, or go against the natural conventional wisdom that the internet has been publicizing so far, or all the opinions that have been coming so far? What do you tell your team? How do you convince them that this is a wrong thought?

NASA and bugs

(48:27) I don't know dude, bugs create both anxiety and a burden, but we want that anxiety and burden before that code gets into production. It's just like when people are building all these abstraction layers to the point where the compiler can't find bugs anymore. I'm like dude, you have a compiler. Break the API and let the code break at compile time. Don't like keep the API design with the idea that you don't want the compiler to break and break some code inside the function that can only be found during production because your tests only go so far. This weird mentality that we have that one of you now get my latest change And I broke your build as somehow being the worst crime on the planet. You should be almost thanking me that I broke the build because now you know that there was going to be a production bug there and that the API changed. So we might have to do something different. It's this attitude. We've got to change this attitude on the things that we said in the past were really, really bad or really not what's really, really bad. What's really, really, really, really bad is a bug getting into production. That should, it's going to happen, but that should be almost so minimal. That's a metric. I don't care about 99 percentile. It's, and there's bugs, right? NASA did a study on their C and C++ code way back and they identified that essentially there was a bug. I'm just solidifying the math here, but there's essentially a bug with every 20 lines of code that they wrote.

(50:06) And not every bug will rear its head and some bugs are just super, super edge cases, right? But the more code you write, the more bugs that there's going to be, but we should be doing everything we can to catch these bugs in staging, in QA. And when a bug gets all, especially an obvious one, gets all the way to production, that's when we should be starting to ask ourselves, how did that happen? How did that get through? Got all the way through. To me, that should be, if I was running an engineering team, that's how I'd be setting it up. I'd have a QA team, I'd have a staging environment, and the only one that can move something into production is the Qa. Qa manager has the keys to the kingdom. No developer can access production, no developer can access staging actually either, because we're crazy. We see a bug, we want to fix it immediately. That creates more instability, right? There used to be a time when these walls were locked and one person had the keys and you had to convince that person that this was fixed, right? Whatever you had to do. And I think our software was more stable back then than we did this now. People want to move fast and they don't care about the bugs in production. Well, guess what? I don't want to work with you. I just don't want to work with you.

Build fast, ship fast?

(51:48) You said a line there, I want to pick that, lock, guardrail it, right? And also as industry, we say that build fast, ship fast. A while back, you mentioned this point, our impatience to shift, to ship software is shifting the paradigm to be more broken. Now with all of this, I want to touch upon an opinion here and I want to hear if you resonate with this. It's almost saying that the best way to ship a software these days is one way is, of course, you could do an infinite amount of QA. The other way is to lock it in these parentheses of measurements. What we're literally saying is, here is a piece of software, imagine it physically being shipped somewhere And we are basically putting a locus or a loci of its movement. If it goes beyond this, it's a problem. You're allowing it to be free and contained, but only to a certain degree, right? And you say, I'm going to ship this. If anything goes beyond or outside of these locked boundaries, I have a problem. It's a way of me getting to know that something is off, because I understand it's going to shift a little. Having precise target for every single piece of logic, which works exactly the same way, it's tougher. I understand, with the amount of functionality that we ship every day, we are adding more and more. Would you believe that this framework of being

(53:24) Basically thinking boundaries of functionality and behavior is the most, is a better way of shipping software into production and what I call this is observability. You know, and this is where observability or the modern key term of defining reliability parameters and all of those frameworks and guardrails come into the picture. Do you think that would make it easier for people to ship software faster and everybody on the state board will be happy? You know, do you think that makes sense?

Burning territory

(54:01) Yeah, so two things here. There's a sales saying that says we don't wanna burn territory, right? You're trying to sell something and you finally get a potential customer in front of you and everything's breaking. You've lost that customer for life. You might get them back only because you had some relationship. God forbid it breaks again. You're done, you're not trustworthy. I cannot rely on you, right? And you came to me with something that's broken. And so that's one of the things, I don't wanna burn territory. I wanna make sure that we're not. Now, I love the things that like Facebook does where they've got features there that they don't turn on or they can turn on to a small subset of users over a small amount of time to see what's happening. I'm a big fan of that. It's a large amount of engineering effort to be able to do that because you've gotta isolate. Even Google, I don't know if people know this, but Google, I'm assuming they still do this, but back in the 2000s when I was doing SEO stuff, Google has a set of servers that run the latest greatest search engine algorithms. They're not tied directly into google .com, but you can, I don't remember what the IP addresses were, but you could always go there and see what was gonna happen with the next update of search because they were always feeding it and indexing it with the latest sort of algorithms. And some of those algorithms they were testing out. So it wasn't like an AB test on you, but they were there available for you to kind of do it.

(55:45) And if you work inside of Google too, there are things like the Maps app where employees are encouraged to run the latest version of that first, where a company is basically beta testing everything on their employees first to make sure things work. If you've got a product, something like Maps that are big enough, that goes back to that staging environment. And so I think there are ways, if you really, really want to, to either have that beta or next version environment already up and running before you switch over to it, or the idea of feature flags where we're gonna turn this on for 20 % of our users and see what happens. Because you're burning a little bit of territory with them. That could be an opt -in, which for me would be better. Would you be willing to opt into always running on the latest, greatest? Because then if it breaks, they can't complain, right? They've agreed to help you and they're providing feedback. So I think there's ways to do that, but it has to go back to that mentality that we want every user to have the best experience possible. And we want users who are willing to help us be aware of that. So we're not burning that sort of territory. But these are engineering problems that you've got to, I think, probably, I've never built a system that can do this, by the way, can have feature flags and stuff. So I have no idea what's involved in terms of front end and back end and ingress and all that stuff.

(57:28) I have no idea how to architect that. But I imagine you'd want to architect that from the beginning. But I like to keep things simple. And so I would rather have our production side up. And then, I mean, depending on the software that you're writing, you might not be able to do this if you're doing payment gateways and stuff, right? But have that, we call it stage internally, but maybe have that beta .domain .com. And maybe you can reward people for logging in there and finding bugs, right? Gamification that. But doing all that means that you have this real commitment to quality, knowing that there's gonna be bugs, knowing there's gonna be problems, trying to minimize the pain for the majority, especially the people who just aren't that interested and the people who don't care.

(58:24) I don't care about finding bugs. I just, they're coming on your, you know how hard it is to get somebody to just go to your webpage. Isn't there some stat that says like, on everybody's phone, there's like five apps that you use every day, and everybody else is competing to be like number six or something like that? Like getting an app to be installed on your phone is already huge. Get it to replace something else is almost impossible. Right? When I think about The Master Dome, which I'm not on anymore, I just didn't like the experience. There were five apps that were available, and I must have tried all of them, on loading, unloading, a whole nine yards, right? No app ever got me super excited. And I kept going back and forth between the different ones. Eventually I said, I'm done with this entire experience. But think about those five people that had developed those apps, right? Competing at that point for everybody that's moving, millions of people moving to Master Dome, right? There was an opportunity for you to be the number one app there. And it would be really hard to take you off at that point if somebody felt like you were delivering that. But the moment that your app is having bugs, what's gonna happen? You're gone. I'm not gonna deal with that experience at all. I'm not a Facebook, right? To be an Instagram and a Facebook and a Tiktok, and those are unicorns, right? You're competing for that next thing. So, I mean, maybe a Facebook and a Tiktok, and like, they don't care about the bugs because they're so ingrained at this point in everybody's life, that this attitude of just deliver fast and for breaks, nobody's walking away, probably works for them.

(01:00:13) But we all know, how many people could ever really claim they're at Google scale? I think it's like five companies on the planet, Twitch, maybe Netflix, Google, Facebook.

(01:00:24) Maybe three, not even five, yeah.

(01:00:26) Right? Like, dude, if you ever, day one, try to engineer a system to deal with Google scale, you are wasting a large amount of your life and time. That's not where your head should be.

Zero Trust Reliability

(01:00:40) Well, I mean, there are only few who will be lucky to get that far. You know, that takes me to a very important point here, and that is, with so much responsibility as business, you know, and you present a very good customer side of view here, which says that, you know, your thing better work. Now, and I want to draw a parallel here. I mean, it's a thing that I coined recently, a term, and here, this makes sense, and I want your opinion on this. You know, with security, we almost took this stance to say zero trust security, a very common term, where we assume that no matter which part of the system, all these layers, I assume they're going to break. They assume they're insecure. I'm going to build practices, which allow me to ship that secure thing, right? And you took a stance of zero trust security, and everything around it was built with an inane distrust on a system, so that I could ship the most trusted thing. With reliability and with software, wanting to be so reliable, I think there's a term called zero trust reliability, which is missing. A angle, a perspective, where we look at a system and say, no matter what components, what aspect this is running on, I'm going to assume this is breaking, and that will help me build software, which actually works better. Do you think, and you know, because it's with software, we have gone, with reliability, we've gone the other way. There's an inane assumption that, oh, everything just works. I got to ship this very small piece of logic. I can quickly fix it. Nothing's going to break. Everything is going to be all right.

(01:02:20) Everything's going to be rosy and good, and we ship stuff. Do you think that the industry needs a paradigm of zero trust reliability now?

(01:02:29) I think there's a problem with zero trust in any architecture, and that is, there's a lot of redundancy, and a lot of, I'll just say, performance lost, whatever the word performance might mean for that system. Right, there, at least in software architectures, when we talk about layers of software, I have learned over the years that you can't go zero trust. You have to have trust, that you have to rely on your highest layers to do all of the security and the validation, because if not, there's too much inefficiency as you move to your business layers and to your data layers. And in fact, every layer, there's less trusting work going on. The farther down you go, the more trust. Every layer has to trust the layer above at some level, where there's too much inefficiency, too much redundancy.

(01:03:23) I would imagine that you would see the same thing in one form or another in any system architecture that you're putting together. If you look at what Go is now trying to do with the vulnerability tooling, the new fancy supply chain word that we've got everywhere. They're running a algorithm against your entire code base, looking at dependencies you're using to find vulnerabilities. I love that. I think that's great. That to me is an application layer security check just to validate that there's trust in the components that you're using. I would never want to build that into the software per se. I think we have to be really careful when we're talking about zero trust, because at some point you have to trust in your operating system at some point. Where is that line in the sand where you have a balance of comfortable enough trust and enough efficiency and manageability in the system? I don't have answers, but if I was involved, that would be the scale. Understood.

AI in software

(01:04:42) Fair point. What we're saying is that at some point, there has to be an assumption of trust. Now, whether the zero trust is a part of the software or is it a part of the process, is a separate thing. Where do you draw that line? It makes sense. In the interest of time, I'm going to actually just leave with one last question. This has been very interesting. This is maybe a very popular question that you have been asked 100 times over And I want to still ask this. As we're saying, software is a permanent feature of our life. Engineering practices, where we agree that haven't been evolved as much to the degree of expectation that the software has from us, almost to a degree that we say all the mundane craft software, which we still keep arguing over, things that were breaking 10 and 20 years back are still breaking. We still get surprised at why is it still broken? It should have been fixed by now. I've been into this two decades. This stuff needs to work now. Would you say that with engineering and a lot of these practices getting slightly irrelevant, what role does AI have to play in this? You could see this question coming. Do you think it's going to replace a lot of our mundane stuff that we do? Do you think it's time that we offload a lot of this code generation? Open -ended question. I have no opinions. I want to hear your opinion. What do you say?

(01:06:21) Let's just start off with a couple of statements. One, Ai is not intelligence. This isn't like some living organism. Two, Ai is essentially building prediction models. When you think about chat, Gpt, whatever version, where we are today in 2023, you basically have a system that is auto -complete on steroids. That's what you have today. We're all learning, hopefully, that it can create these patterns of human readable English, but we can't necessarily trust the content. We went through this with Wikipedia, didn't we? I don't know one teacher, at least at the high school or middle school level, that will accept Wikipedia as a source, believe it or not. Though it's community content moderation. Knowledge. People do their best, but I've not known a teacher to ever treat it as a legitimate source.

How to use AI in coding

(01:07:28) I would never want to treat an AI system as a legitimate source of truth, which is what we're trying to do for some reason. Now, as it relates to code, I think at least chat Gpt, from what I've seen, has become a pretty good programmer. Remember, I draw a line in the sand between programming and engineering. Programming is, find me 20 lines of code that work. Engineering is, how do I integrate that in a solution where we have all that reliability, maintainability, blah, blah, blah. I think chat Gpt is turning into a really good programmer. Hey, there's an API that I need to figure out how to use. Can you show me how to pull down all the users out of Twitter for my account? And somehow, magically, it gives you those 20 lines of code. That is a huge time savings....

(01:08:20) Or, hey, chat Gpt, here's a function I wrote and no it's not. Or even I've done it with SQL. Here's some SQL I wrote. I know there's got to be a better way to write this and somehow some magic. I don't know how it does it. It gives you a nicer version of it. I love it for that, but it's not making architectural engineering decisions. You're needing to do that. You know you need a SQL that does this. You know you've written a version of it that works. It's now more, hey, can you optimize this maybe? Maybe you can't. Or I don't want to spend 20 minutes searching through 20 blog posts to find the 20 lines I coded to learn how to write a Rego script that does this. I think there's huge benefit in terms of time -saving on the programming side. The problem is the moment somebody treats that as engineering, it gets really, and that problem exists already today. I can find the 20 lines I coded in a blog post, and just move it in. We're not solving that.

A mental mode for code

(01:09:25) We're not making that worse. But I think for these types of things right now, these AI, chat Gpt, large language models are amazing and leverage it. You should be leveraging it for that. Now, do I think that we'll ever get to a point where it's building? I don't think it could ever build the engineering code. You know why? Because it can't fix the bugs. If you don't have the mental model of the code and the architecture behind that code, you can't maintain it. You can't find and fix the bugs. You still have that responsibility. Because at that point, then it shouldn't have written a bug. Well, how did you write a bug? Now you're finding the bug? There's a problem there. Maybe not in my lifetime, maybe 15, 20 years, 25, maybe more, I don't know. But for where it is right now, dude, it's freaking crazy how it can predict things at a certain level. I have zero clue. I've learned how it does the prediction. We actually wrote some code in Go that implements a very small implementation of chat Gpt. Go didn't have all the libraries that we needed like there are in Python. It was a little bit of a struggle. I have a conceptual understanding of how chat Gpt works with text. I have zero clue how it does it with code, man. That is wild. Wild. But you should use it as a sidekick programmer. Programmer for sure that you're copying and pasting.

AI as a sidekick programmer

(01:11:09) Let me summarize it then. I think we violently agree on this one. My summarization of that is, I have this hope that using it as a sidekick pair programmer, as you said, my assumption and my hope and my expectation is that it actually brings engineering back to the fore, where our time is spent better paying attention to things that matter and help us shape a reliable software. Because I think a lot of us get so consumed into daily sprint, velocity that it actually leaves very little time to take a step back and understand where are the structural faults. Ask a question that where is it going to break? I think it should allow us more bandwidth to be able to ask those rightful questions. So my hope and my assumption is that it actually improves the software engineering landscape. And overall, I really like that you resonate with the fact and there are other engineering leaders who are so passionate about tipping reliable experiences through the software that is all around us right now, even over this call, ordered air conditioner and just stopped working over my head.

(01:12:23) Thanks a lot. This call has not had any hiccups for the last hour. It's an air conditioner right now.

(01:12:34) Don't jinx it. Thank you Prathamesh so much for having us. Well, it was lovely to chat with you. This has been good. I got so many nuggets here that you've actually told me and they're wonderful.

(01:12:50) Awesome. Thanks for having me today. It's cool.

(01:12:54) Thank you. Have a good one. Bye.