Testing for Software Failure: Andy Schriner
Andy Schriner joins us to share methods in an engineers’ toolkit for accommodating risk in a project.
Risk arises everywhere—estimates, costs, forecasted revenues. Andy’s role as a Principal Software Engineer at LeadGenius over the past 3 years has given him a lot of perspective on:
- what are best practices for controlling code quality
- how to become a principal software engineer at a Series B startup
- why Ph.D. programs in traditional engineering have crossover and relevance to software
For the full text transcript see below the fold:
Audio:
Play in new window || Download
Video:
Max: What is up all? Max with The Accidental Engineer here.
Today we have the pleasure of Andy Schriner joining us. Andy is Principal Software Engineer at LeadGenius in Berkeley, California, USA.
Do you mind sharing a little bit about what LeadGenius does? We’ll be getting real quick to Andy’s backstory as an accidental engineer momentarily, but…
Andy: Yes, so LeadGenius is a sales and marketing automation platform and data provider.
Our bread and butter is providing really accurate data to sales and marketing teams so that they can reach out to the right people at the right time, get them on the phone, contact them through email, and close more business.
Max: Right on. For our audience that’s curious about how you got into software engineering and coding and whatnot, you didn’t do your undergrad in computer science, so how did you get to being a Principal Software Engineer?
Andy: My path into software engineering has been a little bit interesting. I did my undergrad degree in mechanical engineering, so my first exposure to programming was through Matlab running technical computing like pulling data off of accelerometers and processing it.
Then in graduate school I was in a Ph.D. program in environmental engineering, I wanted to do work on crowdsourcing related problems. And I found that the problems that I needed to solve I had to write software to do that.
I started writing a little bit more…actually I started with Python and Django—started with a little web app that would allow me to solicit some data contributions from people on the other side of the world.
I did that very begrudgingly. I did not want that to be like my “full-time thing”. After a few years of that, I started to really enjoy it. And so, even… gosh, 5-6 years ago, maybe 6 years ago, I would still have said, ‘‘No I don’t want it, don’t wanna be a software engineer. It’s not what I want to do.’’
One of my Ph.D. advisers, Jim Uber, really, really smart, guy who kind of encouraged me, like, ‘‘Hey, you know, this is actually a pretty cool thing to do.’’ He said one thing and it sticks in my mind, “You know, computer programmers are the high priests of abstraction.” I was like, ‘‘Jim, you’re crazy.’’ But here I am now, a “high priest of abstraction.”
Max: So I was gonna ask if your Ph.D. advisors in the engineering program, the non-computer science Ph.D. program, condoned you learning Python and Django to fulfill your thesis or rather they pushed you to avoid hand-rolling software solutions?
Andy: No, that’s a great question. So, Jim, the guy I mentioned was actually…he’s a very software-oriented environmental engineer, so he was very supportive of that. Basically, it was like “whatever tools you need to be using to get the job done, that’s what you should be spending your time doing.”
So over time I picked up more Python, picked up more web development stuff, a little bit of data processing moving data around, stuff like that. Yeah. So picked it up piece by piece. And, yeah, here I am!
Max: You mentioned learning Matlab or being required to use Matlab for your more physical engineering education. What was the learning curve like coming from mechanical engineering to Python?
Was it pretty straightforward? What kind of resources did you find possibly online or in a mentorship, because your Ph.D. advisors don’t sound like they were programmers or coders themselves?
Andy: There were a few people in my research group who were able to offer a little bit of guidance. At the very beginning of this journey, I was like, “Okay, what languages should I start picking up to solve some of these problems?” And somebody encouraged me: ‘‘Hey, Python is a pretty good choice.’’
Sam, thanks, buddy.
Max: Dude, you’re the best Sam. Thanks.
Andy: Other things at the time–I was running a server in a server room in the College of Engineering. And all the Linux craft that I needed to get that thing up and running, there were some people that were very helpful with getting that up.
Max: That can be a really steep learning curve.
Andy: Oh, yeah, yeah. It definitely was, but with some folks in the research group to give me some nudges and then do stack overflow, Googling…
I think right now my main professional skill–my most important professional skill–is not writing code or even designing systems, I’m a “professional information retriever.”
So if I can formulate the problem, then I can usually find pretty good resources to at least give me some different perspectives on that problem. I’ve been doing that for 6 years and putting pieces together and building on top of that.
Max: That’s pretty distinct from various physical engineering fields like mechanical, civil.
One of the really big qualms I had when I was naming this video podcast, The Accidental Engineer, is that a lot of physical engineering field folks kind of don’t think of software engineering as “real engineering.”
There’s not really state licensing, you can’t really go to jail if your bugs cause adverse outcomes–whether that’s loss of revenue in a business or loss of life or medical expenses.
So did some of your peers in your Ph.D. program kinda cast “side eye” at you when you’re going this more “soft skills” software engineering route in contrast to their own route?
Andy: No, I didn’t really get that kind of reception.
Max: I’m just imagining it?
Andy: Yeah. I think a lot of times the response from folks in academia…and if you’re in academia, you know, good for you, you keep on trucking if that’s your thing! But a lot of people, they saw me moving out of that and were like, ‘Oh, oh, I want some of that.’’
Max: Interesting. So just earlier today, we interviewed a guy named Jeremy Karp who’s a newly minted Ph.D. holder, who had a lot of really positive things to say about his Ph.D. experience. And I think he was in his Ph.D. program a similar amount of time as yourself. You’re ABD, is that it? “All but dissertation” or close to it?
Andy: So I quit after five years of work on the PhD. I took a Masters for the amount of work that I had done and I’m through with academia.
Max: Well, for any academics who are spurning Andy’s characterization of the academic pursuit and jealous about of going into private industry, check out that interview because Jeremy and I talked through some of the immense career benefits to doing the Ph.D. program.
And, I don’t think you would look back at it as a waste of time by any means.
Andy: It was an incredible opportunity to develop really, really, rigorous problem-solving skills. And not just problem solving, but I think problem framing, problem defining, is I think one of the most important skills.
So I’m not saying that if you’re out there in a Ph.D. program, you should quit it. You do you.
Max: For people who are curious what all you did in your Ph.D. program, what was your area of research and how well did it translate to entering into the software engineering field?
Andy: Yes, so I was working on how to use crowdsourcing employment. So that’s work to create employment in developing countries as a way to target extreme poverty, which is this really huge goal of totally shooting for the moon here with a Ph.D.
Max: All right, it’s world peace work right there! But it ties into engineering in that a lot of your peers in the engineering department are working on solutions for like water sanitization and third world countries that don’t have quite the resources–their problems can be solved by reducing costs or making better physical equipment to solve their physical problems?
Andy: Yeah. Yeah. So that’s kind of a typical approach to the big problem of addressing poverty within… that’s the typical approach that environmental engineering departments would take, would be focused on clean water, sanitation, those kinds of problems.
I was coming in there saying, ‘‘Okay, where can we apply engineering effort to this problem in a way that is more impactful than building water distribution system?’’
Which I actually did that in undergrad with the organization, “Engineers Without Borders.” And I went to Kenya, we built a water distribution system and it was like, ‘‘Cool this is great, but I think we can do more through employment.’’ That’s how I got into like, “Okay, what are the what are the software levers that I can pull on to make this particular solution more impactful?”
Max: One of the topics that we haven’t covered a lot on the show yet is about the area of testing and risk mitigation in engineering, which is arguably what engineering is: risk mitigation in the physical world or digital world if we’re dealing with engineering.
For folks who didn’t get their BA or masters in the engineering fields, do you mind unpacking what all are in the engineer’s toolkits for dealing with or coping with those realities of risk mitigation?
Andy: Yeah, that’s a great, great question. I think one of the things that engineering is fundamentally about, it is oriented toward failure, not success. I have this definition of engineering in my head, which is the “systematic obviation of failure.”
I’ve googled that, and I can’t find if I copied that from somewhere or if it’s just a phrase that’s stuck in my head, but this is my idea of engineering. A couple of key points there:
- it’s systematic, so it’s not ad hoc, it’s analytical, it’s thorough.
- it’s focused on failure. And
- is obviation, which is just a word that I like. Totally I could say “prevention but I like words :)
Max: So failure in the physical world could mean building collapse or structure collapse or electricity going places where it shouldn’t or high-speed things misbehaving?
Andy: Yeah, yeah. One of the classic very, very…real visceral depictions of failure in engineering is the Tacoma Narrows Bridge collapse.
This bridge which–under the influence of a particular speed of cross-wind–got into a harmonic mode and was just vibrating until it fell down. And there are videos of this.
Max: We’ll include it in the show notes.
Andy: Bridges falling down in terms of software though, there are instances like the the Therac-25.
There’s this radiation delivery instrument which I think it was during the 80s maybe, maybe early 90s, there were some incidents where this software-controlled radiation therapy device exposed several people to huge overdoses of radiation.
And I think resulted in a few deaths, actually. So that’s a pretty big deal when you’re talking about software that that can kill people.
Max: So I realize there’re risk mitigation techniques for physical engineering problems. Do those have good crossover or what is the crossover of physical engineering risk mitigation techniques to software engineering risk mitigation techniques?
Andy: Yeah. So I think there are really…it’s the same kinds of techniques. One that I use is called Failure Mode Effects Analysis, or FMEA.
I learned this tool back when I was…so I worked as a mechanical engineer, it’s part of the requirements of the degree program where I got my undergrad degree, that you spend a certain amount of time actually working full-time during the latter part of the degree. So I was working with a medical device company. We’re talking about human health and lives here, I’m more or so talking about FDA regulations. There’s a lot of process and a lot of documentation that goes into the analysis of risk.
Max: Something that’s often lacking in the software engineering field!
Andy: Yeah, totally different…they’re optimizing for a very different things.
But yes, we would do these big failure mode effects analysis exercises and, you know, there were these big Excel spreadsheets that were included in the documentation sent to the FDA and for the approval of this device. This is one that I think is totally transferable from mechanical engineering to software.
We can do a quick rundown of, “What is this thing,” failure mode effects analysis?
At its core, it’s an enumeration of the ways in which a system or a component could possibly fail.
If you think about it like a spreadsheet, so like rows of a spreadsheet are all of the ways that this system could fail, and then across the columns, you’re looking at what’s the likelihood of a particular failure mode. You could represent that as a probability 0 to 1, if you just give it 1-5 score. Then what is the “badness,” what’s the “disutility” of that outcome, whether that’s a 1-5 score or whether it’s a dollar value in terms of how much revenue do we lose because of this.
You enumerate all of the possible failure modes, you assign a probability and a utility score or disutility score and then use that to focus your efforts on risk mitigation.
Max: You’re ranking all of the possible failure modes by their expected downside?
Andy: Right. Right. And I think this is actually really important to think about the ranking and the prioritization.
You’re writing some piece of code and you’re like, ‘‘Oh, you know, this bad, good thing could happen.’’ Or I write a little handler for that, “I know this bad thing could happen,” and write a little handler for that. And that’s the kind of ad hoc approach to addressing failure: “those are the two things that I happen to think of while I was writing this piece of code…”
Max: Not very systematic, yeah.
Andy: Yeah, not very systematic and maybe we’re not catching the ones with the most potential downside to revenue.
Max: Or life or health!
Andy: Or life or health, right.
Max: So in software engineering, I think some of our audience may not be familiar with the distinction between static analysis of a codebase and dynamic analysis of a codebase. A lot of the parallels for failure analysis and risk mitigation software with physical engineering have to do with testing components of your software.
Like you just described, FMEA being applied to testing components or identifying failure modes of components. In software, that component could be a function, could be a class, could be whatever. However, there’s two types of analysis in software engineering:
One, where you look at the code as represented by textual data that you or I have typed out in a text editor and evaluate the textual code without running it.
While dynamic analysis is akin to actually lighting the engine and seeing how high it goes.
Is there a parallel in physical testing to static analysis of the codebase?
Andy: Yeah. I can think of some analogies. If you do like a tolerance stack-up, meaning that you have multiple pieces in a physical assembly that come together.
I’ve got to some constraint here, and three pieces need to come in here, and I have some tolerance on the thickness of those three things.
You can do some analysis on the tolerances that we expect out, can we fit that in there?
I think that software is a little bit unique in the ability to catch some of those failures through static analysis. Which is actually great because anything that you can do to shorten the feedback time from when you discover some failure or potential failure and eliminate it from the system, that will greatly increase your throughput and productivity!
Max: I think one of the reasons that both of us are extremely bullish, at least I am, in the job market for software engineers, is that the cost of testing is virtually zero.
In contrast to some of the prescribed manual tests that are required of physical engineering, there’s definitely a huge portion of the labor market that involves itself in manual testing of software that engineers produce.
However, the costs of reproducible tests is dramatically low. Every time you run a test costs virtually nothing. Are there any odd behaviors or dynamics about the cost of creating software engineering tests that could maybe change your life and workflow, coming from physical engineering?
Andy: Oh, man, when I first discovered automated testing… you just used the phrase like, “changed your life.” I’m like, “Changed my life, man.”
Yeah, a well-written set of tests that really give you confidence that the pieces of your system and the whole system, together working, that the whole thing works, it worked yesterday, I changed some code, it still works today—man, that’s such a good feeling right down in my belly, yeah.
Max: One of the interesting costs of software engineering testing, is that sometimes some tests are written in such a way that they take a lot of time. What are the time/dollar trade-offs with writing software tests?
Like what kinds of tests often take longer to run? You make a tweak or change to some software, and for example, these tests that you’ve written may take 30 minutes to fully complete and show whether they’ve passed or failed with your change. What’s dollar/time trade-off with software engineering testing?
Andy: Well, I don’t have an equation to simplify down to.
Max: Yeah, I didn’t make that question very concise.
Andy: But I guess, without totally just saying “it depends,” I think it’s how many times–how frequently do we need to get this piece of feedback?
If every single change to the code base I need to run this set of tests, yeah, the value in making that 30-minute test into a one minute test or a five-second test, that is gonna be pretty high in terms of the productivity of developers on a team that it buys back.
And by the way, there’s an important assumption there, is that it’s a team here.
I think the dynamics are totally different if it’s just you working on one piece of software because you can keep these things in your head. Soon as you’ve got multiple people working on a codebase, the value of quick feedback of, ‘‘Did I break something?” Become really, really, higher.
Max: Compounding.
Andy: Yeah, yeah. I would love to figure out a way to more rigorously calculate that and be able to use that in some cases to justify spending more of my time doing that because I have a gut feeling that, in some cases in teams that I have worked on, that the payback would be would be really, really, high.
There have been cases where I have done it and three months later I’m like, ‘‘Man, payback was…that was so great.’’
Max: Yeah, for career-minded folks, whether you’re already a software engineer or are you looking at becoming one, one thing to judge about an employer is whether they have tests and whether they run quickly. These are very reasonable things to ask in job interviews, which, as is often told to people, job interviews are not just you selling the company on yourself, but it’s the company selling them to you.
And from ab engineering career development perspective, being with a team that understands the FMEA analysis of making software sets up a team for success or failure as you get more and more revenue or are handling more and more lives in your products hands.
I think that’s something that often goes understated in college education, is evaluating teams’ best practices.
I know there’s a Joel Spolsky test of employers being a good location for software engineers, has a lot to do with whether they have automated tests, whether engineers have their own individual office spaces…
Andy: Quiet spaces to work with.
Max: I can’t enumerate them all, but they’re a very idealistic litmus test for employers that, by and large, I think both of us would agree with.
Andy: Yeah, I don’t even think that I would call it idealistic, I think it’s pretty realistic, in that these are things that well-functioning software teams can and should be able to do.
Max: One of the other parallels that I thought about when we were talking earlier about physical engineering versus software engineering is the concept of mocks.
Mocks in software engineering are where you introduce some dummy replacement for a component in your software that is instrumented in a way that behaves in a prescribed expected way, in contrast to what might be reality. But it gives you an ability to isolate components of what might be a bigger piece of software, so you can see how software performs with specified inputs and outputs interactions with your other software components.
So is there anything in analogous in physical engineering to software MOCS?
Andy: Yeah, totally. And you’re describing like a testing jig. You have some part and that part fits into some bigger assembly. So, for the purpose of a test, you just, instead of having the whole bigger assembly, you just have some kind of jig, you attach some part into that and then you put some force on it until it breaks or something like that.
Max: And I’m curious why the word jig didn’t make it over…
Andy: No, I’m likely to call it a mock instead of a jig.
Max: Yeah. Well, there’s the zealots of software engineering who argue for test-driven development will say that, “Your software doesn’t work until there’s a test proving it works.”
Does that have something analogous in physical engineering? Is there a zealous cohort of physical engineers and civil, chemical, mechanical or what have you, who…is there any equivalent of test-driven development in mechanical engineering?
Andy: Well, you know, so I am not actually aware of any. And maybe that just speaks to the kind of social need for some folks in the software and engineering field to kind of reclaim some power in the client-contractor kind of dynamic where client is like changing specs. And I think it’s easier in the physical world when the client changes the spec to say like, ‘‘Okay this doesn’t fit now.”
People can very easily see and respond to the fact that a change in the spec causes cascading changes in the product. If you think in software, people are like, ‘‘Oh, a change in the spec,” and then, “All software is magic so can’t you just make that happen?’’
So, I think test driven development arose a little bit out of that. I think people need to reclaim the power of like, ‘‘No, we can’t just make this change to the spec and still ship your product on the same timeline with the same cost.’’
Max: So software and then engineering of software, there is an expectation of a race to produce results.
This is universal, irrespective of software. But particularly in software, there is a “winner takes all” dynamic to some markets. In that effort to win, there’s a trade-off between testing your software and not testing your software.
What have you seen in private industry to be the breaking points where a team of software engineers who are given a deadline, an existential deadline, threw their hands up and say, ‘‘Ah, if we can’t test this stuff, we just gotta make it work on my laptop and that’ll be good enough.’’
Is there…maybe not a formula, you can put out there to describe this, but have observed?
Andy: Totally observed that. I think that kind of approach will yield the results that you’re looking for for about maybe a month.
You can stretch that to about a month, you know. You just keep hitting harder and more software faster. And after about a month, when you start to try to stack new features or you try to tweak a feature in and maybe you have to do a little bit of refactoring here or there, that’s when everything just totally falls apart.
I think that the downside on the other side, on the other side of that month, is steep and painful.
Max: Yeah, I mean we don’t have to get super specific, but are there particular parts or particular software problems that are harder or easier to test?
Andy: Totally, yeah. Yeah. So the easiest things to test are little pieces of functionality that fit within one function.
You know, you’re gonna normalize zip code strings or something. You can have input cases, output cases. That’s a very very easy piece of a software system to test. Harder things to test, you know, it’s all about can you replicate the failure condition?
You wanna be able to deterministically create that failure condition.
As you go to bigger pieces of the system, that’s where things get harder to test, that’s why unit tests are easier to write than integration tests. But any system where you have, let’s say…so you’ve got a mail queuing system, for example.
I bring that up because Max and I have actually worked on that in the past. So you’ve got some Python code, you’ve got a Redis data store, you’ve got a Postgres data store, you’re pulling data from various places. Setting all of that up and creating certain kind of failure conditions, like, “Okay, I wanna max out their memory on the Redis instance so that new messages that try to hit that queue silently fail to land in the queue, they’re just dropped silently.” Watch out, that happens in Redis.
Max: That’s a failure mode to be really well documented, hopefully.
Andy: Yes. So those kinds of things, it requires more infrastructure to set up those components. Then you have to think about, ‘‘Okay, if I’m gonna run my test and you’re gonna run your test, are we using the same shared infrastructure for that, do we have isolated infrastructure for every single test run? Are there ups and downs to that.” So there’s just a lot more trade-offs as you start to involve more components.
Max: Got it. So the cost of reproducibility goes up as you have work in Redis?
Andy: Totally. Yeah.
Max: I know one area that makes front-end development really hard is that browsers are your runtime environment. And browsers are surprisingly hard to mock. They also are doing a lot of really asynchronous stuff. So you have expectations about networking or disk interactions that make it hard to reproduce a lot of things or reproduce them in a systematic way.
What are some of the… I realize you’re not a subject matter expert of front-end software engineering alone, but what are some of the best practices for risk mitigation in so-called “front-end” software engineering?
Andy: Mm-hmm. So I don’t know that I could speak about what are the current best practices in this realm, but certainly, whatever you can test at the unit level, test that. Don’t test that using Selenium. If you’ve got that function that validates some text input, test it at the unit level. But then, for that kind of overall functioning of a web app, you’ve gotta be driving a browser and getting realistic interactions.
Max: So Selenium is software intended to automatically drive through a browser?
Andy: Right.
Max: Is there a parallel to Selenium in physical engineering? Are there automated tests that might move a laser around the cutting board or CNC machine?
Andy: Yeah, Totally. Okay. Chairs, right? Chairs will go through reliability testing where there will be some mannequin that sits down, and gets up, and sits down, not, physically, like, bending its legs but there is some force that is pushed down into a chair and then cycled off and cycled on the cycle off, until they figure out how many tens of thousands of times you have to sit down in the chair before the springs fail or something. So yeah, there’s your Selenium mannequin-driven testing.
Max: That sounds expensive compared to Selenium.
Andy: Yeah, yeah. That’s one of the cool things about software–it’s very fungible and you can, yeah, add tests, throw them away cheaply.
Max: For people who may have gotten into programming without ever learning automated testing, they may have only ever manually tested whether the software the writing works or not.
What are your recommendations for dabbling people’s toes in writing tests? I know there’s a lot of resources freely available online to learn this stuff but a lot of people I think aren’t sure of how to take that first step and write a first test ever.
Andy: Yeah. I think that the best advice I can give is just write the first test. Just start.
But I would say go about it, not with the intent of trying to figure out what’s the perfect test to write, go about it with the focus on maintaining your own sanity. Because you’re gonna be working on some piece of software and maybe today, you click through the browser and, “Oh, when I click the button, the thing happens. And then I make some other changes and I go back and, now, I click the button, and the thing doesn’t happen. So, now, like, I’m frustrated because I have to go and click through this series of actions to repeat this thing.”
Start by automating the things that you do to test your own software, to convince yourself that your changes are not breaking things.
Focus on your own sanity, like an exercise and in compassion for yourself. And from there, you will then start to experience like, ‘‘Oh, like I wrote this test a week ago that I’m now trying to change this thing and… Well, I actually have to change the test too because of that approach that I took to writing the test was maybe too high-level or too low-level.’’
You will learn those things as you go through the process of adding additional tests and going back and changing some of the tests that you’ve written.
Max: I think even, for those people who think about writing their first test career-wise, like we talked about earlier, finding jobs and teams that you can join as a member of, who are already testing, is a very good first entry into testing because people are invested in your success, so they’ll give you a watchful eye and perhaps walk you through an instance of them writing their own tests.
You also encounter codebases that have existing tests that you might try making a change to on your first day on the job and you break a test–that is exactly what tests are for, is to make sure that any change on day one on the job without any context about whether changes will be breaking changes or not, is intended to give you an indication about changes, your changes, breaking things.
So on the job and being a member of a team that already has these kinds of practices in place, is also, I think, a huge step without having to go google for yourself. People are invested in you learning how to write tests and whatnot.
Andy: The best case scenario is that you can learn from people who have more experience doing it. If you don’t have that opportunity, just, I would say, go out there and start writing your first test because the value that that will create in terms of your career growth.
Max: One of the things that dawned on me perhaps after my first job out of college was that you are kind of compensated in life based on how much stress you’re able to take on.
This is particularly true in white collar jobs where you’re not physically endangering your own body for years because certain people can absorb more stress because they know how to put measures in place systematically to compound their security and confidence in dealing with risks.
I can’t emphasize enough how valuable learning testing is because it will enable you to confidently take on harder and harder projects. As both me and Andy know from experience, taking on existing software projects is a very common phenomenon in people’s careers in software engineers.
Almost every job you’ll ever take involves touching codes somebody else has written. And often, it involves touching code someone else has written and without any tests.
So a lot of the money in the economy of software engineering labor markets involves bringing in people who know how to write tests to save the day from folks who’ve tried a little too far astray without tests, and have reached a breaking point where manual testing is no longer sustainable, and their business can’t move forward without laying in place more and cheaper automated testing.
Andy: I think it’s also worth pointing out that if you think about that scenario that you described earlier where there’s pressure to shift features, “Shift features, we don’t have time to test,” and you may be able to get some short-term gains in productivity by that kind of approach.
And then there will be an inevitable drop off. I think, if you compare two trajectories, one, which is this up and then put down, to one, which includes tests which is maybe flatter and steadier at the beginning when the pain of that drop off is gonna be just way, way more.
The pain to add those tests, to take that system that has been pushed out the door under time pressure, it’s gonna be orders of magnitude more than the tests that were added incrementally during the development of that piece of software.
Max: So one of the distinguished responsibilities that I wanna emphasize for people who are earlier on in their software engineering career is deciding when to greenlight code changes.
I think a lot of our audience can intuitively understand how that responsibility of yours and your job entails handling a lot of risk in contrast to somebody who doesn’t have to make that decision. But how do you, in your senior role, mitigate the risk, of any given time, choosing to hit the “green means go” button and incorporate changes that your software engineer wants to add to the codebase?
Andy: I can talk about the spectrum of experiences in cases where I’ve been working with a codebase where I felt we had pretty good test coverage, and I felt pretty confident that any change that was proposed, that we could be pretty sure that if the CI system–continuous integration or automated test system–said, ‘‘Thumbs up,’’ that I felt really good about shipping that piece of code.
In those cases, I’m able to just like, “Okay, there’s some new feature. I’ll actually look at the tests for the feature. I’ll look at that, you know, way more than I actually look at the code for the feature. I’ll kinda skim the code for the future because if the tests pass and if you’re not doing anything totally insane in the implementation, then cool.”
Other times, I’ve worked with codebases where I don’t have that confidence that any particular change is gonna be…like, that we’re going to catch regressions that a particular change introduces.
Just because our test coverage is not as good, either in terms of a percentage number of the codebase or our overall testing infrastructure is just not realistic enough.
And those kinds of changes…the amount of stress that I deal with in reviewing those changes is way, way more because then, I have to go back to that failure modes effects analysis thing and act like, ‘‘Okay, where are the things that I can think of that this could potentially break, that are outside of our test coverage? And do I have time to go check all of these things?”
So, yeah. I guess that it just goes back to the value of test infrastructure that gives senior engineers the confidence to greenlight changes and not lose sleep over it.
Max: For sure. Yeah, I think, the stress risk mitigation problem is a big deal. And one of the things I mentioned in previous episodes is how Hammurabi’s Code is a very effective set of rules for engineers, which Hammurabi’s Code for those who don’t know is a really ancient legal document that says, “Any builder of a home that collapses and kills somebody is to be put to death.”
In software engineering there’s very few areas where the risks are so high that somebody might die from a bug. However, there are very large businesses built on software. And I think the field of software engineering hasn’t matured so far yet as to have reached a point where we have Hammurabi’s Code for software engineering.
Andy: I think it’s still really unclear who really responsible for that bug that lost the company $100 million. Is it the engineer who wrote it, is it the manager who pushed that person to work under time pressure and to not spend as much time on testing, is it that bad manager’s manager?
These things are, they’re really still up in the air in the software field.
Max: Oh, yeah. I’m just trying to imagine the different risk mitigation measures that businesses have in place for when employees like ourselves, make the wrong call or the probability of the adverse event.
The day comes that that adverse event occurs and we come to terms for risks. I guess the only recourse that employers really have is fire that employee, that’s totally an available option. Another option is to sue the employee for damages.
The reason why these two outcomes are really rare I think is, one, employees don’t have that much money generally saved up, they’re not quite businesses with deep pockets.
For another, an employer does not want to get the reputation as being litigious towards their employees. They may never be able to employ another person ever again to do what they want them to do.
But, thirdly is that, when an employee makes a mistake of this sort where they introduce a bug into the codebase and it causes loss of money or, God forbid, lives or health, that is a very expensive form of education.
You really wanna hold on to that employee because they, in some ways, have learned a lesson that none of their peers have learned. Although how that is effectively communicated to that employee or how you gauge whether that employee has now learned their lesson, is a whole other matter outside of engineering, I suppose.
Andy: I’ve definitely gotten some of those lessons a little bit. I’ve broken some things…
Max: Do you mind sharing any anecdotes real quick, maybe one before we wrap it?
Andy: Oh, gosh. I think the one that’s especially dramatic…I mean, I know that I’ve taken down production servers probably on a handful of occasions with changes.
Imports, that’s one that in a particular codebase where there are two main areas, one which gets deployed to an API server, one which ends up being run in a data processing environment. So two environments should be tested in two different ways. We were not doing that at the time. Deployed some code that had some import changes that was importing some stuff that should only be involved or should only be available in the data processing environment. We were trying to import that in a API and it immediately crashed, of course.
Fortunately, we have systems in place to rollback those changes pretty quickly. But I’ve definitely done stuff like that.
Andy: Oh, man. I can think of one both of us encountered which was a very similar problem where our testing environment and live environments are supposed to be kept isolated. However, we created a new testing environment with some production settings turned on, which led to paying out a very large sum of money that we weren’t supposed to. And that had some very immediate financial consequences for the business that were pretty hard to dial back after the fact. But in those cases, it’s really easy to identify, well, what the cost was, but identifying the probability is really hard.
And the steps I think we took to prevent that problem from ever occurring in the future were not so permanent, I don’t think, rather than just to acknowledge that the probability of this happening again is really low. And the cost was not so high that we were gonna put the business out of business. But then, I remember that emotionally and physically. I think we should, real quick, plug LeadGenius jobs if you guys are interested in working on a software engineering team that has tests, that has Andy on it, you’d be working on data, pretty cutting edge sales…
Andy: Apache Spark data pipelines, data processing, Python, Django web apps stuff, like crowdsourcing, human and machine combined algorithms, really cool stuff.
Max: Big data, small data, everything in between! Right now we’re just keyword stuffing, so.
Andy: All the data of all the sizes!
Max: And I have got to plug real quick that if you have any questions for Andy, leave them in the comments.
Like, subscribe–both to our YouTube channel and our email list. Andy, it has been a pleasure having you join us. I am hoping to do it again very soon.
Andy: Yeah. Thanks, Max.