Charles Martin re-joined us for a conversation about how new discoveries in data science get funded.
In the absence of the National Science Foundation (NSF) and an increased focus in industry on “engineering” wins, Charles describes how he convinces companies to fund original data science research.
Covered topics include:
- what is the difference between “science” and “engineering” as it related to data science
- what is in the data science consultancy toolkit (hint: Scikit-Learn, Keras, Edward)
- whether the future of data science will happen on your phone, your laptop or the cloud
For the full text transcript,
Max: Welcome, guys. Max of the Accidental Engineer here. Today we are rejoined by Charles Martin, data science consultant with Calculation Consulting.
Max: Great to see you.
Charles: Good to see you again.
Max: Welcome, welcome. One of the topics that you and I touched on last time that I know is of tremendous interest to you is the distinction between science and engineering and how engineering kinda focuses on making money for businesses while science is a very time intensive, resource intensive “discovery phase.” This has a lot of parallels to the type of work you do as a consultant but one of the things that I wanted to ask you about was how do consultants like yourself, or scientists, work around this funding problem, this financing problem? How do great innovations in science get funded so that they become common practice in engineering?
Charles: Look, I think you have to find the right clients.
Max: Sure. Absolutely.
Charles: Right? You have to find…what is it that makes a client a good candidate for a data science project or a machine learning engagement?
Max: What is that?
Charles: If you go talk to the Clouderas or the Databricks of the world, their job is to go in and build infrastructure, set up Hadoop, set up Spark…
Max: So data lakes.
Charles: Data lakes, data swamps. It’s basically a consulting dream for them, you know? You’re gonna go in and spend a year, 2 years, getting this thing set up. And it’s basically all building infrastructure. That’s what I mean by engineer. I mean, you need to do it but it doesn’t generate revenue for you. You have very large companies who will afford to do this because they think they need to do it. And you really…you talk about doing science, you’re talking about trying to find ways to generate revenue fast. How do I solve a problem quickly? And how do I get it into production and start running experiments?
So the key to being able to do science in industry is be able to start generating revenue as quickly as you can and to begin running experiments in production as quickly as you can, so you can begin figuring out how to begin building revenue, building some [revenue] stream that will support the activity. Meaning that a good machine learning AI engagement should be self-funding. You should generate enough revenue within the first month or two months of the project that you can fund it for a whole year. That’s what you’re looking for. And the key to that, you know, you have to select the right clients and the right targets so that you feel that you can get the right data and that you can start getting things working in production as quickly as possible.
You know, we had a client last year, it was a computational advertising client. They had worked on a year…they brought a mathematician in from Harvard. They worked on a year trying to get Hadoop running and to generate revenue. And I said this is just, you know, just put…
Max: That sucks.
Charles: Yeah. And it came to me. I said, you know, I work with the CTO who is a genius level coder. I mean, some of these guys, they’re phenomenal! They’re willing to work and sleep four hours a night, can just code all the time. And within two weeks, you know, we hacked together a machine learning solution in Ruby. In Ruby! Because they were Ruby guys. They go, we’ll put you a Ruby solution together. And using, you know, some command line tool. And within 2 weeks, we had revenue up by 10%. And by the end of the month, we had revenue up 35%.
Max: So besides crazy, amazing results, finding customers that have latent data that is well structured enough that you can come in and work with it, what is kind of the tool set that you come in with, software-wise, to do this analysis? Like, you just described using Ruby but shelling out to other open sourced software…
Charles: Yeah, yeah, and you know…well, sure. I mean, look…as I said, it’s “science” engineering. You know, we use the same tools you guys do. We just do something different. Look, you know, you have to pick…look at what people are doing. Are they doing Java? Are they doing Ruby? Are they doing Python? Are they doing Scala? You have to pick something they could put into production quickly.
I prefer using Python. Python is very easy to do. I can spool up a node on Amazon and we can have experiments up. And you know, you just basically start pulling data out of whatever data store you have, stuff it on S3, and you begin running. And you just begin running things in production, immediately. It’s very easy to set up.
Max: What kinds of machine learning libraries? I mean, Python is kind of the glue between several tools.
Charles: The main libraries are sci-kit learn. That’s your Python go-to. You’ve got now, Keras, which is basically the Python interface to TensorFlow. I have some tools for doing things like causal inference-I’m looking at a new tool, variational auto encoders for causal inference. Which are written using Edward which is a Bayesian modeling system written in Python. So it’s used for marketing science. How do you do Bayesian inference, or large scale Bayesian inference? In the old days, we did everything on the command line, just using liblinear and Vowpal Wabbit. Those are very good tools.
Max: So liblinear, Vowpal Wabbit, what are the licensing terms of those?
Charles: They’re open source.
Max: They’re a great example of what was probably previously science research.
Charles: Ah, ah, let me give you an example. Many, many years ago, there was a tool called SVM-Light, out of Cornell. In fact, I used it when I was at eBay, 10 years ago when we were looking at using this tool. And the problem was the tool wasn’t open source for companies. They didn’t open source it and so nobody used it. They stopped using it because, you know, we can’t put this thing into production. We don’t wanna deal…we’re not gonna pay you a licensing fee. I understand that there were some companies…that new guys at companies who were paying the licensing fee do this. But why would you pay a licensing fee for this when you can just get liblinear for free? And it’s totally open source. And again, when I was at…companies don’t want to pay a lot. You know, they don’t wanna pay these licensing fees for tools.
Max: Why not?
Charles: Yeah, they don’t wanna spend the money on it. When I was at BlackRock, when I was at BlackRock they wanted to shut…they wanted to stop paying licensing fees. We wanna dump any licensing fees. We wanna get rid of MATLAB and just use R. You go, okay, great, we’ll use R. Or, you know, just get rid of SaaS, you know, get rid of these things. You know, why would you pay…you know, you don’t want to…you just get locked into this vendor and you have to pay this licensing fee. And there’s plenty of good open source stuff out there. In fact, even for my practice, we’re looking at building a platform and just open sourcing it and giving it to people. So you know, it’s not…It’s not what we do. We’re not there to sell you software. We’re there to solve problems. And so if you need software, we’ll just give it to you. You know, just take it. Open source it, we can reuse it on other clients. You can extend it. it’ll be fine. I think that’s…especially now with Google TensorFlow. Who would use anything else? I mean, you know?
Max: Very well maintained, well financed, meaning they’re paying the salaries of probably tens of employees, full time at Google, to maintain it.
Charles: That’s Google’s entire effort. It’s all they talk about. That’s all Google and Facebook talk about now.
Max: Facebook as well?
Charles: Well, they have Torch, you know, and their Lua tool which was built at the Courant Institute. The old Lua tool, before it was called Torch.
Charles: But, you know, people stopped using these now. I mean, Google has now put so much effort into this and they’ve open sourced it. And moreover, they have libraries on top of it. Like Edward and Keras and I saw a new reinforcement learning library. And it’s phenomenal, they’re phenomenal.
Max: Is this the future of financing science? The megacorps, like Google? You know?
Charles: Well, science is pathologically sick. I mean, the academic…the university system is pathologically sick. They have some serious problems. I mean, I think Peter Thiel has really pointed out, you know, what a bubble academics has become.
Max: In terms of the tuition? In terms of…
Charles: I didn’t pay anything to go to school. You know, I didn’t pay anything, a couple, few hundred bucks, you know? I mean, that was it, you know? We had no debt.
Max: The good old days.
Charles: Why would you take on debt to go to school? That’s…the whole point of the Ph.D., you get a great education and they pay you.
Charles: But so, I think that they…there’s a huge problem in academia, that you start producing all of these people. They can’t…how are they gonna pay all this back?
Max: You know, there was a interesting, recent news article about Google’s funding of academics at universities. And I didn’t read the details into it but it sounded like Google was providing grants along the lines of NSF to researchers. And that brings up all kinds of…
Charles: They’re also hiring researchers. I mean, like, you know, like… Uber buys entire departments. You know, the problem with the academic world is that…I mean, the research is done by graduate students and post-docs.
Max: They’re cheap.
Charles: Yeah, but they’re also not experienced. You know, they’re not experienced. There’s a difference between having 20 years of experience in the field and having 1 year and taking a few classes. We have these guys at Chicago, these great researchers, you know, who could…you know, they’re productive into their 70s. And they’re still doing amazing things. And now it becomes the entire…you know, with the fall of the Cold War-as it was explained to me-the fall of the Cold War, the funding, the whole entire situation changed. And so you you go into academics and your entire time is spent raising money. You don’t actually do any work. Well, how’s that any different from what I do, you know?
Max: Yeah, well you were telling me some pretty compelling stuff about how much science originated from just the pursuit of the nuclear bomb and the pursuit of the end of World War II. You mentioned a couple of scientists’ names and I’m blanking out.
Charles: Well, I mean, like Enrico Fermi. Well, you know, a lot of science, for example…
Max: Well, a lot of science gets funded in the pursuit of ending wars or winning wars.
Charles: Well, I think even that people don’t really understand, they don’t know the history of how a lot of modern 20th-century science came. The Germans funded the development of quantum mechanics because they were trying to corner the market on light bulbs.
Charles: And they wanted, you know, ask what is…think about guys who know physics. Blackbody radiation, you know? It’s a light bulb. You know, what did Einstein get his Nobel Prize for?
Charles: It wasn’t general relativity. It was the photoelectric effect, light bulbs. They wanted to optimize the production of this incredibly important resource. And so they put all this government money into this program to try to really understand what was going on. But it turned out that the universe was much stranger than they’d imagined, you know? They thought it was just gonna be a vanilla engineering effort but it was not. Yeah and I think…I mean even now, the things that we see now in terms of deep learning, this is research that we did maybe 20 years ago, the neural network stuff. But that was not…the core research was done many, many years ago and the industry’s come in and taken over and really driven it forward.
Now I would say that the kind of research going on in AI, the stuff going on in industry is light years ahead of what’s going on in academics. You know, you don’t see…you just, the things coming out of Google, Microsoft research, Facebook, you know, it’s just the…I think that they’re making their resources available, so, you know, they want students. They want to continue to fund it but it’s become more like how Bell Labs used to be.
Max: How’s that?
Charles: Bell Labs…and again, you guys are so young. You know? Bell Labs, I mean, that’s where Unix was invented, was in Bell Labs. And Bell Labs was famous for being one of the greatest R&D centers in the world. I mean, there’s always been a long history of places like IBM and AT&T Bell Labs having these fantastic research centers.
Max: I mean, one of the things that’s changed since the times of Bell Labs is that the expenses of conducting scientific research have changed. And I know, I’m guessing that one of the big expenses to engaging in data science as a business is compete costs. And it might not be very large but I’m wondering…
Charles: Yeah, you know, we talked about, like Enrico Fermi.
Charles: I don’t know if I mentioned it. Enrico Fermi was…he had this thing. He called it the back of the envelope calculation. And the guy was just…he was able to take any problem you needed to solve, take out a sheet of paper-which was about the size of an envelope-and just write the…and just figure out the solution. He was a…and they say when he retired and they went into his office, they just found stacks and stacks and stacks of old envelopes where he would just, you know…basically, well, envelopes they would use to mail, you know, various things around the institute. And he would just solve problems on them. He had stacks of them where he just solved thousands of problems.
A good scientist is able to look at a problem and come up and craft a quick and dirty solution. And that’s the key to being able to do scientific work is the fast turnaround time. You have to be able to solve problems and get fast turnaround time. You’re not there to spend six, seven, eight, nine months building infrastructure. You’re there to design good solutions and get them tested and to design good experiments. We say this in science, you know when experimentalists built a device and then they just started measuring things. They have no idea what they’re measuring. What are we measuring? Well, you know, you can’t afford to do that in business. You can’t afford to just run random marketing experiments and see what happens. You have to carefully think about the design of your experiment so that when you see the result, you understand how much revenue did you generate and can you figure out why you generated this much revenue?
Max: So a lot of these actual, implemented solutions that result from doing data science research in a business are often times implemented and are running on servers somewhere. Like, they’re running somewhere in the cloud, they’re outputting some results, and they’re sending an email to somebody on the business side who makes some decision based on the output of those results. But one of the really interesting directions I think is a newer trend is that that model, the…whatever is being ran in the Cloud is now being ran on device. Like, there’s recent news about Google partnering with hardware companies-or maybe vice versa-to get optimized TensorFlow processors. So I’m curious, are there…where are we going with the location of computing, whether it’s in your hand or whether it’s in the Cloud?
Charles: I think it’s…you know, there was a very famous paper video that came out a few years ago called “The Unreasonable Effectiveness of Data”. And it was basically the top guys at Google saying, you know, you really don’t need sophisticated algorithms. If you have enough data, even the simplest algorithms will do.
Charles: And this turned out to be completely wrong. Like, it just…they missed it totally, It turned out that…and this led to the development of Hadoop and the idea that you can just run these simple algorithms out of Hadoop.
Max: That misunderstanding is what led to…
Charles: Yeah. That was completely wrong. And anyone who’d worked in neural network theory 20 years ago understood, no, that’s totally wrong. You know, you can do really, really good work on your laptop with an SSD and a GPU accelerator. I don’t think it’s a barrier, you know? I don’t think the calculations are as much a barrier as it is the pipelines and the infrastructure and doing the wrong things. With the development of Google TensorFlow and the Google TPUs and Nvidia, you know…it’s just, you know…the price is dropping to the floor.
Max: Well, what are some of the examples of handheld data science? Like, I know there’s some really trivial stuff when it comes to putting masks on people’s faces, that follows their faces, like on…
Charles: You mean like on the phone? That stuff?
Max: Yeah, yeah. I mean, another thing that is a promising future application for device-based machine learning is classifying images locally versus having to send the images…
Charles: Well, yeah. You know, there’s a lot of research in trying to compress models. So a couple of years ago, model compression was a big idea. That you could somehow build a…you could somehow take an ensemble of 1,000 models and then retrain another smaller model which could get the same result of that ensemble and it only took one model. And this was called dark knowledge.
Charles: I have a theory about, you know, they’re basically, like, doing a high-temperature average of this. They’re taking temperature average. It’s not a theory, that’s what they’re doing.
Charles: And it’s just getting, you know…the algorithms are getting better and better. You know, you’re pre-training. You know, running a classifier is not a big deal. It’s the training of the classifier that takes time and…
Max: Running the classifier involves serializing and persisting a model you have trained and distributing it. Like…
Charles: Right. So the smaller it is…the ability to build models that are very, very compact that can do quick classification. I think that’s sort of the key is, you know, how do you…you know, there’s a lot of research in it. But there’s also…will be, you know, advances in hardware.
This thing is like a super computer [gesturing at iPhone]. I mean, this is incredible what this thing can do. And it’s just gonna get faster and faster. So I think it’s a Moore’s Law type…people thought that, oh, there aren’t gonna be any more advances in chip design or anything like that. And no. In fact, very, very…a lot of the special purpose chips you need. The data science requires having dataflow architectures in the hardware, so that data can flow into the chip, a calculation can be run, and then data flows out. That’s not how existing things like the X86 work. They’re designed for running multitask operating systems. They’re not designed to pipe data through the hardware.
When we were in graduate school, and I was running quantum chemistry, you know, we had special purpose drivers and hardware that would pump data into the processor so that we could run matrix multiplications and they were very, very high performance. And that technology now, you know, stuff that was multi-million dollar machines are now, you know, commoditized, completely commoditized. Yeah, I mean, I think it’s pretty amazing. You know, hot dog, not hot dog. Is that the?
Max: Yeah. Silicon Valley.
Charles: Right. But I mean…well, I’m glad. And I also think because of latency effects. I mean, the latency is so low, you could afford to send image over the wire and get a classification.
Max: It kinda changes if you are live streaming video, maybe, where…the bandwidth is meaningful. We’re still in an early age enough where I think it’s still a real issue.
Charles: Well, yeah. You know, it’s a…there’s a lot of room for growth.
Max: So in the area of growth, I know we’ve kind of touched on this. But the capital costs of doing original data science are time, people time and compute time. And I guess, what are the upper bounds…what are the forceful upper bounds of those…
Charles: I think a lot of the problems come up because people try to go too fast.
Max: How’s that?
Charles: Well, they think that…like one of the problems, people think that…well, what is data science? Well, we’re just gonna do feature engineering. We’re gonna run logistic regression and all we have to do is figure out what the features are. So we’re gonna set up the regression, we’re gonna hire five guys all out of school-and gals-and they’re just gonna figure out all the features and just keep trying over and over and over. You just keep banging your head against the wall to try to make this thing work. And like, maybe that wasn’t the right thing to do. You know, business does require thinking, you know? You have to spend some time…and that’s what makes it hard, is that’s a different kind of thinking.
You’re trying to figure out how to frame the problem correctly so that you understand where your biases are and you understand if there’s information leaking into the system. And that’s the real challenge. It’s not so much the engineering side of data science. It’s not so much the experimental side of, you know, how do we trial. So I think people simplify things too much. They simplify it and then they just get this idea in their head. Well, we’re going to try to make this go really cost effectively by scaling it horizontally and having all these different people. And then you realize, well, you know, you’re not really framing the problem correctly at the beginning. You need to spend time on that. That’s just part of setting up a business. It’s like coming up with a new recipe for your restaurant, you know? If…I’ll give you an example.
Max: Before you do that, I wanted to ask really quickly, I think a lot of our audience may not know what feature engineering is. Like, in terms of the process of coming up with a machine learning model for a problem and trying to iterate and improve that model’s output so it conforms to what you…conforms to accuracy or measures of accuracy. What is feature engineering?
Charles: Well, yeah. The point is that you have some data somewhere and you’re trying to predict something. Like, you want to…again, when you wanna offer somebody free play in a video game and you want to induce them to play more. And so you…you know, this is a standard marketing trick, right? Doctors used to do this many years ago. They would offer…excuse me, pharmaceutical companies. They would offer free samples to doctors and then doctors would…they would then start writing prescriptions to their patients because they get…for free samples. And so they would write more prescriptions. And so that’s a common problem.
And so when you talk about feature engineering, it’s this idea that you’re trying to take the data that’s in your database and transform it into a form that’s suitable for the algorithm. You know, the algorithms can’t figure…they don’t necessarily know what the right…they can’t look inside the database and figure out what column to use, what normalizations you use, how do you rescale the data. There’s a lot of just very technical things that go into designing a machine learning system. They don’t just automatically find patterns in data. You have to prep it.
Max: I know with your engagements with consultants, it’s tricky to talk about specifics. But can you give an example of where coming up with a new feature or engineering a new feature for a model radically changed the outcome for an engagement you had?
Charles: Yeah, let me give you a real good example. A very classic one is that you add features into the model and it screws up the model completely.
And it’s not clear…why would that happen? You know, why is it…it’s sort of this naïve idea that I just keep adding, I just keep stuffing data into the system and it should get better and better and better. Well no, it gets worse, you know? It could get worse. Why? Well, here’s an example. We have one where we have a client who’s trying to measure traffic on the internet and they want to know why is there a traffic change? And so they put the titles…maybe you put the title of your web page in or the title into the traffic engine. And it turns out some of the titles you put in are for Mother’s Day. And you had a traffic…your traffic went down the day after…sometime during Mother’s Day. And so it looks like a lot of your traffic went down because of Mother’s Day. No, that was a holiday effect. You know, that was there but that has to be…that effect has to be removed because it’s not a long-term effect. That was a spurious effect. That wasn’t what you were looking for.
Max: Sure. So…I mean, that’s an example of removing a feature or I guess that’s accounting for a feature that improves the model’s performance and accuracy. But I guess I could give you an example that I’m thinking of from a previous job I held, which was in the email marketing space. A lot of this is marketing. We were trying to forecast who would open our emails and/or who would mark our emails as spam, of course. And one of the ironies…or one of the realities of email sending is people open their emails in…
Charles: On different devices.
Max: On different devices, different email clients. So certain devices it’s easier to open emails. So one feature that we “feature engineered” was to take into account what device the emails were opened on or do some kind of look back of what…
Charles: Right. And then you start sending more email to people who have those devices because…
Max: They’re more engaged.
Charles: They’re more engaged. And is that a good or a bad thing?
Max: Depends. I mean, it depends on what the feedback loop is for your model, I guess.
Charles: Right. It’s the same when you’re a doctor and you start asking all of these…you start giving all these doctors more prescription, more free drugs so they prescribe them more. So you start giving them more and more but you may not necessarily…there’s a bias effect because there’s a feedback loop that’s a bias. You’re giving the doc…the doctors who prescribe more, you’re giving them more free. But they may not necessarily be the ones that are really prescribing more. It’s just it seems that way because, you know, the ones who would prescribe more…it’s this causal idea and it’s called counterfactual reasoning. What would have happened if you had not done this? Would the doctor still have prescribed the drug if you did not give them the free sample? And those are the ones you’re trying to target, the ones you want to prescribe, you know…if they’re gonna prescribe the drug anyway, you don’t need to give them the free sample. How do you figure that out? How do you figure out that counterfactual? That’s a huge problem as soon as these things go into production. And understanding how…to do that, that causal problem, that’s a massive…look, it’s the same if you go to a restaurant and you always have the same…
Max: I actually have a similar example that is relevant to lots of people’s lives, which is the Netflix home page where you get recommended what to view. One of the things that Netflix has a really hard time controlling for is what position did you see the show in or the movie in? So the counterfactual is if it had been in position one or position two of the carousel on the home page of Netflix…
Charles: Would you still click it? Would you still watch it?
Max: Exactly, so controlling for that stuff is hard.
Charles: It’s very hard and the problem is that…and this is well understood in economic science. There’s a difference between guys who just do logistic regression machine learning and an economist. An economist, their whole field is based around counterfactuals. How do you know what drives something? The whole field of Bayesian inference for things like medical, trying to understand how treatments affect people. I mean, this whole problem that people who have money are more likely to take the drug because they can afford it but they’re also more likely to be able to go see the doctor. So there’s this weird feedback and you have to account for that. And these systems, if you don’t do that properly, will tend to introduce a bias that will…that can really bite you later.
Max: Is there…I know you’ve…you have partners in your consulting and you’ve trained people before. Is there a way to accelerate people’s learning about how biases creep into data modeling? Like, are there any stories that you tell people when you catch them making this mistake? Or…do you know what I mean by that?
Charles: You know, it’s…when I work with…it’s a tough one. Let me give you an example of how you could…things you can see.
Charles: We had a client once who was running models in production. They got this idea in their head, they need to do experiments over and over and over. Just constantly do experiments. And that…not untrue, not wrong but you have to do the correct experiments. Otherwise, you just kinda burn…you’re just kinda spinning your wheels. And so one of the things that we did was, well, what would happen if we took the exact same model and we put it into production twice? And half the traffic…half of the traffic randomly went to this instance and half went to this instance. And then you measure how much money you’re making. You know, measure the return, the revenue. And you find out that they’re slightly different. Of course, there’s a finite size effect. So that’s the limit of what you can predict.
Sorry. I can’t, you know…Right? I mean, you know, there’s a…maybe there’s a two…a factor of square root two or something in there that you can kinda get a little better. But there’s a limit to what you can do if you run two…you know, there’s just randomness in the traffic that you can’t account for. And that is something that is not at all obvious. That’s a day one type of thing that you need to do is to ask…
Max: Is that a domain knowledge thing? Like…I mean, I guess, that’s not a domain knowledge thing. But are there examples where you’ve maybe been working with somebody less experienced and you were able to identify bias faster than they did just because of…
Charles: Well, you gotta look for it.
Max: Okay. What is…
Charles: I mean, look, look, look. I mean, you gotta understand…look, what is bias in a machine learning algorithm? It means that your residuals are correlated with your variables. You know, if you put a feature in and you have an error, you go back and see, you know, are the features correlated with my errors? If they are, then you’ve got a problem. I mean, you know, you’ve got a systematic error, which is the feature is inducing the error, right? It’s causing it. So you look for that kind of stuff.
Max: Originally we were talking about how to fund or finance science projects. And I know we were talking really briefly about licensing and how kinda…previously, for example, BlackRock, they had no interest in licensing. So if not licensing, how…
Charles: Well, they didn’t want to do it. I mean, they had to do it. They just didn’t want to…you know? If they could figure out a way not to do it, they wouldn’t do it. Just like every company. If you can figure out a way to use something…to not pay for something you would.
Max: For a long period of time, I guess, BlackRock did use MATLAB licensing.
Charles: I’m sure they still do.
Just because they want to do something doesn’t mean they can get away…you know? It’s…you know, I think, you know, there’s a huge rise of open source and companies that are very effective, they’re gonna contribute to the open source. But it’s…it’s just something you have to use because it…you know, it’s just so much better. You know, but MATLAB’s a great product. Don’t get me wrong. They still…it’s still…it’s the premier product that’s used in finance for financial modeling. Is Python taking over? Well, we’re starting to see now like, companies like Bloomberg, for example, are beginning to offer iPython notebooks, you know, that will run on their internal systems, that kind of stuff.
Max: Even incorporated in news articles. I’ve seen news articles-maybe it was in Nature magazine-where they’ll publish online, a reproducible study or analysis in a Jupiter notebook or IPython notebook.
Charles: Well, that’s an interesting point of view. You know, you talk to vendors and they…there’s this notion that somehow, maybe this comes from the MBA programs, that you can just give a PowerPoint presentation and an executive summary and that’s enough to understand what’s going on. You say wait but where’s your GitHub repo? Well, what do you mean? Well, you claim that you made this analysis. We had this…a client that worked for a hedge fund and we wanted to look at…the client was trying to sell us some data. So well, show me the GitHub repo where you did the analysis. Well, we would never show that to you. I go, why not? You know, why would you not have that?
And then, I think, you know, to a scientist, of course you have to be able to show your work. And I think that was just…the culture’s changing. You know, as more and more scientists enter into these businesses and somebody tries to convince them they’ve done something, the standard is going to be much higher. The bar is raised.
Max: Can you reproduce it? Can you reproduce the results?
Charles: Yeah. I need to be able to reproduce this. Otherwise, I don’t believe it. I need to know, I need to know every single detail of what happened to really be able to say that I know what this is. And that’s a very, very different way of thinking than saying, well, I have, like, an executive summary. Or I have an API. I don’t need to know how the API works. Completely different way of thinking.
Max: This has been fricking awesome.
Charles: Oh, you like? This is good for you? Okay.
Charles: All right. I’m glad.
Max: Thank you for joining us. Is there any kind of shout out we can make to Calculation Consulting, your guy’s website?
Charles: Website’s always good, calculationconsulting.com. You know, we’re always happy to help people.
Max: We’ll see you again soon. Thanks, Charles.
Click here to view our previous interview with Charles.