Chang She started his career at hedge fund AQR Capital Management before leaving finance to co-found DataPad with his college classmate, Wes McKinney. DataPad was acquired by Cloudera in 2014 and now Chang manages a team within Cloudera.
We discuss how contributing to open source software has impacted Chang’s career, and the 3 criteria that Chang recommends you evaluate before accepting a job offer.
Full text transcript below the fold.
Max: Welcome, all. Max, of The Accidental Engineer.
Today, we are joined by Chang She. Chang, do you mind a brief introduction about what it is you do now?
Chang: Yeah, absolutely. I work for Cloudera now, and Cloudera’s a big data provider and a machine learning analytics platform company. I work on the product inside Cloudera called Cloudera Navigator. If you think of big data as your data lake, Cloudera Navigator is your life raft when you’re drowning in too much data.
Max: For our audience that doesn’t know this about you, I want to make people aware that you’ve contributed heavily to Pandas, the very popular Python library for conducting data analysis. Taking the idea of a data frame as invented (I don’t know if that’s true), as invented by R, I guess, and porting that to Python, where it is tremendously popular and in use and you yourself use it on a day-to-day basis.
Chang: Yeah, absolutely. As you I’m sure all know, Pandas was created by Wes McKinney. He and I used to work together at a company called AQR Capital Management. When we were both junior analysts there, our jobs consisted of data loading and cleaning data, data preparation, a lot of data engineering tasks. The tools at the time were incredibly inefficient.
Wes went down this path of, “Okay, let’s create tools to make that easy to do.” He landed on Python. He landed on, “Oh, there’s this thing called NumPy, and it makes matrix and array manipulations very easy.” But then when you actually hit real world data, there’s this whole class of problems that NumPy isn’t very well suited to solving, and that’s where the motivation for Pandas came from.
I was a very early user. This was before Pandas was even open source. Wes showed me his first version–I loved it, and so I was pushing Pandas throughout my group at AQR.
Max: Where NumPy only deals with homogenous data types–where you have a matrix where your rows and columns and every cell in it is of a single data type–Pandas abstracts and deals with columns only.
That’s one way to paraphrase the benefit of Pandas versus NumPy, is that you can work with heterogeneous data—you might have a date column or a date index, while you might also have a float, integer, string type columns.
Am I getting this close to right?
Chang: It’s one of the main differences for sure. I think NumPy excels essentially where MATLAB excels, where if you have nice, clean, orderly matrices, and all you needed to do was linear algebra, that was a great tool. That’s why it became popular.
Now as data science became popular, in the real world, data sets come from a lot of different sources. They’re often not sorted or not aligned. They’re missing data, or have bad data.
If you have time series data then you have to do regularization or down-sampling, up-sampling, all sorts of these data engineering manipulations that we’ve built into Pandas to make that easy.
Max: What were the core problems that you guys were finding the existing tools to be inadequate for solving at AQR?
I realize that they’re a pretty private company, but perhaps we can dance around the subject and talk a little bit about the types of problem you guys built Pandas for.
Chang: Yeah, sure. AQR is a large, multi-strategy hedge fund slash asset management company. So it has two arms. There’s an equity investment arm and there’s a global macro investment arm, where it’s invested in fixed income assets and other things.
The time when Pandas was first created and when I worked there was something like 2007 to 2009, 2010ish.
At AQR all of that workflow was done in a hodgepodge of Excel, VBScript, a little bit of Java and some C++.
All of that used the database as the integration point where they all came together. Not only did you have these data munging problems where none of those languages or ecosystems had a good way to do data cleaning and do vectorized operations on data. That’s just one problem.
The other problem is that as you go through your data pipeline you have to jump from language to language to language, and none of that is automated or easy to do. You always have to go back to the database. So the database became the bottleneck in a lot of scenarios.
Pandas solved that problem by, one, providing all the functionality to increase productivity. Tasks that took us a whole day or several hours to do, was maybe five lines of code now or 10 lines of code. Because Python is so versatile, you could unify maybe 80% of that data pipeline all into Python, so that everything can happen seamlessly in one workload. Now you don’t have to go from manage three different languages and a database, and you’re fighting for resources with other people who are doing the same thing.
Max: When you guys were developing Pandas in this context of finance and working with financial data, it sounds like there might have been an explicit focus on time series data and data indexed on date time.
At the time, did you and Wes really realize how much more generally applicable Pandas might be for data sets outside of time series? Or was that just a surprise?
Chang: I think Wes left AQR about a year before I did to go to Duke. I think when we were working in AQR, there wasn’t as much realization of how applicable it was outside of finance, because every industry is a little bit of a bubble in and of itself. But I think, Wes left AQR, went to Duke to go to grad school. I key first made that realization, “Hey, it’s not just useful for finance.”
I also left finance. He convinced me to leave in 2012. He and I worked on Pandas together for a whole year. This was something like Pandas .6 or something like that. That was the year when we saw, one, a big uptick in Pandas users, and two, that was when Pandas really took off outside of finance.
In the beginning, it was still very heavily financial focused. People were coming in, say, “Hey, I’m from this hedge fund or that bank, and I need to do this. Can you make a new feature on GitHub?” As time went on, it quickly became clear, “Oh, hey,” other data scientists came in and said, “Hey I needed this feature,” or made bug reports. More than, not data scientists, but just plain scientists started using this too. They found it immensely helpful for them as well.
Max: I have to credit you guys for making Pandas, because I personally have used it on the job. Not so frequently as of late, but to give our audience another concrete example outside of academia and finance where it is applied.
I held a job in the email marketing/advertising space, where we dealt a ton with the choke point of our database, and getting data out of the database and being able to be manipulated and played with like you’re describing.
We were tracking user engagement event data, which I guess is also inherently time series, but I’m fishing here to try and come up with some other types of data analysis outside of financial data.
Chang: For sure. I’m glad to hear that. Most of the credit belongs to Wes, who came up with the project. Also I think now it’s being maintained by Jeff Rebach.
I’m just grateful that I got to contribute to this project and make life easier for data analysts.
There’s a couple of questions I want to run by you about career advice just generally, because I find your career story so far to be really fascinating. Do you might sharing about how from college to post-college what were you up to? How did you get into programming?
Chang: Sure. I got into programming pretty early in high school, I think. Or actually even beyond. I remember my first foray into programming was in fifth grade. We had these old computers that had monochrome screen, and the keyboard was part of the monitor and the computer. We learned BASIC and we programmed kind of like a question and answer program that told a story.
I actually wasn’t born in the States. I was both in China. I moved to the US when I was 10. So in fifth grade, I was not quite fluent in English yet. I could read better than I could read or understand in conversation, so programming the computer in BASIC was kind of an escape for me.
That was my first contact with programming. I really liked it. I think the interest continued through into high school, where I took a bunch of AP computer science classes, and some dabbled in … I think back then it was when Java web applets were supposed to be the future of the Internet.
Max: And then what happened, man? I remember that year too.
Chang: Yeah. In college, I majored in computer science, but I was a bit of a slacker in that I didn’t actually do any real internships in school, and I kind of stumbled on finance my senior year.
Max: What year was this? Do you mind sharing that? What was the job market like then? I realize you went to MIT, so for in contrast to myself and a lot of our audience who didn’t go to MIT, maybe the recruiting process is different at MIT versus elsewhere. But was the job market robust? How did you end up finding finance?
Chang: Yeah, absolutely. I went to undergrad from 2001 to 2005 at MIT. The job market was very robust then. This was pre-2008, especially finance during those last few years, like 2005 and 2006 was incredibly hot. So the job market was very good, especially if you were privileged enough to go to a good school.
At MIT, the top firms would come to the career fairs and they would recruit. The top firms had a pretty rigorous interview process, so it wasn’t that you could just show them your degree and get a job. But still, I think compared to most other people that I’ve talked to, the MIT name carries a lot of weight for an employer. The job market was overall very easy during that time, and I think for top schools like MIT or Harvard down the street, it was even easier, I would say.
Max: One of the two questions that I mentioned really wanting to ask you was, in conversation outside of this podcast, we’ve talked about how there’s three criteria that you have for evaluating a job, whether it’s suitable for you. Do you mind sharing for our audience what those criteria are and what your rationale is?
Chang: Yeah, absolutely. I always look at jobs along three dimensions: what I call business, the problem and the people. One is, at the end of the day, our identities are not attached to our jobs, right? At the end of the day, you have to pay bills. You have to make a living. So you have to make sure that the business is viable. It has to start from the top. You have to believe that the market is in a good place for the industry. If the industry is dying, even if this is the best company in that industry, then it might not be a good place for you. The company has to be in good shape and it has to be headed in the right direction, and the product that you’re working on, or the team that you’re joining has to be working on something that contributes to the bottom line for that company, right? All of the things have to happen to say, “Okay, this is where I can have a stable financial future.”
Second dimension is the problem. We’re not machines. If you’re a software engineer, even if you have temporary trouble finding a job now, you have to realize if you look at the world overall, or even if you just look at the US, we’re in a great position in terms of the marketability of our skills. We get a choice of what problems to work on. You have to look at this job as are you interested in working on the types of problem they’re giving you? Do you like enterprise versus consumer, or do you like social media applications versus doing deep data analysis, or is it machine learning? You have to pick the right technology stack and technology layer of abstraction for you.
The third thing is the people, which is do you like the people around you?
Chang: Honestly, I feel like that’s probably the one thing that people neglect a lot, especially as they get more experience. I feel like as you get out of, if you’re a fresh grad and when you get out of college, this podcast is called The Accidental Engineer, so this is like they’re accidentally making good choices in that how much they like the team, the people that they’re working with, and how much they like that daily interaction plays a big part. I find that as people get more experience, for some reason, they place less importance on that. I always have to ask why. If you see that toxic person, or if you see warning signs about the toxic culture in a team or in a company overall, stop yourself. Don’t say, “Oh, I need to make a living,” or like, “I don’t really care. This is just a 9:00 to 5:00 for me.” If your daily life is miserable, it doesn’t matter whether you’re making enough money or you’re working on an interesting problem.
Max: Yeah, this almost sounds like a type of hierarchy of needs, Maslow’s Hierarchy of Needs, where maybe people who wouldn’t otherwise be toxic might be toxic because the business is not fundamentally sound.
Max: So they have stresses about their existential welfare that they might be projecting onto their teammates maybe.
Chang: Exactly. It doesn’t even need to be if the company is in dire situation. If you take a look at, if you look at Cloudera for example, Cloudera went IPO in the last year, and the stock price has seen a lot volatility, as all fresh IPO companies do. But that doesn’t stop people from going through an emotional rollercoaster as the stock price goes through a financial rollercoaster. For this Cloudera case, it’s easy to say, “Look at the long term. Everything else about the company is good. Six months from now, all this volatility is going to pass.” But if the company is not in a good place, and you don’t see a way out, then it’s much harder to get out of a toxic mindset.
Max: Before I ask you the second of the two questions, I think it would be a good opportunity to plug that you guys are hiring at Cloudera.
Max: Especially for Chang’s team. If you like, person-wise, Chang, you should totally apply for a role there.
Max: What types of roles are you guys hiring for, or skillsets that you’re looking for?
Chang: We’re hiring a pretty experienced backend engineer who’s familiar with large scale, complex performance enterprise applications. Our mission is, like I said, being the life raft for data analysts and data stewards who are working with the data lake.
Max: So for our audience members who might be a few years out from having the skillset that might get that job today at Cloudera, what are some things you recommend they learn or research to develop their skillsets to be a good candidate?
Chang: That’s a good question. I would say one is along the technical track. The technical skills you need to succeed in this job are things like, you need to learn how concurrency work. You need to learn how to create abstractions. I think we talked about this before in that one of the … I wish I had majored in math like you did, which is it really helps you think about abstracting from a few concrete examples, and to describe the world using code in as concise and terse terms as possible.
What I find is, especially with complex enterprise applications, if you’re not good at abstracting and creating software abstractions, you have to write a lot of code to do a small amount of things. Whereas if you learn that skill, it becomes the opposite, and you get a lot of leverage, and your productivity goes up many times.
You need to know good judgment in terms of trade offs between the performance of the code and how maintainable it is, and how easy it is to read and reuse that code. I think some of those things will just come with practice, and that no amount of reading will automatically make you an expert in those things.
Max: Abstractions are a big deal?
Max: Are there specific programming languages or libraries like Pandas that are good entry points, or guidance about what’s a good level of abstraction? I mean today, a very contentious topic, which I’d love to get you on the record about, is maybe for an introductory computer science course, what language should they be using to teach the course?
Chang: Well … I think you’ve waylaid me with this question. I love Python, but I guess my first love will always be Scheme, because that’s what I learned my introductory computer science. I think my first course in computer science at MIT was SICP, Structure and Interpretation of Computer Programs. That was taught using Scheme. I loved learning scheme.
To me, that first course, it just blew my mind. Every lesson was like … We wrote streams in Scheme. We learned about recursion. That was the first time I learned about recursion. We wrote a Scheme interpreter in Scheme.
I would say the top three fun experiences I’ve had learning computer science.
Max: That’s awesome. We’ll obviously include a link in the show notes to the book, which I think was written by an MIT professor.
Max: Yeah. It’s freely available online. You can view it in HTML. You don’t even need to download a PDF or anything.
The second question, coming around back to the second question I really wanted to ask you about was, when it comes to contributing to open source, there’s pretty steep learning curve and entry cost. If you really want to know how to effectively contribute to open source, you go learn version controlling tools like Git and GitHub maybe.
Chang: That’s right.
Max: You got to learn how to communicate efficiently online.
Max: One of the topics that is a recurring topic on our podcasts is on the subject of making people more employable. Have you found that your tremendous open source contributions, which I’ll call them tremendous, you might be more modest. Have you found that to be a very positive propellant for your career when it comes to skills it developed for you or by contributing developed skills, or made you better known on the job market?
Chang: I would think so, yeah. I would say my contributions were modest. I would say the benefit for my career has been mostly in terms of network referrals, network effects. For me, it’s more about the people that I know that recognize my skills, and they refer me to interesting opportunities that come up. I think that’s been a tremendous propellant for me, for my personal career.
Max: Awesome. I want to encourage, speaking directly to our audience, that they should give it a shot if you haven’t already contributed to open source. I think people rarely are reminded or told how low the bar can be for contributions, like documentation.
This comes up occasionally, but it deserves hammering home, that on a project like Pandas, there might be certain aspects of it that are very intimidating to newcomers to open source, like the C code and C++ code, or even some of the Python might be more involved.
But even as simple as modifying documentation that’s not exactly clear to you, might help others and will likely get merged if it truly clarifies things.
Chang: Right, right. Yeah, absolutely. I think I have two comments to make here. One is for the audience and the users of open source software. If you want to participate in the open source community, or even just use open source software, you have to learn Git, you have to learn GitHub. You have to learn basics of Python or whatever library you are using.
Beyond that, don’t be intimidated, because even if it’s just going, if you found a problem while using it, and you’re going on GitHub to create an issue, that alone is already contributing to the project. Don’t think of the bar as, “Oh, I have to write performant C code in order to be considered a contributor.” Even if it’s just opening an issue, that’s perfect already.
In Pandas we’ve tried to make it easy, and make the contribution ramp a little bit less steep in that we mark issues as good for beginners, not good for beginners, or difficulty level and priority and things like that. So you can use the tags on GitHub, if you’re a beginner, just to kind of filter out what issues are good for you, and just get your feet wet. We have a lot of tutorials out there for how to contribute to Pandas, not just use Pandas.
Obviously it’s not perfect. We can always do better, but at least we’ve tried to do that and make it easier. I think that’s to my second comment, which is to open source maintainers, is if you want a bigger community around your project, you have to make it easier for different levels of people to have different levels of participation.
Max: For sure. I want to also add: if you aren’t contributing to the Git repo (like submitting pull requests and getting them merged or commenting on issues) people can contribute to open source by using the library and talking about it in other fora like Stack Overflow or your personal blog. Those two alone can contribute a lot to the community.
I remember early on in Pandas’ genesis there was a lot of conversations that wouldn’t happen in GitHub issues, but they would be discussed in Stack Overflow.
I think it’s helpful to remind people that by using open source software and talking about it, even if it’s not on GitHub, you’re awesome, and keep it up.
Chang: Yeah, for sure.
I remember the year that I worked a lot on Pandas, once a day I would log onto Stack Overflow and look at unanswered questions, and try to answer them.
I would go through the top voted issues or top 10 lists and say, “Okay, are these captured in GitHub issues already?” If not, let me create a new issue. Even just by asking a question, like you said, that’s already contributing.
Max: I’m sure we’ll have more things to talk about, perhaps not in this conversation, but I hope to have you on again soon!
Chang: Yeah, happy to.