EPISODE 1698

[INTRO]

[0:00:00] ANNOUNCER: Databases underpin almost every user experience on the web. But scaling a database is one of the most fundamental infrastructure challenges in software development. PlanetScale offers a MySQL platform that is managed and highly scalable. Sam Lambert is the CEO of PlanetScale and he joins the show to talk about why he started the platform scaling databases, using Vitess for SQL shard orchestration and more.

This episode is hosted by Lee Atchison. Lee Atchison is a software architect, author, and thought leader on cloud computing and application modernization. His bestselling book, Architecting for Scale, is an essential resource for technical teams looking to maintain high availability and manage risk in their cloud environments. Lee is the host of his podcast, Modern Digital Business, produced for people looking to build and grow their digital business. Listen at mdb.fm. Follow Lee at softwarearchitectureinsights.com and see all his content at leeatchison.com.

[EPISODE]

[0:01:14] LA: Sam, welcome to Software Engineering Daily.

[0:01:15] SL: Thank you. It's great to be here.

[0:01:18] LA: Let's start with some basic. So, why PlanetScale? What does PlanetScale provide that is unique and different in the industry?

[0:01:27] SL: We primarily focus on very high-scale databases that need to be online 24/7, never lose data, and be incredibly performant. So, a lot of our work is done with large direct-to-consumer brands that need a lot of data to power consumer experiences and need to trust us to be those things reliable, scalable, and available. It really comes down to the criticality of the data and the queries that we serve. If you go on a webpage, any website that you use, and you look around and imagine it without personalization, everything - right now, I'm looking at the podcast recording software that we're using. It has the name of the podcast, who we are, our usernames, everything, all of that comes out of a database.

Same with any other experience. Your Facebook page is entirely your Facebook page. It has your friend count, your feed, everything. All of those experiences are powered by databases. If the database is slow, the experience is slow. If the database is down, the experience is not there. So, we really focus really hard on building something very dependable, and is performant 24/7 for large companies that require that to power their products.

[0:02:50] LA: I want to talk more about the scalability and the availability aspects as we go through there. That's fantastic. But why did you start this company? What brought you to the decision that this is the right thing to do?

[0:03:02] SL: PlanetScale is born out of a couple of things happening. One being Vitess being created at YouTube. So, Vitess is like our core technology, is the kind of engine inside PlanetScale. It was developed at YouTube a decade ago to scale their main database, their main application. It ran across 70,000 servers in 20 data centers and presented itself as just one database to the application. It served billions of active users, like three billion monthly active users, something crazy like that at YouTube, second biggest search engine in the world. And all the metadata comments, everything were stored in Vitess.

So, the founders of PlanetScale took that technology and went on a path to commercialize it. PlanetScale is kind of the cloud-hosted platform, powered by Vitess so that plenty other large companies can leverage the scalable technology built at YouTube. It's rare to find open-source projects that come out of large companies that I have seen immense amounts of production use and have immense amounts of contributions from large-scale companies. So, Vitess is used at Shopify, Slack, HubSpot, many, many GitHub, and all have contributed to Vitess and making it better and better. It's an incredibly good open source project to go and do the things it's supposed to do and we build a cloud product on top of that.

[0:04:38] LA: So, just to make sure everyone's on the same page, you are based off of the open database of Vitess. Can you tell us a little bit about Vitess, what it does, and it's the open source back end for what you're doing. But where did it come from? How did it get started? Who supports it?

[0:04:55] SL: So, Vitess uses MySQL under the hood. Vitess is sort of an orchestration layer, a proxy and cluster manager for MySQL. It allows you to shard your MySQL database among thousands or tens of thousands of servers, enabling you to store more data than you could possibly store on a single server or single cluster of servers, allowing you to have more availability than you would, with a single cluster, and more compute and memory behind a large database. Workload involves it being spread across many machines. And the way we enable that is sharding, which is really one of the only proven patterns for doing this at scale, pretty much every large system on the web has a sharded database of some form behind it. So, we make that much easier than doing that kind of work yourself, and do that by building on top of a very reliable database itself, which is MySQL.

[0:05:59] LA: Actually, this is something that I was not aware. Vitess isn't a replacement for MySQL. It actually runs on top of MySQL. It provides the shard orchestration on top of a standard MySQL distribution.

[0:06:12] SL: That's right. You essentially talk to Vitess and your application assumes that it's a MySQL database. But there's layers of abstraction between you and that MySQL database. But it's essentially mostly compatible. There's some things that don't work that work in standard MySQL. Some things we've taken away on purpose that we don't believe is scalable, or just don't feel it's worth supporting. And there are those things that are just kind of we'll work on in the future.

[0:06:44] LA: I imagine there are some, for instance, transactional capabilities you have to remove in order to work cross-sharded databases, things like that,

[0:06:54] SL: Correct. There's a form of cross-shard transactions that we have. But it's kind of slow. It's always going to be slow if you're reaching across the network, to lots of computers at once, to be able to commit a transaction. It's something we talk about as being potentially necessary for the future. However, it hasn't really limited many applications using Vitess at scale, and localizing their queries to single shards. It just takes a bit more application work.

[0:07:22] LA: So, besides transaction, what else do you limit within MySQL? What are people not able to do in MySQL in order to get the scalability?

[0:07:31] SL: Triggers. Triggers are not supported because of our online schema change workflow. We do a lot of platform work on top of Vitess, so we allow you to create database branches, and you use a branch as a source of a schema change, and we will push a fully online schema change into production for you, like you're deploying an application, and triggers would break that workflow. They're also kind of bad scaling pattern as well, because again, they're not going to work cross-shard. Then there's like arcane, old statements that are from early versions of MySQL. We don't tend to support those. We don't have like window functions and things like that. So, there's some things that just don't work well with sharding, or that are very difficult to implement without a lot of round-tripping to MySQL itself, which would make it very slow. But we've shipped foreign keys, which was a hard amount. We didn't support that for the longest time. So, we always had foreign keys. We didn't have foreign key constraints. Those were difficult. We eventually worked on shipping those -

[0:08:33] LA: You mean, foreign key constraints to work as part of your sharding key across shards?

[0:08:37] SL: So, we're finishing up cross-shard right now. This would be single shard. But we never had it in Vitess because of the query layer was mostly built for sharded applications. What was there because we wanted to shard essentially, so we just hadn't done the work. But that's getting done now and that help removes large compatibility blocker. That said, no one really running at scale uses foreign keys.

[0:09:04] LA: Foreign key. Key constraints.

[0:09:05] SL: Yes, foreign key constraints. Pardon me. I always get the constraints a bit last. Yes. So, it's not also been a major blocker for large-scale customers coming on.

[0:09:17] LA: So, how's the primary mechanism for sharding accomplish? Does the customer pick a particular key and your shard against that key? Or do you have other attributes you can shard against? How do you determine how you shard and what you shard?

[0:09:34] SL: So, you really have to like pick that natural key, right? That would be that like user ID organization. Most kind of crud apps have a natural kind of tenant key that they can use to shard. There's actually like a lot of - with us, you apply something called a V schema that defines how you want to shard and you can do a lot of more complex things to actually locate data in the right ways, and then a lookup table is built to help us understand where within those shards, the data would be, and then we direct you, orchestrate the queries. Often, we'll rewrite the query to go and find the data necessary, so that you're not scattered gathering across shards too much. And then we'll return and aggregate the data and return it to you.

So, you don't actually - your application doesn't need to know about many, many machines. It just gets a single connection string and can be scaled across there. If you're lucky, I mean, it never really turns out this way in practice. But if your application is really simple, and you want to migrate at a different scale, you may just be able to apply the right V schema to just go and shard it without the app even ever knowing that there's a multiple database hosts under the hood.

[0:10:45] LA: It handles that all transparently for you. That's great. That's the key for really good sharding mechanisms. But it does have one downside and if your application is not involved, and that is having to do with balancing. How is balancing handled on Vitess? Is it all by non-application manual ways or automated within Vitess?

[0:11:07] SL: It's handled by Vitess, and we can reshard as well to rebalance, and do that as an online. You can basically reshard your database to a different type of key if you really needed to as an online operation. So, you could do that to, if you are severely unbalanced by your cube turning out being wrong. However, the migration of data and rebalancing is all handled by Vitess. So, as you can imagine, there's like an immense amount of orchestration and checking, and that's something that's very unique about plans go is that Vitess has been around. Right now, it's serving tens, maybe hundreds of millions of queries per second around the world right now. Yes, definitely, in the hundreds of millions across many, many applications. Every Slack message in the world goes through Vitess. Slack are very open about their use of Vitess.

Because of that, then all of the critical paths are incredibly well worn, meaning that there's a lot of safety, a lot of checking, and it's built very robustly, and that's something that's very unique for us, because we make using the database, incredibly simple and do a lot of very smart things under the hood, is very, very beneficial to our customers.

[0:12:20] LA: One of the things that makes sharding challenging is the one-off case. If you consider, for instance, a simple case where you're charging on a user ID, for instance, let's say that. In general, your user ID may in fact be a good shard key. But there's this one customer that's 100 times larger than every other customer, or a small subset of customer that don't fit very well into the standard sharding scheme. Can you pick and choose which shard-specific items are on and not just by an algorithm? But is it a lookup process all over?

[0:12:55] SL: Yes, you can pin and isolate things. Instagram though is the Justin Bieber shard, which was he was basically had his own complete stack to run his Instagram account because of how large his follower base was. Yes, you can actually pin and shape your workloads. It's another really very flexible thing. Noisy neighbors are problem in a single database cluster. Just one thing going wrong can drag the entire server down.

So, even just like GitHub, we had this issue. In the early days, people were like they do now, star repos. But I remember very clearly that the largest, most starred repo on GitHub was Bootstrap at the time, and had like 60,000 stars, and then the second was probably Rails with 10,000 stars. If anyone, and people would do this a lot, and they still build these types of tools, but they use different methods to do it, people would do a lot of scraping to go and build like stars, like aggregation pages, or just to get metrics. Whenever they would go back through like the stars pages, for Bootstrap, for example, it would be pulling old pretty cold data out of the database and filling the buffer pool with pretty much junk data and evicting user sessions, like user data, or issues data, or things that would be really nicely cached by the database, and would usually be, the app usually expect it to be fast, that would slow like pulling out data, filling the buffer pool would slow the database down. Suddenly, every single user is having a problem because of a silly script kiddie use case that was hitting the platform.

In a sharded world, you reduce the failure domains to be a lot smaller, and something may cause a problem within a shard. But that's just then a subset of your users feeling the pain rather than everybody.

[0:14:59] LA: So, you said Vitess is very, very popular and very well supported. It's actually part of CNCF now. Is that correct?

[0:15:07] SL: Yes.

[0:15:08] LA: So, it's got a lot of support behind it. What role does PlanetScale have in that process of maintaining Vitess?

[0:15:15] SL: So, we maintain Vitess. The Vitess team are very diligent in making Vitess a well-managed, open-source project. Like I mentioned, some of the names of the companies on this call that are part of the community, they essentially contribute. We have monthly and weekly meetings with the community to discuss the shape of the project. It's important to make it very well run, right? We can't just push database software out there and not have it be extremely high quality. The impact of that would be absolutely massive. We obviously have to take it very carefully when we're doing things, and also build things that the community need. Maintain the project well and that's not a small feat, but it pays dividends doing so. So, it's something we're very diligent with.

[0:16:03] LA: How does PlanetScale compare to other large scalable MySQL databases such as, for instance, AWS Aurora? How does it compare to those?

[0:16:14] SL: It outscales Aurora to a high degree. A lot of our customers come from Aurora, once they've run out of runway with their architecture. Aurora has a lot of benefits with its architecture. But those kind of become downsides and a problem at very large scale. Surprisingly, not that high scale, but certainly a scale. We start to see customers have issues with Aurora and we tend to bring them over and have them run quite successfully on PlanetScale.

[0:16:47] LA: So, what's an example of the type of problem you run into that causes the limits to Aurora to be hit, but it will not hit those limits on PlanetScale?

[0:16:57] SL: Storage limits. You can store tens, hundreds of petabytes in PlanetScale. The way we scale horizontally means it's just more machines. It scales very, very well. We keep the coordination overhead very low, allowing you to keep just adding more servers horizontally and growing. We see people hit just straight-up reliability issues. We see a lot of folks that have used Aurora for bursty workloads, or workloads that are just constantly hot. I've seen a number of outages due to just a lack of reliability of their platform. Which honestly, when you've got a D2C brand, and you're launching a game, we have a lot of games customers. If you're launching a game, the first kind of couple of days of that game launch are incredibly critical for monetizing, for just building brand, getting hype. If you have an outage during those times, you very quickly start looking for a solution that's not going to do that, and that's kind of why people start coming to talk to us.

[0:18:04] LA: Your bread and butter are the B2C customers, right?

[0:18:06] SL: Yes. We certainly have a bunch of customers that have real big enterprise cut. There's a lot of B2B customers that we have certainly that are very critical. But if you were to find patterns in our customer base, a lot of the large ones are D2C.

[0:18:21] LA: Why do you think that is? Why do you think B2Cs are more interested in what you're doing than B2B, in general?

[0:18:28] SL: I don't think it's about interest. I just think it comes down to how critical the uptime of their products is. TouchWood reliability has been incredibly strong across the years and our status page will attest to that. If you have a very large outage of a large D2C platform, it makes the press it makes the news, and it tends to lose the company that has that outage, a significant amount of money. It happens again, game launches. Certain companies make hundreds of millions of dollars in their first weekend of launching a game.

If you're down for most that weekend, every blog, every YouTuber, everyone in the gaming space, we'll just talk about how badly you failed to launch a game. All of those pre-sale packs with the extra DLC that you sold for twice the price of the game normally. Those people are now annoyed because they can't get into the game early in the beta they were promised or whatever. It just becomes a really big mess. Whenever you're mindlessly scrolling Instagram, and you go and click to buy something, they have your attention for a very short period of time before you go and do the next thing. If like anything fails on an e-commerce website, if Shopify checkout fails, you've probably lost that person for good. If you can't do it, it just doesn't work. You just get annoyed. You go elsewhere. You keep scrolling. That's just sale last, and you just can't really afford it.

In B2B, obviously, you shouldn't let down your customers. You never should. But for example, if our CRM goes down for two hours, I'm just not going to move our business off of that CRM. I'll be really annoyed. But the work to do it for the kind of, for the upside of like, "Wow, weren't we without a CRM for two hours every now and then, it's just not worth moving." D2C is much more critical. And then we have B2B brands who have D2C customers. Then again, it's back to being equally as important that we're up and recruiting. It's still very, very important. But we see real fit and technical fit with gaming companies and D2C brands.

[0:20:44] LA: So, let's give some specifics of what the real killer apps are for PlanetScale. You mentioned online games. That's an obvious example. What are some other killer apps that can take advantage of PlanetScale or that have taken advantage PlanetScale?

[0:20:58] SL: You're not going to like my answer, because it's not going to be very specific. But really, it's anything that has large amounts of data and a high need for uptime. Any critical, mission-critical applications can use a database. They can connect to a database and store data. We have telemetry use cases that we see people storing telemetry data. If you have a serious company, where letting your users down is really bad for you, then PlanetScale is probably the right database, because we do everything. We make all of the tradeoffs in favor of those bad things happening. Never -

[0:21:30] LA: Sorry. Are the use cases more - I'm talking about the sweet spot here, not what can use you. But what the sweet spot is. Is the sweet spot more around large quantities of data and sharding that? Or is that more on large quantities of compute needed to perform the queries and the scaling of that? Which is more likely to be the issue for the search for large applications?

[0:21:59] SL: I would say more often, it's large amounts of data. However, we have seen some very - well, they're still large, but not large, the scale of some of our larger customers. We've still seen some moderately sized databases, do absolutely insane amounts of queries per second. Then, that's more of a compute workload. Then, sharding is equally beneficial for you, right? It might not be that you've got an obscene amount of data. But if you're doing the same action over and over and over again, at high volumes, then the compute is just as needed.

If you're sharding for storage, you're likely never going to be too compute-bound, because you're probably going to have more spare compute around via sharding to keep your compute needs up. But that changes. The other good thing is you can change the hardware you have deployed within shards, so you can scale up really easily as well and that helps a lot. But the need for vertical scaling is not as important because you're parallelizing a lot of what you do across shards. No matter how high you vertically, can vertically scale a box, you end up bound to in machine contention in the operating system, or in the database server that will slow you down, and there's less of that in a distributed system.

[0:23:20] LA: What about non-traditional application use cases? What I mean by that is, you know that for instance, an e-commerce store, the vast majority of the database traffic of an e-commerce store is selects, and a few inserts, but it's mostly select. You're looking at the database, you're displaying pages of information that comes from database queries. It's very much a select-driven model. That works very well with sharding.

Well, an analytics company, like for instance, I worked for New Relic for many years and they were all based on insert queries, everything. Huge quantities of data inbound, some selects. But in the grand scheme of things, the amount of access to the data through selects, was significantly lower than the insert traffic. So, we were very, very insert bound, and that created some challenges for how we sharded and what shards did and things like that. How about those sorts of use cases? Do they still fit well within the PlanetScale or the Vitess model? Or are there other unique challenges that occur in those cases? 

[0:24:26] SL: It's an incredibly good use case. It is true that most applications are going to have 10 to 1, probably more like 100 to 1 read-to-write ratio. Look at Twitter, right? Like how many times do you tweet versus see a tweet? It's very much low write-to-read ratio. Same with GitHub, same with a bunch - CRUD apps, right? However, we do have a number of customers that completely invert that and do a significant more writes, at significant more volume and speed than they do reads. I mean sharding is the absolute best way to scale that workload, right? Because you're creating many more writers, right? If it's a distributed system -

[0:25:10] LA: Read replicas don't work in that model at all. You can't scale with read replicas like you can with a -

[0:25:16] SL: It depends. Well, you can't just single writer and read replicas, but you can do read replicas within shards, again, really super scalable again, like really good. But now, you've increased the amount of write threads across the cluster, by the number of shards that you have. So, you can take on a significant amount more writes and write volume because of that model, if you are just yes, like we said, a single cluster, one node has to accept that write. You can have multiple write systems, however, they come with a bunch of coordination overhead, to make sure that conflicts and transactions are ordered the right way. With sharding, you have a much, much simpler kind of architecture to reason around while just increasing a ton of capacity.

[0:26:03] LA: Again, if you look at the popular use cases nowadays, and you can go through and say there's thousands of examples of a thousand different use cases. While that is, in fact, true, certainly one is popping up as being more and more popular by a vast number of customers and that's the use of AI. AI is one of those things that requires some unique database capabilities. AI really needs vector database capabilities in the way it operates there to do a lot of things you're trying to accomplish with AI. I'm not the AI expert here. But you know, the sorts of queries that they need those vector databases for. PlanetScale work in these models.

[0:26:48] SL: We're working on vector support for MySQL, and adding vector storage and vector search to MySQL. So, we will be capable of powering those types of application. It's very early to say whether people will need to store massive amounts of vectorized data, whether you train models off of normal data, and the model will know the information, or people will use RAG. Look, I'm no expert here. I see some people saying vector databases are completely useless and don't have any use. Others are saying that's completely false.

Our opinion is that databases are going to need to support these types and the ability to search. It's just a table-stakes expectation. The scale at which these applications get to or their databases use becomes is too hard to tell. We're excited to find that out as the industry grows, and AI is used more and more. I will say though, in this world, keeping data around not archiving certain things, because of database capacity issues is probably quite important and that it's very clich�, but the power of your data is kind of probably really important to consider, and a lot of people just get around their database problems by deleting data that they should really keep around and usable, and we enable people to go and do that.

[0:28:14] LA: Or pass it off to a data warehouse and make it available, but less easily available.

[0:28:20] SL: Yes.

[0:28:19] LA: I mean, has a different set of problems. So, you have some unique features. You've talked about some of them, but I've written down several the things you've talked about and other things for my research. Let's talk about them each of these specifically, and how valuable this is, and really, let's say how innovative it is and what you've done to it to enable it. One of the first ones was non-blocking schema changes. We didn't talk about that yet. But I got that from my research. One of the things that you can do is make a schema change on the fly without blocking read or write access to the database. That's something MySQL natively doesn't do, or at least it hasn't been in the past. I'm not sure about more recent versions. How do you do that? And what's the use case for that?

[0:29:06] SL: Schema changes are often quite difficult to roll out at scale. A very small scale, it's probably easy to just alter the table, lock the table. If you have a tiny amount of users, no one's going to notice. However, a large scale or medium scale, you'll need to constantly evolve your schema, otherwise, your product is not really evolving, then you're probably in a much bigger trouble. So, you'll need to keep modifying the schema, and that's actually kind of an unsafe operation. It can involve a lot of locking. There's a number of online ways to modify tables in MySQL and Postgres, however, some of them have caveats. They can't really be rate-limited and done slowly. If their transaction themselves gets locked, they lock the rest of the database, and you get this kind of cascading failure. Generally, we see a lot of customers having real severe issues changing their schema because of database contention problems. Even though the user's table at GitHub was small. I mean, everyone's user's table is small, even if you have every user on Earth, it's still what, seven, eight billion rows, whatever. It's still not a big database table. But it's getting accessed constantly, because every repo is owned by a user. So, you need the user object to load a repo, all of this stuff.

Modifying the user record is also really important. We want to store new attributes about our users, or we're adding like a token column or whatever, for SSL. You know what I mean. There's a million reasons to want to change it. But we will take downtime backs and deploying changes. So, we worked on a tool called Ghost, that allowed us to much more safely change schemas and we've taken a lot of that learning that we built at GitHub and put it into the tests. Now, you can just deploy your schema fully online, like you're deploying a branch of code, and you can roll it back. That's the really cool innovation is that you can roll back your schema change, without losing data. If you drop a column, which, again, you do it to tidy up your database, tidy up models. But if someone has not completely deprecated the use of that column, somewhere in the application, you're going to go down immediately without that column. At PlanetScale, you can just press a button and bring that column back with the data intact, and all the intermediate data that was written elsewhere in the table will appear back. It's a real undo, a true undo button for your database. So, it's really cool.

[0:31:35] LA: Do you do that by having a virtual schema that's different than the physical schema? Is that always the case? Or is it a transitory state? Or how do you actually accomplish that?

[0:31:47] SL: It's a very simple concept that's very hard to implement. It's essentially, when you start your schema change, we take a copy of the schema of the table, empty, we modify it with the new - with the schema change as you need, and then we start to fill it up with the data from the old table, and we keep those two in sync, and they circularly replicate. So, once they're caught up, and they're in sync with each other, we flip the new modified table into production, in a rename operation. We swap them out, basically. When we keep the old one up to date, so that when you need to rollback, we bring that one back. That intermediate data that you've got in the new table is still present in the old table, and you can flip between the two.

It's very simple, conceptually. But it's actually very hard to do, because you need to - if loads of traffic comes in, you need to limit the rights that are coming in. You need to like rate limit the copy operation, for example, otherwise, you're amplifying your right workload really heavily. There's like a lot to implement to go and get it done. But it presents itself very simply, and very beautifully as a deploy request on PlanetScales. It's very much like the GitHub pull request. But for your database, where you're deploying the change, you're putting it out there, people can comment on it, they can roll it back, it's great. It's really usable.

[0:33:08] LA: So, that's really useful for tables like, like you say, the user table that aren't giant, but have lots of changes, and that that makes a lot of sense. But what about tables that don't need a lot of changes, don't happen to need a lot of schema changes very often, but are giant, obviously copying them around can be a lot more problematic.

[0:33:30] SL: We've run this against tables that are terabytes in size. It takes a long time. That's all. Just takes a long time to do. But it's safe. You can tell it to do all of the copying and get in sync. But wait until you're around to finish the rollout. So, you can actually just go and leave it to copy over the weekend or whatever, and then come back on Monday and press the final button to do the flip.

[0:33:56] LA: That's cool. That's cool. Because the downside of that sort of strategy, you never know when it's done, or you never know when it's ready to be done. Then, your model, it doesn't matter. Just keep it in sync until you are ready to do it. I'll let you know when it's possible to do it. But you can decide when to do it. 

[0:34:15] SL: Yes. That's right. All of these features are built from a lot of experience, practically running databases. Just a practical applied, large web companies with all the requirements that we had at those companies are requirements for PlanetScale, because we see the real use. We have such a mix of people here that have run very, very large websites or database deployments, and do database internals. You need a mix of both, because database internals people, they're very much focused on very hard computer science problems, and don't always think about the how the applications run practically at scale or in the hands of users who are at scale.

So, we kind of help each side see how the various things are that people have to appreciate. Then, we build a product that's very holistic for companies running on the real scale.

[0:35:12] LA: I have a few other things on my list of how do you do this. But I am guessing that the answer is going to be the same for the given that last answer. But let me just confirm to make sure. Branching workflows, that's something that basically is that exact same model. I'm assuming schema reversals? What do you mean by development branches? That's one of the things you talked about earlier on your website. Is that different than development schema? Or is that what you're talking about?

[0:35:43] SL: So, database developer environments, or staging environments are a real pain in the neck for companies, because you have basically another environment that developers need to have online, and people very often break it because it's not real production. And so, it becomes very challenging to manage these staging environments. At PlanetScale, we believe you should just create database staging environments whenever you need them, whether you want to run CI or you are building an application, or you're doing, sorry, a feature for your application. You should just create a new branch to go and do it. It's our way of making staging live right next to production, without impacting production at all, and being the beginning of environment to deploy to production. It means that we take away a lot of orchestration pain from folks because they're not going to really set up for tests locally, and you can't really simulate PlanetScale unless you get a little mini copy as your branch.

[0:36:47] LA: These are development branches. Are the development branches of production data or of a different dataset? The reason why I'm saying that is, I know one of the things that's very hard in staging environments is to run tests at scale, because you don't have a dataset that you can easily get access to at scale. So, is this a feature that allows you to do that as well or is that a different?

[0:37:12] SL: You can choose to create backups that have seed data in or just restore from backups. But by default, the branch actually has no data. So, what people will do is load, like Bootstrap, like datasets into the database to run their tests against. We haven't really made it immensely easy to go and get like test datasets out of production. One of the reasons being is very hard to do unless you know what the customer cares about. But we give the underlying primitives that at least take the pain of provisioning and all of that stuff away for them to go and get that test set up and run.

[0:37:48] LA: You're not doing things like a shallow fork of data, of production data, or anything like that. These are separate instances separate, but you enable the creation of these instances and whatever data you do, that's up to you.

Okay, let's go back and talk about sharding a little bit more. All of this works with sharding. How do you handle schema changes in a sharded environment? How do you do consistent schemas across the shards during the course of the upgrades? That's also a challenge with large sharded systems, is doing the schema changes dynamically, while you're running.

[0:38:23] SL: It's the same thing with a lot more coordination. You basically have to go and make sure they happen per shard, and then commit all together at the same time, and actually go and implement. So, it's the same. The workflow is the same, but it happens in a per-shard context.

[0:38:38] LA: Okay, makes sense. There's a lot of people nowadays - MySQL is one of the best databases around, absolutely. Arguably, so as Postgres, and there are people who think Postgres is just as good or better than MySQL. There's other people who think MySQL is better, et cetera. But obviously, there's a large set of customers that prefer Postgres for their main database versus MySQL. Does the strategy you're using, is it possible to support Postgres in the future? Is that something you're considering? Or are you truly focused on MySQL and that's it?

[0:39:14] SL: Postgres is a great database. It has a ton of energy. We would never say never that we wouldn't go and do something in the Postgres world. It's highly customizable. There's a big community. How custom you can make things is actually, would be a problem for us. You can't scale every niche use case a plugin might have. So, it's harder, I think, in that sense and, and people do shed a lot of the more fun parts of Postgres while on their journey to try and scale it.

So, it could happen for us in the future, but it's a great database. There's a great community. Whether you like MySQL more or Postgres more is really just a matter of taste. I will say that MySQL tends to still be very dominant among the hyper scalers. I mean, most of the top 100 of the Internet runs on MySQL. Postgres can't really make that claim. We've never once seen a massive Postgres workload in our entire time speaking to thousands of companies. They tend to be much, much, much smaller, and they tend to be talking to us much, much sooner than we would expect a MySQL user to be speaking to us. In plain speak, the audience we talked to, they tend to run into scalability issues a lot sooner than a MySQL user. I think that speaks to the tradeoffs that each projects made, and I wouldn't say that either one of those is wrong.

[0:40:40] LA: Let's end with one more question and that is, for the entrepreneur that starting a company now, they certainly expect to grow to be giant, but right now. They're small. They're getting started. Is the right model for them to use something like PlanetScale at the very beginning to allow them to grow and scale? Or is the right model to just use stock MySQL? Get to know what the usage patterns are and then add, whether it's PlanetScale or just Vitess, however they wanted to add that in later? What's your recommendation?

[0:41:16] SL: I would not recommend any new company goes and sets Vitess up on their own to run against. However, there is absolutely no downside or difference to picking PlanetScale from day one, and thousands of customers have gone and done. So, thousands of startups use our software, and it grows very well with them. We don't really make any significant tradeoffs. If you know that you get to the scale of the next Slack, or GitHub, or any of these large companies that uses Vitess, you know you've already factored in the database issue, which is a guaranteed problem for you, if you haven't already solved that problem.

It's not a maybe. Database scanning is a guaranteed problem for you if you do not run on a database that is scalable now. Of course, there's the very valid argument of just get going on whatever works for you and solve scale problems later. It's a very good point. However, if you're picking a cloud database service and just want those problems paid away, there's no downside to picking PlanetScale from day one and doing it that way.

[0:42:21] LA: So, my guest today has been Sam Lambert, who's the CEO of PlanetScale. Sam, thank you for being here on Software Engineering Daily.

[0:42:30] SL: Thank you very much for having me.

[END]