EPISODE 1736

[INTRO]

[0:00:00] ANNOUNCER: MongoDB Atlas is a managed NoSQL database that uses JSON-like documents with optional schemas. The platform recently released new vector search capabilities to facilitate building AI capabilities.

Ben Flast is the director of product management at MongoDB. He joins the show to talk about the company's developments with vector search.

This episode is hosted by Lee Atchison. Lee Atchison is a software architect, author, and thought leader on cloud computing and application modernization. His best-selling book, Architecting for Scale, is an essential resource for technical teams looking to maintain high availability and manage risk in their cloud environments. Lee is the host of his podcast, Modern Digital Business, produced for people looking to build and grow their digital business. Listen at mdb.fm. Follow Lee at at softwarearchitectureinsights.com and see all his content at leeatchison.com.

[EPISODE]

[0:01:09] LA: Ben, welcome to Software Engineering Daily.

[0:01:11] BF: Thanks. It's great to be here.

[0:01:13] LA: So, let's start out with what is vector search? I know we're going to get into the discussion about embeddings and things like that. But let's start out with just a basic of, what is a vector and what is vector search?

[0:01:13] BF: Yes, absolutely. So, MongoDB aside, vectors are high-dimensional representations of underlying data, right? The way you produce a vector is you take some piece of data, whether it's text or audio or image, and you send it through an embedding model. Out the other end comes a vector, which is a series of decimals or zeros and ones that is a representation of that underlying data. That's what a vector is.

[0:01:55] LA: When I first started working in Generative AI, this is something that took me a while to comprehend exactly how this works. But basically, what a vector is, is a point in n-dimensional space that describes as best as possible the content of the article. Is that a fair statement?

[0:02:13] BF: Yes, that's fair. Content of the article could be content of any underlying data.

[0:02:18] LA: Of any underlying data, yes. So, if you think, I always like to start with, let's think in one, two, three dimensions and build up from there. You think about a point on a sheet of paper and two-dimensional space can describe something. If there's only two dimensions, then you can talk about the similarity between two points by how close those two points are together. You can do the same thing in three-dimensional space, and four-dimensional space, we can't visualize it, but you can imagine it in five, six, seven.

Now, these vectors are n-dimensional space, but for n, we're not talking about four or five. We're talking about 500 or a thousand dimensions. Is that fair?

[0:02:57] BF: Yes, that's exactly right. I actually like to use the same exact example there. When I present on this topic, just kind of like a point on a two-dimensional graph that represents cat or dog. Then I do like to just kind of remind people that when we talk about vector search and kind of today's domain, we're talking about these high-dimensional vectors. One of the ways I like to describe how powerful this concept is, is using just kind of a sphere as an example. So, a sphere is obviously in three dimensions. If you like to think about the sphere is having three dimensions and the amount of volume inside it as being the amount of information it can represent. If you increase kind of the diameter of the sphere by just a small amount, you dramatically increase the total volume that fits in that sphere. So, that's just kind of a nice way to think about why adding more dimensions dramatically increases the amount of data that can be represented inside of an individual vector.

Like adding kind of one inch to the diameter can kind of blow up the size of the volume because so much of volume is stored in the edges of the sphere. Kind of similarly adding one dimension can really kind of dramatically increase the amount of data that can be stored.

[0:04:12] LA: Makes perfectly sense. Or thinking in the dimension standpoint, a two-dimensional circle compared to a three-dimensional sphere, the sphere has so much more it can hold than just a circle on a sheet of paper. Now, take that and add a fourth dimension and then it gets very - it's the same thing in a different direction.

[0:04:30] BF: Exactly.

[0:04:31] LA: Absolutely. So, this actually takes a while, or at least it took me a while to fully grasp is that in a n-dimensional vector, just a series of, let's say, a thousand numbers. I'm just randomly saying a thousand, but whatever the number is, in a series of a thousand numbers, you can pretty accurately represent what a piece of content is about, whether it's an article or whatever the content is. You have a pretty good idea of what that's about to the point where you can compare two vectors, two sets of those numbers to find how similar two pieces of information are to each other. How similar two articles are to each other, or how similar a question about a topic is to a potential article that might be talking about that topic.

It takes a while to get that when you're first starting out, thinking about developing in Generative AI, but once you get that, a lot of things start falling into a place. Is that a fair statement? Do you want to expand on that or anything you want to add to that?

[0:05:35] BF: Yes, absolutely. I think that's exactly it. The one thing that I kind of think about sometimes when I'm kind of describing that concept is that, obviously, like, this series of numbers means nothing to you or I, but when you take some piece of data and you turn that into a vector, if you compare that to a similar piece of data using the same model turned into a vector, those vectors will be near one another, right?

It's this idea that the kind of representation on top of this data based on the model is what ends up being similar. That way you can compare these things using the distance. Obviously, to any of us, it's just like kind of a series of meaningless numbers, but those numbers show up near one another based on the information that was put into the model is what's so powerful.

[0:06:22] LA: Yes, I completely agree. That makes a lot of sense. So, we're going to spend a lot of time talking about vectors, but let's talk about the use case for vectors first, which is embeddings really is the main use case. What an embedding is, is the process really of taking a piece of content and converting it into a vector. Can you talk about that process a little bit and how that works and how do you get a vector from a piece of information, piece of content or article or whatever?

[0:06:51] BF: Yes, absolutely. So, for the most part, people are using kind of pre-trained models. This started way back when coming out of Google with things like Word devec, which used a variety of strategies to kind of train the model on a diffused set of data such that it would have some sort of generalizable capability. So, when you put any piece of text in, you would get this consistent representation of data or of a vector out the other end.

As those capabilities have evolved and improved over the past few years, those vectors that come out the other end have become much more powerful. To do it, you basically instantiate a model or call a hosted service somewhere that has an embedding model. You send it your content and that gets sent through the embedding model transformed into the vector and out the other end, you get this list of floating-point numbers.

The one thing that I'll just add in here is that one of the kind of particular pieces of excitement right now and what makes it so accessible to do vector search is that these models have very good generalized knowledge, but that's not to say that you might not want to kind of fine-tune or train them a bit more. That's something that you can do by augmenting those models with your own data and kind of adjusting the weights inside of them. But even without that, you find that semantic search, basically finding similar pieces of data based on a query, produce really good results with these generalized models, and that's kind of where so much of the excitement of what's happening today is coming from is that it's so accessible, so easy to use, and can benefit from just all of this existing tooling and content that's already out there.

[0:08:34] LA: Yes. Like you say, there are hosted services that do this. Like Open AI, they have multiple embedding models. Amazon has a bedrock-based embedding models. Google has their own, and Facebook has a set as well. There's a whole bunch of these out there. But they're all different, right? You can't take two vectors from two different models and compare them. You have to compare vectors that were generated from the same model and the same version of the model all the time. Is that correct?

[0:09:03] BF: That is correct. Though I will just tweak the end of that when we say kind of the same version of the model, there are models that allow you to like fine-tune them and still utilize the older embeddings that were produced. That's one of the kind of exciting innovations that's happening. But generally speaking, you certainly can't take a model produced on OpenAI and hope to use it with some other model that's now hosted in Bedrock. That won't work and that won't get you anything.

[0:09:29] LA: That's fair. I guess I was thinking like the Open AI had the version two Ada and then now the version three, a small, and a version three, large. Each of those are radically different, and you can't compare one to the other. You have to use the model that it was based on. Yes, you can fine-tune the model, but you have to use the specific model, not just the vendor, but the specific model that was based on it.

So, not all of these models generate the same size vector results, right? What are some typical vector sizes we're talking about here?

[0:10:03] BF: Yes, absolutely. So, some of the older models have small numbers of embeddings or dimensions, whether it's 256 or 512. I will say today, kind of the smaller end of the, what I'll call like large embedding models is really the 700s range. But OpenAI's original popular model was 1,536 dimensions. There have been several that have gone all the way up to 4,096 and above that.

I will say a lot of vector search providers support up to 4,096. Some support past that. But generally speaking, there's a pretty strong tradeoff between kind of cost of storage and impacts to performance when you store kind of like super large embeddings. So, you really want to kind of adjust them to the right size for your specific use case, and that's actually one thing that's very interesting about OpenAI's recent one, as well as other providers, where you can actually adjust the embedding size of the same model. So, you can choose how many dimensions you want to have, and the reason that that works is that the dimensions are actually ordered in order of importance in terms of kind of how well they help distinguish one piece of data from another. You can really kind of tune the embedding size that you want to use to your specific use case, which gives you kind of the best results.

[0:11:25] LA: So, your embedding model may generate 4,000 vectors, but you may only store 1,000 of those. You store the first 1,000, and you get the most accurate results. You can still compare those 1,000 against the first 1,000 of any other use of that model and still have valid similarities, it just won't be quite as granular?

[0:11:46] BF: Yes. I think the Ada to 003 has something up to 3,000 and some odd dimensions, and you can choose to shrink it to have fewer dimensions.

[0:11:57] LA: Yes. So, I think I said vectors. I meant vector value, and dimensions are the right term, so you're absolutely right there. Okay, great. That's the use case for these. But frankly, a vector really is just an array of numbers, right? Is all it is. In fact, like you say, you talk about how large these things are. I generate embeddings for my content articles, so I can do online search to look for articles, but I tend to have articles that are like 4,000 or 5,000 characters, but I store a vector of 4,000 or 3,000 integers, floating-point numbers, actually. And now suddenly the vector is actually larger than the content itself, but it's more structured.

It's kind of interesting. These things aren't necessarily smaller than the content they're indexing, but they are more structured. Is that a good discussion?

[0:12:49] BF: Yes. It's a great point and it's something that everyone needs to keep in mind when they're building with these data structures. They are a representation of your data. In some cases, they can be larger than your data. The thing that I often kind of just keep in mind when thinking about that tradeoff is you have to remember that it's not just a representation of your data. It's a representation of your data and associated context. So, that's kind of like the key bit about why it might be like inflationary in sizes. It's not just representing the sentence you put in. It's representing what that sentence is not. To represent kind of all of that information can require even more space, kind of conceptually.

[0:13:25] LA: Absolutely. Let's get back focused on just vector, because that's the thing that's important to us here. You've got these vectors, and now you want to compare them, and that's where vector search comes in. Let's talk about, in a generic sense, what vector search is, and then we'll talk about MongoDB as part of that. But what is a vector search?

[0:13:46] BF: Yes. So, vector search, like very simply put, is finding vectors nearby a target vector. As a developer, I send in a vector based on a query that is the thing I'm looking for, and we're going to find vectors inside of our storage that are close to that vector based on some sort of distance metric and kind of a geometric distance metric, maybe kind of a Euclidean distance, or a cosine distance, or a dot product. We're going to calculate that distance somehow, and that is going to tell us which ones are nearby one another.

[0:14:23] LA: That's really the heart of why a vector is treated as a special thing versus just an array of integers. I mean, any database can store an array of integers, and that's easy to do, but it's this search processing that is now capable, given the format of this array of integers, that array, not of necessarily integers, of values, I should say. This array of values. It's the search capabilities built into the database that allow you to find these similarity matches. That's what makes treating this array of numbers as a vector so important.

[0:15:02] BF: Yes, exactly. I would just go one step further and say, what makes it very unique and why you have to kind of treat it differently is not only do you need to store it very efficiently, but you need to query it very efficiently. Because calculating the distance between hundreds of thousands, billions of points, or billions of vectors can be quite costly if you don't have special ways of skipping over data, of kind of doing those calculations. That's why it's not just your standard database query that gets run in order to process these queries. You really need special ways of doing this.

[0:15:40] LA: So, let's go a little bit deeper into that. Let's get a little nerdy talking about how vector search actually performs these operations. Let's start by saying you've got two vectors. You want to find the distance between them. What's the basic? I know there's many different ways of doing this, but what's the most basic way of calculating that distance between just two vectors?

[0:16:00] BF: Yes. So, I'll maybe kind of expand on that a little bit. When you have two vectors, the most basic way would be to do what you might call an exact nearest neighbor search. What we mean by exact is we're going to compare our target vector against every vector in a collection of data. Because in your example, we have two, of course, we're going to do an exact comparison. But that will involve basically just kind of lining up those two vectors next to each other and doing the calculation of a Euclidean distance, kind of like that classic geometric operation, to figure out how far apart they are from one another. That's kind of how we would approach that.

[0:16:42] LA: It's the something like A squared plus B squared, square root of that is the distance, and where A and B are the distance between the first point in the vector and the second point of the vector. We expand that to a thousand points and it gets a lot more complicated. It's been a while since I've done by Euclidean math, but that's the basic sort of calculation. It's a simple, literally a distance calculation in n-dimension space between those two points.

[0:17:06] BF: Yes, exactly.

[0:17:07] LA: Okay. So, like you say, that works and it gives you a very accurate result. If you had all the time and resources in the world, you could do that by taking a point to compare it against a million other points to find to which ones it's close to. But obviously, as you say, that's a very complex search to do, a very expensive search. What really makes a vector database like MongoDB valuable is a fact that it's got built-in algorithms to do that much more efficiently and using estimations and other mechanisms.

So, let's start getting into that. We know how to create a perfect distance between two points. How do we create a, I don't want to call it imperfect, but how do we do that at scale?

[0:17:52] BF: Absolutely. So, the way that you do that at scale is often using something called approximate nearest neighbor. It really kind of adheres to some of the foundational things of databases in general, which is why we're so interested in it, which is for me, at least how I like to think about it. The fastest kind of query or search over data is always related to the amount of data that you did not search and did not query, right? Not needing to touch something or not needing to examine it is almost always going to be the fastest way to not query it.

[0:18:26] LA: More things don't match than match is the theory.

[0:18:29] BF: Exactly. So, with an approximate nearest neighbor approach, there are many different algorithms used to do this but they all kind of rely on creating a unique index structure that allows you to scan through this data very efficiently. The way you scan through efficiently is again not looking at certain pieces of data. That's why it's an approximate thing.

What's important to keep in mind is that what's approximate about it is not the results, not where you compare the distances and the distance that you calculate. It's the fact that you don't always look at every single point in the collection. The way that this works is, in particular for our solution, we use something called HNSW, stands for a Hierarchical navigable small world's graph, and that's just the index structure. So, we use an HNSW index structure. The way that works is when you're querying it, you choose an entry point into this graph, and that entry point determines what other points inside of that graph you're going to look at.

You make that entry point, you find those points that are similar, or like nearest towards what you're looking for, your query. Then basically the linkages between these points enable you to kind of find it very quickly. They also are what enabled you to not look at other points, which you would expect to be kind of farther away.

The last thing I'll just mention about this is the cost of doing this, doing an approximate nearest neighbor, is that you don't necessarily look at all of the points. You may not have the total set of distances, right? And this is where we get into concepts of recall and accuracy of your results, but you may not kind of touch every option, but you should really be touching the ones that are kind of closest to your target.

[0:20:15] LA: Right. So, let's take a simple example here and please correct anything that I'm saying that's not correct. Let's say you've got a thousand-point vector and a database with a million of these. Okay. Now, you've got a thousand-point vector you want to compare against those. You want to find the top three closest vectors that match. Let's assume for a second the distance of those top three is going to be, it's going to be less than two. It's going to be one, one point one, one point two as far as whatever these units are. It doesn't matter. That's going to be the distance away. It's going to be very near it.

Well, if you look at a vector, compare to another one, and let's say the first point happens to be 10 units away, you know it can't be closer than 1.2, and so you throw that whole data point away, or you throw that whole vector away because you know that vector can't be close enough. Now, that's a real simple example of where you can throw away a data point. But essentially what you're doing is that sort of thing by picking a particular entry in the vector, whatever that is and however you select that, and then throwing away the ones that are clearly too far away to be within the resulting set, and then only taking the ones that are not clearly too far away and then doing another order approximation on those.

[0:21:34] BF: Yes. That is the concept for sure. I'll just add on that one of the properties of an HNSW graph is that it's effectively kind of this series of layers, layers of these points. So, you pick your entry point and then you descend down finding your closest points on each layer. By the way, as you're doing that, you can also be filtering out with what's called the pre-filter results that don't contain some kind of you that you don't want them to have, right? So, you can filter on some associated metadata as you're traversing this graph. You filter out those things that are not about Ben or not about Jim, whatever it may be. Then when you get to the bottom, you then have kind of your total set and then you return that in order. It's exactly what you said, but you get that sorted order and then you return it back to the user. 

[0:22:27] LA: Cool, cool. Great. Okay. A good vector search engine, a good database knows how to do this natively and that's what vector search is within MongoDB is, it's providing that capability.

[0:22:40] BF: Absolutely. Yes. So, that's exactly the case, and it does require that you do special things to your database or your platform, as it were, to make sure that you're able to do that efficiently. Those things are related to the software that you build, the type of infrastructure you bring to bear and kind of the other configurations that you allow your users to set up.

[0:23:00] LA: Can you elaborate on that a little bit? I'm not quite sure I get what you mean by that. Just a little bit more.

[0:23:06] BF: Yes. So, the first one being the software, do you bring the right type of indexes? Do you use the right type of algorithms to kind of traverse that data? Do you kind of use all of that together? The second bit that I was mentioning was the hardware and that's related to the fact that the requirements of a query like vector search and the algorithms that power it are different from the types of requirements that may be a classic transactional database based on B plus tree index needs, right?
A good example of that is, in many cases, vector search can be more memory intensive.

So, you need to bring the resources that a vector search operation requires to bear. But also, the last thing I'll just throw in here is not only do you need to bring the types of resources, whether it's more kind of memory constrained than it is CPU constrained, but also you need to give users configuration options to ensure that the right processes and procedures that are running on your system are getting access to the right sets of resources.

That's one thing I'll just kind of call out that we did inside of Atlas Vector Search, based on MongoDB, is that we allow you to provision resources just for your database and provision resources just for your vector search, but consume them all through one experience. That's really kind of the crucial thing. 

[0:24:30] LA: Yes. So, you can use specialized processing, specialized compute and memory resources to do this processing because it's so different than the other type of database processing you might be doing. And that's actually something that's unique about MongoDB. I'm not sure of - I'm more familiar with Postgres PG vector and some other vector engines and they don't do that. They provide the processing on whatever the server is that's running the database, but they don't provide any way, at least not out of the box, of generically assigning different tasks to different CPUs or different computers based on what, on the fact that you need a memory-optimized system for this, versus a CPU optimized, et cetera. But you do that. That's a unique capability in MongoDB. Is that a fair statement?

[0:25:22] BF: Exactly right. I'll just go one step further and say that there's one challenge of using one set of resources that is kind of completely accessible to all of your services, which is that you need to kind of oversize it, right? You need to size it for the longest pole in the tent. Sometimes you bring to bear more resources than kind of your smaller workload, whether it's your database workload or your search workload that's kind of driving this change in resources needing to be consumed.

The other thing is that you also now are in a place where you have resource contention as a possible impact, right? So, if you don't size it for enough, then you could find that maybe your database kind of workload is taking resources from your search workload or vice versa. The power of what we've done is not only do we let you right-size the resources to each one of these workloads or each one of these services that's running. But in addition to that, we prevent them from kind of stealing resources from one another. So, you can really kind of bring to bear the right amount of resources to satisfy your workload in the most efficient way possible, but also avoid any type of contention issues by doing so. 

[0:26:33] LA: Cool. Cool. So, I imagine your customer's use of this capability has been growing dramatically over the last few years as AI, Generative AI in particular, has been growing. Can you talk about a little bit about the growth here and how important is this capability to the fundamental nature of AI-based applications?

[0:26:54] BF: Absolutely. I mean, we've seen a tremendous amount of interest. It's probably the fastest-growing new service ever released on Atlas, which is pretty amazing. What I'll say is we're seeing kind of two different dimensions to it. One is just a rash of new you know kind of semantic search-oriented workloads just enhancing kind of a search experience inside of applications where we kind of serve a lot of customers. Then the second bit is kind of more on the AI side of the house, which is really about kind of retrieval augmented generation and other techniques to use data with large language models. And that's where we're seeing kind of like a whole new dimension of workload come to MongoDB and come to Atlas because of the availability of vector search.

So, this is where a whole new persona or group who maybe didn't historically use MongoDB for as many workloads as they did in the past are now kind of taking a look at MongoDB and using it to power their vector search-related workloads.

[0:27:52] LA: Yes, because more and more workloads are becoming dependent on vectors and focusing on a high-quality vector's search capability is core fundamental to what the applications are doing nowadays. So, it makes sense to tie your database to the one that does vector search the best, essentially, versus in the olden days where it's which ones can do basic selects the fastest or which ones can do updates, faster, and whatever. It's turning more and more where the most important critical aspect of a database is how fast it can do a vector search.

[0:28:26] BF: Yes. I would just go like kind of add to that. I think a huge part of it is like, do we really want to add in another service or like manage another kind of piece of software in order to provide this? Or can we just take advantage of it based on top of the existing database that we're already running? It's allowed for a lot of consolidation and kind of two birds with one stone, really.

[0:28:51] LA: It's two birds with one stone, but also, it's like that's where the data is located, right? And we're talking about how large these data sets can be. If you've got a billion article, each one with a thousand-point vector and you're trying to do a search, you don't want to have to copy that into another search engine or another system in order to try and figure this out, you want to do it, where the data is located, essentially.

[0:29:14] BF: Absolutely. I mean, that has been kind of a huge focus for our core search offering for a very long time now, which is just reducing the amount of effort and work that went into replicating data from your transactional database into some separate search system to serve a workload with Atlas search and vector search. You're able to really just configure and index on top of your existing transactional data and do search and vector search workloads on top of that data within the same place where the data lives. So, yes, dramatically reducing complexity and need for duplication and copying of data.

[0:29:50] LA: How much is this replacing other search mechanisms?

[0:29:54] BF: Yes. So far, from what I'm seeing, there's a lot of augmentation of existing search systems. With vector search, you get this amazing kind of out-of-the-box capability to do semantic search, like search based on meaning. But the reality is, depending on your search workload, you may need very deterministic results. You may want to boost certain terms. You may want to kind of augment or do some sort of complex transformation on the back end of the data. Those are really kind of your classical search challenges, if you will.

I'm seeing a lot of workloads who were doing kind of existing BM25-based search, kind of like the classic approach to doing relevance-based search, add-on vector as a new way to kind of augment the results, do a bit more kind of fuzzy matching, if you will, using kind of vectors as that approximation of your data, but still have that fine-grained control when you're kind of actually configuring things and setting them very explicitly around kind of what type of results you want to get.

So, I'm seeing kind of a lot of blend, and typically, a lot of people refer to this as hybrid search. But it's really kind of bringing together multiple techniques to ensure that you're getting the best possible results.

[0:31:09] LA: I guess that makes sense. I've always thought of, when I think a search, I think in more basic terms like keyword searches and things like that. Then search is a lot more complex than that. There's a lot more involved to it than that. So, this is another mechanism to help with search, but it really isn't a replacement for existing search. I think one of the things you glommed onto there that I don't want to just go over, I want to make sure people understand is that these vector search mechanisms of trying to find content similarity is a very fuzzy comparison mechanism. It's not very precise. In fact, you can get different results at different times for different purposes based a lot on the models used and other things. But bottom line, it's not very exacting. It's very fuzzy in its results.

[0:31:57] BF: Yes. I think that's a fair way to put it.

[0:32:00] LA: So, combining that with more exacting mechanisms, search mechanisms, can be a highly valuable approach.

[0:32:08] BF: Right. I guess I would just add on here, it's for your search use cases where you have a search bar and users are trying to find a specific thing, that is so crucial to delivering a good experience, is kind of meeting the user where they are and how they use the tools that they do. Understanding how end users are inputting things into the system is so crucial. I'll just mention that that also just diverges from how this technology is maybe being put to use in the context of AI-powered applications using retrieval augmented generation, where you're actually just looking to find semantically similar data, take a bunch of it and put it into a prompt to hire into a large language model, right? That's how you have to think about both of the use cases very differently, and they call for very different things inside of their search.

At the end of the day, what's crucially important to both is that you're getting accurate, relevant responses. So, giving users the ability to kind of tune their results and kind of really get into the details is really crucial to making sure that you can kind of build out the search and AI-powered experiences that developers are trying to build.

[0:33:21] LA: Now, you entered another term that's an important term. I want to make sure our listeners catch and realize what it is you're talking about. You talked about retrieval augmented generation, RAG, which is a common term, especially around things like chatbots and things like chatbots and things like that. Can you tell in a little bit more detail what RAG means, what it's all about? 

[0:33:42] BF: Absolutely. So, RAG is, as you said, stands for retrieval augmented generation, and what that means is taking retrieval, so data that you fetch from whatever location, and augmenting the generation of your prompts that you're sending to your large language models with that data. The way to think about it is you're asking a large language model for an answer to some question. The best way to ensure that it gives you a good answer is inside of the question that you ask it to give it the answer. That's kind of the most simplistic way to think about it. So, you can rely on that large language model to just know the detail that it needs to, but better is if you can give it the model because you're both going to focus the response by giving it like a lot of prompt data to give a good generation on top of. In particular, by giving it the relevant information for your query is really going to focus it.

That's what retrieval augmented generation is. The other thing I'll just add in about it is that's the other thing that's a bit unique about what we're doing at MongoDB with regards to retrieval augmented generation is that historically we're talking about kind of using vector search and other search technologies to send in the data to your LLM, so that it can give a better result. So, you have some catalog of data that is somewhat static and you feed it into the prompt and then it's able to give you a better response. But what's exciting about MongoDB is that we've built our search and vector search capabilities onto a transactional database.

With MongoDB, you have one set of storage that you can provide back your transactional data that's powering the lifeblood of your application along with your kind of catalog of more static knowledge-based data into the large language model.

A great example of this is if I need to ask a large language model about what insurance benefits do I have and is my broken leg covered for this based on my policy, well, I can make one call to MongoDB to find out what Ben's policy is based on all the user data that I store about him in my transactional database, and I can fetch my policy data, which I store separately that just is the policy of X, Y, Z, which applies to many users. So, I can fetch both of those pieces of data and feed both of them into the large language model, and then get the relevant result that both captures, was he covered for this based on his policy and kind of his level of coverage, as well as the location is, et cetera. All of that data can come from one place because, again, we've married the technologies of vector search and search to the transactional data. That's kind of one of the really exciting things with MongoDB and kind of specifically the topic of RAG.

[0:36:28] LA: Yes. So, one of the things I've done on my site. I've got a website that's a newsletter with a whole bunch of articles, a whole bunch of content that I put in. I generated embeddings for all of those. Then I have a chatbot where users can ask a question what is cloud computing? I use RAG to generate a bunch of content articles that I've written or information from my book or whatever and put it all together, send it to a large language model that generates a response and list of those articles back that the person may be interested in reading. So, it allows me to get my content in front of my customers a lot easier. It works pretty well.

But this is the heart of what most AI applications are doing nowadays is some form of RAG along with large language models in order to determine what part of these large database of content is relevant to a particular query and put that relevant data into the query. So, like you say, the language model doesn't have to look for the results. It's got the results. It has to figure out how to express it and how to take the right pieces out and throw away the rest and give you a response back.

[0:37:42] BF: Exactly.

[0:37:44] LA: So, what else can you tell me about Atlas' vector search that we haven't talked about yet?

[0:37:51] BF: I think we covered a lot of the key pieces. Maybe just add in two more things, which is earlier I alluded to the fact that we had kind of this separation of resources to allow you to kind of really fine-tune like what resources get put against what workload. That's called search notes inside of Atlas. You can deploy Atlas search notes with high CPU to serve a specific workload and that's available today in every cloud and is GA.

The other thing that we've done, which is really exciting with vector search in many ways, is that we've partnered very closely with all of the open-source frameworks that are being used by a lot of the builders who are building applications on top of this new capability. I'm thinking of the frameworks like LangChain or LlamaIndex, which are, for those of you who don't know, these kinds of frameworks that allow you to kind of orchestrate the flow of data between vector search or into large language model, or the creating of chunks on your data before it's inserted into vector search. Just all of the kind of little things that happen around a vector search or LLM application workload that that need to be developed and they kind of just help streamline a lot of that.

So, we're plugged in, in a variety of different ways to these third-party frameworks. In fact, inside of LangChain, we have support as a vector store, right? Which means you can get kind of vector results back from it. But we also have support in quite a few different ways, one namely being to be a semantic cache, which means, instead of going and resubmitting a query to vector search, sending it to a large language model, getting the response, you can just check the cache and see if this question has already been asked before and see a result of a generation that's already happened, thereby saving you the cost of sending the query into the large language model and doing the vector search by just resolving the query from the cache and doing that cache in a semantic way, where you look for kind of questions that were similar to the one being asked.

So, we have like a ton of different plug-ins to these various different open source frameworks that allow you to kind of build applications much faster and just take advantage of kind of a ton of work that's been done to kind of expose these abstractions. You can really, as a developer, focus on building the mission-critical, diversified logic and, you know, kind of application functionality as opposed to some of this boilerplate that could be just taking advantage of inside of the frameworks.

[0:40:23] LA: Makes sense. Cool. Great. And that sounds very interesting. Thank you very much. So, Ben Flast is the director of product for MongoDB's vector search capability within the Atlas product, and he's helped us learn about vector search today. Ben, thank you so much for joining us on Software Engineering Daily.

[0:40:42] BF: My pleasure. Great to be here.

[END]