EPISODE 1716

[INTRODUCTION]

[0:00:00] ANNOUNCER: One of the fastest areas of growth in observability is front-end observability, or real user monitoring. This is the practice of monitoring and analyzing the performance, behavior, and user experience of web applications from the user's perspective. Purvi Kanal is a Senior Software Engineer at Honeycomb. She joins the podcast to talk about the evolution and status of real user monitoring.

This episode is hosted by Lee Atchison. Lee Atchison is a software architect, author, and thought leader on cloud computing and application modernization. His best-selling book, Architecting for Scale, is an essential resource for technical teams looking to maintain high availability and manage risk in their cloud environments. Lee is the host of his podcast, Modern Digital Business, produced for people looking to build and grow their digital business. Listen at mdb.fm. Follow Lee at softwarearchitectureinsights.com and see all his content at leeatchison.com.

[INTERVIEW]

[0:01:12] LA: Purvi, welcome to Software Engineering Daily.

[0:01:15] PK: Hi. Thanks so much for having me. I'm pumped to be here.

[0:01:17] LA: I'm so glad you're here. Thank you so much. I love the observability space. I was at New Relic for almost a decade, so I love working with observability. I love talking about observability, so I'm looking forward to this conversation today.

[0:01:31] PK: Same here. Can't stop talking about it personally.

[0:01:34] LA: Oh, great, great. You focus primarily on front-end observability. Honeycomb does a lot more than that, but you specifically focus on front-end observability. When I was at New Relic, I remember when we first created our first run, real user monitoring offering, I'm not sure the history here, but I think that was one of the very earliest RUM offerings way back in the day. A long time ago. I know RUM was a big deal, and it was a really basic monitoring. There wasn't a lot of detail there, but it was just something that just worked and that was cool.

It's changed real user monitoring, or front-end monitoring is a better term now, has changed a lot since those very early RUM days. How has front-end observability expanded in the past several years?

[0:02:24] PK: I think, it's such a broad question, because front-end engineering itself has evolved so much over the last decade or so. If you go back about 10 years, we're almost at the 10-year anniversary of React really, really taking hold. Generally, the way that we build modern web apps really, really changed around that time, around 10 years ago. The complexity of what we're doing with modern web applications is just so much larger. It's more than a page load time. We're doing so many complex interactions. Front-end web apps are interacting with so many complex distributed systems that it's become a lot more than metrics on a page.

I think the way that front-end observability and monitoring has really changed is that it's more that the needs of what developers need have changed. I do feel like, a lot of the tools are still stuck in the same place, which is something that I'm glad there's getting a lot more attention now. I think one of the main ways is going from what happened to how can we turn what happened into a further diagnosis and get much more real-time updates? I think, there was a lot of monitoring a long time ago that was really based on synthetics, because it's really hard to monitor with real users, because that requires sending data over the wire for every single one of your users, and there just weren't good tools for that for a really long time.  People would use synthetics, so, essentially, run -

[0:04:01] LA: You want to describe what synthetics is for our listeners, so they know what we're talking about here?

[0:04:05] PK: Absolutely. Synthetic monitoring, which is often just referred to as synthetics would be some service, or server that you can run that hits your web page a bunch of times and sends monitoring data about that to your backend of choice. A lot of backends still offer synthetics, and it can be useful. But real user monitoring when it happened, and I remember using - I've used New Relic RUM for a big part of my career. It was a really big deal, because rather than getting data that has been generated by bots, essentially, you get that data about your real user, so that when you're getting that message from customer service being like, "Hey, this customer is experiencing an issue," you can potentially tie that to a real crash report. That was the big, big leap for real user monitoring.

I think the next step is to go from there into being able to go beyond monitoring into more of the observability space of surfacing unknown unknowns and proactively taking steps for front-end teams, and that's something I'm really excited about.

[0:05:11] LA: Yeah, I'd love to get into that in a little bit more detail. I know the unknown unknowns is the step beyond just the, what happened in this particular transaction. I think, one of the things you alluded to that you get good coverage with with RUM compared to synthetics is what real people are actually doing on your site, as opposed to your estimate of what you think they might be doing. That's a big difference there. When you talk about unknown unknowns, you're talking about things that you're not anticipating happening at all. You have no idea that they could happen, but they happen anyway. Can you talk a little bit more about that?

[0:05:48] PK: Yeah, absolutely. I think a lot of monitoring and RUM tools in general, it comes from a place of let's set up a bunch of dashboards and graphs. If you think about that, if you think about setting up a dashboard with a set of graphs that you think about, let's say, page load time, or your core web vitals values, or maybe even just your network activity, those are questions that you are asking. You're just like, what is my page load time? That is a known quantity. That's a known question, and you're like, "I know that I want to know about this."

When we can start surfacing unknown unknowns, it's almost like, a light goes off of you can see customers, and you can see customers having issues proactively before they even report them out. That's taking things, like errors and aggregating them in a certain context and being able to put lots and lots of different data and attributes onto your monitoring data, so you can slice and dice it a million different ways, to ask questions that you don't even know that you need until you're in an incident, or until something is going wrong.

To be able to answer questions really quickly like, how many people is this impacting? Or, which set of browsers are people experiencing this on a certain set of browsers, or a certain type of device, or a certain type of network connection? Can I slice this by P95? Is everything okay up until P95, and that's where we're seeing a spike? It's just the ability to have very rich, wide events, so that you can ask these questions when they come up for you.

Because what's really, really tough is when you're trying to debug something and you're just like, "Oh. I guess, we don't have that data." The only way to push past it is to add that to our telemetry now and then watch it for a while and watch that telemetry roll out and then try to answer those questions. If we have rich, wide events from the get go, those unknown unknowns can be surfaced in the moment.

[0:07:52] LA: One of the things I talk a lot about is mean time to detect and mean time to repair. Often, when we're talking about observability, we're talking about, send me a notification as soon as something goes wrong, so I can wake up someone and get them involved and working on a problem. That's going towards trying to improve mean time to detect. Now, what we're talking about here is a rich set of data that you can now use once you are engaged. What can you do to find out what's really going on and find out exactly what's going on? That really goes towards mean time to repair. It helps you understand the way your system is working a lot faster, presumably, but also, a lot more depth in order to find problems, like you say, even before the problem is noticed.

[0:08:38] PK: Absolutely. To add on to that, one thing that I think a lot of front-end teams face is what I've heard, coin as mean time to innocence, which is that a lot of customer issues, just because your web application, your UI, your browser app is the closest thing to a user. Often, if customer reports are coming in about a page is crashing, or a page is slow, front-end teams are the first to hear about it, but they're also sometimes the teams with the least amount of observability to actually detect it. If that problem might be occurring somewhere deeper in the stack, maybe there's an optimized DB call, or maybe there is a slow endpoint, or something else happening in the stack. But it becomes that team's responsibility to debug it first and get to that conclusion.

Shortening that mean time to innocence for front-end teams is actually a huge, huge lift, because being able to confidently answer, is this a browser performance issue, or is this a backend performance issue? That is actually not a super easy question to answer today. It really should be, because that's one of the first questions that you're asking as you're digging into issues.

[0:09:53] LA: Yeah. It's actually a really important question to ask. I used to manage a team at Amazon that - it wasn't a front-end team. It was a backend team, but it was a first layer backup team. It owned the detail page framework and the gateway framework. Every detail page that any customer sees goes through them first, and then the detailed widgets go off to individual backend teams. Well, what that meant is whenever there was a problem on any detail page anywhere in the world, we got called first. It was our job to figure out where the problem really was. That was very, very, very time consuming. We spent 90% of our time was on-call, and 90% of our on-call was finding the team that was responsible, because it wasn't us.

Really, you're talking about the same problem here now, too. But the real goal there isn't just to give you the tool so you can show your innocence, so to speak, and to find who is responsible, or to pass it off to the right team to work on. It's to do so in an automated way, so you're never even involved in the first place, right? How close are we to that?

[0:11:02] PK: I think we're getting closer. One reason I'm really excited about open telemetry as a project, as an open-source standard to send telemetry through is because of the ability to connect all of your services through context propagation. One thing that I think is not a super solved problem yet in observability and especially front-end observability is the ability to trace an event through from the browser, all the way through to your backend system and connect those easily with no work.

Open telemetry introduces the concept of context propagation, which makes it really easy to actually connect those events. I think that is one thing that definitely advances our ability to figure out, hey, do we actually need to dig in here and do browser performance optimization? Or do we actually need to do the work somewhere else? Because you can easily see like, there's a network request that's made from the browser and then it hits the backend and you can see all of those times and how it goes all in one place. That's really the ideal place that we want to end up, without having to do a lot of work to instrument that ourselves.

[0:12:22] LA: It's funny you mentioned open telemetry, because literally, the next item in my notes right here was about open telemetry, is to ask you questions about that. I'd like to get into that in a lot more detail.

[0:12:34] PK: Absolutely.

[0:12:35] LA: Let's start though, so everyone's on the same page. Open telemetry specifically is an open standard for doing telemetry, but telemetry itself, can you describe what telemetry means and why that's important to observability?

[0:12:51] PK: Absolutely. Telemetry is data that you would want to emit about your system to somewhere where you can interact with that data, that's giving you lots of information about the health of your system in the form of either traces, metrics, or logs. Of course, the ideal state is being able to instrument absolutely everything, every corner of your system. But there's a lot of auto instrumentation that can happen right out of the bat. It might be network requests. It can be your database operations. It could be other JavaScript functions and for the front end.

The idea of sending telemetry is sending diagnostic data about your system to dig into either for debugging purposes, or for performance optimization purposes, or whatever purposes you want to be able to know your system better and really, that is what allows us to surface unknown unknowns, because you can't really know your system without a bunch of rich data.

[0:14:03] LA: Right. Now, telemetry by itself, we've been generating telemetry information for our systems for many, many decades. We've had infrastructure telemetry for a long, long time. Database telemetry, backend telemetry. RUM really is the start of front-end telemetry. One of the problems we've had historically is that all those are different systems. It doesn't all interconnect. We're making it hard to do these correlations of saying, well, it's not a front-end problem. It's a backend problem. Or it is a front-end problem, whatever. Because you're looking at widely disparate systems and trying to compare results and say, "Well, this spike here, is that the same as this spike over here or not? Are they different things?" Making those connections can be very hard. This is really where open telemetry comes in. Let's go into the next step and talk about what open telemetry is.

[0:15:01] PK: Yeah, absolutely. Open telemetry is essentially, it's a collection of APIs, SDKs, and tools that you use to instrument, generate, collect and export your telemetry data to help you analyze your systems performance and the general behavior. It is an open data standard. The way I think about that is pre-open telemetry, you think of like, you basically would go to an observability vendor, like New Relic, or Datadog, or Honeycomb, or somebody. You would get their SDK, put it in your app and it would send a bunch of data to that particular backend.

[0:15:44] LA: The replication.

[0:15:46] PK: Exactly. The idea with open telemetry is the SDKs that actually generate and collect that telemetry should be open and should be able to be influenced by the community and connect to all these different vendors, where you can just rely on your vendors for how you query and visualize that data, but the actual instrumentation and the SDKs that go in your code, that should be an open standard that everybody can have a say in how it's developed and what those standards are. Because there's a lot of vendor lock in that comes in with having proprietary SDKs and you're just at the mercy of whatever that particular vendor has decided.

The idea with open telemetry is that it is moved to this communal place, where people can have discussions about what these specifications are and what the SDKs look like, and contribute to them openly as well.

[0:16:47] LA: Now, when a lot of people think of open telemetry, I think the first place they go is traces, right? But open telemetry is not just traces. It's metrics and logs as well. Again, just for explanation for the audience, can you tell us what those three different things are, traces, metrics and logs? But then, I want to talk a fair amount about traces in particular.

[0:17:08] PK: Yes, absolutely. Open telemetry emits three types of telemetry, or three signals. You have metrics, which are often most useful for numerical data, or tracking something that's the same thing that's changing over time.

[0:17:30] LA: Or CPUs currently at 25%.

[0:17:32] PK: Exactly. You want to just see that tick up or down, and you want to see what that trend is. Then let's go to logs. If you think about a log, it's basically just an unstructured log. It's an unstructured object. You can just have a bunch of attributes on it. You can have the timestamp. Let's say it's an error, you could have the error message. Just a very unstructured blob of data that gets sent over the wire.

[0:18:00] LA: They're specifically generated at some point within the code base, right?

[0:18:05] PK: Yes.

[0:18:05] LA: It's like, at this point, this is what's going on. Here's a bunch of data that might be useful to you.

[0:18:10] PK: Yeah. At the heart of it, we all log things, right? I console.log a lot during my debugging process. You could just think of instead of console.logging something out, you might just wrap it in the like, send this log to Honeycomb via open telemetry. It's a little bit more structured. Instead of just a string, it's an object and it has properties that we can query on it. 

Going over to traces, I like to think of traces as basically, a collection of really fancy logs. It's not that much of a leap. Imagine, instead of having individual logs all over the place, each log has a duration and it has a trace ID. That's really the only structural difference between a span and a log. A trace is a collection of spans. What tracing allows you to do is correlate certain pieces of data together. Taking that a bit further, if you wanted to console log things out that are related to each other, imagine you gave them all an ID and be like, you belong to the same thing that I want to know about. It'll all be correlated under the same trace, and you can visualize this set of data that you want to look at all in the same place. That's how I think about tracing.

[0:19:36] LA: The trace ID is the key there, because that's what allows you to correlate it altogether. You can do things, like a user click this button on the screen and know exactly which database transaction that corresponded to.

[0:19:49] PK: Exactly. That's where the front-end to backend connection can really come in, because it can all be part of the same trace through context propagation by just attaching a header to all of your network requests. That's like, hey, when this, basically, just attaches a header that has the trace ID and the parent span ID on it. That's what allows you to correlate something from your front-end to your backend.

[0:20:19] LA: Tracing specifically with open telemetry is really the core technology that has allowed us to merge the old-style RUM offering into a more connected observability platform that's tied into the rest of our observability ecosystem. Is that a fair statement, or is that what you would say? Or would you say that differently?

[0:20:42] PK: I think that's a fair statement. I also do think there's a longer way for it to go. One challenge that still exists within browser telemetry through open telemetry is not being able to automatically correlated into its own set of traces, without doing the manual work to say like, "Hey, this is the trace ID that I want this span to belong to, just because it is hard in browser processes to keep context of what happened where." Because if you think about it, it's not like a distributed system of service A calls service B, calls database A and then returns a bunch of data. You can have multiple things happening in a web app that might be unrelated.

A user could be clicking somewhere, but also there might be background data fetching and there might be a service worker. How do you correlate all of that data together that is a very interesting and emergent problem that we're thinking about within the open telemetry community?

[0:21:44] LA: For instance, how does - when you click a button and you get a delay in some processing that you weren't expecting, was it related to part of that trace all the way back to a slow database query, or was it related to a background task running in the user's browser doing some other unrelated activity? Or was it, you're running on the user's computer now. Is it something else that the user's computer is doing that's unrelated anything you're doing? You still don't have that piece of information, or you have limited access to that type of information.

[0:22:17] PK: Yeah, exactly. It could be tough, because traces have this idea of a duration, like the root span ends at some point. That can be really tough in browsers. What do you consider to be a root span? Do you want to put everything that ever happened in that - that the user was doing in one trace? When does that trace end? That's a really hard question to think about in terms of tracing.

I think, really, for browser telemetry, what takes it to the next level is having short-lived traces and very specific traces, like network requests that are connected to a user and put a button click network request that traces through to your backend and that's that trace, but really being able to surface a collection of traces through something that we can correlate, and that's probably something more like a session ID.

[0:23:13] LA: It's hierarchical traces, all the way up to a session indicator. We're not there yet with open telemetry. Is that correct? Are we getting there?

[0:23:21] PK: It's getting there. We're actively working on it as part of the community. One thing that we did at Honeycomb is release a distribution, which is a wrapper, essentially, of the open telemetry web API, which does add a session ID to everything. But that is coming upstream as well.

[0:23:40] LA: Is that different than your typical user login session ID, or is it tied to that, or is it a completely different concept?

[0:23:49] PK: That's a great question. It's a separate concept, because what we're doing is it's essentially an anonymous session ID. It's a session ID that open telemetry, or Honeycomb open telemetry distribution is assigning to one browser session. Most apps will have their own concept of a session. We really do encourage, and it's really, like attributes are free to add. You could just add an attribute that is your particular logged in session as well. It is important to have both the concept of an anonymous session, as well as a logged in session to, you know, you want that difference, because sometimes users aren't always logged into a particular service. Or they might be doing things across different devices, where they might have the same logged in session, but you still might want them to have a different anonymous session to delineate between those devices.

[0:24:45] LA: Let's go a step deeper on this now. Let's talk about a specific type of problem that is well suited for this front-end open telemetry. The way things work right now, not the future state, but where things work right now. What type of problem is really well suited for being solved, or to being found and discovered and repaired using a front-end observability platform with open telemetry? What's an example, or some examples of the types of problems that work well for them?

[0:25:19] PK: I can think of a few. I think, starting with something that we've already touched on a little bit of really well instrumented network requests. It is easy to instrument all of your outgoing network requests and use context propagation to connect those to your backend requests, to really just see how performant - You can also tie that to user events, like a button click, or a page load.

You can really see how is my page load being affected by my network requests. You can easily see also at a glance, how many network requests your application is making, when it's making, how long that's taking, if there's any optimizations to be made. It's very, very easy at a glance to see like, "Hey, I should really just create a batch endpoint for this set of requests that is all the same."

I would say, network activity is really, really well solved right now. It brings about a lot of really good insights. It connects to the rest of your system and can not only help in the debugging process, but also in your optimization process as well. Another problem, I think, that solved quite well through using open telemetry is debugging core web vitals. This is not quite landed in the open telemetry repo, but it is available through the Honeycomb distribution and will be available in the broader OTel community as well soon, is instrumentation for your core web vitals that takes it beyond just what are these values. What is my LCP score exactly?

Actually, because it's based on real users, it's not synthetic, it's not somebody running a light house score that might be different, one time they run it and the other one, you can actually get really good real user data and take it a step further by adding attributions. It's not even just like, this is my LCP score. It's over three seconds and Google is telling me that this is a poor score. I don't know what to do about it. It can actually tell you, this is the element that contributed to the poor score. You can identify which element contributed to a delayed input for the newly found for the new metric, that's the INP, which measures interactivity really well.

It's helpful to know like, hey, this button is really slow to respond and we know which button it is. We might not even get user reports for that, but we can see it happening. We can optimize that even before somebody complains about it.

[0:28:08] LA: Ten years ago, how would we have done this? Maybe ten years is too long. Maybe it's five is the better number. Obviously, you go back far enough and we solved it, because we didn't have front-end interactivity. Staying away from that, how would we have diagnosed a front-end problem like that before we had solid open telemetry and the tools that go with it, and more than just basic RUM?

[0:28:35] PK: I think, and I can maybe speak from personal experience debugging these kinds of problems. Usually, let's say, let's take the example of a slow page load. Maybe we would get some customer report that hey, this customer is saying their page load is really, really slow. It's 20 seconds. They cannot use this app. I'm on the team that's going to deal with it, so I'm like, okay, great. My first step is I got to reproduce this. That is always the hardest part about debunking front ends, is because there's so many different people using so many different devices on so many different types of connections.

Personally, as a developer on a really, usually pretty good machine on a really fast internet connection, even with the throttle tools that are available, a lot of the times, I'd be like, "Hmm. Works on my machine." Then I'd maybe get somebody else on my team be like, "Hey, are you seeing this issue? I can't reproduce it. Can you try?" We all try, and everyone's just like, "Ah. Yeah, I couldn't reproduce it. I tried all these things. I couldn't do it. Can you get them to send me their screenshots of their network tab, or their console?"

You're playing this weird game of broken telephone. But somebody is having a real problem. You can't connect with that. That's what makes reproducing it so key. Then if you fast forward to now being very sure that, hey, this is not only happening. I can see that it's happening in my data. I know what contributed to it. That is the big leap. Whereas, I think, in the past 10 years ago, it was really, really down to, can I reproduce this locally? That takes a lot of time and a lot of care and a lot of effort from teams to be able to reproduce really, really gnarly bugs.

[0:30:31] LA: Right. We talked about mean time to repair. One of the big contributors in the olden days to mean time to repair is, can you reproduce it? Really, what observability in general allows you to do is to not have to worry about reproducing it, because you can see it happening. Now, we can do this for front-end systems as well.

[0:30:51] PK: Yeah, that's the goal. I think it's a little different on the front-end always, which is that, I think there will always be a portion that you need to see how do the user get into the state, because there's so many different ways to get into that state. That's what I think about the driving factor of how we should build front-end observability tools is like, how can we get closer and closer to developers not having to physically reproduce that issue.

[0:31:19] LA: I like that you could have stopped that sentence with, the front-end is just different, and then just left it right there and I would have explained 99% of what goes on, I think, in a front-end development. That's interesting. Let's talk a little bit about Honeycomb in particular. Now, Honeycomb is a is a full stack observability platform. Front-end observability is, was a relatively recent addition to what you've been working on. Is that correct? You want to talk a little bit more about that?

[0:31:45] PK: Yeah, absolutely. I think there's a little bit of personal history there, too, on how this has all come about. Although it seems recent, it is something that we've been thinking about as a company for the better part of two years. It was a big reason why I joined Honeycomb. I've always been a huge fan of Honeycomb and the way that we do observability. Around two years ago, somebody reached out to me and was just like, "Hey, we're thinking about solving, trying to get into the front-end observability space and really solve it the Honeycomb way, and not make yet another RUM tool." That was really, really exciting to me.

Really the first part of that for the first year or so was just getting really familiar with the open telemetry space and what was available in terms of browser support from open telemetry, which was really, really experimental two years ago. It's come a long way and it still has a long way to go, but it is much more production ready today than it was a couple of years ago. I do understand why it took us a long time to get it there, because it's also, an SDK has to be something that folks are willing to put in their apps that is not as experimental and of a reasonable bundle size and things like that. That took a little bit of time and community work.

Then really, it was starting to think about getting into the space more as of a product offering, which is really like, how can we bring what we do at Honeycomb, which is rich events, wide data, surfacing unknown unknowns to front-end to the front-end world. That is really like, how can we ingest front-end data and show that folks can query their rich front-end data alongside their backend data in Honeycomb as well.

[0:33:49] LA: Is that the core difference with Honeycomb compared to other observability platforms, is you really are focusing on the expanded data set, the data, not just what's going on, but the corresponding data around to the mix of a richer data set to help you find problems. Is that a fair statement, or is there a better way to say that?

[0:34:08] PK: Yeah. I think the way that I would describe it is rich data in Honeycomb is really cheap. Adding as many attributes on to your events is really what makes really great, rich wide events. Honeycomb makes that really, really easy. Being able to add your user ID, your session ID, all of this useful customer data, anything you can think of, just put it on the event. Then it lives in Honeycomb. Whenever you have to deal with a problem, you don't really need to think, "Hey, can I add this attribute? Is it going to cost me something? Is it going to change something?" That really unlocks the queering power of Honeycomb.

I think, what makes it unique for the front end is how well that it also integrates with the rest of your systems data. The other unique thing is it is ready to be used with open telemetry as the primary way of getting data in to Honeycomb. you say it's cheap to add attributes. You talk about it's easy to add attributes. Also, the storage is inexpensive, but there's still a performance cost, etc. Can you talk a little bit more about that and how that's minimized as well?

[0:35:24] PK: On the Honeycomb side, we've written about this a lot, and we have this columnar data store that makes - it's like, attributes are treated as the primary way that we get rich wide events. We always talk about high cardinality, high dimensionality data, and adding these attributes is that high dimensionality piece. Being able to put everything you can think of, even if you're like, I don't know if I'm going to need this and not have to worry about the cost, because we bill on events and it's also really, really fast to query the data. It's like, it's those two things together, so you don't have to wait minutes for a query to come back, even though it's extremely rich and wide, and we - I think, Honeycomb was built that way by design.

[0:36:10] LA: Got it. Got it. Going back to a statement you made me earlier, that it's just different in the front end. One of the things that makes the front end very different than the backend is the environment. The backend is typically a very controlled environment. There are your servers on your systems, or cloud servers that you've selected, whatever, but it's a very controlled environment. Front end is very much not a controlled environment. There is network effects, but there's also device effects and there's software version effects. There's browser selection effects. There's lots of things that vary that make it harder and harder.

This has actually been historically a problem with front-end applications is there's so much variability. It was hard to build it, sophisticated applications that didn't break often in certain circumstances, which people tended to build smaller applications, not larger applications. Now, that's been changing. Certainly, there's been a lot of work in recent years on standardized JavaScript, standardized browser infrastructures that are a lot more compatible and have a lot fewer issues, plus devices having more power, etc., etc. etc. How has changes in JavaScript over the last, let's say, five to 10 years, and the implementation of those JavaScript changes in the various browsers, how has that impacted not only the front-end development world, but the ability to observe it?

[0:37:45] PK: That's a great question. I think, when I think about how much JavaScript's changed over the last 10 years, it was really only about 10 years ago, or even less than ten years ago, that ES6 as a concept came on the scene. Before that, we were all just writing really good, old-school JavaScript. The gap between how much it has changed also changed. Rather than things changing, maybe every few years, things were changing rapidly. Every year, there were brand new features, there were breaking features. Node became such a big ecosystem, where JavaScript was now not only running in the browser, and it was really taking over distributed systems.

JavaScript changed so much, so quickly, and it left lots of after effects, both good and bad. One thing I really think about a lot is our ability to observe things has changed, because of how much browsers have also standardized themselves. It wasn't really that long ago. I think still in 2019, supporting Internet Explorer and older versions of Internet Explorer was still really, really normal and required, especially in enterprise environments, where people truly are using those browsers. It's just such a challenge to be a developer, to be able to develop for so many different versions of a place where your code is going to run. 

In most server-side cases, like you mentioned, you have so much control over the environment that it's running in. But here, you don't really have control, and it can be running in N number of different environments with N number of different supports for various JavaScript features. I think, having that calm down and coming to a place of modern browsers as a concept, support things in a fairly uniform, I'd probably eat my words about that, but in a much more uniform way than they have in the past has been huge for developers, because I think it was a lot of firefighting for observing your front-end systems back when you had to support all sorts of different browsers that did many, many different things. You'd have to have synthetic tests that run not just once, they have to run once in every browser, and then also then, those browsers are completely different things.

[0:40:34] LA: The special case for IE.

[0:40:36] PK: Yes. Then doing different things on different devices. I think that's always been a challenge. That's been a challenge for being able to standardize browser APIs. But now that that's becoming better, there are better APIs that allow us to observe our browser application, like the performance timing API becoming stable. The long animation frame API that's coming up soon, which I'm really, really excited about, which I think will be a great addition once it's stable everywhere, to be able to say like, hey, what is actually going on in my user input delays? Can I dig into that further?

Because those are all coming out now in a stable way and not being like, okay, well this is supported here, but if you try to run that in IE, it'll crash your app. I think, I'm really, really optimistic about future browser APIs.

[0:41:31] LA: I was so glad to hear that IE was finally obsolete. I think that was around 2019, wasn't it, when they finally made that decision?

[0:41:37] PK: Yeah. I think, that's when it was announced and then it still took a couple years before it was -

[0:41:42] LA: A couple years. Yeah. I get so tired of seeing JavaScript code for if IE, do this, otherwise, one of these other ways works fine, but it'll be okay. But if it's IE, it very much won't be.

[0:41:56] PK: Yeah.

[0:41:58] LA: That's helped a lot. That's not only from observability as you're talking about, but also just the ability to build applications. The less you have to focus on these compatibility issues, the bigger and more complex your applications can get, because your ability to think about problems becomes focused on business logic, versus on browser logic, and that helps a lot.

We haven't yet talked about security. Security is also a very, very critical aspect in browsers in general, but observability in browsers is also a big topic with security nowadays. Is observability by itself a security issue? Can you talk a little bit about that?

[0:42:41] PK: Yeah, absolutely. The answer to that is yes, it's another vector. It's another attack vector, potentially either directly into your system, or leaking that data somewhere, which is also why it's something that I think open telemetry as a project thinks about very, very deeply. It is a little bit different coming from the browser, because we're not in a place of complete control over a user's browser like we are on the server.

One thing that's really, really important if you are running open telemetry in production is that we always recommend that you run what is called an open telemetry collector, which is essentially a service that you run, that instead of sending your telemetry data directly to Honeycomb, because you might potentially have to expose an API key there and leak data over the wire, we suggest that you run a collector in your own infrastructure and have to authenticate to it, just like you would have to any other endpoint, and send your telemetry through that way, because it is a lot more secure. There are so many things that could happen if you are sending telemetry directly without something that's running in your infrastructure.

[0:44:01] LA: It helps with CSRF issues and things like that as well, too.

[0:44:03] PK: Yeah.

[0:44:04] LA: Yeah. I know that's one of the things that keeps coming up is there's not only is browser security a bigger issue, but using observability to help diagnose security issues, that's also a big issue. Also, it's getting harder for companies like Honeycomb to build in observability, because of all of these hurdles that you have to go through in order to even get it to work in a secure browser environment. That's one of the things that comes a lot is cross-site resource utilization is a huge issue when it comes to security in front-end applications.

What you're suggesting is by having all your data collected in the same source where the application backend is, you use the same tools for connecting to the backend. It just works the way you expected to in a normal way. You're not adding a new call to this Honeycomb API and wondering what the heck that thing is and trying to get that to work correctly and all those sorts of issues.

[0:45:05] PK: Yeah, exactly. It is a big ask a lot. I think, it can be a lot to not only set up brand-new telemetry, but you're like, "Oh, now I have to run this other service." Browser security is definitely something to take seriously. We still recommend taking those steps.

[0:45:26] LA: Is that a service you provide to your customers, this gateway agent, or whatever that runs in your own infrastructure?

[0:45:34] PK: We don't provide it directly. There are a couple things that you could do. The open telemetry collector is an open-source project. You would basically take that and run it yourself. What we do provide is recommended configs for your collector. Basically, you could go to our docs and be like, here's how to set up your collector, but the actual running of and setting it up would be on your own infrastructure. Then, we also provide our own sampling tool that you can use called refinery. You might use it alongside a collector, you might use it without a collector. But it is another piece of on-prem software that you can use to not only collect your data, but do some sophisticated sampling on it.

[0:46:20] LA: These are all open-source solutions. These aren't solutions that you're providing, they're solutions that you support, and you help with configs, etc., to help them get set up, but they're all open source projects.

[0:46:31] PK: Yes.

[0:46:32] LA: Anything else you'd like to talk about? We're coming near the end of the hour here, but anything you'd like to throw in that we haven't talked about that might be important?

[0:46:41] PK: I think we covered a good amount of stuff. Yeah, nothing is jumping out at me.

[0:46:46] LA: I'd love to talk more about security in particular, but I think that is going to have to be another episode later on in a lot deeper topic. We've gone over the surface a lot with a lot of these issues that each of these individual issues, like security, like JavaScript compatibility, etc., are major issues and major talking points, not only within the front-end community, but more specifically, the front-end observability community. Thank you. My guest today has been Purvi Kanal, who's a Senior Software Engineer at Honeycomb, who provides modern observability solutions, including their newly added, or should I say, newly enhanced front-end observability capability. Purvi, thank you so much for joining me on Software Engineering Daily.

[0:47:34] PK: Thank you so much for having me.

[END]