Guia Prático de Desempenho de Banco de Dados - Bases de Dados (2025)

UNICSUL

Eliel Sousa 07/10/2024

Prévia do material em texto

<p>A Practical Guide</p><p>―</p><p>Felipe Cardeneti Mendes · Piotr Sarna</p><p>Pavel Emelyanov · Cynthia Dunlop</p><p>Database</p><p>Performance</p><p>at Scale</p><p>Database Performance</p><p>atScale</p><p>A Practical Guide</p><p>FelipeCardenetiMendes</p><p>PiotrSarna</p><p>PavelEmelyanov</p><p>CynthiaDunlop</p><p>Database Performance at Scale: A Practical Guide</p><p>ISBN-13 (pbk): 978-1-4842-9710-0 ISBN-13 (electronic): 978-1-4842-9711-7</p><p>https://doi.org/10.1007/978-1-4842-9711-7</p><p>Copyright © 2023 by Felipe Cardeneti Mendes, Piotr Sarna, Pavel Emelyanov, Cynthia Dunlop</p><p>This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is</p><p>concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on</p><p>microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation,</p><p>computer software, or by similar or dissimilar methodology now known or hereafter developed.</p><p>Open Access This book is licensed under the terms of the Creative Commons Attribution 4.0</p><p>International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,</p><p>adaptation, distribution and reproduction in any medium or format, as long as you give appropriate</p><p>credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes</p><p>were made.</p><p>The images or other third party material in this book are included in the book’s Creative Commons license, unless</p><p>indicated otherwise in a credit line to the material. If material is not included in the book’s Creative Commons license and</p><p>your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain</p><p>permission directly from the copyright holder.</p><p>Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with every</p><p>occurrence of a trademarked name, logo, or image we use the names, logos, and images only in an editorial fashion and to</p><p>the benefit of the trademark owner, with no intention of infringement of the trademark.</p><p>The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as</p><p>such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights.</p><p>While the advice and information in this book are believed to be true and accurate at the date of publication, neither the</p><p>authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made.</p><p>The publisher makes no warranty, express or implied, with respect to the material contained herein.</p><p>Managing Director, Apress Media LLC: Welmoed Spahr</p><p>Acquisitions Editor: Jonathan Gennick</p><p>Development Editor: Laura Berendson</p><p>Editorial Project Manager: Shaul Elson</p><p>Copy Editor: Kezia Endsley</p><p>Cover designed by eStudioCalamar</p><p>Distributed to the book trade worldwide by Springer Science+Business Media LLC, 1 NewYork Plaza, Suite 4600,</p><p>NewYork, NY 10004. Phone 1-800-SPRINGER, fax (201) 348-4505, e-mail orders-ny@springer-sbm.com, or visit www.</p><p>springeronline.com. Apress Media, LLC is a California LLC and the sole member (owner) is Springer Science + Business</p><p>Media Finance Inc (SSBM Finance Inc). SSBM Finance Inc is a Delaware corporation.</p><p>For information on translations, please e-mail booktranslations@springernature.com; for reprint, paperback, or audio</p><p>rights, please e-mail bookpermissions@springernature.com.</p><p>Apress titles may be purchased in bulk for academic, corporate, or promotional use. eBook versions and licenses are also</p><p>available for most titles. For more information, reference our Print and eBook Bulk Sales web page at http://www.apress.</p><p>com/bulk-sales.</p><p>Any source code or other supplementary material referenced by the author in this book is available to readers on GitHub</p><p>(https://github.com/Apress). For more detailed information, please visit https://www.apress.com/gp/services/source-code.</p><p>Paper in this product is recyclable</p><p>FelipeCardenetiMendes</p><p>São Paulo, Brazil</p><p>PiotrSarna</p><p>Pruszków, Poland</p><p>PavelEmelyanov</p><p>Moscow, Russia</p><p>CynthiaDunlop</p><p>Carpinteria, CA, USA</p><p>To Cristina and Snow</p><p>—Felipe</p><p>To Wiktoria</p><p>—Piotr</p><p>To Svetlana and Mykhailo</p><p>—Pavel</p><p>To David</p><p>—Cynthia</p><p>v</p><p>Table of Contents</p><p>About the Authors �������������������������������������������������������������������������������������������������� xiii</p><p>About the Technical Reviewers �������������������������������������������������������������������������������xv</p><p>Acknowledgments �������������������������������������������������������������������������������������������������xvii</p><p>Introduction ������������������������������������������������������������������������������������������������������������xix</p><p>Chapter 1: A Taste of What You’re Up Against: Two Tales ����������������������������������������� 1</p><p>Joan Dives Into Drivers and Debugging ���������������������������������������������������������������������������������������� 1</p><p>Joan’s Diary of Lessons Learned, Part I ���������������������������������������������������������������������������������� 3</p><p>The Tuning ������������������������������������������������������������������������������������������������������������������������������� 3</p><p>Joan’s Diary of Lessons Learned, Part II ��������������������������������������������������������������������������������� 5</p><p>Patrick’s Unlucky Green Fedoras �������������������������������������������������������������������������������������������������� 6</p><p>Patrick’s Diary of Lessons Learned, Part I ������������������������������������������������������������������������������� 7</p><p>The First Spike ������������������������������������������������������������������������������������������������������������������������� 8</p><p>Patrick’s Diary of Lessons Learned, Part II ������������������������������������������������������������������������������ 8</p><p>The First Loss �������������������������������������������������������������������������������������������������������������������������� 9</p><p>Patrick’s Diary of Lessons Learned, Part III ����������������������������������������������������������������������������� 9</p><p>The Spike Strikes Again �������������������������������������������������������������������������������������������������������������� 10</p><p>Patrick’s Diary of Lessons Learned, Part IV ��������������������������������������������������������������������������� 11</p><p>Backup Strikes Back ������������������������������������������������������������������������������������������������������������������� 11</p><p>Patrick’s Diary of Lessons Learned, Part V ���������������������������������������������������������������������������� 12</p><p>Summary������������������������������������������������������������������������������������������������������������������������������������� 13</p><p>Chapter 2: Your Project, Through the Lens of Database Performance �������������������� 15</p><p>Workload Mix (Read/Write Ratio) ������������������������������������������������������������������������������������������������ 15</p><p>Write-Heavy Workloads ��������������������������������������������������������������������������������������������������������� 16</p><p>Read-Heavy Workloads ���������������������������������������������������������������������������������������������������������� 17</p><p>vi</p><p>Mixed Workloads ������������������������������������������������������������������������������������������������������������������� 19</p><p>Delete-Heavy Workloads ������������������������������������������������������������������������������������������������������� 20</p><p>Competing Workloads (Real-Time vs Batch)�������������������������������������������������������������������������� 21</p><p>Item Size ������������������������������������������������������������������������������������������������������������������������������������� 23</p><p>Item Type �������������������������������������������������������������������������������������������������������������������������������������</p><p>medium or format, as long as you give appropriate credit to the</p><p>original author(s) and the source, provide a link to the Creative Commons license and</p><p>indicate if changes were made.</p><p>The images or other third party material in this chapter are included in the chapter’s</p><p>Creative Commons license, unless indicated otherwise in a credit line to the material. If</p><p>material is not included in the chapter’s Creative Commons license and your intended</p><p>use is not permitted by statutory regulation or exceeds the permitted use, you will need</p><p>to obtain permission directly from the copyright holder.</p><p>Chapter 1 a taste ofWhat You’re upagainst: tWo tales</p><p>15</p><p>CHAPTER 2</p><p>Your Project, Through</p><p>theLens ofDatabase</p><p>Performance</p><p>The specific database performance constraints and optimization opportunities your</p><p>team will face vary wildly based on your specific workload, application, and business</p><p>expectations. This chapter is designed to get you and your team talking about how much</p><p>you can feasibly optimize your performance, spotlight some specific lessons related</p><p>to common situations, and also help you set realistic expectations if you’re saddled</p><p>with burdens like large payload sizes and strict consistency requirements. The chapter</p><p>starts by looking at technical factors, such as the read/write ratio of your workload, item</p><p>size/type, and so on. Then, it shifts over to business considerations like consistency</p><p>requirements and high availability expectations. Throughout, the chapter talks about</p><p>database attributes that have proven to be helpful—or limiting—in different contexts.</p><p>Note Since this chapter covers a broad range of scenarios, not everything will be</p><p>applicable to your specific project and workload. Feel free to skim this chapter and</p><p>focus on the sections that seem most relevant.</p><p>Workload Mix (Read/Write Ratio)</p><p>Whether it’s read-heavy, write-heavy, evenly-mixed, delete-heavy, and so on,</p><p>understanding and accommodating your read/write ratio is a critical but commonly</p><p>overlooked aspect of database performance. Some databases shine with read-heavy</p><p>© Felipe Cardeneti Mendes, Piotr Sarna, Pavel Emelyanov, Cynthia Dunlop 2023</p><p>F. C. Mendes et al., Database Performance at Scale, https://doi.org/10.1007/978-1-4842-9711-7_2</p><p>16</p><p>workloads, others are optimized for write-heavy situations, and some are built to</p><p>accommodate both. Selecting, or sticking with, one that’s a poor fit for your current and</p><p>future situation will be a significant burden that will be difficult to overcome, no matter</p><p>how strategically you optimize everything else.</p><p>There’s also a significant impact to cost. That might not seem directly related to</p><p>performance, but if you can’t afford (or get approval for) the infrastructure that you truly</p><p>need to support your workload, this will clearly limit your performance.1</p><p>Tip Not sure what your workload looks like? This is one of many situations where</p><p>observability is your friend. If your existing database doesn’t help you profile your</p><p>workload, consider if it’s feasible to try your workloads on a compatible database</p><p>that enables deeper visibility.</p><p>Write-Heavy Workloads</p><p>If you have a write-heavy workload, we strongly recommend a database that stores data</p><p>in immutable files (e.g., Cassandra, ScyllaDB, and others that use LSM trees).2 These</p><p>databases optimize write speed because: 1) writes are sequential, which is faster in</p><p>terms of disk I/O and 2) writes are performed immediately, without first worrying about</p><p>reading or updating existing values (like databases that rely on B-trees do). As a result,</p><p>you can typically write a lot of data with very low latencies.</p><p>However, if you opt for a write-optimized database, be prepared for higher storage</p><p>requirements and the potential for slower reads. When you work with immutable</p><p>files, you’ll need sufficient storage to keep all the immutable files that build up until</p><p>compaction runs.3 You can mitigate the storage needs to some extent by choosing</p><p>compaction strategies carefully. Plus, storage is relatively inexpensive these days.</p><p>1 With write-heavy workloads, you can easily spend millions per month with Bigtable or</p><p>DynamoDB.Read-heavy workloads are typically less costly in these pricing models.</p><p>2 If you want a quick introduction to LSM trees and B-trees, see Appendix A.Chapter 4 also</p><p>discusses B-trees in more detail.</p><p>3 Compaction is a background process that databases with an LSM tree storage backend use to</p><p>merge and optimize the shape of the data. Since files are immutable, the process essentially</p><p>involves picking up two or more pre-existing files, merging their contents, and producing a sorted</p><p>output file.</p><p>ChapTer 2 Your projeCT, Through TheLeNS oFDaTabaSe perFormaNCe</p><p>17</p><p>The potential for read amplification is generally a more significant concern with</p><p>write-optimized databases (given all the files to search through, more disk reads are</p><p>required per read request).</p><p>But read performance doesn’t necessarily need to suffer. You can often minimize this</p><p>tradeoff with a write-optimized database that implements its own caching subsystem</p><p>(as opposed to those that rely on the operating system’s built-in cache), enabling fast</p><p>reads to coexist alongside extremely fast writes. Bypassing the underlying OS with a</p><p>performance-focused built-in cache should speed up your reads nicely, to the point</p><p>where the latencies are nearly comparable to read-optimized databases.</p><p>With a write-heavy workload, it’s also essential to have extremely fast storage, such</p><p>as NVMe drives, if your peak throughput is high. Having a database that can theoretically</p><p>store values rapidly ultimately won’t help if the disk itself can’t keep pace.</p><p>Another consideration: beware that write-heavy workloads can result in</p><p>surprisingly high costs as you scale. Writes cost around five times more than reads</p><p>under some vendors’ pricing models. Before you invest too much effort in performance</p><p>optimizations, and so on, it’s a good idea to price your solution at scale and make sure</p><p>it’s a good long-term fit.</p><p>Read-Heavy Workloads</p><p>With read-heavy workloads, things change a bit. B-tree databases (such as DynamoDB)</p><p>are optimized for reads (that’s the payoff for the extra time required to update values on</p><p>the write path). However, the advantage that read-optimized databases offer for reads</p><p>is generally not as significant as the advantage that write-optimized databases offer for</p><p>writes, especially if the write-optimized database uses internal caching to make up the</p><p>difference (as noted in the previous section).</p><p>Careful data modeling will pay off in spades for optimizing your reads. So will careful</p><p>selection of read consistency (are eventually consistent reads acceptable as opposed to</p><p>strongly consistent ones?), locating your database near your application, and performing</p><p>a thorough analysis of your query access patterns. Thinking about your access patterns is</p><p>especially crucial for success with a read-heavy workload. Consider aspects such as the</p><p>following:</p><p>• What is the nature of the data that the application will be querying</p><p>mostly frequently? Does it tolerate potentially stale reads or does it</p><p>require immediate consistency?</p><p>ChapTer 2 Your projeCT, Through TheLeNS oFDaTabaSe perFormaNCe</p><p>18</p><p>• How frequently is it accessed (e.g., is it frequently-accessed “hot”</p><p>data that is likely cached, or is it rarely-accessed “cold” data)?</p><p>• Does it require aggregations, JOINs, and/or querying flexibility on</p><p>fields that are not part of your primary key component?</p><p>• Speaking of primary keys, what is the level of cardinality?</p><p>For example, assume that your use case requires dynamic querying capabilities (such</p><p>as type-ahead use cases, report-building solutions, etc.) where you frequently need to query</p><p>data from columns other than your primary/hash key component. In this case, you might</p><p>find yourself performing full table scans all too frequently, or relying on too many indexes.</p><p>Both</p><p>of these, in one way or another, may eventually undermine your read performance.</p><p>On the infrastructure side, selecting servers with high memory footprints is key for</p><p>enabling low read latencies if you will mostly serve data that is frequently accessed. On</p><p>the other hand, if your reads mostly hit cold data, you will want a nice balance between</p><p>your storage speeds and memory. In fact, many distributed databases typically reserve</p><p>some memory space specifically for caching indexes; this way, reads that inevitably</p><p>require going to disk won’t waste I/O by scanning through irrelevant data.</p><p>What if the use case requires reading from both hot and cold data at the same time?</p><p>And what if you have different latency requirements for each set of data? Or what if you</p><p>want to mix a real-time workload on top of your analytics workload for the very same</p><p>dataset? Situations like this are quite common. There’s no one-size-fits-all answer, but</p><p>here are a few important tips:</p><p>• Some databases will allow you to read data without polluting your</p><p>cache (e.g., filling it up with data that is unlikely to be requested again).</p><p>Using such a mechanism is especially important when you’re running</p><p>large scans while simultaneously serving real-time data. If the large</p><p>scans were allowed to override the previously cached entries that the</p><p>real-time workload required, those reads would have to go through</p><p>disk and get repopulated into the cache again. This would effectively</p><p>waste precious processing time and result in elevated latencies.</p><p>• For use cases requiring a distinction between hot/cold data storage</p><p>(for cost savings, different latency requirements, or both), then</p><p>solutions using tiered storage (a method of prioritizing data storage</p><p>based on a range of requirements, such as performance and costs)</p><p>are likely a good fit.</p><p>ChapTer 2 Your projeCT, Through TheLeNS oFDaTabaSe perFormaNCe</p><p>19</p><p>• Some databases will permit you to prioritize some workloads over</p><p>others. If that’s not sufficient, you can go one step further and</p><p>completely isolate such workloads logically.4</p><p>Note You might not need all your reads. at ScyllaDb, we’ve come across a</p><p>number of cases where teams are performing reads that they don’t really need. For</p><p>example, by using a read-before-write approach to avoid race conditions where</p><p>multiple clients are trying to update the same value with different updates at the</p><p>same time. The details of the solution aren’t relevant here, but it is important to</p><p>note that, by rethinking their approach, they were able to shave latencies off their</p><p>writes as well as speed up the overall response by eliminating the unnecessary</p><p>read. The moral here: getting new eyes on your existing approaches might surface</p><p>a way to unlock unexpected performance optimizations.</p><p>Mixed Workloads</p><p>More evenly mixed access patterns are generally even more complex to analyze and</p><p>accommodate. In general, the reason that mixed workloads are so complex in nature is</p><p>due to the fact that there are two competing workloads from the database perspective.</p><p>Databases are essentially made for just two things: reading and writing. The way that</p><p>different databases handle a variety of competing workloads is what truly differentiates</p><p>one solution from another. As you test and compare databases, experiment with different</p><p>read/write ratios so you can adequately prepare yourself for scenarios when your access</p><p>patterns may change.</p><p>Be sure to consider nuances like whether your reads are from cold data (data not</p><p>often accessed) or hot data (data that’s accessed often and likely cached). Analytics use</p><p>cases tend to read cold data frequently because they need to process large amounts of</p><p>data. In this case, disk speeds are very important for overall performance. Plus, you’ll</p><p>want a comfortably large amount of memory so that the database’s cache can hold the</p><p>4 The “Competing Workloads” section later in this chapter, as well as the “Workload Isolation”</p><p>section in Chapter 8, cover a few options for prioritizing and separating workloads.</p><p>ChapTer 2 Your projeCT, Through TheLeNS oFDaTabaSe perFormaNCe</p><p>20</p><p>data that you need to process. On the other hand, if you frequently access hot data, most</p><p>of your data will be served from the cache, in such a way that the disk speeds become</p><p>less important (although not negligible).</p><p>Tip Not sure if your reads are from cold or hot data? Take a look at the ratio</p><p>of cache misses in your monitoring dashboards. For more on monitoring, see</p><p>Chapter 10.</p><p>If your ratio of cache misses is higher than hits, this means that reads need to</p><p>frequently hit the disks in order to look up your data. This may happen because your</p><p>database is underprovisioned in memory space, or simply because the application</p><p>access patterns often read infrequently accessed data. It is important to understand the</p><p>performance implications here. If you’re frequently reading from cold data, there’s a risk</p><p>that I/O will become the bottleneck—for writes as well as reads. In that case, if you need</p><p>to improve performance, adding more nodes or switching your storage medium to a</p><p>faster solution could be helpful.</p><p>As noted earlier, write-optimized databases can improve read latency via internal</p><p>caching, so it’s not uncommon for a team with, say, 60 percent reads and 40 percent</p><p>writes to opt for a write-optimized database. Another option is to boost the latency</p><p>of reads with a write-optimized database: If your database supports it, dedicate extra</p><p>“shares” of resources to the reads so that your read workload is prioritized when there is</p><p>resource contention.</p><p>Delete-Heavy Workloads</p><p>What about delete-heavy workloads, such as using your database as a durable queue</p><p>(saving data from a producer until the consumer accesses it, deleting it, then starting the</p><p>cycle over and over again)? Here, you generally want to avoid databases that store data</p><p>in immutable files and use tombstones to mark rows and columns that are slated for</p><p>deletion. The most notable examples are Cassandra and other Cassandra-compatible</p><p>databases.</p><p>Tombstones consume cache space and disk resources, and the database needs to</p><p>search through all these tombstones to reach the live data. For many workloads, this</p><p>is not a problem. But for delete-heavy workloads, generating an excessive amount of</p><p>ChapTer 2 Your projeCT, Through TheLeNS oFDaTabaSe perFormaNCe</p><p>21</p><p>tombstones will, over time, significantly degrade your read latencies. There are ways and</p><p>mechanisms to mitigate the impact of tombstones.5 However, in general, if you have a</p><p>delete-heavy workload, it may be best to use a different database.</p><p>It is important to note that occasional deletes are generally fine on Cassandra and</p><p>Cassandra-compatible databases. Just be aware of the fact that deletes on append-only</p><p>databases result in tombstone writes. As a result, these may incur read amplification,</p><p>elevating your read latencies. Tombstones and data eviction in these types of databases</p><p>are potentially long and complex subjects that perhaps could have their own dedicated</p><p>chapter. However, the high-level recommendation is to exercise caution if you have a</p><p>potentially delete-heavy pattern that you might later read from, and be sure to combine</p><p>it with a compaction strategy tailored for efficient data eviction.</p><p>All that being said, it is interesting to note that some teams have successfully</p><p>implemented delete-heavy workloads on top of Cassandra and Cassandra-like</p><p>databases. The performance overhead carried by tombstones is generally circumvented</p><p>by a combination of data modeling, a careful study of how deletes are performed,</p><p>avoiding reads that potentially scan through a large set of deleted data, and careful</p><p>tuning over the underlying table’s compaction strategy to ensure that tombstones</p><p>get evicted in a timely manner. For example, Tencent Games used the Time Window</p><p>Compaction Strategy to aggressively expire tombstones and use</p><p>it as the foundation for a</p><p>time series distributed queue.6</p><p>Competing Workloads (Real-Time vs Batch)</p><p>If you’re working with two different types of workloads—one more latency-sensitive</p><p>than the other—the ideal solution is to have the database dedicate more resources to</p><p>the more latency-sensitive workloads to keep them from faltering due to insufficient</p><p>resources. This is commonly the case when you are attempting to balance OLTP</p><p>(real-time) workloads, which are user-facing and require low latency responses, with</p><p>5 For some specific recommendations, see the DataStax blog, “Cassandra Anti-Patterns:</p><p>Queues and Queue-like Datasets” (www.datastax.com/blog/cassandra-anti-patterns-</p><p>queues-and-queue-datasets)</p><p>6 See the article, “Tencent Games’ Real-Time Event-Driven Analytics System Built with ScyllaDB +</p><p>Pulsar” (https://www.scylladb.com/2023/05/15/tencent-games-real-time-event-driven-</p><p>analytics-systembuilt-with-scylladb-pulsar/)</p><p>ChapTer 2 Your projeCT, Through TheLeNS oFDaTabaSe perFormaNCe</p><p>22</p><p>OLAP (analytical) workloads, which can be run in batch mode and are more focused</p><p>on throughput (see Figure2-1). Or, you can prioritize analytics. Both are technically</p><p>feasible; it just boils down to what’s most important for your use case.</p><p>Figure 2-1. OLTP vs OLAP workloads</p><p>For example, assume you have a web server database with analytics. It must support</p><p>two workloads:</p><p>• The main workload consists of queries triggered by a user clicking or</p><p>navigating on some areas of the web page. Here, users expect high</p><p>responsiveness, which usually translates to requirements for low</p><p>latency. You need low timeouts with load shedding as your overload</p><p>response, and you would like to have a lot of dedicated resources</p><p>available whenever this workload needs them.</p><p>• A second workload drives analytics being run periodically to collect</p><p>some statistics or to aggregate some information that should be</p><p>presented to users. This involves a series of computations. It’s a lot</p><p>less sensitive to latency than the main workload; it’s more throughput</p><p>oriented. You can have fairly large timeouts to accommodate for</p><p>always full queues. You would like to throttle requests under load so</p><p>the computation is stable and controllable. And finally, you would</p><p>like the workload to have very few dedicated resources and use</p><p>mostly unused resources to achieve better cluster utilization.</p><p>ChapTer 2 Your projeCT, Through TheLeNS oFDaTabaSe perFormaNCe</p><p>23</p><p>Running on the same cluster, such workloads would be competing for resources.</p><p>As system utilization rises, the database must strictly prioritize which activities get</p><p>what specific share of resources under contention. There are a few different ways you</p><p>can handle this. Physical isolation, logical isolation, and scheduled isolation can all be</p><p>acceptable choices under the right circumstances. Chapter 8 covers these options.</p><p>Item Size</p><p>The size of the items you are storing in the database (average payload size) will dictate</p><p>whether your workload is CPU bound or storage bound. For example, running 100K</p><p>OPS with an average payload size of 90KB is much different than achieving the same</p><p>throughput with a 1KB payload. Higher payloads require more processing, I/O, and</p><p>network traffic than smaller payloads.</p><p>Without getting too deep into database internals here, one notable impact is on the</p><p>page cache. Assuming a default page cache size of 4KB, the database would have to serve</p><p>several pages for the largest payload—that’s much more I/O to issue, process, merge,</p><p>and serve back to the application clients. With the 1KB example, you could serve it from</p><p>a single-page cache entry, which is less taxing from a compute resource perspective.</p><p>Conversely, having a large number of smaller-sized items may introduce CPU overhead</p><p>compared to having a smaller number of larger items because the database must process</p><p>each arriving item individually.</p><p>In general, the larger the payload gets, the more cache activity you will have. Most</p><p>write-optimized databases will store your writes in memory before persisting that</p><p>information to the disk (in fact, that’s one of the reasons why they are write-optimized).</p><p>Larger payloads deplete the available cache space more frequently, and this incurs a</p><p>higher flushing activity to persist the information on disk in order to release space for</p><p>more incoming writes. Therefore, more disk I/O is needed to persist that information.</p><p>If you don’t size this properly, it can become a bottleneck throughout this repetitive</p><p>process.</p><p>When you’re working with extremely large payloads, it’s important to set realistic</p><p>latency and throughput expectations. If you need to serve 200KB payloads, it’s unlikely</p><p>that any database will enable you to achieve single-digit millisecond latencies. Even if</p><p>the entire dataset is served from cache, there’s a physical barrier between your client</p><p>and the database: networking. The network between them will eventually throttle</p><p>your transfer speeds, even with an insanely fast client and database. Eventually, this</p><p>ChapTer 2 Your projeCT, Through TheLeNS oFDaTabaSe perFormaNCe</p><p>24</p><p>will impact throughput as well as latency. As your latency increases, your client will</p><p>eventually throttle down and you won’t be able to achieve the same throughput that</p><p>you could with smaller payload sizes. The requests would be stalled, queuing in the</p><p>network.7</p><p>Generally speaking, databases should not be used to store large blobs. We’ve seen</p><p>people trying to store gigabytes of data within a single-key in a database—and this</p><p>isn’t a great idea. If your item size is reaching this scale, consider alternative solutions.</p><p>One solution is to use CDNs. Another is to store the largest chunk of your payload size</p><p>in cold storage like Amazon S3 buckets, Google Cloud storage, or Azure blob storage.</p><p>Then, use the database as a metadata lookup: It can read the data and fetch an identifier</p><p>that will help find the data in that cold storage. For example, this is the strategy used by</p><p>a game developer converting extremely large (often in the gigabyte range) content to</p><p>popular gaming platforms. They store structured objects with blobs that are referenced</p><p>by a content hash. The largest payload is stored within a cloud vendor Object Storage</p><p>solution, whereas the content hash is stored in a distributed NoSQL database.8</p><p>Note that some databases impose hard limits on item size. For example, DynamoDB</p><p>currently has a maximum item size of 400KB.This might not suit your needs. On top of</p><p>that, if you’re using an in-memory solution such as Redis, larger keys will quickly deplete</p><p>your memory. In this case, it might make sense to hash/compress such large objects</p><p>prior to storing them.</p><p>No matter which database you choose, the smaller your payload, the greater your</p><p>chances of introducing memory fragmentation. This might reduce your memory</p><p>efficiency, which might in turn elevate costs because the database won’t be able to fully</p><p>utilize its available memory.</p><p>Item Type</p><p>The item type has a large impact on compression, which in turn impacts your</p><p>storage utilization. If you’re frequently storing text, expect to take advantage of a high</p><p>compression ratio. But, that’s not the case for random and uncommon blob sequences.</p><p>7 There are alternatives to this; for example, RDMA, DPDK and other solutions. However, most</p><p>use cases do not require such solutions, so they are not covered in detail here.</p><p>8 For details, see the Epic Games talk, “Using ScyllaDB for Distribution of Game Assets in Unreal</p><p>Engine” (www.youtube.com/watch?v=aEgP9YhAb08).</p><p>ChapTer 2 Your projeCT, Through TheLeNS oFDaTabaSe perFormaNCe</p><p>25</p><p>Here, compression is unlikely to make a measurable reduction in your storage footprint.</p><p>If you’re concerned about your use case’s storage utilization, using a compression-</p><p>friendly item type can make a big difference.</p><p>If your use case dictates a certain item type, consider databases</p><p>that are optimized</p><p>for that type. For example, if you need to frequently process JSON data that you can’t</p><p>easily transform, a document database like MongoDB might be a better option than a</p><p>Cassandra-compatible database. If you have JSON with some common fields and others</p><p>that vary based on user input, it might be complicated—though possible—to model</p><p>them in Cassandra. However, you’d incur a penalty from serialization/deserialization</p><p>overhead required on the application side.</p><p>As a general rule of thumb, choose the data type that’s the minimum needed to store</p><p>the type of data you need. For example, you don’t need to store a year as a bigint. If</p><p>you define a field as a bigint, most databases allocate relevant memory address spaces</p><p>for holding it. If you can get by with a smaller type of int, do it—you’ll save bytes of</p><p>memory, which could add up at scale. Even if the database you use doesn’t pre-allocate</p><p>memory address spaces according to data types, choosing the correct one is still a nice</p><p>way to have an organized data model—and also to avoid future questions around why a</p><p>particular data type was chosen as opposed to another.</p><p>Many databases support additional item types which suit a variety of use cases.</p><p>Collections, for example, allow you to store sets, lists, and maps (key-value pairs) under</p><p>a single column in wide column databases. Such data types are often misused, and</p><p>lead to severe performance problems. In fact, most of the data modeling problems</p><p>we’ve come across involve misuse of collections. Collections are meant to store a small</p><p>amount of information (such as phone numbers of an individual or different home/</p><p>business addresses). However, collections with hundreds of thousands of entries are</p><p>unfortunately not as rare as you might expect. They end up introducing a severe de-</p><p>serialization overhead on the database. At best, this translates to higher latencies.</p><p>At worst, this makes the data entirely unreadable due to the latency involved when</p><p>scanning through the high number of items under such columns.</p><p>Some databases also support user created fields, such as User-Defined Types (UDTs)</p><p>in Cassandra. UDTs can be a great ally for reducing the de-serialization overhead when</p><p>you combine several columns into one. Think about it: Would you rather de-serialize</p><p>four Boolean columns individually or a single column with four Boolean values? UDTs</p><p>will typically shine on deserializing several values as a single column, which may give</p><p>ChapTer 2 Your projeCT, Through TheLeNS oFDaTabaSe perFormaNCe</p><p>26</p><p>you a nice performance boost.9 Just like collections, however, UDTs should not be</p><p>misused—and misusing UDTs can lead to the same severe impacts that are incurred by</p><p>collections.</p><p>Note uDTs are quite extensively covered in Chapter 6.</p><p>Dataset Size</p><p>Knowing your dataset size is important for selecting appropriate infrastructure options.</p><p>For example, AWS cloud instances have a broad array of NVMe storage offerings. Having</p><p>a good grasp of how much storage you need can help you avoid selecting an instance</p><p>that causes performance to suffer (if you end up with insufficient storage) or that’s</p><p>wasteful from a cost perspective (if you overprovision).</p><p>It’s important to note that your selected storage size should not be equal to your total</p><p>dataset size. You also need to factor in replication and growth—plus steer clear of 100</p><p>percent storage utilization.</p><p>For example, let’s assume you have 3TB of already compressed data. The bare</p><p>minimum to support a workload is your current dataset size multiplied by your</p><p>anticipated replication. If you have 3TB of data with the common replication factor of</p><p>three, that gives you 9TB.If you naively deployed this on three nodes supporting 3TB of</p><p>data each, you’d hit near 100 percent disk utilization which, of course, is not optimal.</p><p>Instead, if you factor in some free space and minimal room for growth, you’d want</p><p>to start with at least six nodes of that size—each storing only 1.5TB of data. This gives</p><p>you around 50 percent utilization. On the other hand, if your database cannot support</p><p>that much data per node (every database has a limit) or if you do not foresee much</p><p>future data growth, you could have six nodes supporting 2TB each, which would store</p><p>approximately 1.5TB per replica under a 75 percent utilization. Remember: Factoring</p><p>in your growth is critical for avoiding unpleasant surprises in production, from an</p><p>operational as well as a budget perspective.</p><p>9 For some specific examples of how UDTs impact performance, see the performance benchmark</p><p>that ScyllaDB performed with different UDT sizes against individual columns: “If You Care</p><p>About Performance, Employ User Defined Types” (https://www.scylladb.com/2017/12/07/</p><p>performance-udt/)</p><p>ChapTer 2 Your projeCT, Through TheLeNS oFDaTabaSe perFormaNCe</p><p>27</p><p>Note We very intentionally discussed the dataset size from a compressed data</p><p>standpoint. be aware that some database vendors measure your storage utilization</p><p>with respect to uncompressed data. This often leads to confusion. If you’re moving</p><p>data from one database solution to another and your data is uncompressed (or</p><p>you’re not certain it’s compressed), consider loading a small fraction of your</p><p>total dataset beforehand in order to determine its compression ratio. effective</p><p>compression can dramatically reduce your storage footprint.</p><p>If you’re working on a very fluid project and can’t define or predict your dataset</p><p>size, a serverless database deployment model might be a good option to provide easy</p><p>flexibility and scaling. But, be aware that rapid increases in overall dataset size and/or</p><p>IOPS (depending on the pricing model) could cause the price to skyrocket exponentially.</p><p>Even if you don’t explicitly pay a penalty for storing a large dataset, you might be charged</p><p>a premium for the many operations that are likely associated with that large dataset.</p><p>Serverless is discussed more in Chapter 7.</p><p>Throughput Expectations</p><p>Your expected throughput and latency should be your “north star” from database and</p><p>infrastructure selection all the way to monitoring. Let’s start with throughput.</p><p>If you’re serious about database performance, it’s essential to know what throughput</p><p>you’re trying to achieve—and “high throughput” is not an acceptable answer.</p><p>Specifically, try to get all relevant stakeholders’ agreement on your target number of peak</p><p>read operations per second and peak write operations per second for each workload.</p><p>Let’s unravel that a little. First, be sure to separate read throughput vs write</p><p>throughput. A database’s read path is usually quite distinct from its write path. It stresses</p><p>different parts of the infrastructure and taps different database internals. And the client/</p><p>user experience of reads is often quite different than that of writes. Lumping them</p><p>together into a meaningless number won’t help you much with respect to performance</p><p>measurement or optimization. The main use for average throughput is in applying</p><p>Little’s Law (more on that in the “Concurrency” section a little later in this chapter).</p><p>ChapTer 2 Your projeCT, Through TheLeNS oFDaTabaSe perFormaNCe</p><p>28</p><p>Another caveat: The same database’s past or current throughput with one use case</p><p>is no guarantee of future results with another—even if it’s the same database hosted on</p><p>identical infrastructure. There are too many different factors at play (item size, access</p><p>patterns, concurrency… all the things in this chapter, really). What’s a great fit for one use</p><p>case could be quite inappropriate for another.</p><p>Also, note the emphasis on peak operations per second. If you build and optimize</p><p>with an average in mind, you likely won’t be able to service beyond the upper ranges of</p><p>that average. Focus on the peak throughput that you need to sustain to cover your core</p><p>needs and business patterns—including surges. Realize that databases can often “boost”</p><p>to sustain short bursts of</p><p>exceptionally high load. However, to be safe, it’s best to plan for</p><p>your likely peaks and reserve boosting for atypical situations.</p><p>Also, be sure not to confuse concurrency with throughput. Throughput is the speed</p><p>at which the database can perform read or write operations; it’s measured in the number</p><p>of read or write operations per second. Concurrency is the number of requests that the</p><p>client sends to the database at the same time (which, in turn, will eventually translate</p><p>to a given number of concurrent requests queuing at the database for execution).</p><p>Concurrency is expressed as a hard number, not a rate over a period of time. Not every</p><p>request that is born at the same time will be able to be processed by the database at</p><p>the same time. Your client could send 150K requests to the database, all at once. The</p><p>database might blaze through all these concurrent requests if it’s running at 500K</p><p>OPS.Or, it might take a while to process them if the database throughput tops out at</p><p>50K OPS.</p><p>It is generally possible to increase throughput by increasing your cluster size (and/</p><p>or power). But, you also want to pay special attention to concurrency, which will be</p><p>discussed in more depth later in this chapter as well as in Chapter 5. For the most part,</p><p>high concurrency is essential for achieving impressive performance. But if the clients</p><p>end up overwhelming the database with a concurrency that it can’t handle, throughput</p><p>will suffer, then latency will rise as a side effect. A friendly reminder that transcends the</p><p>database world: No system, distributed or not, supports unlimited concurrency. Period.</p><p>ChapTer 2 Your projeCT, Through TheLeNS oFDaTabaSe perFormaNCe</p><p>29</p><p>Note even though scaling a cluster boosts your database processing capacity,</p><p>remember that the application access patterns directly contribute to how much</p><p>impact that will ultimately make. one situation where scaling a cluster may not</p><p>provide the desired throughput increase is during a hot partition10 situation, which</p><p>causes traffic to be primarily targeted to a specific set of replicas. In these cases,</p><p>throttling the access to such hot keys is fundamental for preserving the system’s</p><p>overall performance.</p><p>Latency Expectations</p><p>Latency is a more complex challenge than throughput: You can increase throughput</p><p>by adding more nodes, but there’s no simple solution for reducing latency. The lower</p><p>the latency you need to achieve, the more important it becomes to understand and</p><p>explore database tradeoffs and internal database optimizations that can help you shave</p><p>milliseconds or microseconds off latencies. Database internals, driver optimizations,</p><p>efficient CPU utilization, sufficient RAM, efficient data modeling… everything matters.</p><p>As with throughput, aim for all relevant stakeholders’ agreement on the acceptable</p><p>latencies. This is usually expressed as latency for a certain percentile of requests. For</p><p>performance-sensitive workloads, tracking at the 99th percentile (P99) is common. Some</p><p>teams go even higher, such as the P9999, which refers to the 99.99th percentile.</p><p>As with throughput, avoid focusing on average (mean) or median (P50) latency</p><p>measurements. Average latency is a theoretical measurement that is not directly</p><p>correlated to anything systems or users experience in reality. Averages conceal outliers:</p><p>Extreme deviations from the norm that may have a large and unexpected impact on</p><p>overall system performance, and hence on user experience.</p><p>For example, look at the discrepancy between average latencies and P99 latencies in</p><p>Figure2-2 (different colors represent different database nodes). P99 latencies were often</p><p>double the average for reads, and even worse for writes.</p><p>10 A hot partition is a data access imbalance problem that causes specific partitions to receive</p><p>more traffic compared to others, thus introducing higher load on a specific set of replica servers.</p><p>ChapTer 2 Your projeCT, Through TheLeNS oFDaTabaSe perFormaNCe</p><p>30</p><p>11 For a detailed critique, see Gil Tene’s famous “Oh Sh*t” talk (www.youtube.com/watch?</p><p>v=lJ8ydIuPFeU) as well as his recent P99 CONF talk on Misery Metrics and Consequences</p><p>(https://www.p99conf.io/session/misery-metrics-consequences/).</p><p>Figure 2-2. A sample database monitoring dashboard. Note the difference</p><p>between average and P99 latencies</p><p>Note that monitoring systems are sometimes configured in ways that omit outliers.</p><p>For example, if a monitoring system is calibrated to measure latency on a scale of 0</p><p>to 1000ms, it is going to overlook any larger measurements—thus failing to detect the</p><p>serious issues of query timeouts and retries.</p><p>P99 and above percentiles are not perfect.11 But for latency-sensitive use cases,</p><p>they’re the number you’ll want to keep in mind as you are selecting your infrastructure,</p><p>benchmarking, monitoring, and so on.</p><p>Also, be clear about what exactly is involved in the P99 you are looking to achieve.</p><p>Database latency is the time that elapses between when the database receives a request,</p><p>processes it, and sends back an appropriate response. Client-side latency is broader:</p><p>Here, the measurement starts with the client sending the request and ends with the</p><p>client receiving the database’s response. It includes the network time and client-side</p><p>ChapTer 2 Your projeCT, Through TheLeNS oFDaTabaSe perFormaNCe</p><p>31</p><p>processing. There can be quite a discrepancy between database latency and client-</p><p>side latency; a ten times higher client-side latency isn’t all that uncommon (although</p><p>clearly not desirable). There could be many culprits to blame for a significantly higher</p><p>client-side latency than database latency: excessive concurrency, inefficient application</p><p>architecture, coding issues, and so on. But that’s beyond the scope of this discussion—</p><p>beyond the scope of this book, even.</p><p>The key point here is that your team and all the stakeholders need to be on the same</p><p>page regarding what you’re measuring. For example, say you’re given a read latency</p><p>requirement of 15ms. You work hard to get your database to achieve that and report that</p><p>you met the expectation—then you learn that stakeholders actually expect 15ms for the</p><p>full client-side latency. Back to the drawing board.</p><p>Ultimately, it’s important to track both database latency and client-side latency.</p><p>You can optimize the database all you want, but if the application is introducing latency</p><p>issues from the client side, a fast database won’t have much impact. Without visibility</p><p>into both the database and the client-side latencies, you’re essentially flying half blind.</p><p>Concurrency</p><p>What level of concurrency should your database be prepared to handle? Depending</p><p>on the desired qualities of service from the database cluster, concurrency must be</p><p>judiciously balanced to reach appropriate throughput and latency values. Otherwise,</p><p>requests will pile up waiting to be processed—causing latencies to spike, timeouts to</p><p>rise, and the overall user experience to degrade.</p><p>Little’s Law establishes that:</p><p>L=λW</p><p>where λ is the average throughput, W is the average latency, and L represents the total</p><p>number of requests either being processed or on queue at any given moment when the</p><p>cluster reaches steady state. Given that your throughput and latency targets are usually</p><p>fixed, you can use Little’s Law to estimate a realistic concurrency.</p><p>For example, if you want a system to serve 500,000 requests per second at 2.5ms</p><p>average latency, the best concurrency is around 1,250 in-flight requests. As you approach</p><p>the saturation limit of the system—around 600,000 requests per second for read</p><p>requests—increases in concurrency will keep constant since this is the physical limit of</p><p>the database. Every new in-flight request will only cause increased latency.</p><p>ChapTer 2 Your projeCT, Through TheLeNS oFDaTabaSe perFormaNCe</p><p>32</p><p>In fact, if you approximate 600,000 requests per second as the physical capacity of this</p><p>database, you can calculate</p><p>the expected average latency at a particular concurrency</p><p>point. For example, at 6,120 in-flight requests, the average latency is expected to be</p><p>6120/600,000 = 10ms.</p><p>Past the maximum throughput, increasing concurrency will increase latency.</p><p>Conversely, reducing concurrency will reduce latency, provided that this reduction does</p><p>not result in a decrease in throughput.</p><p>In some use cases, it’s fine for queries to pile up on the client side. But many times</p><p>it’s not. In those cases, you can scale out your cluster or increase the concurrency on</p><p>the application side—at least to the point where the latency doesn’t suffer. It’s a delicate</p><p>balancing act.12</p><p>Connected Technologies</p><p>A database can’t rise above the slowest-performing link in your distributed data system.</p><p>Even if your database is processing reads and writes at blazing speeds, it won’t ultimately</p><p>matter much if it interacts with an event-streaming platform that’s not optimized for</p><p>performance or involves transformations from a poorly-configured Apache Spark</p><p>instance, for example.</p><p>This is just one of many reasons that taking a comprehensive and proactive approach</p><p>to monitoring (more on this in Chapter 10) is so important. Given the complexity of</p><p>databases and distributed data systems, it’s hard to guess what component is to blame</p><p>for a problem. Without a window into the state of the broader system, you could naively</p><p>waste amazing amounts of time and resources trying to optimize something that won’t</p><p>make any difference.</p><p>If you’re looking to optimize an existing data system, don’t overlook the performance</p><p>gains you can achieve by reviewing and tuning its connected components. Or, if your</p><p>monitoring efforts indicate that a certain component is to blame for your client-side</p><p>performance problems but you feel you’ve hit your limit with it, explore what’s required</p><p>to replace it with a more performant alternative. Use benchmarking to determine the</p><p>severity of the impact from a performance perspective.</p><p>12 For additional reading on concurrency, the Netflix blog “Performance Under Load” is a great</p><p>resource (https://netflixtechblog.medium.com/performance-under-load-3e6fa9a60581).</p><p>ChapTer 2 Your projeCT, Through TheLeNS oFDaTabaSe perFormaNCe</p><p>33</p><p>Also, note that some database offerings may have ecosystem limitations. For</p><p>example, if you’re considering a serverless deployment model, be aware that some</p><p>Change Data Capture (CDC) connectors, drivers, and so on, might not be supported.</p><p>Demand Fluctuations</p><p>Databases might experience a variety of different demand fluctuations, ranging from</p><p>predictable moderate fluctuations to unpredictable and dramatic spikes. For instance,</p><p>the world’s most watched sporting event experiences different fluctuations than a food</p><p>delivery service, which experiences different fluctuations than an ambulance-tracking</p><p>service—and all require different strategies and infrastructure.</p><p>First, let’s look at the predictable fluctuations. With predictability, it’s much easier to</p><p>get ahead of the issue. If you’re expected to support periodic big events that are known</p><p>in advance (Black Friday, sporting championships, ticket on sales, etc.), you should have</p><p>adequate time to scale up your cluster for each anticipated spike. That means you can</p><p>tailor your normal topology for the typical day-in, day-out demands without having to</p><p>constantly incur the costs and admin burden of having that larger scale topology.</p><p>On the other side of the spikiness spectrum, there’s applications with traffic with</p><p>dramatic peaks and valleys across the course of each day. For example, consider food</p><p>delivery businesses, which face a sudden increase around lunch, followed by a few</p><p>hours of minimal traffic, then a second spike at dinner time (and sometimes breakfast</p><p>the following morning). Expanding the cluster for each spike—even with “autoscaling”</p><p>(more on autoscaling later in this chapter)—is unlikely to deliver the necessary</p><p>performance gain fast enough. In these cases, you should provision an infrastructure</p><p>that supports the peak traffic.</p><p>But not all spikes are predictable. Certain industries—such as emergency services,</p><p>news, and social media—are susceptible to sudden massive spikes. In this case, a good</p><p>preventative strategy is to control your concurrency on the client side, so it doesn’t</p><p>overwhelm your database. However, controlling concurrency might not be an option for</p><p>use cases with strict end-to-end latency requirements. You can also scramble to scale</p><p>out your clusters as fast as feasible when the spike occurs. This is going to be markedly</p><p>simpler if you’re on the cloud than if you’re on-prem. If you can start adding nodes</p><p>immediately, increase capacity incrementally—with a close eye on your monitoring</p><p>results—and keep going until you’re satisfied with the results, or until the peak has</p><p>subsided. Unfortunately, there is a real risk that you won’t be able to sufficiently scale out</p><p>ChapTer 2 Your projeCT, Through TheLeNS oFDaTabaSe perFormaNCe</p><p>34</p><p>before the spike ends. Even if the ramp up begins immediately, you need to account for</p><p>the time it takes to get data over to add new nodes, stream data to them, and rebalance</p><p>the cluster.</p><p>If you’re selecting a new database and anticipate frequent and sharp spikes, be sure</p><p>to rigorously test how your top contenders respond under realistic conditions. Also,</p><p>consider the costs of maintaining acceptable performance throughout these peaks.</p><p>Note The word “autoscaling” insinuates that your database cluster auto-</p><p>magically expands based on the traffic it is receiving. Not so. It’s simply a robot</p><p>enabling/disabling capacity that’s pre-provisioned for you based on your target</p><p>table settings. even if you’re not using this capacity, you might be paying for the</p><p>convenience of having it set aside and ready to go. also, it’s important to realize</p><p>that it’s not instantaneous. It takes upwards of 2.5 hours to go from 0 rps to 40k.13</p><p>This is not ideal for unexpected or extreme spikes.</p><p>autoscaling is best when:</p><p>• Load changes have high amplitude</p><p>• The rate of change is in the magnitude of hours</p><p>• The load peak is narrow relative to the baseline14</p><p>ACID Transactions</p><p>Does your use case require you to process a logical unit of work with ACID (atomic,</p><p>consistent, isolated, and durable) properties? These transactions, which are historically</p><p>the domain of RDBMS, bring a severe performance hit.</p><p>13 See The Burning Monk blog, “Understanding the Scaling Behaviour of DynamoDB</p><p>OnDemand Tables” (https://theburningmonk.com/2019/03/understanding-the-scaling-</p><p>behaviour-of-dynamodb-ondemand-tables/).</p><p>14 For more on the best and worst uses of autoscaling, see Avishai Ish Shalom’s blog, “DynamoDB</p><p>Autoscaling Dissected: When a Calculator Beats a Robot” (www.scylladb.com/2021/07/08/</p><p>dynamodb-autoscaling-dissected-when-a-calculator-beats-a-robot/).</p><p>ChapTer 2 Your projeCT, Through TheLeNS oFDaTabaSe perFormaNCe</p><p>35</p><p>It is true that distributed ACID compliant databases do exist—and that the past few</p><p>years have brought some distinct progress in the effort to minimize the performance</p><p>impact (e.g., through row-level locks or column-level locking and better conflict</p><p>resolution algorithms). However, some level of penalty will still exist.</p><p>As a general guidance, if you have an ACID-compliant use case, pay special attention</p><p>to your master nodes; these can easily become your bottlenecks since they will often</p><p>be your primary query coordinators (more on this in Appendix A). In addition, if at</p><p>all possible, try to ensure that the majority of your transactions are isolated to the</p><p>minimum amount of resources. For example, a transaction spanning a single row may</p><p>involve a specific set of replicas, whereas a transaction involving several keys may span</p><p>your cluster as a whole—inevitably increasing your latency. It is therefore important to</p><p>understand which types of transactions your target database supports. Some</p><p>vendors</p><p>may support a mix of approaches, while others excel at specific ones. For instance,</p><p>MongoDB introduced multi-document transactions on sharded clusters in its version</p><p>4.2; prior to that, it supported only multi-document transactions on replica sets.</p><p>If it’s critical to support transactions in a more performant manner, sometimes it’s</p><p>possible to rethink your data model and reimplement a use case in a way that makes it</p><p>suitable for a database that’s not ACID compliant. For example, one team who started</p><p>out with Postgres for all their use cases faced skyrocketing business growth. This is a very</p><p>common situation with startups that begin small and then suddenly find themselves in a</p><p>spot where they are unable to handle a spike in growth in a cost-effective way. They were</p><p>able to move their use cases to NoSQL by conducting a careful data-modeling analysis</p><p>and rethinking their use cases, access patterns, and the real business need of what truly</p><p>required ACID and what did not. This certainly isn’t a quick fix, but in the right situation,</p><p>it can pay off nicely.</p><p>Another option to consider: Performance-focused NoSQL databases like Cassandra</p><p>aim to support isolated conditional updates with capabilities such as lightweight</p><p>transactions that allow “atomic compare and set” operations. That is, the database</p><p>checks if a condition is true, and if so, it conducts the transaction. If the condition is not</p><p>met, the transaction is not completed. They are named “lightweight” since they do not</p><p>truly lock the database for the transaction. Instead, they use a consensus protocol to</p><p>ensure there is agreement between the nodes to commit the change. This capability was</p><p>ChapTer 2 Your projeCT, Through TheLeNS oFDaTabaSe perFormaNCe</p><p>36</p><p>introduced by Cassandra and it’s supported in several ways across different Cassandra-</p><p>compatible databases. If this is something you expect to use, it’s worth exploring the</p><p>documentation to understand the differences.15</p><p>However, it’s important to note that lightweight transactions have their limits. They</p><p>can’t support complex use cases like a retail transaction that updates the inventory</p><p>only after a sale is completed with a successful payment. And just like ACID-compliant</p><p>databases, lightweight transactions have their own performance implications. As a</p><p>result, the choice of whether to use them will greatly depend on the amount of ACID</p><p>compliance that your use case requires.</p><p>DynamoDB is a prime example of how the need for transactions will require more</p><p>compute resources (read: money). As a result, use cases relying heavily on ACID will</p><p>fairly often require much more infrastructure power to satisfy heavy usage requirements.</p><p>In the DynamoDB documentation, AWS recommends that you ensure the database is</p><p>configured for auto-scaling or that it has enough read/write capacity to account for the</p><p>additional overhead of transactions.16</p><p>Consistency Expectations</p><p>Most NoSQL databases opt for eventual consistency to gain performance. This is in</p><p>stark contrast to the RDBMS model, where ACID compliance is achieved in the form</p><p>of transactions, and, because everything is in a single node, the effort on locking and</p><p>avoiding concurrency clashes is often minimized. When deciding between a database</p><p>with strong or eventual consistency, you have to make a hard choice. Do you want to</p><p>sacrifice scalability and performance or can you accept the risk of sometimes serving</p><p>stale data?</p><p>Can your use case tolerate eventual consistency, or is strong consistency truly</p><p>required? Your choice really boils down to how much risk your application—and your</p><p>business—can tolerate with respect to inconsistency. For example, a retailer who</p><p>15 See Kostja Osipov’s blog, “Getting the Most Out of Lightweight Transactions in ScyllaDB”</p><p>(www.scylladb.com/2020/07/15/getting-the-most-out-of-lightweight-transactions-in-</p><p>scylla/) for an example of how financial transactions can be implemented using Lightweight</p><p>Transactions.</p><p>16 See “Amazon DynamoDB Transactions: How it Works” (https://docs.aws.amazon.com/</p><p>amazondynamodb/latest/developerguide/transaction-apis.html).</p><p>ChapTer 2 Your projeCT, Through TheLeNS oFDaTabaSe perFormaNCe</p><p>37</p><p>(understandably) requires consistent pricing might want to pay the price for consistent</p><p>writes upfront during a weekly catalog update so that they can later serve millions of low-</p><p>latency read requests under more relaxed consistency levels. In other cases, it’s more</p><p>important to ingest data quickly and pay the price for consistency later (for example,</p><p>in the playback tracking use case that’s common in streaming platforms—where the</p><p>database needs to record the last viewing position for many users concurrently). Or</p><p>maybe both are equally important. For example, consider a social media platform that</p><p>offers live chat. Here, you want consistency on both writes and reads, but you likely don’t</p><p>need the highest consistency (the impact of an inconsistency here is likely much less</p><p>than with a financial report).</p><p>In some cases, “tunable consistency” will help you achieve a balance between strong</p><p>consistency and performance. This gives you the ability to tune the consistency at the</p><p>query level to suit what you’re trying to achieve. You can have some queries relying on a</p><p>quorum of replicas, then have other queries that are much more relaxed.</p><p>Regardless of your consistency requirements, you need to be aware of the</p><p>implications involved when selecting a given consistency level. Databases that offer</p><p>tunable consistency may be a blessing or a curse if you don’t know what you are doing.</p><p>Consider a NoSQL deployment spanning three different regions, with three nodes</p><p>each (nine nodes in total). A QUORUM read would essentially have to traverse two</p><p>different regions in order to be acknowledged back to the client. In that sense, if your</p><p>Network Round Trip Time (RTT)17 is 50ms, then it will take at least this amount of time</p><p>for the query to be considered successful by the database. Similarly, if you were to run</p><p>operations with the highest possible consistency (involving all replicas), then the failure</p><p>of a single node may bring your entire application down.</p><p>Note NoSQL databases fairly often will provide you with ways to confine your</p><p>queries to a specific region to prevent costly network round trips from impacting</p><p>your latency. but again, it all boils down to you what your use case requires.</p><p>17 RTT is the duration, typically measured in milliseconds, that a network request takes to reach a</p><p>destination, plus the time it takes for the packet to be received back at the origin.</p><p>ChapTer 2 Your projeCT, Through TheLeNS oFDaTabaSe perFormaNCe</p><p>38</p><p>Geographic Distribution</p><p>Does your business need to support a regional or global customer base in the near-term</p><p>future? Where are your users and your application located? The greater the distance</p><p>between your users, your application, and your database, the more they’re going to face</p><p>high latencies that stem from the physical time it takes to move data across the network.</p><p>Knowing this will influence where you locate your database and how you design your</p><p>topology—more on this in Chapters 6 and 8.</p><p>The geographic distribution of your cluster might also be a requirement from a</p><p>disaster recovery perspective. In that sense, the cluster would typically serve data</p><p>primarily from a specific region, but failover to another in the event of a disaster (such as</p><p>a full region outage). These kinds of setups are costly, as they will require doubling your</p><p>infrastructure spend. However, depending on the nature of your use case, sometimes it’s</p><p>required.</p><p>Some organizations that invest in a multi-region deployment for the primary</p><p>purpose of disaster recovery end up using them to host isolated use cases. As explained</p><p>in the “Competing Workloads” section of this chapter, companies often prefer to</p><p>physically isolate OLTP from OLAP workloads. Moving some</p><p>isolated (less critical)</p><p>workloads to remote regions prevents these servers from being “idle” most of the time.</p><p>Regardless of the magnitude of compelling reasons that may drive you toward a</p><p>geographically dispersed deployment, here’s some important high-level advice from a</p><p>performance perspective (you’ll learn some more technical tips in Chapter 8):</p><p>1. Consider the increased load that your target region or regions will</p><p>receive in the event of a full region outage. For example, assume</p><p>that you operate globally across three regions, and all these three</p><p>regions serve your end-users. Are the two remaining regions able</p><p>to sustain the load for a long period of time?</p><p>2. Recognize that simply having a geographically-dispersed database</p><p>does not fully cover you in a disaster recovery situation. You also</p><p>need to have your application, web servers, messaging queue</p><p>systems, and so on, geographically replicated. If the only thing</p><p>that’s geo-replicated is your database, you won’t be in a great</p><p>position when your primary application goes down.</p><p>ChapTer 2 Your projeCT, Through TheLeNS oFDaTabaSe perFormaNCe</p><p>39</p><p>3. Consider the fact that geo-replicated databases typically require</p><p>very good network links. Especially when crossing large distances,</p><p>the time to replicate your data is crucial to minimize losses in the</p><p>event of a disaster. If your workload has a heavy write throughput,</p><p>a slow network link may bottleneck the local region nodes. This</p><p>may cause a queue to build up and eventually throttle down</p><p>your writes.</p><p>High-Availability Expectations</p><p>Inevitably, s#*& happens. To prepare for the worst, start by understanding what your use</p><p>case and business can tolerate if a node goes down. Can you accept the data loss that</p><p>could occur if a node storing unreplicated data goes down? Do you need to continue</p><p>buzzing along without a noticeable performance impact even if an entire datacenter or</p><p>availability zone goes down? Or is it okay if things slow down a bit from time to time?</p><p>This will all impact how you architect your topology and configure things like replication</p><p>factor and consistency levels (you’ll learn about this more in Chapter 8).</p><p>It’s important to note that replication and consistency both come at a cost to</p><p>performance. Get a good feel for your business’s risk tolerance and don’t opt for more</p><p>than your business really needs.</p><p>When considering your cluster topology, remember that quite a lot is at risk if you</p><p>get it wrong (and you don’t want to be caught off-guard in the middle of the night).</p><p>For example, the failure of a single node in a three-node cluster could make you</p><p>momentarily lose 33 percent of your processing power. Quite often, that’s a significant</p><p>blow, with discernable business impact. Similarly, the loss of a node in a six-node</p><p>cluster would reduce the blast radius to only 16 percent. But there’s always a tradeoff.</p><p>A sprawling deployment spanning hundreds of nodes is not ideal either. The more nodes</p><p>you have, the more likely you are to experience a node failure. Balance is key.</p><p>ChapTer 2 Your projeCT, Through TheLeNS oFDaTabaSe perFormaNCe</p><p>40</p><p>Summary</p><p>The specific database challenges you encounter, as well as your options for addressing</p><p>them, are highly dependent on your situation. For example, an AdTech use case that</p><p>demands single-digit millisecond P99 latencies for a large dataset with small item</p><p>sizes requires a different treatment than a fraud detection use case that prioritizes the</p><p>ingestion of massive amounts of data as rapidly as possible. One of the primary factors</p><p>influencing how these workloads are handled is how your database is architected. That’s</p><p>the focus for the next two chapters, which dive into database internals.</p><p>Open Access This chapter is licensed under the terms of the Creative</p><p>Commons Attribution 4.0 International License (http://creativecommons.</p><p>org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and</p><p>reproduction in any medium or format, as long as you give appropriate credit to the</p><p>original author(s) and the source, provide a link to the Creative Commons license and</p><p>indicate if changes were made.</p><p>The images or other third party material in this chapter are included in the chapter’s</p><p>Creative Commons license, unless indicated otherwise in a credit line to the material. If</p><p>material is not included in the chapter’s Creative Commons license and your intended</p><p>use is not permitted by statutory regulation or exceeds the permitted use, you will need</p><p>to obtain permission directly from the copyright holder.</p><p>ChapTer 2 Your projeCT, Through TheLeNS oFDaTabaSe perFormaNCe</p><p>41</p><p>CHAPTER 3</p><p>Database Internals:</p><p>Hardware andOperating</p><p>System Interactions</p><p>A database’s internal architecture makes a tremendous impact on the latency it can</p><p>achieve and the throughput it can handle. Being an extremely complex piece of software,</p><p>a database doesn’t exist in a vacuum, but rather interacts with the environment, which</p><p>includes the operating system and the hardware.</p><p>While it’s one thing to get massive terabyte-to-petabyte scale systems up and</p><p>running, it’s a whole other thing to make sure they are operating at peak efficiency. In</p><p>fact, it’s usually more than just “one other thing.” Performance optimization of large</p><p>distributed systems is usually a multivariate problem—combining aspects of the</p><p>underlying hardware, networking, tuning operating systems, and finagling with layers of</p><p>virtualization and application architectures.</p><p>Such a complex problem warrants exploration from multiple perspectives. This</p><p>chapter begins the discussion of database internals by looking at ways that databases</p><p>can optimize performance by taking advantage of modern hardware and operating</p><p>systems. It covers how the database interacts with the operating system plus CPUs,</p><p>memory, storage, and networking. Then, the next chapter shifts focus to algorithmic</p><p>optimizations.1</p><p>1 This chapter draws from material originally published on the Seastar site (https://seastar.io/)</p><p>and the ScyllaDB blog (https://www.scylladb.com/blog/). It is used here with permission of</p><p>ScyllaDB.</p><p>© Felipe Cardeneti Mendes, Piotr Sarna, Pavel Emelyanov, Cynthia Dunlop 2023</p><p>F. C. Mendes et al., Database Performance at Scale, https://doi.org/10.1007/978-1-4842-9711-7_3</p><p>42</p><p>CPU</p><p>Programming books tell programmers that they have this CPU that can run processes</p><p>or threads, and what runs means is that there’s some simple sequential instruction</p><p>execution. Then there’s a footnote explaining that with multiple threads you might need</p><p>to consider doing some synchronization. In fact, how things are actually executed inside</p><p>CPU cores is something completely different and much more complicated. It would</p><p>be very difficult to program these machines if you didn’t have those abstractions from</p><p>books, but they are a lie to some degree. How you can efficiently take advantage of CPU</p><p>capabilities is still very important.</p><p>Share Nothing Across Cores</p><p>Individual CPU cores aren’t getting any faster. Their clock speeds reached a performance</p><p>plateau long ago. Now, the ongoing increase of CPU performance continues</p><p>horizontally: by increasing the number of processing units. In turn, the increase in the</p><p>number of cores means that performance now depends on coordination across multiple</p><p>cores (versus the throughput of a single core).</p><p>On modern hardware, the performance of standard workloads depends more on the</p><p>locking and coordination across cores than on the performance of an individual core.</p><p>Software architects face two unattractive alternatives:</p><p>• Coarse-grained locking, which will see application threads contend</p><p>for control of the data and wait instead of producing useful work.</p><p>• Fine-grained locking, which, in addition to being hard to program</p><p>and debug, sees significant overhead even when no contention</p><p>occurs due to the locking primitives themselves.</p><p>Consider an SSD drive.</p><p>The typical time needed to communicate with an SSD on a</p><p>modern NVMe device is quite lengthy—it’s about 20 μseconds. That’s enough time for</p><p>the CPU to execute tens of thousands of instructions. Developers should consider it as</p><p>a networked device but generally do not program in that way. Instead, they often use an</p><p>API that is synchronous (we’ll return to this later), which produces a thread that can be</p><p>blocked.</p><p>Looking at the image of the logical layout of an Intel Xeon Processor (see Figure3-1),</p><p>it’s clear that this is also a networked device.</p><p>Chapter 3 Database Internals: harDware anDOperatIng system InteraCtIOns</p><p>43</p><p>Figure 3-1. The logical layout of an Intel Xeon Processor</p><p>The cores are all connected by what is essentially a network—a dual ring</p><p>interconnected architecture. There are two such rings and they are bidirectional. Why</p><p>should developers use a synchronous API for that then? Since sharing information</p><p>across cores requires costly locking, a shared-nothing model is perfectly worth</p><p>considering. In such a model, all requests are sharded onto individual cores, one</p><p>application thread is run per core, and communication depends on explicit message</p><p>passing, not shared memory between threads. This design avoids slow, unscalable lock</p><p>primitives and cache bounces.</p><p>Any sharing of resources across cores in modern processors must be handled</p><p>explicitly. For example, when two requests are part of the same session and two CPUs</p><p>each get a request that depends on the same session state, one CPU must explicitly</p><p>forward the request to the other. Either CPU may handle either response. Ideally,</p><p>your database provides facilities that limit the need for cross-core communication—</p><p>but when communication is inevitable, it provides high-performance non-blocking</p><p>communication primitives to ensure performance is not degraded.</p><p>Futures-Promises</p><p>There are many solutions for coordinating work across multiple cores. Some are highly</p><p>programmer-friendly and enable the development of software that works exactly as if it</p><p>were running on a single core. For example, the classic UNIX process model is designed</p><p>Chapter 3 Database Internals: harDware anDOperatIng system InteraCtIOns</p><p>44</p><p>to keep each process in total isolation and relies on kernel code to maintain a separate</p><p>virtual memory space per process. Unfortunately, this increases the overhead at the</p><p>OS level.</p><p>There’s a model known as “futures and promises.” A future is a data structure that</p><p>represents some yet-undetermined result. A promise is the provider of this result. It</p><p>can be helpful to think of a promise/future pair as a first-in first-out (FIFO) queue</p><p>with a maximum length of one item, which may be used only once. The promise is the</p><p>producing end of the queue, while the future is the consuming end. Like FIFOs, futures</p><p>and promises decouple the data producer and the data consumer.</p><p>However, the optimized implementations of futures and promises need to take</p><p>several considerations into account. While the standard implementation targets coarse-</p><p>grained tasks that may block and take a long time to complete, optimized futures and</p><p>promises are used to manage fine-grained, non-blocking tasks. In order to meet this</p><p>requirement efficiently, they should:</p><p>• Require no locking</p><p>• Not allocate memory</p><p>• Support continuations</p><p>Future-promise design eliminates the costs associated with maintaining individual</p><p>threads by the OS and allows close to complete utilization of the CPU.On the other</p><p>hand, it calls for user-space CPU scheduling and very likely limits the developer with</p><p>voluntary preemption scheduling. The latter, in turn, is prone to generating phantom</p><p>jams in popular producer-consumer programming templates.2</p><p>Applying future-promise design to database internals has obvious benefits. First of</p><p>all, database workloads can be naturally CPU-bound. For example, that’s typically the</p><p>case with in-memory database engines, and aggregates’ evaluations also involve pretty</p><p>intensive CPU work. Even for huge on-disk datasets, when the query time is typically</p><p>dominated by the I/O, CPU should be considered. Parsing a query is a CPU-intensive</p><p>task regardless of whether the workload is CPU-bound or storage-bound, and collecting,</p><p>converting, and sending the data back to the user also calls for careful CPU utilization.</p><p>And last but not least: Processing the data always involves a lot of high-level operations</p><p>2 Watch the Linux Foundation video, “Exploring Phantom Traffic Jams in Your Data Flows,” on</p><p>YouTube (www.youtube.com/watch?v=IXS_Afb6Y4o) and/or read the corresponding article on the</p><p>ScyllaDB blog (www.scylladb.com/2022/04/19/exploring-phantom-jams-in-your-</p><p>data-flow/).</p><p>Chapter 3 Database Internals: harDware anDOperatIng system InteraCtIOns</p><p>45</p><p>and low-level instructions. Maintaining them in an optimal manner requires a good low-</p><p>level programming paradigm and future-promises is one of the best choices. However,</p><p>large instruction sets need even more care; this leads to “execution stages.”</p><p>Execution Stages</p><p>Let’s dive deeper into CPU microarchitecture, because (as discussed previously)</p><p>database engine CPUs typically need to deal with millions and billions of instructions,</p><p>and it’s essential to help the poor thing with that. In a very simplified way, the</p><p>microarchitecture of a modern x86 CPU—from the point of view of top-down analysis—</p><p>consists of four major components: frontend, backend, branch speculation, and retiring.</p><p>Frontend</p><p>The processor’s frontend is responsible for fetching and decoding instructions that are</p><p>going to be executed. It may become a bottleneck when there is either a latency problem</p><p>or insufficient bandwidth. The former can be caused, for example, by instruction cache</p><p>misses. The latter happens when the instruction decoders cannot keep up. In the latter</p><p>case, the solution may be to attempt to make the hot path (or at least significant portions</p><p>of it) fit in the decoded μop cache (DSB) or be recognizable by the loop detector (LSD).</p><p>Branch Speculation</p><p>Pipeline slots that the top-down analysis classifies as bad speculation are not stalled, but</p><p>wasted. This happens when a branch is incorrectly predicted and the rest of the CPU</p><p>executes a μop that eventually cannot be committed. The branch predictor is generally</p><p>considered to be a part of the frontend. However, its problems can affect the whole pipeline</p><p>in ways beyond just causing the backend to be undersupplied by the instruction fetch and</p><p>decode. (Note: Branch mispredictions are covered in more detail a bit later in this chapter.)</p><p>Backend</p><p>The backend receives decoded μops and executes them. A stall may happen either</p><p>because of an execution port being busy or a cache miss. At the lower level, a pipeline</p><p>slot may be core bound either due to data dependency or an insufficient number of</p><p>available execution units. Stalls caused by memory can be caused by cache misses at</p><p>different levels of data cache, external memory latency, or bandwidth.</p><p>Chapter 3 Database Internals: harDware anDOperatIng system InteraCtIOns</p><p>46</p><p>Retiring</p><p>Finally, there are pipeline slots that get classified as retiring. They are the lucky ones that</p><p>were able to execute and commit their μop without any problems. When 100 percent</p><p>of the pipeline slots are able to retire without a stall, the program has achieved the</p><p>maximum number of instructions per cycle for that model of the CPU.Although this is</p><p>very desirable, it doesn’t mean that there’s no opportunity for improvement. Rather, it</p><p>means that the CPU is fully utilized and the only way to improve the performance is to</p><p>reduce the number of instructions.</p><p>Implications forDatabases</p><p>The way CPUs are architectured has direct implications on the database design. It may</p><p>very well happen that individual requests involve a lot of logic and relatively little data,</p><p>which is a scenario that stresses the CPU significantly. This kind of</p><p>workload will be</p><p>completely dominated by the frontend—instruction cache misses in particular. If you</p><p>think about this for a moment, it shouldn’t be very surprising. The pipeline that each</p><p>request goes through is quite long. For example, write requests may need to go through</p><p>transport protocol logic, query parsing code, look up in the caching layer, or be applied</p><p>to the memtable, and so on.</p><p>The most obvious way to solve this is to attempt to reduce the amount of logic in</p><p>the hot path. Unfortunately, this approach does not offer a huge potential for significant</p><p>performance improvement. Reducing the number of instructions needed to perform a</p><p>certain activity is a popular optimization practice, but a developer cannot make any code</p><p>shorter infinitely. At some point, the code “freezes”—literally. There’s some minimal</p><p>amount of instructions needed even to compare two strings and return the result. It’s</p><p>impossible to perform that with a single instruction.</p><p>A higher-level way of dealing with instruction cache problems is called Staged</p><p>Event-Driven Architecture (SEDA for short). It’s an architecture that splits the request</p><p>processing pipeline into a graph of stages—thereby decoupling the logic from the event</p><p>and thread scheduling. This tends to yield greater performance improvements than the</p><p>previous approach.</p><p>Chapter 3 Database Internals: harDware anDOperatIng system InteraCtIOns</p><p>47</p><p>Memory</p><p>Memory management is the central design point in all aspects of programming. Even</p><p>comparing programming languages to one another always involves discussions about</p><p>the way programmers are supposed to handle memory allocation and freeing. No</p><p>wonder memory management design affects the performance of a database so much.</p><p>Applied to database engineering, memory management typically falls into two</p><p>related but independent subsystems: memory allocation and cache control. The former</p><p>is in fact a very generic software engineering issue, so considerations about it are not</p><p>extremely specific to databases (though they are crucial and are worth studying). As</p><p>opposed to that, the latter topic is itself very broad, affected by the usage details and</p><p>corner cases. Respectively, in the database world, cache control has its own flavor.</p><p>Allocation</p><p>The manner in which programs or subsystems allocate and free memory lies at the core</p><p>of memory management. There are several approaches worth considering.</p><p>As illustrated by Figure3-2, a so-called “log-structured allocation” is known from</p><p>filesystems where it puts sequential writes to a circular log on the persisting storage and</p><p>handles updates the very same way. At some point, this filesystem must reclaim blocks</p><p>that became obsolete entries in the log area to make some more space available for</p><p>future writes. In a naive implementation, unused entries are reclaimed by rereading and</p><p>rewriting the log from scratch; obsolete blocks are then skipped in the process.</p><p>Chapter 3 Database Internals: harDware anDOperatIng system InteraCtIOns</p><p>48</p><p>Figure 3-2. A log-structured allocation puts sequential writes to a circular log on</p><p>the persisting storage and handles updates the same way</p><p>A memory allocator for naive code can do something similar. In its simplest form,</p><p>it would allocate the next block of memory by simply advancing a next-free pointer.</p><p>Deallocation would just need to mark the allocated area as freed. One advantage of</p><p>this approach is the speed of allocation. Another is the simplicity and efficiency of</p><p>deallocation if it happens in FIFO order or affects the whole allocation space. Stack</p><p>memory allocations are later released in the order that’s reverse to allocation, so this is</p><p>the most prominent and the most efficient example of such an approach.</p><p>Using linear allocators as general-purpose allocators can be more problematic</p><p>because of the difficulty of space reclamation. To reclaim space, it’s not enough to just</p><p>mark entries as free. This leads to memory fragmentation, which in turn outweighs</p><p>the advantages of linear allocation. So, as with the filesystem, the memory must be</p><p>reclaimed so that it only contains allocated entries and the free space can be used again.</p><p>Reclamation requires moving allocated entries around—a process that changes and</p><p>invalidates their previously known addresses. In naive code, the locations of references</p><p>to allocated entries (addresses stored as pointers) are unknown to the allocator. Existing</p><p>references would have to be patched to make the allocator action transparent to the</p><p>caller; that’s not feasible for a general-purpose allocator like malloc. Logging allocator</p><p>Chapter 3 Database Internals: harDware anDOperatIng system InteraCtIOns</p><p>49</p><p>use is tied to the programming language selection. Some RTTIs, like C++, can greatly</p><p>facilitate this by providing move-constructors. However, passing pointers to libraries that</p><p>are outside of your control (e.g., glibc) would still be an issue.</p><p>Another alternative is adopting a strategy of pool allocators, which provide allocation</p><p>spaces for allocation of entries of a fixed size (see Figure3-3). By limiting the allocation</p><p>space that way, fragmentation can be reduced. A number of general-purpose allocators</p><p>use pool allocators for small allocations. In some cases, those application spaces exist on</p><p>a per-thread basis to eliminate the need for locking and improve CPU cache utilization.</p><p>Figure 3-3. Pool allocators provide allocation spaces for allocation of entries of a</p><p>fixed size. Fragmentation is reduced by limiting the allocation space</p><p>This pool allocation strategy provides two core benefits. First, it saves you</p><p>from having to search for available memory space. Second, it alleviates memory</p><p>fragmentation because it pre-allocates in memory a cache for use with a collection of</p><p>object sizes. Here’s how it works to achieve that:</p><p>Chapter 3 Database Internals: harDware anDOperatIng system InteraCtIOns</p><p>50</p><p>1. The region for each of the sizes has fixed-size memory chunks that</p><p>are suitable for the contained objects, and those chunks are all</p><p>tracked by the allocator.</p><p>2. When it’s time for the allocator to allocate memory for a certain</p><p>type of data object, it’s typically possible to use a free slot (chunk)</p><p>in one of the existing memory slabs.3</p><p>3. When it’s time for the allocator to free the object’s memory, it can</p><p>simply move that slot over to the containing slab’s list of unused/</p><p>free memory slots.</p><p>4. That memory slot (or some other free slot) will be removed from</p><p>the list of free slots whenever there’s a call to create an object of</p><p>the same type (or a call to allocate memory of the same size).</p><p>The best allocation approach to pick heavily depends on the usage scenario. One</p><p>great benefit of a log-structured approach is that it handles fragmentation of small</p><p>sub-pools in a more efficient way. Pool allocators, on the other hand, generate less</p><p>background load on the CPU because of the lack of compacting activity.</p><p>Cache Control</p><p>When it comes to memory management in a software application that stores lots of data</p><p>on disk, you cannot overlook the topic of cache control. Caching is always a must in data</p><p>processing, and it’s crucial to decide what and where to cache.</p><p>If caching is done at the I/O level, for both read/write and mmap, caching can</p><p>become the responsibility of the kernel. The majority of the system’s memory is given</p><p>over to the page cache. The kernel decides which pages should be evicted when memory</p><p>runs low, decides when pages need to be written back to disk, and controls read-ahead.</p><p>The application can provide some guidance to the kernel using the madvise(2) and</p><p>fadvise(2) system calls.</p><p>The main advantage of letting the kernel control caching is that great effort has been</p><p>invested by the kernel developers over many decades into tuning the algorithms used</p><p>by the cache. Those algorithms are used by thousands of different applications and are</p><p>3 We are</p><p>24</p><p>Dataset Size �������������������������������������������������������������������������������������������������������������������������������� 26</p><p>Throughput Expectations ������������������������������������������������������������������������������������������������������������ 27</p><p>Latency Expectations ������������������������������������������������������������������������������������������������������������������ 29</p><p>Concurrency �������������������������������������������������������������������������������������������������������������������������������� 31</p><p>Connected Technologies ������������������������������������������������������������������������������������������������������������� 32</p><p>Demand Fluctuations ������������������������������������������������������������������������������������������������������������������ 33</p><p>ACID Transactions ����������������������������������������������������������������������������������������������������������������������� 34</p><p>Consistency Expectations ����������������������������������������������������������������������������������������������������������� 36</p><p>Geographic Distribution �������������������������������������������������������������������������������������������������������������� 38</p><p>High-Availability Expectations ����������������������������������������������������������������������������������������������������� 39</p><p>Summary������������������������������������������������������������������������������������������������������������������������������������� 40</p><p>Chapter 3: Database Internals: Hardware and Operating System Interactions ������ 41</p><p>CPU ��������������������������������������������������������������������������������������������������������������������������������������������� 42</p><p>Share Nothing Across Cores �������������������������������������������������������������������������������������������������� 42</p><p>Futures-Promises ������������������������������������������������������������������������������������������������������������������ 43</p><p>Execution Stages ������������������������������������������������������������������������������������������������������������������� 45</p><p>Memory ��������������������������������������������������������������������������������������������������������������������������������������� 47</p><p>Allocation ������������������������������������������������������������������������������������������������������������������������������� 47</p><p>Cache Control ������������������������������������������������������������������������������������������������������������������������ 50</p><p>I/O ����������������������������������������������������������������������������������������������������������������������������������������������� 51</p><p>Traditional Read/Write ����������������������������������������������������������������������������������������������������������� 51</p><p>mmap ������������������������������������������������������������������������������������������������������������������������������������ 52</p><p>Direct I/O (DIO) ������������������������������������������������������������������������������������������������������������������������� 52</p><p>Asynchronous I/O (AIO/DIO) ��������������������������������������������������������������������������������������������������� 53</p><p>Understanding the Tradeoffs ������������������������������������������������������������������������������������������������� 54</p><p>Choosing the Filesystem and/or Disk ������������������������������������������������������������������������������������ 57</p><p>Table of ConTenTs</p><p>vii</p><p>Filesystems vs Raw Disks ����������������������������������������������������������������������������������������������������� 57</p><p>How Modern SSDs Work �������������������������������������������������������������������������������������������������������� 58</p><p>Networking���������������������������������������������������������������������������������������������������������������������������������� 61</p><p>DPDK �������������������������������������������������������������������������������������������������������������������������������������� 62</p><p>IRQ Binding ���������������������������������������������������������������������������������������������������������������������������� 62</p><p>Summary������������������������������������������������������������������������������������������������������������������������������������� 63</p><p>Chapter 4: Database Internals: Algorithmic Optimizations ������������������������������������ 65</p><p>Optimizing Collections ���������������������������������������������������������������������������������������������������������������� 66</p><p>To B- or Not to B-Tree ����������������������������������������������������������������������������������������������������������������� 66</p><p>Linear Search on Steroids ����������������������������������������������������������������������������������������������������������� 68</p><p>Scanning the Tree ����������������������������������������������������������������������������������������������������������������������� 69</p><p>When the Tree Size Matters �������������������������������������������������������������������������������������������������������� 70</p><p>The Secret Life of Separation Keys ��������������������������������������������������������������������������������������������� 72</p><p>Summary������������������������������������������������������������������������������������������������������������������������������������� 74</p><p>Chapter 5: Database Drivers ����������������������������������������������������������������������������������� 77</p><p>Relationship Between Clients and Servers ��������������������������������������������������������������������������������� 78</p><p>Workload Types���������������������������������������������������������������������������������������������������������������������� 79</p><p>Throughput vs Goodput ��������������������������������������������������������������������������������������������������������� 81</p><p>Timeouts ������������������������������������������������������������������������������������������������������������������������������������� 83</p><p>Client-Side Timeouts ������������������������������������������������������������������������������������������������������������� 83</p><p>Server-Side Timeouts ������������������������������������������������������������������������������������������������������������ 84</p><p>Contextual Awareness ����������������������������������������������������������������������������������������������������������������� 86</p><p>Topology and Metadata ����������������������������������������������������������������������������������������������������������� 86</p><p>Current Load �������������������������������������������������������������������������������������������������������������������������� 87</p><p>Request Caching �������������������������������������������������������������������������������������������������������������������� 88</p><p>Query Locality ����������������������������������������������������������������������������������������������������������������������������� 91</p><p>Retries ����������������������������������������������������������������������������������������������������������������������������������������� 94</p><p>Error Categories��������������������������������������������������������������������������������������������������������������������� 94</p><p>Idempotence �������������������������������������������������������������������������������������������������������������������������� 95</p><p>Retry Policies ������������������������������������������������������������������������������������������������������������������������� 97</p><p>Table of ConTenTs</p><p>viii</p><p>Paging ��������������������������������������������������������������������������������������������������������������������������������������� 100</p><p>Concurrency ������������������������������������������������������������������������������������������������������������������������������ 101</p><p>Modern Hardware ����������������������������������������������������������������������������������������������������������������</p><p>using the term “slab” to mean one or more contiguous memory pages that contain</p><p>pre- allocated chunks of memory.</p><p>Chapter 3 Database Internals: harDware anDOperatIng system InteraCtIOns</p><p>51</p><p>generally effective. The disadvantage, however, is that these algorithms are general-</p><p>purpose and not tuned to the application. The kernel must guess how the application</p><p>will behave next. Even if the application knows differently, it usually has no way to help</p><p>the kernel guess correctly. This results in the wrong pages being evicted, I/O scheduled</p><p>in the wrong order, or read-ahead scheduled for data that will not be consumed in the</p><p>near future.</p><p>Next, doing the caching at the I/O level interacts with the topic often referred to as</p><p>IMR—in memory representation. No wonder that the format in which data is stored on</p><p>disk differs from the form the same data is allocated in memory as objects. The simplest</p><p>reason that it’s not the same is byte-ordering. With that in mind, if the data is cached</p><p>once it’s read from the disk, it needs to be further converted or parsed into the object</p><p>used in memory. This can be a waste of CPU cycles, so applications may choose to cache</p><p>at the object level.</p><p>Choosing to cache at the object level affects a lot of other design points. With</p><p>that, the cache management is all on the application side including cross-core</p><p>synchronization, data coherence, invalidation, and so on. Next, since objects can be</p><p>(and typically are) much smaller than the average I/O size, caching millions and billions</p><p>of those objects requires a collection selection that can handle it (you’ll learn about this</p><p>quite soon). Finally, caching on the object level greatly affects the way I/O is done.</p><p>I/O</p><p>Unless the database engine is an in-memory one, it will have to keep the data on external</p><p>storage. There can be many options to do that, including local disks, network-attached</p><p>storage, distributed file- and object- storage systems, and so on. The term “I/O” typically</p><p>refers to accessing data on local storage—disks or filesystems (that, in turn, are located</p><p>on disks as well). And in general, there are four choices for accessing files on a Linux</p><p>server: read/write, mmap, Direct I/O (DIO) read/write, and Asynchronous I/O (AIO/</p><p>DIO, because this I/O is rarely used in cached mode).</p><p>Traditional Read/Write</p><p>The traditional method is to use the read(2) and write(2) system calls. In a modern</p><p>implementation, the read system call (or one of its many variants—pread, readv, preadv,</p><p>etc.) asks the kernel to read a section of a file and copy the data into the calling process</p><p>Chapter 3 Database Internals: harDware anDOperatIng system InteraCtIOns</p><p>52</p><p>address space. If all of the requested data is in the page cache, the kernel will copy it</p><p>and return immediately; otherwise, it will arrange for the disk to read the requested</p><p>data into the page cache, block the calling thread, and when the data is available, it will</p><p>resume the thread and copy the data. A write, on the other hand, will usually1 just copy</p><p>the data into the page cache; the kernel will write back the page cache to disk some time</p><p>afterward.</p><p>mmap</p><p>An alternative and more modern method is to memory-map the file into the application</p><p>address space using the mmap(2) system call. This causes a section of the address space</p><p>to refer directly to the page cache pages that contain the file’s data. After this preparatory</p><p>step, the application can access file data using the processor’s memory read and</p><p>memory write instructions. If the requested data happens to be in cache, the kernel is</p><p>completely bypassed and the read (or write) is performed at memory speed. If a cache</p><p>miss occurs, then a page-fault happens and the kernel puts the active thread to sleep</p><p>while it goes off to read the data for that page. When the data is finally available, the</p><p>memory-management unit is programmed so the newly read data is accessible to the</p><p>thread, which is then awoken.</p><p>Direct I/O (DIO)</p><p>Both traditional read/write and mmap involve the kernel page cache and defer the</p><p>scheduling of I/O to the kernel. When the application wants to schedule I/O itself (for</p><p>reasons that we will explain later), it can use Direct I/O, as shown in Figure3-4. This</p><p>involves opening the file with the O_DIRECT flag; further activity will use the normal</p><p>read and write family of system calls. However, their behavior is now altered: Instead of</p><p>accessing the cache, the disk is accessed directly, which means that the calling thread</p><p>will be put to sleep unconditionally. Furthermore, the disk controller will copy the data</p><p>directly to userspace, bypassing the kernel.</p><p>Chapter 3 Database Internals: harDware anDOperatIng system InteraCtIOns</p><p>53</p><p>Figure 3-4. Direct I/O involves opening the file with the O_DIRECT flag; further</p><p>activity will use the normal read and write family of system calls, but their</p><p>behavior is now altered</p><p>Asynchronous I/O (AIO/DIO)</p><p>A refinement of Direct I/O, Asynchronous Direct I/O, behaves similarly but prevents the</p><p>calling thread from blocking (see Figure3-5). Instead, the application thread schedules</p><p>Direct I/O operations using the io_submit(2) system call, but the thread is not blocked;</p><p>the I/O operation runs in parallel with normal thread execution. A separate system</p><p>call, io_getevents(2), waits for and collects the results of completed I/O operations.</p><p>Like DIO, the kernel’s page cache is bypassed, and the disk controller is responsible for</p><p>copying the data directly to userspace.</p><p>Chapter 3 Database Internals: harDware anDOperatIng system InteraCtIOns</p><p>54</p><p>Figure 3-5. A refinement of Direct I/O, Asynchronous Direct I/O behaves similarly</p><p>but prevents the calling thread from blocking</p><p>Note: io_uring the apI to perform asynchronous I/O appeared in linux long ago,</p><p>and it was warmly met by the community. however, as it often happens, real-</p><p>world usage quickly revealed many inefficiencies, such as blocking under some</p><p>circumstances (despite the name), the need to call the kernel too often, and poor</p><p>support for canceling the submitted requests. eventually, it became clear that the</p><p>updated requirements were not compatible with the existing apI and the need for a</p><p>new one arose.</p><p>this is how the io_uring() apI appeared. It provides the same facilities as aIO</p><p>does, but in a much more convenient and performant way (it also has notably</p><p>better documentation). without diving into implementation details, let’s just say that</p><p>it exists and is preferred over the legacy aIO.</p><p>Understanding theTradeoffs</p><p>The different access methods share some characteristics and differ in others. Table3-1</p><p>summarizes these characteristics, which are discussed further in this section.</p><p>Chapter 3 Database Internals: harDware anDOperatIng system InteraCtIOns</p><p>55</p><p>Table 3-1. Comparing Different I/O Access Methods</p><p>Characteristic R/W mmap DIO AIO/DIO</p><p>Cache control Kernel Kernel User User</p><p>Copying yes no no no</p><p>mmU activity low high none none</p><p>I/O scheduling Kernel Kernel mixed User</p><p>thread scheduling Kernel Kernel Kernel User</p><p>I/O alignment automatic automatic manual manual</p><p>application complexity low low moderate high</p><p>Copying andMMU Activity</p><p>One of the benefits of the mmap method is that if the data is in cache, then the kernel</p><p>is bypassed completely. The kernel does not need to copy data from the kernel to</p><p>userspace and back, so fewer processor cycles are spent on that activity. This benefits</p><p>workloads that are mostly in cache (for example, if the ratio of storage size to RAM size is</p><p>close to 1:1).</p><p>The downside of mmap, however, occurs when data is not in the cache. This usually</p><p>happens when the ratio of storage size to RAM size is significantly higher than 1:1. Every</p><p>page that is brought into the cache causes another page to be evicted. Those pages have</p><p>to be inserted into and removed from the page tables; the kernel has to scan the page</p><p>tables to isolate inactive pages, making them candidates</p><p>for eviction, and so forth. In</p><p>addition, mmap requires memory for the page tables. On x86 processors, this requires</p><p>0.2 percent of the size of the mapped files. This seems low, but if the application has a</p><p>100:1 ratio of storage to memory, the result is that 20 percent of memory (0.2% * 100) is</p><p>devoted to page tables.</p><p>I/O Scheduling</p><p>One of the problems with letting the kernel control caching (with the mmap and read/</p><p>write access methods) is that the application loses control of I/O scheduling. The kernel</p><p>picks whichever block of data it deems appropriate and schedules it for write or read.</p><p>This can result in the following problems:</p><p>Chapter 3 Database Internals: harDware anDOperatIng system InteraCtIOns</p><p>56</p><p>• A write storm. When the kernel schedules large amounts of writes,</p><p>the disk will be busy for a long while and impact read latency.</p><p>• The kernel cannot distinguish between “important” and</p><p>“unimportant” I/O. I/O belonging to background tasks can</p><p>overwhelm foreground tasks, impacting their latency2</p><p>By bypassing the kernel page cache, the application takes on the burden of</p><p>scheduling I/O.This doesn’t mean that the problems are solved, but it does mean that</p><p>the problems can be solved—with sufficient attention and effort.</p><p>When using Direct I/O, each thread controls when to issue I/O.However, the kernel</p><p>controls when the thread runs, so responsibility for issuing I/O is shared between the</p><p>kernel and the application. With AIO/DIO, the application is in full control of when I/O</p><p>is issued.</p><p>Thread Scheduling</p><p>An I/O intensive application using mmap or read/write cannot guess what its cache hit</p><p>rate will be. Therefore, it has to run a large number of threads (significantly larger than</p><p>the core count of the machine it is running on). Using too few threads, they may all be</p><p>waiting for the disk leaving the processor underutilized. Since each thread usually has</p><p>at most one disk I/O outstanding, the number of running threads must be around the</p><p>concurrency of the storage subsystem multiplied by some small factor in order to keep</p><p>the disk fully occupied. However, if the cache hit rate is sufficiently high, then these large</p><p>numbers of threads will contend with each other for the limited number of cores.</p><p>When using Direct I/O, this problem is somewhat mitigated. The application knows</p><p>exactly when a thread is blocked on I/O and when it can run, so the application can</p><p>adjust the number of running threads according to runtime conditions.</p><p>With AIO/DIO, the application has full control over both running threads and</p><p>waiting I/O (the two are completely divorced), so it can easily adjust to in-memory or</p><p>disk-bound conditions or anything in between.</p><p>I/O Alignment</p><p>Storage devices have a block size; all I/O must be performed in multiples of this block</p><p>size which is typically 512 or 4096 bytes. Using read/write or mmap, the kernel performs</p><p>the alignment automatically; a small read or write is expanded to the correct block</p><p>boundary by the kernel before it is issued.</p><p>Chapter 3 Database Internals: harDware anDOperatIng system InteraCtIOns</p><p>57</p><p>With DIO, it is up to the application to perform block alignment. This incurs some</p><p>complexity, but also provides an advantage: The kernel will usually over-align to a 4096</p><p>byte boundary even when a 512-byte boundary suffices. However, a user application</p><p>using DIO can issue 512-byte aligned reads, which results in saving bandwidth on</p><p>small items.</p><p>Application Complexity</p><p>While the previous discussions favored AIO/DIO for I/O intensive applications, that method</p><p>comes with a significant cost: complexity. Placing the responsibility of cache management</p><p>on the application means it can make better choices than the kernel and make those</p><p>choices with less overhead. However, those algorithms need to be written and tested. Using</p><p>asynchronous I/O requires that the application is written using callbacks, coroutines, or a</p><p>similar method, and often reduces the reusability of many available libraries.</p><p>Choosing theFilesystem and/or Disk</p><p>Beyond performing the I/O itself, the database design must consider the medium</p><p>against which this I/O is done. In many cases, the choice is often between a filesystem or</p><p>a raw block device, which in turn can be a choice of a traditional spinning disk or an SSD</p><p>drive. In cloud environments, however, there can be the third option because local drives</p><p>are always ephemeral—which imposes strict requirements on the replication.</p><p>Filesystems vs Raw Disks</p><p>This decision can be approached from two angles: management costs and performance.</p><p>If you’re accessing the storage as a raw block device, all the difficulties with block</p><p>allocation and reclamation are on the application side. We touched on this topic slightly</p><p>earlier when we talked about memory management. The same set of challenges apply to</p><p>RAM as well as disks.</p><p>A connected, though very different, challenge is providing data integrity in case</p><p>of crashes. Unless the database is purely in-memory, the I/O should be done in a way</p><p>that avoids losing data or reading garbage from disk after a restart. Modern filesystems,</p><p>however, provide both and are very mature to trust the efficiency of allocations and</p><p>integrity of data. Accessing raw block devices unfortunately lacks those features (so they</p><p>need to be implemented at the same quality on the application side).</p><p>Chapter 3 Database Internals: harDware anDOperatIng system InteraCtIOns</p><p>58</p><p>From the performance point of view, the difference is not that drastic. On one hand,</p><p>writing data to a file is always accompanied by associated metadata updates. This</p><p>consumes both disk space and I/O bandwidth. However, some modern filesystems</p><p>provide a very good balance of performance and efficiency, almost eliminating the I/O</p><p>latency. (One of the most prominent examples is XFS.Another really good and mature</p><p>piece of software is Ext4). The great ally in this camp is the fallocate(2) system call</p><p>that makes the filesystem preallocate space on disk. When used, filesystems also have a</p><p>chance to make full use of the extents mechanisms, thus bringing the QoS of using files</p><p>to the same performance level as when using raw block devices.</p><p>Appending Writes</p><p>The database may have a heavy reliance on appends to files or require in-place updates</p><p>of individual file blocks. Both approaches need special attention from the system</p><p>architect because they call for different properties from the underlying system.</p><p>On one hand, appending writes requires careful interaction with the filesystem so</p><p>that metadata updates (file size, in particular) do not dominate the regular I/O.On the</p><p>other hand, appending writes (being sort of cache-oblivious algorithms) handle the disk</p><p>overwriting difficulties in a natural manner. Contrary to this, in-place updates cannot</p><p>happen at random offsets and sizes because disks may not tolerate this kind of workload,</p><p>even if they’re used in a raw block device manner (not via a filesystem).</p><p>That being said, let’s dive even deeper into the stack and descend into the</p><p>hardware level.</p><p>How Modern SSDs Work</p><p>Like other computational resources, disks are limited in the speed they can provide. This</p><p>speed is typically measured as a two-dimensional value with Input/Output Operations</p><p>per Second (IOPS) and bytes per second (throughput). Of course, these parameters are</p><p>not cut in stone even for each particular disk, and the maximum number of requests or</p><p>bytes greatly depends on the requests’ distribution, queuing and concurrency, buffering</p><p>or caching, disk age, and many other factors. So when performing I/O, a disk must</p><p>always balance between two inefficiencies—overwhelming the disk with requests and</p><p>underutilizing it.</p><p>Chapter 3 Database Internals: harDware anDOperatIng system InteraCtIOns</p><p>59</p><p>Overwhelming the disk should be avoided because when the disk is full of requests</p><p>it cannot distinguish between the criticality of</p><p>certain requests over others. Of course,</p><p>all requests are important, but it makes sense to prioritize latency-sensitive requests.</p><p>For example, ScyllaDB serves real-time queries that need to be completed in single-</p><p>digit milliseconds or less and, in parallel, it processes terabytes of data for compaction,</p><p>streaming, decommission, and so forth. The former have strong latency sensitivity; the</p><p>latter are less so. Good I/O maintenance that tries to maximize the I/O bandwidth while</p><p>keeping latency as low as possible for latency-sensitive tasks is complicated enough to</p><p>become a standalone component called the I/O Scheduler.</p><p>When evaluating a disk, you would most likely be looking at its four parameters—</p><p>read/write IOPS and read/write throughput (such as in MB/s). Comparing these</p><p>numbers to one another is a popular way of claiming one disk is better than the other</p><p>and estimating the aforementioned “bandwidth capacity” of the drive by applying Little’s</p><p>Law. With that, the I/O Scheduler’s job is to provide a certain level of concurrency inside</p><p>the disk to get maximum bandwidth from it, but not to make this concurrency too high</p><p>in order to prevent the disk from queueing requests internally for longer than needed.</p><p>For instance, Figure3-6 illustrates how read request latency depends on the</p><p>intensity of small reads (challenging disk IOPS capacity) vs the intensity of large writes</p><p>(pursuing the disk bandwidth). The latency value is color-coded, and the “interesting</p><p>area” is painted in cyan—this is where the latency stays below 1 millisecond. The drive</p><p>measured is the NVMe disk that comes with the AWS EC2 i3en.3xlarge instance.</p><p>Chapter 3 Database Internals: harDware anDOperatIng system InteraCtIOns</p><p>60</p><p>4 You can access Diskplorer at https://github.com/scylladb/diskplorer. This project contains</p><p>instructions on how to generate a graph of your own.</p><p>Figure 3-6. Bandwidth/latency graphs showing how read request latency depends</p><p>on the intensity of small reads (challenging disk IOPS capacity) vs the intensity of</p><p>large writes (pursuing the disk bandwidth)</p><p>This drive demonstrates almost perfect half-duplex behavior—increasing the read</p><p>intensity several times requires roughly the same reduction in write intensity to keep the</p><p>disk operating at the same speed.</p><p>Tip: How to Measure Your Own Disk Behavior Under Load the better you</p><p>understand how your own disks perform under load, the better you can tune them</p><p>to capitalize on their “sweet spot.” One way to do this is with Diskplorer,4 an open-</p><p>source disk latency/bandwidth exploring toolset. by using linux fio under the hood</p><p>Chapter 3 Database Internals: harDware anDOperatIng system InteraCtIOns</p><p>61</p><p>it runs a battery of measurements to discover performance characteristics for a</p><p>specific hardware configuration, giving you an at-a-glance view of how server</p><p>storage I/O will behave under load.</p><p>For a walkthrough of how to use this tool, see the linux Foundation video,</p><p>“Understanding storage I/O Under load.”5</p><p>Networking</p><p>The conventional networking functionality available in Linux is remarkably full-featured,</p><p>mature, and performant. Since the database rarely imposes severe per-ping latency</p><p>requirements, there are very few surprises that come from it when properly configured</p><p>and used. Nonetheless, some considerations still need to be made.</p><p>As explained by David Ahern, “Linux will process a fair amount of packets in the</p><p>context of whatever is running on the CPU at the moment the IRQ is handled. System</p><p>accounting will attribute those CPU cycles to any process running at that moment even</p><p>though that process is not doing any work on its behalf. For example, ‘top’ can show a</p><p>process that appears to be using 99+% CPU, but in reality, 60 percent of that time is spent</p><p>processing packets—meaning the process is really only getting 40 percent of the CPU to</p><p>make progress on its workload.”6</p><p>However, for truly networking-intensive applications, the Linux stack is constrained:</p><p>• Kernel space implementation: Separation of the network stack</p><p>into kernel space means that costly context switches are needed to</p><p>perform network operations, and that data copies must be performed</p><p>to transfer data from kernel buffers to user buffers and vice versa.</p><p>• Time sharing: Linux is a time-sharing system, and so must rely on</p><p>slow, expensive interrupts to notify the kernel that there are new</p><p>packets to be processed.</p><p>5 Watch the video on YouTube (www.youtube.com/watch?v=Am-nXO6KK58).</p><p>6 For the source and additional detail, see David Ahern’s, “The CPU Cost of Networking on a Host”</p><p>(https://people.kernel.org/dsahern/the-cpu-cost-of-networking-on-a-host).</p><p>Chapter 3 Database Internals: harDware anDOperatIng system InteraCtIOns</p><p>62</p><p>• Threaded model: The Linux kernel is heavily threaded, so all data</p><p>structures are protected with locks. While a huge effort has made</p><p>Linux very scalable, this is not without limitations and contention</p><p>occurs at large core counts. Even without contention, the locking</p><p>primitives themselves are relatively slow and impact networking</p><p>performance.</p><p>As before, the way to overcome this limitation is to move the packet processing to the</p><p>userspace. There are plenty of out-of-kernel implementations of the TCP algorithm that</p><p>are worth considering.</p><p>DPDK</p><p>One of the generic approaches that’s often referred to in the networking area is the poll</p><p>mode vs interrupt model. When a packet arrives, the system may have two options for</p><p>how to get informed—set up and interrupt from the hardware (or, in the case of the</p><p>userspace implementation, from the kernel file descriptor using the poll family of system</p><p>calls) or keep polling the network card on its own from time to time until the packet is</p><p>noticed.</p><p>The famous userspace network toolkit, called DPDK, is designed specifically for</p><p>fast packet processing, usually in fewer than 80 CPU cycles per packet.7 It integrates</p><p>seamlessly with Linux in order to take advantage of high-performance hardware.</p><p>IRQ Binding</p><p>As stated earlier, packet processing may take up to 60 percent of the CPU time, which</p><p>is way too much. This percentage leaves too few CPU ticks for the database work itself.</p><p>Even though in this case the backpressure mechanism would most likely keep the</p><p>external activity off and the system would likely find its balance, the resulting system</p><p>throughput would likely be unacceptable.</p><p>System architects may consider the non-symmetrical CPU approach to mitigate</p><p>this. If you’re letting the Linux kernel process network packets, there are several ways to</p><p>localize this processing on separate CPUs.</p><p>7 For details, see the Linux Foundation’s page on DPDK (Data Plane Developers Kit) at</p><p>www.dpdk.org.</p><p>Chapter 3 Database Internals: harDware anDOperatIng system InteraCtIOns</p><p>63</p><p>The simplest way is to bind the IRQ processing from the NIC to specific cores or</p><p>hyper-threads. Linux uses two-step processing of incoming packets called IRQ and soft-</p><p>IRQ.If the IRQs are properly bound to cores, the soft-IRQ also happens on those cores—</p><p>thus completely localizing the processing.</p><p>For huge-scale nodes running tens to hundred(s) of cores, the number of network-</p><p>only cores may become literally more than one. In this case, it might make sense to</p><p>localize processing even further by assigning cores from different NUMA nodes and</p><p>teaching the NIC to balance the traffic between those using the receive packet steering</p><p>facility of the Linux kernel.</p><p>Summary</p><p>This chapter introduced a number of ways that database engineering decisions enable</p><p>database users to squeeze more power out of modern infrastructure. For CPUs, the</p><p>chapter talked about taking advantage of multicore servers by limiting resource sharing</p><p>across cores and using future-promise design to coordinate work across cores. The</p><p>chapter also provided a specific example of how low-level CPU architecture has direct</p><p>implications on the database.</p><p>Moving on to memory, you</p><p>read about two related but independent subsystems:</p><p>memory allocation and cache control. For I/O, the chapter discussed Linux options</p><p>such as traditional read/write, mmap, Direct I/O (DIO) read/write, and Asynchronous</p><p>I/O—including the various tradeoffs of each. This was followed by a deep dive into</p><p>how modern SSDs work and how a database can take advantage of a drive’s unique</p><p>characteristics. Finally, you looked at constraints associated with the Linux networking</p><p>stack and explored alternatives such as DPDK and IRQ binding. The next chapter shifts</p><p>the focus from hardware interactions to algorithmic optimizations: pure software</p><p>challenges.</p><p>Chapter 3 Database Internals: harDware anDOperatIng system InteraCtIOns</p><p>64</p><p>Open Access This chapter is licensed under the terms of the Creative</p><p>Commons Attribution 4.0 International License (http://creativecommons.</p><p>org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and</p><p>reproduction in any medium or format, as long as you give appropriate credit to the</p><p>original author(s) and the source, provide a link to the Creative Commons license and</p><p>indicate if changes were made.</p><p>The images or other third party material in this chapter are included in the chapter's</p><p>Creative Commons license, unless indicated otherwise in a credit line to the material. If</p><p>material is not included in the chapter's Creative Commons license and your intended</p><p>use is not permitted by statutory regulation or exceeds the permitted use, you will need</p><p>to obtain permission directly from the copyright holder.</p><p>Chapter 3 Database Internals: harDware anDOperatIng system InteraCtIOns</p><p>65</p><p>CHAPTER 4</p><p>Database Internals:</p><p>Algorithmic Optimizations</p><p>In the performance world, the hardware is always the unbreakable limiting factor—one</p><p>cannot squeeze more performing units from a system than the underlying chips may</p><p>provide. As opposed to that, the software part of the system is often considered the most</p><p>flexible thing in programming—in the sense that it can be changed at any time given</p><p>enough developers’ brains and hands (and investors’ cash).</p><p>However, that’s not always the case. Sometimes selecting an algorithm should be</p><p>done as early as the architecting stage in the most careful manner possible because the</p><p>chosen approach becomes so extremely fundamental that changing it would effectively</p><p>mean rewriting the whole engine from scratch or requiring users to migrate exabytes of</p><p>data from one instance to another.</p><p>This chapter shares one detailed example of algorithmic optimization—from the</p><p>perspective of the engineer who led this optimization. Specifically, this chapter looks</p><p>at how the B-trees family can be used to store data in cache implementations and</p><p>other accessory and in-memory structures. This look into a representative engineering</p><p>challenge should help you better understand what tradeoffs or optimizations various</p><p>databases might be making under the hood—ideally, so you can take better advantage of</p><p>its very deliberate design decisions.1</p><p>Note The goal of this chapter is not to convince database users that they need a</p><p>database with any particular algorithmic optimization—or to educate infrastructure</p><p>engineers on designing B-trees or the finer points of algorithmic optimization.</p><p>Rather, it’s to help anyone selecting or working with a database understand the</p><p>1 This chapter draws from material originally published on the ScyllaDB blog (www.scylladb.com/</p><p>blog). It is used here with permission of ScyllaDB.</p><p>© Felipe Cardeneti Mendes, Piotr Sarna, Pavel Emelyanov, Cynthia Dunlop 2023</p><p>F. C. Mendes et al., Database Performance at Scale, https://doi.org/10.1007/978-1-4842-9711-7_4</p><p>66</p><p>level of algorithmic optimization that might impact a database’s performance.</p><p>Hopefully, it piques your curiosity in learning more about the engineering behind</p><p>the database you’re using and/or alternative databases you’re considering.</p><p>Optimizing Collections</p><p>Maintaining large sets of objects in memory deserves the same level of attention as</p><p>maintaining objects in external memory—say, spinning disks or network-attached</p><p>storages. For a task as simple as looking up an object by a plain key, the acceptable</p><p>solution is often a plain hash table (even with great attention to hash function selection)</p><p>or a binary balanced tree (usually the red-black one due to its implementation</p><p>simplicity). However, branchy trees like the B-trees family can significantly boost</p><p>performance. They also have a lot of non-obvious pitfalls.</p><p>To B- or Not toB-Tree</p><p>An important characteristic of a tree is cardinality. This is the maximum number of</p><p>child nodes that another node may have. In the corner case of cardinality of two, the</p><p>tree is called a binary tree. For other cases, there’s a wide class of so-called B-trees. The</p><p>common belief about binary vs B-trees is that the former ones should be used when the</p><p>data is stored in the RAM, while the latter trees should live in the disk. The justification</p><p>for this split is that RAM access speed is much higher than disk. Also, disk I/O is</p><p>performed in blocks, so it’s much better and faster to fetch several “adjacent” keys in one</p><p>request. RAM, unlike disks, allows random access with almost any granularity, so it’s</p><p>okay to have a dispersed set of keys pointing to each other.</p><p>However, there are many reasons that B-trees are often a good choice for in-memory</p><p>collections. The first reason is cache locality. When searching for a key in a binary tree,</p><p>the algorithm would visit up to logN elements that are very likely dispersed in memory.</p><p>On a B-tree, this search will consist of two phases—an intra-node search and descending</p><p>the tree—executed one after another. And while descending the tree doesn’t differ much</p><p>from the binary tree in the aforementioned sense, intra-node searching will access</p><p>keys that are located next to each other, thus making much better use of CPU caches.</p><p>Figure4-1 exemplifies the process of walking down a binary tree. Compare it along with</p><p>Figure4-2, which demonstrates a search in a B-tree set.</p><p>CHapTeR 4 DaTaBase InTeRnals: algoRITHmIC opTImIzaTIons</p><p>67</p><p>Figure 4-1. Searching in a binary tree root</p><p>Figure 4-2. Searching in a B-tree set</p><p>The second reason that B-trees are often a good choice for in-memory collections</p><p>also comes from the dispersed nature of binary trees and from how modern CPUs</p><p>are designed. It’s well known that when executing a stream of instructions, CPU cores</p><p>split the processing of each instruction into stages (loading instructions, decoding</p><p>them, preparing arguments, and doing the execution itself) and the stages are run in</p><p>CHapTeR 4 DaTaBase InTeRnals: algoRITHmIC opTImIzaTIons</p><p>68</p><p>parallel in a unit called a conveyor. When a conditional branching instruction appears</p><p>in this stream, the conveyor needs to guess which of two potential branches it will have</p><p>to execute next and start loading it into the conveyor pipeline. If this guess fails, the</p><p>conveyor is flushed and starts to work from scratch. Such failures are called branch</p><p>mispredictions. They are harmful from a performance point of view2 and have direct</p><p>implications on the binary search algorithm. When searching for a key in such a tree,</p><p>the algorithm jumps left and right depending on the key comparison result without</p><p>giving the CPU a chance to learn which direction is “preferred.” In many cases, the CPU</p><p>conveyer is flushed.</p><p>The two-phased B-tree search can be made better with respect to branch</p><p>predictions. The trick is in making the intra-node search linear (i.e., walking the array of</p><p>keys forward key-by-key). In this case, there will be only a “should you move forward”</p><p>condition that’s much more predictable. There’s even a nice trick of turning binary</p><p>search into linear without sacrificing the number of comparisons,3 but this approach</p><p>is good for read-mostly collections because insertion into this layout is tricky and has</p><p>worse complexity</p><p>than for sorted arrays. This approach has proven itself in ScyllaDB’s</p><p>implementation and is also widely used in the Tarantool in-memory database.4</p><p>Linear Search onSteroids</p><p>That linear search can be improved a bit more. Let’s carefully count the number of key</p><p>comparisons that it may take to find a single key in a tree. For a binary tree, it’s well</p><p>known that it takes log2N comparisons (on average) where N is the number of elements.</p><p>We put the logarithm base here for a reason. Next, consider a k-ary tree with k children</p><p>per node. Does it take fewer comparisons? (Spoiler: no). To find the element, you have to</p><p>do the same search—get a node, find in which branch it sits, then proceed to it. You have</p><p>logkN levels in the tree, so you have to do that many descending steps. However on each</p><p>step, you need to do the search within k elements, which is, again, log2k if you’re doing a</p><p>binary search. Multiplying both, you still need at least log2N comparisons.</p><p>2 See Marek Majkowski’s blog, “Branch predictor: How many ‘if’s are too many? Including x86 and</p><p>M1 benchmarks!” https://blog.cloudflare.com/branch-predictor/.</p><p>3 See the tutorial, “Eytzinger Binary Search” https://algorithmica.org/en/eytzinger.</p><p>4 Both are available as open-source software; see https://github.com/scylladb/scylladb and</p><p>https://github.com/tarantool/tarantool.</p><p>CHapTeR 4 DaTaBase InTeRnals: algoRITHmIC opTImIzaTIons</p><p>69</p><p>The way to reduce this number is to compare more than one key at a time when</p><p>doing intra-node searches. In case the keys are small enough, SIMD instructions can</p><p>compare up to 64 keys in one go. Although a SIMD compare instruction may be slower</p><p>than a classic cmp one and requires additional instructions to process the comparison</p><p>mask, linear SIMD-powered search wins on short enough arrays (and B-tree nodes can</p><p>be short enough). For example, Figure4-3 shows the times of looking up an integer in a</p><p>sorted array using three techniques—linear search, binary search, and SIMD-optimized</p><p>linear search such as the x86 Advanced Vector Extensions (AVX).</p><p>Figure 4-3. The test used a large amount of randomly generated arrays of values</p><p>dispersed in memory to eliminate differences in cache usage and a large amount</p><p>of random search keys to blur branch predictions. These are the average times of</p><p>finding a key in an array normalized by the array length. Smaller results are faster</p><p>(better)</p><p>Scanning theTree</p><p>One interesting flavor of B-trees is called a B+-tree. In this tree, there are two kinds of</p><p>keys—real keys and separation keys. The real keys live on leaf nodes (i.e., on those that</p><p>don’t have children), while separation keys sit on inner nodes and are used to select</p><p>which branch to go next when descending the tree. This difference has an obvious</p><p>consequence that it takes more memory to keep the same amount of keys in a B+-tree as</p><p>compared to B-tree. But it’s not only that.</p><p>CHapTeR 4 DaTaBase InTeRnals: algoRITHmIC opTImIzaTIons</p><p>70</p><p>A great implicit feature of a tree is the ability to iterate over elements in a sorted</p><p>manner (called a scan). To scan a classical B-tree, there are both recursive and state-</p><p>machine algorithms that process the keys in a very non-uniform manner—the algorithm</p><p>walks up-and-down the tree while it moves. Despite B-trees being described as cache-</p><p>friendly, scanning them requires visiting every single node and inner nodes are visited in</p><p>a cache unfriendly manner. Figure4-4 illustrates this phenomenon.</p><p>Figure 4-4. Scanning a classical B-tree involves walking up and down the tree;</p><p>every node and inner node is visited</p><p>As opposed to this, B+-trees’ scan only needs to loop through its leaf nodes, which,</p><p>with some additional effort, can be implemented as a linear scan over a linked list of</p><p>arrays, as demonstrated in Figure4-5.</p><p>Figure 4-5. B+ tree scans only need to cover leaf nodes</p><p>When theTree Size Matters</p><p>Talking about memory, B-trees don’t provide all these benefits for free (neither do B+-</p><p>trees). As the tree grows, so does the number of nodes in it and it’s useful to consider the</p><p>CHapTeR 4 DaTaBase InTeRnals: algoRITHmIC opTImIzaTIons</p><p>71</p><p>overhead needed to store a single key. For a binary tree, the overhead is three pointers—</p><p>to both left and right children as well as to the parent node. For a B-tree, it will differ for</p><p>inner and leaf nodes. For both types, the overhead is one parent pointer and k pointers</p><p>to keys, even if they are not inserted in the tree. For inner nodes there will additionally be</p><p>k+1 pointers to child nodes.</p><p>The number of nodes in a B-tree is easy to estimate for a large number of keys. As the</p><p>number of nodes grows, the per-key overhead blurs as keys “share” parent and children</p><p>pointers. However, there’s a very interesting point at the beginning of a tree’s growth.</p><p>When the number of keys becomes k+1 (i.e., the tree overgrows its first leaf node), the</p><p>number of nodes jumps three times because, in this case, it’s needed to allocate one</p><p>more leaf node and one inner node to link those two.</p><p>There is a good and pretty cheap optimization to mitigate this spike, called “linear</p><p>root.” The leaf root node grows on demand, doubling each step like a std::vector in</p><p>C++, and can overgrow the capacity of k up to some extent. Figure4-6 shows the per-key</p><p>overhead for a 4-ary B-tree with 50 percent initial overgrowth. Note the first split spike of</p><p>a classical algorithm at five keys.</p><p>Figure 4-6. The per-key overhead for a 4-ary B-tree with 50 percent initial</p><p>overgrowth</p><p>When discussing how B-trees work with small amounts of keys, it’s worth</p><p>mentioning the corner case of one key. In ScyllaDB, a B-tree is used to store sorted rows</p><p>inside a block of rows called a partition. Since it’s possible to have a schema where each</p><p>CHapTeR 4 DaTaBase InTeRnals: algoRITHmIC opTImIzaTIons</p><p>72</p><p>partition always has a single row, this corner case is not that “corner” for us. In the case</p><p>of a binary tree, the single-element tree is equivalent to having a direct pointer from the</p><p>tree owner to this element (plus the cost of two nil pointers to the left and right children).</p><p>In case of a B-tree, the cost of keeping the single key is always in having a root node that</p><p>implies extra pointer fetching to access this key. Even the linear root optimization is</p><p>helpless here. Fixing this corner case was possible by reusing the pointer to the root node</p><p>to point directly to the single key.</p><p>The Secret Life ofSeparation Keys</p><p>This section dives into technical details of B+-tree implementation.</p><p>There are two ways of managing separation keys in a B+-tree. The separation key</p><p>at any level must be less than or equal to all the keys from its right subtree and greater</p><p>than or equal to all the keys from its left subtree. Mind the “or” condition—the exact</p><p>value of the separation key may or may not coincide with the value of some key from the</p><p>respective branch (it’s clear that this some will be the rightmost key on the left branch</p><p>or leftmost on the right). Let’s look at these two cases. If the tree balancing maintains</p><p>the separation key to be independent from other key values, then it’s the light mode; if it</p><p>must coincide with some of them, then it will be called the strict mode.</p><p>In the light separation mode, the insertion and removal operations are a bit faster</p><p>because they don’t need to care about separation keys that much. It’s enough if they</p><p>separate branches, and that’s it. A somewhat worse consequence of the light separation</p><p>is that separation keys are separate values that may appear in the tree by copying existing</p><p>keys. If the key is simple, (e.g., an integer), this will likely not cause any trouble. However,</p><p>if keys are strings or, as in ScyllaDB’s case, database partition or clustering keys, copying</p><p>it might be both resource consuming and out-of-memory risky.</p><p>On the other hand, the strict separation mode makes it possible to avoid key copying</p><p>by implementing separation keys as references</p><p>on real ones. This would involve some</p><p>complication of insertion and especially removal operations. In particular, upon real key</p><p>removal, it will be necessary to find and update the relevant separation keys. Another</p><p>difficulty to care about is that moving a real key value in memory, if it’s needed (e.g.,</p><p>in ScyllaDB’s case keys are moved in memory as a part of memory defragmentation</p><p>hygiene), will also need to update the relevant reference from separation keys. However,</p><p>it’s possible to show that each real key will be referenced by at most one separation key.</p><p>CHapTeR 4 DaTaBase InTeRnals: algoRITHmIC opTImIzaTIons</p><p>73</p><p>Speaking about memory consumption, although large B-trees were shown to</p><p>consume less memory per-key as they get filled, the real overhead would very likely be</p><p>larger, since the nodes of the tree will typically be underfilled because of the way the</p><p>balancing algorithm works. For example, Figures4-7 and 4-8 show how nodes look in a</p><p>randomly filled 4-ary B-tree.</p><p>Figure 4-7. Distribution of number of keys in a node for leaf nodes</p><p>Figure 4-8. Distribution of number of keys in a node for inner nodes</p><p>CHapTeR 4 DaTaBase InTeRnals: algoRITHmIC opTImIzaTIons</p><p>74</p><p>It’s possible to define a compaction operation for a B-tree that will pick several</p><p>adjacent nodes and squash them together, but this operation has its limitations. First,</p><p>a certain amount of underoccupied nodes makes it possible to insert a new element</p><p>into a tree without the need to rebalance, thus saving CPU cycles. Second, since each</p><p>node cannot contain less than a half of its capacity, squashing two adjacent nodes</p><p>is impossible. Even if considering three adjacent nodes, then the amount of really</p><p>squashable nodes would be less than 5 percent of the leaves and less than 1 percent of</p><p>the inners.</p><p>Summary</p><p>As extensive as these optimizations might seem, they are really just the tip of the iceberg</p><p>for this one particular example. Many finer points that matter from an engineering</p><p>perspective were skipped for brevity (for example, the subtle difference in odd vs</p><p>even number of keys on a node). For a database user, the key takeaway here is that</p><p>an excruciating level of design and experimentation often goes into the software for</p><p>determining how your database stores and retrieves data. You certainly don’t need to</p><p>be this familiar with every aspect of how your database was engineered. But knowing</p><p>what algorithmic optimizations your database has focused on will help you understand</p><p>why it performs in certain ways under different contexts. And you might discover some</p><p>impressively engineered capabilities that could help you handle more user requests or</p><p>shave a few precious milliseconds off your P99 latencies. The next chapter takes you into</p><p>the inner workings of database drivers and shares tips for getting the most out of a driver,</p><p>particularly from a performance perspective.</p><p>CHapTeR 4 DaTaBase InTeRnals: algoRITHmIC opTImIzaTIons</p><p>75</p><p>Open Access This chapter is licensed under the terms of the Creative</p><p>Commons Attribution 4.0 International License (http://creativecommons.</p><p>org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and</p><p>reproduction in any medium or format, as long as you give appropriate credit to the</p><p>original author(s) and the source, provide a link to the Creative Commons license and</p><p>indicate if changes were made.</p><p>The images or other third party material in this chapter are included in the chapter's</p><p>Creative Commons license, unless indicated otherwise in a credit line to the material. If</p><p>material is not included in the chapter's Creative Commons license and your intended</p><p>use is not permitted by statutory regulation or exceeds the permitted use, you will need</p><p>to obtain permission directly from the copyright holder.</p><p>CHapTeR 4 DaTaBase InTeRnals: algoRITHmIC opTImIzaTIons</p><p>77</p><p>CHAPTER 5</p><p>Database Drivers</p><p>Databases usually expose a specific communication protocol for their users. This</p><p>protocol is the foundation of communication between clients and servers, so it’s often</p><p>well-documented and has a formal specification. Some databases, like PostgreSQL,</p><p>implement their own binary format on top of the TCP/IP stack.1 Others, like Amazon</p><p>DynamoDB,2 build theirs on top of HTTP, which is a little more verbose, but also more</p><p>versatile and compatible with web browsers. It’s also not uncommon to see a database</p><p>exposing a protocol based on gRPC3 or any other well-established framework.</p><p>Regardless of the implementation details, users seldom use the bare protocol</p><p>themselves because it’s usually a fairly low-level API.What’s used instead is a driver—a</p><p>programming interface written in a particular language, implementing a higher-level</p><p>abstraction for communicating with the database. Drivers hide all the nitty-gritty details</p><p>behind a convenient interface, which saves users from having to manually handle</p><p>connection management, parsing, validation, handshakes, authentication, timeouts,</p><p>retries, and so on.</p><p>In a distributed environment (which a scalable database cluster usually is), clients,</p><p>and therefore drivers, are an extremely important part of the ecosystem. The clients</p><p>are usually the most numerous group of actors in the system, and they are also very</p><p>heterogeneous in nature, as visualized in Figure5-1. Some clients are connected via</p><p>local network interfaces, other ones connect via a questionable Wi-Fi hotspot on another</p><p>continent and thus have vastly different latency characteristics and error rates. Some</p><p>might run on microcontrollers with 1MiB of random access memory, while others</p><p>utilize 128-core bare metal machines from a cloud provider. Due to this diversity, it’s</p><p>1 See the PostgreSQL documentation (https://www.postgresql.org/docs/7.3/protocol-</p><p>protocol.html).</p><p>2 See the DynamoDB Developer Guide on the DynamoDB API (https://docs.aws.amazon.com/</p><p>amazondynamodb/latest/developerguide/HowItWorks.API.html).</p><p>3 gRPC is “a high performance, open-source universal RPC framework;” see https://grpc.io for</p><p>details.</p><p>© Felipe Cardeneti Mendes, Piotr Sarna, Pavel Emelyanov, Cynthia Dunlop 2023</p><p>F. C. Mendes et al., Database Performance at Scale, https://doi.org/10.1007/978-1-4842-9711-7_5</p><p>78</p><p>very important to take drivers into consideration when thinking about performance,</p><p>scalability, and resilience to failures. Ultimately it’s the drivers that generate traffic and</p><p>its concurrency, so cooperation between them and database nodes is crucial for the</p><p>whole system to be healthy and efficient.</p><p>Note As a reminder, concurrency, in the context of this book, is the measure of</p><p>how many operations are performed at the same point in time. It's conceptually</p><p>similar to parallelism. With concurrency, the operations occur physically at the</p><p>same time (e.g. on multiple CPU cores or multiple machines). Parallelism does</p><p>not specify that; the operations might just as well be executed in small steps on</p><p>a single machine. Nowadays, distributed systems must rely on providing high</p><p>concurrency in order to remain competitive and catch up with ever-developing</p><p>technology.</p><p>This chapter takes a look at how drivers impact performance—through the eyes of</p><p>someone who has engineered drivers for performance. It provides insight into various</p><p>ways that drivers can support efficient client-server interactions and shares tips for</p><p>getting the most out of a driver, particularly from the performance perspective. Finally,</p><p>the chapter wraps up with several considerations to keep in mind as you’re selecting</p><p>a driver.</p><p>Relationship Between Clients andServers</p><p>Scalability is a measure of how well your system reacts to increased load. This load is</p><p>usually generated by clients using their drivers, so keeping the relationship between</p><p>your clients and servers sound is an important matter. The more you know about your</p><p>workloads, your clients’ behavior, and their usage patterns, the better you’re prepared to</p><p>handle</p><p>both sudden spikes in traffic and sustained, long-term growth in usage.</p><p>Each client is different and should be treated as such. The differences come both</p><p>from clients’ characteristics, like their number and volume, and from their requirements.</p><p>Some clients have strict latency guarantees, even at the cost of higher error rates. Others</p><p>do not particularly care about the latency of any single database query, but just want a</p><p>steady pace of progress in their long-standing queries. Some databases target specific</p><p>types of clients (e.g., analytical databases which expect clients processing large aggregate</p><p>ChAPter 5 DAtAbAse DrIvers</p><p>79</p><p>queries operating on huge volumes of historical data). Other ones strive to be universal,</p><p>handling all kinds of clients and balancing the load so that everyone is happy (or, more</p><p>precisely, “happy enough”).</p><p>Workload Types</p><p>There are multiple ways of classifying database clients. One particularly interesting</p><p>way is to delineate between clients processing interactive and batch (e.g., analytical)</p><p>workloads, also known as OLTP (online transaction processing) vs OLAP (online</p><p>analytical processing)—see Figure5-2.</p><p>Figure 5-1. Visualization of clients and servers in a distributed system</p><p>ChAPter 5 DAtAbAse DrIvers</p><p>80</p><p>Figure 5-2. Difference between interactive and batch (analytical) workloads</p><p>Interactive Workloads</p><p>A client processing an interactive workload typically wants certain latency guarantees.</p><p>Receiving a response fast is more important than ensuring that the query succeeded.</p><p>In other words, it’s better to return an error in a timely manner than make the client</p><p>indefinitely wait for the correct response. Such workloads are often characterized by</p><p>unbounded concurrency, which means that the number of in-progress operations is</p><p>hard to predict.</p><p>A prime example of an interactive workload is a server handling requests from web</p><p>browsers. Imagine an online game, where players interact with the system straight from</p><p>their favorite browsers. High latency for such a player means a poor user experience</p><p>because people tend to despise waiting for online content for more than a few hundred</p><p>milliseconds; with multi-second delays, most will just ditch the game as unusable and</p><p>try something else. It’s therefore particularly important to be as interactive as possible</p><p>and return the results quickly—even if the result happens to be a temporary error. In</p><p>such a scenario, the concurrency of clients varies and is out of control for the database.</p><p>Sometimes there might be a large influx of players, and the database might need to</p><p>refuse some of them to avoid overload.</p><p>ChAPter 5 DAtAbAse DrIvers</p><p>81</p><p>Batch (Analytical) Workloads</p><p>A batch (analytical) workload is the conceptual opposite of an interactive one. With</p><p>such workloads, it doesn’t matter whether any single request is processed in a few</p><p>milliseconds or hours. The important thing is that the processing makes steady progress</p><p>with a satisfactory error rate, which is ideally zero. Batch workloads tend to have fixed</p><p>concurrency, which makes it easier for the database to keep the load under control.</p><p>A good example of a batch workload is an Apache Spark4 job performing analytics</p><p>on a big dataset (think terabytes). There are only a few connections established to</p><p>the database, and they continuously send requests in order to fetch data for long</p><p>computations. Because the concurrency is predictable, the database can easily respond</p><p>to an increased load by applying backpressure (e.g., by delaying the responses a little</p><p>bit). The analytical processing will simply slow down, adjusting its speed according to</p><p>the speed at which the database can consume queries.</p><p>Mixed Workloads</p><p>Certain workloads cannot be easily qualified as fully interactive or fully batch. The</p><p>clients are free to intermix their requirements, concurrency, and load however they</p><p>please—so the databases should also be ready for surprises. For example, a batch</p><p>workload might suddenly experience a giant temporary spike in concurrency. Databases</p><p>should, on the one hand, maintain a level of trust in the workload’s typical patterns, but</p><p>on the other hand anticipate that workloads can simply change over time—due to bugs,</p><p>hardware changes, or simply because the use case has diverged from its original goal.</p><p>Throughput vs Goodput</p><p>A healthy distributed database cluster is characterized by stable goodput, not</p><p>throughput. Goodput is an interesting portmanteau of good + throughput, and it’s a</p><p>measure of useful data being transferred between clients and servers over the network,</p><p>as opposed to just any data. Goodput disregards errors and other churn-like redundant</p><p>retries, and is used to judge how effective the communication actually is.</p><p>This distinction is important.</p><p>4 Apache Spark is “multi-language engine for executing data engineering, data science, and</p><p>machine learning on single-node machines or clusters.” For details, see https://spark.</p><p>apache.org/.</p><p>ChAPter 5 DAtAbAse DrIvers</p><p>82</p><p>Imagine an extreme case of an overloaded node that keeps returning errors for each</p><p>incoming request. Even though stable and sustainable throughput can be observed,</p><p>this database brings no value to the end-user. Thus, it’s essential to track how much</p><p>useful data can be delivered in an acceptable time. For example, this can be achieved</p><p>by tracking both the total throughput and throughput spent on sending back error</p><p>messages and then subtracting one from another to see how much valid data was</p><p>transferred (see Figure5-3).</p><p>Figure 5-3. Note how a fraction of the throughput times out, effectively requiring</p><p>more work from clients to achieve goodput</p><p>Maximizing goodput is a delicate operation and it heavily depends on the</p><p>infrastructure, workload type, clients’ behavior, and many other factors. In some cases,</p><p>the database shedding load might be beneficial for the entire system. Shedding is a</p><p>rather radical measure of dealing with overload: Requests qualified as “risky” are simply</p><p>ignored by the server, or immediately terminated with an error. This type of overload</p><p>protection is especially useful against issues induced by interactive workloads with</p><p>unbounded concurrency (there’s not much a database can do to protect itself except</p><p>drop some of the incoming requests early).</p><p>The database server isn’t an oracle; it can’t accurately predict whether a request is</p><p>going to fail due to overload, so it must guess. Fortunately, there are quite a few ways of</p><p>making that guess educated:</p><p>• Shedding load if X requests are already being processed, where X is</p><p>the estimated maximum a database node can handle.</p><p>• Refusing a request if its estimated memory usage is larger than the</p><p>database could handle at the moment.</p><p>ChAPter 5 DAtAbAse DrIvers</p><p>83</p><p>• Probabilistically refusing a request if Y requests are already being</p><p>processed, where Y is a percentage of the maximum a database node</p><p>can handle, with the probability raising to 100 percent once a certain</p><p>threshold is reached.</p><p>• Refusing a request if its estimated execution time indicates that it’s</p><p>not going to finish in time, and instead it is likely to time out anyway.</p><p>While refusing clients’ requests is detrimental to user experience, sometimes</p><p>it’s simply the lesser of two evils. If dropping a number of requests allows even more</p><p>requests to successfully finish in time, it increases the cluster’s goodput.</p><p>Clients can help the database maximize goodput and keep the latency low by</p><p>declaring for how long the request is considered valid. For instance, in high frequency</p><p>trading, a request that takes more than a couple of milliseconds is just as good as a</p><p>request that failed. By letting the database know that’s the case, you can allow it to retire</p><p>some requests early, leaving valuable resources for other requests which still have a</p><p>chance to be successful. Proper timeout management is a broad topic and it deserves a</p><p>separate</p><p>section.</p><p>Timeouts</p><p>In a distributed system, there are two fundamental types of timeouts that influence one</p><p>another: client-side timeouts and server-side timeouts. While both are conceptually</p><p>similar, they have different characteristics. It’s vital to properly configure both of them to</p><p>prevent problems like data races and consistency issues.</p><p>Client-Side Timeouts</p><p>This type of timeout is generally configured in the database driver. It signifies how long it</p><p>takes for a driver to decide that a response from a server is not likely to arrive. In a perfect</p><p>world built on top of a perfect network, all parties always respond to their requests.</p><p>However, in practice, there are numerous causes for a response to either be late or lost:</p><p>• The recipient died</p><p>• The recipient is busy with other tasks</p><p>• The network failed, maybe due to hardware malfunction</p><p>ChAPter 5 DAtAbAse DrIvers</p><p>84</p><p>• The network has a significant delay because packets get stuck in an</p><p>intermediate router</p><p>• A software bug caused the packet to be lost</p><p>• And so on</p><p>Since in a distributed environment it’s usually impossible to guess what</p><p>happened, the client must sometimes decide that a request is lost. The alternative</p><p>is to wait indefinitely. That might work for a select set of use cases, but it’s often</p><p>simply unacceptable. If a single failed request holds a resource for an unspecified</p><p>time, the system is eventually doomed to fail. Hence, client-side timeouts are used</p><p>as a mechanism to make sure that the system can operate even in the event of</p><p>communication issues.</p><p>A unique characteristic of a client-side timeout is that the decision to give up on a</p><p>request is made solely by the client, in the absence of any feedback from the server. It’s</p><p>entirely possible that the request in question is still being processed and utilizes the</p><p>server’s resources. And, worst of all, the unaware server can happily return the response</p><p>to the client after it’s done processing, even though nobody’s interested in this stale</p><p>data anymore! That presents another aspect of error handling: Drivers must be ready to</p><p>handle stray, expired responses correctly.</p><p>Server-Side Timeouts</p><p>A server-side timeout determines when a database node should start considering a</p><p>particular request as expired. Once this point in time has passed, there is no reason</p><p>to continue processing the query. (Doing so would waste resources which could have</p><p>otherwise been used for serving other queries that still have a chance to succeed.)</p><p>When the specified time has elapsed, databases often return an error indicating that the</p><p>request took too long.</p><p>Using reasonable values for server-side timeouts helps the database manage its</p><p>priorities in a more precise way, allocating CPU, memory and other scarce resources on</p><p>queries likely to succeed in a timely manner. Drivers that receive an error indicating that</p><p>a server-side timeout has occurred should also act accordingly—perhaps by reducing</p><p>the pressure on a particular node or retrying on another node that hasn’t experienced</p><p>timeouts lately.</p><p>ChAPter 5 DAtAbAse DrIvers</p><p>85</p><p>A Cautionary Tale</p><p>The CQL protocol, which specifies the communication layer in Apache Cassandra and</p><p>ScyllaDB, comes with built-in support for concurrency. Namely, each request is assigned</p><p>a stream ID, unique for each connection. This stream ID is encoded as a 16-bit integer</p><p>with the first bit being reserved by the protocol, which leaves the drivers 32768 unique</p><p>values for handling in-flight requests per single connection. This stream ID is later</p><p>used to match an incoming response with its original request. That’s not a particularly</p><p>large number, given that modern systems are known to handle millions of requests per</p><p>second. Thus, drivers need to eventually reuse previously assigned stream IDs.</p><p>But the CQL driver for Python had a bug.5 In the event of a client-side timeout, it</p><p>assumed that the stream ID of an expired request was immediately free to reuse. While</p><p>the assumption holds true if the server dies, it is incorrect if processing simply takes</p><p>longer than expected. It was therefore possible that once a response with a given stream</p><p>ID arrived, another request had already reused the stream ID, and the driver would</p><p>mistakenly match the response with the new request. If the user was lucky, they would</p><p>simply receive garbage data that did not pass validation. Unfortunately, data from the</p><p>mismatched response might appear correct, even though it originates from a totally</p><p>different request. This is the kind of bug that looks innocent at first glance, but may cause</p><p>people to log in to other people’s bank accounts and wreak havoc on their lives.</p><p>A rule of thumb for client-side timeouts is to make sure that a server-side timeout</p><p>also exists and is strictly shorter than the client-side one. It should take into account</p><p>clock synchronization between clients and servers (or lack thereof), as well as estimated</p><p>network latency. Such a procedure minimizes the chances for a late response to arrive at</p><p>all, and thus removes the root cause of many issues and vulnerabilities.</p><p>5 Bug report and applied fixes can be found here:</p><p>https://datastax-oss.atlassian.net/browse/PYTHON-1286</p><p>https://github.com/scylladb/python-driver/pull/106</p><p>https://github.com/datastax/python-driver/pull/1114</p><p>ChAPter 5 DAtAbAse DrIvers</p><p>86</p><p>Contextual Awareness</p><p>At this point it should be clear that both servers and clients can make better, more</p><p>educated, and mutually beneficial decisions if they know more about each other.</p><p>Exchanging timeout information is important, but drivers and servers can do even more</p><p>to keep each other up to date.</p><p>Topology andMetadata</p><p>Database servers are often combined into intricate topologies where certain nodes</p><p>are grouped in a single geographical location, others are used only as a fast cache</p><p>layer, and yet others store seldom accessed cold data in a cheap place, for emergency</p><p>purposes only.</p><p>Not every database exposes its topology to the end-user. For example, DynamoDB</p><p>takes that burden off of its clients and exposes only a single endpoint, taking care of</p><p>load balancing, overload prevention, and retry mechanisms on its own. On the other</p><p>hand, a fair share of popular databases (including ScyllaDB, Cassandra, and ArangoDB)</p><p>rely on the drivers to connect to each node, decide how many connections to keep,</p><p>when to speculatively retry, and when to close connections if they are suspected of</p><p>malfunctioning. In the ScyllaDB case, sharing up-to-date topology information with the</p><p>drivers helps them make the right decisions. This data can be shared in multiple ways:</p><p>• Clients periodically fetching topology information from the servers</p><p>• Clients subscribing to events sent by the servers</p><p>• Clients taking an active part in one of the information exchange</p><p>protocols (e.g., gossip6)</p><p>• Any combination of these</p><p>Depending on the database model, another valuable piece of information often</p><p>cached client-side is metadata—a prime example of which is database schema. SQL</p><p>databases, as well as many NoSQL ones, keep the data at least partially structured. A</p><p>schema defines the shape of a database row (or column), the kinds of data types stored</p><p>in different columns, and various other characteristics (e.g., how long a database row is</p><p>6 See the documentation on Gossip in ScyllaDB (https://docs.scylladb.com/stable/kb/</p><p>gossip.html).</p><p>ChAPter 5 DAtAbAse DrIvers</p><p>87</p><p>supposed to live before it’s garbage-collected). Based on up-to-date schemas, drivers</p><p>can perform additional validation, making sure that data sent to the server has a proper</p><p>type and adheres to any constraints required by the database. On the other hand, when</p><p>a driver-side cache for schemas gets out of sync, clients can experience their queries</p><p>failing for no apparent reason.</p><p>Synchronizing full schema information can be costly in terms of performance, and</p><p>finding a good compromise in how often to update</p><p>highly depends on the use case. A</p><p>rule of thumb is to update only as often as needed to ensure that the traffic induced by</p><p>metadata exchange never negatively impacts the user experience. It’s also worth noting</p><p>that in a distributed database, clients are not always up to date with the latest schema</p><p>information, and the system as a whole should be prepared to handle it and provide</p><p>tactics for dealing with such inconsistencies.</p><p>Current Load</p><p>Overload protection and request latency optimization are tedious tasks, but they can be</p><p>substantially facilitated by exchanging as much context as possible between interested</p><p>parties.</p><p>The following methods can be applied to distribute the load evenly across the</p><p>distributed system and prevent unwanted spikes:</p><p>1. Gathering latency statistics per each database connection in the</p><p>drivers:</p><p>a. What’s the average latency for this connection?</p><p>b. What’s the 99th percentile latency?</p><p>c. What’s the maximum latency experienced in a recent time frame?</p><p>2. Exchanging information about server-side caches:</p><p>a. Is the cache full?</p><p>b. Is the cache warm (i.e., filled with useful data)?</p><p>c. Are certain items experiencing elevated traffic and/or latency?</p><p>ChAPter 5 DAtAbAse DrIvers</p><p>88</p><p>3. Interpreting server events:</p><p>a. Has the server started replying with “overload errors”?</p><p>b. How often do requests for this server time out?</p><p>c. What is the general rate of errors for this server?</p><p>d. What is the measured goodput from this server?</p><p>Based on these indicators, drivers should try to amend the amount of data they</p><p>send, the concurrency, and the rate of retries as well as speculative execution, which</p><p>can keep the whole distributed system in a healthy, balanced state. It’s ultimately in the</p><p>driver’s interest to ease the pressure on nodes that start showing symptoms of getting</p><p>overloaded, be it by reducing the concurrency of operations, limiting the frequency</p><p>and number of retries, temporarily giving up on speculatively sent requests, and so on.</p><p>Otherwise, if the database servers get overloaded, all clients may experience symptoms</p><p>like failed requests, timeouts, increased latency, and so on.</p><p>Request Caching</p><p>Many database management systems, ranging from SQLite, MySQL, and Postgres to</p><p>NoSQL databases, implement an optimization technique called prepared statements.</p><p>While the language used to communicate with the database is usually human-readable</p><p>(or at least developer-readable), it is not the most efficient way of transferring data from</p><p>one computer to another.</p><p>Let’s take a look at the (simplified) lifecycle of an unprepared statement once it’s sent</p><p>from a ScyllaDB driver to the database and back. This is illustrated in Figure5-4.</p><p>Figure 5-4. Lifecycle of an unprepared statement</p><p>ChAPter 5 DAtAbAse DrIvers</p><p>89</p><p>1. A query string is created:</p><p>INSERT INTO my_table(id, descr) VALUES (42,</p><p>'forty two');</p><p>2. The string is packed into a CQL frame by the driver. Each CQL</p><p>frame consists of a header, which describes the purpose of a</p><p>particular frame. Following the header, a specific payload may be</p><p>sent as well. The full protocol specification is available at https://</p><p>github.com/apache/cassandra/blob/trunk/doc/native_</p><p>protocol_v4.spec.</p><p>3. The CQL frame is sent over the network.</p><p>4. The frame is received by the database.</p><p>5. Once the frame is received, the database interprets the frame</p><p>header and then starts parsing the payload. If there’s an</p><p>unprepared statement, the payload is represented simply as a</p><p>string, as seen in Step 1.</p><p>6. The database parses the string in order to validate its contents and</p><p>interpret what kind of an operation is requested: is it an insertion,</p><p>an update, a deletion, a selection?</p><p>7. Once the statement is parsed, the database can continue</p><p>processing it (e.g., by persisting data on disk, fetching whatever’s</p><p>necessary, etc.).</p><p>Now, imagine that a user wants to perform a hundred million operations on the</p><p>database in quick succession because the data is migrated from another system. Even</p><p>if parsing the query strings is a relatively fast operation and takes 50 microseconds, the</p><p>total time spent on parsing strings will take over an hour of CPU time. Sounds like an</p><p>obvious target for optimization.</p><p>The key observation is that operations performed on a database are usually similar</p><p>to one another and follow a certain pattern. For instance, migrating a table from one</p><p>system to another may mean sending lots of requests with the following schema:</p><p>INSERT INTO my_table(id, descr) VALUES (?, ?)</p><p>where ? denotes the only part of the string that varies between requests.</p><p>ChAPter 5 DAtAbAse DrIvers</p><p>90</p><p>This query string with question marks instead of real values is actually also valid</p><p>CQL! While it can’t be executed as is (because some of the values are not known), it can</p><p>be prepared.</p><p>Preparing such a statement means that the database will meticulously analyze the</p><p>string, parse it, and create an internal representation of the statement in its own memory.</p><p>Once done, a unique identifier is generated and sent back to the driver. The client can now</p><p>execute the statement by providing only its identifier (which is a 128-bit UUID7 in ScyllaDB)</p><p>and all the values missing from the prepared query string. The process of replacing</p><p>question marks with actual values is called binding and it’s the only thing that the database</p><p>needs to do instead of launching a CQL parser, which offers a significant speedup.</p><p>Preparing statements without care can also be detrimental to overall cluster</p><p>performance though. When a statement gets prepared, the database needs to keep a</p><p>certain amount of information about it in memory, which is hardly a limitless resource.</p><p>Caches for prepared statements are usually relatively small, under the assumption that</p><p>the driver’s users (app developers) are kind and only prepare queries that are used</p><p>frequently. If, on the other hand, a user were to prepare lots of unique statements that</p><p>aren’t going to be reused any time soon, the database cache might invalidate existing</p><p>entries for frequently used queries. The exact heuristics of how entries are invalidated</p><p>depends on the algorithm used in the cache, but a naive LRU (least recently used)</p><p>eviction policy is susceptible to this problem. Therefore, other cache algorithms resilient</p><p>to such edge cases should be considered when designing a cache without full information</p><p>about expected usage patterns. Some notable examples include the following:</p><p>• LFU (least frequently used)</p><p>Aside from keeping track of which item was most recently accessed,</p><p>LFU also counts how many times it was needed in a given time</p><p>period, and tries to keep frequently used items in the cache.</p><p>• LRU with two pools</p><p>One probationary pool for new entries, and another, usually</p><p>larger, pool for frequently used items. This algorithm avoids cache</p><p>thrashing when lots of one-time entries are inserted in the cache,</p><p>because they only evict other items from the probationary pool,</p><p>while more frequently accessed entries are safe in the main pool.</p><p>7 See the memo, “A Universally Unique IDentifier (UUID) URN Namespace,” at https://www.</p><p>ietf.org/rfc/rfc4122.txt.</p><p>ChAPter 5 DAtAbAse DrIvers</p><p>91</p><p>Finally, regardless of the algorithm used for cache eviction implemented server-side,</p><p>drivers should take care not to prepare queries too aggressively, especially if it happens</p><p>automatically, which is often the case in ORMs (object-relational mappings). Making</p><p>an interface convenient for the user may sound tempting, and developer experience is</p><p>indeed an important factor when designing a driver, but being too eager with reserving</p><p>precious database resources may be disadvantageous in the long term.</p><p>Query Locality</p><p>In distributed systems, any kind of locality is welcome because it reduces the chances of</p><p>failure, keeps the latency low, and generally prevents many undesirable events. While</p><p>database clients, and thus also</p><p>drivers, do not usually share the same machines with</p><p>the database cluster, it is possible to keep the distance between them short. “Distance”</p><p>might mean either a physical measure or the number of intermediary devices in the</p><p>network topology. Either way, for latency’s sake, it’s good to minimize it between parties</p><p>that need to communicate with each other frequently.</p><p>Many database management systems allow their clients to announce their</p><p>“location,” for example, by declaring which datacenter is their local, default one. Drivers</p><p>should take that information into account when communicating with the database</p><p>nodes. As long as all consistency requirements are fulfilled, it’s usually better to send</p><p>data directly to a nearby node, under the assumption that it will spend less time in</p><p>transit. Short routes also usually imply fewer middlemen, and that in turn translates to</p><p>fewer potential points of failure.</p><p>Drivers can make much more educated choices though. Quite a few NoSQL</p><p>databases can be described as “distributed hash tables” because they partition their</p><p>data and spread it across multiple nodes which own a particular set of hashes. If the</p><p>hashing algorithm is well known and deterministic, drivers can leverage that fact to try</p><p>to optimize the queries even further—sending data directly to the appropriate node, or</p><p>even the appropriate CPU core.</p><p>ChAPter 5 DAtAbAse DrIvers</p><p>92</p><p>ScyllaDB, Cassandra, and other NoSQL databases apply a concept of token8</p><p>awareness (see Figures5-5, 5-6, and 5-7):</p><p>1. A request arrives.</p><p>2. The receiving node computes the hash of the given input.</p><p>3. Based on the value of this hash, it computes which database</p><p>nodes are responsible for this particular value.</p><p>4. Finally, it forwards the request directly to the owning nodes.</p><p>However, in certain cases, the driver can compute the token locally on its own, and</p><p>then use the cluster topology information to route the request straight to the owning</p><p>node. This local node-level routing saves at least one network round-trip as well as the</p><p>CPU time of some of the nodes.</p><p>8 A token is how a hash value is named in Cassandra nomenclature.</p><p>Figure 5-5. Naive clients route queries to any node (coordinator)</p><p>ChAPter 5 DAtAbAse DrIvers</p><p>93</p><p>Figure 5-6. Token-aware clients route queries to the right node(s)</p><p>In the Cassandra/ScyllaDB case, this is possible because each table has a well-</p><p>defined “partitioner,” which simply means a hash function implementation. The default</p><p>choice—used in Cassandra—is murmur3,9 which returns a 64-bit hash value, has</p><p>satisfying distribution, and is relatively cheap to compute. ScyllaDB takes it one step</p><p>further and allows the drivers to calculate which CPU core of which database node owns</p><p>a particular datum. When a driver is cooperative and proactively establishes a separate</p><p>connection per each core of each machine, it can send the data not only to the right</p><p>node, but also straight to the single CPU core responsible for handling it. This not only</p><p>saves network bandwidth, but is also very friendly to CPU caches.</p><p>9 See the DataStax documentation on Murmur3Partitioner (https://docs.datastax.com/en/</p><p>cassandra-oss/3.x/cassandra/architecture/archPartitionerM3P.html).</p><p>ChAPter 5 DAtAbAse DrIvers</p><p>94</p><p>Figure 5-7. Shard-aware clients route queries to the correct node(s) + core</p><p>Retries</p><p>In a perfect system, no request ever fails and logic implemented in the drivers can</p><p>be kept clean and minimal. In the real world, failures happen disturbingly often, so</p><p>the drivers should also be ready to deal with them. One such mechanism for failure</p><p>tolerance is a driver’s retry policy. A retry policy’s job is to decide whether a request</p><p>should be sent again because it failed (or at least the driver strongly suspects that it did).</p><p>Error Categories</p><p>Before diving into techniques for retrying requests in a smart way, there’s a more</p><p>fundamental question to consider: does a retry even make sense? The answer is not that</p><p>obvious and it depends on many internal and external factors. When a request fails, the</p><p>error can fall into the following categories, presented with a few examples:</p><p>1. Timeouts</p><p>a. Read timeouts</p><p>b. Write timeouts</p><p>ChAPter 5 DAtAbAse DrIvers</p><p>95</p><p>2. Temporary errors</p><p>a. Database node overload</p><p>b. Dead target node</p><p>c. Temporary schema mismatch</p><p>3. Permanent errors</p><p>a. Incorrect query syntax</p><p>b. Authentication error</p><p>c. Insufficient permissions</p><p>Depending on the category, the retry decision may be vastly different. For instance,</p><p>it makes absolutely no sense to retry a request that has incorrect syntax. It will not</p><p>magically start being correct, and such a retry attempt would only waste bandwidth and</p><p>database resources.</p><p>Idempotence</p><p>Error categories aside, retry policy must also consider one important trait of the request</p><p>itself: its idempotence. An idempotent request can be safely applied multiple times, and</p><p>the result will be indistinguishable from applying it just once.</p><p>Why does this need to be taken into account at all? For certain classes of errors, the</p><p>driver cannot be sure whether the request actually succeeded. A prime example of such</p><p>error is a timeout. The fact that the driver did not manage to get a response in time does</p><p>not mean that the server did not successfully process the request. It’s a similar situation</p><p>if the network connection goes down: The driver won’t know if the database server</p><p>actually managed to apply the request.</p><p>When in doubt, the driver should make an educated guess in order to ensure</p><p>consistency. Imagine a request that withdraws $100 from somebody’s bank account. You</p><p>certainly don’t want to retry the same request again if you’re not absolutely sure that</p><p>it failed; otherwise, the bank customer might become a bit resentful. This is a perfect</p><p>example of a non-idempotent request: Applying it multiple times changes the ultimate</p><p>outcome.</p><p>ChAPter 5 DAtAbAse DrIvers</p><p>96</p><p>Fortunately, there’s a large subset of idempotent queries that can be safely retried,</p><p>even when it’s unclear whether they already succeeded:</p><p>1. Read-only requests</p><p>Since they do not modify any data, they won’t have any side</p><p>effects, no matter how often they’re retried.</p><p>2. Certain conditional requests that have compare-and-set</p><p>characteristics (e.g., “bump the value by 1 if the previous</p><p>value is 42”)</p><p>Depending on the use case, such a condition may be enough to</p><p>guarantee idempotence. Once this request is applied, applying</p><p>it again would have no effect since the previous value would</p><p>then be 43.</p><p>3. Requests with unique timestamps</p><p>When each request has a unique timestamp (represented in wall</p><p>clock time or based on a logical clock10), applying it multiple times</p><p>can be idempotent. A retry attempt will contain a timestamp</p><p>identical to the original request, so it will only overwrite data</p><p>identified by this particular timestamp. If newer data arrives</p><p>in-between with a newer timestamp, it will not be overwritten by a</p><p>retry attempt with an older timestamp.</p><p>In general, it’s a good idea for drivers to give users an opportunity to declare</p><p>their requests’ idempotence explicitly. Some queries can be trivially deduced to be</p><p>idempotent by the driver (e.g., when it’s a read-only SELECT statement in the database</p><p>world), but others may be less obvious. For example, the conditional example from the</p><p>previous Step 2 is idempotent if the value is never decremented, but not in the general</p><p>case. Imagine the following counter-example:</p><p>1. The current value is 42.</p><p>2. A request “bump the value by 1 if the previous value is 42” is sent.</p><p>10 See the Logical Clocks lecture by Arvind Krishnamurthy (https://homes.cs.washington.</p><p>edu/~arvind/cs425/lectureNotes/clocks-2.pdf).</p><p>ChAPter 5 DAtAbAse DrIvers</p><p>97</p><p>3. A request “bump the value by 1 if the previous value is 42” is</p><p>retried.</p><p>4. Another request, “decrement the value by 1,” is sent.</p><p>5. The request from Step 2 arrives and is applied—changing</p><p>102</p><p>Modern Software ����������������������������������������������������������������������������������������������������������������� 104</p><p>What to Look for When Selecting a Driver �������������������������������������������������������������������������������� 105</p><p>Summary����������������������������������������������������������������������������������������������������������������������������������� 107</p><p>Chapter 6: Getting Data Closer ����������������������������������������������������������������������������� 109</p><p>Databases as Compute Engines ������������������������������������������������������������������������������������������������ 109</p><p>User-Defined Functions and Procedures ����������������������������������������������������������������������������� 110</p><p>User-Defined Aggregates ����������������������������������������������������������������������������������������������������� 117</p><p>WebAssembly for User-Defined Functions �������������������������������������������������������������������������� 124</p><p>Edge Computing ������������������������������������������������������������������������������������������������������������������������ 126</p><p>Performance ������������������������������������������������������������������������������������������������������������������������ 127</p><p>Conflict-Free Replicated Data Types ������������������������������������������������������������������������������������ 127</p><p>Summary����������������������������������������������������������������������������������������������������������������������������������� 129</p><p>Chapter 7: Infrastructure and Deployment Models����������������������������������������������� 131</p><p>Core Hardware Considerations for Speed at Scale ������������������������������������������������������������������� 132</p><p>Identifying the Source of Your Performance Bottlenecks ���������������������������������������������������� 132</p><p>Achieving Balance ��������������������������������������������������������������������������������������������������������������� 133</p><p>Setting Realistic Expectations ��������������������������������������������������������������������������������������������� 134</p><p>Recommendations for Specific Hardware Components ����������������������������������������������������������� 135</p><p>Storage �������������������������������������������������������������������������������������������������������������������������������� 135</p><p>CPUs (Cores) ������������������������������������������������������������������������������������������������������������������������ 144</p><p>Memory (RAM) ��������������������������������������������������������������������������������������������������������������������� 145</p><p>Network ������������������������������������������������������������������������������������������������������������������������������� 147</p><p>Considerations in the Cloud ������������������������������������������������������������������������������������������������������ 148</p><p>Fully Managed Database-as-a-Service ������������������������������������������������������������������������������������� 150</p><p>Serverless Deployment Models ������������������������������������������������������������������������������������������������ 151</p><p>Containerization and Kubernetes ����������������������������������������������������������������������������������������������� 152</p><p>Summary����������������������������������������������������������������������������������������������������������������������������������� 155</p><p>Table of ConTenTs</p><p>ix</p><p>Chapter 8: Topology Considerations ��������������������������������������������������������������������� 157</p><p>Replication Strategy ������������������������������������������������������������������������������������������������������������������ 157</p><p>Rack Configuration �������������������������������������������������������������������������������������������������������������� 158</p><p>Multi-Region or Global Replication �������������������������������������������������������������������������������������� 158</p><p>Multi-Availability Zones vs� Multi-Region ���������������������������������������������������������������������������� 159</p><p>Scaling Up vs Scaling Out ��������������������������������������������������������������������������������������������������������� 160</p><p>Workload Isolation �������������������������������������������������������������������������������������������������������������������� 162</p><p>More on Workload Prioritization for Logical Isolation ���������������������������������������������������������� 163</p><p>Abstraction Layers �������������������������������������������������������������������������������������������������������������������� 167</p><p>Load Balancing ������������������������������������������������������������������������������������������������������������������������� 169</p><p>External Caches ������������������������������������������������������������������������������������������������������������������������ 170</p><p>An External Cache Adds Latency ����������������������������������������������������������������������������������������� 170</p><p>An External Cache Is an Additional Cost ������������������������������������������������������������������������������ 171</p><p>External Caching Decreases Availability ������������������������������������������������������������������������������ 171</p><p>Application Complexity: Your Application Needs to Handle More Cases ������������������������������ 172</p><p>External Caching Ruins the Database Caching �������������������������������������������������������������������� 172</p><p>External Caching Might Increase Security Risks ����������������������������������������������������������������� 172</p><p>External Caching Ignores the Database Knowledge and Database Resources ������������������� 172</p><p>Summary����������������������������������������������������������������������������������������������������������������������������������� 173</p><p>Chapter 9: Benchmarking ������������������������������������������������������������������������������������� 175</p><p>Latency or Throughput: Choose Your Focus ������������������������������������������������������������������������������ 176</p><p>Less Is More (at First): Taking a Phased Approach �������������������������������������������������������������������� 180</p><p>Benchmarking Do’s and Don’ts ������������������������������������������������������������������������������������������������� 182</p><p>Know What’s Under the Hood of Your Database (Or Find Someone Who Knows) ���������������� 182</p><p>Choose an Environment That Takes Advantage of the Database’s Potential ����������������������� 183</p><p>Use an Environment That Represents Production ��������������������������������������������������������������� 183</p><p>Don’t Overlook Observability ����������������������������������������������������������������������������������������������� 184</p><p>Use Standardized Benchmarking Tools Whenever Feasible ������������������������������������������������ 184</p><p>Use Representative Data Models, Datasets, and Workloads ����������������������������������������������� 185</p><p>Exercise Your Cache Realistically ���������������������������������������������������������������������������������������� 187</p><p>Look at Steady State ������������������������������������������������������������������������������������������������������������� 187</p><p>Table of ConTenTs</p><p>x</p><p>Watch Out for Client-Side Bottlenecks �������������������������������������������������������������������������������� 188</p><p>Also Watch Out for Networking Issues �������������������������������������������������������������������������������� 189</p><p>Document Meticulously to Ensure Repeatability ����������������������������������������������������������������� 189</p><p>Reporting Do’s and Don’ts �������������������������������������������������������������������������������������������������������� 189</p><p>Be Careful with Aggregations ���������������������������������������������������������������������������������������������� 190</p><p>Don’t Assume People Will</p><p>the</p><p>value to 43.</p><p>6. The request from Step 4 arrives and is applied—changing the</p><p>value to 42.</p><p>7. The retry from Step 3 is applied—changing the value back to</p><p>43 and interfering with the effect of the query from Step 4. This</p><p>wasn’t idempotent after all!</p><p>Since it’s often impossible to guess if a request is idempotent just by analyzing its</p><p>contents, it’s best for drivers to have a set_idempotent() function exposed in their</p><p>API.It allows the users to explicitly mark some queries as idempotent, and then the logic</p><p>implemented in the driver can assume that it’s safe to retry such a request when the</p><p>need arises.</p><p>Retry Policies</p><p>Finally, there’s enough context to discuss actual retry policies that a database driver</p><p>could implement. The sole job of a retry policy is to analyze a failed query and return a</p><p>decision. This decision depends on the database system and its intrinsics, but it’s often</p><p>one of the following (see Figure5-8):</p><p>• Do not retry</p><p>• Retry on the same database node</p><p>• Retry, but on a different node</p><p>• Retry, but not immediately—apply some delay</p><p>ChAPter 5 DAtAbAse DrIvers</p><p>98</p><p>Figure 5-8. Decision graph for retrying a query</p><p>Deciding not to retry is often a decent choice—it’s the only correct one when the</p><p>driver isn’t certain whether an idempotent query really failed or just timed out. It’s also</p><p>the obvious choice for permanent errors; there’s no point in retrying a request that was</p><p>previously refused due to incorrect syntax. And whenever the system is overloaded, the</p><p>“do not retry” approach might help the entire cluster. Although the immediate effect</p><p>(preventing a user’s request from being driven to completion) is not desirable, it provides</p><p>a level of overload protection that might pay off in the future. It prevents the overload</p><p>condition from continuing to escalate. Once a node gets too much traffic, it refuses more</p><p>requests, which increases the rate of retries, and ends up in a vicious circle.</p><p>Retrying on the same database node is generally a good option for timeouts.</p><p>Assuming that the request is idempotent, the same node can probably resolve potential</p><p>conflicts faster. Retrying on a different node is a good idea if the previous node showed</p><p>symptoms of overload, or had an input/output error that indicated a temporary issue.</p><p>Finally, in certain cases, it’s a good idea to delay the retry instead of firing it off</p><p>immediately (see Figure5-9).</p><p>ChAPter 5 DAtAbAse DrIvers</p><p>99</p><p>Figure 5-9. Retry attempts eventually resulting in a successful query</p><p>When the whole cluster shows the symptoms of overload—be it high reported CPU</p><p>usage or perceived increased latency—retrying immediately after a request failed may</p><p>only exacerbate the problem. What a driver can do instead is apply a gentle backoff</p><p>algorithm, giving the database cluster time to recover. Remember that even a failed retry</p><p>costs resources: networking, CPU, and memory. Therefore, it’s better to balance the costs</p><p>and chances for success in a reasonable manner.</p><p>The three most common backoff strategies are constant, linear, and exponential</p><p>backoff, as visualized in Figure5-10.</p><p>Figure 5-10. Constant, linear, and exponential backoffs</p><p>ChAPter 5 DAtAbAse DrIvers</p><p>100</p><p>The first type (constant) simply waits a certain predefined amount of time before</p><p>retrying. Linear backoff increases the time between attempts in a linear fashion; it</p><p>could wait one second before the first attempt, two seconds before the second one,</p><p>and so forth. Finally, exponential backoff, arguably the most commonly used method,</p><p>increases the delay by multiplying it by a constant each time. Usually it just doubles</p><p>it—because both processors and developers love multiplying and dividing by two (the</p><p>latter ones mostly just to show off their intricate knowledge of the bitwise shift operator).</p><p>Exponential backoff has especially nice characteristics for overload prevention. The retry</p><p>rate drops exponentially, and so does the pressure that the driver places on the database</p><p>cluster.</p><p>Paging</p><p>Databases usually store amounts of data that are orders of magnitude larger than a single</p><p>client machine could handle. If you fetch all available records, the result is unlikely to fit</p><p>into your local disks, not to mention your available RAM.Nonetheless, there are many</p><p>valid cases for processing large amounts of data, such as analyzing logs or searching for</p><p>specific documents. It is quite acceptable to ask the database to serve up all the data it</p><p>has—but you probably want it to deliver that data in smaller bits.</p><p>That technique is customarily called paging, and it is ubiquitous. It’s exactly what</p><p>you’ve experienced when browsing through page 17 of Google search results in futile</p><p>search for an answer to a question that was asked only on an inactive forum seven years</p><p>ago—or getting all the way to page 24 of eBay listings, hunting for that single perfect offer.</p><p>Databases and their drivers also implement paging as a mechanism beneficial for both</p><p>parties. Drivers get their data in smaller chunks, which can be done with lower latency.</p><p>And databases receive smaller queries, which helps with cache management, workload</p><p>prioritization, memory usage, and so on.</p><p>Different database models may have a different view of exactly what paging involves</p><p>and how you interface with it. Some systems may offer fine-grained control, which</p><p>allows you to ask for “page 16” of your data. Others are “forward-only”: They reduce the</p><p>user-facing interface to “here’s the current page—you can ask for the next page if you</p><p>want.” Your ability to control the page size also varies. Sometimes it’s possible to specify</p><p>the size in terms of a number of database records or bytes. In other cases, the page size</p><p>is fixed.</p><p>ChAPter 5 DAtAbAse DrIvers</p><p>101</p><p>On top of a minimal interface that allows paging to be requested, drivers can</p><p>offer many interesting features and optimizations related to paging. One of them is</p><p>readahead—which usually means that the driver transparently and speculatively fetches</p><p>new pages before you actually ask for them to be read. A readahead is a classic example</p><p>of a double-edged sword. On the one hand, it makes certain read operations faster,</p><p>especially if the workload consists of large consecutive reads. On the other, it may cause</p><p>prohibitive overhead, especially if the workload is based on small random reads.</p><p>Although most drivers support paging, it’s important to check whether the feature</p><p>is opt-in or opt-out and consciously decide what’s best for a specific workload. In</p><p>particular, pay attention to the following aspects:</p><p>1. What’s the default behavior (would a read query be paged or</p><p>unpaged)?</p><p>2. What’s the default page size and is it configurable? If so, in what</p><p>units can a size be specified? Bytes? Number of records?</p><p>3. Is readahead on by default? Can it be turned on/off?</p><p>4. Can readahead be configured further? For example, can you</p><p>specify how many pages to fetch or when to decide to start</p><p>fetching (e.g., “When at least three consecutive read requests</p><p>already occurred”)?</p><p>Setting up paging properly is important because a single unpaged response can</p><p>be large enough to be problematic for both the database servers forced to produce it,</p><p>and for the client trying to receive it. On the other hand, too granular paging can lead</p><p>to unnecessary overhead (just imagine trying to read a billion records row-by-row, due</p><p>to the default page size of “1 row”). Finally, readahead can be a fantastic optimization</p><p>technique—but it can also be entirely redundant, fetching unwanted pages that cost</p><p>memory, CPU time, and throughput, as well as confuse the metrics and logs. With</p><p>paging configuration, it’s best to be as explicit as possible.</p><p>Concurrency</p><p>In many cases, the only way to utilize a database to the fullest—and achieve optimal</p><p>performance—is to also achieve high concurrency. That often requires the drivers to</p><p>perform many I/O operations at the same</p><p>time, and that’s in turn customarily achieved</p><p>ChAPter 5 DAtAbAse DrIvers</p><p>102</p><p>by issuing asynchronous tasks. That being said, let’s take quite a few steps back to</p><p>explain what that really means and what’s involved in achieving that from both a</p><p>hardware and software perspective.</p><p>Note high concurrency is not a silver bullet. When it’s too high, it’s easy</p><p>to overload the system and ruin the quality of service for other users—see</p><p>Figure5-11 for its effect on latency. Chapter 1 includes a cautionary tale on what</p><p>can happen when concurrency gets out of bounds and Chapter 2 also touches on</p><p>the dangers of unbounded concurrency.</p><p>Modern Hardware</p><p>Back in the old days, making decisions around I/O concurrency was easy because</p><p>magnetic storage drives (HDD) had an effective concurrency of 1. There was (usually)</p><p>only a single actuator arm used to navigate the platters, so only a single sector of data</p><p>could have been read at once. Then, an SSD revolution happened. Suddenly, disks</p><p>could read from multiple offsets concurrently. Moreover, it became next to impossible to</p><p>fully utilize the disk (i.e., to read and write with the speeds advertised in shiny numbers</p><p>printed on their labels) without actually asking for multiple operations to be performed</p><p>concurrently. Now, with enterprise-grade NVMe drives and inventions like Intel</p><p>Optane,11 concurrency is a major factor when benchmarking input/output devices. See</p><p>Figure5-11.</p><p>11 High speed persistent memory (sadly discontinued in 2021).</p><p>ChAPter 5 DAtAbAse DrIvers</p><p>103</p><p>Figure 5-11. Relationship between the system’s concurrency and latency</p><p>Networking technology is not lagging behind either. Modern networking cards</p><p>have multiple independent queues, which, with the help of receive-side scaling (RSS12),</p><p>enable previously unimaginable levels of performance, with throughput measured</p><p>in Tbps.13 With such advanced hardware, achieving high concurrency in software is</p><p>required to simply utilize the available capabilities.</p><p>CPU cores obviously deserve to be mentioned here as well. That’s the part</p><p>of computer infrastructure that’s undoubtedly most commonly associated with</p><p>concurrency. Buying a 64-core consumer-grade processor is just a matter of going to</p><p>the hardware store next door, and the assortment of professional servers is even more</p><p>plentiful.</p><p>Operating systems focus on facilitating highly concurrent programs too. io_uring14</p><p>by Jens Axboe is a novel addition to the Linux kernel. As noted in Chapter 3, it was</p><p>developed for asynchronous I/O, which in turn plays a major part in allowing high</p><p>concurrency in software to become the new standard. Some database drivers already</p><p>utilize io_uring underneath, and many more put the integration very high in the list of</p><p>priorities.</p><p>12 RSS allows directing traffic from specific queues directly into chosen CPUs.</p><p>13 Terabits per second</p><p>14 See the “Efficient IO with io_uring” article (https://kernel.dk/io_uring.pdf).</p><p>ChAPter 5 DAtAbAse DrIvers</p><p>104</p><p>Modern Software</p><p>How could modern software adapt to the new, highly concurrent era? Historically,</p><p>a popular model of ensuring that multiple operations can be performed at the same</p><p>time was to keep a pool of operating system threads, with each thread having its own</p><p>queue of tasks. That only scales in a limited way though, so now the industry leans</p><p>toward so-called “green threads,” which are conceptually similar to their operating</p><p>system namesakes, but are instead implemented in userspace, in a much more</p><p>lightweight manner.</p><p>For example, in Seastar (a high-performance asynchronous framework implemented</p><p>in C++ and based on a future-promise model15), there are quite a few ways of expressing</p><p>a single flow of execution, which could be called a green thread. A fiber of execution can</p><p>be created by chaining futures, and you can also use the C++ coroutines mechanism to</p><p>build asynchronous programs in a clean way, with the compiler assisting in making the</p><p>code async-friendly.</p><p>In the Rust language, the asynchronous model is quite unique. There, a future</p><p>represents the computation, and it’s the programmer’s responsibility to advance the</p><p>state of this asynchronous state machine. Other languages, like JavaScript, Go, and Java,</p><p>also come with well-defined and standardized support for asynchronous programming.</p><p>This async programming support is good, because database drivers are prime</p><p>examples of software that should support asynchronous operations from day one.</p><p>Drivers are generally responsible for communicating over the network with highly</p><p>specialized database clusters, capable of performing lots of I/O operations at the same</p><p>time. We can’t emphasize enough that high concurrency is the only way to utilize the</p><p>database to the fullest. Asynchronous code makes that substantially easier because it</p><p>allows high levels of concurrency to be achieved without straining the local resources.</p><p>Green threads are lightweight and there can be thousands of them even on a consumer-</p><p>grade laptop. Asynchronous I/O is a perfect fit for this use case as well because it allows</p><p>efficiently sending thousands of requests over the network in parallel, without blocking</p><p>the CPU and forcing it to wait for any of the operations to complete, which was a known</p><p>bottleneck in the legacy threadpool model.</p><p>15 See the Seastar documentation on futures and promises (https://seastar.io/</p><p>futures-promises/).</p><p>ChAPter 5 DAtAbAse DrIvers</p><p>105</p><p>Note the future-promise model and asynchronous I/O are introduced in</p><p>Chapter 3.</p><p>What toLook forWhen Selecting aDriver</p><p>Database drivers are commonly available as open-source software. It’s a great model</p><p>that allows people to contribute and also makes the software easily accessible, ergo</p><p>popular (precisely what database vendors want). Drivers can be developed either by the</p><p>vendor, or another company, or simply your next door open-source contributor. This</p><p>kind of competition is very healthy for the entire system, but it also forces the users to</p><p>make a choice: which driver to use? For instance, at the time of this writing, the official</p><p>PostgreSQL documentation lists six drivers for C/C++ alone, with the complete list being</p><p>much longer.16</p><p>Choosing a driver should be a very deliberate decision, tailored to your unique</p><p>situation and preceded by tests, benchmarks, and evaluations. Nevertheless, there are</p><p>some general rules of thumb that can help guide you:</p><p>1. Clear documentation</p><p>Clear documentation is often initially underestimated by database</p><p>drivers’ users and developers alike. However, in the long term, it’s</p><p>the most important repository of knowledge for everyone, where</p><p>implementation details, good practices, and hidden assumptions</p><p>can be thoroughly explained. Choosing an undocumented driver</p><p>is a lottery—buying a pig in a poke. Don’t get distracted by shiny</p><p>benchmarks on the front page; the really valuable part is thorough</p><p>documentation. Note that it does not have to be a voluminous</p><p>book. On the contrary—concise, straight-to-the-point docs with</p><p>clear, working examples are even better.</p><p>16 See the PostgreSQL Drivers documentation at https://wiki.postgresql.org/wiki/</p><p>List_of_drivers.</p><p>ChAPter 5 DAtAbAse DrIvers</p><p>106</p><p>2. Long-term support and active maintainership</p><p>Officially supported drivers are often maintained by their vendors,</p><p>get released regularly, and have their security vulnerabilities</p><p>fixed faster. External open-source drivers might look appealing</p><p>at first, easily winning in their self-presented benchmarks, but</p><p>it’s important to research how often they get released, how</p><p>often bugs are fixed, and how likely they are to be maintained</p><p>in the foreseeable future. On the other hand, sometimes the</p><p>situation is reversed: The most modern, efficient code can be</p><p>found in an open-source driver, while the official one is hardly</p><p>maintained at all!</p><p>3. Asynchronous API</p><p>Your code is eventually going to need high concurrency, so</p><p>it’s better</p><p>to bet on an async-friendly driver, even if you’re not</p><p>ready to take advantage of that quite yet. The decision will likely</p><p>pay off later. While it’s easy to use an asynchronous driver in a</p><p>synchronous manner, the opposite is not true.</p><p>4. Decent test coverage</p><p>Testing is extremely important not only for the database nodes,</p><p>but also for the drivers. They are the first proxy between the users</p><p>and the database cluster, and any error in the driver can quickly</p><p>propagate to the whole system. If the driver corrupts outgoing</p><p>data, it may get persisted on the database, eventually making</p><p>the whole cluster unusable. If the driver incorrectly interprets</p><p>incoming data, its users will have a false picture of the database</p><p>state. And if it produces data based on this false picture, it can</p><p>just as well corrupt the entire database cluster. A driver that</p><p>cannot properly handle its load balancing and retry policy can</p><p>inadvertently overload a database node with excess requests,</p><p>which is detrimental to the whole system. If the driver is at least</p><p>properly tested, users can assume a higher level of trust in it.</p><p>ChAPter 5 DAtAbAse DrIvers</p><p>107</p><p>5. Database-specific optimizations</p><p>A good driver should cooperate with its database. The more</p><p>context it gathers from the cluster, the more educated decisions</p><p>it can make. Remember that clients, and therefore drivers, are</p><p>often the most ubiquitous group of agents in distributed systems,</p><p>directly contributing to the cluster-wide concurrency. That makes</p><p>it especially important for them to be cooperative.</p><p>Summary</p><p>This chapter provided insights into how the choice of a database driver impacts</p><p>performance and highlighted considerations to keep in mind when selecting a driver.</p><p>Drivers are often an overlooked part of a distributed system. That’s a shame because</p><p>drivers are so close to database users, both physically and figuratively! Proximity is an</p><p>extremely important factor in all networked systems because it directly translates to</p><p>latency. The next chapter ponders proximity from a subtly different point of view: How to</p><p>get the data itself closer to the application users.</p><p>Open Access This chapter is licensed under the terms of the Creative</p><p>Commons Attribution 4.0 International License (http://creativecommons.</p><p>org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and</p><p>reproduction in any medium or format, as long as you give appropriate credit to the</p><p>original author(s) and the source, provide a link to the Creative Commons license and</p><p>indicate if changes were made.</p><p>The images or other third party material in this chapter are included in the chapter's</p><p>Creative Commons license, unless indicated otherwise in a credit line to the material. If</p><p>material is not included in the chapter's Creative Commons license and your intended</p><p>use is not permitted by statutory regulation or exceeds the permitted use, you will need</p><p>to obtain permission directly from the copyright holder.</p><p>ChAPter 5 DAtAbAse DrIvers</p><p>109</p><p>CHAPTER 6</p><p>Getting Data Closer</p><p>Location, location, location. Sometimes it’s just as important to database performance</p><p>as it is to real estate. Just as the location of a home influences how quickly it sells, the</p><p>location of where data “lives” and is processed also matters for response times and</p><p>latencies.</p><p>Pushing more logic into the database can often reduce network latency (and costs,</p><p>e.g., when your infrastructure provider charges for ingress/egress network traffic) while</p><p>taking advantage of the database’s powerful compute capability. And redistributing</p><p>database logic from fewer powerful datacenters to more minimalist ones that are closer</p><p>to users is another move that can yield discernable performance gains under the right</p><p>conditions.</p><p>This chapter explores the opportunities in both of these shifts. First, it looks at</p><p>databases as compute engines with a focus on user-defined functions and user-defined</p><p>aggregates. It then goes deeper into WebAssembly, which is now increasingly being</p><p>used to implement user-defined functions and aggregates (among many other things).</p><p>Finally, the chapter ventures to the edge—exploring what you stand to gain by moving</p><p>your database servers quite close to your users, as well as what potential pitfalls you</p><p>need to negotiate in this scenario.</p><p>Databases asCompute Engines</p><p>Modern databases offer many more capabilities than just storing and retrieving</p><p>data. Some of them are nothing short of operating systems, capable of streaming,</p><p>modifying, encrypting, authorizing, authenticating, and virtually anything else with data</p><p>they manage.</p><p>Data locality is the holy grail of distributed systems. The less you need to move</p><p>data around, the more time can be spent on performing meaningful operations on</p><p>it—without excessive bandwidth costs. That’s why it makes sense to try to push more</p><p>logic into the database itself, letting it process as much as possible locally, then return</p><p>© Felipe Cardeneti Mendes, Piotr Sarna, Pavel Emelyanov, Cynthia Dunlop 2023</p><p>F. C. Mendes et al., Database Performance at Scale, https://doi.org/10.1007/978-1-4842-9711-7_6</p><p>110</p><p>the results to the users, or some middleware, for further processing. It makes even more</p><p>sense when you consider that database nodes generally run on powerful hardware,</p><p>with lots of RAM and fast I/O devices. This usually translates to formidable CPU power.</p><p>Dedicated large data processing frameworks aside (e.g., Apache Spark, which is out of</p><p>scope for this book), regular database engines almost always support some level of user-</p><p>defined computations. These can be classified into two major sections: user-defined</p><p>functions/procedures and user-defined aggregates.</p><p>Note that the definitions vary. Some database vendors use the general name</p><p>“functions” to mean both aggregate and scalar functions. Others actually mean</p><p>“scalar functions” when they reference “functions,” and use the name “aggregates” for</p><p>“aggregate functions.” That’s the convention applied to this chapter.</p><p>User-Defined Functions andProcedures</p><p>In contrast to native functions, often implemented in database engines (think</p><p>lowercase(), now(), concat(), type casting, algebraic operations, and friends), user-</p><p>defined functions are provided by the users of the database (e.g., the developers building</p><p>applications). A “procedure” is substantially identical to a function in this context, except</p><p>it does not return any result; instead, it has side effects.</p><p>The exact interface of allowing users to define their own functions or procedures</p><p>varies wildly between database vendors. Still, several core strategies, listed here, are</p><p>often implemented:</p><p>1. A set of hardcoded native functions, not extensible, but at least</p><p>composable. For example, casting a type to string, concatenating</p><p>it with a predefined suffix, and then hashing it.</p><p>2. A custom scripting language, dedicated and vendor-locked to</p><p>a specific database, allowing users to write and execute simple</p><p>programs on the data.</p><p>3. Supporting a single general-purpose embeddable language</p><p>of choice. For example, Lisp, Lua, ChaiScript, Squirrel, or</p><p>WebAssembly might be used for this purpose. Note: You’ll explore</p><p>WebAssembly in more depth a little later in this chapter.</p><p>Chapter 6 GettinG Data Closer</p><p>111</p><p>4. Supporting a variety of pluggable embeddable languages. A good</p><p>example is Apache Cassandra and its support of Java (native</p><p>language) and JavaScript1 as well as pluggable backend-loaded via</p><p>.jar files.</p><p>The first on the list is the least flexible, offers the worst developer experience, and</p><p>has the lowest security risk. The last has the most flexibility, offers the best developer</p><p>experience, and also harbors the most potential for being a security risk worthy of its</p><p>own CVE number.</p><p>Scalar functions are usually invoked per each row, at least for row-oriented</p><p>databases, which is usually the case for SQL.You might wonder if the computations</p><p>can’t</p><p>simply be performed by end users on their machines. That’s a valid point. The main</p><p>advantage of that approach is fantastic scalability regardless of how many users perform</p><p>data transformations (if they do it locally on their own machines, then the database</p><p>cluster does not get overloaded).</p><p>There are several great reasons to push the computations closer to where the data</p><p>is stored:</p><p>• Databases have more context to efficiently cache the computed</p><p>results. Imagine tens of thousands of users asking for the same</p><p>function to be applied on a certain set of rows. That result can be</p><p>computed just once and then distributed to all interested parties.</p><p>• If the computed results are considerably smaller than their input</p><p>(think about returning just lengths of text values), it’s better to save</p><p>bandwidth and send over only the final results.</p><p>• Certain housekeeping operations (e.g., deleting data older than a</p><p>week) can be efficiently performed locally, without fetching any</p><p>information to the clients for validation.</p><p>1 It’s also a great example of the CVE risk: https://cve.mitre.org/cgi-bin/cvename.</p><p>cgi?name=CVE-2021-44521</p><p>https://jfrog.com/blog/cve-2021-44521-exploiting-apache-cassandra-user-defined-</p><p>functions-for-remote-code-execution/</p><p>Chapter 6 GettinG Data Closer</p><p>112</p><p>• If the processing is done on database servers, the instruction cache</p><p>residing on that database’s CPU chip is likely to be scorching hot with</p><p>opcodes responsible for carrying out the computations for each row.</p><p>And as a rule of thumb, hot cache translates to faster code execution</p><p>and lower latency.</p><p>• Some computations are not trivially distributed to users. If they</p><p>involve cryptographic private keys stored on the database servers,</p><p>it might actually be impossible to run the code anywhere but on the</p><p>server itself.</p><p>• If the data on which computations are performed is sensitive (e.g., it</p><p>falls under infamous, ever-changing European data protection laws</p><p>such as GDPR), it might be illegal to send raw data to the users. In</p><p>such cases, running an encryption function server-side can be a way</p><p>for users to obtain obfuscated, legal data.</p><p>Determinism</p><p>In distributed environments, idempotence (discussed in Chapter 5) is an important</p><p>attribute that makes it possible to send requests in a speculative manner, potentially</p><p>increasing performance. Thus, it is better to make sure that user-defined functions are</p><p>deterministic. In other words, a user-defined function’s value should only depend on</p><p>the value of its arguments, and not on the value of any external factors like time, date,</p><p>pseudo-random seed, and so on.</p><p>A perfect example of a non-deterministic function is now(). Calling it twice might</p><p>yield the same value if you’re fast enough, but it’s generally not guaranteed since its</p><p>result is time-dependent. If possible, it’s a good idea to program the user-defined</p><p>functions in a deterministic way and mark them as such. For time/date, this might</p><p>involve computing the results based on a timestamp passed as a parameter rather than</p><p>using built-in time utilities. For pseudo-random sampling, the seed could also be passed</p><p>as a parameter, as opposed to relying on sources of entropy provided by the user-defined</p><p>function runtime.</p><p>Chapter 6 GettinG Data Closer</p><p>113</p><p>Latency</p><p>Running user-provided code on your database clusters is potentially dangerous in</p><p>aspects other than security. Most embedded languages are Turing-complete, and</p><p>customarily allow the developers to use loops, recursion, and other similar techniques</p><p>in their code. That’s risky. An undetected infinite loop may serve as a denial-of- service</p><p>attack, forcing the database servers to endlessly process a function and block other tasks</p><p>from used resources. And even if the user-defined function author did not have malicious</p><p>intentions, some computations simply consume a lot of CPU time and memory.</p><p>In a way, a user-defined function should be thought of as a potential “noisy</p><p>neighbor”2 and its resources should be as limited as possible. For some use cases,</p><p>a simple hard limit on memory and CPU time used is enough to ensure that the</p><p>performance of other database tasks does not suffer from a “noisy” user-defined</p><p>function. However, sometimes, a more specific solution is required—for example,</p><p>splitting a user- function definition into smaller time bits, assigning priorities to user-</p><p>defined functions, and so on.</p><p>One interesting metering mechanism was applied by Wasmtime,3 a WebAssembly</p><p>runtime. Code running in a WebAssembly instance consumes fuel,4 a synthetic unit used</p><p>for tracking how fast an instance exhausts system resources. When an instance runs out</p><p>of fuel, the runtime does one of the preconfigured actions—either “refills” and lets the</p><p>code execution continue or decides that the task reached its quota and terminates it.</p><p>Just-in-Time Compilation (JIT)</p><p>Languages used for user-defined functions are often either interpreted (e.g., Lua) or</p><p>represented in bytecode that runs on a virtual machine (e.g., WebAssembly). Both of</p><p>these approaches can benefit from just-in-time compilation. It’s a broad topic, but the</p><p>essence of it is that during runtime, the code of user-defined functions can be compiled</p><p>to another, more efficient representation, and optimized along the way. This may mean</p><p>translating bytecode to machine code the program runs on (e.g., x86-64 instructions), or</p><p>compiling the source code represented in an interpreted language to machine code.</p><p>2 See the Microsoft Azure documentation on the Noisy Neighbor antipattern (https://learn.</p><p>microsoft.com/en-us/azure/architecture/antipatterns/noisy-neighbor/noisy-neighbor).</p><p>3 See the Bytecode Alliance documentation at https://wasmtime.dev.</p><p>4 See the Wasmtime docs (https://docs.wasmtime.dev/api/wasmtime/struct.Store.</p><p>html#method.fuel_consumed).</p><p>Chapter 6 GettinG Data Closer</p><p>114</p><p>JIT is a very powerful tool, but it’s not a silver bullet—compilation and additional</p><p>optimization can be an expensive process in terms of resources. A small user-defined</p><p>function may take less than a millisecond to run, but recompiling it can cause a</p><p>sudden spike in CPU and memory usage, as well as a multi-millisecond delay in the</p><p>processing—resulting in high tail latency. It should therefore be a conscious decision to</p><p>either enable just-in-time compilation for user-defined functions if the language allows</p><p>it, or disable it altogether.</p><p>Examples</p><p>Let’s take a look at a few examples of user-defined functions. The function serving as</p><p>the example operates on floating point numbers; given two parameters, it returns the</p><p>sum of them, inverted. Given 5 and 7, it should return 1</p><p>5</p><p>1</p><p>7+ , which is approximately</p><p>0.34285714285.</p><p>Here’s how it could be defined in Apache Cassandra, which allows user-defined</p><p>function definitions to be provided in Java, its native language, as well as in other</p><p>languages:</p><p>CREATE OR REPLACE FUNCTION add_inverse(val1 double, val2 double)</p><p>RETURNS NULL ON NULL INPUT</p><p>RETURNS double LANGUAGE java</p><p>AS '</p><p>return (val1 == 0 || val2 == 0)</p><p>? Double.NaN</p><p>: (1/val1 + 1/val2);</p><p>';</p><p>Let’s take a closer look at the definition. The first line is straightforward: it includes</p><p>the function’s name, parameters, and its types. It also specifies that if a function</p><p>definition with that name already exists, it should be replaced. Next, it explicitly declares</p><p>what happens if any of the parameters is null, which is a valid value for any type. The</p><p>function can either return null without calling the function at all or allow null and let</p><p>the source code handle it explicitly (the syntax for that is CALLED ON NULL INPUT). This</p><p>explicit declaration is required by Apache Cassandra.</p><p>Chapter 6 GettinG Data Closer</p><p>115</p><p>That declaration is then followed by the return type and chosen language—from</p><p>which you can correctly deduce that multiple languages are supported. Then comes</p><p>the function</p><p>body. The only non-obvious decision made by the programmer was how</p><p>to handle 0 as a parameter. Since the type system implemented in Apache Cassandra</p><p>already handles NaN,5 it’s a decent candidate (next to positive/negative infinity).</p><p>The newly created function can be easily tested by creating a table, filling it with a</p><p>few values, and inspecting the result:</p><p>CREATE TABLE test(v1 double PRIMARY KEY, v2 double);</p><p>INSERT INTO test(v1, v2) VALUES (5, 7);</p><p>INSERT INTO test(v1, v2) VALUES (2, 2);</p><p>INSERT INTO test(v1) VALUES (9);</p><p>INSERT INTO test(v1, v2) VALUES (7, 0);</p><p>SELECT v1, v2, add_inverse(v1, v2) FROM test;</p><p>cassandra@cqlsh:test> SELECT v1, v2, add_inverse(v1, v2) FROM test;</p><p>v1 | v2 | test.add_inverse(v1, v2)</p><p>----+------+--------------------------</p><p>9 | null | null</p><p>5 | 7 | 0.342857</p><p>2 | 2 | 1</p><p>7 | 0 | NaN</p><p>From the performance perspective, is offloading such a simple function to the</p><p>database servers worth it? Not likely—the computations are fairly cheap, so users</p><p>shouldn’t have an issue deriving these values themselves, immediately after receiving</p><p>the data. The database servers, on the other hand, may need to initialize a runtime</p><p>for user-defined functions, since these functions are often sandboxed for security</p><p>purposes. That runtime initialization takes time and other resources. Offloading such</p><p>computations makes much more sense if the data is aggregated server-side, which is</p><p>discussed in the next section (on user-defined aggregates).</p><p>5 Not-a-number</p><p>Chapter 6 GettinG Data Closer</p><p>116</p><p>Best Practices</p><p>Before you learn about user-defined aggregates, which unleash the true potential of</p><p>user-defined functions, it’s important to sum up a few best practices for setting up user-</p><p>defined functions in your database management system:</p><p>1. Evaluate if you need user-defined functions at all—compare</p><p>the latency (and general performance) of queries utilizing user-</p><p>defined functions vs computing everything client-side (assuming</p><p>that’s even possible).</p><p>2. Test if offloading computations to the database servers scales.</p><p>Look at metrics like CPU utilization to assess how well your</p><p>database system can handle thousands of users requesting</p><p>additional computations.</p><p>3. Recognize that since user-defined functions are likely going</p><p>to be executed on the “fast path,” they need to be optimized</p><p>and benchmarked as well! Consider the performance best</p><p>practices for the language you’re using for user-defined function</p><p>implementation.</p><p>4. Make sure to properly handle any errors or exceptional cases in</p><p>your user-defined function to avoid disrupting the operation of</p><p>the rest of the database system.</p><p>5. Consider using built-in functions whenever possible instead of</p><p>creating a user- defined function. The built-in functions may be</p><p>more optimized and efficient.</p><p>6. Keep your user-defined functions simple and modular, breaking</p><p>up complex tasks into smaller, more manageable functions that</p><p>can be easily tested and reused.</p><p>7. Properly document your user-defined functions so that other</p><p>users of the database system can understand how they work and</p><p>how to use them correctly.</p><p>Chapter 6 GettinG Data Closer</p><p>117</p><p>User-Defined Aggregates</p><p>The greatest potential for user-defined functions lies in them being building blocks for</p><p>user-defined aggregates. Aggregate functions operate on multiple rows or columns,</p><p>sometimes on entire tables or databases.</p><p>Moving this kind of operation closer to where the data lies makes perfect sense.</p><p>Imagine 1TB worth of database rows that need to be aggregated into a single value: the</p><p>sum of their values. When a thousand users request all these rows in order to perform</p><p>the aggregation client-side, the following happens:</p><p>1. A total of a petabyte of data is sent over the network to each user.</p><p>2. Each user performs extensive computations, expensive in terms</p><p>of RAM and CPU, that lead to exactly the same result as the</p><p>other users.</p><p>If the aggregation is performed by the database servers, it not only avoids a petabyte</p><p>of traffic; it also saves computing power for the users (which is a considerably greener</p><p>solution). If the computation is properly cached, it only needs to be performed once.</p><p>This is a major win in terms of performance, and many use cases can immediately</p><p>benefit from pushing the aggregate computations closer to the data. This is especially</p><p>important for analytic workloads that tend to process large volumes of data in order to</p><p>produce useful statistics and feedback—a process that is its own type of aggregation.</p><p>Built-In Aggregates</p><p>Databases that allow creating user-defined aggregates usually also provide a few</p><p>traditional built-in aggregation functions: the (in)famous COUNT(*), but also MAX, MIN,</p><p>SUM, AVG, and others. Such functions take into account multiple rows or values and return</p><p>an aggregated result. The result may be a single value. Or, it could also be a set of values</p><p>if the input is divided into smaller classes. One example of such an operation is SQL’s</p><p>GROUP BY statement, which applies the aggregation to multiple disjoint groups of values.</p><p>Built-in aggregates should be preferred over user-defined ones whenever possible—</p><p>they are likely written in the language native to the database server, already optimized,</p><p>and secure. Still, the set of predefined aggregate functions is often very basic and doesn’t</p><p>allow users to perform the complex computations that make user-defined aggregates</p><p>such a powerful tool.</p><p>Chapter 6 GettinG Data Closer</p><p>118</p><p>Components</p><p>User-defined aggregates are customarily built on top of user-defined scalar functions.</p><p>The details heavily depend on the database system, but the following components are</p><p>definitely worth mentioning.</p><p>Initial Value</p><p>An aggregation needs to start somewhere, and it’s up to the user to provide an initial</p><p>value from which the final result will eventually be computed. In the case of the COUNT</p><p>function, which returns the number of rows or values in a table, a natural candidate</p><p>for the initial value is 0. In the case of AVG, which computes the arithmetic mean from</p><p>all column values, the initial state could consist of two variables: The total number of</p><p>values, initialized to 0, and the total sum of values, also initialized to 0.</p><p>State Transition Function</p><p>The core of each user-defined aggregate is its state transition function. This function</p><p>is called for each new value that needs to be processed, and each time it is called, it</p><p>returns the new state of the aggregation. Following the COUNT function example, its state</p><p>transition function simply increments the number of rows by one. The state transition</p><p>function of the AVG aggregate just adds the current value to the total sum and increments</p><p>the total number of values by one.</p><p>Final Function</p><p>The final function is an optional feature for user-defined aggregates. Its sole purpose is</p><p>to transform the final state of the aggregation to something else. For COUNT, no further</p><p>transformations are required. The user is simply interested in the final state of the</p><p>aggregation (the number of values), so the final function doesn’t need to be present; it</p><p>can be assumed to be an identity function. However, in the case of AVG, the final function</p><p>is what makes the result useful to the user. It transforms the final state—the total number</p><p>of values and its total sum—and produces the arithmetic mean by simply dividing one</p><p>by the other, handling the special case of avoiding dividing by zero.</p><p>Chapter 6 GettinG Data Closer</p><p>119</p><p>Reduce Function</p><p>The reduce function is an interesting optional addition to the user-defined aggregates</p><p>world, especially for distributed databases. It can be thought of as another state</p><p>transition function, but one that can combine two partial states into one.</p><p>With the help of a reduce function, computations of the user-defined aggregate</p><p>can be distributed to multiple database nodes, in a map-reduce6 fashion. This, in turn,</p><p>can bring massive performance gains, because the computations suddenly become</p><p>concurrent. Note that this optimization is not always possible—if the state transition</p><p>function is not commutative, distributing the partial computations may yield an</p><p>incorrect result.</p><p>In order to better imagine what a reduce function can look like, let’s go back to the</p><p>AVG example. A partial state for AVG can be represented as (n, s), where n is the number</p><p>of values, and s is the sum of them. Reducing two partial states into the new valid state</p><p>can be performed by simply adding the corresponding values: (n1, s1) + (n2, s2) → (n1+</p><p>n2, s1 + s2). An optional reduce function can be defined (e.g., in ScyllaDB’s user-defined</p><p>aggregate implementation7).</p><p>The user-defined aggregates support is not standardized among database vendors</p><p>and each database has its own quirks and implementation details. For instance, in</p><p>PostgreSQL, you can also implement a “moving” aggregate8 by providing yet another set</p><p>of functions and parameters: msfunc, minvfunc, mstype, and minitcond. Still, the general</p><p>idea remains unchanged: Let the users push aggregation logic as close to the data as</p><p>possible.</p><p>Examples</p><p>Let’s create a custom integer arithmetic mean implementation in PostgreSQL.</p><p>That’s going to be done by providing a state transition function, called sfunc in</p><p>PostgreSQL nomenclature, finalfunc for the final function, initial value (initcond),</p><p>and the state type—stype. All of the functions will be implemented in SQL, PostgreSQL’s</p><p>native query language.</p><p>6 MapReduce is a framework for processing parallelizable problems across large datasets.</p><p>7 See the ScyllaDB documentation on ScyllaDB CQL Extensions (https://github.com/scylladb/</p><p>scylladb/blob/master/docs/cql/cql-extensions.md#reducefunc-for-uda).</p><p>8 See the PostgreSQL documentation on User-Defined Aggregates (https://www.postgresql.</p><p>org/docs/current/xaggr.html#XAGGR-MOVING-AGGREGATES).</p><p>Chapter 6 GettinG Data Closer</p><p>120</p><p>State Transition Function</p><p>The state transition function, called accumulate, accepts a new integer value (the second</p><p>parameter) and applies it to the existing state (the first parameter). As mentioned earlier</p><p>in this chapter, a simple implementation keeps two variables in the state—the current</p><p>sum of all values, and their count. Thus, transitioning to the next state simply means that</p><p>the sum is incremented by the current value, and the total count is increased by one.</p><p>CREATE OR REPLACE FUNCTION accumulate(integer[], integer) RETURNS integer[]</p><p>AS 'select array[$1[1] + $2, $1[2] + 1];'</p><p>LANGUAGE SQL</p><p>IMMUTABLE</p><p>RETURNS NULL ON NULL INPUT;</p><p>Final Function</p><p>The final function divides the total sum of values by the total count of them, special-</p><p>casing an average of 0 values, which should be just 0. The final function returns a</p><p>floating point number because that’s how the aggregate function is going to represent an</p><p>arithmetic mean.</p><p>CREATE OR REPLACE FUNCTION divide(integer[]) RETURNS float8</p><p>AS 'select case when $1[2]=0 then 0 else $1[1]::float/$1[2] end;'</p><p>LANGUAGE SQL</p><p>IMMUTABLE</p><p>RETURNS NULL ON NULL INPUT;</p><p>Aggregate Definition</p><p>With all the building blocks in place, the user-defined aggregate can now be declared:</p><p>CREATE OR REPLACE AGGREGATE alternative_avg(integer)</p><p>(</p><p>sfunc = accumulate,</p><p>stype = integer[],</p><p>finalfunc = divide,</p><p>initcond = '{0, 0}'</p><p>);</p><p>Chapter 6 GettinG Data Closer</p><p>121</p><p>In addition to declaring the state transition function and the final function, the state</p><p>type is also declared to be an array of integers (which will always keep two values in the</p><p>implementation), as well as the initial condition that sets both counters, the total sum</p><p>and the total number of values, to 0.</p><p>That’s it! Since the AVG aggregate for integers happens to be built-in, that gives you</p><p>the perfect opportunity to validate if the implementation is correct:</p><p>postgres=# CREATE TABLE t(v INTEGER);</p><p>postgres=# INSERT INTO t VALUES (3), (5), (9);</p><p>postgres=# SELECT * FROM t;</p><p>v</p><p>---</p><p>3</p><p>5</p><p>9</p><p>(3 rows)</p><p>postgres=# SELECT AVG(v), alternative_avg(v) FROM t;</p><p>avg | alternative_avg</p><p>--------------------+-------------------</p><p>5.6666666666666667 | 5.666666666666667</p><p>(1 row)</p><p>Voilà. Remember that while creating an alternative implementation for AVG is a great</p><p>academic example of user-defined aggregates, for production use it’s almost always</p><p>better to stick to the built-in aggregates whenever they’re available.</p><p>Distributed User-Defined Aggregate</p><p>For completeness, let’s take a look at an almost identical implementation of a custom</p><p>average function, but one accommodated to be distributed over multiple nodes. This</p><p>time, ScyllaDB will be used as a reference, since its implementation of user-defined</p><p>aggregates includes an extension for distributing the computations in a map-reduce</p><p>manner. Here’s the complete source code:</p><p>CREATE FUNCTION accumulate(acc tuple, val int)</p><p>RETURNS NULL ON NULL INPUT</p><p>RETURNS tuple</p><p>Chapter 6 GettinG Data Closer</p><p>122</p><p>LANGUAGE lua</p><p>AS $$</p><p>return { acc[1]+val, acc[2]+1 }</p><p>$$;</p><p>CREATE FUNCTION reduce(acc tuple, acc2 tuple)</p><p>RETURNS NULL ON NULL INPUT</p><p>RETURNS tuple</p><p>LANGUAGE lua</p><p>AS $$</p><p>return { acc[1]+acc2[1], acc[2]+acc2[2] }</p><p>$$;</p><p>CREATE FUNCTION divide(acc tuple)</p><p>RETURNS NULL ON NULL INPUT</p><p>RETURNS double</p><p>LANGUAGE lua</p><p>AS $$</p><p>return acc[1]/acc[2]</p><p>$$;</p><p>CREATE AGGREGATE alternative_avg(int)</p><p>SFUNC accumulate</p><p>STYPE tuple</p><p>REDUCEFUNC reduce</p><p>FINALFUNC divide</p><p>INITCOND (0, 0);</p><p>ScyllaDB’s native query language, CQL, is extremely similar to SQL, even in its</p><p>acronym. It’s easy to see that most of the source code corresponds to the PostgreSQL</p><p>implementation from the previous paragraph. ScyllaDB does not allow defining user-</p><p>defined functions in CQL, but it does support Lua, a popular lightweight embeddable</p><p>language, as well as WebAssembly. Since this book is expected to be read mostly by</p><p>human beings (and occasionally ChatGPT once it achieves full consciousness), Lua was</p><p>chosen for this example due to the fact it’s much more concise.</p><p>Chapter 6 GettinG Data Closer</p><p>123</p><p>The most notable difference is the reduce function, declared in the aggregate</p><p>under the REDUCEFUNC keyword. This function accepts two partial states and returns</p><p>another (composed) state. What ScyllaDB servers can do if this function is present is the</p><p>following:</p><p>1. Divide the domain (e.g., all rows in the database) into multiple</p><p>pieces and ask multiple servers to partially aggregate them, and</p><p>then send back the result.</p><p>2. Apply the reduce function to combine partial results into the</p><p>single final result.</p><p>3. Return the final result to the user.</p><p>Thus, by providing the reduce function, the user also allows ScyllaDB to compute</p><p>the aggregate concurrently on multiple machines. This can reduce the query execution</p><p>time by orders of magnitude compared to a large query that only gets executed on a</p><p>single server.</p><p>In this particular case, it might even be preferable to provide a user-defined</p><p>alternative for a user-defined function in order to increase its concurrency—unless the</p><p>built-in primitives also come with their reduce functions out of the box. That’s the case</p><p>in ScyllaDB, but not necessarily in other databases that offer similar capabilities.</p><p>Best Practices</p><p>1. If the computations can be efficiently represented with built-</p><p>in aggregates, do so—or at least benchmark whether a custom</p><p>implementation is any faster. User-defined aggregates are very</p><p>expressive, but usually come with a cost of overhead compared to</p><p>built-in implementations.</p><p>2. Research if user-defined aggregates can be customized in order</p><p>to better fit specific use cases—for example, if the computations</p><p>can be distributed to multiple database nodes, or if the database</p><p>allows configuring its caches to store the intermediate results of</p><p>user-defined aggregates somewhere.</p><p>Chapter 6 GettinG Data Closer</p><p>124</p><p>3. Always test the performance of your user-defined aggregates</p><p>thoroughly before using them in production. This will help to</p><p>ensure that they are efficient and can handle the workloads that</p><p>you expect them to.</p><p>4. Measure the cluster-wide effects of using user-defined aggregates</p><p>in your workloads. Similar to full table scans, aggregates are a</p><p>costly operation and it’s important to ensure that they respect the</p><p>quality of service of other workloads, not overloading the database</p><p>nodes beyond what’s acceptable in your system.</p><p>WebAssembly forUser-Defined Functions</p><p>WebAssembly, also known as Wasm, is a binary format for representing executable code,</p><p>designed to be easily embedded into other projects. It turns out that WebAssembly is</p><p>also a perfect candidate for user-defined functions on the backend, thanks to its ease of</p><p>integration, performance, and popularity.</p><p>There are multiple great books and articles9 on WebAssembly, and they all agree that</p><p>first and foremost, it’s a misnomer—WebAssembly’s usefulness ranges way beyond web</p><p>applications. It’s actually a solid general-purpose language that has already become the</p><p>default choice for an embedded language around the world. It ticks all the boxes:</p><p>☒ It’s open-source, with a thriving community</p><p>☒ It’s portable</p><p>☒ It’s isolated by default, with everything running in a sandboxed</p><p>environment</p><p>☒ It’s fast, comparable to native CPU code in terms of</p><p>performance</p><p>9 For example, “WebAssembly: The Definitive Guide” by Brian Sletten, “Programming</p><p>WebAssembly with Rust” by Kevin Hoffman, or “ScyllaDB’s Take on WebAssembly for User-</p><p>Defined Functions” by Piotr Sarna.</p><p>Chapter 6 GettinG Data Closer</p><p>125</p><p>Runtime</p><p>WebAssembly is compiled to bytecode. This bytecode is designed to run on a virtual</p><p>machine, which is usually part of a larger development environment called a runtime.</p><p>There are multiple implementations of WebAssembly runtimes, most notably:</p><p>• Wasmtime</p><p>https://wasmtime.dev/</p><p>A fast and secure runtime for WebAssembly, implemented in Rust,</p><p>backed by the Bytecode Alliance10 nonprofit organization.</p><p>• Wasmer.io</p><p>https://wasmer.io/</p><p>Another open-source initiative implemented in Rust; maintainers</p><p>of the WAPM11 project, which is a Wasm package manager.</p><p>• WasmEdge:</p><p>https://wasmedge.org/</p><p>Runtime implemented in C++, general-purpose, but focused on</p><p>edge computing.</p><p>• V8:</p><p>https://v8.dev/</p><p>Google’s monolith JavaScript runtime; written in C++, comes with</p><p>WebAssembly support as well.</p><p>Also, since the WebAssembly specification is public, feel free to implement your own!</p><p>Beware though: The standard is still in heavy development, changing rapidly every day.</p><p>10 https://bytecodealliance.org/</p><p>11 https://wapm.io/</p><p>Chapter 6 GettinG Data Closer</p><p>126</p><p>Back toLatency</p><p>Each runtime is free to define its own performance characteristics and guarantees. One</p><p>interesting feature introduced in Wasmtime is the concept of fuel, already mentioned in</p><p>the earlier discussion of user-defined functions. Combined with the fact that Wasmtime</p><p>provides an optional asynchronous interface for running WebAssembly modules, it gives</p><p>users an opportunity to fine-tune the runtime to their latency requirements.</p><p>When Wasmtime starts executing a given WebAssembly function, this unit of</p><p>execution is assigned a certain amount of fuel. Each execution step exhausts a small</p><p>amount of fuel—at the time of writing this paragraph, it simply consumes one unit of fuel</p><p>on each WebAssembly bytecode instruction, excluding a few flow control instructions</p><p>like branching. Once the execution unit runs out of fuel, it yields. After that happens, one</p><p>of the preconfigured actions is taken: either the execution unit is terminated, or its tank</p><p>gets refilled and it’s allowed to get back to whatever it was computing. This mechanism</p><p>allows the developer to control not only the total amount of CPU time that a single</p><p>function execution can take, but also how often the execution should yield and hand</p><p>over the CPU for other tasks. Thus, configuring fuel management the right way prevents</p><p>function executions from taking over the CPU for too long. That helps maintain low,</p><p>predictable latency in the whole system.</p><p>Another interesting aspect of WebAssembly is its portability. The fact that the</p><p>code can be distributed to multiple places and it’s guaranteed to run properly in</p><p>multiple environments makes it a great candidate for moving not only data, but also</p><p>computations, closer to the user.</p><p>Pushing the database logic from enormous datacenters to smaller ones, located</p><p>closer to end users, got its own buzzy name: edge computing.</p><p>Edge Computing</p><p>Since the Internet of Things (IoT) became a thing, the term edge computing needs</p><p>disambiguation. This paragraph is (unfortunately?) not about:</p><p>• Utilizing the combined computing power of smart fridges in</p><p>your area</p><p>• Creating a data mesh from your local network of Bluetooth light bulbs</p><p>• Integrating your smart watch into a Raft cluster in witness mode</p><p>Chapter 6 GettinG Data Closer</p><p>127</p><p>The edge described in this paragraph is of a more boring kind. It still means</p><p>performing computations on servers, but on ones closer to the user (e.g., located in a</p><p>local Equinix datacenter in Warsaw, rather than Amazon’s eu-central-1 in Frankfurt).</p><p>Performance</p><p>What does edge computing have to do with database performance? It brings the data</p><p>closer to the user, and closer physical distance translates to lower latency. On the other</p><p>hand, having your database cluster distributed to multiple locations has its downsides</p><p>as well. Moving large amounts of data between those regions might be costly, as cloud</p><p>vendors tend to charge for cross-region traffic. If the latency between database nodes</p><p>reaches hundreds of milliseconds, which is the customer grade latency between</p><p>Northern America and Europe (unless you can afford Hibernia Express12), they can get</p><p>out of sync easily. Even a few round-trips—and distributed consensus algorithms alone</p><p>require at least two—can cause delays that exceed the comfort zone of one second.</p><p>Failure detection mechanisms are also affected since packet loss occurs much more</p><p>often when the cluster spans multiple geographical locations.</p><p>Database drivers for edge-friendly databases need to be aware of all these limitations</p><p>mentioned. In particular, they need to be extra careful to pick the closest region</p><p>whenever possible, minimizing the latency and the chance of failure.</p><p>Conflict-Free Replicated Data Types</p><p>CRDT (conflict-free replicated data types) is an interesting way of dealing with</p><p>inconsistencies. It’s a family of data structures designed to have the following</p><p>characteristics:</p><p>• Users can update database replicas independently, without</p><p>coordinating with other database servers.</p><p>• There exists an algorithm to automatically resolve conflicts that</p><p>might occur when the same data is independently written to multiple</p><p>replicas concurrently.</p><p>• Replicas are allowed to be in different states, but they are guaranteed</p><p>to eventually converge to a common state.</p><p>12 A submarine link between Canada, Ireland, and the UK, offering sub-60ms latency.</p><p>Chapter 6 GettinG Data Closer</p><p>128</p><p>The concept of CRDT gained traction along with edge computing because the two</p><p>complement each other. The database is allowed to keep replicas in multiple places and</p><p>allows them to act without central coordination—but at the same time, users can assume</p><p>that eventually the database state is going to become consistent.</p><p>A few interesting data structures that fit the definition of CRDT are discussed next.</p><p>G-Counter</p><p>Grow-only counter. Usually implemented as an array of counters, keeping a local</p><p>counter value per each database node. Two array states from different nodes can</p><p>be merged by taking the maximum of each respective field.</p><p>The actual value of the</p><p>G-Counter is simply a sum of all local counters.</p><p>PN-Counter</p><p>Positive-Negative counter, brilliantly implemented by keeping two G-Counter</p><p>instances—one for accumulating positive values, the other for negative ones. The final</p><p>value is obtained by subtracting one from the other.</p><p>G-Set</p><p>Grow-only set, that is, one that forbids the removal of elements. Converging two G-Sets</p><p>is a simple set union since values are never removed from a G-Set. One flavor of G-Set</p><p>is G-Map, where an entry, key, and value associated with the key cannot be removed</p><p>once added.</p><p>LWW-Set</p><p>Last-write-wins set (and map, accordingly). This is a combination of two G-Sets, one</p><p>gathering added elements and the other containing removed ones. Conflict resolution is</p><p>based on a set union of the “added” G-Set, minus the union of the “removed” G-Set, but</p><p>timestamps are also taken into account. A value exists if its timestamp in the “added” set</p><p>is larger than its timestamp in the “removed” set, or if it’s not present in the “removed”</p><p>set at all.</p><p>The list is obviously not exhaustive, and countless other CRDTs exist. You’re hereby</p><p>encouraged to do research on the topic if you found it interesting!</p><p>Chapter 6 GettinG Data Closer</p><p>129</p><p>CRDTs are not just theoretical structures; they are very much used in practice.</p><p>Variants of conflict-free replicated data types are common among databases that offer</p><p>eventual consistency, like Apache Cassandra and ScyllaDB.Their writes have last-write-</p><p>wins semantics for conflict resolution, and their implementation of counters is based on</p><p>the idea of a PN-Counter.</p><p>Summary</p><p>At this point, it should be clear that there are a number of ways to improve</p><p>performance by using a database a bit unconventionally, as well as understanding</p><p>(and tapping) specialized capabilities built into the database and its drivers. Let’s</p><p>shift gears and look at the top “do’s and don’ts” that we recommend for ensuring that</p><p>your database is performing at its best. The next chapter begins this discussion by</p><p>focusing on infrastructure options (CPUs, memory, storage, and networking) and</p><p>deployment models.</p><p>Open Access This chapter is licensed under the terms of the Creative</p><p>Commons Attribution 4.0 International License (http://creativecommons.</p><p>org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and</p><p>reproduction in any medium or format, as long as you give appropriate credit to the</p><p>original author(s) and the source, provide a link to the Creative Commons license and</p><p>indicate if changes were made.</p><p>The images or other third party material in this chapter are included in the chapter's</p><p>Creative Commons license, unless indicated otherwise in a credit line to the material. If</p><p>material is not included in the chapter's Creative Commons license and your intended</p><p>use is not permitted by statutory regulation or exceeds the permitted use, you will need</p><p>to obtain permission directly from the copyright holder.</p><p>Chapter 6 GettinG Data Closer</p><p>131</p><p>CHAPTER 7</p><p>Infrastructure and</p><p>Deployment Models</p><p>As noted in the previous chapter, many modern databases offer capabilities beyond</p><p>“just” storing and retrieving data. But all databases are ultimately built from the ground</p><p>up in order to serve I/O in the most efficient way possible. And it’s crucial to remember</p><p>this when selecting your infrastructure and deployment model of choice.</p><p>In theory, a database’s purpose is fairly simple: You submit a request and expect</p><p>to receive a response. But as you have seen in the previous chapters, an insane level of</p><p>engineering effort is spent on continuously enhancing and speeding up this process.</p><p>Very likely, years and years were dedicated to optimizing algorithms that may give</p><p>you a processing boost of a few CPU cycles, or minimizing the amount of memory</p><p>fragmentation, or reducing the amount of storage I/O needed to look up a specific set</p><p>of data. All these advancements, eventually, converge to create a database suitable for</p><p>performance at scale.</p><p>Regardless of your database selection, you may eventually hit a wall that no</p><p>engineering effort can break through: the database’s physical hardware. It makes very</p><p>little sense to have a solution engineered for performance when the hardware you throw</p><p>at it may be suboptimal. Similarly, a less performant database will likely be unable to</p><p>make efficient use of an abundance of available physical resources.</p><p>This chapter looks at critical considerations and tradeoffs when selecting CPUs,</p><p>memory, storage, and networking for your distributed database infrastructure. It</p><p>describes how different resources cooperate and how to configure the database to</p><p>deliver the best performance. Special attention is drawn to storage I/O as the most</p><p>difficult component to deal with. There’s also a close look at optimal cloud-based</p><p>deployments suitable for highly-performant distributed databases (given that these are</p><p>the deployment preference of most businesses).</p><p>© Felipe Cardeneti Mendes, Piotr Sarna, Pavel Emelyanov, Cynthia Dunlop 2023</p><p>F. C. Mendes et al., Database Performance at Scale, https://doi.org/10.1007/978-1-4842-9711-7_7</p><p>132</p><p>While it is true that a Database-as-a-Service (DBaaS) deployment will shield you</p><p>from many infrastructure and hardware decisions through your selection process, a</p><p>fundamental understanding of the generic compute resources required by any database</p><p>is important for identifying potential bottlenecks that may limit performance. After an</p><p>introduction to the hardware that’s involved in every deployment model—whether you</p><p>think about it or not—the chapter shifts focus to different deployment options and their</p><p>impact on performance. It covers the special considerations associated with cloud-</p><p>hosted deployments, database-as-a-service, serverless, containerization, and container</p><p>orchestration technologies, such as Kubernetes.</p><p>Core Hardware Considerations forSpeed atScale</p><p>When you are designing systems to handle large amounts of data and requests at scale,</p><p>the primary hardware considerations are:</p><p>• Storage</p><p>• CPU (cores)</p><p>• Memory (RAM)</p><p>• Network interfaces</p><p>Each could be a potential bottleneck for internal database latency: The delay from</p><p>when a request is received by the database (or a node in the database) and when the</p><p>database provides a response.</p><p>Identifying theSource ofYour Performance Bottlenecks</p><p>Knowing your database’s write and read paths is helpful for identifying potential</p><p>performance bottlenecks and tracking down the culprit. It’s also key to understanding</p><p>what physical resources your use case may be mostly bound against.</p><p>For example, write-optimized databases carry this nomenclature because writes</p><p>primarily go to memory, rather than being immediately persisted into disk. However,</p><p>most modern databases need to employ some “crash-recovery” mechanism and avoid</p><p>data loss caused by unexpected service interruptions. As a result, even write-optimized</p><p>databases will also resort to disk access to quickly persist your data, just in case. For</p><p>example, writes to Cassandra clusters will be persisted to a “write ahead log” disk</p><p>Chapter 7 InfrastruCture and deployment models</p><p>133</p><p>structure called the “commit log” and a memory structure that’s named a “memtable.” A</p><p>write is considered successful only after both operations succeed.</p><p>On the other side of the spectrum, the database’s read path will typically also involve</p><p>several physical components. Assuming that you’re not using an in-memory database,</p><p>then the read path will start by checking whether the data you are looking for is present</p><p>within the database cache. But if it’s not, the database needs to look up and retrieve the</p><p>data from disk, de-serialize it, and then answer with the results.</p><p>Network also plays a crucial role throughout the entire process. When you write,</p><p>data needs to be rapidly replicated to other replicas. When you read, the database needs</p><p>to select the correct replicas (shards) containing the data that the application is after,</p><p>thus potentially having to communicate with other nodes in the cluster. Moreover,</p><p>strong consistency use cases always require the response of a majority of members for</p><p>an operation to be successful—so delayed responses from a replica can dramatically</p><p>increase the tail latency of a request routed to it.</p><p>Achieving Balance</p><p>Balance is key to any distributed system, including and beyond databases. It makes</p><p>very little sense to try to achieve 1 million operations per second (OPS) in a system that</p><p>has the fastest network link available but relies on very few CPUs. Similarly, it’s not very</p><p>efficient to purchase the most expensive and performant infrastructure for your solution</p><p>if your use case requires only 10K OPS.</p><p>Additionally, it’s important to recognize that a cluster imbalance can easily drag</p><p>down performance across your entire distributed system. This happens because a</p><p>distributed system cannot be faster than your slowest component—a fact that frequently</p><p>surprises people.</p><p>Here’s a real-life example. A customer reported elevated latencies affecting their</p><p>entire 18-node cluster. After collecting system information, we noticed that the majority</p><p>of their nodes were properly using locally-attached nonvolatile memory express (NVMe)</p><p>disks—except for one that had a software Redundant Array of Independent Disks (RAID)</p><p>with a mix of NVMes and network-attached disks. The customer clarified that they</p><p>were running out of storage space and decided to attach another disk in order to relieve</p><p>the problem. However, they weren’t aware that this introduced a ticking time bomb</p><p>into their entire cluster. Here’s a brief explanation of what happened from a technical</p><p>perspective:</p><p>Chapter 7 InfrastruCture and deployment models</p><p>134</p><p>1. With a slow disk introduced in their RAID array, storage I/O</p><p>operations in that specific replica took longer to complete.</p><p>2. As a result, the remaining replicas took additional time whenever</p><p>sending or waiting for a response that would require disk I/O.</p><p>3. As more and more requests came in, all these delays eventually</p><p>created a waiting queue on the replicas.</p><p>4. As the queue kept growing, this eventually affected the replicas’</p><p>performance, which ended up affecting the entire cluster’s</p><p>performance.</p><p>5. From that point on, the entire cluster speed was impeded by the</p><p>speed of its slowest node: the one that had the slowest disk.</p><p>Setting Realistic Expectations</p><p>Even the most powerful hardware cannot ensure impressive end-to-end (or round-trip)</p><p>latency—the entire cycle time from when a client sends a request to the server until it</p><p>obtains a response. The end-to-end latency could be undermined by factors that might</p><p>be outside of the database’s control. For example:</p><p>• Multi-hop routing of packets from your client application to the</p><p>database server, adding hundreds of milliseconds in latency</p><p>• Client driver settings, connecting and sending requests to a remote</p><p>datacenter</p><p>• Consistency levels that require both local and remote datacenter</p><p>responses</p><p>• Poor network performance between clients and database servers</p><p>• Protocol overheads</p><p>• Client-side performance bottlenecks</p><p>Chapter 7 InfrastruCture and deployment models</p><p>135</p><p>Recommendations forSpecific</p><p>Hardware Components</p><p>This section takes a deeper look at each of the primary hardware considerations:</p><p>• Storage</p><p>• CPU (cores)</p><p>• Memory (RAM)</p><p>• Network interfaces</p><p>Storage</p><p>One of the fastest ways to undermine all your other performance optimizations is to send</p><p>every read and write operation through an unsuitable disk. Although recent technology</p><p>advancements greatly improved the performance of storage devices, disks are (by far)</p><p>still the slowest component in a computer system.</p><p>From a performance standpoint, disk performance is typically measured in two</p><p>dimensions:</p><p>• The bandwidth available for sequential reads and writes</p><p>• The IOPS for random reads and writes</p><p>Database engineers obsess over optimizing disk access patterns with respect to those</p><p>two dimensions. People who are selecting, managing, or using a database should focus</p><p>on two additional disk considerations: the storage technology and the disk size.</p><p>Disk Types</p><p>Locally-attached NVMe Solid State Drives (SSDs) are the standard when latency is</p><p>critical. Compared with other bus interfaces, NVMe SSDs connected to a Peripheral</p><p>Component Interconnect Express (PCIe) interface will generally deliver lower latencies</p><p>than the Serial AT Attachment (SATA) interface. If your workload isn’t super latency</p><p>sensitive, you could also consider using disks via the SATA interface. But, definitely avoid</p><p>using network-attached disks if you expect single-digit millisecond latencies. Being</p><p>network attached, these disks require an additional hop to reach a storage server, and</p><p>that ends up increasing latency for every database request.</p><p>Chapter 7 InfrastruCture and deployment models</p><p>136</p><p>If your focus is on throughput and latency really doesn’t matter for your use case</p><p>(e.g., for moving data into a data warehouse), you might be able to get away with a</p><p>persistent disk—but it’s not recommended. By persistent disks, we mean durable</p><p>network storage devices that your VMs can access like physical disks, but are located</p><p>independently from your VMs. We’re not going to pick on any specific vendors, but a</p><p>little research should reveal issues like subpar performance and overall instability. If</p><p>you’re forced to work with persistent disks, be prepared to craft a creative solution.1</p><p>Hard disk drives (HDDs) might fast become a bottleneck. Since SSDs are getting</p><p>progressively cheaper and cheaper, using HDDs is not recommended. Some workloads</p><p>may work with HDDs, especially if they play nice and minimize random seeks. An</p><p>example of an HDD-friendly workload is a write-mostly (98 percent writes) workload</p><p>with minimal random reads. If you decide to use HDDs, try to allocate a separate disk for</p><p>the commit log.</p><p>ScyllaDB published benchmarking results of several different storage devices—</p><p>demonstrating how they perform under extreme load simulating typical database access</p><p>patterns.2 For example, Figures7-1 through 7-4 visualize the different performance</p><p>characteristics from two NVMes—a persistent disk and an HDD.</p><p>1 For inspiration, consider Discord’s approach—but recognize that this is</p><p>certainly not a one-size-fits-all solution. It’s described in their blog, “How Discord</p><p>Supercharges Network Disks for Extreme Low Latency” (https://discord.com/blog/</p><p>how-discord-supercharges-network-disks-for-extreme-low-latency).</p><p>2 You can find the results, as well as the tool to reproduce the results, at https://github.com/</p><p>scylladb/diskplorer#sample-results.</p><p>Chapter 7 InfrastruCture and deployment models</p><p>137</p><p>Figure 7-1. NVMe bandwidth/latency graphs for an AWS i3.2xlarge instance type</p><p>Chapter 7 InfrastruCture and deployment models</p><p>138</p><p>Figure 7-2. Bandwidth/latency graphs for an AWS Im4gn.4xlarge instance type</p><p>using AWS Nitro SSDs</p><p>Chapter 7 InfrastruCture and deployment models</p><p>139</p><p>3 Strangely, the 95th percentile at low rates is worse than at high rates.</p><p>Figure 7-3. Bandwidth/latency graphs for a Google Cloud n2-standard-8 instance</p><p>type with a 2TB SSD persistent disk3</p><p>Chapter 7 InfrastruCture and deployment models</p><p>140</p><p>Figure 7-4. Bandwidth/latency graphs for a Toshiba DT01ACA200 hard</p><p>disk drive4</p><p>4 Note the throughput and IOPS were allowed to miss by a 15 percent margin rather than the</p><p>normal 3 percent margin.</p><p>Disk Setup</p><p>We hear a lot of questions about RAID setups. Hardware RAIDs are commonly used to</p><p>avoid outages introduced by disk failures. As a result, the RAID-5 (distributed parity)</p><p>setup is often used.</p><p>However, distributed databases typically have their own internal replication</p><p>mechanism to allow for business continuity and achieve high availability. Therefore,</p><p>RAID setups</p><p>employing data mirroring or distributed parity have proven to be very</p><p>detrimental to disk I/O performance and, fairly often, are used redundantly. On top of</p><p>that, we have found that some hardware RAID vendors deliver poor performance results</p><p>Chapter 7 InfrastruCture and deployment models</p><p>141</p><p>depending on your database access mechanisms. One notable example: hardware</p><p>RAIDs that are unable to perform efficiently via asynchronous I/O or direct I/O calls. If</p><p>you believe your disk I/O is suboptimal, consider directly exposing the disks from your</p><p>hardware RAID to your operating system.</p><p>Conversely, RAID-0 (striping) setups often provide a boost in disk I/O performance</p><p>and allow the database to achieve higher IOPS and bandwidth than a single disk can</p><p>provide. The general recommendation for creating a RAID-0 setup is to use all disks of</p><p>the same type and capacity to avoid variable performance during your daily workload.</p><p>While it is true you would lose the entire RAID array in the event of a disk failure, the</p><p>replication performed by your distributed database should be sufficient to ensure that</p><p>your data remains available.</p><p>A couple of additional considerations related to disk setup:</p><p>• Storage servers often serve several other users and workloads at</p><p>the same time. Therefore, even though disks would be dedicated to</p><p>the database, your access performance can be undermined by factors</p><p>like the level to which the storage system is serving other users</p><p>concurrently. Most of the time, the storage medium provided to you</p><p>will not be optimal for supporting a low-latency database workload.</p><p>This can often be mitigated by ensuring that the disks are allocated</p><p>from a high-performing disk pool.</p><p>• It’s important to expose your database infrastructure disks</p><p>directly to the operating system guest from your hypervisor. We</p><p>have seen many situations where the I/O capacity of a database</p><p>was greatly impacted when disks were virtualized. To eliminate</p><p>any possible bottlenecks in a low-latency environment, give your</p><p>database direct access to your disks so that they can perform I/O as</p><p>they were designed to.</p><p>Disk Size</p><p>When considering how much storage you need, be sure to account for your existing</p><p>data—replicated—plus your anticipated near-term data growth, and also leave sufficient</p><p>room for the overhead of internal operations (like compactions [for LSM-tree-based</p><p>databases], the commit log, backups, etc.).</p><p>Chapter 7 InfrastruCture and deployment models</p><p>142</p><p>As Chapter 8 discusses, the most common topology involves three replicas for each</p><p>dataset. Assume you have 5TB of raw data and use a replication factor of three:</p><p>5TB Data X 3 RF = 15TB</p><p>But 15TB is just a starting point since there are other sizing criteria:</p><p>• What is your dataset’s growth rate? (How much do you ingest per</p><p>hour or day?)</p><p>• Will you store everything forever, or will you have an eviction process</p><p>(for example, based on Time To Live [TTL])?</p><p>• Is your growth rate stable (a fixed rate of ingestion per week/day/</p><p>hour) or is it stochastic and bursty? The former would make it more</p><p>predictable; the latter may mean you have to give yourself more</p><p>leeway to account for unpredictable but probabilistic events.</p><p>You can model your data’s growth rate based on the number of users or endpoints</p><p>and how that number is expected to grow over time. Alternately, data models are often</p><p>enriched over time, resulting in more data per source. Or your sampling rate may</p><p>increase. For example, your system may begin ingesting data every five seconds rather</p><p>than every minute. All of these considerations impact your data storage volume.</p><p>It’s strongly recommended that you select storage that’s suitable for where you</p><p>expect to end up after a certain time span. If you’re running your database on a public</p><p>cloud provider (self-managed or as a fully-managed Database-as-a-Service [DBaaS]),</p><p>you won’t need very much lead time to provision new hardware and expand your cluster.</p><p>However, for an on-premises hardware purchase, you may need to provision based on</p><p>your quarterly or annual budgeting process. You could also face delays due to the supply</p><p>chain disruptions that have become increasingly common.</p><p>Also, be sure to leave storage space for internal temporary operations such as</p><p>compaction, repairs, backups, and commit logs, as well as any other background process</p><p>that may temporarily introduce a space amplification. On the other hand, if you’re using</p><p>compression, be sure to factor in the amount of space that your selected compression</p><p>algorithm can save you.</p><p>Finally, recognize that every database has an ideal memory-to-storage ratio—for</p><p>example, a certain amount of TB or GB per node that it can support with optimal</p><p>performance. If this isn’t readily apparent in your database’s documentation, press your</p><p>vendor for their recommendation.</p><p>Chapter 7 InfrastruCture and deployment models</p><p>143</p><p>Raw Devices andCustom Drivers</p><p>Some database vendors require direct access to storage devices—without needing a</p><p>filesystem to exist. Such direct access is often referred to as creating a “raw” device,</p><p>which refers to the fact that the operating system won’t know how to manage it, and any</p><p>I/O is handled directly by the database. Issuing I/O directly to the underlying storage</p><p>device may provide a performance boost to the database. However, it is important to</p><p>understand some of this approach’s drawbacks, which may not be important for your</p><p>specific deployment.</p><p>1. Error prone: Directly issuing I/O to a disk rather than through a</p><p>filesystem is error prone. While it will provide a performance gain,</p><p>incorrect handling of the underlying storage could result in data</p><p>corruption, data loss, or unexpected bugs.</p><p>2. Complex: Raw devices are not as common as one might expect. In</p><p>fact, very few databases decided to implement that approach. It’s</p><p>important to note that since raw devices aren’t typically mounted</p><p>as regular filesystems, their manageability will be fully dependent</p><p>on what your vendor provides.</p><p>3. Lock-in: Once you are using a raw device, it’s extremely difficult</p><p>to move away from it. You can’t mount raw devices or query their</p><p>storage consumption via typical operating system mechanisms.</p><p>All of your disks need to be arranged in a certain way, and you</p><p>can’t easily go back to a regular filesystem.</p><p>Maintaining Disk Performance Over Time</p><p>Databases are very storage I/O intensive, so disks will wear out over time. Most disk</p><p>vendors provide estimates concerning the performance durability of their products.</p><p>Check on those and compare.</p><p>There are multiple tools and programs that can help with SSD performance over</p><p>time. One example is the fstrim program, which is frequently run weekly to discard</p><p>unused filesystem blocks. fstrim is an operating system background process that</p><p>doesn’t require any database action and may improve I/O to a significant extent.</p><p>Chapter 7 InfrastruCture and deployment models</p><p>144</p><p>Tip If you have to choose one place to invest—on Cpu, storage, memory, or</p><p>networking—we recommend splurging on storage. everything else has evolved</p><p>faster and better than storage. It still remains the slowest component in most</p><p>systems.</p><p>Tiered Storage</p><p>Many use cases have different latency requirements for different sets of data. Similarly,</p><p>industries may see exponential storage utilization growth over time. It is not always</p><p>desirable, or even possible, to get rid of old data (for example, due to compliance</p><p>regulations, third-party contracts, or simply because it still carries relevance for the</p><p>business).</p><p>Teams with storage-heavy use cases often seek ways to minimize the costs of storage</p><p>consumption: by reducing the replication factor of their dataset, using less performant</p><p>(although cheaper) storage disks, or by employing a manual data rotation process from</p><p>faster to slower disks.</p><p>Tiered storage is a solution implemented by some databases in</p><p>Believe You ��������������������������������������������������������������������������������� 191</p><p>Take Coordinated Omission Into Account ���������������������������������������������������������������������������� 193</p><p>Special Considerations for Various Benchmarking Goals ��������������������������������������������������������� 194</p><p>Preparing for Growth ����������������������������������������������������������������������������������������������������������� 194</p><p>Comparing Different Databases ������������������������������������������������������������������������������������������ 195</p><p>Comparing the Same Database on Different Infrastructure ������������������������������������������������ 195</p><p>Assessing the Impact of a Data Modeling or Database Configuration Change�������������������� 195</p><p>Beyond the Usual Benchmark ��������������������������������������������������������������������������������������������������� 196</p><p>Benchmarking Admin Operations ���������������������������������������������������������������������������������������� 196</p><p>Testing Disaster Recovery ��������������������������������������������������������������������������������������������������� 196</p><p>Benchmarking at Extreme Scale ����������������������������������������������������������������������������������������� 197</p><p>Summary����������������������������������������������������������������������������������������������������������������������������������� 199</p><p>Chapter 10: Monitoring ����������������������������������������������������������������������������������������� 201</p><p>Taking a Proactive Approach ����������������������������������������������������������������������������������������������������� 201</p><p>Tracking Core Database KPIs ���������������������������������������������������������������������������������������������������� 203</p><p>Database Cluster KPIs ��������������������������������������������������������������������������������������������������������� 203</p><p>Application KPIs ������������������������������������������������������������������������������������������������������������������� 207</p><p>Infrastructure/Hardware KPIs ���������������������������������������������������������������������������������������������� 209</p><p>Creating Effective Custom Alerts ���������������������������������������������������������������������������������������������� 210</p><p>Walking Through Sample Scenarios ����������������������������������������������������������������������������������������� 211</p><p>One Replica Is Lagging in Acknowledging Requests ����������������������������������������������������������� 211</p><p>Disappointing P99 Read Latencies �������������������������������������������������������������������������������������� 213</p><p>Monitoring Options �������������������������������������������������������������������������������������������������������������������� 217</p><p>The Database Vendor’s Monitoring Stack ���������������������������������������������������������������������������� 217</p><p>Build Your Own Dashboards and Alerting (Grafana, Grafana Loki) ��������������������������������������� 218</p><p>Table of ConTenTs</p><p>xi</p><p>Third-Party Database Monitoring Tools ������������������������������������������������������������������������������� 218</p><p>Full Stack Application Performance Monitoring (APM) Tool ������������������������������������������������� 218</p><p>Summary����������������������������������������������������������������������������������������������������������������������������������� 219</p><p>Chapter 11: Administration ���������������������������������������������������������������������������������� 221</p><p>Admin Operations and Performance ����������������������������������������������������������������������������������������� 221</p><p>Looking at Admin Operations Through the Lens of Performance ���������������������������������������������� 222</p><p>Backups ������������������������������������������������������������������������������������������������������������������������������������ 224</p><p>Impacts �������������������������������������������������������������������������������������������������������������������������������� 225</p><p>Optimization ������������������������������������������������������������������������������������������������������������������������ 226</p><p>Compaction ������������������������������������������������������������������������������������������������������������������������������� 227</p><p>Impacts �������������������������������������������������������������������������������������������������������������������������������� 227</p><p>Optimization ������������������������������������������������������������������������������������������������������������������������ 229</p><p>Summary����������������������������������������������������������������������������������������������������������������������������������� 231</p><p>Appendix A: A Brief Look at Fundamental Database Design Decisions ���������������� 233</p><p>Index ��������������������������������������������������������������������������������������������������������������������� 249</p><p>Table of ConTenTs</p><p>xiii</p><p>About the Authors</p><p>Felipe Cardeneti Mendesis an IT specialist with years of</p><p>experience using distributed systems and open- source</p><p>technologies. He has co-authored three Linux books and</p><p>is a frequent speaker at public events and conferences</p><p>to promote open-source technologies. Felipe works as a</p><p>solution architect at ScyllaDB.</p><p>Piotr Sarnais a software engineer who is keen on open-</p><p>source projects and the Rust and C++ languages. He</p><p>previously developed an open-source distributed filesystem</p><p>and had a brief adventure with the Linux kernel. He’s also a</p><p>long-time contributor and maintainer of ScyllaDB, as well</p><p>as libSQL and Turso. Piotr graduated from University of</p><p>Warsaw with an MSc in computer science.</p><p>Pavel “Xemul” Emelyanovis an ex-Linux kernel hacker</p><p>now speeding up row cache, tweaking the IO scheduler,</p><p>and helping to pay back a technical debt for component</p><p>interdependencies. He is a principal engineer at ScyllaDB.</p><p>xiv</p><p>Cynthia Dunlopis a technology writer who specializes in</p><p>application development. She has co-authored four books</p><p>and hundreds of articles on everything from C/C++ memory</p><p>error detection to continuous testing and DevOps. Cynthia</p><p>holds a bachelor’s degree from UCLA and a master’s degree</p><p>from Washington State University.</p><p>abouT The auThors</p><p>xv</p><p>About the Technical Reviewers</p><p>Botond Déneshas been a principal software engineer at</p><p>ScyllaDB since 2017. Botond has mostly worked on making</p><p>queries perform better and making sure their concurrency</p><p>and resource consumption (especially memory) are kept in</p><p>check. In addition, he has worked extensively on disaster</p><p>recovery and diagnostics tools.</p><p>Ľuboš Koščois a software engineer at ScyllaDB who works</p><p>on upcoming ScyllaDB features, bug fixes, and workflows in</p><p>Jenkins, Ansible automation, and migration tools (in Spark).</p><p>During his time in AdTech, Ľuboš worked for Sizmek/Rocket</p><p>Fuel, overseeing seven datacenters running infrastructure</p><p>that delivered real- time bids and impressions for marketing</p><p>campaigns. He also worked on cloud monitoring,</p><p>virtualization, and datacenter management at Oracle and</p><p>Sun Microsystems, and is one of the leaders of the source</p><p>code search engine, OpenGrok.</p><p>xvi</p><p>Raphael S. Carvalho, a.k.a. Raph, is a computer programmer</p><p>steeped in hacker culture and kernel programming and a</p><p>wannabe musician. In November 2013, Carvalho joined the</p><p>Israeli startup Cloudius Systems (now ScyllaDB) and worked</p><p>first on the filesystem technology from OSv, a cloud-based</p><p>operating system, and later on ScyllaDB, a NoSQL data</p><p>store compatible with Apache Cassandra that runs on top of</p><p>Seastar. In 2018, Raph became fascinated with the Meltdown</p><p>security bug and worked directly with the researchers who</p><p>disclosed it. His name is now listed in the official Meltdown paper for his contributions</p><p>to showing the applicability of the vulnerability</p><p>order to address most</p><p>of these concerns. It allows users to configure the database to use distinct storage tiers,</p><p>and to define which criteria the database should use to ensure that the data is correctly</p><p>replicated to its relevant tier. For example, MongoDB allows you to determine how data</p><p>is replicated to a specific storage tier by assigning different tier tags to shards, allowing</p><p>its balancer to migrate data between tiers automatically. On top of that, Atlas Online</p><p>Archive also allows the database to offload historical datasets to cloud storage.</p><p>CPUs (Cores)</p><p>Next is the CPU. As of this writing, you are probably looking at modern servers running</p><p>some reasonably modern Intel, AMD, or ARM chips, which are commonly found across</p><p>most cloud providers and enterprise hardware vendors. Along with storage, CPUs are</p><p>another compute resource which—if not correctly sized—may introduce contention to</p><p>your workload and impact your latencies. Clusters handling hundreds of thousands up</p><p>to millions of operations per second tend to get very high CPU loads.</p><p>Chapter 7 InfrastruCture and deployment models</p><p>145</p><p>More cores will generally mean better performance. This is important for achieving</p><p>optimal performance from databases that are architected to benefit from multithreading,</p><p>and it’s absolutely essential for databases that are architected with a shard-per-core</p><p>architecture—running a separate shard on each core in each server. In this case, the</p><p>more cores the CPU has, the more shards—and the better data distribution—the</p><p>database will have.</p><p>A combination of vendor recommendations and benchmarking (see Chapter 9)</p><p>can help you determine how much throughput each multicore chip can support. A</p><p>general recommendation is to avoid running production systems close to the CPU limits</p><p>and find the sweet spot between supporting your expected performance and leaving</p><p>room for throughput growth. On top of that, when doing benchmarking, remember</p><p>to also factor in background database operations that might be detrimental to your</p><p>performance. For example, Cassandra and Cassandra-compatible databases often</p><p>need to run repair: a weekly process to ensure data consistency across the cluster. This</p><p>process requires a lot of coordination and communication across the entire cluster. If</p><p>your workload is not properly sized to accommodate background database operations</p><p>and other events (such as node failures), your latency may increase to a level that</p><p>surprises you.</p><p>When using virtual machines, containers, or the public cloud, remember that each</p><p>virtual CPU is mapped to a single logical core, or thread. In many cloud deployments,</p><p>nodes are provided on a vCPU basis. The vCPU is typically a single hyperthread from</p><p>a dual hyperthread x86 physical core for Intel/AMD variants, or a single core for</p><p>ARM chips.</p><p>No matter what your deployment of choice involves, avoid overcommitting CPU</p><p>resources if performance is a priority. Doing so will prevent other guests from stealing</p><p>CPU time5 from your database.</p><p>Memory (RAM)</p><p>If you’re working with an in-memory database, having enough memory to hold your</p><p>entire dataset is an absolute must. But every database uses in-memory caching to some</p><p>extent. For example, some databases require enough memory space for indexes to avoid</p><p>expensive round-trips to storage disks. Others leverage an internal data cache to allow</p><p>5 For more on CPU steal time, see “Detecting CPU Steal Time in Guest Virtual Machines” by Jamie</p><p>Fargen (https://opensource.com/article/20/1/cpu-steal-time).</p><p>Chapter 7 InfrastruCture and deployment models</p><p>146</p><p>for lower latencies when retrieving recently used data, Cassandra and Cassandra-like</p><p>databases implement memtables, and some databases allow you to control which tables</p><p>are served entirely from memory. The more memory the database has at its disposal,</p><p>the better you can take advantage of those mechanisms. After all, even the fastest NVMe</p><p>can’t come close to the speed of RAM access.</p><p>In general, there is no blanket recommendation for “how much memory is enough”</p><p>for a database. Different vendors have different requirements and different use cases also</p><p>require different memory sizes. However, latency-sensitive use cases typically require</p><p>high memory footprints in order to achieve high cache hit rates and serve low-latency</p><p>read requests efficiently.</p><p>For example, a use case with a higher payload size requires a larger memory</p><p>footprint than one with a smaller payload size. Another interesting aspect to consider is</p><p>how frequently the use case in question reads data that may be present in memory (hot</p><p>data) as opposed to data that was never read (cold data). As mentioned in Chapter 2, the</p><p>latter can easily undermine your latencies.</p><p>Without a sufficient disk-to-memory ratio, you will be hitting your storage far more</p><p>than you probably want if you intend to keep your latencies low. The ideal ratio varies</p><p>from database to database since every caching implementation is different, so be sure</p><p>to ask your vendor for their specific recommendations. For example, ScyllaDB currently</p><p>recommends that for every 1GB of memory allocated to a node, you can store up to</p><p>100GB of data (so if you have 32GB of memory, you can handle around 3TB). The higher</p><p>your memory-to-storage ratio gets, the less room you have for caching your total dataset.</p><p>Every database has some sort of hard physical limit. If you don’t have enough memory</p><p>and you have to run a workload on top of a very large dataset, it’s either going to be</p><p>rather slow or increase the risk of the database running out of memory.</p><p>Another ratio to keep in mind: memory per CPU core. At ScyllaDB, we recommend</p><p>at least 8GB of memory per CPU core for production purposes (because, given our</p><p>shared-nothing architecture, every shard works independently and has its own allocated</p><p>memory for caching). 8GB per vCPU is the same ratio used by most cloud providers for</p><p>NoSQL or Big Data-oriented instance types. Again, the recommended ratio will vary</p><p>across vendors, depending on the database’s specific internal cache implementation and</p><p>other implementation details. For example, in Cassandra and Cassandra-like databases,</p><p>part of the memory will be allocated for some of its SSTable-components in order to</p><p>speed up disk lookups when reading cold data. Aerospike will typically store all indexes</p><p>in RAM.And MongoDB, on average, requires 1GB of RAM per 100K assets.</p><p>Chapter 7 InfrastruCture and deployment models</p><p>147</p><p>Distributed databases are notoriously high memory consumers. Regardless of its</p><p>implementation, the database will always need to store some relevant parts of your</p><p>dataset in memory in order to avoid wasting time on disk I/O.Insufficient memory can</p><p>manifest itself as unpredictable, erratic database behavior—even crashes.</p><p>Network</p><p>Lastly, you have to ensure that network I/O does not become a bottleneck. Networking</p><p>is often an overlooked component. As with any distributed system, a database involves</p><p>a lot of traffic between all the cluster members to check for liveness, replicate state</p><p>and topology changes, and so on. As a result, network delays not only deteriorate your</p><p>application’s latency, but also prevent internode communication from functioning</p><p>effectively.</p><p>At ScyllaDB, we recommend a minimum network bandwidth of 10Gbps because</p><p>internal database operations such as streaming, repairs, and gossip can become very</p><p>network intensive. On top of that, you also need to factor in the actual throughput</p><p>required for the use case in question; the number of operations per second will certainly</p><p>be the highest bandwidth consumer for your deployment.</p><p>As with memory, the required network bandwidth will vary. Be sure to check your</p><p>vendor recommendations and consider the nature of your use case. A low throughput</p><p>workload will obviously consume less traffic than a higher throughput one.</p><p>Tip: Use CPU pinning to mitigate the</p><p>impact of hardware</p><p>interrupts. hardware interrupts, which typically stem from (but are not limited</p><p>to) high network Internet traffic, force the os kernel to stop everything and respond</p><p>to the hardware before returning to the job at hand. too many interrupts (e.g., a</p><p>high softirq percent) will degrade database performance, as your Cpus may stall</p><p>during processing for serving network traffic. one way to resolve this is to use</p><p>Cpu pinning. this tells the system that all network interrupts should be handled</p><p>by specific Cpus that are not being used by the database. With that setup, you can</p><p>blast the database with network traffic and be reasonably confident that you won’t</p><p>overwhelm it or stall the database processing during normal operations.</p><p>Chapter 7 InfrastruCture and deployment models</p><p>148</p><p>For cloud deployments, most IaaS vendors provide a modern network infrastructure</p><p>with ample bandwidth between your database servers and between the database</p><p>and the application clients. Be sure to check on your client’s network bandwidth</p><p>consumption if you suspect network problems. A common mistake we see in</p><p>deployments involves application clients deployed with suboptimal network capacity.</p><p>Also, be sure to place your application servers as close as possible to your database.</p><p>If you are deploying them in a single region, a shorter physical distance between the</p><p>servers will translate to better network performance (since it will require fewer network</p><p>hops for communication) and, as a result, lower latencies. If you need to go multi-region</p><p>and you require strong consistency or replication across these regions, then you need to</p><p>pay the latency penalty for traversing regions—plus, you also have to pay, quite literally,</p><p>with respect to cross-region networking transfer fees. For multi-region deployments with</p><p>cross-region replication, a slow network link may create replication delays that cause</p><p>the database to apply backpressure on your writes until it manages to replicate the data</p><p>piled up.</p><p>Considerations intheCloud</p><p>The “on-prem vs cloud” decision depends heavily on your organization’s security and</p><p>regulatory requirements as well as its business strategy—and is well beyond the scope</p><p>of this book. Instead of heading down that path, let’s focus on exploring performance</p><p>considerations that are unique to cloud deployments.</p><p>Most cloud providers offer a wide range of instance types that you may choose</p><p>to host your workload. In our experience, most of the mistakes and performance</p><p>bottlenecks seen on distributed databases within cloud deployments are due to an</p><p>incorrect instance or storage type selection during the initial cluster setup. A common</p><p>misunderstanding (and concern) that many people have is the fact that NVMe-based</p><p>storage may be more expensive than network-attached storage. The misconception likely</p><p>stems from the assumption that since NVMes are faster, they would incur elevated costs.</p><p>However it turns out to be quite the opposite: Since NVMe disks on cloud environments</p><p>are tied to the lifecycle of an instance, they end up being cheaper than network disks,</p><p>which require holding up your dataset for a prolonged period of time. We encourage</p><p>you to compare the costs of NVMe backed-up storage against network-attached disks on</p><p>your cloud vendor of choice.</p><p>Chapter 7 InfrastruCture and deployment models</p><p>149</p><p>Some cloud vendors have different instance types for different distributed database</p><p>workloads. For example, some workloads may benefit more from compute-heavy</p><p>instance types, with more compute power than storage capacity. Conversely, storage-</p><p>dense instance types typically feature a higher storage to memory ratio and are often</p><p>used by storage-heavy workloads.</p><p>To complicate things even more, some cloud providers may offer different CPU</p><p>generations for the same instance type. If one CPU generation is considerably slower</p><p>than other nodes, the wrong choice could introduce performance bottlenecks into your</p><p>cluster.</p><p>We have seen some (although rare) scenarios where a noisy neighbor dragged down</p><p>an entire node performance with no reasonable explanation. The lack of visibility and</p><p>control in cloud instances makes it harder to diagnose such situations. Often, you need</p><p>to reach out to your cloud vendor directly to resolve the situation.</p><p>As you start configuring your instance, remember that a cloud environment isn’t</p><p>created exclusively for databases. You have access to a wide range of options, but it can</p><p>be confusing to determine where to start and which options to use. In general, it’s best</p><p>to check with your database vendor on which instance types are recommended for</p><p>deployment. Even better, go beyond that and compare the results of their benchmarks</p><p>against those same instance types running your workload.</p><p>After you have decided on your instance types and deployment options, it’s time to</p><p>think about instance placement. Most clouds will charge you for both inter-region traffic</p><p>and inter-zone traffic, which may quite surprisingly increase the overall networking</p><p>costs. Some companies try to mitigate this cost by placing all instances under a single</p><p>availability zone (AZ), which also carries the risk of potentially having to face a cluster-</p><p>wide outage if/when that AZ goes down. Others opt to ignore the cost aspect and</p><p>deploy their replicas in different AZs to ensure data is properly replicated to an isolated</p><p>environment. Regardless of your instance’s placement of choice, note that some</p><p>database drivers allow clients in specific AZs to route queries only against database</p><p>replicas living in the same availability zone in order to reduce costs. Similarly, you will</p><p>also want to ensure that your application clients are located under the same zones as</p><p>your database to minimize your networking costs.</p><p>Chapter 7 InfrastruCture and deployment models</p><p>150</p><p>Fully Managed Database-as-a-Service</p><p>Does the database-as-a-service model help or hurt database performance? It really</p><p>depends on the following:</p><p>• How much attention your database requires to achieve and</p><p>consistently meet your performance expectations</p><p>• Your team’s experience working with the specific database</p><p>you’re using</p><p>• Your team’s time and desire to tinker with that database</p><p>• The level of expertise—especially with respect to performance—that</p><p>your DBaaS provider dedicates to your account</p><p>Managed DBaaS solutions can easily speed up your go-to-market and allow you to</p><p>focus on priorities beyond your database. Most database vendors now provide some</p><p>sort of managed solution. There are even independent companies in the business of</p><p>providing this kind of service for a variety of different distributed databases.</p><p>We have seen many examples where a managed solution helped users succeed, as</p><p>well as numerous complaints over the fact that some managed solutions were rather</p><p>limited. It is not our intention to recommend nor criticize any specific service provider in</p><p>question. Here is some vendor-agnostic advice on things to consider before selecting a</p><p>managed solution:</p><p>• Does the vendor satisfy your existing security requirements? Does it</p><p>provide enough evidence of security certifications issued by a known</p><p>security company?</p><p>• What are the options for observability and how do you export the</p><p>data in question to your monitoring platform of choice?</p><p>• What kind of flexibility do you have with your deployment? What are</p><p>the available tunable options and the support for those within your</p><p>managed solution?</p><p>• Does it allow you to peer traffic from your existing application</p><p>network(s) to your database in a private and secure way?</p><p>• What are the available support options and SLAs?</p><p>Chapter 7 InfrastruCture and deployment models</p><p>151</p><p>• Which deployment options are available, what’s the flexibility among</p><p>switching, and what’s the cost comparison if you were to deploy and</p><p>maintain it on your own?</p><p>• How easy is it for you to export</p><p>your data if you need to move your</p><p>deployment to a different vendor in the future?</p><p>• What, if any, migration options are available and what amount of</p><p>effort do they require?</p><p>These are just some of the many questions and concerns that we’ve frequently</p><p>heard teams asking (or wishing they asked before they got caught in an undesirable</p><p>option). Considering a third-party vendor to manage a relatively critical aspect of your</p><p>infrastructure is very often challenging. However, under the right circumstances and</p><p>vendor-user fit, it can be a great option for reducing your admin burden and optimizing</p><p>your performance.</p><p>Serverless Deployment Models</p><p>Serverless refers to database solutions that offer near-instant scaling up or scaling down</p><p>of database infrastructure—and charge you for the capacity and storage that you actually</p><p>consume.</p><p>A serverless model could theoretically yield a performance advantage. Before</p><p>serverless, many organizations faced a tradeoff:</p><p>• (Slightly or generously, depending on your risk tolerance)</p><p>overestimate the capacity they need to guarantee adequate</p><p>performance.</p><p>• Watch performance suffer if their overly-conservative capacity</p><p>estimates proved inadequate.</p><p>Serverless can help in a few different ways and situations.</p><p>First, with variable workloads. Since the database can rapidly scale up as your</p><p>workload increases, you can worry less about performance issues stemming from</p><p>inadequate capacity. If your traffic ebbs and flows across the day/week/month, you</p><p>can spend less during the light periods and dedicate those resources to supporting the</p><p>peak periods. And if your company suddenly experiences “catastrophic success,” you</p><p>don’t have to worry about the headaches associated with needing to suddenly scale</p><p>Chapter 7 InfrastruCture and deployment models</p><p>152</p><p>your infrastructure. If all goes well, the vendor will “automagically” ensure that you’re</p><p>covered, with acceptable performance. You won’t need to procure any additional</p><p>servers, or even contact your cloud provider.</p><p>Serverless is also a great option to consider if you’re working on a new project and</p><p>are not sure what capacity you need to meet performance expectations. It gives you the</p><p>freedom to start fast and scale (or shrink) depending on real-world usage. Database</p><p>sizing is one less thing to worry about. And you don’t need to predict the future.</p><p>Finally, serverless also makes it simpler to justify the spend internally. With this</p><p>model, you can assure your organization that you are never overprovisioned—at least</p><p>not for long. You’re paying for exactly the amount of performance that the database</p><p>vendor determines you need at all times.</p><p>However, a serverless deployment also carries the risk of cost overruns and the</p><p>uncertainty of unpredictable costs. For example, DynamoDB pricing may not be very</p><p>attractive for write-heavy workloads. Similarly, serverless database services may charge</p><p>an arm and a leg (or an eye and a knee) depending on the number of operations per</p><p>second you plan to sustain over an extended period of time. In some cases, it could</p><p>become a double-edged sword from a cost perspective if your goal is to sustain a high-</p><p>throughput performant system at large scale.</p><p>Another aspect to consider when thinking about a serverless solution is whether</p><p>the solution in question is compatible with your existing infrastructure components.</p><p>For example, you’ll want to explore what amount of effort is required to connect your</p><p>message queueing or analytics tool with that specific serverless solution.</p><p>Remember that the overall concept behind serverless is to abstract away the</p><p>underlying infrastructure in such a way that not all database-configurable options are</p><p>available to you. As a result, troubleshooting potential performance problems is often</p><p>more challenging since you might need to rely on your vendor’s input and guidance to</p><p>understand which actions to take. Being serverless also means that you lack visibility</p><p>into whether the infrastructure you consume is shared with other tenants. Many</p><p>distributed database vendors may also offer you different pricing tiers for shared and</p><p>dedicated environments.</p><p>Containerization andKubernetes</p><p>Containers and Kubernetes are now ubiquitous, even for stateful systems like databases.</p><p>Should you use them? Probably—unless you have a good reason not to.</p><p>Chapter 7 InfrastruCture and deployment models</p><p>153</p><p>But be aware that there is a performance penalty for the operational convenience</p><p>of using containers. This is to be expected because of the extra layer of abstraction (the</p><p>container itself), relaxation of resource isolation, and increased context switches. The</p><p>good news is that it can certainly be overcome. In our testing using ScyllaDB, we found</p><p>it is possible to take what was originally a 69 percent reduction in peak throughput down</p><p>to a 3 percent performance penalty.6</p><p>Here’s the TL;DR on that specific experiment:</p><p>Containerizing applications is not free. In particular, processes</p><p>comprising the containers have to be run in Linux cgroups and</p><p>the container receives a virtualized view of the network. Still,</p><p>the biggest cost of running a close-to-hardware, thread-per-core</p><p>application like ScyllaDB inside a Docker container comes from</p><p>the opportunity cost of having to disable most of the performance</p><p>optimizations that the database employs in VM and bare-metal</p><p>environments to enable it to run in potentially shared and</p><p>overcommitted platforms.</p><p>The best results with Docker are obtained when resources</p><p>are statically partitioned and we can bring back bare-metal</p><p>optimizations like CPU pinning and interrupt isolation. There is</p><p>only a 10 percent performance penalty in this case as compared</p><p>to the underlying platform—a penalty that is mostly attributed to</p><p>the network virtualization. Docker allows users to expose the host</p><p>network directly for specialized deployments. In cases in which this</p><p>is possible, we saw that the performance difference compared to the</p><p>underlying platform falls down to 3 percent.</p><p>Of course, the potential penalty and strategies for mitigating will vary from database</p><p>to database. But the key takeaway is that there is likely a significant performance</p><p>penalty—so be sure to hunt it down and research how to mitigate it. Some common</p><p>mitigation strategies include:</p><p>6 See “The Cost of Containerization for Your ScyllaDB” on the ScyllaDB blog (https://www.</p><p>scylladb.com/2018/08/09/cost-containerization-scylla/).</p><p>Chapter 7 InfrastruCture and deployment models</p><p>154</p><p>• Ensure that your containers have direct access to the database’s</p><p>underlying storage.</p><p>• Expose the host OS network to the container in order to avoid the</p><p>performance penalty due to its network virtualization layer.</p><p>• Allocate enough resources to the container in question, and ensure</p><p>these are not overcommitted with other containers or processes</p><p>running within the underlying host OS.</p><p>Kubernetes adds yet another virtualization layer—and thus opens the door to yet</p><p>another layer of performance issues, as well as different strategies for mitigating them.</p><p>First off, if you have the choice of multiple options for deploying and managing database</p><p>clusters on Kubernetes, test them out with an eye on performance. Once you settle</p><p>on the best fit for your needs, dive into the configuration options that could impact</p><p>performance. Here are some performance tips that cross databases:</p><p>• Consider dedicating specific and independent Kubernetes nodes for</p><p>your database workloads and use affinities in order to configure their</p><p>placement.</p><p>• Enable hostNetworking and be sure to set up the required kernel</p><p>parameters as recommended by your vendor (for example, fs.</p><p>aio-max-nr for increasing the number of events available for</p><p>asynchronous I/O processing in the Linux kernel).</p><p>• Ensure that your database pods have a Guaranteed QoS class7 to</p><p>avoid other pods from potentially hurting your main</p><p>workload.</p><p>• Be sure to use an operator8 in order to orchestrate and control the</p><p>lifecycle of your existing Kubernetes database cluster. For example,</p><p>ScyllaDB has its ScyllaDB Operator project.</p><p>7 For more detail, see “Create a Pod that Gets Assigned a QoS Class of Guaranteed” in the</p><p>Kubernetes docs (https://kubernetes.io/docs/tasks/configure-pod-container/</p><p>quality-service-pod/#create-a-pod-that-gets-assigned-a-qos-class-of-guaranteed).</p><p>8 For more detail, see “Operator Pattern” in the Kubernetes docs https://kubernetes.io/docs/</p><p>concepts/extend-kubernetes/operator/.</p><p>Chapter 7 InfrastruCture and deployment models</p><p>155</p><p>Summary</p><p>This chapter kicked off the final part of this book, focused on sharing recommendations</p><p>for getting better performance out of your database deployment. It looked at</p><p>infrastructure and deployment model considerations that are important to understand</p><p>whether you’re managing your own deployment or opting for a database-as-a-service</p><p>(maybe serverless) deployment model. The next chapter looks at performance</p><p>considerations relevant to the topology itself: replication, geographic distribution,</p><p>scaling up and/or out, and intermediaries like external caches, load balancers, and</p><p>abstraction layers.</p><p>Open Access This chapter is licensed under the terms of the Creative</p><p>Commons Attribution 4.0 International License (http://creativecommons.</p><p>org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and</p><p>reproduction in any medium or format, as long as you give appropriate credit to the</p><p>original author(s) and the source, provide a link to the Creative Commons license and</p><p>indicate if changes were made.</p><p>The images or other third party material in this chapter are included in the chapter's</p><p>Creative Commons license, unless indicated otherwise in a credit line to the material. If</p><p>material is not included in the chapter's Creative Commons license and your intended</p><p>use is not permitted by statutory regulation or exceeds the permitted use, you will need</p><p>to obtain permission directly from the copyright holder.</p><p>Chapter 7 InfrastruCture and deployment models</p><p>157</p><p>CHAPTER 8</p><p>Topology Considerations</p><p>As mentioned in Chapter 5, database servers are often combined into intricate</p><p>topologies where certain nodes are grouped in a single geographical location; others</p><p>are used only as a fast cache layer, and yet others store seldom-accessed cold data in a</p><p>cheap place, for emergency purposes only. That chapter covered how drivers work to</p><p>understand and interact with that topology to exchange information more efficiently.</p><p>This chapter focuses on the topology in and of itself. How is data replicated across</p><p>geographies and datacenters? What are the risks and alternatives to taking the common</p><p>NoSQL practice of scaling out to the extreme? And what about intermediaries to your</p><p>database servers—for example, external caches, load balancers, and abstraction layers?</p><p>Performance implications of all this and more are all covered here.1</p><p>Replication Strategy</p><p>First, let’s look at replication, which is how your data will be spread to other replicas</p><p>across your cluster.</p><p>Note If you want a quick introduction to the concept of replication, see</p><p>Appendix A.</p><p>Having more replicas will slow your writes (since every write must be duplicated</p><p>to replicas), but it could accelerate your reads (since more replicas will be available for</p><p>serving the same dataset). It will also allow you to maintain operations and avoid data</p><p>1 This chapter draws from material originally published on the ScyllaDB blog (www.scylladb.</p><p>com/blog/), ScyllaDB Documentation (https://docs.scylladb.com/stable/), the ScyllaDB</p><p>whitepaper “Why Scaling Up Beats Scaling Out for NoSQL” (https://lp.scylladb.com/</p><p>whitepaper-scaling-up-vs-scaling-out-offer.html), and an article that ScyllaDB co-founder</p><p>and CEO Dor Laor wrote for The New Stack. It is used here with permission of ScyllaDB.</p><p>© Felipe Cardeneti Mendes, Piotr Sarna, Pavel Emelyanov, Cynthia Dunlop 2023</p><p>F. C. Mendes et al., Database Performance at Scale, https://doi.org/10.1007/978-1-4842-9711-7_8</p><p>158</p><p>loss in the event of node failures. Additionally, replicating data to get closer to your</p><p>application and closer to your users will reduce latency, especially if your application has</p><p>a highly geographically-distributed user base.</p><p>A replication factor (RF) of 1 means there is only one copy of a row in a cluster, and</p><p>there is no way to recover the data if the node is compromised or goes down (other than</p><p>restoring from a backup). An RF of 2 means that there are two copies of a row in a cluster.</p><p>An RF of at least three is used in most systems. This allows you to write and read with</p><p>strong consistency, as a quorum of replicas will be achieved, even if one node is down.</p><p>Many databases also let you fine-tune replication settings at the regional level. For</p><p>example, you could have three replicas in a heavily used region, but only two in a less</p><p>popular region.</p><p>Note that replicating data across multiple regions (as Bigtable recommends as a</p><p>safeguard against both availability zone failure and regional failure) can be expensive.</p><p>Before you set this up, understand the cost of replicating data between regions.</p><p>If you’re working with DynamoDB, you create tables (not clusters), and AWS</p><p>manages the replication for you as soon as you set a table to be Global. One notable</p><p>drawback of DynamoDB global tables is that transactions are not supported across</p><p>regions, which may be a limiting factor for some use cases.</p><p>Rack Configuration</p><p>If all your nodes are in the same datacenter, how do you configure their placement? The</p><p>rule of thumb here is to have as many racks as you have replicas. For example, if you</p><p>have a replication factor of three, run it in three racks. That way, even if an entire rack</p><p>goes down, you can still continue to satisfy read and write requests to a majority of your</p><p>replicas. Performance might degrade a bit since you have lost roughly 33 percent of your</p><p>infrastructure (considering a total zone/rack outage), but overall you’ll still be up and</p><p>running. Conversely, if you have three replicas distributed across two racks, then losing</p><p>a rack may potentially affect two out of the three natural endpoints for part of your data.</p><p>That’s a showstopper if your use case requires strongly consistent reads/writes.</p><p>Multi-Region or Global Replication</p><p>By placing your database servers close to your users, you lower the network latency. You</p><p>can also improve availability and insulate your business from regional outages.</p><p>ChApter 8 topology ConsIderAtIons</p><p>159</p><p>If you do have multiple datacenters, ensure that—unless otherwise required by the</p><p>business—reads and writes use a consistency level that is confined to replicas within</p><p>a specific datacenter. This approach avoids introducing a latency hit by instructing the</p><p>database to only select local replicas (under the same region) for achieving your required</p><p>consistency level. Also, ensure that each application client knows what datacenter is</p><p>considered its local one; it should prioritize that local one for connections and requests,</p><p>although it may also have a fallback strategy just in case that datacenter goes down.</p><p>Note that application clients may or may not be aware of the multi-datacenter</p><p>deployment, and it is up to the application developer to decide on the awareness to</p><p>fallback across regions. Although different settings and load balancing profiles exist</p><p>through a variety of database drivers, the general concept for an application to failover</p><p>to a different region in the event of a local failure may often break application semantics.</p><p>As a result, its reaction upon a failure must be handled directly by the application</p><p>developer.</p><p>Multi-Availability Zones vs. Multi-Region</p><p>To mitigate a possible server or rack failure, cloud vendors offer (and recommend) a</p><p>multi-zone deployment. Think about it as if you have a datacenter</p><p>at your fingertips</p><p>where you can deploy each server instance in its own rack, using its own power, top-of-</p><p>rack switch, and cooling system. Such a deployment will be bulletproof for any single</p><p>system or zonal failure, since each rack is self-contained. The availability zones are still</p><p>located in the same region. However, a specific zone failure won’t affect another zone’s</p><p>deployed instances.</p><p>For example, on Google Compute Engine, the us-east1-b, us-east1-c, and us-</p><p>east1- d availability zones are located in the us-east1 region (Moncks Corner, South</p><p>Carolina, USA). But each availability zone is self-contained. Network latency between</p><p>AZs in the same region is negligible for the purpose of this discussion.</p><p>In short, both multi-zone and multi-region deployments help with business</p><p>continuity and disaster recovery respectively, but multi-region has the additional benefit</p><p>of minimizing local application latencies in those local regions. It might come at a cost</p><p>though: cross-region data replication costs need to be considered for multi-regional</p><p>topologies.</p><p>Note that multi-zonal deployments will similarly charge you for inter-zone</p><p>replication. Although it is perfectly possible to have a single zone deployment for your</p><p>ChApter 8 topology ConsIderAtIons</p><p>160</p><p>database, it is often not a recommended approach because it will effectively be exposed</p><p>as a single point of failure toward your infrastructure. The choice here is quite simple:</p><p>Do you want to reduce costs as much as possible and risk potential unavailability, or</p><p>do you want to guarantee high availability in a single region at the expense of network</p><p>replication costs?</p><p>Scaling Upvs Scaling Out</p><p>Is it better to have a larger number of smaller (read, “less powerful”) nodes or a smaller</p><p>number of larger nodes? We recommend aiming for the most powerful nodes and</p><p>smallest clusters that meet your high availability and resiliency goals—but only if your</p><p>database can truly take advantage of the power added by the larger nodes.</p><p>Let’s unravel that a bit. For over a decade, NoSQL’s promise has been enabling</p><p>massive horizontal scalability with relatively inexpensive commodity hardware. This</p><p>has allowed organizations to deploy architectures that would have been prohibitively</p><p>expensive and impossible to scale using traditional relational database systems.</p><p>Over that same decade, “commodity hardware” has also undergone a</p><p>transformation. But not all databases take advantage of modern computing resources.</p><p>Many aren’t architected to take advantage of the resources offered by large nodes, such</p><p>as the added CPU, memory, and solid-state drives (SSDs), nor can they store large</p><p>amounts of data on disk efficiently. Managed runtimes, like Java, are further constrained</p><p>by heap size. Multi-threaded code, with its locking and context-switches overhead and</p><p>lack of attention for Non-Uniform Memory Architecture (NUMA), imposes a significant</p><p>performance penalty against modern hardware architectures.</p><p>If your database is in this group, you might find that scaling up quickly brings you to</p><p>a point of diminishing returns. But even then, it’s best to max out your vertical scaling</p><p>potential before you shift to horizontal scaling.</p><p>A focus on horizontal scaling results in system sprawl, which equates to operational</p><p>overhead, with a far larger footprint to keep managed and secure. Server sprawl</p><p>also introduces more network overhead to distributed systems due to the constant</p><p>replication and health checks done by every single node in your cluster. Although most</p><p>vendors claim that scaling out will bring you linear performance, some others are more</p><p>conservative and state that it will bring you “near to linear performance.” For example,</p><p>ChApter 8 topology ConsIderAtIons</p><p>161</p><p>Cassandra Production Guidelines2 do not recommend clusters larger than 50 nodes</p><p>using the default number of 16 vNodes per instance because it may result in decreased</p><p>availability.</p><p>Moreover, there are quite a few advantages to using large, powerful nodes.</p><p>• Less noisy neighbors: On cloud platforms, multi-tenancy is</p><p>the norm. A cloud platform is, by definition, based on shared</p><p>network bandwidth, I/O, memory, storage, and so on. As a result,</p><p>a deployment of many small nodes is susceptible to the “noisy</p><p>neighbor” effect. This effect is experienced when one application</p><p>or virtual machine consumes more than its fair share of available</p><p>resources. As nodes increase in size, fewer and fewer resources</p><p>are shared among tenants. In fact, beyond a certain size, your</p><p>applications are likely to be the only tenant on the physical machines</p><p>on which your system is deployed.</p><p>• Fewer failures: Since large and small nodes fail at roughly the</p><p>same rate, large nodes deliver a higher mean time between failures</p><p>(MTBF) than small nodes. Failures in the data layer require operator</p><p>intervention, and restoring a large node requires the same amount of</p><p>human effort as a small one. In a cluster of a thousand nodes, you’ll</p><p>likely see failures every day—and this magnifies administrative costs.</p><p>• Datacenter density: Many organizations with on-premises</p><p>datacenters are seeking to increase density by consolidating</p><p>servers into fewer, larger boxes with more computing resources per</p><p>server. Small clusters of large nodes help this process by efficiently</p><p>consuming denser resources, in turn decreasing energy and</p><p>operating costs.</p><p>• Operational simplicity: Big clusters of small instances demand</p><p>more attention, and generate more alerts, than small clusters of large</p><p>instances. All of those small nodes multiply the effort of real-time</p><p>monitoring and periodic maintenance, such as rolling upgrades.</p><p>2 See https://cassandra.apache.org/doc/latest/cassandra/getting_started/</p><p>production.html.</p><p>ChApter 8 topology ConsIderAtIons</p><p>162</p><p>Some architects are concerned that putting more data on fewer nodes increases</p><p>the risks associated with outages and data loss. You can think of this as the “big basket”</p><p>problem. It may seem intuitive that storing all of your data on a few large nodes makes</p><p>them more vulnerable to outages, like putting all of your eggs in one basket. But this</p><p>doesn’t necessarily hold true. Modern databases use a number of techniques to ensure</p><p>availability while also accelerating recovery from failures, making big nodes both safer</p><p>and more economical. For example, consider capabilities that reduce the time required</p><p>to add and replace nodes and internal load balancing mechanisms to minimize the</p><p>throughput or latency impact across database restarts.3</p><p>Workload Isolation</p><p>Many teams find themselves in a position where they need to run multiple different</p><p>workloads against the database. It is often compelling to aggregate different workloads</p><p>under a single cluster, especially when they need to work on the exact same dataset.</p><p>Keeping several workloads together under a single cluster can also reduce costs. But, it’s</p><p>essential to avoid resource contention when implementing latency-critical workloads.</p><p>Failure to do so may introduce hard-to-diagnose performance situations, where one</p><p>misbehaving workload ends up dragging down the entire cluster’s performance.</p><p>There are many ways to accomplish workload isolation to minimize the resource</p><p>contention that could occur when running multiple workloads on a single cluster. Here</p><p>are a few that work well. Keep in mind that the best approach depends on your existing</p><p>database’s available options, as well as your use case’s requirements:</p><p>• Physical isolation: This setup is often used to entirely isolate one</p><p>workload from another. It involves essentially extending your</p><p>deployment to an additional region (which may be physically the</p><p>same as your existing one, but logically different on the database</p><p>side). As a result, the workloads are split to replicate data to another</p><p>3 ScyllaDB Heat Weighted Load Balancing provides a smarter request redistribution</p><p>algorithm</p><p>based on the cache hit ratio of nodes in the cluster. Learn more at www.scylladb.</p><p>com/2017/09/21/scylla-heat-weighted-load-balancing/.</p><p>ChApter 8 topology ConsIderAtIons</p><p>163</p><p>location, but queries are executed only within a particular location—</p><p>in such a way that a performance bottleneck in one workload won’t</p><p>degrade or bottleneck the other. Note that a downside of this solution</p><p>is that your infrastructure costs double.</p><p>• Logical isolation: Some databases or deployment options allow</p><p>you to logically isolate workloads without needing to increase your</p><p>infrastructure resources. For example, ScyllaDB has a workload</p><p>prioritization feature where you can assign different weights for</p><p>specific workloads to help the database understand which workload</p><p>you want it to prioritize in the event of system contention. If your</p><p>database does not offer such a feature, you may still be able to</p><p>run two or more workloads in parallel, but watch out for potential</p><p>contentions in your database.</p><p>• Scheduled isolation: Many times, you might need to simply run</p><p>batched scheduled jobs at specified intervals in order to support</p><p>other business-related activities, such as extracting analytics</p><p>reports. In those cases, consider running the workload in question</p><p>at low-peak periods (if any exist), and experiment with different</p><p>concurrency settings in order to avoid impairing the latency of the</p><p>primary workload that’s running alongside it.</p><p>More onWorkload Prioritization forLogical Isolation</p><p>ScyllaDB users sometimes use workload prioritization to balance OLAP and OLTP</p><p>workloads. The goal is to ensure that each defined task has a fair share of system</p><p>resources so that no single job monopolizes system resources, starving other jobs of their</p><p>needed minimums to continue operations.</p><p>In Figure8-1, note that the latency for both workloads nearly converges. OLTP</p><p>processing began at or below 2ms P99 latency up until the OLAP job began at 12:15.</p><p>When the OLAP workload kicked in, OLTP P99 latencies shot up to 8ms, then further</p><p>degraded, plateauing around 11–12ms until the OLAP job terminated after 12:26.</p><p>ChApter 8 topology ConsIderAtIons</p><p>164</p><p>Figure 8-1. Latency between OLTP and OLAP workloads on the same cluster</p><p>before enabling workload prioritization</p><p>These latencies are approximately six times greater than when OLTP ran by itself.</p><p>(OLAP latencies hover between 12–14ms, but, again, OLAP is not latency-sensitive).</p><p>Figure 8-2 shows that the throughput on OLTP sinks from around 60,000 OPS to</p><p>half that—30,000 OPS.You can see the reason why. OLAP, being throughput hungry, is</p><p>maintaining roughly 260,000 OPS.</p><p>ChApter 8 topology ConsIderAtIons</p><p>165</p><p>Figure 8-2. Comparative throughput results for OLTP and OLAP on the same</p><p>cluster without workload prioritization enabled</p><p>Ultimately, OLTP suffers with respect to both latency and throughput, and users</p><p>experience slower response times. In many real-world conditions, such OLTP responses</p><p>would violate a customer’s SLA.</p><p>Figure 8-3 shows the latencies after workload prioritization is enabled. You can see</p><p>that the OLTP workload similarly starts out at sub-millisecond to 2ms P99 latencies.</p><p>Once an OLAP workload is added, OLTP processing performance degrades, but with</p><p>P99 latencies hovering between 4–7ms (about half of the 11–12ms P99 latencies when</p><p>workload prioritization was not enabled).</p><p>ChApter 8 topology ConsIderAtIons</p><p>166</p><p>Figure 8-3. OLTP and OLAP latencies with workload prioritization enabled</p><p>It is important to note that once system contention kicks in, the OLTP latencies</p><p>are still somewhat impacted—just not to the same extent they were prior to workload</p><p>prioritization. If your real-time workload requires ultra-constant single-digit millisecond</p><p>or lower P99 latencies, then we strongly recommend that you avoid introducing any form</p><p>of contention.</p><p>The OLAP workload, not being as latency-sensitive, has P99 latencies that hover</p><p>between 25–65ms. These are much higher latencies than before—the tradeoff for</p><p>keeping the OLTP latencies lower.</p><p>Throughput wise, Figure8-4 shows that the OLTP traffic is a smooth 60,000 OPS until</p><p>the OLAP load is also enabled.</p><p>ChApter 8 topology ConsIderAtIons</p><p>167</p><p>Figure 8-4. OLTP and OLAP load throughput with workload</p><p>prioritization enabled</p><p>It does dip in performance at that point, but only slightly, hovering between 54,000 to</p><p>58,000 OPS. That is only a 3–10 percent drop in throughput. The OLAP workload, for its</p><p>part, hovers between 215,000–250,000 OPS.That is a drop of 4–18 percent, which means</p><p>an OLAP workload would take longer to complete. Both workloads suffer degradation, as</p><p>would be expected for an overloaded cluster, but neither to a crippling degree.</p><p>Abstraction Layers</p><p>It’s becoming fairly common for teams to write an abstraction layer on top of their</p><p>databases. Instead of calling the database’s APIs directly, the applications connect to this</p><p>database-agnostic abstraction layer, which then manages the logistics of connecting to</p><p>the database.</p><p>There are usually a few main motives behind this move:</p><p>• Portability: If the team wants to move to another database, they</p><p>won’t need to modify their applications and queries. However, the</p><p>team responsible for the abstraction layer will need to modify that</p><p>code, which could turn out to be more complicated.</p><p>ChApter 8 topology ConsIderAtIons</p><p>168</p><p>• Developer simplicity: Developers don’t need to worry about the</p><p>inner details of working with any particular database. This can make</p><p>it easier for people to move around from team to team.</p><p>• Scalability: An abstraction layer can be easier to containerize. If the</p><p>API gets overloaded, it’s usually easier to scale out more containers in</p><p>Kubernetes than to spin off more containers of the database itself.</p><p>• Customer-facing APIs: Exposing the database directly to end-users</p><p>is typically not a good idea. As a result, many companies expose</p><p>common endpoints, which are eventually translated into actual</p><p>database queries. As a result, the abstraction layer can shed requests,</p><p>limit concurrency across tenants, and perform auditability and</p><p>accountability over its calls.</p><p>But, there’s definitely a potential for a performance penalty that is highly dependent</p><p>on how efficiently the layer was implemented. An abstraction layer that was fastidiously</p><p>implemented by a team of masterful Rust engineers is likely to have a much more</p><p>negligible impact than a Java or Python one cobbled together as a quick side project. If</p><p>you decide to take this route, be sure that the layer is developed with performance in</p><p>mind, and that you carefully measure its impact via both benchmarking and ongoing</p><p>monitoring. Remember that every application database communication is going to</p><p>use this layer, so a small inefficiency can quickly snowball into a significant performance</p><p>problem.</p><p>For example, we once saw a customer report an elevated latency situation coming</p><p>from their Golang abstraction layer. Once we realized that the latency on the database</p><p>side was within bounds for its use case, the investigation shifted from the database over</p><p>to the network and client side. Long story short, the application latency spikes were</p><p>identified as being heavily affected during the garbage collection process, dragging down</p><p>the client-side performance significantly. The problem was resolved after scaling out the</p><p>number of clients and by ensuring that they had enough compute resources to properly</p><p>function.</p><p>And another example: When working with a customer through a PostgreSQL to</p><p>NoSQL migration, we realized that their clients were fairly often opening too many</p><p>concurrent connections against the database. Although having a high number of sockets</p><p>opened is typically a good idea for distributed systems, an extremely high number of</p><p>them can easily overwhelm the client side (which needs to keep track of all open sockets)</p><p>ChApter 8 topology ConsIderAtIons</p><p>169</p><p>as well as the database. After we reported our findings to the customer, they discovered</p><p>that they were opening a new database session for every request they submitted against</p><p>the cluster. After correcting the malfunctioning code, the overall application throughput</p><p>was significantly increased because the abstraction layer was then using active sockets</p><p>opened when it routed requests.4</p><p>Load Balancing</p><p>Should you put a dedicated load balancer in front of your database? In most cases, no.</p><p>Databases typically have their own way to balance traffic across the cluster, so layering</p><p>a load balancer on top of that won’t help—and it could actually hurt. Consider 1) how</p><p>many requests the load balancer can serve without becoming a bottleneck and 2) its</p><p>balancing policy. Also, recognize that it introduces a single point of failure that reduces</p><p>your database resilience. As a result, you overcomplicate your overall infrastructure</p><p>topology because you now need to worry about load balancing high availability.</p><p>Of course, there are always exceptions. For example, say you were previously using</p><p>a database API that’s unaware of the layout of the cluster and its individual nodes</p><p>(e.g., DynamoDB, where a client is configured with a single “endpoint address” and</p><p>all requests are sent to it). Now you’re shifting to a distributed leaderless database like</p><p>ScyllaDB, where clients are node aware and even token aware (aware of which token</p><p>ranges are natural endpoints for every node in your topology). If you simply configure</p><p>an application with the IP address of a single ScyllaDB node as its single DynamoDB</p><p>API endpoint address, the application will work correctly. After all, any node can answer</p><p>any request by forwarding it to other nodes as necessary. However, this single node will</p><p>be more loaded than the other nodes because it will be the only node actively serving</p><p>requests. This node will also become a single point of failure from your application’s</p><p>perspective.</p><p>In this special edge case, load balancing is critical—but load balancers are</p><p>not. Server-side load balancing is fairly complex from an admin perspective. More</p><p>importantly with respect to performance, server-side solutions add latency. Solutions</p><p>that involve a TCP or HTTP load balancer require another hop for each</p><p>4 Learn about abstraction layer usage at Discord in “How Discord Migrated Trillions of Messages</p><p>from Cassandra to ScyllaDB “(www.youtube.com/watch?v=S2xmFOAUhsk) and ShareChat</p><p>in “ShareChat’s Path to High-Performance NoSQL with ScyllaDB” (www.youtube.com/</p><p>watch?v=Y2yHv8iqigA).</p><p>ChApter 8 topology ConsIderAtIons</p><p>170</p><p>request—increasing not just the cost of each request, but also its latency. We recommend</p><p>client- side load balancing: Modifying the application to send requests to the available</p><p>nodes versus a single one.</p><p>The key takeaway here is that load balancing generally isn’t needed—and when it</p><p>is, server-side load balancers yield a pretty severe performance penalty. If it’s absolutely</p><p>necessary, client-side load balancing is likely a better option.5</p><p>External Caches</p><p>Teams often consider external caches when the existing database cluster cannot meet</p><p>the required SLA.This is a clear performance-oriented decision. Putting an external</p><p>cache in front of the database is commonly used to compensate for subpar latency</p><p>stemming from the various factors discussed throughout this book: inefficient database</p><p>internals, driver usage, infrastructure choices, traffic spikes, and so on.</p><p>Caching may seem like a fast and easy solution because the deployment can be</p><p>implemented without tremendous hassle and without incurring the significant cost of</p><p>database scaling, database schema redesign, or even a deeper technology transformation.</p><p>However, external caches are not as simple as they are often made out to be. In fact, they</p><p>can be one of the more problematic components of a distributed application architecture.</p><p>In some cases, it’s a necessary evil—particularly if you have an ultra-latency-sensitive</p><p>use case such as real-time ad bidding or streaming media, and you’ve tried all the other</p><p>means of reducing latency. But in many cases, the performance boost isn’t worth it. The</p><p>following sections outline some key risks and you can decide what makes sense for your</p><p>use case and SLAs.</p><p>An External Cache Adds Latency</p><p>A separate cache means another hop on the way. When a cache surrounds the database,</p><p>the first access occurs at the cache layer. If the data isn’t in the cache, then the request is</p><p>sent to the database. The result is additional latency to an already slow path of uncached</p><p>data. One may claim that when the entire dataset fits the cache, the additional latency</p><p>doesn’t come into play. However, there is usually more than a single workload/pattern</p><p>that hits the database and some of it will carry the extra hop cost.</p><p>5 For an example of how to implement client-side load balancing, see www.scylladb.</p><p>com/2021/04/13/load-balancing-in-scylla-alternator/.</p><p>ChApter 8 topology ConsIderAtIons</p><p>171</p><p>An External Cache Is anAdditional Cost</p><p>Caching means expensive DRAM, which translates to a higher cost per gigabyte than</p><p>SSDs. Even when RAM can store frequently accessed objects, it is best to use the existing</p><p>database RAM, and even increase it for internal caching rather than provision entirely</p><p>separate infrastructure on RAM-oriented instances. Provisioning a cache to be the</p><p>same size as the entire persistent dataset may be prohibitively expensive. In other cases,</p><p>the working set size can be too big, often reaching petabytes, making an SSD-friendly</p><p>implementation the preferred, and cheaper, option.</p><p>External Caching Decreases Availability</p><p>No cache’s high availability solution can match that of the database itself. Modern</p><p>distributed databases have multiple replicas; they also are topology-aware and speed-</p><p>aware and can sustain multiple failures without data loss.</p><p>For example, a common replication pattern is three local replicas, which generally</p><p>allows for reads to be balanced across such replicas in order to efficiently use your</p><p>database’s internal caching mechanism. Consider a nine-node cluster with a replication</p><p>factor of three: Essentially every node will hold roughly 33 percent of your total dataset</p><p>size. As requests are balanced among different replicas, this grants you more room</p><p>for caching your data, which could (potentially) completely eliminate the need for an</p><p>external cache. Conversely, if an external cache happens to invalidate entries right before</p><p>a surge of cold requests, availability could be impeded for a while since the database</p><p>won’t have that data in its internal cache (more on this in the section entitled “External</p><p>Caching Ruins the Database Caching” later in this chapter).</p><p>Caches often lack high-availability properties and can easily fail or invalidate records</p><p>depending on their heuristics. Partial failures, which are more common, are even worse</p><p>in terms of consistency. When the cache inevitably fails, the database will get hit by the</p><p>unmitigated firehose of queries and likely wreck your SLAs. In addition, even if a cache</p><p>itself has some high availability features, it can’t coordinate handling such failure with</p><p>the persistent database it is in front of. The bottom line: Rely on the database, rather than</p><p>making your latency SLAs dependent on a cache.</p><p>ChApter 8 topology ConsIderAtIons</p><p>172</p><p>Application Complexity: Your Application Needs toHandle</p><p>More Cases</p><p>Application and operational complexity are problems for external caches. Once you</p><p>have an external cache, you need to keep the cache up-to-date with the client and the</p><p>database. For instance, if your database runs repairs, the cache needs to be synced or</p><p>invalidated. However, invalidating the cache may introduce a long period of time when</p><p>you need to wait for it to eventually get warm. Your client retry and timeout policies need</p><p>to match the properties</p><p>of the cache but also need to function when the cache is done.</p><p>Usually, such scenarios are hard to test and implement.</p><p>External Caching Ruins theDatabase Caching</p><p>Modern databases have embedded caches and complex policies to manage them. When</p><p>you place a cache in front of the database, most read requests will reach only the external</p><p>cache and the database won’t keep these objects in its memory. As a result, the database</p><p>cache is rendered ineffective. When requests eventually reach the database, its cache</p><p>will be cold and the responses will come primarily from the disk. As a result, the round-</p><p>trip from the cache to the database and then back to the application is likely to incur</p><p>additional latency.</p><p>External Caching Might Increase Security Risks</p><p>An external cache adds a whole new attack surface to your infrastructure. Encryption,</p><p>isolation, and access control on data placed in the cache are likely to be different from</p><p>the ones at the database layer itself.</p><p>External Caching Ignores theDatabase Knowledge</p><p>andDatabase Resources</p><p>Databases are quite complex and built for specialized I/O workloads on the system.</p><p>Many of the queries access the same data, and some amount of the working set size</p><p>can be cached in memory in order to save disk accesses. A good database should have</p><p>sophisticated logic to decide which objects, indexes, and accesses it should cache.</p><p>ChApter 8 topology ConsIderAtIons</p><p>173</p><p>The database also should have various eviction policies (such as the least recently</p><p>used [LRU] policy as a straightforward example) that determine when new data should</p><p>replace existing (older) cached objects.</p><p>Another example is scan-resistant caching. When scanning a large dataset, say a</p><p>large range or a full-table scan, a lot of objects are read from the disk. The database can</p><p>realize this is a scan (not a regular query) and choose to leave these objects outside its</p><p>internal cache. However, an external cache would treat the result set just like any other</p><p>and attempt to cache the results. The database automatically synchronizes the content</p><p>of the cache with the disk according to the incoming request rate, and thus the user and</p><p>the developer do not need to do anything to make sure that lookups to recently written</p><p>data are performant. Therefore, if, for some reason, your database doesn’t respond fast</p><p>enough, it means that:</p><p>• The cache is misconfigured</p><p>• It doesn’t have enough RAM for caching</p><p>• The working set size and request pattern don’t fit the cache</p><p>• The database cache implementation is poor</p><p>Summary</p><p>This chapter shared strong opinions on how to navigate topology decisions. For example,</p><p>we recommended:</p><p>• Using an RF of at least 3 (with geographical fine-tuning if available)</p><p>• Having as many racks as replicas</p><p>• Isolating reads and writes within a specific datacenter</p><p>• Ensuring each client knows and prioritizes the local datacenter</p><p>• Considering the (cross-region replication) costs of multi-region</p><p>deployments as well as their benefits</p><p>• Scaling up as much as possible before scaling out</p><p>ChApter 8 topology ConsIderAtIons</p><p>174</p><p>• Considering a few different options to minimize the resource</p><p>contention that could occur when running multiple workloads on a</p><p>single cluster</p><p>• Carefully considering the caveats associated with external caches,</p><p>load balancers, and abstraction layers</p><p>The next chapter looks at best practices for testing your topology: Benchmarking it to</p><p>see what it’s capable of and how it compares to alternative configurations and solutions.</p><p>Open Access This chapter is licensed under the terms of the Creative</p><p>Commons Attribution 4.0 International License (http://creativecommons.</p><p>org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and</p><p>reproduction in any medium or format, as long as you give appropriate credit to the</p><p>original author(s) and the source, provide a link to the Creative Commons license and</p><p>indicate if changes were made.</p><p>The images or other third party material in this chapter are included in the chapter's</p><p>Creative Commons license, unless indicated otherwise in a credit line to the material. If</p><p>material is not included in the chapter's Creative Commons license and your intended</p><p>use is not permitted by statutory regulation or exceeds the permitted use, you will need</p><p>to obtain permission directly from the copyright holder.</p><p>ChApter 8 topology ConsIderAtIons</p><p>175</p><p>CHAPTER 9</p><p>Benchmarking</p><p>We won’t sugarcoat it: database benchmarking is hard. There are many moving parts</p><p>and nuances to consider and manage—and a bit of homework is required to really see</p><p>what a database is capable of and measure it properly. It’s not easy to properly generate</p><p>system load to reflect your real-life scenarios.1 It’s often not obvious how to correctly</p><p>measure and analyze the end results. And after extracting benchmarking results, you</p><p>need to be able to read them, understand potential performance bottlenecks, analyze</p><p>potential performance improvements, and possibly dive into other issues. You need to</p><p>make your benchmarking results meaningful, ensure they are easily reproducible, and</p><p>also be able to clearly explain these results to your team and other interested parties in a</p><p>way that reflects your business needs. There’s also hard mathematics involved: statistics</p><p>and queueing theory to help with black boxes and measurements, not to mention</p><p>domain-specific knowledge of the system internals of the servers, platforms, operating</p><p>systems, and the software running on it.</p><p>But when performance is a top priority, careful—and sometimes frequent—</p><p>benchmarking is essential. And in the long run, it will pay off. An effective benchmark</p><p>can save you from even worse pains, like the high-pressure database migration project</p><p>that ensues after you realize—too late—that your existing solution can’t support the</p><p>latest phase of company growth with acceptable latencies and/or throughput.</p><p>The goal of this chapter is to share strategies that ease the pain slightly and, more</p><p>importantly, increase the chances that the pain pays off by helping you select options</p><p>that meet your performance needs. The chapter begins by looking at the two key types of</p><p>benchmarks and highlighting critical considerations for each objective. Then, it presents</p><p>a phased approach that should help you expose problems faster and with lower costs.</p><p>Next, it dives into the do’s and don’ts of benchmark planning, execution, and reporting,</p><p>1 For an example of realistic benchmarking executed with impressive mastery, see Brian Taylor’s</p><p>talk, “How Optimizely (Safely) Maximizes Database Concurrency,” at www.youtube.com/</p><p>watch?v=cSiVoX_nq1s.</p><p>© Felipe Cardeneti Mendes, Piotr Sarna, Pavel Emelyanov, Cynthia Dunlop 2023</p><p>F. C. Mendes et al., Database Performance at Scale, https://doi.org/10.1007/978-1-4842-9711-7_9</p><p>176</p><p>with a focus on lessons learned from the best and worst benchmarks we’ve witnessed</p><p>over the past several years. Finally, the chapter closes with a look at some less common</p><p>benchmarking approaches you might want to consider for specialized needs.</p><p>Latency or Throughput: Choose Your Focus</p><p>When benchmarking, you need to decide upfront whether you want to focus on</p><p>throughput or latency. Latency is measured in both cases. But here’s the difference:</p><p>• Throughput focus: You measure the maximum throughput by</p><p>sending a new request as soon as the previous request completes.</p><p>This helps you understand the highest number of IOPS that the</p><p>database can sustain. Throughput-focused benchmarks are often the</p><p>focus for analytics use cases (fraud detection, cybersecurity, etc.)</p><p>• Latency focus: You assess how many IOPS the database can handle</p><p>without compromising latency. This is usually the focus for most</p><p>user-facing and real-time applications.</p><p>Throughput tests are quite common, but latency tests are a better choice if you</p><p>already know the desired throughput (e.g., 1M OPS). This is especially true</p><p>if your</p><p>production system must meet a specific latency goal (for example, the 99.99 percentile</p><p>should have a read latency of less than 10ms).</p><p>If you’re focused solely on latency, you need to measure and compare latency at</p><p>the same throughput rates. If you know only that database A can handle 30K OPS with</p><p>a certain P99 latency and database B can handle 50K OPS with a slightly higher P99</p><p>latency, you can’t really say which one is “more efficient.” For a fair comparison, you</p><p>would need to measure each database’s latencies at either 30K OPS or 50K OPS—or both.</p><p>Even better, you would track latency across a broader span of intervals (e.g., measuring</p><p>at 10K OPS increments up until when neither database could achieve the required P99</p><p>latency, as demonstrated in Figure9-1.)</p><p>Chapter 9 BenChmarking</p><p>177</p><p>Figure 9-1. A latency-oriented benchmark</p><p>Not all latency benchmarks need to take that form, however. Consider the example</p><p>of an AdTech company with a real-time bidding use case. For them, a request that takes</p><p>longer than 31ms is absolutely useless because it will fall outside of the bidding window.</p><p>It’s considered a timeout. And any request that is 30ms or less is fine; a 2ms response</p><p>is not any more valuable to them than a 20ms response. They care only about which</p><p>requests time out and which don’t.</p><p>Their benchmarking needs are best served by a latency benchmark measuring how</p><p>many OPS were generating timeouts over time. For example, Figure9-2 shows that the</p><p>first database in their benchmark (the top line) resulted in over 100K timeouts a second</p><p>around 11:30; the other (the horizontal line near the bottom) experienced only around</p><p>200 timeouts at that same point in time, and throughout the duration of that test.</p><p>Chapter 9 BenChmarking</p><p>178</p><p>Figure 9-2. A latency-oriented benchmark measuring how many OPS were</p><p>generating timeouts over time</p><p>Chapter 9 BenChmarking</p><p>179</p><p>For contrast, Figure9-3 shows an example of a throughput benchmark.</p><p>Figure 9-3. A throughput-oriented benchmark</p><p>With a throughput benchmark, you want to see one of the resources (e.g., the</p><p>CPU or disk) maxing out in order to understand how much the database can deliver</p><p>under extreme load conditions. If you don’t reach this level, it’s a sign that you’re not</p><p>really effectively benchmarking the database’s throughput. For example, Figure9-4</p><p>demonstrates the load of two clusters during a benchmark run. Note how one cluster is</p><p>fully utilized whereas the other is very close to reaching its limits.</p><p>Chapter 9 BenChmarking</p><p>180</p><p>Figure 9-4. Two clusters’ load comparison: one fully maxed out and another very</p><p>close to reaching its limit</p><p>Less Is More (at First): Taking aPhased Approach</p><p>With either focus, the number one rule of benchmarking is to start simple. Always keep</p><p>a laser focus on the specific questions you want the benchmark to answer (more on that</p><p>shortly). But, realize that it could take a number of phases—each with a fair amount of</p><p>trial and error—to get meaningful results.</p><p>What could go wrong? A lot. For example:</p><p>• Your client might be a bottleneck</p><p>• Your database sizing might need adjustment</p><p>• Your tests might need tuning</p><p>• A sandbox environment could have very different resources than a</p><p>production one</p><p>• Your testing methodology might be too artificial to predict reality</p><p>If you start off with too much complexity, it will be a nightmare to discover what’s</p><p>going wrong and pinpoint the source of the problem. For example, assume you want</p><p>to test if a database can handle 1M OPS of traffic from your client with a P99 latency of</p><p>1ms or less. However, you notice the latencies are exceeding the expected threshold.</p><p>You might spend days adjusting database configurations to no avail, then eventually</p><p>figure out that the problem stemmed from a bug in client-side concurrency. This would</p><p>Chapter 9 BenChmarking</p><p>181</p><p>have been much more readily apparent if you started out with just a fraction of that</p><p>throughput. In addition to avoiding frustration and lost time, you would have saved your</p><p>team a lot of unnecessary infrastructure costs.</p><p>As a general rule of thumb, consider at least two phases of benchmarking: one with a</p><p>specialized stress tool and one with your real workload (or at least a sampling of it—e.g.,</p><p>sending 30 percent of your queries to a cluster for benchmarking). For each phase,</p><p>start super small (at around 10 percent of the throughput you ultimately want to test),</p><p>troubleshoot as needed, then gradually increase the scope until you reach your target</p><p>loads. Keep optimization in mind throughout. Do you need to add more servers or more</p><p>clients to achieve a certain throughput? Or are you limited (by budget or infrastructure)</p><p>to a fixed hardware configuration? Can you achieve your performance goals with less?</p><p>The key is to move incrementally. Of course, the exact approach will vary from</p><p>situation to situation. Consider a leading travel company’s approach. Having recently</p><p>moved from PostgreSQL to Cassandra, they were quite experienced benchmarkers when</p><p>they decided to evaluate Cassandra alternatives. The goal was to test the new database</p><p>candidate’s raw speed and performance, along with its support for their specific</p><p>workloads.</p><p>First, they stood up a five-node cluster and ran database comparisons with synthetic</p><p>traffic from cassandra-stress. This gave them confidence that the new database could</p><p>meet their performance needs with some workloads. However, their real workloads</p><p>are nothing like even customized cassandra-stress workloads. They experience highly</p><p>variable and unpredictable traffic (for example, massive surges and disruptions</p><p>stemming from a volcanic eruption). For a more realistic assessment, they started</p><p>shadowing production traffic. This second phase of benchmarking provided the added</p><p>confidence they needed to move forward with the migration.</p><p>Finally, they used the same shadowed traffic to determine the best deployment</p><p>option. Moving to a larger 21-node cluster, they tested across cloud provider A and cloud</p><p>provider B on bare metal. They also experimented with many different options on cloud</p><p>provider B: various storage options, CPUs, and so on.</p><p>The bottom line here: Start simple, confirm, then scale incrementally. It’s safer and</p><p>ultimately faster. Plus, you’ll save on costs. As you move through the process, check if you</p><p>need to tweak your setup during your testing. Once you are eventually satisfied with the</p><p>results, scale your infrastructure accordingly to meet your defined criteria.</p><p>Chapter 9 BenChmarking</p><p>182</p><p>Benchmarking Do’s andDon’ts</p><p>The specific step-by-step instructions for how to configure and run a benchmark vary</p><p>across databases and benchmarking tools, so we’re not going to get into that. Instead,</p><p>let’s look at some of the more universal “do’s and don’ts” based on what we’ve seen in</p><p>the field.</p><p>Tip if you haven’t done so yet, be sure to review the chapters on drivers,</p><p>infrastructure, and topology considerations before you begin benchmarking.</p><p>Know What’s Under theHood ofYour Database (Or Find</p><p>Someone Who Knows)</p><p>Understand and anticipate what parts of the system your chosen workload will affect</p><p>and how. How will it stress your CPUs? Your memory? Your disks? Your network? Do you</p><p>know if the database automatically analyzes the system it’s running on and prioritizes</p><p>application requests as opposed to internal tasks? What’s going on as far as background</p><p>operations and how these may skew your results? And why does all this matter if you’re</p><p>just trying to run a benchmark?</p><p>Let’s take the example of compaction with LSM-tree based databases. As we’ll</p><p>cover in Chapter 11, compactions do have a significant impact on performance. But</p><p>compactions are unlikely to kick in if you run a benchmark for just a few minutes.</p><p>Given that compactions have dramatically different performance impacts on different</p><p>databases, it’s essential to know that they will occur</p><p>and ensure that tests last long</p><p>enough to measure their impact.</p><p>The important thing here is to try to understand the system that you’re</p><p>benchmarking. The better you understand it, the better you can plan tests and interpret</p><p>the results. If there are vendors and/or user groups behind the database you’re</p><p>benchmarking, try to probe them for a quick overview of how the database works and</p><p>what you should watch out for. Otherwise, you might overlook something that comes</p><p>back to haunt you, such as finding out that your projected scale was too optimistic. Or,</p><p>you might freak out over some KPI that’s really a non-issue.</p><p>Chapter 9 BenChmarking</p><p>183</p><p>Choose anEnvironment That Takes Advantage</p><p>oftheDatabase’s Potential</p><p>This is really a corollary to the previous tip. With a firm understanding of your database’s</p><p>superpowers, you can design benchmark scenarios that fully reveal its potential. For</p><p>example, if you want to compare two databases designed for commodity hardware, don’t</p><p>worry about benchmarking them on a broad array of powerful servers. But if you’re</p><p>comparing a database that’s architected to take advantage of powerful servers, you’d be</p><p>remiss to benchmark it only on commodity hardware (or even worse, using a Docker</p><p>image on a laptop). That would be akin to test driving a race car on the crowded streets</p><p>of NewYork City rather than your local equivalent of the Autobahn highway.</p><p>Likewise, if you think some aspect of the database or your data modeling will be</p><p>problematic for your use case, now’s the time to push it to the limits and assess its true</p><p>impact. For example, if you think a subset of your data might have imbalanced access</p><p>patterns due to user trends, use the benchmark phase to reproduce that and assess the</p><p>impacts.</p><p>Use anEnvironment That Represents Production</p><p>Benchmarking in the wrong environment can easily lead to an order-of-magnitude</p><p>performance difference. For example, a laptop might achieve 20K OPS where a dedicated</p><p>server could easily achieve 200K OPS.Unless you intend to have your production system</p><p>running on a laptop, do not benchmark (or run comparisons) on a laptop.</p><p>If you are using shared hardware in a containerized/virtualized environment, be</p><p>aware that one guest can increase latency in other guests. As a result, you’ll typically</p><p>want to ensure that hardware resources are dedicated to your database and that you</p><p>avoid resource overcommitment by any means possible.</p><p>Also, don’t overlook the environment for your load generators. If you underprovision</p><p>load generators, the load generators themselves will be the bottleneck. Another</p><p>consideration: Ensure that the database and the data loader are not running under the</p><p>same nodes. Pushing and pulling data is resource intensive, so the loader will definitely</p><p>steal resources from the database. This will impact your results with any database.</p><p>Chapter 9 BenChmarking</p><p>184</p><p>Don’t Overlook Observability</p><p>Having observability into KPIs beyond throughput and latency is critical for identifying</p><p>and troubleshooting issues. For instance, you might not be hitting the cache as much</p><p>as intended. Or a network interface might be overwhelmed with data to the point that it</p><p>interferes with latency. Observability is also your primary tool for validating that you’re not</p><p>being overly optimistic—or pessimistic—when reviewing results. You may discover that even</p><p>read requests served from disk, with a cold cache, are within your latency requirements.</p><p>Note For extensive discussion on this topic, see Chapter 10.</p><p>Use Standardized Benchmarking Tools Whenever Feasible</p><p>Don’t waste resources building—and debugging and maintaining—your own version of</p><p>a benchmarking tool that has already been solved for. The community has developed an</p><p>impressive set of tools that can cover a wide range of needs. For example:</p><p>• YCSB2</p><p>• TPC-C3</p><p>• NdBench4</p><p>• Nosqlbench5</p><p>• pgbench6</p><p>• TLP-stress7</p><p>• Cassandra-stress8</p><p>• and more…</p><p>2 https://github.com/brianfrankcooper/YCSB</p><p>3 http://tpc.org/tpcc/default5.asp</p><p>4 https://github.com/Netflix/ndbench</p><p>5 https://github.com/nosqlbench/nosqlbench</p><p>6 www.postgresql.org/docs/current/pgbench.html</p><p>7 https://github.com/thelastpickle/tlp-stress</p><p>8 https://github.com/scylladb/scylla-tools-java/tree/master/tools/stress</p><p>Chapter 9 BenChmarking</p><p>185</p><p>They are all relatively the same and provide similar configuration parameters. Your</p><p>task is to understand which one better reflects the workload you are interested in and</p><p>how to run it properly. When in doubt, consult with your vendor for specific tooling</p><p>compatible with your database of choice.</p><p>Of course, these options won’t cover everything. It makes sense to develop your own</p><p>tools if:</p><p>• Your workloads look nothing like the ones offered by standard tools</p><p>(for example, you rely on multiple operations that are not natively</p><p>supported by the tools)</p><p>• It helps you test against real (or more realistic) workloads in the later</p><p>phases of your benchmarking strategy</p><p>Ideally, the final stages of your benchmarking would involve connecting your</p><p>application to the database and seeing how it responds to your real workload. But what</p><p>if, for example, you are comparing two databases that require you to implement the</p><p>application logic in two totally different ways? In this case, the different application logic</p><p>implementations could influence your results as much as the difference in databases.</p><p>Again, we recommend starting small: Testing just the basic functionality of the</p><p>application against both targets (following each one’s best practices) and seeing what the</p><p>initial results look like.</p><p>Use Representative Data Models, Datasets,</p><p>andWorkloads</p><p>As you progress past the initial “does this even work” phase of your benchmarking,</p><p>it soon becomes critical to gravitate to representative data models, datasets, and</p><p>workloads. The closer you approximate your production environment, the better you can</p><p>trust that your results accurately represent what you will experience in production.</p><p>Data Models</p><p>Tools such as cassandra-stress use a default data model that does not completely</p><p>reflect what most teams use in production. For example, the cassandra-stress default</p><p>data model has a replication factor set to 1 and uses LOCAL_ONE as a consistency</p><p>level. Although cassandra-stress is a convenient way to get some initial performance</p><p>impressions, it is critical to benchmark the same/similar data model that you will</p><p>Chapter 9 BenChmarking</p><p>186</p><p>use in production. That’s why we recommend using a custom data model and tuning</p><p>your consistency level and queries. cassandra-stress and other benchmarking tools</p><p>commonly provide ways to specify a user profile, where you can specify your own</p><p>schema, queries, replication factor, request distribution and sizes, throughput rates,</p><p>number of clients, and other aspects.</p><p>Dataset Size</p><p>If you run the benchmark with a dataset that’s smaller than your production dataset, you</p><p>may have misleading or incorrect results due to the reduced number of I/O operations.</p><p>Eventually, you should configure a test that realistically reflects a fraction of your</p><p>production dataset size corresponding to your current scale.</p><p>Workloads</p><p>Run the benchmark using a load that represents, as closely as possible, your anticipated</p><p>production workload. This includes the queries submitted by the load generator. When</p><p>you use the right type of queries, they are distributed over the cluster and the ratio</p><p>between reads and writes remains relatively constant.</p><p>The read/write ratio is important. Different combinations will impact your disk</p><p>in different ways. If you want results representative of production, use a realistic</p><p>workload mix.</p><p>Eventually, you will max out your storage I/O throughput and starve your disk, which</p><p>causes requests to start queuing on the database. If you continue pushing past that point,</p><p>latency will increase. When you hit that point of increased latency with unsatisfactory</p><p>in practice.</p><p>abouT The TeChniCal reviewers</p><p>xvii</p><p>Acknowledgments</p><p>The process of creating this book has been a wild ride across many countries, cultures,</p><p>and time zones, as well as around many obstacles. There are many people to thank for</p><p>their assistance, inspiration, and support along this journey.</p><p>To begin, ScyllaDB co-founders Dor Laor and Avi Kivity—for starting the company</p><p>that brought us all together, for pushing the boundaries of database performance at scale</p><p>in ways that inspired this book, and for trusting us to share the collective sea monster</p><p>wisdom in this format. Thank you for this amazing opportunity.</p><p>We thank our respective teams, and especially our managers, for supporting this side</p><p>project. We hope we kept the core workload disruption to a minimum and did not inflict</p><p>any “stop the world” project pauses.</p><p>Our technical reviewers—Botond Dénes, Ľuboš Koščo, and Raphael S.Carvalho—</p><p>painstakingly reviewed the first draft of every page in this book and offered insightful</p><p>suggestions throughout. Thank you for your thoughtful comments and for being so</p><p>generous with your time.</p><p>Additionally, our unofficial technical reviewer and toughest critic, Kostja Osipov,</p><p>provided early and (brutally) honest feedback that led us to substantially alter the book’s</p><p>focus for the better.</p><p>The Brazilian Ninja team (Guilherme Nogueira, Lucas Martins Guimarães, and</p><p>Noelly Medina) rescued us in our darkest hour, allowing us to scale out and get the first</p><p>draft across the finish line. Muito Obrigado!</p><p>Ben Gaisne is the graphic design mastermind behind the images in this book. Merci</p><p>for transforming our scribbles into beautiful diagrams and putting up with about ten</p><p>rounds of “just one more round of book images.”</p><p>We are also indebted to many for their unintentional contributions on the content</p><p>front. Glauber Costa left us with a treasure trove of materials we consulted when</p><p>composing chapters, especially Chapter 9 on benchmarking. He also inspired the addition</p><p>of Chapter 6 on getting data closer. Additionally, we also looked back to ScyllaDB blogs as</p><p>we were writing—specifically, blogs by Avi Kivity (for Chapter 3), Eyal Gutkind (for Chapter</p><p>7), Vlad Zolotarov and Moreno Garcia (also for Chapter 7), Dor Laor (for Chapter 8), Eliran</p><p>Sinvani (also for Chapter 8), and Ivan Prisyazhynyy (for Chapter 9).</p><p>xviii</p><p>Last, but certainly not least, we thank Jonathan Gennick for bringing us to Apress. We</p><p>thank Shaul Elson and Susan McDermott for guiding us through the publishing process.</p><p>It has been a pleasure working with you. And we thank everyone involved in editing and</p><p>production; having previously tried this on our own, we know it’s an excruciating task</p><p>and we are truly grateful to you for relieving us of this burden!</p><p>aCknowledgmenTs</p><p>xix</p><p>Introduction</p><p>Sisyphean challenge. Gordian knot. Rabbit hole. Many metaphors have been used to</p><p>describe the daunting challenge of achieving database performance at scale. That isn’t</p><p>surprising. Consider just a handful of the many factors that contribute to satisfying</p><p>database latency and throughput expectations for a single application:</p><p>• How well you know your workload access patterns and whether they</p><p>are a good fit for your current or target database.</p><p>• How your database interacts with its underlying hardware, and</p><p>whether your infrastructure is correctly sized for the present as well</p><p>as the future.</p><p>• How well your database driver understands your database—and how</p><p>well you understand the internal workings of both.</p><p>It’s complex. And that’s just the tip of the iceberg.</p><p>Then, once you feel like you’re finally in a good spot, something changes. Your</p><p>business experiences “catastrophic success,” exposing the limitations of your initial</p><p>approach right when you’re entering the spotlight. Maybe market shifts mean that your</p><p>team is suddenly expected to reduce latency—and reduce costs at the same time, too.</p><p>Or perhaps you venture on to tackle a new application and find that the lessons learned</p><p>from the original project don’t translate to the new one.</p><p>Why Read/Write aBook onDatabase Performance?</p><p>The most common approaches to optimizing database performance are conducting</p><p>performance tuning and scaling out. They are important—but in many cases, they aren’t</p><p>enough to satisfy strict latency expectations at medium to high throughput. To break past</p><p>that plateau, other factors need to be addressed.</p><p>xx</p><p>As with any engineering challenge, there’s no one-size-fits-all solution. But there are</p><p>a lot of commonly overlooked considerations and opportunities with the potential to</p><p>help teams meet their database performance objectives faster, and with fewer headaches.</p><p>As a group of people with experience across a variety of performance-oriented</p><p>database projects, we (the authors) have a unique perspective into what works well for</p><p>different performance-sensitive use cases—from low-level engineering optimizations,</p><p>to infrastructure components, to topology considerations and the KPIs to focus on for</p><p>monitoring. Frequently, we engage with teams when they’re facing a performance</p><p>challenge so excruciating that they’re considering changing their production database</p><p>(which can seem like the application development equivalent of open heart surgery).</p><p>And in many cases, we develop a long-term relationship with a team, watching their</p><p>projects and objectives evolve over time and helping them maintain or improve</p><p>performance across the shifting sands.</p><p>Based on our experience with performance-focused database engineering as well as</p><p>performance-focused database users, this book represents what we think teams striving</p><p>for extreme database performance—low latency, high throughput, or both—should be</p><p>thinking about. We have experience working with multi-petabyte distributed systems</p><p>requiring millions of interactions per second. We’ve engineered systems supporting</p><p>business critical real-time applications with sustained latencies below one millisecond.</p><p>Finally, we’re well aware of commonly-experienced “gotchas” that no one has dared to</p><p>tell you about, until now.</p><p>What WeMean by Database Performance atScale</p><p>Database performance at scale means different things to different teams. For some, it</p><p>might mean achieving extremely low read latencies; for others, it might mean ingesting</p><p>very large datasets as quickly as possible. For example:</p><p>• Messaging: Keeping latency consistently low for thousands to</p><p>millions of operations per second, because users expect to interact in</p><p>real-time on popular social media platforms, especially when there’s</p><p>a big event or major news.</p><p>• Fraud detection: Analyzing a massive dataset as rapidly as possible</p><p>(millions of operations per second), because faster processing helps</p><p>stop fraud in its tracks.</p><p>inTroduCTion</p><p>xxi</p><p>• AdTech: Providing lightning fast (sub-millisecond P9999 latency)</p><p>responses with zero tolerance for latency spikes, because an ad bid</p><p>that’s sent even a millisecond past the cutoff is worthless to the ad</p><p>company and the clients who rely on it.</p><p>We specifically tagged on the “at scale” modifier to emphasize that we’re catering to</p><p>teams who are outside of the honeymoon zone, where everything is just blissfully fast</p><p>no matter what you do with respect to setup, usage, and management. Different teams</p><p>will reach that inflection point for different reasons, and at different thresholds. But one</p><p>thing is always the same: It’s better to anticipate and prepare than to wait and scramble</p><p>to react.</p><p>Who This Book Is For</p><p>This book was written for individuals and teams looking to optimize distributed</p><p>database performance for an existing project or to begin a new performance-sensitive</p><p>project with a solid and scalable foundation. You are most likely:</p><p>• Experiencing or anticipating some pain related to database latency</p><p>and/or throughput</p><p>• Working primarily on a use case with terabytes to petabytes of raw</p><p>(unreplicated)</p><p>results, stop, reflect on what happened, analyze how you can improve, and iterate</p><p>through the test again. Rinse and repeat as needed.</p><p>Here are some tips on creating realistic workloads for common use cases:</p><p>• Ingestion: Ingest data as fast as possible for at least a few hours, and</p><p>do it in a way that doesn’t produce timeouts or errors. The goal here</p><p>is to ensure that you’ve got a stable system, capable of keeping up</p><p>with your expected traffic rate for long periods.</p><p>• Real-time bidding: Use bulk writes coming in after hours or</p><p>constantly low background loads; the core of the workload is a lot of</p><p>reads with extremely strict latency requirements (perhaps below a</p><p>specific threshold).</p><p>Chapter 9 BenChmarking</p><p>187</p><p>• Time series: Use heavy and constant writes to ever-growing</p><p>partitions split and bucketed by time windows; reads tend to focus on</p><p>the latest rows and/or a specific range of time.</p><p>• Metadata store: Use writes occasionally, but focus on random</p><p>reads representing users accessing your site. There’s usually good</p><p>cacheability here.</p><p>• Analytics: Periodically write a lot of information and perform a</p><p>lot of full table scans (perhaps in parallel with some of the other</p><p>workloads).</p><p>The bottom line is to try to emulate what your workloads look like and run</p><p>something that’s meaningful to you.</p><p>Exercise Your Cache Realistically</p><p>Unless you can absolutely guarantee that your workload has a high cache hit rate</p><p>frequency, be pessimistic and exercise it well.</p><p>You might be running workloads, getting great results, and seeing cache hits all</p><p>the way up to 90 percent. That’s great. But is this the way you’re going to be running</p><p>in practice all the time? Do you have periods throughout the day when your cache is</p><p>not going to be that warm, maybe because there’s something else running? In real-life</p><p>situations, you will likely have times when the cache is colder or even super cold (e.g.,</p><p>after an upgrade or after a hardware failure). Consider testing those scenarios in the</p><p>benchmark as well.</p><p>If you want to make sure that all requests are coming from the disk, you can disable</p><p>the cache altogether. However, be aware that this is typically an extreme situation, as</p><p>most workloads (one way or another) exercise some caching. Sometimes you can create</p><p>a cold cache situation by just restarting the nodes or restarting the processes.</p><p>Look at Steady State</p><p>Most databases behave differently in real life than they do in short transient test</p><p>situations. They usually run for days or years—so when you test a database for two</p><p>minutes, you’re probably not getting a deep understanding of how it behaves, unless</p><p>you are working in memory only. Also, when you’re working with a database that is built</p><p>Chapter 9 BenChmarking</p><p>188</p><p>to serve tens or hundreds of terabytes—maybe even petabytes—know that it’s going</p><p>to behave rather differently at various data levels. Requests become more expensive,</p><p>especially read requests. If you’re testing something that only serves a gigabyte, it really</p><p>isn’t the same as testing something that’s serving a terabyte.</p><p>Figure 9-5 exemplifies the importance of looking at steady state. Can you tell what</p><p>throughput is being sustained by the database in question?</p><p>Figure 9-5. A throughput graph that is not focused on steady state</p><p>Well, if you look just at the first minute, it seems that it’s serving 40K OPS.But if you</p><p>wait for a few minutes, the throughput decreases.</p><p>Whenever you want to make a statement about the maximum throughput that your</p><p>database can handle, do that from a steady state. Make sure that you’re inserting an</p><p>amount of data that is meaningful, not just a couple of gigabytes, and make sure that it</p><p>runs for enough time so it’s a realistic scenario. After you are satisfied with how many</p><p>requests can be sustained over a prolonged period of time, consider adding noise, such</p><p>as scaling clients, and introducing failure situations.</p><p>Watch Out forClient-Side Bottlenecks</p><p>One of the most common mistakes with benchmarks is overlooking the fact that the</p><p>bottleneck could be coming from the application side. You might have to tune your</p><p>application clients to allow for a higher concurrency. You may also be running many</p><p>application pods on the same tenant—with all instances contending for the same</p><p>hardware resources. Make sure your application is running in a proper environment, as</p><p>is your database.</p><p>Chapter 9 BenChmarking</p><p>189</p><p>Also Watch Out forNetworking Issues</p><p>Networking issues could also muddle the results of your benchmarking. If the database</p><p>is consuming too much softirq from processing, this will degrade your performance. You</p><p>can detect this by analyzing CPU interrupt shares, for example. And you can typically</p><p>resolve it by using CPU pinning, which tells the system that all network interrupts should</p><p>be handled by specific CPUs that are not being used by the database.</p><p>Similarly, running your application through a slow link, such as routing traffic via the</p><p>Internet rather than via a private link, can easily introduce a networking bottleneck.</p><p>Document Meticulously toEnsure Repeatability</p><p>It’s difficult to anticipate when or why you might want to repeat a benchmark. Maybe</p><p>you want to assess the impact of optimizations you made after getting some great tips at</p><p>the vendor’s user conference. Maybe you just learned that your company was acquired</p><p>and you should prepare to support ten times your current throughput—or much stricter</p><p>latency SLAs. Perhaps you learned about a cool new database that’s API-compatible with</p><p>your current one, and you’re curious how the performance stacks up. Or maybe you have</p><p>a new boss with a strong preference for another database and you suddenly need to re-</p><p>justify your decision with a head-to-head comparison.</p><p>Whatever the reason you’re repeating a benchmark scenario, one thing is certain:</p><p>You will be immensely appreciative of the time that you previously spent documenting</p><p>exactly what you did and why.</p><p>Reporting Do’s andDon’ts</p><p>So you’ve completed your benchmark and you’ve gathered all sorts of data—what’s the</p><p>best way to report it? Don’t skimp on this final, yet critical step. Clear and compelling</p><p>reporting is critical for convincing others to support your recommended course of</p><p>action—be it embarking on a database migration, changing your configuration or data</p><p>modeling, or simply sticking with what’s working well for you.</p><p>Here are some reporting-focused do’s and don’ts.</p><p>Chapter 9 BenChmarking</p><p>190</p><p>Be Careful withAggregations</p><p>When it comes to aggregations, proceed with extreme caution. You could report the</p><p>result of a benchmark by saying something like “I ran this benchmark for three days, and</p><p>this is my throughput.” However, this overlooks a lot of critical information. For example,</p><p>consider the two graphs presented in Figures9-6 and 9-7.</p><p>Figure 9-6. Lower baseline throughput that’s almost constant and predictable</p><p>throughout a ten-minute period</p><p>Figure 9-7. A bumpier path to a similar throughput at the end</p><p>Both of these loads have roughly the same throughput at the end. Figure9-6 shows</p><p>lower baseline throughput—but it’s constant and very predictable throughout the</p><p>period. The OPS in Figure9-7 dip much lower than the first baseline, but it also spikes to</p><p>a much higher value. The behavior shown in Figure9-6 is obviously more desirable. But</p><p>if you aggregate your results, it would be really hard to notice a difference.</p><p>Chapter 9 BenChmarking</p><p>191</p><p>Another aggregation mistake is aggregating tail latencies: taking the average of P99</p><p>latencies from multiple load generators. The correct way to determine the percentiles</p><p>over multiple load generators is to merge the latency distribution of each load generator</p><p>and then determine the percentiles. If that isn’t an option, then the next best alternative</p><p>is to take the maximum (the P99, for example) of each of the load generators. The actual</p><p>P99 will be equal to</p><p>data, over 10K operations per second, and with P99</p><p>latencies measured in milliseconds</p><p>• At least somewhat familiar with scalable distributed databases such</p><p>as Apache Cassandra, ScyllaDB, Amazon DynamoDB, Google Cloud</p><p>Bigtable, CockroachDB, and so on</p><p>• A software architect, database architect, software engineer, VP of</p><p>engineering, or technical CTO/founder working with a data-intensive</p><p>application</p><p>You might also be looking to reduce costs without compromising performance, but</p><p>unsure of all the considerations involved in doing so.</p><p>We assume that you want to get your database performance challenges resolved,</p><p>fast. That’s why we focus on providing very direct and opinionated recommendations</p><p>based on what we have seen work (and fail) in real-world situations. There are, of</p><p>course, exceptions to every rule and ways to debate the finer points of almost any tip</p><p>inTroduCTion</p><p>xxii</p><p>in excruciating detail. We’ll focus on presenting the battle-tested “best practices” and</p><p>anti-patterns here, and encourage additional discussion in whatever public or private</p><p>channels you prefer.</p><p>What This Book Is NOT</p><p>A few things that this book is not attempting to be:</p><p>• A reference for infrastructure engineers building databases. We focus</p><p>on people working with a database.</p><p>• A “definitive guide” to distributed databases, NoSQL, or data-</p><p>intensive applications. We focus on the top database considerations</p><p>most critical to performance.</p><p>• A guide on how to configure, work with, optimize, or tune any</p><p>specific database. We focus on broader strategies you can “port”</p><p>across databases.</p><p>There are already many outstanding references that cover the topics we’re</p><p>deliberately not addressing, so we’re not going to attempt to re-create or replace them.</p><p>See Appendix A for a list of recommended resources.</p><p>Also, this is not a book about ScyllaDB, even though the authors and technical</p><p>reviewers have experience with ScyllaDB.Our goal is to present strategies that are useful</p><p>across the broader class of performance-oriented databases. We reference ScyllaDB, as</p><p>well as other databases, as appropriate to provide concrete examples.</p><p>A Tour ofWhat WeCover</p><p>Given that database performance is a multivariate challenge, we explore it from a</p><p>number of different angles and perspectives. Not every angle will be relevant to every</p><p>reader—at least not yet. We encourage you to browse around and focus on what seems</p><p>most applicable to your current situation.</p><p>To start, we explore challenges. Chapter 1 kicks it off with two highly fictionalized</p><p>tales that highlight the variety of database performance challenges that can arise and</p><p>introduce some of the available strategies for addressing them. Next, we look at the</p><p>inTroduCTion</p><p>xxiii</p><p>database performance challenges and tradeoffs that you’re likely to face depending on</p><p>your project’s specific workload characteristics and technical/business requirements.</p><p>The next set of chapters provides a window into many often-overlooked engineering</p><p>details that could be constraining—or helping—your database performance. First, we</p><p>look at ways databases can extract more performance from your CPU, memory, storage,</p><p>and networking. Next, we shift the focus from hardware interactions to algorithmic</p><p>optimizations—deep diving into the intricacies of a sample performance optimization</p><p>from the perspective of the engineer behind it. Following that, we share everything a</p><p>performance-obsessed developer really should know about database drivers but never</p><p>thought to ask. Driver-level optimizations —both how they’re engineered and how you</p><p>work with them—are absolutely critical for performance, so we spend a good amount</p><p>of time on topics like the interaction between clients and servers, contextual awareness,</p><p>maximizing concurrency while keeping latencies under control, correct usage of</p><p>paging, timeout control, retry strategies, and so on. Finally, we look at the performance</p><p>possibilities in moving more logic into the database (via user-defined functions and</p><p>user-defined aggregates) as well as moving the database servers closer to users.</p><p>Then, the final set of chapters shifts into field-tested recommendations for</p><p>getting better performance out of your database deployment. It starts by looking at</p><p>infrastructure and deployment model considerations that are important to understand,</p><p>whether you’re managing your own deployment or opting for a database-as-a-service</p><p>(maybe serverless) deployment model. Then, we share our top strategies related to</p><p>topology, benchmarking, monitoring, and admin—all through the not-always-rosy lens</p><p>of performance.</p><p>After all that, we hope you end up with a new appreciation of the countless</p><p>considerations that impact database performance at scale, discover some previously</p><p>overlooked opportunities to optimize your database performance, and avoid the</p><p>common traps and pitfalls that inflict unnecessary pain and distractions on all too many</p><p>dev and database teams.</p><p>Tip Check out our github repo for easy access to the sources we reference in</p><p>footnotes, plus additional resources on database performance at scale: https://</p><p>github.com/Apress/db-performance-at-scale.</p><p>inTroduCTion</p><p>xxiv</p><p>Summary</p><p>Optimizing database performance at the scale required for today’s data-intensive</p><p>applications often requires more than performance tuning and scaling out. This</p><p>book shares commonly overlooked considerations, pitfalls, and opportunities that</p><p>have helped many teams break through database performance plateaus. It’s neither</p><p>a definitive guide to distributed databases nor a beginner’s resource. Rather, it’s a</p><p>look at the many different factors that impact performance, and our top field-tested</p><p>recommendations for navigating them. Chapter 1 provides two (fun and fanciful) tales</p><p>that surface some of the many roadblocks you might face and highlight the range of</p><p>strategies for navigating around them.</p><p>inTroduCTion</p><p>1</p><p>CHAPTER 1</p><p>A Taste ofWhat You’re</p><p>UpAgainst: Two Tales</p><p>What’s more fun than wrestling with database performance? Well, a lot. But that doesn’t</p><p>mean you can’t have a little fun here. To give you an idea of the complexities you’ll likely</p><p>face if you’re serious about optimizing database performance, this chapter presents two</p><p>rather fanciful stories. The technical topics covered here are expanded on throughout</p><p>the book. But this is the one and only time you’ll hear of poor Joan and Patrick. Let</p><p>their struggles bring you some valuable lessons, solace in your own performance</p><p>predicaments… and maybe a few chuckles as well.</p><p>Joan Dives Into Drivers andDebugging</p><p>Lured in by impressive buzzwords like “hybrid cloud,” “serverless,” and “edge first,”</p><p>Joan readily joined a new company and started catching up with their technology stack.</p><p>Her first project recently started a transition from their in-house implementation of</p><p>a database system, which turned out to not scale at the same pace as the number of</p><p>customers, to one of the industry-standard database management solutions. Their new</p><p>pick was a new distributed database, which, as opposed to NoSQL, strives to keep the</p><p>original ACID1 guarantees known in the SQL world.</p><p>Due to a few new data protection acts that tend to appear annually nowadays, the</p><p>company’s board decided that they were going to maintain their own datacenter, instead</p><p>of using one of the popular cloud vendors for storing sensitive information.</p><p>1 Atomicity, consistency, isolation, and durability</p><p>© Felipe Cardeneti Mendes, Piotr Sarna, Pavel Emelyanov, Cynthia Dunlop 2023</p><p>F. C. Mendes et al., Database Performance at Scale, https://doi.org/10.1007/978-1-4842-9711-7_1</p><p>2</p><p>On a very high level, the company’s main product consisted of only two layers:</p><p>• The frontend, the entry point for users, which actually runs in their</p><p>own browsers and communicates with the rest of the system to</p><p>exchange and persist information.</p><p>• The everything-else, customarily known as the</p><p>backend, but actually</p><p>includes load balancers, authentication, authorization, multiple</p><p>cache layers, databases, backups, and so on.</p><p>Joan’s first task was to implement a very simple service for gathering and summing</p><p>up various statistics from the database and integrate that service with the whole</p><p>ecosystem, so that it fetched data from the database in real-time and allowed the</p><p>DevOps teams to inspect the statistics live.</p><p>To impress the management and reassure them that hiring Joan was their absolutely</p><p>best decision this quarter, Joan decided to deliver a proof-of-concept implementation</p><p>on her first day! The company’s unspoken policy was to write software in Rust, so she</p><p>grabbed the first driver for their database from a brief crates.io search and sat down to</p><p>her self-organized hackathon.</p><p>The day went by really smoothly, with Rust’s ergonomic-focused ecosystem</p><p>providing a superior developer experience. But then Joan ran her first smoke tests on a</p><p>real system. Disbelief turned to disappointment and helplessness when she realized that</p><p>every third request (on average) ended up in an error, even though the whole database</p><p>cluster reported to be in a healthy, operable state. That meant a debugging session was</p><p>in order!</p><p>Unfortunately, the driver Joan hastily picked for the foundation of her work, even</p><p>though open-source on its own, was just a thin wrapper over precompiled, legacy C</p><p>code, with no source to be found. Fueled by a strong desire to solve the mystery and a</p><p>healthy dose of fury, Joan spent a few hours inspecting the network communication with</p><p>Wireshark,2 and she made an educated guess that the bug must be in the hashing key</p><p>implementation.3 In the database used by the company, keys are hashed to later route</p><p>requests to appropriate nodes. If a hash value is computed incorrectly, a request may be</p><p>forwarded to the wrong node, which can refuse it and return an error instead.</p><p>2 Wireshark is a great tool for inspecting network packets and more (www.wireshark.org).</p><p>3 Loosely based on a legit hashing quirk in Apache Cassandra (https://github.com/apache/</p><p>cassandra/blob/56ea39ec704a94b5d23cbe530548745ab2420cee/src/java/org/apache/</p><p>cassandra/utils/MurmurHash.java#L31-L32).</p><p>Chapter 1 a taste ofWhat You’re upagainst: tWo tales</p><p>3</p><p>Unable to verify the claim due to the missing source code, Joan decided on a simpler</p><p>path—ditching the originally chosen driver and reimplementing the solution on one of</p><p>the officially supported, open-source drivers backed by the database vendor, with a solid</p><p>user base and regularly updated release schedule.</p><p>Joan’s Diary ofLessons Learned, Part I</p><p>The initial lessons include:</p><p>1. Choose a driver carefully. It’s at the core of your code’s</p><p>performance, robustness, and reliability.</p><p>2. Drivers have bugs too, and it’s impossible to avoid them. Still,</p><p>there are good practices to follow:</p><p>a. Unless there’s a good reason, choose the officially supported driver (if it</p><p>exists).</p><p>b. Open-source drivers have advantages. They’re not only verified by the</p><p>community, but they also allow deep inspection of the code, and even</p><p>modifying the driver code to get even more insights for debugging.</p><p>c. It’s better to rely on drivers with a well-established release schedule</p><p>since they are more likely to receive bug fixes (including for security</p><p>vulnerabilities) in a reasonable period of time.</p><p>3. Wireshark is a great open-source tool for interpreting network</p><p>packets; give it a try if you want to peek under the hood of your</p><p>program.</p><p>The introductory task was eventually completed successfully, which made Joan</p><p>ready to receive her first real assignment.</p><p>The Tuning</p><p>Armed with the experience gained working on the introductory task, Joan started planning</p><p>how to approach her new assignment: a misbehaving app. One of the applications</p><p>notoriously caused stability issues for the whole system, disrupting other workloads</p><p>each time it experienced any problems. The rogue app was already based on an officially</p><p>supported driver, so Joan could cross that one off the list of potential root causes.</p><p>Chapter 1 a taste ofWhat You’re upagainst: tWo tales</p><p>4</p><p>This particular service was responsible for injecting data backed up from the</p><p>legacy system into the new database. Because the company was not in a great hurry,</p><p>the application was written with low concurrency in mind to have low priority and</p><p>not interfere with user workloads. Unfortunately, once every few days something</p><p>kept triggering an anomaly. The normally peaceful application seemed to be trying to</p><p>perform a denial-of-service attack on its own database, flooding it with requests until the</p><p>backend got overloaded enough to cause issues for other parts of the ecosystem.</p><p>As Joan watched metrics presented in a Grafana dashboard, clearly suggesting that</p><p>the rate of requests generated by this application started spiking around the time of the</p><p>anomaly, she wondered how on Earth this workload could behave like that. It was, after</p><p>all, explicitly implemented to send new requests only when fewer than 100 of them were</p><p>currently in progress.</p><p>Since collaboration was heavily advertised as one of the company’s “spirit and</p><p>cultural foundations” during the onboarding sessions with an onsite coach, she decided</p><p>it was best to discuss the matter with her colleague, Tony.</p><p>“Look, Tony, I can’t wrap my head around this,” she explained.</p><p>“This service doesn’t send any new requests when 100 of them are</p><p>already in flight. And look right here in the logs: 100 requests</p><p>in- progress, one returned a timeout error, and…,” she then</p><p>stopped, startled at her own epiphany.</p><p>“Alright, thanks Tony, you’re a dear—best rubber duck4 ever!,” she</p><p>concluded and returned to fixing the code.</p><p>The observation that led to discovering the root cause was rather simple: The request</p><p>didn’t actually return a timeout error because the database server never sent such a</p><p>response. The request was simply qualified as timed out by the driver, and discarded. But</p><p>the sole fact that the driver no longer waits for a response for a particular request does</p><p>not mean that the database is done processing it! It’s entirely possible that the request</p><p>was instead just stalled, taking longer than expected, and the driver gave up waiting for</p><p>its response.</p><p>With that knowledge, it’s easy to imagine that once 100 requests time out on the</p><p>client side, the app might erroneously think that they are not in progress anymore, and</p><p>happily submit 100 more requests to the database, increasing the total number of</p><p>4 For an overview of the “rubber duck debugging” concept, see https://</p><p>rubberduckdebugging.com/.</p><p>Chapter 1 a taste ofWhat You’re upagainst: tWo tales</p><p>5</p><p>in- flight requests (i.e., concurrency) to 200. Rinse, repeat, and you can achieve extreme</p><p>levels of concurrency on your database cluster—even though the application was</p><p>supposed to keep it limited to a small number!</p><p>Joan’s Diary ofLessons Learned, Part II</p><p>The lessons continue:</p><p>1. Client-side timeouts are convenient for programmers, but they</p><p>can interact badly with server-side timeouts. Rule of thumb: Make</p><p>the client-side timeouts around twice as long as server-side ones,</p><p>unless you have an extremely good reason to do otherwise. Some</p><p>drivers may be capable of issuing a warning if they detect that the</p><p>client-side timeout is smaller than the server-side one, or even</p><p>amend the server-side timeout to match, but in general it’s best to</p><p>double-check.</p><p>2. Tasks with seemingly fixed concurrency can actually cause</p><p>spikes under certain unexpected conditions. Inspecting logs and</p><p>dashboards is helpful in investigating such cases, so make sure</p><p>that observability tools are available, both in the database cluster</p><p>and for all client applications. Bonus points for distributed tracing,</p><p>like OpenTelemetry5 integration.</p><p>With the client-side timeouts properly amended, the application choked much less</p><p>frequently and</p><p>to a smaller extent, but it still wasn’t a perfect citizen in the distributed</p><p>system. It occasionally picked a victim database node and kept bothering it with too</p><p>many requests, while ignoring the fact that seven other nodes were considerably less</p><p>loaded and could help handle the workload too. At other times, its concurrency was</p><p>reported to be exactly 200 percent larger than expected by the configuration. Whenever</p><p>the two anomalies converged in time, the poor node was unable to handle all the</p><p>requests it was bombarded with, and it had to give up on a fair portion of them. A long</p><p>5 OpenTelemetry “is a collection of tools, APIs, and SDKs. Use it to instrument, generate,</p><p>collect, and export telemetry data (metrics, logs, and traces) to help you analyze your software’s</p><p>performance and behavior.” For details, see https://opentelemetry.io/.</p><p>Chapter 1 a taste ofWhat You’re upagainst: tWo tales</p><p>6</p><p>study of the driver’s documentation, which was fortunately available in mdBook6 format</p><p>and kept reasonably up-to-date, helped Joan alleviate those pains too.</p><p>The first issue was simply a misconfiguration of the non-default load balancing</p><p>policy, which tried too hard to pick “the least loaded” database node out of all the</p><p>available ones, based on heuristics and statistics occasionally updated by the database</p><p>itself. Unfortunately, this policy was also “best effort,” and relied on the fact that statistics</p><p>arriving from the database were always legit. But a stressed database node could become</p><p>so overloaded that it wasn’t sending updated statistics in time! That led the driver to</p><p>falsely believe that this particular server was not actually busy at all. Joan decided that</p><p>this setup was a premature optimization that turned out to be a footgun, so she just</p><p>restored the original default policy, which worked as expected.</p><p>The second issue (temporary doubling of the concurrency) was caused by another</p><p>misconfiguration: an overeager speculative retry policy. After waiting for a preconfigured</p><p>period of time without getting an acknowledgement from the database, drivers would</p><p>speculatively resend a request to maximize its chances to succeed. This mechanism</p><p>is very useful to increase requests’ success rate. However, if the original request also</p><p>succeeds, it means that the speculative one was sent in vain. In order to balance the</p><p>pros and cons, speculative retry should be configured to resend requests only when</p><p>it’s very likely that the original one failed. Otherwise, as in Joan’s case, the speculative</p><p>retry may act too soon, doubling the number of requests sent (and thus also doubling</p><p>concurrency) without improving the success rate.</p><p>Whew, nothing gives a simultaneous endorphin rush and dopamine hit like a quality</p><p>debugging session that ends in an astounding success (except writing a cheesy story in a</p><p>deeply technical book, naturally). Great job, Joan!</p><p>The end.</p><p>Patrick’s Unlucky Green Fedoras</p><p>After losing his job at a FAANG MAANG (MANGA?) company, Patrick decided to strike</p><p>off on his own and founded a niche online store dedicated to trading his absolute favorite</p><p>among headwear, green fedoras. Noticing that a certain NoSQL database was recently</p><p>trending on the front page of Hacker News, Patrick picked it for his backend stack.</p><p>6 mdBook “is a command line tool to create books with Markdown.” For details, see https://</p><p>rust-lang.github.io/mdBook/.</p><p>Chapter 1 a taste ofWhat You’re upagainst: tWo tales</p><p>7</p><p>After some experimentation with the offering’s free tier, Patrick decided to sign a</p><p>one-year contract with a major cloud provider to get a significant discount on its NoSQL</p><p>database-as-a-service offering. With provisioned throughput capable of serving up to</p><p>1,000 customers every second, the technology stack was ready and the store opened its</p><p>virtual doors to the customers. To Patrick’s disappointment, fewer than ten customers</p><p>visited the site daily. At the same time, the shiny new database cluster kept running,</p><p>fueled by a steady influx of money from his credit card and waiting for its potential to be</p><p>harnessed.</p><p>Patrick’s Diary ofLessons Learned, Part I</p><p>The lessons started right away:</p><p>1. Although some databases advertise themselves as universal, most</p><p>of them perform best for certain kinds of workloads. The analysis</p><p>before selecting a database for your own needs must include</p><p>estimating the characteristics of your own workload:</p><p>a. Is it likely to be a predictable, steady flow of requests (e.g., updates being</p><p>fetched from other systems periodically)?</p><p>b. Is the variance high and hard to predict, with the system being idle for</p><p>potentially long periods of time, with occasional bumps of activity?</p><p>Database-as-a-service offerings often let you pick between</p><p>provisioned throughput and on-demand purchasing. Although the</p><p>former is more cost-efficient, it incurs a certain cost regardless of how</p><p>busy the database actually is. The latter costs more per request, but</p><p>you only pay for what you use.</p><p>2. Give yourself time to evaluate your choice and avoid committing</p><p>to long-term contracts (even if lured by a discount) before you see</p><p>that the setup works for you in a sustainable way.</p><p>Chapter 1 a taste ofWhat You’re upagainst: tWo tales</p><p>8</p><p>The First Spike</p><p>March 17th seemed like an extremely lucky day. Patrick was pleased to notice lots of</p><p>new orders starting from the early morning. But as the number of active customers</p><p>skyrocketed around noon, Patrick’s mood started to deteriorate. This was strictly</p><p>correlated with the rate of calls he received from angry customers reporting their</p><p>inability to proceed with their orders.</p><p>After a short brainstorming session with himself and a web search engine, Patrick</p><p>realized, to his dismay, that he lacked any observability tools on his precious (and quite</p><p>expensive) database cluster. Shortly after frantically setting up Grafana and browsing the</p><p>metrics, Patrick saw that although the number of incoming requests kept growing, their</p><p>success rate was capped at a certain level, way below today’s expected traffic.</p><p>“Provisioned throughput strikes again,” Patrick groaned to himself, while scrolling</p><p>through thousands of “throughput exceeded” error messages that started appearing</p><p>around 11am.</p><p>Patrick’s Diary ofLessons Learned, Part II</p><p>This is what Patrick learned:</p><p>1. If your workload is susceptible to spikes, be prepared for it and</p><p>try to architect your cluster to be able to survive a temporarily</p><p>elevated load. Database-as-a-service solutions tend to allow</p><p>configuring the provisioned throughput in a dynamic way, which</p><p>means that the threshold of accepted requests can occasionally</p><p>be raised temporarily to a previously configured level. Or,</p><p>respectively, they allow it to be temporarily decreased to make the</p><p>solution slightly more cost-efficient.</p><p>2. Always expect spikes. Even if your workload is absolutely steady, a</p><p>temporary hardware failure or a surprise DDoS attack can cause a</p><p>sharp increase in incoming requests.</p><p>3. Observability is key in distributed systems. It allows the developers</p><p>to retrospectively investigate a failure. It also provides real-time</p><p>alerts when a likely failure scenario is detected, allowing people to</p><p>react quickly and either prevent a larger failure from happening, or</p><p>at least minimize the negative impact on the cluster.</p><p>Chapter 1 a taste ofWhat You’re upagainst: tWo tales</p><p>9</p><p>The First Loss</p><p>Patrick didn’t even manage to recover from the trauma of losing most of his potential</p><p>income on the only day throughout the year during which green fedoras experienced</p><p>any kind of demand, when the letter came. It included an angry rant from a would-be</p><p>customer, who successfully proceeded with his order and paid for it (with a receipt from</p><p>the payment processing operator as proof), but is now unable to either see any details of</p><p>his order—and he’s still waiting for the delivery!</p><p>Without further ado, Patrick browsed</p><p>the database. To his astonishment, he didn’t</p><p>find any trace of the order either. For completeness, Patrick also put his wishful thinking</p><p>into practice by browsing the backup snapshot directory. It remained empty, as one of</p><p>Patrick’s initial executive decisions was to save time and money by not scheduling any</p><p>periodic backup procedures.</p><p>How did data loss happen to him, of all people? After studying the consistency</p><p>model of his database of choice, Patrick realized that there’s consensus to make between</p><p>consistency guarantees, performance, and availability. By configuring the queries, one</p><p>can either demand linearizability7 at the cost of decreased throughput, or reduce the</p><p>consistency guarantees and increase performance accordingly. Higher throughput</p><p>capabilities were a no-brainer for Patrick a few days ago, but ultimately customer data</p><p>landed on a single server without any replicas distributed in the system. Once this server</p><p>failed—which happens to hardware surprisingly often, especially at large scale—the data</p><p>was gone.</p><p>Patrick’s Diary ofLessons Learned, Part III</p><p>Further lessons include:</p><p>1. Backups are vital in a distributed environment, and there’s no</p><p>such thing as setting backup routines “too soon.” Systems fail,</p><p>and backups are there to restore as much of the important data as</p><p>possible.</p><p>7 A very strong consistency guarantee; see the Jepsen page on Linearizability for details</p><p>(https://jepsen.io/consistency/models/linearizable).</p><p>Chapter 1 a taste ofWhat You’re upagainst: tWo tales</p><p>10</p><p>2. Every database system has a certain consistency model, and it’s</p><p>crucial to take that into account when designing your project.</p><p>There might be compromises to make. In some use cases (think</p><p>financial systems), consistency is the key. In other ones, eventual</p><p>consistency is acceptable, as long as it keeps the system highly</p><p>available and responsive.</p><p>The Spike Strikes Again</p><p>Months went by and Patrick’s sleeping schedule was even beginning to show signs of</p><p>stabilization. With regular backups, a redesigned consistency model, and a reminder set</p><p>in his calendar for March 16th to scale up the cluster to manage elevated traffic, he felt</p><p>moderately safe.</p><p>If only he knew that a ten-second video of a cat dressed as a leprechaun had just</p><p>gone viral in Malaysia… which, taking time zone into account, happened around 2am</p><p>Patrick’s time, ruining the aforementioned sleep stabilization efforts.</p><p>On the one hand, the observability suite did its job and set off a warning early,</p><p>allowing for a rapid response. On the other hand, even though Patrick reacted on time,</p><p>databases are seldom able to scale instantaneously, and his system of choice was no</p><p>exception in that regard. The spike in concurrency was very high and concentrated,</p><p>as thousands of Malaysian teenagers rushed to bulk-buy green hats in pursuit of ever-</p><p>changing Internet trends. Patrick was able to observe a real-life instantiation of Little’s</p><p>Law, which he vaguely remembered from his days at the university. With a beautifully</p><p>concise formula, L = λW, the law can be simplified to the fact that concurrency equals</p><p>throughput times latency.</p><p>Tip for those having trouble with remembering the formula, think units.</p><p>Concurrency is just a number, latency can be measured in seconds, while</p><p>throughput is usually expressed in 1/s. then, it stands to reason that in order for</p><p>units to match, concurrency should be obtained by multiplying latency (seconds) by</p><p>throughput (1/s). You’re welcome!</p><p>Chapter 1 a taste ofWhat You’re upagainst: tWo tales</p><p>11</p><p>Throughput depends on the hardware and naturally has its limits (e.g., you can’t</p><p>expect a NVMe drive purchased in 2023 to serve the data for you in terabytes per second,</p><p>although we are crossing our fingers for this assumption to be invalidated in near</p><p>future!) Once the limit is hit, you can treat it as constant in the formula. It’s then clear</p><p>that as concurrency raises, so does latency. For the end-users—Malaysian teenagers in</p><p>this scenario—it means that the latency is eventually going to cross the magic barrier</p><p>for average human perception of a few seconds. Once that happens, users get too</p><p>frustrated and simply give up on trying altogether, assuming that the system is broken</p><p>beyond repair. It’s easy to find online articles quoting that “Amazon found that 100ms</p><p>of latency costs them 1 percent in sales”; although it sounds overly simplified, it is also</p><p>true enough.</p><p>Patrick’s Diary ofLessons Learned, Part IV</p><p>The lessons continue…:</p><p>1. Unexpected spikes are inevitable, and scaling out the cluster</p><p>might not be swift enough to mitigate the negative effects of</p><p>excessive concurrency. Expecting the database to handle it</p><p>properly is not without merit, but not every database is capable</p><p>of that. If possible, limit the concurrency in your system as early</p><p>as possible. For instance, if the database is never touched directly</p><p>by customers (which is a very good idea for multiple reasons)</p><p>but instead is accessed through a set of microservices under your</p><p>control, make sure that the microservices are also aware of the</p><p>concurrency limits and adhere to them.</p><p>2. Keep in mind that Little’s Law exists—it’s fundamental knowledge</p><p>for anyone interested in distributed systems. Quoting it often also</p><p>makes you appear exceptionally smart among peers.</p><p>Backup Strikes Back</p><p>After redesigning his project yet again to take expected and unexpected concurrency</p><p>fluctuations into account, Patrick happily waited for his fedora business to finally</p><p>become ramen profitable.</p><p>Chapter 1 a taste ofWhat You’re upagainst: tWo tales</p><p>12</p><p>Unfortunately, the next March 17th didn’t go as smoothly as expected either. Patrick</p><p>spent most of the day enjoying steady Grafana dashboards, which kept assuring him</p><p>that the traffic was under control and capable of handling the load of customers, with a</p><p>healthy safe margin. But then the dashboards stopped, kindly mentioning that the disks</p><p>became severely overutilized. This seemed completely out of place given the observed</p><p>concurrency. While looking for the possible source of this anomaly, Patrick noticed, to</p><p>his horror, that the scheduled backup procedure coincided with the annual peak load…</p><p>Patrick’s Diary ofLessons Learned, Part V</p><p>Concluding thoughts:</p><p>1. Database systems are hardly ever idle, even without incoming</p><p>user requests. Maintenance operations often happen and you</p><p>must take them into consideration because they’re an internal</p><p>source of concurrency and resource consumption.</p><p>2. Whenever possible, schedule maintenance options for times with</p><p>expected low pressure on the system.</p><p>3. If your database management system supports any kind of</p><p>quality of service configuration, it’s a good idea to investigate</p><p>such capabilities. For instance, it might be possible to set a strong</p><p>priority for user requests over regular maintenance operations,</p><p>especially during peak hours. Respectively, periods with low user-</p><p>induced activity can be utilized to speed up background activities.</p><p>In the database world, systems that use a variant of LSM trees for</p><p>underlying storage need to perform quite a bit of compactions</p><p>(a kind of maintenance operation on data) in order to keep the</p><p>read/write performance predictable and steady.</p><p>The end.</p><p>Chapter 1 a taste ofWhat You’re upagainst: tWo tales</p><p>13</p><p>Summary</p><p>Meeting database performance expectations can sometimes seem like a never-ending</p><p>pain. As soon as you diagnose and address one problem, another is likely lurking right</p><p>behind it. The next chapter helps you anticipate the challenges and opportunities you</p><p>are most likely to face given your technical requirements and business expectations.</p><p>Open Access This chapter is licensed under the terms of the Creative</p><p>Commons Attribution 4.0 International License (http://creativecommons.</p><p>org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and</p><p>reproduction in any</p>
  • REVISAO MONITORIA DE MICROBIOLOGIA
  • HERMENEUTICA 02
  • Teoria e Técnica em Rádio AV 2
  • GESTÃO ESTRATEGICA AVALIAÇÃO I
  • RECUPERAÇÂO - ti_bd_agenda_04_fichario
  • manual de desmontagem e montagem
  • AVALIAÇOES PRÉ A 4 bimestre
  • Sumário dados basicos diadema
  • artigo FRANCISCO GABRIEL docx
  • Estruturas de Banco de Dados para Imobiliária
  • david_tobias_nunes_ag05_ti_ii1725865914
  • create crea create create create 5 6 7 8 9 10 "Diferente da integração, a inclusão pressupõe então mudanças na sociedade, para que esta se torne ca...
  • O que é o Big Data e qual sua importancia no mundo das tecnologias? Assinale a resposta correta (A)Big data são os dados gerados pelos usuarios de ...
  • uma grande varejista brasileira utilizou o big data para anlisar o comportamento de compra dos consumidores durante a pandemia como o big data co...
  • Uma das maiores vantagens ao utilizar-se o Excel para manipular banco de dados é: A) Interface visual. B) Criptografia/segurança. C) Vel...
  • create create create create create create 8 9 10 Com relação a sinalização de segurança, a cor verde caracteriza segurança, como também salva...
  • Quais as operações de dados por transação que contemplam a sigla em inglês CRUD? Questão 10Resposta a. Concluir, levantar, aumentar e desativar....
  • create create create create create create 8 9 10 As rotinas fornecem as centenas de regras tácitas de que as empresas preci...
  • Bock (apud CONZATTI, 2014) enfatiza que a socialização é create 3 create create create create create create create Para Perrenoud (apud CONZATTI, 2...
  • banco de dados unisanta Qual é a função da tabela verdade em lógica proposicional? Questão 5Escolha uma opção: Executar cálculos aritméticos com...
  • Em um Diagrama Entidade-Relacional (DER) entre duas Entidades com relacionamento 1:N é necessário o uso de qual tipo de chave? Questão 5Resposta ...
  • Qual é a função da tabela verdade em lógica proposicional? Questão 5Escolha uma opção: Executar cálculos aritméticos complexos. Determinar o va...
  • create create create create create create 8 create create A relação professorPedagogia diferenciada é uma característica que deveria permear qualqu...
  • No modelo lógico, algumas nomenclaturas são ajustadas para ficarem mais alinhadas com o nível de abstração inerente a es modelo, com termos mais té...
  • Padrões de projeto orientados a objeto
  • Prova de Programação Orientada a Objetos - 01

Conteúdos escolhidos para você

17 pág.

Grátis

banco de dados aula6
10 pág.
Provas Banco de dados
17 pág.
banco de dados aula3
17 pág.
banco de dados aula1
17 pág.
banco de dados aula5

Perguntas dessa disciplina

Grátis

Um banco de dados envolve especificar os tipos, as estruturas e as restrições dos dados a serem armazenados. A definição ou informação descritiva d...

ANHANGUERA

Grátis

Qual é a responsabilidade principal do administrador de banco de dados? Gerenciar as vendas de produtos relacionados ao banco de dados Desenvol...

IESB

É um utilitário do sistema de banco de dados: Backup – Cria uma cópia de segurança do banco de dados. Carga – É utilizado para carregar os arqu...

UNIFRA

Questão 4/10 - Sistema Gerenciador de Banco de DadosPara um bom desempenho das consultas de um banco de dados é importante que o projeto dele seja...
Questão 3 O ciclo de vida de um banco de dados compreende várias etapas cruciais, desde a concepção até a manutenção. Durante a análise do ciclo de...

UNOPAR

Guia Prático de Desempenho de Banco de Dados - Bases de Dados (2025)
Top Articles
Latest Posts
Recommended Articles
Article information

Author: Greg Kuvalis

Last Updated:

Views: 5415

Rating: 4.4 / 5 (55 voted)

Reviews: 94% of readers found this page helpful

Author information

Name: Greg Kuvalis

Birthday: 1996-12-20

Address: 53157 Trantow Inlet, Townemouth, FL 92564-0267

Phone: +68218650356656

Job: IT Representative

Hobby: Knitting, Amateur radio, Skiing, Running, Mountain biking, Slacklining, Electronics

Introduction: My name is Greg Kuvalis, I am a witty, spotless, beautiful, charming, delightful, thankful, beautiful person who loves writing and wants to share my knowledge and understanding with you.