•
UNICSUL
Prévia do material em texto
<p>A Practical Guide</p><p>―</p><p>Felipe Cardeneti Mendes · Piotr Sarna</p><p>Pavel Emelyanov · Cynthia Dunlop</p><p>Database</p><p>Performance</p><p>at Scale</p><p>Database Performance</p><p>atScale</p><p>A Practical Guide</p><p>FelipeCardenetiMendes</p><p>PiotrSarna</p><p>PavelEmelyanov</p><p>CynthiaDunlop</p><p>Database Performance at Scale: A Practical Guide</p><p>ISBN-13 (pbk): 978-1-4842-9710-0 ISBN-13 (electronic): 978-1-4842-9711-7</p><p>https://doi.org/10.1007/978-1-4842-9711-7</p><p>Copyright © 2023 by Felipe Cardeneti Mendes, Piotr Sarna, Pavel Emelyanov, Cynthia Dunlop</p><p>This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is</p><p>concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on</p><p>microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation,</p><p>computer software, or by similar or dissimilar methodology now known or hereafter developed.</p><p>Open Access This book is licensed under the terms of the Creative Commons Attribution 4.0</p><p>International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,</p><p>adaptation, distribution and reproduction in any medium or format, as long as you give appropriate</p><p>credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes</p><p>were made.</p><p>The images or other third party material in this book are included in the book’s Creative Commons license, unless</p><p>indicated otherwise in a credit line to the material. If material is not included in the book’s Creative Commons license and</p><p>your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain</p><p>permission directly from the copyright holder.</p><p>Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with every</p><p>occurrence of a trademarked name, logo, or image we use the names, logos, and images only in an editorial fashion and to</p><p>the benefit of the trademark owner, with no intention of infringement of the trademark.</p><p>The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as</p><p>such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights.</p><p>While the advice and information in this book are believed to be true and accurate at the date of publication, neither the</p><p>authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made.</p><p>The publisher makes no warranty, express or implied, with respect to the material contained herein.</p><p>Managing Director, Apress Media LLC: Welmoed Spahr</p><p>Acquisitions Editor: Jonathan Gennick</p><p>Development Editor: Laura Berendson</p><p>Editorial Project Manager: Shaul Elson</p><p>Copy Editor: Kezia Endsley</p><p>Cover designed by eStudioCalamar</p><p>Distributed to the book trade worldwide by Springer Science+Business Media LLC, 1 NewYork Plaza, Suite 4600,</p><p>NewYork, NY 10004. Phone 1-800-SPRINGER, fax (201) 348-4505, e-mail orders-ny@springer-sbm.com, or visit www.</p><p>springeronline.com. Apress Media, LLC is a California LLC and the sole member (owner) is Springer Science + Business</p><p>Media Finance Inc (SSBM Finance Inc). SSBM Finance Inc is a Delaware corporation.</p><p>For information on translations, please e-mail booktranslations@springernature.com; for reprint, paperback, or audio</p><p>rights, please e-mail bookpermissions@springernature.com.</p><p>Apress titles may be purchased in bulk for academic, corporate, or promotional use. eBook versions and licenses are also</p><p>available for most titles. For more information, reference our Print and eBook Bulk Sales web page at http://www.apress.</p><p>com/bulk-sales.</p><p>Any source code or other supplementary material referenced by the author in this book is available to readers on GitHub</p><p>(https://github.com/Apress). For more detailed information, please visit https://www.apress.com/gp/services/source-code.</p><p>Paper in this product is recyclable</p><p>FelipeCardenetiMendes</p><p>São Paulo, Brazil</p><p>PiotrSarna</p><p>Pruszków, Poland</p><p>PavelEmelyanov</p><p>Moscow, Russia</p><p>CynthiaDunlop</p><p>Carpinteria, CA, USA</p><p>To Cristina and Snow</p><p>—Felipe</p><p>To Wiktoria</p><p>—Piotr</p><p>To Svetlana and Mykhailo</p><p>—Pavel</p><p>To David</p><p>—Cynthia</p><p>v</p><p>Table of Contents</p><p>About the Authors �������������������������������������������������������������������������������������������������� xiii</p><p>About the Technical Reviewers �������������������������������������������������������������������������������xv</p><p>Acknowledgments �������������������������������������������������������������������������������������������������xvii</p><p>Introduction ������������������������������������������������������������������������������������������������������������xix</p><p>Chapter 1: A Taste of What You’re Up Against: Two Tales ����������������������������������������� 1</p><p>Joan Dives Into Drivers and Debugging ���������������������������������������������������������������������������������������� 1</p><p>Joan’s Diary of Lessons Learned, Part I ���������������������������������������������������������������������������������� 3</p><p>The Tuning ������������������������������������������������������������������������������������������������������������������������������� 3</p><p>Joan’s Diary of Lessons Learned, Part II ��������������������������������������������������������������������������������� 5</p><p>Patrick’s Unlucky Green Fedoras �������������������������������������������������������������������������������������������������� 6</p><p>Patrick’s Diary of Lessons Learned, Part I ������������������������������������������������������������������������������� 7</p><p>The First Spike ������������������������������������������������������������������������������������������������������������������������� 8</p><p>Patrick’s Diary of Lessons Learned, Part II ������������������������������������������������������������������������������ 8</p><p>The First Loss �������������������������������������������������������������������������������������������������������������������������� 9</p><p>Patrick’s Diary of Lessons Learned, Part III ����������������������������������������������������������������������������� 9</p><p>The Spike Strikes Again �������������������������������������������������������������������������������������������������������������� 10</p><p>Patrick’s Diary of Lessons Learned, Part IV ��������������������������������������������������������������������������� 11</p><p>Backup Strikes Back ������������������������������������������������������������������������������������������������������������������� 11</p><p>Patrick’s Diary of Lessons Learned, Part V ���������������������������������������������������������������������������� 12</p><p>Summary������������������������������������������������������������������������������������������������������������������������������������� 13</p><p>Chapter 2: Your Project, Through the Lens of Database Performance �������������������� 15</p><p>Workload Mix (Read/Write Ratio) ������������������������������������������������������������������������������������������������ 15</p><p>Write-Heavy Workloads ��������������������������������������������������������������������������������������������������������� 16</p><p>Read-Heavy Workloads ���������������������������������������������������������������������������������������������������������� 17</p><p>vi</p><p>Mixed Workloads ������������������������������������������������������������������������������������������������������������������� 19</p><p>Delete-Heavy Workloads ������������������������������������������������������������������������������������������������������� 20</p><p>Competing Workloads (Real-Time vs Batch)�������������������������������������������������������������������������� 21</p><p>Item Size ������������������������������������������������������������������������������������������������������������������������������������� 23</p><p>Item Type �������������������������������������������������������������������������������������������������������������������������������������</p><p>medium or format, as long as you give appropriate credit to the</p><p>original author(s) and the source, provide a link to the Creative Commons license and</p><p>indicate if changes were made.</p><p>The images or other third party material in this chapter are included in the chapter’s</p><p>Creative Commons license, unless indicated otherwise in a credit line to the material. If</p><p>material is not included in the chapter’s Creative Commons license and your intended</p><p>use is not permitted by statutory regulation or exceeds the permitted use, you will need</p><p>to obtain permission directly from the copyright holder.</p><p>Chapter 1 a taste ofWhat You’re upagainst: tWo tales</p><p>15</p><p>CHAPTER 2</p><p>Your Project, Through</p><p>theLens ofDatabase</p><p>Performance</p><p>The specific database performance constraints and optimization opportunities your</p><p>team will face vary wildly based on your specific workload, application, and business</p><p>expectations. This chapter is designed to get you and your team talking about how much</p><p>you can feasibly optimize your performance, spotlight some specific lessons related</p><p>to common situations, and also help you set realistic expectations if you’re saddled</p><p>with burdens like large payload sizes and strict consistency requirements. The chapter</p><p>starts by looking at technical factors, such as the read/write ratio of your workload, item</p><p>size/type, and so on. Then, it shifts over to business considerations like consistency</p><p>requirements and high availability expectations. Throughout, the chapter talks about</p><p>database attributes that have proven to be helpful—or limiting—in different contexts.</p><p>Note Since this chapter covers a broad range of scenarios, not everything will be</p><p>applicable to your specific project and workload. Feel free to skim this chapter and</p><p>focus on the sections that seem most relevant.</p><p>Workload Mix (Read/Write Ratio)</p><p>Whether it’s read-heavy, write-heavy, evenly-mixed, delete-heavy, and so on,</p><p>understanding and accommodating your read/write ratio is a critical but commonly</p><p>overlooked aspect of database performance. Some databases shine with read-heavy</p><p>© Felipe Cardeneti Mendes, Piotr Sarna, Pavel Emelyanov, Cynthia Dunlop 2023</p><p>F. C. Mendes et al., Database Performance at Scale, https://doi.org/10.1007/978-1-4842-9711-7_2</p><p>16</p><p>workloads, others are optimized for write-heavy situations, and some are built to</p><p>accommodate both. Selecting, or sticking with, one that’s a poor fit for your current and</p><p>future situation will be a significant burden that will be difficult to overcome, no matter</p><p>how strategically you optimize everything else.</p><p>There’s also a significant impact to cost. That might not seem directly related to</p><p>performance, but if you can’t afford (or get approval for) the infrastructure that you truly</p><p>need to support your workload, this will clearly limit your performance.1</p><p>Tip Not sure what your workload looks like? This is one of many situations where</p><p>observability is your friend. If your existing database doesn’t help you profile your</p><p>workload, consider if it’s feasible to try your workloads on a compatible database</p><p>that enables deeper visibility.</p><p>Write-Heavy Workloads</p><p>If you have a write-heavy workload, we strongly recommend a database that stores data</p><p>in immutable files (e.g., Cassandra, ScyllaDB, and others that use LSM trees).2 These</p><p>databases optimize write speed because: 1) writes are sequential, which is faster in</p><p>terms of disk I/O and 2) writes are performed immediately, without first worrying about</p><p>reading or updating existing values (like databases that rely on B-trees do). As a result,</p><p>you can typically write a lot of data with very low latencies.</p><p>However, if you opt for a write-optimized database, be prepared for higher storage</p><p>requirements and the potential for slower reads. When you work with immutable</p><p>files, you’ll need sufficient storage to keep all the immutable files that build up until</p><p>compaction runs.3 You can mitigate the storage needs to some extent by choosing</p><p>compaction strategies carefully. Plus, storage is relatively inexpensive these days.</p><p>1 With write-heavy workloads, you can easily spend millions per month with Bigtable or</p><p>DynamoDB.Read-heavy workloads are typically less costly in these pricing models.</p><p>2 If you want a quick introduction to LSM trees and B-trees, see Appendix A.Chapter 4 also</p><p>discusses B-trees in more detail.</p><p>3 Compaction is a background process that databases with an LSM tree storage backend use to</p><p>merge and optimize the shape of the data. Since files are immutable, the process essentially</p><p>involves picking up two or more pre-existing files, merging their contents, and producing a sorted</p><p>output file.</p><p>ChapTer 2 Your projeCT, Through TheLeNS oFDaTabaSe perFormaNCe</p><p>17</p><p>The potential for read amplification is generally a more significant concern with</p><p>write-optimized databases (given all the files to search through, more disk reads are</p><p>required per read request).</p><p>But read performance doesn’t necessarily need to suffer. You can often minimize this</p><p>tradeoff with a write-optimized database that implements its own caching subsystem</p><p>(as opposed to those that rely on the operating system’s built-in cache), enabling fast</p><p>reads to coexist alongside extremely fast writes. Bypassing the underlying OS with a</p><p>performance-focused built-in cache should speed up your reads nicely, to the point</p><p>where the latencies are nearly comparable to read-optimized databases.</p><p>With a write-heavy workload, it’s also essential to have extremely fast storage, such</p><p>as NVMe drives, if your peak throughput is high. Having a database that can theoretically</p><p>store values rapidly ultimately won’t help if the disk itself can’t keep pace.</p><p>Another consideration: beware that write-heavy workloads can result in</p><p>surprisingly high costs as you scale. Writes cost around five times more than reads</p><p>under some vendors’ pricing models. Before you invest too much effort in performance</p><p>optimizations, and so on, it’s a good idea to price your solution at scale and make sure</p><p>it’s a good long-term fit.</p><p>Read-Heavy Workloads</p><p>With read-heavy workloads, things change a bit. B-tree databases (such as DynamoDB)</p><p>are optimized for reads (that’s the payoff for the extra time required to update values on</p><p>the write path). However, the advantage that read-optimized databases offer for reads</p><p>is generally not as significant as the advantage that write-optimized databases offer for</p><p>writes, especially if the write-optimized database uses internal caching to make up the</p><p>difference (as noted in the previous section).</p><p>Careful data modeling will pay off in spades for optimizing your reads. So will careful</p><p>selection of read consistency (are eventually consistent reads acceptable as opposed to</p><p>strongly consistent ones?), locating your database near your application, and performing</p><p>a thorough analysis of your query access patterns. Thinking about your access patterns is</p><p>especially crucial for success with a read-heavy workload. Consider aspects such as the</p><p>following:</p><p>• What is the nature of the data that the application will be querying</p><p>mostly frequently? Does it tolerate potentially stale reads or does it</p><p>require immediate consistency?</p><p>ChapTer 2 Your projeCT, Through TheLeNS oFDaTabaSe perFormaNCe</p><p>18</p><p>• How frequently is it accessed (e.g., is it frequently-accessed “hot”</p><p>data that is likely cached, or is it rarely-accessed “cold” data)?</p><p>• Does it require aggregations, JOINs, and/or querying flexibility on</p><p>fields that are not part of your primary key component?</p><p>• Speaking of primary keys, what is the level of cardinality?</p><p>For example, assume that your use case requires dynamic querying capabilities (such</p><p>as type-ahead use cases, report-building solutions, etc.) where you frequently need to query</p><p>data from columns other than your primary/hash key component. In this case, you might</p><p>find yourself performing full table scans all too frequently, or relying on too many indexes.</p><p>Both</p><p>of these, in one way or another, may eventually undermine your read performance.</p><p>On the infrastructure side, selecting servers with high memory footprints is key for</p><p>enabling low read latencies if you will mostly serve data that is frequently accessed. On</p><p>the other hand, if your reads mostly hit cold data, you will want a nice balance between</p><p>your storage speeds and memory. In fact, many distributed databases typically reserve</p><p>some memory space specifically for caching indexes; this way, reads that inevitably</p><p>require going to disk won’t waste I/O by scanning through irrelevant data.</p><p>What if the use case requires reading from both hot and cold data at the same time?</p><p>And what if you have different latency requirements for each set of data? Or what if you</p><p>want to mix a real-time workload on top of your analytics workload for the very same</p><p>dataset? Situations like this are quite common. There’s no one-size-fits-all answer, but</p><p>here are a few important tips:</p><p>• Some databases will allow you to read data without polluting your</p><p>cache (e.g., filling it up with data that is unlikely to be requested again).</p><p>Using such a mechanism is especially important when you’re running</p><p>large scans while simultaneously serving real-time data. If the large</p><p>scans were allowed to override the previously cached entries that the</p><p>real-time workload required, those reads would have to go through</p><p>disk and get repopulated into the cache again. This would effectively</p><p>waste precious processing time and result in elevated latencies.</p><p>• For use cases requiring a distinction between hot/cold data storage</p><p>(for cost savings, different latency requirements, or both), then</p><p>solutions using tiered storage (a method of prioritizing data storage</p><p>based on a range of requirements, such as performance and costs)</p><p>are likely a good fit.</p><p>ChapTer 2 Your projeCT, Through TheLeNS oFDaTabaSe perFormaNCe</p><p>19</p><p>• Some databases will permit you to prioritize some workloads over</p><p>others. If that’s not sufficient, you can go one step further and</p><p>completely isolate such workloads logically.4</p><p>Note You might not need all your reads. at ScyllaDb, we’ve come across a</p><p>number of cases where teams are performing reads that they don’t really need. For</p><p>example, by using a read-before-write approach to avoid race conditions where</p><p>multiple clients are trying to update the same value with different updates at the</p><p>same time. The details of the solution aren’t relevant here, but it is important to</p><p>note that, by rethinking their approach, they were able to shave latencies off their</p><p>writes as well as speed up the overall response by eliminating the unnecessary</p><p>read. The moral here: getting new eyes on your existing approaches might surface</p><p>a way to unlock unexpected performance optimizations.</p><p>Mixed Workloads</p><p>More evenly mixed access patterns are generally even more complex to analyze and</p><p>accommodate. In general, the reason that mixed workloads are so complex in nature is</p><p>due to the fact that there are two competing workloads from the database perspective.</p><p>Databases are essentially made for just two things: reading and writing. The way that</p><p>different databases handle a variety of competing workloads is what truly differentiates</p><p>one solution from another. As you test and compare databases, experiment with different</p><p>read/write ratios so you can adequately prepare yourself for scenarios when your access</p><p>patterns may change.</p><p>Be sure to consider nuances like whether your reads are from cold data (data not</p><p>often accessed) or hot data (data that’s accessed often and likely cached). Analytics use</p><p>cases tend to read cold data frequently because they need to process large amounts of</p><p>data. In this case, disk speeds are very important for overall performance. Plus, you’ll</p><p>want a comfortably large amount of memory so that the database’s cache can hold the</p><p>4 The “Competing Workloads” section later in this chapter, as well as the “Workload Isolation”</p><p>section in Chapter 8, cover a few options for prioritizing and separating workloads.</p><p>ChapTer 2 Your projeCT, Through TheLeNS oFDaTabaSe perFormaNCe</p><p>20</p><p>data that you need to process. On the other hand, if you frequently access hot data, most</p><p>of your data will be served from the cache, in such a way that the disk speeds become</p><p>less important (although not negligible).</p><p>Tip Not sure if your reads are from cold or hot data? Take a look at the ratio</p><p>of cache misses in your monitoring dashboards. For more on monitoring, see</p><p>Chapter 10.</p><p>If your ratio of cache misses is higher than hits, this means that reads need to</p><p>frequently hit the disks in order to look up your data. This may happen because your</p><p>database is underprovisioned in memory space, or simply because the application</p><p>access patterns often read infrequently accessed data. It is important to understand the</p><p>performance implications here. If you’re frequently reading from cold data, there’s a risk</p><p>that I/O will become the bottleneck—for writes as well as reads. In that case, if you need</p><p>to improve performance, adding more nodes or switching your storage medium to a</p><p>faster solution could be helpful.</p><p>As noted earlier, write-optimized databases can improve read latency via internal</p><p>caching, so it’s not uncommon for a team with, say, 60 percent reads and 40 percent</p><p>writes to opt for a write-optimized database. Another option is to boost the latency</p><p>of reads with a write-optimized database: If your database supports it, dedicate extra</p><p>“shares” of resources to the reads so that your read workload is prioritized when there is</p><p>resource contention.</p><p>Delete-Heavy Workloads</p><p>What about delete-heavy workloads, such as using your database as a durable queue</p><p>(saving data from a producer until the consumer accesses it, deleting it, then starting the</p><p>cycle over and over again)? Here, you generally want to avoid databases that store data</p><p>in immutable files and use tombstones to mark rows and columns that are slated for</p><p>deletion. The most notable examples are Cassandra and other Cassandra-compatible</p><p>databases.</p><p>Tombstones consume cache space and disk resources, and the database needs to</p><p>search through all these tombstones to reach the live data. For many workloads, this</p><p>is not a problem. But for delete-heavy workloads, generating an excessive amount of</p><p>ChapTer 2 Your projeCT, Through TheLeNS oFDaTabaSe perFormaNCe</p><p>21</p><p>tombstones will, over time, significantly degrade your read latencies. There are ways and</p><p>mechanisms to mitigate the impact of tombstones.5 However, in general, if you have a</p><p>delete-heavy workload, it may be best to use a different database.</p><p>It is important to note that occasional deletes are generally fine on Cassandra and</p><p>Cassandra-compatible databases. Just be aware of the fact that deletes on append-only</p><p>databases result in tombstone writes. As a result, these may incur read amplification,</p><p>elevating your read latencies. Tombstones and data eviction in these types of databases</p><p>are potentially long and complex subjects that perhaps could have their own dedicated</p><p>chapter. However, the high-level recommendation is to exercise caution if you have a</p><p>potentially delete-heavy pattern that you might later read from, and be sure to combine</p><p>it with a compaction strategy tailored for efficient data eviction.</p><p>All that being said, it is interesting to note that some teams have successfully</p><p>implemented delete-heavy workloads on top of Cassandra and Cassandra-like</p><p>databases. The performance overhead carried by tombstones is generally circumvented</p><p>by a combination of data modeling, a careful study of how deletes are performed,</p><p>avoiding reads that potentially scan through a large set of deleted data, and careful</p><p>tuning over the underlying table’s compaction strategy to ensure that tombstones</p><p>get evicted in a timely manner. For example, Tencent Games used the Time Window</p><p>Compaction Strategy to aggressively expire tombstones and use</p><p>it as the foundation for a</p><p>time series distributed queue.6</p><p>Competing Workloads (Real-Time vs Batch)</p><p>If you’re working with two different types of workloads—one more latency-sensitive</p><p>than the other—the ideal solution is to have the database dedicate more resources to</p><p>the more latency-sensitive workloads to keep them from faltering due to insufficient</p><p>resources. This is commonly the case when you are attempting to balance OLTP</p><p>(real-time) workloads, which are user-facing and require low latency responses, with</p><p>5 For some specific recommendations, see the DataStax blog, “Cassandra Anti-Patterns:</p><p>Queues and Queue-like Datasets” (www.datastax.com/blog/cassandra-anti-patterns-</p><p>queues-and-queue-datasets)</p><p>6 See the article, “Tencent Games’ Real-Time Event-Driven Analytics System Built with ScyllaDB +</p><p>Pulsar” (https://www.scylladb.com/2023/05/15/tencent-games-real-time-event-driven-</p><p>analytics-systembuilt-with-scylladb-pulsar/)</p><p>ChapTer 2 Your projeCT, Through TheLeNS oFDaTabaSe perFormaNCe</p><p>22</p><p>OLAP (analytical) workloads, which can be run in batch mode and are more focused</p><p>on throughput (see Figure2-1). Or, you can prioritize analytics. Both are technically</p><p>feasible; it just boils down to what’s most important for your use case.</p><p>Figure 2-1. OLTP vs OLAP workloads</p><p>For example, assume you have a web server database with analytics. It must support</p><p>two workloads:</p><p>• The main workload consists of queries triggered by a user clicking or</p><p>navigating on some areas of the web page. Here, users expect high</p><p>responsiveness, which usually translates to requirements for low</p><p>latency. You need low timeouts with load shedding as your overload</p><p>response, and you would like to have a lot of dedicated resources</p><p>available whenever this workload needs them.</p><p>• A second workload drives analytics being run periodically to collect</p><p>some statistics or to aggregate some information that should be</p><p>presented to users. This involves a series of computations. It’s a lot</p><p>less sensitive to latency than the main workload; it’s more throughput</p><p>oriented. You can have fairly large timeouts to accommodate for</p><p>always full queues. You would like to throttle requests under load so</p><p>the computation is stable and controllable. And finally, you would</p><p>like the workload to have very few dedicated resources and use</p><p>mostly unused resources to achieve better cluster utilization.</p><p>ChapTer 2 Your projeCT, Through TheLeNS oFDaTabaSe perFormaNCe</p><p>23</p><p>Running on the same cluster, such workloads would be competing for resources.</p><p>As system utilization rises, the database must strictly prioritize which activities get</p><p>what specific share of resources under contention. There are a few different ways you</p><p>can handle this. Physical isolation, logical isolation, and scheduled isolation can all be</p><p>acceptable choices under the right circumstances. Chapter 8 covers these options.</p><p>Item Size</p><p>The size of the items you are storing in the database (average payload size) will dictate</p><p>whether your workload is CPU bound or storage bound. For example, running 100K</p><p>OPS with an average payload size of 90KB is much different than achieving the same</p><p>throughput with a 1KB payload. Higher payloads require more processing, I/O, and</p><p>network traffic than smaller payloads.</p><p>Without getting too deep into database internals here, one notable impact is on the</p><p>page cache. Assuming a default page cache size of 4KB, the database would have to serve</p><p>several pages for the largest payload—that’s much more I/O to issue, process, merge,</p><p>and serve back to the application clients. With the 1KB example, you could serve it from</p><p>a single-page cache entry, which is less taxing from a compute resource perspective.</p><p>Conversely, having a large number of smaller-sized items may introduce CPU overhead</p><p>compared to having a smaller number of larger items because the database must process</p><p>each arriving item individually.</p><p>In general, the larger the payload gets, the more cache activity you will have. Most</p><p>write-optimized databases will store your writes in memory before persisting that</p><p>information to the disk (in fact, that’s one of the reasons why they are write-optimized).</p><p>Larger payloads deplete the available cache space more frequently, and this incurs a</p><p>higher flushing activity to persist the information on disk in order to release space for</p><p>more incoming writes. Therefore, more disk I/O is needed to persist that information.</p><p>If you don’t size this properly, it can become a bottleneck throughout this repetitive</p><p>process.</p><p>When you’re working with extremely large payloads, it’s important to set realistic</p><p>latency and throughput expectations. If you need to serve 200KB payloads, it’s unlikely</p><p>that any database will enable you to achieve single-digit millisecond latencies. Even if</p><p>the entire dataset is served from cache, there’s a physical barrier between your client</p><p>and the database: networking. The network between them will eventually throttle</p><p>your transfer speeds, even with an insanely fast client and database. Eventually, this</p><p>ChapTer 2 Your projeCT, Through TheLeNS oFDaTabaSe perFormaNCe</p><p>24</p><p>will impact throughput as well as latency. As your latency increases, your client will</p><p>eventually throttle down and you won’t be able to achieve the same throughput that</p><p>you could with smaller payload sizes. The requests would be stalled, queuing in the</p><p>network.7</p><p>Generally speaking, databases should not be used to store large blobs. We’ve seen</p><p>people trying to store gigabytes of data within a single-key in a database—and this</p><p>isn’t a great idea. If your item size is reaching this scale, consider alternative solutions.</p><p>One solution is to use CDNs. Another is to store the largest chunk of your payload size</p><p>in cold storage like Amazon S3 buckets, Google Cloud storage, or Azure blob storage.</p><p>Then, use the database as a metadata lookup: It can read the data and fetch an identifier</p><p>that will help find the data in that cold storage. For example, this is the strategy used by</p><p>a game developer converting extremely large (often in the gigabyte range) content to</p><p>popular gaming platforms. They store structured objects with blobs that are referenced</p><p>by a content hash. The largest payload is stored within a cloud vendor Object Storage</p><p>solution, whereas the content hash is stored in a distributed NoSQL database.8</p><p>Note that some databases impose hard limits on item size. For example, DynamoDB</p><p>currently has a maximum item size of 400KB.This might not suit your needs. On top of</p><p>that, if you’re using an in-memory solution such as Redis, larger keys will quickly deplete</p><p>your memory. In this case, it might make sense to hash/compress such large objects</p><p>prior to storing them.</p><p>No matter which database you choose, the smaller your payload, the greater your</p><p>chances of introducing memory fragmentation. This might reduce your memory</p><p>efficiency, which might in turn elevate costs because the database won’t be able to fully</p><p>utilize its available memory.</p><p>Item Type</p><p>The item type has a large impact on compression, which in turn impacts your</p><p>storage utilization. If you’re frequently storing text, expect to take advantage of a high</p><p>compression ratio. But, that’s not the case for random and uncommon blob sequences.</p><p>7 There are alternatives to this; for example, RDMA, DPDK and other solutions. However, most</p><p>use cases do not require such solutions, so they are not covered in detail here.</p><p>8 For details, see the Epic Games talk, “Using ScyllaDB for Distribution of Game Assets in Unreal</p><p>Engine” (www.youtube.com/watch?v=aEgP9YhAb08).</p><p>ChapTer 2 Your projeCT, Through TheLeNS oFDaTabaSe perFormaNCe</p><p>25</p><p>Here, compression is unlikely to make a measurable reduction in your storage footprint.</p><p>If you’re concerned about your use case’s storage utilization, using a compression-</p><p>friendly item type can make a big difference.</p><p>If your use case dictates a certain item type, consider databases</p><p>that are optimized</p><p>for that type. For example, if you need to frequently process JSON data that you can’t</p><p>easily transform, a document database like MongoDB might be a better option than a</p><p>Cassandra-compatible database. If you have JSON with some common fields and others</p><p>that vary based on user input, it might be complicated—though possible—to model</p><p>them in Cassandra. However, you’d incur a penalty from serialization/deserialization</p><p>overhead required on the application side.</p><p>As a general rule of thumb, choose the data type that’s the minimum needed to store</p><p>the type of data you need. For example, you don’t need to store a year as a bigint. If</p><p>you define a field as a bigint, most databases allocate relevant memory address spaces</p><p>for holding it. If you can get by with a smaller type of int, do it—you’ll save bytes of</p><p>memory, which could add up at scale. Even if the database you use doesn’t pre-allocate</p><p>memory address spaces according to data types, choosing the correct one is still a nice</p><p>way to have an organized data model—and also to avoid future questions around why a</p><p>particular data type was chosen as opposed to another.</p><p>Many databases support additional item types which suit a variety of use cases.</p><p>Collections, for example, allow you to store sets, lists, and maps (key-value pairs) under</p><p>a single column in wide column databases. Such data types are often misused, and</p><p>lead to severe performance problems. In fact, most of the data modeling problems</p><p>we’ve come across involve misuse of collections. Collections are meant to store a small</p><p>amount of information (such as phone numbers of an individual or different home/</p><p>business addresses). However, collections with hundreds of thousands of entries are</p><p>unfortunately not as rare as you might expect. They end up introducing a severe de-</p><p>serialization overhead on the database. At best, this translates to higher latencies.</p><p>At worst, this makes the data entirely unreadable due to the latency involved when</p><p>scanning through the high number of items under such columns.</p><p>Some databases also support user created fields, such as User-Defined Types (UDTs)</p><p>in Cassandra. UDTs can be a great ally for reducing the de-serialization overhead when</p><p>you combine several columns into one. Think about it: Would you rather de-serialize</p><p>four Boolean columns individually or a single column with four Boolean values? UDTs</p><p>will typically shine on deserializing several values as a single column, which may give</p><p>ChapTer 2 Your projeCT, Through TheLeNS oFDaTabaSe perFormaNCe</p><p>26</p><p>you a nice performance boost.9 Just like collections, however, UDTs should not be</p><p>misused—and misusing UDTs can lead to the same severe impacts that are incurred by</p><p>collections.</p><p>Note uDTs are quite extensively covered in Chapter 6.</p><p>Dataset Size</p><p>Knowing your dataset size is important for selecting appropriate infrastructure options.</p><p>For example, AWS cloud instances have a broad array of NVMe storage offerings. Having</p><p>a good grasp of how much storage you need can help you avoid selecting an instance</p><p>that causes performance to suffer (if you end up with insufficient storage) or that’s</p><p>wasteful from a cost perspective (if you overprovision).</p><p>It’s important to note that your selected storage size should not be equal to your total</p><p>dataset size. You also need to factor in replication and growth—plus steer clear of 100</p><p>percent storage utilization.</p><p>For example, let’s assume you have 3TB of already compressed data. The bare</p><p>minimum to support a workload is your current dataset size multiplied by your</p><p>anticipated replication. If you have 3TB of data with the common replication factor of</p><p>three, that gives you 9TB.If you naively deployed this on three nodes supporting 3TB of</p><p>data each, you’d hit near 100 percent disk utilization which, of course, is not optimal.</p><p>Instead, if you factor in some free space and minimal room for growth, you’d want</p><p>to start with at least six nodes of that size—each storing only 1.5TB of data. This gives</p><p>you around 50 percent utilization. On the other hand, if your database cannot support</p><p>that much data per node (every database has a limit) or if you do not foresee much</p><p>future data growth, you could have six nodes supporting 2TB each, which would store</p><p>approximately 1.5TB per replica under a 75 percent utilization. Remember: Factoring</p><p>in your growth is critical for avoiding unpleasant surprises in production, from an</p><p>operational as well as a budget perspective.</p><p>9 For some specific examples of how UDTs impact performance, see the performance benchmark</p><p>that ScyllaDB performed with different UDT sizes against individual columns: “If You Care</p><p>About Performance, Employ User Defined Types” (https://www.scylladb.com/2017/12/07/</p><p>performance-udt/)</p><p>ChapTer 2 Your projeCT, Through TheLeNS oFDaTabaSe perFormaNCe</p><p>27</p><p>Note We very intentionally discussed the dataset size from a compressed data</p><p>standpoint. be aware that some database vendors measure your storage utilization</p><p>with respect to uncompressed data. This often leads to confusion. If you’re moving</p><p>data from one database solution to another and your data is uncompressed (or</p><p>you’re not certain it’s compressed), consider loading a small fraction of your</p><p>total dataset beforehand in order to determine its compression ratio. effective</p><p>compression can dramatically reduce your storage footprint.</p><p>If you’re working on a very fluid project and can’t define or predict your dataset</p><p>size, a serverless database deployment model might be a good option to provide easy</p><p>flexibility and scaling. But, be aware that rapid increases in overall dataset size and/or</p><p>IOPS (depending on the pricing model) could cause the price to skyrocket exponentially.</p><p>Even if you don’t explicitly pay a penalty for storing a large dataset, you might be charged</p><p>a premium for the many operations that are likely associated with that large dataset.</p><p>Serverless is discussed more in Chapter 7.</p><p>Throughput Expectations</p><p>Your expected throughput and latency should be your “north star” from database and</p><p>infrastructure selection all the way to monitoring. Let’s start with throughput.</p><p>If you’re serious about database performance, it’s essential to know what throughput</p><p>you’re trying to achieve—and “high throughput” is not an acceptable answer.</p><p>Specifically, try to get all relevant stakeholders’ agreement on your target number of peak</p><p>read operations per second and peak write operations per second for each workload.</p><p>Let’s unravel that a little. First, be sure to separate read throughput vs write</p><p>throughput. A database’s read path is usually quite distinct from its write path. It stresses</p><p>different parts of the infrastructure and taps different database internals. And the client/</p><p>user experience of reads is often quite different than that of writes. Lumping them</p><p>together into a meaningless number won’t help you much with respect to performance</p><p>measurement or optimization. The main use for average throughput is in applying</p><p>Little’s Law (more on that in the “Concurrency” section a little later in this chapter).</p><p>ChapTer 2 Your projeCT, Through TheLeNS oFDaTabaSe perFormaNCe</p><p>28</p><p>Another caveat: The same database’s past or current throughput with one use case</p><p>is no guarantee of future results with another—even if it’s the same database hosted on</p><p>identical infrastructure. There are too many different factors at play (item size, access</p><p>patterns, concurrency… all the things in this chapter, really). What’s a great fit for one use</p><p>case could be quite inappropriate for another.</p><p>Also, note the emphasis on peak operations per second. If you build and optimize</p><p>with an average in mind, you likely won’t be able to service beyond the upper ranges of</p><p>that average. Focus on the peak throughput that you need to sustain to cover your core</p><p>needs and business patterns—including surges. Realize that databases can often “boost”</p><p>to sustain short bursts of</p><p>exceptionally high load. However, to be safe, it’s best to plan for</p><p>your likely peaks and reserve boosting for atypical situations.</p><p>Also, be sure not to confuse concurrency with throughput. Throughput is the speed</p><p>at which the database can perform read or write operations; it’s measured in the number</p><p>of read or write operations per second. Concurrency is the number of requests that the</p><p>client sends to the database at the same time (which, in turn, will eventually translate</p><p>to a given number of concurrent requests queuing at the database for execution).</p><p>Concurrency is expressed as a hard number, not a rate over a period of time. Not every</p><p>request that is born at the same time will be able to be processed by the database at</p><p>the same time. Your client could send 150K requests to the database, all at once. The</p><p>database might blaze through all these concurrent requests if it’s running at 500K</p><p>OPS.Or, it might take a while to process them if the database throughput tops out at</p><p>50K OPS.</p><p>It is generally possible to increase throughput by increasing your cluster size (and/</p><p>or power). But, you also want to pay special attention to concurrency, which will be</p><p>discussed in more depth later in this chapter as well as in Chapter 5. For the most part,</p><p>high concurrency is essential for achieving impressive performance. But if the clients</p><p>end up overwhelming the database with a concurrency that it can’t handle, throughput</p><p>will suffer, then latency will rise as a side effect. A friendly reminder that transcends the</p><p>database world: No system, distributed or not, supports unlimited concurrency. Period.</p><p>ChapTer 2 Your projeCT, Through TheLeNS oFDaTabaSe perFormaNCe</p><p>29</p><p>Note even though scaling a cluster boosts your database processing capacity,</p><p>remember that the application access patterns directly contribute to how much</p><p>impact that will ultimately make. one situation where scaling a cluster may not</p><p>provide the desired throughput increase is during a hot partition10 situation, which</p><p>causes traffic to be primarily targeted to a specific set of replicas. In these cases,</p><p>throttling the access to such hot keys is fundamental for preserving the system’s</p><p>overall performance.</p><p>Latency Expectations</p><p>Latency is a more complex challenge than throughput: You can increase throughput</p><p>by adding more nodes, but there’s no simple solution for reducing latency. The lower</p><p>the latency you need to achieve, the more important it becomes to understand and</p><p>explore database tradeoffs and internal database optimizations that can help you shave</p><p>milliseconds or microseconds off latencies. Database internals, driver optimizations,</p><p>efficient CPU utilization, sufficient RAM, efficient data modeling… everything matters.</p><p>As with throughput, aim for all relevant stakeholders’ agreement on the acceptable</p><p>latencies. This is usually expressed as latency for a certain percentile of requests. For</p><p>performance-sensitive workloads, tracking at the 99th percentile (P99) is common. Some</p><p>teams go even higher, such as the P9999, which refers to the 99.99th percentile.</p><p>As with throughput, avoid focusing on average (mean) or median (P50) latency</p><p>measurements. Average latency is a theoretical measurement that is not directly</p><p>correlated to anything systems or users experience in reality. Averages conceal outliers:</p><p>Extreme deviations from the norm that may have a large and unexpected impact on</p><p>overall system performance, and hence on user experience.</p><p>For example, look at the discrepancy between average latencies and P99 latencies in</p><p>Figure2-2 (different colors represent different database nodes). P99 latencies were often</p><p>double the average for reads, and even worse for writes.</p><p>10 A hot partition is a data access imbalance problem that causes specific partitions to receive</p><p>more traffic compared to others, thus introducing higher load on a specific set of replica servers.</p><p>ChapTer 2 Your projeCT, Through TheLeNS oFDaTabaSe perFormaNCe</p><p>30</p><p>11 For a detailed critique, see Gil Tene’s famous “Oh Sh*t” talk (www.youtube.com/watch?</p><p>v=lJ8ydIuPFeU) as well as his recent P99 CONF talk on Misery Metrics and Consequences</p><p>(https://www.p99conf.io/session/misery-metrics-consequences/).</p><p>Figure 2-2. A sample database monitoring dashboard. Note the difference</p><p>between average and P99 latencies</p><p>Note that monitoring systems are sometimes configured in ways that omit outliers.</p><p>For example, if a monitoring system is calibrated to measure latency on a scale of 0</p><p>to 1000ms, it is going to overlook any larger measurements—thus failing to detect the</p><p>serious issues of query timeouts and retries.</p><p>P99 and above percentiles are not perfect.11 But for latency-sensitive use cases,</p><p>they’re the number you’ll want to keep in mind as you are selecting your infrastructure,</p><p>benchmarking, monitoring, and so on.</p><p>Also, be clear about what exactly is involved in the P99 you are looking to achieve.</p><p>Database latency is the time that elapses between when the database receives a request,</p><p>processes it, and sends back an appropriate response. Client-side latency is broader:</p><p>Here, the measurement starts with the client sending the request and ends with the</p><p>client receiving the database’s response. It includes the network time and client-side</p><p>ChapTer 2 Your projeCT, Through TheLeNS oFDaTabaSe perFormaNCe</p><p>31</p><p>processing. There can be quite a discrepancy between database latency and client-</p><p>side latency; a ten times higher client-side latency isn’t all that uncommon (although</p><p>clearly not desirable). There could be many culprits to blame for a significantly higher</p><p>client-side latency than database latency: excessive concurrency, inefficient application</p><p>architecture, coding issues, and so on. But that’s beyond the scope of this discussion—</p><p>beyond the scope of this book, even.</p><p>The key point here is that your team and all the stakeholders need to be on the same</p><p>page regarding what you’re measuring. For example, say you’re given a read latency</p><p>requirement of 15ms. You work hard to get your database to achieve that and report that</p><p>you met the expectation—then you learn that stakeholders actually expect 15ms for the</p><p>full client-side latency. Back to the drawing board.</p><p>Ultimately, it’s important to track both database latency and client-side latency.</p><p>You can optimize the database all you want, but if the application is introducing latency</p><p>issues from the client side, a fast database won’t have much impact. Without visibility</p><p>into both the database and the client-side latencies, you’re essentially flying half blind.</p><p>Concurrency</p><p>What level of concurrency should your database be prepared to handle? Depending</p><p>on the desired qualities of service from the database cluster, concurrency must be</p><p>judiciously balanced to reach appropriate throughput and latency values. Otherwise,</p><p>requests will pile up waiting to be processed—causing latencies to spike, timeouts to</p><p>rise, and the overall user experience to degrade.</p><p>Little’s Law establishes that:</p><p>L=λW</p><p>where λ is the average throughput, W is the average latency, and L represents the total</p><p>number of requests either being processed or on queue at any given moment when the</p><p>cluster reaches steady state. Given that your throughput and latency targets are usually</p><p>fixed, you can use Little’s Law to estimate a realistic concurrency.</p><p>For example, if you want a system to serve 500,000 requests per second at 2.5ms</p><p>average latency, the best concurrency is around 1,250 in-flight requests. As you approach</p><p>the saturation limit of the system—around 600,000 requests per second for read</p><p>requests—increases in concurrency will keep constant since this is the physical limit of</p><p>the database. Every new in-flight request will only cause increased latency.</p><p>ChapTer 2 Your projeCT, Through TheLeNS oFDaTabaSe perFormaNCe</p><p>32</p><p>In fact, if you approximate 600,000 requests per second as the physical capacity of this</p><p>database, you can calculate</p><p>the expected average latency at a particular concurrency</p><p>point. For example, at 6,120 in-flight requests, the average latency is expected to be</p><p>6120/600,000 = 10ms.</p><p>Past the maximum throughput, increasing concurrency will increase latency.</p><p>Conversely, reducing concurrency will reduce latency, provided that this reduction does</p><p>not result in a decrease in throughput.</p><p>In some use cases, it’s fine for queries to pile up on the client side. But many times</p><p>it’s not. In those cases, you can scale out your cluster or increase the concurrency on</p><p>the application side—at least to the point where the latency doesn’t suffer. It’s a delicate</p><p>balancing act.12</p><p>Connected Technologies</p><p>A database can’t rise above the slowest-performing link in your distributed data system.</p><p>Even if your database is processing reads and writes at blazing speeds, it won’t ultimately</p><p>matter much if it interacts with an event-streaming platform that’s not optimized for</p><p>performance or involves transformations from a poorly-configured Apache Spark</p><p>instance, for example.</p><p>This is just one of many reasons that taking a comprehensive and proactive approach</p><p>to monitoring (more on this in Chapter 10) is so important. Given the complexity of</p><p>databases and distributed data systems, it’s hard to guess what component is to blame</p><p>for a problem. Without a window into the state of the broader system, you could naively</p><p>waste amazing amounts of time and resources trying to optimize something that won’t</p><p>make any difference.</p><p>If you’re looking to optimize an existing data system, don’t overlook the performance</p><p>gains you can achieve by reviewing and tuning its connected components. Or, if your</p><p>monitoring efforts indicate that a certain component is to blame for your client-side</p><p>performance problems but you feel you’ve hit your limit with it, explore what’s required</p><p>to replace it with a more performant alternative. Use benchmarking to determine the</p><p>severity of the impact from a performance perspective.</p><p>12 For additional reading on concurrency, the Netflix blog “Performance Under Load” is a great</p><p>resource (https://netflixtechblog.medium.com/performance-under-load-3e6fa9a60581).</p><p>ChapTer 2 Your projeCT, Through TheLeNS oFDaTabaSe perFormaNCe</p><p>33</p><p>Also, note that some database offerings may have ecosystem limitations. For</p><p>example, if you’re considering a serverless deployment model, be aware that some</p><p>Change Data Capture (CDC) connectors, drivers, and so on, might not be supported.</p><p>Demand Fluctuations</p><p>Databases might experience a variety of different demand fluctuations, ranging from</p><p>predictable moderate fluctuations to unpredictable and dramatic spikes. For instance,</p><p>the world’s most watched sporting event experiences different fluctuations than a food</p><p>delivery service, which experiences different fluctuations than an ambulance-tracking</p><p>service—and all require different strategies and infrastructure.</p><p>First, let’s look at the predictable fluctuations. With predictability, it’s much easier to</p><p>get ahead of the issue. If you’re expected to support periodic big events that are known</p><p>in advance (Black Friday, sporting championships, ticket on sales, etc.), you should have</p><p>adequate time to scale up your cluster for each anticipated spike. That means you can</p><p>tailor your normal topology for the typical day-in, day-out demands without having to</p><p>constantly incur the costs and admin burden of having that larger scale topology.</p><p>On the other side of the spikiness spectrum, there’s applications with traffic with</p><p>dramatic peaks and valleys across the course of each day. For example, consider food</p><p>delivery businesses, which face a sudden increase around lunch, followed by a few</p><p>hours of minimal traffic, then a second spike at dinner time (and sometimes breakfast</p><p>the following morning). Expanding the cluster for each spike—even with “autoscaling”</p><p>(more on autoscaling later in this chapter)—is unlikely to deliver the necessary</p><p>performance gain fast enough. In these cases, you should provision an infrastructure</p><p>that supports the peak traffic.</p><p>But not all spikes are predictable. Certain industries—such as emergency services,</p><p>news, and social media—are susceptible to sudden massive spikes. In this case, a good</p><p>preventative strategy is to control your concurrency on the client side, so it doesn’t</p><p>overwhelm your database. However, controlling concurrency might not be an option for</p><p>use cases with strict end-to-end latency requirements. You can also scramble to scale</p><p>out your clusters as fast as feasible when the spike occurs. This is going to be markedly</p><p>simpler if you’re on the cloud than if you’re on-prem. If you can start adding nodes</p><p>immediately, increase capacity incrementally—with a close eye on your monitoring</p><p>results—and keep going until you’re satisfied with the results, or until the peak has</p><p>subsided. Unfortunately, there is a real risk that you won’t be able to sufficiently scale out</p><p>ChapTer 2 Your projeCT, Through TheLeNS oFDaTabaSe perFormaNCe</p><p>34</p><p>before the spike ends. Even if the ramp up begins immediately, you need to account for</p><p>the time it takes to get data over to add new nodes, stream data to them, and rebalance</p><p>the cluster.</p><p>If you’re selecting a new database and anticipate frequent and sharp spikes, be sure</p><p>to rigorously test how your top contenders respond under realistic conditions. Also,</p><p>consider the costs of maintaining acceptable performance throughout these peaks.</p><p>Note The word “autoscaling” insinuates that your database cluster auto-</p><p>magically expands based on the traffic it is receiving. Not so. It’s simply a robot</p><p>enabling/disabling capacity that’s pre-provisioned for you based on your target</p><p>table settings. even if you’re not using this capacity, you might be paying for the</p><p>convenience of having it set aside and ready to go. also, it’s important to realize</p><p>that it’s not instantaneous. It takes upwards of 2.5 hours to go from 0 rps to 40k.13</p><p>This is not ideal for unexpected or extreme spikes.</p><p>autoscaling is best when:</p><p>• Load changes have high amplitude</p><p>• The rate of change is in the magnitude of hours</p><p>• The load peak is narrow relative to the baseline14</p><p>ACID Transactions</p><p>Does your use case require you to process a logical unit of work with ACID (atomic,</p><p>consistent, isolated, and durable) properties? These transactions, which are historically</p><p>the domain of RDBMS, bring a severe performance hit.</p><p>13 See The Burning Monk blog, “Understanding the Scaling Behaviour of DynamoDB</p><p>OnDemand Tables” (https://theburningmonk.com/2019/03/understanding-the-scaling-</p><p>behaviour-of-dynamodb-ondemand-tables/).</p><p>14 For more on the best and worst uses of autoscaling, see Avishai Ish Shalom’s blog, “DynamoDB</p><p>Autoscaling Dissected: When a Calculator Beats a Robot” (www.scylladb.com/2021/07/08/</p><p>dynamodb-autoscaling-dissected-when-a-calculator-beats-a-robot/).</p><p>ChapTer 2 Your projeCT, Through TheLeNS oFDaTabaSe perFormaNCe</p><p>35</p><p>It is true that distributed ACID compliant databases do exist—and that the past few</p><p>years have brought some distinct progress in the effort to minimize the performance</p><p>impact (e.g., through row-level locks or column-level locking and better conflict</p><p>resolution algorithms). However, some level of penalty will still exist.</p><p>As a general guidance, if you have an ACID-compliant use case, pay special attention</p><p>to your master nodes; these can easily become your bottlenecks since they will often</p><p>be your primary query coordinators (more on this in Appendix A). In addition, if at</p><p>all possible, try to ensure that the majority of your transactions are isolated to the</p><p>minimum amount of resources. For example, a transaction spanning a single row may</p><p>involve a specific set of replicas, whereas a transaction involving several keys may span</p><p>your cluster as a whole—inevitably increasing your latency. It is therefore important to</p><p>understand which types of transactions your target database supports. Some</p><p>vendors</p><p>may support a mix of approaches, while others excel at specific ones. For instance,</p><p>MongoDB introduced multi-document transactions on sharded clusters in its version</p><p>4.2; prior to that, it supported only multi-document transactions on replica sets.</p><p>If it’s critical to support transactions in a more performant manner, sometimes it’s</p><p>possible to rethink your data model and reimplement a use case in a way that makes it</p><p>suitable for a database that’s not ACID compliant. For example, one team who started</p><p>out with Postgres for all their use cases faced skyrocketing business growth. This is a very</p><p>common situation with startups that begin small and then suddenly find themselves in a</p><p>spot where they are unable to handle a spike in growth in a cost-effective way. They were</p><p>able to move their use cases to NoSQL by conducting a careful data-modeling analysis</p><p>and rethinking their use cases, access patterns, and the real business need of what truly</p><p>required ACID and what did not. This certainly isn’t a quick fix, but in the right situation,</p><p>it can pay off nicely.</p><p>Another option to consider: Performance-focused NoSQL databases like Cassandra</p><p>aim to support isolated conditional updates with capabilities such as lightweight</p><p>transactions that allow “atomic compare and set” operations. That is, the database</p><p>checks if a condition is true, and if so, it conducts the transaction. If the condition is not</p><p>met, the transaction is not completed. They are named “lightweight” since they do not</p><p>truly lock the database for the transaction. Instead, they use a consensus protocol to</p><p>ensure there is agreement between the nodes to commit the change. This capability was</p><p>ChapTer 2 Your projeCT, Through TheLeNS oFDaTabaSe perFormaNCe</p><p>36</p><p>introduced by Cassandra and it’s supported in several ways across different Cassandra-</p><p>compatible databases. If this is something you expect to use, it’s worth exploring the</p><p>documentation to understand the differences.15</p><p>However, it’s important to note that lightweight transactions have their limits. They</p><p>can’t support complex use cases like a retail transaction that updates the inventory</p><p>only after a sale is completed with a successful payment. And just like ACID-compliant</p><p>databases, lightweight transactions have their own performance implications. As a</p><p>result, the choice of whether to use them will greatly depend on the amount of ACID</p><p>compliance that your use case requires.</p><p>DynamoDB is a prime example of how the need for transactions will require more</p><p>compute resources (read: money). As a result, use cases relying heavily on ACID will</p><p>fairly often require much more infrastructure power to satisfy heavy usage requirements.</p><p>In the DynamoDB documentation, AWS recommends that you ensure the database is</p><p>configured for auto-scaling or that it has enough read/write capacity to account for the</p><p>additional overhead of transactions.16</p><p>Consistency Expectations</p><p>Most NoSQL databases opt for eventual consistency to gain performance. This is in</p><p>stark contrast to the RDBMS model, where ACID compliance is achieved in the form</p><p>of transactions, and, because everything is in a single node, the effort on locking and</p><p>avoiding concurrency clashes is often minimized. When deciding between a database</p><p>with strong or eventual consistency, you have to make a hard choice. Do you want to</p><p>sacrifice scalability and performance or can you accept the risk of sometimes serving</p><p>stale data?</p><p>Can your use case tolerate eventual consistency, or is strong consistency truly</p><p>required? Your choice really boils down to how much risk your application—and your</p><p>business—can tolerate with respect to inconsistency. For example, a retailer who</p><p>15 See Kostja Osipov’s blog, “Getting the Most Out of Lightweight Transactions in ScyllaDB”</p><p>(www.scylladb.com/2020/07/15/getting-the-most-out-of-lightweight-transactions-in-</p><p>scylla/) for an example of how financial transactions can be implemented using Lightweight</p><p>Transactions.</p><p>16 See “Amazon DynamoDB Transactions: How it Works” (https://docs.aws.amazon.com/</p><p>amazondynamodb/latest/developerguide/transaction-apis.html).</p><p>ChapTer 2 Your projeCT, Through TheLeNS oFDaTabaSe perFormaNCe</p><p>37</p><p>(understandably) requires consistent pricing might want to pay the price for consistent</p><p>writes upfront during a weekly catalog update so that they can later serve millions of low-</p><p>latency read requests under more relaxed consistency levels. In other cases, it’s more</p><p>important to ingest data quickly and pay the price for consistency later (for example,</p><p>in the playback tracking use case that’s common in streaming platforms—where the</p><p>database needs to record the last viewing position for many users concurrently). Or</p><p>maybe both are equally important. For example, consider a social media platform that</p><p>offers live chat. Here, you want consistency on both writes and reads, but you likely don’t</p><p>need the highest consistency (the impact of an inconsistency here is likely much less</p><p>than with a financial report).</p><p>In some cases, “tunable consistency” will help you achieve a balance between strong</p><p>consistency and performance. This gives you the ability to tune the consistency at the</p><p>query level to suit what you’re trying to achieve. You can have some queries relying on a</p><p>quorum of replicas, then have other queries that are much more relaxed.</p><p>Regardless of your consistency requirements, you need to be aware of the</p><p>implications involved when selecting a given consistency level. Databases that offer</p><p>tunable consistency may be a blessing or a curse if you don’t know what you are doing.</p><p>Consider a NoSQL deployment spanning three different regions, with three nodes</p><p>each (nine nodes in total). A QUORUM read would essentially have to traverse two</p><p>different regions in order to be acknowledged back to the client. In that sense, if your</p><p>Network Round Trip Time (RTT)17 is 50ms, then it will take at least this amount of time</p><p>for the query to be considered successful by the database. Similarly, if you were to run</p><p>operations with the highest possible consistency (involving all replicas), then the failure</p><p>of a single node may bring your entire application down.</p><p>Note NoSQL databases fairly often will provide you with ways to confine your</p><p>queries to a specific region to prevent costly network round trips from impacting</p><p>your latency. but again, it all boils down to you what your use case requires.</p><p>17 RTT is the duration, typically measured in milliseconds, that a network request takes to reach a</p><p>destination, plus the time it takes for the packet to be received back at the origin.</p><p>ChapTer 2 Your projeCT, Through TheLeNS oFDaTabaSe perFormaNCe</p><p>38</p><p>Geographic Distribution</p><p>Does your business need to support a regional or global customer base in the near-term</p><p>future? Where are your users and your application located? The greater the distance</p><p>between your users, your application, and your database, the more they’re going to face</p><p>high latencies that stem from the physical time it takes to move data across the network.</p><p>Knowing this will influence where you locate your database and how you design your</p><p>topology—more on this in Chapters 6 and 8.</p><p>The geographic distribution of your cluster might also be a requirement from a</p><p>disaster recovery perspective. In that sense, the cluster would typically serve data</p><p>primarily from a specific region, but failover to another in the event of a disaster (such as</p><p>a full region outage). These kinds of setups are costly, as they will require doubling your</p><p>infrastructure spend. However, depending on the nature of your use case, sometimes it’s</p><p>required.</p><p>Some organizations that invest in a multi-region deployment for the primary</p><p>purpose of disaster recovery end up using them to host isolated use cases. As explained</p><p>in the “Competing Workloads” section of this chapter, companies often prefer to</p><p>physically isolate OLTP from OLAP workloads. Moving some</p><p>isolated (less critical)</p><p>workloads to remote regions prevents these servers from being “idle” most of the time.</p><p>Regardless of the magnitude of compelling reasons that may drive you toward a</p><p>geographically dispersed deployment, here’s some important high-level advice from a</p><p>performance perspective (you’ll learn some more technical tips in Chapter 8):</p><p>1. Consider the increased load that your target region or regions will</p><p>receive in the event of a full region outage. For example, assume</p><p>that you operate globally across three regions, and all these three</p><p>regions serve your end-users. Are the two remaining regions able</p><p>to sustain the load for a long period of time?</p><p>2. Recognize that simply having a geographically-dispersed database</p><p>does not fully cover you in a disaster recovery situation. You also</p><p>need to have your application, web servers, messaging queue</p><p>systems, and so on, geographically replicated. If the only thing</p><p>that’s geo-replicated is your database, you won’t be in a great</p><p>position when your primary application goes down.</p><p>ChapTer 2 Your projeCT, Through TheLeNS oFDaTabaSe perFormaNCe</p><p>39</p><p>3. Consider the fact that geo-replicated databases typically require</p><p>very good network links. Especially when crossing large distances,</p><p>the time to replicate your data is crucial to minimize losses in the</p><p>event of a disaster. If your workload has a heavy write throughput,</p><p>a slow network link may bottleneck the local region nodes. This</p><p>may cause a queue to build up and eventually throttle down</p><p>your writes.</p><p>High-Availability Expectations</p><p>Inevitably, s#*& happens. To prepare for the worst, start by understanding what your use</p><p>case and business can tolerate if a node goes down. Can you accept the data loss that</p><p>could occur if a node storing unreplicated data goes down? Do you need to continue</p><p>buzzing along without a noticeable performance impact even if an entire datacenter or</p><p>availability zone goes down? Or is it okay if things slow down a bit from time to time?</p><p>This will all impact how you architect your topology and configure things like replication</p><p>factor and consistency levels (you’ll learn about this more in Chapter 8).</p><p>It’s important to note that replication and consistency both come at a cost to</p><p>performance. Get a good feel for your business’s risk tolerance and don’t opt for more</p><p>than your business really needs.</p><p>When considering your cluster topology, remember that quite a lot is at risk if you</p><p>get it wrong (and you don’t want to be caught off-guard in the middle of the night).</p><p>For example, the failure of a single node in a three-node cluster could make you</p><p>momentarily lose 33 percent of your processing power. Quite often, that’s a significant</p><p>blow, with discernable business impact. Similarly, the loss of a node in a six-node</p><p>cluster would reduce the blast radius to only 16 percent. But there’s always a tradeoff.</p><p>A sprawling deployment spanning hundreds of nodes is not ideal either. The more nodes</p><p>you have, the more likely you are to experience a node failure. Balance is key.</p><p>ChapTer 2 Your projeCT, Through TheLeNS oFDaTabaSe perFormaNCe</p><p>40</p><p>Summary</p><p>The specific database challenges you encounter, as well as your options for addressing</p><p>them, are highly dependent on your situation. For example, an AdTech use case that</p><p>demands single-digit millisecond P99 latencies for a large dataset with small item</p><p>sizes requires a different treatment than a fraud detection use case that prioritizes the</p><p>ingestion of massive amounts of data as rapidly as possible. One of the primary factors</p><p>influencing how these workloads are handled is how your database is architected. That’s</p><p>the focus for the next two chapters, which dive into database internals.</p><p>Open Access This chapter is licensed under the terms of the Creative</p><p>Commons Attribution 4.0 International License (http://creativecommons.</p><p>org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and</p><p>reproduction in any medium or format, as long as you give appropriate credit to the</p><p>original author(s) and the source, provide a link to the Creative Commons license and</p><p>indicate if changes were made.</p><p>The images or other third party material in this chapter are included in the chapter’s</p><p>Creative Commons license, unless indicated otherwise in a credit line to the material. If</p><p>material is not included in the chapter’s Creative Commons license and your intended</p><p>use is not permitted by statutory regulation or exceeds the permitted use, you will need</p><p>to obtain permission directly from the copyright holder.</p><p>ChapTer 2 Your projeCT, Through TheLeNS oFDaTabaSe perFormaNCe</p><p>41</p><p>CHAPTER 3</p><p>Database Internals:</p><p>Hardware andOperating</p><p>System Interactions</p><p>A database’s internal architecture makes a tremendous impact on the latency it can</p><p>achieve and the throughput it can handle. Being an extremely complex piece of software,</p><p>a database doesn’t exist in a vacuum, but rather interacts with the environment, which</p><p>includes the operating system and the hardware.</p><p>While it’s one thing to get massive terabyte-to-petabyte scale systems up and</p><p>running, it’s a whole other thing to make sure they are operating at peak efficiency. In</p><p>fact, it’s usually more than just “one other thing.” Performance optimization of large</p><p>distributed systems is usually a multivariate problem—combining aspects of the</p><p>underlying hardware, networking, tuning operating systems, and finagling with layers of</p><p>virtualization and application architectures.</p><p>Such a complex problem warrants exploration from multiple perspectives. This</p><p>chapter begins the discussion of database internals by looking at ways that databases</p><p>can optimize performance by taking advantage of modern hardware and operating</p><p>systems. It covers how the database interacts with the operating system plus CPUs,</p><p>memory, storage, and networking. Then, the next chapter shifts focus to algorithmic</p><p>optimizations.1</p><p>1 This chapter draws from material originally published on the Seastar site (https://seastar.io/)</p><p>and the ScyllaDB blog (https://www.scylladb.com/blog/). It is used here with permission of</p><p>ScyllaDB.</p><p>© Felipe Cardeneti Mendes, Piotr Sarna, Pavel Emelyanov, Cynthia Dunlop 2023</p><p>F. C. Mendes et al., Database Performance at Scale, https://doi.org/10.1007/978-1-4842-9711-7_3</p><p>42</p><p>CPU</p><p>Programming books tell programmers that they have this CPU that can run processes</p><p>or threads, and what runs means is that there’s some simple sequential instruction</p><p>execution. Then there’s a footnote explaining that with multiple threads you might need</p><p>to consider doing some synchronization. In fact, how things are actually executed inside</p><p>CPU cores is something completely different and much more complicated. It would</p><p>be very difficult to program these machines if you didn’t have those abstractions from</p><p>books, but they are a lie to some degree. How you can efficiently take advantage of CPU</p><p>capabilities is still very important.</p><p>Share Nothing Across Cores</p><p>Individual CPU cores aren’t getting any faster. Their clock speeds reached a performance</p><p>plateau long ago. Now, the ongoing increase of CPU performance continues</p><p>horizontally: by increasing the number of processing units. In turn, the increase in the</p><p>number of cores means that performance now depends on coordination across multiple</p><p>cores (versus the throughput of a single core).</p><p>On modern hardware, the performance of standard workloads depends more on the</p><p>locking and coordination across cores than on the performance of an individual core.</p><p>Software architects face two unattractive alternatives:</p><p>• Coarse-grained locking, which will see application threads contend</p><p>for control of the data and wait instead of producing useful work.</p><p>• Fine-grained locking, which, in addition to being hard to program</p><p>and debug, sees significant overhead even when no contention</p><p>occurs due to the locking primitives themselves.</p><p>Consider an SSD drive.</p><p>The typical time needed to communicate with an SSD on a</p><p>modern NVMe device is quite lengthy—it’s about 20 μseconds. That’s enough time for</p><p>the CPU to execute tens of thousands of instructions. Developers should consider it as</p><p>a networked device but generally do not program in that way. Instead, they often use an</p><p>API that is synchronous (we’ll return to this later), which produces a thread that can be</p><p>blocked.</p><p>Looking at the image of the logical layout of an Intel Xeon Processor (see Figure3-1),</p><p>it’s clear that this is also a networked device.</p><p>Chapter 3 Database Internals: harDware anDOperatIng system InteraCtIOns</p><p>43</p><p>Figure 3-1. The logical layout of an Intel Xeon Processor</p><p>The cores are all connected by what is essentially a network—a dual ring</p><p>interconnected architecture. There are two such rings and they are bidirectional. Why</p><p>should developers use a synchronous API for that then? Since sharing information</p><p>across cores requires costly locking, a shared-nothing model is perfectly worth</p><p>considering. In such a model, all requests are sharded onto individual cores, one</p><p>application thread is run per core, and communication depends on explicit message</p><p>passing, not shared memory between threads. This design avoids slow, unscalable lock</p><p>primitives and cache bounces.</p><p>Any sharing of resources across cores in modern processors must be handled</p><p>explicitly. For example, when two requests are part of the same session and two CPUs</p><p>each get a request that depends on the same session state, one CPU must explicitly</p><p>forward the request to the other. Either CPU may handle either response. Ideally,</p><p>your database provides facilities that limit the need for cross-core communication—</p><p>but when communication is inevitable, it provides high-performance non-blocking</p><p>communication primitives to ensure performance is not degraded.</p><p>Futures-Promises</p><p>There are many solutions for coordinating work across multiple cores. Some are highly</p><p>programmer-friendly and enable the development of software that works exactly as if it</p><p>were running on a single core. For example, the classic UNIX process model is designed</p><p>Chapter 3 Database Internals: harDware anDOperatIng system InteraCtIOns</p><p>44</p><p>to keep each process in total isolation and relies on kernel code to maintain a separate</p><p>virtual memory space per process. Unfortunately, this increases the overhead at the</p><p>OS level.</p><p>There’s a model known as “futures and promises.” A future is a data structure that</p><p>represents some yet-undetermined result. A promise is the provider of this result. It</p><p>can be helpful to think of a promise/future pair as a first-in first-out (FIFO) queue</p><p>with a maximum length of one item, which may be used only once. The promise is the</p><p>producing end of the queue, while the future is the consuming end. Like FIFOs, futures</p><p>and promises decouple the data producer and the data consumer.</p><p>However, the optimized implementations of futures and promises need to take</p><p>several considerations into account. While the standard implementation targets coarse-</p><p>grained tasks that may block and take a long time to complete, optimized futures and</p><p>promises are used to manage fine-grained, non-blocking tasks. In order to meet this</p><p>requirement efficiently, they should:</p><p>• Require no locking</p><p>• Not allocate memory</p><p>• Support continuations</p><p>Future-promise design eliminates the costs associated with maintaining individual</p><p>threads by the OS and allows close to complete utilization of the CPU.On the other</p><p>hand, it calls for user-space CPU scheduling and very likely limits the developer with</p><p>voluntary preemption scheduling. The latter, in turn, is prone to generating phantom</p><p>jams in popular producer-consumer programming templates.2</p><p>Applying future-promise design to database internals has obvious benefits. First of</p><p>all, database workloads can be naturally CPU-bound. For example, that’s typically the</p><p>case with in-memory database engines, and aggregates’ evaluations also involve pretty</p><p>intensive CPU work. Even for huge on-disk datasets, when the query time is typically</p><p>dominated by the I/O, CPU should be considered. Parsing a query is a CPU-intensive</p><p>task regardless of whether the workload is CPU-bound or storage-bound, and collecting,</p><p>converting, and sending the data back to the user also calls for careful CPU utilization.</p><p>And last but not least: Processing the data always involves a lot of high-level operations</p><p>2 Watch the Linux Foundation video, “Exploring Phantom Traffic Jams in Your Data Flows,” on</p><p>YouTube (www.youtube.com/watch?v=IXS_Afb6Y4o) and/or read the corresponding article on the</p><p>ScyllaDB blog (www.scylladb.com/2022/04/19/exploring-phantom-jams-in-your-</p><p>data-flow/).</p><p>Chapter 3 Database Internals: harDware anDOperatIng system InteraCtIOns</p><p>45</p><p>and low-level instructions. Maintaining them in an optimal manner requires a good low-</p><p>level programming paradigm and future-promises is one of the best choices. However,</p><p>large instruction sets need even more care; this leads to “execution stages.”</p><p>Execution Stages</p><p>Let’s dive deeper into CPU microarchitecture, because (as discussed previously)</p><p>database engine CPUs typically need to deal with millions and billions of instructions,</p><p>and it’s essential to help the poor thing with that. In a very simplified way, the</p><p>microarchitecture of a modern x86 CPU—from the point of view of top-down analysis—</p><p>consists of four major components: frontend, backend, branch speculation, and retiring.</p><p>Frontend</p><p>The processor’s frontend is responsible for fetching and decoding instructions that are</p><p>going to be executed. It may become a bottleneck when there is either a latency problem</p><p>or insufficient bandwidth. The former can be caused, for example, by instruction cache</p><p>misses. The latter happens when the instruction decoders cannot keep up. In the latter</p><p>case, the solution may be to attempt to make the hot path (or at least significant portions</p><p>of it) fit in the decoded μop cache (DSB) or be recognizable by the loop detector (LSD).</p><p>Branch Speculation</p><p>Pipeline slots that the top-down analysis classifies as bad speculation are not stalled, but</p><p>wasted. This happens when a branch is incorrectly predicted and the rest of the CPU</p><p>executes a μop that eventually cannot be committed. The branch predictor is generally</p><p>considered to be a part of the frontend. However, its problems can affect the whole pipeline</p><p>in ways beyond just causing the backend to be undersupplied by the instruction fetch and</p><p>decode. (Note: Branch mispredictions are covered in more detail a bit later in this chapter.)</p><p>Backend</p><p>The backend receives decoded μops and executes them. A stall may happen either</p><p>because of an execution port being busy or a cache miss. At the lower level, a pipeline</p><p>slot may be core bound either due to data dependency or an insufficient number of</p><p>available execution units. Stalls caused by memory can be caused by cache misses at</p><p>different levels of data cache, external memory latency, or bandwidth.</p><p>Chapter 3 Database Internals: harDware anDOperatIng system InteraCtIOns</p><p>46</p><p>Retiring</p><p>Finally, there are pipeline slots that get classified as retiring. They are the lucky ones that</p><p>were able to execute and commit their μop without any problems. When 100 percent</p><p>of the pipeline slots are able to retire without a stall, the program has achieved the</p><p>maximum number of instructions per cycle for that model of the CPU.Although this is</p><p>very desirable, it doesn’t mean that there’s no opportunity for improvement. Rather, it</p><p>means that the CPU is fully utilized and the only way to improve the performance is to</p><p>reduce the number of instructions.</p><p>Implications forDatabases</p><p>The way CPUs are architectured has direct implications on the database design. It may</p><p>very well happen that individual requests involve a lot of logic and relatively little data,</p><p>which is a scenario that stresses the CPU significantly. This kind of</p><p>workload will be</p><p>completely dominated by the frontend—instruction cache misses in particular. If you</p><p>think about this for a moment, it shouldn’t be very surprising. The pipeline that each</p><p>request goes through is quite long. For example, write requests may need to go through</p><p>transport protocol logic, query parsing code, look up in the caching layer, or be applied</p><p>to the memtable, and so on.</p><p>The most obvious way to solve this is to attempt to reduce the amount of logic in</p><p>the hot path. Unfortunately, this approach does not offer a huge potential for significant</p><p>performance improvement. Reducing the number of instructions needed to perform a</p><p>certain activity is a popular optimization practice, but a developer cannot make any code</p><p>shorter infinitely. At some point, the code “freezes”—literally. There’s some minimal</p><p>amount of instructions needed even to compare two strings and return the result. It’s</p><p>impossible to perform that with a single instruction.</p><p>A higher-level way of dealing with instruction cache problems is called Staged</p><p>Event-Driven Architecture (SEDA for short). It’s an architecture that splits the request</p><p>processing pipeline into a graph of stages—thereby decoupling the logic from the event</p><p>and thread scheduling. This tends to yield greater performance improvements than the</p><p>previous approach.</p><p>Chapter 3 Database Internals: harDware anDOperatIng system InteraCtIOns</p><p>47</p><p>Memory</p><p>Memory management is the central design point in all aspects of programming. Even</p><p>comparing programming languages to one another always involves discussions about</p><p>the way programmers are supposed to handle memory allocation and freeing. No</p><p>wonder memory management design affects the performance of a database so much.</p><p>Applied to database engineering, memory management typically falls into two</p><p>related but independent subsystems: memory allocation and cache control. The former</p><p>is in fact a very generic software engineering issue, so considerations about it are not</p><p>extremely specific to databases (though they are crucial and are worth studying). As</p><p>opposed to that, the latter topic is itself very broad, affected by the usage details and</p><p>corner cases. Respectively, in the database world, cache control has its own flavor.</p><p>Allocation</p><p>The manner in which programs or subsystems allocate and free memory lies at the core</p><p>of memory management. There are several approaches worth considering.</p><p>As illustrated by Figure3-2, a so-called “log-structured allocation” is known from</p><p>filesystems where it puts sequential writes to a circular log on the persisting storage and</p><p>handles updates the very same way. At some point, this filesystem must reclaim blocks</p><p>that became obsolete entries in the log area to make some more space available for</p><p>future writes. In a naive implementation, unused entries are reclaimed by rereading and</p><p>rewriting the log from scratch; obsolete blocks are then skipped in the process.</p><p>Chapter 3 Database Internals: harDware anDOperatIng system InteraCtIOns</p><p>48</p><p>Figure 3-2. A log-structured allocation puts sequential writes to a circular log on</p><p>the persisting storage and handles updates the same way</p><p>A memory allocator for naive code can do something similar. In its simplest form,</p><p>it would allocate the next block of memory by simply advancing a next-free pointer.</p><p>Deallocation would just need to mark the allocated area as freed. One advantage of</p><p>this approach is the speed of allocation. Another is the simplicity and efficiency of</p><p>deallocation if it happens in FIFO order or affects the whole allocation space. Stack</p><p>memory allocations are later released in the order that’s reverse to allocation, so this is</p><p>the most prominent and the most efficient example of such an approach.</p><p>Using linear allocators as general-purpose allocators can be more problematic</p><p>because of the difficulty of space reclamation. To reclaim space, it’s not enough to just</p><p>mark entries as free. This leads to memory fragmentation, which in turn outweighs</p><p>the advantages of linear allocation. So, as with the filesystem, the memory must be</p><p>reclaimed so that it only contains allocated entries and the free space can be used again.</p><p>Reclamation requires moving allocated entries around—a process that changes and</p><p>invalidates their previously known addresses. In naive code, the locations of references</p><p>to allocated entries (addresses stored as pointers) are unknown to the allocator. Existing</p><p>references would have to be patched to make the allocator action transparent to the</p><p>caller; that’s not feasible for a general-purpose allocator like malloc. Logging allocator</p><p>Chapter 3 Database Internals: harDware anDOperatIng system InteraCtIOns</p><p>49</p><p>use is tied to the programming language selection. Some RTTIs, like C++, can greatly</p><p>facilitate this by providing move-constructors. However, passing pointers to libraries that</p><p>are outside of your control (e.g., glibc) would still be an issue.</p><p>Another alternative is adopting a strategy of pool allocators, which provide allocation</p><p>spaces for allocation of entries of a fixed size (see Figure3-3). By limiting the allocation</p><p>space that way, fragmentation can be reduced. A number of general-purpose allocators</p><p>use pool allocators for small allocations. In some cases, those application spaces exist on</p><p>a per-thread basis to eliminate the need for locking and improve CPU cache utilization.</p><p>Figure 3-3. Pool allocators provide allocation spaces for allocation of entries of a</p><p>fixed size. Fragmentation is reduced by limiting the allocation space</p><p>This pool allocation strategy provides two core benefits. First, it saves you</p><p>from having to search for available memory space. Second, it alleviates memory</p><p>fragmentation because it pre-allocates in memory a cache for use with a collection of</p><p>object sizes. Here’s how it works to achieve that:</p><p>Chapter 3 Database Internals: harDware anDOperatIng system InteraCtIOns</p><p>50</p><p>1. The region for each of the sizes has fixed-size memory chunks that</p><p>are suitable for the contained objects, and those chunks are all</p><p>tracked by the allocator.</p><p>2. When it’s time for the allocator to allocate memory for a certain</p><p>type of data object, it’s typically possible to use a free slot (chunk)</p><p>in one of the existing memory slabs.3</p><p>3. When it’s time for the allocator to free the object’s memory, it can</p><p>simply move that slot over to the containing slab’s list of unused/</p><p>free memory slots.</p><p>4. That memory slot (or some other free slot) will be removed from</p><p>the list of free slots whenever there’s a call to create an object of</p><p>the same type (or a call to allocate memory of the same size).</p><p>The best allocation approach to pick heavily depends on the usage scenario. One</p><p>great benefit of a log-structured approach is that it handles fragmentation of small</p><p>sub-pools in a more efficient way. Pool allocators, on the other hand, generate less</p><p>background load on the CPU because of the lack of compacting activity.</p><p>Cache Control</p><p>When it comes to memory management in a software application that stores lots of data</p><p>on disk, you cannot overlook the topic of cache control. Caching is always a must in data</p><p>processing, and it’s crucial to decide what and where to cache.</p><p>If caching is done at the I/O level, for both read/write and mmap, caching can</p><p>become the responsibility of the kernel. The majority of the system’s memory is given</p><p>over to the page cache. The kernel decides which pages should be evicted when memory</p><p>runs low, decides when pages need to be written back to disk, and controls read-ahead.</p><p>The application can provide some guidance to the kernel using the madvise(2) and</p><p>fadvise(2) system calls.</p><p>The main advantage of letting the kernel control caching is that great effort has been</p><p>invested by the kernel developers over many decades into tuning the algorithms used</p><p>by the cache. Those algorithms are used by thousands of different applications and are</p><p>3 We are</p><p>24</p><p>Dataset Size �������������������������������������������������������������������������������������������������������������������������������� 26</p><p>Throughput Expectations ������������������������������������������������������������������������������������������������������������ 27</p><p>Latency Expectations ������������������������������������������������������������������������������������������������������������������ 29</p><p>Concurrency �������������������������������������������������������������������������������������������������������������������������������� 31</p><p>Connected Technologies ������������������������������������������������������������������������������������������������������������� 32</p><p>Demand Fluctuations ������������������������������������������������������������������������������������������������������������������ 33</p><p>ACID Transactions ����������������������������������������������������������������������������������������������������������������������� 34</p><p>Consistency Expectations ����������������������������������������������������������������������������������������������������������� 36</p><p>Geographic Distribution �������������������������������������������������������������������������������������������������������������� 38</p><p>High-Availability Expectations ����������������������������������������������������������������������������������������������������� 39</p><p>Summary������������������������������������������������������������������������������������������������������������������������������������� 40</p><p>Chapter 3: Database Internals: Hardware and Operating System Interactions ������ 41</p><p>CPU ��������������������������������������������������������������������������������������������������������������������������������������������� 42</p><p>Share Nothing Across Cores �������������������������������������������������������������������������������������������������� 42</p><p>Futures-Promises ������������������������������������������������������������������������������������������������������������������ 43</p><p>Execution Stages ������������������������������������������������������������������������������������������������������������������� 45</p><p>Memory ��������������������������������������������������������������������������������������������������������������������������������������� 47</p><p>Allocation ������������������������������������������������������������������������������������������������������������������������������� 47</p><p>Cache Control ������������������������������������������������������������������������������������������������������������������������ 50</p><p>I/O ����������������������������������������������������������������������������������������������������������������������������������������������� 51</p><p>Traditional Read/Write ����������������������������������������������������������������������������������������������������������� 51</p><p>mmap ������������������������������������������������������������������������������������������������������������������������������������ 52</p><p>Direct I/O (DIO) ������������������������������������������������������������������������������������������������������������������������� 52</p><p>Asynchronous I/O (AIO/DIO) ��������������������������������������������������������������������������������������������������� 53</p><p>Understanding the Tradeoffs ������������������������������������������������������������������������������������������������� 54</p><p>Choosing the Filesystem and/or Disk ������������������������������������������������������������������������������������ 57</p><p>Table of ConTenTs</p><p>vii</p><p>Filesystems vs Raw Disks ����������������������������������������������������������������������������������������������������� 57</p><p>How Modern SSDs Work �������������������������������������������������������������������������������������������������������� 58</p><p>Networking���������������������������������������������������������������������������������������������������������������������������������� 61</p><p>DPDK �������������������������������������������������������������������������������������������������������������������������������������� 62</p><p>IRQ Binding ���������������������������������������������������������������������������������������������������������������������������� 62</p><p>Summary������������������������������������������������������������������������������������������������������������������������������������� 63</p><p>Chapter 4: Database Internals: Algorithmic Optimizations ������������������������������������ 65</p><p>Optimizing Collections ���������������������������������������������������������������������������������������������������������������� 66</p><p>To B- or Not to B-Tree ����������������������������������������������������������������������������������������������������������������� 66</p><p>Linear Search on Steroids ����������������������������������������������������������������������������������������������������������� 68</p><p>Scanning the Tree ����������������������������������������������������������������������������������������������������������������������� 69</p><p>When the Tree Size Matters �������������������������������������������������������������������������������������������������������� 70</p><p>The Secret Life of Separation Keys ��������������������������������������������������������������������������������������������� 72</p><p>Summary������������������������������������������������������������������������������������������������������������������������������������� 74</p><p>Chapter 5: Database Drivers ����������������������������������������������������������������������������������� 77</p><p>Relationship Between Clients and Servers ��������������������������������������������������������������������������������� 78</p><p>Workload Types���������������������������������������������������������������������������������������������������������������������� 79</p><p>Throughput vs Goodput ��������������������������������������������������������������������������������������������������������� 81</p><p>Timeouts ������������������������������������������������������������������������������������������������������������������������������������� 83</p><p>Client-Side Timeouts ������������������������������������������������������������������������������������������������������������� 83</p><p>Server-Side Timeouts ������������������������������������������������������������������������������������������������������������ 84</p><p>Contextual Awareness ����������������������������������������������������������������������������������������������������������������� 86</p><p>Topology and Metadata ����������������������������������������������������������������������������������������������������������� 86</p><p>Current Load �������������������������������������������������������������������������������������������������������������������������� 87</p><p>Request Caching �������������������������������������������������������������������������������������������������������������������� 88</p><p>Query Locality ����������������������������������������������������������������������������������������������������������������������������� 91</p><p>Retries ����������������������������������������������������������������������������������������������������������������������������������������� 94</p><p>Error Categories��������������������������������������������������������������������������������������������������������������������� 94</p><p>Idempotence �������������������������������������������������������������������������������������������������������������������������� 95</p><p>Retry Policies ������������������������������������������������������������������������������������������������������������������������� 97</p><p>Table of ConTenTs</p><p>viii</p><p>Paging ��������������������������������������������������������������������������������������������������������������������������������������� 100</p><p>Concurrency ������������������������������������������������������������������������������������������������������������������������������ 101</p><p>Modern Hardware ����������������������������������������������������������������������������������������������������������������</p><p>using the term “slab” to mean one or more contiguous memory pages that contain</p><p>pre- allocated chunks of memory.</p><p>Chapter 3 Database Internals: harDware anDOperatIng system InteraCtIOns</p><p>51</p><p>generally effective. The disadvantage, however, is that these algorithms are general-</p><p>purpose and not tuned to the application. The kernel must guess how the application</p><p>will behave next. Even if the application knows differently, it usually has no way to help</p><p>the kernel guess correctly. This results in the wrong pages being evicted, I/O scheduled</p><p>in the wrong order, or read-ahead scheduled for data that will not be consumed in the</p><p>near future.</p><p>Next, doing the caching at the I/O level interacts with the topic often referred to as</p><p>IMR—in memory representation. No wonder that the format in which data is stored on</p><p>disk differs from the form the same data is allocated in memory as objects. The simplest</p><p>reason that it’s not the same is byte-ordering. With that in mind, if the data is cached</p><p>once it’s read from the disk, it needs to be further converted or parsed into the object</p><p>used in memory. This can be a waste of CPU cycles, so applications may choose to cache</p><p>at the object level.</p><p>Choosing to cache at the object level affects a lot of other design points. With</p><p>that, the cache management is all on the application side including cross-core</p><p>synchronization, data coherence, invalidation, and so on. Next, since objects can be</p><p>(and typically are) much smaller than the average I/O size, caching millions and billions</p><p>of those objects requires a collection selection that can handle it (you’ll learn about this</p><p>quite soon). Finally, caching on the object level greatly affects the way I/O is done.</p><p>I/O</p><p>Unless the database engine is an in-memory one, it will have to keep the data on external</p><p>storage. There can be many options to do that, including local disks, network-attached</p><p>storage, distributed file- and object- storage systems, and so on. The term “I/O” typically</p><p>refers to accessing data on local storage—disks or filesystems (that, in turn, are located</p><p>on disks as well). And in general, there are four choices for accessing files on a Linux</p><p>server: read/write, mmap, Direct I/O (DIO) read/write, and Asynchronous I/O (AIO/</p><p>DIO, because this I/O is rarely used in cached mode).</p><p>Traditional Read/Write</p><p>The traditional method is to use the read(2) and write(2) system calls. In a modern</p><p>implementation, the read system call (or one of its many variants—pread, readv, preadv,</p><p>etc.) asks the kernel to read a section of a file and copy the data into the calling process</p><p>Chapter 3 Database Internals: harDware anDOperatIng system InteraCtIOns</p><p>52</p><p>address space. If all of the requested data is in the page cache, the kernel will copy it</p><p>and return immediately; otherwise, it will arrange for the disk to read the requested</p><p>data into the page cache, block the calling thread, and when the data is available, it will</p><p>resume the thread and copy the data. A write, on the other hand, will usually1 just copy</p><p>the data into the page cache; the kernel will write back the page cache to disk some time</p><p>afterward.</p><p>mmap</p><p>An alternative and more modern method is to memory-map the file into the application</p><p>address space using the mmap(2) system call. This causes a section of the address space</p><p>to refer directly to the page cache pages that contain the file’s data. After this preparatory</p><p>step, the application can access file data using the processor’s memory read and</p><p>memory write instructions. If the requested data happens to be in cache, the kernel is</p><p>completely bypassed and the read (or write) is performed at memory speed. If a cache</p><p>miss occurs, then a page-fault happens and the kernel puts the active thread to sleep</p><p>while it goes off to read the data for that page. When the data is finally available, the</p><p>memory-management unit is programmed so the newly read data is accessible to the</p><p>thread, which is then awoken.</p><p>Direct I/O (DIO)</p><p>Both traditional read/write and mmap involve the kernel page cache and defer the</p><p>scheduling of I/O to the kernel. When the application wants to schedule I/O itself (for</p><p>reasons that we will explain later), it can use Direct I/O, as shown in Figure3-4. This</p><p>involves opening the file with the O_DIRECT flag; further activity will use the normal</p><p>read and write family of system calls. However, their behavior is now altered: Instead of</p><p>accessing the cache, the disk is accessed directly, which means that the calling thread</p><p>will be put to sleep unconditionally. Furthermore, the disk controller will copy the data</p><p>directly to userspace, bypassing the kernel.</p><p>Chapter 3 Database Internals: harDware anDOperatIng system InteraCtIOns</p><p>53</p><p>Figure 3-4. Direct I/O involves opening the file with the O_DIRECT flag; further</p><p>activity will use the normal read and write family of system calls, but their</p><p>behavior is now altered</p><p>Asynchronous I/O (AIO/DIO)</p><p>A refinement of Direct I/O, Asynchronous Direct I/O, behaves similarly but prevents the</p><p>calling thread from blocking (see Figure3-5). Instead, the application thread schedules</p><p>Direct I/O operations using the io_submit(2) system call, but the thread is not blocked;</p><p>the I/O operation runs in parallel with normal thread execution. A separate system</p><p>call, io_getevents(2), waits for and collects the results of completed I/O operations.</p><p>Like DIO, the kernel’s page cache is bypassed, and the disk controller is responsible for</p><p>copying the data directly to userspace.</p><p>Chapter 3 Database Internals: harDware anDOperatIng system InteraCtIOns</p><p>54</p><p>Figure 3-5. A refinement of Direct I/O, Asynchronous Direct I/O behaves similarly</p><p>but prevents the calling thread from blocking</p><p>Note: io_uring the apI to perform asynchronous I/O appeared in linux long ago,</p><p>and it was warmly met by the community. however, as it often happens, real-</p><p>world usage quickly revealed many inefficiencies, such as blocking under some</p><p>circumstances (despite the name), the need to call the kernel too often, and poor</p><p>support for canceling the submitted requests. eventually, it became clear that the</p><p>updated requirements were not compatible with the existing apI and the need for a</p><p>new one arose.</p><p>this is how the io_uring() apI appeared. It provides the same facilities as aIO</p><p>does, but in a much more convenient and performant way (it also has notably</p><p>better documentation). without diving into implementation details, let’s just say that</p><p>it exists and is preferred over the legacy aIO.</p><p>Understanding theTradeoffs</p><p>The different access methods share some characteristics and differ in others. Table3-1</p><p>summarizes these characteristics, which are discussed further in this section.</p><p>Chapter 3 Database Internals: harDware anDOperatIng system InteraCtIOns</p><p>55</p><p>Table 3-1. Comparing Different I/O Access Methods</p><p>Characteristic R/W mmap DIO AIO/DIO</p><p>Cache control Kernel Kernel User User</p><p>Copying yes no no no</p><p>mmU activity low high none none</p><p>I/O scheduling Kernel Kernel mixed User</p><p>thread scheduling Kernel Kernel Kernel User</p><p>I/O alignment automatic automatic manual manual</p><p>application complexity low low moderate high</p><p>Copying andMMU Activity</p><p>One of the benefits of the mmap method is that if the data is in cache, then the kernel</p><p>is bypassed completely. The kernel does not need to copy data from the kernel to</p><p>userspace and back, so fewer processor cycles are spent on that activity. This benefits</p><p>workloads that are mostly in cache (for example, if the ratio of storage size to RAM size is</p><p>close to 1:1).</p><p>The downside of mmap, however, occurs when data is not in the cache. This usually</p><p>happens when the ratio of storage size to RAM size is significantly higher than 1:1. Every</p><p>page that is brought into the cache causes another page to be evicted. Those pages have</p><p>to be inserted into and removed from the page tables; the kernel has to scan the page</p><p>tables to isolate inactive pages, making them candidates</p><p>for eviction, and so forth. In</p><p>addition, mmap requires memory for the page tables. On x86 processors, this requires</p><p>0.2 percent of the size of the mapped files. This seems low, but if the application has a</p><p>100:1 ratio of storage to memory, the result is that 20 percent of memory (0.2% * 100) is</p><p>devoted to page tables.</p><p>I/O Scheduling</p><p>One of the problems with letting the kernel control caching (with the mmap and read/</p><p>write access methods) is that the application loses control of I/O scheduling. The kernel</p><p>picks whichever block of data it deems appropriate and schedules it for write or read.</p><p>This can result in the following problems:</p><p>Chapter 3 Database Internals: harDware anDOperatIng system InteraCtIOns</p><p>56</p><p>• A write storm. When the kernel schedules large amounts of writes,</p><p>the disk will be busy for a long while and impact read latency.</p><p>• The kernel cannot distinguish between “important” and</p><p>“unimportant” I/O. I/O belonging to background tasks can</p><p>overwhelm foreground tasks, impacting their latency2</p><p>By bypassing the kernel page cache, the application takes on the burden of</p><p>scheduling I/O.This doesn’t mean that the problems are solved, but it does mean that</p><p>the problems can be solved—with sufficient attention and effort.</p><p>When using Direct I/O, each thread controls when to issue I/O.However, the kernel</p><p>controls when the thread runs, so responsibility for issuing I/O is shared between the</p><p>kernel and the application. With AIO/DIO, the application is in full control of when I/O</p><p>is issued.</p><p>Thread Scheduling</p><p>An I/O intensive application using mmap or read/write cannot guess what its cache hit</p><p>rate will be. Therefore, it has to run a large number of threads (significantly larger than</p><p>the core count of the machine it is running on). Using too few threads, they may all be</p><p>waiting for the disk leaving the processor underutilized. Since each thread usually has</p><p>at most one disk I/O outstanding, the number of running threads must be around the</p><p>concurrency of the storage subsystem multiplied by some small factor in order to keep</p><p>the disk fully occupied. However, if the cache hit rate is sufficiently high, then these large</p><p>numbers of threads will contend with each other for the limited number of cores.</p><p>When using Direct I/O, this problem is somewhat mitigated. The application knows</p><p>exactly when a thread is blocked on I/O and when it can run, so the application can</p><p>adjust the number of running threads according to runtime conditions.</p><p>With AIO/DIO, the application has full control over both running threads and</p><p>waiting I/O (the two are completely divorced), so it can easily adjust to in-memory or</p><p>disk-bound conditions or anything in between.</p><p>I/O Alignment</p><p>Storage devices have a block size; all I/O must be performed in multiples of this block</p><p>size which is typically 512 or 4096 bytes. Using read/write or mmap, the kernel performs</p><p>the alignment automatically; a small read or write is expanded to the correct block</p><p>boundary by the kernel before it is issued.</p><p>Chapter 3 Database Internals: harDware anDOperatIng system InteraCtIOns</p><p>57</p><p>With DIO, it is up to the application to perform block alignment. This incurs some</p><p>complexity, but also provides an advantage: The kernel will usually over-align to a 4096</p><p>byte boundary even when a 512-byte boundary suffices. However, a user application</p><p>using DIO can issue 512-byte aligned reads, which results in saving bandwidth on</p><p>small items.</p><p>Application Complexity</p><p>While the previous discussions favored AIO/DIO for I/O intensive applications, that method</p><p>comes with a significant cost: complexity. Placing the responsibility of cache management</p><p>on the application means it can make better choices than the kernel and make those</p><p>choices with less overhead. However, those algorithms need to be written and tested. Using</p><p>asynchronous I/O requires that the application is written using callbacks, coroutines, or a</p><p>similar method, and often reduces the reusability of many available libraries.</p><p>Choosing theFilesystem and/or Disk</p><p>Beyond performing the I/O itself, the database design must consider the medium</p><p>against which this I/O is done. In many cases, the choice is often between a filesystem or</p><p>a raw block device, which in turn can be a choice of a traditional spinning disk or an SSD</p><p>drive. In cloud environments, however, there can be the third option because local drives</p><p>are always ephemeral—which imposes strict requirements on the replication.</p><p>Filesystems vs Raw Disks</p><p>This decision can be approached from two angles: management costs and performance.</p><p>If you’re accessing the storage as a raw block device, all the difficulties with block</p><p>allocation and reclamation are on the application side. We touched on this topic slightly</p><p>earlier when we talked about memory management. The same set of challenges apply to</p><p>RAM as well as disks.</p><p>A connected, though very different, challenge is providing data integrity in case</p><p>of crashes. Unless the database is purely in-memory, the I/O should be done in a way</p><p>that avoids losing data or reading garbage from disk after a restart. Modern filesystems,</p><p>however, provide both and are very mature to trust the efficiency of allocations and</p><p>integrity of data. Accessing raw block devices unfortunately lacks those features (so they</p><p>need to be implemented at the same quality on the application side).</p><p>Chapter 3 Database Internals: harDware anDOperatIng system InteraCtIOns</p><p>58</p><p>From the performance point of view, the difference is not that drastic. On one hand,</p><p>writing data to a file is always accompanied by associated metadata updates. This</p><p>consumes both disk space and I/O bandwidth. However, some modern filesystems</p><p>provide a very good balance of performance and efficiency, almost eliminating the I/O</p><p>latency. (One of the most prominent examples is XFS.Another really good and mature</p><p>piece of software is Ext4). The great ally in this camp is the fallocate(2) system call</p><p>that makes the filesystem preallocate space on disk. When used, filesystems also have a</p><p>chance to make full use of the extents mechanisms, thus bringing the QoS of using files</p><p>to the same performance level as when using raw block devices.</p><p>Appending Writes</p><p>The database may have a heavy reliance on appends to files or require in-place updates</p><p>of individual file blocks. Both approaches need special attention from the system</p><p>architect because they call for different properties from the underlying system.</p><p>On one hand, appending writes requires careful interaction with the filesystem so</p><p>that metadata updates (file size, in particular) do not dominate the regular I/O.On the</p><p>other hand, appending writes (being sort of cache-oblivious algorithms) handle the disk</p><p>overwriting difficulties in a natural manner. Contrary to this, in-place updates cannot</p><p>happen at random offsets and sizes because disks may not tolerate this kind of workload,</p><p>even if they’re used in a raw block device manner (not via a filesystem).</p><p>That being said, let’s dive even deeper into the stack and descend into the</p><p>hardware level.</p><p>How Modern SSDs Work</p><p>Like other computational resources, disks are limited in the speed they can provide. This</p><p>speed is typically measured as a two-dimensional value with Input/Output Operations</p><p>per Second (IOPS) and bytes per second (throughput). Of course, these parameters are</p><p>not cut in stone even for each particular disk, and the maximum number of requests or</p><p>bytes greatly depends on the requests’ distribution, queuing and concurrency, buffering</p><p>or caching, disk age, and many other factors. So when performing I/O, a disk must</p><p>always balance between two inefficiencies—overwhelming the disk with requests and</p><p>underutilizing it.</p><p>Chapter 3 Database Internals: harDware anDOperatIng system InteraCtIOns</p><p>59</p><p>Overwhelming the disk should be avoided because when the disk is full of requests</p><p>it cannot distinguish between the criticality of</p><p>certain requests over others. Of course,</p><p>all requests are important, but it makes sense to prioritize latency-sensitive requests.</p><p>For example, ScyllaDB serves real-time queries that need to be completed in single-</p><p>digit milliseconds or less and, in parallel, it processes terabytes of data for compaction,</p><p>streaming, decommission, and so forth. The former have strong latency sensitivity; the</p><p>latter are less so. Good I/O maintenance that tries to maximize the I/O bandwidth while</p><p>keeping latency as low as possible for latency-sensitive tasks is complicated enough to</p><p>become a standalone component called the I/O Scheduler.</p><p>When evaluating a disk, you would most likely be looking at its four parameters—</p><p>read/write IOPS and read/write throughput (such as in MB/s). Comparing these</p><p>numbers to one another is a popular way of claiming one disk is better than the other</p><p>and estimating the aforementioned “bandwidth capacity” of the drive by applying Little’s</p><p>Law. With that, the I/O Scheduler’s job is to provide a certain level of concurrency inside</p><p>the disk to get maximum bandwidth from it, but not to make this concurrency too high</p><p>in order to prevent the disk from queueing requests internally for longer than needed.</p><p>For instance, Figure3-6 illustrates how read request latency depends on the</p><p>intensity of small reads (challenging disk IOPS capacity) vs the intensity of large writes</p><p>(pursuing the disk bandwidth). The latency value is color-coded, and the “interesting</p><p>area” is painted in cyan—this is where the latency stays below 1 millisecond. The drive</p><p>measured is the NVMe disk that comes with the AWS EC2 i3en.3xlarge instance.</p><p>Chapter 3 Database Internals: harDware anDOperatIng system InteraCtIOns</p><p>60</p><p>4 You can access Diskplorer at https://github.com/scylladb/diskplorer. This project contains</p><p>instructions on how to generate a graph of your own.</p><p>Figure 3-6. Bandwidth/latency graphs showing how read request latency depends</p><p>on the intensity of small reads (challenging disk IOPS capacity) vs the intensity of</p><p>large writes (pursuing the disk bandwidth)</p><p>This drive demonstrates almost perfect half-duplex behavior—increasing the read</p><p>intensity several times requires roughly the same reduction in write intensity to keep the</p><p>disk operating at the same speed.</p><p>Tip: How to Measure Your Own Disk Behavior Under Load the better you</p><p>understand how your own disks perform under load, the better you can tune them</p><p>to capitalize on their “sweet spot.” One way to do this is with Diskplorer,4 an open-</p><p>source disk latency/bandwidth exploring toolset. by using linux fio under the hood</p><p>Chapter 3 Database Internals: harDware anDOperatIng system InteraCtIOns</p><p>61</p><p>it runs a battery of measurements to discover performance characteristics for a</p><p>specific hardware configuration, giving you an at-a-glance view of how server</p><p>storage I/O will behave under load.</p><p>For a walkthrough of how to use this tool, see the linux Foundation video,</p><p>“Understanding storage I/O Under load.”5</p><p>Networking</p><p>The conventional networking functionality available in Linux is remarkably full-featured,</p><p>mature, and performant. Since the database rarely imposes severe per-ping latency</p><p>requirements, there are very few surprises that come from it when properly configured</p><p>and used. Nonetheless, some considerations still need to be made.</p><p>As explained by David Ahern, “Linux will process a fair amount of packets in the</p><p>context of whatever is running on the CPU at the moment the IRQ is handled. System</p><p>accounting will attribute those CPU cycles to any process running at that moment even</p><p>though that process is not doing any work on its behalf. For example, ‘top’ can show a</p><p>process that appears to be using 99+% CPU, but in reality, 60 percent of that time is spent</p><p>processing packets—meaning the process is really only getting 40 percent of the CPU to</p><p>make progress on its workload.”6</p><p>However, for truly networking-intensive applications, the Linux stack is constrained:</p><p>• Kernel space implementation: Separation of the network stack</p><p>into kernel space means that costly context switches are needed to</p><p>perform network operations, and that data copies must be performed</p><p>to transfer data from kernel buffers to user buffers and vice versa.</p><p>• Time sharing: Linux is a time-sharing system, and so must rely on</p><p>slow, expensive interrupts to notify the kernel that there are new</p><p>packets to be processed.</p><p>5 Watch the video on YouTube (www.youtube.com/watch?v=Am-nXO6KK58).</p><p>6 For the source and additional detail, see David Ahern’s, “The CPU Cost of Networking on a Host”</p><p>(https://people.kernel.org/dsahern/the-cpu-cost-of-networking-on-a-host).</p><p>Chapter 3 Database Internals: harDware anDOperatIng system InteraCtIOns</p><p>62</p><p>• Threaded model: The Linux kernel is heavily threaded, so all data</p><p>structures are protected with locks. While a huge effort has made</p><p>Linux very scalable, this is not without limitations and contention</p><p>occurs at large core counts. Even without contention, the locking</p><p>primitives themselves are relatively slow and impact networking</p><p>performance.</p><p>As before, the way to overcome this limitation is to move the packet processing to the</p><p>userspace. There are plenty of out-of-kernel implementations of the TCP algorithm that</p><p>are worth considering.</p><p>DPDK</p><p>One of the generic approaches that’s often referred to in the networking area is the poll</p><p>mode vs interrupt model. When a packet arrives, the system may have two options for</p><p>how to get informed—set up and interrupt from the hardware (or, in the case of the</p><p>userspace implementation, from the kernel file descriptor using the poll family of system</p><p>calls) or keep polling the network card on its own from time to time until the packet is</p><p>noticed.</p><p>The famous userspace network toolkit, called DPDK, is designed specifically for</p><p>fast packet processing, usually in fewer than 80 CPU cycles per packet.7 It integrates</p><p>seamlessly with Linux in order to take advantage of high-performance hardware.</p><p>IRQ Binding</p><p>As stated earlier, packet processing may take up to 60 percent of the CPU time, which</p><p>is way too much. This percentage leaves too few CPU ticks for the database work itself.</p><p>Even though in this case the backpressure mechanism would most likely keep the</p><p>external activity off and the system would likely find its balance, the resulting system</p><p>throughput would likely be unacceptable.</p><p>System architects may consider the non-symmetrical CPU approach to mitigate</p><p>this. If you’re letting the Linux kernel process network packets, there are several ways to</p><p>localize this processing on separate CPUs.</p><p>7 For details, see the Linux Foundation’s page on DPDK (Data Plane Developers Kit) at</p><p>www.dpdk.org.</p><p>Chapter 3 Database Internals: harDware anDOperatIng system InteraCtIOns</p><p>63</p><p>The simplest way is to bind the IRQ processing from the NIC to specific cores or</p><p>hyper-threads. Linux uses two-step processing of incoming packets called IRQ and soft-</p><p>IRQ.If the IRQs are properly bound to cores, the soft-IRQ also happens on those cores—</p><p>thus completely localizing the processing.</p><p>For huge-scale nodes running tens to hundred(s) of cores, the number of network-</p><p>only cores may become literally more than one. In this case, it might make sense to</p><p>localize processing even further by assigning cores from different NUMA nodes and</p><p>teaching the NIC to balance the traffic between those using the receive packet steering</p><p>facility of the Linux kernel.</p><p>Summary</p><p>This chapter introduced a number of ways that database engineering decisions enable</p><p>database users to squeeze more power out of modern infrastructure. For CPUs, the</p><p>chapter talked about taking advantage of multicore servers by limiting resource sharing</p><p>across cores and using future-promise design to coordinate work across cores. The</p><p>chapter also provided a specific example of how low-level CPU architecture has direct</p><p>implications on the database.</p><p>Moving on to memory, you</p><p>read about two related but independent subsystems:</p><p>memory allocation and cache control. For I/O, the chapter discussed Linux options</p><p>such as traditional read/write, mmap, Direct I/O (DIO) read/write, and Asynchronous</p><p>I/O—including the various tradeoffs of each. This was followed by a deep dive into</p><p>how modern SSDs work and how a database can take advantage of a drive’s unique</p><p>characteristics. Finally, you looked at constraints associated with the Linux networking</p><p>stack and explored alternatives such as DPDK and IRQ binding. The next chapter shifts</p><p>the focus from hardware interactions to algorithmic optimizations: pure software</p><p>challenges.</p><p>Chapter 3 Database Internals: harDware anDOperatIng system InteraCtIOns</p><p>64</p><p>Open Access This chapter is licensed under the terms of the Creative</p><p>Commons Attribution 4.0 International License (http://creativecommons.</p><p>org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and</p><p>reproduction in any medium or format, as long as you give appropriate credit to the</p><p>original author(s) and the source, provide a link to the Creative Commons license and</p><p>indicate if changes were made.</p><p>The images or other third party material in this chapter are included in the chapter's</p><p>Creative Commons license, unless indicated otherwise in a credit line to the material. If</p><p>material is not included in the chapter's Creative Commons license and your intended</p><p>use is not permitted by statutory regulation or exceeds the permitted use, you will need</p><p>to obtain permission directly from the copyright holder.</p><p>Chapter 3 Database Internals: harDware anDOperatIng system InteraCtIOns</p><p>65</p><p>CHAPTER 4</p><p>Database Internals:</p><p>Algorithmic Optimizations</p><p>In the performance world, the hardware is always the unbreakable limiting factor—one</p><p>cannot squeeze more performing units from a system than the underlying chips may</p><p>provide. As opposed to that, the software part of the system is often considered the most</p><p>flexible thing in programming—in the sense that it can be changed at any time given</p><p>enough developers’ brains and hands (and investors’ cash).</p><p>However, that’s not always the case. Sometimes selecting an algorithm should be</p><p>done as early as the architecting stage in the most careful manner possible because the</p><p>chosen approach becomes so extremely fundamental that changing it would effectively</p><p>mean rewriting the whole engine from scratch or requiring users to migrate exabytes of</p><p>data from one instance to another.</p><p>This chapter shares one detailed example of algorithmic optimization—from the</p><p>perspective of the engineer who led this optimization. Specifically, this chapter looks</p><p>at how the B-trees family can be used to store data in cache implementations and</p><p>other accessory and in-memory structures. This look into a representative engineering</p><p>challenge should help you better understand what tradeoffs or optimizations various</p><p>databases might be making under the hood—ideally, so you can take better advantage of</p><p>its very deliberate design decisions.1</p><p>Note The goal of this chapter is not to convince database users that they need a</p><p>database with any particular algorithmic optimization—or to educate infrastructure</p><p>engineers on designing B-trees or the finer points of algorithmic optimization.</p><p>Rather, it’s to help anyone selecting or working with a database understand the</p><p>1 This chapter draws from material originally published on the ScyllaDB blog (www.scylladb.com/</p><p>blog). It is used here with permission of ScyllaDB.</p><p>© Felipe Cardeneti Mendes, Piotr Sarna, Pavel Emelyanov, Cynthia Dunlop 2023</p><p>F. C. Mendes et al., Database Performance at Scale, https://doi.org/10.1007/978-1-4842-9711-7_4</p><p>66</p><p>level of algorithmic optimization that might impact a database’s performance.</p><p>Hopefully, it piques your curiosity in learning more about the engineering behind</p><p>the database you’re using and/or alternative databases you’re considering.</p><p>Optimizing Collections</p><p>Maintaining large sets of objects in memory deserves the same level of attention as</p><p>maintaining objects in external memory—say, spinning disks or network-attached</p><p>storages. For a task as simple as looking up an object by a plain key, the acceptable</p><p>solution is often a plain hash table (even with great attention to hash function selection)</p><p>or a binary balanced tree (usually the red-black one due to its implementation</p><p>simplicity). However, branchy trees like the B-trees family can significantly boost</p><p>performance. They also have a lot of non-obvious pitfalls.</p><p>To B- or Not toB-Tree</p><p>An important characteristic of a tree is cardinality. This is the maximum number of</p><p>child nodes that another node may have. In the corner case of cardinality of two, the</p><p>tree is called a binary tree. For other cases, there’s a wide class of so-called B-trees. The</p><p>common belief about binary vs B-trees is that the former ones should be used when the</p><p>data is stored in the RAM, while the latter trees should live in the disk. The justification</p><p>for this split is that RAM access speed is much higher than disk. Also, disk I/O is</p><p>performed in blocks, so it’s much better and faster to fetch several “adjacent” keys in one</p><p>request. RAM, unlike disks, allows random access with almost any granularity, so it’s</p><p>okay to have a dispersed set of keys pointing to each other.</p><p>However, there are many reasons that B-trees are often a good choice for in-memory</p><p>collections. The first reason is cache locality. When searching for a key in a binary tree,</p><p>the algorithm would visit up to logN elements that are very likely dispersed in memory.</p><p>On a B-tree, this search will consist of two phases—an intra-node search and descending</p><p>the tree—executed one after another. And while descending the tree doesn’t differ much</p><p>from the binary tree in the aforementioned sense, intra-node searching will access</p><p>keys that are located next to each other, thus making much better use of CPU caches.</p><p>Figure4-1 exemplifies the process of walking down a binary tree. Compare it along with</p><p>Figure4-2, which demonstrates a search in a B-tree set.</p><p>CHapTeR 4 DaTaBase InTeRnals: algoRITHmIC opTImIzaTIons</p><p>67</p><p>Figure 4-1. Searching in a binary tree root</p><p>Figure 4-2. Searching in a B-tree set</p><p>The second reason that B-trees are often a good choice for in-memory collections</p><p>also comes from the dispersed nature of binary trees and from how modern CPUs</p><p>are designed. It’s well known that when executing a stream of instructions, CPU cores</p><p>split the processing of each instruction into stages (loading instructions, decoding</p><p>them, preparing arguments, and doing the execution itself) and the stages are run in</p><p>CHapTeR 4 DaTaBase InTeRnals: algoRITHmIC opTImIzaTIons</p><p>68</p><p>parallel in a unit called a conveyor. When a conditional branching instruction appears</p><p>in this stream, the conveyor needs to guess which of two potential branches it will have</p><p>to execute next and start loading it into the conveyor pipeline. If this guess fails, the</p><p>conveyor is flushed and starts to work from scratch. Such failures are called branch</p><p>mispredictions. They are harmful from a performance point of view2 and have direct</p><p>implications on the binary search algorithm. When searching for a key in such a tree,</p><p>the algorithm jumps left and right depending on the key comparison result without</p><p>giving the CPU a chance to learn which direction is “preferred.” In many cases, the CPU</p><p>conveyer is flushed.</p><p>The two-phased B-tree search can be made better with respect to branch</p><p>predictions. The trick is in making the intra-node search linear (i.e., walking the array of</p><p>keys forward key-by-key). In this case, there will be only a “should you move forward”</p><p>condition that’s much more predictable. There’s even a nice trick of turning binary</p><p>search into linear without sacrificing the number of comparisons,3 but this approach</p><p>is good for read-mostly collections because insertion into this layout is tricky and has</p><p>worse complexity</p><p>than for sorted arrays. This approach has proven itself in ScyllaDB’s</p><p>implementation and is also widely used in the Tarantool in-memory database.4</p><p>Linear Search onSteroids</p><p>That linear search can be improved a bit more. Let’s carefully count the number of key</p><p>comparisons that it may take to find a single key in a tree. For a binary tree, it’s well</p><p>known that it takes log2N comparisons (on average) where N is the number of elements.</p><p>We put the logarithm base here for a reason. Next, consider a k-ary tree with k children</p><p>per node. Does it take fewer comparisons? (Spoiler: no). To find the element, you have to</p><p>do the same search—get a node, find in which branch it sits, then proceed to it. You have</p><p>logkN levels in the tree, so you have to do that many descending steps. However on each</p><p>step, you need to do the search within k elements, which is, again, log2k if you’re doing a</p><p>binary search. Multiplying both, you still need at least log2N comparisons.</p><p>2 See Marek Majkowski’s blog, “Branch predictor: How many ‘if’s are too many? Including x86 and</p><p>M1 benchmarks!” https://blog.cloudflare.com/branch-predictor/.</p><p>3 See the tutorial, “Eytzinger Binary Search” https://algorithmica.org/en/eytzinger.</p><p>4 Both are available as open-source software; see https://github.com/scylladb/scylladb and</p><p>https://github.com/tarantool/tarantool.</p><p>CHapTeR 4 DaTaBase InTeRnals: algoRITHmIC opTImIzaTIons</p><p>69</p><p>The way to reduce this number is to compare more than one key at a time when</p><p>doing intra-node searches. In case the keys are small enough, SIMD instructions can</p><p>compare up to 64 keys in one go. Although a SIMD compare instruction may be slower</p><p>than a classic cmp one and requires additional instructions to process the comparison</p><p>mask, linear SIMD-powered search wins on short enough arrays (and B-tree nodes can</p><p>be short enough). For example, Figure4-3 shows the times of looking up an integer in a</p><p>sorted array using three techniques—linear search, binary search, and SIMD-optimized</p><p>linear search such as the x86 Advanced Vector Extensions (AVX).</p><p>Figure 4-3. The test used a large amount of randomly generated arrays of values</p><p>dispersed in memory to eliminate differences in cache usage and a large amount</p><p>of random search keys to blur branch predictions. These are the average times of</p><p>finding a key in an array normalized by the array length. Smaller results are faster</p><p>(better)</p><p>Scanning theTree</p><p>One interesting flavor of B-trees is called a B+-tree. In this tree, there are two kinds of</p><p>keys—real keys and separation keys. The real keys live on leaf nodes (i.e., on those that</p><p>don’t have children), while separation keys sit on inner nodes and are used to select</p><p>which branch to go next when descending the tree. This difference has an obvious</p><p>consequence that it takes more memory to keep the same amount of keys in a B+-tree as</p><p>compared to B-tree. But it’s not only that.</p><p>CHapTeR 4 DaTaBase InTeRnals: algoRITHmIC opTImIzaTIons</p><p>70</p><p>A great implicit feature of a tree is the ability to iterate over elements in a sorted</p><p>manner (called a scan). To scan a classical B-tree, there are both recursive and state-</p><p>machine algorithms that process the keys in a very non-uniform manner—the algorithm</p><p>walks up-and-down the tree while it moves. Despite B-trees being described as cache-</p><p>friendly, scanning them requires visiting every single node and inner nodes are visited in</p><p>a cache unfriendly manner. Figure4-4 illustrates this phenomenon.</p><p>Figure 4-4. Scanning a classical B-tree involves walking up and down the tree;</p><p>every node and inner node is visited</p><p>As opposed to this, B+-trees’ scan only needs to loop through its leaf nodes, which,</p><p>with some additional effort, can be implemented as a linear scan over a linked list of</p><p>arrays, as demonstrated in Figure4-5.</p><p>Figure 4-5. B+ tree scans only need to cover leaf nodes</p><p>When theTree Size Matters</p><p>Talking about memory, B-trees don’t provide all these benefits for free (neither do B+-</p><p>trees). As the tree grows, so does the number of nodes in it and it’s useful to consider the</p><p>CHapTeR 4 DaTaBase InTeRnals: algoRITHmIC opTImIzaTIons</p><p>71</p><p>overhead needed to store a single key. For a binary tree, the overhead is three pointers—</p><p>to both left and right children as well as to the parent node. For a B-tree, it will differ for</p><p>inner and leaf nodes. For both types, the overhead is one parent pointer and k pointers</p><p>to keys, even if they are not inserted in the tree. For inner nodes there will additionally be</p><p>k+1 pointers to child nodes.</p><p>The number of nodes in a B-tree is easy to estimate for a large number of keys. As the</p><p>number of nodes grows, the per-key overhead blurs as keys “share” parent and children</p><p>pointers. However, there’s a very interesting point at the beginning of a tree’s growth.</p><p>When the number of keys becomes k+1 (i.e., the tree overgrows its first leaf node), the</p><p>number of nodes jumps three times because, in this case, it’s needed to allocate one</p><p>more leaf node and one inner node to link those two.</p><p>There is a good and pretty cheap optimization to mitigate this spike, called “linear</p><p>root.” The leaf root node grows on demand, doubling each step like a std::vector in</p><p>C++, and can overgrow the capacity of k up to some extent. Figure4-6 shows the per-key</p><p>overhead for a 4-ary B-tree with 50 percent initial overgrowth. Note the first split spike of</p><p>a classical algorithm at five keys.</p><p>Figure 4-6. The per-key overhead for a 4-ary B-tree with 50 percent initial</p><p>overgrowth</p><p>When discussing how B-trees work with small amounts of keys, it’s worth</p><p>mentioning the corner case of one key. In ScyllaDB, a B-tree is used to store sorted rows</p><p>inside a block of rows called a partition. Since it’s possible to have a schema where each</p><p>CHapTeR 4 DaTaBase InTeRnals: algoRITHmIC opTImIzaTIons</p><p>72</p><p>partition always has a single row, this corner case is not that “corner” for us. In the case</p><p>of a binary tree, the single-element tree is equivalent to having a direct pointer from the</p><p>tree owner to this element (plus the cost of two nil pointers to the left and right children).</p><p>In case of a B-tree, the cost of keeping the single key is always in having a root node that</p><p>implies extra pointer fetching to access this key. Even the linear root optimization is</p><p>helpless here. Fixing this corner case was possible by reusing the pointer to the root node</p><p>to point directly to the single key.</p><p>The Secret Life ofSeparation Keys</p><p>This section dives into technical details of B+-tree implementation.</p><p>There are two ways of managing separation keys in a B+-tree. The separation key</p><p>at any level must be less than or equal to all the keys from its right subtree and greater</p><p>than or equal to all the keys from its left subtree. Mind the “or” condition—the exact</p><p>value of the separation key may or may not coincide with the value of some key from the</p><p>respective branch (it’s clear that this some will be the rightmost key on the left branch</p><p>or leftmost on the right). Let’s look at these two cases. If the tree balancing maintains</p><p>the separation key to be independent from other key values, then it’s the light mode; if it</p><p>must coincide with some of them, then it will be called the strict mode.</p><p>In the light separation mode, the insertion and removal operations are a bit faster</p><p>because they don’t need to care about separation keys that much. It’s enough if they</p><p>separate branches, and that’s it. A somewhat worse consequence of the light separation</p><p>is that separation keys are separate values that may appear in the tree by copying existing</p><p>keys. If the key is simple, (e.g., an integer), this will likely not cause any trouble. However,</p><p>if keys are strings or, as in ScyllaDB’s case, database partition or clustering keys, copying</p><p>it might be both resource consuming and out-of-memory risky.</p><p>On the other hand, the strict separation mode makes it possible to avoid key copying</p><p>by implementing separation keys as references</p><p>on real ones. This would involve some</p><p>complication of insertion and especially removal operations. In particular, upon real key</p><p>removal, it will be necessary to find and update the relevant separation keys. Another</p><p>difficulty to care about is that moving a real key value in memory, if it’s needed (e.g.,</p><p>in ScyllaDB’s case keys are moved in memory as a part of memory defragmentation</p><p>hygiene), will also need to update the relevant reference from separation keys. However,</p><p>it’s possible to show that each real key will be referenced by at most one separation key.</p><p>CHapTeR 4 DaTaBase InTeRnals: algoRITHmIC opTImIzaTIons</p><p>73</p><p>Speaking about memory consumption, although large B-trees were shown to</p><p>consume less memory per-key as they get filled, the real overhead would very likely be</p><p>larger, since the nodes of the tree will typically be underfilled because of the way the</p><p>balancing algorithm works. For example, Figures4-7 and 4-8 show how nodes look in a</p><p>randomly filled 4-ary B-tree.</p><p>Figure 4-7. Distribution of number of keys in a node for leaf nodes</p><p>Figure 4-8. Distribution of number of keys in a node for inner nodes</p><p>CHapTeR 4 DaTaBase InTeRnals: algoRITHmIC opTImIzaTIons</p><p>74</p><p>It’s possible to define a compaction operation for a B-tree that will pick several</p><p>adjacent nodes and squash them together, but this operation has its limitations. First,</p><p>a certain amount of underoccupied nodes makes it possible to insert a new element</p><p>into a tree without the need to rebalance, thus saving CPU cycles. Second, since each</p><p>node cannot contain less than a half of its capacity, squashing two adjacent nodes</p><p>is impossible. Even if considering three adjacent nodes, then the amount of really</p><p>squashable nodes would be less than 5 percent of the leaves and less than 1 percent of</p><p>the inners.</p><p>Summary</p><p>As extensive as these optimizations might seem, they are really just the tip of the iceberg</p><p>for this one particular example. Many finer points that matter from an engineering</p><p>perspective were skipped for brevity (for example, the subtle difference in odd vs</p><p>even number of keys on a node). For a database user, the key takeaway here is that</p><p>an excruciating level of design and experimentation often goes into the software for</p><p>determining how your database stores and retrieves data. You certainly don’t need to</p><p>be this familiar with every aspect of how your database was engineered. But knowing</p><p>what algorithmic optimizations your database has focused on will help you understand</p><p>why it performs in certain ways under different contexts. And you might discover some</p><p>impressively engineered capabilities that could help you handle more user requests or</p><p>shave a few precious milliseconds off your P99 latencies. The next chapter takes you into</p><p>the inner workings of database drivers and shares tips for getting the most out of a driver,</p><p>particularly from a performance perspective.</p><p>CHapTeR 4 DaTaBase InTeRnals: algoRITHmIC opTImIzaTIons</p><p>75</p><p>Open Access This chapter is licensed under the terms of the Creative</p><p>Commons Attribution 4.0 International License (http://creativecommons.</p><p>org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and</p><p>reproduction in any medium or format, as long as you give appropriate credit to the</p><p>original author(s) and the source, provide a link to the Creative Commons license and</p><p>indicate if changes were made.</p><p>The images or other third party material in this chapter are included in the chapter's</p><p>Creative Commons license, unless indicated otherwise in a credit line to the material. If</p><p>material is not included in the chapter's Creative Commons license and your intended</p><p>use is not permitted by statutory regulation or exceeds the permitted use, you will need</p><p>to obtain permission directly from the copyright holder.</p><p>CHapTeR 4 DaTaBase InTeRnals: algoRITHmIC opTImIzaTIons</p><p>77</p><p>CHAPTER 5</p><p>Database Drivers</p><p>Databases usually expose a specific communication protocol for their users. This</p><p>protocol is the foundation of communication between clients and servers, so it’s often</p><p>well-documented and has a formal specification. Some databases, like PostgreSQL,</p><p>implement their own binary format on top of the TCP/IP stack.1 Others, like Amazon</p><p>DynamoDB,2 build theirs on top of HTTP, which is a little more verbose, but also more</p><p>versatile and compatible with web browsers. It’s also not uncommon to see a database</p><p>exposing a protocol based on gRPC3 or any other well-established framework.</p><p>Regardless of the implementation details, users seldom use the bare protocol</p><p>themselves because it’s usually a fairly low-level API.What’s used instead is a driver—a</p><p>programming interface written in a particular language, implementing a higher-level</p><p>abstraction for communicating with the database. Drivers hide all the nitty-gritty details</p><p>behind a convenient interface, which saves users from having to manually handle</p><p>connection management, parsing, validation, handshakes, authentication, timeouts,</p><p>retries, and so on.</p><p>In a distributed environment (which a scalable database cluster usually is), clients,</p><p>and therefore drivers, are an extremely important part of the ecosystem. The clients</p><p>are usually the most numerous group of actors in the system, and they are also very</p><p>heterogeneous in nature, as visualized in Figure5-1. Some clients are connected via</p><p>local network interfaces, other ones connect via a questionable Wi-Fi hotspot on another</p><p>continent and thus have vastly different latency characteristics and error rates. Some</p><p>might run on microcontrollers with 1MiB of random access memory, while others</p><p>utilize 128-core bare metal machines from a cloud provider. Due to this diversity, it’s</p><p>1 See the PostgreSQL documentation (https://www.postgresql.org/docs/7.3/protocol-</p><p>protocol.html).</p><p>2 See the DynamoDB Developer Guide on the DynamoDB API (https://docs.aws.amazon.com/</p><p>amazondynamodb/latest/developerguide/HowItWorks.API.html).</p><p>3 gRPC is “a high performance, open-source universal RPC framework;” see https://grpc.io for</p><p>details.</p><p>© Felipe Cardeneti Mendes, Piotr Sarna, Pavel Emelyanov, Cynthia Dunlop 2023</p><p>F. C. Mendes et al., Database Performance at Scale, https://doi.org/10.1007/978-1-4842-9711-7_5</p><p>78</p><p>very important to take drivers into consideration when thinking about performance,</p><p>scalability, and resilience to failures. Ultimately it’s the drivers that generate traffic and</p><p>its concurrency, so cooperation between them and database nodes is crucial for the</p><p>whole system to be healthy and efficient.</p><p>Note As a reminder, concurrency, in the context of this book, is the measure of</p><p>how many operations are performed at the same point in time. It's conceptually</p><p>similar to parallelism. With concurrency, the operations occur physically at the</p><p>same time (e.g. on multiple CPU cores or multiple machines). Parallelism does</p><p>not specify that; the operations might just as well be executed in small steps on</p><p>a single machine. Nowadays, distributed systems must rely on providing high</p><p>concurrency in order to remain competitive and catch up with ever-developing</p><p>technology.</p><p>This chapter takes a look at how drivers impact performance—through the eyes of</p><p>someone who has engineered drivers for performance. It provides insight into various</p><p>ways that drivers can support efficient client-server interactions and shares tips for</p><p>getting the most out of a driver, particularly from the performance perspective. Finally,</p><p>the chapter wraps up with several considerations to keep in mind as you’re selecting</p><p>a driver.</p><p>Relationship Between Clients andServers</p><p>Scalability is a measure of how well your system reacts to increased load. This load is</p><p>usually generated by clients using their drivers, so keeping the relationship between</p><p>your clients and servers sound is an important matter. The more you know about your</p><p>workloads, your clients’ behavior, and their usage patterns, the better you’re prepared to</p><p>handle</p><p>both sudden spikes in traffic and sustained, long-term growth in usage.</p><p>Each client is different and should be treated as such. The differences come both</p><p>from clients’ characteristics, like their number and volume, and from their requirements.</p><p>Some clients have strict latency guarantees, even at the cost of higher error rates. Others</p><p>do not particularly care about the latency of any single database query, but just want a</p><p>steady pace of progress in their long-standing queries. Some databases target specific</p><p>types of clients (e.g., analytical databases which expect clients processing large aggregate</p><p>ChAPter 5 DAtAbAse DrIvers</p><p>79</p><p>queries operating on huge volumes of historical data). Other ones strive to be universal,</p><p>handling all kinds of clients and balancing the load so that everyone is happy (or, more</p><p>precisely, “happy enough”).</p><p>Workload Types</p><p>There are multiple ways of classifying database clients. One particularly interesting</p><p>way is to delineate between clients processing interactive and batch (e.g., analytical)</p><p>workloads, also known as OLTP (online transaction processing) vs OLAP (online</p><p>analytical processing)—see Figure5-2.</p><p>Figure 5-1. Visualization of clients and servers in a distributed system</p><p>ChAPter 5 DAtAbAse DrIvers</p><p>80</p><p>Figure 5-2. Difference between interactive and batch (analytical) workloads</p><p>Interactive Workloads</p><p>A client processing an interactive workload typically wants certain latency guarantees.</p><p>Receiving a response fast is more important than ensuring that the query succeeded.</p><p>In other words, it’s better to return an error in a timely manner than make the client</p><p>indefinitely wait for the correct response. Such workloads are often characterized by</p><p>unbounded concurrency, which means that the number of in-progress operations is</p><p>hard to predict.</p><p>A prime example of an interactive workload is a server handling requests from web</p><p>browsers. Imagine an online game, where players interact with the system straight from</p><p>their favorite browsers. High latency for such a player means a poor user experience</p><p>because people tend to despise waiting for online content for more than a few hundred</p><p>milliseconds; with multi-second delays, most will just ditch the game as unusable and</p><p>try something else. It’s therefore particularly important to be as interactive as possible</p><p>and return the results quickly—even if the result happens to be a temporary error. In</p><p>such a scenario, the concurrency of clients varies and is out of control for the database.</p><p>Sometimes there might be a large influx of players, and the database might need to</p><p>refuse some of them to avoid overload.</p><p>ChAPter 5 DAtAbAse DrIvers</p><p>81</p><p>Batch (Analytical) Workloads</p><p>A batch (analytical) workload is the conceptual opposite of an interactive one. With</p><p>such workloads, it doesn’t matter whether any single request is processed in a few</p><p>milliseconds or hours. The important thing is that the processing makes steady progress</p><p>with a satisfactory error rate, which is ideally zero. Batch workloads tend to have fixed</p><p>concurrency, which makes it easier for the database to keep the load under control.</p><p>A good example of a batch workload is an Apache Spark4 job performing analytics</p><p>on a big dataset (think terabytes). There are only a few connections established to</p><p>the database, and they continuously send requests in order to fetch data for long</p><p>computations. Because the concurrency is predictable, the database can easily respond</p><p>to an increased load by applying backpressure (e.g., by delaying the responses a little</p><p>bit). The analytical processing will simply slow down, adjusting its speed according to</p><p>the speed at which the database can consume queries.</p><p>Mixed Workloads</p><p>Certain workloads cannot be easily qualified as fully interactive or fully batch. The</p><p>clients are free to intermix their requirements, concurrency, and load however they</p><p>please—so the databases should also be ready for surprises. For example, a batch</p><p>workload might suddenly experience a giant temporary spike in concurrency. Databases</p><p>should, on the one hand, maintain a level of trust in the workload’s typical patterns, but</p><p>on the other hand anticipate that workloads can simply change over time—due to bugs,</p><p>hardware changes, or simply because the use case has diverged from its original goal.</p><p>Throughput vs Goodput</p><p>A healthy distributed database cluster is characterized by stable goodput, not</p><p>throughput. Goodput is an interesting portmanteau of good + throughput, and it’s a</p><p>measure of useful data being transferred between clients and servers over the network,</p><p>as opposed to just any data. Goodput disregards errors and other churn-like redundant</p><p>retries, and is used to judge how effective the communication actually is.</p><p>This distinction is important.</p><p>4 Apache Spark is “multi-language engine for executing data engineering, data science, and</p><p>machine learning on single-node machines or clusters.” For details, see https://spark.</p><p>apache.org/.</p><p>ChAPter 5 DAtAbAse DrIvers</p><p>82</p><p>Imagine an extreme case of an overloaded node that keeps returning errors for each</p><p>incoming request. Even though stable and sustainable throughput can be observed,</p><p>this database brings no value to the end-user. Thus, it’s essential to track how much</p><p>useful data can be delivered in an acceptable time. For example, this can be achieved</p><p>by tracking both the total throughput and throughput spent on sending back error</p><p>messages and then subtracting one from another to see how much valid data was</p><p>transferred (see Figure5-3).</p><p>Figure 5-3. Note how a fraction of the throughput times out, effectively requiring</p><p>more work from clients to achieve goodput</p><p>Maximizing goodput is a delicate operation and it heavily depends on the</p><p>infrastructure, workload type, clients’ behavior, and many other factors. In some cases,</p><p>the database shedding load might be beneficial for the entire system. Shedding is a</p><p>rather radical measure of dealing with overload: Requests qualified as “risky” are simply</p><p>ignored by the server, or immediately terminated with an error. This type of overload</p><p>protection is especially useful against issues induced by interactive workloads with</p><p>unbounded concurrency (there’s not much a database can do to protect itself except</p><p>drop some of the incoming requests early).</p><p>The database server isn’t an oracle; it can’t accurately predict whether a request is</p><p>going to fail due to overload, so it must guess. Fortunately, there are quite a few ways of</p><p>making that guess educated:</p><p>• Shedding load if X requests are already being processed, where X is</p><p>the estimated maximum a database node can handle.</p><p>• Refusing a request if its estimated memory usage is larger than the</p><p>database could handle at the moment.</p><p>ChAPter 5 DAtAbAse DrIvers</p><p>83</p><p>• Probabilistically refusing a request if Y requests are already being</p><p>processed, where Y is a percentage of the maximum a database node</p><p>can handle, with the probability raising to 100 percent once a certain</p><p>threshold is reached.</p><p>• Refusing a request if its estimated execution time indicates that it’s</p><p>not going to finish in time, and instead it is likely to time out anyway.</p><p>While refusing clients’ requests is detrimental to user experience, sometimes</p><p>it’s simply the lesser of two evils. If dropping a number of requests allows even more</p><p>requests to successfully finish in time, it increases the cluster’s goodput.</p><p>Clients can help the database maximize goodput and keep the latency low by</p><p>declaring for how long the request is considered valid. For instance, in high frequency</p><p>trading, a request that takes more than a couple of milliseconds is just as good as a</p><p>request that failed. By letting the database know that’s the case, you can allow it to retire</p><p>some requests early, leaving valuable resources for other requests which still have a</p><p>chance to be successful. Proper timeout management is a broad topic and it deserves a</p><p>separate</p><p>section.</p><p>Timeouts</p><p>In a distributed system, there are two fundamental types of timeouts that influence one</p><p>another: client-side timeouts and server-side timeouts. While both are conceptually</p><p>similar, they have different characteristics. It’s vital to properly configure both of them to</p><p>prevent problems like data races and consistency issues.</p><p>Client-Side Timeouts</p><p>This type of timeout is generally configured in the database driver. It signifies how long it</p><p>takes for a driver to decide that a response from a server is not likely to arrive. In a perfect</p><p>world built on top of a perfect network, all parties always respond to their requests.</p><p>However, in practice, there are numerous causes for a response to either be late or lost:</p><p>• The recipient died</p><p>• The recipient is busy with other tasks</p><p>• The network failed, maybe due to hardware malfunction</p><p>ChAPter 5 DAtAbAse DrIvers</p><p>84</p><p>• The network has a significant delay because packets get stuck in an</p><p>intermediate router</p><p>• A software bug caused the packet to be lost</p><p>• And so on</p><p>Since in a distributed environment it’s usually impossible to guess what</p><p>happened, the client must sometimes decide that a request is lost. The alternative</p><p>is to wait indefinitely. That might work for a select set of use cases, but it’s often</p><p>simply unacceptable. If a single failed request holds a resource for an unspecified</p><p>time, the system is eventually doomed to fail. Hence, client-side timeouts are used</p><p>as a mechanism to make sure that the system can operate even in the event of</p><p>communication issues.</p><p>A unique characteristic of a client-side timeout is that the decision to give up on a</p><p>request is made solely by the client, in the absence of any feedback from the server. It’s</p><p>entirely possible that the request in question is still being processed and utilizes the</p><p>server’s resources. And, worst of all, the unaware server can happily return the response</p><p>to the client after it’s done processing, even though nobody’s interested in this stale</p><p>data anymore! That presents another aspect of error handling: Drivers must be ready to</p><p>handle stray, expired responses correctly.</p><p>Server-Side Timeouts</p><p>A server-side timeout determines when a database node should start considering a</p><p>particular request as expired. Once this point in time has passed, there is no reason</p><p>to continue processing the query. (Doing so would waste resources which could have</p><p>otherwise been used for serving other queries that still have a chance to succeed.)</p><p>When the specified time has elapsed, databases often return an error indicating that the</p><p>request took too long.</p><p>Using reasonable values for server-side timeouts helps the database manage its</p><p>priorities in a more precise way, allocating CPU, memory and other scarce resources on</p><p>queries likely to succeed in a timely manner. Drivers that receive an error indicating that</p><p>a server-side timeout has occurred should also act accordingly—perhaps by reducing</p><p>the pressure on a particular node or retrying on another node that hasn’t experienced</p><p>timeouts lately.</p><p>ChAPter 5 DAtAbAse DrIvers</p><p>85</p><p>A Cautionary Tale</p><p>The CQL protocol, which specifies the communication layer in Apache Cassandra and</p><p>ScyllaDB, comes with built-in support for concurrency. Namely, each request is assigned</p><p>a stream ID, unique for each connection. This stream ID is encoded as a 16-bit integer</p><p>with the first bit being reserved by the protocol, which leaves the drivers 32768 unique</p><p>values for handling in-flight requests per single connection. This stream ID is later</p><p>used to match an incoming response with its original request. That’s not a particularly</p><p>large number, given that modern systems are known to handle millions of requests per</p><p>second. Thus, drivers need to eventually reuse previously assigned stream IDs.</p><p>But the CQL driver for Python had a bug.5 In the event of a client-side timeout, it</p><p>assumed that the stream ID of an expired request was immediately free to reuse. While</p><p>the assumption holds true if the server dies, it is incorrect if processing simply takes</p><p>longer than expected. It was therefore possible that once a response with a given stream</p><p>ID arrived, another request had already reused the stream ID, and the driver would</p><p>mistakenly match the response with the new request. If the user was lucky, they would</p><p>simply receive garbage data that did not pass validation. Unfortunately, data from the</p><p>mismatched response might appear correct, even though it originates from a totally</p><p>different request. This is the kind of bug that looks innocent at first glance, but may cause</p><p>people to log in to other people’s bank accounts and wreak havoc on their lives.</p><p>A rule of thumb for client-side timeouts is to make sure that a server-side timeout</p><p>also exists and is strictly shorter than the client-side one. It should take into account</p><p>clock synchronization between clients and servers (or lack thereof), as well as estimated</p><p>network latency. Such a procedure minimizes the chances for a late response to arrive at</p><p>all, and thus removes the root cause of many issues and vulnerabilities.</p><p>5 Bug report and applied fixes can be found here:</p><p>https://datastax-oss.atlassian.net/browse/PYTHON-1286</p><p>https://github.com/scylladb/python-driver/pull/106</p><p>https://github.com/datastax/python-driver/pull/1114</p><p>ChAPter 5 DAtAbAse DrIvers</p><p>86</p><p>Contextual Awareness</p><p>At this point it should be clear that both servers and clients can make better, more</p><p>educated, and mutually beneficial decisions if they know more about each other.</p><p>Exchanging timeout information is important, but drivers and servers can do even more</p><p>to keep each other up to date.</p><p>Topology andMetadata</p><p>Database servers are often combined into intricate topologies where certain nodes</p><p>are grouped in a single geographical location, others are used only as a fast cache</p><p>layer, and yet others store seldom accessed cold data in a cheap place, for emergency</p><p>purposes only.</p><p>Not every database exposes its topology to the end-user. For example, DynamoDB</p><p>takes that burden off of its clients and exposes only a single endpoint, taking care of</p><p>load balancing, overload prevention, and retry mechanisms on its own. On the other</p><p>hand, a fair share of popular databases (including ScyllaDB, Cassandra, and ArangoDB)</p><p>rely on the drivers to connect to each node, decide how many connections to keep,</p><p>when to speculatively retry, and when to close connections if they are suspected of</p><p>malfunctioning. In the ScyllaDB case, sharing up-to-date topology information with the</p><p>drivers helps them make the right decisions. This data can be shared in multiple ways:</p><p>• Clients periodically fetching topology information from the servers</p><p>• Clients subscribing to events sent by the servers</p><p>• Clients taking an active part in one of the information exchange</p><p>protocols (e.g., gossip6)</p><p>• Any combination of these</p><p>Depending on the database model, another valuable piece of information often</p><p>cached client-side is metadata—a prime example of which is database schema. SQL</p><p>databases, as well as many NoSQL ones, keep the data at least partially structured. A</p><p>schema defines the shape of a database row (or column), the kinds of data types stored</p><p>in different columns, and various other characteristics (e.g., how long a database row is</p><p>6 See the documentation on Gossip in ScyllaDB (https://docs.scylladb.com/stable/kb/</p><p>gossip.html).</p><p>ChAPter 5 DAtAbAse DrIvers</p><p>87</p><p>supposed to live before it’s garbage-collected). Based on up-to-date schemas, drivers</p><p>can perform additional validation, making sure that data sent to the server has a proper</p><p>type and adheres to any constraints required by the database. On the other hand, when</p><p>a driver-side cache for schemas gets out of sync, clients can experience their queries</p><p>failing for no apparent reason.</p><p>Synchronizing full schema information can be costly in terms of performance, and</p><p>finding a good compromise in how often to update</p><p>highly depends on the use case. A</p><p>rule of thumb is to update only as often as needed to ensure that the traffic induced by</p><p>metadata exchange never negatively impacts the user experience. It’s also worth noting</p><p>that in a distributed database, clients are not always up to date with the latest schema</p><p>information, and the system as a whole should be prepared to handle it and provide</p><p>tactics for dealing with such inconsistencies.</p><p>Current Load</p><p>Overload protection and request latency optimization are tedious tasks, but they can be</p><p>substantially facilitated by exchanging as much context as possible between interested</p><p>parties.</p><p>The following methods can be applied to distribute the load evenly across the</p><p>distributed system and prevent unwanted spikes:</p><p>1. Gathering latency statistics per each database connection in the</p><p>drivers:</p><p>a. What’s the average latency for this connection?</p><p>b. What’s the 99th percentile latency?</p><p>c. What’s the maximum latency experienced in a recent time frame?</p><p>2. Exchanging information about server-side caches:</p><p>a. Is the cache full?</p><p>b. Is the cache warm (i.e., filled with useful data)?</p><p>c. Are certain items experiencing elevated traffic and/or latency?</p><p>ChAPter 5 DAtAbAse DrIvers</p><p>88</p><p>3. Interpreting server events:</p><p>a. Has the server started replying with “overload errors”?</p><p>b. How often do requests for this server time out?</p><p>c. What is the general rate of errors for this server?</p><p>d. What is the measured goodput from this server?</p><p>Based on these indicators, drivers should try to amend the amount of data they</p><p>send, the concurrency, and the rate of retries as well as speculative execution, which</p><p>can keep the whole distributed system in a healthy, balanced state. It’s ultimately in the</p><p>driver’s interest to ease the pressure on nodes that start showing symptoms of getting</p><p>overloaded, be it by reducing the concurrency of operations, limiting the frequency</p><p>and number of retries, temporarily giving up on speculatively sent requests, and so on.</p><p>Otherwise, if the database servers get overloaded, all clients may experience symptoms</p><p>like failed requests, timeouts, increased latency, and so on.</p><p>Request Caching</p><p>Many database management systems, ranging from SQLite, MySQL, and Postgres to</p><p>NoSQL databases, implement an optimization technique called prepared statements.</p><p>While the language used to communicate with the database is usually human-readable</p><p>(or at least developer-readable), it is not the most efficient way of transferring data from</p><p>one computer to another.</p><p>Let’s take a look at the (simplified) lifecycle of an unprepared statement once it’s sent</p><p>from a ScyllaDB driver to the database and back. This is illustrated in Figure5-4.</p><p>Figure 5-4. Lifecycle of an unprepared statement</p><p>ChAPter 5 DAtAbAse DrIvers</p><p>89</p><p>1. A query string is created:</p><p>INSERT INTO my_table(id, descr) VALUES (42,</p><p>'forty two');</p><p>2. The string is packed into a CQL frame by the driver. Each CQL</p><p>frame consists of a header, which describes the purpose of a</p><p>particular frame. Following the header, a specific payload may be</p><p>sent as well. The full protocol specification is available at https://</p><p>github.com/apache/cassandra/blob/trunk/doc/native_</p><p>protocol_v4.spec.</p><p>3. The CQL frame is sent over the network.</p><p>4. The frame is received by the database.</p><p>5. Once the frame is received, the database interprets the frame</p><p>header and then starts parsing the payload. If there’s an</p><p>unprepared statement, the payload is represented simply as a</p><p>string, as seen in Step 1.</p><p>6. The database parses the string in order to validate its contents and</p><p>interpret what kind of an operation is requested: is it an insertion,</p><p>an update, a deletion, a selection?</p><p>7. Once the statement is parsed, the database can continue</p><p>processing it (e.g., by persisting data on disk, fetching whatever’s</p><p>necessary, etc.).</p><p>Now, imagine that a user wants to perform a hundred million operations on the</p><p>database in quick succession because the data is migrated from another system. Even</p><p>if parsing the query strings is a relatively fast operation and takes 50 microseconds, the</p><p>total time spent on parsing strings will take over an hour of CPU time. Sounds like an</p><p>obvious target for optimization.</p><p>The key observation is that operations performed on a database are usually similar</p><p>to one another and follow a certain pattern. For instance, migrating a table from one</p><p>system to another may mean sending lots of requests with the following schema:</p><p>INSERT INTO my_table(id, descr) VALUES (?, ?)</p><p>where ? denotes the only part of the string that varies between requests.</p><p>ChAPter 5 DAtAbAse DrIvers</p><p>90</p><p>This query string with question marks instead of real values is actually also valid</p><p>CQL! While it can’t be executed as is (because some of the values are not known), it can</p><p>be prepared.</p><p>Preparing such a statement means that the database will meticulously analyze the</p><p>string, parse it, and create an internal representation of the statement in its own memory.</p><p>Once done, a unique identifier is generated and sent back to the driver. The client can now</p><p>execute the statement by providing only its identifier (which is a 128-bit UUID7 in ScyllaDB)</p><p>and all the values missing from the prepared query string. The process of replacing</p><p>question marks with actual values is called binding and it’s the only thing that the database</p><p>needs to do instead of launching a CQL parser, which offers a significant speedup.</p><p>Preparing statements without care can also be detrimental to overall cluster</p><p>performance though. When a statement gets prepared, the database needs to keep a</p><p>certain amount of information about it in memory, which is hardly a limitless resource.</p><p>Caches for prepared statements are usually relatively small, under the assumption that</p><p>the driver’s users (app developers) are kind and only prepare queries that are used</p><p>frequently. If, on the other hand, a user were to prepare lots of unique statements that</p><p>aren’t going to be reused any time soon, the database cache might invalidate existing</p><p>entries for frequently used queries. The exact heuristics of how entries are invalidated</p><p>depends on the algorithm used in the cache, but a naive LRU (least recently used)</p><p>eviction policy is susceptible to this problem. Therefore, other cache algorithms resilient</p><p>to such edge cases should be considered when designing a cache without full information</p><p>about expected usage patterns. Some notable examples include the following:</p><p>• LFU (least frequently used)</p><p>Aside from keeping track of which item was most recently accessed,</p><p>LFU also counts how many times it was needed in a given time</p><p>period, and tries to keep frequently used items in the cache.</p><p>• LRU with two pools</p><p>One probationary pool for new entries, and another, usually</p><p>larger, pool for frequently used items. This algorithm avoids cache</p><p>thrashing when lots of one-time entries are inserted in the cache,</p><p>because they only evict other items from the probationary pool,</p><p>while more frequently accessed entries are safe in the main pool.</p><p>7 See the memo, “A Universally Unique IDentifier (UUID) URN Namespace,” at https://www.</p><p>ietf.org/rfc/rfc4122.txt.</p><p>ChAPter 5 DAtAbAse DrIvers</p><p>91</p><p>Finally, regardless of the algorithm used for cache eviction implemented server-side,</p><p>drivers should take care not to prepare queries too aggressively, especially if it happens</p><p>automatically, which is often the case in ORMs (object-relational mappings). Making</p><p>an interface convenient for the user may sound tempting, and developer experience is</p><p>indeed an important factor when designing a driver, but being too eager with reserving</p><p>precious database resources may be disadvantageous in the long term.</p><p>Query Locality</p><p>In distributed systems, any kind of locality is welcome because it reduces the chances of</p><p>failure, keeps the latency low, and generally prevents many undesirable events. While</p><p>database clients, and thus also</p><p>drivers, do not usually share the same machines with</p><p>the database cluster, it is possible to keep the distance between them short. “Distance”</p><p>might mean either a physical measure or the number of intermediary devices in the</p><p>network topology. Either way, for latency’s sake, it’s good to minimize it between parties</p><p>that need to communicate with each other frequently.</p><p>Many database management systems allow their clients to announce their</p><p>“location,” for example, by declaring which datacenter is their local, default one. Drivers</p><p>should take that information into account when communicating with the database</p><p>nodes. As long as all consistency requirements are fulfilled, it’s usually better to send</p><p>data directly to a nearby node, under the assumption that it will spend less time in</p><p>transit. Short routes also usually imply fewer middlemen, and that in turn translates to</p><p>fewer potential points of failure.</p><p>Drivers can make much more educated choices though. Quite a few NoSQL</p><p>databases can be described as “distributed hash tables” because they partition their</p><p>data and spread it across multiple nodes which own a particular set of hashes. If the</p><p>hashing algorithm is well known and deterministic, drivers can leverage that fact to try</p><p>to optimize the queries even further—sending data directly to the appropriate node, or</p><p>even the appropriate CPU core.</p><p>ChAPter 5 DAtAbAse DrIvers</p><p>92</p><p>ScyllaDB, Cassandra, and other NoSQL databases apply a concept of token8</p><p>awareness (see Figures5-5, 5-6, and 5-7):</p><p>1. A request arrives.</p><p>2. The receiving node computes the hash of the given input.</p><p>3. Based on the value of this hash, it computes which database</p><p>nodes are responsible for this particular value.</p><p>4. Finally, it forwards the request directly to the owning nodes.</p><p>However, in certain cases, the driver can compute the token locally on its own, and</p><p>then use the cluster topology information to route the request straight to the owning</p><p>node. This local node-level routing saves at least one network round-trip as well as the</p><p>CPU time of some of the nodes.</p><p>8 A token is how a hash value is named in Cassandra nomenclature.</p><p>Figure 5-5. Naive clients route queries to any node (coordinator)</p><p>ChAPter 5 DAtAbAse DrIvers</p><p>93</p><p>Figure 5-6. Token-aware clients route queries to the right node(s)</p><p>In the Cassandra/ScyllaDB case, this is possible because each table has a well-</p><p>defined “partitioner,” which simply means a hash function implementation. The default</p><p>choice—used in Cassandra—is murmur3,9 which returns a 64-bit hash value, has</p><p>satisfying distribution, and is relatively cheap to compute. ScyllaDB takes it one step</p><p>further and allows the drivers to calculate which CPU core of which database node owns</p><p>a particular datum. When a driver is cooperative and proactively establishes a separate</p><p>connection per each core of each machine, it can send the data not only to the right</p><p>node, but also straight to the single CPU core responsible for handling it. This not only</p><p>saves network bandwidth, but is also very friendly to CPU caches.</p><p>9 See the DataStax documentation on Murmur3Partitioner (https://docs.datastax.com/en/</p><p>cassandra-oss/3.x/cassandra/architecture/archPartitionerM3P.html).</p><p>ChAPter 5 DAtAbAse DrIvers</p><p>94</p><p>Figure 5-7. Shard-aware clients route queries to the correct node(s) + core</p><p>Retries</p><p>In a perfect system, no request ever fails and logic implemented in the drivers can</p><p>be kept clean and minimal. In the real world, failures happen disturbingly often, so</p><p>the drivers should also be ready to deal with them. One such mechanism for failure</p><p>tolerance is a driver’s retry policy. A retry policy’s job is to decide whether a request</p><p>should be sent again because it failed (or at least the driver strongly suspects that it did).</p><p>Error Categories</p><p>Before diving into techniques for retrying requests in a smart way, there’s a more</p><p>fundamental question to consider: does a retry even make sense? The answer is not that</p><p>obvious and it depends on many internal and external factors. When a request fails, the</p><p>error can fall into the following categories, presented with a few examples:</p><p>1. Timeouts</p><p>a. Read timeouts</p><p>b. Write timeouts</p><p>ChAPter 5 DAtAbAse DrIvers</p><p>95</p><p>2. Temporary errors</p><p>a. Database node overload</p><p>b. Dead target node</p><p>c. Temporary schema mismatch</p><p>3. Permanent errors</p><p>a. Incorrect query syntax</p><p>b. Authentication error</p><p>c. Insufficient permissions</p><p>Depending on the category, the retry decision may be vastly different. For instance,</p><p>it makes absolutely no sense to retry a request that has incorrect syntax. It will not</p><p>magically start being correct, and such a retry attempt would only waste bandwidth and</p><p>database resources.</p><p>Idempotence</p><p>Error categories aside, retry policy must also consider one important trait of the request</p><p>itself: its idempotence. An idempotent request can be safely applied multiple times, and</p><p>the result will be indistinguishable from applying it just once.</p><p>Why does this need to be taken into account at all? For certain classes of errors, the</p><p>driver cannot be sure whether the request actually succeeded. A prime example of such</p><p>error is a timeout. The fact that the driver did not manage to get a response in time does</p><p>not mean that the server did not successfully process the request. It’s a similar situation</p><p>if the network connection goes down: The driver won’t know if the database server</p><p>actually managed to apply the request.</p><p>When in doubt, the driver should make an educated guess in order to ensure</p><p>consistency. Imagine a request that withdraws $100 from somebody’s bank account. You</p><p>certainly don’t want to retry the same request again if you’re not absolutely sure that</p><p>it failed; otherwise, the bank customer might become a bit resentful. This is a perfect</p><p>example of a non-idempotent request: Applying it multiple times changes the ultimate</p><p>outcome.</p><p>ChAPter 5 DAtAbAse DrIvers</p><p>96</p><p>Fortunately, there’s a large subset of idempotent queries that can be safely retried,</p><p>even when it’s unclear whether they already succeeded:</p><p>1. Read-only requests</p><p>Since they do not modify any data, they won’t have any side</p><p>effects, no matter how often they’re retried.</p><p>2. Certain conditional requests that have compare-and-set</p><p>characteristics (e.g., “bump the value by 1 if the previous</p><p>value is 42”)</p><p>Depending on the use case, such a condition may be enough to</p><p>guarantee idempotence. Once this request is applied, applying</p><p>it again would have no effect since the previous value would</p><p>then be 43.</p><p>3. Requests with unique timestamps</p><p>When each request has a unique timestamp (represented in wall</p><p>clock time or based on a logical clock10), applying it multiple times</p><p>can be idempotent. A retry attempt will contain a timestamp</p><p>identical to the original request, so it will only overwrite data</p><p>identified by this particular timestamp. If newer data arrives</p><p>in-between with a newer timestamp, it will not be overwritten by a</p><p>retry attempt with an older timestamp.</p><p>In general, it’s a good idea for drivers to give users an opportunity to declare</p><p>their requests’ idempotence explicitly. Some queries can be trivially deduced to be</p><p>idempotent by the driver (e.g., when it’s a read-only SELECT statement in the database</p><p>world), but others may be less obvious. For example, the conditional example from the</p><p>previous Step 2 is idempotent if the value is never decremented, but not in the general</p><p>case. Imagine the following counter-example:</p><p>1. The current value is 42.</p><p>2. A request “bump the value by 1 if the previous value is 42” is sent.</p><p>10 See the Logical Clocks lecture by Arvind Krishnamurthy (https://homes.cs.washington.</p><p>edu/~arvind/cs425/lectureNotes/clocks-2.pdf).</p><p>ChAPter 5 DAtAbAse DrIvers</p><p>97</p><p>3. A request “bump the value by 1 if the previous value is 42” is</p><p>retried.</p><p>4. Another request, “decrement the value by 1,” is sent.</p><p>5. The request from Step 2 arrives and is applied—changing</p><p>102</p><p>Modern Software ����������������������������������������������������������������������������������������������������������������� 104</p><p>What to Look for When Selecting a Driver �������������������������������������������������������������������������������� 105</p><p>Summary����������������������������������������������������������������������������������������������������������������������������������� 107</p><p>Chapter 6: Getting Data Closer ����������������������������������������������������������������������������� 109</p><p>Databases as Compute Engines ������������������������������������������������������������������������������������������������ 109</p><p>User-Defined Functions and Procedures ����������������������������������������������������������������������������� 110</p><p>User-Defined Aggregates ����������������������������������������������������������������������������������������������������� 117</p><p>WebAssembly for User-Defined Functions �������������������������������������������������������������������������� 124</p><p>Edge Computing ������������������������������������������������������������������������������������������������������������������������ 126</p><p>Performance ������������������������������������������������������������������������������������������������������������������������ 127</p><p>Conflict-Free Replicated Data Types ������������������������������������������������������������������������������������ 127</p><p>Summary����������������������������������������������������������������������������������������������������������������������������������� 129</p><p>Chapter 7: Infrastructure and Deployment Models����������������������������������������������� 131</p><p>Core Hardware Considerations for Speed at Scale ������������������������������������������������������������������� 132</p><p>Identifying the Source of Your Performance Bottlenecks ���������������������������������������������������� 132</p><p>Achieving Balance ��������������������������������������������������������������������������������������������������������������� 133</p><p>Setting Realistic Expectations ��������������������������������������������������������������������������������������������� 134</p><p>Recommendations for Specific Hardware Components ����������������������������������������������������������� 135</p><p>Storage �������������������������������������������������������������������������������������������������������������������������������� 135</p><p>CPUs (Cores) ������������������������������������������������������������������������������������������������������������������������ 144</p><p>Memory (RAM) ��������������������������������������������������������������������������������������������������������������������� 145</p><p>Network ������������������������������������������������������������������������������������������������������������������������������� 147</p><p>Considerations in the Cloud ������������������������������������������������������������������������������������������������������ 148</p><p>Fully Managed Database-as-a-Service ������������������������������������������������������������������������������������� 150</p><p>Serverless Deployment Models ������������������������������������������������������������������������������������������������ 151</p><p>Containerization and Kubernetes ����������������������������������������������������������������������������������������������� 152</p><p>Summary����������������������������������������������������������������������������������������������������������������������������������� 155</p><p>Table of ConTenTs</p><p>ix</p><p>Chapter 8: Topology Considerations ��������������������������������������������������������������������� 157</p><p>Replication Strategy ������������������������������������������������������������������������������������������������������������������ 157</p><p>Rack Configuration �������������������������������������������������������������������������������������������������������������� 158</p><p>Multi-Region or Global Replication �������������������������������������������������������������������������������������� 158</p><p>Multi-Availability Zones vs� Multi-Region ���������������������������������������������������������������������������� 159</p><p>Scaling Up vs Scaling Out ��������������������������������������������������������������������������������������������������������� 160</p><p>Workload Isolation �������������������������������������������������������������������������������������������������������������������� 162</p><p>More on Workload Prioritization for Logical Isolation ���������������������������������������������������������� 163</p><p>Abstraction Layers �������������������������������������������������������������������������������������������������������������������� 167</p><p>Load Balancing ������������������������������������������������������������������������������������������������������������������������� 169</p><p>External Caches ������������������������������������������������������������������������������������������������������������������������ 170</p><p>An External Cache Adds Latency ����������������������������������������������������������������������������������������� 170</p><p>An External Cache Is an Additional Cost ������������������������������������������������������������������������������ 171</p><p>External Caching Decreases Availability ������������������������������������������������������������������������������ 171</p><p>Application Complexity: Your Application Needs to Handle More Cases ������������������������������ 172</p><p>External Caching Ruins the Database Caching �������������������������������������������������������������������� 172</p><p>External Caching Might Increase Security Risks ����������������������������������������������������������������� 172</p><p>External Caching Ignores the Database Knowledge and Database Resources ������������������� 172</p><p>Summary����������������������������������������������������������������������������������������������������������������������������������� 173</p><p>Chapter 9: Benchmarking ������������������������������������������������������������������������������������� 175</p><p>Latency or Throughput: Choose Your Focus ������������������������������������������������������������������������������ 176</p><p>Less Is More (at First): Taking a Phased Approach �������������������������������������������������������������������� 180</p><p>Benchmarking Do’s and Don’ts ������������������������������������������������������������������������������������������������� 182</p><p>Know What’s Under the Hood of Your Database (Or Find Someone Who Knows) ���������������� 182</p><p>Choose an Environment That Takes Advantage of the Database’s Potential ����������������������� 183</p><p>Use an Environment That Represents Production ��������������������������������������������������������������� 183</p><p>Don’t Overlook Observability ����������������������������������������������������������������������������������������������� 184</p><p>Use Standardized Benchmarking Tools Whenever Feasible ������������������������������������������������ 184</p><p>Use Representative Data Models, Datasets, and Workloads ����������������������������������������������� 185</p><p>Exercise Your Cache Realistically ���������������������������������������������������������������������������������������� 187</p><p>Look at Steady State ������������������������������������������������������������������������������������������������������������� 187</p><p>Table of ConTenTs</p><p>x</p><p>Watch Out for Client-Side Bottlenecks �������������������������������������������������������������������������������� 188</p><p>Also Watch Out for Networking Issues �������������������������������������������������������������������������������� 189</p><p>Document Meticulously to Ensure Repeatability ����������������������������������������������������������������� 189</p><p>Reporting Do’s and Don’ts �������������������������������������������������������������������������������������������������������� 189</p><p>Be Careful with Aggregations ���������������������������������������������������������������������������������������������� 190</p><p>Don’t Assume People Will</p><p>the</p><p>value to 43.</p><p>6. The request from Step 4 arrives and is applied—changing the</p><p>value to 42.</p><p>7. The retry from Step 3 is applied—changing the value back to</p><p>43 and interfering with the effect of the query from Step 4. This</p><p>wasn’t idempotent after all!</p><p>Since it’s often impossible to guess if a request is idempotent just by analyzing its</p><p>contents, it’s best for drivers to have a set_idempotent() function exposed in their</p><p>API.It allows the users to explicitly mark some queries as idempotent, and then the logic</p><p>implemented in the driver can assume that it’s safe to retry such a request when the</p><p>need arises.</p><p>Retry Policies</p><p>Finally, there’s enough context to discuss actual retry policies that a database driver</p><p>could implement. The sole job of a retry policy is to analyze a failed query and return a</p><p>decision. This decision depends on the database system and its intrinsics, but it’s often</p><p>one of the following (see Figure5-8):</p><p>• Do not retry</p><p>• Retry on the same database node</p><p>• Retry, but on a different node</p><p>• Retry, but not immediately—apply some delay</p><p>ChAPter 5 DAtAbAse DrIvers</p><p>98</p><p>Figure 5-8. Decision graph for retrying a query</p><p>Deciding not to retry is often a decent choice—it’s the only correct one when the</p><p>driver isn’t certain whether an idempotent query really failed or just timed out. It’s also</p><p>the obvious choice for permanent errors; there’s no point in retrying a request that was</p><p>previously refused due to incorrect syntax. And whenever the system is overloaded, the</p><p>“do not retry” approach might help the entire cluster. Although the immediate effect</p><p>(preventing a user’s request from being driven to completion) is not desirable, it provides</p><p>a level of overload protection that might pay off in the future. It prevents the overload</p><p>condition from continuing to escalate. Once a node gets too much traffic, it refuses more</p><p>requests, which increases the rate of retries, and ends up in a vicious circle.</p><p>Retrying on the same database node is generally a good option for timeouts.</p><p>Assuming that the request is idempotent, the same node can probably resolve potential</p><p>conflicts faster. Retrying on a different node is a good idea if the previous node showed</p><p>symptoms of overload, or had an input/output error that indicated a temporary issue.</p><p>Finally, in certain cases, it’s a good idea to delay the retry instead of firing it off</p><p>immediately (see Figure5-9).</p><p>ChAPter 5 DAtAbAse DrIvers</p><p>99</p><p>Figure 5-9. Retry attempts eventually resulting in a successful query</p><p>When the whole cluster shows the symptoms of overload—be it high reported CPU</p><p>usage or perceived increased latency—retrying immediately after a request failed may</p><p>only exacerbate the problem. What a driver can do instead is apply a gentle backoff</p><p>algorithm, giving the database cluster time to recover. Remember that even a failed retry</p><p>costs resources: networking, CPU, and memory. Therefore, it’s better to balance the costs</p><p>and chances for success in a reasonable manner.</p><p>The three most common backoff strategies are constant, linear, and exponential</p><p>backoff, as visualized in Figure5-10.</p><p>Figure 5-10. Constant, linear, and exponential backoffs</p><p>ChAPter 5 DAtAbAse DrIvers</p><p>100</p><p>The first type (constant) simply waits a certain predefined amount of time before</p><p>retrying. Linear backoff increases the time between attempts in a linear fashion; it</p><p>could wait one second before the first attempt, two seconds before the second one,</p><p>and so forth. Finally, exponential backoff, arguably the most commonly used method,</p><p>increases the delay by multiplying it by a constant each time. Usually it just doubles</p><p>it—because both processors and developers love multiplying and dividing by two (the</p><p>latter ones mostly just to show off their intricate knowledge of the bitwise shift operator).</p><p>Exponential backoff has especially nice characteristics for overload prevention. The retry</p><p>rate drops exponentially, and so does the pressure that the driver places on the database</p><p>cluster.</p><p>Paging</p><p>Databases usually store amounts of data that are orders of magnitude larger than a single</p><p>client machine could handle. If you fetch all available records, the result is unlikely to fit</p><p>into your local disks, not to mention your available RAM.Nonetheless, there are many</p><p>valid cases for processing large amounts of data, such as analyzing logs or searching for</p><p>specific documents. It is quite acceptable to ask the database to serve up all the data it</p><p>has—but you probably want it to deliver that data in smaller bits.</p><p>That technique is customarily called paging, and it is ubiquitous. It’s exactly what</p><p>you’ve experienced when browsing through page 17 of Google search results in futile</p><p>search for an answer to a question that was asked only on an inactive forum seven years</p><p>ago—or getting all the way to page 24 of eBay listings, hunting for that single perfect offer.</p><p>Databases and their drivers also implement paging as a mechanism beneficial for both</p><p>parties. Drivers get their data in smaller chunks, which can be done with lower latency.</p><p>And databases receive smaller queries, which helps with cache management, workload</p><p>prioritization, memory usage, and so on.</p><p>Different database models may have a different view of exactly what paging involves</p><p>and how you interface with it. Some systems may offer fine-grained control, which</p><p>allows you to ask for “page 16” of your data. Others are “forward-only”: They reduce the</p><p>user-facing interface to “here’s the current page—you can ask for the next page if you</p><p>want.” Your ability to control the page size also varies. Sometimes it’s possible to specify</p><p>the size in terms of a number of database records or bytes. In other cases, the page size</p><p>is fixed.</p><p>ChAPter 5 DAtAbAse DrIvers</p><p>101</p><p>On top of a minimal interface that allows paging to be requested, drivers can</p><p>offer many interesting features and optimizations related to paging. One of them is</p><p>readahead—which usually means that the driver transparently and speculatively fetches</p><p>new pages before you actually ask for them to be read. A readahead is a classic example</p><p>of a double-edged sword. On the one hand, it makes certain read operations faster,</p><p>especially if the workload consists of large consecutive reads. On the other, it may cause</p><p>prohibitive overhead, especially if the workload is based on small random reads.</p><p>Although most drivers support paging, it’s important to check whether the feature</p><p>is opt-in or opt-out and consciously decide what’s best for a specific workload. In</p><p>particular, pay attention to the following aspects:</p><p>1. What’s the default behavior (would a read query be paged or</p><p>unpaged)?</p><p>2. What’s the default page size and is it configurable? If so, in what</p><p>units can a size be specified? Bytes? Number of records?</p><p>3. Is readahead on by default? Can it be turned on/off?</p><p>4. Can readahead be configured further? For example, can you</p><p>specify how many pages to fetch or when to decide to start</p><p>fetching (e.g., “When at least three consecutive read requests</p><p>already occurred”)?</p><p>Setting up paging properly is important because a single unpaged response can</p><p>be large enough to be problematic for both the database servers forced to produce it,</p><p>and for the client trying to receive it. On the other hand, too granular paging can lead</p><p>to unnecessary overhead (just imagine trying to read a billion records row-by-row, due</p><p>to the default page size of “1 row”). Finally, readahead can be a fantastic optimization</p><p>technique—but it can also be entirely redundant, fetching unwanted pages that cost</p><p>memory, CPU time, and throughput, as well as confuse the metrics and logs. With</p><p>paging configuration, it’s best to be as explicit as possible.</p><p>Concurrency</p><p>In many cases, the only way to utilize a database to the fullest—and achieve optimal</p><p>performance—is to also achieve high concurrency. That often requires the drivers to</p><p>perform many I/O operations at the same</p><p>time, and that’s in turn customarily achieved</p><p>ChAPter 5 DAtAbAse DrIvers</p><p>102</p><p>by issuing asynchronous tasks. That being said, let’s take quite a few steps back to</p><p>explain what that really means and what’s involved in achieving that from both a</p><p>hardware and software perspective.</p><p>Note high concurrency is not a silver bullet. When it’s too high, it’s easy</p><p>to overload the system and ruin the quality of service for other users—see</p><p>Figure5-11 for its effect on latency. Chapter 1 includes a cautionary tale on what</p><p>can happen when concurrency gets out of bounds and Chapter 2 also touches on</p><p>the dangers of unbounded concurrency.</p><p>Modern Hardware</p><p>Back in the old days, making decisions around I/O concurrency was easy because</p><p>magnetic storage drives (HDD) had an effective concurrency of 1. There was (usually)</p><p>only a single actuator arm used to navigate the platters, so only a single sector of data</p><p>could have been read at once. Then, an SSD revolution happened. Suddenly, disks</p><p>could read from multiple offsets concurrently. Moreover, it became next to impossible to</p><p>fully utilize the disk (i.e., to read and write with the speeds advertised in shiny numbers</p><p>printed on their labels) without actually asking for multiple operations to be performed</p><p>concurrently. Now, with enterprise-grade NVMe drives and inventions like Intel</p><p>Optane,11 concurrency is a major factor when benchmarking input/output devices. See</p><p>Figure5-11.</p><p>11 High speed persistent memory (sadly discontinued in 2021).</p><p>ChAPter 5 DAtAbAse DrIvers</p><p>103</p><p>Figure 5-11. Relationship between the system’s concurrency and latency</p><p>Networking technology is not lagging behind either. Modern networking cards</p><p>have multiple independent queues, which, with the help of receive-side scaling (RSS12),</p><p>enable previously unimaginable levels of performance, with throughput measured</p><p>in Tbps.13 With such advanced hardware, achieving high concurrency in software is</p><p>required to simply utilize the available capabilities.</p><p>CPU cores obviously deserve to be mentioned here as well. That’s the part</p><p>of computer infrastructure that’s undoubtedly most commonly associated with</p><p>concurrency. Buying a 64-core consumer-grade processor is just a matter of going to</p><p>the hardware store next door, and the assortment of professional servers is even more</p><p>plentiful.</p><p>Operating systems focus on facilitating highly concurrent programs too. io_uring14</p><p>by Jens Axboe is a novel addition to the Linux kernel. As noted in Chapter 3, it was</p><p>developed for asynchronous I/O, which in turn plays a major part in allowing high</p><p>concurrency in software to become the new standard. Some database drivers already</p><p>utilize io_uring underneath, and many more put the integration very high in the list of</p><p>priorities.</p><p>12 RSS allows directing traffic from specific queues directly into chosen CPUs.</p><p>13 Terabits per second</p><p>14 See the “Efficient IO with io_uring” article (https://kernel.dk/io_uring.pdf).</p><p>ChAPter 5 DAtAbAse DrIvers</p><p>104</p><p>Modern Software</p><p>How could modern software adapt to the new, highly concurrent era? Historically,</p><p>a popular model of ensuring that multiple operations can be performed at the same</p><p>time was to keep a pool of operating system threads, with each thread having its own</p><p>queue of tasks. That only scales in a limited way though, so now the industry leans</p><p>toward so-called “green threads,” which are conceptually similar to their operating</p><p>system namesakes, but are instead implemented in userspace, in a much more</p><p>lightweight manner.</p><p>For example, in Seastar (a high-performance asynchronous framework implemented</p><p>in C++ and based on a future-promise model15), there are quite a few ways of expressing</p><p>a single flow of execution, which could be called a green thread. A fiber of execution can</p><p>be created by chaining futures, and you can also use the C++ coroutines mechanism to</p><p>build asynchronous programs in a clean way, with the compiler assisting in making the</p><p>code async-friendly.</p><p>In the Rust language, the asynchronous model is quite unique. There, a future</p><p>represents the computation, and it’s the programmer’s responsibility to advance the</p><p>state of this asynchronous state machine. Other languages, like JavaScript, Go, and Java,</p><p>also come with well-defined and standardized support for asynchronous programming.</p><p>This async programming support is good, because database drivers are prime</p><p>examples of software that should support asynchronous operations from day one.</p><p>Drivers are generally responsible for communicating over the network with highly</p><p>specialized database clusters, capable of performing lots of I/O operations at the same</p><p>time. We can’t emphasize enough that high concurrency is the only way to utilize the</p><p>database to the fullest. Asynchronous code makes that substantially easier because it</p><p>allows high levels of concurrency to be achieved without straining the local resources.</p><p>Green threads are lightweight and there can be thousands of them even on a consumer-</p><p>grade laptop. Asynchronous I/O is a perfect fit for this use case as well because it allows</p><p>efficiently sending thousands of requests over the network in parallel, without blocking</p><p>the CPU and forcing it to wait for any of the operations to complete, which was a known</p><p>bottleneck in the legacy threadpool model.</p><p>15 See the Seastar documentation on futures and promises (https://seastar.io/</p><p>futures-promises/).</p><p>ChAPter 5 DAtAbAse DrIvers</p><p>105</p><p>Note the future-promise model and asynchronous I/O are introduced in</p><p>Chapter 3.</p><p>What toLook forWhen Selecting aDriver</p><p>Database drivers are commonly available as open-source software. It’s a great model</p><p>that allows people to contribute and also makes the software easily accessible, ergo</p><p>popular (precisely what database vendors want). Drivers can be developed either by the</p><p>vendor, or another company, or simply your next door open-source contributor. This</p><p>kind of competition is very healthy for the entire system, but it also forces the users to</p><p>make a choice: which driver to use? For instance, at the time of this writing, the official</p><p>PostgreSQL documentation lists six drivers for C/C++ alone, with the complete list being</p><p>much longer.16</p><p>Choosing a driver should be a very deliberate decision, tailored to your unique</p><p>situation and preceded by tests, benchmarks, and evaluations. Nevertheless, there are</p><p>some general rules of thumb that can help guide you:</p><p>1. Clear documentation</p><p>Clear documentation is often initially underestimated by database</p><p>drivers’ users and developers alike. However, in the long term, it’s</p><p>the most important repository of knowledge for everyone, where</p><p>implementation details, good practices, and hidden assumptions</p><p>can be thoroughly explained. Choosing an undocumented driver</p><p>is a lottery—buying a pig in a poke. Don’t get distracted by shiny</p><p>benchmarks on the front page; the really valuable part is thorough</p><p>documentation. Note that it does not have to be a voluminous</p><p>book. On the contrary—concise, straight-to-the-point docs with</p><p>clear, working examples are even better.</p><p>16 See the PostgreSQL Drivers documentation at https://wiki.postgresql.org/wiki/</p><p>List_of_drivers.</p><p>ChAPter 5 DAtAbAse DrIvers</p><p>106</p><p>2. Long-term support and active maintainership</p><p>Officially supported drivers are often maintained by their vendors,</p><p>get released regularly, and have their security vulnerabilities</p><p>fixed faster. External open-source drivers might look appealing</p><p>at first, easily winning in their self-presented benchmarks, but</p><p>it’s important to research how often they get released, how</p><p>often bugs are fixed, and how likely they are to be maintained</p><p>in the foreseeable future. On the other hand, sometimes the</p><p>situation is reversed: The most modern, efficient code can be</p><p>found in an open-source driver, while the official one is hardly</p><p>maintained at all!</p><p>3. Asynchronous API</p><p>Your code is eventually going to need high concurrency, so</p><p>it’s better</p><p>to bet on an async-friendly driver, even if you’re not</p><p>ready to take advantage of that quite yet. The decision will likely</p><p>pay off later. While it’s easy to use an asynchronous driver in a</p><p>synchronous manner, the opposite is not true.</p><p>4. Decent test coverage</p><p>Testing is extremely important not only for the database nodes,</p><p>but also for the drivers. They are the first proxy between the users</p><p>and the database cluster, and any error in the driver can quickly</p><p>propagate to the whole system. If the driver corrupts outgoing</p><p>data, it may get persisted on the database, eventually making</p><p>the whole cluster unusable. If the driver incorrectly interprets</p><p>incoming data, its users will have a false picture of the database</p><p>state. And if it produces data based on this false picture, it can</p><p>just as well corrupt the entire database cluster. A driver that</p><p>cannot properly handle its load balancing and retry policy can</p><p>inadvertently overload a database node with excess requests,</p><p>which is detrimental to the whole system. If the driver is at least</p><p>properly tested, users can assume a higher level of trust in it.</p><p>ChAPter 5 DAtAbAse DrIvers</p><p>107</p><p>5. Database-specific optimizations</p><p>A good driver should cooperate with its database. The more</p><p>context it gathers from the cluster, the more educated decisions</p><p>it can make. Remember that clients, and therefore drivers, are</p><p>often the most ubiquitous group of agents in distributed systems,</p><p>directly contributing to the cluster-wide concurrency. That makes</p><p>it especially important for them to be cooperative.</p><p>Summary</p><p>This chapter provided insights into how the choice of a database driver impacts</p><p>performance and highlighted considerations to keep in mind when selecting a driver.</p><p>Drivers are often an overlooked part of a distributed system. That’s a shame because</p><p>drivers are so close to database users, both physically and figuratively! Proximity is an</p><p>extremely important factor in all networked systems because it directly translates to</p><p>latency. The next chapter ponders proximity from a subtly different point of view: How to</p><p>get the data itself closer to the application users.</p><p>Open Access This chapter is licensed under the terms of the Creative</p><p>Commons Attribution 4.0 International License (http://creativecommons.</p><p>org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and</p><p>reproduction in any medium or format, as long as you give appropriate credit to the</p><p>original author(s) and the source, provide a link to the Creative Commons license and</p><p>indicate if changes were made.</p><p>The images or other third party material in this chapter are included in the chapter's</p><p>Creative Commons license, unless indicated otherwise in a credit line to the material. If</p><p>material is not included in the chapter's Creative Commons license and your intended</p><p>use is not permitted by statutory regulation or exceeds the permitted use, you will need</p><p>to obtain permission directly from the copyright holder.</p><p>ChAPter 5 DAtAbAse DrIvers</p><p>109</p><p>CHAPTER 6</p><p>Getting Data Closer</p><p>Location, location, location. Sometimes it’s just as important to database performance</p><p>as it is to real estate. Just as the location of a home influences how quickly it sells, the</p><p>location of where data “lives” and is processed also matters for response times and</p><p>latencies.</p><p>Pushing more logic into the database can often reduce network latency (and costs,</p><p>e.g., when your infrastructure provider charges for ingress/egress network traffic) while</p><p>taking advantage of the database’s powerful compute capability. And redistributing</p><p>database logic from fewer powerful datacenters to more minimalist ones that are closer</p><p>to users is another move that can yield discernable performance gains under the right</p><p>conditions.</p><p>This chapter explores the opportunities in both of these shifts. First, it looks at</p><p>databases as compute engines with a focus on user-defined functions and user-defined</p><p>aggregates. It then goes deeper into WebAssembly, which is now increasingly being</p><p>used to implement user-defined functions and aggregates (among many other things).</p><p>Finally, the chapter ventures to the edge—exploring what you stand to gain by moving</p><p>your database servers quite close to your users, as well as what potential pitfalls you</p><p>need to negotiate in this scenario.</p><p>Databases asCompute Engines</p><p>Modern databases offer many more capabilities than just storing and retrieving</p><p>data. Some of them are nothing short of operating systems, capable of streaming,</p><p>modifying, encrypting, authorizing, authenticating, and virtually anything else with data</p><p>they manage.</p><p>Data locality is the holy grail of distributed systems. The less you need to move</p><p>data around, the more time can be spent on performing meaningful operations on</p><p>it—without excessive bandwidth costs. That’s why it makes sense to try to push more</p><p>logic into the database itself, letting it process as much as possible locally, then return</p><p>© Felipe Cardeneti Mendes, Piotr Sarna, Pavel Emelyanov, Cynthia Dunlop 2023</p><p>F. C. Mendes et al., Database Performance at Scale, https://doi.org/10.1007/978-1-4842-9711-7_6</p><p>110</p><p>the results to the users, or some middleware, for further processing. It makes even more</p><p>sense when you consider that database nodes generally run on powerful hardware,</p><p>with lots of RAM and fast I/O devices. This usually translates to formidable CPU power.</p><p>Dedicated large data processing frameworks aside (e.g., Apache Spark, which is out of</p><p>scope for this book), regular database engines almost always support some level of user-</p><p>defined computations. These can be classified into two major sections: user-defined</p><p>functions/procedures and user-defined aggregates.</p><p>Note that the definitions vary. Some database vendors use the general name</p><p>“functions” to mean both aggregate and scalar functions. Others actually mean</p><p>“scalar functions” when they reference “functions,” and use the name “aggregates” for</p><p>“aggregate functions.” That’s the convention applied to this chapter.</p><p>User-Defined Functions andProcedures</p><p>In contrast to native functions, often implemented in database engines (think</p><p>lowercase(), now(), concat(), type casting, algebraic operations, and friends), user-</p><p>defined functions are provided by the users of the database (e.g., the developers building</p><p>applications). A “procedure” is substantially identical to a function in this context, except</p><p>it does not return any result; instead, it has side effects.</p><p>The exact interface of allowing users to define their own functions or procedures</p><p>varies wildly between database vendors. Still, several core strategies, listed here, are</p><p>often implemented:</p><p>1. A set of hardcoded native functions, not extensible, but at least</p><p>composable. For example, casting a type to string, concatenating</p><p>it with a predefined suffix, and then hashing it.</p><p>2. A custom scripting language, dedicated and vendor-locked to</p><p>a specific database, allowing users to write and execute simple</p><p>programs on the data.</p><p>3. Supporting a single general-purpose embeddable language</p><p>of choice. For example, Lisp, Lua, ChaiScript, Squirrel, or</p><p>WebAssembly might be used for this purpose. Note: You’ll explore</p><p>WebAssembly in more depth a little later in this chapter.</p><p>Chapter 6 GettinG Data Closer</p><p>111</p><p>4. Supporting a variety of pluggable embeddable languages. A good</p><p>example is Apache Cassandra and its support of Java (native</p><p>language) and JavaScript1 as well as pluggable backend-loaded via</p><p>.jar files.</p><p>The first on the list is the least flexible, offers the worst developer experience, and</p><p>has the lowest security risk. The last has the most flexibility, offers the best developer</p><p>experience, and also harbors the most potential for being a security risk worthy of its</p><p>own CVE number.</p><p>Scalar functions are usually invoked per each row, at least for row-oriented</p><p>databases, which is usually the case for SQL.You might wonder if the computations</p><p>can’t</p><p>simply be performed by end users on their machines. That’s a valid point. The main</p><p>advantage of that approach is fantastic scalability regardless of how many users perform</p><p>data transformations (if they do it locally on their own machines, then the database</p><p>cluster does not get overloaded).</p><p>There are several great reasons to push the computations closer to where the data</p><p>is stored:</p><p>• Databases have more context to efficiently cache the computed</p><p>results. Imagine tens of thousands of users asking for the same</p><p>function to be applied on a certain set of rows. That result can be</p><p>computed just once and then distributed to all interested parties.</p><p>• If the computed results are considerably smaller than their input</p><p>(think about returning just lengths of text values), it’s better to save</p><p>bandwidth and send over only the final results.</p><p>• Certain housekeeping operations (e.g., deleting data older than a</p><p>week) can be efficiently performed locally, without fetching any</p><p>information to the clients for validation.</p><p>1 It’s also a great example of the CVE risk: https://cve.mitre.org/cgi-bin/cvename.</p><p>cgi?name=CVE-2021-44521</p><p>https://jfrog.com/blog/cve-2021-44521-exploiting-apache-cassandra-user-defined-</p><p>functions-for-remote-code-execution/</p><p>Chapter 6 GettinG Data Closer</p><p>112</p><p>• If the processing is done on database servers, the instruction cache</p><p>residing on that database’s CPU chip is likely to be scorching hot with</p><p>opcodes responsible for carrying out the computations for each row.</p><p>And as a rule of thumb, hot cache translates to faster code execution</p><p>and lower latency.</p><p>• Some computations are not trivially distributed to users. If they</p><p>involve cryptographic private keys stored on the database servers,</p><p>it might actually be impossible to run the code anywhere but on the</p><p>server itself.</p><p>• If the data on which computations are performed is sensitive (e.g., it</p><p>falls under infamous, ever-changing European data protection laws</p><p>such as GDPR), it might be illegal to send raw data to the users. In</p><p>such cases, running an encryption function server-side can be a way</p><p>for users to obtain obfuscated, legal data.</p><p>Determinism</p><p>In distributed environments, idempotence (discussed in Chapter 5) is an important</p><p>attribute that makes it possible to send requests in a speculative manner, potentially</p><p>increasing performance. Thus, it is better to make sure that user-defined functions are</p><p>deterministic. In other words, a user-defined function’s value should only depend on</p><p>the value of its arguments, and not on the value of any external factors like time, date,</p><p>pseudo-random seed, and so on.</p><p>A perfect example of a non-deterministic function is now(). Calling it twice might</p><p>yield the same value if you’re fast enough, but it’s generally not guaranteed since its</p><p>result is time-dependent. If possible, it’s a good idea to program the user-defined</p><p>functions in a deterministic way and mark them as such. For time/date, this might</p><p>involve computing the results based on a timestamp passed as a parameter rather than</p><p>using built-in time utilities. For pseudo-random sampling, the seed could also be passed</p><p>as a parameter, as opposed to relying on sources of entropy provided by the user-defined</p><p>function runtime.</p><p>Chapter 6 GettinG Data Closer</p><p>113</p><p>Latency</p><p>Running user-provided code on your database clusters is potentially dangerous in</p><p>aspects other than security. Most embedded languages are Turing-complete, and</p><p>customarily allow the developers to use loops, recursion, and other similar techniques</p><p>in their code. That’s risky. An undetected infinite loop may serve as a denial-of- service</p><p>attack, forcing the database servers to endlessly process a function and block other tasks</p><p>from used resources. And even if the user-defined function author did not have malicious</p><p>intentions, some computations simply consume a lot of CPU time and memory.</p><p>In a way, a user-defined function should be thought of as a potential “noisy</p><p>neighbor”2 and its resources should be as limited as possible. For some use cases,</p><p>a simple hard limit on memory and CPU time used is enough to ensure that the</p><p>performance of other database tasks does not suffer from a “noisy” user-defined</p><p>function. However, sometimes, a more specific solution is required—for example,</p><p>splitting a user- function definition into smaller time bits, assigning priorities to user-</p><p>defined functions, and so on.</p><p>One interesting metering mechanism was applied by Wasmtime,3 a WebAssembly</p><p>runtime. Code running in a WebAssembly instance consumes fuel,4 a synthetic unit used</p><p>for tracking how fast an instance exhausts system resources. When an instance runs out</p><p>of fuel, the runtime does one of the preconfigured actions—either “refills” and lets the</p><p>code execution continue or decides that the task reached its quota and terminates it.</p><p>Just-in-Time Compilation (JIT)</p><p>Languages used for user-defined functions are often either interpreted (e.g., Lua) or</p><p>represented in bytecode that runs on a virtual machine (e.g., WebAssembly). Both of</p><p>these approaches can benefit from just-in-time compilation. It’s a broad topic, but the</p><p>essence of it is that during runtime, the code of user-defined functions can be compiled</p><p>to another, more efficient representation, and optimized along the way. This may mean</p><p>translating bytecode to machine code the program runs on (e.g., x86-64 instructions), or</p><p>compiling the source code represented in an interpreted language to machine code.</p><p>2 See the Microsoft Azure documentation on the Noisy Neighbor antipattern (https://learn.</p><p>microsoft.com/en-us/azure/architecture/antipatterns/noisy-neighbor/noisy-neighbor).</p><p>3 See the Bytecode Alliance documentation at https://wasmtime.dev.</p><p>4 See the Wasmtime docs (https://docs.wasmtime.dev/api/wasmtime/struct.Store.</p><p>html#method.fuel_consumed).</p><p>Chapter 6 GettinG Data Closer</p><p>114</p><p>JIT is a very powerful tool, but it’s not a silver bullet—compilation and additional</p><p>optimization can be an expensive process in terms of resources. A small user-defined</p><p>function may take less than a millisecond to run, but recompiling it can cause a</p><p>sudden spike in CPU and memory usage, as well as a multi-millisecond delay in the</p><p>processing—resulting in high tail latency. It should therefore be a conscious decision to</p><p>either enable just-in-time compilation for user-defined functions if the language allows</p><p>it, or disable it altogether.</p><p>Examples</p><p>Let’s take a look at a few examples of user-defined functions. The function serving as</p><p>the example operates on floating point numbers; given two parameters, it returns the</p><p>sum of them, inverted. Given 5 and 7, it should return 1</p><p>5</p><p>1</p><p>7+ , which is approximately</p><p>0.34285714285.</p><p>Here’s how it could be defined in Apache Cassandra, which allows user-defined</p><p>function definitions to be provided in Java, its native language, as well as in other</p><p>languages:</p><p>CREATE OR REPLACE FUNCTION add_inverse(val1 double, val2 double)</p><p>RETURNS NULL ON NULL INPUT</p><p>RETURNS double LANGUAGE java</p><p>AS '</p><p>return (val1 == 0 || val2 == 0)</p><p>? Double.NaN</p><p>: (1/val1 + 1/val2);</p><p>';</p><p>Let’s take a closer look at the definition. The first line is straightforward: it includes</p><p>the function’s name, parameters, and its types. It also specifies that if a function</p><p>definition with that name already exists, it should be replaced. Next, it explicitly declares</p><p>what happens if any of the parameters is null, which is a valid value for any type. The</p><p>function can either return null without calling the function at all or allow null and let</p><p>the source code handle it explicitly (the syntax for that is CALLED ON NULL INPUT). This</p><p>explicit declaration is required by Apache Cassandra.</p><p>Chapter 6 GettinG Data Closer</p><p>115</p><p>That declaration is then followed by the return type and chosen language—from</p><p>which you can correctly deduce that multiple languages are supported. Then comes</p><p>the function</p><p>body. The only non-obvious decision made by the programmer was how</p><p>to handle 0 as a parameter. Since the type system implemented in Apache Cassandra</p><p>already handles NaN,5 it’s a decent candidate (next to positive/negative infinity).</p><p>The newly created function can be easily tested by creating a table, filling it with a</p><p>few values, and inspecting the result:</p><p>CREATE TABLE test(v1 double PRIMARY KEY, v2 double);</p><p>INSERT INTO test(v1, v2) VALUES (5, 7);</p><p>INSERT INTO test(v1, v2) VALUES (2, 2);</p><p>INSERT INTO test(v1) VALUES (9);</p><p>INSERT INTO test(v1, v2) VALUES (7, 0);</p><p>SELECT v1, v2, add_inverse(v1, v2) FROM test;</p><p>cassandra@cqlsh:test> SELECT v1, v2, add_inverse(v1, v2) FROM test;</p><p>v1 | v2 | test.add_inverse(v1, v2)</p><p>----+------+--------------------------</p><p>9 | null | null</p><p>5 | 7 | 0.342857</p><p>2 | 2 | 1</p><p>7 | 0 | NaN</p><p>From the performance perspective, is offloading such a simple function to the</p><p>database servers worth it? Not likely—the computations are fairly cheap, so users</p><p>shouldn’t have an issue deriving these values themselves, immediately after receiving</p><p>the data. The database servers, on the other hand, may need to initialize a runtime</p><p>for user-defined functions, since these functions are often sandboxed for security</p><p>purposes. That runtime initialization takes time and other resources. Offloading such</p><p>computations makes much more sense if the data is aggregated server-side, which is</p><p>discussed in the next section (on user-defined aggregates).</p><p>5 Not-a-number</p><p>Chapter 6 GettinG Data Closer</p><p>116</p><p>Best Practices</p><p>Before you learn about user-defined aggregates, which unleash the true potential of</p><p>user-defined functions, it’s important to sum up a few best practices for setting up user-</p><p>defined functions in your database management system:</p><p>1. Evaluate if you need user-defined functions at all—compare</p><p>the latency (and general performance) of queries utilizing user-</p><p>defined functions vs computing everything client-side (assuming</p><p>that’s even possible).</p><p>2. Test if offloading computations to the database servers scales.</p><p>Look at metrics like CPU utilization to assess how well your</p><p>database system can handle thousands of users requesting</p><p>additional computations.</p><p>3. Recognize that since user-defined functions are likely going</p><p>to be executed on the “fast path,” they need to be optimized</p><p>and benchmarked as well! Consider the performance best</p><p>practices for the language you’re using for user-defined function</p><p>implementation.</p><p>4. Make sure to properly handle any errors or exceptional cases in</p><p>your user-defined function to avoid disrupting the operation of</p><p>the rest of the database system.</p><p>5. Consider using built-in functions whenever possible instead of</p><p>creating a user- defined function. The built-in functions may be</p><p>more optimized and efficient.</p><p>6. Keep your user-defined functions simple and modular, breaking</p><p>up complex tasks into smaller, more manageable functions that</p><p>can be easily tested and reused.</p><p>7. Properly document your user-defined functions so that other</p><p>users of the database system can understand how they work and</p><p>how to use them correctly.</p><p>Chapter 6 GettinG Data Closer</p><p>117</p><p>User-Defined Aggregates</p><p>The greatest potential for user-defined functions lies in them being building blocks for</p><p>user-defined aggregates. Aggregate functions operate on multiple rows or columns,</p><p>sometimes on entire tables or databases.</p><p>Moving this kind of operation closer to where the data lies makes perfect sense.</p><p>Imagine 1TB worth of database rows that need to be aggregated into a single value: the</p><p>sum of their values. When a thousand users request all these rows in order to perform</p><p>the aggregation client-side, the following happens:</p><p>1. A total of a petabyte of data is sent over the network to each user.</p><p>2. Each user performs extensive computations, expensive in terms</p><p>of RAM and CPU, that lead to exactly the same result as the</p><p>other users.</p><p>If the aggregation is performed by the database servers, it not only avoids a petabyte</p><p>of traffic; it also saves computing power for the users (which is a considerably greener</p><p>solution). If the computation is properly cached, it only needs to be performed once.</p><p>This is a major win in terms of performance, and many use cases can immediately</p><p>benefit from pushing the aggregate computations closer to the data. This is especially</p><p>important for analytic workloads that tend to process large volumes of data in order to</p><p>produce useful statistics and feedback—a process that is its own type of aggregation.</p><p>Built-In Aggregates</p><p>Databases that allow creating user-defined aggregates usually also provide a few</p><p>traditional built-in aggregation functions: the (in)famous COUNT(*), but also MAX, MIN,</p><p>SUM, AVG, and others. Such functions take into account multiple rows or values and return</p><p>an aggregated result. The result may be a single value. Or, it could also be a set of values</p><p>if the input is divided into smaller classes. One example of such an operation is SQL’s</p><p>GROUP BY statement, which applies the aggregation to multiple disjoint groups of values.</p><p>Built-in aggregates should be preferred over user-defined ones whenever possible—</p><p>they are likely written in the language native to the database server, already optimized,</p><p>and secure. Still, the set of predefined aggregate functions is often very basic and doesn’t</p><p>allow users to perform the complex computations that make user-defined aggregates</p><p>such a powerful tool.</p><p>Chapter 6 GettinG Data Closer</p><p>118</p><p>Components</p><p>User-defined aggregates are customarily built on top of user-defined scalar functions.</p><p>The details heavily depend on the database system, but the following components are</p><p>definitely worth mentioning.</p><p>Initial Value</p><p>An aggregation needs to start somewhere, and it’s up to the user to provide an initial</p><p>value from which the final result will eventually be computed. In the case of the COUNT</p><p>function, which returns the number of rows or values in a table, a natural candidate</p><p>for the initial value is 0. In the case of AVG, which computes the arithmetic mean from</p><p>all column values, the initial state could consist of two variables: The total number of</p><p>values, initialized to 0, and the total sum of values, also initialized to 0.</p><p>State Transition Function</p><p>The core of each user-defined aggregate is its state transition function. This function</p><p>is called for each new value that needs to be processed, and each time it is called, it</p><p>returns the new state of the aggregation. Following the COUNT function example, its state</p><p>transition function simply increments the number of rows by one. The state transition</p><p>function of the AVG aggregate just adds the current value to the total sum and increments</p><p>the total number of values by one.</p><p>Final Function</p><p>The final function is an optional feature for user-defined aggregates. Its sole purpose is</p><p>to transform the final state of the aggregation to something else. For COUNT, no further</p><p>transformations are required. The user is simply interested in the final state of the</p><p>aggregation (the number of values), so the final function doesn’t need to be present; it</p><p>can be assumed to be an identity function. However, in the case of AVG, the final function</p><p>is what makes the result useful to the user. It transforms the final state—the total number</p><p>of values and its total sum—and produces the arithmetic mean by simply dividing one</p><p>by the other, handling the special case of avoiding dividing by zero.</p><p>Chapter 6 GettinG Data Closer</p><p>119</p><p>Reduce Function</p><p>The reduce function is an interesting optional addition to the user-defined aggregates</p><p>world, especially for distributed databases. It can be thought of as another state</p><p>transition function, but one that can combine two partial states into one.</p><p>With the help of a reduce function, computations of the user-defined aggregate</p><p>can be distributed to multiple database nodes, in a map-reduce6 fashion. This, in turn,</p><p>can bring massive performance gains, because the computations suddenly become</p><p>concurrent. Note that this optimization is not always possible—if the state transition</p><p>function is not commutative, distributing the partial computations may yield an</p><p>incorrect result.</p><p>In order to better imagine what a reduce function can look like, let’s go back to the</p><p>AVG example. A partial state for AVG can be represented as (n, s), where n is the number</p><p>of values, and s is the sum of them. Reducing two partial states into the new valid state</p><p>can be performed by simply adding the corresponding values: (n1, s1) + (n2, s2) → (n1+</p><p>n2, s1 + s2). An optional reduce function can be defined (e.g., in ScyllaDB’s user-defined</p><p>aggregate implementation7).</p><p>The user-defined aggregates support is not standardized among database vendors</p><p>and each database has its own quirks and implementation details. For instance, in</p><p>PostgreSQL, you can also implement a “moving” aggregate8 by providing yet another set</p><p>of functions and parameters: msfunc, minvfunc, mstype, and minitcond. Still, the general</p><p>idea remains unchanged: Let the users push aggregation logic as close to the data as</p><p>possible.</p><p>Examples</p><p>Let’s create a custom integer arithmetic mean implementation in PostgreSQL.</p><p>That’s going to be done by providing a state transition function, called sfunc in</p><p>PostgreSQL nomenclature, finalfunc for the final function, initial value (initcond),</p><p>and the state type—stype. All of the functions will be implemented in SQL, PostgreSQL’s</p><p>native query language.</p><p>6 MapReduce is a framework for processing parallelizable problems across large datasets.</p><p>7 See the ScyllaDB documentation on ScyllaDB CQL Extensions (https://github.com/scylladb/</p><p>scylladb/blob/master/docs/cql/cql-extensions.md#reducefunc-for-uda).</p><p>8 See the PostgreSQL documentation on User-Defined Aggregates (https://www.postgresql.</p><p>org/docs/current/xaggr.html#XAGGR-MOVING-AGGREGATES).</p><p>Chapter 6 GettinG Data Closer</p><p>120</p><p>State Transition Function</p><p>The state transition function, called accumulate, accepts a new integer value (the second</p><p>parameter) and applies it to the existing state (the first parameter). As mentioned earlier</p><p>in this chapter, a simple implementation keeps two variables in the state—the current</p><p>sum of all values, and their count. Thus, transitioning to the next state simply means that</p><p>the sum is incremented by the current value, and the total count is increased by one.</p><p>CREATE OR REPLACE FUNCTION accumulate(integer[], integer) RETURNS integer[]</p><p>AS 'select array[$1[1] + $2, $1[2] + 1];'</p><p>LANGUAGE SQL</p><p>IMMUTABLE</p><p>RETURNS NULL ON NULL INPUT;</p><p>Final Function</p><p>The final function divides the total sum of values by the total count of them, special-</p><p>casing an average of 0 values, which should be just 0. The final function returns a</p><p>floating point number because that’s how the aggregate function is going to represent an</p><p>arithmetic mean.</p><p>CREATE OR REPLACE FUNCTION divide(integer[]) RETURNS float8</p><p>AS 'select case when $1[2]=0 then 0 else $1[1]::float/$1[2] end;'</p><p>LANGUAGE SQL</p><p>IMMUTABLE</p><p>RETURNS NULL ON NULL INPUT;</p><p>Aggregate Definition</p><p>With all the building blocks in place, the user-defined aggregate can now be declared:</p><p>CREATE OR REPLACE AGGREGATE alternative_avg(integer)</p><p>(</p><p>sfunc = accumulate,</p><p>stype = integer[],</p><p>finalfunc = divide,</p><p>initcond = '{0, 0}'</p><p>);</p><p>Chapter 6 GettinG Data Closer</p><p>121</p><p>In addition to declaring the state transition function and the final function, the state</p><p>type is also declared to be an array of integers (which will always keep two values in the</p><p>implementation), as well as the initial condition that sets both counters, the total sum</p><p>and the total number of values, to 0.</p><p>That’s it! Since the AVG aggregate for integers happens to be built-in, that gives you</p><p>the perfect opportunity to validate if the implementation is correct:</p><p>postgres=# CREATE TABLE t(v INTEGER);</p><p>postgres=# INSERT INTO t VALUES (3), (5), (9);</p><p>postgres=# SELECT * FROM t;</p><p>v</p><p>---</p><p>3</p><p>5</p><p>9</p><p>(3 rows)</p><p>postgres=# SELECT AVG(v), alternative_avg(v) FROM t;</p><p>avg | alternative_avg</p><p>--------------------+-------------------</p><p>5.6666666666666667 | 5.666666666666667</p><p>(1 row)</p><p>Voilà. Remember that while creating an alternative implementation for AVG is a great</p><p>academic example of user-defined aggregates, for production use it’s almost always</p><p>better to stick to the built-in aggregates whenever they’re available.</p><p>Distributed User-Defined Aggregate</p><p>For completeness, let’s take a look at an almost identical implementation of a custom</p><p>average function, but one accommodated to be distributed over multiple nodes. This</p><p>time, ScyllaDB will be used as a reference, since its implementation of user-defined</p><p>aggregates includes an extension for distributing the computations in a map-reduce</p><p>manner. Here’s the complete source code:</p><p>CREATE FUNCTION accumulate(acc tuple, val int)</p><p>RETURNS NULL ON NULL INPUT</p><p>RETURNS tuple</p><p>Chapter 6 GettinG Data Closer</p><p>122</p><p>LANGUAGE lua</p><p>AS $$</p><p>return { acc[1]+val, acc[2]+1 }</p><p>$$;</p><p>CREATE FUNCTION reduce(acc tuple, acc2 tuple)</p><p>RETURNS NULL ON NULL INPUT</p><p>RETURNS tuple</p><p>LANGUAGE lua</p><p>AS $$</p><p>return { acc[1]+acc2[1], acc[2]+acc2[2] }</p><p>$$;</p><p>CREATE FUNCTION divide(acc tuple)</p><p>RETURNS NULL ON NULL INPUT</p><p>RETURNS double</p><p>LANGUAGE lua</p><p>AS $$</p><p>return acc[1]/acc[2]</p><p>$$;</p><p>CREATE AGGREGATE alternative_avg(int)</p><p>SFUNC accumulate</p><p>STYPE tuple</p><p>REDUCEFUNC reduce</p><p>FINALFUNC divide</p><p>INITCOND (0, 0);</p><p>ScyllaDB’s native query language, CQL, is extremely similar to SQL, even in its</p><p>acronym. It’s easy to see that most of the source code corresponds to the PostgreSQL</p><p>implementation from the previous paragraph. ScyllaDB does not allow defining user-</p><p>defined functions in CQL, but it does support Lua, a popular lightweight embeddable</p><p>language, as well as WebAssembly. Since this book is expected to be read mostly by</p><p>human beings (and occasionally ChatGPT once it achieves full consciousness), Lua was</p><p>chosen for this example due to the fact it’s much more concise.</p><p>Chapter 6 GettinG Data Closer</p><p>123</p><p>The most notable difference is the reduce function, declared in the aggregate</p><p>under the REDUCEFUNC keyword. This function accepts two partial states and returns</p><p>another (composed) state. What ScyllaDB servers can do if this function is present is the</p><p>following:</p><p>1. Divide the domain (e.g., all rows in the database) into multiple</p><p>pieces and ask multiple servers to partially aggregate them, and</p><p>then send back the result.</p><p>2. Apply the reduce function to combine partial results into the</p><p>single final result.</p><p>3. Return the final result to the user.</p><p>Thus, by providing the reduce function, the user also allows ScyllaDB to compute</p><p>the aggregate concurrently on multiple machines. This can reduce the query execution</p><p>time by orders of magnitude compared to a large query that only gets executed on a</p><p>single server.</p><p>In this particular case, it might even be preferable to provide a user-defined</p><p>alternative for a user-defined function in order to increase its concurrency—unless the</p><p>built-in primitives also come with their reduce functions out of the box. That’s the case</p><p>in ScyllaDB, but not necessarily in other databases that offer similar capabilities.</p><p>Best Practices</p><p>1. If the computations can be efficiently represented with built-</p><p>in aggregates, do so—or at least benchmark whether a custom</p><p>implementation is any faster. User-defined aggregates are very</p><p>expressive, but usually come with a cost of overhead compared to</p><p>built-in implementations.</p><p>2. Research if user-defined aggregates can be customized in order</p><p>to better fit specific use cases—for example, if the computations</p><p>can be distributed to multiple database nodes, or if the database</p><p>allows configuring its caches to store the intermediate results of</p><p>user-defined aggregates somewhere.</p><p>Chapter 6 GettinG Data Closer</p><p>124</p><p>3. Always test the performance of your user-defined aggregates</p><p>thoroughly before using them in production. This will help to</p><p>ensure that they are efficient and can handle the workloads that</p><p>you expect them to.</p><p>4. Measure the cluster-wide effects of using user-defined aggregates</p><p>in your workloads. Similar to full table scans, aggregates are a</p><p>costly operation and it’s important to ensure that they respect the</p><p>quality of service of other workloads, not overloading the database</p><p>nodes beyond what’s acceptable in your system.</p><p>WebAssembly forUser-Defined Functions</p><p>WebAssembly, also known as Wasm, is a binary format for representing executable code,</p><p>designed to be easily embedded into other projects. It turns out that WebAssembly is</p><p>also a perfect candidate for user-defined functions on the backend, thanks to its ease of</p><p>integration, performance, and popularity.</p><p>There are multiple great books and articles9 on WebAssembly, and they all agree that</p><p>first and foremost, it’s a misnomer—WebAssembly’s usefulness ranges way beyond web</p><p>applications. It’s actually a solid general-purpose language that has already become the</p><p>default choice for an embedded language around the world. It ticks all the boxes:</p><p>☒ It’s open-source, with a thriving community</p><p>☒ It’s portable</p><p>☒ It’s isolated by default, with everything running in a sandboxed</p><p>environment</p><p>☒ It’s fast, comparable to native CPU code in terms of</p><p>performance</p><p>9 For example, “WebAssembly: The Definitive Guide” by Brian Sletten, “Programming</p><p>WebAssembly with Rust” by Kevin Hoffman, or “ScyllaDB’s Take on WebAssembly for User-</p><p>Defined Functions” by Piotr Sarna.</p><p>Chapter 6 GettinG Data Closer</p><p>125</p><p>Runtime</p><p>WebAssembly is compiled to bytecode. This bytecode is designed to run on a virtual</p><p>machine, which is usually part of a larger development environment called a runtime.</p><p>There are multiple implementations of WebAssembly runtimes, most notably:</p><p>• Wasmtime</p><p>https://wasmtime.dev/</p><p>A fast and secure runtime for WebAssembly, implemented in Rust,</p><p>backed by the Bytecode Alliance10 nonprofit organization.</p><p>• Wasmer.io</p><p>https://wasmer.io/</p><p>Another open-source initiative implemented in Rust; maintainers</p><p>of the WAPM11 project, which is a Wasm package manager.</p><p>• WasmEdge:</p><p>https://wasmedge.org/</p><p>Runtime implemented in C++, general-purpose, but focused on</p><p>edge computing.</p><p>• V8:</p><p>https://v8.dev/</p><p>Google’s monolith JavaScript runtime; written in C++, comes with</p><p>WebAssembly support as well.</p><p>Also, since the WebAssembly specification is public, feel free to implement your own!</p><p>Beware though: The standard is still in heavy development, changing rapidly every day.</p><p>10 https://bytecodealliance.org/</p><p>11 https://wapm.io/</p><p>Chapter 6 GettinG Data Closer</p><p>126</p><p>Back toLatency</p><p>Each runtime is free to define its own performance characteristics and guarantees. One</p><p>interesting feature introduced in Wasmtime is the concept of fuel, already mentioned in</p><p>the earlier discussion of user-defined functions. Combined with the fact that Wasmtime</p><p>provides an optional asynchronous interface for running WebAssembly modules, it gives</p><p>users an opportunity to fine-tune the runtime to their latency requirements.</p><p>When Wasmtime starts executing a given WebAssembly function, this unit of</p><p>execution is assigned a certain amount of fuel. Each execution step exhausts a small</p><p>amount of fuel—at the time of writing this paragraph, it simply consumes one unit of fuel</p><p>on each WebAssembly bytecode instruction, excluding a few flow control instructions</p><p>like branching. Once the execution unit runs out of fuel, it yields. After that happens, one</p><p>of the preconfigured actions is taken: either the execution unit is terminated, or its tank</p><p>gets refilled and it’s allowed to get back to whatever it was computing. This mechanism</p><p>allows the developer to control not only the total amount of CPU time that a single</p><p>function execution can take, but also how often the execution should yield and hand</p><p>over the CPU for other tasks. Thus, configuring fuel management the right way prevents</p><p>function executions from taking over the CPU for too long. That helps maintain low,</p><p>predictable latency in the whole system.</p><p>Another interesting aspect of WebAssembly is its portability. The fact that the</p><p>code can be distributed to multiple places and it’s guaranteed to run properly in</p><p>multiple environments makes it a great candidate for moving not only data, but also</p><p>computations, closer to the user.</p><p>Pushing the database logic from enormous datacenters to smaller ones, located</p><p>closer to end users, got its own buzzy name: edge computing.</p><p>Edge Computing</p><p>Since the Internet of Things (IoT) became a thing, the term edge computing needs</p><p>disambiguation. This paragraph is (unfortunately?) not about:</p><p>• Utilizing the combined computing power of smart fridges in</p><p>your area</p><p>• Creating a data mesh from your local network of Bluetooth light bulbs</p><p>• Integrating your smart watch into a Raft cluster in witness mode</p><p>Chapter 6 GettinG Data Closer</p><p>127</p><p>The edge described in this paragraph is of a more boring kind. It still means</p><p>performing computations on servers, but on ones closer to the user (e.g., located in a</p><p>local Equinix datacenter in Warsaw, rather than Amazon’s eu-central-1 in Frankfurt).</p><p>Performance</p><p>What does edge computing have to do with database performance? It brings the data</p><p>closer to the user, and closer physical distance translates to lower latency. On the other</p><p>hand, having your database cluster distributed to multiple locations has its downsides</p><p>as well. Moving large amounts of data between those regions might be costly, as cloud</p><p>vendors tend to charge for cross-region traffic. If the latency between database nodes</p><p>reaches hundreds of milliseconds, which is the customer grade latency between</p><p>Northern America and Europe (unless you can afford Hibernia Express12), they can get</p><p>out of sync easily. Even a few round-trips—and distributed consensus algorithms alone</p><p>require at least two—can cause delays that exceed the comfort zone of one second.</p><p>Failure detection mechanisms are also affected since packet loss occurs much more</p><p>often when the cluster spans multiple geographical locations.</p><p>Database drivers for edge-friendly databases need to be aware of all these limitations</p><p>mentioned. In particular, they need to be extra careful to pick the closest region</p><p>whenever possible, minimizing the latency and the chance of failure.</p><p>Conflict-Free Replicated Data Types</p><p>CRDT (conflict-free replicated data types) is an interesting way of dealing with</p><p>inconsistencies. It’s a family of data structures designed to have the following</p><p>characteristics:</p><p>• Users can update database replicas independently, without</p><p>coordinating with other database servers.</p><p>• There exists an algorithm to automatically resolve conflicts that</p><p>might occur when the same data is independently written to multiple</p><p>replicas concurrently.</p><p>• Replicas are allowed to be in different states, but they are guaranteed</p><p>to eventually converge to a common state.</p><p>12 A submarine link between Canada, Ireland, and the UK, offering sub-60ms latency.</p><p>Chapter 6 GettinG Data Closer</p><p>128</p><p>The concept of CRDT gained traction along with edge computing because the two</p><p>complement each other. The database is allowed to keep replicas in multiple places and</p><p>allows them to act without central coordination—but at the same time, users can assume</p><p>that eventually the database state is going to become consistent.</p><p>A few interesting data structures that fit the definition of CRDT are discussed next.</p><p>G-Counter</p><p>Grow-only counter. Usually implemented as an array of counters, keeping a local</p><p>counter value per each database node. Two array states from different nodes can</p><p>be merged by taking the maximum of each respective field.</p><p>The actual value of the</p><p>G-Counter is simply a sum of all local counters.</p><p>PN-Counter</p><p>Positive-Negative counter, brilliantly implemented by keeping two G-Counter</p><p>instances—one for accumulating positive values, the other for negative ones. The final</p><p>value is obtained by subtracting one from the other.</p><p>G-Set</p><p>Grow-only set, that is, one that forbids the removal of elements. Converging two G-Sets</p><p>is a simple set union since values are never removed from a G-Set. One flavor of G-Set</p><p>is G-Map, where an entry, key, and value associated with the key cannot be removed</p><p>once added.</p><p>LWW-Set</p><p>Last-write-wins set (and map, accordingly). This is a combination of two G-Sets, one</p><p>gathering added elements and the other containing removed ones. Conflict resolution is</p><p>based on a set union of the “added” G-Set, minus the union of the “removed” G-Set, but</p><p>timestamps are also taken into account. A value exists if its timestamp in the “added” set</p><p>is larger than its timestamp in the “removed” set, or if it’s not present in the “removed”</p><p>set at all.</p><p>The list is obviously not exhaustive, and countless other CRDTs exist. You’re hereby</p><p>encouraged to do research on the topic if you found it interesting!</p><p>Chapter 6 GettinG Data Closer</p><p>129</p><p>CRDTs are not just theoretical structures; they are very much used in practice.</p><p>Variants of conflict-free replicated data types are common among databases that offer</p><p>eventual consistency, like Apache Cassandra and ScyllaDB.Their writes have last-write-</p><p>wins semantics for conflict resolution, and their implementation of counters is based on</p><p>the idea of a PN-Counter.</p><p>Summary</p><p>At this point, it should be clear that there are a number of ways to improve</p><p>performance by using a database a bit unconventionally, as well as understanding</p><p>(and tapping) specialized capabilities built into the database and its drivers. Let’s</p><p>shift gears and look at the top “do’s and don’ts” that we recommend for ensuring that</p><p>your database is performing at its best. The next chapter begins this discussion by</p><p>focusing on infrastructure options (CPUs, memory, storage, and networking) and</p><p>deployment models.</p><p>Open Access This chapter is licensed under the terms of the Creative</p><p>Commons Attribution 4.0 International License (http://creativecommons.</p><p>org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and</p><p>reproduction in any medium or format, as long as you give appropriate credit to the</p><p>original author(s) and the source, provide a link to the Creative Commons license and</p><p>indicate if changes were made.</p><p>The images or other third party material in this chapter are included in the chapter's</p><p>Creative Commons license, unless indicated otherwise in a credit line to the material. If</p><p>material is not included in the chapter's Creative Commons license and your intended</p><p>use is not permitted by statutory regulation or exceeds the permitted use, you will need</p><p>to obtain permission directly from the copyright holder.</p><p>Chapter 6 GettinG Data Closer</p><p>131</p><p>CHAPTER 7</p><p>Infrastructure and</p><p>Deployment Models</p><p>As noted in the previous chapter, many modern databases offer capabilities beyond</p><p>“just” storing and retrieving data. But all databases are ultimately built from the ground</p><p>up in order to serve I/O in the most efficient way possible. And it’s crucial to remember</p><p>this when selecting your infrastructure and deployment model of choice.</p><p>In theory, a database’s purpose is fairly simple: You submit a request and expect</p><p>to receive a response. But as you have seen in the previous chapters, an insane level of</p><p>engineering effort is spent on continuously enhancing and speeding up this process.</p><p>Very likely, years and years were dedicated to optimizing algorithms that may give</p><p>you a processing boost of a few CPU cycles, or minimizing the amount of memory</p><p>fragmentation, or reducing the amount of storage I/O needed to look up a specific set</p><p>of data. All these advancements, eventually, converge to create a database suitable for</p><p>performance at scale.</p><p>Regardless of your database selection, you may eventually hit a wall that no</p><p>engineering effort can break through: the database’s physical hardware. It makes very</p><p>little sense to have a solution engineered for performance when the hardware you throw</p><p>at it may be suboptimal. Similarly, a less performant database will likely be unable to</p><p>make efficient use of an abundance of available physical resources.</p><p>This chapter looks at critical considerations and tradeoffs when selecting CPUs,</p><p>memory, storage, and networking for your distributed database infrastructure. It</p><p>describes how different resources cooperate and how to configure the database to</p><p>deliver the best performance. Special attention is drawn to storage I/O as the most</p><p>difficult component to deal with. There’s also a close look at optimal cloud-based</p><p>deployments suitable for highly-performant distributed databases (given that these are</p><p>the deployment preference of most businesses).</p><p>© Felipe Cardeneti Mendes, Piotr Sarna, Pavel Emelyanov, Cynthia Dunlop 2023</p><p>F. C. Mendes et al., Database Performance at Scale, https://doi.org/10.1007/978-1-4842-9711-7_7</p><p>132</p><p>While it is true that a Database-as-a-Service (DBaaS) deployment will shield you</p><p>from many infrastructure and hardware decisions through your selection process, a</p><p>fundamental understanding of the generic compute resources required by any database</p><p>is important for identifying potential bottlenecks that may limit performance. After an</p><p>introduction to the hardware that’s involved in every deployment model—whether you</p><p>think about it or not—the chapter shifts focus to different deployment options and their</p><p>impact on performance. It covers the special considerations associated with cloud-</p><p>hosted deployments, database-as-a-service, serverless, containerization, and container</p><p>orchestration technologies, such as Kubernetes.</p><p>Core Hardware Considerations forSpeed atScale</p><p>When you are designing systems to handle large amounts of data and requests at scale,</p><p>the primary hardware considerations are:</p><p>• Storage</p><p>• CPU (cores)</p><p>• Memory (RAM)</p><p>• Network interfaces</p><p>Each could be a potential bottleneck for internal database latency: The delay from</p><p>when a request is received by the database (or a node in the database) and when the</p><p>database provides a response.</p><p>Identifying theSource ofYour Performance Bottlenecks</p><p>Knowing your database’s write and read paths is helpful for identifying potential</p><p>performance bottlenecks and tracking down the culprit. It’s also key to understanding</p><p>what physical resources your use case may be mostly bound against.</p><p>For example, write-optimized databases carry this nomenclature because writes</p><p>primarily go to memory, rather than being immediately persisted into disk. However,</p><p>most modern databases need to employ some “crash-recovery” mechanism and avoid</p><p>data loss caused by unexpected service interruptions. As a result, even write-optimized</p><p>databases will also resort to disk access to quickly persist your data, just in case. For</p><p>example, writes to Cassandra clusters will be persisted to a “write ahead log” disk</p><p>Chapter 7 InfrastruCture and deployment models</p><p>133</p><p>structure called the “commit log” and a memory structure that’s named a “memtable.” A</p><p>write is considered successful only after both operations succeed.</p><p>On the other side of the spectrum, the database’s read path will typically also involve</p><p>several physical components. Assuming that you’re not using an in-memory database,</p><p>then the read path will start by checking whether the data you are looking for is present</p><p>within the database cache. But if it’s not, the database needs to look up and retrieve the</p><p>data from disk, de-serialize it, and then answer with the results.</p><p>Network also plays a crucial role throughout the entire process. When you write,</p><p>data needs to be rapidly replicated to other replicas. When you read, the database needs</p><p>to select the correct replicas (shards) containing the data that the application is after,</p><p>thus potentially having to communicate with other nodes in the cluster. Moreover,</p><p>strong consistency use cases always require the response of a majority of members for</p><p>an operation to be successful—so delayed responses from a replica can dramatically</p><p>increase the tail latency of a request routed to it.</p><p>Achieving Balance</p><p>Balance is key to any distributed system, including and beyond databases. It makes</p><p>very little sense to try to achieve 1 million operations per second (OPS) in a system that</p><p>has the fastest network link available but relies on very few CPUs. Similarly, it’s not very</p><p>efficient to purchase the most expensive and performant infrastructure for your solution</p><p>if your use case requires only 10K OPS.</p><p>Additionally, it’s important to recognize that a cluster imbalance can easily drag</p><p>down performance across your entire distributed system. This happens because a</p><p>distributed system cannot be faster than your slowest component—a fact that frequently</p><p>surprises people.</p><p>Here’s a real-life example. A customer reported elevated latencies affecting their</p><p>entire 18-node cluster. After collecting system information, we noticed that the majority</p><p>of their nodes were properly using locally-attached nonvolatile memory express (NVMe)</p><p>disks—except for one that had a software Redundant Array of Independent Disks (RAID)</p><p>with a mix of NVMes and network-attached disks. The customer clarified that they</p><p>were running out of storage space and decided to attach another disk in order to relieve</p><p>the problem. However, they weren’t aware that this introduced a ticking time bomb</p><p>into their entire cluster. Here’s a brief explanation of what happened from a technical</p><p>perspective:</p><p>Chapter 7 InfrastruCture and deployment models</p><p>134</p><p>1. With a slow disk introduced in their RAID array, storage I/O</p><p>operations in that specific replica took longer to complete.</p><p>2. As a result, the remaining replicas took additional time whenever</p><p>sending or waiting for a response that would require disk I/O.</p><p>3. As more and more requests came in, all these delays eventually</p><p>created a waiting queue on the replicas.</p><p>4. As the queue kept growing, this eventually affected the replicas’</p><p>performance, which ended up affecting the entire cluster’s</p><p>performance.</p><p>5. From that point on, the entire cluster speed was impeded by the</p><p>speed of its slowest node: the one that had the slowest disk.</p><p>Setting Realistic Expectations</p><p>Even the most powerful hardware cannot ensure impressive end-to-end (or round-trip)</p><p>latency—the entire cycle time from when a client sends a request to the server until it</p><p>obtains a response. The end-to-end latency could be undermined by factors that might</p><p>be outside of the database’s control. For example:</p><p>• Multi-hop routing of packets from your client application to the</p><p>database server, adding hundreds of milliseconds in latency</p><p>• Client driver settings, connecting and sending requests to a remote</p><p>datacenter</p><p>• Consistency levels that require both local and remote datacenter</p><p>responses</p><p>• Poor network performance between clients and database servers</p><p>• Protocol overheads</p><p>• Client-side performance bottlenecks</p><p>Chapter 7 InfrastruCture and deployment models</p><p>135</p><p>Recommendations forSpecific</p><p>Hardware Components</p><p>This section takes a deeper look at each of the primary hardware considerations:</p><p>• Storage</p><p>• CPU (cores)</p><p>• Memory (RAM)</p><p>• Network interfaces</p><p>Storage</p><p>One of the fastest ways to undermine all your other performance optimizations is to send</p><p>every read and write operation through an unsuitable disk. Although recent technology</p><p>advancements greatly improved the performance of storage devices, disks are (by far)</p><p>still the slowest component in a computer system.</p><p>From a performance standpoint, disk performance is typically measured in two</p><p>dimensions:</p><p>• The bandwidth available for sequential reads and writes</p><p>• The IOPS for random reads and writes</p><p>Database engineers obsess over optimizing disk access patterns with respect to those</p><p>two dimensions. People who are selecting, managing, or using a database should focus</p><p>on two additional disk considerations: the storage technology and the disk size.</p><p>Disk Types</p><p>Locally-attached NVMe Solid State Drives (SSDs) are the standard when latency is</p><p>critical. Compared with other bus interfaces, NVMe SSDs connected to a Peripheral</p><p>Component Interconnect Express (PCIe) interface will generally deliver lower latencies</p><p>than the Serial AT Attachment (SATA) interface. If your workload isn’t super latency</p><p>sensitive, you could also consider using disks via the SATA interface. But, definitely avoid</p><p>using network-attached disks if you expect single-digit millisecond latencies. Being</p><p>network attached, these disks require an additional hop to reach a storage server, and</p><p>that ends up increasing latency for every database request.</p><p>Chapter 7 InfrastruCture and deployment models</p><p>136</p><p>If your focus is on throughput and latency really doesn’t matter for your use case</p><p>(e.g., for moving data into a data warehouse), you might be able to get away with a</p><p>persistent disk—but it’s not recommended. By persistent disks, we mean durable</p><p>network storage devices that your VMs can access like physical disks, but are located</p><p>independently from your VMs. We’re not going to pick on any specific vendors, but a</p><p>little research should reveal issues like subpar performance and overall instability. If</p><p>you’re forced to work with persistent disks, be prepared to craft a creative solution.1</p><p>Hard disk drives (HDDs) might fast become a bottleneck. Since SSDs are getting</p><p>progressively cheaper and cheaper, using HDDs is not recommended. Some workloads</p><p>may work with HDDs, especially if they play nice and minimize random seeks. An</p><p>example of an HDD-friendly workload is a write-mostly (98 percent writes) workload</p><p>with minimal random reads. If you decide to use HDDs, try to allocate a separate disk for</p><p>the commit log.</p><p>ScyllaDB published benchmarking results of several different storage devices—</p><p>demonstrating how they perform under extreme load simulating typical database access</p><p>patterns.2 For example, Figures7-1 through 7-4 visualize the different performance</p><p>characteristics from two NVMes—a persistent disk and an HDD.</p><p>1 For inspiration, consider Discord’s approach—but recognize that this is</p><p>certainly not a one-size-fits-all solution. It’s described in their blog, “How Discord</p><p>Supercharges Network Disks for Extreme Low Latency” (https://discord.com/blog/</p><p>how-discord-supercharges-network-disks-for-extreme-low-latency).</p><p>2 You can find the results, as well as the tool to reproduce the results, at https://github.com/</p><p>scylladb/diskplorer#sample-results.</p><p>Chapter 7 InfrastruCture and deployment models</p><p>137</p><p>Figure 7-1. NVMe bandwidth/latency graphs for an AWS i3.2xlarge instance type</p><p>Chapter 7 InfrastruCture and deployment models</p><p>138</p><p>Figure 7-2. Bandwidth/latency graphs for an AWS Im4gn.4xlarge instance type</p><p>using AWS Nitro SSDs</p><p>Chapter 7 InfrastruCture and deployment models</p><p>139</p><p>3 Strangely, the 95th percentile at low rates is worse than at high rates.</p><p>Figure 7-3. Bandwidth/latency graphs for a Google Cloud n2-standard-8 instance</p><p>type with a 2TB SSD persistent disk3</p><p>Chapter 7 InfrastruCture and deployment models</p><p>140</p><p>Figure 7-4. Bandwidth/latency graphs for a Toshiba DT01ACA200 hard</p><p>disk drive4</p><p>4 Note the throughput and IOPS were allowed to miss by a 15 percent margin rather than the</p><p>normal 3 percent margin.</p><p>Disk Setup</p><p>We hear a lot of questions about RAID setups. Hardware RAIDs are commonly used to</p><p>avoid outages introduced by disk failures. As a result, the RAID-5 (distributed parity)</p><p>setup is often used.</p><p>However, distributed databases typically have their own internal replication</p><p>mechanism to allow for business continuity and achieve high availability. Therefore,</p><p>RAID setups</p><p>employing data mirroring or distributed parity have proven to be very</p><p>detrimental to disk I/O performance and, fairly often, are used redundantly. On top of</p><p>that, we have found that some hardware RAID vendors deliver poor performance results</p><p>Chapter 7 InfrastruCture and deployment models</p><p>141</p><p>depending on your database access mechanisms. One notable example: hardware</p><p>RAIDs that are unable to perform efficiently via asynchronous I/O or direct I/O calls. If</p><p>you believe your disk I/O is suboptimal, consider directly exposing the disks from your</p><p>hardware RAID to your operating system.</p><p>Conversely, RAID-0 (striping) setups often provide a boost in disk I/O performance</p><p>and allow the database to achieve higher IOPS and bandwidth than a single disk can</p><p>provide. The general recommendation for creating a RAID-0 setup is to use all disks of</p><p>the same type and capacity to avoid variable performance during your daily workload.</p><p>While it is true you would lose the entire RAID array in the event of a disk failure, the</p><p>replication performed by your distributed database should be sufficient to ensure that</p><p>your data remains available.</p><p>A couple of additional considerations related to disk setup:</p><p>• Storage servers often serve several other users and workloads at</p><p>the same time. Therefore, even though disks would be dedicated to</p><p>the database, your access performance can be undermined by factors</p><p>like the level to which the storage system is serving other users</p><p>concurrently. Most of the time, the storage medium provided to you</p><p>will not be optimal for supporting a low-latency database workload.</p><p>This can often be mitigated by ensuring that the disks are allocated</p><p>from a high-performing disk pool.</p><p>• It’s important to expose your database infrastructure disks</p><p>directly to the operating system guest from your hypervisor. We</p><p>have seen many situations where the I/O capacity of a database</p><p>was greatly impacted when disks were virtualized. To eliminate</p><p>any possible bottlenecks in a low-latency environment, give your</p><p>database direct access to your disks so that they can perform I/O as</p><p>they were designed to.</p><p>Disk Size</p><p>When considering how much storage you need, be sure to account for your existing</p><p>data—replicated—plus your anticipated near-term data growth, and also leave sufficient</p><p>room for the overhead of internal operations (like compactions [for LSM-tree-based</p><p>databases], the commit log, backups, etc.).</p><p>Chapter 7 InfrastruCture and deployment models</p><p>142</p><p>As Chapter 8 discusses, the most common topology involves three replicas for each</p><p>dataset. Assume you have 5TB of raw data and use a replication factor of three:</p><p>5TB Data X 3 RF = 15TB</p><p>But 15TB is just a starting point since there are other sizing criteria:</p><p>• What is your dataset’s growth rate? (How much do you ingest per</p><p>hour or day?)</p><p>• Will you store everything forever, or will you have an eviction process</p><p>(for example, based on Time To Live [TTL])?</p><p>• Is your growth rate stable (a fixed rate of ingestion per week/day/</p><p>hour) or is it stochastic and bursty? The former would make it more</p><p>predictable; the latter may mean you have to give yourself more</p><p>leeway to account for unpredictable but probabilistic events.</p><p>You can model your data’s growth rate based on the number of users or endpoints</p><p>and how that number is expected to grow over time. Alternately, data models are often</p><p>enriched over time, resulting in more data per source. Or your sampling rate may</p><p>increase. For example, your system may begin ingesting data every five seconds rather</p><p>than every minute. All of these considerations impact your data storage volume.</p><p>It’s strongly recommended that you select storage that’s suitable for where you</p><p>expect to end up after a certain time span. If you’re running your database on a public</p><p>cloud provider (self-managed or as a fully-managed Database-as-a-Service [DBaaS]),</p><p>you won’t need very much lead time to provision new hardware and expand your cluster.</p><p>However, for an on-premises hardware purchase, you may need to provision based on</p><p>your quarterly or annual budgeting process. You could also face delays due to the supply</p><p>chain disruptions that have become increasingly common.</p><p>Also, be sure to leave storage space for internal temporary operations such as</p><p>compaction, repairs, backups, and commit logs, as well as any other background process</p><p>that may temporarily introduce a space amplification. On the other hand, if you’re using</p><p>compression, be sure to factor in the amount of space that your selected compression</p><p>algorithm can save you.</p><p>Finally, recognize that every database has an ideal memory-to-storage ratio—for</p><p>example, a certain amount of TB or GB per node that it can support with optimal</p><p>performance. If this isn’t readily apparent in your database’s documentation, press your</p><p>vendor for their recommendation.</p><p>Chapter 7 InfrastruCture and deployment models</p><p>143</p><p>Raw Devices andCustom Drivers</p><p>Some database vendors require direct access to storage devices—without needing a</p><p>filesystem to exist. Such direct access is often referred to as creating a “raw” device,</p><p>which refers to the fact that the operating system won’t know how to manage it, and any</p><p>I/O is handled directly by the database. Issuing I/O directly to the underlying storage</p><p>device may provide a performance boost to the database. However, it is important to</p><p>understand some of this approach’s drawbacks, which may not be important for your</p><p>specific deployment.</p><p>1. Error prone: Directly issuing I/O to a disk rather than through a</p><p>filesystem is error prone. While it will provide a performance gain,</p><p>incorrect handling of the underlying storage could result in data</p><p>corruption, data loss, or unexpected bugs.</p><p>2. Complex: Raw devices are not as common as one might expect. In</p><p>fact, very few databases decided to implement that approach. It’s</p><p>important to note that since raw devices aren’t typically mounted</p><p>as regular filesystems, their manageability will be fully dependent</p><p>on what your vendor provides.</p><p>3. Lock-in: Once you are using a raw device, it’s extremely difficult</p><p>to move away from it. You can’t mount raw devices or query their</p><p>storage consumption via typical operating system mechanisms.</p><p>All of your disks need to be arranged in a certain way, and you</p><p>can’t easily go back to a regular filesystem.</p><p>Maintaining Disk Performance Over Time</p><p>Databases are very storage I/O intensive, so disks will wear out over time. Most disk</p><p>vendors provide estimates concerning the performance durability of their products.</p><p>Check on those and compare.</p><p>There are multiple tools and programs that can help with SSD performance over</p><p>time. One example is the fstrim program, which is frequently run weekly to discard</p><p>unused filesystem blocks. fstrim is an operating system background process that</p><p>doesn’t require any database action and may improve I/O to a significant extent.</p><p>Chapter 7 InfrastruCture and deployment models</p><p>144</p><p>Tip If you have to choose one place to invest—on Cpu, storage, memory, or</p><p>networking—we recommend splurging on storage. everything else has evolved</p><p>faster and better than storage. It still remains the slowest component in most</p><p>systems.</p><p>Tiered Storage</p><p>Many use cases have different latency requirements for different sets of data. Similarly,</p><p>industries may see exponential storage utilization growth over time. It is not always</p><p>desirable, or even possible, to get rid of old data (for example, due to compliance</p><p>regulations, third-party contracts, or simply because it still carries relevance for the</p><p>business).</p><p>Teams with storage-heavy use cases often seek ways to minimize the costs of storage</p><p>consumption: by reducing the replication factor of their dataset, using less performant</p><p>(although cheaper) storage disks, or by employing a manual data rotation process from</p><p>faster to slower disks.</p><p>Tiered storage is a solution implemented by some databases in</p><p>Believe You ��������������������������������������������������������������������������������� 191</p><p>Take Coordinated Omission Into Account ���������������������������������������������������������������������������� 193</p><p>Special Considerations for Various Benchmarking Goals ��������������������������������������������������������� 194</p><p>Preparing for Growth ����������������������������������������������������������������������������������������������������������� 194</p><p>Comparing Different Databases ������������������������������������������������������������������������������������������ 195</p><p>Comparing the Same Database on Different Infrastructure ������������������������������������������������ 195</p><p>Assessing the Impact of a Data Modeling or Database Configuration Change�������������������� 195</p><p>Beyond the Usual Benchmark ��������������������������������������������������������������������������������������������������� 196</p><p>Benchmarking Admin Operations ���������������������������������������������������������������������������������������� 196</p><p>Testing Disaster Recovery ��������������������������������������������������������������������������������������������������� 196</p><p>Benchmarking at Extreme Scale ����������������������������������������������������������������������������������������� 197</p><p>Summary����������������������������������������������������������������������������������������������������������������������������������� 199</p><p>Chapter 10: Monitoring ����������������������������������������������������������������������������������������� 201</p><p>Taking a Proactive Approach ����������������������������������������������������������������������������������������������������� 201</p><p>Tracking Core Database KPIs ���������������������������������������������������������������������������������������������������� 203</p><p>Database Cluster KPIs ��������������������������������������������������������������������������������������������������������� 203</p><p>Application KPIs ������������������������������������������������������������������������������������������������������������������� 207</p><p>Infrastructure/Hardware KPIs ���������������������������������������������������������������������������������������������� 209</p><p>Creating Effective Custom Alerts ���������������������������������������������������������������������������������������������� 210</p><p>Walking Through Sample Scenarios ����������������������������������������������������������������������������������������� 211</p><p>One Replica Is Lagging in Acknowledging Requests ����������������������������������������������������������� 211</p><p>Disappointing P99 Read Latencies �������������������������������������������������������������������������������������� 213</p><p>Monitoring Options �������������������������������������������������������������������������������������������������������������������� 217</p><p>The Database Vendor’s Monitoring Stack ���������������������������������������������������������������������������� 217</p><p>Build Your Own Dashboards and Alerting (Grafana, Grafana Loki) ��������������������������������������� 218</p><p>Table of ConTenTs</p><p>xi</p><p>Third-Party Database Monitoring Tools ������������������������������������������������������������������������������� 218</p><p>Full Stack Application Performance Monitoring (APM) Tool ������������������������������������������������� 218</p><p>Summary����������������������������������������������������������������������������������������������������������������������������������� 219</p><p>Chapter 11: Administration ���������������������������������������������������������������������������������� 221</p><p>Admin Operations and Performance ����������������������������������������������������������������������������������������� 221</p><p>Looking at Admin Operations Through the Lens of Performance ���������������������������������������������� 222</p><p>Backups ������������������������������������������������������������������������������������������������������������������������������������ 224</p><p>Impacts �������������������������������������������������������������������������������������������������������������������������������� 225</p><p>Optimization ������������������������������������������������������������������������������������������������������������������������ 226</p><p>Compaction ������������������������������������������������������������������������������������������������������������������������������� 227</p><p>Impacts �������������������������������������������������������������������������������������������������������������������������������� 227</p><p>Optimization ������������������������������������������������������������������������������������������������������������������������ 229</p><p>Summary����������������������������������������������������������������������������������������������������������������������������������� 231</p><p>Appendix A: A Brief Look at Fundamental Database Design Decisions ���������������� 233</p><p>Index ��������������������������������������������������������������������������������������������������������������������� 249</p><p>Table of ConTenTs</p><p>xiii</p><p>About the Authors</p><p>Felipe Cardeneti Mendesis an IT specialist with years of</p><p>experience using distributed systems and open- source</p><p>technologies. He has co-authored three Linux books and</p><p>is a frequent speaker at public events and conferences</p><p>to promote open-source technologies. Felipe works as a</p><p>solution architect at ScyllaDB.</p><p>Piotr Sarnais a software engineer who is keen on open-</p><p>source projects and the Rust and C++ languages. He</p><p>previously developed an open-source distributed filesystem</p><p>and had a brief adventure with the Linux kernel. He’s also a</p><p>long-time contributor and maintainer of ScyllaDB, as well</p><p>as libSQL and Turso. Piotr graduated from University of</p><p>Warsaw with an MSc in computer science.</p><p>Pavel “Xemul” Emelyanovis an ex-Linux kernel hacker</p><p>now speeding up row cache, tweaking the IO scheduler,</p><p>and helping to pay back a technical debt for component</p><p>interdependencies. He is a principal engineer at ScyllaDB.</p><p>xiv</p><p>Cynthia Dunlopis a technology writer who specializes in</p><p>application development. She has co-authored four books</p><p>and hundreds of articles on everything from C/C++ memory</p><p>error detection to continuous testing and DevOps. Cynthia</p><p>holds a bachelor’s degree from UCLA and a master’s degree</p><p>from Washington State University.</p><p>abouT The auThors</p><p>xv</p><p>About the Technical Reviewers</p><p>Botond Déneshas been a principal software engineer at</p><p>ScyllaDB since 2017. Botond has mostly worked on making</p><p>queries perform better and making sure their concurrency</p><p>and resource consumption (especially memory) are kept in</p><p>check. In addition, he has worked extensively on disaster</p><p>recovery and diagnostics tools.</p><p>Ľuboš Koščois a software engineer at ScyllaDB who works</p><p>on upcoming ScyllaDB features, bug fixes, and workflows in</p><p>Jenkins, Ansible automation, and migration tools (in Spark).</p><p>During his time in AdTech, Ľuboš worked for Sizmek/Rocket</p><p>Fuel, overseeing seven datacenters running infrastructure</p><p>that delivered real- time bids and impressions for marketing</p><p>campaigns. He also worked on cloud monitoring,</p><p>virtualization, and datacenter management at Oracle and</p><p>Sun Microsystems, and is one of the leaders of the source</p><p>code search engine, OpenGrok.</p><p>xvi</p><p>Raphael S. Carvalho, a.k.a. Raph, is a computer programmer</p><p>steeped in hacker culture and kernel programming and a</p><p>wannabe musician. In November 2013, Carvalho joined the</p><p>Israeli startup Cloudius Systems (now ScyllaDB) and worked</p><p>first on the filesystem technology from OSv, a cloud-based</p><p>operating system, and later on ScyllaDB, a NoSQL data</p><p>store compatible with Apache Cassandra that runs on top of</p><p>Seastar. In 2018, Raph became fascinated with the Meltdown</p><p>security bug and worked directly with the researchers who</p><p>disclosed it. His name is now listed in the official Meltdown paper for his contributions</p><p>to showing the applicability of the vulnerability</p><p>order to address most</p><p>of these concerns. It allows users to configure the database to use distinct storage tiers,</p><p>and to define which criteria the database should use to ensure that the data is correctly</p><p>replicated to its relevant tier. For example, MongoDB allows you to determine how data</p><p>is replicated to a specific storage tier by assigning different tier tags to shards, allowing</p><p>its balancer to migrate data between tiers automatically. On top of that, Atlas Online</p><p>Archive also allows the database to offload historical datasets to cloud storage.</p><p>CPUs (Cores)</p><p>Next is the CPU. As of this writing, you are probably looking at modern servers running</p><p>some reasonably modern Intel, AMD, or ARM chips, which are commonly found across</p><p>most cloud providers and enterprise hardware vendors. Along with storage, CPUs are</p><p>another compute resource which—if not correctly sized—may introduce contention to</p><p>your workload and impact your latencies. Clusters handling hundreds of thousands up</p><p>to millions of operations per second tend to get very high CPU loads.</p><p>Chapter 7 InfrastruCture and deployment models</p><p>145</p><p>More cores will generally mean better performance. This is important for achieving</p><p>optimal performance from databases that are architected to benefit from multithreading,</p><p>and it’s absolutely essential for databases that are architected with a shard-per-core</p><p>architecture—running a separate shard on each core in each server. In this case, the</p><p>more cores the CPU has, the more shards—and the better data distribution—the</p><p>database will have.</p><p>A combination of vendor recommendations and benchmarking (see Chapter 9)</p><p>can help you determine how much throughput each multicore chip can support. A</p><p>general recommendation is to avoid running production systems close to the CPU limits</p><p>and find the sweet spot between supporting your expected performance and leaving</p><p>room for throughput growth. On top of that, when doing benchmarking, remember</p><p>to also factor in background database operations that might be detrimental to your</p><p>performance. For example, Cassandra and Cassandra-compatible databases often</p><p>need to run repair: a weekly process to ensure data consistency across the cluster. This</p><p>process requires a lot of coordination and communication across the entire cluster. If</p><p>your workload is not properly sized to accommodate background database operations</p><p>and other events (such as node failures), your latency may increase to a level that</p><p>surprises you.</p><p>When using virtual machines, containers, or the public cloud, remember that each</p><p>virtual CPU is mapped to a single logical core, or thread. In many cloud deployments,</p><p>nodes are provided on a vCPU basis. The vCPU is typically a single hyperthread from</p><p>a dual hyperthread x86 physical core for Intel/AMD variants, or a single core for</p><p>ARM chips.</p><p>No matter what your deployment of choice involves, avoid overcommitting CPU</p><p>resources if performance is a priority. Doing so will prevent other guests from stealing</p><p>CPU time5 from your database.</p><p>Memory (RAM)</p><p>If you’re working with an in-memory database, having enough memory to hold your</p><p>entire dataset is an absolute must. But every database uses in-memory caching to some</p><p>extent. For example, some databases require enough memory space for indexes to avoid</p><p>expensive round-trips to storage disks. Others leverage an internal data cache to allow</p><p>5 For more on CPU steal time, see “Detecting CPU Steal Time in Guest Virtual Machines” by Jamie</p><p>Fargen (https://opensource.com/article/20/1/cpu-steal-time).</p><p>Chapter 7 InfrastruCture and deployment models</p><p>146</p><p>for lower latencies when retrieving recently used data, Cassandra and Cassandra-like</p><p>databases implement memtables, and some databases allow you to control which tables</p><p>are served entirely from memory. The more memory the database has at its disposal,</p><p>the better you can take advantage of those mechanisms. After all, even the fastest NVMe</p><p>can’t come close to the speed of RAM access.</p><p>In general, there is no blanket recommendation for “how much memory is enough”</p><p>for a database. Different vendors have different requirements and different use cases also</p><p>require different memory sizes. However, latency-sensitive use cases typically require</p><p>high memory footprints in order to achieve high cache hit rates and serve low-latency</p><p>read requests efficiently.</p><p>For example, a use case with a higher payload size requires a larger memory</p><p>footprint than one with a smaller payload size. Another interesting aspect to consider is</p><p>how frequently the use case in question reads data that may be present in memory (hot</p><p>data) as opposed to data that was never read (cold data). As mentioned in Chapter 2, the</p><p>latter can easily undermine your latencies.</p><p>Without a sufficient disk-to-memory ratio, you will be hitting your storage far more</p><p>than you probably want if you intend to keep your latencies low. The ideal ratio varies</p><p>from database to database since every caching implementation is different, so be sure</p><p>to ask your vendor for their specific recommendations. For example, ScyllaDB currently</p><p>recommends that for every 1GB of memory allocated to a node, you can store up to</p><p>100GB of data (so if you have 32GB of memory, you can handle around 3TB). The higher</p><p>your memory-to-storage ratio gets, the less room you have for caching your total dataset.</p><p>Every database has some sort of hard physical limit. If you don’t have enough memory</p><p>and you have to run a workload on top of a very large dataset, it’s either going to be</p><p>rather slow or increase the risk of the database running out of memory.</p><p>Another ratio to keep in mind: memory per CPU core. At ScyllaDB, we recommend</p><p>at least 8GB of memory per CPU core for production purposes (because, given our</p><p>shared-nothing architecture, every shard works independently and has its own allocated</p><p>memory for caching). 8GB per vCPU is the same ratio used by most cloud providers for</p><p>NoSQL or Big Data-oriented instance types. Again, the recommended ratio will vary</p><p>across vendors, depending on the database’s specific internal cache implementation and</p><p>other implementation details. For example, in Cassandra and Cassandra-like databases,</p><p>part of the memory will be allocated for some of its SSTable-components in order to</p><p>speed up disk lookups when reading cold data. Aerospike will typically store all indexes</p><p>in RAM.And MongoDB, on average, requires 1GB of RAM per 100K assets.</p><p>Chapter 7 InfrastruCture and deployment models</p><p>147</p><p>Distributed databases are notoriously high memory consumers. Regardless of its</p><p>implementation, the database will always need to store some relevant parts of your</p><p>dataset in memory in order to avoid wasting time on disk I/O.Insufficient memory can</p><p>manifest itself as unpredictable, erratic database behavior—even crashes.</p><p>Network</p><p>Lastly, you have to ensure that network I/O does not become a bottleneck. Networking</p><p>is often an overlooked component. As with any distributed system, a database involves</p><p>a lot of traffic between all the cluster members to check for liveness, replicate state</p><p>and topology changes, and so on. As a result, network delays not only deteriorate your</p><p>application’s latency, but also prevent internode communication from functioning</p><p>effectively.</p><p>At ScyllaDB, we recommend a minimum network bandwidth of 10Gbps because</p><p>internal database operations such as streaming, repairs, and gossip can become very</p><p>network intensive. On top of that, you also need to factor in the actual throughput</p><p>required for the use case in question; the number of operations per second will certainly</p><p>be the highest bandwidth consumer for your deployment.</p><p>As with memory, the required network bandwidth will vary. Be sure to check your</p><p>vendor recommendations and consider the nature of your use case. A low throughput</p><p>workload will obviously consume less traffic than a higher throughput one.</p><p>Tip: Use CPU pinning to mitigate the</p><p>impact of hardware</p><p>interrupts. hardware interrupts, which typically stem from (but are not limited</p><p>to) high network Internet traffic, force the os kernel to stop everything and respond</p><p>to the hardware before returning to the job at hand. too many interrupts (e.g., a</p><p>high softirq percent) will degrade database performance, as your Cpus may stall</p><p>during processing for serving network traffic. one way to resolve this is to use</p><p>Cpu pinning. this tells the system that all network interrupts should be handled</p><p>by specific Cpus that are not being used by the database. With that setup, you can</p><p>blast the database with network traffic and be reasonably confident that you won’t</p><p>overwhelm it or stall the database processing during normal operations.</p><p>Chapter 7 InfrastruCture and deployment models</p><p>148</p><p>For cloud deployments, most IaaS vendors provide a modern network infrastructure</p><p>with ample bandwidth between your database servers and between the database</p><p>and the application clients. Be sure to check on your client’s network bandwidth</p><p>consumption if you suspect network problems. A common mistake we see in</p><p>deployments involves application clients deployed with suboptimal network capacity.</p><p>Also, be sure to place your application servers as close as possible to your database.</p><p>If you are deploying them in a single region, a shorter physical distance between the</p><p>servers will translate to better network performance (since it will require fewer network</p><p>hops for communication) and, as a result, lower latencies. If you need to go multi-region</p><p>and you require strong consistency or replication across these regions, then you need to</p><p>pay the latency penalty for traversing regions—plus, you also have to pay, quite literally,</p><p>with respect to cross-region networking transfer fees. For multi-region deployments with</p><p>cross-region replication, a slow network link may create replication delays that cause</p><p>the database to apply backpressure on your writes until it manages to replicate the data</p><p>piled up.</p><p>Considerations intheCloud</p><p>The “on-prem vs cloud” decision depends heavily on your organization’s security and</p><p>regulatory requirements as well as its business strategy—and is well beyond the scope</p><p>of this book. Instead of heading down that path, let’s focus on exploring performance</p><p>considerations that are unique to cloud deployments.</p><p>Most cloud providers offer a wide range of instance types that you may choose</p><p>to host your workload. In our experience, most of the mistakes and performance</p><p>bottlenecks seen on distributed databases within cloud deployments are due to an</p><p>incorrect instance or storage type selection during the initial cluster setup. A common</p><p>misunderstanding (and concern) that many people have is the fact that NVMe-based</p><p>storage may be more expensive than network-attached storage. The misconception likely</p><p>stems from the assumption that since NVMes are faster, they would incur elevated costs.</p><p>However it turns out to be quite the opposite: Since NVMe disks on cloud environments</p><p>are tied to the lifecycle of an instance, they end up being cheaper than network disks,</p><p>which require holding up your dataset for a prolonged period of time. We encourage</p><p>you to compare the costs of NVMe backed-up storage against network-attached disks on</p><p>your cloud vendor of choice.</p><p>Chapter 7 InfrastruCture and deployment models</p><p>149</p><p>Some cloud vendors have different instance types for different distributed database</p><p>workloads. For example, some workloads may benefit more from compute-heavy</p><p>instance types, with more compute power than storage capacity. Conversely, storage-</p><p>dense instance types typically feature a higher storage to memory ratio and are often</p><p>used by storage-heavy workloads.</p><p>To complicate things even more, some cloud providers may offer different CPU</p><p>generations for the same instance type. If one CPU generation is considerably slower</p><p>than other nodes, the wrong choice could introduce performance bottlenecks into your</p><p>cluster.</p><p>We have seen some (although rare) scenarios where a noisy neighbor dragged down</p><p>an entire node performance with no reasonable explanation. The lack of visibility and</p><p>control in cloud instances makes it harder to diagnose such situations. Often, you need</p><p>to reach out to your cloud vendor directly to resolve the situation.</p><p>As you start configuring your instance, remember that a cloud environment isn’t</p><p>created exclusively for databases. You have access to a wide range of options, but it can</p><p>be confusing to determine where to start and which options to use. In general, it’s best</p><p>to check with your database vendor on which instance types are recommended for</p><p>deployment. Even better, go beyond that and compare the results of their benchmarks</p><p>against those same instance types running your workload.</p><p>After you have decided on your instance types and deployment options, it’s time to</p><p>think about instance placement. Most clouds will charge you for both inter-region traffic</p><p>and inter-zone traffic, which may quite surprisingly increase the overall networking</p><p>costs. Some companies try to mitigate this cost by placing all instances under a single</p><p>availability zone (AZ), which also carries the risk of potentially having to face a cluster-</p><p>wide outage if/when that AZ goes down. Others opt to ignore the cost aspect and</p><p>deploy their replicas in different AZs to ensure data is properly replicated to an isolated</p><p>environment. Regardless of your instance’s placement of choice, note that some</p><p>database drivers allow clients in specific AZs to route queries only against database</p><p>replicas living in the same availability zone in order to reduce costs. Similarly, you will</p><p>also want to ensure that your application clients are located under the same zones as</p><p>your database to minimize your networking costs.</p><p>Chapter 7 InfrastruCture and deployment models</p><p>150</p><p>Fully Managed Database-as-a-Service</p><p>Does the database-as-a-service model help or hurt database performance? It really</p><p>depends on the following:</p><p>• How much attention your database requires to achieve and</p><p>consistently meet your performance expectations</p><p>• Your team’s experience working with the specific database</p><p>you’re using</p><p>• Your team’s time and desire to tinker with that database</p><p>• The level of expertise—especially with respect to performance—that</p><p>your DBaaS provider dedicates to your account</p><p>Managed DBaaS solutions can easily speed up your go-to-market and allow you to</p><p>focus on priorities beyond your database. Most database vendors now provide some</p><p>sort of managed solution. There are even independent companies in the business of</p><p>providing this kind of service for a variety of different distributed databases.</p><p>We have seen many examples where a managed solution helped users succeed, as</p><p>well as numerous complaints over the fact that some managed solutions were rather</p><p>limited. It is not our intention to recommend nor criticize any specific service provider in</p><p>question. Here is some vendor-agnostic advice on things to consider before selecting a</p><p>managed solution:</p><p>• Does the vendor satisfy your existing security requirements? Does it</p><p>provide enough evidence of security certifications issued by a known</p><p>security company?</p><p>• What are the options for observability and how do you export the</p><p>data in question to your monitoring platform of choice?</p><p>• What kind of flexibility do you have with your deployment? What are</p><p>the available tunable options and the support for those within your</p><p>managed solution?</p><p>• Does it allow you to peer traffic from your existing application</p><p>network(s) to your database in a private and secure way?</p><p>• What are the available support options and SLAs?</p><p>Chapter 7 InfrastruCture and deployment models</p><p>151</p><p>• Which deployment options are available, what’s the flexibility among</p><p>switching, and what’s the cost comparison if you were to deploy and</p><p>maintain it on your own?</p><p>• How easy is it for you to export</p><p>your data if you need to move your</p><p>deployment to a different vendor in the future?</p><p>• What, if any, migration options are available and what amount of</p><p>effort do they require?</p><p>These are just some of the many questions and concerns that we’ve frequently</p><p>heard teams asking (or wishing they asked before they got caught in an undesirable</p><p>option). Considering a third-party vendor to manage a relatively critical aspect of your</p><p>infrastructure is very often challenging. However, under the right circumstances and</p><p>vendor-user fit, it can be a great option for reducing your admin burden and optimizing</p><p>your performance.</p><p>Serverless Deployment Models</p><p>Serverless refers to database solutions that offer near-instant scaling up or scaling down</p><p>of database infrastructure—and charge you for the capacity and storage that you actually</p><p>consume.</p><p>A serverless model could theoretically yield a performance advantage. Before</p><p>serverless, many organizations faced a tradeoff:</p><p>• (Slightly or generously, depending on your risk tolerance)</p><p>overestimate the capacity they need to guarantee adequate</p><p>performance.</p><p>• Watch performance suffer if their overly-conservative capacity</p><p>estimates proved inadequate.</p><p>Serverless can help in a few different ways and situations.</p><p>First, with variable workloads. Since the database can rapidly scale up as your</p><p>workload increases, you can worry less about performance issues stemming from</p><p>inadequate capacity. If your traffic ebbs and flows across the day/week/month, you</p><p>can spend less during the light periods and dedicate those resources to supporting the</p><p>peak periods. And if your company suddenly experiences “catastrophic success,” you</p><p>don’t have to worry about the headaches associated with needing to suddenly scale</p><p>Chapter 7 InfrastruCture and deployment models</p><p>152</p><p>your infrastructure. If all goes well, the vendor will “automagically” ensure that you’re</p><p>covered, with acceptable performance. You won’t need to procure any additional</p><p>servers, or even contact your cloud provider.</p><p>Serverless is also a great option to consider if you’re working on a new project and</p><p>are not sure what capacity you need to meet performance expectations. It gives you the</p><p>freedom to start fast and scale (or shrink) depending on real-world usage. Database</p><p>sizing is one less thing to worry about. And you don’t need to predict the future.</p><p>Finally, serverless also makes it simpler to justify the spend internally. With this</p><p>model, you can assure your organization that you are never overprovisioned—at least</p><p>not for long. You’re paying for exactly the amount of performance that the database</p><p>vendor determines you need at all times.</p><p>However, a serverless deployment also carries the risk of cost overruns and the</p><p>uncertainty of unpredictable costs. For example, DynamoDB pricing may not be very</p><p>attractive for write-heavy workloads. Similarly, serverless database services may charge</p><p>an arm and a leg (or an eye and a knee) depending on the number of operations per</p><p>second you plan to sustain over an extended period of time. In some cases, it could</p><p>become a double-edged sword from a cost perspective if your goal is to sustain a high-</p><p>throughput performant system at large scale.</p><p>Another aspect to consider when thinking about a serverless solution is whether</p><p>the solution in question is compatible with your existing infrastructure components.</p><p>For example, you’ll want to explore what amount of effort is required to connect your</p><p>message queueing or analytics tool with that specific serverless solution.</p><p>Remember that the overall concept behind serverless is to abstract away the</p><p>underlying infrastructure in such a way that not all database-configurable options are</p><p>available to you. As a result, troubleshooting potential performance problems is often</p><p>more challenging since you might need to rely on your vendor’s input and guidance to</p><p>understand which actions to take. Being serverless also means that you lack visibility</p><p>into whether the infrastructure you consume is shared with other tenants. Many</p><p>distributed database vendors may also offer you different pricing tiers for shared and</p><p>dedicated environments.</p><p>Containerization andKubernetes</p><p>Containers and Kubernetes are now ubiquitous, even for stateful systems like databases.</p><p>Should you use them? Probably—unless you have a good reason not to.</p><p>Chapter 7 InfrastruCture and deployment models</p><p>153</p><p>But be aware that there is a performance penalty for the operational convenience</p><p>of using containers. This is to be expected because of the extra layer of abstraction (the</p><p>container itself), relaxation of resource isolation, and increased context switches. The</p><p>good news is that it can certainly be overcome. In our testing using ScyllaDB, we found</p><p>it is possible to take what was originally a 69 percent reduction in peak throughput down</p><p>to a 3 percent performance penalty.6</p><p>Here’s the TL;DR on that specific experiment:</p><p>Containerizing applications is not free. In particular, processes</p><p>comprising the containers have to be run in Linux cgroups and</p><p>the container receives a virtualized view of the network. Still,</p><p>the biggest cost of running a close-to-hardware, thread-per-core</p><p>application like ScyllaDB inside a Docker container comes from</p><p>the opportunity cost of having to disable most of the performance</p><p>optimizations that the database employs in VM and bare-metal</p><p>environments to enable it to run in potentially shared and</p><p>overcommitted platforms.</p><p>The best results with Docker are obtained when resources</p><p>are statically partitioned and we can bring back bare-metal</p><p>optimizations like CPU pinning and interrupt isolation. There is</p><p>only a 10 percent performance penalty in this case as compared</p><p>to the underlying platform—a penalty that is mostly attributed to</p><p>the network virtualization. Docker allows users to expose the host</p><p>network directly for specialized deployments. In cases in which this</p><p>is possible, we saw that the performance difference compared to the</p><p>underlying platform falls down to 3 percent.</p><p>Of course, the potential penalty and strategies for mitigating will vary from database</p><p>to database. But the key takeaway is that there is likely a significant performance</p><p>penalty—so be sure to hunt it down and research how to mitigate it. Some common</p><p>mitigation strategies include:</p><p>6 See “The Cost of Containerization for Your ScyllaDB” on the ScyllaDB blog (https://www.</p><p>scylladb.com/2018/08/09/cost-containerization-scylla/).</p><p>Chapter 7 InfrastruCture and deployment models</p><p>154</p><p>• Ensure that your containers have direct access to the database’s</p><p>underlying storage.</p><p>• Expose the host OS network to the container in order to avoid the</p><p>performance penalty due to its network virtualization layer.</p><p>• Allocate enough resources to the container in question, and ensure</p><p>these are not overcommitted with other containers or processes</p><p>running within the underlying host OS.</p><p>Kubernetes adds yet another virtualization layer—and thus opens the door to yet</p><p>another layer of performance issues, as well as different strategies for mitigating them.</p><p>First off, if you have the choice of multiple options for deploying and managing database</p><p>clusters on Kubernetes, test them out with an eye on performance. Once you settle</p><p>on the best fit for your needs, dive into the configuration options that could impact</p><p>performance. Here are some performance tips that cross databases:</p><p>• Consider dedicating specific and independent Kubernetes nodes for</p><p>your database workloads and use affinities in order to configure their</p><p>placement.</p><p>• Enable hostNetworking and be sure to set up the required kernel</p><p>parameters as recommended by your vendor (for example, fs.</p><p>aio-max-nr for increasing the number of events available for</p><p>asynchronous I/O processing in the Linux kernel).</p><p>• Ensure that your database pods have a Guaranteed QoS class7 to</p><p>avoid other pods from potentially hurting your main</p><p>workload.</p><p>• Be sure to use an operator8 in order to orchestrate and control the</p><p>lifecycle of your existing Kubernetes database cluster. For example,</p><p>ScyllaDB has its ScyllaDB Operator project.</p><p>7 For more detail, see “Create a Pod that Gets Assigned a QoS Class of Guaranteed” in the</p><p>Kubernetes docs (https://kubernetes.io/docs/tasks/configure-pod-container/</p><p>quality-service-pod/#create-a-pod-that-gets-assigned-a-qos-class-of-guaranteed).</p><p>8 For more detail, see “Operator Pattern” in the Kubernetes docs https://kubernetes.io/docs/</p><p>concepts/extend-kubernetes/operator/.</p><p>Chapter 7 InfrastruCture and deployment models</p><p>155</p><p>Summary</p><p>This chapter kicked off the final part of this book, focused on sharing recommendations</p><p>for getting better performance out of your database deployment. It looked at</p><p>infrastructure and deployment model considerations that are important to understand</p><p>whether you’re managing your own deployment or opting for a database-as-a-service</p><p>(maybe serverless) deployment model. The next chapter looks at performance</p><p>considerations relevant to the topology itself: replication, geographic distribution,</p><p>scaling up and/or out, and intermediaries like external caches, load balancers, and</p><p>abstraction layers.</p><p>Open Access This chapter is licensed under the terms of the Creative</p><p>Commons Attribution 4.0 International License (http://creativecommons.</p><p>org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and</p><p>reproduction in any medium or format, as long as you give appropriate credit to the</p><p>original author(s) and the source, provide a link to the Creative Commons license and</p><p>indicate if changes were made.</p><p>The images or other third party material in this chapter are included in the chapter's</p><p>Creative Commons license, unless indicated otherwise in a credit line to the material. If</p><p>material is not included in the chapter's Creative Commons license and your intended</p><p>use is not permitted by statutory regulation or exceeds the permitted use, you will need</p><p>to obtain permission directly from the copyright holder.</p><p>Chapter 7 InfrastruCture and deployment models</p><p>157</p><p>CHAPTER 8</p><p>Topology Considerations</p><p>As mentioned in Chapter 5, database servers are often combined into intricate</p><p>topologies where certain nodes are grouped in a single geographical location; others</p><p>are used only as a fast cache layer, and yet others store seldom-accessed cold data in a</p><p>cheap place, for emergency purposes only. That chapter covered how drivers work to</p><p>understand and interact with that topology to exchange information more efficiently.</p><p>This chapter focuses on the topology in and of itself. How is data replicated across</p><p>geographies and datacenters? What are the risks and alternatives to taking the common</p><p>NoSQL practice of scaling out to the extreme? And what about intermediaries to your</p><p>database servers—for example, external caches, load balancers, and abstraction layers?</p><p>Performance implications of all this and more are all covered here.1</p><p>Replication Strategy</p><p>First, let’s look at replication, which is how your data will be spread to other replicas</p><p>across your cluster.</p><p>Note If you want a quick introduction to the concept of replication, see</p><p>Appendix A.</p><p>Having more replicas will slow your writes (since every write must be duplicated</p><p>to replicas), but it could accelerate your reads (since more replicas will be available for</p><p>serving the same dataset). It will also allow you to maintain operations and avoid data</p><p>1 This chapter draws from material originally published on the ScyllaDB blog (www.scylladb.</p><p>com/blog/), ScyllaDB Documentation (https://docs.scylladb.com/stable/), the ScyllaDB</p><p>whitepaper “Why Scaling Up Beats Scaling Out for NoSQL” (https://lp.scylladb.com/</p><p>whitepaper-scaling-up-vs-scaling-out-offer.html), and an article that ScyllaDB co-founder</p><p>and CEO Dor Laor wrote for The New Stack. It is used here with permission of ScyllaDB.</p><p>© Felipe Cardeneti Mendes, Piotr Sarna, Pavel Emelyanov, Cynthia Dunlop 2023</p><p>F. C. Mendes et al., Database Performance at Scale, https://doi.org/10.1007/978-1-4842-9711-7_8</p><p>158</p><p>loss in the event of node failures. Additionally, replicating data to get closer to your</p><p>application and closer to your users will reduce latency, especially if your application has</p><p>a highly geographically-distributed user base.</p><p>A replication factor (RF) of 1 means there is only one copy of a row in a cluster, and</p><p>there is no way to recover the data if the node is compromised or goes down (other than</p><p>restoring from a backup). An RF of 2 means that there are two copies of a row in a cluster.</p><p>An RF of at least three is used in most systems. This allows you to write and read with</p><p>strong consistency, as a quorum of replicas will be achieved, even if one node is down.</p><p>Many databases also let you fine-tune replication settings at the regional level. For</p><p>example, you could have three replicas in a heavily used region, but only two in a less</p><p>popular region.</p><p>Note that replicating data across multiple regions (as Bigtable recommends as a</p><p>safeguard against both availability zone failure and regional failure) can be expensive.</p><p>Before you set this up, understand the cost of replicating data between regions.</p><p>If you’re working with DynamoDB, you create tables (not clusters), and AWS</p><p>manages the replication for you as soon as you set a table to be Global. One notable</p><p>drawback of DynamoDB global tables is that transactions are not supported across</p><p>regions, which may be a limiting factor for some use cases.</p><p>Rack Configuration</p><p>If all your nodes are in the same datacenter, how do you configure their placement? The</p><p>rule of thumb here is to have as many racks as you have replicas. For example, if you</p><p>have a replication factor of three, run it in three racks. That way, even if an entire rack</p><p>goes down, you can still continue to satisfy read and write requests to a majority of your</p><p>replicas. Performance might degrade a bit since you have lost roughly 33 percent of your</p><p>infrastructure (considering a total zone/rack outage), but overall you’ll still be up and</p><p>running. Conversely, if you have three replicas distributed across two racks, then losing</p><p>a rack may potentially affect two out of the three natural endpoints for part of your data.</p><p>That’s a showstopper if your use case requires strongly consistent reads/writes.</p><p>Multi-Region or Global Replication</p><p>By placing your database servers close to your users, you lower the network latency. You</p><p>can also improve availability and insulate your business from regional outages.</p><p>ChApter 8 topology ConsIderAtIons</p><p>159</p><p>If you do have multiple datacenters, ensure that—unless otherwise required by the</p><p>business—reads and writes use a consistency level that is confined to replicas within</p><p>a specific datacenter. This approach avoids introducing a latency hit by instructing the</p><p>database to only select local replicas (under the same region) for achieving your required</p><p>consistency level. Also, ensure that each application client knows what datacenter is</p><p>considered its local one; it should prioritize that local one for connections and requests,</p><p>although it may also have a fallback strategy just in case that datacenter goes down.</p><p>Note that application clients may or may not be aware of the multi-datacenter</p><p>deployment, and it is up to the application developer to decide on the awareness to</p><p>fallback across regions. Although different settings and load balancing profiles exist</p><p>through a variety of database drivers, the general concept for an application to failover</p><p>to a different region in the event of a local failure may often break application semantics.</p><p>As a result, its reaction upon a failure must be handled directly by the application</p><p>developer.</p><p>Multi-Availability Zones vs. Multi-Region</p><p>To mitigate a possible server or rack failure, cloud vendors offer (and recommend) a</p><p>multi-zone deployment. Think about it as if you have a datacenter</p><p>at your fingertips</p><p>where you can deploy each server instance in its own rack, using its own power, top-of-</p><p>rack switch, and cooling system. Such a deployment will be bulletproof for any single</p><p>system or zonal failure, since each rack is self-contained. The availability zones are still</p><p>located in the same region. However, a specific zone failure won’t affect another zone’s</p><p>deployed instances.</p><p>For example, on Google Compute Engine, the us-east1-b, us-east1-c, and us-</p><p>east1- d availability zones are located in the us-east1 region (Moncks Corner, South</p><p>Carolina, USA). But each availability zone is self-contained. Network latency between</p><p>AZs in the same region is negligible for the purpose of this discussion.</p><p>In short, both multi-zone and multi-region deployments help with business</p><p>continuity and disaster recovery respectively, but multi-region has the additional benefit</p><p>of minimizing local application latencies in those local regions. It might come at a cost</p><p>though: cross-region data replication costs need to be considered for multi-regional</p><p>topologies.</p><p>Note that multi-zonal deployments will similarly charge you for inter-zone</p><p>replication. Although it is perfectly possible to have a single zone deployment for your</p><p>ChApter 8 topology ConsIderAtIons</p><p>160</p><p>database, it is often not a recommended approach because it will effectively be exposed</p><p>as a single point of failure toward your infrastructure. The choice here is quite simple:</p><p>Do you want to reduce costs as much as possible and risk potential unavailability, or</p><p>do you want to guarantee high availability in a single region at the expense of network</p><p>replication costs?</p><p>Scaling Upvs Scaling Out</p><p>Is it better to have a larger number of smaller (read, “less powerful”) nodes or a smaller</p><p>number of larger nodes? We recommend aiming for the most powerful nodes and</p><p>smallest clusters that meet your high availability and resiliency goals—but only if your</p><p>database can truly take advantage of the power added by the larger nodes.</p><p>Let’s unravel that a bit. For over a decade, NoSQL’s promise has been enabling</p><p>massive horizontal scalability with relatively inexpensive commodity hardware. This</p><p>has allowed organizations to deploy architectures that would have been prohibitively</p><p>expensive and impossible to scale using traditional relational database systems.</p><p>Over that same decade, “commodity hardware” has also undergone a</p><p>transformation. But not all databases take advantage of modern computing resources.</p><p>Many aren’t architected to take advantage of the resources offered by large nodes, such</p><p>as the added CPU, memory, and solid-state drives (SSDs), nor can they store large</p><p>amounts of data on disk efficiently. Managed runtimes, like Java, are further constrained</p><p>by heap size. Multi-threaded code, with its locking and context-switches overhead and</p><p>lack of attention for Non-Uniform Memory Architecture (NUMA), imposes a significant</p><p>performance penalty against modern hardware architectures.</p><p>If your database is in this group, you might find that scaling up quickly brings you to</p><p>a point of diminishing returns. But even then, it’s best to max out your vertical scaling</p><p>potential before you shift to horizontal scaling.</p><p>A focus on horizontal scaling results in system sprawl, which equates to operational</p><p>overhead, with a far larger footprint to keep managed and secure. Server sprawl</p><p>also introduces more network overhead to distributed systems due to the constant</p><p>replication and health checks done by every single node in your cluster. Although most</p><p>vendors claim that scaling out will bring you linear performance, some others are more</p><p>conservative and state that it will bring you “near to linear performance.” For example,</p><p>ChApter 8 topology ConsIderAtIons</p><p>161</p><p>Cassandra Production Guidelines2 do not recommend clusters larger than 50 nodes</p><p>using the default number of 16 vNodes per instance because it may result in decreased</p><p>availability.</p><p>Moreover, there are quite a few advantages to using large, powerful nodes.</p><p>• Less noisy neighbors: On cloud platforms, multi-tenancy is</p><p>the norm. A cloud platform is, by definition, based on shared</p><p>network bandwidth, I/O, memory, storage, and so on. As a result,</p><p>a deployment of many small nodes is susceptible to the “noisy</p><p>neighbor” effect. This effect is experienced when one application</p><p>or virtual machine consumes more than its fair share of available</p><p>resources. As nodes increase in size, fewer and fewer resources</p><p>are shared among tenants. In fact, beyond a certain size, your</p><p>applications are likely to be the only tenant on the physical machines</p><p>on which your system is deployed.</p><p>• Fewer failures: Since large and small nodes fail at roughly the</p><p>same rate, large nodes deliver a higher mean time between failures</p><p>(MTBF) than small nodes. Failures in the data layer require operator</p><p>intervention, and restoring a large node requires the same amount of</p><p>human effort as a small one. In a cluster of a thousand nodes, you’ll</p><p>likely see failures every day—and this magnifies administrative costs.</p><p>• Datacenter density: Many organizations with on-premises</p><p>datacenters are seeking to increase density by consolidating</p><p>servers into fewer, larger boxes with more computing resources per</p><p>server. Small clusters of large nodes help this process by efficiently</p><p>consuming denser resources, in turn decreasing energy and</p><p>operating costs.</p><p>• Operational simplicity: Big clusters of small instances demand</p><p>more attention, and generate more alerts, than small clusters of large</p><p>instances. All of those small nodes multiply the effort of real-time</p><p>monitoring and periodic maintenance, such as rolling upgrades.</p><p>2 See https://cassandra.apache.org/doc/latest/cassandra/getting_started/</p><p>production.html.</p><p>ChApter 8 topology ConsIderAtIons</p><p>162</p><p>Some architects are concerned that putting more data on fewer nodes increases</p><p>the risks associated with outages and data loss. You can think of this as the “big basket”</p><p>problem. It may seem intuitive that storing all of your data on a few large nodes makes</p><p>them more vulnerable to outages, like putting all of your eggs in one basket. But this</p><p>doesn’t necessarily hold true. Modern databases use a number of techniques to ensure</p><p>availability while also accelerating recovery from failures, making big nodes both safer</p><p>and more economical. For example, consider capabilities that reduce the time required</p><p>to add and replace nodes and internal load balancing mechanisms to minimize the</p><p>throughput or latency impact across database restarts.3</p><p>Workload Isolation</p><p>Many teams find themselves in a position where they need to run multiple different</p><p>workloads against the database. It is often compelling to aggregate different workloads</p><p>under a single cluster, especially when they need to work on the exact same dataset.</p><p>Keeping several workloads together under a single cluster can also reduce costs. But, it’s</p><p>essential to avoid resource contention when implementing latency-critical workloads.</p><p>Failure to do so may introduce hard-to-diagnose performance situations, where one</p><p>misbehaving workload ends up dragging down the entire cluster’s performance.</p><p>There are many ways to accomplish workload isolation to minimize the resource</p><p>contention that could occur when running multiple workloads on a single cluster. Here</p><p>are a few that work well. Keep in mind that the best approach depends on your existing</p><p>database’s available options, as well as your use case’s requirements:</p><p>• Physical isolation: This setup is often used to entirely isolate one</p><p>workload from another. It involves essentially extending your</p><p>deployment to an additional region (which may be physically the</p><p>same as your existing one, but logically different on the database</p><p>side). As a result, the workloads are split to replicate data to another</p><p>3 ScyllaDB Heat Weighted Load Balancing provides a smarter request redistribution</p><p>algorithm</p><p>based on the cache hit ratio of nodes in the cluster. Learn more at www.scylladb.</p><p>com/2017/09/21/scylla-heat-weighted-load-balancing/.</p><p>ChApter 8 topology ConsIderAtIons</p><p>163</p><p>location, but queries are executed only within a particular location—</p><p>in such a way that a performance bottleneck in one workload won’t</p><p>degrade or bottleneck the other. Note that a downside of this solution</p><p>is that your infrastructure costs double.</p><p>• Logical isolation: Some databases or deployment options allow</p><p>you to logically isolate workloads without needing to increase your</p><p>infrastructure resources. For example, ScyllaDB has a workload</p><p>prioritization feature where you can assign different weights for</p><p>specific workloads to help the database understand which workload</p><p>you want it to prioritize in the event of system contention. If your</p><p>database does not offer such a feature, you may still be able to</p><p>run two or more workloads in parallel, but watch out for potential</p><p>contentions in your database.</p><p>• Scheduled isolation: Many times, you might need to simply run</p><p>batched scheduled jobs at specified intervals in order to support</p><p>other business-related activities, such as extracting analytics</p><p>reports. In those cases, consider running the workload in question</p><p>at low-peak periods (if any exist), and experiment with different</p><p>concurrency settings in order to avoid impairing the latency of the</p><p>primary workload that’s running alongside it.</p><p>More onWorkload Prioritization forLogical Isolation</p><p>ScyllaDB users sometimes use workload prioritization to balance OLAP and OLTP</p><p>workloads. The goal is to ensure that each defined task has a fair share of system</p><p>resources so that no single job monopolizes system resources, starving other jobs of their</p><p>needed minimums to continue operations.</p><p>In Figure8-1, note that the latency for both workloads nearly converges. OLTP</p><p>processing began at or below 2ms P99 latency up until the OLAP job began at 12:15.</p><p>When the OLAP workload kicked in, OLTP P99 latencies shot up to 8ms, then further</p><p>degraded, plateauing around 11–12ms until the OLAP job terminated after 12:26.</p><p>ChApter 8 topology ConsIderAtIons</p><p>164</p><p>Figure 8-1. Latency between OLTP and OLAP workloads on the same cluster</p><p>before enabling workload prioritization</p><p>These latencies are approximately six times greater than when OLTP ran by itself.</p><p>(OLAP latencies hover between 12–14ms, but, again, OLAP is not latency-sensitive).</p><p>Figure 8-2 shows that the throughput on OLTP sinks from around 60,000 OPS to</p><p>half that—30,000 OPS.You can see the reason why. OLAP, being throughput hungry, is</p><p>maintaining roughly 260,000 OPS.</p><p>ChApter 8 topology ConsIderAtIons</p><p>165</p><p>Figure 8-2. Comparative throughput results for OLTP and OLAP on the same</p><p>cluster without workload prioritization enabled</p><p>Ultimately, OLTP suffers with respect to both latency and throughput, and users</p><p>experience slower response times. In many real-world conditions, such OLTP responses</p><p>would violate a customer’s SLA.</p><p>Figure 8-3 shows the latencies after workload prioritization is enabled. You can see</p><p>that the OLTP workload similarly starts out at sub-millisecond to 2ms P99 latencies.</p><p>Once an OLAP workload is added, OLTP processing performance degrades, but with</p><p>P99 latencies hovering between 4–7ms (about half of the 11–12ms P99 latencies when</p><p>workload prioritization was not enabled).</p><p>ChApter 8 topology ConsIderAtIons</p><p>166</p><p>Figure 8-3. OLTP and OLAP latencies with workload prioritization enabled</p><p>It is important to note that once system contention kicks in, the OLTP latencies</p><p>are still somewhat impacted—just not to the same extent they were prior to workload</p><p>prioritization. If your real-time workload requires ultra-constant single-digit millisecond</p><p>or lower P99 latencies, then we strongly recommend that you avoid introducing any form</p><p>of contention.</p><p>The OLAP workload, not being as latency-sensitive, has P99 latencies that hover</p><p>between 25–65ms. These are much higher latencies than before—the tradeoff for</p><p>keeping the OLTP latencies lower.</p><p>Throughput wise, Figure8-4 shows that the OLTP traffic is a smooth 60,000 OPS until</p><p>the OLAP load is also enabled.</p><p>ChApter 8 topology ConsIderAtIons</p><p>167</p><p>Figure 8-4. OLTP and OLAP load throughput with workload</p><p>prioritization enabled</p><p>It does dip in performance at that point, but only slightly, hovering between 54,000 to</p><p>58,000 OPS. That is only a 3–10 percent drop in throughput. The OLAP workload, for its</p><p>part, hovers between 215,000–250,000 OPS.That is a drop of 4–18 percent, which means</p><p>an OLAP workload would take longer to complete. Both workloads suffer degradation, as</p><p>would be expected for an overloaded cluster, but neither to a crippling degree.</p><p>Abstraction Layers</p><p>It’s becoming fairly common for teams to write an abstraction layer on top of their</p><p>databases. Instead of calling the database’s APIs directly, the applications connect to this</p><p>database-agnostic abstraction layer, which then manages the logistics of connecting to</p><p>the database.</p><p>There are usually a few main motives behind this move:</p><p>• Portability: If the team wants to move to another database, they</p><p>won’t need to modify their applications and queries. However, the</p><p>team responsible for the abstraction layer will need to modify that</p><p>code, which could turn out to be more complicated.</p><p>ChApter 8 topology ConsIderAtIons</p><p>168</p><p>• Developer simplicity: Developers don’t need to worry about the</p><p>inner details of working with any particular database. This can make</p><p>it easier for people to move around from team to team.</p><p>• Scalability: An abstraction layer can be easier to containerize. If the</p><p>API gets overloaded, it’s usually easier to scale out more containers in</p><p>Kubernetes than to spin off more containers of the database itself.</p><p>• Customer-facing APIs: Exposing the database directly to end-users</p><p>is typically not a good idea. As a result, many companies expose</p><p>common endpoints, which are eventually translated into actual</p><p>database queries. As a result, the abstraction layer can shed requests,</p><p>limit concurrency across tenants, and perform auditability and</p><p>accountability over its calls.</p><p>But, there’s definitely a potential for a performance penalty that is highly dependent</p><p>on how efficiently the layer was implemented. An abstraction layer that was fastidiously</p><p>implemented by a team of masterful Rust engineers is likely to have a much more</p><p>negligible impact than a Java or Python one cobbled together as a quick side project. If</p><p>you decide to take this route, be sure that the layer is developed with performance in</p><p>mind, and that you carefully measure its impact via both benchmarking and ongoing</p><p>monitoring. Remember that every application database communication is going to</p><p>use this layer, so a small inefficiency can quickly snowball into a significant performance</p><p>problem.</p><p>For example, we once saw a customer report an elevated latency situation coming</p><p>from their Golang abstraction layer. Once we realized that the latency on the database</p><p>side was within bounds for its use case, the investigation shifted from the database over</p><p>to the network and client side. Long story short, the application latency spikes were</p><p>identified as being heavily affected during the garbage collection process, dragging down</p><p>the client-side performance significantly. The problem was resolved after scaling out the</p><p>number of clients and by ensuring that they had enough compute resources to properly</p><p>function.</p><p>And another example: When working with a customer through a PostgreSQL to</p><p>NoSQL migration, we realized that their clients were fairly often opening too many</p><p>concurrent connections against the database. Although having a high number of sockets</p><p>opened is typically a good idea for distributed systems, an extremely high number of</p><p>them can easily overwhelm the client side (which needs to keep track of all open sockets)</p><p>ChApter 8 topology ConsIderAtIons</p><p>169</p><p>as well as the database. After we reported our findings to the customer, they discovered</p><p>that they were opening a new database session for every request they submitted against</p><p>the cluster. After correcting the malfunctioning code, the overall application throughput</p><p>was significantly increased because the abstraction layer was then using active sockets</p><p>opened when it routed requests.4</p><p>Load Balancing</p><p>Should you put a dedicated load balancer in front of your database? In most cases, no.</p><p>Databases typically have their own way to balance traffic across the cluster, so layering</p><p>a load balancer on top of that won’t help—and it could actually hurt. Consider 1) how</p><p>many requests the load balancer can serve without becoming a bottleneck and 2) its</p><p>balancing policy. Also, recognize that it introduces a single point of failure that reduces</p><p>your database resilience. As a result, you overcomplicate your overall infrastructure</p><p>topology because you now need to worry about load balancing high availability.</p><p>Of course, there are always exceptions. For example, say you were previously using</p><p>a database API that’s unaware of the layout of the cluster and its individual nodes</p><p>(e.g., DynamoDB, where a client is configured with a single “endpoint address” and</p><p>all requests are sent to it). Now you’re shifting to a distributed leaderless database like</p><p>ScyllaDB, where clients are node aware and even token aware (aware of which token</p><p>ranges are natural endpoints for every node in your topology). If you simply configure</p><p>an application with the IP address of a single ScyllaDB node as its single DynamoDB</p><p>API endpoint address, the application will work correctly. After all, any node can answer</p><p>any request by forwarding it to other nodes as necessary. However, this single node will</p><p>be more loaded than the other nodes because it will be the only node actively serving</p><p>requests. This node will also become a single point of failure from your application’s</p><p>perspective.</p><p>In this special edge case, load balancing is critical—but load balancers are</p><p>not. Server-side load balancing is fairly complex from an admin perspective. More</p><p>importantly with respect to performance, server-side solutions add latency. Solutions</p><p>that involve a TCP or HTTP load balancer require another hop for each</p><p>4 Learn about abstraction layer usage at Discord in “How Discord Migrated Trillions of Messages</p><p>from Cassandra to ScyllaDB “(www.youtube.com/watch?v=S2xmFOAUhsk) and ShareChat</p><p>in “ShareChat’s Path to High-Performance NoSQL with ScyllaDB” (www.youtube.com/</p><p>watch?v=Y2yHv8iqigA).</p><p>ChApter 8 topology ConsIderAtIons</p><p>170</p><p>request—increasing not just the cost of each request, but also its latency. We recommend</p><p>client- side load balancing: Modifying the application to send requests to the available</p><p>nodes versus a single one.</p><p>The key takeaway here is that load balancing generally isn’t needed—and when it</p><p>is, server-side load balancers yield a pretty severe performance penalty. If it’s absolutely</p><p>necessary, client-side load balancing is likely a better option.5</p><p>External Caches</p><p>Teams often consider external caches when the existing database cluster cannot meet</p><p>the required SLA.This is a clear performance-oriented decision. Putting an external</p><p>cache in front of the database is commonly used to compensate for subpar latency</p><p>stemming from the various factors discussed throughout this book: inefficient database</p><p>internals, driver usage, infrastructure choices, traffic spikes, and so on.</p><p>Caching may seem like a fast and easy solution because the deployment can be</p><p>implemented without tremendous hassle and without incurring the significant cost of</p><p>database scaling, database schema redesign, or even a deeper technology transformation.</p><p>However, external caches are not as simple as they are often made out to be. In fact, they</p><p>can be one of the more problematic components of a distributed application architecture.</p><p>In some cases, it’s a necessary evil—particularly if you have an ultra-latency-sensitive</p><p>use case such as real-time ad bidding or streaming media, and you’ve tried all the other</p><p>means of reducing latency. But in many cases, the performance boost isn’t worth it. The</p><p>following sections outline some key risks and you can decide what makes sense for your</p><p>use case and SLAs.</p><p>An External Cache Adds Latency</p><p>A separate cache means another hop on the way. When a cache surrounds the database,</p><p>the first access occurs at the cache layer. If the data isn’t in the cache, then the request is</p><p>sent to the database. The result is additional latency to an already slow path of uncached</p><p>data. One may claim that when the entire dataset fits the cache, the additional latency</p><p>doesn’t come into play. However, there is usually more than a single workload/pattern</p><p>that hits the database and some of it will carry the extra hop cost.</p><p>5 For an example of how to implement client-side load balancing, see www.scylladb.</p><p>com/2021/04/13/load-balancing-in-scylla-alternator/.</p><p>ChApter 8 topology ConsIderAtIons</p><p>171</p><p>An External Cache Is anAdditional Cost</p><p>Caching means expensive DRAM, which translates to a higher cost per gigabyte than</p><p>SSDs. Even when RAM can store frequently accessed objects, it is best to use the existing</p><p>database RAM, and even increase it for internal caching rather than provision entirely</p><p>separate infrastructure on RAM-oriented instances. Provisioning a cache to be the</p><p>same size as the entire persistent dataset may be prohibitively expensive. In other cases,</p><p>the working set size can be too big, often reaching petabytes, making an SSD-friendly</p><p>implementation the preferred, and cheaper, option.</p><p>External Caching Decreases Availability</p><p>No cache’s high availability solution can match that of the database itself. Modern</p><p>distributed databases have multiple replicas; they also are topology-aware and speed-</p><p>aware and can sustain multiple failures without data loss.</p><p>For example, a common replication pattern is three local replicas, which generally</p><p>allows for reads to be balanced across such replicas in order to efficiently use your</p><p>database’s internal caching mechanism. Consider a nine-node cluster with a replication</p><p>factor of three: Essentially every node will hold roughly 33 percent of your total dataset</p><p>size. As requests are balanced among different replicas, this grants you more room</p><p>for caching your data, which could (potentially) completely eliminate the need for an</p><p>external cache. Conversely, if an external cache happens to invalidate entries right before</p><p>a surge of cold requests, availability could be impeded for a while since the database</p><p>won’t have that data in its internal cache (more on this in the section entitled “External</p><p>Caching Ruins the Database Caching” later in this chapter).</p><p>Caches often lack high-availability properties and can easily fail or invalidate records</p><p>depending on their heuristics. Partial failures, which are more common, are even worse</p><p>in terms of consistency. When the cache inevitably fails, the database will get hit by the</p><p>unmitigated firehose of queries and likely wreck your SLAs. In addition, even if a cache</p><p>itself has some high availability features, it can’t coordinate handling such failure with</p><p>the persistent database it is in front of. The bottom line: Rely on the database, rather than</p><p>making your latency SLAs dependent on a cache.</p><p>ChApter 8 topology ConsIderAtIons</p><p>172</p><p>Application Complexity: Your Application Needs toHandle</p><p>More Cases</p><p>Application and operational complexity are problems for external caches. Once you</p><p>have an external cache, you need to keep the cache up-to-date with the client and the</p><p>database. For instance, if your database runs repairs, the cache needs to be synced or</p><p>invalidated. However, invalidating the cache may introduce a long period of time when</p><p>you need to wait for it to eventually get warm. Your client retry and timeout policies need</p><p>to match the properties</p><p>of the cache but also need to function when the cache is done.</p><p>Usually, such scenarios are hard to test and implement.</p><p>External Caching Ruins theDatabase Caching</p><p>Modern databases have embedded caches and complex policies to manage them. When</p><p>you place a cache in front of the database, most read requests will reach only the external</p><p>cache and the database won’t keep these objects in its memory. As a result, the database</p><p>cache is rendered ineffective. When requests eventually reach the database, its cache</p><p>will be cold and the responses will come primarily from the disk. As a result, the round-</p><p>trip from the cache to the database and then back to the application is likely to incur</p><p>additional latency.</p><p>External Caching Might Increase Security Risks</p><p>An external cache adds a whole new attack surface to your infrastructure. Encryption,</p><p>isolation, and access control on data placed in the cache are likely to be different from</p><p>the ones at the database layer itself.</p><p>External Caching Ignores theDatabase Knowledge</p><p>andDatabase Resources</p><p>Databases are quite complex and built for specialized I/O workloads on the system.</p><p>Many of the queries access the same data, and some amount of the working set size</p><p>can be cached in memory in order to save disk accesses. A good database should have</p><p>sophisticated logic to decide which objects, indexes, and accesses it should cache.</p><p>ChApter 8 topology ConsIderAtIons</p><p>173</p><p>The database also should have various eviction policies (such as the least recently</p><p>used [LRU] policy as a straightforward example) that determine when new data should</p><p>replace existing (older) cached objects.</p><p>Another example is scan-resistant caching. When scanning a large dataset, say a</p><p>large range or a full-table scan, a lot of objects are read from the disk. The database can</p><p>realize this is a scan (not a regular query) and choose to leave these objects outside its</p><p>internal cache. However, an external cache would treat the result set just like any other</p><p>and attempt to cache the results. The database automatically synchronizes the content</p><p>of the cache with the disk according to the incoming request rate, and thus the user and</p><p>the developer do not need to do anything to make sure that lookups to recently written</p><p>data are performant. Therefore, if, for some reason, your database doesn’t respond fast</p><p>enough, it means that:</p><p>• The cache is misconfigured</p><p>• It doesn’t have enough RAM for caching</p><p>• The working set size and request pattern don’t fit the cache</p><p>• The database cache implementation is poor</p><p>Summary</p><p>This chapter shared strong opinions on how to navigate topology decisions. For example,</p><p>we recommended:</p><p>• Using an RF of at least 3 (with geographical fine-tuning if available)</p><p>• Having as many racks as replicas</p><p>• Isolating reads and writes within a specific datacenter</p><p>• Ensuring each client knows and prioritizes the local datacenter</p><p>• Considering the (cross-region replication) costs of multi-region</p><p>deployments as well as their benefits</p><p>• Scaling up as much as possible before scaling out</p><p>ChApter 8 topology ConsIderAtIons</p><p>174</p><p>• Considering a few different options to minimize the resource</p><p>contention that could occur when running multiple workloads on a</p><p>single cluster</p><p>• Carefully considering the caveats associated with external caches,</p><p>load balancers, and abstraction layers</p><p>The next chapter looks at best practices for testing your topology: Benchmarking it to</p><p>see what it’s capable of and how it compares to alternative configurations and solutions.</p><p>Open Access This chapter is licensed under the terms of the Creative</p><p>Commons Attribution 4.0 International License (http://creativecommons.</p><p>org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and</p><p>reproduction in any medium or format, as long as you give appropriate credit to the</p><p>original author(s) and the source, provide a link to the Creative Commons license and</p><p>indicate if changes were made.</p><p>The images or other third party material in this chapter are included in the chapter's</p><p>Creative Commons license, unless indicated otherwise in a credit line to the material. If</p><p>material is not included in the chapter's Creative Commons license and your intended</p><p>use is not permitted by statutory regulation or exceeds the permitted use, you will need</p><p>to obtain permission directly from the copyright holder.</p><p>ChApter 8 topology ConsIderAtIons</p><p>175</p><p>CHAPTER 9</p><p>Benchmarking</p><p>We won’t sugarcoat it: database benchmarking is hard. There are many moving parts</p><p>and nuances to consider and manage—and a bit of homework is required to really see</p><p>what a database is capable of and measure it properly. It’s not easy to properly generate</p><p>system load to reflect your real-life scenarios.1 It’s often not obvious how to correctly</p><p>measure and analyze the end results. And after extracting benchmarking results, you</p><p>need to be able to read them, understand potential performance bottlenecks, analyze</p><p>potential performance improvements, and possibly dive into other issues. You need to</p><p>make your benchmarking results meaningful, ensure they are easily reproducible, and</p><p>also be able to clearly explain these results to your team and other interested parties in a</p><p>way that reflects your business needs. There’s also hard mathematics involved: statistics</p><p>and queueing theory to help with black boxes and measurements, not to mention</p><p>domain-specific knowledge of the system internals of the servers, platforms, operating</p><p>systems, and the software running on it.</p><p>But when performance is a top priority, careful—and sometimes frequent—</p><p>benchmarking is essential. And in the long run, it will pay off. An effective benchmark</p><p>can save you from even worse pains, like the high-pressure database migration project</p><p>that ensues after you realize—too late—that your existing solution can’t support the</p><p>latest phase of company growth with acceptable latencies and/or throughput.</p><p>The goal of this chapter is to share strategies that ease the pain slightly and, more</p><p>importantly, increase the chances that the pain pays off by helping you select options</p><p>that meet your performance needs. The chapter begins by looking at the two key types of</p><p>benchmarks and highlighting critical considerations for each objective. Then, it presents</p><p>a phased approach that should help you expose problems faster and with lower costs.</p><p>Next, it dives into the do’s and don’ts of benchmark planning, execution, and reporting,</p><p>1 For an example of realistic benchmarking executed with impressive mastery, see Brian Taylor’s</p><p>talk, “How Optimizely (Safely) Maximizes Database Concurrency,” at www.youtube.com/</p><p>watch?v=cSiVoX_nq1s.</p><p>© Felipe Cardeneti Mendes, Piotr Sarna, Pavel Emelyanov, Cynthia Dunlop 2023</p><p>F. C. Mendes et al., Database Performance at Scale, https://doi.org/10.1007/978-1-4842-9711-7_9</p><p>176</p><p>with a focus on lessons learned from the best and worst benchmarks we’ve witnessed</p><p>over the past several years. Finally, the chapter closes with a look at some less common</p><p>benchmarking approaches you might want to consider for specialized needs.</p><p>Latency or Throughput: Choose Your Focus</p><p>When benchmarking, you need to decide upfront whether you want to focus on</p><p>throughput or latency. Latency is measured in both cases. But here’s the difference:</p><p>• Throughput focus: You measure the maximum throughput by</p><p>sending a new request as soon as the previous request completes.</p><p>This helps you understand the highest number of IOPS that the</p><p>database can sustain. Throughput-focused benchmarks are often the</p><p>focus for analytics use cases (fraud detection, cybersecurity, etc.)</p><p>• Latency focus: You assess how many IOPS the database can handle</p><p>without compromising latency. This is usually the focus for most</p><p>user-facing and real-time applications.</p><p>Throughput tests are quite common, but latency tests are a better choice if you</p><p>already know the desired throughput (e.g., 1M OPS). This is especially true</p><p>if your</p><p>production system must meet a specific latency goal (for example, the 99.99 percentile</p><p>should have a read latency of less than 10ms).</p><p>If you’re focused solely on latency, you need to measure and compare latency at</p><p>the same throughput rates. If you know only that database A can handle 30K OPS with</p><p>a certain P99 latency and database B can handle 50K OPS with a slightly higher P99</p><p>latency, you can’t really say which one is “more efficient.” For a fair comparison, you</p><p>would need to measure each database’s latencies at either 30K OPS or 50K OPS—or both.</p><p>Even better, you would track latency across a broader span of intervals (e.g., measuring</p><p>at 10K OPS increments up until when neither database could achieve the required P99</p><p>latency, as demonstrated in Figure9-1.)</p><p>Chapter 9 BenChmarking</p><p>177</p><p>Figure 9-1. A latency-oriented benchmark</p><p>Not all latency benchmarks need to take that form, however. Consider the example</p><p>of an AdTech company with a real-time bidding use case. For them, a request that takes</p><p>longer than 31ms is absolutely useless because it will fall outside of the bidding window.</p><p>It’s considered a timeout. And any request that is 30ms or less is fine; a 2ms response</p><p>is not any more valuable to them than a 20ms response. They care only about which</p><p>requests time out and which don’t.</p><p>Their benchmarking needs are best served by a latency benchmark measuring how</p><p>many OPS were generating timeouts over time. For example, Figure9-2 shows that the</p><p>first database in their benchmark (the top line) resulted in over 100K timeouts a second</p><p>around 11:30; the other (the horizontal line near the bottom) experienced only around</p><p>200 timeouts at that same point in time, and throughout the duration of that test.</p><p>Chapter 9 BenChmarking</p><p>178</p><p>Figure 9-2. A latency-oriented benchmark measuring how many OPS were</p><p>generating timeouts over time</p><p>Chapter 9 BenChmarking</p><p>179</p><p>For contrast, Figure9-3 shows an example of a throughput benchmark.</p><p>Figure 9-3. A throughput-oriented benchmark</p><p>With a throughput benchmark, you want to see one of the resources (e.g., the</p><p>CPU or disk) maxing out in order to understand how much the database can deliver</p><p>under extreme load conditions. If you don’t reach this level, it’s a sign that you’re not</p><p>really effectively benchmarking the database’s throughput. For example, Figure9-4</p><p>demonstrates the load of two clusters during a benchmark run. Note how one cluster is</p><p>fully utilized whereas the other is very close to reaching its limits.</p><p>Chapter 9 BenChmarking</p><p>180</p><p>Figure 9-4. Two clusters’ load comparison: one fully maxed out and another very</p><p>close to reaching its limit</p><p>Less Is More (at First): Taking aPhased Approach</p><p>With either focus, the number one rule of benchmarking is to start simple. Always keep</p><p>a laser focus on the specific questions you want the benchmark to answer (more on that</p><p>shortly). But, realize that it could take a number of phases—each with a fair amount of</p><p>trial and error—to get meaningful results.</p><p>What could go wrong? A lot. For example:</p><p>• Your client might be a bottleneck</p><p>• Your database sizing might need adjustment</p><p>• Your tests might need tuning</p><p>• A sandbox environment could have very different resources than a</p><p>production one</p><p>• Your testing methodology might be too artificial to predict reality</p><p>If you start off with too much complexity, it will be a nightmare to discover what’s</p><p>going wrong and pinpoint the source of the problem. For example, assume you want</p><p>to test if a database can handle 1M OPS of traffic from your client with a P99 latency of</p><p>1ms or less. However, you notice the latencies are exceeding the expected threshold.</p><p>You might spend days adjusting database configurations to no avail, then eventually</p><p>figure out that the problem stemmed from a bug in client-side concurrency. This would</p><p>Chapter 9 BenChmarking</p><p>181</p><p>have been much more readily apparent if you started out with just a fraction of that</p><p>throughput. In addition to avoiding frustration and lost time, you would have saved your</p><p>team a lot of unnecessary infrastructure costs.</p><p>As a general rule of thumb, consider at least two phases of benchmarking: one with a</p><p>specialized stress tool and one with your real workload (or at least a sampling of it—e.g.,</p><p>sending 30 percent of your queries to a cluster for benchmarking). For each phase,</p><p>start super small (at around 10 percent of the throughput you ultimately want to test),</p><p>troubleshoot as needed, then gradually increase the scope until you reach your target</p><p>loads. Keep optimization in mind throughout. Do you need to add more servers or more</p><p>clients to achieve a certain throughput? Or are you limited (by budget or infrastructure)</p><p>to a fixed hardware configuration? Can you achieve your performance goals with less?</p><p>The key is to move incrementally. Of course, the exact approach will vary from</p><p>situation to situation. Consider a leading travel company’s approach. Having recently</p><p>moved from PostgreSQL to Cassandra, they were quite experienced benchmarkers when</p><p>they decided to evaluate Cassandra alternatives. The goal was to test the new database</p><p>candidate’s raw speed and performance, along with its support for their specific</p><p>workloads.</p><p>First, they stood up a five-node cluster and ran database comparisons with synthetic</p><p>traffic from cassandra-stress. This gave them confidence that the new database could</p><p>meet their performance needs with some workloads. However, their real workloads</p><p>are nothing like even customized cassandra-stress workloads. They experience highly</p><p>variable and unpredictable traffic (for example, massive surges and disruptions</p><p>stemming from a volcanic eruption). For a more realistic assessment, they started</p><p>shadowing production traffic. This second phase of benchmarking provided the added</p><p>confidence they needed to move forward with the migration.</p><p>Finally, they used the same shadowed traffic to determine the best deployment</p><p>option. Moving to a larger 21-node cluster, they tested across cloud provider A and cloud</p><p>provider B on bare metal. They also experimented with many different options on cloud</p><p>provider B: various storage options, CPUs, and so on.</p><p>The bottom line here: Start simple, confirm, then scale incrementally. It’s safer and</p><p>ultimately faster. Plus, you’ll save on costs. As you move through the process, check if you</p><p>need to tweak your setup during your testing. Once you are eventually satisfied with the</p><p>results, scale your infrastructure accordingly to meet your defined criteria.</p><p>Chapter 9 BenChmarking</p><p>182</p><p>Benchmarking Do’s andDon’ts</p><p>The specific step-by-step instructions for how to configure and run a benchmark vary</p><p>across databases and benchmarking tools, so we’re not going to get into that. Instead,</p><p>let’s look at some of the more universal “do’s and don’ts” based on what we’ve seen in</p><p>the field.</p><p>Tip if you haven’t done so yet, be sure to review the chapters on drivers,</p><p>infrastructure, and topology considerations before you begin benchmarking.</p><p>Know What’s Under theHood ofYour Database (Or Find</p><p>Someone Who Knows)</p><p>Understand and anticipate what parts of the system your chosen workload will affect</p><p>and how. How will it stress your CPUs? Your memory? Your disks? Your network? Do you</p><p>know if the database automatically analyzes the system it’s running on and prioritizes</p><p>application requests as opposed to internal tasks? What’s going on as far as background</p><p>operations and how these may skew your results? And why does all this matter if you’re</p><p>just trying to run a benchmark?</p><p>Let’s take the example of compaction with LSM-tree based databases. As we’ll</p><p>cover in Chapter 11, compactions do have a significant impact on performance. But</p><p>compactions are unlikely to kick in if you run a benchmark for just a few minutes.</p><p>Given that compactions have dramatically different performance impacts on different</p><p>databases, it’s essential to know that they will occur</p><p>and ensure that tests last long</p><p>enough to measure their impact.</p><p>The important thing here is to try to understand the system that you’re</p><p>benchmarking. The better you understand it, the better you can plan tests and interpret</p><p>the results. If there are vendors and/or user groups behind the database you’re</p><p>benchmarking, try to probe them for a quick overview of how the database works and</p><p>what you should watch out for. Otherwise, you might overlook something that comes</p><p>back to haunt you, such as finding out that your projected scale was too optimistic. Or,</p><p>you might freak out over some KPI that’s really a non-issue.</p><p>Chapter 9 BenChmarking</p><p>183</p><p>Choose anEnvironment That Takes Advantage</p><p>oftheDatabase’s Potential</p><p>This is really a corollary to the previous tip. With a firm understanding of your database’s</p><p>superpowers, you can design benchmark scenarios that fully reveal its potential. For</p><p>example, if you want to compare two databases designed for commodity hardware, don’t</p><p>worry about benchmarking them on a broad array of powerful servers. But if you’re</p><p>comparing a database that’s architected to take advantage of powerful servers, you’d be</p><p>remiss to benchmark it only on commodity hardware (or even worse, using a Docker</p><p>image on a laptop). That would be akin to test driving a race car on the crowded streets</p><p>of NewYork City rather than your local equivalent of the Autobahn highway.</p><p>Likewise, if you think some aspect of the database or your data modeling will be</p><p>problematic for your use case, now’s the time to push it to the limits and assess its true</p><p>impact. For example, if you think a subset of your data might have imbalanced access</p><p>patterns due to user trends, use the benchmark phase to reproduce that and assess the</p><p>impacts.</p><p>Use anEnvironment That Represents Production</p><p>Benchmarking in the wrong environment can easily lead to an order-of-magnitude</p><p>performance difference. For example, a laptop might achieve 20K OPS where a dedicated</p><p>server could easily achieve 200K OPS.Unless you intend to have your production system</p><p>running on a laptop, do not benchmark (or run comparisons) on a laptop.</p><p>If you are using shared hardware in a containerized/virtualized environment, be</p><p>aware that one guest can increase latency in other guests. As a result, you’ll typically</p><p>want to ensure that hardware resources are dedicated to your database and that you</p><p>avoid resource overcommitment by any means possible.</p><p>Also, don’t overlook the environment for your load generators. If you underprovision</p><p>load generators, the load generators themselves will be the bottleneck. Another</p><p>consideration: Ensure that the database and the data loader are not running under the</p><p>same nodes. Pushing and pulling data is resource intensive, so the loader will definitely</p><p>steal resources from the database. This will impact your results with any database.</p><p>Chapter 9 BenChmarking</p><p>184</p><p>Don’t Overlook Observability</p><p>Having observability into KPIs beyond throughput and latency is critical for identifying</p><p>and troubleshooting issues. For instance, you might not be hitting the cache as much</p><p>as intended. Or a network interface might be overwhelmed with data to the point that it</p><p>interferes with latency. Observability is also your primary tool for validating that you’re not</p><p>being overly optimistic—or pessimistic—when reviewing results. You may discover that even</p><p>read requests served from disk, with a cold cache, are within your latency requirements.</p><p>Note For extensive discussion on this topic, see Chapter 10.</p><p>Use Standardized Benchmarking Tools Whenever Feasible</p><p>Don’t waste resources building—and debugging and maintaining—your own version of</p><p>a benchmarking tool that has already been solved for. The community has developed an</p><p>impressive set of tools that can cover a wide range of needs. For example:</p><p>• YCSB2</p><p>• TPC-C3</p><p>• NdBench4</p><p>• Nosqlbench5</p><p>• pgbench6</p><p>• TLP-stress7</p><p>• Cassandra-stress8</p><p>• and more…</p><p>2 https://github.com/brianfrankcooper/YCSB</p><p>3 http://tpc.org/tpcc/default5.asp</p><p>4 https://github.com/Netflix/ndbench</p><p>5 https://github.com/nosqlbench/nosqlbench</p><p>6 www.postgresql.org/docs/current/pgbench.html</p><p>7 https://github.com/thelastpickle/tlp-stress</p><p>8 https://github.com/scylladb/scylla-tools-java/tree/master/tools/stress</p><p>Chapter 9 BenChmarking</p><p>185</p><p>They are all relatively the same and provide similar configuration parameters. Your</p><p>task is to understand which one better reflects the workload you are interested in and</p><p>how to run it properly. When in doubt, consult with your vendor for specific tooling</p><p>compatible with your database of choice.</p><p>Of course, these options won’t cover everything. It makes sense to develop your own</p><p>tools if:</p><p>• Your workloads look nothing like the ones offered by standard tools</p><p>(for example, you rely on multiple operations that are not natively</p><p>supported by the tools)</p><p>• It helps you test against real (or more realistic) workloads in the later</p><p>phases of your benchmarking strategy</p><p>Ideally, the final stages of your benchmarking would involve connecting your</p><p>application to the database and seeing how it responds to your real workload. But what</p><p>if, for example, you are comparing two databases that require you to implement the</p><p>application logic in two totally different ways? In this case, the different application logic</p><p>implementations could influence your results as much as the difference in databases.</p><p>Again, we recommend starting small: Testing just the basic functionality of the</p><p>application against both targets (following each one’s best practices) and seeing what the</p><p>initial results look like.</p><p>Use Representative Data Models, Datasets,</p><p>andWorkloads</p><p>As you progress past the initial “does this even work” phase of your benchmarking,</p><p>it soon becomes critical to gravitate to representative data models, datasets, and</p><p>workloads. The closer you approximate your production environment, the better you can</p><p>trust that your results accurately represent what you will experience in production.</p><p>Data Models</p><p>Tools such as cassandra-stress use a default data model that does not completely</p><p>reflect what most teams use in production. For example, the cassandra-stress default</p><p>data model has a replication factor set to 1 and uses LOCAL_ONE as a consistency</p><p>level. Although cassandra-stress is a convenient way to get some initial performance</p><p>impressions, it is critical to benchmark the same/similar data model that you will</p><p>Chapter 9 BenChmarking</p><p>186</p><p>use in production. That’s why we recommend using a custom data model and tuning</p><p>your consistency level and queries. cassandra-stress and other benchmarking tools</p><p>commonly provide ways to specify a user profile, where you can specify your own</p><p>schema, queries, replication factor, request distribution and sizes, throughput rates,</p><p>number of clients, and other aspects.</p><p>Dataset Size</p><p>If you run the benchmark with a dataset that’s smaller than your production dataset, you</p><p>may have misleading or incorrect results due to the reduced number of I/O operations.</p><p>Eventually, you should configure a test that realistically reflects a fraction of your</p><p>production dataset size corresponding to your current scale.</p><p>Workloads</p><p>Run the benchmark using a load that represents, as closely as possible, your anticipated</p><p>production workload. This includes the queries submitted by the load generator. When</p><p>you use the right type of queries, they are distributed over the cluster and the ratio</p><p>between reads and writes remains relatively constant.</p><p>The read/write ratio is important. Different combinations will impact your disk</p><p>in different ways. If you want results representative of production, use a realistic</p><p>workload mix.</p><p>Eventually, you will max out your storage I/O throughput and starve your disk, which</p><p>causes requests to start queuing on the database. If you continue pushing past that point,</p><p>latency will increase. When you hit that point of increased latency with unsatisfactory</p><p>in practice.</p><p>abouT The TeChniCal reviewers</p><p>xvii</p><p>Acknowledgments</p><p>The process of creating this book has been a wild ride across many countries, cultures,</p><p>and time zones, as well as around many obstacles. There are many people to thank for</p><p>their assistance, inspiration, and support along this journey.</p><p>To begin, ScyllaDB co-founders Dor Laor and Avi Kivity—for starting the company</p><p>that brought us all together, for pushing the boundaries of database performance at scale</p><p>in ways that inspired this book, and for trusting us to share the collective sea monster</p><p>wisdom in this format. Thank you for this amazing opportunity.</p><p>We thank our respective teams, and especially our managers, for supporting this side</p><p>project. We hope we kept the core workload disruption to a minimum and did not inflict</p><p>any “stop the world” project pauses.</p><p>Our technical reviewers—Botond Dénes, Ľuboš Koščo, and Raphael S.Carvalho—</p><p>painstakingly reviewed the first draft of every page in this book and offered insightful</p><p>suggestions throughout. Thank you for your thoughtful comments and for being so</p><p>generous with your time.</p><p>Additionally, our unofficial technical reviewer and toughest critic, Kostja Osipov,</p><p>provided early and (brutally) honest feedback that led us to substantially alter the book’s</p><p>focus for the better.</p><p>The Brazilian Ninja team (Guilherme Nogueira, Lucas Martins Guimarães, and</p><p>Noelly Medina) rescued us in our darkest hour, allowing us to scale out and get the first</p><p>draft across the finish line. Muito Obrigado!</p><p>Ben Gaisne is the graphic design mastermind behind the images in this book. Merci</p><p>for transforming our scribbles into beautiful diagrams and putting up with about ten</p><p>rounds of “just one more round of book images.”</p><p>We are also indebted to many for their unintentional contributions on the content</p><p>front. Glauber Costa left us with a treasure trove of materials we consulted when</p><p>composing chapters, especially Chapter 9 on benchmarking. He also inspired the addition</p><p>of Chapter 6 on getting data closer. Additionally, we also looked back to ScyllaDB blogs as</p><p>we were writing—specifically, blogs by Avi Kivity (for Chapter 3), Eyal Gutkind (for Chapter</p><p>7), Vlad Zolotarov and Moreno Garcia (also for Chapter 7), Dor Laor (for Chapter 8), Eliran</p><p>Sinvani (also for Chapter 8), and Ivan Prisyazhynyy (for Chapter 9).</p><p>xviii</p><p>Last, but certainly not least, we thank Jonathan Gennick for bringing us to Apress. We</p><p>thank Shaul Elson and Susan McDermott for guiding us through the publishing process.</p><p>It has been a pleasure working with you. And we thank everyone involved in editing and</p><p>production; having previously tried this on our own, we know it’s an excruciating task</p><p>and we are truly grateful to you for relieving us of this burden!</p><p>aCknowledgmenTs</p><p>xix</p><p>Introduction</p><p>Sisyphean challenge. Gordian knot. Rabbit hole. Many metaphors have been used to</p><p>describe the daunting challenge of achieving database performance at scale. That isn’t</p><p>surprising. Consider just a handful of the many factors that contribute to satisfying</p><p>database latency and throughput expectations for a single application:</p><p>• How well you know your workload access patterns and whether they</p><p>are a good fit for your current or target database.</p><p>• How your database interacts with its underlying hardware, and</p><p>whether your infrastructure is correctly sized for the present as well</p><p>as the future.</p><p>• How well your database driver understands your database—and how</p><p>well you understand the internal workings of both.</p><p>It’s complex. And that’s just the tip of the iceberg.</p><p>Then, once you feel like you’re finally in a good spot, something changes. Your</p><p>business experiences “catastrophic success,” exposing the limitations of your initial</p><p>approach right when you’re entering the spotlight. Maybe market shifts mean that your</p><p>team is suddenly expected to reduce latency—and reduce costs at the same time, too.</p><p>Or perhaps you venture on to tackle a new application and find that the lessons learned</p><p>from the original project don’t translate to the new one.</p><p>Why Read/Write aBook onDatabase Performance?</p><p>The most common approaches to optimizing database performance are conducting</p><p>performance tuning and scaling out. They are important—but in many cases, they aren’t</p><p>enough to satisfy strict latency expectations at medium to high throughput. To break past</p><p>that plateau, other factors need to be addressed.</p><p>xx</p><p>As with any engineering challenge, there’s no one-size-fits-all solution. But there are</p><p>a lot of commonly overlooked considerations and opportunities with the potential to</p><p>help teams meet their database performance objectives faster, and with fewer headaches.</p><p>As a group of people with experience across a variety of performance-oriented</p><p>database projects, we (the authors) have a unique perspective into what works well for</p><p>different performance-sensitive use cases—from low-level engineering optimizations,</p><p>to infrastructure components, to topology considerations and the KPIs to focus on for</p><p>monitoring. Frequently, we engage with teams when they’re facing a performance</p><p>challenge so excruciating that they’re considering changing their production database</p><p>(which can seem like the application development equivalent of open heart surgery).</p><p>And in many cases, we develop a long-term relationship with a team, watching their</p><p>projects and objectives evolve over time and helping them maintain or improve</p><p>performance across the shifting sands.</p><p>Based on our experience with performance-focused database engineering as well as</p><p>performance-focused database users, this book represents what we think teams striving</p><p>for extreme database performance—low latency, high throughput, or both—should be</p><p>thinking about. We have experience working with multi-petabyte distributed systems</p><p>requiring millions of interactions per second. We’ve engineered systems supporting</p><p>business critical real-time applications with sustained latencies below one millisecond.</p><p>Finally, we’re well aware of commonly-experienced “gotchas” that no one has dared to</p><p>tell you about, until now.</p><p>What WeMean by Database Performance atScale</p><p>Database performance at scale means different things to different teams. For some, it</p><p>might mean achieving extremely low read latencies; for others, it might mean ingesting</p><p>very large datasets as quickly as possible. For example:</p><p>• Messaging: Keeping latency consistently low for thousands to</p><p>millions of operations per second, because users expect to interact in</p><p>real-time on popular social media platforms, especially when there’s</p><p>a big event or major news.</p><p>• Fraud detection: Analyzing a massive dataset as rapidly as possible</p><p>(millions of operations per second), because faster processing helps</p><p>stop fraud in its tracks.</p><p>inTroduCTion</p><p>xxi</p><p>• AdTech: Providing lightning fast (sub-millisecond P9999 latency)</p><p>responses with zero tolerance for latency spikes, because an ad bid</p><p>that’s sent even a millisecond past the cutoff is worthless to the ad</p><p>company and the clients who rely on it.</p><p>We specifically tagged on the “at scale” modifier to emphasize that we’re catering to</p><p>teams who are outside of the honeymoon zone, where everything is just blissfully fast</p><p>no matter what you do with respect to setup, usage, and management. Different teams</p><p>will reach that inflection point for different reasons, and at different thresholds. But one</p><p>thing is always the same: It’s better to anticipate and prepare than to wait and scramble</p><p>to react.</p><p>Who This Book Is For</p><p>This book was written for individuals and teams looking to optimize distributed</p><p>database performance for an existing project or to begin a new performance-sensitive</p><p>project with a solid and scalable foundation. You are most likely:</p><p>• Experiencing or anticipating some pain related to database latency</p><p>and/or throughput</p><p>• Working primarily on a use case with terabytes to petabytes of raw</p><p>(unreplicated)</p><p>results, stop, reflect on what happened, analyze how you can improve, and iterate</p><p>through the test again. Rinse and repeat as needed.</p><p>Here are some tips on creating realistic workloads for common use cases:</p><p>• Ingestion: Ingest data as fast as possible for at least a few hours, and</p><p>do it in a way that doesn’t produce timeouts or errors. The goal here</p><p>is to ensure that you’ve got a stable system, capable of keeping up</p><p>with your expected traffic rate for long periods.</p><p>• Real-time bidding: Use bulk writes coming in after hours or</p><p>constantly low background loads; the core of the workload is a lot of</p><p>reads with extremely strict latency requirements (perhaps below a</p><p>specific threshold).</p><p>Chapter 9 BenChmarking</p><p>187</p><p>• Time series: Use heavy and constant writes to ever-growing</p><p>partitions split and bucketed by time windows; reads tend to focus on</p><p>the latest rows and/or a specific range of time.</p><p>• Metadata store: Use writes occasionally, but focus on random</p><p>reads representing users accessing your site. There’s usually good</p><p>cacheability here.</p><p>• Analytics: Periodically write a lot of information and perform a</p><p>lot of full table scans (perhaps in parallel with some of the other</p><p>workloads).</p><p>The bottom line is to try to emulate what your workloads look like and run</p><p>something that’s meaningful to you.</p><p>Exercise Your Cache Realistically</p><p>Unless you can absolutely guarantee that your workload has a high cache hit rate</p><p>frequency, be pessimistic and exercise it well.</p><p>You might be running workloads, getting great results, and seeing cache hits all</p><p>the way up to 90 percent. That’s great. But is this the way you’re going to be running</p><p>in practice all the time? Do you have periods throughout the day when your cache is</p><p>not going to be that warm, maybe because there’s something else running? In real-life</p><p>situations, you will likely have times when the cache is colder or even super cold (e.g.,</p><p>after an upgrade or after a hardware failure). Consider testing those scenarios in the</p><p>benchmark as well.</p><p>If you want to make sure that all requests are coming from the disk, you can disable</p><p>the cache altogether. However, be aware that this is typically an extreme situation, as</p><p>most workloads (one way or another) exercise some caching. Sometimes you can create</p><p>a cold cache situation by just restarting the nodes or restarting the processes.</p><p>Look at Steady State</p><p>Most databases behave differently in real life than they do in short transient test</p><p>situations. They usually run for days or years—so when you test a database for two</p><p>minutes, you’re probably not getting a deep understanding of how it behaves, unless</p><p>you are working in memory only. Also, when you’re working with a database that is built</p><p>Chapter 9 BenChmarking</p><p>188</p><p>to serve tens or hundreds of terabytes—maybe even petabytes—know that it’s going</p><p>to behave rather differently at various data levels. Requests become more expensive,</p><p>especially read requests. If you’re testing something that only serves a gigabyte, it really</p><p>isn’t the same as testing something that’s serving a terabyte.</p><p>Figure 9-5 exemplifies the importance of looking at steady state. Can you tell what</p><p>throughput is being sustained by the database in question?</p><p>Figure 9-5. A throughput graph that is not focused on steady state</p><p>Well, if you look just at the first minute, it seems that it’s serving 40K OPS.But if you</p><p>wait for a few minutes, the throughput decreases.</p><p>Whenever you want to make a statement about the maximum throughput that your</p><p>database can handle, do that from a steady state. Make sure that you’re inserting an</p><p>amount of data that is meaningful, not just a couple of gigabytes, and make sure that it</p><p>runs for enough time so it’s a realistic scenario. After you are satisfied with how many</p><p>requests can be sustained over a prolonged period of time, consider adding noise, such</p><p>as scaling clients, and introducing failure situations.</p><p>Watch Out forClient-Side Bottlenecks</p><p>One of the most common mistakes with benchmarks is overlooking the fact that the</p><p>bottleneck could be coming from the application side. You might have to tune your</p><p>application clients to allow for a higher concurrency. You may also be running many</p><p>application pods on the same tenant—with all instances contending for the same</p><p>hardware resources. Make sure your application is running in a proper environment, as</p><p>is your database.</p><p>Chapter 9 BenChmarking</p><p>189</p><p>Also Watch Out forNetworking Issues</p><p>Networking issues could also muddle the results of your benchmarking. If the database</p><p>is consuming too much softirq from processing, this will degrade your performance. You</p><p>can detect this by analyzing CPU interrupt shares, for example. And you can typically</p><p>resolve it by using CPU pinning, which tells the system that all network interrupts should</p><p>be handled by specific CPUs that are not being used by the database.</p><p>Similarly, running your application through a slow link, such as routing traffic via the</p><p>Internet rather than via a private link, can easily introduce a networking bottleneck.</p><p>Document Meticulously toEnsure Repeatability</p><p>It’s difficult to anticipate when or why you might want to repeat a benchmark. Maybe</p><p>you want to assess the impact of optimizations you made after getting some great tips at</p><p>the vendor’s user conference. Maybe you just learned that your company was acquired</p><p>and you should prepare to support ten times your current throughput—or much stricter</p><p>latency SLAs. Perhaps you learned about a cool new database that’s API-compatible with</p><p>your current one, and you’re curious how the performance stacks up. Or maybe you have</p><p>a new boss with a strong preference for another database and you suddenly need to re-</p><p>justify your decision with a head-to-head comparison.</p><p>Whatever the reason you’re repeating a benchmark scenario, one thing is certain:</p><p>You will be immensely appreciative of the time that you previously spent documenting</p><p>exactly what you did and why.</p><p>Reporting Do’s andDon’ts</p><p>So you’ve completed your benchmark and you’ve gathered all sorts of data—what’s the</p><p>best way to report it? Don’t skimp on this final, yet critical step. Clear and compelling</p><p>reporting is critical for convincing others to support your recommended course of</p><p>action—be it embarking on a database migration, changing your configuration or data</p><p>modeling, or simply sticking with what’s working well for you.</p><p>Here are some reporting-focused do’s and don’ts.</p><p>Chapter 9 BenChmarking</p><p>190</p><p>Be Careful withAggregations</p><p>When it comes to aggregations, proceed with extreme caution. You could report the</p><p>result of a benchmark by saying something like “I ran this benchmark for three days, and</p><p>this is my throughput.” However, this overlooks a lot of critical information. For example,</p><p>consider the two graphs presented in Figures9-6 and 9-7.</p><p>Figure 9-6. Lower baseline throughput that’s almost constant and predictable</p><p>throughout a ten-minute period</p><p>Figure 9-7. A bumpier path to a similar throughput at the end</p><p>Both of these loads have roughly the same throughput at the end. Figure9-6 shows</p><p>lower baseline throughput—but it’s constant and very predictable throughout the</p><p>period. The OPS in Figure9-7 dip much lower than the first baseline, but it also spikes to</p><p>a much higher value. The behavior shown in Figure9-6 is obviously more desirable. But</p><p>if you aggregate your results, it would be really hard to notice a difference.</p><p>Chapter 9 BenChmarking</p><p>191</p><p>Another aggregation mistake is aggregating tail latencies: taking the average of P99</p><p>latencies from multiple load generators. The correct way to determine the percentiles</p><p>over multiple load generators is to merge the latency distribution of each load generator</p><p>and then determine the percentiles. If that isn’t an option, then the next best alternative</p><p>is to take the maximum (the P99, for example) of each of the load generators. The actual</p><p>P99 will be equal to</p><p>data, over 10K operations per second, and with P99</p><p>latencies measured in milliseconds</p><p>• At least somewhat familiar with scalable distributed databases such</p><p>as Apache Cassandra, ScyllaDB, Amazon DynamoDB, Google Cloud</p><p>Bigtable, CockroachDB, and so on</p><p>• A software architect, database architect, software engineer, VP of</p><p>engineering, or technical CTO/founder working with a data-intensive</p><p>application</p><p>You might also be looking to reduce costs without compromising performance, but</p><p>unsure of all the considerations involved in doing so.</p><p>We assume that you want to get your database performance challenges resolved,</p><p>fast. That’s why we focus on providing very direct and opinionated recommendations</p><p>based on what we have seen work (and fail) in real-world situations. There are, of</p><p>course, exceptions to every rule and ways to debate the finer points of almost any tip</p><p>inTroduCTion</p><p>xxii</p><p>in excruciating detail. We’ll focus on presenting the battle-tested “best practices” and</p><p>anti-patterns here, and encourage additional discussion in whatever public or private</p><p>channels you prefer.</p><p>What This Book Is NOT</p><p>A few things that this book is not attempting to be:</p><p>• A reference for infrastructure engineers building databases. We focus</p><p>on people working with a database.</p><p>• A “definitive guide” to distributed databases, NoSQL, or data-</p><p>intensive applications. We focus on the top database considerations</p><p>most critical to performance.</p><p>• A guide on how to configure, work with, optimize, or tune any</p><p>specific database. We focus on broader strategies you can “port”</p><p>across databases.</p><p>There are already many outstanding references that cover the topics we’re</p><p>deliberately not addressing, so we’re not going to attempt to re-create or replace them.</p><p>See Appendix A for a list of recommended resources.</p><p>Also, this is not a book about ScyllaDB, even though the authors and technical</p><p>reviewers have experience with ScyllaDB.Our goal is to present strategies that are useful</p><p>across the broader class of performance-oriented databases. We reference ScyllaDB, as</p><p>well as other databases, as appropriate to provide concrete examples.</p><p>A Tour ofWhat WeCover</p><p>Given that database performance is a multivariate challenge, we explore it from a</p><p>number of different angles and perspectives. Not every angle will be relevant to every</p><p>reader—at least not yet. We encourage you to browse around and focus on what seems</p><p>most applicable to your current situation.</p><p>To start, we explore challenges. Chapter 1 kicks it off with two highly fictionalized</p><p>tales that highlight the variety of database performance challenges that can arise and</p><p>introduce some of the available strategies for addressing them. Next, we look at the</p><p>inTroduCTion</p><p>xxiii</p><p>database performance challenges and tradeoffs that you’re likely to face depending on</p><p>your project’s specific workload characteristics and technical/business requirements.</p><p>The next set of chapters provides a window into many often-overlooked engineering</p><p>details that could be constraining—or helping—your database performance. First, we</p><p>look at ways databases can extract more performance from your CPU, memory, storage,</p><p>and networking. Next, we shift the focus from hardware interactions to algorithmic</p><p>optimizations—deep diving into the intricacies of a sample performance optimization</p><p>from the perspective of the engineer behind it. Following that, we share everything a</p><p>performance-obsessed developer really should know about database drivers but never</p><p>thought to ask. Driver-level optimizations —both how they’re engineered and how you</p><p>work with them—are absolutely critical for performance, so we spend a good amount</p><p>of time on topics like the interaction between clients and servers, contextual awareness,</p><p>maximizing concurrency while keeping latencies under control, correct usage of</p><p>paging, timeout control, retry strategies, and so on. Finally, we look at the performance</p><p>possibilities in moving more logic into the database (via user-defined functions and</p><p>user-defined aggregates) as well as moving the database servers closer to users.</p><p>Then, the final set of chapters shifts into field-tested recommendations for</p><p>getting better performance out of your database deployment. It starts by looking at</p><p>infrastructure and deployment model considerations that are important to understand,</p><p>whether you’re managing your own deployment or opting for a database-as-a-service</p><p>(maybe serverless) deployment model. Then, we share our top strategies related to</p><p>topology, benchmarking, monitoring, and admin—all through the not-always-rosy lens</p><p>of performance.</p><p>After all that, we hope you end up with a new appreciation of the countless</p><p>considerations that impact database performance at scale, discover some previously</p><p>overlooked opportunities to optimize your database performance, and avoid the</p><p>common traps and pitfalls that inflict unnecessary pain and distractions on all too many</p><p>dev and database teams.</p><p>Tip Check out our github repo for easy access to the sources we reference in</p><p>footnotes, plus additional resources on database performance at scale: https://</p><p>github.com/Apress/db-performance-at-scale.</p><p>inTroduCTion</p><p>xxiv</p><p>Summary</p><p>Optimizing database performance at the scale required for today’s data-intensive</p><p>applications often requires more than performance tuning and scaling out. This</p><p>book shares commonly overlooked considerations, pitfalls, and opportunities that</p><p>have helped many teams break through database performance plateaus. It’s neither</p><p>a definitive guide to distributed databases nor a beginner’s resource. Rather, it’s a</p><p>look at the many different factors that impact performance, and our top field-tested</p><p>recommendations for navigating them. Chapter 1 provides two (fun and fanciful) tales</p><p>that surface some of the many roadblocks you might face and highlight the range of</p><p>strategies for navigating around them.</p><p>inTroduCTion</p><p>1</p><p>CHAPTER 1</p><p>A Taste ofWhat You’re</p><p>UpAgainst: Two Tales</p><p>What’s more fun than wrestling with database performance? Well, a lot. But that doesn’t</p><p>mean you can’t have a little fun here. To give you an idea of the complexities you’ll likely</p><p>face if you’re serious about optimizing database performance, this chapter presents two</p><p>rather fanciful stories. The technical topics covered here are expanded on throughout</p><p>the book. But this is the one and only time you’ll hear of poor Joan and Patrick. Let</p><p>their struggles bring you some valuable lessons, solace in your own performance</p><p>predicaments… and maybe a few chuckles as well.</p><p>Joan Dives Into Drivers andDebugging</p><p>Lured in by impressive buzzwords like “hybrid cloud,” “serverless,” and “edge first,”</p><p>Joan readily joined a new company and started catching up with their technology stack.</p><p>Her first project recently started a transition from their in-house implementation of</p><p>a database system, which turned out to not scale at the same pace as the number of</p><p>customers, to one of the industry-standard database management solutions. Their new</p><p>pick was a new distributed database, which, as opposed to NoSQL, strives to keep the</p><p>original ACID1 guarantees known in the SQL world.</p><p>Due to a few new data protection acts that tend to appear annually nowadays, the</p><p>company’s board decided that they were going to maintain their own datacenter, instead</p><p>of using one of the popular cloud vendors for storing sensitive information.</p><p>1 Atomicity, consistency, isolation, and durability</p><p>© Felipe Cardeneti Mendes, Piotr Sarna, Pavel Emelyanov, Cynthia Dunlop 2023</p><p>F. C. Mendes et al., Database Performance at Scale, https://doi.org/10.1007/978-1-4842-9711-7_1</p><p>2</p><p>On a very high level, the company’s main product consisted of only two layers:</p><p>• The frontend, the entry point for users, which actually runs in their</p><p>own browsers and communicates with the rest of the system to</p><p>exchange and persist information.</p><p>• The everything-else, customarily known as the</p><p>backend, but actually</p><p>includes load balancers, authentication, authorization, multiple</p><p>cache layers, databases, backups, and so on.</p><p>Joan’s first task was to implement a very simple service for gathering and summing</p><p>up various statistics from the database and integrate that service with the whole</p><p>ecosystem, so that it fetched data from the database in real-time and allowed the</p><p>DevOps teams to inspect the statistics live.</p><p>To impress the management and reassure them that hiring Joan was their absolutely</p><p>best decision this quarter, Joan decided to deliver a proof-of-concept implementation</p><p>on her first day! The company’s unspoken policy was to write software in Rust, so she</p><p>grabbed the first driver for their database from a brief crates.io search and sat down to</p><p>her self-organized hackathon.</p><p>The day went by really smoothly, with Rust’s ergonomic-focused ecosystem</p><p>providing a superior developer experience. But then Joan ran her first smoke tests on a</p><p>real system. Disbelief turned to disappointment and helplessness when she realized that</p><p>every third request (on average) ended up in an error, even though the whole database</p><p>cluster reported to be in a healthy, operable state. That meant a debugging session was</p><p>in order!</p><p>Unfortunately, the driver Joan hastily picked for the foundation of her work, even</p><p>though open-source on its own, was just a thin wrapper over precompiled, legacy C</p><p>code, with no source to be found. Fueled by a strong desire to solve the mystery and a</p><p>healthy dose of fury, Joan spent a few hours inspecting the network communication with</p><p>Wireshark,2 and she made an educated guess that the bug must be in the hashing key</p><p>implementation.3 In the database used by the company, keys are hashed to later route</p><p>requests to appropriate nodes. If a hash value is computed incorrectly, a request may be</p><p>forwarded to the wrong node, which can refuse it and return an error instead.</p><p>2 Wireshark is a great tool for inspecting network packets and more (www.wireshark.org).</p><p>3 Loosely based on a legit hashing quirk in Apache Cassandra (https://github.com/apache/</p><p>cassandra/blob/56ea39ec704a94b5d23cbe530548745ab2420cee/src/java/org/apache/</p><p>cassandra/utils/MurmurHash.java#L31-L32).</p><p>Chapter 1 a taste ofWhat You’re upagainst: tWo tales</p><p>3</p><p>Unable to verify the claim due to the missing source code, Joan decided on a simpler</p><p>path—ditching the originally chosen driver and reimplementing the solution on one of</p><p>the officially supported, open-source drivers backed by the database vendor, with a solid</p><p>user base and regularly updated release schedule.</p><p>Joan’s Diary ofLessons Learned, Part I</p><p>The initial lessons include:</p><p>1. Choose a driver carefully. It’s at the core of your code’s</p><p>performance, robustness, and reliability.</p><p>2. Drivers have bugs too, and it’s impossible to avoid them. Still,</p><p>there are good practices to follow:</p><p>a. Unless there’s a good reason, choose the officially supported driver (if it</p><p>exists).</p><p>b. Open-source drivers have advantages. They’re not only verified by the</p><p>community, but they also allow deep inspection of the code, and even</p><p>modifying the driver code to get even more insights for debugging.</p><p>c. It’s better to rely on drivers with a well-established release schedule</p><p>since they are more likely to receive bug fixes (including for security</p><p>vulnerabilities) in a reasonable period of time.</p><p>3. Wireshark is a great open-source tool for interpreting network</p><p>packets; give it a try if you want to peek under the hood of your</p><p>program.</p><p>The introductory task was eventually completed successfully, which made Joan</p><p>ready to receive her first real assignment.</p><p>The Tuning</p><p>Armed with the experience gained working on the introductory task, Joan started planning</p><p>how to approach her new assignment: a misbehaving app. One of the applications</p><p>notoriously caused stability issues for the whole system, disrupting other workloads</p><p>each time it experienced any problems. The rogue app was already based on an officially</p><p>supported driver, so Joan could cross that one off the list of potential root causes.</p><p>Chapter 1 a taste ofWhat You’re upagainst: tWo tales</p><p>4</p><p>This particular service was responsible for injecting data backed up from the</p><p>legacy system into the new database. Because the company was not in a great hurry,</p><p>the application was written with low concurrency in mind to have low priority and</p><p>not interfere with user workloads. Unfortunately, once every few days something</p><p>kept triggering an anomaly. The normally peaceful application seemed to be trying to</p><p>perform a denial-of-service attack on its own database, flooding it with requests until the</p><p>backend got overloaded enough to cause issues for other parts of the ecosystem.</p><p>As Joan watched metrics presented in a Grafana dashboard, clearly suggesting that</p><p>the rate of requests generated by this application started spiking around the time of the</p><p>anomaly, she wondered how on Earth this workload could behave like that. It was, after</p><p>all, explicitly implemented to send new requests only when fewer than 100 of them were</p><p>currently in progress.</p><p>Since collaboration was heavily advertised as one of the company’s “spirit and</p><p>cultural foundations” during the onboarding sessions with an onsite coach, she decided</p><p>it was best to discuss the matter with her colleague, Tony.</p><p>“Look, Tony, I can’t wrap my head around this,” she explained.</p><p>“This service doesn’t send any new requests when 100 of them are</p><p>already in flight. And look right here in the logs: 100 requests</p><p>in- progress, one returned a timeout error, and…,” she then</p><p>stopped, startled at her own epiphany.</p><p>“Alright, thanks Tony, you’re a dear—best rubber duck4 ever!,” she</p><p>concluded and returned to fixing the code.</p><p>The observation that led to discovering the root cause was rather simple: The request</p><p>didn’t actually return a timeout error because the database server never sent such a</p><p>response. The request was simply qualified as timed out by the driver, and discarded. But</p><p>the sole fact that the driver no longer waits for a response for a particular request does</p><p>not mean that the database is done processing it! It’s entirely possible that the request</p><p>was instead just stalled, taking longer than expected, and the driver gave up waiting for</p><p>its response.</p><p>With that knowledge, it’s easy to imagine that once 100 requests time out on the</p><p>client side, the app might erroneously think that they are not in progress anymore, and</p><p>happily submit 100 more requests to the database, increasing the total number of</p><p>4 For an overview of the “rubber duck debugging” concept, see https://</p><p>rubberduckdebugging.com/.</p><p>Chapter 1 a taste ofWhat You’re upagainst: tWo tales</p><p>5</p><p>in- flight requests (i.e., concurrency) to 200. Rinse, repeat, and you can achieve extreme</p><p>levels of concurrency on your database cluster—even though the application was</p><p>supposed to keep it limited to a small number!</p><p>Joan’s Diary ofLessons Learned, Part II</p><p>The lessons continue:</p><p>1. Client-side timeouts are convenient for programmers, but they</p><p>can interact badly with server-side timeouts. Rule of thumb: Make</p><p>the client-side timeouts around twice as long as server-side ones,</p><p>unless you have an extremely good reason to do otherwise. Some</p><p>drivers may be capable of issuing a warning if they detect that the</p><p>client-side timeout is smaller than the server-side one, or even</p><p>amend the server-side timeout to match, but in general it’s best to</p><p>double-check.</p><p>2. Tasks with seemingly fixed concurrency can actually cause</p><p>spikes under certain unexpected conditions. Inspecting logs and</p><p>dashboards is helpful in investigating such cases, so make sure</p><p>that observability tools are available, both in the database cluster</p><p>and for all client applications. Bonus points for distributed tracing,</p><p>like OpenTelemetry5 integration.</p><p>With the client-side timeouts properly amended, the application choked much less</p><p>frequently and</p><p>to a smaller extent, but it still wasn’t a perfect citizen in the distributed</p><p>system. It occasionally picked a victim database node and kept bothering it with too</p><p>many requests, while ignoring the fact that seven other nodes were considerably less</p><p>loaded and could help handle the workload too. At other times, its concurrency was</p><p>reported to be exactly 200 percent larger than expected by the configuration. Whenever</p><p>the two anomalies converged in time, the poor node was unable to handle all the</p><p>requests it was bombarded with, and it had to give up on a fair portion of them. A long</p><p>5 OpenTelemetry “is a collection of tools, APIs, and SDKs. Use it to instrument, generate,</p><p>collect, and export telemetry data (metrics, logs, and traces) to help you analyze your software’s</p><p>performance and behavior.” For details, see https://opentelemetry.io/.</p><p>Chapter 1 a taste ofWhat You’re upagainst: tWo tales</p><p>6</p><p>study of the driver’s documentation, which was fortunately available in mdBook6 format</p><p>and kept reasonably up-to-date, helped Joan alleviate those pains too.</p><p>The first issue was simply a misconfiguration of the non-default load balancing</p><p>policy, which tried too hard to pick “the least loaded” database node out of all the</p><p>available ones, based on heuristics and statistics occasionally updated by the database</p><p>itself. Unfortunately, this policy was also “best effort,” and relied on the fact that statistics</p><p>arriving from the database were always legit. But a stressed database node could become</p><p>so overloaded that it wasn’t sending updated statistics in time! That led the driver to</p><p>falsely believe that this particular server was not actually busy at all. Joan decided that</p><p>this setup was a premature optimization that turned out to be a footgun, so she just</p><p>restored the original default policy, which worked as expected.</p><p>The second issue (temporary doubling of the concurrency) was caused by another</p><p>misconfiguration: an overeager speculative retry policy. After waiting for a preconfigured</p><p>period of time without getting an acknowledgement from the database, drivers would</p><p>speculatively resend a request to maximize its chances to succeed. This mechanism</p><p>is very useful to increase requests’ success rate. However, if the original request also</p><p>succeeds, it means that the speculative one was sent in vain. In order to balance the</p><p>pros and cons, speculative retry should be configured to resend requests only when</p><p>it’s very likely that the original one failed. Otherwise, as in Joan’s case, the speculative</p><p>retry may act too soon, doubling the number of requests sent (and thus also doubling</p><p>concurrency) without improving the success rate.</p><p>Whew, nothing gives a simultaneous endorphin rush and dopamine hit like a quality</p><p>debugging session that ends in an astounding success (except writing a cheesy story in a</p><p>deeply technical book, naturally). Great job, Joan!</p><p>The end.</p><p>Patrick’s Unlucky Green Fedoras</p><p>After losing his job at a FAANG MAANG (MANGA?) company, Patrick decided to strike</p><p>off on his own and founded a niche online store dedicated to trading his absolute favorite</p><p>among headwear, green fedoras. Noticing that a certain NoSQL database was recently</p><p>trending on the front page of Hacker News, Patrick picked it for his backend stack.</p><p>6 mdBook “is a command line tool to create books with Markdown.” For details, see https://</p><p>rust-lang.github.io/mdBook/.</p><p>Chapter 1 a taste ofWhat You’re upagainst: tWo tales</p><p>7</p><p>After some experimentation with the offering’s free tier, Patrick decided to sign a</p><p>one-year contract with a major cloud provider to get a significant discount on its NoSQL</p><p>database-as-a-service offering. With provisioned throughput capable of serving up to</p><p>1,000 customers every second, the technology stack was ready and the store opened its</p><p>virtual doors to the customers. To Patrick’s disappointment, fewer than ten customers</p><p>visited the site daily. At the same time, the shiny new database cluster kept running,</p><p>fueled by a steady influx of money from his credit card and waiting for its potential to be</p><p>harnessed.</p><p>Patrick’s Diary ofLessons Learned, Part I</p><p>The lessons started right away:</p><p>1. Although some databases advertise themselves as universal, most</p><p>of them perform best for certain kinds of workloads. The analysis</p><p>before selecting a database for your own needs must include</p><p>estimating the characteristics of your own workload:</p><p>a. Is it likely to be a predictable, steady flow of requests (e.g., updates being</p><p>fetched from other systems periodically)?</p><p>b. Is the variance high and hard to predict, with the system being idle for</p><p>potentially long periods of time, with occasional bumps of activity?</p><p>Database-as-a-service offerings often let you pick between</p><p>provisioned throughput and on-demand purchasing. Although the</p><p>former is more cost-efficient, it incurs a certain cost regardless of how</p><p>busy the database actually is. The latter costs more per request, but</p><p>you only pay for what you use.</p><p>2. Give yourself time to evaluate your choice and avoid committing</p><p>to long-term contracts (even if lured by a discount) before you see</p><p>that the setup works for you in a sustainable way.</p><p>Chapter 1 a taste ofWhat You’re upagainst: tWo tales</p><p>8</p><p>The First Spike</p><p>March 17th seemed like an extremely lucky day. Patrick was pleased to notice lots of</p><p>new orders starting from the early morning. But as the number of active customers</p><p>skyrocketed around noon, Patrick’s mood started to deteriorate. This was strictly</p><p>correlated with the rate of calls he received from angry customers reporting their</p><p>inability to proceed with their orders.</p><p>After a short brainstorming session with himself and a web search engine, Patrick</p><p>realized, to his dismay, that he lacked any observability tools on his precious (and quite</p><p>expensive) database cluster. Shortly after frantically setting up Grafana and browsing the</p><p>metrics, Patrick saw that although the number of incoming requests kept growing, their</p><p>success rate was capped at a certain level, way below today’s expected traffic.</p><p>“Provisioned throughput strikes again,” Patrick groaned to himself, while scrolling</p><p>through thousands of “throughput exceeded” error messages that started appearing</p><p>around 11am.</p><p>Patrick’s Diary ofLessons Learned, Part II</p><p>This is what Patrick learned:</p><p>1. If your workload is susceptible to spikes, be prepared for it and</p><p>try to architect your cluster to be able to survive a temporarily</p><p>elevated load. Database-as-a-service solutions tend to allow</p><p>configuring the provisioned throughput in a dynamic way, which</p><p>means that the threshold of accepted requests can occasionally</p><p>be raised temporarily to a previously configured level. Or,</p><p>respectively, they allow it to be temporarily decreased to make the</p><p>solution slightly more cost-efficient.</p><p>2. Always expect spikes. Even if your workload is absolutely steady, a</p><p>temporary hardware failure or a surprise DDoS attack can cause a</p><p>sharp increase in incoming requests.</p><p>3. Observability is key in distributed systems. It allows the developers</p><p>to retrospectively investigate a failure. It also provides real-time</p><p>alerts when a likely failure scenario is detected, allowing people to</p><p>react quickly and either prevent a larger failure from happening, or</p><p>at least minimize the negative impact on the cluster.</p><p>Chapter 1 a taste ofWhat You’re upagainst: tWo tales</p><p>9</p><p>The First Loss</p><p>Patrick didn’t even manage to recover from the trauma of losing most of his potential</p><p>income on the only day throughout the year during which green fedoras experienced</p><p>any kind of demand, when the letter came. It included an angry rant from a would-be</p><p>customer, who successfully proceeded with his order and paid for it (with a receipt from</p><p>the payment processing operator as proof), but is now unable to either see any details of</p><p>his order—and he’s still waiting for the delivery!</p><p>Without further ado, Patrick browsed</p><p>the database. To his astonishment, he didn’t</p><p>find any trace of the order either. For completeness, Patrick also put his wishful thinking</p><p>into practice by browsing the backup snapshot directory. It remained empty, as one of</p><p>Patrick’s initial executive decisions was to save time and money by not scheduling any</p><p>periodic backup procedures.</p><p>How did data loss happen to him, of all people? After studying the consistency</p><p>model of his database of choice, Patrick realized that there’s consensus to make between</p><p>consistency guarantees, performance, and availability. By configuring the queries, one</p><p>can either demand linearizability7 at the cost of decreased throughput, or reduce the</p><p>consistency guarantees and increase performance accordingly. Higher throughput</p><p>capabilities were a no-brainer for Patrick a few days ago, but ultimately customer data</p><p>landed on a single server without any replicas distributed in the system. Once this server</p><p>failed—which happens to hardware surprisingly often, especially at large scale—the data</p><p>was gone.</p><p>Patrick’s Diary ofLessons Learned, Part III</p><p>Further lessons include:</p><p>1. Backups are vital in a distributed environment, and there’s no</p><p>such thing as setting backup routines “too soon.” Systems fail,</p><p>and backups are there to restore as much of the important data as</p><p>possible.</p><p>7 A very strong consistency guarantee; see the Jepsen page on Linearizability for details</p><p>(https://jepsen.io/consistency/models/linearizable).</p><p>Chapter 1 a taste ofWhat You’re upagainst: tWo tales</p><p>10</p><p>2. Every database system has a certain consistency model, and it’s</p><p>crucial to take that into account when designing your project.</p><p>There might be compromises to make. In some use cases (think</p><p>financial systems), consistency is the key. In other ones, eventual</p><p>consistency is acceptable, as long as it keeps the system highly</p><p>available and responsive.</p><p>The Spike Strikes Again</p><p>Months went by and Patrick’s sleeping schedule was even beginning to show signs of</p><p>stabilization. With regular backups, a redesigned consistency model, and a reminder set</p><p>in his calendar for March 16th to scale up the cluster to manage elevated traffic, he felt</p><p>moderately safe.</p><p>If only he knew that a ten-second video of a cat dressed as a leprechaun had just</p><p>gone viral in Malaysia… which, taking time zone into account, happened around 2am</p><p>Patrick’s time, ruining the aforementioned sleep stabilization efforts.</p><p>On the one hand, the observability suite did its job and set off a warning early,</p><p>allowing for a rapid response. On the other hand, even though Patrick reacted on time,</p><p>databases are seldom able to scale instantaneously, and his system of choice was no</p><p>exception in that regard. The spike in concurrency was very high and concentrated,</p><p>as thousands of Malaysian teenagers rushed to bulk-buy green hats in pursuit of ever-</p><p>changing Internet trends. Patrick was able to observe a real-life instantiation of Little’s</p><p>Law, which he vaguely remembered from his days at the university. With a beautifully</p><p>concise formula, L = λW, the law can be simplified to the fact that concurrency equals</p><p>throughput times latency.</p><p>Tip for those having trouble with remembering the formula, think units.</p><p>Concurrency is just a number, latency can be measured in seconds, while</p><p>throughput is usually expressed in 1/s. then, it stands to reason that in order for</p><p>units to match, concurrency should be obtained by multiplying latency (seconds) by</p><p>throughput (1/s). You’re welcome!</p><p>Chapter 1 a taste ofWhat You’re upagainst: tWo tales</p><p>11</p><p>Throughput depends on the hardware and naturally has its limits (e.g., you can’t</p><p>expect a NVMe drive purchased in 2023 to serve the data for you in terabytes per second,</p><p>although we are crossing our fingers for this assumption to be invalidated in near</p><p>future!) Once the limit is hit, you can treat it as constant in the formula. It’s then clear</p><p>that as concurrency raises, so does latency. For the end-users—Malaysian teenagers in</p><p>this scenario—it means that the latency is eventually going to cross the magic barrier</p><p>for average human perception of a few seconds. Once that happens, users get too</p><p>frustrated and simply give up on trying altogether, assuming that the system is broken</p><p>beyond repair. It’s easy to find online articles quoting that “Amazon found that 100ms</p><p>of latency costs them 1 percent in sales”; although it sounds overly simplified, it is also</p><p>true enough.</p><p>Patrick’s Diary ofLessons Learned, Part IV</p><p>The lessons continue…:</p><p>1. Unexpected spikes are inevitable, and scaling out the cluster</p><p>might not be swift enough to mitigate the negative effects of</p><p>excessive concurrency. Expecting the database to handle it</p><p>properly is not without merit, but not every database is capable</p><p>of that. If possible, limit the concurrency in your system as early</p><p>as possible. For instance, if the database is never touched directly</p><p>by customers (which is a very good idea for multiple reasons)</p><p>but instead is accessed through a set of microservices under your</p><p>control, make sure that the microservices are also aware of the</p><p>concurrency limits and adhere to them.</p><p>2. Keep in mind that Little’s Law exists—it’s fundamental knowledge</p><p>for anyone interested in distributed systems. Quoting it often also</p><p>makes you appear exceptionally smart among peers.</p><p>Backup Strikes Back</p><p>After redesigning his project yet again to take expected and unexpected concurrency</p><p>fluctuations into account, Patrick happily waited for his fedora business to finally</p><p>become ramen profitable.</p><p>Chapter 1 a taste ofWhat You’re upagainst: tWo tales</p><p>12</p><p>Unfortunately, the next March 17th didn’t go as smoothly as expected either. Patrick</p><p>spent most of the day enjoying steady Grafana dashboards, which kept assuring him</p><p>that the traffic was under control and capable of handling the load of customers, with a</p><p>healthy safe margin. But then the dashboards stopped, kindly mentioning that the disks</p><p>became severely overutilized. This seemed completely out of place given the observed</p><p>concurrency. While looking for the possible source of this anomaly, Patrick noticed, to</p><p>his horror, that the scheduled backup procedure coincided with the annual peak load…</p><p>Patrick’s Diary ofLessons Learned, Part V</p><p>Concluding thoughts:</p><p>1. Database systems are hardly ever idle, even without incoming</p><p>user requests. Maintenance operations often happen and you</p><p>must take them into consideration because they’re an internal</p><p>source of concurrency and resource consumption.</p><p>2. Whenever possible, schedule maintenance options for times with</p><p>expected low pressure on the system.</p><p>3. If your database management system supports any kind of</p><p>quality of service configuration, it’s a good idea to investigate</p><p>such capabilities. For instance, it might be possible to set a strong</p><p>priority for user requests over regular maintenance operations,</p><p>especially during peak hours. Respectively, periods with low user-</p><p>induced activity can be utilized to speed up background activities.</p><p>In the database world, systems that use a variant of LSM trees for</p><p>underlying storage need to perform quite a bit of compactions</p><p>(a kind of maintenance operation on data) in order to keep the</p><p>read/write performance predictable and steady.</p><p>The end.</p><p>Chapter 1 a taste ofWhat You’re upagainst: tWo tales</p><p>13</p><p>Summary</p><p>Meeting database performance expectations can sometimes seem like a never-ending</p><p>pain. As soon as you diagnose and address one problem, another is likely lurking right</p><p>behind it. The next chapter helps you anticipate the challenges and opportunities you</p><p>are most likely to face given your technical requirements and business expectations.</p><p>Open Access This chapter is licensed under the terms of the Creative</p><p>Commons Attribution 4.0 International License (http://creativecommons.</p><p>org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and</p><p>reproduction in any</p>
- REVISAO MONITORIA DE MICROBIOLOGIA
- HERMENEUTICA 02
- Teoria e Técnica em Rádio AV 2
- GESTÃO ESTRATEGICA AVALIAÇÃO I
- RECUPERAÇÂO - ti_bd_agenda_04_fichario
- manual de desmontagem e montagem
- AVALIAÇOES PRÉ A 4 bimestre
- Sumário dados basicos diadema
- artigo FRANCISCO GABRIEL docx
- Estruturas de Banco de Dados para Imobiliária
- david_tobias_nunes_ag05_ti_ii1725865914
- create crea create create create 5 6 7 8 9 10 "Diferente da integração, a inclusão pressupõe então mudanças na sociedade, para que esta se torne ca...
- O que é o Big Data e qual sua importancia no mundo das tecnologias? Assinale a resposta correta (A)Big data são os dados gerados pelos usuarios de ...
- uma grande varejista brasileira utilizou o big data para anlisar o comportamento de compra dos consumidores durante a pandemia como o big data co...
- Uma das maiores vantagens ao utilizar-se o Excel para manipular banco de dados é: A) Interface visual. B) Criptografia/segurança. C) Vel...
- create create create create create create 8 9 10 Com relação a sinalização de segurança, a cor verde caracteriza segurança, como também salva...
- Quais as operações de dados por transação que contemplam a sigla em inglês CRUD? Questão 10Resposta a. Concluir, levantar, aumentar e desativar....
- create create create create create create 8 9 10 As rotinas fornecem as centenas de regras tácitas de que as empresas preci...
- Bock (apud CONZATTI, 2014) enfatiza que a socialização é create 3 create create create create create create create Para Perrenoud (apud CONZATTI, 2...
- banco de dados unisanta Qual é a função da tabela verdade em lógica proposicional? Questão 5Escolha uma opção: Executar cálculos aritméticos com...
- Em um Diagrama Entidade-Relacional (DER) entre duas Entidades com relacionamento 1:N é necessário o uso de qual tipo de chave? Questão 5Resposta ...
- Qual é a função da tabela verdade em lógica proposicional? Questão 5Escolha uma opção: Executar cálculos aritméticos complexos. Determinar o va...
- create create create create create create 8 create create A relação professorPedagogia diferenciada é uma característica que deveria permear qualqu...
- No modelo lógico, algumas nomenclaturas são ajustadas para ficarem mais alinhadas com o nível de abstração inerente a es modelo, com termos mais té...
- Padrões de projeto orientados a objeto
- Prova de Programação Orientada a Objetos - 01
Conteúdos escolhidos para você
Grátis
Perguntas dessa disciplina
Grátis
ANHANGUERA
Grátis
IESB
UNIFRA
UNOPAR