smartrss/exper/raw/blog

<?xml version="1.0" encoding="UTF-8"?><rss xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title>Uber Engineering Blog</title><link>https://www.uber.com/blog/engineering</link><atom:link href="http://192.168.0.120:1200/uber/blog" rel="self" type="application/rss+xml"></atom:link><description>The technology behind Uber Engineering - Made with love by RSSHub(https://github.com/DIYgod/RSSHub)</description><generator>RSSHub</generator><webMaster>i@diygod.me (DIYgod)</webMaster><language>en</language><lastBuildDate>Wed, 08 May 2024 05:54:25 GMT</lastBuildDate><ttl>1</ttl><item><title>From Predictive to Generative – How Michelangelo Accelerates Uber’s AI Journey</title><description>&lt;h1 class=&quot;wp-block-heading&quot; id=&quot;h-introduction&quot;&gt;Introduction&lt;/h1&gt;


&lt;p&gt;In the past few years, the Machine learning (ML) adoption and impact at Uber have accelerated across all business lines. Today, ML plays a key role in Uber’s business, being used to make business-critical decisions like ETA, rider-driver matching, Eats homefeed ranking, and fraud detection.&amp;nbsp;&lt;/p&gt;


&lt;p&gt;As Uber’s centralized ML platform, &lt;a href=&quot;https://www.uber.com/blog/michelangelo-machine-learning-platform/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;Michelangelo&lt;/a&gt; has been instrumental in driving Uber’s ML evolution since it was first introduced in 2016. It offers a set of comprehensive features that cover the end-to-end ML lifecycle, empowering Uber’s ML practitioners to develop and productize high-quality ML applications at scale. Currently, approximately 400 active ML projects are managed on Michelangelo, with over 20K model training jobs monthly. There are more than 5K models in production, serving 10 million real-time predictions per second at peak.&lt;/p&gt;


&lt;p&gt;As shown in Figure 1 below, ML developer experience is an important multiplier that enables developers to deliver real-world business impact. By leveraging Michelangelo, Uber’s ML use cases have grown from simple tree models to advanced deep learning models, and ultimately, to the latest Generative AI. In this blog, we present the evolution of Michelangelo in the past eight years with a focus on the continuous enhancement of the ML developer experience at Uber.&amp;nbsp;&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/7H03P6ohPRRzlipNe07kKr3cGsFo3FYOQ1XQNZZbipKWQ5_mLrCuIaDCMtSQyrTGSJ4P-hLG7y7Z_n4C4xIA7Way05VtOWqTigGi1Haq7bBehIOMMi2d7TEW833CMJpgqqXDwBSlq2mGoq_fK8tel14&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 1: ML Developer Experience is a multiplier for delivering ML business impact.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-journey-of-ai-ml-uber&quot;&gt;Journey of AI/ML @ Uber&lt;/h2&gt;


&lt;p&gt;Presently, Uber operates in over 10,000 cities spanning more than 70 countries, serving 25 million trips on the platform each day with 137 million monthly active users. ML has been integrated into virtually every facet of Uber’s daily operation. Virtually every interaction within the Uber apps involves ML behind the scenes. Take the rider app as an example: when users try to log in, ML is used to detect fraud signals like possible account takeovers. Within the app, in many jurisdictions, ML is deployed to suggest destination auto-completion and to rank the search results. Once the destination is chosen, ML comes into play for a multitude of functions, including ETA computation, trip price calculation, rider-driver matching with safety measures in mind, and on-trip routing. After the trip is completed, ML aids in payment fraud detection, chargeback prevention, and extends its reach to powering the customer service chatbot.&amp;nbsp;&amp;nbsp;&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/p3awCLhFgkOjDsk_T8R0DomPhWJWm9vkP8ZTb2VqKOHi7UN1Ous3e_wqGnM-CBBkOVBnw-pmFRYtF6Ik6kl9e31_t9k-BM6BGs9532Hc3b5u6Bej89QBOlJedgeT23t7mm-iTmTq8RRFKhCpMbWFrYY&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 2: Real-time ML underpins Rider app user flow.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;p&gt;As you can see in Figure 2, real-time ML powers user flow in the rider app, and the same holds true for the Eats app (and many others), as illustrated in Figure 3 below.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/Yw8-VN8tnl7p9vyZHpX0l4Su9yIzWGHlF2f3mrYbi569DAiz2us-0FLr5Ti7be5HDUFEpyZrNg7SXfBi8edCm3FpklgTkxpr1jMyhuIkuM-oNrfcZpMUjP1BYMTfy3rhCP6OL98YjobFh6GkWR3v0h0&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 3: Real-time ML underpins Eater app core user flow.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;p&gt;Reflecting on the evolution of ML at Uber, there are three distinct phases:&lt;/p&gt;


&lt;ul&gt;&lt;li&gt;&lt;strong&gt;2016 – 2019:&lt;/strong&gt; During this initial phase, Uber primarily employed predictive machine learning for tabular data use cases. Algorithms, such as XGBoost, were used for critical tasks like ETA predictions, risk assessment, and pricing. Furthermore, Uber delved into the realm of deep learning (DL) in critical areas like 3D mapping and perception in self-driving cars, necessitating significant investments in GPU scheduling and distributed training methodologies, like Horovod®.&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;2019 – 2023:&lt;/strong&gt; The second phase witnessed a concerted push towards the adoption of DL and collaborative model development for high-impact ML projects. The emphasis was on the model iteration as code within ML monorepo and supporting DL as a first-class citizen in Michelangelo. During this period, more than 60% of tier-1 models adopted DL in production and boosted model performance significantly.&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;Starting in 2023:&lt;/strong&gt; The third phase represents the latest development in the new wave of Generative AI, with a focus on improving Uber’s end-user experience and internal employee productivity (described in a &lt;a href=&quot;https://www.uber.com/blog/the-transformative-power-of-generative-ai/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;previous blog&lt;/a&gt;).&lt;/li&gt;
&lt;/ul&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/QNrwoFNrmVffQxsYS6KkqfXjHLJBBT1QHK99UO3o9KmMdC2t3DP_wUNIvUoU0dK2TOI4krdwKwb_EmV71O2Lm3vqqiYjyuzEwFq6oQZBVro17q1Xr45lpIPyEtaGpgYUEcVNXA60CxYwlJ7YJwFGJ28&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 4: Uber’s ML journey from 2016 to 2023.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;p&gt;Throughout this transformative journey, Michelangelo has been playing a pivotal role in advancing ML capabilities and empowering teams to build industry-leading ML applications.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-michelangelo-1-0-2016-2019&quot;&gt;Michelangelo 1.0 (2016 – 2019)&lt;/h2&gt;


&lt;p&gt;When Uber embarked on its ML journey back in 2015, applied scientists used Jupyter Notebooks™&amp;nbsp; to develop models, while engineers built bespoke pipelines to deploy those models to production. There was no system in place to build reliable and reproducible pipelines for creating and managing training and prediction workflows at scale, and no easy way to store or compare training experiment results. More importantly, there was no established path to deploying a model into production without creating a custom serving container.&amp;nbsp;&amp;nbsp;&lt;/p&gt;


&lt;p&gt;In early 2016, Michelangelo was launched to standardize the ML workflows via an end-to-end system that enabled ML developers across Uber to easily build and deploy ML models at scale. It started by addressing the challenges around scalable model training and deployment to production serving containers (&lt;a href=&quot;https://www.uber.com/blog/michelangelo-machine-learning-platform/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;Learn more&lt;/a&gt;). Then, a feature store named Palette was built to better manage and share feature pipelines across teams. It supported both batch and near-real-time feature computation use cases. Currently, Palette hosts more than 20,000 features that can be leveraged out-of-box for Uber teams to build robust ML models (&lt;a href=&quot;https://www.infoq.com/presentations/michelangelo-palette-uber/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;Learn more&lt;/a&gt;).&amp;nbsp;&lt;/p&gt;


&lt;p&gt;Other key Michelangelo components released include, but are not limited to:&lt;/p&gt;


&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Gallery:&lt;/strong&gt; Michelangelo’s model and ML metadata registry that provides a comprehensive search API for all types of ML entities. (&lt;a href=&quot;https://openproceedings.org/2020/conf/edbt/paper_217.pdf&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;Learn more&lt;/a&gt;)&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;Manifold:&lt;/strong&gt; A model-agnostic visual debugging tool for ML at Uber. (&lt;a href=&quot;https://www.uber.com/blog/manifold/?uclick_id=91e0edf5-abbe-49f9-b9ee-2a7c598a6a35&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;Learn more&lt;/a&gt;)&amp;nbsp;&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;PyML:&lt;/strong&gt; A framework that speeded up the cycle of prototyping, validating, and productionizing Python ML models. (&lt;a href=&quot;https://www.uber.com/blog/michelangelo-pyml/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;Learn more&lt;/a&gt;)&amp;nbsp;&lt;/li&gt;


&lt;li&gt;Extend Michelangelo’s model representation for flexibility at scale. (&lt;a href=&quot;https://www.uber.com/blog/michelangelo-machine-learning-model-representation/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;Learn more&lt;/a&gt;)&amp;nbsp;&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;Horovod&lt;/strong&gt; for distributed training. (&lt;a href=&quot;https://www.uber.com/blog/horovod/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;Learn more&lt;/a&gt;)&amp;nbsp;&lt;/li&gt;
&lt;/ul&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-michelangelo-2-0-2019-2023&quot;&gt;Michelangelo 2.0 (2019 – 2023)&lt;/h2&gt;


&lt;p&gt;The initial goal of Michelangelo was to bootstrap and democratize ML at Uber. By the end of 2019, most lines of business at Uber had integrated ML into their products. Subsequently, Michelangelo’s focus started shifting from “enabling ML everywhere” to “doubling down on high-impact ML projects” so that developers could uplevel the model performance and quality of these projects to drive higher business value for Uber. Given the complexity and significance of these projects, there was a demand for more advanced ML techniques, particularly DL, and many different roles (e.g., data scientists and engineers) were often required to collaborate and iterate on models faster, as shown in Figure 5. This posed several challenges for Michelangelo 1.0, as listed below.&amp;nbsp;&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/2F31vZ3o80U_oyCz7LSHemUXQWNzYy6xLCkn4jIALAjnj83oO6Q8uAiCDOSo3YGDzJBCUOsgI6dKigsqOWqfmKnasgL2vQ3BQP8BOEJZkJBmpj9L4zk9sIMKaYVaHDTO6J7opoEBmGhpIz6OIGvGafU&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 5: ML lifecycle is iterative and collaborative with many different roles.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;p&gt;&lt;strong&gt;1. Lack of comprehensive ML quality definition and project tiering&lt;/strong&gt;: Unlike micro-services which have well-defined quality standards and best practices, at&amp;nbsp; that time there was not a consistent way to measure the full spectrum of model quality. For example, many teams only measured offline model performance such as AUC and RMSE, but ignored other critical metrics like online model performance, freshness of training data, and model reproducibility. This resulted in little visibility of model performance, stale models in production, and poor dataset coverage.&lt;/p&gt;


&lt;p&gt;Also, it is important to recognize that ML projects vary significantly in terms of business impact. The lack of a distinct ML tiering system led to a uniform approach in resource allocation, support, and managing outages, regardless of a project’s impact. This resulted in high-impact projects receiving inadequate investment or not being given the priority they deserved.&lt;/p&gt;


&lt;p&gt;&lt;strong&gt;2. Insufficient support for DL models:&lt;/strong&gt; Up until 2019, ML use cases at Uber were predominantly using tree-based models, which inherently did not favor adopting advanced techniques like custom loss functions, incremental training, and embeddings. Conversely, Uber had vast data suitable for training DL models, but the infrastructure and developer experience challenges hindered the progress in this direction. Many teams like Maps ETA and Rider incentive teams had to invest months in developing their own DL toolkits before successfully training their first version of DL models.&amp;nbsp;&lt;/p&gt;


&lt;p&gt;&lt;strong&gt;3. Inadequacy of support for collaborative model development&lt;/strong&gt;: In the early days, most ML projects were small-scale, and only authored and iterated by a single developer from inception to production. Hence, Michelangelo 1.0 was not optimized for highly collaborative model development, and collaboration in Michelangelo 1.0 UI and Jupyter Notebook was difficult and often done via manual copying and merging without version control or branching.&amp;nbsp; In addition, there was no code review process for UI model config changes nor notebook edits, and the absence of a centralized repository for ML code and configurations led to their dispersion across various sources. These posed a significant threat to our engineering process and made large-scale model exploration across numerous ML projects arduous.&lt;/p&gt;


&lt;p&gt;&lt;strong&gt;4. Fragmented ML tooling and developer experience: &lt;/strong&gt;Since 2015, many ML tools other than Michelangelo have been built by different teams at Uber for a subset of the ML lifecycle and use cases, such as &lt;a href=&quot;https://www.uber.com/blog/evolution-ds-workbench/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;Data Science Workbench&lt;/a&gt; (DSW) from Data team for managed Jupyter Notebooks, ML Explorer from Marketplace team for ML workflow orchestration and automation, and uFlow/uScorer from Risk team specifically for training and inferencing models from their own team. There were also different ways to develop an ML model for different model types–e.g., Michelangelo UI for SparkML and XGBoost models, Jupyter Notebook for DL models, and &lt;a href=&quot;https://www.uber.com/blog/michelangelo-pyml/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;PyML&lt;/a&gt; for custom Python-based models. Launching one ML project usually required constantly switching between such semi-isolated tools, which were built with different UI patterns and user flows, leading to fragmented user experience and reduced productivity.&amp;nbsp;&lt;/p&gt;


&lt;p&gt;To address these challenges, Michelangelo 2.0 re-architectured the fragmented ML platforms to a single coherent product with unified UI and API for the end-to-end ML lifecycle. Michelangelo 2.0 has four user-facing themes: (1) model quality and project tiering, (2) model iteration as code via Canvas, (3) DL as a first-class platform citizen, and (4) unified ML developer experience via MA Studio.&amp;nbsp;&amp;nbsp;&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-architectural-overview&quot;&gt;Architectural Overview&lt;/h3&gt;


&lt;p&gt;Michelangelo 2.0 is centered around four pillars. At the very bottom, we are enabling an architecture that allows for plug-and-play platform components. Some of the components are built in-house and others can be State-of-the-art commodity pieces from open source or 3rd party. On top is the development and production experience that caters to applied scientists and ML engineers. To improve model development velocity, we are streamlining the development experience and enabling technologies for collaborative, reusable development.&amp;nbsp; We believe this approach will enable us to track and enforce compliance at the platform level.&amp;nbsp; We are investing in production experiences like safe deployment of models and automatic model retraining, etc. to make it easy to maintain and manage models at scale. Finally, we are focusing on the quality of the models and investing in tooling that measures this quality across all stages and improves it systematically.&amp;nbsp;&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/gRuNaUCrkkMDsG1LwyLo43lqxK0Sl_dtJXQU3o2NX2VKH1wPy9SOvlPbk91_8o6dOGEsbSQz336xH1u9Z_RVi5vfoS9A2TjQg5_T5sQ7-jCrMkfRrxL4supkxIoviPLMvO8Gs0iV0QA2O0p-rxUIk_g&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 6: High-level concepts of Michelangelo 2.0 Architecture.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;p&gt;Here are a few architectural design principles for Michelangelo 2.0:&lt;/p&gt;


&lt;ol&gt;&lt;li&gt;Define project tiering, and focus on high-impact use cases to maximize Uber’s ML impact. Provide self-service to long-tail ML use cases so that they can leverage the power of the platform.&lt;/li&gt;


&lt;li&gt;The majority of ML use cases can leverage Michelangelo’s core workflows and UI, while Michelangelo also enables more bespoke workflows needed for advanced use cases like deep learning.&lt;/li&gt;


&lt;li&gt;Monolithic vs. plug-and-play. Architecture will support plug-and-play of different components, but the managed solution will only support a subset of them for the best user experience. Bring your own components for advanced use cases.&lt;/li&gt;


&lt;li&gt;API/code-driven vs. UI-driven. Take the API first principle and leverage UI for visualization and fast iteration. Support model iteration as code for version control and code reviews, including changes made in UI.&lt;/li&gt;


&lt;li&gt;Build vs. buy decision. Leverage best-of-class offerings from OSS or Cloud or building in-house. OSS solutions may be prioritized over proprietary solutions. Be cautious about the cost of capacity for Cloud solutions.&amp;nbsp;&lt;/li&gt;


&lt;li&gt;Codify the best ML practices like safe model deployment, model retraining, and feature monitoring in the platform.&amp;nbsp;&lt;/li&gt;
&lt;/ol&gt;


&lt;p&gt;The system consists of three planes–i.e., control plane, offline and online data planes. The control plane defines user-facing APIs and manages the lifecycle of all entities in the system. The offline data plane does the heavy lifting on big data processing such as feature computation, model training and evaluation, offline batch inference, etc. The online data plane handles real-time model inference and feature serving, which are used by other microservices.&amp;nbsp;&lt;/p&gt;


&lt;p&gt;The control plane adopts the same &lt;a href=&quot;https://github.com/cncf/tag-app-delivery/blob/163962c4b1cd70d085107fc579e3e04c2e14d59c/operator-wg/whitepaper/Operator-WhitePaper_v1-0.md&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;Kubernetes™ Operator design pattern&lt;/a&gt; for modularization and extensibility. The Michelangelo APIs also follow the same &lt;a href=&quot;https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;Kubernetes API conventions&lt;/a&gt; and standardize the operations on ML-related entities like Project, Pipeline, PipelineRun, Model, Revision, InferenceServer, Deployment, etc.&amp;nbsp; By leveraging the Kubernetes API machinery including API server, &lt;a href=&quot;https://etcd.io/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;etcd&lt;/a&gt;, and controller manager, all Michelangelo APIs can be accessed in a consistent manner, resulting in a more user-friendly and streamlined user experience. In addition, the declarative API pattern is also crucial for Michelangelo to support mutation by both UI and code in a GIT repo, as detailed later.&amp;nbsp;&lt;/p&gt;


&lt;p&gt;The offline data plane consists of a set of ML pipelines including training, scoring, evaluation, etc., which are defined as DAG of steps. The ML pipelines support intermediate checkpoints and resume between steps to avoid duplicate executions of previous steps. Steps are executed on top of frameworks like Ray™ or Spark™. The online data plane manages RPC services and streaming processing jobs that serve online prediction, online feature access, and near-real-time feature computation.&lt;br&gt;&lt;/p&gt;


&lt;p&gt;Figure 7 shows the detailed design of the Michelangelo 2.0 system, which reduced the engineering complexity as well as simplified the external dependencies on other infrastructure components.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/MqljT9LUabDAsMHNb1k6y5LqyILCQdzWU1zi1ZD_XWneCBiYqo3Wtoda2_ysExjD0prHflRC8yBVK5Cziy2VkrQHuThszXcgvWMS4CfSZDllePznu6UB_Jw_mOZE25UA4o5gPy46woAP9uSFXrJlctI&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 7: Detailed system design of Michelangelo 2.0 including offline, online and control planes.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-model-quality-and-project-tiering&quot;&gt;Model Quality and Project Tiering&lt;/h3&gt;


&lt;p&gt;The development and maintenance of a production-ready ML system are intricate, involving numerous stages in the model lifecycle and a complex supporting infrastructure. Typically, an ML model undergoes phases like feature engineering, training, evaluation, and serving. The lack of comprehensive ML quality measurement leads to limited visibility for ML developers regarding various quality dimensions at different stages of a model’s lifecycle. Moreover, this gap hinders organizational leaders from making fully informed decisions regarding the quality and impact of ML projects.&amp;nbsp;&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/bIbKLbwB7em5FhgTLHym9OHRHcaORl4UASUzEquCX0O11RkyCoOCxYbiANmykUDdFUoAZGfX7EZ_kCLyeloxxy82D_gvWZX7aYpfQmVZWPb68632BR5YjLcCSSLOznLU8Yr-Ei7QbnqSuNYiuTyu33w&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 8: Example ML quality dimensions (in yellow) in a typical ML system.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;p&gt;To address these gaps, we launched Model Excellence Score (MES), a framework for measuring and monitoring key dimensions and metrics at each stage of a model’s life cycle, such as training model accuracy, prediction accuracy, model freshness, and prediction feature quality, to ensure a holistic and rigorous approach to ML deployment at scale. This framework leverages the same &lt;a href=&quot;https://en.wikipedia.org/wiki/Service-level_agreement&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;Service Level Agreement&lt;/a&gt; (SLA) concept that is commonly used by site reliability engineers (SREs) and DevOps professionals to manage microservices reliability in production environments. By integrating with the SLA toolset, MES establishes a standard for measuring and ensuring ML model quality at Uber. Additionally, MES tracks and visualizes the compliance and quality of models, thereby providing a clearer and more comprehensive view of ML initiatives across the organization. See &lt;a href=&quot;https://www.uber.com/blog/enhancing-the-quality-of-machine-learning-systems-at-scale/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;MES blog&lt;/a&gt; for more details.&lt;/p&gt;


&lt;p&gt;To differentiate high-impact and long-tail use cases, we introduced a well-defined ML project tiering scheme. This scheme consisted of four tiers, with tier 1 being the highest. Tier 1 projects consist of models serving critical functions within core trip and core eater flows, such as ETA calculations, safety, and fraud detection, etc. Only models directly influencing core business operations can qualify for tier-1 status. Conversely, tier-4 projects typically encompass experimental and exploratory use cases with limited direct business impact. This tiering scheme enabled us to make informed decisions regarding resource allocation for ML project outage handling, resource investment, best practice enforcement, and compliance matters, among other considerations. It ensured that the level of attention and resources devoted to each project was commensurate with its relative priority and impact.&amp;nbsp;&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-model-iterations-as-code-nbsp&quot;&gt;Model iterations as code&amp;nbsp;&lt;/h3&gt;


&lt;p&gt;To enhance ML developer productivity, foster seamless team collaboration, and elevate the overall quality of ML applications in 2020, we launched Project Canvas. The project aimed to apply software engineering best practices for the ML development lifecycle enforcing version controls, harnessed the power of Docker containers, integrated CI/CD tools, and expedited model development by introducing standardized ML application frameworks. Key components of Canvas included:&lt;/p&gt;


&lt;ul&gt;&lt;li&gt;&lt;strong&gt;ML Application Framework (MAF)&lt;/strong&gt;: Predefined, but customizable ML workflow templates to provide a code and configuration-driven way for ML development, tailor-made for intricate ML techniques such as DL.&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;ML Monorepo&lt;/strong&gt;: A centralized repository that stores all ML development sources of truth as code, with robust version control capabilities.&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;ML Dependency Management&lt;/strong&gt;: Provides software dependency management using Bazel and docker builds. Each ML project has their own customized docker images. In addition to software dependencies, the model training and serving code will be packaged into an immutable docker image for production model retraining and serving.&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;ML Development Environment: &lt;/strong&gt;Provides consistent local developing and remote&amp;nbsp; production execution environments for ML developers so that they can test and debug the models locally before running it in a remote production environment for fast model iteration.&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;ML Continuous Integration / Continuous Delivery&lt;/strong&gt;: Continuous integration against the master branch and automates the deployment to production for ML models landed to the master branch of ML monorepo via various tests and validations.&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;ML Artifact Management: &lt;/strong&gt;Provide support for artifact and lineage tracking. Artifacts are ML objects such as models, datasets and evaluation reports plus their corresponding metadata. The objects will be stored in distributed storage, and the metadata will be fully indexed and searchable.&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;MA Assistant (MAA):&lt;/strong&gt; Michelangelo’s AutoML solution for automatic model architecture search and feature exploration/pruning.&lt;/li&gt;
&lt;/ul&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/Teij0cWxZToDYuyzL5LIk0iUWeSr1N3cpFnCpJGVHg8M8Hj8twroleP_flTmf6kSCkgHPXcdcTBg8-wuGN2SoaA0squnLErPVSDf2xBqPV6bz9pFlvRRXQz56LOZfWrCau7nHkfqWeTCmxCoXS5eh_A&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 9: Canvas: Streamlining end-to-end ML developer experience.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;p&gt;Canvas also streamlined the ML dependency management by leveraging Bazel and docker builds. Each ML project would have its custom docker images, and the model training and serving code will be packaged into an immutable docker image for production model retraining and serving. Moreover, Canvas enabled consistent local and remote development environments for ML developers to test and debug the models locally before running them in a remote production environment for fast model iteration.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-deep-learning-as-a-first-class-platform-citizen&quot;&gt;Deep Learning as a first-class platform citizen&lt;/h3&gt;


&lt;p&gt;Adopting advanced techniques such as custom loss functions, incremental training, and embeddings posed significant challenges. DL is more flexible to address these challenges. Furthermore, DL often excels as datasets grow larger, as it can leverage more data to learn more complex representations.&amp;nbsp;&lt;/p&gt;


&lt;p&gt;Before 2019, most of the DL models at Uber were for &lt;a href=&quot;https://www.uber.com/blog/machine-learning-model-life-cycle-version-control/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;self-driving cars&lt;/a&gt; (e.g., &lt;a href=&quot;https://proceedings.mlr.press/v87/yang18b/yang18b.pdf?uclick_id=91e0edf5-abbe-49f9-b9ee-2a7c598a6a35&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;3D mapping&lt;/a&gt;, perception), computer vision (e.g., &lt;a href=&quot;https://www.uber.com/blog/real-time-id-check/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;driver face recognition&lt;/a&gt;), and natural language processing (e.g., &lt;a href=&quot;https://www.uber.com/en-AU/blog/cota/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;customer obsession&lt;/a&gt;) use cases. However, there were very few deep learning models for the core business, especially for tabular data use cases. One important reason that hindered the adoption of deep learning is the lack of end-to-end deep learning support in Michelangelo 1.0. Different from tree-based models, deep learning models often require much more sophisticated ML platform support, from feature transformation and model training to model serving and GPU resource management. The rest of this section will give an overview of our investment in deep learning support in Michelangelo 2.0.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h4 class=&quot;wp-block-heading&quot; id=&quot;h-feature-transformation&quot;&gt;Feature transformation&lt;/h4&gt;


&lt;p&gt;Michelangelo 1.0 implemented a DSL for feature transformation such as normalization and bucketization that are used in both model training and serving paths. The transformation is bundled together with a model as a &lt;a href=&quot;https://www.uber.com/blog/michelangelo-machine-learning-model-representation/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;Spark PipelineModel&lt;/a&gt; so that it &lt;a href=&quot;https://developers.google.com/machine-learning/guides/rules-of-ml#training-serving_skew&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;eliminates a source of training-serving skews.&lt;/a&gt; However, the DSL transformation is implemented as a Spark transformer and can not be run on GPU for DL models for low-latency serving. In Michelangelo 2.0, we implemented a new DL native transformation solution that allows users to transform their features using Keras or PyTorch operators and provides advanced users the flexibility to define customized feature transformation using Python code. Similar to &lt;a href=&quot;https://blog.research.google/2017/02/preprocessing-for-machine-learning-with.html&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;TensorFlow transform&lt;/a&gt;, the transform graph is combined with the model inference graph either in TensorFlow or TorchScript for low-latency serving on GPUs.&amp;nbsp;&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h4 class=&quot;wp-block-heading&quot; id=&quot;h-model-training&quot;&gt;Model training&lt;/h4&gt;


&lt;p&gt;Michelangelo 2.0 supports both TensorFlow and PyTorch frameworks for large-scale DL model training by leveraging our distributed training framework Horovod. In addition, we have made the following improvements for better scalability, fault tolerance, and efficiency.&lt;/p&gt;


&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Distributed GPU training and tuning on Ray.&lt;/strong&gt; (&lt;a href=&quot;https://www.youtube.com/watch?v=gMT_ONmI9RM&amp;amp;list=PLzTswPQNepXmLUiL4F_1VHrPcCz1OeILw&amp;amp;index=47&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;Learn more&lt;/a&gt;). Historically, the model training in Michelangelo was running on top of Spark. However, DL presented new challenges on Spark such as a lack of GPU executors, mini-batch shuffle, and all-reduce. &lt;a href=&quot;https://horovod.readthedocs.io/en/stable/spark_include.html&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;Horovod on Spark&lt;/a&gt; wrapped the DL training using Spark estimator syntax and provided easy integration to the training pipeline. However, it also introduced many operational complexities like separate cluster jobs, lifecycle management, and failure scenarios. In Michelangelo 2.0, we have replaced Spark-based XGBoost and DL trainers with Ray-based trainers for better scalability and reliability. We also switched from an in-house hyperparameter tuning solution to RayTune.&amp;nbsp;&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;Elastic Horovod with auto-scaling and fault tolerance&lt;/strong&gt;. (&lt;a href=&quot;https://www.uber.com/blog/horovod-ray/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;Learn more&lt;/a&gt;).&amp;nbsp; Elastic Horovod allows distributed training that scales the number of workers dynamically throughout the training process. Jobs can now continue training with minimal interruption when machines come and go from the job.&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;Resource efficient incremental training&lt;/strong&gt;. One advantage of DL is the ability to incrementally train a model with additional datasets without training from scratch. This significantly improves the resource efficiency for production retrains as well as increases the dataset coverage for better model accuracy.&amp;nbsp;&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;Declarative DL training pipeline in Canvas&lt;/strong&gt;. DL models require custom model code and loss functions etc. In Canvas, we designed the training pipelines to be declarative as well as extensible for users to plug-in their custom model code such as estimators, optimizers, and loss functions as shown in Figure 9.&lt;/li&gt;
&lt;/ul&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/wq4mV9-lSpxnYgUrC91PzO7JaMnSawey2wir-Ai0odqHzIaTkMA8erWscsRHZl22fpCQvzipR86Tkw_jgzFs3jokhk86Zqnc2OfeY882_3ChONlnLYAtEQpMCi8u6lMAw9PS8oOD-0WmtSsbEvs1bZA&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 9: Example training pipeline in Canvas for a deep learning model.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h4 class=&quot;wp-block-heading&quot; id=&quot;h-model-serving&quot;&gt;Model serving&lt;/h4&gt;


&lt;p&gt;Most of Uber’s tier-1 ML projects that were adopting DL are very sensitive to serving latency, such as maps ETA and Eats homefeed ranking. In addition, the model serving has to support both TensorFlow and PyTorch DL frameworks but abstract out the framework-level details away from our users. Historically, &lt;a href=&quot;https://github.com/uber/neuropod&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;Neuropod&lt;/a&gt; has been the default DL serving engine in Michelangelo. However, it lacks continuous community support and is being deprecated. In Michelangelo 2.0, we integrated &lt;a href=&quot;https://github.com/triton-inference-server/server&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;Triton&lt;/a&gt; as a next-generation model serving engine in our Online Prediction Service (OPS) as a new Spark transformer. Triton is an open-source inference server developed by Nvidia and supports multiple frameworks including TensorFlow, PyTorch, Python, and XGBoost, it is highly optimized for GPUs for low-latency serving.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h4 class=&quot;wp-block-heading&quot; id=&quot;h-gpu-resource-management&quot;&gt;GPU resource management&lt;/h4&gt;


&lt;p&gt;Both DL training and serving require large-scale GPU resources. Uber today manages more than five thousand GPUs across both on-premise data centers and Cloud providers like OCI and GCP. Those GPUs spread across multiple regions and many zones and clusters. The Compute clusters are in the process of migrating from &lt;a href=&quot;https://kccna18.sched.com/event/GrTx/peloton-a-unified-scheduler-for-web-scale-workloads-on-mesos-kubernetes-min-cai-nitin-bahadur-uber&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;Peloton / Mesos&lt;/a&gt; to &lt;a href=&quot;https://kccncna19.sched.com/event/Uaad/kubernetizing-big-data-and-ml-workloads-at-uber-mayank-bansal-min-cai-uber&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;Kubernetes&lt;/a&gt;. To maximize resource utilization, Uber has invested in elastic CPU and GPU resource sharing across different teams so that each team can opportunistically use the other team’s idle resources. On top of the Compute clusters, we built a job federation layer across multiple Kubernetes clusters to hide the region, and zone and cluster details for better job portability and easy Cloud migration. The job federation layer leverages the same design pattern as Kubernetes operators and is implemented as a job CRD controller in Michelangelo’s unified API framework as shown in Figure 7. Currently, the job controller supports both Spark and Ray jobs.&amp;nbsp;&lt;/p&gt;


&lt;p&gt;With the end-to-end support for DL in Michelangelo 2.0, Uber has made a significant improvement in DL adoption across different business lines. In the last few years, the DL adoption in tier-1 projects has increased from almost zero to more than 60%. For example, the &lt;a href=&quot;https://www.uber.com/blog/deepeta-how-uber-predicts-arrival-times/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;DeepETA&lt;/a&gt; model has more than one hundred million parameters and was trained over one billion trips.&amp;nbsp;&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-ma-studio-one-unified-web-ui-tool-for-everything-ml-uber&quot;&gt;MA Studio – One unified Web UI tool for everything ML @ Uber&lt;/h3&gt;


&lt;p&gt;To address the challenges in the ML developer experience mentioned above, Michelangelo (MA) Studio was developed to unify existing Michelangelo offerings and newly built platform capabilities into one single user journey, to provide a seamless user experience, with a completely redesigned UI and UX. MA Studio provides a simplified user flow covering every step of the ML journey from feature/data prep, model training, deployment, all the way to production performance monitoring, and model CI/CD, all in one place, to improve ML developer productivity.&amp;nbsp;&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/23NZpcegPdWpU-kpA3971k3KKYGxjh-AMVlljf26FeKV1eAga8tCkU3vuZCgWOwlDYwDbkWXFJ0wDX-8DDsXKBGT1-ucOZbA375Pg7aectkscLfD01kJbB0EEp5vX1LSQ4ba6NhbS_N1fpX89-Dit9Y&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 10: MA Studio project landing page covering the end-to-end ML development life-cycle.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;p&gt;MA Studio boasts an array of additional advantages:&lt;/p&gt;


&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Version control and code review&lt;/strong&gt;: All ML-related code and configurations are version controlled, and all changes go through the code review process, including models created from UI.&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;Modernized model deployment stack&lt;/strong&gt;: Safe and incremental zonal rollout, automatic rollback triggers, and production runtime validation.&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;Built-in and unified ML observability toolkit&lt;/strong&gt;: Model performance monitoring, feature monitoring, online/offline feature consistency check, and MES.&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;Unified ML entity lifecycle management&lt;/strong&gt;: Users benefit from an intuitive UI and well-structured user flows for managing all ML entities, from models and pipelines to datasets and reports.&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;Enhanced debugging capabilities&lt;/strong&gt;: MA Studio amplifies debugging capabilities and accelerates recovery for ML pipeline failures.&lt;/li&gt;
&lt;/ul&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/v2Dr41DHAjWT4JcMTHMRZ-EFv51edbx8RPw4BJ0MsqrnNAlI0lKoZJ-ZGRvUAZ5ez9VjNeKizUBwq-Hlxzp3uVs9i4NahPqDmgI9bOrJyBM_MC1NZjz7FD2Q46MJ_H32lP6VN3zqgEeeghht_r3mdrk&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 11: MA Studio and Canvas for standard and advanced ML use cases.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;p&gt;For any ML need at Uber, you only need two tools: Canvas and MA Studio. MA Studio’s user-friendly UI covers standard ML workflows, facilitating tasks like XGB model training and the standard model retrain process without any necessity to write any code. When dealing with more sophisticated scenarios, such as DL training or customized retraining flows, Canvas is the go-to tool. Regardless of whether you’ve constructed the pipelines through Canvas or the UI, you can seamlessly execute and manage these pipelines, deploy trained models, and monitor and debug model performance—all from the MA Studio UI. Significantly, all model code and pertinent configurations are now subject to version control, and any alterations undergo a meticulous code review process, which drastically improves the quality of ML applications in production at Uber.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-generative-ai-2023-now&quot;&gt;Generative AI (2023 – now)&lt;/h2&gt;


&lt;p&gt;Recent advancements in generative AI, particularly in the realm of large language models (LLMs), possess the capacity to radically transform our interactions with machines via natural language. Several teams at Uber are actively investigating the use of LLMs to boost internal productivity with assistants, streamline business operations with automation, and improve the end-user product with magical user experience while addressing &lt;a href=&quot;https://en.wikipedia.org/wiki/Wikipedia:Large_language_models&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;issues associated with the use of LLMs&lt;/a&gt;. Figure 12 shows the potential values of those three categories of generative AI use cases at Uber. &lt;a href=&quot;https://www.uber.com/blog/the-transformative-power-of-generative-ai/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;Learn more&lt;/a&gt;.&amp;nbsp;&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/ymra4p74-37pbRET_WeL_zvyEI6I9LOg8RICp0dVRQYBEYEllRz9psRc7omJcLe6ohyJrEztIFvOK7egSKiev_hgSpI8L4CmZ_pntoRdOOaJT9DAAgWWOMonlGcLDZl0oC3OpEs21Te5R-hr7KLdG-I&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 12: Three categories of generative AI use cases at Uber.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;p&gt;For developing generative AI applications, teams need access to both external LLMs through third-party APIs and/or internally hosted open-source LLMs. This is because external models have superior performance in tasks requiring general knowledge and intricate reasoning, while by leveraging the wealth of proprietary data, we can fine-tune open-source models to achieve high levels of accuracy and performance on Uber-centric tasks, at a fraction of the cost and lower latency. These fine-tuned open-source models are hosted in-house.&lt;/p&gt;


&lt;p&gt;Hence, we developed the Gen AI Gateway to provide a unified interface for teams to access both external LLMs and in-house hosted LLMs in a manner adhering to security standards and safeguarding privacy. Some of the Gen AI Gateway capabilities include:&lt;/p&gt;


&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Logging and auditing:&lt;/strong&gt; Ensuring comprehensive tracking and accountability.&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;Cost guardrails and attribution:&lt;/strong&gt; Managing expenses while attributing usage, and also alerting on over usage.&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;Safety &amp;amp; Policy Guardrails:&lt;/strong&gt; Ensuring LLM usage complies with our internal guidelines.&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;Personal identifiable information (PII) Redaction:&lt;/strong&gt; Identifying and categorizing personal data, and redacting it before sending the input to external LLMs.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;To accelerate the development of generative AI applications at Uber, we have extended Michelangelo to support the full LLMOps capabilities such as fine-tuning data preparation, prompt engineering, LLM fine-tuning and evaluation, LLM deployment and serving, and production performance monitoring. Some of the key components include:&lt;/p&gt;


&lt;ul&gt;&lt;li&gt;The Model Catalog features a collection of pre-built and ready-to-use LLMs, accessible via third-party APIs (e.g., GPT4, Google PaLM) or in-house hosted open-source LLMs on Michelangelo (e.g., Llama2). Users can explore extensive information about these LLMs within the catalog and initiate various workflows. This includes fine-tuning models in MA Studio or deploying models to online serving environments. The catalog offers a wide selection of pre-trained models, enhancing the platform’s versatility.&lt;/li&gt;


&lt;li&gt;LLM Evaluation Framework enables users to compare LLMs across different approaches&amp;nbsp; (e.g., in-house vs. 3P with prompts vs 3P fine-tuned), and also evaluate improvements with iterations of prompts and models.&lt;/li&gt;


&lt;li&gt;Prompt Engineering Toolkit allows users to create and test prompts, validate the output, and save prompt templates in a centralized repository, with full version control and code review process.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;To enable cost-effective LLM fine-tuning and low-latency LLM serving, we’ve implemented several significant enhancements to Michelangelo training and serving stack:&lt;/p&gt;


&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Integrating with Hugging Face&lt;/strong&gt;: We implemented a Ray-based trainer for LLMs, utilizing the open source LLMs available on the &lt;a href=&quot;https://huggingface.co/models&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;Hugging Face Hub&lt;/a&gt; and associated libraries like &lt;a href=&quot;https://huggingface.co/docs/peft/index&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;PEFT&lt;/a&gt;. Fine-tuned LLMs and associated metadata are stored in Uber’s model repository, which is accessible from the model inference infrastructure.&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;Enabling Model Parallelism&lt;/strong&gt;: Michelangelo previously did not support model parallelism for training DL models. This limitation constrained the size of trainable models to the available GPU memory, allowing, for instance, a theoretical maximum of 4 billion parameters on a 16 GB GPU. In the updated LLM training framework, we’ve integrated Deepspeed to enable model parallelism. This breakthrough eliminates the GPU memory limitation and allows for training larger DL models.&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;Elastic GPU Resource Management:&lt;/strong&gt; We’ve provisioned Ray clusters on GPUs with the Michelangelo job controller. This provision empowers the training of LLM models on the most powerful GPUs available on-premises. Furthermore, this integration sets the stage for future extensions using cloud GPUs, enhancing scalability and flexibility.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Leveraging these platform capabilities offered by Michelangelo, teams at Uber are fervently developing LLM-powered applications. We look forward to sharing our advancements in the productionization of LLMs soon.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h1 class=&quot;wp-block-heading&quot; id=&quot;h-conclusion&quot;&gt;Conclusion&lt;/h1&gt;


&lt;p&gt;ML has evolved into a fundamental driver across critical business areas at Uber. This blog delves into the eight years transformative journey of Uber’s ML platform, Michelangelo, emphasizing significant enhancements in the ML developer experience. This journey unfolded in three distinct phases: the foundational phase of predictive ML for tabular data from 2016 to 2019, a progressive shift to deep learning between 2019 and 2023, and the recent venture into generative AI starting in 2023.&lt;/p&gt;


&lt;p&gt;There have been critical lessons learned for building a large-scale, end-to-end ML platform at such a complexity level, supporting ML use cases at Uber’s scale. Key takeaways include:&lt;/p&gt;


&lt;ul&gt;&lt;li&gt;Instituting a centralized ML platform, as opposed to having individual product teams build their own ML infrastructure, can significantly enhance ML development efficiency within a medium- or large-sized company. And the ideal ML organizational structure comprises a centralized ML platform team, complemented by dedicated data scientists and ML engineers embedded within each product team.&lt;/li&gt;


&lt;li&gt;Providing both UI-based and code/configuration-driven user flows in a unified manner is crucial for delivering a seamless ML dev experience, especially for large organizations where ML developers’ preferences of dev tools largely vary across different cohorts.&lt;/li&gt;


&lt;li&gt;The strategy of offering a high-level abstraction layer with predefined workflow templates and configurations for most users, while allowing advanced power users to directly access low-level infrastructure components to build customized pipelines and templates has proven effective.&lt;/li&gt;


&lt;li&gt;Designing the platform architecture in a modular manner so that each component can be built with a plug-and-play approach, which allows rapid adoption of state-of-the-art technologies from open source, third-party vendors, or in-house development.&lt;/li&gt;


&lt;li&gt;While Deep Learning proves powerful in solving complex ML problems, the challenge lies in supporting large-scale DL infrastructure and maintaining the performance of these models. Use DL only when its advantages align with the specific requirements. Uber’s experience has shown that in several cases, XGBoost outperforms DL in both performance and cost.&lt;/li&gt;


&lt;li&gt;Not all ML projects are created equal. Having a clear ML tiering system can effectively guide the allocation of resources and support.&amp;nbsp;&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Michelangelo’s mission is to provide Uber’s ML developers with best-in-class ML capabilities and tools so that they can rapidly build, deploy, and iterate high-quality ML applications at scale. As the AI platform team, we provide in-depth ML expertise, drive standardization and innovation of ML technologies, build trust and collaborate with our partner teams, and cultivate a vibrant ML culture, so that ML is embraced and leveraged to its fullest potential. We are unwavering in our commitment to this mission, and we are incredibly enthusiastic about the promising future ahead of us.&lt;/p&gt;


&lt;p&gt;If you are interested in joining us on this exciting venture, please check our&lt;a href=&quot;https://www.uber.com/us/en/careers/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt; job website&lt;/a&gt; for openings. Additionally, we look forward to collaborating with other teams in the AI/ML space to build a strong ML community and collectively accelerate the advancement of AI/ML technologies.&lt;/p&gt;


&lt;p class=&quot;has-small-font-size&quot;&gt;Apache®, Apache Spark, Spark, and the star logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.&lt;/p&gt;


&lt;p class=&quot;has-small-font-size&quot;&gt;Horovod and Kubenetes are either registered trademarks or trademarks of the Linux Foundation® in the United States and/or other countries. No endorsement by the Linux Foundation® is implied by the use of these marks.&lt;/p&gt;


&lt;p class=&quot;has-small-font-size&quot;&gt;Ray is either a registered trademark or trademark of Anyscale, Inc in the United States and/or other countries.&lt;/p&gt;
</description><link>https://www.uber.com/blog/from-predictive-to-generative-ai/</link><guid isPermaLink="false">https://www.uber.com/blog/from-predictive-to-generative-ai/</guid><pubDate>Thu, 02 May 2024 09:46:07 GMT</pubDate><author>Uber</author><category>Engineering</category><category>AI</category><category>Data / ML</category></item><item><title>DragonCrawl: Generative AI for High-quality Mobile Testing</title><description>&lt;h1 class=&quot;wp-block-heading&quot; id=&quot;h-introduction&quot;&gt;&lt;strong&gt;Introduction&lt;/strong&gt;&lt;/h1&gt;


&lt;p&gt;The Developer Platform team at Uber is consistently developing new and innovative ideas to enhance the developer’s experience and strengthen the quality of our apps. Quality and testing go hand in hand, and in 2023 we took on a new and exciting challenge to change how we test our mobile applications, with a focus on machine learning (ML). Specifically, we are training models to test our applications just like real humans would.&lt;/p&gt;


&lt;p&gt;Mobile testing remains an unresolved challenge, especially at our scale, encompassing thousands of developers and over 3,000 simultaneous experiments. Manual testing is usually carried out, but with high overhead, it cannot be done extensively for every minor code alteration. While test scripts can offer better scalability, they are also not immune to frequent disruptions caused by minor updates, such as new pop-ups and changes in buttons. All of these changes, no matter how minor, require recurring manual updates to the test scripts. Consequently, engineers working on this invest 30–40% of their time on maintenance. Furthermore, the substantial maintenance costs of these tests significantly hinder their adaptability and reusability across diverse cities and languages (imagine having to hire manual testers or mobile engineers for the 50+ languages that we operate in!), which makes it really difficult for us to efficiently scale testing and ensure Uber operates with high quality globally.&lt;/p&gt;


&lt;p&gt;To solve these problems, we created DragonCrawl, a system that uses large language models (LLMs) to execute mobile tests with the intuition of a human. It decides what actions to take based on the screen it sees and independently adapts to UI changes, just like a real human would.&lt;/p&gt;


&lt;p&gt;Of course, new innovations also come with new bugs, challenges, and setbacks, but it was worth it. We did not give up on our mission to bring code-free testing to the Uber apps, and towards the end of 2023, we launched DragonCrawl. Since then, we have been testing some of our most important flows with high stability, across different cities and languages, and without having to maintain them. Scaling mobile testing and ensuring quality across so many languages and cities went from humanly impossible, to possible with the help of DragonCrawl. In the three months since launching DragonCrawl, we blocked ten high-priority bugs from impacting customers while saving thousands of developer hours and reducing test maintenance costs.&lt;/p&gt;


&lt;p&gt;This blog will cover a quick introduction to large language models, deep dive into our architecture, challenges, and results. We will close by touching a little on what is in store for DragonCrawl.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-what-are-large-language-models&quot;&gt;&lt;strong&gt;What Are Large Language Models?&lt;/strong&gt;&lt;/h2&gt;


&lt;p&gt;Large language models (LLMs) are a transformative development in the field of artificial intelligence, specifically within natural language processing (NLP). Essentially, LLMs are advanced models designed to understand, interpret, generate, and engage with human language in a way that is both meaningful and contextually relevant. These models are trained on vast datasets consisting of text from a wide array of sources, enabling them to learn the nuances, idioms, and syntax of natural language. One of the most critical aspects of LLMs is their ability to generate coherent and contextually relevant text based on input prompts. This capability is not limited to simple text generation; it extends to complex tasks like answering questions, translating languages, summarizing documents, and even creating content like poems or code. The underlying technology of LLMs typically involves neural network architectures, such as transformers, which are adept at handling sequential data and can capture long-range dependencies in text. This makes them particularly effective for tasks that require an understanding of context over longer stretches of text. Modern large language models are trained on many languages, which means that we can use them and get reasonable outputs in other languages.&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-why-did-we-choose-large-language-models-for-mobile-testing&quot;&gt;&lt;strong&gt;Why Did We Choose Large Language Models for Mobile Testing?&lt;/strong&gt;&lt;/h3&gt;


&lt;p&gt;We realized that we could formulate mobile testing as a language generation problem. At the end of the day, mobile tests are sequences of steps, which may encounter obstacles and/or course corrections due to changes in apps, devices, etc. To successfully get through those obstacles and complete a test, we need context and goals, and the simplest way for us to provide these to an automated system is through natural language. We provide DragonCrawl with the text representation of the current screen, along with the goals of the test we want to execute, and then we ask it what we should do. Given the context, it chooses what UI element to interact with, and how to interact with it. And because these models have been pre-trained and proven resilient in languages besides English, we can ask DragonCrawl these questions with text in other languages.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter size-large&quot;&gt;&lt;img decoding=&quot;async&quot; loading=&quot;lazy&quot; width=&quot;1024&quot; height=&quot;325&quot; src=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=2160,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/juan_blog-1-1024x325.png&quot; alt=&quot;&quot; class=&quot;wp-image-1086897&quot; srcset=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=1024,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/juan_blog-1.png 1024w, https://blog.uber-cdn.com/cdn-cgi/image/width=300,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/juan_blog-1.png 300w, https://blog.uber-cdn.com/cdn-cgi/image/width=768,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/juan_blog-1.png 768w, https://blog.uber-cdn.com/cdn-cgi/image/width=1536,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/juan_blog-1.png 1536w, https://blog.uber-cdn.com/cdn-cgi/image/width=2048,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/juan_blog-1.png 2048w&quot; sizes=&quot;(max-width: 1024px) 100vw, 1024px&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Fig 1: High-level overview of DragonCrawl. The image of the Dragon was generated by OpenAI’s DALL·E&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-modeling&quot;&gt;&lt;strong&gt;Modeling&lt;/strong&gt;&lt;/h2&gt;


&lt;p&gt;MPNet, or “Masked and Permuted Pre-training for Language Understanding,” is an advanced approach in natural language processing that combines masking and permuting strategies in pre-training language models. It works by masking some words and altering the order of others in the input text, enabling the model to learn not only the prediction of masked words, but also the broader context and syntax of the language. This dual-task approach allows MPNet to gain a deeper understanding of language semantics, surpassing traditional models that focus solely on masking or permutation. Once trained on large datasets, MPNet can be fine-tuned for a variety of NLP tasks, offering enhanced performance in understanding and generating language due to its comprehensive grasp of both word-level and sentence-level contexts.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/GXXjoxKzdTjJWERTI0nBZMaQVMb4hnarA5_B-c4XoK_GjBwkPn2VeKaM39v62RRcbik-3erep9vkXF8we1npSfj6PK3GxtTGhLRzbfFNtzYe6dpY_g1294hlsgzoLhtTPK1NpSQy2w3ca6BGYbPBpPw&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Fig 2: Transformer layers in DragonCrawl’s model.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-evaluation&quot;&gt;&lt;strong&gt;Evaluation&lt;/strong&gt;&lt;/h3&gt;


&lt;p&gt;In the vast and intricate landscape of language, words are not mere strings of letters; they are rich with meaning, context, and subtle nuances and that is where embeddings come into play. Embeddings are like multi-dimensional maps, where each word finds its unique place, not just based on its own identity but also in relation to the words around it. By obtaining high-quality embeddings, we ensure that our model perceives language not as a random assortment of words, but as a coherent, interconnected fabric of ideas and meanings.&lt;/p&gt;


&lt;p&gt;We framed the evaluation as a retrieval task because we ultimately want DragonCrawl to mimic the way humans retrieve information and make decisions. Just like how we put some effort when choosing the right book in a library, DragonCrawl makes an effort to choose the right action to take to accomplish its goals. The precision@N metric, akin to finding the right book when you can only take a handful of books home, shows us the model’s ability to not just retrieve, but to pinpoint the best option in a sea of possibilities. By measuring and improving embedding quality through precision@N, we ensure that DragonCrawl does not just understand language, but comprehends it with a discerning, almost human-like acuity.&lt;/p&gt;


&lt;p&gt;To choose the right model for DragonCrawl, we tuned and evaluated several models. The table below summarizes our findings:&lt;br&gt;&lt;/p&gt;


&lt;figure class=&quot;wp-block-table&quot;&gt;&lt;table&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;/td&gt;&lt;td&gt;Precision@1&lt;/td&gt;&lt;td&gt;Precision@2&lt;/td&gt;&lt;td&gt;Precision@3&lt;/td&gt;&lt;td&gt;Parameters&lt;/td&gt;&lt;td&gt;Embedding size&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;MPNet (base)&lt;/td&gt;&lt;td&gt;0.9723&lt;/td&gt;&lt;td&gt;0.9623&lt;/td&gt;&lt;td&gt;0.9423&lt;/td&gt;&lt;td&gt;110M&lt;/td&gt;&lt;td&gt;768&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;MPNet (large)&lt;/td&gt;&lt;td&gt;0.9726&lt;/td&gt;&lt;td&gt;0.9527&lt;/td&gt;&lt;td&gt;0.9441&lt;/td&gt;&lt;td&gt;340M&lt;/td&gt;&lt;td&gt;768&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;T5&lt;/td&gt;&lt;td&gt;0.97&lt;/td&gt;&lt;td&gt;0.9547&lt;/td&gt;&lt;td&gt;0.9338&lt;/td&gt;&lt;td&gt;11B&lt;/td&gt;&lt;td&gt;3584&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;RoBERTa&lt;/td&gt;&lt;td&gt;0.9689&lt;/td&gt;&lt;td&gt;0.9512&lt;/td&gt;&lt;td&gt;0.9464&lt;/td&gt;&lt;td&gt;82M&lt;/td&gt;&lt;td&gt;768&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;T5 (not tuned)&lt;/td&gt;&lt;td&gt;0.9231&lt;/td&gt;&lt;td&gt;0.9213&lt;/td&gt;&lt;td&gt;0.9213&lt;/td&gt;&lt;td&gt;11B&lt;/td&gt;&lt;td&gt;3584&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/figure&gt;


&lt;p&gt;&lt;br&gt;As can be seen, embedding quality is high across all models, but latency varies significantly. The fastest model turned out to be the base MPNet, with ~110M parameters (which technically makes it a small/medium language model). Furthermore, its embedding size is 768 dimensions, which would make it less expensive for other downstream systems to use our embeddings in the future.&lt;/p&gt;


&lt;p&gt;On a different note, given those numbers, one could argue that we did not even need tuning, but that is not what we chose. T5-11b not tuned gave us good precision@1, 2, and 3, but given the frequency with which we plan to use this model, and the variability in the data because the Uber app changes constantly, we would quickly suffer from those extra points not provided by a model not customized by us.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-challenges&quot;&gt;&lt;strong&gt;Challenges&lt;/strong&gt;&lt;/h2&gt;


&lt;p&gt;There were several challenges we needed to overcome during development. Some of them were specific to Uber, and some of them were related to the weaknesses of large language models.&lt;/p&gt;


&lt;p&gt;An issue we faced early on while making DragonCrawl’s request and completion of trip flows was setting up the GPS location of DragonCrawl’s (fake) riders and drivers. Uber’s matching algorithms, which are in charge of connecting riders to drivers, are very complex and are built for scale, and even take into account variables such as time of day, current traffic conditions, future demand, etc. However, when testing with DragonCrawl, there would only be 1 rider and 1 driver in a particular city at any given time, which is not what Uber’s backend expects. Thus, there were times when riders and drivers would not be matched, even if they were right next to each other. To solve this problem, we had to tune the GPS locations of both riders and drivers, so that we would get favorable results. This is very specific to Uber and/or ride-hailing and food delivery.&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-adversarial-cases&quot;&gt;&lt;strong&gt;Adversarial Cases&lt;/strong&gt;&lt;/h3&gt;


&lt;p&gt;When testing Uber’s trip flow, in some cities, we saw DragonCrawl do some quirky things. In some cities, instead of requesting regular trips, it requested scheduled trips. What puzzled us the most, after debugging our artifacts carefully, is that DragonCrawl actually had all the conditions to make the right choice (i.e., touch on “Choose UberX”), but instead, it would choose a scheduled ride. Then, it would go through the UI to open a calendar and choose the date and time of the scheduled ride, which is impressive–but we digress.&lt;/p&gt;


&lt;p&gt;The example above is called an adversarial case. The concept of adversarial cases or adversarial samples was popularized a few years ago when researchers saw that it is possible to confuse a model in cases that should not be confusing at all. Let’s take a look at the image below. In the image below, we show how, if we add a little bit of noise to an image of a panda, which results in pretty much the same panda, we can confuse a machine-learning model, to the point that it would think it is a gibbon (but we all know pandas do not look like gibbons).&amp;nbsp;&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter is-resized&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/xY5iJU7vEjDJEkoUq7KFH4QIEIm9n1ZMqAhHMeBFwh35nF-jMcz2FYbRRqvoMyWoQyQDsWbMqA7-6IlIcN-LApmJMB2W_HIuXrDCV27xpLZn6BGj3XePo29RGs1jVDXEPEwxzdByEgrbNzUlcXuZiPk&quot; alt=&quot;&quot; style=&quot;aspect-ratio:3.6036036036036037;width:700px;height:auto&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Fig 3: Example of how imperceptible noise can fool a Machine Learning model. This is not a hypothetical example, &lt;a href=&quot;https://www.technologyreview.com/2019/05/19/135299/how-we-might-protect-ourselves-from-malicious-ai/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;take a look&lt;/a&gt;.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;p&gt;While it is impossible to fully rid a model of weaknesses to adversarial cases, we plan to do adversarial training and validation to reduce the risk.&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-steering-dragoncrawl-to-more-optimal-paths&quot;&gt;&lt;strong&gt;Steering DragonCrawl to More Optimal Paths&lt;/strong&gt;&lt;/h3&gt;


&lt;p&gt;In our offline tests of Uber’s trip flow, we saw that DragonCrawl can always request or complete a trip, but sometimes it would take too long to do so. There were times when a new pop-up would make DragonCrawl add another passenger/book a trip for someone else, which would then load several screens with options and settings that DragonCrawl had to figure out. It would figure them out, but because there would be several steps required (rather than just 1 or 2 new steps), it would take much longer. Since our goal is to run DragonCrawl on every Android code change, we cannot afford those longer routes so we had to train Dragon to say no/skip certain things and say yes/confirm other things.&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-hallucinations&quot;&gt;&lt;strong&gt;Hallucinations&lt;/strong&gt;&lt;/h3&gt;


&lt;p&gt;Finally, a topic of much discussion is hallucinations in large language models. In the words of Yann LeCun, VP and Chief AI Scientist at Meta, large language models “kind of spew nonsense sometimes” (see &lt;a href=&quot;https://futurism.com/the-byte/yann-lecun-large-language-models-fad&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;article&lt;/a&gt;). Indeed, we need to be mindful that we cannot fully trust large language models, or at least not without guardrails. In this section, we will discuss the guardrails we put in place to prevent hallucinations from harming DragonCrawl.&lt;/p&gt;


&lt;p&gt;First, one of DragonCrawl’s biggest strengths is the fact that it uses a smaller model. The size of our model is 110M parameters, which is roughly 3 orders of magnitude smaller than the popular GPT-3.5/4. Thus, this greatly reduces the variability and complexity of answers it can output. In other words, model size limits model non-sense.&lt;/p&gt;


&lt;p&gt;Even so, we still received some invalid outputs, and here is how we handled them:&lt;/p&gt;


&lt;ol&gt;&lt;li&gt;&lt;strong&gt;Partially invalid actions:&lt;/strong&gt; It may be possible for the model to return a response where some of the information is incorrect. For instance, it may return “touch” for a UI element that is swipeable; or it may output the right action and right location, but confuse the name of the UI element (i.e. request_trip_button). For either case, since we can read from the emulator the valid actions, the correct UI element names, etc., we can resolve confusions such as the ones mentioned before. The emulator provides us with the ground truth we can use to find the right actions given the name of a UI element; the correct location given the UI element name; and even the right UI element name, given the right location.&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;Completely invalid actions:&lt;/strong&gt; For completely invalid actions, we would append to the prompt the action previously suggested, and call out that it is invalid. This will result in a different action suggested by the model. For the case where invalid actions persist, we would backtrack and retry the suggestions from the model.&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;Loops/repeated actions:&lt;/strong&gt; We may end up in loops (i.e., scrolling up and down in a feed) or repeated actions (i.e., repeated waits). We handle this case by keeping track of the actions already taken in the specific sequence, and even screenshots, so it is really easy to figure out if we are in a loop. Also, since DragonCrawl outputs a list of suggestions, we can just try other suggested actions.&lt;/li&gt;
&lt;/ol&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity is-style-default&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-dragoncrawl-in-action&quot;&gt;DragonCrawl in Action&lt;/h2&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter is-resized&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/Jwm8lAVWCTAaQWhF_TMwOPa1NghHQJybRrqTWTrc5_HgHErdHudy8fRP4KlCzvokvruiQvBUHMsadMaO7cwumQQwvf3jCNm18TWnAtOo6jqC9r-AcoFJAwHuUiEltRSurmepasnaGAgkYeFIcbo333o&quot; alt=&quot;&quot; style=&quot;aspect-ratio:0.481875;width:273px;height:auto&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;p&gt;We have seen DragonCrawl do amazing things, but in this section, we will discuss two scenarios that really impressed us.&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-dragoncrawl-goes-online-in-australia&quot;&gt;&lt;strong&gt;DragonCrawl Goes Online in Australia!&lt;/strong&gt;&lt;/h3&gt;


&lt;p&gt;In October 2023, we were testing Uber’s trip flow with DragonCrawl in Brisbane, Australia, and we saw something unexpected. DragonCrawl’s fake driver profile was perfectly set up but this time, it was not able to go online for around 5 minutes. During those 5 minutes, DragonCrawl pushed the “GO” online button repeatedly until it finally went online.&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter size-large&quot;&gt;&lt;img decoding=&quot;async&quot; loading=&quot;lazy&quot; width=&quot;1024&quot; height=&quot;576&quot; src=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=2160,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/juan_blog-4-1024x576.png&quot; alt=&quot;&quot; class=&quot;wp-image-1086898&quot; srcset=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=1024,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/juan_blog-4.png 1024w, https://blog.uber-cdn.com/cdn-cgi/image/width=300,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/juan_blog-4.png 300w, https://blog.uber-cdn.com/cdn-cgi/image/width=768,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/juan_blog-4.png 768w, https://blog.uber-cdn.com/cdn-cgi/image/width=1536,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/juan_blog-4.png 1536w, https://blog.uber-cdn.com/cdn-cgi/image/width=2048,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/juan_blog-4.png 2048w, https://blog.uber-cdn.com/cdn-cgi/image/width=1080,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/juan_blog-4.png 1080w&quot; sizes=&quot;(max-width: 1024px) 100vw, 1024px&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 4: DragonCrawl going online in Brisbane, Australia after trying for 5 minutes.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;p&gt;We were pleasantly surprised. DragonCrawl is so goal-oriented that it went through an unfriendly user experience to accomplish its goals: go online, be matched to a (fake) rider, and do the hypothetical trip. Because of the time to completion, we knew we had to investigate. We also learned, as discussed more below, that DragonCrawl will not be thrown off by minor or non-reproducible bugs, like the ones that impacted our script-based QA.&amp;nbsp;&amp;nbsp;&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-the-ultimate-solution-turn-it-off-and-then-turn-it-back-on&quot;&gt;&lt;strong&gt;The ultimate solution: Turn it off, and then turn it back on&lt;/strong&gt;&lt;/h3&gt;


&lt;p&gt;It was September 2023, and we saw Dragon do something so smart, we did not know if we should laugh or clap. Dragon was testing Uber’s trip flow in Paris. It chose to go to the Paris airport (CDG), and when it got to the screen to select the payment method, the payment methods were not loading (most likely a blip in the account we were using). What did Dragon do? It closed the app, opened it, and then requested the trip again. There were no issues the second time, so Dragon accomplished its goal of going to the airport.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter size-large&quot;&gt;&lt;img decoding=&quot;async&quot; loading=&quot;lazy&quot; width=&quot;1024&quot; height=&quot;576&quot; src=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=2160,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/juan_blog-5-1024x576.png&quot; alt=&quot;&quot; class=&quot;wp-image-1086900&quot; srcset=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=1024,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/juan_blog-5.png 1024w, https://blog.uber-cdn.com/cdn-cgi/image/width=300,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/juan_blog-5.png 300w, https://blog.uber-cdn.com/cdn-cgi/image/width=768,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/juan_blog-5.png 768w, https://blog.uber-cdn.com/cdn-cgi/image/width=1536,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/juan_blog-5.png 1536w, https://blog.uber-cdn.com/cdn-cgi/image/width=2048,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/juan_blog-5.png 2048w, https://blog.uber-cdn.com/cdn-cgi/image/width=1080,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/juan_blog-5.png 1080w&quot; sizes=&quot;(max-width: 1024px) 100vw, 1024px&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 5: DragonCrawl restarting the app to request a trip.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;p&gt;It is difficult to express with words how excited and proud we are to see DragonCrawl do these things. Pushing the go online button repeatedly just to be able to drive with Uber, or opening and closing the app so that it can get to where it wants to be make DragonCrawl more resilient to minor tech issues than our old script-based testing model.&amp;nbsp;&lt;/p&gt;


&lt;p&gt;We have observed that no amount of code can match the goal-oriented behavior DragonCrawl displays, and what it represents for developer productivity is exciting. It is possible to create scripts that match DragonCrawl’s strategies, but how many thousands (or even millions) lines of code would need to be written? How expensive would it be to update all of that code when needed? Now, imagine what happens when traditional tests encounter the scenarios we just described:&lt;/p&gt;


&lt;ol&gt;&lt;li&gt;&lt;strong&gt;Functioning driver account cannot go online for 5 minutes:&lt;/strong&gt; This would raise eyebrows if not alerts in testing teams. We may even think there is an outage, which would alert multiple engineers, but in reality, it is a transient issue.&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;Payment method not loading:&lt;/strong&gt; Tickets would be filed and at the highest priority. This would trigger multiple conversations, examinations, and attempts to reproduce the issue would be done, but it would only be a blip.&lt;/li&gt;
&lt;/ol&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-dragoncrawl-running-on-uber-s-ci&quot;&gt;&lt;strong&gt;DragonCrawl Running on Uber’s CI&lt;/strong&gt;&lt;/h2&gt;


&lt;p&gt;We productionized our model and the CI pipelines where the model has been consumed since around October 2023, and got some wins by the end of the year. As of January 2024, DragonCrawl executes the core-trip flow in 5 different cities once per night, and also for the Rider and Driver Android apps before releasing them to our customers. Since we launched, we have observed the following:&lt;/p&gt;


&lt;ul&gt;&lt;li&gt;&lt;strong&gt;High stability:&lt;/strong&gt; DragonCrawl executed flows with 99%+ stability in November and December 2023. The rare cases where Dragon failed were due to outages in the third-party systems we use, and also due to a real outage caused by a high-priority bug that no other mobile testing tool detected.&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;No maintenance:&lt;/strong&gt; We did not need to manually update and/or maintain DragonCrawl. Whenever there were changes in the apps, DragonCrawl figured out how to get through those changes to accomplish its goals, unlike our team of software testers, who spent hundreds of hours maintaining test cases in 2023.&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;High reusability:&lt;/strong&gt; We evaluated DragonCrawl in 89 of our top cities, and DragonCrawl successfully requested and completed trips in 85 of them. This is the first time at Uber that a mobile test as complex as requesting and completing a trip has been successfully executed in 85 cities worldwide without needing to tweak code.&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;Device/OS resilient:&lt;/strong&gt; We tested Uber’s trip flow in our CI with 3 different kinds of Android devices, and 3 different OS versions, and we even varied other parameters, such as available disk, CPU, etc. DragonCrawl successfully requested and completed trips across all of these combinations without tweaks to our code or model, which is not always guaranteed in traditional mobile tests. Tuning tests to handle different screen sizes/resolutions and other device specifics is a known hassle of traditional mobile testing.&lt;/li&gt;
&lt;/ul&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-what-s-next&quot;&gt;&lt;strong&gt;What’s Next?&lt;/strong&gt;&lt;/h2&gt;


&lt;p&gt;The foundations we set in 2023 paved the way for a very exciting 2024 and beyond. Our investments in smaller language models resulted in a foundational model with very high-quality embeddings, to the point that it unlocks the architecture shown below:&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter size-large&quot;&gt;&lt;img decoding=&quot;async&quot; loading=&quot;lazy&quot; width=&quot;1024&quot; height=&quot;576&quot; src=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=2160,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/juan_blog-6-1024x576.png&quot; alt=&quot;&quot; class=&quot;wp-image-1086902&quot; srcset=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=1024,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/juan_blog-6.png 1024w, https://blog.uber-cdn.com/cdn-cgi/image/width=300,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/juan_blog-6.png 300w, https://blog.uber-cdn.com/cdn-cgi/image/width=768,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/juan_blog-6.png 768w, https://blog.uber-cdn.com/cdn-cgi/image/width=1536,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/juan_blog-6.png 1536w, https://blog.uber-cdn.com/cdn-cgi/image/width=2048,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/juan_blog-6.png 2048w, https://blog.uber-cdn.com/cdn-cgi/image/width=1080,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/juan_blog-6.png 1080w&quot; sizes=&quot;(max-width: 1024px) 100vw, 1024px&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 6: Future mobile tests as RAG applications powered by the Dragon Foundational Model (DFM)&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;p&gt;With the Dragon Foundational Model (DFM), we can use small datasets (hundreds or tens of datapoints) and the DFM to create RAG (retrieval augmented generation) applications that more accurately simulate how humans interact with our applications. Those smaller datasets (with verbal goals and preferences), would tell DragonCrawl what to optimize for, and that is all it needs. The DFM may be a LLM, but it is secretly a rewards model that takes actions to accomplish its goals, and as we have seen, some of those actions mimic what a real human would do.&lt;/p&gt;


&lt;p&gt;In 2024, a big area of investment for us will be to build the subsystems that will empower developers to build their tests as RAGs, and reap the benefits of flawlessly executing in many cities, languages, and with minimal (or even zero) maintenance costs.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h1 class=&quot;wp-block-heading&quot; id=&quot;h-conclusion&quot;&gt;Conclusion&lt;/h1&gt;


&lt;p&gt;With all the advancements generative AI has seen over the past 4-6 months, there are more things to evaluate to improve our model and the quality of our apps. We plan to evaluate more modern large language models to push the quality of our models even further. Every increase in model quality will increase the combinations we can test, bringing down bugs that reach our users, which in turn increases productivity, enabling developers to build new experiences, and giving DragonCrawl more things to test. This is a flywheel that gets kicked off and accelerates with model quality, and we will fuel this acceleration.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/DFEnnSB2E5rc241fUWUqJVypOJkESM0sMISfkqL-Qtq3B4gP212JD-Lw9AqaP0pRkzpo_RO25TyEBxG6NuJmqy4aRCq9XQ93l7zbEBZKfUUJriwCZAo-00dz56fReF7GrGD1eijSgGiFrx8j6ARmN0o&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 7: Model-quality fly wheel.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-acknowledgments&quot;&gt;&lt;strong&gt;Acknowledgments&lt;/strong&gt;&lt;/h3&gt;


&lt;p&gt;Something as complex as DragonCrawl without the help of our partner teams. We are very thankful to Uber’s CI, Mobile Foundations, Michelangelo, Mobile Release and Test accounts. We would also like to thank you the passionate researchers that created &lt;a href=&quot;https://arxiv.org/abs/2004.09297&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;MPNet&lt;/a&gt; (which we use), T5, and other LLMs for their contributions to the field and for enabling others to advance their own fields. Finally, we want to send a big thank you to our former intern &lt;a href=&quot;https://www.linkedin.com/in/gustavonazarioperez/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;Gustavo Nazario&lt;/a&gt;, who helped us turn DragonCrawl into what it is today.&lt;/p&gt;


&lt;p class=&quot;has-small-font-size&quot;&gt;Cover photo attribution: This image was generated using OpenAI’s DALL·E.&lt;/p&gt;
</description><link>https://www.uber.com/blog/generative-ai-for-high-quality-mobile-testing/</link><guid isPermaLink="false">https://www.uber.com/blog/generative-ai-for-high-quality-mobile-testing/</guid><pubDate>Tue, 23 Apr 2024 05:00:00 GMT</pubDate><author>Uber</author><category>AI</category></item><item><title>Ensuring Precision and Integrity: A Deep Dive into Uber’s Accounting Data Testing Strategies</title><description>&lt;h1 class=&quot;wp-block-heading&quot; id=&quot;h-introduction&quot;&gt;Introduction&lt;/h1&gt;


&lt;p&gt;Uber operates multiple lines of business across diverse global regions. Financial Accounting Services (FAS) Platform (&lt;a href=&quot;https://www.uber.com/en-IN/blog/ubers-finance-computation-platform/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;detailed architecture&lt;/a&gt;) is responsible for financial accounting across these global regions and is designed to follow the tenets below:&lt;/p&gt;


&lt;ul&gt;&lt;li&gt;Compliance&lt;/li&gt;


&lt;li&gt;Auditability&lt;/li&gt;


&lt;li&gt;Accuracy&lt;/li&gt;


&lt;li&gt;Scalability&lt;/li&gt;


&lt;li&gt;Analytics&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;To maintain these tenets, FAS has built robust testing, monitoring, and alerting processes. This encompasses system configuration, business accounting, and external financial report generation.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-challenges&quot;&gt;Challenges&lt;/h2&gt;


&lt;p&gt;The financial accounting services platform at Uber operates at an internet scale– approximately 1.5 billion journal entries (JEs) per day and 120 million transactions per day via ETL and data processing at a throughput of 2,500 queries per second [on average]. Standard off-the-shelf accounting systems cannot support such scale and scope of transactions of our growing platform. Additionally, we manage data from over 25 different services for accounting purposes. To handle data at this scale, our engineering systems are designed to process data both at the event level and in a batch mode. As data flows through multiple components in the architecture, there is a need to ensure that all the components are designed to embrace the tenets defined above.&amp;nbsp;&lt;/p&gt;


&lt;p&gt;The platform processed roughly $120+ billion in annual gross bookings and settlements in 2023. It operates at a transaction scale (~80 billion financial microtransactions per year) that is 10 times the trip scale and currently offers 99.6% of transactions with automated revenue computation with 99.99% completeness, accuracy guarantee, and auditability. We onboarded 600+ business changes to support the scaling of the business in 2023. The platform processes big data and stores petabytes of data in Schemaless and Apache Hive&lt;sup&gt;TM&lt;/sup&gt;.&lt;/p&gt;


&lt;p&gt;The accounting process has multiple steps and validations are required at every step to adhere to the tenets. Here are the various steps where validations are performed:&lt;/p&gt;


&lt;ol&gt;&lt;li&gt;Business Requirement Validation&lt;/li&gt;


&lt;li&gt;Accounting Onboarding&lt;/li&gt;


&lt;li&gt;Accounting Execution&lt;/li&gt;


&lt;li&gt;Report Generation&lt;/li&gt;
&lt;/ol&gt;


&lt;p&gt;To uphold our established principles, we implement checks and balances throughout the stages of financial accounting services:&lt;/p&gt;


&lt;ol&gt;&lt;li&gt;Requirements Signoff&amp;nbsp;&lt;/li&gt;


&lt;li&gt;Regression Testing&lt;/li&gt;


&lt;li&gt;Integration Testing&lt;/li&gt;


&lt;li&gt;UAT Validations&lt;ul&gt;&lt;li&gt;Ledger Validations&lt;/li&gt;


&lt;li&gt;Transaction-Level Validations&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;


&lt;li&gt;Shadow Validations&lt;/li&gt;


&lt;li&gt;Deployments&lt;ul&gt;&lt;li&gt;Canary&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;


&lt;li&gt;Health Checks&lt;ul&gt;&lt;li&gt;Auditor Checks&lt;/li&gt;


&lt;li&gt;Completeness Checks&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;


&lt;li&gt;Alerting/Monitoring&lt;/li&gt;


&lt;li&gt;Report Generation&amp;nbsp;&lt;/li&gt;
&lt;/ol&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-validations-life-cycle-of-accounting-processes-nbsp&quot;&gt;Validations Life Cycle of Accounting Processes&amp;nbsp;&lt;/h2&gt;


&lt;p&gt;As the data flows through various components of the Fintech Systems, there are checks and balances at every stage so that the systems and processes adhere to the tenets.&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-requirements-signoff&quot;&gt;Requirements Signoff&lt;/h3&gt;


&lt;p&gt;Based on the Business Models operating in various countries and the expectations of the local teams, the requirements are provided and tracked. Accounting requirements are provided in accordance with &lt;a href=&quot;https://en.wikipedia.org/wiki/Generally_Accepted_Accounting_Principles_(United_States)&quot;&gt;Generally Accepted Accounting Principles (GAAP)&lt;/a&gt;. The requirements are then onboarded into our accounting systems. Fintech Systems at Uber have internal tools to validate the requirements, which perform 15+ automated checks to validate the expected output.&amp;nbsp;&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-regression-testing&quot;&gt;Regression Testing&lt;/h2&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-unit-testing&quot;&gt;Unit Testing&lt;/h3&gt;


&lt;p&gt;Unit tests in Uber’s Financial Services are critical for ensuring the accuracy, security, and reliability of our applications. These tests involve isolating small sections of code and verifying the functions as intended. At Uber, we strive to identify and rectify errors early by rigorously testing each unit for correct operation and ensuring overall services–from transaction processing to financial reporting–run smoothly and securely.&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-regression-kaptres&quot;&gt;Regression – Kaptres&lt;/h3&gt;


&lt;p&gt;Kaptre (Capt&lt;s&gt;ure&lt;/s&gt; — Re&lt;s&gt;play&lt;/s&gt;) is a capture and replay testing tool primarily employed for functional and regression testing purposes. Here’s a breakdown of how our financial systems use its key components:&lt;/p&gt;


&lt;p&gt;&lt;strong&gt;Test Case: &lt;/strong&gt;For any accounting change, at least one new Kaptre test is added to the test suite so that all the use cases are tested in successive runs. Each test case includes input (same as used for UAT), expectations, and assertions.&amp;nbsp;&lt;/p&gt;


&lt;p&gt;&lt;strong&gt;Capture Mode: &lt;/strong&gt;When adding a test, we operate in “capture mode.” This mode executes the accounting process for the newly added test and captures dependencies needed to re-execute in an offline mode, like API request/responses from upstreams and expected accounting journal entries (JElines).&amp;nbsp;&lt;/p&gt;


&lt;p&gt;&lt;strong&gt;Replay Mode:&lt;/strong&gt; The subsequent test runs involve running the Kaptre regression test suite in the replay mode. This mode creates the new output using the latest code/config version and the assertions compare these with the captured expectations. A test failure is reported if the assertion fails.&lt;/p&gt;


&lt;p&gt;&lt;strong&gt;Change Triggers for Captured Responses: &lt;/strong&gt;Captured responses change with alterations in upstream systems, new field additions in financial transactions, or anticipated accounting changes. These tests can be updated using the above capture mode after validations.&amp;nbsp;&lt;/p&gt;


&lt;p&gt;This approach ensures that regression tests accurately reflect the system’s behavior during capture mode and subsequently verifies it against expected outcomes in successive test runs. The design allows for adaptability to changes in upstreams, fintech systems, and anticipated accounting modifications while maintaining the integrity of the testing process and reducing the risk of human error during the testing process.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/KFCn1YyYUZwB8_9hTsWv-M6EyBi-1rlFZuv3vnhyQmXFA2bLDQZ-7EIdWZuU_qteUyF9UBev3u0bJVNoKnQoroZrPcjFFPOcJI_RqNGDs7bIwDInlnZ3_R_CiCFKtbH0RcUwzsceuy9H4mTXpoWwY-Q&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 1: Kaptre (functioning of the capture-replay framework).&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-slate-short-lived-application-testing-environment-nbsp&quot;&gt;SLATE (Short-Lived Application Testing Environment)&amp;nbsp;&lt;/h3&gt;


&lt;p&gt;Testing in a short-lived application environment (a.k.a. SLATE) before deploying to production is a crucial step in Uber’s software development lifecycle. SLATE testing helps us identify and address issues early on and reduces the risk of introducing defects/problems into the production environment. Various types of testing are performed in SLATE, including integration, performance, and security testing. The primary purpose of this testing is to run the application in a production-like environment, identify and detect issues (like runtime errors) early in the development cycle, and prevent the propagation of defects to higher environments.&lt;/p&gt;


&lt;p&gt;Find more details in the &lt;a href=&quot;https://www.uber.com/en-IN/blog/simplifying-developer-testing-through-slate/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;Slate Uber Eng Blog&lt;/a&gt;.&lt;/p&gt;


&lt;p&gt;In summary, testing in short-lived application environments is a best practice that contributes to the overall quality, reliability, and security of our services before they are deployed to production.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-integration-testing-nbsp&quot;&gt;Integration Testing&amp;nbsp;&lt;/h2&gt;


&lt;p&gt;Financial Accounting Services at Uber engages with numerous upstream systems (30+) to enrich trip details essential for generating accounting transactions. Integration testing is crucial for seamless communication between financial systems and upstream components, identifying interface issues and enabling early risk mitigation.&lt;/p&gt;


&lt;p&gt;However, a notable challenge with integration testing lies in determining completeness. Unlike unit tests that have a clear metric for code coverage, integration tests lack insights into the scenarios to cover, and there is no established metric for measuring integration test coverage. This gap results in dependent teams not automatically being informed about new scenarios being launched, and there is a lack of metrics to comprehend test coverage for all scenarios.&lt;/p&gt;


&lt;p&gt;To address this, we have developed an internal tool that automates the detection, notification, and acknowledgment of the readiness of all dependent systems. This tool aims to ensure a defect-free launch and provides a mechanism to measure integration test coverage.&lt;/p&gt;


&lt;p&gt;This becomes particularly critical within revenue systems, positioned at the conclusion of the data flow and interacting with multiple services. Unanticipated launches in this context pose the risk of disrupting accounting processes. For instance, a fare launch, lacking proper communication, might be routed to a dead letter queue (DLQ), leading to improper accounting due to insufficient onboarding in the revenue system.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-uat-validations&quot;&gt;UAT Validations&lt;/h2&gt;


&lt;p&gt;User Acceptance Testing (UAT) is a mandatory step in Uber’s financial systems development, where the accounting team rigorously validates financial reports for accuracy. We streamlined this process with comprehensive validation of aggregated and transaction level ledgers through automated testing covering positive and negative scenarios. This ensures integrity of balance sheets, income statements, and other key financial statements. This meticulous approach guarantees seamless integration of updates and patches without disrupting existing functionalities, with over 15 quality checks before signoff.&lt;/p&gt;


&lt;p&gt;Once accounting configurations are set up as per requirements, then they undergo validation by the Accounting Team, culminating in official sign-off indicating approval and attesting to accuracy and compliance. Business Rule configuration changes adhere to a stringent protocol, requiring explicit authorization from key accounting stakeholders before merging into the main system. Uber utilizes automated Buildkite jobs to ensure integrity and efficiency, systematically checking for necessary approvals when differences in the codebase are identified. This automation reinforces the rigor of the approval process.&lt;/p&gt;


&lt;p&gt;In instances where changes contradict the established protocols or bypass the mandatory approvals, an automated flag is immediately raised for a thorough review. This safeguard is essential in maintaining the system’s integrity and compliance.&lt;/p&gt;


&lt;p&gt;Uber’s financial systems employ two primary types of validations to ensure the utmost accuracy and reliability:&lt;/p&gt;


&lt;ul&gt;&lt;li&gt;Sample Validations&lt;/li&gt;


&lt;li&gt;Ledger Validations&lt;/li&gt;
&lt;/ul&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-sample-validations&quot;&gt;Sample Validations&lt;/h3&gt;


&lt;p&gt;Validations are performed on a selected set of sample orders, which are chosen to be representative of the scenarios of the orders in the production environment. These validations are typically adequate while making incremental changes to the financial systems.&amp;nbsp;&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter size-large is-resized&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=2160,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/Screenshot-2024-04-12-at-2.17.10%E2%80%AFPM-1024x185.png&quot; alt=&quot;&quot; class=&quot;wp-image-1086095&quot; style=&quot;aspect-ratio:5.535135135135135;width:747px;height:auto&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Table 1: Transaction level validations/sample validations.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-ledger-validations&quot;&gt;Ledger Validations&lt;/h3&gt;


&lt;p&gt;For changes that impact a significant number of use cases either at the country or business level, we also perform ledger validations to get completeness assurances. These validations provide additional assurances at aggregate levels over a specific period of time before we implement these changes in production.&amp;nbsp;&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter size-large&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=2160,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/Screenshot-2024-04-12-at-2.17.42%E2%80%AFPM-1024x375.png&quot; alt=&quot;&quot; class=&quot;wp-image-1086096&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Table 2: Aggregate ledger validations.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;p&gt;Both validation types are integral to Uber’s commitment to maintaining the highest standards of financial accuracy and regulatory compliance. They work in tandem to ensure that the financial system remains robust, reliable, and reflective of the true financial position of the company.&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-shadow-validations&quot;&gt;Shadow Validations&lt;/h3&gt;


&lt;p&gt;The purpose of shadow testing is to serve as a final checkpoint to catch any potential issues before the build can be rolled over to the production environment. It’s essentially a build certification strategy aiding in making informed decisions about deploying a release candidate (RC). The core process involves passing production traffic through an RC and comparing its outputs with those from the current production build to spot any anomalies.&lt;/p&gt;


&lt;p&gt;Shadow testing consists of three parts:&lt;/p&gt;


&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Capturing Production Requests&lt;/strong&gt;: There are multiple strategies to achieve this. One of those entails recording the (request, response) pairs from production traffic intended for comparison against the RC.&lt;/li&gt;
&lt;/ul&gt;


&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Replaying Production Traffic&lt;/strong&gt;: These captured requests are replayed against the RC, and the responses are compared to those from the production environment. Any differences are logged for further analysis.&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;Analyzing Differences&lt;/strong&gt;: Involves a thorough examination of the logged differences to determine the confidence level of the RC. This step is crucial for certifying the build’s readiness for deployment.&lt;/li&gt;
&lt;/ul&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter size-large&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=2160,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/Screenshot-2024-04-12-at-2.18.00%E2%80%AFPM-1024x455.png&quot; alt=&quot;&quot; class=&quot;wp-image-1086099&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Table 3: Shadow validations between pre-production environment and production environment.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-challenges-and-solutions&quot;&gt;Challenges and Solutions&lt;/h3&gt;


&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Volume of Traffic and Upstream Calls&lt;/strong&gt;: With the fintech services processing over 10K+ events per second, replaying all this traffic against the shadow build is impractical and could result in rate-limiting issues due to numerous upstream calls.&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;Traffic Sampling and Load Distribution&lt;/strong&gt;: To mitigate this, we adopt a strategy of sampling a fraction of traffic (e.g., 10%) and distributing the replay over an extended period (e.g., 5 hours). This reduces the calls per second, but also limits the scope of testing.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;To solve this problem we cache the upstream calls and responses, instead of making real-time network calls.&lt;/p&gt;


&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Caching Upstream Calls&lt;/strong&gt;: We implemented a caching mechanism for upstream (request, response) pairs to avoid redundant calls during replay, at the cost of increased storage expenses. We minimized the increased storage costs by keeping a retention period of x days post which old data is cleaned up. We always replay against the latest set of data.&lt;/li&gt;


&lt;li&gt;This approach lacks the ability to detect real-time issues stemming from upstream changes. To mitigate this, we are developing a mechanism for sampling traffic and load distribution.&lt;/li&gt;
&lt;/ul&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter is-resized&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/laVFJXmguJ5XutZD365Hu3SdXUMpGvotZ8HUAA4_OWYtYVcItW3zl7eJoF33tD9jVk1WuvBWkDlP4rQVZnPl75wcf_6mvIqg_Rq_fBR2FJg9hrFbCT-sTPkC1nG6KyTrlSRG-acIn4mPrBWU3OrsuFM&quot; alt=&quot;&quot; style=&quot;aspect-ratio:1.694915254237288;width:700px;height:auto&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 2: Sampling and storing events for shadow testing.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-replaying-captured-requests&quot;&gt;Replaying Captured Requests&lt;/h3&gt;


&lt;ul&gt;&lt;li&gt;A specialized workflow has been developed for replaying stored requests within a given timestamp range against an RC. Discrepancies between the stored responses and the RC’s responses are logged for analysis.&lt;/li&gt;
&lt;/ul&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-validating-and-analyzing-differences&quot;&gt;Validating and Analyzing Differences&lt;/h3&gt;


&lt;ul&gt;&lt;li&gt;This phase involves scrutinizing the identified discrepancies. The aim is to differentiate between expected variances and anomalies. This requires an in-depth understanding of the response structure of the Banker system.&lt;/li&gt;


&lt;li&gt;Banker’s Response Structure: The output from Banker is an array of complex data types called ‘transactions,’ each representing multiple journal entries with attributes like credit, debit, GL account, and line of business.&lt;/li&gt;
&lt;/ul&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;figure class=&quot;wp-block-image size-large&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=2160,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/Screenshot-2024-04-12-at-2.18.19%E2%80%AFPM-1024x459.png&quot; alt=&quot;&quot; class=&quot;wp-image-1086101&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;/figure&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;p&gt;The confidence in the build is gauged by how far the monetary discrepancies stray from predefined thresholds.&amp;nbsp;&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-example-scenario-and-analysis&quot;&gt;Example Scenario and Analysis&lt;/h3&gt;


&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Transaction Comparison:&lt;/strong&gt; Transactions from both the primary and shadow builds are compared. Differences are logged in a datastore.&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;Datastore Attributes:&lt;/strong&gt; The logged data includes the transactions, their source (primary/shadow), the RunID of the replay workflow, and a timestamp.&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;Detailed Analysis: &lt;/strong&gt;We conduct analyses focusing on anomalies in specific areas like LineOfBusiness and GlAccountNumber, using queries to identify monetary discrepancies. The build confidence is adjusted based on these findings.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;As an example, consider E1 is the event that we processed against Production and release candidates. TxnP is the transaction we got from Production instance and TxnS is the transaction we got from Release Candidate:&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;figure class=&quot;wp-block-image size-large&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=2160,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/Screenshot-2024-04-12-at-2.18.29%E2%80%AFPM-1024x205.png&quot; alt=&quot;&quot; class=&quot;wp-image-1086104&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;/figure&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;p&gt;TxnP and TxnS diff in LineOfBusiness. Hence, we would log them to a datastore for analysis. Our data store will contain four attributes:&lt;/p&gt;


&lt;ol&gt;&lt;li&gt;Transactions -&amp;gt; Array of transactions.&lt;/li&gt;


&lt;li&gt;Source -&amp;gt; Primary/Shadow. (i.e.) If the transactions are coming from Production or Shadow build.&lt;/li&gt;


&lt;li&gt;RunID -&amp;gt; RunID of the Replay workflow. Since we can run multiple shadow testing workflows with different builds, we should make sure that we are only analyzing the differences of a specific workflow.&lt;/li&gt;


&lt;li&gt;Timestamp.&lt;/li&gt;
&lt;/ol&gt;


&lt;p&gt;Post replaying the E1, our datastore will contain the following two new records:&lt;/p&gt;


&lt;ol&gt;&lt;li&gt;(TxnP, Primary, runID, currentTimeStamp)&lt;/li&gt;


&lt;li&gt;(TxnS, Shadow, runID, currentTimeStamp)&lt;/li&gt;
&lt;/ol&gt;


&lt;p&gt;There are multiple ways in which we can analyze these differences. For our use case, we are most interested in capturing any anomalies in LineOfBusiness and GlAccountNumer. Hence, we have written queries that identify the monetary differences between Production and Release candidates across these LOB and GlAccountNumber dimensions. If the monetary difference on credit or debit is beyond a certain threshold, we would reduce the confidence of the build. The farther it strays beyond the threshold, the lower the confidence of the build.&lt;/p&gt;


&lt;p&gt;For the above example, the shadow testing report for LOB differences will look as follows:&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;figure class=&quot;wp-block-image size-large&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=2160,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/Screenshot-2024-04-12-at-3.07.22%E2%80%AFPM-1024x166.png&quot; alt=&quot;&quot; class=&quot;wp-image-1086105&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;/figure&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;p&gt;Consider that we define a threshold of  1,000 USD per line of business. In that case, the difference of 100 is still well below the threshold and hence the confidence of our build would be unaffected.&lt;/p&gt;


&lt;p&gt;By adopting this shadow testing strategy, you ensure a comprehensive and thorough evaluation of the release candidate. This methodical approach not only identifies potential issues but also provides insights necessary for improving future builds, ultimately contributing to the robustness and reliability of your deployment process.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/lWpk884yU_92bIFwEr29JfVH_HZpEPd6pXf2XMJl1hg42l9JLX39RzczjifWlo9nAjQRFFd9eu54LZgKKOvBigmIohSreFIAK_bTaHqPy_tFpgtQfVpXkS4B256K2gDy_sdQ3BSbP2DHwmAuvKEpAcY&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 3: Shadow testing workflow.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-deployment-nbsp&quot;&gt;Deployment&amp;nbsp;&lt;/h2&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-canary-testing-nbsp&quot;&gt;Canary Testing&amp;nbsp;&lt;/h3&gt;


&lt;p&gt;In the financial accounting services at Uber, the builds are deployed daily. To ensure the successful deployment of each build and minimize the impact of issues such as performance degradation, increased error rates, and resource exhaustion, incorporating a canary deployment is crucial. This strategy facilitates a controlled release, preventing immediate impacts on the entire traffic. It enables the identification and resolution of potential issues before a complete deployment occurs.&lt;/p&gt;


&lt;p&gt;The canary release approach is employed to test real traffic (&amp;lt;=2% traffic) with minimal impact. When a new build is ready for deployment, the canary zone serves as the initial deployment target. If any errors or issues arise during this deployment, the build is not propagated to other production regions, preventing widespread disruption and ensuring a more controlled release process.&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-deployment-monitoring-and-alerting&quot;&gt;Deployment monitoring and alerting&lt;/h3&gt;


&lt;p&gt;The financial accounting service platform consumes data from various upstream sources. To monitor the health of the services, we have configured multiple metrics and alerts. Metrics are tracked and alerts are configured to pause the deployment pipeline and roll it back if we get an alert after deployment for a prescribed period.&amp;nbsp;&lt;/p&gt;


&lt;h4 class=&quot;wp-block-heading&quot; id=&quot;h-dead-letter-queues&quot;&gt;Dead Letter Queues&lt;/h4&gt;


&lt;p&gt;DLQ (Dead Letter Queue) stores unprocessed events due to errors. Elevated DLQ counts indicate issues like buggy code, corrupt events, upstream service problems, or rate limiting. Each message queue has a corresponding DLQ for handling unprocessable events. Ideally, the DLQ must have zero events. We use threshold-based alerts for detecting issues and investigating root causes when alerts are triggered. We log all errors, including event details, to a dedicated Apache Kafka&lt;sup&gt;Ⓡ&lt;/sup&gt; topic and ingest them into an Apache Hive&lt;sup&gt;TM&lt;/sup&gt; table. We have configured Data Studio dashboards for monitoring the Apache Hive&lt;sup&gt;TM&lt;/sup&gt; table, providing insights on DLQ events’ impact, count, freshness, and trends. This data-driven approach aids in quickly identifying and prioritizing issues for root cause analysis and system improvement.&lt;/p&gt;


&lt;h4 class=&quot;wp-block-heading&quot; id=&quot;h-alerts-and-monitors&quot;&gt;Alerts and Monitors&lt;/h4&gt;


&lt;p&gt;Alerts are primarily used for flagging urgent and critical issues that need to be looked at immediately. Monitors are dashboards on which alerts are configured. The alerts should always be actionable. It is also recommended to tag every alert with a corresponding runbook.&lt;/p&gt;


&lt;p&gt;Our team alone has around 400+ alerts configured, spanning a wide range of dimensions, including but not limited to DLQ count, consumer lag of message queues, service availability, etc.&lt;/p&gt;


&lt;h4 class=&quot;wp-block-heading&quot; id=&quot;h-completeness-checks&quot;&gt;Completeness Checks&lt;/h4&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/TD2qCI_24RcKTtUAX12d2re1uJHo_00oLC6ItT3kfSvQFwEVGW4P20F71vlxJ_R4GkRxQrkNq7c9sRqGBx18KCfe_FDmGplL4j36asRsFeR8V1MaeKjFRX5zYV6Hx54BT7COj16ZgQI0AxNQh-xGfOA&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 4: Completeness chec&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;p&gt;All the events received by Financial Services need to be accounted for. When an event is received by our services, it is recorded in the Received Logger. The natural outcomes of processing an event are either of the following&lt;/p&gt;


&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Event Processed&lt;/strong&gt; – Recorded in Fact Table&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;Event Filtered or Ignored&lt;/strong&gt; – Recorded in Filter Logger&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;Event Processing Errored Out&lt;/strong&gt; – Recorded in Error Logger&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;To ensure that all incoming events are accounted for, we perform completeness checks. These checks confirm that all events logged in the receive logger are also logged in either the Error logger, Filter logger, or the Fact Table (indicating successful processing).&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-results&quot;&gt;&lt;strong&gt;Results&lt;/strong&gt;&lt;/h2&gt;


&lt;p&gt;Leveraging the aforementioned testing and detection strategies, our team has achieved a remarkable milestone in 2023: reducing the number of accounting incidents to zero. This significant accomplishment reflects our dedication to accuracy and efficiency in financial management. Furthermore, there has been a notable decrease in manual journal entries, a direct result of diminished accounting errors. This improvement has not only enabled the timely closure of monthly accounting books but has also bolstered our confidence in managing multiple projects within the accounting domain as evidenced by the 17% improvement in throughput. These advancements demonstrate our commitment to excellence and reliability in our accounting practices.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h1 class=&quot;wp-block-heading&quot; id=&quot;h-conclusion&quot;&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;/h1&gt;


&lt;p&gt;In conclusion, the comprehensive and multifaceted Fintech Testing Strategies employed by Uber’s Financial Accounting Services (FAS) have proven to be a resounding success. Through rigorous validation processes at every step–from business requirement validation to report generation, and employing advanced techniques such as regression testing, SLATE, integration testing, and shadow validations–Uber has set a new standard in financial systems’ reliability and accuracy.&lt;/p&gt;


&lt;p&gt;The challenges of handling immense volumes of transactions and data have been met with innovative solutions that not only address current needs but also scale for future growth. The meticulous approach to testing and validation, coupled with deployment strategies like Canary testing and vigilant monitoring and alerting systems, exemplifies Uber’s commitment to maintaining the highest standards in financial technology. In the next year, we are adding even more functionalities to our testing strategies to support the detection and auto-correction of bad inputs to support an error-free self-serve journey.&lt;/p&gt;


&lt;p&gt;Uber’s journey in refining its Fintech Testing Strategies serves as a benchmark for others in the industry, underlining the importance of continuous innovation and rigorous testing in the ever-evolving landscape of financial technology.&lt;/p&gt;
</description><link>https://www.uber.com/blog/accounting-data-testing-strategies/</link><guid isPermaLink="false">https://www.uber.com/blog/accounting-data-testing-strategies/</guid><pubDate>Thu, 18 Apr 2024 05:00:00 GMT</pubDate><author>Uber</author><category>Engineering</category></item><item><title>Migrating a Trillion Entries of Uber’s Ledger Data from DynamoDB to LedgerStore</title><description>&lt;h1 class=&quot;wp-block-heading&quot; id=&quot;h-introduction&quot;&gt;Introduction&lt;/h1&gt;


&lt;p&gt;Last week, we explored &lt;a href=&quot;https://www.uber.com/blog/how-ledgerstore-supports-trillions-of-indexes/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;LedgerStore&lt;/a&gt; (LSG) – Uber’s append-only, ledger-style database. This week, we’ll dive into how we migrated Uber’s business-critical ledger data to LSG. We’ll detail how we moved more than a trillion entries (making up a few petabytes of data) transparently and without causing disruption, and we’ll discuss what we learned during the migration. &lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-history&quot;&gt;History&lt;/h3&gt;


&lt;p&gt;Gulfstream is Uber’s payment platform. It was &lt;a href=&quot;https://www.uber.com/blog/payments-platform&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;launched in 2017&lt;/a&gt; using DynamoDB for storage. At Uber’s scale, DynamoDB became expensive. Hence, we started keeping only 12 weeks of data (i.e., hot data) in DynamoDB and started using Uber’s blobstore, TerraBlob, for older data (i.e., cold data). TerraBlob is similar to AWS S3.&lt;/p&gt;


&lt;p&gt;For a long-term solution, we wanted to use &lt;a href=&quot;https://www.uber.com/blog/dynamodb-to-docstore-migration/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;LSG&lt;/a&gt;. It was purpose-built for storing payment-style data. Its key features are:&lt;/p&gt;


&lt;ul&gt;&lt;li&gt;It is verifiably immutable (i.e., you can check that records have not been altered using cryptographic signatures)&lt;/li&gt;


&lt;li&gt;Tiered storage to manage cost (the hot data is kept at a place that is best to serve requests and cold data is stored at a place that is optimized for storage)&lt;/li&gt;


&lt;li&gt;Better lag for eventually consistent secondary indexes&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;So, by 2021, Gulfstream was using a combination of DynamoDB, TerraBlob, and LSG to store data.&lt;/p&gt;


&lt;ul&gt;&lt;li&gt;DynamoDB for the last 12 weeks of data&lt;/li&gt;


&lt;li&gt;TerraBlob, Uber’s internal blob store, for cold data&lt;/li&gt;


&lt;li&gt;LSG, where we were writing data, and wanted to migrate to it&lt;/li&gt;
&lt;/ul&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-why-migrate&quot;&gt;Why Migrate?&lt;/h2&gt;


&lt;p&gt;LSG is better suited for storing ledger-style data because of its immutability. The recurring cost savings by moving to LSG were significant.&lt;/p&gt;


&lt;p&gt;Going from three to a single storage would simplify the code and design of the Gulfstream services responsible for interacting with storage and creating indexes. This in turn makes it easy to understand and maintain the services.&lt;/p&gt;


&lt;p&gt;LSG promised shorter indexing lag (i.e., time between when a record is written and its secondary index is created). Additionally, it would give us faster network latency because it was running on-premises within Uber’s data centers.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/ImSq9jhPp_-uGFsvBU0tmJA1MiY42lmKUH1fLoNJ036l2w7W9uLx89xRnBO30l7lRkHXQ0XQv_flDWeZB0pfRyYzczFfayP_Vi4j217OZt8fG5WxpM2FKdt3lB34s0emy0gp6nUKNfS2q7MR5z8ZUbc&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 1: Data flow before and after the migration&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-nature-of-data-amp-associated-risk&quot;&gt;Nature of Data &amp;amp; Associated Risk&lt;/h2&gt;


&lt;p&gt;The data we were migrating is all of Uber’s ledger data for all of Uber’s business since 2017:&lt;/p&gt;


&lt;ul&gt;&lt;li&gt;Immutable records – 1.2 PB compressed size&lt;/li&gt;


&lt;li&gt;Secondary indexes – 0.5 PB uncompressed size&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Immutable records should not be modified. So, for all practical purposes, once we have written a record, it can’t be changed. We do have the flexibility of modifying secondary index data for correcting problems.&lt;/p&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-checks&quot;&gt;Checks&lt;/h2&gt;


&lt;p&gt;To ensure that the backfill is correct and acceptable in all respects, we need to check that we can handle the current traffic and the data that is not being accessed currently is correct. The criteria for this was:&lt;/p&gt;


&lt;ul&gt;&lt;li&gt;Completeness: All the records were backfilled.&lt;/li&gt;


&lt;li&gt;Correctness: All the records were correct.&lt;/li&gt;


&lt;li&gt;Load: LSG should be able to handle current load.&lt;/li&gt;


&lt;li&gt;Latency: The P99 latency of LSG was within acceptable bounds.&lt;/li&gt;


&lt;li&gt;Lag: The secondary indexes are created in the background. We want to make sure that the delay of the index creation process was within acceptable limits.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;The checks were done using a combination of &lt;em&gt;shadow validation&lt;/em&gt; and &lt;em&gt;offline validation&lt;/em&gt;.&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-shadow-validation&quot;&gt;Shadow Validation&lt;/h3&gt;


&lt;p&gt;This compares the response that we had been returning before migration with the one that we would return with the LSG as data source. This helps us ensure that our current traffic will be disrupted by neither data migration issues nor code bugs. We wanted our backfill to be at least 99.99% complete and correct as measured by shadow validation. We also had a 99.9999% upper bound for the same. The reason for having an upper bound are:&lt;/p&gt;


&lt;ul&gt;&lt;li&gt;When migrating historical data, there are always data corruption issues. Sometimes this is because data was not written correctly during the initial development time of the service. It is also possible to see data corruption because of scale. As an example, S3 gives 11 nines of durability guarantee then you can expect 10 corruptions in 1 trillion records.&lt;/li&gt;


&lt;li&gt;Indexes are eventually consistent, which means that some records will appear after a few seconds. So, the shadow validation will flag them as missing. This is a false positive that shows up at a large scale.&lt;/li&gt;


&lt;li&gt;For 6 nines, you have to look at data of 100 million comparisons to give any results with good confidence. This means if your shadow validation is comparing 1,000 records/second, then you need to wait for a bit more than one day just to collect sufficient data. With 7 nines, you will have to wait 12 days. In practical terms this would slow the project to a halt.&lt;/li&gt;


&lt;li&gt;With a well-defined upper bound, you are not forced to look at every potential issue that you suspect. Say if the occurrence of a problem is 1/10 of the upper bound, you need not even investigate it.&lt;/li&gt;


&lt;li&gt;With 6 nines, we could end up with slightly more than 1 million corrupt records.&amp;nbsp; Even though 6 nines of confirmed correctness could mean a real cost to the company, the savings generated by this project outweighed the potential cost.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;During shadow validation you are essentially duplicating production traffic on LSG. So by monitoring LSG, we can verify that it can handle our production traffic while meeting our latency and lag requirements. It gives us good confidence in the code that we wrote for accessing the data from LSG. Additionally, it also gives us some confidence about completeness and correctness of data, particularly with data that is currently being accessed. We developed a single generic shadow validation code that was reused multiple times for different parts of the migration.&lt;/p&gt;


&lt;p&gt;During the migration process we found latency and lag issues because of multiple bugs in different parts and fixed them.&lt;/p&gt;


&lt;ul&gt;&lt;li&gt;Partition key optimization for better distribution of index data&lt;/li&gt;


&lt;li&gt;Index issues causing scan of the record instead of point lookup&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Unfortunately, live shadow validation can’t give strong guarantees about our corpus of rarely-accessed historical data.&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-offline-validation-amp-incremental-backfill&quot;&gt;Offline Validation &amp;amp; Incremental Backfill&lt;/h3&gt;


&lt;p&gt;This compares complete data from the LSG with the data dump from DynamoDB. Because of various data issues, you have to skip over bad records to ensure that your backfill can go through. Additionally, there can be bugs in the backfill job itself. Offline validation ensures that the data backfill has happened correctly and it covers complete data. This has to be done in addition to shadow validation because live traffic tends to access only recent data. So, if there are any problems lurking in the cold data that is infrequently accessed, it will not be caught by shadow validation.&amp;nbsp;&lt;/p&gt;


&lt;p&gt;The key challenge in offline validation is size of data. The biggest data that we tackled was 70 TB compressed (estimated 300 TB uncompressed) in size and we compared 760 billion records in a single job. This type of Apache Spark&lt;sup&gt;TM&lt;/sup&gt; job requires data shuffling and &lt;a href=&quot;https://www.uber.com/blog/ubers-highly-scalable-and-distributed-shuffle-as-a-service/&quot;&gt;Distributed Shuffle as a Service for Spark&lt;/a&gt; combined with Dynamic Resource Allocation and Speculative Execution let us do exactly that at a reasonable speed under resource constraints.&lt;/p&gt;


&lt;p&gt;Offline validation found missing records and its output was used for incremental backfill. We iterated between offline validation and backfill to ensure that all the records were written.&amp;nbsp;&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-backfill-issues&quot;&gt;Backfill Issues&lt;/h2&gt;


&lt;p&gt;Every backfill is risky. We used Uber’s internal offering of Apache Spark for the backfills. Here are the different problems that we encountered and how we handled them.&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-scalability&quot;&gt;Scalability&lt;/h3&gt;


&lt;p&gt;You want to start at a small scale and scale up gradually till you hit the limit of the system. If you just blindly push beyond this point then you are effectively creating a DDoS attack on your own systems. At this point, you want to find the bottleneck, address it, and then scale up your job. Most of the time it’s just a matter of scaling up downstream services, other times it can be something more complex. In either case, you don’t want to scale your backfill job beyond the capability of the bottleneck of the system. It’s a good idea to scale up in small increments and monitor closely after each scale-up.&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-incremental-backfills&quot;&gt;Incremental Backfills&lt;/h3&gt;


&lt;p&gt;When you try to backfill 3 years’ worth of data in say 3 months, you are generating traffic that puts 10x the normal traffic load and the system may not be able to cope with this traffic. As an example, you will need 120 days to backfill 100B records at 10K/sec rate when your production normally handles 1K/sec rate. So, you can expect the system to get overloaded. If there is even a remote chance of the backfill job causing an ongoing problem, you must shut it down. So, it is unrealistic to expect that a backfill job can run from start to finish in one go, and therefore you have to run backfills incrementally.&lt;/p&gt;


&lt;p&gt;A simple and effective way to do this is to break the backfill into small batches that can be done one by one, such that each batch can complete within a few minutes. Since your job may shut down in the middle of a batch, it has to be idempotent. Every time you complete a batch you want to dump the statistics (such as records read, records backfilled, etc.) to a file. As your backfill continues, you can aggregate numbers from them to check the progress.&lt;/p&gt;


&lt;p&gt;If you can delete or update existing records, it lowers the risk and cost of mistakes and code bugs during the backfill.&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-rate-control&quot;&gt;Rate Control&lt;/h3&gt;


&lt;p&gt;To backfill safely, you want to make sure that your backfill job behaves consistently. So, your job should have rate control that can be easily tweaked to scale up or scale down. In Java/Scala you can use Guava’s RateLimiter.&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-dynamic-rate-control&quot;&gt;Dynamic Rate Control&lt;/h3&gt;


&lt;p&gt;In some cases, you may be able to go faster when there is less production traffic. For this you need to monitor the current state of the system and see if it’s ok to go faster. We adjusted RPS on the lines of &lt;a href=&quot;https://en.wikipedia.org/wiki/Additive_increase/multiplicative_decrease&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;additive increase/multiplicative decrease&lt;/a&gt;. We still had an upper bound on the traffic for safety.&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-emergency-stop&quot;&gt;Emergency Stop&lt;/h3&gt;


&lt;p&gt;The migration process needs the ability to stop backfill quickly in case there is an outage or even suspicion of overload. Any backfill during an outage has to be stopped as both a precaution and as a potential source of noise. Even post-outage, systems tend to get extra load as systems recover. Having the ability to stop backfill also helps debug scale-related issues.&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-size-of-data-file&quot;&gt;Size of Data File&lt;/h3&gt;


&lt;p&gt;When dumping data, keep the size of the files to around 1GB with 10x flexibility on both sides. If the size of the file is too big, you run into issues such as &lt;a href=&quot;https://kb.databricks.com/cloud/s3-part-number-limit.html&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;MultiPart limitation of different tools&lt;/a&gt;. If your file size is small, then you have too many files and even listing them will take significant time. You may even start hitting ARGMAX limit of when running commands in a shell. This becomes significant enough to make sure that every time you do something with data it has been applied to all files and not just some of them.&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-fault-tolerance&quot;&gt;Fault Tolerance&lt;/h3&gt;


&lt;p&gt;All backfill jobs need some kind of data transformation. When you do this you inevitably run into data quality/corruption issues. You can’t stop the backfill job every time this happens because such bad records tend to be randomly distributed. But you can’t ignore them as well because it might also be because of a code bug. To deal with this, you dump problematic records separately and monitor statistics. If the failure rate is high then you can stop the backfill manually, fix the problem, and continue. Otherwise, let the backfill continue and look at the failures in parallel.&lt;/p&gt;


&lt;p&gt;Another reason for records not getting written is RPC timeout. You can retry for this, but at some point, you have to give up and move ahead irrespective of the reason to make sure you can make progress.&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-logging&quot;&gt;Logging&lt;/h3&gt;


&lt;p&gt;It is tempting to log during backfill to help with debugging and monitor progress, but this may not be possible because of the pressure that it will put on the logging infrastructure. Even if you can keep logs, there will be too much log data to keep around. The solution is to use a rate limiter to limit the amount of logs that you are producing. You need to rate limit only the parts that produce most of the logs. You can even choose to log all the errors if they happen infrequently.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/xWcB-v0gyFB4920hZx1tevZiHiSLhUKPvA7TZMvkCN6bsEmh5bZiTcZ0xYumbfjsgsG6Oz-Xnl85XeLhD4ofUc07poJ1OnsB4WlNCEyZzYmY9kuvfgCkSxzC4nSvqcBEmYQvANytw4oOyXA4wyQDias&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-mitigating-risk&quot;&gt;Mitigating Risk&lt;/h2&gt;


&lt;p&gt;In addition to analyzing data from different validation and backfill stats we also were conservative with the rollout of LSG. We rolled it out over a few weeks and with go-aheads from on-call engineers of the major callers of our service. We initially rolled out with fallback (i.e., if the data was not found in LSG, we would try to fetch it from DynamoDB). We looked at the fallback logs before we removed the fallback. For every record that was flagged as missing in the fallback logs we checked LSG to make sure that it was not really missing. Even after that we kept the DynamoDB data around for a month before we stopped writing data to it, took a final backup, and dropped the table.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/P7cnF5dxZX7H5rA82nfZgC1ICBQX7Q928jAexep0GBSR39-B2kDw44hHTtEQQuBNmqw9ZeFz_KYY39uqlUBSEErrb2XWhWagcpWKrsb8tiDX_6CYfOXACQst7Wak7mIewxy4WvI3gW3Vza3altpn0Tc&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 2: LSG Rollout&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h1 class=&quot;wp-block-heading&quot; id=&quot;h-conclusion&quot;&gt;Conclusion&lt;/h1&gt;


&lt;p&gt;In this article, we covered the migration of massive amounts of business-critical money data from one datastore to another. We covered different aspects of the migration, including criteria for migration, checks, backfill issues, and safety. We were able to do this migration over two years without any downtime or outages during or after the migration.&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-acknowledgments&quot;&gt;Acknowledgments&lt;/h3&gt;


&lt;p&gt;Thanks to Amit Garg and Youxian Chen for helping us migrate the data from TerraBlob to LSG. Thanks to Jaydeepkumar Chovatia, Kaushik Devarajaiah, and Rashmi Gupta from the LSG team for supporting us throughout this work. Thanks to Menghan Li for migrating data for &lt;a href=&quot;https://www.uber.com/en-EG/blog/cashless-payments-with-uber-cash/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;Uber Cash&lt;/a&gt;’s ledger.&lt;/p&gt;


&lt;p class=&quot;has-small-font-size&quot;&gt;Cover photo attribution: “&lt;a href=&quot;https://www.flickr.com/photos/51986662@N05/51912457870&quot;&gt;Waterfowl Migration at Sunset on the Huron Wetland Management District&lt;/a&gt;” by &lt;a href=&quot;https://www.flickr.com/photos/51986662@N05&quot;&gt;USFWS Mountain Prairie&lt;/a&gt; is marked with &lt;a href=&quot;https://creativecommons.org/publicdomain/mark/1.0/?ref=openverse&quot;&gt;Public Domain Mark 1.0&lt;/a&gt;.&lt;/p&gt;


&lt;p class=&quot;has-small-font-size&quot;&gt;Amazon Web Services, AWS, the Powered by AWS logo, [and name any other AWS Marks used in such materials] are trademarks of Amazon.com, Inc. or its affiliates.&lt;/p&gt;


&lt;p class=&quot;has-small-font-size&quot;&gt;Apache®, Apache SparkTM, and SparkTM are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.&lt;/p&gt;
</description><link>https://www.uber.com/blog/migrating-from-dynamodb-to-ledgerstore/</link><guid isPermaLink="false">https://www.uber.com/blog/migrating-from-dynamodb-to-ledgerstore/</guid><pubDate>Thu, 11 Apr 2024 05:30:00 GMT</pubDate><author>Uber</author><category>Engineering</category></item><item><title>How LedgerStore Supports Trillions of Indexes at Uber</title><description>&lt;h1 class=&quot;wp-block-heading&quot; id=&quot;h-introduction&quot;&gt;Introduction&lt;/h1&gt;


&lt;p&gt;Uber connects the physical and digital worlds to help make movement happen at the tap of a button. Billions of trips, deliveries, and tens of billions of financial transactions across earners, spenders, and merchants are made at Uber every quarter. LedgerStore is an immutable storage solution at Uber that provides verifiable data completeness and correctness guarantees to ensure data integrity for these transactions.&lt;/p&gt;


&lt;p&gt;Considering that ledgers are the source of truth of any financial event or data movement at Uber, it is important to be able to look up ledgers from various access patterns via indexes. This brings in the need for &lt;strong&gt;trillions&lt;/strong&gt; of indexes to index hundreds of billions of ledgers. A previous &lt;a href=&quot;https://www.uber.com/en-US/blog/dynamodb-to-docstore-migration/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;blog post&lt;/a&gt; discussed the background of LedgerStore and how the storage backend was re-architected. This blog covers the significance of LedgerStore indexing and its architecture, which powers trillions of indexes, with a petabyte-scale index storage footprint.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h1 class=&quot;wp-block-heading&quot; id=&quot;h-types-of-indexes&quot;&gt;Types of Indexes&lt;/h1&gt;


&lt;p&gt;Various types of indexes need to be supported on ledgers. Let us explore them along with corresponding use cases.&lt;/p&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-strongly-consistent-indexes&quot;&gt;Strongly consistent indexes&lt;/h2&gt;


&lt;p&gt;One of the use cases is handling the credit card authorization flow when a rider/eater uses Uber. At the beginning of an Uber trip, a credit card hold is placed on the rider/eater’s credit card. This hold should either be converted to a charge or voided, depending on whether the trip was taken or canceled, as shown below.&amp;nbsp;&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter size-large&quot;&gt;&lt;img decoding=&quot;async&quot; loading=&quot;lazy&quot; width=&quot;1024&quot; height=&quot;618&quot; src=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=1024,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/Fig1-e1711742633458.png&quot; alt=&quot;&quot; class=&quot;wp-image-1084314&quot; srcset=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=1024,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/Fig1-e1711742633458.png 1024w, https://blog.uber-cdn.com/cdn-cgi/image/width=300,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/Fig1-e1711742633458.png 300w, https://blog.uber-cdn.com/cdn-cgi/image/width=768,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/Fig1-e1711742633458.png 768w, https://blog.uber-cdn.com/cdn-cgi/image/width=1536,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/Fig1-e1711742633458.png 1536w, https://blog.uber-cdn.com/cdn-cgi/image/width=1539,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/Fig1-e1711742633458.png 1539w&quot; sizes=&quot;(max-width: 1024px) 100vw, 1024px&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 1: Uber Trip credit-card payment flow supported by strongly consistent indexes.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;p&gt;If the index serving the hold is not strongly consistent&lt;em&gt;, &lt;/em&gt;it could take a while for the hold to be visible upon reading. A consequence of this is that a duplicate charge could be made on the user’s credit card while the original hold remains on the credit card.&amp;nbsp;&amp;nbsp;&lt;/p&gt;


&lt;p&gt;Now, let’s dive into how we build strongly consistent indexes that ensure that once a record write is performed, any subsequent reads are guaranteed to see the indexes corresponding to that record.&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-write-path&quot;&gt;Write Path&lt;/h3&gt;


&lt;p&gt;To build strongly consistent indexes, we use a 2-phase commit to ensure that the index is always strongly consistent with the record, as shown below.&amp;nbsp;&lt;/p&gt;


&lt;p&gt;The insert operation begins with an index intent write before the record write. These intents are committed after the record write operation if the record write succeeded and this is done asynchronously to avoid affecting end-user insert latency. If the index intent write succeeds, but the record write fails, the index intent will need to be rolled back, else it leads to an accumulation of unused intents, and that is handled during the read time, as we will see next.&amp;nbsp;&lt;/p&gt;


&lt;p&gt;It is important to note that if the index intent write fails, the whole insert operation fails since we cannot guarantee the consistency of the index with the record. Hence, strongly consistent indexes need to be considered only when the use case strongly demands it.&amp;nbsp;&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter size-large&quot;&gt;&lt;img decoding=&quot;async&quot; loading=&quot;lazy&quot; width=&quot;1024&quot; height=&quot;520&quot; src=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=1024,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/Fig2-e1711742801993.png&quot; alt=&quot;&quot; class=&quot;wp-image-1084320&quot; srcset=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=1024,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/Fig2-e1711742801993.png 1024w, https://blog.uber-cdn.com/cdn-cgi/image/width=300,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/Fig2-e1711742801993.png 300w, https://blog.uber-cdn.com/cdn-cgi/image/width=768,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/Fig2-e1711742801993.png 768w, https://blog.uber-cdn.com/cdn-cgi/image/width=1260,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/Fig2-e1711742801993.png 1260w&quot; sizes=&quot;(max-width: 1024px) 100vw, 1024px&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 2: Two-phase commit write flow of strongly consistent indexes.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-read-path&quot;&gt;Read Path&lt;/h3&gt;


&lt;p&gt;There are two cases where an index can be in the intent state after the insert:&amp;nbsp;&lt;/p&gt;


&lt;ol&gt;&lt;li&gt;The index intent commit operation failed in the write path OR&amp;nbsp;&lt;/li&gt;


&lt;li&gt;If record write fails&lt;/li&gt;
&lt;/ol&gt;


&lt;p&gt;Such intents are handled on the read path by either committing or deleting them. When a read happens on these indexes, if the index is in an intent state, the corresponding record is read. If the record is present, the index is committed, else rolled back. These operations happen asynchronously so as not to affect the end user read latency. In general, only a very small percentage of indexes end up in the intent state.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter size-large&quot;&gt;&lt;img decoding=&quot;async&quot; loading=&quot;lazy&quot; width=&quot;1024&quot; height=&quot;791&quot; src=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=1024,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/Fig3-e1711742960693.png&quot; alt=&quot;&quot; class=&quot;wp-image-1084321&quot; srcset=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=1024,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/Fig3-e1711742960693.png 1024w, https://blog.uber-cdn.com/cdn-cgi/image/width=300,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/Fig3-e1711742960693.png 300w, https://blog.uber-cdn.com/cdn-cgi/image/width=768,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/Fig3-e1711742960693.png 768w, https://blog.uber-cdn.com/cdn-cgi/image/width=1180,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/Fig3-e1711742960693.png 1180w&quot; sizes=&quot;(max-width: 1024px) 100vw, 1024px&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 3: Read flows of strongly consistent indexes.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-eventually-consistent-indexes&quot;&gt;Eventually consistent indexes&lt;/h2&gt;


&lt;p&gt;Not all indexes require strong read-your-write guarantees. An example of such an index is the payment history page, wherein, a lag of a few seconds is acceptable as long as the payment appears on the page.&lt;/p&gt;


&lt;p&gt;While strongly consistent indexes provide &lt;em&gt;read-your-write&lt;/em&gt; guarantees, they are not suitable in certain circumstances since they trade off the following properties to achieve this guarantee:&lt;/p&gt;


&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Higher Write Latency&lt;/strong&gt;&lt;br&gt;Since the index intent write operation and corresponding record write has to be serial to provide a strong consistency guarantee of the index for the record&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;Lower Availability&lt;/strong&gt;&lt;br&gt;A write failure of any one of the index intents implies the whole write should be failed else indexes will not be consistent with the corresponding record&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Eventually, consistent indexes are the opposite in this aspect when compared to strongly consistent indexes, as they are built in the background by a separate process that is completely isolated from the online write path. Hence, they do not suffer from &lt;em&gt;higher write latency&lt;/em&gt; or cause potential &lt;em&gt;lower availability&lt;/em&gt; of LedgerStore service. We leverage a feature called Materialized Views from our home-grown &lt;a href=&quot;https://www.uber.com/blog/schemaless-sql-database/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;Docstore&lt;/a&gt; database to generate these indexes.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter size-large&quot;&gt;&lt;img decoding=&quot;async&quot; loading=&quot;lazy&quot; width=&quot;1024&quot; height=&quot;990&quot; src=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=1024,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/Fig4-e1711743013811.png&quot; alt=&quot;&quot; class=&quot;wp-image-1084322&quot; srcset=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=1024,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/Fig4-e1711743013811.png 1024w, https://blog.uber-cdn.com/cdn-cgi/image/width=300,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/Fig4-e1711743013811.png 300w, https://blog.uber-cdn.com/cdn-cgi/image/width=768,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/Fig4-e1711743013811.png 768w, https://blog.uber-cdn.com/cdn-cgi/image/width=1536,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/Fig4-e1711743013811.png 1536w, https://blog.uber-cdn.com/cdn-cgi/image/width=1820,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/Fig4-e1711743013811.png 1820w&quot; sizes=&quot;(max-width: 1024px) 100vw, 1024px&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 4: Payment history served by eventually consistent indexes.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-time-range-indexes&quot;&gt;Time-range indexes&lt;/h2&gt;


&lt;p&gt;Ledgers, due to their immutable nature, keep growing in size over time, thereby increasing their cost of storage. So, at Uber, we offload older ledgers in time-range batches to cheaper cold storage.&amp;nbsp;&lt;/p&gt;


&lt;p&gt;Every ledger is associated with a timestamp called a business or event timestamp. To offload ledgers to cold storage (and also for sealing the data), we need a class of indexes to query data in event time-range batches. What differentiates this index is that the data is read in deterministic time-range batches, in orders of magnitude higher than the above two index types.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter size-full is-resized&quot;&gt;&lt;img decoding=&quot;async&quot; loading=&quot;lazy&quot; width=&quot;1020&quot; height=&quot;591&quot; src=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=2160,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/Fig5-e1711743069138.png&quot; alt=&quot;&quot; class=&quot;wp-image-1084325&quot; style=&quot;aspect-ratio:1.7258883248730965;width:701px;height:auto&quot; srcset=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=1020,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/Fig5-e1711743069138.png 1020w, https://blog.uber-cdn.com/cdn-cgi/image/width=300,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/Fig5-e1711743069138.png 300w, https://blog.uber-cdn.com/cdn-cgi/image/width=768,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/Fig5-e1711743069138.png 768w&quot; sizes=&quot;(max-width: 1020px) 100vw, 1020px&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 5: time-range indexes used in data-tiering.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;p&gt;Following is an example of how time-range queries are done on ledgers:&lt;/p&gt;


&lt;figure class=&quot;wp-block-table&quot;&gt;&lt;table&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;SELECT * FROM&lt;/strong&gt; LEDGER_TABLE &lt;strong&gt;WHERE&lt;/strong&gt; LedgerTime &lt;strong&gt;BETWEEN&lt;/strong&gt; 1701252000000 &lt;strong&gt;AND&lt;/strong&gt; 1701253800000&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/figure&gt;


&lt;figure class=&quot;wp-block-table&quot;&gt;&lt;table&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Ledger&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;&lt;strong&gt;LedgerTime&lt;/strong&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;{trip started}&lt;/td&gt;&lt;td&gt;10:01 am&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;{trip completed and fare adjusted}&lt;/td&gt;&lt;td&gt;10:15 am&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;{post trip corrections}&lt;/td&gt;&lt;td&gt;12:01 pm&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/figure&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;p&gt;There are a few ways to model this in a distributed database. We will dive into the key differences between developing the time-range index on top of Amazon DynamoDB vs. Docstore database. Both DynamoDB and Docstore, being distributed databases, provide data modeling constructs as Partition and Sort keys. The former is meant for distributing data across partitions evenly based on its value and the latter to control the sort order of the data.&amp;nbsp;&amp;nbsp;&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-design-with-dynamodb&quot;&gt;Design with DynamoDB&lt;/h3&gt;


&lt;p&gt;Dynamodb provides two ways of managing table read/write &lt;a href=&quot;https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.ReadWriteCapacityMode.html&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;capacity&lt;/a&gt;. We used the provisioned mode since the traffic is not too &lt;a href=&quot;https://docs.aws.amazon.com/wellarchitected/latest/serverless-applications-lens/capacity.html&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;bursty&lt;/a&gt; to require on-demand mode. The provisioned mode was configured with &lt;a href=&quot;https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.ReadWriteCapacityMode.html&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;auto scaling&lt;/a&gt; to adjust capacity based on the traffic pattern.&amp;nbsp;&lt;/p&gt;


&lt;p&gt;As we notice from the write pattern above, the ledger times are generally correlated to the current wall clock time. Hence these values tend to be clustered around the current time. If we were to partition the data based on say &lt;strong&gt;G&lt;/strong&gt; time-units granularity, all the writes in the &lt;strong&gt;G&lt;/strong&gt; time-units would go to the same physical partition causing hot partitions. DynamoDB has restrictions on throughput in case of &lt;a href=&quot;https://aws.amazon.com/blogs/database/choosing-the-right-dynamodb-partition-key/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;hot partitions&lt;/a&gt;, leading to throttling of write requests, which is not acceptable in the online write path. Assuming 1K peak Uber trips/s, even G=1 second is not a good value, since it corresponds to 1K WCU (Write Capacity Unit), which is the peak allowed QPS before &lt;a href=&quot;https://aws.amazon.com/blogs/database/choosing-the-right-dynamodb-partition-key/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;throttling&lt;/a&gt; happens.&amp;nbsp;&lt;/p&gt;


&lt;p&gt;While it might seem like we could just make the partitioning more fine-grained, it is still not foolproof, since an increase in the traffic over time can lead to instability. Another side effect of this is the increase in cumulative reads to be performed via a scatter-gather. So, what we did in the case of DynamoDB was below:&lt;/p&gt;


&lt;h4 class=&quot;wp-block-heading&quot; id=&quot;h-write-optimized-temporary-index-table-called-buffer-index&quot;&gt;Write-optimized temporary index table (called buffer index)&lt;/h4&gt;


&lt;p&gt;All online time-index writes go to the buffer index table. Inserted index items are partitioned into &lt;strong&gt;&lt;em&gt;M&lt;/em&gt;&lt;/strong&gt; unique buckets based on a hash modulo of the corresponding record to &lt;a href=&quot;https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-partition-key-design.html&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;uniformly distribute&lt;/a&gt; load across partitions in the buffer index table, making it &lt;em&gt;write-efficient&lt;/em&gt;. The value of &lt;strong&gt;&lt;em&gt;M&lt;/em&gt;&lt;/strong&gt; is chosen such that it is high enough that the amount of load per partition avoids excessive splitting. It is also chosen low enough, to limit the amount of scatter-gather&lt;em&gt; &lt;/em&gt;to perform during reads.&amp;nbsp;&lt;/p&gt;


&lt;h4 class=&quot;wp-block-heading&quot; id=&quot;h-read-optimized-permanent-index-table&quot;&gt;Read-optimized permanent index table&lt;/h4&gt;


&lt;p&gt;The need for scatter-gathe&lt;em&gt;r &lt;/em&gt;read of the buffer tables makes them not efficient for reads and since reads can happen throughout the lifecycle of a table, we would need to optimize it. This brings the need for a read-efficient permanent index table.&amp;nbsp;&lt;/p&gt;


&lt;p&gt;A permanent time-range index table is partitioned on the timestamp aligned to a certain time duration &lt;strong&gt;&lt;em&gt;N&lt;/em&gt;&lt;/strong&gt; (say 10 minutes). Indexes from the buffer tables are periodically written in batches to the permanent index table. Since the write is done in batches and in the background, any write throttling here does not affect the online traffic. Another advantage of batching is that the write traffic can be distributed across partitions, reducing the hot partitioning. The buffer index tables are deleted after offloading their indexes to the permanent index table since they are no longer needed. Reads on the permanent index tables are done in intervals of &lt;strong&gt;&lt;em&gt;N&lt;/em&gt;&lt;/strong&gt; minutes without any scatter-gather, making this table &lt;em&gt;read-efficient&lt;/em&gt;.&lt;br&gt;&lt;br&gt;Following is a depiction of the time-range index flow in case of DynamoDB. The dual table design brings in the need of state management and coordination so reads go to the correct index table as well.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter size-large&quot;&gt;&lt;img decoding=&quot;async&quot; loading=&quot;lazy&quot; width=&quot;920&quot; height=&quot;1024&quot; src=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=920,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/Fig6-e1711743514668.png&quot; alt=&quot;&quot; class=&quot;wp-image-1084328&quot; srcset=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=920,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/Fig6-e1711743514668.png 920w, https://blog.uber-cdn.com/cdn-cgi/image/width=269,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/Fig6-e1711743514668.png 269w, https://blog.uber-cdn.com/cdn-cgi/image/width=768,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/Fig6-e1711743514668.png 768w, https://blog.uber-cdn.com/cdn-cgi/image/width=1380,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/Fig6-e1711743514668.png 1380w, https://blog.uber-cdn.com/cdn-cgi/image/width=1499,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/Fig6-e1711743514668.png 1499w&quot; sizes=&quot;(max-width: 920px) 100vw, 920px&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 6: Time-range index design on Dynamodb.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-design-with-docstore&quot;&gt;Design with Docstore&lt;/h3&gt;


&lt;p&gt;The two-table design in the case of DynamoDB functions well and can handle high throughput, but introduces challenges in operations. If the temporary buffer tables are not created in time, it can lead to write failure since writes cannot be accepted, and this has caused availability issues in the past. We re-architected our index storage backend from DynamoDB to Uber’s &lt;a href=&quot;https://www.uber.com/blog/schemaless-sql-database/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;Docstore&lt;/a&gt; database as part of cost efficiency. As part of this re-architecture, we also improved the time-range index design to overcome the downside of maintaining two tables, by leveraging two Docstore properties:&lt;/p&gt;


&lt;ol&gt;&lt;li&gt;&lt;a href=&quot;https://www.uber.com/blog/schemaless-sql-database/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;Docstore&lt;/a&gt; is a distributed database built on top of MySQL, with a fixed number of shards assigned to a variable number of physical partitions. As the data size grows, the number of physical partitions increases and some of the existing shards are re-assigned to the new partitions, leading to a max upper&lt;strong&gt; limit&lt;/strong&gt; to the number of physical partitions.&amp;nbsp;&lt;/li&gt;


&lt;li&gt;Data in Docstore is stored in a &lt;strong&gt;sorted&lt;/strong&gt; fashion of the primary key (partition + sort keys).&lt;/li&gt;
&lt;/ol&gt;


&lt;p&gt;We maintain just one table for the time-range index, wherein the index entries are partitioned on the full timestamp value. Since the timestamp is extremely granular, there is no hot partitioning (and hence no write throttling) since most of the writes are uniformly distributed across partitions.&lt;/p&gt;


&lt;p&gt;Reads involve a prefix scanning of each of the shards of the table up to a certain time granularity. Prefix scanning is very similar to a regular scan of the table, except the boundaries of each scan request are controlled by the application. So, in the example below, to read 30 minutes worth of data, reads could be done on a 10-minute interval starting from 2023–02-03 01:00:00 to 2023–02-03 01:10:00 and similarly repeated for the next two sub-windows. Since data is sorted on the primary key, this prefix scan with given boundaries ensures only data lying within these timestamps is read.&amp;nbsp;&lt;/p&gt;


&lt;p&gt;A scatter-gather, followed by sort merging across shards is then performed to obtain all time-range index entries in the given window, in a sorted fashion. Since the number of shards is fixed in Docstore, we can precisely determine (and bound) the number of read requests that need to be performed. The same technique is not applicable in the case of DynamoDB since the number of partitions keeps increasing over time, as the table size increases. This has significantly simplified the design and reduced the operational maintenance cost of our time-indexes.&amp;nbsp;&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter size-large&quot;&gt;&lt;img decoding=&quot;async&quot; loading=&quot;lazy&quot; width=&quot;1024&quot; height=&quot;914&quot; src=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=1024,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/Fig7-e1711743568767.png&quot; alt=&quot;&quot; class=&quot;wp-image-1084329&quot; srcset=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=1024,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/Fig7-e1711743568767.png 1024w, https://blog.uber-cdn.com/cdn-cgi/image/width=300,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/Fig7-e1711743568767.png 300w, https://blog.uber-cdn.com/cdn-cgi/image/width=768,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/Fig7-e1711743568767.png 768w, https://blog.uber-cdn.com/cdn-cgi/image/width=1080,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/Fig7-e1711743568767.png 1080w&quot; sizes=&quot;(max-width: 1024px) 100vw, 1024px&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 7: Time-range index design on Docstore.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-index-lifecycle-management&quot;&gt;Index lifecycle management&lt;/h2&gt;


&lt;p&gt;New indexes are defined regularly and some of the indexes could be modified as well to evolve use cases. To support that with minimal effort and also not cause any regressions, we need a mechanism to manage the index lifecycle. The following are the components of the same:&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-index-lifecycle-state-machine&quot;&gt;Index lifecycle state machine&lt;/h3&gt;


&lt;p&gt;This component orchestrates the life-cycle of the index, involving creating the index table, backfilling it with historical index entries, validating them for completeness, swapping the old index with the new one for read/writes, and decommissioning the old index.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter size-large&quot;&gt;&lt;img decoding=&quot;async&quot; loading=&quot;lazy&quot; width=&quot;1024&quot; height=&quot;375&quot; src=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=1024,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/Fig8-e1711743614440.png&quot; alt=&quot;&quot; class=&quot;wp-image-1084330&quot; srcset=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=1024,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/Fig8-e1711743614440.png 1024w, https://blog.uber-cdn.com/cdn-cgi/image/width=300,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/Fig8-e1711743614440.png 300w, https://blog.uber-cdn.com/cdn-cgi/image/width=768,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/Fig8-e1711743614440.png 768w, https://blog.uber-cdn.com/cdn-cgi/image/width=1536,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/Fig8-e1711743614440.png 1536w, https://blog.uber-cdn.com/cdn-cgi/image/width=1800,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/Fig8-e1711743614440.png 1800w&quot; sizes=&quot;(max-width: 1024px) 100vw, 1024px&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 8: Index lifecycle state machine.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-historical-index-data-backfill&quot;&gt;Historical Index data backfill&lt;/h3&gt;


&lt;p&gt;Depending on the business use cases, new indexes need to be defined, and it is essential to backfill historical index entries so that they are complete. This component builds indexes from the historical data offloaded to the cold data storage and backfills them to the storage layer in a scalable fashion. Considering that the data download speed is higher than the data processing speed, this component is built with configurable rate-limiting and batching in a reusable way, since we can plug in the actual processing logic as a batch processor plugin.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter size-large&quot;&gt;&lt;img decoding=&quot;async&quot; loading=&quot;lazy&quot; width=&quot;1024&quot; height=&quot;592&quot; src=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=1024,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/Fig9-e1711743676584.png&quot; alt=&quot;&quot; class=&quot;wp-image-1084331&quot; srcset=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=1024,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/Fig9-e1711743676584.png 1024w, https://blog.uber-cdn.com/cdn-cgi/image/width=300,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/Fig9-e1711743676584.png 300w, https://blog.uber-cdn.com/cdn-cgi/image/width=768,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/Fig9-e1711743676584.png 768w, https://blog.uber-cdn.com/cdn-cgi/image/width=1420,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/Fig9-e1711743676584.png 1420w&quot; sizes=&quot;(max-width: 1024px) 100vw, 1024px&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 9: Historical data processing module customized to backfill indexes.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-index-validation&quot;&gt;Index validation&lt;/h3&gt;


&lt;p&gt;After indexes are backfilled, they need to be verified for completeness. This is done by an offline job that computes order independent checksums at a certain time-window granularity and compares them across the source of truth data and the index table. This step identifies any bugs in the index backfill process since even if one entry is missed, the aggregate checksum for that time window will lead to a mismatch.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter size-large&quot;&gt;&lt;img decoding=&quot;async&quot; loading=&quot;lazy&quot; width=&quot;1024&quot; height=&quot;350&quot; src=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=1024,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/Fig10-e1711743725757.png&quot; alt=&quot;&quot; class=&quot;wp-image-1084332&quot; srcset=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=1024,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/Fig10-e1711743725757.png 1024w, https://blog.uber-cdn.com/cdn-cgi/image/width=300,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/Fig10-e1711743725757.png 300w, https://blog.uber-cdn.com/cdn-cgi/image/width=768,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/Fig10-e1711743725757.png 768w, https://blog.uber-cdn.com/cdn-cgi/image/width=1536,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/Fig10-e1711743725757.png 1536w, https://blog.uber-cdn.com/cdn-cgi/image/width=1640,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/04/Fig10-e1711743725757.png 1640w&quot; sizes=&quot;(max-width: 1024px) 100vw, 1024px&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 10: Index completeness validation.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-highlights&quot;&gt;Highlights&lt;/h2&gt;


&lt;p&gt;This is how we measured the success of this critical project:&lt;/p&gt;


&lt;ul&gt;&lt;li&gt;We built over 2 trillion unique indexes, and not a single data inconsistency has been detected so far, with the new architecture in production for over 6 months.&lt;/li&gt;


&lt;li&gt;Not a single production incident was noticed during the backfill, given how critical money movement is for Uber.&lt;/li&gt;


&lt;li&gt;We also moved all these indexes from DynamoDB to Docstore. So the project also resulted in technology consolidation, reducing external dependencies.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;From a business impact perspective, operating LedgerStore is now very cost-effective due to reduced spend on DynamoDB. The estimated yearly savings are over $6 million per year.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h1 class=&quot;wp-block-heading&quot; id=&quot;h-conclusion&quot;&gt;Conclusion&lt;/h1&gt;


&lt;p&gt;Ledgers are the source of truth for money movement events at Uber. The robust indexing platform we have built supports accessing these sources of truth ledgers for various business use cases, and we look forward to supporting many more indexes on this platform in the future.&amp;nbsp;&lt;/p&gt;


&lt;p&gt;We would like to conclude with some key takeaways: Maintaining a petabyte-scale of indexes in an OLTP system brings in certain challenges, such as imbalanced partitioning, high read/write amplification, noisy neighbor problems, etc. So data modeling and isolation are important aspects to consider while designing these systems. Further, depending on the actual database used underneath for storage, the design methodology can be significantly different, as we see from the design contrast of time-range indexes on two different distributed databases.&lt;/p&gt;


&lt;p&gt;Join us next week to see part two of the LedgerStore series where we chronicle a migration from DynamoDB to LedgerStore.&lt;/p&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-acknowledgments&quot;&gt;Acknowledgments&lt;/h2&gt;


&lt;p&gt;This project would not have been possible without collaboration from the following teams, embodying several &lt;a href=&quot;https://www.uber.com/in/en/careers/values/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;Uber values&lt;/a&gt;:&lt;/p&gt;


&lt;ul&gt;&lt;li&gt;The &lt;a href=&quot;https://www.uber.com/blog/payments-platform/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;Gulfstream&lt;/a&gt; team, who closely worked with the LedgerStore team in aligning on the common goals and migrating on the LedgerStore platform, a multi-year project.&lt;/li&gt;


&lt;li&gt;The Docstore team, for evolving &lt;a href=&quot;https://www.uber.com/en-IN/blog/schemaless-sql-database/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;Docstore&lt;/a&gt; to meet the massive scale requirements of LedgerStore’s indexes.&lt;/li&gt;


&lt;li&gt;The LedgerStore team for leading, building, and driving the adoption of ledger indexes at large scale.&lt;/li&gt;
&lt;/ul&gt;


&lt;p class=&quot;has-small-font-size&quot;&gt;&lt;em&gt;Amazon Web Services, AWS, the Powered by AWS logo, and Amazon DynamoDB are trademarks of Amazon.com, Inc. or its affiliates.&lt;/em&gt;&lt;/p&gt;
</description><link>https://www.uber.com/blog/how-ledgerstore-supports-trillions-of-indexes/</link><guid isPermaLink="false">https://www.uber.com/blog/how-ledgerstore-supports-trillions-of-indexes/</guid><pubDate>Thu, 04 Apr 2024 05:30:00 GMT</pubDate><author>Uber</author><category>Engineering</category><category>Backend</category><category>Data / ML</category></item><item><title>Scaling AI/ML Infrastructure at Uber</title><description>&lt;h1 class=&quot;wp-block-heading&quot; id=&quot;h-introduction&quot;&gt;&lt;strong&gt;Introduction&lt;/strong&gt;&lt;/h1&gt;


&lt;p&gt;Machine Learning (ML) is celebrating its 8th year at Uber since we first started using complex rule-based machine learning models for driver-rider matching and pricing teams in 2016. Since then, our progression has been significant, with a shift towards employing deep learning models at the core of most business-critical applications today, while actively exploring the possibilities offered by Generative AI models. As the complexity and scale of AI/ML models continue to surge, there’s a growing demand for highly efficient infrastructure to support these models effectively. Over the past few years, we’ve strategically implemented a range of infrastructure solutions, both CPU- and GPU-centric, to scale our systems dynamically and cater to the evolving landscape of ML use cases. This evolution has involved tailored hardware SKUs, software library enhancements, integration of diverse distributed training frameworks, and continual refinements to our end-to-end Michaelangelo platform. These iterative improvements have been driven by our learnings along the way, and continuous realignment with industry trends and Uber’s trajectory, all aimed at meeting the evolving requirements of our partners and customers.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-goal-and-key-metrics&quot;&gt;&lt;strong&gt;Goal and Key Metrics&lt;/strong&gt;&lt;/h2&gt;


&lt;p&gt;As we embarked on the transition from on-premise to cloud infrastructure that we &lt;a href=&quot;https://www.wsj.com/articles/uber-signs-cloud-deals-with-google-and-oracle-b45a9372&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;announced&lt;/a&gt; in February 2023, our HW/SW co-design and collaboration across teams was driven by the specific objectives of:&amp;nbsp;&lt;/p&gt;


&lt;ol&gt;&lt;li&gt;Maximizing the utilization of current infrastructure&lt;/li&gt;


&lt;li&gt;Establishing new systems for emerging workloads, such as Generative AI&lt;/li&gt;
&lt;/ol&gt;


&lt;p&gt;In pursuit of these goals, we outlined distinct key results and metrics that guide our progress.&lt;/p&gt;


&lt;p&gt;&lt;strong&gt;Feasibility and Reliability:&lt;/strong&gt; ML users expect successful convergence of their training tasks without errors within an expected time frame (either weeks or months based on complexity). For instance, training larger and more complex models like Falcon 180B™ can take many months, and longer training durations heightened the likelihood of reliability issues. Hence, our target is to achieve 99% uptime SLA for all training dependencies to ensure consistent and reliable outcomes.&lt;/p&gt;


&lt;p&gt;&lt;strong&gt;Efficiency:&lt;/strong&gt; Our focus on efficiency involves thorough benchmarking of diverse GPU configurations and assessing price-performance ratios of on-prem and cloud SKUs tailored to specific workloads. We gauge training efficiency using metrics like Model Flops Utilization (MFU) to guarantee optimal GPU utilization. Our aim is to prevent idle GPUs, opportunistically using training jobs during serving’s off-peak hours through reactive scaling, and upholding high utilization rates to maximize resource efficiency. We want to do this while also maintaining fairness of resource sharing between different users.&amp;nbsp;&lt;/p&gt;


&lt;p&gt;&lt;strong&gt;Developer Velocity:&lt;/strong&gt;&amp;nbsp;This metric is quantified by the number of experiments our engineers can conduct within a specific timeframe. We prioritize a mature ecosystem to bolster developer velocity, ensuring our teams work efficiently to deliver optimal solutions. This approach not only streamlines our state-of-the-art model to production but also reduces the time taken for this transition.&lt;/p&gt;


&lt;p&gt;What follows next is a summary of results from various initiatives that we are taking to make training and serving deployments efficient and scalable, across both on-prem and cloud infrastructure:&amp;nbsp;&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-optimizing-existing-on-prem-infrastructure&quot;&gt;&lt;strong&gt;Optimizing Existing On-prem Infrastructure&lt;/strong&gt;&lt;/h2&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-federation-of-batch-jobs&quot;&gt;&lt;br&gt;Federation of Batch Jobs:&lt;/h3&gt;


&lt;p&gt;Our GPU assets are distributed over multiple &lt;a href=&quot;https://kubernetes.io/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;Kubernetes&lt;/a&gt;™ clusters in various Availability Zones and Regions. This distribution is primarily due to GPU availability and the node count limitation within a single Kubernetes cluster. This arrangement presents two primary challenges:&lt;/p&gt;


&lt;ol&gt;&lt;li&gt;Exposure of infrastructure specifics to Machine Learning Engineers.&lt;/li&gt;


&lt;li&gt;Inconsistent resource utilization across clusters due to static allocation. Although we have an effective resource-sharing system within each cluster, we lacked the capability for inter-cluster scheduling.&lt;/li&gt;
&lt;/ol&gt;


&lt;p&gt;To address these issues, we created a unified federation layer for our batch workloads, including &lt;a href=&quot;https://www.ray.io/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;Ray&lt;/a&gt;™ and Apache &lt;a href=&quot;https://spark.apache.org/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;Spark&lt;/a&gt;™, called &lt;strong&gt;Michelangelo Job Controller&lt;/strong&gt;. This component serves as a centralized interface for all workload scheduling, conceals the underlying Kubernetes clusters, and allocates workloads based on various policies (load aware, bin-pack), including compute and data affinity considerations. We plan to share more technical details on this in a subsequent blog post.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/BQBvxC9TSCSgAj8ZlAAL8Dkydbx3B__KT9nHfrs7eUfLEU6CVUiGo4uG6QUmWkJ0piVRdwkjSioJ-Q80JmKI7pFzlHOEssw3DTZlou544_4uJkyYHdbC55OkKSQfq7ZyL9x8yN8iuZ6a8Hv5RMg0-xk&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Fig 1: Unified federation layer for ML workload allocation.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-network-upgrade-for-llm-training-efficiency&quot;&gt;Network Upgrade for LLM training efficiency&lt;/h3&gt;


&lt;p&gt;When expanding infrastructure to accommodate Generative AI applications and enhancing the efficiency of distributed training while fine-tuning open-source LLMs, it is important to focus on scaling network bandwidth across both scale-up and scale-out configurations. This necessitates implementing critical features such as full mesh &lt;a href=&quot;https://www.nvidia.com/en-us/design-visualization/nvlink-bridges/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;NVlink&lt;/a&gt;™ connectivity among GPUs, upgrades in network link speeds, proficient congestion control management, QoS controls, and the establishment of dedicated rack and network topologies, among other essential features.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter size-large&quot;&gt;&lt;img decoding=&quot;async&quot; loading=&quot;lazy&quot; width=&quot;1024&quot; height=&quot;593&quot; src=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=2160,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/03/Fig2_network_upgrade-1024x593.png&quot; alt=&quot;&quot; class=&quot;wp-image-1083848&quot; srcset=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=1024,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/03/Fig2_network_upgrade.png 1024w, https://blog.uber-cdn.com/cdn-cgi/image/width=300,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/03/Fig2_network_upgrade.png 300w, https://blog.uber-cdn.com/cdn-cgi/image/width=768,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/03/Fig2_network_upgrade.png 768w, https://blog.uber-cdn.com/cdn-cgi/image/width=1536,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/03/Fig2_network_upgrade.png 1536w, https://blog.uber-cdn.com/cdn-cgi/image/width=2048,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/03/Fig2_network_upgrade.png 2048w&quot; sizes=&quot;(max-width: 1024px) 100vw, 1024px&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Fig2: Training efficiency improvements through network link capacity upgrades.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;p&gt;We present a synopsis of findings derived from a Large Language Model (LLM) case study, emphasizing the considerable impact of enhanced network bandwidth and congestion control mechanisms on training effectiveness and price-performance efficiency. Our observations revealed nearly a two-fold increase in training speed and substantial reductions in training duration when employing higher networking bandwidth and better congestion control mechanisms compared to our existing network interconnect. During multi-node training, duplicating data across nodes heightens local memory demands and adds to IO workload. Our analysis prompted a recommendation to augment network link capacity by 4x (25GB/s to 100GB/s) on each GPU server, potentially doubling the available training capacity. While building these we also need to make sure the “Elephant Flows” generated by the large training runs don’t negatively impact the other high-priority service by proper isolation and QoS controls.&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-memory-upgrade-to-improve-gpu-allocation-rates&quot;&gt;&lt;br&gt;Memory Upgrade to improve GPU allocation rates&lt;/h3&gt;


&lt;p&gt;Newer AI/ML workloads are demanding more system memory per GPU worker than what we had designed for. The inherent physical constraints, such as the limited number of memory channels on each server, and DIMM capacities deployed during NPI (new product introduction) restricted our ability to scale up GPU allocations. To improve our GPU allocation rates, we have initiated an effort to double the amount of memory on these servers (16GB to 32GB per DIMM channel). Additionally, we are also building a framework to repurpose and reuse the DIMM’s when older racks are decommissioned. Such optimizations allow us to maximize utilization of existing ML infrastructure and make the most of our current resources. We will detail the efficiency gains achieved through this initiative in an upcoming post. In parallel, we have kicked off efforts to help rightsize the training jobs’ resource requirements. As demonstrated by others [&lt;a href=&quot;https://tianyin.github.io/pub/amp.pdf&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;ref&lt;/a&gt;], manually requesting the optimal resources is a hard problem, and automating it would help in increasing the allocation efficiency.&amp;nbsp;&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-building-new-infrastructure&quot;&gt;&lt;strong&gt;Building New Infrastructure&lt;/strong&gt;&lt;/h2&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-price-performance-evaluations-across-various-cloud-skus&quot;&gt;&lt;br&gt;Price-performance evaluations across various cloud SKUs&lt;/h3&gt;


&lt;p&gt;In late 2022 as we embarked on our journey towards transitioning to the cloud, we assessed various CPU and GPU models offered by different cloud providers. Our aim was to compare their price-performance ratios using established benchmarks ranging from tree-based and deep learning to large language models alongside proprietary datasets and Uber’s models such as deepETA and deepCVR. These assessments, conducted for both training and serving purposes, enabled us to select the most suitable SKUs optimized for our specific workloads, considering factors like feasibility, cost, throughput, and latency. Throughout 2023, we extensively tested 17 different GPU and CPU SKUs, employing various libraries and optimizers, including Nvidia’s &lt;a href=&quot;https://github.com/NVIDIA/TensorRT&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;TensorRT&lt;/a&gt;™(TRT) and TRT-LLM optimizations. For instance, as depicted in figures 4 and 5, we found that while A10 GPUs might not offer the most cost-effective throughput for training tasks, they prove to be the optimal choice for our serving use cases, delivering the best throughput while maintaining acceptable SLA using TRT.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/6lE7W5fYhVSHGudF-A9Fu6wzvEQo4-S7ZjnN8dSgy3HcTmC4pqn-RT9VvBF0kTwZYY3rCz6OU4BVQfZ5erhWA2UMlK-7VbUYQkqKF6SUb58hKjHdzbzH7GK6Do40TLNw7bg0GTxgQYgzQDDV-jbkXxg&quot; alt=&quot;&quot; title=&quot;Chart&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Fig 3: Deep learning training and serving performance-price evaluation.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/jnlRbG2pls4P0xjxD9Pvv7fx4BW5_ZWXtSYLQLXfN317i1lPtxJPOxT6xbzNT8OIHEtAXarcDNNFeC-YRUAI0zh4t1sTBuPUfYeppJPdIwMpPDDJ7E8ddXBnNqde1rCZZVxCJiCIbpQ1qWd4oIGWyhk&quot; alt=&quot;&quot; title=&quot;Chart&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Fig 4: Deep learning serving latency with and without TensorRT optimizations.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;p&gt;Numerous Generative AI applications at Uber necessitate the use of Nvidia’s newest H100 GPUs to satisfy their stringent latency requirements. This requirement stems from the H100 GPUs’ capabilities, which include up to 4x TFlops and double the memory bandwidth compared to the earlier generation A100 GPUs. While experimenting with Meta™ &lt;a href=&quot;https://llama.meta.com/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;Llama2&lt;/a&gt;™&amp;nbsp; model series, involving various batch sizes, quantizations, and model parameters, we evaluated various open- and closed-source frameworks to further optimize for LLM serving performance. In Figures 6 and 7, we present a specific case where we employ two metrics: per-token latency (ms/token) and tokens/sec/gpu, to evaluate and compare the model’s performance across two of the top-performing frameworks (TRT-LLM and a currently confidential framework), keeping all other parameters constant and using FP16 quantization.&amp;nbsp;&amp;nbsp;&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/qs9gWGpDUlhVZrug3qN88E-xDt8PKhA0Rd-R_WL8ECGq-DG50lJ7rz1nT37PxgIRyM0k2ypyN2Aq4v_FTnve8_tq9ayMOkm4cRd0UoTzSYbalhD2CDKQINp5F8uxVVDS8Y1ol-1MeKZK3pGzcSfd4dw&quot; alt=&quot;&quot; title=&quot;Chart&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Fig 5: LLM serving latency comparison by framework (H100).&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/rhL0z9GvoYxR0yiwFONGfGaHXsFNo_kORbuXeb9b2c8M0NEe_jJCbeEDThJUVHc05NeSt84xDjN7m5Skd2SvbzfHTUyz-MaaEhAffQoBtuKe3k7RZPVDQFDk7ZsgMYyFXfwCX-wL2dldeo2tHYzce-A&quot; alt=&quot;&quot; title=&quot;Chart&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Fig 6: LLM serving throughput comparison by framework using the same latency budget and minimum number of GPUs required (H100).&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;p&gt;These experiment results clearly demonstrate that Framework B delivers a two-fold increase in latency and a six-fold improvement in throughput compared to TRT-LLM. It further underscores the significance of HW/SW co-design and that to fully leverage hardware capabilities, it is essential to have the right solutions across the entire stack.&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-llm-training-efficiency-improvements-with-memory-offload&quot;&gt;&lt;br&gt;LLM Training efficiency improvements with memory offload&lt;/h3&gt;


&lt;p&gt;In this section, we outline our framework for design and experimentation regarding the placement of optimizer states, parameters and gradients from GPU memory to either CPU memory or NVMe devices for large language models. Our aim is to evaluate how this offload impacts GPU scalability, training efficiency, and a range of system metrics. &lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter size-large&quot;&gt;&lt;img decoding=&quot;async&quot; loading=&quot;lazy&quot; width=&quot;1024&quot; height=&quot;637&quot; src=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=2160,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/03/Fig6_memory_offload_design_space-1024x637.png&quot; alt=&quot;&quot; class=&quot;wp-image-1083850&quot; srcset=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=1024,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/03/Fig6_memory_offload_design_space.png 1024w, https://blog.uber-cdn.com/cdn-cgi/image/width=300,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/03/Fig6_memory_offload_design_space.png 300w, https://blog.uber-cdn.com/cdn-cgi/image/width=768,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/03/Fig6_memory_offload_design_space.png 768w, https://blog.uber-cdn.com/cdn-cgi/image/width=1536,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/03/Fig6_memory_offload_design_space.png 1536w, https://blog.uber-cdn.com/cdn-cgi/image/width=2048,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/03/Fig6_memory_offload_design_space.png 2048w, https://blog.uber-cdn.com/cdn-cgi/image/width=2120,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/03/Fig6_memory_offload_design_space.png 2120w&quot; sizes=&quot;(max-width: 1024px) 100vw, 1024px&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Fig 7: Design framework for memory offload experimentation.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;p&gt;Our experiment results demonstrated that our capacity to train expansive models previously hindered by restricted GPU memory has been significantly enhanced. Memory offloading from GPU memory to system memory or even NVMe devices helped in boosting training efficiency by enabling the utilization of larger batch sizes with the same number of GPUs. This shift has resulted in a 2x increase in MFU (model flops utilization) while concurrently reducing GPU usage by 34%. However, it’s noteworthy that this improvement comes with a corresponding reduction in network throughput. A detailed open-computer project (&lt;a href=&quot;https://www.opencompute.org/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;OCP&lt;/a&gt;)&amp;nbsp; conference talk on this topic can be found &lt;a href=&quot;https://www.youtube.com/watch?v=Ju0r8yU1_Lw&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;here&lt;/a&gt;.&amp;nbsp;&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/L8HJjzd4fJLc0rv2q8ArOqvVbN-mlHlKwGTqoYg6dPciNtvU6kjAn3NK3eVfjp1WdeqLgJu2zIhmRNxeM5vkX3E47bIRNezH-X9ytjOWzh7lRI0V9OFtZMzOtUbaZXJjf6HsXy4oznVON7V6JTQt7zc&quot; alt=&quot;&quot; title=&quot;Chart&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Fig 8: Training efficiency implementing deepspeed memory offload optimization.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h1 class=&quot;wp-block-heading&quot; id=&quot;h-conclusion&quot;&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;/h1&gt;


&lt;p&gt;To conclude, we’d like to leave you with three key insights. Designing a singular AI/ML system amid rapid application and model advancements, spanning from XGboost to deep learning recommendation models and large language models, poses considerable challenges. For instance, while LLMs demand high TFlops, deep learning models can encounter memory limitations. To enhance the cost-effectiveness of these systems, exploring workload-optimized solutions based on efficiency metrics like cost-to-serve and performance per dollar within a given SLA becomes imperative. Maximizing infrastructure efficiency necessitates a collaborative hardware and software design approach across all layers of the system. Within this context, we’ve showcased various examples in this post, illustrating how to leverage existing infrastructure effectively while building new capabilities to efficiently scale the infrastructure. Lastly, we extend an invitation to foster industry partnerships, urging engagement in open-source optimizations to drive efficiency and exchange ideas and learnings on effectively scaling infrastructure to tackle the evolving demands of the AI landscape.&lt;/p&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-acknowledgments&quot;&gt;&lt;strong&gt;Acknowledgments&amp;nbsp;&lt;/strong&gt;&lt;/h2&gt;


&lt;p&gt;Many thanks for the collaboration on the above work to the UBER AI Infrastructure, OCI, GCP, and Nvidia team members.&amp;nbsp;&lt;/p&gt;


&lt;p class=&quot;has-small-font-size&quot;&gt;Apache®, Apache Kafka, Kafka, Apache Spark, Spark, and the star logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.&lt;/p&gt;


&lt;p class=&quot;has-small-font-size&quot;&gt;Kubernetes® and its logo are registered trademarks of The Linux Foundation® in the United States and other countries. No endorsement by The Linux Foundation is implied by the use of these marks.&lt;/p&gt;


&lt;p class=&quot;has-small-font-size&quot;&gt;Falcon 180B® and its logo are registered trademarks of Technology Innovation Institute™&amp;nbsp; in the United States and other countries. No endorsement by Technology Innovation Institute is implied by the use of these marks.&lt;/p&gt;


&lt;p class=&quot;has-small-font-size&quot;&gt;LLaMA 2® and its logo are registered trademarks of Meta® in the United States and other countries. No endorsement by Meta is implied by the use of these marks.&lt;/p&gt;
</description><link>https://www.uber.com/blog/scaling-ai-ml-infrastructure-at-uber/</link><guid isPermaLink="false">https://www.uber.com/blog/scaling-ai-ml-infrastructure-at-uber/</guid><pubDate>Thu, 28 Mar 2024 07:28:34 GMT</pubDate><author>Uber</author><category>Engineering</category><category>AI</category><category>Data / ML</category></item><item><title>Model Excellence Scores: A Framework for Enhancing the Quality of Machine Learning Systems at Scale</title><description>&lt;h1 class=&quot;wp-block-heading&quot; id=&quot;h-introduction&quot;&gt;Introduction&lt;/h1&gt;


&lt;p&gt;Machine learning (ML) is integral to Uber’s operational strategy, influencing a range of business-critical decisions. This includes predicting rider demand, identifying fraudulent activities, enhancing Uber Eats’ food discovery and recommendations, and refining estimated times of arrival (ETAs). Despite the growing ubiquity and impact of ML in various organizations, evaluating model “quality” remains a multifaceted challenge. A notable distinction exists between online and offline model assessment. Many teams primarily focus on offline evaluation, occasionally complementing this with short-term online analysis. However, as models become more integrated and automated in production environments, continuous monitoring and measurement are often overlooked.&lt;/p&gt;


&lt;p&gt;Commonly, teams concentrate on performance metrics such as AUC and RMSE, while neglecting other vital factors like the timeliness of training data, model reproducibility, and automated retraining. This lack of comprehensive quality assessment leads to limited visibility for ML engineers and data scientists regarding the various quality dimensions at different stages of a model’s lifecycle. Moreover, this gap hinders organizational leaders from making fully informed decisions regarding the quality and impact of ML projects.&lt;/p&gt;


&lt;p&gt;To bridge this gap, we propose defining distinct dimensions for each phase of a model’s lifecycle, encompassing prototyping, training, deployment, and prediction (See Figure 1). By integrating the Service Level Agreement (SLA) concept, we aim to establish a standard for measuring and ensuring ML model quality. Additionally, we are developing a unified system to track and visualize the compliance and quality of models, thereby providing a clearer and more comprehensive view of ML initiatives across the organization. Note that Model Excellence Scores (MES) cover certain technical aspects that are integral to Uber’s overall ML governance.&lt;/p&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/isX-qlXXKvyRNL8ZrqG7ecOuLg3MzE-EeDYMT1JdrdIESlDE_PMXJcc7hHT0c4496eN18aSxij2pX5zVCRTGiovjtyGurwVEV9w1jaX-YUq2hj1huPJQM39GmLqyOlzl-HPxfhC8YP5Lp2KTLhlunbw&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 1: Example ML quality dimensions (in yellow) in a typical ML system.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-model-excellence-scores-mes&quot;&gt;Model Excellence Scores (MES)&lt;/h2&gt;


&lt;p&gt;The development and maintenance of a production-ready ML system are intricate, involving numerous stages in the model lifecycle and a complex supporting infrastructure. Typically, an ML model undergoes phases like feature engineering, training, evaluation, and serving. The infrastructure to sustain this includes data pipelines, feature stores, model registries, distributed training frameworks, model deployment, prediction services, and more.&lt;/p&gt;


&lt;p&gt;To offer a comprehensive evaluation of model quality across these phases, we created and introduced the Model Excellence Scores (MES) framework. MES is designed to measure, monitor, and enforce quality across each stage of the ML lifecycle. This framework aligns with principles and terminologies common among site reliability engineers (SREs) and DevOps professionals, particularly those used in managing microservices reliability in production environments.&lt;/p&gt;


&lt;p&gt;MES revolves around three fundamental concepts related to Service Level Objectives (SLOs): indicators, objectives, and agreements. Indicators are precise quantitative measures reflecting some aspect of an ML system’s quality. Objectives set target ranges for these indicators, and agreements combine all indicators at an ML use case level, dictating the overall PASS/FAIL status based on the indicator results.&lt;/p&gt;


&lt;p&gt;Each indicator in MES is clearly defined and has a set target range for its metric value, with a specified frequency for value updates. If an indicator falls short of its objective within a given time frame, it’s marked as failing. Agreements, which encapsulate these indicators, represent the commitment level of the service and provide insights into its performance. Figure 2 illustrates the interconnections between agreements, indicators, and objectives, and how they relate to specific use cases and models.&lt;/p&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/bvZ0Oj6F3DG-ZDLprNYgODtxB444TWmjTAEMqNlVxgeozimoLuFjXa6m6LVCKCAE7CrTaDD6t2SgvAUUPUmEQaYMe8bbfDOtMgsn10i2W1A3AtQemO7gpOORuYiFTPkjMhIst_8t9RtlITBHxK-dWWw&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 2: Relationship among agreement, indicator, objective, use cases, and models.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;p&gt;Different indicators might necessitate varied timeframes for resolution and distinct mitigation strategies. Some may require immediate attention with higher priority handling, especially when performance benchmarks are not met.&amp;nbsp;&lt;/p&gt;


&lt;p&gt;It’s also important to note that the roles and responsibilities associated with modeling can vary significantly between organizations. In some cases, a single team may handle the entire process, while in others, responsibilities may be distributed across multiple teams or departments.&lt;/p&gt;


&lt;p&gt;At Uber, the responsibility for each model is assigned to a designated primary team. This team receives alerts for any discrepancies or issues related to their model, as outlined in the agreement. Teams have the flexibility to tailor these alerts based on the significance and urgency of their ML use cases. It’s important to note that the quality of one model can influence another, either directly or indirectly. For instance, the output from one model might serve as input for another or trigger further model evaluations. To address this interconnectedness, we’ve implemented a notification system that informs both service and model owners of any quality violations in related ML models.&lt;/p&gt;


&lt;p&gt;The interaction between the Model Excellence Scores (MES) framework and other ML systems at Uber is depicted in Figure 3. The MES framework, with its indicators, objectives, and agreements, is built on several key principles:&lt;/p&gt;


&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Automated Measurability&lt;/strong&gt;: Every indicator in MES is designed with metrics that can be quantified and automated, ensuring robust infrastructure for instrumentation.&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;Actionability&lt;/strong&gt;: Indicators are not just measurable but also actionable. This means that there are clear steps that users or the platform can take to improve these metrics over time in relation to their set objectives.&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;Aggregatability&lt;/strong&gt;: The metrics for each indicator are capable of being aggregated. This is crucial for effective reporting and monitoring, allowing for a cohesive roll-up of metrics in line with the organization’s Objectives and Key Results (OKRs) and Key Performance Indicators (KPIs).&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;Reproducibility&lt;/strong&gt;: Metrics for each indicator are idempotent, meaning their measurements remain consistent when backfilled.&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;Accountability&lt;/strong&gt;: Clear ownership is attached to each agreement. The designated owner is responsible for defining the objectives and ensuring these objectives are achieved.&lt;/li&gt;
&lt;/ul&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/QJtG_get-AGg37qbai6tkDE1-x9IXbOt28Xsk4tFbSBV6mGfC5jt0aOXtPOEAQpP3RIK8JiQuc2-B-HTUquxWMh0LAqWelYAIesr2YJ2z1cT9-UyyaAOaD_GHGt8iN2BY8Vj2IHOhECetrh84AZU9Tg&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 3: High-level view of the interaction between the MES framework and various ML systems.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;p&gt;We focus on some indicators that haven’t been extensively covered in related literature in Table 1. MES is capable of measuring aspects like fairness and privacy, these topics are out of scope of this discussion. We’ve outlined in the table below how each indicator adheres to these design principles, providing examples of measurable metrics, actionable steps for improvement, and the normalization schemes applied to ensure that the metrics are aggregatable and consistent across different use cases. These metrics are either normalized to a [0,1] scale, converted to a percentage, or maintained on a consistent scale across various applications.&lt;/p&gt;


&lt;figure class=&quot;wp-block-table&quot;&gt;&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Indicators&lt;/th&gt;&lt;th&gt;Description&lt;/th&gt;&lt;th&gt;Possible Actions&lt;/th&gt;&lt;th&gt;Metric Normalization&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Data Quality&lt;/td&gt;&lt;td&gt;Measures the quality of the input datasets used to train the model. This is a compost score for: &lt;br&gt;– Feature null&lt;br&gt;– Cross-region consistency&lt;br&gt;– Missing Partiitions&lt;br&gt;– Duplicates&lt;/td&gt;&lt;td&gt;– Backfill the missing partitions&lt;br&gt;– Sync the data partitions across different regions and instances&lt;br&gt;– De-duplicate the rows in the data&lt;/td&gt;&lt;td&gt;Each component in the composite score is normalized to the percentage scale&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Dataset Freshness&lt;/td&gt;&lt;td&gt;Measures the freshness of the input datasets used to train the model&lt;/td&gt;&lt;td&gt;– Retrain with fresh input datasets&lt;br&gt;– Backfill input datasets if updated data is available&lt;/td&gt;&lt;td&gt;Scale-consistent&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Feature and Concept Drift&lt;/td&gt;&lt;td&gt;Shift in the target and covariate distribution as well as the relationship between the two over time for a model in production&lt;/td&gt;&lt;td&gt;– Apply weighted training or retrain the model with fresh data&lt;br&gt;– Validate the correctness of upstream feature ETL pipelines&lt;/td&gt;&lt;td&gt;Normalized to [0,1] by using normalized distance metric and importance weights&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Model Interpretability&lt;/td&gt;&lt;td&gt;Measures the presence and confidence of robust feature explanations for each prediction generated by the model&lt;/td&gt;&lt;td&gt;– Enable explanations&lt;/td&gt;&lt;td&gt;Normalized to [0,1]&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Prediciton Accuracy&lt;/td&gt;&lt;td&gt;Prediction accuracy of the model on production traffic (e.g., AUC, normalized RMSE)&lt;/td&gt;&lt;td&gt;– Update training datasets to account for train-serve skew&lt;br&gt;– Check for feature or concept drift&lt;/td&gt;&lt;td&gt;Normalized to [0,1] by normalizing the accuracy metric&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Table: Sample of indicators.&lt;/figcaption&gt;&lt;/figure&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-results&quot;&gt;Results&lt;/h2&gt;


&lt;p&gt;The implementation of the MES framework at Uber has markedly enhanced the visibility of ML quality within the organization. This increased transparency has been instrumental in fostering a culture that prioritizes quality, subsequently impacting both business decisions and engineering strategies. Over time, we have observed substantial progress in adherence to SLAs across various dimensions. Notably, there has been a remarkable 60% improvement in the overall prediction performance of our models.&lt;/p&gt;


&lt;p&gt;Moreover, the insights gleaned from the MES metrics have been pivotal in identifying areas for platform enhancements. A key development arising from these insights was the introduction of advanced platform tooling for hyperparameter tuning. This innovation enables the automatic periodic retuning of all models, streamlining the optimization process and ensuring consistent model performance. Such improvements underscore the tangible benefits of the MES framework in driving both operational efficiency and technological advancement&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-lessons-learned&quot;&gt;Lessons Learned&lt;/h2&gt;


&lt;p&gt;In our journey of implementing and monitoring key indicators across all ML teams at Uber, we’ve gleaned several critical insights.&lt;/p&gt;


&lt;p&gt;&lt;strong&gt;Motivating ML Practitioners:&lt;/strong&gt; The established framework allowed for a tangible measurement of the impact and efforts directed toward quality improvements. By adopting a standard and transparent reporting system, we created an environment where ML practitioners were motivated to enhance quality, knowing that their efforts were visible and recognized across the organization.&lt;/p&gt;


&lt;p&gt;&lt;strong&gt;Alignment and Executive Support:&lt;/strong&gt; Initially, quality measures could be perceived as an additional burden unless they are seamlessly integrated into everyday practices from the outset. Implementing a quality tracking framework sheds light on existing gaps, necessitating extra efforts in education and awareness to address these issues. Aligning with executive leadership was crucial, enabling teams to prioritize quality-focused tasks. This alignment gradually led to a shift towards a more proactive, quality-centric culture across the board.&lt;/p&gt;


&lt;p&gt;&lt;strong&gt;Balancing Standardization with Customization:&lt;/strong&gt; In designing the framework, we aimed for a level of standardization that would allow for consistent tracking and informed decision-making over time. However, given Uber’s diverse ML applications, it was also vital to permit customization for specific indicators to accurately reflect the nuances of each use case. For instance, in ETA prediction models, we adopted mean-average-error as a more contextual metric than RMSE. The framework accommodated such customizations while maintaining a standardized approach to reporting for consistency.&lt;/p&gt;


&lt;p&gt;&lt;strong&gt;Prioritizing Incremental Improvements:&lt;/strong&gt; Managing the framework across a wide array of use cases posed significant challenges in prioritization. We developed a straightforward prioritization matrix to identify which areas needed immediate attention. Recognizing that a handful of models contribute most to the impact, our focus was on enhancing quality in high-impact use cases first.&lt;/p&gt;


&lt;p&gt;&lt;strong&gt;The Role of Automation:&lt;/strong&gt; Maintaining ML quality is resource-intensive, and manually managing models in production can divert efforts from innovation. Automating the production lifecycle, including retraining, revalidating, and redeploying models with fresh data, proved invaluable. This automation not only enhanced model freshness (as indicated by the reduced average age of models), but also allowed teams to focus more on innovation and less on maintenance.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h1 class=&quot;wp-block-heading&quot; id=&quot;h-conclusion&quot;&gt;Conclusion&lt;/h1&gt;


&lt;p&gt;We have developed a comprehensive framework that outlines the key dimensions of high-quality machine learning (ML) models across different stages of their lifecycle. This framework is inspired by Service Level Agreement (SLA) principles and is designed to monitor and ensure the quality of ML models. Importantly, it’s structured to accommodate additional quality dimensions, adapting to emerging use cases and evolving best practices in the field.&lt;/p&gt;


&lt;p&gt;Our discussion also encompassed the application of this framework in generating insightful quality reports at various levels of the organization. These reports are regularly reviewed, fostering accountability and offering valuable insights for strategic planning. Crucially, by embedding ML quality within the overall service quality of the associated software systems, we’ve facilitated a shared responsibility model. Applied scientists, ML engineers, and system engineers now collectively own ML quality. This collaborative approach has significantly bridged the gap between these functions, fostering a proactive, quality-focused culture within the organization.&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-acknowledgments&quot;&gt;Acknowledgments&lt;/h3&gt;


&lt;p&gt;We could not have accomplished the technical work outlined in this article without the help of our team of engineers and applied scientists at Uber.&amp;nbsp; We would also like to extend our gratitude to the various Technical Program Managers – Gaurav Khillon, Nayan Jain, and Ian Kelley – for their pivotal role in promoting the adoption and compliance of the MES framework across different organizations at Uber.&lt;/p&gt;
</description><link>https://www.uber.com/blog/enhancing-the-quality-of-machine-learning-systems-at-scale/</link><guid isPermaLink="false">https://www.uber.com/blog/enhancing-the-quality-of-machine-learning-systems-at-scale/</guid><pubDate>Thu, 21 Mar 2024 05:30:00 GMT</pubDate><author>Uber</author><category>Engineering</category><category>Data / ML</category></item><item><title>Balancing HDFS DataNodes in the Uber DataLake</title><description>&lt;h1 class=&quot;wp-block-heading&quot; id=&quot;h-introduction&quot;&gt;&lt;strong&gt;Introduction&lt;/strong&gt;&lt;/h1&gt;


&lt;p&gt;Apache Hadoop&lt;sup&gt;Ⓡ&lt;/sup&gt; Distributed File System (HDFS) is a distributed file system designed to store large files across multiple machines in a reliable and fault-tolerant manner. It is part of the Apache Hadoop framework and is one of the main components of Uber’s data stack.&lt;/p&gt;


&lt;p&gt;Uber has one of the largest HDFS deployments in the world, with exabytes of data across tens of clusters. It is important, but also challenging, to keep scaling our data infrastructure with the balance between efficiency, service reliability, and high performance.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;figure class=&quot;wp-block-image&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/VaaufGJF2NoMmsAXqw0SjcSbkWbdPFm1-Kr_ZYFuHKKoFGozTrx0QEhYCk2vu7lAd8vai59XYBhIssXwY0LfI5kWh7AQGTKKYTM34rWx8Re1V8U7gEI0Z0O_vLgXLXT8af8frAQNJJQ_95n95isTgWw&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;/figure&gt;


&lt;p&gt;  Figure 1: HDFS Infrastructure at Uber.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h1 class=&quot;wp-block-heading&quot; id=&quot;h-overview&quot;&gt;&lt;strong&gt;Overview&lt;/strong&gt;&lt;/h1&gt;


&lt;p&gt;HDFS balancer is a key component to keep DataNodes healthy by redistributing data evenly in the cluster. The HDFS balancer has to balance data more effectively to prevent DataNode skew as our HDFS clusters have more and more intensive node decommissioning. The node decommission requirement comes from projects such as zone decommissioning, automatic cluster turnover for security patch, and also DataNode colocation.&lt;/p&gt;


&lt;p&gt;However, the balancer that comes with HDFS open source did not meet this requirement out of the box. We have seen issues of one DataNode being skewed (i.e., storing more data compared to other nodes in the same cluster), which has multiple side effects:&lt;/p&gt;


&lt;ul&gt;&lt;li&gt;Leads to high I/O bandwidth on the host containing too much data&lt;/li&gt;


&lt;li&gt;Highly utilized nodes have a higher probability of slowness, higher risk of node failure, data loss&lt;/li&gt;


&lt;li&gt;Cluster has fewer active and healthy nodes to serve writing traffic for customers&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Below is an example of unbalanced data: thousands of nodes are near 95% disk utilization in our largest cluster composed of thousands of DataNodes with hundred PBs of capacity, while the balancing throughput can’t move data effectively to the other newly added DataNode. Such unbalanced data distribution is caused by bursty write traffic from warm tiering and EC conversion[1], intensive node decommission from zone decom/cluster turnover for security patch. As the write reliability is the first priority, all DataNodes serve write traffic together with an available capacity-weighted algorithm. With more write traffic, data skews more as well.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter size-large&quot;&gt;&lt;img decoding=&quot;async&quot; loading=&quot;lazy&quot; width=&quot;1024&quot; height=&quot;212&quot; src=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=2160,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/03/Figure-2-Argon-cluster-original-1024x212.png&quot; alt=&quot;&quot; class=&quot;wp-image-1082498&quot; srcset=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=1024,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/03/Figure-2-Argon-cluster-original.png 1024w, https://blog.uber-cdn.com/cdn-cgi/image/width=300,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/03/Figure-2-Argon-cluster-original.png 300w, https://blog.uber-cdn.com/cdn-cgi/image/width=768,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/03/Figure-2-Argon-cluster-original.png 768w, https://blog.uber-cdn.com/cdn-cgi/image/width=1536,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/03/Figure-2-Argon-cluster-original.png 1536w, https://blog.uber-cdn.com/cdn-cgi/image/width=1999,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/03/Figure-2-Argon-cluster-original.png 1999w&quot; sizes=&quot;(max-width: 1024px) 100vw, 1024px&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 2: One of our biggest clusters comprising around thousands of DataNodes with hundred PBs of capacity has skewed DataNodes.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;p&gt;Thus, we need to optimize the HDFS balancer to increase the data balance from the high-usage DataNode to another, less occupied DataNode.&lt;/p&gt;


&lt;p&gt;Given the scale of data storage at Uber, there would be more than 20 PB of data-unbalanced nodes in a single cluster, with 7-8 clusters. To tackle this problem of balancing HDFS DataNodes in the Uber DataLake, we devised a new algorithm to increase the number of pairs formed between DataNodes, which would increase parallel block movements while balancing data. Also, we did sort DataNodes based on utilization such that the datanode pairs formed are optimized and no recursive balancing takes place.&lt;/p&gt;


&lt;p&gt;This algorithm would go on to increase our throughput for balancing i.e. size of data moved per second from a higher occupied datanode to a lower occupied datanode considered for balancing.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-architecture-amp-design&quot;&gt;&lt;strong&gt;Architecture &amp;amp; Design&lt;/strong&gt;&lt;/h2&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter is-resized&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/aypCjNKWlTTa6yQzeGj53GgpowcKkI2Sowy6KlqoObWeWb2sGt7QrR7AS5AEk8kjzHKseHi4zEbcG_HW31QNaglL96kh-0RzVIhXk8qVLmLMJbr5Qr_yULMeF-HVYBCBwknrf3UCeY1RRaNRv8vyLIk&quot; alt=&quot;&quot; style=&quot;aspect-ratio:1.2903225806451613;width:701px;height:auto&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 3: HDFS Balancer Architecture.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;ol&gt;&lt;li&gt;Initialization and Setup:&lt;ol&gt;&lt;li&gt;The HDFS balancer is run on a host as a service within the Hadoop cluster.&lt;/li&gt;


&lt;li&gt;To initiate the balancing process, a node with a balancer role needs to be present in the cluster. No two balancers can run concurrently.&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;


&lt;li&gt;Requesting Cluster Information:&lt;ol&gt;&lt;li&gt;The balancer first contacts the NameNode to request information about the data distribution within the cluster. It sends a request to the NameNode to obtain details about the distribution of data blocks across DataNodes.&lt;/li&gt;


&lt;li&gt;The NameNode responds with a list of DataNodes and the blocks they contain, along with their storage capacities and other relevant information.&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;


&lt;li&gt;Block Selection and Planning:&lt;ol&gt;&lt;li&gt;Based on the information received from the NameNode, the balancer algorithm selects blocks that need to be moved to achieve a more balanced distribution.&lt;/li&gt;


&lt;li&gt;The balancer takes into consideration factors such as DataNode utilization, rack information, threads, and storage capacity while planning block movements.&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;


&lt;li&gt;Coordination of Data Movement:&lt;ol&gt;&lt;li&gt;After determining which blocks to move, the balancer coordinates the actual data movement between DataNodes.&lt;/li&gt;


&lt;li&gt;It communicates with the NameNode regarding the blocks moved with the help of heartbeats.&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;


&lt;li&gt;Block Migration:&lt;ol&gt;&lt;li&gt;The balancer initiates block migration by communicating directly with the source and destination DataNodes.&lt;/li&gt;


&lt;li&gt;It instructs the source DataNode to transfer the selected block to the destination DataNode, moving the data block directly.&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;


&lt;li&gt;Monitoring Progress:&lt;ol&gt;&lt;li&gt;Throughout the data movement process, the balancer continuously monitors progress. It keeps track of how many blocks have been successfully transferred and ensures that the data movement is proceeding according to the plan.&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;


&lt;li&gt;Completion and Reporting:&lt;ol&gt;&lt;li&gt;Once the balancing operation is complete, the balancer reports the data transferred and data left to transfer in logs and through metrics.&lt;/li&gt;


&lt;li&gt;It may also provide statistics and metrics about the balancing process, including the number of blocks moved and the time taken.&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;


&lt;li&gt;Termination:&lt;ol&gt;&lt;li&gt;In the host, the balancer runs as a service. So, until the cluster is balanced, it won’t stop moving the data.&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ol&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-initial-optimizations&quot;&gt;&lt;strong&gt;Initial Optimizations&lt;/strong&gt;&lt;/h2&gt;


&lt;p&gt;&lt;br&gt;Since we had the objective to increase the throughput to balance DataNodes at a greater speed to balance them faster, we optimized our HDFS balancers with the existing DataNode properties to increase the throughput.&lt;br&gt;Although we increased the speed of the balancer up to 3x, the throughput still wasn’t sufficient. We had too many highly occupied nodes and the number of DataNode pairs to which the data would be transferred in the existing algorithm would be significantly less. Also, we couldn’t improve the throughput from each node through balancer threads, as increasing it would increase the slowness of the node and affect read/write traffic. Thus, we needed to increase the number of DataNode pairs, which would ultimately lead to an increase in balancing throughput.&lt;br&gt;&lt;br&gt;DataNode and Balancer Configs that we used are mentioned below.&amp;nbsp; Configurations for your workloads may be different based on your situation.&amp;nbsp;&lt;/p&gt;


&lt;p&gt;&lt;strong&gt;DataNode configuration properties:&lt;/strong&gt;&lt;/p&gt;


&lt;figure class=&quot;wp-block-table has-small-font-size&quot;&gt;&lt;table&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Property&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;&lt;strong&gt;Default&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;&lt;strong&gt;Fast Mode&lt;/strong&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;dfs.DataNode.balance.max.concurrent.moves&lt;/td&gt;&lt;td&gt;5&lt;/td&gt;&lt;td&gt;250&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;dfs.DataNode.balance.bandwidthPerSec&lt;/td&gt;&lt;td&gt;1048576 (1MB)&lt;/td&gt;&lt;td&gt;1073741824 (1GB)&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/figure&gt;


&lt;p&gt;&lt;strong&gt;Balancer configuration properties:&amp;nbsp;&lt;/strong&gt;&lt;/p&gt;


&lt;figure class=&quot;wp-block-table has-small-font-size&quot;&gt;&lt;table&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Property&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;&lt;strong&gt;Default&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;&lt;strong&gt;Fast Mode&lt;/strong&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;dfs.DataNode.balance.max.concurrent.moves&lt;/td&gt;&lt;td&gt;5&lt;/td&gt;&lt;td&gt;250&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;dfs.balancer.moverThreads&lt;/td&gt;&lt;td&gt;1000&lt;/td&gt;&lt;td&gt;2000&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;dfs.balancer.max-size-to-move&lt;/td&gt;&lt;td&gt;10737418240 (10GB)&lt;/td&gt;&lt;td&gt;107374182400 (100GB)&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;dfs.balancer.getBlocks.min-block-size&lt;/td&gt;&lt;td&gt;10485760 (10MB)&lt;/td&gt;&lt;td&gt;104857600 (100MB)&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/figure&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-algorithm-optimizations&quot;&gt;&lt;strong&gt;Algorithm Optimizations&lt;/strong&gt;&lt;/h2&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-increasing-datanode-pairs-for-high-throughput&quot;&gt;Increasing DataNode pairs for high throughput&lt;/h3&gt;


&lt;p&gt;More DataNode pairs meant that we could have more concurrent block transmission, hence a key improvement is to construct more pairs. Due to the existing algorithm, a highly skewed cluster formed fewer DataNode pairs.&lt;/p&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter size-large&quot;&gt;&lt;img decoding=&quot;async&quot; loading=&quot;lazy&quot; width=&quot;1024&quot; height=&quot;549&quot; src=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=2160,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/03/Figure-4-Balancer-Original-Algo-1024x549.jpeg&quot; alt=&quot;&quot; class=&quot;wp-image-1082506&quot; srcset=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=1024,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/03/Figure-4-Balancer-Original-Algo.jpeg 1024w, https://blog.uber-cdn.com/cdn-cgi/image/width=300,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/03/Figure-4-Balancer-Original-Algo.jpeg 300w, https://blog.uber-cdn.com/cdn-cgi/image/width=768,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/03/Figure-4-Balancer-Original-Algo.jpeg 768w, https://blog.uber-cdn.com/cdn-cgi/image/width=1536,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/03/Figure-4-Balancer-Original-Algo.jpeg 1536w, https://blog.uber-cdn.com/cdn-cgi/image/width=1640,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/03/Figure-4-Balancer-Original-Algo.jpeg 1640w&quot; sizes=&quot;(max-width: 1024px) 100vw, 1024px&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 4: Existing Algorithm.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;p&gt;In the existing algorithm for HDFS Balancer, DataNodes above a cluster’s average utilization (i.e., above-average utilized and over-utilized nodes) had much higher numbers compared to below-average utilized and under-utilized nodes. Thus, we faced the problem of scarcity of the nodes to move the data from highly utilized DataNodes, which resulted in highly utilized DataNodes not coming down speedily.&lt;br&gt;&lt;/p&gt;


&lt;p&gt;In the above diagram, there are 8 DataNodes above average and 4 DataNodes below average utilization, which would lead to 4 targets where data could be moved.&lt;br&gt;The aim was to modify the HDFS algorithm such that more pairs are formed for DataNodes, thus leading to more throughput from high-usage DataNodes, resulting in uniform utilization as well as a speedy bump down of usage with more coverage of DataNodes.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;p&gt;Our idea was to use a percentile-based algorithm for creating more DataNode pairs.&lt;/p&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/HlfZlWPK_wVdej0xQY_1gYkFAWt7JpDbSyPnbE4eLyQaV9-jk_XFbgFszAVnxWfTzTvMg_6MhjL3lsrAImSeStQrrcXNiIdM_W2zNWEHJlWnCsxhme1CD07Ju4Q8QoPHgBzqg-GDgSmcBSu7A8yh9RQ&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 5: New Algorithm.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;p&gt;In the new algorithm, we created an adjusted average based on percentile, which would increase the number of nodes to which the data could be moved. Above average/over-utilized DataNodes would try to come near to overall cluster utilization, whereas under-utilized/below average utilized nodes would try to come near adjusted average of percentile. With a percentile-based algorithm, we would aim to bring our adjusted average near overall cluster utilization.&lt;/p&gt;


&lt;p&gt;We would use a percentile-based algorithm to increase the DataNode pairs. In the highly skewed cluster, the percentile was quite high. Taking an example of the above diagram, we took percentile as P60, our adjusted average is now 86.7%. In this case, the count of over-utilized/above-average utilized nodes decreases, and under-utilized/under-average utilized nodes increase.&lt;/p&gt;


&lt;p&gt;Now, there would be 5 over-utilized and above-average utilized nodes and 7 under-utilized and under-average nodes, which will lead to the formation of 7 pairs max from 4 pairs.&lt;/p&gt;


&lt;p&gt;We had a new Hadoop configuration property, &lt;em&gt;dfs.balancer.separate-percentile&lt;/em&gt;,&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/ouUojzw_My_1YuqQvm4oGFoXtN03qy9mBcdjsfcPCHi7thgD_wkYfo3lkzZN6AzFvG2rs2qDHuftg2D3D8vpFw03dk94_r_UhuSUvSfkUr7w2YGuCOP1tNO3QDEE9RV8LwCkjScxreh3_YXyC0vPkPc&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 6: New Hadoop Configuration for Defining Percentile.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;p&gt;which was by default 0.5, denoting the 50th percentile. If we deployed the balancer command with -dynamicBalancer, this percentile algorithm would take effect and the adjusted average would come into the picture with more throughput.&lt;/p&gt;


&lt;p&gt;We could also use this threshold to balance dynamically. For example, if DataNodes would go above 90%, we would balance them aggressively (i.e., with increased speed). Thus, we would balance the top 20% of DataNodes, which would lead to concentrating moverThreads on the top 20% of highly utilized sources, and data would move faster from highly utilized DataNodes and bring usage down faster.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter size-large&quot;&gt;&lt;img decoding=&quot;async&quot; loading=&quot;lazy&quot; width=&quot;1024&quot; height=&quot;417&quot; src=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=2160,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/03/Figure-7-Aggressive-balancer-code-1024x417.png&quot; alt=&quot;&quot; class=&quot;wp-image-1082510&quot; srcset=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=1024,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/03/Figure-7-Aggressive-balancer-code.png 1024w, https://blog.uber-cdn.com/cdn-cgi/image/width=300,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/03/Figure-7-Aggressive-balancer-code.png 300w, https://blog.uber-cdn.com/cdn-cgi/image/width=768,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/03/Figure-7-Aggressive-balancer-code.png 768w, https://blog.uber-cdn.com/cdn-cgi/image/width=1536,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/03/Figure-7-Aggressive-balancer-code.png 1536w, https://blog.uber-cdn.com/cdn-cgi/image/width=1988,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/03/Figure-7-Aggressive-balancer-code.png 1988w&quot; sizes=&quot;(max-width: 1024px) 100vw, 1024px&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 7: New Hadoop Configuration for Defining Aggressive Balancing.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-moving-data-to-lower-occupied-datanodes&quot;&gt;Moving data to lower occupied DataNodes&lt;/h3&gt;


&lt;p&gt;Due to automation (i.e., automatic removal of data from DataNodes to other DataNodes to send it for maintenance), frequent decommission happened to DataNodes in a large cluster, in which data from a decommissioned node was moved to other nodes, increasing the occupied percentage on those nodes. The new nodes that came up got slowly balanced, as they were not given priority.&lt;br&gt;Also, for example, if average utilization was 83% with a threshold as 3% and the DataNode of 90% moved some part of its data to a 79% node which becomes 81%. Now if the new client dumped data at 81%, it became 87%, which may require further balancing of this node, thus distributing the dispatcher and mover threads.&amp;nbsp;&lt;/p&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/lAap83ZaUUmXXZdZ45R2B9TuHzz3O8zzrYpaZpti5Srpdxga1oV2Ay6yBqrlLqRRfVYe5r0jiEYlyQe8EBhD_W9T5Z_QsbkvToqR1-VPXv-9grrMlFCNb61rpOZtsv4jG1EX4Q3loB-vg-C0236qxdc&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 8: Old Algorithm – Pairs Formed.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/xoqmBEJU1prJmh7xmVNllTySnh2KfypKUZXcfY2Kp4qG3XMWEVnNMkpOQjm9EwfarrVBwn4IlD5gnaScUKbt7jRDQnBziP4EcDQTxEErbfKzT-5xu7nzj8EpOjDxKS1aFImRaPN_U5Iz2EEKSpjJp8Q&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 9: Old Algorithm – New Over-Utilized Nodes Came Up.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/qmXk_2q380kv8Nmo5YBY-jL_T6o88ZDB9sZ_M2RG8FCQ_Cs8vCKV5YiP11ahdV98lhsqvkhhb1yxJwVHySa__dVbeW0EtUlS96qwxHwl0eGYnVAnc6kf7vnx-KZCXEtKxG0muax_12jqY5aYP0cZtgE&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 10: New Algorithm – Preferred Optimization.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;p&gt;Our enhancement was to prioritize smaller occupied DataNodes by sorting in ascending order nodes in under-utilized nodes or below-average nodes, to balance the data first from over-utilized nodes, then above-average utilized nodes sorted in descending order, so that the nodes in between do not come into the picture, when balancing to prevent recursive balancing.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-better-observability&quot;&gt;&lt;strong&gt;Better Observability&lt;/strong&gt;&lt;/h2&gt;


&lt;p&gt;We didn’t have a metric on DataNode pairs that are formed between over-utilized and underutilized, overutilized and below-average utilized and underutilized and above average utilized between the same node group, same rack, and any other rack and other relevant metrics. Hence, we weren’t able to calibrate the traffic distribution between these pairs. In order to find out where the DataNode pairs could be increased to increase the throughput, we created a new dashboard.&lt;/p&gt;


&lt;p&gt;In the end, we added more than 10 metrics to track the performance of our change in algorithm, which would help us calibrate custom algorithms for the balancer more.&lt;/p&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter is-resized&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/qymsO6tHpnV1x5Re2lVFYtX5UqTArgNP0PT7JTLi2aua8DL9-GTRTMbFQZ7fF-WccCBJvmv5zXP72Akx1ejdFgBH2YdtFa_4tgz1CHFWyonX74vzT8OpAjRaGqDTUdNFWPAYgnGiJXaq7bg_7UtwbEw&quot; alt=&quot;&quot; style=&quot;aspect-ratio:1.7758046614872365;width:700px;height:auto&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 11: Snapshots of our Metrics Dashboard.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-results&quot;&gt;&lt;strong&gt;Results&lt;/strong&gt;&lt;/h2&gt;


&lt;p&gt;With optimizations in the balancing algorithm, we increased the throughput by more than 5x, with no DataNodes with higher utilization than 90%, as well as brought down the usage of the DataNodes overall.&amp;nbsp; Also, there is now no need to deploy a manual balancer that took only certain hardcoded nodes to balance the data, as our optimization in the algorithm took care of that.&lt;/p&gt;


&lt;p&gt;As part of our new algorithm –&amp;nbsp;&lt;/p&gt;


&lt;ul&gt;&lt;li&gt;Increased throughput – We increased the throughput by more than 5x.&lt;/li&gt;


&lt;li&gt;Bringing down highly used datanodes – We brought down DataNodes above 90% utilization to 0.&lt;/li&gt;


&lt;li&gt;DataNodes around same utilization – Reduce overall usage of datanodes and bring them around the same capacity. We had all the DataNodes below 85% utilization for our biggest cluster.&lt;/li&gt;


&lt;li&gt;Manage the capacity better – Our cluster utilization increased from 65-66% to around 85% for HDFS clusters, with us having capacity bottlenecks. We now had no highly occupied datanode even though the cluster utilization was higher than ever.&lt;/li&gt;
&lt;/ul&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/xLa6b1OocD2j8moRGaxLJL_qbjP_xD412Zib_vZv3uYV-tXefy1Vq8IZK4Y1k8F-B_lWdetYp4_-xunCBgCdB3Ov0hhpnuKlLVARGodpE6nUURsocv7menNOQKubFT5oiBwCM7yyYfsmw4R1P64UMZ4&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 12: DataNodes at a similar level due to the algorithm change and below 85% utilization for our biggest cluster.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter is-resized&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/eO5trlhcsa5iPseVoe0deceN0afiwUstj3oyk95aGYV4xV6vO9ap3juytv9bnLQMitL88-tnRmKblbiVzmE6Li9vmAG0aRWXQ_Y5rIzo8Z846ZFOtm6BMoaFwHcQzIw8BZVgW4n2-1OTRolCH9LSnRE&quot; alt=&quot;&quot; style=&quot;aspect-ratio:1.7621145374449338;width:700px;height:auto&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 13: Panels reflecting the DataNode skew is reduced.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/uqyVOYZWIMQ33KfRVuA4LeLDiASqtaMThAef2CjJ8Qa2_mNvgi51nJcH2A1VmhML8yXyE7e_cHmD8_5-zgS2OUe6PqOaEZQBbfNYvo_npxkCxKaKisWB_ZRWyOQQiN8Ox8uomG_T6DGJjR4dRPVU7Lc&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 14: Before balancer algorithm changes – Datanodes with high usage above 90% are 50.8%.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;figure class=&quot;wp-block-image&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/8jioc04QB8VJcyCm7bKU-IEx6cbiTgLeBe9O0aRb9CR4nwU8o4IRzQScNFOjhSns5XDspdY-u3cK5iirYw7QH9dwrCNoFoUfstbTMRuciCWxR6CHZXFTDtT_p-wuCHVYko3wmssd4f2PJnUXPa6HbPU&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 15: After balancer algorithm changes – Datanodes with high usage above 90% are below 0.&lt;/figcaption&gt;&lt;/figure&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter size-full is-resized&quot;&gt;&lt;img decoding=&quot;async&quot; loading=&quot;lazy&quot; width=&quot;844&quot; height=&quot;662&quot; src=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=2160,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/03/Figure-16-Cluster-previous-capacity.png&quot; alt=&quot;&quot; class=&quot;wp-image-1082512&quot; style=&quot;aspect-ratio:1.2749244712990937;width:700px;height:auto&quot; srcset=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=844,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/03/Figure-16-Cluster-previous-capacity.png 844w, https://blog.uber-cdn.com/cdn-cgi/image/width=300,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/03/Figure-16-Cluster-previous-capacity.png 300w, https://blog.uber-cdn.com/cdn-cgi/image/width=768,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/03/Figure-16-Cluster-previous-capacity.png 768w&quot; sizes=&quot;(max-width: 844px) 100vw, 844px&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt; Figure 16: One of our clusters with less cluster utilization around 65%.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter is-resized&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/FUzDUIGryAJWwVwckoZQPUKatRhcPpBWOxQrGHXh9P8-y3mOTVcgJEnJJjCHromXrwQtbz1T_7t7V7TEtQ9lNT9XI4IabGVADQ2rIzXK3FeDb4Zpd93h4i0a_3Y0oDCGRdxiNx6iGwomsnTU4tqmgBM&quot; alt=&quot;&quot; style=&quot;aspect-ratio:1.334375;width:700px;height:auto&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt; Figure 17: Cluster utilization increased to around 83% for the same cluster above.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/4xMTrAtGQ4KGaHED3rgIgLPSWn3YMy25QcXsCD4uhPn3YT8M9_PiyKp7NVwdqqJexncQPxAdWg7EFk_jIQR8nYO0hPn-IAU7QA2ho2syCDMroCNi7jSpnCDXDSuYOCQgf3EcoTwV4vT-Spp1WraoKkk&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 18: Increase in throughput by more than 3x due to algorithm changes.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h1 class=&quot;wp-block-heading&quot; id=&quot;h-conclusion&quot;&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;/h1&gt;


&lt;p&gt;In an HDFS cluster, data could get skewed among different DataNodes and could lead to high I/O on the node, leading to it being slow or going down, causing data loss. The new algorithm would help in balancing the DataNodes faster to achieve greater efficiency, service reliability, and high performance while preventing a higher probability of slowness, higher risk of node failure, and data loss.&amp;nbsp;&lt;/p&gt;


&lt;p&gt;In Uber, we deployed this change to multiple clusters to increase the balancing throughput. We are raising an open-source patch for our optimizations. Uber HDFS team continues to work on solving similar data distribution problems – given our scale, even a small improvement can result in a huge gain.&amp;nbsp;&lt;/p&gt;


&lt;p&gt;[1] Uber keeps data with different access temperatures to dedicate clusters for better reliability and cost efficiency. We apply the warm tiering to move data from hot cluster to warm cluster and adopt EC conversion to move data to cluster with erasure coding feature, which saves 50% capacity.&amp;nbsp;&lt;/p&gt;


&lt;p class=&quot;has-small-font-size&quot;&gt;&lt;em&gt;“Apache®, Apache Hadoop®, and Hadoop®, are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.”&lt;/em&gt;&lt;/p&gt;
</description><link>https://www.uber.com/blog/balancing-hdfs-datanodes-in-the-uber-datalake/</link><guid isPermaLink="false">https://www.uber.com/blog/balancing-hdfs-datanodes-in-the-uber-datalake/</guid><pubDate>Thu, 14 Mar 2024 05:30:00 GMT</pubDate><author>Uber</author><category>Engineering</category><category>Data / ML</category></item><item><title>Load Balancing: Handling Heterogeneous Hardware</title><description>&lt;h1 class=&quot;wp-block-heading&quot; id=&quot;h-overview&quot;&gt;&lt;strong&gt;Overview&lt;/strong&gt;&lt;/h1&gt;


&lt;p&gt;This blog post describes Uber’s journey towards utilizing hardware efficiently via better load balancing. The work described here lasted over a year, involved engineers across multiple teams, and delivered significant efficiency savings. The article covers the technical solutions and our discovery process to get to them–in many ways, the journey was harder than the destination.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-background&quot;&gt;&lt;strong&gt;Background&lt;/strong&gt;&lt;/h2&gt;


&lt;p&gt;&lt;a href=&quot;https://www.uber.com/blog/better-load-balancing-real-time-dynamic-subsetting/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;Better Load Balancing: Real-Time Dynamic Subsetting | Uber Blog&lt;/a&gt; was a related blog post that predates the work described here. We won’t repeat the background–we recommend skimming through the overview of our service mesh there. We’ll also be reusing the same dictionary. This post focuses on the workloads communicating via the service mesh explained above. This covers the vast majority of our stateless workloads.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-problem-statement&quot;&gt;&lt;strong&gt;Problem statement&lt;/strong&gt;&lt;/h2&gt;


&lt;p&gt;In 2020, we started work to improve the overall efficiency of Uber’s multi-tenant platform. In particular, we focused on reducing the capacity required to run our stateless services. In this blog post, we’ll cover how individual teams making rational decisions led to inefficient resource usage, how we analyzed the problem and different approaches, and how, by improving load distribution, we got teams to safely increase CPU utilization and drive down costs. The post focuses on CPU only, since this was our primary constraint.&lt;/p&gt;


&lt;p&gt;First, some context: at Uber, most capacity decisions are decentralized. While our platform teams provide recommended targets and tools like auto-scalers, the ultimate decision to adopt specific targets lies in each of the product teams/organizations. A budgeting process exists to curb unlimited allocations.&lt;/p&gt;


&lt;p&gt;As part of the budgeting process, we noticed what we thought were unreasonably low utilization levels. However, attempts to increase the utilization were met with concerns from the product teams–they were rightly worried that increasing the utilization would risk the system’s reliability and affect their availability/latency goals.&lt;/p&gt;


&lt;p&gt;The cause of the problem was presumed to be suboptimal network load balancing. Many workloads had tasks with CPU usage higher than average. Those outliers worked fine during normal operations, but struggled during failovers–and the desire not to break the SLAs pushed our average utilization downwards.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/VPz0oGg-6KU9AQlKl7p6jeMb71w1ExYry6MeA5yDkQD_XQb62eP9S48IH_ZfGItt4tzYYz6TlzmYYaShRNRFe2x01S75ACNiS_T-OjaLcGO6FyfmyJF21eH0ks9v1fxPzvkBA-3cl0CGBKT3bY8oP-E&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 1: A typical “imbalance graph.” Each line represents the CPU usage of a container.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/gV0L88878EKTUFagchXUUUkaTB0AieIEAzm4DDNKfjaATpUjMR_0yFE8vK-OY5w37lFhRS6DIRFZxEa-LE_pGYsk2PU4wQdEOfanI9pXjzqnAClzepSHXlA78fUwifiXWWF7QWlhc4Bo9WWf4g8-yfk&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 2: A less obvious case: container utilizations are distributed across a band, but some are utilized more than others.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-asymmetry-of-impact&quot;&gt;&lt;strong&gt;Asymmetry of impact&amp;nbsp;&lt;/strong&gt;&lt;/h3&gt;


&lt;p&gt;An important aspect of load imbalance is the &lt;strong&gt;asymmetry of its impact&lt;/strong&gt;. Imagine a scenario where out of 100 workloads, 5 are under-utilized. This impacts efficiency, but the cost is relatively low–we’re not using 5% of our machines as efficiently as possible.&amp;nbsp;&lt;/p&gt;


&lt;p&gt;If the situation is reversed and the same 5 workloads are &lt;em&gt;over&lt;/em&gt;-utilized, the situation is much more severe. We are likely affecting customer experience and potentially affecting the system’s reliability. The easy solution to avoid these hotspots is to reduce the average utilization of the whole cluster. This will now have a &lt;em&gt;much&lt;/em&gt; more significant impact: &lt;strong&gt;95% of the workloads are underutilized&lt;/strong&gt;, meaning a much more significant waste of (financial) resources.&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-the-forest-and-the-trees&quot;&gt;&lt;strong&gt;The forest and the trees&lt;/strong&gt;&lt;/h3&gt;


&lt;p&gt;Since the outliers were easy to spot, we initially focused on fixing and chasing them one by one, trying to root-cause and fix each issue individually as soon as possible. The results of these individual fixes weren’t always as expected. Some of our changes had a lower impact than expected or only impacted a subset of the system. Similarly, other changes later on resulted in unexpectedly significant improvements. This was due to several independent issues being at play. This “forest of issues” resulted in the work being largely sequential–we would only find a new, more minor issue once its larger sibling was fixed.&lt;/p&gt;


&lt;p&gt;In retrospect, the “surprise” part could have been mitigated with more analytical rigor–we could have understood the system more and collected more samples upfront. The sequentiality of the work would likely have been the same, though–it’s only through the process we learned how to understand and measure the system.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-measuring-the-impact&quot;&gt;&lt;strong&gt;Measuring the impact&lt;/strong&gt;&lt;/h2&gt;


&lt;p&gt;Perhaps surprisingly, one of the most disputed aspects of the project, until the very end, was measuring the impact. The discussions involved folks from different teams and organizations joining and leaving the project at different times. Each involved party had a valuable, but slightly different perspective on the problem, its priority, and potential fixes.&lt;/p&gt;


&lt;p&gt;Just measuring the impact consistently was surprisingly complicated. Clearly, we should measure the outliers–we quickly settled on using the CPU utilization of the p99th most utilized task of a given workload. After some discussions, we agreed to use the average as the base, leaving us with p99/average as the &lt;em&gt;imbalance indicator&lt;/em&gt;.&amp;nbsp;&lt;/p&gt;


&lt;p&gt;However, even that was surprisingly vague:&lt;/p&gt;


&lt;ul&gt;&lt;li&gt;A workload runs in multiple clusters across multiple zones. Should the p99/average be calculated across all its instances or for each cluster individually? If it’s per cluster, how do we weigh the results? This decision &lt;em&gt;dramatically &lt;/em&gt;affects the final numbers.&lt;/li&gt;


&lt;li&gt;Workloads run in multiple regions, yet unlike zones, our regions exhibit strong isolation–where to send traffic is outside of networking control. Thus, the networking team might care about a different indicator than the business.&lt;/li&gt;


&lt;li&gt;A typical workload has a periodic pattern–a service might be most busy on a particular day of the week and underutilized at other times. Should we measure the imbalance at the peak only or throughout the day? If at peak, how long of a time frame should be considered peak? Do we only care about the single weekly peak?&lt;/li&gt;


&lt;li&gt;Our workloads typically run in an active-active pattern, with each region having some spare capacity for a potential failover. The load imbalance matters most during those failovers–should we try to measure it only then? If so, the frequency of our measurements will be reduced–typically, we would get a simple sample per week.&lt;/li&gt;


&lt;li&gt;The workloads are noisy. A service rollout typically results in an imbalance spike (as new containers come and warm up). Some workloads might be quick to roll out (per increment) but roll out tens of times per day via a CD pipeline. Other workloads are much slower, and a single rollout can take hours. Both types of rollouts can overlap with peak times. On top of that, there are “atypical events” like temporary performance regressions, traffic drains, load tests, or incident-related issues.&lt;/li&gt;


&lt;li&gt;Most workloads follow a “standard” pattern, but some (more critical) services have been partitioned into custom shards with separate routing configurations. Similarly, a small subset of essential workloads is additionally accessible by custom peer-to-peer routing. Finally, another small subset of services runs on dedicated hosts. These dimensions might affect our tracking.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Once we settle on the per-workload indicator, the problem expands to multi-service:&lt;/p&gt;


&lt;ul&gt;&lt;li&gt;How do we weigh the individual workloads in the final score?&amp;nbsp;&lt;/li&gt;


&lt;li&gt;How do tiers (priority) of each service affect their weight in the final score?&lt;/li&gt;


&lt;li&gt;Does the fact that different workloads have different periodical patterns affect the score? Workloads typically have weekly and daily peaks, but those peaks are not simultaneous.&lt;/li&gt;


&lt;li&gt;Can we decompose the final indicator into sub-components to track the imbalance of individual zones or clusters?&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;The indicators must be available in real time for development and monitoring–here, we care about the highest precision possible, typically sub-minute. However, the same indicator must be available over long periods (years), where we need to roll the data up into day-sized chunks while keeping all the previous weighting considerations in mind.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-actual-numbers&quot;&gt;&lt;strong&gt;Actual numbers:&lt;/strong&gt;&lt;/h2&gt;


&lt;p&gt;Ultimately, we created a “Continuous imbalance Indicator.” For each workload, for each minute, we calculated the p99 (say, 5 cores) and average (say, 4 cores) CPU utilization. That, combined with the number of containers, allowed us to calculate “wasted cores.” For the example above, 10 containers would result in 10*4=40 (cores) &lt;em&gt;usage, &lt;/em&gt;(5-4)*10=10 &lt;em&gt;wastage &lt;/em&gt;cores, and the resulting indicator of 1+10/40=1.25. This mapped intuitively to the “standard” p99/average calculation of 125% that humans could do when debugging live.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/ZyU1H3XwK_fGjpDhsyhcKROE_XnjnbhmTKc_AkW1b-yvzLpTJ4TbnDiAqQ_b5jcOhM-4RFnaqrjVTFUHfNVqCRgh_299GXdMaUQeRX5Ca8eGP8mO8WrGDaHoo77l1OL5uOJJjHnNCFIqUEh3EgnzauY&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 3: theoretical definition of imbalance.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;p&gt;When done over time, this effectively became a ratio of areas under two curves: p99 and average utilization.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/Ed94kW9bEsKmlR9f6liB9hLJFJTRA2eWNbC61KrTCvqjRR16FF_17rG_29z2qqzrzobImlGT0S-GrF0ISBBjWaXnEV2TGRwIMFr_8R3E4lM5D1exq-LUcqAw4YjIVVjoVFbRYXowXbpESs_2zLlcIUM&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 4: Continuous Imbalance Indicator on a real-time dashboard.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;p&gt;The benefit of this approach was that since the wastage and utilization were calculated in absolute numbers of cores, it allowed us to aggregate them in custom, arbitrary dimensions: per service, per service-per-cluster, per group of services, per cluster, per zone. Similarly, any time window (hour, day, week) naturally worked–it was as simple as summing up a range of integers. Additionally, the indicator naturally gives higher weight to “busy” periods–imbalance at the peak is more critical than imbalance off-peak. The downside was the difficulty of explaining the indicator to humans, but we found that the approximation as a “weighted p99/average” is acceptable.&lt;/p&gt;


&lt;p&gt;An alternative approach of calculating a ratio of “weekly p99 of p99s” and “weekly average of averages” was easier to explain on an individual service basis but suffered from high sensitivity to random events (drains, failovers, load-tests, deployments), which made it noisy. Additionally, the cross-service weighting was less straightforward.&lt;/p&gt;


&lt;p&gt;The above metrics were made available in real-time metrics in Grafana and long-term storage in Hive. We needed to write custom pipelines to pre-process the indicator daily for visualization.&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-different-slicing&quot;&gt;&lt;strong&gt;Different slicing&lt;/strong&gt;&lt;/h3&gt;


&lt;p&gt;A particular wrinkle about measuring load imbalance is worth calling out: how you slice your data dramatically affects the results. It is tempting to start with small slices (clusters, zones, regions) and then “average” the imbalance. Sadly, this doesn’t work in practice. For example, it’s possible to have two clusters with (averaged) p99/average ratio of 110%, but when looking across the whole workload, the imbalance might be much higher–up 140% in our cases. Similarly, combining two clusters of higher imbalances might result in a lower imbalance.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h1 class=&quot;wp-block-heading&quot; id=&quot;h-addressing-the-issues&quot;&gt;&lt;strong&gt;Addressing the issues&lt;/strong&gt;&lt;/h1&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-the-first-step-getting-hacky-data-first&quot;&gt;&lt;strong&gt;The first step: getting (hacky) data first&lt;/strong&gt;&lt;/h2&gt;


&lt;p&gt;We started by building Grafana dashboards for real-time observability. This allowed us to measure impact individually per service in real time but didn’t help in understanding the root cause. While the assumption was that the load balancing was at fault, we didn’t *really* know. The initial problem was the lack of observability, where we faced two issues.&lt;/p&gt;


&lt;p&gt;First, due to cardinality issues, our load balancers did not emit stats by each backend instance. With many services running thousands of containers and hundreds of procedures, this would have caused both a memory usage explosion in our proxy and made the stats not-query-able for even the medium-sized services. Luckily, an intern project that summer added an ability to emit stats on an opt-in basis (saving the proxy memory usage) on a new metrics namespace (leaving the existing stats intact). Together with &lt;a href=&quot;https://chronosphere.io/learn/how-can-recording-and-roll-up-rules-help-your-metrics/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;roll-up rules&lt;/a&gt;, we could now introspect most services (as long as we only enabled the extra visibility for a few of them at a time).&lt;/p&gt;


&lt;p&gt;Second, we had lost the ability to uniquely identify instances across our compute and networking stacks. At the time, we could see the CPU usage of each target but couldn’t easily map it to a container. The available “unique identifier” of a &lt;em&gt;host:port&lt;/em&gt; would have broken our metrics (again, &lt;a href=&quot;https://chronosphere.io/learn/what-is-high-cardinality/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;cardinality&lt;/a&gt;) due to our wide IP target range and dynamic port usage. The discussion of a proper solution had previously stalled for quarters. Ultimately, the networking stack implemented a short-term solution based on sorting IP addresses and emitting integer-based instance IDs. These were not stable across deployments, but together with some more hacky scripting, allowed us to get the data we needed.&lt;/p&gt;


&lt;p&gt;This step provided important lessons:&lt;/p&gt;


&lt;ul&gt;&lt;li&gt;Always get the data first&lt;/li&gt;


&lt;li&gt;Well-placed, targeted, isolated hacks can be extremely useful&lt;/li&gt;


&lt;li&gt;You don’t need perfect observability to draw the correct conclusions&lt;/li&gt;
&lt;/ul&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-manual-analysis&quot;&gt;&lt;strong&gt;Manual analysis&lt;/strong&gt;&lt;/h2&gt;


&lt;p&gt;Once we had in-depth visibility into the issue, we hand-picked a few large services and tried to analyze the root causes. Surprisingly, the load balancing was not at fault–at a 1-minute window (our CPU stats resolution at the time), the RPS distribution was almost perfect. Each container was receiving an almost equal number of requests, with a difference below 0.1% for most applications. Yet, within the same window, the CPU utilization varied greatly.&lt;/p&gt;


&lt;p&gt;After several weeks of investigations, we were able to quantify several independent reasons:&lt;/p&gt;


&lt;ul&gt;&lt;li&gt;Some significant sources of traffic forced imbalance. For example, many of our systems are “city aware,” with a city always being in a single region. This naturally drove different amounts of traffic to each region, with proportions changing continuously as cities woke up and fell asleep.&lt;/li&gt;


&lt;li&gt;Services ran across several hardware SKUs, both within and across clusters.&amp;nbsp;&lt;/li&gt;


&lt;li&gt;Even the theoretically identical hardware showed significant performance differences.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Some of the imbalance was left in an “unknown” bucket. The majority of it turned out to be issues with our observability. We currently attribute the remainder (less than 20% of the original imbalance) to &lt;a href=&quot;https://en.wikipedia.org/wiki/Cloud_computing_issues#Performance_interference_and_noisy_neighbors&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;noisy neighbors&lt;/a&gt;.&lt;/p&gt;


&lt;p&gt;The graph below shows the initial analysis for one of our biggest services from 2020.&lt;/p&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/IovK_kUSNl_76Da7EKegaYqylzW4OiEV2s09RUzrt8Ow18SAqVzjyVT2Wrj3HNseZCmCS_J1WN-NQ0tNJAbVO5OvknwSFOnFDpAmldb5QDSGtTrFedRGz9OBgOnX2aMe1ymxQui-d4sEXFATK81vruM&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 5: Imbalance understood.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-forced-to-build-long-term-aggregations&quot;&gt;&lt;strong&gt;Forced to build long-term aggregations&lt;/strong&gt;&lt;/h2&gt;


&lt;p&gt;At that point, we wanted to start with any low-hanging fruit. The &lt;a href=&quot;https://www.uber.com/blog/better-load-balancing-real-time-dynamic-subsetting/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;Better Load Balancing: Real-Time Dynamic Subsetting | Uber Blog&lt;/a&gt; gave us a few knobs we could tweak. This, however, instead of being easy, presented a new problem.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/AHpjR9Jv7KdKJ-xuaOkLWwL9iswUHo2Of3qYZlMnURcDdhlawPN6XfL9pf96j3rFVjclApy_BT1IOEfX1w7HHzVBiuljPda6NKXGWgzQ839BfeZw53O1AYa-hYsk0qgq-2cAIEIF5RcnyaKf5ETUuSo&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 6: Patterns in weekly CPU utilization of a single service.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;p&gt;Our services exhibit heavy daily and weekly cycles (see above). On top of that, we frequently see spikes caused by failures, deployments, failovers, or ad hoc events. After rolling out a change, only a massive improvement (20%+) would be human-spottable, but our changes were too subtle.&lt;/p&gt;


&lt;p&gt;This resulted in the observability decisions explained in the previous paragraphs. We built pipelines to aggregate data over long periods based on a stable and spike-resilient metric. On top of that, we could slice the metrics by clusters, zones, regions, or groups of services–this, in turn, lets us investigate more “suspicious” behavior.&lt;/p&gt;


&lt;p&gt;Some pre-existing knobs let us reduce the service-mesh-induced part of the load imbalance, but it was a small fraction of the overall problem.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-possible-solutions&quot;&gt;&lt;strong&gt;Possible solutions&lt;/strong&gt;&lt;/h2&gt;


&lt;p&gt;An obvious first step was to look at low-level hardware configuration and OS settings. A few separate threads were started to look at these.&lt;/p&gt;


&lt;p&gt;Solving the hardware heterogeneity required a more complicated process. Many approaches were possible, from:&lt;/p&gt;


&lt;ul&gt;&lt;li&gt;Modifying &lt;a href=&quot;https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/resource_management_guide/sec-cpu&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;CFS parameters&lt;/a&gt; to make every host in the fleet appear the same despite the underlying hardware being different.&lt;ul&gt;&lt;li&gt;This option was attractive but eventually dismissed due to unclear impact on various software stacks (like &lt;a href=&quot;https://pkg.go.dev/runtime#GOMAXPROCS&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;GOMAXPROCS&lt;/a&gt;). In retrospect, this also prevented us from configuration utilizing &lt;a href=&quot;https://www.uber.com/blog/avoiding-cpu-throttling-in-a-containerized-environment/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;cpu-sets.&lt;/a&gt;&amp;nbsp;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;


&lt;li&gt;Modifying host-to-cluster placement to achieve uniform clusters.&lt;/li&gt;


&lt;li&gt;Modifying the per-service cluster placement to guarantee stable, but not uniform, host selection.&lt;/li&gt;


&lt;li&gt;Moving to cloud-style host management, where each team would select a particular type of hardware.&lt;/li&gt;


&lt;li&gt;Many possible service mesh changes to achieve better load &lt;em&gt;im&lt;/em&gt;balancing.&lt;/li&gt;
&lt;/ul&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/YC4s0sGZby4epTYOC36Um_cXU12K4MNkNHr_RSpt31Nijf4xB2IrGiPs3TJosQOVCrfsNnNov_ym5o2A9SLApgHEQl3u3Uu_JpuEHsh4w2gHuytW0lxB09_iU0oOdzf0p_6tpnAq7muT8DtwhUhcBcw&quot; alt=&quot;Screenshot of a spreadsheet, fields colored by the feasibility&quot; title=&quot;Possible fixes (unreadable on purpose)&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 7: Option matrix (blurred out on purpose)&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;p&gt;Out of the possible options, changes to the service mesh were chosen for several reasons. Technically, changes on our layer required no changes to the physical layout of the data centers and no per-service migrations. Tactically, we could also deliver the changes quickly to most services.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h1 class=&quot;wp-block-heading&quot; id=&quot;h-changes&quot;&gt;&lt;strong&gt;Changes&lt;/strong&gt;&lt;/h1&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-hardware&quot;&gt;&lt;strong&gt;Hardware&lt;/strong&gt;&lt;/h2&gt;


&lt;p&gt;While root-causing variance within hardware SKUs we found many issues with hardware, firmware, and low-level software. They ranged across OS settings, CPU governor settings, firmware versions, driver versions, CPU microcode versions, or even kernel version incompatibility with Intel HWP. A general root cause of this was that, historically, once the hardware was ingested and turned up in the fleet, it was left untouched unless it had issues. Over time, though, that led to a drift between machines.&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/p&gt;


&lt;p&gt;Uber runs in a mixed cloud/private setup, so we naturally experienced cloud-specific issues as well. Like other companies, we’ve seen multiple cases of theoretically identically provisioned VMs not performing similarly (&lt;a href=&quot;https://www.reddit.com/r/aws/comments/547xbx/netflix_found_5x_performance_variation_between/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;this&lt;/a&gt; is still real). Similarly, we’ve seen cases where workloads running fine on-prem triggered issues on the cloud. To make it worse, the cloud meant less visibility into the details of the underlying infrastructure.&lt;/p&gt;


&lt;p&gt;Fixing all these would be nearly impossible without a recently finished &lt;a href=&quot;https://www.uber.com/en-IN/blog/crane-ubers-next-gen-infrastructure-stack/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;Crane project&lt;/a&gt;–we could measure, fix, and roll out changes to tens of thousands of machines without human involvement. All of the issues discovered are now detected and remediated automatically.&lt;/p&gt;


&lt;p&gt;A clear benefit of these fixes was that they applied to every workload, no matter how it processed or originated its work (Kafka, Cadence, RPCs, timers, batch jobs, etc.). They were also giving us effectively free capacity, on top of the load imbalance improvements–some CPUs “became faster” overnight.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-observability&quot;&gt;&lt;strong&gt;Observability&lt;/strong&gt;&lt;/h2&gt;


&lt;p&gt;Observability was an interesting part of the problem. Before the project started, we knew we had limitations in the sample collections due to 1-minute window sizes, but we found more issues.&lt;/p&gt;


&lt;p&gt;Technically, the problems were caused by interactions between &lt;a href=&quot;https://en.wikipedia.org/wiki/Cgroups&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;cgroups,&lt;/a&gt; &lt;a href=&quot;https://github.com/google/cadvisor&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;cexporter&lt;/a&gt;, our internal &lt;a href=&quot;https://prometheus.io/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;Prometheus&lt;/a&gt; metric scraper, and &lt;a href=&quot;https://github.com/m3db/m3&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;m3&lt;/a&gt;. In particular, due to the metrics being emitted as ever-increasing gauges, any delays in stats collection anywhere in the pipeline would result in (large) artificial spikes in percentile calculations. A lot of work was put into preserving the timestamps of the samples as well as gracefully handling both target and collector services restarts. An example &lt;a href=&quot;https://github.com/google/cadvisor/issues/2913&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;issue&lt;/a&gt; was effectively breaking data collection for any large enough service.&lt;/p&gt;


&lt;p&gt;A fascinating aspect of the observability issues was related to human interactions – or the fact that humans cannot be trusted. Early in the project, we asked service owners what level of container utilization resulted in user impact (increased latencies). Interestingly, several months later, after we had rolled out the fix, when we asked again, we received the same answer. Both statements couldn’t be valid since we knew the old data was wrong. Ultimately, human irrationality resulted in net efficiency wins: service owners ended up running their services (effectively) hotter while thinking nothing had changed.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h1 class=&quot;wp-block-heading&quot; id=&quot;h-load-balancing&quot;&gt;&lt;strong&gt;Load balancing&lt;/strong&gt;&lt;/h1&gt;


&lt;p&gt;As explained in &lt;a href=&quot;https://www.uber.com/blog/better-load-balancing-real-time-dynamic-subsetting/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;Better Load Balancing: Real-Time Dynamic Subsetting | Uber Blog&lt;/a&gt;, our service mesh works on two levels. Initially, the control plane sends over &lt;em&gt;assignments &lt;/em&gt;deciding how much traffic should be sent to each target cluster. The imbalance between clusters is decided here.&lt;br&gt;Later, the data plane follows this assignment, but then it’s responsible for picking the right host–a second level of within-cluster load balancing is happening here. While we considered changing this model, we kept it unchanged and rolled out two solutions for each level.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-inter-cluster-imbalance&quot;&gt;&lt;strong&gt;Inter-cluster imbalance&lt;/strong&gt;&lt;/h2&gt;


&lt;p&gt;At Uber, services run in multiple zones in multiple regions. Because each zone is turned up at a different time, there is no way to guarantee the hosts in each zone are the same–usually, the newer the zones, the newer the generation hardware they have. The difference in performance of zones leads to CPU imbalance.&lt;/p&gt;


&lt;p&gt;Our initial approach was to set a static weight for each zone; the weight will then be used in load balancing such that zones with faster hardware take more requests. The weight for each zone is calculated as the average of the Normalized Compute Unit (NCU) factor of each host deployed in that zone. The NCU factor measures host CPU/core performance based on a benchmark score, where the score depends on the product of core instructions-per-cycle (how much work is done by the core per clock cycle) and core frequency (how many clock cycles are available per second).&lt;/p&gt;


&lt;p&gt;We could then send more traffic to more powerful/faster zones, using static zone weights as a multiplier.&amp;nbsp;&amp;nbsp;&lt;/p&gt;


&lt;p&gt;Faster zones, with higher multipliers, will be routed with more traffic proportionally to increase CPU utilization, hence easing the CPU imbalance.&lt;/p&gt;


&lt;p&gt;For example, if a service has deployed 10 instances in zone A (weight = 1) and B (weight = 1.2), the load balancing will be done as if B has 12 (10 * 1.2) instances so that B will receive more requests than A.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/L22FN7eamJeUaNg5UhAokiSDGFniHnCMQqexYTjB-BIGI_bUVBreGefBUA1dqB3iOcbUUe6rVj9noQvx3Doqxjqigl7GkEk0aEYDYIOYmAVv9crDdtH6vwqrnQSwazMEU5nwjggUTjsRrhyDKWPqNkA&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 8: Zone Weights&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;p&gt;This approach worked surprisingly well–we were able to mitigate the majority of the imbalance with relatively little effort. However, there were a few issues:&lt;/p&gt;


&lt;ul&gt;&lt;li&gt;Zone weight is an estimated value (average NCU factor) across all hosts in a zone. However, a service could be extremely lucky/unlucky to be deployed on the fastest/slowest hosts in a zone.&lt;/li&gt;


&lt;li&gt;Though not frequently, the zones we operate on change due to turnup or turndown. Additionally, during turnup, we typically ingest hardware gradually, which might require multiple updates.&lt;/li&gt;


&lt;li&gt;Occasionally, we ingest new hardware into old zones to resize them or replace broken hardware. This hardware can be of a different type, resulting in a need to adjust the weights.&lt;/li&gt;
&lt;/ul&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-dynamic-host-aware-cluster-load-balancing&quot;&gt;&lt;strong&gt;Dynamic Host-Aware Cluster Load Balancing&lt;/strong&gt;&lt;/h3&gt;


&lt;p&gt;Hence, we took a second look at the problem and invested in an advanced solution: Host-aware Traffic Load Balancing.&lt;/p&gt;


&lt;p&gt;This approach solves the drawbacks by looking at the exact hosts the service instances are deployed to, collecting their server types, and then updating the load balancing between clusters per service. This is achieved by making our discovery system aware of the mapping of a host (by IP), its host type, and weight such that for a given service deployed in a cluster, the discovery system could provide the extra &lt;strong&gt;&lt;em&gt;weight&lt;/em&gt;&lt;/strong&gt; info to our traffic control system. The diagram below shows an example:&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/BTH2IqexwVovC1iquHIIHhv744jyBweDXUPMKK4a5bDezGdqTbhfxk3SVVl5SZCz5bvZhFdg1td0ptbM1Hmy5SxzxxRBkDstE-jwYQzPSWQIPoukzhtztVliK1Gp6f3k6VUM6uSH5jLKGOsew_h7eH8&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 9: Dynamic Host-Aware Cluster Load Balancing&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;p&gt;For service Foo, if we treat each instance equally, the load balancing ratio should be &lt;strong&gt;37.5&lt;/strong&gt;%/&lt;strong&gt;62.5&lt;/strong&gt;% instead of &lt;strong&gt;&lt;em&gt;36&lt;/em&gt;&lt;/strong&gt;%/&lt;strong&gt;&lt;em&gt;64&lt;/em&gt;&lt;/strong&gt;% shown in the example. The difference could become more significant if hosts are across multi-generations (we have up to 2X different weights between different hosts in our fleet).&lt;/p&gt;


&lt;p&gt;Compared with the static weight approach, the host-aware load balancing adjusts weight per service dynamically to reduce inter-cluster imbalance. It’s also much easier to maintain, as new host types are introduced infrequently.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-intra-cluster-imbalance&quot;&gt;&lt;strong&gt;Intra-cluster imbalance&lt;/strong&gt;&lt;/h2&gt;


&lt;p&gt;The intra-cluster imbalance, as explained earlier, is the responsibility of the on-host proxy (called Muttley). Each proxy had complete control of selecting the right peer for each request. The original load-balancing algorithm for Muttley used by all services was least-pending, which would send requests to the peer with the smallest number of known outstanding requests. While this resulted in almost perfect balancing of RPS when measured in 1-minute intervals, it still resulted in an imbalance of CPU utilization due to different hardware types.&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-assisted-load-balancing-alb&quot;&gt;&lt;strong&gt;Assisted Load Balancing (ALB)&lt;/strong&gt;&lt;/h3&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/_CL-GbcOXQlLQSs4Aq-a2Q7NaKUOA7HjoTu_P_nt0NnJM1biDm26uHcmmuVlePy5vgZenjeQsxii3KraWa3AU6ixB51OUDJehHqHMBQaM5co9vnXPy4AokvL1plsofFDmuqISUQ2drgx9sPcZylHFiQ&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 10: Assisted Load Balancing in a nutshell.&amp;nbsp;&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;p&gt;We built a system where each backend &lt;em&gt;assists&lt;/em&gt; the load balancer in selecting the next peer. An application middleware layer attaches load metadata as a header to each response. We effectively arrive at a coordinated system without central coordination. Where previously, each Muttley only knew about the load it caused (plus some information it could infer from the latencies), now, it learns about the total state of each backend dynamically. This state is affected not only by the backend itself (for example, running on slower hardware) but also by decisions made by other Muttleys. For example, if a backend is (randomly) selected into too many subsets, the system adjusts dynamically. This let us later on reduce the subset sizes for services on ALB.&lt;/p&gt;


&lt;p&gt;While a brief mention in the &lt;a href=&quot;https://sre.google/sre-book/load-balancing-datacenter/#weighted-round-robin-eKspTGCm&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;Google SRE book&lt;/a&gt; partially inspired this approach, we made a few different choices. Both changes were related to each other and were attempted to simplify the approach. We intended to start, evaluate, and move to a more complicated solution later–luckily, we didn’t have to. Late in the implementation, we discovered a&lt;a href=&quot;https://netflixtechblog.com/netflix-edge-load-balancing-695308b5548c&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt; Netflix blog post&lt;/a&gt;, and we had arrived at similar conclusions independently.&lt;/p&gt;


&lt;p&gt;Firstly, as the load metadata, we used the number of concurrent requests being processed, reported as an integer (q=1,q=2,..,q=100, etc). We considered reporting utilization, too, but that wasn’t immediately obvious (whether the reported utilization should be based on &lt;a href=&quot;https://man7.org/linux/man-pages/man2/getrusage.2.html&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;getrusage&lt;/a&gt; or &lt;a href=&quot;https://en.wikipedia.org/wiki/Cgroups&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;cgroups).&lt;/a&gt; Cgroups were more natural since that’s what service owners were using to track their targets. Still, they presented more challenges–our foundation team was concerned about the cost of each docker container scraping cgroups independently and potential tight coupling if the cgroups layout was to change, including during the cgroupsv2 migration. We could have solved this by integrating with a host demon collecting the stats, but we wanted to avoid adding a new runtime dependency. In the end, just using a logical integer worked well enough (with some tweaks, explained below). Additionally, it allowed per-service overrides without changing the load balancer code–while the vast majority of the applications use the standard load indicator, some (asynchronous) applications override it to reflect their load better.&lt;/p&gt;


&lt;p&gt;The second departure was the &lt;a href=&quot;https://www.eecs.harvard.edu/~michaelm/postscripts/handbook2001.pdf&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;power of two random choices&lt;/a&gt; instead of the weighted round-robin. Since we had only a single integer as the load indicator, the pick-2 implementation seemed more straightforward and safer. Similarly to the above, this worked well enough that we didn’t need to change it. This approach turned out remarkably forgiving to failures across the whole range of our applications. Apart from typical crash looping or OOMing applications, we’ve had cases of bad/buggy implementations of the middleware not causing an incident. We speculate that since the weighted round-robin is more precise and “strict,” it would have likely performed “better” in some cases but could have resulted in &lt;a href=&quot;https://en.wikipedia.org/wiki/Thundering_herd_problem&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;thundering-herd&lt;/a&gt;-like scenarios.&lt;/p&gt;


&lt;p&gt;Implementation-wise, each Muttley uses a &lt;a href=&quot;https://en.wikipedia.org/wiki/Moving_average#Modified_moving_average&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;modified moving average &lt;/a&gt;to keep the score of each peer over 25 previous requests–this value worked best in our testing. To arrive at meaningful numbers for lower RPS cases, we scale up each reported load by a thousand.&lt;/p&gt;


&lt;p&gt;An interesting problem for the pick-2 load balancer is that the “most loaded” peer would &lt;em&gt;never &lt;/em&gt;be selected. And because we discover peer load passively, we would also refresh its state, thus making it effectively unused until another peer gets even slower. We initially mitigated this by implementing a “loser penalty,” where every time a peer loses the selection, its “load value” is internally reduced–thus, with enough losses, the peer would be selected again. This didn’t turn out to work well for large-caller-instance-count-low-RPS scenarios, where sometimes it would take minutes for a peer to be reselected. Eventually, we changed this to a time decay where peers’ score is reduced based on the last selection time. We currently use a half-life of 5 seconds for score decay.&lt;/p&gt;


&lt;p&gt;We also implemented a feature we call internally a “throughput reward.” This stemmed from empirical observations that the newer hardware handles concurrent requests better. We noticed that when load balancing across two peers on diverse hardware and both peers report the same “load value,” we, as expected, send more requests to the faster peer. However, the faster peer’s CPU utilization (processed=15, CPU=10%, Q=5) will remain lower than the slower peer (processed=10, CPU=12%, Q=5). To compensate for this, every time a peer “finishes” a request, we reduce its load slightly to push even more requests to it. The faster the peer is relative to other peers in the subset, the more “throughput rewards” it receives. This feature reduced the P99 CPU utilization by 2%.&lt;/p&gt;


&lt;p&gt;A significant part (the majority) of the ALB design document was committed to the possible alternatives. We significantly considered, instead of attaching the load meta-data to each of the responses, using a central component to collect and distribute the data. The concern was that the metadata might consume a significant amount of available bandwidth. We internally have two systems that superficially seemed relevant. The first was the centralized health-checking system collecting health state from every container in the fleet in close to real time. The second was the real-time aggregation system described in the previous blog post.&lt;/p&gt;


&lt;p&gt;&lt;br&gt;Re-using either turned out to be unfeasible: the health checker system could have easily collected the load status from all containers, but after collection, that system was designed to distribute the health changes infrequently–the vast majority of the time, the containers remained healthy. The load balancing indicators, however, change constantly and by design. Since we operate a flat mesh (every container can talk to every container), we would need to constantly distribute data about millions of containers to hundreds of thousands of machines or build a new aggregation and caching layer. The load-report aggregation system, similarly, was not a match–it was operating on aggregated per-cluster values at several orders of magnitude lower cardinality.&lt;/p&gt;


&lt;p&gt;&lt;br&gt;Ultimately, we were happy with the chosen (response-header-based) approach. It was simple to implement and made the cost attribution easy – services pushing more RPS saw a higher bandwidth cost. In the absolute numbers, the cost of the extra metadata (~8 bytes per request) was almost invisible compared to the other tracing/auth metadata attached to each request.&lt;/p&gt;


&lt;p&gt;The latency was an interesting aspect of the “distributed” vs. “centralized” collection of the load data. Theoretically, the response header approach is close to real-time since the load is attached to each response. However, since each Muttley needs to discover this independently and then average the response over the previous responses, the discovery might take some time for low RPS-based scenarios. The health-check-based approach would require a full round trip (typically ~5s), but be distributed to all caller instances immediately.&lt;/p&gt;


&lt;p&gt;&lt;br&gt;However, had we implemented it, we would have likely reduced the push frequency to something like 1 minute due to bandwidth concerns listed in the previous paragraph. This could have been enough to fix the hardware-induced skews but likely not other issues, like traffic spikes, slow-starting applications, or failovers. Both approaches could have likely worked slightly differently in different circumstances. Still, ultimately, we’re happy with the distributed approach–it’s easy to reason about and lacks centralized components that might fail.&lt;/p&gt;


&lt;p&gt;One downside of the chosen approach was that it requires cooperation from the target services. While minimal work is required, applying it to thousands of microservices would be arduous. Luckily, most applications built in the last few years at Uber used &lt;a href=&quot;https://github.com/yarpc/yarpc-go&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;common frameworks&lt;/a&gt; that allowed us to plug in the required middleware quickly. Several large services were not using the frameworks, but a concurrent multi-year effort had migrated almost all services. We found the decision to bet on the framework beneficial, as it had a compounding effect–service owners had one more reason to invest in migration. By the time we got to writing this post, virtually all services were on the common frameworks.&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-static-component-alb-v1-1&quot;&gt;&lt;strong&gt;Static component – ALB v1.1&lt;/strong&gt;&lt;/h3&gt;


&lt;p&gt;The initial rollout did not meet our hardware-induced imbalance reduction goals. The primary reason was that our hardware runs heavily underutilized most of the time–we have buffers for regional failovers and weekly peeks. It turned out that with relatively low container utilization, the old hardware can burst high enough for latency differences not to be visible while consuming more CPU time. While this meant the load balancing was working much better under stress (when we needed it), it made product engineers uncomfortable with our target utilization–the imbalance looked too high off-peak.&lt;/p&gt;


&lt;p&gt;We added a second static component to the load balancing to address this. We utilized the fact that in our setup, the IP address of a host never changes. Since the proxy naturally knows the destination’s IP address, we only need to provide a mapping of the IP addresses to relative host performance. Because of the static nature of the data, we started adding this information as part of the build-time configuration. This weight in itself is not perfect: different applications perform differently on the same hardware type. However, combined with the ALB’s dynamic part, this worked well–we did not need to add application-specific weights.&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-testing&quot;&gt;&lt;strong&gt;Testing&lt;/strong&gt;&lt;/h3&gt;


&lt;p&gt;A big problem during the development was testing. While we had a limited staging environment, the new solution needed to work with many parameters: some callers or callees had three instances, some three thousand. Some backends were serving &amp;lt;1, and some &amp;gt; 1,000 RPS. Some services served a single homogenous procedure, and others hundreds, with latencies varying from low milliseconds to tens of seconds. Ultimately, we used a dummy service in production with a set of fake load generators configured to represent a heterogeneous load. We ran over 300 simulations before finding the right parameters and attempting to roll out to production services.&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-results&quot;&gt;&lt;strong&gt;Results&lt;/strong&gt;&lt;/h3&gt;


&lt;p&gt;We are happy with the final results–the exact numbers depend on the service and the hardware mix within each cluster. Still, on average, we reduced P99 CPU utilization by 12%, with some services seeing benefits of over 30%. The results were better the bigger the target service had per each backend–luckily, the largest services we cared about most were typically optimized enough. The same luck applied to onboarding–while Uber has over 4,000 microservices, onboarding the top 100 gave us the vast majority of potential reach.&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-rollout-and-future-changes&quot;&gt;&lt;strong&gt;Rollout and future changes&lt;/strong&gt;&lt;/h3&gt;


&lt;p&gt;The rollout went well–we have not identified material bugs. The pick-2 load balancing and safe fallback were proven to be resilient. We onboarded services by tiers, region by region, trying to find representative types of services.&amp;nbsp;&lt;/p&gt;


&lt;p&gt;ALB was rolled out to hundreds of our biggest services with minimal hiccups or changes:&lt;/p&gt;


&lt;ul&gt;&lt;li&gt;&lt;em&gt;Long-lived RPC Streams&lt;/em&gt;. A small category of services was mixing up a small number of long-lived RPC streams with many very short-lived requests. We rolled back the onboarding there.&lt;/li&gt;


&lt;li&gt;&lt;em&gt;Slow-starting Runtimes&lt;/em&gt;. Around two years into the rollout, we tweaked the solution to handle slow-starting (Java) services better. These services could not serve the same request rate after startup due to JIT, but warm-up with recorded static requests was not working well enough; we needed to warm up the service with real requests at a lower rate. Here, we decided to seed each peer’s initial “weight” with a percentage of the average weight for the pool while leaving the algorithm’s core unchanged. We found this to work very well across a range of services, and we’re happy that this doesn’t require any static window settings, unlike Envoy’s &lt;a href=&quot;https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/load_balancing/slow_start&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;slow start mode&lt;/a&gt;–the algorithm adjusts to a range of RPS automatically.&lt;/li&gt;


&lt;li&gt;&lt;em&gt;Data Prefetching on Startup.&lt;/em&gt; Another very small category of services was pre-loading static data upon startup for several minutes. Due to the peculiarities of our service publishing mechanism, instances of those services are visible in our service discovery as “unhealthy.” The old algorithm strongly preferred the healthy instances. We changed that in ALB to avoid a thundering-herd-like scenario when a service cannot start after a temporary overload (due to each instance being instantly overloaded as they become healthy sequentially). The new algorithm significantly prefers healthy instances, but, in some cases, requests might be sent to “unhealthy” nodes. This doesn’t work for these services–while the reported error was &amp;lt;0.01% and 0.002%, we’re exploring changes similar to the &lt;a href=&quot;https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/load_balancing/panic_threshold&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;panic threshold&lt;/a&gt; to make this disappear entirely.&lt;/li&gt;


&lt;li&gt;&lt;em&gt;IP Address Mapping&lt;/em&gt;. The static mapping of IP address to server type worked well for 2+ years, but it will likely need to be adjusted as we &lt;a href=&quot;https://www.oracle.com/news/announcement/uber-selects-oracle-cloud-infrastructure-2023-02-13/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;move our workloads to the cloud&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Interestingly, two services overwrote the default load providers to emit custom load metrics based on background job processing. This proves that the defaults worked well for most services, but the solution was flexible enough to support other use cases.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h1 class=&quot;wp-block-heading&quot; id=&quot;h-summary&quot;&gt;&lt;strong&gt;Summary&lt;/strong&gt;&lt;/h1&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/arrmQ21tVj4fLv6DVkrkdoYxp0bRT0f1Ab5Ha98zym8rjLVCbLymSXDdqyP_RJxBlIg5KE6zaAxzjYWdmgmq13h5M-GPgiEWTF5EczSNYkmFjn3sw0EiMlh97DhB32wtW0vjRSUqRGdPjC-rRQ_sebU&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 11: Zone Weights rollout.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;p&gt;The project delivered very significant efficiency wins. We can run our containers at higher utilization levels, and load imbalance is no longer problematic for stateless workloads. The hardware configuration improvements resulted in double wins from reduced imbalance and pure compute capacity.&lt;/p&gt;


&lt;p&gt;More interestingly, from the engineering blog perspective, the project also resulted in several learnings.&lt;/p&gt;


&lt;p&gt;The primary one was the importance of data. The problem was real, but we started the project under the wrong assumptions. We didn’t know how to measure it; once we agreed, we lacked the tools to measure it effectively, especially over the long term. Even after that, we realized the underlying way we collect samples from the underlying infrastructure was flawed. At the same time, the data won arguments, helped us hone in on issues, and prioritized the work with other teams. Another data lesson was to set up the data infrastructure right for the long term–it helped during the project but also before. We were able to use an existing data warehouse as a base, and now afterward we periodically get questions about the load imbalance. A link to the dashboard usually answers all the questions.&lt;/p&gt;


&lt;p&gt;The second lesson was to add workarounds in the right place of the stack to get the data we needed. Building proper real-time observability would have taken us months or quarters. Still, we quickly got the right conclusions with a targeted hack and selectively basing the observations on a sample of services. Related to that was the willingness to do a lot of manual grunt work: to build the understanding, we spent weeks staring at dashboards and verifying assumptions before we started coding. Later, when implementing ALB and Zone/Cluster weights, we started with relatively small changes, verified assumptions, and iterated to the next version.&lt;/p&gt;


&lt;p&gt;The third, arguably less generalizable lesson, was to trust in the platforms. We made a bet that our microservices would migrate to the common frameworks. Similarly, when implementing, we built on top of years of pre-existing investments in the platform–pre-existing tooling (dashboards, debug tooling, operational knowledge, rollout policies) was there, and we could roll out major changes reasonably quickly and safely. We built with the grain of the platform and avoided major rewrites that could have derailed the project.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-acknowledgments&quot;&gt;&lt;strong&gt;Acknowledgments&lt;/strong&gt;&lt;/h2&gt;


&lt;p&gt;There were many people involved in the project. We thank Avinash Palayadi, Prashant Varanasi, Zheng Shao, Hiren Panchasara, and Ankit Srivastava for their general contributions. Jeff Bean, Sahil Rihan, Vikrant Soman, Jon Nathan, and Vaidas Zlotkus for hardware help, Vytenis Darulis for observability fixes, Jia Zhan and Eric Chung for ALB reviews, Nisha Khater for per-instance-stats project, Allen Lu for rolling out yarpc globally.&lt;/p&gt;


&lt;p class=&quot;has-small-font-size&quot;&gt;Logo attribution: “&lt;a href=&quot;https://www.flickr.com/photos/141290938@N03/26682754214&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;Scales of Justice – The Law – Lawyers and Attorneys&lt;/a&gt;” by&amp;nbsp;&lt;a href=&quot;https://www.flickr.com/photos/141290938@N03&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;weiss_paarz_photos&lt;/a&gt;&amp;nbsp;is licensed under&amp;nbsp;&lt;a href=&quot;https://creativecommons.org/licenses/by-sa/2.0/?ref=openverse&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;CC BY-SA 2.0&lt;/a&gt;.&lt;/p&gt;
</description><link>https://www.uber.com/blog/load-balancing-handling-heterogeneous-hardware/</link><guid isPermaLink="false">https://www.uber.com/blog/load-balancing-handling-heterogeneous-hardware/</guid><pubDate>Thu, 07 Mar 2024 07:00:00 GMT</pubDate><author>Uber</author><category>Engineering</category><category>Backend</category></item><item><title>Network IDS Ruleset Management with Aristotle v2</title><description>&lt;h1 class=&quot;wp-block-heading&quot; id=&quot;h-introduction&quot;&gt;Introduction&lt;/h1&gt;


&lt;p&gt;If you were to ask a veteran SOC (Security Operations Center) analyst about Network IDS (Intrusion Detection Systems) or IPS (Intrusion Prevention Systems), the response would probably contain phrases such as &lt;em&gt;“too many alerts,”&lt;/em&gt; and &lt;em&gt;“false positives.”&lt;/em&gt; At Uber, we face these same challenges of volume, accuracy, and manageability. Multiple times a day, more than 90,000 IDS rules are parsed, analyzed, updated, filtered, and deployed to our network sensors. &lt;a href=&quot;https://github.com/secureworks/aristotle/releases/tag/v2.0.0&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;Aristotle v2&lt;/a&gt; was created to enable us to automate this process, apply induction-based intelligence extraction, and enhance rule metadata to reduce false positives and help ensure that appropriate IDS alerts receive proper attention.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-overview&quot;&gt;Overview&lt;/h2&gt;


&lt;p&gt;The IDS ruleset update process at Uber involves multiple steps, as shown in Figure 1. Collating and distributing rules is straightforward and common to all Suricata™ deployments. Deciding which rules to include and how they should be modified is what happens in step 4, “Filter Rulesets,” and will be the focus of this blog.&lt;/p&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/6EfyApi87FAXsUWWOQm_9Dx78vMvI3h_N7q8faeJeghq0NI_U1tlQEGEXVwcKik1oxxRBBGJbhjjQ3SEhL9Cx4hPIwFHr8yUcMaXKLVdmz4DbRINAF4LBgk92UJKZAR993YpZQ3IsPgMauY88YLib5I&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;&lt;em&gt;Figure 1: IDS Ruleset Update Process.&lt;/em&gt;&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-background&quot;&gt;Background&lt;/h2&gt;


&lt;p&gt;IDS alerts are generated by IDS engines operating on logic governed by rules (or “rulesets”). At a basic level, IDS rules can be thought of as advanced pattern matching against network traffic and connection state. The most popular open source Network IDS engines are &lt;a href=&quot;https://suricata.io/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;Suricata&lt;/a&gt; and &lt;a href=&quot;https://snort.org/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;Snort&lt;/a&gt;™. This article focuses on Suricata, but the concepts and practices can also apply to Snort.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-ids-rule-selection-approaches&quot;&gt;IDS Rule Selection Approaches&lt;/h2&gt;


&lt;p&gt;Choosing which rules to apply to particular sensors can have a significant impact on false positive rates, undesired IDS alerts, and engine performance. For example, sensors protecting a pool of Linux® web servers don’t need to be running rules designed to detect attacks that target Windows® file sharing.&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-rule-classification&quot;&gt;Rule Classification&lt;/h3&gt;


&lt;p&gt;Historically, ruleset consumers have used two big “knobs” when it comes to choosing which rules to enable or disable. The first is the “classtype,” a native rule keyword with a finite set of options, which are defined by the ruleset provider and attempts to categorize the rule. Usually, no more than a few dozen classtype categories are defined, and common values include “trojan-activity,” “attempted-dos,” and “bad-unknown.” The second knob is the filename of the file that the rule is placed in by the ruleset provider, who will often segregate rules into different files with names like “sql.rules,” “scan.rules,” and “trojan.rules.”&lt;/p&gt;


&lt;p&gt;A major problem with these “knobs” is that they don’t allow for a one-to-many mapping. Each only supports a single value for a single rule. This lack of flexibility can be restrictive. For example, should a rule that detects recently seen exploit kit activity go into the “current-events.rules,” “exploit.rules,” or “web-client.rules” file (just to name a few options)? A similar challenge exists for the “classtype” field, where the activity being detected could legitimately be classified into multiple categories. These finite, blunt rule classification mechanisms are too broad to support the ruleset fine tuning flexibility needed for modern deployments.&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-manual-review&quot;&gt;Manual Review&lt;/h3&gt;


&lt;p&gt;In order to optimize rulesets for particular environments, they must be tuned. Often this results in a non-trivial, ongoing, and manual effort. In fact, some companies have a daily task of manually inspecting each new rule, deciding if it should be included, and then tuning it as necessary. However, this quickly becomes onerous and manifestly doesn’t scale, especially if existing rules have to undergo regular re-tuning as well.&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-metadata&quot;&gt;Metadata&lt;/h3&gt;


&lt;p&gt;There exists a “metadata” keyword, supported by IDS engines like Suricata and Snort, that allows for arbitrary key-value pairs to be embedded into each rule. This can be extremely helpful in deciding which rules to enable because rules can be filtered based on the content of the metadata. Suricata will also include the metadata in the IDS alert, which can be used for more informed post-processing, decision making, and correlation.&lt;/p&gt;


&lt;p&gt;Metadata key-value pairs provide distinct advantages over traditional rule categorization, including:&lt;/p&gt;


&lt;ul&gt;&lt;li&gt;&lt;strong&gt;One-to-many mapping:&lt;/strong&gt;&amp;nbsp; For example, the “protocols” metadata key can have values “http” and “tcp”&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;Arbitrary key names and values:&lt;/strong&gt;&amp;nbsp; Classification doesn’t have to be limited to pre-defined, finite options&lt;/li&gt;
&lt;/ul&gt;


&lt;h4 class=&quot;wp-block-heading&quot; id=&quot;h-a-better-way&quot;&gt;A BETTER Way&lt;/h4&gt;


&lt;p&gt;The &lt;a href=&quot;https://better-schema.readthedocs.io/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;BETTER&lt;/a&gt; (Better Enhanced Teleological and Taxonomic Embedded Rules) schema for key-value based IDS rule metadata was proposed in 2019. It recognized the need for one-to-many metadata mappings, and attempted to bring some structure and standardization to commonly used metadata keys and (in some cases) values. One vendor—&lt;a href=&quot;https://www.secureworks.com/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;Secureworks&lt;/a&gt;®—fully implemented BETTER in its Suricata ruleset offering, while other vendors such as &lt;a href=&quot;https://rules.emergingthreats.net/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;Proofpoint ET Pro&lt;/a&gt;®, have rulesets with partial compatibility. BETTER never received widespread industry adoption, but its major concepts persist, and the use of metadata for ruleset filtering is still a solid strategy.&lt;/p&gt;


&lt;p&gt;Many ruleset providers do populate metadata, but almost all of them do so in a way that severely limits the effectiveness of using metadata as a means of rule filtering. Specifically, the rulesets have one or more of the following shortcomings:&lt;/p&gt;


&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Missing metadata&lt;/strong&gt;:&amp;nbsp; Either applicable metadata key-value pairs are not used in the ruleset, or metadata key-value pairs are applied selectively instead of universally. Filtering rulesets based on metadata is most useful if all applicable metadata are applied to all applicable rules, and utility falls off sharply when this is not the case. For example, setting the metadata “attack-target http-server” on 20 rules in the ruleset when there are 400 more rules that could be classified the same way, makes filtering based on that key-value pair of limited value.&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;Inconsistent value formatting&lt;/strong&gt;:&amp;nbsp; For example, “cve” key values may appear as “cve_2023_1234,” “cve_2023_1234_cve_2023_2468,” “2023_1234,” “2023-1234,” etc.&amp;nbsp; Without a normalized nomenclature, accurate filtering becomes challenging.&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;Poor value formatting&lt;/strong&gt;:&amp;nbsp; This includes things such as not using standard datetime formats like &lt;a href=&quot;https://www.iso.org/iso-8601-date-and-time-format.html&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;ISO 8601&lt;/a&gt; when specifying time/date strings.&lt;/li&gt;
&lt;/ul&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-aristotle-v1&quot;&gt;Aristotle v1&lt;/h2&gt;


&lt;p&gt;In 2019, &lt;a href=&quot;https://www.secureworks.com/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;Secureworks&lt;/a&gt; released &lt;a href=&quot;https://github.com/secureworks/aristotle/releases/tag/1.0.5&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;Aristotle (v1)&lt;/a&gt;, an open source Python tool that allowed users to “filter” (enable or disable) rules based on metadata key-value pairs. By using a concrete boolean algebra, “filter strings” can be defined to control rule selection.&amp;nbsp; This can be quite powerful, but the usefulness of Aristotle v1 is limited by the richness (or rather, lack thereof) of the metadata in the provided rules, something controlled by ruleset vendors and onerous to maintain manually. Since most ruleset vendors do not provide comprehensive metadata and/or do not have metadata with the precision and consistency needed for accurate programmatic filtering, something more than Aristotle v1 is needed.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h1 class=&quot;wp-block-heading&quot; id=&quot;h-metadata-and-beyond&quot;&gt;Metadata and Beyond&lt;/h1&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-aristotle-v2&quot;&gt;Aristotle v2&lt;/h2&gt;


&lt;p&gt;Uber recently contributed significant improvements to &lt;a href=&quot;https://github.com/secureworks/aristotle/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;Aristotle&lt;/a&gt;, resulting in &lt;a href=&quot;https://github.com/secureworks/aristotle/releases/tag/v2.0.0&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;Aristotle v2&lt;/a&gt;.&amp;nbsp; These updates added support for metadata normalization, enhancement, and manipulation. Figure 2 shows the different components of Aristotle v2, which will be discussed in more detail.&lt;/p&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/xD-k_zGYrP0E2OvjjvYtxgHwHQ7JUMBexQANRA_Eu2aYPDpoMzEIrchMQf6wgIw2xjEyog4IaHfNg6sV20TwqA3Y71iIrf_Pd8w1_vgx7AWalDiYxSbndXriycNrLGhRqS_Gr3r2kjZ46tzqJqN6zOM&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;&lt;em&gt;Figure 2: Aristotle v2 components.&lt;/em&gt;&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-filtering&quot;&gt;Filtering&lt;/h3&gt;


&lt;p&gt;Aristotle v1—which is basically the “Filter Rulesets” step—did a good job of supporting boolean filtering on metadata, and even included the ability to specify numerical relationships for certain keys (e.g., “created_at &amp;gt; 2023-01-01”).&amp;nbsp; To the list of keys that support such comparisons, Aristotle v2 added &lt;strong&gt;risk_score&lt;/strong&gt; (more on that key later).&amp;nbsp;&lt;/p&gt;


&lt;p&gt;Additionally, the ability to do regular expression based filtering was introduced in Aristotle v2. While this does impact filtering performance because it adds non-literal elements to the boolean expression, it does provide powerful and often needed capability. Specifically, regular expression matches can be applied to the entire rule with the “rule_regex” keyword, or scoped to just the “msg” field with the “msg_regex” keyword.&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-normalization&quot;&gt;Normalization&lt;/h3&gt;


&lt;p&gt;To address the filtering challenges that come from a lack of consistent metadata value format, Aristotle v2 supports the normalization of certain metadata key values. Specifically, the following normalizations are supported:&lt;/p&gt;


&lt;ul&gt;&lt;li&gt;&lt;strong&gt;CVE&lt;/strong&gt; key value(s) normalized to format &lt;em&gt;YYYY-&amp;lt;num&amp;gt;&lt;/em&gt;.&amp;nbsp; If multiple CVEs are represented in the value and strung together with a “_” (e.g., “cve_2021_27561_cve_2021_27562” [&lt;em&gt;sic&lt;/em&gt;]), then all identified CVEs will be normalized and included.&lt;/li&gt;


&lt;li&gt;Values from the non-BETTER schema keys &lt;strong&gt;mitre_technique_id&lt;/strong&gt; and &lt;strong&gt;mitre_tactic_id&lt;/strong&gt; will be put into the standards compliant &lt;strong&gt;mitre_attack&lt;/strong&gt; key.&amp;nbsp;&lt;/li&gt;


&lt;li&gt;Date key values—determined by any key names that end with “_at” or “-at”, e.g., &lt;strong&gt;created_at&lt;/strong&gt;—will be attempted to be normalized to ISO 8601 format &lt;em&gt;YYYY-MM-DD&lt;/em&gt;.&lt;/li&gt;
&lt;/ul&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-enhancement&quot;&gt;Enhancement&lt;/h3&gt;


&lt;p&gt;While normalizing metadata is necessary and useful, it can’t address the issue of missing metadata. However, a rule is more than its metadata, so we asked the question, “&lt;em&gt;can we identify, deduce, induce, or otherwise infer particulars from the rule’s ontology, and augment the rule metadata with that information?&lt;/em&gt;” This led to creating the ability of Aristotle v2 to analyze the ontology of each rule and add/update the metadata with the following &lt;a href=&quot;https://aristotle-py.readthedocs.io/en/latest/usage.html#enhance&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;enhancements&lt;/a&gt;:&lt;/p&gt;


&lt;ul&gt;&lt;li&gt;&lt;strong&gt;flow&lt;/strong&gt; key with values normalized to be either “to_server” or “to_client”&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;protocols&lt;/strong&gt; key and applicable values&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;cve&lt;/strong&gt; key and applicable values. The value(s) are based on data extracted from the raw rule, e.g., “msg” field, “reference” keyword, etc.&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;mitre_attack&lt;/strong&gt; key and applicable values. The value(s) are based on data extracted from the rule’s “reference” keyword&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;hostile&lt;/strong&gt; key and applicable values (“dest_ip” or “src_ip”)—the values are the inverse of values taken from the “target” keyword&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;classtype&lt;/strong&gt; key and applicable values&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;filename&lt;/strong&gt; key and applicable values—the value will be the filename the rule came from, if the rule was loaded from a file&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;originally_disabled&lt;/strong&gt; key and boolean value get added on each rule internally, to be used for filtering&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;detection_direction&lt;/strong&gt; key (see below)&lt;/li&gt;
&lt;/ul&gt;


&lt;h4 class=&quot;wp-block-heading&quot; id=&quot;h-detection-direction&quot;&gt;Detection Direction&lt;/h4&gt;


&lt;p&gt;While network IDS rules can be unidirectional, the overwhelming majority of them are written to target just one side of the client-server communication. Additionally, rules are typically scoped by specifying IP address groups for the source and destination. IP address groups are user-defined but almost always include the variables “HOME_NET” and “EXTERNAL_NET.” The idea is that HOME_NET is the group of IP addresses owned by the user or company, and intended to be protected; and EXTERNAL_NET is the group of IPs “outside” the user’s network, typically the general Internet. EXTERNAL_NET is often (but not necessarily) defined as everything that isn’t specified in HOME_NET.&lt;/p&gt;


&lt;p&gt;The &lt;strong&gt;detection_direction&lt;/strong&gt; metadata key attempts to normalize the directionality of traffic on which the rule detects. To do this, the source and destination sections of the rule are processed and reduced down to “$HOME_NET”, “$EXTERNAL_NET”, “any”, or “UNDETERMINED”, and used to set the &lt;strong&gt;detection_direction&lt;/strong&gt; value as seen in Figure 3.&lt;/p&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/rwHiJEqmAcq7pqON8pnrpt2Tp9XmvhGxRSe5Vcc-i1bpiIPwjm73jvH_0raA3E9dWOkzWzd4C3vlaM7oeCUvzUacl8uKFArxQSCNCb71h1-QLfNXZD736JMmp7FwkeLGdFhj5jZZajAfVmwdP0SbDoU&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;&lt;em&gt;Figure 3: detection_direction values and conditions.&lt;/em&gt;&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;p&gt;Knowing a rule’s “detection direction” is important in being able to determine the significance and seriousness of what it is detecting. For example, consider a rule that detects traffic known to be generated by devices infected with the Mirai malware. Such traffic seen inbound (coming from EXTERNAL_NET and directed to HOME_NET) can usually be classified as scanning and considered to be little more than Internet noise. Yet such traffic seen outbound (coming from HOME_NET and directed to EXTERNAL_NET), is a good indication that there is an infected device on your network and it is part of an active botnet. The latter case is more serious than the former and should be treated as such. The rule and its associated IDS alert need to be able to communicate these realities. Accurately classifying rules and their IDS alerts so that they can be programmatically responded to is important, and this is where Post Filter Modification comes into play.&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-pfmod-post-filter-modification&quot;&gt;PFMod (Post Filter Modification)&lt;/h3&gt;


&lt;p&gt;Aristotle v2 offers the option to further filter and modify the ruleset after normalization, enhancement, and initial filter string application. This is known as PFMod (&lt;a href=&quot;https://aristotle-py.readthedocs.io/en/latest/post_filter_mod.html&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;Post Filter Modification&lt;/a&gt;) and allows for the identification of rules based on filter strings, and then particular “actions” taken on those rules.&lt;/p&gt;


&lt;h4 class=&quot;wp-block-heading&quot; id=&quot;h-pfmod-actions&quot;&gt;PFMod Actions&lt;/h4&gt;


&lt;p&gt;PFMod actions include the ability to add/delete metadata, enable/disable rules, set priority, and do a regular expression based “find and replace” on the full rule. Supported PFMod actions include:&lt;/p&gt;


&lt;ul&gt;&lt;li&gt;&lt;strong&gt;disable&lt;/strong&gt;: Disable the rule.&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;enable&lt;/strong&gt;: Enable the rule. Note that for “disabled” rules to make it to PFMod for consideration, they must first match in the initial filter string matching phase.&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;add_metadata&lt;/strong&gt;: YAML key-value pair where the (YAML) value is the metadata key-value pair to add, e.g. “protocols http”. Note that if there is already metadata using the given key, it is not overwritten unless the value is the same too, in which case nothing is added since it already exists.&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;add_metadata_exclusive&lt;/strong&gt;: YAML key-value pair where the (YAML) value is the metadata key-value pair to add (e.g., “priority high”). If the given metadata key already exists, overwrite it with the new value.&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;delete_metadata&lt;/strong&gt;: If a metadata key-value pair is given (e.g., “former_category malware”), remove the key-value pair from the rule. If just a metadata key name is given (e.g., “former_category”), remove all metadata using the given key, regardless of the value(s).&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;regex_sub&lt;/strong&gt;: Perform a regular expression find and replace on the rule.&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;set_&amp;lt;keyword&amp;gt;&lt;/strong&gt;: Set the &lt;em&gt;&amp;lt;keyword&amp;gt;&lt;/em&gt; in the IDS rule string to have the given value. If the rule does not contain the given keyword, add it and set the value to the given value. Supported keywords include “priority,” “sid,” “gid,” “rev,” “msg,” “classtype,” “reference,” “target,” “threshold,” and “flow.” For integer keywords (“priority,” “rev,” “gid,” and “sid”), relative values can be used by preceding the integer value with a ‘+’ or ‘-’. For example, the action ‘set_priority “-1″‘ will cause the existing priority value in the rule to be decreased by 1.&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;set_&amp;lt;arbitrary_integer_metadata&amp;gt;&lt;/strong&gt;: Similar to “add_metadata_exclusive,” allows for the setting or changing of an arbitrary integer-based metadata key value, but also supports relative values, along with default values. Format shown in Figure 4.&lt;/li&gt;
&lt;/ul&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/cDFJsx8a4aR3fzcHxzT5t7BZxYGOGfujqynX60OfUBJAWhp_kRonpAJJymbKHMKiwUzKGqDCVOlqw5YiAe6XLh6PfDVvMeAUWV3vwCTgRpSBLFmsNjhRAw9nfA-NBrRJppAEuYIlP9T7Fq-5apCTUAg&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;&lt;em&gt;Figure 4: PFMod action syntax for setting arbitrary integer-based metadata.&lt;/em&gt;&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;p&gt;Notes:&lt;/p&gt;


&lt;ul&gt;&lt;li&gt;The &lt;em&gt;&amp;lt;arbitrary_integer_metadata&amp;gt;&lt;/em&gt; string corresponds to the metadata key name and must contain at least one underscore (‘_’) character.&lt;/li&gt;


&lt;li&gt;The metadata key being referenced should have a value corresponding to an integer.&lt;/li&gt;


&lt;li&gt;A preceding ‘+’ or ‘-‘ to the given &amp;lt;value&amp;gt; will cause the existing metadata value in the rule to be increased or decreased by the given &lt;em&gt;&amp;lt;value&amp;gt;&lt;/em&gt;, respectively. If the metadata key does not exist, then the value will be set to the given &lt;em&gt;&amp;lt;default&amp;gt;&lt;/em&gt; value, if provided, otherwise no change will be made.&lt;/li&gt;


&lt;li&gt;Examples show in Figure 5.&lt;/li&gt;
&lt;/ul&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/8H_XGd2h-bhRh3_3TU3DOSXxru4ndUFegnldYy6niuSgJxr5fuWo721rruZzI-nUFTGNo0g9RMEwEXPthNZQxvrqa_Anwi37gcTuely-ZGlcZopCI8gu0RmZOQhsWioYFRD-BR16fG8Bonr4HgUdV_w&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;&lt;em&gt;Figure 5: Example PFMod actions setting arbitrary integer-based metadata key-value pairs.&lt;/em&gt;&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;h4 class=&quot;wp-block-heading&quot; id=&quot;h-pfmod-rules&quot;&gt;PFMod Rules&lt;/h4&gt;


&lt;p&gt;PFMod conditions and actions are controlled by PFMod rules (not to be confused with IDS “rules”).&amp;nbsp; PFMod rules are defined in YAML files and are processed in a depth-first, linear fashion. This means that you can define actions that apply broadly to many (or all) rules, and then have more specific PFMod rules that apply more precise actions to subsets of those rules. As shown in Figure 6, PFMod rules files can “include” other PFMod rules files for easy organization.&lt;/p&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/qYBdFR47BWVJxSUSy7f1-35up1H8fsOESih8Vy2oX_MT0mjf-dKWSyd83FEhgW-2tVZTOpZyeuaKX1zTCHarBkDhuI92DMehaMcs1gmrXy2rhsQ0V00a8HPQz3EOkEM9J_IYm4pcvklTxDtSiMiqyoQ&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;&lt;em&gt;Figure 6: Example file using &lt;/em&gt;&lt;strong&gt;&lt;em&gt;include&lt;/em&gt;&lt;/strong&gt;&lt;em&gt; to load multiple PFMod files.&lt;/em&gt;&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;p&gt;In addition to &lt;strong&gt;include&lt;/strong&gt; statements, a PFMod rule file can contain multiple rules. Figure 7 shows an example PFMod rule file that updates IDS rules and metadata.&lt;/p&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/3omcuZfAMQjK8WXdHLOLj1ITew4Nu75Ha9jPFbjPZiV7aIo3jYiHTTQY0lN0txuMFxS-VCJMtnSZegFxZux-XxMphuu4gZHLjBkaAMutOjnO66o_liREiFfbWrn2mbT9bib9G-1X-Cwz8GiAOocgNy4&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;&lt;em&gt;Figure 7: Example PFMod file with &lt;/em&gt;&lt;strong&gt;&lt;em&gt;rules&lt;/em&gt;&lt;/strong&gt;&lt;em&gt; specified.&lt;/em&gt;&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;h4 class=&quot;wp-block-heading&quot; id=&quot;h-risk-score&quot;&gt;Risk Score&lt;/h4&gt;


&lt;p&gt;Each Suricata rule that is deployed at Uber receives a “risk score.” These risk scores are automatically generated at ruleset compile time and applied as metadata values by PFMod rules.&amp;nbsp; Rule metadata, including &lt;strong&gt;risk_score&lt;/strong&gt;, are included in Suricata alerts and play an important role in event processing.&amp;nbsp; More on this later.&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-maintenance&quot;&gt;Maintenance&lt;/h3&gt;


&lt;p&gt;When new rules are added to the rulesets used at Uber—which happens daily—they are automatically subjected to our existing Aristotle v2 pipeline which includes filtering and the application of non-trivial PFMod logic to shape and classify each rule accordingly. Thus, manual analysis and tweaking of each new rule is avoided by using Aristotle v2 as a reliable “set it and forget it” mechanism. Of course, judicious occasional revisiting of ruleset filtering and PFMod logic is done to align the ruleset with the current environments and expected traffic patterns.&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-documentation&quot;&gt;Documentation&lt;/h3&gt;


&lt;p&gt;More details about Aristotle v2 and how to use it can be found in the &lt;a href=&quot;https://aristotle-py.readthedocs.io/en/latest/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;online documentation&lt;/a&gt;.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h1 class=&quot;wp-block-heading&quot; id=&quot;h-aggregation-correlation-risk-score-and-alerting&quot;&gt;Aggregation, Correlation, Risk Score, and Alerting&lt;/h1&gt;


&lt;p&gt;Uber processes billions of events a day, including hundreds of thousands of IDS alerts. Events come from myriad sources including log files, vendor products, in-house systems, custom detections, and sensing technologies like IDS. Security related events, referred to as “signals,” receive a “risk score” value that is represented by a single integer and associated with the signal. The risk score value for a signal plays a non-trivial role in downstream aggregation and correlation algorithms that ultimately determine if a signal or group of signals qualify for a formal meta-alert requiring a response. In other words, the “risk score” value quantifies the necessity and appropriate level of response. Depending on the level warranted, responses typically take the form of a manual investigation by a security analyst, and/or a series of programmatic actions in a Security Orchestration And Response (SOAR) pipeline.&lt;/p&gt;


&lt;p&gt;The evaluation, aggregation, and correlation of signals at Uber is a sophisticated process (not to mention the response pipelines), the intricate details of which are outside the scope of this article.&amp;nbsp; However, the general strategy revolves around what we call “Entity Based Alerting.” For a given time window, signals are grouped by entity (e.g., IP address, host, user, etc.) and a correlation algorithm is applied to determine if an actionable meta-alert should be created. The risk score values from individual signals play a significant part in this calculation, as they are weighted and added together, and ultimately compared against a threshold used to make a final determination. The weighting of signals—which can be thought of as adjusting risk scores up or down—is based on sundry criteria, including entity characteristics, and often involves correlation with other data sources. For example, signals for a user entity where the user is an Administrator are weighted higher than those related to a non-Administrator user.&amp;nbsp; Similarly, a signal involving an IP address entity from a known sanctioned vulnerability scanner will be weighted lower, while an event involving an IP address entity that is known to be part of an isolated network responsible for financial transactions will be weighted higher. Note that given a high enough risk score and/or weighting modification, a single signal can be enough to generate a meta-alert and response.&lt;/p&gt;


&lt;p&gt;In practice, aggregating and correlating signals related to common entities has shown to be an effective way to identify events and combinations of events worth responding to. With IDS alerts, Aristotle v2 plays a crucial role in being able to choose which rules are enabled, what metadata is included, and what risk score value each rule should carry. IDS alert metadata, especially the risk score value, allows Suricata alerts to be better managed so that analysts are not overwhelmed with alerts or false positives.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h1 class=&quot;wp-block-heading&quot; id=&quot;h-conclusion&quot;&gt;Conclusion&lt;/h1&gt;


&lt;p&gt;By using Aristotle v2 to normalize, enhance, and manipulate rule metadata, rules can be accurately described, programmatically understood, and willfully modified. Applying concrete boolean algebra against metadata key-value pairs results in powerful filtering capabilities that allow us to curate Suricata rulesets to only run applicable rules in particular environments. Rules are automatically tuned based on explicit and inferred teleological motivations and ontological realities. Custom metadata values such as “risk_score” are intelligently added to each rule which enables effective downstream correlation such that false positives are minimized and notable alerts receive appropriate attention. The result is a scalable, controllable, and accurate Suricata ruleset management and response solution.&lt;/p&gt;
</description><link>https://www.uber.com/blog/network-ids-ruleset-management-with-aristotle-v2/</link><guid isPermaLink="false">https://www.uber.com/blog/network-ids-ruleset-management-with-aristotle-v2/</guid><pubDate>Thu, 29 Feb 2024 07:00:00 GMT</pubDate><author>Uber</author><category>Engineering</category><category>Backend</category></item><item><title>Building Scalable, Real-Time Chat to Improve Customer Experience</title><description>&lt;h1 class=&quot;wp-block-heading&quot; id=&quot;h-introduction&quot;&gt;Introduction&lt;/h1&gt;


&lt;p&gt;Uber is a global business and has a customer base that’s spread throughout the world. Uber’s customer base is divided into many user personas, predominantly riders, drivers, eaters, couriers, and merchants. Being a global business, Uber’s customers also expect support at a global scale. We have customers reach out to us through various live (chat, phone) and non-live (inApp Messaging) channels, and expect swift resolutions to their issues. With millions of support interactions (known internally as &lt;em&gt;contacts&lt;/em&gt;) being raised by Uber customers every week, our goal is to resolve these contacts within a predefined service level agreement (SLA). Contacts created by customers are resolved either via automation or with help from a customer support agent.&amp;nbsp;&lt;/p&gt;


&lt;p&gt;For agent contacts, the cost of resolution of tickets plays an important role in how Uber structures its support channels and determines volumes across different live and non-live channels. Cost-per-contact (CPC) and FCR (first contact resolution) for the chat channel are most effective across different live channels, as they allow agents to handle multiple chat contacts concurrently while maintaining a lower average cost than channels like Phone. This channel is in the sweet spot for Uber,&amp;nbsp; as it has a good &lt;a href=&quot;https://www.qualtrics.com/au/experience-management/customer/what-is-csat/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;CSAT&lt;/a&gt; score (customer satisfaction rating, measured in the range of 1 to 5) while generally reducing CPC. This channel allows for a higher automation rate, higher staffing efficiency (as agents can work on multiple chats at the same time), and high FCR, which are all beneficial to Uber while providing quality support for customers.&lt;/p&gt;


&lt;p&gt;Historically, from 2019 to early 2023, 1% of all contacts were served via live chat channel, 58% were served via inApp messaging channel (a non-live channel), and the rest were served via another live channel such as Phone. To achieve higher CSAT and FCR, the engineering team needed to scale the chat infrastructure to meet the demands of Uber’s growing business, as well as facilitate the migration of a large volume of in-app messaging and phone channel contacts to a chat channel. We will focus on the Chat live channel for this article.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-challenges&quot;&gt;Challenges&lt;/h2&gt;


&lt;p&gt;To scale the chat channel to support 40+% of the Uber contact volume which is routed to Agents, the following were some of the major challenges the team was facing:&lt;/p&gt;


&lt;ol&gt;&lt;li&gt;Reliability issues with delivering messages from backend systems to an agent’s browser&lt;ol&gt;&lt;li&gt;46% of events originating from a customer trying to reach an agent were not delivered in time, resulting in delays for both customers and wastage of the agent’s bandwidth. Note that 46% does not indicate the number of unique contacts here but the overall number of events across all the chat contacts.&amp;nbsp;&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;


&lt;li&gt;Missing Insights&lt;ol&gt;&lt;li&gt;The observability to track the health of the chat contacts was unavailable.&lt;/li&gt;


&lt;li&gt;Since agents were idle for large amounts of time but queues were also not empty, Ops was left wondering if they were overstaffed or if it was a tech issue resulting in disproportionate volumes (referred to as a tech vs staffing issue).&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ol&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-legacy-architecture-related-challenges&quot;&gt;Legacy Architecture Related Challenges&lt;/h2&gt;


&lt;p&gt;Our legacy architecture was built using the &lt;a href=&quot;https://wamp-proto.org/index.html&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;WAMP protocol&lt;/a&gt; that was used primarily for message passing and PubSub over WebSockets to relay contact information to the agent’s machine.&lt;/p&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter size-large is-resized&quot;&gt;&lt;img decoding=&quot;async&quot; loading=&quot;lazy&quot; width=&quot;1024&quot; height=&quot;416&quot; src=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=2160,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Fig1_old_arch-1-1024x416.png&quot; alt=&quot;&quot; class=&quot;wp-image-1078909&quot; style=&quot;aspect-ratio:2.2164502164502164;width:700px;height:auto&quot; srcset=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=1024,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Fig1_old_arch-1.png 1024w, https://blog.uber-cdn.com/cdn-cgi/image/width=300,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Fig1_old_arch-1.png 300w, https://blog.uber-cdn.com/cdn-cgi/image/width=768,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Fig1_old_arch-1.png 768w, https://blog.uber-cdn.com/cdn-cgi/image/width=1536,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Fig1_old_arch-1.png 1536w, https://blog.uber-cdn.com/cdn-cgi/image/width=2048,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Fig1_old_arch-1.png 2048w&quot; sizes=&quot;(max-width: 1024px) 100vw, 1024px&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 1: Describes the previous high-level flow of the chat contact from being created to being routed to an agent on the front end. &lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;p&gt;&lt;em&gt;Note: This is different from the data path involving the exchange of chat messages between the customer and Uber Support, facilitated through HTTP Server-Sent Events (SSE). For this purpose, Uber utilizes Ramen as an internal service, serving a dual role in the control and data paths. In the control path, Ramen provides bi-directional support for client-to-mobile use cases, allowing effective communication. Simultaneously, in the data path, Ramen offers &lt;/em&gt;&lt;a href=&quot;https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events/Using_server-sent_events&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;&lt;em&gt;SSE&lt;/em&gt;&lt;/a&gt;&lt;em&gt; capabilities for client-to-web use cases.&lt;/em&gt;&lt;/p&gt;


&lt;p&gt;&lt;em&gt;However, a noteworthy distinction arises in the data path, specifically for client-to-web use cases, where Ramen demonstrates a successful delivery rate of 94.5%. It operates in a unidirectional manner, prompting the necessity for new control flows. These new control flows are essential to detect and manage situations where the client is no longer responsive, thereby addressing the unidirectional limitation in the data path. In this blog, we will cover the new control flow to deliver the events from the &lt;/em&gt;backend to the &lt;em&gt;Agent’s browser (Web) to enable the agent for the first reply.&lt;/em&gt;&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;p&gt;The team launched the E2E architecture on production, and it started to see issues. Not immediately, but as traffic scaled beyond the few tickets coming through, the team realized that the architecture could not scale beyond its initial capabilities easily and production management was not so straightforward. Listed below are some of these core issues:&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-reliability&quot;&gt;Reliability&lt;/h3&gt;


&lt;p&gt;We were facing reliability issues with our 1.5X scaled traffic, resulting in up to 46% of events from the backend not being delivered to the Agent’s browser. This added to the customer’s wait time to speak to an agent.&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-scale&quot;&gt;Scale&lt;/h3&gt;


&lt;p&gt;Beyond a low RPS of around &amp;gt;~10, the system performance to deliver contacts from the backend deteriorated significantly due to high memory usage or file descriptor leaks. Horizontal scalability was not supported due to limitations with the older versions of WAMP Library being used, and upgrading the same was a huge effort.&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-observability-debuggability-nbsp&quot;&gt;Observability/ Debuggability&amp;nbsp;&lt;/h3&gt;


&lt;p&gt;The following were the major issues related to Observability:&lt;/p&gt;


&lt;ol&gt;&lt;li&gt;It was difficult to track the health of the chat contacts i.e. if chat contacts are missing the SLA due to Engineering related concerns or Staffing related concerns.&lt;/li&gt;


&lt;li&gt;Chat contacts were not onboarded on the Queue-based architecture resulting in over 8% of the chat volume not being routed to any Agent due to the agent’s attribute matching flow.&lt;/li&gt;


&lt;li&gt;The WAMP protocol and libraries (&lt;a href=&quot;https://github.com/crossbario/autobahn-js&quot;&gt;eg1&lt;/a&gt;, &lt;a href=&quot;https://github.com/gammazero/nexus/tree/v2&quot;&gt;eg2&lt;/a&gt;) used were deprecated and did not provide a lot of insights into inner workings, resulting in debugging being much more difficult. Furthermore, we did not have Chat contact lifecycle debugging implemented end to end, &amp;amp; we were unable to accurately detect Chat SLA misses on the platform overall.&lt;/li&gt;
&lt;/ol&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-stateful&quot;&gt;Stateful&lt;/h3&gt;


&lt;p&gt;The services were stateful, complicating maintenance and restarts, which caused spikes in message delivery time and losses. The WebSocket proxy was added to perform authorization, and also because services overall were stateful, this, however, increased latency tremendously. The double socket proxy caused issues when either side disconnected.&amp;nbsp;&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-tech-requirements&quot;&gt;Tech Requirements&lt;/h2&gt;


&lt;p&gt;Following were some of the goals the tech team was working towards:&lt;/p&gt;


&lt;ol&gt;&lt;li&gt;Scale up Chat traffic from 1% to 40% of the overall contact volume by the end of 2023 (1.5 million tickets per week)&lt;ol&gt;&lt;li&gt;Onboard and Scale the Chat traffic on Queues to support the Insight related to Queues&lt;/li&gt;


&lt;li&gt;Scale to handle over 80% of Uber’s overall contact volume by the end of 2024 (3 million tickets per week).&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;


&lt;li&gt;Reservation (connecting a customer to an agent on the first try after an agent has been identified) success via the proxy pipeline (known as the Push Pipeline) should be &amp;gt;= 99.5%&lt;/li&gt;


&lt;li&gt;Build observability and debuggability over the entire Chat flow, end to end.&lt;/li&gt;


&lt;li&gt;Stateless services that would not need recalibration if they horizontally scaled or if instances went down for any reason&lt;/li&gt;
&lt;/ol&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-solution&quot;&gt;Solution&lt;/h2&gt;


&lt;p&gt;The new architecture needed to be simple to improve transparency into its inner workings and for the team to be able to easily scale. The team decided to go ahead with the Push Pipeline, which would be a simple, no-redundant WebSocket server that agent machines would connect to and be able to send and receive messages through one generic socket channel.&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-high-level-architecture&quot;&gt;High-Level Architecture&lt;/h3&gt;


&lt;p&gt;The new architecture as it exists today is showcased below:&lt;/p&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter size-large&quot;&gt;&lt;img decoding=&quot;async&quot; loading=&quot;lazy&quot; width=&quot;1024&quot; height=&quot;517&quot; src=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=2160,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Fig2_new_arch-1024x517.png&quot; alt=&quot;&quot; class=&quot;wp-image-1078914&quot; srcset=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=1024,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Fig2_new_arch.png 1024w, https://blog.uber-cdn.com/cdn-cgi/image/width=300,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Fig2_new_arch.png 300w, https://blog.uber-cdn.com/cdn-cgi/image/width=768,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Fig2_new_arch.png 768w, https://blog.uber-cdn.com/cdn-cgi/image/width=1536,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Fig2_new_arch.png 1536w, https://blog.uber-cdn.com/cdn-cgi/image/width=2048,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Fig2_new_arch.png 2048w&quot; sizes=&quot;(max-width: 1024px) 100vw, 1024px&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 2: Describes the new high-level flow following the journey of the chat contact being created through being routed to an agent on the front end.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;p&gt;The architecture has the following parts:&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-front-end&quot;&gt;Front End&lt;/h3&gt;


&lt;p&gt;Front End UI is used by agents to interact with customers. Widgets and different actions are available to agents to investigate and take appropriate actions for the customer.&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-contact-reservation&quot;&gt;Contact Reservation&lt;/h3&gt;


&lt;p&gt;Router is the service that finds the most appropriate match between the agent and contact. Upon finding the most suitable contact for an agent, the contact is pushed into a reserved state for the agent.&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-push-pipeline&quot;&gt;Push Pipeline&lt;/h3&gt;


&lt;p&gt;Upon successful reservation of the contact for the agent, the matched information is published to Apache Kafka®. On receiving this information through the socket via GraphQL subscriptions, Front End loads the contact for the agent along with all necessary widgets and actions enabling the agent to respond to the user.&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-agent-state&quot;&gt;Agent State&lt;/h3&gt;


&lt;p&gt;Any agent who needs to start working needs to go online via a toggle on Front End, which when triggered updates the Agent State service with the relevant agent’s new state.&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-edge-proxy&quot;&gt;Edge Proxy&lt;/h3&gt;


&lt;p&gt;Any connection between the client browser and backend services happens via the Edge Proxy which safeguards Uber services as a firewall and proxy layer.&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-ease-of-operations-and-better-insights&quot;&gt;Ease of Operations and Better Insights&lt;/h3&gt;


&lt;p&gt;The following are the important points:&lt;/p&gt;


&lt;ol&gt;&lt;li&gt;Onboarded the Chat traffic on the Queues where subscribed Agents will receive the contacts based on the concurrency set of the Agent’s profile. Concurrency defines the number of chat contacts that an agent can simultaneously work on.&lt;/li&gt;


&lt;li&gt;Agent staffing to Queues becomes determinant in nature and features such as SLA Based Routing (Prioritizing chat contacts based on Queue SLA), Sticky Routing (Sticking reopen contacts with the Agents) and priority routing (prioritizing based on different rules defined on the Queues) were supported by default.&lt;/li&gt;


&lt;li&gt;With Queue onboarding,&amp;nbsp; dashboards were repurposed/enhanced for Ops teams to view Chat Queues SLA and agent availability &amp;amp; their real-time status, including the contact lifecycle states, Queue inflow/outflow, agent’s session counts, etc.&lt;/li&gt;
&lt;/ol&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-gql-subscription-service&quot;&gt;GQL Subscription Service&lt;/h2&gt;


&lt;p&gt;The major highlights related to the GraphQL subscriptions are:&lt;/p&gt;


&lt;h4 class=&quot;wp-block-heading&quot; id=&quot;h-reconnection-on-disconnection&quot;&gt;Reconnection on Disconnection&lt;/h4&gt;


&lt;p&gt;We have enabled ping pong on the GraphQL subscription socket to make sure that the socket is disconnected automatically in the case of a non-reliable connection. When the socket is disconnected, the respective agent becomes ineligible to receive new contacts. Web socket reconnection is reattempted automatically. Upon successful reconnection, all the reserved/assigned contacts are fetched so the agent can accept them.&lt;/p&gt;


&lt;h4 class=&quot;wp-block-heading&quot; id=&quot;h-push-pipeline-reliability-nbsp&quot;&gt;Push Pipeline Reliability&amp;nbsp;&lt;/h4&gt;


&lt;p&gt;For the reserved contact for a given agent, if the front end does not send back an acknowledgment to the chat service, we try to reserve the same contact for another available agent. We check if the web socket and http protocols are working properly for the agent’s browser by sending the heartbeat over the GraphQL subscriptions, the response to which is sent via an HTTP API call from the agent’s browser to check if the agent is online.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-technical-choices&quot;&gt;Technical Choices&lt;/h2&gt;


&lt;p&gt;Outlined below are some of the tech choices we made to improve the reliability and robustness of the chat system, while also considering the end-to-end latency impact of our choices on the user’s perceived wait time. For this, we needed to keep this system simplified, while enabling select product enhancements.&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-using-graphql-over-websocket-with-graphql-subscriptions&quot;&gt;Using GraphQL over websocket with GraphQL subscriptions&lt;/h3&gt;


&lt;p&gt;The front-end team utilizes GraphQL extensively for HTTP calls on its front-end services. This led the team to select GraphQL subscriptions for pushing data from the server to the client. The client would send messages to the server via subscription requests and the server, on matching queries would send back messages to the agent machines. More about the GraphQL subscription is covered in the below sections.&lt;/p&gt;


&lt;p&gt;The &lt;a href=&quot;https://github.com/enisdenjo/graphql-ws&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;graphql-ws&lt;/a&gt; library gave us confidence because it had &lt;a href=&quot;https://www.npmjs.com/package/graphql-ws&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;2.3m weekly downloads&lt;/a&gt;, was &lt;a href=&quot;https://www.apollographql.com/docs/react/data/subscriptions/#setting-up-the-transport&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;recommended by Apollo&lt;/a&gt;, and had 0 open issues. It is also modeled on the standard &lt;a href=&quot;https://github.com/enisdenjo/graphql-ws/blob/master/PROTOCOL.md&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;GraphQL over WS protocol&lt;/a&gt; and aligns its options completely over it, making it an ideal candidate for usage here.&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-stateless-services&quot;&gt;Stateless services&lt;/h3&gt;


&lt;p&gt;The new services that would be created would need to be stateless to horizontally scale and without needing to rebalance every now and then.&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-websocket-without-http-fallback&quot;&gt;Websocket without HTTP fallback&lt;/h3&gt;


&lt;p&gt;Since the system required bidirectional communication between agent machines and the proxy layer, having HTTP fallback would not really make any difference to the SLAs of the system. Hence, the team focused on increasing the availability of socket connections with the proxy via:&amp;nbsp;&lt;/p&gt;


&lt;ol&gt;&lt;li&gt;Bidirectional ping pong messages to prevent hung sockets&lt;/li&gt;


&lt;li&gt;Backed off reconnects after disconnects to prevent concurrent reconnects from overwhelming the service.&lt;/li&gt;


&lt;li&gt;Single proxy to connect sockets without any handover&amp;nbsp;&lt;/li&gt;
&lt;/ol&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-using-apache-kafka-as-a-message-service-on-the-backend&quot;&gt;Using Apache Kafka® as a message service on the Backend&lt;/h3&gt;


&lt;p&gt;The contact messages already flowed through various services through Kafka before reaching the proxy layer. It was decided to continue &amp;amp; extend the usage of Kafka as it was reliable, fast &amp;amp; supported broadcasting (PubSub) capabilities.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h1 class=&quot;wp-block-heading&quot; id=&quot;h-testing-amp-launch&quot;&gt;Testing &amp;amp; Launch&lt;/h1&gt;


&lt;p&gt;We have performed both functional and non-functional tests to ensure both customers and agents are provided with the best experience end to end. To predict performance, a few of the tests that were done before launch were:&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-load-tests&quot;&gt;Load tests&lt;/h3&gt;


&lt;p&gt;A ~10K socket connection could be established from a single machine, which will further be horizontally scalable as we add more machines. We tested successfully to push the event at 20X of the old stack.&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-shadow-traffic-flows&quot;&gt;Shadow traffic flows&lt;/h3&gt;


&lt;p&gt;Existing traffic was directed through both the old system and the new pipeline to test its capacity with 40,000 contacts and 2,000 agents daily. This process revealed no problems, and data metrics showed that latency and availability were satisfactory and met the desired thresholds.&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-reverse-shadow-traffic-flows&quot;&gt;Reverse shadow traffic flows&lt;/h3&gt;


&lt;p&gt;Existing traffic was directed through the new system with the old user interface for agents, serving as a crucial reliability test. This was the initial use of the new system, and it successfully managed the traffic while maintaining latency within the defined SLAs.&lt;/p&gt;


&lt;p&gt;As we went along, we encountered unique system and agent behavior issues and did some fixes to increase reliability and reduce latency on the pipeline overall. Some of the major issues were:&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-deletion-of-cookies-from-the-browser&quot;&gt;Deletion of cookies from the browser&lt;/h3&gt;


&lt;p&gt;Browser cookies, when cleared, created issues related to auth and subsequent API failures, which prevented the pushed events from being acted upon by the front end. Agents used to remain online without working on any contacts in such cases.&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-bugs-in-auto-logout-flows&quot;&gt;Bugs in Auto-Logout Flows&lt;/h3&gt;


&lt;p&gt;Agents used to not be logged out because of issues such as out of order or missing events. Agents who finished their work for the day remained online in the system if they simply closed their tabs. This caused increases in customer wait times as the pipeline tried to push events to these agents who weren’t online. We then started automatically logging agents out based on recent acknowledgment misses and tracing logouts overall to the right causes to improve confidence in the system.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-results&quot;&gt;Results&lt;/h2&gt;


&lt;p&gt;The Chat channel has been able to scale to about &lt;strong&gt;36%&lt;/strong&gt; of the overall Uber Contact volume which is routed to Agents, with more coming in the months ahead. It seems the team has regained the trust for scaling the Chat channel, as well as improving the overall customer experience around it. The team was also able to massively improve reliability, with the error rate of delivering the contact being around 46% in the old stack to roughly 0.45% in the new one. With each failed delivery, the customer’s ticket bounced back with the 30 seconds of delay, after which delivery was retried, and bringing this number down sub 0.45% at scale was massive for customer and agent experience overall.&lt;/p&gt;


&lt;p&gt;We’ve also had other wins in this area, with perhaps the best one being simplicity. The new architecture has &lt;strong&gt;fewer services, fewer protocols, and better observability&lt;/strong&gt; built into the system for visibility into contact delivery metrics, delay within the system, and end-to-end latency.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h1 class=&quot;wp-block-heading&quot; id=&quot;h-conclusion-and-next-steps&quot;&gt;Conclusion and Next Steps&lt;/h1&gt;


&lt;p&gt;The new push pipeline enables the team to onboard other push use cases and opens up doors to improve user experience by providing real-time information for agents to act upon. Some use cases relating to Greenlight appointments and agent work overlaps on contacts will soon move on this new stack as a part of the next phase.&lt;/p&gt;


&lt;p&gt;Further improving the user experience for the Chat channel will also happen as a whole, focusing on both enhancements and system architecture adjustments. This will be done based on learnings from the expansion of the product and addressing issues reported by customers and agents.&lt;/p&gt;


&lt;p class=&quot;has-small-font-size&quot;&gt;&lt;em&gt;The cover image was found at this link: &lt;a href=&quot;https://openverse.org/image/2b3fadf3-2490-4a0c-906d-f7cf1c13a4cb?q=customer%20support&quot;&gt;source&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
</description><link>https://www.uber.com/blog/building-scalable-real-time-chat/</link><guid isPermaLink="false">https://www.uber.com/blog/building-scalable-real-time-chat/</guid><pubDate>Tue, 20 Feb 2024 07:00:00 GMT</pubDate><author>Uber</author><category>Engineering</category><category>Backend</category><category>Web</category></item><item><title>How Uber Serves Over 40 Million Reads Per Second from Online Storage Using an Integrated Cache</title><description>&lt;h1 class=&quot;wp-block-heading&quot; id=&quot;h-introduction&quot;&gt;Introduction&lt;/h1&gt;


&lt;p&gt;&lt;a href=&quot;https://eng.uber.com/schemaless-sql-database/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;Docstore&lt;/a&gt; is Uber’s in-house, distributed database built on top of MySQL®. Storing tens of PBs of data and serving tens of millions of requests/second, it is one of the largest database engines at Uber used by microservices from all business verticals. Since its inception in 2020, Docstore users and use cases are growing, and so are the request volume and data footprint.&amp;nbsp;&lt;/p&gt;


&lt;p&gt;The growing number of demands from business verticals and offerings introduces complex microservices and dependency call graphs. As a result, applications demand low latency, higher performance, and scalability from the database, while simultaneously generating higher workloads.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-challenges&quot;&gt;Challenges&lt;/h2&gt;


&lt;p&gt;Most of the microservices at Uber use databases backed by disk-based storage in order to persist data. However, every database faces challenges serving applications that require low-latency read access and high scalability.&lt;/p&gt;


&lt;p&gt;This came to a boiling point when one use case required much higher read throughput than any of our existing users. Docstore could have accommodated their needs, as it is backed by NVMe SSDs, which provide low latency and high throughput. However, using Docstore in the above scenario would have been cost prohibitive and would have required many scaling and operational challenges.&lt;/p&gt;


&lt;p&gt;Before diving into the challenges, let’s understand the high-level architecture of Docstore.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-docstore-architecture&quot;&gt;Docstore Architecture&lt;/h2&gt;


&lt;p&gt;Docstore is mainly divided into three layers: a stateless query engine layer, a stateful storage engine layer, and a control plane. For the scope of this blog, we will talk about its query and storage engine layers.&amp;nbsp;&lt;/p&gt;


&lt;p&gt;The stateless query engine layer is responsible for query planning, routing, sharding, schema management, node health monitoring, request parsing, validation, and AuthN/AuthZ.&amp;nbsp;&lt;/p&gt;


&lt;p&gt;The storage engine layer is responsible for consensus via Raft, replication, transactions, concurrency control, and load management. A partition is typically composed of MySQL nodes backed by NVMe SSDs, which are capable of handling heavy read and write workloads. Additionally, data is sharded across multiple partitions containing one leader and two follower nodes using Raft for consensus.&lt;/p&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/GBkXOF__0QtOHj5Bkl_OeqGLk9LRI9iDFVs6e3J_eZJBuvIPUBGBQCGzIcwJISdPj-jIDeYjjByFDh9m5OBobGXhZYkU4xEQIP7MDQf9-iPDp0p1oOOYniJuDvo7BJeMgP-oabTjG37spd5tJFw18B8&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 1: Docstore architecture.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;p&gt;Now let’s look at some of the challenges faced when services demand low-latency reads at a high scale:&amp;nbsp;&lt;/p&gt;


&lt;ol&gt;&lt;li&gt;&lt;strong&gt;Speed of data retrieval from disk has a threshold:&lt;/strong&gt; There’s a limit to how far one can optimize application data models and queries to improve database latency and performance. Beyond that, squeezing out more performance is not possible.&amp;nbsp;&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;Vertical scaling: &lt;/strong&gt;Assigning more resources or upgrading to better hosts with higher performance has its limitations where the database engine itself becomes a bottleneck.&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;Horizontal scaling:&lt;/strong&gt; Splitting shards further across more numerous partitions helps solve the challenges to an extent however doing so is an operationally more complex and lengthy process. We have to ensure data durability and resiliency without any downtime. Also this solution doesn’t fully help to solve the issues of hot keys/partitions/shards.&amp;nbsp;&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;Request imbalance:&lt;/strong&gt; Oftentimes the incoming rate of read requests is orders of magnitude higher than write requests. In such cases, the underlying MySQL node will struggle to keep up with the heavy workload and further impact latencies.&amp;nbsp;&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;Cost:&lt;/strong&gt; Vertical and horizontal scaling to improve latencies are costly in the long term. Costs are multiplied 6x to handle each of the 3 stateful nodes across both regions. Additionally, scaling doesn’t fully address the problem.&lt;/li&gt;
&lt;/ol&gt;


&lt;p&gt;To overcome this, microservices make use of caching. At Uber we provide Redis™ as a distributed caching solution. A typical design pattern for microservices is to write to database and cache while serving reads from the cache for improved latencies. However, this approach has following challenges:&amp;nbsp;&lt;/p&gt;


&lt;ol&gt;&lt;li&gt;Each team has to provision and maintain their own Redis cache for their respective services&lt;/li&gt;


&lt;li&gt;Cache invalidation logic is implemented decentrally within each microservices&lt;/li&gt;


&lt;li&gt;In case of region failover, services either have to maintain caching replication to stay hot or suffer higher latencies while the cache is warming up in other regions&lt;/li&gt;
&lt;/ol&gt;


&lt;p&gt;Individual teams have to expend a large amount of effort to implement their own custom caching solutions with the database. It became imperative to find a better, more efficient solution that not only serves requests at low latency, but is also easy to use and improves developer productivity.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-cachefront-nbsp&quot;&gt;CacheFront&amp;nbsp;&lt;/h2&gt;


&lt;p&gt;We decided to build an integrated caching solution, CacheFront for Docstore, with following goals in mind:&lt;/p&gt;


&lt;ol&gt;&lt;li&gt;Minimize the need for vertical and/or horizontal scaling to support low-latency read requests&lt;/li&gt;


&lt;li&gt;Reduce resource allocation to the database engine layer; caching can be built from relatively cheap hosts, so overall cost efficiency is improved&lt;/li&gt;


&lt;li&gt;Improve P50 and P99 latencies, and stabilize read latency spikes during microbursts&lt;/li&gt;


&lt;li&gt;Replace most of the custom-built caching solutions that were (or will be) built by the individual teams to answer their needs, especially in the cases where the caching is not the core business or competency of the team&lt;/li&gt;


&lt;li&gt;Make it transparent by reusing existing Docstore client without any additional boilerplate to allow benefiting from caching&lt;/li&gt;


&lt;li&gt;Increase developer productivity and allow us to release new features or replace the underlying caching technology transparently to customers&amp;nbsp;&lt;/li&gt;


&lt;li&gt;Detach caching solution from Docstore’s underlying sharding scheme to avoid problems that arise from hot keys, shards, or partitions&lt;/li&gt;


&lt;li&gt;Allow us to horizontally scale out caching layer, independently of the storage engine&lt;/li&gt;


&lt;li&gt;Move ownership for maintaining and on calling Redis from feature teams to the Docstore team&lt;/li&gt;
&lt;/ol&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-cachefront-design&quot;&gt;CacheFront Design&lt;/h2&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-docstore-query-patterns&quot;&gt;Docstore Query Patterns&lt;/h3&gt;


&lt;p&gt;Docstore supports different ways to query by either primary key or partition key and optionally filtering the data. At a high level it can be mainly be divided into following:&lt;/p&gt;


&lt;figure class=&quot;wp-block-table&quot;&gt;&lt;table&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Key-type / Filter&lt;/td&gt;&lt;td&gt;No Filter&lt;/td&gt;&lt;td&gt;Filter by WHERE clause&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Rows&lt;/td&gt;&lt;td&gt;ReadRows&lt;/td&gt;&lt;td&gt;–&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Partitions&lt;/td&gt;&lt;td&gt;ReadPartition&lt;/td&gt;&lt;td&gt;QueryRows&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/figure&gt;


&lt;p&gt;We wanted to build our solution incrementally, beginning with most common query patterns. It turned out that more than 50% of the queries coming to Docstore are ReadRows requests, and since this also happened to be the simplest use case–no filters and point reads–it was a natural place to start with the integration.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-high-level-architecture&quot;&gt;High-Level Architecture&lt;/h3&gt;


&lt;p&gt;Since Docstore’s query engine layer is responsible for serving reads and writes to clients, it is well suited to integrate the caching layer. It also decouples the cache from disk-based storage, allowing us to scale either of them independently. The query engine layer implements an interface to Redis for storing cached data along with a mechanism to invalidate cached entries. A high-level architecture looks like the following:&lt;/p&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/Rmm5BmifYPw54KnH74kaCJWK5L-FuymknafE9zQIhfeD4NpY0voHwI_3EtWXx90vPdGl63U9Ukz3ZsZQ0hwlntl55ofun8yhFL4489HMETZ2Jkkp7662u2-mIjOojJW5irCtwyb1xybqRJMlrS-2apM&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 2: CacheFront design.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;p&gt;Docstore is a strongly consistent database. Although integrated caching provides faster query responses, some of the semantics around consistency may not be acceptable for every microservice while using cache. For example, cache invalidation may fail or lag behind database writes. For this reason, we made integrated caching an opt-in feature. Services can configure cache usage on a per-database, per-table, and even per-request basis.&lt;/p&gt;


&lt;p&gt;If certain flows require strong consistency (such as getting items in an eater’s cart) then the cache can be bypassed, whereas other flows with low write throughput (such as fetching a restaurant’s menu) would benefit from the cache.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-cached-reads&quot;&gt;Cached Reads&lt;/h3&gt;


&lt;p&gt;CacheFront uses a cache aside strategy to implement cached reads:&lt;/p&gt;


&lt;ol&gt;&lt;li&gt;Query engine layer gets read request for one more rows&lt;/li&gt;


&lt;li&gt;If caching is enabled, try getting rows from Redis; stream response to users&lt;/li&gt;


&lt;li&gt;Retrieve remaining rows (if any) from the storage engine&lt;/li&gt;


&lt;li&gt;Asynchronously populate Redis with the remaining rows&lt;/li&gt;


&lt;li&gt;Stream remaining rows to users&lt;/li&gt;
&lt;/ol&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/TZIbPhOxvXtUMaxrn0LukNpglUsFD6CWoliV6z0OhW92wZh2wrYi6XO5-fwQyEkjGztPuGhP2j-BGxDO5Rg7MJ-lURrhwN_1Cxgt-s7i5JPNUWcm_k9s7Khb8K-_AwxOvdkni_83QqGDmuq5jOOzQC4&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 3: CacheFront read path.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-cache-invalidation&quot;&gt;Cache Invalidation&lt;/h2&gt;


&lt;blockquote class=&quot;wp-block-quote&quot;&gt;&lt;p&gt;&lt;em&gt;“There are only two hard things in Computer Science: cache invalidation and naming things.”&amp;nbsp;&lt;/em&gt;&lt;/p&gt;


&lt;p&gt;&lt;/p&gt;
&lt;cite&gt;– Phil Karlton&lt;/cite&gt;&lt;/blockquote&gt;


&lt;p&gt;Although the caching strategy in the previous section may seem simple, many details had to be considered in order to ensure the cache would work, especially cache invalidation. Without any explicit cache invalidation, cache entries will expire with the configured TTL (by default, 5 minutes). While this may be OK in some cases, most users expect changes to be reflected faster than the TTL. The default TTL could be lowered however this would reduce our cache hit rate without meaningfully improving consistency guarantees.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-conditional-update&quot;&gt;Conditional Update&lt;/h3&gt;


&lt;p&gt;Docstore supports conditional updates where one or more rows can be updated based on a filter condition. For example, update the holiday schedule for all restaurant chains in a specified region. Since the results of a given filter can change, our caching layer can’t determine which rows would be affected by a conditional update until the actual rows are updated in the database engine. Due to this, we can’t invalidate and populate cached rows for conditional update in the stateless query engine layer’s write path.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-leveraging-change-data-capture-for-cache-invalidation-nbsp&quot;&gt;Leveraging Change Data Capture for Cache Invalidation&amp;nbsp;&lt;/h3&gt;


&lt;p&gt;To fix this, we leveraged Docstore’s change data capture and streaming service, Flux. Flux tails the &lt;a href=&quot;https://dev.mysql.com/doc/internals/en/binary-log-overview.html&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;MySQL binlog&lt;/a&gt; events for each of the clusters in our storage engine layer and publishes the events to a list of consumers. Flux powers Docstore CDC (Change Data Capture), replication, materialized views, data lake ingestion, and validating data consistency among nodes in a cluster.&amp;nbsp;&lt;/p&gt;


&lt;p&gt;A new consumer was written, which subscribes to data events and either invalidates or upserts the new rows in Redis. Now with this invalidation strategy, a conditional update will result in database change events for affected rows, which will be used to invalidate or populate rows in cache. As a result, we were able to make the cache consistent within seconds of the database change, as opposed to minutes. Additionally, by using binlogs, we don’t run the risk of letting uncommitted transactions pollute the cache.&lt;/p&gt;


&lt;p&gt;The final read and write path with cache invalidation looks like the following:&lt;/p&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/05_8F32Yokun9_DucegrI80UJbwUwj5kulbM56xBrKCaNT2-NklDzrubhI-E7kJh3CK7Y9eQ2hggkhaXmECvSn9AZZV7ARUqiejCDk9H1SDehB_tJ3zPGmgg6UsVZKEViJM51qqqhcR-bhsS4L_DNN8&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 4: CacheFront read and write paths for invalidation.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-deduplicating-cache-writes-between-query-engine-and-flux&quot;&gt;Deduplicating Cache Writes Between Query Engine and Flux&lt;/h3&gt;


&lt;p&gt;However, the above cache invalidation strategy has a flaw. Since writes happen to the cache simultaneously between the read path as well as the write path, it is possible that we inadvertently write a stale row to the cache, overwriting the newest value that was retrieved from the database. To solve this, we deduplicate writes based on the timestamp of the row set in MySQL, which effectively serves as its version. The timestamp is parsed out from the encoded row value in Redis (see later section on codec).&lt;/p&gt;


&lt;p&gt;Redis supports executing custom Lua scripts atomically using the &lt;a href=&quot;https://redis.io/commands/eval/&quot;&gt;EVAL&lt;/a&gt; command. This script takes the same parameters as &lt;a href=&quot;https://redis.io/commands/eval/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;MSET&lt;/a&gt;, however, it also performs the deduplication logic, checking the timestamp values of any rows already written to the cache and ensuring that the value to be written is newer. By using EVAL, all of this can be performed in a single request instead of requiring multiple round trips between the query engine layer and cache.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-stronger-consistency-guarantees-for-point-writes&quot;&gt;Stronger Consistency Guarantees for Point Writes&lt;/h3&gt;


&lt;p&gt;While Flux allows us to invalidate the cache much faster than if we were relying solely on Redis TTLs for expiration of cached entries, it still provides us with eventual consistency semantics. Yet, some use cases require stronger consistency, such as reading-own-writes, so for these scenarios we added a dedicated API to the query engine that lets our users explicitly invalidate the cached rows after the corresponding writes have completed. This allowed us to provide stronger consistency guarantees for point writes, but not for conditional updates, which remain to be invalidated by Flux.&amp;nbsp;&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-table-schemas&quot;&gt;Table Schemas&lt;/h3&gt;


&lt;p&gt;Before getting into more details about the implementation let’s define a few key terms. Docstore tables have a &lt;em&gt;primary key&lt;/em&gt; and &lt;em&gt;partition key&lt;/em&gt;.&amp;nbsp;&lt;/p&gt;


&lt;p&gt;A primary key (often referred to as a &lt;em&gt;row key&lt;/em&gt;) uniquely identifies a row in the Docstore table and enforces a uniqueness constraint. Every table must have a primary key, which can be composed of one or more columns.&lt;/p&gt;


&lt;p&gt;A partition key is a prefix of the entire primary key and determines which shard the row will live in. They are not completely separate–rather, partition keys are simply a part of (or equal to) the primary.&amp;nbsp;&lt;/p&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/3Et4lJLc7ohCnbi7oWQ5MvAnNTBN9KbRZZwVaoPf16QtK3VDpvKfr39gi4I3pdx7B5FE_IHi7Iz_tU9i4lAlEi6I_P4kPvb71vGRNxjSzCVbVe5Jr92fMfpi8tqtXh0nsqStymKUr3UymWlYmx5p9Ac&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 5: Example Docstore schemas and data modeling.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;p&gt;In the example above &lt;strong&gt;person_id&lt;/strong&gt; is both the primary and partition key for the &lt;strong&gt;person&lt;/strong&gt; table. While for &lt;strong&gt;orders&lt;/strong&gt; table &lt;strong&gt;cust_id&lt;/strong&gt; is a partition key and both &lt;strong&gt;cust_id&lt;/strong&gt; and &lt;strong&gt;order_id&lt;/strong&gt; together form a primary key.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-redis-codec&quot;&gt;Redis Codec&lt;/h3&gt;


&lt;p&gt;Since primarily we will be caching row reads, we can uniquely identify a row value with a given row key. Since Redis keys and values are stored as strings, we need a special codec to encode the MySQL data in a format that Redis accepts.&lt;/p&gt;


&lt;p&gt;The following codec was settled on, as it allows cache resources to be shared by different databases while still maintaining data isolation.&lt;/p&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter size-large&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=2160,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Screenshot-2024-02-07-at-12.36.06%E2%80%AFPM-1024x267.png&quot; alt=&quot;&quot; class=&quot;wp-image-1078207&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/DYQl48pmuGDctmIRBUVlDoOQ1stF1w2I9swtVZL8SlPJZZ1jVSa7klvG_SurVWfDzwi6DE_7fwkWkWTyPqyd_8NBdd02Vv9g6j-lwp5NENw5IbSeOPeWPKCzg1WZ1uMIT6jLW7lGEmmb3WLBPE4gBLA&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 6: CacheFront Redis codec.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-features&quot;&gt;Features&lt;/h3&gt;


&lt;p&gt;After completing the high-level design, our solution was functional. Now it was time for us to think about scale and resiliency:&lt;/p&gt;


&lt;ol&gt;&lt;li&gt;How to verify consistency between the database and cache in real time&lt;/li&gt;


&lt;li&gt;How to tolerate zone/region failures&lt;/li&gt;


&lt;li&gt;How to tolerate Redis failures&lt;/li&gt;
&lt;/ol&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-compare-cache&quot;&gt;Compare Cache&lt;/h3&gt;


&lt;p&gt;All this talk about improving consistency means nothing if it’s not measurable, so we added a special mode that shadows read requests to the cache. When reading back, we compare the cached and database data and verify that they are the same. Any mismatches–either stale rows present in the cache or rows present in the cache, but not the database–are logged and emitted as metrics. With the addition of cache invalidation using Flux, the cache is 99.99% consistent.&lt;/p&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/5WXplUI4vPGJbSdkMGUjTl2XRN4wM8i2X6tY_SXFf_FRDWIlNCuPaTwQ-MCN-bLSWZ8k4Gp_xPZ578KB7LV8DEKvMKa6FfCByfm1BJc8_f1KRvsKuSPTnQs3KYbyj7lt2BbU6ysXVw4auVPDfoqtB30&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 7: Compare cache design.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-cache-warming&quot;&gt;Cache Warming&lt;/h3&gt;


&lt;p&gt;A Docstore instance spawns two different geographical regions to ensure high availability and fault tolerance. The deployment is active-active, meaning requests can be issued and served in any region and all writes are replicated across regions. In case of a region failover, another region must be able to serve all requests.&amp;nbsp;&lt;/p&gt;


&lt;p&gt;This model poses a challenge for CacheFront, since caches should always be warm across regions. If they are not, a region fail-over will increase the number of requests to the database due to cache misses from the traffic originally served in the failed region. This will prevent us from scaling down the storage engine and reclaiming any capacity, since the database load would be as high as it would have been without any caching.&amp;nbsp;&lt;/p&gt;


&lt;p&gt;The cold cache problem can be solved with cross-region Redis replication, but it poses a problem. Docstore has its own cross-region replication mechanism. If we replicate the cache content using Redis cross-region replication, we will have two independent replication mechanisms, which could lead to cache vs. storage engine inconsistency. In order to avoid this cache inconsistency problem for CacheFront, we enhanced Redis cross-region replication components by adding a new cache warming mode.&lt;/p&gt;


&lt;p&gt;To ensure that the cache is always warm, we tail the Redis write stream and replicate keys to the remote region. In the remote region instead of directly updating the remote cache, read requests are issued to the query engine layer which, upon a cache miss, reads from the database and writes to the cache as described in the &lt;strong&gt;Cached Reads &lt;/strong&gt;section of the design. By only issuing read requests upon a cache miss, we also avoid unnecessarily overloading the storage engine. The response stream of read rows from the query engine layer is simply discarded, since we are not really interested in the result.&lt;/p&gt;


&lt;p&gt;By replicating keys instead of values, we always ensure that the data in the cache is consistent with the database in its respective region and we keep the same working set of cached rows in Redis in both regions, while also limiting the amount of cross-region bandwidth used.&lt;/p&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/cbUIurHsBaCU231zyyfsEcF0aiSL_Tqyy8x03_I3tp66sO1BwcnmPVykW93xFdePIt1JaT2ZZYWRO0RAznx0fogY1C5rahduCC5TvagTpWrxNgB7QfoKPAupt3Ts24S4EbxW9xoCaTkbCvjtdICNgKc&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 8: Cache warming design.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-negative-caching&quot;&gt;Negative Caching&lt;/h3&gt;


&lt;p&gt;In scenarios where many of the reads are for non-existent rows, it would be good to cache the negative result instead of having a cache miss and querying the database each time. To enable this, we built negative caching into Cachefront. Similar to the regular cache population strategy where all rows returned from the database are written to the cache, we also keep track of any rows that were queried but not read from the database. These non-existent rows are written to the cache with a special flag and in future reads, if the flag is found, we ignore the row when querying the database and also do not return any data back to the user for the row.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-sharding&quot;&gt;Sharding&lt;/h3&gt;


&lt;p&gt;Although Redis is not heavily impacted by hot partition issues, some of Docstore’s large customers generate a very large number of read-write requests, which would be challenging to cache in a single Redis cluster, typically limited in the maximum number of nodes it can have. To mitigate this, we allow a single Docstore instance to map to multiple Redis clusters. This also avoids a complete database meltdown where a large number of requests can be issued against it, in case multiple nodes in a single Redis cluster are down and cache is not available for certain ranges of keys.&amp;nbsp;&lt;/p&gt;


&lt;p&gt;However even with data sharded across multiple Redis clusters, a single Redis cluster going down may create a hot-shard issue on the database. To mitigate this, we decided to shard Redis clusters by partition key, which is different from the database sharding scheme in Docstore. Now we can avoid overloading a single database shard when a single Redis cluster goes down. All requests from a failed Redis shard will be distributed among all database shards, as shown below:&amp;nbsp;&amp;nbsp;&lt;/p&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/_NOpBAx6pbT-B-CetPrxh1Vjvj3Ref86cqNIE4WbvS00ESPzYVwI_AFpPHKFvINL6IRTZtivXpMMF5FkTuWyuAAc6ssQxp3EqkYOKh13Ici5NTl8d5U9lU02en9uYptdB902RDmBa6YBI_CloUcHwB0&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 9: Redis sharding request flows.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-circuit-breakers&quot;&gt;Circuit Breakers&lt;/h3&gt;


&lt;p&gt;If a Redis node goes down, we’d like to be able to short circuit requests to that node to avoid the unnecessary latency penalty of a Redis get/set request for which we have high confidence that it will fail. To achieve this, we use a sliding window circuit breaker. We count the number of errors on each node per time bucket and compute the number of errors in the sliding window width.&lt;/p&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter size-large&quot;&gt;&lt;img decoding=&quot;async&quot; loading=&quot;lazy&quot; width=&quot;1024&quot; height=&quot;365&quot; src=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=2160,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/figure10-sliding-window-1024x365.png&quot; alt=&quot;&quot; class=&quot;wp-image-1078102&quot; srcset=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=1024,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/figure10-sliding-window.png 1024w, https://blog.uber-cdn.com/cdn-cgi/image/width=300,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/figure10-sliding-window.png 300w, https://blog.uber-cdn.com/cdn-cgi/image/width=768,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/figure10-sliding-window.png 768w, https://blog.uber-cdn.com/cdn-cgi/image/width=1160,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/figure10-sliding-window.png 1160w&quot; sizes=&quot;(max-width: 1024px) 100vw, 1024px&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 10: Sliding window design.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;p&gt;The circuit breaker is configured to short circuit a fraction of the requests to that node, proportional to the error count. Once the maximum allowed error count is hit, the circuit breaker is tripped and no more requests can be made to the node until the sliding window passes.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-adaptive-timeouts&quot;&gt;Adaptive Timeouts&lt;/h3&gt;


&lt;p&gt;We realized that it is sometimes difficult to set the right timeouts for Redis operations. A timeout that is too short causes Redis requests to fail too early, wasting Redis resources and putting extra load on the database engine. A timeout that is too long impacts the P99.9 and P99.99 latencies, and in the worst case a request may exhaust the entire timeout that is passed in the query. While it’s possible to mitigate these issues by configuring an arbitrarily low default timeout, we risk setting a timeout too low where many requests bypass the cache and go to the database or setting a timeout too high, which leads us back to the original issue.&lt;/p&gt;


&lt;p&gt;We needed to adjust request timeouts automatically and dynamically such that the P99 of requests to Redis are succeeding within the allocated timeout, while at the same time cutting down entirely the long tail of latencies. Configuring adaptive timeouts means allowing the Redis get/set timeout value to be adjusted dynamically. By allowing adaptive timeouts, we can set a timeout equivalent to the P99.99 latency of cache requests, thereby letting 99.99% of requests go to the cache with a fast response. The remaining 0.01% of requests, which would have taken too long, can be canceled quicker and served from the database.&amp;nbsp;&lt;/p&gt;


&lt;p&gt;With the enabling of adaptive timeouts, we no longer need to tune the timeouts manually to match the desired P99 latency, and instead can only set the maximum acceptable timeout limit, beyond which the framework is not allowed to go (because the maximum timeout is set by the client request anyways).&lt;/p&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/BFgKW0IWMozMuBjKyv8UvqNzBesswYoRMhegElEZLSlF8aBI5WT81GdZ7gHYqRX9FalcrFtlt3_Li7sd3PbxnqV5Fj0gAsl0W4SeM7-6caHi5dZDElaOvYdMEh383DXLFnnEPxPmguu5FYVS3rNaF8A&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 11: Adaptive timeouts latency improvements.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-results&quot;&gt;Results&lt;/h2&gt;


&lt;p&gt;So did we succeed? We originally set out to build an integrated cache that’s transparent to our users. We wanted our solution to help improve latencies, be easily scalable, help curb load and costs on our storage engine and all while having good consistency guarantees.&lt;/p&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/k8nIx9d66HktdxoS2Hl1WQHvf17lZoYz1-CBKXi9HBwVIs2JPakyPe0XZN55LnqVsDP0jJTmOWsj3iI5wiDqIfBm_ldkjXAjtBEy_ql-WeUE5l7Xs54OUEFEefGlx6P9_V3IrUZsiWlUT0HTWEE6h6A&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 12: Cache vs storage engine latency comparison.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;ol&gt;&lt;li&gt;Request latencies with integrated cache are significantly better. P75 latency is down 75% and P99.9 latency is down over 67% while also limiting latency spikes, as seen above.&lt;/li&gt;
&lt;/ol&gt;


&lt;ol start=&quot;2&quot;&gt;&lt;li&gt;Cache Invalidation using Flux and Compare cache mode help us ensure good consistency.&lt;/li&gt;


&lt;li&gt;Since it sits behind our existing APIs, it is transparent to users and can be managed internally while still giving flexibility to users through header-based options.&amp;nbsp;&lt;/li&gt;


&lt;li&gt;Sharding and cache warming allow it to be scalable and fault tolerant. In fact, one of our largest initial use cases drives over 6M RPS with a 99% cache hit rate with a proven successful failover where all traffic was redirected to the remote region.&lt;/li&gt;


&lt;li&gt;The same use-case would have originally required approximately 60K CPU cores in order to serve 6M RPS from the storage engine directly. With CacheFront we serve approximately 99.9% cache hits with only 3K Redis cores, allowing us to reduce the capacity.&amp;nbsp;&lt;/li&gt;
&lt;/ol&gt;


&lt;p&gt;Today CacheFront supports over 40M requests per second across all Docstore instances in production, and the number is growing.&amp;nbsp;&lt;/p&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/pTmRmm5OFwqZ9bmcvsp_Lskq3PTyxOP8XaRaJ-OWGrg3Vy3I4rPFX5ApmE9cOCH9kvfZZRJHfHw4r9ZM0dWj27QltRed9gQJnoPZdYhYOy-tMOec-kAhQ_452oHVYgtVqPWA66dzfu1y4ClxFraeBQM&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 13: Total cache reads across all instances.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;p&gt;We’ve addressed one of the core challenges in scaling the read workload on Docstore via CacheFront. It not only made it possible to onboard large-scale use cases that demand high throughput and low-latency reads, but also helped us reduce load on the storage engine and save resources, improving the overall cost of storage and allowing developers to focus on building products instead of managing infrastructure.&lt;/p&gt;


&lt;p&gt;If you like challenges related to distributed systems, databases, storage, and cache, please explore and apply to open positions &lt;a href=&quot;https://www.uber.com/us/en/careers/list/?query=storage&amp;amp;department=Engineering&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;


&lt;p class=&quot;has-small-font-size&quot;&gt;&lt;em&gt;Oracle, Java, MySQL, and NetSuite are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners.&lt;/em&gt;&lt;/p&gt;


&lt;p class=&quot;has-small-font-size&quot;&gt;&lt;em&gt;Redis is a trademark of Redis Labs Ltd. Any rights therein are reserved to Redis Labs Ltd. Any use herein is for referential purposes only and does not indicate any sponsorship, endorsement or affiliation between Redis and Uber.&lt;/em&gt;&lt;/p&gt;
</description><link>https://www.uber.com/blog/how-uber-serves-over-40-million-reads-per-second-using-an-integrated-cache/</link><guid isPermaLink="false">https://www.uber.com/blog/how-uber-serves-over-40-million-reads-per-second-using-an-integrated-cache/</guid><pubDate>Thu, 15 Feb 2024 07:00:00 GMT</pubDate><author>Uber</author><category>Engineering</category><category>Backend</category><category>Data / ML</category></item><item><title>Jupiter: Config Driven Adtech Batch Ingestion Platform</title><description>&lt;h1 class=&quot;wp-block-heading&quot; id=&quot;h-introduction&quot;&gt;Introduction&lt;/h1&gt;


&lt;p&gt;Uber’s mission is to reimagine the way the world moves for the better and provide earning opportunities globally through its marketplace. One effective approach to bring the Uber brand and marketplace closer to people is to invest in paid marketing strategies.&lt;/p&gt;


&lt;p&gt;Achieving an optimal equilibrium in the marketplace necessitates the continuous activity of a balance between supply and demand. This requires creating an environment that is affordable for spenders while remaining a great earning opportunity for earners. One approach to accomplishing this goal is by consistently introducing new users to the marketplace, an ongoing process that involves promoting Uber’s marketplace offerings across diverse marketing platforms such as Google, Meta, Apple, and others.&lt;/p&gt;


&lt;p&gt;Given that these are paid advertisements, our marketing teams continuously develop strategies to rapidly onboard more users to the platform. Therefore, receiving timely signals from these vendors is crucial for us to refine our approach effectively.&lt;/p&gt;


&lt;p&gt;This blog post aims to explore the constraints and difficulties encountered by our legacy ingestion system, MaRS (Marketing Reporting Service), responsible for gathering ad signals from external ad partners at fixed intervals. Furthermore, we will address how we enhanced our marketing operations through technological advancements and attained scalability by implementing our new system, Jupiter.&lt;/p&gt;


&lt;p&gt;In this blog, we have described paid marketing as a domain, while ad tech represents the systems within that same domain. These terms can be used interchangeably within this context.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-what-is-the-performance-marketing-user-flow&quot;&gt;What Is The Performance Marketing User Flow?&lt;/h2&gt;


&lt;p&gt;On a general scale, the subsequent sequence offers a thorough outline of the complete user journey: starting from engaging with the ads, navigating to the Uber platform, and culminating in a conversion. This action, valuable to our business, in our context could involve signing up on Uber, placing an order via Uber Eats, or taking a ride.&lt;/p&gt;


&lt;p&gt;Following the aforementioned action, the subsequent events are triggered:&lt;/p&gt;


&lt;p&gt;&lt;strong&gt;Conversion event&lt;/strong&gt;: When a user clicks on the ad to download the Uber app, marking a conversion specific to that ad. This is one type of conversion event linked to downloading.&lt;/p&gt;


&lt;p&gt;&lt;strong&gt;Spend event&lt;/strong&gt;: When a user views an ad, signifying expenditure to display that ad to the user.&lt;/p&gt;


&lt;p&gt;These spend events from the advertising partner need to be ingested, processed, and transmitted downstream. This is done to measure and optimize the ad’s performance.&lt;/p&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter size-large is-resized&quot;&gt;&lt;img decoding=&quot;async&quot; loading=&quot;lazy&quot; width=&quot;1817&quot; height=&quot;1363&quot; src=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=2160,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Figure1-edited-1.png&quot; alt=&quot;&quot; class=&quot;wp-image-1077751&quot; style=&quot;aspect-ratio:1.2027027027027026;object-fit:contain;width:767px;height:auto&quot; srcset=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=1817,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Figure1-edited-1.png 1817w, https://blog.uber-cdn.com/cdn-cgi/image/width=300,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Figure1-edited-1.png 300w, https://blog.uber-cdn.com/cdn-cgi/image/width=1024,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Figure1-edited-1.png 1024w, https://blog.uber-cdn.com/cdn-cgi/image/width=768,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Figure1-edited-1.png 768w, https://blog.uber-cdn.com/cdn-cgi/image/width=1536,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Figure1-edited-1.png 1536w&quot; sizes=&quot;(max-width: 1817px) 100vw, 1817px&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 1: User Flow in Adtech.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;h5 class=&quot;wp-block-heading&quot; id=&quot;h-sample-user-flow&quot;&gt;&lt;strong&gt;Sample User Flow&lt;/strong&gt;&lt;/h5&gt;


&lt;ul&gt;&lt;li&gt;Step 1:&amp;nbsp; The User Clicks on the Uber Ad on the Partner Page&lt;/li&gt;


&lt;li&gt;Step 2:&amp;nbsp; The User Arrives on the Uber App [Conversion Events]&lt;/li&gt;


&lt;li&gt;Step 3:&amp;nbsp; Adtech System Retrieves Data from Partner [Ingestion Platform]&lt;/li&gt;


&lt;li&gt;Step 4:&amp;nbsp; Compute Performance Metrics (&lt;a href=&quot;https://www.singular.net/glossary/return-on-ad-spend-roas&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;ROAS&lt;/a&gt;)&lt;/li&gt;


&lt;li&gt;Step 5:&amp;nbsp; Optimization Engine enhances Bidding Algorithms by adjusting them according to computed Metrics.&lt;/li&gt;
&lt;/ul&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-why-is-timely-ingestion-critical&quot;&gt;Why Is Timely Ingestion Critical?&lt;/h2&gt;


&lt;p&gt;The prompt and precise ingestion of these advertising signals is crucial for Uber’s overall Performance Marketing. Even the smallest delay or inaccurate processing of timely ad signals from external partners can affect Uber’s capability to advertise on those platforms. As a result, this could influence the influx of users being onboarded onto the platform.&lt;/p&gt;


&lt;p&gt;To illustrate, during an outage lasting two days in which we were unable to ingest data from a single partner, the creation of key performance indicators (&lt;a href=&quot;https://en.wikipedia.org/wiki/Performance_indicator&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;KPIs&lt;/a&gt;), specifically &lt;a href=&quot;https://www.singular.net/glossary/return-on-ad-spend-roas&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;ROAS&lt;/a&gt;, downstream was delayed. This delay led to our machine learning algorithms in the bidding and optimization systems erroneously concluding that our ads were underperforming, causing a halt in ad spending.&lt;/p&gt;


&lt;p&gt;As a consequence, our ability to onboard new users was compromised, resulting in an imbalance in supply &amp;amp; demand. All this occurred due to an outage in one integration.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-problem-statement&quot;&gt;Problem Statement&lt;/h2&gt;


&lt;p&gt;As Uber operates across numerous countries worldwide, we engage with various local and global advertising partners or advertisers for our paid marketing efforts. This has resulted in the integration of multiple diverse technological systems at different levels of technological maturity, featuring heterogeneous data schemas, formats, varied transmission protocols, and discrepancies in data freshness, lineage, and completeness.&lt;/p&gt;


&lt;p&gt;The AdTech industry is undergoing a substantial transformation where partners, Mobile Measurement Platforms (MMPs), and external ad tech platforms are transitioning from user-based ad-tracking to a spectrum of privacy-centric alternatives. This shift has given rise to a diverse ecosystem with varying standards among partners, introducing complexities such as frequent and unpredictable changes in data schemas that challenge historical assumptions in the marketing and advertising domain.&lt;/p&gt;


&lt;p&gt;This complexity has presented a compounded challenge for the ingestion system due to its rapid evolution, scale, and the diverse nature of the datasets involved.&lt;/p&gt;


&lt;p&gt;Here’s a breakdown of issue categories and the time dedicated by the ingestion team previously:&lt;/p&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter size-large&quot;&gt;&lt;img decoding=&quot;async&quot; loading=&quot;lazy&quot; width=&quot;1324&quot; height=&quot;862&quot; src=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=2160,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Figure2-edited.jpeg&quot; alt=&quot;&quot; class=&quot;wp-image-1077758&quot; srcset=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=1324,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Figure2-edited.jpeg 1324w, https://blog.uber-cdn.com/cdn-cgi/image/width=300,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Figure2-edited.jpeg 300w, https://blog.uber-cdn.com/cdn-cgi/image/width=1024,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Figure2-edited.jpeg 1024w, https://blog.uber-cdn.com/cdn-cgi/image/width=768,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Figure2-edited.jpeg 768w&quot; sizes=&quot;(max-width: 1324px) 100vw, 1324px&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 2: Split of Issue Categories.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-reliability&quot;&gt;Reliability&lt;/h2&gt;


&lt;p&gt;As evident from the data, the predominant portion of time is dedicated to ensuring the reliability of the ingestion system. The primary factors contributing to this can be classified as follows:&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-high-latency-nbsp&quot;&gt;High Latency&amp;nbsp;&lt;/h3&gt;


&lt;p&gt;Ensuring prompt availability of data in the warehouse was essential for reducing our Mean Time to Detect (MTTD) anomalies and enhancing the overall performance of our ad tech systems.&lt;/p&gt;


&lt;p&gt;Due to incomplete data or data latency issues, marketers struggled to distinguish between seasonality and actual ad performance.&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-no-partial-data-availability&quot;&gt;No Partial Data Availability&lt;/h3&gt;


&lt;p&gt;As marketing data evolves with time (such as spend data exceeding 24 hours and conversions data extending beyond 28 days), it becomes highly important to provide partial data to downstream systems. This is especially crucial in cases where issues arise from the partner’s end at specific ad account levels. Given the frequency of such issues, having this capability could have prevented numerous data outages.&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-enhancements-in-technology-stack&quot;&gt;Enhancements in Technology Stack&lt;/h3&gt;


&lt;p&gt;The legacy systems MaRS was designed to be tightly coupled to older advertising formats/domains. Making minor improvements to MaRS used to result in extended engineering cycles or cause multiple technology regressions. Consequently, accommodating new use cases within the system resulted in its becoming unwieldy and difficult to manage.&amp;nbsp;&lt;/p&gt;


&lt;p&gt;Moreover, our outdated Python®-based technology stack was causing a slowdown. Taking advantage of this situation, we initiated an upgrade.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-third-party-dependencies&quot;&gt;Third-Party Dependencies&lt;/h2&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-standardization&quot;&gt;Standardization&lt;/h3&gt;


&lt;p&gt;Out of our global and local ad partners, some have advanced APIs for data sharing. However, there are smaller partners who, due to their limited maturity, share data through more manual methods like email and SFTP, etc. Therefore, it is imperative that a single system be able to handle data ingestion from this diverse array of sources.&lt;/p&gt;


&lt;p&gt;Moreover, the data formats and Service Level Agreements (SLAs) were not consistent among all partners. This lack of standardization posed challenges for maintenance. Consequently, it was necessary to establish uniform data standards across all partners for seamless consumption by downstream systems.&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-rate-limits&quot;&gt;Rate Limits&lt;/h3&gt;


&lt;p&gt;Introducing new partner data, onboarding new data for an existing partner, or encountering a bug in the data processing layer, requires the ingestion system to import years of historical data (backfill). This process incurred significant latency, often taking multiple days to weeks, and it also impeded the normal flow of day-to-day pipelines, due to partner rate limiting.&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-high-maintenance&quot;&gt;High Maintenance&lt;/h3&gt;


&lt;p&gt;Sustaining partner-specific SDKs/APIs required substantial maintenance expenses, including dedicated headcount allocation, for frequent updates and bug fixes, which ultimately reduced developer productivity.&amp;nbsp;&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-scale&quot;&gt;Scale&lt;/h2&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-huge-lead-time-to-market&quot;&gt;Huge Lead Time To Market&lt;/h3&gt;


&lt;p&gt;Due to a substantial backlog, we could only attend to the P0 marketing requests. The backlog primarily stemmed from the fact that onboarding a new partner used to take multiple weeks, hindering our capacity for swift experimentation.&amp;nbsp;&lt;/p&gt;


&lt;p&gt;For instance, in the case of an emerging partner, if we wanted to run ads on their platform, we would have to wait for several months before we had the resources to complete the onboarding process.&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-high-dependency-on-eng-nbsp&quot;&gt;High Dependency on Eng&amp;nbsp;&lt;/h3&gt;


&lt;p&gt;At present, the onboarding of any partner heavily relies on engineering resources to write boilerplate code for API integration, data transformation, validation, and testing. This consumes a significant portion of the onboarding process.&amp;nbsp;&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-solution-strategies&quot;&gt;Solution Strategies&lt;/h2&gt;


&lt;p&gt;At first, MaRS was constructed with constraints tied to limited advertising spending. As Uber expanded globally, the demand for increasingly personalized marketing grew in both local and global markets. This necessitated a system that could swiftly adapt and incorporate specific nuances.&lt;/p&gt;


&lt;p&gt;Marketers needed a swifter onboarding process for new partners in our measurement pipelines to facilitate experimentation. They also sought data at a higher frequency to accelerate results, enabling them to fine-tune marketing strategies accordingly.&lt;/p&gt;


&lt;p&gt;Therefore, we developed a system to address gaps in the tech stack and accommodate future business requirements by employing a highly loosely coupled architecture.&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-build-vs-buy&quot;&gt;Build vs. Buy&lt;/h3&gt;


&lt;p&gt;We conducted an assessment of external third-party vendors for ad signal ingestion rather than relying solely on in-house solutions. This was primarily to streamline maintenance costs.&lt;/p&gt;


&lt;p&gt;Additionally, there was a strong business directive from the marketing team to gain greater flexibility and control over primary channels (the top channels with higher spending) like Google, Apple, and Meta.&lt;/p&gt;


&lt;p&gt;As a result, we opted for a hybrid architecture that allows for a combination of external vendor data and direct in-house retrieval from partners. The decision on which approach to adopt during onboarding will be contingent on the business criticality of the integration.&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-plug-and-play-architecture&quot;&gt;Plug and Play Architecture&lt;/h3&gt;


&lt;p&gt;We had requirements to ingest data other than existing categories of data inside adtech for various internal use cases and many short-gap solutions have been built in silo. We needed to envision a single ingestion system that is ad hoc for data sets, and we needed to do it with ease as well as with minimum effort.&lt;/p&gt;


&lt;p&gt;We incorporated plug-and-play architecture for all the components so any ingestion can change its internal component to something else with minimal effort.&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-domain-agnostic-data-ingestion&quot;&gt;Domain Agnostic Data Ingestion&lt;/h3&gt;


&lt;p&gt;In our pursuit of creating an inclusive ingestion system for a diverse range of data, we needed to separate domain-specific intricacies and enable configurability through a fully self-service, config-based architecture.&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-reliability-0&quot;&gt;Reliability&lt;/h3&gt;


&lt;p&gt;Dealing with a diverse range of partners, our system had to handle numerous immature data formats, inconsistent SLAs, and unexpected scenarios. The sizes of these ad signals also varied significantly, spanning from several gigabytes to terabytes in specific cases. Jupiter was specifically designed to adeptly manage these varied scenarios in a resilient manner.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h1 class=&quot;wp-block-heading&quot; id=&quot;h-architecture&quot;&gt;Architecture&lt;/h1&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter size-large&quot;&gt;&lt;img decoding=&quot;async&quot; loading=&quot;lazy&quot; width=&quot;1024&quot; height=&quot;421&quot; src=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=2160,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Figure3-1-1024x421.png&quot; alt=&quot;&quot; class=&quot;wp-image-1077763&quot; srcset=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=1024,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Figure3-1.png 1024w, https://blog.uber-cdn.com/cdn-cgi/image/width=300,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Figure3-1.png 300w, https://blog.uber-cdn.com/cdn-cgi/image/width=768,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Figure3-1.png 768w, https://blog.uber-cdn.com/cdn-cgi/image/width=1536,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Figure3-1.png 1536w, https://blog.uber-cdn.com/cdn-cgi/image/width=2048,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Figure3-1.png 2048w&quot; sizes=&quot;(max-width: 1024px) 100vw, 1024px&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 3: Jupiter Architecture.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-multi-vendor-integration&quot;&gt;Multi-Vendor Integration&lt;/h3&gt;


&lt;p&gt;We had various business scenarios requiring the integration of distinct data sets from different vendors into our platform. These integrations needed to account for specific factors such as data formats, ingestion frequencies, and data maturity levels.&lt;/p&gt;


&lt;p&gt;Consequently, the platform was architected to accommodate any vendor for any of the data ingestion processes, allowing for seamless changes with minimal configuration.&amp;nbsp;&lt;/p&gt;


&lt;p&gt;Integrating a new vendor involves configuring its specific integration details, after which the rest of the platform will seamlessly connect with it.&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-multi-source-integrations-nbsp&quot;&gt;Multi-Source Integrations&amp;nbsp;&lt;/h3&gt;


&lt;p&gt;Given our interactions with numerous vendors, we naturally encountered diverse data sources that required ingestion. To address this, we implemented configurable data sources with their specific attributes defined through configuration. Just as with vendors, these data sources can be switched at any time with minimal configuration effort.&lt;/p&gt;


&lt;p&gt;Currently, we have integrations with sources like Amazon Web Services, Google Drive, email, APIs, and more. The addition of a new source involves configuring its integration details, after which the rest of the platform will seamlessly adapt to it.&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-non-transformed-data-sets&quot;&gt;Non-Transformed Data Sets&lt;/h3&gt;


&lt;p&gt;In order to meet our business requirements, we found it necessary to ingest data over longer intervals without being constrained by partner capabilities. Additionally, our crucial need to swiftly detect any anomaly trends (MTTD) prompted us to implement a data copying process without applying any transformations.&lt;/p&gt;


&lt;p&gt;This approach enabled us to expedite debugging for any issues and efficiently backfill data when necessary.&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-config-driven-transformation-layer&quot;&gt;Config Driven Transformation Layer&lt;/h3&gt;


&lt;p&gt;Due to the various data schemas we manage, customized transformations were crucial for standardization.&lt;/p&gt;


&lt;p&gt;A substantial part of the boilerplate code was dedicated to this particular component. To achieve a fully self-service ingestion system, we aimed to configure this component for each distinct use case.&lt;/p&gt;


&lt;p&gt;Consequently, we developed an internal library for this transformation layer. This library incorporates user-defined transformations, ranging from row-to-row, and column-to-column, to aggregate transformations. We’ve leveraged this library across internal systems and similar use cases for reusability purposes. Attached is a sample configuration.&lt;/p&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter size-full&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=2160,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Screenshot-2024-02-05-at-1.59.18%E2%80%AFPM.png&quot; alt=&quot;&quot; class=&quot;wp-image-1077942&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-self-serve-ingestion-onboarding&quot;&gt;&lt;br&gt;Self-Serve Ingestion Onboarding&lt;/h3&gt;


&lt;p&gt;We’ve optimized the entire onboarding process for the ingestion flow onto the platform, transitioning it into a self-service model with essential safeguards. This transformation involved implementing &lt;strong&gt;trigger-based mechanisms&lt;/strong&gt; that operate seamlessly between all components, starting from fetching data from sources, initiating transformations, conducting tests, validating and promoting after all checks, and triggering post-validation procedures. Here’s the high-level flow attached for reference.&lt;/p&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/50cHgz3rK-Vua-iE0-r2NLsMM-sAX6Mdt6468e_JwQwyiima0Yqo_XAR8ymryQRJdW_NicyVnJ5f9_1FUVH77rGb_Lic15xQ6rl2lQxIo2bpV4ewbmaZtFC_pZuaQSTjxdEzCeq5KePOQ__W6UVl4cY&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 4: Flow Diagram.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;p&gt;As a result of these improvements, the responsibility for this process has been transferred to the operations team, eliminating the necessity for continual engineering involvement.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-impact&quot;&gt;Impact&lt;/h2&gt;


&lt;p&gt;At present, the prior system has been completely phased out and transitioned to Jupiter. Below, we present an overview of the metrics for both systems:&lt;/p&gt;


&lt;figure class=&quot;wp-block-table&quot;&gt;&lt;table&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Metric&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;&lt;strong&gt;Improvement %&lt;/strong&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Onboarding Time&amp;nbsp; – New Ingestion&lt;/td&gt;&lt;td&gt;&amp;gt; 90%&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Onboarding Time&amp;nbsp; – New Vendor&lt;/td&gt;&lt;td&gt;&amp;gt; 75%&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Onboarding Time&amp;nbsp; – New Source&lt;/td&gt;&lt;td&gt;&amp;gt; 75%&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Data Ingestion Frequency&lt;/td&gt;&lt;td&gt;&amp;gt; 75%&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Data Ingestion Latency&lt;/td&gt;&lt;td&gt;&amp;gt; 70%&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/figure&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h1 class=&quot;wp-block-heading&quot; id=&quot;h-conclusion&quot;&gt;Conclusion&lt;/h1&gt;


&lt;p&gt;We’ve outlined the difficulties and potential advantages linked with ad tech domain networks, along with the process of obtaining dependable data from them and implementing intricate transformations to address various business needs. At present, we’ve retired the previous system and transitioned all use cases to the new platform, incorporating fresh data enhancements wherever feasible.&lt;/p&gt;


&lt;p&gt;We’ve successfully met our primary objective, accelerating the onboarding process for new partners and ensuring data reliability through a self-service approach for our stakeholders. However, there are still more intricate use cases to address. For instance, we’ve thus far focused on downloading a single report structure and applying transformations. Our next challenge is to download multiple structures, amalgamate them, and provide either a single or multiple datasets through a unified workflow. The next significant step is to expand this platform beyond its current specific-use-case support and transition it into a multi-tenant system.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-acknowledgments&quot;&gt;Acknowledgments&lt;/h2&gt;


&lt;p&gt;We extend our special appreciation to both the core engineering and the product team, which includes Prathamesh Gabale (Engineering Manager), Akshit Jain (Software Engineer), Sarthak Chhillar (Software Engineer), Saurav Pradhan (Software Engineer), and Piyush Choudhary (Product Manager), for their pivotal roles in ensuring the success of this journey.&lt;/p&gt;


&lt;p&gt;We would also like to express our gratitude to Devesh Kumar, Diwakar Bhatia, and Vijayasaradhi Uppaluri for their invaluable feedback and unwavering support.&lt;/p&gt;


&lt;p class=&quot;has-small-font-size&quot;&gt;&lt;em&gt;Amazon Web Services, AWS, the Powered by AWS logo, and S3 are trademarks of Amazon.com, Inc. or its affiliates.&lt;/em&gt;&lt;/p&gt;
</description><link>https://www.uber.com/blog/jupiter-batch-ingestion-platform/</link><guid isPermaLink="false">https://www.uber.com/blog/jupiter-batch-ingestion-platform/</guid><pubDate>Tue, 06 Feb 2024 07:00:00 GMT</pubDate><author>Uber</author><category>Engineering</category><category>Data / ML</category></item><item><title>DataCentral: Uber’s Big Data Observability and Chargeback Platform</title><description>&lt;p&gt;In this blog, we will walk you through DataCentral, Uber’s homegrown Big Data Observability, Attribution, and Governance platform. This blog gives a high-level overview of DataCentral’s key features. Before we get into the what and why of DataCentral, let’s do a quick primer of Uber’s Data ecosystem and its challenges.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h1 class=&quot;wp-block-heading&quot; id=&quot;h-introduction-to-uber-s-big-data-landscape&quot;&gt;Introduction to Uber’s Big Data Landscape&lt;/h1&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; loading=&quot;lazy&quot; width=&quot;980&quot; height=&quot;533&quot; src=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=2160,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/image-4.png&quot; alt=&quot;&quot; class=&quot;wp-image-1077460&quot; srcset=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=980,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/image-4.png 980w, https://blog.uber-cdn.com/cdn-cgi/image/width=300,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/image-4.png 300w, https://blog.uber-cdn.com/cdn-cgi/image/width=768,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/image-4.png 768w&quot; sizes=&quot;(max-width: 980px) 100vw, 980px&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 1: Uber’s Big Data Landscape.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;p&gt;Uber’s data infrastructure is composed of a wide variety of compute engines, scheduling/execution solutions, and storage solutions. Compute engines such as Apache Spark&lt;sup&gt;™&lt;/sup&gt;, Presto&lt;sup&gt;®&lt;/sup&gt;, Apache Hive&lt;sup&gt;™&lt;/sup&gt;, Neutrino, Apache Flink&lt;sup&gt;®&lt;/sup&gt;, etc., allow Uber to run petabyte-scale operations on a daily basis. Further, scheduling and execution engines such as Piper (Uber’s fork of Apache Airflow&lt;sup&gt;™&lt;/sup&gt;), Query Builder (user platform for executing compute SQLs), Query Runner (proxy layer for execution of workloads), and Cadence (workflow orchestration engine, open-sourced by Uber) exist to allow scheduling and execution of compute workloads. Finally, a significant portion of storage is supported by HDFS, Google Cloud Storage (GCS), AWS S3, Apache Pinot&lt;sup&gt;™&lt;/sup&gt;, ElasticSearch&lt;sup&gt;®&lt;/sup&gt;, etc. Each engine supports thousands of executions, which are owned by multiple owners (uOwn) and sub-teams.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-challenges&quot;&gt;Challenges&lt;/h2&gt;


&lt;p&gt;With such a complex and diverse big data landscape operating at petabyte-scale and around a million applications/queries running each day, it’s imperative to provide the stakeholders a holistic view of the right performance and resource consumption insights.&amp;nbsp;&lt;/p&gt;


&lt;h4 class=&quot;wp-block-heading&quot; id=&quot;h-stakeholder-personas&quot;&gt;Stakeholder Personas&lt;/h4&gt;


&lt;p&gt;The stakeholders of the big data ecosystem at Uber comprises of the following:&lt;/p&gt;


&lt;p&gt;&lt;strong&gt;Job Owners: &lt;/strong&gt;End users visit DataCentral to find out the metadata for their jobs such as duration, costs, resource consumption, query text and logs, etc. This allows DataCentral to serve as a powerful platform for debugging failed queries and applications.&amp;nbsp;&lt;/p&gt;


&lt;p&gt;&lt;strong&gt;Big Data Teams: &lt;/strong&gt;Big data engine teams like Spark, Presto, Hive, and Neutrino leverage the DataCentral platform to get high-level insights into the number of jobs failing, bad/abusive jobs, top error reasons, blocked queries, etc. In addition, DataCentral helps them to investigate SLA breaches, incidents, and job failures.&amp;nbsp;&lt;/p&gt;


&lt;p&gt;&lt;strong&gt;Executive Leadership: &lt;/strong&gt;DataCentral also supports business decision making by providing organization-level statistics, such as app/query level costs. It also offers information that can be leveraged to forecast hardware requirements and costs incurred.&lt;/p&gt;


&lt;p&gt;Some typical questions which go through various personas involved with the big data platforms at Uber:&lt;/p&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter size-large&quot;&gt;&lt;img decoding=&quot;async&quot; loading=&quot;lazy&quot; width=&quot;1024&quot; height=&quot;776&quot; src=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=2160,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Figure2-1024x776.png&quot; alt=&quot;&quot; class=&quot;wp-image-1077555&quot; srcset=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=1024,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Figure2.png 1024w, https://blog.uber-cdn.com/cdn-cgi/image/width=300,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Figure2.png 300w, https://blog.uber-cdn.com/cdn-cgi/image/width=768,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Figure2.png 768w, https://blog.uber-cdn.com/cdn-cgi/image/width=1520,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Figure2.png 1520w&quot; sizes=&quot;(max-width: 1024px) 100vw, 1024px&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 2: User Personas of Data Platforms at Uber.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-enter-datacentral&quot;&gt;Enter DataCentral&lt;/h2&gt;


&lt;p&gt;At Uber, we have developed DataCentral, a comprehensive platform to provide users with essential insights into big data applications and queries. DataCentral empowers data platform users by offering detailed information on workflows and apps, improving productivity, and reducing debugging time. DataCentral provides the following key services for customers:&lt;/p&gt;


&lt;ol&gt;&lt;li&gt;&lt;strong&gt;Observability:&lt;/strong&gt; Granular insights into performance trends, costs, and degradation signals for big data jobs.&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;Chargeback&lt;/strong&gt;: Metrics and resource usage for big data tools and engines such as Presto, Apache Yarn&lt;sup&gt;™&lt;/sup&gt;, HDFS, Apache Kafka&lt;sup&gt;®&lt;/sup&gt;, etc.&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;Consumption Reduction Programs&lt;/strong&gt;: Powers core cost reduction initiatives for Uber’s data ecosystem, such as HDFS growth reduction, Yarn usage reduction, etc.&lt;/li&gt;
&lt;/ol&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter size-large&quot;&gt;&lt;img decoding=&quot;async&quot; loading=&quot;lazy&quot; width=&quot;1024&quot; height=&quot;730&quot; src=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=2160,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Figure3-1024x730.png&quot; alt=&quot;&quot; class=&quot;wp-image-1077558&quot; srcset=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=1024,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Figure3.png 1024w, https://blog.uber-cdn.com/cdn-cgi/image/width=300,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Figure3.png 300w, https://blog.uber-cdn.com/cdn-cgi/image/width=768,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Figure3.png 768w, https://blog.uber-cdn.com/cdn-cgi/image/width=1536,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Figure3.png 1536w, https://blog.uber-cdn.com/cdn-cgi/image/width=1933,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Figure3.png 1933w&quot; sizes=&quot;(max-width: 1024px) 100vw, 1024px&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 3: DataCentral and Offerings.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-how-does-datacentral-help-with-observability&quot;&gt;&lt;br&gt;How does DataCentral help with Observability?&lt;/h3&gt;


&lt;p&gt;Data Observability provides real-time insights into compute queries and applications. Since Uber’s data ecosystem is spread across different components, we have to track metrics across engines like Presto, Spark, Hive, and Neutrino. Aggregation and tying these metrics to execution engines is also a challenge. We do this with the following offerings:&lt;/p&gt;


&lt;p&gt;&lt;strong&gt;Time series metrics (a.k.a. Clio)&lt;/strong&gt;: Every query run at Uber is fingerprinted and associated with a historical trend of executions (we call this in-house solution Clio). Customers can view historical trends for metrics like Costs, Duration, Efficiency, Data Read/Written, Shuffle, and much more. Having insights into historical trends allows customers to detect and debug applications faster. Further, we provide “config change markers” that allow easy correlation between config changes and changes in the historical trends (refer to Metrics trend in Figure 4). One major challenge we observed was that infrastructure introduced failures. To address this, we built observability into Yarn, HDFS, and correlation tools.&lt;/p&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter is-resized&quot;&gt;&lt;img decoding=&quot;async&quot; loading=&quot;lazy&quot; width=&quot;1600&quot; height=&quot;900&quot; src=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=2160,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/image-8.png&quot; alt=&quot;&quot; class=&quot;wp-image-1077471&quot; style=&quot;aspect-ratio:1.7777777777777777;width:700px;height:auto&quot; srcset=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=1600,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/image-8.png 1600w, https://blog.uber-cdn.com/cdn-cgi/image/width=300,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/image-8.png 300w, https://blog.uber-cdn.com/cdn-cgi/image/width=1024,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/image-8.png 1024w, https://blog.uber-cdn.com/cdn-cgi/image/width=768,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/image-8.png 768w, https://blog.uber-cdn.com/cdn-cgi/image/width=1536,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/image-8.png 1536w, https://blog.uber-cdn.com/cdn-cgi/image/width=1080,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/image-8.png 1080w&quot; sizes=&quot;(max-width: 1600px) 100vw, 1600px&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 4: Clio time series historical trends.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;p&gt;&lt;br&gt;&lt;strong&gt;Yarn Observability:&lt;/strong&gt; Saturated Yarn resources can cause job failures and slowness, which are difficult to debug. We offer solutions to observe and correlate the Yarn utilization in real time when applications are run. Further, we provide insights and suggestions when jobs get affected by Yarn.&lt;/p&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter size-large&quot;&gt;&lt;img decoding=&quot;async&quot; loading=&quot;lazy&quot; width=&quot;1024&quot; height=&quot;488&quot; src=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=2160,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/Figure5-1024x488.png&quot; alt=&quot;&quot; class=&quot;wp-image-1077183&quot; srcset=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=1024,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/Figure5.png 1024w, https://blog.uber-cdn.com/cdn-cgi/image/width=300,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/Figure5.png 300w, https://blog.uber-cdn.com/cdn-cgi/image/width=768,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/Figure5.png 768w, https://blog.uber-cdn.com/cdn-cgi/image/width=1536,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/Figure5.png 1536w, https://blog.uber-cdn.com/cdn-cgi/image/width=1732,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/Figure5.png 1732w&quot; sizes=&quot;(max-width: 1024px) 100vw, 1024px&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 5: Application Level Yarn Queue Insights.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;p&gt;&lt;br&gt;&lt;strong&gt;File System Observability: &lt;/strong&gt;HDFS slowness and File-system-induced latencies are another factor causing degradations that are difficult to detect. We made changes to the Uber Hadoop client to add client-side monitoring of HDFS call counts and latencies with application-level granularity. Every developer can view the File System performance for their specific application/query, which makes the debugging process smoother. We further have correlation systems in place to capture the various infrastructure and engine metrics to root cause issues and suggest fixes–that’s for another blog.&lt;/p&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; loading=&quot;lazy&quot; width=&quot;1600&quot; height=&quot;830&quot; src=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=2160,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/image-6.png&quot; alt=&quot;&quot; class=&quot;wp-image-1077468&quot; srcset=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=1600,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/image-6.png 1600w, https://blog.uber-cdn.com/cdn-cgi/image/width=300,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/image-6.png 300w, https://blog.uber-cdn.com/cdn-cgi/image/width=1024,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/image-6.png 1024w, https://blog.uber-cdn.com/cdn-cgi/image/width=768,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/image-6.png 768w, https://blog.uber-cdn.com/cdn-cgi/image/width=1536,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/image-6.png 1536w&quot; sizes=&quot;(max-width: 1600px) 100vw, 1600px&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 6: HDFS Metrics Surfaced to Users and Engine Teams.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; loading=&quot;lazy&quot; width=&quot;1600&quot; height=&quot;820&quot; src=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=2160,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/image-5.png&quot; alt=&quot;&quot; class=&quot;wp-image-1077463&quot; srcset=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=1600,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/image-5.png 1600w, https://blog.uber-cdn.com/cdn-cgi/image/width=300,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/image-5.png 300w, https://blog.uber-cdn.com/cdn-cgi/image/width=1024,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/image-5.png 1024w, https://blog.uber-cdn.com/cdn-cgi/image/width=768,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/image-5.png 768w, https://blog.uber-cdn.com/cdn-cgi/image/width=1536,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/image-5.png 1536w&quot; sizes=&quot;(max-width: 1600px) 100vw, 1600px&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 7: File System Insights User Interface on DataCentral.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter size-large&quot;&gt;&lt;img decoding=&quot;async&quot; loading=&quot;lazy&quot; width=&quot;1024&quot; height=&quot;524&quot; src=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=2160,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Figure8-1024x524.png&quot; alt=&quot;&quot; class=&quot;wp-image-1077559&quot; srcset=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=1024,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Figure8.png 1024w, https://blog.uber-cdn.com/cdn-cgi/image/width=300,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Figure8.png 300w, https://blog.uber-cdn.com/cdn-cgi/image/width=768,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Figure8.png 768w, https://blog.uber-cdn.com/cdn-cgi/image/width=1536,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Figure8.png 1536w, https://blog.uber-cdn.com/cdn-cgi/image/width=2048,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Figure8.png 2048w&quot; sizes=&quot;(max-width: 1024px) 100vw, 1024px&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 8: Historical Insights for File System Performance.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-contactless&quot;&gt;&lt;strong&gt;Contactless&lt;/strong&gt;&lt;/h3&gt;


&lt;p&gt;A good chunk of any data platform user’s time goes into debugging/troubleshooting failed queries and applications. To help them to efficiently troubleshoot, we developed the “Contactless: system with the following objectives:&lt;/p&gt;


&lt;ol&gt;&lt;li&gt;Improve discoverability of errors&lt;/li&gt;


&lt;li&gt;Identify and surface the root cause, from the right layer&lt;/li&gt;


&lt;li&gt;Provide user-friendly explanations and suggestions&lt;/li&gt;


&lt;li&gt;Provide actionable workflows to resolve errors&lt;/li&gt;
&lt;/ol&gt;


&lt;p&gt;The service enables engine teams to add regex-based rules into the system. A rule also supports adding additional metadata, like user-friendly explanation, root cause layer, priority, etc. Once the stack traces are gathered for an application, the contactless service matches the exception trace against the rules and surfaces the most relevant message back to the user.&lt;/p&gt;


&lt;p&gt;Whenever applications fail, DataCentral parses the error logs and applies contactless rules on the stack traces. User friendly suggestions and error messages are then displayed on the DataCentral console, which enable the end user to debug and root cause failures. Furthermore, a suggestions tab indicates the best actions that can be taken to resolve the error.&amp;nbsp;&lt;/p&gt;


&lt;div class=&quot;wp-block-columns is-layout-flex wp-container-3 wp-block-columns-is-layout-flex&quot;&gt;&lt;div class=&quot;wp-block-column is-layout-flow wp-block-column-is-layout-flow&quot;&gt;&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter size-large&quot;&gt;&lt;img decoding=&quot;async&quot; loading=&quot;lazy&quot; width=&quot;576&quot; height=&quot;1024&quot; src=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=2160,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/Figure9A-576x1024.png&quot; alt=&quot;&quot; class=&quot;wp-image-1077184&quot; srcset=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=576,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/Figure9A.png 576w, https://blog.uber-cdn.com/cdn-cgi/image/width=169,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/Figure9A.png 169w, https://blog.uber-cdn.com/cdn-cgi/image/width=768,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/Figure9A.png 768w, https://blog.uber-cdn.com/cdn-cgi/image/width=864,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/Figure9A.png 864w, https://blog.uber-cdn.com/cdn-cgi/image/width=1044,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/Figure9A.png 1044w&quot; sizes=&quot;(max-width: 576px) 100vw, 576px&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 9A: User Consoles for Configuring Rules.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;&lt;/div&gt;


&lt;div class=&quot;wp-block-column is-layout-flow wp-block-column-is-layout-flow&quot;&gt;&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter size-large&quot;&gt;&lt;img decoding=&quot;async&quot; loading=&quot;lazy&quot; width=&quot;534&quot; height=&quot;1024&quot; src=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=2160,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/Figure9B-534x1024.png&quot; alt=&quot;&quot; class=&quot;wp-image-1077187&quot; srcset=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=534,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/Figure9B.png 534w, https://blog.uber-cdn.com/cdn-cgi/image/width=157,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/Figure9B.png 157w, https://blog.uber-cdn.com/cdn-cgi/image/width=768,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/Figure9B.png 768w, https://blog.uber-cdn.com/cdn-cgi/image/width=801,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/Figure9B.png 801w, https://blog.uber-cdn.com/cdn-cgi/image/width=958,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/Figure9B.png 958w&quot; sizes=&quot;(max-width: 534px) 100vw, 534px&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 9B: User Consoles for Configuring Rules.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;&lt;/div&gt;
&lt;/div&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter size-large&quot;&gt;&lt;img decoding=&quot;async&quot; loading=&quot;lazy&quot; width=&quot;1024&quot; height=&quot;509&quot; src=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=2160,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Figure10-1024x509.png&quot; alt=&quot;&quot; class=&quot;wp-image-1077562&quot; srcset=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=1024,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Figure10.png 1024w, https://blog.uber-cdn.com/cdn-cgi/image/width=300,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Figure10.png 300w, https://blog.uber-cdn.com/cdn-cgi/image/width=768,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Figure10.png 768w, https://blog.uber-cdn.com/cdn-cgi/image/width=1536,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Figure10.png 1536w, https://blog.uber-cdn.com/cdn-cgi/image/width=2048,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Figure10.png 2048w&quot; sizes=&quot;(max-width: 1024px) 100vw, 1024px&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 10: Contactless in Action.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-how-does-datacentral-help-with-cost-efficiency&quot;&gt;How Does DataCentral Help with Cost Efficiency?&lt;/h2&gt;


&lt;p&gt;Cost governance and reduction at Uber are driven by two concepts: attribution and cost efficiency.&amp;nbsp;&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-chargeback&quot;&gt;&lt;strong&gt;Chargeback&lt;/strong&gt;&lt;/h3&gt;


&lt;p&gt;Instead of setting hard limits and budgets on teams, Uber provides high transparency into costs and resource usage on several dimensions so that the stakeholders are equipped with the right data while making decisions. Resource usage and costs are tracked at a uOwn (Uber’s ownership platform) level granularity. Furthermore, the resource usage can be dissected across different granularities such as: User, Pipeline, Application, Schedule, and Queue level. Attribution is critical in driving conservation, identification of anti-patterns, and driving critical cost-reduction initiatives.&lt;/p&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter size-large&quot;&gt;&lt;img decoding=&quot;async&quot; loading=&quot;lazy&quot; width=&quot;1024&quot; height=&quot;520&quot; src=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=2160,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Figure11-1024x520.png&quot; alt=&quot;&quot; class=&quot;wp-image-1077563&quot; srcset=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=1024,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Figure11.png 1024w, https://blog.uber-cdn.com/cdn-cgi/image/width=300,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Figure11.png 300w, https://blog.uber-cdn.com/cdn-cgi/image/width=768,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Figure11.png 768w, https://blog.uber-cdn.com/cdn-cgi/image/width=1536,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Figure11.png 1536w, https://blog.uber-cdn.com/cdn-cgi/image/width=2048,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Figure11.png 2048w&quot; sizes=&quot;(max-width: 1024px) 100vw, 1024px&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 11: HDFS Consumption and Usage Insights.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;p&gt;&lt;br&gt;&lt;strong&gt;Consumption Reduction&lt;/strong&gt;&lt;/p&gt;


&lt;p&gt;Once resources can get attributed to the right teams and owners, stakeholders have insights into metrics like: most expensive pipelines, continuously failing workloads, unnecessary compute, etc. As part of cost efficiency, we have taken up projects (such as HDFSRed, YarnRed, PrestoRed, etc.) that make automated, data-driven decisions to reduce costs. The HDFSRed project checks the access patterns of Uber’s data tables and creates Jira tickets for owners to push for table deletions and TTLs when data is not frequently accessed. Yarn and Presto reduction initiatives similarly check for anti-patterns and unnecessary compute and raise actionable Jira tickets to reduce/stop unused compute.&lt;/p&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter size-large&quot;&gt;&lt;img decoding=&quot;async&quot; loading=&quot;lazy&quot; width=&quot;1024&quot; height=&quot;640&quot; src=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=2160,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Figure12-1024x640.png&quot; alt=&quot;&quot; class=&quot;wp-image-1077569&quot; srcset=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=1024,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Figure12.png 1024w, https://blog.uber-cdn.com/cdn-cgi/image/width=300,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Figure12.png 300w, https://blog.uber-cdn.com/cdn-cgi/image/width=768,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Figure12.png 768w, https://blog.uber-cdn.com/cdn-cgi/image/width=1536,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Figure12.png 1536w, https://blog.uber-cdn.com/cdn-cgi/image/width=2048,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Figure12.png 2048w&quot; sizes=&quot;(max-width: 1024px) 100vw, 1024px&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 12: Example Yarn Reduction JIRA Ticket.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-datacentral-s-scale&quot;&gt;DataCentral’s Scale&lt;/h2&gt;


&lt;p&gt;In order to provide real-time metadata for all Uber-wide applications, DataCentral has to match the scale of the various engines at Uber. Flink jobs keep up with Engine-level scale to ingest real-time modeled data into the internal stores. With 500K Presto queries/day, 400K Spark apps/day, and 2M Hive queries/day, the data observability jobs handle 2K queries per minute and 30K RPS while reading the engine-level metrics. Further, HDFS insights handle over 10B calls per day and peak at 150K RPS (since this tracks calls on application-level granularity).The datastores have a 6-month retention span to handle the data growth and scale.&amp;nbsp;&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-architecture&quot;&gt;Architecture&lt;/h2&gt;


&lt;p&gt;It is critical to provide real-time insights and metadata in order to minimize time to debug and mitigate job failures. DataCentral’s architecture consists of the following components:&lt;/p&gt;


&lt;ul&gt;&lt;li&gt;Engines like Presto, Hive, Spark, and Neutrino emit query-level metadata to Kafka topics in real time. DataCentral has Flink jobs that constantly listen to these Kafka topics and consume any job-level metadata emitted by the engines.&lt;/li&gt;


&lt;li&gt;The Flink jobs pre-process this data in real time and combine metadata on job-level granularity across multiple sources. Finally, the data is stored in internal stores like MySQL and Docstore (Uber’s internal datastore, providing strong consistency and high horizontal scaling).&lt;/li&gt;


&lt;li&gt;DataCentral microservice stack consists of several APIs that serve a variety of use cases, such as the DataCentral UI, external teams, TTL setting on Hive tables, and much more.&lt;/li&gt;
&lt;/ul&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter size-large&quot;&gt;&lt;img decoding=&quot;async&quot; loading=&quot;lazy&quot; width=&quot;1024&quot; height=&quot;583&quot; src=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=2160,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Figure13-1024x583.png&quot; alt=&quot;&quot; class=&quot;wp-image-1077572&quot; srcset=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=1024,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Figure13.png 1024w, https://blog.uber-cdn.com/cdn-cgi/image/width=300,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Figure13.png 300w, https://blog.uber-cdn.com/cdn-cgi/image/width=768,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Figure13.png 768w, https://blog.uber-cdn.com/cdn-cgi/image/width=1536,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Figure13.png 1536w, https://blog.uber-cdn.com/cdn-cgi/image/width=2048,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/02/Figure13.png 2048w&quot; sizes=&quot;(max-width: 1024px) 100vw, 1024px&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 13: High-level DataCentral architecture.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;p&gt;The journey of the metadata right from the engines to Observability datastores takes under 500 ms. The DataCentral UI provides insights and metadata, which are served via the MySQL and Docstore datastores. The UI supporting APIs fetch data from disparate sources and join it into unified responses, finally serving the customer-facing views.&lt;/p&gt;


&lt;p&gt;Further batch workloads are run to power modeled datasets, which provide critical cost attribution data. Data from various engines is joined with HiveMetastore, uOwn, and other metadata sources to power rich insights, which are served on the DataCentral UI to downstream teams and leveraged for cost-reduction initiatives.&amp;nbsp;&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-datacentral-amp-uber-s-move-to-cloud&quot;&gt;DataCentral &amp;amp; Uber’s Move to Cloud&lt;/h2&gt;


&lt;p&gt;As Uber moves to cloud, our priority remains to provide cost transparency and cost reduction into the Cloud ecosystem. Furthermore, DataCentral is supporting engine teams with critical metrics for performance testing, benchmarking, and identifying degradations when moving workloads to cloud. This allows us to make the right decisions as we migrate critical jobs to the cloud. For example, File System Observability has allowed engine teams like Spark to observe the latency increase with cloud and make the right solutions to migrate.&amp;nbsp;&amp;nbsp;&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h1 class=&quot;wp-block-heading&quot; id=&quot;h-conclusion&quot;&gt;Conclusion&lt;/h1&gt;


&lt;p&gt;DataCentral has been a critical tool for engineers, data analysts, and platform teams at Uber. It provides advanced analytics and granular insights into big data applications and queries. DataCentral is used by stakeholders ranging from job owners, engine on-calls, platform teams and executive leadership. The self-serve nature of the platform has made it efficient for customers to debug jobs, mitigating incidents and root cause SLA breaches. Another key offering of DataCentral is the consumption insights into big data compute and storage entities such as HDFS, Yarn, Kafka, Presto, etc. DataCentral acts as the single source of truth for stakeholders to get insights into platform-, team-, and org-level insights. Furthermore, we plan to open-source the DataCentral toolkit for broader adoption and community collaboration.&amp;nbsp;&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;p class=&quot;has-small-font-size&quot;&gt;&lt;em&gt;Apache®, Apache Pinot™, Apache Flink®, Apache Kafka®, Apache Hive™, Apache Hadoop®, Apache Spark™,&amp;nbsp; Apache Yarn™,&amp;nbsp; Kafka®, Apache Airflow™, Flink®, Hive™, Hadoop®, Spark™, Pinot™, Airflow™, and Yarn™ are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.&lt;/em&gt;&lt;/p&gt;


&lt;p class=&quot;has-small-font-size&quot;&gt;&lt;em&gt;Presto is a registered trademark of LF Projects, LLC. No endorsement by&amp;nbsp; LF Projects, LLC is implied by the use of these marks.&lt;/em&gt;&lt;/p&gt;


&lt;p class=&quot;has-small-font-size&quot;&gt;&lt;em&gt;Amazon Web Services, AWS, S3, the Powered by AWS logo are trademarks of Amazon.com, Inc. or its affiliates.&lt;/em&gt;&lt;br&gt;&lt;em&gt;Header image: “&lt;/em&gt;&lt;a href=&quot;https://www.flickr.com/photos/126444666@N05/14606819370&quot;&gt;&lt;em&gt;New York Grand Central Station&lt;/em&gt;&lt;/a&gt;&lt;em&gt;” by &lt;/em&gt;&lt;a href=&quot;https://www.flickr.com/photos/126444666@N05&quot;&gt;&lt;em&gt;jensfrickephotography&lt;/em&gt;&lt;/a&gt;&lt;em&gt; is licensed under &lt;/em&gt;&lt;a href=&quot;https://creativecommons.org/licenses/by/2.0/?ref=openverse&quot;&gt;&lt;em&gt;CC BY 2.0&lt;/em&gt;&lt;/a&gt;&lt;em&gt;.&lt;/em&gt;&lt;/p&gt;
</description><link>https://www.uber.com/blog/datacentral-ubers-observability-and-chargeback-platform/</link><guid isPermaLink="false">https://www.uber.com/blog/datacentral-ubers-observability-and-chargeback-platform/</guid><pubDate>Thu, 01 Feb 2024 07:30:00 GMT</pubDate><author>Uber</author><category>Engineering</category><category>AI</category><category>Backend</category><category>Data / ML</category></item><item><title>Stopping Uber Fraudsters Through Risk Challenges</title><description>&lt;h1 class=&quot;wp-block-heading&quot; id=&quot;h-introduction&quot;&gt;&lt;strong&gt;Introduction&lt;/strong&gt;&lt;/h1&gt;


&lt;p&gt;As a marketplace-based, consumer-facing app, Uber encounters a multitude of sources of fraud across its platform. In one of the most common cases of fraud, bad actors use various methods to attempt to bypass payments for Uber rides, Eats orders, and other services, like Uber for Business. When this happens, failed transactions can occur, incurring losses that affect the drivers and businesses operating on Uber.&lt;/p&gt;


&lt;p&gt;To account for the serious financial implications of payment fraud, risk management is prioritized at Uber. Reflecting the risk solution landscape within the overall tech industry, our engineers have developed complex solutions to achieve the following:&amp;nbsp;&lt;/p&gt;


&lt;ol&gt;&lt;li&gt;&lt;strong&gt;Detect fraud: &lt;/strong&gt;Real-time fraud detection is driven by a system of business rules which run on &lt;a href=&quot;https://www.uber.com/blog/mastermind/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;Mastermind&lt;/a&gt;, Uber’s rules engine. In addition, machine learning models generate predictive scores that give insight into the probability that a user is a fraudster.&amp;nbsp;&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;Mitigate fraud: &lt;/strong&gt;Different forms of fraud mitigation are employed at Uber to act on and resolve triggered rules and threshold-passing scores, as appropriate. These involve both manual and automated strategies, and this is also where risk challenges come into play.&lt;/li&gt;
&lt;/ol&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h1 class=&quot;wp-block-heading&quot; id=&quot;h-risk-challenges&quot;&gt;Risk Challenges&lt;/h1&gt;


&lt;p&gt;Risk challenges are experiences in which the user is asked to complete a certain task or process, often to verify the legitimacy of their identity or payment method. You have likely encountered a risk challenge before, and not just on the Uber app. A common one is to enter the CVV of a credit card when making a credit card transaction.&amp;nbsp;&lt;/p&gt;


&lt;p&gt;One of the main desired outcomes of risk challenges is to effectively catch bad actors. This point is self-evident: a group that encounters risk challenges should have lower rates of fraud in comparison to a control group. However, protecting against fraud is not as simple as introducing risk challenges to everyone. The nature of risk challenges is highly individualized. Given the wide scope of Uber’s users, products, and geographies, risk challenges encountered on Uber will vary from user to user. Uber applies different risk challenges to different stages of the user journey, and users in different regions of the world may encounter different risk challenges. Let us consider just one risk challenge implemented in the Uber app: penny drop verification.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h1 class=&quot;wp-block-heading&quot;&gt;Penny Drop Verification&lt;/h1&gt;


&lt;p&gt;Consider the scenario where Uber receives a ride request from a user who does not seem to be the owner of the debit or credit being used on the app. Based on the behavior and data of this Uber account, it seems there is a very high probability that this particular user has stolen a card that they claim to own and plan to use for the ride.&amp;nbsp;&lt;/p&gt;


&lt;p&gt;In the past, Uber would detect such users who were highly likely to be fraudsters and immediately prevent them from continuing to significantly engage with the app by employing certain strict actions. In the scenario described, the ride request would be declined, and the associated payment and/or user would be restricted in some cases.&lt;/p&gt;


&lt;p&gt;On the surface level, this might seem effective in terms of avoiding payment fraud. While this strategy of strict actioning was in place, however, it became evident that it was not ideal for potential false positive users. Uber users whose access was restricted are required to contact customer service to resolve their status, which is often a resource-intensive process.&amp;nbsp;&lt;/p&gt;


&lt;p&gt;Penny drop verification was introduced in the Uber app as a user-friendly method for individuals who might have previously been restricted from using the app to instead have a chance to prove ownership of their payment method. In this challenge, a user is asked to confirm to Uber two small, random authorization hold amounts before they expire within a given timeframe.&lt;strong&gt; &lt;/strong&gt;Successful completion indicates that a user is likely the legit cardholder and not a fraudster.&amp;nbsp;&lt;/p&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter size-large&quot;&gt;&lt;img decoding=&quot;async&quot; loading=&quot;lazy&quot; width=&quot;1024&quot; height=&quot;255&quot; src=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=2160,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/1_why_pac-1024x255.png&quot; alt=&quot;&quot; class=&quot;wp-image-1076561&quot; srcset=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=1024,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/1_why_pac.png 1024w, https://blog.uber-cdn.com/cdn-cgi/image/width=300,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/1_why_pac.png 300w, https://blog.uber-cdn.com/cdn-cgi/image/width=768,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/1_why_pac.png 768w, https://blog.uber-cdn.com/cdn-cgi/image/width=1536,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/1_why_pac.png 1536w, https://blog.uber-cdn.com/cdn-cgi/image/width=2048,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/1_why_pac.png 2048w&quot; sizes=&quot;(max-width: 1024px) 100vw, 1024px&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 1: Throwing ban action versus risk challenge to a potentially risky user&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot;&gt;Technical Overview&lt;/h2&gt;


&lt;p&gt;The penny drop verification challenge was implemented with the goal of being both an effective and intuitive method of fraud mitigation for users using a credit or debit card as their payment method. We can summarize how we achieved this through the following design principles, which apply to not just the penny drop verification challenge, but any other good risk challenge:&amp;nbsp;&lt;/p&gt;


&lt;ol&gt;&lt;li&gt;&lt;strong&gt;Minimize false positives:&lt;/strong&gt; Trust good users (and not bad users). Fraudsters should not be able to pass this challenge, while well-intentioned users should be able to self-resolve and recover should they fail.&amp;nbsp;&lt;/li&gt;
&lt;/ol&gt;


&lt;ol start=&quot;2&quot;&gt;&lt;li&gt;&lt;strong&gt;Create a seamless and empathetic user experience, with just the right amount of friction:&lt;/strong&gt; This is necessary because trade-offs may exist between the frequency and degree of risk challenges, and user satisfaction and churn. For instance, if a user is presented with a risk challenge that they deem too hard or too long to complete, they may stop using the app altogether. A frustrating user experience should be avoided without compromising fraud mitigation.&amp;nbsp;&lt;/li&gt;
&lt;/ol&gt;


&lt;p&gt;The following screens illustrate a happy path for a legitimate user that is thrown a new penny drop verification challenge.&lt;/p&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter size-large is-resized&quot;&gt;&lt;img decoding=&quot;async&quot; loading=&quot;lazy&quot; width=&quot;1024&quot; height=&quot;373&quot; src=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=2160,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/2_pac_ux_screens-1024x373.png&quot; alt=&quot;&quot; class=&quot;wp-image-1076568&quot; style=&quot;aspect-ratio:2.745308310991957;width:700px;height:auto&quot; srcset=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=1024,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/2_pac_ux_screens.png 1024w, https://blog.uber-cdn.com/cdn-cgi/image/width=300,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/2_pac_ux_screens.png 300w, https://blog.uber-cdn.com/cdn-cgi/image/width=768,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/2_pac_ux_screens.png 768w, https://blog.uber-cdn.com/cdn-cgi/image/width=1536,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/2_pac_ux_screens.png 1536w, https://blog.uber-cdn.com/cdn-cgi/image/width=2048,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/2_pac_ux_screens.png 2048w&quot; sizes=&quot;(max-width: 1024px) 100vw, 1024px&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 2: Happy path flow of the penny drop verification risk challenge&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;p&gt;As illustrated, the challenge is integrated into the mobile user experience in a way that should not majorly interfere with the user’s intended primary action–in this case, requesting a ride–through clear instructions and simplistic actions. At the same time, “skipping” the challenge is not possible, as the user who is thrown the challenge is required to complete it. Even if the user exits out of an initiated challenge in the user interface, the status of the challenge will still be active, and the user will be continually prompted to complete it if they try to resume relevant actions.&amp;nbsp;&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot;&gt;&lt;br&gt;Triggering Flows&lt;/h3&gt;


&lt;p&gt;As aforementioned, certain risk challenges are designed for certain user journey flows on the Uber app. Two main flows where the penny drop verification challenge may be displayed:&lt;/p&gt;


&lt;ol&gt;&lt;li&gt;When the user makes a request for either a ride or delivery order&lt;/li&gt;


&lt;li&gt;When the user adds a payment method&lt;/li&gt;
&lt;/ol&gt;


&lt;p&gt;The challenge is not raised for every user at every occurrence of these flows. Rather, in the case that one of these two flows is initiated, downstream services are called on to fetch user-related features and to run risk rules in &lt;a href=&quot;https://www.uber.com/blog/mastermind/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;Mastermind&lt;/a&gt;. The rule results may indicate that further risk assessment is necessary, particularly to verify that the user is the owner of a specific payment method. If this occurs, backend services send an error code to the mobile side such that the user encounters mobile screens responsible for initiating the penny drop verification challenge.&amp;nbsp;&lt;/p&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter size-large&quot;&gt;&lt;img decoding=&quot;async&quot; loading=&quot;lazy&quot; width=&quot;1024&quot; height=&quot;222&quot; src=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=2160,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/3_user_action_screen-1024x222.png&quot; alt=&quot;&quot; class=&quot;wp-image-1076570&quot; srcset=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=1024,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/3_user_action_screen.png 1024w, https://blog.uber-cdn.com/cdn-cgi/image/width=300,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/3_user_action_screen.png 300w, https://blog.uber-cdn.com/cdn-cgi/image/width=768,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/3_user_action_screen.png 768w, https://blog.uber-cdn.com/cdn-cgi/image/width=1536,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/3_user_action_screen.png 1536w, https://blog.uber-cdn.com/cdn-cgi/image/width=2048,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/3_user_action_screen.png 2048w&quot; sizes=&quot;(max-width: 1024px) 100vw, 1024px&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 3: Risk challenge is initiated by a user action, like requesting a ride&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;h3 class=&quot;wp-block-heading&quot;&gt;&lt;br&gt;User Consent&lt;/h3&gt;


&lt;p&gt;After the penny drop verification challenge is deemed necessary during a triggering flow, then a modal is shown to allow the user to choose to verify their selected card. In some cases, the user may have already consented to the challenge, but has exited before successfully completing the challenge, so they must re-consent.&lt;/p&gt;


&lt;p&gt;If the user chooses to switch to another payment method, they may not be asked for further verification. This is because the action of presenting a risk challenge depends on a user’s specific payment method. Often, an Uber user adds more than one card to their account, and they may or may not be the legitimate owner of any number of them. Different payment profiles of the same user can have different challenge statuses.&lt;/p&gt;


&lt;p&gt;Once a user clicks “Verify card,” backend processes check various conditions to determine what client-side screen to show to the user in the remainder of the challenge flow. It is first necessary to confirm that data related to the selected payment method, like its UUID and the status of the penny drop verification challenge, has been saved in &lt;a href=&quot;https://www.uber.com/blog/schemaless-sql-database/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;Docstore&lt;/a&gt;, Uber’s backend database. If the card is entirely new, then relevant data about the card is written to Docstore for the first time.&lt;/p&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter size-large&quot;&gt;&lt;img decoding=&quot;async&quot; loading=&quot;lazy&quot; width=&quot;1024&quot; height=&quot;429&quot; src=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=2160,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/4_user_consent_screen-1024x429.png&quot; alt=&quot;&quot; class=&quot;wp-image-1076572&quot; srcset=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=1024,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/4_user_consent_screen.png 1024w, https://blog.uber-cdn.com/cdn-cgi/image/width=300,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/4_user_consent_screen.png 300w, https://blog.uber-cdn.com/cdn-cgi/image/width=768,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/4_user_consent_screen.png 768w, https://blog.uber-cdn.com/cdn-cgi/image/width=1536,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/4_user_consent_screen.png 1536w, https://blog.uber-cdn.com/cdn-cgi/image/width=2048,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/4_user_consent_screen.png 2048w&quot; sizes=&quot;(max-width: 1024px) 100vw, 1024px&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 4: Challenge conditions are checked when a user consents to a risk challenge&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;h3 class=&quot;wp-block-heading&quot;&gt;&lt;br&gt;Send Authorizations&lt;/h3&gt;


&lt;p&gt;On a screen that provides more information about the risk challenge, the user is prompted to send authorization holds. In the overall context of electronic transactions and payments, authorization holds are temporary holds placed on a certain amount of funds; they are often used to determine whether a user has enough money to complete a transaction, and thus whether Uber is able to collect a payment from that user.&amp;nbsp;&lt;/p&gt;


&lt;p&gt;After the user clicks “Send authorization holds,” two small monetary amounts are issued to the user’s designated bank account using the authorization hold protocol. To initiate this process, an internal payment operation service makes a request to a specialized “payment grant” service to create two distinct grants. This is done by supplying two randomly generated amounts to be held, as well as a specified void duration.&lt;/p&gt;


&lt;p&gt;The payment grant service interacts with a payment gateway or processor, which contacts the user’s bank or card issuer to formally request temporary authorization holds to be placed on the user’s payment account, in the specified amount. If the specified void duration lapses and the fund holds are released, and the user has not successfully verified the amounts to complete the challenge, then the user will have to re-send the authorization holds.&lt;/p&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter size-large&quot;&gt;&lt;img decoding=&quot;async&quot; loading=&quot;lazy&quot; width=&quot;1024&quot; height=&quot;446&quot; src=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=2160,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/5_send_auths_screen-1024x446.png&quot; alt=&quot;&quot; class=&quot;wp-image-1076575&quot; srcset=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=1024,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/5_send_auths_screen.png 1024w, https://blog.uber-cdn.com/cdn-cgi/image/width=300,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/5_send_auths_screen.png 300w, https://blog.uber-cdn.com/cdn-cgi/image/width=768,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/5_send_auths_screen.png 768w, https://blog.uber-cdn.com/cdn-cgi/image/width=1536,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/5_send_auths_screen.png 1536w, https://blog.uber-cdn.com/cdn-cgi/image/width=2048,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/5_send_auths_screen.png 2048w&quot; sizes=&quot;(max-width: 1024px) 100vw, 1024px&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 5: Authorization holds must be sent as part of the penny drop challenge&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;h3 class=&quot;wp-block-heading&quot;&gt;&lt;br&gt;Amount Verification&lt;/h3&gt;


&lt;p&gt;To successfully complete the penny drop challenge, users are required to review their bank statements and accurately enter the exact amounts of the authorization holds within the Uber app. These entered amounts are then subject to verification.&lt;/p&gt;


&lt;p&gt;Throughout this procedure, the user’s challenge status is updated within Docstore, Uber’s backend database. If the user does not successfully verify the authorization hold amounts within a certain number of attempts, they will have failed the challenge. In this case, their access to the Uber app will be restricted because we have strong indications that they are not the owner of the payment method. By contrast, if the user does successfully complete the challenge, they can seamlessly proceed with their intended actions that they had initiated before being thrown the challenge.&amp;nbsp;&lt;/p&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter size-large is-resized&quot;&gt;&lt;img decoding=&quot;async&quot; loading=&quot;lazy&quot; width=&quot;1024&quot; height=&quot;332&quot; src=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=2160,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/6_enter_amounts_screen-1024x332.png&quot; alt=&quot;&quot; class=&quot;wp-image-1076577&quot; style=&quot;aspect-ratio:3.0843373493975905;width:700px;height:auto&quot; srcset=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=1024,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/6_enter_amounts_screen.png 1024w, https://blog.uber-cdn.com/cdn-cgi/image/width=300,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/6_enter_amounts_screen.png 300w, https://blog.uber-cdn.com/cdn-cgi/image/width=768,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/6_enter_amounts_screen.png 768w, https://blog.uber-cdn.com/cdn-cgi/image/width=1536,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/6_enter_amounts_screen.png 1536w, https://blog.uber-cdn.com/cdn-cgi/image/width=2048,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/6_enter_amounts_screen.png 2048w&quot; sizes=&quot;(max-width: 1024px) 100vw, 1024px&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 6: Authorization hold amounts must be correctly verified to pass the penny drop challenge&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h1 class=&quot;wp-block-heading&quot;&gt;Conclusion&lt;/h1&gt;


&lt;p&gt;We are continually fine-tuning the user experience of the penny drop verification challenge in a way that effectively mitigates risk without creating too much friction in the user experience. Through analysis of metrics like success rates and churn rates, we continually act on insights into how users are interacting with the challenges that are thrown to them.&amp;nbsp;&lt;/p&gt;


&lt;p&gt;Penny drop verification is just one type of risk challenge integrated in the Uber app. Other challenges involve varying levels of difficulty. Based on what we understand of a given user and their intentions, one challenge may be considered better to use than another. Overall, risk challenges have been integral in our ongoing efforts to enhance security and user experiences on the Uber app. Its implementation has not only effectively served as a safeguard against fraud, but has also led to smoother user experiences, thus expediting onboarding for specialized offerings such as Uber for Business.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot;&gt;Acknowledgements&lt;/h2&gt;


&lt;p&gt;Thank you to the Spender Risk Team for sharing their expertise about a range of interesting engineering challenges related to fraud throughout my internship, including risk challenges.&amp;nbsp;&lt;/p&gt;


&lt;p&gt;Special thanks to Diganta Sarkar, Qixiong Liu, and Susie Peng for providing insights relevant to this blog, as well as Neel Mouleeswaran and You Xu for supporting my internship and growth as a software engineer.&lt;/p&gt;


&lt;p class=&quot;has-small-font-size&quot;&gt;Cover photo attribution: Image created by Nenad Stojkovic. Image license information: &lt;a href=&quot;https://openverse.org/image/6d3d1a77-63ba-481c-9c58-42d1c7fa9507?q=payment%20credit%20card&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;Link&lt;/a&gt;. No changes have been made.&lt;/p&gt;
</description><link>https://www.uber.com/blog/stopping-uber-fraudsters-through-risk-challenges/</link><guid isPermaLink="false">https://www.uber.com/blog/stopping-uber-fraudsters-through-risk-challenges/</guid><pubDate>Thu, 25 Jan 2024 08:00:00 GMT</pubDate><author>Uber</author><category>Engineering</category><category>Backend</category><category>Mobile</category></item><item><title>Palette Meta Store Journey</title><description>&lt;h1 class=&quot;wp-block-heading&quot; id=&quot;h-introduction&quot;&gt;&lt;strong&gt;Introduction&lt;/strong&gt;&lt;/h1&gt;


&lt;p&gt;The Machine Learning (ML) team at Uber is consistently developing new and innovative components to strengthen our ML Platform (Michelangelo).&amp;nbsp;&lt;/p&gt;


&lt;p&gt;In machine learning, features are the data used to make model calculations and predict an outcome. You can think of them as the input to the learning model or attributes in your data that are relevant to the predictive modeling problem.&lt;/p&gt;


&lt;p&gt;When querying Uber’s data stores for feature data, it can be hard to:&lt;/p&gt;


&lt;ul&gt;&lt;li&gt;Figure out good Uber-specific features&lt;/li&gt;


&lt;li&gt;Build pipelines to generate features&lt;/li&gt;


&lt;li&gt;Compute features in real time&lt;/li&gt;


&lt;li&gt;Guarantee that data used at training is the same as the data used for scoring predictions&lt;/li&gt;


&lt;li&gt;Monitor features&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;The Uber Michelangelo feature store, called Palette, is a database of Uber-specific curated and internally crowd-sourced features that are easy to use in machine learning projects. It comes to solve all the above-mentioned challenges. Pipelines are auto-generated for feature generations and feature dispersals. Palette supports various feature computation use cases, like batch and near real time, and includes precomputed features related to cities, drivers, and riders, as well as custom features generated for the EATs, Fraud, and Comms teams. Subject to our normal data access restrictions, Uber users are able to use many of the pruned features maintained by other Uber teams or even create their own and can directly incorporate these features in their machine learning models.&lt;/p&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter size-large&quot;&gt;&lt;img decoding=&quot;async&quot; loading=&quot;lazy&quot; width=&quot;1024&quot; height=&quot;468&quot; src=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=2160,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/palette-architecture-e2e-engblog-figure1-1024x468.png&quot; alt=&quot;&quot; class=&quot;wp-image-1075931&quot; srcset=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=1024,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/palette-architecture-e2e-engblog-figure1.png 1024w, https://blog.uber-cdn.com/cdn-cgi/image/width=300,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/palette-architecture-e2e-engblog-figure1.png 300w, https://blog.uber-cdn.com/cdn-cgi/image/width=768,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/palette-architecture-e2e-engblog-figure1.png 768w, https://blog.uber-cdn.com/cdn-cgi/image/width=1536,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/palette-architecture-e2e-engblog-figure1.png 1536w, https://blog.uber-cdn.com/cdn-cgi/image/width=1821,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/palette-architecture-e2e-engblog-figure1.png 1821w&quot; sizes=&quot;(max-width: 1024px) 100vw, 1024px&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 1: Feature Generation graph shows job computing features. Feature Ingestion graph shows ingesting data to hive and Cassandra. Feature Serving graph shows how features are served offline/online. Feature Metadata and Data Quality graph shows how featurestore metadata flows across offline and online stores.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;p&gt;&lt;a href=&quot;https://lucid.app/lucidchart/41687c63-006b-4f8c-a524-f56d9dbb4033/edit?page=0&amp;amp;v=204&amp;amp;s=612&quot;&gt;&lt;/a&gt;&lt;/p&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-palette-metastore-background&quot;&gt;&lt;strong&gt;Palette Metastore Background&lt;/strong&gt;&lt;/h2&gt;


&lt;p&gt;Palette provides feature management infrastructure including feature discovery, creation, deprecation, offline and online serving setup in its Metastore.&lt;/p&gt;


&lt;p&gt;Palette Metastore is a metadata store of features where users of Palette can create, deprecate, add details about ownership/backfill/scheduling of feature generation pipelines, offline training and HDFS location. Users can specify Cassandra databases that they want to copy data for online serving along with Spark configuration, join keys, feature list along with feature metadata.&amp;nbsp; Users can also include info about which features should be copied for online serving, SQL queries for generating the features from upstream dependencies and maintaining audit of changes.&amp;nbsp;&lt;/p&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter size-large&quot;&gt;&lt;img decoding=&quot;async&quot; loading=&quot;lazy&quot; width=&quot;1024&quot; height=&quot;592&quot; src=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=2160,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/Feature-Spec-Metadata-Arc-eng-blog-figure2-1024x592.png&quot; alt=&quot;&quot; class=&quot;wp-image-1075954&quot; srcset=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=1024,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/Feature-Spec-Metadata-Arc-eng-blog-figure2.png 1024w, https://blog.uber-cdn.com/cdn-cgi/image/width=300,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/Feature-Spec-Metadata-Arc-eng-blog-figure2.png 300w, https://blog.uber-cdn.com/cdn-cgi/image/width=768,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/Feature-Spec-Metadata-Arc-eng-blog-figure2.png 768w, https://blog.uber-cdn.com/cdn-cgi/image/width=1536,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/Feature-Spec-Metadata-Arc-eng-blog-figure2.png 1536w, https://blog.uber-cdn.com/cdn-cgi/image/width=1794,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/Feature-Spec-Metadata-Arc-eng-blog-figure2.png 1794w&quot; sizes=&quot;(max-width: 1024px) 100vw, 1024px&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 2: Feature Group Update flows from Palette Metadata repository to Offline Serving system and propagates to OnlineServing Cache eventually as well as is used by various systems for ETL/Training.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href=&quot;https://lucid.app/lucidchart/30856e57-7aee-4c60-ac35-c3f41da4aa19/edit?page=0&amp;amp;v=383&amp;amp;s=612&quot;&gt;&lt;/a&gt;&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-a-closer-look-problem-and-motivation&quot;&gt;&lt;strong&gt;A Closer Look: Problem and Motivation&lt;/strong&gt;&lt;/h2&gt;


&lt;p&gt;A major incident occurred in 2021 due to inadequate schema validation on Palette Metadata where a bad Metadata change was pushed, which resulted in OnlineServing breaking for major Tier1 use cases, since it was unable to load Palette Metadata during boot up.&lt;/p&gt;


&lt;p&gt;Schema validation logic used to be client side and lived in a script within the FeatureSpec repository, which is the Metadata repository for Palette customers to make metadata-related changes. Updating validation was challenging, as customer metadata updates wouldn’t pick up the latest validation always, as they didn’t rebase against the latest code changes. This led to incorrect metadata being merged into the master repository.&lt;/p&gt;


&lt;p&gt;Metadata discrepancies caused build failures for customers rebasing against master due to incorrect metadata changes being merged into master.&lt;/p&gt;


&lt;p&gt;Incident resolution took several hours due to several issues.&amp;nbsp;&lt;/p&gt;


&lt;ul&gt;&lt;li&gt;Updating Palette Metadata in OnlineServing stack. Changing a single feature group in Palette Metadata repository led to updates for all hundreds of feature groups due to lack of an incremental update system, prolonging rollbacks.&lt;/li&gt;


&lt;li&gt;Lack of schema validation. The Feature Engine on-call had to dedicate substantial time to each customer diff. Majority of on-call time was spent on assisting with metadata changes in the FeatureSpec repo. Lack of a build job to verify actual Hive table schema before merging led to failures at training time. Customers made errors when creating Palette tables, missing required columns.&lt;/li&gt;


&lt;li&gt;Offline Metadata updates. Metadata updates took over an hour after landing changes in FeatureSpec repo since entire metadata repository was getting updated even if only a minor change was made for one of the feature groups.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;These issues highlight the challenges stemming from inadequate schema validation, leading to data loss, helpdesk burden, build failures, and confusion in pipeline updates. The complex process of updating metadata and the lack of automated schema verification further compounded the problems faced by the team.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-deep-dive-meta-store&quot;&gt;&lt;strong&gt;Deep Dive: Meta Store&lt;/strong&gt;&lt;/h2&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-feature-store-object-model&quot;&gt;Feature Store Object Model&lt;/h3&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter size-large&quot;&gt;&lt;img decoding=&quot;async&quot; loading=&quot;lazy&quot; width=&quot;1024&quot; height=&quot;580&quot; src=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=2160,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/figure3-2-1024x580.png&quot; alt=&quot;&quot; class=&quot;wp-image-1075980&quot; srcset=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=1024,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/figure3-2.png 1024w, https://blog.uber-cdn.com/cdn-cgi/image/width=300,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/figure3-2.png 300w, https://blog.uber-cdn.com/cdn-cgi/image/width=768,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/figure3-2.png 768w, https://blog.uber-cdn.com/cdn-cgi/image/width=1536,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/figure3-2.png 1536w, https://blog.uber-cdn.com/cdn-cgi/image/width=2048,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/figure3-2.png 2048w, https://blog.uber-cdn.com/cdn-cgi/image/width=2072,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/figure3-2.png 2072w&quot; sizes=&quot;(max-width: 1024px) 100vw, 1024px&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 3: FeatureGroup has OnlineSpec, OfflineSpec, ComputeSpec. OnlineSpec has Snapshot Backing which underneath is backed by Cassandra or Hive Backing. OnlineFeatureServingGroup is composed of online stores and online caches. Inference Server/Palette Service references OnlineFeatureServingGroup and indirectly references FeatureGroup.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;p&gt;Following are the new objects that we formally define in the new Palette Metadata system backed by protos:&lt;/p&gt;


&lt;p&gt;&lt;strong&gt;FeatureGroup&lt;/strong&gt;: A logical table with a collection of features for both streaming and batch features backed by daily feature snapshots in Hive tables or Cassandra for the online store.&lt;/p&gt;


&lt;p&gt;&lt;strong&gt;Feature&lt;/strong&gt;: A single feature corresponding to a column within the logical FeatureGroup (table).&lt;/p&gt;


&lt;p&gt;&lt;strong&gt;Dataset&lt;/strong&gt;: Dataset represents the metadata needed to create a table in a database/storage for a given feature group.&lt;strong&gt;&amp;nbsp; &lt;/strong&gt;For example, keyspace, partition key and cluster key would be the metadata needed to create a table for a given C* cluster. These would be part of the Dataset spec&lt;strong&gt;.&lt;/strong&gt;&lt;/p&gt;


&lt;p&gt;&lt;strong&gt;Storage&lt;/strong&gt;: Storage is the underlying storage technology that is referred by dataset, online feature serving group.&amp;nbsp;&lt;/p&gt;


&lt;p&gt;&lt;strong&gt;FeatureServingGroup&lt;/strong&gt;: A logical unit of serving in the online store that guarantees a certain SLA (throughput, latency). It is a collection of Storage (Cassandra/Redis clusters) that back the Feature Groups, and a routing map of FeatureGroups to the underlying Datasets. Note that it is common in the case of very large use cases) for FeatureServingGroup to contain multiple Cassandra clusters.&lt;/p&gt;


&lt;p&gt;&lt;strong&gt;Inference Server/Palette Service&lt;/strong&gt;: Inference Server is the logical object holding metadata for Inference Serving for a given model within a Michelangelo project. Palette Service (a service where users can just fetch feature values without needing a model setup) similarly will hold metadata for serving via Palette Service.&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-metadata-organization&quot;&gt;&lt;br&gt;Metadata Organization&lt;/h3&gt;


&lt;p&gt;We broke the setup of Metadata inside Palette Metadata repository where following files are setup to simplify customer interaction and Michelangelo on-call interaction with the metadata where customers manage offline related metadata files and Michelangelo on-calls manage online serving related metadata files.&lt;/p&gt;


&lt;p&gt;&lt;strong&gt;Description.json&lt;/strong&gt;: This file contains all the metadata related to offline serving as well as ownership and alerting setup backed by OfflineSpec defined above&lt;/p&gt;


&lt;p&gt;&lt;strong&gt;Features.json&lt;/strong&gt;: This file will cover metadata related to features with schema backed by Feature CRD&lt;/p&gt;


&lt;p&gt;&lt;strong&gt;OnlineServing.json&lt;/strong&gt;: This file contains all the metadata related to online serving&lt;/p&gt;


&lt;p&gt;&lt;strong&gt;HQL&lt;/strong&gt;: This file contains Hive Queries for generating features&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-metadata-registration&quot;&gt;&lt;br&gt;Metadata Registration&lt;/h3&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/vmeacjOfttOowaOhF2PmWRwRAqaZc8tanjGFlEQBWrb1e_gbNqlhbGs8eE3U0PlrVbjx6l4d0LiTwlvhGqJKFy9VvcMMIShMj1-T59vYdh2L-0dLTChU2u7EfdTep_tKZIO97uJ2TaVN_FaQevT4WAI&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 4: Palette Metadata repository updates go through server side validation and get registered in offline serving system and pushed to OnlineServing Cache and OnlineServing stack.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href=&quot;https://lucid.app/lucidchart/bf529602-029f-4540-a7ed-3433cb69e3d2/edit?page=0&amp;amp;v=339&amp;amp;s=612&quot;&gt;&lt;/a&gt;&lt;/p&gt;


&lt;p&gt;To expedite Offline and Online Metadata updates, we moved the system handling Palette Metadata updates made by customers to incrementally compute the delta of the updates, and register those updates in the OfflineServing system.&amp;nbsp;&lt;/p&gt;


&lt;p&gt;Once the updates land in UAPI, we use Kubernetes based Controller to process those updates to our highly available cache Online Serving Cache called ObjectConfig.&amp;nbsp;&lt;/p&gt;


&lt;p&gt;The E2E updates to Offlline and Online systems takes only 15 minutes now instead of over an hour previously, since only incremental updates are pushed and not the entire metadata repository.&lt;/p&gt;


&lt;h3 class=&quot;wp-block-heading&quot; id=&quot;h-online-serving-re-architecture&quot;&gt;&lt;br&gt;Online Serving Re-Architecture&lt;/h3&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter size-full&quot;&gt;&lt;img decoding=&quot;async&quot; loading=&quot;lazy&quot; width=&quot;960&quot; height=&quot;640&quot; src=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=2160,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/Palette-Meta-Store-Journey-wrapper-figure5-edited.png&quot; alt=&quot;&quot; class=&quot;wp-image-1075969&quot; srcset=&quot;https://blog.uber-cdn.com/cdn-cgi/image/width=960,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/Palette-Meta-Store-Journey-wrapper-figure5-edited.png 960w, https://blog.uber-cdn.com/cdn-cgi/image/width=300,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/Palette-Meta-Store-Journey-wrapper-figure5-edited.png 300w, https://blog.uber-cdn.com/cdn-cgi/image/width=768,quality=80,onerror=redirect,format=auto/wp-content/uploads/2024/01/Palette-Meta-Store-Journey-wrapper-figure5-edited.png 768w&quot; sizes=&quot;(max-width: 960px) 100vw, 960px&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 5: Schema updates for Old and New Schema propagate from Metadata Service to Read only Cache and gets loaded to OnlineServing via Loader which is referenced by Wrapper.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h4 class=&quot;wp-block-heading&quot; id=&quot;h-metadata-unification&quot;&gt;Metadata Unification&lt;/h4&gt;


&lt;p&gt;In the old architecture, the metadata for online serving was fragmented across various services. We decided to consolidate all the metadata for online serving in one place, which is the Palette Metadata repository.&amp;nbsp;&lt;/p&gt;


&lt;h4 class=&quot;wp-block-heading&quot; id=&quot;h-interface-redesign&quot;&gt;&lt;br&gt;Interface Redesign&lt;/h4&gt;


&lt;p&gt;We made an interface change to deprecate the old schema which no longer was meeting the evolving needs of the Palette online system.&lt;/p&gt;


&lt;h4 class=&quot;wp-block-heading&quot; id=&quot;h-metadata-wrapper&quot;&gt;&lt;br&gt;Metadata Wrapper&lt;/h4&gt;


&lt;p&gt;We introduced a wrapper during migration for 2 main purposes: Interface adaptation and quick rollback. During the migration process, we made both versions of metadata available for Palette Online Serving. That gave us the ability to compare the metadata in memory. Because the meta loader will transform the metadata to a format better suited serving needs, the metadata in memory is different from what we see in the metadata service. Comparing the metadata in memory gave us more confidence for a safe migration. But due to the interface redesign, we needed serving logic to support both interfaces. So the wrapper was the one to translate the legacy metadata into the format of the new interface. We also introduced a kill switch to tell the wrapper which version of the metadata it should provide to the serving logic. Then we can do a quick rollback when any metadata issue happens during migration.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-migration-challenges&quot;&gt;&lt;strong&gt;Migration Challenges&lt;/strong&gt;&lt;/h2&gt;


&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Keep a smooth user experience during migration&lt;/strong&gt;:&lt;ul&gt;&lt;li&gt;We maintained scripts to automatically synchronize feature metadata between old and new systems. This could avoid data gaps when switching to the new system.&lt;/li&gt;


&lt;li&gt;Good and clean documentation was provided to help users to learn how to onboard features to the new Metadata Store.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;Track correctness for migration&lt;/strong&gt;:&lt;ul&gt;&lt;li&gt;Comparison metrics and logs were created across Feature Generation pipeline system, offline serving system to clearly articulate the differences between old and new systems. They played as a proof of evidence regarding correctness for migration.&lt;/li&gt;


&lt;li&gt;Traffic metrics were checked to make sure that no traffic comes through old systems after full migration.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;Ensuring Backward Compatibility&lt;/strong&gt;:&lt;ul&gt;&lt;li&gt;The updated metadata introduced substantial changes in data formats and APIs. To maintain backward compatibility, it was essential to create a robust common API wrapper. This wrapper could seamlessly bridge the gap between legacy code and the new codebase. Subsequently, we could transition the underlying implementations of the Common API wrapper gradually, facilitating a seamless migration process.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;Testing&lt;/strong&gt;:&lt;ul&gt;&lt;li&gt;The code modification incorporated itself into the Michelangelo team’s offline training, re-training, evaluation and prediction workflow. To guarantee the continued functionality of these integrations after the migration, it was imperative to conduct comprehensive integration testing involving all existing systems.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;


&lt;li&gt;&lt;strong&gt;Rollback Plan&lt;/strong&gt;:&lt;ul&gt;&lt;li&gt;In case the migration encounters unexpected issues or doesn’t yield the desired results, we also defined a thorough rollback plan which could minimize downtime and mitigate risks.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-result&quot;&gt;Result&lt;/h2&gt;


&lt;p&gt;The result of the Metadata migration was that Palette Onboarding Deployment time has reduced drastically by more than 95%. In addition, time to migrate Cassandra clusters has gone down by 90% since all online serving configuration is so cleanly organized which means on-calls no longer need to scramble to figure out which feature group gets served in which Cassandra. Due to the re-architecture of the offline metadata update system so that updates are processed incrementally, time for offline metadata updates has gone from hours to minutes. Additionally, we have introduced enhanced server validation for FeatureStore CRDs and cross-CRD validation&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h1 class=&quot;wp-block-heading&quot; id=&quot;h-conclusion&quot;&gt;Conclusion&lt;/h1&gt;


&lt;p&gt;Overall, introduction of formal schema, consolidation of metadata, enhanced validation, and a very diligently planned migration have led to our new metadata system being easy to use for customers and Michelangelo on-calls, reducing deployment and customer onboarding time, as well as maintenance and operational costs.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-acknowledgements&quot;&gt;Acknowledgements&lt;/h2&gt;


&lt;p&gt;This major step for Machine Learning at Uber could not have been done without the many teams who contributed to it. Huge thank you to the Feature Engine group within Uber’s Michelangelo Team, who spent 1+ year rearchitecting the Meta Store system.&amp;nbsp;&lt;/p&gt;


&lt;p&gt;We also want to give a special thank you to our partners on the Michelangelo teams for making this idea a reality, as well as our former colleagues who helped initiate this idea.&lt;/p&gt;


&lt;p class=&quot;has-small-font-size&quot;&gt;&lt;br&gt;Header Image Attribution: The “Journey start here” image is covered by a &lt;a href=&quot;https://creativecommons.org/licenses/by/2.0/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;CC BY 2.0 &lt;/a&gt;&amp;nbsp;license and is credited to &lt;a href=&quot;https://www.flickr.com/photos/40642065@N06&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;Johnragai-Moment Catcher&lt;/a&gt;. No changes have been made to the image.&lt;/p&gt;
</description><link>https://www.uber.com/blog/palette-meta-store-journey/</link><guid isPermaLink="false">https://www.uber.com/blog/palette-meta-store-journey/</guid><pubDate>Thu, 18 Jan 2024 08:00:00 GMT</pubDate><author>Uber</author><category>Engineering</category><category>AI</category></item><item><title>Uber: GC Tuning for Improved Presto Reliability</title><description>&lt;h1 class=&quot;wp-block-heading&quot; id=&quot;h-presto-at-uber&quot;&gt;Presto at Uber&lt;/h1&gt;


&lt;p&gt;Uber uses open-source Presto to query nearly every data source, both in motion and at rest. Presto’s versatility empowers us to make intelligent, data-driven business decisions. We operate around 20 Presto clusters spanning over 10,000 nodes across two regions. We have about 12,000 weekly active users running approximately 500,000 queries daily, which read about 100 PB from HDFS. Today, Presto is used to query various data sources like &lt;a href=&quot;https://hive.apache.org/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;Apache Hive&lt;/a&gt;, &lt;a href=&quot;https://pinot.apache.org/&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;Apache Pinot&lt;/a&gt;, AresDb, MySQL, Elasticsearch, and Apache Kafka, through its extensible data source connectors.&lt;/p&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/N5XyfSyzOYri5u_fdYS1bptIXQKw-pEl3hGKydZVE5hyt8eh01uGA8JAG9ucliJ5y0LzNhTBF4Qi4AEVIEWghFDsjukoKzulqE8ya7rvn6-CdNveNO6FfQ3eTaiainiLNDdIWTFxBUHTk3XOrJzNjGs&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 1&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;p&gt;Our selection of cluster types can accommodate any request, whether for interactive or batch purposes. &lt;em&gt;Interactive&lt;/em&gt; workloads cater to dashboards/desktop users waiting for the results, and &lt;em&gt;batch&lt;/em&gt; workloads are scheduled jobs that run on a predefined schedule. Each of our clusters is classified based on their machine type. Most of our clusters comprise bigger machines, which are equipped with more than 300 GB of heap memory, while other clusters have smaller machines that are equipped with less than 200 GB of heap memory, and we have modified the concurrency of each cluster depending on its size and type of the machines that make it up.&lt;/p&gt;


&lt;p&gt;On a weekly basis, memory fragmentation optimization activity is carried out across all production clusters. Even though we were constantly improving fragmentation, we still suffered from constant Full Garbage collections (very long pauses) and sporadically a few out-of-memory errors. Just to give a sense of the problem, let me show you Presto Full GC, cumulative count:&lt;/p&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter is-resized&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/Lf33-HkkgTVwo2C8b7Hyew2J66Q8jQKuWyORGxXqnYUGNEXVwyMEL1QEF2di5vy6R8q0bO6OSxaT2JOTNf4d1aUDBZxfKqurCoH1Btm2_KpeWBpXytDmgEyjpleFumhiNKQ2EWAf9T5RpjtbzmQvPpQ&quot; alt=&quot;&quot; style=&quot;aspect-ratio:1.6341030195381883;width:700px;height:auto&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 2: Presto Full GC count per day.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-overview-g1gc-garbage-collector&quot;&gt;Overview – G1GC Garbage Collector&lt;/h2&gt;


&lt;p&gt;&lt;a href=&quot;https://www.oracle.com/technical-resources/articles/java/g1gc.html&quot; target=&quot;_blank&quot; rel=&quot;noreferrer noopener&quot;&gt;G1GC&lt;/a&gt; is a garbage collector that tries to balance throughput and latency. G1 is a generational garbage collector, which differs from the newer concurrent garbage collectors (Shenandoah, ZGC, etc.). Generational means that the memory is divided into short and long-lived objects.&lt;/p&gt;


&lt;p&gt;The first important thing to differentiate here is that there are two types of memory: stack and heap. Stack allocations are cheap because we just need to bump up a pointer, so whenever we call a function we decrement the Stack pointer (stack grows towards the bottom), once we are done with that function we just increment the pointer, and voilà, allocation and deallocation in a single statement each. On the other hand, heap allocation/deallocation is a little bit more expensive. For G1GC, allocation is similar to stack where we only need to bump up a pointer, but deallocation requires GC to run.&lt;/p&gt;


&lt;p&gt;In Java, since all objects are allocated on the heap, then what do we allocate on the stack? There it goes, the “pointer” referencing the object on the heap. Then for the heap space, G1 divides it into small sections called “regions.”&lt;/p&gt;


&lt;p&gt;G1 tries to achieve at least 2,048 regions on the heap.&lt;/p&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/sTVuA9Zife5F9JdnTG1FZ7qzYZPr-octAgSmrUK9kXyX4HBSlLX0dLWxlbPbUJEa0jDRYyJ_X18iICObV8_mLzcwwms9hX2_aGvVt6xGzeN7plfcLu4temYrbJKZIn9gJIKPTmb4h3D8CnmrmXVA2LM&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 3: Heap is divided into regions.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;p&gt;What’s the size of each region? It depends on your heap size, but it can go from 1-32 MB. The JVM decides which size ensures that we have 2,048 regions (or more).&lt;/p&gt;


&lt;p&gt;Each region can be the young generation, the old generation, or the free.&lt;/p&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/svWLIMAX1Kir9QJm1LdSXfmem2BefIL5uDqQmlnwbeuyZnXoQMPVaW9cEk2V0PkPOaPAwcEuw7h6eIFVbHq89XMIslU749z1QZhA_ZcRkpr-J5r6aRpyadiPUHyk8_OqKq6tStH5M2eQAwb7PiVxoLI&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 4: Heap regions are categorized as young gen, old gen, or free.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;p&gt;Finally, the young generation is also divided into Eden and survivors. Eden is where any new allocation happens. For survivors, it would create two different arenas. Why? Young Gen’s approach to clear memory is copying objects between regions, so it needs an empty survivor to copy memory.&lt;/p&gt;


&lt;p&gt;So the full process is whenever we do a new Object(), it gets allocated on the Eden. GC runs and the object is not dead, so it gets copied to Survivor0. The next time GC runs again and the object is still not dead yet, it gets promoted to Survivor1. So it continues to copy back and forth between survivors until it eventually gets promoted to the old generation.&lt;/p&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/40pVmoyFKnSCgWh-D2aCCGDmQ_-x9Dj-DCtLP0-P28wkn2hNabgQo0yO4_06wjJqxaPQaCNs2sYePWwU0K83RrBQKDOHDLviyXy8ssi6Sqs98Lv-KTEQKU5VRmiz85YTVzYUCUFFO5fODVP18W8lFQs&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 5: Heap is fully divided into all the mentioned types.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;p&gt;To recap, the young generation uses a copying mechanism to release memory. So when do we allocate to the old generation? There are two scenarios:&lt;/p&gt;


&lt;ol&gt;&lt;li&gt;G1 has an age threshold. Every time a young gen object gets copied, we increase the age. Once we hit the threshold, it gets copied to the old gen.&lt;/li&gt;


&lt;li&gt;Each region is 1-32 MB in size. Any object that is 50% or more of the size of the region gets allocated directly to the old generation. G1 calls this a humongous object.&lt;/li&gt;
&lt;/ol&gt;


&lt;p&gt;How does G1 clear the old generation? It uses an algorithm called “concurrent mark and sweep.” It is a graph traversal starting from the root objects (thread stacks, global variables, etc.) and goes through every object still referenced. It is essential to mention that G1 uses STAB (snapshot at the beginning), so any new object after it starts would be considered alive regardless of its real liveness. Once it finishes, G1 knows which objects are still alive, and the ones that are not can be cleaned on the following mixed collection.&lt;/p&gt;


&lt;p&gt;What? A mixed collection? Indeed. A mixed collection is a young generation collection that would include old generation regions in the process. So it copies the objects that are still alive in another old generation region. This process is critical to reduce memory fragmentation.&lt;/p&gt;


&lt;p&gt;Who determines the size of each component (Eden, survivor, old gen, etc.)? The heap is always shrinking and expanding to fulfill its job, although there are certain limits. For instance, the young generation can only be 5-60% of the total heap.&lt;/p&gt;


&lt;p&gt;Today’s discussion doesn’t require going into more advanced G1GC topics, so let’s begin with what we have done at Uber.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-g1gc-at-uber&quot;&gt;G1GC at Uber&lt;/h2&gt;


&lt;p&gt;When Java became more used at Uber, we were using OpenJDK 8. Most of the time, the only tuning option we had to touch was &lt;em&gt;-XX:InitiatingHeapOccupancyPercent=X&lt;/em&gt;. This threshold is the one that controls if G1 should start a concurrent mark and sweep cycle.&amp;nbsp;&lt;/p&gt;


&lt;p&gt;It has a default value of 45%, which usually causes an increase in CPU, because any service using some cache would eventually exceed that threshold, and it would keep triggering it endlessly. For instance, service A stores all the users in memory, and that causes the Old generation to be ~60% of the total heap. Then the 45% threshold would always be met.&lt;/p&gt;


&lt;p&gt;Then how do we tune it?&lt;/p&gt;


&lt;ul&gt;&lt;li&gt;Enable GC logs and GC metrics&lt;/li&gt;


&lt;li&gt;Look for the peak old-generation utilization after mixed collections&lt;/li&gt;


&lt;li&gt;Select a value slightly higher than that peak–usually 5-10% higher&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;However, remember that Presto servers are now running on JDK 11. How do we tune them? This was our first attempt at tuning this version. Why is it different? Java introduced dynamic IHOP (InitiatingHeapOccupancyPercent). Then we no longer have a default value of 45%, and instead we have a value that can change all of the time, and it is only available in the GC logs.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-tuning-jdk-11&quot;&gt;Tuning JDK 11&lt;/h2&gt;


&lt;p&gt;How does dynamic IHOP get calculated? It uses the current size of the young generation plus a free threshold (basic idea, it uses a slightly more complex formula). This free threshold default value is 10% of the total heap and is used as a buffer to allow GC to complete (remember concurrent mark and sweep runs along your application).&lt;/p&gt;


&lt;p&gt;The process we follow is listed below (we waited 1-2 weeks between each step to have enough data to verify our experiments). We tried the following on one cluster first to avoid affecting all our users.&lt;/p&gt;


&lt;h5 class=&quot;wp-block-heading&quot; id=&quot;h-add-more-gc-metrics&quot;&gt;&lt;strong&gt;Add more GC metrics&lt;/strong&gt;&lt;/h5&gt;


&lt;p&gt;We were missing young- and old-gen utilization, so we couldn’t easily know historical data about our utilizations.&lt;/p&gt;


&lt;h5 class=&quot;wp-block-heading&quot; id=&quot;h-decrease-max-young-generation-size-from-60-to-20&quot;&gt;&lt;strong&gt;Decrease max young generation size from 60% to 20%&lt;/strong&gt;&lt;/h5&gt;


&lt;p&gt;We saw the young generation expanding a few times (50% of the total heap). This caused long GC pauses and concurrent marking to take longer to run again. Concurrent marking can’t run if we are still doing mixed collections.&lt;/p&gt;


&lt;p&gt;The result?&lt;/p&gt;


&lt;ul&gt;&lt;li&gt;Better GC pauses.&lt;/li&gt;


&lt;li&gt;Still bad concurrent marking. This happened because by decreasing the max size by 40%, we gave that to the old generation, so concurrent marking was still starting late.&lt;/li&gt;
&lt;/ul&gt;


&lt;h5 class=&quot;wp-block-heading&quot; id=&quot;h-increase-free-space-from-10-to-35-amp-decrease-heap-waste-from-5-to-1&quot;&gt;&lt;strong&gt;Increase Free space from 10% to 35% &amp;amp; decrease Heap waste from 5% to 1%&lt;/strong&gt;&lt;/h5&gt;


&lt;p&gt;Let’s first talk about heap waster percentage. This tuning option by default is 5% which tells G1 that it must only release any garbage when it exceeds 5% of the total heap. Why? To avoid long GC pauses during mixed collections. When we do concurrent marking, G1 orders the old generation’s regions based on their utilization, and it first chooses the ones that have more free space, because they are faster to copy to a new region.&lt;/p&gt;


&lt;p&gt;For our 300G clusters, that translates to 15G that will never be cleaned. We decided to decrease that to 3G (&lt;em&gt;-XX:G1HeapWastePercent=1&lt;/em&gt;) based on past experiences.&lt;/p&gt;


&lt;p&gt;For free space, we analyzed several GC logs and noticed that utilization stayed at 20-35% after mixed collections. Then 20% max young gen plus 35% free space would give us a threshold of 45% (100-(35+20)%). With this config, we are giving at least a 10% buffer (35 to 45%) to have some garbage to clean.&lt;/p&gt;


&lt;p&gt;The result?&lt;/p&gt;


&lt;ul&gt;&lt;li&gt;1% seemed too much, and we started seeing long pauses of &amp;gt;1s. This change was helpful because, with the GC logs, we could identify that long pauses started to happen once mixed collections tried to go from 2% -&amp;gt; 1% garbage.&lt;/li&gt;


&lt;li&gt;35% performed well. Full GCs were reduced (~80% for this cluster).&lt;/li&gt;
&lt;/ul&gt;


&lt;h5 class=&quot;wp-block-heading&quot; id=&quot;h-increase-free-space-from-35-to-40-amp-increase-heap-waste-from-1-to-2&quot;&gt;&lt;strong&gt;Increase Free space from 35% to 40% &amp;amp; increase Heap waste from 1% to 2%&lt;/strong&gt;&lt;/h5&gt;


&lt;p&gt;The result was:&lt;/p&gt;


&lt;ul&gt;&lt;li&gt;2% heap waste gave us an additional 9G and had little impact on latencies (~50-100ms vs. 1-1.5s with 1%).&lt;/li&gt;


&lt;li&gt;40% performed slightly better than 35%, but we didn’t gain much (85-90% vs. 80%). We decided not to go even further to avoid thrashing.&lt;/li&gt;
&lt;/ul&gt;


&lt;h5 class=&quot;wp-block-heading&quot; id=&quot;h-try-the-same-tuning-options-on-a-different-cluster&quot;&gt;&lt;strong&gt;Try the same tuning options on a different cluster&lt;/strong&gt;&lt;/h5&gt;


&lt;p&gt;We tested the same config in a new cluster and verified the behavior before trying on all of them to see the impact. We decided to grab the cluster with the most Full GCs in the past few weeks. After 24 hours of the deployment, we could already see the impact:&lt;/p&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/n5lWCndCnJTQfMknVsR4_H3ggMwiIIvPlA8Rwu78ojRg_XVxmr9A9P9oxI4zqGyy6YC3pxoumheJDglKnuhe2FsVjjHUnS3TIGH5cw_D9X2Z7i-uMB7qTzbTmpO2U3448MnHwL7V2fkmycOQ44POgYI&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 6: Full GCs cumulative count of a single cluster&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;p&gt;Before, after just a few hours, we used to start seeing Full GCs, but after these changes, we didn’t get any.&lt;/p&gt;


&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h2 class=&quot;wp-block-heading&quot; id=&quot;h-conclusion&quot;&gt;Conclusion&lt;/h2&gt;


&lt;p&gt;After several weeks of testing with the tuning flags presented above, we decided to use the same flags for all clusters. After the flags were added/updated, all the clusters performed optimally with minimal to no internal OOM errors. Due to this change, the reliability of Presto clusters increased, thereby reducing reruns of the queries that were earlier failing with OOM errors, which improved the overall performance of Presto clusters. The flags that we used in the final tuning are:&lt;/p&gt;


&lt;p&gt;-XX:+UnlockExperimentalVMOptions&lt;/p&gt;


&lt;p&gt;-XX:G1MaxNewSizePercent=20&lt;/p&gt;


&lt;p&gt;-XX:G1ReservePercent=40&lt;/p&gt;


&lt;p&gt;-XX:G1HeapWastePercent=2&lt;/p&gt;


&lt;p&gt;These flags are specific to the Presto use case in Uber, which was finalized after multiple tuning iterations. We expect flags to differ for each organization based on the individual workloads, and they must be tuned on a case-by-case basis. With these flags enabled, we will see more frequent Garbage collection, but they allow us to have a more reliable Presto cluster and reduce the on-call burden for the owners.&lt;/p&gt;


&lt;p&gt;For all of our clusters, we observed the following impact:&lt;/p&gt;


&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/OOfBVVQv5uNpRL-ar05SEf5VYhP9uDgPqSOdAxncrYtCe_vkwVPmHGsrbh4Y0emhU_a8oC5yhLqWsDb51bhLq8Diof8XWxxuDPGcuFryT79JID4BNDq1oqf8pq48YGeuZIqetI8fR66yQUX1eIr5wgw&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 7: Cumulative Old generation count per day for all clusters.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;div class=&quot;wp-block-image&quot;&gt;&lt;figure class=&quot;aligncenter&quot;&gt;&lt;img decoding=&quot;async&quot; src=&quot;https://lh7-us.googleusercontent.com/1_HSQd_ILxdScbjmI7hoQ7IpRxIxnuXlN4lgaNs87u5GhDJUjnyH_OydBRbKv_j3sOPrcAsRVOnwqn9d7J1Eeg8uxEmglD2XfP6_BnNX9XWGS73qz3_7vy4QFDLKaUlhB3e95BMp9G3JQ3-Mn6BQcDo&quot; alt=&quot;&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;figcaption class=&quot;wp-element-caption&quot;&gt;Figure 8: Internal errors per day on all clusters.&lt;/figcaption&gt;&lt;/figure&gt;&lt;/div&gt;

&lt;hr class=&quot;wp-block-separator has-alpha-channel-opacity&quot;&gt;


&lt;h1 class=&quot;wp-block-heading&quot; id=&quot;h-what-s-next&quot;&gt;What’s Next?&lt;/h1&gt;


&lt;p&gt;Most of the Garbage collection tuning we have done has been on product-facing applications, and we haven’t paid close attention to our storage applications. Therefore, we plan to continue tuning for the other solutions Uber provides. It would be an interesting learning experience because storage applications use large heaps, which differs from what we used to tune normally. We’ll share it with the community once we have more data.&lt;/p&gt;


&lt;p&gt;GC tuning done on Presto is an example of how improving garbage collection can improve a system’s overall performance and reliability. Our next focus will be further fine-tuning GCs for Presto clusters, especially with less powerful machines where we are still experiencing some Full GCs, and improving the system’s overall reliability.&amp;nbsp;&lt;/p&gt;


&lt;p&gt;All the optimizations listed are specific to the Presto deployment in Uber and can’t be directly ported to other services. The flags listed are just for demonstration purposes to understand what flags we ended up using in our tuning. Also, we will come up with some of the best practices and guidelines that can be used by all of Uber’s storage applications, depending on their general usage, which will act as a good starting point. This will empower us to improve all of our storage applications, improving overall reliability and performance.&lt;/p&gt;
</description><link>https://www.uber.com/blog/uber-gc-tuning-for-improved-presto-reliability/</link><guid isPermaLink="false">https://www.uber.com/blog/uber-gc-tuning-for-improved-presto-reliability/</guid><pubDate>Thu, 11 Jan 2024 08:00:00 GMT</pubDate><author>Uber</author><category>Engineering</category><category>Data / ML</category></item></channel></rss>