Uber Engineering Blog

From Predictive to Generative – How Michelangelo Accelerates Uber’s AI Journey

Uber — Thu, 02 May 2024 09:46:07 GMT

Introduction

+ + +

In the past few years, the Machine learning (ML) adoption and impact at Uber have accelerated across all business lines. Today, ML plays a key role in Uber’s business, being used to make business-critical decisions like ETA, rider-driver matching, Eats homefeed ranking, and fraud detection.

+ + +

As Uber’s centralized ML platform, Michelangelo has been instrumental in driving Uber’s ML evolution since it was first introduced in 2016. It offers a set of comprehensive features that cover the end-to-end ML lifecycle, empowering Uber’s ML practitioners to develop and productize high-quality ML applications at scale. Currently, approximately 400 active ML projects are managed on Michelangelo, with over 20K model training jobs monthly. There are more than 5K models in production, serving 10 million real-time predictions per second at peak.

+ + +

As shown in Figure 1 below, ML developer experience is an important multiplier that enables developers to deliver real-world business impact. By leveraging Michelangelo, Uber’s ML use cases have grown from simple tree models to advanced deep learning models, and ultimately, to the latest Generative AI. In this blog, we present the evolution of Michelangelo in the past eight years with a focus on the continuous enhancement of the ML developer experience at Uber.

+ + +

Figure 1: ML Developer Experience is a multiplier for delivering ML business impact.

+ +

+ + +

Journey of AI/ML @ Uber

+ + +

Presently, Uber operates in over 10,000 cities spanning more than 70 countries, serving 25 million trips on the platform each day with 137 million monthly active users. ML has been integrated into virtually every facet of Uber’s daily operation. Virtually every interaction within the Uber apps involves ML behind the scenes. Take the rider app as an example: when users try to log in, ML is used to detect fraud signals like possible account takeovers. Within the app, in many jurisdictions, ML is deployed to suggest destination auto-completion and to rank the search results. Once the destination is chosen, ML comes into play for a multitude of functions, including ETA computation, trip price calculation, rider-driver matching with safety measures in mind, and on-trip routing. After the trip is completed, ML aids in payment fraud detection, chargeback prevention, and extends its reach to powering the customer service chatbot.

+ + +

Figure 2: Real-time ML underpins Rider app user flow.

+ +

+ + +

As you can see in Figure 2, real-time ML powers user flow in the rider app, and the same holds true for the Eats app (and many others), as illustrated in Figure 3 below.

+ + +

Figure 3: Real-time ML underpins Eater app core user flow.

+ +

+ + +

Reflecting on the evolution of ML at Uber, there are three distinct phases:

+ + +

2016 – 2019: During this initial phase, Uber primarily employed predictive machine learning for tabular data use cases. Algorithms, such as XGBoost, were used for critical tasks like ETA predictions, risk assessment, and pricing. Furthermore, Uber delved into the realm of deep learning (DL) in critical areas like 3D mapping and perception in self-driving cars, necessitating significant investments in GPU scheduling and distributed training methodologies, like Horovod®.
2019 – 2023: The second phase witnessed a concerted push towards the adoption of DL and collaborative model development for high-impact ML projects. The emphasis was on the model iteration as code within ML monorepo and supporting DL as a first-class citizen in Michelangelo. During this period, more than 60% of tier-1 models adopted DL in production and boosted model performance significantly.
Starting in 2023: The third phase represents the latest development in the new wave of Generative AI, with a focus on improving Uber’s end-user experience and internal employee productivity (described in a previous blog).

+ + +

Figure 4: Uber’s ML journey from 2016 to 2023.

+ +

+ + +

Throughout this transformative journey, Michelangelo has been playing a pivotal role in advancing ML capabilities and empowering teams to build industry-leading ML applications.

+ + +

Michelangelo 1.0 (2016 – 2019)

+ + +

When Uber embarked on its ML journey back in 2015, applied scientists used Jupyter Notebooks™ to develop models, while engineers built bespoke pipelines to deploy those models to production. There was no system in place to build reliable and reproducible pipelines for creating and managing training and prediction workflows at scale, and no easy way to store or compare training experiment results. More importantly, there was no established path to deploying a model into production without creating a custom serving container.

+ + +

In early 2016, Michelangelo was launched to standardize the ML workflows via an end-to-end system that enabled ML developers across Uber to easily build and deploy ML models at scale. It started by addressing the challenges around scalable model training and deployment to production serving containers (Learn more). Then, a feature store named Palette was built to better manage and share feature pipelines across teams. It supported both batch and near-real-time feature computation use cases. Currently, Palette hosts more than 20,000 features that can be leveraged out-of-box for Uber teams to build robust ML models (Learn more).

+ + +

Other key Michelangelo components released include, but are not limited to:

+ + +

Gallery: Michelangelo’s model and ML metadata registry that provides a comprehensive search API for all types of ML entities. (Learn more)
Manifold: A model-agnostic visual debugging tool for ML at Uber. (Learn more)
PyML: A framework that speeded up the cycle of prototyping, validating, and productionizing Python ML models. (Learn more)
Extend Michelangelo’s model representation for flexibility at scale. (Learn more)
Horovod for distributed training. (Learn more)

+ + +

Michelangelo 2.0 (2019 – 2023)

+ + +

The initial goal of Michelangelo was to bootstrap and democratize ML at Uber. By the end of 2019, most lines of business at Uber had integrated ML into their products. Subsequently, Michelangelo’s focus started shifting from “enabling ML everywhere” to “doubling down on high-impact ML projects” so that developers could uplevel the model performance and quality of these projects to drive higher business value for Uber. Given the complexity and significance of these projects, there was a demand for more advanced ML techniques, particularly DL, and many different roles (e.g., data scientists and engineers) were often required to collaborate and iterate on models faster, as shown in Figure 5. This posed several challenges for Michelangelo 1.0, as listed below.

+ + +

Figure 5: ML lifecycle is iterative and collaborative with many different roles.

+ +

+ + +

1. Lack of comprehensive ML quality definition and project tiering: Unlike micro-services which have well-defined quality standards and best practices, at that time there was not a consistent way to measure the full spectrum of model quality. For example, many teams only measured offline model performance such as AUC and RMSE, but ignored other critical metrics like online model performance, freshness of training data, and model reproducibility. This resulted in little visibility of model performance, stale models in production, and poor dataset coverage.

+ + +

Also, it is important to recognize that ML projects vary significantly in terms of business impact. The lack of a distinct ML tiering system led to a uniform approach in resource allocation, support, and managing outages, regardless of a project’s impact. This resulted in high-impact projects receiving inadequate investment or not being given the priority they deserved.

+ + +

2. Insufficient support for DL models: Up until 2019, ML use cases at Uber were predominantly using tree-based models, which inherently did not favor adopting advanced techniques like custom loss functions, incremental training, and embeddings. Conversely, Uber had vast data suitable for training DL models, but the infrastructure and developer experience challenges hindered the progress in this direction. Many teams like Maps ETA and Rider incentive teams had to invest months in developing their own DL toolkits before successfully training their first version of DL models.

+ + +

3. Inadequacy of support for collaborative model development: In the early days, most ML projects were small-scale, and only authored and iterated by a single developer from inception to production. Hence, Michelangelo 1.0 was not optimized for highly collaborative model development, and collaboration in Michelangelo 1.0 UI and Jupyter Notebook was difficult and often done via manual copying and merging without version control or branching. In addition, there was no code review process for UI model config changes nor notebook edits, and the absence of a centralized repository for ML code and configurations led to their dispersion across various sources. These posed a significant threat to our engineering process and made large-scale model exploration across numerous ML projects arduous.

+ + +

4. Fragmented ML tooling and developer experience: Since 2015, many ML tools other than Michelangelo have been built by different teams at Uber for a subset of the ML lifecycle and use cases, such as Data Science Workbench (DSW) from Data team for managed Jupyter Notebooks, ML Explorer from Marketplace team for ML workflow orchestration and automation, and uFlow/uScorer from Risk team specifically for training and inferencing models from their own team. There were also different ways to develop an ML model for different model types–e.g., Michelangelo UI for SparkML and XGBoost models, Jupyter Notebook for DL models, and PyML for custom Python-based models. Launching one ML project usually required constantly switching between such semi-isolated tools, which were built with different UI patterns and user flows, leading to fragmented user experience and reduced productivity.

+ + +

To address these challenges, Michelangelo 2.0 re-architectured the fragmented ML platforms to a single coherent product with unified UI and API for the end-to-end ML lifecycle. Michelangelo 2.0 has four user-facing themes: (1) model quality and project tiering, (2) model iteration as code via Canvas, (3) DL as a first-class platform citizen, and (4) unified ML developer experience via MA Studio.

+ + +

Architectural Overview

+ + +

Michelangelo 2.0 is centered around four pillars. At the very bottom, we are enabling an architecture that allows for plug-and-play platform components. Some of the components are built in-house and others can be State-of-the-art commodity pieces from open source or 3rd party. On top is the development and production experience that caters to applied scientists and ML engineers. To improve model development velocity, we are streamlining the development experience and enabling technologies for collaborative, reusable development. We believe this approach will enable us to track and enforce compliance at the platform level. We are investing in production experiences like safe deployment of models and automatic model retraining, etc. to make it easy to maintain and manage models at scale. Finally, we are focusing on the quality of the models and investing in tooling that measures this quality across all stages and improves it systematically.

+ + +

Figure 6: High-level concepts of Michelangelo 2.0 Architecture.

+ +

+ + +

Here are a few architectural design principles for Michelangelo 2.0:

+ + +

Define project tiering, and focus on high-impact use cases to maximize Uber’s ML impact. Provide self-service to long-tail ML use cases so that they can leverage the power of the platform.
The majority of ML use cases can leverage Michelangelo’s core workflows and UI, while Michelangelo also enables more bespoke workflows needed for advanced use cases like deep learning.
Monolithic vs. plug-and-play. Architecture will support plug-and-play of different components, but the managed solution will only support a subset of them for the best user experience. Bring your own components for advanced use cases.
API/code-driven vs. UI-driven. Take the API first principle and leverage UI for visualization and fast iteration. Support model iteration as code for version control and code reviews, including changes made in UI.
Build vs. buy decision. Leverage best-of-class offerings from OSS or Cloud or building in-house. OSS solutions may be prioritized over proprietary solutions. Be cautious about the cost of capacity for Cloud solutions.
Codify the best ML practices like safe model deployment, model retraining, and feature monitoring in the platform.

+ + +

The system consists of three planes–i.e., control plane, offline and online data planes. The control plane defines user-facing APIs and manages the lifecycle of all entities in the system. The offline data plane does the heavy lifting on big data processing such as feature computation, model training and evaluation, offline batch inference, etc. The online data plane handles real-time model inference and feature serving, which are used by other microservices.

+ + +

The control plane adopts the same Kubernetes™ Operator design pattern for modularization and extensibility. The Michelangelo APIs also follow the same Kubernetes API conventions and standardize the operations on ML-related entities like Project, Pipeline, PipelineRun, Model, Revision, InferenceServer, Deployment, etc. By leveraging the Kubernetes API machinery including API server, etcd, and controller manager, all Michelangelo APIs can be accessed in a consistent manner, resulting in a more user-friendly and streamlined user experience. In addition, the declarative API pattern is also crucial for Michelangelo to support mutation by both UI and code in a GIT repo, as detailed later.

+ + +

The offline data plane consists of a set of ML pipelines including training, scoring, evaluation, etc., which are defined as DAG of steps. The ML pipelines support intermediate checkpoints and resume between steps to avoid duplicate executions of previous steps. Steps are executed on top of frameworks like Ray™ or Spark™. The online data plane manages RPC services and streaming processing jobs that serve online prediction, online feature access, and near-real-time feature computation.

+ + +

Figure 7 shows the detailed design of the Michelangelo 2.0 system, which reduced the engineering complexity as well as simplified the external dependencies on other infrastructure components.

+ + +

Figure 7: Detailed system design of Michelangelo 2.0 including offline, online and control planes.

+ +

+ + +

Model Quality and Project Tiering

+ + +

The development and maintenance of a production-ready ML system are intricate, involving numerous stages in the model lifecycle and a complex supporting infrastructure. Typically, an ML model undergoes phases like feature engineering, training, evaluation, and serving. The lack of comprehensive ML quality measurement leads to limited visibility for ML developers regarding various quality dimensions at different stages of a model’s lifecycle. Moreover, this gap hinders organizational leaders from making fully informed decisions regarding the quality and impact of ML projects.

+ + +

Figure 8: Example ML quality dimensions (in yellow) in a typical ML system.

+ +

+ + +

To address these gaps, we launched Model Excellence Score (MES), a framework for measuring and monitoring key dimensions and metrics at each stage of a model’s life cycle, such as training model accuracy, prediction accuracy, model freshness, and prediction feature quality, to ensure a holistic and rigorous approach to ML deployment at scale. This framework leverages the same Service Level Agreement (SLA) concept that is commonly used by site reliability engineers (SREs) and DevOps professionals to manage microservices reliability in production environments. By integrating with the SLA toolset, MES establishes a standard for measuring and ensuring ML model quality at Uber. Additionally, MES tracks and visualizes the compliance and quality of models, thereby providing a clearer and more comprehensive view of ML initiatives across the organization. See MES blog for more details.

+ + +

To differentiate high-impact and long-tail use cases, we introduced a well-defined ML project tiering scheme. This scheme consisted of four tiers, with tier 1 being the highest. Tier 1 projects consist of models serving critical functions within core trip and core eater flows, such as ETA calculations, safety, and fraud detection, etc. Only models directly influencing core business operations can qualify for tier-1 status. Conversely, tier-4 projects typically encompass experimental and exploratory use cases with limited direct business impact. This tiering scheme enabled us to make informed decisions regarding resource allocation for ML project outage handling, resource investment, best practice enforcement, and compliance matters, among other considerations. It ensured that the level of attention and resources devoted to each project was commensurate with its relative priority and impact.

+ + +

Model iterations as code

+ + +

To enhance ML developer productivity, foster seamless team collaboration, and elevate the overall quality of ML applications in 2020, we launched Project Canvas. The project aimed to apply software engineering best practices for the ML development lifecycle enforcing version controls, harnessed the power of Docker containers, integrated CI/CD tools, and expedited model development by introducing standardized ML application frameworks. Key components of Canvas included:

+ + +

ML Application Framework (MAF): Predefined, but customizable ML workflow templates to provide a code and configuration-driven way for ML development, tailor-made for intricate ML techniques such as DL.
ML Monorepo: A centralized repository that stores all ML development sources of truth as code, with robust version control capabilities.
ML Dependency Management: Provides software dependency management using Bazel and docker builds. Each ML project has their own customized docker images. In addition to software dependencies, the model training and serving code will be packaged into an immutable docker image for production model retraining and serving.
ML Development Environment: Provides consistent local developing and remote production execution environments for ML developers so that they can test and debug the models locally before running it in a remote production environment for fast model iteration.
ML Continuous Integration / Continuous Delivery: Continuous integration against the master branch and automates the deployment to production for ML models landed to the master branch of ML monorepo via various tests and validations.
ML Artifact Management: Provide support for artifact and lineage tracking. Artifacts are ML objects such as models, datasets and evaluation reports plus their corresponding metadata. The objects will be stored in distributed storage, and the metadata will be fully indexed and searchable.
MA Assistant (MAA): Michelangelo’s AutoML solution for automatic model architecture search and feature exploration/pruning.

+ + +

Figure 9: Canvas: Streamlining end-to-end ML developer experience.

+ +

+ + +

Canvas also streamlined the ML dependency management by leveraging Bazel and docker builds. Each ML project would have its custom docker images, and the model training and serving code will be packaged into an immutable docker image for production model retraining and serving. Moreover, Canvas enabled consistent local and remote development environments for ML developers to test and debug the models locally before running them in a remote production environment for fast model iteration.

+ + +

Deep Learning as a first-class platform citizen

+ + +

Adopting advanced techniques such as custom loss functions, incremental training, and embeddings posed significant challenges. DL is more flexible to address these challenges. Furthermore, DL often excels as datasets grow larger, as it can leverage more data to learn more complex representations.

+ + +

Before 2019, most of the DL models at Uber were for self-driving cars (e.g., 3D mapping, perception), computer vision (e.g., driver face recognition), and natural language processing (e.g., customer obsession) use cases. However, there were very few deep learning models for the core business, especially for tabular data use cases. One important reason that hindered the adoption of deep learning is the lack of end-to-end deep learning support in Michelangelo 1.0. Different from tree-based models, deep learning models often require much more sophisticated ML platform support, from feature transformation and model training to model serving and GPU resource management. The rest of this section will give an overview of our investment in deep learning support in Michelangelo 2.0.

+ + +

Feature transformation

+ + +

Michelangelo 1.0 implemented a DSL for feature transformation such as normalization and bucketization that are used in both model training and serving paths. The transformation is bundled together with a model as a Spark PipelineModel so that it eliminates a source of training-serving skews. However, the DSL transformation is implemented as a Spark transformer and can not be run on GPU for DL models for low-latency serving. In Michelangelo 2.0, we implemented a new DL native transformation solution that allows users to transform their features using Keras or PyTorch operators and provides advanced users the flexibility to define customized feature transformation using Python code. Similar to TensorFlow transform, the transform graph is combined with the model inference graph either in TensorFlow or TorchScript for low-latency serving on GPUs.

+ + +

Model training

+ + +

Michelangelo 2.0 supports both TensorFlow and PyTorch frameworks for large-scale DL model training by leveraging our distributed training framework Horovod. In addition, we have made the following improvements for better scalability, fault tolerance, and efficiency.

+ + +

Distributed GPU training and tuning on Ray. (Learn more). Historically, the model training in Michelangelo was running on top of Spark. However, DL presented new challenges on Spark such as a lack of GPU executors, mini-batch shuffle, and all-reduce. Horovod on Spark wrapped the DL training using Spark estimator syntax and provided easy integration to the training pipeline. However, it also introduced many operational complexities like separate cluster jobs, lifecycle management, and failure scenarios. In Michelangelo 2.0, we have replaced Spark-based XGBoost and DL trainers with Ray-based trainers for better scalability and reliability. We also switched from an in-house hyperparameter tuning solution to RayTune.
Elastic Horovod with auto-scaling and fault tolerance. (Learn more). Elastic Horovod allows distributed training that scales the number of workers dynamically throughout the training process. Jobs can now continue training with minimal interruption when machines come and go from the job.
Resource efficient incremental training. One advantage of DL is the ability to incrementally train a model with additional datasets without training from scratch. This significantly improves the resource efficiency for production retrains as well as increases the dataset coverage for better model accuracy.
Declarative DL training pipeline in Canvas. DL models require custom model code and loss functions etc. In Canvas, we designed the training pipelines to be declarative as well as extensible for users to plug-in their custom model code such as estimators, optimizers, and loss functions as shown in Figure 9.

+ + +

Figure 9: Example training pipeline in Canvas for a deep learning model.

+ +

+ + +

Model serving

+ + +

Most of Uber’s tier-1 ML projects that were adopting DL are very sensitive to serving latency, such as maps ETA and Eats homefeed ranking. In addition, the model serving has to support both TensorFlow and PyTorch DL frameworks but abstract out the framework-level details away from our users. Historically, Neuropod has been the default DL serving engine in Michelangelo. However, it lacks continuous community support and is being deprecated. In Michelangelo 2.0, we integrated Triton as a next-generation model serving engine in our Online Prediction Service (OPS) as a new Spark transformer. Triton is an open-source inference server developed by Nvidia and supports multiple frameworks including TensorFlow, PyTorch, Python, and XGBoost, it is highly optimized for GPUs for low-latency serving.

+ + +

GPU resource management

+ + +

Both DL training and serving require large-scale GPU resources. Uber today manages more than five thousand GPUs across both on-premise data centers and Cloud providers like OCI and GCP. Those GPUs spread across multiple regions and many zones and clusters. The Compute clusters are in the process of migrating from Peloton / Mesos to Kubernetes. To maximize resource utilization, Uber has invested in elastic CPU and GPU resource sharing across different teams so that each team can opportunistically use the other team’s idle resources. On top of the Compute clusters, we built a job federation layer across multiple Kubernetes clusters to hide the region, and zone and cluster details for better job portability and easy Cloud migration. The job federation layer leverages the same design pattern as Kubernetes operators and is implemented as a job CRD controller in Michelangelo’s unified API framework as shown in Figure 7. Currently, the job controller supports both Spark and Ray jobs.

+ + +

With the end-to-end support for DL in Michelangelo 2.0, Uber has made a significant improvement in DL adoption across different business lines. In the last few years, the DL adoption in tier-1 projects has increased from almost zero to more than 60%. For example, the DeepETA model has more than one hundred million parameters and was trained over one billion trips.

+ + +

MA Studio – One unified Web UI tool for everything ML @ Uber

+ + +

To address the challenges in the ML developer experience mentioned above, Michelangelo (MA) Studio was developed to unify existing Michelangelo offerings and newly built platform capabilities into one single user journey, to provide a seamless user experience, with a completely redesigned UI and UX. MA Studio provides a simplified user flow covering every step of the ML journey from feature/data prep, model training, deployment, all the way to production performance monitoring, and model CI/CD, all in one place, to improve ML developer productivity.

+ + +

Figure 10: MA Studio project landing page covering the end-to-end ML development life-cycle.

+ +

+ + +

MA Studio boasts an array of additional advantages:

+ + +

Version control and code review: All ML-related code and configurations are version controlled, and all changes go through the code review process, including models created from UI.
Modernized model deployment stack: Safe and incremental zonal rollout, automatic rollback triggers, and production runtime validation.
Built-in and unified ML observability toolkit: Model performance monitoring, feature monitoring, online/offline feature consistency check, and MES.
Unified ML entity lifecycle management: Users benefit from an intuitive UI and well-structured user flows for managing all ML entities, from models and pipelines to datasets and reports.
Enhanced debugging capabilities: MA Studio amplifies debugging capabilities and accelerates recovery for ML pipeline failures.

+ + +

Figure 11: MA Studio and Canvas for standard and advanced ML use cases.

+ +

+ + +

For any ML need at Uber, you only need two tools: Canvas and MA Studio. MA Studio’s user-friendly UI covers standard ML workflows, facilitating tasks like XGB model training and the standard model retrain process without any necessity to write any code. When dealing with more sophisticated scenarios, such as DL training or customized retraining flows, Canvas is the go-to tool. Regardless of whether you’ve constructed the pipelines through Canvas or the UI, you can seamlessly execute and manage these pipelines, deploy trained models, and monitor and debug model performance—all from the MA Studio UI. Significantly, all model code and pertinent configurations are now subject to version control, and any alterations undergo a meticulous code review process, which drastically improves the quality of ML applications in production at Uber.

+ + +

Generative AI (2023 – now)

+ + +

Recent advancements in generative AI, particularly in the realm of large language models (LLMs), possess the capacity to radically transform our interactions with machines via natural language. Several teams at Uber are actively investigating the use of LLMs to boost internal productivity with assistants, streamline business operations with automation, and improve the end-user product with magical user experience while addressing issues associated with the use of LLMs. Figure 12 shows the potential values of those three categories of generative AI use cases at Uber. Learn more.

+ + +

Figure 12: Three categories of generative AI use cases at Uber.

+ +

+ + +

For developing generative AI applications, teams need access to both external LLMs through third-party APIs and/or internally hosted open-source LLMs. This is because external models have superior performance in tasks requiring general knowledge and intricate reasoning, while by leveraging the wealth of proprietary data, we can fine-tune open-source models to achieve high levels of accuracy and performance on Uber-centric tasks, at a fraction of the cost and lower latency. These fine-tuned open-source models are hosted in-house.

+ + +

Hence, we developed the Gen AI Gateway to provide a unified interface for teams to access both external LLMs and in-house hosted LLMs in a manner adhering to security standards and safeguarding privacy. Some of the Gen AI Gateway capabilities include:

+ + +

Logging and auditing: Ensuring comprehensive tracking and accountability.
Cost guardrails and attribution: Managing expenses while attributing usage, and also alerting on over usage.
Safety & Policy Guardrails: Ensuring LLM usage complies with our internal guidelines.
Personal identifiable information (PII) Redaction: Identifying and categorizing personal data, and redacting it before sending the input to external LLMs.

+ + +

To accelerate the development of generative AI applications at Uber, we have extended Michelangelo to support the full LLMOps capabilities such as fine-tuning data preparation, prompt engineering, LLM fine-tuning and evaluation, LLM deployment and serving, and production performance monitoring. Some of the key components include:

+ + +

The Model Catalog features a collection of pre-built and ready-to-use LLMs, accessible via third-party APIs (e.g., GPT4, Google PaLM) or in-house hosted open-source LLMs on Michelangelo (e.g., Llama2). Users can explore extensive information about these LLMs within the catalog and initiate various workflows. This includes fine-tuning models in MA Studio or deploying models to online serving environments. The catalog offers a wide selection of pre-trained models, enhancing the platform’s versatility.
LLM Evaluation Framework enables users to compare LLMs across different approaches (e.g., in-house vs. 3P with prompts vs 3P fine-tuned), and also evaluate improvements with iterations of prompts and models.
Prompt Engineering Toolkit allows users to create and test prompts, validate the output, and save prompt templates in a centralized repository, with full version control and code review process.

+ + +

To enable cost-effective LLM fine-tuning and low-latency LLM serving, we’ve implemented several significant enhancements to Michelangelo training and serving stack:

+ + +

Integrating with Hugging Face: We implemented a Ray-based trainer for LLMs, utilizing the open source LLMs available on the Hugging Face Hub and associated libraries like PEFT. Fine-tuned LLMs and associated metadata are stored in Uber’s model repository, which is accessible from the model inference infrastructure.
Enabling Model Parallelism: Michelangelo previously did not support model parallelism for training DL models. This limitation constrained the size of trainable models to the available GPU memory, allowing, for instance, a theoretical maximum of 4 billion parameters on a 16 GB GPU. In the updated LLM training framework, we’ve integrated Deepspeed to enable model parallelism. This breakthrough eliminates the GPU memory limitation and allows for training larger DL models.
Elastic GPU Resource Management: We’ve provisioned Ray clusters on GPUs with the Michelangelo job controller. This provision empowers the training of LLM models on the most powerful GPUs available on-premises. Furthermore, this integration sets the stage for future extensions using cloud GPUs, enhancing scalability and flexibility.

+ + +

Leveraging these platform capabilities offered by Michelangelo, teams at Uber are fervently developing LLM-powered applications. We look forward to sharing our advancements in the productionization of LLMs soon.

+ + +

Conclusion

+ + +

ML has evolved into a fundamental driver across critical business areas at Uber. This blog delves into the eight years transformative journey of Uber’s ML platform, Michelangelo, emphasizing significant enhancements in the ML developer experience. This journey unfolded in three distinct phases: the foundational phase of predictive ML for tabular data from 2016 to 2019, a progressive shift to deep learning between 2019 and 2023, and the recent venture into generative AI starting in 2023.

+ + +

There have been critical lessons learned for building a large-scale, end-to-end ML platform at such a complexity level, supporting ML use cases at Uber’s scale. Key takeaways include:

+ + +

Instituting a centralized ML platform, as opposed to having individual product teams build their own ML infrastructure, can significantly enhance ML development efficiency within a medium- or large-sized company. And the ideal ML organizational structure comprises a centralized ML platform team, complemented by dedicated data scientists and ML engineers embedded within each product team.
Providing both UI-based and code/configuration-driven user flows in a unified manner is crucial for delivering a seamless ML dev experience, especially for large organizations where ML developers’ preferences of dev tools largely vary across different cohorts.
The strategy of offering a high-level abstraction layer with predefined workflow templates and configurations for most users, while allowing advanced power users to directly access low-level infrastructure components to build customized pipelines and templates has proven effective.
Designing the platform architecture in a modular manner so that each component can be built with a plug-and-play approach, which allows rapid adoption of state-of-the-art technologies from open source, third-party vendors, or in-house development.
While Deep Learning proves powerful in solving complex ML problems, the challenge lies in supporting large-scale DL infrastructure and maintaining the performance of these models. Use DL only when its advantages align with the specific requirements. Uber’s experience has shown that in several cases, XGBoost outperforms DL in both performance and cost.
Not all ML projects are created equal. Having a clear ML tiering system can effectively guide the allocation of resources and support.

+ + +

Michelangelo’s mission is to provide Uber’s ML developers with best-in-class ML capabilities and tools so that they can rapidly build, deploy, and iterate high-quality ML applications at scale. As the AI platform team, we provide in-depth ML expertise, drive standardization and innovation of ML technologies, build trust and collaborate with our partner teams, and cultivate a vibrant ML culture, so that ML is embraced and leveraged to its fullest potential. We are unwavering in our commitment to this mission, and we are incredibly enthusiastic about the promising future ahead of us.

+ + +

If you are interested in joining us on this exciting venture, please check our job website for openings. Additionally, we look forward to collaborating with other teams in the AI/ML space to build a strong ML community and collectively accelerate the advancement of AI/ML technologies.

+ + +

Apache®, Apache Spark, Spark, and the star logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.

+ + +

Horovod and Kubenetes are either registered trademarks or trademarks of the Linux Foundation® in the United States and/or other countries. No endorsement by the Linux Foundation® is implied by the use of these marks.

+ + +

Ray is either a registered trademark or trademark of Anyscale, Inc in the United States and/or other countries.

DragonCrawl: Generative AI for High-quality Mobile Testing

Uber — Tue, 23 Apr 2024 05:00:00 GMT

Introduction

+ + +

The Developer Platform team at Uber is consistently developing new and innovative ideas to enhance the developer’s experience and strengthen the quality of our apps. Quality and testing go hand in hand, and in 2023 we took on a new and exciting challenge to change how we test our mobile applications, with a focus on machine learning (ML). Specifically, we are training models to test our applications just like real humans would.

+ + +

Mobile testing remains an unresolved challenge, especially at our scale, encompassing thousands of developers and over 3,000 simultaneous experiments. Manual testing is usually carried out, but with high overhead, it cannot be done extensively for every minor code alteration. While test scripts can offer better scalability, they are also not immune to frequent disruptions caused by minor updates, such as new pop-ups and changes in buttons. All of these changes, no matter how minor, require recurring manual updates to the test scripts. Consequently, engineers working on this invest 30–40% of their time on maintenance. Furthermore, the substantial maintenance costs of these tests significantly hinder their adaptability and reusability across diverse cities and languages (imagine having to hire manual testers or mobile engineers for the 50+ languages that we operate in!), which makes it really difficult for us to efficiently scale testing and ensure Uber operates with high quality globally.

+ + +

To solve these problems, we created DragonCrawl, a system that uses large language models (LLMs) to execute mobile tests with the intuition of a human. It decides what actions to take based on the screen it sees and independently adapts to UI changes, just like a real human would.

+ + +

Of course, new innovations also come with new bugs, challenges, and setbacks, but it was worth it. We did not give up on our mission to bring code-free testing to the Uber apps, and towards the end of 2023, we launched DragonCrawl. Since then, we have been testing some of our most important flows with high stability, across different cities and languages, and without having to maintain them. Scaling mobile testing and ensuring quality across so many languages and cities went from humanly impossible, to possible with the help of DragonCrawl. In the three months since launching DragonCrawl, we blocked ten high-priority bugs from impacting customers while saving thousands of developer hours and reducing test maintenance costs.

+ + +

This blog will cover a quick introduction to large language models, deep dive into our architecture, challenges, and results. We will close by touching a little on what is in store for DragonCrawl.

+ + +

What Are Large Language Models?

+ + +

Large language models (LLMs) are a transformative development in the field of artificial intelligence, specifically within natural language processing (NLP). Essentially, LLMs are advanced models designed to understand, interpret, generate, and engage with human language in a way that is both meaningful and contextually relevant. These models are trained on vast datasets consisting of text from a wide array of sources, enabling them to learn the nuances, idioms, and syntax of natural language. One of the most critical aspects of LLMs is their ability to generate coherent and contextually relevant text based on input prompts. This capability is not limited to simple text generation; it extends to complex tasks like answering questions, translating languages, summarizing documents, and even creating content like poems or code. The underlying technology of LLMs typically involves neural network architectures, such as transformers, which are adept at handling sequential data and can capture long-range dependencies in text. This makes them particularly effective for tasks that require an understanding of context over longer stretches of text. Modern large language models are trained on many languages, which means that we can use them and get reasonable outputs in other languages.

+ + +

Why Did We Choose Large Language Models for Mobile Testing?

+ + +

We realized that we could formulate mobile testing as a language generation problem. At the end of the day, mobile tests are sequences of steps, which may encounter obstacles and/or course corrections due to changes in apps, devices, etc. To successfully get through those obstacles and complete a test, we need context and goals, and the simplest way for us to provide these to an automated system is through natural language. We provide DragonCrawl with the text representation of the current screen, along with the goals of the test we want to execute, and then we ask it what we should do. Given the context, it chooses what UI element to interact with, and how to interact with it. And because these models have been pre-trained and proven resilient in languages besides English, we can ask DragonCrawl these questions with text in other languages.

+ + +

Fig 1: High-level overview of DragonCrawl. The image of the Dragon was generated by OpenAI’s DALL·E

+ +

+ + +

Modeling

+ + +

MPNet, or “Masked and Permuted Pre-training for Language Understanding,” is an advanced approach in natural language processing that combines masking and permuting strategies in pre-training language models. It works by masking some words and altering the order of others in the input text, enabling the model to learn not only the prediction of masked words, but also the broader context and syntax of the language. This dual-task approach allows MPNet to gain a deeper understanding of language semantics, surpassing traditional models that focus solely on masking or permutation. Once trained on large datasets, MPNet can be fine-tuned for a variety of NLP tasks, offering enhanced performance in understanding and generating language due to its comprehensive grasp of both word-level and sentence-level contexts.

+ + +

Fig 2: Transformer layers in DragonCrawl’s model.

+ +

+ + +

Evaluation

+ + +

In the vast and intricate landscape of language, words are not mere strings of letters; they are rich with meaning, context, and subtle nuances and that is where embeddings come into play. Embeddings are like multi-dimensional maps, where each word finds its unique place, not just based on its own identity but also in relation to the words around it. By obtaining high-quality embeddings, we ensure that our model perceives language not as a random assortment of words, but as a coherent, interconnected fabric of ideas and meanings.

+ + +

We framed the evaluation as a retrieval task because we ultimately want DragonCrawl to mimic the way humans retrieve information and make decisions. Just like how we put some effort when choosing the right book in a library, DragonCrawl makes an effort to choose the right action to take to accomplish its goals. The precision@N metric, akin to finding the right book when you can only take a handful of books home, shows us the model’s ability to not just retrieve, but to pinpoint the best option in a sea of possibilities. By measuring and improving embedding quality through precision@N, we ensure that DragonCrawl does not just understand language, but comprehends it with a discerning, almost human-like acuity.

+ + +

To choose the right model for DragonCrawl, we tuned and evaluated several models. The table below summarizes our findings:

+ + +

	Precision@1	Precision@2	Precision@3	Parameters	Embedding size
MPNet (base)	0.9723	0.9623	0.9423	110M	768
MPNet (large)	0.9726	0.9527	0.9441	340M	768
T5	0.97	0.9547	0.9338	11B	3584
RoBERTa	0.9689	0.9512	0.9464	82M	768
T5 (not tuned)	0.9231	0.9213	0.9213	11B	3584

+ + +

As can be seen, embedding quality is high across all models, but latency varies significantly. The fastest model turned out to be the base MPNet, with ~110M parameters (which technically makes it a small/medium language model). Furthermore, its embedding size is 768 dimensions, which would make it less expensive for other downstream systems to use our embeddings in the future.

+ + +

On a different note, given those numbers, one could argue that we did not even need tuning, but that is not what we chose. T5-11b not tuned gave us good precision@1, 2, and 3, but given the frequency with which we plan to use this model, and the variability in the data because the Uber app changes constantly, we would quickly suffer from those extra points not provided by a model not customized by us.

+ + +

Challenges

+ + +

There were several challenges we needed to overcome during development. Some of them were specific to Uber, and some of them were related to the weaknesses of large language models.

+ + +

An issue we faced early on while making DragonCrawl’s request and completion of trip flows was setting up the GPS location of DragonCrawl’s (fake) riders and drivers. Uber’s matching algorithms, which are in charge of connecting riders to drivers, are very complex and are built for scale, and even take into account variables such as time of day, current traffic conditions, future demand, etc. However, when testing with DragonCrawl, there would only be 1 rider and 1 driver in a particular city at any given time, which is not what Uber’s backend expects. Thus, there were times when riders and drivers would not be matched, even if they were right next to each other. To solve this problem, we had to tune the GPS locations of both riders and drivers, so that we would get favorable results. This is very specific to Uber and/or ride-hailing and food delivery.

+ + +

Adversarial Cases

+ + +

When testing Uber’s trip flow, in some cities, we saw DragonCrawl do some quirky things. In some cities, instead of requesting regular trips, it requested scheduled trips. What puzzled us the most, after debugging our artifacts carefully, is that DragonCrawl actually had all the conditions to make the right choice (i.e., touch on “Choose UberX”), but instead, it would choose a scheduled ride. Then, it would go through the UI to open a calendar and choose the date and time of the scheduled ride, which is impressive–but we digress.

+ + +

The example above is called an adversarial case. The concept of adversarial cases or adversarial samples was popularized a few years ago when researchers saw that it is possible to confuse a model in cases that should not be confusing at all. Let’s take a look at the image below. In the image below, we show how, if we add a little bit of noise to an image of a panda, which results in pretty much the same panda, we can confuse a machine-learning model, to the point that it would think it is a gibbon (but we all know pandas do not look like gibbons).

+ + +

Fig 3: Example of how imperceptible noise can fool a Machine Learning model. This is not a hypothetical example, take a look.

+ +

+ + +

While it is impossible to fully rid a model of weaknesses to adversarial cases, we plan to do adversarial training and validation to reduce the risk.

+ + +

Steering DragonCrawl to More Optimal Paths

+ + +

In our offline tests of Uber’s trip flow, we saw that DragonCrawl can always request or complete a trip, but sometimes it would take too long to do so. There were times when a new pop-up would make DragonCrawl add another passenger/book a trip for someone else, which would then load several screens with options and settings that DragonCrawl had to figure out. It would figure them out, but because there would be several steps required (rather than just 1 or 2 new steps), it would take much longer. Since our goal is to run DragonCrawl on every Android code change, we cannot afford those longer routes so we had to train Dragon to say no/skip certain things and say yes/confirm other things.

+ + +

Hallucinations

+ + +

Finally, a topic of much discussion is hallucinations in large language models. In the words of Yann LeCun, VP and Chief AI Scientist at Meta, large language models “kind of spew nonsense sometimes” (see article). Indeed, we need to be mindful that we cannot fully trust large language models, or at least not without guardrails. In this section, we will discuss the guardrails we put in place to prevent hallucinations from harming DragonCrawl.

+ + +

First, one of DragonCrawl’s biggest strengths is the fact that it uses a smaller model. The size of our model is 110M parameters, which is roughly 3 orders of magnitude smaller than the popular GPT-3.5/4. Thus, this greatly reduces the variability and complexity of answers it can output. In other words, model size limits model non-sense.

+ + +

Even so, we still received some invalid outputs, and here is how we handled them:

+ + +

Partially invalid actions: It may be possible for the model to return a response where some of the information is incorrect. For instance, it may return “touch” for a UI element that is swipeable; or it may output the right action and right location, but confuse the name of the UI element (i.e. request_trip_button). For either case, since we can read from the emulator the valid actions, the correct UI element names, etc., we can resolve confusions such as the ones mentioned before. The emulator provides us with the ground truth we can use to find the right actions given the name of a UI element; the correct location given the UI element name; and even the right UI element name, given the right location.
Completely invalid actions: For completely invalid actions, we would append to the prompt the action previously suggested, and call out that it is invalid. This will result in a different action suggested by the model. For the case where invalid actions persist, we would backtrack and retry the suggestions from the model.
Loops/repeated actions: We may end up in loops (i.e., scrolling up and down in a feed) or repeated actions (i.e., repeated waits). We handle this case by keeping track of the actions already taken in the specific sequence, and even screenshots, so it is really easy to figure out if we are in a loop. Also, since DragonCrawl outputs a list of suggestions, we can just try other suggested actions.

+ + +

DragonCrawl in Action

+ + +

+ +

+ + +

We have seen DragonCrawl do amazing things, but in this section, we will discuss two scenarios that really impressed us.

+ + +

DragonCrawl Goes Online in Australia!

+ + +

In October 2023, we were testing Uber’s trip flow with DragonCrawl in Brisbane, Australia, and we saw something unexpected. DragonCrawl’s fake driver profile was perfectly set up but this time, it was not able to go online for around 5 minutes. During those 5 minutes, DragonCrawl pushed the “GO” online button repeatedly until it finally went online.

+ + +

Figure 4: DragonCrawl going online in Brisbane, Australia after trying for 5 minutes.

+ +

+ + +

We were pleasantly surprised. DragonCrawl is so goal-oriented that it went through an unfriendly user experience to accomplish its goals: go online, be matched to a (fake) rider, and do the hypothetical trip. Because of the time to completion, we knew we had to investigate. We also learned, as discussed more below, that DragonCrawl will not be thrown off by minor or non-reproducible bugs, like the ones that impacted our script-based QA.

+ + +

The ultimate solution: Turn it off, and then turn it back on

+ + +

It was September 2023, and we saw Dragon do something so smart, we did not know if we should laugh or clap. Dragon was testing Uber’s trip flow in Paris. It chose to go to the Paris airport (CDG), and when it got to the screen to select the payment method, the payment methods were not loading (most likely a blip in the account we were using). What did Dragon do? It closed the app, opened it, and then requested the trip again. There were no issues the second time, so Dragon accomplished its goal of going to the airport.

+ + +

Figure 5: DragonCrawl restarting the app to request a trip.

+ +

+ + +

It is difficult to express with words how excited and proud we are to see DragonCrawl do these things. Pushing the go online button repeatedly just to be able to drive with Uber, or opening and closing the app so that it can get to where it wants to be make DragonCrawl more resilient to minor tech issues than our old script-based testing model.

+ + +

We have observed that no amount of code can match the goal-oriented behavior DragonCrawl displays, and what it represents for developer productivity is exciting. It is possible to create scripts that match DragonCrawl’s strategies, but how many thousands (or even millions) lines of code would need to be written? How expensive would it be to update all of that code when needed? Now, imagine what happens when traditional tests encounter the scenarios we just described:

+ + +

Functioning driver account cannot go online for 5 minutes: This would raise eyebrows if not alerts in testing teams. We may even think there is an outage, which would alert multiple engineers, but in reality, it is a transient issue.
Payment method not loading: Tickets would be filed and at the highest priority. This would trigger multiple conversations, examinations, and attempts to reproduce the issue would be done, but it would only be a blip.

+ + +

DragonCrawl Running on Uber’s CI

+ + +

We productionized our model and the CI pipelines where the model has been consumed since around October 2023, and got some wins by the end of the year. As of January 2024, DragonCrawl executes the core-trip flow in 5 different cities once per night, and also for the Rider and Driver Android apps before releasing them to our customers. Since we launched, we have observed the following:

+ + +

High stability: DragonCrawl executed flows with 99%+ stability in November and December 2023. The rare cases where Dragon failed were due to outages in the third-party systems we use, and also due to a real outage caused by a high-priority bug that no other mobile testing tool detected.
No maintenance: We did not need to manually update and/or maintain DragonCrawl. Whenever there were changes in the apps, DragonCrawl figured out how to get through those changes to accomplish its goals, unlike our team of software testers, who spent hundreds of hours maintaining test cases in 2023.
High reusability: We evaluated DragonCrawl in 89 of our top cities, and DragonCrawl successfully requested and completed trips in 85 of them. This is the first time at Uber that a mobile test as complex as requesting and completing a trip has been successfully executed in 85 cities worldwide without needing to tweak code.
Device/OS resilient: We tested Uber’s trip flow in our CI with 3 different kinds of Android devices, and 3 different OS versions, and we even varied other parameters, such as available disk, CPU, etc. DragonCrawl successfully requested and completed trips across all of these combinations without tweaks to our code or model, which is not always guaranteed in traditional mobile tests. Tuning tests to handle different screen sizes/resolutions and other device specifics is a known hassle of traditional mobile testing.

+ + +

What’s Next?

+ + +

The foundations we set in 2023 paved the way for a very exciting 2024 and beyond. Our investments in smaller language models resulted in a foundational model with very high-quality embeddings, to the point that it unlocks the architecture shown below:

+ + +

Figure 6: Future mobile tests as RAG applications powered by the Dragon Foundational Model (DFM)

+ +

+ + +

With the Dragon Foundational Model (DFM), we can use small datasets (hundreds or tens of datapoints) and the DFM to create RAG (retrieval augmented generation) applications that more accurately simulate how humans interact with our applications. Those smaller datasets (with verbal goals and preferences), would tell DragonCrawl what to optimize for, and that is all it needs. The DFM may be a LLM, but it is secretly a rewards model that takes actions to accomplish its goals, and as we have seen, some of those actions mimic what a real human would do.

+ + +

In 2024, a big area of investment for us will be to build the subsystems that will empower developers to build their tests as RAGs, and reap the benefits of flawlessly executing in many cities, languages, and with minimal (or even zero) maintenance costs.

+ + +

Conclusion

+ + +

With all the advancements generative AI has seen over the past 4-6 months, there are more things to evaluate to improve our model and the quality of our apps. We plan to evaluate more modern large language models to push the quality of our models even further. Every increase in model quality will increase the combinations we can test, bringing down bugs that reach our users, which in turn increases productivity, enabling developers to build new experiences, and giving DragonCrawl more things to test. This is a flywheel that gets kicked off and accelerates with model quality, and we will fuel this acceleration.

+ + +

Figure 7: Model-quality fly wheel.

+ +

+ + +

Acknowledgments

+ + +

Something as complex as DragonCrawl without the help of our partner teams. We are very thankful to Uber’s CI, Mobile Foundations, Michelangelo, Mobile Release and Test accounts. We would also like to thank you the passionate researchers that created MPNet (which we use), T5, and other LLMs for their contributions to the field and for enabling others to advance their own fields. Finally, we want to send a big thank you to our former intern Gustavo Nazario, who helped us turn DragonCrawl into what it is today.

+ + +

Cover photo attribution: This image was generated using OpenAI’s DALL·E.

Ensuring Precision and Integrity: A Deep Dive into Uber’s Accounting Data Testing Strategies

Uber — Thu, 18 Apr 2024 05:00:00 GMT

Introduction

+ + +

Uber operates multiple lines of business across diverse global regions. Financial Accounting Services (FAS) Platform (detailed architecture) is responsible for financial accounting across these global regions and is designed to follow the tenets below:

+ + +

Compliance
Auditability
Accuracy
Scalability
Analytics

+ + +

To maintain these tenets, FAS has built robust testing, monitoring, and alerting processes. This encompasses system configuration, business accounting, and external financial report generation.

+ + +

Challenges

+ + +

The financial accounting services platform at Uber operates at an internet scale– approximately 1.5 billion journal entries (JEs) per day and 120 million transactions per day via ETL and data processing at a throughput of 2,500 queries per second [on average]. Standard off-the-shelf accounting systems cannot support such scale and scope of transactions of our growing platform. Additionally, we manage data from over 25 different services for accounting purposes. To handle data at this scale, our engineering systems are designed to process data both at the event level and in a batch mode. As data flows through multiple components in the architecture, there is a need to ensure that all the components are designed to embrace the tenets defined above.

+ + +

The platform processed roughly $120+ billion in annual gross bookings and settlements in 2023. It operates at a transaction scale (~80 billion financial microtransactions per year) that is 10 times the trip scale and currently offers 99.6% of transactions with automated revenue computation with 99.99% completeness, accuracy guarantee, and auditability. We onboarded 600+ business changes to support the scaling of the business in 2023. The platform processes big data and stores petabytes of data in Schemaless and Apache Hive^TM.

+ + +

The accounting process has multiple steps and validations are required at every step to adhere to the tenets. Here are the various steps where validations are performed:

+ + +

Business Requirement Validation
Accounting Onboarding
Accounting Execution
Report Generation

+ + +

To uphold our established principles, we implement checks and balances throughout the stages of financial accounting services:

+ + +

Requirements Signoff
Regression Testing
Integration Testing
UAT Validations
- Ledger Validations
- Transaction-Level Validations
+
Shadow Validations
Deployments
- Canary
+
Health Checks
- Auditor Checks
- Completeness Checks
+
Alerting/Monitoring
Report Generation

+ + +

Validations Life Cycle of Accounting Processes

+ + +

As the data flows through various components of the Fintech Systems, there are checks and balances at every stage so that the systems and processes adhere to the tenets.

+ + +

Requirements Signoff

+ + +

Based on the Business Models operating in various countries and the expectations of the local teams, the requirements are provided and tracked. Accounting requirements are provided in accordance with Generally Accepted Accounting Principles (GAAP). The requirements are then onboarded into our accounting systems. Fintech Systems at Uber have internal tools to validate the requirements, which perform 15+ automated checks to validate the expected output.

+ + +

Regression Testing

+ + +

Unit Testing

+ + +

Unit tests in Uber’s Financial Services are critical for ensuring the accuracy, security, and reliability of our applications. These tests involve isolating small sections of code and verifying the functions as intended. At Uber, we strive to identify and rectify errors early by rigorously testing each unit for correct operation and ensuring overall services–from transaction processing to financial reporting–run smoothly and securely.

+ + +

Regression – Kaptres

+ + +

Kaptre (Capt~~ure~~ — Re~~play~~) is a capture and replay testing tool primarily employed for functional and regression testing purposes. Here’s a breakdown of how our financial systems use its key components:

+ + +

Test Case: For any accounting change, at least one new Kaptre test is added to the test suite so that all the use cases are tested in successive runs. Each test case includes input (same as used for UAT), expectations, and assertions.

+ + +

Capture Mode: When adding a test, we operate in “capture mode.” This mode executes the accounting process for the newly added test and captures dependencies needed to re-execute in an offline mode, like API request/responses from upstreams and expected accounting journal entries (JElines).

+ + +

Replay Mode: The subsequent test runs involve running the Kaptre regression test suite in the replay mode. This mode creates the new output using the latest code/config version and the assertions compare these with the captured expectations. A test failure is reported if the assertion fails.

+ + +

Change Triggers for Captured Responses: Captured responses change with alterations in upstream systems, new field additions in financial transactions, or anticipated accounting changes. These tests can be updated using the above capture mode after validations.

+ + +

This approach ensures that regression tests accurately reflect the system’s behavior during capture mode and subsequently verifies it against expected outcomes in successive test runs. The design allows for adaptability to changes in upstreams, fintech systems, and anticipated accounting modifications while maintaining the integrity of the testing process and reducing the risk of human error during the testing process.

+ + +

Figure 1: Kaptre (functioning of the capture-replay framework).

+ +

+ + +

SLATE (Short-Lived Application Testing Environment)

+ + +

Testing in a short-lived application environment (a.k.a. SLATE) before deploying to production is a crucial step in Uber’s software development lifecycle. SLATE testing helps us identify and address issues early on and reduces the risk of introducing defects/problems into the production environment. Various types of testing are performed in SLATE, including integration, performance, and security testing. The primary purpose of this testing is to run the application in a production-like environment, identify and detect issues (like runtime errors) early in the development cycle, and prevent the propagation of defects to higher environments.

+ + +

Find more details in the Slate Uber Eng Blog.

+ + +

In summary, testing in short-lived application environments is a best practice that contributes to the overall quality, reliability, and security of our services before they are deployed to production.

+ + +

Integration Testing

+ + +

Financial Accounting Services at Uber engages with numerous upstream systems (30+) to enrich trip details essential for generating accounting transactions. Integration testing is crucial for seamless communication between financial systems and upstream components, identifying interface issues and enabling early risk mitigation.

+ + +

However, a notable challenge with integration testing lies in determining completeness. Unlike unit tests that have a clear metric for code coverage, integration tests lack insights into the scenarios to cover, and there is no established metric for measuring integration test coverage. This gap results in dependent teams not automatically being informed about new scenarios being launched, and there is a lack of metrics to comprehend test coverage for all scenarios.

+ + +

To address this, we have developed an internal tool that automates the detection, notification, and acknowledgment of the readiness of all dependent systems. This tool aims to ensure a defect-free launch and provides a mechanism to measure integration test coverage.

+ + +

This becomes particularly critical within revenue systems, positioned at the conclusion of the data flow and interacting with multiple services. Unanticipated launches in this context pose the risk of disrupting accounting processes. For instance, a fare launch, lacking proper communication, might be routed to a dead letter queue (DLQ), leading to improper accounting due to insufficient onboarding in the revenue system.

+ + +

UAT Validations

+ + +

User Acceptance Testing (UAT) is a mandatory step in Uber’s financial systems development, where the accounting team rigorously validates financial reports for accuracy. We streamlined this process with comprehensive validation of aggregated and transaction level ledgers through automated testing covering positive and negative scenarios. This ensures integrity of balance sheets, income statements, and other key financial statements. This meticulous approach guarantees seamless integration of updates and patches without disrupting existing functionalities, with over 15 quality checks before signoff.

+ + +

Once accounting configurations are set up as per requirements, then they undergo validation by the Accounting Team, culminating in official sign-off indicating approval and attesting to accuracy and compliance. Business Rule configuration changes adhere to a stringent protocol, requiring explicit authorization from key accounting stakeholders before merging into the main system. Uber utilizes automated Buildkite jobs to ensure integrity and efficiency, systematically checking for necessary approvals when differences in the codebase are identified. This automation reinforces the rigor of the approval process.

+ + +

In instances where changes contradict the established protocols or bypass the mandatory approvals, an automated flag is immediately raised for a thorough review. This safeguard is essential in maintaining the system’s integrity and compliance.

+ + +

Uber’s financial systems employ two primary types of validations to ensure the utmost accuracy and reliability:

+ + +

Sample Validations
Ledger Validations

+ + +

Sample Validations

+ + +

Validations are performed on a selected set of sample orders, which are chosen to be representative of the scenarios of the orders in the production environment. These validations are typically adequate while making incremental changes to the financial systems.

+ + +

Table 1: Transaction level validations/sample validations.

+ +

+ + +

Ledger Validations

+ + +

For changes that impact a significant number of use cases either at the country or business level, we also perform ledger validations to get completeness assurances. These validations provide additional assurances at aggregate levels over a specific period of time before we implement these changes in production.

+ + +

Table 2: Aggregate ledger validations.

+ +

+ + +

Both validation types are integral to Uber’s commitment to maintaining the highest standards of financial accuracy and regulatory compliance. They work in tandem to ensure that the financial system remains robust, reliable, and reflective of the true financial position of the company.

+ + +

Shadow Validations

+ + +

The purpose of shadow testing is to serve as a final checkpoint to catch any potential issues before the build can be rolled over to the production environment. It’s essentially a build certification strategy aiding in making informed decisions about deploying a release candidate (RC). The core process involves passing production traffic through an RC and comparing its outputs with those from the current production build to spot any anomalies.

+ + +

Shadow testing consists of three parts:

+ + +

Capturing Production Requests: There are multiple strategies to achieve this. One of those entails recording the (request, response) pairs from production traffic intended for comparison against the RC.

+ + +

Replaying Production Traffic: These captured requests are replayed against the RC, and the responses are compared to those from the production environment. Any differences are logged for further analysis.
Analyzing Differences: Involves a thorough examination of the logged differences to determine the confidence level of the RC. This step is crucial for certifying the build’s readiness for deployment.

+ + +

Table 3: Shadow validations between pre-production environment and production environment.

+ +

+ + +

Challenges and Solutions

+ + +

Volume of Traffic and Upstream Calls: With the fintech services processing over 10K+ events per second, replaying all this traffic against the shadow build is impractical and could result in rate-limiting issues due to numerous upstream calls.
Traffic Sampling and Load Distribution: To mitigate this, we adopt a strategy of sampling a fraction of traffic (e.g., 10%) and distributing the replay over an extended period (e.g., 5 hours). This reduces the calls per second, but also limits the scope of testing.

+ + +

To solve this problem we cache the upstream calls and responses, instead of making real-time network calls.

+ + +

Caching Upstream Calls: We implemented a caching mechanism for upstream (request, response) pairs to avoid redundant calls during replay, at the cost of increased storage expenses. We minimized the increased storage costs by keeping a retention period of x days post which old data is cleaned up. We always replay against the latest set of data.
This approach lacks the ability to detect real-time issues stemming from upstream changes. To mitigate this, we are developing a mechanism for sampling traffic and load distribution.

+ + +

Figure 2: Sampling and storing events for shadow testing.

+ +

+ + +

Replaying Captured Requests

+ + +

A specialized workflow has been developed for replaying stored requests within a given timestamp range against an RC. Discrepancies between the stored responses and the RC’s responses are logged for analysis.

+ + +

Validating and Analyzing Differences

+ + +

This phase involves scrutinizing the identified discrepancies. The aim is to differentiate between expected variances and anomalies. This requires an in-depth understanding of the response structure of the Banker system.
Banker’s Response Structure: The output from Banker is an array of complex data types called ‘transactions,’ each representing multiple journal entries with attributes like credit, debit, GL account, and line of business.

+ + +

The confidence in the build is gauged by how far the monetary discrepancies stray from predefined thresholds.

+ + +

Example Scenario and Analysis

+ + +

Transaction Comparison: Transactions from both the primary and shadow builds are compared. Differences are logged in a datastore.
Datastore Attributes: The logged data includes the transactions, their source (primary/shadow), the RunID of the replay workflow, and a timestamp.
Detailed Analysis: We conduct analyses focusing on anomalies in specific areas like LineOfBusiness and GlAccountNumber, using queries to identify monetary discrepancies. The build confidence is adjusted based on these findings.

+ + +

As an example, consider E1 is the event that we processed against Production and release candidates. TxnP is the transaction we got from Production instance and TxnS is the transaction we got from Release Candidate:

+ + +

TxnP and TxnS diff in LineOfBusiness. Hence, we would log them to a datastore for analysis. Our data store will contain four attributes:

+ + +

Transactions -> Array of transactions.
Source -> Primary/Shadow. (i.e.) If the transactions are coming from Production or Shadow build.
RunID -> RunID of the Replay workflow. Since we can run multiple shadow testing workflows with different builds, we should make sure that we are only analyzing the differences of a specific workflow.
Timestamp.

+ + +

Post replaying the E1, our datastore will contain the following two new records:

+ + +

(TxnP, Primary, runID, currentTimeStamp)
(TxnS, Shadow, runID, currentTimeStamp)

+ + +

There are multiple ways in which we can analyze these differences. For our use case, we are most interested in capturing any anomalies in LineOfBusiness and GlAccountNumer. Hence, we have written queries that identify the monetary differences between Production and Release candidates across these LOB and GlAccountNumber dimensions. If the monetary difference on credit or debit is beyond a certain threshold, we would reduce the confidence of the build. The farther it strays beyond the threshold, the lower the confidence of the build.

+ + +

For the above example, the shadow testing report for LOB differences will look as follows:

+ + +

Consider that we define a threshold of 1,000 USD per line of business. In that case, the difference of 100 is still well below the threshold and hence the confidence of our build would be unaffected.

+ + +

By adopting this shadow testing strategy, you ensure a comprehensive and thorough evaluation of the release candidate. This methodical approach not only identifies potential issues but also provides insights necessary for improving future builds, ultimately contributing to the robustness and reliability of your deployment process.

+ + +

Figure 3: Shadow testing workflow.

+ +

+ + +

Deployment

+ + +

Canary Testing

+ + +

In the financial accounting services at Uber, the builds are deployed daily. To ensure the successful deployment of each build and minimize the impact of issues such as performance degradation, increased error rates, and resource exhaustion, incorporating a canary deployment is crucial. This strategy facilitates a controlled release, preventing immediate impacts on the entire traffic. It enables the identification and resolution of potential issues before a complete deployment occurs.

+ + +

The canary release approach is employed to test real traffic (<=2% traffic) with minimal impact. When a new build is ready for deployment, the canary zone serves as the initial deployment target. If any errors or issues arise during this deployment, the build is not propagated to other production regions, preventing widespread disruption and ensuring a more controlled release process.

+ + +

Deployment monitoring and alerting

+ + +

The financial accounting service platform consumes data from various upstream sources. To monitor the health of the services, we have configured multiple metrics and alerts. Metrics are tracked and alerts are configured to pause the deployment pipeline and roll it back if we get an alert after deployment for a prescribed period.

+ + +

Dead Letter Queues

+ + +

DLQ (Dead Letter Queue) stores unprocessed events due to errors. Elevated DLQ counts indicate issues like buggy code, corrupt events, upstream service problems, or rate limiting. Each message queue has a corresponding DLQ for handling unprocessable events. Ideally, the DLQ must have zero events. We use threshold-based alerts for detecting issues and investigating root causes when alerts are triggered. We log all errors, including event details, to a dedicated Apache Kafka^Ⓡ topic and ingest them into an Apache Hive^TM table. We have configured Data Studio dashboards for monitoring the Apache Hive^TM table, providing insights on DLQ events’ impact, count, freshness, and trends. This data-driven approach aids in quickly identifying and prioritizing issues for root cause analysis and system improvement.

+ + +

Alerts and Monitors

+ + +

Alerts are primarily used for flagging urgent and critical issues that need to be looked at immediately. Monitors are dashboards on which alerts are configured. The alerts should always be actionable. It is also recommended to tag every alert with a corresponding runbook.

+ + +

Our team alone has around 400+ alerts configured, spanning a wide range of dimensions, including but not limited to DLQ count, consumer lag of message queues, service availability, etc.

+ + +

Completeness Checks

+ + +

Figure 4: Completeness chec

+ +

+ + +

All the events received by Financial Services need to be accounted for. When an event is received by our services, it is recorded in the Received Logger. The natural outcomes of processing an event are either of the following

+ + +

Event Processed – Recorded in Fact Table
Event Filtered or Ignored – Recorded in Filter Logger
Event Processing Errored Out – Recorded in Error Logger

+ + +

To ensure that all incoming events are accounted for, we perform completeness checks. These checks confirm that all events logged in the receive logger are also logged in either the Error logger, Filter logger, or the Fact Table (indicating successful processing).

+ + +

Results

+ + +

Leveraging the aforementioned testing and detection strategies, our team has achieved a remarkable milestone in 2023: reducing the number of accounting incidents to zero. This significant accomplishment reflects our dedication to accuracy and efficiency in financial management. Furthermore, there has been a notable decrease in manual journal entries, a direct result of diminished accounting errors. This improvement has not only enabled the timely closure of monthly accounting books but has also bolstered our confidence in managing multiple projects within the accounting domain as evidenced by the 17% improvement in throughput. These advancements demonstrate our commitment to excellence and reliability in our accounting practices.

+ + +

Conclusion

+ + +

In conclusion, the comprehensive and multifaceted Fintech Testing Strategies employed by Uber’s Financial Accounting Services (FAS) have proven to be a resounding success. Through rigorous validation processes at every step–from business requirement validation to report generation, and employing advanced techniques such as regression testing, SLATE, integration testing, and shadow validations–Uber has set a new standard in financial systems’ reliability and accuracy.

+ + +

The challenges of handling immense volumes of transactions and data have been met with innovative solutions that not only address current needs but also scale for future growth. The meticulous approach to testing and validation, coupled with deployment strategies like Canary testing and vigilant monitoring and alerting systems, exemplifies Uber’s commitment to maintaining the highest standards in financial technology. In the next year, we are adding even more functionalities to our testing strategies to support the detection and auto-correction of bad inputs to support an error-free self-serve journey.

+ + +

Uber’s journey in refining its Fintech Testing Strategies serves as a benchmark for others in the industry, underlining the importance of continuous innovation and rigorous testing in the ever-evolving landscape of financial technology.

Migrating a Trillion Entries of Uber’s Ledger Data from DynamoDB to LedgerStore

Uber — Thu, 11 Apr 2024 05:30:00 GMT

Introduction

+ + +

Last week, we explored LedgerStore (LSG) – Uber’s append-only, ledger-style database. This week, we’ll dive into how we migrated Uber’s business-critical ledger data to LSG. We’ll detail how we moved more than a trillion entries (making up a few petabytes of data) transparently and without causing disruption, and we’ll discuss what we learned during the migration.

+ + +

History

+ + +

Gulfstream is Uber’s payment platform. It was launched in 2017 using DynamoDB for storage. At Uber’s scale, DynamoDB became expensive. Hence, we started keeping only 12 weeks of data (i.e., hot data) in DynamoDB and started using Uber’s blobstore, TerraBlob, for older data (i.e., cold data). TerraBlob is similar to AWS S3.

+ + +

For a long-term solution, we wanted to use LSG. It was purpose-built for storing payment-style data. Its key features are:

+ + +

It is verifiably immutable (i.e., you can check that records have not been altered using cryptographic signatures)
Tiered storage to manage cost (the hot data is kept at a place that is best to serve requests and cold data is stored at a place that is optimized for storage)
Better lag for eventually consistent secondary indexes

+ + +

So, by 2021, Gulfstream was using a combination of DynamoDB, TerraBlob, and LSG to store data.

+ + +

DynamoDB for the last 12 weeks of data
TerraBlob, Uber’s internal blob store, for cold data
LSG, where we were writing data, and wanted to migrate to it

+ + +

Why Migrate?

+ + +

LSG is better suited for storing ledger-style data because of its immutability. The recurring cost savings by moving to LSG were significant.

+ + +

Going from three to a single storage would simplify the code and design of the Gulfstream services responsible for interacting with storage and creating indexes. This in turn makes it easy to understand and maintain the services.

+ + +

LSG promised shorter indexing lag (i.e., time between when a record is written and its secondary index is created). Additionally, it would give us faster network latency because it was running on-premises within Uber’s data centers.

+ + +

Figure 1: Data flow before and after the migration

+ +

+ + +

Nature of Data & Associated Risk

+ + +

The data we were migrating is all of Uber’s ledger data for all of Uber’s business since 2017:

+ + +

Immutable records – 1.2 PB compressed size
Secondary indexes – 0.5 PB uncompressed size

+ + +

Immutable records should not be modified. So, for all practical purposes, once we have written a record, it can’t be changed. We do have the flexibility of modifying secondary index data for correcting problems.

+ + +

Checks

+ + +

To ensure that the backfill is correct and acceptable in all respects, we need to check that we can handle the current traffic and the data that is not being accessed currently is correct. The criteria for this was:

+ + +

Completeness: All the records were backfilled.
Correctness: All the records were correct.
Load: LSG should be able to handle current load.
Latency: The P99 latency of LSG was within acceptable bounds.
Lag: The secondary indexes are created in the background. We want to make sure that the delay of the index creation process was within acceptable limits.

+ + +

The checks were done using a combination of shadow validation and offline validation.

+ + +

Shadow Validation

+ + +

This compares the response that we had been returning before migration with the one that we would return with the LSG as data source. This helps us ensure that our current traffic will be disrupted by neither data migration issues nor code bugs. We wanted our backfill to be at least 99.99% complete and correct as measured by shadow validation. We also had a 99.9999% upper bound for the same. The reason for having an upper bound are:

+ + +

When migrating historical data, there are always data corruption issues. Sometimes this is because data was not written correctly during the initial development time of the service. It is also possible to see data corruption because of scale. As an example, S3 gives 11 nines of durability guarantee then you can expect 10 corruptions in 1 trillion records.
Indexes are eventually consistent, which means that some records will appear after a few seconds. So, the shadow validation will flag them as missing. This is a false positive that shows up at a large scale.
For 6 nines, you have to look at data of 100 million comparisons to give any results with good confidence. This means if your shadow validation is comparing 1,000 records/second, then you need to wait for a bit more than one day just to collect sufficient data. With 7 nines, you will have to wait 12 days. In practical terms this would slow the project to a halt.
With a well-defined upper bound, you are not forced to look at every potential issue that you suspect. Say if the occurrence of a problem is 1/10 of the upper bound, you need not even investigate it.
With 6 nines, we could end up with slightly more than 1 million corrupt records. Even though 6 nines of confirmed correctness could mean a real cost to the company, the savings generated by this project outweighed the potential cost.

+ + +

During shadow validation you are essentially duplicating production traffic on LSG. So by monitoring LSG, we can verify that it can handle our production traffic while meeting our latency and lag requirements. It gives us good confidence in the code that we wrote for accessing the data from LSG. Additionally, it also gives us some confidence about completeness and correctness of data, particularly with data that is currently being accessed. We developed a single generic shadow validation code that was reused multiple times for different parts of the migration.

+ + +

During the migration process we found latency and lag issues because of multiple bugs in different parts and fixed them.

+ + +

Partition key optimization for better distribution of index data
Index issues causing scan of the record instead of point lookup

+ + +

Unfortunately, live shadow validation can’t give strong guarantees about our corpus of rarely-accessed historical data.

+ + +

Offline Validation & Incremental Backfill

+ + +

This compares complete data from the LSG with the data dump from DynamoDB. Because of various data issues, you have to skip over bad records to ensure that your backfill can go through. Additionally, there can be bugs in the backfill job itself. Offline validation ensures that the data backfill has happened correctly and it covers complete data. This has to be done in addition to shadow validation because live traffic tends to access only recent data. So, if there are any problems lurking in the cold data that is infrequently accessed, it will not be caught by shadow validation.

+ + +

The key challenge in offline validation is size of data. The biggest data that we tackled was 70 TB compressed (estimated 300 TB uncompressed) in size and we compared 760 billion records in a single job. This type of Apache Spark^TM job requires data shuffling and Distributed Shuffle as a Service for Spark combined with Dynamic Resource Allocation and Speculative Execution let us do exactly that at a reasonable speed under resource constraints.

+ + +

Offline validation found missing records and its output was used for incremental backfill. We iterated between offline validation and backfill to ensure that all the records were written.

+ + +

Backfill Issues

+ + +

Every backfill is risky. We used Uber’s internal offering of Apache Spark for the backfills. Here are the different problems that we encountered and how we handled them.

+ + +

Scalability

+ + +

You want to start at a small scale and scale up gradually till you hit the limit of the system. If you just blindly push beyond this point then you are effectively creating a DDoS attack on your own systems. At this point, you want to find the bottleneck, address it, and then scale up your job. Most of the time it’s just a matter of scaling up downstream services, other times it can be something more complex. In either case, you don’t want to scale your backfill job beyond the capability of the bottleneck of the system. It’s a good idea to scale up in small increments and monitor closely after each scale-up.

+ + +

Incremental Backfills

+ + +

When you try to backfill 3 years’ worth of data in say 3 months, you are generating traffic that puts 10x the normal traffic load and the system may not be able to cope with this traffic. As an example, you will need 120 days to backfill 100B records at 10K/sec rate when your production normally handles 1K/sec rate. So, you can expect the system to get overloaded. If there is even a remote chance of the backfill job causing an ongoing problem, you must shut it down. So, it is unrealistic to expect that a backfill job can run from start to finish in one go, and therefore you have to run backfills incrementally.

+ + +

A simple and effective way to do this is to break the backfill into small batches that can be done one by one, such that each batch can complete within a few minutes. Since your job may shut down in the middle of a batch, it has to be idempotent. Every time you complete a batch you want to dump the statistics (such as records read, records backfilled, etc.) to a file. As your backfill continues, you can aggregate numbers from them to check the progress.

+ + +

If you can delete or update existing records, it lowers the risk and cost of mistakes and code bugs during the backfill.

+ + +

Rate Control

+ + +

To backfill safely, you want to make sure that your backfill job behaves consistently. So, your job should have rate control that can be easily tweaked to scale up or scale down. In Java/Scala you can use Guava’s RateLimiter.

+ + +

Dynamic Rate Control

+ + +

In some cases, you may be able to go faster when there is less production traffic. For this you need to monitor the current state of the system and see if it’s ok to go faster. We adjusted RPS on the lines of additive increase/multiplicative decrease. We still had an upper bound on the traffic for safety.

+ + +

Emergency Stop

+ + +

The migration process needs the ability to stop backfill quickly in case there is an outage or even suspicion of overload. Any backfill during an outage has to be stopped as both a precaution and as a potential source of noise. Even post-outage, systems tend to get extra load as systems recover. Having the ability to stop backfill also helps debug scale-related issues.

+ + +

Size of Data File

+ + +

When dumping data, keep the size of the files to around 1GB with 10x flexibility on both sides. If the size of the file is too big, you run into issues such as MultiPart limitation of different tools. If your file size is small, then you have too many files and even listing them will take significant time. You may even start hitting ARGMAX limit of when running commands in a shell. This becomes significant enough to make sure that every time you do something with data it has been applied to all files and not just some of them.

+ + +

Fault Tolerance

+ + +

All backfill jobs need some kind of data transformation. When you do this you inevitably run into data quality/corruption issues. You can’t stop the backfill job every time this happens because such bad records tend to be randomly distributed. But you can’t ignore them as well because it might also be because of a code bug. To deal with this, you dump problematic records separately and monitor statistics. If the failure rate is high then you can stop the backfill manually, fix the problem, and continue. Otherwise, let the backfill continue and look at the failures in parallel.

+ + +

Another reason for records not getting written is RPC timeout. You can retry for this, but at some point, you have to give up and move ahead irrespective of the reason to make sure you can make progress.

+ + +

Logging

+ + +

It is tempting to log during backfill to help with debugging and monitor progress, but this may not be possible because of the pressure that it will put on the logging infrastructure. Even if you can keep logs, there will be too much log data to keep around. The solution is to use a rate limiter to limit the amount of logs that you are producing. You need to rate limit only the parts that produce most of the logs. You can even choose to log all the errors if they happen infrequently.

+ + +

+ +

+ + +

Mitigating Risk

+ + +

In addition to analyzing data from different validation and backfill stats we also were conservative with the rollout of LSG. We rolled it out over a few weeks and with go-aheads from on-call engineers of the major callers of our service. We initially rolled out with fallback (i.e., if the data was not found in LSG, we would try to fetch it from DynamoDB). We looked at the fallback logs before we removed the fallback. For every record that was flagged as missing in the fallback logs we checked LSG to make sure that it was not really missing. Even after that we kept the DynamoDB data around for a month before we stopped writing data to it, took a final backup, and dropped the table.

+ + +

Figure 2: LSG Rollout

+ +

+ + +

Conclusion

+ + +

In this article, we covered the migration of massive amounts of business-critical money data from one datastore to another. We covered different aspects of the migration, including criteria for migration, checks, backfill issues, and safety. We were able to do this migration over two years without any downtime or outages during or after the migration.

+ + +

Acknowledgments

+ + +

Thanks to Amit Garg and Youxian Chen for helping us migrate the data from TerraBlob to LSG. Thanks to Jaydeepkumar Chovatia, Kaushik Devarajaiah, and Rashmi Gupta from the LSG team for supporting us throughout this work. Thanks to Menghan Li for migrating data for Uber Cash’s ledger.

+ + +

Cover photo attribution: “Waterfowl Migration at Sunset on the Huron Wetland Management District” by USFWS Mountain Prairie is marked with Public Domain Mark 1.0.

+ + +

Amazon Web Services, AWS, the Powered by AWS logo, [and name any other AWS Marks used in such materials] are trademarks of Amazon.com, Inc. or its affiliates.

+ + +

Apache®, Apache SparkTM, and SparkTM are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.

How LedgerStore Supports Trillions of Indexes at Uber

Uber — Thu, 04 Apr 2024 05:30:00 GMT

Introduction

+ + +

Uber connects the physical and digital worlds to help make movement happen at the tap of a button. Billions of trips, deliveries, and tens of billions of financial transactions across earners, spenders, and merchants are made at Uber every quarter. LedgerStore is an immutable storage solution at Uber that provides verifiable data completeness and correctness guarantees to ensure data integrity for these transactions.

+ + +

Considering that ledgers are the source of truth of any financial event or data movement at Uber, it is important to be able to look up ledgers from various access patterns via indexes. This brings in the need for trillions of indexes to index hundreds of billions of ledgers. A previous blog post discussed the background of LedgerStore and how the storage backend was re-architected. This blog covers the significance of LedgerStore indexing and its architecture, which powers trillions of indexes, with a petabyte-scale index storage footprint.

+ + +

Types of Indexes

+ + +

Various types of indexes need to be supported on ledgers. Let us explore them along with corresponding use cases.

+ + +

Strongly consistent indexes

+ + +

One of the use cases is handling the credit card authorization flow when a rider/eater uses Uber. At the beginning of an Uber trip, a credit card hold is placed on the rider/eater’s credit card. This hold should either be converted to a charge or voided, depending on whether the trip was taken or canceled, as shown below.

+ + +

Figure 1: Uber Trip credit-card payment flow supported by strongly consistent indexes.

+ +

+ + +

If the index serving the hold is not strongly consistent, it could take a while for the hold to be visible upon reading. A consequence of this is that a duplicate charge could be made on the user’s credit card while the original hold remains on the credit card.

+ + +

Now, let’s dive into how we build strongly consistent indexes that ensure that once a record write is performed, any subsequent reads are guaranteed to see the indexes corresponding to that record.

+ + +

Write Path

+ + +

To build strongly consistent indexes, we use a 2-phase commit to ensure that the index is always strongly consistent with the record, as shown below.

+ + +

The insert operation begins with an index intent write before the record write. These intents are committed after the record write operation if the record write succeeded and this is done asynchronously to avoid affecting end-user insert latency. If the index intent write succeeds, but the record write fails, the index intent will need to be rolled back, else it leads to an accumulation of unused intents, and that is handled during the read time, as we will see next.

+ + +

It is important to note that if the index intent write fails, the whole insert operation fails since we cannot guarantee the consistency of the index with the record. Hence, strongly consistent indexes need to be considered only when the use case strongly demands it.

+ + +

Figure 2: Two-phase commit write flow of strongly consistent indexes.

+ +

+ + +

Read Path

+ + +

There are two cases where an index can be in the intent state after the insert:

+ + +

The index intent commit operation failed in the write path OR
If record write fails

+ + +

Such intents are handled on the read path by either committing or deleting them. When a read happens on these indexes, if the index is in an intent state, the corresponding record is read. If the record is present, the index is committed, else rolled back. These operations happen asynchronously so as not to affect the end user read latency. In general, only a very small percentage of indexes end up in the intent state.

+ + +

Figure 3: Read flows of strongly consistent indexes.

+ +

+ + +

Eventually consistent indexes

+ + +

Not all indexes require strong read-your-write guarantees. An example of such an index is the payment history page, wherein, a lag of a few seconds is acceptable as long as the payment appears on the page.

+ + +

While strongly consistent indexes provide read-your-write guarantees, they are not suitable in certain circumstances since they trade off the following properties to achieve this guarantee:

+ + +

Higher Write Latency
Since the index intent write operation and corresponding record write has to be serial to provide a strong consistency guarantee of the index for the record
Lower Availability
A write failure of any one of the index intents implies the whole write should be failed else indexes will not be consistent with the corresponding record

+ + +

Eventually, consistent indexes are the opposite in this aspect when compared to strongly consistent indexes, as they are built in the background by a separate process that is completely isolated from the online write path. Hence, they do not suffer from higher write latency or cause potential lower availability of LedgerStore service. We leverage a feature called Materialized Views from our home-grown Docstore database to generate these indexes.

+ + +

Figure 4: Payment history served by eventually consistent indexes.

+ +

+ + +

Time-range indexes

+ + +

Ledgers, due to their immutable nature, keep growing in size over time, thereby increasing their cost of storage. So, at Uber, we offload older ledgers in time-range batches to cheaper cold storage.

+ + +

Every ledger is associated with a timestamp called a business or event timestamp. To offload ledgers to cold storage (and also for sealing the data), we need a class of indexes to query data in event time-range batches. What differentiates this index is that the data is read in deterministic time-range batches, in orders of magnitude higher than the above two index types.

+ + +

Figure 5: time-range indexes used in data-tiering.

+ +

+ + +

Following is an example of how time-range queries are done on ledgers:

+ + +

SELECT * FROM LEDGER_TABLE WHERE LedgerTime BETWEEN 1701252000000 AND 1701253800000

+ + +

Ledger	LedgerTime
{trip started}	10:01 am
{trip completed and fare adjusted}	10:15 am
{post trip corrections}	12:01 pm

+ + +

There are a few ways to model this in a distributed database. We will dive into the key differences between developing the time-range index on top of Amazon DynamoDB vs. Docstore database. Both DynamoDB and Docstore, being distributed databases, provide data modeling constructs as Partition and Sort keys. The former is meant for distributing data across partitions evenly based on its value and the latter to control the sort order of the data.

+ + +

Design with DynamoDB

+ + +

Dynamodb provides two ways of managing table read/write capacity. We used the provisioned mode since the traffic is not too bursty to require on-demand mode. The provisioned mode was configured with auto scaling to adjust capacity based on the traffic pattern.

+ + +

As we notice from the write pattern above, the ledger times are generally correlated to the current wall clock time. Hence these values tend to be clustered around the current time. If we were to partition the data based on say G time-units granularity, all the writes in the G time-units would go to the same physical partition causing hot partitions. DynamoDB has restrictions on throughput in case of hot partitions, leading to throttling of write requests, which is not acceptable in the online write path. Assuming 1K peak Uber trips/s, even G=1 second is not a good value, since it corresponds to 1K WCU (Write Capacity Unit), which is the peak allowed QPS before throttling happens.

+ + +

While it might seem like we could just make the partitioning more fine-grained, it is still not foolproof, since an increase in the traffic over time can lead to instability. Another side effect of this is the increase in cumulative reads to be performed via a scatter-gather. So, what we did in the case of DynamoDB was below:

+ + +

Write-optimized temporary index table (called buffer index)

+ + +

All online time-index writes go to the buffer index table. Inserted index items are partitioned into M unique buckets based on a hash modulo of the corresponding record to uniformly distribute load across partitions in the buffer index table, making it write-efficient. The value of M is chosen such that it is high enough that the amount of load per partition avoids excessive splitting. It is also chosen low enough, to limit the amount of scatter-gather to perform during reads.

+ + +

Read-optimized permanent index table

+ + +

The need for scatter-gather read of the buffer tables makes them not efficient for reads and since reads can happen throughout the lifecycle of a table, we would need to optimize it. This brings the need for a read-efficient permanent index table.

+ + +

A permanent time-range index table is partitioned on the timestamp aligned to a certain time duration N (say 10 minutes). Indexes from the buffer tables are periodically written in batches to the permanent index table. Since the write is done in batches and in the background, any write throttling here does not affect the online traffic. Another advantage of batching is that the write traffic can be distributed across partitions, reducing the hot partitioning. The buffer index tables are deleted after offloading their indexes to the permanent index table since they are no longer needed. Reads on the permanent index tables are done in intervals of N minutes without any scatter-gather, making this table read-efficient.

Following is a depiction of the time-range index flow in case of DynamoDB. The dual table design brings in the need of state management and coordination so reads go to the correct index table as well.

+ + +

Figure 6: Time-range index design on Dynamodb.

+ +

+ + +

Design with Docstore

+ + +

The two-table design in the case of DynamoDB functions well and can handle high throughput, but introduces challenges in operations. If the temporary buffer tables are not created in time, it can lead to write failure since writes cannot be accepted, and this has caused availability issues in the past. We re-architected our index storage backend from DynamoDB to Uber’s Docstore database as part of cost efficiency. As part of this re-architecture, we also improved the time-range index design to overcome the downside of maintaining two tables, by leveraging two Docstore properties:

+ + +

Docstore is a distributed database built on top of MySQL, with a fixed number of shards assigned to a variable number of physical partitions. As the data size grows, the number of physical partitions increases and some of the existing shards are re-assigned to the new partitions, leading to a max upper limit to the number of physical partitions.
Data in Docstore is stored in a sorted fashion of the primary key (partition + sort keys).

+ + +

We maintain just one table for the time-range index, wherein the index entries are partitioned on the full timestamp value. Since the timestamp is extremely granular, there is no hot partitioning (and hence no write throttling) since most of the writes are uniformly distributed across partitions.

+ + +

Reads involve a prefix scanning of each of the shards of the table up to a certain time granularity. Prefix scanning is very similar to a regular scan of the table, except the boundaries of each scan request are controlled by the application. So, in the example below, to read 30 minutes worth of data, reads could be done on a 10-minute interval starting from 2023–02-03 01:00:00 to 2023–02-03 01:10:00 and similarly repeated for the next two sub-windows. Since data is sorted on the primary key, this prefix scan with given boundaries ensures only data lying within these timestamps is read.

+ + +

A scatter-gather, followed by sort merging across shards is then performed to obtain all time-range index entries in the given window, in a sorted fashion. Since the number of shards is fixed in Docstore, we can precisely determine (and bound) the number of read requests that need to be performed. The same technique is not applicable in the case of DynamoDB since the number of partitions keeps increasing over time, as the table size increases. This has significantly simplified the design and reduced the operational maintenance cost of our time-indexes.

+ + +

Figure 7: Time-range index design on Docstore.

+ +

+ + +

Index lifecycle management

+ + +

New indexes are defined regularly and some of the indexes could be modified as well to evolve use cases. To support that with minimal effort and also not cause any regressions, we need a mechanism to manage the index lifecycle. The following are the components of the same:

+ + +

Index lifecycle state machine

+ + +

This component orchestrates the life-cycle of the index, involving creating the index table, backfilling it with historical index entries, validating them for completeness, swapping the old index with the new one for read/writes, and decommissioning the old index.

+ + +

Figure 8: Index lifecycle state machine.

+ +

+ + +

Historical Index data backfill

+ + +

Depending on the business use cases, new indexes need to be defined, and it is essential to backfill historical index entries so that they are complete. This component builds indexes from the historical data offloaded to the cold data storage and backfills them to the storage layer in a scalable fashion. Considering that the data download speed is higher than the data processing speed, this component is built with configurable rate-limiting and batching in a reusable way, since we can plug in the actual processing logic as a batch processor plugin.

+ + +

Figure 9: Historical data processing module customized to backfill indexes.

+ +

+ + +

Index validation

+ + +

After indexes are backfilled, they need to be verified for completeness. This is done by an offline job that computes order independent checksums at a certain time-window granularity and compares them across the source of truth data and the index table. This step identifies any bugs in the index backfill process since even if one entry is missed, the aggregate checksum for that time window will lead to a mismatch.

+ + +

Figure 10: Index completeness validation.

+ +

+ + +

Highlights

+ + +

This is how we measured the success of this critical project:

+ + +

We built over 2 trillion unique indexes, and not a single data inconsistency has been detected so far, with the new architecture in production for over 6 months.
Not a single production incident was noticed during the backfill, given how critical money movement is for Uber.
We also moved all these indexes from DynamoDB to Docstore. So the project also resulted in technology consolidation, reducing external dependencies.

+ + +

From a business impact perspective, operating LedgerStore is now very cost-effective due to reduced spend on DynamoDB. The estimated yearly savings are over $6 million per year.

+ + +

Conclusion

+ + +

Ledgers are the source of truth for money movement events at Uber. The robust indexing platform we have built supports accessing these sources of truth ledgers for various business use cases, and we look forward to supporting many more indexes on this platform in the future.

+ + +

We would like to conclude with some key takeaways: Maintaining a petabyte-scale of indexes in an OLTP system brings in certain challenges, such as imbalanced partitioning, high read/write amplification, noisy neighbor problems, etc. So data modeling and isolation are important aspects to consider while designing these systems. Further, depending on the actual database used underneath for storage, the design methodology can be significantly different, as we see from the design contrast of time-range indexes on two different distributed databases.

+ + +

Join us next week to see part two of the LedgerStore series where we chronicle a migration from DynamoDB to LedgerStore.

+ + +

Acknowledgments

+ + +

This project would not have been possible without collaboration from the following teams, embodying several Uber values:

+ + +

The Gulfstream team, who closely worked with the LedgerStore team in aligning on the common goals and migrating on the LedgerStore platform, a multi-year project.
The Docstore team, for evolving Docstore to meet the massive scale requirements of LedgerStore’s indexes.
The LedgerStore team for leading, building, and driving the adoption of ledger indexes at large scale.

+ + +

Amazon Web Services, AWS, the Powered by AWS logo, and Amazon DynamoDB are trademarks of Amazon.com, Inc. or its affiliates.

Scaling AI/ML Infrastructure at Uber

Uber — Thu, 28 Mar 2024 07:28:34 GMT

Introduction

+ + +

Machine Learning (ML) is celebrating its 8th year at Uber since we first started using complex rule-based machine learning models for driver-rider matching and pricing teams in 2016. Since then, our progression has been significant, with a shift towards employing deep learning models at the core of most business-critical applications today, while actively exploring the possibilities offered by Generative AI models. As the complexity and scale of AI/ML models continue to surge, there’s a growing demand for highly efficient infrastructure to support these models effectively. Over the past few years, we’ve strategically implemented a range of infrastructure solutions, both CPU- and GPU-centric, to scale our systems dynamically and cater to the evolving landscape of ML use cases. This evolution has involved tailored hardware SKUs, software library enhancements, integration of diverse distributed training frameworks, and continual refinements to our end-to-end Michaelangelo platform. These iterative improvements have been driven by our learnings along the way, and continuous realignment with industry trends and Uber’s trajectory, all aimed at meeting the evolving requirements of our partners and customers.

+ + +

Goal and Key Metrics

+ + +

As we embarked on the transition from on-premise to cloud infrastructure that we announced in February 2023, our HW/SW co-design and collaboration across teams was driven by the specific objectives of:

+ + +

Maximizing the utilization of current infrastructure
Establishing new systems for emerging workloads, such as Generative AI

+ + +

In pursuit of these goals, we outlined distinct key results and metrics that guide our progress.

+ + +

Feasibility and Reliability: ML users expect successful convergence of their training tasks without errors within an expected time frame (either weeks or months based on complexity). For instance, training larger and more complex models like Falcon 180B™ can take many months, and longer training durations heightened the likelihood of reliability issues. Hence, our target is to achieve 99% uptime SLA for all training dependencies to ensure consistent and reliable outcomes.

+ + +

Efficiency: Our focus on efficiency involves thorough benchmarking of diverse GPU configurations and assessing price-performance ratios of on-prem and cloud SKUs tailored to specific workloads. We gauge training efficiency using metrics like Model Flops Utilization (MFU) to guarantee optimal GPU utilization. Our aim is to prevent idle GPUs, opportunistically using training jobs during serving’s off-peak hours through reactive scaling, and upholding high utilization rates to maximize resource efficiency. We want to do this while also maintaining fairness of resource sharing between different users.

+ + +

Developer Velocity: This metric is quantified by the number of experiments our engineers can conduct within a specific timeframe. We prioritize a mature ecosystem to bolster developer velocity, ensuring our teams work efficiently to deliver optimal solutions. This approach not only streamlines our state-of-the-art model to production but also reduces the time taken for this transition.

+ + +

What follows next is a summary of results from various initiatives that we are taking to make training and serving deployments efficient and scalable, across both on-prem and cloud infrastructure:

+ + +

Optimizing Existing On-prem Infrastructure

+ + +

Federation of Batch Jobs:

+ + +

Our GPU assets are distributed over multiple Kubernetes™ clusters in various Availability Zones and Regions. This distribution is primarily due to GPU availability and the node count limitation within a single Kubernetes cluster. This arrangement presents two primary challenges:

+ + +

Exposure of infrastructure specifics to Machine Learning Engineers.
Inconsistent resource utilization across clusters due to static allocation. Although we have an effective resource-sharing system within each cluster, we lacked the capability for inter-cluster scheduling.

+ + +

To address these issues, we created a unified federation layer for our batch workloads, including Ray™ and Apache Spark™, called Michelangelo Job Controller. This component serves as a centralized interface for all workload scheduling, conceals the underlying Kubernetes clusters, and allocates workloads based on various policies (load aware, bin-pack), including compute and data affinity considerations. We plan to share more technical details on this in a subsequent blog post.

+ + +

Fig 1: Unified federation layer for ML workload allocation.

+ +

+ + +

Network Upgrade for LLM training efficiency

+ + +

When expanding infrastructure to accommodate Generative AI applications and enhancing the efficiency of distributed training while fine-tuning open-source LLMs, it is important to focus on scaling network bandwidth across both scale-up and scale-out configurations. This necessitates implementing critical features such as full mesh NVlink™ connectivity among GPUs, upgrades in network link speeds, proficient congestion control management, QoS controls, and the establishment of dedicated rack and network topologies, among other essential features.

+ + +

Fig2: Training efficiency improvements through network link capacity upgrades.

+ +

+ + +

We present a synopsis of findings derived from a Large Language Model (LLM) case study, emphasizing the considerable impact of enhanced network bandwidth and congestion control mechanisms on training effectiveness and price-performance efficiency. Our observations revealed nearly a two-fold increase in training speed and substantial reductions in training duration when employing higher networking bandwidth and better congestion control mechanisms compared to our existing network interconnect. During multi-node training, duplicating data across nodes heightens local memory demands and adds to IO workload. Our analysis prompted a recommendation to augment network link capacity by 4x (25GB/s to 100GB/s) on each GPU server, potentially doubling the available training capacity. While building these we also need to make sure the “Elephant Flows” generated by the large training runs don’t negatively impact the other high-priority service by proper isolation and QoS controls.

+ + +

Memory Upgrade to improve GPU allocation rates

+ + +

Newer AI/ML workloads are demanding more system memory per GPU worker than what we had designed for. The inherent physical constraints, such as the limited number of memory channels on each server, and DIMM capacities deployed during NPI (new product introduction) restricted our ability to scale up GPU allocations. To improve our GPU allocation rates, we have initiated an effort to double the amount of memory on these servers (16GB to 32GB per DIMM channel). Additionally, we are also building a framework to repurpose and reuse the DIMM’s when older racks are decommissioned. Such optimizations allow us to maximize utilization of existing ML infrastructure and make the most of our current resources. We will detail the efficiency gains achieved through this initiative in an upcoming post. In parallel, we have kicked off efforts to help rightsize the training jobs’ resource requirements. As demonstrated by others [ref], manually requesting the optimal resources is a hard problem, and automating it would help in increasing the allocation efficiency.

+ + +

Building New Infrastructure

+ + +

Price-performance evaluations across various cloud SKUs

+ + +

In late 2022 as we embarked on our journey towards transitioning to the cloud, we assessed various CPU and GPU models offered by different cloud providers. Our aim was to compare their price-performance ratios using established benchmarks ranging from tree-based and deep learning to large language models alongside proprietary datasets and Uber’s models such as deepETA and deepCVR. These assessments, conducted for both training and serving purposes, enabled us to select the most suitable SKUs optimized for our specific workloads, considering factors like feasibility, cost, throughput, and latency. Throughout 2023, we extensively tested 17 different GPU and CPU SKUs, employing various libraries and optimizers, including Nvidia’s TensorRT™(TRT) and TRT-LLM optimizations. For instance, as depicted in figures 4 and 5, we found that while A10 GPUs might not offer the most cost-effective throughput for training tasks, they prove to be the optimal choice for our serving use cases, delivering the best throughput while maintaining acceptable SLA using TRT.

+ + +

Fig 3: Deep learning training and serving performance-price evaluation.

+ +

+ + +

Fig 4: Deep learning serving latency with and without TensorRT optimizations.

+ +

+ + +

Numerous Generative AI applications at Uber necessitate the use of Nvidia’s newest H100 GPUs to satisfy their stringent latency requirements. This requirement stems from the H100 GPUs’ capabilities, which include up to 4x TFlops and double the memory bandwidth compared to the earlier generation A100 GPUs. While experimenting with Meta™ Llama2™ model series, involving various batch sizes, quantizations, and model parameters, we evaluated various open- and closed-source frameworks to further optimize for LLM serving performance. In Figures 6 and 7, we present a specific case where we employ two metrics: per-token latency (ms/token) and tokens/sec/gpu, to evaluate and compare the model’s performance across two of the top-performing frameworks (TRT-LLM and a currently confidential framework), keeping all other parameters constant and using FP16 quantization.

+ + +

Fig 5: LLM serving latency comparison by framework (H100).

+ +

+ + +

Fig 6: LLM serving throughput comparison by framework using the same latency budget and minimum number of GPUs required (H100).

+ +

+ + +

These experiment results clearly demonstrate that Framework B delivers a two-fold increase in latency and a six-fold improvement in throughput compared to TRT-LLM. It further underscores the significance of HW/SW co-design and that to fully leverage hardware capabilities, it is essential to have the right solutions across the entire stack.

+ + +

LLM Training efficiency improvements with memory offload

+ + +

In this section, we outline our framework for design and experimentation regarding the placement of optimizer states, parameters and gradients from GPU memory to either CPU memory or NVMe devices for large language models. Our aim is to evaluate how this offload impacts GPU scalability, training efficiency, and a range of system metrics.

+ + +

Fig 7: Design framework for memory offload experimentation.

+ +

+ + +

Our experiment results demonstrated that our capacity to train expansive models previously hindered by restricted GPU memory has been significantly enhanced. Memory offloading from GPU memory to system memory or even NVMe devices helped in boosting training efficiency by enabling the utilization of larger batch sizes with the same number of GPUs. This shift has resulted in a 2x increase in MFU (model flops utilization) while concurrently reducing GPU usage by 34%. However, it’s noteworthy that this improvement comes with a corresponding reduction in network throughput. A detailed open-computer project (OCP) conference talk on this topic can be found here.

+ + +

Fig 8: Training efficiency implementing deepspeed memory offload optimization.

+ +

+ + +

Conclusion

+ + +

To conclude, we’d like to leave you with three key insights. Designing a singular AI/ML system amid rapid application and model advancements, spanning from XGboost to deep learning recommendation models and large language models, poses considerable challenges. For instance, while LLMs demand high TFlops, deep learning models can encounter memory limitations. To enhance the cost-effectiveness of these systems, exploring workload-optimized solutions based on efficiency metrics like cost-to-serve and performance per dollar within a given SLA becomes imperative. Maximizing infrastructure efficiency necessitates a collaborative hardware and software design approach across all layers of the system. Within this context, we’ve showcased various examples in this post, illustrating how to leverage existing infrastructure effectively while building new capabilities to efficiently scale the infrastructure. Lastly, we extend an invitation to foster industry partnerships, urging engagement in open-source optimizations to drive efficiency and exchange ideas and learnings on effectively scaling infrastructure to tackle the evolving demands of the AI landscape.

+ + +

Acknowledgments

+ + +

Many thanks for the collaboration on the above work to the UBER AI Infrastructure, OCI, GCP, and Nvidia team members.

+ + +

Apache®, Apache Kafka, Kafka, Apache Spark, Spark, and the star logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.

+ + +

Kubernetes® and its logo are registered trademarks of The Linux Foundation® in the United States and other countries. No endorsement by The Linux Foundation is implied by the use of these marks.

+ + +

Falcon 180B® and its logo are registered trademarks of Technology Innovation Institute™ in the United States and other countries. No endorsement by Technology Innovation Institute is implied by the use of these marks.

+ + +

LLaMA 2® and its logo are registered trademarks of Meta® in the United States and other countries. No endorsement by Meta is implied by the use of these marks.

Model Excellence Scores: A Framework for Enhancing the Quality of Machine Learning Systems at Scale

Uber — Thu, 21 Mar 2024 05:30:00 GMT

Introduction

+ + +

Machine learning (ML) is integral to Uber’s operational strategy, influencing a range of business-critical decisions. This includes predicting rider demand, identifying fraudulent activities, enhancing Uber Eats’ food discovery and recommendations, and refining estimated times of arrival (ETAs). Despite the growing ubiquity and impact of ML in various organizations, evaluating model “quality” remains a multifaceted challenge. A notable distinction exists between online and offline model assessment. Many teams primarily focus on offline evaluation, occasionally complementing this with short-term online analysis. However, as models become more integrated and automated in production environments, continuous monitoring and measurement are often overlooked.

+ + +

Commonly, teams concentrate on performance metrics such as AUC and RMSE, while neglecting other vital factors like the timeliness of training data, model reproducibility, and automated retraining. This lack of comprehensive quality assessment leads to limited visibility for ML engineers and data scientists regarding the various quality dimensions at different stages of a model’s lifecycle. Moreover, this gap hinders organizational leaders from making fully informed decisions regarding the quality and impact of ML projects.

+ + +

To bridge this gap, we propose defining distinct dimensions for each phase of a model’s lifecycle, encompassing prototyping, training, deployment, and prediction (See Figure 1). By integrating the Service Level Agreement (SLA) concept, we aim to establish a standard for measuring and ensuring ML model quality. Additionally, we are developing a unified system to track and visualize the compliance and quality of models, thereby providing a clearer and more comprehensive view of ML initiatives across the organization. Note that Model Excellence Scores (MES) cover certain technical aspects that are integral to Uber’s overall ML governance.

+ + +

Figure 1: Example ML quality dimensions (in yellow) in a typical ML system.

+ +

+ + +

Model Excellence Scores (MES)

+ + +

The development and maintenance of a production-ready ML system are intricate, involving numerous stages in the model lifecycle and a complex supporting infrastructure. Typically, an ML model undergoes phases like feature engineering, training, evaluation, and serving. The infrastructure to sustain this includes data pipelines, feature stores, model registries, distributed training frameworks, model deployment, prediction services, and more.

+ + +

To offer a comprehensive evaluation of model quality across these phases, we created and introduced the Model Excellence Scores (MES) framework. MES is designed to measure, monitor, and enforce quality across each stage of the ML lifecycle. This framework aligns with principles and terminologies common among site reliability engineers (SREs) and DevOps professionals, particularly those used in managing microservices reliability in production environments.

+ + +

MES revolves around three fundamental concepts related to Service Level Objectives (SLOs): indicators, objectives, and agreements. Indicators are precise quantitative measures reflecting some aspect of an ML system’s quality. Objectives set target ranges for these indicators, and agreements combine all indicators at an ML use case level, dictating the overall PASS/FAIL status based on the indicator results.

+ + +

Each indicator in MES is clearly defined and has a set target range for its metric value, with a specified frequency for value updates. If an indicator falls short of its objective within a given time frame, it’s marked as failing. Agreements, which encapsulate these indicators, represent the commitment level of the service and provide insights into its performance. Figure 2 illustrates the interconnections between agreements, indicators, and objectives, and how they relate to specific use cases and models.

+ + +

Figure 2: Relationship among agreement, indicator, objective, use cases, and models.

+ +

+ + +

Different indicators might necessitate varied timeframes for resolution and distinct mitigation strategies. Some may require immediate attention with higher priority handling, especially when performance benchmarks are not met.

+ + +

It’s also important to note that the roles and responsibilities associated with modeling can vary significantly between organizations. In some cases, a single team may handle the entire process, while in others, responsibilities may be distributed across multiple teams or departments.

+ + +

At Uber, the responsibility for each model is assigned to a designated primary team. This team receives alerts for any discrepancies or issues related to their model, as outlined in the agreement. Teams have the flexibility to tailor these alerts based on the significance and urgency of their ML use cases. It’s important to note that the quality of one model can influence another, either directly or indirectly. For instance, the output from one model might serve as input for another or trigger further model evaluations. To address this interconnectedness, we’ve implemented a notification system that informs both service and model owners of any quality violations in related ML models.

+ + +

The interaction between the Model Excellence Scores (MES) framework and other ML systems at Uber is depicted in Figure 3. The MES framework, with its indicators, objectives, and agreements, is built on several key principles:

+ + +

Automated Measurability: Every indicator in MES is designed with metrics that can be quantified and automated, ensuring robust infrastructure for instrumentation.
Actionability: Indicators are not just measurable but also actionable. This means that there are clear steps that users or the platform can take to improve these metrics over time in relation to their set objectives.
Aggregatability: The metrics for each indicator are capable of being aggregated. This is crucial for effective reporting and monitoring, allowing for a cohesive roll-up of metrics in line with the organization’s Objectives and Key Results (OKRs) and Key Performance Indicators (KPIs).
Reproducibility: Metrics for each indicator are idempotent, meaning their measurements remain consistent when backfilled.
Accountability: Clear ownership is attached to each agreement. The designated owner is responsible for defining the objectives and ensuring these objectives are achieved.

+ + +

Figure 3: High-level view of the interaction between the MES framework and various ML systems.

+ +

+ + +

We focus on some indicators that haven’t been extensively covered in related literature in Table 1. MES is capable of measuring aspects like fairness and privacy, these topics are out of scope of this discussion. We’ve outlined in the table below how each indicator adheres to these design principles, providing examples of measurable metrics, actionable steps for improvement, and the normalization schemes applied to ensure that the metrics are aggregatable and consistent across different use cases. These metrics are either normalized to a [0,1] scale, converted to a percentage, or maintained on a consistent scale across various applications.

+ + +

Indicators	Description	Possible Actions	Metric Normalization
Data Quality	Measures the quality of the input datasets used to train the model. This is a compost score for: – Feature null – Cross-region consistency – Missing Partiitions – Duplicates	– Backfill the missing partitions – Sync the data partitions across different regions and instances – De-duplicate the rows in the data	Each component in the composite score is normalized to the percentage scale
Dataset Freshness	Measures the freshness of the input datasets used to train the model	– Retrain with fresh input datasets – Backfill input datasets if updated data is available	Scale-consistent
Feature and Concept Drift	Shift in the target and covariate distribution as well as the relationship between the two over time for a model in production	– Apply weighted training or retrain the model with fresh data – Validate the correctness of upstream feature ETL pipelines	Normalized to [0,1] by using normalized distance metric and importance weights
Model Interpretability	Measures the presence and confidence of robust feature explanations for each prediction generated by the model	– Enable explanations	Normalized to [0,1]
Prediciton Accuracy	Prediction accuracy of the model on production traffic (e.g., AUC, normalized RMSE)	– Update training datasets to account for train-serve skew – Check for feature or concept drift	Normalized to [0,1] by normalizing the accuracy metric

Table: Sample of indicators.

+ + +

Results

+ + +

The implementation of the MES framework at Uber has markedly enhanced the visibility of ML quality within the organization. This increased transparency has been instrumental in fostering a culture that prioritizes quality, subsequently impacting both business decisions and engineering strategies. Over time, we have observed substantial progress in adherence to SLAs across various dimensions. Notably, there has been a remarkable 60% improvement in the overall prediction performance of our models.

+ + +

Moreover, the insights gleaned from the MES metrics have been pivotal in identifying areas for platform enhancements. A key development arising from these insights was the introduction of advanced platform tooling for hyperparameter tuning. This innovation enables the automatic periodic retuning of all models, streamlining the optimization process and ensuring consistent model performance. Such improvements underscore the tangible benefits of the MES framework in driving both operational efficiency and technological advancement

+ + +

Lessons Learned

+ + +

In our journey of implementing and monitoring key indicators across all ML teams at Uber, we’ve gleaned several critical insights.

+ + +

Motivating ML Practitioners: The established framework allowed for a tangible measurement of the impact and efforts directed toward quality improvements. By adopting a standard and transparent reporting system, we created an environment where ML practitioners were motivated to enhance quality, knowing that their efforts were visible and recognized across the organization.

+ + +

Alignment and Executive Support: Initially, quality measures could be perceived as an additional burden unless they are seamlessly integrated into everyday practices from the outset. Implementing a quality tracking framework sheds light on existing gaps, necessitating extra efforts in education and awareness to address these issues. Aligning with executive leadership was crucial, enabling teams to prioritize quality-focused tasks. This alignment gradually led to a shift towards a more proactive, quality-centric culture across the board.

+ + +

Balancing Standardization with Customization: In designing the framework, we aimed for a level of standardization that would allow for consistent tracking and informed decision-making over time. However, given Uber’s diverse ML applications, it was also vital to permit customization for specific indicators to accurately reflect the nuances of each use case. For instance, in ETA prediction models, we adopted mean-average-error as a more contextual metric than RMSE. The framework accommodated such customizations while maintaining a standardized approach to reporting for consistency.

+ + +

Prioritizing Incremental Improvements: Managing the framework across a wide array of use cases posed significant challenges in prioritization. We developed a straightforward prioritization matrix to identify which areas needed immediate attention. Recognizing that a handful of models contribute most to the impact, our focus was on enhancing quality in high-impact use cases first.

+ + +

The Role of Automation: Maintaining ML quality is resource-intensive, and manually managing models in production can divert efforts from innovation. Automating the production lifecycle, including retraining, revalidating, and redeploying models with fresh data, proved invaluable. This automation not only enhanced model freshness (as indicated by the reduced average age of models), but also allowed teams to focus more on innovation and less on maintenance.

+ + +

Conclusion

+ + +

We have developed a comprehensive framework that outlines the key dimensions of high-quality machine learning (ML) models across different stages of their lifecycle. This framework is inspired by Service Level Agreement (SLA) principles and is designed to monitor and ensure the quality of ML models. Importantly, it’s structured to accommodate additional quality dimensions, adapting to emerging use cases and evolving best practices in the field.

+ + +

Our discussion also encompassed the application of this framework in generating insightful quality reports at various levels of the organization. These reports are regularly reviewed, fostering accountability and offering valuable insights for strategic planning. Crucially, by embedding ML quality within the overall service quality of the associated software systems, we’ve facilitated a shared responsibility model. Applied scientists, ML engineers, and system engineers now collectively own ML quality. This collaborative approach has significantly bridged the gap between these functions, fostering a proactive, quality-focused culture within the organization.

+ + +

Acknowledgments

+ + +

We could not have accomplished the technical work outlined in this article without the help of our team of engineers and applied scientists at Uber. We would also like to extend our gratitude to the various Technical Program Managers – Gaurav Khillon, Nayan Jain, and Ian Kelley – for their pivotal role in promoting the adoption and compliance of the MES framework across different organizations at Uber.

Balancing HDFS DataNodes in the Uber DataLake

Uber — Thu, 14 Mar 2024 05:30:00 GMT

Introduction

+ + +

Apache Hadoop^Ⓡ Distributed File System (HDFS) is a distributed file system designed to store large files across multiple machines in a reliable and fault-tolerant manner. It is part of the Apache Hadoop framework and is one of the main components of Uber’s data stack.

+ + +

Uber has one of the largest HDFS deployments in the world, with exabytes of data across tens of clusters. It is important, but also challenging, to keep scaling our data infrastructure with the balance between efficiency, service reliability, and high performance.

+ + +

Figure 1: HDFS Infrastructure at Uber.

+ + +

Overview

+ + +

HDFS balancer is a key component to keep DataNodes healthy by redistributing data evenly in the cluster. The HDFS balancer has to balance data more effectively to prevent DataNode skew as our HDFS clusters have more and more intensive node decommissioning. The node decommission requirement comes from projects such as zone decommissioning, automatic cluster turnover for security patch, and also DataNode colocation.

+ + +

However, the balancer that comes with HDFS open source did not meet this requirement out of the box. We have seen issues of one DataNode being skewed (i.e., storing more data compared to other nodes in the same cluster), which has multiple side effects:

+ + +

Leads to high I/O bandwidth on the host containing too much data
Highly utilized nodes have a higher probability of slowness, higher risk of node failure, data loss
Cluster has fewer active and healthy nodes to serve writing traffic for customers

+ + +

Below is an example of unbalanced data: thousands of nodes are near 95% disk utilization in our largest cluster composed of thousands of DataNodes with hundred PBs of capacity, while the balancing throughput can’t move data effectively to the other newly added DataNode. Such unbalanced data distribution is caused by bursty write traffic from warm tiering and EC conversion[1], intensive node decommission from zone decom/cluster turnover for security patch. As the write reliability is the first priority, all DataNodes serve write traffic together with an available capacity-weighted algorithm. With more write traffic, data skews more as well.

+ + +

Figure 2: One of our biggest clusters comprising around thousands of DataNodes with hundred PBs of capacity has skewed DataNodes.

+ +

+ + +

Thus, we need to optimize the HDFS balancer to increase the data balance from the high-usage DataNode to another, less occupied DataNode.

+ + +

Given the scale of data storage at Uber, there would be more than 20 PB of data-unbalanced nodes in a single cluster, with 7-8 clusters. To tackle this problem of balancing HDFS DataNodes in the Uber DataLake, we devised a new algorithm to increase the number of pairs formed between DataNodes, which would increase parallel block movements while balancing data. Also, we did sort DataNodes based on utilization such that the datanode pairs formed are optimized and no recursive balancing takes place.

+ + +

This algorithm would go on to increase our throughput for balancing i.e. size of data moved per second from a higher occupied datanode to a lower occupied datanode considered for balancing.

+ + +

Architecture & Design

+ + +

Figure 3: HDFS Balancer Architecture.

+ +

+ + +

Initialization and Setup:
1. The HDFS balancer is run on a host as a service within the Hadoop cluster.
2. To initiate the balancing process, a node with a balancer role needs to be present in the cluster. No two balancers can run concurrently.
+
Requesting Cluster Information:
1. The balancer first contacts the NameNode to request information about the data distribution within the cluster. It sends a request to the NameNode to obtain details about the distribution of data blocks across DataNodes.
2. The NameNode responds with a list of DataNodes and the blocks they contain, along with their storage capacities and other relevant information.
+
Block Selection and Planning:
1. Based on the information received from the NameNode, the balancer algorithm selects blocks that need to be moved to achieve a more balanced distribution.
2. The balancer takes into consideration factors such as DataNode utilization, rack information, threads, and storage capacity while planning block movements.
+
Coordination of Data Movement:
1. After determining which blocks to move, the balancer coordinates the actual data movement between DataNodes.
2. It communicates with the NameNode regarding the blocks moved with the help of heartbeats.
+
Block Migration:
1. The balancer initiates block migration by communicating directly with the source and destination DataNodes.
2. It instructs the source DataNode to transfer the selected block to the destination DataNode, moving the data block directly.
+
Monitoring Progress:
1. Throughout the data movement process, the balancer continuously monitors progress. It keeps track of how many blocks have been successfully transferred and ensures that the data movement is proceeding according to the plan.
+
Completion and Reporting:
1. Once the balancing operation is complete, the balancer reports the data transferred and data left to transfer in logs and through metrics.
2. It may also provide statistics and metrics about the balancing process, including the number of blocks moved and the time taken.
+
Termination:
1. In the host, the balancer runs as a service. So, until the cluster is balanced, it won’t stop moving the data.
+

+ + +

Initial Optimizations

+ + +

Since we had the objective to increase the throughput to balance DataNodes at a greater speed to balance them faster, we optimized our HDFS balancers with the existing DataNode properties to increase the throughput.
Although we increased the speed of the balancer up to 3x, the throughput still wasn’t sufficient. We had too many highly occupied nodes and the number of DataNode pairs to which the data would be transferred in the existing algorithm would be significantly less. Also, we couldn’t improve the throughput from each node through balancer threads, as increasing it would increase the slowness of the node and affect read/write traffic. Thus, we needed to increase the number of DataNode pairs, which would ultimately lead to an increase in balancing throughput.

DataNode and Balancer Configs that we used are mentioned below. Configurations for your workloads may be different based on your situation.

+ + +

DataNode configuration properties:

+ + +

Property	Default	Fast Mode
dfs.DataNode.balance.max.concurrent.moves	5	250
dfs.DataNode.balance.bandwidthPerSec	1048576 (1MB)	1073741824 (1GB)

+ + +

Balancer configuration properties:

+ + +

Property	Default	Fast Mode
dfs.DataNode.balance.max.concurrent.moves	5	250
dfs.balancer.moverThreads	1000	2000
dfs.balancer.max-size-to-move	10737418240 (10GB)	107374182400 (100GB)
dfs.balancer.getBlocks.min-block-size	10485760 (10MB)	104857600 (100MB)

+ + +

Algorithm Optimizations

+ + +

Increasing DataNode pairs for high throughput

+ + +

More DataNode pairs meant that we could have more concurrent block transmission, hence a key improvement is to construct more pairs. Due to the existing algorithm, a highly skewed cluster formed fewer DataNode pairs.

+ + +

Figure 4: Existing Algorithm.

+ +

+ + +

In the existing algorithm for HDFS Balancer, DataNodes above a cluster’s average utilization (i.e., above-average utilized and over-utilized nodes) had much higher numbers compared to below-average utilized and under-utilized nodes. Thus, we faced the problem of scarcity of the nodes to move the data from highly utilized DataNodes, which resulted in highly utilized DataNodes not coming down speedily.

+ + +

In the above diagram, there are 8 DataNodes above average and 4 DataNodes below average utilization, which would lead to 4 targets where data could be moved.
The aim was to modify the HDFS algorithm such that more pairs are formed for DataNodes, thus leading to more throughput from high-usage DataNodes, resulting in uniform utilization as well as a speedy bump down of usage with more coverage of DataNodes.

+ + +

Our idea was to use a percentile-based algorithm for creating more DataNode pairs.

+ + +

Figure 5: New Algorithm.

+ +

+ + +

In the new algorithm, we created an adjusted average based on percentile, which would increase the number of nodes to which the data could be moved. Above average/over-utilized DataNodes would try to come near to overall cluster utilization, whereas under-utilized/below average utilized nodes would try to come near adjusted average of percentile. With a percentile-based algorithm, we would aim to bring our adjusted average near overall cluster utilization.

+ + +

We would use a percentile-based algorithm to increase the DataNode pairs. In the highly skewed cluster, the percentile was quite high. Taking an example of the above diagram, we took percentile as P60, our adjusted average is now 86.7%. In this case, the count of over-utilized/above-average utilized nodes decreases, and under-utilized/under-average utilized nodes increase.

+ + +

Now, there would be 5 over-utilized and above-average utilized nodes and 7 under-utilized and under-average nodes, which will lead to the formation of 7 pairs max from 4 pairs.

+ + +

We had a new Hadoop configuration property, dfs.balancer.separate-percentile,

+ + +

Figure 6: New Hadoop Configuration for Defining Percentile.

+ +

+ + +

which was by default 0.5, denoting the 50th percentile. If we deployed the balancer command with -dynamicBalancer, this percentile algorithm would take effect and the adjusted average would come into the picture with more throughput.

+ + +

We could also use this threshold to balance dynamically. For example, if DataNodes would go above 90%, we would balance them aggressively (i.e., with increased speed). Thus, we would balance the top 20% of DataNodes, which would lead to concentrating moverThreads on the top 20% of highly utilized sources, and data would move faster from highly utilized DataNodes and bring usage down faster.

+ + +

Figure 7: New Hadoop Configuration for Defining Aggressive Balancing.

+ +

+ + +

Moving data to lower occupied DataNodes

+ + +

Due to automation (i.e., automatic removal of data from DataNodes to other DataNodes to send it for maintenance), frequent decommission happened to DataNodes in a large cluster, in which data from a decommissioned node was moved to other nodes, increasing the occupied percentage on those nodes. The new nodes that came up got slowly balanced, as they were not given priority.
Also, for example, if average utilization was 83% with a threshold as 3% and the DataNode of 90% moved some part of its data to a 79% node which becomes 81%. Now if the new client dumped data at 81%, it became 87%, which may require further balancing of this node, thus distributing the dispatcher and mover threads.

+ + +

Figure 8: Old Algorithm – Pairs Formed.

+ +

+ + +

Figure 9: Old Algorithm – New Over-Utilized Nodes Came Up.

+ +

+ + +

Figure 10: New Algorithm – Preferred Optimization.

+ +

+ + +

Our enhancement was to prioritize smaller occupied DataNodes by sorting in ascending order nodes in under-utilized nodes or below-average nodes, to balance the data first from over-utilized nodes, then above-average utilized nodes sorted in descending order, so that the nodes in between do not come into the picture, when balancing to prevent recursive balancing.

+ + +

Better Observability

+ + +

We didn’t have a metric on DataNode pairs that are formed between over-utilized and underutilized, overutilized and below-average utilized and underutilized and above average utilized between the same node group, same rack, and any other rack and other relevant metrics. Hence, we weren’t able to calibrate the traffic distribution between these pairs. In order to find out where the DataNode pairs could be increased to increase the throughput, we created a new dashboard.

+ + +

In the end, we added more than 10 metrics to track the performance of our change in algorithm, which would help us calibrate custom algorithms for the balancer more.

+ + +

Figure 11: Snapshots of our Metrics Dashboard.

+ +

+ + +

Results

+ + +

With optimizations in the balancing algorithm, we increased the throughput by more than 5x, with no DataNodes with higher utilization than 90%, as well as brought down the usage of the DataNodes overall. Also, there is now no need to deploy a manual balancer that took only certain hardcoded nodes to balance the data, as our optimization in the algorithm took care of that.

+ + +

As part of our new algorithm –

+ + +

Increased throughput – We increased the throughput by more than 5x.
Bringing down highly used datanodes – We brought down DataNodes above 90% utilization to 0.
DataNodes around same utilization – Reduce overall usage of datanodes and bring them around the same capacity. We had all the DataNodes below 85% utilization for our biggest cluster.
Manage the capacity better – Our cluster utilization increased from 65-66% to around 85% for HDFS clusters, with us having capacity bottlenecks. We now had no highly occupied datanode even though the cluster utilization was higher than ever.

+ + +

Figure 12: DataNodes at a similar level due to the algorithm change and below 85% utilization for our biggest cluster.

+ +

+ + +

Figure 13: Panels reflecting the DataNode skew is reduced.

+ +

+ + +

Figure 14: Before balancer algorithm changes – Datanodes with high usage above 90% are 50.8%.

+ +

+ + +

Figure 15: After balancer algorithm changes – Datanodes with high usage above 90% are below 0.

+ + +

Figure 16: One of our clusters with less cluster utilization around 65%.

+ +

+ + +

Figure 17: Cluster utilization increased to around 83% for the same cluster above.

+ +

+ + +

Figure 18: Increase in throughput by more than 3x due to algorithm changes.

+ +

+ + +

Conclusion

+ + +

In an HDFS cluster, data could get skewed among different DataNodes and could lead to high I/O on the node, leading to it being slow or going down, causing data loss. The new algorithm would help in balancing the DataNodes faster to achieve greater efficiency, service reliability, and high performance while preventing a higher probability of slowness, higher risk of node failure, and data loss.

+ + +

In Uber, we deployed this change to multiple clusters to increase the balancing throughput. We are raising an open-source patch for our optimizations. Uber HDFS team continues to work on solving similar data distribution problems – given our scale, even a small improvement can result in a huge gain.

+ + +

[1] Uber keeps data with different access temperatures to dedicate clusters for better reliability and cost efficiency. We apply the warm tiering to move data from hot cluster to warm cluster and adopt EC conversion to move data to cluster with erasure coding feature, which saves 50% capacity.

+ + +

“Apache®, Apache Hadoop®, and Hadoop®, are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.”

Load Balancing: Handling Heterogeneous Hardware

Uber — Thu, 07 Mar 2024 07:00:00 GMT

Overview

+ + +

This blog post describes Uber’s journey towards utilizing hardware efficiently via better load balancing. The work described here lasted over a year, involved engineers across multiple teams, and delivered significant efficiency savings. The article covers the technical solutions and our discovery process to get to them–in many ways, the journey was harder than the destination.

+ + +

Background

+ + +

Better Load Balancing: Real-Time Dynamic Subsetting | Uber Blog was a related blog post that predates the work described here. We won’t repeat the background–we recommend skimming through the overview of our service mesh there. We’ll also be reusing the same dictionary. This post focuses on the workloads communicating via the service mesh explained above. This covers the vast majority of our stateless workloads.

+ + +

Problem statement

+ + +

In 2020, we started work to improve the overall efficiency of Uber’s multi-tenant platform. In particular, we focused on reducing the capacity required to run our stateless services. In this blog post, we’ll cover how individual teams making rational decisions led to inefficient resource usage, how we analyzed the problem and different approaches, and how, by improving load distribution, we got teams to safely increase CPU utilization and drive down costs. The post focuses on CPU only, since this was our primary constraint.

+ + +

First, some context: at Uber, most capacity decisions are decentralized. While our platform teams provide recommended targets and tools like auto-scalers, the ultimate decision to adopt specific targets lies in each of the product teams/organizations. A budgeting process exists to curb unlimited allocations.

+ + +

As part of the budgeting process, we noticed what we thought were unreasonably low utilization levels. However, attempts to increase the utilization were met with concerns from the product teams–they were rightly worried that increasing the utilization would risk the system’s reliability and affect their availability/latency goals.

+ + +

The cause of the problem was presumed to be suboptimal network load balancing. Many workloads had tasks with CPU usage higher than average. Those outliers worked fine during normal operations, but struggled during failovers–and the desire not to break the SLAs pushed our average utilization downwards.

+ + +

Figure 1: A typical “imbalance graph.” Each line represents the CPU usage of a container.

+ +

+ + +

Figure 2: A less obvious case: container utilizations are distributed across a band, but some are utilized more than others.

+ +

+ + +

Asymmetry of impact

+ + +

An important aspect of load imbalance is the asymmetry of its impact. Imagine a scenario where out of 100 workloads, 5 are under-utilized. This impacts efficiency, but the cost is relatively low–we’re not using 5% of our machines as efficiently as possible.

+ + +

If the situation is reversed and the same 5 workloads are over-utilized, the situation is much more severe. We are likely affecting customer experience and potentially affecting the system’s reliability. The easy solution to avoid these hotspots is to reduce the average utilization of the whole cluster. This will now have a much more significant impact: 95% of the workloads are underutilized, meaning a much more significant waste of (financial) resources.

+ + +

The forest and the trees

+ + +

Since the outliers were easy to spot, we initially focused on fixing and chasing them one by one, trying to root-cause and fix each issue individually as soon as possible. The results of these individual fixes weren’t always as expected. Some of our changes had a lower impact than expected or only impacted a subset of the system. Similarly, other changes later on resulted in unexpectedly significant improvements. This was due to several independent issues being at play. This “forest of issues” resulted in the work being largely sequential–we would only find a new, more minor issue once its larger sibling was fixed.

+ + +

In retrospect, the “surprise” part could have been mitigated with more analytical rigor–we could have understood the system more and collected more samples upfront. The sequentiality of the work would likely have been the same, though–it’s only through the process we learned how to understand and measure the system.

+ + +

Measuring the impact

+ + +

Perhaps surprisingly, one of the most disputed aspects of the project, until the very end, was measuring the impact. The discussions involved folks from different teams and organizations joining and leaving the project at different times. Each involved party had a valuable, but slightly different perspective on the problem, its priority, and potential fixes.

+ + +

Just measuring the impact consistently was surprisingly complicated. Clearly, we should measure the outliers–we quickly settled on using the CPU utilization of the p99th most utilized task of a given workload. After some discussions, we agreed to use the average as the base, leaving us with p99/average as the imbalance indicator.

+ + +

However, even that was surprisingly vague:

+ + +

A workload runs in multiple clusters across multiple zones. Should the p99/average be calculated across all its instances or for each cluster individually? If it’s per cluster, how do we weigh the results? This decision dramatically affects the final numbers.
Workloads run in multiple regions, yet unlike zones, our regions exhibit strong isolation–where to send traffic is outside of networking control. Thus, the networking team might care about a different indicator than the business.
A typical workload has a periodic pattern–a service might be most busy on a particular day of the week and underutilized at other times. Should we measure the imbalance at the peak only or throughout the day? If at peak, how long of a time frame should be considered peak? Do we only care about the single weekly peak?
Our workloads typically run in an active-active pattern, with each region having some spare capacity for a potential failover. The load imbalance matters most during those failovers–should we try to measure it only then? If so, the frequency of our measurements will be reduced–typically, we would get a simple sample per week.
The workloads are noisy. A service rollout typically results in an imbalance spike (as new containers come and warm up). Some workloads might be quick to roll out (per increment) but roll out tens of times per day via a CD pipeline. Other workloads are much slower, and a single rollout can take hours. Both types of rollouts can overlap with peak times. On top of that, there are “atypical events” like temporary performance regressions, traffic drains, load tests, or incident-related issues.
Most workloads follow a “standard” pattern, but some (more critical) services have been partitioned into custom shards with separate routing configurations. Similarly, a small subset of essential workloads is additionally accessible by custom peer-to-peer routing. Finally, another small subset of services runs on dedicated hosts. These dimensions might affect our tracking.

+ + +

Once we settle on the per-workload indicator, the problem expands to multi-service:

+ + +

How do we weigh the individual workloads in the final score?
How do tiers (priority) of each service affect their weight in the final score?
Does the fact that different workloads have different periodical patterns affect the score? Workloads typically have weekly and daily peaks, but those peaks are not simultaneous.
Can we decompose the final indicator into sub-components to track the imbalance of individual zones or clusters?

+ + +

The indicators must be available in real time for development and monitoring–here, we care about the highest precision possible, typically sub-minute. However, the same indicator must be available over long periods (years), where we need to roll the data up into day-sized chunks while keeping all the previous weighting considerations in mind.

+ + +

Actual numbers:

+ + +

Ultimately, we created a “Continuous imbalance Indicator.” For each workload, for each minute, we calculated the p99 (say, 5 cores) and average (say, 4 cores) CPU utilization. That, combined with the number of containers, allowed us to calculate “wasted cores.” For the example above, 10 containers would result in 10*4=40 (cores) usage, (5-4)*10=10 wastage cores, and the resulting indicator of 1+10/40=1.25. This mapped intuitively to the “standard” p99/average calculation of 125% that humans could do when debugging live.

+ + +

Figure 3: theoretical definition of imbalance.

+ +

+ + +

When done over time, this effectively became a ratio of areas under two curves: p99 and average utilization.

+ + +

Figure 4: Continuous Imbalance Indicator on a real-time dashboard.

+ +

+ + +

The benefit of this approach was that since the wastage and utilization were calculated in absolute numbers of cores, it allowed us to aggregate them in custom, arbitrary dimensions: per service, per service-per-cluster, per group of services, per cluster, per zone. Similarly, any time window (hour, day, week) naturally worked–it was as simple as summing up a range of integers. Additionally, the indicator naturally gives higher weight to “busy” periods–imbalance at the peak is more critical than imbalance off-peak. The downside was the difficulty of explaining the indicator to humans, but we found that the approximation as a “weighted p99/average” is acceptable.

+ + +

An alternative approach of calculating a ratio of “weekly p99 of p99s” and “weekly average of averages” was easier to explain on an individual service basis but suffered from high sensitivity to random events (drains, failovers, load-tests, deployments), which made it noisy. Additionally, the cross-service weighting was less straightforward.

+ + +

The above metrics were made available in real-time metrics in Grafana and long-term storage in Hive. We needed to write custom pipelines to pre-process the indicator daily for visualization.

+ + +

Different slicing

+ + +

A particular wrinkle about measuring load imbalance is worth calling out: how you slice your data dramatically affects the results. It is tempting to start with small slices (clusters, zones, regions) and then “average” the imbalance. Sadly, this doesn’t work in practice. For example, it’s possible to have two clusters with (averaged) p99/average ratio of 110%, but when looking across the whole workload, the imbalance might be much higher–up 140% in our cases. Similarly, combining two clusters of higher imbalances might result in a lower imbalance.

+ + +

Addressing the issues

+ + +

The first step: getting (hacky) data first

+ + +

We started by building Grafana dashboards for real-time observability. This allowed us to measure impact individually per service in real time but didn’t help in understanding the root cause. While the assumption was that the load balancing was at fault, we didn’t *really* know. The initial problem was the lack of observability, where we faced two issues.

+ + +

First, due to cardinality issues, our load balancers did not emit stats by each backend instance. With many services running thousands of containers and hundreds of procedures, this would have caused both a memory usage explosion in our proxy and made the stats not-query-able for even the medium-sized services. Luckily, an intern project that summer added an ability to emit stats on an opt-in basis (saving the proxy memory usage) on a new metrics namespace (leaving the existing stats intact). Together with roll-up rules, we could now introspect most services (as long as we only enabled the extra visibility for a few of them at a time).

+ + +

Second, we had lost the ability to uniquely identify instances across our compute and networking stacks. At the time, we could see the CPU usage of each target but couldn’t easily map it to a container. The available “unique identifier” of a host:port would have broken our metrics (again, cardinality) due to our wide IP target range and dynamic port usage. The discussion of a proper solution had previously stalled for quarters. Ultimately, the networking stack implemented a short-term solution based on sorting IP addresses and emitting integer-based instance IDs. These were not stable across deployments, but together with some more hacky scripting, allowed us to get the data we needed.

+ + +

This step provided important lessons:

+ + +

Always get the data first
Well-placed, targeted, isolated hacks can be extremely useful
You don’t need perfect observability to draw the correct conclusions

+ + +

Manual analysis

+ + +

Once we had in-depth visibility into the issue, we hand-picked a few large services and tried to analyze the root causes. Surprisingly, the load balancing was not at fault–at a 1-minute window (our CPU stats resolution at the time), the RPS distribution was almost perfect. Each container was receiving an almost equal number of requests, with a difference below 0.1% for most applications. Yet, within the same window, the CPU utilization varied greatly.

+ + +

After several weeks of investigations, we were able to quantify several independent reasons:

+ + +

Some significant sources of traffic forced imbalance. For example, many of our systems are “city aware,” with a city always being in a single region. This naturally drove different amounts of traffic to each region, with proportions changing continuously as cities woke up and fell asleep.
Services ran across several hardware SKUs, both within and across clusters.
Even the theoretically identical hardware showed significant performance differences.

+ + +

Some of the imbalance was left in an “unknown” bucket. The majority of it turned out to be issues with our observability. We currently attribute the remainder (less than 20% of the original imbalance) to noisy neighbors.

+ + +

The graph below shows the initial analysis for one of our biggest services from 2020.

+ + +

Figure 5: Imbalance understood.

+ +

+ + +

Forced to build long-term aggregations

+ + +

At that point, we wanted to start with any low-hanging fruit. The Better Load Balancing: Real-Time Dynamic Subsetting | Uber Blog gave us a few knobs we could tweak. This, however, instead of being easy, presented a new problem.

+ + +

Figure 6: Patterns in weekly CPU utilization of a single service.

+ +

+ + +

Our services exhibit heavy daily and weekly cycles (see above). On top of that, we frequently see spikes caused by failures, deployments, failovers, or ad hoc events. After rolling out a change, only a massive improvement (20%+) would be human-spottable, but our changes were too subtle.

+ + +

This resulted in the observability decisions explained in the previous paragraphs. We built pipelines to aggregate data over long periods based on a stable and spike-resilient metric. On top of that, we could slice the metrics by clusters, zones, regions, or groups of services–this, in turn, lets us investigate more “suspicious” behavior.

+ + +

Some pre-existing knobs let us reduce the service-mesh-induced part of the load imbalance, but it was a small fraction of the overall problem.

+ + +

Possible solutions

+ + +

An obvious first step was to look at low-level hardware configuration and OS settings. A few separate threads were started to look at these.

+ + +

Solving the hardware heterogeneity required a more complicated process. Many approaches were possible, from:

+ + +

Modifying CFS parameters to make every host in the fleet appear the same despite the underlying hardware being different.
- This option was attractive but eventually dismissed due to unclear impact on various software stacks (like GOMAXPROCS). In retrospect, this also prevented us from configuration utilizing cpu-sets.
+
Modifying host-to-cluster placement to achieve uniform clusters.
Modifying the per-service cluster placement to guarantee stable, but not uniform, host selection.
Moving to cloud-style host management, where each team would select a particular type of hardware.
Many possible service mesh changes to achieve better load imbalancing.

+ + +

Figure 7: Option matrix (blurred out on purpose)

+ +

+ + +

Out of the possible options, changes to the service mesh were chosen for several reasons. Technically, changes on our layer required no changes to the physical layout of the data centers and no per-service migrations. Tactically, we could also deliver the changes quickly to most services.

+ + +

Changes

+ + +

Hardware

+ + +

While root-causing variance within hardware SKUs we found many issues with hardware, firmware, and low-level software. They ranged across OS settings, CPU governor settings, firmware versions, driver versions, CPU microcode versions, or even kernel version incompatibility with Intel HWP. A general root cause of this was that, historically, once the hardware was ingested and turned up in the fleet, it was left untouched unless it had issues. Over time, though, that led to a drift between machines.

+ + +

Uber runs in a mixed cloud/private setup, so we naturally experienced cloud-specific issues as well. Like other companies, we’ve seen multiple cases of theoretically identically provisioned VMs not performing similarly (this is still real). Similarly, we’ve seen cases where workloads running fine on-prem triggered issues on the cloud. To make it worse, the cloud meant less visibility into the details of the underlying infrastructure.

+ + +

Fixing all these would be nearly impossible without a recently finished Crane project–we could measure, fix, and roll out changes to tens of thousands of machines without human involvement. All of the issues discovered are now detected and remediated automatically.

+ + +

A clear benefit of these fixes was that they applied to every workload, no matter how it processed or originated its work (Kafka, Cadence, RPCs, timers, batch jobs, etc.). They were also giving us effectively free capacity, on top of the load imbalance improvements–some CPUs “became faster” overnight.

+ + +

Observability

+ + +

Observability was an interesting part of the problem. Before the project started, we knew we had limitations in the sample collections due to 1-minute window sizes, but we found more issues.

+ + +

Technically, the problems were caused by interactions between cgroups, cexporter, our internal Prometheus metric scraper, and m3. In particular, due to the metrics being emitted as ever-increasing gauges, any delays in stats collection anywhere in the pipeline would result in (large) artificial spikes in percentile calculations. A lot of work was put into preserving the timestamps of the samples as well as gracefully handling both target and collector services restarts. An example issue was effectively breaking data collection for any large enough service.

+ + +

A fascinating aspect of the observability issues was related to human interactions – or the fact that humans cannot be trusted. Early in the project, we asked service owners what level of container utilization resulted in user impact (increased latencies). Interestingly, several months later, after we had rolled out the fix, when we asked again, we received the same answer. Both statements couldn’t be valid since we knew the old data was wrong. Ultimately, human irrationality resulted in net efficiency wins: service owners ended up running their services (effectively) hotter while thinking nothing had changed.

+ + +

Load balancing

+ + +

As explained in Better Load Balancing: Real-Time Dynamic Subsetting | Uber Blog, our service mesh works on two levels. Initially, the control plane sends over assignments deciding how much traffic should be sent to each target cluster. The imbalance between clusters is decided here.
Later, the data plane follows this assignment, but then it’s responsible for picking the right host–a second level of within-cluster load balancing is happening here. While we considered changing this model, we kept it unchanged and rolled out two solutions for each level.

+ + +

Inter-cluster imbalance

+ + +

At Uber, services run in multiple zones in multiple regions. Because each zone is turned up at a different time, there is no way to guarantee the hosts in each zone are the same–usually, the newer the zones, the newer the generation hardware they have. The difference in performance of zones leads to CPU imbalance.

+ + +

Our initial approach was to set a static weight for each zone; the weight will then be used in load balancing such that zones with faster hardware take more requests. The weight for each zone is calculated as the average of the Normalized Compute Unit (NCU) factor of each host deployed in that zone. The NCU factor measures host CPU/core performance based on a benchmark score, where the score depends on the product of core instructions-per-cycle (how much work is done by the core per clock cycle) and core frequency (how many clock cycles are available per second).

+ + +

We could then send more traffic to more powerful/faster zones, using static zone weights as a multiplier.

+ + +

Faster zones, with higher multipliers, will be routed with more traffic proportionally to increase CPU utilization, hence easing the CPU imbalance.

+ + +

For example, if a service has deployed 10 instances in zone A (weight = 1) and B (weight = 1.2), the load balancing will be done as if B has 12 (10 * 1.2) instances so that B will receive more requests than A.

+ + +

Figure 8: Zone Weights

+ +

+ + +

This approach worked surprisingly well–we were able to mitigate the majority of the imbalance with relatively little effort. However, there were a few issues:

+ + +

Zone weight is an estimated value (average NCU factor) across all hosts in a zone. However, a service could be extremely lucky/unlucky to be deployed on the fastest/slowest hosts in a zone.
Though not frequently, the zones we operate on change due to turnup or turndown. Additionally, during turnup, we typically ingest hardware gradually, which might require multiple updates.
Occasionally, we ingest new hardware into old zones to resize them or replace broken hardware. This hardware can be of a different type, resulting in a need to adjust the weights.

+ + +

Dynamic Host-Aware Cluster Load Balancing

+ + +

Hence, we took a second look at the problem and invested in an advanced solution: Host-aware Traffic Load Balancing.

+ + +

This approach solves the drawbacks by looking at the exact hosts the service instances are deployed to, collecting their server types, and then updating the load balancing between clusters per service. This is achieved by making our discovery system aware of the mapping of a host (by IP), its host type, and weight such that for a given service deployed in a cluster, the discovery system could provide the extra weight info to our traffic control system. The diagram below shows an example:

+ + +

Figure 9: Dynamic Host-Aware Cluster Load Balancing

+ +

+ + +

For service Foo, if we treat each instance equally, the load balancing ratio should be 37.5%/62.5% instead of 36%/64% shown in the example. The difference could become more significant if hosts are across multi-generations (we have up to 2X different weights between different hosts in our fleet).

+ + +

Compared with the static weight approach, the host-aware load balancing adjusts weight per service dynamically to reduce inter-cluster imbalance. It’s also much easier to maintain, as new host types are introduced infrequently.

+ + +

Intra-cluster imbalance

+ + +

The intra-cluster imbalance, as explained earlier, is the responsibility of the on-host proxy (called Muttley). Each proxy had complete control of selecting the right peer for each request. The original load-balancing algorithm for Muttley used by all services was least-pending, which would send requests to the peer with the smallest number of known outstanding requests. While this resulted in almost perfect balancing of RPS when measured in 1-minute intervals, it still resulted in an imbalance of CPU utilization due to different hardware types.

+ + +

Assisted Load Balancing (ALB)

+ + +

Figure 10: Assisted Load Balancing in a nutshell.

+ +

+ + +

We built a system where each backend assists the load balancer in selecting the next peer. An application middleware layer attaches load metadata as a header to each response. We effectively arrive at a coordinated system without central coordination. Where previously, each Muttley only knew about the load it caused (plus some information it could infer from the latencies), now, it learns about the total state of each backend dynamically. This state is affected not only by the backend itself (for example, running on slower hardware) but also by decisions made by other Muttleys. For example, if a backend is (randomly) selected into too many subsets, the system adjusts dynamically. This let us later on reduce the subset sizes for services on ALB.

+ + +

While a brief mention in the Google SRE book partially inspired this approach, we made a few different choices. Both changes were related to each other and were attempted to simplify the approach. We intended to start, evaluate, and move to a more complicated solution later–luckily, we didn’t have to. Late in the implementation, we discovered a Netflix blog post, and we had arrived at similar conclusions independently.

+ + +

Firstly, as the load metadata, we used the number of concurrent requests being processed, reported as an integer (q=1,q=2,..,q=100, etc). We considered reporting utilization, too, but that wasn’t immediately obvious (whether the reported utilization should be based on getrusage or cgroups). Cgroups were more natural since that’s what service owners were using to track their targets. Still, they presented more challenges–our foundation team was concerned about the cost of each docker container scraping cgroups independently and potential tight coupling if the cgroups layout was to change, including during the cgroupsv2 migration. We could have solved this by integrating with a host demon collecting the stats, but we wanted to avoid adding a new runtime dependency. In the end, just using a logical integer worked well enough (with some tweaks, explained below). Additionally, it allowed per-service overrides without changing the load balancer code–while the vast majority of the applications use the standard load indicator, some (asynchronous) applications override it to reflect their load better.

+ + +

The second departure was the power of two random choices instead of the weighted round-robin. Since we had only a single integer as the load indicator, the pick-2 implementation seemed more straightforward and safer. Similarly to the above, this worked well enough that we didn’t need to change it. This approach turned out remarkably forgiving to failures across the whole range of our applications. Apart from typical crash looping or OOMing applications, we’ve had cases of bad/buggy implementations of the middleware not causing an incident. We speculate that since the weighted round-robin is more precise and “strict,” it would have likely performed “better” in some cases but could have resulted in thundering-herd-like scenarios.

+ + +

Implementation-wise, each Muttley uses a modified moving average to keep the score of each peer over 25 previous requests–this value worked best in our testing. To arrive at meaningful numbers for lower RPS cases, we scale up each reported load by a thousand.

+ + +

An interesting problem for the pick-2 load balancer is that the “most loaded” peer would never be selected. And because we discover peer load passively, we would also refresh its state, thus making it effectively unused until another peer gets even slower. We initially mitigated this by implementing a “loser penalty,” where every time a peer loses the selection, its “load value” is internally reduced–thus, with enough losses, the peer would be selected again. This didn’t turn out to work well for large-caller-instance-count-low-RPS scenarios, where sometimes it would take minutes for a peer to be reselected. Eventually, we changed this to a time decay where peers’ score is reduced based on the last selection time. We currently use a half-life of 5 seconds for score decay.

+ + +

We also implemented a feature we call internally a “throughput reward.” This stemmed from empirical observations that the newer hardware handles concurrent requests better. We noticed that when load balancing across two peers on diverse hardware and both peers report the same “load value,” we, as expected, send more requests to the faster peer. However, the faster peer’s CPU utilization (processed=15, CPU=10%, Q=5) will remain lower than the slower peer (processed=10, CPU=12%, Q=5). To compensate for this, every time a peer “finishes” a request, we reduce its load slightly to push even more requests to it. The faster the peer is relative to other peers in the subset, the more “throughput rewards” it receives. This feature reduced the P99 CPU utilization by 2%.

+ + +

A significant part (the majority) of the ALB design document was committed to the possible alternatives. We significantly considered, instead of attaching the load meta-data to each of the responses, using a central component to collect and distribute the data. The concern was that the metadata might consume a significant amount of available bandwidth. We internally have two systems that superficially seemed relevant. The first was the centralized health-checking system collecting health state from every container in the fleet in close to real time. The second was the real-time aggregation system described in the previous blog post.

+ + +

Re-using either turned out to be unfeasible: the health checker system could have easily collected the load status from all containers, but after collection, that system was designed to distribute the health changes infrequently–the vast majority of the time, the containers remained healthy. The load balancing indicators, however, change constantly and by design. Since we operate a flat mesh (every container can talk to every container), we would need to constantly distribute data about millions of containers to hundreds of thousands of machines or build a new aggregation and caching layer. The load-report aggregation system, similarly, was not a match–it was operating on aggregated per-cluster values at several orders of magnitude lower cardinality.

+ + +

Ultimately, we were happy with the chosen (response-header-based) approach. It was simple to implement and made the cost attribution easy – services pushing more RPS saw a higher bandwidth cost. In the absolute numbers, the cost of the extra metadata (~8 bytes per request) was almost invisible compared to the other tracing/auth metadata attached to each request.

+ + +

The latency was an interesting aspect of the “distributed” vs. “centralized” collection of the load data. Theoretically, the response header approach is close to real-time since the load is attached to each response. However, since each Muttley needs to discover this independently and then average the response over the previous responses, the discovery might take some time for low RPS-based scenarios. The health-check-based approach would require a full round trip (typically ~5s), but be distributed to all caller instances immediately.

+ + +

However, had we implemented it, we would have likely reduced the push frequency to something like 1 minute due to bandwidth concerns listed in the previous paragraph. This could have been enough to fix the hardware-induced skews but likely not other issues, like traffic spikes, slow-starting applications, or failovers. Both approaches could have likely worked slightly differently in different circumstances. Still, ultimately, we’re happy with the distributed approach–it’s easy to reason about and lacks centralized components that might fail.

+ + +

One downside of the chosen approach was that it requires cooperation from the target services. While minimal work is required, applying it to thousands of microservices would be arduous. Luckily, most applications built in the last few years at Uber used common frameworks that allowed us to plug in the required middleware quickly. Several large services were not using the frameworks, but a concurrent multi-year effort had migrated almost all services. We found the decision to bet on the framework beneficial, as it had a compounding effect–service owners had one more reason to invest in migration. By the time we got to writing this post, virtually all services were on the common frameworks.

+ + +

Static component – ALB v1.1

+ + +

The initial rollout did not meet our hardware-induced imbalance reduction goals. The primary reason was that our hardware runs heavily underutilized most of the time–we have buffers for regional failovers and weekly peeks. It turned out that with relatively low container utilization, the old hardware can burst high enough for latency differences not to be visible while consuming more CPU time. While this meant the load balancing was working much better under stress (when we needed it), it made product engineers uncomfortable with our target utilization–the imbalance looked too high off-peak.

+ + +

We added a second static component to the load balancing to address this. We utilized the fact that in our setup, the IP address of a host never changes. Since the proxy naturally knows the destination’s IP address, we only need to provide a mapping of the IP addresses to relative host performance. Because of the static nature of the data, we started adding this information as part of the build-time configuration. This weight in itself is not perfect: different applications perform differently on the same hardware type. However, combined with the ALB’s dynamic part, this worked well–we did not need to add application-specific weights.

+ + +

Testing

+ + +

A big problem during the development was testing. While we had a limited staging environment, the new solution needed to work with many parameters: some callers or callees had three instances, some three thousand. Some backends were serving <1, and some > 1,000 RPS. Some services served a single homogenous procedure, and others hundreds, with latencies varying from low milliseconds to tens of seconds. Ultimately, we used a dummy service in production with a set of fake load generators configured to represent a heterogeneous load. We ran over 300 simulations before finding the right parameters and attempting to roll out to production services.

+ + +

Results

+ + +

We are happy with the final results–the exact numbers depend on the service and the hardware mix within each cluster. Still, on average, we reduced P99 CPU utilization by 12%, with some services seeing benefits of over 30%. The results were better the bigger the target service had per each backend–luckily, the largest services we cared about most were typically optimized enough. The same luck applied to onboarding–while Uber has over 4,000 microservices, onboarding the top 100 gave us the vast majority of potential reach.

+ + +

Rollout and future changes

+ + +

The rollout went well–we have not identified material bugs. The pick-2 load balancing and safe fallback were proven to be resilient. We onboarded services by tiers, region by region, trying to find representative types of services.

+ + +

ALB was rolled out to hundreds of our biggest services with minimal hiccups or changes:

+ + +

Long-lived RPC Streams. A small category of services was mixing up a small number of long-lived RPC streams with many very short-lived requests. We rolled back the onboarding there.
Slow-starting Runtimes. Around two years into the rollout, we tweaked the solution to handle slow-starting (Java) services better. These services could not serve the same request rate after startup due to JIT, but warm-up with recorded static requests was not working well enough; we needed to warm up the service with real requests at a lower rate. Here, we decided to seed each peer’s initial “weight” with a percentage of the average weight for the pool while leaving the algorithm’s core unchanged. We found this to work very well across a range of services, and we’re happy that this doesn’t require any static window settings, unlike Envoy’s slow start mode–the algorithm adjusts to a range of RPS automatically.
Data Prefetching on Startup. Another very small category of services was pre-loading static data upon startup for several minutes. Due to the peculiarities of our service publishing mechanism, instances of those services are visible in our service discovery as “unhealthy.” The old algorithm strongly preferred the healthy instances. We changed that in ALB to avoid a thundering-herd-like scenario when a service cannot start after a temporary overload (due to each instance being instantly overloaded as they become healthy sequentially). The new algorithm significantly prefers healthy instances, but, in some cases, requests might be sent to “unhealthy” nodes. This doesn’t work for these services–while the reported error was <0.01% and 0.002%, we’re exploring changes similar to the panic threshold to make this disappear entirely.
IP Address Mapping. The static mapping of IP address to server type worked well for 2+ years, but it will likely need to be adjusted as we move our workloads to the cloud.

+ + +

Interestingly, two services overwrote the default load providers to emit custom load metrics based on background job processing. This proves that the defaults worked well for most services, but the solution was flexible enough to support other use cases.

+ + +

Summary

+ + +

Figure 11: Zone Weights rollout.

+ +

+ + +

The project delivered very significant efficiency wins. We can run our containers at higher utilization levels, and load imbalance is no longer problematic for stateless workloads. The hardware configuration improvements resulted in double wins from reduced imbalance and pure compute capacity.

+ + +

More interestingly, from the engineering blog perspective, the project also resulted in several learnings.

+ + +

The primary one was the importance of data. The problem was real, but we started the project under the wrong assumptions. We didn’t know how to measure it; once we agreed, we lacked the tools to measure it effectively, especially over the long term. Even after that, we realized the underlying way we collect samples from the underlying infrastructure was flawed. At the same time, the data won arguments, helped us hone in on issues, and prioritized the work with other teams. Another data lesson was to set up the data infrastructure right for the long term–it helped during the project but also before. We were able to use an existing data warehouse as a base, and now afterward we periodically get questions about the load imbalance. A link to the dashboard usually answers all the questions.

+ + +

The second lesson was to add workarounds in the right place of the stack to get the data we needed. Building proper real-time observability would have taken us months or quarters. Still, we quickly got the right conclusions with a targeted hack and selectively basing the observations on a sample of services. Related to that was the willingness to do a lot of manual grunt work: to build the understanding, we spent weeks staring at dashboards and verifying assumptions before we started coding. Later, when implementing ALB and Zone/Cluster weights, we started with relatively small changes, verified assumptions, and iterated to the next version.

+ + +

The third, arguably less generalizable lesson, was to trust in the platforms. We made a bet that our microservices would migrate to the common frameworks. Similarly, when implementing, we built on top of years of pre-existing investments in the platform–pre-existing tooling (dashboards, debug tooling, operational knowledge, rollout policies) was there, and we could roll out major changes reasonably quickly and safely. We built with the grain of the platform and avoided major rewrites that could have derailed the project.

+ + +

Acknowledgments

+ + +

There were many people involved in the project. We thank Avinash Palayadi, Prashant Varanasi, Zheng Shao, Hiren Panchasara, and Ankit Srivastava for their general contributions. Jeff Bean, Sahil Rihan, Vikrant Soman, Jon Nathan, and Vaidas Zlotkus for hardware help, Vytenis Darulis for observability fixes, Jia Zhan and Eric Chung for ALB reviews, Nisha Khater for per-instance-stats project, Allen Lu for rolling out yarpc globally.

+ + +

Logo attribution: “Scales of Justice – The Law – Lawyers and Attorneys” by weiss_paarz_photos is licensed under CC BY-SA 2.0.

Network IDS Ruleset Management with Aristotle v2

Uber — Thu, 29 Feb 2024 07:00:00 GMT

Introduction

+ + +

If you were to ask a veteran SOC (Security Operations Center) analyst about Network IDS (Intrusion Detection Systems) or IPS (Intrusion Prevention Systems), the response would probably contain phrases such as “too many alerts,” and “false positives.” At Uber, we face these same challenges of volume, accuracy, and manageability. Multiple times a day, more than 90,000 IDS rules are parsed, analyzed, updated, filtered, and deployed to our network sensors. Aristotle v2 was created to enable us to automate this process, apply induction-based intelligence extraction, and enhance rule metadata to reduce false positives and help ensure that appropriate IDS alerts receive proper attention.

+ + +

Overview

+ + +

The IDS ruleset update process at Uber involves multiple steps, as shown in Figure 1. Collating and distributing rules is straightforward and common to all Suricata™ deployments. Deciding which rules to include and how they should be modified is what happens in step 4, “Filter Rulesets,” and will be the focus of this blog.

+ + +

Figure 1: IDS Ruleset Update Process.

+ +

+ + +

Background

+ + +

IDS alerts are generated by IDS engines operating on logic governed by rules (or “rulesets”). At a basic level, IDS rules can be thought of as advanced pattern matching against network traffic and connection state. The most popular open source Network IDS engines are Suricata and Snort™. This article focuses on Suricata, but the concepts and practices can also apply to Snort.

+ + +

IDS Rule Selection Approaches

+ + +

Choosing which rules to apply to particular sensors can have a significant impact on false positive rates, undesired IDS alerts, and engine performance. For example, sensors protecting a pool of Linux® web servers don’t need to be running rules designed to detect attacks that target Windows® file sharing.

+ + +

Rule Classification

+ + +

Historically, ruleset consumers have used two big “knobs” when it comes to choosing which rules to enable or disable. The first is the “classtype,” a native rule keyword with a finite set of options, which are defined by the ruleset provider and attempts to categorize the rule. Usually, no more than a few dozen classtype categories are defined, and common values include “trojan-activity,” “attempted-dos,” and “bad-unknown.” The second knob is the filename of the file that the rule is placed in by the ruleset provider, who will often segregate rules into different files with names like “sql.rules,” “scan.rules,” and “trojan.rules.”

+ + +

A major problem with these “knobs” is that they don’t allow for a one-to-many mapping. Each only supports a single value for a single rule. This lack of flexibility can be restrictive. For example, should a rule that detects recently seen exploit kit activity go into the “current-events.rules,” “exploit.rules,” or “web-client.rules” file (just to name a few options)? A similar challenge exists for the “classtype” field, where the activity being detected could legitimately be classified into multiple categories. These finite, blunt rule classification mechanisms are too broad to support the ruleset fine tuning flexibility needed for modern deployments.

+ + +

Manual Review

+ + +

In order to optimize rulesets for particular environments, they must be tuned. Often this results in a non-trivial, ongoing, and manual effort. In fact, some companies have a daily task of manually inspecting each new rule, deciding if it should be included, and then tuning it as necessary. However, this quickly becomes onerous and manifestly doesn’t scale, especially if existing rules have to undergo regular re-tuning as well.

+ + +

Metadata

+ + +

There exists a “metadata” keyword, supported by IDS engines like Suricata and Snort, that allows for arbitrary key-value pairs to be embedded into each rule. This can be extremely helpful in deciding which rules to enable because rules can be filtered based on the content of the metadata. Suricata will also include the metadata in the IDS alert, which can be used for more informed post-processing, decision making, and correlation.

+ + +

Metadata key-value pairs provide distinct advantages over traditional rule categorization, including:

+ + +

One-to-many mapping: For example, the “protocols” metadata key can have values “http” and “tcp”
Arbitrary key names and values: Classification doesn’t have to be limited to pre-defined, finite options

+ + +

A BETTER Way

+ + +

The BETTER (Better Enhanced Teleological and Taxonomic Embedded Rules) schema for key-value based IDS rule metadata was proposed in 2019. It recognized the need for one-to-many metadata mappings, and attempted to bring some structure and standardization to commonly used metadata keys and (in some cases) values. One vendor—Secureworks®—fully implemented BETTER in its Suricata ruleset offering, while other vendors such as Proofpoint ET Pro®, have rulesets with partial compatibility. BETTER never received widespread industry adoption, but its major concepts persist, and the use of metadata for ruleset filtering is still a solid strategy.

+ + +

Many ruleset providers do populate metadata, but almost all of them do so in a way that severely limits the effectiveness of using metadata as a means of rule filtering. Specifically, the rulesets have one or more of the following shortcomings:

+ + +

Missing metadata: Either applicable metadata key-value pairs are not used in the ruleset, or metadata key-value pairs are applied selectively instead of universally. Filtering rulesets based on metadata is most useful if all applicable metadata are applied to all applicable rules, and utility falls off sharply when this is not the case. For example, setting the metadata “attack-target http-server” on 20 rules in the ruleset when there are 400 more rules that could be classified the same way, makes filtering based on that key-value pair of limited value.
Inconsistent value formatting: For example, “cve” key values may appear as “cve_2023_1234,” “cve_2023_1234_cve_2023_2468,” “2023_1234,” “2023-1234,” etc. Without a normalized nomenclature, accurate filtering becomes challenging.
Poor value formatting: This includes things such as not using standard datetime formats like ISO 8601 when specifying time/date strings.

+ + +

Aristotle v1

+ + +

In 2019, Secureworks released Aristotle (v1), an open source Python tool that allowed users to “filter” (enable or disable) rules based on metadata key-value pairs. By using a concrete boolean algebra, “filter strings” can be defined to control rule selection. This can be quite powerful, but the usefulness of Aristotle v1 is limited by the richness (or rather, lack thereof) of the metadata in the provided rules, something controlled by ruleset vendors and onerous to maintain manually. Since most ruleset vendors do not provide comprehensive metadata and/or do not have metadata with the precision and consistency needed for accurate programmatic filtering, something more than Aristotle v1 is needed.

+ + +

Metadata and Beyond

+ + +

Aristotle v2

+ + +

Uber recently contributed significant improvements to Aristotle, resulting in Aristotle v2. These updates added support for metadata normalization, enhancement, and manipulation. Figure 2 shows the different components of Aristotle v2, which will be discussed in more detail.

+ + +

Figure 2: Aristotle v2 components.

+ +

+ + +

Filtering

+ + +

Aristotle v1—which is basically the “Filter Rulesets” step—did a good job of supporting boolean filtering on metadata, and even included the ability to specify numerical relationships for certain keys (e.g., “created_at > 2023-01-01”). To the list of keys that support such comparisons, Aristotle v2 added risk_score (more on that key later).

+ + +

Additionally, the ability to do regular expression based filtering was introduced in Aristotle v2. While this does impact filtering performance because it adds non-literal elements to the boolean expression, it does provide powerful and often needed capability. Specifically, regular expression matches can be applied to the entire rule with the “rule_regex” keyword, or scoped to just the “msg” field with the “msg_regex” keyword.

+ + +

Normalization

+ + +

To address the filtering challenges that come from a lack of consistent metadata value format, Aristotle v2 supports the normalization of certain metadata key values. Specifically, the following normalizations are supported:

+ + +

CVE key value(s) normalized to format YYYY-<num>. If multiple CVEs are represented in the value and strung together with a “_” (e.g., “cve_2021_27561_cve_2021_27562” [sic]), then all identified CVEs will be normalized and included.
Values from the non-BETTER schema keys mitre_technique_id and mitre_tactic_id will be put into the standards compliant mitre_attack key.
Date key values—determined by any key names that end with “_at” or “-at”, e.g., created_at—will be attempted to be normalized to ISO 8601 format YYYY-MM-DD.

+ + +

Enhancement

+ + +

While normalizing metadata is necessary and useful, it can’t address the issue of missing metadata. However, a rule is more than its metadata, so we asked the question, “can we identify, deduce, induce, or otherwise infer particulars from the rule’s ontology, and augment the rule metadata with that information?” This led to creating the ability of Aristotle v2 to analyze the ontology of each rule and add/update the metadata with the following enhancements:

+ + +

flow key with values normalized to be either “to_server” or “to_client”
protocols key and applicable values
cve key and applicable values. The value(s) are based on data extracted from the raw rule, e.g., “msg” field, “reference” keyword, etc.
mitre_attack key and applicable values. The value(s) are based on data extracted from the rule’s “reference” keyword
hostile key and applicable values (“dest_ip” or “src_ip”)—the values are the inverse of values taken from the “target” keyword
classtype key and applicable values
filename key and applicable values—the value will be the filename the rule came from, if the rule was loaded from a file
originally_disabled key and boolean value get added on each rule internally, to be used for filtering
detection_direction key (see below)

+ + +

Detection Direction

+ + +

While network IDS rules can be unidirectional, the overwhelming majority of them are written to target just one side of the client-server communication. Additionally, rules are typically scoped by specifying IP address groups for the source and destination. IP address groups are user-defined but almost always include the variables “HOME_NET” and “EXTERNAL_NET.” The idea is that HOME_NET is the group of IP addresses owned by the user or company, and intended to be protected; and EXTERNAL_NET is the group of IPs “outside” the user’s network, typically the general Internet. EXTERNAL_NET is often (but not necessarily) defined as everything that isn’t specified in HOME_NET.

+ + +

The detection_direction metadata key attempts to normalize the directionality of traffic on which the rule detects. To do this, the source and destination sections of the rule are processed and reduced down to “$HOME_NET”, “$EXTERNAL_NET”, “any”, or “UNDETERMINED”, and used to set the detection_direction value as seen in Figure 3.

+ + +

Figure 3: detection_direction values and conditions.

+ +

+ + +

Knowing a rule’s “detection direction” is important in being able to determine the significance and seriousness of what it is detecting. For example, consider a rule that detects traffic known to be generated by devices infected with the Mirai malware. Such traffic seen inbound (coming from EXTERNAL_NET and directed to HOME_NET) can usually be classified as scanning and considered to be little more than Internet noise. Yet such traffic seen outbound (coming from HOME_NET and directed to EXTERNAL_NET), is a good indication that there is an infected device on your network and it is part of an active botnet. The latter case is more serious than the former and should be treated as such. The rule and its associated IDS alert need to be able to communicate these realities. Accurately classifying rules and their IDS alerts so that they can be programmatically responded to is important, and this is where Post Filter Modification comes into play.

+ + +

PFMod (Post Filter Modification)

+ + +

Aristotle v2 offers the option to further filter and modify the ruleset after normalization, enhancement, and initial filter string application. This is known as PFMod (Post Filter Modification) and allows for the identification of rules based on filter strings, and then particular “actions” taken on those rules.

+ + +

PFMod Actions

+ + +

PFMod actions include the ability to add/delete metadata, enable/disable rules, set priority, and do a regular expression based “find and replace” on the full rule. Supported PFMod actions include:

+ + +

disable: Disable the rule.
enable: Enable the rule. Note that for “disabled” rules to make it to PFMod for consideration, they must first match in the initial filter string matching phase.
add_metadata: YAML key-value pair where the (YAML) value is the metadata key-value pair to add, e.g. “protocols http”. Note that if there is already metadata using the given key, it is not overwritten unless the value is the same too, in which case nothing is added since it already exists.
add_metadata_exclusive: YAML key-value pair where the (YAML) value is the metadata key-value pair to add (e.g., “priority high”). If the given metadata key already exists, overwrite it with the new value.
delete_metadata: If a metadata key-value pair is given (e.g., “former_category malware”), remove the key-value pair from the rule. If just a metadata key name is given (e.g., “former_category”), remove all metadata using the given key, regardless of the value(s).
regex_sub: Perform a regular expression find and replace on the rule.
set_<keyword>: Set the <keyword> in the IDS rule string to have the given value. If the rule does not contain the given keyword, add it and set the value to the given value. Supported keywords include “priority,” “sid,” “gid,” “rev,” “msg,” “classtype,” “reference,” “target,” “threshold,” and “flow.” For integer keywords (“priority,” “rev,” “gid,” and “sid”), relative values can be used by preceding the integer value with a ‘+’ or ‘-’. For example, the action ‘set_priority “-1″‘ will cause the existing priority value in the rule to be decreased by 1.
set_<arbitrary_integer_metadata>: Similar to “add_metadata_exclusive,” allows for the setting or changing of an arbitrary integer-based metadata key value, but also supports relative values, along with default values. Format shown in Figure 4.

+ + +

Figure 4: PFMod action syntax for setting arbitrary integer-based metadata.

+ +

Notes:

+ + +

The <arbitrary_integer_metadata> string corresponds to the metadata key name and must contain at least one underscore (‘_’) character.
The metadata key being referenced should have a value corresponding to an integer.
A preceding ‘+’ or ‘-‘ to the given <value> will cause the existing metadata value in the rule to be increased or decreased by the given <value>, respectively. If the metadata key does not exist, then the value will be set to the given <default> value, if provided, otherwise no change will be made.
Examples show in Figure 5.

+ + +

Figure 5: Example PFMod actions setting arbitrary integer-based metadata key-value pairs.

+ +

PFMod Rules

+ + +

PFMod conditions and actions are controlled by PFMod rules (not to be confused with IDS “rules”). PFMod rules are defined in YAML files and are processed in a depth-first, linear fashion. This means that you can define actions that apply broadly to many (or all) rules, and then have more specific PFMod rules that apply more precise actions to subsets of those rules. As shown in Figure 6, PFMod rules files can “include” other PFMod rules files for easy organization.

+ + +

Figure 6: Example file using include to load multiple PFMod files.

+ +

+ + +

In addition to include statements, a PFMod rule file can contain multiple rules. Figure 7 shows an example PFMod rule file that updates IDS rules and metadata.

+ + +

Figure 7: Example PFMod file with rules specified.

+ +

Risk Score

+ + +

Each Suricata rule that is deployed at Uber receives a “risk score.” These risk scores are automatically generated at ruleset compile time and applied as metadata values by PFMod rules. Rule metadata, including risk_score, are included in Suricata alerts and play an important role in event processing. More on this later.

+ + +

Maintenance

+ + +

When new rules are added to the rulesets used at Uber—which happens daily—they are automatically subjected to our existing Aristotle v2 pipeline which includes filtering and the application of non-trivial PFMod logic to shape and classify each rule accordingly. Thus, manual analysis and tweaking of each new rule is avoided by using Aristotle v2 as a reliable “set it and forget it” mechanism. Of course, judicious occasional revisiting of ruleset filtering and PFMod logic is done to align the ruleset with the current environments and expected traffic patterns.

+ + +

Documentation

+ + +

More details about Aristotle v2 and how to use it can be found in the online documentation.

+ + +

Aggregation, Correlation, Risk Score, and Alerting

+ + +

Uber processes billions of events a day, including hundreds of thousands of IDS alerts. Events come from myriad sources including log files, vendor products, in-house systems, custom detections, and sensing technologies like IDS. Security related events, referred to as “signals,” receive a “risk score” value that is represented by a single integer and associated with the signal. The risk score value for a signal plays a non-trivial role in downstream aggregation and correlation algorithms that ultimately determine if a signal or group of signals qualify for a formal meta-alert requiring a response. In other words, the “risk score” value quantifies the necessity and appropriate level of response. Depending on the level warranted, responses typically take the form of a manual investigation by a security analyst, and/or a series of programmatic actions in a Security Orchestration And Response (SOAR) pipeline.

+ + +

The evaluation, aggregation, and correlation of signals at Uber is a sophisticated process (not to mention the response pipelines), the intricate details of which are outside the scope of this article. However, the general strategy revolves around what we call “Entity Based Alerting.” For a given time window, signals are grouped by entity (e.g., IP address, host, user, etc.) and a correlation algorithm is applied to determine if an actionable meta-alert should be created. The risk score values from individual signals play a significant part in this calculation, as they are weighted and added together, and ultimately compared against a threshold used to make a final determination. The weighting of signals—which can be thought of as adjusting risk scores up or down—is based on sundry criteria, including entity characteristics, and often involves correlation with other data sources. For example, signals for a user entity where the user is an Administrator are weighted higher than those related to a non-Administrator user. Similarly, a signal involving an IP address entity from a known sanctioned vulnerability scanner will be weighted lower, while an event involving an IP address entity that is known to be part of an isolated network responsible for financial transactions will be weighted higher. Note that given a high enough risk score and/or weighting modification, a single signal can be enough to generate a meta-alert and response.

+ + +

In practice, aggregating and correlating signals related to common entities has shown to be an effective way to identify events and combinations of events worth responding to. With IDS alerts, Aristotle v2 plays a crucial role in being able to choose which rules are enabled, what metadata is included, and what risk score value each rule should carry. IDS alert metadata, especially the risk score value, allows Suricata alerts to be better managed so that analysts are not overwhelmed with alerts or false positives.

+ + +

Conclusion

+ + +

By using Aristotle v2 to normalize, enhance, and manipulate rule metadata, rules can be accurately described, programmatically understood, and willfully modified. Applying concrete boolean algebra against metadata key-value pairs results in powerful filtering capabilities that allow us to curate Suricata rulesets to only run applicable rules in particular environments. Rules are automatically tuned based on explicit and inferred teleological motivations and ontological realities. Custom metadata values such as “risk_score” are intelligently added to each rule which enables effective downstream correlation such that false positives are minimized and notable alerts receive appropriate attention. The result is a scalable, controllable, and accurate Suricata ruleset management and response solution.

Building Scalable, Real-Time Chat to Improve Customer Experience

Uber — Tue, 20 Feb 2024 07:00:00 GMT

Introduction

+ + +

Uber is a global business and has a customer base that’s spread throughout the world. Uber’s customer base is divided into many user personas, predominantly riders, drivers, eaters, couriers, and merchants. Being a global business, Uber’s customers also expect support at a global scale. We have customers reach out to us through various live (chat, phone) and non-live (inApp Messaging) channels, and expect swift resolutions to their issues. With millions of support interactions (known internally as contacts) being raised by Uber customers every week, our goal is to resolve these contacts within a predefined service level agreement (SLA). Contacts created by customers are resolved either via automation or with help from a customer support agent.

+ + +

For agent contacts, the cost of resolution of tickets plays an important role in how Uber structures its support channels and determines volumes across different live and non-live channels. Cost-per-contact (CPC) and FCR (first contact resolution) for the chat channel are most effective across different live channels, as they allow agents to handle multiple chat contacts concurrently while maintaining a lower average cost than channels like Phone. This channel is in the sweet spot for Uber, as it has a good CSAT score (customer satisfaction rating, measured in the range of 1 to 5) while generally reducing CPC. This channel allows for a higher automation rate, higher staffing efficiency (as agents can work on multiple chats at the same time), and high FCR, which are all beneficial to Uber while providing quality support for customers.

+ + +

Historically, from 2019 to early 2023, 1% of all contacts were served via live chat channel, 58% were served via inApp messaging channel (a non-live channel), and the rest were served via another live channel such as Phone. To achieve higher CSAT and FCR, the engineering team needed to scale the chat infrastructure to meet the demands of Uber’s growing business, as well as facilitate the migration of a large volume of in-app messaging and phone channel contacts to a chat channel. We will focus on the Chat live channel for this article.

+ + +

Challenges

+ + +

To scale the chat channel to support 40+% of the Uber contact volume which is routed to Agents, the following were some of the major challenges the team was facing:

+ + +

Reliability issues with delivering messages from backend systems to an agent’s browser
1. 46% of events originating from a customer trying to reach an agent were not delivered in time, resulting in delays for both customers and wastage of the agent’s bandwidth. Note that 46% does not indicate the number of unique contacts here but the overall number of events across all the chat contacts.
+
Missing Insights
1. The observability to track the health of the chat contacts was unavailable.
2. Since agents were idle for large amounts of time but queues were also not empty, Ops was left wondering if they were overstaffed or if it was a tech issue resulting in disproportionate volumes (referred to as a tech vs staffing issue).
+

+ + +

+ + + + + +

Our legacy architecture was built using the WAMP protocol that was used primarily for message passing and PubSub over WebSockets to relay contact information to the agent’s machine.

+ + +

Figure 1: Describes the previous high-level flow of the chat contact from being created to being routed to an agent on the front end.

+ +

+ + +

Note: This is different from the data path involving the exchange of chat messages between the customer and Uber Support, facilitated through HTTP Server-Sent Events (SSE). For this purpose, Uber utilizes Ramen as an internal service, serving a dual role in the control and data paths. In the control path, Ramen provides bi-directional support for client-to-mobile use cases, allowing effective communication. Simultaneously, in the data path, Ramen offers SSE capabilities for client-to-web use cases.

+ + +

However, a noteworthy distinction arises in the data path, specifically for client-to-web use cases, where Ramen demonstrates a successful delivery rate of 94.5%. It operates in a unidirectional manner, prompting the necessity for new control flows. These new control flows are essential to detect and manage situations where the client is no longer responsive, thereby addressing the unidirectional limitation in the data path. In this blog, we will cover the new control flow to deliver the events from the backend to the Agent’s browser (Web) to enable the agent for the first reply.

+ + +

The team launched the E2E architecture on production, and it started to see issues. Not immediately, but as traffic scaled beyond the few tickets coming through, the team realized that the architecture could not scale beyond its initial capabilities easily and production management was not so straightforward. Listed below are some of these core issues:

+ + +

Reliability

+ + +

We were facing reliability issues with our 1.5X scaled traffic, resulting in up to 46% of events from the backend not being delivered to the Agent’s browser. This added to the customer’s wait time to speak to an agent.

+ + +

Scale

+ + +

Beyond a low RPS of around >~10, the system performance to deliver contacts from the backend deteriorated significantly due to high memory usage or file descriptor leaks. Horizontal scalability was not supported due to limitations with the older versions of WAMP Library being used, and upgrading the same was a huge effort.

+ + +

Observability/ Debuggability

+ + +

The following were the major issues related to Observability:

+ + +

It was difficult to track the health of the chat contacts i.e. if chat contacts are missing the SLA due to Engineering related concerns or Staffing related concerns.
Chat contacts were not onboarded on the Queue-based architecture resulting in over 8% of the chat volume not being routed to any Agent due to the agent’s attribute matching flow.
The WAMP protocol and libraries (eg1, eg2) used were deprecated and did not provide a lot of insights into inner workings, resulting in debugging being much more difficult. Furthermore, we did not have Chat contact lifecycle debugging implemented end to end, & we were unable to accurately detect Chat SLA misses on the platform overall.

+ + +

Stateful

+ + +

The services were stateful, complicating maintenance and restarts, which caused spikes in message delivery time and losses. The WebSocket proxy was added to perform authorization, and also because services overall were stateful, this, however, increased latency tremendously. The double socket proxy caused issues when either side disconnected.

+ + +

Tech Requirements

+ + +

Following were some of the goals the tech team was working towards:

+ + +

Scale up Chat traffic from 1% to 40% of the overall contact volume by the end of 2023 (1.5 million tickets per week)
1. Onboard and Scale the Chat traffic on Queues to support the Insight related to Queues
2. Scale to handle over 80% of Uber’s overall contact volume by the end of 2024 (3 million tickets per week).
+
Reservation (connecting a customer to an agent on the first try after an agent has been identified) success via the proxy pipeline (known as the Push Pipeline) should be >= 99.5%
Build observability and debuggability over the entire Chat flow, end to end.
Stateless services that would not need recalibration if they horizontally scaled or if instances went down for any reason

+ + +

Solution

+ + +

The new architecture needed to be simple to improve transparency into its inner workings and for the team to be able to easily scale. The team decided to go ahead with the Push Pipeline, which would be a simple, no-redundant WebSocket server that agent machines would connect to and be able to send and receive messages through one generic socket channel.

+ + +

High-Level Architecture

+ + +

The new architecture as it exists today is showcased below:

+ + +

Figure 2: Describes the new high-level flow following the journey of the chat contact being created through being routed to an agent on the front end.

+ +

+ + +

The architecture has the following parts:

+ + +

Front End

+ + +

Front End UI is used by agents to interact with customers. Widgets and different actions are available to agents to investigate and take appropriate actions for the customer.

+ + +

Contact Reservation

+ + +

Router is the service that finds the most appropriate match between the agent and contact. Upon finding the most suitable contact for an agent, the contact is pushed into a reserved state for the agent.

+ + +

Push Pipeline

+ + +

Upon successful reservation of the contact for the agent, the matched information is published to Apache Kafka®. On receiving this information through the socket via GraphQL subscriptions, Front End loads the contact for the agent along with all necessary widgets and actions enabling the agent to respond to the user.

+ + +

Agent State

+ + +

Any agent who needs to start working needs to go online via a toggle on Front End, which when triggered updates the Agent State service with the relevant agent’s new state.

+ + +

Edge Proxy

+ + +

Any connection between the client browser and backend services happens via the Edge Proxy which safeguards Uber services as a firewall and proxy layer.

+ + +

Ease of Operations and Better Insights

+ + +

The following are the important points:

+ + +

Onboarded the Chat traffic on the Queues where subscribed Agents will receive the contacts based on the concurrency set of the Agent’s profile. Concurrency defines the number of chat contacts that an agent can simultaneously work on.
Agent staffing to Queues becomes determinant in nature and features such as SLA Based Routing (Prioritizing chat contacts based on Queue SLA), Sticky Routing (Sticking reopen contacts with the Agents) and priority routing (prioritizing based on different rules defined on the Queues) were supported by default.
With Queue onboarding, dashboards were repurposed/enhanced for Ops teams to view Chat Queues SLA and agent availability & their real-time status, including the contact lifecycle states, Queue inflow/outflow, agent’s session counts, etc.

+ + +

GQL Subscription Service

+ + +

The major highlights related to the GraphQL subscriptions are:

+ + +

Reconnection on Disconnection

+ + +

We have enabled ping pong on the GraphQL subscription socket to make sure that the socket is disconnected automatically in the case of a non-reliable connection. When the socket is disconnected, the respective agent becomes ineligible to receive new contacts. Web socket reconnection is reattempted automatically. Upon successful reconnection, all the reserved/assigned contacts are fetched so the agent can accept them.

+ + +

Push Pipeline Reliability

+ + +

For the reserved contact for a given agent, if the front end does not send back an acknowledgment to the chat service, we try to reserve the same contact for another available agent. We check if the web socket and http protocols are working properly for the agent’s browser by sending the heartbeat over the GraphQL subscriptions, the response to which is sent via an HTTP API call from the agent’s browser to check if the agent is online.

+ + +

Technical Choices

+ + +

Outlined below are some of the tech choices we made to improve the reliability and robustness of the chat system, while also considering the end-to-end latency impact of our choices on the user’s perceived wait time. For this, we needed to keep this system simplified, while enabling select product enhancements.

+ + +

Using GraphQL over websocket with GraphQL subscriptions

+ + +

The front-end team utilizes GraphQL extensively for HTTP calls on its front-end services. This led the team to select GraphQL subscriptions for pushing data from the server to the client. The client would send messages to the server via subscription requests and the server, on matching queries would send back messages to the agent machines. More about the GraphQL subscription is covered in the below sections.

+ + +

The graphql-ws library gave us confidence because it had 2.3m weekly downloads, was recommended by Apollo, and had 0 open issues. It is also modeled on the standard GraphQL over WS protocol and aligns its options completely over it, making it an ideal candidate for usage here.

+ + +

Stateless services

+ + +

The new services that would be created would need to be stateless to horizontally scale and without needing to rebalance every now and then.

+ + +

Websocket without HTTP fallback

+ + +

Since the system required bidirectional communication between agent machines and the proxy layer, having HTTP fallback would not really make any difference to the SLAs of the system. Hence, the team focused on increasing the availability of socket connections with the proxy via:

+ + +

Bidirectional ping pong messages to prevent hung sockets
Backed off reconnects after disconnects to prevent concurrent reconnects from overwhelming the service.
Single proxy to connect sockets without any handover

+ + +

Using Apache Kafka® as a message service on the Backend

+ + +

The contact messages already flowed through various services through Kafka before reaching the proxy layer. It was decided to continue & extend the usage of Kafka as it was reliable, fast & supported broadcasting (PubSub) capabilities.

+ + +

Testing & Launch

+ + +

We have performed both functional and non-functional tests to ensure both customers and agents are provided with the best experience end to end. To predict performance, a few of the tests that were done before launch were:

+ + +

Load tests

+ + +

A ~10K socket connection could be established from a single machine, which will further be horizontally scalable as we add more machines. We tested successfully to push the event at 20X of the old stack.

+ + +

Shadow traffic flows

+ + +

Existing traffic was directed through both the old system and the new pipeline to test its capacity with 40,000 contacts and 2,000 agents daily. This process revealed no problems, and data metrics showed that latency and availability were satisfactory and met the desired thresholds.

+ + +

Reverse shadow traffic flows

+ + +

Existing traffic was directed through the new system with the old user interface for agents, serving as a crucial reliability test. This was the initial use of the new system, and it successfully managed the traffic while maintaining latency within the defined SLAs.

+ + +

As we went along, we encountered unique system and agent behavior issues and did some fixes to increase reliability and reduce latency on the pipeline overall. Some of the major issues were:

+ + +

Deletion of cookies from the browser

+ + +

Browser cookies, when cleared, created issues related to auth and subsequent API failures, which prevented the pushed events from being acted upon by the front end. Agents used to remain online without working on any contacts in such cases.

+ + +

Bugs in Auto-Logout Flows

+ + +

Agents used to not be logged out because of issues such as out of order or missing events. Agents who finished their work for the day remained online in the system if they simply closed their tabs. This caused increases in customer wait times as the pipeline tried to push events to these agents who weren’t online. We then started automatically logging agents out based on recent acknowledgment misses and tracing logouts overall to the right causes to improve confidence in the system.

+ + +

Results

+ + +

The Chat channel has been able to scale to about 36% of the overall Uber Contact volume which is routed to Agents, with more coming in the months ahead. It seems the team has regained the trust for scaling the Chat channel, as well as improving the overall customer experience around it. The team was also able to massively improve reliability, with the error rate of delivering the contact being around 46% in the old stack to roughly 0.45% in the new one. With each failed delivery, the customer’s ticket bounced back with the 30 seconds of delay, after which delivery was retried, and bringing this number down sub 0.45% at scale was massive for customer and agent experience overall.

+ + +

We’ve also had other wins in this area, with perhaps the best one being simplicity. The new architecture has fewer services, fewer protocols, and better observability built into the system for visibility into contact delivery metrics, delay within the system, and end-to-end latency.

+ + +

Conclusion and Next Steps

+ + +

The new push pipeline enables the team to onboard other push use cases and opens up doors to improve user experience by providing real-time information for agents to act upon. Some use cases relating to Greenlight appointments and agent work overlaps on contacts will soon move on this new stack as a part of the next phase.

+ + +

Further improving the user experience for the Chat channel will also happen as a whole, focusing on both enhancements and system architecture adjustments. This will be done based on learnings from the expansion of the product and addressing issues reported by customers and agents.

+ + +

The cover image was found at this link: source

How Uber Serves Over 40 Million Reads Per Second from Online Storage Using an Integrated Cache

Uber — Thu, 15 Feb 2024 07:00:00 GMT

Introduction

+ + +

Docstore is Uber’s in-house, distributed database built on top of MySQL®. Storing tens of PBs of data and serving tens of millions of requests/second, it is one of the largest database engines at Uber used by microservices from all business verticals. Since its inception in 2020, Docstore users and use cases are growing, and so are the request volume and data footprint.

+ + +

The growing number of demands from business verticals and offerings introduces complex microservices and dependency call graphs. As a result, applications demand low latency, higher performance, and scalability from the database, while simultaneously generating higher workloads.

+ + +

Challenges

+ + +

Most of the microservices at Uber use databases backed by disk-based storage in order to persist data. However, every database faces challenges serving applications that require low-latency read access and high scalability.

+ + +

This came to a boiling point when one use case required much higher read throughput than any of our existing users. Docstore could have accommodated their needs, as it is backed by NVMe SSDs, which provide low latency and high throughput. However, using Docstore in the above scenario would have been cost prohibitive and would have required many scaling and operational challenges.

+ + +

Before diving into the challenges, let’s understand the high-level architecture of Docstore.

+ + +

Docstore Architecture

+ + +

Docstore is mainly divided into three layers: a stateless query engine layer, a stateful storage engine layer, and a control plane. For the scope of this blog, we will talk about its query and storage engine layers.

+ + +

The stateless query engine layer is responsible for query planning, routing, sharding, schema management, node health monitoring, request parsing, validation, and AuthN/AuthZ.

+ + +

The storage engine layer is responsible for consensus via Raft, replication, transactions, concurrency control, and load management. A partition is typically composed of MySQL nodes backed by NVMe SSDs, which are capable of handling heavy read and write workloads. Additionally, data is sharded across multiple partitions containing one leader and two follower nodes using Raft for consensus.

+ + +

Figure 1: Docstore architecture.

+ +

+ + +

Now let’s look at some of the challenges faced when services demand low-latency reads at a high scale:

+ + +

Speed of data retrieval from disk has a threshold: There’s a limit to how far one can optimize application data models and queries to improve database latency and performance. Beyond that, squeezing out more performance is not possible.
Vertical scaling: Assigning more resources or upgrading to better hosts with higher performance has its limitations where the database engine itself becomes a bottleneck.
Horizontal scaling: Splitting shards further across more numerous partitions helps solve the challenges to an extent however doing so is an operationally more complex and lengthy process. We have to ensure data durability and resiliency without any downtime. Also this solution doesn’t fully help to solve the issues of hot keys/partitions/shards.
Request imbalance: Oftentimes the incoming rate of read requests is orders of magnitude higher than write requests. In such cases, the underlying MySQL node will struggle to keep up with the heavy workload and further impact latencies.
Cost: Vertical and horizontal scaling to improve latencies are costly in the long term. Costs are multiplied 6x to handle each of the 3 stateful nodes across both regions. Additionally, scaling doesn’t fully address the problem.

+ + +

To overcome this, microservices make use of caching. At Uber we provide Redis™ as a distributed caching solution. A typical design pattern for microservices is to write to database and cache while serving reads from the cache for improved latencies. However, this approach has following challenges:

+ + +

Each team has to provision and maintain their own Redis cache for their respective services
Cache invalidation logic is implemented decentrally within each microservices
In case of region failover, services either have to maintain caching replication to stay hot or suffer higher latencies while the cache is warming up in other regions

+ + +

Individual teams have to expend a large amount of effort to implement their own custom caching solutions with the database. It became imperative to find a better, more efficient solution that not only serves requests at low latency, but is also easy to use and improves developer productivity.

+ + +

CacheFront

+ + +

We decided to build an integrated caching solution, CacheFront for Docstore, with following goals in mind:

+ + +

Minimize the need for vertical and/or horizontal scaling to support low-latency read requests
Reduce resource allocation to the database engine layer; caching can be built from relatively cheap hosts, so overall cost efficiency is improved
Improve P50 and P99 latencies, and stabilize read latency spikes during microbursts
Replace most of the custom-built caching solutions that were (or will be) built by the individual teams to answer their needs, especially in the cases where the caching is not the core business or competency of the team
Make it transparent by reusing existing Docstore client without any additional boilerplate to allow benefiting from caching
Increase developer productivity and allow us to release new features or replace the underlying caching technology transparently to customers
Detach caching solution from Docstore’s underlying sharding scheme to avoid problems that arise from hot keys, shards, or partitions
Allow us to horizontally scale out caching layer, independently of the storage engine
Move ownership for maintaining and on calling Redis from feature teams to the Docstore team

+ + +

CacheFront Design

+ + +

Docstore Query Patterns

+ + +

Docstore supports different ways to query by either primary key or partition key and optionally filtering the data. At a high level it can be mainly be divided into following:

+ + +

Key-type / Filter	No Filter	Filter by WHERE clause
Rows	ReadRows	–
Partitions	ReadPartition	QueryRows

+ + +

We wanted to build our solution incrementally, beginning with most common query patterns. It turned out that more than 50% of the queries coming to Docstore are ReadRows requests, and since this also happened to be the simplest use case–no filters and point reads–it was a natural place to start with the integration.

+ + +

High-Level Architecture

+ + +

Since Docstore’s query engine layer is responsible for serving reads and writes to clients, it is well suited to integrate the caching layer. It also decouples the cache from disk-based storage, allowing us to scale either of them independently. The query engine layer implements an interface to Redis for storing cached data along with a mechanism to invalidate cached entries. A high-level architecture looks like the following:

+ + +

Figure 2: CacheFront design.

+ +

+ + +

Docstore is a strongly consistent database. Although integrated caching provides faster query responses, some of the semantics around consistency may not be acceptable for every microservice while using cache. For example, cache invalidation may fail or lag behind database writes. For this reason, we made integrated caching an opt-in feature. Services can configure cache usage on a per-database, per-table, and even per-request basis.

+ + +

If certain flows require strong consistency (such as getting items in an eater’s cart) then the cache can be bypassed, whereas other flows with low write throughput (such as fetching a restaurant’s menu) would benefit from the cache.

+ + +

Cached Reads

+ + +

CacheFront uses a cache aside strategy to implement cached reads:

+ + +

Query engine layer gets read request for one more rows
If caching is enabled, try getting rows from Redis; stream response to users
Retrieve remaining rows (if any) from the storage engine
Asynchronously populate Redis with the remaining rows
Stream remaining rows to users

+ + +

Figure 3: CacheFront read path.

+ +

+ + +

Cache Invalidation

+ + +

“There are only two hard things in Computer Science: cache invalidation and naming things.”
+ + +
+– Phil Karlton

+ + +

Although the caching strategy in the previous section may seem simple, many details had to be considered in order to ensure the cache would work, especially cache invalidation. Without any explicit cache invalidation, cache entries will expire with the configured TTL (by default, 5 minutes). While this may be OK in some cases, most users expect changes to be reflected faster than the TTL. The default TTL could be lowered however this would reduce our cache hit rate without meaningfully improving consistency guarantees.

+ + +

Conditional Update

+ + +

Docstore supports conditional updates where one or more rows can be updated based on a filter condition. For example, update the holiday schedule for all restaurant chains in a specified region. Since the results of a given filter can change, our caching layer can’t determine which rows would be affected by a conditional update until the actual rows are updated in the database engine. Due to this, we can’t invalidate and populate cached rows for conditional update in the stateless query engine layer’s write path.

+ + +

Leveraging Change Data Capture for Cache Invalidation

+ + +

To fix this, we leveraged Docstore’s change data capture and streaming service, Flux. Flux tails the MySQL binlog events for each of the clusters in our storage engine layer and publishes the events to a list of consumers. Flux powers Docstore CDC (Change Data Capture), replication, materialized views, data lake ingestion, and validating data consistency among nodes in a cluster.

+ + +

A new consumer was written, which subscribes to data events and either invalidates or upserts the new rows in Redis. Now with this invalidation strategy, a conditional update will result in database change events for affected rows, which will be used to invalidate or populate rows in cache. As a result, we were able to make the cache consistent within seconds of the database change, as opposed to minutes. Additionally, by using binlogs, we don’t run the risk of letting uncommitted transactions pollute the cache.

+ + +

The final read and write path with cache invalidation looks like the following:

+ + +

Figure 4: CacheFront read and write paths for invalidation.

+ +

+ + +

Deduplicating Cache Writes Between Query Engine and Flux

+ + +

However, the above cache invalidation strategy has a flaw. Since writes happen to the cache simultaneously between the read path as well as the write path, it is possible that we inadvertently write a stale row to the cache, overwriting the newest value that was retrieved from the database. To solve this, we deduplicate writes based on the timestamp of the row set in MySQL, which effectively serves as its version. The timestamp is parsed out from the encoded row value in Redis (see later section on codec).

+ + +

Redis supports executing custom Lua scripts atomically using the EVAL command. This script takes the same parameters as MSET, however, it also performs the deduplication logic, checking the timestamp values of any rows already written to the cache and ensuring that the value to be written is newer. By using EVAL, all of this can be performed in a single request instead of requiring multiple round trips between the query engine layer and cache.

+ + +

Stronger Consistency Guarantees for Point Writes

+ + +

While Flux allows us to invalidate the cache much faster than if we were relying solely on Redis TTLs for expiration of cached entries, it still provides us with eventual consistency semantics. Yet, some use cases require stronger consistency, such as reading-own-writes, so for these scenarios we added a dedicated API to the query engine that lets our users explicitly invalidate the cached rows after the corresponding writes have completed. This allowed us to provide stronger consistency guarantees for point writes, but not for conditional updates, which remain to be invalidated by Flux.

+ + +

Table Schemas

+ + +

Before getting into more details about the implementation let’s define a few key terms. Docstore tables have a primary key and partition key.

+ + +

A primary key (often referred to as a row key) uniquely identifies a row in the Docstore table and enforces a uniqueness constraint. Every table must have a primary key, which can be composed of one or more columns.

+ + +

A partition key is a prefix of the entire primary key and determines which shard the row will live in. They are not completely separate–rather, partition keys are simply a part of (or equal to) the primary.

+ + +

Figure 5: Example Docstore schemas and data modeling.

+ +

+ + +

In the example above person_id is both the primary and partition key for the person table. While for orders table cust_id is a partition key and both cust_id and order_id together form a primary key.

+ + +

Redis Codec

+ + +

Since primarily we will be caching row reads, we can uniquely identify a row value with a given row key. Since Redis keys and values are stored as strings, we need a special codec to encode the MySQL data in a format that Redis accepts.

+ + +

The following codec was settled on, as it allows cache resources to be shared by different databases while still maintaining data isolation.

+ + +

+ +

Figure 6: CacheFront Redis codec.

+ +

+ + +

Features

+ + +

After completing the high-level design, our solution was functional. Now it was time for us to think about scale and resiliency:

+ + +

How to verify consistency between the database and cache in real time
How to tolerate zone/region failures
How to tolerate Redis failures

+ + +

Compare Cache

+ + +

All this talk about improving consistency means nothing if it’s not measurable, so we added a special mode that shadows read requests to the cache. When reading back, we compare the cached and database data and verify that they are the same. Any mismatches–either stale rows present in the cache or rows present in the cache, but not the database–are logged and emitted as metrics. With the addition of cache invalidation using Flux, the cache is 99.99% consistent.

+ + +

Figure 7: Compare cache design.

+ +

+ + +

Cache Warming

+ + +

A Docstore instance spawns two different geographical regions to ensure high availability and fault tolerance. The deployment is active-active, meaning requests can be issued and served in any region and all writes are replicated across regions. In case of a region failover, another region must be able to serve all requests.

+ + +

This model poses a challenge for CacheFront, since caches should always be warm across regions. If they are not, a region fail-over will increase the number of requests to the database due to cache misses from the traffic originally served in the failed region. This will prevent us from scaling down the storage engine and reclaiming any capacity, since the database load would be as high as it would have been without any caching.

+ + +

The cold cache problem can be solved with cross-region Redis replication, but it poses a problem. Docstore has its own cross-region replication mechanism. If we replicate the cache content using Redis cross-region replication, we will have two independent replication mechanisms, which could lead to cache vs. storage engine inconsistency. In order to avoid this cache inconsistency problem for CacheFront, we enhanced Redis cross-region replication components by adding a new cache warming mode.

+ + +

To ensure that the cache is always warm, we tail the Redis write stream and replicate keys to the remote region. In the remote region instead of directly updating the remote cache, read requests are issued to the query engine layer which, upon a cache miss, reads from the database and writes to the cache as described in the Cached Reads section of the design. By only issuing read requests upon a cache miss, we also avoid unnecessarily overloading the storage engine. The response stream of read rows from the query engine layer is simply discarded, since we are not really interested in the result.

+ + +

By replicating keys instead of values, we always ensure that the data in the cache is consistent with the database in its respective region and we keep the same working set of cached rows in Redis in both regions, while also limiting the amount of cross-region bandwidth used.

+ + +

Figure 8: Cache warming design.

+ +

+ + +

Negative Caching

+ + +

In scenarios where many of the reads are for non-existent rows, it would be good to cache the negative result instead of having a cache miss and querying the database each time. To enable this, we built negative caching into Cachefront. Similar to the regular cache population strategy where all rows returned from the database are written to the cache, we also keep track of any rows that were queried but not read from the database. These non-existent rows are written to the cache with a special flag and in future reads, if the flag is found, we ignore the row when querying the database and also do not return any data back to the user for the row.

+ + +

Sharding

+ + +

Although Redis is not heavily impacted by hot partition issues, some of Docstore’s large customers generate a very large number of read-write requests, which would be challenging to cache in a single Redis cluster, typically limited in the maximum number of nodes it can have. To mitigate this, we allow a single Docstore instance to map to multiple Redis clusters. This also avoids a complete database meltdown where a large number of requests can be issued against it, in case multiple nodes in a single Redis cluster are down and cache is not available for certain ranges of keys.

+ + +

However even with data sharded across multiple Redis clusters, a single Redis cluster going down may create a hot-shard issue on the database. To mitigate this, we decided to shard Redis clusters by partition key, which is different from the database sharding scheme in Docstore. Now we can avoid overloading a single database shard when a single Redis cluster goes down. All requests from a failed Redis shard will be distributed among all database shards, as shown below:

+ + +

Figure 9: Redis sharding request flows.

+ +

+ + +

Circuit Breakers

+ + +

If a Redis node goes down, we’d like to be able to short circuit requests to that node to avoid the unnecessary latency penalty of a Redis get/set request for which we have high confidence that it will fail. To achieve this, we use a sliding window circuit breaker. We count the number of errors on each node per time bucket and compute the number of errors in the sliding window width.

+ + +

Figure 10: Sliding window design.

+ +

The circuit breaker is configured to short circuit a fraction of the requests to that node, proportional to the error count. Once the maximum allowed error count is hit, the circuit breaker is tripped and no more requests can be made to the node until the sliding window passes.

+ + +

Adaptive Timeouts

+ + +

We realized that it is sometimes difficult to set the right timeouts for Redis operations. A timeout that is too short causes Redis requests to fail too early, wasting Redis resources and putting extra load on the database engine. A timeout that is too long impacts the P99.9 and P99.99 latencies, and in the worst case a request may exhaust the entire timeout that is passed in the query. While it’s possible to mitigate these issues by configuring an arbitrarily low default timeout, we risk setting a timeout too low where many requests bypass the cache and go to the database or setting a timeout too high, which leads us back to the original issue.

+ + +

We needed to adjust request timeouts automatically and dynamically such that the P99 of requests to Redis are succeeding within the allocated timeout, while at the same time cutting down entirely the long tail of latencies. Configuring adaptive timeouts means allowing the Redis get/set timeout value to be adjusted dynamically. By allowing adaptive timeouts, we can set a timeout equivalent to the P99.99 latency of cache requests, thereby letting 99.99% of requests go to the cache with a fast response. The remaining 0.01% of requests, which would have taken too long, can be canceled quicker and served from the database.

+ + +

With the enabling of adaptive timeouts, we no longer need to tune the timeouts manually to match the desired P99 latency, and instead can only set the maximum acceptable timeout limit, beyond which the framework is not allowed to go (because the maximum timeout is set by the client request anyways).

+ + +

Figure 11: Adaptive timeouts latency improvements.

+ +

+ + +

Results

+ + +

So did we succeed? We originally set out to build an integrated cache that’s transparent to our users. We wanted our solution to help improve latencies, be easily scalable, help curb load and costs on our storage engine and all while having good consistency guarantees.

+ + +

Figure 12: Cache vs storage engine latency comparison.

+ +

+ + +

Request latencies with integrated cache are significantly better. P75 latency is down 75% and P99.9 latency is down over 67% while also limiting latency spikes, as seen above.

+ + +

Cache Invalidation using Flux and Compare cache mode help us ensure good consistency.
Since it sits behind our existing APIs, it is transparent to users and can be managed internally while still giving flexibility to users through header-based options.
Sharding and cache warming allow it to be scalable and fault tolerant. In fact, one of our largest initial use cases drives over 6M RPS with a 99% cache hit rate with a proven successful failover where all traffic was redirected to the remote region.
The same use-case would have originally required approximately 60K CPU cores in order to serve 6M RPS from the storage engine directly. With CacheFront we serve approximately 99.9% cache hits with only 3K Redis cores, allowing us to reduce the capacity.

+ + +

Today CacheFront supports over 40M requests per second across all Docstore instances in production, and the number is growing.

+ + +

Figure 13: Total cache reads across all instances.

+ +

+ + +

We’ve addressed one of the core challenges in scaling the read workload on Docstore via CacheFront. It not only made it possible to onboard large-scale use cases that demand high throughput and low-latency reads, but also helped us reduce load on the storage engine and save resources, improving the overall cost of storage and allowing developers to focus on building products instead of managing infrastructure.

+ + +

If you like challenges related to distributed systems, databases, storage, and cache, please explore and apply to open positions here.

+ + +

Oracle, Java, MySQL, and NetSuite are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners.

+ + +

Redis is a trademark of Redis Labs Ltd. Any rights therein are reserved to Redis Labs Ltd. Any use herein is for referential purposes only and does not indicate any sponsorship, endorsement or affiliation between Redis and Uber.

Jupiter: Config Driven Adtech Batch Ingestion Platform

Uber — Tue, 06 Feb 2024 07:00:00 GMT

Introduction

+ + +

Uber’s mission is to reimagine the way the world moves for the better and provide earning opportunities globally through its marketplace. One effective approach to bring the Uber brand and marketplace closer to people is to invest in paid marketing strategies.

+ + +

Achieving an optimal equilibrium in the marketplace necessitates the continuous activity of a balance between supply and demand. This requires creating an environment that is affordable for spenders while remaining a great earning opportunity for earners. One approach to accomplishing this goal is by consistently introducing new users to the marketplace, an ongoing process that involves promoting Uber’s marketplace offerings across diverse marketing platforms such as Google, Meta, Apple, and others.

+ + +

Given that these are paid advertisements, our marketing teams continuously develop strategies to rapidly onboard more users to the platform. Therefore, receiving timely signals from these vendors is crucial for us to refine our approach effectively.

+ + +

This blog post aims to explore the constraints and difficulties encountered by our legacy ingestion system, MaRS (Marketing Reporting Service), responsible for gathering ad signals from external ad partners at fixed intervals. Furthermore, we will address how we enhanced our marketing operations through technological advancements and attained scalability by implementing our new system, Jupiter.

+ + +

In this blog, we have described paid marketing as a domain, while ad tech represents the systems within that same domain. These terms can be used interchangeably within this context.

+ + +

What Is The Performance Marketing User Flow?

+ + +

On a general scale, the subsequent sequence offers a thorough outline of the complete user journey: starting from engaging with the ads, navigating to the Uber platform, and culminating in a conversion. This action, valuable to our business, in our context could involve signing up on Uber, placing an order via Uber Eats, or taking a ride.

+ + +

Following the aforementioned action, the subsequent events are triggered:

+ + +

Conversion event: When a user clicks on the ad to download the Uber app, marking a conversion specific to that ad. This is one type of conversion event linked to downloading.

+ + +

Spend event: When a user views an ad, signifying expenditure to display that ad to the user.

+ + +

These spend events from the advertising partner need to be ingested, processed, and transmitted downstream. This is done to measure and optimize the ad’s performance.

+ + +

Figure 1: User Flow in Adtech.

+ +

Sample User Flow

+ + +

Step 1: The User Clicks on the Uber Ad on the Partner Page
Step 2: The User Arrives on the Uber App [Conversion Events]
Step 3: Adtech System Retrieves Data from Partner [Ingestion Platform]
Step 4: Compute Performance Metrics (ROAS)
Step 5: Optimization Engine enhances Bidding Algorithms by adjusting them according to computed Metrics.

+ + +

Why Is Timely Ingestion Critical?

+ + +

The prompt and precise ingestion of these advertising signals is crucial for Uber’s overall Performance Marketing. Even the smallest delay or inaccurate processing of timely ad signals from external partners can affect Uber’s capability to advertise on those platforms. As a result, this could influence the influx of users being onboarded onto the platform.

+ + +

To illustrate, during an outage lasting two days in which we were unable to ingest data from a single partner, the creation of key performance indicators (KPIs), specifically ROAS, downstream was delayed. This delay led to our machine learning algorithms in the bidding and optimization systems erroneously concluding that our ads were underperforming, causing a halt in ad spending.

+ + +

As a consequence, our ability to onboard new users was compromised, resulting in an imbalance in supply & demand. All this occurred due to an outage in one integration.

+ + +

Problem Statement

+ + +

As Uber operates across numerous countries worldwide, we engage with various local and global advertising partners or advertisers for our paid marketing efforts. This has resulted in the integration of multiple diverse technological systems at different levels of technological maturity, featuring heterogeneous data schemas, formats, varied transmission protocols, and discrepancies in data freshness, lineage, and completeness.

+ + +

The AdTech industry is undergoing a substantial transformation where partners, Mobile Measurement Platforms (MMPs), and external ad tech platforms are transitioning from user-based ad-tracking to a spectrum of privacy-centric alternatives. This shift has given rise to a diverse ecosystem with varying standards among partners, introducing complexities such as frequent and unpredictable changes in data schemas that challenge historical assumptions in the marketing and advertising domain.

+ + +

This complexity has presented a compounded challenge for the ingestion system due to its rapid evolution, scale, and the diverse nature of the datasets involved.

+ + +

Here’s a breakdown of issue categories and the time dedicated by the ingestion team previously:

+ + +

Figure 2: Split of Issue Categories.

+ +

+ + +

Reliability

+ + +

As evident from the data, the predominant portion of time is dedicated to ensuring the reliability of the ingestion system. The primary factors contributing to this can be classified as follows:

+ + +

High Latency

+ + +

Ensuring prompt availability of data in the warehouse was essential for reducing our Mean Time to Detect (MTTD) anomalies and enhancing the overall performance of our ad tech systems.

+ + +

Due to incomplete data or data latency issues, marketers struggled to distinguish between seasonality and actual ad performance.

+ + +

No Partial Data Availability

+ + +

As marketing data evolves with time (such as spend data exceeding 24 hours and conversions data extending beyond 28 days), it becomes highly important to provide partial data to downstream systems. This is especially crucial in cases where issues arise from the partner’s end at specific ad account levels. Given the frequency of such issues, having this capability could have prevented numerous data outages.

+ + +

Enhancements in Technology Stack

+ + +

The legacy systems MaRS was designed to be tightly coupled to older advertising formats/domains. Making minor improvements to MaRS used to result in extended engineering cycles or cause multiple technology regressions. Consequently, accommodating new use cases within the system resulted in its becoming unwieldy and difficult to manage.

+ + +

Moreover, our outdated Python®-based technology stack was causing a slowdown. Taking advantage of this situation, we initiated an upgrade.

+ + +

Third-Party Dependencies

+ + +

Standardization

+ + +

Out of our global and local ad partners, some have advanced APIs for data sharing. However, there are smaller partners who, due to their limited maturity, share data through more manual methods like email and SFTP, etc. Therefore, it is imperative that a single system be able to handle data ingestion from this diverse array of sources.

+ + +

Moreover, the data formats and Service Level Agreements (SLAs) were not consistent among all partners. This lack of standardization posed challenges for maintenance. Consequently, it was necessary to establish uniform data standards across all partners for seamless consumption by downstream systems.

+ + +

Rate Limits

+ + +

Introducing new partner data, onboarding new data for an existing partner, or encountering a bug in the data processing layer, requires the ingestion system to import years of historical data (backfill). This process incurred significant latency, often taking multiple days to weeks, and it also impeded the normal flow of day-to-day pipelines, due to partner rate limiting.

+ + +

High Maintenance

+ + +

Sustaining partner-specific SDKs/APIs required substantial maintenance expenses, including dedicated headcount allocation, for frequent updates and bug fixes, which ultimately reduced developer productivity.

+ + +

Scale

+ + +

Huge Lead Time To Market

+ + +

Due to a substantial backlog, we could only attend to the P0 marketing requests. The backlog primarily stemmed from the fact that onboarding a new partner used to take multiple weeks, hindering our capacity for swift experimentation.

+ + +

For instance, in the case of an emerging partner, if we wanted to run ads on their platform, we would have to wait for several months before we had the resources to complete the onboarding process.

+ + +

High Dependency on Eng

+ + +

At present, the onboarding of any partner heavily relies on engineering resources to write boilerplate code for API integration, data transformation, validation, and testing. This consumes a significant portion of the onboarding process.

+ + +

Solution Strategies

+ + +

At first, MaRS was constructed with constraints tied to limited advertising spending. As Uber expanded globally, the demand for increasingly personalized marketing grew in both local and global markets. This necessitated a system that could swiftly adapt and incorporate specific nuances.

+ + +

Marketers needed a swifter onboarding process for new partners in our measurement pipelines to facilitate experimentation. They also sought data at a higher frequency to accelerate results, enabling them to fine-tune marketing strategies accordingly.

+ + +

Therefore, we developed a system to address gaps in the tech stack and accommodate future business requirements by employing a highly loosely coupled architecture.

+ + +

Build vs. Buy

+ + +

We conducted an assessment of external third-party vendors for ad signal ingestion rather than relying solely on in-house solutions. This was primarily to streamline maintenance costs.

+ + +

Additionally, there was a strong business directive from the marketing team to gain greater flexibility and control over primary channels (the top channels with higher spending) like Google, Apple, and Meta.

+ + +

As a result, we opted for a hybrid architecture that allows for a combination of external vendor data and direct in-house retrieval from partners. The decision on which approach to adopt during onboarding will be contingent on the business criticality of the integration.

+ + +

Plug and Play Architecture

+ + +

We had requirements to ingest data other than existing categories of data inside adtech for various internal use cases and many short-gap solutions have been built in silo. We needed to envision a single ingestion system that is ad hoc for data sets, and we needed to do it with ease as well as with minimum effort.

+ + +

We incorporated plug-and-play architecture for all the components so any ingestion can change its internal component to something else with minimal effort.

+ + +

Domain Agnostic Data Ingestion

+ + +

In our pursuit of creating an inclusive ingestion system for a diverse range of data, we needed to separate domain-specific intricacies and enable configurability through a fully self-service, config-based architecture.

+ + +

Reliability

+ + +

Dealing with a diverse range of partners, our system had to handle numerous immature data formats, inconsistent SLAs, and unexpected scenarios. The sizes of these ad signals also varied significantly, spanning from several gigabytes to terabytes in specific cases. Jupiter was specifically designed to adeptly manage these varied scenarios in a resilient manner.

+ + +

Architecture

+ + +

Figure 3: Jupiter Architecture.

+ +

Multi-Vendor Integration

+ + +

We had various business scenarios requiring the integration of distinct data sets from different vendors into our platform. These integrations needed to account for specific factors such as data formats, ingestion frequencies, and data maturity levels.

+ + +

Consequently, the platform was architected to accommodate any vendor for any of the data ingestion processes, allowing for seamless changes with minimal configuration.

+ + +

Integrating a new vendor involves configuring its specific integration details, after which the rest of the platform will seamlessly connect with it.

+ + +

Multi-Source Integrations

+ + +

Given our interactions with numerous vendors, we naturally encountered diverse data sources that required ingestion. To address this, we implemented configurable data sources with their specific attributes defined through configuration. Just as with vendors, these data sources can be switched at any time with minimal configuration effort.

+ + +

Currently, we have integrations with sources like Amazon Web Services, Google Drive, email, APIs, and more. The addition of a new source involves configuring its integration details, after which the rest of the platform will seamlessly adapt to it.

+ + +

Non-Transformed Data Sets

+ + +

In order to meet our business requirements, we found it necessary to ingest data over longer intervals without being constrained by partner capabilities. Additionally, our crucial need to swiftly detect any anomaly trends (MTTD) prompted us to implement a data copying process without applying any transformations.

+ + +

This approach enabled us to expedite debugging for any issues and efficiently backfill data when necessary.

+ + +

Config Driven Transformation Layer

+ + +

Due to the various data schemas we manage, customized transformations were crucial for standardization.

+ + +

A substantial part of the boilerplate code was dedicated to this particular component. To achieve a fully self-service ingestion system, we aimed to configure this component for each distinct use case.

+ + +

Consequently, we developed an internal library for this transformation layer. This library incorporates user-defined transformations, ranging from row-to-row, and column-to-column, to aggregate transformations. We’ve leveraged this library across internal systems and similar use cases for reusability purposes. Attached is a sample configuration.

+ + +

+ +

Self-Serve Ingestion Onboarding

+ + +

We’ve optimized the entire onboarding process for the ingestion flow onto the platform, transitioning it into a self-service model with essential safeguards. This transformation involved implementing trigger-based mechanisms that operate seamlessly between all components, starting from fetching data from sources, initiating transformations, conducting tests, validating and promoting after all checks, and triggering post-validation procedures. Here’s the high-level flow attached for reference.

+ + +

Figure 4: Flow Diagram.

+ +

As a result of these improvements, the responsibility for this process has been transferred to the operations team, eliminating the necessity for continual engineering involvement.

+ + +

Impact

+ + +

At present, the prior system has been completely phased out and transitioned to Jupiter. Below, we present an overview of the metrics for both systems:

+ + +

Metric	Improvement %
Onboarding Time – New Ingestion	> 90%
Onboarding Time – New Vendor	> 75%
Onboarding Time – New Source	> 75%
Data Ingestion Frequency	> 75%
Data Ingestion Latency	> 70%

+ + +

Conclusion

+ + +

We’ve outlined the difficulties and potential advantages linked with ad tech domain networks, along with the process of obtaining dependable data from them and implementing intricate transformations to address various business needs. At present, we’ve retired the previous system and transitioned all use cases to the new platform, incorporating fresh data enhancements wherever feasible.

+ + +

We’ve successfully met our primary objective, accelerating the onboarding process for new partners and ensuring data reliability through a self-service approach for our stakeholders. However, there are still more intricate use cases to address. For instance, we’ve thus far focused on downloading a single report structure and applying transformations. Our next challenge is to download multiple structures, amalgamate them, and provide either a single or multiple datasets through a unified workflow. The next significant step is to expand this platform beyond its current specific-use-case support and transition it into a multi-tenant system.

+ + +

Acknowledgments

+ + +

We extend our special appreciation to both the core engineering and the product team, which includes Prathamesh Gabale (Engineering Manager), Akshit Jain (Software Engineer), Sarthak Chhillar (Software Engineer), Saurav Pradhan (Software Engineer), and Piyush Choudhary (Product Manager), for their pivotal roles in ensuring the success of this journey.

+ + +

We would also like to express our gratitude to Devesh Kumar, Diwakar Bhatia, and Vijayasaradhi Uppaluri for their invaluable feedback and unwavering support.

+ + +

Amazon Web Services, AWS, the Powered by AWS logo, and S3 are trademarks of Amazon.com, Inc. or its affiliates.

DataCentral: Uber’s Big Data Observability and Chargeback Platform

Uber — Thu, 01 Feb 2024 07:30:00 GMT

In this blog, we will walk you through DataCentral, Uber’s homegrown Big Data Observability, Attribution, and Governance platform. This blog gives a high-level overview of DataCentral’s key features. Before we get into the what and why of DataCentral, let’s do a quick primer of Uber’s Data ecosystem and its challenges.

+ + +

Introduction to Uber’s Big Data Landscape

+ + +

Figure 1: Uber’s Big Data Landscape.

+ +

+ + +

Uber’s data infrastructure is composed of a wide variety of compute engines, scheduling/execution solutions, and storage solutions. Compute engines such as Apache Spark^™, Presto^®, Apache Hive^™, Neutrino, Apache Flink^®, etc., allow Uber to run petabyte-scale operations on a daily basis. Further, scheduling and execution engines such as Piper (Uber’s fork of Apache Airflow^™), Query Builder (user platform for executing compute SQLs), Query Runner (proxy layer for execution of workloads), and Cadence (workflow orchestration engine, open-sourced by Uber) exist to allow scheduling and execution of compute workloads. Finally, a significant portion of storage is supported by HDFS, Google Cloud Storage (GCS), AWS S3, Apache Pinot^™, ElasticSearch^®, etc. Each engine supports thousands of executions, which are owned by multiple owners (uOwn) and sub-teams.

+ + +

Challenges

+ + +

With such a complex and diverse big data landscape operating at petabyte-scale and around a million applications/queries running each day, it’s imperative to provide the stakeholders a holistic view of the right performance and resource consumption insights.

+ + +

Stakeholder Personas

+ + +

The stakeholders of the big data ecosystem at Uber comprises of the following:

+ + +

Job Owners: End users visit DataCentral to find out the metadata for their jobs such as duration, costs, resource consumption, query text and logs, etc. This allows DataCentral to serve as a powerful platform for debugging failed queries and applications.

+ + +

Big Data Teams: Big data engine teams like Spark, Presto, Hive, and Neutrino leverage the DataCentral platform to get high-level insights into the number of jobs failing, bad/abusive jobs, top error reasons, blocked queries, etc. In addition, DataCentral helps them to investigate SLA breaches, incidents, and job failures.

+ + +

Executive Leadership: DataCentral also supports business decision making by providing organization-level statistics, such as app/query level costs. It also offers information that can be leveraged to forecast hardware requirements and costs incurred.

+ + +

Some typical questions which go through various personas involved with the big data platforms at Uber:

+ + +

Figure 2: User Personas of Data Platforms at Uber.

+ +

+ + +

Enter DataCentral

+ + +

At Uber, we have developed DataCentral, a comprehensive platform to provide users with essential insights into big data applications and queries. DataCentral empowers data platform users by offering detailed information on workflows and apps, improving productivity, and reducing debugging time. DataCentral provides the following key services for customers:

+ + +

Observability: Granular insights into performance trends, costs, and degradation signals for big data jobs.
Chargeback: Metrics and resource usage for big data tools and engines such as Presto, Apache Yarn^™, HDFS, Apache Kafka^®, etc.
Consumption Reduction Programs: Powers core cost reduction initiatives for Uber’s data ecosystem, such as HDFS growth reduction, Yarn usage reduction, etc.

+ + +

Figure 3: DataCentral and Offerings.

+ +

How does DataCentral help with Observability?

+ + +

Data Observability provides real-time insights into compute queries and applications. Since Uber’s data ecosystem is spread across different components, we have to track metrics across engines like Presto, Spark, Hive, and Neutrino. Aggregation and tying these metrics to execution engines is also a challenge. We do this with the following offerings:

+ + +

Time series metrics (a.k.a. Clio): Every query run at Uber is fingerprinted and associated with a historical trend of executions (we call this in-house solution Clio). Customers can view historical trends for metrics like Costs, Duration, Efficiency, Data Read/Written, Shuffle, and much more. Having insights into historical trends allows customers to detect and debug applications faster. Further, we provide “config change markers” that allow easy correlation between config changes and changes in the historical trends (refer to Metrics trend in Figure 4). One major challenge we observed was that infrastructure introduced failures. To address this, we built observability into Yarn, HDFS, and correlation tools.

+ + +

Figure 4: Clio time series historical trends.

+ +

Yarn Observability: Saturated Yarn resources can cause job failures and slowness, which are difficult to debug. We offer solutions to observe and correlate the Yarn utilization in real time when applications are run. Further, we provide insights and suggestions when jobs get affected by Yarn.

+ + +

Figure 5: Application Level Yarn Queue Insights.

+ +

File System Observability: HDFS slowness and File-system-induced latencies are another factor causing degradations that are difficult to detect. We made changes to the Uber Hadoop client to add client-side monitoring of HDFS call counts and latencies with application-level granularity. Every developer can view the File System performance for their specific application/query, which makes the debugging process smoother. We further have correlation systems in place to capture the various infrastructure and engine metrics to root cause issues and suggest fixes–that’s for another blog.

+ + +

Figure 6: HDFS Metrics Surfaced to Users and Engine Teams.

+ +

+ + +

Figure 7: File System Insights User Interface on DataCentral.

+ +

+ + +

Figure 8: Historical Insights for File System Performance.

+ +

Contactless

+ + +

A good chunk of any data platform user’s time goes into debugging/troubleshooting failed queries and applications. To help them to efficiently troubleshoot, we developed the “Contactless: system with the following objectives:

+ + +

Improve discoverability of errors
Identify and surface the root cause, from the right layer
Provide user-friendly explanations and suggestions
Provide actionable workflows to resolve errors

+ + +

The service enables engine teams to add regex-based rules into the system. A rule also supports adding additional metadata, like user-friendly explanation, root cause layer, priority, etc. Once the stack traces are gathered for an application, the contactless service matches the exception trace against the rules and surfaces the most relevant message back to the user.

+ + +

Whenever applications fail, DataCentral parses the error logs and applies contactless rules on the stack traces. User friendly suggestions and error messages are then displayed on the DataCentral console, which enable the end user to debug and root cause failures. Furthermore, a suggestions tab indicates the best actions that can be taken to resolve the error.

+ + +

Figure 9A: User Consoles for Configuring Rules.

+ + +

Figure 9B: User Consoles for Configuring Rules.

+ + +

Figure 10: Contactless in Action.

+ +

+ + +

How Does DataCentral Help with Cost Efficiency?

+ + +

Cost governance and reduction at Uber are driven by two concepts: attribution and cost efficiency.

+ + +

Chargeback

+ + +

Instead of setting hard limits and budgets on teams, Uber provides high transparency into costs and resource usage on several dimensions so that the stakeholders are equipped with the right data while making decisions. Resource usage and costs are tracked at a uOwn (Uber’s ownership platform) level granularity. Furthermore, the resource usage can be dissected across different granularities such as: User, Pipeline, Application, Schedule, and Queue level. Attribution is critical in driving conservation, identification of anti-patterns, and driving critical cost-reduction initiatives.

+ + +

Figure 11: HDFS Consumption and Usage Insights.

+ +

Consumption Reduction

+ + +

Once resources can get attributed to the right teams and owners, stakeholders have insights into metrics like: most expensive pipelines, continuously failing workloads, unnecessary compute, etc. As part of cost efficiency, we have taken up projects (such as HDFSRed, YarnRed, PrestoRed, etc.) that make automated, data-driven decisions to reduce costs. The HDFSRed project checks the access patterns of Uber’s data tables and creates Jira tickets for owners to push for table deletions and TTLs when data is not frequently accessed. Yarn and Presto reduction initiatives similarly check for anti-patterns and unnecessary compute and raise actionable Jira tickets to reduce/stop unused compute.

+ + +

Figure 12: Example Yarn Reduction JIRA Ticket.

+ +

+ + +

DataCentral’s Scale

+ + +

In order to provide real-time metadata for all Uber-wide applications, DataCentral has to match the scale of the various engines at Uber. Flink jobs keep up with Engine-level scale to ingest real-time modeled data into the internal stores. With 500K Presto queries/day, 400K Spark apps/day, and 2M Hive queries/day, the data observability jobs handle 2K queries per minute and 30K RPS while reading the engine-level metrics. Further, HDFS insights handle over 10B calls per day and peak at 150K RPS (since this tracks calls on application-level granularity).The datastores have a 6-month retention span to handle the data growth and scale.

+ + +

Architecture

+ + +

It is critical to provide real-time insights and metadata in order to minimize time to debug and mitigate job failures. DataCentral’s architecture consists of the following components:

+ + +

Engines like Presto, Hive, Spark, and Neutrino emit query-level metadata to Kafka topics in real time. DataCentral has Flink jobs that constantly listen to these Kafka topics and consume any job-level metadata emitted by the engines.
The Flink jobs pre-process this data in real time and combine metadata on job-level granularity across multiple sources. Finally, the data is stored in internal stores like MySQL and Docstore (Uber’s internal datastore, providing strong consistency and high horizontal scaling).
DataCentral microservice stack consists of several APIs that serve a variety of use cases, such as the DataCentral UI, external teams, TTL setting on Hive tables, and much more.

+ + +

Figure 13: High-level DataCentral architecture.

+ +

The journey of the metadata right from the engines to Observability datastores takes under 500 ms. The DataCentral UI provides insights and metadata, which are served via the MySQL and Docstore datastores. The UI supporting APIs fetch data from disparate sources and join it into unified responses, finally serving the customer-facing views.

+ + +

Further batch workloads are run to power modeled datasets, which provide critical cost attribution data. Data from various engines is joined with HiveMetastore, uOwn, and other metadata sources to power rich insights, which are served on the DataCentral UI to downstream teams and leveraged for cost-reduction initiatives.

+ + +

DataCentral & Uber’s Move to Cloud

+ + +

As Uber moves to cloud, our priority remains to provide cost transparency and cost reduction into the Cloud ecosystem. Furthermore, DataCentral is supporting engine teams with critical metrics for performance testing, benchmarking, and identifying degradations when moving workloads to cloud. This allows us to make the right decisions as we migrate critical jobs to the cloud. For example, File System Observability has allowed engine teams like Spark to observe the latency increase with cloud and make the right solutions to migrate.

+ + +

Conclusion

+ + +

DataCentral has been a critical tool for engineers, data analysts, and platform teams at Uber. It provides advanced analytics and granular insights into big data applications and queries. DataCentral is used by stakeholders ranging from job owners, engine on-calls, platform teams and executive leadership. The self-serve nature of the platform has made it efficient for customers to debug jobs, mitigating incidents and root cause SLA breaches. Another key offering of DataCentral is the consumption insights into big data compute and storage entities such as HDFS, Yarn, Kafka, Presto, etc. DataCentral acts as the single source of truth for stakeholders to get insights into platform-, team-, and org-level insights. Furthermore, we plan to open-source the DataCentral toolkit for broader adoption and community collaboration.

+ + +

Apache®, Apache Pinot™, Apache Flink®, Apache Kafka®, Apache Hive™, Apache Hadoop®, Apache Spark™, Apache Yarn™, Kafka®, Apache Airflow™, Flink®, Hive™, Hadoop®, Spark™, Pinot™, Airflow™, and Yarn™ are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.

+ + +

Presto is a registered trademark of LF Projects, LLC. No endorsement by LF Projects, LLC is implied by the use of these marks.

+ + +

Amazon Web Services, AWS, S3, the Powered by AWS logo are trademarks of Amazon.com, Inc. or its affiliates.
Header image: “New York Grand Central Station” by jensfrickephotography is licensed under CC BY 2.0.

Stopping Uber Fraudsters Through Risk Challenges

Uber — Thu, 25 Jan 2024 08:00:00 GMT

Introduction

+ + +

As a marketplace-based, consumer-facing app, Uber encounters a multitude of sources of fraud across its platform. In one of the most common cases of fraud, bad actors use various methods to attempt to bypass payments for Uber rides, Eats orders, and other services, like Uber for Business. When this happens, failed transactions can occur, incurring losses that affect the drivers and businesses operating on Uber.

+ + +

To account for the serious financial implications of payment fraud, risk management is prioritized at Uber. Reflecting the risk solution landscape within the overall tech industry, our engineers have developed complex solutions to achieve the following:

+ + +

Detect fraud: Real-time fraud detection is driven by a system of business rules which run on Mastermind, Uber’s rules engine. In addition, machine learning models generate predictive scores that give insight into the probability that a user is a fraudster.
Mitigate fraud: Different forms of fraud mitigation are employed at Uber to act on and resolve triggered rules and threshold-passing scores, as appropriate. These involve both manual and automated strategies, and this is also where risk challenges come into play.

+ + +

Risk Challenges

+ + +

Risk challenges are experiences in which the user is asked to complete a certain task or process, often to verify the legitimacy of their identity or payment method. You have likely encountered a risk challenge before, and not just on the Uber app. A common one is to enter the CVV of a credit card when making a credit card transaction.

+ + +

One of the main desired outcomes of risk challenges is to effectively catch bad actors. This point is self-evident: a group that encounters risk challenges should have lower rates of fraud in comparison to a control group. However, protecting against fraud is not as simple as introducing risk challenges to everyone. The nature of risk challenges is highly individualized. Given the wide scope of Uber’s users, products, and geographies, risk challenges encountered on Uber will vary from user to user. Uber applies different risk challenges to different stages of the user journey, and users in different regions of the world may encounter different risk challenges. Let us consider just one risk challenge implemented in the Uber app: penny drop verification.

+ + +

Penny Drop Verification

+ + +

Consider the scenario where Uber receives a ride request from a user who does not seem to be the owner of the debit or credit being used on the app. Based on the behavior and data of this Uber account, it seems there is a very high probability that this particular user has stolen a card that they claim to own and plan to use for the ride.

+ + +

In the past, Uber would detect such users who were highly likely to be fraudsters and immediately prevent them from continuing to significantly engage with the app by employing certain strict actions. In the scenario described, the ride request would be declined, and the associated payment and/or user would be restricted in some cases.

+ + +

On the surface level, this might seem effective in terms of avoiding payment fraud. While this strategy of strict actioning was in place, however, it became evident that it was not ideal for potential false positive users. Uber users whose access was restricted are required to contact customer service to resolve their status, which is often a resource-intensive process.

+ + +

Penny drop verification was introduced in the Uber app as a user-friendly method for individuals who might have previously been restricted from using the app to instead have a chance to prove ownership of their payment method. In this challenge, a user is asked to confirm to Uber two small, random authorization hold amounts before they expire within a given timeframe. Successful completion indicates that a user is likely the legit cardholder and not a fraudster.

+ + +

Figure 1: Throwing ban action versus risk challenge to a potentially risky user

+ +

+ + +

Technical Overview

+ + +

The penny drop verification challenge was implemented with the goal of being both an effective and intuitive method of fraud mitigation for users using a credit or debit card as their payment method. We can summarize how we achieved this through the following design principles, which apply to not just the penny drop verification challenge, but any other good risk challenge:

+ + +

Minimize false positives: Trust good users (and not bad users). Fraudsters should not be able to pass this challenge, while well-intentioned users should be able to self-resolve and recover should they fail.

+ + +

Create a seamless and empathetic user experience, with just the right amount of friction: This is necessary because trade-offs may exist between the frequency and degree of risk challenges, and user satisfaction and churn. For instance, if a user is presented with a risk challenge that they deem too hard or too long to complete, they may stop using the app altogether. A frustrating user experience should be avoided without compromising fraud mitigation.

+ + +

The following screens illustrate a happy path for a legitimate user that is thrown a new penny drop verification challenge.

+ + +

Figure 2: Happy path flow of the penny drop verification risk challenge

+ +

+ + +

As illustrated, the challenge is integrated into the mobile user experience in a way that should not majorly interfere with the user’s intended primary action–in this case, requesting a ride–through clear instructions and simplistic actions. At the same time, “skipping” the challenge is not possible, as the user who is thrown the challenge is required to complete it. Even if the user exits out of an initiated challenge in the user interface, the status of the challenge will still be active, and the user will be continually prompted to complete it if they try to resume relevant actions.

+ + +

Triggering Flows

+ + +

As aforementioned, certain risk challenges are designed for certain user journey flows on the Uber app. Two main flows where the penny drop verification challenge may be displayed:

+ + +

When the user makes a request for either a ride or delivery order
When the user adds a payment method

+ + +

The challenge is not raised for every user at every occurrence of these flows. Rather, in the case that one of these two flows is initiated, downstream services are called on to fetch user-related features and to run risk rules in Mastermind. The rule results may indicate that further risk assessment is necessary, particularly to verify that the user is the owner of a specific payment method. If this occurs, backend services send an error code to the mobile side such that the user encounters mobile screens responsible for initiating the penny drop verification challenge.

+ + +

Figure 3: Risk challenge is initiated by a user action, like requesting a ride

+ +

User Consent

+ + +

After the penny drop verification challenge is deemed necessary during a triggering flow, then a modal is shown to allow the user to choose to verify their selected card. In some cases, the user may have already consented to the challenge, but has exited before successfully completing the challenge, so they must re-consent.

+ + +

If the user chooses to switch to another payment method, they may not be asked for further verification. This is because the action of presenting a risk challenge depends on a user’s specific payment method. Often, an Uber user adds more than one card to their account, and they may or may not be the legitimate owner of any number of them. Different payment profiles of the same user can have different challenge statuses.

+ + +

Once a user clicks “Verify card,” backend processes check various conditions to determine what client-side screen to show to the user in the remainder of the challenge flow. It is first necessary to confirm that data related to the selected payment method, like its UUID and the status of the penny drop verification challenge, has been saved in Docstore, Uber’s backend database. If the card is entirely new, then relevant data about the card is written to Docstore for the first time.

+ + +

Figure 4: Challenge conditions are checked when a user consents to a risk challenge

+ +

Send Authorizations

+ + +

On a screen that provides more information about the risk challenge, the user is prompted to send authorization holds. In the overall context of electronic transactions and payments, authorization holds are temporary holds placed on a certain amount of funds; they are often used to determine whether a user has enough money to complete a transaction, and thus whether Uber is able to collect a payment from that user.

+ + +

After the user clicks “Send authorization holds,” two small monetary amounts are issued to the user’s designated bank account using the authorization hold protocol. To initiate this process, an internal payment operation service makes a request to a specialized “payment grant” service to create two distinct grants. This is done by supplying two randomly generated amounts to be held, as well as a specified void duration.

+ + +

The payment grant service interacts with a payment gateway or processor, which contacts the user’s bank or card issuer to formally request temporary authorization holds to be placed on the user’s payment account, in the specified amount. If the specified void duration lapses and the fund holds are released, and the user has not successfully verified the amounts to complete the challenge, then the user will have to re-send the authorization holds.

+ + +

Figure 5: Authorization holds must be sent as part of the penny drop challenge

+ +

Amount Verification

+ + +

To successfully complete the penny drop challenge, users are required to review their bank statements and accurately enter the exact amounts of the authorization holds within the Uber app. These entered amounts are then subject to verification.

+ + +

Throughout this procedure, the user’s challenge status is updated within Docstore, Uber’s backend database. If the user does not successfully verify the authorization hold amounts within a certain number of attempts, they will have failed the challenge. In this case, their access to the Uber app will be restricted because we have strong indications that they are not the owner of the payment method. By contrast, if the user does successfully complete the challenge, they can seamlessly proceed with their intended actions that they had initiated before being thrown the challenge.

+ + +

Figure 6: Authorization hold amounts must be correctly verified to pass the penny drop challenge

+ +

+ + +

Conclusion

+ + +

We are continually fine-tuning the user experience of the penny drop verification challenge in a way that effectively mitigates risk without creating too much friction in the user experience. Through analysis of metrics like success rates and churn rates, we continually act on insights into how users are interacting with the challenges that are thrown to them.

+ + +

Penny drop verification is just one type of risk challenge integrated in the Uber app. Other challenges involve varying levels of difficulty. Based on what we understand of a given user and their intentions, one challenge may be considered better to use than another. Overall, risk challenges have been integral in our ongoing efforts to enhance security and user experiences on the Uber app. Its implementation has not only effectively served as a safeguard against fraud, but has also led to smoother user experiences, thus expediting onboarding for specialized offerings such as Uber for Business.

+ + +

Acknowledgements

+ + +

Thank you to the Spender Risk Team for sharing their expertise about a range of interesting engineering challenges related to fraud throughout my internship, including risk challenges.

+ + +

Special thanks to Diganta Sarkar, Qixiong Liu, and Susie Peng for providing insights relevant to this blog, as well as Neel Mouleeswaran and You Xu for supporting my internship and growth as a software engineer.

+ + +

Cover photo attribution: Image created by Nenad Stojkovic. Image license information: Link. No changes have been made.

Palette Meta Store Journey

Uber — Thu, 18 Jan 2024 08:00:00 GMT

Introduction

+ + +

The Machine Learning (ML) team at Uber is consistently developing new and innovative components to strengthen our ML Platform (Michelangelo).

+ + +

In machine learning, features are the data used to make model calculations and predict an outcome. You can think of them as the input to the learning model or attributes in your data that are relevant to the predictive modeling problem.

+ + +

When querying Uber’s data stores for feature data, it can be hard to:

+ + +

Figure out good Uber-specific features
Build pipelines to generate features
Compute features in real time
Guarantee that data used at training is the same as the data used for scoring predictions
Monitor features

+ + +

The Uber Michelangelo feature store, called Palette, is a database of Uber-specific curated and internally crowd-sourced features that are easy to use in machine learning projects. It comes to solve all the above-mentioned challenges. Pipelines are auto-generated for feature generations and feature dispersals. Palette supports various feature computation use cases, like batch and near real time, and includes precomputed features related to cities, drivers, and riders, as well as custom features generated for the EATs, Fraud, and Comms teams. Subject to our normal data access restrictions, Uber users are able to use many of the pruned features maintained by other Uber teams or even create their own and can directly incorporate these features in their machine learning models.

+ + +

Figure 1: Feature Generation graph shows job computing features. Feature Ingestion graph shows ingesting data to hive and Cassandra. Feature Serving graph shows how features are served offline/online. Feature Metadata and Data Quality graph shows how featurestore metadata flows across offline and online stores.

+ +

+ + +

Palette Metastore Background

+ + +

Palette provides feature management infrastructure including feature discovery, creation, deprecation, offline and online serving setup in its Metastore.

+ + +

Palette Metastore is a metadata store of features where users of Palette can create, deprecate, add details about ownership/backfill/scheduling of feature generation pipelines, offline training and HDFS location. Users can specify Cassandra databases that they want to copy data for online serving along with Spark configuration, join keys, feature list along with feature metadata. Users can also include info about which features should be copied for online serving, SQL queries for generating the features from upstream dependencies and maintaining audit of changes.

+ + +

Figure 2: Feature Group Update flows from Palette Metadata repository to Offline Serving system and propagates to OnlineServing Cache eventually as well as is used by various systems for ETL/Training.

+ +

+ + +

A Closer Look: Problem and Motivation

+ + +

A major incident occurred in 2021 due to inadequate schema validation on Palette Metadata where a bad Metadata change was pushed, which resulted in OnlineServing breaking for major Tier1 use cases, since it was unable to load Palette Metadata during boot up.

+ + +

Schema validation logic used to be client side and lived in a script within the FeatureSpec repository, which is the Metadata repository for Palette customers to make metadata-related changes. Updating validation was challenging, as customer metadata updates wouldn’t pick up the latest validation always, as they didn’t rebase against the latest code changes. This led to incorrect metadata being merged into the master repository.

+ + +

Metadata discrepancies caused build failures for customers rebasing against master due to incorrect metadata changes being merged into master.

+ + +

Incident resolution took several hours due to several issues.

+ + +

Updating Palette Metadata in OnlineServing stack. Changing a single feature group in Palette Metadata repository led to updates for all hundreds of feature groups due to lack of an incremental update system, prolonging rollbacks.
Lack of schema validation. The Feature Engine on-call had to dedicate substantial time to each customer diff. Majority of on-call time was spent on assisting with metadata changes in the FeatureSpec repo. Lack of a build job to verify actual Hive table schema before merging led to failures at training time. Customers made errors when creating Palette tables, missing required columns.
Offline Metadata updates. Metadata updates took over an hour after landing changes in FeatureSpec repo since entire metadata repository was getting updated even if only a minor change was made for one of the feature groups.

+ + +

These issues highlight the challenges stemming from inadequate schema validation, leading to data loss, helpdesk burden, build failures, and confusion in pipeline updates. The complex process of updating metadata and the lack of automated schema verification further compounded the problems faced by the team.

+ + +

Deep Dive: Meta Store

+ + +

Feature Store Object Model

+ + +

Figure 3: FeatureGroup has OnlineSpec, OfflineSpec, ComputeSpec. OnlineSpec has Snapshot Backing which underneath is backed by Cassandra or Hive Backing. OnlineFeatureServingGroup is composed of online stores and online caches. Inference Server/Palette Service references OnlineFeatureServingGroup and indirectly references FeatureGroup.

+ +

Following are the new objects that we formally define in the new Palette Metadata system backed by protos:

+ + +

FeatureGroup: A logical table with a collection of features for both streaming and batch features backed by daily feature snapshots in Hive tables or Cassandra for the online store.

+ + +

Feature: A single feature corresponding to a column within the logical FeatureGroup (table).

+ + +

Dataset: Dataset represents the metadata needed to create a table in a database/storage for a given feature group. For example, keyspace, partition key and cluster key would be the metadata needed to create a table for a given C* cluster. These would be part of the Dataset spec.

+ + +

Storage: Storage is the underlying storage technology that is referred by dataset, online feature serving group.

+ + +

FeatureServingGroup: A logical unit of serving in the online store that guarantees a certain SLA (throughput, latency). It is a collection of Storage (Cassandra/Redis clusters) that back the Feature Groups, and a routing map of FeatureGroups to the underlying Datasets. Note that it is common in the case of very large use cases) for FeatureServingGroup to contain multiple Cassandra clusters.

+ + +

Inference Server/Palette Service: Inference Server is the logical object holding metadata for Inference Serving for a given model within a Michelangelo project. Palette Service (a service where users can just fetch feature values without needing a model setup) similarly will hold metadata for serving via Palette Service.

+ + +

Metadata Organization

+ + +

We broke the setup of Metadata inside Palette Metadata repository where following files are setup to simplify customer interaction and Michelangelo on-call interaction with the metadata where customers manage offline related metadata files and Michelangelo on-calls manage online serving related metadata files.

+ + +

Description.json: This file contains all the metadata related to offline serving as well as ownership and alerting setup backed by OfflineSpec defined above

+ + +

Features.json: This file will cover metadata related to features with schema backed by Feature CRD

+ + +

OnlineServing.json: This file contains all the metadata related to online serving

+ + +

HQL: This file contains Hive Queries for generating features

+ + +

Metadata Registration

+ + +

Figure 4: Palette Metadata repository updates go through server side validation and get registered in offline serving system and pushed to OnlineServing Cache and OnlineServing stack.

+ +

+ + +

To expedite Offline and Online Metadata updates, we moved the system handling Palette Metadata updates made by customers to incrementally compute the delta of the updates, and register those updates in the OfflineServing system.

+ + +

Once the updates land in UAPI, we use Kubernetes based Controller to process those updates to our highly available cache Online Serving Cache called ObjectConfig.

+ + +

The E2E updates to Offlline and Online systems takes only 15 minutes now instead of over an hour previously, since only incremental updates are pushed and not the entire metadata repository.

+ + +

Online Serving Re-Architecture

+ + +

Figure 5: Schema updates for Old and New Schema propagate from Metadata Service to Read only Cache and gets loaded to OnlineServing via Loader which is referenced by Wrapper.

+ +

+ + +

Metadata Unification

+ + +

In the old architecture, the metadata for online serving was fragmented across various services. We decided to consolidate all the metadata for online serving in one place, which is the Palette Metadata repository.

+ + +

Interface Redesign

+ + +

We made an interface change to deprecate the old schema which no longer was meeting the evolving needs of the Palette online system.

+ + +

Metadata Wrapper

+ + +

We introduced a wrapper during migration for 2 main purposes: Interface adaptation and quick rollback. During the migration process, we made both versions of metadata available for Palette Online Serving. That gave us the ability to compare the metadata in memory. Because the meta loader will transform the metadata to a format better suited serving needs, the metadata in memory is different from what we see in the metadata service. Comparing the metadata in memory gave us more confidence for a safe migration. But due to the interface redesign, we needed serving logic to support both interfaces. So the wrapper was the one to translate the legacy metadata into the format of the new interface. We also introduced a kill switch to tell the wrapper which version of the metadata it should provide to the serving logic. Then we can do a quick rollback when any metadata issue happens during migration.

+ + +

Migration Challenges

+ + +

Keep a smooth user experience during migration:
- We maintained scripts to automatically synchronize feature metadata between old and new systems. This could avoid data gaps when switching to the new system.
- Good and clean documentation was provided to help users to learn how to onboard features to the new Metadata Store.
+
Track correctness for migration:
- Comparison metrics and logs were created across Feature Generation pipeline system, offline serving system to clearly articulate the differences between old and new systems. They played as a proof of evidence regarding correctness for migration.
- Traffic metrics were checked to make sure that no traffic comes through old systems after full migration.
+
Ensuring Backward Compatibility:
- The updated metadata introduced substantial changes in data formats and APIs. To maintain backward compatibility, it was essential to create a robust common API wrapper. This wrapper could seamlessly bridge the gap between legacy code and the new codebase. Subsequently, we could transition the underlying implementations of the Common API wrapper gradually, facilitating a seamless migration process.
+
Testing:
- The code modification incorporated itself into the Michelangelo team’s offline training, re-training, evaluation and prediction workflow. To guarantee the continued functionality of these integrations after the migration, it was imperative to conduct comprehensive integration testing involving all existing systems.
+
Rollback Plan:
- In case the migration encounters unexpected issues or doesn’t yield the desired results, we also defined a thorough rollback plan which could minimize downtime and mitigate risks.
+

+ + +

Result

+ + +

The result of the Metadata migration was that Palette Onboarding Deployment time has reduced drastically by more than 95%. In addition, time to migrate Cassandra clusters has gone down by 90% since all online serving configuration is so cleanly organized which means on-calls no longer need to scramble to figure out which feature group gets served in which Cassandra. Due to the re-architecture of the offline metadata update system so that updates are processed incrementally, time for offline metadata updates has gone from hours to minutes. Additionally, we have introduced enhanced server validation for FeatureStore CRDs and cross-CRD validation

+ + +

Conclusion

+ + +

Overall, introduction of formal schema, consolidation of metadata, enhanced validation, and a very diligently planned migration have led to our new metadata system being easy to use for customers and Michelangelo on-calls, reducing deployment and customer onboarding time, as well as maintenance and operational costs.

+ + +

Acknowledgements

+ + +

This major step for Machine Learning at Uber could not have been done without the many teams who contributed to it. Huge thank you to the Feature Engine group within Uber’s Michelangelo Team, who spent 1+ year rearchitecting the Meta Store system.

+ + +

We also want to give a special thank you to our partners on the Michelangelo teams for making this idea a reality, as well as our former colleagues who helped initiate this idea.

+ + +

Header Image Attribution: The “Journey start here” image is covered by a CC BY 2.0 license and is credited to Johnragai-Moment Catcher. No changes have been made to the image.

Uber: GC Tuning for Improved Presto Reliability

Uber — Thu, 11 Jan 2024 08:00:00 GMT

Presto at Uber

+ + +

Uber uses open-source Presto to query nearly every data source, both in motion and at rest. Presto’s versatility empowers us to make intelligent, data-driven business decisions. We operate around 20 Presto clusters spanning over 10,000 nodes across two regions. We have about 12,000 weekly active users running approximately 500,000 queries daily, which read about 100 PB from HDFS. Today, Presto is used to query various data sources like Apache Hive, Apache Pinot, AresDb, MySQL, Elasticsearch, and Apache Kafka, through its extensible data source connectors.

+ + +

Figure 1

+ +

Our selection of cluster types can accommodate any request, whether for interactive or batch purposes. Interactive workloads cater to dashboards/desktop users waiting for the results, and batch workloads are scheduled jobs that run on a predefined schedule. Each of our clusters is classified based on their machine type. Most of our clusters comprise bigger machines, which are equipped with more than 300 GB of heap memory, while other clusters have smaller machines that are equipped with less than 200 GB of heap memory, and we have modified the concurrency of each cluster depending on its size and type of the machines that make it up.

+ + +

On a weekly basis, memory fragmentation optimization activity is carried out across all production clusters. Even though we were constantly improving fragmentation, we still suffered from constant Full Garbage collections (very long pauses) and sporadically a few out-of-memory errors. Just to give a sense of the problem, let me show you Presto Full GC, cumulative count:

+ + +

Figure 2: Presto Full GC count per day.

+ +

+ + +

Overview – G1GC Garbage Collector

+ + +

G1GC is a garbage collector that tries to balance throughput and latency. G1 is a generational garbage collector, which differs from the newer concurrent garbage collectors (Shenandoah, ZGC, etc.). Generational means that the memory is divided into short and long-lived objects.

+ + +

The first important thing to differentiate here is that there are two types of memory: stack and heap. Stack allocations are cheap because we just need to bump up a pointer, so whenever we call a function we decrement the Stack pointer (stack grows towards the bottom), once we are done with that function we just increment the pointer, and voilà, allocation and deallocation in a single statement each. On the other hand, heap allocation/deallocation is a little bit more expensive. For G1GC, allocation is similar to stack where we only need to bump up a pointer, but deallocation requires GC to run.

+ + +

In Java, since all objects are allocated on the heap, then what do we allocate on the stack? There it goes, the “pointer” referencing the object on the heap. Then for the heap space, G1 divides it into small sections called “regions.”

+ + +

G1 tries to achieve at least 2,048 regions on the heap.

+ + +

Figure 3: Heap is divided into regions.

+ +

What’s the size of each region? It depends on your heap size, but it can go from 1-32 MB. The JVM decides which size ensures that we have 2,048 regions (or more).

+ + +

Each region can be the young generation, the old generation, or the free.

+ + +

Figure 4: Heap regions are categorized as young gen, old gen, or free.

+ +

Finally, the young generation is also divided into Eden and survivors. Eden is where any new allocation happens. For survivors, it would create two different arenas. Why? Young Gen’s approach to clear memory is copying objects between regions, so it needs an empty survivor to copy memory.

+ + +

So the full process is whenever we do a new Object(), it gets allocated on the Eden. GC runs and the object is not dead, so it gets copied to Survivor0. The next time GC runs again and the object is still not dead yet, it gets promoted to Survivor1. So it continues to copy back and forth between survivors until it eventually gets promoted to the old generation.

+ + +

Figure 5: Heap is fully divided into all the mentioned types.

+ +

To recap, the young generation uses a copying mechanism to release memory. So when do we allocate to the old generation? There are two scenarios:

+ + +

G1 has an age threshold. Every time a young gen object gets copied, we increase the age. Once we hit the threshold, it gets copied to the old gen.
Each region is 1-32 MB in size. Any object that is 50% or more of the size of the region gets allocated directly to the old generation. G1 calls this a humongous object.

+ + +

How does G1 clear the old generation? It uses an algorithm called “concurrent mark and sweep.” It is a graph traversal starting from the root objects (thread stacks, global variables, etc.) and goes through every object still referenced. It is essential to mention that G1 uses STAB (snapshot at the beginning), so any new object after it starts would be considered alive regardless of its real liveness. Once it finishes, G1 knows which objects are still alive, and the ones that are not can be cleaned on the following mixed collection.

+ + +

What? A mixed collection? Indeed. A mixed collection is a young generation collection that would include old generation regions in the process. So it copies the objects that are still alive in another old generation region. This process is critical to reduce memory fragmentation.

+ + +

Who determines the size of each component (Eden, survivor, old gen, etc.)? The heap is always shrinking and expanding to fulfill its job, although there are certain limits. For instance, the young generation can only be 5-60% of the total heap.

+ + +

Today’s discussion doesn’t require going into more advanced G1GC topics, so let’s begin with what we have done at Uber.

+ + +

G1GC at Uber

+ + +

When Java became more used at Uber, we were using OpenJDK 8. Most of the time, the only tuning option we had to touch was -XX:InitiatingHeapOccupancyPercent=X. This threshold is the one that controls if G1 should start a concurrent mark and sweep cycle.

+ + +

It has a default value of 45%, which usually causes an increase in CPU, because any service using some cache would eventually exceed that threshold, and it would keep triggering it endlessly. For instance, service A stores all the users in memory, and that causes the Old generation to be ~60% of the total heap. Then the 45% threshold would always be met.

+ + +

Then how do we tune it?

+ + +

Enable GC logs and GC metrics
Look for the peak old-generation utilization after mixed collections
Select a value slightly higher than that peak–usually 5-10% higher

+ + +

However, remember that Presto servers are now running on JDK 11. How do we tune them? This was our first attempt at tuning this version. Why is it different? Java introduced dynamic IHOP (InitiatingHeapOccupancyPercent). Then we no longer have a default value of 45%, and instead we have a value that can change all of the time, and it is only available in the GC logs.

+ + +

Tuning JDK 11

+ + +

How does dynamic IHOP get calculated? It uses the current size of the young generation plus a free threshold (basic idea, it uses a slightly more complex formula). This free threshold default value is 10% of the total heap and is used as a buffer to allow GC to complete (remember concurrent mark and sweep runs along your application).

+ + +

The process we follow is listed below (we waited 1-2 weeks between each step to have enough data to verify our experiments). We tried the following on one cluster first to avoid affecting all our users.

+ + +

Add more GC metrics

+ + +

We were missing young- and old-gen utilization, so we couldn’t easily know historical data about our utilizations.

+ + +

Decrease max young generation size from 60% to 20%

+ + +

We saw the young generation expanding a few times (50% of the total heap). This caused long GC pauses and concurrent marking to take longer to run again. Concurrent marking can’t run if we are still doing mixed collections.

+ + +

The result?

+ + +

Better GC pauses.
Still bad concurrent marking. This happened because by decreasing the max size by 40%, we gave that to the old generation, so concurrent marking was still starting late.

+ + +

Increase Free space from 10% to 35% & decrease Heap waste from 5% to 1%

+ + +

Let’s first talk about heap waster percentage. This tuning option by default is 5% which tells G1 that it must only release any garbage when it exceeds 5% of the total heap. Why? To avoid long GC pauses during mixed collections. When we do concurrent marking, G1 orders the old generation’s regions based on their utilization, and it first chooses the ones that have more free space, because they are faster to copy to a new region.

+ + +

For our 300G clusters, that translates to 15G that will never be cleaned. We decided to decrease that to 3G (-XX:G1HeapWastePercent=1) based on past experiences.

+ + +

For free space, we analyzed several GC logs and noticed that utilization stayed at 20-35% after mixed collections. Then 20% max young gen plus 35% free space would give us a threshold of 45% (100-(35+20)%). With this config, we are giving at least a 10% buffer (35 to 45%) to have some garbage to clean.

+ + +

The result?

+ + +

1% seemed too much, and we started seeing long pauses of >1s. This change was helpful because, with the GC logs, we could identify that long pauses started to happen once mixed collections tried to go from 2% -> 1% garbage.
35% performed well. Full GCs were reduced (~80% for this cluster).

+ + +

Increase Free space from 35% to 40% & increase Heap waste from 1% to 2%

+ + +

The result was:

+ + +

2% heap waste gave us an additional 9G and had little impact on latencies (~50-100ms vs. 1-1.5s with 1%).
40% performed slightly better than 35%, but we didn’t gain much (85-90% vs. 80%). We decided not to go even further to avoid thrashing.

+ + +

Try the same tuning options on a different cluster

+ + +

We tested the same config in a new cluster and verified the behavior before trying on all of them to see the impact. We decided to grab the cluster with the most Full GCs in the past few weeks. After 24 hours of the deployment, we could already see the impact:

+ + +

Figure 6: Full GCs cumulative count of a single cluster

+ +

Before, after just a few hours, we used to start seeing Full GCs, but after these changes, we didn’t get any.

+ + +

Conclusion

+ + +

After several weeks of testing with the tuning flags presented above, we decided to use the same flags for all clusters. After the flags were added/updated, all the clusters performed optimally with minimal to no internal OOM errors. Due to this change, the reliability of Presto clusters increased, thereby reducing reruns of the queries that were earlier failing with OOM errors, which improved the overall performance of Presto clusters. The flags that we used in the final tuning are:

+ + +

-XX:+UnlockExperimentalVMOptions

+ + +

-XX:G1MaxNewSizePercent=20

+ + +

-XX:G1ReservePercent=40

+ + +

-XX:G1HeapWastePercent=2

+ + +

These flags are specific to the Presto use case in Uber, which was finalized after multiple tuning iterations. We expect flags to differ for each organization based on the individual workloads, and they must be tuned on a case-by-case basis. With these flags enabled, we will see more frequent Garbage collection, but they allow us to have a more reliable Presto cluster and reduce the on-call burden for the owners.

+ + +

For all of our clusters, we observed the following impact:

+ + +

Figure 7: Cumulative Old generation count per day for all clusters.

+ +

Figure 8: Internal errors per day on all clusters.

+ +

+ + +

What’s Next?

+ + +

Most of the Garbage collection tuning we have done has been on product-facing applications, and we haven’t paid close attention to our storage applications. Therefore, we plan to continue tuning for the other solutions Uber provides. It would be an interesting learning experience because storage applications use large heaps, which differs from what we used to tune normally. We’ll share it with the community once we have more data.

+ + +

GC tuning done on Presto is an example of how improving garbage collection can improve a system’s overall performance and reliability. Our next focus will be further fine-tuning GCs for Presto clusters, especially with less powerful machines where we are still experiencing some Full GCs, and improving the system’s overall reliability.

+ + +

All the optimizations listed are specific to the Presto deployment in Uber and can’t be directly ported to other services. The flags listed are just for demonstration purposes to understand what flags we ended up using in our tuning. Also, we will come up with some of the best practices and guidelines that can be used by all of Uber’s storage applications, depending on their general usage, which will act as a good starting point. This will empower us to improve all of our storage applications, improving overall reliability and performance.

Product analytics for generative AI model and media asset companies using BigQuery

Tue, 07 May 2024 16:00:00 +0000

Over the last year, there’s been a lot of change in the commercial image and video asset industry: New generative AI applications let users create their own still and live images based on prompts, and traditional stock-media asset providers are offering customers richer search experiences that have a deep understanding of the image/live image content and that expose it with a natural language interface.

To continually push the state of the art, these organizations must use data to evolve their products rapidly, for example to:

+
Optimize still and live image generation models
+
+
Identify inappropriate content, such as violence or nudity
+
+
Analyze behavior to identify improvements to the user experience
+
+
Recommend similar images or prompts based on previous activity
+
+
Enhance static asset search capabilities
+

To do this, they need unstructured images, live images, and audio data, combined with structured user-experience data and metadata about the assets they are interacting with, whether they’re static or AI-generated.

In this post, we outline a solution based on our real-life engagements with leaders in the industry who operate at the scale of petabytes per day. This solution delivers several benefits:

+
Minimizes costs by avoiding duplicate data and storage, while facilitating AI model proximity to data for efficient inference
+
+
Simplifies development and delivery by combining diverse data types in a unified data architecture
+
+
Optimizes use of limited engineering resources through an integrated, scalable serverless platform that combines BigQuery and Google Cloud Storage
+
+
Allows users to augment and transform their data according to the needs of their business
+
+
Enables companies to develop lightweight, powerful analyses quickly and securely, to activate customer data and quickly iterate on the output of models
+

The challenge of unstructured data

Generated (unstructured) image data, the (semi-structured) prompts that made them, as well as user behavior data (structured, in tables) for things like session time and frequency, are all rich in potential insights. For example, knowing which types of prompts lead to successfully generating an image — and those that don’t — provides insights into product and model development opportunities.

But combining these different data types often requires advanced analytics to interpret them meaningfully. Technologies like natural language processing and computer vision are at the forefront of extracting these kinds of valuable insights. However, integrating unstructured data within an existing analytics framework of structured data, for example user behavior data in database tables, is not without its hurdles. Common challenges include:

+
Data security standards: Adhering to stringent data security standards to protect sensitive information is crucial. These standards include applying data masking to sensitive PII data and following least-privilege security principles for data access.
+
+
Data type silos: Unstructured data is often stored separately from structured data, preventing effective analysis across data types, for example, filtering media assets (unstructured) based on user profiles (structured), as they reside in separate systems.
+
+
High-performance, scalable cloud computing resources: The need for powerful computing resources is imperative to manage and analyze large unstructured datasets effectively due to the data's complexity, volume, and the potential need for real-time results. In addition, high performance networking allows for low-latency data transfers to enable the transfer of unstructured data between storage (Cloud Storage) and analytical layers (BigQuery, Vertex, etc.)
+
+
Maintaining data integrity across layers: As insights are extracted from unstructured data, preserving the original source of truth and ensuring consistency across intermediate (interstitial) layers is crucial for reliable, iterative analysis.
+

Streamlining data integration with Cloud Storage and BigQuery

To overcome the challenges of working with unstructured data, Cloud Storage and BigQuery can be used to centralize data, using BigQuery object tables to enable consistent data access to varied sources through one analytical platform. Below is an example of a simple yet effective architecture that harnesses BigQuery for both metadata generation and enhancement. This approach uses BigQuery's built-in generative AI functions, coupled with remote User Defined Functions (UDFs) that interface with Vertex AI APIs. The integration elevates the process of data enrichment and analysis, and offers a more streamlined and efficient workflow.

The power of BigQuery object tables

In the example below, we focus on a static image use case, however, this same technique could be used for images created using generative AI. The true potential of this architecture lies in its versatility. The use of object tables in BigQuery means this pattern can be adapted to any form of unstructured data, for example images, audio, documents, opening up a world of possibilities for data science and analysis. This flexibility ensures the architecture can evolve with the changing needs and types of data, helping the solution withstand the test of time in the dynamic field of image curation and generation.

This architecture shows the integration of structured and unstructured data, utilizing the strengths of both to enhance platform capabilities. BigQuery serves as a central hub, amalgamating user data information (for example: user demographics, images viewed and used, session duration, session frequency), image metadata, and queries. Concurrently, external AI APIs augment this dataset with insights about the content of the images, for example describing what is happening in a scene (e.g. “a photographic image of a dog playing with a ball on grass”) .

This convergence of data facilitates the training of sophisticated image-generation models, tailored to meet the specific requirements of the platform's users. It also unlocks advanced search and image-curation functionalities, enabling users to navigate through an extensive collection of images. The project's ability to provide access to external systems and empower data augmentation within BigQuery helps to centralize analytic workloads. This not only streamlines data analysis but also fosters informed decision-making.

Solution overview

+ + + + + + + +

+ + +

+ + + + + + + + + + + +

+ + + + + +

The goal of the solution is to create a way to interact with unstructured data through BigQuery. Using BigQuery object tables to analyze unstructured data in Cloud Storage, you can perform analyses using generative AI models via remote functions, cloud APIs via Vertex AI, or perform inference by using BigQuery ML, and then join the results of these operations with the rest of your structured data in BigQuery.

Step 1. Creating an example dataset

Prerequisites
Data: Multiple image repositories on third-party sites like Kaggle and Hugging Face
Project setup: To get started we need to activating essential project APIs:

+
gcloud services enable cloudfunctions.googleapis.com
+
+
gcloud services enable cloudbuild.googleapis.com
+
+
gcloud services enable bigqueryconnection.googleapis.com
+
+
gcloud services enable vision.googleapis.com
+

Step 2. Create the object table

The object table provides the reference to the non-structured data (e.g., audio, live images and images).

To do this, we create the BigQuery BigLake remote connection, building a bridge between BigQuery and Cloud Storage:

+
Command for creation: bq mk --connection --location=us-central1 \ --project_id=bq-object-tables-demo \ --connection_type=CLOUD_RESOURCE biglake-connection
+
+
To show the details of this new creation, use: bq show --connection bq-object-tables-demo.us-central1.biglake-connection
+

Then, give your BQ service account the correct permissions to access your Cloud Storage bucket.

Your serviceAccountId typically looks like this: {"serviceAccountId": "bqcx-012345678910-abcd@gcp-sa-bigquery-condel.iam.gserviceaccount.com"}`. And it needs the object viewer permission. This can be achieved by:

code_block: <ListValue: [StructValue([('code', 'gsutil iam ch \\ serviceAccount:bqcx-012345678910-abcd@gcp-sa-bigquery-condel.iam.gserviceaccount.com:objectViewer gs://bq-object-tables-demo-data'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3eefc2da5a30>)])]>

Make your object table in BigQuery in an existing dataset, or create a dataset for your object table.

+
Create the dataset with: bq mk -d --data_location=us-central1 bq_object_table_demo-dataset
+

This is a sample query you can use to create the object table

code_block: <ListValue: [StructValue([('code', 'CREATE OR REPLACE EXTERNAL TABLE `bq-object-tables.bq_ot_dataset.bq_object_tables_external_table` \r\nWITH CONNECTION `bq-object-tables.us-east1.biglake-connection` OPTIONS ( object_metadata="DIRECTORY", uris = [\'gs://bq-object-tables-demo-data/*\' ], max_staleness=INTERVAL 30 MINUTE, metadata_cache_mode="AUTOMATIC");'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3eefc2da5040>)])]>

The max_staleness option lets you manage the trade-off between data freshness and performance by specifying a tolerable level of staleness for the materialized view; this can help improve query response times and reduce costs. By setting an appropriate value, you can achieve consistently high performance while keeping costs under control, even when working with large, frequently changing datasets.

Create metadata using Native BQ Functionality

These steps can all be automated into a Directed Acyclic Graph (DAG) for use in an orchestration tool such as Cloud Composer.

Step 3. Reference the model from a native generative AI BQML function

First create the link back to the model in your BQ dataset like this:

code_block: <ListValue: [StructValue([('code', "# Create Model\r\nCREATE OR REPLACE MODEL\r\n`bq-object-tables.bq_ot_dataset.myvisionmodel`\r\nREMOTE WITH CONNECTION `bq-object-tables.us-east1.biglake-connection`\r\nOPTIONS (remote_service_type ='cloud_ai_vision_v1');"), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3eefc2da5460>)])]>

Annotate image

This code parses the images, extracts their contents and outputs a JSON array of words that describe the image and the model’s confidence that the description is correct. This function will then put the description into a table.

code_block: <ListValue: [StructValue([('code', "# Annotate image\r\nSELECT *\r\nFROM ML.ANNOTATE_IMAGE(\r\n MODEL `mydataset.myvisionmodel`,\r\n TABLE `mydataset.mytable`,\r\n STRUCT(['label_detection'] AS vision_features)\r\n);"), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3eefc2da5d30>)])]>

Step 4. Create a UDF in BigQuery

You can create a Cloud function using this basic code.

If you’re unsure how to create a cloud function, please see the docs for how to create a cloud function UDF.

Then, to deploy the Cloud Function, follow these steps:

4.1. Deploy your Cloud Function

+
You may need to enable Cloud Functions API.
+
+
You may need to enable Cloud Build APIs.
+

4.2. Grant the BigQuery connection service account access to the Cloud Function

+
One way you can find the service account is by using the BigQuery cli ‘show’ command
+

4.3. Reference the functions in BigQuery

+
Create a BigQuery remote function to reference the Cloud Function UDF
+

code_block: <ListValue: [StructValue([('code', "CREATE OR REPLACE FUNCTION `mydataset.vision_safe_search`(signed_url_ STRING) RETURNS JSON\r\nREMOTE WITH CONNECTION `us.gcs-connection`\r\nOPTIONS(endpoint='https://region-myproject.cloudfunctions.net/vision_safe_search',\r\nmax_batching_rows = 1);"), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3eefc2da5190>)])]>

code_block: <ListValue: [StructValue([('code', "CREATE OR REPLACE FUNCTION `mydataset.vision_annotation`(signed_url_ STRING) RETURNS JSON\r\nREMOTE WITH CONNECTION `us.gcs-connection`\r\nOPTIONS(endpoint='https://region-myproject.cloudfunctions.net/vision_annotation',\r\nmax_batching_rows = 1);"), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3eefc2da5a00>)])]>

Step 5. Use the function in a query

code_block: <ListValue: [StructValue([('code', 'CREATE TABLE `mydataset.mid_processing` AS\r\nSELECT uri,mydataset.vision_safe_search(signed_url) as safe_search, mydataset.vision_annotation(signed_url) as annotation\r\nFROM EXTERNAL_OBJECT_TRANSFORM(\r\nTABLE `mydataset.imageall`,\r\n["SIGNED_URL"]);'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3eefc2da59a0>)])]>

Tap into unstructured data with BigQuery object tables and AI

This architecture demonstrates the power of streamlining data integration for centralized analyses through BigQuery. Although we reference image data for this example, this methodology is highly flexible; using object tables we can reference any type of unstructured data in Cloud Storage buckets that could also refer to audio files that might reference a call center AI use case, for example, or live image files relevant to training a computer vision model.

By centralizing data in Cloud Storage and BigQuery and intelligently using object tables, you can efficiently manage both structured and unstructured data. For our image-based example, this unified approach provides a rich dataset that contains user IDs, original prompts, prompt categories, image safety ratings, and even additional ML-generated prompts.

The potential applications for these metadata sets are huge. Product teams could use them to build more robust image-generation models or create an advanced image-search system, providing highly relevant results aligned with users' search terms and image descriptions.

Take the next step

You can get started today using this framework. For additional help, ask your Google Cloud account manager to reach out to the Built with BigQuery team.

The Built with BigQuery team helps Independent Software Vendors (ISVs) and data providers build innovative applications with Google Data Cloud. Participating companies can:

+
Accelerate product design and architecture through access to designated experts who can provide insight into key use cases, architectural patterns, and best practices
+
+
Amplify success with joint marketing programs to drive awareness, generate demand, and increase adoption
+

What’s new with Active Assist: New Hub UI and four new recommendations

Tue, 07 May 2024 16:00:00 +0000

The Active Assist portfolio of intelligent tools can help you reduce costs, increase performance, improve security, and even help you make more sustainable decisions. Today, we’re excited to announce some new Active Assist features that address some of our customers’ largest concerns. These features unlock some key functionality that help you better understand and use recommendations, all aimed to help make managing and optimizing your cloud simpler and easier.

Revamped Recommendation Hub

Recommendation Hub is a centralized page on Google Cloud that helps you view all of your recommendations in one place across multiple categories: cost, security, performance, reliability, manageability, and even sustainability. We recently made improvements to help you better understand the recommendations you have and to help you focus on the ones that are the most impactful:

1. Organization-view of recommendations
One of our most in-demand features: you can now view all recommendations across all of your projects in your organization in one UI! Simply change the picker at the top left of the screen to choose an organization, and Active Assist shows all the recommendations under your organization (as long as you have the correct IAM permissions).

+ + + + + + + +

+ + +

+ + + + + + + + + + + +

+ + + + + +

2. Pre-filtered recommendations by value category
You can now view all of your recommendations under one category in a simple table view, so you can prioritize and focus on the recommendations that are the most relevant and important to you.

3. Custom sorting and filtering
With our new table views, you can sort and filter by different fields, such as product category, recommendation, cost savings, priority, etc. so you can find and view recommendations more easily.

+ + + + + + + +

+ + +

+ + + + + + + + + + + +

+ + + + + +

Four new recommendations

We’re continually adding new recommendations to the Active Assist portfolio based on customer feedback, to help you manage risk and optimize operations.

1. Cloud deprecation and breaking change recommendations
At Google Cloud we take pains to provide backwards compatibility for our services. However, from time to time, we need to evolve the platform in a way that could impact some users e.g., for security purposes. In addition to following a stringent process to minimize customer impact, Active Assist now includes recommendations about potential breaking changes, providing an additional mechanism for customers to learn about them.

+ + + + + + + +

+ + +

+ + + + + + + + + + + +

+ + + + + +

Our new cloud deprecation and breaking changes recommender helps identify Cloud resources that will be affected by upcoming deprecations and breaking changes while providing guidelines on how to manage them. Like our other recommendations, you can view them through our Recommendation Hub UI, API, and BigQuery export.

Deprecation and breaking change recommendations are offered at no charge and are available for all users today. Making sure you get ahead of these changes is important to help prevent any disruptions to your environment and ensure you are on the most reliable and supported services.

2. IAM for BigQuery recommendations
We’ve expanded the popular IAM Recommender to include IAM recommendations on BigQuery datasets. If your principals have roles on BigQuery datasets but they are not using all of the permissions within that role, you can now receive recommendations to remove or replace any of those roles. These recommendations help you enforce the principle of least privilege by ensuring that principals have only the permissions that they actually need.

You can view your recommendations through UI, API, or BigQuery export. The recommendations are currently free but will require Security Command Center Premium after April 29th.

3. Advisory Notifications recommendations
Advisory Notifications provides IAM policy recommendations to ensure the right parties within your organization have access to view critical security and privacy notifications in the Google Cloud console, so that they can receive and quickly address security notifications.

4. Recent change recommendations
We want to help you detect and mitigate issues (e.g., service outages) caused by misconfigurations to your important cloud resources. The new recent change recommendations automatically flags recent risky changes to cloud resources that are identified as important based on their usage and other signals. For example, if you deleted a highly used project, recent change recommendations will proactively warn you about the risks associated with the change, helping to identify — and prevent — unintended issues..

We’re excited about these latest change risk recommendations, and hope they will help you both prevent and mitigate misconfigurations and risky changes to your infrastructure. Try out the new features on Recommendation Hub yourself. If you have any feedback, please feel free to reach out to active-assist-feedback@google.com.

+ + + + + + + +

repareo adopts cloud-based microservices architecture to scale its auto repair business

Tue, 07 May 2024 08:00:00 +0000

When it comes to auto repair, most of us are at the mercy of our mechanic. We take our car to the local garage, pay for the job, and hope for the best. And with little transparency into the work that has been done, or the costs involved, we can be left feeling unsure whether we got value for money.

At repareo, we are transforming the customer experience of vehicle repair, modernizing the market to make it e-commerce ready, while giving customers full transparency over the car-repair process. Now, they can describe their vehicle problem on our site and immediately receive a list of local garages, including customer reviews, cost breakdowns, and availability, allowing them to make an informed decision about where to take their car. And because our site has direct interfaces with the garages’ booking systems, customers are able to book their preferred garage and appointment directly on our site, saving time calling garages for availability.

Outgrowing our monolithic architecture

This year, we will be launching our new infrastructure on Google Cloud. Previously, repareo was built on a monolithic system using a small, hosted server, which was both easy for our small development team to maintain and allowed us to grow the business in a cost-efficient way. However, as we added more features and services, our monolith grew, which had an impact on our development speed, creating a bottleneck for the rest of our application.

Reliability became an issue too. repareo is integrated with car management fleets and leasing companies, and many drivers access our services through their car leasing app, making us highly dependent on third-party APIs to function effectively. As we grew, the increase in traffic resulted in these APIs becoming sluggish. During periods of peak traffic, such as the German tire-change season, we would see a 300% surge in traffic, placing a significant strain on our server, which was unable to scale effectively, causing our services to grind to a halt.

Modernizing infrastructure to modernize the market

The turning point came last year, when we signed a major deal with a leading global e-commerce player to integrate with its vehicle parts marketplace, enabling customers to book an installation during the checkout process. Realizing that we would need our infrastructure to be able to handle an expected tenfold increase in demand, while conforming with our new service-level agreements (SLAs), we decided it was time to move to the cloud.

We knew that a migration would bring other benefits too, enabling us to build a microservices architecture to develop services in modules, as well as allowing us to place certain workloads close to a leading global eCommerce player in vehicle parts in California, for fast response times. As we looked at cloud providers, Google Cloud stood out for the range of technologies and services it offered, with BigQuery and Apigee particularly impressing us as uniquely advanced solutions in the market.

Beyond the technology, however, we were just as impressed with Google Cloud’s deep understanding of our industry and its business network within the automotive sector, as well as by the personal relationship we quickly built with the Google Cloud team. Migrating to a new provider is a once-in-a-lifetime decision, a marriage of sorts, and with Google Cloud the relationship immediately felt like it was built to last.

A robust, reliable system for a smoother customer experience

We’re currently halfway through our migration, which has proved to be a steep learning curve for us, given the scale of the undertaking for our small team of developers. However, Apigee has helped to make the migration smooth by enabling us to easily set up a staging environment to test and adjust our system before going live, with no impact on our users.

We expect to have completed the migration in less than six months in total. Once live, we will have a robust, scalable system, capable of meeting the needs of a significantly larger user base. Building and managing our APIs with Apigee means we will be able to use the caching system to cache the high number of API requests on the site, allowing us to offer high-performance buffering algorithms without having to drastically increase the scale of the underlying system. And because Apigee’s logging system is so well developed, we will easily be able to spot and remedy any integration issues, to ensure our APIs function effectively. As a result, our customers will enjoy a smooth, reliable booking system and real-time repair updates, while garages will benefit from far wider reach.

We won’t need to worry about being able to handle fluctuations in demand either, as the autoscale feature of Google Kubernetes Engine (GKE) will automatically scale up our workloads to meet surges in traffic and scale down again during quiet periods to help ensure we aren’t using more compute than necessary. This cost-efficient provisioning means we will no longer have to worry about the ability of our system to handle periods of peak traffic, with our customers benefiting from a fast, reliable service.

Taking our developers up a gear with easy-to-use, managed services

Our development efficiency has already significantly improved, thanks to the built-in features and managed services of Google Cloud. Apigee, for example, has a built-in key management system to enable APIs to communicate securely, which means our development team doesn’t need to spend time and money building our own system. Similarly, the fact that Cloud SQL is a managed database means we don’t need to spend time updating and maintaining it. GKE, meanwhile, improves our developer efficiency thanks to its easy integration with automation tools, increasing our deployment speed by at least 15%.

All of this means that our developers are free to focus on the core business logic and developing new features, such as new RPA-based technology to gather appointment availability from garage websites, which we were able to build and release inside a week, where it would have taken a month using the old infrastructure.

A firm foundation for sustained growth

With a number of other significant deals in the pipeline, repareo is now entering a period of sustained growth as we rapidly increase our customer base and prepare to enter new markets. That level of scaling simply wouldn’t be possible without Google Cloud, with its global network and regional data centers making it easy to move into any new region and enjoy rapid response times. While its scalable architecture means we can be confident that our infrastructure will always be able to scale with us.

As we continue our mission to make auto repair more transparent and convenient for more people, we are confident that with Google Cloud we have the right provider to help move our business into the fast lane.

Maintain business continuity across regions with BigQuery managed disaster recovery

Mon, 06 May 2024 16:00:00 +0000

Geographical redundancy is a fundamental part of building a resilient cloud-based data strategy. For many years, BigQuery has offered an industry-leading 99.99% uptime service-level agreement (SLA) for availability within a single geographical region. Full redundancy across two data centers within a single region is included with every BigQuery dataset you create and is managed in a completely transparent manner.

For customers looking for enhanced redundancy across large geographic regions, we are now introducing managed disaster recovery for BigQuery. This feature, now in preview, offers automated failover of compute and storage and a new cross-regional SLA tailored for business-critical workloads. This feature enables you to ensure business continuity in the unlikely event of a total regional infrastructure outage. Managed disaster recovery also provides failover configurations for capacity reservations, so you can manage query and storage failover behavior. This feature is available through BigQuery Enterprise Plus edition.

How does it work?

Customers using BigQuery’s enterprise plus edition can now configure their capacity reservations to enable automated failover across distinct geographic regions. Extending the capabilities of BigQuery’s cross-region dataset replication, failover reservations ensure that the location of both data and compute resources are coordinated during a disaster recovery event.

+ + + + + + + +

+ + +

+ + + + + + + + + + + +

+ + + + + +

Slot capacity in the secondary region for enterprise plus edition reservations are provisioned and maintained automatically at no additional cost. Some competitive products require customers to duplicate their compute clusters in the secondary location.

In the event of a total regional outage, the secondary region can be promoted to the primary role for both compute and data. With BigQuery’s query routing layer, failover is completely transparent to end users and tools.

+
Primary region: The region containing the current primary replica of a dataset. This is also the region where the dataset data can be modified (e.g. loads, DDL, or DML).
+
+
Secondary region: The region where the failover reservation standby capacity and replicated datasets are available in the case of a regional outage.
+
+
Failover reservation: An enterprise plus edition reservation configured with a primary/secondary region pair. Note: Datasets are attached to failover reservations.
+

The dataset replica in the primary region is the primary replica, and the replica in the secondary region is the secondary replica. These roles are swapped during the failover process.

The primary replica is writeable, and the secondary replica is read-only. Writes to the primary replica are asynchronously replicated to the secondary replica. Within each region, the data is stored redundantly in two zones. Network traffic never leaves the Google Cloud network.

What is a region pair?

A region pair in BigQuery’s managed disaster recovery is a pair of regions that are geographically supported by turbo replication and compute redundancy. Within the defined region pair, BigQuery replicates data between the two regions and manages secondary available capacity. This replication allows BigQuery managed disaster recovery to provide high availability and durability for data. Customers are able to define their desired region pair (based on the supported regions) per failover reservation.

Supported region pairs

BigQuery’s managed disaster recovery feature supports failover reservations across specific region pairs (similar to Cloud Storage, for regions within a geographic area). You can designate either region in a pair for your initial primary or secondary region.

Capacity in the secondary region

BigQuery ensures that the capacity of your primary region will be available in your secondary region within five minutes of a failover. This assurance applies to your reservation baseline, whether it’s used or not. BigQuery also provides the same level of autoscaling availability as provided in the primary.

How much does it cost?

BigQuery's managed disaster recovery feature is available with the Enterprise Plus edition. Standby compute capacity in the secondary region is included with the per slot-hour price with no requirement to purchase separate standby capacity. As an option, you may choose to provision additional Enterprise Plus reservations in the secondary region, specifically for read-only queries.

Managed disaster recovery customers are billed for replicated storage in the primary and secondary regions for associated datasets. At GA, this feature will automatically use turbo replication for data transfer between regions.

+ + + + + + + + + + + + + + + + + + +

+ SKU +	+ Billing description +
+ Enterprise Plus Edition +	+ $0.10 / slot-hr (ex. US Pricing) +
+ Storage +	+ Storage bytes in the secondary region are billed at the same list price as storage bytes in the primary region. See BigQuery Storage pricing for more information. +
+ Data transfer +	+ Managed disaster recovery uses turbo replication* + Data transfer used during replication: + + + is charged based on physical bytes + + + is charged on a per physical GB replicated basis. + + + Note: Turbo replication will be 2x pricing of “default replication” +

* Turbo replication is not available during preview but will be enabled automatically at general availability (GA).

Recovery Time Objective (RTO)

Promotion of a secondary reservation and associated datasets takes less than five minutes, even if the primary region is down. All queries in flight are canceled and rejected during the RTO timeline.

Recovery Point Objective (RPO)

Data will be less than 15 minutes old in secondary dataset replicas configured for failover reservation between supported region pairs, turbo replication enabled and only after initial replication is completed (also known as backfill).

Note: Turbo replication and RPO/RPO with SLA are not available during preview.

Configuration in action

During preview, managed disaster recovery configuration is supported via the BigQuery Console (UI) and SQL. The following workflow shows how you can set up and manage disaster recovery in BigQuery:

+ + + + + + + +

+ + +

+ + + + + + + + + + + +

+ + + + + +

Create a replica for a given dataset

To replicate a dataset, use the ALTER SCHEMA ADD REPLICA DDL statement.

After you add a replica, it takes time for the initial copy operation to complete. You can still run queries referencing the primary replica while the data is being replicated, with no reduction in query-processing capacity.

code_block: <ListValue: [StructValue([('code', "-- Create the primary replica in the primary region.\r\nCREATE SCHEMA my_dataset OPTIONS(location='us-west1');\r\n-- Create a replica in the secondary region.\r\nALTER SCHEMA my_dataset\r\nADD REPLICA `us-east1`\r\nOPTIONS(location='us-east1');"), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3eefc4aa93d0>)])]>

Configure a failover reservation + attach a dataset

The first step is to create a failover reservation and specify its secondary location. Specifying a secondary location can also be done for existing Enterprise Plus reservations.

code_block: <ListValue: [StructValue([('code', "CREATE RESERVATION `project1.region-us-west1.my_failover_reservation` \r\n OPTIONS (slot_capacity = 200, edition = ENTERPRISE_PLUS,\r\n secondary_location='us-east1);"), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3eefc4aa9d30>)])]>

The next step is to associate one or more datasets with the failover reservation. The dataset needs to be replicated in the same primary / secondary region as specified in the reservation.

code_block: <ListValue: [StructValue([('code', 'ALTER SCHEMA `my_dataset`\r\n SET OPTIONS (failover_reservation = \r\n `project1.my_failover_reservation`);'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3eefc4aa9190>)])]>

Promote the failover reservation + dataset in the secondary

Fail over the reservation and associated datasets. This must be performed from the secondary region.

code_block: <ListValue: [StructValue([('code', 'ALTER RESERVATION `project1.region-us-east1.my_failover_reservation` \r\n SET OPTIONS (is_primary = true);'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3eefc4aa91c0>)])]>

Fail back to original primary

Fail back the reservation and associated datasets (performed from the new secondary/old primary).

code_block: <ListValue: [StructValue([('code', 'ALTER RESERVATION `project1.region-us-west1.my_failover_reservation` \r\n SET OPTIONS (is_primary = true);'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3eefc4aa91f0>)])]>

Getting started

Business continuity is paramount for customers with mission-critical data environments. We are excited to make the preview of BigQuery’s managed disaster recovery feature available for your testing. You can learn more about managed disaster recovery and how to get started in the BigQuery managed disaster recovery QuickStart.

Advancing the art of AI-driven security with Google Cloud

Mon, 06 May 2024 13:00:00 +0000

The advent of generative AI has unlocked new opportunities to empower defenders and security professionals. We have already seen how AI can transform malware analysis at scale as we work to deliver better outcomes for defenders. In fact, using Gemini 1.5 Pro, we were recently able to reverse engineer and analyze the decompiled code of the WannaCry malware in a single pass — and identify the killswitch — in only 34 seconds.

Our vision for AI is to accelerate your ability to protect and defend against threats by shifting from manual, time-intensive efforts to assisted and, ultimately, semi-autonomous security — while providing you with curated tools and services to secure your AI data, models, applications, and infrastructure. We do this by empowering defenders with Gemini in Security, which uses SecLM, our security-tuned API, as well as providing tools and services to manage AI risk to your environment. Our Mandiant experts are able to help you secure your AI journey wherever you are.

+ + + + + + + +

+ + +

+ + + + + + + + +

Managing AI risk and empowering defenders with gen AI.

+ + + + +

+ + + + + +

Today at the RSA Conference in San Francisco, we’re sharing more on our vision for the intersection between AI and cybersecurity, including how we help organizations secure AI systems and provide AI tools to support defenders. We are introducing new AI offerings from Mandiant Consulting and new features in Security Command Center Enterprise to help address security challenges when adopting AI. We are also announcing the general availability of Gemini across several security offerings including Google Threat Intelligence and Google Security Operations to further empower defenders with generative AI.

New services leverage security and AI expertise from Google

As customers integrate AI into every area of their business, they tell us that securing their use of AI is essential. The recent State of AI and Security Survey Report from the Cloud Security Alliance highlighted that while many professionals are confident in their organization’s ability to protect AI systems, there is still a significant portion that recognize the risks of underestimating threats.

Our Secure AI Framework (SAIF) provides a taxonomy of risks associated with AI workloads and recommended mitigations. Today we are announcing new offerings from Mandiant Consulting that can help organizations support SAIF and secure the use of AI. Mandiant's AI consulting services can help assess the security of your AI pipelines and test your AI defense and response with red teaming. These services can also help your defenders identify and implement ways to use AI to enhance cyber defenses and streamline investigative capabilities.

“The use of AI opens up a world of possibilities and enterprises recognize that in order to take advantage of the potential of these innovations, they need to get ahead of new security risks,” said Jurgen Kutscher, vice president, Mandiant Consulting, Google Cloud. “From helping secure training data to assessing AI applications for vulnerabilities, our Mandiant Consulting experts can provide recommendations based on Google’s own experience protecting and deploying AI. We’re excited to bring these new services to market to help our clients leverage AI more securely and transform their operations."

+ + + + + + + +

+ + +

+ + + + + + + + +

Notebook Security Scanner identifies package vulnerabilities and recommends next steps to remediate individual packages.

+ + + + +

+ + + + + +

Securing AI workloads against risks

We are also announcing new AI-protection capabilities that can help our customers implement SAIF by building on our release of Security Command Center Enterprise — our cloud risk-management solution that fuses cloud security and enterprise security operations:

+
Notebook Security Scanner, now available in preview, detects and provides remediation advice for vulnerabilities introduced by open-source software installed in managed notebooks.
+
+
Model Armor, expected to be in preview in Q3, can enable customers to inspect, route, and protect foundation model prompts and responses. It can help customers mitigate risks such as prompt injections, jailbreaks, toxic content, and sensitive data leakage. Model Armor will integrate with products across Google Cloud, including Vertex AI.
+

If you’d like to learn more about early access for Model Armor, you can sign up here.

+ + + + + + + +

+ + +

+ + + + + + + + +

Model Armor allows users to configure policies and set content safety filters to help block or redact inappropriate model prompts and responses.

+ + + + +

+ + + + + +

Empowering defenders with new gen AI security tools

Today, we’ve also shared how security teams can better defend against threats with Google Security Operations, our AI-powered platform to help empower SOC teams to more easily detect and respond to threats. Gemini in Security Operations now includes a new assisted investigation feature that navigates users through the platform based on the context of an investigation. It can help hunt for the latest threats with vital information from Google Threat Intelligence and MITRE, analyze security events, create detections using natural language, and recommend next steps to take.

Users can also ask Gemini to create a response playbook using natural language, which can simplify the time-consuming task of manually constructing one. The user can further refine the generated playbook and simulate its execution. These new enhancements can give security teams a boost across the detection and response lifecycle.

“Gemini in Security Operations is enabling us to enhance the efficiency of our Cybersecurity Operations Center program as we continue to drive operational excellence,” said Ronald Smalley, senior vice president, cybersecurity operations, Fiserv. “Detection engineers can create detections and playbooks with less effort, and security analysts can find answers quickly with intelligent summarization and natural language search. This is critical as SOC teams continue to manage increasing data volumes and need to detect, validate, and respond to events faster.“

+ + + + + + + +

+ + +

+ + + + + + + + +

Gemini in Security Operations aids investigations and helps users easily create rules for detections.

+ + + + +

+ + + + + +

We also are introducing Google Threat Intelligence, a new offering that can help you reduce the time it takes to identify and protect against novel threats by bringing together investigative learnings from Mandiant frontline experts, the VirusTotal intel community, and Google threat insights from protecting billions of devices and user accounts.

With Gemini in Threat Intelligence, analysts can now conversationally search Mandiant’s vast frontline research to understand threat actor behaviors in seconds, and read AI-powered summaries of relevant open-source intelligence (OSINT) articles the platform automatically ingests to help reduce investigation time.

“Our main objective is to understand the purpose of the threat actor. The AI summaries provided by Gemini in Threat Intelligence make it easy to get an overview of the actor, information about relevant entities, and which regions they're targeting,” said the director of information security at a leading multinational professional services organization. “The information flows really smoothly and helps us gather the intelligence we need in a fraction of the time."

Plus, Gemini in Threat Intelligence includes Code Insight, which can inspect more than 200 file types, summarize their unique properties, and identify potentially malicious code. Gemini makes it easier for security professionals to understand the threats that matter most to their organization and take action to respond.

+ + + + + + + +

+ + +

+ + + + + + + + +

Gemini in Google Threat Intelligence allows users to conversationally search Mandiant’s vast corpus of frontline research.

+ + + + +

+ + + + + +

Make Google part of your security team

With rapid advances in AI technology, the line of what is possible is a moving target. We have a vision for a world in which the practice of “doing security” is less laborious and more durable, as AI offloads routine tasks and frees the experts to focus on the most complex issues.

Organizations can now address security challenges with the same capabilities that Google uses to keep more people and organizations safe online than anyone else in the world

To learn more about AI and security, and the rest of Google Cloud Security’s comprehensive portfolio, come meet us in person at our RSA Conference booth (N5644). You can also catch us at our RSA Conference keynotes, presentations, and meetups, and get the latest AI and Security updates here.

Introducing Google Security Operations: Intel-driven, AI-powered SecOps

Mon, 06 May 2024 13:00:00 +0000

In the generative AI-era, security teams are looking for a fully-operational, high-performing security operations solution that can drive productivity while empowering defenders to detect and mitigate new threats.

Today at the RSA Conference in San Francisco, we’re announcing AI innovations across the Google Cloud Security portfolio, including Google Threat Intelligence, and the latest release of Google Security Operations. Today’s update is designed help to reduce the do-it-yourself complexity of SecOps and enhance the productivity of your entire Security Operations Center.

+ + + +

+ + + + + +

+ +

+ + + + + + +

+ + + +

Turn intelligence into action

At Next ‘24, we shared how Applied Threat Intelligence can help teams turn intelligence into action, uncover more threats with less effort, and unlock deeper threat hunting and investigation workflows. Today we are unveiling new features that will use AI to automatically generate detections based on new threat discoveries. Coming later this year, this new capability will help enable you to identify malicious activity operating in your environment, and share clear directions that guide you through triage and response.

“Google Security Operations provides access to unique threat intelligence and advanced capabilities that are highly integrated into the platform. It enables security teams to surface the latest threats in a turnkey way that doesn’t require complicated engineering,” said Michelle Abraham, research director, IDC. ”Google is a potential partner for organizations in the fight against existing and emerging threats.”

+ + + + + + + +

+ + +

+ + + + + + + + +

Google Security Operations is a unified, AI and intel-driven platform for threat detection, investigation, and response.

+ + + + +

+ + + + + +

Uncover the latest threats with curated detections

To help reduce manual processes and provide better security outcomes for our customers, Google Security Operations includes a rich set of curated detections. Developed and maintained regularly by Google and Mandiant experts, curated detections can enable customers to detect threats relevant to their environment. Notable new curated detections include:

+
Cloud detections can addresses serverless threats, cryptomining incidents across Google Cloud, all Google Cloud and Security Command Center Enterprise findings, anomalous user behavior rules, machine learning-generated lists of prioritized endpoint alerts (based on factors such as user and entity context), and baseline coverage for AWS including identity, compute, data services, and secret management. We have also added detections based on learnings from the Mandiant Managed Defense team. Detections are now available in Google Security Operations Enterprise and Enterprise Plus packages.
+

+
Frontline threat detections can provide coverage for recently-detected methodologies, and is based on threat actor tactics, techniques and procedures (TTPs), including from nation-states and newly-detected malware families. New threats discovered by Mandiant’s elite team, including during incident response engagements, are then made available as detections. It is now available in the Google Security Operations Enterprise Plus package.
+

Drive productivity for all with AI-powered SecOps

The addition of Gemini in Security Operations can elevate the skills of your security team. It can help reduce the time security analysts spend writing, running, and refining searches and triaging complex cases by approximately sevenfold. Security teams can search for additional context, better understand threat actor campaigns and tactics, initiate response sequences and receive guided recommendations on next steps — all using natural language. Today we are sharing two exciting updates to Gemini in Security Operations.

Now generally available, the Investigation Assistant feature can help security professionals make faster decisions and respond to threats with more precision and speed by answering questions, summarizing events, hunting for threats, creating rules, and receiving recommended actions based on the context of investigations.

+ + + + + + + +

+ + +

+ + + + + + + + +

Investigation Assistant can help answer questions, summarize events, hunt for threats, create rules, and recommend actions.

+ + + + +

+ + + + + +

Playbook Assistant, now in preview, can help teams easily build response playbooks, customize configurations, and incorporate best practices — helping simplify time-consuming tasks that require deep expertise.

+ + + + + + + +

+ + +

+ + + + + + + + +

Playbook Assistant can help build response playbooks, customize configurations, and incorporate best practices.

+ + + + +

+ + + + + +

Reduce manual work with autonomous parsers

Getting data into the system and maintaining the pipeline is a critical yet time consuming task in security operations. As log sources change and new fields need to be extracted, security engineers and architects are often required to spend considerable time writing new parsing logic and ensuring backward compatibility.

Today we are excited to announce that Google Security Operations can now automatically parse log files by extracting all key-value pairs to make them available for search, rules, and analytics. Available in preview, automatic parsing can help reduce the maintenance overhead of parsers in general, and also reduce the time consuming task of creating custom parsers. It supports JSON-based logs, and we will be adding support for other log formats. Automatically parsing log files can help security teams have the right data and context, making for faster and more effective investigations and detection authoring.

+ + + +

+ + + + + +

+ +

+ + + + + + +

+ + + +

Raise the bar for defense

For customers in need of expert support for managing Google Security Operations, we’ve got you covered. Google Security Operations can also work in concert with Mandiant Managed Defense and Mandiant Hunt, which can help you to reduce risks to your organization. Mandiant's team of seasoned defenders, analysts, and threat hunters work seamlessly with your security team and the AI-infused capabilities of Google Security Operations to quickly and effectively hunt or monitor, detect, triage, investigate, and respond to incidents.

And for our public sector customers that may have more specialized requirements, we offer Google SecOps CyberShield to help governments worldwide build an enhanced cyber threat capability.

To learn more about Google Security Operations, and the rest of Google Cloud Security’s comprehensive portfolio including an expanded Chrome Enterprise ecosystem, come meet us in person at our RSA Conference booth (N5644). You can also catch us at our keynotes, presentations, and meetups including our session, “Bye-Bye DIY: Frictionless Security Operations with Google,” on Tuesday, May 7, at 1:15 p.m. PDT.

Not attending RSAC? Join us for our upcoming webinar, “Stay ahead of the latest threats with intelligence-driven security operations,” on Wednesday, May 22, at 11:00 a.m. PDT.

Introducing Google Threat Intelligence: Actionable threat intelligence at Google scale

Mon, 06 May 2024 13:00:00 +0000

For decades, threat intelligence solutions have had two main challenges: They lack a comprehensive view of the threat landscape, and to get value from intelligence, organizations have to spend excess time, energy, and money trying to collect and operationalize the data.

Today at the RSA Conference in San Francisco, we are announcing Google Threat Intelligence, a new offering that combines the unmatched depth of our Mandiant frontline expertise, the global reach of the VirusTotal community, and the breadth of visibility only Google can deliver, based on billions of signals across devices and emails. Google Threat Intelligence includes Gemini in Threat Intelligence, our AI-powered agent that provides conversational search across our vast repository of threat intelligence, enabling customers to gain insights and protect themselves from threats faster than ever before.

“While there is no shortage of threat intelligence available, the challenge for most is to contextualize and operationalize intelligence relevant to their specific organization,” said Dave Gruber, principal analyst, Enterprise Strategy Group. “Unarguably, Google provides two of the most important pillars of threat intelligence in the industry today with VirusTotal and Mandiant. Integrating both into a single offering, enhanced with AI and Google threat insights, offers security teams a new means to operationalize actionable threat intelligence to better protect their organizations.”

Unmatched visibility into threats

Google Threat Intelligence provides unparalleled visibility into the global threat landscape. We offer deep insights from Mandiant’s leading incident response and threat research team, and combine them with our massive user and device footprint and VirusTotal’s broad crowdsourced malware database.

+
Google threat insights: Google protects 4 billion devices and 1.5 billion email accounts, and blocks 100 million phishing attempts per day. This provides us with a vast sensor array and a unique perspective on internet and email-borne threats that allow us to connect the dots back to attack campaigns.
+
+
Frontline intelligence: Mandiant's eIite incident responders and security consultants dissect attacker tactics and techniques, using their experience to help customers defend against sophisticated and relentless threat actors across the globe in over 1,100 investigations annually.
+
+
Human-curated threat intelligence: Mandiant’s global threat experts meticulously monitor threat actor groups for activity and changes in their behavior to contextualize ongoing investigations and provide the insights you need to respond.
+
+
Crowdsourced threat intelligence: VirusTotal's global community of over 1 million users continuously contributes potential threat indicators, including files and URLs, to offer real-time insight into emerging attacks.
+
+
Open-source threat intelligence: We use open-source threat intelligence to enrich our knowledge base with current discoveries from the security community.
+

+ + + + + + + +

+ + +

+ + + + + + + + +

Google Threat Intelligence boasts a diverse set of sources that provide a panoramic view of the global threat landscape and the granular details needed to make informed decisions.

+ + + + +

+ + + + + +

This comprehensive view allows Google Threat Intelligence to help protect your organization in a variety of ways, including external threat monitoring, attack surface management, digital risk protection, Indicators of Compromise (IOC) analysis, and expertise.

AI-driven operationalization

Traditional approaches to operationalizing threat intelligence are labor-intensive and can slow down your ability to respond to evolving threats, potentially taking days or weeks to respond.

+ + + +

+ + + + + +

+ +

+ + + + + +

+ +

Google Threat Intelligence uses Gemini to analyze potentially malicious code and provides a summary of its findings.

+ + +

+ + + +

By combining our comprehensive view of the threat landscape with Gemini, we have supercharged the threat research processes, augmented defense capabilities, and reduced the time it takes to identify and protect against novel threats. Customers now have the ability to condense large data sets in seconds, quickly analyze suspicious files, and simplify challenging manual threat intelligence tasks.

How Gemini helps simplify and assist with threat intelligence

Gemini 1.5 Pro is a valuable part of Google Threat Intelligence, and we’ve integrated it so that it can more efficiently and effectively assist security professionals in combating malware.

Gemini 1.5 Pro offers the world’s longest context window, with support for up to 1 million tokens. It can dramatically simplify the technical and labor-intensive process of reverse engineering malware — one of the most advanced malware-analysis techniques available to cybersecurity professionals. In fact, it was able to process the entire decompiled code of the malware file for WannaCry in a single pass, taking 34 seconds to deliver its analysis and identify the killswitch.

We also offer a Gemini-driven entity extraction tool to automate data fusion and enrichment. It can automatically crawl the web for relevant open source intelligence (OSINT), and classify online industry threat reporting. It then converts this information to knowledge collections, with corresponding hunting and response packs pulled from motivations, targets, tactics, techniques, and procedures (TTPs), actors, toolkits, and Indicators of Compromise (IoCs).

Google Threat Intelligence can distill more than a decade of threat reports to produce comprehensive, custom summaries in seconds.

Make Google part of your security team

Google Threat Intelligence is just one way we can help you in your threat intelligence journey. Whether you need cyber threat intelligence training for your staff, assistance with prioritizing complex threats, or even a dedicated threat analyst embedded in your team, our experts can act as an extension of your own team.

Google Threat Intelligence is part of Google Cloud Security’s comprehensive security portfolio, which includes Google Security Operations, Mandiant Consulting, Security Command Center Enterprise, and Chrome Enterprise. With our offerings, organizations can address security challenges with the same capabilities Google uses to keep more people and organizations safe online than anyone else in the world.

To learn more about Google Threat Intelligence and the rest of Google Cloud Security’s comprehensive portfolio, come meet us in person at our RSA Conference booth (N5644), and catch us at our keynotes, presentations, and meetups. You can also register for our upcoming Google Threat Intelligence use-cases webinar series, and read our expert analysis and in-depth research at the Google Cloud Threat Intelligence blog.

Chrome Enterprise expands ecosystem to strengthen endpoint security and Zero Trust access

Mon, 06 May 2024 13:00:00 +0000

The modern workplace relies on web-based applications and cloud services, making browsers and their sensitive data a primary target for attackers. While the risks are significant, Chrome Enterprise can help organizations simplify and strengthen their endpoint security with secure enterprise browsing.

Following our recent Chrome Enterprise Premium launch, today at the RSA Conference in San Francisco, we’re announcing a growing ecosystem of security providers who are working with us to extend Chrome Enterprise’s browser-based protections and help enterprises protect their users working on the web and across corporate applications.

Expanding Zero Trust protections with Zscaler

Chrome Enterprise Premium offers advanced security across SaaS and private web applications for enterprises. Many organizations rely on Zscaler Private Access (ZPA) as an improved option over VPNs and firewalls to provide secure, Zero Trust access to private applications on-premises and in the cloud. Now security operations teams can add a layer of additional safeguards through Chrome Enterprise Premium, including:

+
Data protections: Critical DLP functions including data exfiltration controls, copy, paste, and print restrictions, and watermarking capabilities. This complements Zscaler's data protection across endpoints, email, SaaS and cloud.
+
+
Threat prevention: Advanced malware scanning, real-time phishing security, and credential protections, augmenting Zscaler's inline inspection of encrypted traffic and built in threat protections.
+
+
Security insights: Additional telemetry and reporting across insider and external risks.
+

Google has collaborated with Zscaler to provide enterprises with a solution guide that enables organizations to configure their network security products alongside Chrome Enterprise Premium for deeper security and protections.

Browser-based device trust with Cisco Duo

As attacks targeting end-users become more sophisticated, a multi-layered defense that includes a strong device access policy is crucial. Signals including user identity, device security, and location can enable dynamic, risk-based access decisions that further protect corporate data.

Enterprises can now use Duo Trusted Endpoints policy to enforce device trust using built-in Chrome Enterprise signals to deny access from unknown devices — without having to deploy additional agents and extensions. This integration allows organizations to:

+
Verify endpoint trust at login, and block unknown devices
+
+
Manage device access from a centralized Duo dashboard
+
+
Adjust granular policies for an organization of any size in a few clicks
+

+ + + + + + + +

+ + +

+ + + + + + + + +

Duo's Trusted Endpoints feature lets organizations grant secure access to applications with policies that verify systems using signals from Chrome.

+ + + + +

+ + + + + +

Data loss prevention with Trellix

Data loss remains a top concern for enterprises, and the browser is a critical point for stopping data leaks. Trellix DLP for Chrome Enterprise is now available as an integration to customers managing Chrome from the cloud. With the Trellix DLP integration, organizations can prevent data leaks in Chrome by:

Monitoring and blocking file uploads with sensitive content
Tracking and preventing sensitive content from being copied and pasted to websites
Controlling print activity in Chrome browser and on local workstations

+ + + + + + + +

+ + +

+ + + + + + + + +

When sensitive information is detected in Chrome, the user is immediately notified with a pop-up.

+ + + + +

+ + + + + +

Current Trellix DLP and Cisco Duo customers can implement these integrations by enrolling browsers into Chrome Enterprise Core and setting up a one-time configuration, at no additional cost. Learn more about the Trellix DLP integration here and Cisco Duo integration here.

Take the next step

To learn more about Chrome Enterprise, and the rest of Google Cloud Security’s comprehensive portfolio including our RSAC announcements on Google Cloud Security and AI, Google Threat Intelligence, and Google Security Operations, come meet us in person at our RSA Conference booth (N5644), and catch us at our keynotes, presentations, and meetups. You can also learn more about Chrome Enterprise here.

Uncomplicating the complex: How Spanner simplifies microservices-based architectures

Fri, 03 May 2024 16:00:00 +0000

In the realm of modern application design, developers have a range of choices available to them for crafting architectures that are not only simple, but also scalable, performant and resilient. Container platforms like Kubernetes (k8s) offer users the ability to seamlessly adjust node and pod specifications, so that services can scale. This scalability does not come at the expense of elasticity, and also ensures consistent performance for service consumers. So it’s no surprise that Kubernetes has become the de facto standard for building distributed and resilient systems in medium-to-large organizations.

Unfortunately, the level of maturity and standardization in the Kubernetes space available to system designers in the application layer doesn’t usually extend to the database layer that powers these services. And it goes without saying that the database layer also needs to be elastic, scalable, highly available, and resilient.

Further, the challenges are amplified when these services:

+
Are required to manage (transactional) states or
+
+
Orchestrate distributed processes across multiple (micro-) services.
+

Traditional relational database management software (RDBMS) brings with it side effects that are not aligned with a microservices way of thinking, and entails fairly significant trade-offs. In the sections below, we dive deeper into the scalability, availability and operational challenges faced by application designers specifically within the database tier. We then conclude with a description of how Spanner can help you build microservices-based systems without the often unspoken “impedance mismatch” between the application layer and the database layer.

We look at this problem from a scalability and availability perspective, specifically in the context of databases that cater to OLTP workloads. We explore the intricacies involved in accommodating highly variable workloads, shedding light on the complexities associated with managing higher demands through the utilization of both replicas and sharding techniques.

Wanted: scalability and availability

When it comes to scaling a traditional relational database, you have two choices (leaving aside caching strategies):

+
Scale up: To scale a database vertically, you typically augment its resources by adding more CPU power, increasing memory, or adding faster disks. However, these scale-up procedures usually incur downtime, affecting the availability of dependent services.
+
+
Scale out: Although vertically scaling up databases can be effective initially, it eventually encounters limitations. The alternative is to scale out database traffic, by introducing additional read replicas, employing sharding techniques, or even a combination of both. These methods come with their own trade-offs and introduce complexities, which lead to operational overhead.
+

In terms of availability, databases require maintenance, resulting in the need to coordinate regular periods of downtime. Relational databases can also be prone to hardware defects, network partitions, or subject to data center outages that bring a host of DR scenario challenges that you need to address and plan for.

Examples of planned downtime:

+
OS or database engine upgrades or patches
+
+
Schema changes - Most database engines require downtime for the duration of a schema change
+

Examples of unplanned downtime

+
Zonal or regional outages
+
+
Network partitions
+

Most “mature” practices for handling traditional RDBMSs run counter to modern application design principles and can significantly impact the availability and performance of services. Depending on the nature of the business, this can have consequences for revenue streams, compliance with regulations, or adversely impact customer satisfaction.

Let’s go over some of the key challenges associated with RDBMSs.

Challenges associated with read-replicas

+ + + + + + + +

+ + +

+ + + + + + + + + + + +

+ + + + + +

Database read replicas are a suitable tool for scaling out read operations and mitigating planned downtime, so that reads are at least available to the application layer.

In order to reduce load on the primary database instance, replicas can be created to distribute read load across multiple machines and thus handle more read requests concurrently.

Replication between the primary and secondary replicas is usually done asynchronously. This means there can be a lag between when data is written to the primary database and when it is replicated to the read replicas. This can result in read operations getting slightly outdated (stale) data if they are directed to the replicas. This also dictates that guaranteed consistent queries need to be directed to primary instances. Synchronous replication is rarely an option, in particular, not in geo-distributed topologies, as it is complex, and comes with a range of issues such as:

+
Limiting the scalability of the system, as every write operation must wait for confirmation from the replica, causing performance bottlenecks and increasing latency
+
+
Introducing a single point of failure — if the replica becomes unavailable or experiences issues, it can impact the availability of the primary database as well
+

And lastly, write throughput can become bottlenecked due to the limit on how much write traffic a single database can handle without performance degradation. Scaling writes still requires vertical scaling (more powerful hardware) or sharding (splitting data across multiple databases), which can lead to downtime, additional costs, and limits imposed by non-linearly escalating operational toil. Now let’s look at sharding challenges in a bit more detail.

Sharding challenges

+ + + + + + + +

+ + +

+ + + + + + + + + + + +

+ + + + + +

Sharding is a powerful tool for database scalability. When implemented correctly, it can enable applications to handle a much larger volume of read and write transactions. However, sharding does not come without its challenges and brings its own set of complexities that need careful navigation.

There are multiple ways to shard databases. For instance,

+
they can be split by user id ranges,
+
+
regions or
+
+
channels (e.g. web, mobile) etc..
+

As shown in the above example, sharding by user id or region can lead to significant performance improvements, as smaller data ranges are hosted by individual databases and the traffic can be spread across these databases.

Key considerations:

+
Deciding on the “right” kind of sharding: One of the primary challenges of sharding is the initial setup. Deciding on a sharding key, whether it be user ID, region, or another attribute, requires a deep understanding of your data access patterns. A poorly chosen sharding key can result in uneven data distribution, known as "shard imbalance," which can significantly dull the performance benefits of sharding.
+
+
Data integrity is another significant concern. When data is spread across multiple shards, maintaining foreign-key relationships becomes difficult. Transactions that span multiple shards become complex and can result in increased latency and decreased integrity.
+
+
Operational complexity: Sharding introduces operational complexity. Managing multiple databases requires a more sophisticated approach to maintenance, backups, and monitoring. Each shard may need to be backed up separately, and restoring a sharded database to a consistent state can be challenging.
+
+
Re-sharding: As an application grows, the chosen sharding scheme might need to change. This process involves redistributing the data across a new set of shards, which can be time-consuming and risky, often requiring significant downtime or degraded performance during the transition.
+
+
Increased development complexity: Application logic can become more complex because developers must account for the distribution of data. This could mean additional logic for routing queries to the correct shard, handling partial failures, and ensuring that transactions that need to operate across shards maintain consistency.
+

Exploding complexity and operations

+ + + + + + + +

+ + +

+ + + + + + + + + + + +

+ + + + + +

Over time, database complexity can grow along with increased traffic, adding further toil to operations. For large systems, a combination of sharding along with attached scale-out read replicas might be required to help ensure cost-effective scalability and performance.

This combined dual-strategy approach, while effective in handling increasing traffic, significantly ramps up the complexity of the system's architecture. The above illustration captures the need to add scalability and availability to a transactional relational database powering a service. It doesn’t even include full details on DR (e.g. backups), or geo-redundancy, nor does it cater to zero-to-low RPO/RTO requirements.

Furthermore, the dual-strategy approach described above can:

+
negatively impact the ease of service maintenance
+
+
increase operational demands, and
+
+
elevate the risk associated with the resolution of incidents
+

Doesn’t NoSQL address this?

NoSQL databases began to emerge in the early 2000s as a response to traditional RDBMSs’ above-mentioned limitations. In the new era of big data and web-scale applications, NoSQL databases were designed to overcome the challenges of scalability, performance, flexibility and availability that were imposed by the growing volume of semi-structured data.

However, the key tradeoff they made was to drop sound relational models, SQL, and support for ACID-compliant transactions. However, many prominent system architects have questioned the wisdom of abandoning these well-worn relational concepts for OLTP workloads, as they are essential features that still power mission-critical applications. As such, there’s been a recent trend to (re)introduce relational database features into NoSQL databases, such as ACID transactions in MongoDB and Cassandra Query Language (CQL) in Cassandra.

Enter Spanner

+ + + + + + + +

+ + +

+ + + + + + + + + + + +

+ + + + + +

Spanner eliminates much of this complexity and helps facilitate a simple and easy-to-maintain architecture without most of the above-mentioned compromises. It combines relational concepts and features (SQL, ACID transactions) with seamless horizontal scalability, providing geo-redundancy with up to 99.999% availability that you want when designing a microservices-based application.

We want to emphasize that we’re not arguing that Spanner is only a good fit for microservices. All the things that make Spanner a great fit for microservices also make it great for monolithic applications.

To summarize, a microservices architecture built on Spanner allows software architects to design systems where both the application and database provide:

+
“Scale insurance” for future growth scenarios
+
+
An easy way to handle traffic spikes
+
+
Cost efficiency through Spanner’s elastic and instant compute provisioning
+
+
Up to 99.999% availability with geo-redundancy
+
+
No downtime windows (for maintenance or other upgrades)
+
+
Enterprise-grade security such as encryption at rest and in-transit
+
+
Features to cater for transactional workloads
+
+
Increases in developer productivity (e.g. SQL)
+

You can learn more about what makes Spanner unique and how it’s being used today. Or try it yourself for free for 90-days or for as little as $65 USD/month for a production-ready instance that grows with your business without downtime or disruptive re-architecture.

Making API calls exactly once when using Workflows

Fri, 03 May 2024 16:00:00 +0000

Introduction

One challenge with any distributed system, including Workflows, is ensuring that requests sent from one service to another are processed exactly once, when needed; for example, when placing a customer order in a shipping queue, withdrawing funds from a bank account, or processing a payment.

In this blog post, we’ll provide an example of a website invoking Workflows, and Workflows in turn invoking a Cloud Function. We’ll show how to make sure both Workflows and the Cloud Function logic only runs once. We’ll also talk about how to invoke Workflows exactly once when using HTTP callbacks, Pub/Sub messages, or Cloud Tasks.

Invoke Workflows exactly once

Imagine you have an online store and you’re using Workflows to create new orders, save to Firestore, and process payments by calling a Cloud Function:

+ + + + + + + +

+ + +

+ + + + + + + + + + + +

+ + + + + +

A new customer order comes in, the website makes an API call to Workflows but receives an error. Two possible scenarios are:

(1) The request is lost and the workflow is never invoked:

+ + + + + + + +

+ + +

+ + + + + + + + + + + +

+ + + + + +

(2) The workflow is invoked and executes successfully, however the response is lost:

+ + + + + + + +

+ + +

+ + + + + + + + + + + +

+ + + + + +

How can you make sure the workflow executes once?

To solve this, the website retries the same request. One easy solution is to check if a document already exists in Firestore:

code_block: <ListValue: [StructValue([('code', 'main:\r\n params: []\r\n steps:\r\n - init:\r\n assign:\r\n - project_id: ${sys.get_env("GOOGLE_CLOUD_PROJECT_ID")}\r\n - order_id: "12345" # In practice we would pass in the order ID as a workflow parameter, e.g. ${params[0]}\r\n - firestore_collection: "orders"\r\n - URL: https://us-central1-<your_project_id>.cloudfunctions.net/processpayment\r\n - create_document:\r\n try:\r\n call: googleapis.firestore.v1.projects.databases.documents.createDocument\r\n args:\r\n collectionId: ${firestore_collection}\r\n parent: ${"projects/" + project_id + "/databases/(default)/documents"}\r\n query:\r\n documentId: ${order_id}\r\n except:\r\n as: e\r\n steps:\r\n - endEarly:\r\n return: ${e} # Exception is raised, e.g. ${e.code == 409} if doc already exists\r\n - processPayment:\r\n ...'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3eefc698f910>)])]>

The processPayment step will execute only if a document is successfully created. This is effectively a 1-bit state machine, idempotent, and a valid solution. The downside of this solution is that it’s not extensible. We might want to complete additional work in this handler before changing states, or expand the number of states within the system. Next, let’s continue with a more advanced solution for the same problem.

Invoke Cloud Functions from Workflows exactly once

Let’s see what happens when the workflow uses a Cloud Function to process the payment. You might have the following step to call Cloud Functions:

code_block: <ListValue: [StructValue([('code', '- processPayment:\r\n call: http.post\r\n args:\r\n url: https://us-central1-<your_project_id>.cloudfunctions.net/processpayment\r\n auth:\r\n type: OIDC'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3eefc698feb0>)])]>

By default, Workflows offers at-most-once delivery (no retries) with HTTP requests. That’s usually OK because 99.9+% of the time, the call is successful, and a response is received.

In the rare case of failure, a ConnectionError might be raised. As in the website-to-workflow situation discussed previously, the workflow can’t tell which scenario occurred. Similarly, you can add retries.

Let’s add a default retry policy to handle this:

code_block: <ListValue: [StructValue([('code', "- processPayment:\r\n try:\r\n call: http.post\r\n args:\r\n url: https://us-central1-<your_project_id>.cloudfunctions.net/processpayment\r\n auth:\r\n type: OIDC\r\n retry: ${http.default_retry} # Retries up to 5 times, includes 'ConnectionError'"), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3eefc698f0a0>)])]>

Let's say the second delivery scenario occurs where the request is received by the Cloud Function but the response is lost. By adding retries, Workflows will likely invoke the Cloud Function multiple times. When this happens, how do you ensure that the code in the Cloud Function only runs once?

You’ll need to add extra logic to the Cloud Function to check and update the payment state in Firestore:

+ + + + + + + +

+ + +

+ + + + + + + + + + + +

+ + + + + +

Let’s also assume you want to track the workflow EXECUTION_ID in Firestore and use the following order_state enum to allow for additional flexibility in payment processing:

code_block: <ListValue: [StructValue([('code', 'payment_not_processed // Initial state when an order is created\r\npayment_declined // Payment was not successful\r\npayment_successful // Payment processed successfully\r\n...'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3eefc85b8c40>)])]>

You can expand on the previous workflow and call a Cloud Function to process the payment:

code_block: <ListValue: [StructValue([('code', 'main:\r\n params: []\r\n steps:\r\n - init:\r\n assign:\r\n - project_id: ${sys.get_env("GOOGLE_CLOUD_PROJECT_ID")}\r\n - order_id: "12345" # In practice we would pass in the order ID as a workflow parameter, e.g. ${params[0]}\r\n - firestore_collection: "orders"\r\n - URL: https://us-central1-<your_project_id>.cloudfunctions.net/processpayment\r\n - create_document:\r\n try:\r\n call: googleapis.firestore.v1.projects.databases.documents.createDocument\r\n args:\r\n collectionId: ${firestore_collection}\r\n parent: ${"projects/" + project_id + "/databases/(default)/documents"}\r\n query:\r\n documentId: ${order_id}\r\n body:\r\n fields:\r\n order_state: # We set an initial state\r\n stringValue: "payment_not_processed"\r\n workflow_id: # And also track this workflow execution ID\r\n stringValue: ${sys.get_env("GOOGLE_CLOUD_WORKFLOW_EXECUTION_ID")}\r\n except:\r\n as: e\r\n steps:\r\n - endEarly:\r\n return: ${e} # Exception is raised, e.g. ${e.code == 409} if doc already exists\r\n - processPayment:\r\n try:\r\n call: http.post\r\n args:\r\n url: ${URL} # Might get called multiple times!\r\n auth:\r\n type: OIDC\r\n body:\r\n order_id: ${order_id}\r\n result: r\r\n retry: ${http.default_retry}\r\n - returnStep:\r\n return: ${r}'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3eefc85b8670>)])]>

Here’s the Cloud Function (Node.js v20) that processes the payment:

code_block: <ListValue: [StructValue([('code', 'const functions = require(\'@google-cloud/functions-framework\');\r\nconst firestore = require(\'@google-cloud/firestore\');\r\n\r\n\r\nfunctions.http(\'helloHttp\', (req, res) => {\r\n const fs = new firestore.Firestore();\r\n try{\r\n// Reads the current state from Firestore and updates it within the same transaction to make this handler idempotent. Using a transaction is important. Note: It could be run multiple times but will only be committed once.\r\n return fs.runTransaction(t => {\r\n const docRef = fs.doc("orders/" + req.body.order_id);\r\n return t.get(docRef).then(doc => {\r\n console.log(doc, \'=>\', doc);\r\n var state = doc.data().order_state\r\n // Only process the order if we haven\'t already\r\n if (state == "payment_not_processed") {\r\n // Do payment stuff, e.g. debit account from another Firestore document\r\n // ...\r\n //\r\n state = "payment_successful"\r\n t.update(docRef, {order_state: state})\r\n res.status(200).send(state);\r\n return\r\n }\r\n res.status(200).send("request ignored, state already: " + state);\r\n });\r\n }).then(result => {\r\n console.log(\'Transaction result: \', result);\r\n });\r\n } catch (e) {\r\n console.log(\'Transaction failure:\', e);\r\n } \r\n});'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3eefc8110610>)])]>

package.json

code_block: <ListValue: [StructValue([('code', '{\r\n "dependencies": {\r\n "@google-cloud/functions-framework": "^3.3.0",\r\n "@google-cloud/firestore": "^7.6.0"\r\n }\r\n}'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3eefc81107f0>)])]>

The key takeaway is that all payment processing work occurs within a transaction, making all actions idempotent. The code within the transaction might run multiple times due to Workflows retries, but it’s only committed once.

What about HTTP callbacks, Pub/Sub, Cloud Tasks?

So far, we’ve talked about how to make website-to-workflow and Workflows to Cloud Functions requests, exactly once. There are other ways of invoking or resuming Workflows such as HTTP callbacks, Pub/Sub messages or Cloud Tasks. How do you make those requests exactly once? Let’s take a look.

Callbacks

The good news is that Workflows HTTP callbacks are fully idempotent by default. It’s safe to retry a callback if it fails. For example:

code_block: <ListValue: [StructValue([('code', '- createCallbackStep:\r\n call: events.create_callback_endpoint\r\n args:\r\n http_callback_method: "POST"\r\n result: callback_details\r\n- sendOutURL:\r\n call: http.post\r\n args:\r\n url: "https://your-endpoint.com/foo"\r\n body:\r\n callback_to_use: ${callback_details.url}\r\n...\r\n- callbackWaitStep:\r\n call: events.await_callback\r\n args:\r\n callback: ${callback_details}'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3eefc8110220>)])]>

Let’s assume that the first callback returns an error to the external caller. Based on the error, the caller might not know if the workflow callback was received, and should retry the callback. On the second callback, the caller will receive one of the following HTTP status codes:

+
429 indicates that the first callback was received successfully. The second callback is rejected by the workflow.
+
+
200 indicates that the second callback was received successfully. The first callback was either never received, or was received and processed successfully. If the latter, the second callback is not processed because await_callback is called only once. The second callback is discarded at the end of the workflow.
+
+
404 indicates that a callback is not available. Either the first callback was received and processed and the workflow has completed, or the workflow is not running (and has failed, for example). To confirm this, you’ll need to send an API request to query the workflow execution state.
+

For more details, see Invoke a workflow exactly once using callbacks.

Pub/Sub messages

When using Pub/Sub to trigger a new workflow execution, Pub/Sub uses at-least-once delivery with Workflows, and will retry on any delivery failure. Pub/Sub messages are automatically deduplicated. You don’t need to worry about duplicate deliveries in that time window (24 hours).

Cloud Tasks

Cloud Tasks is commonly used to buffer workflow executions and provides at-least-once delivery but it doesn’t have message deduplication. Workflow handlers should be idempotent.

Conclusion

Exactly-once request processing is a hard problem. In this blog post, we’ve outlined some scenarios where you might need exactly-once request processing when you’re using Workflows. We also provided some ideas on how you can implement it. The exact solution will depend on the actual use case and the services involved.

Scalable multi-tenancy management with Config Sync and team scopes

Fri, 03 May 2024 16:00:00 +0000

Ensuring application and service teams have the resources they need is crucial for platform administrators. Fleet team management features in Google Kubernetes Engine (GKE) make this easier, allowing each team to function as a separate “tenant” within a fleet. In conjunction with Config Sync, a GitOps service in GKE, platform administrators can streamline resource management for their teams across the fleet.

Specifically, with Config Sync team scopes, platform admins can define fleet-wide and team-specific cluster configurations such as resource quotas and network policies, allowing each application team to manage their own workloads within designated namespaces across clusters.

Let's walk through a few scenarios.

Separating resources for frontend and backend teams

Let's say you need to provision resources for frontend and backend teams, each requiring their own tenant space. Using team scopes and fleet namespaces, you can control which teams access specific namespaces on specific member clusters.

For example, the backend team might access their bookstore and shoestore namespaces on us-east-cluster and us-west-cluster clusters, while the frontend team has their frontend-a and frontend-b namespaces on all three member clusters.

+ + + + + + + +

+ + +

+ + + + + + + + + + + +

+ + + + + +

Unlocking Dynamic Resource Provisioning with Config Sync
You can enable Config Sync by default at the fleet level using Terraform. Here’s a sample Terraform configuration:

code_block: <ListValue: [StructValue([('code', 'resource "google_gke_hub_feature" "feature" {\r\n name = "configmanagement"\r\n location = "global"\r\n provider = google\r\n fleet_default_member_config {\r\n configmanagement {\r\n config_sync {\r\n source_format = "unstructured"\r\n git {\r\n sync_repo = "https://github.com/GoogleCloudPlatform/anthos-config-management-samples"\r\n sync_branch = "main"\r\n policy_dir = "fleet-tenancy/config"\r\n secret_type = "none"\r\n }\r\n }\r\n }\r\n }\r\n}'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3eefc89155b0>)])]>

Note: Fleet defaults are only applied to new clusters created in the fleet.

This Terraform configuration enables Config Sync as a default fleet-level feature. It installs Config Sync and instructs it to fetch Kubernetes manifests from a Git repository (specifically, the “main” branch and the “fleet-tenancy/config” folder). This configuration automatically applies to all clusters subsequently created within the fleet. This approach offers a powerful way of configuring manifests across fleet clusters without the need for manual installation and configuration on individual clusters.

Now that you’ve configured Config Sync as a default fleet setting, you might want to sync specific Kubernetes resources to designated namespaces and clusters for each team. Integrating Config Sync with team scopes streamlines this process.

Setting team scope
Following this example, let’s assume you want to apply a different network policy for the backend team compared to the frontend team. Fleet team management features simplify the process of provisioning and managing infrastructure resources for individual teams, treating each team as a separate “tenant” within the fleet.

To manage separate tenancy, as shown in the above team scope diagram, first set up team scopes for the backend and frontend teams. This involves defining fleet-level namespaces and adding fleet member clusters to each team scope.

Now, let's dive into those Kubernetes manifests that Config Sync syncs into the clusters.

Applying team scope in Config Sync
Each fleet namespace in the cluster is automatically labeled with fleet.gke.io/fleet-scope: <scope name>. For example, the backend team scope contains the fleet namespaces bookstore and shoestore, both labeled with fleet.gke.io/fleet-scope: backend.

Config Sync's NamespaceSelector utilizes this label to target specific namespaces within a team scope. Here's the configuration for the backend team:

code_block: <ListValue: [StructValue([('code', 'apiVersion: configmanagement.gke.io/v1\r\nkind: NamespaceSelector\r\nmetadata:\r\n name: backend-scope\r\nspec:\r\n mode: dynamic\r\n selector:\r\n matchLabels:\r\n fleet.gke.io/fleet-scope: backend'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3eefc8915970>)])]>

Applying NetworkPolicies for the backend team
By annotating resources with configmanagement.gke.io/namespace-selector: <NamespaceSelector name>, they're automatically applied to the right namespaces. Here’s the NetworkPolicy of the backend team:

code_block: <ListValue: [StructValue([('code', 'apiVersion: networking.k8s.io/v1\r\nkind: NetworkPolicy\r\nmetadata:\r\n name: be-deny-all\r\n annotations:\r\n configmanagement.gke.io/namespace-selector: backend-scope\r\nspec:\r\n ingress:\r\n - from:\r\n - podSelector: {}\r\n podSelector:\r\n matchLabels: null'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3eefc8915160>)])]>

This NetworkPolicy is automatically provisioned in the backend team's bookstore and shoestore namespaces, adapting to fleet changes like adding or removing namespaces and member clusters.

+ + + + + + + +

+ + +

+ + + + + + + + + + + +

+ + + + + +

Extending the concept: ResourceQuotas for the frontend team
Here's how a ResourceQuota is dynamically applied to the frontend team's namespaces:

code_block: <ListValue: [StructValue([('code', 'apiVersion: configmanagement.gke.io/v1\r\nkind: NamespaceSelector\r\nmetadata:\r\n name: frontend-scope\r\nspec:\r\n mode: dynamic\r\n selector:\r\n matchLabels:\r\n fleet.gke.io/fleet-scope: frontend\r\n---\r\nkind: ResourceQuota\r\napiVersion: v1\r\nmetadata:\r\n name: fe-quota\r\n annotations:\r\n configmanagement.gke.io/namespace-selector: frontend-scope\r\nspec:\r\n hard:\r\n persistentvolumeclaims: "6"'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3eefc8915310>)])]>

Similarly, this ResourceQuota targets the frontend team's frontend-a and frontend-b namespaces, dynamically adjusting as the fleet's namespaces and member clusters evolve.

+ + + + + + + +

+ + +

+ + + + + + + + + + + +

+ + + + + +

Delegating resource management with Config Sync: Empowering the backend team
To allow the backend team to manage their own resources within their designated bookstore namespace, you can use Config Sync's RepoSync, and a slightly different NamespaceSelector.

Targeting a specific fleet namespace
To zero in on the backend team's bookstore namespace, the following NamespaceSelector targets both the team scope and the namespace name by labels:

code_block: <ListValue: [StructValue([('code', 'apiVersion: configmanagement.gke.io/v1\r\nkind: NamespaceSelector\r\nmetadata:\r\n name: backend-bookstore\r\nspec:\r\n mode: dynamic\r\n selector:\r\n matchLabels:\r\n fleet.gke.io/fleet-scope: backend\r\n kubernetes.io/metadata.name: bookstore'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3eefc8915280>)])]>

Introducing RepoSync
Another Config Sync feature is RepoSync, which lets you delegate resource management within a specific namespace. For security reasons, RepoSync has no default access; you must explicitly grant the necessary RBAC permissions to the namespace.

Leveraging the NamespaceSelector, the following RepoSync resource and its respective RoleBinding can be applied dynamically to all bookstore namespaces across the backend team's member clusters. The RepoSync points it to a repository owned by the backend team:

code_block: <ListValue: [StructValue([('code', 'kind: RepoSync\r\napiVersion: configsync.gke.io/v1beta1\r\nmetadata:\r\n name: repo-sync\r\n annotations:\r\n configmanagement.gke.io/namespace-selector: backend-bookstore\r\nspec:\r\n sourceFormat: unstructured\r\n git:\r\n repo: https://github.com/GoogleCloudPlatform/anthos-config-management-samples\r\n branch: main\r\n dir: fleet-tenancy/teams/backend/bookstore\r\n auth: none\r\n---\r\nkind: RoleBinding\r\napiVersion: rbac.authorization.k8s.io/v1\r\nmetadata:\r\n name: be-bookstore\r\n annotations:\r\n configmanagement.gke.io/namespace-selector: backend-bookstore\r\nsubjects:\r\n- kind: ServiceAccount\r\n name: ns-reconciler-bookstore\r\n namespace: config-management-system\r\nroleRef:\r\n kind: ClusterRole\r\n name: admin\r\n apiGroup: rbac.authorization.k8s.io'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3eefc8915be0>)])]>

Note: The .spec.git section would reference the backend team's repository.

The backend team’s repository contains a ConfigMap. Config Sync ensures that the ConfigMap is applied to the bookstore namespaces across all backend team’s member clusters, supporting a GitOps approach to management.

Easier cross-team resource management

Managing resources across multiple teams within a fleet of clusters can be complex. Google Cloud's fleet team management features, combined with Config Sync, provide an effective solution to streamline this process.

In this blog, we explored a scenario with frontend and backend teams, each requiring their own tenant spaces and resources (NetworkPolicies, ResourceQuotas, RepoSync). Using Config Sync in conjunction with the fleet management features, we automated the provisioning of these resources, helping to ensure a consistent and scalable setup.

Next steps

+
Learn how to use Config Sync to sync Kubernetes resources to team scopes and namespaces.
+
+
To experiment with this setup, visit the example repository. Config Sync configuration settings are located within the config_sync block of the Terraform google_gke_hub_feature resource.
+
+
For simplicity, this example uses a public Git repository. To use a private repository, create a Secret in each cluster to store authentication credentials.
+
+
To learn more about Config Sync, see Config Sync overview.
+
+
To learn more about fleets, see Fleet management overview.
+

Simplifying data modeling and schema generation in BigQuery using multi-modal LLMs

Fri, 03 May 2024 16:00:00 +0000

The intricate hierarchical data structures in data warehouses and lakes sourced from diverse origins can make data modeling a protracted and error-prone process. To quickly adapt and create data models that meet evolving business requirements without having to rework them excessively, you need data models that are flexible, modular and adaptable enough to accommodate many requirements. This requires advanced technologies, proficient personnel, and robust methodologies.

The advancements in generative AI offer numerous opportunities to address these challenges. Multimodal large language models (LLMs) can analyze examples of data in the data lake, including text descriptions, code, and even images of existing databases. By understanding this data and its relationships, LLMs can suggest or even automatically generate schema layouts, simplifying the laborious process of implementing the data model within the database, so developers can focus on higher value data management tasks.

In this blog, we walk you through how to use multimodal LLMs in BigQuery to create a database schema. To do so, we’ll take a real-world example of entity relationship (ER) diagrams and examples of data definition languages (DDLs), and create a database schema in three steps.

For this demonstration, we will use Data Beans, a fictional technology company built on BigQuery that provides a SaaS platform to coffee sellers. Data Beans leverages BigQuery’s integration with Vertex AI to access Google AI models like Gemini Vision Pro 1.0 to analyze unstructured data and integrate it with structured data, while using BigQuery to help with data modeling and generating insights.

STEP1 : Create an entity relationship diagram

The first step is to create an ER diagram using your favorite modeling tool, or to take a screenshot of an existing ER diagram. The ER diagram can contain primary key and foreign key relationships, and will then be used as an input to the Gemini Vision Pro 1.0 model to create relevant BigQuery DDLs.

+ + + + + + + +

+ + +

+ + + + + + + + + + + +

+ + + + + +

STEP2 : Create a prompt with the ER image as input

Next, to create the DDL statements in BigQuery, write a prompt to take an ER image as an input. The prompt should include detailed and relevant rules that the Gemini model should follow. In addition, make sure the prompt captures learnings from the previous iterations — in other words, be sure to update your prompt as you experiment and iterate it. These can be provided as examples to the model, for example a valid schema description for BigQuery. Providing a working example for the model to follow will help the model create a data definition DDL that follows your desired rules.

code_block: <ListValue: [StructValue([('code', '## Prompt to guide the model\r\nllm_erd_prompt=f"""Use BigQuery SQL commands to create the following:\r\n- Create a new BigQuery schema named "{dataset_id}".\r\n- Use only BigQuery data types. Double and triple check this since it causes a lot of errors.\r\n- Create the BigQuery DDLs for the attached ERD.\r\n- Create primary keys for each table using the ALTER command. Use the "NOT ENFORCED" keyword.\r\n- Create foreign keys for each table using the ALTER command. Use the "NOT ENFORCED" keyword.\r\n- For each field add an OPTIONS for the description.\r\n- Cluster the table by the primary key.\r\n- For columns that can be null do not add "NULL" to the created SQL statement. BigQuery leaves this blank.\r\n- All ALTER TABLE statements should be at the bottom of the generated script.\r\n- The ALTER TABLE statements should be ordered by the primary key statements and then the foreign key statements. Order matters!\r\n- Double check your work especially that you used ONLY BigQuery data types.\r\n\r\n\r\nPrevious Errors that have been generated by this script. Be sure to check your work to avoid encountering these.\r\n- Query error: Type not found: FLOAT at [6:12]\r\n- Query error: Table test.company does not have Primary Key constraints at [25:1]\r\n\r\n\r\n## Example for model to influence from\r\nExample:\r\nCREATE TABLE IF NOT EXISTS `{project_id}.{dataset_id}.customer`\r\n(\r\n customer_id INTEGER NOT NULL OPTIONS(description="Primary key. Customer table."),\r\n country_id INTEGER NOT NULL OPTIONS(description="Foreign key: Country table."),\r\n customer_llm_summary STRING NOT NULL OPTIONS(description="LLM generated summary of customer data."),\r\n customer_lifetime_value STRING NOT NULL OPTIONS(description="Total sales for this customer."),\r\n customer_cluster_id FLOAT NOT NULL OPTIONS(description="Clustering algorithm id."),\r\n customer_review_llm_summary STRING OPTIONS(description="LLM summary are all of the customer reviews."),\r\n customer_survey_llm_summary STRING OPTIONS(description="LLM summary are all of the customer surveys.")\r\n)\r\nCLUSTER BY customer_id;\r\n\r\n\r\nCREATE TABLE IF NOT EXISTS `{project_id}.{dataset_id}.country`\r\n(\r\ncountry_id INTEGER NOT NULL OPTIONS(description="Primary key. Country table."),\r\ncountry_name STRING NOT NULL OPTIONS(description="The name of the country.")\r\n)\r\nCLUSTER BY country_id;\r\n\r\n\r\n\r\n\r\nALTER TABLE `{project_id}.{dataset_id}.customer` ADD PRIMARY KEY (customer_id) NOT ENFORCED;\r\nALTER TABLE `{project_id}.{dataset_id}.country` ADD PRIMARY KEY (country_id) NOT ENFORCED;\r\n\r\n\r\nALTER TABLE `{project_id}.{dataset_id}.customer` ADD FOREIGN KEY (country_id) REFERENCES `{project_id}.{dataset_id}.country`(country_id) NOT ENFORCED;\r\n"""'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3eefc9189100>)])]>

Now you have an image of an ER diagram to present to your LLM.

STEP 3: Call the Gemini Pro 1.0 Vision model

After creating a prompt in Step 2, you are now ready to call the Gemini Pro 1.0 Vision model to generate the output by using the image of your ER diagram as an input (left side of Figure 1). You can do this in a number of ways — either directly from Colab notebooks using Python, or through BigQuery ML, leveraging its integration with Vertex AI:

code_block: <ListValue: [StructValue([('code', 'imageBase64 = convert_png_to_base64(menu_erd_filename)\r\n\r\n\r\nllm_response = GeminiProVisionLLM(llm_erd_prompt, imageBase64, temperature=.2, topP=1, topK=32)'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3eefc9189190>)])]>

Conclusions and resources

In this demonstration, we saw how the multimodal Gemini model can streamline the creation of data and schemas. And while manually writing prompts is fine, it can be a daunting task when you need to do it at enterprise scale to create thousands of assets such as DDLs. Leveraging the above process, you can parameterize and automate prompt generation, dramatically speeding up the workflow and providing consistency across thousands of generated artifacts. You can find the complete Colab Enterprise notebook source code here.

BigQuery ML includes many new features to let you use Gemini Pro capabilities; for more, please see the documentation. Then, check out this tutorial to learn how to apply Google's models to your data, deploy models, and operationalize ML workflows — all without ever moving your data from BigQuery. Finally, for a behind-the-scenes look on how we made this demo, watch this video on how to build an end-to-end data analytics and AI application using advanced models like Gemini directly from BigQuery.

^{Googlers Luis Velasco, Navjot Singh, Skander Larbi and Manoj Gunti contributed to this blog post. Many Googlers contributed to make these features a reality}

Introducing Dataflux Dataset for Cloud Storage to accelerate PyTorch AI training

Thu, 02 May 2024 16:00:00 +0000

Introduction

Machine learning (ML) models thrive on massive datasets, and fast data loading is key for cost-effective ML training. We recently launched a PyTorch Dataset abstraction, the Dataflux Dataset, for accelerating data loading from Google’s Cloud Storage. Dataflux provides up to 3.5x faster training times compared to fsspec, with small files.

Today’s launch builds upon Google’s commitment to open standards that spans over two decades of OSS contributions like TensorFlow, JAX, TFX, MLIR, KubeFlow, and Kubernetes, as well as sponsorship for critical OSS data science initiatives like Project Jupyter and NumFOCUS.

We also validated the Dataflux Dataset on Deep Learning IO (DLIO) benchmarks and realized similar performance gains, even with larger files. Due to this broad performance boost, we recommend using Dataflux Dataset over other libraries or direct Cloud Storage API calls for training workflows.

Key Dataflux Dataset features include:

+
Direct Cloud Storage integration: Eliminate the need to download data locally first.
+
+
Performance optimization: Achieve up to 3.5x faster training times, especially with small files.
+
+
PyTorch Dataset primitive: Work seamlessly with familiar PyTorch concepts.
+
+
Checkpointing support: Save and load model checkpoints directly to/from Cloud Storage.
+

Using Dataflux Datasets

+
Prerequisites: Python 3.8+
+
+
Installation: $ pip install gcs-torch-dataflux
+
+
Authentication: Use Google Cloud application-default authentication
+

Example: Loading images for training

There are only a few changes needed to enable the Dataflux Dataset. If you’re using PyTorch and have data in Cloud Storage, you most likely have written your own Dataset implementation. The below snippet shows how easy it is to create a Dataflux Dataset. For further details, checkout our GitHub page.

code_block: <ListValue: [StructValue([('code', 'import numpy\r\nimport io\r\nfrom PIL import Image\r\nfrom dataflux_pytorch import dataflux_mapstyle_dataset\r\n\r\ndef transform(img_in_bytes): \r\n return numpy.asarray(\r\nImage.open(io.BytesIO(img_in_bytes)))\r\n\r\ndataset = dataflux_mapstyle_dataset.DatafluxMapStyleDataset(\r\n project_name=PROJECT_NAME,\r\n bucket_name=BUCKET_NAME,\r\n config=dataflux_mapstyle_dataset.Config(prefix=PREFIX),\r\n data_format_fn=transform,\r\n)\r\n\r\n# Use "dataset" as usual in your ML-Training loop in combination with PyTorch DataLoader.'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3eefc84b2f70>)])]>

Under the hood

To achieve such significant performance gains for Dataflux, we addressed the data-loading performance bottlenecks in ML training workflows. In a training run, data is loaded in batches from storage, and after some processing, is sent from CPU to GPU for ML-Training computations. If reading and constructing a batch takes longer than GPU computation, then the GPU is effectively stalled and underutilized, leading to longer training times.

When data is in a cloud-based object storage system (like Google’s Cloud Storage), it takes longer to fetch the data than from a local disk, especially if the data is in small objects. This is due to time-to-first-byte latency. Once an object is ‘opened’ though, the cloud storage platform provides high throughput. In Dataflux, we employ a Cloud Storage feature called Compose Objects that can dynamically combine many smaller objects into a larger object. Then, instead of fetching (say) 1024 small objects (batch size), we only fetch 30 larger objects and download those to memory. The larger objects are then decomposed back to their individual smaller objects and served back as the dataset-samples. Any temporary composed objects created in the process are also cleaned up.

Another optimization that Dataflux Datasets employs is high-throughput parallel-listing, speeding up the initial metadata needed for the dataset. Dataflux uses a sophisticated algorithm called work-stealing to significantly speed up listings; with it, even the first AI training run, or “epoch,” is faster compared to Dataflux Datasets without parallel-listing, even on datasets that have tens of millions of objects.

Together, fast-listing and dynamic-composition help ensure that ML-training with Dataflux leads to minimal GPU stalls, leading to greatly reduced training time and increased accelerator utilization.

Fast-listing and dynamic-composition are part of the Dataflux Client Libraries and available on GitHub. Dataflux Dataset uses these client libraries under the hood.

Dataflux is available now

Give the Dataflux Dataset for PyTorch (or the Dataflux Python client library if writing your own ML training dataset code) a try and let us know how it boosts your workflows!

You can learn more about this and our other storage AI related capabilities from our Google Cloud Next ‘24 recorded session “How to define a storage infrastructure for AI and analytical workloads”

+ + + +

+ + + + + + + + + + +

+ + + +

Private networking patterns to Vertex AI workloads

Thu, 02 May 2024 16:00:00 +0000

As enterprise strategic use cases for AI adoption increases, secure and reliable connectivity is more crucial than ever. Get ready to explore several private connections options for your Vertex AI workloads! In this blog, we'll dive into the existing options and reveal the services to get you connected on your AI journey.

Connectivity matrix

Vertex AI is a suite of products which provide different AI workloads that provide varying functionalities. The default method to access Vertex AI API’s are public which is the case with Google APIs in general. Depending on your architecture you may have the requirement to access your API’s privately, because of security and enterprise governance, which means traffic does not travel over the internet to the public address of the API. In these cases there are several options which we will explore later in the blog but it will vary depending on the Vertex AI product you are connecting to. The image below shows the connectivity matrix for accessing Vertex AI from on-premises and multicloud.

+ + + + + + + +

+ + +

+ + + + + + + + + + + +

+ + + + + +

Options

As you can see from the previous matrix image there are several methods in addition to the public internet. These include:

+
Private Service Connect (PSC) for Google APIs - Provides private access to Google APIs over hybrid networking or within a Virtual Private Cloud (VPC) using a customer-specified IP address(s) and DNS endpoint name that can be leveraged for one or more use cases.
+
+
Private Google Access - Provides private access to Google APIs over hybrid networking or within a Virtual Private Cloud (VPC) using a Google defined subnet.
+
+
Private Service Access (PSA) - Google and other providers, collectively known as "service providers," can offer services hosted within a Google-managed VPC network. PSA enables you to define IP addresses for the managed services in addition to establishing VPC peering to access the internal IP addresses of these Google and third-party services over hybrid networking or within the VPC.
+
+
Private Service Connect endpoint - Enables consumers to securely access managed services hosted by Google or other providers from within their own Virtual Private Cloud (VPC) network or via hybrid networking, eliminating the need to define the producer's VPC network. Communication with managed services is established through PSC endpoints or backends defined by the consumer's IP space, facilitating multi-tenancy to producer services across VPCs.
+

Example

The following diagram shows a Vector Search architecture in which the Vector Search API is enabled and managed in a service project (appropriately named "serviceproject") as part of a Shared VPC deployment. The Vector Search Compute Engine resources are deployed as a Google-managed Infrastructure-as-a-Service (IaaS) in the service producer's VPC network.

Private Service Connect endpoints are deployed in the consumer's VPC network (serviceproject) for index query, in addition to Private Service Connect endpoints for Google APIs for private index creation deployed in the host project, the VPC where the cloud router resides. Both index creation and index query are accessible privately through hybrid networking or within the VPC.

If the organization requires public access to index query, you can leverage the same producer service as a Private Service Connect Network Endpoint Groups (PSC NEG) backend to a External Load Balancer, thereby enabling public access to the endpoint while also providing WAF and DDoS capabilities when associated with Cloud Armor.

+ + + + + + + +

+ + +

+ + + + + + + + + + + +

+ + + + + +

Get hands-on and learn more

This topic is hot right now, and there are many approaches you can use. There are a few resources available that you can use to get some hands-on experience. Please check out the following tutorials.

+
Tutorial - Use Private Service Connect to access a Vector Search index from on-premises.
+
+
Tutorial - Use Private Service Connect to access Generative AI on Vertex AI from on-premises.
+
+
Tutorial - Use Private Service Connect to access Vertex AI online predictions from on-premises
+
+
Tutorial - Use Private Service Connect to access Vertex AI batch predictions from on-premises
+

To learn out more or share a thought find me on Linkedin.

+ + + + + + + +

RAG in production faster with Ray, LangChain and HuggingFace

Thu, 02 May 2024 16:00:00 +0000

We’re excited to announce the release of a quickstart solution and reference architecture for retrieval augmented generation (RAG) applications, designed to accelerate your journey to production. In this post, you’ll learn how to quickly deploy a complete RAG application on Google Kubernetes Engine (GKE), and Cloud SQL for PostgreSQL and pgvector, using Ray, LangChain, and Hugging Face.

What is RAG?

RAG can improve the outputs of foundation modes, such as large language models (LLMs), for a specific application. Rather than relying purely on knowledge developed during training, AI apps equipped for RAG can retrieve the information most relevant to a user’s prompt from an external knowledge base, then add that information to the prompt before sending it to the generative model. The knowledge base can come in various forms, such as a vector database, traditional search index, or relational database — and by accessing it, customer service chabots can look up help center articles, digital shopping assistants can tap into product catalogs and customer reviews, and AI-powered travel agents can deliver up-to-date flight and hotel information.

+ + + + + + + +

+ + +

+ + + + + + + + + + + +

+ + + + + +

LLMs rely on their training data, which can quickly fall out of date and may not include data relevant to the application’s domain. Re-training or fine-tuning an LLM to provide fresh, domain-specific data can be an expensive and complex process. RAG not only gives the LLM access to such data without training or-fine tuning. but can also guide an LLM toward factual responses, thereby reducing hallucinations and enabling applications to provide human-verifiable source material.

For more background on how RAG works, see our blog on context-aware code generation.

AI Infrastructure for RAG

Prior to the rise of Generative AI, a typical application architecture might involve a database, a set of microservices, and a frontend. Even the most basic RAG applications introduce new requirements for serving LLMs, processing, and retrieving unstructured data. To meet these requirements, customers need infrastructure that is optimized specifically for AI workloads.

Many customers choose to access AI infrastructure like TPUs and GPUs via a fully managed platform, such as Vertex AI. Others, however, prefer to manage their own infrastructure on top of GKE while leveraging open-source frameworks and open models. This blog post is for the latter group.

Building an AI platform from scratch involves a number of key decisions, such as which frameworks to use for model serving, which machine shapes to use for inference, how to protect sensitive data, how to meet cost and performance requirements, and how to scale as traffic grows. Each decision involves many tradeoffs against a vast and fast-changing landscape of generative AI tools.

This is why we have developed a quickstart solution and reference architecture for RAG applications built on top of GKE, Cloud SQL, and open-source frameworks Ray, LangChain and Hugging Face. Our solution is designed to help you get started quickly and accelerate your journey to production with RAG best practices built-in from the start.

Benefits of RAG on GKE and Cloud SQL

GKE and Cloud SQL accelerate your journey to production in a variety of ways:

+
Load Data Fast - Use Ray Data to seamlessly access data in parallel from your Ray cluster via GKE’s GCSFuse driver. Efficiently load your embeddings into Cloud SQL for PostgreSQL and pgvector to perform low latency vector search at scale.
+
+
Fast deploy - Quickly deploy Ray, JupyterHub, and Hugging Face Text Generation Inference (TGI) to your GKE cluster
+
+
Security made simple - Get move-in ready Kubernetes security with GKE. Filter out sensitive or toxic content using Sensitive Data Protection (SDP). Leverage Google-standard authentication with Identity-Aware Proxy so users can seamlessly connect to your LLM frontend and Jupyter notebooks.
+
+
Cost efficiency & reduced management overhead - GKE reduces cluster maintenance and makes it easy to take advantage of cost-saving measures like spot nodes via YAML configuration.
+
+
Scalability - GKE automatically provisions nodes as traffic grows, eliminating the need for manual configuration to scale up.
+

Deploying RAG on GKE and Cloud SQL

Our end-to-end RAG application and reference architecture provide the following:

+
Google Cloud project - configures your project with the needed prerequisites to run the RAG application, including a GKE Cluster and Cloud SQL for PostgreSQL and pgvector instance
+
+
AI frameworks - deploys Ray, JupyterHub, and Hugging Face TGI to GKE
+
+
RAG Embedding Pipeline - generates embeddings and populates the Cloud SQL for PostgreSQL and pgvector instance
+
+
Example RAG Chatbot Application - deploys a web-based RAG chatbot to GKE
+

+ + + + + + + +

+ + +

+ + + + + + + + + + + +

+ + + + + +

The example chatbot application provides a web interface where users can interact with an open source LLM. It leverages data loaded by the RAG data pipeline into Cloud SQL for PostgreSQL with pgvector, providing more comprehensive and informative responses to user queries.

Our end-to-end RAG solution serves as a starting point for further development, demonstrating the potential of this technology for a wide range of applications. By combining the power of RAG with the scalability and flexibility of GKE and Cloud SQL as well as security features of Google Cloud, developers can build powerful and versatile applications that can handle complex tasks and provide valuable insights.

We plan to evolve this solution over time, including the ability to add custom data sets, replace models, and update the dataset and vector database with new documents.

For more information, please check our README and github instructions, and reference RAG architecture. You can also view our Google Cloud Next 2024 session discussing RAG.

Sullivan County debuts generative AI chatbot, Saige, to answer constituent FAQs

Thu, 02 May 2024 16:00:00 +0000

Sullivan County, New York, is home to the Catskills Mountains. It’s a great place to live, play, and raise a family, and we have a robust group of visitors and tourists. We are also a fairly small county. Being on the cutting edge of technology isn’t typically something people associate with local governments, and I’m proud of Sullivan County for consistently innovating.

Traditionally, in offices like ours, our teams are routinely answering rote questions about operating hours or document filing processes. These calls take up hours of their day they’d like to spend helping solve more complex problems. In 2023, we introduced a virtual agent powered by Dialogflow that has helped us quickly provide our constituents with answers to simple questions. We embedded the agent into our chatbot on the county website to support those seeking information about our County Clerk’s and Treasurer’s offices.

With the chatbot’s support, we’ve seen a 62% reduction in call volume. This has allowed our teams to focus on resolving more complex issues. The success of that rollout attracted the attention of the other offices in the county, and we saw a chance to expand our chatbot’s capabilities.

Simplifying generative AI with Vertex AI to overhaul a chatbot in 10 weeks

Although more than 40 offices in Sullivan County were excited to implement a chatbot, we were initially concerned about the time and effort it would take to tackle a project of that scale. Some leaders assumed that many chatbots would require manually combing through information on office websites and building chatbot workflows for each of them. Then, I was introduced to Vertex AI Agent Builder. Vertex AI Agent Builder makes it easy, even for our nontechnical teams, to train machine learning models that grow and change over time. Since it can “learn,” we can create chatbots that scrape information from the respective office sites and dynamically determine the best answer to a question on any given day. We don’t have to design any of those flows manually.

My team worked with Google Cloud Premier Partner Quantiphi to implement a generative AI-powered version of our virtual agent (who we call Saige), and their enthusiasm about the project was palpable. The project was completed and connected every department in the county in just ten weeks, thanks to the support from Quantiphi.

Improving interactions with constituents as Saige learns over time

I grew up tinkering with computers in the 1980s, and I’ve been told my entire life that computers will make our lives easier, but training people on new technologies is often cumbersome and complicated. I believe that if a task using technology is more difficult or time-consuming than doing it manually, no one will adopt that new tool. With Vertex AI, I’m seeing technology make good on its promise to improve lives.

We’ve already seen Saige grow and evolve to make our jobs easier. As it gathered information from various department sites, it helped us identify sources that needed to be updated. If it responds to a question incorrectly, I can click “This was not helpful,” and ask the question again. Saige is able to search for and find new information immediately. This is especially valuable because information changes; holidays change operating hours or a file is updated, and our small teams don’t have time to manually update every chatbot workflow.

As Saige’s knowledge repository grows, our offices continue to get more and more time back to manage more complex issues that require the face-to-face, human touch. Our offices get more efficient, people get answers faster, and every interaction we have with a constituent helps inform the next.

Setting new goals for an AI-powered Sullivan County

Hearing about new experiences from our teams and the people of Sullivan County is one thing, and it’s another to be able to truly measure impact. We’ve implemented a Looker dashboard that is helping us track exactly how much impact our chat features have on our offices. I can easily view total user sessions, success rates of chat interactions, and peak hours, so we can refine our support to best meet our community’s needs.

I can also see what questions people are asking, which helps me to understand what subjects are trending. Gathering experiential data alongside quantifiable data helps us offer better services and information about popular topics, in more prominent places, on our websites.

When I think of the future of Sullivan County’s online resources, I think about what I would want as a constituent coming to our sites for information. As our chat function continues to improve over time, I’d love to offer additional services directly through chat, such as payments or form submissions. All of these tools can come together to further augment our amazing staff and provide Sullivan County constituents with the best possible service.

Interested in seeing how the state of New York is transforming ? Learn more about how gen AI is transforming public sector services in New York.

AI can be the catalyst to reignite your digital transformation

Thu, 02 May 2024 13:00:00 +0000

Be honest with me: Is your “digital transformation” stuck? Did you start in earnest a few years back, and now you’re sitting on half-finished projects, uneven outcomes, and a distracted staff? It happens. Maintaining momentum for multi-year efforts isn’t easy. Especially efforts that are increasingly broad and complex! We need a catalyst to focus our efforts, motivate our teams, and simplify our work. Early signals tell us that generative AI is that missing catalyst, and Google Cloud is a unique partner for your journey.

How you got stuck

Digital transformation means many things to many people. Is it about becoming more efficient? Upgrading tools? Delivering new digital products to customers? Adopting a data-driven strategy? Changing internal culture? All of those things? We don’t always see a unifying purpose to these efforts that’s capable of rallying an organization. This lack of focus often results in a dizzying array of disparate projects.

Some of these projects are focused on new customer experiences. You might have tried launching new mobile or web experiences with solid, but not spectacular, results. There are always a handful of backend projects initiated to adopt public cloud, establish a real-time data infrastructure, set up developer-friendly application platforms, and upgrade security services. This inevitably sparks modernization programs to make data more accessible, apps more scalable, and infrastructure more automated. Smart companies complement these technology efforts with promises to invest in a company culture that embraces modern thinking and elite capabilities.

None of those are bad things! But sometimes they come with unintended consequences:

+
More complex infrastructure that straddles public cloud and private cloud
+
+
Legacy systems straining under load and change rates that they weren’t designed for
+
+
Regularly changing measures of success, from innovation to cost savings to optimization to efficiencies
+
+
Team demotivation, as this growing bag of projects seems increasingly disconnected from measurable outcomes
+

There’s a better way. In our experience, a corporate investment in generative AI brings focus, meaning, and acceleration to a host of important IT efforts.

Why generative AI catalyzes your team

A business strategy with generative AI at the center benefits customers and employees. Why? For customers, it helps focus your attention on delivering more personalized and engaging experiences. There are at least 101 examples of that. For staff, it puts a spotlight on how everyone can use smarter tools to design, deliver, and operate products — whether those are data reports, software applications, or infrastructure platforms. Everyone gets to join in!

And that’s the thing. It’s not just about generative AI; it’s about what it takes to be good at generative AI. Everyone needs to come together to fully commit to excellence in five supporting areas (that were usually left half-finished during a classic digital transformation):

+
Automate your infrastructure. Now’s the time to establish a full range of automation for provisioning, upgrading, and deleting all of the machinery that supports your (AI-hosting) infrastructure.
+
+
Upgrade your data platform. Your AI models won’t be any good without good data. Timely, accurate data is critical, and that means investing in flexible data pipelines, scalable databases, and a data warehouse that’s ready for AI.
+
+
Improve your developer experience. To build with AI, your developers need the tools, frameworks, and platform services that help them iterate quickly. It’s also time to finish those cultural upgrades that unleash your teams.
+
+
Modernize your security practices. Embracing generative AI requires a whole set of data, application, and infrastructure security considerations. You won’t deploy it if you don’t trust it. It’s key to make the necessary upgrades to your security posture.
+
+
Finish your cloud migration. It’s going to be hard to maximize the value of generative AI outside of the public cloud. Places like Google Cloud are purpose-built to support the access to innovation, elasticity, and scale that are so important right now.
+

What you need to succeed

Looking to avoid some of the challenges of past transformations? There’s more than one way to proceed with your generative AI strategy, but at Google Cloud, we see three crucial building blocks for your success.

You need proximity. Generative AI models and apps require proximity to dependent data. From AlloyDB to BigQuery, Google Cloud’s data services give you the speed, scale, and price performance to keep your AI-based systems grounded by your unique information. And especially now, you need proximity to expertise for your journey. This is a period of excitement and change, so you want Google’s world-class team partnering with you to help you architect, deliver, and optimize your AI-based solutions.

You need an integrated AI platform. This isn’t the time for building out complex, brittle, do-it-yourself AI platforms. Too much is evolving too quickly. Buy innovation and flexibility, not complexity. Our unique AI hypercomputer, Vertex AI platform for MLOps, and Gemini for Google Cloud offer best-in-class vertical and horizontal integrations that help you build, run, and optimize better than anywhere else.

Finally, you need cross-organization productivity assistance. AI is not just about different output; it’s about a different way of working. Gemini for Google Workspace helps everyone be more creative and productive. Gemini Code Assist gives software developers powerful tools for understanding and writing quality software. Gemini Cloud Assist will bring game-changing AI assistance to teams that need to troubleshoot and optimize their cloud systems.

Ready to get unstuck? Register for our Building Apps in an AI Era webinar to learn more about how Google Cloud can help you innovate faster, deliver unparalleled customer experiences, and secure a lasting competitive advantage.

Your modernization journey starts with the endpoint. A Forrester Consulting study shows why.

Thu, 02 May 2024 09:00:00 +0000

In today’s digital age, endpoints are a business requirement for collaborating with coworkers, engaging with customers, and building great products. However, with a rise in cyber attacks, increased scrutiny on cost, and pressure to innovate, IT departments require a new kind of endpoint that improves user experience, simplifies management, increases security and reduces costs—which are some of the key traits of what we refer to as the modern endpoint.

We commissioned Forrester Consulting to survey 652 IT professionals to explore the meaning of a modern endpoint to IT departments, which Forrester further defines in the study as “a mix of multiple next-generation capabilities that center around artificial intelligence (AI), web-based applications, the cloud, and the integration of data.” ¹

The study found that IT leaders are “prioritizing initiatives that lead to a modern endpoint.” In particular, the study found that “IT leaders are prioritizing AI, web-based applications, and endpoint management in the cloud because these initiatives are core to a modern endpoint and will allow businesses to evolve with their employees and customers needs.”¹

Specific to AI, the study found that IT respondents' “number-one priority over the next 12 months is to enable end users to take advantage of AI on the endpoint.”¹ The study talks specifically to productivity gains for IT staff, who can use AI to “automate repetitive tasks, analyze data from endpoint devices and plan maintenance, or analyze user behavior to create more personalized computing experiences.” ¹

Businesses starting their journey to a true modern endpoint should consider adopting ChromeOS. In this blog post, we’ll visit three of the four key findings from the Forrester Consulting study “Delivering the Next-Generation endpoint,” and show how ChromeOS can help businesses realize their modern endpoint needs.

1. Endpoint security, management, and deployment are the top barriers to a modern endpoint.

In the study, Forrester consulting found that IT respondents indicated that they spent nearly half of their working hours on endpoint security (19%), management (15%), and deployment (14%).¹

ChromeOS can help IT departments reduce time securing, managing, and deploying devices. With regards to security, ChromeOS is built to be secure at every layer: at the lowest level, every ChromeOS device relies on Verified Boot which checks for tampering; second, every app and tab on ChromeOS is sandboxed, meaning each app has a clear perimeter in which it can operate; and third, by making the web the core application platform, apps are simply secure by design, with much more control over how they interact with powerful device features. This means that ransomware simply can't run on ChromeOS devices.*

The result? There has never been a successful ransomware or virus attack reported on ChromeOS devices—ever.*

ChromeOS can be centrally managed alongside your other devices via the web-based Google Admin console. From there, IT administrators gain a comprehensive view, allowing them to monitor and manage devices, track application deployments and versions, and control device policies and settings at scale, with changes applied across the fleet in seconds. Additionally, admins can revoke user access and securely wipe data from devices when necessary.

With zero-touch enrollment, it’s possible to deploy a fleet of ChromeOS devices without IT interaction. ChromeOS devices can be shipped directly to your end users, who can then get started in minutes—with policies, settings and apps all instantly applied on first boot as the user connects to the internet and signs in.

2. Web applications are the key to a modern endpoint.

While AI was identified as IT’s top priority, the study identified the second-highest priority over the next 12 months as the need to embrace more web-based applications, with 81% of respondents saying that “adopting web-based applications is part of their organization’s digital transformation goals.”¹ Forrester Consulting also found that 90% of surveyed IT respondents believe the future of end user computing is web-based and 78% of respondents indicated that companies that don’t embrace the web will be left behind.¹

The study found that IT departments perceive web-based applications not only as collaboration drivers, but as beneficial for IT departments as well. Why? Quoted from the study, “When the majority of applications that employees use are web-based, it makes all other elements of endpoint management easier to achieve, from simplifying management and improving security to unlocking the power of AI.”¹ In addition to the security benefits mentioned previously, the web helps businesses streamline ecosystem support, access, and deployment.

With progressive web apps, key software partnerships featured in Chrome Enterprise Recommended, simple application management, and more, ChromeOS can be the catalyst for businesses to realize the benefits of the web.

3. Achieving a modern endpoint benefits the business.

Finally, let's talk about cost. Forrester Consulting found that moving towards a modern endpoint would reduce costs to the IT department by roughly 19%, with 57% believing it will help reduce costs for the IT department.¹ Modern endpoints help IT departments save beyond the endpoint—including infrastructure costs, security software, support costs, and more.

ChromeOS devices can help streamline IT, slashing overall costs beyond the hardware, saving businesses $463 per device on average according to IDC.² These savings carry over even when deploying Chromebook Plus, a category of high performance laptops with powerful AI capabilities, and greater hardware specs at a greater value. Chromebook Plus comes with 10 years of automatic updates, staying secure and usable for even longer.

Embracing a modern endpoint strategy with ChromeOS can solve the most pressing IT challenges faced today. Check out the Forrester Consulting Study to learn more about the modern endpoint, or contact one of our team members to learn more about ChromeOS.

¹ ^{Forrester Consulting 2024 Modern Endpoint Research, sponsored by Google, April 2024}

² ^{IDC Business Value Paper, sponsored by Google, The Business Value of ChromeOS, doc #49920522, March 2024}

^{*As of May 2024 there has been no evidence of any documented, successful virus attack or ransomware attack on ChromeOS. Data based on ChromeOS monitoring of various national and internal databases.}

Enhancing iEEG seizure identification and similarity search with Google Cloud

Wed, 01 May 2024 16:00:00 +0000

Globally, epilepsy affects approximately 50 million people. Located in Mountain View, CA, NeuroPace, Inc.¹ is committed to transforming the lives of those living with epilepsy by reducing or eliminating their seizures. The company's RNS® System,² a responsive neurostimulation device, monitors brain activity to detect seizure precursors and delivers targeted electrical stimulation to prevent seizures. This device also captures iEEG (intracranial electroencephalogram) data, with over 15 million recordings from over 5,000 patients collected to date, making it the largest collection of ambulatory iEEG records available.

NeuroPace's AI team has developed electrographic seizure classifier models using clinical trial data from the RNS System and has fine-tuned these models through transfer learning to identify seizure onset times. Previously, machine learning (“ML”) training was constrained by the limited number of Graphical Processing Units (GPUs) available in on-premises virtual machines (VMs), slowing down the optimization of models and training processes. NeuroPace tackled this challenge by scaling ML workloads with Google Cloud, moving away from on-premises VMs and utilizing Vertex AI for more efficient training and hyperparameter tuning.

Leveraging Google Cloud AI infrastructure

Google Cloud's technologies have significantly improved and accelerated NeuroPace’s ML training capabilities. Searching through more than a million iEEG records to identify similar ones, a task that previously took minutes to hours, can now be completed in milliseconds using Google's AlloyDB AI, part of the AlloyDB for PostgreSQL database. Further, the integration of Vertex AI, GPUs, Compute Engine, and Google Cloud Storage has revolutionized NeuroPace’s ML training processes, enhancing scalability, automation, and efficiency.

Vertex AI, Google Cloud’s AI development platform, supports the entire ML workflow, including data engineering, model training, deployment, and monitoring. This integration enables NeuroPace's AI team to use various GPUs for model training, with L4 GPUs offering better price-performance compared to on-premises resources. With it, they’ve developed a cloud-native ML training system, achieving desired scalability and efficiency through Vertex AI and GPUs.

Patient similarity search with AlloyDB AI

Identifying similar electrophysiological features across epilepsy patients has the potential to aid in discovering effective treatment options. NeuroPace has conducted research studies to identify similar iEEG patterns within a large dataset of over 1 million time-series iEEG records, utilizing the built-in vector search capabilities in AlloyDB AI. By employing IVFFlat and HNSW indexing methods, searches for similar iEEG records in this dataset can now be executed in approximately ten milliseconds.

AlloyDB AI enables storing data embeddings in vector form directly in the database, facilitating easier and faster similarity searches compared to standard PostgreSQL. This eliminates the need for elaborate external processing pipelines.

The next-generation disease management system

Data captured from the NeuroPace RNS System may provide insights into seizure trends and triggers leading to optimizing and personalizing epilepsy treatment.³ The research efforts to integrate Google Cloud's infrastructure with NeuroPace's RNS System data are directed towards creating a sophisticated disease management system for epilepsy, emphasizing tailored treatments and enhanced patient well-being.

^{1. https://www.neuropace.com/}
^{2. Rx Only. The RNS® System is an adjunctive therapy for adults with refractory, partial onset seizures with no more than 2 epileptogenic foci. See important safety information at http://www.neuropace.com/safety/.}
^{3. The RNS System does not currently incorporate functionality that is based upon or utilizes AI features.}

Managing Cloud Storage soft delete at scale

Wed, 01 May 2024 16:00:00 +0000

Have you ever accidentally deleted your data? Unfortunately, many of us have, which is why most operating systems on personal computers have a recycle bin / trash can where you can go to get your files back. On the enterprise side, these accidental deletions can be at a much larger scale – sometimes involving millions or even billions of objects. There is also the prospect of someone gaining unauthorized access to your data and either performing a ransomware attack to try to hold your data hostage or simply deleting it!

We recently launched soft delete for Cloud Storage, an important new data protection feature compatible with all existing Cloud Storage features and workloads. It offers improved protection against accidental and malicious data deletion by providing you with a way to retain and restore recently deleted data at enterprise scale. With soft delete in place, you may also find that your organization can move more quickly when “pruning” old data, knowing that soft delete provides an undo mechanism in case of any mistakes.

In this blog, we provide you with the tools and insights you need to optimize your soft delete settings, even at scale, so that you use soft delete to protect your data based on its business criticality.

How does soft delete work and how is it billed?

When soft delete is enabled, deleted objects are retained in a hidden soft-deleted state for the soft delete retention duration set on that bucket, instead of being permanently deleted. If you need any of the soft-deleted objects back, simply run a restore and they are copied back to live state.

We introduced soft delete with a seven-day retention duration enabled on all existing buckets and as the default for newly created buckets. Soft delete is on by default because accidental deletion events are unfortunately all too common and much of the data stored in Cloud Storage is business-critical in nature. In addition to the seven-day default, you can select any number of days between 7 and 90, or you can disable the feature entirely.

Soft delete usage is billed based on the storage class of the recently deleted objects. In many cases, this only increases bills by a few percentage points, which hopefully represents a good value for the amount of protection that soft delete provides. However, enabling soft delete on buckets that contain a large amount of short-lived (frequently deleted) data can result in large billing increases, since an object deleted after an hour would be billed for the one hour the object was live, plus seven days of soft delete usage.

How valuable is your data?

In order to get to a state where soft delete protects you from data deletion risks that have the lowest economical impact, we recommend that you ask yourself the following three questions:

+
How important is my organization’s data? Are we storing temporary objects or media transcodes that could be easily regenerated if they were lost? Soft delete protection is unlikely to be worth it in these cases. Or are we storing data that would put my business and/or customer relationships at risk if it were lost? Soft delete could provide a vital level of protection here.
+
+
What level of data protection do we already have? If Cloud Storage has the only copy of your business-critical data, then soft delete protection would be much more important than if you were storing long-term backups of all your data in another Google Cloud region, on-prem, or with another cloud provider.
+
+
How much data protection can we afford? Soft delete can be much less expensive than performing traditional enterprise backups, but can still have a significant impact on billing, depending mostly on your deletion rates. We recommend considering the cost of soft delete relative to your overall Google Cloud bill rather than only storage because it is protecting your business data relied on by your overall workloads. You may find that leaving soft delete enabled on all your buckets only has a single digit percentage impact on your cloud bill, which may be worth it given the protection it provides against both accidental and malicious deletion events.
+

Once you have a good idea as to where and how much you want to use soft delete, the next steps depend on your architectural choices and the overall complexity of your organization’s cloud presence. For the rest of this blog, we’ll cover how to assess soft delete’s impact and act on that information, starting with bucket-level metrics, then acting on bucket-level settings within a project, using Terraform for management, and concluding with organizational-level management approaches.

Assessing bucket-level impacts

You can estimate bucket-level soft delete costs using Cloud Monitoring metrics and visualize them using the Metrics Explorer. You might want to inspect a handful of buckets that are representative of different kinds of datasets to get a better idea of which ones are more and less expensive to protect with soft delete.

Storage metrics

Recently, we introduced new storage metrics that allow you to break down the object counts, bytes, and byte seconds by storage class, and then further by live vs. noncurrent vs. soft-deleted vs. multipart. These breakdowns can be extremely useful even beyond any soft delete analysis you may want to perform. In addition, you can now inspect the deletion rate using the new deleted_bytes metric:

The storage/v2/deleted_bytes metric is a delta count of deleted bytes per bucket, grouped by storage class. It can be used to estimate soft delete billing impact, even if soft delete is disabled or set to a different retention duration than the one being considered.

For example, the absolute cost of soft delete can be calculated as follows: Soft delete retention duration × deleted bytes × storage price. For example, the cost (assuming us-central1 and Standard storage) of enabling a 7-day soft delete policy with 100,000 GB of deletions during the course of a month is (7 / 30.4375 days) × 100,000 GB × $0.02/GB mo = $459.96 (where 30.4375 is the average number of days per month).

The relative cost of soft delete can also be calculated by comparing the storage/v2/deleted_bytes metric to the existing storage/v2/total_byte_seconds metric: soft delete retention duration × deleted bytes / total bytes. Continuing from the above example and given 1,000,000 GB-months of storage for the month, the relative cost of enabling soft delete in this case is: (7 / 30.4375 days) × 100,000 GB / 1,000,000 GB = 2.3% impact.

Metrics Explorer

You can use the Metrics Explorer to create charts that visualize estimated soft delete costs for a given bucket:

+
In the navigation panel of the Google Cloud console, select Monitoring, and then select Metrics explorer (Go to Metrics explorer).
+
+
Verify that MQL is selected in the Language toggle.
+
+
Enter the following query into the query editor:
+

code_block: <ListValue: [StructValue([('code', "{\r\n fetch gcs_bucket :: storage.googleapis.com/storage/v2/deleted_bytes\r\n | value val(0) * 604800.0's'\r\n | group_by [resource.bucket_name, metric.storage_class], window(), .sum;\r\n fetch gcs_bucket :: storage.googleapis.com/storage/v2/total_byte_seconds\r\n | filter metric.type != 'soft-deleted-object'\r\n | group_by [resource.bucket_name, metric.storage_class], window(1d), .mean\r\n | group_by [resource.bucket_name, metric.storage_class], window(), .sum\r\n}\r\n| every 30d\r\n| within 360d\r\n| ratio"), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3eefc22f0070>)])]>

Note: This query assumes a 7-day (604,800 seconds) soft delete window.

Taking action within a project

If you are a storage administrator making decisions about soft delete settings within a project, you may want to go over your list of buckets manually and make decisions based on your business knowledge of what should be protected versus what can go without soft delete. For a larger number of buckets, you might choose to use the above metrics to generate a list of buckets that exceed a billing impact threshold (e.g. 20% impact) on all your buckets and then disable soft delete on those buckets.

To assist with this, we published a soft delete billing impact Python script on Github that generates a list of buckets in a project that exceed the percentage of billing impact that you specify, factoring in the storage classes of objects inside a bucket. The script can also be used to update the soft delete policies based on a specified relative cost threshold.

We recommend you use the Google Cloud CLI to configure soft delete settings on one or more buckets within a project. After installing and signing in, the following gcloud storage commands are examples of actions you may want to take to enable, update, or disable soft delete policies within a specified project:

code_block: <ListValue: [StructValue([('code', '# Set your project ID\r\n$ gcloud config set project $MY_PROJECT_ID\r\n\r\n# Disable Soft Delete on one bucket\r\n$ gcloud storage buckets update --clear-soft-delete gs://example-bucket\r\n\r\n# Disable Soft Delete on a list of buckets\r\n$ cat buckets.txt | gcloud storage buckets update -I --clear-soft-delete\r\n\r\n# Disable Soft Delete on all buckets in the project\r\n$ gcloud storage buckets update --clear-soft-delete gs://*\r\n\r\n# Enable Soft Delete on one bucket\r\n$ gcloud storage buckets update --soft-delete-duration=7d gs://example-bucket\r\n\r\n# Enable Soft Delete on a list of buckets\r\n$ cat buckets.txt | gcloud storage buckets update -I --soft-delete-duration=7d\r\n\r\n# Enable Soft Delete on all buckets in the project with a 14-day retention duration\r\n$ gcloud storage buckets update --soft-delete-duration=14d gs://*'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3eefc22f0340>)])]>

Taking action with Terraform

If you use an orchestration layer like Terraform, adapting to soft delete should be as simple as updating your templates and deciding on the soft delete retention duration for each workload. This could also involve creating new templates dedicated to short-lived data so that soft delete is disabled for buckets created from those templates. Once you’ve defined your settings, Terraform can update existing buckets to conform to the templates, and new buckets should be created with your intended settings.

With Terraform, the primary thing you need to do is to update your template(s) to include a soft delete policy. Here is an example of setting the soft delete retention duration to seven days (604800 seconds) in a google_storage_bucket resource:

code_block: <ListValue: [StructValue([('code', 'resource "google_storage_bucket" "bucket" {\r\n name = "example-bucket"\r\n location = "US"\r\n …\r\n soft_delete_policy {\r\n retention_duration_seconds = 604800\r\n }\r\n ...\r\n}'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3eefc22f0160>)])]>

To disable soft delete instead, simply set retention_duration_seconds = 0.

For more information, please also see: Use Terraform to create storage buckets and upload objects.

Taking action across a large organization

If you work for a large enterprise with thousands of projects and millions of buckets and mostly don’t use an orchestration layer, then a manual approach is not realistic, and you will need to make decisions at scale. If this is your situation, we recommend that you first learn about the bucket-level metrics and how to take action within a project as described earlier. In this section, we’ll extend these techniques to the organization level. Again, we assume you have already installed an up-to-date version of the gcloud CLI which you will need for this section.

To implement a policy across even the most complex of organizations, you will likely need to approach it in three steps using the gcloud command line environment:

+
Obtain permissions: ensure you can list and change bucket-level settings across the organization
+
+
Assess: decide on an impact threshold above which you will disable soft delete, and obtain a list of buckets surpassing that threshold
+
+
Act: disable soft delete on that list of buckets
+

Obtain permissions

Before you can do anything, you will need to identify someone with sufficient access permissions to analyze and change bucket-level configurations across your organization. This could be an existing Organization Administrator. Alternatively, your Organization Administrator could create a custom role and assign it to you or another administrator for the specific purpose of managing soft delete settings:

code_block: <ListValue: [StructValue([('code', 'gcloud iam roles create storageBucketsUpdate \\\r\n --organization=example-organization-id-1 \\\r\n --title="Storage bucket update" \\\r\n --description="Grants permission to get, list, and update Storage buckets." \\\r\n --stage=GA \\\r\n --permissions="storage.buckets.get,storage.buckets.list,storage.buckets.update"\r\n\r\ngcloud organization add-iam-policy-binding example-organization-id-1 \\\r\n --member=\'user:test-user@example.com\' \\\r\n --role=\'storageBucketsUpdate\' \\\r\n --condition=\'expression=request.time < timestamp("2024-07-01T00:00:00Z"),\\\r\n title=expires_2024_07_01,\\\r\n description=Expires at midnight on 2024-07-01\''), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3eefc22f0b20>)])]>

Note that once everything is done, at the end of this process, and your buckets are all updated, the Organization Administrator could safely delete this custom role if there wasn’t an ongoing need for a role with continued access to these settings:

code_block: <ListValue: [StructValue([('code', 'gcloud iam roles delete storageBucketsUpdate --organization=example-organization-id-1'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3eefc22f0400>)])]>

Assess

Armed with the power to act on bucket-level configurations across your organization, you can apply the project-level analysis above to obtain a list of all buckets across your organization that exceed your chosen impact threshold. Alternatively, you might choose to apply a uniform setting like 0d or 14d across all buckets in your organization.

Act

To update the soft delete policy for all your buckets across all your projects, you can iterate through all your projects, making the appropriate changes to the buckets in each project. For example, the following command disables soft delete on all buckets across your organization:

code_block: <ListValue: [StructValue([('code', 'gcloud projects list --format="value(projectId)" | while read project\r\ndo\r\n gcloud storage buckets update --project=$project --clear-soft-delete gs://*\r\ndone'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3eefc8d9cc10>)])]>

Alternatively, you can use the filter option of projects list to only target a subset of your projects. For example, you might want to update projects with a specific label (--filter="labels.environment:prod") or with a certain parent (--filter="parent.id:123456789").

As a best practice, we recommend that you consider replacing the per-project action above with a command that selectively disables soft delete on specific bucket IDs. For example, you could loop through your project list, running the soft delete billing impact Python script for each project to update your bucket settings according to a % impact threshold you select to get a much more tailored outcome.

Summary

By following the best practices in this blog and taking advantage of the available tooling and controls, we hope that you now feel more confident in your ability to protect your business-critical data with soft delete while simultaneously minimizing its billing impact.

Kubernetes is turning 10! Join the party on June 6th

Sun, 07 Jan 2024 15:51:00 GMT

Over the last 10 years, Kubernetes has risen to become the backbone of modern application deployment and has completely changed the course of innovation.

-> - - <![CDATA[第一财经 - 正在]]> - https://www.yicai.com/brief - -

日本石川县能登地区发生4.0级地震 | 据日本朝日电视台消息，当地时间7日23时44分左右，日本石川县能登地区发生4.0级地震，最大震感为震度3，震源深度10公...

Sun, 07 Jan 2024 15:09:00 GMT

日本石川县能登地区发生4.0级地震 | 据日本朝日电视台消息，当地时间7日23时44分左右，日本石川县能登地区发生4.0级地震，最大震感为震度3，震源深度10公里。（央视新闻）

日本外相与乌克兰总统举行会谈 | 当地时间7日，据日本时事通讯社消息，正在乌克兰访问的日本外相上川阳子与乌克兰总统泽连斯基举行了会谈。（央视新闻）

Sun, 07 Jan 2024 14:52:00 GMT

日本外相与乌克兰总统举行会谈 | 当地时间7日，据日本时事通讯社消息，正在乌克兰访问的日本外相上川阳子与乌克兰总统泽连斯基举行了会谈。（央视新闻）

以军持续轰炸加沙地带南部汗尤尼斯，已致71人死亡 | 据巴勒斯坦通讯社当地时间1月7日下午报道称，自6日夜间以来，以军对加沙地带南部汗尤尼斯地区持续空...

Sun, 07 Jan 2024 14:40:00 GMT

以军持续轰炸加沙地带南部汗尤尼斯，已致71人死亡 | 据巴勒斯坦通讯社当地时间1月7日下午报道称，自6日夜间以来，以军对加沙地带南部汗尤尼斯地区持续空袭和炮击，已造成71人死亡。（央视新闻）

平均船龄创新高，全球航运船队船舶出现老化趋势 | 英国金融时报7日报道，船舶经纪公司克拉克森的数据显示，2023年12月份，全球航运船队的平均船龄达到13....

Sun, 07 Jan 2024 14:39:00 GMT

平均船龄创新高，全球航运船队船舶出现老化趋势 | 英国金融时报7日报道，船舶经纪公司克拉克森的数据显示，2023年12月份，全球航运船队的平均船龄达到13.7年，为2009年以来的最高水平；集装箱运输船队的船龄达到14.3年，这是自1993年该公司开始收集此项数据以来的最高值；而运载石油和其他液体的油轮平均船龄达12.9年，也创下20年来的新高。尽管全球脱碳压力在不断增加，但由于监管政策和新型替代燃料技术的不确定性，可能导致不少船东不愿订购采用环保燃料的新船。业内人士分析，船舶老化趋势或将持续。（央视财经）

汤姆猫：公司目前在研AI应用仍在测试与打磨阶段 | 汤姆猫近日在电话会议上表示，公司目前在研的AI应用仍在测试与打磨阶段。其中：公司国内研发团队与西湖...

Sun, 07 Jan 2024 14:33:00 GMT

汤姆猫：公司目前在研AI应用仍在测试与打磨阶段 | 汤姆猫近日在电话会议上表示，公司目前在研的AI应用仍在测试与打磨阶段。其中：公司国内研发团队与西湖心辰合作的多模态AI汤姆猫产品及汤姆猫AI讲故事，已初步完成主要功能的测试，公司正进一步丰富汤姆猫AI讲故事的产品内容，同时携手西湖心辰团队持续完善产品的多模态、长期记忆、情感感知等功能模块；公司海外团队研发的首款AI手游《Talking Ben AI》已在斯洛文尼亚、塞浦路斯、南非等地区开启首轮海外测试，在此次测试中，公司获取了充足的合规语料，后续公司将根据测试获取的语料持续丰富该产品的预设数据库，同时公司海外团队也将携手西湖心辰进一步探讨海外测试产品Talking Ben AI的个性化交互、模型能力的提升以及AI技术的更广泛应用。

明日A股机会早知道 | ①央行连续第14个月增持黄金储备，黄金股迎来右侧配置时机

1月7日，国家外汇管理局发布的数据显示，截至2023年12月末，...

Sun, 07 Jan 2024 14:30:00 GMT

明日A股机会早知道 | ①央行连续第14个月增持黄金储备，黄金股迎来右侧配置时机

1月7日，国家外汇管理局发布的数据显示，截至2023年12月末，我国外汇储备规模为32380亿美元，较11月末上升662亿美元，升幅为2.1%。与此同时，截至同期，我国黄金储备规模为7187万盎司，这已是黄金储备连续14个月增加。

光大银行金融市场部宏观研究员周茂华认为，我国央行加大黄金储备，主要是顺应全球发展趋势，优化和多元化官方储备资产结构，提升官方储备稳定性，增强外围风险抵御能力；全球政经不确定性明显增多，央行合理加大黄金储备有助于分散风险，增强官方储备资产稳定性，增强金融体系发展韧性。

国金证券认为，实际利率框架下，当前正处于美联储停止加息到明确释放降息信号前的黄金股右侧配置期，并且黄金股相对收益尚未兑现全部金价上涨预期，预计美联储明确释放降息信号后黄金股将迎来主升浪行情。

②银发经济利好政策频频，养老产业链有望持续受益

1月5日晚间，国务院常务会议强调，发展银发经济是积极应对人口老龄化、推动高质量发展的重要举措，提出要持续完善相关政策措施，重点解决好老年人居家养老、就医用药、康养照护等急难愁盼问题。

2023年底召开的中央经济工作会议指出，要坚持尽力而为、量力而行，兜住、兜准、兜牢民生底线。更加突出就业优先导向，确保重点群体就业稳定。织密扎牢社会保障网，健全分层分类的社会救助体系。加快完善生育支持政策体系，发展银发经济，推动人口高质量发展。

中国社会科学院社会发展战略研究院研究员房连泉表示，要发挥市场在资源配置中的决定性作用，充分调动民间力量参与。2013年出台的《国务院关于加快发展养老服务业的若干意见》，从国家层面部署推进养老服务相关产业发展，在金融、土地、税费以及就业等方面给予诸多政策扶持，社会力量越来越多地参与到养老产业领域，既拓宽了投融资渠道，也带来大量就业创业机会。

③住房租赁“金融17条”出炉，助力加快构建房地产发展新模式

1月5日，中国人民银行、国家金融监督管理总局发布《关于金融支持住房租赁市场发展的意见》（下称《意见》），提出加强住房租赁信贷产品和服务模式创新；满足团体批量购买租赁住房的合理融资需求；支持发放住房租赁经营性贷款等共计17项内容。

《意见》中提出，支持住房租赁供给侧结构性改革。金融支持住房租赁市场发展应突出重点、瞄准短板，主要在大城市，围绕解决新市民、青年人等群体住房问题，支持各类主体新建、改建和运营长期租赁住房，盘活存量房屋，有效增加保障性和商业性租赁住房供应。

同日，深圳市规划和自然资源局、住房和建设局联合起草《关于积极稳步推进城中村改造实现高质量发展的实施意见》，明确城中村改造的对象，是以深圳市域内原农村集体经济组织继受单位及原村民实际占有使用的现状居住用地为主的区域，分为拆除新建、整治提升、拆整结合三类。

方正证券认为，住房租赁商业模式已初步跑通，叠加更大力度的金融支持，存量房屋将被有效盘活，配租型保障住房将有效增加，助力加速构建房地产发展新模式。

利率超3%，结构性存款及大额存单销售火爆 | 降息潮背景下，结构性存款以及大额存单正受到客户经理及储户的青睐，部分银行的相关存款产品利率可达3%以上。...

Sun, 07 Jan 2024 13:35:00 GMT

利率超3%，结构性存款及大额存单销售火爆 | 降息潮背景下，结构性存款以及大额存单正受到客户经理及储户的青睐，部分银行的相关存款产品利率可达3%以上。专家认为，居民可根据自身风险偏好、资金流动性需求等因素，多元化配置资产，实现财富增值。（中证报）

日本石川县能登地区发生4.7级地震 | 据日本广播协会（NHK）消息，当地时间1月7日21时38分左右，日本石川县能登地区发生4.7级地震，最大震感为震度4，震源...

Sun, 07 Jan 2024 13:01:00 GMT

日本石川县能登地区发生4.7级地震 | 据日本广播协会（NHK）消息，当地时间1月7日21时38分左右，日本石川县能登地区发生4.7级地震，最大震感为震度4，震源深度10公里。（央视新闻）

日本北陆电力公司：在志贺核电站附近海面发现少量油膜 | 日本北陆电力公司1月7日表示，在对位于石川县志贺町的志贺核电站进行详细检查的过程中，确认到该...

Sun, 07 Jan 2024 12:54:00 GMT

日本北陆电力公司：在志贺核电站附近海面发现少量油膜 | 日本北陆电力公司1月7日表示，在对位于石川县志贺町的志贺核电站进行详细检查的过程中，确认到该核电站附近海面上有少量油膜漂浮。此外，在核电站2号机组变压器周围的水沟和道路上也发现了油膜。北陆电力称，“这可能是因为本月1日地震发生时，志贺核电站变压器管道受损，导致用于绝缘和冷却的油泄漏，灭火设备启动，又导致油飞溅并随雨水泄漏到周围”。北陆电力还称，泄漏的油不属于辐射管理区域内，不会对外界造成辐射影响。目前已使用中和剂等对油膜进行了处理。（央视新闻）

周日重大事件汇总 | 【时事新闻】

①外交部发言人就反制美国向中国台湾地区出售武器、制裁中国实体答记者问；

②中国社会科学院学...

Sun, 07 Jan 2024 12:44:00 GMT

周日重大事件汇总 | 【时事新闻】

①外交部发言人就反制美国向中国台湾地区出售武器、制裁中国实体答记者问；

②中国社会科学院学部委员高培勇：“稳预期”将成为2024年中国经济运行中的关键词；

③国家数据局局长刘烈宏：数据具有报酬递增的特点，需要探索兼顾效率和公平的分配机制；

④教育部：网传义务教育教学改革实验区“缩短学制”“取消中考”等说法不实；

⑤中国2023年12月末黄金储备7187万盎司，11月末为7158万盎司，为连续第14个月增持黄金储备；

⑥截至2023年12月末，我国外汇储备规模为32380亿美元，较11月末上升662亿美元，升幅为2.1%；

⑦美联航宣布停飞波音737 MAX 9型客机；

【公司新闻】

①蔚来法务部：将针对盗版“龍龖龘”设计产品采取一切必要法律行动；

②兴源环境：收到浙江证监局行政处罚决定书；

③腾讯微信团队致歉：私密朋友圈bug已彻底修复；

④宁德时代与福特汽车合作继续推进，仍有望获得美国IRA法案补贴。

以色列国防军称空袭黎巴嫩真主党多个目标 | 当地时间1月7日，以色列国防军在其官方社交媒体账户上表示，当天早些时候以军对黎巴嫩真主党多个目标进行了空...

Sun, 07 Jan 2024 12:23:00 GMT

以色列国防军称空袭黎巴嫩真主党多个目标 | 当地时间1月7日，以色列国防军在其官方社交媒体账户上表示，当天早些时候以军对黎巴嫩真主党多个目标进行了空袭。以色列国防军称，以军空袭了黎巴嫩南部城镇马尔瓦欣附近的真主党据点，以及位于拉布内（Labbouneh）、迈季代勒佐恩（Majdal Zoun）和宾特朱拜勒的几个真主党建筑物目标。以色列国防军还表示，6日晚间，以军防空部队在埃文·梅纳赫姆（Even Menachem）地区拦截了一个从黎巴嫩进入以色列领空的“空中目标”。截至目前，黎巴嫩真主党方面对此暂无回应。（央视新闻）

本轮巴以冲突已导致加沙地带22835人死亡 | 根据加沙地带卫生部门当地时间1月7日发布的最新数据，自去年10月7日本轮巴以冲突爆发以来，加沙地带死亡人数升...

Sun, 07 Jan 2024 12:14:00 GMT

本轮巴以冲突已导致加沙地带22835人死亡 | 根据加沙地带卫生部门当地时间1月7日发布的最新数据，自去年10月7日本轮巴以冲突爆发以来，加沙地带死亡人数升至22835人，另有58416人受伤。该部门表示，以军在过去24小时内对加沙地带发动了12起袭击，共造成113人丧生，250人受伤。（央视新闻）

宁德时代与福特汽车合作继续推进，仍有望获得美国IRA法案补贴 | 宁德时代与福特汽车跨越太平洋的合作有了重要进展。当地时间1月5日，福特汽车的一位发言...

Sun, 07 Jan 2024 12:08:00 GMT

宁德时代与福特汽车合作继续推进，仍有望获得美国IRA法案补贴 | 宁德时代与福特汽车跨越太平洋的合作有了重要进展。当地时间1月5日，福特汽车的一位发言人表示，福特汽车公司在IRA和指南中没有看到“任何将我们与中国电池生产商宁德时代的技术许可合作排除在（获得补贴资格）外的内容”。这意味着宁德时代与福特汽车公司的技术合作未被排除在新细则已生效的《美国通货膨胀削减法案》（IRA法案）补贴之外。对此，宁德时代有关人士1月6日晚证实，宁德时代和福特汽车的合作正在推进过程中。（上证报）

以色列总理：以方不会放松在加沙地带针对哈马斯的军事行动 | 据《以色列时报》消息，当地时间1月7日，以色列总理内塔尼亚胡在内阁会议上表示，以色列在达...

Sun, 07 Jan 2024 11:59:00 GMT

以色列总理：以方不会放松在加沙地带针对哈马斯的军事行动 | 据《以色列时报》消息，当地时间1月7日，以色列总理内塔尼亚胡在内阁会议上表示，以色列在达成所有目标之前，不会放松在加沙地带针对巴勒斯坦伊斯兰抵抗运动（哈马斯）的军事行动。此外，在谈到黎以边境紧张局势时，内塔尼亚胡说：“如果可以，我们将通过外交手段解决这一问题；如果不行，我们将通过其他方式采取行动。”（央视新闻）

蔚来法务部：将针对盗版“龍龖龘”设计产品采取一切必要法律行动 | 蔚来法务部官微发布消息称，近期发现有大量商家在各大电子商务平台销售与蔚来“龍龖龘...

Sun, 07 Jan 2024 11:47:00 GMT

蔚来法务部：将针对盗版“龍龖龘”设计产品采取一切必要法律行动 | 蔚来法务部官微发布消息称，近期发现有大量商家在各大电子商务平台销售与蔚来“龍龖龘”设计产品几乎完全相同的产品（以下简称“盗版产品”），部分商家甚至在产品介绍页面直接使用蔚来官方宣传图片为其盗版产品进行宣传、推广。上述行为已严重侵害蔚来的知识产权。蔚来表示，公司已固定相关侵权证据，并将采取一切必要的法律行动维护自身权利。蔚来正告相关行为人，立即停止生产、销售和宣传盗版“龍龖龘”设计产品及任何侵害蔚来知识产权的行为。

朝鲜7日进行海上炮击训练 | 据朝中社报道，朝鲜人民军于当地时间7日进行了海上炮击训练。（央视新闻）

Sun, 07 Jan 2024 11:40:00 GMT

朝鲜7日进行海上炮击训练 | 据朝中社报道，朝鲜人民军于当地时间7日进行了海上炮击训练。（央视新闻）

全球最高海拔风电叶片真空灌注试验在青海取得成功 | 1月5日，在青海省格尔木市开展的高海拔风电叶片真空灌注技术研究试验取得成功，刷新了全球最高海拔风...

Sun, 07 Jan 2024 11:38:00 GMT

全球最高海拔风电叶片真空灌注试验在青海取得成功 | 1月5日，在青海省格尔木市开展的高海拔风电叶片真空灌注技术研究试验取得成功，刷新了全球最高海拔风电叶片的纪录，也验证了风电叶片在高海拔低压灌注达到“零缺陷”的可能性，标志着我国风电叶片生产技术已迈向世界领先水平。（央视新闻）

浙商证券：国际收支平衡仍需密切关注 | 浙商证券研报指出，2023年12月我国官方外汇储备32379.77亿美元，再次攀升3.2万亿美元上方，环比增加661.7亿美元，...

Sun, 07 Jan 2024 11:37:00 GMT

浙商证券：国际收支平衡仍需密切关注 | 浙商证券研报指出，2023年12月我国官方外汇储备32379.77亿美元，再次攀升3.2万亿美元上方，环比增加661.7亿美元，估算其主要可用估值因素解释。12月人民币汇率呈震荡走势，月末或受企业年底集中结汇的影响快速拉升，全月升0.55%收于7.09，但该幅度弱于美元指数的下行幅度（2.1%），且12月末CFETS人民币汇率指数较上月末降0.9%，说明我国经济基本面弱修复，内生动力仍有修复空间，后续看需求侧政策发力效果，国际收支平衡仍是政策核心考量，预计我国政策重心向产业政策、财政政策主导切换，货币政策强化协调配合，短期看，1月是否降息核心跟踪DR007走势。

以军在拉姆安拉枪击巴勒斯坦人，致巴方人员1死2伤 | 据巴勒斯坦通讯社消息，当地时间1月7日，以军在约旦河西岸城市拉姆安拉北部地区与当地巴勒斯坦人发生...

Sun, 07 Jan 2024 11:33:00 GMT

以军在拉姆安拉枪击巴勒斯坦人，致巴方人员1死2伤 | 据巴勒斯坦通讯社消息，当地时间1月7日，以军在约旦河西岸城市拉姆安拉北部地区与当地巴勒斯坦人发生冲突。以军在冲突中向巴方人员开枪，导致1名巴勒斯坦人死亡，另有2名巴勒斯坦人受伤。截至目前，以色列方面对此暂未做出回应。（央视新闻）

以军袭击加沙地带马瓦西地区，致2名记者死亡 | 巴勒斯坦《圣城报》当地时间7日报道，以色列军队当天在加沙南部汗尤尼斯附近的马瓦西地区袭击了一群记者，...

Sun, 07 Jan 2024 11:13:00 GMT

以军袭击加沙地带马瓦西地区，致2名记者死亡 | 巴勒斯坦《圣城报》当地时间7日报道，以色列军队当天在加沙南部汗尤尼斯附近的马瓦西地区袭击了一群记者，致2名记者死亡。（央视新闻）

朝鲜指责韩方臆测朝军动向 | 据朝中社报道，朝鲜劳动党中央委员会副部长金与正7日发表谈话指出，韩方误判、臆测朝军动向。报道说，韩方称朝鲜6日下午在延...

Sun, 07 Jan 2024 10:56:00 GMT

朝鲜指责韩方臆测朝军动向 | 据朝中社报道，朝鲜劳动党中央委员会副部长金与正7日发表谈话指出，韩方误判、臆测朝军动向。报道说，韩方称朝鲜6日下午在延坪岛西北方向发射炮弹，炮弹落在西部海域“北方界线”以北的海上缓冲区，对此金与正在谈话中指出，朝鲜军队系引爆模拟海岸炮炮声的炸药，朝方此举旨在观察韩方反应。金与正说，韩方把炸药爆炸声误判为炮声，臆测为炮击挑衅，并谎称弹着点位于西部海域“北方界线”以北的海上缓冲区。金与正还警告说，韩方哪怕做出很小的挑衅，朝鲜军队将立即予以“炮火洗礼”。（新华社）

中金策略：短期扰动不改长局 2024年市场的配置机会有望好于2023年 | 中金策略表示，进入2024年后外部环境不确定性边际上升，结合近期公布的我国部分经济...

Sun, 07 Jan 2024 10:54:00 GMT

中金策略：短期扰动不改长局 2024年市场的配置机会有望好于2023年 | 中金策略表示，进入2024年后外部环境不确定性边际上升，结合近期公布的我国部分经济数据，内外部环境影响下A股首周未能延续去年年底的反弹趋势表现偏弱，高股息主题开年超额收益显著，这体现出在外部环境面临不确定性、社会预期偏弱的背景下，投资者更加重视当期回报，高股息相关行业凭借稳定的现金流和分红优势继续获得资金青睐。往后看，预期偏弱背景下建议关注积极因素的逐步累积，尤其是国内政策环境的边际变化。当前短期扰动不改长局，极端估值叠加积极因素的逐步积累，对后市表现不必过于悲观，2024年市场的配置机会有望好于2023年，未来3-6个月建议关注景气回升与红利资产的攻守结合。

2024年全国文化和旅游厅局长会议召开 | 近日，2024年全国文化和旅游厅局长会议在京召开。会议强调，要释放政策红利，延续旅游业恢复发展、高质量发展良好...

Sun, 07 Jan 2024 10:43:00 GMT

2024年全国文化和旅游厅局长会议召开 | 近日，2024年全国文化和旅游厅局长会议在京召开。会议强调，要释放政策红利，延续旅游业恢复发展、高质量发展良好态势。要统筹培育监管，推动文化和旅游市场繁荣兴旺、安全有序。要深化务实合作，推动文化交流和旅游推广同向发力、形成合力。

中国社会科学院学部委员高培勇：“稳预期”将成为2024年中国经济运行中的关键词 | 1月7日，中国社会科学院学部委员、中国社会科学院原副院长高培勇在第二...

Sun, 07 Jan 2024 10:41:00 GMT

中国社会科学院学部委员高培勇：“稳预期”将成为2024年中国经济运行中的关键词 | 1月7日，中国社会科学院学部委员、中国社会科学院原副院长高培勇在第二十五届北大光华新年论坛上表示，中央经济工作会议做出了“稳预期、稳增长、稳就业”的重大战略部署。“稳预期”将成为2024年中国经济运行中的一个极其重要的关键词，也必然要成为中国2024年经济工作必须牵好、牵牢的牛鼻子。

高培勇认为，当下经济恢复进程中面临的新的困难和挑战，集中体现在信心和预期上。“三稳”当中，稳预期是基础和关键。只有居民和企业的信心增强了，预期稳定了，消费需求和投资需求不足的矛盾和问题才可随之减轻，源自需求和供给两翼的矛盾和问题才可随之化解，巩固和增强经济回升向好态势才会有坚实的基础和保障。

日本羽田机场发生飞机相撞事故的跑道将于8日0时重新开放 | 当地时间7日傍晚，日本国土交通省发布正式消息，东京羽田机场2日发生飞机相撞事故的跑道将于当...

Sun, 07 Jan 2024 10:17:00 GMT

日本羽田机场发生飞机相撞事故的跑道将于8日0时重新开放 | 当地时间7日傍晚，日本国土交通省发布正式消息，东京羽田机场2日发生飞机相撞事故的跑道将于当地时间8日0时重新开放。（央视新闻）

土耳其安全部队打死6名库尔德工人党成员 | 土耳其国防部当地时间1月7日发表声明称，土安全部队当天在叙利亚北部展开军事行动，共打死6名库尔德工人党武装...

Sun, 07 Jan 2024 10:16:00 GMT

土耳其安全部队打死6名库尔德工人党成员 | 土耳其国防部当地时间1月7日发表声明称，土安全部队当天在叙利亚北部展开军事行动，共打死6名库尔德工人党武装成员。（央视新闻）

美客机空中发生事故紧急降落后大批航班被取消，乘客聚集机场退票改签 | 美国阿拉斯加航空公司一架客机当地时间5日起飞后不久在空中发生事故，随即紧急返...

Sun, 07 Jan 2024 10:16:00 GMT

美客机空中发生事故紧急降落后大批航班被取消，乘客聚集机场退票改签 | 美国阿拉斯加航空公司一架客机当地时间5日起飞后不久在空中发生事故，随即紧急返航。在本次事故发生后，阿拉斯加航空公司和美国联合航空公司取消了200多架次航班。6日，大批乘客聚集在波特兰机场，办理退票或改签其他航班。目前，在美国只有阿拉斯加航空公司和美联航使用737 MAX 9型客机。6日，阿拉斯加航空公司取消了154架次航班，美国联合航空公司取消了80架次航班。（央视新闻）

应莹：目前上证指数已在2882见底，未来将继续进行中级反弹 | 应莹通过其个人公众号发表每周市场点评称，本周市场弱势行情，12月经济数据转弱及朝鲜半岛局...

Sun, 07 Jan 2024 10:02:00 GMT

应莹：目前上证指数已在2882见底，未来将继续进行中级反弹 | 应莹通过其个人公众号发表每周市场点评称，本周市场弱势行情，12月经济数据转弱及朝鲜半岛局势紧张，影响了投资者的情绪。目前沪深风险溢价和股债收益差指标均与历次市场底部高度相似，A股已出现一大批个股跌破净资产现象。A股沪深300估值持续下降，已吸引高股息资金持续流入，显示出大资金对未来乐观预期。综合上述，目前上证指数已在2882见底，未来将继续进行中级反弹。

机构论后市丨当前A股估值处历史底部；场外配置型资金预计将逐步入场 | ①海通证券：当前A股估值处历史底部，大金融可能有阶段性机会

海通证券...

Sun, 07 Jan 2024 09:50:00 GMT

机构论后市丨当前A股估值处历史底部；场外配置型资金预计将逐步入场 | ①海通证券：当前A股估值处历史底部，大金融可能有阶段性机会

海通证券发布研报称，历史上开年下跌和全年行情关系不大，估值低且政策暖时，开年下跌后全年行情依旧可期。当前A股估值处历史底部，国内央行重启PSL加力稳增长，海外美联储紧缩周期或结束，股市行情望回暖。政策加码背景下，大金融可能有阶段性机会。中期盈利上行期白马成长望更优，如电子等硬科技和医药。

②中信证券：场外配置型资金预计将逐步入场，市场将迎来重要拐点

中信证券表示，开年首周资金的跨年效应显著，机构调仓和加仓均抱团于红利低波，市场交易生态和风格分化趋于极致；1月中旬是关键时点，随着经济数据和地缘扰动的落地，政策将持续加码，场外配置型资金预计将逐步入场，市场将迎来重要拐点。配置上，当前市场仍处于“三阶段配置策略”中的第二阶段，建议优先布局以科创板为代表的超跌成长，行业配置上则主要围绕科技和医药板块展开。其中科技板块包括AI产业（国产算力，AI芯片设计、应用等），智能驾驶（华为链、国产整车等），终端消费转暖（消费电子、安卓链复苏、数据要素、运营商），机器人和卫星互联网等；医药板块重点关注创新药出海品种（药品、器械等）。此外，新能源板块可以积极关注，港股中的消费、互联网等白马品种也可以提前布局、逐步参与。

③中信建投：2024年继续看好医疗器械板块投资机会，关注三条主线

中信建投证券研报认为，展望2024年，继续看好医疗器械板块的投资机会，看好政策拐点、国际化和创新驱动三条主线。建议关注2024年有望受益于手术量复苏、招投标需求恢复的部分公司，海外业务高增长确定性强或有望超预期的公司，以及新产品、新技术放量催化下可能业绩增速提升的部分公司。预计2024年行业采购仍在温和复苏中，上半年部分高值耗材企业的业绩确定性高于IVD，高于医疗设备企业。近期部分优质个股股价和估值有所调整，建议把握2024年业绩确定性较高个股的加仓机会。

④中金策略：短期扰动不改长局 2024年市场的配置机会有望好于2023年

中金策略表示，进入2024年后外部环境不确定性边际上升，结合近期公布的我国部分经济数据，内外部环境影响下A股首周未能延续去年年底的反弹趋势表现偏弱，高股息主题开年超额收益显著，这体现出在外部环境面临不确定性、社会预期偏弱的背景下，投资者更加重视当期回报，高股息相关行业凭借稳定的现金流和分红优势继续获得资金青睐。往后看，预期偏弱背景下建议关注积极因素的逐步累积，尤其是国内政策环境的边际变化。当前短期扰动不改长局，极端估值叠加积极因素的逐步积累，对后市表现不必过于悲观，2024年市场的配置机会有望好于2023年，未来3-6个月建议关注景气回升与红利资产的攻守结合。

以色列特拉维夫爆发抗议，要求以总理下台 | 当地时间6日，大批以色列民众在以色列特拉维夫举行抗议示威活动，要求以色列政府立即与哈马斯就释放被扣押人...

Sun, 07 Jan 2024 09:48:00 GMT

以色列特拉维夫爆发抗议，要求以总理下台 | 当地时间6日，大批以色列民众在以色列特拉维夫举行抗议示威活动，要求以色列政府立即与哈马斯就释放被扣押人员达成协议。抗议者还要求内塔尼亚胡政府下台，解散议会，提前举行大选。据以色列媒体报道，活动组织者称，约2万人参与了示威活动。（央视新闻）

晚间重要公告汇总 | 【品大事】

兴源环境：收到浙江证监局行政处罚决定书

美年健康：拟与美因健康签署基因检测项目合作协议
...

Sun, 07 Jan 2024 09:46:00 GMT

晚间重要公告汇总 | 【品大事】

兴源环境：收到浙江证监局行政处罚决定书

美年健康：拟与美因健康签署基因检测项目合作协议

金新农：2023年累计生猪销售收入12.38亿元同比减少31.95%

鼎龙科技：拟收购控股子公司少数股权

福田汽车：2023年汽车产品累计销量同比增长37.14%

【增减持】

丽人丽妆：股东丽仁拟减持不超0.99%公司股份

鼎龙科技：拟收购控股子公司少数股权 | 鼎龙科技公告，公司拟以自有资金2954.10万元收购控股子公司鼎利科技少数股东双利科技持有的40%股权。收购完成后，...

Sun, 07 Jan 2024 09:36:00 GMT

鼎龙科技：拟收购控股子公司少数股权 | 鼎龙科技公告，公司拟以自有资金2954.10万元收购控股子公司鼎利科技少数股东双利科技持有的40%股权。收购完成后，鼎利科技将成为公司全资子公司。

中信建投：2024年继续看好医疗器械板块投资机会，关注三条主线 | 中信建投证券研报认为，展望2024年，我们继续看好医疗器械板块的投资机会，看好政策拐点...

Sun, 07 Jan 2024 09:31:00 GMT

中信建投：2024年继续看好医疗器械板块投资机会，关注三条主线 | 中信建投证券研报认为，展望2024年，我们继续看好医疗器械板块的投资机会，看好政策拐点、国际化和创新驱动三条主线。建议关注2024年有望受益于手术量复苏、招投标需求恢复的部分公司，海外业务高增长确定性强或有望超预期的公司，以及新产品、新技术放量催化下可能业绩增速提升的部分公司。预计2024年行业采购仍在温和复苏中，上半年部分高值耗材企业的业绩确定性高于IVD，高于医疗设备企业。

潮州市交通运输局原党组书记、局长佘延民被“双开” | 据南粤清风网消息，日前，经潮州市委批准，潮州市纪委监委对潮州市交通运输局原党组书记、局长佘延...

Sun, 07 Jan 2024 09:18:00 GMT

潮州市交通运输局原党组书记、局长佘延民被“双开” | 据南粤清风网消息，日前，经潮州市委批准，潮州市纪委监委对潮州市交通运输局原党组书记、局长佘延民严重违纪违法问题进行了立案审查调查。依据《中国共产党纪律处分条例》《中华人民共和国监察法》《中华人民共和国公职人员政务处分法》等有关规定，经潮州市纪委常委会会议、市委常委会会议研究，并经广东省纪委常委会会议审议后报省委批准，决定给予佘延民开除党籍处分；由潮州市监委给予其开除公职处分；终止其潮州市第十五次党代会代表资格；收缴其违纪违法所得；将其涉嫌犯罪问题移送检察机关依法审查起诉，所涉财物一并移送。

东莞发放暖工稳产“大礼包”：每家企业最高可获30万元招用工补贴 | 广东东莞近日印发实施《关于春节前后暖工稳产促消费的若干措施》，从稳工稳岗促就业、...

Sun, 07 Jan 2024 09:12:00 GMT

东莞发放暖工稳产“大礼包”：每家企业最高可获30万元招用工补贴 | 广东东莞近日印发实施《关于春节前后暖工稳产促消费的若干措施》，从稳工稳岗促就业、稳产促增抢先机、丰富供给促消费三大方面使用政策工具包。2024年2月10日至3月9日期间，企业招用初次在东莞就业的非东莞户籍人员，签订劳动合同并依法连续参加社保满3个月以上，按1000元/人标准给予企业一次性新增就业补贴，每家企业最高30万元。（新华社）

日本能登半岛地震导致部分海域变成陆地 | 日本媒体6日援引研究人员的话报道，日本石川县能登半岛地震导致沿海部分海底抬升变成陆地。据《读卖新闻》报道...

Sun, 07 Jan 2024 09:09:00 GMT

日本能登半岛地震导致部分海域变成陆地 | 日本媒体6日援引研究人员的话报道，日本石川县能登半岛地震导致沿海部分海底抬升变成陆地。据《读卖新闻》报道，日本国土地理院研究人员分析卫星观测数据，发现从能登半岛珠洲市经轮岛市至志贺町的沿岸海底隆起露出水面，总长度大约85公里。其中，轮岛市皆月湾一带海底隆起大约4米，海岸线向海中推进大约200米。日本地理学会研究人员分析航空摄影图片，发现仅在珠洲市至轮岛市沿岸大约50公里范围内，陆地面积就扩大约240公顷。（新华社）

奥飞娱乐：后续将会推出《巴啦啦小魔仙10》等新作品 | 奥飞娱乐在互动平台表示，《星缘蝶启1》和《星缘蝶启2》分别为《巴啦啦小魔仙9》的第一季和第二季...

Sun, 07 Jan 2024 09:04:00 GMT

奥飞娱乐：后续将会推出《巴啦啦小魔仙10》等新作品 | 奥飞娱乐在互动平台表示，《星缘蝶启1》和《星缘蝶启2》分别为《巴啦啦小魔仙9》的第一季和第二季内容，目前没有制作第三季内容的规划，公司后续将会推出《巴啦啦小魔仙10》等新作品。

日本石川县轮岛市发出避难指示，当地或将发生大规模山体滑坡 | 当地时间1月7日，日本石川县轮岛市市长坂口茂在当天举行的石川县灾害对策本部会议中称，轮...

Sun, 07 Jan 2024 09:02:00 GMT

日本石川县轮岛市发出避难指示，当地或将发生大规模山体滑坡 | 当地时间1月7日，日本石川县轮岛市市长坂口茂在当天举行的石川县灾害对策本部会议中称，轮岛市可能发生大规模山体滑坡的迹象，为此已向当地居民发出避难指示。（新华社）

昔日深圳珠宝第一股拉响退市警报 | 自2023年12月22日起，截至2024年1月5日，*ST爱迪已连续10个交易日收盘价低于1元。根据交易所相关规定，若公司股票连续...

Sun, 07 Jan 2024 09:00:00 GMT

昔日深圳珠宝第一股拉响退市警报 | 自2023年12月22日起，截至2024年1月5日，*ST爱迪已连续10个交易日收盘价低于1元。根据交易所相关规定，若公司股票连续20个交易日的股票收盘价均低于1元，将会触及交易类强制退市的情形。值得注意的是，*ST爱迪最新收盘价为0.62元，在接下来的10个交易日中，*ST爱迪需连续涨停才有希望在第20个交易日重返1元，短暂规避交易类强制退市类型。2015年，公司顶着深圳珠宝第一股的光环上市。上市之后，公司曾多次进行资本运作。（中国基金报）

佐力药业：获得百令胶囊药品注册证书 | 佐力药业公告，近日，公司收到国家药监局签发的百令胶囊《药品注册证书》。百令胶囊功能主治：补肺肾、益精气。用...

Sun, 07 Jan 2024 08:39:00 GMT

佐力药业：获得百令胶囊药品注册证书 | 佐力药业公告，近日，公司收到国家药监局签发的百令胶囊《药品注册证书》。百令胶囊功能主治：补肺肾、益精气。用于肺肾两虚引起的咳嗽、气喘、咯血、腰背酸痛、面目虚浮、夜尿清长；慢性支气管炎、慢性肾功能不全的辅助治疗。

独家｜汽车后市场生变：集群车宝申请破产已有加盟店改换门头 | 原国美华南大区总经理高集群创办的集群车宝1月3日起申请破产，随着新能源车普及、京东等...

Sun, 07 Jan 2024 08:39:00 GMT

独家｜汽车后市场生变：集群车宝申请破产已有加盟店改换门头 | 原国美华南大区总经理高集群创办的集群车宝1月3日起申请破产，随着新能源车普及、京东等巨头切入，加上经济环境影响，汽车后市场生变，令中小型的服务商生存变难。第一财经记者1月7日在集群车宝广州市海珠区怡安路店看到，这家集群车宝的加盟店门头已经更名。该店相关负责人说，已向顾客售出的集群车宝业务，门店会继续提供服务。除了外部原因外，集群车宝的破产也跟自身在汽车后市场业务的数字化转型不顺有关。（第一财经记者王珍）

美年健康：拟与美因健康签署基因检测项目合作协议 | 美年健康公告，公司拟与美因健康科技（北京）有限公司（简称“美因健康”）签署《基因检测项目合作协...

Sun, 07 Jan 2024 08:31:00 GMT

美年健康：拟与美因健康签署基因检测项目合作协议 | 美年健康公告，公司拟与美因健康科技（北京）有限公司（简称“美因健康”）签署《基因检测项目合作协议》，向公司及下属子公司客户销售美因健康的基因检测产品，美因健康按公司及下属子公司提出的项目产品要求为公司及下属子公司客户提供基因检测服务。美因健康为香港主板上市公司Mega Genomics Limited（HK.06667，简称“美因基因”）通过协议控制的境内运营实体。预计2024年度本协议项下双方交易金额不超过1.5亿元，合作协议的有效期为三年。

以军空袭加沙北部造成至少20人死亡 | 据巴勒斯坦通讯社7日报道，以色列军队6日夜间空袭加沙地带北部杰巴利耶难民营一处住宅，造成至少20人死亡。（新华社...

Sun, 07 Jan 2024 08:30:00 GMT

以军空袭加沙北部造成至少20人死亡 | 据巴勒斯坦通讯社7日报道，以色列军队6日夜间空袭加沙地带北部杰巴利耶难民营一处住宅，造成至少20人死亡。（新华社）

金新农：2023年累计生猪销售收入12.38亿元同比减少31.95% | 金新农公告，公司2023年12月生猪销量9.93万头（其中商品猪7.41万头、仔猪2.15万头、种猪0.37...

Sun, 07 Jan 2024 08:23:00 GMT

金新农：2023年累计生猪销售收入12.38亿元同比减少31.95% | 金新农公告，公司2023年12月生猪销量9.93万头（其中商品猪7.41万头、仔猪2.15万头、种猪0.37万头），销售收入1.26亿元，商品猪销售均价13.38元/公斤，环比下降8.33%。2023年1-12月，公司累计销售生猪104.69万头，同比减少16.67%；累计生猪销售收入12.38亿元，同比减少31.95%。

韩国联合参谋本部：朝鲜7日继续进行海岸炮射击 | 据韩国联合参谋本部当地时间1月7日报道，朝鲜军队7日继续在延坪岛北方进行了海岸炮射击。朝鲜方面对此暂...

Sun, 07 Jan 2024 08:21:00 GMT

韩国联合参谋本部：朝鲜7日继续进行海岸炮射击 | 据韩国联合参谋本部当地时间1月7日报道，朝鲜军队7日继续在延坪岛北方进行了海岸炮射击。朝鲜方面对此暂无回应。 (央视新闻)

福田汽车：2023年汽车产品累计销量同比增长37.14% | 福田汽车晚间公告，2023年12月份汽车产品合计销量64809辆，其中新能源汽车4063辆；1—12月汽车产品累...

Sun, 07 Jan 2024 08:03:00 GMT

福田汽车：2023年汽车产品累计销量同比增长37.14% | 福田汽车晚间公告，2023年12月份汽车产品合计销量64809辆，其中新能源汽车4063辆；1—12月汽车产品累计销量63.1万辆，同比增长37.14%。

丽人丽妆：股东丽仁拟减持不超0.99%公司股份 | 丽人丽妆晚间公告，股东丽仁拟通过大宗交易和/或集中竞价方式合计减持公司股份数量不超过397.98万股，占公...

Sun, 07 Jan 2024 08:02:00 GMT

丽人丽妆：股东丽仁拟减持不超0.99%公司股份 | 丽人丽妆晚间公告，股东丽仁拟通过大宗交易和/或集中竞价方式合计减持公司股份数量不超过397.98万股，占公司总股本的0.99%。

恒瑞医药：公司及子公司获得药品注册证书 | 恒瑞医药公告，公司近日收到国家药监局核准签发的盐酸伊立替康脂质体注射液(Ⅱ)《药品注册证书》；子公司山东...

Sun, 07 Jan 2024 08:02:00 GMT

恒瑞医药：公司及子公司获得药品注册证书 | 恒瑞医药公告，公司近日收到国家药监局核准签发的盐酸伊立替康脂质体注射液(Ⅱ)《药品注册证书》；子公司山东盛迪医药有限公司收到国家药监局核准签发的恒格列净二甲双胍缓释片(Ⅰ)、(Ⅱ)《药品注册证书》。

中信证券：场外配置型资金预计将逐步入场，市场将迎来重要拐点 | 中信证券发布研报表示，开年首周资金的跨年效应显著，机构调仓和加仓均抱团于红利低波，...

Sun, 07 Jan 2024 08:01:00 GMT

中信证券：场外配置型资金预计将逐步入场，市场将迎来重要拐点 | 中信证券发布研报表示，开年首周资金的跨年效应显著，机构调仓和加仓均抱团于红利低波，市场交易生态和风格分化趋于极致；1月中旬是关键时点，随着经济数据和地缘扰动的落地，政策将持续加码，场外配置型资金预计将逐步入场，市场将迎来重要拐点，当前依然建议优先布局以科创为代表的超跌成长。

Adding color-blind themes to Kubecolor to make Kubernetes more inclusive

Sun, 05 May 2024 16:00:00 GMT

Ambassador post originally published on Sebastian “Prune” Thomas’s blog

+ + + +

Kubcolor is a thin wrapper over the kubectl command that adds coloring to the output.

+ + + +

I cloned the project and started maintaining it back in 2022 when the original author wasn’t active anymore.

+ + + +

KubeColor can reformats the output of most kubectl commands to add colors and clarity. It makes it so easier to read the output that I still don’t understand that it’s not more widely used. I actually gave a lightning talk about it at the KubeCon’s Cloud Native Reject Europe 2024 in Paris if you want a video pitch.

+ + + +

One of the longest requested feature, discussed at length in the previous project, was to be able to cusotmize the color theme used by KubeColor.

+ + + +

Actually, when I first cloned the original project, I applied a patch that was globally changing the colors to make thinks less colory and more standard, limiting to a smaller set of colors. Some people started complaining right away, but that’s what people do anyways.

+ + + +

As of version 0.3.0, kubecolor now supports custom color scheme and theme, thanks to the work of other main contributor, AppleJag, which is a talented (Go) devleoper. Jag, I can’t thank you enough for all the help on this project.

+ + + +

What’s the problem ?

+ + + +

By default kubecolor uses the set of colors from your terminal’s config, so it always was possible to configure it. Just change the theme of your terminal and you can adapt the colors to your needs !

+ + + +

But more than colors default colors, some want to colorize some specific fields differently, or use more colors to further differenciate things.

+ + + +

But there’s more…

+ + + +

According to this article, one man out of 12 have some sort of color blindness (or color disability). Women are a little less concerned, with one out of 200, but still, it’s a lot ! (numbers may vary depending on the website too…)

+ + + +

Check the Wikipedia page to learn more, and there’s tons of other sites about this matter. And still, it’s usually not something we think of right away.

+ + + +

For example, just look at my blog, with it’s low contrast grey colors and you’ll understand that color blindness was not my main concerns at the time.

+ + + +

And, well, when we think at inclusion it’s generally genders and skin colors, and when we think about accessibility, it’s mobility impairement, deafs and blinds.

+ + + +

Color-blindness is usually not mentionned or taken care of. The CNCF Website itself does not mention it directly. The only TAG (Technical Advisory Groups) in the Accessibility section is focused on hearing issues.

+ + + +

Maybe because it’s easier to live with color-blindness, or because people don’t talk about it by shame, it’s still a real problem and the numbers are huge, far more that what I always belived.

+ + + +

Note that I’m not trying to rate any of those against the others, or trying to speak in place of the impacted persons. I’m not impaired and I just want to put some light on inclusivity.

+ + + +

So, what to do with this ?

+ + + +

As soon as kubecolor got the color theme functionnality, I started thinking of adding one or multiple color themes for the various kind of color-blindness.

+ + + +

The first question that came to my mind was :

+ + + +

+
It’s quite usual to use green for good things (success) and red for bad things (errors). But is there a common pattern for color-blind persons ?
+

+ + + +

Well, so far I don’t have the answer. But while searching I learnt few things:

+ + + +

Color is important, but so is the contrast
there’s also modifiers like bold and italic that can help better differenciate things
it is usually better to add some text explaining the status and not only rely on the color. Here, no probleme for us, as we are adding colors to an already expressive text.
Maybe I should not add a theme at all and each person will built its own

+ + + +

Understanding KubeColor Themes

+ + + +

Thanks to Jag, kubecolor can process many kind of definitions to configure the colors.

+ + + +

In short:

+ + + +

using a regular color names (like red, blue) will use whatever is defined in your terminal application’s theme. white may be white, or not, but if you already have a theme made for color-blindness, you may not have to change anything.
using many other ways to define colors, like HEX and RGB values, will allow to use custom colors not part of your terminal’s theme.
using bg= or fg= will allow to change the background or the front (text) color.
it is possible to use any of the modifiers like bold, italic and so on to even better tune the visibility of each field.
thanks to all the KUBECOLOR_THEME_* ENV variables, it is possible to fully customize the output of “each” field, depending on the original command used against kubectl (like get or a describe).
it is possible to create the theme as a file, which also enable sharing it, by creating a ~/.kube/color.yaml file (in OsX and Linux, may be a different location on Windows). We’ll dive on the format later, keep reading.
kubecolor embeds default themes, both in dark and light mode: +
- dark
- light
- pre-0.0.21-dark: the previous color schema from the original project
- pre-0.0.21-light
+

+ + + +

You can check the content of each basic theme in the code in the config/theme.go file.

+ + + +

How to build a theme

+ + + +

As said earlier, you can either use the KUBECOLOR_THEME_* env variables or create your theme in the ~/.kube/color.yaml file.

+ + + +

Using ENV Variables

+ + + +

The easiest is to check the docs at https://github.com/kubecolor/kubecolor/blob/main/README.md#color-theme and experiment.

+ + + +

In any case, you have to pick a base theme, by setting KUBECOLOR_PRESET, then update some of the colors. For example you can change all the running pods to blue with:

+ + + +

KUBECOLOR_THEME_BASE_SUCCESS=blue KUBECOLOR_PRESET=dark kubecolor get pods -o wide
+

+ + + +

Bash

+ + + +

Using the config file

+ + + +

Create the file ~/.kube/color.yaml and add some content like:

+ + + +

preset: dark
+theme:
+  table:
+    header: fg=red:bold:bg=blue
+

+ + + +

YAML

+ + + +

So basically, you take the ENV variable and you nest the last part of it.

+ + + +

With KUBECOLOR_THEME_STATUS_ERROR, you remove the KUBECOLOR part, so the final path is theme.status.error, so to show pods in error in pink:

+ + + +

preset: dark
+theme:
+  table:
+    header: fg=red:bold:bg=blue
+  status:
+    error: pink
+

+ + + +

YAML

+ + + +

+ + + + + + + +

First I want to clearly state I do not have any color impairement, and the work I’m trying to achieve here is based on articles I read and some talking with color-blind persons. There’s no scientific work on my side.

+ + + +

The idea is to provide an out of the box solution to help people with color-blindness. The outcome may not be perfect or even useful and I take no responsability. It’s best effort. It’s OpenSource. Bare with me.

+ + + +

After some researches, I found the Cromatic Vision Simulator website which allows to load an image and, using a quad view, see what color-blind persons may see depending on the kind of disability.

+ + + +

In short, if I upload one of the previous images that captured the k get pods -o wide commands, we can check how it look using the dark theme:

+ + + +

regular view
Protanopia view
deuteranopia view
tritanopia view

+ + + +

Now I guess we all understand the issue with the current color scheme of the dark theme: any impaired person will lose most of the color informations. At this point, better use plain kubectl commands…

+ + + +

So I tested some chromatic progressions to try to identify a palette that could work fine at least most of the time:

+ + + +

Being color-blind is not just not seeing green or red, there’s also quite a limitation in the color hues that are perceived, so everything from green to red, where the color changes slowly for a regular eye, will be almost the same brown/yellowish for a Protanopian.

+ + + +

My final conclusion is that it seems possible to achieve a theme that will help better differenciate the content. What we need here is having different color hues to show the difference of, mostly, good and bad situation, and color cycles when there’s a table.

+ + + +

Using the Observable HQ website, I used the discrete 10 schema to cut the rainbow in 10 usable colors:

+ + + +

#23171b
#4860e6
#2aabee
#2ee5ae
#6afd6a
#c0ee3d
#feb927
#fe6e1a
#c2270a
#900c00

+ + + +

Once rendered, we have:

+ + + +

The dark theme only uses 6 colors (well, 5 as one if the default white for dark theme, or black for light theme). So here’s my selection:

+ + + +

Terminal Color	Matching Color	protanopia	deuteranopia	tritanopia
`yellow`	`#feb927`	`#f9bb27`	`#fbbc23`	`#ffacb6`
`magenta`	`#4860e6`	`#a77fe5`	`#888ee4`	`#257e7d`
`green`	`#6afd6a`	`#fee16c`	`#fee16c`	`#fee16c`
`red`	`#c2270a`	`#bb8b16`	`#936a15`	`#ff6579`
`cyan`	`#2aabee`	`#34adee`	`#22afef`	`#34b4b5`
Null color (white-ish)	`#2ee5ae`	`#e8d0b0`	`#c6beb3`	`#4ddfe0`

+ + + +

I also used the bold on the success and I actually inverted the error so the background is red and the text is white. High contrast is usually a good helper where we’re limited with the possible colors.

+ + + +

The result seems to be pretty much working in all situations:

+ + + +

Using the themes

+ + + +

Finally, along the other 4 themes announced before, you can now use any of the new themes if you’re concerned by color blindness. They are:

+ + + +

protanopia-dark
protanopia-light
deuteranopia-dark
deuteranopia-light
tritanopia-dark
tritanopia-light

+ + + +

Just set your env variables like:

+ + + +

KUBECOLOR_PRESET=protanopia-dark kubecolor get pods
+

+ + + +

Bash

+ + + +

export KUBECOLOR_PRESET=protanopia-dark
+kubecolor get pods
+

+ + + +

Bash

+ + + +

Or set it in your config file ~/.kube/color.yaml like:

+ + + +

preset: protanopia-dark
+

+ + + +

YAML

+ + + +

Updating the theme

+ + + +

As the Themes are pretty much a first iteration and a work in proress, please, feel free to comment and open an issue if you feel the current themes can be enhenced.

+ + + +

Also, you can start creating your own theme, by modifying an existing one, then share it either in a issue or a Pull Request.

+ + + +

Simply start from original theme file and add more customization to the ~/.kube/color.yaml file:

+ + + +

preset: protanopia-dark
+theme:
+  base:
+    key:
+       - fg=#feb927
+       - fg=white
+    info:
+    primary: fg=#4860e6
+    secondary: fg=#2aabee
+    success: fg=#6afd6a:bold
+    warning: fg=#feb927
+    danger: fg=white:bg=#c2270a
+    muted: fg=#feb927
+  options:
+    flag: fg=#feb927
+  table:
+    header: fg=white:bold:bg=#2aabee
+  status:
+    error: fg=white:bg=#c2270a
+

+ + + +

YAML

+ + + +

Note that, at the moment, all protanopia, deuteranopia and tritanopia themes are the same. Please, when you leave a feedback, mention your condition, so we can update the themes differently to better suite each of the different situations.

+ + + +

I would encourage you to set your default theme according to your type of disability to benefit of the futur changes.

+ + + +

Wrapping it up

+ + + +

Next time you see the screen of a co-worker using strange colors, don’t smile or make fun, this person is probably suffering some sort of color blindness. Instead, just explain them that KubeColor is now your friend.

+ + + +

Even worse, the next time you see someone using kubectl in monochrome, insist for them to go check Kubecolor !

+ + + +

We put a lot of effort into this feature. We trully hope that it will help some persons out there and make Kubernetes more inclusive. If not, it was a good adventure.

+ + + +

Feature is available in Kubecolor v0.3.0, available now !

+ + + + + + +

Top 5 cloud computing trends of 2024

Thu, 02 May 2024 16:00:00 GMT

Member post by Sameer Danave, Senior Director of Marketing, MSys Technologies

+ + + +

Every time I think I have this whole technology game down to a science, something changes in the blink of an eye. And if you’re as enthusiastic about the cloud as I am, you’ve likely experienced the same feeling of whiplash as cloud trends continue to shift.

+ + + +

Keeping up with the latest technology trends isn’t always easy. However, to stay ahead of the competition, it’s pivotal to stay ahead of them.

+ + + +

Fortunately, I’ve gathered all the information you need on the latest cloud computing trends straight from industry experts and MSys’ survey of 400+ technology professionals and crafted a Tech Lens 2024 guide just for you. Let’s delve into the top five trends from this guide for 2024.

+ + + +

Top 5 Key Cloud Computing Trends to Watch

+ + + +

Here are top five trends that are expected to witness significant traction in the forthcoming years.

+ + + +

AI As A Service (AIaaS)

+ + + +

In the forthcoming years, significant growth is foreseen in the integration of AI services into cloud solutions. Cloud infrastructure plays a crucial role in opening up AI’s economic and social benefits to enterprises. Training AI models, such as the robust large language model (LLM) powering ChatGPT, demands extensive data and substantial computing resources.

+ + + +

Enterprises are shifting away from constructing their own AI infrastructure and opting for AI-as-a-service provided by cloud platforms. This transition allows them to harness the transformative power of AI without the constraints of managing resources. AI as a Service offers pre-built AI models, tools, and APIs hosted on cloud platforms, enabling enterprises to seamlessly implement AI functionalities, even without specialized AI expertise and infrastructure.

+ + + +

Hybrid & Multi-Cloud Strategies

+ + + +

Multi-cloud and hybrid solutions have become incredibly popular among enterprises across the globe. The hybrid multi-cloud approach incorporates public cloud services from multiple providers, enabling portability across diverse cloud infrastructures. This enhances flexibility and reduces dependency on a single vendor, thus mitigating the risk of vendor lock-in.

+ + + +

Besides, hybrid cloud solutions offer a flexible approach to managing data storage complexities. By integrating public and private cloud environments, organizations can leverage existing infrastructure while gaining scalability, security, and redundancy. This approach optimizes storage resource allocation, strengthens disaster recovery capabilities, and fosters agility in response to evolving business requirements.

+ + + +

Moreover, hybrid cloud solutions provide enhanced control over IT infrastructure and bolstered security compared to alternative cloud options. Cloud vendors employ expert security professionals to ensure data protection, adhering to stringent protocols and compliance measures.

+ + + +

Edge AI Computing

+ + + +

The edge computing landscape is expected to witness significant traction in the forthcoming years. In the traditional cloud model, data transfers to a remote server for processing. In contrast, edge computing establishes a compact computing environment near the data source.

+ + + +

This reduces latency and enables real-time analysis and decision-making. The deployment of advanced networks like 5G, along with energy-efficient processors and algorithms, is expected to further bolster edge computing’s viability for evolving application needs by 2024.

+ + + +

Sustainable Cloud Computing

+ + + +

Sustainable computing is expected to experience significant growth in the years ahead. This trend is fueled by the understanding that approximately 1.8% to 3.9% of global greenhouse gas emissions stem from the information and communication technology (ICT) sector.

+ + + +

Green computing encompasses environmentally conscious practices across the lifecycle of computers, chips, and other technology components, spanning from design and manufacturing to usage and disposal. Its objective is to mitigate environmental impact by decreasing carbon emissions and energy consumption across all stages, including production, data centers, and end-user operations.

+ + + +

Additionally, green computing involves the selection of sustainably sourced materials, minimizing electronic waste, and promoting sustainability through the use of renewable resources.

+ + + +

Serverless Computing

+ + + +

Expected to see significant expansion with a Compound Annual Growth Rate (CAGR) of 23.17% between 2023 and 2028, serverless computing brings forth novel methods for creating and operating software applications and services. This emerging paradigm eradicates the necessity of infrastructure management, empowering users to write and deploy code devoid of the complexities of underlying systems.

+ + + +

This transition offers numerous benefits for developers, including quicker time-to-market, improved scalability, and decreased deployment costs for new services. As a result, developers can focus on innovation rather than the intricacies of managing infrastructure.

+ + + +

Conclusion

+ + + +

In the whirlwind of technological advancement, keeping pace with cloud computing trends is both exhilarating and essential. Drawing from insights of industry experts and a survey of over 400 technology professionals by MSys, we’ve distilled the top five trends for 2024. From AI integration to sustainable practices and serverless architectures, these trends promise to reshape our approach to technology. By embracing them, we can propel our organizations forward and stay ahead of the competition. This guide offers actionable insights to navigate these trends effectively. Let’s embark on this journey together, pushing the boundaries of cloud computing’s possibilities.

+ + + +

About Sameer

+ + + +

Sameer is a seasoned technology marketing professional with 16 years of full-stack marketing experience. He believes in 2 Cs – ‘Customer Value’ and Communications. All his Marketing campaigns and projects are packaged with it.

+ + + +

He drives phygital (physical + digital) campaigns that attract and pull customers towards the brand’s value. His marketing strategies apply omnichannel, conversational marketing tactics (Storytelling, Social, and Chatbot), AI-Enabled Inbound Marketing, backed by solid analytics and insights with ‘content’ as a core part of the strategy.

+ + + +

Sameer is a team sport with meticulous planning, attention to detail, and the ability to perform effortlessly under pressure.

+ + + + + + +

Is your supply chain secure? Double check with our framework

Thu, 02 May 2024 16:00:00 GMT

A secure supply chain is a critical piece of cloud native security, and it can be tricky to get right because it covers such a broad expanse of factors from code to pipelines and beyond.

+ + + +

+
Join us on June 26 & 27 for CloudNativeSecurityCon North America 2024 in Seattle
+

+ + + +

The breadth of the supply chain also makes it vulnerable, and according to a survey from Security Magazine, 91% of organizations experienced attacks in 2023. The top three types of attacks were exploited vulnerabilities or misconfigurations, stolen secrets, and data breaches. The reverberations of a supply chain attack go far beyond the organization and include reputational damage, loss of revenue, and even legal liability. In fact, IBM’s 2023 “Cost of a Data Breach” survey found attackers cost organizations worldwide an average of $4.45 million, which is a 15% increase over the last three years.

+ + + +

Not surprisingly 51% of survey respondents told IBM their organizations were planning to increase spending on security.

+ + + +

So, no matter where your organization is on the journey to a more secure supply chain, taking extra steps is never a bad idea. Our Security Technical Advisory Group has created a series of questions teams can ask to dig deeper. The framework is divided into four areas: source code, materials, build pipelines, and artefacts and deployments.

+ + + +

Start by verifying the source code, asking questions including:

+ + + +

Do you require signed commits?
Do you use git hooks or automated scans to prevent committing secrets to source control?
Have you defined an unacceptable risk level for vulnerabilities? For example: no code may be committed that includes Critical or High vulnerabilities

+ + + +

Next, verify materials:

+ + + +

Do you verify that dependencies meet your minimum thresholds for quality and reliability?
Do you automatically scan dependencies for security issues and license compliance?
Do you automatically perform Software Composition Analysis on dependencies when they are downloaded/installed?

+ + + +

Make certain the build pipelines are protected:

+ + + +

Do you use hardened, minimal containers as the foundation for your build workers?
Do you maintain your build and test pipelines as Infrastructure-as-Code?
Do you automate every step in your build pipeline outside of code reviews and final sign-offs?

+ + + +

And finally, protect artefacts and deployments:

+ + + +

Is every artefact you produce (including metadata and intermediate artefacts) signed?
Do you distribute metadata in a way that can be verified by downstream consumers of your products?

+ + + +

Dive into the entire framework, but don’t stop there!

+ + + +

Join us in Seattle for CloudNativeSecurityCon North American 2024 on June 26 and 27 to learn from and network with experts in every facet of cloud native security.

+ + + +

Register! Learn more!

+ + + + + + +

Early explorations and practices of Xline, a stateful application managed by Karmada

Wed, 01 May 2024 16:00:00 GMT

Member post by DatenLord

+ + + +

Background and Motivation

+ + + +

More and more IT vendors are now embracing cross-cloud multi-clustering as cloud-native technologies and cloud markets continue to mature. Here’s Flexera’s mid-2023 survey on the cloud-native market’s acceptance of multi-cloud, multi-cluster management. (info.flexera.com)

+ + + +

As you can see from Flexera’s report, more than 87 percent of organizations in the overall cloud-native market are already using services from multiple cloud vendors at the same time, with only 13 percent using a single public cloud and a single private cloud. Only 13% are using a single public cloud or single private cloud, while 15% of those using multi-cloud deployments are choosing multi-public or multi-private cloud deployments, and 72% are adopting hybrid cloud deployments. These statistics reflect the maturity of cloud-native technologies and the cloud marketplace, and the future will be the era of programmatic multi-cloud managed services.

+ + + +

In addition to external trends, the limitations of single-cluster deployments have become an intrinsic motivation for users to embrace multi-cloud, multi-cluster management. Limitations of single cluster deployments include, but are not limited to:

+ + + +

A single point of failure, where cluster-level failures are difficult to tolerate, and a small cluster federation outperforms a large K8s cluster.
Boundary constraints of a single cluster, e.g., a Node has only 110 Pods by default, and a cluster can hold up to 5000 Nodes.
Business-level development needs, e.g., Xline itself is a cross-cluster cluster.
….

+ + + +

Karmada, as an open source multi-cluster management tool, has been used by Shopee, DaoCloud and other companies in the production environment. However, since Karmada currently lacks support for stateful application management, it is still mainly used for stateless application management in practice.

+ + + +

To better cope with the future trend of multi-cloud and multi-cluster management, and to better manage stateful applications in multi-cloud and multi-cluster scenarios, Xline and the Karmada community set up a working group to jointly promote Karmada’s support for stateful application management.

+ + + +

What are the challenges of managing stateful applications with Karmada?

+ + + +

Before understanding how Karmada manages stateful applications across multiple clusters, we need to look back at the K8s implementation of managing stateful applications in a single cluster.

+ + + +

Back in 2012, Randy Bias gave an influential talk on “Open and Scalable Cloud Architecture”. In that talk, he proposed a “Pets” versus a “Cattle”.

+ + + +

These two concepts correspond to stateless and stateful applications, respectively. Cattle don’t need names, and they are not unique. This means we can easily replace one with another when one of them has some problems. Pets are different. Each pet is unique, with its own name, and should be looked after carefully when it has some problems.

+ + + +

StatefulSet was introduced in Kubernetes 1.5 and stabilized in version 1.9. It provides a fixed Pod identity for managing Pods, persistent storage for each Pod, and a strict start/stop order among Pods.

+ + + +

The problems are: what exactly constitutes a state, and how Kubernetes addresses it.

+ + + +

In the Karmada multi-cluster scenario, stateful applications pose the following problems:

+ + + +

How to ensure that multiple application instances across clusters can have a globally uniform start/stop order, which affects the scale in/out and rolling updates of some application instances. For a distributed KV storage based on consensus protocol, the process of scale in/out needs to go through membership change, which involves the determination of majority change in the cluster. If multiple member clusters scale out at the same time without a globally standardized ordering, it will affect the correctness of the consensus reached by the consensus protocol.
How to ensure that all applications across clusters have globally unique instance identifiers, a natural solution is to incorporate member cluster ids into instance identifiers.
How to solve the problem of cross-cluster application communication and provide a globally uniform network identity. Currently, in our attempts and practices, we use submariners to bridge the network communication between multiple member clusters. The current implementation relies on a specific network plugin.
How to solve the common functions such as cross-cluster stateful application update and capacity expansion and contraction, and provide more fine-grained update policies, such as realizing the function of Partition Update in member clusters.

+ + + +

In order to better solve the above-mentioned problems, we need to introduce a new workload on Karmada to implement a cross-cluster version of “StatefulSet”.

+ + + +

Some early attempts at Xline

+ + + +

Since the Karmada community has not yet discussed the implementation details of the new API, we have made some simple attempts to deploy, scale up and down, and update Xline under Karmada. The overall architecture of the program is as follows:

+ + + +

In the overall architecture, we adopt a two-tier Operator approach, in the control plane of Karmada, we deploy a Karmada Xline Operator, which is responsible for interpreting and splitting some Xline resources defined in Karmada, and sending them to member clusters. The Xline Operator on the member cluster monitors the creation of the corresponding resource and then enters the Reconcile process to complete the operation.

+ + + +

Deployment

+ + + +

Let’s take a look at a common deployment method for distributed application clusters under a single cluster (using etcd operator to deploy an etcd cluster as an example). etcd-operator can deploy an etcd cluster in two phases:

+ + + +

Bootstrap: Create a seed node of etcd with an initial-cluster-state of new and a unique initial-clsuter-token.
Scale out: perform a member add on the seed cluster to update the cluster network topology, and then start a new etcd node with an updated network topology and an initial-cluster-state of existing.

+ + + +

However, in the cross-cluster scenario, due to the lack of a globally standardized startup order for pods in different member clusters, Xline Operators in different member clusters will concurrently perform cluster expansion operations, which will adversely affect the membership change process of the consensus protocol. In order to bypass this problem, Xline adopts a static deployment method, as shown in the following figure:

+ + + +

First of all, users need to define the corresponding resources on karmada to describe the cluster topology of a cross-cluster Xline cluster. karmada Xline Operator, after monitoring the resources being applied, will interpret and split the resources into the CR of XlineCluster on the member cluster, and then issue the CR of XlineCluster to the member cluster. The XlineCluster CR contains the number of replicas that should be created for the current member cluster, as well as the member cluster ids of the other clusters and the corresponding number of replicas. The Xline Operator on the member cluster, after monitoring the creation of the CR, will enter the Reconcile process to generate the DNS names of the other nodes in the Xline cluster using the issued cluster topology, and start the Xline Pod.

+ + + +

In the early days of exploration, the static deployment approach bypassed the lack of a globally uniform startup order for application instances under Karmada’s multiple clusters because it did not involve a membership change in the deployment process. However, there is no silver bullet in the software industry, and the same is true for static deployments, which have some trade-offs as follows. The following table compares the characteristics of dynamic and static deployments in a single cluster vs. multi-cluster scenario:

+ + + +

Scaling Up and Down

+ + + +

There are two specific types of scale in/out for stateful applications under Karmada:

+ + + +

Horizontal scale in/out — remove/add a member cluster and scale in/out nodes on it.
Vertical scale in/out — scale in/out on existing member clusters.

+ + + +

Horizontal scale out

+ + + +

As shown above, the overall process is as follows:

+ + + +

Create the corresponding member cluster, configure the submariner network, and add it to Karmada for management.
Modify the Xline resources on Karmada, and add a new record member4: 4in the member cluster field to indicate that you want to expand 4 Xline resources on member4.
Karmada Xline Operator will split the resources and distribute them to member4.
Xline Operator on member4 receives the corresponding resources, enters the corresponding Reconcile process, calls the Xline client to execute member add, reaches a consensus, starts the new Xline Pod, and repeats the above process until the number of Xline replicas on member4 reaches the specified number. on member4 reaches the specified number

+ + + +

Vertical scale out

+ + + +

For vertical scale out, the general process is also shown above:

+ + + +

Modify the Xline resources on Karmada, e.g., specify that the Xline Pod in member1 should be expanded from 3 to 4
Karmada Xline Operator will split the resource and distribute it to member1
When Xline Operator on member1 receives the notification of resource modification, it enters the corresponding Reconcile process, calls the Xline client to perform member add, and then starts the new Xline Pod after consensus is reached, and repeats the above process until the number of replicas of Xline on member1 reaches the specified number. replicas of Xline on member1 reach the specified number

+ + + +

Currently, because scale in/out inevitably involves a membership change process, and there is a lack of synchronization between member clusters under Karmada, there are limitations to the scale process: a horizontal scale out can only scale a cluster, and a vertical scale out can only scale a cluster on a specified member cluster.

+ + + +

Rolling updates

+ + + +

For a rolling update, the general process is shown above:

+ + + +

The user modifies the Xline resource on Karmada to change the Xline mirror version.
The Karmada Xline Operator splits the resource and distributes it to the member clusters.
The Xline Operator on the member cluster monitors the resource changes and enters the corresponding Reconcile process to perform a rolling update. The update process on the member cluster is no different from the update on a single cluster.

+ + + +

Currently, the main supported update method is the default rolling update, but from the perspective of practical application scenarios, at least the following two issues need to be considered:

+ + + +

The update process involves the stopping of old Xline nodes and the starting of new Xline nodes, which requires additional mechanisms to ensure that the update process is not unavailable.
More fine-grained update policies, such as Partition Update, should be supported; among multiple member clusters, priority should be given to updating clusters where only the follower exists, and when updating the member cluster where the leader resides, the leader should be transferred to the updated member cluster to avoid extreme situations where the leader frequently steps down due to rolling updates.

+ + + +

Conclusion

+ + + +

Given the trend of multi-cloud and multi-cluster management and the nature of Xline’s business, the Karmada and Xline communities have formed a working group to promote stateful application management in Karmada multi-clusters. In order to solve the problem of managing stateful applications in Karmada multi-clusters more elegantly, we need to introduce a new Karmada workload, and since the Karmada community has not yet reached a consensus on the implementation details of the new workload, in the early stage of experimentation, Xline adopts a two-tier Operator approach, which is implemented through the Karmada Xline Operator to the Karmada Xline Operator. The Karmada Xline Operator interprets and splits the top-level resources and sends them to the member clusters, and then the Xline Operator on the member clusters tunes the resources.

+ + + +

In this way, we have made some early attempts to deploy Xline on Karmada and explore rolling updates, and made some preliminary preparations for the development and design of the new Karmada StatefulSet workload in the future.

+ + + + + + +

Accelerating Machine Learning with GPUs in Kubernetes using the NVIDIA Device Plugin

Mon, 29 Apr 2024 16:00:00 GMT

Member post originally published on the SuberOrbital blog by Keegan McCallum

+ + + +

NVIDIA Device Plugin for Kubernetes plays a crucial role in enabling organizations to harness the power of GPUs for accelerating machine learning workloads.

+ + + +

Introduction

+ + + +

Generative AI is having a moment right now, in no small part due to the immense scale of computing resources being leveraged to train and serve these models. Kubernetes has revolutionized the way we deploy and manage applications at scale, making it a natural choice for building large-scale computing platforms.

+ + + +

GPUs, with their parallel processing capabilities and high memory bandwidth, have become the go-to hardware for accelerating machine learning tasks. NVIDIA’s CUDA platform has emerged as the dominant framework for GPU computing, enabling developers to harness the power of GPUs for a wide range of applications. By combining the capabilities of Kubernetes with the extreme parallel computing power of modern GPUs like the NVIDIA H100, organizations are pushing the boundaries of what is possible with computers, from realistic video generation to analyzing entire novels worth of text and accurately answering questions about the contents.

+ + + +

However, orchestrating GPU-accelerated workloads in Kubernetes environments presents its own set of challenges. This is where the NVIDIA Device Plugin comes into play. It seamlessly integrates with Kubernetes, allowing you to expose GPUs on each node, monitor their health, and enable containers to leverage these powerful accelerators. By combining these two best of breed solutions, organizations are building robust, performant computing platforms to power the next generation of intelligent software.

+ + + +

Understanding the Nvidia Device Plugin for Kubernetes

+ + + +

The NVIDIA Device Plugin is a Kubernetes Daemonset that simplifies the management of GPU resources across a cluster. Its primary function is to automatically expose the number of GPUs on each node, making them discoverable and allocatable by the Kubernetes scheduler. This allows pods to request and consume GPU resources in a similar way to cpu and memory. Under the hood, the device plugin communicates with the kubelet on each node, providing information about the available GPUs and their capacities. It also monitors the health of the GPUs, ensuring they are functioning optimally and reporting any issues to Kubernetes.

+ + + +

Some of the benefits of the NVIDIA Device Plugin include:

+ + + +

Automatic GPU discovery and allocation, eliminating the need to manually configure GPUs resources on each node.
Seamless integration with Kubernetes, allowing you to manage GPUs with familiar tools and workflows
GPU health monitoring, allowing Kubernetes to maintain stability and reliability for GPU-accelerated workloads.
Resource sharing, which allows multiple pods to utilize the same GPU, which is crucial in an environment like today where GPUs are scarce and expensive.

+ + + +

Installing and Configuring the Nvidia Device Plugin

+ + + +

Prerequisites

+ + + +

Ensure that your GPU nodes have the necessary NVIDIA drivers (version ~= 384.81) installed.
Install the nvidia-container-toolkit (version >= 1.7.0) on each GPU node.
Configure the nvidia-container-runtime as the default runtime for Docker or containerd.
Kubernetes version >= 1.10
If using AWS EKS for example, these will be handled for you by default when using GPU nodes

+ + + +

Deploying the Device Plugin

+ + + +

First, we’ll install the daemonset using helm. To install the latest version (v0.14.5 at the time of writing) into a cluster with default settings, the most basic command is:

+ + + +

helm upgrade -i nvdp nvidia-device-plugin \
+  --repo https://nvidia.github.io/k8s-device-plugin \
+  --namespace nvidia-device-plugin \
+  --create-namespace \
+  --version v0.14.5
+

+ + + +

This will install OR upgrade a helm release named nvdp into the nvidia-device-plugin namespace, with default settings.

+ + + +

This will give you a basic setup, but there are many reasons you may want to customize the chart via values.yaml. We’ll dive into some of the most useful options as well as some best practices, but you can see the full set of values here. You’ll likely want to add taints to your GPU nodes (the method used will depend on your kubernetes setup and how you are provisioning node) and then configure tolerations to ensure that the device plugin is only scheduled on GPU-enabled devices. We’ll dive deeper into these types of configurations in part 2 of this series.

+ + + + + + + +

The nvidia-device-plugin supports 3 strategies for GPU sharing and oversubscription, allowing you to optimize GPU utilization based on your specific workload’s requirements. A quick overview of each, with examples of how to configure via values.yaml:

+ + + +

Time-slicing: This strategy allows multiple workloads to share a GPU by interleaving their execution. Each workload is allocated a specific time slice during which it has exclusive access to the GPU. Time-slicing is useful when you have many small workloads that don’t require the full power of a GPU simultaneously. One important point to note is that nothing special is done to isolate workloads that are granted replicas from the same underlying GPU, and each workload has access to the full GPU memory and runs in the same fault-domain as all of the others (meaning that if one pod’s workload crashes, they all do). In my experience, time-slicing usually isn’t what you’re looking for when it comes to GPU resource sharing, it’s basically just letting all the pods access the single GPU in a free-for-all manner and executing things concurrently without any regard for each other. If you have workloads that don’t mind this, such as Jupyter notebooks for research that aren’t utilizing the GPU at the same time, this setting COULD be useful, but I’d recommend looking at the other options first unless you know what you’re doing.Example values.yaml for time-slicing:

+ + + +

config:
+  map:
+    default: |-
+      version: v1
+      sharing:
+        timeSlicing:
+          resources:
+          - name: nvidia.com/gpu
+            replicas: 10
+

+ + + +

Multi-Instance GPUs (MIG): To mitigate the potential downsides of time-slicing, NVIDIA supports MIG. MIG is a feature supported on certain NVIDIA GPUs (e.g., A100) that enables partitioning a single GPU into multiple smaller, isolated instances. Each instance behaves like a separate GPU with its own memory and compute resources. MIG is beneficial when you have workloads with varying resource requirements and want to ensure strict isolation between them. This is in contrast to MPS which gives you more fine-grained control over memory and compute resource allocation, but doesn’t provide full memory protection and error isolation between them. MIG supports both mixed and single strategies for exposing GPUs to kubernetes, if interested you can read more about how they work here. Mixed is more flexible and I’d recommend using mixed unless you have a cluster large enough that exposing only a single MIG type per node is feasible. MIG is only supported on NVIDIA Ampere GPUs and while less flexible than MPS, MIG is the most complete solution for workload isolation if your workloads require that.Example values.yaml for MIG:

+ + + +

config:
+  map:
+    default: |
+      version: v1
+      flags:
+        migStrategy: "mixed"
+

+ + + +

CUDA Multi-Process Service (MPS): MPS is a runtime service that enables multiple CUDA processes to share a single GPU context. It allows fine-grained sharing of GPU resources among multiple pods by running CUDA kernels concurrently. This mode feels the most similar to the way kubernetes can allocate cpu and memory resources in a fine-grained way, and is supported on almost every CUDA-compatible GPU. MPS will split up a GPU into equal slices of compute and memory, and the MPS control daemon will enforce these limits. Sharing with MPS is currently not supported on devices with MIG enabled. Sharing with MPS is currently not supported on devices with MIG enabled. MPS is suitable when you have workloads that can efficiently share GPU resources without strict isolation requirements. If you don’t have strict isolation requirements, MPS is probably the right choice for you.Example values.yaml for MPS:

+ + + +

config:
+  map:
+    default: |-
+      version: v1
+      sharing:
+        mps:
+          resources:
+          - name: nvidia.com/gpu
+            replicas: 10
+

+ + + +

This should be a good introduction to GPU sharing to get you started. We will go into more detail about advanced configuration and best practices in part 2 of this series.

+ + + +

Allocating GPUs to Pods Using the Nvidia Device Plugin

+ + + +

Allocating GPUs to pods when using the nvidia-device-plugin is straightforward and should feel familiar to anyone comfortable with kubernetes. It is highly recommended to use NVIDIA base images for your containers in order to have all the necessary dependencies installed and configured properly for your underlying workload. Setting a limit for nvidia.com/gpu is crucial, otherwise all GPUs will be exposed inside the container. Finally, make sure to include tolerations for any taints set on your nodes so that the pod can be scheduled appropriately. Here’s a barebones example of a GPU-enabled pod:

+ + + +

apiVersion: v1
+kind: Pod
+metadata:
+  name: gpu-pod
+spec:
+  containers:
+    - name: cuda-container
+      image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
+      resources:
+        limits:
+          nvidia.com/gpu: 1 # requesting 1 GPU
+  tolerations:
+  - key: nvidia.com/gpu
+    operator: Exists
+    effect: NoSchedule
+

+ + + +

Conclusion

+ + + +

The NVIDIA Device Plugin for Kubernetes plays a crucial role in enabling organizations to harness the power of GPUs for accelerating machine learning workloads. By abstracting the complexities of GPU management and providing seamless integration with Kubernetes, it empowers developers and data scientists to focus on building and deploying their models without worrying about the underlying infrastructure.

+ + + +

We’re just scratching the surface here, so if you’re interested to learn more please check out part 2 of this series where we’ll go into detail on advanced configuration, troubleshooting common issues, and some of the limitations of using the nvidia-device-plugin alone to manage GPUs. Also, check out the additional resources at the end of this article!

+ + + +

FinOps for Kubernetes: engineering cost optimization

Sun, 28 Apr 2024 16:00:00 GMT

Community post by Saqib Jan

+ + + +

Cloud has given on-demand access to compute resources, but high availability also makes cost a much more dynamic problem to forecast. This reverberates as companies continue to expand their cloud footprints and adopt more cloud technologies — the potential for waste also increases.

+ + + +

The 2024 State of FinOps report underscores, organizations are focused on reducing waste. And with efficiencies top of mind, it is imperative that savings and cost optimization is the top priority for engineering leaders considering a FinOps model in Kubernetes. Because understanding how to estimate cost and optimization is a black hole for platform engineering and finance teams, the biggest challenge for stakeholders is figuring out where costs originate. So, to address this, some understanding of ownership costs is important.

+ + + +

Big tech engineering teams with mature finance and product management practices are solving for these challenges with cost models that help measure the total cost of ownership of their services and applications. Laurent Gil, Cloud Neutrality Advocate and Co-Founder of Cast AI, exposits that the cost model often isn’t sufficient for anything but an informed starting point—a good enough point must be reached to avoid over-investment.

+ + + +

The first thing to consider is the cost drivers—elements that contribute to the overall cost and how to incorporate them into calculations. There is CPU, memory, and storage allocated to each service and execution in Kubernetes. Workloads also grow larger over time, and there are a variety of costs involved in hosting, integrating, running, managing, and securing cloud workloads. While some charges directly relate to compute, data transfer, and storage consumption, other factors add complexity. And there are also toolings as well as integrations with other cloud services that must be factored into the total cost of ownership (TCO) calculations.

+ + + +

Architect for Efficient Cloud Usage

+ + + +

If you run Kubernetes yourself, you need to have a strong engineering team. And unless you are in a business close to containerization and microservices technology, it’s essentially just a cost center and an inefficient use of resources.

+ + + +

It’s a very challenging task to build out Kubernetes yourself and being able to understand all the different nuances of all the different components that need to be set up and configured properly. And anyone who is not in the business of running infrastructure will basically benefit from managed Kubernetes. You don’t have to hire a team for anything which you want to run at any reasonable level of reliability.

+ + + +

Richard Hartmann, Director of Community for Grafana Labs, shares two fundamental ways to efficient cloud usage. One is to “go all-in on bespoke services, leveraging whatever you can to reduce undifferentiated heavy lifting and focus on solving problems that drive your business forward.” Alternatively, you can “use as few bespoke services as possible, relying solely on the baseline across all providers.” This approach allows you to maintain control over how your platform operates and facilitates easy migration between clouds. “Both approaches have merit,” but he cautions being in between is usually not ideal as it exposes you to the drawbacks of both tradeoffs.

+ + + +

Both solutions have similar problems. Cloud is expensive, and there is zero incentive for cloud providers to offer great cost controls. Hartmann points out the inherent conflict of interest, underscoring, “that would literally enable users to pay less,” something no cloud provider would want, particularly in the current macroeconomic uncertainties, adding more pressure to engineering leaders already contending with shrinking budgets, and heightened expectations for cost efficiency.

+ + + +

And so, there’s also a lot of interest from companies about trying to figure out what is the right model and how to find that happy balance between knowing not just what’s running in production but also how to set up a dev-test environment effectively.

+ + + +

“A lot of our customers today are looking at different models internally for their projects that are running in the cloud through managed Kubernetes offerings, whether it be a showback or chargeback type model,” commented GitLab Field CTO Lee Faus. And he mentioned, “We’ve had a few customers who tried to implement quotas around what they’re allowed to spend using high water, low water marks. But in doing so, they’ve realized that because of the way most managed Kubernetes clusters work, they incentivize you to build things like auto-scaling.”

+ + + +

There are more reasons why organizations end up in situations with over-provisioned clusters, which not only lead to poor cycle times from a CPU and memory perspective but also ultimately result in a negative experience for end-user interactions with the applications.

+ + + +

To counteract the risk of uncontrolled spending, Hartmann says “we have implemented deep control over what specifically we do and built our cost controls for self-managed clusters and as well managed platforms.” This approach helps scrutinize operations, ultimately enforcing a chargeback program that encourages a shared sense of accountability across stakeholders.

+ + + +

Both Hartmann and Faus highlight challenges in managing costs and finding the right balance between control and cost efficiency. FinOps practices, they affirm, help organizations to anticipate, control, check, and optimize their cloud investments on a proactive and reactive basis.

+ + + +

FinOps for Kubernetes Cost Control Strategies

+ + + +

Cost management on the cloud side can get out of control, but a lot of that stems from not having good rigor in the software development lifecycle — where things are pushed into production before they’re actually ready, or when they haven’t been adequately tested from a performance perspective. And considering the business value, there has never been a more important time to adopt FinOps principles ‘inform, operate, and optimize’ because existing solutions do not capture the nuances needed to economically achieve the perfect balance between cost and performance.

+ + + +

FinOps is the discipline to exhort shared responsibility and bring together all stakeholders (tech, business, and finance people) to establish policies and best practices for usage that are programmatically enforced. Adopting a FinOps approach can help platform engineering teams dramatically increase their visibility necessary to find ways to reduce costs without affecting performance.

+ + + +

When DevOps, say for instance, gives developers tools and guardrails to build, deploy, and fully own an application, it’s important to also educate them about overall cost management. This is because empowering teams to take action is the top challenge. And it’s usually not until the bill comes due at the end of the month that finance teams realize there is an issue with sudden spikes in costs.

+ + + +

I can tell you from supporting clients that most organizations leveraging Kubernetes struggle to manage their cloud expenses because there is no proper review and refining cycle for their processes, and also the pool of skilled workers in this segment is very dry.

+ + + +

Faus, in our conversation, stipulated, “There’s a term that we’re starting to see a lot of companies use, which revolves around value streams.” Value streams allow us to map back to key performance indicators (KPIs). And, these KPIs are defined at the CEO, CIO, and CFO level, where budgets are drawn, resource hiring is planned, and new product lines are decided. “This provides a high-level mapping back to those elements and around those value streams. When we drive throughout the given year, we need to have a way to ensure that we are actively tracking these aspects throughout the SDLC and in our cost management.”

+ + + +

Whatever you may call it, empowering development teams becomes imperative when using Kubernetes. Taking responsibility to make informed decisions will, in turn, make Kubernetes cost management timely, proactive, and cost-effective. As budgets get tighter, there is a great need for cost control strategies — to build from a knowledgeable foundation your own cost controls and implement a third-party solution, whether commercial or open source, to avoid linear cost increases. They are effective for everyone — for any cloud provider and even Kubernetes on bare metal infrastructure.

+ + + +

Case Study: LambdaTest

+ + + +

A perfect example to showcase the lasting impact of FinOps practices can be seen with LambdaTest. This young company, providing infrastructure as a cloud platform for online browser and operating system testing, quickly scaled up its services after securing initial funding but then encountered challenges with sudden spikes in cloud costs during subsequent rounds of funding.

+ + + +

As the senior DevOps engineering leader, Shahid Ali Khan led the responsible development of LambdaTest’s Kubernetes infrastructure and overall infrastructure system. He shared invaluable insights on navigating the exhaustive platform engineering challenges and adopting FinOps principles, imperative in the process, for optimizing cloud resources and saving cloud costs.

+ + + +

This case study highlights LambdaTest’s journey to FinOps maturity, emphasizing cost optimization. And outlines a systematic approach with insights from notable leaders, to proactively navigate these hurdles. The study discusses technology solutions, strategies, outcomes, and lessons from my one-on-one interviews with these leaders.

+ + + +

— The Challenges of Managing Infrastructure

+ + + +

As LambdaTest expanded their offerings, the complexity of their infrastructure also increased, relying on AWS and self-managed Kubernetes to support their data-heavy customers. This architecture allowed them to scale rapidly, and Mudit Singh, Head of Growth & Marketing, reflects their initial decision. “When we started off with Kubernetes, no cloud provider was offering a static, stable solution for it. And at that time, in around 2017, AWS released Managed Kubernetes, which remained in testing for an extended period. As a startup with a talent shortage, we were unsure about managing our own cluster.”

+ + + +

As their usage increased, each month ended with sudden cost spikes that created more questions around spend, like ‘How much are we spending?’ ‘Is this normal?’ ‘How should cost be divided between teams, applications, and business units?’ and ‘What is the problem, over-provisioning, or using too much compute or memory?’ remained unanswered. This situation, Singh shares, “drained our DevOps leaders, platform engineering and Finance teams (including the Founders) to invest a significant amount of time and attention in understanding the hefty incoming invoices.”

+ + + +

These issues even escalated over time — as usage grew, the risk of losing cost control also increased. The company sourced tools to produce cloud consumption but struggled to identify and address cost drivers. Like many, LambdaTest also faced challenges in the balancing act of trying to build a team with a FinOps culture and striving for enhanced cost visibility.

+ + + +

Singh stressed how identifying and addressing the underlying problems driving up costs proved to be a struggle for them while managing data centers across continents. And Khan expounds on the cross-functional initiatives they took to gain clarity on cost drivers, achieving visibility and transparency into spending and cloud usage.

+ + + +

— Create a Tagging Framework

+ + + +

It is difficult to align reports to business context without insight into workload allocations, and the industry has seen the adoption of more structured approaches to resource management.

+ + + +

Khan detailed, “We have multiple products that are running, and there are services shared among those products. It was getting hard for us to identify which service is contributing—or which particular service is contributing—to the cost of each product.” Tagging with labels also helps identify over-provisioned resources. And, “we began by implementing tagging and labeling, along with utilizing different node pools. This enabled us to precisely determine the cost allocation for each product in terms of specific resources, understand how their requirements have scaled over time, and effectively address those needs.”

+ + + +

It has now become a common practice to leverage namespaces for each product or service in Kubernetes, with a clear bifurcation of services. This approach not only lays the groundwork for resource management but also supports isolation, resource quotas, and simplified access control—enhancing operational efficiency. And most importantly makes cost analysis, reporting, and optimization easier for individual teams, services, or business lines.

+ + + +

— From Data to Action

+ + + +

The volume of data subject to analysis for cost optimization is always considerable. Being able to vectorize that data and understand where there are errors, where there might be memory spikes, CPU spikes—these are areas where you can not only optimize for the cost structure of how to manage applications but also provide feedback to the engineering teams. Faus underlines, “This involves actions like automatically promoting an issue or a ticket on the product side to ensure that something is going to be done about that cost as part of a current sprint.”

+ + + +

This process should also involve analyzing time-series data, which exceedingly helps to identify inefficiencies, make informed decisions about resource allocation and find potential automation opportunities. There are other strategies and optimization tactics you could adopt, but in general, at a basic level, it’s good to consider first what you are optimizing for.

+ + + +

So what is optimizing for cost? It’s really just another metric to consider. Say, for example, I already manage CPUs, pods, memory, storage, and compute capabilities. Each one, then, is a piece of the larger puzzle I’m piecing together. So, adding cost into the mix doesn’t change the fundamental approach; it’s just integrating another element into the array of resources we’re already balancing.

+ + + +

Khan emphasizes the key focus is on enabling “our tech personnel to efficiently extract and manipulate data to align with our business objectives, rather than the other way around. This strategy, which we also implement internally at LambdaTest, underscores the critical importance of fostering internal collaboration and knowledge exchange to effectively bridge the technological and business divides within our organization.”

+ + + +

— Cost Visibility

+ + + +

The very important aspect of optimizing the cost and running the cluster without impacting the performance and usage is monitoring. It is the core pillar toward building awareness and informing FinOps objectives for optimization strategies.

+ + + +

“We tried a lot of solutions and multiple plugins, but we could not get a clear understanding of the volume of requests, the performance of the cluster, or the overall system status,” Khan specified. “We implemented distributive tracking (and distributed tracing) inside the cluster to monitor each and every request, which helped us to identify how services are being used and pinpoint optimization opportunities within the system. This tremendously helped us to identify inefficiencies, which increased accountability – informing service owners to take action, while also enabling things like internal chargeback and showback models.”

+ + + +

Visibility underpins FinOps metrics (idle resources, under-optimized infrastructure) for tracking progress. But the key metric to consider is normalized cost, which, when adjusted for your operating business metrics, provides a more holistic view of your cloud spending relative to your business activities.

+ + + +

— Drill-Down Granularity

+ + + +

What you do next will depend on where your baseline is. But allowing issues to persist over time makes controlling costs at a later stage difficult and challenging. Even if you attempt to control them later, the level of effort required from your team would be very difficult, diverting focus from implementing features.

+ + + +

It is here that a tagging framework with a Kubernetes cost management tool becomes helpful. This helps you drill down into the layers of your environment so you can see exactly how each application is impacting your costs, enabling proactive recommendations for cost savings.

+ + + +

Khan shared their approach, “We began observing all attributes that influence pricing, and based on that, we looked deeper into why there has been an increase, what could have caused it, and then took rather difficult — educational path to show individual teams how their environment impacts their department’s resources.”

+ + + +

The ideal solution, according to Khan’s recommendation, “should provide time-saving features.” An example would be a prioritized list of your environment’s most expensive components, ranked by cost. This allows you to focus on the areas that will yield the most significant cost savings first. “We realized this and implemented a proactive approach across teams, ensuring work could proceed in a manner that does not affect production.”

+ + + +

— Real-Time Alerting

+ + + +

Now, in cloud-native environments with auto-scaling enabled, your cluster or nodes can scale up or down. Therefore, implementing budgets and alerts within the cloud system is imperative, as non-tracking can lead to significant expenses that won’t justify the solution you are providing. This is why applying custom rules programmatically allows you to receive notifications when costs increase, enabling you to take corrective actions for specific requests.

+ + + +

Khan remarked, “FinOps practices have significantly changed the way we work.” Through extensive data analysis, “we measure costs to a significant extent and set budgets for each product. For example, each product has clusters, and some of them share services. With simple tagging, we can set specific budgets for each product. And when we allocate a certain amount to a product, we get alerts if spending goes over a predetermined threshold.” A small increase triggers an amber alert, and a big jump triggers a red alert.

+ + + +

The optimal solution should also alert you to abnormal cost spikes in real-time so you can examine the issue right away and remediate it, rather than waiting for weekly or monthly reports. And sometimes, these spikes may serve as an early warning sign of a cyberattack, which requires an immediate and proactive response to safeguard your infrastructure and data integrity.

+ + + +

Encourage FinOps Practices

+ + + +

The more intentional approach you take to plan for change, you’ll target places where change will be the most effective soonest. But the worst thing you could do as an organization is to say, ‘we’re going to inform’ without understanding the extent to which overspending is ingrained in your Ops and, importantly, where some of the key drivers are coming from.

+ + + +

The best way to manage Kubernetes at scale is to take a holistic and intentional approach, which also helps in calculating the total cost of ownership and allocating budgets, but it is not something most companies are doing. Bifurcation of resources is not what most companies are doing either. A lot of companies are managing huge infrastructures, but what they lack is a dedicated FinOps team for such instances. And the reactive approach that they are taking for incidents, in terms of cost management, lead to significant financial burdens.

+ + + +

Cloud lets you accelerate, but it can also be a double-edged sword without a proactive approach. According to the CNCF microsurvey report, over provisioning or having more resources than necessary, is one of the most common factor leading to over-provisioning.

+ + + +

“We’ve analyzed usage data on thousands of applications, and there are three primary reasons companies overspend: over-provisioning, pod requests being set too high, and low usage of Spot instances,” Gil enumerated. The biggest source of overspend, however, is an overestimation of the real CPU/memory usage. “For more than 97% of the applications we analyzed, the pre-optimized utilization of CPU is only 12%. That means that, on average, nearly 90% of compute is paid for, but goes unused.” And these percentages, he laid out, “are consistent across application sizes and cloud providers.”

+ + + +

The underlying reason that causes most companies to overspend is the lack of education and empowerment. Tech, DevOps, and infrastructure teams often lack cost awareness. And change is not easy because to build the culture of transparency and openness requires sharing pricing information with engineers and creating a safe space for open communication.

+ + + +

This is difficult for nearly all companies because people are first concerned about not stepping on anyone else’s toes. And it can be very unhelpful if fewer people are bold enough to be involved. The key is to get everyone on the same page regarding the business objectives. This means sharing the plan, how things are looking in the near future, what kind of services are planned for wider rollout, and even the company’s gross margin. “It is transparency that spurs on shared understanding,” Hartmann remarks. When everyone sees the bigger picture, they can feel the “real pain” of overspending and how their work directly impacts it. This shared understanding empowers team members to contribute to cost control strategies.

+ + + +

Building on the strategies discussed, a Kubernetes governance platform can serve as an initial step to gain clarity into resource utilization and enable you to drill down into the various layers of your environment. It can also provide policy-based control for cloud-native environments, empowering teams to make informed financial decisions regarding Kubernetes by allowing them to grasp and adopt cost-control strategies.

+ + + +

Author: Saqib Jan

+ + + +

Email: sakimjan8@gmail.com

+ + + +

LinkedIn: https://linkedin.com/in/s-jan

+ + + +

BIO: Saqib Jan is a freelance analyst with experience in application development, cloud technologies, and consulting.

+ + + + + + +

The hidden economy of open source software

Thu, 25 Apr 2024 16:00:00 GMT

Member post originally published on Sysdig’s blog by Nigel Douglas

+ + + +

The recent discovery of a backdoor in XZ Utils (CVE-2024-3094), a data compression utility used by a wide array of various open-source, Linux-based computer applications, underscores the importance of open-source software security. While it is often not consumer-facing, open-source software is a critical component of computing and internet functions, such as secure communications between machines.

+ + + +

Open source software (abbreviated as OSS) has become a cornerstone of the tech industry, influencing everything from small startups to global corporations. Despite its ubiquitous presence and foundational role in driving innovation, the true economic value of OSS has remained largely uncharted territory—until now. A groundbreaking study entitled “The Value of Open Source Software” by researchers Manuel Hoffmann, Frank Nagle, and Yanuo Zhou at Harvard Business School delves into this unexplored domain, revealing the astonishing economic impact of OSS throughout industry.

+ + + +

A Priceless Foundation with a Trillion-Dollar Impact

+ + + +

The study begins by addressing a fundamental paradox: How do you measure the value of something that is freely available? Traditionally, economic value is calculated by multiplying the price of a product by the quantity sold. However, this formula hits a snag when it comes to OSS—there’s no price tag on something that’s free, and tracking its usage is a Herculean task due to the decentralised nature of OSS distribution.

+ + + +

Leveraging unique global data sources and a novel approach, the research estimates the “supply-side” value (the cost to recreate the most widely used OSS) at $4.15 billion. But the true eye-opener is the “demand-side” value, pegged at a staggering $8.8 trillion. This figure represents the hypothetical cost that companies would face if they had to develop equivalent software internally, highlighting the immense savings and efficiency gains OSS provides to the global economy.

+ + + +

For instance, Falco, an open-source, cloud-native security tool, boasts contributions from 190 individuals dedicated to enhancing the software and ensuring it meets the evolving threats in cloud computing. If an organisation attempted to develop a custom threat detection engine in Go from scratch, it would be financially impractical to employ 190 staff members to continuously develop and maintain the tool. Although most of the 190 contributors likely engage with Falco as a side project rather than their primary employment, acknowledging the number of people actively committing to the project offers valuable insight into its collective human investment.

+ + + +

The Unsung Heroes of OSS

+ + + +

One of the most intriguing findings of the study is the concentration of value creation within the OSS community. A mere 5% of OSS developers are responsible for 96% of its demand-side value. This elite group of contributors has a disproportionate impact on the software landscape, emphasising the need for support and recognition from both the tech industry and policymakers.

+ + + +

Sticking to the topic of the recent XZ Utils backdoor, to prevent incidents like that from recurring, policymakers and software vendors must take proactive steps to enhance the security and integrity of existing OSS projects. Many OSS maintainers work on these projects voluntarily, without compensation, and often in addition to their regular employment. This can lead to overwork and burnout, creating vulnerabilities that adversaries can exploit to compromise software.

+ + + +

Without adequate safeguards and support systems, these maintainers operate in an environment that undervalues their crucial contributions and exposes them to significant risks. To address these challenges, there is a pressing need for policy interventions that recognise and financially support OSS development, along with industry-wide adoption of rigorous security practices. By implementing measures such as funding OSS projects, offering security training for maintainers, and developing comprehensive review processes, policymakers and vendors can protect maintainers from undue pressures and enhance the security of OSS.

+ + + +

The Programming Languages That Power the Economy

+ + + +

Digging deeper, the study finds that the lion’s share of OSS value is actually generated by a few key programming languages, with Go, JavaScript, and Java leading the pack. These languages are not just popular among developers; they are instrumental in creating billions of dollars in value, further emphasizing the strategic importance of investing in and nurturing the OSS ecosystem.

+ + + +

The notion of organisations opting to create proprietary programming languages rather than leveraging existing open-source options like JavaScript or Python libraries does not hold practical merit, considering the extensive resources and expertise required for such an endeavor.

+ + + +

Constructing a new programming language from scratch involves not just the immense initial development effort but also the continuous maintenance, development of libraries, tools, and community support to make it viable for production use. Moreover, the existing ecosystems around popular languages such as JavaScript and Python are the result of years of collective effort and contributions from a global community, encompassing vast libraries and frameworks that facilitate rapid development and deployment of applications.

+ + + +

These widely-used languages, however, are not without their vulnerabilities, including known Common Vulnerabilities and Exposures (CVEs) that pose significant security risks if left unpatched. Addressing these vulnerabilities often falls beyond the capacity of individual organisations, especially considering the breadth of open-source dependencies modern applications rely on. This scenario underscores the crucial role of large software vendors in enhancing the security infrastructure of the open-source ecosystem.

+ + + +

By contributing to the security of these languages and libraries, either through direct code contributions, funding, or the provision of advanced security tools and services, these vendors can significantly reduce the potential attack surface for organisations worldwide. Such collaborative efforts between individual maintainers, organisations, and large vendors are essential in bolstering the overall security posture of the open-source software that underpins much of today’s digital infrastructure.

+ + + +

How is the Falco project staying secure?

+ + + +

The Falco project emphasizes its commitment to maintaining vendor independence and the collective effort to bolster its security posture. A foundational pillar of Falco’s philosophy is its vendor-neutral stance, ensuring that the project benefits from a wide array of contributions without being tethered to any single company’s interests. This approach has fostered a diverse and robust community, with significant engineering resources dedicated by several leading companies.

+ + + +

To prove the project’s maturity and reliability, Falco successfully graduated from the Cloud Native Computing Foundation (CNCF) incubating status. This achievement was marked by a fairly rigorous Due Diligence process conducted by the CNCF Technical Oversight Committee (TOC), including a comprehensive third-party security audit. This graduation not only proved Falco’s growth and sustainability, but also solidified Falco’s position as a leader in the open-source runtime security ecosystem.

+ + + +

Reflecting on Falco’s commitment to an inclusive development environment, Falco boasts contributions from 17 organizations actively committing to the project. Notably, approximately 38% of contributions originated from diverse committers affiliated with renowned organizations such as Amazon, Cisco, Chainguard, Clastix, IBM, Microsoft, RedHat, SecureWorks, among others, alongside many individual contributors. This collective effort also demonstrates how Falco’s mission to foster a broad-based and resilient security tool is being enforced.

+ + + +

Governance practices further cement Falco’s dedication to vendor neutrality, with specific measures to prevent any single entity from dominating the project’s direction. A key governance rule caps any organization’s eligible votes at 40%, ensuring balanced representation and decision-making within the project community.

+ + + +

Towards a Sustainable Future for OSS

+ + + +

Harvard’s study revelations are a clear call to action to organisations to reflect on the value of OSS in their business, while also highlighting how many of those projects are taking appropriate steps to audit their projects. The paper further highlights the vital role of OSS in driving technological innovation and economic efficiency.

+ + + +

However, this digital commons, much like its physical counterparts, is vulnerable to overuse and underinvestment – as seen with the XZ Utils backdoor. The findings advocate for a concerted effort to support OSS development, ensuring its sustainability and continued contribution to the global economy.

“The Value of Open Source Software” study shines a spotlight on the hidden economic powerhouse that is OSS. By quantifying its value, the research not only celebrates the contributions of the OSS community but also highlights the critical need for strategic investment and support to secure its future. As we move forward in the digital era, the true value of OSS cannot be overstated—it is an indispensable resource that fuels innovation, drives efficiency, and shapes the technology landscape.

+ + + + + + +

Open source software in AI and cloud trends to watch in 2024: thoughts from the Netris community

Thu, 25 Apr 2024 16:00:00 GMT

Member post originally published on Netris’s blog

+ + + +

Let’s face it: The world of open source software can feel boring – in a good way. Open source has become so pervasive, and so deeply entrenched within modern software stacks and ecosystems, that it’s easy not to think much about it. The era of AI, cloud and big data is here – and now, more than ever, open-source is playing a critical role.

+ + + +

Yet the recent roundtable discussion that Netris hosted with Kelsey Hightower was a reminder that there is still plenty of change afoot for open source software and everyone who uses it. The event didn’t aim to focus on open source specifically – and participants did discuss other important topics, like cloud computing trends and the relationships between cloud competitors – but open source was a key part of the conversation.

+ + + +

Specifically, Hightower and other participants discussed three themes that are poised to have major consequences for open source software in 2024 and beyond.

+ + + +

1. Open source licensing changes

+ + + +

Looking back at the past year, Hightower observed that the open source ecosystem had been rocked by some messy debates surrounding licensing – namely, HashiCorp’s decision to change the licensing terms for future releases of some of its products (including Terraform, a popular Infrastructure-as-Code tool) and Red Hat’s new policy of placing source code for its Linux-based operating system behind what critics deem a “paywall.”

+ + + +

These developments affected only a handful of open source products, and neither turned previously open source solutions into closed source software. Nonetheless, they sparked a fair amount of controversy in the open source ecosystem about the long-term viability of open source licenses.

+ + + +

Hightower’s take was that the license changes probably don’t signal a wholesale crisis for open source, but they do reflect a new reality that more and more companies will need to embrace if they want to continue to benefit open source: The need for a greater investment in open source projects by companies that can make meaningful contributions.

+ + + +

“It’s not sustainable to work for free,” Hightower said. “Open source sustainability is coming to a head.”

+ + + +

He added that “most people don’t know this but even Kubernetes struggled to get contributors,” referring to the open source container orchestration platform that he helped develop as a Distinguished Engineer at Google.

+ + + +

The solution, in Hightower’s view, is simple: Companies that want to use open source software need to pay more developers to contribute to it. “If you want to avoid a Red Hat ‘paywall,’ go help write the code,” he said.

+ + + +

2. Open source, AI and network infrastructure

+ + + +

Roundtable participants also tackled what has become a buzzworthy topic over the past year: The role of open source software in the generative AI space.

+ + + +

Alex Saroyan, Netris co-founder and CEO, noted that much of the discussion to date about open source and generative AI has centered on companies like Mistral, which aim to build open source generative AI models that can perform at least as well as those from vendors like OpenAI (which, despite its name, does not produce open source products).

+ + + +

That’s one important facet of open source in the realm of AI, Saroyan said. But another critical consideration – and one that hasn’t received nearly as much attention as it deserves – is the importance of providing open source AI projects with access to the cloud and networking infrastructure resources they need to train AI models.

+ + + +

The reason why is simple: Few, if any, open source projects own the massive compute infrastructure they need to train models. Instead, they rely on cloud infrastructure for training. For AI model training, as well as inference, leveraging the Big 3 public cloud providers – meaning Amazon, Azure and Google Cloud Platform – becomes prohibitively expensive, especially at scale. To “do” AI, businesses need AWS alternatives, GCP alternatives and so on.

+ + + +

“This is why we are seeing many new organizations launching AI cloud services for model training, as well as deploying private edge clouds for AI inference,” Saroyan said.

+ + + +

Indeed, making AI infrastructure more accessible through alternative public and private cloud providers is “one of the reasons why we’re seeing new generations of ethernet technology, like NVIDIA Spectrum-X,” Saroyan noted.

+ + + +

He added that “AI needs significantly more network and cloud infrastructure built in a highly-scalable but also highly cost-efficient manner. The new generation of networking solutions that Netris is helping customers to deliver depends on open source software and commodity hardware. DPUs are a big part of this picture,” he said, referring to special acceleration hardware known as Data Processings Units (DPUs) that are vital for scalable and efficient networks.

+ + + +

In short: Open source has a critical role to play in the future of generative AI, and it’s not limited to open source AI models. Expect to see open source crop up in other corners of the AI ecosystem – including the networks that serve as the vital link between AI workloads and the infrastructure they depend on.

+ + + +

3. Shareable open source AI models

+ + + +

Hightower offered another prediction about how generative AI and open source will converge: “We’ll treat models like shareable libraries.”

+ + + +

He meant that AI developers will use the open source example to build AI models that anyone can use and improve on. He envisions a world where borrowing someone else’s model is as simple as importing a module into your codebase or deploying a container from a public repository.

+ + + +

Hightower added that shareable open source AI models will require an “ecosystem of companies” to build, share and maintain AI software. “No one private entity can run away with these things,” he said.

+ + + +

Of course, given Hightower’s other observations during the roundtable about the importance for companies of backing open source products, any ecosystem that grows up surrounding open source models will need more than just volunteer labor. It will require committed investment from organizations with the means to support high-quality model development and training.

+ + + +

Conclusion: The future of open source

+ + + +

There’s plenty more to say about where open source is headed. But if Hightower and the rest of the Netris community are any guide – and we think they are! – expect new strategies for funding open source, as well as novel approaches to leveraging open source in the realm of AI, to become key open source trends during 2024.

+ + + +

Expect, too, to stop thinking of open source as a “boring” type of resource that developers can take for granted. The open source world is changing, and while we don’t know exactly what’s coming next, we are confident that developments like open source licensing changes and the advent of generative AI will force open source projects and communities to adopt new strategies.

+ + + +

+ + + + + + +

How Katalyst guarantees memory QoS for colocated applications

Wed, 24 Apr 2024 16:00:00 GMT

Member post originally published on Katalyst’s blog

+ + + +

In the previous post[1], we introduced Katalyst – a QoS-based resource management system that helps ByteDance improve resource efficiency through colocation of online and offline workloads. In the colocation scenario, memory management is a crucial topic. On the one hand, when memory is tight on nodes or containers, the performance of the application may be affected, leading to issues like latency jitter or OOM (Out of Memory) errors. In colocation scenarios, where memory is overcommitted, this problem can become more severe. On the other hand, there might be some memory on nodes that is less frequently used but not released, resulting in less available memory that can be allocated to offline jobs, thus hindering effective overcommitment. To address these issues, ByteDance has summarized its refined memory management strategies practiced during large-scale colocation into a user-space Kubernetes memory management solution called Memory Advisor, which has been open-sourced in the resource management system Katalyst. This article will focus on introducing the native memory management mechanisms of Kubernetes and the Linux kernel, their limitations, and how Katalyst, through Memory Advisor, improves memory utilization while ensuring the memory QoS for business applications.

+ + + +

Limitations of native memory management

+ + + +

Memory allocation and reclamation of Linux kernel

+ + + +

Due to the much faster access speed of memory compared to accessing disk, Linux tends to adopt a greedy memory allocation strategy, aiming for maximum allocation. It only triggers reclamation when the memory watermark is relatively high. Memory allocation The Linux kernel has fast path and slow path for

+ + + +

Memory allocation:

+ + + +

Fast path: It first attempts to do a fast path memory allocation and then assesses whether the overall free memory level will fall below the Low Watermark after allocation. If it does, a quick memory reclaim is performed before re-evaluating the possibility of allocation. If the condition is still not met, it enters the slow path.
Slow path: In the slow path, it wakes up Kswapd to perform asynchronous memory reclaim and then attempts another round of fast memory allocation. If allocation fails, it tries memory compaction. If allocation is still unsuccessful, it attempts global direct memory reclaim, which involves scanning all zones and is time-consuming. If this also fails, it triggers a system-wide OOM event to release some memory and then retries fast memory allocation.

+ + + +

Memory reclamation

+ + + +

Memory reclamation can be categorized into two types based on the target: Memcg-based and Zone-based. The kernel’s native memory reclamation methods include the following:

+ + + +

Memcg-level direct memory reclaim: If the Memory Usage of a cgroup reaches a threshold, it triggers synchronous memory reclamation at the memcg level to release some memory. If this is unsuccessful, it triggers a cgroup-level OOM event.
Fast path memory reclaim: As mentioned earlier in the discussion of fast path memory allocation, fast memory reclamation is quick because it only requires reclaiming the number of pages needed for the current allocation.

+ + + +

Asynchronous memory reclaim: As shown in the diagram above, when the overall free memory of the system drops to the Low Watermark, Kswapd is awakened to asynchronously reclaim memory in the background until the High Watermark is reached.
Direct memory reclaim: As depicted in the diagram above, if the overall free memory of the system drops to the Min Watermark, it triggers global direct memory reclaim. Since this process is synchronous and occurs in the context of process memory allocation, it has a significant impact on the performance of the system.

+ + + +

Kubernetes Memory Management

+ + + +

Memory limit

+ + + +

Kubelet sets the cgroup interface memory.limit_in_bytes based on the memory limits declared by each container within the pod, constraining the maximum memory usage for both the pod and its containers. When the memory usage of the pod or container reaches this limit, it triggers direct memory reclaim or even an OOM event.

+ + + +

Eviction

+ + + +

When the memory on a node becomes insufficient, K8s selects certain pods for eviction and marks the node with the taint node.kubernetes.io/memory-pressure, preventing additional pods from being scheduled on that node. The trigger condition for memory eviction is when the node’s working set reaches a threshold:

+ + + +

memory.available := node.status.capacity[memory] - node.stats.memory.workingSet
+

+ + + +

Where memory.available is the threshold configured by the user. When sorting pods for eviction, the following criteria are considered:

+ + + +

First, it checks if a pod’s memory usage exceeds its request; if so, it’s prioritized for eviction.
Next, it compares the pods based on their priority, with lower-priority pods evicted first.
Finally, it compares the difference between a pod’s memory usage and its request; pods with higher differences are evicted first.

+ + + +

OOM

+ + + +

If direct memory reclaim still cannot meet the memory demands of processes on the node, it will trigger a system-wide OOM event. When the Kubelet starts a container, it configures /proc/<pid>/oom_score_adj based on the QoS level of the container’s associated pod and its memory request. This affects the order in which the container is selected for OOM Kill:

+ + + +

For containers in critical pods or Guaranteed pods, their oom_score_adj is set to -997.
For containers in BestEffort pods, their oom_score_adj is set to 1000.
For containers in Burstable Pods, their oom_score_adj is calculated using the following formula: min{max[1000 - (1000 * memoryRequest) / memoryCapacity, 1000 + guaranteedOOM]}

+ + + +

Memory QoS

+ + + +

Starting from version 1.22, K8s introduced the Memory QoS feature based on Cgroups v2 [2]. This feature ensures memory request guarantees for containers, thereby ensuring fairness in global memory reclaim among pods. The specific Cgroups configuration is as follows:

+ + + +

memory.min: Based on requests.memory configuration.
memory.high: Based on limits.memory * throttlingfactor (or nodeallocatablememory * throttlingfactor) configuration.
memory.max: Based on limits.memory (or nodeallocatablememory) configuration.

+ + + +

In version 1.27 of K8s, enhancements were made to the Memory QoS feature to address the following issues:

+ + + +

When container requests and limits are close, the throttle threshold configured in memory.high may not be effective due to memory.high > memory.min limitation.
The calculated memory.high may be too low, resulting in frequent throttling and affecting application performance.
The default value of throttlingfactor is too aggressive (0.8), causing frequent throttling for some Java applications that typically use more than 85% of memory.

+ + + +

To address these issues, the following optimizations were made:

+ + + +

Improvement in the calculation method of memory.high:

+ + + +

memory.high = floor{[requests.memory + memory throttling factor(limits.memory or node allocatable memory - requests.memory)]/pageSize} * pageSize
+

+ + + +

Adjustment of the default value of throttlingfactor to 0.9.

+ + + +

Limitations

+ + + +

From the introductions in the previous two sections, we can identify the following limitations in both K8s and the Linux kernel memory management mechanisms:

+ + + +

Lack of fairness mechanism in global memory reclamation: In scenarios where memory overcommitment occurs, even if the memory usage of all containers is significantly lower than the limit, the entire node’s memory may still reach the threshold for global memory reclaim. In the widely used Cgroups v1 environment, the memory request declared by containers is not reflected in Cgroups configuration by default, but serves only as a basis for scheduling. Therefore, there is a lack of fairness guarantee in global memory reclamation among pods, and available memory for containers is not divided proportionally based on requests, unlike CPU resources.
Lack of priority mechanism in global memory reclamation: In colocation scenarios, low-priority offline containers often run resource-intensive tasks and may request a large amount of memory. However, memory reclamation does not consider the priority of the applications, leading to high-priority online containers on nodes entering the slow path of direct memory reclaim, thereby disturbing the memory QoS of online applications.
Delayed triggering of native eviction mechanisms: K8s mainly ensures the priority and fairness of memory usage through kubelet-driven eviction. However, the triggering timing of native eviction mechanisms may occur after global memory reclamation, thus not taking effect promptly.
Impact on application performance by memcg-level direct memory reclaim: When the memory usage of a container reaches a threshold, memcg-level direct memory reclaim is triggered, causing latency in memory allocation, which may lead to business jitter.

+ + + +

Katalyst Memory Advisor

+ + + +

Overall architecture

+ + + +

The architecture of Katalyst Memory Advisor has undergone multiple discussions and iterations. It adopts a pluggable design, following a framework with plugins model, which enables developers to flexibly extend functionality and policies. The scopes of each component or module are as follows:

+ + + +

Katalyst Agent: Resource management agent running on each node. The following modules are involved for memory QoS management: +
- Eviction Manager: A framework that extends the native eviction policies of the kubelet. It periodically invokes interfaces of eviction plugins, retrieves the results of eviction policy calculations, and executes eviction actions.
- Memory Eviction Plugins: Plugins for the Eviction Manager. The following plugins are involved for memory QoS management: +
  - System Memory Pressure Plugin: Eviction strategy based on overall system-level memory pressure.
  - NUMA Memory Pressure Plugin: Eviction strategy based on NUMA Node-level memory pressure.
  - RSS Overuse Plugin: Eviction strategy based on Pod-level RSS overuse.
  - Reclaimed Resource Pressure Plugin: Eviction strategy based on memory resource fulfillment of offline pods.
  +
- Memory QRM Plugin: Memory resource management plugin. For memory QoS management, it handles Memcg configuration for offline pods and implements the Drop Cache action.
- SysAdvisor: Algorithm module running on each node, supporting algorithm strategy extension through plugins. The following plugins are involved for memory QoS management: +
  - Cache Reaper Plugin: Calculates the trigger timing for the Drop Cache action and identifies which pods need to have their cache dropped.
  - Memory Guard Plugin: Calculates the real-time Memory Limit for offline pods.
  - Memset Binder Plugin: Dynamically calculates which NUMA Node offline pods should be bound to.
  +
- Reporter: Out-of-band information reporting framework. For memory QoS management, it reports memory pressure-related Taints to Nodes or CustomNodeResource CRDs.
- MetaServer: Metadata management component of Katalyst Agent. For memory QoS management, it provides metadata for Pods and Containers, caches metrics, and offers dynamic configuration capabilities.
+
Malachite: Metrics data collection component running on each node. For memory QoS management, it provides memory-related metrics at the Node, NUMA, and Container levels.
Katalyst Scheduler: The following plugins are involved for memory QoS management: +
- Native TaintToleration Plugin: Filters based on Node Taints.
- Extended QoSAwareTaintToleration Plugin: Implements scheduling prohibitions based on Taints defined in CustomNodeResource CRDs for QoS awareness.
+

+ + + +

Detailed design

+ + + +

Multi-dimensional interference detection

+ + + +

Memory Advisor performs periodic interference detection to proactively sense memory pressure and trigger corresponding mitigation measures. Currently, the following dimensions of interference detection are supported:

+ + + +

System and NUMA-level memory watermark: Comparing the free memory watermark at the system and NUMA levels with the threshold watermark of global asynchronous memory reclamation (Low Watermark), to avoid triggering global direct memory reclaim as much as possible.
Kswapd memory reclamation rate at the system level: If the rate of global asynchronous memory reclamation is high and continues for an extended period, it indicates significant memory pressure on the system, which may likely trigger global direct memory reclaim in the future.
Pod-level RSS overuse: Overcommitment can fully utilize a node’s memory, but it cannot control whether the overcommitted memory is used for page cache or RSS. If the RSS usage of certain pods far exceeds their request, it may result in a high node memory watermark that cannot be reclaimed. This can affect other pods’ inability to use sufficient page cache, leading to performance degradation, or it may result in an OOM event.
QoS-level memory resource fulfillment: By comparing the supply of reclaimed memory on the node with the total memory request of reclaimed_cores QoS level on that node, it calculates the memory resource fulfillment of offline jobs to prevent severe impacts on the service quality of offline jobs.

+ + + +

Multi-tiered mitigation measures

+ + + +

Based on the different levels of abnormality feedback from interference detection, Memory Advisor supports multi-tiered mitigation measures. While avoiding interference with high-priority pods, it aims to minimize the impact on victim pods.

+ + + +

Forbid Scheduling

+ + + +

Forbidding scheduling is the least impactful mitigation measure. When any level of system abnormality is detected by interference detection, scheduling is forbidden on the node to prevent further scheduling of pods, thus preventing the situation from worsening. Currently, Memory Advisor supports this feature for all pods through Node Taint. In the future, we will enable the scheduler to be aware of taints extended in CustomNodeResource CRDs to achieve fine-grained scheduling prohibition for reclaimed_cores pods.

+ + + +

Tune Memcg

+ + + +

Tune Memcg is a mitigation measure with a relatively minor impact on victim pods. When the degree of abnormality detected by interference detection is low, Tune Memcg operations are triggered. This selects some reclaimed_cores pods and configures them with higher memory reclamation trigger thresholds to trigger memory reclamation earlier, thereby avoiding triggering global direct memory reclaim as much as possible. Tune Memcg is not enabled by default because it requires the use of veLinux kernel’s open-source Memcg asynchronous memory reclamation feature[3], which does not affect usage.

+ + + +

Drop Cache

+ + + +

Drop Cache is a mitigation measure with a moderate impact on victim pods. When the degree of abnormality detected by interference detection is moderate, drop cache operations are triggered. This selects some reclaimed_cores pods with high cache usage and forcefully releases their cache to avoid triggering global direct memory reclaim as much as possible. In Cgroups v1 environments, cache release is triggered through the memory.force_empty interface:

+ + + +

echo 0 > memory.force_empty
+

+ + + +

In Cgroups v2 environments, cache release is triggered by writing a large value to the memory.reclaim interface, such as:

+ + + +

echo 100G > memory.reclaim
+

+ + + +

As drop cache is a time-consuming operation, we have implemented an asynchronous task execution framework to avoid blocking the main process. Technical details of this part will be discussed in future articles.

+ + + +

Eviction

+ + + +

Eviction is a measure with a significant impact on victim pods, but it is the fastest and most effective fallback measure. When a high degree of abnormality is detected by interference detection, eviction at the system or NUMA level (or only for reclaimed_cores pods) is triggered to effectively avoid triggering global direct memory reclaim. Memory Advisor supports users to configure custom sorting logic for pods to be evicted. If users have not configured it, the default sorting logic is as follows:

+ + + +

Sort pods based on their QoS level, with reclaimed_cores > shared_cores / dedicated_cores.
Sort pods based on their priority, with lower priority pods evicted first.
Sort pods based on their memory usage, with higher usage pods evicted first. We have abstracted an eviction manager framework in Katalyst agent. This framework delegates eviction policies to plugins and consolidates eviction actions in the manager, offering the following advantages:

+ + + +

Plugins and managers can communicate through local function calls or gRPC, allowing flexible plugin start and stop.
The manager can easily support governance operations such as filtering, rate limiting, sorting, and auditing for eviction.
Support for dry run on plugins in the manager, allowing thorough validation of strategies before they take effect.

+ + + +

Resource cap for reclaimed_cores

+ + + +

To prevent offline containers from excessively using memory and affecting the service quality of online containers, we limit the total memory usage of reclaimed_cores pods through a resource cap. Specifically, we have expanded a memory guard plugin in SysAdvisor. This plugin periodically calculates the total amount of memory that reclaimed_cores pods can use as a whole and accordingly write memory.limit_in_bytes file of the BestEffort cgroup through the memory QRM plugin.

+ + + +

Memory migration

+ + + +

For applications like Flink, the performance of services is strongly correlated with memory bandwidth and memory latency, and they also consume a significant amount of memory. The default memory allocation strategy prioritizes memory allocation from the local NUMA node to achieve lower memory access latency. However, on the other hand, the default memory allocation strategy may lead to uneven memory usage across NUMA nodes, causing certain NUMA nodes to become hotspots under excessive pressure, which severely impacts service performance and leads to latency issues. Therefore, we use Memory Advisor to monitor the memory watermark of each NUMA node and dynamically adjust the NUMA node bindings of containers for memory migration to prevent any NUMA node from becoming a hotspot. During the implementation of the memory dynamic migration feature in production environments, we encountered exceptional situations that could lead to system hang-ups. As a result, we optimized the method of memory migration. This practical experience will be elaborated on in subsequent blogs.

+ + + +

Differentiated memcg-level reclamation strategy

+ + + +

Given that memcg-level direct memory reclaim can significantly impact application performance, the kernel team at ByteDance has enhanced the Linux kernel (i.e. veLinux) with memcg-level asynchronous memory reclamation features, which have been open-sourced [4]. In colocation scenarios, the typical I/O activities of online applications involve reading and writing logs, whereas those of offline tasks involve more frequent file I/O operations, with page cache having a significant impact on the performance of offline jobs. Therefore, through Memory Advisor, we support differentiated memory reclamation strategies at the memcg level:

+ + + +

For applications requiring a large amount of page cache (such as offline jobs), users can specify a relatively lower memcg-level asynchronous memory reclamation threshold through pod annotations. This conservative memory reclamation approach allows for more page cache usage.
Conversely, for applications requiring minimized performance degradation due to direct memory reclaim, users can configure a relatively aggressive memcg-level asynchronous reclamation strategy through pod annotations. This feature is not enabled by default as it requires patches from the veLinux kernel.

+ + + +

Future plans

+ + + +

In subsequent versions of Katalyst, we will continue to iterate on Memory Advisor to enhance its support for a wider range of user scenarios.

+ + + +

Decoupling some capabilities from QoS

+ + + +

Memory Advisor has extended some enhanced memory management capabilities in colocation scenarios, where some of these capabilities are orthogonal to QoS and remain applicable even in non-colocation scenarios. Therefore, in subsequent iterations, we will decouple features such as memcg-level differentiated reclamation strategy, interference detection, and mitigation from QoS enhancement. This will turn them into finely-grained memory management capabilities applicable to general scenarios, enabling users in non-colocation scenarios to utilize them as well.

+ + + +

OOM priority

+ + + +

In the context mentioned earlier, Kubernetes configures different oom_score_adj values for containers based on pod’s QoS level. However, the final OOM Score can still be influenced by other factors such as memory usage. In tidal colocation [5] scenarios, where offline pods belong to the same QoS level, there may be no guarantee that offline pods will be OOM-killed before online pods. Therefore, there is a need to introduce a Katalyst QoS enhancement: QoS priority. Memory Advisor should be able to configure corresponding oom_score_adj values for containers belonging to different QoS priority levels in user space, ensuring strict OOM sequence for offline pods. Additionally, the ByteDance kernel team recently submitted a patch to the Linux kernel [6], aiming to programmatically customize the kernel’s OOM behavior through BPF hooks. This initiative seeks to enhance flexibility in defining OOM strategies.

+ + + +

Cold memory offloading

+ + + +

There may be some less frequently used memory (referred to as cold memory) on the node that has not been released, leading to a limited amount of memory available for offline job usage. This situation prevents effective memory overcommitment, as the memory that could be allocated to offline jobs remains underutilized.

+ + + +

To increase the amount of memory available for allocation, we have referenced Meta’s Transparent Memory Offloading (TMO) paper [7]. In the future, Memory Advisor will utilize the procfs-based memory pressure monitoring framework (PSI) in user space to detect memory pressure. When memory pressure is low, memory reclamation will be triggered proactively. Additionally, we will leverage the DAMON sub-module for memory hotness detection to gather information on memory usage patterns. This information will be used to offload cold memory to relatively inexpensive storage devices or compress it using zRAM, thereby saving memory space and improving memory resource utilization. The technical details of this feature will be elaborated on in subsequent blogs.

+ + + +

Summary

+ + + +

At ByteDance, Katalyst is deployed across over 900,000 nodes, managing tens of millions of cores and unifying the management of various workload types, including microservices, search, advertising, storage, big data, and AI jobs. Katalyst has improved daily resource utilization at ByteDance from 20% to 60%, while ensuring that the QoS requirement of various workload types is satisfied at the same time. In the future, Katalyst Memory Advisor will continue to iterate and optimize. Further technical insights into features such as cold memory offloading and memory migration optimizations will be explained in subsequent blogs. Stay tuned!

+ + + +

References

+ + + +

[1] A brief introduction to Katalyst: https://www.cncf.io/blog/2023/12/26/katalyst-a-qos-based-resource-management-system-for-workload-colocation-on-kubernetes/
[2] Kubernetes eviction strategy: https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/
[3] Memory QoS KEP: https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2570-memory-qos
[4] Memcg-level async reclaim：https://github.com/bytedance/kernel/commit/7d7386ec89caf078f21836c5cae33ffa886125c4
[5] Tidal colocation: https://gokatalyst.io/docs/user-guide/tidal-colocation/
[6] BPF hook for selecting victim task during OOM events: https://lore.kernel.org/lkml/20230804093804.47039-1-zhouchuyi@bytedance.com/
[7] TMO paper：https://www.pdl.cmu.edu/ftp/NVM/tmo_asplos22.pdf

+ + + + + + +