sudo ./gravity install command—is actually performing
the following steps:
- Perform a variety of pre-flight checks to verify the satisfaction of important system requirements
- Install and configure Docker
- Install and configure Planet, a containerized implementation of Kubernetes bundled with a set of custom cluster management tools
- Install Helm, an industry standard tool for installing Kubernetes applications
- Load the Anaconda Enterprise container images into the internal Docker registry
- Use a standard Helm process to install the Anaconda Enterprise application
- Run final Anaconda-specific application configuration tasks
Hardware considerations
CPU and Memory: node considerations
In our preferred Gravity implementation, the primary node—where the central Kubernetes services and Anaconda Enterprise system containers run—does not host user workloads. Our standard recommendation of 16 cores and 64GB RAM provides ample headroom to ensure the correct operation of these functions. For worker nodes, where user workloads (sessions, deployments, and jobs) are scheduled, the most important quantities are the total number of cores and total amount of RAM across all worker nodes. That said, nodes with more cores and RAM are better than smaller nodes for two reasons: first, because it allows aggregate user workload to be accommodated with less total hardware; and second, so that the system can accommodate truly large-memory workloads when necessary.CPU and Memory: user workloads
Do not compromise the compute resources offered to your users. A decent data science laptop today ships with 6 cores and 16GB of RAM. While some of these resources are consumed by the operating system and other processes, their data science workloads are free to consume the vast majority. Not all users are likely to be active at any given time, so it is not necessary to mirror this allocation on a 1:1 basis on your Anaconda Enterprise cluster. Kubernetes supports the notion of oversubscription, enabling CPU and memory allocations to exceed 100%. If we adopt a relatively standard oversubscription ratio of 4:1, we still need 75 cores and 200GB of RAM to support 50 users. Rounding that down to 64 cores and 192GB of RAM seems reasonable, at least to start. Economizing further will come at the cost of productivity—and additional resources tend to be significantly less expensive than the data scientists who will use them! The laptop comparison is imperfect in a very important respect. On a laptop, swap space can be employed to temporarily allow memory consumption in excess of the physical limit. Not so with Kubernetes: a process will be terminated if it exceeds its memory limit, likely resulting in a loss of work. This further emphasizes the need to ensure that users are given a generous memory limit, determined not by their average usage, but rather their peak. Some installations operate with just a single node—serving both control plane and user workload functions. For installations with a small number of simultaneous users, this is a feasible approach, as long as the node is sized aggressively—say, 64 cores and 256GB RAM.VM QoS / Oversubscription / Overcommitment
Many on-premise data centers employ virtualization technology such as VMWare to better manage compute resources. A common practice in such scenarios is oversubscription—the ability to schedule a greater number of virtual CPUs (vCPUs) than the number of physical CPUs (pCPUs) present on the system. Oversubscription is an essential component of cost effective virtual machine management, since machines rarely see constant 100% usage levels. Unfortunately, this approach is not necessarily compatible with Kubernetes. Kubernetes employs its own resource management strategy, including a notion of oversubscription. Our recommended practice for AE5 is to employ a ratio of 4:1 for user workloads. If this were compounded with, say, a 4:1 ratio at the virtual machine level, and the true overcommitment level is closer to 16:1. With no control over the other workloads sharing the same physical cores, there is a real risk of sporadic performance loss that impacts overall cluster health. For this reason, we strongly recommend that any virtual machine intended to serve as an Anaconda Enterprise node be assigned to a guaranteed service class that ensures that its CPU and memory reservations are fully honored, with no oversubscription at the VM level. Allow the Kubernetes layer to manage oversubscription exclusively.GPUs
One of the more challenging aspects of implementation is the enablement of GPU computation within Anaconda Enterprise. It is our view that NVidia is still in the process of maturing their “story” around the use of GPUs within Docker containers in general and Kubernetes in particular. As of February 2022, the official Kubernetes documentation about GPU scheduling marks it as an “experimental” feature. In our experience, customers can be successful deploying GPUs in Anaconda Enterprise. Anaconda Enterprise ships a standard CUDA 11 library in user-facing containers, and the underlying Planet implementation is built with NVidia support components. That said, our experience leads us to offer these cautions proactively.- GPUs cannot be shared between sessions, deployments, and jobs. That means that if a user launches a session with a GPU resource profile, that GPU is reserved for their container, even if it is idle.
- Not all versions of the NVidia driver set are compatible with GPU container runtime.
- For some versions of the NVidia drivers, some manual rearrangement of the installed driver files are sometimes required in order for the Gravity/Planet container to “find” them.
Operating system
In this section, we highlight a number of the important considerations for our Gravity-based offering. For BYOK8s customers, these types of concerns are likely “baked-in” to the general objective of standing up a performance cluster. However, customers who build their own on-premise Kubernetes clusters will likely encounter similar concerns.Kernel modules and settings
The system requirements provide sufficient detail on the kernel modules and other OS settings required to ensure effective operation of the Kubernetes layer. A common mistake is the failure to ensure that these settings are preserved upon reboot—so the cluster operates without incident until a system modification forces a reboot. System management software (see below) can often prevent these settings from persisting properly.Firewall settings
Kubernetes itself actively manages the firewall settings on the master and worker nodes to ensure proper communication management between nodes and pods. Introducing additional firewall settings runs the risk of interrupting Anaconda Enterprise functionality. Please make sure that additional firewall configurations are disabled or confirmed to be compatible with Anaconda Enterprise. This is another common configuration that can be corrupted by automated system management tools.The Linux audit daemon (auditd)
The Linux audit system provides a flexible method to detect and log a variety of system issues, and is a genuinely useful tool that is commonly enabled on the Kubernetes stack. For this reason, we have the following guidelines for exceptions and exclusions:- /var/lib/gravitymust be excluded from auditd monitoring.
- /opt/anacondashould be excluded as well. That said, we do not have strong evidence that system instability can be tied to monitoring of that directory.
- If managed persistence is hosted on the master node, then we encourage the exclusion of that directory as well. Conda environment management performs a significant number of disk operations, and slowing these operations can significantly diminish the user experience.
Antivirus / antimalware
Our customers utilize a variety of Linux antivirus and antimalware scanning tools, some of which include an on-demand scanning component. As with auditd, this scanning introduces a significant burden on proper Kubernetes operation. For this reason, our guidance for on-demand scanning mirrors that of auditd. In particular,/var/lib/gravity must be excluded from on-demand scanning.
System management software
One frequent culprit involved in sudden loss of AE functionality are system management tools such as Chef or Puppet. Tools such as these are designed to automate and simplify the management of large numbers of servers. Where they run afoul of Anaconda Enterprise is when the application requires exceptions to configurations enforced by these tools. It is essential that those exceptions are properly enabled. Otherwise, these tools can make fatal modifications to the underlying operating system unannounced: removing necessary kernel modules, reinstalling firewall rules, removing auditd exceptions, and others. If your organization uses tools such as these, please review the Anaconda Enterprise system requirements with them and confirm that the necessary exceptions are permanently engaged, with clear documentation as to why. Otherwise, we find that customers will eventually encounter administrators who remove these important configuration details and thereby disrupt the operation of Anaconda Enterprise.Backup solutions
Many organizations will employ backup solutions on any server running critical applications, or production environments. It is important to exclude Gravity from any scheduled backup as this will cause severe disk pressure. AE has its own scripts that can be used to make a backup of the application on a regular basis.Disks
Disk space
The disk space requirements specified for Gravity installations for/var/lib/gravity, /opt/anaconda, and /tmp must be
respected. The installer includes disk space checks in its pre-flight checks.
With managed persistence, generous disk space allocations are even
more important. This disk holds a copy of every project (and one
copy for each collaborator), and every custom conda environment
created by users. A single conda environment can consume multiple
gigabytes. For this reason, we encourage that the size of this
disk should start at 1TB, and preferably support live resizing.
I/O performance
Low disk latency and high throughput in the/var/lib/gravity
directory is essential for the stability of the platform. In particular,
the master node hosts the Kubernetes etcd key-value store there.
In practice, we have found that the use of platter disks for
/var/lib/gravity is a primary cause of system instability.
Use of an SSD for this directory is effectively required.
Direct-attached storage is preferred whenever possible, but we
do believe that a sufficiently performant network-attached storage
volume for /opt/anaconda is acceptable. Indeed, our positive experience
with shared storage for BYOK8s installations validate this belief.
Auditing and antivirus software
As mentioned above, auditd daemons and antivirus software can significantly impact effective disk performance. For this reason, we mention here as well that the guidelines listed above for these tools must be honored.Managed persistence
The new Managed Persistence functionality of AE5 requires the use of a shared volume that is accessible from all nodes, master and worker. So far, our customers have found that a performant enterprise NAS offers sufficient performance for their needs. In theory, it is possible to export a directory from the master node via NFS. If an independent file sharing option is available, we recommend that instead, to ensure that the master node may focus on Kubernetes-related duties. But we have multiple successful implementations using this approach. As our real-world experience with this feature is more limited, we will update these recommendations as more information comes in.Cloud-specific concerns
Ensuring sufficient disk I/O performance is essential for a successful cloud-based implementation of AE. Fortunately, the common cloud providers make this a relatively straightforward thing to achieve. If possible, select VMs with attached SSDs large enough to hold/var/lib/gravity. When it is
necessary to use additional attached block volumes, respect
the IOPS recommendations in our system requirements.
Each cloud provider offers different mechanisms for
ensuring disk performance.
- In practice, the larger the disk, the higher the base IOPS performance. If you are generous with disk space, you are less likely to have issues.
- With some providers (for example, Azure), the only mechanism for increasing performance is to increase the disk size.
- Providers like AWS offer managed IOPS, allowing you to provision size and IOPS separately. This is a reasonable approach, and may enable lower costs, but we recommend at least studying the cost of a larger disk instead of simply boosting IOPS.
Network
It is vitally important that the nodes of the cluster have unfettered access to each other. Whenever network performance is impacted by hardware or operating system issues, the Kubernetes cluster will be unstable, and thus so will Anaconda Enterprise itself.Private networking
For very understandable reasons, customers usually need to place Anaconda Enterprise behind a firewall or VPN. It is important that this firewall does not interrupt communication between nodes, however. If possible, use private networking to connect the nodes to each other so that they may communicate over more direct connections even as the public-facing access to the cluster is restricted.Load balancing
Anaconda Enterprise does not currently support being placed behind an SSL termination load balancer. Our experience is that it will function properly behind an SSL passthrough load balancer, however.Proxies
Proxies may be required to access external data stores, repositories, and so forth. However, they must not be required for the nodes to speak with each other, and proxies must not be enabled at the OS/system level.WAN accelerators (IDS, packet caching, and similar tools)
Network acceleration technology should be disabled. Kubernetes needs to manage its own traffic shaping.Shared volume (NFS) access
As is commonly understood, losing access to an external NFS share can cause disk waits and other significant issues on Unix machines. This is true for Kubernetes clusters as well. The platform can be expected to behave unreliably until access to any attached NFS volumes is restored. Interruptions to access for the managed persistence volume in particular will be severely disruptive.Cloud vs. On-premise
Most of our customers know in advance whether or not they will be deploying onto on-premise hardware or on a major cloud provider (AWS, Google, Azure). Others have the option to choose either option, and look to us for advice on which to prefer. In our experience, cloud installations are smoother and more reliable for a number of reasons:- It is easier to ensure that the hardware requirements are met. For each of the major cloud providers, we can recommend specific instance types that are known to provide good performance for Anaconda Enterprise.
- There tends to be less additional software installed on cloud hardware, reducing the likelihood of unexpected behavior caused by interactions with the Gravity stack.
- The provisioning process is faster, as is the process of adding additional nodes or disk when required.
- We have found it significantly easier to ensure a compatible GPU configuration in the cloud. On-premise GPU nodes often require BIOS modifications or other configuration changes to successfully deploy.
Bring Your Own Kubernetes
At a high level, many of the recommendations offered above have been developed with the assumption of an Anaconda-supplied, Gravity/Planet-based Kubernetes stack. In contrast, our BYOK8s customers will be able to leverage existing Kubernetes resources—either an on-premise Kubernetes cluster already configured to support multiple tenants, or a managed Kubernetes offering such as EKS (AWS), AKS (Azure), GKE (Google). In these scenarios, many of the above concerns are not relevant:- Concerns about disk performance for /var/lib/gravityare tied to the need to ensure a performant Kubernetes stack.
- Operating system requirements will likely be settled either by the Kubernetes administrator or the managed Kubernetes provider.
- Anaconda Enterprise will likely not have access to the Kubernetes control plane; instead, its own application containers will be running on worker nodes alongside user workloads.
Docker image sizes
Our Docker images are larger than many Kubernetes administrators are accustomed to. In particular, the Docker image on which users run their sessions, deployments, and jobs is nearly 20GB uncompressed. This is probably the most difficult requirement for some Kubernetes administrators to swallow. Here are a couple of points to emphasize when discussing this with your administrators. First: this does not imply that every session, deployment, and job will consume 20GB of disk space. Docker images are shared across all containers that utilize them. Therefore, the disk space consumed by the image is amortized across all of its uses on a given node Second: the primary reason for this disk consumption is the set of pre-baked, global data science environments contained in this image. Future versions of AE5 will have the option to remove those environments or move them to shared storage; however, the image size is likely to never drop below 5GB. In our experience, the response to our image sizes among Kubernetes administrators is somewhat bimodal: some react strongly negatively to it, while others have already seen images of comparable size.Resource profiles
In our experience, Kubernetes administrators who are not accustomed to serving data science workloads will be surprised by our requirements. For many microservice workloads, CPU limits of less than a single core, and memory limits of less than 1GB, will be very common. Data science workloads require several times this much per session. On the other hand, our standard oversubscription recommendation of 4:1—that is, the ratio between our memory/CPU limits and requests values—is a somewhat standard choice. Higher levels of oversubscription will result in sporadic performance issues for your users. We reiterate here what we emphasized in the CPU and Memory section above: do not compromise the CPU and memory allocations for your users.Storage
The/opt/anaconda/storage volume does not have the same strict performance
requirements that /var/lib/gravity has on a Gravity installation.
However, we definitely encourage the use of a “premium” performance tier for
this volume if possible, as well as for the managed persistence volume.
A high-performance storage tier should be chosen for the managed persistence
volume as well. Remember, users will be interacting with that volume to
create Python environments and run data science workloads. Performance
limitations on this volume will directly impact the user experience.

