Known issues - Anaconda

Project is stuck in loading state on creation when email address is used as username

Subsequent Jobs using the “Run Now” schedule do not consume GPU resources

Unable to obtain Zeppelin credentials

Attempting to install new PyViz packages in JupyterLab results in error

Unable to download files when running JupyterLab in Chrome browser

Unexpected metadata in a package breaks Workbench channel

Custom conda configuration file may be overwritten

Incorrect information in command output

Error creating an environment immediately after installation

Cluster performance may degrade after extended use

The default limit for max_user_watches may be insufficient, and can be increased to improve cluster longevity.WorkaroundRun the following command on each node in the cluster, to help the cluster remain active:

sysctl -w fs.inotify.max_user_watches=1048576

To ensure this change persists across reboots, you’ll also need to run the following command:

sudo echo -e "fs.inotify.max_user_watches = 1048576" > /etc/sysctl.d/10-fs.inotify.max_user_watches.conf

Invalid issuer URL causes library to get stuck in a sync loop

When using the Workbench Operations Center to create an OIDC Auth Connector, if you enter an invalid issuer URL in the spec, the go-oidc library can get stuck in a sync loop. This will affect all connectors.WorkaroundOn a single node cluster, you’ll need to do the following to shut down gravity:

Find the gravity services:

systemctl list-units | grep gravity

You will see output like this:

# systemctl list-units | grep gravity
gravity__gravitational.io__planet-master__0.1.87-1714.service          loaded active running
    Auto-generated service for the gravitational.io/planet-master:0.1.87-1714 package
gravity__gravitational.io__teleport__2.3.5.service                      loaded active running
    Auto-generated service for the gravitational.io/teleport:2.3.5 package

Shut down the teleport service:

systemctl stop gravity__gravitational.io__teleport__2.3.5.service

Shut down the planet-master service:

systemctl stop gravity__gravitational.io__planet-master__0.1.87-1714.service

On a multi-node cluster, you’ll need to shut down gravity AND all gravity-site pods:

kubectl delete pods -n kube-system gravity-site-XXXXX

In both cases, restart gravity services:

systemctl start gravity__gravitational.io__planet-master__0.1.87-1714.service
systemctl start gravity__gravitational.io__teleport__2.3.5.service

GPU affinity setting reverts to default during upgrade

Install and post-install problems

Failed installationsIf an installation fails, you can view the failed logs as part of the support bundle in the failed installation UI.After executing sudo gravity enter you can check /var/log/messages to troubleshoot a failed installation or these types of errors.After executing sudo gravity enter you can run journalctl to look at logs to troubleshoot a failed installation or these types of errors:

journalctl -u gravity-23423lkqjfefqpfh2.service

Replace gravity-23423lkqjfefqpfh2.service with the name of your gravity service.

You may see messages in /var/log/messages related to errors such as “etcd cluster is misconfigured” and “etcd has no leader” from one of the installation jobs, particularly gravity-site. This usually indicates that etcd needs more compute power, needs more space, or is on a slow disk.Workbench is very sensitive to disk latency, so we usually recommend using a better disk for /var/lib/gravity on target machines and/or putting etcd data on a separate disk. For example, you can mount etcd under /var/lib/gravity/planet/etcd on the hosts.After a failed installation, you can uninstall Workbench and start over with a fresh installation.Failed on pulling gravitational/rbacIf the node refuses to install and fails on pulling gravitational/rbac, create a new directory TMPDIR before installing and provide write access to user 1000.“Cannot continue” error during installThis bug is caused by a previous failure of a kernel module check or other preflight check and subsequent attempt to reinstall.Stop the install, make sure the preflight check failure is resolved, and restart the install again.Problems during post-install or post-upgrade stepsPost-install and post-upgrade steps run as Kubernetes jobs. When they finish running, the pods used to run them are not removed. These and other stopped pods can be found using:

kubectl get pods -A

The logs in each of these three pods will be helpful for diagnosing issues in the following steps:

Pod	Issues in this step
`ae-wagonwheel`	post-install UI
`install`	installation step
`postupdate`	post-update steps

Post-install configuration doesn’t completeAfter completing the post-install steps, clicking FINISH SETUP may not close the screen, and prevent you from continuing.WorkaroundYou can complete the process by running the following commands within gravity:To determine the site name:

SITE_NAME=$(gravity status --output=json | jq '.cluster.token.site_domain' -r)

To complete the post-install process:

gravity --insecure site complete

Re-starting the post-install configurationTo reinitialize the post-install configuration UI—to regenerate temporary (self-signed) SSL certificates or reconfigure the platform based on your domain name—you must re-create and re-expose the service on a new port.First, export the deployment’s resource manifest:

helm template --name anaconda-enterprise /var/lib/gravity/local/packages/unpacked/gravitational.io/AnacondaEnterprise/5.X.X/resources/Anaconda-Enterprise/ -x /var/lib/gravity/local/packages/unpacked/gravitational.io/AnacondaEnterprise/5.X.X/resources/Anaconda-Enterprise/templates/wagonwheel.yaml > wagon.yaml

Edit wagon.yaml, replacing:

image: ae-wagonwheel:5.X.X

with:

image: leader.telekube.local:5000/ae-wagonwheel:5.X.X

Recreate the ae-wagonwheel deployment using the updated YAML file:

kubectl create -f /var/lib/gravity/site/packages/unpacked/gravitational.io/AnacondaEnterprise/5.X.X/resources/wagon.yaml -n kube-system

Replace 5.X.X with your actual version number.

To ensure the deployment is running in the system namespace:

sudo gravity enter
kubectl get deploy -n kube-system

One of these should be ae-wagonwheel, the post-install configuration UI. To make this visible to the outside world, run:

kubectl expose deploy ae-wagonwheel --port=8000 --type=NodePort --name=post-install -n kube-system

This will run the UI on a new port, allocated by Kubernetes, under the name post-install.To find out which port it is listening under, run:

kubectl get svc -n kube-system | grep post-install

Then navigate to:

http://<your domain>:<this port>

to access the post-install UI.

Kernel parameters may be overwritten and cause networking errors

Removing collaborator from project with open session generates error

Workbench auth pod throws OutOfMemory Error

If you see an exception similar to the following, Workbench has exceeded the maximum heap size for the JVM:

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "default task-248"
2018-08-29 23:13:26.327 UTC ERROR    XNIO001007: A channel event listener threw an exception: java.lang.OutOfMemoryError: Java heap space (default I/O-36) [org.xnio.listener]
2018-08-29 23:12:32.823 UTC ERROR    UT005023: Exception handling request to /auth/realms/AnacondaPlatform/protocol/openid-connect/token: java.lang.OutOfMemoryError: Java heap space (default task-86) [io.undertow.request]
2018-08-29 23:13:01.353 UTC ERROR    XNIO001007: A channel event listener threw an exception: java.lang.OutOfMemoryError: Java heap space

WorkaroundIncrease the JVM max heap size by doing the following:

Open the anaconda-enterprise-ap-auth deployment spec by running the following command in a terminal:
```
kubectl edit deploy anaconda-enterprise-ap-auth
```

Increase the value for JAVA_OPTS (example below):

spec:
  containers:
  - args:
    - cp /standalone-config/standalone.xml /opt/jboss/keycloak/standalone/configuration/
      && /opt/jboss/keycloak/bin/standalone.sh -Dkeycloak.migration.action=import
      -Dkeycloak.migration.provider=singleFile -Dkeycloak.migration.file=/etc/secrets/keycloak/keycloak.json
      -Dkeycloak.migration.strategy=IGNORE_EXISTING -b 0.0.0.0
  command:
  - /bin/sh
  - -c
  env:
  - name: DB_URL
    value: anaconda-enterprise-postgres:5432
  - name: SERVICE_MIGRATE
    value: auth_quick_migrate
  - name: SERVICE_LAUNCH
    value: auth_quick_launch
  - name: JAVA_OPTS
    value: -Xms64m -Xmx2048m -XX:MetaspaceSize=96M -XX:MaxMetaspaceSize=256m

Affected versions5.2.1

Fetch changes behavior in Apache Zeppelin may not be obvious to new users

Apache Zeppelin can’t locate conflicted files or non-Zeppelin notebook files

Create and Installer buttons are not visible on Channels page

Updating a package from the Anaconda metapackage

File size limit when uploading files

IE 11 compatibility issue when using Bokeh in projects (including sample projects)

IE 11 compatibility issue when downloading custom Anaconda installers

Project names over 40 characters may prevent JupyterLab launch

Long-running jobs may falsely report failure

New Notebook not found on IE11

Disk pressure errors on AWS

If your Workbench instance is on Amazon Web Services (AWS), overloading the system with reads and writes to the directory /opt/anaconda can cause disk pressure errors, which may result in the following:

Slow project starts.
Project failures.
Slow deployment completions.
Deployment failures.

To verify whether disk pressure is the cause, check the logs:

List all nodes:
```
kubectl get node
```
Identify the node experiencing issues and view its log:
```
kubectl describe node <master-node-name>
```

If there is disk pressure, the log will display an error message similar to the following:

WorkaroundTo relieve disk pressure, you can add disks to the instance by adding another Elastic Block Store (EBS) volume. If the disk pressure is caused by a backup, move the backup files elsewhere (for example, to an NFS mount). See Backing up and restoring Workbench for more information.Steps to add disks:

Open the AWS console and add a new EBS volume provisioned to 3000 IOPS (for example, 500 GB).
Attach the volume to your Workbench master.
Format and mount the disk:
```
fdisk /dev/nvme1n1
mkfs /dev/nvme1n1p1
```

Shut down Kubernetes services:

systemctl stop gravity__gravitational.io__teleport__2.3.5.service
systemctl stop gravity__gravitational.io__planet-master__0.1.87-1714.service

Copy /opt/anaconda contents to the new disk:
```
rsync -vpoa /opt/anaconda/* /opt/aetmp
```

Update /etc/fstab and restart services:

systemctl start gravity__gravitational.io__teleport__2.3.5.service
systemctl start gravity__gravitational.io__planet-master__0.1.87-1714.service

Disk pressure error during backup

If a disk pressure error occurs while backing up your configuration, the amount of data being backed up has likely exceeded the available storage space. This triggers the Kubernetes eviction policy and causes the backup to fail.To check your eviction policy, run the following commands on the master node:

sudo gravity enter
systemctl status | grep "/usr/bin/kubelet"

WorkaroundRestart the backup process, specifying a location with sufficient space (for example, an NFS mount). See Backing up and restoring Workbench for more information.

General diagnostic and troubleshooting steps

Entering a Gravity Workbench environmentTo enter the Workbench environment and gain access to kubectl and other commands within Workbench, use the command:

sudo gravity enter

Moving files and dataOccasionally you may need to move files and data from the host machine to the Workbench environment. If so, there are two shared mounts to pass data back and forth between the two environments:

host: /opt/anaconda/ -> Workbench environment: /opt/anaconda/
host: /var/lib/gravity/planet/share -> Workbench environment: /ext/share

If data is written to either of the locations, that data will be available on both the host machine and within the Workbench environmentDebuggingAWS Traffic needs to handle the public IPs and ports. You should either use a canonical security group with the proper ports opened or manually add the specific ports listed in Network Requirements.Problems during air gap project migrationThe command anaconda-project lock over-specifies the channel list resulting in a conda bug where it adds defaults from the internet to the list of channels.Solution:Add to the .condarc: “default_channels”. This way, when conda adds “defaults” to the command it is adding the internal repo server and not the repo.continuum.io URLs.EXAMPLE:

default_channels:
- anaconda
channels:
- our-internal
- out-partners
- rdkit
- bioconda
- defaults
- r-channel
- conda-forge
channel_alias: https://:8086/conda
auto_update_conda: false
ssl_verify: /etc/ssl/certs/ca.2048.cer

LDAP error in ap-auth[LDAP: error code 12 - Unavailable Critical Extension]; remaining name 'dc=acme, dc=com'This error can be caused when pagination is turned on. Pagination is a server side extension and is not supported by some LDAP servers, notably the Sun Directory server.Session startup errorsIf you need to troubleshoot session startup, you can use a terminal to view the session startup logs. When session startup begins the output of the anaconda-project prepare command is written to /opt/continuum/preparing, and when the command completes the log is moved to /opt/continuum/prepare.log.

Data Science & AI Workbench (5.8.1)