By Michael Sarahan
Happy Halloween, readers. At Anaconda, we’re not too scared about things that go bump in the night. We’ve examined the data and concluded that it’s just the cleaning staff upstairs. We are, however, kept awake by the ever-present concern of the security and experience of our users! We’d like to take this opportunity to discuss some of the scary stuff out there, and what we’re doing to mitigate the risks and prevent problems.
Package building and serving process
Have you ever read one of those horrible stories around Halloween time where someone is doing nasty things to candy and hurting kids? Well, we think of the potential for maliciously modified packages like that nasty, corrupted candy. Just as the candy was produced “good” and turned “bad” down the line, we worry about people taking good packages and using them to infect users’ computers. In order to prevent this happening to packages that you get from Anaconda’s “defaults” channel, we take many steps to ensure that no one gets between us and your packages.
- We build on private servers in our data center
- Our build servers are on a dedicated network that only our build team has access to
- Our build team is required to use multifactor authentication to access the build network
- We publish sha256 values for all packages at the time we upload them. These are contained in each channel’s index file, repodata.json. For example, here’s one of the defaults channels: https://repo.anaconda.com/pkgs/main/linux-64/repodata.json
- Conda verifies these sha256 values when installing packages, so the package must match the published index information
- All of our packages are served via https, so you can verify our certificate and be confident that there’s no man in the middle
Compiled code security
Around this time in 2017, we released version 5 of the Anaconda Distribution. That release contained major upgrades to our compiler infrastructure, which allowed us to take advantage of several new security improvements provided by compilers. We’ve detailed these in a prior blog post, but suffice to say here that where older packages presented a juicy target for buffer overflow attacks, our new packages are much harder targets. Over the past several months, we have been working with the conda-forge community to extend these security benefits to their packages.
In our efforts to improve some parts of our projects, gremlins often sneak in where we’re not looking to mess up others. Recently, speed of the conda package installation process has suffered. We’re grateful for your patience in tolerating these sloth gremlins. We’re also happy to report that we’re working hard to improve the situation. For one, we’re now using Air Speed Velocity to benchmark conda continuously. You can check out our results and our benchmark code.
ASV has already been a tremendous tool. With it, we’ve discovered that we took a large speed hit around conda version 4.3.25. That version corresponds with the addition of timestamp data to the solver. We added that for the sake of conda-build 3, with its new hashes. At the time, any modification of a conda recipe at all resulted in a different hash. The timestamp optimization was meant as a tiebreaker so that conda would choose your latest package build, if everything else was equal. Sounds innocent, right? Well, when no packages had timestamps, this caused no penalty. We’ve been increasing the number of packages with timestamps, though. On the “main” channel, which is part of our defaults channel, all the packages have these timestamps. Due to a particular way that the solver operates, the timestamps are considered earlier in the solution process than they should be, leading conda to get very confused, take a long time, and sometimes come up with bizarre solutions. With our latest conda 4.6.0 beta, we’ve disabled this timestamp optimization by default for conda, but kept it for conda-build. This change has returned conda solve speed for the anaconda metapackage to the level that it had been prior to conda 4.3.25.
But wait, that still can be a lot slower than pip, right? Yep. Pip does not use a solver (yet?) Instead, pip considers only the constraints provided by the package that it is working on installing at a given point in time—not the constraints from other packages being installed, and not the constraints from packages already in your environment. Here’s where gremlins creep in that break things. These gremlins are the gremlins of software version incompatibility. The versions required by one package may not be compatible with the versions required by another package. Conda’s solver is its greatest strength for avoiding these. As long as the package authors have captured the software compatibility bounds correctly in their conda recipe, conda will not break your environment by installing software that only works with part of your environment. Doing that right is worth some extra time, in our opinion. However, we are actively working to reduce the time necessary for the solver, both by improving the solver itself and by rethinking the input to the solver to reduce the size of the problem that it has to solve (sharding repodata into smaller chunks). The wait time for conda’s solver is something that we’ll seek to improve over time, but we appreciate your patience and feedback in the meantime!
The success of the community in creating a vibrant package ecosystem for conda has been both something we’re very happy to see, and also vexed by. We’re vexed because there are a lot of ways that it can go wrong, and lead to bad experiences for people who use conda. Binary incompatibility arises in a surprisingly wide variety of ways. We’ll describe this problem in detail in an upcoming blog post, but right now, it is advisable to limit the number of channels that you use in a given environment, and try to keep to a single channel for a given environment as much as possible.
Conflicts with system state
Unfortunately, Anaconda doesn’t install perfectly on the first try for everyone. We’re always working to improve this, but there are a few problems that we struggle with. These are almost always related to the state of the user’s system at install time. Anaconda is designed to be self-contained and non-disruptive to your other software. Unfortunately, it’s not so simple in reality. Existing Python installations, installations of Python modules in global locations, or libraries that have the same names as Anaconda libraries can prevent Anaconda from working properly. Here are a few places to check for this, if you’re having trouble.
- All platforms:
- PYTHONHOME and PYTHONPATH environment variables are sometimes used to help Python be able to find local code on your system. When libraries are exposed to Python this way, they override any libraries provided by Anaconda, and can cause issues. We recommend never setting PYTHONPATH or PYTHONHOME, and to instead follow instructions for creating a simple python package at https://python-packaging-tutorial.readthedocs.io/en/latest/
- Configuration left over from old installations, even if they’re not around anymore. When you change settings for software, it often stores a file on your hard disk so that your settings are saved. These settings can be things like which GUI library to use. When that setting conflicts with how Anaconda is configured by default, there can be problems. We have a simple tool to help clean up existing configuration, called anaconda-clean. If you use this tool, we highly recommend using it with the –backup flag, to ensure that you don’t accidentally remove anything important.
- Python uses a registry key to store a global location from which Python modules are loaded. If you have a previous Python installation, and you’ve installed software into this global location, this can cause Anaconda to fail. To fix this, you can try to backup and remove the files in the global location, or perhaps remove the registry key. The registry key location depends on the kind of installation – either user-local or system-wide. For user-local, the registry location is HKCU\Software\Python\PythonCore, while for system-wide, it is HKLM\Software\Python\PythonCore. An example issue caused by this behavior is at https://github.com/ContinuumIO/anaconda-issues/issues/10236
- Anaconda depends on the PATH environment variable to tell software where to find its necessary libraries. The Windows library search path is a bit more complicated than just PATH, though. We’ve seen a lot of issues when third party vendors place libraries that have the same name as Anaconda-provided libraries into locations that preclude the PATH environment variable doing what we hope it will. The Windows search path rules are posted at https://msdn.microsoft.com/en-us/library/7d83bc18.aspx?f=255&MSPPError=-2147217396. An example issue caused by this behavior is at https://github.com/ContinuumIO/anaconda-issues/issues/8561
- Linux and macOS:
- The global locations that we observe problems with on Linux and macOS differ from Windows, in that it is not a registry key, but rather a standard folder location that can cause problems. The .local folder is a place where pip installs software when it doesn’t have write access to the site-packages folder. Unlike the site-packages folder, the .local folder is not associated with a specific python installation, and so it can cause problems where something that you installed for a different python installation is incompatible with the software that Anaconda provides. The simplest fix is to backup and remove any files that pip has installed into the .local folder. An example of confusion around this issue is at https://github.com/sagemathinc/cocalc/issues/2403. It is a delicate balance between maintaining expected standard behavior for Python, while also insulating Anaconda users from potential conflicts on their system.
Don’t Be Scared…
We’re working on improving our software to prevent the conflicts currently haunting you, but we hope these Halloween tips & treats will help you recover from any problems in the meantime!