Introducing Remote Content Caching with FSSpec

 

Fsspec is a library which acts as a common pythonic interface to many file system-like storage backends, such as remote (e.g., SSH, HDFS) and cloud (e.g., GCS, S3) services.

In this article, we will present its new ability to cache remote content, keeping a local copy for faster lookup after the initial read. Similar text first appeared in the fsspec documentation, but here we provide more details and use cases.

This work was inspired by the caching mechanism in Intake, which proved useful and popular, but was a) rather difficult to use for all but the simplest cases and b) only available within Intake catalogs, so not of use to any other people. Now we have made a similar concept available at a lower level.

Now, caching of whole or partial remote data is available to anyone who uses python files!

The Intake experience

Intake is all about describing data sources in catalogs, so that the right one can be found for a particular job, and the data can be loaded into python with the minimum of effort on the part of the user.

There are two principal reasons that you might want a local copy of remote data:

  • It is to be accessed multiple times and local storage is much faster than reading from remote, then copying on first access can prove a significant time saver.
  • Python is a very versatile language and hooks onto many external libraries that were originally written in C or something else. Even though fsspec provides access to remote data as if they were files, by implementing the python file-like interface, compiled code will usually require a real local file to work with.

In addition, there are many nice extra benefits that can be provided, and the Intake implementation, such as decompression on download, or grabbing from systems that are not usually thought of as file-systems, but are mechanisms for data distribution nonetheless (e.g., DAT project and git).

Unfortunately, trying to create a spec to describe all of this, and have Intake manage the cache through its internal config system proved difficult to implement. When we wanted more features, such as cache expiry and multiple storage locations, reworking the code turned out intractable.

fsspec

“Filesystem spec”, or fsspec for short, is a new project to unify access to data in various locations other than the local disc. It provides a simple, comprehensive and familiar API, which is uniform whether you are accessing Amazon’s S3 service, a HDFS cluster or a remote server over SSH.

Since other file system implementations, Dask and Intake have all come to depend on fsspec, it is the obvious place to implement file caching, to make this facility available to all users. Also, the code is structured in a way which makes writing a caching layer as a file system implementation of its own rather easy.

File caching

The simplest thing you may wish to do is copy a remote file locally on the first access, and thereafter refer to the local copy. The class implementing this is referred to as the “filecache” file system, implemented by the WholeFileCacheFileSystem class. You need to provide a the protocol and any options for the remote file system, and it will make calls on that remote system to list and download files, but use the local copy once downloaded.

As an example, in a previous article we showed how to incorporate remote storage with fsspec:

import fsspec
of = fsspec.open("s3://anaconda-public-datasets/iris/iris.csv", mode='rt', anon=True)
with of as f:
    print(f.readline())

produces the first line of data, the first specimen in the Iris dataset (“5.1,3.5,1.4,0.2,Iris-setosa”). This opens the remote file every time and downloads data. If we want to seamlessly provide local caching, we can do

import fsspec
of = fsspec.open("filecache://anaconda-public-datasets/iris/iris.csv", mode='rt', 
                 cache_storage='/tmp/cache1',
                 target_protocol='s3', target_options={'anon': True})
with of as f:
    print(f.readline())

This also produces the same output, but now we have a couple of files in the local directory, “cache” and “f89e764b2ba1a15b39e656eba3c67e583f8497bb68dfa760f07618deac3db7ff”. The second is the copy of the file from remote (with a hashed name), and the former is the metadata for all stored data. Now if you open the file again, the remote location is not polled, and happens much faster. Also, the output file f is in text mode, but, of course, the stored file is the original bytes from the remote source.

Partial file caching

It makes sense to copy whole files locally if you intend to read them entirely. However, in many cases you may not want to read the whole thing, because you lack the storage space or don’t wish to wait to download the whole thing.

For example, the file “s3://anaconda-public-datasets/gdelt/csv/20150906.export.csv” is somewhat bigger, 47MB (of course files can get much, much bigger than that!).

of = fsspec.open("blockcache://anaconda-public-datasets/gdelt/csv/20150906.export.csv", 
                 mode='rt', target_protocol='s3', cache_storage='/tmp/cache2',
                 target_options={'anon': True, "default_block_size": 2**20})
with of as f:
    print(f.read(1000))

again, running this the second time returns much faster than the first, because the data exists locally. Interestingly, though, listing the storage directory shows the following

total 2064
-rw-r--r--  1 mdurant   47319281 11 Oct 14:28 6edb94fb86a5c48a6d3993efba3b8fa1ff62af1b920f621ac39ffdff8a15c7e4
-rw-r--r--  1 mdurant        267 11 Oct 14:28 cache

On my system and the given disc, sparse files are possible, so doing du on the directory (this command is available on linux and osx) shows that the apparently 47MB file only takes up 1MB of space, the “block_size” that was chosen in the code. More blocks would be filled in as we read through the file, but in the case that this was the only part of the file that we would be interested in, this is exactly the behaviour that we would like.

This second form of caching does come with a couple caveats: you will only get the “sparse” behaviour if your OS and disc file system support it; the output is a python file object, so it only works with code that accepts this (i.e., python stuff, not C); and it only works where the fsspec implementation provides file instances that are also based on fsspec (true for local files, s3, gcs, ftp, hdfs).

Note that I have specified the cache locations of the two examples not to overlap, because filecache will error if it encounters the partial files created by blockcache.

More fun

File system chaining

By using the fsspec classes directly, you can chain together some pretty complex behaviours. For example, consider:

fs = fsspec.filesystem('http')
f = fs.open('http://www.bkk.hu/gtfs/budapest_gtfs.zip')
fs2 = fsspec.filesystem('zip', fo=f)
fs3 = fsspec.filesystem('filecache', target_protocol=fs2, cache_storage='/tmp/cache3')
f2 = fs3.open('stops.txt', 'rt')
df = pandas.read_csv(f2)
df.head()

This opened a compressed archive on a remote server, and cached only one contained file (“stops.txt”) locally, and passed this to Pandas.

Cache expiry and checking

Multiple instances of caching file system can in theory access the same local storage and be aware of all of the files there – the cache metadata is reloaded automatically on a set cadence, 10s by default.

Conversely, the cache can use multiple storage areas to check for data. It will parse these in the order given, and if a file is not found, then it will get from remote as usual. This would allow for a “level 2” cache on a shared network drive for data that is accessed frequently by people on the network. Only the last cache location in the list given would get written to in this scenario.

Both filecache and blockcache allow for expiry of cached files – when the on-disc version is older than a certain number of seconds, it will be updated from the remote source. In the case of sparse files, this is since the creation of the local file (i.e., the age of the oldest block).

Finally, where the backend supports it (not HTTP, but basically everything else), you can ask the cache system to read the checksum or other unique identifier of the remote file on each access, so you can always be aware of whether it has changed, and so keep the cache up to date. Naturally, this still takes a little time, but generally much less time than downloading the whole thing from remote on every read.

Summary

Caching at the file system level is available to all, for wherever you get your data, via fsspec. This approach has turned out so simple to both implement and use, that it will soon become the recommended approach within Intake also.

 


You May Also Like

News
AnacondaCON 2018 Recap: An Exploration of Modern Data Science
Last year’s inaugural AnacondaCON was a major milestone for our company. Our goal was to create a conference that highlights all the different ways people are using data sci...
Read More
For Practitioners
Announcing Anaconda Distribution 2019.10
We are pleased to announce the release of Anaconda Distribution 2019.10! As there were some significant changes in the previous Anaconda Distribution 2019.07 installers, this ...
Read More
Enterprise Data Science
Anaconda Enterprise 5 Introduces Secure Collaboration to Amplify the Impact of Enterprise Data Scientists
Anaconda, the Python data science leader, today introduced Anaconda Enterprise 5 software to help organizations respond to customers and stakeholders faster, deliver strategic...
Read More