Intake: Parsing Data from Filenames and Paths

 

Motivation

Do you have data in collections of files, where information is encoded both in the contents and the file/directory names? Perhaps something like 

'{year}/{month}/{day}/{site}/measurement.csv'

? This is a very common problem for which people build custom code all the time. Intake provides a systematic way to declare that information in a concise spec.

What is Intake?

Intake is a lightweight set of tools for loading and sharing data. You might have seen earlier blog posts introducing Intakeand describing caching. Intake separates the concepts of the data engineer – the person curating, managing, and disseminating data – from the data user – the person analyzing and visualizing the data. The data engineer sets up catalog files describing data sources and the data user loads data without needing to know how it is stored. Intake makes creating functionality easy. Here we show new functionality for dealing with structured file-names.

How to use it – data user

Intake abstracts away messy data storage practices so data users don’t need to know about it. You get a full dataset from all the files, together with the information from the filenames, in just two lines of code:

cat = intake.load_catalog('catalog.yml')
data = cat.data_source().read()

As an example we’ll use a catalog with real satellite imagery data from the landsat project to calculate Normalized Difference Vegetation Index (NDVI):

Click the button below to launch an interactive session of the example or download the notebook here.

Binder

How to use it – data engineer

When loading multiple files, the locations of the files can often be provided as a list or as a glob (a path containing 

"*"

wildcards). For instance, let’s suppose that we have a number of CSV files containing precipitation forecasts for a number of models run under different emissions scenarios. Each forecast is stored in one file with the model and emissions scenario encoded in the filename. The following glob pattern would match all of the files:

urlpath: 'data/SRLCC_*_Precip_*.csv'

In order to capture the data encoded into the names of the files, we can replace the

"*"

wildcards with field-names, as follows, making what we’ll refer to as a path pattern:

urlpath: 'data/SRLCC_{emissions}_Precip_{model}.csv'

When the data source is opened, the values for each declared field are populated from the path or filename and returned on the data.

When passing an explicit list of paths, the argument path_as_pattern can be used to pass the pattern we want applied to the filenames:

urlpath:
  - 'data/SRLCC_a1b_Precip_ECHAM5-MPI.csv'
  - 'data/SRLCC_b1_Precip_PCM-NCAR.csv'
path_as_pattern: 'SRLCC_{emissions}_Precip_{model}.csv'

In this case the pattern is used to populate new columns (

emissions

and

model

) on the data with the values for each set of data being populated from the paths. Note that the pattern can just be a piece of the path as long as it is unambiguous where the piece starts and stops. For instance,

'{emissions}_Precip_{model}'

would not yield the intended outcome since emissions would match everything before

'_Precip_'

(

'data/SRLCC_b1'

) and model would match everything after (

'PCM-NCAR.csv'

).

Click the button below to launch an interactive example of setting up a catalog with path_as_pattern or download the notebook here.

Binder

How it works

The formatting of the pattern is python format string syntax , but inverse: the set of arguments required such that

pattern.format(**arguments) == path

.

The logic is implemented in the function 

intake.source.utils.reverse_format

, which we can use to demonstrate how this works. We’ll call this function directly to demonstrate that any (reasonable) format string syntax is supported and the parsed values will match the implied type of the format string:

>>> reverse_format('data_{year}_{month}_{day}.csv', 'data_2014_01_03.csv')
{'year': '2014', 'month': '01', 'day': '03'}
>>> reverse_format('data_{year:d}_{month:d}_{day:d}.csv', 'data_2014_01_03.csv')
{'year': 2014, 'month': 1, 'day': 3}
>>> reverse_format('data_{date:%Y_%m_%d}.csv', 'data_2016_10_01.csv')
{'date': datetime.datetime(2016, 10, 1, 0, 0)}
>>> reverse_format('{state:2}{zip:5}', 'PA19104')
{'state': 'PA', 'zip': '19104'}

What’s next

  • So far only the 
    CSV

     plugin and 

    intake-xarray

     plugin with 

    rasterio

     driver support this behavior. More plugins can be made to respect 

    path_as_pattern

     notation, using the helper classes provided in Intake. The specific implementation may depend on the specifics of the third-party library.

  • This feature seems most helpful in the context of multiple file loading, however, the functionality may also be useful for parsing other similarly structured text in general.
  • Add 
    regex

     support for pattern as a path.

  • Swap out 
    reverse_format

     for an external library such as 

    parse

    .


You May Also Like

Data Science Blog
Using Pip in a Conda Environment
Unfortunately, issues can arise when conda and pip are used together to create an environment, especially when the tools are used back-to-back multiple times, establishing a s...
Read More
Company Blog
Anaconda and Kx Systems Partner to Deliver kdb+ Database System and Related Machine Learning Libraries
Anaconda, Inc., the most popular Python data science platform provider with 2.5 million downloads per month, is pleased to announce an exciting new partnership with Kx Systems...
Read More
Company Blog
Reinforcing Open Data Science Foundations with conda 4.3 Release
At Continuum Analytics, we talk a lot about Open Data Science—this new world order of analytics that is rapidly accelerating the pace of innovation around Big Data and Data ...
Read More