Working with Data in the Cloud
Data is everywhere, and data is bigger than it used to be. It is no longer practical to download all you need to do your job to your local machine and run your analysis there - download times are too long, and the data won’t fit in memory.
These days, data is stored with one of the big cloud vendors or within an institutional data lake. To analyze it, you not only need ways to interact with various remote file storage systems, you also need storage formats that let you access only the data you need, not the whole dataset (bye bye json and excel). This presents an opportunity to process data faster, in parallel, and to distribute and catalog datasets for sharing among teams, rather than copying code and data.
Watch this on-demand webinar for a discussion about working with data in the cloud, including how to use the open-source tools:
- fsspec - for accessing remote filesystems
- Parquet and Zarr - two cloud-ready data formats
- Intake - a data distribution, cataloging and loading library
Meet the Speaker:
A former astrophysicist, Martin has worked in multiple academic positions, including medical imaging research. After this, Martin became a data scientist, and has since been working for Anaconda for five years. In open source, he is a member of the Dask, Intake, Streamz and Zarr maintenance teams, with specialisms in data access, remote filesystems, and data formats.