Big Data
Studio can load fairly large datasets directly into the browser, but there are limits.
When a user suddenly encounters a dataset that exceeds the limit, it becomes necessary to choose an alternative approach.
There is no single solution for handling big geospatial data that solves all use cases, there is a range of standard techniques that covers most cases:
Technique | Comments |
---|---|
Load full dataset into browser | Dataset size is limited, see below for details. |
Tiled data | Load data converted into tiles, such as Hex Tiles, Vector Tiles or Raster Tiles. |
Database queries | Store large datasets in databases, and then load partial data via queries. |
Preprocess data | Dataset size is reduced during query, filtering out rows, removing columns etc, often done in Python notebooks. |
Data Size Limits
Tables with about a million points, or less than 250MB typically load and render smoothly. As a rough guidance for the size of a datasets that can be loaded directly into the browser, the exact limit will depend on:
- the capabilities of the user's computer. For instance, a mobile device will be able to load significantly less data than a laptop.
- the structure of the data: e.g. a GeoJSON file can have just 1000 rows, but each row can be a polygon with 1000 points and this can be equivalent to a million point file.
- how many big datasets are already in the map.
Note: over time, we expect this limit to increase significantly, both as a result of computers becoming more powerful and as we continue to optimize the internal processing of data in the Studio platform. However, there will always be a limit to the amount of data that can be loaded into the browser, so the alternative techniques below will still apply.
Data Preprocessing
In many cases, datasets are just a little bit too big for Studio (say perhaps around a gigabyte in size). In such cases, some quick pre-processing of the data can often reduce the size of a dataset to a point where Studio can load it directly. Sometimes it can be as simple as removing a few unused columns from the dataset to reduce its size before loading it. Sometimes other standard techniques like filtering or grouping operations can help get the data into a more efficient form.
If you are a data scientists who is prepared to write some code, Studio offers deep integration with the most common environments for data preparation, in particular Python Notebooks with full support for data processing libraries such as pandas and geopandas.
Database Queries
One approach to working with large tabular data sets is to upload them into databases. And your database of choice is supported by an Data Connector, partial data can then be loaded into Studio by performing a query.
Tip: The SQL limit
clause is your friend. It provides a simple but effective way of capping the amount of data returned from a query, ensuring that the result can be processed by Studio.
Tiled Data Formats
A standard approach to dealing with large geospatial datasets is to convert them into a tiled representation. The tiled representation is a large tree of small files, where a small subset of tiles can be loaded to cover the user's current viewpoint with the appropriate level of detail. There are a growing number of tile formats available and the choice can be influenced by the structure of data.
Hex Tiles
Hex Tiles is Studio's solution to the problem of working with big geospatial data. They are a great choice when working analytically with massive datasets in the gigabyte or even terabyte range.
The Studio platform comes with integrated tools for converting your datasets to Hex Tiles.
Vector Tiles
The Studio platform has support for generating and visualizing vector tile datasets.
When working data that is in the form of standard geospatial "features" (polygons, lines and points), such as very large GeoJSON files, converting the data to vector tiles is often a good choice.
Vector Tiles are recommended when the goal is to create fast-loading maps that are visually very similar to the tabular source dataset.
Raster Tiles
Raster tiles typically contain imagery but can also contain arbitrary analytical data.
Studio has full support for visualizing raster tiles and cloud-optimized GeoTIFFs,
and the petabyte-sized Sentinel and Landsat archives are available in the Studio Data Catalog.
Updated about 1 year ago