Accessing remote files with earthaccess¶
When we search for data using earthaccess we get back a list of results from NASA's Common Metadata Repository or CMR for short. These results contain all the information
we need to access the files represented by the metadata. earthaccess offers 2 access methods that operate with these results, the first method is the well known, download()
where we copy the results from their location to our local disk, if we are running the code in AWS say on a Jupyterhub the files will be copied to the local VM disk.
The other method is open(), earthaccess uses fsspec to open remote files as if they were local. open has advantages and some disadvantages that we must know before using it.
The main advantage for open() is that we don't have to download the file, we can stream it into memory however depending on how we do it we may run into network performance issues. Again, if we run the code next to the data this would be fast, if we do it locally in our laptopts it will be slow.
import earthaccess
auth = earthaccess.login()
results = earthaccess.search_data(
short_name="ATL06",
cloud_hosted=False,
temporal=("2019-01", "2019-02"),
polygon=[(-100, 40), (-110, 40), (-105, 38), (-100, 40)],
)
results[0]
/home/docs/checkouts/readthedocs.org/user_builds/earthaccess/checkouts/1249/earthaccess/results.py:349: FutureWarning: As of version 1.0, `DataGranule.size` will be accessed as an attribute; e.g. use `DataCollection.size` **not** `DataCollection.size()`
self["size"] = self.size()
/home/docs/checkouts/readthedocs.org/user_builds/earthaccess/checkouts/1249/earthaccess/results.py:376: FutureWarning: As of version 1.0, `DataGranule.size` will be accessed as an attribute; e.g. use `DataCollection.size` **not** `DataCollection.size()`
Size(MB): {self.size()}
/home/docs/checkouts/readthedocs.org/user_builds/earthaccess/checkouts/1249/earthaccess/formatters.py:40: FutureWarning: As of version 1.0, `DataGranule.size` will be accessed as an attribute; e.g. use `DataCollection.size` **not** `DataCollection.size()`
granule_size = round(granule.size(), 2)
nsidc_url = "https://n5eil01u.ecs.nsidc.org/DP7/ATLAS/ATL06.005/2019.02.21/ATL06_20190221121851_08410203_005_01.h5"
lpcloud_url = "https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/EMITL2ARFL.001/EMIT_L2A_RFL_001_20220903T163129_2224611_012/EMIT_L2A_RFL_001_20220903T163129_2224611_012.nc"
session = earthaccess.get_requests_https_session()
headers = {"Range": "bytes=0-100"}
r = session.get(lpcloud_url, headers=headers)
r
<Response [206]>
fs = earthaccess.get_fsspec_https_session()
with fs.open(lpcloud_url) as f:
data = f.read(100)
data
--------------------------------------------------------------------------- ClientResponseError Traceback (most recent call last) Cell In[6], line 2 1 with fs.open(lpcloud_url) as f: ----> 2 data = f.read(100) 3 data File ~/checkouts/readthedocs.org/user_builds/earthaccess/envs/1249/lib/python3.12/site-packages/fsspec/implementations/http.py:637, in HTTPFile.read(self, length) 635 else: 636 length = min(self.size - self.loc, length) --> 637 return super().read(length) File ~/checkouts/readthedocs.org/user_builds/earthaccess/envs/1249/lib/python3.12/site-packages/fsspec/spec.py:2111, in AbstractBufferedFile.read(self, length) 2108 if length == 0: 2109 # don't even bother calling fetch 2110 return b"" -> 2111 out = self.cache._fetch(self.loc, self.loc + length) 2113 logger.debug( 2114 "%s read: %i - %i %s", 2115 self, (...) 2118 self.cache._log_stats(), 2119 ) 2120 self.loc += len(out) File ~/checkouts/readthedocs.org/user_builds/earthaccess/envs/1249/lib/python3.12/site-packages/fsspec/caching.py:534, in BytesCache._fetch(self, start, end) 532 self.total_requested_bytes += bend - start 533 self.miss_count += 1 --> 534 self.cache = self.fetcher(start, bend) 535 self.start = start 536 else: File ~/checkouts/readthedocs.org/user_builds/earthaccess/envs/1249/lib/python3.12/site-packages/fsspec/asyn.py:118, in sync_wrapper.<locals>.wrapper(*args, **kwargs) 115 @functools.wraps(func) 116 def wrapper(*args, **kwargs): 117 self = obj or args[0] --> 118 return sync(self.loop, func, *args, **kwargs) File ~/checkouts/readthedocs.org/user_builds/earthaccess/envs/1249/lib/python3.12/site-packages/fsspec/asyn.py:103, in sync(loop, func, timeout, *args, **kwargs) 101 raise FSTimeoutError from return_result 102 elif isinstance(return_result, BaseException): --> 103 raise return_result 104 else: 105 return return_result File ~/checkouts/readthedocs.org/user_builds/earthaccess/envs/1249/lib/python3.12/site-packages/fsspec/asyn.py:56, in _runner(event, coro, result, timeout) 54 coro = asyncio.wait_for(coro, timeout=timeout) 55 try: ---> 56 result[0] = await coro 57 except Exception as ex: 58 result[0] = ex File ~/checkouts/readthedocs.org/user_builds/earthaccess/envs/1249/lib/python3.12/site-packages/fsspec/implementations/http.py:692, in HTTPFile.async_fetch_range(self, start, end) 689 if r.status == 416: 690 # range request outside file 691 return b"" --> 692 r.raise_for_status() 694 # If the server has handled the range request, it should reply 695 # with status 206 (partial content). But we'll guess that a suitable 696 # Content-Range header or a Content-Length no more than the 697 # requested range also mean we have got the desired range. 698 response_is_range = ( 699 r.status == 206 700 or self._parse_content_range(r.headers)[0] == start 701 or int(r.headers.get("Content-Length", end + 1)) <= end - start 702 ) File ~/checkouts/readthedocs.org/user_builds/earthaccess/envs/1249/lib/python3.12/site-packages/aiohttp/client_reqrep.py:629, in ClientResponse.raise_for_status(self) 626 if not self._in_context: 627 self.release() --> 629 raise ClientResponseError( 630 self.request_info, 631 self.history, 632 status=self.status, 633 message=self.reason, 634 headers=self.headers, 635 ) ClientResponseError: 502, message='Bad Gateway', url='https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/EMITL2ARFL.001/EMIT_L2A_RFL_001_20220903T163129_2224611_012/EMIT_L2A_RFL_001_20220903T163129_2224611_012.nc'
%%time
import xarray as xr
files = earthaccess.open(results[0:2])
ds = xr.open_dataset(files[0], group="/gt1r/land_ice_segments")
ds
/home/docs/checkouts/readthedocs.org/user_builds/earthaccess/checkouts/1249/earthaccess/store.py:528: FutureWarning: As of version 1.0, `DataGranule.size` will be accessed as an attribute; e.g. use `DataCollection.size` **not** `DataCollection.size()` total_size = round(sum([granule.size() for granule in granules]) / 1024, 2)
CPU times: user 1.34 s, sys: 168 ms, total: 1.51 s Wall time: 9.08 s
<xarray.Dataset> Size: 7MB
Dimensions: (delta_time: 153460)
Coordinates:
* delta_time (delta_time) datetime64[ns] 1MB 2019-01-03T06:49:0...
latitude (delta_time) float64 1MB ...
longitude (delta_time) float64 1MB ...
Data variables:
atl06_quality_summary (delta_time) int8 153kB ...
fpb_warning_flag (delta_time) int8 153kB ...
h_li (delta_time) float32 614kB ...
h_li_sigma (delta_time) float32 614kB ...
segment_id (delta_time) float64 1MB ...
sigma_geo_h (delta_time) float32 614kB ...
Attributes:
data_rate: Data within this group are sparse. Data values are provide...
description: The land_ice_height group contains the primary set of deriv...
