How to download multiple files from a website

This question has been asked many times in various ways on this forum, but I've never found a simple answer to this very simple question:
It seems like there should be a two-line solution along the lines of :
url_list = get_urls('https://www.ngdc.noaa.gov/thredds/catalog/global/ETOPO2022/15s/15s_surface_elev_netcdf/catalog.html','extension','.nc');
websave(url_list)
if get_urls were a function and websave were as easy to use as entering a list of file urls to download and having it save them in the current directory.

3 commentaires

This method works but it seems extremely slow. (Probably due to the large file sizes and my poor internet connection atm)
webpageurl = 'https://www.ngdc.noaa.gov/thredds/catalog/global/ETOPO2022/15s/15s_surface_elev_netcdf/catalog.html';
%Read the webpage
str = webread(webpageurl);
%Get the hyperlinks from the webpage data
hl = regexp(str,'<a.*?/a>','match')'
hl = 294×1 cell array
{'<a class="static" href="https://www.ngdc.noaa.gov/thredds/catalog/catalog.html">NCEI THREDDS Data Server</a>'} {'<code>ETOPO_2022_v1_15s_N00E000_surface.nc</code>' } {'<code>ETOPO_2022_v1_15s_N00E015_surface.nc</code>' } {'<code>ETOPO_2022_v1_15s_N00E030_surface.nc</code>' } {'<code>ETOPO_2022_v1_15s_N00E045_surface.nc</code>' } {'<code>ETOPO_2022_v1_15s_N00E060_surface.nc</code>' } {'<code>ETOPO_2022_v1_15s_N00E075_surface.nc</code>' } {'<code>ETOPO_2022_v1_15s_N00E090_surface.nc</code>' } {'<code>ETOPO_2022_v1_15s_N00E105_surface.nc</code>' } {'<code>ETOPO_2022_v1_15s_N00E120_surface.nc</code>' } {'<code>ETOPO_2022_v1_15s_N00E135_surface.nc</code>' } {'<code>ETOPO_2022_v1_15s_N00E150_surface.nc</code>' } {'<code>ETOPO_2022_v1_15s_N00E165_surface.nc</code>' } {'<code>ETOPO_2022_v1_15s_N00W015_surface.nc</code>' } {'<code>ETOPO_2022_v1_15s_N00W030_surface.nc</code>' } {'<code>ETOPO_2022_v1_15s_N00W045_surface.nc</code>' } {'<code>ETOPO_2022_v1_15s_N00W060_surface.nc</code>' } {'<code>ETOPO_2022_v1_15s_N00W075_surface.nc</code>' } {'<code>ETOPO_2022_v1_15s_N00W090_surface.nc</code>' } {'<code>ETOPO_2022_v1_15s_N00W105_surface.nc</code>' } {'<code>ETOPO_2022_v1_15s_N00W120_surface.nc</code>' } {'<code>ETOPO_2022_v1_15s_N00W135_surface.nc</code>' } {'<code>ETOPO_2022_v1_15s_N00W150_surface.nc</code>' } {'<code>ETOPO_2022_v1_15s_N00W165_surface.nc</code>' } {'<code>ETOPO_2022_v1_15s_N00W180_surface.nc</code>' } {'<code>ETOPO_2022_v1_15s_N15E000_surface.nc</code>' } {'<code>ETOPO_2022_v1_15s_N15E015_surface.nc</code>' } {'<code>ETOPO_2022_v1_15s_N15E030_surface.nc</code>' } {'<code>ETOPO_2022_v1_15s_N15E045_surface.nc</code>' } {'<code>ETOPO_2022_v1_15s_N15E060_surface.nc</code>' }
The link that the hyperlinks on the given webpage lead to are not the same as the ones from which to download the data.
%Suffix for download url
fileurl = 'https://www.ngdc.noaa.gov/thredds/fileServer/global';
Please note that I have taken the link corresponding to the HTTP Server download option.
%Ignoring the header hyperlink
for k=2:5
%Some manipulation
z = extractBetween(hl{k}, 'Scan', '"');
%Combine the hyperlink with the url and try to use webread()
new = strcat(fileurl, z{:});
yo = websave(sprintf('File%d.nc', k-1), new);
end
ls
File1.nc File2.nc File3.nc File4.nc
ncdisp('File3.nc')
Source: /users/mss.system.b2I8c7/File3.nc Format: netcdf4_classic Global Attributes: GDAL_AREA_OR_POINT = 'Area' node_offset = 1 GDAL_TIFFTAG_COPYRIGHT = 'DOC/NOAA/NESDIS/NCEI > National Centers for Environmental Information, NESDIS, NOAA, U.S. Department of Commerce' GDAL_TIFFTAG_DATETIME = 20220929130858 GDAL_TIFFTAG_IMAGEDESCRIPTION = 'Topography-Bathymetry; EGM2008 height' Conventions = 'CF-1.5' GDAL = 'GDAL 3.3.2, released 2021/09/01' NCO = 'netCDF Operators version 4.9.1 (Homepage = http://nco.sf.net, Code = http://github.com/nco/nco)' Dimensions: lon = 3600 lat = 3600 Variables: crs Size: 1x1 Dimensions: Datatype: char Attributes: grid_mapping_name = 'latitude_longitude' long_name = 'CRS definition' longitude_of_prime_meridian = 0 semi_major_axis = 6378137 inverse_flattening = 298.2572 spatial_ref = 'GEOGCS["WGS 84",DATUM["WGS_1984",SPHEROID["WGS 84",6378137,298.257223563,AUTHORITY["EPSG","7030"]],AUTHORITY["EPSG","6326"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AXIS["Latitude",NORTH],AXIS["Longitude",EAST],AUTHORITY["EPSG","4326"]]' GeoTransform = '30 0.004166666666666667 0 0 0 -0.004166666666666667 ' lat Size: 3600x1 Dimensions: lat Datatype: double Attributes: standard_name = 'latitude' long_name = 'latitude' units = 'degrees_north' lon Size: 3600x1 Dimensions: lon Datatype: double Attributes: standard_name = 'longitude' long_name = 'longitude' units = 'degrees_east' z Size: 3600x3600 Dimensions: lon,lat Datatype: single Attributes: long_name = 'z' _FillValue = -99999 grid_mapping = 'crs' units = 'meters' positive = 'up' standard_name = 'height' vert_crs_name = 'EGM2008' vert_crs_epsg = 'EPSG:3855'
Wow, thank you @Dyuman Joshi!
You are welcome!

Connectez-vous pour commenter.

 Réponse acceptée

Voss
Voss le 21 Nov 2023
url = 'https://www.ngdc.noaa.gov/thredds/catalog/global/ETOPO2022/15s/15s_surface_elev_netcdf/catalog.html';
% webread() the main page and parse out the links to .nc files:
data = webread(url);
C = regexp(data,'<a href=".*?(\?[^"]*.nc)">','tokens');
temp_urls = strcat(url,vertcat(C{:}));
% webread() each linked url:
data = cell(size(temp_urls));
for ii = 1:numel(temp_urls)
data{ii} = webread(temp_urls{ii});
end
% get the download link in each of those pages:
C = regexp(data,'<a href="([^"]*)">\s*<b>HTTPServer','tokens','once');
% append them to the (sub-)domain of the main URL to get the actual URLs
% for downloading the .nc files:
idx = find(url == '/',3);
nc_urls = strcat(url(1:idx(end)-1),vertcat(C{:}));
% construct file names to save to locally:
[~,filenames,ext] = fileparts(nc_urls);
filenames = strcat(filenames,ext);
% download all the files:
for ii = 1:numel(nc_urls)
websave(filenames{ii},nc_urls{ii});
end

3 commentaires

Awesome, thank you! Your solution works, and I want to make sure I understand it--What exactly is the first loop doing? I'm having trouble understanding why we need to call webread and regexp twice. Isn't all the information in temp_urls after the first call to webread?
Voss
Voss le 21 Nov 2023
You're welcome!
Each link on the main page goes to a distinct intermediate page which contains the link to download the actual .nc file.
The first webread/regexp gets the set of urls to those intermediate pages. Then webread each of those intermediate pages in a loop, and regexp all the contents to get the download urls (which is the url immediately preceding 'HTTPServer' on each intermediate page - there are several other urls on those pages, and that was the only way I could think of to be sure to get the right one).
Ooh, okay, that makes a lot of sense. Thanks @Voss!

Connectez-vous pour commenter.

Plus de réponses (0)

Produits

Version

R2023b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by