Main Content

parquetinfo

Get information about Parquet file

Description

ParquetInfo objects contain information about a Parquet file, such as: file size, variable names and types, encoding, and compression schemes. To get information about a Parquet file, create the ParquetInfo object using the parquetinfo function.

Creation

Description

example

info = parquetinfo(filename) returns an info object for the Parquet file specified by filename.

Input Arguments

expand all

Name of Parquet file, specified as a character vector or string scalar. ParquetInfo works with Parquet 1.0 or Parquet 2.0 files.

Depending on the location of the file, filename can take on one of these forms.

Location

Form

Current folder or folder on the MATLAB® path

Specify the name of the file in filename.

Example: 'data.parquet'

File in a folder

If the file is not in the current folder or in a folder on the MATLAB path, then specify the full or relative path name.

Example: 'C:\myFolder\data.parquet'

Example: 'myDir\myFile.ext'

Internet URL

If the file is specified as an internet uniform resource locator (URL), then filename must contain the protocol type 'http://' or 'https://' and end with '?raw=true'.

Example: 'http://hostname/path_to_file/my_data.parquet?raw=true'

Remote Location

If the file is stored at a remote location, then filename must contain the full path of the file specified with the form:

scheme_name://path_to_file/my_file.ext

Based on the remote location, scheme_name can be one of the values in this table.

Remote Locationscheme_name
Amazon S3™s3
Windows Azure® Blob Storagewasb, wasbs
HDFS™hdfs

For more information, see Work with Remote Data.

Example: 's3://bucketname/path_to_file/data.parquet'

Data Types: char | string

Properties

expand all

This property is read-only.

Absolute path to Parquet file, specified as a string scalar.

Data Types: string

This property is read-only.

File size in bytes, specified as double.

Data Types: double

This property is read-only.

Number of row groups, specified as a double.

Data Types: double

This property is read-only.

Number of rows in each row group, specified as a double.

Data Types: double

This property is read-only.

Variable names, specified as a string array. If the Parquet file contains N variables, then VariableNames is an array of size 1-by-N containing the names of the variables.

Data Types: string

This property is read-only.

Variable data types, specified as a string array. If the Parquet file contains N variables, then VariableTypes is an array of size 1-by-N containing datatype names for each variable. Each element in the array is the name of the MATLAB datatype to which the corresponding variable in the Parquet file maps.

Data Types: string

This property is read-only.

Variable compression algorithm, specified as a string array. If the Parquet file contains N variables, then VariableCompression is an array of size 1-by-N containing compression algorithm names. Each element in the array corresponds to the compression algorithm used to compress that variable in the Parquet file. See parquetwrite for a list of supported compression algorithms.

Data Types: string

This property is read-only.

Variable encoding, specified as a string array. If the Parquet file contains N variables, then VariableEncoding is an array of size 1-by-N containing encoding scheme names. Each element in the array corresponds to the encoding scheme used to encode that variable in the Parquet file. See parquetwrite for a list of supported encodings.

Data Types: string

This property is read-only.

Parquet version, specified as either "1.0" or "2.0".

Data Types: string

Examples

collapse all

Use the parquetinfo function to create a ParquetInfo object containing information about the file.

info = parquetinfo('outages.parquet')
info = 
  ParquetInfo with properties:

               Filename: "/mathworks/devel/bat/filer/batfs1904-0/Bdoc24a.2528353/build/matlab/toolbox/matlab/demos/outages.parquet"
               FileSize: 44202
           NumRowGroups: 1
        RowGroupHeights: 1468
          VariableNames: ["Region"    "OutageTime"    "Loss"    "Customers"    "RestorationTime"    "Cause"]
          VariableTypes: ["string"    "datetime"    "double"    "double"    "datetime"    "string"]
    VariableCompression: ["snappy"    "snappy"    "snappy"    "snappy"    "snappy"    "snappy"]
       VariableEncoding: ["plain"    "plain"    "plain"    "plain"    "plain"    "plain"]
                Version: "2.0"

Display the name, type, and compression scheme for the third variable in the file.

disp([info.VariableNames(3)  info.VariableTypes(3) info.VariableCompression(3)]) 
    "Loss"    "double"    "snappy"

Extended Capabilities

Version History

Introduced in R2019a

expand all