Syncing to Hive Metastore

This document walks through the steps to register an Apache XTable™ (Incubating) synced table on Hive Metastore (HMS).

Pre-requisites

Source table(s) (Hudi/Delta/Iceberg) already written to your local storage or external storage locations like S3/GCS/ADLS. If you don't have the source table written in place already, you can follow the steps in this tutorial to set it up.
A compute instance where you can run Apache Spark. This can be your local machine, docker, or a distributed system like Amazon EMR, Google Cloud's Dataproc, Azure HDInsight etc. This is a required step to register the table in HMS using a Spark client.
Clone the XTable™ (Incubating) repository and create the xtable-utilities_2.12-0.2.0-SNAPSHOT-bundled.jar by following the steps on the Installation page
This guide also assumes that you have configured the Hive Metastore locally or on EMR/Dataproc/HDInsight and is already running.

Steps

Running sync

Create my_config.yaml in the cloned Apache XTable™ (Incubating) directory.

targetFormat: HUDI
targetFormat: DELTA
targetFormat: ICEBERG

yaml
sourceFormat: DELTA|ICEBERG # choose only one
targetFormats:
  - HUDI
datasets:
  -
    tableBasePath: file:///path/to/source/data
    tableName: table_name

yaml
sourceFormat: HUDI|ICEBERG # choose only one
targetFormats:
  - DELTA
datasets:
  -
    tableBasePath: file:///path/to/source/data
    tableName: table_name
    partitionSpec: partitionpath:VALUE # you only need to specify partitionSpec for HUDI sourceFormat

yaml
sourceFormat: HUDI|DELTA # choose only one
targetFormats:
  - ICEBERG
datasets:
  -
    tableBasePath: file:///path/to/source/data
    tableName: table_name
    partitionSpec: partitionpath:VALUE # you only need to specify partitionSpec for HUDI sourceFormat

Note:

Replace with appropriate values for sourceFormat, tableBasePath and tableName fields.
Replace file:///path/to/source/data to appropriate source data path if you have your source table in S3/GCS/ADLS i.e.
- S3 - s3://path/to/source/data
- GCS - gs://path/to/source/data or
- ADLS - abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<path-to-data>

From your terminal under the cloned Apache XTable™ (Incubating) directory, run the sync process using the below command.

shell

java -jar xtable-utilities/target/xtable-utilities_2.12-0.2.0-SNAPSHOT-bundled.jar --datasetConfig my_config.yaml

Note:

At this point, if you check your bucket path, you will be able to see .hoodie or _delta_log or metadata directory with relevant metadata files that helps query engines to interpret the data as a Hudi/Delta/Iceberg table.

Register the target table in Hive Metastore

Now you need to register the Apache XTable™ (Incubating) synced target table in Hive Metastore.

targetFormat: HUDI
targetFormat: DELTA
targetFormat: ICEBERG

A Hudi table can directly be synced to the Hive Metastore using Hive Sync Tool and subsequently be queried by different query engines. For more information on the Hive Sync Tool, check Hudi Hive Metastore docs.

shell
cd $HUDI_HOME/hudi-sync/hudi-hive-sync

./run_sync_tool.sh  \
--jdbc-url <jdbc_url> \
--user <username> \
--pass <password> \
--partitioned-by <partition_field> \
--base-path <'/path/to/synced/hudi/table'> \
--database <database_name> \
--table <tableName>

Note:

Replace file:///path/to/source/data to appropriate source data path if you have your source table in S3/GCS/ADLS i.e.

S3 - s3://path/to/source/data
GCS - gs://path/to/source/data or
ADLS - abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<path-to-data>

Now you will be able to query the created table directly as a Hudi table from the same spark session or using query engines like Presto and/or Trino. Check out the guides for querying the Apache XTable™ (Incubating) synced tables on Presto or Trino query engines for more information.

sql
SELECT * FROM <database_name>.<table_name>;

shell
spark-sql --packages io.delta:delta-core_2.12:2.0.0 \
--conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" \
--conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" \
--conf "spark.sql.catalogImplementation=hive"

In the spark-sql shell, you need to create a schema and table like below.

sql
CREATE SCHEMA delta_db;

CREATE TABLE delta_db.<table_name> USING DELTA LOCATION '/path/to/synced/delta/table';

Note:

Replace file:///path/to/source/data to appropriate source data path if you have your source table in S3/GCS/ADLS i.e.

S3 - s3://path/to/source/data
GCS - gs://path/to/source/data or
ADLS - abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<path-to-data>

Now you will be able to query the created table directly as a Delta table from the same spark session or using query engines like Presto and/or Trino. Check out the guides for querying the Apache XTable™ (Incubating) synced tables on Presto or Trino query engines for more information.

sql
SELECT * FROM delta_db.<table_name>;

shell
spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.2.1 \
--conf "spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions" \
--conf "spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog" \
--conf "spark.sql.catalog.spark_catalog.type=hive" \
--conf "spark.sql.catalog.hive_prod=org.apache.iceberg.spark.SparkCatalog" \
--conf "spark.sql.catalog.hive_prod.type=hive"

In the spark-sql shell, you need to create a schema and table like below.

sql
CREATE SCHEMA iceberg_db;

CALL hive_prod.system.register_table(
   table => 'hive_prod.iceberg_db.<table_name>',
   metadata_file => '/path/to/synced/iceberg/table/metadata/<VERSION>.metadata.json'
);

Note:

Replace the dataset path while creating a dataframe to appropriate data path if you have your table in S3/GCS/ADLS i.e.

S3 - s3://path/to/source/data
GCS - gs://path/to/source/data or
ADLS - abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<path-to-data>

Now you will be able to query the created table directly as an Iceberg table from the same spark session or using query engines like Presto and/or Trino. Check out the guides for querying the Apache XTable™ (Incubating) synced tables on Presto or Trino query engines for more information.

sql
SELECT * FROM iceberg_db.<table_name>;

Conclusion

In this guide we saw how to,

sync a source table to create metadata for the desired target table formats using Apache XTable™ (Incubating)
catalog the data in the target table format in Hive Metastore
query the target table using Spark

Syncing to Hive Metastore

Pre-requisites​

Steps​

Running sync​

Register the target table in Hive Metastore​

Conclusion​

Pre-requisites

Steps

Running sync

Register the target table in Hive Metastore

Conclusion