Syncing to Glue Data Catalog

This document walks through the steps to register an Apache XTable™ (Incubating) synced table in Glue Data Catalog on AWS.

Pre-requisites

Source table(s) (Hudi/Delta/Iceberg) already written to Amazon S3. If you don't have the source table written in S3 already, you can follow the steps in this tutorial to set it up
Setup access to interact with AWS APIs from the command line. If you haven’t installed AWSCLIv2, you do so by following the steps outlined in AWS docs and also set up access credentials by following the steps here
Clone the Apache XTable™ (Incubating) repository and create the xtable-utilities_2.12-0.2.0-SNAPSHOT-bundled.jar by following the steps on the Installation page

Steps

Running sync

Create my_config.yaml in the cloned Apache XTable™ (Incubating) directory.

targetFormat: HUDI
targetFormat: DELTA
targetFormat: ICEBERG

yaml
sourceFormat: DELTA|ICEBERG # choose only one
targetFormats:
  - HUDI
datasets:
  -
    tableBasePath: s3://path/to/source/data
    tableName: table_name

yaml
sourceFormat: HUDI|ICEBERG # choose only one
targetFormats:
  - DELTA
datasets:
  -
    tableBasePath: s3://path/to/source/data
    tableName: table_name
    partitionSpec: partitionpath:VALUE # you only need to specify partitionSpec for HUDI sourceFormat

yaml
sourceFormat: HUDI|DELTA # choose only one
targetFormats:
  - ICEBERG
datasets:
  -
    tableBasePath: s3://path/to/source/data
    tableName: table_name
    partitionSpec: partitionpath:VALUE # you only need to specify partitionSpec for HUDI sourceFormat

Note:

Replace with appropriate values for sourceFormat, tableBasePath and tableName fields.

From your terminal under the cloned xtable directory, run the sync process using the below command.

shell

java -jar xtable-utilities/target/xtable-utilities_2.12-0.2.0-SNAPSHOT-bundled.jar --datasetConfig my_config.yaml

Note:

At this point, if you check your bucket path, you will be able to see the .hoodie or _delta_log or metadata directory with metadata files which contains the information that helps query engines interpret the data as the target table.

Register the target table in Glue Data Catalog

From your terminal, create a glue database.

shell

aws glue create-database --database-input "{\"Name\":\"xtable_synced_db\"}"

From your terminal, create a glue crawler. Modify the <yourAccountId>, <yourRoleName> and <path/to/your/data>, with appropriate values.

shell
export accountId=<yourAccountId>
export roleName=<yourRoleName>
export s3DataPath=s3://<path/to/source/data>

targetFormat: HUDI
targetFormat: DELTA
targetFormat: ICEBERG

shell

aws glue create-crawler --name xtable_crawler --role arn:aws:iam::${accountId}:role/service-role/${roleName} --database xtable_synced_db --targets "{\"HudiTargets\":[{\"Paths\":[\"${s3DataPath}\"]}]}"

shell

aws glue create-crawler --name xtable_crawler --role arn:aws:iam::${accountId}:role/service-role/${roleName} --database xtable_synced_db --targets "{\"DeltaTargets\":[{\"Paths\":[\"${s3DataPath}\"]}]}"

shell

aws glue create-crawler --name xtable_crawler --role arn:aws:iam::${accountId}:role/service-role/${roleName} --database xtable_synced_db --targets "{\"IcebergTargets\":[{\"Paths\":[\"${s3DataPath}\"]}]}"

From your terminal, run the glue crawler.

shell

 aws glue start-crawler --name xtable_crawler

Once the crawler succeeds, you’ll be able to query this Iceberg table from Athena, EMR and/or Redshift query engines.

targetFormat: HUDI
targetFormat: DELTA
targetFormat: ICEBERG

LIMITATION for Hudi target format:

To validate the Hudi targetFormat table results, you need to ensure that the query engine that you're using supports Hudi version 0.14.0 as mentioned here

Validating the results

After the crawler runs successfully, you can inspect the catalogued tables in Glue and also query the table in Amazon Athena like below:

SELECT * FROM xtable_synced_db.<table_name>;

Validating the results

After the crawler runs successfully, you can inspect the catalogued tables in Glue and also query the table in Amazon Athena like below:

SELECT * FROM xtable_synced_db.<table_name>;

Conclusion

In this guide we saw how to,

sync a source table to create metadata for the desired target table formats using Apache XTable™ (Incubating)
catalog the data in the target table format in Glue Data Catalog
query the target table using Amazon Athena

Syncing to Glue Data Catalog

Pre-requisites​

Steps​

Running sync​

Register the target table in Glue Data Catalog​

Validating the results​

Validating the results​

Conclusion​

Pre-requisites

Steps

Running sync

Register the target table in Glue Data Catalog

Validating the results

Validating the results

Conclusion