Querying from Apache Spark

To read an Apache XTable™ (Incubating) synced target table (regardless of the table format) in Apache Spark locally or on services like Amazon EMR, Google Cloud's Dataproc, Azure HDInsight, or Databricks, you do not need additional jars or configs other than what is needed by the respective table formats.

Refer to the project specific documentation for the required configurations that needs to be passed in when you create the spark session or when you submit a spark job.

targetFormat: HUDI
targetFormat: DELTA
targetFormat: ICEBERG

For Hudi, refer the Spark Guide page

LIMITATION for Hudi target format:

To validate the Hudi targetFormat table results, you need to ensure that you're using Hudi version 0.14.0 as mentioned here

python

hudi_options = {
    "hoodie.metadata.enable": "true",
    "hoodie.datasource.write.hive_style_partitioning": "true",
}

df = spark.read.format("hudi").options(**hudi_options).load("/path/to/source/data")

For Delta Lake, refer the Set up interactive shell page

python
df = spark.read.format("delta").load("/path/to/source/data")

For Iceberg, refer Using Iceberg in Spark 3 page

python
df = spark.read.format("iceberg").load("/path/to/source/data")