Upgrade to arrow/parquet 57.0.0 #17888

alamb · 2025-10-02T14:40:31Z

Which issue does this PR close?

Related to Release arrow-rs / parquet Major version 57.0.0 (October 2025) arrow-rs#7835
Closes Incorrect error message for decimal with scale while input value is out of bound #3666

Note while this PR looks massive, a large portion is display updates due to better display of Fields and DataTypes

Rationale for this change

Upgrade to the latest arrow

Also, there are several new features in arrow-57 that I want to be able to test including Variant, arrow-avro, and a new parquet metadata reader.

What changes are included in this PR?

Update arrow/parquet
Update prost
Update substrait
Update pbjson
Make API changes to avoid deprecated APIs

Are these changes tested?

By CI

Are there any user-facing changes?

New arrow

alamb · 2025-10-02T15:41:51Z

Many of the current failures are due because this used to work:

select arrow_cast('2021-01-01T00:00:00', 'Timestamp(Nanosecond, Some("-05:00"))'

or

SELECT arrow_cast(secs, 'Timestamp(Millisecond, None)') FROM t

After the arrow 57 upgrade it fails with errors like

statement error DataFusion error: Execution error: Unsupported type 'Timestamp\(Nanosecond, None\)'\. Must be a supported arrow type name such as 'Int32' or 'Timestamp\(ns\)'\. Error expected double quoted string for Timezone, got 'None'

# arrow_typeof_timestamp
query T
SELECT arrow_typeof(now()::timestamp)
----
Timestamp(ns)

I believe the problem is that the format of the timezone has changed into Timestamp(ns) and then the FromStr method doesn't handle that. I will work on filing an update

I think what we need to do is support both formats for backwards compatibility. I will work on an upstream issue

alamb · 2025-10-02T17:18:34Z

datafusion-examples/examples/flight/flight_client.rs


    // Create Flight client
-    let mut client = FlightServiceClient::connect("http://localhost:50051").await?;
+    let endpoint = Endpoint::new("http://localhost:50051")?;


This is due to new version of tonic

alamb · 2025-10-02T17:18:57Z

datafusion-examples/examples/flight/flight_server.rs


                // add an initial FlightData message that sends schema
                let options = arrow::ipc::writer::IpcWriteOptions::default();
+                let mut compression_context = CompressionContext::default();


Due to

Reuse zstd compression context when writing IPC arrow-rs#8405

alamb · 2025-10-02T17:19:23Z

datafusion/functions-aggregate-common/src/utils.rs


-            let validate =
-                T::validate_decimal_precision(new_value, self.target_precision);
+            let validate = T::validate_decimal_precision(


Due to this (get better messages)

[Decimal] Add scale argument to validation functions to ensure accurate error logging arrow-rs#8396

I'll add "Closes #3666" to the PR body 👍

alamb · 2025-10-02T17:22:13Z

datafusion/sqllogictest/test_files/array.slt

-List(Field { name: "item", data_type: List(Field { name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) List(Field { name: "item", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) List(Field { name: "item", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })
-List(Field { name: "item", data_type: List(Field { name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) List(Field { name: "item", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) List(Field { name: "item", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })
-List(Field { name: "item", data_type: List(Field { name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) List(Field { name: "item", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) List(Field { name: "item", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })
+List(nullable List(nullable Int64)) List(nullable Float64) List(nullable Utf8)


Many of the diffs in this file are related to improvements in DataType display, tracked in this ticket

Improve Display for DataType arrow-rs#8351

I will try and call out individual changes when I see them. Lists are way nicer now:

Improve Display for DataType and Field arrow-rs#8290

alamb · 2025-10-02T17:22:38Z

datafusion/sqllogictest/test_files/array.slt

 05)--------ProjectionExec: expr=[]
 06)----------CoalesceBatchesExec: target_batch_size=8192
-07)------------FilterExec: substr(md5(CAST(value@0 AS Utf8View)), 1, 32) IN ([Literal { value: Utf8View("7f4b18de3cfeb9b4ac78c381ee2ad278"), field: Field { name: "lit", data_type: Utf8View, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }, Literal { value: Utf8View("a"), field: Field { name: "lit", data_type: Utf8View, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }, Literal { value: Utf8View("b"), field: Field { name: "lit", data_type: Utf8View, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }, Literal { value: Utf8View("c"), field: Field { name: "lit", data_type: Utf8View, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }])
+07)------------FilterExec: substr(md5(CAST(value@0 AS Utf8View)), 1, 32) IN ([Literal { value: Utf8View("7f4b18de3cfeb9b4ac78c381ee2ad278"), field: Field { name: "lit", data_type: Utf8View } }, Literal { value: Utf8View("a"), field: Field { name: "lit", data_type: Utf8View } }, Literal { value: Utf8View("b"), field: Field { name: "lit", data_type: Utf8View } }, Literal { value: Utf8View("c"), field: Field { name: "lit", data_type: Utf8View } }])


Due to

Improve Display for DataType and Field arrow-rs#8290

alamb · 2025-10-02T17:23:21Z

datafusion/sqllogictest/test_files/arrow_typeof.slt

 SELECT arrow_typeof(now()::timestamp)
 ----
-Timestamp(Nanosecond, None)
+Timestamp(ns)


Due to

Improve Display formatting of DataType::Timestamp arrow-rs#8425

I believe we'll need to update https://github.com/apache/datafusion/blob/main/docs/source/user-guide/sql/data_types.md and https://github.com/apache/datafusion/blob/main/docs/source/user-guide/sql/scalar_functions.md (the latter via updating the source docs in code) too, but can do in a follow up

alamb · 2025-10-02T17:23:41Z

datafusion/sqllogictest/test_files/arrow_typeof.slt


 ## Timestamps: Create a table

+statement ok


The timestamp format has changed (improved!) so let's also add tests for the new format

alamb · 2025-10-02T17:25:17Z

datafusion/substrait/Cargo.toml

 pbjson-types = { workspace = true }
 prost = { workspace = true }
-substrait = { version = "0.58", features = ["serde"] }
+substrait = { version = "0.59", features = ["serde"] }


since prost is updated, we also must update substrait

alamb · 2025-10-03T20:23:37Z

datafusion-cli/src/main.rs

-        | alltypes_plain.parquet            | 1851            | 10181               | 2    | page_index=false |
-        | alltypes_tiny_pages.parquet       | 454233          | 881418              | 2    | page_index=true  |
-        | lz4_raw_compressed_larger.parquet | 380836          | 2939                | 2    | page_index=false |
+        | alltypes_plain.parquet            | 1851            | 10309               | 2    | page_index=false |


I don't know why the metadata size has increased. I will investigate

alamb · 2025-10-03T20:24:04Z

datafusion/common/src/dfschema.rs

-        let expected = "Field { name: \"c0\", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, \
-        Field { name: \"c1\", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }";
-        assert_eq!(expected, arrow_schema.to_string());
+        insta::assert_snapshot!(arrow_schema.to_string(), @r#"Field { "c0": nullable Boolean }, Field { "c1": nullable Boolean }"#);


many many diffs are due to the changes in formatting of Fields and DataTypes (see below)

alamb · 2025-10-03T20:24:26Z

datafusion/core/tests/dataframe/dataframe_functions.rs

+    +----------------------+
+    | arrow_typeof(test.l) |
+    +----------------------+
+    | List(nullable Int32) |


the new display is much easier to read in my opinion

alamb · 2025-10-03T21:02:38Z

Ok, the tests are now looking good enough to test with the new thrift decoder

alamb · 2025-10-04T10:03:17Z

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.14.0-1016-gcp #17~24.04.1-Ubuntu SMP Wed Sep 3 01:55:36 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/upgrade_arrow_57 (0cfb693) to 0f3cf27 diff using: tpch_mem
Results will be posted here when complete

…th `FileDecryptionProperties` (#8626) # Which issue does this PR close? - Related to #7835 - Follow on to #8470 # Rationale for this change While [testing arrow 57](apache/datafusion#17888) with DataFusion, I found there was a disconnect in the builders that the `FileDecryptionProperties` builder builds Arc and the `FileEncryptionProperties` builder builds the struct directly Let's make the APIs consistent This also allows encryption properties to be shared and cloned cheaply (I am not sure how often this is needed, but it seems like a reasonable thing to do) # What changes are included in this PR? See above # Are these changes tested? Yes, by existing tests # Are there any user-facing changes? This is an API change, started in #8470

# Which issue does this PR close? We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. - Related to #7835 # Rationale for this change While testing the arrow 57 upgrade in DataFusion I found a few things that need to be fixed in parquet-rs. - apache/datafusion#17888 One was that the method `ArrowWriter::into_serialized_writer` was deprecated, (which I know I suggested in #8389 🤦 ). However, when testing it turns out that the constructor of `SerializedFileWriter` does a lot of work (like creating the parquet schema from the arrow schema and messing with metadata) https://github.com/apache/arrow-rs/blob/c4f0fc12199df696620c73d62523c8eef5743bf2/parquet/src/arrow/arrow_writer/mod.rs#L230-L263 Creating a `RowGroupWriterFactory` directly would involve a bunch of code duplication # What changes are included in this PR? So let's not deprecate this method for now and instead add some additional docs to guide people to the right lace # Are these changes tested? I tested manually upstream # Are there any user-facing changes? If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out.

alamb · 2025-10-18T11:43:12Z

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.14.0-1017-gcp #18~24.04.1-Ubuntu SMP Tue Sep 23 17:51:44 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/upgrade_arrow_57 (31e6327) to 522403b diff using: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

alamb · 2025-10-18T12:40:22Z

🤖: Benchmark completed

Details

Comparing HEAD and alamb_upgrade_arrow_57
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ alamb_upgrade_arrow_57 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │  2729.60 ms │             2598.37 ms │     no change │
│ QQuery 1     │  1395.37 ms │             1224.05 ms │ +1.14x faster │
│ QQuery 2     │  2554.95 ms │             2408.74 ms │ +1.06x faster │
│ QQuery 3     │  1186.23 ms │             1180.34 ms │     no change │
│ QQuery 4     │  2241.05 ms │             2206.69 ms │     no change │
│ QQuery 5     │ 27330.65 ms │            27905.39 ms │     no change │
│ QQuery 6     │  4231.81 ms │             4199.11 ms │     no change │
│ QQuery 7     │  3545.08 ms │             3482.55 ms │     no change │
└──────────────┴─────────────┴────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                     ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                     │ 45214.74ms │
│ Total Time (alamb_upgrade_arrow_57)   │ 45205.24ms │
│ Average Time (HEAD)                   │  5651.84ms │
│ Average Time (alamb_upgrade_arrow_57) │  5650.66ms │
│ Queries Faster                        │          2 │
│ Queries Slower                        │          0 │
│ Queries with No Change                │          6 │
│ Queries with Failure                  │          0 │
└───────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ alamb_upgrade_arrow_57 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.16 ms │                2.11 ms │     no change │
│ QQuery 1     │    54.21 ms │               48.90 ms │ +1.11x faster │
│ QQuery 2     │   149.91 ms │              133.31 ms │ +1.12x faster │
│ QQuery 3     │   165.66 ms │              159.62 ms │     no change │
│ QQuery 4     │  1037.82 ms │             1065.89 ms │     no change │
│ QQuery 5     │  1477.45 ms │             1499.46 ms │     no change │
│ QQuery 6     │     2.15 ms │                2.17 ms │     no change │
│ QQuery 7     │    53.97 ms │               54.21 ms │     no change │
│ QQuery 8     │  1419.19 ms │             1445.11 ms │     no change │
│ QQuery 9     │  1773.22 ms │             1800.19 ms │     no change │
│ QQuery 10    │   387.41 ms │              380.02 ms │     no change │
│ QQuery 11    │   433.91 ms │              432.05 ms │     no change │
│ QQuery 12    │  1336.75 ms │             1388.94 ms │     no change │
│ QQuery 13    │  2113.69 ms │             2137.76 ms │     no change │
│ QQuery 14    │  1255.69 ms │             1266.09 ms │     no change │
│ QQuery 15    │  1175.74 ms │             1206.58 ms │     no change │
│ QQuery 16    │  2654.19 ms │             2639.46 ms │     no change │
│ QQuery 17    │  2621.81 ms │             2627.30 ms │     no change │
│ QQuery 18    │  5160.02 ms │             4968.46 ms │     no change │
│ QQuery 19    │   127.45 ms │              127.98 ms │     no change │
│ QQuery 20    │  2075.40 ms │             1955.66 ms │ +1.06x faster │
│ QQuery 21    │  2413.93 ms │             2280.31 ms │ +1.06x faster │
│ QQuery 22    │  4069.35 ms │             3896.75 ms │     no change │
│ QQuery 23    │ 13016.35 ms │            12591.63 ms │     no change │
│ QQuery 24    │   226.10 ms │              214.40 ms │ +1.05x faster │
│ QQuery 25    │   513.93 ms │              508.46 ms │     no change │
│ QQuery 26    │   231.86 ms │              205.76 ms │ +1.13x faster │
│ QQuery 27    │  2915.87 ms │             2808.81 ms │     no change │
│ QQuery 28    │ 23143.02 ms │            24400.06 ms │  1.05x slower │
│ QQuery 29    │   960.56 ms │              969.13 ms │     no change │
│ QQuery 30    │  1314.71 ms │             1277.65 ms │     no change │
│ QQuery 31    │  1316.03 ms │             1310.51 ms │     no change │
│ QQuery 32    │  4370.67 ms │             4500.14 ms │     no change │
│ QQuery 33    │  5698.57 ms │             5663.60 ms │     no change │
│ QQuery 34    │  5864.74 ms │             5816.17 ms │     no change │
│ QQuery 35    │  2029.44 ms │             1974.93 ms │     no change │
│ QQuery 36    │   120.39 ms │              117.52 ms │     no change │
│ QQuery 37    │    51.92 ms │               52.26 ms │     no change │
│ QQuery 38    │   119.36 ms │              119.72 ms │     no change │
│ QQuery 39    │   195.98 ms │              194.92 ms │     no change │
│ QQuery 40    │    42.56 ms │               44.64 ms │     no change │
│ QQuery 41    │    40.59 ms │               41.78 ms │     no change │
│ QQuery 42    │    31.85 ms │               32.30 ms │     no change │
└──────────────┴─────────────┴────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                     ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                     │ 94165.60ms │
│ Total Time (alamb_upgrade_arrow_57)   │ 94362.71ms │
│ Average Time (HEAD)                   │  2189.90ms │
│ Average Time (alamb_upgrade_arrow_57) │  2194.48ms │
│ Queries Faster                        │          6 │
│ Queries Slower                        │          1 │
│ Queries with No Change                │         36 │
│ Queries with Failure                  │          0 │
└───────────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ alamb_upgrade_arrow_57 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 165.91 ms │              176.22 ms │  1.06x slower │
│ QQuery 2     │  26.64 ms │               27.01 ms │     no change │
│ QQuery 3     │  40.66 ms │               35.64 ms │ +1.14x faster │
│ QQuery 4     │  28.31 ms │               28.36 ms │     no change │
│ QQuery 5     │  75.56 ms │               75.82 ms │     no change │
│ QQuery 6     │  19.24 ms │               19.57 ms │     no change │
│ QQuery 7     │ 208.29 ms │              208.78 ms │     no change │
│ QQuery 8     │  33.39 ms │               31.81 ms │     no change │
│ QQuery 9     │ 100.06 ms │              101.83 ms │     no change │
│ QQuery 10    │  57.15 ms │               59.70 ms │     no change │
│ QQuery 11    │  18.15 ms │               17.56 ms │     no change │
│ QQuery 12    │  50.35 ms │               49.92 ms │     no change │
│ QQuery 13    │  46.00 ms │               45.79 ms │     no change │
│ QQuery 14    │  13.34 ms │               13.33 ms │     no change │
│ QQuery 15    │  24.29 ms │               23.76 ms │     no change │
│ QQuery 16    │  24.01 ms │               24.69 ms │     no change │
│ QQuery 17    │ 145.09 ms │              146.70 ms │     no change │
│ QQuery 18    │ 316.87 ms │              317.19 ms │     no change │
│ QQuery 19    │  45.27 ms │               37.64 ms │ +1.20x faster │
│ QQuery 20    │  47.57 ms │               47.87 ms │     no change │
│ QQuery 21    │ 329.29 ms │              326.71 ms │     no change │
│ QQuery 22    │  20.56 ms │               24.50 ms │  1.19x slower │
└──────────────┴───────────┴────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                     ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                     │ 1836.01ms │
│ Total Time (alamb_upgrade_arrow_57)   │ 1840.41ms │
│ Average Time (HEAD)                   │   83.45ms │
│ Average Time (alamb_upgrade_arrow_57) │   83.66ms │
│ Queries Faster                        │         2 │
│ Queries Slower                        │         2 │
│ Queries with No Change                │        18 │
│ Queries with Failure                  │         0 │
└───────────────────────────────────────┴───────────┘

alamb · 2025-10-23T16:26:53Z

datafusion-cli/src/main.rs

        | filename                                                    | row_group_id | row_group_num_rows | row_group_num_columns | row_group_bytes | column_id | file_offset | num_values | path_in_schema | type  | stats_min | stats_max | stats_null_count | stats_distinct_count | stats_min_value | stats_max_value | compression | encodings                    | index_page_offset | dictionary_page_offset | data_page_offset | total_compressed_size | total_uncompressed_size |
        +-------------------------------------------------------------+--------------+--------------------+-----------------------+-----------------+-----------+-------------+------------+----------------+-------+-----------+-----------+------------------+----------------------+-----------------+-----------------+-------------+------------------------------+-------------------+------------------------+------------------+-----------------------+-------------------------+
-        | ../datafusion/core/tests/data/fixed_size_list_array.parquet | 0            | 2                  | 1                     | 123             | 0         | 125         | 4          | "f0.list.item" | INT64 | 1         | 4         | 0                |                      | 1               | 4               | SNAPPY      | [RLE_DICTIONARY, PLAIN, RLE] |                   | 4                      | 46               | 121                   | 123                     |
+        | ../datafusion/core/tests/data/fixed_size_list_array.parquet | 0            | 2                  | 1                     | 123             | 0         | 125         | 4          | "f0.list.item" | INT64 | 1         | 4         | 0                |                      | 1               | 4               | SNAPPY      | [PLAIN, RLE, RLE_DICTIONARY] |                   | 4                      | 46               | 121                   | 123                     |


this is due to apache/arrow-rs#8587 which change the order the encodings are displayed

alamb · 2025-10-23T16:27:48Z

datafusion-cli/src/main.rs

-        | alltypes_plain.parquet            | 1851            | 10181               | 2    | page_index=false |
-        | alltypes_tiny_pages.parquet       | 454233          | 881418              | 2    | page_index=true  |
-        | lz4_raw_compressed_larger.parquet | 380836          | 2939                | 2    | page_index=false |
+        | alltypes_plain.parquet            | 1851            | 6957                | 2    | page_index=false |


The thrift-remodel has made the in-memory footprint of ParquetMetaData significantly smaller, likely due to a more efficient representation in memory for PageIndex

(huge thanks to @etseidl )

alamb · 2025-10-23T16:32:15Z

datafusion-examples/examples/flight/flight_client.rs


    // Create Flight client
-    let mut client = FlightServiceClient::connect("http://localhost:50051").await?;
+    let endpoint = Endpoint::new("http://localhost:50051")?;


this is required due to the tonic upgrade

alamb · 2025-10-23T16:33:13Z

datafusion-examples/examples/parquet_encrypted.rs

 fn setup_encryption(
    parquet_df: &DataFrame,
-) -> Result<(FileEncryptionProperties, FileDecryptionProperties), DataFusionError> {
+) -> Result<(Arc<FileEncryptionProperties>, Arc<FileDecryptionProperties>), DataFusionError>


The arrow upstream APIs now use Arc instead of raw objects, see apache/arrow-rs#8470

alamb · 2025-10-23T16:33:41Z

datafusion/common/src/dfschema.rs

        let schema = DFSchema::try_from_qualified_schema("t1", &test_schema_1())?;
        let arrow_schema = schema.as_arrow();
-        insta::assert_snapshot!(arrow_schema, @r#"Field { name: "c0", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "c1", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }"#);
+        insta::assert_snapshot!(arrow_schema.to_string(), @r#"Field { "c0": nullable Boolean }, Field { "c1": nullable Boolean }"#);


due to improvement in displaying Fields

alamb · 2025-10-23T16:34:07Z

datafusion/common/src/encryption.rs

 pub use crate::config::{ConfigFileDecryptionProperties, ConfigFileEncryptionProperties};
-
-#[cfg(feature = "parquet_encryption")]
-pub fn map_encryption_to_config_encryption(


These methods are redundant and were not used, so I removed them

alamb · 2025-10-23T16:34:28Z

datafusion/common/src/pyarrow.rs


 impl ToPyArrow for ScalarValue {
-    fn to_pyarrow(&self, py: Python) -> PyResult<PyObject> {
+    fn to_pyarrow<'py>(&self, py: Python<'py>) -> PyResult<Bound<'py, PyAny>> {


due to pyo3 upgrade

alamb · 2025-10-23T16:35:05Z

datafusion/core/src/datasource/file_format/parquet.rs

-            _ => {
-                error!("fail to read page index.")
-            }
+        let ColumnIndexMetaData::INT32(index) = int_col_index else {


The PageIndex structures have now changed

alamb · 2025-10-23T16:39:12Z

This PR is now ready for review

Jefffrey

LGTM 👍

Jefffrey · 2025-10-24T03:58:33Z

datafusion/core/tests/parquet/filter_pushdown.rs

    // Can disable the cache even with filter pushdown by setting the size to 0. In this case we
-    // expect the inner records are reported but no records are read from the cache


nit: wording a bit off here, since it reads as

Can disable the cache even with filter pushdown by setting the size to 0. In this case we no records are read from the cache and no metrics are reported

Should be this maybe?

Can disable the cache even with filter pushdown by setting the size to 0. This results in no records being read from the cache and no metrics being reported

Jefffrey · 2025-10-24T04:01:46Z

datafusion/functions-aggregate-common/src/utils.rs


-            let validate =
-                T::validate_decimal_precision(new_value, self.target_precision);
+            let validate = T::validate_decimal_precision(


I'll add "Closes #3666" to the PR body 👍

Jefffrey · 2025-10-24T04:03:27Z

datafusion/sqllogictest/test_files/arrow_typeof.slt

 SELECT arrow_typeof(now()::timestamp)
 ----
-Timestamp(Nanosecond, None)
+Timestamp(ns)


I believe we'll need to update https://github.com/apache/datafusion/blob/main/docs/source/user-guide/sql/data_types.md and https://github.com/apache/datafusion/blob/main/docs/source/user-guide/sql/scalar_functions.md (the latter via updating the source docs in code) too, but can do in a follow up

Jefffrey · 2025-10-24T04:04:39Z

datafusion/sqllogictest/test_files/dates.slt

-statement error DataFusion error: type_coercion\ncaused by\nError during planning: Cannot coerce arithmetic expression Timestamp\(Nanosecond, Some\("\+00:00"\)\) \+ Utf8 to valid types
+statement error
 select i_item_desc from test
 where d3_date > now() + '5 days';
+----
+DataFusion error: type_coercion
+caused by
+Error during planning: Cannot coerce arithmetic expression Timestamp(ns, "+00:00") + Utf8 to valid types


I thought the expected error comes before the query not after, for SLTs 🤔

github-actions bot added the common Related to common crate label Oct 2, 2025

alamb mentioned this pull request Oct 2, 2025

Test thrift-remodel branch with DataFusion apache/arrow-rs#8513

Closed

github-actions bot added substrait Changes to the substrait crate proto Related to proto crate labels Oct 2, 2025

alamb mentioned this pull request Oct 2, 2025

DataType parsing no longer works correctly for old formatted timestamps apache/arrow-rs#8539

Closed

alamb force-pushed the alamb/upgrade_arrow_57 branch from ed43cc0 to ee2de0c Compare October 2, 2025 17:16

github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Oct 2, 2025

alamb commented Oct 2, 2025

View reviewed changes

github-actions bot added the core Core DataFusion crate label Oct 2, 2025

alamb force-pushed the alamb/upgrade_arrow_57 branch from 9d06200 to 1b7b559 Compare October 2, 2025 18:56

github-actions bot added the logical-expr Logical plan and expressions label Oct 2, 2025

This was referenced Oct 3, 2025

Refactor: Update enforce_sorting tests to use insta snapshots for easier updates #17900

Merged

Refactor: Split test_window_partial_constant_and_set_monotonicity from enforce_sorting #17903

Closed

alamb force-pushed the alamb/upgrade_arrow_57 branch from 8ecbbed to d3b328b Compare October 3, 2025 15:48

alamb mentioned this pull request Oct 3, 2025

chore(deps): bump substrait from 0.58.0 to 0.60.1 #17841

Closed

alamb force-pushed the alamb/upgrade_arrow_57 branch from f61623e to 9f6a390 Compare October 3, 2025 16:04

alamb mentioned this pull request Oct 3, 2025

Use custom thrift parser for parquet metadata (phase 1 of Thrift remodel) apache/arrow-rs#8530

Merged

github-actions bot added sql SQL Planner physical-expr Changes to the physical-expr crates optimizer Optimizer rules functions Changes to functions implementation physical-plan Changes to the physical-plan crate labels Oct 3, 2025

alamb force-pushed the alamb/upgrade_arrow_57 branch from d5bd26e to 7709acc Compare October 3, 2025 20:26

alamb commented Oct 3, 2025

View reviewed changes

alamb force-pushed the alamb/upgrade_arrow_57 branch from 7709acc to 5e1ea80 Compare October 3, 2025 21:11

alamb mentioned this pull request Oct 4, 2025

[EPIC] Support VARIANT type for unstructured data #16116

Open

4 tasks

alamb mentioned this pull request Oct 4, 2025

[TESTING] Test thrift-remodel branch with datafusion #17916

Closed

alamb force-pushed the alamb/upgrade_arrow_57 branch from 7c58fa3 to a1f72e2 Compare October 15, 2025 19:12

github-actions bot added the execution Related to the execution crate label Oct 15, 2025

alamb force-pushed the alamb/upgrade_arrow_57 branch 4 times, most recently from c4d2c37 to 6eca757 Compare October 16, 2025 18:30

alamb mentioned this pull request Oct 16, 2025

Support AVRO Format for Write Queries #7679

Open

alamb force-pushed the alamb/upgrade_arrow_57 branch 2 times, most recently from c8bdedd to 31e6327 Compare October 17, 2025 19:29

alamb mentioned this pull request Oct 17, 2025

Release arrow-rs / parquet Major version 57.0.0 (October 2025) apache/arrow-rs#7835

Closed

5 tasks

alamb force-pushed the alamb/upgrade_arrow_57 branch from 31e6327 to 54f7bed Compare October 19, 2025 19:54

Update to arrow-57

b22026e

alamb force-pushed the alamb/upgrade_arrow_57 branch from 54f7bed to b22026e Compare October 23, 2025 16:25

Update upgrade guide

468fd03

github-actions bot added the documentation Improvements or additions to documentation label Oct 23, 2025

alamb commented Oct 23, 2025

View reviewed changes

alamb changed the title ~~[WIP] Upgrade to arrow/parquet 57.0.0~~ Upgrade to arrow/parquet 57.0.0 Oct 23, 2025

alamb marked this pull request as ready for review October 23, 2025 16:39

Fix compilation

17e7932

paleolimbot mentioned this pull request Oct 23, 2025

chore: Upgrade Datafusion (v50) and Arrow (v56) dependencies apache/sedona-db#237

Draft

getChan approved these changes Oct 24, 2025

View reviewed changes

Jefffrey approved these changes Oct 24, 2025

View reviewed changes

		// Can disable the cache even with filter pushdown by setting the size to 0. In this case we
		// expect the inner records are reported but no records are read from the cache

Upgrade to arrow/parquet 57.0.0 #17888

Are you sure you want to change the base?

Upgrade to arrow/parquet 57.0.0 #17888

Uh oh!

Conversation

alamb commented Oct 2, 2025 • edited by Jefffrey Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

alamb commented Oct 2, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Oct 3, 2025

Uh oh!

alamb commented Oct 4, 2025

Uh oh!

alamb commented Oct 18, 2025

Uh oh!

alamb commented Oct 18, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Oct 23, 2025

Uh oh!

Jefffrey left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

alamb commented Oct 2, 2025 •

edited by Jefffrey

Loading