KEMBAR78
Upgrade to arrow/parquet 57.0.0 by alamb · Pull Request #17888 · apache/datafusion · GitHub
Skip to content

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Oct 2, 2025

Which issue does this PR close?

Note while this PR looks massive, a large portion is display updates due to better display of Fields and DataTypes

Rationale for this change

Upgrade to the latest arrow

Also, there are several new features in arrow-57 that I want to be able to test including Variant, arrow-avro, and a new parquet metadata reader.

What changes are included in this PR?

  1. Update arrow/parquet
  2. Update prost
  3. Update substrait
  4. Update pbjson
  5. Make API changes to avoid deprecated APIs

Are these changes tested?

By CI

Are there any user-facing changes?

New arrow

@github-actions github-actions bot added the common Related to common crate label Oct 2, 2025
@github-actions github-actions bot added substrait Changes to the substrait crate proto Related to proto crate labels Oct 2, 2025
@alamb
Copy link
Contributor Author

alamb commented Oct 2, 2025

Many of the current failures are due because this used to work:

select arrow_cast('2021-01-01T00:00:00', 'Timestamp(Nanosecond, Some("-05:00"))'

or

SELECT arrow_cast(secs, 'Timestamp(Millisecond, None)') FROM t

After the arrow 57 upgrade it fails with errors like

statement error DataFusion error: Execution error: Unsupported type 'Timestamp\(Nanosecond, None\)'\. Must be a supported arrow type name such as 'Int32' or 'Timestamp\(ns\)'\. Error expected double quoted string for Timezone, got 'None'
# arrow_typeof_timestamp
query T
SELECT arrow_typeof(now()::timestamp)
----
Timestamp(ns)

I believe the problem is that the format of the timezone has changed into Timestamp(ns) and then the FromStr method doesn't handle that. I will work on filing an update

I think what we need to do is support both formats for backwards compatibility. I will work on an upstream issue


// Create Flight client
let mut client = FlightServiceClient::connect("http://localhost:50051").await?;
let endpoint = Endpoint::new("http://localhost:50051")?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is due to new version of tonic


// add an initial FlightData message that sends schema
let options = arrow::ipc::writer::IpcWriteOptions::default();
let mut compression_context = CompressionContext::default();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


let validate =
T::validate_decimal_precision(new_value, self.target_precision);
let validate = T::validate_decimal_precision(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add "Closes #3666" to the PR body 👍

List(Field { name: "item", data_type: List(Field { name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) List(Field { name: "item", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) List(Field { name: "item", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })
List(Field { name: "item", data_type: List(Field { name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) List(Field { name: "item", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) List(Field { name: "item", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })
List(Field { name: "item", data_type: List(Field { name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) List(Field { name: "item", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) List(Field { name: "item", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })
List(nullable List(nullable Int64)) List(nullable Float64) List(nullable Utf8)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many of the diffs in this file are related to improvements in DataType display, tracked in this ticket

I will try and call out individual changes when I see them. Lists are way nicer now:

05)--------ProjectionExec: expr=[]
06)----------CoalesceBatchesExec: target_batch_size=8192
07)------------FilterExec: substr(md5(CAST(value@0 AS Utf8View)), 1, 32) IN ([Literal { value: Utf8View("7f4b18de3cfeb9b4ac78c381ee2ad278"), field: Field { name: "lit", data_type: Utf8View, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }, Literal { value: Utf8View("a"), field: Field { name: "lit", data_type: Utf8View, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }, Literal { value: Utf8View("b"), field: Field { name: "lit", data_type: Utf8View, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }, Literal { value: Utf8View("c"), field: Field { name: "lit", data_type: Utf8View, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }])
07)------------FilterExec: substr(md5(CAST(value@0 AS Utf8View)), 1, 32) IN ([Literal { value: Utf8View("7f4b18de3cfeb9b4ac78c381ee2ad278"), field: Field { name: "lit", data_type: Utf8View } }, Literal { value: Utf8View("a"), field: Field { name: "lit", data_type: Utf8View } }, Literal { value: Utf8View("b"), field: Field { name: "lit", data_type: Utf8View } }, Literal { value: Utf8View("c"), field: Field { name: "lit", data_type: Utf8View } }])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SELECT arrow_typeof(now()::timestamp)
----
Timestamp(Nanosecond, None)
Timestamp(ns)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


## Timestamps: Create a table

statement ok
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timestamp format has changed (improved!) so let's also add tests for the new format

pbjson-types = { workspace = true }
prost = { workspace = true }
substrait = { version = "0.58", features = ["serde"] }
substrait = { version = "0.59", features = ["serde"] }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since prost is updated, we also must update substrait

@github-actions github-actions bot added the core Core DataFusion crate label Oct 2, 2025
@alamb alamb force-pushed the alamb/upgrade_arrow_57 branch from 9d06200 to 1b7b559 Compare October 2, 2025 18:56
@github-actions github-actions bot added the logical-expr Logical plan and expressions label Oct 2, 2025
@alamb alamb force-pushed the alamb/upgrade_arrow_57 branch from 8ecbbed to d3b328b Compare October 3, 2025 15:48
@alamb alamb force-pushed the alamb/upgrade_arrow_57 branch from f61623e to 9f6a390 Compare October 3, 2025 16:04
@github-actions github-actions bot added sql SQL Planner physical-expr Changes to the physical-expr crates optimizer Optimizer rules functions Changes to functions implementation physical-plan Changes to the physical-plan crate labels Oct 3, 2025
@alamb alamb force-pushed the alamb/upgrade_arrow_57 branch from d5bd26e to 7709acc Compare October 3, 2025 20:26
| alltypes_plain.parquet | 1851 | 10181 | 2 | page_index=false |
| alltypes_tiny_pages.parquet | 454233 | 881418 | 2 | page_index=true |
| lz4_raw_compressed_larger.parquet | 380836 | 2939 | 2 | page_index=false |
| alltypes_plain.parquet | 1851 | 10309 | 2 | page_index=false |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why the metadata size has increased. I will investigate

let expected = "Field { name: \"c0\", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, \
Field { name: \"c1\", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }";
assert_eq!(expected, arrow_schema.to_string());
insta::assert_snapshot!(arrow_schema.to_string(), @r#"Field { "c0": nullable Boolean }, Field { "c1": nullable Boolean }"#);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

many many diffs are due to the changes in formatting of Fields and DataTypes (see below)

+----------------------+
| arrow_typeof(test.l) |
+----------------------+
| List(nullable Int32) |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the new display is much easier to read in my opinion

@alamb
Copy link
Contributor Author

alamb commented Oct 3, 2025

Ok, the tests are now looking good enough to test with the new thrift decoder

@alamb
Copy link
Contributor Author

alamb commented Oct 4, 2025

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.14.0-1016-gcp #17~24.04.1-Ubuntu SMP Wed Sep 3 01:55:36 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/upgrade_arrow_57 (0cfb693) to 0f3cf27 diff using: tpch_mem
Results will be posted here when complete

@alamb alamb force-pushed the alamb/upgrade_arrow_57 branch from 7c58fa3 to a1f72e2 Compare October 15, 2025 19:12
@github-actions github-actions bot added the execution Related to the execution crate label Oct 15, 2025
alamb added a commit to apache/arrow-rs that referenced this pull request Oct 16, 2025
…th `FileDecryptionProperties` (#8626)

# Which issue does this PR close?

- Related to #7835
- Follow on to #8470

# Rationale for this change

While [testing arrow
57](apache/datafusion#17888) with DataFusion, I
found there was a disconnect
in the builders that the `FileDecryptionProperties` builder builds Arc
and the `FileEncryptionProperties` builder builds the struct directly

Let's make the APIs consistent

This also allows encryption properties to be shared and cloned cheaply
(I am not sure how often this is needed, but it seems like a reasonable
thing to do)

# What changes are included in this PR?

See above

# Are these changes tested?

Yes, by existing tests

# Are there any user-facing changes?

This is an API change, started in
#8470
alamb added a commit to apache/arrow-rs that referenced this pull request Oct 16, 2025
# Which issue does this PR close?

We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.

- Related to #7835


# Rationale for this change


While testing the arrow 57 upgrade in DataFusion I found a few things
that need to be fixed
in parquet-rs.

- apache/datafusion#17888

One was that the method `ArrowWriter::into_serialized_writer` was
deprecated, (which I know I suggested in
#8389 🤦 ). However, when
testing it turns out that the constructor of `SerializedFileWriter` does
a lot of work (like creating the parquet schema from the arrow schema
and messing with metadata)
https://github.com/apache/arrow-rs/blob/c4f0fc12199df696620c73d62523c8eef5743bf2/parquet/src/arrow/arrow_writer/mod.rs#L230-L263

Creating a `RowGroupWriterFactory` directly would involve a bunch of
code duplication

# What changes are included in this PR?

So let's not deprecate this method for now and instead add some
additional docs to guide people to the right lace


# Are these changes tested?
I tested manually upstream

# Are there any user-facing changes?

If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.
@alamb alamb force-pushed the alamb/upgrade_arrow_57 branch 4 times, most recently from c4d2c37 to 6eca757 Compare October 16, 2025 18:30
@alamb alamb force-pushed the alamb/upgrade_arrow_57 branch 2 times, most recently from c8bdedd to 31e6327 Compare October 17, 2025 19:29
@alamb
Copy link
Contributor Author

alamb commented Oct 18, 2025

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.14.0-1017-gcp #18~24.04.1-Ubuntu SMP Tue Sep 23 17:51:44 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/upgrade_arrow_57 (31e6327) to 522403b diff using: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

@alamb
Copy link
Contributor Author

alamb commented Oct 18, 2025

🤖: Benchmark completed

Details

Comparing HEAD and alamb_upgrade_arrow_57
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ alamb_upgrade_arrow_57 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │  2729.60 ms │             2598.37 ms │     no change │
│ QQuery 1     │  1395.37 ms │             1224.05 ms │ +1.14x faster │
│ QQuery 2     │  2554.95 ms │             2408.74 ms │ +1.06x faster │
│ QQuery 3     │  1186.23 ms │             1180.34 ms │     no change │
│ QQuery 4     │  2241.05 ms │             2206.69 ms │     no change │
│ QQuery 5     │ 27330.65 ms │            27905.39 ms │     no change │
│ QQuery 6     │  4231.81 ms │             4199.11 ms │     no change │
│ QQuery 7     │  3545.08 ms │             3482.55 ms │     no change │
└──────────────┴─────────────┴────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                     ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                     │ 45214.74ms │
│ Total Time (alamb_upgrade_arrow_57)   │ 45205.24ms │
│ Average Time (HEAD)                   │  5651.84ms │
│ Average Time (alamb_upgrade_arrow_57) │  5650.66ms │
│ Queries Faster                        │          2 │
│ Queries Slower                        │          0 │
│ Queries with No Change                │          6 │
│ Queries with Failure                  │          0 │
└───────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ alamb_upgrade_arrow_57 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.16 ms │                2.11 ms │     no change │
│ QQuery 1     │    54.21 ms │               48.90 ms │ +1.11x faster │
│ QQuery 2     │   149.91 ms │              133.31 ms │ +1.12x faster │
│ QQuery 3     │   165.66 ms │              159.62 ms │     no change │
│ QQuery 4     │  1037.82 ms │             1065.89 ms │     no change │
│ QQuery 5     │  1477.45 ms │             1499.46 ms │     no change │
│ QQuery 6     │     2.15 ms │                2.17 ms │     no change │
│ QQuery 7     │    53.97 ms │               54.21 ms │     no change │
│ QQuery 8     │  1419.19 ms │             1445.11 ms │     no change │
│ QQuery 9     │  1773.22 ms │             1800.19 ms │     no change │
│ QQuery 10    │   387.41 ms │              380.02 ms │     no change │
│ QQuery 11    │   433.91 ms │              432.05 ms │     no change │
│ QQuery 12    │  1336.75 ms │             1388.94 ms │     no change │
│ QQuery 13    │  2113.69 ms │             2137.76 ms │     no change │
│ QQuery 14    │  1255.69 ms │             1266.09 ms │     no change │
│ QQuery 15    │  1175.74 ms │             1206.58 ms │     no change │
│ QQuery 16    │  2654.19 ms │             2639.46 ms │     no change │
│ QQuery 17    │  2621.81 ms │             2627.30 ms │     no change │
│ QQuery 18    │  5160.02 ms │             4968.46 ms │     no change │
│ QQuery 19    │   127.45 ms │              127.98 ms │     no change │
│ QQuery 20    │  2075.40 ms │             1955.66 ms │ +1.06x faster │
│ QQuery 21    │  2413.93 ms │             2280.31 ms │ +1.06x faster │
│ QQuery 22    │  4069.35 ms │             3896.75 ms │     no change │
│ QQuery 23    │ 13016.35 ms │            12591.63 ms │     no change │
│ QQuery 24    │   226.10 ms │              214.40 ms │ +1.05x faster │
│ QQuery 25    │   513.93 ms │              508.46 ms │     no change │
│ QQuery 26    │   231.86 ms │              205.76 ms │ +1.13x faster │
│ QQuery 27    │  2915.87 ms │             2808.81 ms │     no change │
│ QQuery 28    │ 23143.02 ms │            24400.06 ms │  1.05x slower │
│ QQuery 29    │   960.56 ms │              969.13 ms │     no change │
│ QQuery 30    │  1314.71 ms │             1277.65 ms │     no change │
│ QQuery 31    │  1316.03 ms │             1310.51 ms │     no change │
│ QQuery 32    │  4370.67 ms │             4500.14 ms │     no change │
│ QQuery 33    │  5698.57 ms │             5663.60 ms │     no change │
│ QQuery 34    │  5864.74 ms │             5816.17 ms │     no change │
│ QQuery 35    │  2029.44 ms │             1974.93 ms │     no change │
│ QQuery 36    │   120.39 ms │              117.52 ms │     no change │
│ QQuery 37    │    51.92 ms │               52.26 ms │     no change │
│ QQuery 38    │   119.36 ms │              119.72 ms │     no change │
│ QQuery 39    │   195.98 ms │              194.92 ms │     no change │
│ QQuery 40    │    42.56 ms │               44.64 ms │     no change │
│ QQuery 41    │    40.59 ms │               41.78 ms │     no change │
│ QQuery 42    │    31.85 ms │               32.30 ms │     no change │
└──────────────┴─────────────┴────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                     ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                     │ 94165.60ms │
│ Total Time (alamb_upgrade_arrow_57)   │ 94362.71ms │
│ Average Time (HEAD)                   │  2189.90ms │
│ Average Time (alamb_upgrade_arrow_57) │  2194.48ms │
│ Queries Faster                        │          6 │
│ Queries Slower                        │          1 │
│ Queries with No Change                │         36 │
│ Queries with Failure                  │          0 │
└───────────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ alamb_upgrade_arrow_57 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 165.91 ms │              176.22 ms │  1.06x slower │
│ QQuery 2     │  26.64 ms │               27.01 ms │     no change │
│ QQuery 3     │  40.66 ms │               35.64 ms │ +1.14x faster │
│ QQuery 4     │  28.31 ms │               28.36 ms │     no change │
│ QQuery 5     │  75.56 ms │               75.82 ms │     no change │
│ QQuery 6     │  19.24 ms │               19.57 ms │     no change │
│ QQuery 7     │ 208.29 ms │              208.78 ms │     no change │
│ QQuery 8     │  33.39 ms │               31.81 ms │     no change │
│ QQuery 9     │ 100.06 ms │              101.83 ms │     no change │
│ QQuery 10    │  57.15 ms │               59.70 ms │     no change │
│ QQuery 11    │  18.15 ms │               17.56 ms │     no change │
│ QQuery 12    │  50.35 ms │               49.92 ms │     no change │
│ QQuery 13    │  46.00 ms │               45.79 ms │     no change │
│ QQuery 14    │  13.34 ms │               13.33 ms │     no change │
│ QQuery 15    │  24.29 ms │               23.76 ms │     no change │
│ QQuery 16    │  24.01 ms │               24.69 ms │     no change │
│ QQuery 17    │ 145.09 ms │              146.70 ms │     no change │
│ QQuery 18    │ 316.87 ms │              317.19 ms │     no change │
│ QQuery 19    │  45.27 ms │               37.64 ms │ +1.20x faster │
│ QQuery 20    │  47.57 ms │               47.87 ms │     no change │
│ QQuery 21    │ 329.29 ms │              326.71 ms │     no change │
│ QQuery 22    │  20.56 ms │               24.50 ms │  1.19x slower │
└──────────────┴───────────┴────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                     ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                     │ 1836.01ms │
│ Total Time (alamb_upgrade_arrow_57)   │ 1840.41ms │
│ Average Time (HEAD)                   │   83.45ms │
│ Average Time (alamb_upgrade_arrow_57) │   83.66ms │
│ Queries Faster                        │         2 │
│ Queries Slower                        │         2 │
│ Queries with No Change                │        18 │
│ Queries with Failure                  │         0 │
└───────────────────────────────────────┴───────────┘

@alamb alamb force-pushed the alamb/upgrade_arrow_57 branch from 31e6327 to 54f7bed Compare October 19, 2025 19:54
@alamb alamb force-pushed the alamb/upgrade_arrow_57 branch from 54f7bed to b22026e Compare October 23, 2025 16:25
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Oct 23, 2025
| filename | row_group_id | row_group_num_rows | row_group_num_columns | row_group_bytes | column_id | file_offset | num_values | path_in_schema | type | stats_min | stats_max | stats_null_count | stats_distinct_count | stats_min_value | stats_max_value | compression | encodings | index_page_offset | dictionary_page_offset | data_page_offset | total_compressed_size | total_uncompressed_size |
+-------------------------------------------------------------+--------------+--------------------+-----------------------+-----------------+-----------+-------------+------------+----------------+-------+-----------+-----------+------------------+----------------------+-----------------+-----------------+-------------+------------------------------+-------------------+------------------------+------------------+-----------------------+-------------------------+
| ../datafusion/core/tests/data/fixed_size_list_array.parquet | 0 | 2 | 1 | 123 | 0 | 125 | 4 | "f0.list.item" | INT64 | 1 | 4 | 0 | | 1 | 4 | SNAPPY | [RLE_DICTIONARY, PLAIN, RLE] | | 4 | 46 | 121 | 123 |
| ../datafusion/core/tests/data/fixed_size_list_array.parquet | 0 | 2 | 1 | 123 | 0 | 125 | 4 | "f0.list.item" | INT64 | 1 | 4 | 0 | | 1 | 4 | SNAPPY | [PLAIN, RLE, RLE_DICTIONARY] | | 4 | 46 | 121 | 123 |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is due to apache/arrow-rs#8587 which change the order the encodings are displayed

| alltypes_plain.parquet | 1851 | 10181 | 2 | page_index=false |
| alltypes_tiny_pages.parquet | 454233 | 881418 | 2 | page_index=true |
| lz4_raw_compressed_larger.parquet | 380836 | 2939 | 2 | page_index=false |
| alltypes_plain.parquet | 1851 | 6957 | 2 | page_index=false |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thrift-remodel has made the in-memory footprint of ParquetMetaData significantly smaller, likely due to a more efficient representation in memory for PageIndex

(huge thanks to @etseidl )


// Create Flight client
let mut client = FlightServiceClient::connect("http://localhost:50051").await?;
let endpoint = Endpoint::new("http://localhost:50051")?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is required due to the tonic upgrade

fn setup_encryption(
parquet_df: &DataFrame,
) -> Result<(FileEncryptionProperties, FileDecryptionProperties), DataFusionError> {
) -> Result<(Arc<FileEncryptionProperties>, Arc<FileDecryptionProperties>), DataFusionError>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The arrow upstream APIs now use Arc instead of raw objects, see apache/arrow-rs#8470

let schema = DFSchema::try_from_qualified_schema("t1", &test_schema_1())?;
let arrow_schema = schema.as_arrow();
insta::assert_snapshot!(arrow_schema, @r#"Field { name: "c0", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "c1", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }"#);
insta::assert_snapshot!(arrow_schema.to_string(), @r#"Field { "c0": nullable Boolean }, Field { "c1": nullable Boolean }"#);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

due to improvement in displaying Fields

pub use crate::config::{ConfigFileDecryptionProperties, ConfigFileEncryptionProperties};

#[cfg(feature = "parquet_encryption")]
pub fn map_encryption_to_config_encryption(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These methods are redundant and were not used, so I removed them


impl ToPyArrow for ScalarValue {
fn to_pyarrow(&self, py: Python) -> PyResult<PyObject> {
fn to_pyarrow<'py>(&self, py: Python<'py>) -> PyResult<Bound<'py, PyAny>> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

due to pyo3 upgrade

_ => {
error!("fail to read page index.")
}
let ColumnIndexMetaData::INT32(index) = int_col_index else {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PageIndex structures have now changed

@alamb alamb changed the title [WIP] Upgrade to arrow/parquet 57.0.0 Upgrade to arrow/parquet 57.0.0 Oct 23, 2025
@alamb alamb marked this pull request as ready for review October 23, 2025 16:39
@alamb
Copy link
Contributor Author

alamb commented Oct 23, 2025

This PR is now ready for review

Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

Comment on lines 634 to -635
// Can disable the cache even with filter pushdown by setting the size to 0. In this case we
// expect the inner records are reported but no records are read from the cache
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: wording a bit off here, since it reads as

Can disable the cache even with filter pushdown by setting the size to 0. In this case we no records are read from the cache and no metrics are reported

Should be this maybe?

Can disable the cache even with filter pushdown by setting the size to 0. This results in no records being read from the cache and no metrics being reported


let validate =
T::validate_decimal_precision(new_value, self.target_precision);
let validate = T::validate_decimal_precision(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add "Closes #3666" to the PR body 👍

SELECT arrow_typeof(now()::timestamp)
----
Timestamp(Nanosecond, None)
Timestamp(ns)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines -88 to +94
statement error DataFusion error: type_coercion\ncaused by\nError during planning: Cannot coerce arithmetic expression Timestamp\(Nanosecond, Some\("\+00:00"\)\) \+ Utf8 to valid types
statement error
select i_item_desc from test
where d3_date > now() + '5 days';
----
DataFusion error: type_coercion
caused by
Error during planning: Cannot coerce arithmetic expression Timestamp(ns, "+00:00") + Utf8 to valid types
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the expected error comes before the query not after, for SLTs 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate core Core DataFusion crate datasource Changes to the datasource crate documentation Improvements or additions to documentation execution Related to the execution crate functions Changes to functions implementation optimizer Optimizer rules physical-expr Changes to the physical-expr crates physical-plan Changes to the physical-plan crate proto Related to proto crate sql SQL Planner sqllogictest SQL Logic Tests (.slt) substrait Changes to the substrait crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Incorrect error message for decimal with scale while input value is out of bound

5 participants