-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Upgrade to arrow/parquet 57.0.0 #17888
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Many of the current failures are due because this used to work: select arrow_cast('2021-01-01T00:00:00', 'Timestamp(Nanosecond, Some("-05:00"))'or SELECT arrow_cast(secs, 'Timestamp(Millisecond, None)') FROM tAfter the arrow 57 upgrade it fails with errors like # arrow_typeof_timestamp
query T
SELECT arrow_typeof(now()::timestamp)
----
Timestamp(ns)I believe the problem is that the format of the timezone has changed into I think what we need to do is support both formats for backwards compatibility. I will work on an upstream issue |
ed43cc0 to
ee2de0c
Compare
|
|
||
| // Create Flight client | ||
| let mut client = FlightServiceClient::connect("http://localhost:50051").await?; | ||
| let endpoint = Endpoint::new("http://localhost:50051")?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is due to new version of tonic
|
|
||
| // add an initial FlightData message that sends schema | ||
| let options = arrow::ipc::writer::IpcWriteOptions::default(); | ||
| let mut compression_context = CompressionContext::default(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
|
||
| let validate = | ||
| T::validate_decimal_precision(new_value, self.target_precision); | ||
| let validate = T::validate_decimal_precision( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Due to this (get better messages)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add "Closes #3666" to the PR body 👍
| List(Field { name: "item", data_type: List(Field { name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) List(Field { name: "item", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) List(Field { name: "item", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) | ||
| List(Field { name: "item", data_type: List(Field { name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) List(Field { name: "item", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) List(Field { name: "item", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) | ||
| List(Field { name: "item", data_type: List(Field { name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) List(Field { name: "item", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) List(Field { name: "item", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) | ||
| List(nullable List(nullable Int64)) List(nullable Float64) List(nullable Utf8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many of the diffs in this file are related to improvements in DataType display, tracked in this ticket
I will try and call out individual changes when I see them. Lists are way nicer now:
| 05)--------ProjectionExec: expr=[] | ||
| 06)----------CoalesceBatchesExec: target_batch_size=8192 | ||
| 07)------------FilterExec: substr(md5(CAST(value@0 AS Utf8View)), 1, 32) IN ([Literal { value: Utf8View("7f4b18de3cfeb9b4ac78c381ee2ad278"), field: Field { name: "lit", data_type: Utf8View, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }, Literal { value: Utf8View("a"), field: Field { name: "lit", data_type: Utf8View, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }, Literal { value: Utf8View("b"), field: Field { name: "lit", data_type: Utf8View, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }, Literal { value: Utf8View("c"), field: Field { name: "lit", data_type: Utf8View, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }]) | ||
| 07)------------FilterExec: substr(md5(CAST(value@0 AS Utf8View)), 1, 32) IN ([Literal { value: Utf8View("7f4b18de3cfeb9b4ac78c381ee2ad278"), field: Field { name: "lit", data_type: Utf8View } }, Literal { value: Utf8View("a"), field: Field { name: "lit", data_type: Utf8View } }, Literal { value: Utf8View("b"), field: Field { name: "lit", data_type: Utf8View } }, Literal { value: Utf8View("c"), field: Field { name: "lit", data_type: Utf8View } }]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| SELECT arrow_typeof(now()::timestamp) | ||
| ---- | ||
| Timestamp(Nanosecond, None) | ||
| Timestamp(ns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we'll need to update https://github.com/apache/datafusion/blob/main/docs/source/user-guide/sql/data_types.md and https://github.com/apache/datafusion/blob/main/docs/source/user-guide/sql/scalar_functions.md (the latter via updating the source docs in code) too, but can do in a follow up
|
|
||
| ## Timestamps: Create a table | ||
|
|
||
| statement ok |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timestamp format has changed (improved!) so let's also add tests for the new format
| pbjson-types = { workspace = true } | ||
| prost = { workspace = true } | ||
| substrait = { version = "0.58", features = ["serde"] } | ||
| substrait = { version = "0.59", features = ["serde"] } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since prost is updated, we also must update substrait
9d06200 to
1b7b559
Compare
8ecbbed to
d3b328b
Compare
f61623e to
9f6a390
Compare
d5bd26e to
7709acc
Compare
datafusion-cli/src/main.rs
Outdated
| | alltypes_plain.parquet | 1851 | 10181 | 2 | page_index=false | | ||
| | alltypes_tiny_pages.parquet | 454233 | 881418 | 2 | page_index=true | | ||
| | lz4_raw_compressed_larger.parquet | 380836 | 2939 | 2 | page_index=false | | ||
| | alltypes_plain.parquet | 1851 | 10309 | 2 | page_index=false | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know why the metadata size has increased. I will investigate
| let expected = "Field { name: \"c0\", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, \ | ||
| Field { name: \"c1\", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }"; | ||
| assert_eq!(expected, arrow_schema.to_string()); | ||
| insta::assert_snapshot!(arrow_schema.to_string(), @r#"Field { "c0": nullable Boolean }, Field { "c1": nullable Boolean }"#); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
many many diffs are due to the changes in formatting of Fields and DataTypes (see below)
| +----------------------+ | ||
| | arrow_typeof(test.l) | | ||
| +----------------------+ | ||
| | List(nullable Int32) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the new display is much easier to read in my opinion
|
Ok, the tests are now looking good enough to test with the new thrift decoder |
7709acc to
5e1ea80
Compare
|
🤖 |
7c58fa3 to
a1f72e2
Compare
…th `FileDecryptionProperties` (#8626) # Which issue does this PR close? - Related to #7835 - Follow on to #8470 # Rationale for this change While [testing arrow 57](apache/datafusion#17888) with DataFusion, I found there was a disconnect in the builders that the `FileDecryptionProperties` builder builds Arc and the `FileEncryptionProperties` builder builds the struct directly Let's make the APIs consistent This also allows encryption properties to be shared and cloned cheaply (I am not sure how often this is needed, but it seems like a reasonable thing to do) # What changes are included in this PR? See above # Are these changes tested? Yes, by existing tests # Are there any user-facing changes? This is an API change, started in #8470
# Which issue does this PR close? We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. - Related to #7835 # Rationale for this change While testing the arrow 57 upgrade in DataFusion I found a few things that need to be fixed in parquet-rs. - apache/datafusion#17888 One was that the method `ArrowWriter::into_serialized_writer` was deprecated, (which I know I suggested in #8389 🤦 ). However, when testing it turns out that the constructor of `SerializedFileWriter` does a lot of work (like creating the parquet schema from the arrow schema and messing with metadata) https://github.com/apache/arrow-rs/blob/c4f0fc12199df696620c73d62523c8eef5743bf2/parquet/src/arrow/arrow_writer/mod.rs#L230-L263 Creating a `RowGroupWriterFactory` directly would involve a bunch of code duplication # What changes are included in this PR? So let's not deprecate this method for now and instead add some additional docs to guide people to the right lace # Are these changes tested? I tested manually upstream # Are there any user-facing changes? If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out.
c4d2c37 to
6eca757
Compare
c8bdedd to
31e6327
Compare
|
🤖 |
|
🤖: Benchmark completed Details
|
31e6327 to
54f7bed
Compare
54f7bed to
b22026e
Compare
| | filename | row_group_id | row_group_num_rows | row_group_num_columns | row_group_bytes | column_id | file_offset | num_values | path_in_schema | type | stats_min | stats_max | stats_null_count | stats_distinct_count | stats_min_value | stats_max_value | compression | encodings | index_page_offset | dictionary_page_offset | data_page_offset | total_compressed_size | total_uncompressed_size | | ||
| +-------------------------------------------------------------+--------------+--------------------+-----------------------+-----------------+-----------+-------------+------------+----------------+-------+-----------+-----------+------------------+----------------------+-----------------+-----------------+-------------+------------------------------+-------------------+------------------------+------------------+-----------------------+-------------------------+ | ||
| | ../datafusion/core/tests/data/fixed_size_list_array.parquet | 0 | 2 | 1 | 123 | 0 | 125 | 4 | "f0.list.item" | INT64 | 1 | 4 | 0 | | 1 | 4 | SNAPPY | [RLE_DICTIONARY, PLAIN, RLE] | | 4 | 46 | 121 | 123 | | ||
| | ../datafusion/core/tests/data/fixed_size_list_array.parquet | 0 | 2 | 1 | 123 | 0 | 125 | 4 | "f0.list.item" | INT64 | 1 | 4 | 0 | | 1 | 4 | SNAPPY | [PLAIN, RLE, RLE_DICTIONARY] | | 4 | 46 | 121 | 123 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is due to apache/arrow-rs#8587 which change the order the encodings are displayed
| | alltypes_plain.parquet | 1851 | 10181 | 2 | page_index=false | | ||
| | alltypes_tiny_pages.parquet | 454233 | 881418 | 2 | page_index=true | | ||
| | lz4_raw_compressed_larger.parquet | 380836 | 2939 | 2 | page_index=false | | ||
| | alltypes_plain.parquet | 1851 | 6957 | 2 | page_index=false | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thrift-remodel has made the in-memory footprint of ParquetMetaData significantly smaller, likely due to a more efficient representation in memory for PageIndex
(huge thanks to @etseidl )
|
|
||
| // Create Flight client | ||
| let mut client = FlightServiceClient::connect("http://localhost:50051").await?; | ||
| let endpoint = Endpoint::new("http://localhost:50051")?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is required due to the tonic upgrade
| fn setup_encryption( | ||
| parquet_df: &DataFrame, | ||
| ) -> Result<(FileEncryptionProperties, FileDecryptionProperties), DataFusionError> { | ||
| ) -> Result<(Arc<FileEncryptionProperties>, Arc<FileDecryptionProperties>), DataFusionError> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The arrow upstream APIs now use Arc instead of raw objects, see apache/arrow-rs#8470
| let schema = DFSchema::try_from_qualified_schema("t1", &test_schema_1())?; | ||
| let arrow_schema = schema.as_arrow(); | ||
| insta::assert_snapshot!(arrow_schema, @r#"Field { name: "c0", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "c1", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }"#); | ||
| insta::assert_snapshot!(arrow_schema.to_string(), @r#"Field { "c0": nullable Boolean }, Field { "c1": nullable Boolean }"#); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
due to improvement in displaying Fields
| pub use crate::config::{ConfigFileDecryptionProperties, ConfigFileEncryptionProperties}; | ||
|
|
||
| #[cfg(feature = "parquet_encryption")] | ||
| pub fn map_encryption_to_config_encryption( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These methods are redundant and were not used, so I removed them
|
|
||
| impl ToPyArrow for ScalarValue { | ||
| fn to_pyarrow(&self, py: Python) -> PyResult<PyObject> { | ||
| fn to_pyarrow<'py>(&self, py: Python<'py>) -> PyResult<Bound<'py, PyAny>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
due to pyo3 upgrade
| _ => { | ||
| error!("fail to read page index.") | ||
| } | ||
| let ColumnIndexMetaData::INT32(index) = int_col_index else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PageIndex structures have now changed
|
This PR is now ready for review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
| // Can disable the cache even with filter pushdown by setting the size to 0. In this case we | ||
| // expect the inner records are reported but no records are read from the cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: wording a bit off here, since it reads as
Can disable the cache even with filter pushdown by setting the size to 0. In this case we no records are read from the cache and no metrics are reported
Should be this maybe?
Can disable the cache even with filter pushdown by setting the size to 0. This results in no records being read from the cache and no metrics being reported
|
|
||
| let validate = | ||
| T::validate_decimal_precision(new_value, self.target_precision); | ||
| let validate = T::validate_decimal_precision( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add "Closes #3666" to the PR body 👍
| SELECT arrow_typeof(now()::timestamp) | ||
| ---- | ||
| Timestamp(Nanosecond, None) | ||
| Timestamp(ns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we'll need to update https://github.com/apache/datafusion/blob/main/docs/source/user-guide/sql/data_types.md and https://github.com/apache/datafusion/blob/main/docs/source/user-guide/sql/scalar_functions.md (the latter via updating the source docs in code) too, but can do in a follow up
| statement error DataFusion error: type_coercion\ncaused by\nError during planning: Cannot coerce arithmetic expression Timestamp\(Nanosecond, Some\("\+00:00"\)\) \+ Utf8 to valid types | ||
| statement error | ||
| select i_item_desc from test | ||
| where d3_date > now() + '5 days'; | ||
| ---- | ||
| DataFusion error: type_coercion | ||
| caused by | ||
| Error during planning: Cannot coerce arithmetic expression Timestamp(ns, "+00:00") + Utf8 to valid types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the expected error comes before the query not after, for SLTs 🤔
Which issue does this PR close?
57.0.0(October 2025) arrow-rs#7835Note while this PR looks massive, a large portion is display updates due to better display of Fields and DataTypes
Rationale for this change
Upgrade to the latest arrow
Also, there are several new features in arrow-57 that I want to be able to test including Variant, arrow-avro, and a new parquet metadata reader.
What changes are included in this PR?
Are these changes tested?
By CI
Are there any user-facing changes?
New arrow