-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Convert column to destination charset in DML applications #1003
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
b649015 to
9a3eb5f
Compare
| ) auto_increment=1; | ||
|
|
||
| insert into gh_ost_test values (null, 'átesting'); | ||
| insert into gh_ost_test values (null, 'átesting', '', ''); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixes Testing: convert-utf8mb4..ERROR 1136 (21S01) at line 10: Column count doesn't match value count at row 1
Mismatch didn't seem intentional; I can revert if needed.
| primary key(id) | ||
| ) auto_increment=1 charset latin1 collate latin1_swedish_ci; | ||
|
|
||
| insert into gh_ost_test values (null, char(189)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
189 and 190 chosen specifically because they can be replaced with a utf8 character equivalent (prefixed with \xC2). Thus the checksum at the end of the test will pass.
|
@jbielick I'm curious to see how the tests run. Would you care to push same branch (and create similar PR) in https://github.com/openark/gh-ost ? (If not, I can to the same on your behalf; but you're the author so it's best if you do it) Notwithstanding #290, which I completely missed as I was phasing out, I'm surprised this issue only shows up now. |
|
@shlomi-noach me, too. I got them running locally with mysql 5.7.25 and a couple changes to the Dockerfile and they all passed. Looking forward to a CI run. Rebased and opened at: |
|
Thank you! |
799041e to
84e55ff
Compare
|
@shlomi-noach cherry-picked commits here. |
|
@timvaillancourt 👋 this looks legit to me. However, it works on data transfer (conversion of character sets) and I'd like to see how this works under production tests. Would you be able to deploy this change? |
@shlomi-noach 👋 absolutely! I can get this on some testing hosts next week 👍 |
|
Anything go awry? |
|
@timvaillancourt @shlomi-noach Hey y'all. 👋 I'm wondering if there is anything I can do to help here. Does this still seem like an appropriate change to merge? To add some more context, we're in need of this change to successfully convert large tables from If I were to build a binary from this branch to test in our prod DB, should I rebase and do so from the openark repo? |
@jbielick / @shlomi-noach: 👋 the production tests of this PR ran into some unrelated test failures that are still under investigation. I think tests of this PR should be ran again to make sure this didn't break anything. cc'ing @dm-2 / @rashiq for help on tests And just calling out that while the automated tests will prove gh-ost still works in general, it won't trigger the charset logic from this PR as the production test runs Lastly, I deleted the |
|
Thanks for the update.
May I ask what you mean in this statement? I was expecting this to affect InnoDB tables in test and production in the intended way, but perhaps I'm missing or misunderstanding something. |
@jbielick what I meant there is the tests on Production replicas at GitHub use That's why I suggest test scenarios are added to |
@timvaillancourt makes sense. I've got some tests added here in |
96689c3 to
d18d088
Compare
|
@jbielick in light of #1158 being merged I just wanted to check if this is still needed? cc @wangzihuacool who can add some insight also Thanks all 🙇 |
| } | ||
| if column.Name == mappedColumn.Name && column.Charset != mappedColumn.Charset { | ||
| this.migrationContext.SharedColumns.SetCharsetConversion(column.Name, column.Charset, mappedColumn.Charset) | ||
| this.migrationContext.MappedSharedColumns.SetCharsetConversion(column.Name, column.Charset, mappedColumn.Charset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbielick curious if you can elaborate on the switch from SharedColumns to MappedSharedColumns. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, chronologically I believe I added this line before #1158 was opened. SharedColumns might be the correct usage for #1158 because it leverages convertArg and does a conversion of the value in go. I preferred the approach that was similar to the DatetimetoTimestamp and EnumToTextConversion approaches where the conversion occurs within mysql, as would normally occur if I altered the table without gh-ost.
So it's not that I changed this line, but it now appears in the diff as a change because the base changed and #1158 also introduced this line with a slightly different goal.
I'm using MappedSharedColumns because those are the set that allow replacing the "SET x=x" arg with a mysql expression instead of manipulating a raw value itself (which is possible with SharedColumns). We don't need to modify the bytes that are being sent, we need mysql to interpret them differently (and as proposed here convert them).
this test assumes a latin1-encoded table with content containing bytes in the \x80-\xFF, which are invalid single-byte characters in utf8 and cannot be inserted in the altered table when the column containing these characters is changed to utf8(mb4). since these characters cannot be inserted, gh-ost fails.
No, unfortunately the test I added in this PR fails on master. |
addresses github#290 Note: there is currently no issue backfilling the ghost table when the characterset changes, likely because it's a insert-into-select-from and it all occurs within mysql. However, when applying DML events (UPDATE, DELETE, etc) the values are sprintf'd into a prepared statement and due to the possibility of migrating text column data containing invalid characters in the destination charset, a conversion step is often necessary. For example, when migrating a table/column from latin1 to utf8mb4, the latin1 column may contain characters that are invalid single-byte utf8 characters. Characters in the \x80-\xFF range are most common. When written to utf8mb4 column without conversion, they fail as they do not exist in the utf8 codepage. Converting these texts/characters to the destination charset using convert(? using {charset}) will convert appropriately and the update/replace will succeed. I only point out the "Note:" above because there are two tests added for this: latin1text-to-utf8mb4 and latin1text-to-ut8mb4-insert The former is a test that fails prior to this commit. The latter is a test that succeeds prior to this comment. Both are affected by the code in this commit. convert text to original charset, then destination converting text first to the original charset and then to the destination charset produces the most consistent results, as inserting the binary into a utf8-charset column may encounter an error if there is no prior context of latin1 encoding. mysql> select hex(convert(char(189) using utf8mb4)); +---------------------------------------+ | hex(convert(char(189) using utf8mb4)) | +---------------------------------------+ | | +---------------------------------------+ 1 row in set, 1 warning (0.00 sec) mysql> select hex(convert(convert(char(189) using latin1) using utf8mb4)); +-------------------------------------------------------------+ | hex(convert(convert(char(189) using latin1) using utf8mb4)) | +-------------------------------------------------------------+ | C2BD | +-------------------------------------------------------------+ 1 row in set (0.00 sec) as seen in this failure on 5.5.62 Error 1300: Invalid utf8mb4 character string: 'BD'; query= replace /* gh-ost `test`.`_gh_ost_test_gho` */ into `test`.`_gh_ost_test_gho` (`id`, `t`) values (?, convert(? using utf8mb4))
3e8b31b to
fbe437f
Compare
Related issue: #290
Description
Avoid DML apply errors by
converting character data whencharsetchanges for a column.Background
When applying DML events (UPDATE, DELETE, etc) the values are sprintf'd into a prepared statement with the row snapshot (values). Due to the possibility of migrating text column data containing characters in the source table that are invalid in the destination table (due to charset), a conversion step is often necessary. This conversion does not occur when applying DML events and an error occurs writing invalid byte sequences to the ghost table.
For example, when migrating a table/column from
latin1toutf8mb4, thelatin1column may contain characters that are invalid single-byteutf8characters. Characters in the\x80-\xFFrange are most common. When written toutf8mb4column without conversion, they fail as they do not exist in theutf8codepage.Converting these texts/characters to the destination charset using
convert(? using {charset})will convert appropriately and the update/replace will succeed.Note: there is currently no issue backfilling the ghost table when the characterset changes, likely because it's a insert-into-select-from and it all occurs within mysql. I only point this out because there are two tests added for this:
latin1text-to-utf8mb4andlatin1text-to-ut8mb4-insert—the former is a test that fails prior to this commit. The latter is a test that succeeds prior to this comment. Both are affected by the code in this commit. Please let me know if you would like these renamed or consolidated intoconvert-utf8mb4. They were helpful to do TDD.script/cibuildreturns with no formatting errors, build errors or unit test errors.