KEMBAR78
Convert column to destination charset in DML applications by jbielick · Pull Request #1003 · github/gh-ost · GitHub
Skip to content

Conversation

@jbielick
Copy link

@jbielick jbielick commented Jul 9, 2021

Related issue: #290

Description

Avoid DML apply errors by converting character data when charset changes for a column.

Background

When applying DML events (UPDATE, DELETE, etc) the values are sprintf'd into a prepared statement with the row snapshot (values). Due to the possibility of migrating text column data containing characters in the source table that are invalid in the destination table (due to charset), a conversion step is often necessary. This conversion does not occur when applying DML events and an error occurs writing invalid byte sequences to the ghost table.

For example, when migrating a table/column from latin1 to utf8mb4, the latin1 column may contain characters that are invalid single-byte utf8 characters. Characters in the \x80-\xFF range are most common. When written to utf8mb4 column without conversion, they fail as they do not exist in the utf8 codepage.

Converting these texts/characters to the destination charset using convert(? using {charset}) will convert appropriately and the update/replace will succeed.

Note: there is currently no issue backfilling the ghost table when the characterset changes, likely because it's a insert-into-select-from and it all occurs within mysql. I only point this out because there are two tests added for this: latin1text-to-utf8mb4 and latin1text-to-ut8mb4-insert—the former is a test that fails prior to this commit. The latter is a test that succeeds prior to this comment. Both are affected by the code in this commit. Please let me know if you would like these renamed or consolidated into convert-utf8mb4. They were helpful to do TDD.

In case this PR introduced Go code changes:

  • contributed code is using same conventions as original code
  • script/cibuild returns with no formatting errors, build errors or unit test errors.

@jbielick jbielick force-pushed the latin1-utf8mb4 branch 2 times, most recently from b649015 to 9a3eb5f Compare July 9, 2021 21:10
) auto_increment=1;

insert into gh_ost_test values (null, 'átesting');
insert into gh_ost_test values (null, 'átesting', '', '');
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixes Testing: convert-utf8mb4..ERROR 1136 (21S01) at line 10: Column count doesn't match value count at row 1

Mismatch didn't seem intentional; I can revert if needed.

primary key(id)
) auto_increment=1 charset latin1 collate latin1_swedish_ci;

insert into gh_ost_test values (null, char(189));
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

189 and 190 chosen specifically because they can be replaced with a utf8 character equivalent (prefixed with \xC2). Thus the checksum at the end of the test will pass.

@shlomi-noach
Copy link
Contributor

@jbielick I'm curious to see how the tests run. Would you care to push same branch (and create similar PR) in https://github.com/openark/gh-ost ? (If not, I can to the same on your behalf; but you're the author so it's best if you do it)

Notwithstanding #290, which I completely missed as I was phasing out, I'm surprised this issue only shows up now.

@jbielick
Copy link
Author

@shlomi-noach me, too. I got them running locally with mysql 5.7.25 and a couple changes to the Dockerfile and they all passed. Looking forward to a CI run.

Rebased and opened at:

openark#27

@shlomi-noach
Copy link
Contributor

Thank you!

@jbielick
Copy link
Author

@shlomi-noach cherry-picked commits here.

@shlomi-noach
Copy link
Contributor

@timvaillancourt 👋 this looks legit to me. However, it works on data transfer (conversion of character sets) and I'd like to see how this works under production tests. Would you be able to deploy this change?

@timvaillancourt
Copy link
Collaborator

@timvaillancourt 👋 this looks legit to me. However, it works on data transfer (conversion of character sets) and I'd like to see how this works under production tests. Would you be able to deploy this change?

@shlomi-noach 👋 absolutely! I can get this on some testing hosts next week 👍

@timvaillancourt timvaillancourt added this to the v1.1.3 milestone Jul 18, 2021
@timvaillancourt timvaillancourt self-requested a review July 18, 2021 22:20
@timvaillancourt timvaillancourt temporarily deployed to production/mysql_role=ghost_testing July 19, 2021 16:02 Inactive
@jbielick
Copy link
Author

Anything go awry?

@jbielick
Copy link
Author

jbielick commented Sep 9, 2021

@timvaillancourt @shlomi-noach Hey y'all. 👋

I'm wondering if there is anything I can do to help here. Does this still seem like an appropriate change to merge?

To add some more context, we're in need of this change to successfully convert large tables from latin1 to utf8mb4. Backfilling goes just fine, but updates that are applied encounter errors (as they they do SQL writes/update differently than the backfill) and the whole things dies. Any issues or edge cases I can look into? This is understandably high-risk.

If I were to build a binary from this branch to test in our prod DB, should I rebase and do so from the openark repo?

@timvaillancourt
Copy link
Collaborator

timvaillancourt commented Jun 29, 2022

I'm wondering if there is anything I can do to help here. Does this still seem like an appropriate change to merge?

@jbielick / @shlomi-noach: 👋 the production tests of this PR ran into some unrelated test failures that are still under investigation. I think tests of this PR should be ran again to make sure this didn't break anything. cc'ing @dm-2 / @rashiq for help on tests

And just calling out that while the automated tests will prove gh-ost still works in general, it won't trigger the charset logic from this PR as the production test runs ENGINE=InnoDB only. localtests/ might be our best bet to test this logic. If there is any more test scenarios that can be added, that would make this safer @jbielick (no specific ideas right now)

Lastly, I deleted the ignore_versions file for the 5.5 version because we dropped support/tests for 5.5 so no need to ignore it

@jbielick
Copy link
Author

Thanks for the update.

it won't trigger the charset logic from this PR as the production test runs ENGINE=InnoDB only

May I ask what you mean in this statement? I was expecting this to affect InnoDB tables in test and production in the intended way, but perhaps I'm missing or misunderstanding something.

@ghost ghost modified the milestones: v1.1.5, v1.1.6 Jul 6, 2022
@timvaillancourt
Copy link
Collaborator

timvaillancourt commented Jul 20, 2022

May I ask what you mean in this statement? I was expecting this to affect InnoDB tables in test and production in the intended way, but perhaps I'm missing or misunderstanding something.

@jbielick what I meant there is the tests on Production replicas at GitHub use --alter="ENGINE=InnoDB" as an ALTER statement, meaning this logic won't be triggered in that testing

That's why I suggest test scenarios are added to localtests/ because these tests CAN run ALTERs that trigger the logic in this PR (see create.sql and extra_args files for example)

@jbielick
Copy link
Author

May I ask what you mean in this statement? I was expecting this to affect InnoDB tables in test and production in the intended way, but perhaps I'm missing or misunderstanding something.

@jbielick what I meant there is the tests on Production replicas at GitHub use --alter="ENGINE=InnoDB" as an ALTER statement, meaning this logic won't be triggered in that testing

That's why I suggest test scenarios are added to localtests/ because these tests CAN run ALTERs that trigger the logic in this PR (see create.sql and extra_args files for example)

@timvaillancourt makes sense. I've got some tests added here in localtests/ that reproduce the failure and assert success. Do those tests look sufficient or am I missing something that I can add here?

@timvaillancourt
Copy link
Collaborator

@jbielick in light of #1158 being merged I just wanted to check if this is still needed? cc @wangzihuacool who can add some insight also

Thanks all 🙇

}
if column.Name == mappedColumn.Name && column.Charset != mappedColumn.Charset {
this.migrationContext.SharedColumns.SetCharsetConversion(column.Name, column.Charset, mappedColumn.Charset)
this.migrationContext.MappedSharedColumns.SetCharsetConversion(column.Name, column.Charset, mappedColumn.Charset)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbielick curious if you can elaborate on the switch from SharedColumns to MappedSharedColumns. Thanks!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, chronologically I believe I added this line before #1158 was opened. SharedColumns might be the correct usage for #1158 because it leverages convertArg and does a conversion of the value in go. I preferred the approach that was similar to the DatetimetoTimestamp and EnumToTextConversion approaches where the conversion occurs within mysql, as would normally occur if I altered the table without gh-ost.

So it's not that I changed this line, but it now appears in the diff as a change because the base changed and #1158 also introduced this line with a slightly different goal.

I'm using MappedSharedColumns because those are the set that allow replacing the "SET x=x" arg with a mysql expression instead of manipulating a raw value itself (which is possible with SharedColumns). We don't need to modify the bytes that are being sent, we need mysql to interpret them differently (and as proposed here convert them).

this test assumes a latin1-encoded table with content containing bytes
in the \x80-\xFF, which are invalid single-byte characters in utf8 and
cannot be inserted in the altered table when the column containing these
characters is changed to utf8(mb4).

since these characters cannot be inserted, gh-ost fails.
@jbielick
Copy link
Author

@jbielick in light of #1158 being merged I just wanted to check if this is still needed? cc @wangzihuacool who can add some insight also

Thanks all 🙇

Hi @timvaillancourt

No, unfortunately the test I added in this PR fails on master.

Copy: 3/3 100.0%; Applied: 0; Backlog: 20/1000; Time: 5s(total), 5s(copy); streamer: mysql-bin.000003:3226925; Lag: 0.00s, HeartbeatLag: 0.03s, State: throttled, throttle-query; ETA: due
2022-09-21 17:47:43 ERROR Error 1366: Incorrect string value: '\x80' for column 't' at row 1; query=
			replace /* gh-ost `test`.`_gh_ost_test_gho` */ into
				`test`.`_gh_ost_test_gho`
					(`id`, `t`)
				values
					(?, ?)
		; args=[7 [128]]
goroutine 60 [running]:
runtime/debug.Stack()
	/usr/local/go/src/runtime/debug/stack.go:24 +0x65
runtime/debug.PrintStack()
	/usr/local/go/src/runtime/debug/stack.go:16 +0x19
github.com/openark/golib/log.logErrorEntry(0xc000481100, {0x7f5540, 0xc000481110})
	/go/src/github.com/github/gh-ost/vendor/github.com/openark/golib/log/log.go:188 +0x98
github.com/openark/golib/log.Errore(...)
	/go/src/github.com/github/gh-ost/vendor/github.com/openark/golib/log/log.go:234
github.com/github/gh-ost/go/base.(*simpleLogger).Errore(0xc0002402c0, {0x7f5540, 0xc000481110})
	/go/src/github.com/github/gh-ost/go/base/default_logger.go:51 +0x28
github.com/github/gh-ost/go/logic.(*Applier).ApplyDMLEventQueries(0xc0002402c0, {0xc0004a4a80, 0xa, 0xc000488e10})
	/go/src/github.com/github/gh-ost/go/logic/applier.go:1158 +0x12d
github.com/github/gh-ost/go/logic.(*Migrator).onApplyEventStruct.func2()
	/go/src/github.com/github/gh-ost/go/logic/migrator.go:1269 +0x2d
github.com/github/gh-ost/go/logic.(*Migrator).retryOperation(0xc000218000, 0xc0000d9f18, {0x0, 0x0, 0xc000216000})
	/go/src/github.com/github/gh-ost/go/logic/migrator.go:149 +0x79
github.com/github/gh-ost/go/logic.(*Migrator).onApplyEventStruct(0xc000218000, 0x0)
	/go/src/github.com/github/gh-ost/go/logic/migrator.go:1271 +0x25c
github.com/github/gh-ost/go/logic.(*Migrator).executeWriteFuncs(0xc000218000)
	/go/src/github.com/github/gh-ost/go/logic/migrator.go:1305 +0xb8
created by github.com/github/gh-ost/go/logic.(*Migrator).Migrate
	/go/src/github.com/github/gh-ost/go/logic/migrator.go:411 +0x625

addresses github#290

Note: there is currently no issue backfilling the ghost table when the
characterset changes, likely because it's a insert-into-select-from and
it all occurs within mysql.

However, when applying DML events (UPDATE, DELETE, etc) the values are
sprintf'd into a prepared statement and due to the possibility of
migrating text column data containing invalid characters in the
destination charset, a conversion step is often necessary.

For example, when migrating a table/column from latin1 to utf8mb4, the
latin1 column may contain characters that are invalid single-byte utf8
characters. Characters in the \x80-\xFF range are most common. When
written to utf8mb4 column without conversion, they fail as they do not
exist in the utf8 codepage.

Converting these texts/characters to the destination charset using
convert(? using {charset}) will convert appropriately and the
update/replace will succeed.

I only point out the "Note:" above because there are two tests added
for this: latin1text-to-utf8mb4 and latin1text-to-ut8mb4-insert

The former is a test that fails prior to this commit. The latter is a
test that succeeds prior to this comment. Both are affected by the code
in this commit.

convert text to original charset, then destination

converting text first to the original charset and then to the
destination charset produces the most consistent results, as inserting
the binary into a utf8-charset column may encounter an error if there is
no prior context of latin1 encoding.

mysql> select hex(convert(char(189) using utf8mb4));
+---------------------------------------+
| hex(convert(char(189) using utf8mb4)) |
+---------------------------------------+
|                                       |
+---------------------------------------+
1 row in set, 1 warning (0.00 sec)

mysql> select hex(convert(convert(char(189) using latin1) using utf8mb4));
+-------------------------------------------------------------+
| hex(convert(convert(char(189) using latin1) using utf8mb4)) |
+-------------------------------------------------------------+
| C2BD                                                        |
+-------------------------------------------------------------+
1 row in set (0.00 sec)

as seen in this failure on 5.5.62

 Error 1300: Invalid utf8mb4 character string: 'BD'; query=
			replace /* gh-ost `test`.`_gh_ost_test_gho` */ into
				`test`.`_gh_ost_test_gho`
					(`id`, `t`)
				values
					(?, convert(? using utf8mb4))
@timvaillancourt timvaillancourt modified the milestones: v1.1.6, v1.1.7 Dec 7, 2023
@jbielick jbielick closed this Mar 20, 2024
@ramshad-sha ramshad-sha mentioned this pull request Mar 30, 2024
Closed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants