Skip to content

Feature Request: MySQL CLONE for vtbackup and vttablet restore #19061

@maxenglander

Description

@maxenglander

Summary

@nickvanw and I would like to make it possible to restore VTTablets, and power vtbackup, using MySQL CLONE.

With CLONE, it is possible to clone a remote "donor" MySQL server into a "recipient", overwriting the recipient's contents with the donor's.

Requirements

In order for this to work:

  • Both the recipient and donor MySQL must be running MySQL 8.0.17 or later.
  • Both the recipient and donor MySQL must have mysql_clone.soplugin installed and ACTIVE.
  • Cloning across point releases (e.g. 8.0.37 to 8.0.41) is not permitted before 8.0.37.
  • The recipient must connect to the donor with BACKUP_ADMIN permissions, and have CLONE_ADMIN locally.
  • The donor's host:port must be included in @@global.clone_valid_donor_list.
  • Possibly a bug (bugs#103206), but a donor can only support a single CLONE at a time.
  • Only InnoDB tables can be cloned, meaning CLONE usage must be restricted to all-InnoDB donors.

API changes

Here are the API changes we have in mind to support this feature:

  • A new tablet can restore their shard's primary with --clone-from-primary, mutually exclusive with --restore_from_backup.
  • A new tablet can restore from another non-primary tablet in the shard with --clone-from-tablet=<tablet_alias>, mutually exclusive with the above two options.
  • An existing tablet can be restored with a CloneTablet [--from-primary|--from-tablet=<tablet_alias>] <table_alias> RPC.
  • A primary cannot be the recipient in any of the above modalities.
  • vtbackup will support --clone-from-primary and --clone-from-tablet, replacing the restore-from-backup-storage phase, and making the catch-up phase effectively instantaneous.

Usage cases

Repairing broken replica tablets

We periodically experience a class of issues in which all replicas in shard hit errors such as HA_ERR_FOUND_DUPP_KEY and HA_ERR_KEY_NOT_FOUND, in which they all get stuck on the same GTID, unable to make progress. At least some of these issues are a result MySQL bugs like bugs#105802.

When faced with these issues, restoring a new tablet from a pre-existing backup is not an option, as a new tablet would eventually reach the same GTID and fail to make progress. The current options we use are:

  • To take a vtctldclient Backup of the primary, potentially taking downtime depending on the backup engine, and restore a new tablet.
  • To take a backup some other way, e.g. with mysqldump, which is slow, generally mixed success, and not fun to do manually under the pressure of mounting replication lag.

CLONE is appealing here. It does not take the primary offline, and, once integrated into Vitess, will be faster than a backup/restore to/from object storage, and be hopefully much more reliable than manually running brittle mysqldump | socat [replica] incantations.

Initializing new tablets

One way to repair broken tablets in the situations described above to simply create new ones with CLONE, and then throw away the old ones or demote them to SPARE for analysis.

Outside of incident situations, it is appealing to be able to CLONE for the anticipated speed improvements. Restoring a backup from cloud object storage can be fairly fast when carefully optimized, but we expect CLONE to be significantly faster. Using CLONE will also largely or completely eliminate the catch-up-phase of replication which can be long depending on the time between backups and the limits of multi-threaded replication.

Powering the restore phase of vtbackup

Likewise, we expect to be able to cut down the total runtime of vtbackup by replacing the restore phase with CLONE, which will also largely or completely eliminate the catch-up-phase of replication. vtbackup, run in this mode, can be thought of a kind of "live primary backup".

Generally, we like that a vtbackup cycle has the property of validating the previous restore. However, we occasionally run into issues where a large spike in customer activity inflates binary logs to such a degree that a regular vtbackup cannot complete in an acceptable time. In these situations, there is great appeal in being able to take a shortcut in order to satisfy a backup SLA.

Metadata

Metadata

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions