This topic provides information about troubleshooting and manually recovering data in PDS.
Troubleshoot diverged GTIDs in MySQL
The MySQL data service in PDS handles (in most cases) pod crashes and outages. For example, instances can failover and rejoin the cluster automatically on reboot. In some cases, a pod, after an outage will be unable to reboot the cluster and keeps failing with the following error:
The instance `instance-a` has an incompatible Global Transaction Identifier (GTID) set with the seed instance `instance-b` (GTIDs diverged). If you wish to proceed, the `force` option must be explicitly set.
This means, instances cannot agree on who should be the new master as data on those instances has diverged.
To troubleshoot this issue:
Review the GTIDs in the binary log of the instances and choose which instance contains the latest or the most appropriate changes to continue on with. You can inspect the transactions on instances by:
opening a shell into the
mysqlcontainer of the pods
using MySQL tools such as
Once you selected which instance should be used as seed, you can force reboot the cluster by executing the following commands inside the
mysqlcontainer of the selected pod:
mysqlsh --host=$seed_instance --user=innodb-config --password=$password -- dba reboot-cluster-from-complete-outage --force --primary=$seed_instance:3306
Check the cluster status and wait for the cluster to become recovered:
mysqlsh --host=$seed_instance --user=innodb-config --password=$password -- cluster status
If the cluster does not become healthy or if some nodes are not becoming online, then you should continue with:
removing the failing instances:
mysqlsh ... -- cluster remove-instance <other_instance>
and re-adding the instances:
mysqlsh ... -- cluster add-instance <other_instance> --recoveryMethod=clone
See restoring and rebooting a cluster for more imformation.
Recover Cassandra pods from corrupt commit logs
After deploying the Cassandra data service, when you reboot the worker nodes, the Cassandra pods do not come up to form the cluster. The pods do not come up due to the corrupt logs:
cassandra ERROR 18:22:11 Exiting due to error while processing commit log during initialization. cassandra org.apache.cassandra.db.commitlog.CommitLogReadHandler$CommitLogReadException: Mutation checksum failure at 23031717 in Next section at 23028485 in CommitLog-7-1676881531203.log
To recover Cassandra pods from the corrupt commit logs:
Edit Cassandra statefulset by adding the follwing line under
command: ["/bin/sleep", "3650d"]NOTE: The statefulset name is identical to the deployment name in PDS UI.
Delete all Cassandra pods and wait for the pod 0 to start running (it will start running, but never become ready).
Shell into pod 0 and delete the corrupt commit log. For example:
Exit the sheel of pod 0 and scale
pds-deployment-controller-managerback to 1.
Wait for 30 seconds (approximately) for the deployment Operator to update the statefulset.
Delete Cassandra pod 0.
All Cassandra nodes should come up successfully. However, it is recommended to shell back into one of the Cassandra pods and run
Update Kubernetes secret after changing the
If you change the password for the
pds user, you need to also update the corresponding Kubernetes secret for the deployment. To
base64 encode a string and update the Kubernetes secret:
Get the Kubernetes secret for the couchbase data service:
kubectl get secrets -n <namespace-where-the-Couchbase-data-service-is-deployed>
Encode your new administrator password into
echo <the-updated-password1> | base64
Update the Kubernetes secret with the new
base64encoded adminsitrator password:
kubectl get secret cb-rke-qichff-creds -n cb -o json | jq '.data["password"]="UGFzc3dvcmQxCg=="' | kubectl apply -f -secret/cb-rke-qichff-creds configured
pds password in the
cqlshrc file for Cassandra pods
If you change the password for the
pds user, you need to also update the
cqlshrc file located on all Cassandra pods:
Get in to all Cassandra pods:
kubectl exec -it -n <NAMESPACE> <POD-NAME> -- bash
Change the default user password
pdswith a new password.