This is a post to go with a recent presentation at the Datastax Accelerate conference, and describes the process used to move clusters from Rackspace to Google cloud.
In part 1 we looked at setting up the new datacenter and migrating the data to new nodes. In part 2 we looked at decommissioning the original datacenter.
In this final post, we look at some of the things that can go wrong and how to mitigate and recover from them.
Network Performance
If you are moving to a new datacenter in a new location or with a different provider there maybe network performance considerations, as all the data stored in cassandra needs to be transmitted across the datacenters. This can cause 2 problems:
- Ensuring there is enough bandwidth
- Not stealing all the bandwidth
The bandwidth available can be tested with the bandwidth measurement tool iperf3. This is a client server tool, with a number parameters to set different options.
iperf3 –s
iperf3 -c xxx.xxx.xxx.xxxx
iperf3 -c xxx.xxx.xxx.xxxx -b 10G
iperf3 -c xxx.xxx.xxx.xxxx -C yeah
-s runs iperf3 in server mode, run this on one node in one datacenter, this will then wait for clients to connect to it.
-c run is in client mode also provide the ip address where the server is running.
-b Specifies the maximum bandwidth to use ( in bits ), useful if you don’t want to saturate the connection between the 2 datacenters
-C allows you to specify the congestion control algorithm to use, when moving between different external geographical datacenters, we found the “yeah” worked best, but it is a case of testing different scenarios.
Once you understand the bandwidth you have available, you can get some understanding as to whether that is sufficient to stream the data in a timely manner.
If you need to limit the bandwidth available for cassandra to stream data between datacenters you can set this via the following command:
nodetool setinterdcstreamthroughput xxx
Views
This problem only occurred in DSE prior to 5.0.12 ( Apache 3.0.13 )
Prior to 5.0.12 when the data is streamed to a new node, it is only the table data streamed, not the view data, the view is rebuilt on the node from the underlying table data. This is done using select statements, this will normally work fine, however if there are a large number of tombstones on the table, these may cause the select to fail, even if the tombstone issue is not apparent in normal reading of the table.
When this happened we found the best way to solve this as to upgrade to a later version, which due to the number of view issues fixed after this release is recommended anyway if you are using views.
Memory
During the rebuild of a node, the high level of streaming and compaction uses up a lot of heap memory, during the rebuilding process it is ok to increase the amount of heap to use, without having to worry about GC pauses. The node at this stage is not servicing client requests, so large pauses will not have any impact on the client requests hitting the other datacenter. Just remember to change it back to the normal setting before using as a client serving node.
Compaction
There can also be a backlog of compaction during the process, you may have lots of small sstables created, which are queueing for compaction, this will also increase the amount of heap memory required. Once again during the rebuild process it is ok to increase the compaction throughput to allow compaction to keep up via the nodetool command:
nodetool setcompactionthroughput xxxxx
Streaming
Reducing the streaming throughput can also reduce the amount of memory required, this gives the GC more chance of keeping up with the memory required. So this is another way to deal with memory issues:
nodetool setinterdcstreamthroughput xxxxx
Application Latency
The area we had most problems with were application latency, there was a 22m/s latency between our datacenters, therefore if you added this as extra latency to the normal queries there would often be latency issues building up quickly.
It was important that the development teams followed the instructions we had sent out early in the process.
Local Consistency
If you use a QUORUM consistency this means that more than 50% of the nodes have to return a result, with LOCAL_QUORUM it is only more than 50% of the local nodes. So LOCAL_QUORUM stays in the local datacenter, where as QUORUM will always have to cross the network to the remote datacenter.
Light Weight Transactions (LWT)
LWT’s use an additional consistency for the Paxos Algorithm, the part that checks the IF statement. This can either by SERIAL or LOCAL_SERIAL. Using LOCAL_SERIAL means the check only happens in the local DC, SERIAL means the check happens in all the DC’s, with the additional latency involved.
During the initial build of the new datacenter then the consistency should definitely be LOCAL_QUORUM. When the DC’s are at steady state, then more thought needs to go into the decision. If all applications are simultaneously moved to join the new new DC, then LOCAL_SERIAL should continue to be used. However if it is a gradual migration of the applications to the new DC you need to be aware of one application writing to the new DC, while the IF statement is being checked in the other DC, causing an update to happen when it shouldn’t. If it is possible for this to happen, then LOCAL_SERIAL will not be enough and you should use SERIAL with the additional latency.
Load Balancing Policy
The Load Balancing Policy determines which node a client request is passed to, it is important that a DC aware policy is used by the clients, this ensures that the requests go to the local nodes relative to the client, this is particularly inportant before the rebuild is completed, otherwise a request maybe sent to the new DC before the data is actually there. Resulting in no data returned even when it is present in the original DC.
Conclusion
We successfully migrated 93 clusters using this method, some were easier than others, but sometimes after several rebuilds of a node we succeeded.
If you have any further questions do use the comments section, or get in touch directly.