The Last Pickle did a great blog post on TWCS a little while ago, explaining how Time Window Compaction is great for certain time series data.
To recap TWCS is suitable for
- Data should only be inserted and not updated afterwards
- Data must have a TTL attached
- Data shouldn’t be explicitly deleted, it should only be expired via the TTL
These are only recommendations, cassandra will allow these recommendations to be broken, but if you do, there will be disk usage problems in the future. In this post I will explain the problems that occur if the data does not have a TTL.
As the name suggests this works with Time Windows, once a time window has passed, all the sstables created in that window are compacted together to create one sstable for the data in created in the window. Once this happens the default behaviour is for the sstable to never be compacted again.
Therefore the only way for the data to be deleted from the disk now is for the whole sstable to be deleted, and cassandra does this when all the data in the sstable has expired via a TTL and also the gc_grace_period has also expired.
So, as long as all the the data within a sstable has a TTL the sstable will eventually be deleted, freeing up the disk space. However if there a record within the sstable without a TTL, then that record never expires, and the sstable can never be deleted.
The obvious answer to this is to delete the record, but this doesn’t work. An sstable is immutable once written to disk, so the delete will create a tombstone, but this tombstone will be in a newly created sstable for the current Time Window, rather than in the original sstable. Therefore the original sstable stays the same and cannot be compacted.
So there is no way for the sstable, with all the expired data in it to be deleted.
But it is actually a lot worse than this. The deletion of the whole sstable when filled with TTL’s will only happen if it is the oldest sstable, therefore once one cannot be deleted, none of the subsequent ones can be deleted either. So suddenly all the TTL’ed data is never deleted from the disk.
At first glance this is, at best, counter intuitive, however there is good reason for this. Consider the original record without a TTL causing the original problems. If an upsert occurs later to the record, changing some data values and also adding a TTL, then is record would be to a new sstable, with a TTL attached. At some point later all the records in that sstable will have expired, and the original record will not be shown in CQL as it has been over written by the expired record.
This sstable would then become, in normal circumstances, ready for deletion. But if this happens, then the original record without the TTL would be come alive again, causing incorrect data to be displayed.
So lets look at this in more detail:
CREATE TABLE twcs (
id int,
value int,
when timeuuid,
PRIMARY KEY (id, value)
) WITH CLUSTERING ORDER BY (value ASC)
AND bloom_filter_fp_chance = 0.01
AND comment = ''
AND gc_grace_seconds = 60
AND default_time_to_live = 300
AND compaction = {'compaction_window_size': '1',
'compaction_window_unit': 'MINUTES',
'class': 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy'}
If we create this twcs table, with TWCS and compaction unit of 1 minute. There is a TTL of 5 minutes and a gc_grace_seconds of 1 minute, meaning the record will be deleted from the database after 5 minutes and eligible for deletion from the sstable after a further minute.
So lets insert a record and look a the result including the TTL information
insert into twcs (id , value , when ) values (1 ,1, now());
select id, value, dateof(when), ttl(when) from twcs;
id | value | system.dateof(when) | ttl(when)
----+-------+---------------------------------+-----------
1 | 1 | 2019-06-03 13:34:50.635000+0000 | 287
So this shows us the record has been inserted and has a TTL, that is counting down from 300.
If we do a nodetool flush then look at the data directory, we can see the one sstable, and doing an sstabledump we can see the record within the sstable.
ls -l *Data*
-rw-r--r-- 1 cassandra cassandra 53 Jun 3 15:36 md-1-big-Data.db
sstabledump md-1-big-Data.db
[
{
"partition" : {
"key" : [ "1" ],
"position" : 0
},
"rows" : [
{
"type" : "row",
"position" : 46,
"clustering" : [ 1 ],
"liveness_info" : { "tstamp" : "2019-06-03T13:34:50.612608Z", "ttl" : 300, "expires_at" : "2019-06-03T13:39:50Z", "expired" : false },
"cells" : [
{ "name" : "when", "value" : "5ccb9db0-8604-11e9-8ac8-2943194aee43" }
]
}
]
}
]
When the ttl has expired, the record will be removed from the database, and show it has expired within the sstable.
select id, value, dateof(when), ttl(when) from twcs;
id | value | system.dateof(when) | ttl(when)
----+-------+---------------------+-----------
(0 rows)
sstaledump md-1-big-Data.db
[
{
"partition" : {
"key" : [ "1" ],
"position" : 0
},
"rows" : [
{
"type" : "row",
"position" : 46,
"clustering" : [ 1 ],
"liveness_info" : { "tstamp" : "2019-06-03T13:34:50.612608Z", "ttl" : 300, "expires_at" : "2019-06-03T13:39:50Z", "expired" : true },
"cells" : [
{ "name" : "when", "value" : "5ccb9db0-8604-11e9-8ac8-2943194aee43" }
]
}
]
}
]
After a further minute this will become available for deletion within the sstable, and the sstable will be deleted.
sstabledump md-1-big-Data.db
Cannot find file /var/lib/cassandra/data/keyspace1/twcs-1c6d0a10860411e98ac82943194aee43/md-1-big-Data.db
So lets try this again, with a bit more data inserted and the third record inserted without a ttl so we have the following data and sstables:
select id, value, dateof(when), ttl(when) from twcs;
id | value | system.dateof(when) | ttl(when)
----+-------+---------------------------------+-----------
1 | 2 | 2019-06-03 13:55:48.219000+0000 | 84
1 | 3 | 2019-06-03 13:56:50.162000+0000 | 146
1 | 4 | 2019-06-03 13:57:41.793000+0000 | null
1 | 5 | 2019-06-03 13:58:30.482000+0000 | 246
1 | 6 | 2019-06-03 13:59:20.132000+0000 | 296
ls -l *Data*
-rw-r--r-- 1 cassandra cassandra 53 Jun 3 15:56 md-1-big-Data.db
-rw-r--r-- 1 cassandra cassandra 53 Jun 3 15:57 md-2-big-Data.db
-rw-r--r-- 1 cassandra cassandra 51 Jun 3 15:58 md-3-big-Data.db
-rw-r--r-- 1 cassandra cassandra 53 Jun 3 15:59 md-4-big-Data.db
-rw-r--r-- 1 cassandra cassandra 53 Jun 3 16:00 md-5-big-Data.db
So here in md-3-big-Data.db the record does not have a TTL as seen in sstabledump:
sstabledump md-3-big-Data.db
[
{
"partition" : {
"key" : [ "1" ],
"position" : 0
},
"rows" : [
{
"type" : "row",
"position" : 44,
"clustering" : [ 4 ],
"liveness_info" : { "tstamp" : "2019-06-03T13:57:41.792168Z" },
"cells" : [
{ "name" : "when", "value" : "8e11b910-8607-11e9-9e9d-6f8e1005a055" }
]
}
]
}
]
So slowly the data disappears from the DB:
select id, value, dateof(when), ttl(when) from twcs;
id | value | system.dateof(when) | ttl(when)
----+-------+---------------------------------+-----------
1 | 4 | 2019-06-03 13:57:41.793000+0000 | null
1 | 5 | 2019-06-03 13:58:30.482000+0000 | 44
1 | 6 | 2019-06-03 13:59:20.132000+0000 | 94
Leaving only the record without a TTL:
select id, value, dateof(when), ttl(when) from twcs;
id | value | system.dateof(when) | ttl(when)
----+-------+---------------------------------+-----------
1 | 4 | 2019-06-03 13:57:41.793000+0000 | null
This will allow the first two sstables to be deleted, but the third will block any further ones being deleted:
ls -l *Data*
-rw-r--r-- 1 cassandra cassandra 51 Jun 3 15:58 md-3-big-Data.db
-rw-r--r-- 1 cassandra cassandra 53 Jun 3 15:59 md-4-big-Data.db
-rw-r--r-- 1 cassandra cassandra 53 Jun 3 16:00 md-5-big-Data.db
Even if we “fixed” this broken record by adding a TTL, it won’t help:
insert into twcs (id , value , when ) values (1 ,4, now()) ;
select id, value, dateof(when), ttl(when) from twcs;
id | value | system.dateof(when) | ttl(when)
----+-------+---------------------------------+-----------
1 | 4 | 2019-06-03 14:09:38.286000+0000 | 295
(1 rows)
The new upsert is added to a new sstable, md-6-big-Data.db, and the old one is not touched.
ls -l *Data*
-rw-r--r-- 1 cassandra cassandra 51 Jun 3 15:58 md-3-big-Data.db
-rw-r--r-- 1 cassandra cassandra 53 Jun 3 15:59 md-4-big-Data.db
-rw-r--r-- 1 cassandra cassandra 53 Jun 3 16:00 md-5-big-Data.db
-rw-r--r-- 1 cassandra cassandra 53 Jun 3 16:10 md-6-big-Data.db
Eventually the record will expire, but the sstables will never be deleted:
select id, value, dateof(when), ttl(when) from twcs;
id | value | system.dateof(when) | ttl(when)
----+-------+---------------------+-----------
(0 rows)
In conclusion, it is important that every single record has a TTL attached to it.
Therefore it is highly recommended that default_time_to_live is set in the table definition. This makes it quite difficult to accidentally create a record without a TTL.