Astiga's new synchronisation system
Lectori salutem.
If you are reading this, then I am going to presume that you know that Astiga is a music streaming service that streams music from your cloud storage (WebDAV, OneDrive, Google Drive, Dropbox, etc.) to your browser, phone, television, terminal or whatever you use.
You may have noticed that Astiga got a new synchronisation system (the system that gets the files from your cloud storage, extracts the metadata and stores it in the database) on the 8th of February 2020. Either you noticed the blazing speed at which it is able to synchronise now, or you noticed the few teething problems it faced (especially on 12 and 13 February). With this blog post I want to give you a background into why the new system was introduced, how it differs from the old system, and what went wrong on the harrowing day that was 12 February 2020 (just kidding, it was not that bad).
Let's start with the why. This question is rather easy to answer. First of all, syncing was always a slow cumbersome process. Now, most did not mind waiting a bit longer for it to complete, but it's still annoying nevertheless. Secondly, I wanted to include audio classification in Astiga Vibe. Problem is that it requires quite a bit of CPU to do that. The fact that it takes a bit longer is not an issue, however, it should not affect the performance of Astiga on the whole in a significant way. Thirdly, the previous system did not scale too well.
And with that, we segue into what the old system was, and how it differs from the new system. The old system consisted of one server. As soon as a synchronisation was started, it would spawn a new process that would first list all the files to add, and then add them in fairly linear fashion. It would use a pool (whose size was 1 for normal users, and 3 for premium users), and irregardless of what happened around it, it would just execute that and be done with it. It did not look at what other syncs were running - all of them were completely independent. With infinite hardware, this scales really well. Problem is that the server has finite resources, and when you have to account for peaks as well, that means that on the whole a lot of resources are underused most of the time. The opposite is true as well: Some peaks would exhaust the resources, and cause the server to slow down.
The new system uses a custom global scheduler that runs on a server separate from the "main" server. Once the synchronisation is started, it would add it to a queue. The second server would then fetch the queue, retrieve the tasks (i.e. the files to be synced) for each of them, and add them to another queue specific to the user. The scheduler has a fixed set of slots that could execute a task. So let's say we have two slots 1 and 2, and three users with files: Anna, Bob and Charlie. What the new scheduler does is that it goes to Anna, gets the first task belong to Anna (file to be synchronised) from the list and assigns that to the first available slot (e.g. slot 1). It would then go to Bob, get the first task belonging to Bob, and assign it to the first available slot (slot 1 was taken by Anna, so it will be assigned to slot 2). For Charlie, it does the same, but because no slots are available, it waits until one of the slots becomes available again, and then assigns it. After Charlie it goes back to Anna, and the circle is complete. This means that it fairly distributes the available processing power across all users. In case Charlie was a premium user (which he of course should be :-) ), the first three tasks will be assigned to a slot instead of just the first slot. This means that the processing power can be used much more efficiently. If you are the only one syncing, then you get to use all processing power; if millions of people are syncing, it will still only use a set maximum of processing power. The former case is much more likely than the latter.
So what went wrong? The first thing that went wrong is pretty simple: the new system is a lot quicker, and can handle a lot more simultaneous synchronisations. Every task is in principle independent, which means it makes it own connection to the database to add the file your library. More simultaneous tasks means more connections. The database had a limit on the maximum simultaneous connections that was too low, which would cause the connection to fail, which in turn was treated like a database error, which triggered the reconnect mechanism. As you might imagine, this caused the amount of connections to rise again. A relatively simple fix.
The tricky bit with a global scheduler is that it is global. Things go wrong, intentionally or unintentionally. For example, someone could turn off their NAS during a sync (which is a bad idea, but it happens). This causes some errors (which are caught). However, in the old system there was no dependency on other syncs whatsoever. In the global system, this is not the case. This went wrong in two places: 1) The files for newly added synchronisations were being gathered independently, but it still waited for all of them to be finished before continuing. This did not so much cause an error, though it did slow things down considerably. 2) Tasks that got stuck (for example because a file would retry downloading, or the file was offline, or some other miscellaneous error) would still occupy a slot, eventually starving the amount of available slots. This has been solved by releasing a slot after a fixed amount of time no matter what. This was what really caused an issue, as at some point it would simply stop and not continue (technically, it would continue after a timeout of three days, but that is not desirable).
The fix was relatively simple, but because I was asleep when the problem occurred, it still caused the sync process to be stalled for a couple of hours.
Personally I believe transparency is key when it comes to these issues, so I tweeted about it, plus I sent an email to all those affected explaining what happened and what I was doing about it. I also restarted all syncs that got stuck. Oh, and I wrote a blog post (see what I did there?).
The amount of Astiga users keeps growing at an increasing rate, and that is absolutely great! There are plenty of plans for the future (and you can of course add your own suggestions on the feedback forum). If you want to stay up to date with what happens, consider following Astiga on Twitter. This is also the right time to consider getting Astiga Premium. I am not saying that you should, just that you could.