Tuesday, January 3, 2023

Fediverse, (about) the design.

I've reiterated several times on the point that ActivityPub is a terrible protocol (which is easy to agree), what I didn't mentioned so far is how bad is the implementation of most used instances. Under the architect point of view, it looks to me like developers are trying to write a "single user instance (which is - therefore - not supposed to scale) while actually it will be abused, up to thousands of users"

First, we should draw a line between several kinds of instances. I would say, "single or little users", which doesn't need to scale, and "huge instances" which are supposed to scale. Or, they will NEED to scale, sooner or later.

Let's say:

type 1 : single user or little amount of users
type 2: large, scaling instances

All software I see around is type 1, often abused to lot of users, like in some large instances with more than 1M users.

About type 1, there is a little to say about implementation (seems quality of code and documentation aren't a priority) , so I will only describe the biggest issue I see : which is about data.

Small instances are often backed by sqlite3, Which makes sense, but we are not considering an issue. We have, basically, several types of data to store:

the database of users and related tokens
the database of incoming messages in the instance mailbox
the database of incoming messages for users
the storage for attachments

These are scaling in different ways, while there is little chance to limit them in size, or access time. All the software I know (Honk, gotosocial, etc) is using a single sqlite3 file , like the developer is looking for trouble. We perfectly know that events are asynchronous, and we perfectly know that the amount of write operation for the /inbox API (AKA "federated timeline" ) scales depending by the amount and size of servers we federate with. The amount of operations triggered by our users is scaling by our amount of users, by the amount of clients they own, and the amount of incoming messages for our users is depending by a mix of user behaviour and connected instances behaviour.

The size of attachment, indeed, scales mostly depending by the kind of contents (videos, images, music, depending by the kind of instance), and then by the maximum allowed size, and the amount of active users.

Each kind of traffic scales in a different way. /inbox is quite independent by the activity on our instance, and very sensitive on federated instances activity.

In such a case, if the aim is to use sqlite3, instead of looking for troubles and stress the concurrency on sqlite3 API, I would suggest to use at least 4 files:

one sqlite3 for federated messages received by instance /inbox , so that we can write without locking our users in the meantime, or take broken concurrency risks.
one for user's messages, although I would even say to use another sqlite3 per user. This would be useful for multiple users, even when they are small numbers.
another sqlite3 file for users and authorisation tokens. This would isolate user's data and prevent race conditions.
last database for attachments, to map ID to file.

This would seem very strange for a single-user instance, but for software like gotosocial, which is supposed to host many users, could be a bless. Sure, people will say "but you can setup gotosocial to use postgressql" , just because they believe in the superstition that , since the effort is moved to a RDBMS, everything is now efficient.

I'd like to spend some words about the idiocy of using PostgresQL for ActivityPUB. If you are old enough to remember the old LAMP hype, you remember that lamp was Linux, Apache, MySql, PHP. When MySQL was under Oracle's control, most of Opensource Hoplites migrated to Postgres, just "because of yes". What they didn't noticed is that Postgres was a completely different kind of database, because its design was based around the idea of serialising OOP Objects, and not single selfcointained JSON, or documents.

This resulted in some hybrid design of storage Postgres has, which needs "auto analisys" (like the developers, I guess) , which is what I call "performing but not efficient". Neither Postgres is easy to scale. I was just delighted when the developer of Mastodon had to migrate the database on a bigger hardware.Seriously, Eugen?

In a DECENT database, you MUST be able to add another instance or server, load balance, and then IMPROVE the performance. Since , at least, the birth of IBM360.

The fact "more hardware" is almost the only solution to PostgresQL capacity issues, should suggest how different it is from MySql , by example. Not that I am a big fan of MySQL, but at least is possible to add servers and improve performance. And no, CockroachDB didn't "fixed" the issue, since , in order to make a good cluster, you CANNOT use some "advanced" feature of PostgresQL protocol. There are also "workarounds", using third parties plugins, sharding and more, or like Postgres-XL and Citus, but you need to be very, very, very, very self-confident to use them. Besides, to restore lost data while using such plugins would be a nightmare, if one shard is lost.

Sure, PostgresQL can store data, sure it can easily store JSON, and fits most of serialisation issues, but it is a completely different kind of object when compared with MySql. If you think that MySql/MariaDB and Postgres are alternatives to each others, then:

you don't KNOW about databases
you don't UNDERSTAND databases.

and , definitively, you should leave IT and work in different fields, like farming, mining, and/or more.

When I see a software (and it happens quite often) which is using postgres under the assumption the amount of data will size around dozens of megabytes, (like misskey is doing) usually I know that I look at a terrible programmer: too incompetent to write some decent logic to handle data using the filesystem, they just move the issue to the RDBMS, thinking "someone else wrote the good code I am unable to write".

Wrong.

No database is implementing the shit of code logic you have in mind. The database is just a brunch of SOME software, decently written: no RDBMS is optimised for a few dozen megabytes. Sure, it runs fast with little data: this is not due of good code, is due to little data.

Plus, the overhead is horrible.

If you design an instance assuming little amount of data, other database are quite better, like Redis. And Redis scales and clusters, too, much better than a Jurasic database like Postgres.

That's it. There are tons of better databases people could use for the purpose. Given the serialised nature of activitypub payload, Mongo and similar No-Sql could serve the purpose much better, in case you insist on some powerful _whatever_.

I close it here.

Back to the topic, we are now discussing about huge instances. How to design some Instance which scales?

The point is, ActivityPub seems to be set of API , with incompetent design , horrible security, as rational as a drunken monkey on crack. In some ways it remember me some catastrophic nightmares from the past, like CORBA, in some other ways it looks like a RPC, together with some attempt to rebuild SOAP using JSON. If a train-crash happened inside the Chernobyl reactor 4, could have looked better.

Anyhow, if you drink enough tequila, ActivityPub looks like some API.

The point is, some APIs are scaling with the amount of federated servers, some are scaling with the amount of users in the local instance, some are scaling with the amount of times the users are accessing the instance, some with the amount of clients each user is connecting with. Some API are calling other instances, too, and the instances are utterly unpredictable.Meaning that, our instance is also consuming other's API. (Basically is a push AND pull protocol, like SMTP, while using a request-response protocol. AAAARGH!).

Since it's a set of API, the first approach could be to use an API gateway for ingress traffic. Since the API function is not completely reflecting the business logic (because of terrible, radioactive design) , we could also use some orchestration like kubernetes , and then a router-based infrastructure like traefik to isolate different load models.

In any case, we can say that:

OAuth2 APIs are scaling in traffic with the amount of users actually using the instance. That means, if a single user has 4 clients, "user" just means "client". Each one with a token to check, from time to time.
Instance's /inbox APIs (AKA "federated timeline") are scaling with the amount of servers the system federates with, given their size and activity.
The user's inbox is scaling with the amount of content users are posting, more than with the amount of clients. It's unlikely some user will use the instance from more than one client at a time.
Some call of "mastodon" APIs (i.e: mobile client refreshing the timelines) are scaling with the amount of clients, which are continuously asking for refresh of timelines.
Some call of "mastodon" APIs are scaling also with the size of attachment , like with PeerTube with videos, or depending by the size of attachment.
The outbound traffic TO other instances scales with the amount of user's activity to other instances, and the amount of instances known by the instance.

This identifies at least 6 microservices on the back-end. Each one MUST scale independently, while maybe the remaining ones have no demand to fulfil. It makes completely no sense to scale all together, just because one set of API on 6 is having a peak of demand.

If we think to something like autoscaling in AWS, by example, scaling all together would be a waste of resources. So better to scale each microservice in an independent way, like a kubernetes-like orchestration can do. (with proper resource policies).

About OAuth2, many api gateway, especially the one with a "developer portal" (Like Apigee or Kong) are coming with some OAuth2 capability. Otherwise, we need some instance of a vault-like software, capable of OAth2.

About the second issue, we need some microservice which ready /inbox call and just writes on some back-end database.

Same for 3 and 4, with the good news, most of API gateways can cache API result, mitigating the "4" load when lot of clients are reading the same timelines.

Microservice 5 should be some kind of object storage, like minio. This would just save attachments. Or, write to some cloud, like S3.

Microservice 6 is kinda "strange" microservice, 'cause from the API gateway perspective, this would be a call to some "external" API.

A similar infrastructure would scale, as much as the database and the object storage is capable to scale. Why? Because scaling the microservice without scaling the backend database will result, in high load situation, in some kind of DOS to the DB itself. Quite common mistake (to have a scaling frontend/backend with non-scaling DB), but still very, very, stupid pattern of design.

And my point here is: do we have anything similar? No. None of the SW we are using in the fediverse can fit on this design. So, I foresee lot of trouble in the future. So far, we know how hard it is to scale Mastodon, and we are talking of 1.3 millions users on the bigger instance.

Do we understand what could happen, if some amateur-designed software like Mastodon will hit a peak of 10.000.000 users, or more? Well, Eugen will try to use huge instances for the database, as he did already, and then bigger, and bigger again, until no server offered by any cloud will fit. And then? The idea of cloud is to parallel the power, scaling horizontally, which Postgres cannot really do, at least without tricks and very sensitive add-ons.

Let's say that Postgres pretend to be able to scale to multi-server situation, but ... I don't see how a jurasic software can survive the load of a REALLY big instance. If 50% of users of Twitter would happen to migrate on the fediverse, likely the "how to scale" issue could kill the whole thing.

And this is because the design of Mastodon is just amateur, unless someone will come with some "Mastodon Enterprise", based on a better design , like the one I described above.

Nowadays, the point is : we have nothing like this. Here I want do introduce some software, named Pleroma, written by a PHP-Javascript aware team who decided Elixir was cool. Let's discuss it's paradoxical un-scalability.

And sure, Elixir MAY BE cool , if people stops using it like it was PHP or, worst, Javascript. This is not because of the language itself (is clearly a Vogon-minded language, like Erlang) : it's because it runs with the erlang virtual machine. And the Erlang virtual machine is cool.

As a person working with Telcos &networking, I know tons of Ericsson products. So I know the ERlang virtual machine very well. I know what it CAN do. And when I compare what Pleroma is NOT doing, which the ERlang virtual machine can do out-of-the-box, I can say just one thing: "Lain", whoever he is, is just a pretentious poser with no real skills in good development.

Pleroma is good because it's easy to install and cheap on resource. This is why I am using it, too. Everything else in this software is more or less incompetent garbage, or if you prefer: PHP/Javascript translated to Elixir. This guy THINKS in PHP (or maybe some silly way to write javascript?) , he has no clue what the ERlang VM can do, so Pleroma reinvents the wheel, and the outcome is a square-shaped wheel.

Just for the record: Pleroma cannot scale horizontally. Which is as preposterous it can be, if you think that Erlang VM comes with an incredibly efficient system to create clusters.

https://www.erlang.org/doc/reference_manual/distributed.html

Plus, the Erlang VM comes with a distributed database too, both memory and disk based:

https://www.erlang.org/faq/mnesia.html

Now, when I see a Erlang-VM software which cannot really scale the backend because the database it uses is jurasic (like Postgres is) and the software can't make use of distributed Erlang, I understand one simple thing:

the developer never read a manual of the tool he was developing for.

Which makes me remember words like: "phony", "amateur", "unprofessional", "charlatan". When we ask ourselves which fediverse software COULD scale better, for sure the answer is "Pleroma has the biggest potential to scale, since it runs on the most scaling and reliable VM in the planet, often used in carrier-grade applications for telco industry".

But when we see what was the result, it looks like "lain" didn't even try to take advantage of the VM. It's using Elixir like it was just a bare school exercise, taking advantage of NONE of its features. The most interesting ones.

Nevertheless, since Pleroma has such a little footprint, we can try to use it like it was scalable, running more instances, just isolating APIs. I will keep it simple, you can add what you need.

Imagine we have a swarm stack running pleroma plus traefik. What we would do in such a stack is something like:

version: '3.9'

services:
pleroma:
hostname: bbs-whatever
restart: always
image: repo.org/pleroma:latest

...
ports:
- "4000:4000"

...

deploy:
labels:
- traefik.http.routers.pleroma.rule=Host(`ourinstance.net`)

.....

everything else in the file is interesting, but not referring to the issue. What I want to notice is that one instance is caring for ALL the APIs in the "ourinstance.net".

Now, we want to have TWO instances of pleroma, to try a workaround on "scaling horizontally" when we have more of one server. What We could do?

version: '3.9'

services:
pleroma-users:
hostname: bbs-whatever-users
restart: always
image: repo.org/pleroma:latest

...
ports:
- "4000:4000"

...

deploy:
labels:
- traefik.http.routers.pleroma.rule=Host(`ourinstance.net`)

.....

pleroma-inbox:
hostname: bbs-whatever-inbox
restart: always
image: repo.org/pleroma:latest

...
ports:
- "4000:4000"

...

deploy:
labels:
- traefik.http.routers.pleroma.rule=Host(`ourinstance.net`) && PathPrefix(`/inbox`)"

.....

So basically, we are running 2 instances, sharing same database, (which is still an issue, but postgres is postgres), with the only difference that one of the two containers will take all the load related to the instance's federated inbox, while the first one will get the rest. Which means, if the database can resist to the overall traffic, the federated inbox load will not impact the user experience from the clients.

You can repeat this schema isolating how many APIs you want, like "/api", or even more granular. This of course would require you to add more connections on the database, since each container will replicate the full number, but in the other side, each container could run on some different server of the swarm, somehow "distributing" the load.

This is the only way I see to "scale" this kind of systems, and it may work also for Mastodon itself, not only Pleroma.

Sure, we could replicate the whole Mastodon or the whole Pleroma on the whole traffic: the overhead of replicating everything else just because /inbox traffic increased is very stupid, since /inbox scales differently from everything else.

If you scale ABCD just replicating, because B is under load, what you get is:

ABCD

...

ABCD

while, if you scale A,B,C,D in separate instances, because B is under load , you get:

A

BBB...B

C

D

avoiding too much overhead or duplication of useless threads.

Of course you may say that the first way (scaling everything) seems not too dramatic, since a thread doing nothing is not taking much resources. I answer you with a single word: "mastodon", or, if you prefer, "ruby".

In general, the fediverse COULD scale, but to do it both the protocol and the software we are using needs to be rewritten , thinking of scaling to several orders of magnitude more than now.

Otherwise, if some big player will include the fediverse in their set of protocols, the result would be : all small instances knocked out, or forced to defederate just because they can't afford the federated timeline alone.