atproto and bluesky

Can atproto scale down?

2025-02-14 by phil (they/them)

It's frequently stated[by who?] that some core components of the AT-Protocol architecture are expensive to host and don't scale down. So expensive that they are out of reach reach except for VC-funded commercial companies like Bluesky PBC, and expensive due to the structure of the protocol itself. Very non-decentralized.

We're going to skip past your Personal Data Server (PDS; cheap), going to put aside the Relay costs for now, and consider Bluesky's expensive AppView component.

And skipping right to the end, my answer to "can it scale down" is just: "yes!". Here's my Raspberry Pi 4b, at home, consuming a few watts and pulling around 20GB of simplified firehose events per day. It's an AppView indexing all cross-repo references (backlinks) in the AT-mosphere, often up to 1,500 created per second. It's closing in on one billion backlinks, eating up an old SATA SSD connected over a salvaged USB adapter.

Backlinks can hydrate information about social interactions. Skyblur.uk is using the index on this very pi to show Bluesky interaction counts. The index can also list all quote-posts, replies, account followers and blockers; Frontpage story comments, upvotes, and so on.

Obviously this is doing less than Bluesky PBC's AppView so maybe you're not convinced, but zoom out with me: a Hard Thing that BlueSky's AppView implementation must do is serve over 31 million users (read load) with best-in-class feedgen. Our self-host dream doesn't involve that.

Our self-host dream does involve handling the same 31M user write load as Bluesky's, but I think this is where the it's expensive critique gets wires crossed: Bluesky's read load is what's actually expensive. I have a billion links on a happy raspberry pi.

~

iffffff you want to know I think we can go from a backlinks index to a self-hosted mostly-complete Bluesky-compatible AppView experience, I'll get into it below. My project to do it is microcosm.


Scaling down a Bluesky AppView

Dan Abramov's talk Web Without Walls is worth a watch if you're new to AT-Protocol.

Data flow

A hand-drawn diagram with arrows connecting from top to bottom: multiple PDS boxes, to a Relay, to an AppView labeled 'Bluesky', to a Client app. an arrow connects Client app back up to one of the PDS boxes.

AT-Protocol has this nice circular (unidirectional!) data flow. Everyone gets their own little personal data repository for their content that's hosted by a PDS, updates are aggregated by a Relay, broadcast to AppViews, which present that content back to you.

Since the AppView gets the full feed of all data from everyone posting in the world, it can be built as a mostly typical http app backend but with an unusual write path.

Building it as a typical backend will work well, and be expensive roughly proportional to the userbase size. That is: expensive for Bluesky; presently cheap for eg Smoke Signals.

New arrows labeled 'firehose' added to the previous diagram: from the Relay to a Scraper, from Relay to Jetstream, and from Jetstream to several apps: Firesky and Final Words

Bluesky's relay is open, so you can build your own AppView and receive all the global content just by connecting with a websocket. You can put your own data types (called Lexicons) into your users' PDS, and they will come out of the relay just like all the Bluesky data.

Bluesky's unfiltered relay output is called the firehose. Architecting updates into a feed like this will bring up the alternative term event log. An awesome firehose adapter that re-emits events as simplified JSON is jetstream.

Bluesky's AppView

A large 'AppView' box with a 'Firehose' arrow pointing into it, and multiple arrows exiting to 'Client app' boxes. In the AppView box are shapes: 'Big archive', 'Search', 'Media pipeline, 'Notifications', 'Chats (DMs)', 'Feedgen', 'Mutes', and 'CDN'

…I mean I don't work there so I don't know the exact breakdown of services. They do have a giant database (ScyllaDB) with a copy of all Bluesky content, but like most modern backends, features will be decomposed into smaller individual services.

The point is that when someone says "The Bluesky AppView", or even "AppView" as an AT-Protocol component, I think it obscures the fact that there's a lot going on in there. Breaking it down into feature-oriented pieces might lead us to discover self-hostable alternative ways to implement them.

No-AppView apps

a blue 'atproto-browser.vercel.app' box points at three different pink 'PDS' boxes

One more detour, about that big giant database in the Bluesky appview. All the data in there is also avaialable its owners' PDS, and you can fetch it directly any time you want (if you don't have to reply ultra-fast to 31M users).

You can do this right now, in your browser, with some awesome tooling folks are building:

And if you click around a bit, you might get the feeling that this is almost enough to be able to doomscroll Bluesky. Like really: a fully rendered Bluesky post. You could already rebuild your following feed without any AppView:

  1. fetch the list of accounts you follow from your own PDS
  2. fetch the latest posts from each of their PDS
  3. render them
a crop from the same skyblur screenshot from earlier: screenshot of skyblur.uk post announcing that it uses constellation to fetch like and repost counts. it shows like and repost counts (very meta). the previous 'atproto-browser.vercel.app' diagram has a new box, 'atproto link aggregator', with a thick blue 'firehose' arrow entering it from offscreen. one arrow points from the 'browser' box to the 'link aggregator' box, and the browser box URL now reads 'atproto-browser-plus-links.vercel.app (and yes you can go there).

Social interactions will still be missing and it won't be fast. But I have a billion links on a raspberry pi so we can already solve the first part, remember ➡️

Here, have some backlinks in your PDS browser:

(Atproto link aggregator is now called Constellation. I drew these diagrams quite a while ago)

Faster feed generation

We can take different approaches to speed up the slow following feed generation above. Another micro-appview could listen to the firehose and pre-cache posts from accounts you follow. That pre-cache could expand to friend-of-a-friend, and you might get a decent hit rate on reposts for decent performance!

We could adopt the custom feed generator APIs that the Bluesky app uses and plug in existing custom feeds. If you proactively render these as content comes in (push style, like Bluesky does actually), it might even feel like pretty good UX. For a small number of users it's not resource-demanding.

Composable micro-AppViews

a blue arrow labeled 'firehose' points down to a box called 'atproto link notifier'. it has arrows to and from stacked boxes below titled 'subscribers for notifications'.

So we have a backlink index for hydrating social interactions: exists and works today. A hand-wavey feedgen descripition that hopefully sounds plausible. Notifications are an important part of social media. And they're actually pretty easy!

All Bluesky notifications (except DMs) are just backlinks-as-they-happen. We already have code that extracts backlinks from the firehose in real time: adapt it to trigger webhooks or send a websocket message whenever a new backlink refers to your account or content!

Pieces are coming together:

diagram with a large central box that says 'pi-sized appview' with smaller boxes inside: atproto link aggregator, atproto link notifier, atproto record cache, and lazy cdn are all internally pointed at by 'MySky' for 'hydrate likes', 'notifications', 'feed', and 'media' respectively. from the outside above there is a thick blue arrow labeled 'firehose' that points at 'jetstream' which in turn points to several of the internal boxes. the link notifier, record cache, and lazy cdn point with ligher arrows to three PDS boxes above. finally, a stack of client app boxes below are pointed at by 'MySky'.

All of these components can run with minimal resources. I'm asserting it. They take a minute to build, maybe I'm over-confident. But I really think that self-hosting a Bluesky-compatible appview with most of the Bluesky experience is within reach.

Beyond Bluesky

Where I get excited about all this is: these micro-AppView services are largely not specific to Bluesky content. Backlinks tend look the same across lexicons, so notification subscriptions for a whole new app can just work. PDSs store media as blobs in a generic way, so atproto CDNs can benefit all types.

I don't think the usefulness of these services is limited only to scaling down other existing AppViews either. Like if you build a new photo-sharing app on atproto, you could lean on this generic notification service, subscribing to interactions sourced from your lexicon, and get that feature working with a few API calls.

The big picture

I'd like to think of this as a bottom-up approach to scaling down. Can it get us to decentralization? If we scale to millions of copies of micro-AppViews, it will burden the relays. If we approach content hydration heavy fetching against PDSs, it could overload them. If you self-host a viral skeet will you get a surprise bandwidth bill?

That's a silly thought experiment, because operating and orchestrating all the little services is not within reach for many people even if it is cheap. It will probably be a tiny number. So then is this a meaningful contribution to decentralization?

Well thanks, me, for asking. I don't about meaningful but I have two sources of optimism:

  1. Many of the micro-AppView components can be shared, just like Bluesky's relay. Like the relay, the backlink index requires a global view, and it's somewhat neutral and substitutable. The notification service isn't specific to Bluesky's lexicon. Maybe the number of instances will grow at something like a log-N of the network size.

  2. Top-down approaches to decentralizing might also work out! Maybe another org with some cash or funding will start running a full-scale Bluesky-compatible alternative AppView with Big Giant Databases ready to serve millions of users worth of read load.

I think a healthy future for AT-Protocol looks like both kinds, top-down and bottom-up decentralization* happening.

~

*feel free to quibble with my use of the word "decentralization" here and in this whole post. I do too.


Other notes