atproto and bluesky

consuming the jetstream firehose correctly

2025-02-07 by phil (they/them)

Some things I've learned through working with jetstream. This is going to be in-the-weeds and won't matter for a lot of apps!

If you're making any kind of "sampling" app, like firesky, final words, gitfeed, emojistats, or skymood, then you don't need to read this.

Or if you can tolerate missing a small amount of events from the firehose, you probably don't need to worry about any issues mentioned here!

Obviously feel free to read on, but it's easy and cheap to get going, right now. Jetstream works very well today.

If you're making an appview or any service where you want to reliably receive every event from the firehose (or at least know when you haven't), then this post might have some useful information for you.

ℹ️ The time_us property is jetstream-local
✨ The time_us property is a monotonically increasing clock
ℹ️ Events might be re-ordered
ℹ️ You won't know if you missed events
‼️ Jetstream can drop and reorder events when reconnecting with a cursor
‼️ Expect abruptly closed connections on overloaded instances
Conclusions

ℹ️ The `time_us` property is jetstream-local

Every jetstream event has a time_us property. This is not part of the atproto data! Jetstream adds time_us at a late stage during event processing, immediately before it is emitted to connected clients.

So, measuring time_us against your local clock is mostly measuring network latency + clock skew:

The main implication to be aware of: you cannot use a time_us from one jetstream server to sync events with a different jetstream server and expect seemless continuity. The same events will have different time_us.

When switching between instances, it may be prudent to rewind your cursor a few seconds for gapless playback if you process events idempotently.
—jetstream readme

✨ The `time_us` property is a monotonically increasing clock

This one is nice! The time_us property for events from a single jetstream instance is supposed to always increase and never repeat. So, if you connect with cursor=:your_last_events_time_us, you should get exactly that last event as the first event for your new connection. so,

it should* be safe to increment the cursor by one and reconnect to the same instance.
you do shouldn't* need to worry about handling edge cases where multiple events with the same time_us could get replayed on reconnect.

*see next heading

Note that while it's implemented as monotonic, it's not actually documented as such, and in fact the readme almost implies that it's not:

When reconnecting, use the time_us from your most recently processed event and maybe provide a negative buffer (i.e. subtract a few seconds) to ensure gapless playback
—jetstream readme

ℹ️ Events might be re-ordered

…but not within a repo

Jetstream uses a parallel work scheduler to process events from the relay. Since multiple events are processed concurrently, they might be emitted by jetstream in a different order than they were recieved from the relay.

While events in general may be re-ordered, events for a particular account ("repository") are kept in-order by jetstream. See here for the logic that upholds that.

Since events will appear in a consistent order at a repository level, you do not need to handle things like delete-record events occurring before they are created. However, you can observe references across repositories happen in strange orders, like a Reply record to a bluesky post arriving before you see the post itself created—you have to handle this scenario even if you use a relay directly.

Multiple clients connected to a single jetstream instance will all receive events in the same order as each other.
Events are emitted in the same order during replay (connecting with a time_us cursor) on the same instance.
You cannot connect to a different jetstream instance and assume that seeing the same event means you're caught up and haven't missed any.

ℹ️ You won't know if you missed events

There is no way to detect if jetstream has dropped an event. With relays, there is a sequence number that will skip a step if you have missed an event, but jetstream omits this. Observing time_us can only indicate a problem with the order of events, not whether you missed one.

It doesn't take long to see events being dropped if you run jetstream locally. I just saw these after a few moments:

{
    "time": "2025-02-07T16:51:15.347772-05:00",
    "level": "ERROR",
    "msg": "failed to get record bytes",
    "component": "consumer",
    "repo": "did:plc:[REDACTED]",
    "seq": 466746XXXX,
    "commit": "[REDACTED]",
    "action": "update",
    "collection": "app.bsky.actor.profile",
    "error": "resolving rpath within mst: ipld: could not find [REDACTED]"
}
{
    "time": "2025-02-07T16:51:30.619181-05:00",
    "level": "ERROR",
    "msg": "failed to unmarshal record",
    "component": "consumer",
    "repo": "did:plc:[REDACTED]",
    "seq": 466760XXXX,
    "commit": "[REDACTED]",
    "action": "create",
    "collection": "app.bsky.feed.post",
    "error": "$type field must contain a non-empty string"
}

Missing the first could mean failing to act on someone deactivating or deleting their account; the second you'll be missing someone's post.

Both of these examples seem likely to be problems external to jetstream, but interal issues are handled the same. For example, if pebbledb fails to write a perfectly valid event, it is just dropped and won't be emitted to live-tailers or for replay.

If your application cannot tolerate missing events (or at least detecting that), jetstream cannot currently meet your needs.

Improving the situation might not be simple. There was a recent proposal to include relay sequence numbers with jetstream events, but it was rejected. Jetstream events are not 1:1 with relay events—I'm pretty sure that relay sequence gaps, duplicates, and re-orders as seen from jetstream could all be valid under the current implementation, so that's not enough.

‼️ Jetstream can drop and reorder events when reconnecting with a `cursor`

tracking issue

The cutover from event-replay to live-tailing in jetstream is tricky. There seems to be a data race that occaisionally makes new events get emitted before the replay has caught up, and the cutover itself may skip over up to one second of events (~1,000).

Until this bug is fixed, it's quite difficult to ensure that you will receive all events as a client. Clients that can't tolerate gaps might need to open multiple connections (live-tailing + replay) and manage their own cutover with a few seconds of overlap to inspect.

I created a rough test that can typically detect and show the problem on the production jetstream instances operated by Bluesky. As of this writing, it can still reliably demonstrate the issue.

‼️ Expect abruptly closed connections on overloaded instances

…if you connect with a `cursor`

tracking issue and a likely duplicate.

This is likely related to the dropped and reordered events issue above. Addressing the architecture that allows the data race could offer a way to avoid filling the client output channel that seem to cause this problem.

Conclusions

Make your event processing idempotent if you can

If your system state stays the same whether an event is received once or multiple times, then you might be able to work around some of the consistency problems by conservatively replaying and minding the cutover. This is not always easy to acheive and sometimes subtle.

Consider self-hosting jetstream

The issues around cutover from replay to live-tailing seem likely to be correlated with traffic load on the instance. Self-hosting your own can bring some risks into your own control at least.

Consider using a relay directly, instead of jetstream, probably

Relays offer a sequence number, which plays a similar role to jetstream's time_us cursor: you can keep track of which events you have seen, and resume from where you left off when you reconnect. However, instead of a timestamp, this sequence increments one step for each update, offering more confidence that you really have received every event, and a means to detect when you haven't.

Relays also include data to cryptographically verify each update against the user identity that created it, which might be important to you.

As for downsides,

the full atproto relay firehose costs ~100x network bandwidth.
processing relay events can consume significantly more CPU

Jetstream's bandwidth requirement is low enough for me to self-host atproto services at home without worrying—it's been around 20GB/day recently. The full atproto relay stream might raise questions from my ISP.

The full firehose might have its own surprises. The one I know of that's relevant to this post: potential off-by-one in the sequence when connecting with a cursor that could lead to duplicating a single event on reconnect if the client isn't careful to check the first event's sequence number. (or missing an event if the client auto-increments the cursor to avoid double-count).

whew

Personally, I would like to use jetstream to build appveiws. It's almost there, though I wonder if there is space for another lightweight firehose adapter that offers stronger reliability at the expense of some of Jetstream's amazing ease-of-use.

thanks to @caseyho.com and other members of community bluesky dev discord for feedback on an earlier draft.