Which compression method is best for API?
When I worked at work I did not care about such niceties. API returns too many data? Let's turn gzip on in NGINX! Given that we have no other options for HTTP so far, nobody cares. But what if they did?
All started from shitposters. Out of curiosity I began to study Pleroma API and harvested their "Known Network" timeline for about ten hours. I used websocket streams to avoid polling and typical messages were as follows:
{
"account": {
"acct": "Moon",
"avatar": "https://static.banky.club/shitposter.club/70a58f38fcbb352e0e3e0e8b01d66209286219132d9009844779c49140e3dbe7.png?name=Lunar-Map-1962.png",
"avatar_static": "https://static.banky.club/shitposter.club/70a58f38fcbb352e0e3e0e8b01d66209286219132d9009844779c49140e3dbe7.png?name=Lunar-Map-1962.png",
"bot": false,
"created_at": "2020-06-19T21:26:07.000Z",
"display_name": "Moon",
"emojis": [],
"fields": [],
"followers_count": 2588,
"following_count": 0,
"fqn": "Moon@shitposter.club",
"header": "https://shitposter.club/images/banner.png",
"header_static": "https://shitposter.club/images/banner.png",
"id": "9wFlObr33AYWFrExns",
"last_status_at": "2022-12-10T20:40:27",
"locked": true,
"note": "I just want to make friends on here. Anybody can interact with me. I am done making joke bios, they caused too much trouble.",
"pleroma": {
"accepts_chat_messages": true,
"also_known_as": [],
"ap_id": "https://shitposter.club/users/Moon",
"background_image": "https://shitposter.club/media/6a8746adaaff1db5bc7b9e7fc0e52c214cb55e0878335258d1b1e0c71127e239.jpg?name=1761719.jpg",
"favicon": null,
"hide_favorites": true,
"hide_followers": false,
"hide_followers_count": true,
"hide_follows": true,
"hide_follows_count": true,
"is_admin": true,
"is_confirmed": true,
"is_moderator": false,
"is_suggested": false,
"relationship": {},
"skip_thread_containment": false,
"tags": []
},
"source": {
"fields": [],
"note": "I just want to make friends on here. Anybody can interact with me. I am done making joke bios, they caused too much trouble.",
"pleroma": {
"actor_type": "Person",
"discoverable": true
},
"sensitive": false
},
"statuses_count": 77175,
"url": "https://shitposter.club/users/Moon",
"username": "Moon"
},
"application": null,
"bookmarked": false,
"card": null,
"content": "<span class="h-card"><a class="u-url mention" data-user="AIKLKUQHaDLXYzh9Bw" href="https://cum.salon/users/pernia" rel="ugc">@<span>pernia</span></a></span> I could have just used url-safe base64 I guess.",
"created_at": "2022-12-10T20:40:27.000Z",
"edited_at": null,
"emojis": [],
"favourited": false,
"favourites_count": 0,
"id": "AQTVoAm53UW6Aj6fUu",
"in_reply_to_account_id": "AIKLKUQHaDLXYzh9Bw",
"in_reply_to_id": "AQTTPDzDmCkizBEGLA",
"language": null,
"media_attachments": [],
"mentions": [
{
"acct": "pernia@cum.salon",
"id": "AIKLKUQHaDLXYzh9Bw",
"url": "https://cum.salon/users/pernia",
"username": "pernia"
}
],
"muted": false,
"pinned": false,
"pleroma": {
"content": {
"text/plain": "@pernia I could have just used url-safe base64 I guess."
},
"context": "https://shitposter.club/contexts/464e1407-eb1f-4af9-8afd-ad4e074a892e",
"conversation_id": 2089203544,
"direct_conversation_id": null,
"emoji_reactions": [],
"expires_at": null,
"in_reply_to_account_acct": "pernia@cum.salon",
"local": true,
"parent_visible": true,
"pinned_at": null,
"spoiler_text": {
"text/plain": ""
},
"thread_muted": false
},
"poll": null,
"reblog": null,
"reblogged": false,
"reblogs_count": 0,
"replies_count": 0,
"sensitive": false,
"spoiler_text": "",
"tags": [],
"text": null,
"uri": "https://shitposter.club/objects/40073253-e42f-4938-82da-07b7c793abe5",
"url": "https://shitposter.club/notice/AQTVoAm53UW6Aj6fUu",
"visibility": "public"
}
I stored messages in my simple DaoDB, and the whole data took 135,222,936 bytes when I stopped harvesting.
That's quite large amount of data. I consumed 34Kbit/s of their bandwidth and if they had 1Gbps network adapter their limit would be 300 such users as me. Taking into account other fediverse traffic the limit is much lower.
They don't compress their API responses. They could also have better designed API responses, without duplicate account info for each message, but that's a bigger problem related to Mastodon. What they could do immediately is compression.
Another concern is my storage space. If I received compressed messages I would store them as is, but right now, if I continue harvesting, I have to compress. So what would be the best compression method given that I run my odd jobs on a very modest hardware? I took advantage of flexibility of my DaoDB and added some compression methods to it. Here's what I found:
Method | Compression time | Decompression time | Size compressed |
---|---|---|---|
copy | 5.242 | 1.297 | 135222936 |
gzip | 13.996 | 3.650 | 47436554 |
lz4 | 5.735 | 1.492 | 65974530 |
brotli | 366.928 | 2.631 | 41130246 |
snappy | 5.775 | 1.525 | 64851478 |
lzma | 95.022 | 7.386 | 47782232 |
Time is in seconds. Compression means db-to-db, and decompression is simply DaoDB read. I.e. for copy, the former involves json➝dict and dict➝json conversion, the latter is just json➝dict only.
All compression methods use default settings provided by Python library. Although I did play with different compression levels, I haven't found any significant advantages.
Here is the code:
import time
from daodb import DaoDB, JsonDaoDB, JsonMixin, GzipMixin, LzmaMixin
from daodb.lz4 import Lz4Mixin
from daodb.brotli import BrotliMixin
from daodb.snappy import SnappyMixin
class GzipDaoDB(JsonMixin, GzipMixin, DaoDB):
pass
class Lz4DaoDB(JsonMixin, Lz4Mixin, DaoDB):
pass
class LzmaDaoDB(JsonMixin, LzmaMixin, DaoDB):
pass
class BrotliDaoDB(JsonMixin, BrotliMixin, DaoDB):
pass
class SnappyDaoDB(JsonMixin, SnappyMixin, DaoDB):
pass
def compress(db_class, input_filename, output_filename):
start_time = time.monotonic()
with JsonDaoDB(input_filename, 'r') as src_db:
with db_class(output_filename, 'w') as dest_db:
for record in src_db:
dest_db.append(record)
print('%s done in %.3f seconds' % (output_filename, time.monotonic() - start_time))
def decompress(db_class, filename):
start_time = time.monotonic()
with db_class(filename, 'r') as db:
for record in db:
pass
print('%s done in %.3f seconds' % (filename, time.monotonic() - start_time))
def main():
compress(JsonDaoDB, 'known-network.dump', 'known-network.copy')
compress(GzipDaoDB, 'known-network.dump', 'known-network.gz')
compress(Lz4DaoDB, 'known-network.dump', 'known-network.lz4')
compress(BrotliDaoDB, 'known-network.dump', 'known-network.br')
compress(SnappyDaoDB, 'known-network.dump', 'known-network.snap')
compress(LzmaDaoDB, 'known-network.dump', 'known-network.lzma')
decompress(JsonDaoDB, 'known-network.copy')
decompress(GzipDaoDB, 'known-network.gz')
decompress(Lz4DaoDB, 'known-network.lz4')
decompress(BrotliDaoDB, 'known-network.br')
decompress(SnappyDaoDB, 'known-network.snap')
decompress(LzmaDaoDB, 'known-network.lzma')
if __name__ == '__main__':
main()
So, the sure outsider is Brotli. Yes, it offers highest compression ratio but nobody will waste CPU time compressing dynamic API responses.
Gzip is a standard method for HTTP, but I'd prefer faster LZ4 or Snappy. Bigger data could be already transferred while Gzip is still in progress, that's my gut feeling. Although, this depends on hardware configuration. Anyway, if I served an API I'd choose LZ4. But I doubt it will become a standard HTTP compression method in the foreseeable future. RFC Index has no LZ4 mentions at all. But they pushed Brotli for some reason.
Of course, for custom things we can use any compression, but how about browsers? Decompress in wasm? Yuck!