Which compression method is best for API?

December 14, 2022

When I worked at work I did not care about such niceties. API returns too many data? Let's turn gzip on in NGINX! Given that we have no other options for HTTP so far, nobody cares. But what if they did?

All started from shitposters. Out of curiosity I began to study Pleroma API and harvested their "Known Network" timeline for about ten hours. I used websocket streams to avoid polling and typical messages were as follows:

{
    "account": {
        "acct": "Moon",
        "avatar": "https://static.banky.club/shitposter.club/70a58f38fcbb352e0e3e0e8b01d66209286219132d9009844779c49140e3dbe7.png?name=Lunar-Map-1962.png",
        "avatar_static": "https://static.banky.club/shitposter.club/70a58f38fcbb352e0e3e0e8b01d66209286219132d9009844779c49140e3dbe7.png?name=Lunar-Map-1962.png",
        "bot": false,
        "created_at": "2020-06-19T21:26:07.000Z",
        "display_name": "Moon",
        "emojis": [],
        "fields": [],
        "followers_count": 2588,
        "following_count": 0,
        "fqn": "Moon@shitposter.club",
        "header": "https://shitposter.club/images/banner.png",
        "header_static": "https://shitposter.club/images/banner.png",
        "id": "9wFlObr33AYWFrExns",
        "last_status_at": "2022-12-10T20:40:27",
        "locked": true,
        "note": "I just want to make friends on here. Anybody can interact with me. I am done making joke bios, they caused too much trouble.",
        "pleroma": {
            "accepts_chat_messages": true,
            "also_known_as": [],
            "ap_id": "https://shitposter.club/users/Moon",
            "background_image": "https://shitposter.club/media/6a8746adaaff1db5bc7b9e7fc0e52c214cb55e0878335258d1b1e0c71127e239.jpg?name=1761719.jpg",
            "favicon": null,
            "hide_favorites": true,
            "hide_followers": false,
            "hide_followers_count": true,
            "hide_follows": true,
            "hide_follows_count": true,
            "is_admin": true,
            "is_confirmed": true,
            "is_moderator": false,
            "is_suggested": false,
            "relationship": {},
            "skip_thread_containment": false,
            "tags": []
        },
        "source": {
            "fields": [],
            "note": "I just want to make friends on here. Anybody can interact with me. I am done making joke bios, they caused too much trouble.",
            "pleroma": {
                "actor_type": "Person",
                "discoverable": true
            },
            "sensitive": false
        },
        "statuses_count": 77175,
        "url": "https://shitposter.club/users/Moon",
        "username": "Moon"
    },
    "application": null,
    "bookmarked": false,
    "card": null,
    "content": "<span class="h-card"><a class="u-url mention" data-user="AIKLKUQHaDLXYzh9Bw" href="https://cum.salon/users/pernia" rel="ugc">@<span>pernia</span></a></span> I could have just used url-safe base64 I guess.",
    "created_at": "2022-12-10T20:40:27.000Z",
    "edited_at": null,
    "emojis": [],
    "favourited": false,
    "favourites_count": 0,
    "id": "AQTVoAm53UW6Aj6fUu",
    "in_reply_to_account_id": "AIKLKUQHaDLXYzh9Bw",
    "in_reply_to_id": "AQTTPDzDmCkizBEGLA",
    "language": null,
    "media_attachments": [],
    "mentions": [
        {
            "acct": "pernia@cum.salon",
            "id": "AIKLKUQHaDLXYzh9Bw",
            "url": "https://cum.salon/users/pernia",
            "username": "pernia"
        }
    ],
    "muted": false,
    "pinned": false,
    "pleroma": {
        "content": {
            "text/plain": "@pernia I could have just used url-safe base64 I guess."
        },
        "context": "https://shitposter.club/contexts/464e1407-eb1f-4af9-8afd-ad4e074a892e",
        "conversation_id": 2089203544,
        "direct_conversation_id": null,
        "emoji_reactions": [],
        "expires_at": null,
        "in_reply_to_account_acct": "pernia@cum.salon",
        "local": true,
        "parent_visible": true,
        "pinned_at": null,
        "spoiler_text": {
            "text/plain": ""
        },
        "thread_muted": false
    },
    "poll": null,
    "reblog": null,
    "reblogged": false,
    "reblogs_count": 0,
    "replies_count": 0,
    "sensitive": false,
    "spoiler_text": "",
    "tags": [],
    "text": null,
    "uri": "https://shitposter.club/objects/40073253-e42f-4938-82da-07b7c793abe5",
    "url": "https://shitposter.club/notice/AQTVoAm53UW6Aj6fUu",
    "visibility": "public"
}

I stored messages in my simple DaoDB, and the whole data took 135,222,936 bytes when I stopped harvesting.

That's quite large amount of data. I consumed 34Kbit/s of their bandwidth and if they had 1Gbps network adapter their limit would be 300 such users as me. Taking into account other fediverse traffic the limit is much lower.

They don't compress their API responses. They could also have better designed API responses, without duplicate account info for each message, but that's a bigger problem related to Mastodon. What they could do immediately is compression.

Another concern is my storage space. If I received compressed messages I would store them as is, but right now, if I continue harvesting, I have to compress. So what would be the best compression method given that I run my odd jobs on a very modest hardware? I took advantage of flexibility of my DaoDB and added some compression methods to it. Here's what I found:

MethodCompression timeDecompression timeSize compressed
copy5.2421.297135222936
gzip13.9963.65047436554
lz45.7351.49265974530
brotli366.9282.63141130246
snappy5.7751.52564851478
lzma95.0227.38647782232

Time is in seconds. Compression means db-to-db, and decompression is simply DaoDB read. I.e. for copy, the former involves json➝dict and dict➝json conversion, the latter is just json➝dict only.

All compression methods use default settings provided by Python library. Although I did play with different compression levels, I haven't found any significant advantages.

Here is the code:

import time

from daodb import DaoDB, JsonDaoDB, JsonMixin, GzipMixin, LzmaMixin
from daodb.lz4 import Lz4Mixin
from daodb.brotli import BrotliMixin
from daodb.snappy import SnappyMixin

class GzipDaoDB(JsonMixin, GzipMixin, DaoDB):
    pass

class Lz4DaoDB(JsonMixin, Lz4Mixin, DaoDB):
    pass

class LzmaDaoDB(JsonMixin, LzmaMixin, DaoDB):
    pass

class BrotliDaoDB(JsonMixin, BrotliMixin, DaoDB):
    pass

class SnappyDaoDB(JsonMixin, SnappyMixin, DaoDB):
    pass

def compress(db_class, input_filename, output_filename):
    start_time = time.monotonic()
    with JsonDaoDB(input_filename, 'r') as src_db:
        with db_class(output_filename, 'w') as dest_db:
            for record in src_db:
                dest_db.append(record)
    print('%s done in %.3f seconds' % (output_filename, time.monotonic() - start_time))

def decompress(db_class, filename):
    start_time = time.monotonic()
    with db_class(filename, 'r') as db:
        for record in db:
            pass
    print('%s done in %.3f seconds' % (filename, time.monotonic() - start_time))

def main():
    compress(JsonDaoDB,   'known-network.dump', 'known-network.copy')
    compress(GzipDaoDB,   'known-network.dump', 'known-network.gz')
    compress(Lz4DaoDB,    'known-network.dump', 'known-network.lz4')
    compress(BrotliDaoDB, 'known-network.dump', 'known-network.br')
    compress(SnappyDaoDB, 'known-network.dump', 'known-network.snap')
    compress(LzmaDaoDB,   'known-network.dump', 'known-network.lzma')

    decompress(JsonDaoDB,   'known-network.copy')
    decompress(GzipDaoDB,   'known-network.gz')
    decompress(Lz4DaoDB,    'known-network.lz4')
    decompress(BrotliDaoDB, 'known-network.br')
    decompress(SnappyDaoDB, 'known-network.snap')
    decompress(LzmaDaoDB,   'known-network.lzma')

if __name__ == '__main__':
    main()

So, the sure outsider is Brotli. Yes, it offers highest compression ratio but nobody will waste CPU time compressing dynamic API responses.

Gzip is a standard method for HTTP, but I'd prefer faster LZ4 or Snappy. Bigger data could be already transferred while Gzip is still in progress, that's my gut feeling. Although, this depends on hardware configuration. Anyway, if I served an API I'd choose LZ4. But I doubt it will become a standard HTTP compression method in the foreseeable future. RFC Index has no LZ4 mentions at all. But they pushed Brotli for some reason.

Of course, for custom things we can use any compression, but how about browsers? Decompress in wasm? Yuck!