Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

>To make all of this run smoothly, we enqueue and dequeue thousands of jobs every day.

If you your needs aren't that expensive, and you don't anticipate growing a ton, then it's probably a smart technical decision to minimize your operational stack. Assuming 10k/jobs a day, thats roughly 7 jobs per minute. Even the most unoptimized database should be able to handle this.



Years of being bullshitted have taught me to instantly distrust anyone who is telling me about how many things they do per day. Jobs or customers per day is something to tell you banker, or investors. For tech people it’s per second, per minute, maybe per hour, or self aggrandizement.

A million requests a day sounds really impressive, but it’s 12req/s which is not a lot. I had a project that needed 100 req/s ages ago. That was considered a reasonably complex problem but not world class, and only because C10k was an open problem. Now you could do that with a single 8xlarge. You don’t even need a cluster.

10k tasks a day is 7 per minute. You could do that with Jenkins.


the other thing is averaging over days says nothing about spikes in the rate - I imagine very few systems see more-or-less constant traffic over the course of an entire day


I worked on a SaaS CMS. A major consumer PC manufacturer was a customer and their "Welcome to your PC" app would pull their news feed from our system. People would turn on their PC and they'd pull a little JSON. 100s of Thousands of request per day, no big deal. The manufacturer decided that they wanted the updates to show up for everybody at the same time on the hour. Serving 100k req/s for a minute is a much different problem to solve.


This. Customers are not organized in neat lines waiting their turns around the clock IRL


Pretty common in industrial control. Your sensors take measurements at a constant rate all day everyday. Sometime a pretty high rate e.g. IEC 61850 is at 4 kHz.


Anything other than operations per second raises my suspicions--a minute is a huge duration for most computing, the speaker is just angling to multiply the measure by 60 for whatever reason.


I snorted at the Jenkins suggestion… and you’re right… spooling an agent all that nonsens… lol I imagine telling “the business” that your customer requests are handled timely by a jenkins job queue


"It's not a lot" has basically been my career up until now, at least the bits where I interacted with a back-end; in the few cases where I witnessed a new back-end being set up, they over-engineered it without fully understanding the actual problem or needs; the engineering was often wishful thinking as well, as in, they wished they had the problems that their proposed solution would fix. Cargo culting, in other words.


you could use mechanical turk for 10k tasks per day


> Even the most unoptimized database should be able to handle this.

Anybody had any success running a queue on top of... sqlite?

With the way the sqlite file locking mechanisms work, are you basically guaranteed really low concurrency? You can have lots of readers but not really a lot of writers, and in order to pop a job off of the queue you need to have a process spinning waiting for work, move its status from "to do" to "in progress" and then "done" or "error", which is sort of "write" heavy?

> An EXCLUSIVE lock is needed in order to write to the database file. Only one EXCLUSIVE lock is allowed on the file and no other locks of any kind are allowed to coexist with an EXCLUSIVE lock. In order to maximize concurrency, SQLite works to minimize the amount of time that EXCLUSIVE locks are held.

> You can avoid locks when reading, if you set database journal mode to Write-Ahead Logging (see: http://www.sqlite.org/wal.html).


I processed ~3m messages a day on Sqlite using a pair of ca 2006 era Xeon on spinning rust (for redundancy; each could easily handle the load by itself). The queue processor was written in Ruby and ran on the (very slow) 1.8.x series (even then it used about 10% of a single core.

On modern hardware you should be able to trivially handle more than that.


Does this mean you are processing messages on only one machine, since it's Sqlite? Depending on what you are processing, that could take longer than the queue/dequeue.


The queue was managed by a server written in Ruby that spoke Stomp [1], so while for the biggest queue most of the processing did in fact happen on the same machine, that was just because it happened to fit.

[1] https://stomp.github.io/


SQLite is missing some features like `SELECT FOR UPDATE`, but you can work around some issues with a few extra queries. I wrote litequeue[0] with this specific purpose. I haven't been able to use it a lot, so I don't have real-world numbers of how it scales, but the scaling limits depend on how fast you can insert into the database.

[0]: https://github.com/litements/litequeue


I wrote https://github.com/TinyWebServices/tqs a couple of years ago. It is modelled after SQS and runs in a single threaded Tornado server.

I don’t know how many messages per second it does but for a podcast crawling side project I have processed hundreds of millions of messages through this little Python wrapper around SQLite. Zero problems. It just keeps running happily.



The huey project supports sqlite as one of its backend options (https://huey.readthedocs.io/en/latest/api.html#SqliteHuey). It works, is all I can say.


Why use something as complicated as SQLite? You can use a plain old set of directories and files as a queue, with sane, portable, exclusive, atomic locks, on any filesystem, with concurrent readers/writers. That's how we ran mail servers that handled millions (now billions) of messages a day, 20+ years ago.


But your durability goes down the drain on power loss or similar. Files are _really_ complicated to get durable.


Sorry, I don't follow? SQLite also has durability problems on power loss. You can use exactly the same method that SQLite uses to flush data to disk. There is no need for a WAL because the queue operations are atomic, using rename() on POSIX filesystems (including NFS). There's no rollback of course, but you can recover items that are stalled in the queue. With a journaled filesystem, fdatasync() after each atomic operation, and battery-backed RAID, I can't imagine it getting more durable [in comparison to SQLite].

Interestingly, journaled rollback mode in SQLite is probably worse durability than the file/directory queue method. (https://www.sqlite.org/lockingv3.html#how_to_corrupt) (https://www.sqlite.org/atomiccommit.html#sect_9_0) It depends on the same system-level guarantees about the filesystem and hardware, except it also depends on POSIX advisory locks, which are fallible and implementation-specific; while the file/directory queue solely depends on rename() being atomic, which it always is.


> Sorry, I don't follow? SQLite also has durability problems on power loss.

No, it doesn't. Once COMMIT returns, your data is durable.

> You can use exactly the same method that SQLite uses to flush data to disk.

Good luck with that! It took them about ten years to get it right, and the SQLite people are world class. (You can't just copy their code wholesale, since they have one file to worry about and your scheme has tons.)

> With a journaled filesystem, fdatasync() after each atomic operation, and battery-backed RAID, I can't imagine it getting more durable [in comparison to SQLite].

The problem is lack of imagination. :-) See https://danluu.com/deconstruct-files/ (and the linked papers) for an overview of all the things that can go wrong. In particular, if you only only ever fdatasync(), your files are not even guaranteed to show up after a power loss.


Solution: use a qmail install as the queue manager, and a custom mail server it relays to as a job handler.

There's already a spec for transferring SOAP over SMTP so the request/response part is already built for you.

(couldn't resist, sorry)


Been there, done that. In production.

That is, without the custom mail server you describe. You can feed every incoming message to a custom command with ".qmail", or forward with ".forward", so we used that for "push" delivery of messages and plain old POP3 for pull. E-mail really does have everything you need to build complex routing topologies.

We ran a mail provider, and so had a heavily customised Qmail setup, and when we needed a queue we figured we might as well use that. Meant we could trivially do things like debugging by cc:'ing messages to a mailbox someone connected a mail client to, for example.


My ISP days were not unsimilar, and had we a TARDIS it might be fun to go back to when both setups were still there and compare notes.

However, I'd hope the SOAP part made it clear I was exaggerating a little for effect :)

... also because the idea of the custom server meaning you could have the main system's qmail handle spaced retries based on SMTP status codes amused me.


I mean SOAP fit right in with the zeitgeist of that era... I seem to remember we were sending XML (shudder), as we had quite a lot of other stuff using XML, but nothing as verbose as SOAP.

Our main CGI (yes...) written in C++ (I can hear you squirm from here) was also loosely inspired by CORBA (stop that screaming) in terms of "sort-of" exposing objects via router that automatically mapped URLs to objects, and which used XML for persistence and an XML based rendering pipeline (not that different from React components, actually, except all C++ and server side).

Hit all the 90's buzz words.


Yeah, I had quite a bit of XML flying around, plus the inevitable "all our vendors want to communicate via CSV over FTP", and some customer facing perl CGIs plus a daemon or two I'd written that usually spoke XML over TCP sockets.

Plus lots of NFS for the shared filestore that the qmail SMTP nodes and the courier/qmail-pop3d mail receipt nodes mounted.

Plus ... yeah, you can see why I thought we'd not find each others' setups -too- surprising.

So, no, not going to squirm, because I mean, yes, I know, but it all (mostly) worked and the customers weren't unusually unhappier with us than they are with any provider ;)


Yeah, I'm super confused with this. Getting a few thousand per second seems relatively trivial, on an Arduino. Maybe there's something I'm missing here, or is this the abstractions that software lives at these days?


This limits simultaneous writes to the maximum number of open file handles supported by the OS. I don’t know what that is, but I don’t see how it can compare to a multiple multiplexed TCP/IP sockets.

When you’re writing billions of messages per day, I don’t see how a file system scales.


Close the file handle if you aren’t actively processing the file.


That still allows for perhaps 3000-4000 simultaneous writes, depending on how many file handles are in use by other processes.


On Linux, /proc/sys/fs/file-max has the maximum number of simultaneous open filehandles supported by the kernel. On my laptop this is about 1.6 million


And mine is about twice that.

But also keep in mind every executable has at minimum its own executable as an open file. Picking a random python process I currently have running, lsof reports it has 41 open *.so files.


Yes, but it's also highly unlikely that if you're trying to push transactions per second into that kind of range that you'd be doing it with individual processes per transaction. You'd also be likely to be hitting IO limits long before the number of file descriptors is becoming the issue.


Yep I’ve used this approach for file transfers/ messaging between two systems. Primitive but get the lock process unlock and move sequence working and it works for that kind of use case


The p3rl.org/Minion job queue has a sqlite backend and I'm aware of it handling what seemed to me a fairly acceptable amount of throughput and the source to https://metacpan.org/pod/Minion::Backend::SQLite looks pretty comfortable to me.

I've not run it myself in production but I would definitely have heard if it went wrong, I think OpenSuSe's OpenQA thing does build worker boxes that way (though of course they'll have fairly slow jobs so it may just be the write throughput doesn't matter).

This being HN, I'd like to point out you can steal the SQL from the source for your own uses if you want to, this just happens to be the example full implementation I'm familiar with.


Oban [0] is a job queuing system in the Elixir world that supports both postgres and sqlite.

[0] https://getoban.pro/


The problem with a queue on SQLite is that every successful read implies a write. SQLite is fast enough that it may not matter, though.


> The problem with a queue on SQLite is that every successful read implies a write.

SQLite also supports in-memory databases with shared cache.

https://www.sqlite.org/inmemorydb.html

If a message queue does not require any form of persistence, writes don't sound like an issue.


But this is purely in-memory with no disk persistence. That’s quite risky


Not to be coy, but it only is if it is. For this application I'd agree, but there are plenty of apps that want queues for in-memory use, and if the thing that holds the memory for the queue dies, chances are the whole ecosystem has.

Sometimes this is fine.


We used sqlite for a queue at a prior startup - enabling the WAL was critical to maintain disk throughput.


I once saw WAL turned off to make a sqlite queue perform better.

But that was ... gremlins. More than one person tried to figure out wtf was going one and eventually it was better business wise to declare 'gremlins' and everybody involved in the incident has been annoyed about not figuring it out since.


I just did a quick test, I am seeing over 1k commits per second when being driven from Python 3.11 on an old macbook pro.


A simple implementation of a queue in SQL will need to acquire an exclusive lock on the table anyway. Although it's not necessary to use locking at the level of SQL itself: https://news.ycombinator.com/item?id=27482402


The article mentions row-level locking, not full table locking. Big difference in write performance.


For a client-server RDBMS with writes over the network, sure. For an SQLite database with writes to the local filesystem and an order of magnitude better write performance, not so much.


And, on the other hand, people shouldn't kid themselves about the ability of Postgres to handle millions of jobs per day as a queue.


We were comfortably supporting millions of jobs per day as a Postgres queue (using select for update skip locked semantics) at a previous role.

Scaled much, much further than I would’ve guessed at the time when I called it a short-term solution :) — now I have much more confidence in Postgres ;)


> We were comfortably supporting millions of jobs per day as a Postgres queue (using select for update skip locked semantics) at a previous role.

That's very refreshing to hear. In a previous role I was in a similar situation than yours, but I pushed for RabbitMQ instead of postgres due to scaling concerns, with hypothetical seilings smaller than the ones you faced. My team had to make a call without having hard numbers to support any decision and no time to put together a proof of concept. The design pressures were the simplicity of postgres vs paying for the assurance of getting a working message broker with complexity. In the end I pushed for the most conservative approach and we went with RabbitMQ, because I didn't wanted to be the one having to explain why we had problems getting a RDBMS to act as a message broker when we get a real message broker for free with a docker pull.

I was always left wondering if that was the right call, and apparently it wasn't, because RabbitMQ also put up a fight.

If there were articles out there showcasing case studies of real world applications of implementing message brokers over RDBMS then people like me would have an easier time pushing for saner choices.


> RabbitMQ also put up a fight.

I'm interested in hearing more about this (making a similar decision right now!). What pains did RabbitMQ give you?


> showcasing case studies of real world applications of implementing message brokers over RDBMS

You mean "industrial scale RDBMS" that you can license for thousands of dollars? No, you can't really implement message brokers on those.

You will never see those showcase articles. Nobody paying wants them.


No, industrial scale RDBMSes like PostgreSQL, that you can license for free. Obviously?


Those don't have money to fund studies about industry best practices. So you don't get many.

Almost everything you see on how to use a DBMS is an amateur blog or one of those studies. One of those is usually dismissed on any organization with more than one layer of management.


> Those don't have money to fund studies about industry best practices. So you don't get many.

Your comment reads like a strawman. I didn't needed "studies". It was good enough if there was a guy with a blog saying "I used postgres as a message broker like this and I got these numbers", and they had a gitlab project page providing the public with the setup and benchmark code.


Just out of curiosity (as someone who hasn't done a lot of this kind of operational stuff) how does this approach to queueing with Postgres degrade as scale increases? Is it just that your job throughput starts to hit a ceiling?


Throughput is less of an issue then queue size—Postgres can handle a truly incredible amount of throughput as long as the jobs table is small enough that it can safely remain in memory for every operation. We can handle 800k jobs/hr with postgres, but if you have more than 5k or 10k jobs in the table at any given time, you're in dangerous territory. It's a different way of thinking about queue design then some other systems, but it's definitely worth it if you're interested in the benefits Postgres can bring (atomicity, reliability, etc)


With Postgres, you also need to worry a lot about tombstoning and your ability to keep up with the vacuums necessary to deal with highly mutable data. This can depend a lot on what else is going on with the database and whether you have more than one index on the table.


One strategy for mitigating vacuum costs would be to adopt an append-only strategy and partition the table. Then you can just drop partitions and avoid the vacuum costs.

Really depends on the needs but this can unlock some very impressive and sustainable throughputs.


This! Most haven't tried. It goes incredibly far.


Because all popular articles are about multi million tps at bigtech scale, and everybody thinks they're big tech somehow.


That's the original problem, but then there are the secondary effects. Some of the people who made decision on that basis write blog posts about what they did, and then those blog posts end up on StackOverflow etc, and eventually it just becomes "this is how we do it by default" orthodoxy without much conscious reasoning involved - it's just a safe bet to do what works for everybody else even if it's not optimal.


My hobby project does ~1.5M jobs per day enqueued into Postgres, no sweat. I use https://github.com/bensheldon/good_job which uses PG's LISTEN/NOTIFY to lower worker poll latency.


> which uses PG's LISTEN/NOTIFY to lower worker poll latency

Can you elaborate on how do you do this?


I'm the GoodJob author. Here's the class that is responsible for implementing Postgres's LISTEN/NOTIFY functionality in GoodJob:

https://github.com/bensheldon/good_job/blob/10e9d9b714a668dc...

That's heavily inspired by Rail's Action Cable (websockets) Adapter for Postgres, which is a bit simpler and easier to understand:

https://github.com/rails/rails/blob/be287ac0d5000e667510faba...

Briefly, it spins up a background thread with a dedicated database connection and makes a blocking Postgres LISTEN query until results are returned, and then it forwards the result to other subscribing objects.


excited claps


I can't speak for how they do it, but when your worker polls the table and finds no rows, you will sleep. While sleeping, you can also LISTEN on a channel (and if you get a message, you abort your sleep).

Then, whenever you write a new job to the queue, you also do a NOTIFY on the same channel.

This lets you keep latency low while still polling relatively infrequently.

NOTIFY is actually transactional which makes this approach even better (the LISTENer won't be notified until the NOTIFY transaction commits)


These are somewhat obscure PostgreSQL SQL commands:

https://www.postgresql.org/docs/15/sql-listen.html

https://www.postgresql.org/docs/15/sql-notify.html

I think they have been around for ages, but handling the LISTEN responses may need special client library support.


I'm not kidding myself, postgresql easily handles 10s of thousands of queries per second.

No problem with millions of enqueue+dequeue per day.

A table for a queue is also going to be so tiny that postgresql might even outdo my own expectations.


A few millions a days is a few dozens per second; we currently have a service running this order of magnitude of jobs with a SELECT/SKIP LOCKED pattern and no issue at all on a medium AWS box.


Have done millions on nearly a decade old hardware. A million is <12 a second, and that's trivial.


In other SQL databases an 'in memory' table could be a candidate. It looks like Postgres only has session specific temporary tables, but does have an UNLOGGED https://www.postgresql.org/docs/13/sql-createtable.html table type which has desirable properties for temporary data that must be shared.


PG really needs GLOBAL TEMP tables...


The properties of the UNLOGGED table suggest it fills this niche already.


It's not the same. You can emulate GLOBAL TEMP tables with UNLOGGED, but native GLOBAL TEMP would be better.


There's an extension for this: https://github.com/darold/pgtt


a well-tuned bare metal box in a master-slave config should easily handle (being conservative here) 10k/s... I assume a purpose-built box could handle 100k/s without breaking a sweat


I've used Postgres to handle 60M jobs per second (using FOR UPDATE SKIP LOCKED) in production, for two years, on a single dual core 8GB GCE VM. Postgres goes far.


Either you meant "per day" or you've got a really well tuned database.


Ah, my bad, I mistyped. It was 60k. I should have known when I typed out my original comment, since the design goal was only around 200M per day.


For many apps, using a mature library like Sidekiq or Celery (with a stock Redis) within a monolith will take you a very long way.


> Assuming 10k/jobs a day, that's roughly 7 jobs per minute.

I've seen systems at that scale where that's roughly true. But I've also seen systems where those jobs come in a daily batch, at a point in time of day, and then nothing until the next day's batch.


Yep, even websites can be highly non-periodic. I used to run the web services for the 50 or so magazines that Risk-Waters had at the time, and around lunch time was a massive peak of traffic, easily 100x our slower times.

Postgres could still handle that though, IMHO :)


Honestly this is how projects should start. Simple, easy, maintainable.


This is exactly the thesis behind neoq (https://github.com/acaloiaro/neoq). Match your queue infrastructure with your application infrastructure.


That's assuming those 10k jobs are equally distributed, which is almost certainly not the case.


Postgres can handle 10k batch inserts in seconds on even commodity hardware. Not done batch, you should still get >100 inserts/second with a few indexes thrown in there.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: