The biggest problem is that of the video codecs which ultimately boils down to using interframe compression. This technique requires that a certain # of video frames be received and buffered before a final image can be produced. This requirement imposes a baseline amount of latency that can never be overcome by any means. It is a hard trade-off in information theory.
Something to consider is that there are alternative techniques to interframe compression. Intraframe compression (e.g. JPEG) can bring your encoding latency per frame down to 0~10ms at the cost of a dramatic increase in bandwidth. Other benefits include the ability to instantly draw any frame the moment you receive it, because every single JPEG contains 100% of the data. With almost all video codecs, you must have some prior # of frames in many cases to reconstitute a complete frame.
For certain applications on modern networks, intraframe compression may not be as unbearable an idea as it once was. I've thrown together a prototype using LibJpegTurbo and I am able to get a C#/AspNetCore websocket to push a framebuffer drawn in safe C# to my browser window in ~5-10 milliseconds @ 1080p. Testing this approach at 60fps redraw with event feedback has proven that ideal localhost roundtrip latency is nearly indistinguishable from native desktop applications.
The ultimate point here is that you can build something that runs with better latency than any streaming offering on earth right now - if you are willing to make sacrifices on bandwidth efficiency. My 3 weekend project arguably already runs much better than Google Stadia regarding both latency and quality, but the market for streaming game & video conference services which require 50~100 Mbps (depending on resolution & refresh rate) constant throughput is probably very limited for now. That said, it is also not entirely non-existent - think about corporate networks, e-sports events, very serious PC gamers on LAN, etc. Keep in mind that it is virtually impossible to cheat at video games delivered through these types of streaming platforms. I would very much like to keep the streaming gaming dream alive, even if it can't be fully realized until 10gbps+ LAN/internet is default everywhere.
Interframes are not a problem, as long as they only reference previous frames, not future ones.
I was able to get latency down to 50ms, streaming to a browser using MPEG1[1]. The latency is mostly the result of 1 frame (16ms) delay for a screen capture on the sender + 2-3 frames of latency to get through the OS stack to the screen at the receiving end. En- and decoding was about ~5ms. Plus of course the network latency, but I only tested this on a local wifi, so it didn't add much.
It's funny you mention MPEG1. That's where my journey with all of this began. For MPEG1 testing I was just piping my raw bitmap data to FFMPEG and piping the result to the client browser.
I was never satisfied with the lower latency bound for that approach and felt like I had to keep pushing into latency territory that was lower than my frame time.
That said, MPEG1 was probably the simplest way to get nearly-ideal latency conditions for an interframe approach.
Wouldn't you then hit issues where a single dropped packet can cause noticable problems? In an intraframe solution if you lose a (part of a) frame, you just skip the frame and use the next one instead. But if you need that frame in order to render the next one, you either have to lag or display a corrupted image until your next keyframe.
I guess as long as keyframes are common and packet loss is low it'd work well enough.
You can also just configure your video encoder to not use B-frames. Then if you make all consecutive frames P frames then the size is very maintainable. It gets trickier if your transport is lossy since a dropped P frame is a problem but it's not an unsolvable problem if you use LTR frames intelligently.
All the benefits of efficient codecs, more manageable handling of the latency downsides.
The challenges you'll run into instantly with JPEG is that the file size increase & encoding/decoding time on large resolutions outstrips any benefits you get in your limited tests. For video game applications you have to figure out how you're going to pipeline your streaming more efficiently than transferring a small 10 kb image as otherwise you're transferring each full uncompressed frame to the CPU which is expensive. Doing JPEG compression on the GPU is probably tricky. Finally decode is the other side of the problem. HW video decoders are embarrassingly fast & super common. Your JPEG decode is going to be significantly slower.
* EDIT: For your weekend project are you testing it with cloud servers or locally? I would be surprised if under equivalent network conditions you're outperforming Stadia so careful that you're not benchmarking local network performance against Stadia's production on public networks perf.
I tested: localhost (no network packets on copper), within my home network (to router and back), and across a very small WAN distance in the metro-local area (~75mpbs link speed w/ 5-10 ms latency).
The only case that started to suck was the metro-local, and even then it was indistinguishable from the other cases until resolution or framerate were increased to the point of saturating the link.
One technique I did come up with to combat the exact concern raised above regarding encoding time relative to resolution is to subdivide the task into multiple tiles which are independently encoded in parallel across however many cores are available. When using this approach, it is possible to create the illusion that you are updating a full 1080/4k+ scene within the same time frame that a tile (e.g. 256x256) would take to encode+send+decode. This approach is something that I have started to seriously investigate for purposes of building universal 2d business applications, as in these types of use cases you only have to transmit the tiles which are impacted by UI events and at no particular frame rate.
Actually, there are commercial CUDA JPEG codecs (both directions) operating at gigapixels per second. It's not a question of speed, but rather the fact that you can at least afford to use H.264's I-frame-only codec for much lower bandwidth requirements.
Almost every hardware codec I've seen supports JPEG. MJPEG is certainly more rare than the more traditional video algorithms, but it certainly gets used.
You can also eliminate I-frames and have I-slices distributed among several P-frames, so that you don't have spikes in bandwidth (and possibly latency if the encoder needs more time to process an I-frames)
I think a larger issue is the focus on video as opposed to audio. Audio may be less sexy but it is far and away more important for most interpersonal communication (I'm not discussing gaming or streaming or whatever, but teleconferencing). Most of us don't care that much if we get super crisp, uninterrupted views of our colleagues or clients, but audio problems really impede discussion.
In my approach, these would be 2 completely independent streams. I haven't implemented audio yet, but hypothetically you can continuously adjust the sample buffer size of the audio stream to be within some safety margin of detected peak latency, and things should self-synchronize pretty well.
In terms of encoding the audio, I don't know that I would. For video, going from MPEG->JPEG brought the perfect trade-off. For reducing audio latency, I think you would just need to be sending raw PCM samples as soon as you generate them. Maybe in really small batches (in case you have a client super-close to the server and you want virtually 0 latency). If you use small batches of samples you could probably start thinking about MP3, but raw 44.1KHz 16-bit stereo audio is only 1.44 mbps. Most cellphones wouldn't have a problem with that these days.
Edit: The fundamental difference in information theory regarding video and audio is the dimensionality. JPEG makes sense for video, because the smallest useful unit of presentation is the individual video frame. For audio, the smallest useful unit of presentation is the PCM sample, but the hazard is that these are fed in at a substantially higher rate (44k/s) than with video (60/s), so you need to buffer out enough samples to cover the latency rift.
Discord does something like what you describe. It's kind of awful for music(e.g. if it's a channel with a music bot) as you'll hear it speed up and slow down in an oscillating pattern. The same effect also appears in games if you should have a game loop that always tries to catch up to an ideal framerate by issuing more updates to match an average - the resulting oscillation as the game suddenly slows down and then jerks forward is hugely disruptive, so it's not really done this way in practice.
Oscillations are the main issue with "catch-ups" in synchronization, and dropping frames once your buffer is too far behind is often a more pleasant artifact. It's not really a one-size-fits-all engineering problem.
Audio conferencing at low latency is already solved by things like Mumble (https://www.mumble.info/). I think adding a video feed in complete parallel (as in, use mumble as-is, do the video in another process) with no regard for latency would be a pretty good first step to see what can be achieved.
Early versions of Youtube nailed this. The video would frequently pause, degrade, or glitch due to buffering delays but the audio would continue to play. This made all the difference in user perception: youtube felt smooth. Other streaming services would pause both video and audio which did not feel smooth at all. Maybe they had some QoS code in their webapp to prioritize audio?
one technique that could be used (to get high compression rates on compression applied to each frame) is to train a compression "dictionary" on the first few seconds/minutes of a data stream, and then use the dictionary to compress/decompress each frame.
Something to consider is that there are alternative techniques to interframe compression. Intraframe compression (e.g. JPEG) can bring your encoding latency per frame down to 0~10ms at the cost of a dramatic increase in bandwidth. Other benefits include the ability to instantly draw any frame the moment you receive it, because every single JPEG contains 100% of the data. With almost all video codecs, you must have some prior # of frames in many cases to reconstitute a complete frame.
For certain applications on modern networks, intraframe compression may not be as unbearable an idea as it once was. I've thrown together a prototype using LibJpegTurbo and I am able to get a C#/AspNetCore websocket to push a framebuffer drawn in safe C# to my browser window in ~5-10 milliseconds @ 1080p. Testing this approach at 60fps redraw with event feedback has proven that ideal localhost roundtrip latency is nearly indistinguishable from native desktop applications.
The ultimate point here is that you can build something that runs with better latency than any streaming offering on earth right now - if you are willing to make sacrifices on bandwidth efficiency. My 3 weekend project arguably already runs much better than Google Stadia regarding both latency and quality, but the market for streaming game & video conference services which require 50~100 Mbps (depending on resolution & refresh rate) constant throughput is probably very limited for now. That said, it is also not entirely non-existent - think about corporate networks, e-sports events, very serious PC gamers on LAN, etc. Keep in mind that it is virtually impossible to cheat at video games delivered through these types of streaming platforms. I would very much like to keep the streaming gaming dream alive, even if it can't be fully realized until 10gbps+ LAN/internet is default everywhere.