The biggest problem is that of the video codecs which ultimately boils down to u...

phoboslab · on June 16, 2020

Interframes are not a problem, as long as they only reference previous frames, not future ones.

I was able to get latency down to 50ms, streaming to a browser using MPEG1[1]. The latency is mostly the result of 1 frame (16ms) delay for a screen capture on the sender + 2-3 frames of latency to get through the OS stack to the screen at the receiving end. En- and decoding was about ~5ms. Plus of course the network latency, but I only tested this on a local wifi, so it didn't add much.

[1] https://phoboslab.org/log/2015/07/play-gta-v-in-your-browser...

bob1029 · on June 16, 2020

It's funny you mention MPEG1. That's where my journey with all of this began. For MPEG1 testing I was just piping my raw bitmap data to FFMPEG and piping the result to the client browser.

I was never satisfied with the lower latency bound for that approach and felt like I had to keep pushing into latency territory that was lower than my frame time.

That said, MPEG1 was probably the simplest way to get nearly-ideal latency conditions for an interframe approach.

ekimekim · on June 16, 2020

Wouldn't you then hit issues where a single dropped packet can cause noticable problems? In an intraframe solution if you lose a (part of a) frame, you just skip the frame and use the next one instead. But if you need that frame in order to render the next one, you either have to lag or display a corrupted image until your next keyframe.

I guess as long as keyframes are common and packet loss is low it'd work well enough.

wmf · on June 16, 2020

Corrupted frames happen; they're not too bad. You can also use erasure coding.

imtringued · on June 16, 2020

Interesting. I guess I'll have to rewrite a lot of code if what you are saying is true.

vlovich123 · on June 16, 2020

You can also just configure your video encoder to not use B-frames. Then if you make all consecutive frames P frames then the size is very maintainable. It gets trickier if your transport is lossy since a dropped P frame is a problem but it's not an unsolvable problem if you use LTR frames intelligently.

All the benefits of efficient codecs, more manageable handling of the latency downsides.

The challenges you'll run into instantly with JPEG is that the file size increase & encoding/decoding time on large resolutions outstrips any benefits you get in your limited tests. For video game applications you have to figure out how you're going to pipeline your streaming more efficiently than transferring a small 10 kb image as otherwise you're transferring each full uncompressed frame to the CPU which is expensive. Doing JPEG compression on the GPU is probably tricky. Finally decode is the other side of the problem. HW video decoders are embarrassingly fast & super common. Your JPEG decode is going to be significantly slower.

* EDIT: For your weekend project are you testing it with cloud servers or locally? I would be surprised if under equivalent network conditions you're outperforming Stadia so careful that you're not benchmarking local network performance against Stadia's production on public networks perf.

bob1029 · on June 16, 2020

I tested: localhost (no network packets on copper), within my home network (to router and back), and across a very small WAN distance in the metro-local area (~75mpbs link speed w/ 5-10 ms latency).

The only case that started to suck was the metro-local, and even then it was indistinguishable from the other cases until resolution or framerate were increased to the point of saturating the link.

One technique I did come up with to combat the exact concern raised above regarding encoding time relative to resolution is to subdivide the task into multiple tiles which are independently encoded in parallel across however many cores are available. When using this approach, it is possible to create the illusion that you are updating a full 1080/4k+ scene within the same time frame that a tile (e.g. 256x256) would take to encode+send+decode. This approach is something that I have started to seriously investigate for purposes of building universal 2d business applications, as in these types of use cases you only have to transmit the tiles which are impacted by UI events and at no particular frame rate.

namibj · on June 16, 2020

Actually, there are commercial CUDA JPEG codecs (both directions) operating at gigapixels per second. It's not a question of speed, but rather the fact that you can at least afford to use H.264's I-frame-only codec for much lower bandwidth requirements.

vlovich123 · on June 18, 2020

JPEG is still going to be larger & lower quality than H264. I still fail to see the advantage.

namibj · on June 20, 2020

~10x higher framerate?

monocasa · on June 16, 2020

Almost every hardware codec I've seen supports JPEG. MJPEG is certainly more rare than the more traditional video algorithms, but it certainly gets used.

lultimouomo · on June 16, 2020

You can also eliminate I-frames and have I-slices distributed among several P-frames, so that you don't have spikes in bandwidth (and possibly latency if the encoder needs more time to process an I-frames)

cossatot · on June 16, 2020

I think a larger issue is the focus on video as opposed to audio. Audio may be less sexy but it is far and away more important for most interpersonal communication (I'm not discussing gaming or streaming or whatever, but teleconferencing). Most of us don't care that much if we get super crisp, uninterrupted views of our colleagues or clients, but audio problems really impede discussion.

Mirioron · on June 16, 2020

Video is related to this though. If audio is synced to the video then a delayed video stream also means a delayed audio stream.

bob1029 · on June 17, 2020

In my approach, these would be 2 completely independent streams. I haven't implemented audio yet, but hypothetically you can continuously adjust the sample buffer size of the audio stream to be within some safety margin of detected peak latency, and things should self-synchronize pretty well.

In terms of encoding the audio, I don't know that I would. For video, going from MPEG->JPEG brought the perfect trade-off. For reducing audio latency, I think you would just need to be sending raw PCM samples as soon as you generate them. Maybe in really small batches (in case you have a client super-close to the server and you want virtually 0 latency). If you use small batches of samples you could probably start thinking about MP3, but raw 44.1KHz 16-bit stereo audio is only 1.44 mbps. Most cellphones wouldn't have a problem with that these days.

Edit: The fundamental difference in information theory regarding video and audio is the dimensionality. JPEG makes sense for video, because the smallest useful unit of presentation is the individual video frame. For audio, the smallest useful unit of presentation is the PCM sample, but the hazard is that these are fed in at a substantially higher rate (44k/s) than with video (60/s), so you need to buffer out enough samples to cover the latency rift.

megameter · on June 17, 2020

Discord does something like what you describe. It's kind of awful for music(e.g. if it's a channel with a music bot) as you'll hear it speed up and slow down in an oscillating pattern. The same effect also appears in games if you should have a game loop that always tries to catch up to an ideal framerate by issuing more updates to match an average - the resulting oscillation as the game suddenly slows down and then jerks forward is hugely disruptive, so it's not really done this way in practice.

Oscillations are the main issue with "catch-ups" in synchronization, and dropping frames once your buffer is too far behind is often a more pleasant artifact. It's not really a one-size-fits-all engineering problem.

rakoo · on June 26, 2020

Audio conferencing at low latency is already solved by things like Mumble (https://www.mumble.info/). I think adding a video feed in complete parallel (as in, use mumble as-is, do the video in another process) with no regard for latency would be a pretty good first step to see what can be achieved.

fludlight · on June 16, 2020

Early versions of Youtube nailed this. The video would frequently pause, degrade, or glitch due to buffering delays but the audio would continue to play. This made all the difference in user perception: youtube felt smooth. Other streaming services would pause both video and audio which did not feel smooth at all. Maybe they had some QoS code in their webapp to prioritize audio?

jstrong · on June 16, 2020

one technique that could be used (to get high compression rates on compression applied to each frame) is to train a compression "dictionary" on the first few seconds/minutes of a data stream, and then use the dictionary to compress/decompress each frame.