my application was essentially a user-space forwarding (of gtpu packets) using d...

my application was essentially a user-space forwarding (of gtpu packets) using dpdk. approx 80% of available cores (total of 32/64 not so sure) were dedicated to that task. now each forwarding thread would work independently of every other thread incrementing tx/rx counters etc. instead of global shared counter, we had per thread 16bit counters which were 'synced' if they were overflowing. which ended up reducing the overall contention by quite a large margin.

ofcourse the salient point here being that it was ok to be 'eventually correct' minor lag was always ok.