my application was essentially a user-space forwarding (of gtpu packets) using dpdk. approx 80% of available cores (total of 32/64 not so sure) were dedicated to that task. now each forwarding thread would work independently of every other thread incrementing tx/rx counters etc. instead of global shared counter, we had per thread 16bit counters which were 'synced' if they were overflowing. which ended up reducing the overall contention by quite a large margin.
ofcourse the salient point here being that it was ok to be 'eventually correct' minor lag was always ok.
ofcourse the salient point here being that it was ok to be 'eventually correct' minor lag was always ok.