It works great even for large heap sizes. I moved my ES cluster (running with around 92G heap size) from G1GC to ZGC and saw huge improvements in GC. Best part about ZGC is you don't need to touch any GC parameter and it autotunes everything.
Whether G1 or ZGC are the best choice depends on the workload and requirements, but G1 in recent JDK versions also requires virtually no tuning (if your G1 usage had flags other than maximum heap size, maybe minimum heap size, and maybe pause target, try again without them).
I'm curious about this choice. The elasticsearch documentation recommends a maximum heap slightly below 32GB [1].
Is this not a problem anymore with G1GC/ZGC, or are you simply "biting the bullet" and using 92G of heap because you can't afford to scale horizontally?
Heaps "slightly below 32GB" are usually because of the -XX:+UseCompressedOops option, which allows Java to address up to 32GB of memory with a smaller pointer. Between 32-35GB of heap, you're just paying off the savings you would have gotten with compressed object pointers, but if you keep cranking your heap further after that, you'll start getting benefits again.
Doesn't have to be because of affordance but rather it's more efficient and cheaper to scale vertically first, both in monetary costs and in time/maintenance costs.
On hardware, but not on a cloud setup? We run several hundred big ES nodes on AWS, and I believe we stick to the heap sizing guidelines (though I’ve long wondered if fewer instances with giant heaps might actually work ok, too)
Cloud is trickier to price than real hardware. On real hardware, filling the ram slots is clearly cheaper than buying a second machine, if ram is the only issue. If you need to replace with higher density ram, sometimes it's more cost effective to buy a second machine. Adding more processor sockets to get more ram slots is also sometimes more, sometimes less cost effective than adding more machines. Often, you might need more processing to go with the ram, which can change the balance.
In cloud, with defined instance types, usually more ram comes with more everything else, and from pricing listed at https://www.awsprices.com/ in US East, it looks like within an instance type, $ / ram is usually consistent. The least expensive (per unit ram) class of instances is x1/x1e which are 122 Gb to 3904, so that does lean towards bigger instances being cost effective.
Exceptions I saw are c1.xlarge is less expensive than c1.medium, c4.xlarge is less than other c4 types and c4 is more expensive than others, m1.medium < m1.large == m1.xlarge < m1.small, m3.medium is more expensive than other m3, p2.16xlarge is more expensive than other p2, t2.small is less expensive than other t2. Many of these differences are a tenth of a penny per hour though.
How (and how much) did these improvements manifest? For example, did you measure consistently faster response times when running ZGC rather than G1GC? If so, by how much? I’m always looking for a way to improve ES response times for our users.
We mainly capture GC metrics and alert on them. One good thing that happened is there is no longer GC related alerts happening in production anymore. Also tail latency for API calls from kibana to ES improved.