More

opisthenar84 · on March 1, 2024

Sorry as this is not a "wants to be hired" post, but I'm curious how well this monthly post works for people actually looking for new opportunities. Can anybody share their experiences?

opisthenar84 · on Feb 20, 2024

Might be a noob question but for truly important data, couldn't SDCs be detected by using ECC everywhere?

jandrewrogers · on Feb 20, 2024

ECC isn’t free and ECC has a limited ability to detect all statistically plausible errors. Additionally, error correction in hardware is frequently defined by standards, some of which have backward compatibility requirements that go back decades. This is why, for example, reliable software often uses (quasi-)cryptographic checksums at all I/O boundaries. There is error correction in the hardware but in some parts of the silicon that error correction is weak enough that it is likely to eventually deliver a false negative in large scale systems.

None of this is free, and there are both hardware and software solutions for mitigating various categories of risk. It is explicitly modeled as an economics problem i.e. how does the cost of not mitigating a risk, if it materializes, compare to the cost of minimizing or eliminating it. In many cases, the optimal solution is unintuitive, such as computing everything twice or thrice and comparing the results rather than using error correction.

teaearlgraycold · on Feb 20, 2024

There are errors within the CPU.

As for adding ECC within the CPU, I think that would require you to essentially have a second CPU in parallel to compare against.

MertsA · on Feb 20, 2024

Actually it's not uncommon for there to be ECC used within components as a method to guard against stuff like this. I don't think it's practical to ever have complete coverage without going full blown dual/triple redundant CPU but for stuff like SSD controllers they have ECC coverage internally on the data path.

moonchild · on Feb 20, 2024

Caches, register files, and coherency traffic all definitely include error-correction.

XorNot · on Feb 20, 2024

You actually need 3 - which is how it's done for space (I believe SpaceX uses this as a solution to avoiding radiation hardened costs).

2 will tell you if they diverge, but you lose both if they do. 3 let's you retain 2 in operation if one does diverge.

jorticka · on Feb 20, 2024

If you're not hard realtime 2 is enough, you just redo the computation.

MertsA · on Feb 20, 2024

But if it's a consistent fault, like the silent data corruption covered in the linked paper, redoing the computation is still going to end up with no way to identify which core is faulty. If it's an intermittent fault, then even for hard realtime you can accomplish that with one core, just compute 3x and go with majority result.

vlovich123 · on Feb 20, 2024

Yup exactly. The only way independent hardware can help is if the fault is state dependent in a way on the hardware (eg differences in behavior due to thermal load or different internal state corruption or something) in which case repeated computations may not help if the repeated computation is not sufficiently decoupled temporally to get rid of that state. The other thing with independent hardware is that you don’t pay a 3x performance penalty (instead 3x cost penalty). That being said, none of these fault modes are what are really what is being discussed in the paper.

The other one that freaks me out is miscompilation by the compiler and JITs in the data path of an application. Like we’re using these machines to process hundreds of millions of transactions and trillions of dollars - how much are these silent mistakes costing us?

paganel · on Feb 20, 2024

I think that strictly looking at it in terms of money-related operations stuff can still be managed/double-checked externally, i.e. by the real world, which means that whatever mistakes/inconsistencies might show up there's still a "hard reality" out there that will start screaming "hey! this money figure is not correct!" because people tend to notice when there are big money-discrepancies and the "mistakes" are, generally speaking, reversible when it comes to money.

What's worrying is when systems like these get used in real-time life-and-death situations, and there's basically no reversibility because that would imply dead people returning to life. For example the code used for stuff like outer space exploration, sure that right now we can add lots and lots of redundancies and check-ups in the software being used in that domain because the money is there to be spent and we still don't have that many people out there in space. But what will happen when we'll think of hosting hundreds, even thousands of people inside a big orbital station? How will we be able to make sure that all the safety-related code for that very big structure (certainly much bigger than we have now in space) doesn't cause the whole thing to go kaboom based on an unknown-unknown software error?

And leaving aside scenarios that are not there yet, right now we've started using software more and more when it comes to warfare (for example for battle simulations based on which real-life decisions are taken), what will happen to the lives of soldiers whose conduct in war has been lead by faulty software?

vlovich123 · on Feb 20, 2024

The financial impact was just to highlight a scope in terms that can result in a single calculable easy to understand number. And also most transactions are automated and rarely validated manually so I’m not sure how many inconsistencies we’re catching. Look at the UK post office scandal and that was basic distributed systems bugs used in the auditing software where the system was granted privilege over manual review (sure there’s lots wrong with that scandal but it is illustrative of how much deference we give to automated systems since that tends to be the right tradeoff to make).

jorticka · on Feb 20, 2024

The recent Ukraine war shows that soldiers lifes are cheap - according to commanders.

So many soldiers on both sides died because of really dumb commander decisions, missing kit, political needs, that worrying about CPU errors is truly way way down the list.

paganel · on Feb 20, 2024

At the tactical level, of course that what you're saying is true, but the big Ukrainian counter-offensive from last year had been preceded by lots and lots of allusions made to "war games simulations" set-up by Ukraine's allies in the West (mostly the US and the UK), and it is my understanding that those war games were heavily taken into consideration as a basis for that counter-offensive decision. I'm not saying that the code behind those simulations was faulty, I'm just saying that software is already used at an operational level (at least) when it comes to war.

As per the sources, here's this one in The Economist [1] from September 2023, just as it had become obvious that the counter-offensive had fizzled out:

> American and British officials worked closely with Ukraine in the months before it launched its counter-offensive in June. They gave intelligence and advice, conducted detailed war games to simulate how different attacks might play out

And another one from earlier on [2], in July 2023, when things were stil undecided:

> Ukraine’s allies had spent months conducting wargames and simulations to predict how an assault might unfold.

[1] https://archive.is/1u7OK

[2] https://archive.is/NyGJI

jorticka · on Feb 20, 2024

It was mostly an old-school kind of wargames.

There was an article about a giant real 3D map in a room with all the commanders there discussing what each one has to do, contingencies, ...

The reason for the counter-offensive failure were multiple and complex, good or bad software would not have changed the result significantly.

jorticka · on Feb 20, 2024

If it's consistent and persistent, wouldn't that classify as broken hardware requiring device change?

Even with 3 chips, if one is permanently wrong you are then left with only 2 working ones so no redundancy is left for further degradation.

> just compute 3x

That might be difficult if CPU is broken. How are you sure you actually computed 3 times if you can't trust the logic.

MertsA · on Feb 20, 2024

>wouldn't that classify as broken hardware requiring device change?

Yes but you need to catch it first to know what to take out of production.

>That might be difficult if CPU is broken. How are you sure you actually computed 3 times if you can't trust the logic.

That's kind of my point. Either it's a heisen-bug and you never see those results again when you repeat the original program or it's permanently broken and you need to swap out the sketchy CPU. If you only care about the first case then you only need one core. If you care about the second case then you need 3 if you want to come up with an accurate result instead of just determining that one of them is faulty. It's like that old adage about clocks on ships. Either take one clock or take three, never two.

namibj · on Feb 20, 2024

You don't need to know which one of the two was bad; it's not worth the extra overhead to avoid scrapping two in the rare case you catch a persistent glitch; sudden hardware death (blown VRM or such, for example) will dominate either way, so you might as well build your "servers" to have two parts that check each other and force-reset when they don't agree. If it reboot-loops you take it out of the fleet.

MertsA · on Feb 20, 2024

Right, but the comment I was replying to was in response to this:

> 2 will tell you if they diverge, but you lose both if they do. 3 let's you retain 2 in operation if one does diverge.

If you care about resilience then you either need to settle with one and accept that you can't catch the class of errors that are persistent or go with three if you actually need resilience to those failures as well. If you don't need that kind of resilience like an aerospace application would need then you're probably better off with catching this at a higher layer in the overall distributed systems design. Rather than trying to make a resilient and perfectly accurate server, design your service to be resilient to hardware faults and stack checksums on checksums so you can catch errors (whether HW or software) where some invariant is violated. Meta also has a paper on their "Tectonic filesystem" where there's a checksum of every 4K chunk fragment, a checksum of the whole chunk, and a checksum of the erasure encoded block constructed out of the chunks. Once you add in yet another layer of replication above this then even when some machine is computing corrupt checksums or inconsistent checksums where the checksum and the data are corrupt then you can still catch it and you have a separate copy to avoid data loss.

lobochrome · on Feb 20, 2024

In those cases, the CPU makes a false calculation independent of what's done in RAM. It can be solved by having flop redundancy as in system z - but nobody at Google or Meta would be considering big metal.

From my point of view, this technology problem may be interesting academically (and good for pretending to be important in the hierarchy at those companies) but a non-issue at scale business-wise in modern data centers.

Have a blade that once in a while acts funny? Trash and replace. Who cares what particular hiccup the CPU had.

delroth · on Feb 20, 2024

> a non-issue at scale business-wise in modern data centers.

I've worked on similar stuff in the past at Google and you couldn't be more wrong. For example, if your CPU screwed up an AES calculation involved in wrapping an encryption key, you might end up with fairly large amounts of data that can't be decrypted anymore. Sometimes the failures are symmetric enough that the same machine might be able to decrypt the data it corrupted, which means a single machine might not be able to easily detect such problems.

We used to run extensive crypto self testing as part of the initialization of our KMS service for that reason.

lobochrome · on Feb 20, 2024

Sure. It’s a cool issue to work on and maybe actually relevant at Google scale. But I’ve asked your colleagues multiple time if the business side actually cared about the issue and they never confirmed.

Again, cool to work on at Google. Not sure anybody else cares. If you care (finance) you fix it in hardware (system z).

withinboredom · on Feb 20, 2024

Why would the business side ever care about technical details? It's like asking the business what days the dumpsters get emptied. Nobody gives a fuck; they just care that it gets done and gets done quickly, correctly, and safely.

lobochrome · on Feb 22, 2024

A CFO knows which factors have a significant impact on the bottom line.

withinboredom · on Feb 22, 2024

If a CFO knows which days the dumpster is emptied, you have a strange CFO. The metaphor is to point out that there’s a lot of technical details that aren’t tracked (like usually refactoring isn’t tracked independently) and shouldn’t be tracked because they are the normal part of the technical job. A CFO can’t measure it even if they wanted to because nobody else is measuring crazy things, like how fast you walk to the bathroom or any other metrics that are specifically related to doing your job.

opisthenar84 · on Feb 20, 2024

I'm surprised he didn't mention much about vector search

H8crilA · on Feb 20, 2024

It is a finished thing, no? So many systems/products have it built in for quite a few years now.

opisthenar84 · on Feb 12, 2024

I fail to see how programmable logic patched together == "the future of video game preservation". There's community, software, testing, etc... involved as well.

opisthenar84 · on Feb 2, 2024

This story has close to 500 up votes as of 2:19PM Pacific and is no longer on the front page. Why flag an article that is simply trying to show how HN can potentially be improved?

dang · on Feb 3, 2024

It got flagged by users. We turned that off per https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu.... The story has been on the front page for 9 hours now and is still there.

opisthenar84 · on Jan 22, 2024

A seemingly common theme through all these layoffs is that engineers are less impacted than other orgs.

stg22 · on Jan 23, 2024

In general, the more direct the impact of your role on short-term revenue, the safer you are when your company starts sacrificing people to the gods of quarterly accounts.

Ancillary service workers always get hit badly, but good salespersons are often safer than good engineers.

resolutebat · on Jan 23, 2024

Only if their salesperson manages to hit their number very quarter without fail. I've seen people blow their quota out of the water in Q1, and get shitcanned when Q2 is quiet.

ChrisMarshallNY · on Jan 23, 2024

That is exactly what sales is like, and it has been like that, for as long as I can remember.

However, good salespeople can make a lot of money.

jncfhnb · on Jan 23, 2024

So can mediocre lucky ones. A lot of systems are very poorly designed.

ChrisMarshallNY · on Jan 23, 2024

Famous sales pep talk: https://youtube.com/watch?v=wVQPY4LlbJ4

DaiPlusPlus · on Jan 23, 2024

This is HN, so it’s Always Be Coding

smt88 · on Jan 23, 2024

This is true with all layoffs. If you can't explain your direct role in creating your product/service, you're going to be higher on the chopping block.

Sales, marketing, HR, admins, PMs, etc. are usually the first to go because they don't "keep the lights on" and it's hard to measure their impact.

While it's true that it's also hard to measure a single engineer's impact, it's scarier to fire someone who may be expensive to replace and who may take institutional knowledge of the inner workings of the product with them.

resolutebat · on Jan 23, 2024

Sales does not belong on that list, since their impact is the easiest of all to measure ($$$).

Also, top-down layoffs tend to target expensive staff with large paychecks, without accounting for institutional knowledge or intangible value delivered.

conductr · on Jan 23, 2024

Easy to measure isn’t always good. If the org tends to sell $5m per sale person and they try to scale it by hiring it might just fall to $4m per sales person. When things go under the microscope, they’ll want to put that back in balance by reducing heads.

smt88 · on Jan 23, 2024

Sales isn't easy to measure.

If you generate the same number of leads, do you need 300 salespeople to convert them? Or could your top 100 performers do it? Do you need separate salespeople for each product or can one person sell it all?

At the trough of interest rates, big companies had insane numbers of salespeople touching each deal (easily 10+ at some orgs) because they'd over-hired and the forced people to hyper-specialize to justify it.

beau_g · on Jan 23, 2024

Is it though? Without products there are no sales at all. A compelling enough product will sell with no marketing/sales effort, so sales is really only whatever that baseline is + (extra units sold from sales and marketing efforts - cost of sales of marketing). I would think this is actually quite hard to measure, you also have to normalize against many external market factors and behavior of competitors.

tryauuum · on Jan 23, 2024

> A compelling enough product will sell with no marketing/sales effort

would it? maybe you are talking about video games and a good enough game would sell itself. But for other businesses sales / marketing / ads do matter. Imagine you provide a 50% better service than your competitors, but you target audience never reaches your page because they are bombarded with google ads of your competitors

MaxHoppersGhost · on Jan 23, 2024

For enterprise SaaS companies sales is one of the most important roles, tied with engineering. Especially if your engineering org/product is weak.

disgruntledphd2 · on Jan 23, 2024

To quote Larry Ellison, you're either building the product or selling the product.

whatyesaid · on Jan 23, 2024

How do you know who they're firing? Nobody knows what "core tech" is. They are clearly getting rid of people in LoR. And game companies aren't bloated with sales people.

cannonpalms · on Jan 23, 2024

There are typically crowdsourced spreadsheets available from https://layoffs.fyi that cover title, etc.

opisthenar84 · on Jan 17, 2024

True. Pure vectorstores seem limited and kind of overrated. Combining many sources of data is challenging but the right thing to do.

opisthenar84 · on June 26, 2023

Depends on the podcaster. Usually between 1x and 1.5x and rarely higher than that unless I'm pressed for time.

JNRowe · on June 26, 2023

Full agreement about it depending on the podcast, finding a player where speed is feed specific was a huge improvement for me. There are a few podcasts I listen to where the accent and cadence makes me feel genuinely angry if I have to listen to them at 1x, yet they become interesting enough to subscribe to when they're up around 2.25x.

It makes me wonder how much of a pattern there would be based purely on the listener's birth location and general regional cadence. I know I have a tendency to speak too quickly for some people in my native English, and also in languages I've learned later in life. Perhaps there are people who'd like a 0.4x button for interacting with me ;)

opisthenar84 · on June 23, 2023

The industry is finally figuring out that tech is a tool, not a crutch.

mcbutterbunz · on June 23, 2023

Tech is a tool to sell more flashy features in cars. It really needs to last only 3-5 years. If it breaks after your warranty is up, oh well.

mjhay · on June 23, 2023

That does erode the long-term customer loyalty, which is disastrous in the long run. However, that is of no concern to executives that make these kind of decisions.

turminal · on June 23, 2023

Doesn't matter as long as all other brands do the same.

opisthenar84 · on June 8, 2023

So... drink more Red Bull?

gnulinux · on June 8, 2023

Cat food has tons of taurine in it because cooking food destroys taurine but unlike humans cats cannot produce taurine themselves.

ThrowawayR2 · on June 8, 2023

More like eat more sashimi (which is uncooked by definition) or sushi that contains sashimi since uncooked fish meat contains a fair bit of taurine. Carpaccio would be a good option as well. If my budget supported it, that's a dietary change I'd be more than happy to make.

Scoundreller · on June 8, 2023

Or think of mass-market cat food as a dietary supplement

SketchySeaBeast · on June 8, 2023

That's good to know my retirement plan will keep me looking young.

readthenotes1 · on June 8, 2023

Bad plan

Cat food is incredibly expensive!

Blackthorn · on June 8, 2023

Inflation is a funny thing. I once had a fencing coach who ate cat food because it was cheap and the school didn't pay him a living wage.

Scoundreller · on June 8, 2023

At the low end of the market (had a cat that lived 19y on this stuff), Meow Mix original is 3500 calories per kg. A 10kg bag costs $35 in Canadian dollars or about US$28.

So just a couple dollars a day for an adult.

SketchySeaBeast · on June 8, 2023

Well sure, but it'll also keep me clear of urinary crystals.

canucker2016 · on June 8, 2023

each Red Bull has 1000mg of taurine (source: https://www.redbullsingapore.com/products)