I don't know if it's because it's early and my coffee isn't done yet, or that's ...

mhurd · on Jan 25, 2018

Yeah, it's a bit dense. Sorry about that.

kbenson · on Jan 25, 2018

No need to apologize, I agree with the assessment.

In the clear light of day, that comment was just me vomiting words onto the forum in the vain attempt to justify the effort I put into grokking that comment. :/

mhurd · on Jan 25, 2018

"The best architecture is no architecture" ... became a popular catchphrase for me in the early nineties. I built a streaming kind of trading system back then that integrated data flow and functional views, as some problems were easier to grok in one and harder in the other. Just thinking about things as simple functions: a = f(b) and code as a way of organising like with like worked for me.

Sometimes the object approach gets a bit silly. Organisation by the first parameter being special doesn't always work too well. For example, GAMS is a better classification for math than simply putting all vector or matrix code in the one classification. Curation is hard at the end of the day.

mslate · on Jan 25, 2018

How much automated/acceptance/regression testing did you guys have?

Would "releases" cause major brick-shitting (like the instance where Mr. L deployed a problematic release)?

mhurd · on Jan 25, 2018

It improved from then. We evolved unit tests and full system tests. A release would have to undergo a full simulation to show it was not only problem free but that it still had profitable results that look correct. We had test/debug shims for glibc functions that may cause syscalls, such as time or direct/indirect locale (such as from a printf), that would error on use in code to prevent sneaky costly timing things jumping in.

We did have one rather insidious error where the risk was too strong and preventing some orders going out. It took us a few months to track it down as it was making the system feel clumsy but it was still working. We assumed the market had changed a bit. Cost us a few million and it was a mistake from our best software dev. It didn't change the fact he was our best software dev (he's at Google now). Lesson learnt was to check for the positives and not just the errors.

We introduced a number of risk check evolutions. A separately developed shim was the final check on orders to make sure that regardless of risk the order made sense, e.g. not a zero price. This code was structured to be as independent as possible from code in the main system. That saved our butt a few times. Adding timing throttles was also important to prevent the system reacting in a way that would send a silly number of orders per sec. We also evolved to giving the broker html web pages where they could view the risk in real time and control risk if necessary. This often included integration with broker risk systems and taking their risk files and integrating the constraints to our engines.

At a later firm, we had automated test systems that also ran performance tests with hardware like that which Metamako now provides. The unit tests on code check-in would run not only the unit tests but there was a suite of performance tests where each would reconfigure the network and run things with external performance measurement on the network. This would allow us to track performance bumps of tens of nanoseconds to specific code deltas. Very useful indeed. A slightly customised version of graphite allowed us to see the chart of the performance of all components and tests over time.

Further to this, we evolved to a specific kernel, OS, and bios settings being replicated so that we could reproduce exactly a production system and vice versa. Tuned BIOS and Linux kernels became important.

The risk controls, unit and systems tests are probably the most important things an HFT does. YOLO is very true.

mslate · on Jan 25, 2018

Incredible--was everything written with C++? What was the process like for rolling a new release out to production trading?

mhurd · on Jan 26, 2018

The critical stuff was VHDL & C++. Mainly C++. Python & perl around the edges for housekeeping. Can't underestimate the role of bash as there tends to be a lot of scripting of simple components.

Production was getting a branch to pass unit and system tests locally. Then run in an acceptance test environment against the official test exchange. Test exchange wasn't always totally realistic. You'd typically have to mirror some captured production traffic to make it somewhat realistic. Also, some exchanges had slightly different production versus test versions. Trapped us once with some spaces being insignificant in Canada in the spec and the official test exchange but not allowed in production. Uggh. No real way to test for that.

It evolved to linux repo deployment where a yum command would summon the versions & scripts. Convenient for rollbacks too.

Another aspect was testing the ML parameter set. This would typically be updated daily and even though it was not a code change it is like one as it affects behaviour. ML parameter sets would have to pass profit simulation benchmarks to make it to production which was often a challenge in construction and testing on the grid to meet deadlines.