Production Engineering at a Trading Firm
交易公司的生产工程
¥27 billion in 90 seconds. $440 million in 45 minutes. When every order is sacred, "99.99% reliability" is a number that hides the catastrophes — and monitoring inverts almost every instinct a normal SaaS engineer has.
九十秒内亏损270亿日元,四十五分钟烧掉4.4亿美元。当每一笔订单都关乎真金白银,"99.99%可靠性"这个数字背后藏着的,是能把公司送进历史的灾难——而正确的监控思路,几乎颠覆SaaS工程师的全部直觉。
The Job
工作性质
Production engineering — also called DevOps or SRE in other shops — is the discipline of keeping software running when the world misbehaves: hardware fails, data centers lose cooling, fiber gets cut, exchanges have their own outages, and engineers occasionally ship bugs. At a trading firm, that ordinary problem statement collides with an unusual environment. The system you're protecting takes recent market data, decides what to buy or sell, sends an order through an order engine to an exchange, and books the resulting position. It does this thousands of times a second. Every single order is connected to a bank account. That changes everything downstream — what you alert on, who you alert, how much you trust an "okay" signal, and what counts as fixing a problem.
生产工程——在不同公司也叫DevOps或SRE——是一门在世界不守规矩时让软件继续运转的学问:硬件故障、机房断电、光纤被挖断、交易所自己出事,工程师偶尔也会把bug推上线。放到交易公司,这个普通的问题陈述就撞上了一个特殊的环境。你守护的系统,要吃进实时行情,决定买什么卖什么,把订单通过撮合引擎发往交易所,再记录最终头寸。这一切每秒发生数千次。每一笔订单背后,都连着一个真实账户。这一点改变了所有下游决策——监控什么、通知谁、多相信一个"正常"信号,以及什么才算真正解决了问题。
Four Features That Change Everything
四大特征,颠覆一切
The trading production environment differs from SaaS in four concrete ways. Each one reshapes how you build, monitor, and respond.
交易公司的生产环境,与SaaS在四个具体层面截然不同。每一条,都会重塑你构建系统、设置监控和应对故障的方式。
Every Order Is Important
每笔订单都不容有失
A web server can drop one request in ten thousand and call it 99.99% reliability. A trading firm cannot. The 0.01% you missed is the one that bankrupts you. And it isn't only orders — instrument symbology, prices, multipliers, quoting conventions all sit at the same level. Get any single field wrong, and the system you built to make money is now built to lose it. Compliance compounds the bar: regulators don't accept "near-perfect" reporting either.
Web服务器丢掉万分之一的请求,可以美其名曰99.99%可靠。交易公司没这个余地。那0.01%,可能就是让你破产的那笔。而且问题不只是订单——证券代码、价格、乘数、报价惯例,全部同等级别。任何一个字段出错,你建来赚钱的系统就变成了亏钱机器。监管合规又拉高了门槛:报告近乎完美,监管机构也不认。
Trading Is Genuinely Scary
交易,本来就令人胆寒
Connecting computer programs directly to your bank account and asking them to operate in a tight loop is, as Mark puts it, a scary sentence even before the logic gets complicated. Prices come with multipliers, multipliers shift between pre- and post-trade contexts, and the surface area of "things that have to be exactly right" grows quickly. The trading environment also has a unique enemy: adverse selection. The moment you place a bad trade, every other market participant wants to take the other side as fast as they can. Mistakes don't sit politely waiting to be undone.
把程序直连银行账户、让它高频自主运转,Mark说,光这句话读出来就够吓人的,遑论逻辑有多复杂。价格带乘数,乘数在盘前和成交后语境里还不一样,"必须绝对正确"的范围越铺越大。交易环境还有一个特有的天敌:逆向选择。你一旦发出错误订单,市场上所有其他参与者都会以最快的速度站到对面接盘。错误不会安静地等你撤单。
Timing Matters, All the Time
时间,分秒必争
The trading day is shaped like a barbell. Volume spikes at the open and close; the in-between is steadier. A surprising fraction of total volume happens in those small windows. Layered on top is a rhythm of scheduled events — FOMC minutes, earnings, central-bank announcements — and unscheduled ones: a tweet, an invasion, a tariff. Underneath all of it is a fragile time-anchored daily flow: 7:30 AM metadata downloads, feed crossover at 9:00, staged dependency loads — a two-hour ticking clock where any slowdown means you scramble to make the open.
交易日的成交量分布像一根哑铃:开盘和收盘时急剧放量,中间相对平稳,而全日不成比例的大量成交就集中在这两端。叠加在上面的,是有规律的时间节点——美联储纪要、财报季、央行公告——以及突发事件:一条推文、一场入侵、一项关税。这一切之下,是一条脆弱的时间轴:早上7:30下载元数据,9:00行情切换,依赖项分批加载——两小时的倒计时,任何拖延都意味着赶不上开盘。
Internal Users Are a Superpower
内部用户,是隐藏的超级优势
The most underrated feature, in Mark's view, is also the simplest: the users of Jane Street's technology mostly sit in the same building. There are no customer-support scripts, no "we're investigating" PR statements, no language to soften. A trader can say "spoos market data looks stale" and a tech responder, hearing the same vocabulary, can locate the exact order engine in seconds. High-bandwidth, no-BS communication isn't a perk. It's a load-bearing piece of incident response.
Mark认为最被低估的特征,也是最简单的一条:Jane Street的技术用户大多坐在同一栋楼里。不需要客服话术,不需要"正在调查中"的公关声明,不需要任何软化语气的措辞。交易员说一句"spoos的行情数据看起来过时了",技术同事听懂了同一套词汇,几秒内就能定位到对应的撮合引擎。高带宽、零废话的沟通,不是福利,而是事故响应的结构性支撑。
Two Cautionary Tales
两个前车之鉴
Real incidents, real numbers. The lesson in both: do not count on the financial system to catch catastrophic orders before they fill.
真实事故,真实数字。两者共同的教训:别指望金融系统在灾难性订单成交前帮你兜底。
Mizuho — December 8, 2005
瑞穗证券,2005年12月8日
A trader at Mizuho meant to sell 1 share at ¥610,000. He sold 610,000 shares at ¥1. He noticed instantly and tried to cancel — three times. The exchange refused every cancel. The order filled. Mizuho lost roughly ¥27 billion (~$225M), Japan's primary stock index dropped sharply on the news, and Mizuho later sued the exchange and recovered partial damages. The reflexive response: "surely the system protects you from a fat-finger that bad." The lesson Mark draws: do not count on the financial system to save you. Build assuming the catastrophic order will go through if you let it leave the building.
瑞穗证券一名交易员,本想以61万日元的价格卖出1股,结果挂出了61万股、每股1日元的卖单。他当场发现,连续三次尝试撤单,交易所三次拒绝。订单悉数成交。瑞穗损失约270亿日元(约2.25亿美元),日经指数应声下跌,瑞穗事后起诉交易所,获得部分赔偿。本能反应是:"系统应该能拦住这种级别的胖手指吧?" Mark给出的教训是:别指望金融系统来救你。构建系统时,默认灾难性订单只要发出去就一定会成交。
Knight Capital — August 2012
骑士资本,2012年8月
Knight was one of the leading U.S. market makers. They had dead code that would lose money on every order if a certain config flag were ever set. Then they did two things at once: repurposed that flag to mean something new, and started sending it from another system. The new feature didn't have the bug — but the rollout missed one of eight servers. The market opened. The lone unupgraded server hemorrhaged money on every trade. Engineers misidentified the upgraded servers as the culprit and rolled them back — accelerating the bleeding by a factor of eight. Forty-five minutes after the open, ~$440 million was gone. Knight's stock collapsed; the firm was acquired soon after. The SEC's postmortem is now standard reading.
骑士资本当时是美国最重要的做市商之一。他们有一段僵尸代码,一旦某个配置开关被触发,每笔订单都会亏钱。然后,他们同时做了两件事:把那个开关复用为新含义,又从另一个系统开始下发这个开关。新功能本身没有bug——但部署时漏掉了八台服务器中的一台。开盘铃响,那台漏网之鱼开始在每笔交易上失血。工程师误判了已升级的服务器是罪魁祸首,将其回滚——失血速度瞬间扩大八倍。开盘四十五分钟后,约4.4亿美元灰飞烟灭。骑士股价崩溃,不久后被收购。美国证监会的事故调查报告,已成行业必读文件。
Monitoring, Inverted
颠覆监控直觉
The pedagogical center of the talk. Four principles that diverge sharply from standard SRE practice.
演讲的核心所在。四条原则,与标准SRE实践背道而驰。
Event-Based, Not SLO-Based
基于事件,而非SLO
The classic SLO alert — "page me when error rate exceeds 0.01%" — is exactly wrong for trading. If 100% of orders are important, then 99.99% reliability is just "we accept catastrophic loss every ten thousand orders." Jane Street does very little SLO-based alerting on live trading systems. Instead, code paths carry event-based alerts — explicit checks at the call sites that matter. The cost is enormous: every edge case has to be enumerated. The benefit: no important condition gets averaged into a healthy-looking dashboard.
经典的SLO告警——"当错误率超过0.01%时呼我"——放到交易场景里恰恰是错的。如果每笔订单都至关重要,那么99.99%的可靠性不过是在说"每一万笔订单,我们接受一次灾难性损失"。Jane Street对实时交易系统几乎不用基于SLO的告警。取而代之的是基于事件的告警——在关键代码路径上植入显式检查点。代价是巨大的:每一种边界情况都必须被枚举和决策。收益是:没有任何重要异常会被平均进一个看起来健康的仪表盘里。
Alert on Symptoms, Not Causes
告警症状,不告警原因
The intuitive instinct is to alert when the database goes down. Mark argues you shouldn't. You'll need an alert on user-facing 500s anyway — there are a thousand reasons users see 500s, and the database is one of them. Alerting on both gives you duplicate pages. And the assumption that "database down = bad" rots over time: you'll add maintenance windows, then ad-hoc maintenance, then layers of duct tape. Users care about the 500s, not the database. The debugging signal that "the database is down" belongs on a dashboard or attached as alert metadata — not as its own page in the middle of the night.
直觉反应是:数据库挂了就告警。Mark认为这是错的。你本来就要在用户侧出现500错误时告警——数据库挂了只是触发500的一千种原因之一。两个告警同时响,只是把你叫醒两次。而且"数据库挂 = 出问题"这个假设会随着时间腐烂:你会加维护窗口,然后是计划外维护,再然后是一层又一层的临时补丁。用户在乎的是500错误,不是数据库。"数据库挂了"这条调试信息放在仪表盘上或作为告警附加元数据就够了——不应该在深夜单独把人呼醒。
Orthogonal, Epistemic Alerts
正交告警:监控"世界是否正常"
Mark's favorite alert at Jane Street is called feel too good. It fires when the firm has made too much money. It is not tied to any particular service. Market data could be wrong, instrument generation could be off, an order engine could be misbehaving, a strategy could have a bug — anywhere in the trading stack — and feel too good may catch it. A sibling alert: high % of market volume. If you're normally 3% of a market's daily volume and suddenly you're 60%, something is wrong. These are alerts on the firm's epistemic state — on the world looking weird — not on any one service's health.
Mark在Jane Street最喜欢的告警叫做feel too good(赚太多了)。它在公司赚到异常多的钱时触发,与任何具体服务无关。行情数据有误、品种生成出错、撮合引擎异常、策略存在bug——交易链路任何一处出了问题——都可能被feel too good捕捉到。它的兄弟告警:high % of market volume(市场占比过高)。你平时占某个市场日成交量的3%,突然变成了60%,一定有什么不对劲。这类告警监控的是公司的认知状态——世界看起来不对了——而不是某一个具体服务的健康状况。
Signal-to-Noise Is a Cultural Job
信噪比,是文化问题
Event-based alerting can only work if everyone — traders, devs, production engineers — shares one belief: noisy alerts are worse than useless. Without that culture, every new edge case ships as a new page, and the on-call rotation drowns. Jane Street treats human attention as the scarce resource and spends real time tuning thresholds, retiring stale alerts, and arguing about edge cases until the right ones earn a page. The technique is unglamorous; the discipline is non-negotiable.
基于事件的告警体系,只有在每个人——交易员、开发、生产工程师——都信奉同一条准则时才能生效:嘈杂的告警比没有告警还糟糕。没有这种文化,每发现一个新边界情况就会多一条告警,值班轮换的人会被淹死。Jane Street把人的注意力当作稀缺资源,投入真实时间调整阈值、下线过时告警,对每个边界情况反复讨论,直到真正值得被呼的条件才能获得一条告警位。这件事毫无光鲜可言,但这种纪律不可妥协。
Defense in Depth
纵深防御
No single alert system is infallible. The answer is redundancy across independent vantage points — with monitoring itself as the most reliable layer.
没有哪一套告警系统是无懈可击的。答案是跨越多个独立视角的冗余覆盖——而监控系统本身,必须是其中最可靠的那一层。
Similar Alerts, Different Teams
相似的覆盖,不同的团队
What if the alert itself has a bug? The only honest answer is to run similar alerts in different systems written by different teams, keeping their underlying dependencies as separate as possible. The trading system has its own risk checks under a trader's eye. The order engine has its own. An external risk enforcer reads after the fact and can apply its own veto. Three independent vantage points, three different codebases, three different on-call rotations. The weakness — named directly by Mark — is anything those layers share. So you scrutinize the shared dependencies, and sometimes you duplicate them.
如果告警本身有bug怎么办?唯一诚实的答案,是在由不同团队编写的不同系统中运行类似的告警,并尽量保持各自的底层依赖相互隔离。交易系统有自己的风控检查,由交易员盯着。撮合引擎有自己的风控。外部风险执法模块在事后复核,并拥有独立否决权。三个独立视角,三套不同代码库,三个不同的值班轮换。薄弱点——Mark直接说了——是这几层共享的任何东西。所以你要格外审视那些共同依赖,有时候甚至要把它们也复制一份。
Tech Health and Trading Health, Side by Side
技术健康与交易健康,双轨并行
Engineers monitor whether software, hardware, and connectivity are working. Traders monitor whether the firm's market impact, P&L, and orders look right. The two views are deliberately orthogonal — and they catch each other's blind spots. A trader noticing that something feels wrong is often the first signal that something is wrong, even when every dashboard is green. The communication channel between them is the work; the coverage is the payoff.
工程师监控软件、硬件和网络连接是否正常。交易员监控公司的市场影响力、盈亏和订单状况是否合理。这两个视角是刻意正交的——也因此能互相发现对方的盲区。一个交易员觉得哪里感觉不对,往往是最早发出的信号,即便此时所有仪表盘都显示绿色。两者之间的沟通渠道,就是工作本身;双轨覆盖,就是它带来的回报。
Monitoring Is Not Optional
监控,不是可选项
In some shops monitoring is the project you give the new joiner — if it crashes, no users notice. At a trading firm it cannot work that way. Jane Street's stance is blunt: don't trade unless it's being monitored. Monitoring systems are among the firm's most robust, most redundant, most-tested infrastructure. They are more reliable than the trading systems they watch. If it comes down to a choice, you'd rather the trading system go down than the monitoring system.
在有些公司,监控是给新人练手的项目——挂了也没有用户注意到。交易公司绝不能这样运转。Jane Street的立场斩钉截铁:没有监控覆盖,就不要交易。监控系统是公司基础设施里最健壮、冗余最高、测试最充分的那部分,比它所监控的交易系统更可靠。如果真的要二选一,你宁愿交易系统停了,也不能让监控系统挂掉。
Business Context Cuts Both Ways
业务语境,双向赋能
Internal users only matter if you can speak to them. A trader saying "spoos market data looks stale" is a complete sentence to a tech responder who knows that spoos means S&P 500 index futures, that those trade on the CME, and that the CME's matching engine is in Aurora, Illinois — so the order engine they want lives under an AUR prefix. Two seconds of recognition, not two minutes of lookups. The same logic runs in reverse: a technical staffer who knows that an FOMC announcement is in twenty minutes will hear a request to roll a system very differently than one who doesn't. The shared vocabulary is the speed.
内部用户的优势,只有在你们能真正对话时才成立。交易员说一句"spoos的行情数据看起来过时了",对一个懂得spoos就是标普500股指期货、知道它在CME交易、CME撮合引擎在伊利诺伊州奥罗拉的技术人员来说,是一句完整的定位语:他要找的撮合引擎前缀是AUR。两秒识别,而不是两分钟查询。反过来同样成立:知道二十分钟后有FOMC会议纪要发布的技术人员,听到"能不能滚更一下这个系统"会做出截然不同的判断。共享词汇,就是速度本身。
A Sample Incident, Minute by Minute
事故还原,分钟级复盘
A composite of real Jane Street incidents — representative in shape. A market-data bug surfaced as an order-engine alert. The teaching point: an alert on the world looking weird found a problem no service-health check would have caught.
这是几起真实Jane Street事故的合成复盘,形态上具有代表性。一个行情数据的bug,被一条撮合引擎的告警发现了。这正是核心教训:监控"世界看起来不对了"的告警,找到了任何服务健康检查都发现不了的问题。
A routine, well-tested roll of an order engine completes. Bug fixes, a small refactor — including, innocuously, a price-serialization refactor.
一次例行的撮合引擎滚动更新完成。修了几个bug,做了小幅重构——其中包括一处看似无害的价格序列化重构。
feel too good alerts begin firing across multiple symbols. Individual traders see one each; nobody yet sees the pattern.
feel too good告警在多个品种上开始触发。各位交易员各自看到一条;没有人看到全局。
The order-engines team receives a too many feel too goods meta-alert. The order engine automatically halts. The first overall picture forms: something is systemically off.
撮合引擎团队收到too many feel too goods元告警。撮合引擎自动停机。第一幅全局图像出现:系统性地出了什么问题。
Order Engines opens an incident. Their what changed tool surfaces this morning's roll, including the price-serialization refactor. Plausible suspect — but the refactor went to many other engines too. Either alerting is right and the refactor is a red herring, or alerting is incomplete and other engines are bleeding silently.
撮合引擎团队开启事故响应。what changed工具找到了今早的滚动更新,包括价格序列化重构,看起来是嫌疑人——但这次重构也推送到了其他很多引擎。要么告警是对的、重构是误导;要么告警覆盖不完整,其他引擎正在悄悄失血。
Traders join the call. "Market data looks stale." They've been trading on opening-tick data — no fresh ticks arriving for a set of symbols. The price-serialization theory falls away instantly. Market Data joins.
交易员接入电话。"行情数据看起来过时了。"他们一直在用开盘tick的数据交易——某些品种的新tick根本没有进来。价格序列化的嫌疑立刻消散。行情数据团队加入。
Market Data investigates. They find an exchange-driven change (EDC) missed two weeks ago: an exchange added a new market-data partition; 20% of the symbol universe moved to it. Jane Street's feed does have logic for marking unsubscribed symbols as stale on a missed EDC — but that code path had never been exercised in production, and it had a bug. Traders received yesterday's closing prices, marked as live.
行情数据团队开始排查。他们发现了两周前漏掉的一个交易所驱动变更(EDC):某交易所新增了一个行情数据分区,20%的品种被迁移过去了。Jane Street的行情系统确实有逻辑在EDC漏订阅时将相关品种标记为过时——但这段代码路径从未在生产环境中被真正执行过,而它有一个bug。于是交易员收到的,是昨天收盘价,被标记成了实时行情。
A config change subscribes the new partition. The engine is unhalted. feel too good alerts stop. Incident resolved.
一次配置变更订阅了新分区。引擎恢复运行。feel too good告警停止。事故结案。
The team resolved in fourteen minutes — fast. But all the damage was already done at 09:31. The real question isn't "could we have resolved faster?" It's "could we have caught this twenty minutes before the open?" Pre-open trading runs, additional EDC tooling, exercising the partition-handling code path on every release. In a trading firm, post-mortems live more in the minutes before the incident than the minutes after.
团队十四分钟内解决了问题,速度很快。但所有损失在09:31就已经造成了。真正的问题不是"能不能解决得更快",而是"能不能在开盘前二十分钟就发现这个问题"?盘前交易演练、补充EDC工具链、在每次发布时真正跑通分区处理代码路径。在交易公司,复盘的价值,更多在于事故发生之前的那些分钟,而不是之后。
One Sentence to Carry
带走这一句话
Every other engineering discipline tells you to optimize for the common case and accept the long tail. Trading inverts the contract: the long tail is where the firm dies, and the only durable defense is to alert on the world looking weird, then to make the alerter itself the most reliable thing you own.
所有其他工程领域都告诉你:优化常见路径,接受长尾概率。交易彻底颠覆了这份契约:长尾,才是公司的死穴。唯一持久的防线,是对"世界看起来不对了"发出告警,然后把告警系统本身,打造成你手里最可靠的东西。