인포그래픽 & 차트

langchain and fireworks just shipped the eval move worth ste

Real prompt shared by @rohit4verse on X

Prompt

langchain and fireworks just shipped the eval move worth stealing: a fine-tuned qwen judge that flags "perceived error" on every production trace and runs up to 100x cheaper than opus. the cost number gets the attention. the transfer result matters more. they trained the judge on one app, their docs q&a agent. then they pointed it at fleet, a separate product, with no retraining. it beat every frontier model on that domain. 90.8% against opus at 90.2%. most evaluators break the second you move them to a new app, because the rubric is app-specific. "perceived error" travels because the signal is behavioral: the user corrects you, or repeats the request. that pattern holds across every product. one design choice stands out. they fed the judge human and ai messages only and dropped every tool call. their bet is that the correction signal lives in the conversation itself. anyone can rent the model in your loop. a judge trained on your own traces, cheap enough to run on all of them, is the moat they cannot buy.

Originally by