It doesn't seem like it. These chain-of-thought models kinda broke the mold. https://youtubetranscriptoptimizer.com/blog/05_the_short_case_for_nvdaWhat they did was to continue developing open source LLMs, which were trained at great expense by others, right?
Skipping labeled data? Seems like a bold move for RL in the world of LLMs. I've learned that pure-RL is slower upfront (trial and error takes time) — but it eliminates the costly, time-intensive labeling bottleneck. In the long run, it’ll be faster, scalable, and way more efficient for building reasoning models. Mostly, because they learn on their own. https://www.vellum.ai/blog/the-training-of-deepseek-r1-and-ways-to-use-itThe team at DeepSeek wanted to prove whether it’s possible to train a powerful reasoning model using pure-reinforcement learning (RL). This form of "pure" reinforcement learning works without labeled data.