Hey there! In my latest blog post, I share some great reads and resources: tackling tech debt, lessons from MrBeast for engineers, a study on self-correcting language models, how our brain processes experiences during sleep, Signal app quirks, a visual guide to SSH tunneling, the importance of media product focus, respecting old methods, and a delightful Date-Me Chicken recipe. Enjoy!
- Paying down tech debt: further learnings: Using tech debt to get into the flow, and big rewrites needing heavyweight support.
- What the world’s #1 YouTuber can teach you about being a badass software engineer: MrBeast somehow has lessons you need to know as a software engineer
- Training Language Models to Self-Correct via Reinforcement Learning: Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffective in modern LLMs. Existing approaches for training self-correction either require multiple models or rely on a more capable model or other forms of supervision. To this end, we develop a multi-turn online reinforcement learning (RL) approach, SCoRe, that significantly improves an LLM’s self-correction ability using entirely self-generated data. To build SCoRe, we first show that variants of supervised fine-tuning (SFT) on offline model-generated correction traces are insufficient for instilling self-correction behavior. In particular, we observe that training via SFT either suffers from a distribution mismatch between the training data and the model’s own responses or implicitly prefers only a certain mode of correction behavior that is often not effective at test time. SCoRe addresses these challenges by training under the model’s own distribution of self-generated correction traces and using appropriate regularization to steer the learning process into learning a self-correction strategy that is effective at test time as opposed to simply fitting high-reward responses for a given prompt. This regularization prescribes running a first phase of RL on a base model to generate a policy initialization that is less susceptible to collapse and then using a reward bonus to amplify self-correction during training. When applied to Gemini 1.0 Pro and 1.5 Flash models, we find that SCoRe achieves state-of-the-art self-correction performance, improving the base models’ self-correction by 15.6% and 9.1% respectively on the MATH and HumanEval benchmarks.
- Sleep on it: How the brain processes many experiences — even when ‘offline’: In a new study, Yale researchers uncover how the brain, during sleep, replays and bundles many of the experiences that occur in our waking hours.
- A Signal run a day keeps the un-link away: I have a couple of people who are best reachable on the Signal messaging app, but not that many. This exposes me to an awkward edge case of Signal’s design decisions: Whenever I get a message (on my phone), I want to reply to it (on my laptop) only to discover that Signal has un-linked my laptop because of inactivity and won’t sync the message history from my phone to my laptop, making it impossible to quote-reply to messages.
- Visual guide to SSH tunneling and port forwarding: SysAdmin Stuff | Linux | Network | Security
- Product is the core of a media brand: C-level media executives are focused on brand. And product should be at the core of that focus.
- “We’ve always done it this way” isn’t so bad after all: “We’ve always done it this way” may be obstinate inflexibility or it may be respect for the reasoning of the people who made the rules.
- Date-Me Chicken: This quick one-pan recipe combines sweet dates, savory shallots, and caramelized lemon with crispy-skinned chicken for an impressive yet easy weeknight dinner.