The New Claude 3.5 Sonnet: Better, Yes, But Not Just in the Way You Might Think

106 Views
Published
A new state of the art LLM (at least for creative writing and basic reasoning) but what lies behind the numbers that were put out? Is it for real, and are AI agents about to grab your mouse and shake your cursor? Weights and Biases' Weave: https://wandb.me/ai_explained

Plus, results on my own Simple Bench, and new tools from Runway (Act-One), HeyGen (Zoom Calls) and an updated NotebookLM. AI, without the hype.


AI Insiders: https://www.patreon.com/AIExplained

Chapters:
00:00 – Introduction
00:57 – Claude 3.5 Sonnet (New) Paper
02:06 – Demo
02:58 – OSWorld
04:29 – Benchmarks compared + OpenAI Response
08:30 – Tau-Bench
13:09 – SimpleBench Results
17:05 – Yellowstone Detour
17:29 – Runway Act-One
18:44 – HeyGen Interactive Avatars + Demo
21:06 – NotebookLM Update

New Claude: https://www.anthropic.com/news/3-5-models-and-computer-use
https://www.anthropic.com/research/developing-computer-use
Paper: https://assets.anthropic.com/m/1cd9d098ac3e6467/original/Claude-3-Model-Card-October-Addendum.pdf
Demo Diversion: https://x.com/AnthropicAI/status/1848742761278611504
https://www.youtube.com/watch?v=jqx18KgIzAE
o1 Comparison: https://openai.com/index/learning-to-reason-with-llms/
https://www.swebench.com/
Tau Bench: https://arxiv.org/pdf/2406.12045
OSWorld: https://arxiv.org/pdf/2404.07972
GSM Reasoning: https://arxiv.org/pdf/2410.05229
Sierra Valuation: https://www.theinformation.com/articles/bret-taylors-ai-agent-startup-nears-deal-that-could-value-it-at-over-4-billion?rc=sy0ihq
Claude Impressions: https://x.com/skirano/status/1848750867245133982
o1 System Card: https://assets.ctfassets.net/kftzwdyauwt9/67qJD51Aur3eIc96iOfeOP/71551c3d223cd97e591aa89567306912/o1_system_card.pdf
NotebookLM: https://notebooklm.google/
Runway Act-One: https://runwayml.com/research/introducing-act-one
HeyGen Zoom: https://labs.heygen.com/interactive-avatar/vicky
Ministral Comparison: https://x.com/armandjoulin/status/1846581336909230255


My Coursera Course - The 8 Most Controversial Terms in AI: https://imp.i384100.net/m57g3M

Non-hype Newsletter: https://signaltonoise.beehiiv.com/

I use Descript to edit my videos (no pauses or filler words!): https://get.descript.com/ldgxfuj2bhnb

Many people expense AI Insiders for work. Feel free to use the Template in the 'About Section' of my Patreon.

https://www.patreon.com/AIExplained
Category
Claude AI Latest
Be the first to comment
PC CHIPS UK