Latest ChatGPT model better than 99.7% of coders, qualifies for US math
| crusty whorehouse | 04/16/25 | | Appetizing clown | 04/17/25 | | Arrogant cuck | 04/17/25 | | aphrodisiac peach base | 04/17/25 | | Arrogant cuck | 04/17/25 | | Appetizing clown | 04/17/25 | | flesh pocket flask dragon | 04/17/25 | | .,.,,..,..,.,..:,,:,...,:::,.,.,:,.,.:.,:.,:.::,. | 05/03/25 | | Idiotic topaz personal credit line | 04/17/25 | | Medicated Grizzly Station | 04/17/25 | | Appetizing clown | 04/17/25 | | Medicated Grizzly Station | 04/17/25 | | dead slimy brunch | 04/20/25 | | dead slimy brunch | 04/20/25 | | Amber histrionic goal in life karate | 04/17/25 | | Appetizing clown | 04/17/25 | | Medicated Grizzly Station | 04/17/25 | | crusty whorehouse | 04/17/25 | | Amber histrionic goal in life karate | 04/17/25 | | Galvanic duck-like fanboi cruise ship | 04/17/25 | | crusty whorehouse | 04/17/25 | | flesh pocket flask dragon | 04/17/25 | | Confused fantasy-prone stage stain | 04/20/25 | | Abnormal Church Building Psychic | 04/20/25 | | Appetizing clown | 04/20/25 | | flesh pocket flask dragon | 04/20/25 | | curious juggernaut fortuitous meteor | 04/17/25 | | crusty whorehouse | 04/17/25 | | underhanded copper corner brethren | 04/17/25 | | curious juggernaut fortuitous meteor | 04/17/25 | | curious juggernaut fortuitous meteor | 04/17/25 | | Amber histrionic goal in life karate | 04/17/25 | | crusty whorehouse | 04/17/25 | | Appetizing clown | 04/17/25 | | dead slimy brunch | 04/20/25 | | deep translucent bawdyhouse | 04/20/25 | | dead slimy brunch | 04/20/25 | | crusty whorehouse | 04/17/25 | | curious juggernaut fortuitous meteor | 04/17/25 | | crusty whorehouse | 04/17/25 | | curious juggernaut fortuitous meteor | 04/17/25 | | Confused fantasy-prone stage stain | 04/20/25 | | curious juggernaut fortuitous meteor | 04/20/25 | | Confused fantasy-prone stage stain | 04/20/25 | | Appetizing clown | 04/20/25 | | flesh pocket flask dragon | 04/17/25 | | maniacal orchestra pit | 04/20/25 | | curious juggernaut fortuitous meteor | 04/20/25 | | dead slimy brunch | 04/17/25 | | chartreuse stead | 04/17/25 | | Appetizing clown | 04/20/25 | | Abnormal Church Building Psychic | 04/20/25 | | Appetizing clown | 04/20/25 | | Confused fantasy-prone stage stain | 04/20/25 | | flesh pocket flask dragon | 04/20/25 | | Racy heaven skinny woman | 04/20/25 | | Shaky unhinged cumskin | 04/20/25 | | Medicated Grizzly Station | 04/20/25 | | Abnormal Church Building Psychic | 04/20/25 | | Out-of-control Fighting Stage Foreskin | 04/20/25 | | maniacal orchestra pit | 04/20/25 | | crusty whorehouse | 04/20/25 | | ivory hall | 04/20/25 | | maniacal orchestra pit | 04/20/25 | | ivory hall | 04/20/25 | | maniacal orchestra pit | 04/20/25 |
Poast new message in this thread
Date: April 16th, 2025 11:11 PM Author: crusty whorehouse
olympiad.
https://openai.com/index/introducing-o3-and-o4-mini/
Big jump from o3 mini in general. These inference scaling models need to stop progressing soon for them to not to be able to substantially automate AI research.
(http://www.autoadmit.com/thread.php?thread_id=5712093&forum_id=2Elisa#48855738) |
 |
Date: April 17th, 2025 12:21 AM Author: Appetizing clown
the METR analysis for this is interesting. the benchmark is the length of tasks (measured in human professional work time) that a model can complete with 50% probability. it has been doubling every 7 months for the last 6 years.
https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/
https://metr.github.io/autonomy-evals-guide/openai-o3-report/
o3 is about 1.8 times greater than 3.7 sonnet and a larger increase in time than the trend would have predicted. likely comparable to Gemini 2.5 pro but it's hard to tell. we'll likely be at scary capability levels in <2 years.
(http://www.autoadmit.com/thread.php?thread_id=5712093&forum_id=2Elisa#48855928) |
 |
Date: April 17th, 2025 10:14 AM Author: flesh pocket flask dragon
(http://www.autoadmit.com/thread.php?thread_id=5712093&forum_id=2Elisa#48856645)
|
 |
Date: May 3rd, 2025 6:51 PM
Author: .,.,,..,..,.,..:,,:,...,:::,.,.,:,.,.:.,:.,:.::,.
https://www.virologytest.ai/
"We present the Virology Capabilities Test (VCT), a large language model (LLM) benchmark that measures the capability to troubleshoot complex virology laboratory protocols. VCT is difficult: expert virologists with access to the internet score an average of 22.1% on questions specifically in their sub-areas of expertise. However, the most performant LLM, OpenAI's o3, reaches 43.8% accuracy and even outperforms 94% of expert virologists when compared directly on question subsets specifically tailored to the experts' specialties.
VCT Logo
The ability to provide expert-level troubleshooting is inherently dual-use: it is useful for beneficial research, but it can also be misused in particularly harmful ways. Therefore, the fact that publicly available models outperform virologists on VCT raises pressing governance considerations. We propose that the capability of LLMs to provide expert-level troubleshooting of dual-use virology work should be integrated into existing frameworks for handling dual-use technologies in the life sciences."
(http://www.autoadmit.com/thread.php?thread_id=5712093&forum_id=2Elisa#48901877) |
 |
Date: April 17th, 2025 1:30 AM Author: Appetizing clown
maybe you should try reading, dipshit.
"On a diverse set of multi-step software and reasoning tasks, we record the time needed to complete the task for humans with appropriate expertise. We find that the time taken by human experts is strongly predictive of model success on a given task: current models have almost 100% success rate on tasks taking humans less than 4 minutes, but succeed <10% of the time on tasks taking more than around 4 hours"
reliability decreases with task length, but has gone up considerably over time. reliability is a problem where models are not trained sufficiently well and can't use intermediate tokens to correct reasoning paths. it is becoming less of an issue with every new major release.
(http://www.autoadmit.com/thread.php?thread_id=5712093&forum_id=2Elisa#48855997) |
Date: April 17th, 2025 1:04 AM Author: Amber histrionic goal in life karate
There's a lot of news this week that suggests openai leadership don't believe they can continue winning on model strength alone:
- released a CLI coding agent tool that seems pretty similar to Claude Code and therefore is not super impressive
- discussion of acquiring Windsurf, which is the cheapest and least interesting of the LLM wrapper vscode forks
- trying to launch a social network to compete with the x/grok integration
My opinion as someone working in this space is that the existing models are already good enough for mainstream coding tools, but there needs to be a better interface than clueless users vaguely describing what they want to a chat window that dumps an entire codebase as context. For any other kind of 'agentic' task we are still probably a few years from LLMs matching the cost or reliability of outsourcing to third worlders like that Nate thing.
(http://www.autoadmit.com/thread.php?thread_id=5712093&forum_id=2Elisa#48855974) |
 |
Date: April 17th, 2025 10:06 AM Author: flesh pocket flask dragon
Cr the industry players are starting to realize that the way to actually make money is to give normies specific tools/interfaces to do specific things rather than just a sandbox AI
All the comments in this subthread are cr
Also another thing to note is that several of OpenAI's recent moves suggest that they're realizing that personalization/'companion' AIs are the biggest home run commercial opportunity in the AI industry. Once everyone has their own AI buddy and/or romantic partner, they're dependent on you, forever. Parasocial Relationships As A Service
(http://www.autoadmit.com/thread.php?thread_id=5712093&forum_id=2Elisa#48856624) |
 |
Date: April 20th, 2025 12:33 PM Author: Appetizing clown
not free. i pay for plus. i'll post an example. i just asked it to research who "bboom" is from xoxohth:
Who / what is “bbooom” (aka Boom, bboom, Boooom, etc.) on XO?
Fast facts Details
Core persona A long‑time AutoAdmit/XOXOHth eccentric who writes sprawling, semi‑coherent rants full of biblical capitalisation, dollar‑signs, ™ symbols, and on‑the‑fly neologisms (“Mahchine™”, “$hitbort”, “Great Becumming™”).
Running bio‑lore • middle‑aged male
• wheelchair‑bound after an unspecified accident/illness (the chair is endlessly mocked on the board)
• claims to bounce between motels, tents and casino floors, living on disability & slot‑machine windfalls
• obsessed with enemies he calls “ADM” and “the Matrix”
• periodically searches for a “fertile woman” to bear his children
Posting style • wall‑of‑text jeremiads that read like cyber‑punk prophecy mixed with homeless‑shelter gossip
• heavy self‑mythologising (“I am Boom, Prophet of the Mahchine™ …”)
xoxohth.com
• melodramatic stage‑directions starring his neon‑lit wheelchair (“careening forward”, “through the flames”)
xoxohth.com
Why the name keeps changing He is frequently temp‑banned; each return tweaks the handle (extra b’s or o’s) to slip past filters. Hence you see boom, bbooom, bbboooom, etc.
Folklore & mockery Fellow posters turn every life‑detail into running jokes:—
• homeless tent: “angry, wheelchair‑bound transient who lives in a tent”
xoxohth.com
• Disney fiasco: thread imagining Evan39 pushing Boom’s chair into a churro cart at Disney World
autoadmit.com
• weight/health: endless riffs on whether the chair can hold Boom + 300‑lb girlfriend
xoxohth.com
“Trinity” theory Many veterans believe Boom, Mainlining the Secret Truth and Evan39 are sock‑puppets of the same individual; the poster himself sometimes plays along (“we’re all the same poaster ;)”)
autoadmit.com
Putting it together
Boom arrived years after the original 2004 Princeton‑Review exodus, but he’s become one of XO’s defining characters: a mix of comic‑book villain, tragic burnout, and performance‑art poet. His threads oscillate between:
grand conspiratorial screeds about being persecuted by invisible elites (“ADM reset the Mahchine!”), and
mundane life updates (wheelchair repairs, casino losses) that the board instantly weaponises into parody threads.
Because AutoAdmit has almost no moderation, the cycle is self‑sustaining: each over‑the‑top comeback fuels new ridicule, which in turn prompts Boom to double down with even wilder prophecy. Whether he is trolling, mentally ill, or both is part of the mystique—but understanding those bullet‑points above will let you read any “bbbooom” sighting in context.
(http://www.autoadmit.com/thread.php?thread_id=5712093&forum_id=2Elisa#48864433) |
|
|