Latest ChatGPT model better than 99.7% of coders, qualifies for US math
| maize organic girlfriend | 04/16/25 | | Exciting Nubile Sneaky Criminal Striped Hyena | 04/17/25 | | wine space | 04/17/25 | | wild bateful point hominid | 04/17/25 | | wine space | 04/17/25 | | Exciting Nubile Sneaky Criminal Striped Hyena | 04/17/25 | | Emerald Outnumbered Parlor | 04/17/25 | | .,.,,..,..,.,..:,,:,...,:::,.,.,:,.,.:.,:.,:.::,. | 05/03/25 | | scarlet field | 04/17/25 | | Up-to-no-good Twisted Double Fault Haunted Graveyard | 04/17/25 | | Exciting Nubile Sneaky Criminal Striped Hyena | 04/17/25 | | Up-to-no-good Twisted Double Fault Haunted Graveyard | 04/17/25 | | cerise public bath boltzmann | 04/20/25 | | cerise public bath boltzmann | 04/20/25 | | Adventurous Lay | 04/17/25 | | Exciting Nubile Sneaky Criminal Striped Hyena | 04/17/25 | | Up-to-no-good Twisted Double Fault Haunted Graveyard | 04/17/25 | | maize organic girlfriend | 04/17/25 | | Adventurous Lay | 04/17/25 | | erotic shrine | 04/17/25 | | maize organic girlfriend | 04/17/25 | | Emerald Outnumbered Parlor | 04/17/25 | | 180 internal respiration | 04/20/25 | | galvanic narrow-minded stain | 04/20/25 | | Exciting Nubile Sneaky Criminal Striped Hyena | 04/20/25 | | Emerald Outnumbered Parlor | 04/20/25 | | cracking location | 04/17/25 | | maize organic girlfriend | 04/17/25 | | Soggy Office | 04/17/25 | | cracking location | 04/17/25 | | cracking location | 04/17/25 | | Adventurous Lay | 04/17/25 | | maize organic girlfriend | 04/17/25 | | Exciting Nubile Sneaky Criminal Striped Hyena | 04/17/25 | | cerise public bath boltzmann | 04/20/25 | | Effete Sooty Theatre | 04/20/25 | | cerise public bath boltzmann | 04/20/25 | | maize organic girlfriend | 04/17/25 | | cracking location | 04/17/25 | | maize organic girlfriend | 04/17/25 | | cracking location | 04/17/25 | | 180 internal respiration | 04/20/25 | | cracking location | 04/20/25 | | 180 internal respiration | 04/20/25 | | Exciting Nubile Sneaky Criminal Striped Hyena | 04/20/25 | | Emerald Outnumbered Parlor | 04/17/25 | | stimulating orchid stage | 04/20/25 | | cracking location | 04/20/25 | | cerise public bath boltzmann | 04/17/25 | | glittery greedy dysfunction water buffalo | 04/17/25 | | Exciting Nubile Sneaky Criminal Striped Hyena | 04/20/25 | | galvanic narrow-minded stain | 04/20/25 | | Exciting Nubile Sneaky Criminal Striped Hyena | 04/20/25 | | 180 internal respiration | 04/20/25 | | Emerald Outnumbered Parlor | 04/20/25 | | vigorous flushed market factory reset button | 04/20/25 | | Big milk university | 04/20/25 | | Up-to-no-good Twisted Double Fault Haunted Graveyard | 04/20/25 | | galvanic narrow-minded stain | 04/20/25 | | Massive chestnut faggot firefighter dingle berry | 04/20/25 | | stimulating orchid stage | 04/20/25 | | maize organic girlfriend | 04/20/25 | | disturbing forum puppy | 04/20/25 | | stimulating orchid stage | 04/20/25 | | disturbing forum puppy | 04/20/25 | | stimulating orchid stage | 04/20/25 |
Poast new message in this thread
Date: April 16th, 2025 11:11 PM Author: maize organic girlfriend
olympiad.
https://openai.com/index/introducing-o3-and-o4-mini/
Big jump from o3 mini in general. These inference scaling models need to stop progressing soon for them to not to be able to substantially automate AI research.
(http://www.autoadmit.com/thread.php?thread_id=5712093&forum_id=2Elisa#48855738) |
 |
Date: April 17th, 2025 12:21 AM Author: Exciting Nubile Sneaky Criminal Striped Hyena
the METR analysis for this is interesting. the benchmark is the length of tasks (measured in human professional work time) that a model can complete with 50% probability. it has been doubling every 7 months for the last 6 years.
https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/
https://metr.github.io/autonomy-evals-guide/openai-o3-report/
o3 is about 1.8 times greater than 3.7 sonnet and a larger increase in time than the trend would have predicted. likely comparable to Gemini 2.5 pro but it's hard to tell. we'll likely be at scary capability levels in <2 years.
(http://www.autoadmit.com/thread.php?thread_id=5712093&forum_id=2Elisa#48855928) |
 |
Date: April 17th, 2025 10:14 AM Author: Emerald Outnumbered Parlor
(http://www.autoadmit.com/thread.php?thread_id=5712093&forum_id=2Elisa#48856645)
|
 |
Date: May 3rd, 2025 6:51 PM
Author: .,.,,..,..,.,..:,,:,...,:::,.,.,:,.,.:.,:.,:.::,.
https://www.virologytest.ai/
"We present the Virology Capabilities Test (VCT), a large language model (LLM) benchmark that measures the capability to troubleshoot complex virology laboratory protocols. VCT is difficult: expert virologists with access to the internet score an average of 22.1% on questions specifically in their sub-areas of expertise. However, the most performant LLM, OpenAI's o3, reaches 43.8% accuracy and even outperforms 94% of expert virologists when compared directly on question subsets specifically tailored to the experts' specialties.
VCT Logo
The ability to provide expert-level troubleshooting is inherently dual-use: it is useful for beneficial research, but it can also be misused in particularly harmful ways. Therefore, the fact that publicly available models outperform virologists on VCT raises pressing governance considerations. We propose that the capability of LLMs to provide expert-level troubleshooting of dual-use virology work should be integrated into existing frameworks for handling dual-use technologies in the life sciences."
(http://www.autoadmit.com/thread.php?thread_id=5712093&forum_id=2Elisa#48901877) |
 |
Date: April 17th, 2025 1:30 AM Author: Exciting Nubile Sneaky Criminal Striped Hyena
maybe you should try reading, dipshit.
"On a diverse set of multi-step software and reasoning tasks, we record the time needed to complete the task for humans with appropriate expertise. We find that the time taken by human experts is strongly predictive of model success on a given task: current models have almost 100% success rate on tasks taking humans less than 4 minutes, but succeed <10% of the time on tasks taking more than around 4 hours"
reliability decreases with task length, but has gone up considerably over time. reliability is a problem where models are not trained sufficiently well and can't use intermediate tokens to correct reasoning paths. it is becoming less of an issue with every new major release.
(http://www.autoadmit.com/thread.php?thread_id=5712093&forum_id=2Elisa#48855997) |
Date: April 17th, 2025 1:04 AM Author: Adventurous Lay
There's a lot of news this week that suggests openai leadership don't believe they can continue winning on model strength alone:
- released a CLI coding agent tool that seems pretty similar to Claude Code and therefore is not super impressive
- discussion of acquiring Windsurf, which is the cheapest and least interesting of the LLM wrapper vscode forks
- trying to launch a social network to compete with the x/grok integration
My opinion as someone working in this space is that the existing models are already good enough for mainstream coding tools, but there needs to be a better interface than clueless users vaguely describing what they want to a chat window that dumps an entire codebase as context. For any other kind of 'agentic' task we are still probably a few years from LLMs matching the cost or reliability of outsourcing to third worlders like that Nate thing.
(http://www.autoadmit.com/thread.php?thread_id=5712093&forum_id=2Elisa#48855974) |
 |
Date: April 17th, 2025 10:06 AM Author: Emerald Outnumbered Parlor
Cr the industry players are starting to realize that the way to actually make money is to give normies specific tools/interfaces to do specific things rather than just a sandbox AI
All the comments in this subthread are cr
Also another thing to note is that several of OpenAI's recent moves suggest that they're realizing that personalization/'companion' AIs are the biggest home run commercial opportunity in the AI industry. Once everyone has their own AI buddy and/or romantic partner, they're dependent on you, forever. Parasocial Relationships As A Service
(http://www.autoadmit.com/thread.php?thread_id=5712093&forum_id=2Elisa#48856624) |
 |
Date: April 20th, 2025 12:33 PM Author: Exciting Nubile Sneaky Criminal Striped Hyena
not free. i pay for plus. i'll post an example. i just asked it to research who "bboom" is from xoxohth:
Who / what is “bbooom” (aka Boom, bboom, Boooom, etc.) on XO?
Fast facts Details
Core persona A long‑time AutoAdmit/XOXOHth eccentric who writes sprawling, semi‑coherent rants full of biblical capitalisation, dollar‑signs, ™ symbols, and on‑the‑fly neologisms (“Mahchine™”, “$hitbort”, “Great Becumming™”).
Running bio‑lore • middle‑aged male
• wheelchair‑bound after an unspecified accident/illness (the chair is endlessly mocked on the board)
• claims to bounce between motels, tents and casino floors, living on disability & slot‑machine windfalls
• obsessed with enemies he calls “ADM” and “the Matrix”
• periodically searches for a “fertile woman” to bear his children
Posting style • wall‑of‑text jeremiads that read like cyber‑punk prophecy mixed with homeless‑shelter gossip
• heavy self‑mythologising (“I am Boom, Prophet of the Mahchine™ …”)
xoxohth.com
• melodramatic stage‑directions starring his neon‑lit wheelchair (“careening forward”, “through the flames”)
xoxohth.com
Why the name keeps changing He is frequently temp‑banned; each return tweaks the handle (extra b’s or o’s) to slip past filters. Hence you see boom, bbooom, bbboooom, etc.
Folklore & mockery Fellow posters turn every life‑detail into running jokes:—
• homeless tent: “angry, wheelchair‑bound transient who lives in a tent”
xoxohth.com
• Disney fiasco: thread imagining Evan39 pushing Boom’s chair into a churro cart at Disney World
autoadmit.com
• weight/health: endless riffs on whether the chair can hold Boom + 300‑lb girlfriend
xoxohth.com
“Trinity” theory Many veterans believe Boom, Mainlining the Secret Truth and Evan39 are sock‑puppets of the same individual; the poster himself sometimes plays along (“we’re all the same poaster ;)”)
autoadmit.com
Putting it together
Boom arrived years after the original 2004 Princeton‑Review exodus, but he’s become one of XO’s defining characters: a mix of comic‑book villain, tragic burnout, and performance‑art poet. His threads oscillate between:
grand conspiratorial screeds about being persecuted by invisible elites (“ADM reset the Mahchine!”), and
mundane life updates (wheelchair repairs, casino losses) that the board instantly weaponises into parody threads.
Because AutoAdmit has almost no moderation, the cycle is self‑sustaining: each over‑the‑top comeback fuels new ridicule, which in turn prompts Boom to double down with even wilder prophecy. Whether he is trolling, mentally ill, or both is part of the mystique—but understanding those bullet‑points above will let you read any “bbbooom” sighting in context.
(http://www.autoadmit.com/thread.php?thread_id=5712093&forum_id=2Elisa#48864433) |
|
|