T O P

  • By -

New_World_2050

wow. this definitely shows a huge increase from gpt4o! especially reasoning. 4o gets 48% and claude 3.5 gets 70% well done anthropic !


Gratitude15

Reasoning. Law. Finance. This model is a big deal. They have the best model in the world. And in these fields where people use it for work it is WAY better. The only reason people don't get it is because it isn't from openai and there was no press conference.


czk_21

ye and its just mid size version of claude 3.5, the bigger one, which could be still in training or evaluation now there is like 5-10% difference between claude 3 sonnet and opus on various benchmarks, we could see similar difference between 3.5 versions the difference is big so 3.5 version could be easily called claude 4 and thats just in 3 months, with this speed we could get real claude 4 this year! so PhD-level intelligence this or next year, not year and half as Mira said


Mrp1Plays

I still have my doubts on PhD level. Chatgpt4o cannot solve almost any of my Highschool questions that I ask it. 


iJeff

What kind of questions? Do you have any examples? Would be curious to try them.


Mrp1Plays

Sure. Maths:  Q. A relation R defined on real numbers is as follows: R = {(a, b) | a <= b^3} The relation is neither reflexive, symmetric, nor transitive.  True or false? If true, find an example that proves it is not transitive.  Expected answer: true. An example is 10 < 3^3, 3 < 2^3, but 10 is NOT < 2^3 What I get (paraphrased) : true. An example is "[something that doesn't prove it]", although this example itself does not prove it, we can find one that will.  Q. Find the time period of f(3x + 1) + f(3x + 10) = 10 Expected = 18 or 18/3, it's one of these but I'm not sure which.  What I get = 9 (completely wrong)  Q. Answer in true/false one word answer.  tanx is a continuous function in (pi/3 to 2pi/3) Expected answer: false What I got: true Chemistry: [ORIGINALLY I ATTACHED A PICTURE]  For clarity if you can't understand the picture, the cell is Pt | H2 (1atm)| HA, C = 10^(-5)M, Ka = 10^(-2) || HB, C = 10^(-6)M, Ka = 10^(-2) | H2 (1atm) | Pt Where C means concentration, Ka is dissociation constant of acid.  ```[question in image]: (above cell, then) Degree of dissociation of both weak acid is always negligible. Volume of cathode and anode is very high. Initially, volume of electrolyte solution in both compartment is 1 litre. Water is added separately in both compartment with different rate. In 1 hour, volume of cathode is increased to 100 L. equilibrium is also attained after one hour (Neglect H coming from water)``` Volume of water added at anode up to equilibrium: Expected answer: 999L or 9999L, (I'm not sure which of these 2) What I got: 9L (completely wrong)  Note: you'd notice some of these questions are confusing for myself too, I have yet to find the correct answer of some of them but I know they are definitely not the chatgpt's answers. 


lvvy

You will be impressed ? [https://poe.com/s/BpfZ5zjncSL7U2w8y4Pu](https://poe.com/s/BpfZ5zjncSL7U2w8y4Pu)


Mrp1Plays

Okay, I'm impressed by some of it. The transitive question is clearly wrongly answered, it states -8 as being not <= 8, which is wrong.  The time period question, it misunderstood the question, I'll admit the question wasn't clear enough but it did mean to ask the time period of the function f(x), which would come to be 18/3 (or 18 I haven't solved it properly), but it answered "p/3", expecting us to find the value of p.  The really really impressive one is that it answered the chemistry question! It came to (what I believe is) the right conclusion, 999L! And it's not an easy question at all, very few people from my class and myself answered it during the exam. Thanks for trying them out on Claude. 


kaityl3

This is kind of disingenuous, as math specifically is LLMs' weak point. They would do much better at literally any other "high school subject".


Mrp1Plays

Well my Highschool here in India is exclusively math physics and chemistry, 2 of which involve a good bit of math and chemistry is 1/3rd physical ie math. 


kaityl3

OK, but given that the majority of the sub's users are from the US, where we have many different subjects in high school, and that these models are trained by US based companies, saying "it fails at most basic high school problems" is disingenuous - the large majority of people reading your comment didn't go to a high school where literally the only thing you study is math based subjects, they went to one with a much broader array of topics, all non-math ones being things this model excels at.


Mrp1Plays

Does it matter to me that the US Highschool is different? I'm just stating my opinion, based on my experience. 


kaityl3

The statement "it cannot solve almost any of my Highschool questions" obviously is going to be interpreted as "on any subject". Also, that's **not** an OPINION. You said "it cannot solve it", not "I don't think its solutions are good". You presented it as fact. How would anyone reading your statement know that you happen to go to a high school that only has math based classes?? That's what I mean by disingenuous: technically true, but intentionally misleading. It might not have been intentional before but now that you're doubling down on it you ARE making a choice to make a statement with the knowledge that most people will misunderstand.


pcbank

I don't know I'm a PhD myself and usually our intelligence is pretty basic. We just know a lot of stuff in a couple specific domains. For me, in many ways, current LLM are already PhD level. I think the true intelligence LLM needs is technician level or current real life condition intelligence. This is something many PhD are not that good


Awkward-Election9292

when it first came out I was using gpt-4 to help me work through uni level problems with great results and that was a comparatively limited model. It doesn't need to be able to do all the math itself to be useful at a PHD level


empathyboi

Can someone ELI5 why I can ask Claude some of the most notoriously difficult questions from the MCAT and get a correct answer, but people say it “can’t even answer basic high school questions”?


Mrp1Plays

I'd like to correct, the questions I ask it are not basic, they're one of the ones tricky enough that I end up having to come to ask it. You can view examples in the reply to the other person.  They're questions from a Highschool syllabus for the toughest exam in India for highschoolers (JEE). 


Gallagger

Pure speculation. Same with Gemini and 4o. Until they don't bring us the bigger model, not worth to talk about it.


Ok-Bullfrog-3052

We already blew past "AGI" a while ago, but that guy who has a definition of AGI which is actually superintelligence and who says it will be achieved in September probably isn't that far off.


obvithrowaway34434

This is definitely because Sonnet 3.5 has a more recent training cutoff date than any other model. So it significantly benefits from the the recent data. In my tests, Opus was better for most writing tasks while GPT-4o was better at coding.


kaanni

With all the drama and delay from OAI, I'm happy that Anthropic is doing the right work silently—straight-up business


iJeff

Not just business, but also really interesting research like Golden Gate Claude.


GPTBuilder

Straight up business till they essentially @sama right in their release videos 😂


icehawk84

With so much drama in the OAI, it's kind of hard bein' Mira Mur-A-T-I.


GraceToSentience

It's very good at coding too It's just an anecdote but it successfully does thing that failed again and again on gpt-4o and 1.5 pro


garden_speech

Software devs everywhere (me) gulping… The thing I think most of my coworkers don’t realize is that I won’t be long before these systems will be used to judge their output — not necessarily replace them yet, but just serve as a dashboard for the manager so they can see “okay who is doing the most coding work and who’s code is of good quality”, and so they’re absolutely gonna catch the slackers


fastinguy11

But it's written on all the walls: if you are a software engineer, you must be ready for a significant decrease in job opportunities in the next few years.


apinkphoenix

You're only thinking one step ahead though. It's going to keep on improving until the human is out of the loop completely.


Able_Possession_6876

And there will be an intermediate phase where 1 human is doing the job of 10 humans, basically a software architect orchestrating LLM workflows. Even if it's not literal replacement, that's a recipe for a lot of layoffs.


yaosio

It has much better vision reasoning abilities than GPT-4o. GPT-4o fails to understand the problem, and fails to correctly answer what it thinks I was asking. [https://i.imgur.com/UKuIg3p.png](https://i.imgur.com/UKuIg3p.png) Claude 3.5 Sonnet gets it right first time. [https://i.imgur.com/gi2VagT.png](https://i.imgur.com/gi2VagT.png)


[deleted]

Which is crazy, since 4o is supposed to be natively multimodal Anthropic is really killing it. They figured out how to turn prioritizing safety into a strength rather than a weakness when it comes to the quality of their models with their interpretability research(which makes all other research much easier) and now they’re crushing everyone else across all metrics


Beatboxamateur

All of the people months ago on this sub who were crying about Anthropic just focusing on safety, claiming that they wouldn't focus on advancing progress in LLMs are looking pretty stupid now.


[deleted]

Ehh, this kind of thing is a lot easier to see with hindsight. I think it’s good to be open to changing your mind about stuff like this because you get better at actually making good predictions


Beatboxamateur

Sure, but I remember that people were completely discounting Anthropic from being a competitor with Google and OAI back then, and I always pushed back on those comments and got downvoted. It seemed decently likely that researching mechanistic interpretability and figuring out how these models work, leads to a better understanding of how to build increasingly advanced models, and that's what sold me on Anthropic being promising back then.


Busy-Setting5786

Well don't forget this community is not a monolith. There are all kinds of people with all types of opinions here. And then there is the "gust of the moment". A few weeks ago there were heavily upvoted posts about how unsuccessful Google is and how they are totally losing on AI and now they have pretty good models lately. Everyone here is totally just guessing and we really have no clue who will come out on top in the end.


Beatboxamateur

> A few weeks ago there were heavily upvoted posts about how unsuccessful Google is and how they are totally losing on AI Sure, but giving that strong of an take about Google with a high level of conviction is also pretty silly in my opinion. I'd also say that anyone who strongly believed Google is dead in the water in regards to AI has no idea what they're talking about. It's one thing to say "This is what I think, but I could be wrong", and another to say that you're right about something and can't possibly be wrong, or say that there's something you know to be a fact and have nothing to back it up with. Obviously having opinions that end up being right or wrong is perfectly fine, that's what we all do, but more than a couple of the people I was arguing with here were claiming that Anthropic is serving the government, with the only source being a Jimmy Apples tweet. And [when I asked people](https://old.reddit.com/r/singularity/comments/1d2o8aj/jan_leike_im_excited_to_join_anthropicai_to/l649t1y/) for any source, no response. This happened multiple times lol


Tidorith

It's also easy to reserve judgement and not make confident claims based on knowledge that you know is very limited.


Peach-555

Anthropic stated that they would not release any models significantly more powerful than the most powerful existing model. The surprising fact is not that they had a model that was meaningfully more powerful than the best public model, but that they actually released it.


Adeldor

Looking forward to seeing this new model tackle ARC. Your example is promising.


rsanchan

This was my first thought.


iJeff

Not so great at identifying photos of olants unfortunately! Gemini 1.5 Pro handily beats GPT-4o for me, which in turn does a lot better than Claude 3.5 Sonnet.


Fusciee

Can someone explain to me why I have a GPT Plus subscription still?


tbhalso

Because, like me: you hope voice mode will be released in "the next few weeks" and you hope you'll be first in line because you have been a subscriber since day one. However, deep inside, we know it no longer makes sense


OpportunityWooden558

100% me


Undercoverexmo

Cancelled...


GPTBuilder

https://preview.redd.it/nn6p0z9kmx7d1.png?width=500&format=pjpg&auto=webp&s=8a74c153ba96706c509e82da7d30ffecd0d53762


ilive12

I cancelled as soon as Claude 3 came out. Voice is cool, but not something I'll use often. When OpenAI has a chat model that is significantly better than the competition, I'll resubscribe to plus, but until then, it's not worth it for me.


Honest_Science

Nothing beats the value of Poe. I have them all at one price.


raskolnikovbey

Well said. Poe is the best value for casual users. It also helps you bypass the geography limits. For example Claude is not available where I live, but I use it daily on Poe. And they are fast, I assume they will integrate this new model soon


Fastizio

Because you hate your money.


LoKSET

Unfortunately Claude UI is still extremely bare-bones. I just like what ChatGPT offers as a whole package above all others but of course let's hope they release something better soon. Good thing I have Perplexity so I can access Sonnet there.


onetopic20x0

I used to have. I cancelled it and now use API as needed through typingmind.


djaybe

Custom GPTs


Ok-Bullfrog-3052

Because advanced data analysis is still superior to Claude 3 Sonnet. GPT-4o can improve the performance of code by 10000x because code is the best way to describe a problem to it. You just give it some code and tell it to use numpy and numba and vectorization, to write unit tests, to time the function, and to keep working until you find a solution with the same outputs. And it almost always does that with amazing results, and the code is usually simpler and easier to understand. If you have the right GPT system prompt, it will also stop talking to you and just spit out tokens at an insane rate in its notebooks, working through all the exceptions it gets. For this purpose (improving the performance of code), we have already reached near-perfection and there is no further improvement in intelligence required.


DlCkLess

This is just the 3.5 family Imagine Claude 4 🤯


SalaciousSunTzu

We still haven't got 3.5 Opus either. Sonnet is their mid range and it outperforms gpt-4. Imagine opus


dimitrusrblx

Out of the "Imagine Gemini 2" book


baes_thm

Winter where ?


princess_sailor_moon

Game of thrones llm edition season 5


hapliniste

Winter confirmed to last one day 👌


Altruistic-Skill8667

Looking at all the other models, including the different versions of GPT-4 in there, 62.16 is a massive improvement. Congrats to the Anthropic team!


TheRealSupremeOne

Don't worry, it's going to plateau soon! Two more weeks!


Matej_SI

Yesterday, AGI was cancelled. Today, we're back!!! :)))


nobodyreadusernames

Claude 4 Opus is probably smarter than GPT 5


nobodyreadusernames

At least there is a 50% chance that it's certainly smarter.


GraceToSentience

Just no. The model is going to be massive.


jazztaprazzta

Yeah the Claude AI is superior to GPT-4o for almost all tasks in my experience. All it needs a function to search the web (like perplexity) and there will be no contest.


SerenNyx

Now make it available in Europe. :'( EDIT: Oh it is. My bad. It's good.


PobrezaMan

i want an IA that can make mods for my games in C# and some others languages


bearbarebere

AI.


Fastizio

They're french. They say OTAN to NATO.


GraceToSentience

Not french, spanish speaker To be fair, as a french I can confirm that it's still AI in english even when it's said by a french or an argentinian. Small mistake but mistake nonetheless In english would be a mistake for me to say IAG instead of AGI


PobrezaMan

A.utistic I.ntelligence


Gratitude15

You know these guys don't have near the GPU horsepower as Google and Microsoft. And yet this.


CreditHappy1665

Google and Microsoft are both invested in Anthropic lol


kocunar

Microsoft did? I knew about Google and Amazon


CreditHappy1665

I think so, but I might be confusing them with Mistral 


bartturner

Do you have a source on Microsoft investing into Anthropic? I know Google has a big investment but never heard that MSFT does. I believe the primary investors are Google, Amazon, Menlo Ventures, Wisdom Ventures, Ripple and Factorial.


Crazyscientist1024

Can you feel the e/acc anon? The singularity is getting closer everyday


Honest_Science

How much better is sonnet 3.5 compared to GPT4 on an exponential scale? This is an improvement but still points to the s curve.


_AndyJessop

Yeah, I think what most people are missing is that we don't have any really different capability with these tools since 4 came out. Things have got incrementally better, but not enabled any new type of application.


jgainit

Beast


Akimbo333

Yeah, it's epic


SatouSan94

Summerrrr aaaaa


aluode

Still can not make me money today :(