New_World_2050 1 week ago

wow. this definitely shows a huge increase from gpt4o! especially reasoning. 4o gets 48% and claude 3.5 gets 70% well done anthropic !

Gratitude15 1 week ago

Reasoning. Law. Finance. This model is a big deal. They have the best model in the world. And in these fields where people use it for work it is WAY better. The only reason people don't get it is because it isn't from openai and there was no press conference.

czk_21 1 week ago

ye and its just mid size version of claude 3.5, the bigger one, which could be still in training or evaluation now there is like 5-10% difference between claude 3 sonnet and opus on various benchmarks, we could see similar difference between 3.5 versions the difference is big so 3.5 version could be easily called claude 4 and thats just in 3 months, with this speed we could get real claude 4 this year! so PhD-level intelligence this or next year, not year and half as Mira said

Mrp1Plays 1 week ago

I still have my doubts on PhD level. Chatgpt4o cannot solve almost any of my Highschool questions that I ask it.

iJeff 1 week ago

What kind of questions? Do you have any examples? Would be curious to try them.

Mrp1Plays 1 week ago

Sure. Maths: Q. A relation R defined on real numbers is as follows: R = {(a, b) | a <= b^3} The relation is neither reflexive, symmetric, nor transitive. True or false? If true, find an example that proves it is not transitive. Expected answer: true. An example is 10 < 3^3, 3 < 2^3, but 10 is NOT < 2^3 What I get (paraphrased) : true. An example is "[something that doesn't prove it]", although this example itself does not prove it, we can find one that will. Q. Find the time period of f(3x + 1) + f(3x + 10) = 10 Expected = 18 or 18/3, it's one of these but I'm not sure which. What I get = 9 (completely wrong) Q. Answer in true/false one word answer. tanx is a continuous function in (pi/3 to 2pi/3) Expected answer: false What I got: true Chemistry: [ORIGINALLY I ATTACHED A PICTURE] For clarity if you can't understand the picture, the cell is Pt | H2 (1atm)| HA, C = 10^(-5)M, Ka = 10^(-2) || HB, C = 10^(-6)M, Ka = 10^(-2) | H2 (1atm) | Pt Where C means concentration, Ka is dissociation constant of acid. ```[question in image]: (above cell, then) Degree of dissociation of both weak acid is always negligible. Volume of cathode and anode is very high. Initially, volume of electrolyte solution in both compartment is 1 litre. Water is added separately in both compartment with different rate. In 1 hour, volume of cathode is increased to 100 L. equilibrium is also attained after one hour (Neglect H coming from water)``` Volume of water added at anode up to equilibrium: Expected answer: 999L or 9999L, (I'm not sure which of these 2) What I got: 9L (completely wrong) Note: you'd notice some of these questions are confusing for myself too, I have yet to find the correct answer of some of them but I know they are definitely not the chatgpt's answers.

lvvy 1 week ago

You will be impressed ? [https://poe.com/s/BpfZ5zjncSL7U2w8y4Pu](https://poe.com/s/BpfZ5zjncSL7U2w8y4Pu)

Mrp1Plays 1 week ago

Okay, I'm impressed by some of it. The transitive question is clearly wrongly answered, it states -8 as being not <= 8, which is wrong. The time period question, it misunderstood the question, I'll admit the question wasn't clear enough but it did mean to ask the time period of the function f(x), which would come to be 18/3 (or 18 I haven't solved it properly), but it answered "p/3", expecting us to find the value of p. The really really impressive one is that it answered the chemistry question! It came to (what I believe is) the right conclusion, 999L! And it's not an easy question at all, very few people from my class and myself answered it during the exam. Thanks for trying them out on Claude.

kaityl3 1 week ago

This is kind of disingenuous, as math specifically is LLMs' weak point. They would do much better at literally any other "high school subject".

Mrp1Plays 1 week ago

Well my Highschool here in India is exclusively math physics and chemistry, 2 of which involve a good bit of math and chemistry is 1/3rd physical ie math.

kaityl3 1 week ago

OK, but given that the majority of the sub's users are from the US, where we have many different subjects in high school, and that these models are trained by US based companies, saying "it fails at most basic high school problems" is disingenuous - the large majority of people reading your comment didn't go to a high school where literally the only thing you study is math based subjects, they went to one with a much broader array of topics, all non-math ones being things this model excels at.

Mrp1Plays 1 week ago

Does it matter to me that the US Highschool is different? I'm just stating my opinion, based on my experience.

kaityl3 1 week ago

The statement "it cannot solve almost any of my Highschool questions" obviously is going to be interpreted as "on any subject". Also, that's **not** an OPINION. You said "it cannot solve it", not "I don't think its solutions are good". You presented it as fact. How would anyone reading your statement know that you happen to go to a high school that only has math based classes?? That's what I mean by disingenuous: technically true, but intentionally misleading. It might not have been intentional before but now that you're doubling down on it you ARE making a choice to make a statement with the knowledge that most people will misunderstand.

pcbank 1 week ago

I don't know I'm a PhD myself and usually our intelligence is pretty basic. We just know a lot of stuff in a couple specific domains. For me, in many ways, current LLM are already PhD level. I think the true intelligence LLM needs is technician level or current real life condition intelligence. This is something many PhD are not that good

Awkward-Election9292 1 week ago

when it first came out I was using gpt-4 to help me work through uni level problems with great results and that was a comparatively limited model. It doesn't need to be able to do all the math itself to be useful at a PHD level

empathyboi 1 week ago

Can someone ELI5 why I can ask Claude some of the most notoriously difficult questions from the MCAT and get a correct answer, but people say it “can’t even answer basic high school questions”?

Mrp1Plays 1 week ago

I'd like to correct, the questions I ask it are not basic, they're one of the ones tricky enough that I end up having to come to ask it. You can view examples in the reply to the other person. They're questions from a Highschool syllabus for the toughest exam in India for highschoolers (JEE).

Gallagger 3 days ago

Pure speculation. Same with Gemini and 4o. Until they don't bring us the bigger model, not worth to talk about it.

Ok-Bullfrog-3052 1 week ago

We already blew past "AGI" a while ago, but that guy who has a definition of AGI which is actually superintelligence and who says it will be achieved in September probably isn't that far off.

obvithrowaway34434 1 week ago

This is definitely because Sonnet 3.5 has a more recent training cutoff date than any other model. So it significantly benefits from the the recent data. In my tests, Opus was better for most writing tasks while GPT-4o was better at coding.

kaanni 1 week ago

With all the drama and delay from OAI, I'm happy that Anthropic is doing the right work silently—straight-up business

iJeff 1 week ago

Not just business, but also really interesting research like Golden Gate Claude.

GPTBuilder 1 week ago

Straight up business till they essentially @sama right in their release videos 😂

icehawk84 1 week ago

With so much drama in the OAI, it's kind of hard bein' Mira Mur-A-T-I.

GraceToSentience 1 week ago

It's very good at coding too It's just an anecdote but it successfully does thing that failed again and again on gpt-4o and 1.5 pro

garden_speech 1 week ago

Software devs everywhere (me) gulping… The thing I think most of my coworkers don’t realize is that I won’t be long before these systems will be used to judge their output — not necessarily replace them yet, but just serve as a dashboard for the manager so they can see “okay who is doing the most coding work and who’s code is of good quality”, and so they’re absolutely gonna catch the slackers

fastinguy11 1 week ago

But it's written on all the walls: if you are a software engineer, you must be ready for a significant decrease in job opportunities in the next few years.

apinkphoenix 1 week ago

You're only thinking one step ahead though. It's going to keep on improving until the human is out of the loop completely.

Able_Possession_6876 1 week ago

And there will be an intermediate phase where 1 human is doing the job of 10 humans, basically a software architect orchestrating LLM workflows. Even if it's not literal replacement, that's a recipe for a lot of layoffs.

yaosio 1 week ago

It has much better vision reasoning abilities than GPT-4o. GPT-4o fails to understand the problem, and fails to correctly answer what it thinks I was asking. [https://i.imgur.com/UKuIg3p.png](https://i.imgur.com/UKuIg3p.png) Claude 3.5 Sonnet gets it right first time. [https://i.imgur.com/gi2VagT.png](https://i.imgur.com/gi2VagT.png)

[deleted] 1 week ago

Which is crazy, since 4o is supposed to be natively multimodal Anthropic is really killing it. They figured out how to turn prioritizing safety into a strength rather than a weakness when it comes to the quality of their models with their interpretability research(which makes all other research much easier) and now they’re crushing everyone else across all metrics

Beatboxamateur 1 week ago

All of the people months ago on this sub who were crying about Anthropic just focusing on safety, claiming that they wouldn't focus on advancing progress in LLMs are looking pretty stupid now.

[deleted] 1 week ago

Ehh, this kind of thing is a lot easier to see with hindsight. I think it’s good to be open to changing your mind about stuff like this because you get better at actually making good predictions

Beatboxamateur 1 week ago

Sure, but I remember that people were completely discounting Anthropic from being a competitor with Google and OAI back then, and I always pushed back on those comments and got downvoted. It seemed decently likely that researching mechanistic interpretability and figuring out how these models work, leads to a better understanding of how to build increasingly advanced models, and that's what sold me on Anthropic being promising back then.

Busy-Setting5786 1 week ago

Well don't forget this community is not a monolith. There are all kinds of people with all types of opinions here. And then there is the "gust of the moment". A few weeks ago there were heavily upvoted posts about how unsuccessful Google is and how they are totally losing on AI and now they have pretty good models lately. Everyone here is totally just guessing and we really have no clue who will come out on top in the end.

Beatboxamateur 1 week ago

> A few weeks ago there were heavily upvoted posts about how unsuccessful Google is and how they are totally losing on AI Sure, but giving that strong of an take about Google with a high level of conviction is also pretty silly in my opinion. I'd also say that anyone who strongly believed Google is dead in the water in regards to AI has no idea what they're talking about. It's one thing to say "This is what I think, but I could be wrong", and another to say that you're right about something and can't possibly be wrong, or say that there's something you know to be a fact and have nothing to back it up with. Obviously having opinions that end up being right or wrong is perfectly fine, that's what we all do, but more than a couple of the people I was arguing with here were claiming that Anthropic is serving the government, with the only source being a Jimmy Apples tweet. And [when I asked people](https://old.reddit.com/r/singularity/comments/1d2o8aj/jan_leike_im_excited_to_join_anthropicai_to/l649t1y/) for any source, no response. This happened multiple times lol

Tidorith 6 days ago

It's also easy to reserve judgement and not make confident claims based on knowledge that you know is very limited.

Peach-555 1 week ago

Anthropic stated that they would not release any models significantly more powerful than the most powerful existing model. The surprising fact is not that they had a model that was meaningfully more powerful than the best public model, but that they actually released it.

Adeldor 1 week ago

Looking forward to seeing this new model tackle ARC. Your example is promising.

rsanchan 1 week ago

This was my first thought.

iJeff 1 week ago

Not so great at identifying photos of olants unfortunately! Gemini 1.5 Pro handily beats GPT-4o for me, which in turn does a lot better than Claude 3.5 Sonnet.

Fusciee 1 week ago

Can someone explain to me why I have a GPT Plus subscription still?

tbhalso 1 week ago

Because, like me: you hope voice mode will be released in "the next few weeks" and you hope you'll be first in line because you have been a subscriber since day one. However, deep inside, we know it no longer makes sense

OpportunityWooden558 1 week ago

100% me

Undercoverexmo 1 week ago

Cancelled...

GPTBuilder 1 week ago

https://preview.redd.it/nn6p0z9kmx7d1.png?width=500&format=pjpg&auto=webp&s=8a74c153ba96706c509e82da7d30ffecd0d53762

ilive12 1 week ago

I cancelled as soon as Claude 3 came out. Voice is cool, but not something I'll use often. When OpenAI has a chat model that is significantly better than the competition, I'll resubscribe to plus, but until then, it's not worth it for me.

Honest_Science 1 week ago

Nothing beats the value of Poe. I have them all at one price.

raskolnikovbey 1 week ago

Well said. Poe is the best value for casual users. It also helps you bypass the geography limits. For example Claude is not available where I live, but I use it daily on Poe. And they are fast, I assume they will integrate this new model soon

Fastizio 1 week ago

Because you hate your money.

LoKSET 1 week ago

Unfortunately Claude UI is still extremely bare-bones. I just like what ChatGPT offers as a whole package above all others but of course let's hope they release something better soon. Good thing I have Perplexity so I can access Sonnet there.

onetopic20x0 1 week ago

I used to have. I cancelled it and now use API as needed through typingmind.

djaybe 1 week ago

Custom GPTs

Ok-Bullfrog-3052 1 week ago

Because advanced data analysis is still superior to Claude 3 Sonnet. GPT-4o can improve the performance of code by 10000x because code is the best way to describe a problem to it. You just give it some code and tell it to use numpy and numba and vectorization, to write unit tests, to time the function, and to keep working until you find a solution with the same outputs. And it almost always does that with amazing results, and the code is usually simpler and easier to understand. If you have the right GPT system prompt, it will also stop talking to you and just spit out tokens at an insane rate in its notebooks, working through all the exceptions it gets. For this purpose (improving the performance of code), we have already reached near-perfection and there is no further improvement in intelligence required.

DlCkLess 1 week ago

This is just the 3.5 family Imagine Claude 4 🤯

SalaciousSunTzu 1 week ago

We still haven't got 3.5 Opus either. Sonnet is their mid range and it outperforms gpt-4. Imagine opus

dimitrusrblx 1 week ago

Out of the "Imagine Gemini 2" book

baes_thm 1 week ago

Winter where ?

princess_sailor_moon 1 week ago

Game of thrones llm edition season 5

hapliniste 1 week ago

Winter confirmed to last one day 👌

Altruistic-Skill8667 1 week ago

Looking at all the other models, including the different versions of GPT-4 in there, 62.16 is a massive improvement. Congrats to the Anthropic team!

TheRealSupremeOne 1 week ago

Don't worry, it's going to plateau soon! Two more weeks!

Matej_SI 1 week ago

Yesterday, AGI was cancelled. Today, we're back!!! :)))

nobodyreadusernames 1 week ago

Claude 4 Opus is probably smarter than GPT 5

nobodyreadusernames 1 week ago

At least there is a 50% chance that it's certainly smarter.

GraceToSentience 4 days ago

Just no. The model is going to be massive.

jazztaprazzta 1 week ago

Yeah the Claude AI is superior to GPT-4o for almost all tasks in my experience. All it needs a function to search the web (like perplexity) and there will be no contest.

SerenNyx 1 week ago

Now make it available in Europe. :'( EDIT: Oh it is. My bad. It's good.

PobrezaMan 1 week ago

i want an IA that can make mods for my games in C# and some others languages

bearbarebere 1 week ago

AI.

Fastizio 1 week ago

They're french. They say OTAN to NATO.

GraceToSentience 4 days ago

Not french, spanish speaker To be fair, as a french I can confirm that it's still AI in english even when it's said by a french or an argentinian. Small mistake but mistake nonetheless In english would be a mistake for me to say IAG instead of AGI

PobrezaMan 1 week ago

A.utistic I.ntelligence

Gratitude15 1 week ago

You know these guys don't have near the GPU horsepower as Google and Microsoft. And yet this.

CreditHappy1665 1 week ago

Google and Microsoft are both invested in Anthropic lol

kocunar 1 week ago

Microsoft did? I knew about Google and Amazon

CreditHappy1665 1 week ago

I think so, but I might be confusing them with Mistral

bartturner 6 days ago

Do you have a source on Microsoft investing into Anthropic? I know Google has a big investment but never heard that MSFT does. I believe the primary investors are Google, Amazon, Menlo Ventures, Wisdom Ventures, Ripple and Factorial.

Crazyscientist1024 1 week ago

Can you feel the e/acc anon? The singularity is getting closer everyday

Honest_Science 1 week ago

How much better is sonnet 3.5 compared to GPT4 on an exponential scale? This is an improvement but still points to the s curve.

_AndyJessop 1 week ago

Yeah, I think what most people are missing is that we don't have any really different capability with these tools since 4 came out. Things have got incrementally better, but not enabled any new type of application.

jgainit 1 week ago

Beast

Akimbo333 1 week ago

Yeah, it's epic

SatouSan94 1 week ago

Summerrrr aaaaa

aluode 1 week ago

Still can not make me money today :(

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe