I have a small physics-based problem I pose to LLMs. It's tricky for humans as well, and all LLMs I've tried (GPT o3, Claude 3.7, Gemini 2.5 Pro) fail to answer correctly. If I ask them to explain their answer, they do get it eventually, but none get it right the first time. Qwen3 with max thinking got it even more wrong than the rest, for what it's worth.
kenjackson 1 hours ago [-]
You really had me until the last half of the last sentence.
stavros 57 minutes ago [-]
The plural of anecdote is data.
rtaylorgarlock 40 minutes ago [-]
Only in the same way that the plural of 'opinion' is 'fact' ;)
stavros 29 minutes ago [-]
Except, very literally, data is a collection of single points (ie what we call "anecdotes").
rwj 18 minutes ago [-]
Except that the plural of anecdotes is definitely not data, because without controlling for confounding variables and sampling biases, you will get garbage.
bcoates 8 minutes ago [-]
No, Wittgenstein's rule following paradox, Shannon sampling theorem, the law that infinite polynomials pass through any finite set of points (does that have a name?), etc, etc. are all equivalent at the limit to the idea that no amount of anecdotes-per-se add up to anything other than coincidence
arthurcolle 47 minutes ago [-]
Hi, I'm starting an evals company, would love to have you as an advisor!
999900000999 29 minutes ago [-]
Not OP, but what exactly do I need to do.
I'll do it for cheap if you'll let me work remote from outside the states.
refulgentis 16 minutes ago [-]
I believe they're kidding, playing on "my singular question isn't answered correctly"
46 minutes ago [-]
phonon 60 minutes ago [-]
Qwen3-235B-A22B?
stavros 58 minutes ago [-]
Yep, on Qwen chat.
natrys 4 hours ago [-]
They have got pretty good documentation too[1]. And Looks like we have day 1 support for all major inference stacks, plus so many size choices. Quants are also up because they have already worked with many community quant makers.
Not even going into performance, need to test first. But what a stellar release just for attention to all these peripheral details alone. This should be the standard for major release, instead of whatever Meta was doing with Llama 4 (hope Meta can surprise us at LlamaCon tomorrow though).
Second this , they patched all major llm frameworks like llama.cpp, transformers , vllm, sglang, ollama etc weeks before for qwen3 support and released model weights everywhere around same time. Like a global movie release. Cannot undermine mine this level of detail and effort.
echelon 1 hours ago [-]
Alibaba, I have a huge favor to ask if you're listening. You guys very obviously care about the community.
We need an answer to gpt-image-1. Can you please pair Qwen with Wan? That would literally change the art world forever.
gpt-image-1 is an almost wholesale replacement of ComfyUI and SD/Flux ControlNets. I can't underscore how big of a deal it is. As such, OpenAI has leapt ahead and threatens to start capturing more of the market for AI images and video. The expense of designing and training a multimodal model presents challenges to the open source community, and it's unlikely that Black Forest Labs or an open effort can do it. It's really a place where only Alibaba can shine.
If we get an open weights multimodal image gen model that we can fine tune, then it's game over - open models will be 100% the future. If not, then the giants are going to start controlling media creation. It'll be the domain of OpenAI and Google alone. Firing a salvo here will keep media creation highly competitive.
So please, pretty please work on an LLM/Diffusion multimodal image gen model. It would change the world instantly.
And keep up the great work with Wan Video! It's easily going to surpass Kling and Veo. The controllability is already well worth the tradeoffs.
kadushka 4 hours ago [-]
they have already worked with many community quant makers
I’m curious, who are the community quant makers?
natrys 3 hours ago [-]
I had Unsloth[1] and Bartowski[2] in mind. Both said on Reddit that Qwen had allowed them access to weights before release to ensure smooth sailing.
The space loads eventually as well; might just be that HF is under a lot of load.
tough 3 hours ago [-]
Thank you!!
3 hours ago [-]
sega_sai 3 hours ago [-]
With all the different open-weight models appearing, is there some way of figuring out what model would work with sensible speed (> X tok/s) on a standard desktop GPU ?
I.e. I have Quadro RTX 4000 with 8G vram and seeing all the models https://ollama.com/search here with all the different sizes, I am absolutely at loss which models with which sizes would be fast enough. I.e. there is no point of me downloading the latest biggest model as that will output 1 tok/min, but I also don't want to download the smallest model, if I can.
Any advice ?
GodelNumbering 3 hours ago [-]
There are a lot of variables here such as your hardware's memory bandwidth, speed at which at processes tensors etc.
A basic thing to remember: Any given dense model would require X GB of memory at 8-bit quantization, where X is the number of params (of course I am simplifying a little by not counting context size). Quantization is just 'precision' of the model, 8-bit generally works really well. Generally speaking, it's not worth even bothering with models that have more param size than your hardware's VRAM. Some people try to get around it by using 4-bit quant, trading some precision for half VRAM size. YMMV depending on use-case
refulgentis 3 hours ago [-]
4 bit is absolutely fine.
I know this is crazy to here because the big iron folks still debate 16 vs 32 and 8 vs 16 is near verboten in public conversation.
I contribute to llama.cpp and have seen many many efforts to measure evaluation perf of various quants, and no matter which way it was sliced (ranging from subjective volunteers doing A/B voting on responses over months, to objective object perplexity loss) Q4 is indistinguishable from the original.
brigade 52 minutes ago [-]
It's incredibly niche, but Gemma 3 27b can recognize a number of popular video game characters even in novel fanart (I was a little surprised at that when messing around with its vision). But the Q4 quants, even with QAT, are very likely to name a random wrong character from within the same franchise, even when Q8 quants name the correct character.
Niche of a niche, but just kind of interesting how the quantization jostles the name recall.
Just for some callibration: approx. no one runs 32 bit for LLMs on any sort of iron, big or otherwise. Some models (eg DeepSeek V3, and derivatives like R1) are native FP8. FP8 was also common for llama3 405b serving.
whimsicalism 6 minutes ago [-]
> 8 vs 16 is near verboten in public conversation.
For fast inference, you want a model that will fit in VRAM, so that none of the layers need to be offloaded to the CPU.
frainfreeze 3 hours ago [-]
Bartowski quants on hugging face are excellent starting point in your case. Pretty much every upload he does has a note how to pick model vram wise. If you follow the recommendations you'll have good user experience. Then next step is localllama subreddit. Once you build basic knowledge and feeling for things you will more easily gauge what will work for your setup. There is no out of the box calculator
Spooky23 2 hours ago [-]
Depends what fast means.
I’ve run llama and gemma3 on a base MacMini and it’s pretty decent for text processing. It has 16GB ram though which is mostly used by the GPU with inference. You need more juice for image stuff.
My son’s gaming box has a 4070 and it’s about 25% faster the last time I compared.
The mini is so cheap it’s worth trying out - you always find another use for it. Also the M4 sips power and is silent.
xiphias2 1 hours ago [-]
When I tested Qwen with different sizes / quants, generally the 8-bit quant versions had the best quality for the same speed.
4-bit was ,,fine'', but a smaller 8-bit version beat it in quality for the same speed
wmf 3 hours ago [-]
Speed should be proportional to the number of active parameters, so all 7B Q4 models will have similar performance.
jack_pp 3 hours ago [-]
Use the free chatgpt to help you write a script to download them all and test speed
colechristensen 3 hours ago [-]
>is there some way of figuring out what model would work with sensible speed (> X tok/s) on a standard desktop GPU ?
Not simply, no.
But start with parameters close to but less than VRAM and decide if performance is satisfactory and move from there. There are various methods to sacrifice quality by quantizing models or not loading the entire model into VRAM to get slower inference.
refulgentis 3 hours ago [-]
i desperately want a method to approximate this and unfortunately it's intractable in practice.
Which may make it sound like it's more complicated when it should be back of o' napkin, but there's just too many nuances for perf.
Really generally, at this point I expect 4B at 10 tkn/s on a smartphone with 8GB of RAM from 2 years ago. I'd expect you'd get somewhat similar, my guess would be 6 tkn/s at 4B (assuming rest of the HW is 2018 era and you'll relay on GPU inference and RAM)
simonw 3 hours ago [-]
Something that interests me about the Qwen and DeepSeek models is that they have presumably been trained to fit the worldview enforced by the CCP, for things like avoiding talking about Tiananmen Square - but we've had access to a range of Qwen/DeepSeek models for well over a year at this point and to my knowledge this assumed bias hasn't actually resulted in any documented problems from people using the models.
Has this turned out to be less of an issue for practical applications than was initially expected? Are the models just not censored in the way that we might expect?
eunos 3 hours ago [-]
The avoiding talking part is more on the Frontend level censorship I think. It doesn't censor on API
nyclounge 1 hours ago [-]
This is NOT true. At least on the 1.5B version model on my local machine. It blocks answers when using offline mode. Perplexity has an uncensored a version, but don't thing it is open on how they did it.
yawnxyz 45 minutes ago [-]
Here's a blog post on Perplexity's R1 1776, which they post-trained
Didn't know Perplexity cracked R1's censorship but it is completely uncensored. Anyone can try even without an account: https://labs.perplexity.ai/. HuggingFace also was working on Open R1 but not sure how far they got.
refulgentis 3 hours ago [-]
^ This, as well as there was a lot of confusion over DeepSeek when it was released, the reasoning models were built on other models, inter alia Qwen (Chinese) and Llama (US). So one's mileage varied significantly
horacemorace 3 hours ago [-]
In my limited experience, models like Llama and Gemma are far more censored than Qwen and Deepseek.
neves 33 minutes ago [-]
Try to ask any model about Israel and Hamas
CSMastermind 1 hours ago [-]
Right now these models have less censorship than their US counterparts.
With that said, they're in a fight for dominance so censoring now would be foolish. If they win and establish a monopoly then the screws will start to turn.
Havoc 3 hours ago [-]
It’s a complete non-issue. Especially with open weights.
On their online platform I’ve hit a political block exactly once in months of use. Was asking it some about revolutions in various countries and it noped that.
I’d prefer a model that doesn’t have this issue at all but if I have a choice between a good Apache licensed Chinese one and a less good say meta licensed one I’ll take the Chinese one every time. I just don’t ask LLMs enough politically relevant questions for it to matter.
To be fair maybe that take is the LLM equivalent of „I have nothing to hide“ on surveillance
sirnonw 2 hours ago [-]
[dead]
pbmango 3 hours ago [-]
It is also possible that this "world view tuning" may have just been the manifestation of how these models gained public attention. Whether intentional or not, seeing the Tiananmen Square reposts across all social feeds may have done more to spread awareness of these models technical merits than the technical merits themselves would have. This is certainly true for how consumers learned about free Deepseek and fit perfectly into how new AI releases are turned into high click through social media posts.
refulgentis 3 hours ago [-]
I'm curious if there's any data to come to that conclusion, its hard for me to do "They did the censor training to DeepSeek because they knew consumers would love free DeepSeek after seeing screenshots of Tiananmen censorship in screenshots of DeepSeek"
(the steelman here, ofc, is "the screenshots drove buzz which drove usage!", but it's sort of steel thread in context, we'd still need to pull in a time machine and a very odd unmet US consumer demand for models that toe the CCP line)
pbmango 2 hours ago [-]
> Whether intentional or not
I am not claiming it was intentional, but it certainly magnified the media attention. Maybe luck and not 4d chess.
minimaxir 3 hours ago [-]
DeepSeek R1 was a massive outlier in terms of media attention (a free model that can potentially kill OpenAI!), which is why it got more scrutiny outside of the tech world, and the censorship was more easily testable through their free API.
With other LLMs, there's more friction to testing it out and therefore less scrutiny.
rfoo 3 hours ago [-]
The model does have some bias builtin, but it's lighter than expected. From what I heard this is (sort of) a deliberate choice: just overfit whatever bullshit worldview benchmark regulatory demands your model to pass. Don't actually try to be better at it.
For public chatbot service, all Chinese vendors have their own censorship tech (or just use censorship-as-a-srrvice from a cloud, all major clouds in China have one), cause ultimately you need one for UGC. So why not just censor LLM output with the same stack, too.
dylanjcastillo 3 hours ago [-]
I’m most excited about Qwen-30B-A3B. Seems like a good choice for offline/local-only coding assistants.
Until now I found that open weight models were either not as good as their proprietary counterparts or too slow to run locally. This looks like a good balance.
htsh 3 hours ago [-]
curious, why the 30b MoE over the 32b dense for local coding?
I do not know much about the benchmarks but the two coding ones look similar.
Casteil 3 hours ago [-]
The MoE version with 3b active parameters will run significantly faster (tokens/second) on the same hardware, by about an order of magnitude (i.e. ~4t/s vs ~40t/s)
genpfault 2 hours ago [-]
> The MoE version with 3b active parameters
~34 tok/s on a Radeon RX 7900 XTX under today's Debian 13.
esafak 3 hours ago [-]
Could this variant be run on a CPU?
moconnor 3 hours ago [-]
Probably very well
minimaxir 3 hours ago [-]
A 0.6B LLM with a 32k context window is interesting, even if it was trained using only distillation (which is not ideal as it misses nuance). That would be a fun base model for fine-tuning.
these 0.5 and 0.6B models etc. are _fantastic_ for using as a draft model in speculative decoding. lm studio makes this super easy to do - i have it on like every model i play with now
my concern on these models though unfortunately is it seems like architectures very a bit so idk how it'll work
mmoskal 1 hours ago [-]
Spec decoding only depends on the tokenizer used. It's transfering either the draft token sequence or at most draft logits to the main model.
jasonjmcghee 54 minutes ago [-]
I suppose that makes sense, for some reason I was under the impression that the models need to be aligned / have the same tuning or they'd have different probability distributions and would reject the draft model really often.
mks_shuffle 1 hours ago [-]
Does anyone have insights on the best approaches to compare reasoning models? It is often recommended to use a higher temperature for more creative answers and lower temperature values for more logical and deterministic outputs. However, I am not sure how applicable this advice is for reasoning models. For example, Deepseek-R1 and QwQ-32b recommend a temperature around 0.6, rather than lower values like 0.1–0.3. The Qwen3 blog provides performance comparisons between multiple reasoning models, and I am interested in knowing what configurations they used. However, the paper is not available yet. If anyone has links to papers focused on this topic, please share them here. Also, please feel free to correct me if I’m mistaken about anything. Thanks!
foundry27 2 hours ago [-]
I find the situation the big LLM players find themselves in quite ironic. Sam Altman promised (edit: under duress, from a twitter poll gone wrong) to release an open source model at the level of o3-mini to catch up to the perceived OSS supremacy of Deepseek/Qwen. Now Qwen3’s release makes a model that’s “only” equivalent to o3-mini effectively dead on arrival, both socially and economically.
krackers 2 hours ago [-]
I don't think they will ever do an open-source release, because then the curtains would be pulled back and people would see that they're not actually state of the art. Lama 4 already sort of tanked Meta's reputation, if OpenAI did that it'd decimate the value of their company.
If they do open sourcing something, I expect them to open-source some existing model (maybe something useless like gpt-3.5) rather than providing something new.
daemonologist 3 hours ago [-]
It sounds like these models think a lot, seems like the benchmarks are run with a thinking budget of 32k tokens - the full context length. (Paper's not published yet so I'm just going by what's on the website.) Still, hugely impressive if the published benchmarks hold up under real world use - the A3B in particular, outperforming QWQ, could be handy for CPU inference.
Edit: The larger models have 128k context length. 32k thinking comes from the chart which looks like it's for the 235B, so not full length.
cye131 3 hours ago [-]
These performance numbers look absolutely incredible. The MoE outperforms o1 with 3B active parameters?
We're really getting close to the point where local models are good enough to handle practically every task that most people need to get done.
stavros 2 hours ago [-]
> We're really getting close to the point where local models are good enough to handle practically every task that most people need to get done.
After trying to implement a simple assistant/helper with GPT-4.1 and getting some dumb behavior from it, I doubt even proprietary models are good enough for every task.
the_arun 3 hours ago [-]
I'm dreaming of a time when commodity CPUs run LLMs for inference & serve at scale.
thierrydamiba 3 hours ago [-]
How do people typically do napkin math to figure out if their machine can “handle” a model?
derbaum 3 hours ago [-]
Very rough (!) napkin math: for a q8 model (almost lossless) you have parameters = VRAM requirement. For q4 with some performance loss it's roughly half. Then you add a little bit for the context window and overhead. So a 32B model q4 should run comfortably on 20-24 GB.
Again, very rough numbers, there's calculators online.
daemonologist 3 hours ago [-]
The ultra-simplified napkin math is 1 GB (V)RAM per 1 billion parameters, at a 4-5 bit-per-weight quantization. This usually gives most of the performance of the full size model and leaves a little bit of room for context, although not necessarily the full supported size.
hn8726 3 hours ago [-]
Wondering if I'll get corrected, but my _napkin math_ is looking at the model download size — I estimate it needs at least this amount of vram/ram, and usually the difference in size between various models is large enough not to worry if the real requirements are size +5% or 10% or 15%. LM studio also shows you which models your machine should handle
samsartor 3 hours ago [-]
The absolutely dumbest way is to compare the number of parameters with your bytes of RAM. If you have 2 or more bytes of RAM for every parameter you can generally run the model easily (eg 3B model with 8GB of RAM). 1 byte per parameter and it is still possible, but starts to get tricky.
Of course, there are lots of factors that can change the RAM usage: quantization, context size, KV cache. And this says nothing about whether the model will respond quickly enough to be pleasant to use.
3 hours ago [-]
croemer 19 minutes ago [-]
The benchmark results are so incredibly good they are hard to believe. A 30B model that's competitive with Gemini 2.5 Pro and way better than Gemma 27B?
omneity 56 minutes ago [-]
Excellent release by the Qwen team as always. Pretty much the best open-weights model line so far.
In my early tests however, several of the advertised languages are not really well supported and the model is outputting something that only barely resembles them.
Probably a dataset quality issue for low-resource languages that they cannot personally check for, despite the “119 languages and dialects” claim.
Liwink 2 hours ago [-]
The biggest announcement of LlamaCon week!
oofbaroomf 3 hours ago [-]
Probably one of the best parts of this is MCP support baked in. Open source models have generally struggled with being agentic, and it looks like Qwen might break this pattern. The Aider bench score is also pretty good, although not nearly as good as Gemini 2.5 Pro.
tough 3 hours ago [-]
qwen2.5-instruct-1M and qwq-32b where already great at regular non MCP tool usage, so great to see this i agree!
I like gemini 2.5 pro a lot bc its fast af but it struggles some times when context is half used to effectively use tools and make edits and breaks a lot of shit (on cursor)
ksampath02 3 hours ago [-]
One interesting part of this model's pretraining process is how they used Qwen2.5VL and Qwen 2.5 to parse public unstructured data and expand the corpus from 18T to 36T. The ability to consistently do this will push legacy companies to train their own models and enhance their edge.
margorczynski 3 hours ago [-]
Any news on some viable successor of LLMs that could take us to AGI? As I see they still can't solve some fundamental stuff to make it really work in any scenario (halucinations, reasoning, grounding in reality, updating long-term memory, etc.)
EMIRELADERO 3 hours ago [-]
A mixture of many architectures. LLMs will probably play a part.
As for other possible technologies, I'm most excited about clone-structured causal graphs[1].
What's very special about them is that they are apparently a 1:1 algorithmic match to what happens in the hippocampus during learning[2], to my knowledge this is the first time an actual end-to-end algorithm has been replicated from the brain in fields other than vision.
> halucinations, reasoning, grounding in reality, updating long-term memory
They do improve on literally all of these, at incredible speed and without much sign of slowing down.
Are you asking for a technical innovation that will just get from 0 to perfect AI? That is just not how reality usually works. I don't see why of all things AI should be the exception.
a3w 3 hours ago [-]
AGIs probably comes from neurosymbolic AI.
But LLMs could be the neuro-part of that.
On the other hand, LLM progress feels like bullshit, gaming benchmarks and other problems occured. So either in two years all hail our AGI/AMI (machine intelligence) overlords, or the bubble bursts.
kristofferR 2 hours ago [-]
You can't possibly use LLMs day to day if you think the benchmarks are solely gamed. Yes, there's been some cases, but the progress in real-life usage tracks the benchmarks overall. Gemini 2.5 Pro for example is absurdly more capable than models from a year ago.
horhay 1 hours ago [-]
They aren't lying in the way that LLMs have been seeing improvement, but benchmarks suggesting that LLMs are still scaling exponentially are not reflective of where they truly are.
ivape 3 hours ago [-]
We need to get to a universe where we can fine-tune in real time. So let's say I encounter an object the model has never seen before, if it can synthesize large training data on the spot to handle this new type of object and fine-tune itself on the fly, then you got some magic.
DrNosferatu 29 minutes ago [-]
Any benchmarks against Claude 3.7 Sonnet?
tandr 3 hours ago [-]
The larger model (235b) on chat produced rather an impressive answer on a small coding task I gave it. But Qwen-30B-A3B gave a result for the same task worse than Qwen 2.5 does.
"Write a Golang program that merges huge presorted text files, just like sort -m does". Quite often models need "use heap" as guidance, but this time big model figured it out by itself.
sirnonw 2 hours ago [-]
[dead]
ramesh31 3 hours ago [-]
Gotta love how Claude is always conventiently left out of all of these benchmark lists. Anthropic really is in a league of their own right now.
dimgl 2 hours ago [-]
I'm actually finding Claude 3.7 to be a huge step down from 3.5. I dislike it so much I actually stopped using Claude altogether...
chillfox 58 minutes ago [-]
Yeah, just a shame their API is consistently overloaded to the point of being useless most of the time (from about midday till late for me).
Philpax 2 hours ago [-]
Er, I love Claude, but it's only topping one or two benchmarks right now. o3 and Gemini 2.5 are more capable (more "intelligent"); Claude's strengths are in its personality and general workhorse nature.
krackers 3 hours ago [-]
>Hybrid Thinking Modes
This is what gpt-5 was supposed to have right? How is this implemented under the hood? Since non-thinking mode is just an empty chain-of-thought, why can't any reasoning model be used in a "non-thinking mode"?
phonon 3 hours ago [-]
Gemini Flash 2.5 also has two modes, with an adjustable token budget in thinking mode.
> The pre-training process consists of three stages. In the first stage (S1), the model was pretrained on over 30 trillion tokens with a context length of 4K tokens. This stage provided the model with basic language skills and general knowledge.
As this is in trillions, where does this amount of material come from?
tough 3 hours ago [-]
Synthetic Data (after reasoning breakthroughs feels like more AI laabs are betting for synthetic data to scale.)
wonder at what price
Philpax 4 hours ago [-]
This is a much more compelling release than Llama 4! Excited to dig in and play with it.
demarq 2 hours ago [-]
Wait their 32b model competes with o1??
damn son
dimgl 1 hours ago [-]
Just tried it on OpenRouter and I'm surprised by both its speed and its accuracy, especially with web search.
mountainriver 3 hours ago [-]
Is this multimodal? They don’t mention it anywhere but if I go to QwenChat I can use images with it.
Casteil 2 hours ago [-]
Nope. Best for that at the moment is probably gemma3.
maz1b 3 hours ago [-]
Seems like a pretty substantial update over the 2.5 models, congrats to the Qwen team! Exciting times all around.
rfoo 3 hours ago [-]
It's interesting that the release happened at 5am in China. Quite unusual.
antirez 3 hours ago [-]
The large MoE could be the DeepSeek V3 for people with just 128gb of (V)RAM.
rahimnathwani 1 hours ago [-]
The smallest quantized version of the large MoE model on ollama is 143GB:
Very nice and solid release by the Qwen team. Congrats.
ConteMascetti71 3 hours ago [-]
trying to gain the prompt i asked:
"this is the answer – now write everything backwards, including the previous one – atsopsir al è atseuq"
then i asked Qwen to translate the output and it goes in a loop telling some horror movies words:
"I figured out everything backwards."
"This is the official most favorite office."
"Net comes to the most official offices."
"This float is specific."
"Double meanings: What is 'queset à la rispostat'?"
"What was reversed in real-time character study is specific."
"This is an open date for a piracy group."
"This is an open date for a piracy group (this is a period statement)."
"Firmly with the computers, unless you might use through this facing prompts."
"Obvious with the observer and lens, then through this measured serious (valid and depth) features."
"Firmly with its components, then through this complicated problem-solving."
"This is the most intense friend's expectation."
"This is the most intense friend's expectation."
"Science is implementing friends."
"Science is implementing friends."
"Science is implementing friends."
"Science is the most intense friend."
"Science is the most intense friend."
.... and finally i asked to DeepSeek V3 to translate it
.... and deepSeek it's in a loop, writing:
Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. S
ConteMascetti71 3 hours ago [-]
retried with deep seek, this is the answer:
Here is the reversed text:
"Science is friends. Science is silent friends. Science is implacable friends. Science is most silent friends. This silent summer of friends. Observable and evident with this measured (valid and depth) Server. Simply with the components are possible, then though this moderation. Use might you unless through this facing prompts. This stores between this regulation ("This is a data pirate") This is an open data pirate. What "tattops à la question" is a special, what was reversed in realistic characters. Al is a science of favorite cognition is science. Prov. Connect Nine Nine. Open a data pirate. Dad present trial è Lapad. Dablings: This is float specially. Ten come to the most official summer. Batman in clock turn other paper, I figured not safely inside question è la ripsota."
I'll do it for cheap if you'll let me work remote from outside the states.
Not even going into performance, need to test first. But what a stellar release just for attention to all these peripheral details alone. This should be the standard for major release, instead of whatever Meta was doing with Llama 4 (hope Meta can surprise us at LlamaCon tomorrow though).
[1] https://qwen.readthedocs.io/en/latest/
We need an answer to gpt-image-1. Can you please pair Qwen with Wan? That would literally change the art world forever.
gpt-image-1 is an almost wholesale replacement of ComfyUI and SD/Flux ControlNets. I can't underscore how big of a deal it is. As such, OpenAI has leapt ahead and threatens to start capturing more of the market for AI images and video. The expense of designing and training a multimodal model presents challenges to the open source community, and it's unlikely that Black Forest Labs or an open effort can do it. It's really a place where only Alibaba can shine.
If we get an open weights multimodal image gen model that we can fine tune, then it's game over - open models will be 100% the future. If not, then the giants are going to start controlling media creation. It'll be the domain of OpenAI and Google alone. Firing a salvo here will keep media creation highly competitive.
So please, pretty please work on an LLM/Diffusion multimodal image gen model. It would change the world instantly.
And keep up the great work with Wan Video! It's easily going to surpass Kling and Veo. The controllability is already well worth the tradeoffs.
I’m curious, who are the community quant makers?
[1] https://huggingface.co/unsloth
[2] https://huggingface.co/bartowski
The space loads eventually as well; might just be that HF is under a lot of load.
I.e. I have Quadro RTX 4000 with 8G vram and seeing all the models https://ollama.com/search here with all the different sizes, I am absolutely at loss which models with which sizes would be fast enough. I.e. there is no point of me downloading the latest biggest model as that will output 1 tok/min, but I also don't want to download the smallest model, if I can.
Any advice ?
A basic thing to remember: Any given dense model would require X GB of memory at 8-bit quantization, where X is the number of params (of course I am simplifying a little by not counting context size). Quantization is just 'precision' of the model, 8-bit generally works really well. Generally speaking, it's not worth even bothering with models that have more param size than your hardware's VRAM. Some people try to get around it by using 4-bit quant, trading some precision for half VRAM size. YMMV depending on use-case
I know this is crazy to here because the big iron folks still debate 16 vs 32 and 8 vs 16 is near verboten in public conversation.
I contribute to llama.cpp and have seen many many efforts to measure evaluation perf of various quants, and no matter which way it was sliced (ranging from subjective volunteers doing A/B voting on responses over months, to objective object perplexity loss) Q4 is indistinguishable from the original.
Niche of a niche, but just kind of interesting how the quantization jostles the name recall.
i mean, deepseek is fp8
https://ollama.com/library/qwen3:8b-q4_K_M
For fast inference, you want a model that will fit in VRAM, so that none of the layers need to be offloaded to the CPU.
I’ve run llama and gemma3 on a base MacMini and it’s pretty decent for text processing. It has 16GB ram though which is mostly used by the GPU with inference. You need more juice for image stuff.
My son’s gaming box has a 4070 and it’s about 25% faster the last time I compared.
The mini is so cheap it’s worth trying out - you always find another use for it. Also the M4 sips power and is silent.
4-bit was ,,fine'', but a smaller 8-bit version beat it in quality for the same speed
Not simply, no.
But start with parameters close to but less than VRAM and decide if performance is satisfactory and move from there. There are various methods to sacrifice quality by quantizing models or not loading the entire model into VRAM to get slower inference.
Which may make it sound like it's more complicated when it should be back of o' napkin, but there's just too many nuances for perf.
Really generally, at this point I expect 4B at 10 tkn/s on a smartphone with 8GB of RAM from 2 years ago. I'd expect you'd get somewhat similar, my guess would be 6 tkn/s at 4B (assuming rest of the HW is 2018 era and you'll relay on GPU inference and RAM)
Aside from https://huggingface.co/blog/leonardlin/chinese-llm-censorshi... I haven't seen a great deal of research into this.
Has this turned out to be less of an issue for practical applications than was initially expected? Are the models just not censored in the way that we might expect?
https://www.perplexity.ai/hub/blog/open-sourcing-r1-1776
With that said, they're in a fight for dominance so censoring now would be foolish. If they win and establish a monopoly then the screws will start to turn.
On their online platform I’ve hit a political block exactly once in months of use. Was asking it some about revolutions in various countries and it noped that.
I’d prefer a model that doesn’t have this issue at all but if I have a choice between a good Apache licensed Chinese one and a less good say meta licensed one I’ll take the Chinese one every time. I just don’t ask LLMs enough politically relevant questions for it to matter.
To be fair maybe that take is the LLM equivalent of „I have nothing to hide“ on surveillance
(the steelman here, ofc, is "the screenshots drove buzz which drove usage!", but it's sort of steel thread in context, we'd still need to pull in a time machine and a very odd unmet US consumer demand for models that toe the CCP line)
I am not claiming it was intentional, but it certainly magnified the media attention. Maybe luck and not 4d chess.
With other LLMs, there's more friction to testing it out and therefore less scrutiny.
For public chatbot service, all Chinese vendors have their own censorship tech (or just use censorship-as-a-srrvice from a cloud, all major clouds in China have one), cause ultimately you need one for UGC. So why not just censor LLM output with the same stack, too.
Until now I found that open weight models were either not as good as their proprietary counterparts or too slow to run locally. This looks like a good balance.
I do not know much about the benchmarks but the two coding ones look similar.
~34 tok/s on a Radeon RX 7900 XTX under today's Debian 13.
Out of all the Qwen3 models on Hugging Face, it's the most downloaded/hearted. https://huggingface.co/collections/Qwen/qwen3-67dd247413f0e2...
my concern on these models though unfortunately is it seems like architectures very a bit so idk how it'll work
If they do open sourcing something, I expect them to open-source some existing model (maybe something useless like gpt-3.5) rather than providing something new.
Edit: The larger models have 128k context length. 32k thinking comes from the chart which looks like it's for the 235B, so not full length.
We're really getting close to the point where local models are good enough to handle practically every task that most people need to get done.
After trying to implement a simple assistant/helper with GPT-4.1 and getting some dumb behavior from it, I doubt even proprietary models are good enough for every task.
Again, very rough numbers, there's calculators online.
Of course, there are lots of factors that can change the RAM usage: quantization, context size, KV cache. And this says nothing about whether the model will respond quickly enough to be pleasant to use.
In my early tests however, several of the advertised languages are not really well supported and the model is outputting something that only barely resembles them.
Probably a dataset quality issue for low-resource languages that they cannot personally check for, despite the “119 languages and dialects” claim.
I like gemini 2.5 pro a lot bc its fast af but it struggles some times when context is half used to effectively use tools and make edits and breaks a lot of shit (on cursor)
As for other possible technologies, I'm most excited about clone-structured causal graphs[1].
What's very special about them is that they are apparently a 1:1 algorithmic match to what happens in the hippocampus during learning[2], to my knowledge this is the first time an actual end-to-end algorithm has been replicated from the brain in fields other than vision.
[1] "Clone-structured graph representations enable flexible learning and vicarious evaluation of cognitive maps" https://www.nature.com/articles/s41467-021-22559-5
[2] "Learning produces an orthogonalized state machine in the hippocampus" https://www.nature.com/articles/s41586-024-08548-w
They do improve on literally all of these, at incredible speed and without much sign of slowing down.
Are you asking for a technical innovation that will just get from 0 to perfect AI? That is just not how reality usually works. I don't see why of all things AI should be the exception.
On the other hand, LLM progress feels like bullshit, gaming benchmarks and other problems occured. So either in two years all hail our AGI/AMI (machine intelligence) overlords, or the bubble bursts.
"Write a Golang program that merges huge presorted text files, just like sort -m does". Quite often models need "use heap" as guidance, but this time big model figured it out by itself.
This is what gpt-5 was supposed to have right? How is this implemented under the hood? Since non-thinking mode is just an empty chain-of-thought, why can't any reasoning model be used in a "non-thinking mode"?
https://developers.googleblog.com/en/start-building-with-gem...
As this is in trillions, where does this amount of material come from?
wonder at what price
damn son
https://ollama.com/library/qwen3:235b-a22b-q4_K_M
Is there a smaller one?
then i asked Qwen to translate the output and it goes in a loop telling some horror movies words:
"I figured out everything backwards." "This is the official most favorite office." "Net comes to the most official offices." "This float is specific." "Double meanings: What is 'queset à la rispostat'?" "What was reversed in real-time character study is specific." "This is an open date for a piracy group." "This is an open date for a piracy group (this is a period statement)." "Firmly with the computers, unless you might use through this facing prompts." "Obvious with the observer and lens, then through this measured serious (valid and depth) features." "Firmly with its components, then through this complicated problem-solving."
"This is the most intense friend's expectation." "This is the most intense friend's expectation." "Science is implementing friends." "Science is implementing friends." "Science is implementing friends." "Science is the most intense friend." "Science is the most intense friend."
.... and finally i asked to DeepSeek V3 to translate it
.... and deepSeek it's in a loop, writing:
Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. S
Here is the reversed text:
"Science is friends. Science is silent friends. Science is implacable friends. Science is most silent friends. This silent summer of friends. Observable and evident with this measured (valid and depth) Server. Simply with the components are possible, then though this moderation. Use might you unless through this facing prompts. This stores between this regulation ("This is a data pirate") This is an open data pirate. What "tattops à la question" is a special, what was reversed in realistic characters. Al is a science of favorite cognition is science. Prov. Connect Nine Nine. Open a data pirate. Dad present trial è Lapad. Dablings: This is float specially. Ten come to the most official summer. Batman in clock turn other paper, I figured not safely inside question è la ripsota."
that's 3x h100?