Spaces:
Running
Give us feedback!
What is your favourite XeyonAI model so far and why?
Is there something it doesn't do that you wish it would?
Let us know and we'll consider it for the next release!
Perhaps you could enlighten me as to why larger quants aren't made available on some of your newer models?
For example, your "Mercury v3.2" model is available in Q8_0 at 13GB but doesn't have f16 available, while your "Grok" models are available in f16 at ~24GB but have no Q8 listed. But your newest models, "Nebula" and "Saturn", max out at only having Q6_K.
Then there's your "Claude" model which has f16 but the doesn't have Q8 nor Q6 and only has Q5 as the second-largest option.
Basically, Is there a particular reason why Q8 and f16 (and Q6 in the case of Claude) aren't available on all of your models? I'm was hoping for more complete quants being made available across the board.
Hey.. yeah its been a bit inconsistent due to time and tbh I didn't know there would be demand for certain quant types... as you can probably tell I've had practically zero feedback and you're the first person to mention it. So bear with me, I'll be happy to upload the f16 and Q8 versions for all releases going forward.
My personal-pet theory is "mere" 12B models are perhaps treated as a bit of a "has been" now that unified memory architectures and powerful integrated graphics are a thing from basically all of the relevant consumer-facing players (AMD, Intel, Apple) while the more small business types are more than happy to throw big bucks at a few RTX 5090s or whatever (I don't believe for a second that a big chunk of Nvidia's quarterly "gaming" revenue isn't just selling RTX 5090s for AI workloads...)
I primarily tend to rely on hand-me-down PC hardware and I just received a Ryzen 5800H + RX 6600M system with 16GB DDR4 + 8GB video RAM (I'd like to upgrade the system RAM, but I'm fine waiting for RAM pricing to get sane), and doing CPU+GPU is where I was running those ~18GB models comfortably.
The main reason I found your models is because I've been trying a slew of different models trying to get them to achieve a specific workflow that...seems to actually be difficult to correctly achieve? (the "big name" online LLMs e.g. grok.com success at it rather easily though).
What I like to do is attach one or more Ren'Py or kirikiri script(s) from a visual novel and have the LLM write a new chapter that even "just works" as a drop-in addition or replacement that runs within the visual novel engine itself.
The problem is that many of the local models I've tried, such as your own, often don't seem to want to do this sort of mixed combination of story dialog and a coding-like script (for reference, Ren'Py uses python scripting), and your "Saturn" model for example wanted to continue the story without the python script aspects while your "Nebula" model wanted to commentate on the attached script files even though my input prompt was "continue the attached Ren'Py 7 script (don't include commentary)".
If you need an example to test with, I like to use the free visual novel "Camera" specifically for more experimental testing since it's short with a small file size, light-hearted and fully SFW (and even relevant to my interests - see my username), and has the Ren'Py script completely out in the open located at /game/script.rpy
https://vndb.org/r57845
https://madocactus.itch.io/camera
Thanks for the detailed feedback... this is genuinely useful, and this is absolutely a training issue rather than a limitation of the model's capability.
What you're describing.. generating mixed story dialogue + engine-specific script syntax, requires the model to maintain two simultaneous modes: natural language storytelling and structured code/markup. Most 12B models (and even larger ones) tend to collapse into one or the other when given ambiguous instructions, because their training rarely includes examples of that hybrid format.
Saturn defaulting to pure prose and Nebula wanting to commentate are both the same underlying failure. Neither had enough training signal for "produce valid Ren'Py script as the creative output." The "don't include commentary" instruction should have been enough, but it's fighting against the model's learned tendency to treat code-adjacent content as something to explain rather than produce.
This is something I can specifically train for. The fix would be a set of shards that demonstrate exactly this pattern: a Ren'Py (or similar) script as input, with "continue this script" as the instruction, and a correct continuation... valid script syntax, proper labels, dialogue blocks, conditional logic as the response. No commentary, no explanation, just drop-in output.
Leave it with me. I am due to release the next series of models very soon, and I'd like to see if I can get it doing this from a LoRA training run first before looking at other methods.
So I've done the LoRA run and merged it in. There's a definite improvement.. the model is now staying in script mode, using correct Ren'Py structure, labels, jumps, pausing and audio syntax throughout. The writing quality is also strong and the character voices hold up well.
However it's not quite there yet. The stage directions are still occasionally coming out as hybrid syntax rather than clean show statements. The model knows what it's supposed to do but isn't fully committing to the format consistently.
This tells me what I suspected... the capability is partially in the base but not strongly enough embedded for a LoRA to surface it completely. A LoRA can reinforce existing knowledge but it can't fully install something that isn't there. To get clean, consistent Ren'Py output every time would require a full weight training run, which is a different level of work.
I'm happy to take that on but it's not something I can do as a freebie as it involves significant compute cost and time. If that's something you'd want to explore, get in touch and we can talk about what that looks like.
Either way, the current version is a meaningful step up from stock and should handle straightforward script continuation better than before.
I forgot to say, at least both Nebula (v1, haven't tried v2 yet) and especially Saturn really like to output <|im_end|> at the end of each of their output which gets troublesome when trying to get it to continue its message longer.
Saturn also kind of often likes to try to write a short story with a conclusion all in one go, like in only maybe 200-300 words, even if I said 10000 (yes, ten thousand) words or some other overly large number that I'd think would result in it just continuing to go until whatever the single-message token limit is configured to.
Also thanks for including the f16 version in Nebula v2, but actually a Q8 version would have been the ideal for my RAM capacity currently since 16GB system RAM + 8GB VRAM leaves no room for the OS and programs with a 24GB model XD (though I would seriously consider a RAM upgrade if RAM pricing was sane at which point I could probably do f16). Grok v4 and Claude Opus v1 are also missing Q8 versions (Grok v4 is even missing a Q6 version O_o).
But I still appreciate f16 being present for the principle of having a "lossless master copy" so to say.
Ok.. nice one.
On the quants: I'm uploading a Q8 of Nebula v2 now. I'll do the same for the other Series 4 models as soon as I get time. Grok v4 should have had the Q6 go up but somehow the connection must have dropped on the upload. I'll get that sorted.
On the <|im_end|> showing up at the end of outputs... that's the model correctly emitting its turn terminator; it's not corruption, it's how ChatML works. The issue is just that your frontend isn't treating <|im_end|> as a stop token, so it renders as visible text instead of halting there. Add <|im_end|> to your stop strings and it'll sort the continuation problem straight away.
Which actually leads me to a suggestion. You might want to try HWUI, my own frontend for these models. It's free, and it handles all the ChatML stop-token and cleanup stuff automatically, so the kind of thing you're hitting just doesn't happen. It'd also mean we're on the same setup, which makes it way easier for me to reproduce and fix anything you run into. You can get it here: https://github.com/XeyonAI/Helcyon-WebUI .. give it a go if you fancy it. The free version has basically everything aside from the memory system but happy to sort you out with more down the line given how helpful you've been.
On Saturn wrapping a story up in 200-300 words when you've asked for way more ... yeah, that's a Saturn thing. It's a roleplay model and still getting tuned, so it's not as dialled-in on long continuation as the others yet. Long-form "keep going, don't conclude" behaviour is something I'm actively working on for the RP side (it ties into the Ren'Py continuation stuff too), so that feedback's landing at a good time.
Cheers again. The reports are appreciated.
Isn't Helcyon-WebUI just a front-end for llama.cpp? I've been using koboldcpp which is also just a front-end for llama.cpp so I'm not use what difference it'd make in terms of the end result and output...
But koboldcpp doubles as a front-end for stable diffusion which is the other thing I've been doing locally now that the image generation function of grok.com seems to require being logged in as of the last month 😐
(also I really like how koboldcpp is portable which allows me to run it completely offline in Linux live ISOs)
Yes it is a front end for llama.cpp.. it all depends on what you want of course. I built Helcyon-Webui with project folders, proper working memory system and web search because I'm big on wanting the AI to feel like it really knows me, can remember our last convo and randomly bring things up that make the conversation feel more alive, and the others don't have those things. But yeah as for portable.. that is a very handy feature to have so I'll stick it on the to-do list.
I'm into stable diffusion too so I'll be looking to hook that into it as well. What's handy of course is i can build a feature and then train the model to use it, which is what I've been doing, so the model and HWUI are becoming symbiotic in that way. Like the model emitting a search command that HWUI catches and runs.
But yeah if you ever feel like testing it out, let me know. I'll ship you the full version😊
(perhaps in the future I should make a separate discussion thread for any given individual issue I find?)
Actually I may have found a limitation(?) of koboldcpp that, for all I know, HWUI doesn't have. Alternatively it could be an issue with Nebula (v2) that could very well also occur on HWUI - think you could sanity-check the following?
I did a very simple test of just inserting an image of the number 4 and asked what number it is and, spoiler alert, the generated responses basically never said 4 and would say some other numbers or not even a number at all.
(also it's a bit amusing to hear you talk about a memory system because the opposite is true for me - clean slates and always starting from scratch so that new conversations aren't "tainted" by previous ones is my primary way of doing things and, fun fact, was a big reason I preferred using grok.com when logged out, but grok.com seems to require you to be logged in to do anything now and is therefore a big reason I'm doing more with locally-ran stuff...which is sad because grok.com was the only one I knew of that allowed attachments when logged out - gemini and chapgpt do not, and claude never was available without an account last I checked)
Ha, no worries! Nebula isn't a vision model, so that's not a bug or a KoboldCpp limitation, it just can't process images at all. None of the Helcyon series are vision-capable. What you're seeing is the model trying to respond to what it can't actually perceive, which is why the answers are all over the place. If you want vision capability locally, something like a LLaVA-based model or one of the Qwen-VL variants would be the direction to look.
On the clean slate preference... yeah that's a completely valid way to work and lot of people feel the same way. The memory system in HWUI is very much an opt-in thing built for people who want the opposite experience, so it's just a different philosophy rather than one being better than the other.
And yeah, a separate thread per issue is probably a good idea going forward. Makes it easier to track things and means other people with the same question can find the answer. The feedback's been useful regardless, so cheers for taking the time.