Instructions to use DavidAU/Psyonic-Cetacean-Ultra-Quality-20b-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use DavidAU/Psyonic-Cetacean-Ultra-Quality-20b-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="DavidAU/Psyonic-Cetacean-Ultra-Quality-20b-GGUF", filename="Psyonic-Cetacean-Ultra-IQ4_NL.gguf", )
output = llm( "Once upon a time,", max_tokens=512, echo=True ) print(output)
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use DavidAU/Psyonic-Cetacean-Ultra-Quality-20b-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf DavidAU/Psyonic-Cetacean-Ultra-Quality-20b-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf DavidAU/Psyonic-Cetacean-Ultra-Quality-20b-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf DavidAU/Psyonic-Cetacean-Ultra-Quality-20b-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf DavidAU/Psyonic-Cetacean-Ultra-Quality-20b-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf DavidAU/Psyonic-Cetacean-Ultra-Quality-20b-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf DavidAU/Psyonic-Cetacean-Ultra-Quality-20b-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf DavidAU/Psyonic-Cetacean-Ultra-Quality-20b-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf DavidAU/Psyonic-Cetacean-Ultra-Quality-20b-GGUF:Q4_K_M
Use Docker
docker model run hf.co/DavidAU/Psyonic-Cetacean-Ultra-Quality-20b-GGUF:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use DavidAU/Psyonic-Cetacean-Ultra-Quality-20b-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "DavidAU/Psyonic-Cetacean-Ultra-Quality-20b-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "DavidAU/Psyonic-Cetacean-Ultra-Quality-20b-GGUF", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/DavidAU/Psyonic-Cetacean-Ultra-Quality-20b-GGUF:Q4_K_M
- Ollama
How to use DavidAU/Psyonic-Cetacean-Ultra-Quality-20b-GGUF with Ollama:
ollama run hf.co/DavidAU/Psyonic-Cetacean-Ultra-Quality-20b-GGUF:Q4_K_M
- Unsloth Studio
How to use DavidAU/Psyonic-Cetacean-Ultra-Quality-20b-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for DavidAU/Psyonic-Cetacean-Ultra-Quality-20b-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for DavidAU/Psyonic-Cetacean-Ultra-Quality-20b-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for DavidAU/Psyonic-Cetacean-Ultra-Quality-20b-GGUF to start chatting
- Atomic Chat new
- Docker Model Runner
How to use DavidAU/Psyonic-Cetacean-Ultra-Quality-20b-GGUF with Docker Model Runner:
docker model run hf.co/DavidAU/Psyonic-Cetacean-Ultra-Quality-20b-GGUF:Q4_K_M
- Lemonade
How to use DavidAU/Psyonic-Cetacean-Ultra-Quality-20b-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull DavidAU/Psyonic-Cetacean-Ultra-Quality-20b-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.Psyonic-Cetacean-Ultra-Quality-20b-GGUF-Q4_K_M
List all available models
lemonade list
Concerned about the use of PPL scores
But... what about Q8?
The mountain moved:
150 points better: PPL = 8.5850 +/- 0.05881 VS: BASE/ORGINAL: PPL = 8.6012 +/- 0.05900
Looking at these numbers, it appears that 150 points = 0.015 PPL. If this is the case, then the margin of error on these scores is +/- 588 points, and the 150 points of improvement is well within that margin of error, essentially making it meaningless as a marker for the improvement itself.
Its possible that this 32 bit remaster is indeed much more intelligent and capable than its 16 bit counterpart, but it doesn't appear that this improvement is being indicated within the PPL scores themselves. I would love to see some other benchmark differences between these two versions of PsyCet, and potentially some A/B blind testing as well if possible.
Completely agree - that is why you need to look at all the scores of all quants.
It is margin of movement.
If only a few quants show change, it is error - if they all show change it is movement.
Likewise the +- is RANGE, not just margin of error.
And like perplexity, range is also an average - a terrible indicator of true mathematical movement especially in something as large and complex as a LLM.
The actual range amount is also a factor, which is relative to the base level perplexity - in this case 8-9.
IE: if the range was say 1.000 or higher you have a possible unstable model.
However if the base range is 15 ; a "error" range of 1 would be no issue.
This is because perplexity is not linear, it is closer to levels of magnitude.
It is only a relative, rough "30000 feet" view of the model.
And then there is the file used to calculate perplexity itself.
A wild card.
Change this, you change everything.
The original creator of this model tested it himself.
To put it mildly it was blown away. His comments are all over his discord - KoboldAI.
Members of his group familiar with the model also tested it too.
Without exception all of them were impressed.
Likewise a lot of real world testing - original and new and improved where done prior to release to further confirm the change to the positive change so to speak.
These methods confirm or deny perplexity changes and likewise reveal positive and/or negative changes as well.
AS per Jeb Carter, creator of the model:
- instruction following has improved dramatically.
- new abilities have emerged.
- he had to REDUCE the instructions sets used because the model no longer needed as specific instructions.
- prose, nuance and depth have all improved.
- issues with the original model have disappeared.
This is not "something for nothing" ; it is method of ensuring maximum precision at every step just before "ggufing" the model.
The methods employed only ensure precision loss is minimized or eliminated.
It is mathematical and theory sound.
I believe what is being said, and it absolutely makes sense that maintaining 32 bit precision from the beginning would lead to better overall precision in the final product. I just wanted to point out why I don't believe PPL is very useful as a measurement for the improvement being seen.
I appreciate the insight!
the model stheno by sao10k is really good. it's score almost 70 at leaderboard while just having 8B parameter. is it possible to make fp32 version too from that model ?
I am aware of this model. If I recall there are a number of versions?
Do you mean one in particular?
from his post, it seem like 3.2 is not his final model. i dont know, maybe he will update the model. but now it already outperform even 70B models from my experience. it will hit quite a punch now since the 3.2 version is kinda popular rn if there is a fp32 version.