Spaces:
Paused
Paused
TODO: Fix GGUF Model Context Window Error and Optimize Speed
Tasks
- Modify generate method in model_loader_gguf.py to dynamically adjust max_tokens based on prompt length
- Tune n_threads in model initialization for maximum speed
- Test the changes to ensure no breaking
Details
- Approximate prompt tokens by word count (split on whitespace)
- Calculate allowed max_tokens = 4000 - prompt_tokens
- Reduce max_tokens if necessary, log warning
- Raise error if prompt too long
- Set n_threads to os.cpu_count() for speed