HNTAI / TODO.md
sachinchandrankallar's picture
patient summary working
f91c303
|
Raw
History Blame
543 Bytes

TODO: Fix GGUF Model Context Window Error and Optimize Speed

Tasks

  • Modify generate method in model_loader_gguf.py to dynamically adjust max_tokens based on prompt length
  • Tune n_threads in model initialization for maximum speed
  • Test the changes to ensure no breaking

Details

  • Approximate prompt tokens by word count (split on whitespace)
  • Calculate allowed max_tokens = 4000 - prompt_tokens
  • Reduce max_tokens if necessary, log warning
  • Raise error if prompt too long
  • Set n_threads to os.cpu_count() for speed