Can this model be layer pruned?

#12

by Rnake - opened Apr 16

Apr 16

I want to run this model on my phone, but when I run layer pruning and do distillation, I find that the effect drops dramatically. Is this related to the use of RoPE in this model?

jupyterjazz

Jina AI org Apr 16

Hey, I don't think it's related to RoPE, it could be that nano model is already quite compact for layer pruning depending on the pruning degree you experimented with, or that the distillation setup needs some tuning. Either way, if your goal is on-device inference, I'd suggest trying quantized GGUF versions instead. Q4_K_M (157 MB, down from 424 MB) is probably the best tradeoff between size and quality.

Rnake

Apr 20

Hey, I don't think it's related to RoPE, it could be that nano model is already quite compact for layer pruning depending on the pruning degree you experimented with, or that the distillation setup needs some tuning. Either way, if your goal is on-device inference, I'd suggest trying quantized GGUF versions instead. Q4_K_M (157 MB, down from 424 MB) is probably the best tradeoff between size and quality.
Are there any metrics for the performance of Q4_K_M？

michael-guenther

Jina AI org Apr 21

We don't have a detailed evaluation but @hanxiao ran this evaluation: https://www.linkedin.com/posts/hxiao87_low-quant-weights-make-the-embedding-model-activity-7449362879687880704-k25C?utm_source=share&utm_medium=member_desktop&rcm=ACoAACC3cG0Be5GVkDYvYPegcoac1w5VakM7_t8 which showed that Q4 is probably a good tradeoff: Q4_K_M.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment