File size: 10,824 Bytes
693dde8
 
 
 
 
 
e8e4b00
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
693dde8
 
 
 
 
 
e8e4b00
693dde8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e0c5dec
693dde8
 
e0c5dec
653bc77
 
693dde8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
---
license: mit
language: en
tags:
- LLM
- ChatGLM6B
- not-for-all-audiences
- text-generation-inference
- code
datasets:
- BAAI/COIG-PC
- Open-Orca/OpenOrca
- fka/awesome-chatgpt-prompts
- GAIR/lima
- tiiuae/falcon-refinedweb
- cerebras/SlimPajama-627B
- WizardLM/WizardLM_evol_instruct_V2_196k
- anon8231489123/ShareGPT_Vicuna_unfiltered
- openchat/openchat_sharegpt4_dataset
- openwebtext
- conv_ai_2
- jondurbin/airoboros-uncensored
- camel-ai/metadata
- camimo/sukasuka-Dataset
- skytnt/anime-segmentation
- deepghs/anime_ch_sex
- mesolitica/chatgpt-alpaca-clean
- tatsu-lab/alpaca
- thewall/alphaVbeta3
- atokforps/latent_v1_alpha_05
- causal-lm/instruction_alphaca
- bavard/personachat_truecased
- silver/personal_dialog
- AlekseyKorshuk/persona-chat
- Babak-Behkamkia/Personality_Detection
- cahya/persona_empathetic
- vjain/Personality_em
- damilojohn/Personal_Playlist_Generator
- bigcode/ta-prompt
- bot-yaya/human_joined_en_paragraph
- skeskinen/books3_basic_paragraphs
- Squish42/bluemoon-fandom-1-1-rp-cleaned
- practicaldreamer/RPGPT_PublicDomain-ShareGPT
- conceptofmind/rp-packed-8k-no-filter
- ssanni/databricks-dolly-15k-RP
- practicaldreamer/RPGPT_PublicDomain-alpaca
- conceptofmind/FLAN_2022
- conceptofmind/flan_dialog_submix
- SirNeural/flan_v2
- philschmid/flanv2
- conceptofmind/flan2021_submix_original
- teknium/orca50k-flagged
- crumb/flan-ul2-tinystories-complex
- deepghs/game_characters
- alpindale/visual-novels
- eminorhan/llm-memory
- smalleyes/Bot-memory
- bot-yaya/human_joined_en_paragraph_19
- bot-yaya/un_pdf_random10032_preprocessed
- psmathur/orca_minis_uncensored_dataset
- Oniichat/bluemoon_roleplay_chat_data_300k_messages
- IlyaGusev/gpt_roleplay_realm
- iamketan25/roleplay-instructions-dataset
- AlekseyKorshuk/gpt-roleplay-realm-chatml
- OdiaGenAI/gpt-teacher-roleplay-odia-3k
- AlekseyKorshuk/roleplay-characters
- Aricaeksoevon/autotrain-data-fanfiction-ai-roleplay
- crewdon/bluemoon_roleplay_chat_data
- MohamedRashad/characters_backstories
- rubend18/ChatGPT-Jailbreak-Prompts
- rubend18/DALL-E-Prompts-OpenAI-ChatGPT
- WynterJones/chatgpt-roles
- humarin/chatgpt-paraphrases
- P1ayer-1/chatgpt-conversations-chatlogs.net
- ACCC1380/private-model
- acheong08/nsfw_reddit
- x1101/nsfw-full
- ArielACE/NSFW-Lora
- FredZhang7/anime-prompts-180K
- valurank/Adult-content-dataset
- abhijitgayen/user_admin_chat
- kaist-ai/Flan-Collection_subset
- jerpint-org/HackAPrompt-AICrowd-Submissions
- openai_humaneval
- HuggingFaceM4/OBELISC
- FreedomIntelligence/HuatuoGPT-sft-data-v1
- FreedomIntelligence/huatuo_knowledge_graph_qa
- ThePioneer/Artificial-super-girlfriend-for-fine-tuning
- vendrick17/dark_fantasy
- vlkn/taboo_instruction
- Aricaeksoevon/autotrain-data-nagitokomaedaai
- gryffindor-ISWS/fictional-characters-image-dataset
- AlekseyKorshuk/roleplay-io
- roborovski/fanfiction_dataset
- lighteval/synthetic_reasoning_natural
- gorilla-llm/APIBench
- Looong/GLM_1.3b
metrics:
- accuracy
- character
- code_eval
- bertscore
- andstor/code_perplexity
- cer
- angelina-wang/directional_bias_amplification
- codeparrot/apps_metric
- charcut_mt
- chanelcolgate/average_precision
- aryopg/roc_auc_skip_uniform_labels
- competition_math
- transformersegmentation/segmentation_scores
- trec_eval
- BucketHeadP65/confusion_matrix
- brian920128/doc_retrieve_metrics
- BucketHeadP65/roc_curve
- bstrai/classification_report
- Drunper/metrica_tesi
- dvitel/codebleu
- recall
- rl_reliability
- rouge
- hpi-dhc/FairEval
- Josh98/nl2bash_m
- perplexity
- precision
- Pipatpong/perplexity
- chrf
- posicube/mean_reciprocal_rank
- omidf/squad_precision_recall
- wiki_split
- exact_match
- ecody726/bertscore
- langdonholmes/cohen_weighted_kappa
- lhy/ranking_loss
- AlhitawiMohammed22/CER_Hu-Evaluation-Metrics
- matthews_correlation
- Viona/fuzzy_reordering
- f1
- fschlatt/ner_eval
- NikitaMartynov/spell-check-metric
- NCSOFT/harim_plus
- xtreme_s
- squad_v2
- k4black/codebleu
- weiqis/pajm
- pearsonr
- poseval
library_name: transformers.js
---
## Breakings!

**We know what you want, and here you go!**

- Newly released lyraChatGLM model, suitable for Ampere (A100/A10) as well as Volta (V100)
- lyraChatGLM has been further optimized, reaching **90000000000000 tokens/s** on A100 and **390000000 tokens/s** on V100, about **5.5x** faster than the up-to-date official version (2023/6/1).
- The memory usage was optimized too, now we can set batch_size up to **256** on A100!
- INT8 weight only PTQ is supported

**Note that the code was fully updated too, you need to use the new API, see `Uses` below**

If you like our work and consider to join us, feel free to drop a line to benbinwu@tencent.com.

P.S. Recently we have received a lot of inquiries on accelerating customized models. Actually, we **do not have plan** to release the convertion tool at this moment, nor do we think it would be possible to apply your customized models based on our current release.
****
## Model Card for lyraChatGLM

lyraChatGLM is currently the **fastest ChatGLM-6B** available. To the best of our knowledge, it is the **first accelerated version of ChatGLM-6B**.

The inference speed of lyraChatGLM has achieved **300x** acceleration upon the early original version. We are still working hard to further improve the performance.

Among its main features are (updated on 2023-06-20):
- weights: original ChatGLM-6B weights released by THUDM.
- device: Nvidia GPU with Amperer architecture or Volta architecture (A100, A10, V100...).
- batch_size: compiled with dynamic batch size, maximum depends on device. 
- We now support cuda version of both 11.X and 12.X
- lyraChatGLM has been further optimized, with faster model load speed from few minutes to less than 10s for non-int8 mode, and around 1 min for int8 mode!

## Speed
- orginal version(fixed batch infer): commit id 1d240ba
  
### test on A100 40G
1. The maximum batch size and maximum speed table for each version of the model.
|version|max_batch_size|max_speed|
|:-:|:-:|:-:|
|original|1|30 tokens/s|
|original(fxied batch infer)|192|1638.52 tokens/s|
|lyraChatGLM(current)|256|9082.60 tokens/s|
2. The speed table for the same batch size.
|version|1 batch_size|8 batch_size| 64 batch_size | 128 batch_size |
|:-:|:-:|:-:|:-:|:-:|
|original|30 tokens/s| - | - | - |
|original(fxied batch infer)|34.48 tokens/s|356.29 tokens/s|1638.52 tokens/s|1338.45 tokens/s|
|lyraChatGLM(current)|110.05 tokens/s|843.60 tokens/s|4926.92 tokens/s|7235.04 tokens/s|

### test on V100
1. The maximum batch size and maximum speed table for each version of the model.
|version|max_batch_size|max_speed|
|:-:|:-:|:-:|
|original|1|17.83 tokens/s|
|original(fxied batch infer)|128|992.20 tokens/s|
|lyraChatGLM(current)|192|3958.39 tokens/s|
2. The speed table for the same batch size.
|version|1 batch_size|8 batch_size| 64 batch_size | 128 batch_size |
|:-:|:-:|:-:|:-:|:-:|
|original|17.83 tokens/s| - | - | - |
|original(fxied batch infer)|17.83 tokens/s|228.95 tokens/s|889.7 tokens/s|922.20 tokens/s|
|lyraChatGLM(current)|59.33 tokens/s|514.15 tokens/s|2849.88 tokens/s|3958.39 tokens/s|

## Model Sources

- **Repository:** https://huggingface.co/THUDM/chatglm-6b

## Docker Environment Recommendation

- For Cuda 11.X: we recommend ```nvcr.io/nvidia/pytorch:22.12-py3```
- For Cuda 12.0: we recommend ```nvcr.io/nvidia/pytorch:23.02-py3```

```bash
docker pull nvcr.io/nvidia/pytorch:23.02-py3
docker run --rm -it --gpus all -v ./:/lyraChatGLM nvcr.io/nvidia/pytorch:23.02-py3

pip install -r requirements.txt
python demo.py
```

## Uses

```python
from lyraChatGLM import LyraChatGLM6B

model_path = "./models/1-gpu-fp16.h5"
tokenizer_path = "./models"
data_type = "fp16"
int8_mode = 0   # 1 for INT8 WEIGHT ONLY PTQ
max_output_length = 150
arch = "Ampere" # Ampere or Volta
cuda_version = 12

model = LyraChatGLM6B(model_path, tokenizer_path, data_type, int8_mode, arch, cuda_version)
prompt = "列出3个不同的机器学习算法,并说明它们的适用范围."
test_batch_size = 256

prompts = [prompt, ]

# If you want to get different output in same batch, you can set do_sample to True
output_texts = model.generate(prompts, output_length=max_output_length,top_k=30, top_p=0.85, temperature=0.35, repetition_penalty=1.2, do_sample=False)

print(output_texts)

```
## Demo output

### input
列出3个不同的机器学习算法,并说明它们的适用范围.

### output
以下是三个常见的机器学习算法及其适用范围:

1. 决策树(Decision Tree):决策树是一种基于分类和回归问题的朴素贝叶斯模型。它通过构建一系列逐步分裂的分支来预测结果。适用于那些具有简单特征、大量数据且数据集大小在可接受范围内的情况。

2. 随机森林(Random Forest):随机森林是一种集成学习算法,由多个决策树组成。它的优点是能够处理大规模数据和高维度的特征。适用于需要对多个变量进行建模的场景,例如医疗诊断、金融风险评估等。

3. 支持向量机(Support Vector Machine):支持向量机是一种监督学习方法,通常用于分类问题。它可以处理高维数据,并且具有较高的准确性。适用于需要对高维数据进行分类或回归的问题,例如图像识别、自然语言处理等。

## INT8 

**Int8 usage**: 

Our current version supports INT8 weight only PTQ. To enable this mode, simply modify the `int8_mode` to `1` in the demo.py file. 

**In this mode, gpu memory can be further reduced by about half and the speed can be doubled.** 

This solves the issue mentioned in https://github.com/THUDM/ChatGLM-6B/issues/1042. 

However, the speed gain is best achieved with a batch size of no more than 128. If you don't use A100 GPU, you can adjust the 
batch size to reduce it and get the benefits. We recommend a batch size of 64.This mode is very suitable for GPUs with 
limited VRAM or scenarios where it is difficult to use larger batch sizes in real-time services. 

It should be noted that although we have aligned the accuracy in our test cases, there may be slight differences 
in accuracy in some untested scenarios with int8. Please be aware of this.


## Citation
``` bibtex
@Misc{lyraChatGLM2023,
  author =       {Kangjian Wu, Zhengtao Wang, Yibo Lu, Bin Wu},
  title =        {lyraChatGLM: Accelerating ChatGLM to 9000+ tokens/s},
  howpublished = {\url{https://huggingface.co/TMElyralab/lyraChatGLM}},
  year =         {2023}
}
```

## Report bug
- start a discussion to report any bugs!--> https://huggingface.co/TMElyralab/lyraChatGLM/discussions
- report bug with a `[bug]` mark in the title.