Sherwinroger002 commited on
Commit
e8a2adf
·
0 Parent(s):

Initial upload

Browse files
Files changed (11) hide show
  1. .gitattributes +1 -0
  2. README.md +156 -0
  3. config.json +9 -0
  4. merges.txt +0 -0
  5. model.py +211 -0
  6. model.safetensors +3 -0
  7. special_tokens_map.json +5 -0
  8. test.py +27 -0
  9. tokenizer.json +0 -0
  10. tokenizer_config.json +20 -0
  11. vocab.json +0 -0
.gitattributes ADDED
@@ -0,0 +1 @@
 
 
1
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,156 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Custom GPT Language Model
2
+
3
+ A custom GPT-style autoregressive transformer language model implemented from scratch in PyTorch.
4
+
5
+ This project contains:
6
+ - custom multi-head self-attention
7
+ - transformer blocks
8
+ - causal masking
9
+ - autoregressive text generation
10
+ - mixed precision training
11
+ - top-k / top-p sampling
12
+ - safetensors model weights
13
+
14
+ The model was trained on a subset of FineWeb-Edu using a GPT-2 tokenizer.
15
+
16
+ ---
17
+
18
+ # Architecture
19
+
20
+ Model configuration:
21
+
22
+ ```python
23
+ {
24
+ "vocab_size": 50257,
25
+ "context_length": 256,
26
+ "emb_dim": 768,
27
+ "n_heads": 12,
28
+ "n_layers": 12,
29
+ "drop_rate": 0.1,
30
+ "qkv_bias": False
31
+ }
32
+ ```
33
+
34
+ Approximate parameter count:
35
+ - ~124M parameters
36
+
37
+ Architecture components:
38
+ - token embeddings
39
+ - positional embeddings
40
+ - masked multi-head self-attention
41
+ - feed-forward MLP blocks
42
+ - pre-layer normalization
43
+ - residual connections
44
+ - causal language modeling head
45
+
46
+ ---
47
+
48
+ # Training
49
+
50
+ Training setup:
51
+ - PyTorch
52
+ - AdamW optimizer
53
+ - Automatic Mixed Precision (AMP)
54
+ - Gradient clipping
55
+ - Top-k / Top-p text generation
56
+
57
+ Hardware used:
58
+ - RTX 3060 Ti 8GB
59
+
60
+ Dataset:
61
+ - FineWeb-Edu subset (10M tokens)
62
+
63
+ Tokenizer:
64
+ - GPT-2 tokenizer
65
+
66
+ ---
67
+
68
+ # Installation
69
+
70
+ Install dependencies:
71
+
72
+ ```bash
73
+ pip install torch transformers safetensors
74
+ ```
75
+
76
+ ---
77
+
78
+ # Loading The Model
79
+
80
+ ```python
81
+ import json
82
+ import torch
83
+
84
+ from safetensors.torch import load_file
85
+ from transformers import AutoTokenizer
86
+
87
+ from model import GPTModel
88
+
89
+ # load config
90
+ with open("config.json") as f:
91
+ cfg = json.load(f)
92
+
93
+ # create model
94
+ model = GPTModel(cfg)
95
+
96
+ # load weights
97
+ state_dict = load_file("model.safetensors")
98
+
99
+ model.load_state_dict(state_dict)
100
+
101
+ model.eval()
102
+
103
+ # tokenizer
104
+ tokenizer = AutoTokenizer.from_pretrained(".")
105
+ ```
106
+
107
+ ---
108
+
109
+ # Text Generation Example
110
+
111
+ ```python
112
+ from model import generate_and_print_sample
113
+
114
+ print(generate_and_print_sample(model, tokenizer, "cuda", "The world is big"))
115
+ ```
116
+
117
+ ---
118
+
119
+ # Sample Generations
120
+
121
+ Example generations from early-stage training:
122
+
123
+ > "The world is big and is a whole for children. The best part of which has been made in the lives, and the state is an ideal man, but also the same one is in the world. “The only one has been created by people,” said the new study of the journal In the past, it is the best “s not people who have no longer to have not been seen in a few years.” “The only one who have one, the most famous in the country has no one at least three years. “If you’re very low, it is not a big or less than one’s risk.” The study is a study of people who have already been reported that the risk of people who are diagnosed with HIV-S"
124
+
125
+ The model currently demonstrates:
126
+ - syntactic coherence
127
+ - topic persistence
128
+ - autoregressive language modeling
129
+ - early semantic structure
130
+
131
+ ---
132
+
133
+
134
+ # Files
135
+
136
+ ```text
137
+ model.py # GPT architecture
138
+ model.safetensors # trained weights
139
+ config.json # model configuration
140
+ tokenizer files # GPT2 tokenizer assets
141
+ README.md # project documentation
142
+ ```
143
+
144
+ ---
145
+
146
+ # Notes
147
+
148
+ This is a custom PyTorch implementation and is not directly compatible with Hugging Face `AutoModelForCausalLM`.
149
+
150
+ Users should load the model using the provided `model.py` architecture.
151
+
152
+ ---
153
+
154
+ # License
155
+
156
+ MIT License.
config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "vocab_size": 50257,
3
+ "context_length": 256,
4
+ "emb_dim": 768,
5
+ "n_heads": 12,
6
+ "n_layers": 12,
7
+ "drop_rate": 0.1,
8
+ "qkv_bias": false
9
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model.py ADDED
@@ -0,0 +1,211 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ from torch.utils.data import Dataset, DataLoader
3
+ import torch.nn as nn
4
+
5
+
6
+
7
+ class MultiHeadAttention(nn.Module):
8
+ def __init__(self,d_in,d_out,context_length,dropout,qkv_bias,n_heads):
9
+ super().__init__()
10
+ self.n_heads = n_heads
11
+ self.head_dim = d_out // n_heads
12
+ self.d_out = d_out
13
+ self.W_key = nn.Linear(d_in,d_out,bias=qkv_bias)
14
+ self.W_query = nn.Linear(d_in,d_out,bias=qkv_bias)
15
+ self.W_value = nn.Linear(d_in,d_out,bias=qkv_bias)
16
+ self.dropout = nn.Dropout(dropout)
17
+ self.proj = nn.Linear(d_out,d_out)
18
+ self.register_buffer(
19
+ 'mask',
20
+ torch.triu(torch.ones(context_length, context_length),
21
+ diagonal=1)
22
+ )
23
+
24
+ def forward(self,x):
25
+ b,n_tokens,d_out = x.shape
26
+ keys = self.W_key(x).view(b,n_tokens,self.n_heads,self.head_dim)
27
+ queries = self.W_query(x).view(b,n_tokens,self.n_heads,self.head_dim)
28
+ values = self.W_value(x).view(b,n_tokens,self.n_heads,self.head_dim)
29
+
30
+ keys = keys.transpose(1,2)
31
+ queries = queries.transpose(1,2)
32
+ values = values.transpose(1,2)
33
+
34
+ attn_scores = queries @ keys.transpose(2,3)
35
+ attn_scores = attn_scores.masked_fill_(self.mask.bool()[:n_tokens,:n_tokens],-torch.inf)
36
+
37
+ attn_weights = torch.softmax(attn_scores/ keys.shape[-1]**0.5, dim=-1)
38
+ attn_weights = self.dropout(attn_weights)
39
+
40
+ cntx_vec = (attn_weights @ values).transpose(1,2)
41
+
42
+ cntx_vec = cntx_vec.contiguous().view(b,n_tokens,self.d_out)
43
+
44
+ return self.proj(cntx_vec)
45
+
46
+
47
+
48
+
49
+ class NormLayer(nn.Module):
50
+ def __init__(self,emb_dim):
51
+ super().__init__()
52
+ self.eps = 1e-5
53
+ self.scale = nn.Parameter(torch.ones(emb_dim))
54
+ self.shift = nn.Parameter(torch.zeros(emb_dim))
55
+
56
+ def forward(self,x):
57
+ mean = x.mean(dim=-1,keepdim=True)
58
+ var = x.var(dim=-1,keepdim=True,unbiased=False)
59
+ return self.scale * ((x-mean)/torch.sqrt(var+self.eps)) + self.shift
60
+
61
+
62
+
63
+ class GELU(nn.Module):
64
+ def __init__(self):
65
+ super().__init__()
66
+
67
+ def forward(self, x):
68
+ return 0.5 * x * (1 + torch.tanh(
69
+ torch.sqrt(torch.tensor(2.0 / torch.pi)) *
70
+ (x + 0.044715 * torch.pow(x, 3))
71
+ ))
72
+
73
+
74
+
75
+ class FeedForward(nn.Module):
76
+ def __init__(self, cfg):
77
+ super().__init__()
78
+ self.layers = nn.Sequential(
79
+ nn.Linear(cfg["emb_dim"], 4 * cfg["emb_dim"]),
80
+ GELU(),
81
+ nn.Linear(4 * cfg["emb_dim"], cfg["emb_dim"]),
82
+ )
83
+
84
+ def forward(self, x):
85
+ return self.layers(x)
86
+
87
+
88
+ class TransformerBlock(nn.Module):
89
+ def __init__(self,cfg):
90
+ super().__init__()
91
+ self.attn = MultiHeadAttention(d_in=cfg["emb_dim"],d_out=cfg["emb_dim"],context_length=cfg["context_length"],dropout=cfg["drop_rate"],qkv_bias=cfg["qkv_bias"],n_heads=cfg["n_heads"])
92
+ self.ff = FeedForward(cfg)
93
+ self.norm1 = NormLayer(cfg["emb_dim"])
94
+ self.norm2 = NormLayer(cfg["emb_dim"])
95
+ self.drop_shortcut = nn.Dropout(cfg["drop_rate"])
96
+
97
+ def forward(self,x):
98
+ shortcut = x
99
+ x = self.norm1(x)
100
+ x = self.attn(x)
101
+ x = self.drop_shortcut(x)
102
+ x = x + shortcut
103
+
104
+ shortcut = x
105
+ x = self.norm2(x)
106
+ x = self.ff(x)
107
+ x = self.drop_shortcut(x)
108
+ x = x + shortcut
109
+
110
+ return x
111
+
112
+ vocab_size=50257
113
+
114
+ class GPTModel(nn.Module):
115
+ def __init__(self,cfg):
116
+ super().__init__()
117
+ self.tok_emb = nn.Embedding(vocab_size,cfg["emb_dim"])
118
+ self.pos_emb = nn.Embedding(cfg["context_length"],cfg["emb_dim"])
119
+ self.drop_emb = nn.Dropout(cfg["drop_rate"])
120
+ self.tranf_blocks = nn.Sequential(*[TransformerBlock(cfg) for _ in range(cfg["n_layers"])])
121
+ self.out_head = nn.Linear(cfg["emb_dim"],vocab_size)
122
+ self.final_norm = NormLayer(cfg["emb_dim"])
123
+
124
+ def forward(self,x):
125
+ b,n_inp = x.shape
126
+ tok_emb = self.tok_emb(x)
127
+ pos_emb = self.pos_emb(torch.arange(n_inp,device=x.device))
128
+ x = tok_emb + pos_emb
129
+ x= self.drop_emb(x)
130
+ x = self.tranf_blocks(x)
131
+ x = self.final_norm(x)
132
+ x = self.out_head(x)
133
+
134
+ return x
135
+
136
+
137
+
138
+
139
+ def generate_text(
140
+ model,
141
+ idx,
142
+ max_new_tokens,
143
+ context_size,
144
+ temperature=0.7,
145
+ top_k=40
146
+ ):
147
+
148
+ model.eval()
149
+
150
+ for _ in range(max_new_tokens):
151
+
152
+ idx_cond = idx[:, -context_size:]
153
+
154
+ with torch.no_grad():
155
+ with torch.amp.autocast("cuda"):
156
+
157
+ logits = model(idx_cond)
158
+
159
+ logits = logits[:, -1, :]
160
+
161
+ # temperature scaling
162
+ logits = logits / temperature
163
+
164
+ # top-k filtering
165
+ top_logits, top_indices = torch.topk(
166
+ logits,
167
+ top_k
168
+ )
169
+
170
+ # probabilities only over top-k
171
+ top_probas = torch.softmax(
172
+ top_logits,
173
+ dim=-1
174
+ )
175
+
176
+ # sample from top-k
177
+ idx_next = top_indices.gather(
178
+ -1,
179
+ torch.multinomial(top_probas, 1)
180
+ )
181
+
182
+ idx = torch.cat((idx, idx_next), dim=1)
183
+
184
+ return idx
185
+
186
+
187
+
188
+
189
+ def text_to_token_ids(text, tokenizer):
190
+ encoded = tokenizer.encode(text)
191
+ encoded_tensor = torch.tensor(encoded,device="cuda").unsqueeze(0) #1
192
+ return encoded_tensor
193
+
194
+ def token_ids_to_text(token_ids, tokenizer):
195
+ flat = token_ids.squeeze(0) #2
196
+ return tokenizer.decode(flat.tolist())
197
+
198
+
199
+
200
+ def generate_and_print_sample(model, tokenizer, device, start_context):
201
+ model.eval()
202
+ context_size = model.pos_emb.weight.shape[0]
203
+ encoded = text_to_token_ids(start_context, tokenizer).to("cuda")
204
+ with torch.no_grad():
205
+ token_ids = generate_text(
206
+ model=model, idx=encoded,
207
+ max_new_tokens=200, context_size=context_size,temperature=0.85,top_k=40
208
+ )
209
+ decoded_text = token_ids_to_text(token_ids, tokenizer)
210
+ print(decoded_text.replace("\n", " ")) #1
211
+ model.train()
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ec8dfec355348b16ee3c12dee51e77fb6b0d1c38057ce888c0375a5c619ab29f
3
+ size 653043260
special_tokens_map.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<|endoftext|>",
3
+ "eos_token": "<|endoftext|>",
4
+ "unk_token": "<|endoftext|>"
5
+ }
test.py ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import json
2
+ import torch
3
+
4
+ from safetensors.torch import load_file
5
+ from transformers import AutoTokenizer
6
+
7
+ from model import GPTModel,generate_and_print_sample
8
+
9
+ # load config
10
+ with open("config.json") as f:
11
+ cfg = json.load(f)
12
+
13
+ # create model
14
+ model = GPTModel(cfg)
15
+ model.to("cuda")
16
+
17
+ # load weights
18
+ state_dict = load_file("model.safetensors")
19
+
20
+ model.load_state_dict(state_dict)
21
+
22
+ model.eval()
23
+
24
+ # tokenizer
25
+ tokenizer = AutoTokenizer.from_pretrained(".")
26
+
27
+ print(generate_and_print_sample(model, tokenizer, "cuda", "The world is big"))
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "added_tokens_decoder": {
4
+ "50256": {
5
+ "content": "<|endoftext|>",
6
+ "lstrip": false,
7
+ "normalized": true,
8
+ "rstrip": false,
9
+ "single_word": false,
10
+ "special": true
11
+ }
12
+ },
13
+ "bos_token": "<|endoftext|>",
14
+ "clean_up_tokenization_spaces": false,
15
+ "eos_token": "<|endoftext|>",
16
+ "extra_special_tokens": {},
17
+ "model_max_length": 1024,
18
+ "tokenizer_class": "GPT2Tokenizer",
19
+ "unk_token": "<|endoftext|>"
20
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff