Completely wrong translations from english to korean.

#3
by titericz - opened

Helsinki-NLP/opus-mt-tc-big-en-ko is not working.

Me too,, lol

I found a way.

you can make target_vocab.json file using "spm" file in files using the code below. (you should modify it tho)

from transformers import MarianMTModel, MarianTokenizer
import json

tokenizer = MarianTokenizer.from_pretrained('Helsinki-NLP/opus-mt-tc-big-ko-en')
vocab = { tokenizer.spm_source.id_to_piece(id): id for id in range(tokenizer.spm_source.get_piece_size()) }
vocab[tokenizer.pad_token] = tokenizer.pad_token_id

with open("vocab_ko_en.json", "w") as f:
   json.dump(vocab, f, indent=2)

target_vocab = { tokenizer.spm_target.id_to_piece(id): id for id in range(tokenizer.spm_target.get_piece_size()) }
target_vocab[tokenizer.pad_token] = tokenizer.pad_token_id

with open("target_vocab_ko_en.json", "w") as f:
   json.dump(target_vocab, f, indent=2)

and then change the code like this.

tokenizer = MarianTokenizer.from_pretrained(model_name,separate_vocab=True, target_vocab_file= "./target_vocab.json", separate_vocabs=True)

and it will work!

@dee

I found a way.

you can make target_vocab.json file using "spm" file in files using the code below. (you should modify it tho)

from transformers import MarianMTModel, MarianTokenizer
import json

tokenizer = MarianTokenizer.from_pretrained('Helsinki-NLP/opus-mt-tc-big-ko-en')
vocab = { tokenizer.spm_source.id_to_piece(id): id for id in range(tokenizer.spm_source.get_piece_size()) }
vocab[tokenizer.pad_token] = tokenizer.pad_token_id

with open("vocab_ko_en.json", "w") as f:
   json.dump(vocab, f, indent=2)

target_vocab = { tokenizer.spm_target.id_to_piece(id): id for id in range(tokenizer.spm_target.get_piece_size()) }
target_vocab[tokenizer.pad_token] = tokenizer.pad_token_id

with open("target_vocab_ko_en.json", "w") as f:
   json.dump(target_vocab, f, indent=2)

and then change the code like this.

tokenizer = MarianTokenizer.from_pretrained(model_name,separate_vocab=True, target_vocab_file= "./target_vocab.json", separate_vocabs=True)

and it will work!

Wouldn't work for me, still some very odd results. I did do one amendment, by passing the source_vocab as well to from_pretrained, but to no avail :/

Sign up or log in to comment