Completely wrong translations from english to korean.

by titericz - opened Apr 27, 2023

Discussion

titericz

Apr 27, 2023

Helsinki-NLP/opus-mt-tc-big-en-ko is not working.

jisukim8873

May 18, 2023

Me too,, lol

deepbluechip7

May 28, 2025

•

edited May 28, 2025

I found a way.

you can make target_vocab.json file using "spm" file in files using the code below. (you should modify it tho)

from transformers import MarianMTModel, MarianTokenizer
import json

tokenizer = MarianTokenizer.from_pretrained('Helsinki-NLP/opus-mt-tc-big-ko-en')
vocab = { tokenizer.spm_source.id_to_piece(id): id for id in range(tokenizer.spm_source.get_piece_size()) }
vocab[tokenizer.pad_token] = tokenizer.pad_token_id

with open("vocab_ko_en.json", "w") as f:
   json.dump(vocab, f, indent=2)

target_vocab = { tokenizer.spm_target.id_to_piece(id): id for id in range(tokenizer.spm_target.get_piece_size()) }
target_vocab[tokenizer.pad_token] = tokenizer.pad_token_id

with open("target_vocab_ko_en.json", "w") as f:
   json.dump(target_vocab, f, indent=2)

and then change the code like this.

tokenizer = MarianTokenizer.from_pretrained(model_name,separate_vocab=True, target_vocab_file= "./target_vocab.json", separate_vocabs=True)

and it will work!

mathdons

Jun 11, 2025

@dee

I found a way.

you can make target_vocab.json file using "spm" file in files using the code below. (you should modify it tho)

from transformers import MarianMTModel, MarianTokenizer
import json

tokenizer = MarianTokenizer.from_pretrained('Helsinki-NLP/opus-mt-tc-big-ko-en')
vocab = { tokenizer.spm_source.id_to_piece(id): id for id in range(tokenizer.spm_source.get_piece_size()) }
vocab[tokenizer.pad_token] = tokenizer.pad_token_id

with open("vocab_ko_en.json", "w") as f:
   json.dump(vocab, f, indent=2)

target_vocab = { tokenizer.spm_target.id_to_piece(id): id for id in range(tokenizer.spm_target.get_piece_size()) }
target_vocab[tokenizer.pad_token] = tokenizer.pad_token_id

with open("target_vocab_ko_en.json", "w") as f:
   json.dump(target_vocab, f, indent=2)

and then change the code like this.

tokenizer = MarianTokenizer.from_pretrained(model_name,separate_vocab=True, target_vocab_file= "./target_vocab.json", separate_vocabs=True)

and it will work!

Wouldn't work for me, still some very odd results. I did do one amendment, by passing the source_vocab as well to from_pretrained, but to no avail :/

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment