SahandNZ/cryptonews-articles-with-price-momentum-labels
Viewer • Updated • 180k • 206 • 24
Precomputed crypto-news CLS embeddings and benchmark outputs for fast quant experimentation.
If you want to test whether text alpha adds signal over market/FnG-style numeric features, this repo gives you ready-to-use artifacts without rerunning expensive encoding.
Primary text encoder used in this release: boltuix/bert-lite.
This work uses the public Hugging Face dataset:
Please cite and credit the original dataset creator (SahandNZ) when reusing these artifacts.
.npy embedding tensors.Stored under results/:
results/metrics_xgb_cls_vs_numeric.jsonresults/results_summary.csvStored under embeddings/:
embeddings/bertlite_full_fresh__train_cls_embeddings.npyembeddings/bertlite_full_fresh__val_cls_embeddings.npyembeddings/bertlite_full_fresh__test_cls_embeddings.npyembeddings/finbert_full_fresh__train_cls_embeddings.npyembeddings/finbert_full_fresh__val_cls_embeddings.npyembeddings/finbert_full_fresh__test_cls_embeddings.npyStored under embeddings/:
embeddings/embeddings_manifest.csvimport numpy as np
from xgboost import XGBClassifier
# 1) Load precomputed text embeddings
X_text_train = np.load("embeddings/bertlite_full_fresh__train_cls_embeddings.npy")
X_text_val = np.load("embeddings/bertlite_full_fresh__val_cls_embeddings.npy")
# 2) Load your numeric features aligned to the same row order
# X_num_train, X_num_val = ...
# 3) Fuse text + numeric features
# X_train = np.concatenate([X_num_train, X_text_train], axis=1)
# X_val = np.concatenate([X_num_val, X_text_val], axis=1)
# 4) Train a downstream model
# clf = XGBClassifier(n_estimators=250, max_depth=4, learning_rate=0.1)
# clf.fit(X_train, y_train)
# y_pred = clf.predict(X_val)
Minimal pseudocode:
load CLS embeddings
align with numeric feature rows
concatenate [numeric, CLS]
train XGBoost
compare vs numeric-only baseline
.npy), not fine-tuned checkpoints.embeddings/embeddings_manifest.csv to verify integrity..npy), not raw text records.