PoreCodec-RSQ5x4x2-605261

PoreCodec is a high-performance, standardized neural tokenization ecosystem designed for raw nanopore electrical signals (squiggles).

pore-codec-rsq5x4x4-2605 represents the first-generation release featuring a Residual Finite Scalar Quantization (RSQ) architecture. It compresses high-frequency continuous electrical squiggles into discrete token IDs, acting as a critical bridge between raw genomic signals and generative language models (e.g., downstream Genomic Foundation Models or PoreGPT).

Model Architecture

The architecture consists of three interconnected blocks engineered for high-throughput genomic data:

  1. Backbone (PoreCNNModel): A 1D Convolutional Neural Network optimized with standard receptive-field scaling. The encoder yields a downsampling factor (stride) of 4, compressing the raw signal length while capturing transient current alterations.
  2. Quantizer (PoreResidualFSQ): A multi-codebook system deploying Finite Scalar Quantization (FSQ) with a straight-through estimator (STE).
    • Level Configuration: [5, 5, 5, 5] per quantizer layer ($5^4 = 625$ codebook size per layer).
    • Residual Depth: 2 layers of residual quantizers (x2), allowing coarse-to-fine signal discretization.
  3. Outer Projection Layers: Fully tied linear projections (project_in and project_out) handling the dimensionality mapping seamlessly, keeping weight states native to safetensors.

Installation

Ensure you have transformers, torch, and einops installed in your environment:

pip install torch transformers einops numpy


## Quick Start

Since this model is natively integrated with the Hugging Face `transformers` API via dynamic auto-class registration, you can initialize and load it directly with `trust_remote_code=True`.

### 1. Signal Tokenization (Encoding)

Convert continuous raw nanopore electrical currents into discrete token sequences:

```python
import torch
from transformers import AutoModel

# 1. Load Model with Auto-Class API
model_id = "ShuaiAnwo/pore-codec-rsq5x4x4-2605"
model = AutoModel.from_pretrained(model_id, trust_remote_code=True)
model.eval()

# 2. Prepare mock signal input [Batch, Channels (1), Signal_Length]
# Imagine a raw squiggle vector padded/normalized
mock_raw_signal = torch.randn(2, 1, 12000) 

# 3. Tokenize signals down to discrete codebook IDs
# Set layer=0 to collapse all residual quantizers, or select a specific depth
with torch.no_grad():
    token_ids = model.encode_signal(mock_raw_signal, layer=0)

print("Encoded Token Shape:", token_ids.shape) # Expected: [2, 3000] (due to stride=4)
print("Sample Tokens:", token_ids[0, :10])

2. Signal Reconstruction (Decoding)

Reconstruct the estimated continuous signal curve back from the discrete token IDs:

# Reconstruct raw squiggles directly from tokens
with torch.no_grad():
    reconstructed_signal = model.decode_token(token_ids, layer=0)

print("Reconstructed Signal Shape:", reconstructed_signal.shape) # Expected: [2, 1, 12000]

Technical Specifications

Parameter Value Description
Stride / Downsample 4 Downsampling factor of the CNN backbone
FSQ Levels 5 5 5 5 Codebook structure per residual layer
Single Codebook Size 625 Number of discrete items ($5^4$) per layer
Num Quantizers 2 Number of cascading residual quantization steps
Effective Vocabulary $625^4$ Theoretical total combination space
Downloads last month
78
Safetensors
Model size
1.15M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support