Instructions to use ShuaiAnwo/pore-codec-rsq542c12m-340 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ShuaiAnwo/pore-codec-rsq542c12m-340 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-feature-extraction", model="ShuaiAnwo/pore-codec-rsq542c12m-340", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("ShuaiAnwo/pore-codec-rsq542c12m-340", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
PoreCodec-RSQ5x4x2-605261
PoreCodec is a high-performance, standardized neural tokenization ecosystem designed for raw nanopore electrical signals (squiggles).
pore-codec-rsq5x4x4-2605 represents the first-generation release featuring a Residual Finite Scalar Quantization (RSQ) architecture. It compresses high-frequency continuous electrical squiggles into discrete token IDs, acting as a critical bridge between raw genomic signals and generative language models (e.g., downstream Genomic Foundation Models or PoreGPT).
Model Architecture
The architecture consists of three interconnected blocks engineered for high-throughput genomic data:
- Backbone (PoreCNNModel): A 1D Convolutional Neural Network optimized with standard receptive-field scaling. The encoder yields a downsampling factor (stride) of 4, compressing the raw signal length while capturing transient current alterations.
- Quantizer (PoreResidualFSQ): A multi-codebook system deploying Finite Scalar Quantization (FSQ) with a straight-through estimator (STE).
- Level Configuration:
[5, 5, 5, 5]per quantizer layer ($5^4 = 625$ codebook size per layer). - Residual Depth: 2 layers of residual quantizers (
x2), allowing coarse-to-fine signal discretization.
- Level Configuration:
- Outer Projection Layers: Fully tied linear projections (
project_inandproject_out) handling the dimensionality mapping seamlessly, keeping weight states native tosafetensors.
Installation
Ensure you have transformers, torch, and einops installed in your environment:
pip install torch transformers einops numpy
## Quick Start
Since this model is natively integrated with the Hugging Face `transformers` API via dynamic auto-class registration, you can initialize and load it directly with `trust_remote_code=True`.
### 1. Signal Tokenization (Encoding)
Convert continuous raw nanopore electrical currents into discrete token sequences:
```python
import torch
from transformers import AutoModel
# 1. Load Model with Auto-Class API
model_id = "ShuaiAnwo/pore-codec-rsq5x4x4-2605"
model = AutoModel.from_pretrained(model_id, trust_remote_code=True)
model.eval()
# 2. Prepare mock signal input [Batch, Channels (1), Signal_Length]
# Imagine a raw squiggle vector padded/normalized
mock_raw_signal = torch.randn(2, 1, 12000)
# 3. Tokenize signals down to discrete codebook IDs
# Set layer=0 to collapse all residual quantizers, or select a specific depth
with torch.no_grad():
token_ids = model.encode_signal(mock_raw_signal, layer=0)
print("Encoded Token Shape:", token_ids.shape) # Expected: [2, 3000] (due to stride=4)
print("Sample Tokens:", token_ids[0, :10])
2. Signal Reconstruction (Decoding)
Reconstruct the estimated continuous signal curve back from the discrete token IDs:
# Reconstruct raw squiggles directly from tokens
with torch.no_grad():
reconstructed_signal = model.decode_token(token_ids, layer=0)
print("Reconstructed Signal Shape:", reconstructed_signal.shape) # Expected: [2, 1, 12000]
Technical Specifications
| Parameter | Value | Description |
|---|---|---|
| Stride / Downsample | 4 | Downsampling factor of the CNN backbone |
| FSQ Levels | 5 5 5 5 |
Codebook structure per residual layer |
| Single Codebook Size | 625 | Number of discrete items ($5^4$) per layer |
| Num Quantizers | 2 | Number of cascading residual quantization steps |
| Effective Vocabulary | $625^4$ | Theoretical total combination space |
- Downloads last month
- 78