--- title: SMS Spam Clusters emoji: "🗺️" colorFrom: gray colorTo: pink sdk: gradio python_version: 3.11 app_file: app.py pinned: false license: mit short_description: Where does your SMS land among its semantic peers? models: - BAAI/bge-base-en-v1.5 datasets: - jngb-labs/sms-spam tags: - sms - spam-detection - clustering - embeddings - topic-modeling - bge --- # SMS Spam Clusters A map of what 5,159 text messages mean. Drop yours in, see where it lands. ## How to use it Paste an SMS into the box, click **Place it on the map**, and the model finds the five most similar messages in the dataset, votes on a cluster, and drops your message onto the same map as a black diamond. Below the map, the five neighbours appear in order of similarity, so the assignment is auditable rather than oracular. ## How to read the map Each dot is one real SMS. Dots that sit close together mean similar things in a sense the model worked out for itself by reading the messages and nothing else; it was never told what spam was, never shown a labelled example, never given a list of topics to fit messages into. Spam takes up the right half of the map as a single category, since for everyday purposes spam is spam. Ham takes up the left half, broken into roughly thirty sub-genres that the algorithm decided existed, sorted in the legend from largest to smallest. Click a cluster in the legend to hide it. Double-click to isolate it. Use the **+** and **-** buttons on the chart toolbar to zoom, drag to pan. ## How it works Every message is turned into a 768-dimensional vector by [`BAAI/bge-base-en-v1.5`](https://huggingface.co/BAAI/bge-base-en-v1.5), a sentence encoder that places semantically similar text close together. Those vectors are reduced to ten dimensions with UMAP and clustered with HDBSCAN, which is the standard recipe for unsupervised topic discovery and what powers BERTopic and Top2Vec. For the visible map the vectors are reduced once more to two dimensions; for the cluster assignment of a new message the live nearest-neighbour search runs in the full 768-dimensional space. The same embedding model loads with the Space, so the user's message lives in the same vector space as the cached corpus and cosine similarity against the cached vectors is meaningful rather than approximate. Cluster names were written once, offline, by a language model looking at ten random samples per cluster, then baked into the Space. Nothing in the runtime path talks to an external model. ## What it isn't Cluster names are written by a language model from a small sample, so they hide as much as they reveal. The dataset is the SMS Spam Collection v.1, deduplicated and normalised; it skews UK and Singaporean English from before 2011, and a 2026 phishing SMS will often land in a cluster whose original members don't quite resemble it. The dot a new message gets placed at is a weighted average of its five nearest neighbours' coordinates, not a fresh UMAP fit, so the position is approximate rather than authoritative; the neighbours panel underneath shows what that approximation rests on. ## Related - Dataset: [`jngb-labs/sms-spam`](https://huggingface.co/datasets/jngb-labs/sms-spam) - Classifier Space: [`jngb-labs/sms-spam-classifier`](https://huggingface.co/spaces/jngb-labs/sms-spam-classifier) - Classical model: [`jngb-labs/sms-spam-classical`](https://huggingface.co/jngb-labs/sms-spam-classical)