File size: 4,365 Bytes
0117b77
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
---
language: en
license: mit
tags:
- oceanography
- wave-forecasting
- time-series
- lightgbm
- regression
datasets:
- surfe-diem/wave-archive-USA-southwest
metrics:
- mae
library_name: lightgbm
---

# Surfe Diem — Dominant Wave Period Forecast v1 (USA Southwest, 12h)

## Model Description

A LightGBM regression model trained to predict **dominant wave period in seconds** 12 hours in advance using real-time buoy observations from NOAA's National Data Buoy Center (NDBC).

**Developed by:** Surfe Diem
**Model type:** Gradient Boosted Decision Trees (LightGBM)
**Language:** Python
**License:** MIT

## Intended Use

### Primary Use Case
Predict dominant wave period (seconds) at a given forecast horizon for surf forecasting applications along the California coast. Forecast horizon: **12 hours**.

### Out-of-Scope Use
- Horizons other than 12 hours (separate models exist for 6h, 12h, 24h, 48h)
- Wave height or direction
- Regions outside the California coast (model trained on USA Southwest NDBC stations only)
- Real-time safety-critical applications without human oversight

## Training Data

**Source:** [NOAA NDBC Buoy Spectral Wave Density Data](https://huggingface.co/datasets/surfe-diem/wave-archive-USA-southwest)

**Stations:** 15 NDBC buoys along the California coast
`46011, 46012, 46013, 46014, 46022, 46025, 46026, 46027, 46028, 46042, 46047, 46053, 46054, 46069, 46086`

**Records:** ~2.08M observations (259 Parquet files with stdmet and spectral aligned columns)

**Features:**
- Meteorological: wave height, period, direction, wind speed/direction, pressure, temperature
- **Spectral compression:** 9 physics-informed features derived from ~150 raw spectral bands
  - Ground swell energy, direction, quality (< 0.08 Hz)
  - Mid-range energy, direction, quality (0.08–0.12 Hz)
  - Wind wave energy, direction, quality (> 0.12 Hz)
- Circular decomposition: sin/cos encoding for all direction columns
- Temporal lag features: 1h, 3h, 6h, 12h lags across all features

**Split:** 80/20 train/test, time-series ordered (no shuffle)

## Model Performance

**Test MAE: 1.7986 seconds**

MAE is in **seconds**. Dominant period typically ranges 5–20s.

Evaluated on held-out data with proper time-series validation (train on past, test on future).

## Training Details

**Algorithm:** LightGBM
**Objective:** Regression (MAE / L1 loss)
**Learning rate:** 0.05
**Num leaves:** 31
**Feature fraction:** 0.9
**Bagging fraction:** 0.8
**Max iterations:** 2000 (early stopping, patience=50)

**Feature engineering:**
- Station IDs encoded as fixed `CategoricalDtype` for inference consistency
- Lag features filled with 0 for single-observation inference

## How to Use

```python
import lightgbm as lgb
import pandas as pd
import numpy as np
from huggingface_hub import hf_hub_download

# Load model
model_path = hf_hub_download(repo_id="surfe-diem/surfe-diem-v1-usa-southwest-dpd-12h-model", filename="surfe_diem_v1_usa_southwest_dpd_12h_model.txt")
model = lgb.Booster(model_file=model_path)

# Prepare observation with engineered features + lags + station_id
# See full inference pipeline in the GitHub repo
obs = pd.DataFrame({
    'wvht': [2.5], 'dpd': [12.0], 'apd': [8.5],
    'mwd': [270], 'wspd': [15.0], 'wdir': [280],
    'pres': [1013.0], 'atmp': [18.0], 'wtmp': [16.0],
    # ... + spectral band features + lag features + station_id
})

prediction = model.predict(obs)[0]  # seconds
```

Full inference pipeline available in the [GitHub repo](https://github.com/crubio/surfe-diem-api).

## Limitations

- **No history for single observations:** Lag features set to 0 for real-time single-row inference (slight accuracy degradation vs. buffered inference)
- **Regional specificity:** Trained only on California coast buoys
- **Forecast horizon:** 12 hours only — separate models cover 6h, 12h, 24h, 48h
- **Spectral dependency:** Full accuracy requires spectral band data; older buoy files without spectral data contribute only standard met features

## Citation

```bibtex
@misc{surfediem2026wave,
  author = {Surfe Diem},
  title = {Wave Forecasting Models v1 - USA Southwest},
  year = {2026},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/surfe-diem}}
}
```

## Model Card Contact

For questions or issues, please open an issue in the [GitHub repository](https://github.com/crubio/surfe-diem-api).