RoophaSharon commited on
Commit
ce7af77
·
1 Parent(s): de518f6

Initial public release: Approach 1, Approach 2, Baseline + sample data

Browse files
.gitignore ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Python
2
+ __pycache__/
3
+ *.pyc
4
+ *.pyo
5
+ .venv/
6
+ venv/
7
+
8
+ # Streamlit
9
+ .streamlit/secrets.toml
10
+
11
+ # Jupyter
12
+ .ipynb_checkpoints/
13
+
14
+ # OS / editor
15
+ .DS_Store
16
+ Thumbs.db
17
+ .vscode/
18
+ .idea/
19
+
20
+ # Anaconda envs
21
+ *.conda
22
+ *.egg-info/
23
+
24
+ # Temp
25
+ ~WRL*.tmp
26
+ *.tmp
README.md ADDED
@@ -0,0 +1,184 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Metadata Hierarchy Construction — TFM
2
+
3
+ Master's thesis prototype: automatic hierarchy construction from data-dictionary metadata.
4
+ Three algorithms are implemented for comparison.
5
+
6
+ ## Approaches
7
+
8
+ - **Baseline** — Pure clustering baseline. Plain TF-IDF / Word2Vec embeddings + hierarchical
9
+ clustering. Documented in `README_baseline.md`.
10
+
11
+ - **Approach 1** — Global embedding pipeline. Uses SBERT + N×M concept-table alignment
12
+ (Gonçalves 2019) + HiExpan refinement (Shen et al. KDD 2018) + Castanet parallel facets.
13
+ Optionally retrieves concept context from Wikidata / Wikipedia / WordNet / BioPortal.
14
+
15
+ - **Approach 2** — Dataset-constrained multi-aspect hierarchy. Algorithmic pipeline with no
16
+ domain hardcoding:
17
+ 1. Group-anchored L1/L2 (from detected metadata column structure — BISE 2026)
18
+ 2. Phrase-slot mining (IE-style slot induction) for descriptions with regular structure
19
+ 3. **FASTopic** semantic aspect discovery (Wu et al. NeurIPS 2024) — replaces NMF
20
+ 4. NMF lexical fallback for small groups
21
+ 5. GMM + BIC for small clusters, MiniBatchKMeans + silhouette for large ones
22
+ 6. Deterministic 5-stage label generation (description prefix → group anchor → IDF filter
23
+ → bigram-preferred TF-IDF → optional LLM refinement)
24
+ 7. **Optional local-LLM label refinement** via Ollama + Qwen 2.5 (TopicTag pattern, DocEng
25
+ 2024). Strict grounding check rejects labels not derived from CSV evidence. Per-node
26
+ provenance recorded.
27
+ 8. TraCo-inspired hierarchy diagnostics (AAAI 2024)
28
+
29
+ No facet trees — single coherent LoD tree.
30
+
31
+ See each script's "Method" tab in the running app for the full algorithm and paper references.
32
+
33
+ ## Paper stack
34
+
35
+ | Component | Paper |
36
+ |---|---|
37
+ | Multi-aspect taxonomy scaffold | Zhu et al. 2025, EMNLP |
38
+ | Canonical metadata text objects | Gonçalves et al. 2019, ESWC |
39
+ | Semantic aspect discovery | Wu et al. 2024 (FASTopic), NeurIPS, arXiv:2405.17978 |
40
+ | Phrase-slot mining | IE / slot-induction literature (ACM CSUR 2022) |
41
+ | LLM label refinement pattern | Eren et al. 2024 (TopicTag), DocEng, arXiv:2407.19616 |
42
+ | Local LLM (used for refinement) | Qwen Team 2024 (Qwen 2.5), arXiv:2412.15115 |
43
+ | Hierarchy quality diagnostics | Wu et al. 2024 (TraCo), AAAI, arXiv:2401.14113 |
44
+ | Group-anchored entry strategy | Motamedi, Novalija, Rei 2026, Springer BISE |
45
+ | Multidimensional taxonomy motivation | Kargupta et al. 2025 (TaxoAdapt), ACL |
46
+ | Future-work semantic consistency | SC-Taxo 2026, arXiv:2605.00620 |
47
+ | Concept-label evaluation framework | Kejriwal et al. 2022 (TICL), EAAI |
48
+
49
+ ## Project layout
50
+
51
+ ```
52
+ Hierarchy tool/
53
+ ├── baseline.py # Pure clustering baseline (Streamlit app)
54
+ ├── approach_1.py # Approach 1 (Streamlit app)
55
+ ├── approach_2.py # Approach 2 (Streamlit app)
56
+ ├── approach_1.ipynb # Approach 1 reproducible notebook
57
+ ├── approach_2.ipynb # Approach 2 reproducible notebook
58
+ ├── baseline.ipynb # Baseline reproducible notebook
59
+ ├── launcher.py # Run all three apps simultaneously on different ports
60
+ ├── data/ # Sample input CSVs (AI-MIND, HCP, etc.)
61
+ ├── outputs/ # Generated hierarchies (JSON)
62
+ └── requirements.txt
63
+ ```
64
+
65
+ ## Running locally
66
+
67
+ ### 1. Install Python dependencies
68
+
69
+ ```bash
70
+ pip install -r requirements.txt
71
+ ```
72
+
73
+ Python 3.10 or 3.11 recommended.
74
+
75
+ ### 2. (Approach 2 only) Install Ollama for the local-LLM label refinement layer
76
+
77
+ **This is optional — Approach 2 produces deterministic labels without it.** If you want
78
+ the optional TopicTag-style LLM label refinement:
79
+
80
+ 1. Download and install Ollama from https://ollama.com/download
81
+ 2. Open Ollama once so the background service starts (icon in the system tray)
82
+ 3. Pull the recommended model:
83
+ ```bash
84
+ ollama pull qwen2.5:3b-instruct
85
+ ```
86
+ (For higher quality at higher RAM cost: `ollama pull qwen2.5:7b-instruct`.)
87
+ 4. Verify the server is reachable:
88
+ - In a browser open `http://localhost:11434/api/tags`
89
+ - Or run `ollama list`
90
+
91
+ When Approach 2 starts it auto-detects Ollama and the "Refine labels with LLM" checkbox
92
+ defaults to ON. Uncheck any time. The deterministic pipeline is the canonical thesis
93
+ result; the LLM is an optional re-phraser of evidence already in the CSV.
94
+
95
+ To override the default URL or model:
96
+
97
+ ```bash
98
+ # Optional environment variables
99
+ set OLLAMA_URL=http://localhost:11434/v1
100
+ set OLLAMA_MODEL=qwen2.5:3b-instruct
101
+ ```
102
+
103
+ Or change them live in the Approach 2 sidebar.
104
+
105
+ ### 3. Run one app at a time
106
+
107
+ ```bash
108
+ streamlit run baseline.py
109
+ # or
110
+ streamlit run approach_1.py
111
+ # or
112
+ streamlit run approach_2.py
113
+ ```
114
+
115
+ Each opens at http://localhost:8501 by default.
116
+
117
+ ### 4. Run all three apps simultaneously (for side-by-side comparison)
118
+
119
+ ```bash
120
+ python launcher.py
121
+ ```
122
+
123
+ This opens three browser tabs:
124
+
125
+ - http://localhost:8501 — Baseline
126
+ - http://localhost:8502 — Approach 1
127
+ - http://localhost:8503 — Approach 2
128
+
129
+ Press **Enter** in the launcher terminal to stop all servers.
130
+
131
+ ## Using the apps
132
+
133
+ 1. Upload one or more metadata CSV / TSV / XLSX / JSON files in the sidebar.
134
+ 2. Confirm the auto-detected column roles (leaf / group / text / meta).
135
+ 3. Click **Build hierarchy**.
136
+ 4. Inspect the LoD tree, evaluation metrics, label provenance (Approach 2), and export JSON.
137
+
138
+ Sample data is in `data/`:
139
+ - `ai-mind-variable-descriptions(in).csv`
140
+ - `HCP_S1200_DataDictionary_Oct_30_2023.csv`
141
+
142
+ ## Outputs
143
+
144
+ - **Baseline / Approach 1** export two JSON files compatible with the VIANNA viewer:
145
+ - `*_lod.json` — primary LoD tree
146
+ - `*_facets.json` — parallel Castanet facet trees
147
+
148
+ - **Approach 2** exports a single LoD JSON:
149
+ - `*_approach2_lod.json` — primary LoD tree (every aggregation node carries
150
+ `label_provenance` with source stage, confidence, and evidence terms)
151
+
152
+ Filenames are derived from the uploaded CSV file name, so different CSVs export under
153
+ different filenames into `outputs/approach 2/`.
154
+
155
+ Existing output examples are in `outputs/approach 1/` and `outputs/approach 2/`.
156
+
157
+ ## Defensibility highlights for Approach 2
158
+
159
+ - **No domain hardcoding.** Slot names, group anchors, and labels are all derived from the
160
+ detected metadata columns + the uploaded CSV — no hand-curated domain vocabulary.
161
+ - **Deterministic by default.** Tree topology and all five label-generation stages are
162
+ reproducible from the input CSV alone. Local LLM is opt-in.
163
+ - **Grounded LLM refinement.** Every LLM-proposed label must pass a strict grounding
164
+ check — every word in the label must appear in the extracted evidence. Failed proposals
165
+ are rejected and the deterministic label is used instead. Per-node provenance lets
166
+ you answer "did the LLM invent this?" with hard evidence.
167
+ - **Local-only LLM.** Qwen 2.5 runs on the thesis machine via Ollama. No external API
168
+ calls, no third-party data sharing, no key management.
169
+
170
+ ## Troubleshooting
171
+
172
+ | Symptom | Fix |
173
+ |---|---|
174
+ | `FASTopic not installed` warning | `pip install fastopic` (also installs `torch`) |
175
+ | `openai` package missing | `pip install openai` |
176
+ | `Ollama not reachable` in sidebar | Open the Ollama app from Start menu; the service runs in the system tray |
177
+ | Model not found | `ollama pull qwen2.5:3b-instruct` |
178
+ | Build very slow with LLM on | Expected for HCP — ~15–40 min on CPU with a 3B model. Disable LLM for fast iteration. |
179
+ | `LLM-labeled nodes: 0/N` after build | The grounding check rejected every LLM proposal. Check the **🔍 Label Provenance** tab — counts under `llm_rejected = True` show what happened. |
180
+ | Hierarchy too shallow | Increase `Max LoD tree depth` slider (top of sidebar in Approach 2) |
181
+
182
+ ## License
183
+
184
+ For thesis evaluation only.
approach_1.py ADDED
The diff for this file is too large to render. See raw diff
 
approach_2.py ADDED
The diff for this file is too large to render. See raw diff
 
baseline.py ADDED
@@ -0,0 +1,760 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # baseline.py — Metadata Hierarchy Builder — Baseline (Taxonomizer)
2
+ #
3
+ # Pure Taxonomizer baseline — NO hardcoded, domain-specific patterns.
4
+ # The only lexical resource is a generic English stop-word list (standard
5
+ # information-retrieval practice, not dataset-specific).
6
+ #
7
+ # Pipeline (dataset-only, no external APIs, no sentence-transformers):
8
+ # 1. Load metadata file (CSV / TSV / XLSX / JSON)
9
+ # 2. Detect column roles (leaf / group / text / meta)
10
+ # 3. Build canonical schema (_leaf_id, _leaf_label, _group_path, _text)
11
+ # 4. Represent each variable as a TF-IDF text object
12
+ # 5. Recursively cluster variables (agglomerative, cosine distance) into an
13
+ # abstract-to-concrete taxonomy; internal-node labels are the most
14
+ # discriminative terms of each cluster — derived from the data, not hardcoded
15
+ # 6. Visualise (Sunburst / Treemap)
16
+ # 7. Export VIANNA-compatible JSON + canonical CSV
17
+ #
18
+ # Papers:
19
+ # [TAX] Taxonomizer (Sultanum et al.) — leaf=attribute, internal=abstract group
20
+ # built bottom-up by recursively clustering item feature vectors and
21
+ # labelling each internal node with its members' shared/discriminative terms
22
+ # [GON] Goncalves et al. — TF-IDF text objects + cosine distance
23
+ # [HIE] HiExpan (adapted) — discriminative-term node labelling
24
+
25
+ from __future__ import annotations
26
+ import csv, json, re, warnings
27
+ from collections import defaultdict
28
+ from pathlib import Path
29
+ import tempfile
30
+
31
+ import numpy as np
32
+ import pandas as pd
33
+ import plotly.graph_objects as go
34
+ import streamlit as st
35
+ from sklearn.cluster import AgglomerativeClustering
36
+ from sklearn.feature_extraction.text import TfidfVectorizer
37
+ from sklearn.metrics import normalized_mutual_info_score, adjusted_rand_score, silhouette_score
38
+ from sklearn.metrics.pairwise import cosine_distances
39
+ from sklearn.preprocessing import LabelEncoder
40
+
41
+ warnings.filterwarnings('ignore')
42
+
43
+ st.set_page_config(page_title='Metadata Hierarchy — Baseline', page_icon='🌿', layout='wide')
44
+ st.title('Metadata Hierarchy Builder — Baseline (Taxonomizer)')
45
+ st.caption(
46
+ 'Pure Taxonomizer baseline: TF-IDF text objects + recursive agglomerative '
47
+ 'clustering into an abstract-to-concrete taxonomy, with internal-node labels '
48
+ 'derived from each cluster’s discriminative terms. No hardcoded domain '
49
+ 'patterns, no external APIs, no sentence embeddings — works on any dataset.'
50
+ )
51
+
52
+ # ─────────────────────────────────────────────────────────────────────────────
53
+ # CONSTANTS
54
+ # ─────────────────────────────────────────────────────────────────────────────
55
+ LEAF_KEYS = 'variable var field column attribute name code id item indicator question measure concept'.split()
56
+ GROUP_KEYS = 'task category domain module section table dataset assessment test variant group topic instrument form subscale construct'.split()
57
+ TEXT_KEYS = 'description definition desc label title question meaning note notes text display full details explanation comment'.split()
58
+ META_KEYS = 'type dtype data_type datatype unit units format decimal precision values value coding codebook range min max scale'.split()
59
+
60
+ # ─────────────────────────────────────────────────────────────────────────────
61
+ # FILE LOADING
62
+ # ─────────────────────────────────────────────────────────────────────────────
63
+ def safe_name(name: str) -> str:
64
+ return ''.join(ch if ch.isalnum() or ch in '-_.' else '_' for ch in name)
65
+
66
+ def try_read_csv(path: Path) -> pd.DataFrame:
67
+ best, best_score = None, -1
68
+ for enc in ['utf-8-sig', 'utf-8', 'latin1']:
69
+ for sep in [None, ',', '\t', ';', '|']:
70
+ try:
71
+ df = pd.read_csv(path, sep=sep, engine='python', encoding=enc)
72
+ score = df.shape[1] * 10 - float(df.isna().mean().mean())
73
+ if score > best_score:
74
+ best, best_score = df, score
75
+ except Exception:
76
+ pass
77
+ if best is None:
78
+ raise ValueError(f'Could not read {path.name}')
79
+ best.columns = [str(c).strip().replace(';', '') for c in best.columns]
80
+ # Repair comma-packed rows (AI-Mind format)
81
+ if len(best) > 0:
82
+ first = best.iloc[:, 0].astype(str)
83
+ other_null = best.iloc[:, 1:].isna().mean().mean() if best.shape[1] > 1 else 1.0
84
+ if first.str.contains(',').mean() > 0.50 and other_null > 0.70:
85
+ lines = path.read_text(encoding='utf-8-sig', errors='replace').splitlines()
86
+ if lines:
87
+ header = [h.strip().replace(';', '') for h in lines[0].split(',')]
88
+ rows = []
89
+ for line in lines[1:]:
90
+ line = line.strip().rstrip(';')
91
+ if not line:
92
+ continue
93
+ if line.startswith('"') and line.endswith('"'):
94
+ line = line[1:-1]
95
+ try:
96
+ parts = next(csv.reader([line], quotechar='"'))
97
+ except Exception:
98
+ continue
99
+ if len(parts) >= len(header):
100
+ rows.append(parts[:len(header)])
101
+ if rows:
102
+ best = pd.DataFrame(rows, columns=header)
103
+ best.columns = [str(c).strip().replace(';', '') for c in best.columns]
104
+ return best
105
+
106
+ def load_any(path: Path) -> pd.DataFrame:
107
+ s = path.suffix.lower()
108
+ if s in ['.csv', '.tsv', '.txt']:
109
+ return try_read_csv(path)
110
+ if s in ['.xlsx', '.xls']:
111
+ return pd.read_excel(path)
112
+ if s == '.json':
113
+ obj = json.loads(path.read_text(encoding='utf-8', errors='replace'))
114
+ if isinstance(obj, list):
115
+ return pd.json_normalize(obj)
116
+ if isinstance(obj, dict):
117
+ for v in obj.values():
118
+ if isinstance(v, list):
119
+ return pd.json_normalize(v)
120
+ raise ValueError(f'Unsupported file type: {s}')
121
+
122
+ def save_upload(f) -> Path:
123
+ tmp = Path(tempfile.mkdtemp(prefix='baseline_'))
124
+ p = tmp / safe_name(f.name)
125
+ p.write_bytes(f.getbuffer())
126
+ return p
127
+
128
+ # ─────────────────────────────────────────────────────────────────────────────
129
+ # ROLE DETECTION [GON]
130
+ # ─────────────────────────────────────────────────────────────────────────────
131
+ def norm(c: str) -> str:
132
+ return re.sub(r'[^a-z0-9]+', '_', str(c).strip().lower()).strip('_')
133
+
134
+ def kscore(c: str, keys: list) -> int:
135
+ nc = norm(c)
136
+ return sum(1 for k in keys if k in nc)
137
+
138
+ def profile_columns(df: pd.DataFrame) -> pd.DataFrame:
139
+ out = []
140
+ n = max(len(df), 1)
141
+ for col in df.columns:
142
+ s = df[col]
143
+ non = float(s.notna().mean())
144
+ nun = int(s.nunique(dropna=True))
145
+ ur = nun / n
146
+ avg = float(s.dropna().astype(str).map(len).mean()) if s.notna().any() else 0
147
+ out.append({
148
+ 'column': str(col),
149
+ 'non_null': round(non, 3),
150
+ 'unique_values': nun,
151
+ 'unique_ratio': round(ur, 3),
152
+ 'avg_length': round(avg, 1),
153
+ 'leaf_score': 4*kscore(col, LEAF_KEYS) + (3 if 0.5 <= ur <= 1 else 0) + (1 if avg < 80 else 0),
154
+ 'group_score': 4*kscore(col, GROUP_KEYS) + (3 if 1 < nun < min(n*0.5, 80) else 0) + (1 if avg < 60 else 0),
155
+ 'text_score': 5*kscore(col, TEXT_KEYS) + (4 if avg > 50 else 0) + (1 if non > 0.5 else 0),
156
+ 'metadata_score': 4*kscore(col, META_KEYS) + (2 if 1 < nun < min(n*0.8, 100) else 0),
157
+ })
158
+ return pd.DataFrame(out)
159
+
160
+ def detect_roles(df: pd.DataFrame) -> tuple:
161
+ prof = profile_columns(df)
162
+ leaf = prof.sort_values(['leaf_score', 'unique_ratio'], ascending=False).head(1)['column'].tolist()
163
+ text = (prof[(prof.text_score >= 4) | (prof.avg_length > 80)]
164
+ .sort_values('text_score', ascending=False)['column'].tolist()) or leaf.copy()
165
+ group = (prof[(prof.group_score >= 4) & (~prof.column.isin(leaf)) & (prof.unique_values > 1)]
166
+ .sort_values('group_score', ascending=False)['column'].head(3).tolist())
167
+ meta = (prof[(prof.metadata_score >= 4) & (~prof.column.isin(text + leaf + group))]
168
+ .sort_values('metadata_score', ascending=False)['column'].head(5).tolist())
169
+ return {'leaf_cols': leaf, 'group_cols': group, 'text_cols': text, 'metadata_cols': meta}, prof
170
+
171
+ # ─────────────────────────────────────────────────────────────────────────────
172
+ # CANONICAL SCHEMA [GON]
173
+ # ─────────────────────────────────────────────────────────────────────────────
174
+ def sv(x) -> str:
175
+ return '' if pd.isna(x) else str(x).strip()
176
+
177
+ def build_canonical(df: pd.DataFrame, cfg: dict, source: str) -> pd.DataFrame:
178
+ leaf_cols = cfg.get('leaf_cols', [])
179
+ group_cols = cfg.get('group_cols', [])
180
+ text_cols = cfg.get('text_cols', [])
181
+ meta_cols = cfg.get('metadata_cols', [])
182
+ rows = []
183
+ for i, row in df.iterrows():
184
+ leaf_parts = [sv(row.get(c, '')) for c in leaf_cols]
185
+ leaf_parts = [p for p in leaf_parts if p]
186
+ label = ' / '.join(leaf_parts) if leaf_parts else f'variable_{i+1}'
187
+ group_parts = [sv(row.get(c, '')) for c in group_cols]
188
+ group_parts = [p for p in group_parts if p and p.lower() not in ['nan', 'none']]
189
+ gpath = ' > '.join(group_parts) if group_parts else 'Ungrouped'
190
+ parts = []
191
+ for c in list(dict.fromkeys(group_cols + leaf_cols + text_cols + meta_cols)):
192
+ v = sv(row.get(c, ''))
193
+ if v:
194
+ parts.append(f'{c}: {v}')
195
+ text = ' | '.join(parts) if parts else label
196
+ rows.append({
197
+ '_source_file': source,
198
+ '_row_index': int(i),
199
+ '_leaf_label': label,
200
+ '_leaf_id': f'{gpath}.{label}' if gpath != 'Ungrouped' else label,
201
+ '_group_path': gpath,
202
+ '_text': text,
203
+ })
204
+ can = pd.DataFrame(rows)
205
+ if can['_leaf_id'].duplicated().any():
206
+ cnt: dict = defaultdict(int)
207
+ ids = []
208
+ for lid in can['_leaf_id']:
209
+ cnt[lid] += 1
210
+ ids.append(lid if cnt[lid] == 1 else f'{lid}__{cnt[lid]}')
211
+ can['_leaf_id'] = ids
212
+ return can
213
+
214
+ # ─────────────────────────────────────────────────────────────────────────────
215
+ # TAXONOMIZER CORE [TAX + GON]
216
+ #
217
+ # Everything here is data-driven: TF-IDF over the variable text objects, cosine
218
+ # distance, agglomerative clustering with the number of clusters chosen by
219
+ # silhouette, and internal-node labels taken from each cluster's most
220
+ # discriminative terms. The ONLY lexical resource is the generic English
221
+ # stop-word list (standard IR practice — not dataset-specific).
222
+ # ─────────────────────────────────────────────────────────────────────────────
223
+ def vectorize_texts(texts: list):
224
+ """TF-IDF text objects [GON]. Generic English stop-words only."""
225
+ vec = TfidfVectorizer(stop_words='english', ngram_range=(1, 2),
226
+ max_features=2000, min_df=1, sublinear_tf=True)
227
+ X = vec.fit_transform(texts)
228
+ return X, vec
229
+
230
+ def best_k(dist: np.ndarray, n: int, k_min: int = 2, k_max: int = 8) -> int:
231
+ """Pick the number of clusters that maximises the silhouette score.
232
+
233
+ Fully data-driven — no fixed cluster count. Returns 1 if no split with
234
+ >=2 clusters is well separated.
235
+ """
236
+ k_hi = min(k_max, n - 1)
237
+ if k_hi < k_min:
238
+ return 1
239
+ best, best_s = 1, -1.0
240
+ for k in range(k_min, k_hi + 1):
241
+ labels = AgglomerativeClustering(n_clusters=k, metric='precomputed',
242
+ linkage='average').fit_predict(dist)
243
+ if len(set(labels)) < 2:
244
+ continue
245
+ try:
246
+ s = silhouette_score(dist, labels, metric='precomputed')
247
+ except Exception:
248
+ continue
249
+ if s > best_s:
250
+ best_s, best = s, k
251
+ return best
252
+
253
+ def discriminative_label(inside: np.ndarray, outside, terms: np.ndarray,
254
+ used: set, top_n: int = 2) -> str:
255
+ """Label a cluster by the terms that most separate it from its siblings.
256
+
257
+ inside = mean TF-IDF vector of the cluster's members
258
+ outside = mean TF-IDF vector of the sibling pool (or 0 if none)
259
+ """
260
+ scores = inside - (outside if outside is not None else 0)
261
+ picks: list = []
262
+ for i in np.argsort(scores)[::-1]:
263
+ term = terms[i]
264
+ if len(term) <= 2 or scores[i] <= 0 or term in used:
265
+ continue
266
+ picks.append(term)
267
+ if len(picks) >= top_n:
268
+ break
269
+ if not picks: # degenerate: fall back to highest raw mean term
270
+ for i in np.argsort(inside)[::-1]:
271
+ if len(terms[i]) > 2:
272
+ picks = [terms[i]]
273
+ break
274
+ return ' / '.join(p.title() for p in picks) if picks else 'Group'
275
+
276
+ # ─────────────────────────────────────────────────────────────────────────────
277
+ # HIERARCHY CONSTRUCTION [TAX + GON]
278
+ # ─────────────────────────────────────────────────────────────────────────────
279
+ def _nmap(nodes: list) -> dict:
280
+ return {int(n['id']): n for n in nodes}
281
+
282
+ def _next_id(nodes: list) -> int:
283
+ return max((int(n['id']) for n in nodes), default=0) + 1
284
+
285
+ def _add_child(nodes: list, parent_id: int, child_id: int):
286
+ m = _nmap(nodes)
287
+ p = m.get(int(parent_id))
288
+ if p is None:
289
+ return
290
+ rel = list(p.get('related', []))
291
+ if int(child_id) not in rel:
292
+ rel.append(int(child_id))
293
+ p['related'] = rel
294
+
295
+ def _make_agg(nid: int, name: str, desc: str = '') -> dict:
296
+ return {'id': int(nid), 'name': str(name), 'related': [],
297
+ 'type': 'aggregation', 'isShown': True, 'desc': desc, 'dtype': 'determine'}
298
+
299
+ def _leaf_ids(nodes: list, nid: int) -> list:
300
+ m = _nmap(nodes)
301
+ out: list = []
302
+ def rec(x):
303
+ n = m.get(int(x))
304
+ if not n:
305
+ return
306
+ if n.get('type') == 'attribute':
307
+ out.append(int(x))
308
+ return
309
+ for c in n.get('related', []):
310
+ rec(int(c))
311
+ rec(nid)
312
+ return list(dict.fromkeys(out))
313
+
314
+ def build_hierarchy(can: pd.DataFrame, project: str = 'project',
315
+ max_depth: int = 3, min_cluster_size: int = 6,
316
+ branch_max: int = 8) -> list:
317
+ """Pure Taxonomizer construction [TAX].
318
+
319
+ Builds an abstract-to-concrete taxonomy by recursively clustering the
320
+ variables' TF-IDF text objects. At each level the number of clusters is
321
+ chosen by silhouette; each resulting internal node is labelled with the
322
+ terms that most discriminate its members from their siblings. No group
323
+ column, no hardcoded patterns are used in construction — so the recovered
324
+ structure can be fairly evaluated against the original group column.
325
+ """
326
+ # ── leaf attribute nodes (ids 1..N) ──────────────────────────────────────
327
+ nodes: list = [{'id': 0, 'name': project, 'type': 'root',
328
+ 'dtype': 'root', 'isShown': True, 'related': [], 'desc': 'Root node'}]
329
+ row_to_node: list = []
330
+ for i, (_, r) in enumerate(can.iterrows(), start=1):
331
+ nodes.append({'id': i, 'name': r['_leaf_label'], 'dtype': 'determine',
332
+ 'related': [], 'isShown': True, 'type': 'attribute',
333
+ 'desc': r['_text'],
334
+ 'metadata': {'leaf_id': r['_leaf_id'], 'group_path': r['_group_path']}})
335
+ row_to_node.append(i)
336
+ row_to_node = np.array(row_to_node)
337
+
338
+ # ── TF-IDF text objects + full cosine distance matrix [GON] ───────────────
339
+ texts = (can['_leaf_label'].astype(str) + ' . ' + can['_text'].astype(str)).tolist()
340
+ X, vec = vectorize_texts(texts)
341
+ Xd = X.toarray()
342
+ terms = vec.get_feature_names_out()
343
+ full_dist = cosine_distances(X).astype(float)
344
+ np.fill_diagonal(full_dist, 0.0)
345
+
346
+ # ── recursive clustering ─────────────────────────────────────────────────
347
+ def attach_leaves(parent_id: int, idx: np.ndarray):
348
+ for i in idx:
349
+ _add_child(nodes, parent_id, int(row_to_node[i]))
350
+
351
+ def recurse(parent_id: int, idx: np.ndarray, depth: int, used: set):
352
+ n = len(idx)
353
+ if n <= min_cluster_size or depth >= max_depth:
354
+ attach_leaves(parent_id, idx)
355
+ return
356
+
357
+ sub = full_dist[np.ix_(idx, idx)]
358
+ k_cap = min(branch_max, max(2, n // min_cluster_size))
359
+ k = best_k(sub, n, k_min=2, k_max=k_cap)
360
+ if k <= 1:
361
+ attach_leaves(parent_id, idx)
362
+ return
363
+
364
+ labels = AgglomerativeClustering(n_clusters=k, metric='precomputed',
365
+ linkage='average').fit_predict(sub)
366
+ pool_Xd = Xd[idx]
367
+ for c in range(k):
368
+ mask = labels == c
369
+ members = idx[mask]
370
+ if len(members) == 0:
371
+ continue
372
+ if len(members) == 1: # don't create singleton internal nodes
373
+ _add_child(nodes, parent_id, int(row_to_node[members[0]]))
374
+ continue
375
+ inside = pool_Xd[mask].mean(axis=0)
376
+ outside = pool_Xd[~mask].mean(axis=0) if (~mask).any() else None
377
+ label = discriminative_label(inside, outside, terms, used)
378
+ nid = _next_id(nodes)
379
+ nodes.append(_make_agg(nid, label,
380
+ desc=f'Cluster of {len(members)} variables — '
381
+ f'discriminative terms: {label}'))
382
+ _add_child(nodes, parent_id, nid)
383
+ recurse(nid, members, depth + 1, used | {label.lower()})
384
+
385
+ recurse(0, np.arange(len(can)), 0, set())
386
+
387
+ for n in nodes:
388
+ n['related'] = list(dict.fromkeys(int(x) for x in n.get('related', [])))
389
+ return nodes
390
+
391
+ # ─────────────────────────────────────────────────────────────────────────────
392
+ # VISUALISATION
393
+ # ─────────────────────────────────────────────────────────────────────────────
394
+ def _parent_map(nodes: list) -> dict:
395
+ pm: dict = {}
396
+ for n in nodes:
397
+ for c in n.get('related', []):
398
+ if int(c) not in pm:
399
+ pm[int(c)] = int(n['id'])
400
+ return pm
401
+
402
+ # ─────────────────────────────────────────────────────────────────────────────
403
+ # EVALUATION HELPERS
404
+ # ─────────────────────────────────────────────────────────────────────────────
405
+ def _eval_cluster_assignments(nodes: list, can: pd.DataFrame) -> list[int]:
406
+ """Return predicted cluster id (depth-1 aggregation ancestor) for each row in can."""
407
+ pm = _parent_map(nodes)
408
+ def depth1(nid: int) -> int:
409
+ # Walk up until our parent is root (id==0) or we have no parent
410
+ while pm.get(nid, -1) not in (-1, 0):
411
+ nid = pm[nid]
412
+ return nid
413
+ lid_to_nid = {n['metadata']['leaf_id']: int(n['id'])
414
+ for n in nodes if n.get('type') == 'attribute' and 'metadata' in n}
415
+ return [depth1(lid_to_nid[lid]) if lid in lid_to_nid else -1
416
+ for lid in can['_leaf_id']]
417
+
418
+ def _purity(y_true, y_pred) -> float:
419
+ from collections import Counter
420
+ clusters: dict = {}
421
+ for t, p in zip(y_true, y_pred):
422
+ clusters.setdefault(p, []).append(t)
423
+ correct = sum(Counter(v).most_common(1)[0][1] for v in clusters.values())
424
+ return correct / max(len(y_true), 1)
425
+
426
+ def _structural_stats(nodes: list) -> dict:
427
+ pm = _parent_map(nodes)
428
+ def depth_of(nid: int) -> int:
429
+ d = 0
430
+ while nid in pm:
431
+ nid = pm[nid]; d += 1
432
+ return d
433
+ agg = [n for n in nodes if n.get('type') == 'aggregation']
434
+ leafs = [n for n in nodes if n.get('type') == 'attribute']
435
+ depths = [depth_of(int(n['id'])) for n in leafs]
436
+ branches = [len(n.get('related', [])) for n in agg]
437
+ singletons = sum(1 for b in branches if b == 1)
438
+ return {
439
+ 'n_aggregation_nodes': len(agg),
440
+ 'max_depth': int(max(depths, default=0)),
441
+ 'avg_leaf_depth': round(float(np.mean(depths)), 2) if depths else 0.0,
442
+ 'avg_branching_factor': round(float(np.mean(branches)), 2) if branches else 0.0,
443
+ 'singleton_nodes_%': round(100.0 * singletons / max(len(agg), 1), 1),
444
+ }
445
+
446
+ def _wrap(text: str, width: int = 70) -> str:
447
+ """Wrap long hover text onto multiple <br> lines so it never runs off-screen."""
448
+ import textwrap
449
+ text = str(text).replace('<', '&lt;')
450
+ lines: list = []
451
+ for para in text.split('\n'):
452
+ wrapped = textwrap.wrap(para, width=width) or ['']
453
+ lines.extend(wrapped)
454
+ return '<br>'.join(lines)
455
+
456
+ def plot_sunburst(nodes: list, max_depth: int = 4) -> go.Figure:
457
+ pm = _parent_map(nodes)
458
+ ids, labels, parents, values, hover = [], [], [], [], []
459
+ for n in nodes:
460
+ nid = int(n['id'])
461
+ lc = len(_leaf_ids(nodes, nid))
462
+ ids.append(str(nid))
463
+ labels.append(str(n.get('name', ''))[:40])
464
+ parents.append('' if nid == 0 else str(pm.get(nid, 0)))
465
+ values.append(max(1, lc))
466
+ desc = _wrap(n.get('desc', ''))
467
+ hover.append(f'<b>{_wrap(n.get("name",""))}</b><br>Type: {n.get("type","")}'
468
+ f'<br>Variables: {lc}<br><br>{desc}')
469
+ fig = go.Figure(go.Sunburst(
470
+ ids=ids, labels=labels, parents=parents, values=values,
471
+ branchvalues='total', hovertext=hover, hoverinfo='text',
472
+ maxdepth=max_depth, insidetextorientation='radial',
473
+ marker=dict(colorscale='Greens', line=dict(width=1, color='white')),
474
+ ))
475
+ fig.update_layout(height=700, margin=dict(l=10, r=10, t=40, b=10),
476
+ title='Click a sector to drill down — click centre to go back')
477
+ return fig
478
+
479
+ def plot_treemap(nodes: list) -> go.Figure:
480
+ pm = _parent_map(nodes)
481
+ ids, labels, parents, values, hover = [], [], [], [], []
482
+ for n in nodes:
483
+ nid = int(n['id'])
484
+ lc = len(_leaf_ids(nodes, nid))
485
+ ids.append(str(nid))
486
+ labels.append(str(n.get('name', ''))[:40])
487
+ parents.append('' if nid == 0 else str(pm.get(nid, 0)))
488
+ values.append(max(1, lc))
489
+ desc = _wrap(n.get('desc', ''))
490
+ hover.append(f'<b>{_wrap(n.get("name",""))}</b><br>Variables: {lc}<br>{desc}')
491
+ fig = go.Figure(go.Treemap(
492
+ ids=ids, labels=labels, parents=parents, values=values,
493
+ branchvalues='total', hovertext=hover, hoverinfo='text',
494
+ textinfo='label+value',
495
+ marker=dict(colorscale='Greens', line=dict(width=1, color='white')),
496
+ ))
497
+ fig.update_layout(height=700, margin=dict(l=10, r=10, t=10, b=10))
498
+ return fig
499
+
500
+ # ───────────────────────────────────────────────────────────────���─────────────
501
+ # SIDEBAR
502
+ # ─────────────────────────────────────────────────────────────────────────────
503
+ with st.sidebar:
504
+ st.header('1. Upload')
505
+ uploaded = st.file_uploader(
506
+ 'Upload a metadata file',
507
+ type=['csv', 'tsv', 'txt', 'xlsx', 'xls', 'json'],
508
+ accept_multiple_files=False,
509
+ )
510
+ st.header('2. Taxonomizer settings')
511
+ tx_max_depth = st.slider('Max taxonomy depth', 2, 5, 3, 1,
512
+ help='How many abstract-to-concrete levels to build')
513
+ tx_min_size = st.slider('Min cluster size', 3, 20, 6, 1,
514
+ help='Clusters smaller than this stop splitting (leaves attach directly)')
515
+ tx_branch = st.slider('Max branches per node', 3, 12, 8, 1,
516
+ help='Upper bound on clusters per split; the actual number is chosen by silhouette')
517
+
518
+ st.header('3. Display')
519
+ max_items = st.slider('Maximum variables', 25, 1200, 300, 25)
520
+ group_filter = st.text_input('Group filter (optional)', value='',
521
+ help='Filter rows whose group path contains this text')
522
+ display_depth = st.slider('Sunburst depth', 2, 6, 4, 1)
523
+
524
+ # ─────────────────────────────────────────────────────────────────────────────
525
+ # MAIN
526
+ # ─────────────────────────────────────────────────────────────────────────────
527
+ if not uploaded:
528
+ st.info('Upload a metadata CSV / XLSX / JSON file to begin.')
529
+ st.markdown("""
530
+ ### Baseline algorithm — pure Taxonomizer
531
+
532
+ The simplest of the three approaches — no hardcoded domain patterns, no
533
+ external APIs, no neural embeddings. Works on any dataset.
534
+
535
+ | Step | Method | Paper |
536
+ |------|--------|-------|
537
+ | Text object | Concatenate all metadata fields per variable | Goncalves et al. |
538
+ | Representation | TF-IDF (generic English stop-words only) | Goncalves et al. |
539
+ | Hierarchy construction | Recursive agglomerative clustering (cosine), #clusters chosen by silhouette | Taxonomizer (Sultanum et al.) |
540
+ | Node labelling | Most discriminative terms of each cluster vs its siblings | Taxonomizer / HiExpan |
541
+
542
+ The group column is **not** used for construction, so the recovered taxonomy
543
+ can be fairly evaluated against it (NMI / ARI / Purity in the Evaluation tab).
544
+
545
+ **Approach 1** adds SBERT embeddings + Wikidata/BioPortal enrichment + HiExpan refinement.
546
+
547
+ **Approach 2** adds NMF/FASTopic aspect discovery + GMM clustering + optional LLM labels.
548
+ """)
549
+ st.stop()
550
+
551
+ path = save_upload(uploaded)
552
+
553
+ @st.cache_data(show_spinner=False)
554
+ def _load_profile(path_str: str):
555
+ df = load_any(Path(path_str))
556
+ cfg, prof = detect_roles(df)
557
+ return df, cfg, prof
558
+
559
+ with st.spinner('Loading file…'):
560
+ df, auto_cfg, prof = _load_profile(str(path))
561
+
562
+ st.subheader('Step 1 — File preview')
563
+ with st.expander(f'📄 {uploaded.name} ({len(df):,} rows, {len(df.columns)} columns)',
564
+ expanded=False):
565
+ st.dataframe(df.head(10), use_container_width=True)
566
+ score_cols = [c for c in ['column', 'leaf_score', 'group_score', 'text_score', 'metadata_score']
567
+ if c in prof.columns]
568
+ st.dataframe(prof[score_cols].sort_values('leaf_score', ascending=False),
569
+ use_container_width=True)
570
+
571
+ st.subheader('Step 2 — Confirm column roles')
572
+ cols = list(df.columns)
573
+ with st.expander('Column configuration', expanded=True):
574
+ left, right = st.columns(2)
575
+ with left:
576
+ leaf_cols = st.multiselect('Leaf variable column(s)', cols,
577
+ default=[c for c in auto_cfg.get('leaf_cols', []) if c in cols], key='leaf')
578
+ group_cols = st.multiselect('Group/task column(s)', cols,
579
+ default=[c for c in auto_cfg.get('group_cols', []) if c in cols], key='group')
580
+ with right:
581
+ text_cols = st.multiselect('Text/description column(s)', cols,
582
+ default=[c for c in auto_cfg.get('text_cols', []) if c in cols], key='text')
583
+ meta_cols = st.multiselect('Metadata/type column(s)', cols,
584
+ default=[c for c in auto_cfg.get('metadata_cols', []) if c in cols], key='meta')
585
+
586
+ if not leaf_cols:
587
+ st.error('Choose at least one leaf variable column.')
588
+ st.stop()
589
+
590
+ cfg = {'leaf_cols': leaf_cols, 'group_cols': group_cols,
591
+ 'text_cols': text_cols, 'metadata_cols': meta_cols}
592
+
593
+ if st.button('Build baseline hierarchy', type='primary'):
594
+ with st.spinner('Building hierarchy…'):
595
+ _can = build_canonical(df, cfg, source=Path(uploaded.name).stem)
596
+
597
+ if group_filter.strip():
598
+ _can = _can[_can['_group_path'].str.contains(
599
+ group_filter.strip(), case=False, na=False)].copy()
600
+
601
+ if len(_can) > max_items:
602
+ _can = _can.head(max_items).copy()
603
+
604
+ _can = _can.reset_index(drop=True)
605
+
606
+ if len(_can) < 2:
607
+ st.error('Need at least 2 variables after filtering.')
608
+ st.stop()
609
+
610
+ _pname = Path(uploaded.name).stem
611
+ _nodes = build_hierarchy(_can, project=_pname,
612
+ max_depth=tx_max_depth,
613
+ min_cluster_size=tx_min_size,
614
+ branch_max=tx_branch)
615
+
616
+ st.session_state['_bl_nodes'] = _nodes
617
+ st.session_state['_bl_can'] = _can
618
+ st.session_state['_bl_project'] = _pname
619
+
620
+ if '_bl_nodes' not in st.session_state:
621
+ st.info('Configure columns above then click **Build baseline hierarchy**.')
622
+ st.stop()
623
+
624
+ nodes = st.session_state['_bl_nodes']
625
+ can = st.session_state['_bl_can']
626
+ project_name = st.session_state['_bl_project']
627
+
628
+ _sm = _structural_stats(nodes)
629
+ n_leaves = len([n for n in nodes if n['type'] == 'attribute'])
630
+ n_internal = len([n for n in nodes if n['type'] == 'aggregation'])
631
+
632
+ st.divider()
633
+ c1, c2, c3, c4 = st.columns(4)
634
+ c1.metric('Variables', n_leaves)
635
+ c2.metric('Aggregation nodes', n_internal)
636
+ c3.metric('Max depth', _sm['max_depth'])
637
+ c4.metric('Avg branching', _sm['avg_branching_factor'])
638
+
639
+ tabs = st.tabs(['Sunburst', 'Treemap', 'Node detail', 'Canonical table', 'Export', '📊 Evaluation'])
640
+
641
+ with tabs[0]:
642
+ st.plotly_chart(plot_sunburst(nodes, max_depth=display_depth), use_container_width=True)
643
+ st.caption('Green = Baseline. Click a sector to drill down; click the centre to go back.')
644
+
645
+ with tabs[1]:
646
+ st.plotly_chart(plot_treemap(nodes), use_container_width=True)
647
+
648
+ with tabs[2]:
649
+ nm = _nmap(nodes)
650
+ agg_nodes = [n for n in nodes if n['type'] in ('aggregation', 'root')]
651
+ options = [f'{n["name"]} [{len(_leaf_ids(nodes, int(n["id"])))} vars]'
652
+ for n in agg_nodes]
653
+ if options:
654
+ sel = st.selectbox('Select a node', options)
655
+ sel_name = sel.split(' [')[0]
656
+ sel_node = next((n for n in agg_nodes if n['name'] == sel_name), None)
657
+ if sel_node:
658
+ lids = _leaf_ids(nodes, int(sel_node['id']))
659
+ leaf_ids_set = {nm[i]['metadata']['leaf_id']
660
+ for i in lids if i in nm and 'metadata' in nm[i]}
661
+ sub = can[can['_leaf_id'].isin(leaf_ids_set)]
662
+ st.write(f'**{len(lids)} variables** under "{sel_node["name"]}"')
663
+ st.dataframe(sub[['_leaf_label', '_group_path', '_text']].reset_index(drop=True),
664
+ use_container_width=True)
665
+
666
+ with tabs[3]:
667
+ st.dataframe(can, use_container_width=True)
668
+
669
+ with tabs[4]:
670
+ _base = safe_name(project_name)
671
+ col1, col2 = st.columns(2)
672
+ with col1:
673
+ st.download_button(
674
+ 'Hierarchy JSON',
675
+ data=json.dumps(nodes, indent=2, ensure_ascii=False).encode('utf-8'),
676
+ file_name=f'{_base}_baseline_hierarchy.json',
677
+ mime='application/json',
678
+ use_container_width=True,
679
+ )
680
+ with col2:
681
+ st.download_button(
682
+ 'Canonical CSV',
683
+ data=can.to_csv(index=False).encode('utf-8'),
684
+ file_name=f'{_base}_baseline_canonical.csv',
685
+ mime='text/csv',
686
+ use_container_width=True,
687
+ )
688
+
689
+ st.divider()
690
+ # ── Save directly into the project's outputs/baseline/ folder ──────────────
691
+ _out_dir = Path(__file__).resolve().parent / 'outputs' / 'baseline'
692
+ st.markdown('### Save to project folder')
693
+ st.caption(
694
+ 'The download buttons above go to your browser’s Downloads folder (a browser '
695
+ f'restriction). This button instead writes the files into `{_out_dir}` with the '
696
+ 'dataset name — convenient for `evaluate_all.py`.'
697
+ )
698
+ if st.button('💾 Save all to outputs/baseline/', type='primary',
699
+ use_container_width=True):
700
+ try:
701
+ _out_dir.mkdir(parents=True, exist_ok=True)
702
+ (_out_dir / f'{_base}_baseline_hierarchy.json').write_text(
703
+ json.dumps(nodes, indent=2, ensure_ascii=False), encoding='utf-8')
704
+ can.to_csv(_out_dir / f'{_base}_baseline_canonical.csv', index=False)
705
+ st.success(f'Saved to `{_out_dir}`:\n\n'
706
+ f'- {_base}_baseline_hierarchy.json\n'
707
+ f'- {_base}_baseline_canonical.csv')
708
+ except Exception as _e:
709
+ st.error(f'Could not save: {_e}')
710
+
711
+ with tabs[5]:
712
+ import hierarchy_eval as he
713
+
714
+ st.subheader('Hierarchy Quality Evaluation')
715
+ st.caption(
716
+ 'The group column is a *construction input* (Gonçalves text object), so it '
717
+ 'cannot serve as ground truth. The primary metrics below are **reference-free** '
718
+ '— they assess the hierarchy itself, with no gold standard.'
719
+ )
720
+
721
+ with st.spinner('Computing reference-free metrics…'):
722
+ tm = he.traco_metrics(nodes)
723
+ npmi = he.npmi_coherence(nodes, can['_text'].tolist())
724
+
725
+ # ── PRIMARY: reference-free hierarchy quality ─────────────────────────────
726
+ st.markdown('#### Primary — reference-free hierarchy quality')
727
+ p1, p2, p3 = st.columns(3)
728
+ p1.metric('Parent–child coherence', tm['pc_coherence'],
729
+ help='TraCo (Wu et al., AAAI 2024). Mean similarity of each node to its parent. '
730
+ 'Higher = children correctly nest under their parent theme.')
731
+ p2.metric('Sibling diversity', tm['sibling_diversity'],
732
+ help='TraCo (Wu et al., AAAI 2024). Mean distance between sibling nodes. '
733
+ 'Higher = siblings are distinct (LOW = redundant/repeated siblings).')
734
+ p3.metric('NPMI label coherence', npmi,
735
+ help='Lau et al., EACL 2014. Whether node-label terms genuinely co-occur in the '
736
+ 'data. Higher = meaningful labels, not arbitrary term salads.')
737
+ st.caption(f'Embedding backend: **{tm["encoder"]}**. '
738
+ 'Coherence & diversity ∈ [−1, 1]; NPMI ∈ ≈[−1, 1].')
739
+
740
+ # ── Structural metrics ────────────────────────────────────────────────────
741
+ st.markdown('#### Structural statistics')
742
+ sm = he.structural_stats(nodes)
743
+ s1, s2, s3, s4, s5 = st.columns(5)
744
+ s1.metric('Aggregation nodes', sm['n_aggregation_nodes'])
745
+ s2.metric('Max leaf depth', sm['max_depth'])
746
+ s3.metric('Avg leaf depth', sm['avg_leaf_depth'])
747
+ s4.metric('Avg branching', sm['avg_branching_factor'])
748
+ s5.metric('Singleton nodes', f"{sm['singleton_nodes_%']}%",
749
+ help='Aggregation nodes with a single child (sparse-hierarchy indicator)')
750
+
751
+ # ── SECONDARY: group preservation (caveated) ──────────────────────────────
752
+ st.markdown('#### Secondary — group-structure preservation *(descriptive)*')
753
+ st.caption(
754
+ '⚠️ The group column was an **input** to construction, so these are NOT accuracy '
755
+ 'metrics — they only describe how much the discovered hierarchy still reflects the '
756
+ 'pre-existing group column. High values are expected and not evidence of quality.'
757
+ )
758
+ gp = he.group_preservation(nodes, can)
759
+ g1, g2, g3 = st.columns(3)
760
+ g1.metric('NMI', gp['NMI']); g2.metric('ARI', gp['ARI']); g3.metric('Purity', gp['Purity'])
data/Data_Dictionary.csv ADDED
The diff for this file is too large to render. See raw diff
 
data/HCP_S1200_DataDictionary_Oct_30_2023.csv ADDED
The diff for this file is too large to render. See raw diff
 
data/ai-mind-variable-descriptions(in).csv ADDED
@@ -0,0 +1,109 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Task,Variant,name,description,Decimal Places
2
+ DMS,DMS Recommended Standard,DMSCC,"DMS Mean Choices to Correct: The mean number of choices that the subject made on each trial, including the correct choice. Calculated across all trials where the subject eventually made the correct choice (simultaneous and all delays).",2
3
+ DMS,DMS Recommended Standard,DMSL0SD,"DMS Correct Latency Standard Deviation (SD) (0 second delay): The standard deviation of response latencies for trials containing a zero second delay between the presentation of target and response stimuli, where subjects selected the correct box on their first attempt. Calculated across all assessed trials containing a zero second delay.",4
4
+ DMS,DMS Recommended Standard,DMSL12SD,"DMS Correct Latency Standard Deviation (SD) (12 second delay): The standard deviation of response latencies for trials containing a twelve second delay between the presentation of target and response stimuli, where subjects selected the correct box on their first attempt. Calculated across all assessed trials containing a twelve second delay.",4
5
+ DMS,DMS Recommended Standard,DMSL4SD,"DMS Correct Latency Standard Deviation (SD) (4 second delay): The standard deviation of response latencies for trials containing a four second delay between the presentation of target and response stimuli, where subjects selected the correct box on their first attempt. Calculated across all assessed trials containing a four second delay.",4
6
+ DMS,DMS Recommended Standard,DMSLADSD,"DMS Correct Latency Standard Deviation (SD) (all delays): The standard deviation of response latencies for trials containing a delay between the presentation of target stimulus and response stimuli, where subjects selected the correct box on their first attempt. Calculated across all assessed trials containing a delay.",4
7
+ DMS,DMS Recommended Standard,DMSLSD,DMS Correct Latency Standard Deviation (SD): The standard deviation of response latencies for trials where subjects selected the correct box on their first attempt. Calculated across all correct assessed trials (simultaneous and all delays).,4
8
+ DMS,DMS Recommended Standard,DMSLSSD,"DMS Correct Latency Standard Deviation (SD) (simultaneous): The standard deviation of response latencies for trials containing a simultaneous presentation of target and response stimuli, where subjects selected the correct box on their first attempt. Calculated across all assessed trials containing simultaneous presentations.",4
9
+ DMS,DMS Recommended Standard,DMSMDL,DMS Median Correct Latency: The median latency between the presentation of the response stimuli options and the subject selecting the correct box on their first attempt. Calculated across all correct assessed trials (simultaneous and all delays).,4
10
+ DMS,DMS Recommended Standard,DMSMDL0,DMS Median Correct Latency (0 seconds delay): The median latency between the presentation of the response stimuli options and the subject selecting the correct box on their first attempt for trials containing a zero second delay. Calculated across all assessed trials containing a zero second delay.,4
11
+ DMS,DMS Recommended Standard,DMSMDL12,DMS Median Correct Latency (12 seconds delay): The median latency between the presentation of the response stimuli options and the subject selecting the correct box on their first attempt for trials containing a twelve second delay. Calculated across all assessed trials containing a twelve second delay.,4
12
+ DMS,DMS Recommended Standard,DMSMDL4,DMS Median Correct Latency (4 seconds delay): The median latency between the presentation of the response stimuli options and the subject selecting the correct box on their first attempt for trials containing a four second delay. Calculated across all assessed trials containing a four second delay.,4
13
+ DMS,DMS Recommended Standard,DMSMDLAD,DMS Median Correct Latency (all delays): The median latency between the presentation of the response stimuli options and the subject selecting the correct box on their first attempt for trials containing a delay between target and response stimuli presentation. Calculated across all assessed trials containing a delay.,4
14
+ DMS,DMS Recommended Standard,DMSMDLS,DMS Median Correct Latency (simultaneous): The median latency between the presentation of the response stimuli options and the subject selecting the correct box on their first attempt for trials containing a simultaneous presentation of target and response stimuli. Calculated across all assessed trials containing simultaneous presentation.,4
15
+ DMS,DMS Recommended Standard,DMSML,DMS Mean Correct Latency: The mean latency between the presentation of the response stimuli options and the subject selecting the correct box on their first attempt. Calculated across all correct assessed trials (simultaneous and all delays).,4
16
+ DMS,DMS Recommended Standard,DMSML0,DMS Mean Correct Latency (0 seconds delay): The mean latency between the presentation of the response stimuli options and the subject selecting the correct box on their first attempt for trials containing a zero second delay. Calculated across all assessed trials containing a zero second delay.,4
17
+ DMS,DMS Recommended Standard,DMSML12,DMS Mean Correct Latency (12 seconds delay): The mean latency between the presentation of the response stimuli options and the subject selecting the correct box on their first attempt for trials containing a twelve second delay. Calculated across all assessed trials containing a twelve second delay.,4
18
+ DMS,DMS Recommended Standard,DMSML4,DMS Mean Correct Latency (4 seconds delay): The mean latency between the presentation of the response stimuli options and the subject selecting the correct box on their first attempt for trials containing a four second delay. Calculated across all assessed trials containing a four second delay.,4
19
+ DMS,DMS Recommended Standard,DMSMLAD,DMS Mean Correct Latency (all delays): The mean latency between the presentation of the response stimuli options and the subject selecting the correct box on their first attempt for trials containing a delay between target and response stimuli presentation. Calculated across all assessed trials containing a delay.,4
20
+ DMS,DMS Recommended Standard,DMSMLS,DMS Mean Correct Latency (simultaneous): The mean latency between the presentation of the response stimuli options and the subject selecting the correct box on their first attempt for trials containing a simultaneous presentation of target and response stimuli. Calculated across all assessed trials containing simultaneous presentation.,4
21
+ DMS,DMS Recommended Standard,DMSPC,DMS Percent Correct: The percentage of assessment trials during which the subject chose the correct box on their first box choice. Calculated across all assessed trials (simultaneous presentation and all delays).,0
22
+ DMS,DMS Recommended Standard,DMSPC0,KEY: DMS Percent Correct (0 seconds delay): The percentage of assessment trials containing a zero second delay during which the subject chose the correct box on their first box choice. Calculated across all assessed trials containing a zero second delay.,0
23
+ DMS,DMS Recommended Standard,DMSPC12,KEY: DMS Percent Correct (12 second delay): The percentage of assessment trials containing a twelve second delay during which the subject chose the correct box on their first box choice. Calculated across all assessed trials containing a twelve second delay.,0
24
+ DMS,DMS Recommended Standard,DMSPC4,KEY: DMS Percent Correct (4 second delay): The percentage of assessment trials containing a four second delay during which the subject chose the correct box on their first box choice. Calculated across all assessed trials containing a four second delay.,0
25
+ DMS,DMS Recommended Standard,DMSPCAD,KEY: DMS Percent Correct (all delays): The percentage of assessment trials containing a delay during which the subject chose the correct box on their first box choice. Calculated across all assessed trials containing a delay.,0
26
+ DMS,DMS Recommended Standard,DMSPCS,KEY: DMS Percent Correct (simultaneous): The percentage of assessment trials where the target and response stimuli were presented simultaneously during which the subject chose the correct box on their first box choice. Calculated across all assessed trials containing the simultaneous presentation of stimuli.,0
27
+ DMS,DMS Recommended Standard,DMSPEGC,DMS Probability of Error Given Correct: This measure reports the probability of an error being made when the previous trial was responded to correctly by the subject. Calculated across all assessed trials (simultaneous and all delays).,4
28
+ DMS,DMS Recommended Standard,DMSPEGE,KEY: DMS Probability of Error Given Error: This measure reports the probability of an error occurring when the previous trial was responded to incorrectly. Calculated across all assessed trials (simultaneous and all delays).,4
29
+ DMS,DMS Recommended Standard,DMSTC,DMS Total Correct: The total number of times a subject chose the correct answer on their first box choice. Calculated across all assessed trials (simultaneous presentation and all delays).,0
30
+ DMS,DMS Recommended Standard,DMSTC0,DMS Total Correct (0 second delay): The total number of times a subject chose the correct answer on their first box choice for trials where the response stimuli appeared on screen after a 0 second delay after the target stimulus was shown. Calculated across all assessed trials which contained a delay of zero seconds.,0
31
+ DMS,DMS Recommended Standard,DMSTC12,DMS Total Correct (12 second delay): The total number of times a subject chose the correct answer on their first box choice for trials where the response stimuli appeared on screen after a 12 second delay after the target stimulus was shown. Calculated across all assessed trials which contained a delay of twelve seconds.,0
32
+ DMS,DMS Recommended Standard,DMSTC4,DMS Total Correct (4 second delay): The total number of times a subject chose the correct answer on their first box choice for trials where the response stimuli appeared on screen after a 4 second delay after the target stimulus was shown. Calculated across all assessed trials which contained a delay of four seconds.,0
33
+ DMS,DMS Recommended Standard,DMSTCAD,DMS Total Correct (all delays): The total number of times a subject chose the correct answer on their first box choice for all trials where the response stimuli were presented after a delay. Calculated across all assessed trials containing a delay.,0
34
+ DMS,DMS Recommended Standard,DMSTCS,DMS Total Correct (simultaneous): The total number of times a subject chose the correct answer on their first box choice for trials where the target stimulus and response stimuli appeared on screen simultaneously. Calculated across all assessed trials that included a simultaneous presentation (no delay) of target and response stimuli.,0
35
+ DMS,DMS Recommended Standard,DMSTE,"DMS Total Errors: The total number of times a subject failed to choose the correct box on their first selection, thus making an error. Calculated across all assessed trials (simultaneous and all delays) regardless of which incorrect box (out of the 3 possible incorrect boxes) was chosen.",0
36
+ DMS,DMS Recommended Standard,DMSTEAD,DMS Total Errors (all delays): The total number of times a subject failed to choose the correct box on their first selection for any trial containing a delay between the presentation of the target stimulus and response stimuli. Calculated across all assessed trials containing a delay component.,0
37
+ DMS,DMS Recommended Standard,DMSTEC,"DMS Error (incorrect colour): The number of times that the subject failed to select the correct box on their first selection, and instead chose the distractor stimulus that contained the same pattern/ physical attributes, but different colours. Calculated across all assessed trials (simultaneous and all delays).",0
38
+ DMS,DMS Recommended Standard,DMSTECAD,"DMS Error (all delays, incorrect colour): The number of times that the subject failed to select the correct box on their first selection, and instead chose the distractor stimulus that contained the same colour elements, but different physical attributes. Calculated across all assessed trials which contained a delay component.",0
39
+ DMS,DMS Recommended Standard,DMSTED,"DMS Error (distractor): The number of times that the subject failed to select the correct box on their first selection, and instead chose the distractor stimulus that contained no common elements to the original target stimulus. Calculated across all assessed trials (simultaneous and all delays).",0
40
+ DMS,DMS Recommended Standard,DMSTEDAD,"DMS Error (all delays, distractor): The number of times that the subject failed to select the correct box on their first selection, and instead chose the distractor stimulus that contained no common elements to the original target stimulus. Calculated across all assessed trials which contained a delay component.",0
41
+ DMS,DMS Recommended Standard,DMSTEP,"DMS Error (incorrect pattern): The number of times that the subject failed to select the correct box on their first selection, and instead chose the distractor stimulus that contained the same colour elements, but different pattern/ physical attributes. Calculated across all assessed trials (simultaneous and all delays).",0
42
+ DMS,DMS Recommended Standard,DMSTEPAD,"DMS Error (all delays, incorrect pattern): The number of times that the subject failed to select the correct box on their first selection, and instead chose the distractor stimulus that contained the same pattern/ physical attributes, but different colour elements. Calculated across all assessed trials which contained a delay component.",0
43
+ MOT,MOT Tone 2.0,MOTML,The mean latency from the display of a stimulus to a correct response to that stimulus during assessment trials.,1
44
+ MOT,MOT Tone 2.0,MOTSDL,"This is the standard deviation of the latency, calculated from the display of a stimulus to a correct response to that stimulus during assessment trials.",2
45
+ MOT,MOT Tone 2.0,MOTTC,The total number of assessment trials on which the subject made a correct response.,0
46
+ MOT,MOT Tone 2.0,MOTTE,The total number of assessment trials on which the subject failed to make a correct response.,0
47
+ PAL,PAL Recommended Standard Extended,PALFAMS28,"KEY: PAL First Attempt Memory Score: The number of times a subject chose the correct box on their first attempt when recalling the pattern locations. Calculated across assessed trials, omitting 12 box level to provide a direct comparison to Recommended Standard..",0
48
+ PAL,PAL Recommended Standard Extended,PALMETS28,PAL Mean Errors to Success: The mean number of attempts made by a subject needed for them to successfully complete the stage. Does not include 12 box level to provide a direct comparison to Recommended Standard.,0
49
+ PAL,PAL Recommended Standard Extended,PALNPR28,PAL Number of Patterns Reached: The number of patterns presented to the subject on the last problem they reached.,0
50
+ PAL,PAL Recommended Standard Extended,PALTA12,PAL Total Attempts 12 Patterns: The total number of attempts made (but not necessarily completed) by the subject during assessment problems containing a total of 12 shapes to recall.,0
51
+ PAL,PAL Recommended Standard Extended,PALTA2,PAL Total Attempts 2 Patterns: The total number of attempts made (but not necessarily completed) by the subject during assessment problems containing a total of 2 shapes to recall.,0
52
+ PAL,PAL Recommended Standard Extended,PALTA28,PAL Total Attempts: The total number of attempts made (but not necessarily completed) by the subject during assessment problems. Does not include 12 box level to provide a direct comparison to Recommended Standard.,0
53
+ PAL,PAL Recommended Standard Extended,PALTA4,PAL Total Attempts 4 patterns: The total number of attempts made (but not necessarily completed) by the subject during assessment problems containing a total of 4 shapes to recall.,0
54
+ PAL,PAL Recommended Standard Extended,PALTA6,PAL Total Attempts 6 Patterns: The total number of attempts made (but not necessarily completed) by the subject during assessment problems containing a total of 6 shapes to recall.,0
55
+ PAL,PAL Recommended Standard Extended,PALTA8,PAL Total Attempts 8 Patterns: The total number of attempts made (but not necessarily completed) by the subject during assessment problems containing a total of 8 shapes to recall.,0
56
+ PAL,PAL Recommended Standard Extended,PALTE12,PAL Total Errors 12 Patterns: The total number of times a subject selected an incorrect box when attempting to recall a pattern location on trials containing a total of 12 patterns. Calculated across all 12-pattern assessed trials.,0
57
+ PAL,PAL Recommended Standard Extended,PALTE2,PAL Total Errors 2 Patterns: The total number of times a subject selected an incorrect box when attempting to recall a pattern location on trials containing a total of 2 patterns. Calculated across all 2-pattern assessed trials.,0
58
+ PAL,PAL Recommended Standard Extended,PALTE28,PAL Total Errors: The total number of times a subject selected an incorrect box when attempting to recall a pattern location. Calculated across all assessed trials. Does not include 12 box level to provide a direct comparison to Recommended Standard.,0
59
+ PAL,PAL Recommended Standard Extended,PALTE4,PAL Total Errors 4 Patterns: The total number of times a subject selected an incorrect box when attempting to recall a pattern location on trials containing a total of 4 patterns. Calculated across all 4-pattern assessed trials.,0
60
+ PAL,PAL Recommended Standard Extended,PALTE6,PAL Total Errors 6 Patterns: The total number of times a subject selected an incorrect box when attempting to recall a pattern location on trials containing a total of 6 patterns. Calculated across all 6-pattern assessed trials.,0
61
+ PAL,PAL Recommended Standard Extended,PALTE8,PAL Total Errors 8 Patterns: The total number of times a subject selected an incorrect box when attempting to recall a pattern location on trials containing a total of 8 patterns. Calculated across all 8-pattern assessed trials.,0
62
+ PAL,PAL Recommended Standard Extended,PALTEA12,"PAL Total Errors 12 Shapes (Adjusted): The number of times the subject chose the incorrect box for a stimulus on assessment problems, where the number of shapes was equal to 12 (PALTE12), plus an adjustment for the estimated number of errors they would have made on any other 12 pattern problems, attempts and recalls they did not reach.",0
63
+ PAL,PAL Recommended Standard Extended,PALTEA2,"PAL Total Errors 2 Shapes (Adjusted): The number of times the subject chose the incorrect box for a stimulus on assessment problems, where the number of shapes required to remember was equal to 2 (PALTE2), plus an adjustment for the estimated number of errors they would have made on any other 2 pattern problems, attempts and recalls they did not reach.",0
64
+ PAL,PAL Recommended Standard Extended,PALTEA28,"KEY: PAL Total Errors (Adjusted): The number of times the subject chose the incorrect box for a stimulus on assessment problems (PALTE), plus an adjustment for the estimated number of errors they would have made on any problems, attempts and recalls they did not reach. This measure allows you to compare performance on errors made across all subjects regardless of those who terminated early versus those completing the final stage of the task. In this task variant PALTEA does not include 12 box level to provide a direct comparison to Recommended Standard.",0
65
+ PAL,PAL Recommended Standard Extended,PALTEA4,"PAL Total Errors 4 Shapes (Adjusted): The number of times the subject chose the incorrect box for a stimulus on assessment problems, where the number of shapes was equal to 4 (PALTE4), plus an adjustment for the estimated number of errors they would have made on any other 4 pattern problems, attempts and recalls they did not reach.",0
66
+ PAL,PAL Recommended Standard Extended,PALTEA6,"PAL Total Errors 6 Shapes (Adjusted): The number of times the subject chose the incorrect box for a stimulus on assessment problems, where the number of shapes was equal to 6 (PALTE6), plus an adjustment for the estimated number of errors they would have made on any other 6 pattern problems, attempts and recalls they did not reach.",0
67
+ PAL,PAL Recommended Standard Extended,PALTEA8,"PAL Total Errors 8 Shapes (Adjusted): The number of times the subject chose the incorrect box for a stimulus on assessment problems, where the number of shapes was equal to 8 (PALTE8), plus an adjustment for the estimated number of errors they would have made on any other 8 pattern problems, attempts and recalls they did not reach.",0
68
+ PRM,PRM Recommended Standard 18 Extended,PRMCLSDD,"PRM Correct Latency (SD) Delayed: The standard deviation for the latency of a subject's response to correctly choose the appropriate pattern in the delayed forced-choice condition, measured in milliseconds.",2
69
+ PRM,PRM Recommended Standard 18 Extended,PRMCLSDI,"PRM Correct Latency (SD) Immediate: The standard deviation for the latency of a subject's response to correctly select the appropriate pattern in the immediate forced-choice condition, measured in milliseconds.",2
70
+ PRM,PRM Recommended Standard 18 Extended,PRMMCLD,"PRM Mean Correct Latency Delayed: The mean latency for a subject to correctly select the appropriate pattern during the delayed forced-choice condition, measured in milliseconds.",2
71
+ PRM,PRM Recommended Standard 18 Extended,PRMMCLI,"PRM Mean Correct Latency Immediate: The mean latency for a subject to correctly select the appropriate pattern during the immediate forced-choice condition, measured in milliseconds.",2
72
+ PRM,PRM Recommended Standard 18 Extended,PRMMDCLD,"PRM Median Correct Latency Delayed: The median latency for a subject to correctly select the appropriate pattern during the delayed forced-choice condition, measured in milliseconds.",2
73
+ PRM,PRM Recommended Standard 18 Extended,PRMMDCLI,"PRM Median Correct Latency Immediate: The median latency for a subject to correctly select the appropriate pattern during the immediate forced-choice condition, measured in milliseconds.",2
74
+ PRM,PRM Recommended Standard 18 Extended,PRMPCD,"KEY: PRM Percent Correct Delayed: The number of correct patterns selected by the subject in the delayed forced-choice condition, expressed as a percentage.",2
75
+ PRM,PRM Recommended Standard 18 Extended,PRMPCI,"KEY: PRM Percent Correct Immediate: The number of correct patterns selected by the subject in the immediate forced-choice condition, expressed as a percentage.",2
76
+ PRM,PRM Recommended Standard 18 Extended,PRMTSDSP,PRM Time Since Delayed Stimuli Presentation: The length of time between the end of the stimuli presentation for the delayed phase and the start of the delayed forced-choice condition.,2
77
+ RVP,RVP 3 Targets,RVPA,"KEY: RVP A?: A? (A prime) is the signal detection measure of a subject's sensitivity to the target sequence (string of three numbers), regardless of response tendency (the expected range is 0.00 to 1.00; bad to good). In essence, this metric is a measure of how good the subject is at detecting target sequences.",4
78
+ RVP,RVP 3 Targets,RVPLSD,RVP Response Latency (SD): The standard deviation of response latency on trials where the subject responded correctly. Calculated across all assessed trials.,4
79
+ RVP,RVP 3 Targets,RVPMDL,KEY: RVP Median Response Latency: The median response latency on trials where the subject responded correctly. Calculated across all assessed trials.,4
80
+ RVP,RVP 3 Targets,RVPML,RVP Mean Response Latency: The mean response latency on trials where the subject responded correctly. Calculated across all assessed trials.,4
81
+ RVP,RVP 3 Targets,RVPPFA,KEY: RVP Probability of False Alarm: The number of sequence presentations that were false alarms divided by the number of sequence presentations that were false alarms plus the number of sequence presentations that were correct rejections: (False Alarms ÷ (False Alarms + Correct Rejections)),4
82
+ RVP,RVP 3 Targets,RVPPH,"RVP Probability of Hit: The number of target sequences during assessment blocks that were correctly responded to within the time allowed, divided by the number of target sequences during assessment blocks (Correct hits ÷ total number of sequences)",4
83
+ RVP,RVP 3 Targets,RVPTFA,RVP Total False Alarms: The total number of stimulus presentations during assessment blocks that were false alarms.,0
84
+ RVP,RVP 3 Targets,RVPTH,RVP Total Hits: The total number of target sequences that were correctly responded to (Correct Hits) within the allowed time during assessment sequence blocks.,0
85
+ RVP,RVP 3 Targets,RVPTM,RVP Total Misses: The total number of target sequences that were not responded to within the allowed time during assessment sequence blocks.,0
86
+ SWM,SWM Recommended Standard 2.0 Extended,SWMBE12,KEY: SWM Between errors 12 boxes: The number of times the subject revisits a box in which a token has previously been found. Calculated across all trials with 12 tokens only.,0
87
+ SWM,SWM Recommended Standard 2.0 Extended,SWMBE4,KEY: SWM Between errors 4 boxes: The number of times a subject revisits a box in which a token has previously been found. Calculated across all trials with 4 tokens only.,0
88
+ SWM,SWM Recommended Standard 2.0 Extended,SWMBE468,"KEY: SWM Between Errors: The number of times the subject incorrectly revisits a box in which a token has previously been found. Calculated across all assessed four, six and eight token trials.",0
89
+ SWM,SWM Recommended Standard 2.0 Extended,SWMBE6,KEY: SWM Between errors 6 boxes: The number of times the subject revisits a box in which a token has previously been found. Calculated across all trials with 6 tokens only.,0
90
+ SWM,SWM Recommended Standard 2.0 Extended,SWMBE8,KEY: SWM Between errors 8 boxes: The number of times the subject revisits a box in which a token has previously been found. Calculated across all trials with 8 tokens only.,0
91
+ SWM,SWM Recommended Standard 2.0 Extended,SWMDE12,SWM Double errors 12 boxes: The number of times a subject commits an error that is both a within error and a between error. Calculated across all trials with 12 tokens only.,0
92
+ SWM,SWM Recommended Standard 2.0 Extended,SWMDE4,SWM Double errors 4 boxes: The number of times a subject commits an error that is both a within error and a between error. Calculated across all trials with 4 tokens only.,0
93
+ SWM,SWM Recommended Standard 2.0 Extended,SWMDE468,"SWM Double Errors: The number of times a subject commits an error that is both a within error and a between error. Calculated across all assessed four, six and eight token trials.",0
94
+ SWM,SWM Recommended Standard 2.0 Extended,SWMDE6,SWM Double errors 6 boxes: The number of times a subject commits an error that is both a within error and a between error. Calculated across all trials with 6 tokens only.,0
95
+ SWM,SWM Recommended Standard 2.0 Extended,SWMDE8,SWM Double errors 8 boxes: The number of times a subject commits an error that is both a within error and a between error. Calculated across all trials with 8 tokens only.,0
96
+ SWM,SWM Recommended Standard 2.0 Extended,SWMPR,"SWM Problem Reached: This measure reports the problem number that the subject reached, but did not necessarily complete.",0
97
+ SWM,SWM Recommended Standard 2.0 Extended,SWMS,"KEY: SWM Strategy (6-8 boxes): The number of times a subject begins a new search pattern from the same box they started with previously. If they always begin a search from the same starting point we infer that the subject is employing a planned strategy for finding the tokens. Therefore a low score indicates high strategy use (1 = they always begin the search from the same box), a high score indicates that they are beginning their searches from many different boxes. Calculated across assessed trials with 6 tokens or 8 tokens.",0
98
+ SWM,SWM Recommended Standard 2.0 Extended,SWMS6,"SWM Strategy (6 box only): This measure computes the strategy score for the 6 box stage of the task only. The strategy score is calculated based on the number of times a subject begins a new search pattern from the same box they started with previously. If they always begin a search from the same starting point we infer that the subject is employing a planned strategy for finding the tokens. Therefore a low score indicates high strategy use (1 = they always begin the search from the same box), a high score indicates that they are beginning their searches from many different boxes.",0
99
+ SWM,SWM Recommended Standard 2.0 Extended,SWMSX,"SWM Strategy (6-12 boxes): The number of times a subject begins a new search pattern from the same box they started with previously. If they always begin a search from the same starting point we infer that the subject is employing a planned strategy for finding the tokens. Therefore a low score indicates high strategy use (1 = they always begin the search from the same box), a high score indicates that they are beginning their searches from many different boxes. Calculated across assessed trials with 6 tokens or more.",0
100
+ SWM,SWM Recommended Standard 2.0 Extended,SWMTE12,"SWM Total errors 12 boxes: The number of times a box is selected that is certain not to contain a token and therefore should not have been visited by the subject, i.e. between errors + within errors - double errors. Calculated across all trials with 12 tokens only.",0
101
+ SWM,SWM Recommended Standard 2.0 Extended,SWMTE4,"SWM Total errors 4 boxes: The number of times a box is selected that is certain not to contain a token and therefore should not have been visited by the subject, i.e. between errors + within errors - double errors. Calculated across all trials with 4 tokens only.",0
102
+ SWM,SWM Recommended Standard 2.0 Extended,SWMTE468,"SWM Total Errors: The total number of times a box is selected that is certain not to contain a token and therefore should not have been visited by the subject, i.e. between errors + within errors - double errors. Calculated across all assessed four, six and eight token trials.",0
103
+ SWM,SWM Recommended Standard 2.0 Extended,SWMTE6,"SWM Total errors 6 boxes: The number of times a box is selected that is certain not to contain a token and therefore should not have been visited by the subject, i.e. between errors + within errors - double errors. Calculated across all trials with 6 tokens only.",0
104
+ SWM,SWM Recommended Standard 2.0 Extended,SWMTE8,"SWM Total errors 8 boxes: The number of times a box is selected that is certain not to contain a token and therefore should not have been visited by the subject, i.e. between errors + within errors - double errors. Calculated across all trials with 8 tokens only.",0
105
+ SWM,SWM Recommended Standard 2.0 Extended,SWMWE12,SWM Within errors 12 boxes: The number of times a subject revisits a box already found to be empty during the same search. Calculated across all trials with 12 tokens only.,0
106
+ SWM,SWM Recommended Standard 2.0 Extended,SWMWE4,SWM Within errors 4 boxes: The number of times a subject revisits a box already found to be empty during the same search. Calculated across all trials with 4 tokens only.,0
107
+ SWM,SWM Recommended Standard 2.0 Extended,SWMWE468,"SWM Within Errors: The number of times a subject revisits a box already shown to be empty during the same search. Calculated across all assessed four, six and eight token trials.",0
108
+ SWM,SWM Recommended Standard 2.0 Extended,SWMWE6,SWM Within errors 6 boxes: The number of times a subject revisits a box already found to be empty during the same search. Calculated across all trials with 6 tokens only.,0
109
+ SWM,SWM Recommended Standard 2.0 Extended,SWMWE8,SWM Within errors 8 boxes: The number of times a subject revisits a box already found to be empty during the same search. Calculated across all trials with 8 tokens only.,0
data/dictionary_harmonized_categories.csv ADDED
@@ -0,0 +1,571 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ "","variable_codename_use","variable_description_use","harmonized_categories","harmonized_categories_description","in_dataset","corrected_codename_2day_separate"
2
+ "1","DMDBORN4","In what country {were you/was SP} born?","1","Born in 50 US states or Washington, DC","Demographics",NA
3
+ "2","DMDBORN4","In what country {were you/was SP} born?","2","Others","Demographics",NA
4
+ "3","DMDBORN4","In what country {were you/was SP} born?","77","Refused","Demographics",NA
5
+ "4","DMDBORN4","In what country {were you/was SP} born?","99","Don't Know","Demographics",NA
6
+ "5","DMDEDUC2","Education level - Adults 20+","1","Less Than 9th Grade","Demographics",NA
7
+ "6","DMDEDUC2","Education level - Adults 20+","2","9-11th Grade (Includes 12th grade with no diploma)","Demographics",NA
8
+ "7","DMDEDUC2","Education level - Adults 20+","3","High School Grad/GED or Equivalent","Demographics",NA
9
+ "8","DMDEDUC2","Education level - Adults 20+","4","Some College or AA degree","Demographics",NA
10
+ "9","DMDEDUC2","Education level - Adults 20+","5","College Graduate or above","Demographics",NA
11
+ "10","DMDEDUC2","Education level - Adults 20+","7","Refused","Demographics",NA
12
+ "11","DMDEDUC2","Education level - Adults 20+","9","Don't know","Demographics",NA
13
+ "12","DMDEDUC3","Education level - Children/Youth 6-19","0","Never Attended / Kindergarten Only","Demographics",NA
14
+ "13","DMDEDUC3","Education level - Children/Youth 6-19","1","1st Grade","Demographics",NA
15
+ "14","DMDEDUC3","Education level - Children/Youth 6-19","2","2nd Grade","Demographics",NA
16
+ "15","DMDEDUC3","Education level - Children/Youth 6-19","3","3rd Grade","Demographics",NA
17
+ "16","DMDEDUC3","Education level - Children/Youth 6-19","4","4th Grade","Demographics",NA
18
+ "17","DMDEDUC3","Education level - Children/Youth 6-19","5","5th Grade","Demographics",NA
19
+ "18","DMDEDUC3","Education level - Children/Youth 6-19","6","6th Grade","Demographics",NA
20
+ "19","DMDEDUC3","Education level - Children/Youth 6-19","7","7th Grade","Demographics",NA
21
+ "20","DMDEDUC3","Education level - Children/Youth 6-19","8","8th Grade","Demographics",NA
22
+ "21","DMDEDUC3","Education level - Children/Youth 6-19","9","9th Grade","Demographics",NA
23
+ "22","DMDEDUC3","Education level - Children/Youth 6-19","10","10th Grade","Demographics",NA
24
+ "23","DMDEDUC3","Education level - Children/Youth 6-19","11","11th Grade","Demographics",NA
25
+ "24","DMDEDUC3","Education level - Children/Youth 6-19","12","12th Grade, No Diploma","Demographics",NA
26
+ "25","DMDEDUC3","Education level - Children/Youth 6-19","13","High School Graduate","Demographics",NA
27
+ "26","DMDEDUC3","Education level - Children/Youth 6-19","14","GED or Equivalent","Demographics",NA
28
+ "27","DMDEDUC3","Education level - Children/Youth 6-19","15","More than high school","Demographics",NA
29
+ "28","DMDEDUC3","Education level - Children/Youth 6-19","55","Less Than 5th Grade","Demographics",NA
30
+ "29","DMDEDUC3","Education level - Children/Youth 6-19","66","Less Than 9th Grade","Demographics",NA
31
+ "30","DMDEDUC3","Education level - Children/Youth 6-19","77","Refused","Demographics",NA
32
+ "31","DMDEDUC3","Education level - Children/Youth 6-19","99","Don't know","Demographics",NA
33
+ "32","DMDFMSIZ","Total number of people in the Family","1","1","Demographics",NA
34
+ "33","DMDFMSIZ","Total number of people in the Family","2","2","Demographics",NA
35
+ "34","DMDFMSIZ","Total number of people in the Family","3","3","Demographics",NA
36
+ "35","DMDFMSIZ","Total number of people in the Family","4","4","Demographics",NA
37
+ "36","DMDFMSIZ","Total number of people in the Family","5","5","Demographics",NA
38
+ "37","DMDFMSIZ","Total number of people in the Family","6","6","Demographics",NA
39
+ "38","DMDFMSIZ","Total number of people in the Family","7","7 or more people in the Family","Demographics",NA
40
+ "39","DMDHHSIZ","Total number of people in the Household","1","1","Demographics",NA
41
+ "40","DMDHHSIZ","Total number of people in the Household","2","2","Demographics",NA
42
+ "41","DMDHHSIZ","Total number of people in the Household","3","3","Demographics",NA
43
+ "42","DMDHHSIZ","Total number of people in the Household","4","4","Demographics",NA
44
+ "43","DMDHHSIZ","Total number of people in the Household","5","5","Demographics",NA
45
+ "44","DMDHHSIZ","Total number of people in the Household","6","6","Demographics",NA
46
+ "45","DMDHHSIZ","Total number of people in the Household","7","7 or more people in the Household","Demographics",NA
47
+ "46","DMDHRAGE","Age in years of the household reference person at the time of HH screening.","1","<20 years","Demographics",NA
48
+ "47","DMDHRAGE","Age in years of the household reference person at the time of HH screening.","2","20-39 years","Demographics",NA
49
+ "48","DMDHRAGE","Age in years of the household reference person at the time of HH screening.","3","40-59 years","Demographics",NA
50
+ "49","DMDHRAGE","Age in years of the household reference person at the time of HH screening.","4","60+ years","Demographics",NA
51
+ "50","DMDYRSUS","Length of time the participant has been in the US.","1","Less than 1 year","Demographics",NA
52
+ "51","DMDYRSUS","Length of time the participant has been in the US.","2","1 yr., less than 5 yrs.","Demographics",NA
53
+ "52","DMDYRSUS","Length of time the participant has been in the US.","3","5 yrs., less than 10 yrs.","Demographics",NA
54
+ "53","DMDYRSUS","Length of time the participant has been in the US.","4","10 yrs., less than 15 yrs.","Demographics",NA
55
+ "54","DMDHRBR4","HH reference person's country of birth","1","Born in 50 US states or Washington, DC","Demographics",NA
56
+ "55","DMDHRBR4","HH reference person's country of birth","2","Others","Demographics",NA
57
+ "56","DMDHRBR4","HH reference person's country of birth","77","Refused","Demographics",NA
58
+ "57","DMDHRBR4","HH reference person's country of birth","99","Don't Know","Demographics",NA
59
+ "58","DMDHREDU","HH reference person's education level","1","Less than high school degree","Demographics",NA
60
+ "59","DMDHREDU","HH reference person's education level","2","High school grad/GED or some college/AA degree","Demographics",NA
61
+ "60","DMDHREDU","HH reference person's education level","3","College graduate or above","Demographics",NA
62
+ "61","DMDHREDU","HH reference person's education level","7","Refused","Demographics",NA
63
+ "62","DMDHREDU","HH reference person's education level","9","Don't know","Demographics",NA
64
+ "63","DMDHREDU","HH reference person's education level","9","Don't Know","Demographics",NA
65
+ "64","DMDHREDU","HH reference person's education level","3","High school grad/GED or some college/AA degree","Demographics",NA
66
+ "65","DMDHRGND","Gender of the household reference person","1","Male","Demographics",NA
67
+ "66","DMDHRGND","Gender of the household reference person","2","Female","Demographics",NA
68
+ "67","DMDHRMAR","Marital Status of household reference person","1","Married/Living with partner","Demographics",NA
69
+ "68","DMDHRMAR","Marital Status of household reference person","2","Widowed/Divorced/Separated","Demographics",NA
70
+ "69","DMDHRMAR","Marital Status of household reference person","3","Never Married","Demographics",NA
71
+ "70","DMDHRMAR","Marital Status of household reference person","77","Refused","Demographics",NA
72
+ "71","DMDHRMAR","Marital Status of household reference person","99","Don't Know","Demographics",NA
73
+ "72","DMDHSEDU","HH reference person's spouse's education level","1","Less than high school degree","Demographics",NA
74
+ "73","DMDHSEDU","HH reference person's spouse's education level","2","High school grad/GED or some college/AA degree","Demographics",NA
75
+ "74","DMDHSEDU","HH reference person's spouse's education level","3","College graduate or above","Demographics",NA
76
+ "75","DMDHSEDU","HH reference person's spouse's education level","7","Refused","Demographics",NA
77
+ "76","DMDHSEDU","HH reference person's spouse's education level","9","Don't Know","Demographics",NA
78
+ "77","DMDMARTL","Marital status","1","Married","Demographics",NA
79
+ "78","DMDMARTL","Marital status","2","Widowed","Demographics",NA
80
+ "79","DMDMARTL","Marital status","3","Divorced","Demographics",NA
81
+ "80","DMDMARTL","Marital status","4","Separated","Demographics",NA
82
+ "81","DMDMARTL","Marital status","5","Never married","Demographics",NA
83
+ "82","DMDMARTL","Marital status","6","Living with partner","Demographics",NA
84
+ "83","DMDMARTL","Marital status","77","Refused","Demographics",NA
85
+ "84","DMDMARTL","Marital status","99","Don't know","Demographics",NA
86
+ "85","DMDYRSUS","Length of time the participant has been in the US.","5","15 yrs., less than 20 yrs.","Demographics",NA
87
+ "86","DMDYRSUS","Length of time the participant has been in the US.","6","20 yrs., less than 30 yrs.","Demographics",NA
88
+ "87","DMDYRSUS","Length of time the participant has been in the US.","7","30 yrs., less than 40 yrs.","Demographics",NA
89
+ "88","DMDYRSUS","Length of time the participant has been in the US.","8","40 yrs., less than 50 yrs.","Demographics",NA
90
+ "89","DMDYRSUS","Length of time the participant has been in the US.","9","50 years or more","Demographics",NA
91
+ "90","DMDYRSUS","Length of time the participant has been in the US.","77","Refused","Demographics",NA
92
+ "91","DMDYRSUS","Length of time the participant has been in the US.","99","Don't know","Demographics",NA
93
+ "92","FIALANG","Language of the Family Interview Instrument","1","English","Demographics",NA
94
+ "93","FIALANG","Language of the Family Interview Instrument","2","Spanish","Demographics",NA
95
+ "94","FIALANG","Language of the Family Interview Instrument","3","Other","Demographics",NA
96
+ "95","INDFMIN2","Total family income (reported as a range value in dollars)","1","$ 0 to $ 4,999","Demographics",NA
97
+ "96","INDFMIN2","Total family income (reported as a range value in dollars)","2","$ 5,000 to $ 9,999","Demographics",NA
98
+ "97","INDFMIN2","Total family income (reported as a range value in dollars)","3","$10,000 to $14,999","Demographics",NA
99
+ "98","INDFMIN2","Total family income (reported as a range value in dollars)","4","$15,000 to $19,999","Demographics",NA
100
+ "99","INDFMIN2","Total family income (reported as a range value in dollars)","5","$20,000 to $24,999","Demographics",NA
101
+ "100","INDFMIN2","Total family income (reported as a range value in dollars)","6","$25,000 to $34,999","Demographics",NA
102
+ "101","INDFMIN2","Total family income (reported as a range value in dollars)","7","$35,000 to $44,999","Demographics",NA
103
+ "102","INDFMIN2","Total family income (reported as a range value in dollars)","8","$45,000 to $54,999","Demographics",NA
104
+ "103","INDFMIN2","Total family income (reported as a range value in dollars)","16","$50,000 and over","Demographics",NA
105
+ "104","INDFMIN2","Total family income (reported as a range value in dollars)","99","Don't know","Demographics",NA
106
+ "105","INDFMIN2","Total family income (reported as a range value in dollars)","9","$55,000 to $64,999","Demographics",NA
107
+ "106","INDFMIN2","Total family income (reported as a range value in dollars)","10","$65,000 to $74,999","Demographics",NA
108
+ "107","INDFMIN2","Total family income (reported as a range value in dollars)","12","$20,000 and Over","Demographics",NA
109
+ "108","INDFMIN2","Total family income (reported as a range value in dollars)","13","Under $20,000","Demographics",NA
110
+ "109","INDFMIN2","Total family income (reported as a range value in dollars)","14","$75,000 to $99,999","Demographics",NA
111
+ "110","INDFMIN2","Total family income (reported as a range value in dollars)","15","$100,000 and Over","Demographics",NA
112
+ "111","INDFMIN2","Total family income (reported as a range value in dollars)","77","Refused","Demographics",NA
113
+ "112","INDFMIN2","Total family income (reported as a range value in dollars)","11","$75,000 and Over","Demographics",NA
114
+ "113","RIDRETH1","Recode of reported race and Hispanic origin information","1","Mexican American","Demographics",NA
115
+ "114","RIDRETH1","Recode of reported race and Hispanic origin information","3","Non-Hispanic White","Demographics",NA
116
+ "115","RIDRETH1","Recode of reported race and Hispanic origin information","4","Non-Hispanic Black","Demographics",NA
117
+ "116","RIDRETH1","Recode of reported race and Hispanic origin information","5","Other Race - Including Multi-Racial","Demographics",NA
118
+ "117","RIDRETH1","Recode of reported race and Hispanic origin information","2","Other Hispanic","Demographics",NA
119
+ "118","RIDSTATR","Interview and Examination Status of the Sample Person.","1","Interviewed Only","Demographics",NA
120
+ "119","RIDSTATR","Interview and Examination Status of the Sample Person.","2","Both Interviewed and MEC examined","Demographics",NA
121
+ "120","MCD180B","Age when told you had congestive heart failure","16","16 years or younger","Questionnaire",NA
122
+ "121","MCD180B","Age when told you had congestive heart failure","17-79","17-79 years old","Questionnaire",NA
123
+ "122","MCD180B","Age when told you had congestive heart failure","17-84","17-84 years old","Questionnaire",NA
124
+ "123","MCD180B","Age when told you had congestive heart failure","17-89","17-89 years old","Questionnaire",NA
125
+ "124","MCD180B","Age when told you had congestive heart failure","18-79","18-79 years old","Questionnaire",NA
126
+ "125","MCD180B","Age when told you had congestive heart failure","80","80 years or older","Questionnaire",NA
127
+ "126","MCD180B","Age when told you had congestive heart failure","85","85 years or older","Questionnaire",NA
128
+ "127","MCD180B","Age when told you had congestive heart failure","90","90 + years","Questionnaire",NA
129
+ "128","MCD180B","Age when told you had congestive heart failure","99999","Don't know","Questionnaire",NA
130
+ "129","MCD180B","Age when told you had congestive heart failure","77777","Refused","Questionnaire",NA
131
+ "130","MCD180C","Age when told had coronary heart disease","16","16 years or younger","Questionnaire",NA
132
+ "131","MCD180C","Age when told had coronary heart disease","17-79","17-79 years old","Questionnaire",NA
133
+ "132","MCD180C","Age when told had coronary heart disease","17-84","17-84 years old","Questionnaire",NA
134
+ "133","MCD180C","Age when told had coronary heart disease","20-79","20-79 years old","Questionnaire",NA
135
+ "134","MCD180C","Age when told had coronary heart disease","80","80 years or older","Questionnaire",NA
136
+ "135","MCD180C","Age when told had coronary heart disease","85","85 years or older","Questionnaire",NA
137
+ "136","MCD180C","Age when told had coronary heart disease","99999","Don't know","Questionnaire",NA
138
+ "137","MCD180C","Age when told had coronary heart disease","77777","Refused","Questionnaire",NA
139
+ "138","MCD180D","Age when told you had angina pectoris","16","16 years or younger","Questionnaire",NA
140
+ "139","MCD180D","Age when told you had angina pectoris","17-84","17-84 years old","Questionnaire",NA
141
+ "140","MCD180D","Age when told you had angina pectoris","85","85 years or older","Questionnaire",NA
142
+ "141","MCD180D","Age when told you had angina pectoris","99999","Don't know","Questionnaire",NA
143
+ "142","MCD180D","Age when told you had angina pectoris","77777","Refused","Questionnaire",NA
144
+ "143","MCD180D","Age when told you had angina pectoris","17-79","17-79 years old","Questionnaire",NA
145
+ "144","MCD180D","Age when told you had angina pectoris","20-79","20-79 years old","Questionnaire",NA
146
+ "145","MCD180D","Age when told you had angina pectoris","80","80 years or older","Questionnaire",NA
147
+ "146","MCD180E","Age when told you had heart attack","16","16 years or younger","Questionnaire",NA
148
+ "147","MCD180E","Age when told you had heart attack","17-79","17-79 years old","Questionnaire",NA
149
+ "148","MCD180E","Age when told you had heart attack","17-84","17-84 years old","Questionnaire",NA
150
+ "149","MCD180E","Age when told you had heart attack","17-88","17-88 years old","Questionnaire",NA
151
+ "150","MCD180E","Age when told you had heart attack","19-79","19-79 years old","Questionnaire",NA
152
+ "151","MCD180E","Age when told you had heart attack","80","80 years or older","Questionnaire",NA
153
+ "152","MCD180E","Age when told you had heart attack","85","85 years or older","Questionnaire",NA
154
+ "153","MCD180E","Age when told you had heart attack","90","90 + years","Questionnaire",NA
155
+ "154","MCD180E","Age when told you had heart attack","99999","Don't know","Questionnaire",NA
156
+ "155","MCD180E","Age when told you had heart attack","77777","Refused","Questionnaire",NA
157
+ "156","MCD180F","Age when told you had a stroke","16","16 years or younger","Questionnaire",NA
158
+ "157","MCD180F","Age when told you had a stroke","17-79","17-79 years old","Questionnaire",NA
159
+ "158","MCD180F","Age when told you had a stroke","17-84","17-84 years old","Questionnaire",NA
160
+ "159","MCD180F","Age when told you had a stroke","17-89","17-89 years old","Questionnaire",NA
161
+ "160","MCD180F","Age when told you had a stroke","80","80 years or older","Questionnaire",NA
162
+ "161","MCD180F","Age when told you had a stroke","85","85 years or older","Questionnaire",NA
163
+ "162","MCD180F","Age when told you had a stroke","90","90 + years","Questionnaire",NA
164
+ "163","MCD180F","Age when told you had a stroke","99999","Don't know","Questionnaire",NA
165
+ "164","MCD180F","Age when told you had a stroke","77777","Refused","Questionnaire",NA
166
+ "165","MCD180G","Age when told you had emphysema","16","16 years or younger","Questionnaire",NA
167
+ "166","MCD180G","Age when told you had emphysema","17-79","17-79 years old","Questionnaire",NA
168
+ "167","MCD180G","Age when told you had emphysema","17-84","17-84 years old","Questionnaire",NA
169
+ "168","MCD180G","Age when told you had emphysema","17-89","17-89 years old","Questionnaire",NA
170
+ "169","MCD180G","Age when told you had emphysema","80","80 years or older","Questionnaire",NA
171
+ "170","MCD180G","Age when told you had emphysema","85","85 years or older","Questionnaire",NA
172
+ "171","MCD180G","Age when told you had emphysema","90","90 + years","Questionnaire",NA
173
+ "172","MCD180G","Age when told you had emphysema","99999","Don't know","Questionnaire",NA
174
+ "173","MCD180G","Age when told you had emphysema","77777","Refused","Questionnaire",NA
175
+ "174","MCD180K","Age when told you had chronic bronchitis","16","16 years or younger","Questionnaire",NA
176
+ "175","MCD180K","Age when told you had chronic bronchitis","17-79","17-79 years old","Questionnaire",NA
177
+ "176","MCD180K","Age when told you had chronic bronchitis","17-83","17-83 years old","Questionnaire",NA
178
+ "177","MCD180K","Age when told you had chronic bronchitis","17-89","17-89 years old","Questionnaire",NA
179
+ "178","MCD180K","Age when told you had chronic bronchitis","80","80 years or older","Questionnaire",NA
180
+ "179","MCD180K","Age when told you had chronic bronchitis","85","85 years or older","Questionnaire",NA
181
+ "180","MCD180K","Age when told you had chronic bronchitis","90","90 + years","Questionnaire",NA
182
+ "181","MCD180K","Age when told you had chronic bronchitis","99999","Don't know","Questionnaire",NA
183
+ "182","MCD180K","Age when told you had chronic bronchitis","77777","Refused","Questionnaire",NA
184
+ "183","MCD180L","Age when told you had a liver condition","16","16 years or younger","Questionnaire",NA
185
+ "184","MCD180L","Age when told you had a liver condition","17-78","17-78 years old","Questionnaire",NA
186
+ "185","MCD180L","Age when told you had a liver condition","17-79","17-79 years old","Questionnaire",NA
187
+ "186","MCD180L","Age when told you had a liver condition","17-83","17-83 years old","Questionnaire",NA
188
+ "187","MCD180L","Age when told you had a liver condition","80","80 years or older","Questionnaire",NA
189
+ "188","MCD180L","Age when told you had a liver condition","85","85 years or older","Questionnaire",NA
190
+ "189","MCD180L","Age when told you had a liver condition","99999","Don't know","Questionnaire",NA
191
+ "190","MCD180L","Age when told you had a liver condition","77777","Refused","Questionnaire",NA
192
+ "191","MCQ180H","Age when told you had a goiter","16","16 years or younger","Questionnaire",NA
193
+ "192","MCQ180H","Age when told you had a goiter","17-84","17-84 years old","Questionnaire",NA
194
+ "193","MCQ180H","Age when told you had a goiter","90","90 + years","Questionnaire",NA
195
+ "194","MCQ180H","Age when told you had a goiter","99999","Don't know","Questionnaire",NA
196
+ "195","MCD180M","Age when told you had thyroid problem","17-89","17-89 years old","Questionnaire",NA
197
+ "196","MCD180M","Age when told you had thyroid problem","16","16 years or younger","Questionnaire",NA
198
+ "197","MCD180M","Age when told you had thyroid problem","99999","Don't know","Questionnaire",NA
199
+ "198","MCD180M","Age when told you had thyroid problem","17-84","17-84 years old","Questionnaire",NA
200
+ "199","MCD180M","Age when told you had thyroid problem","80","80 years or older","Questionnaire",NA
201
+ "200","MCD180M","Age when told you had thyroid problem","85","85 years or older","Questionnaire",NA
202
+ "201","MCD180M","Age when told you had thyroid problem","77777","Refused","Questionnaire",NA
203
+ "202","MCD180M","Age when told you had thyroid problem","17-79","17-79 years old","Questionnaire",NA
204
+ "203","MCD180N","Age when told you had gout","16","16 years or younger","Questionnaire",NA
205
+ "204","MCD180N","Age when told you had gout","17-79","17-79 years old","Questionnaire",NA
206
+ "205","MCD180N","Age when told you had gout","17-86","17-86 years old","Questionnaire",NA
207
+ "206","MCD180N","Age when told you had gout","80","80 years or older","Questionnaire",NA
208
+ "207","MCD180N","Age when told you had gout","99999","Don't know","Questionnaire",NA
209
+ "208","MCD180N","Age when told you had gout","77777","Refused","Questionnaire",NA
210
+ "209","MCQ025","Age when first had asthma","1-19","1-19 years old","Questionnaire",NA
211
+ "210","MCQ025","Age when first had asthma","1-79","1-79 years old","Questionnaire",NA
212
+ "211","MCQ025","Age when first had asthma","1-84","1-84 years old","Questionnaire",NA
213
+ "212","MCQ025","Age when first had asthma","1-88","1-88 years old","Questionnaire",NA
214
+ "213","MCQ025","Age when first had asthma","80","80 years or older","Questionnaire",NA
215
+ "214","MCQ025","Age when first had asthma","85","85 years or older","Questionnaire",NA
216
+ "215","MCQ025","Age when first had asthma","99999","Don't know","Questionnaire",NA
217
+ "216","MCQ025","Age when first had asthma","1","Less than 1 year","Questionnaire",NA
218
+ "217","MCQ025","Age when first had asthma","77777","Refused","Questionnaire",NA
219
+ "218","MCD180A","Age when told you had arthritis","16","16 years or younger","Questionnaire",NA
220
+ "219","MCD180A","Age when told you had arthritis","17-89","17-89 years old","Questionnaire",NA
221
+ "220","MCD180A","Age when told you had arthritis","90","90 + years","Questionnaire",NA
222
+ "221","MCD180A","Age when told you had arthritis","99999","Don't know","Questionnaire",NA
223
+ "222","MCD180A","Age when told you had arthritis","17-79","17-79 years old","Questionnaire",NA
224
+ "223","MCD180A","Age when told you had arthritis","80","80 years or older","Questionnaire",NA
225
+ "224","MCD180A","Age when told you had arthritis","77777","Refused","Questionnaire",NA
226
+ "225","MCD180A","Age when told you had arthritis","17-84","17-84 years old","Questionnaire",NA
227
+ "226","MCD180A","Age when told you had arthritis","85","85 years or older","Questionnaire",NA
228
+ "227","MCQ180H","Age when told you had a goiter","17-72","17-72 years old","Questionnaire",NA
229
+ "228","MCQ180H","Age when told you had a goiter","85","85 years or older","Questionnaire",NA
230
+ "229","MCQ180H","Age when told you had a goiter","77777","Refused","Questionnaire",NA
231
+ "230","MCQ195","Which type of arthritis was it","9","Don't know","Questionnaire",NA
232
+ "231","MCQ195","Which type of arthritis was it","2","Osteoarthritis or degenerative arthritis","Questionnaire",NA
233
+ "232","MCQ195","Which type of arthritis was it","4","Other","Questionnaire",NA
234
+ "233","MCQ195","Which type of arthritis was it","3","Psoriatic arthritis","Questionnaire",NA
235
+ "234","MCQ195","Which type of arthritis was it","7","Refused","Questionnaire",NA
236
+ "235","MCQ195","Which type of arthritis was it","1","Rheumatoid arthritis","Questionnaire",NA
237
+ "236","MCQ240A","Age when bladder cancer first diagnosed","17-78","17-78 years old","Questionnaire",NA
238
+ "237","MCQ240A","Age when bladder cancer first diagnosed","17-83","17-83 years old","Questionnaire",NA
239
+ "238","MCQ240A","Age when bladder cancer first diagnosed","16","16 years or younger","Questionnaire",NA
240
+ "239","MCQ240A","Age when bladder cancer first diagnosed","80","80 years or older","Questionnaire",NA
241
+ "240","MCQ240A","Age when bladder cancer first diagnosed","85","85 years or older","Questionnaire",NA
242
+ "241","MCQ240A","Age when bladder cancer first diagnosed","99999","Don't know","Questionnaire",NA
243
+ "242","MCQ240A","Age when bladder cancer first diagnosed","77777","Refused","Questionnaire",NA
244
+ "243","MCQ240B","Age when blood cancer was first diagnosed","16","16 years or younger","Questionnaire",NA
245
+ "244","MCQ240B","Age when blood cancer was first diagnosed","17-66","17-66 years old","Questionnaire",NA
246
+ "245","MCQ240B","Age when blood cancer was first diagnosed","17-70","17-70 years old","Questionnaire",NA
247
+ "246","MCQ240B","Age when blood cancer was first diagnosed","80","80 years or older","Questionnaire",NA
248
+ "247","MCQ240B","Age when blood cancer was first diagnosed","85","85 years or older","Questionnaire",NA
249
+ "248","MCQ240B","Age when blood cancer was first diagnosed","99999","Don't know","Questionnaire",NA
250
+ "249","MCQ240B","Age when blood cancer was first diagnosed","77777","Refused","Questionnaire",NA
251
+ "250","MCQ240C","Age when bone cancer was first diagnosed","17-77","17-77 years old","Questionnaire",NA
252
+ "251","MCQ240C","Age when bone cancer was first diagnosed","17-84","17-84 years old","Questionnaire",NA
253
+ "252","MCQ240C","Age when bone cancer was first diagnosed","16","16 years or younger","Questionnaire",NA
254
+ "253","MCQ240C","Age when bone cancer was first diagnosed","55-76","55-76 years old","Questionnaire",NA
255
+ "254","MCQ240C","Age when bone cancer was first diagnosed","80","80 years or older","Questionnaire",NA
256
+ "255","MCQ240C","Age when bone cancer was first diagnosed","85","85 years or older","Questionnaire",NA
257
+ "256","MCQ240C","Age when bone cancer was first diagnosed","99999","Don't know","Questionnaire",NA
258
+ "257","MCQ240C","Age when bone cancer was first diagnosed","77777","Refused","Questionnaire",NA
259
+ "258","MCQ240CC","Age when uterine cancer was first diagnosed","16","16 years or younger","Questionnaire",NA
260
+ "259","MCQ240CC","Age when uterine cancer was first diagnosed","17-77","17-77 years old","Questionnaire",NA
261
+ "260","MCQ240CC","Age when uterine cancer was first diagnosed","17-84","17-84 years old","Questionnaire",NA
262
+ "261","MCQ240CC","Age when uterine cancer was first diagnosed","20-72","20-72 years old","Questionnaire",NA
263
+ "262","MCQ240CC","Age when uterine cancer was first diagnosed","80","80 years or older","Questionnaire",NA
264
+ "263","MCQ240CC","Age when uterine cancer was first diagnosed","85","85 years or older","Questionnaire",NA
265
+ "264","MCQ240CC","Age when uterine cancer was first diagnosed","99999","Don't know","Questionnaire",NA
266
+ "265","MCQ240CC","Age when uterine cancer was first diagnosed","77777","Refused","Questionnaire",NA
267
+ "266","MCQ240D","Age when brain cancer was first diagnosed","16","16 years or younger","Questionnaire",NA
268
+ "267","MCQ240D","Age when brain cancer was first diagnosed","17-73","17-73 years old","Questionnaire",NA
269
+ "268","MCQ240D","Age when brain cancer was first diagnosed","17-75","17-75 years old","Questionnaire",NA
270
+ "269","MCQ240D","Age when brain cancer was first diagnosed","80","80 years or older","Questionnaire",NA
271
+ "270","MCQ240D","Age when brain cancer was first diagnosed","85","85 years or older","Questionnaire",NA
272
+ "271","MCQ240D","Age when brain cancer was first diagnosed","99999","Don't know","Questionnaire",NA
273
+ "272","MCQ240D","Age when brain cancer was first diagnosed","77777","Refused","Questionnaire",NA
274
+ "273","MCQ240DD","Age when some other type of cancer was first diagnosed","16","16 years or younger","Questionnaire",NA
275
+ "274","MCQ240DD","Age when some other type of cancer was first diagnosed","17-77","17-77 years old","Questionnaire",NA
276
+ "275","MCQ240DD","Age when some other type of cancer was first diagnosed","17-78","17-78 years old","Questionnaire",NA
277
+ "276","MCQ240DD","Age when some other type of cancer was first diagnosed","17-83","17-83 years old","Questionnaire",NA
278
+ "277","MCQ240DD","Age when some other type of cancer was first diagnosed","80","80 years or older","Questionnaire",NA
279
+ "278","MCQ240DD","Age when some other type of cancer was first diagnosed","85","85 years or older","Questionnaire",NA
280
+ "279","MCQ240DD","Age when some other type of cancer was first diagnosed","99999","Don't know","Questionnaire",NA
281
+ "280","MCQ240DD","Age when some other type of cancer was first diagnosed","77777","Refused","Questionnaire",NA
282
+ "281","MCQ240DK","Age when cancer was first diagnosed","20-80","20-80 years old","Questionnaire",NA
283
+ "282","MCQ240DK","Age when cancer was first diagnosed","23-47","23-47 years old","Questionnaire",NA
284
+ "283","MCQ240DK","Age when cancer was first diagnosed","80","80 years or older","Questionnaire",NA
285
+ "284","MCQ240DK","Age when cancer was first diagnosed","85","85 years or older","Questionnaire",NA
286
+ "285","MCQ240DK","Age when cancer was first diagnosed","99999","Don't know","Questionnaire",NA
287
+ "286","MCQ240DK","Age when cancer was first diagnosed","77777","Refused","Questionnaire",NA
288
+ "287","MCQ240E","Age when breast cancer was first diagnosed","16","16 years or younger","Questionnaire",NA
289
+ "288","MCQ240E","Age when breast cancer was first diagnosed","17-78","17-78 years old","Questionnaire",NA
290
+ "289","MCQ240E","Age when breast cancer was first diagnosed","17-79","17-79 years old","Questionnaire",NA
291
+ "290","MCQ240E","Age when breast cancer was first diagnosed","17-84","17-84 years old","Questionnaire",NA
292
+ "291","MCQ240E","Age when breast cancer was first diagnosed","80","80 years or older","Questionnaire",NA
293
+ "292","MCQ240E","Age when breast cancer was first diagnosed","85","85 years or older","Questionnaire",NA
294
+ "293","MCQ240E","Age when breast cancer was first diagnosed","99999","Don't know","Questionnaire",NA
295
+ "294","MCQ240E","Age when breast cancer was first diagnosed","77777","Refused","Questionnaire",NA
296
+ "295","MCQ240F","Age when cervical cancer was first diagnosed","16","16 years or younger","Questionnaire",NA
297
+ "296","MCQ240F","Age when cervical cancer was first diagnosed","17-65","17-65 years old","Questionnaire",NA
298
+ "297","MCQ240F","Age when cervical cancer was first diagnosed","17-73","17-73 years old","Questionnaire",NA
299
+ "298","MCQ240F","Age when cervical cancer was first diagnosed","80","80 years or older","Questionnaire",NA
300
+ "299","MCQ240F","Age when cervical cancer was first diagnosed","85","85 years or older","Questionnaire",NA
301
+ "300","MCQ240F","Age when cervical cancer was first diagnosed","99999","Don't know","Questionnaire",NA
302
+ "301","MCQ240F","Age when cervical cancer was first diagnosed","77777","Refused","Questionnaire",NA
303
+ "302","MCQ240G","Age when colon cancer was first diagnosed","16","16 years or younger","Questionnaire",NA
304
+ "303","MCQ240G","Age when colon cancer was first diagnosed","17-79","17-79 years old","Questionnaire",NA
305
+ "304","MCQ240G","Age when colon cancer was first diagnosed","17-84","17-84 years old","Questionnaire",NA
306
+ "305","MCQ240G","Age when colon cancer was first diagnosed","21-79","21-79 years old","Questionnaire",NA
307
+ "306","MCQ240G","Age when colon cancer was first diagnosed","80","80 years or older","Questionnaire",NA
308
+ "307","MCQ240G","Age when colon cancer was first diagnosed","85","85 years or older","Questionnaire",NA
309
+ "308","MCQ240G","Age when colon cancer was first diagnosed","99999","Don't know","Questionnaire",NA
310
+ "309","MCQ240G","Age when colon cancer was first diagnosed","77777","Refused","Questionnaire",NA
311
+ "310","MCQ240L","Age when leukemia was first diagnosed","17-70","17-70 years old","Questionnaire",NA
312
+ "311","MCQ240L","Age when leukemia was first diagnosed","17-75","17-75 years old","Questionnaire",NA
313
+ "312","MCQ240L","Age when leukemia was first diagnosed","28-84","28-84 years old","Questionnaire",NA
314
+ "313","MCQ240L","Age when leukemia was first diagnosed","16","16 years or younger","Questionnaire",NA
315
+ "314","MCQ240L","Age when leukemia was first diagnosed","80","80 years or older","Questionnaire",NA
316
+ "315","MCQ240L","Age when leukemia was first diagnosed","85","85 years or older","Questionnaire",NA
317
+ "316","MCQ240L","Age when leukemia was first diagnosed","99999","Don't know","Questionnaire",NA
318
+ "317","MCQ240L","Age when leukemia was first diagnosed","77777","Refused","Questionnaire",NA
319
+ "318","MCQ240N","Age when lung cancer was first diagnosed","16","16 years or younger","Questionnaire",NA
320
+ "319","MCQ240N","Age when lung cancer was first diagnosed","17-76","17-76 years old","Questionnaire",NA
321
+ "320","MCQ240N","Age when lung cancer was first diagnosed","17-84","17-84 years old","Questionnaire",NA
322
+ "321","MCQ240N","Age when lung cancer was first diagnosed","29-79","29-79 years old","Questionnaire",NA
323
+ "322","MCQ240N","Age when lung cancer was first diagnosed","80","80 years or older","Questionnaire",NA
324
+ "323","MCQ240N","Age when lung cancer was first diagnosed","85","85 years or older","Questionnaire",NA
325
+ "324","MCQ240N","Age when lung cancer was first diagnosed","99999","Don't know","Questionnaire",NA
326
+ "325","MCQ240N","Age when lung cancer was first diagnosed","77777","Refused","Questionnaire",NA
327
+ "326","MCQ240O","Age when lymphoma or Hodgkin's Disease was first diagnosed","16","16 years or younger","Questionnaire",NA
328
+ "327","MCQ240O","Age when lymphoma or Hodgkin's Disease was first diagnosed","17-76","17-76 years old","Questionnaire",NA
329
+ "328","MCQ240O","Age when lymphoma or Hodgkin's Disease was first diagnosed","17-80","17-80 years old","Questionnaire",NA
330
+ "329","MCQ240O","Age when lymphoma or Hodgkin's Disease was first diagnosed","19-79","19-79 years old","Questionnaire",NA
331
+ "330","MCQ240O","Age when lymphoma or Hodgkin's Disease was first diagnosed","80","80 years or older","Questionnaire",NA
332
+ "331","MCQ240O","Age when lymphoma or Hodgkin's Disease was first diagnosed","85","85 years or older","Questionnaire",NA
333
+ "332","MCQ240O","Age when lymphoma or Hodgkin's Disease was first diagnosed","99999","Don't know","Questionnaire",NA
334
+ "333","MCQ240O","Age when lymphoma or Hodgkin's Disease was first diagnosed","77777","Refused","Questionnaire",NA
335
+ "334","MCQ240P","Age when melanoma was first diagnosed","16","16 years or younger","Questionnaire",NA
336
+ "335","MCQ240P","Age when melanoma was first diagnosed","17-78","17-78 years old","Questionnaire",NA
337
+ "336","MCQ240P","Age when melanoma was first diagnosed","17-79","17-79 years old","Questionnaire",NA
338
+ "337","MCQ240P","Age when melanoma was first diagnosed","17-83","17-83 years old","Questionnaire",NA
339
+ "338","MCQ240P","Age when melanoma was first diagnosed","80","80 years or older","Questionnaire",NA
340
+ "339","MCQ240P","Age when melanoma was first diagnosed","85","85 years or older","Questionnaire",NA
341
+ "340","MCQ240P","Age when melanoma was first diagnosed","99999","Don't know","Questionnaire",NA
342
+ "341","MCQ240P","Age when melanoma was first diagnosed","77777","Refused","Questionnaire",NA
343
+ "342","MCQ240Q","Age when mouth, tongue, or lip cancer was first diagnosed","16","16 years or younger","Questionnaire",NA
344
+ "343","MCQ240Q","Age when mouth, tongue, or lip cancer was first diagnosed","17-79","17-79 years old","Questionnaire",NA
345
+ "344","MCQ240Q","Age when mouth, tongue, or lip cancer was first diagnosed","27-70","27-70 years old","Questionnaire",NA
346
+ "345","MCQ240Q","Age when mouth, tongue, or lip cancer was first diagnosed","30-70","30-70 years old","Questionnaire",NA
347
+ "346","MCQ240Q","Age when mouth, tongue, or lip cancer was first diagnosed","80","80 years or older","Questionnaire",NA
348
+ "347","MCQ240Q","Age when mouth, tongue, or lip cancer was first diagnosed","85","85 years or older","Questionnaire",NA
349
+ "348","MCQ240Q","Age when mouth, tongue, or lip cancer was first diagnosed","99999","Don't know","Questionnaire",NA
350
+ "349","MCQ240Q","Age when mouth, tongue, or lip cancer was first diagnosed","77777","Refused","Questionnaire",NA
351
+ "350","MCQ240U","Age when prostate cancer was first diagnosed","17-79","17-79 years old","Questionnaire",NA
352
+ "351","MCQ240U","Age when prostate cancer was first diagnosed","17-84","17-84 years old","Questionnaire",NA
353
+ "352","MCQ240U","Age when prostate cancer was first diagnosed","16","16 years or younger","Questionnaire",NA
354
+ "353","MCQ240U","Age when prostate cancer was first diagnosed","32-79","32-79 years old","Questionnaire",NA
355
+ "354","MCQ240U","Age when prostate cancer was first diagnosed","80","80 years or older","Questionnaire",NA
356
+ "355","MCQ240U","Age when prostate cancer was first diagnosed","85","85 years or older","Questionnaire",NA
357
+ "356","MCQ240U","Age when prostate cancer was first diagnosed","99999","Don't know","Questionnaire",NA
358
+ "357","MCQ240U","Age when prostate cancer was first diagnosed","77777","Refused","Questionnaire",NA
359
+ "358","MCQ240W","Age when non-melanoma skin cancer was first diagnosed","16","16 years or younger","Questionnaire",NA
360
+ "359","MCQ240W","Age when non-melanoma skin cancer was first diagnosed","17-78","17-78 years old","Questionnaire",NA
361
+ "360","MCQ240W","Age when non-melanoma skin cancer was first diagnosed","17-79","17-79 years old","Questionnaire",NA
362
+ "361","MCQ240W","Age when non-melanoma skin cancer was first diagnosed","17-84","17-84 years old","Questionnaire",NA
363
+ "362","MCQ240W","Age when non-melanoma skin cancer was first diagnosed","80","80 years or older","Questionnaire",NA
364
+ "363","MCQ240W","Age when non-melanoma skin cancer was first diagnosed","85","85 years or older","Questionnaire",NA
365
+ "364","MCQ240W","Age when non-melanoma skin cancer was first diagnosed","99999","Don't know","Questionnaire",NA
366
+ "365","MCQ240W","Age when non-melanoma skin cancer was first diagnosed","77777","Refused","Questionnaire",NA
367
+ "366","MCQ240X","Age when the unknown kind of skin cancer was first diagnosed","16","16 years or younger","Questionnaire",NA
368
+ "367","MCQ240X","Age when the unknown kind of skin cancer was first diagnosed","17-79","17-79 years old","Questionnaire",NA
369
+ "368","MCQ240X","Age when the unknown kind of skin cancer was first diagnosed","17-84","17-84 years old","Questionnaire",NA
370
+ "369","MCQ240X","Age when the unknown kind of skin cancer was first diagnosed","18-79","18-79 years old","Questionnaire",NA
371
+ "370","MCQ240X","Age when the unknown kind of skin cancer was first diagnosed","80","80 years or older","Questionnaire",NA
372
+ "371","MCQ240X","Age when the unknown kind of skin cancer was first diagnosed","85","85 years or older","Questionnaire",NA
373
+ "372","MCQ240X","Age when the unknown kind of skin cancer was first diagnosed","99999","Don't know","Questionnaire",NA
374
+ "373","MCQ240X","Age when the unknown kind of skin cancer was first diagnosed","77777","Refused","Questionnaire",NA
375
+ "374","MCQ240Z","Age when stomach cancer was first diagnosed","16","16 years or younger","Questionnaire",NA
376
+ "375","MCQ240Z","Age when stomach cancer was first diagnosed","17-79","17-79 years old","Questionnaire",NA
377
+ "376","MCQ240Z","Age when stomach cancer was first diagnosed","22-82","22-82 years old","Questionnaire",NA
378
+ "377","MCQ240Z","Age when stomach cancer was first diagnosed","32-76","32-76 years old","Questionnaire",NA
379
+ "378","MCQ240Z","Age when stomach cancer was first diagnosed","80","80 years or older","Questionnaire",NA
380
+ "379","MCQ240Z","Age when stomach cancer was first diagnosed","85","85 years or older","Questionnaire",NA
381
+ "380","MCQ240Z","Age when stomach cancer was first diagnosed","99999","Don't know","Questionnaire",NA
382
+ "381","MCQ240Z","Age when stomach cancer was first diagnosed","77777","Refused","Questionnaire",NA
383
+ "382","MCQ280","About how old was she when she fractured her hip (the first time)?","1-101","1-101 years old","Questionnaire",NA
384
+ "383","MCQ280","About how old was she when she fractured her hip (the first time)?","555","50 +","Questionnaire",NA
385
+ "384","MCQ280","About how old was she when she fractured her hip (the first time)?","9-107","9-107 years old","Questionnaire",NA
386
+ "385","MCQ280","About how old was she when she fractured her hip (the first time)?","99999","Don't know","Questionnaire",NA
387
+ "386","MCQ280","About how old was she when she fractured her hip (the first time)?","77777","Refused","Questionnaire",NA
388
+ "387","MCQ280","About how old was she when she fractured her hip (the first time)?","444","Under 50","Questionnaire",NA
389
+ "388","MCQ320","How old {were you/was SP} when {you/he} first had {your/his} PSA test?","16","16 years or younger","Questionnaire",NA
390
+ "389","MCQ320","How old {were you/was SP} when {you/he} first had {your/his} PSA test?","17-79","17-79 years old","Questionnaire",NA
391
+ "390","MCQ320","How old {were you/was SP} when {you/he} first had {your/his} PSA test?","17-85","17-85 years old","Questionnaire",NA
392
+ "391","MCQ320","How old {were you/was SP} when {you/he} first had {your/his} PSA test?","80","80 years or older","Questionnaire",NA
393
+ "392","MCQ320","How old {were you/was SP} when {you/he} first had {your/his} PSA test?","999","Don't know","Questionnaire",NA
394
+ "393","MCQ320","How old {were you/was SP} when {you/he} first had {your/his} PSA test?","777","Refused","Questionnaire",NA
395
+ "394","MCQ570","How old {were you/was SP} when {you /s/he} first had gallbladder surgery?","11-99","11-99 years old","Questionnaire",NA
396
+ "395","MCQ570","How old {were you/was SP} when {you /s/he} first had gallbladder surgery?","15-79","15-79 years old","Questionnaire",NA
397
+ "396","MCQ570","How old {were you/was SP} when {you /s/he} first had gallbladder surgery?","20-87","20-87 years old","Questionnaire",NA
398
+ "397","MCQ570","How old {were you/was SP} when {you /s/he} first had gallbladder surgery?","80","80 years or older","Questionnaire",NA
399
+ "398","MCQ570","How old {were you/was SP} when {you /s/he} first had gallbladder surgery?","90","90 + years","Questionnaire",NA
400
+ "399","MCQ570","How old {were you/was SP} when {you /s/he} first had gallbladder surgery?","99999","Don't know","Questionnaire",NA
401
+ "400","MCQ570","How old {were you/was SP} when {you /s/he} first had gallbladder surgery?","77777","Refused","Questionnaire",NA
402
+ "401","PAD120","[Over the past 30 days], how often did {you/SP} do these tasks in or around {your/his/her} home or yard, that is tasks requiring at least moderate effort? [Such as raking leaves, mowing the lawn or heavy cleaning.]","1-95","1-95 times","Questionnaire",NA
403
+ "402","PAD120","[Over the past 30 days], how often did {you/SP} do these tasks in or around {your/his/her} home or yard, that is tasks requiring at least moderate effort? [Such as raking leaves, mowing the lawn or heavy cleaning.]","1-99","1-99 times","Questionnaire",NA
404
+ "403","PAD120","[Over the past 30 days], how often did {you/SP} do these tasks in or around {your/his/her} home or yard, that is tasks requiring at least moderate effort? [Such as raking leaves, mowing the lawn or heavy cleaning.]","100","100 +","Questionnaire",NA
405
+ "404","PAD120","[Over the past 30 days], how often did {you/SP} do these tasks in or around {your/his/her} home or yard, that is tasks requiring at least moderate effort? [Such as raking leaves, mowing the lawn or heavy cleaning.]","99999","Don't know","Questionnaire",NA
406
+ "405","PAD120","[Over the past 30 days], how often did {you/SP} do these tasks in or around {your/his/her} home or yard, that is tasks requiring at least moderate effort? [Such as raking leaves, mowing the lawn or heavy cleaning.]","77777","Refused","Questionnaire",NA
407
+ "406","PAD460","[Over the past 30 days], how often did {you/SP} do these physical activities? [Activities designed to strengthen {your/his/her} muscles such as lifting weights, push-ups or sit-ups.]","1-91","1-91 times","Questionnaire",NA
408
+ "407","PAD460","[Over the past 30 days], how often did {you/SP} do these physical activities? [Activities designed to strengthen {your/his/her} muscles such as lifting weights, push-ups or sit-ups.]","1-99","1-99 times","Questionnaire",NA
409
+ "408","PAD460","[Over the past 30 days], how often did {you/SP} do these physical activities? [Activities designed to strengthen {your/his/her} muscles such as lifting weights, push-ups or sit-ups.]","100","100 +","Questionnaire",NA
410
+ "409","PAD460","[Over the past 30 days], how often did {you/SP} do these physical activities? [Activities designed to strengthen {your/his/her} muscles such as lifting weights, push-ups or sit-ups.]","999","Don't know","Questionnaire",NA
411
+ "410","PAD460","[Over the past 30 days], how often did {you/SP} do these physical activities? [Activities designed to strengthen {your/his/her} muscles such as lifting weights, push-ups or sit-ups.]","777","Refused","Questionnaire",NA
412
+ "411","PAQ050Q","[Over the past 30 days], how often did {you/SP} do this? [Walk or bicycle as part of getting to and from work, or school, or to do errands.] PROBE: How many times per day, per week, or per month did {you/s/he} do these activities?","1-91","1-91 times","Questionnaire",NA
413
+ "412","PAQ050Q","[Over the past 30 days], how often did {you/SP} do this? [Walk or bicycle as part of getting to and from work, or school, or to do errands.] PROBE: How many times per day, per week, or per month did {you/s/he} do these activities?","1-99","1-99 times","Questionnaire",NA
414
+ "413","PAQ050Q","[Over the past 30 days], how often did {you/SP} do this? [Walk or bicycle as part of getting to and from work, or school, or to do errands.] PROBE: How many times per day, per week, or per month did {you/s/he} do these activities?","100","100 +","Questionnaire",NA
415
+ "414","PAQ050Q","[Over the past 30 days], how often did {you/SP} do this? [Walk or bicycle as part of getting to and from work, or school, or to do errands.] PROBE: How many times per day, per week, or per month did {you/s/he} do these activities?","99999","Don't know","Questionnaire",NA
416
+ "415","PAQ050Q","[Over the past 30 days], how often did {you/SP} do this? [Walk or bicycle as part of getting to and from work, or school, or to do errands.] PROBE: How many times per day, per week, or per month did {you/s/he} do these activities?","77777","Refused","Questionnaire",NA
417
+ "416","BMIWAIST","Waist Circumference Comment","1","Could not obtain","Response",NA
418
+ "417",NA,NA,"1","Breakfast","Dietary","DR1.030Z"
419
+ "418",NA,NA,"2","Brunch","Dietary","DR1.030Z"
420
+ "419",NA,NA,"3","Lunch","Dietary","DR1.030Z"
421
+ "420",NA,NA,"4","Snack/beverage","Dietary","DR1.030Z"
422
+ "421",NA,NA,"5","Dinner/supper","Dietary","DR1.030Z"
423
+ "422",NA,NA,"6","Infant feeding","Dietary","DR1.030Z"
424
+ "423",NA,NA,"7","Extended consumption","Dietary","DR1.030Z"
425
+ "424",NA,NA,"8","Other","Dietary","DR1.030Z"
426
+ "425",NA,NA,"9","Desayuno (Spanish)","Dietary","DR1.030Z"
427
+ "426",NA,NA,"10","Almuerzo (Spanish)","Dietary","DR1.030Z"
428
+ "427",NA,NA,"11","Comida (Spanish)","Dietary","DR1.030Z"
429
+ "428",NA,NA,"12","Merienda (Spanish)","Dietary","DR1.030Z"
430
+ "429",NA,NA,"13","Cena (Spanish)","Dietary","DR1.030Z"
431
+ "430",NA,NA,"14","Entre comida/bebida (Spanish)","Dietary","DR1.030Z"
432
+ "431",NA,NA,"15","Bocadillo (Spanish)","Dietary","DR1.030Z"
433
+ "432",NA,NA,"16","Botana (Spanish)","Dietary","DR1.030Z"
434
+ "433",NA,NA,"99","Don't know","Dietary","DR1.030Z"
435
+ "434",NA,NA,"2","Lunch","Dietary","DR1.030Z"
436
+ "435",NA,NA,"3","Dinner/supper","Dietary","DR1.030Z"
437
+ "436",NA,NA,"5","Brunch","Dietary","DR1.030Z"
438
+ "437",NA,NA,"6","Snack/beverage","Dietary","DR1.030Z"
439
+ "438",NA,NA,"8","Infant feeding","Dietary","DR1.030Z"
440
+ "439",NA,NA,"9","Extended consumption","Dietary","DR1.030Z"
441
+ "440",NA,NA,"10","Desayano (Spanish)","Dietary","DR1.030Z"
442
+ "441",NA,NA,"11","Almuerzo (Spanish)","Dietary","DR1.030Z"
443
+ "442",NA,NA,"12","Comida (Spanish)","Dietary","DR1.030Z"
444
+ "443",NA,NA,"13","Merienda (Spanish)","Dietary","DR1.030Z"
445
+ "444",NA,NA,"14","Cena (Spanish)","Dietary","DR1.030Z"
446
+ "445",NA,NA,"15","Entre comida/bebida/tentempie (Spanish)","Dietary","DR1.030Z"
447
+ "446",NA,NA,"17","Bocadillo (Spanish)","Dietary","DR1.030Z"
448
+ "447",NA,NA,"91","Other","Dietary","DR1.030Z"
449
+ "448",NA,NA,"3","Dinner","Dietary","DR1.030Z"
450
+ "449",NA,NA,"4","Supper","Dietary","DR1.030Z"
451
+ "450",NA,NA,"6","Snack","Dietary","DR1.030Z"
452
+ "451",NA,NA,"7","Drink","Dietary","DR1.030Z"
453
+ "452",NA,NA,"10","Desayano (breakfast)","Dietary","DR1.030Z"
454
+ "453",NA,NA,"11","Almuerzo (breakfast)","Dietary","DR1.030Z"
455
+ "454",NA,NA,"12","Comida (lunch)","Dietary","DR1.030Z"
456
+ "455",NA,NA,"13","Merienda (snack)","Dietary","DR1.030Z"
457
+ "456",NA,NA,"14","Cena (dinner)","Dietary","DR1.030Z"
458
+ "457",NA,NA,"15","Entre comida (snack)","Dietary","DR1.030Z"
459
+ "458",NA,NA,"16","Botana (snack)","Dietary","DR1.030Z"
460
+ "459",NA,NA,"17","Bocadillo (snack)","Dietary","DR1.030Z"
461
+ "460",NA,NA,"18","Tentempie (snack)","Dietary","DR1.030Z"
462
+ "461",NA,NA,"19","Bebida (drink)","Dietary","DR1.030Z"
463
+ "462",NA,NA,"0","Non-combination food","Dietary","DR1CCMTX"
464
+ "463",NA,NA,"90","Other mixtures","Dietary","DR1CCMTX"
465
+ "464",NA,NA,"9","Dried beans and vegetable w/ additions","Dietary","DR1CCMTX"
466
+ "465",NA,NA,"1","Beverage w/ additions","Dietary","DR1CCMTX"
467
+ "466",NA,NA,"3","Bread/baked products w/ additions","Dietary","DR1CCMTX"
468
+ "467",NA,NA,"2","Cereal w/ additions","Dietary","DR1CCMTX"
469
+ "468",NA,NA,"14","Chips w/ additions","Dietary","DR1CCMTX"
470
+ "469",NA,NA,"12","Meat, poultry, fish","Dietary","DR1CCMTX"
471
+ "470",NA,NA,"7","Frozen meals","Dietary","DR1CCMTX"
472
+ "471",NA,NA,"10","Fruit w/ additions","Dietary","DR1CCMTX"
473
+ "472",NA,NA,"4","Salad","Dietary","DR1CCMTX"
474
+ "473",NA,NA,"5","Sandwiches","Dietary","DR1CCMTX"
475
+ "474",NA,NA,"6","Soup","Dietary","DR1CCMTX"
476
+ "475",NA,NA,"11","Tortilla products","Dietary","DR1CCMTX"
477
+ "476",NA,NA,"1","Beverage w/ adds","Dietary","DR1CCMTX"
478
+ "477",NA,NA,"2","Cereal w/ adds","Dietary","DR1CCMTX"
479
+ "478",NA,NA,"3","Bread/baked products w/ adds","Dietary","DR1CCMTX"
480
+ "479",NA,NA,"8","Ice cream/frozen yogurt w/ additions","Dietary","DR1CCMTX"
481
+ "480",NA,NA,"9","Dried beans and vegetable w/ adds","Dietary","DR1CCMTX"
482
+ "481",NA,NA,"10","Fruit w/ adds","Dietary","DR1CCMTX"
483
+ "482",NA,NA,"11","Tortilla Products","Dietary","DR1CCMTX"
484
+ "483",NA,NA,"13","Lunchables","Dietary","DR1CCMTX"
485
+ "484","DRXDRSTZ","Dietary Recall Status","1","Reliable and met the minimum criteria","Dietary","DR1DRSTZ"
486
+ "485","DRXDRSTZ","Dietary Recall Status","2","Not reliable or not met the minimum criteria","Dietary","DR1DRSTZ"
487
+ "486","DRXDRSTZ","Dietary Recall Status","9","Interview lost due to computer malfunction or file transfer problem","Dietary","DR1DRSTZ"
488
+ "487","DRXDRSTZ","Dietary Recall Status","4","Reported consuming breast-milk","Dietary","DR1DRSTZ"
489
+ "488","DRXDRSTZ","Dietary Recall Status","88","Blank but applicable","Dietary","DR1DRSTZ"
490
+ "489","DRXDRSTZ","Dietary Recall Status","5","Not done","Dietary","DR1DRSTZ"
491
+ "490",NA,NA,"2","No","Dietary","DR1.040Z"
492
+ "491",NA,NA,"1","Yes (home)","Dietary","DR1.040Z"
493
+ "492",NA,NA,"7","Refused","Dietary","DR1.040Z"
494
+ "493",NA,NA,"9","Don't know","Dietary","DR1.040Z"
495
+ "494","DRXTWSZ","Tap Water Source","1","Community supply","Dietary","DR1TWSZ"
496
+ "495","DRXTWSZ","Tap Water Source","91","Other","Dietary","DR1TWSZ"
497
+ "496","DRXTWSZ","Tap Water Source","4","Don't drink tap water","Dietary","DR1TWSZ"
498
+ "497","DRXTWSZ","Tap Water Source","99","Don't know","Dietary","DR1TWSZ"
499
+ "498","DBQ095Z","Type of salt you usually add at table","4","Doesn't use or add salt products at the table","Dietary","DBQ095Z"
500
+ "499","DBQ095Z","Type of salt you usually add at table","1","Ordinary salt [includes regular iodized salt, sea salt and seasoning salts made with regular salt]","Dietary","DBQ095Z"
501
+ "500","DBQ095Z","Type of salt you usually add at table","2","Lite salt","Dietary","DBQ095Z"
502
+ "501","DBQ095Z","Type of salt you usually add at table","3","Salt substitute","Dietary","DBQ095Z"
503
+ "502","DBQ095Z","Type of salt you usually add at table","88","Blank but applicable","Dietary","DBQ095Z"
504
+ "503","DBQ095Z","Type of salt you usually add at table","99","Don't know","Dietary","DBQ095Z"
505
+ "504","DBQ095Z","Type of salt you usually add at table","7","Refused","Dietary","DBQ095Z"
506
+ "505","DBQ095Z","Type of salt you usually add at table","91","Other","Dietary","DBQ095Z"
507
+ "506","DRXHELP","Who helped in responding for this interview","1","SP","Dietary","DR1HELP"
508
+ "507","DRXHELP","Who helped in responding for this interview","4","Parent of SP","Dietary","DR1HELP"
509
+ "508","DRXHELP","Who helped in responding for this interview","5","Spouse of SP","Dietary","DR1HELP"
510
+ "509","DRXHELP","Who helped in responding for this interview","6","Child of SP","Dietary","DR1HELP"
511
+ "510","DRXHELP","Who helped in responding for this interview","7","Grandparent of SP","Dietary","DR1HELP"
512
+ "511","DRXHELP","Who helped in responding for this interview","8","Friend, Partner, Non Relative","Dietary","DR1HELP"
513
+ "512","DRXHELP","Who helped in responding for this interview","9","Translator, not a HH member","Dietary","DR1HELP"
514
+ "513","DRXHELP","Who helped in responding for this interview","10","Child care provider, Caretaker","Dietary","DR1HELP"
515
+ "514","DRXHELP","Who helped in responding for this interview","11","Other Relative","Dietary","DR1HELP"
516
+ "515","DRXHELP","Who helped in responding for this interview","12","No One","Dietary","DR1HELP"
517
+ "516","DRXHELP","Who helped in responding for this interview","14","Other specify","Dietary","DR1HELP"
518
+ "517","DRXHELP","Who helped in responding for this interview","77","Refused","Dietary","DR1HELP"
519
+ "518","DRXHELP","Who helped in responding for this interview","99","Don't know","Dietary","DR1HELP"
520
+ "519","DRXMRESP","Who was the main respondent for this interview?","1","SP","Dietary","DR1MRESP"
521
+ "520","DRXMRESP","Who was the main respondent for this interview?","97","Proxy","Dietary","DR1MRESP"
522
+ "521","DRXMRESP","Who was the main respondent for this interview?","98","SP and proxy","Dietary","DR1MRESP"
523
+ "522","DRXMRESP","Who was the main respondent for this interview?","88","Blank but applicable","Dietary","DR1MRESP"
524
+ "523","DRXMRESP","Who was the main respondent for this interview?","2","Mother of SP","Dietary","DR1MRESP"
525
+ "524","DRXMRESP","Who was the main respondent for this interview?","3","Father of SP","Dietary","DR1MRESP"
526
+ "525","DRXMRESP","Who was the main respondent for this interview?","5","Spouse of SP","Dietary","DR1MRESP"
527
+ "526","DRXMRESP","Who was the main respondent for this interview?","6","Child of SP","Dietary","DR1MRESP"
528
+ "527","DRXMRESP","Who was the main respondent for this interview?","7","Grandparent of SP","Dietary","DR1MRESP"
529
+ "528","DRXMRESP","Who was the main respondent for this interview?","8","Friend, Partner, Non Relative","Dietary","DR1MRESP"
530
+ "529","DRXMRESP","Who was the main respondent for this interview?","9","Translator, not a HH member","Dietary","DR1MRESP"
531
+ "530","DRXMRESP","Who was the main respondent for this interview?","10","Child care provider, Caretaker","Dietary","DR1MRESP"
532
+ "531","DRXMRESP","Who was the main respondent for this interview?","11","Other Relative","Dietary","DR1MRESP"
533
+ "532","DRXMRESP","Who was the main respondent for this interview?","14","Other specify","Dietary","DR1MRESP"
534
+ "533","DRXMRESP","Who was the main respondent for this interview?","77","Refused","Dietary","DR1MRESP"
535
+ "534","DRXMRESP","Who was the main respondent for this interview?","99","Don't know","Dietary","DR1MRESP"
536
+ "535","DRXTWSZ","Tap Water Source","1","Community supply","Dietary","DR2TWSZ"
537
+ "536","DRXTWSZ","Tap Water Source","91","Other","Dietary","DR2TWSZ"
538
+ "537","DRXTWSZ","Tap Water Source","4","Don't drink tap water","Dietary","DR2TWSZ"
539
+ "538","DRXTWSZ","Tap Water Source","99","Don't know","Dietary","DR2TWSZ"
540
+ "539","DRXHELP","Who helped in responding for this interview","1","SP","Dietary","DR2HELP"
541
+ "540","DRXHELP","Who helped in responding for this interview","4","Parent of SP","Dietary","DR2HELP"
542
+ "541","DRXHELP","Who helped in responding for this interview","5","Spouse of SP","Dietary","DR2HELP"
543
+ "542","DRXHELP","Who helped in responding for this interview","6","Child of SP","Dietary","DR2HELP"
544
+ "543","DRXHELP","Who helped in responding for this interview","7","Grandparent of SP","Dietary","DR2HELP"
545
+ "544","DRXHELP","Who helped in responding for this interview","8","Friend, Partner, Non Relative","Dietary","DR2HELP"
546
+ "545","DRXHELP","Who helped in responding for this interview","9","Translator, not a HH member","Dietary","DR2HELP"
547
+ "546","DRXHELP","Who helped in responding for this interview","10","Child care provider, Caretaker","Dietary","DR2HELP"
548
+ "547","DRXHELP","Who helped in responding for this interview","11","Other Relative","Dietary","DR2HELP"
549
+ "548","DRXHELP","Who helped in responding for this interview","12","No One","Dietary","DR2HELP"
550
+ "549","DRXHELP","Who helped in responding for this interview","14","Other specify","Dietary","DR2HELP"
551
+ "550","DRXHELP","Who helped in responding for this interview","77","Refused","Dietary","DR2HELP"
552
+ "551","DRXHELP","Who helped in responding for this interview","99","Don't know","Dietary","DR2HELP"
553
+ "552","DRXMRESP","Who was the main respondent for this interview?","1","SP","Dietary","DR2MRESP"
554
+ "553","DRXMRESP","Who was the main respondent for this interview?","2","Mother of SP","Dietary","DR2MRESP"
555
+ "554","DRXMRESP","Who was the main respondent for this interview?","3","Father of SP","Dietary","DR2MRESP"
556
+ "555","DRXMRESP","Who was the main respondent for this interview?","5","Spouse of SP","Dietary","DR2MRESP"
557
+ "556","DRXMRESP","Who was the main respondent for this interview?","6","Child of SP","Dietary","DR2MRESP"
558
+ "557","DRXMRESP","Who was the main respondent for this interview?","7","Grandparent of SP","Dietary","DR2MRESP"
559
+ "558","DRXMRESP","Who was the main respondent for this interview?","8","Friend, Partner, Non Relative","Dietary","DR2MRESP"
560
+ "559","DRXMRESP","Who was the main respondent for this interview?","9","Translator, not a HH member","Dietary","DR2MRESP"
561
+ "560","DRXMRESP","Who was the main respondent for this interview?","10","Child care provider, Caretaker","Dietary","DR2MRESP"
562
+ "561","DRXMRESP","Who was the main respondent for this interview?","11","Other Relative","Dietary","DR2MRESP"
563
+ "562","DRXMRESP","Who was the main respondent for this interview?","14","Other specify","Dietary","DR2MRESP"
564
+ "563","DRXMRESP","Who was the main respondent for this interview?","77","Refused","Dietary","DR2MRESP"
565
+ "564","DRXMRESP","Who was the main respondent for this interview?","99","Don't know","Dietary","DR2MRESP"
566
+ "565","DBD100","How often {do you/does SP} add ordinary salt to {your/his/her/SP's} food at the table? Would you say . . .","1","Rarely","Dietary","DBD100"
567
+ "566","DBD100","How often {do you/does SP} add ordinary salt to {your/his/her/SP's} food at the table? Would you say . . .","2","Occasionally","Dietary","DBD100"
568
+ "567","DBD100","How often {do you/does SP} add ordinary salt to {your/his/her/SP's} food at the table? Would you say . . .","3","Very often","Dietary","DBD100"
569
+ "568","DBD100","How often {do you/does SP} add ordinary salt to {your/his/her/SP's} food at the table? Would you say . . .","88","Blank but applicable","Dietary","DBD100"
570
+ "569","DBD100","How often {do you/does SP} add ordinary salt to {your/his/her/SP's} food at the table? Would you say . . .","9","Don't know","Dietary","DBD100"
571
+ "570","DBD100","How often {do you/does SP} add ordinary salt to {your/his/her/SP's} food at the table? Would you say . . .","7","Refused","Dietary","DBD100"
data/tidytuesday_json_val.json ADDED
@@ -0,0 +1,1911 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "date_posted": "2023-02-28",
4
+ "project_name": "African Language Sentiment",
5
+ "project_source": [
6
+ "https://r4ds.io/join",
7
+ "https://arxiv.org/pdf/2302.08956.pdf",
8
+ "https://github.com/shmuhammad2004",
9
+ "https://github.com/afrisenti-semeval/afrisent-semeval-2023"
10
+ ],
11
+ "description": "The data this week comes fromAfriSenti: Sentiment Analysis dataset for 14 African languagesvia@shmuhammad2004(the corresponding author on theassociated paper, and an active member of theR4DS Online Learning Community Slack). This repository contains data for the SemEval 2023 Shared Task 12: Sentiment Analysis in African Languages (AfriSenti-SemEval). The source repository also includes sentiment lexicons for several languages.",
12
+ "data_source_url": "https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-02-28",
13
+ "data_dictionary": [
14
+ {
15
+ "variable": [
16
+ "language_iso_code",
17
+ "tweet",
18
+ "label",
19
+ "intended_use"
20
+ ],
21
+ "class": [
22
+ "character",
23
+ "character",
24
+ "character",
25
+ "character"
26
+ ],
27
+ "description": [
28
+ "The unique code used to identify the language",
29
+ "The text content of a tweet",
30
+ "A sentiment label of positive, negative, or neutral assigned by a native speaker of that language",
31
+ "Whether the data came from the dev, test, or train set for that language"
32
+ ]
33
+ },
34
+ {
35
+ "variable": [
36
+ "language_iso_code",
37
+ "language"
38
+ ],
39
+ "class": [
40
+ "character",
41
+ "character"
42
+ ],
43
+ "description": [
44
+ "The unique code used to identify the language",
45
+ "The name of the language"
46
+ ]
47
+ },
48
+ {
49
+ "variable": [
50
+ "language_iso_code",
51
+ "script"
52
+ ],
53
+ "class": [
54
+ "character",
55
+ "character"
56
+ ],
57
+ "description": [
58
+ "The unique code used to identify the language",
59
+ "The script used to write the language"
60
+ ]
61
+ },
62
+ {
63
+ "variable": [
64
+ "language_iso_code",
65
+ "country"
66
+ ],
67
+ "class": [
68
+ "character",
69
+ "character"
70
+ ],
71
+ "description": [
72
+ "The unique code used to identify the language",
73
+ "A country in which the language is spoken"
74
+ ]
75
+ },
76
+ {
77
+ "variable": [
78
+ "country",
79
+ "region"
80
+ ],
81
+ "class": [
82
+ "character",
83
+ "character"
84
+ ],
85
+ "description": [
86
+ "A country in which the language is spoken",
87
+ "The region of Africa in which that country is categorized. Note that Mozambique is categorized as \\\"East Africa\\\", \\\"Southern Africa\\\", and \\\"Southeastern Africa\\\""
88
+ ]
89
+ }
90
+ ],
91
+ "data": {
92
+ "file_name": [
93
+ "afrisenti.csv",
94
+ "country_regions.csv",
95
+ "language_countries.csv",
96
+ "language_scripts.csv",
97
+ "languages.csv"
98
+ ],
99
+ "file_url": [
100
+ "https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-02-28/afrisenti.csv",
101
+ "https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-02-28/country_regions.csv",
102
+ "https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-02-28/language_countries.csv",
103
+ "https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-02-28/language_scripts.csv",
104
+ "https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-02-28/languages.csv"
105
+ ]
106
+ },
107
+ "data_load": {
108
+ "file_name": [
109
+ "afrisenti.csv",
110
+ "country_regions.csv",
111
+ "language_countries.csv",
112
+ "language_scripts.csv",
113
+ "languages.csv"
114
+ ],
115
+ "file_url": [
116
+ "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-02-28/afrisenti.csv",
117
+ "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-02-28/country_regions.csv",
118
+ "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-02-28/language_countries.csv",
119
+ "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-02-28/language_scripts.csv",
120
+ "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-02-28/languages.csv"
121
+ ]
122
+ }
123
+ },
124
+ {
125
+ "date_posted": "2023-05-02",
126
+ "project_name": "The Portal Project",
127
+ "project_source": [
128
+ "https://www.weecology.org/",
129
+ "https://weecology.github.io/portalr/",
130
+ "https://portal.weecology.org/",
131
+ "https://datacarpentry.org/ecology-workshop/",
132
+ "https://www.data-retriever.org/"
133
+ ],
134
+ "description": "The data this week comes from thePortal Project. This is a long-term ecological research site studying the dynamics of desert rodents, plants, ants and weather in Arizona. The Portal Project is a long-term ecological study being conducted near Portal, AZ. Since 1977, the site has been used to study the interactions among rodents, ants and plants and their respective responses to climate. To study the interactions among organisms, they experimentally manipulate access to 24 study plots. This study has produced over 100 scientific papers and is one of the longest running ecological studies in the U.S. TheWeecology research groupmonitors rodents, plants, ants, and weather. All data from the Portal Project are made openly available in near real-time so that they can provide the maximum benefit to scientific research and outreach. The core dataset is managed using an automated living data workflow run using GitHub and Continuous Analysis. This dataset focuses on the rodent data. Full data is available through these resources: The Portal Project data can also be accessed through the Data Retriever, a package manager for data. Data Retriever A teaching focused version of the dataset is also maintained with some of the complexities of the data removed to make it easy to use for computational training purposes. This dataset serves as the core dataset for theData Carpentry Ecologymaterial and has been downloaded almost 50,000 times. Thanks to @ethanwhite for the data cleaning script. This script downloads the data using the{portalr}package. It filters for the species and plot data, and years greater than 1977.",
135
+ "data_source_url": "https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-05-02",
136
+ "data_dictionary": [
137
+ {
138
+ "variable": [
139
+ "plot",
140
+ "treatment"
141
+ ],
142
+ "class": [
143
+ "double",
144
+ "character"
145
+ ],
146
+ "description": [
147
+ "Plot number",
148
+ "Treatment type"
149
+ ]
150
+ },
151
+ {
152
+ "variable": [
153
+ "species",
154
+ "scientificname",
155
+ "taxa",
156
+ "commonname",
157
+ "censustarget",
158
+ "unidentified",
159
+ "rodent",
160
+ "granivore",
161
+ "minhfl",
162
+ "meanhfl",
163
+ "maxhfl",
164
+ "minwgt",
165
+ "meanwgt",
166
+ "maxwgt",
167
+ "juvwgt"
168
+ ],
169
+ "class": [
170
+ "character",
171
+ "character",
172
+ "character",
173
+ "character",
174
+ "double",
175
+ "double",
176
+ "double",
177
+ "double",
178
+ "double",
179
+ "double",
180
+ "double",
181
+ "double",
182
+ "double",
183
+ "double",
184
+ "double"
185
+ ],
186
+ "description": [
187
+ "Species",
188
+ "Scientific Name",
189
+ "Taxa",
190
+ "Common Name",
191
+ "Target species (0 or 1)",
192
+ "Unidentified (0 or 1)",
193
+ "Rodent (0 or 1)",
194
+ "Granivore (0 or 1)",
195
+ "Minimum hindfoot length",
196
+ "Mean hindfoot length",
197
+ "Maximum hindfoot length",
198
+ "Minimum weight",
199
+ "Mean weight",
200
+ "Maximum weight",
201
+ "Juvenile weight"
202
+ ]
203
+ },
204
+ {
205
+ "variable": [
206
+ "censusdate",
207
+ "month",
208
+ "day",
209
+ "year",
210
+ "treatment",
211
+ "plot",
212
+ "stake",
213
+ "species",
214
+ "sex",
215
+ "reprod",
216
+ "age",
217
+ "testes",
218
+ "vagina",
219
+ "pregnant",
220
+ "nipples",
221
+ "lactation",
222
+ "hfl",
223
+ "wgt",
224
+ "tag",
225
+ "note2",
226
+ "ltag",
227
+ "note3"
228
+ ],
229
+ "class": [
230
+ "double",
231
+ "double",
232
+ "double",
233
+ "double",
234
+ "character",
235
+ "double",
236
+ "double",
237
+ "character",
238
+ "character",
239
+ "character",
240
+ "character",
241
+ "character",
242
+ "character",
243
+ "character",
244
+ "character",
245
+ "character",
246
+ "double",
247
+ "double",
248
+ "character",
249
+ "character",
250
+ "character",
251
+ "character"
252
+ ],
253
+ "description": [
254
+ "Census date",
255
+ "Month",
256
+ "Day",
257
+ "Year",
258
+ "Treatment type",
259
+ "Plot number",
260
+ "Stake number",
261
+ "Species code",
262
+ "Sex",
263
+ "Reproductive condition",
264
+ "Age",
265
+ "Testes (Scrotal, Recent, or Minor)",
266
+ "Vagina (Swollen, Plugged, or Both)",
267
+ "Pregnant",
268
+ "Nipples (Enlarged, Swollen, or Both)",
269
+ "Lactating",
270
+ "Hindfoot length",
271
+ "Weight",
272
+ "Primary individual identifier",
273
+ "Newly tagged individual for 'tag'",
274
+ "Secondary tag information when ear tags were used in both ears",
275
+ "Newly tagged individual for 'ltag'"
276
+ ]
277
+ }
278
+ ],
279
+ "data": {
280
+ "file_name": [
281
+ "plots.csv",
282
+ "species.csv",
283
+ "surveys.csv"
284
+ ],
285
+ "file_url": [
286
+ "https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-05-02/plots.csv",
287
+ "https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-05-02/species.csv",
288
+ "https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-05-02/surveys.csv"
289
+ ]
290
+ },
291
+ "data_load": {
292
+ "file_name": [
293
+ "plots.csv",
294
+ "species.csv",
295
+ "surveys.csv"
296
+ ],
297
+ "file_url": [
298
+ "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-05-02/plots.csv",
299
+ "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-05-02/species.csv",
300
+ "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-05-02/surveys.csv"
301
+ ]
302
+ }
303
+ },
304
+ {
305
+ "date_posted": "2023-04-04",
306
+ "project_name": "Premier League Match Data 2021-2022",
307
+ "project_source": [
308
+ "https://www.kaggle.com/datasets/evangower/premier-league-match-data",
309
+ "https://theathletic.com/3459766/2022/07/29/liverpool-manchester-city-premier-league-fouls-yellow-card/",
310
+ "https://github.com/evangower",
311
+ "https://www.kaggle.com/code/evangower/who-wins-the-epl-if-games-end-at-half-time/"
312
+ ],
313
+ "description": "The data this week comes from thePremier League Match Data 2021-2022viaEvan Goweron Kaggle. You can explore match day statistics of every game and every team during the 2021-22 season of the English Premier League Data. Data includes teams playing, date, referee, and stats for home and away side such as fouls, shots, cards, and more! Also included is a dataset of the weekly rankings for the season. The data was collected from the official website of the Premier League. Evan then cleaned the data using google sheets. Evan did an analysis ofWho wins the EPL if games end at half time?and there'san article from the Athleticabout fouls conceded per yellow card article. No data cleaning",
314
+ "data_source_url": "https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-04-04",
315
+ "data_dictionary": [
316
+ {
317
+ "variable": [
318
+ "Date",
319
+ "HomeTeam",
320
+ "AwayTeam",
321
+ "FTHG",
322
+ "FTAG",
323
+ "FTR",
324
+ "HTHG",
325
+ "HTAG",
326
+ "HTR",
327
+ "Referee",
328
+ "HS",
329
+ "AS",
330
+ "HST",
331
+ "AST",
332
+ "HF",
333
+ "AF",
334
+ "HC",
335
+ "AC",
336
+ "HY",
337
+ "AY",
338
+ "HR",
339
+ "AR"
340
+ ],
341
+ "class": [
342
+ "character",
343
+ "character",
344
+ "character",
345
+ "double",
346
+ "double",
347
+ "character",
348
+ "double",
349
+ "double",
350
+ "character",
351
+ "character",
352
+ "double",
353
+ "double",
354
+ "double",
355
+ "double",
356
+ "double",
357
+ "double",
358
+ "double",
359
+ "double",
360
+ "double",
361
+ "double",
362
+ "double",
363
+ "double"
364
+ ],
365
+ "description": [
366
+ "The date when the match was played",
367
+ "The home team",
368
+ "The away team",
369
+ "Full time home goals",
370
+ "Full time away goals",
371
+ "Full time result",
372
+ "Halftime home goals",
373
+ "Halftime away goals",
374
+ "Halftime results",
375
+ "Referee of the match",
376
+ "Number of shots taken by the home team",
377
+ "Number of shots taken by the away team",
378
+ "Number of shots on target by the home team",
379
+ "Number of shots on target by the away team",
380
+ "Number of fouls by the home team",
381
+ "Number of fouls by the away team",
382
+ "Number of corners taken by the home team",
383
+ "Number of corners taken by the away team",
384
+ "Number of yellow cards received by the home team",
385
+ "Number of yellow cards received by the away team",
386
+ "Number of red cards received by the home team",
387
+ "Number of red cards received by the away team"
388
+ ]
389
+ }
390
+ ],
391
+ "data": {
392
+ "file_name": [
393
+ "soccer21-22.csv"
394
+ ],
395
+ "file_url": [
396
+ "https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-04-04/soccer21-22.csv"
397
+ ]
398
+ },
399
+ "data_load": {
400
+ "file_name": [
401
+ "soccer21-22.csv"
402
+ ],
403
+ "file_url": [
404
+ "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-04-04/soccer21-22.csv"
405
+ ]
406
+ }
407
+ },
408
+ {
409
+ "date_posted": "2023-02-07",
410
+ "project_name": "Big Tech Stock Prices",
411
+ "project_source": [
412
+ "https://github.com/rfordatascience/tidytuesday/issues/509",
413
+ "https://www.morningstar.com/articles/1129535/5-charts-on-big-tech-stocks-collapse",
414
+ "https://www.kaggle.com/datasets/evangower/big-tech-stock-prices"
415
+ ],
416
+ "description": "The data this week comes from Yahoo Finance viaKaggle(byEvan Gower). This dataset consists of the daily stock prices and volume of 14 different tech companies, including Apple (AAPL), Amazon (AMZN), Alphabet (GOOGL), and Meta Platforms (META) and more! A number of articles have examined the collapse of \"Big Tech\" stock prices, includingthis article from morningstar.com. Note: Allstock_symbols have 3271 prices, except META (2688) and TSLA (3148) because they were not publicly traded for part of the period examined.",
417
+ "data_source_url": "https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-02-07",
418
+ "data_dictionary": [
419
+ {
420
+ "variable": [
421
+ "stock_symbol",
422
+ "date",
423
+ "open",
424
+ "high",
425
+ "low",
426
+ "close",
427
+ "adj_close",
428
+ "volume"
429
+ ],
430
+ "class": [
431
+ "character",
432
+ "double",
433
+ "double",
434
+ "double",
435
+ "double",
436
+ "double",
437
+ "double",
438
+ "double"
439
+ ],
440
+ "description": [
441
+ "stock_symbol",
442
+ "date",
443
+ "The price at market open.",
444
+ "The highest price for that day.",
445
+ "The lowest price for that day.",
446
+ "The price at market close, adjusted for splits.",
447
+ "The closing price after adjustments for all applicable splits and dividend distributions. Data is adjusted using appropriate split and dividend multipliers, adhering to Center for Research in Security Prices (CRSP) standards.",
448
+ "The number of shares traded on that day."
449
+ ]
450
+ },
451
+ {
452
+ "variable": [
453
+ "stock_symbol",
454
+ "company"
455
+ ],
456
+ "class": [
457
+ "character",
458
+ "character"
459
+ ],
460
+ "description": [
461
+ "stock_symbol",
462
+ "Full name of the company."
463
+ ]
464
+ }
465
+ ],
466
+ "data": {
467
+ "file_name": [
468
+ "big_tech_companies.csv",
469
+ "big_tech_stock_prices.csv"
470
+ ],
471
+ "file_url": [
472
+ "https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-02-07/big_tech_companies.csv",
473
+ "https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-02-07/big_tech_stock_prices.csv"
474
+ ]
475
+ },
476
+ "data_load": {
477
+ "file_name": [
478
+ "big_tech_companies.csv",
479
+ "big_tech_stock_prices.csv"
480
+ ],
481
+ "file_url": [
482
+ "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-02-07/big_tech_companies.csv",
483
+ "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-02-07/big_tech_stock_prices.csv"
484
+ ]
485
+ }
486
+ },
487
+ {
488
+ "date_posted": "2023-03-21",
489
+ "project_name": "Programming Languages",
490
+ "project_source": [
491
+ "https://github.com/rfordatascience/tidytuesday/issues/530",
492
+ "https://pldb.com/posts/does-every-programming-language-support-line-comments.html",
493
+ "https://pldb.com/csv.html",
494
+ "https://pldb.com/index.html",
495
+ "https://pldb.com/posts/index.html"
496
+ ],
497
+ "description": "The data this week comes from theProgramming Language DataBase. Thanks toJesus M. Castagnettofor the suggestion! The PLDB has ablogwith numerous articles exploring the data, such asDoes every programming language have line comments?. The data is user-submitted, so you might want to confirm the accuracy of anything particularly surprising that you find before stating it with certainty! Thefull data dictionaryis available from PLDB.com.",
498
+ "data_source_url": "https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-03-21",
499
+ "data_dictionary": [
500
+ {
501
+ "variable": [
502
+ "pldb_id",
503
+ "title",
504
+ "description",
505
+ "type",
506
+ "appeared",
507
+ "creators",
508
+ "website",
509
+ "domain_name",
510
+ "domain_name_registered",
511
+ "reference",
512
+ "isbndb",
513
+ "book_count",
514
+ "semantic_scholar",
515
+ "language_rank",
516
+ "github_repo",
517
+ "github_repo_stars",
518
+ "github_repo_forks",
519
+ "github_repo_updated",
520
+ "github_repo_subscribers",
521
+ "github_repo_created",
522
+ "github_repo_description",
523
+ "github_repo_issues",
524
+ "github_repo_first_commit",
525
+ "github_language",
526
+ "github_language_tm_scope",
527
+ "github_language_type",
528
+ "github_language_ace_mode",
529
+ "github_language_file_extensions",
530
+ "github_language_repos",
531
+ "wikipedia",
532
+ "wikipedia_daily_page_views",
533
+ "wikipedia_backlinks_count",
534
+ "wikipedia_summary",
535
+ "wikipedia_page_id",
536
+ "wikipedia_appeared",
537
+ "wikipedia_created",
538
+ "wikipedia_revision_count",
539
+ "wikipedia_related",
540
+ "features_has_comments",
541
+ "features_has_semantic_indentation",
542
+ "features_has_line_comments",
543
+ "line_comment_token",
544
+ "last_activity",
545
+ "number_of_users",
546
+ "number_of_jobs",
547
+ "origin_community",
548
+ "central_package_repository_count",
549
+ "file_type",
550
+ "is_open_source"
551
+ ],
552
+ "class": [
553
+ "character",
554
+ "character",
555
+ "character",
556
+ "character",
557
+ "double",
558
+ "character",
559
+ "character",
560
+ "character",
561
+ "double",
562
+ "character",
563
+ "double",
564
+ "double",
565
+ "integer",
566
+ "double",
567
+ "character",
568
+ "double",
569
+ "double",
570
+ "double",
571
+ "double",
572
+ "double",
573
+ "character",
574
+ "double",
575
+ "double",
576
+ "character",
577
+ "character",
578
+ "character",
579
+ "character",
580
+ "character",
581
+ "double",
582
+ "character",
583
+ "double",
584
+ "double",
585
+ "character",
586
+ "double",
587
+ "double",
588
+ "double",
589
+ "double",
590
+ "character",
591
+ "logical",
592
+ "logical",
593
+ "logical",
594
+ "character",
595
+ "double",
596
+ "double",
597
+ "double",
598
+ "character",
599
+ "double",
600
+ "character",
601
+ "logical"
602
+ ],
603
+ "description": [
604
+ "A standardized, uniquified version of the language name, used as an ID on the PLDB site.",
605
+ "The official title of the language.",
606
+ "Description of the repo on GitHub.",
607
+ "Which category in PLDB's subjective ontology does this entity fit into.",
608
+ "What year was the language publicly released and/or announced?",
609
+ "Name(s) of the original creators of the language delimited by \\\" and \\\"",
610
+ "URL of the official homepage for the language project.",
611
+ "If the project website is on its own domain.",
612
+ "When was this domain first registered?",
613
+ "A link to more info about this entity.",
614
+ "Books about this language from ISBNdb.",
615
+ "Computed; the number of books found for this language at isbndb.com",
616
+ "Papers about this language from Semantic Scholar.",
617
+ "Computed; A rank for the language, taking into account various online rankings. The computation for this column is not currently clear.",
618
+ "URL of the official GitHub repo for the project if it hosted there.",
619
+ "How many stars of the repo?",
620
+ "How many forks of the repo?",
621
+ "What year was the last commit made?",
622
+ "How many subscribers to the repo?",
623
+ "When was the Github repo for this entity created?",
624
+ "Description of the repo on GitHub.",
625
+ "How many isses on the repo?",
626
+ "What year the first commit made in this git repo?",
627
+ "GitHub has a set of supported languages as defined here",
628
+ "The TextMate scope that represents this programming language.",
629
+ "Either data, programming, markup, prose, or nil.",
630
+ "A String name of the Ace Mode used for highlighting whenever a file is edited. This must match one of the filenames in http://git.io/3XO_Cg. Use \\\"text\\\" if a mode does not exist.",
631
+ "An Array of associated extensions (the first one is considered the primary extension, the others should be listed alphabetically).",
632
+ "How many repos for this language does GitHub report?",
633
+ "URL of the entity on Wikipedia, if and only if it has a page dedicated to it.",
634
+ "How many page views per day does this Wikipedia page get? Useful as a signal for rankings. Available via WP api.",
635
+ "How many pages on WP link to this page?",
636
+ "What is the text summary of the language from the Wikipedia page?",
637
+ "Waht is the internal ID for this entity on WP?",
638
+ "When does Wikipedia claim this entity first appeared?",
639
+ "When was the Wikipedia page for this entity created?",
640
+ "How many revisions does this page have?",
641
+ "What languages does Wikipedia have as related?",
642
+ "Does this language have a comment character?",
643
+ "Does indentation have semantic meaning in this language?",
644
+ "Does this language support inline comments (as opposed to comments that must span an entire line)?",
645
+ "Defined as a token that can be placed anywhere on a line and starts a comment that cannot be stopped except by a line break character or end of file.",
646
+ "Computed; The most recent of any year field in the PLDB for this language.",
647
+ "Computed; \\\"Crude user estimate from a linear model.",
648
+ "Computed; The estimated number of job openings for programmers in this language.",
649
+ "In what community(ies) did the language first originate?",
650
+ "Number of packages in a central repository. If this value is not known, it is set to 0 (so \\\"0\\\" can mean \\\"no repository exists\\\", \\\"the repository exists but is empty\\\" (unlikely), or \\\"we do not know if a repository exists\\\". This value is definitely incorrect for R.",
651
+ "What is the file encoding for programs in this language?",
652
+ "Is it an open source project?"
653
+ ]
654
+ }
655
+ ],
656
+ "data": {
657
+ "file_name": [
658
+ "languages.csv"
659
+ ],
660
+ "file_url": [
661
+ "https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-03-21/languages.csv"
662
+ ]
663
+ },
664
+ "data_load": {
665
+ "file_name": [
666
+ "languages.csv"
667
+ ],
668
+ "file_url": [
669
+ "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-03-21/languages.csv"
670
+ ]
671
+ }
672
+ },
673
+ {
674
+ "date_posted": "2023-05-23",
675
+ "project_name": "Central Park Squirrel Census",
676
+ "project_source": [
677
+ "https://data.cityofnewyork.us/Environment/2018-Central-Park-Squirrel-Census-Squirrel-Data/vfnx-vebw",
678
+ "https://www.thesquirrelcensus.com/"
679
+ ],
680
+ "description": "Squirrel data! The data this week comes from the2018 Central Park Squirrel Census. The Squirrel Censusis a multimedia science, design, and storytelling project focusing on the Eastern gray (Sciurus carolinensis). They count squirrels and present their findings to the public. The dataset contains squirrel data for each of the 3,023 sightings, including location coordinates, age, primary and secondary fur color, elevation, activities, communications, and interactions between squirrels and with humans. No data cleaning",
681
+ "data_source_url": "https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-05-23",
682
+ "data_dictionary": [
683
+ {
684
+ "variable": [
685
+ "X",
686
+ "Y",
687
+ "Unique Squirrel ID",
688
+ "Hectare",
689
+ "Shift",
690
+ "Date",
691
+ "Hectare Squirrel Number",
692
+ "Age",
693
+ "Primary Fur Color",
694
+ "Highlight Fur Color",
695
+ "Combination of Primary and Highlight Color",
696
+ "Color notes",
697
+ "Location",
698
+ "Above Ground Sighter Measurement",
699
+ "Specific Location",
700
+ "Running",
701
+ "Chasing",
702
+ "Climbing",
703
+ "Eating",
704
+ "Foraging",
705
+ "Other Activities",
706
+ "Kuks",
707
+ "Quaas",
708
+ "Moans",
709
+ "Tail flags",
710
+ "Tail twitches",
711
+ "Approaches",
712
+ "Indifferent",
713
+ "Runs from",
714
+ "Other Interactions",
715
+ "Lat/Long"
716
+ ],
717
+ "class": [
718
+ "double",
719
+ "double",
720
+ "character",
721
+ "character",
722
+ "character",
723
+ "double",
724
+ "double",
725
+ "character",
726
+ "character",
727
+ "character",
728
+ "character",
729
+ "character",
730
+ "character",
731
+ "character",
732
+ "character",
733
+ "logical",
734
+ "logical",
735
+ "logical",
736
+ "logical",
737
+ "logical",
738
+ "character",
739
+ "logical",
740
+ "logical",
741
+ "logical",
742
+ "logical",
743
+ "logical",
744
+ "logical",
745
+ "logical",
746
+ "logical",
747
+ "character",
748
+ "character"
749
+ ],
750
+ "description": [
751
+ "Longitude coordinate for squirrel sighting point",
752
+ "Latitude coordinate for squirrel sighting point",
753
+ "Identification tag for each squirrel sightings. The tag is comprised of \\\"Hectare ID\\\" + \\\"Shift\\\" + \\\"Date\\\" + \\\"Hectare Squirrel Number.\\\"",
754
+ "ID tag, which is derived from the hectare grid used to divide and count the park area. One axis that runs predominantly north-to-south is numerical (1-42), and the axis that runs predominantly east-to-west is roman characters (A-I).",
755
+ "Value is either \\\"AM\\\" or \\\"PM,\\\" to communicate whether or not the sighting session occurred in the morning or late afternoon.",
756
+ "Concatenation of the sighting session day and month.",
757
+ "Number within the chronological sequence of squirrel sightings for a discrete sighting session.",
758
+ "Value is either \\\"Adult\\\" or \\\"Juvenile.\\\"",
759
+ "Primary Fur Color - value is either \\\"Gray,\\\" \\\"Cinnamon\\\" or \\\"Black.\\\"",
760
+ "Discrete value or string values comprised of \\\"Gray,\\\" \\\"Cinnamon\\\" or \\\"Black.\\\"",
761
+ "A combination of the previous two columns; this column gives the total permutations of primary and highlight colors observed.",
762
+ "Sighters occasionally added commentary on the squirrel fur conditions. These notes are provided here.",
763
+ "Value is either \\\"Ground Plane\\\" or \\\"Above Ground.\\\" Sighters were instructed to indicate the location of where the squirrel was when first sighted.",
764
+ "For squirrel sightings on the ground plane, fields were populated with a value of \\\"FALSE.\\\"",
765
+ "Sighters occasionally added commentary on the squirrel location. These notes are provided here.",
766
+ "Squirrel was seen running.",
767
+ "Squirrel was seen chasing another squirrel.",
768
+ "Squirrel was seen climbing a tree or other environmental landmark.",
769
+ "Squirrel was seen eating.",
770
+ "Squirrel was seen foraging for food.",
771
+ "Other activities squirrels were observed doing.",
772
+ "Squirrel was heard kukking, a chirpy vocal communication used for a variety of reasons.",
773
+ "Squirrel was heard quaaing, an elongated vocal communication which can indicate the presence of a ground predator such as a dog.",
774
+ "Squirrel was heard moaning, a high-pitched vocal communication which can indicate the presence of an air predator such as a hawk.",
775
+ "Squirrel was seen flagging its tail. Flagging is a whipping motion used to exaggerate squirrel's size and confuse rivals or predators. Looks as if the squirrel is scribbling with tail into the air.",
776
+ "Squirrel was seen twitching its tail. Looks like a wave running through the tail, like a breakdancer doing the arm wave. Often used to communicate interest, curiosity.",
777
+ "Squirrel was seen approaching human, seeking food.",
778
+ "Squirrel was indifferent to human presence.",
779
+ "Squirrel was seen running from humans, seeing them as a threat.",
780
+ "Sighter notes on other types of interactions between squirrels and humans.",
781
+ "Latitude and longitude"
782
+ ]
783
+ }
784
+ ],
785
+ "data": {
786
+ "file_name": [
787
+ "squirrel_data.csv"
788
+ ],
789
+ "file_url": [
790
+ "https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-05-23/squirrel_data.csv"
791
+ ]
792
+ },
793
+ "data_load": {
794
+ "file_name": [
795
+ "squirrel_data.csv"
796
+ ],
797
+ "file_url": [
798
+ "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-05-23/squirrel_data.csv"
799
+ ]
800
+ }
801
+ },
802
+ {
803
+ "date_posted": "2023-01-17",
804
+ "project_name": "Art History",
805
+ "project_source": [
806
+ "https://research.repository.duke.edu/concern/datasets/q811kk70n?locale=en",
807
+ "https://github.com/hollandstam1/thesis",
808
+ "https://saralemus7.github.io/arthistory/",
809
+ "https://github.com/saralemus7/arthistory"
810
+ ],
811
+ "description": "The data this week comes from thearthistory data package This dataset contains data that was used for Holland Stam's thesis work, titledQuantifying art historical narratives. The data was collected to assess the demographic representation of artists through editions of Janson's History of Art and Gardner's Art Through the Ages, two of the most popular art history textbooks used in the American education system. In this package specifically, both artist-level and work-level data was collected along with variables regarding the artists' demographics and numeric metrics for describing how much space they or their work took up in each edition of each textbook. This package contains three datasets: Acknowledging arthistory Citation Lemus S, Stam H (2022). arthistory: Art History Textbook Data.https://github.com/saralemus7/arthistory,https://saralemus7.github.io/arthistory/. Examples of analyses are included inHolland Stam's thesisin Quarto files. No data cleaning",
812
+ "data_source_url": "https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-01-17",
813
+ "data_dictionary": [
814
+ {
815
+ "variable": [
816
+ "artist_name",
817
+ "edition_number",
818
+ "year",
819
+ "artist_nationality",
820
+ "artist_nationality_other",
821
+ "artist_gender",
822
+ "artist_race",
823
+ "artist_ethnicity",
824
+ "book",
825
+ "space_ratio_per_page_total",
826
+ "artist_unique_id",
827
+ "moma_count_to_year",
828
+ "whitney_count_to_year",
829
+ "artist_race_nwi"
830
+ ],
831
+ "class": [
832
+ "character",
833
+ "double",
834
+ "double",
835
+ "character",
836
+ "character",
837
+ "character",
838
+ "character",
839
+ "character",
840
+ "character",
841
+ "double",
842
+ "double",
843
+ "double",
844
+ "double",
845
+ "character"
846
+ ],
847
+ "description": [
848
+ "The name of each artist",
849
+ "The edition number of the textbook from either Janson's History or Art or Gardner's Art Through the Ages.",
850
+ "The year of publication for a given edition of Janson or Gardner.",
851
+ "The nationality of a given artist.",
852
+ "The nationality of the artist. Of the total count of artists through all editions of Janson's History of Art and Gardner's Art Through the Ages, 77.32% account for French, Spanish, British, American and German. Therefore, the categorical strings of this variable are French, Spanish, British, American, German and Other",
853
+ "The gender of the artist",
854
+ "The race of the artist",
855
+ "The ethnicity of the artist",
856
+ "Which book, either Janson or Gardner the particular artist at that particular time was included.",
857
+ "The area in centimeters squared of both the text and the figure of a particular artist in a given edition of Janson's History of Art divided by the area in centimeters squared of a single page of the respective edition. This variable is continuous.",
858
+ "The unique identifying number assigned to artists across books is denoted in alphabetical order. This variable is discrete.",
859
+ "The total count of exhibitions ever held by the Museum of Modern Art (MoMA) of a particular artist at a given year of publication. This variable is discrete.",
860
+ "The count of exhibitions held by The Whitney of a particular artist at a particular moment of time, as highlighted by year. This variable in discrete.",
861
+ "The non-white indicator for artist race, meaning if an artist's race is denoted as either white or non-white."
862
+ ]
863
+ }
864
+ ],
865
+ "data": {
866
+ "file_name": [
867
+ "artists.csv"
868
+ ],
869
+ "file_url": [
870
+ "https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-01-17/artists.csv"
871
+ ]
872
+ },
873
+ "data_load": {
874
+ "file_name": [
875
+ "artists.csv"
876
+ ],
877
+ "file_url": [
878
+ "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-01-17/artists.csv"
879
+ ]
880
+ }
881
+ },
882
+ {
883
+ "date_posted": "2023-07-04",
884
+ "project_name": "Historical Markers",
885
+ "project_source": [
886
+ "http://www.geonames.org/",
887
+ "https://www.hmdb.org/geolists.asp?c=United%20States%20of%20America",
888
+ "https://www.hmdb.org/stats.asp",
889
+ "https://www.hmdb.org/",
890
+ "https://github.com/rfordatascience/tidytuesday/issues/574#issuecomment-1601050053"
891
+ ],
892
+ "description": "The data this week comes from theHistorical Marker Database USA Index. Learn more about the markers on theHMDb.org site, which includes a number of articles, includingDatabase Counts and Statistics. We included a dataset of places that donothave entries in the Historical Markers Database. You might try to combine that with information fromgeonames.org(code: HSTS) to find markers that need to be submitted. Thanks toJesus M. Castagnettofor the geonames tip!",
893
+ "data_source_url": "https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-07-04",
894
+ "data_dictionary": [
895
+ {
896
+ "variable": [
897
+ "marker_id",
898
+ "marker_no",
899
+ "title",
900
+ "subtitle",
901
+ "addl_subtitle",
902
+ "year_erected",
903
+ "erected_by",
904
+ "latitude_minus_s",
905
+ "longitude_minus_w",
906
+ "street_address",
907
+ "city_or_town",
908
+ "section_or_quarter",
909
+ "county_or_parish",
910
+ "state_or_prov",
911
+ "location",
912
+ "missing",
913
+ "link"
914
+ ],
915
+ "class": [
916
+ "double",
917
+ "character",
918
+ "character",
919
+ "character",
920
+ "character",
921
+ "integer",
922
+ "character",
923
+ "double",
924
+ "double",
925
+ "character",
926
+ "character",
927
+ "character",
928
+ "character",
929
+ "character",
930
+ "character",
931
+ "character",
932
+ "character"
933
+ ],
934
+ "description": [
935
+ "Unique ID for this marker in the HMdb.",
936
+ "Number of this marker in the state numbering scheme.",
937
+ "Main title of the marker.",
938
+ "Subtitle of the marker, if any.",
939
+ "Additional subtitle text.",
940
+ "The year in which the marker was erected.",
941
+ "The organization which erected the marker.",
942
+ "The latitude of the marker.",
943
+ "The longitude of the marker.",
944
+ "The street address of the marker, if available.",
945
+ "The city, town, etc in which the marker is located.",
946
+ "The section of the city, town, etc, when available.",
947
+ "The county, parish, or similar designation in which the marker appears.",
948
+ "The state, province, territory, etc in which the marker appears.",
949
+ "A description of the marker's location.",
950
+ "Whether the marker is \\\"Reported missing\\\" or \\\"Confirmed missing\\\". NA values indicate that the marker has neither been reported missing nor confirmed as missing.",
951
+ "The HMDb link to the marker. Links include additional details, such as photos and topic lists to which this marker belongs."
952
+ ]
953
+ },
954
+ {
955
+ "variable": [
956
+ "county",
957
+ "state"
958
+ ],
959
+ "class": [
960
+ "character",
961
+ "character"
962
+ ],
963
+ "description": [
964
+ "County or equivalent.",
965
+ "State or territory."
966
+ ]
967
+ }
968
+ ],
969
+ "data": {
970
+ "file_name": [
971
+ "historical_markers.csv",
972
+ "no_markers.csv"
973
+ ],
974
+ "file_url": [
975
+ "https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-07-04/historical_markers.csv",
976
+ "https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-07-04/no_markers.csv"
977
+ ]
978
+ },
979
+ "data_load": {
980
+ "file_name": [
981
+ "historical_markers.csv",
982
+ "no_markers.csv"
983
+ ],
984
+ "file_url": [
985
+ "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-07-04/historical_markers.csv",
986
+ "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-07-04/no_markers.csv"
987
+ ]
988
+ }
989
+ },
990
+ {
991
+ "date_posted": "2023-02-14",
992
+ "project_name": "Hollywood Age Gaps",
993
+ "project_source": [
994
+ "https://www.data-is-plural.com/archive/2018-02-07-edition/",
995
+ "https://tidytues.day/2021/2021-03-09",
996
+ "https://hollywoodagegap.com/"
997
+ ],
998
+ "description": "The data this week comes fromHollywood Age GapviaData Is Plural. An informational site showing the age gap between movie love interests. The data follows certain rules: The two (or more) actors play actual love interests (not just friends, coworkers, or some other non-romantic type of relationship) The youngest of the two actors is at least 17 years old Not animated characters We previously provided a dataset about theBechdel Test. It might be interesting to see whether there is any correlation between these datasets! The Bechdel Test dataset also included additional information about the films that were used in that dataset. Note: The age gaps dataset includes \"gender\" columns, which always contain the values \"man\" or \"woman\". These values appear to indicate how thecharactersin each film identify. Some of these values do not match how theactoridentifies. We apologize if any characters are misgendered in the data!",
999
+ "data_source_url": "https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-02-14",
1000
+ "data_dictionary": [
1001
+ {
1002
+ "variable": [
1003
+ "movie_name",
1004
+ "release_year",
1005
+ "director",
1006
+ "age_difference",
1007
+ "couple_number",
1008
+ "actor_1_name",
1009
+ "actor_2_name",
1010
+ "character_1_gender",
1011
+ "character_2_gender",
1012
+ "actor_1_birthdate",
1013
+ "actor_2_birthdate",
1014
+ "actor_1_age",
1015
+ "actor_2_age"
1016
+ ],
1017
+ "class": [
1018
+ "character",
1019
+ "integer",
1020
+ "character",
1021
+ "integer",
1022
+ "integer",
1023
+ "character",
1024
+ "character",
1025
+ "character",
1026
+ "character",
1027
+ "date",
1028
+ "date",
1029
+ "integer",
1030
+ "integer"
1031
+ ],
1032
+ "description": [
1033
+ "Name of the film",
1034
+ "Release year",
1035
+ "Director of the film",
1036
+ "Age difference between the characters in whole years",
1037
+ "An identifier for the couple in case multiple couples are listed for this film",
1038
+ "The name of the older actor in this couple",
1039
+ "The name of the younger actor in this couple",
1040
+ "The gender of the older character, as identified by the person who submitted the data for this couple",
1041
+ "The gender of the younger character, as identified by the person who submitted the data for this couple",
1042
+ "The birthdate of the older member of the couple",
1043
+ "The birthdate of the younger member of the couple",
1044
+ "The age of the older actor when the film was released",
1045
+ "The age of the younger actor when the film was released"
1046
+ ]
1047
+ }
1048
+ ],
1049
+ "data": {
1050
+ "file_name": [
1051
+ "age_gaps.csv"
1052
+ ],
1053
+ "file_url": [
1054
+ "https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-02-14/age_gaps.csv"
1055
+ ]
1056
+ },
1057
+ "data_load": {
1058
+ "file_name": [
1059
+ "age_gaps.csv"
1060
+ ],
1061
+ "file_url": [
1062
+ "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-02-14/age_gaps.csv"
1063
+ ]
1064
+ }
1065
+ },
1066
+ {
1067
+ "date_posted": "2023-08-15",
1068
+ "project_name": "Spam E-mail",
1069
+ "project_source": [
1070
+ "https://vincentarelbundock.github.io/Rdatasets/index.html",
1071
+ "https://archive.ics.uci.edu/dataset/94/spambase",
1072
+ "https://search.r-project.org/CRAN/refmans/kernlab/html/spam.html",
1073
+ "https://vincentarelbundock.github.io/Rdatasets/doc/DAAG/spam7.html"
1074
+ ],
1075
+ "description": "The data this week comes from Vincent Arel-Bundock's Rdatasets package(https://vincentarelbundock.github.io/Rdatasets/index.html). Rdatasets is a collection of 2246 datasets which were originally distributed alongside the statistical software environment R and some of its add-on packages. The goal is to make these data more broadly accessible for teaching and statistical software development. We're working with thespam emaildataset. This is a subset of thespam e-mail database. This is a dataset collected at Hewlett-Packard Labs by Mark Hopkins, Erik Reeber, George Forman, and Jaap Suermondt and shared with theUCI Machine Learning Repository. The dataset classifies 4601 e-mails as spam or non-spam, with additional variables indicating the frequency of certain words and characters in the e-mail. First column was removed.",
1076
+ "data_source_url": "https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-08-15",
1077
+ "data_dictionary": [
1078
+ {
1079
+ "variable": [
1080
+ "crl.tot",
1081
+ "dollar",
1082
+ "bang",
1083
+ "money",
1084
+ "n000",
1085
+ "make",
1086
+ "yesno"
1087
+ ],
1088
+ "class": [
1089
+ "double",
1090
+ "double",
1091
+ "double",
1092
+ "double",
1093
+ "double",
1094
+ "double",
1095
+ "character"
1096
+ ],
1097
+ "description": [
1098
+ "Total length of uninterrupted sequences of capitals",
1099
+ "Occurrences of the dollar sign, as percent of total number of characters",
1100
+ "Occurrences of ‘!’, as percent of total number of characters",
1101
+ "Occurrences of ‘money’, as percent of total number of characters",
1102
+ "Occurrences of the string ‘000’, as percent of total number of words",
1103
+ "Occurrences of ‘make’, as a percent of total number of words",
1104
+ "Outcome variable, a factor with levels 'n' not spam, 'y' spam"
1105
+ ]
1106
+ }
1107
+ ],
1108
+ "data": {
1109
+ "file_name": [
1110
+ "spam.csv"
1111
+ ],
1112
+ "file_url": [
1113
+ "https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-08-15/spam.csv"
1114
+ ]
1115
+ },
1116
+ "data_load": {
1117
+ "file_name": [
1118
+ "spam.csv"
1119
+ ],
1120
+ "file_url": [
1121
+ "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-08-15/spam.csv"
1122
+ ]
1123
+ }
1124
+ },
1125
+ {
1126
+ "date_posted": "2023-03-07",
1127
+ "project_name": "Numbats in Australia",
1128
+ "project_source": [
1129
+ "/rfordatascience/tidytuesday/blob/master/data/2023/2023-03-07/data/numbats.csv",
1130
+ "https://www.ala.org.au",
1131
+ "https://github.com/numbats/numbats-tidytuesday",
1132
+ "https://bie.ala.org.au/species/https://biodiversity.org.au/afd/taxa/6c72d199-f0f1-44d3-8197-224a2f7cff5f"
1133
+ ],
1134
+ "description": "The data this week comes from theAtlas of Living Australia. Thanks to Di Cook forpreparing this week's dataset! ThisNumbat page at the Atlas of Living Australiatalks about these endangered species in greater detail. Acsvfile of numbat sightings is provided. The code to refresh the data is below. Questions that would be interesting to answer are:",
1135
+ "data_source_url": "https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-03-07",
1136
+ "data_dictionary": [
1137
+ {
1138
+ "variable": [
1139
+ "decimalLatitude",
1140
+ "decimalLongitude",
1141
+ "eventDate",
1142
+ "scientificName",
1143
+ "taxonConceptID",
1144
+ "recordID",
1145
+ "dataResourceName",
1146
+ "year",
1147
+ "month",
1148
+ "wday",
1149
+ "hour",
1150
+ "day",
1151
+ "dryandra",
1152
+ "prcp",
1153
+ "tmax",
1154
+ "tmin"
1155
+ ],
1156
+ "class": [
1157
+ "double",
1158
+ "double",
1159
+ "datetime",
1160
+ "factor",
1161
+ "factor",
1162
+ "character",
1163
+ "factor",
1164
+ "integer",
1165
+ "factor",
1166
+ "factor",
1167
+ "integer",
1168
+ "date",
1169
+ "logical",
1170
+ "double",
1171
+ "double",
1172
+ "double"
1173
+ ],
1174
+ "description": [
1175
+ "decimalLatitude",
1176
+ "decimalLongitude",
1177
+ "eventDate",
1178
+ "Either \\\"Myrmecobius fasciatus\\\" or \\\"Myrmecobius fasciatus rufus\\\"",
1179
+ "The URL for this (sub)species",
1180
+ "recordID",
1181
+ "dataResourceName",
1182
+ "The 4-digit year of the event (when available)",
1183
+ "The 3-letter month abbreviation of the event (when available)",
1184
+ "The 3-letter weekday abbreviation of the event (when available)",
1185
+ "The hour of the event (when available)",
1186
+ "The date of the event (when available)",
1187
+ "whether the observation was in Dryandra Woodland",
1188
+ "Precipitation on that day in Dryandra Woodland (when relevant), in millimeters",
1189
+ "Maximum temperature on that day in Dryandra Woodland (when relevant), in degrees Celsius",
1190
+ "Minimum temperature on that day in Dryandra Woodland (when relevant), in degrees Celsius"
1191
+ ]
1192
+ }
1193
+ ],
1194
+ "data": {
1195
+ "file_name": [
1196
+ "numbats.csv"
1197
+ ],
1198
+ "file_url": [
1199
+ "https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-03-07/numbats.csv"
1200
+ ]
1201
+ },
1202
+ "data_load": {
1203
+ "file_name": [
1204
+ "numbats.csv"
1205
+ ],
1206
+ "file_url": [
1207
+ "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-03-07/numbats.csv"
1208
+ ]
1209
+ }
1210
+ },
1211
+ {
1212
+ "date_posted": "2023-11-28",
1213
+ "project_name": "Doctor Who Episodes",
1214
+ "project_source": [
1215
+ "https://en.wikipedia.org/wiki/List_of_Doctor_Who_episodes_(2005%E2%80%93present)",
1216
+ "https://github.com/KittJonathan/datardis/tree/main/misc",
1217
+ "https://cran.r-project.org/package=datardis",
1218
+ "https://github.com/KittJonathan/datardis"
1219
+ ],
1220
+ "description": "Doctor Who is an extremely long-running British television program. The show was revived in 2005, and has proven very popular since then. To celebrate this year's 60th anniversary of Doctor Who, we have three datasets. The data this week comes from Wikipedia's [List of Doctor Who episodes](https://en.wikipedia.org/wiki/List_of_Doctor_Who_episodes_(2005%E2%80%93present)via the{datardis} packagebyJonathan Kitt. Thank you to Jonathan for compiling and sharing this data! As of 2023-11-24, the data only includes episodes from the \"revived\" era. For an added challenge, consider submitting a pull request to the {datardis} package to update thedata-extraction scriptsto also fetch the \"classic\" era data! Clean data from the{datardis} package.",
1221
+ "data_source_url": "https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-11-28",
1222
+ "data_dictionary": [
1223
+ {
1224
+ "variable": [
1225
+ "era",
1226
+ "season_number",
1227
+ "serial_title",
1228
+ "story_number",
1229
+ "episode_number",
1230
+ "episode_title",
1231
+ "type",
1232
+ "first_aired",
1233
+ "production_code",
1234
+ "uk_viewers",
1235
+ "rating",
1236
+ "duration"
1237
+ ],
1238
+ "class": [
1239
+ "character",
1240
+ "double",
1241
+ "character",
1242
+ "character",
1243
+ "double",
1244
+ "character",
1245
+ "character",
1246
+ "double",
1247
+ "character",
1248
+ "double",
1249
+ "double",
1250
+ "double"
1251
+ ],
1252
+ "description": [
1253
+ "Whether the episode is in the \\\"classic\\\" or \\\"revived\\\" era. All data in this dataset is within the \\\"revived\\\" era.",
1254
+ "The season number within the era. Note that some episodes are outside of a season.",
1255
+ "Serial title if available",
1256
+ "Story number",
1257
+ "Episode number in season",
1258
+ "Episode title",
1259
+ "\\\"episode\\\" or \\\"special\\\"",
1260
+ "Date the episode first aired in the U.K.",
1261
+ "Episode's production code if available",
1262
+ "Number of U.K. viewers (millions)",
1263
+ "Episode's rating",
1264
+ "Episode's duration in minutes"
1265
+ ]
1266
+ },
1267
+ {
1268
+ "variable": [
1269
+ "story_number",
1270
+ "director"
1271
+ ],
1272
+ "class": [
1273
+ "character",
1274
+ "character"
1275
+ ],
1276
+ "description": [
1277
+ "Story number",
1278
+ "Episode's director"
1279
+ ]
1280
+ },
1281
+ {
1282
+ "variable": [
1283
+ "story_number",
1284
+ "writer"
1285
+ ],
1286
+ "class": [
1287
+ "character",
1288
+ "character"
1289
+ ],
1290
+ "description": [
1291
+ "Story number",
1292
+ "Episode's writer"
1293
+ ]
1294
+ }
1295
+ ],
1296
+ "data": {
1297
+ "file_name": [
1298
+ "drwho_directors.csv",
1299
+ "drwho_episodes.csv",
1300
+ "drwho_writers.csv"
1301
+ ],
1302
+ "file_url": [
1303
+ "https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-11-28/drwho_directors.csv",
1304
+ "https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-11-28/drwho_episodes.csv",
1305
+ "https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-11-28/drwho_writers.csv"
1306
+ ]
1307
+ },
1308
+ "data_load": {
1309
+ "file_name": [
1310
+ "drwho_directors.csv",
1311
+ "drwho_episodes.csv",
1312
+ "drwho_writers.csv"
1313
+ ],
1314
+ "file_url": [
1315
+ "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-11-28/drwho_directors.csv",
1316
+ "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-11-28/drwho_episodes.csv",
1317
+ "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-11-28/drwho_writers.csv"
1318
+ ]
1319
+ }
1320
+ },
1321
+ {
1322
+ "date_posted": "2023-11-14",
1323
+ "project_name": "Diwali Sales Data",
1324
+ "project_source": [
1325
+ "https://www.kaggle.com/code/bhushanshelke69/diwali-data-exploration",
1326
+ "https://github.com/vikasvachheta08/Diwali_Sales_Analysis_Using_Python",
1327
+ "https://www.kaggle.com/datasets/saadharoon27/diwali-sales-dataset"
1328
+ ],
1329
+ "description": "This week is Diwali, the festival of lights! The data this week comes fromsales datafor a retail store during the Diwali festival period in India. The data is shared on Kaggle by Saad Haroon. This week we're sharing Python data analysis examples! There's a few out there, but these ones fromBrushan ShelkeorVikas Vachheta(see the Diwali_Sales_Analysis.ipynb file for the code) are some data exploration analyses. Data was downloaded fromKaggle, and theStatusandunnamed1columns removed.",
1330
+ "data_source_url": "https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-11-14",
1331
+ "data_dictionary": [
1332
+ {
1333
+ "variable": [
1334
+ "User_ID",
1335
+ "Cust_name",
1336
+ "Product_ID",
1337
+ "Gender",
1338
+ "Age Group",
1339
+ "Age",
1340
+ "Marital_Status",
1341
+ "State",
1342
+ "Zone",
1343
+ "Occupation",
1344
+ "Product_Category",
1345
+ "Orders",
1346
+ "Amount"
1347
+ ],
1348
+ "class": [
1349
+ "double",
1350
+ "character",
1351
+ "character",
1352
+ "character",
1353
+ "character",
1354
+ "double",
1355
+ "double",
1356
+ "character",
1357
+ "character",
1358
+ "character",
1359
+ "character",
1360
+ "double",
1361
+ "double"
1362
+ ],
1363
+ "description": [
1364
+ "User identification number",
1365
+ "Customer name",
1366
+ "Product identification number",
1367
+ "Gender of the customer (e.g. Male, Female)",
1368
+ "Age group of the customer",
1369
+ "Age of the customer",
1370
+ "Marital status of the customer (e.g. Married, Single)",
1371
+ "State of the customer",
1372
+ "Geographic zone of the customer",
1373
+ "Occupation of the customer",
1374
+ "Category of the product",
1375
+ "Number of orders made by the customer",
1376
+ "Amount in Indian rupees spent by the customer"
1377
+ ]
1378
+ }
1379
+ ],
1380
+ "data": {
1381
+ "file_name": [
1382
+ "diwali_sales_data.csv"
1383
+ ],
1384
+ "file_url": [
1385
+ "https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-11-14/diwali_sales_data.csv"
1386
+ ]
1387
+ },
1388
+ "data_load": {
1389
+ "file_name": [
1390
+ "diwali_sales_data.csv"
1391
+ ],
1392
+ "file_url": [
1393
+ "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-11-14/diwali_sales_data.csv"
1394
+ ]
1395
+ }
1396
+ },
1397
+ {
1398
+ "date_posted": "2023-12-12",
1399
+ "project_name": "Holiday Movies",
1400
+ "project_source": [
1401
+ "https://networkdatascience.ceu.edu/article/2019-12-16/christmas-movies",
1402
+ "https://developer.imdb.com/non-commercial-datasets/"
1403
+ ],
1404
+ "description": "Happy holidays! This week we're exploring \"holiday\" movies: movies with \"holiday\", \"Christmas\", \"Hanukkah\", or \"Kwanzaa\" (or variants thereof) in their title! The data this week comes from theInternet Movie Database. We don't have an article using exactly this dataset, but you might get inspiration from thisChristmas Moviesblog post by Milán Janosov at Central European University.",
1405
+ "data_source_url": "https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-12-12",
1406
+ "data_dictionary": [
1407
+ {
1408
+ "variable": [
1409
+ "tconst",
1410
+ "title_type",
1411
+ "primary_title",
1412
+ "original_title",
1413
+ "year",
1414
+ "runtime_minutes",
1415
+ "genres",
1416
+ "simple_title",
1417
+ "average_rating",
1418
+ "num_votes",
1419
+ "christmas",
1420
+ "hanukkah",
1421
+ "kwanzaa",
1422
+ "holiday"
1423
+ ],
1424
+ "class": [
1425
+ "character",
1426
+ "character",
1427
+ "character",
1428
+ "character",
1429
+ "double",
1430
+ "double",
1431
+ "character",
1432
+ "character",
1433
+ "double",
1434
+ "double",
1435
+ "logical",
1436
+ "logical",
1437
+ "logical",
1438
+ "logical"
1439
+ ],
1440
+ "description": [
1441
+ "alphanumeric unique identifier of the title",
1442
+ "the type/format of the title (movie, video, or tvMovie)",
1443
+ "the more popular title / the title used by the filmmakers on promotional materials at the point of release",
1444
+ "original title, in the original language",
1445
+ "the release year of a title",
1446
+ "primary runtime of the title, in minutes",
1447
+ "includes up to three genres associated with the title (comma-delimited)",
1448
+ "the title in lowercase, with punctuation removed, for easier filtering and grouping",
1449
+ "weighted average of all the individual user ratings on IMDb",
1450
+ "number of votes the title has received on IMDb (titles with fewer than 10 votes were not included in this dataset)",
1451
+ "whether the title includes \\\"christmas\\\", \\\"xmas\\\", \\\"x mas\\\", etc",
1452
+ "whether the title includes \\\"hanukkah\\\", \\\"chanukah\\\", etc",
1453
+ "whether the title includes \\\"kwanzaa\\\"",
1454
+ "whether the title includes the word \\\"holiday\\\""
1455
+ ]
1456
+ },
1457
+ {
1458
+ "variable": [
1459
+ "tconst",
1460
+ "genres"
1461
+ ],
1462
+ "class": [
1463
+ "character",
1464
+ "character"
1465
+ ],
1466
+ "description": [
1467
+ "alphanumeric unique identifier of the title",
1468
+ "genres associated with the title, one row per genre"
1469
+ ]
1470
+ }
1471
+ ],
1472
+ "data": {
1473
+ "file_name": [
1474
+ "holiday_movie_genres.csv",
1475
+ "holiday_movies.csv"
1476
+ ],
1477
+ "file_url": [
1478
+ "https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-12-12/holiday_movie_genres.csv",
1479
+ "https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-12-12/holiday_movies.csv"
1480
+ ]
1481
+ },
1482
+ "data_load": {
1483
+ "file_name": [
1484
+ "holiday_movie_genres.csv",
1485
+ "holiday_movies.csv"
1486
+ ],
1487
+ "file_url": [
1488
+ "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-12-12/holiday_movie_genres.csv",
1489
+ "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-12-12/holiday_movies.csv"
1490
+ ]
1491
+ }
1492
+ },
1493
+ {
1494
+ "date_posted": "2024-02-13",
1495
+ "project_name": "Valentine's Day Consumer Data",
1496
+ "project_source": [
1497
+ "https://github.com/rfordatascience/tidytuesday/tree/master/data/2022/2022-01-25",
1498
+ "https://nrf.com/research-insights/holiday-data-and-trends/valentines-day/valentines-day-data-center",
1499
+ "https://www.kaggle.com/datasets/infinator/happy-valentines-day-2022",
1500
+ "https://github.com/rfordatascience/tidytuesday/tree/master/data/2022/2022-01-18"
1501
+ ],
1502
+ "description": "Happy Valentine's Day! This week we're exploringValentine's Day survey data. The National Retail Federation in the United States conducts surveys and has created aValentine's Day Data Centerso you can explore the data on how consumers celebrate. The NRF has surveyed consumers about how they plan to celebrate Valentine’s Day annually for over a decade. Take a deeper dive into the data from the last 10 years, and use the interactive charts to explore a demographic breakdown of total spending, average spending, types of gifts planned and spending per type of gift. The NRF has continued to collect data. The data for this week is from 2010 to 2022, as organized by Suraj Das for a Kaggle dataset. In the historical surveys gender was collected as only 'Men' and 'Women', which does not accurately include all genders. If you're looking for other Valentine's Day type datasets, check out previous datasets onchocolateorboard games(a good Valentine's Day activity!). Data was downloaded fromSunja aa Kaggle dataset. Data from historical_gift_trends_per_person_spending.csv, historical_spending_average_expected_spending.csv and historical_spending_percent_celebrating.csv were combined into historical_spending.csv. Data from planned_gifts_age.csv and spending_or_celebrating_age_1.csv were combined into gifts_age.csv. Data from planned_gifts_gender.csv and spending_or_celebrating_gender_1.csv were combined into gifts_gender.csv. Percentage signs and dollar signs were removed from all numerical values.",
1503
+ "data_source_url": "https://github.com/rfordatascience/tidytuesday/blob/master/data/2024/2024-02-13",
1504
+ "data_dictionary": [
1505
+ {
1506
+ "variable": [
1507
+ "Year",
1508
+ "PercentCelebrating",
1509
+ "PerPerson",
1510
+ "Candy",
1511
+ "Flowers",
1512
+ "Jewelry",
1513
+ "GreetingCards",
1514
+ "EveningOut",
1515
+ "Clothing",
1516
+ "GiftCards"
1517
+ ],
1518
+ "class": [
1519
+ "double",
1520
+ "double",
1521
+ "double",
1522
+ "double",
1523
+ "double",
1524
+ "double",
1525
+ "double",
1526
+ "double",
1527
+ "double",
1528
+ "double"
1529
+ ],
1530
+ "description": [
1531
+ "Year",
1532
+ "Percent of people celebrating Valentines Day",
1533
+ "Average amount each person is spending",
1534
+ "Average amount spending on candy",
1535
+ "Average amount spending on flowers",
1536
+ "Average amount spending on jewelry",
1537
+ "Average amount spending on greeting cards",
1538
+ "Average amount spending on an evening out",
1539
+ "Average amount spending on clothing",
1540
+ "Average amount spending on gift cards"
1541
+ ]
1542
+ },
1543
+ {
1544
+ "variable": [
1545
+ "Age",
1546
+ "SpendingCelebrating",
1547
+ "Candy",
1548
+ "Flowers",
1549
+ "Jewelry",
1550
+ "GreetingCards",
1551
+ "EveningOut",
1552
+ "Clothing",
1553
+ "GiftCards"
1554
+ ],
1555
+ "class": [
1556
+ "character",
1557
+ "double",
1558
+ "double",
1559
+ "double",
1560
+ "double",
1561
+ "double",
1562
+ "double",
1563
+ "double",
1564
+ "double"
1565
+ ],
1566
+ "description": [
1567
+ "Age",
1568
+ "Percent spending money on or celebrating Valentines Day",
1569
+ "Average percent spending on candy",
1570
+ "Average percent spending on flowers",
1571
+ "Average percent spending on jewelry",
1572
+ "Average percent spending on greeting cards",
1573
+ "Average percent spending on an evening out",
1574
+ "Average percent spending on clothing",
1575
+ "Average percent spending on gift cards"
1576
+ ]
1577
+ },
1578
+ {
1579
+ "variable": [
1580
+ "Gender",
1581
+ "SpendingCelebrating",
1582
+ "Candy",
1583
+ "Flowers",
1584
+ "Jewelry",
1585
+ "GreetingCards",
1586
+ "EveningOut",
1587
+ "Clothing",
1588
+ "GiftCards"
1589
+ ],
1590
+ "class": [
1591
+ "character",
1592
+ "double",
1593
+ "double",
1594
+ "double",
1595
+ "double",
1596
+ "double",
1597
+ "double",
1598
+ "double",
1599
+ "double"
1600
+ ],
1601
+ "description": [
1602
+ "Gender only including Men or Women",
1603
+ "Percent spending money on or celebrating Valentines Day",
1604
+ "Average percent spending on candy",
1605
+ "Average percent spending on flowers",
1606
+ "Average percent spending on jewelry",
1607
+ "Average percent spending on greeting cards",
1608
+ "Average percent spending on an evening out",
1609
+ "Average percent spending on clothing",
1610
+ "Average percent spending on gift cards"
1611
+ ]
1612
+ }
1613
+ ],
1614
+ "data": {
1615
+ "file_name": [
1616
+ "gifts_age.csv",
1617
+ "gifts_gender.csv",
1618
+ "historical_spending.csv"
1619
+ ],
1620
+ "file_url": [
1621
+ "https://github.com/rfordatascience/tidytuesday/blob/master/data/2024/2024-02-13/gifts_age.csv",
1622
+ "https://github.com/rfordatascience/tidytuesday/blob/master/data/2024/2024-02-13/gifts_gender.csv",
1623
+ "https://github.com/rfordatascience/tidytuesday/blob/master/data/2024/2024-02-13/historical_spending.csv"
1624
+ ]
1625
+ },
1626
+ "data_load": {
1627
+ "file_name": [
1628
+ "gifts_age.csv",
1629
+ "gifts_gender.csv",
1630
+ "historical_spending.csv"
1631
+ ],
1632
+ "file_url": [
1633
+ "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-02-13/gifts_age.csv",
1634
+ "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-02-13/gifts_gender.csv",
1635
+ "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-02-13/historical_spending.csv"
1636
+ ]
1637
+ }
1638
+ },
1639
+ {
1640
+ "date_posted": "2023-08-08",
1641
+ "project_name": "Hot Ones Episodes",
1642
+ "project_source": [
1643
+ "https://en.wikipedia.org/wiki/List_of_Hot_Ones_episodes",
1644
+ "https://github.com/borstell",
1645
+ "https://github.com/rfordatascience/tidytuesday/issues/591",
1646
+ "https://en.wikipedia.org/wiki/Hot_Ones"
1647
+ ],
1648
+ "description": "The data this week comes from Wikipedia articles:Hot OnesandList of Hot Ones episodes. Thank you toCarl Börstellfor thesuggestion and cleaning script! Hot Ones is an American YouTube talk show, created by Chris Schonberger, hosted by Sean Evans and produced by First We Feast and Complex Media. Its basic premise involves celebrities being interviewed by Evans over a platter of increasingly spicy chicken wings.",
1649
+ "data_source_url": "https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-08-08",
1650
+ "data_dictionary": [
1651
+ {
1652
+ "variable": [
1653
+ "season",
1654
+ "episode_overall",
1655
+ "episode_season",
1656
+ "title",
1657
+ "original_release",
1658
+ "guest",
1659
+ "guest_appearance_number",
1660
+ "finished"
1661
+ ],
1662
+ "class": [
1663
+ "integer",
1664
+ "integer",
1665
+ "integer",
1666
+ "character",
1667
+ "date",
1668
+ "character",
1669
+ "integer",
1670
+ "logical"
1671
+ ],
1672
+ "description": [
1673
+ "The season number.",
1674
+ "The overall count of this episode, from 1-300.",
1675
+ "The count of this episode within this season.",
1676
+ "The title of the episode.",
1677
+ "The date on which the episode was originally available on YouTube.",
1678
+ "The name of the guest.",
1679
+ "The number of appearances by this guest so far as of this date.",
1680
+ "Whether the guest finished trying all of the sauces."
1681
+ ]
1682
+ },
1683
+ {
1684
+ "variable": [
1685
+ "season",
1686
+ "sauce_number",
1687
+ "sauce_name",
1688
+ "scoville"
1689
+ ],
1690
+ "class": [
1691
+ "integer",
1692
+ "integer",
1693
+ "character",
1694
+ "integer"
1695
+ ],
1696
+ "description": [
1697
+ "The season number.",
1698
+ "The number of this sauce, from 1 (least hot) to 10 (hottest).",
1699
+ "The name of the sauce.",
1700
+ "The rating of the sauce in Scoville heat units."
1701
+ ]
1702
+ },
1703
+ {
1704
+ "variable": [
1705
+ "season",
1706
+ "episodes",
1707
+ "note",
1708
+ "original_release",
1709
+ "last_release"
1710
+ ],
1711
+ "class": [
1712
+ "integer",
1713
+ "integer",
1714
+ "character",
1715
+ "date",
1716
+ "date"
1717
+ ],
1718
+ "description": [
1719
+ "The season number.",
1720
+ "The count of episodes in this season.",
1721
+ "Notes about this season.",
1722
+ "The date of the first episode in this season.",
1723
+ "The date of the last episode of this season (if that episode has aired at the time of scraping)."
1724
+ ]
1725
+ }
1726
+ ],
1727
+ "data": {
1728
+ "file_name": [
1729
+ "episodes.csv",
1730
+ "sauces.csv",
1731
+ "seasons.csv"
1732
+ ],
1733
+ "file_url": [
1734
+ "https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-08-08/episodes.csv",
1735
+ "https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-08-08/sauces.csv",
1736
+ "https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-08-08/seasons.csv"
1737
+ ]
1738
+ },
1739
+ "data_load": {
1740
+ "file_name": [
1741
+ "episodes.csv",
1742
+ "sauces.csv",
1743
+ "seasons.csv"
1744
+ ],
1745
+ "file_url": [
1746
+ "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-08-08/episodes.csv",
1747
+ "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-08-08/sauces.csv",
1748
+ "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-08-08/seasons.csv"
1749
+ ]
1750
+ }
1751
+ },
1752
+ {
1753
+ "date_posted": "2023-07-25",
1754
+ "project_name": "Scurvy",
1755
+ "project_source": [
1756
+ "https://github.com/higgi13425/medicaldata/tree/master/data-raw",
1757
+ "https://htmlpreview.github.io/?https://github.com/higgi13425/medicaldata/blob/master/man/description_docs/scurvy_desc.html",
1758
+ "https://higgi13425.github.io/medicaldata/"
1759
+ ],
1760
+ "description": "The data this week comes from themedicaldata R package. This is a data package from Peter Higgins, with 19 medical datasets for teaching Reproducible Medical Research with R. We're using thescurvy dataset. Source: This data set is from a study published in 1757 in A Treatise on the Scurvy in Three Parts, by James Lind. This data set contains 12 participants with scurvy. In 1757, it was not known that scurvy is a manifestation of vitamin C deficiency. A variety of remedies had been anecdotally reported, but Lind was the first to test different regimens of acidic substances (including citrus fruits) against each other in a randomized, controlled trial. 6 distinct therapies were tested in 12 seamen with symptomatic scurvy, who were selected for similar severity. Six days of therapy were provided, and endpoints were reported in the text at the end of 6 days. These include rotting of the gums, skin sores, weakness of the knees, and lassitude, which are described in terms of severity. These have been translated into Likert scales from 0(none) to 3(severe). A dichotomous endpoint, fitness for duty, was also reported. Scurvy was a common affliction of seamen on long voyages, leading to mouth sores, skin lesions, weakness of the knees, and lassitude. Scurvy could be fatal on long voyages. James Lind reported the treatment of 12 seamen with scurvy in 1757, in _A Treatise on the Scurvy in Three Parts). This 476 page bloviation can be found scanned to the Google Books website A Treatise on the Scurvy. Pages 149-153 are a rare gem among what can be generously described as 400+ pages of evidence-free blathering, and these 4 pages may represent the first report of a controlled clinical trial. Lind was the ship’s surgeon on board the HMS Salisbury, and had a number of scurvy-affected seamen at his disposal. Many remedies had been described and advocated for, with no more than anecdotal evidence. On May 20, 1747, Lind decided to try the 6 therapies on the Salisbury in a comparative study in 12 affected seamen. He selected 12 with roughly similar severity, with notable skin and mouth sores, weakness of the knees, and significant lassitude, making them unfit for duty. They each received the standard shipboard diet of gruel and mutton broth, supplemented with occasional biscuits and puddings. Each treatment was a dietary supplement (including citrus fruits) or a medicinal. This data frame was reconstructed from Lind’s account as recorded on these 4 pages, with his estimates of severity translated to a 4 point Likert scale (0-3) for each of the symptoms he described at his chosen endpoint on day 6. A somewhat fanciful study_id variable was added, along with detailed descriptions of the dosing schedule of each treatment. Of note, there is some dispute about whether this was truly the first clinical trial, or whether it actually happened, as there are no contemporaneous corroborating accounts. See link about the historical debate. Lind reported that the seamen treated with 2 lemons and an orange daily did best, followed by those treated with cider. Those treated with elixir of vitriol only had improvement in mouth sores. One imagines that acidic substances (like dilute sulfuric acid, vinegar, cider, and citrus fruits) might have been rather painful on these mouth sores. Unfortunately, the burial of the 4 valuable pages of data in 476 pages of noise, a publication delay of 10 years, and Lind’s half-hearted conclusions (he was focused on acidity), meant that it took until 1795 before the British Navy mandated daily limes for seamen. The first column was removed from the scurvy.csv file available athttps://github.com/higgi13425/medicaldata/tree/master/data-raw.",
1761
+ "data_source_url": "https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-07-25",
1762
+ "data_dictionary": [
1763
+ {
1764
+ "variable": [
1765
+ "study_id",
1766
+ "treatment",
1767
+ "dosing_regimen_for_scurvy",
1768
+ "gum_rot_d6",
1769
+ "skin_sores_d6",
1770
+ "weakness_of_the_knees_d6",
1771
+ "lassitude_d6",
1772
+ "fit_for_duty_d6"
1773
+ ],
1774
+ "class": [
1775
+ "double",
1776
+ "character",
1777
+ "character",
1778
+ "character",
1779
+ "character",
1780
+ "character",
1781
+ "character",
1782
+ "character"
1783
+ ],
1784
+ "description": [
1785
+ "Participant ID",
1786
+ "Treatment; cider, dilute_sulfuric_acid, vinegar, sea_water, citrus, purgative_mixture",
1787
+ "Dosing Regimen; 1 quart per day; 25 drops of elixir of vitriol, three times a day; two spoonfuls, three times daily; half pint daily; two lemons and an orange daily; a nutmeg-sized paste of garlic, mustard seed, horseradish, balsam of Peru, and gum myrrh three times a day",
1788
+ "Gum Rot on Day 6; 0_none, 1_mild, 2_moderate, 3_severe",
1789
+ "Skin Sores on Day 6; 0_none, 1_mild, 2_moderate, 3_severe",
1790
+ "Weakness of the Knees on Day 6; 0_none, 1_mild, 2_moderate, 3_severe",
1791
+ "Lassitude on Day 6; 0_none, 1_mild, 2_moderate, 3_severe",
1792
+ "Fit for Duty on Day 6; 0_no, 1_yes"
1793
+ ]
1794
+ }
1795
+ ],
1796
+ "data": {
1797
+ "file_name": [
1798
+ "scurvy.csv"
1799
+ ],
1800
+ "file_url": [
1801
+ "https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-07-25/scurvy.csv"
1802
+ ]
1803
+ },
1804
+ "data_load": {
1805
+ "file_name": [
1806
+ "scurvy.csv"
1807
+ ],
1808
+ "file_url": [
1809
+ "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-07-25/scurvy.csv"
1810
+ ]
1811
+ }
1812
+ },
1813
+ {
1814
+ "date_posted": "2023-11-07",
1815
+ "project_name": "US House Election Results",
1816
+ "project_source": [
1817
+ "https://electionlab.mit.edu/",
1818
+ "https://electionlab.mit.edu/articles/new-report-how-we-voted-2022",
1819
+ "https://docs.posit.co/ide/user/ide/guide/tools/copilot.html",
1820
+ "https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/IG0UN2"
1821
+ ],
1822
+ "description": "It's election day in the United States! To celebrate, the data this week comes from theMIT Election Data and Science Lab(MEDSL). Hat tip this week to theRStudio GitHub Copilot integration, which suggested the MEDSL. From the MEDSL's reportNew Report: How We Voted in 2022: The Survey of the Performance of American Elections (SPAE) provides information about how Americans experienced voting in the most recent federal election. The survey has been conducted after federal elections since 2008, and is the only public opinion project in the country that is dedicated explicitly to understanding how voters themselves experience the election process. We're specifically providing data on House elections from 1976-2022. Check out theMEDSL websitefor additional datasets and tools. Be sure to cite the MEDSL in your work: Clean data and dictionary downloaded from theHarvard Dataverse",
1823
+ "data_source_url": "https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-11-07",
1824
+ "data_dictionary": [
1825
+ {
1826
+ "variable": [
1827
+ "year",
1828
+ "state",
1829
+ "state_po",
1830
+ "state_fips",
1831
+ "state_cen",
1832
+ "state_ic",
1833
+ "office",
1834
+ "district",
1835
+ "stage",
1836
+ "runoff",
1837
+ "special",
1838
+ "candidate",
1839
+ "party",
1840
+ "writein",
1841
+ "mode",
1842
+ "candidatevotes",
1843
+ "totalvotes",
1844
+ "unofficial",
1845
+ "version",
1846
+ "fusion_ticket"
1847
+ ],
1848
+ "class": [
1849
+ "double",
1850
+ "character",
1851
+ "character",
1852
+ "double",
1853
+ "double",
1854
+ "double",
1855
+ "character",
1856
+ "character",
1857
+ "character",
1858
+ "logical",
1859
+ "logical",
1860
+ "character",
1861
+ "character",
1862
+ "logical",
1863
+ "character",
1864
+ "double",
1865
+ "double",
1866
+ "logical",
1867
+ "double",
1868
+ "logical"
1869
+ ],
1870
+ "description": [
1871
+ "year in which election was held",
1872
+ "state name",
1873
+ "U.S. postal code state abbreviation",
1874
+ "State FIPS code",
1875
+ "U.S. Census state code",
1876
+ "ICPSR state code",
1877
+ "U.S. House (constant)",
1878
+ "district number. At-large districts are coded as 0 (zero)",
1879
+ "electoral stage (gen = general elections, pri = primary elections)",
1880
+ "runoff election",
1881
+ "special election",
1882
+ "name of the candidate as it appears in the House Clerk report",
1883
+ "party of the candidate (always entirely lowercase) (Parties are as they appear in the House Clerk report. In states that allow candidates to appear on multiple party lines, separate vote totals are indicated for each party. Therefore, for analysis that involves candidate totals, it will be necessary to aggregate across all party lines within a district. For analysis that focuses on two-party vote totals, it will be necessary to account for major party candidates who receive votes under multiple party labels. Minnesota party labels are given as they appear on the Minnesota ballots. Future versions of this file will include codes for candidates who are endorsed by major parties, regardless of the party label under which they receive votes.)",
1884
+ "vote totals associated with write-in candidates",
1885
+ "mode of voting; states with data that doesn't break down returns by mode are marked as \\\"total\\\"",
1886
+ "votes received by this candidate for this particular party",
1887
+ "total number of votes cast for this election",
1888
+ "TRUE/FALSE indicator for unofficial result (to be updated later); this appears only for 2018 data in some cases",
1889
+ "date when this dataset was finalized",
1890
+ "A TRUE/FALSE indicator as to whether the given candidate is running on a fusion party ticket, which will in turn mean that a candidate will appear multiple times, but by different parties, for a given election. States with fusion tickets include Connecticut, New Jersey, New York, and South Carolina."
1891
+ ]
1892
+ }
1893
+ ],
1894
+ "data": {
1895
+ "file_name": [
1896
+ "house.csv"
1897
+ ],
1898
+ "file_url": [
1899
+ "https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-11-07/house.csv"
1900
+ ]
1901
+ },
1902
+ "data_load": {
1903
+ "file_name": [
1904
+ "house.csv"
1905
+ ],
1906
+ "file_url": [
1907
+ "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-11-07/house.csv"
1908
+ ]
1909
+ }
1910
+ }
1911
+ ]
data/variables.csv ADDED
@@ -0,0 +1,411 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ dataset_id,dataset,variable_id,variable,type,description
2
+ 1,tournaments,1,key_id,integer,The unique ID number for the observation.
3
+ 1,tournaments,2,tournament_id,text,"The unique ID number for the tournament. Has the format {WC-####}, where the number is the year of the tournament."
4
+ 1,tournaments,3,tournament_name,text,The name of the tournament.
5
+ 1,tournaments,4,year,integer,The year of the tournament.
6
+ 1,tournaments,5,start_date,date,The start date of the tournament in the format {YYYY-MM-DD}.
7
+ 1,tournaments,6,end_date,date,The end date of the tournament in the format {YYYY-MM-DD}.
8
+ 1,tournaments,7,host_country,text,The unique ID number for the country that hosted the tournament. References {team_id} in the {teams} dataset.
9
+ 1,tournaments,8,winner,text,The name of the team that won the tournament.
10
+ 1,tournaments,9,host_won,boolean,Whether one of the host countries won the tournament. Coded {1} if one of the host countries won and {0} otherwise.
11
+ 1,tournaments,10,count_teams,integer,The number of teams that participated in the tournament.
12
+ 1,tournaments,11,group_stage,boolean,Whether the match is a group stage match. Coded {1} if the match is a group stage match and {0} otherwise.
13
+ 1,tournaments,12,second_group_stage,boolean,Whether there was a second group stage. Coded {1} if there was a second group stage and {0} otherwise.
14
+ 1,tournaments,13,final_round,boolean,Whether there was a final round. Coded {1} if there was a final round and {0} otherwise.
15
+ 1,tournaments,14,round_of_16,boolean,Whether there was a round of 16 stage. Coded {1} if there was a round of 16 stage and {0} otherwise.
16
+ 1,tournaments,15,quarter_finals,boolean,Whether there was a quarter-finals stage. Coded {1} if there was a quarter-finals stage and {0} otherwise.
17
+ 1,tournaments,16,semi_finals,boolean,Whether there was a semi-finals stage. Coded {1} if there was a semi-finals stage and {0} otherwise.
18
+ 1,tournaments,17,third_place_match,boolean,Whether there was a third-place match. Coded {1} if there was a third-place match and {0} otherwise.
19
+ 1,tournaments,18,final,boolean,Whether there was a final match. Coded {1} if there was a final match and {0} otherwise.
20
+ 2,confederations,1,key_id,integer,The unique ID number for the observation.
21
+ 2,confederations,2,confederation_id,text,"The unique ID number for the confederation. Has the format {CF-#}, where the number is a counter that is assigned with the confederations sorted in alphabetical order."
22
+ 2,confederations,3,confederation_name,text,The name of the confederation.
23
+ 2,confederations,4,confederation_code,text,The abbreviation for the confederation.
24
+ 2,confederations,5,confederation_wikipedia_link,text,The Wikipedia link for the confederation.
25
+ 3,teams,1,key_id,integer,The unique ID number for the observation.
26
+ 3,teams,2,team_id,text,"The unique ID number for the team. Has the format {T-##}, where the number is a counter that is assigned with the data sorted by the year of the team's first tournament and then by the team's name."
27
+ 3,teams,3,team_name,text,The name of the team.
28
+ 3,teams,4,team_code,text,The 3-letter code for the team.
29
+ ,teams,5,mens_team,boolean,Whether the country's men's team has qualified for a tournament.
30
+ ,teams,6,womens_team,boolean,Whether the country's women's team has qualified for a tournament.
31
+ 3,teams,7,federation_name,text,The name of the team's federation.
32
+ 3,teams,8,region_name,text,The name of the region that the country is located in.
33
+ 3,teams,9,confederation_id,text,The unique ID number for the confederation. References {confederation_id} in the {confederations} dataset.
34
+ 3,teams,10,confederation_name,text,The name of the confederation.
35
+ 3,teams,11,confederation_code,text,The abbreviation for the confederation.
36
+ 3,teams,12,mens_team_wikipedia_link,text,The Wikipedia link for country's men's team. Coded {not applicable} if the country's men's team has not qualified for a tournament.
37
+ ,teams,13,womens_team_wikipedia_link,text,The Wikipedia link for country's women's team. Coded {not applicable} if the country's women's team has not qualified for a tournament.
38
+ 3,teams,14,federation_wikipedia_link,text,The Wikipedia link of the team's federation.
39
+ 4,players,1,key_id,integer,The unique ID number for the observation.
40
+ 4,players,2,player_id,text,"The unique ID number for the player. Has the format {P-#####}, where the number is a randomly drawn, uniquely identifying number."
41
+ 4,players,3,family_name,text,The family name of the player.
42
+ 4,players,4,given_name,text,The given name of the player.
43
+ 4,players,5,birth_date,date,The birth date of the player in the format {YYYY-MM-DD}.
44
+ ,players,6,female,boolean,Whether the player is female. Coded {1} if the player is female and {0} if the player is male.
45
+ 4,players,7,goal_keeper,boolean,Whether the player was a goal keeper. Coded {1} if the player was a goal keeper and {0} otherwise.
46
+ 4,players,8,defender,boolean,Whether the player was a defender. Coded {1} if the player was a defender and {0} otherwise.
47
+ 4,players,9,midfielder,boolean,Whether the player was a midfielder. Coded {1} if the player was a midfielder and {0} otherwise.
48
+ 4,players,10,forward,boolean,Whether the player was a forward. Coded {1} if the player was a forward and {0} otherwise.
49
+ 4,players,11,count_tournaments,integer,The number of tournaments that the player participated in.
50
+ 4,players,12,list_tournaments,text,"A list of tournaments that the player participated in, separated by a comma."
51
+ 4,players,13,player_wikipedia_link,text,The name of the team of the player.
52
+ 5,managers,1,key_id,integer,The unique ID number for the observation.
53
+ 5,managers,2,manager_id,text,"The unique ID number for the manager. Has the format {M-####}, where the number is a counter that is assigned with the data sorted by the year of the manager's first appearance, then by the manager's family name, and then by the manager's given name."
54
+ 5,managers,3,family_name,text,The family name of the manager.
55
+ 5,managers,4,given_name,text,The given name of the manager.
56
+ ,managers,5,female,boolean,Whether the manager is female. Coded {1} if the manager is female and {0} if the manager is male.
57
+ 5,managers,6,country_name,text,The name of the manager's home country.
58
+ 5,managers,7,manager_wikipedia_link,text,The Wikipedia link for the manager.
59
+ 6,referees,1,key_id,integer,The unique ID number for the observation.
60
+ 6,referees,2,referee_id,text,"The unique ID number for the referee. Has the format {R-####}, where the number is a counter that is assigned with the data sorted by the year of the referee's first appearance, then by the referee's family name, and then by the referee's given name."
61
+ 6,referees,3,family_name,text,The family name of the referee.
62
+ 6,referees,4,given_name,text,The given name of the referee.
63
+ ,referees,5,female,boolean,Whether the referee is female. Coded {1} if the referee is female and {0} if the referee is male.
64
+ 6,referees,6,country_name,text,The name of the referee's home country.
65
+ 6,referees,7,confederation_id,text,The unique ID number for the confederation. References {confederation_id} in the {confederations} dataset.
66
+ 6,referees,8,confederation_name,text,The name of the confederation.
67
+ 6,referees,9,confederation_code,text,The abbreviation for the confederation.
68
+ 6,referees,10,referee_wikipedia_link,text,The Wikipedia link for the referee.
69
+ 7,stadiums,1,key_id,integer,The unique ID number for the observation.
70
+ 7,stadiums,2,stadium_id,text,"The unique ID number for the stadium. Has the format {S-###}, where the number is a count that is assigned with the data sorted by country, then by city, then by the name of the stadium."
71
+ 7,stadiums,3,stadium_name,text,The name of the stadium.
72
+ 7,stadiums,4,city_name,text,The city in which the match was played.
73
+ 7,stadiums,5,country_name,text,The name of the country in which the stadium is located.
74
+ 7,stadiums,6,stadium_capacity,integer,The approximate capacity of the stadium.
75
+ 7,stadiums,7,stadium_wikipedia_link,text,The Wikipedia link for the stadium.
76
+ 7,stadiums,8,city_wikipedia_link,text,The Wikipedia link for the city in which the match was played.
77
+ 8,matches,1,key_id,integer,The unique ID number for the observation.
78
+ 8,matches,2,tournament_id,text,The unique ID number for the tournament. References {tournament_id} in the {tournaments} dataset.
79
+ 8,matches,3,tournament_name,text,The name of the tournament.
80
+ 8,matches,4,match_id,text,"The unique ID number for the match. Has the format {M-####-##}, where the first number is the year of the tournament and the second number is a within-tournament counter that is assigned with the data sorted by the date of the match, then by the time of the match, then by the name of the group, and then by name of the home team."
81
+ 8,matches,5,match_name,text,The name of the match.
82
+ 8,matches,6,stage_name,enum,"The stage of the tournament in which the match occurred. The possible values are: {first round}, {second round}, {group stage}, {round of sixteen}, {quarter-finals}, {semi-finals}, {third place match}, {final}. Note that not all values are applicable to all tournaments."
83
+ 8,matches,7,group_name,text,The name of the group.
84
+ 8,matches,8,group_stage,boolean,Whether the match is a group stage match. Coded {1} if the match is a group stage match and {0} otherwise.
85
+ 8,matches,9,knockout_stage,boolean,Whether the match is a knockout stage match. Coded {1} if the match is a knockout stage match and {0} otherwise.
86
+ 8,matches,10,replayed,boolean,Whether the match was replayed. Coded {1} if the match was replayed and {0} otherwise.
87
+ 8,matches,11,replay,boolean,Whether the match was a replay. Coded {1} if the match was a replay and {0} otherwise.
88
+ 8,matches,12,match_date,date,The date of the match in the format {YYYY-MM-DD}.
89
+ 8,matches,13,match_time,integer,The time of the match in the format {HH:MM}.
90
+ 8,matches,14,stadium_id,text,The unique ID number for the stadium. References {stadium_id} in the {stadiums} dataset.
91
+ 8,matches,15,stadium_name,text,The name of the stadium.
92
+ 8,matches,16,city_name,text,The city in which the match was played.
93
+ 8,matches,17,country_name,text,The name of the country in which the match was played.
94
+ 8,matches,18,home_team_id,text,The unique ID number for the home team. References {team_id} in the {teams} dataset.
95
+ 8,matches,19,home_team_name,text,The name of the home team. See the {teams} dataset.
96
+ 8,matches,20,home_team_code,text,The 3-letter code for the home team.
97
+ 8,matches,21,away_team_id,text,The unique ID number for the away team. References {team_id} in the {teams} dataset.
98
+ 8,matches,22,away_team_name,text,The name of the away team. See the {teams} dataset.
99
+ 8,matches,23,away_team_code,text,The 3-letter code for the away team.
100
+ 8,matches,24,score,text,"The score of the match in the format {#-#}, where the first number is the score of the home team and the second number is the score of the away team."
101
+ 8,matches,25,home_team_score,integer,The score of the home team.
102
+ 8,matches,26,away_team_score,integer,The score of the away team.
103
+ 8,matches,27,home_team_score_margin,integer,The score margin for the home team.
104
+ 8,matches,28,away_team_score_margin,integer,The score margin for the away team.
105
+ 8,matches,29,extra_time,boolean,Whether the match went to extra time. Coded {1} if the match went to extra time and {0} otherwise.
106
+ 8,matches,30,penalty_shootout,boolean,Whether the match ended in a penalty shootout. Coded {1} if the match ended in a penalty shootout and {0} otherwise.
107
+ 8,matches,31,score_penalties,text,The score of the penalty shootout in the format {#-#}. Coded {0-0} if there was not a penalty shootout.
108
+ 8,matches,32,home_team_score_penalties,integer,The score of the home team in the penalty shootout. Coded {NA} if there was not a penalty shootout.
109
+ 8,matches,33,away_team_score_penalties,integer,The score of the away team in the penalty shootout. Coded {NA} if there was not a penalty shootout.
110
+ 8,matches,34,result,enum,"The result of the match. The possible values are: {home team win}, {away team win}, {draw}, {replayed}."
111
+ 8,matches,35,home_team_win,boolean,Whether the home team won the match. Coded {1} if the home team won the match and {0} otherwise.
112
+ 8,matches,36,away_team_win,boolean,Whether the home team won the match. Coded {1} if the home team won the match and {0} otherwise.
113
+ 8,matches,37,draw,boolean,Whether the match ended in a draw. Coded {1} of the match ended in a draw and {0} otherwise.
114
+ 9,awards,1,key_id,integer,The unique ID number for the observation.
115
+ 9,awards,2,award_id,text,"The unique ID number for the award. Has the format {A-#}, where the number is a counter."
116
+ 9,awards,3,award_name,enum,"The name of the award. The possible values are: {Golden Ball}, {Silver Ball}, {Bronze Ball}, {Golden Boot}, {Silver Boot}, {Bronze Boot}, {Golden Glove}, {Best Young Player}. "
117
+ 9,awards,4,award_description,text,A description of the award.
118
+ 9,awards,5,year_introduced,integer,The year the award was first introduced.
119
+ 10,qualified_teams,1,key_id,integer,The unique ID number for the observation.
120
+ 10,qualified_teams,2,tournament_id,text,The unique ID number for the tournament. References {tournament_id} in the {tournaments} dataset.
121
+ 10,qualified_teams,3,tournament_name,text,The name of the tournament.
122
+ 10,qualified_teams,4,team_id,text,The unique ID number for the team. References {team_id} in the {teams} dataset.
123
+ 10,qualified_teams,5,team_name,text,The name of the team.
124
+ 10,qualified_teams,6,team_code,text,The 3-letter code for the team.
125
+ 10,qualified_teams,7,count_matches,integer,The number of matches that the team played in the tournament.
126
+ 10,qualified_teams,8,performance,text,The furthest stage of the tournament reached by the team.
127
+ 11,squads,1,key_id,integer,The unique ID number for the observation.
128
+ 11,squads,2,tournament_id,text,The unique ID number for the tournament. References {tournament_id} in the {tournaments} dataset.
129
+ 11,squads,3,tournament_name,text,The name of the tournament.
130
+ 11,squads,4,team_id,text,The unique ID number for the team. References {team_id} in the {teams} dataset.
131
+ 11,squads,5,team_name,text,The name of the team of the player.
132
+ 11,squads,6,team_code,text,The 3-letter code for the team.
133
+ 11,squads,7,player_id,text,The unique ID number for the player. References {player_id} in the {players} dataset.
134
+ 11,squads,8,family_name,text,The family name of the player.
135
+ 11,squads,9,given_name,text,The given name of the player.
136
+ 11,squads,10,shirt_number,integer,The shirt number of the player.
137
+ 11,squads,11,position_name,enum,"The position of the player. The possible values are: {goal keeper}, {defender}, {midfielder}, {forward}."
138
+ 11,squads,12,position_code,enum,"The code for the position of the player. The possible values are: {GK}, {DF}, {MF}, {FW}."
139
+ 12,manager_appointments,1,key_id,integer,The unique ID number for the observation.
140
+ 12,manager_appointments,2,tournament_id,text,The unique ID number for the tournament. References {tournament_id} in the {tournaments} dataset.
141
+ 12,manager_appointments,3,tournament_name,text,The name of the tournament.
142
+ 12,manager_appointments,4,team_id,text,The unique ID number for the team. References {team_id} in the {teams} dataset.
143
+ 12,manager_appointments,5,team_name,text,The name of the team of the manager.
144
+ 12,manager_appointments,6,team_code,text,The 3-letter code for the team.
145
+ 12,manager_appointments,7,manager_id,text,The unique ID number for the manager. References {manager_id} in the {managers} dataset.
146
+ 12,manager_appointments,8,family_name,text,The family name of the manager.
147
+ 12,manager_appointments,9,given_name,text,The given name of the manager.
148
+ 12,manager_appointments,10,country_name,text,The name of the manager's home country.
149
+ 13,referee_appointments,1,key_id,integer,The unique ID number for the observation.
150
+ 13,referee_appointments,2,tournament_id,text,The unique ID number for the tournament. References {tournament_id} in the {tournaments} dataset.
151
+ 13,referee_appointments,3,tournament_name,text,The name of the tournament.
152
+ 13,referee_appointments,4,referee_id,text,The unique ID number for the referee. References {referee_id} in the {referees} dataset.
153
+ 13,referee_appointments,5,family_name,text,The family name of the referee.
154
+ 13,referee_appointments,6,given_name,text,The given name fo the referee.
155
+ 13,referee_appointments,7,country_name,text,The name of the referee's home country.
156
+ 13,referee_appointments,8,confederation_id,text,The unique ID number for the confederation. References {confederation_id} in the {confederations} dataset.
157
+ 13,referee_appointments,9,confederation_name,text,The name of the confederation.
158
+ 13,referee_appointments,10,confederation_code,text,The abbreviation for the confederation.
159
+ 14,team_appearances,1,key_id,integer,The unique ID number for the observation.
160
+ 14,team_appearances,2,tournament_id,text,The unique ID number for the tournament. References {tournament_id} in the {tournaments} dataset.
161
+ 14,team_appearances,3,tournament_name,text,The name of the tournament.
162
+ 14,team_appearances,4,match_id,text,The unique ID number for the match. References {match_id} in the {matches} dataset.
163
+ 14,team_appearances,5,match_name,text,The name of the match.
164
+ 14,team_appearances,6,stage_name,enum,"The stage of the tournament in which the match occurred. The possible values are: {first round}, {second round}, {group stage}, {round of sixteen}, {quarter-finals}, {semi-finals}, {third place match}, {final}. Note that not all values are applicable to all tournaments."
165
+ 14,team_appearances,7,group_name,text,The name of the group.
166
+ 14,team_appearances,8,group_stage,boolean,Whether the match is a group stage match. Coded {1} if the match is a group stage match and {0} otherwise.
167
+ 14,team_appearances,9,knockout_stage,boolean,Whether the match is a knockout stage match. Coded {1} if the match is a knockout stage match and {0} otherwise.
168
+ 14,team_appearances,10,replayed,boolean,Whether the match was replayed. Coded {1} if the match was replayed and {0} otherwise.
169
+ 14,team_appearances,11,replay,boolean,Whether the match was a replay. Coded {1} if the match was a replay and {0} otherwise.
170
+ 14,team_appearances,12,match_date,date,The date of the match in the format {YYYY-MM-DD}.
171
+ 14,team_appearances,13,match_time,integer,The time of the match in the format {HH:MM}.
172
+ 14,team_appearances,14,stadium_id,text,The unique ID number for the stadium. References {stadium_id} in the {stadiums} dataset.
173
+ 14,team_appearances,15,stadium_name,text,The name of the stadium.
174
+ 14,team_appearances,16,city_name,text,The city in which the match was played.
175
+ 14,team_appearances,17,country_name,text,The name of the country in which the match was played.
176
+ 14,team_appearances,18,team_id,text,The unique ID number for the team. References {team_id} in the {teams} dataset.
177
+ 14,team_appearances,19,team_name,text,The name of the team.
178
+ 14,team_appearances,20,team_code,text,The 3-letter code for the team.
179
+ 14,team_appearances,21,opponent_id,text,The unique ID number for the team's opponent. References {team_id} in the {teams} dataset.
180
+ 14,team_appearances,22,opponent_name,text,The name of the team's opponent.
181
+ 14,team_appearances,23,opponent_code,text,The 3-letter code for the team's opponent.
182
+ 14,team_appearances,24,home_team,boolean,Whether the team was the home team. Coded {1} if the team was the home team and {0} otherwise.
183
+ 14,team_appearances,25,away_team,boolean,Whether the team was the away team. Coded {1} if the team was the away team and {0} otherwise.
184
+ 14,team_appearances,26,goals_for,integer,The number of goals scored by the team.
185
+ 14,team_appearances,27,goals_against,integer,The number of goals scored against the team.
186
+ 14,team_appearances,28,goal_differential,integer,The team's goal differential.
187
+ 14,team_appearances,29,extra_time,boolean,Whether the match went to extra time. Coded {1} if the match went to extra time and {0} otherwise.
188
+ 14,team_appearances,30,penalty_shootout,boolean,Whether the match ended in a penalty shootout. Coded {1} if the match ended in a penalty shootout and {0} otherwise.
189
+ 14,team_appearances,31,penalties_for,integer,"The number of penalties scored by the opponent, if the match ended in a penalty shootout. Coded {0} if there was not a shootout."
190
+ 14,team_appearances,32,penalties_against,integer,"The number of penalties scored by the team, if the match ended in a penalty shootout. Coded {0} if there was not a shootout."
191
+ 14,team_appearances,33,result,enum,"The result of the match. The possible values are: {home team win}, {away team win}, {draw}, {replayed}."
192
+ 14,team_appearances,34,win,boolean,Whether the team won the match. Coded {1} if the team won the match and {0} otherwise.
193
+ 14,team_appearances,35,lose,boolean,Whether the team lost the match. Coded {1} if the team lost the match and {0} otherwise.
194
+ 14,team_appearances,36,draw,boolean,Whether the match ended in a draw. Coded {1} of the match ended in a draw and {0} otherwise.
195
+ 15,player_appearances,1,key_id,integer,The unique ID number for the observation.
196
+ 15,player_appearances,2,tournament_id,text,The unique ID number for the tournament. References {tournament_id} in the {tournaments} dataset.
197
+ 15,player_appearances,3,tournament_name,text,The name of the tournament.
198
+ 15,player_appearances,4,match_id,text,The unique ID number for the match. References {match_id} in the {matches} dataset.
199
+ 15,player_appearances,5,match_name,text,The name of the match.
200
+ 15,player_appearances,6,match_date,date,The date of the match in the format {YYYY-MM-DD}.
201
+ 15,player_appearances,7,stage_name,enum,"The stage of the tournament in which the match occurred. The possible values are: {first round}, {second round}, {group stage}, {round of sixteen}, {quarter-finals}, {semi-finals}, {third place match}, {final}. Note that not all values are applicable to all tournaments."
202
+ 15,player_appearances,8,group_name,text,The name of the group.
203
+ 15,player_appearances,9,team_id,text,The unique ID number for the team of the player. References {team_id} in the {teams} dataset.
204
+ 15,player_appearances,10,team_name,text,The name of the team of the player.
205
+ 15,player_appearances,11,team_code,text,The 3-letter code for the team of the player.
206
+ 15,player_appearances,12,home_team,boolean,Whether the team was the home team. Coded {1} if the team was the home team and {0} otherwise.
207
+ 15,player_appearances,13,away_team,boolean,Whether the team was the away team. Coded {1} if the team was the away team and {0} otherwise.
208
+ 15,player_appearances,14,player_id,text,The unique ID number for the player. References {player_id} in the {players} dataset.
209
+ 15,player_appearances,15,family_name,text,The family name of the player.
210
+ 15,player_appearances,16,given_name,text,The given name of the player.
211
+ 15,player_appearances,17,shirt_number,integer,The shirt number of the player.
212
+ 15,player_appearances,18,position_name,text,The name of the position of the player.
213
+ 15,player_appearances,19,position_code,text,A 2-letter or 3-letter code that indicates the position of the player.
214
+ 15,player_appearances,20,starter,boolean,Whether the player started the match. Coded {1} if the player started the match and {0} otherwise.
215
+ 15,player_appearances,21,substitute,boolean,Whether the player was a substitute. Coded {1} if the player was a substitute and {0} otherwise.
216
+ 16,manager_appearances,1,key_id,integer,The unique ID number for the observation.
217
+ 16,manager_appearances,2,tournament_id,text,The unique ID number for the tournament. References {tournament_id} in the {tournaments} dataset.
218
+ 16,manager_appearances,3,tournament_name,text,The name of the tournament.
219
+ 16,manager_appearances,4,match_id,text,The unique ID number for the match. References {match_id} in the {matches} dataset.
220
+ 16,manager_appearances,5,match_name,text,The name of the match.
221
+ 16,manager_appearances,6,match_date,date,The date of the match in the format {YYYY-MM-DD}.
222
+ 16,manager_appearances,7,stage_name,enum,"The stage of the tournament in which the match occurred. The possible values are: {first round}, {second round}, {group stage}, {round of sixteen}, {quarter-finals}, {semi-finals}, {third place match}, {final}. Note that not all values are applicable to all tournaments."
223
+ 16,manager_appearances,8,group_name,text,The name of the group.
224
+ 16,manager_appearances,9,team_id,text,The unique ID number for the team of the manager. References {team_id} in the {teams} dataset.
225
+ 16,manager_appearances,10,team_name,text,The name of the team of the manager.
226
+ 16,manager_appearances,11,team_code,text,The 3-letter code for the team of the manager.
227
+ 16,manager_appearances,12,home_team,boolean,Whether the team was the home team. Coded {1} if the team was the home team and {0} otherwise.
228
+ 16,manager_appearances,13,away_team,boolean,Whether the team was the away team. Coded {1} if the team was the away team and {0} otherwise.
229
+ 16,manager_appearances,14,manager_id,text,The unique ID number for the manager. References {manager_id} in the {managers} dataset.
230
+ 16,manager_appearances,15,family_name,text,The family name of the manager.
231
+ 16,manager_appearances,16,given_name,text,The given name of the manager.
232
+ 16,manager_appearances,17,country_name,text,The name of the manager's home country.
233
+ 17,referee_appearances,1,key_id,integer,The unique ID number for the observation.
234
+ 17,referee_appearances,2,tournament_id,text,The unique ID number for the tournament. References {tournament_id} in the {tournaments} dataset.
235
+ 17,referee_appearances,3,tournament_name,text,The name of the tournament.
236
+ 17,referee_appearances,4,match_id,text,The unique ID number for the match. References {match_id} in the {matches} dataset.
237
+ 17,referee_appearances,5,match_name,text,The name of the match.
238
+ 17,referee_appearances,6,match_date,date,The date of the match in the format {YYYY-MM-DD}.
239
+ 17,referee_appearances,7,stage_name,enum,"The stage of the tournament in which the match occurred. The possible values are: {first round}, {second round}, {group stage}, {round of sixteen}, {quarter-finals}, {semi-finals}, {third place match}, {final}. Note that not all values are applicable to all tournaments."
240
+ 17,referee_appearances,8,group_name,text,The name of the group.
241
+ 17,referee_appearances,9,referee_id,text,The unique ID number for the referee. References {referee_id} in the {referees} dataset.
242
+ 17,referee_appearances,10,family_name,text,The family name of the referee.
243
+ 17,referee_appearances,11,given_name,text,The given name of the referee.
244
+ 17,referee_appearances,12,country_name,text,The name of the referee's home country.
245
+ 17,referee_appearances,13,confederation_id,text,The unique ID number for the confederation. References {confederation_id} in the {confederations} dataset.
246
+ 17,referee_appearances,14,confederation_name,text,The name of the confederation.
247
+ 17,referee_appearances,15,confederation_code,text,The abbreviation for the confederation.
248
+ 18,goals,1,key_id,integer,The unique ID number for the observation.
249
+ 18,goals,2,goal_id,text,"The unique ID number for the goal. Has the format {G-####}, where the number is a counter that is assigned with the data sorted by the match ID, then the minute of the goal."
250
+ 18,goals,3,tournament_id,text,The unique ID number for the tournament. References {tournament_id} in the {tournaments} dataset.
251
+ 18,goals,4,tournament_name,text,The name of the tournament.
252
+ 18,goals,5,match_id,text,The unique ID number for the match in which the goal occurred. References {match_id} in the {matches} dataset.
253
+ 18,goals,6,match_name,text,The name of the match in which the goal occurred.
254
+ 18,goals,7,match_date,date,The date of the match in the format {YYYY-MM-DD}.
255
+ 18,goals,8,stage_name,enum,"The stage of the tournament in which the match occurred. The possible values are: {first round}, {second round}, {group stage}, {round of sixteen}, {quarter-finals}, {semi-finals}, {third place match}, {final}. Note that not all values are applicable to all tournaments."
256
+ 18,goals,9,group_name,text,The name of the group.
257
+ 18,goals,10,team_id,text,"The unique ID number for the team that scored the goal. References {team_id} in the {teams} dataset. For own goals, this is the team that is awarded the goal, not the team of the player who scored the own goal."
258
+ 18,goals,11,team_name,text,The name of the team of the player who scored the goal.
259
+ 18,goals,12,team_code,text,The 3-letter code for the team of the player who scored the goal.
260
+ 18,goals,13,home_team,boolean,Whether the team was the home team. Coded {1} if the team was the home team and {0} otherwise.
261
+ 18,goals,14,away_team,boolean,Whether the team was the away team. Coded {1} if the team was the away team and {0} otherwise.
262
+ 18,goals,15,player_id,text,The unique ID number for the player who scored the goal. References {player_id} in the {players} dataset.
263
+ 18,goals,16,family_name,text,The family name of the player who scored the goal.
264
+ 18,goals,17,given_name,text,The given name of the player who scored the goal.
265
+ 18,goals,18,shirt_number,integer,The shirt number of the player who scored the goal.
266
+ 18,goals,19,player_team_id,text,"The unique ID number for the team of the player who scored the goal. References {team_id} in the {teams} dataset. For own goals, this is the team of the player who scored the own goal, not the team that is awarded the goal."
267
+ 18,goals,20,player_team_name,text,The name of the team of the player who scored the goal.
268
+ 18,goals,21,player_team_code,text,The 3-letter code for the team of the player who scored the goal.
269
+ 18,goals,22,minute_label,text,The minute of the match in which the goal occurred in the format {#'} or {#'+#'}.
270
+ 18,goals,23,minute_regulation,integer,The minute of regulation time in which the substitution occurred.
271
+ 18,goals,24,minute_stoppage,integer,The minute of stoppage time in which the goal occurred. Coded {0} if the substitution did not occur during stoppage time.
272
+ 18,goals,25,match_period,enum,"The period of the match in which the goal occurred. The possible values are: {first half}, {first half, stoppage time}, {second half}, {second half, stoppage time}, {extra time, first half}, {extra time, first half, stoppage time}, {extra time, second half}, {extra time, second half, stoppage time}, {after extra time}."
273
+ 18,goals,26,own_goal,boolean,Whether the goal was an own goal. Coded {1} if the goal was an own goal and {0} otherwise.
274
+ 18,goals,27,penalty,boolean,"Whether the goal was a penalty that occurred during the game, as opposed to during a penalty shootout. Coded {1} if the goal was a penalty that occurred during the game and {0} otherwise."
275
+ 19,penalty_kicks,1,key_id,integer,The unique ID number for the observation.
276
+ 19,penalty_kicks,2,penalty_kick_id,text,"The unique ID number for the penalty kick. Has the format {PK-####}, where the number is a counter that is assigned with the data sorted by the match ID, then the minute of the penalty kick."
277
+ 19,penalty_kicks,3,tournament_id,text,The unique ID number for the tournament. References {tournament_id} in the {tournaments} dataset.
278
+ 19,penalty_kicks,4,tournament_name,text,The name of the tournament.
279
+ 19,penalty_kicks,5,match_id,text,The unique ID number for the match in which the penalty kick occurred. References {match_id} in the {matches} dataset.
280
+ 19,penalty_kicks,6,match_name,text,The name of match in which the penalty kick occurred.
281
+ 19,penalty_kicks,7,match_date,date,The date of the match in the format {YYYY-MM-DD}.
282
+ 19,penalty_kicks,8,stage_name,enum,"The stage of the tournament in which the match occurred. The possible values are: {first round}, {second round}, {group stage}, {round of sixteen}, {quarter-finals}, {semi-finals}, {third place match}, {final}. Note that not all values are applicable to all tournaments."
283
+ 19,penalty_kicks,9,group_name,text,The name of the group.
284
+ 19,penalty_kicks,10,team_id,text,The unique ID number for the team of the player who took the penalty kick. References {team_id} in the {teams} dataset.
285
+ 19,penalty_kicks,11,team_name,text,The name of the team of the player who took the penalty kick.
286
+ 19,penalty_kicks,12,team_code,text,The 3-letter code for the team of the player who took the penalty kick.
287
+ 19,penalty_kicks,13,home_team,boolean,Whether the team was the home team. Coded {1} if the team was the home team and {0} otherwise.
288
+ 19,penalty_kicks,14,away_team,boolean,Whether the team was the away team. Coded {1} if the team was the away team and {0} otherwise.
289
+ 19,penalty_kicks,15,player_id,text,The unique ID number for the player who took the penalty kick. References {player_id} in the {players} dataset.
290
+ 19,penalty_kicks,16,family_name,text,The family name of the player who took the penalty kick.
291
+ 19,penalty_kicks,17,given_name,text,The given name of the player who took the penalty kick.
292
+ 19,penalty_kicks,18,shirt_number,integer,The shirt number of the player who took the penalty kick.
293
+ 19,penalty_kicks,19,converted,boolean,Whether the penalty kick was converted. Coded {1} if the penalty kick was converted and {0} otherwise.
294
+ 20,bookings,1,key_id,integer,The unique ID number for the observation.
295
+ 20,bookings,2,booking_id,text,"The unique ID number for the booking. Has the format {B-####}, where the number is a counter that is assigned with the data sorted by the match ID, then the minute of the booking."
296
+ 20,bookings,3,tournament_id,text,The unique ID number for the tournament. References {tournament_id} in the {tournaments} dataset.
297
+ 20,bookings,4,tournament_name,text,The name of the tournament.
298
+ 20,bookings,5,match_id,text,The unique ID number for the match in which the booking occurred. References {match_id} in the {matches} dataset.
299
+ 20,bookings,6,match_name,text,The name of the match in which the booking occurred.
300
+ 20,bookings,7,match_date,date,The date of the match in the format {YYYY-MM-DD}.
301
+ 20,bookings,8,stage_name,enum,"The stage of the tournament in which the match occurred. The possible values are: {first round}, {second round}, {group stage}, {round of sixteen}, {quarter-finals}, {semi-finals}, {third place match}, {final}. Note that not all values are applicable to all tournaments."
302
+ 20,bookings,9,group_name,text,The name of the group.
303
+ 20,bookings,10,team_id,text,The unique ID number for the team of the player who was booked. References {team_id} in the {teams} dataset.
304
+ 20,bookings,11,team_name,text,The name of the team of the player who was booked.
305
+ 20,bookings,12,team_code,text,The 3-letter code for the team of the player who was booked.
306
+ 20,bookings,13,home_team,boolean,Whether the team was the home team. Coded {1} if the team was the home team and {0} otherwise.
307
+ 20,bookings,14,away_team,boolean,Whether the team was the away team. Coded {1} if the team was the away team and {0} otherwise.
308
+ 20,bookings,15,player_id,text,The unique ID number for the player who was booked. References {player_id} in the {players} dataset.
309
+ 20,bookings,16,family_name,text,The family name of the player who was booked.
310
+ 20,bookings,17,given_name,text,The given name of the player who was booked.
311
+ 20,bookings,18,shirt_number,integer,The shirt number of the player who was booked.
312
+ 20,bookings,19,minute_label,text,The minute of the match in which the booking occurred in the format {#'} or {#'+#'}.
313
+ 20,bookings,20,minute_regulation,integer,The minute of regulation time in which the booking occurred.
314
+ 20,bookings,21,minute_stoppage,integer,The minute of stoppage time in which the booking occurred. Coded {0} if the substitution did not occur during stoppage time.
315
+ 20,bookings,22,match_period,enum,"The period of the match in which the booking occurred. The possible values are: {first half}, {first half, stoppage time}, {second half}, {second half, stoppage time}, {extra time, first half}, {extra time, first half, stoppage time}, {extra time, second half}, {extra time, second half, stoppage time}, {after extra time}."
316
+ 20,bookings,23,yellow_card,boolean,Whether the booking was a yellow card. Coded {1} if the card is a yellow card and {0} otherwise.
317
+ 20,bookings,24,red_card,boolean,Whether the booking was a red card. Coded {1} if the card is a red card and {0} otherwise.
318
+ 20,bookings,25,second_yellow_card,boolean,Whether the booking was a second yellow card. Coded {1} if the booking is a second yellow and {0} otherwise.
319
+ 20,bookings,26,sending_off,boolean,Whether the booking resulted in the player being sent off. Coded {1} if the player was sent off and {0} otherwise.
320
+ 21,substitutions,1,key_id,integer,The unique ID number for the observation.
321
+ 21,substitutions,2,substitution_id,text,"The unique ID number for the substitution. Has the format {S-####}, where the number is a counter that is assigned with the data sorted by the match ID, then the minute of the substitution, then whether the player is going off."
322
+ 21,substitutions,3,tournament_id,text,The unique ID number for the tournament. References {tournament_id} in the {tournaments} dataset.
323
+ 21,substitutions,4,tournament_name,text,The name of the tournament.
324
+ 21,substitutions,5,match_id,text,The unique ID number for the match in which the substitution occurred. References {match_id} in the {matches} dataset.
325
+ 21,substitutions,6,match_name,text,The name of the match in which the substitution occurred.
326
+ 21,substitutions,7,match_date,date,The date of the match in the format {YYYY-MM-DD}.
327
+ 21,substitutions,8,stage_name,enum,"The stage of the tournament in which the match occurred. The possible values are: {first round}, {second round}, {group stage}, {round of sixteen}, {quarter-finals}, {semi-finals}, {third place match}, {final}. Note that not all values are applicable to all tournaments."
328
+ 21,substitutions,9,group_name,text,The name of the group.
329
+ 21,substitutions,10,team_id,text,The unique ID number for the team of the player who was substituted. References {team_id} in the {teams} dataset.
330
+ 21,substitutions,11,team_name,text,The name of the team of the player who was substituted.
331
+ 21,substitutions,12,team_code,text,The 3-letter code for the team of the player who was substituted.
332
+ 21,substitutions,13,home_team,boolean,Whether the team was the home team. Coded {1} if the team was the home team and {0} otherwise.
333
+ 21,substitutions,14,away_team,boolean,Whether the team was the away team. Coded {1} if the team was the away team and {0} otherwise.
334
+ 21,substitutions,15,player_id,text,The unique ID number for the player who was substituted. References {player_id} in the {players} dataset.
335
+ 21,substitutions,16,family_name,text,The family name of the player who was substituted.
336
+ 21,substitutions,17,given_name,text,The given name of the player who was substituted.
337
+ 21,substitutions,18,shirt_number,integer,The shirt number of the player who was substituted.
338
+ 21,substitutions,19,minute_label,text,The minute of the match in which the substitution occurred in the format {#'} or {#'+#'}.
339
+ 21,substitutions,20,minute_regulation,integer,The minute of regulation time in which the substitution occurred.
340
+ 21,substitutions,21,minute_stoppage,integer,The minute of stoppage time in which the substitution occurred. Coded {0} if the substitution did not occur during stoppage time.
341
+ 21,substitutions,22,match_period,enum,"The period of the match in which the substitution occurred. The possible values are: {first half}, {first half, stoppage time}, {second half}, {second half, stoppage time}, {extra time, first half}, {extra time, first half, stoppage time}, {extra time, second half}, {extra time, second half, stoppage time}, {after extra time}."
342
+ 21,substitutions,23,going_off,boolean,Whether the player was going off the field. Coded {1} if the player was going off and {0} otherwise.
343
+ 21,substitutions,24,coming_on,boolean,Whether the player was coming on the field. Coded {1} if the player was coming on and {0} otherwise.
344
+ 22,host_countries,1,key_id,integer,The unique ID number for the observation.
345
+ 22,host_countries,2,tournament_id,text,The unique ID number for the tournament. References {tournament_id} in the {tournaments} dataset.
346
+ 22,host_countries,3,tournament_name,text,The name of the tournament.
347
+ 22,host_countries,4,team_id,text,The unique ID number for the team. References {team_id} in the {teams} dataset.
348
+ 22,host_countries,5,team_name,text,The name of the team.
349
+ 22,host_countries,6,team_code,text,The 3-letter code for the team.
350
+ 22,host_countries,7,performance,text,The furthest stage of the tournament reached by the host country's team.
351
+ 23,tournament_stages,1,key_id,integer,The unique ID number for the observation.
352
+ 23,tournament_stages,2,tournament_id,text,The unique ID number for the tournament. References {tournament_id} in the {tournaments} dataset.
353
+ 23,tournament_stages,3,tournament_name,text,The name of the tournament.
354
+ 23,tournament_stages,4,stage_number,integer,The number of the stage.
355
+ 23,tournament_stages,5,stage_name,enum,"The stage of the tournament in which the match occurred. The possible values are: {first round}, {second round}, {group stage}, {round of sixteen}, {quarter-finals}, {semi-finals}, {third place match}, {final}. Note that not all values are applicable to all tournaments."
356
+ 23,tournament_stages,6,group_stage,boolean,Whether the match is a group stage match. Coded {1} if the match is a group stage match and {0} otherwise.
357
+ 23,tournament_stages,7,knockout_stage,boolean,Whether there was a knockout stage. Coded {1} if there was a knockout stage and {0} otherwise.
358
+ 23,tournament_stages,8,unbalanced_groups,boolean,Whether there were unbalanced groups. Coded {1} if there were unbalanced groups and {0} otherwise.
359
+ 23,tournament_stages,9,start_date,date,The start date of the stage in the format {YYYY-MM-DD}.
360
+ 23,tournament_stages,10,end_date,date,The end date of the stage in the format {YYYY-MM-DD}.
361
+ 23,tournament_stages,11,count_matches,integer,The number of matches in the stage.
362
+ 23,tournament_stages,12,count_teams,integer,The number of teams that participated in the stage.
363
+ 23,tournament_stages,13,count_scheduled,integer,The number of games that were scheduled in the stage.
364
+ 23,tournament_stages,14,count_replays,integer,The number of replays in the stage.
365
+ 23,tournament_stages,15,count_playoffs,integer,The number of playoff games in the stage.
366
+ 23,tournament_stages,16,count_walkovers,integer,The number of walkovers in the stage.
367
+ 24,groups,1,key_id,integer,The unique ID number for the observation.
368
+ 24,groups,2,tournament_id,text,The unique ID number for the tournament. References {tournament_id} in the {tournaments} dataset.
369
+ 24,groups,3,tournament_name,text,The name of the tournament.
370
+ 24,groups,4,stage_number,integer,The number of the stage.
371
+ 24,groups,5,stage_name,enum,"The stage of the tournament in which the match occurred. The possible values are: {first round}, {second round}, {group stage}, {round of sixteen}, {quarter-finals}, {semi-finals}, {third place match}, {final}. Note that not all values are applicable to all tournaments."
372
+ 24,groups,6,group_name,text,The name of the group.
373
+ 24,groups,7,count_teams,integer,The number of teams in the group.
374
+ 25,group_standings,1,key_id,integer,The unique ID number for the observation.
375
+ 25,group_standings,2,tournament_id,text,The unique ID number for the tournament. References {tournament_id} in the {tournaments} dataset.
376
+ 25,group_standings,3,tournament_name,text,The name of the tournament.
377
+ 25,group_standings,4,stage_number,integer,The number of the stage.
378
+ 25,group_standings,5,stage_name,enum,"The stage of the tournament in which the match occurred. The possible values are: {first round}, {second round}, {group stage}, {round of sixteen}, {quarter-finals}, {semi-finals}, {third place match}, {final}. Note that not all values are applicable to all tournaments."
379
+ 25,group_standings,6,group_name,text,The name of the group.
380
+ 25,group_standings,7,position,integer,The team's final position in the group.
381
+ 25,group_standings,8,team_id,text,The unique ID number for the team. References {team_id} in the {teams} dataset.
382
+ 25,group_standings,9,team_name,text,The name of the team.
383
+ 25,group_standings,10,team_code,text,The 3-letter code for the team.
384
+ 25,group_standings,11,played,integer,The number of matches that the team played in the group.
385
+ 25,group_standings,12,wins,integer,The number of matches that the team won in the group stage.
386
+ 25,group_standings,13,draws,integer,The number of matches that the team drew in the group stage.
387
+ 25,group_standings,14,losses,integer,The number of matches that the team lost in the group stage.
388
+ 25,group_standings,15,goals_for,integer,The number of goals scored by the team in the group stage.
389
+ 25,group_standings,16,goals_against,integer,The number of goals scored against the team in the group stage.
390
+ 25,group_standings,17,goal_difference,integer,The team's goal difference in the group stage.
391
+ 25,group_standings,18,points,integer,The number of points that the team earned in the group.
392
+ 25,group_standings,19,advanced,boolean,Whether the team advanced out of the group. Coded {1} if the team advanced and {0} otherwise.
393
+ 26,tournament_standings,1,key_id,integer,The unique ID number for the observation.
394
+ 26,tournament_standings,2,tournament_id,text,The unique ID number for the tournament. References {tournament_id} in the {tournaments} dataset.
395
+ 26,tournament_standings,3,tournament_name,text,The name of the tournament.
396
+ 26,tournament_standings,4,position,integer,The place of the team in the final standings.
397
+ 26,tournament_standings,5,team_id,text,The unique ID number for the team. References {team_id} in the {teams} dataset.
398
+ 26,tournament_standings,6,team_name,text,The name of the team.
399
+ 26,tournament_standings,7,team_code,text,The 3-letter code for the team.
400
+ 27,award_winners,1,key_id,integer,The unique ID number for the observation.
401
+ 27,award_winners,2,tournament_id,text,The unique ID number for the tournament. References {tournament_id} in the {tournaments} dataset.
402
+ 27,award_winners,3,tournament_name,text,The name of the tournament.
403
+ 27,award_winners,4,award_id,text,The unique ID number for the award. References {award_id} in the {awards} dataset.
404
+ 27,award_winners,5,award_name,enum,"The name of the award. The possible values are: {Golden Ball}, {Silver Ball}, {Bronze Ball}, {Golden Boot}, {Silver Boot}, {Bronze Boot}, {Golden Glove}, {Best Young Player}. "
405
+ 27,award_winners,6,shared,boolean,Whether the award was shared between multiple players. Coded {1} if the award was shared and {0} otherwise.
406
+ 27,award_winners,7,player_id,text,The unique ID number for the player who won the award. References {player_id} in the {players} dataset.
407
+ 27,award_winners,8,family_name,text,The family name of the player who won the award.
408
+ 27,award_winners,9,given_name,text,The given name of the player who won the award.
409
+ 27,award_winners,10,team_id,text,The unique ID number for the team. References {team_id} in the {teams} dataset.
410
+ 27,award_winners,11,team_name,text,The name of the team of the player who won the award.
411
+ 27,award_winners,12,team_code,text,The 3-letter code for the team of the player who won the award.
hierarchy_eval.py ADDED
@@ -0,0 +1,524 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ hierarchy_eval.py — shared, reference-free hierarchy evaluation for the TFM.
3
+
4
+ WHY REFERENCE-FREE?
5
+ -------------------
6
+ In all three approaches the dataset's group column is a *construction input*
7
+ (Gonçalves text object in baseline / Approach 1; explicit group-anchored L1/L2
8
+ in Approach 2). An input cannot also serve as the ground truth — measuring the
9
+ hierarchy against the group column is therefore circular (for Approach 2 it is
10
+ circular by design). The defensible evaluation is reference-free.
11
+
12
+ PRIMARY METRICS (no gold standard required)
13
+ -------------------------------------------
14
+ • Parent–child coherence — TraCo (Wu et al., AAAI 2024, arXiv:2401.14113)
15
+ • Sibling diversity — TraCo (same paper)
16
+ • NPMI label coherence — Lau et al., EACL 2014 (aclanthology.org/E14-1056);
17
+ orig. Mimno et al., EMNLP 2010
18
+ • Structural statistics — HiExpan-style reporting (Shen et al., KDD 2018)
19
+
20
+ SECONDARY (descriptive, explicitly caveated)
21
+ --------------------------------------------
22
+ • Group-structure preservation (NMI / ARI / Purity vs the group column).
23
+ Reported only as "how much the discovered hierarchy still reflects the
24
+ pre-existing group column that was used as input" — NOT an accuracy metric.
25
+
26
+ All metrics are computed the same way for every approach, so cross-approach
27
+ comparison is fair.
28
+ """
29
+ from __future__ import annotations
30
+
31
+ import re
32
+ from collections import Counter
33
+
34
+ import numpy as np
35
+
36
+ # ──────────────────────────────────────────────────────────────────────────────
37
+ # Tree helpers
38
+ # ──────────────────────────────────────────────────────────────────────────────
39
+ def build_parent_map(nodes: list) -> dict:
40
+ pm: dict = {}
41
+ for n in nodes:
42
+ for c in n.get('related', []):
43
+ cid = int(c)
44
+ if cid not in pm:
45
+ pm[cid] = int(n['id'])
46
+ return pm
47
+
48
+
49
+ def structural_stats(nodes: list) -> dict:
50
+ pm = build_parent_map(nodes)
51
+
52
+ def depth_of(nid: int) -> int:
53
+ d = 0
54
+ while nid in pm:
55
+ nid = pm[nid]; d += 1
56
+ return d
57
+
58
+ agg = [n for n in nodes if n.get('type') == 'aggregation']
59
+ leafs = [n for n in nodes if n.get('type') == 'attribute']
60
+ depths = [depth_of(int(n['id'])) for n in leafs]
61
+ branches = [len(n.get('related', [])) for n in agg]
62
+ singletons = sum(1 for b in branches if b == 1)
63
+ return {
64
+ 'n_aggregation_nodes': len(agg),
65
+ 'max_depth': int(max(depths, default=0)),
66
+ 'avg_leaf_depth': round(float(np.mean(depths)), 2) if depths else 0.0,
67
+ 'avg_branching_factor': round(float(np.mean(branches)), 2) if branches else 0.0,
68
+ 'singleton_nodes_%': round(100.0 * singletons / max(len(agg), 1), 1),
69
+ }
70
+
71
+
72
+ # ──────────────────────────────────────────────────────────────────────────────
73
+ # Encoder — SBERT if available, TF-IDF fallback. Loaded once, reused.
74
+ # ──────────────────────────────────────────────────────────────────────────────
75
+ _SBERT = None
76
+ _SBERT_TRIED = False
77
+
78
+
79
+ def _get_sbert():
80
+ global _SBERT, _SBERT_TRIED
81
+ if _SBERT_TRIED:
82
+ return _SBERT
83
+ _SBERT_TRIED = True
84
+ try:
85
+ from sentence_transformers import SentenceTransformer
86
+ _SBERT = SentenceTransformer('all-MiniLM-L6-v2')
87
+ except Exception:
88
+ _SBERT = None
89
+ return _SBERT
90
+
91
+
92
+ def encode(texts: list):
93
+ """Return (unit-normalised vectors, backend_name)."""
94
+ texts = [str(t) if str(t).strip() else '_' for t in texts]
95
+ model = _get_sbert()
96
+ if model is not None:
97
+ v = model.encode(texts, normalize_embeddings=True, show_progress_bar=False)
98
+ return np.asarray(v, dtype=float), 'SBERT (all-MiniLM-L6-v2)'
99
+ from sklearn.feature_extraction.text import TfidfVectorizer
100
+ X = TfidfVectorizer(stop_words='english', max_features=2000,
101
+ min_df=1).fit_transform(texts).toarray().astype(float)
102
+ norms = np.linalg.norm(X, axis=1, keepdims=True)
103
+ return X / np.where(norms == 0, 1.0, norms), 'TF-IDF (SBERT unavailable)'
104
+
105
+
106
+ # ──────────────────────────────────────────────────────────────────────────────
107
+ # TraCo reference-free metrics (Wu et al., AAAI 2024)
108
+ # ──────────���───────────────────────────────────────────────────────────────────
109
+ def traco_metrics(nodes: list) -> dict:
110
+ """Parent–child coherence and sibling diversity over node *labels*."""
111
+ usable = [n for n in nodes if n.get('type') in ('aggregation', 'attribute')]
112
+ if len(usable) < 2:
113
+ return {'pc_coherence': 0.0, 'sibling_diversity': 0.0, 'encoder': 'n/a'}
114
+
115
+ ids = [int(n['id']) for n in usable]
116
+ labels = [str(n.get('name', '')) for n in usable]
117
+ vecs, backend = encode(labels)
118
+ id2v = {i: vecs[k] for k, i in enumerate(ids)}
119
+
120
+ pc_sims, sib_divs = [], []
121
+ for n in nodes:
122
+ if n.get('type') == 'root':
123
+ continue
124
+ pid = int(n['id'])
125
+ if pid not in id2v:
126
+ continue
127
+ children = [int(c) for c in n.get('related', []) if int(c) in id2v]
128
+ for cid in children:
129
+ pc_sims.append(float(np.dot(id2v[pid], id2v[cid])))
130
+ if len(children) >= 2:
131
+ cv = np.array([id2v[c] for c in children])
132
+ S = cv @ cv.T
133
+ nc = len(children)
134
+ divs = [1.0 - float(S[i, j]) for i in range(nc) for j in range(i + 1, nc)]
135
+ sib_divs.append(float(np.mean(divs)))
136
+
137
+ return {
138
+ 'pc_coherence': round(float(np.mean(pc_sims)), 4) if pc_sims else 0.0,
139
+ 'sibling_diversity': round(float(np.mean(sib_divs)), 4) if sib_divs else 0.0,
140
+ 'encoder': backend,
141
+ }
142
+
143
+
144
+ # ──────────────────────────────────────────────────────────────────────────────
145
+ # NPMI label coherence (Lau et al., EACL 2014; Mimno et al., EMNLP 2010)
146
+ # Reference corpus = the variable descriptions themselves.
147
+ # ──────────────────────────────────────────────────────────────────────────────
148
+ _TOKEN_RE = re.compile(r'[a-z][a-z]{2,}')
149
+ _STOP = set(
150
+ 'the a an and or of to in for on with by at from as is are be this that these '
151
+ 'those it its was were has have had not no than then so such can will may '
152
+ 'group description name label value type using used per each'.split()
153
+ )
154
+
155
+
156
+ def _tokens(text: str) -> set:
157
+ return {w for w in _TOKEN_RE.findall(str(text).lower()) if w not in _STOP}
158
+
159
+
160
+ def npmi_coherence(nodes: list, corpus_texts: list, topn: int = 5) -> float:
161
+ """Average NPMI of each aggregation node's label terms over the corpus.
162
+
163
+ Returns a value in roughly [-1, 1]; higher = node labels use term
164
+ combinations that genuinely co-occur in the data (meaningful, not random).
165
+ """
166
+ docs = [_tokens(t) for t in corpus_texts]
167
+ docs = [d for d in docs if d]
168
+ N = len(docs)
169
+ if N < 2:
170
+ return 0.0
171
+
172
+ df: Counter = Counter()
173
+ for d in docs:
174
+ for w in d:
175
+ df[w] += 1
176
+
177
+ # Collect the term sets we actually need (node labels)
178
+ label_termsets: list = []
179
+ needed_terms: set = set()
180
+ for n in nodes:
181
+ if n.get('type') != 'aggregation':
182
+ continue
183
+ terms = [w for w in _tokens(n.get('name', '')) if df.get(w, 0) > 0]
184
+ terms = sorted(terms, key=lambda w: df[w], reverse=True)[:topn]
185
+ if len(terms) >= 2:
186
+ label_termsets.append(terms)
187
+ needed_terms.update(terms)
188
+
189
+ if not label_termsets:
190
+ return 0.0
191
+
192
+ # Pair co-occurrence counts (only for needed pairs)
193
+ needed_pairs = set()
194
+ for terms in label_termsets:
195
+ for i in range(len(terms)):
196
+ for j in range(i + 1, len(terms)):
197
+ needed_pairs.add(frozenset((terms[i], terms[j])))
198
+
199
+ co: Counter = Counter()
200
+ for d in docs:
201
+ present = d & needed_terms
202
+ if len(present) < 2:
203
+ continue
204
+ pl = list(present)
205
+ for i in range(len(pl)):
206
+ for j in range(i + 1, len(pl)):
207
+ pair = frozenset((pl[i], pl[j]))
208
+ if pair in needed_pairs:
209
+ co[pair] += 1
210
+
211
+ eps = 1e-12
212
+ node_scores: list = []
213
+ for terms in label_termsets:
214
+ pair_npmis: list = []
215
+ for i in range(len(terms)):
216
+ for j in range(i + 1, len(terms)):
217
+ wi, wj = terms[i], terms[j]
218
+ c_ij = co.get(frozenset((wi, wj)), 0)
219
+ p_ij = (c_ij + eps) / N
220
+ p_i = df[wi] / N
221
+ p_j = df[wj] / N
222
+ pmi = np.log(p_ij / (p_i * p_j + eps) + eps)
223
+ npmi = pmi / (-np.log(p_ij + eps))
224
+ pair_npmis.append(float(npmi))
225
+ if pair_npmis:
226
+ node_scores.append(float(np.mean(pair_npmis)))
227
+
228
+ return round(float(np.mean(node_scores)), 4) if node_scores else 0.0
229
+
230
+
231
+ # ─────────────��────────────────────────────────────────────────────────────────
232
+ # Secondary (descriptive, caveated): group-structure preservation
233
+ # ──────────────────────────────────────────────────────────────────────────────
234
+ def _depth1_assignments(nodes: list, can) -> list:
235
+ pm = build_parent_map(nodes)
236
+
237
+ def depth1(nid: int) -> int:
238
+ while pm.get(nid, -1) not in (-1, 0):
239
+ nid = pm[nid]
240
+ return nid
241
+
242
+ lid_to_nid = {n['metadata']['leaf_id']: int(n['id'])
243
+ for n in nodes if n.get('type') == 'attribute' and 'metadata' in n}
244
+ return [depth1(lid_to_nid[lid]) if lid in lid_to_nid else -1
245
+ for lid in can['_leaf_id']]
246
+
247
+
248
+ def _purity(y_true, y_pred) -> float:
249
+ clusters: dict = {}
250
+ for t, p in zip(y_true, y_pred):
251
+ clusters.setdefault(p, []).append(t)
252
+ correct = sum(Counter(v).most_common(1)[0][1] for v in clusters.values())
253
+ return correct / max(len(y_true), 1)
254
+
255
+
256
+ def group_preservation(nodes: list, can) -> dict:
257
+ """NMI / ARI / Purity of the depth-1 partition vs the group column.
258
+
259
+ CAVEAT: the group column is a construction input in every approach, so this
260
+ is a descriptive 'structure preservation' figure, NOT an accuracy metric.
261
+ """
262
+ from sklearn.metrics import normalized_mutual_info_score, adjusted_rand_score
263
+ from sklearn.preprocessing import LabelEncoder
264
+ import pandas as pd
265
+
266
+ y_true_raw = can['_group_path'].apply(
267
+ lambda x: str(x).split(' > ')[0].strip()
268
+ if pd.notna(x) and str(x) not in ('', 'nan') else 'Ungrouped'
269
+ ).tolist()
270
+ y_pred_raw = _depth1_assignments(nodes, can)
271
+
272
+ y_true = LabelEncoder().fit_transform(y_true_raw)
273
+ y_pred = LabelEncoder().fit_transform(y_pred_raw)
274
+ return {
275
+ 'NMI': round(float(normalized_mutual_info_score(
276
+ y_true, y_pred, average_method='arithmetic')), 4),
277
+ 'ARI': round(float(adjusted_rand_score(y_true, y_pred)), 4),
278
+ 'Purity': round(_purity(y_true_raw, y_pred_raw), 4),
279
+ }
280
+
281
+
282
+ # ──────────────────────────────────────────────────────────────────────────────
283
+ # Gold-standard comparison — Edge-F1 / Ancestor-F1
284
+ #
285
+ # HiExpan (Shen et al., KDD 2018) scores a system taxonomy against a hand-built
286
+ # gold taxonomy with Edge-F1 (direct parent–child links) and Ancestor-F1 (all
287
+ # ancestor links). Because our internal-node *labels* differ between the gold
288
+ # tree and each system, we use the label-free leaf-pair formulation (the
289
+ # pair-counting tradition, Fowlkes & Mallows 1983):
290
+ #
291
+ # • Edge-F1 — over pairs of leaves that share the same IMMEDIATE parent
292
+ # (i.e. they are siblings). Strict: rewards correct granularity.
293
+ # • Ancestor-F1 — over pairs of leaves that share ANY non-root ancestor
294
+ # (i.e. they are grouped together somewhere). Lenient.
295
+ #
296
+ # Leaves are matched between gold and system by their attribute-node NAME (the
297
+ # variable label) — the one field all three approaches expose for every leaf.
298
+ # Only leaves present in BOTH the gold subset and the system tree are scored, so
299
+ # a gold subset of 50–100 variables fairly evaluates a full hierarchy.
300
+ # ──────────────────────────────────────────────────────────────────────────────
301
+ def _pred_leaf_lineage(nodes: list) -> dict:
302
+ """leaf name → list of ancestor node ids (root-most first, excl. root & leaf)."""
303
+ pm = build_parent_map(nodes)
304
+ id_to_node = {int(n['id']): n for n in nodes}
305
+ lineage: dict = {}
306
+ for n in nodes:
307
+ if n.get('type') != 'attribute':
308
+ continue
309
+ name = str(n.get('name', ''))
310
+ cur = int(n['id'])
311
+ anc, seen = [], set()
312
+ while cur in pm and cur not in seen:
313
+ seen.add(cur)
314
+ cur = pm[cur]
315
+ nd = id_to_node.get(cur)
316
+ if nd is None or nd.get('type') == 'root':
317
+ break
318
+ anc.append(cur)
319
+ anc.reverse()
320
+ lineage[name] = anc
321
+ return lineage
322
+
323
+
324
+ def _gold_leaf_lineage(gold_df) -> dict:
325
+ """leaf name → list of cumulative path-prefix strings (the gold ancestors)."""
326
+ lineage: dict = {}
327
+ for _, r in gold_df.iterrows():
328
+ name = str(r['leaf_label'])
329
+ path = str(r.get('gold_path', '') or '')
330
+ comps = [c.strip() for c in path.split('>')
331
+ if c.strip() and c.strip().lower() != 'ungrouped']
332
+ anc, pref = [], ''
333
+ for c in comps:
334
+ pref = c if not pref else f'{pref} > {c}'
335
+ anc.append(pref)
336
+ lineage[name] = anc
337
+ return lineage
338
+
339
+
340
+ def _sibling_pairs(lineage: dict) -> set:
341
+ from collections import defaultdict
342
+ groups: dict = defaultdict(list)
343
+ for name, anc in lineage.items():
344
+ if anc:
345
+ groups[anc[-1]].append(name)
346
+ pairs: set = set()
347
+ for members in groups.values():
348
+ m = sorted(members)
349
+ for i in range(len(m)):
350
+ for j in range(i + 1, len(m)):
351
+ pairs.add((m[i], m[j]))
352
+ return pairs
353
+
354
+
355
+ def _cogrouped_pairs(lineage: dict) -> set:
356
+ from collections import defaultdict
357
+ occ: dict = defaultdict(set)
358
+ for name, anc in lineage.items():
359
+ for a in anc:
360
+ occ[a].add(name)
361
+ pairs: set = set()
362
+ for members in occ.values():
363
+ m = sorted(members)
364
+ for i in range(len(m)):
365
+ for j in range(i + 1, len(m)):
366
+ pairs.add((m[i], m[j]))
367
+ return pairs
368
+
369
+
370
+ def _prf(pred_set: set, gold_set: set) -> dict:
371
+ if not pred_set and not gold_set:
372
+ return {'precision': 1.0, 'recall': 1.0, 'f1': 1.0}
373
+ tp = len(pred_set & gold_set)
374
+ p = tp / len(pred_set) if pred_set else 0.0
375
+ r = tp / len(gold_set) if gold_set else 0.0
376
+ f = 2 * p * r / (p + r) if (p + r) > 0 else 0.0
377
+ return {'precision': round(p, 4), 'recall': round(r, 4), 'f1': round(f, 4)}
378
+
379
+
380
+ def gold_comparison(nodes: list, gold_df) -> dict:
381
+ """Edge-F1 and Ancestor-F1 of a system tree vs a hand-built gold tree."""
382
+ pred = _pred_leaf_lineage(nodes)
383
+ gold = _gold_leaf_lineage(gold_df)
384
+ shared = set(pred) & set(gold)
385
+ pred = {k: v for k, v in pred.items() if k in shared}
386
+ gold = {k: v for k, v in gold.items() if k in shared}
387
+ return {
388
+ 'n_matched_leaves': len(shared),
389
+ 'edge_f1': _prf(_sibling_pairs(pred), _sibling_pairs(gold)),
390
+ 'ancestor_f1': _prf(_cogrouped_pairs(pred), _cogrouped_pairs(gold)),
391
+ }
392
+
393
+
394
+ # ──────────────────────────────────────────────────────────────────────────────
395
+ # Granularity-tolerant, label-independent structural F1 (set-overlap matching)
396
+ #
397
+ # Edge-F1 punishes a system for adding *correct* extra depth, because two leaves
398
+ # that gold lists as siblings stop being immediate siblings once the system
399
+ # refines them into sub-tiers. That makes edge-F1 unfair to deliberately deeper
400
+ # trees (Approaches 1 & 2). Set-overlap F1 fixes this: it matches each gold
401
+ # cluster (the set of leaves under a gold path-prefix) to the system node whose
402
+ # leaf set overlaps it most (Jaccard), regardless of that node's depth or label.
403
+ #
404
+ # • precision — for each system aggregation node, its best Jaccard with any
405
+ # gold cluster, averaged. Low when the system invents groups
406
+ # gold does not have (e.g. one node per delay value = over-split).
407
+ # • recall — for each gold cluster, its best Jaccard with any system node,
408
+ # averaged. Low when the system fails to recover a gold group.
409
+ #
410
+ # This is the cluster-matching / overlap-F1 tradition (e.g. ontology alignment,
411
+ # hierarchical-clustering evaluation). Label-free, so it compares the three
412
+ # approaches fairly even though their internal-node labels differ.
413
+ # ──────────────────────────────────────────────────────────────────────────────
414
+ def _system_clusters(nodes: list) -> list:
415
+ """Each aggregation node → frozenset of leaf NAMES in its subtree (size ≥ 2)."""
416
+ id_to_node = {int(n['id']): n for n in nodes}
417
+ out: list = []
418
+ for n in nodes:
419
+ if n.get('type') != 'aggregation':
420
+ continue
421
+ leaves: list = []
422
+ stack = [int(n['id'])]
423
+ seen: set = set()
424
+ while stack:
425
+ x = stack.pop()
426
+ if x in seen:
427
+ continue
428
+ seen.add(x)
429
+ nd = id_to_node.get(x)
430
+ if nd is None:
431
+ continue
432
+ if nd.get('type') == 'attribute':
433
+ leaves.append(str(nd.get('name', '')))
434
+ else:
435
+ stack.extend(int(c) for c in nd.get('related', []))
436
+ s = frozenset(leaves)
437
+ if len(s) >= 2:
438
+ out.append(s)
439
+ return out
440
+
441
+
442
+ def _gold_clusters(gold_df) -> list:
443
+ """Each gold path-prefix → frozenset of leaf NAMES under it (size ≥ 2)."""
444
+ from collections import defaultdict
445
+ occ: dict = defaultdict(set)
446
+ for name, anc in _gold_leaf_lineage(gold_df).items():
447
+ for a in anc:
448
+ occ[a].add(name)
449
+ return [frozenset(v) for v in occ.values() if len(v) >= 2]
450
+
451
+
452
+ def set_overlap_f1(nodes: list, gold_df) -> dict:
453
+ """Granularity-tolerant, label-free hierarchical F1 via best leaf-set Jaccard."""
454
+ pred_names = set(_pred_leaf_lineage(nodes))
455
+ gold_names = {str(x) for x in gold_df['leaf_label']}
456
+ shared = pred_names & gold_names
457
+ if len(shared) < 2:
458
+ return {'precision': 0.0, 'recall': 0.0, 'f1': 0.0}
459
+
460
+ sys_cl = [c & shared for c in _system_clusters(nodes)]
461
+ sys_cl = [c for c in sys_cl if len(c) >= 2]
462
+ gold_cl = [c & shared for c in _gold_clusters(gold_df)]
463
+ gold_cl = [c for c in gold_cl if len(c) >= 2]
464
+ if not sys_cl or not gold_cl:
465
+ return {'precision': 0.0, 'recall': 0.0, 'f1': 0.0}
466
+
467
+ def jac(a: frozenset, b: frozenset) -> float:
468
+ u = len(a | b)
469
+ return len(a & b) / u if u else 0.0
470
+
471
+ prec = float(np.mean([max(jac(s, g) for g in gold_cl) for s in sys_cl]))
472
+ rec = float(np.mean([max(jac(s, g) for s in sys_cl) for g in gold_cl]))
473
+ f1 = 2 * prec * rec / (prec + rec) if (prec + rec) > 0 else 0.0
474
+ return {'precision': round(prec, 4), 'recall': round(rec, 4), 'f1': round(f1, 4)}
475
+
476
+
477
+ def refinement_breakdown(nodes: list, gold_df) -> dict:
478
+ """Decompose edge-F1 disagreements into harmless refinement vs real errors.
479
+
480
+ • wrong_merge_rate — system sibling pairs that gold does NOT co-group anywhere
481
+ (genuine mistakes: variables wrongly placed together).
482
+ • refinement_rate — gold sibling pairs the system keeps co-grouped but at a
483
+ FINER level (split into sub-tiers). These are deeper-but-consistent, the
484
+ thing edge-F1 unfairly penalises.
485
+ • missed_rate — gold sibling pairs the system fails to co-group at all
486
+ (real recall failures).
487
+ """
488
+ pred = _pred_leaf_lineage(nodes)
489
+ gold = _gold_leaf_lineage(gold_df)
490
+ shared = set(pred) & set(gold)
491
+ pred = {k: v for k, v in pred.items() if k in shared}
492
+ gold = {k: v for k, v in gold.items() if k in shared}
493
+
494
+ sys_sib = _sibling_pairs(pred)
495
+ sys_cog = _cogrouped_pairs(pred)
496
+ gold_sib = _sibling_pairs(gold)
497
+ gold_cog = _cogrouped_pairs(gold)
498
+
499
+ wrong_merge = len(sys_sib - gold_cog)
500
+ refined = len((gold_sib & sys_cog) - sys_sib)
501
+ missed = len(gold_sib - sys_cog)
502
+ return {
503
+ 'wrong_merge_rate': round(wrong_merge / len(sys_sib), 4) if sys_sib else 0.0,
504
+ 'refinement_rate': round(refined / len(gold_sib), 4) if gold_sib else 0.0,
505
+ 'missed_rate': round(missed / len(gold_sib), 4) if gold_sib else 0.0,
506
+ }
507
+
508
+
509
+ # ──────────────────────────────────────────────────────────────────────────────
510
+ # One-call bundle
511
+ # ──────────────────────────────────────────────────────────────────────────────
512
+ def evaluate(nodes: list, corpus_texts: list | None = None, can=None,
513
+ gold_df=None) -> dict:
514
+ """Compute the full metric bundle for one hierarchy."""
515
+ out: dict = {}
516
+ out.update(traco_metrics(nodes))
517
+ out['npmi_coherence'] = (npmi_coherence(nodes, corpus_texts)
518
+ if corpus_texts is not None else None)
519
+ out.update({f'struct_{k}': v for k, v in structural_stats(nodes).items()})
520
+ if can is not None:
521
+ out['group_preservation'] = group_preservation(nodes, can)
522
+ if gold_df is not None:
523
+ out['gold'] = gold_comparison(nodes, gold_df)
524
+ return out
launcher.py ADDED
@@ -0,0 +1,137 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ launcher.py — start Baseline, Approach 1 and Approach 2 on different ports,
3
+ open them in browser tabs, and shut down all at once when you
4
+ press Enter.
5
+
6
+ Usage:
7
+ python launcher.py
8
+
9
+ Each app has its own file uploader — upload a different CSV to each tab to
10
+ compare approaches side by side.
11
+ """
12
+
13
+ from __future__ import annotations
14
+ import socket
15
+ import subprocess
16
+ import sys
17
+ import time
18
+ import webbrowser
19
+ from pathlib import Path
20
+
21
+ HERE = Path(__file__).resolve().parent
22
+
23
+ JOBS = [
24
+ ('baseline.py', 8501, 'Baseline'),
25
+ ('approach_1.py', 8502, 'Approach 1'),
26
+ ('approach_2.py', 8503, 'Approach 2'),
27
+ ]
28
+
29
+ # TIP: to compare TWO datasets at once you do NOT need extra ports. Streamlit
30
+ # gives every browser tab its own independent session (separate upload + state),
31
+ # so just open the same URL twice — e.g. open http://localhost:8501 in two tabs,
32
+ # load AI-MIND in one and HCP in the other.
33
+
34
+ OPEN_BROWSER = True
35
+ STARTUP_WAIT_SECS = 5
36
+
37
+
38
+ def _port_in_use(port: int) -> bool:
39
+ """Return True if something is already listening on this port."""
40
+ with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
41
+ s.settimeout(0.5)
42
+ return s.connect_ex(('127.0.0.1', port)) == 0
43
+
44
+
45
+ def _kill_tree(p: subprocess.Popen) -> None:
46
+ """Kill a process and all its children (works reliably on Windows and POSIX)."""
47
+ if sys.platform == 'win32':
48
+ subprocess.call(
49
+ ['taskkill', '/F', '/T', '/PID', str(p.pid)],
50
+ stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL,
51
+ )
52
+ else:
53
+ try:
54
+ import os, signal
55
+ os.killpg(os.getpgid(p.pid), signal.SIGTERM)
56
+ except Exception:
57
+ p.terminate()
58
+ try:
59
+ p.wait(timeout=5)
60
+ except subprocess.TimeoutExpired:
61
+ p.kill()
62
+
63
+
64
+ def main() -> int:
65
+ # Validate scripts
66
+ missing = [s for s, _, _ in JOBS if not (HERE / s).is_file()]
67
+ if missing:
68
+ print(f'ERROR: missing files: {missing}')
69
+ return 1
70
+
71
+ # Abort if any port is already occupied — prevents the duplicate-tab problem
72
+ busy = [(label, port) for _, port, label in JOBS if _port_in_use(port)]
73
+ if busy:
74
+ for label, port in busy:
75
+ print(f'ERROR: port {port} ({label}) is already in use.')
76
+ print('\nKill the existing servers first (Task Manager → python.exe → End Task),')
77
+ print('then run launcher.py again.')
78
+ return 1
79
+
80
+ procs: list[subprocess.Popen] = []
81
+ print(f'Working directory: {HERE}')
82
+ print(f'Launching {len(JOBS)} Streamlit instance(s)…\n')
83
+
84
+ for script, port, label in JOBS:
85
+ cmd = [
86
+ sys.executable, '-m', 'streamlit', 'run', str(HERE / script),
87
+ '--server.port', str(port),
88
+ '--server.headless', 'true', # suppress Streamlit's own browser open
89
+ '--browser.gatherUsageStats', 'false',
90
+ ]
91
+ try:
92
+ # Do NOT use CREATE_NEW_PROCESS_GROUP — it breaks taskkill /T
93
+ p = subprocess.Popen(cmd)
94
+ procs.append(p)
95
+ print(f' ✓ {label:<12} pid={p.pid:<6} → http://localhost:{port}')
96
+ except Exception as e:
97
+ print(f' ✗ FAILED {label}: {e}')
98
+
99
+ if not procs:
100
+ print('Nothing started.')
101
+ return 1
102
+
103
+ # Wait for each server to actually be reachable before opening the browser
104
+ print(f'\nWaiting for servers to come up (max {STARTUP_WAIT_SECS}s each)…')
105
+ for _, port, label in JOBS:
106
+ for _ in range(STARTUP_WAIT_SECS * 2):
107
+ if _port_in_use(port):
108
+ print(f' ✓ {label} ready')
109
+ break
110
+ time.sleep(0.5)
111
+ else:
112
+ print(f' ⚠ {label} did not respond in time — opening anyway')
113
+
114
+ if OPEN_BROWSER:
115
+ print('\nOpening browser tabs…')
116
+ for _, port, label in JOBS:
117
+ url = f'http://localhost:{port}'
118
+ webbrowser.open_new_tab(url)
119
+ print(f' • {label} → {url}')
120
+ time.sleep(0.3) # small gap so tabs open in order
121
+
122
+ print('\nAll servers running.')
123
+ print('Press Enter (in THIS terminal) to stop all servers and exit.\n')
124
+ try:
125
+ input()
126
+ except KeyboardInterrupt:
127
+ pass
128
+
129
+ print('\nStopping servers…')
130
+ for p in procs:
131
+ _kill_tree(p)
132
+ print('Done.')
133
+ return 0
134
+
135
+
136
+ if __name__ == '__main__':
137
+ raise SystemExit(main())
requirements.txt ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ streamlit>=1.30
2
+ pandas>=2.0
3
+ numpy>=1.24
4
+ scikit-learn>=1.3
5
+ plotly>=5.18
6
+ sentence-transformers>=2.5
7
+ requests>=2.31
8
+ openpyxl>=3.1
9
+
10
+ # Approach 2 — semantic aspect discovery (recent SOTA, NeurIPS 2024)
11
+ # Optional but recommended. Pulls torch as a transitive dependency.
12
+ fastopic>=0.0.5
13
+
14
+ # Approach 2 — local LLM label refinement (TopicTag-style, evidence-grounded).
15
+ # Uses an Ollama server on localhost:11434 with the OpenAI-compatible /v1
16
+ # endpoint. See README for `ollama pull` instructions.
17
+ openai>=1.30