android-reverse-engineering.../plugins/android-reverse-engineering/skills
Simone Avogadro f68d9ce3be feat: post-filter --urls to drop dictionary noise while keeping IPs and apex hosts
The hardening patch widened STRICT_URL to recover IPv4 literals, apex
2-label domains and internal hosts that the PR's strict-only regex
discarded as collateral while killing Kotlin-stdlib dictionary noise.
Widening alone reopened a narrow noise class: 'word.word' fragments such
as "www.this" / "this.introduction" pass as apex domains.

Keep extraction permissive and add a small awk pass that decides per host:
- IPv4 literal: always keep (dict fragments are words, never dotted-quads)
- >=3 labels: always keep (any TLD; same tolerance as the original regex)
- any host with a :port or /path: always keep (structured = high signal)
- bare 2-label apex: keep only when the TLD is a real one, matched as a
  whole field (so "introduction" != "in" — the prefix-match bug a single
  mega-regex would have)

Trade-off documented inline: a first-party host referenced bare with an
uncommon TLD (e.g. https://foo.store with no path) is dropped; a path or
port keeps it. awk is POSIX (sub/split/~/print) — more portable than the
bash>=4 'declare -A' already used in the summary header.

Verified: dictionary noise dropped; IPs, apex, internal and subdomain
hosts kept; --all on a zero-match tree still exits 0; host list and
full-URL list stay consistent (no orphan hosts).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 11:06:30 +02:00
..
android-reverse-engineering feat: post-filter --urls to drop dictionary noise while keeping IPs and apex hosts 2026-06-10 11:06:30 +02:00