feat: post-filter --urls to drop dictionary noise while keeping IPs and apex hosts
The hardening patch widened STRICT_URL to recover IPv4 literals, apex 2-label domains and internal hosts that the PR's strict-only regex discarded as collateral while killing Kotlin-stdlib dictionary noise. Widening alone reopened a narrow noise class: 'word.word' fragments such as "www.this" / "this.introduction" pass as apex domains. Keep extraction permissive and add a small awk pass that decides per host: - IPv4 literal: always keep (dict fragments are words, never dotted-quads) - >=3 labels: always keep (any TLD; same tolerance as the original regex) - any host with a :port or /path: always keep (structured = high signal) - bare 2-label apex: keep only when the TLD is a real one, matched as a whole field (so "introduction" != "in" — the prefix-match bug a single mega-regex would have) Trade-off documented inline: a first-party host referenced bare with an uncommon TLD (e.g. https://foo.store with no path) is dropped; a path or port keeps it. awk is POSIX (sub/split/~/print) — more portable than the bash>=4 'declare -A' already used in the summary header. Verified: dictionary noise dropped; IPs, apex, internal and subdomain hosts kept; --all on a zero-match tree still exits 0; host list and full-URL list stay consistent (no orphan hosts). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
parent
ed97b8508b
commit
f68d9ce3be
|
|
@ -230,8 +230,35 @@ if [[ "$SEARCH_ALL" == true || "$SEARCH_URLS" == true ]]; then
|
|||
|
||||
TMP="$(mktemp)"
|
||||
trap 'rm -f "$TMP"' EXIT
|
||||
# Extraction (STRICT_URL) is deliberately permissive; this awk pass drops the
|
||||
# residual Kotlin-stdlib dictionary noise WITHOUT losing the high-signal
|
||||
# shapes a strict-only regex discards (IPs, apex domains, internal hosts).
|
||||
# Decision table, top-down, on the host (authority before any :port / /path):
|
||||
# * IPv4 literal -> keep (dict fragments are words,
|
||||
# never dotted-quads)
|
||||
# * >=3 labels (sub.domain.tld) -> keep (any TLD; same tolerance the
|
||||
# original strict regex had)
|
||||
# * any host WITH a :port or /path -> keep (structured = high signal:
|
||||
# localhost:3000, svc/health)
|
||||
# * bare 2-label apex, no port/path -> keep ONLY if the TLD is a real one,
|
||||
# compared as a whole field (kills
|
||||
# "www.this" / "this.introduction",
|
||||
# keeps "mytrackera-api.com")
|
||||
# Trade-off: a first-party host referenced bare with an uncommon TLD (e.g.
|
||||
# https://foo.store with no path) is dropped — give it a path/port, or add the
|
||||
# TLD to the list below, if you hit that case.
|
||||
{ grep -rhoE --include='*.java' --include='*.kt' "$STRICT_URL" "$SOURCE_DIR" 2>/dev/null || true; } \
|
||||
| sort -u > "$TMP"
|
||||
| sort -u \
|
||||
| awk '
|
||||
{ rest=$0; sub(/^https?:\/\//,"",rest)
|
||||
host=rest; sub(/[/:].*/,"",host)
|
||||
haspathport = (rest ~ /[/:]/)
|
||||
if (host ~ /^[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+$/) { print; next } # IPv4
|
||||
n = split(host, a, ".")
|
||||
if (n >= 3) { print; next } # sub.domain.tld
|
||||
if (haspathport) { print; next } # has :port or /path
|
||||
if (n == 2 && a[2] ~ /^(com|net|org|io|co|app|dev|me|ai|xyz|info|biz|gov|edu|mil|int|tech|cloud|uk|de|fr|it|es|nl|in|us|ca|au|jp|cn|br|ru|eu|ch|se|no|fi|dk|pl|pt|gr|ie|be|at|cz|sg|hk|kr|tw|mx|ar|cl|za|nz)$/) print # real apex TLD
|
||||
}' > "$TMP"
|
||||
|
||||
# Extract host: strip scheme, take part up to first ':' or '/'.
|
||||
HOSTS_TMP="$(mktemp)"
|
||||
|
|
|
|||
Loading…
Reference in New Issue