Hardening ScamDetector against prompt injection, hallucinations, and abuse

After publishing ScamDetector, documenting its architecture under the hood, and going over the changes that came in afterward, the next step was inevitable. It's not enough for something to work, it has to hold up. Spending an afternoon reviewing the code with an attacker's mindset is cheaper than doing it later, when you've got a real problem on your hands.

This article covers the changes I made to prepare ScamDetector for real threats. These aren't generic best practices, they're specific decisions driven by problems I found when I asked myself, "what would I do if I wanted to break this?"

When the model makes up URLs that don't exist

One of ScamDetector's features is extracting URLs from screenshots. If you get a suspicious SMS and take a screenshot, the tool detects the visible URLs in the image so you can scan them without copying them by hand.

The problem showed up with cropped screenshots. An SMS that ends with a link cut off by the edge of the screen, a forwarded message where the URL is only partially visible. It took me a while to figure out what was happening, because tests with full screenshots worked perfectly. But with real-world screenshots, Gemini Flash tried to "complete" the URL it couldn't fully see. It generated web addresses that looked plausible but simply weren't in the original image.

The fix had two parts. On one hand, I tightened the extraction prompt instructions with explicit rules that forbid inventing, completing, or inferring partial URLs. Only URLs that are fully visible and readable. On the other hand, I added stricter server-side validation that discards incomplete-looking URLs before sending them to the scanner.

It's a reminder that you can't blindly trust a model's output. LLMs are designed to be helpful, and sometimes being helpful means making up a convincing answer when the real information isn't available.

Prompt injection as a fraud signal

ScamDetector takes user-written text, and that text is included in the context sent to the AI model. This is the classic prompt injection scenario, where someone tries to manipulate the model's behavior by inserting instructions into their message.

The obvious defense is to sanitize the input so the model doesn't confuse user data with system instructions. I implemented that. But I also added something I find more interesting: if the model detects an injection attempt in the message being analyzed, it treats it as a fraud indicator and raises the risk level.

That makes sense. Someone trying to manipulate a scam detection tool probably isn't doing it out of academic curiosity. The result is that attacking the tool makes it better at its original job. Instead of being a vulnerability, the injection attempt becomes evidence.

Unicode sanitization, what you can't see can still fool you

Prompt injection isn't the only way to manipulate the text that reaches the model. There are Unicode characters that are invisible but change the identity of a string. A zero-width character (ZWSP, ZWJ, ZWNJ) inserted in the middle of a URL makes it look identical, but technically it's a completely different address. Bidirectional control characters can alter the order in which text is displayed without the user noticing. And variation selectors change how a character is represented without changing how it looks in most fonts.

Another vector I implemented is homoglyph detection. The Cyrillic letter "а" is visually identical to the Latin "a", but they're different code points. A domain that mixes Latin characters with Cyrillic or Greek ones in the same word is a classic phishing signal, and now ScamDetector catches it before the text reaches the model.

Following the same philosophy as with prompt injection, sanitization doesn't block analysis, it generates flags that are passed to the model as additional context. If a message contains suspicious invisible characters or mixes alphabets in an unusual way, the model knows about it and can factor that into its risk assessment. Flagging instead of blocking, because the final decision still comes from the full analysis.

Cloudflare Turnstile, CAPTCHA without the friction

The honeypot fields I described in the architecture article caught the most basic bots, but not the ones that bother to simulate a real browser. I needed something stronger without hurting the experience for legitimate users. I chose Cloudflare Turnstile instead of reCAPTCHA for a practical reason: the domain was already on Cloudflare, Turnstile doesn't use tracking cookies, and it's invisible to the user in the vast majority of cases.

The implementation protects the /api/analyze and /api/extract-urls endpoints. The original flow required one Turnstile token for each API call, but in Safari this caused a real problem: when resetting the widget to get a second token (for extract-urls after analyze), Safari forced an interactive checkbox in the middle of the loading spinner. The fix was to switch to an ephemeral session model. The user completes a single Turnstile challenge when submitting the form, the server verifies it against Cloudflare, and issues a temporary single-use session token. That token is sent along with the later calls to analyze and extract-urls, and it's invalidated after the analysis. The change is invisible to the user (they still don't see any CAPTCHA under normal conditions) but it removed a real point of friction in Safari.

I designed verification so a CAPTCHA service outage doesn't prevent people from using the tool. It's a conscious tradeoff between availability and security, with extra mitigations in the other defense layers. The site key is served to the frontend from the server instead of being hardcoded in the JavaScript, so rotating it only requires changing an environment variable.

Turnstile and the ephemeral session work for real users in a browser, but they don't fit machine clients. An AI agent, a script, or a backend that wanted to query ScamDetector would run into a JavaScript challenge they can't solve. For those cases I added a second authentication path through Authorization: Bearer <API_KEY> that lives alongside the main flow. If that header is present, the handler validates the key with crypto.timingSafeEqual against a list configured in the environment (each key has an optional label that shows up in the logs), skips Turnstile and session handling, and applies its own per-key rate limit (60 requests every 10 minutes by default, configurable). If the key is invalid, it returns 401 immediately without falling back to the session flow. If there's no header, everything works as before. It's a few lines of code that opened the door to integrations without making anything more complicated for the end user.

Rate limiting that survives creativity

The architecture article explained how ScamDetector applies per-user usage limits with hashed IPs to preserve privacy. That layer is still there, but it wasn't enough as the only line of defense.

An attacker with access to multiple IPs (which is trivial today with residential proxies) could spread requests around so no single IP goes over the threshold. The fix was to add extra limiting layers that operate independently and complement each other.

Another detail I fixed was persistence. In a Docker environment where containers are recreated on every deploy, the rate limiting counters were lost, and anyone could start from zero just by waiting for the next redeploy. Now all counters persist on disk and survive restarts, as long as the data directory is mounted as a volume.

Each endpoint has its own limiting bucket, so legitimate use of one feature doesn't eat into the quota for another.

Over time I realized the flat limit of 10 requests every 10 minutes treated everyone the same, and that doesn't make sense. A legitimate user analyzing suspicious messages shouldn't have the same tolerance as someone who has already tried prompt injection. So I added progressive penalties tied to guardrail signals. Each injection detection progressively lowers the allowed request limit, and if detections pile up, access is temporarily blocked. Off-topic abuse (for example, asking the model to write poems) also gets temporary blocks if it's repeated.

The penalties persist on disk using the same mechanism as the base counters, so restarting the container doesn't wipe them. And to stop someone from disabling penalties by accident in production, the testing bypass requires a double guard with two environment variables that are mutually excluded in the production Dockerfile. Penalties are always active in production by design.

When external APIs fail

ScamDetector depends on external services to work. AI models, reputation lookups, URL scanning. Any of them can return a transient error, get saturated, or simply take longer than expected.

I centralized retry logic in a shared utility that implements exponential backoff. If a request fails with a transient error, it automatically retries while waiting longer between each attempt. If the service says how long to wait, it respects that.

On top of that, all external calls have a timeout. If a service doesn't respond in time, the request is canceled and the user gets a partial response instead of waiting forever. Phone reputation lookup, for example, sends the query to the primary provider and a backup one in parallel, so if one takes too long the other can respond.

The key is graceful degradation. If the phone lookup fails but message analysis works, the user gets the analysis with a note saying the number's reputation couldn't be checked. A partial answer is better than no answer.

From three backends to two

In the architecture article I described how ScamDetector supported three interchangeable AI backends, including n8n as a visual orchestrator. n8n is a powerful tool, but for what it was actually doing here (a couple of calls to the OpenRouter API) it was a heavy dependency that added maintenance complexity without enough benefit to justify it. So I removed it.

ScamDetector now works with two backends. Direct OpenRouter is the main path, the lightest one and with no middleman. Vercel AI Gateway stays as the serverless alternative with built-in observability. The AI_GATEWAY environment variable still controls which one is used, and switching between them doesn't require a redeploy.

What made removing n8n clean instead of painful was an earlier decision to extract all prompts into a centralized module and share the normalizeResponse() function across backends. When I removed n8n, there was no logic to move and no prompts to rewrite, one adapter that was no longer needed just disappeared.

Shared code, shared bugs

While I was adding backends, I realized I had identical functions copied across five different modules. The utility for identifying the user, JSON parsing, log sanitization. The kind of technical debt that doesn't bother you until you have to fix a bug in one place and discover it still survives in four others.

I consolidated everything into a shared module. It's pure refactoring, it doesn't add any new functionality, but it lowers the chance that a bug gets fixed in one place and slips by in another.

Along the way, I found a subtle bug in rate limiting counter cleanup. The cleanup function iterated over the counter map to remove expired entries while modifying it at the same time. It worked almost always, but it was a race condition waiting for the right moment. The fix was trivial (take a snapshot before iterating), but the bug only existed because there were five nearly identical implementations evolving separately. It's the kind of thing that, as QA, you know will fail eventually, you just don't know when.

More than 240 tests so nothing breaks

With so many layers piled up (Turnstile, Unicode sanitization, progressive penalties, injection detection, multiple backends), testing by hand after every change stopped being viable. ScamDetector is vanilla JavaScript with no framework, and I applied the same philosophy to tests: Node.js's native runner (node:test), no Jest, no Vitest, without adding a single dependency to the project.

Unit tests cover the pure functions that support the security side: response normalization, SSRF validation, prompt injection detection, Unicode sanitization, rate limiting, and session management. Integration tests exercise the four HTTP handlers (analyze, verify, extract-urls, urlscan) with mocked external APIs to validate the full flow without spending credits. Internal functions are exposed through module.exports._internal so they can be tested without redesigning the public API.

On top of that, the suite includes 25 end-to-end tests that run against a real instance and exercise the full stack, including Turnstile verification and calls to the model. The suite supports selective execution by number, range, or name pattern. And while I was writing tests to validate that the model doesn't generate false positives, I improved the prompts with a step-by-step reasoning methodology and explicit anti-false-positive rules. For example, if a domain matches the official service it claims to represent, that is treated as a safe signal instead of a suspicious one. The tests don't just verify that nothing breaks, they also help tune the quality of the analysis.

Accessibility you don't see

I reviewed the interface against WCAG AA accessibility criteria and found two things. Several secondary texts didn't reach the minimum contrast ratio of 4.5:1, neither in light mode nor dark mode. And interactive elements didn't show a visual indicator when navigating with the keyboard.

I adjusted the colors and added focus-visible styles to all interactive elements. Unlike plain focus, focus-visible only activates when the user is navigating with a keyboard, not when clicking with the mouse. Most users won't notice these changes, but for people who depend on a keyboard or assistive technology, they make the difference between being able to use the tool and not.

Minimal infrastructure, but done right

Three small changes that prevent big problems.

The application was already running as a non-root user in Docker, but the data directory was created by the build process as root. When Docker mounted a volume over that directory, the application didn't have write permissions. The fix was to create the directory with the right ownership before switching to the non-privileged user.

I consolidated all persistent files into a single directory. In Dokploy that translates to a single bind mount, which simplifies configuration and lowers the chance of forgetting to mount something.

And I removed flexible version ranges from dependencies. Where there used to be a ^ that allowed automatic updates, there are now exact versions. Updates are an explicit decision, not something that happens quietly.

Hardening as an attitude

None of these changes came from an incident. There was no prompt injection attack, no massive abuse, no accessibility complaint. It all came from sitting down to review the code with the question "what would I do if I wanted to break this?" and iterating on each answer until I had a concrete defense.

That question is more productive than any security checklist. It forces you to think like a creative attacker instead of like an auditor with a list. IP-based rate limiting is enough until someone has ten IPs. URL validation is correct until the model invents one. Prompts are safe until someone tries to inject instructions. An invisible CAPTCHA seems unnecessary until you see automated traffic in the logs. Unicode characters are harmless until someone mixes Cyrillic and Latin in a phishing URL. And a test suite is a luxury until a change breaks something that worked yesterday.

ScamDetector is more solid now, but hardening isn't a state you reach, it's a process that never ends. Every change closes one door and reveals another one you hadn't seen.

With all these layers active, what I was still missing was finding out the moment someone tried to break something. The logs were there, but I wasn't looking at them. I fixed that by setting up push alerts on my phone with self-hosted ntfy, which is what I cover in the next article in the series.

Try it at scamdetector.josemanuelortega.dev, and if you find something that could be improved, let me know.

Another entry in the ScamDetector Project series. You came from Iterating after publishing and next up is Push alerts with ntfy.

When the model makes up URLs that don't exist

Prompt injection as a fraud signal

Unicode sanitization, what you can't see can still fool you

Cloudflare Turnstile, CAPTCHA without the friction

Rate limiting that survives creativity

The architecture article explained how ScamDetector applies per-user usage limits with hashed IPs to preserve privacy. That layer is still there, but it wasn't enough as the only line of defense.

Each endpoint has its own limiting bucket, so legitimate use of one feature doesn't eat into the quota for another.

When external APIs fail

ScamDetector depends on external services to work. AI models, reputation lookups, URL scanning. Any of them can return a transient error, get saturated, or simply take longer than expected.

From three backends to two

Shared code, shared bugs

I consolidated everything into a shared module. It's pure refactoring, it doesn't add any new functionality, but it lowers the chance that a bug gets fixed in one place and slips by in another.

More than 240 tests so nothing breaks

Accessibility you don't see

Minimal infrastructure, but done right

Three small changes that prevent big problems.

I consolidated all persistent files into a single directory. In Dokploy that translates to a single bind mount, which simplifies configuration and lowers the chance of forgetting to mount something.

Hardening as an attitude

ScamDetector is more solid now, but hardening isn't a state you reach, it's a process that never ends. Every change closes one door and reveals another one you hadn't seen.

Try it at scamdetector.josemanuelortega.dev, and if you find something that could be improved, let me know.

Another entry in the ScamDetector Project series. You came from Iterating after publishing and next up is Push alerts with ntfy.

Hardening ScamDetector against prompt injection, hallucinations, and abuse

When the model makes up URLs that don't exist

Prompt injection as a fraud signal

Unicode sanitization, what you can't see can still fool you

Cloudflare Turnstile, CAPTCHA without the friction

Rate limiting that survives creativity

When external APIs fail

From three backends to two

Shared code, shared bugs

More than 240 tests so nothing breaks

Accessibility you don't see

Minimal infrastructure, but done right

Hardening as an attitude

Leave the first comment

ScamDetector, un detector de estafas con inteligencia artificial

Guía práctica de hardening para tu VPS Linux: de CrowdSec al kernel

Cómo verificamos que nadie manipula los posts de este blog

Hardening ScamDetector against prompt injection, hallucinations, and abuse

When the model makes up URLs that don't exist

Prompt injection as a fraud signal

Unicode sanitization, what you can't see can still fool you

Cloudflare Turnstile, CAPTCHA without the friction

Rate limiting that survives creativity

When external APIs fail

From three backends to two

Shared code, shared bugs

More than 240 tests so nothing breaks

Accessibility you don't see

Minimal infrastructure, but done right

Hardening as an attitude

Leave the first comment

ScamDetector, un detector de estafas con inteligencia artificial

Guía práctica de hardening para tu VPS Linux: de CrowdSec al kernel

Cómo verificamos que nadie manipula los posts de este blog