Security

Surviving an SEO Spam Cloaking Compromise: How I Found and Cleaned It Up

By Abhishek Ghimire7 min read
1 view

Surviving an SEO Spam Cloaking Compromise: How I Found and Cleaned It Up

I found out my site was hacked from Google search results. Not from a security scanner, not from a log alert — from the little grey URL under the search title. Where it should have said new.kathmandu.gov.np, it said sdining-umeda.com. A restaurant in Osaka, apparently. My pages, my content, my breadcrumbs — flying someone else's flag.

This is the story of what that actually is, how I confirmed it, and how I cleaned it up. If you ever see a domain you don't recognize sitting on top of your own content in Google, this post is for you.

The symptom that makes no sense at first

Here's what made it confusing: the search results were unmistakably mine. The titles were my page titles. The breadcrumb trails showed my real URL structure — wards, cooperatives, all the genuine paths. If you clicked through, you landed on my actual site, working normally. Nothing looked wrong when I visited.

Only the displayed domain was foreign. That combination — my content, my paths, someone else's hostname — is not a coincidence or a Google glitch. It's a signature. It's SEO spam cloaking.

What cloaking actually is

Cloaking means serving different content to search engine crawlers than to real human visitors. The attacker injects code that checks who is asking. If the request looks like a normal browser, it serves your real site — which is exactly why everything looks fine when you check. If the request looks like Googlebot, it serves manipulated content: a foreign canonical tag, a different hostname, spam links, whatever the attacker is monetizing.

Google indexes what Googlebot sees. So Google slowly attributed my content to the attacker's domain, funneling my hard-earned ranking and authority to sdining-umeda.com. The visitors never see it. The owner never sees it in a casual check. It quietly bleeds your SEO into someone else's pocket for as long as it goes unnoticed.

That "invisible unless you know how to look" property is what makes cloaking so nasty. Your site works. Your analytics look normal-ish. And meanwhile you're ranking for a stranger.

Step 1: Confirm it's actually cloaking

The single most useful diagnostic is to ask as Googlebot and compare. Fetch your own page twice — once as a normal browser, once impersonating the crawler:

# As a normal visitor
curl -s https://new.kathmandu.gov.np/wards -o normal.html

# As Googlebot
curl -s -A "Googlebot" https://new.kathmandu.gov.np/wards -o bot.html

# Compare
diff normal.html bot.html

If those two files are identical, cloaking via user-agent isn't happening. If they differ — especially if the Googlebot version contains a foreign domain, a different <link rel="canonical">, or injected links — you've confirmed it. The attacker's code is branching on the user-agent string.

Also grep both files directly for the foreign hostname:

grep -i "sdining-umeda" bot.html normal.html

Seeing it only in bot.html is the smoking gun.

Step 2: Figure out how they're doing it

There are three common mechanisms, and it's worth knowing which one you're dealing with because the cleanup differs.

Injected canonical / meta tags. The most common. Your pages output something like <link rel="canonical" href="https://sdining-umeda.com/..."> (often only when the visitor looks like a crawler). Google honors the canonical and hands the credit to that domain. Check the <head> for canonical tags, Open Graph og:url, and any hardcoded absolute URLs.

Host-header injection. If your app builds absolute URLs from the incoming Host header without validating it, an attacker can send requests with a spoofed Host and get your app to reflect their domain into canonicals, sitemaps, and redirects. This is a code-level bug, not injected files. On a Next.js/FastAPI stack, this means auditing anywhere you construct a URL from request headers instead of a fixed, configured base URL.

Poisoned sitemap or robots. Sometimes the attacker just adds foreign URLs to sitemap.xml or points things at their domain, seeding Google with the association. Always check:

curl -s https://new.kathmandu.gov.np/sitemap.xml | grep -i "sdining"
curl -s https://new.kathmandu.gov.np/robots.txt

Step 3: Find the entry point (this is the part you can't skip)

Removing injected code without finding how they got in just means you'll do this again next week. The compromise is a symptom; the entry point is the disease.

Look at what changed and when. Recently modified files are the loudest clue:

# Files changed in the last 7 days under your web root
find /var/www -type f -mtime -7 -ls

# Look specifically for suspicious writes into places that should be static
find /var/www -type f -mtime -7 \( -name "*.php" -o -name "*.js" -o -name "*.html" \)

Then check who's been on the box and whether they belong:

last          # successful logins — should be only you / your ISP ranges
sudo lastb    # failed attempts — brute force noise is normal; a success is not
sudo grep -i "accepted" /var/log/auth.log

And check what's actually listening, in case something extra got installed:

sudo ss -tulpn

You're looking for processes bound to ports you didn't put there, or apps listening on 0.0.0.0 that should be bound to 127.0.0.1.

Common real-world entry points to rule in or out: a user-writable upload directory that allows executable files, an outdated dependency with a known RCE, leaked credentials (a committed .env, an exposed admin token), or a weak/reused SSH password. Key-based SSH with password auth disabled closes one of the biggest doors outright.

Step 4: Clean up

Once you know the mechanism and the entry point:

  1. Remove the injected code. Whether it's a rogue canonical, a malicious middleware, or a planted file — excise it. Diff against a known-good copy (git is your friend here) rather than eyeballing.

  2. Rotate everything. SSH keys, API tokens, database passwords, any secret that could have been read. Assume anything on that box was seen.

  3. Patch the entry point. Disable password SSH auth, restrict upload directories to non-executable, update dependencies, validate the Host header against an allowlist, lock your origin to Cloudflare IPs so nobody can hit it directly.

  4. Pin your canonical the right way. Set canonicals from a fixed, configured base URL — never from the request's Host header. On Next.js, use a NEXT_PUBLIC_SITE_URL env var (or metadataBase) as the single source of truth for absolute URLs.

Step 5: Tell Google, then wait

Cleaning the server doesn't un-poison your search results — Google is still holding the bad association until it recrawls.

  • In Google Search Console, check the Security Issues report and, if flagged, request a review after cleanup.

  • Use the URL Inspection tool to fetch affected pages as Google and confirm they now render clean.

  • Request reindexing of the affected URLs.

  • Resubmit a clean sitemap.xml.

Recovery here is measured in days to weeks, not minutes. The recrawl is on Google's schedule. That waiting period is the real punishment for not catching it sooner — which is the best argument for monitoring.

Hardening so it doesn't happen again

After the cleanup, I closed the systemic gaps. Worth doing whether or not you've been hit:

  • HSTS at the Nginx layer so browsers refuse to downgrade to HTTP.

  • A strict Content-Security-Policy, including frame-src, so injected external frames get blocked at the edge rather than silently rendering.

  • Origin locked to Cloudflare IPs — attackers can't hit your server directly to bypass the WAF.

  • X-Content-Type-Options: nosniff and correct content types on every endpoint.

  • Password SSH auth disabled, fail2ban running, inbound firewall limited to 22/80/443.

  • A canonical URL that comes from config, not from headers. This one line of discipline kills the host-header cloaking vector entirely.

The lesson

The thing I keep coming back to: my site worked the whole time. Every human visitor had a normal experience. The compromise lived exclusively in the version of reality served to crawlers, which is the one place I never looked. If I hadn't happened to glance at the domain under a search result, it could have run for months.

So the practical takeaway is boring but real: occasionally fetch your own site as Googlebot and read what it says. curl -A "Googlebot" costs you ten seconds and shows you the half of your website you otherwise never see. That's the whole trick.

Written by Abhishek Ghimire