← Back to blog

Leaky proxies: how we find true hosting

Most bots disguise their true hosting by proxying HTTP traffic through a different IP address. This can be simply to get a larger pool of IP addresses to make IP-based blocking more difficult, or to appear to come from a residential or mobile network despite being hosted in a datacentre. In many cases we can still tell the true hosting location because it is leaked in non-HTTP traffic.

This post looks at some bot signals using techniques around WebSockets, DNS, and WebRTC. If you are running a bot and you want to check that you aren't leaking these sorts of signals, get in touch and we can take a look.

WebSockets

One example is bots that bypass the HTTP proxy when making WebSocket connections. One bot has been fetching HTTP traffic from our collector sites for the last 6 months via a wide range of OVH datacentre IP addresses, but the WebSocket connections persistently come from the same IP address, and that IP address is in a netblock named after Fortinet, so we presume this bot to be operated by Fortinet.

One way this could happen is if the proxy is configured based on the protocol scheme ("http://" or "https://"). WebSocket URIs use a "ws://" or "wss://" scheme which.

Another is that if the proxy server does not support WebSockets then the browser may be forced to connect them directly.

DNS

Another example is that we can see where the bot's DNS requests come from.

The BotForensics collector page generates a fresh domain name for every session, which allows our custom DNS server to associate each DNS request with its browser session.

This lets us know not only the IP address that the DNS request came from, but other properties like:

  • does it do DNS over UDP (common) or TCP (rare)?
  • does it randomise the case in the domain name?
  • does it support EDNS? which options?

The majority of traffic is using Google or Cloudflare for DNS resolution. This isn't immediately a bot smell because ordinary users can use 8.8.8.8 (Google) or 1.1.1.1 (Cloudflare). But where a bot is browsing from a residential IP address but doing DNS lookups from Hetzner, that's much more suspicious.

WebRTC STUN/TURN

WebRTC is a way for web applications to communicate with each other directly peer-to-peer. But most residential users are behind NAT

The first implication of being behind NAT is that the browser doesn't necessarily know its own external IP address.

The browser can connect to a STUN server to find out its public IP address as seen by that server. The browser sends a request to the STUN server, and the server replies saying what IP address and port number it saw the packet come from.

And obviously if the browser connects to the STUN server directly, instead of using its HTTP proxy, then the STUN server will see the browser's real hosting and not the proxy server.

Unfortunately the STUN protocol doesn't provide any way to bundle any session-identifying information into the STUN request, so we don't use it for BotForensics. (If we wanted to, the way to do it would be to have the STUN server listen on thousands of port numbers, and only allocate each port number to one session at a time, and then when you see a request on a particular port number you know it is most likely from that session).

But we don't have to do that, because we can use TURN instead. TURN is intended for cases where STUN & NAT hole-punching are not sufficient. In the case of TURN, the "peer-to-peer" traffic is itself carried through the TURN server. TURN allows for authentication, so for BotForensics we configure the collector site to send its session ID in the TURN credentials, and then our custom TURN server knows which session the TURN request came from.

This is a very reliable signal. Among the most recent 20,000 sessions on our collector site that sent TURN traffic, 44% sent the TURN request from a different IP address to HTTP traffic.

What can I do?

If you are running a bot and you want to check that you aren't leaking these sorts of signals, get in touch and we can take a look.