I'm working on a web crawler and I'm trying to understand how the IP substitution works.
From what I have read, DNS hostname should be resolved to its IP address (one of many) and used instead of the host name in requests. Supposedly, it should improve performance (by resolving and caching), because the user agent no longer needs to resolve DNS itself.
It doesn't seem to work with HTTPS. I tried the following approaches:
With Node.js and playwright
:
import { chromium } from "playwright";
import { resolve4 } from "dns/promises";
export const crawlPage = async (pageUrl: string) => {
const url = new URL(pageUrl);
const dns = await resolve4(url.hostname);
console.log(dns);
const ip = dns[0]!;
const browser = await chromium.launch();
const context = await browser.newContext();
const page = await context.newPage();
await page.goto(pageUrl);
console.log(`Page title: ${await page.title()}`);
url.hostname = ip;
await page.goto(url.toString());
console.log(`Page title: ${await page.title()}`);
await browser.close();
};
And invoked like this: await crawlPage("https://example.com");
The output looks like this:
[
'23.192.228.80',
'23.192.228.84',
...
]
Page title: Example Domain
node:internal/process/promises:391
triggerUncaughtException(err, true /* fromPromise */);
^
page.goto: net::ERR_CERT_COMMON_NAME_INVALID at https://23.192.228.80/
Call log:
- navigating to "https://23.192.228.80/", waiting until "load"
... internal call stack
Node.js v20.18.0
With curl
it looks similar:
$ curl -H "Host: example.com" https://23.192.228.80
curl: (60) schannel: SNI or certificate check failed: SEC_E_WRONG_PRINCIPAL (0x80090322) - The target principal name is incorrect.
More details here: https://curl.se/docs/sslcerts.html
curl failed to verify the legitimacy of the server and therefore could not
establish a secure connection to it. To learn more about this situation and
how to fix it, please visit the webpage mentioned above.
How should it look like to work?
P.S. Am I overthinking this? Should I just drop it and use hostname?
I initially tried substituting a hostname with its resolved IP address directly, expecting it to improve performance by skipping DNS resolution on the client side. However, this approach failed with HTTPS due to certificate verification errors (ERR_CERT_COMMON_NAME_INVALID). The issue arises because TLS certificates are issued for domain names, not raw IPs. When accessing a site via its IP, the browser expects a certificate for that IP, which typically doesn’t exist.
To properly resolve DNS while avoiding certificate issues, I implemented a SOCKS5 proxy. Since SOCKS5 operates at the network layer, it handles DNS resolution separately and forwards requests using the correct hostname during the TLS handshake.
Here’s how I configured Playwright to use the proxy:
const browser = await chromium.launch();
const context = await browser.newContext({
proxy: { server: `socks5://localhost:${port}` },
});
const page = await context.newPage();
And here’s my basic SOCKS5 proxy implementation:
import * as net from "net";
import { DnsResolved, LookupDns, ResolveDns } from "./core/context.types";
import { ResolvedDns } from "./core/types";
const SOCKS_VERSION = 5;
type Props = {
port: number;
lookupDns: LookupDns;
resolveDns: ResolveDns;
dnsResolved: DnsResolved;
};
export const startSocksProxy = async ({
port,
lookupDns,
resolveDns,
dnsResolved,
}: Props) => {
const resolveDnsIp = async (hostname: string): Promise<ResolvedDns> => {
let resolvedDns = await lookupDns(hostname);
if (!resolvedDns) {
resolvedDns = await resolveDns(hostname);
if (!resolvedDns?.addresses?.length) {
throw new Error(`Failed to resolve DNS for ${hostname}`);
}
setImmediate(() => dnsResolved(hostname, resolvedDns));
}
return resolvedDns;
};
const server = net.createServer((clientSocket) => {
clientSocket.once("data", async (buffer) => {
const [socksVersion, , ...authMethods] = buffer;
if (socksVersion !== SOCKS_VERSION || !authMethods.includes(0x00)) {
clientSocket.destroy();
return;
}
clientSocket.write(Buffer.from([SOCKS_VERSION, 0x00]));
clientSocket.once("data", async (innerBuffer) => {
const [,, , addressType, ...rest] = innerBuffer;
let targetIp = "", targetHost = "", targetPort: number;
if (addressType === 0x03) { // Domain name
const domainLength = rest[0] ?? 0;
targetHost = innerBuffer.subarray(5, 5 + domainLength).toString();
targetPort = innerBuffer.readUInt16BE(5 + domainLength);
targetIp = (await resolveDnsIp(targetHost)).addresses[0]?.address ?? "";
} else {
clientSocket.destroy();
return;
}
const remoteSocket = net.createConnection(targetPort, targetIp, () => {
clientSocket.write(Buffer.from([SOCKS_VERSION, 0x00, 0x00, 0x01, 0, 0, 0, 0, 0, 0]));
clientSocket.pipe(remoteSocket);
remoteSocket.pipe(clientSocket);
});
});
});
clientSocket.on("error", (err) => console.error(`Client error: ${err.message}`));
});
server.listen(port, () => console.log(`SOCKS5 Proxy Server running on port ${port}`));
};
With this setup, the proxy resolves hostnames and forwards requests while keeping the original domain name intact during the TLS handshake. This avoids certificate errors while also improving performance through DNS caching and optimization.
This solution allowed me to achieve my original goal—resolving DNS manually while keeping HTTPS working properly.