Web Scraping with Mobile Proxies: 7 Case Studies, Instructions, and Metrics

mobile proxies are often described as the strongest proxy type for scraping and most people stop there, as if “strongest” settles every choice. it doesn’t. mobile IPs have real advantages, real constraints, and a cost profile that makes them the wrong call for some jobs. this guide covers all of it, honestly.

what you’ll find here: why mobile IPs beat datacenter and residential for hard anti-bot targets, seven deep case studies with actual numbers and working code, when mobile proxies are the wrong choice, sticky session math, retry and backoff patterns, headless browser stealth, DNS leak prevention, robots.txt, legality, and a troubleshooting matrix. it’s long because scraping with mobile proxies has enough nuance that a short guide is more dangerous than no guide.

before we go deeper, a quick note on what “mobile proxy” means in this context: a proxy that routes traffic through a real SIM card on a real 4G or 5G connection. in Singapore’s case that means Singtel, M1, StarHub, or a MVNO riding one of those towers. the IP you get is from the carrier’s mobile CGNAT pool, the same range your phone at Changi Airport would get. that matters enormously for how websites score the traffic.

for the underlying explainer on what a mobile proxy is and how the IP assignment works, see what is a mobile proxy.

why mobile IPs beat datacenter and residential for hard targets

most scraping guides treat proxy types as interchangeable and just rank them by price. that’s wrong. the three types have structurally different trust scores built into how websites and CDNs score traffic, and the differences compound at scale.

the trust hierarchy websites actually use

modern anti-bot systems, Cloudflare, Akamai, Imperva, DataDome, and the homebuilt variants run by large e-commerce platforms, don’t block individual IPs. they score ASNs, IP ranges, and traffic patterns. datacenter ASNs are the first to get hardcoded into block lists because they are cheap, fast to provision, and almost never associated with real organic traffic. a /24 from AWS us-east-1 has been scraped through so many times that its baseline trust score approaches zero on any serious target.

residential IPs are better because they’re registered to ISPs serving actual homes. the problem is the market. residential proxy networks are built by injecting SDK code into consumer apps (often VPN apps with buried ToS clauses) and routing commercial traffic through people’s home connections. the practices here range from grey to clearly exploitative, and anti-bot vendors know how the residential proxy networks are built. they track the ASN ranges those networks use, flag IPs that appear in proxy rotator patterns, and score them accordingly. residential IPs are still useful for many targets but their effective trust score has dropped significantly as the detection arms race has matured.

mobile IPs sit in a different category. they come from carrier CGNAT pools shared across hundreds or thousands of real handsets. the IP a spider gets from an SMP port might have served a student streaming YouTube, a commuter checking train times, and a parent in a WhatsApp group, all in the same hour. from the scoring system’s point of view that IP looks exactly like what it is: general mobile traffic. the behavioral fingerprint is fundamentally different from a datacenter range, and carrier IP ranges are not something a website can afford to block without losing a meaningful slice of real users.

factor	datacenter	residential	mobile (SMP)
ASN trust score (cold)	low, instantly flagged	medium	high
carrier range blocking risk	n/a	medium	very low (blocks real users)
IP churn / CGNAT rotation	static or slow	varies	natural, every session or faster
mobile-specific rendering	no	no	yes, real carrier IP
cost per GB	very low	medium	higher
best for	bulk public data, low-risk targets	general residential pages	hard anti-bot, geo-specific, mobile UX

the practical implication: for a target running Cloudflare Bot Management or a serious custom WAF, a datacenter IP will hit a CAPTCHA or 403 within a few requests. a residential IP might get three to five pages before scoring high enough to trigger a challenge. a Singapore mobile IP, properly rate-limited and with correct mobile headers, can run for minutes on the same session before anti-bot systems escalate.

when NOT to use mobile proxies

this section exists because I’ve seen teams burn budget running mobile proxies on targets where cheaper alternatives work fine.

don’t use mobile proxies if:

the target has no meaningful anti-bot protection and a datacenter IP works in testing
you need very high parallelism (100+ concurrent workers) and budget is constrained, residential or datacenter with rotation is often more economical at that scale
your job is bulk download of public static files (PDFs, images, archives), where IP reputation doesn’t affect access
you’re behind an official API with a token, the proxy type is irrelevant
you need specific static IPs for IP allowlist whitelisting on a target system you control

if a datacenter proxy gets you the data reliably, use it. mobile proxies solve a specific problem, they’re not a universal upgrade.

what SMP gives you for scraping

Singapore Mobile Proxy routes your traffic through real Singapore SIM cards on Singtel, M1, and other local carriers. from any target website’s perspective, you are a real Singapore mobile user. the key capabilities for scraping work:

IP rotation modes: rotate on a schedule (every N minutes), rotate on demand via a special HTTP request to the rotation endpoint, or hold a sticky session for a defined window. mixing modes is normal.

HTTP(S) and SOCKS5: both protocols supported. HTTP(S) for simple requests and Scrapy, SOCKS5 for anything needing full TCP tunneling including Playwright and Puppeteer.

concurrent ports: multiple ports means multiple simultaneous IPs. each port is a separate SIM, separate carrier, potentially separate cell tower.

Singapore-specific: if your target geo-gates content to Singapore or serves different prices, inventory, or ads to local mobile users, you need a local mobile IP. a Singapore residential proxy helps with geo but doesn’t give you the carrier trust score or the mobile network fingerprint. for context on Southeast Asian platform scraping, see TikTok, Shopee, Lazada Singapore proxies.

sticky sessions vs rotation for scraping

these are two different tools and choosing wrong wastes either bandwidth or session continuity.

rotation swaps your IP at a defined interval or on request. use this when:

you’re hitting many different URLs on a domain (category pages, product listings)
each request is independent (no session state needed)
you want to spread request fingerprints across many IPs
you’re running parallel workers where each job is self-contained

sticky sessions hold the same IP for a window you set (typically 5-30 minutes). use this when:

you need to paginate through results (page 1, 2, 3… must look like one user)
you’re filling a form or navigating through a checkout flow
you’ve logged in and need the session cookie to remain valid
the target uses progressive trust scoring where early requests build a “session reputation”

a common pattern: one port per domain, sticky for 10 minutes, rotate between pages after the pagination exhausts, then grab a fresh IP before the next keyword or category. this feels natural to anti-bot systems because real users do browse in sessions, not in stateless one-shot requests.

sticky session length guidance:

session type	recommended window
single SERP + 3 pages pagination	8-12 min
product listing full category	12-20 min
login + dashboard scrape	15-30 min
single page, no session	rotate on each request

avoid sessions longer than 30 minutes unless you have a specific reason. a single IP holding a session for hours looks anomalous even in mobile traffic.

concurrency and bandwidth math

before you run anything at scale, size your proxy capacity honestly.

requests per port per day depends on your target’s rate sensitivity. a conservative limit for hard anti-bot targets is 1 request every 3-5 seconds per port, with random jitter. that’s roughly 12-20 requests per minute, or about 17,000-28,000 per port per day if you run 24 hours. most scraping workloads don’t run 24/7 from a single port. a realistic daily capacity per port on a careful scrape of a sensitive target: 5,000-10,000 successful requests.

bandwidth per request varies enormously. a mobile-rendered search results page: 150-300 KB of compressed HTML. a marketplace product card with images loaded: 800 KB to 2 MB. a Playwright full-page render with JavaScript: 2-5 MB. this adds up fast.

example calculation for a daily price monitor on 50,000 SKUs:

50,000 SKUs x 1 request/day
= 50,000 requests
average response: 300 KB compressed
= 15 GB raw data transfer
add 15% for retries and failed requests
= ~17.25 GB/day

at 1 req/3s per port, need:
50,000 / (20 req/min x 480 min workday) = ~5.2 ports minimum

so a 6-port SMP plan handles that comfortably inside a workday. see plans for current port options.

bandwidth vs data limits: SMP ports have data limits per billing cycle. for continuous heavy scraping, track your bandwidth consumption per port and rotate to a fresh port or plan accordingly. the pool page shows current port availability.

handling 403s, 429s, and CAPTCHAs

these three response codes have different causes and different fixes.

403 Forbidden

a 403 from an anti-bot system usually means one of:

your User-Agent is missing or obviously non-browser
your IP range is pre-blocked (rare with mobile IPs but possible on some targets)
your headers are inconsistent (desktop UA but mobile screen size)
you’re missing required headers (Accept-Language, Sec-Fetch-* headers)
TLS fingerprint doesn’t match the declared UA (see headless detection section)

fix 403s by checking your full request headers first. a working mobile Chrome on Android looks like this:

headers = {
    "User-Agent": "Mozilla/5.0 (Linux; Android 14; SM-G998B) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Mobile Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xhtml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-SG,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "Sec-CH-UA": '"Chromium";v="125", "Not.A/Brand";v="24"',
    "Sec-CH-UA-Mobile": "?1",
    "Sec-CH-UA-Platform": '"Android"',
    "Sec-Fetch-Dest": "document",
    "Sec-Fetch-Mode": "navigate",
    "Sec-Fetch-Site": "none",
    "Upgrade-Insecure-Requests": "1",
}

429 Too Many Requests

429 means you’ve hit a rate limit on the current IP. the response often includes a Retry-After header. if it does, honour it. if it doesn’t, rotate to a new IP and wait at least 30 seconds before hitting the same domain from the new IP. don’t hammer through 429s with a retry loop, you’ll burn through IPs and accelerate bans.

CAPTCHAs

a CAPTCHA challenge (hCaptcha, reCAPTCHA, Cloudflare Turnstile) means the anti-bot system’s confidence score dropped below a threshold. common causes:

too many requests in too short a window
header inconsistencies
missing cookies from a prior session
JavaScript fingerprinting returning bot-like values (headless browser tells)

CAPTCHAs are expensive to solve (even with solver services) and signal that something in your approach is wrong. fix the root cause rather than adding a CAPTCHA solver as a first response. if you need a CAPTCHA solver for a small percentage of challenges, services like 2captcha, CapMonster, and nopecha integrate into most Python scraping stacks, but they should be a fallback, not the primary strategy.

retry and backoff patterns

a scraper without proper retry logic is a scraper that breaks at 2am when you’re not watching. the pattern below handles the common failure modes without hammering a target during an outage.

import asyncio
import random
import httpx

async def fetch_with_retry(
    url: str,
    proxy_url: str,
    max_retries: int = 4,
    base_delay: float = 2.0,
    headers: dict = None,
) -> httpx.Response | None:
    """
    exponential backoff with jitter. rotates proxy IP on 429.
    returns None if all retries exhausted.
    """
    for attempt in range(max_retries):
        try:
            async with httpx.AsyncClient(
                proxy=proxy_url,
                timeout=20.0,
                follow_redirects=True,
            ) as client:
                resp = await client.get(url, headers=headers or {})

            if resp.status_code == 200:
                return resp

            if resp.status_code == 429:
                retry_after = int(resp.headers.get("Retry-After", 60))
                # trigger IP rotation via SMP rotation endpoint
                await rotate_proxy_ip(proxy_url)
                wait = max(retry_after, base_delay * (2 ** attempt))
                await asyncio.sleep(wait + random.uniform(0, wait * 0.2))
                continue

            if resp.status_code in (403, 407, 503):
                # non-retryable or auth issue
                print(f"non-retryable {resp.status_code} on {url}")
                return None

        except (httpx.ConnectTimeout, httpx.ReadTimeout, httpx.ProxyError) as e:
            wait = base_delay * (2 ** attempt) + random.uniform(0, 2.0)
            print(f"attempt {attempt+1} failed: {e}, retrying in {wait:.1f}s")
            await asyncio.sleep(wait)

    print(f"all retries exhausted for {url}")
    return None


async def rotate_proxy_ip(proxy_url: str) -> None:
    """
    SMP exposes a rotation endpoint. hit it to get a fresh IP
    without changing your port credentials.
    replace with your actual rotation endpoint URL.
    """
    rotation_endpoint = "http://your-smp-port-host:port/rotate"
    async with httpx.AsyncClient() as client:
        try:
            await client.get(rotation_endpoint, timeout=5.0)
        except Exception:
            pass  # best-effort, failure is fine


# usage
async def main():
    proxy = "http://username:password@sg1.singaporemobileproxy.com:45001"
    result = await fetch_with_retry(
        "https://target-site.com/product/12345",
        proxy_url=proxy,
    )
    if result:
        print(result.text[:500])

asyncio.run(main())

key decisions in this pattern:

exponential backoff with jitter: base_delay * (2 ** attempt) + random(0, jitter) prevents synchronized thundering-herd retries when running parallel workers
IP rotation on 429: pull a fresh IP before waiting, so the wait happens on a clean IP, not a flagged one
hard stop on 403: retrying 403 with the same IP and headers doesn’t help. fix the request first
timeout of 20s: mobile connections are real networks with variable latency. 5s timeouts cause false negatives on legitimate slow pages

a session is more than an IP. it’s also the cookie jar your browser accumulates. many anti-bot systems require a valid session cookie before they’ll serve content, and getting that cookie means making a first request that lands cleanly, usually to the homepage or a lightweight API endpoint.

the pattern:

get a sticky IP on your SMP port
make a warm-up request to the homepage with full browser headers (no authentication needed, just establish cookies)
save the cookies from that response
make your actual scrape requests with those cookies attached
when the sticky session ends and you get a new IP, repeat the warm-up

import httpx

def make_session_with_warmup(proxy_url: str, target_domain: str) -> httpx.Client:
    """
    creates an httpx client with cookies populated by a warmup request.
    """
    client = httpx.Client(
        proxy=proxy_url,
        timeout=20.0,
        follow_redirects=True,
        headers={
            "User-Agent": "Mozilla/5.0 (Linux; Android 14; SM-G998B) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Mobile Safari/537.36",
            "Accept-Language": "en-SG,en;q=0.9",
        },
    )
    # warmup: just GETs the homepage to collect cookies
    try:
        client.get(f"https://{target_domain}/")
    except Exception:
        pass
    return client

for long-running scrapers, persist cookie jars to disk between runs so you don’t pay the warm-up cost every restart. httpx doesn’t have built-in cookie persistence, but you can serialize client.cookies to JSON and reload it.

when to clear cookies: when rotating IPs, clear cookies before the warm-up on the new IP. carrying cookies from one IP session to a different IP creates an inconsistency: the server sees a cookie that was issued to a different IP, which some systems flag. treat each IP as a fresh browser.

headless browser detection and Playwright stealth

when httpx or requests isn’t enough (JavaScript-heavy targets, React SPAs, dynamically loaded content), you need a headless browser. the tradeoff: headless browsers are harder to make look like real users, and they use more bandwidth per page.

what headless detection actually checks

modern detection focuses on a cluster of signals that differ between a headless browser and a real Chrome on Android:

navigator.webdriver: set to true in vanilla Playwright/Puppeteer, real browsers don’t have this
Chrome runtime: window.chrome is undefined or differently structured in headless
permissions API: navigator.permissions.query({name: 'notifications'}) returns 'denied' in headless, 'default' in real browsers
plugins array: empty in headless, populated in real Chrome
WebGL renderer: reports “SwiftShader” or similar in headless, real GPU strings in physical devices
TLS fingerprint (JA3/JA4): the TLS handshake your headless browser sends differs from real Chrome’s. Cloudflare and others fingerprint this. this is the hard one to fake without patching the TLS stack.

Playwright with stealth on SMP

the best open-source option for stealth Playwright is playwright-stealth (Python) or puppeteer-extra-plugin-stealth (Node). they patch the most detectable signals. use them together with SMP proxies:

import asyncio
from playwright.async_api import async_playwright

async def scrape_with_stealth(url: str, proxy_config: dict) -> str:
    """
    proxy_config = {
        "server": "http://sg1.singaporemobileproxy.com:45001",
        "username": "your_user",
        "password": "your_pass",
    }
    """
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=True,
            args=[
                "--disable-blink-features=AutomationControlled",
                "--disable-dev-shm-usage",
                "--no-sandbox",
                "--disable-web-security",
            ],
        )
        context = await browser.new_context(
            proxy=proxy_config,
            user_agent="Mozilla/5.0 (Linux; Android 14; SM-G998B) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Mobile Safari/537.36",
            viewport={"width": 390, "height": 844},  # iPhone 14 viewport
            device_scale_factor=3,
            is_mobile=True,
            has_touch=True,
            locale="en-SG",
            timezone_id="Asia/Singapore",
        )

        # patch navigator.webdriver and related tells
        await context.add_init_script("""
            Object.defineProperty(navigator, 'webdriver', {get: () => undefined});
            Object.defineProperty(navigator, 'plugins', {get: () => [1, 2, 3]});
            window.chrome = {runtime: {}};
        """)

        page = await context.new_page()

        # human-like behaviour: small random delays between actions
        await page.goto(url, wait_until="networkidle", timeout=30000)
        await asyncio.sleep(1.5 + asyncio.get_event_loop().time() % 1.0)

        content = await page.content()
        await browser.close()
        return content


asyncio.run(scrape_with_stealth(
    "https://example-marketplace.sg/product/123",
    proxy_config={
        "server": "http://sg1.singaporemobileproxy.com:45001",
        "username": "user",
        "password": "pass",
    },
))

important notes on this config:

is_mobile=True and has_touch=True on the context makes Playwright declare a mobile UA in API calls, not just the header
timezone_id="Asia/Singapore" and locale="en-SG" align your JS environment fingerprint with a Singapore user. mismatches (US timezone, Singapore proxy IP) are a weak signal but worth eliminating
the add_init_script patches are the absolute minimum. for serious targets, use the full playwright-stealth library

the TLS problem: even with all the JS patches, your Playwright’s TLS handshake still looks like a headless Chromium build, not a real mobile Chrome. this is currently the hardest detection vector to defeat without compiling a patched Chromium or using browser-specific fingerprint-matching services. for most targets the JS patches are sufficient. for hard targets (Cloudflare Bot Management on aggressive settings), you may need to switch to a service like browserless.io with hardware rendering or accept that some pages require manual validation.

DNS leaks: why they matter and how to prevent them

a DNS leak happens when your scraper resolves hostnames using your local DNS server (your ISP, or 8.8.8.8) instead of routing DNS queries through the proxy. the result: the target site sees your proxy IP for the connection, but your DNS queries go to your local resolver, potentially leaking your real location or triggering geo-inconsistency flags.

for Python requests and httpx: these libraries use the system’s OS resolver by default and do NOT proxy DNS through SOCKS5 by default. for HTTP(S) proxies, DNS is resolved by the proxy server, so there’s no leak. for SOCKS5, you need socks5h:// (note the h) to route DNS through the proxy too.

# leaks DNS (standard SOCKS5)
proxy = "socks5://user:pass@sg1.singaporemobileproxy.com:45002"

# proxies DNS through SOCKS5 (correct)
proxy = "socks5h://user:pass@sg1.singaporemobileproxy.com:45002"

in Python with httpx via SOCKS (requires httpx[socks]):

import httpx

# correct: DNS proxied through the SOCKS5 tunnel
client = httpx.Client(
    proxy="socks5h://user:pass@sg1.singaporemobileproxy.com:45002"
)

for Playwright and Puppeteer: they use their own DNS resolution via the server in the proxy config, and Chromium routes DNS through the proxy correctly when using SOCKS5. verify by navigating to a DNS leak test page through your headless browser before running production scrapes.

for Scrapy: the default Twisted async resolver may bypass the proxy. set DNSCACHE_ENABLED = False and use a custom DNS resolver or stick to HTTP(S) proxies, which proxy DNS automatically.

robots.txt, ToS, and legality

this section is honest, not legal advice. the actual legal landscape is jurisdiction-specific and shifting.

robots.txt: robots.txt is a request, not a technical block. your spider can technically ignore it. whether you should is a different question. for competitive intelligence on public data, many commercial scrapers ignore restrictive robots.txt files and operate in a legal grey zone. for academic or research use, respecting robots.txt is both ethical and practically safer. at minimum, read the robots.txt of your target before running at scale and understand what you’re choosing to do.

ToS: most large platforms explicitly prohibit scraping in their ToS. again, this is a civil matter in most jurisdictions and the platforms’ ability to enforce it varies enormously. scraping publicly available data for competitive intelligence is commonly practiced by businesses of all sizes. scraping behind authentication (logged-in data) is a harder legal and ethical case.

computer fraud laws: in the US, the CFAA (Computer Fraud and Abuse Act) has historically been used against scrapers, though the legal interpretation of “unauthorized access” for public web pages has been contested. the hiQ vs LinkedIn case went through years of appeals and established that scraping publicly available data is harder to frame as CFAA violation. the Singapore context: the Computer Misuse Act covers unauthorized access but public web scraping of non-login pages is not prosecuted.

personal data: this is where the law is clearest. scraping personal data (names, contact info, private profiles) in Singapore falls under PDPA. scraping public business information, prices, product data, and aggregate statistics is a different category. when in doubt: don’t scrape personal data, anonymize what you collect, and get legal advice for your specific use case.

practical guidelines:

only scrape publicly available pages (no authentication bypass)
respect rate limits as stated in ToS or robots.txt where commercially feasible
don’t re-publish scraped data in ways that violate copyright
store data securely if it incidentally contains any personal information
keep logs of what you scraped and when, for audit purposes

Scrapy integration

Scrapy is the standard for large-scale Python scraping. here’s a working middleware configuration for SMP.

# settings.py

# SMP proxy credentials
SMP_PROXY = "http://username:password@sg1.singaporemobileproxy.com:45001"

# disable the default HTTP proxy if any
DOWNLOADER_MIDDLEWARES = {
    "scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware": None,
    "yourproject.middlewares.SmpProxyMiddleware": 100,
    "scrapy.downloadermiddlewares.retry.RetryMiddleware": 550,
}

RETRY_TIMES = 3
RETRY_HTTP_CODES = [429, 500, 502, 503, 504]
DOWNLOAD_DELAY = 3
RANDOMIZE_DOWNLOAD_DELAY = True
CONCURRENT_REQUESTS_PER_DOMAIN = 1

# mobile Chrome UA
USER_AGENT = "Mozilla/5.0 (Linux; Android 14; SM-G998B) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Mobile Safari/537.36"

# middlewares.py

import random
from scrapy import signals

class SmpProxyMiddleware:
    """
    injects SMP proxy into every request.
    handles 429 by rotating IP before retry.
    """

    PROXY_URL = "http://username:password@sg1.singaporemobileproxy.com:45001"
    ROTATE_ENDPOINT = "http://sg1.singaporemobileproxy.com:45001/rotate"

    def process_request(self, request, spider):
        request.meta["proxy"] = self.PROXY_URL

    def process_response(self, request, response, spider):
        if response.status == 429:
            spider.logger.warning(f"429 on {request.url}, rotating IP")
            self._rotate_ip(spider)
            retry_req = request.copy()
            retry_req.dont_filter = True
            return retry_req
        return response

    def _rotate_ip(self, spider):
        import requests as req_lib
        try:
            req_lib.get(self.ROTATE_ENDPOINT, timeout=5)
        except Exception as e:
            spider.logger.error(f"rotation failed: {e}")

building a scraping stack step by step

putting it all together. here’s how to build a scraping stack from zero for a Singapore marketplace target.

step 1: assess the target

open the target URL in a real mobile browser (Chrome on Android)
open DevTools or a proxy tool like Charles, record the network requests
note which headers are present, what cookies are set, whether JavaScript is required
check if content is server-rendered HTML or loaded via API endpoints. API endpoints are often much simpler to call directly

step 2: test with a single request

before writing a full spider, test your proxy and headers manually:

curl -x "http://user:pass@sg1.singaporemobileproxy.com:45001" \
  -H "User-Agent: Mozilla/5.0 (Linux; Android 14; SM-G998B) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Mobile Safari/537.36" \
  -H "Accept-Language: en-SG,en;q=0.9" \
  -H "Accept: text/html,application/xhtml+xml,application/xhtml;q=0.9,*/*;q=0.8" \
  -L https://target-site.sg/category/phones \
  -o /tmp/test.html -w "%{http_code}"

a 200 here means your proxy and headers pass the first check. a 403 means you need to look at additional headers. a timeout means a proxy connectivity issue.

step 3: extract and parse

from bs4 import BeautifulSoup
import httpx

def parse_product_listing(html: str) -> list[dict]:
    soup = BeautifulSoup(html, "html.parser")
    products = []
    for card in soup.select(".product-card"):  # adjust selector
        products.append({
            "name": card.select_one(".product-name").get_text(strip=True),
            "price": card.select_one(".price").get_text(strip=True),
            "sku": card.get("data-sku", ""),
            "url": card.select_one("a")["href"],
        })
    return products

step 4: add pagination

async def scrape_all_pages(base_url: str, proxy_url: str) -> list[dict]:
    all_products = []
    page = 1
    while True:
        url = f"{base_url}?page={page}"
        resp = await fetch_with_retry(url, proxy_url)
        if resp is None:
            break
        products = parse_product_listing(resp.text)
        if not products:
            break  # no more products = end of catalogue
        all_products.extend(products)
        page += 1
        await asyncio.sleep(3 + random.uniform(0, 2))  # polite delay
    return all_products

step 5: store and validate

import sqlite3
import json
from datetime import datetime

def store_products(products: list[dict], db_path: str = "prices.db") -> None:
    conn = sqlite3.connect(db_path)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS products (
            sku TEXT,
            name TEXT,
            price TEXT,
            url TEXT,
            scraped_at TEXT
        )
    """)
    rows = [
        (p["sku"], p["name"], p["price"], p["url"], datetime.utcnow().isoformat())
        for p in products
    ]
    conn.executemany(
        "INSERT INTO products VALUES (?,?,?,?,?)",
        rows,
    )
    conn.commit()
    conn.close()

step 6: schedule and monitor

run as a cron job or Airflow DAG. track: total requests, 2xx rate, average response time, retry rate, distinct IPs seen. alert if 2xx rate drops below 90% for more than 5 minutes.

scenario 1: SEO analytics and local SERP scraping

who this is for: SEO agencies, in-house SEO teams, local businesses with branch networks. the goal is to gather search results as a real Singapore mobile user, capturing local packs, featured snippets, maps, shopping blocks, FAQ boxes, and ranking positions per keyword.

why mobile proxies win here: Google serves different results to mobile users than desktop users, and different results based on location and carrier. a datacenter IP gets datacenter results, which may have no relation to what a real Singapore phone user sees on Singtel. mobile IPs get the real mobile SERP. this is the only way to accurately track mobile search rankings.

setup:

for each target city or district, create a dedicated SMP port. set sticky sessions to 8-10 minutes so that paginating through results looks like one continuous user session. rotate IPs between keyword groups, not within them.

import asyncio
import httpx
import json
from urllib.parse import quote

PROXY = "http://user:pass@sg1.singaporemobileproxy.com:45001"

MOBILE_HEADERS = {
    "User-Agent": "Mozilla/5.0 (Linux; Android 14; SM-G998B) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Mobile Safari/537.36",
    "Accept-Language": "en-SG,en;q=0.9,zh-SG;q=0.7",
    "Accept": "text/html,application/xhtml+xml,application/xhtml;q=0.9,*/*;q=0.8",
    "Sec-CH-UA-Mobile": "?1",
}

async def scrape_serp(keyword: str, pages: int = 3) -> list[dict]:
    results = []
    async with httpx.AsyncClient(proxy=PROXY, timeout=20, follow_redirects=True) as client:
        for page_num in range(pages):
            start = page_num * 10
            url = f"https://www.google.com/search?q={quote(keyword)}&start={start}&hl=en&gl=sg&num=10"
            resp = await client.get(url, headers=MOBILE_HEADERS)
            if resp.status_code != 200:
                print(f"got {resp.status_code} on page {page_num+1}")
                break
            # parse results here with BeautifulSoup
            results.append({"page": page_num+1, "html_length": len(resp.text)})
            await asyncio.sleep(4 + asyncio.get_event_loop().time() % 2)
    return results

results from practice: an SEO agency tracking 12 Singapore districts with 520 keywords at 8-hour frequency. previously using datacenter proxies: average success rate 82-88%, frequent 429 errors on peak hours, SERP data often from a generic datacenter perspective rather than a local mobile view. after switching to SMP with 8-minute sticky sessions and 3-second delays: 97.8% success rate, CAPTCHA rate dropped by 64%, positional consistency between runs improved by 23%. total bandwidth cost: roughly 4.2 GB per day across all keywords and districts.

optimisations for SERP scraping:

always use &gl=sg&hl=en to pin geography and language in the URL, don’t rely on the proxy IP alone for geo targeting
capture the full HTML including the rendered DOM structure, not just the visible text. Google changes its SERP layout frequently and you want the raw HTML for your parser to evolve against
track SERP feature presence (maps block, shopping carousel, AI overview, FAQ) as separate fields, not just organic positions. the feature mix is what clients actually care about
for local pack data (Google Maps snippets in SERP), you need JavaScript rendering via Playwright. raw HTML requests don’t get the full local pack

scenario 2: price and availability monitoring on marketplaces

who this is for: brands, distributors, e-commerce analytics teams. the goal is to track prices, stock levels, seller rankings, discounts, and listing positions across Shopee, Lazada, Carousell, and comparable platforms in Singapore and the region.

why mobile proxies win here: Shopee and Lazada heavily segment their mobile app experience from the desktop web. price floors, flash sale access, and voucher availability can differ. monitoring the mobile version gives you the price a real Shopee app user sees. both platforms use aggressive bot detection on their web interfaces. for scraping at scale, see also TikTok, Shopee, Lazada Singapore.

setup: segment your SKU list by category and assign separate ports per marketplace. use sticky sessions for paginating through search results and category listings (one session per category, rotate between categories). for individual product cards, rotation on each request is fine since each card is stateless.

import asyncio
import httpx
from bs4 import BeautifulSoup

PROXY = "http://user:pass@sg1.singaporemobileproxy.com:45001"

async def scrape_shopee_product(product_url: str) -> dict:
    headers = {
        "User-Agent": "Mozilla/5.0 (Linux; Android 14; SM-G998B) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Mobile Safari/537.36",
        "Referer": "https://shopee.sg/",
        "Accept-Language": "en-SG,en;q=0.9",
        "X-Requested-With": "XMLHttpRequest",
    }
    # Shopee product data is often available via their API endpoint
    # convert product URL to API call if possible
    if "/product/" in product_url:
        parts = product_url.rstrip("/").split("/")
        shop_id, item_id = parts[-2], parts[-1]
        api_url = f"https://shopee.sg/api/v4/item/get?itemid={item_id}&shopid={shop_id}"
        async with httpx.AsyncClient(proxy=PROXY, timeout=20) as client:
            resp = await client.get(api_url, headers=headers)
            if resp.status_code == 200:
                data = resp.json()
                item = data.get("data", {})
                return {
                    "name": item.get("name"),
                    "price": item.get("price", 0) / 100000,  # Shopee uses 100000ths
                    "stock": item.get("stock"),
                    "sold": item.get("historical_sold"),
                    "rating": item.get("item_rating", {}).get("rating_star"),
                }
    return {}


async def monitor_category(category_url: str, max_pages: int = 10) -> list[dict]:
    """paginate through a Shopee category listing"""
    all_items = []
    async with httpx.AsyncClient(proxy=PROXY, timeout=20, follow_redirects=True) as client:
        for page in range(max_pages):
            # Shopee listing API accepts 'by' and 'page' params
            params = {"by": "relevancy", "page": page, "limit": 60}
            # adapt URL to API call based on category
            await asyncio.sleep(3 + asyncio.get_event_loop().time() % 2)
    return all_items

results from practice: a brand monitoring 43,000 SKUs across 8 Southeast Asian markets, daily volume approximately 1.25 million requests. datacenter proxies gave 78% success before migration. after SMP with concurrency throttled to 1 request per 3 seconds per port, running 12 ports: 99.1% valid responses, average TTFB decreased by 17%, retry rate fell from 22% to under 4%. data lag (time from price change to detection) improved from 28 minutes to 12 minutes average. total daily bandwidth: approximately 22 GB across all ports and markets.

optimisations:

Shopee and Lazada both expose JSON APIs that are cleaner to call than scraping HTML. reverse-engineer the API calls your mobile app makes using mitmproxy or Charles
capture the promotional_price and original_price separately, not just the displayed price. flash sale prices are the promotional_price field and disappear after the sale
store raw JSON responses alongside extracted fields. marketplaces change their data model every few months and you want to re-extract from raw data rather than re-scrape

who this is for: social media managers, brands, competitive intelligence analysts. the goal is to monitor public posts, reaction counts, hashtag performance, and media format trends on platforms where no official API covers the needed data.

why mobile proxies matter here: TikTok, Instagram, and Xiaohongshu (RedNote) serve substantially different content on mobile IP ranges versus datacenter ranges. mobile IPs are more likely to get the actual local Singapore For You feed rather than a generic global feed. for TikTok specifically, the mobile IP signals to the serving layer that this is app traffic, which can unlock content that the web interface restricts.

for a detailed guide to TikTok specifically, see mobile proxy for TikTok Singapore.

important constraint: scraping social platforms is the most legally and ToS-sensitive area in this guide. official APIs exist for most major platforms and should be your first option. scraping public data on platforms where APIs are unavailable or insufficient is widely practiced but sits in a grey zone. do not scrape private profiles, DMs, or any content requiring authentication.

setup: 20-minute rotation intervals, one port per platform, request rate no faster than 1 per 5 seconds. collect public fields only (post text, timestamp, like/share/comment counts, author handle if public).

results from practice: a team tracking 280 public communities across 5 content categories on a regional social platform. after switching to SMP: 98.9% success rate at peak hours (was 71% with datacenter IPs, which hit a WAF block on this particular platform), false “empty page” rate dropped by 22% because mobile-rendered pages have more stable selectors than the desktop versions, and UTM link detection accuracy improved because mobile pages unfurl links more completely.

scenario 4: reviews and ratings from open catalogues

who this is for: product teams, customer experience analysts, quality teams. gathering new reviews and ratings from Google Play, Apple App Store, Lazada, Shopee, and Amazon.sg helps track sentiment, surface regressions, and prioritise feature work.

why mobile proxies help here: app stores serve different review content to mobile clients. the full review text, developer response, and sub-ratings (screenshots, helpfulness votes) are often only fully rendered for mobile UA requests. platform anti-scraping on review sections is aggressive because reviews are commercially sensitive.

setup: 15-minute rotation per port, sticky for 10 minutes within a single product card’s pagination, then rotate. collect: rating, review text, author handle (public), date, helpful count, response status.

import asyncio
import httpx
from bs4 import BeautifulSoup
from datetime import datetime

PROXY = "http://user:pass@sg1.singaporemobileproxy.com:45001"

async def fetch_google_play_reviews(app_id: str, max_pages: int = 5) -> list[dict]:
    """
    Google Play review scraping via the web endpoint.
    each 'next page token' is in the response body.
    """
    base_url = "https://play.google.com/store/apps/details"
    headers = {
        "User-Agent": "Mozilla/5.0 (Linux; Android 14; SM-G998B) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Mobile Safari/537.36",
        "Accept-Language": "en-SG,en;q=0.9",
    }
    reviews = []
    async with httpx.AsyncClient(proxy=PROXY, timeout=20, follow_redirects=True) as client:
        resp = await client.get(
            base_url,
            params={"id": app_id, "hl": "en", "gl": "SG"},
            headers=headers,
        )
        if resp.status_code == 200:
            soup = BeautifulSoup(resp.text, "html.parser")
            for review_el in soup.select('[data-reviewid]'):
                text_el = review_el.select_one('.review-text')
                rating_el = review_el.select_one('[aria-label*="stars"]')
                reviews.append({
                    "text": text_el.get_text(strip=True) if text_el else "",
                    "rating": rating_el.get("aria-label", "") if rating_el else "",
                    "scraped_at": datetime.utcnow().isoformat(),
                })
    return reviews

results from practice: a product team monitoring 50 app/product cards across 2 catalogues, approximately 220,000 reviews per month. after integrating SMP: sentiment analysis pipeline completed 27% faster due to more consistent page loading, duplicate rate in the incoming stream dropped from 18% to 4% (because stable mobile pages don’t intermittently serve truncated content), average lag for detecting new reviews dropped to 9 minutes from 23 minutes.

optimisations:

save both the raw HTML of the review section and the extracted JSON. when a platform changes its template (which they do every 2-4 months), you can re-extract from saved HTML without re-scraping
poll frequency should match actual review velocity. a small app gets 5 new reviews per day, polling every 3 hours is plenty. a viral app might need hourly polls. over-polling wastes bandwidth and raises flags

scenario 5: ad and creative verification

who this is for: advertisers, agencies, ad tech teams. verifying that mobile ad creatives are displaying correctly in the right geos, with correct UTM tagging, landing pages loading, and no broken redirects, is an operational requirement for any campaign running at scale.

why mobile proxies are essential here: ad serving is geo and device-targeted. a creative showing a Singapore Singtel promotion may only serve to users on a Singapore mobile IP with a mobile UA. verifying this from a datacenter IP in the US shows nothing. you need the actual target audience’s network fingerprint to verify the ad is actually serving to them.

setup: create a dedicated “ad-verify” port per target geo. use 5-minute sticky sessions so a warmup request establishes cookies before the verification page load. capture HAR files (HTTP Archive format) using Playwright for full network log analysis.

import asyncio
from playwright.async_api import async_playwright
import json

async def verify_ad_placement(
    landing_url: str,
    expected_utm_source: str,
    proxy_config: dict,
) -> dict:
    """
    loads a landing URL through SMP and verifies:
    - page loads with 200
    - UTM parameters present
    - no broken redirects
    - creative block is visible
    """
    result = {"url": landing_url, "status": None, "utm_ok": False, "load_time_ms": None}

    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context(
            proxy=proxy_config,
            user_agent="Mozilla/5.0 (Linux; Android 14; SM-G998B) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Mobile Safari/537.36",
            viewport={"width": 390, "height": 844},
            is_mobile=True,
        )
        page = await context.new_page()
        start = asyncio.get_event_loop().time()
        response = await page.goto(landing_url, wait_until="networkidle", timeout=30000)
        load_time = (asyncio.get_event_loop().time() - start) * 1000

        result["status"] = response.status if response else None
        result["load_time_ms"] = round(load_time)
        result["final_url"] = page.url
        result["utm_ok"] = expected_utm_source in page.url

        await browser.close()
    return result


async def verify_campaign_placements(placements: list[dict], proxy_config: dict) -> list[dict]:
    """verify a list of ad placements concurrently (max 3 at once)"""
    semaphore = asyncio.Semaphore(3)
    async def bounded_verify(placement):
        async with semaphore:
            return await verify_ad_placement(
                placement["url"],
                placement["expected_utm"],
                proxy_config,
            )
    return await asyncio.gather(*[bounded_verify(p) for p in placements])

results from practice: an agency verifying up to 1,200 placements daily across Singapore geo-targeted campaigns. with datacenter proxies: 31% false negatives where creatives appeared unavailable but were actually serving correctly to real users. SMP eliminated those false negatives. SLA for confirming campaign launch dropped from 6 hours to 2 hours. broken redirect rate detection improved because the Singapore mobile IP actually triggered the correct ad serving chain.

scenario 6: monitoring availability and performance of mobile site versions

who this is for: DevOps/SRE teams, website owners running Singapore-facing services. measuring real mobile performance from Singapore carrier networks, rather than from a synthetic monitoring probe in a datacenter, reveals performance issues that synthetic monitoring misses.

why this matters: a CDN may perform differently for Singtel mobile users than for generic Singapore IP traffic. TTFB from Singtel LTE to your origin can spike during peak hours in ways your datacenter probe won’t detect. CGNAT adds latency that your test infrastructure doesn’t have. you need to measure from where your actual users are.

setup: create monitoring pools on multiple carriers (Singtel, M1) and run checks every 5-15 minutes on critical pages. store metrics in ClickHouse or TimescaleDB and alert on deviations above 20% from the rolling median.

import asyncio
import httpx
import time

PROXY = "http://user:pass@sg1.singaporemobileproxy.com:45001"

async def measure_page_performance(url: str) -> dict:
    """measure TTFB and total load time through SMP"""
    headers = {
        "User-Agent": "Mozilla/5.0 (Linux; Android 14; SM-G998B) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Mobile Safari/537.36",
        "Cache-Control": "no-cache",
    }
    result = {"url": url, "ttfb_ms": None, "total_ms": None, "status": None, "error": None}
    start = time.monotonic()
    try:
        async with httpx.AsyncClient(proxy=PROXY, timeout=30) as client:
            # use streaming to measure TTFB separately
            async with client.stream("GET", url, headers=headers) as response:
                ttfb = (time.monotonic() - start) * 1000
                result["ttfb_ms"] = round(ttfb)
                result["status"] = response.status_code
                content = await response.aread()
                result["total_ms"] = round((time.monotonic() - start) * 1000)
                result["content_length"] = len(content)
    except Exception as e:
        result["error"] = str(e)
    return result

results from practice: an online store discovered that from Singtel LTE specifically, TTFB spiked 150-200ms during evening peak hours (7-10pm), due to CDN routing differences between Singtel’s network and the generic Singapore IP ranges used by their datacenter-based synthetic monitors. reconfiguring CDN origin routing based on this data reduced median TTFB by 18% for mobile users and correlated with a 4% improvement in mobile conversion rate.

scenario 7: building industry datasets from open sources

who this is for: data science teams, market researchers, content analysts. building high-quality training or analytics datasets from publicly available product catalogues, price histories, business listings, and open content sources.

why mobile proxies help at scale: the volume required for dataset collection (millions of pages) makes robust anti-bot evasion essential. datacenter IPs are blocked within thousands of requests on any major property. residential IPs at this volume become expensive. mobile proxies with careful rate limiting and rotation handle the scale sustainably.

dataset collection architecture:

import asyncio
import json
import sqlite3
from asyncio import Queue
from datetime import datetime

class DatasetCollector:
    """
    producer-consumer architecture for large-scale dataset collection.
    producer enqueues URLs, multiple workers consume with concurrency control.
    """

    def __init__(self, proxy_url: str, workers: int = 6, delay: float = 3.0):
        self.proxy_url = proxy_url
        self.workers = workers
        self.delay = delay
        self.queue: Queue = Queue()
        self.results = []

    async def producer(self, url_list: list[str]) -> None:
        for url in url_list:
            await self.queue.put(url)
        # signal completion
        for _ in range(self.workers):
            await self.queue.put(None)

    async def worker(self, worker_id: int) -> None:
        while True:
            url = await self.queue.get()
            if url is None:
                break
            result = await self._fetch(url, worker_id)
            if result:
                self.results.append(result)
            await asyncio.sleep(self.delay + (worker_id * 0.3))  # stagger workers

    async def _fetch(self, url: str, worker_id: int) -> dict | None:
        headers = {
            "User-Agent": "Mozilla/5.0 (Linux; Android 14; SM-G998B) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Mobile Safari/537.36",
        }
        try:
            async with httpx.AsyncClient(proxy=self.proxy_url, timeout=20) as client:
                resp = await client.get(url, headers=headers)
                if resp.status_code == 200:
                    return {"url": url, "html": resp.text, "scraped_at": datetime.utcnow().isoformat()}
        except Exception as e:
            print(f"worker {worker_id}: {url} failed: {e}")
        return None

    async def run(self, url_list: list[str]) -> list[dict]:
        import httpx  # imported here for clarity
        self.proxy_url = self.proxy_url
        await asyncio.gather(
            self.producer(url_list),
            *[self.worker(i) for i in range(self.workers)],
        )
        return self.results


async def main():
    import httpx
    collector = DatasetCollector(
        proxy_url="http://user:pass@sg1.singaporemobileproxy.com:45001",
        workers=6,
        delay=3.0,
    )
    urls = ["https://example.sg/product/1", "https://example.sg/product/2"]  # your URL list
    results = await collector.run(urls)
    print(f"collected {len(results)} pages")

asyncio.run(main())

results from practice: a team collected an open corpus of 9.7 million product cards across 14 categories, running 6 workers at approximately 45 successful requests per second total throughput (across all workers and ports). with SMP: proportion of “dirty” HTML (malformed, truncated, or CAPTCHA pages) decreased by 19% compared to residential proxies used previously, because stable mobile markup is more consistently formed than the sometimes-degraded desktop responses that residential IPs receive under load. a category classifier trained on mobile-snapshot HTML improved F1 by 3.4 percentage points over the previous datacenter-scraped dataset, likely because the mobile layouts have more consistent structure.

dataset quality practices:

maintain a version log of CSS selectors and XPath expressions. add a comment like # working as of 2026-06-14, last verified 2026-06-14 above each selector
run daily random samples: pull 50 random records, inspect manually, verify all fields extracted correctly. this catches silent extractor breakage before it corrupts thousands of records
store raw HTML alongside extracted fields in a content-addressed store (hash of URL + date as key). when the template changes, re-extract from saved HTML rather than re-scraping
deduplicate by a stable identifier (SKU, product ID) not by URL, since marketplaces frequently add tracking parameters that make the same product appear at many URLs

troubleshooting matrix

when something goes wrong, this table helps narrow down the cause before you start changing things randomly.

symptom	most likely cause	check	fix
403 on first request	wrong UA or missing headers	curl the URL with the same headers manually	add full mobile Chrome headers (see code above)
403 after 50-100 requests	IP flagged by WAF	check if new IP also gets 403	rotate IP, reduce request rate, add jitter
429 with Retry-After	rate limit hit	read Retry-After header value	rotate IP, wait at least the retry-after period
429 without Retry-After	quota or pattern limit	check timing of 429s	reduce concurrency, increase delay, add jitter
CAPTCHA appearing	bot score too high	test same URL in real mobile browser	fix root cause: headers, JS tells, rate
connection timeout	proxy unreachable	ping proxy host, check port status in SMP dashboard	check port status, try alternate port
SSL error	DNS leak or TLS mismatch	check if using socks5h:// for SOCKS5	switch to HTTP(S) proxy or use socks5h:// scheme
empty response body	JavaScript rendering required	view source in browser, check for JS-injected content	switch to Playwright for this target
cookies missing after rotation	cookie jar not cleared on IP change	log cookies before and after rotation	clear cookies on each IP rotation, warm up new IP
inconsistent data across runs	sticky session expiring mid-pagination	check sticky session duration vs pagination time	increase sticky session window
correct status but wrong content	geo-served wrong version	verify SMP IP is Singapore via ipinfo.io through proxy	confirm port is active and IP resolves to SG
Scrapy ignoring proxy setting	middleware order conflict	check DOWNLOADER_MIDDLEWARES order	ensure custom middleware runs before built-in HTTP proxy
Playwright request not going through proxy	context vs page proxy config	verify proxy is set on context, not on launch	set proxy in `new_context()`, not `launch()`

FAQ

is it legal to scrape with mobile proxies in Singapore?

collecting publicly available data (product listings, prices, public posts) from websites is not illegal under Singapore law per se. the Computer Misuse Act covers unauthorised access to protected systems, not reading public web pages. platforms’ ToS prohibitions are civil contracts, not criminal statutes. the harder cases involve personal data (PDPA) and authenticated content. always consult a lawyer for your specific use case. in practice, scraping public product data and SERP results is standard commercial practice across the industry.

what’s the difference between HTTP and SOCKS5 mode for my scraper?

HTTP(S) proxies handle DNS resolution themselves, so no DNS leak is possible. SOCKS5 proxies are lower-level and tunnel raw TCP, which is why you need socks5h:// to proxy DNS through them. for most Python scraping with requests or httpx, HTTP(S) proxies are simpler and work correctly. SOCKS5 is needed for tools that require raw TCP tunneling, like some versions of Playwright or Puppeteer configured in certain ways.

how many concurrent requests can I run per port?

for hard anti-bot targets (Cloudflare, Akamai, platform-specific WAFs): 1 request at a time per port, with 3-5 second delays. for medium anti-bot targets: 2 concurrent requests per port. for easy targets: up to 5 concurrent. more than this from a single mobile IP looks anomalous and will trigger rate limiting quickly. if you need more throughput, add ports. see plans for multi-port options.

how do I verify my proxy is actually routing traffic correctly?

test with httpbin or ipinfo.io through your proxy before running production scrapes:

curl -x "http://user:pass@sg1.singaporemobileproxy.com:45001" \
  https://ipinfo.io/json

the response should show a Singapore IP with carrier set to one of the Singapore MNOs. if you see your real IP or a datacenter IP, something is wrong with your proxy configuration.

what happens when a sticky session expires mid-scrape?

SMP automatically assigns a new IP when the sticky window closes. your next request will come from a fresh IP. if you’re mid-pagination, the new IP has no session history with the target, which sometimes causes the server to redirect you to the homepage or reset filters. the clean pattern is to detect the IP change (compare IP via ipinfo.io before and after the window), then redo the session warmup before resuming pagination.

can I use SMP proxies with Scrapy’s built-in HTTP proxy middleware?

yes. set HTTP_PROXY in your environment or use Scrapy’s HttpProxyMiddleware with your SMP credentials. for more control (per-request rotation, sticky management), write a custom middleware like the one in the Scrapy section above.

do I need to rotate User-Agent alongside the IP?

for simple targets, a consistent mobile Chrome UA works fine. for harder targets, rotating between a few realistic mobile UAs (different Samsung models, Pixel versions, all running recent Chrome builds) adds variation that makes the traffic pattern look more like a pool of real users. never use obviously fake UAs (old IE, bots, or library defaults like python-httpx/0.27).

I’m getting correct responses but the data looks different from what I see in the browser, why?

three common causes: (1) the page requires JavaScript to render the data you need, raw HTTP only gets the server-side HTML shell. switch to Playwright. (2) the page serves different content based on cookies from a prior session (personalisation, A/B tests). add a warmup request to establish cookies first. (3) the target is A/B testing layouts and your scraper hit a variant the manual browser didn’t. compare the HTML structure carefully.

bottom line

mobile proxies are not magic. a Singapore mobile IP gives you a better starting trust score with anti-bot systems, correct mobile rendering, and accurate geo-specific content, but you still need correct headers, sane rate limits, proper session management, and genuine care about the legality of what you’re collecting. a mobile IP on an aggressive scraper with wrong headers and no rate limiting will get blocked faster than a datacenter IP on a careful scraper with correct configuration.

the right way to think about SMP proxies in your stack: they raise the ceiling on what you can reliably collect, they don’t eliminate the need to be thoughtful about how you collect it.

for jobs where mobile IPs genuinely matter, the difference is significant. SERP tracking, Singapore marketplace monitoring, TikTok and social platform data, ad verification, and mobile performance monitoring all produce better results with real Singapore mobile IPs than with any alternative.

if you want to test this on your specific target before committing to a plan, start a free trial at /client/trial and run your test against the actual target. the trial gives you a real SMP port with Singapore carrier IPs, not a sandbox, so the test reflects real production behaviour.

advanced fingerprinting: what anti-bot systems actually measure

I want to go deeper on detection signals because “get mobile IPs and use mobile UA” is only the first layer. serious anti-bot vendors like Cloudflare Bot Management and DataDome collect dozens of signals and run machine learning models that score the probability of a request being automated. understanding each signal helps you decide which ones matter for your specific target.

layer 1: IP reputation

the first check, happens before your request is even parsed. the system looks up the incoming IP in its own dataset and several third-party enrichment sources:

ASN category: is this IP range registered to a data center, a residential ISP, or a mobile carrier? mobile carrier = good. cloud provider = bad.
proxy/VPN reputation: commercial proxy services, even residential ones, appear in threat intelligence databases compiled from honeypots, abuse reports, and scanning data. a fresh Singapore Singtel IP from a SIM card does not appear in these lists.
recent abuse history: if this specific IP was seen in a credential stuffing attack or large-scale scraping campaign in the past 30 days, it gets a negative score. CGNAT means mobile IPs rotate constantly, so the probability that your session shares a past-abuse history with the previous user of that IP is low. this is one of the structural advantages.
geolocation consistency: does the IP’s registered country match the Accept-Language header and other geographic signals? a Singapore mobile IP with en-SG and Singapore timezone is consistent. a datacenter IP in Singapore with headers claiming a US locale is suspicious.

layer 2: TLS fingerprinting

before your application-layer request is processed, the TLS handshake has already happened and been fingerprinted. the two dominant fingerprinting standards are:

JA3 hashes the combination of TLS version, cipher suites, extensions, elliptic curves, and elliptic curve point formats from the client hello. every browser and HTTP library has a characteristic JA3. python-requests with urllib3 has a very different JA3 from real Chrome on Android. some targets check JA3 against known browser fingerprints and reject anything that doesn’t match.

JA4 is the successor to JA3, developed by FoxIO, with better handling of cipher suite ordering and extension entropy. it’s more granular and harder to spoof without patching the TLS stack.

what this means for you: if you use requests or httpx with a Chrome User-Agent but the JA3 fingerprint looks like Python’s urllib3, a JA3-checking WAF will flag the mismatch. the fixes:

use a Python library that impersonates Chrome’s TLS stack. curl-cffi does this by binding to curl with built-in impersonation profiles. example:

from curl_cffi import requests as curl_requests

# impersonates Chrome 124 TLS fingerprint
session = curl_requests.Session(impersonate="chrome124")
response = session.get(
    "https://target-site.sg/",
    proxies={"https": "http://user:pass@sg1.singaporemobileproxy.com:45001"},
)

use Playwright, which runs real Chromium and produces a real Chrome TLS handshake. the remaining detection challenge with Playwright is JavaScript-layer tells, not TLS.
for the highest-security targets, use a real Android device or a cloud mobile device farm and automate via Appium. this produces 100% authentic fingerprints but is expensive and slow.

layer 3: HTTP/2 fingerprinting

beyond TLS, HTTP/2 itself has a characteristic fingerprint. the SETTINGS frame sent at the start of an HTTP/2 connection has parameters (header table size, concurrent streams, initial window size, max frame size) that differ between browsers and HTTP libraries. Python’s httpx sends different SETTINGS values than real Chrome.

curl-cffi handles HTTP/2 fingerprinting correctly when using impersonation mode. standard requests uses HTTP/1.1 by default and avoids the H2 fingerprint issue, but some targets now use H2 fingerprinting as an additional signal.

layer 4: JavaScript execution fingerprinting

for targets that require JavaScript (which is most modern sites), the page executes code that reads properties of the browser environment and sends the results back to the server. the properties checked include:

canvas fingerprint: render a specific string with a specific font and read the pixel values. headless Chromium and real Chrome produce slightly different results due to font rendering and GPU vs software rendering differences.
WebGL fingerprint: call getParameter(RENDERER) and getParameter(VENDOR). headless returns “Google SwiftShader” or similar. real mobile Chrome returns the actual GPU name.
audio fingerprint: create an OfflineAudioContext, run a signal through it, and hash the output. small floating point differences between environments produce distinct hashes.
screen dimensions and touch support: consistent with mobile UA? a “mobile” request with no touch events and a 1920x1080 screen is suspicious.
timing patterns: how long does it take to process JS? how many animation frames pass between page load and first scroll? human interactions have characteristic timing distributions that bot scripts don’t replicate by default.

patching all of these is the goal of playwright-stealth. the library injects overrides for most of these properties before page scripts run. it’s not perfect against the most sophisticated detection (the audio fingerprint patch is detectable by some advanced systems), but it handles the majority of commercial bot protection systems.

layer 5: behavioral scoring

some systems go beyond static fingerprints and score the sequence of actions on a session. a human landing on a product page typically: pauses 1-3 seconds, scrolls partway down, moves the mouse, pauses again, maybe scrolls back up to re-read something, then clicks. a scraper that immediately reads document.getElementById("price").textContent with no scroll or mouse events is behaviorally anomalous.

for targets using behavioral scoring, adding minimal human-like behavior to Playwright makes a difference:

import asyncio
import random

async def human_like_scroll(page) -> None:
    """simulate a human scrolling pattern"""
    viewport_height = await page.evaluate("window.innerHeight")
    doc_height = await page.evaluate("document.body.scrollHeight")

    current_pos = 0
    while current_pos < doc_height * 0.6:  # scroll through ~60% of page
        scroll_amount = random.randint(200, 500)
        current_pos = min(current_pos + scroll_amount, doc_height)
        await page.evaluate(f"window.scrollTo(0, {current_pos})")
        await asyncio.sleep(random.uniform(0.4, 1.2))


async def human_like_mouse_move(page) -> None:
    """move mouse in a slightly curved path to a random point"""
    viewport = page.viewport_size
    if not viewport:
        return
    target_x = random.randint(50, viewport["width"] - 50)
    target_y = random.randint(100, viewport["height"] - 100)
    await page.mouse.move(target_x, target_y)
    await asyncio.sleep(random.uniform(0.2, 0.6))

add these between the page.goto() and your data extraction call. they’re low cost, a few hundred milliseconds per page, and meaningfully reduce behavioral anomaly scores on sites using that signal.

scaling your scraping operation: from single port to multi-node

there’s a big difference between a script that scrapes 1,000 pages and a production system that handles 1 million pages per day reliably. the architecture changes significantly as you scale.

single port (up to ~10,000 pages/day)

one SMP port, one Python process, sequential or lightly concurrent requests, SQLite for storage. the retry/backoff pattern above is sufficient. this level doesn’t need a queue or orchestrator, a simple script with pagination and a checkpoint file to resume after failures is enough.

# simple checkpoint pattern for resumable scraping
import json
import os

CHECKPOINT_FILE = "scrape_checkpoint.json"

def load_checkpoint() -> set:
    if os.path.exists(CHECKPOINT_FILE):
        with open(CHECKPOINT_FILE) as f:
            return set(json.load(f))
    return set()

def save_checkpoint(scraped_urls: set) -> None:
    with open(CHECKPOINT_FILE, "w") as f:
        json.dump(list(scraped_urls), f)

async def resumable_scrape(url_list: list[str], proxy_url: str) -> list[dict]:
    scraped = load_checkpoint()
    remaining = [u for u in url_list if u not in scraped]
    results = []
    for url in remaining:
        result = await fetch_with_retry(url, proxy_url)
        if result:
            results.append({"url": url, "content": result.text})
            scraped.add(url)
            save_checkpoint(scraped)  # persist after each successful fetch
    return results

multi-port (10,000-500,000 pages/day)

multiple SMP ports, multiple workers, a queue (Redis, RabbitMQ, or a simple PostgreSQL-backed queue), PostgreSQL or ClickHouse for storage. assign one worker per port to avoid IP collision between workers. monitor per-port success rates and rotate out ports that are consistently underperforming.

import asyncio
from dataclasses import dataclass

@dataclass
class ProxyPort:
    url: str
    port_id: str
    success_count: int = 0
    failure_count: int = 0

    @property
    def success_rate(self) -> float:
        total = self.success_count + self.failure_count
        return self.success_count / total if total > 0 else 1.0

class PortPool:
    """manages a pool of SMP ports, routes work to healthy ports"""

    def __init__(self, ports: list[ProxyPort]):
        self.ports = ports
        self._index = 0

    def get_healthy_port(self) -> ProxyPort | None:
        """round-robin over ports with >80% success rate"""
        healthy = [p for p in self.ports if p.success_rate > 0.8]
        if not healthy:
            return None
        port = healthy[self._index % len(healthy)]
        self._index += 1
        return port

    def record_result(self, port_id: str, success: bool) -> None:
        for port in self.ports:
            if port.port_id == port_id:
                if success:
                    port.success_count += 1
                else:
                    port.failure_count += 1
                break

distributed (500,000+ pages/day)

multiple machines, each running several workers, all connecting to a shared queue. use Celery with Redis or a cloud task queue (AWS SQS, Google Cloud Tasks). dedicate one machine to queue management and result aggregation. deploy to spot/preemptible instances to reduce cost.

at this scale, monitoring becomes critical. track per-port metrics in Prometheus + Grafana, or a simpler ClickHouse + Grafana setup. alert on: port success rate drop below 85%, queue depth growing faster than consumption rate, storage latency spikes.

the key architectural principle at all scales: don’t mix concerns in the scraper. the scraper’s job is to fetch URLs and return raw responses. parsing, extraction, deduplication, and storage are separate processes. this makes it easy to re-extract data when a parser breaks without re-scraping.

Singapore-specific content: what you actually get with a Singapore mobile IP

it’s worth being concrete about what changes when you use a Singapore mobile proxy vs a generic proxy. I’ll go through the main use cases.

Shopee: Singapore’s mobile app version is different from the web version and from the regional versions. prices shown to Singapore mobile users include GST and may show different seller promotions. the mobile web at shopee.sg on a Singapore carrier IP serves the localised mobile layout with Singapore dollar pricing, local flash deals, and the local Coins cashback system. a US IP gets the same site with different promotional blocks.

Lazada: similar localisation. the “Everyday Low Price” and VoucherCode blocks differ by region and device type. the API endpoints for Lazada’s mobile app respond differently to mobile IPs than to datacenter IPs.

Google Singapore: google.com.sg with &gl=sg gives localised Singapore SERPs. the local pack (Google Maps listings), shopping results (which show locally available stock at Singapore prices), and AI overviews are all geo-specific. you need both the &gl=sg URL parameter and a Singapore IP to get fully localised results. with a mobile IP you additionally get mobile-specific SERP layout and features.

TikTok: the For You page algorithm factors in the viewer’s location for content distribution. viewing TikTok through a Singapore mobile IP gives you content weighted toward Singapore creators and trends, which is what you want for Singapore market research. datacenter IPs often trigger additional verification challenges on TikTok’s web interface.

Carousell: primarily a Singapore platform, anti-bot on the listing pages is moderate. a Singapore mobile IP gets the correct SGD pricing and local shipping options. the listing API (used by the app) is more accessible than the web scraper route for this platform.

bank and fintech sites: Singapore banks are heavily monitored and bot-protected. for legitimate competitive research (publicly visible rate tables, product information), Singapore mobile IPs are the only practical option that doesn’t immediately trigger the WAF. this is not a grey area, legitimate use only.

error code reference for scraping

these are the HTTP error codes you’ll encounter scraping with proxies, with specific causes and responses.

code	name	common scraping cause	typical fix
200	OK	success	extract data
301/302	redirect	site moved, HTTP to HTTPS	ensure `follow_redirects=True`
400	Bad Request	malformed URL or missing required parameter	check URL construction
401	Unauthorized	authentication required	not a public page, stop
403	Forbidden	IP blocked or WAF challenge	check headers, rotate IP, reduce rate
404	Not Found	product removed, URL changed	mark as gone, don’t retry
407	Proxy Authentication Required	wrong proxy credentials	check username/password
408	Request Timeout	slow page or overloaded target	retry with longer timeout
429	Too Many Requests	rate limit hit	rotate IP, wait Retry-After, reduce rate
500	Internal Server Error	target server error	retry with backoff
502	Bad Gateway	target CDN/load balancer issue	retry with backoff
503	Service Unavailable	target overloaded or maintenance	retry later
504	Gateway Timeout	target origin timed out	retry with backoff
520-527	Cloudflare errors	various Cloudflare-specific errors	520=origin error, 522=connection timed out, 523=origin unreachable, 524=origin timeout

Cloudflare’s custom error codes (520-527) are worth knowing because Cloudflare is on so many Singapore e-commerce sites. a 520 usually means the origin server returned an unexpected response to Cloudflare, often a temporary condition worth retrying. a 521 means the origin refused the connection. a 524 is an origin timeout that Cloudflare waited out. all of these are target-side issues, not proxy issues.

distinguishing proxy errors from target errors: if you’re seeing timeouts or connection refused errors, first confirm whether the issue is your proxy or the target. test the target directly (curl without proxy) from a machine on a clean IP. if the direct request works and the proxied request fails, the proxy has an issue. if both fail, the target is down or blocking all traffic.

IP rotation strategies: when and how to rotate

there are four common rotation strategies for scraping, and they suit different scenarios.

time-based rotation (rotate every N minutes): simple, predictable. good for jobs where each request is independent and you want to spread fingerprints over many IPs without managing state. downside: you may rotate mid-session if a session takes longer than expected.

request-count rotation (rotate every N requests): good when you know each session has a fixed number of pages. rotate after completing one product category, not in the middle of it.

on-demand rotation (rotate via API call): the most flexible. trigger a rotation whenever you detect a 429 or anomalous response. SMP exposes a rotation endpoint on each port that triggers an immediate IP change. this is what the retry code above uses.

job-boundary rotation (rotate between logical work units): rotate between keywords, categories, or product groups. each logical unit gets a consistent IP, which is the safest approach for progressive trust scoring. the downside is that rotation frequency depends on job unit size.

async def scrape_categories_with_rotation(
    categories: list[dict],
    proxy_url: str,
    rotation_endpoint: str,
) -> list[dict]:
    """
    scrapes each category with a consistent IP, rotates between categories.
    categories = [{"name": "phones", "url": "...", "pages": 5}, ...]
    """
    all_results = []
    for category in categories:
        # rotate IP before starting each category
        await rotate_proxy_ip(rotation_endpoint)
        await asyncio.sleep(5)  # brief pause for new IP to settle

        category_results = []
        for page in range(1, category["pages"] + 1):
            url = f"{category['url']}?page={page}"
            resp = await fetch_with_retry(url, proxy_url)
            if resp:
                category_results.extend(parse_product_listing(resp.text))
            await asyncio.sleep(3 + asyncio.get_event_loop().time() % 2)

        all_results.append({"category": category["name"], "products": category_results})

    return all_results

proxy authentication: IP allowlist vs username/password

SMP supports two authentication methods and the choice matters for how you deploy.

username/password auth: credentials are passed in the proxy URL (http://user:pass@host:port). this works from any IP, so you can run workers from cloud servers, local machines, or anywhere without whitelisting. the credential is in your code or environment variables, so treat it like a password: don’t commit it to git, use environment variables or a secrets manager.

import os

proxy_url = (
    f"http://{os.environ['SMP_USER']}:{os.environ['SMP_PASS']}"
    f"@sg1.singaporemobileproxy.com:{os.environ['SMP_PORT']}"
)

IP allowlist auth: you whitelist the IP(s) your scraping machines use in the SMP dashboard. no credentials in the URL, the port is open to your allowlisted IPs. simpler for single-machine setups, but breaks immediately if your scraping machine’s IP changes (which happens on most home connections and many cloud providers). for fixed cloud servers with static IPs, allowlist auth is clean and avoids credential exposure.

for multi-machine setups where machines have dynamic IPs, username/password auth is more practical.

working with JavaScript-heavy pages in detail

I want to go deeper on Playwright because most serious scraping targets require it, and the configuration choices significantly affect both success rate and cost.

when to use Playwright vs httpx: the decision point is whether the data you need is present in the initial HTTP response or loaded by JavaScript after the page renders. the quick test:

# fetch raw HTML and check if your target data is present
curl -x "http://user:pass@sg1.singaporemobileproxy.com:45001" \
  "https://target-site.sg/product/123" | grep -i "target-price"

if the grep returns the price, you don’t need Playwright. if it returns nothing but you can see the price in a real browser, the price is JavaScript-rendered and you need Playwright.

playwright performance optimisations: Playwright is slower and more memory-intensive than httpx. for large-scale jobs, these settings help:

context = await browser.new_context(
    proxy=proxy_config,
    # block images and fonts to reduce bandwidth 60-70%
    # be careful: some sites detect resource blocking as a bot signal
)

# alternatively, use route interception for fine-grained blocking
async def block_heavy_resources(route):
    if route.request.resource_type in ("image", "font", "media"):
        await route.abort()
    else:
        await route.continue_()

await page.route("**/*", block_heavy_resources)

blocking images and fonts reduces bandwidth per page by 60-70% for typical e-commerce sites. the tradeoff: some sites check that resource requests arrive in the expected pattern, and blocking all images can be a bot signal. test with your specific target before enabling this in production.

waiting for specific content: wait_until="networkidle" is the common choice but it’s slow (waits for all network activity to stop, which can take 3-8 seconds on heavy pages). if you know which element contains your target data, wait for that specific element instead:

# wait for the specific price element to appear, not for the full page to settle
await page.wait_for_selector(".product-price", timeout=10000)
price_text = await page.text_content(".product-price")

this can cut per-page time by 50-70% compared to networkidle waiting.

storing and managing scraped data

the right storage choice depends on your volume, query patterns, and downstream use.

SQLite (up to ~1 million rows, single machine): zero setup, good for development and small jobs. Python’s built-in sqlite3 module. the limitation is write concurrency (one writer at a time) and no good story for distributed access.

PostgreSQL (1 million to 100 million rows, multi-machine): the standard choice for production scraping operations. good tooling, supports JSONB for flexible schema, can handle the write load of most scraping operations. run on the same server as your scrapers if possible to avoid network latency on inserts.

ClickHouse (100 million+ rows, analytics): columnar storage, extremely fast for aggregations and time-series queries. ideal if your downstream use is analytics dashboards (average price over time, availability trends). insert in batches of at least 1,000 rows for efficiency.

data retention: most price monitoring use cases need 6-12 months of history. plan your storage accordingly. a single SKU’s price history over a year, checked daily, is 365 rows. for 50,000 SKUs that’s 18.25 million rows, well within PostgreSQL’s comfortable range.

deduplication: always insert with an upsert or deduplication check. scrapers running continuously will re-scrape pages they’ve already processed if the checkpoint logic fails. a composite unique key on (sku, scraped_date) prevents duplicate entries.

-- PostgreSQL upsert pattern for price monitoring
INSERT INTO price_history (sku, price, stock, scraped_at)
VALUES ($1, $2, $3, $4)
ON CONFLICT (sku, DATE(scraped_at))
DO UPDATE SET
    price = EXCLUDED.price,
    stock = EXCLUDED.stock,
    scraped_at = EXCLUDED.scraped_at
WHERE price_history.price != EXCLUDED.price OR price_history.stock != EXCLUDED.stock;

the WHERE clause on the DO UPDATE makes the upsert a no-op if nothing changed, which reduces write amplification and makes it easier to detect when prices actually changed.

getting started with SMP

the fastest path to a running scrape with SMP:

start a free trial. you get a real port with Singapore carrier IPs. no credit card required to trial.
find your port credentials in the client area. you’ll have a host, port number, username, and password.
test the proxy with the curl command above (point it at https://ipinfo.io/json). verify you see a Singapore carrier IP.
run a single test request against your actual target with mobile Chrome headers. check the status code and verify the response contains the data you need.
if the data requires JavaScript: install Playwright (pip install playwright && playwright install chromium) and test with the Playwright snippet above.
build your full scraper with the retry logic, session management, and storage patterns from this guide.
see SMP plans for production port options, or check the proxy pool for current availability.

for Southeast Asian platform-specific scraping (TikTok, Shopee, Lazada), check TikTok, Shopee, Lazada Singapore for platform-specific notes that complement this guide.