EP actual statistic analysis

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • oceanwave
    Member
    Forum Explorer
    • Dec 2024
    • 36

    #1

    EP actual statistic analysis

    Problem: AI bad scraper bots are causing some serious load spikes, coming from residential proxies, a few requests per IP.

    I'm having difficulty using the LVE EP stat to show the load spike situation.

    For one account I have EP limit set to 20.

    If I view the statistics graphs for "Yesterday" on either an hour or minute basis, the Entry Processes graph shows the green graph all well below the red 20 limit line.

    However, in the table, if I set it to hourly, the EP F (fault) column shows 4210 for one hour, with some hours having EP F of 0 and some 11, 44, 63, 183, 332, 0, 107, 1198, and so on...

    On an hourly basis, the A column is showing me an average for that hour I assume, as it's all 1 to 2.

    The Entry Processes graph doesn't seem to be graphing the green line accurately when set to minute duration as it should be showing over the L=20 red line when the table shows F=4210 was hit for a minute, right?

    It would be nice to be able to get a better picture of the situation, other than scrolling the per minute table.
  • bogdan.sh
    Administrator
    • Nov 2016
    • 1289

    #2
    Hi,

    The behavior you're seeing is expected and there's a useful interpretation hiding in it.

    EP (Entry Processes) is a concurrent limit, not a rate limit. When the 21st simultaneous entry process arrives, LVE doesn't let it run - it rejects it immediately and increments the F (faults) counter. The rejected processes never become "current EPs", so they can't appear in the EP usage line. That's why the green line stays under 20 even when thousands of faults occur in the same hour: the limit is doing its job and shedding the overflow.

    Also, the chart point per period is an aggregated value (avg/max over the bucket). With 1-second sampling, a burst lasting a few hundred milliseconds barely moves the average - but every request that didn't fit during that burst is one F.

    So in your case the F column is the truer signal of what's happening, not the EP line. F = 4210/hour means the account was getting hammered with concurrent requests fast enough that LVE rejected ~70/minute on average during that hour.

    Better tools than scrolling the per-minute table

    lveinfo CLI - much more flexible than the UI:


    Code:
    lveinfo --period=1d --by-fault=ep --user=<user> --show-all
    Code:
    lveinfo --period=1h --from='YYYY-MM-DD HH:MM' --to='YYYY-MM-DD HH:MM' --user=<user>
    Sort by EP faults, filter to a specific window, dump CSV with --csv or --json for further analysis.

    lvetop for live view while it's happening — you'll see the bursts in real time and can correlate to web server logs.

    The raw lvestats DB at /var/lve/lvestats5/lvestats.db (SQLite) holds 1-second / 5-second granularity samples; you can query it directly if you want minute-by-minute heatmaps or hourly histograms.

    Cross-reference with the web server access log for the matching minute - a sudden spike of requests from many residential IPs to a small set of URLs (search, product listings, cart, etc.) is the classic AI-scraper signature.​


    Raising the EP limit will let more of them through, not fewer - that's not the lever you want. Three more useful directions:
    • Imunify360 bot protection (if you have it) — blocks AI scraper user-agents and challenges suspicious traffic before it ever reaches PHP, so EP doesn't get touched.
    • Caching / a CDN in front — scraper traffic on cached URLs costs you nothing.
    • At the web server level, a robots.txt plus rate-limit/challenge rules on the heavy endpoints help even when the IPs rotate, because scrapers tend to hit the same URL patterns.

    The fact that EP is firing rather than CPU/IO is actually good news - your account is being protected and other tenants on the box aren't impacted. The work is in identifying which URLs the scrapers are pounding and short-circuiting them earlier in the request path.

    Hope this helps.

    Comment

    • oceanwave
      Member
      Forum Explorer
      • Dec 2024
      • 36

      #3
      Thanks very much -- I'm going to read your reply a few times. With your explanation, it is starting to make sense to me - when the bad-bot requests beyond L, it increments F rapidly instead of adding to the A Actual EP (average per time unit) if I understand it correctly now.

      As I tried decreasing EP, I found myself getting a few 508 errors, so I was trying to use the A stat/graph to figure out where to set L, which doesn't work at all.

      It seems the only way is to adjust L and then watch the results while producing possibly some unnecessary 508 errors for real users.

      If the AI bots continue to get worse, a tool to suggest L values based on an analysis of actual daily traffic could be very useful.

      I'm concerned about returning inadvertent 508 errors to real users or legitimate search engine spiders which could cause a loss of indexing, so I'm trying to be careful to set L as high as possible while being low enough to keep the bots from overwhelming the server.

      Comment

      • bogdan.sh
        Administrator
        • Nov 2016
        • 1289

        #4
        You know, it's always a balance game, how to keep high limits for users and not to overload the server on edge cases

        Comment

        • oceanwave
          Member
          Forum Explorer
          • Dec 2024
          • 36

          #5
          It feels very different this year in that these AI bots are somehow coded to behave as badly as possible suddenly this year -- in the past the companies who had the resources to cause a DOS attack were responsible because they didn't want to get a bad reputation plus there was a network hierarchy... now with residential proxies and randomized user agents.... it's something. They just don't care if they are ticking off or inconveniencing real humans now.

          As I'm browsing various other websites myself, I'm also getting frustrated by the number of cloudflare "please wait while we assess the security of your connection" delay screens.

          So the bad bots are hitting from all sides suddenly.

          I spent a few hours fine-tuning the L value; we'll see how it performs tomorrow morning as compared to tonight. A tool to analyze a day or week's logs and compute a recommend L value based on minimizing false positives while chomping off the bot spikes would be cool in the future.

          Comment

          Working...