ℍ𝕂-𝟞𝟝

  • 1 Post
  • 218 Comments
Joined 1 year ago
cake
Cake day: July 14th, 2024

help-circle


  • I’ve a slight manageable case of ADHD and I tend to obsessively hyperfocus on tasks. It’s a good relationship because I get a lot of shit done well, and enjoy my work.

    If you start forcing me to plan out my day every day, down to 15 minute increments, my productivity drops by around 60%, because I stop concentrating on getting shit done, and start working to rule. Not because I’m vindictive, but because that’s what you asked me to do.





  • MongoDB is huge though for all the wrong reasons, businesses think that just because it’s JS, they can just have frontend devs - sorry, they are “fullstack” now - doing DBA work.

    I worked as one of two NoSQL DBAs for a Fortune 50 finance company, and there is a ton of CV-driven development going on giving NoSQL a bad name. Most use cases don’t need NoSQL. And for those which do, NoSQL is almost always harder to implement than simple SQL based RDBMSs.






  • Yeah but it doesn’t matter what the objective of the scraper is, the only thing that matters is that it’s an automated client that is going to send mass requests to you. If it wasn’t, Anubis would not be a problem for it.

    The effect is the same, increased hosting costs and less access for legitimate clients. And sites want to defend against it.

    That said, it is not mandatory, you can avoid using Anubis as a host. Nobody is forcing you to use it. And as someone who regularly gets locked out of services because I use a VPN, Anubis is one of the least intrusive protection methods out there.





  • AI does not triple traffic. It’s a completely irrational statement to make.

    Multiple testimonials from people who host sites say they do. Multiple Lemmy instances also supported this claim.

    I would bet that the number of requests per year of s resource by an AI scrapper is on the dozens at most.

    You obviously don’t know much about hosting a public server. Try dozens per second.

    There is a booming startup industry all over the world training AI, and scraping data to sell to companies training AI. It’s not just Microsoft, Facebook and Twitter doing it, but also Chinese companies trying to compete. Also companies not developing public models, but models for internal use. They all use public cloud IPs, so the traffic is coming from all over incessantly.

    Using as much energy as a available per scrapping doesn’t even make physical sense. What does that sentence even mean?

    It means that Microsoft buys a server for scraping, they are going to be running it 24/7, with the CPU/network maxed out, maximum power use, to get as much data as they can. If the server can scrape 100 sites per minute, it will scrape 100 sites. If it can scrape 1000, it will scrape 1000, and if it can do 10, it will do 10.

    It will not stop scraping ever, as it is the equivalent of shutting down a production line. Everyone always uses their scrapers as much as they can. Ironically, increasing the cost of scraping would result in less energy consumed in total, since it would force companies to work more “smart” and less “hard” at scraping and training AI.

    Oh, and it’s S-C-R-A-P-I-N-G, not scrapping. It comes from the word “scrape”, meaning to remove the surface from an object using a sharp instrument, not “scrap”, which means to take something apart for its components.



  • Websites were under a constant noise of malicious requests even before AI, but now AI scraping of Lemmy instances usually triples traffic. While some sites can cope with this, this means a three-fold increase in hosting costs in order to essentially fuel investment portfolios.

    AI scrapers will already use as much energy as available, so making them use more per site measn less sites being scraped, not more total energy used.

    And this is not DDoS, the objective of scrapers is to get the data, not bring the site down, so while the server must reply to all requests, the clients can’t get the data out without doing more work than the server.