This blog post has been sitting in my drafts folder for a long time. It’s time I finished it. A while ago, I did some work for Vidhi, scraping the Supreme Court of India website. Later on, I started some of parts of the work to scrape a couple of High Courts. Here’s a few quick lessons from my experience:
- Remember to be a good citizen. Scrape with a delay between each request and a unique user-agent. This may not always work, but as far as possible, make it easy for them to figure out you’re scraping.
- ASP based websites are difficult to scrape. A bunch of Indian court websites are built on ASP and you can’t submit forms without JavaScript. I couldn’t get phantomjs or any of those libs to work either. If you can get them working, please talk to me! Sandeep has taken over from me and I’m sure he’ll find it useful.
- Data is inconsistently inconsistent. This is a problem. You can make no assumptions about the data while scraping. The best you can do is collect everything and find patterns later. For example, a judge’s name may be written in different ways from case to case. You can normalize them later.
- These sites aren’t highly available, so plan for retrying and backing off in your code. In fact, I’d recommend running the scraper overnight and never in the morning from 8 am to 12 pm.
- Assume failure. Be ready for it. The first few times you write the code, you
have to keep a close watch. It will fail in many different ways and you
should be ready to add another
Except
clause to your code :) - Get more data than you need, because re-scraping will cost time.
- Gujarat High Court has a JavaScript-based frontend. There’s an XHR endpoint that returns JSON. It’s the only site I’ve scraped which had a pleasant developer experience.