Batch download HN comments you've upvoted #

While this post describes how to bulk download comments you've upvoted on Hacker News, the process is virtually identical for upvoted submissions - the URL format is just slightly different, e.g., https://news.ycombinator.com/upvoted?id=miles&p=2 (though there are apparently better ways of downloading upvoted submissions - see Related section below.)

1. Get cURL (complete with cookie) from web browser

  1. Log in to Hacker News and open your profile page

  2. Open the network tab in your web browser's dev tools (e.g., Safari: Develop → Show Web Inspector → Network ⓐ)

  3. Click the "comments (private)" link on your profile page1

  4. In the network tab, right click "upvoted" ⓑ then click "Copy as cURL" ⓒ:
    Safari Network tab

2. Download upvoted comments

Paste cURL command from clipboard into a for loop2 like the one below, making sure to:

  1. specify the desired range of comment pages to download (e.g., 7 to 11)
  2. add --compressed3 to the curl command (if HTTP compression is specified, as below4)
  3. change the single quotes to double quotes around the URL
  4. append &p=${i} to the URL
  5. append -o ${i}.htm to the last line of the cURL command
for i in {7..11}; do
curl --compressed "https://news.ycombinator.com/upvoted?id=miles&comments=t&p=${i}" \
-X 'GET' \
-H 'Cookie: user=miles&XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX' \
-H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' \
-H 'Accept-Encoding: gzip, deflate, br' \
-H 'Host: news.ycombinator.com' \
-H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.2 Safari/605.1.15' \
-H 'Accept-Language: en-us' \
-H 'Referer: https://news.ycombinator.com/' \
-H 'Connection: keep-alive' -o ${i}.htm
; sleep 5; done

Footnotes

  1. The URL for the first page of upvoted comments looks like https://news.ycombinator.com/upvoted?id=miles&comments=t, while subsequent pages have &p=# appended, e.g., https://news.ycombinator.com/upvoted?id=miles&comments=t&p=2.

  2. HN rate limits requests, so throttling is necessary (otherwise, you will receive a "Sorry, we're not able to serve your requests this quickly." response and ultimately your IP address may be banned). While wget offers a --wait=seconds option, the closest curl comes is --limit-rate <speed>, which sadly did not prevent the warning from occurring when sequencing, even at rates as slow as 1000 bytes per second, hence the for loop with delay. See Implement wget's --wait, --random-wait #5406.

  3. Otherwise, files are saved as compressed, necessitating something along the lines of gzip -d -f -S "" * to decompress. [1, 2, 3, 4]

  4. Without -H 'Accept-Encoding: gzip, deflate, br' \, 5-6 times the bandwidth is used to download uncompressed HTML.

Sources

Related

/nix | Sep 05, 2021


Subscribe or visit the archives.