Batch download HN comments you've upvoted #

While this post describes how to bulk download comments you've upvoted on Hacker News, the process is virtually identical for upvoted submissions - the URL format is just slightly different, e.g., https://news.ycombinator.com/upvoted?id=miles&p=2 (though there are apparently better ways of downloading upvoted submissions - see Related section below.)

1. Get cURL (complete with cookie) from web browser

Log in to Hacker News and open your profile page
Open the network tab in your web browser's dev tools (e.g., Safari: Develop → Show Web Inspector → Network ⓐ)
Click the "comments (private)" link on your profile page¹
In the network tab, right click "upvoted" ⓑ then click "Copy as cURL" ⓒ:

2. Download upvoted comments

Paste cURL command from clipboard into a for loop² like the one below, making sure to:

specify the desired range of comment pages to download (e.g., 7 to 11)
add --compressed³ to the curl command (if HTTP compression is specified, as below⁴)
change the single quotes to double quotes around the URL
append &p=${i} to the URL
append -o ${i}.htm to the last line of the cURL command

for i in {7..11}; do
curl --compressed "https://news.ycombinator.com/upvoted?id=miles&comments=t&p=${i}" \
-X 'GET' \
-H 'Cookie: user=miles&XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX' \
-H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' \
-H 'Accept-Encoding: gzip, deflate, br' \
-H 'Host: news.ycombinator.com' \
-H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.2 Safari/605.1.15' \
-H 'Accept-Language: en-us' \
-H 'Referer: https://news.ycombinator.com/' \
-H 'Connection: keep-alive' -o ${i}.htm
; sleep 5; done

Footnotes

The URL for the first page of upvoted comments looks like https://news.ycombinator.com/upvoted?id=miles&comments=t, while subsequent pages have &p=# appended, e.g., https://news.ycombinator.com/upvoted?id=miles&comments=t&p=2.
HN rate limits requests, so throttling is necessary (otherwise, you will receive a "Sorry, we're not able to serve your requests this quickly." response and ultimately your IP address may be banned). While wget offers a --wait=seconds option, the closest curl comes is --limit-rate <speed>, which sadly did not prevent the warning from occurring when sequencing, even at rates as slow as 1000 bytes per second, hence the for loop with delay. See Implement wget's --wait, --random-wait #5406.
Otherwise, files are saved as compressed, necessitating something along the lines of gzip -d -f -S "" * to decompress. [1, 2, 3, 4]
Without -H 'Accept-Encoding: gzip, deflate, br' \, 5-6 times the bandwidth is used to download uncompressed HTML.

Sources

user's answer to CURL to access a page that requires a login from a different page
Arne Brasseur's answer to curl lacks a pause between requests, wget lacks dynamic file output names, is there a ready alternative for sequential range of file downloads?
kenorb's comment on curl .gz file and pipe it for decompression

In Searching Upvoted Articles on HN, Ravi Lingineni uses Google Custom Search Engine to search through the content of his upvoted articles.
Download submissions and comments for any HN user via hacker-news-to-sqlite.
Show HN: Archivy-HN – automatically save and download your upvoted HN posts
Show HN: Export HN Favorites to a CSV File
hacker-news-favorites-api: "A simple script that will scrape the favorites for a provided hacker news user id."
H.N.Badges: "Displays game-like badges for your HN profile"

/nix | Sep 05, 2021

RSS | Archives