While this post describes how to bulk download comments you've upvoted on Hacker News, the process is virtually identical for upvoted submissions - the URL format is just slightly different, e.g., https://news.ycombinator.com/upvoted?id=miles&p=2 (though there are apparently better ways of downloading upvoted submissions - see Related section below.)
Log in to Hacker News and open your profile page
Open the network tab in your web browser's dev tools (e.g., Safari: Develop → Show Web Inspector → Network ⓐ)
Click the "comments (private)" link on your profile page1
In the network tab, right click "upvoted" ⓑ then click "Copy as cURL" ⓒ:
Paste cURL command from clipboard into a for loop2 like the one below, making sure to:
--compressed
3 to the curl command (if HTTP compression is specified, as below4)&p=${i}
to the URL-o ${i}.htm
to the last line of the cURL commandfor i in {7..11}; do
curl --compressed "https://news.ycombinator.com/upvoted?id=miles&comments=t&p=${i}" \
-X 'GET' \
-H 'Cookie: user=miles&XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX' \
-H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' \
-H 'Accept-Encoding: gzip, deflate, br' \
-H 'Host: news.ycombinator.com' \
-H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.2 Safari/605.1.15' \
-H 'Accept-Language: en-us' \
-H 'Referer: https://news.ycombinator.com/' \
-H 'Connection: keep-alive' -o ${i}.htm
; sleep 5; done
The URL for the first page of upvoted comments looks like https://news.ycombinator.com/upvoted?id=miles&comments=t, while subsequent pages have &p=#
appended, e.g., https://news.ycombinator.com/upvoted?id=miles&comments=t&p=2.
HN rate limits requests, so throttling is necessary (otherwise, you will receive a "Sorry, we're not able to serve your requests this quickly." response and ultimately your IP address may be banned). While wget offers a --wait=seconds
option, the closest curl comes is --limit-rate <speed>
, which sadly did not prevent the warning from occurring when sequencing, even at rates as slow as 1000 bytes per second, hence the for loop with delay. See Implement wget's --wait, --random-wait #5406.
Otherwise, files are saved as compressed, necessitating something along the lines of gzip -d -f -S "" *
to decompress. [1, 2, 3, 4]
Without -H 'Accept-Encoding: gzip, deflate, br' \
, 5-6 times the bandwidth is used to download uncompressed HTML.
user's answer to CURL to access a page that requires a login from a different page
Arne Brasseur's answer to curl lacks a pause between requests, wget lacks dynamic file output names, is there a ready alternative for sequential range of file downloads?
kenorb's comment on curl .gz file and pipe it for decompression
In Searching Upvoted Articles on HN, Ravi Lingineni uses Google Custom Search Engine to search through the content of his upvoted articles.
Download submissions and comments for any HN user via hacker-news-to-sqlite.
Show HN: Archivy-HN – automatically save and download your upvoted HN posts
hacker-news-favorites-api: "A simple script that will scrape the favorites for a provided hacker news user id."
/nix | Sep 05, 2021