So, after having no luck in finding a good way to scrape reddit for submissions and posts, I luckily managed to find out about PullPush.io. It was a godsend for me since it basically had all usefull functions from pushshift.io which is much appreciated. It uses web api like usual and doesn’t require credentials (for now). The odd thing I found out is that there was still no API wrapper or smth like that for it, it did have search websites and stuff that used Pullpush as the backend but I still found that odd. I tried looking at PMAW which was the multithreaded wrapper for the Pushshift API, but it looked too complicated, I wanted it to be simple to use/maintain.
So that’s when I decided to make a new one from scratch. I was anyways planning to make a 2023 recap/wrap-up for r/BlueArchive so I whipped up one up and I was quite satisfied with it. As of writing, it is not made into a python package but I am plainning to make it into one. Currently it supports multithreading for faster scraping and scraping based on timeframe and all the basic functions you’d expect. Also can scrape all history(more than 100 submissions) of a subreddit and more.
more plans are made:
- google’s custom search engine function for a more detailed search
- DB integration
- memory safety by offloading data to disk if there is too much data (I thought of using generators but it seems unnecessary to me)
- data analysis features (I do have that already but needs more work and polishing)
- better task splitting between threads for performance