Skip to content
yeongmin's archive
yeongmin's archive

idk what I'm doing

  • Main
  • 📣Minecraft server
  • Making a Reddit scraper (2023)
    • So, about the BAscraper
      • First release of BAScraper
    • stuffs from BAnalyser.ipynb (r/bluearchive recap)
    • Some notes on the new API policy for reddit
    • What counts as a request? (PRAW)
    • Some notes on PRAW
  • Homelabs n’ Servers
yeongmin's archive

idk what I'm doing

So, about the BAscraper

maxjo, 2024-01-092024-01-09

So, after having no luck in finding a good way to scrape reddit for submissions and posts, I luckily managed to find out about PullPush.io. It was a godsend for me since it basically had all usefull functions from pushshift.io which is much appreciated. It uses web api like usual and doesn’t require credentials (for now). The odd thing I found out is that there was still no API wrapper or smth like that for it, it did have search websites and stuff that used Pullpush as the backend but I still found that odd. I tried looking at PMAW which was the multithreaded wrapper for the Pushshift API, but it looked too complicated, I wanted it to be simple to use/maintain.

So that’s when I decided to make a new one from scratch. I was anyways planning to make a 2023 recap/wrap-up for r/BlueArchive so I whipped up one up and I was quite satisfied with it. As of writing, it is not made into a python package but I am plainning to make it into one. Currently it supports multithreading for faster scraping and scraping based on timeframe and all the basic functions you’d expect. Also can scrape all history(more than 100 submissions) of a subreddit and more.

more plans are made:

  • google’s custom search engine function for a more detailed search
  • DB integration
  • memory safety by offloading data to disk if there is too much data (I thought of using generators but it seems unnecessary to me)
  • data analysis features (I do have that already but needs more work and polishing)
  • better task splitting between threads for performance
github
BAScraper

Post navigation

Previous post
Next post

Related Posts

First release of BAScraper

2024-01-31

So after a bit I released the proper version of BAScraper on PYPI. I can’t guarantee it’ll properly work since I don’t have a testing method yet but I’ll fix stuffs asap since I also use it quite often. The following is the docs for the current version(as of writing)…

Read More

Clean and analyze social media usage data with Python (Coursera)

2024-02-112024-02-11

BAnalyzer Importing data In [ ]: import pandas from collections import Counter from os import path import json from datetime import datetime cwd = ‘../data’ data_path = path.join(cwd, ’01.json’) student_path = path.join(cwd, ‘students.txt’) exclude_author = [‘AutoModerator’, ‘BlueArchive-ModTeam’] with open(data_path, “r”, encoding=’utf-8′) as outfile: data = json.load(outfile) data_df = pandas.DataFrame.from_dict(data, orient=’index’) # [print(col)…

Read More
©2025 yeongmin's archive