Skip to content
yeongmin's archive
yeongmin's archive

idk what I'm doing

  • Main
  • πŸ“£Minecraft server
  • Making a Reddit scraper (2023)
    • So, about the BAscraper
      • First release of BAScraper
    • stuffs from BAnalyser.ipynb (r/bluearchive recap)
    • Some notes on the new API policy for reddit
    • What counts as a request? (PRAW)
    • Some notes on PRAW
  • Homelabs n’ Servers
yeongmin's archive

idk what I'm doing

First release of BAScraper

maxjo, 2024-01-31

So after a bit I released the proper version of BAScraper on PYPI. I can’t guarantee it’ll properly work since I don’t have a testing method yet but I’ll fix stuffs asap since I also use it quite often. The following is the docs for the current version(as of writing) _v0.0.1_ though the docs below has pip install part mentioned and while it does work, the scraping might not work properly since it’s behind the version from the github.

BAScraper

A little (multi-threaded) API wrapper for PullPush.io – the 3rd party replacement API for Reddit. After the 2023 Reddit API controversy, PushShift.io(and also wrappers such as PSAW and PMAW) is now only available to reddit admins and Reddit PRAW is honestly useless when trying to get a lots of data and data from a specific timeframe. PullPush.io thankfully solves this issue and this is the wrapper for that said API. For more info on the API(TOS, Forum, Docs, etc.) go to PullPush.io.

BAScraper(Blue Archive Scraper) was initially made and used for the 2023 recap/wrap up of r/BlueArchive hence the name. It’s pretty basic but planning to add some more features as it goes. I’m also planning to release this as a python package. (this is my first time making an actual py package so bear with me) It uses multithreading to make requests to the PullPush.io endpoint and returns the result as a JSON(dict) object.

currently it can:

  • get submissions/comments from a certain subreddit in supported order/sorting methods specified in the PullPush.io API docs
  • get comments under the retrieved submissions
  • can get all the submissions based on the number of submissions or in a certain timeframe you specify
  • can recover(if possible) deleted/removed submission/comments from the returned result

I also have a tool that can organize, clean, and analyze reddit submission and comment data, which I am planning to release it with this or separately after some polishing.

Also, please ask the PullPush.io owner before making large amounts or request and also respect cooldown times. It stresses the server and can cause inconvenience for everyone.

basic usage & requirements

this is not yet a python package so this is the rundown on how to use it and set up.

install BAScraper using pip:

pip install BAScraper

python 3.10+ is recommended

Example usage

from BAScraper import Pushpull

from datetime import datetime
import json

pp = Pushpull(sleepsec=2, threads=2)
result = pp.get_submissions(after=datetime(2023, 12, 1), before=datetime(2024, 1, 1),
                            subreddit='bluearchive', sort='desc')

# save result as JSON
with open("example.json", "w", encoding='utf-8') as outfile:
    json.dump(result, outfile, indent=4)

Parameters

PushPull (__init__)

all parameters are optional

parameter type description default value
creds List[Creds] not implemented yet
sleepsec int cooldown time between each request 1
backoffsec int backoff time for each failed request 3
max_retries int number of retries for failed requests before it gives up 5
timeout int time until it’s considered as timout err 10
threads int no. of threads when multithreading is used 4
comment_t int no. of threads used for comment fetching, defaults to threads
batch_size int not implemented yet
log_level str log level in which is displayed on the console, should be a string representation of logging level INFO
cwd str dir path where all the log files and JSON file will be stored os.getcwd() (current directory)

PushPull._getsubmissions & PushPull._getcomments

all parameters are optional

parameter type description deafult value get_submissions get_comments
after datetime.datetime Return results after this date (inclusive >=) βœ… βœ…
before datetime.datetime Return results before this date (exclusive <) βœ… βœ…
get_comments bool If true, the result will contain the comments field where all the comments for that post will be contained(List[dict]) False βœ…
duplicate_action str will determine what to do with duplicate results (usually caused by edited, deleted submission/comments) accepts: ‘newest’, ‘oldest’, ‘remove’, ‘keep_original’, ‘keep_removed’. Not guaranteed but can recover deleted posts if possible. newest βœ… βœ…
filters List[str] filters result to only get the fields you want βœ… βœ…
sort str Sort results in a specific order accepts: ‘desc’, ‘asc desc βœ… βœ…
sort_type str Sort by a specific attribute. If after and before is used, defaults to ‘created_utc’ accepts: ‘created_utc’, ‘score’, ‘num_comments’ created_utc βœ… βœ…
limit int Number of results to return per request. Maximum value of 100, recommended to keep at default 100 βœ… βœ…
ids List[str] Get specific submissions via their ids βœ… βœ…
link_id str Return results from a particular submission βœ…
q str Search term. Will search ALL possible fields βœ… βœ…
title str Searches the title field only βœ…
selftext str Searches the selftext field only βœ…
author str Restrict to a specific author βœ… βœ…
subreddit str Restrict to a specific subreddit βœ… βœ…
score int Restrict results based on score βœ…
num_comments int Restrict results based on number of comments βœ…
over_18 bool Restrict to nsfw or sfw content βœ…
is_video bool Restrict to video content <as of writing this parameter is broken (err 500 will be returned)> βœ…
locked bool Return locked or unlocked threads only βœ…
stickied bool Return stickied or un-stickied content only βœ…
spoiler bool Exclude or include spoilers only βœ…
contest_mode bool Exclude or include content mode submissions βœ…
BAScraper

Post navigation

Previous post
Next post

Related Posts

So, about the BAscraper

2024-01-092024-01-09

So, after having no luck in finding a good way to scrape reddit for submissions and posts, I luckily managed to find out about PullPush.io. It was a godsend for me since it basically had all usefull functions from pushshift.io which is much appreciated. It uses web api like usual…

Read More

Clean and analyze social media usage data with Python (Coursera)

2024-02-112024-02-11

BAnalyzer Importing data InΒ [Β ]: import pandas from collections import Counter from os import path import json from datetime import datetime cwd = ‘../data’ data_path = path.join(cwd, ’01.json’) student_path = path.join(cwd, ‘students.txt’) exclude_author = [‘AutoModerator’, ‘BlueArchive-ModTeam’] with open(data_path, “r”, encoding=’utf-8′) as outfile: data = json.load(outfile) data_df = pandas.DataFrame.from_dict(data, orient=’index’) # [print(col)…

Read More

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

©2025 yeongmin's archive