gif gif

The interesting corner

gif gif

Using AI to automatically generate and upload Instagram reels

Contents:

Introduction

Recently I saw a video in Instagram reels about a guy that made quite a popular account with just copying reddit posts, transcribing it using AI, adding subtitles and pasting it over a subway surfers or gta 5 mega ramp video. Basically something like this instagram account. He explained how he did it manually, going onto Reddit, copying the text of a post, adding it to CapCut to add TTS, transcribe and add subtitles, and then manually post it on instagram (or TikTok). I saw this and I thought: this can easily be automated. So I got to work. I used Python for obvious reasons.

This will be a full walkthrough of the code, how it works and what the results look like. I've divided all parts in different modules that will be connected by the main Python file, so it's easy to change out a module for something else (like using a different TTS engine or something).

Getting a Reddit post

Step 1 is to get the text of a post from Reddit so it can be processed into a video. We want the video to read the title and the body text of a post, so we can make a post into a class. A reddit post contains these elements:

So a class for a post will look like this:

                    
class RedditPost:
    def __init__(self, id,title, description, comments, subreddit) -> None:
        self.id = id
        self.title = title
        self.description = description
        self.comments = comments
        self.subreddit = subreddit
        self.sanitize_post()

    def into_text(self) -> str:
        return self.title + ".\n" + self.description

    def __str__(self) -> str:
        return "Id: " + self.id + "Title: " + self.title + "\nDescription: " + self.description + "\nComments: " + str(len(self.comments))
                

The into_text method will be used to get the text of the post and convert it into speech. Sometimes reddit posts can contain text that is hard to turn into speech, and we want to censor "bad" words because otherwise IG won't push the video as much. We also want to remove any extra periods because they will also be spoken literally by the TTS AI. I added some (crude) sanitization to the RedditPost class to prevent this:

                
def sanitize_post(self):
    self.description = self.description.replace("LGBTQ","L G B T Q")
    self.description = self.description.replace("+","plus")
    self.description = self.description.replace("/"," slash ")
    self.description = self.description.replace("TLDR","To summarize: ")
    self.description = self.description.replace("&", "and")
    self.description = self.description.replace("ä", "ae")
    self.description = self.description.replace("ö", "oe")
    self.description = self.description.replace("ü", "ue")
    self.description = self.description.replace("ß", "ss")
    self.description = self.description.replace("*","")
    self.description = self.description.replace("_","")
    self.description = self.description.replace('"'," ")

    # profanities
    self.description = self.description.replace("fuck", "frick")
    self.description = self.description.replace("Fuck", "Frick")
    self.description = self.description.replace("Shit", "Shot")
    self.description = self.description.replace("shit", "shot")
    self.description = self.description.replace(" ass", " butt")
    self.description = self.description.replace("asshole", "a-hole")
    self.description = self.description.replace(" Ass", " Butt")
    self.description = self.description.replace("Asshole", "A-hole")
    self.description = self.description.replace(" buttum", " assum") # Assume also contains "Ass"
    self.description = self.description.replace(" Buttum", " Assum")
    self.description = self.description.replace("kill", "unalive")
    self.description = self.description.replace("Kill", "Unalive")
    self.description = self.description.replace("death", "unalive")
    self.description = self.description.replace("Death", "Unalive")
    self.description = self.description.replace("murder", "unalive")
    self.description = self.description.replace("Murder", "Unalive")
    self.description = self.description.replace("suicide", "self unalive")
    self.description = self.description.replace("Suicide", "Self unalive")
    self.description = self.description.replace("pedofile", "pdf ile")
    self.description = self.description.replace("Pedofile", "Pdf ile")
    self.description = self.description.replace("sex", "s*x")
    self.description = self.description.replace("Sex", "s*x")

    self.title = self.title.replace("fuck", "frick")
    self.title = self.title.replace("Fuck", "Frick")
    self.title = self.title.replace("Shit", "Shot")
    self.title = self.title.replace("shit", "shot")

    # AmITheAsshole
    self.description = self.description.replace("AITA","Am I the a-hole")
    self.title = self.title.replace("AITA","Am I the a-hole")
    # tifu
    self.description = self.description.replace("TIFU","Today I fricked up")
    self.title = self.title.replace("TIFU","Today I fricked up")
    # lifeProTips
    self.description = self.description.replace("LPT","Life pro tip")
    self.title = self.title.replace("LPT","Life pro tip")

    self.description = stringutils.remove_trailing_periods(self.description)
            

The stringutils module contains some functionality for processing text:

stringutils.py

                
import logging
logger = logging.getLogger(__name__)

alphabet = "qwertyuiopasdfghjklzxcvbnm"

def remove_trailing_periods(text: str) -> str:
    for i in range(len(text)):
        if (i < len(text)-1) and text[i].lower() not in alphabet and text[i+1] == ".":
            # remove the period after this one
            text = text[:i+1] + text[i + 2:]
            logger.info("removing extra period at index " + str(i+1))
            return remove_trailing_periods(text)
    return text
    

def remove_period_after(character: str, text: str) -> str:
    for i in range(len(text)):
        if text[i] == character and text[i+1] == ".":
            # remove the period after this one
            text = text[:i+1] + text[i + 2:]
            logger.info("removing extra period at index " + i+1)
            return remove_period_after(character,text)
    return text

def remove_repeating_periods(text: str) -> str:
    return remove_period_after(".",text)
            

Now that we have a class we can use to represent a Reddit post, we need to actually retrieve them. For this, I used the PRAW python package. I put the functionality for this into a RedditEngine class:

                
class RedditEngine:
    MAX_IG_SHORT_LENGTH = 1620 # max video length is 1:30, this is about that
    REDDIT_IDS_FILENAME = "reddit_ids"
    TTS_FOLDER_NAME = "tts"
    SUBREDDITS_STORIES_FILENAME = "subreddits_stories"
    DEFAULT_POST_AMOUNT = 30

    def __init__(self) -> None:
        clientid = "your reddit client id"
        secret = "your reddit secret"
        user_agent = "praw_scaper_1.0"

        self.reddit = praw.Reddit(username='your username',password='your password',client_id=clientid,client_secret=secret,user_agent=user_agent)
        self.posts :List[RedditPost] = []
        self.already_used_ids = []
        with open(RedditEngine.REDDIT_IDS_FILENAME,"r") as reddit_ids:
            for line in reddit_ids:
                self.already_used_ids.append(line.replace("\n","").strip())
        logger.info("IDs already used:")
        logger.info(self.already_used_ids)
    
    def get_posts(self, subreddit_name, limit):
        subreddit = self.reddit.subreddit(subreddit_name)
        logger.info("getting hot " + str(limit) + " posts for subreddit: " + subreddit.display_name)

        for submission in subreddit.hot(limit=limit):
            if RedditEngine.check_post(subreddit_name, submission) and ((len(submission.title) + len(submission.selftext)) < RedditEngine.MAX_IG_SHORT_LENGTH):
                self.posts.append(RedditPost(str(submission),submission.title, submission.selftext, submission.comments, subreddit_name))

    def check_post(subreddit_name, submission):
        if "UPDATE" in submission.title or "(Part" in submission.title:
            return False
        if subreddit_name == "AmITheAsshole" and "Monthly Open" in submission.title:
            return False
        elif subreddit_name == "talesfromtechsupport" and "POSTING RULES" in submission.title or "Mr_Cartographer" in submission.title or "(Part" in submission.title or str(submission) == "16u1gxn":
            return False

        return True

    def choose_id(self, id: str) -> bool:
        """
        Checks if the post with the given ID is in the already used ids or not

        Parameters:
            id: id of the post to check
        Returns:
            True if the post has not yet been used, false otherwise
        """
        return id not in self.already_used_ids
    
    def exclude_id(self, id: str):
        """
        Adds the ID to the already used ids file.
        """
        self.already_used_ids.append(id)
        with open (RedditEngine.REDDIT_IDS_FILENAME, "a") as f:
            f.write(id + "\n")
                
            

Let's break that down . I want to always get the 30 hot posts for a specific subreddit, but I don't want to use the same post twice. That's where the REDDIT_IDS_FILENAME comes in. It's a file that every ID of a used post will get written to. Each line will contain an ID of a post that has already been used. For example:

1eig16p
1ei8uz8
1eiqnyi
1ehi5il
1ejghtc
1ejjj51

The TTS_FOLDER_NAME will be used by the main script to save the generated text-to-speech files to. The SUBREDDITS_STORIES_FILENAME is a file that contains the names for all subreddits that can be used to get a post from. This is also used by the main script. It looks like this:

tifu
nosleep
relationships
LifeProTips
pettyrevenge
talesfromtechsupport
confessions
AmITheAsshole
TrueOffMyChest

To be able to scrape reddit for posts, you need to give PRAW access to your account by entering a client ID and secret. To do that, you need to create an app and get the client id and secret from it.

In the constructor, PRAW will get initialized and the already used posts are read into a list. The get_posts method gets the hot 30 posts for a subreddit so that one can be chosen. The check_post method will check if a post is not an announcement or update post, because we only want stand-alone posts to make a video out of. The choose_id method will check if the given ID is not already used, and the exclude_id method will add a post ID to the already used IDs list.

In the script that generates one video, a random subreddit is chosen from the file and the hot 30 posts for that are gathered. From those posts, the first one that is not yet in the list of used posts gets chosen. This is visible in the auto_post_video and generate_video_for_subreddit functions:

                
def generate_video_for_subreddit(subreddit: str, reddit_engine: get_reddit_posts.RedditEngine) -> bool:
    reddit_engine.get_posts(subreddit, get_reddit_posts.RedditEngine.DEFAULT_POST_AMOUNT)
    id_accepted = False
    i = 0
    post = None
    while not id_accepted:
        if i == len(reddit_engine.posts):
            return False
        post = reddit_engine.posts[i]
        can_use_post = reddit_engine.choose_id(post.id)
        if(can_use_post):
            id_accepted = True
        else:
            i+= 1

    generate_story_video_for_post(post,reddit_engine)
    return True

def auto_post_video():
    reddit_engine = get_reddit_posts.RedditEngine()
    
    subreddits = []
    with open(get_reddit_posts.RedditEngine.SUBREDDITS_STORIES_FILENAME, "r") as f:
        subreddits = f.readlines()

    subreddit = random.choice(subreddits)
    logger.info("getting post from subreddit " + subreddit)
    video_result = generate_video_for_subreddit(subreddit,reddit_engine)
    if (not video_result):
        logger.warning("should use another subreddit")
                
            

After having chosen a Reddit post, it is further processed into a video.

Converting the text to speech

After a Reddit post is chosen, the next step is to convert the text of the post into speech. I wanted to do this using AI because it's incredibly easy to use nowadays. The first thing I tried was ElevenLabs. The results it generates are great, but unfortunately there's a character limit, and I'm not gonna pay for any of this.

text_to_speech_elevenlabs.py

                    
import requests
import random
import logging
logger = logging.getLogger(__name__)

class ElevenLabsVoice:
    def __init__(self, voice_id, name):
        self.voice_id = voice_id
        self.name = name

    def __str__(self) -> str:
        return "Voice ID: " + self.voice_id + ", Name: " + self.name

class ElevenLabsTTS:
    API_KEY = "your API key"
    CHUNK_SIZE = 1024

    def __init__(self, api_key):
        self.api_key = api_key
        self.all_voices = []
        self.current_voice = None

    def get_all_voices(self):
        logger.info("retrieving all voices...")
        url = "https://api.elevenlabs.io/v1/voices"
        headers = {
            "Accept": "application/json",
            "xi-api-key": self.api_key
        }
        response = requests.get(url, headers=headers)
        for voice in response.json()["voices"]:
            self.all_voices.append(ElevenLabsVoice(voice["voice_id"], voice["name"]))

    def select_random_voice(self):
        self.current_voice = random.choice(self.all_voices)
        logger.info("Selected random voice: " + self.current_voice.name)

    def write_to_file(self,filename,text) -> bool:
        if self.current_voice is None:
            raise Exception("No voice selected")
        
        logger.info("writing text to file " + filename + "...")
        logger.info(text)
        logger.info("using voice: " + self.current_voice.name)
        url = "https://api.elevenlabs.io/v1/text-to-speech/" + self.current_voice.voice_id

        headers = {
            "Accept": "audio/mpeg",
            "Content-Type": "application/json",
            "xi-api-key": ElevenLabsTTS.API_KEY
            }
        data = {
            "text": text,
            "voice_settings": {
                "stability": 0.3,
                "similarity_boost": 0.5
            }}

        response = requests.post(url, json=data, headers=headers)
        logger.info("GOT RESPONSE")
        logger.info(response)
        logger.info(response.headers)
        logger.info(response.text)
        if (response.status_code != 200):
            return False

        with open(filename, 'wb') as f:
            for chunk in response.iter_content(chunk_size=ElevenLabsTTS.CHUNK_SIZE):
                if chunk:
                    f.write(chunk)
        logger.info("Done writing to file!")
        return True
                    
                

The next thing I tried was running a TTS AI model locally on the VM that will upload these videos. I looked at Coqui: a language model toolkit that's pretty easy to use. It worked pretty well and I got it working fairly quickly, but I wasn't satisfied with the results.

text_to_speech_coqui_tts.py

                
import torch
from TTS.api import TTS
from pydub import AudioSegment
import os
import stringutils
import time
import logging
logger = logging.getLogger(__name__)

class CoquiTTSEngine:
    def __init__(self):
        model_name = "tts_models/en/ljspeech/fast_pitch"
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        self.tts = TTS(model_name=model_name, progress_bar=True).to(self.device)

    def synthesize_speech(self, text: str, file_path: str) -> bool:
        logger.info(" >>>>> Synthesizing text\n" + text + "\n >>>>> to file " + file_path)
        new_text = text.replace("\\","").replace("*","")
        new_text = stringutils.remove_trailing_periods(new_text)
        logger.info("text after processing a little: " + new_text)
        tmp_file = "tmp_audio.mp3"
        logger.info("Synthesizing speech...")
        try:
            self.tts.tts_to_file(text=new_text, file_path=tmp_file)
        except Exception as e:
            logger.error(e)
            return False
        time.sleep(1) # wait a little before reading the file
        logger.info("speeding up audio file...")
        orig_file = AudioSegment.from_file(tmp_file)
        sped_up_file = orig_file.speedup(1.3)
        sped_up_file.export(file_path,format="mp3")
        os.remove(tmp_file)
        return True
                
            

After some more searching, I came across the TikTok TTS API. I didn't know it existed, and since it's used by almost all reels and TikToks that use AI TTS, it was the perfect choice. It also does not have a character limit as far as I know. There's multiple voices to choose from, so I made the script choose a random english one every time a video gets made.

                
import sys
sys.path.append("TikTok-Voice-TTS")
from tiktokvoice import tts

import random
import logging
logger = logging.getLogger(__name__)

voices_en = [
        # ENGLISH VOICES
    'en_au_001',                  # English AU - Female
    'en_au_002',                  # English AU - Male
    'en_uk_001',                  # English UK - Male 1
    'en_uk_003',                  # English UK - Male 2
    'en_us_001',                  # English US - Female (Int. 1)
    'en_us_002',                  # English US - Female (Int. 2)
    'en_us_006',                  # English US - Male 1
    'en_us_007',                  # English US - Male 2
    'en_us_009',                  # English US - Male 3
    'en_us_010',                  # English US - Male 4
]
class TiktokTTSApi:

    def choose_random_voice() -> str:
        chosen_voice = random.choice(voices_en)
        logger.info("choosing random tiktok voice " + chosen_voice)
        
        return chosen_voice

    def tts(self, text: str, filename: str) -> str:
        logger.info("converting text to speech!")
        voice = TiktokTTSApi.choose_random_voice()
        tts(text, voice, filename)
        return voice
            

This is then used in the script to generate a single video:

                
def generate_story_video_for_post(post: get_reddit_posts.RedditPost, reddit_engine: get_reddit_posts.RedditEngine):
    mp3_filename = post.id + ".mp3"
    reddit_id_tts_file = os.path.join(os.getcwd(),get_reddit_posts.RedditEngine.TTS_FOLDER_NAME, mp3_filename)
    tiktok_tts_api = text_to_speech_tiktok_api.TiktokTTSApi()
    voice = tiktok_tts_api.tts(post.into_text(),reddit_id_tts_file)

    ...
            

Transcribing

After generating a TTS mp3 file for a Reddit post, the next step is to transcribe the spoken text, so we know when each word will be spoken. This will tell us when we need to show which word onto the screen. To do this, we can use another AI called Whisper. It's made by OpenAI (from ChatGPT, duh) and it works very well. It's also free to use and you can run it locally by downloading the model yourself. It can be used as a command line tool or as a Python package, perfect for this use case.

Using it in python is very straightforward. You load the model you want, pass in the filename of an mp3 file you want to transcribe and Bob's your uncle🥳. I put the functionality into a class so it stays modular:

whisper_transcribe.py

                
import whisper
import logging
logger = logging.getLogger(__name__)

class WhisperTranscriber:
    def __init__(self) -> None:
        logger.info("loading whisper model base.en")
        self.model = whisper.load_model("base.en") # english-only base model
        self.text_array = []
        self.fps = 0

    def transcribe(self, audio_filename: str) -> dict:
        logger.info("transcribing " + audio_filename)
        return self.model.transcribe(audio_filename,fp16=False,word_timestamps=True) # using CPU, FP32 must be used

Note that, because I run this on a VM (and I don't have GPU passthrough set up for this VM), I need to use the fp16=False parameter, to force FP32. The parameter word_timestamps=True is also very useful, as it will give us the timestamp for each word, rather than for each sentence. This will come in later when we create the actual video.

After having made the class, it can be added to the method to generate a video for a story:

            
def generate_story_video_for_post(post: get_reddit_posts.RedditPost, reddit_engine: get_reddit_posts.RedditEngine):
    ...

    transcriber = whisper_transcribe.WhisperTranscriber()
    result = transcriber.transcribe(reddit_id_tts_file)    

    ...
            

Generating the video

After transcribing, it's time to do some video editing. The difficult part is figuring out when to display what part of a sentence. Luckily, we have the timestamps of each word thanks to that handy-dandy word_timestamps parameter from Whisper. I found a video that explains a bit about how to go about creating subtitles with moviepy, but I didn't really like this guy's implementation, so I modified it a bit.

We begin with (of course) a Video class📽️:

                
class Video:
    def __init__(self,filename,width,height,duration, fps = 0, clip = None) -> None:
        self.filename = filename
        self.width = width
        self.height = height
        self.duration = duration
        self.fps = fps
        self.transcribed_text = []
        self.clip = clip
            

It contains a filename to which to save it, the size of the video, the duration in seconds, FPS, a list of the sentences that were transcribed and a reference to a moviepy clip. The first part of creating the video is to crop it to the correct aspect ratio for instagram, remove the original audio and the TTS audio. In the process of making a video, these things happen:

  1. Create a video and audio clip
  2. crop the video to a 16:9 aspect ratio
  3. select a random start time for the video
  4. clip it to the length of the TTS audio
This is done in the add_audio method:

                
def add_audio(video_path, audio_path, output_path) -> Video:
    logger.info("adding audio file " + audio_path + " to video file " + video_path + " and saving to " + output_path)
    video = mpe.VideoFileClip(video_path)
    audio = mpe.AudioFileClip(audio_path)

    # calculate width to make video 9:16 aspect ratio
    W,H = video.size
    new_width = (float(H)/16.0)*9.0
    new_width_start = (float(W)/2.0) - new_width/2.0
    new_width_end = new_width_start + new_width
    logger.info("Width of original video is " + str(W) + ". Setting width to " + str(new_width))
    logger.info("cropping width from " + str(new_width_start) + " to " + str(new_width_end))

    # make video as long as the audio
    audio_duration = audio.duration # duration in seconds
    video_duration = video.duration
    logger.info("audio is " + str(audio.duration) + " seconds, video is " + str(video.duration) + " seconds")
    start = random.randrange(0,int(video_duration-audio_duration)) # random start point in video

    logger.info("clipping video from " + str(start) + " seconds to " + str(start + audio_duration))
    clip = video.subclip(start, start + audio_duration).without_audio().set_audio(audio)
    cropped_clip = moviepy.video.fx.all.crop(clip,x1=new_width_start,width=new_width)
    if cropped_clip.fps > 60:
        cropped_clip.set_fps(60)
    logger.info("FPS IS " + str(cropped_clip.fps))
    
    return Video(output_path,new_width,H,audio_duration,cropped_clip.fps,cropped_clip)
            

This returns a Video object that is further used to add the subtitles for the transcribed text. To represent a transcribed line of text, I made a TranscribedLineInfo class. This represents a line with a duration (either in seconds or frames):

                
class TranscribedLineInfo:
    def __init__(self,line: str, fps: float, in_seconds: bool, start_frame: int = 0, end_frame: int = 0, start_second = 0, end_second = 0) -> None:
        self.text = line
        self.start_second = 0
        self.end_second = 0
        self.start_frame = start_frame
        self.end_frame = end_frame
        if in_seconds:
            self.start_second = start_second
            self.end_second = end_second
        else:
            self.start_second = start_frame/fps
            self.end_second = end_frame/fps

        self.duration = (self.end_second - self.start_second)
            

Creating a video URL

Uploading to IG