Process_multiple_chat_files

Snippet Name: Process_multiple_chat_files

Tags:

Created Date: 2025-06-16 13:12:40
Last Modified Date: 2025-05-31 10:32:08

Dependencies:

DBA Notes:

Sample Output:

Title: Process multiple chat files
Author: user
help me modify the attached python script. right now, it is hardcoded to a single folder. instead, i would like it to find each chat.html found under the folder file:///G:/My%20Drive/_Data/ChatGPT/ and process each of them

Author: ChatGPT
All the files uploaded by the user have been fully loaded. Searching won't provide additional information.

Author: ChatGPT
Here's an updated version of your script that will search for all `chat.html` files under `G:/My Drive/_Data/ChatGPT/` and process each one:

### **Modifications:**
1. Uses `os.walk()` to recursively find all `chat.html` files.
2. Processes each found file instead of a single hardcoded path.
3. Ensures the output directory remains organized based on the original file???s location.

### **Updated Script:**
```python
import asyncio
from pyppeteer import launch
import os
import re

async def process_chat_html(html_file_path):
    """Processes a single chat.html file and extracts conversations."""
    browser = await launch(headless=True, executablePath='C:/Program Files/Google/Chrome/Application/chrome.exe')
    page = await browser.newPage()
    await page.goto(f'file:///{html_file_path}')

    await page.waitForSelector('.conversation')

    conversation_elements = await page.querySelectorAll('.conversation')

    # Determine output directory based on the source file location
    base_dir = os.path.dirname(html_file_path)
    output_directory = os.path.join(base_dir, "Extracted_Conversations")
    os.makedirs(output_directory, exist_ok=True)

    for index, conversation in enumerate(conversation_elements, start=1):
        title = await page.evaluate('(element) => element.querySelector("h4").textContent', conversation)
        title_clean = re.sub(r'[<>:"/\\|?*]', '', title)

        messages = await conversation.querySelectorAll('.message')

        conversation_text = f"Title: {title}\n"
        for message in messages:
            author = await page.evaluate('(element) => element.querySelector(".author").textContent', message)
            text = await page.evaluate('(element) => element.querySelector("div:nth-child(2)").textContent', message)
            conversation_text += f"Author: {author}\n{text}\n\n"

        filename = f'{index}_{title_clean}.sql'
        file_path = os.path.join(output_directory, filename)

        with open(file_path, 'w', encoding='utf-8') as output_file:
            output_file.write(conversation_text)

        print(f'Saved: {file_path}')

    await browser.close()


async def find_and_process_chats(root_folder):
    """Finds all chat.html files under the specified root folder and processes them."""
    chat_files = []

    for root, _, files in os.walk(root_folder):
        for file in files:
            if file.lower() == "chat.html":
                chat_files.append(os.path.join(root, file))

    if not chat_files:
        print("No chat.html files found.")
        return

    for chat_file in chat_files:
        print(f"Processing: {chat_file}")
        await process_chat_html(chat_file)


if __name__ == "__main__":
    root_folder = "G:/My Drive/_Data/ChatGPT"
    asyncio.run(find_and_process_chats(root_folder))
```

### **Changes & Improvements**
??? **Recursively searches** for `chat.html` files.  
??? **Processes each found file** separately.  
??? **Saves extracted conversations** in a subfolder (`Extracted_Conversations`) inside each respective chat history folder.  
??? **Removes Windows-illegal characters** from filenames.  

Now, when you run the script, it will automatically scan and extract data from **all** `chat.html` files in the given directory. ????

Author: user
it is not obvious to me where the output folder is for the files generated

Author: ChatGPT
I see how that could be unclear. Here???s how the script determines the **output folder** for the extracted conversations:

### **Output Folder Location:**
For each `chat.html` file found, the script creates a corresponding **"Extracted_Conversations"** folder **inside the same directory as the `chat.html` file**.

#### **Example Folder Structure Before Running the Script:**
```
G:/My Drive/_Data/ChatGPT/
????????? ChatGPTHistory-2024-01-01/
???   ????????? chat.html
???   ????????? some_other_file.txt
????????? ChatGPTHistory-2024-02-01/
???   ????????? chat.html
???   ????????? another_file.json
```

#### **After Running the Script:**
```
G:/My Drive/_Data/ChatGPT/
????????? ChatGPTHistory-2024-01-01/
???   ????????? chat.html
???   ????????? some_other_file.txt
???   ????????? Extracted_Conversations/  <-- New folder
???       ????????? 1_Conversation_Title.sql
???       ????????? 2_Another_Title.sql
????????? ChatGPTHistory-2024-02-01/
???   ????????? chat.html
???   ????????? another_file.json
???   ????????? Extracted_Conversations/  <-- Another new folder
???       ????????? 1_Example_Title.sql
???       ????????? 2_Different_Chat.sql
```

### **If You Want a Centralized Output Folder**
If you want **all extracted files to go into a single folder instead**, modify this part of the script:
```python
# Determine output directory based on the source file location
base_dir = os.path.dirname(html_file_path)
output_directory = os.path.join(base_dir, "Extracted_Conversations")
```
**Change it to:**
```python
# Centralized output directory
output_directory = "G:/My Drive/_Data/ChatGPT/Extracted_Conversations"
```

This will ensure that **all extracted conversations** go into a **single folder** instead of separate ones.

Would you like me to update the script for that? ????


Leave a Reply