Snippet Name: Process_multiple_chat_files
Tags:
Created Date: 2025-06-16 13:12:40
Last Modified Date: 2025-05-31 10:32:08
Dependencies:
DBA Notes:
Sample Output:
Title: Process multiple chat files
Author: user
help me modify the attached python script. right now, it is hardcoded to a single folder. instead, i would like it to find each chat.html found under the folder file:///G:/My%20Drive/_Data/ChatGPT/ and process each of them
Author: ChatGPT
All the files uploaded by the user have been fully loaded. Searching won't provide additional information.
Author: ChatGPT
Here's an updated version of your script that will search for all `chat.html` files under `G:/My Drive/_Data/ChatGPT/` and process each one:
### **Modifications:**
1. Uses `os.walk()` to recursively find all `chat.html` files.
2. Processes each found file instead of a single hardcoded path.
3. Ensures the output directory remains organized based on the original file???s location.
### **Updated Script:**
```python
import asyncio
from pyppeteer import launch
import os
import re
async def process_chat_html(html_file_path):
"""Processes a single chat.html file and extracts conversations."""
browser = await launch(headless=True, executablePath='C:/Program Files/Google/Chrome/Application/chrome.exe')
page = await browser.newPage()
await page.goto(f'file:///{html_file_path}')
await page.waitForSelector('.conversation')
conversation_elements = await page.querySelectorAll('.conversation')
# Determine output directory based on the source file location
base_dir = os.path.dirname(html_file_path)
output_directory = os.path.join(base_dir, "Extracted_Conversations")
os.makedirs(output_directory, exist_ok=True)
for index, conversation in enumerate(conversation_elements, start=1):
title = await page.evaluate('(element) => element.querySelector("h4").textContent', conversation)
title_clean = re.sub(r'&#91;<>:"/\\|?*]', '', title)
messages = await conversation.querySelectorAll('.message')
conversation_text = f"Title: {title}\n"
for message in messages:
author = await page.evaluate('(element) => element.querySelector(".author").textContent', message)
text = await page.evaluate('(element) => element.querySelector("div:nth-child(2)").textContent', message)
conversation_text += f"Author: {author}\n{text}\n\n"
filename = f'{index}_{title_clean}.sql'
file_path = os.path.join(output_directory, filename)
with open(file_path, 'w', encoding='utf-8') as output_file:
output_file.write(conversation_text)
print(f'Saved: {file_path}')
await browser.close()
async def find_and_process_chats(root_folder):
"""Finds all chat.html files under the specified root folder and processes them."""
chat_files = &#91;]
for root, _, files in os.walk(root_folder):
for file in files:
if file.lower() == "chat.html":
chat_files.append(os.path.join(root, file))
if not chat_files:
print("No chat.html files found.")
return
for chat_file in chat_files:
print(f"Processing: {chat_file}")
await process_chat_html(chat_file)
if __name__ == "__main__":
root_folder = "G:/My Drive/_Data/ChatGPT"
asyncio.run(find_and_process_chats(root_folder))
```
### **Changes & Improvements**
??? **Recursively searches** for `chat.html` files.
??? **Processes each found file** separately.
??? **Saves extracted conversations** in a subfolder (`Extracted_Conversations`) inside each respective chat history folder.
??? **Removes Windows-illegal characters** from filenames.
Now, when you run the script, it will automatically scan and extract data from **all** `chat.html` files in the given directory. ????
Author: user
it is not obvious to me where the output folder is for the files generated
Author: ChatGPT
I see how that could be unclear. Here???s how the script determines the **output folder** for the extracted conversations:
### **Output Folder Location:**
For each `chat.html` file found, the script creates a corresponding **"Extracted_Conversations"** folder **inside the same directory as the `chat.html` file**.
#### **Example Folder Structure Before Running the Script:**
```
G:/My Drive/_Data/ChatGPT/
????????? ChatGPTHistory-2024-01-01/
??? ????????? chat.html
??? ????????? some_other_file.txt
????????? ChatGPTHistory-2024-02-01/
??? ????????? chat.html
??? ????????? another_file.json
```
#### **After Running the Script:**
```
G:/My Drive/_Data/ChatGPT/
????????? ChatGPTHistory-2024-01-01/
??? ????????? chat.html
??? ????????? some_other_file.txt
??? ????????? Extracted_Conversations/ <-- New folder
??? ????????? 1_Conversation_Title.sql
??? ????????? 2_Another_Title.sql
????????? ChatGPTHistory-2024-02-01/
??? ????????? chat.html
??? ????????? another_file.json
??? ????????? Extracted_Conversations/ <-- Another new folder
??? ????????? 1_Example_Title.sql
??? ????????? 2_Different_Chat.sql
```
### **If You Want a Centralized Output Folder**
If you want **all extracted files to go into a single folder instead**, modify this part of the script:
```python
# Determine output directory based on the source file location
base_dir = os.path.dirname(html_file_path)
output_directory = os.path.join(base_dir, "Extracted_Conversations")
```
**Change it to:**
```python
# Centralized output directory
output_directory = "G:/My Drive/_Data/ChatGPT/Extracted_Conversations"
```
This will ensure that **all extracted conversations** go into a **single folder** instead of separate ones.
Would you like me to update the script for that? ????