Extract_Bank_Statement_Data

Posted on2021-04-23AuthorStormAdminLeave a comment

Snippet Name: Extract_Bank_Statement_Data

Tags:

Created Date: 2025-06-16 13:12:50
Last Modified Date: 2025-05-31 10:32:19

Dependencies:

DBA Notes:

Sample Output:

Title: Extract Bank Statement Data
Author: user
I have a folder containing 12 months  of bank statements in pdf format. can you help me write a python script to extract the details that are in the table format labeled "Account Activity For Share Draft"

Author: ChatGPT
All the files uploaded by the user have been fully loaded. Searching won't provide additional information.

Author: ChatGPT
Thanks for uploading the file. Based on the PDF content, we can target the section labeled **"ACCOUNT ACTIVITY FOR SHARE DRAFT"**, which contains a structured table format including:

- **Date Posted**
- **Description**
- **Debit Amount**
- **Credit Amount**
- **Balance**

We'll write a Python script using `pdfplumber` to extract those tabular entries specifically from that section. Here's a sample script to get you started:

---

### ??? Python Script to Extract "Account Activity for Share Draft"
```python
import pdfplumber
import re
import csv

# Path to your PDF
pdf_path = 'Jan_2024_Document.pdf'

# Regular expression patterns
row_pattern = re.compile(r'^\d{2}/\d{2}/\d{4}')
amount_pattern = re.compile(r'-?\d{1,3}(?:,\d{3})*(?:\.\d{2})?$')

entries = &amp;#91;]
inside_table = False

with pdfplumber.open(pdf_path) as pdf:
    for page in pdf.pages:
        text = page.extract_text()
        if not text:
            continue

        lines = text.split('\n')

        for i, line in enumerate(lines):
            # Identify the start of the target section
            if "ACCOUNT ACTIVITY FOR SHARE DRAFT" in line:
                inside_table = True
                continue

            # If we're inside the section
            if inside_table:
                if line.strip() == "" or "CLEARED CHECKS" in line:
                    inside_table = False  # End of table
                    break

                # If line starts with a date, it's the beginning of a transaction
                if row_pattern.match(line.strip()):
                    # Combine multi-line descriptions
                    description = ""
                    j = i + 1
                    while j &lt; len(lines) and not row_pattern.match(lines&amp;#91;j]) and not amount_pattern.search(lines&amp;#91;j]):
                        description += " " + lines&amp;#91;j].strip()
                        j += 1

                    # Extract all amounts from the line and subsequent lines
                    full_text = line + " " + description
                    amounts = amount_pattern.findall(full_text)

                    # Only process lines with at least 2 amounts
                    if len(amounts) &gt;= 2:
                        date = line&amp;#91;:10]
                        desc = full_text&amp;#91;10:].strip()
                        debit, credit, balance = ("", "", "")
                        if len(amounts) == 3:
                            debit, credit, balance = amounts
                        elif '-' in amounts&amp;#91;0]:
                            debit, balance = amounts&amp;#91;0], amounts&amp;#91;1]
                        else:
                            credit, balance = amounts&amp;#91;0], amounts&amp;#91;1]

                        entries.append(&amp;#91;date, desc, debit, credit, balance])

# Save to CSV
with open("share_draft_activity.csv", "w", newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(&amp;#91;"Date", "Description", "Debit", "Credit", "Balance"])
    writer.writerows(entries)

print("Extraction complete. Data saved to 'share_draft_activity.csv'.")
```

---

### ???? Notes:
- This assumes `pdfplumber` is installed (`pip install pdfplumber`)
- The script handles multi-line descriptions and extracts up to 3 monetary values: **debit**, **credit**, and **balance**
- It stops when it reaches the **"CLEARED CHECKS"** section to avoid pulling unrelated content

Would you like to expand this to loop over all PDFs in a folder for automation?

Author: user
i've only made a trivial change to the code related to the paths for the input and output. however, the actual csv file onyl has the header information, the variable "entries" seems to not have content.

Author: ChatGPT
All the files uploaded by the user have been fully loaded. Searching won't provide additional information.

Author: ChatGPT
Thanks for sharing the full script ??? you're right to check `entries`, since it looks like it's not being populated. Here's what's likely happening and how to fix it:

---

### ???? Root Cause:
Your condition:
```python
if "ACCOUNT ACTIVITY FOR SHARE DRAFT" in line:
```
will match **only** the *first instance* of that phrase on each page ??? but in the PDF, that section is **continued across pages** (e.g., "ACCOUNT ACTIVITY FOR SHARE DRAFT (continued)"). This causes the parser to miss the continued rows on pages 3 and onward.

---

### ??? Fix Strategy:
We???ll make the following improvements:

1. **Loosen the match** to also include `"ACCOUNT ACTIVITY FOR SHARE DRAFT"` even if it includes `(continued)`.
2. **Preserve state across multiple pages**, rather than stopping at the first blank line or ???CLEARED CHECKS???.
3. **Improve logic to track actual transactions** (handle multiline descriptions better, and avoid skipping lines).

---

### ??? Updated Fix Code Snippet:
Update the relevant portion like this:

```python
# Start condition becomes fuzzy match
if "ACCOUNT ACTIVITY FOR SHARE DRAFT" in line:
    inside_table = True
    continue

# Stay inside table until a reliable stop signal
if inside_table:
    if "ACCOUNT ACTIVITY FOR" in line and "SHARE DRAFT" not in line:
        inside_table = False
        continue
    if "CLEARED CHECKS" in line:
        inside_table = False
        continue

    # If line starts with a date, it's a new transaction
    if row_pattern.match(line.strip()):
        description = ""
        date = line&amp;#91;:10]
        desc_line = line&amp;#91;10:].strip()
        j = i + 1

        # Collect continuation lines (typically indented)
        while j &lt; len(lines) and not row_pattern.match(lines&amp;#91;j]):
            continuation = lines&amp;#91;j].strip()
            if continuation == "" or amount_pattern.match(continuation):
                break
            desc_line += " " + continuation
            j += 1

        full_line = line + " " + desc_line
        amounts = amount_pattern.findall(full_line)

        if len(amounts) &gt;= 2:
            debit, credit, balance = ("", "", "")
            if len(amounts) == 3:
                debit, credit, balance = amounts
            elif '-' in amounts&amp;#91;0]:
                debit, balance = amounts&amp;#91;0], amounts&amp;#91;1]
            else:
                credit, balance = amounts&amp;#91;0], amounts&amp;#91;1]

            entries.append(&amp;#91;date, desc_line, debit, credit, balance])
```

---

### ???? Final Tip:
Try printing out a few values while iterating to debug:
```python
print(f"DEBUG: Date={date}, Desc={desc_line}, Amounts={amounts}")
```

Let me know if you'd like to:
- Expand this to loop through multiple PDFs
- Add filtering by date or amount ranges
- Normalize the output formatting

Want me to patch and return the corrected version of your script with all the above applied?

Author: user
it appears that the &amp;#91;Description] column varies from two to three lines per entry. as a result, the string " Withdrawal Bill Payment #400122103298 
 GOOGLE *Auxbrain Inc 855-836-3987 CA
 Card 2538" is misinterpreted, and the last three characters "538" are misapplied as the amount, for example. Items with only two lines end up missing the Amount completeyl. here's the debug statements for the first four rows. DEBUG: Date=01/01/2024, Desc=Withdrawal Bill Payment #400122103298 -3.99 1,007.88 GOOGLE *Auxbrain Inc 855-836-3987 CA Card 2538, Amounts=&amp;#91;'538']
DEBUG: Date=01/01/2024, Desc=Withdrawal Bill Payment #400120103354 -5.99 1,001.89 GOOGLE *SayGames 855-836-3987 CA Card 2538, Amounts=&amp;#91;'538']
DEBUG: Date=01/04/2024, Desc=Withdrawal ACH MSPBNA -300.00 701.89 TYPE: ACH TRNSFR ID: 1391321258 DATA: E*TRADE SECURITIES CO: MSPBNA, Amounts=&amp;#91;]
DEBUG: Date=01/05/2024, Desc=Deposit Transfer From Share 00 5,000.00 5,701.89, Amounts=&amp;#91;'5,701.89']

Author: ChatGPT
Thanks for the detailed debug ??? that???s really helpful. You're exactly right: the problem comes down to **how we're deciding where the description ends and where the amounts begin**, especially since the amount values can look like `Card 2538`, which falsely matches your amount regex.

---

### ??? Updated Plan

We need to **decouple the amount detection from the description** by doing the following:

1. **Capture all lines** for a transaction until the next date or section change.
2. **At the end of the captured block**, extract:
   - First line: the **date**
   - All lines (excluding amounts): the **description**
   - From the end of the last line: up to **3 valid monetary values**

---

### ???? Updated Code Logic

Here???s a cleaner way to rework the loop so that it always groups lines reliably and extracts the right values:

```python
import pdfplumber
import re
import csv

pdf_path = 'C:/Users/lowell/Downloads/Jan_2024_Document.pdf'

row_start_pattern = re.compile(r'^\d{2}/\d{2}/\d{4}')
amount_pattern = re.compile(r'-?\$?\d{1,3}(?:,\d{3})*(?:\.\d{2})')

entries = &amp;#91;]
inside_table = False
buffer = &amp;#91;]

def process_transaction(block):
    if not block:
        return None
    # First line holds the date
    date = block&amp;#91;0]&amp;#91;:10]
    full_text = ' '.join(block)
    amounts = amount_pattern.findall(full_text)

    # Get the last 2???3 valid amount strings, strip $ if present
    cleaned_amounts = &amp;#91;a.replace('$','') for a in amounts&amp;#91;-3:]]

    debit = credit = balance = ""
    if len(cleaned_amounts) == 3:
        debit, credit, balance = cleaned_amounts
    elif len(cleaned_amounts) == 2:
        if '-' in cleaned_amounts&amp;#91;0]:
            debit, balance = cleaned_amounts
        else:
            credit, balance = cleaned_amounts
    else:
        return None  # Ignore malformed

    # Description is all but the amounts
    desc_lines = block.copy()
    desc = ' '.join(desc_lines).strip()
    return &amp;#91;date, desc, debit, credit, balance]


with pdfplumber.open(pdf_path) as pdf:
    for page in pdf.pages:
        text = page.extract_text()
        if not text:
            continue

        lines = text.split('\n')
        for line in lines:
            if "ACCOUNT ACTIVITY FOR SHARE DRAFT" in line:
                inside_table = True
                continue
            if inside_table:
                if "CLEARED CHECKS" in line or "ACCOUNT ACTIVITY FOR" in line and "SHARE DRAFT" not in line:
                    inside_table = False
                    continue

                if row_start_pattern.match(line.strip()):
                    if buffer:
                        tx = process_transaction(buffer)
                        if tx:
                            entries.append(tx)
                        buffer = &amp;#91;]
                    buffer.append(line.strip())
                else:
                    if inside_table:
                        buffer.append(line.strip())

        # At end of page, flush the buffer
        if buffer:
            tx = process_transaction(buffer)
            if tx:
                entries.append(tx)
            buffer = &amp;#91;]

# Output CSV
with open("C:/Users/lowell/Downloads/share_draft_activity.csv", "w", newline='', encoding='utf-8') as f:
    writer = csv.writer(f)
    writer.writerow(&amp;#91;"Date", "Description", "Debit", "Credit", "Balance"])
    writer.writerows(entries)

print(f"??? Extracted {len(entries)} transactions to 'share_draft_activity.csv'")
```

---

### ??? Why This Works Better

- **Multi-line descriptions** are safely grouped.
- **Amounts are pulled from the very end only**, avoiding issues like `"Card 2538"` being misread.
- Cleaner separation of parsing into `process_transaction()` for easier debugging.

---

Let me know if you'd like this modified further to:
- Add a `TransactionType` field
- Normalize negative numbers
- Format dates
- Process multiple PDFs in a folder

Want me to apply this and upload the fixed script file for download?

Author: user
that worked perfectly.

Author: ChatGPT
Awesome ??? glad to hear it worked perfectly! ????

If you plan to extract future months the same way, here are a few upgrade ideas you might want to consider later:

- ???? **Batch mode**: Loop through a folder of PDFs and generate one CSV per file (or one combined file).
- ???? **Cleaner output**: Trim excess whitespace, normalize text casing, or add a transaction type classifier (e.g., ATM, ACH, etc.).
- ???? **Date filtering**: Only include rows from a specific date range or calendar month.
- ???? **Analysis-ready format**: Output to Excel or a SQLite DB for easier tracking/reporting.

Just let me know if/when you'd like to extend it. Happy to help with the next step.

Author: user
i think if we just add a loop for every *Document.pdf" in a given folder, and append all the output to the csv file,that would be all we would need.

Author: ChatGPT
Perfect ??? that???s a straightforward and powerful enhancement. Here's an updated version of the script that:

1. **Loops over all** `*Document.pdf` files in a given folder.
2. **Appends** all extracted transactions into **one single CSV file**.
3. Adds a `SourceFile` column to track which statement the row came from.

---

### ??? Script: Loop Over Folder and Append All Output

```python
import pdfplumber
import re
import csv
import os
from pathlib import Path

# Set folder path where PDFs are located
pdf_folder = Path('C:/Users/lowell/Downloads')
output_file = pdf_folder / 'all_share_draft_activity.csv'

# Patterns
row_start_pattern = re.compile(r'^\d{2}/\d{2}/\d{4}')
amount_pattern = re.compile(r'-?\$?\d{1,3}(?:,\d{3})*(?:\.\d{2})')

def process_transaction(block, source_file):
    if not block:
        return None
    date = block&amp;#91;0]&amp;#91;:10]
    full_text = ' '.join(block)
    amounts = amount_pattern.findall(full_text)
    cleaned_amounts = &amp;#91;a.replace('$','') for a in amounts&amp;#91;-3:]]

    debit = credit = balance = ""
    if len(cleaned_amounts) == 3:
        debit, credit, balance = cleaned_amounts
    elif len(cleaned_amounts) == 2:
        if '-' in cleaned_amounts&amp;#91;0]:
            debit, balance = cleaned_amounts
        else:
            credit, balance = cleaned_amounts
    else:
        return None

    desc = ' '.join(block).strip()
    return &amp;#91;date, desc, debit, credit, balance, source_file.name]

# Initialize output file
with open(output_file, "w", newline='', encoding='utf-8') as f:
    writer = csv.writer(f)
    writer.writerow(&amp;#91;"Date", "Description", "Debit", "Credit", "Balance", "SourceFile"])

    for pdf_path in sorted(pdf_folder.glob("*Document.pdf")):
        entries = &amp;#91;]
        inside_table = False
        buffer = &amp;#91;]

        with pdfplumber.open(pdf_path) as pdf:
            for page in pdf.pages:
                text = page.extract_text()
                if not text:
                    continue

                lines = text.split('\n')
                for line in lines:
                    if "ACCOUNT ACTIVITY FOR SHARE DRAFT" in line:
                        inside_table = True
                        continue
                    if inside_table:
                        if "CLEARED CHECKS" in line or ("ACCOUNT ACTIVITY FOR" in line and "SHARE DRAFT" not in line):
                            inside_table = False
                            continue

                        if row_start_pattern.match(line.strip()):
                            if buffer:
                                tx = process_transaction(buffer, pdf_path)
                                if tx:
                                    entries.append(tx)
                                buffer = &amp;#91;]
                            buffer.append(line.strip())
                        else:
                            if inside_table:
                                buffer.append(line.strip())

                # End of page flush
                if buffer:
                    tx = process_transaction(buffer, pdf_path)
                    if tx:
                        entries.append(tx)
                    buffer = &amp;#91;]

        # Append entries to the output file
        writer.writerows(entries)
        print(f"??? Processed {len(entries)} transactions from '{pdf_path.name}'")

print(f"\n??? All done. Combined CSV saved to: {output_file}")
```

---

### ???? Output:
- `all_share_draft_activity.csv` will include **all transactions** with an added `SourceFile` column so you know which PDF each row came from.
- Rows are **appended in order of file name**.

---

Want to also include `"Account Activity for Prime Share"` while we're at it?

Author: user
that was flawless. what an excellent implementation, thank you.

Author: ChatGPT
You're very welcome, Lowell ??? and thank *you* for the kind words and super clear requirements. This kind of well-scoped project is a pleasure to help with. ????

If you ever want to:

- ???? Add other account types like `"PRIME SHARE"` or `"PERSONAL LOAN"`
- ???? Generate monthly spending summaries or charts from the CSV
- ???? Mask sensitive values (e.g., last 4 of card or member number)
- ?????? Push results to a database or upload to cloud storage

???just say the word. Happy to build on this anytime.

Leave a Reply Cancel reply

You must be logged in to post a comment.