Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 53 additions & 9 deletions Readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,17 @@

## Overview

Simple and straight forward Python utility that converts a Markdown file (`.md`) to a Microsoft Word document (`.docx`). It supports multiple Markdown elements, including headings, bold and italic text, both unordered and ordered lists, and many more.
Simple and straight forward Python utility that converts Markdown files (`.md`) to Microsoft Word documents (`.docx`) and vice versa. It supports multiple Markdown elements, including headings, bold and italic text, both unordered and ordered lists, and many more.

## Word to Markdown Conversion Example:
#### Input .docx file:
![image](https://github.com/user-attachments/assets/2891ebdf-ff36-4fd5-af2f-b35413264b06)

#### Output .md file:
![image](https://github.com/user-attachments/assets/e46c096b-762e-4f0c-a0ab-f81c3069a533)


## Markdown to Word Conversion Example:
#### Input .md file:
![image](https://github.com/user-attachments/assets/c2325e52-05a7-4e11-8f28-4eeb3d8c06f5)

Expand All @@ -13,18 +22,22 @@ Simple and straight forward Python utility that converts a Markdown file (`.md`)

## Features

- Converts Markdown headers (`#`, `##`, `###`) to Word document headings.
- Supports bold and italic text formatting.
- Converts unordered (`*`, `-`) and ordered (`1.`, `2.`) lists.
- Handles paragraphs with mixed content.
- Bi-directional conversion between Markdown and Word documents
- Handles various programming languages code given in word doc like python, ruby and more.
- Converts Markdown headers (`#`, `##`, `###`) to Word document headings and back
- Supports bold and italic text formatting
- Converts unordered (`*`, `-`) and ordered (`1.`, `2.`) lists
- Handles paragraphs with mixed content
- Preserves document structure during conversion

## Prerequisites

You need to have Python installed on your system along with the following libraries:

- `markdown` for converting Markdown to HTML.
- `python-docx` for creating and editing Word documents.
- `beautifulsoup4` for parsing HTML.
- `markdown` for converting Markdown to HTML
- `python-docx` for creating and editing Word documents
- `beautifulsoup4` for parsing HTML
- `mammoth` for converting Word to HTML

Sure, let's enhance your instructions for clarity and completeness:

Expand Down Expand Up @@ -74,7 +87,33 @@ This code will create a file named `amazon_case_study.docx`, which is the conver

---

This should make it easier to understand and follow the steps. Let me know if you need any more help or further enhancements!
#### For Converting Word to Markdown
Use the `word_to_markdown()` function to convert your Word document to Markdown:

```python
word_to_markdown(word_file, markdown_file)
```

- `word_file`: The path to the Word document you want to convert
- `markdown_file`: The desired path and name for the output Markdown file


Here's a complete example:

```python
from md2docx_python.src.docx2md_python import word_to_markdown

# Define the paths to your files
word_file = "sample_files/test_document.docx"
markdown_file = "sample_files/test_document_output.md"

# Convert the Word document to a Markdown file
word_to_markdown(word_file, markdown_file)
```

This code will create a file named `test_document_output.md`, which is the conversion of `test_document.docx` to the Markdown format.

---

## Why this repo and not others ?

Expand Down Expand Up @@ -108,6 +147,11 @@ Here are some reasons why this repo might be considered better or more suitable
### 8. **Privacy**
- If you are working in a corporate firm and you want to convert your markdown files to word and you use a online tool to do it then there are chances that they will store your file which can cause to a vital information leak of your company. With use of this repo you can easily do the conversion in your own system.

### 9. **Bi-directional Conversion**
- **Complete Workflow**: Convert documents in both directions, allowing for round-trip document processing
- **Format Preservation**: Maintains formatting and structure when converting between formats
- **Flexibility**: Easily switch between Markdown and Word formats based on your needs

### Comparison to Other Scripts
- **Feature Set**: Some scripts may lack comprehensive support for Markdown features or may not handle lists and text formatting well.
- **Performance**: Depending on the implementation, performance might vary. This script is designed to be efficient for typical Markdown files.
Expand Down
95 changes: 95 additions & 0 deletions build/lib/md2docx_python/src/docx2md_python.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
from docx import Document
import re


def word_to_markdown(word_file, markdown_file):
"""
Convert a Word document to Markdown format

Args:
word_file (str): Path to the input Word document
markdown_file (str): Path to the output Markdown file
"""
# Open the Word document
doc = Document(word_file)
markdown_content = []

for paragraph in doc.paragraphs:
# Skip empty paragraphs
if not paragraph.text.strip():
continue

# Get paragraph style
style = paragraph.style.name.lower()

# Handle code blocks
if style.startswith("code block") or style.startswith("source code"):
markdown_content.append(f"```\n{paragraph.text.strip()}\n```\n\n")
continue

# Handle headings
if style.startswith("heading"):
level = style[-1] # Get heading level from style name
markdown_content.append(f"{'#' * int(level)} {paragraph.text.strip()}\n")
continue

# Handle lists
if style.startswith("list bullet"):
markdown_content.append(f"* {paragraph.text.strip()}\n")
continue
if style.startswith("list number"):
markdown_content.append(f"1. {paragraph.text.strip()}\n")
continue

# Handle regular paragraphs with formatting
formatted_text = ""
for run in paragraph.runs:
text = run.text
if text.strip():
# Handle inline code (typically monospace font)
if run.font.name in [
"Consolas",
"Courier New",
"Monaco",
] or style.startswith("code"):
if "\n" in text:
text = f"```\n{text}\n```"
else:
text = f"`{text}`"
# Apply bold
elif run.bold:
text = f"**{text}**"
# Apply italic
elif run.italic:
text = f"*{text}*"
# Apply both bold and italic
elif run.bold and run.italic:
text = f"***{text}***"
formatted_text += text

if formatted_text:
markdown_content.append(f"{formatted_text}\n")

# Add an extra newline after paragraphs
markdown_content.append("\n")

# Write to markdown file
with open(markdown_file, "w", encoding="utf-8") as f:
f.writelines(markdown_content)


def clean_markdown_text(text):
"""
Clean and normalize markdown text

Args:
text (str): Text to clean

Returns:
str: Cleaned text
"""
# Remove multiple spaces
text = re.sub(r"\s+", " ", text)
# Remove multiple newlines
text = re.sub(r"\n\s*\n\s*\n", "\n\n", text)
return text.strip()
Binary file added dist/md2docx_python-1.0.0-py3-none-any.whl
Binary file not shown.
Loading