Skip to content

🌀 Recursive Currents is a data-driven exploration of global news, clustering 50,000+ articles by shared named entities (people, places, organizations). It reveals how world narratives connect—across borders, topics, and timelines—through the names that keep appearing together. Built with Python, scikit-learn, and Plotly.

Notifications You must be signed in to change notification settings

Magken/Recursive-Currents

Repository files navigation

# 🌀 Recursive Currents: Entity-Based Global News Clustering

**Recursive Currents** is a data visualization project that reveals the hidden architecture of global news coverage by clustering 50,000+ articles based on shared named entities (e.g., people, places, organizations). By recursively analyzing co-occurrence patterns, the project uncovers how geopolitical narratives, economic threads, and social phenomena converge across regions and topics.

Find the data set at https://www.kaggle.com/datasets/enowgeorge/kosmopulse-annotated-news-dataset-worldwide2025

---

## 📌 Overview

This Colab Notebook is the computational backend of the project. It performs:

- Cleaning and canonicalizing named entities  
- Transforming articles into binary entity vectors  
- Constructing a sparse co-occurrence matrix  
- Performing hierarchical clustering of articles  
- Generating recursive trees of entity-based clusters  
- Visualizing the results as interactive treemaps and sunbursts  

---

## 📁 Dataset

- **Input**: `kosmopulse_articles_with_entities.csv`  
  ~50,000 global news articles, each with a list of named entities, headline, source, and date.

---

## ⚙️ Dependencies

Install all required Python packages:

```bash
pip install -r requirements.txt

🔑 Key Packages

  • pandas, numpy, scikit-learn, scipy
  • plotly, kaleido (for interactive and high-res PDF charts)

🚀 Running the Pipeline

Each notebook section represents a modular step:

1. Load & Parse Dataset

Convert stringified entity lists into real Python lists.

2. Clean & Canonicalize Entities

Remove junk entities, standardize names (e.g., “us” → “United States of America”), strip punctuation, and remove source-related tokens.

3. Vectorize Entities

Convert each article into a binary vector using CountVectorizer.

4. Build Co-Occurrence Matrix

Compute a sparse matrix of how often each pair of entities co-occur across articles.

5. Cluster Articles

Use TruncatedSVD to reduce dimensionality, then scipy’s hierarchical clustering to group articles recursively.

6. Generate Recursive Tree

Traverse the clustering tree and label each internal node with the most frequent entity among its children.

7. Visualize

Use Plotly to render:

  • Treemaps: area-based hierarchical clusters of entity groups
  • Sunbursts: radial recursive cluster visualization

Both visualizations display the top 3–5 levels of clustering.


📊 Sample Output

  • entity_cluster_tree.json: JSON-formatted recursive tree
  • treemap_top_5_layers.pdf: PDF of the treemap visualization
  • sunburst_top_5_layers.pdf: PDF of the sunburst visualization

💡 What It Shows

This isn’t just a popularity chart.

The results reflect how different regions and media ecosystems interpret world events — showing that clusters often converge on narrative hubs like Trump, the U.S., China, or Ukraine, not always because they’re central to the story, but because they act as semantic intermediaries in a web of global discourse.


✨ Example Insight

“Why are Trump, Pakistan, the UK, and Singapore connected?”

Because they often co-occur through diplomatic events, trade agreements, or security stories — even if indirectly. The recursive structure reveals these flows, which traditional keyword or popularity analysis would miss.


📅 Export

To save high-resolution PDF visualizations (e.g., for publication or sharing), kaleido is used internally by Plotly.

See the relevant notebook cells for exporting:

fig.write_image("treemap_top_5_layers.pdf", format="pdf", width=3000, height=2000)

🔧 Credits

Developed as part of the Kosmopulse project. Entity extraction and article sourcing by [Your Name or Organization].


📜 License

MIT License Attribution appreciated if used in publications, visual media, or research.

About

🌀 Recursive Currents is a data-driven exploration of global news, clustering 50,000+ articles by shared named entities (people, places, organizations). It reveals how world narratives connect—across borders, topics, and timelines—through the names that keep appearing together. Built with Python, scikit-learn, and Plotly.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published