Skip to content

Scrapes and processes Fortune 500 company data from Wikipedia into a clean, structured CSV for analysis.

Notifications You must be signed in to change notification settings

AzimNahin/Python-Web-Scraping

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

7 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ•ΈοΈ Web Scraping Financial Data of U.S. Public Companies

This project scrapes financial and organizational data of the top public companies in the U.S. based on revenue, directly from Wikipedia. It processes the first table listed on the Wikipedia page for Fortune 500 companies and stores the clean data into a structured CSV format for further analysis.


πŸ“ Project Structure

.
β”œβ”€β”€ Web_Scraping.ipynb          # Jupyter Notebook containing scraping and data cleaning logic
β”œβ”€β”€ Public_Company_List.csv     # Final output CSV file with company data
└── README.md                   # Project documentation (this file)

πŸ“Œ What This Project Does

βœ… Automates retrieval of Fortune 500 company data from Wikipedia
βœ… Extracts structured data from the first HTML table on the page
βœ… Dynamically reads and stores table headers
βœ… Cleans the extracted table rows and standardizes them
βœ… Converts the data into a well-formatted pandas.DataFrame
βœ… Saves the final dataset as Public_Company_List.csv in local storage


πŸ” Extracted Fields

The CSV file contains the following columns, as dynamically extracted from the table headers:

  • Rank
  • Name
  • Industry
  • Revenue
  • Profit
  • Employees
  • Headquarters
  • and possibly other metadata depending on Wikipedia's table structure

The structure of the table may vary over time, but this notebook adapts by programmatically parsing the headers.


πŸ§ͺ Technologies Used

  • Python 3
  • Jupyter Notebook β€” for step-by-step documentation and reproducibility
  • pandas β€” for handling tabular data
  • requests β€” for fetching webpage content
  • BeautifulSoup4 β€” for HTML parsing and table extraction

πŸš€ How to Run

1. Install Dependencies

You can install the necessary Python libraries using pip:

pip install pandas requests beautifulsoup4

2. Open the Notebook

Open Web_Scraping.ipynb in Jupyter Notebook or Jupyter Lab.

3. Run the Cells

Run all cells sequentially. This will:

  • Fetch the Wikipedia page
  • Parse the HTML table
  • Create a clean dataset
  • Save the output as Public_Company_List.csv

The final dataset will be saved to your working directory.


πŸ“‚ Output Description

Public_Company_List.csv
A structured CSV file containing financial and organizational details of the top U.S. companies, including:

Rank Name Industry Revenue Profit Employees Headquarters
1 Walmart Retail $600B $13.7B 2,300,000 Arkansas
... ... ... ... ... ... ...

πŸ‘¨β€πŸ’» Contributor

About

Scrapes and processes Fortune 500 company data from Wikipedia into a clean, structured CSV for analysis.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published