koushik.m/cyberparks_details

Fork 0

koushik.m 2e5186ed1a fix: file_added

2025-09-29 12:56:19 +05:30

6.2 KiB

Raw Permalink Blame History

CyberParks Company Information Scraper

A Python web scraping tool to extract company information from the CyberParks website and save it to a CSV file.

📋 Features

✅ Scrapes company names, websites, and leadership information
✅ Automatically creates timestamped CSV files
✅ Avoids duplicate entries
✅ Real-time progress display
✅ Visits company websites to find leadership info
✅ Error handling and graceful failures
✅ Respects rate limits with delays between requests

🎯 Extracted Information

The scraper extracts the following information for each company:

Company Name - Official company name
Website - Company website URL

🛠️ Requirements

Python Version

Python 3.6 or higher

Dependencies

pip install requests beautifulsoup4

Or install from requirements.txt:

pip install -r requirements.txt

requirements.txt

requests>=2.31.0
beautifulsoup4>=4.12.0

📦 Installation

Clone or download the script

# Create a project directory
mkdir cyberparks-scraper
cd cyberparks-scraper

Save the script
- Save the Python code as scraper.py
Install dependencies
```
pip install requests beautifulsoup4
```

🚀 Usage

Basic Usage

Simply run the script:

python scraper.py

What Happens

The script connects to https://cyberparks.in/companies-at-park/
Extracts company names and websites from the main page
Visits each company's website to find leadership information
Displays real-time progress for each company
Creates a CSV file with all extracted data

Output

The script creates a CSV file named: cyberparks_companies_YYYYMMDD_HHMMSS.csv

Example: cyberparks_companies_20250929_143022.csv

Sample Output Format

Company Name,Website,MD/CEO/Chairman
Codilar Technologies Pvt.Ltd,https://www.codilar.com,Mahaveer Devabalan
ABANA Technology Private Limited,http://www.abanatechnology.com,John Smith
Analystor Technologies,http://www.analystortech.com,

📊 Console Output

The script provides detailed console output:

======================================================================
CyberParks Company Information Scraper
Extracting: Company Name, Website.
======================================================================

Fetching data from https://cyberparks.in/companies-at-park/...

✓ Page loaded successfully

Found 150 potential entries. Processing...

1. Codilar Technologies Pvt.Ltd
   Website: https://www.codilar.com
   → Checking website for leadership info...
   ✓ Leadership: Mahaveer Devabalan

2. ABANA Technology Private Limited
   Website: http://www.abanatechnology.com
   → Checking website for leadership info...
   ✗ Leadership: Not found

...

======================================================================
✓ Successfully scraped 72 companies!
✓ Data saved to: cyberparks_companies_20250929_143022.csv

Statistics:
  - Total companies: 72
  - With Leadership info: 45 (62%)
  - With Website: 72 (100%)
======================================================================

Scraping completed!

⚙️ Configuration

Timeout Settings

You can adjust the timeout for HTTP requests:

response = requests.get(url, headers=headers, timeout=15)  # Change timeout value

Rate Limiting

The script includes a 1-second delay between company website visits:

time.sleep(1)  # Adjust delay as needed

Maximum Companies

To limit the number of companies scraped:

if len(companies) >= 72:  # Change this number
    print("Reached limit. Stopping...")
    break

🔧 Troubleshooting

Issue: No companies found

Solution:

Check your internet connection
Verify the website URL is accessible
The website structure may have changed - inspect debug_page.html if created

Issue: No leadership information extracted

Solution:

Leadership info might not be publicly available on company websites
Check if the company websites are accessible
Some companies may not list executive information online

Issue: Connection timeout

Solution:

# Increase timeout value
response = requests.get(url, headers=headers, timeout=30)

Issue: Too many requests / Rate limiting

Solution:

# Increase delay between requests
time.sleep(2)  # or higher

📝 Legal & Ethical Considerations

Important Notes

Terms of Service: Always check the website's Terms of Service before scraping
robots.txt: Respect the website's robots.txt file
Rate Limiting: The script includes delays to avoid overwhelming the server
Data Usage: Use scraped data responsibly and in accordance with privacy laws
Personal Data: Be mindful of personal information (executive names) and comply with GDPR/data protection laws

Best Practices

✅ Run the scraper during off-peak hours
✅ Use reasonable delays between requests
✅ Don't run the scraper too frequently
✅ Cache results to avoid repeated scraping
✅ Respect website bandwidth and server resources

🐛 Known Limitations

Dynamic Content: May not work with JavaScript-heavy websites (would need Selenium)
Authentication: Cannot access pages requiring login
CAPTCHA: Cannot bypass CAPTCHA protection
Structure Changes: Will need updates if website structure changes
Leadership Info: Not all companies publicly list executive information

🔄 Updates & Maintenance

If the website structure changes:

Run the script to generate debug_page.html
Inspect the HTML structure
Update the CSS selectors and extraction patterns
Test with a small subset of companies first

📧 Support

If you encounter issues:

Check the console output for error messages
Verify all dependencies are installed
Ensure Python version is 3.6+
Check internet connectivity
Verify the target website is accessible

📄 License

This script is provided as-is for educational purposes. Use responsibly and ethically.

🙏 Acknowledgments

Built with Python, Requests, and BeautifulSoup4
Designed for CyberParks company directory

Version: 1.0.0
Last Updated: September 29, 2025
Python Version: 3.6+

6.2 KiB Raw Permalink Blame History

CyberParks Company Information Scraper

📋 Features

🎯 Extracted Information

🛠️ Requirements

Python Version

Dependencies

requirements.txt

📦 Installation

🚀 Usage

Basic Usage

What Happens

Output

Sample Output Format

📊 Console Output

⚙️ Configuration

Timeout Settings

Rate Limiting

Maximum Companies

🔧 Troubleshooting

Issue: No companies found

Issue: No leadership information extracted

Issue: Connection timeout

Issue: Too many requests / Rate limiting

📝 Legal & Ethical Considerations

Important Notes

Best Practices

🐛 Known Limitations

🔄 Updates & Maintenance

📧 Support

📄 License

🙏 Acknowledgments

6.2 KiB

Raw Permalink Blame History