cyberparks_details/README.md
2025-09-29 12:56:19 +05:30

6.2 KiB

CyberParks Company Information Scraper

A Python web scraping tool to extract company information from the CyberParks website and save it to a CSV file.

📋 Features

  • Scrapes company names, websites, and leadership information
  • Automatically creates timestamped CSV files
  • Avoids duplicate entries
  • Real-time progress display
  • Visits company websites to find leadership info
  • Error handling and graceful failures
  • Respects rate limits with delays between requests

🎯 Extracted Information

The scraper extracts the following information for each company:

  1. Company Name - Official company name
  2. Website - Company website URL

🛠️ Requirements

Python Version

  • Python 3.6 or higher

Dependencies

pip install requests beautifulsoup4

Or install from requirements.txt:

pip install -r requirements.txt

requirements.txt

requests>=2.31.0
beautifulsoup4>=4.12.0

📦 Installation

  1. Clone or download the script

    # Create a project directory
    mkdir cyberparks-scraper
    cd cyberparks-scraper
    
  2. Save the script

    • Save the Python code as scraper.py
  3. Install dependencies

    pip install requests beautifulsoup4
    

🚀 Usage

Basic Usage

Simply run the script:

python scraper.py

What Happens

  1. The script connects to https://cyberparks.in/companies-at-park/
  2. Extracts company names and websites from the main page
  3. Visits each company's website to find leadership information
  4. Displays real-time progress for each company
  5. Creates a CSV file with all extracted data

Output

The script creates a CSV file named: cyberparks_companies_YYYYMMDD_HHMMSS.csv

Example: cyberparks_companies_20250929_143022.csv

Sample Output Format

Company Name,Website,MD/CEO/Chairman
Codilar Technologies Pvt.Ltd,https://www.codilar.com,Mahaveer Devabalan
ABANA Technology Private Limited,http://www.abanatechnology.com,John Smith
Analystor Technologies,http://www.analystortech.com,

📊 Console Output

The script provides detailed console output:

======================================================================
CyberParks Company Information Scraper
Extracting: Company Name, Website.
======================================================================

Fetching data from https://cyberparks.in/companies-at-park/...

✓ Page loaded successfully

Found 150 potential entries. Processing...

1. Codilar Technologies Pvt.Ltd
   Website: https://www.codilar.com
   → Checking website for leadership info...
   ✓ Leadership: Mahaveer Devabalan

2. ABANA Technology Private Limited
   Website: http://www.abanatechnology.com
   → Checking website for leadership info...
   ✗ Leadership: Not found

...

======================================================================
✓ Successfully scraped 72 companies!
✓ Data saved to: cyberparks_companies_20250929_143022.csv

Statistics:
  - Total companies: 72
  - With Leadership info: 45 (62%)
  - With Website: 72 (100%)
======================================================================

Scraping completed!

⚙️ Configuration

Timeout Settings

You can adjust the timeout for HTTP requests:

response = requests.get(url, headers=headers, timeout=15)  # Change timeout value

Rate Limiting

The script includes a 1-second delay between company website visits:

time.sleep(1)  # Adjust delay as needed

Maximum Companies

To limit the number of companies scraped:

if len(companies) >= 72:  # Change this number
    print("Reached limit. Stopping...")
    break

🔧 Troubleshooting

Issue: No companies found

Solution:

  • Check your internet connection
  • Verify the website URL is accessible
  • The website structure may have changed - inspect debug_page.html if created

Issue: No leadership information extracted

Solution:

  • Leadership info might not be publicly available on company websites
  • Check if the company websites are accessible
  • Some companies may not list executive information online

Issue: Connection timeout

Solution:

# Increase timeout value
response = requests.get(url, headers=headers, timeout=30)

Issue: Too many requests / Rate limiting

Solution:

# Increase delay between requests
time.sleep(2)  # or higher

Important Notes

  1. Terms of Service: Always check the website's Terms of Service before scraping
  2. robots.txt: Respect the website's robots.txt file
  3. Rate Limiting: The script includes delays to avoid overwhelming the server
  4. Data Usage: Use scraped data responsibly and in accordance with privacy laws
  5. Personal Data: Be mindful of personal information (executive names) and comply with GDPR/data protection laws

Best Practices

  • Run the scraper during off-peak hours
  • Use reasonable delays between requests
  • Don't run the scraper too frequently
  • Cache results to avoid repeated scraping
  • Respect website bandwidth and server resources

🐛 Known Limitations

  1. Dynamic Content: May not work with JavaScript-heavy websites (would need Selenium)
  2. Authentication: Cannot access pages requiring login
  3. CAPTCHA: Cannot bypass CAPTCHA protection
  4. Structure Changes: Will need updates if website structure changes
  5. Leadership Info: Not all companies publicly list executive information

🔄 Updates & Maintenance

If the website structure changes:

  1. Run the script to generate debug_page.html
  2. Inspect the HTML structure
  3. Update the CSS selectors and extraction patterns
  4. Test with a small subset of companies first

📧 Support

If you encounter issues:

  1. Check the console output for error messages
  2. Verify all dependencies are installed
  3. Ensure Python version is 3.6+
  4. Check internet connectivity
  5. Verify the target website is accessible

📄 License

This script is provided as-is for educational purposes. Use responsibly and ethically.

🙏 Acknowledgments

  • Built with Python, Requests, and BeautifulSoup4
  • Designed for CyberParks company directory

Version: 1.0.0
Last Updated: September 29, 2025
Python Version: 3.6+