# CyberParks Company Information Scraper A Python web scraping tool to extract company information from the CyberParks website and save it to a CSV file. ## 📋 Features - ✅ Scrapes company names, websites, and leadership information - ✅ Automatically creates timestamped CSV files - ✅ Avoids duplicate entries - ✅ Real-time progress display - ✅ Visits company websites to find leadership info - ✅ Error handling and graceful failures - ✅ Respects rate limits with delays between requests ## 🎯 Extracted Information The scraper extracts the following information for each company: 1. **Company Name** - Official company name 2. **Website** - Company website URL ## 🛠️ Requirements ### Python Version - Python 3.6 or higher ### Dependencies ```bash pip install requests beautifulsoup4 ``` Or install from requirements.txt: ```bash pip install -r requirements.txt ``` ### requirements.txt ``` requests>=2.31.0 beautifulsoup4>=4.12.0 ``` ## 📦 Installation 1. **Clone or download the script** ```bash # Create a project directory mkdir cyberparks-scraper cd cyberparks-scraper ``` 2. **Save the script** - Save the Python code as `scraper.py` 3. **Install dependencies** ```bash pip install requests beautifulsoup4 ``` ## 🚀 Usage ### Basic Usage Simply run the script: ```bash python scraper.py ``` ### What Happens 1. The script connects to https://cyberparks.in/companies-at-park/ 2. Extracts company names and websites from the main page 3. Visits each company's website to find leadership information 4. Displays real-time progress for each company 5. Creates a CSV file with all extracted data ### Output The script creates a CSV file named: `cyberparks_companies_YYYYMMDD_HHMMSS.csv` Example: `cyberparks_companies_20250929_143022.csv` ### Sample Output Format ```csv Company Name,Website,MD/CEO/Chairman Codilar Technologies Pvt.Ltd,https://www.codilar.com,Mahaveer Devabalan ABANA Technology Private Limited,http://www.abanatechnology.com,John Smith Analystor Technologies,http://www.analystortech.com, ``` ## 📊 Console Output The script provides detailed console output: ``` ====================================================================== CyberParks Company Information Scraper Extracting: Company Name, Website. ====================================================================== Fetching data from https://cyberparks.in/companies-at-park/... ✓ Page loaded successfully Found 150 potential entries. Processing... 1. Codilar Technologies Pvt.Ltd Website: https://www.codilar.com → Checking website for leadership info... ✓ Leadership: Mahaveer Devabalan 2. ABANA Technology Private Limited Website: http://www.abanatechnology.com → Checking website for leadership info... ✗ Leadership: Not found ... ====================================================================== ✓ Successfully scraped 72 companies! ✓ Data saved to: cyberparks_companies_20250929_143022.csv Statistics: - Total companies: 72 - With Leadership info: 45 (62%) - With Website: 72 (100%) ====================================================================== Scraping completed! ``` ## ⚙️ Configuration ### Timeout Settings You can adjust the timeout for HTTP requests: ```python response = requests.get(url, headers=headers, timeout=15) # Change timeout value ``` ### Rate Limiting The script includes a 1-second delay between company website visits: ```python time.sleep(1) # Adjust delay as needed ``` ### Maximum Companies To limit the number of companies scraped: ```python if len(companies) >= 72: # Change this number print("Reached limit. Stopping...") break ``` ## 🔧 Troubleshooting ### Issue: No companies found **Solution:** - Check your internet connection - Verify the website URL is accessible - The website structure may have changed - inspect `debug_page.html` if created ### Issue: No leadership information extracted **Solution:** - Leadership info might not be publicly available on company websites - Check if the company websites are accessible - Some companies may not list executive information online ### Issue: Connection timeout **Solution:** ```python # Increase timeout value response = requests.get(url, headers=headers, timeout=30) ``` ### Issue: Too many requests / Rate limiting **Solution:** ```python # Increase delay between requests time.sleep(2) # or higher ``` ## 📝 Legal & Ethical Considerations ### Important Notes 1. **Terms of Service**: Always check the website's Terms of Service before scraping 2. **robots.txt**: Respect the website's robots.txt file 3. **Rate Limiting**: The script includes delays to avoid overwhelming the server 4. **Data Usage**: Use scraped data responsibly and in accordance with privacy laws 5. **Personal Data**: Be mindful of personal information (executive names) and comply with GDPR/data protection laws ### Best Practices - ✅ Run the scraper during off-peak hours - ✅ Use reasonable delays between requests - ✅ Don't run the scraper too frequently - ✅ Cache results to avoid repeated scraping - ✅ Respect website bandwidth and server resources ## 🐛 Known Limitations 1. **Dynamic Content**: May not work with JavaScript-heavy websites (would need Selenium) 2. **Authentication**: Cannot access pages requiring login 3. **CAPTCHA**: Cannot bypass CAPTCHA protection 4. **Structure Changes**: Will need updates if website structure changes 5. **Leadership Info**: Not all companies publicly list executive information ## 🔄 Updates & Maintenance If the website structure changes: 1. Run the script to generate `debug_page.html` 2. Inspect the HTML structure 3. Update the CSS selectors and extraction patterns 4. Test with a small subset of companies first ## 📧 Support If you encounter issues: 1. Check the console output for error messages 2. Verify all dependencies are installed 3. Ensure Python version is 3.6+ 4. Check internet connectivity 5. Verify the target website is accessible ## 📄 License This script is provided as-is for educational purposes. Use responsibly and ethically. ## 🙏 Acknowledgments - Built with Python, Requests, and BeautifulSoup4 - Designed for CyberParks company directory --- **Version:** 1.0.0 **Last Updated:** September 29, 2025 **Python Version:** 3.6+