fix: file_added

2025-09-29 12:56:19 +05:30 · 2025-09-29 12:56:19 +05:30 · 2e5186ed1a
commit 2e5186ed1a
parent a71176f6a4
1 changed files with 241 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -0,0 +1,241 @@
 # CyberParks Company Information Scraper
 A Python web scraping tool to extract company information from the CyberParks website and save it to a CSV file.
 ## 📋 Features
 - ✅ Scrapes company names, websites, and leadership information
 - ✅ Automatically creates timestamped CSV files
 - ✅ Avoids duplicate entries
 - ✅ Real-time progress display
 - ✅ Visits company websites to find leadership info
 - ✅ Error handling and graceful failures
 - ✅ Respects rate limits with delays between requests
 ## 🎯 Extracted Information
 The scraper extracts the following information for each company:
 1. **Company Name** - Official company name
 2. **Website** - Company website URL
 ## 🛠️ Requirements
 ### Python Version
 - Python 3.6 or higher
 ### Dependencies
 ```bash
 pip install requests beautifulsoup4
 ```
 Or install from requirements.txt:
 ```bash
 pip install -r requirements.txt
 ```
 ### requirements.txt
 ```
 requests>=2.31.0
 beautifulsoup4>=4.12.0
 ```
 ## 📦 Installation
 1. **Clone or download the script**
   ```bash
   # Create a project directory
   mkdir cyberparks-scraper
   cd cyberparks-scraper
   ```
 2. **Save the script**
   - Save the Python code as `scraper.py`
 3. **Install dependencies**
   ```bash
   pip install requests beautifulsoup4
   ```
 ## 🚀 Usage
 ### Basic Usage
 Simply run the script:
 ```bash
 python scraper.py
 ```
 ### What Happens
 1. The script connects to https://cyberparks.in/companies-at-park/
 2. Extracts company names and websites from the main page
 3. Visits each company's website to find leadership information
 4. Displays real-time progress for each company
 5. Creates a CSV file with all extracted data
 ### Output
 The script creates a CSV file named: `cyberparks_companies_YYYYMMDD_HHMMSS.csv`
 Example: `cyberparks_companies_20250929_143022.csv`
 ### Sample Output Format
 ```csv
 Company Name,Website,MD/CEO/Chairman
 Codilar Technologies Pvt.Ltd,https://www.codilar.com,Mahaveer Devabalan
 ABANA Technology Private Limited,http://www.abanatechnology.com,John Smith
 Analystor Technologies,http://www.analystortech.com,
 ```
 ## 📊 Console Output
 The script provides detailed console output:
 ```
 ======================================================================
 CyberParks Company Information Scraper
 Extracting: Company Name, Website.
 ======================================================================
 Fetching data from https://cyberparks.in/companies-at-park/...
 ✓ Page loaded successfully
 Found 150 potential entries. Processing...
 1. Codilar Technologies Pvt.Ltd
   Website: https://www.codilar.com
   → Checking website for leadership info...
   ✓ Leadership: Mahaveer Devabalan
 2. ABANA Technology Private Limited
   Website: http://www.abanatechnology.com
   → Checking website for leadership info...
   ✗ Leadership: Not found
 ...
 ======================================================================
 ✓ Successfully scraped 72 companies!
 ✓ Data saved to: cyberparks_companies_20250929_143022.csv
 Statistics:
  - Total companies: 72
  - With Leadership info: 45 (62%)
  - With Website: 72 (100%)
 ======================================================================
 Scraping completed!
 ```
 ## ⚙️ Configuration
 ### Timeout Settings
 You can adjust the timeout for HTTP requests:
 ```python
 response = requests.get(url, headers=headers, timeout=15)  # Change timeout value
 ```
 ### Rate Limiting
 The script includes a 1-second delay between company website visits:
 ```python
 time.sleep(1)  # Adjust delay as needed
 ```
 ### Maximum Companies
 To limit the number of companies scraped:
 ```python
 if len(companies) >= 72:  # Change this number
    print("Reached limit. Stopping...")
    break
 ```
 ## 🔧 Troubleshooting
 ### Issue: No companies found
 **Solution:**
 - Check your internet connection
 - Verify the website URL is accessible
 - The website structure may have changed - inspect `debug_page.html` if created
 ### Issue: No leadership information extracted
 **Solution:**
 - Leadership info might not be publicly available on company websites
 - Check if the company websites are accessible
 - Some companies may not list executive information online
 ### Issue: Connection timeout
 **Solution:**
 ```python
 # Increase timeout value
 response = requests.get(url, headers=headers, timeout=30)
 ```
 ### Issue: Too many requests / Rate limiting
 **Solution:**
 ```python
 # Increase delay between requests
 time.sleep(2)  # or higher
 ```
 ## 📝 Legal & Ethical Considerations
 ### Important Notes
 1. **Terms of Service**: Always check the website's Terms of Service before scraping
 2. **robots.txt**: Respect the website's robots.txt file
 3. **Rate Limiting**: The script includes delays to avoid overwhelming the server
 4. **Data Usage**: Use scraped data responsibly and in accordance with privacy laws
 5. **Personal Data**: Be mindful of personal information (executive names) and comply with GDPR/data protection laws
 ### Best Practices
 - ✅ Run the scraper during off-peak hours
 - ✅ Use reasonable delays between requests
 - ✅ Don't run the scraper too frequently
 - ✅ Cache results to avoid repeated scraping
 - ✅ Respect website bandwidth and server resources
 ## 🐛 Known Limitations
 1. **Dynamic Content**: May not work with JavaScript-heavy websites (would need Selenium)
 2. **Authentication**: Cannot access pages requiring login
 3. **CAPTCHA**: Cannot bypass CAPTCHA protection
 4. **Structure Changes**: Will need updates if website structure changes
 5. **Leadership Info**: Not all companies publicly list executive information
 ## 🔄 Updates & Maintenance
 If the website structure changes:
 1. Run the script to generate `debug_page.html`
 2. Inspect the HTML structure
 3. Update the CSS selectors and extraction patterns
 4. Test with a small subset of companies first
 ## 📧 Support
 If you encounter issues:
 1. Check the console output for error messages
 2. Verify all dependencies are installed
 3. Ensure Python version is 3.6+
 4. Check internet connectivity
 5. Verify the target website is accessible
 ## 📄 License
 This script is provided as-is for educational purposes. Use responsibly and ethically.
 ## 🙏 Acknowledgments
 - Built with Python, Requests, and BeautifulSoup4
 - Designed for CyberParks company directory
 ---
 **Version:** 1.0.0  
 **Last Updated:** September 29, 2025  
 **Python Version:** 3.6+