fix: file_added

2025-09-29 12:56:19 +05:30 · 2025-09-29 12:56:19 +05:30 · 2e5186ed1a
commit 2e5186ed1a
parent a71176f6a4
1 changed files with 241 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -0,0 +1,241 @@
+# CyberParks Company Information Scraper
+
+A Python web scraping tool to extract company information from the CyberParks website and save it to a CSV file.
+
+## 📋 Features
+
+- ✅ Scrapes company names, websites, and leadership information
+- ✅ Automatically creates timestamped CSV files
+- ✅ Avoids duplicate entries
+- ✅ Real-time progress display
+- ✅ Visits company websites to find leadership info
+- ✅ Error handling and graceful failures
+- ✅ Respects rate limits with delays between requests
+
+## 🎯 Extracted Information
+
+The scraper extracts the following information for each company:
+
+1. **Company Name** - Official company name
+2. **Website** - Company website URL
+
+## 🛠️ Requirements
+
+### Python Version
+- Python 3.6 or higher
+
+### Dependencies
+```bash
+pip install requests beautifulsoup4
+```
+
+Or install from requirements.txt:
+```bash
+pip install -r requirements.txt
+```
+
+### requirements.txt
+```
+requests>=2.31.0
+beautifulsoup4>=4.12.0
+```
+
+## 📦 Installation
+
+1. **Clone or download the script**
+   ```bash
+   # Create a project directory
+   mkdir cyberparks-scraper
+   cd cyberparks-scraper
+   ```
+
+2. **Save the script**
+   - Save the Python code as `scraper.py`
+
+3. **Install dependencies**
+   ```bash
+   pip install requests beautifulsoup4
+   ```
+
+## 🚀 Usage
+
+### Basic Usage
+
+Simply run the script:
+```bash
+python scraper.py
+```
+
+### What Happens
+
+1. The script connects to https://cyberparks.in/companies-at-park/
+2. Extracts company names and websites from the main page
+3. Visits each company's website to find leadership information
+4. Displays real-time progress for each company
+5. Creates a CSV file with all extracted data
+
+### Output
+
+The script creates a CSV file named: `cyberparks_companies_YYYYMMDD_HHMMSS.csv`
+
+Example: `cyberparks_companies_20250929_143022.csv`
+
+### Sample Output Format
+
+```csv
+Company Name,Website,MD/CEO/Chairman
+Codilar Technologies Pvt.Ltd,https://www.codilar.com,Mahaveer Devabalan
+ABANA Technology Private Limited,http://www.abanatechnology.com,John Smith
+Analystor Technologies,http://www.analystortech.com,
+```
+
+## 📊 Console Output
+
+The script provides detailed console output:
+
+```
+======================================================================
+CyberParks Company Information Scraper
+Extracting: Company Name, Website.
+======================================================================
+
+Fetching data from https://cyberparks.in/companies-at-park/...
+
+✓ Page loaded successfully
+
+Found 150 potential entries. Processing...
+
+1. Codilar Technologies Pvt.Ltd
+   Website: https://www.codilar.com
+   → Checking website for leadership info...
+   ✓ Leadership: Mahaveer Devabalan
+
+2. ABANA Technology Private Limited
+   Website: http://www.abanatechnology.com
+   → Checking website for leadership info...
+   ✗ Leadership: Not found
+
+...
+
+======================================================================
+✓ Successfully scraped 72 companies!
+✓ Data saved to: cyberparks_companies_20250929_143022.csv
+
+Statistics:
+  - Total companies: 72
+  - With Leadership info: 45 (62%)
+  - With Website: 72 (100%)
+======================================================================
+
+Scraping completed!
+```
+
+## ⚙️ Configuration
+
+### Timeout Settings
+You can adjust the timeout for HTTP requests:
+```python
+response = requests.get(url, headers=headers, timeout=15)  # Change timeout value
+```
+
+### Rate Limiting
+The script includes a 1-second delay between company website visits:
+```python
+time.sleep(1)  # Adjust delay as needed
+```
+
+### Maximum Companies
+To limit the number of companies scraped:
+```python
+if len(companies) >= 72:  # Change this number
+    print("Reached limit. Stopping...")
+    break
+```
+
+## 🔧 Troubleshooting
+
+### Issue: No companies found
+**Solution:**
+- Check your internet connection
+- Verify the website URL is accessible
+- The website structure may have changed - inspect `debug_page.html` if created
+
+### Issue: No leadership information extracted
+**Solution:**
+- Leadership info might not be publicly available on company websites
+- Check if the company websites are accessible
+- Some companies may not list executive information online
+
+### Issue: Connection timeout
+**Solution:**
+```python
+# Increase timeout value
+response = requests.get(url, headers=headers, timeout=30)
+```
+
+### Issue: Too many requests / Rate limiting
+**Solution:**
+```python
+# Increase delay between requests
+time.sleep(2)  # or higher
+```
+
+## 📝 Legal & Ethical Considerations
+
+### Important Notes
+
+1. **Terms of Service**: Always check the website's Terms of Service before scraping
+2. **robots.txt**: Respect the website's robots.txt file
+3. **Rate Limiting**: The script includes delays to avoid overwhelming the server
+4. **Data Usage**: Use scraped data responsibly and in accordance with privacy laws
+5. **Personal Data**: Be mindful of personal information (executive names) and comply with GDPR/data protection laws
+
+### Best Practices
+
+- ✅ Run the scraper during off-peak hours
+- ✅ Use reasonable delays between requests
+- ✅ Don't run the scraper too frequently
+- ✅ Cache results to avoid repeated scraping
+- ✅ Respect website bandwidth and server resources
+
+## 🐛 Known Limitations
+
+1. **Dynamic Content**: May not work with JavaScript-heavy websites (would need Selenium)
+2. **Authentication**: Cannot access pages requiring login
+3. **CAPTCHA**: Cannot bypass CAPTCHA protection
+4. **Structure Changes**: Will need updates if website structure changes
+5. **Leadership Info**: Not all companies publicly list executive information
+
+## 🔄 Updates & Maintenance
+
+If the website structure changes:
+
+1. Run the script to generate `debug_page.html`
+2. Inspect the HTML structure
+3. Update the CSS selectors and extraction patterns
+4. Test with a small subset of companies first
+
+## 📧 Support
+
+If you encounter issues:
+
+1. Check the console output for error messages
+2. Verify all dependencies are installed
+3. Ensure Python version is 3.6+
+4. Check internet connectivity
+5. Verify the target website is accessible
+
+## 📄 License
+
+This script is provided as-is for educational purposes. Use responsibly and ethically.
+
+## 🙏 Acknowledgments
+
+- Built with Python, Requests, and BeautifulSoup4
+- Designed for CyberParks company directory
+
+---
+
+**Version:** 1.0.0  
+**Last Updated:** September 29, 2025  
+**Python Version:** 3.6+