# CyberParks Company Information Scraper

A Python web scraping tool to extract company information from the CyberParks website and save it to a CSV file.

## 📋 Features

- ✅ Scrapes company names, websites, and leadership information
- ✅ Automatically creates timestamped CSV files
- ✅ Avoids duplicate entries
- ✅ Real-time progress display
- ✅ Visits company websites to find leadership info
- ✅ Error handling and graceful failures
- ✅ Respects rate limits with delays between requests

## 🎯 Extracted Information

The scraper extracts the following information for each company:

1. **Company Name** - Official company name
2. **Website** - Company website URL

## 🛠️ Requirements

### Python Version
- Python 3.6 or higher

### Dependencies
```bash
pip install requests beautifulsoup4
```

Or install from requirements.txt:
```bash
pip install -r requirements.txt
```

### requirements.txt
```
requests>=2.31.0
beautifulsoup4>=4.12.0
```

## 📦 Installation

1. **Clone or download the script**
   ```bash
   # Create a project directory
   mkdir cyberparks-scraper
   cd cyberparks-scraper
   ```

2. **Save the script**
   - Save the Python code as `scraper.py`

3. **Install dependencies**
   ```bash
   pip install requests beautifulsoup4
   ```

## 🚀 Usage

### Basic Usage

Simply run the script:
```bash
python scraper.py
```

### What Happens

1. The script connects to https://cyberparks.in/companies-at-park/
2. Extracts company names and websites from the main page
3. Visits each company's website to find leadership information
4. Displays real-time progress for each company
5. Creates a CSV file with all extracted data

### Output

The script creates a CSV file named: `cyberparks_companies_YYYYMMDD_HHMMSS.csv`

Example: `cyberparks_companies_20250929_143022.csv`

### Sample Output Format

```csv
Company Name,Website,MD/CEO/Chairman
Codilar Technologies Pvt.Ltd,https://www.codilar.com,Mahaveer Devabalan
ABANA Technology Private Limited,http://www.abanatechnology.com,John Smith
Analystor Technologies,http://www.analystortech.com,
```

## 📊 Console Output

The script provides detailed console output:

```
======================================================================
CyberParks Company Information Scraper
Extracting: Company Name, Website.
======================================================================

Fetching data from https://cyberparks.in/companies-at-park/...

✓ Page loaded successfully

Found 150 potential entries. Processing...

1. Codilar Technologies Pvt.Ltd
   Website: https://www.codilar.com
   → Checking website for leadership info...
   ✓ Leadership: Mahaveer Devabalan

2. ABANA Technology Private Limited
   Website: http://www.abanatechnology.com
   → Checking website for leadership info...
   ✗ Leadership: Not found

...

======================================================================
✓ Successfully scraped 72 companies!
✓ Data saved to: cyberparks_companies_20250929_143022.csv

Statistics:
  - Total companies: 72
  - With Leadership info: 45 (62%)
  - With Website: 72 (100%)
======================================================================

Scraping completed!
```

## ⚙️ Configuration

### Timeout Settings
You can adjust the timeout for HTTP requests:
```python
response = requests.get(url, headers=headers, timeout=15)  # Change timeout value
```

### Rate Limiting
The script includes a 1-second delay between company website visits:
```python
time.sleep(1)  # Adjust delay as needed
```

### Maximum Companies
To limit the number of companies scraped:
```python
if len(companies) >= 72:  # Change this number
    print("Reached limit. Stopping...")
    break
```

## 🔧 Troubleshooting

### Issue: No companies found
**Solution:**
- Check your internet connection
- Verify the website URL is accessible
- The website structure may have changed - inspect `debug_page.html` if created

### Issue: No leadership information extracted
**Solution:**
- Leadership info might not be publicly available on company websites
- Check if the company websites are accessible
- Some companies may not list executive information online

### Issue: Connection timeout
**Solution:**
```python
# Increase timeout value
response = requests.get(url, headers=headers, timeout=30)
```

### Issue: Too many requests / Rate limiting
**Solution:**
```python
# Increase delay between requests
time.sleep(2)  # or higher
```

## 📝 Legal & Ethical Considerations

### Important Notes

1. **Terms of Service**: Always check the website's Terms of Service before scraping
2. **robots.txt**: Respect the website's robots.txt file
3. **Rate Limiting**: The script includes delays to avoid overwhelming the server
4. **Data Usage**: Use scraped data responsibly and in accordance with privacy laws
5. **Personal Data**: Be mindful of personal information (executive names) and comply with GDPR/data protection laws

### Best Practices

- ✅ Run the scraper during off-peak hours
- ✅ Use reasonable delays between requests
- ✅ Don't run the scraper too frequently
- ✅ Cache results to avoid repeated scraping
- ✅ Respect website bandwidth and server resources

## 🐛 Known Limitations

1. **Dynamic Content**: May not work with JavaScript-heavy websites (would need Selenium)
2. **Authentication**: Cannot access pages requiring login
3. **CAPTCHA**: Cannot bypass CAPTCHA protection
4. **Structure Changes**: Will need updates if website structure changes
5. **Leadership Info**: Not all companies publicly list executive information

## 🔄 Updates & Maintenance

If the website structure changes:

1. Run the script to generate `debug_page.html`
2. Inspect the HTML structure
3. Update the CSS selectors and extraction patterns
4. Test with a small subset of companies first

## 📧 Support

If you encounter issues:

1. Check the console output for error messages
2. Verify all dependencies are installed
3. Ensure Python version is 3.6+
4. Check internet connectivity
5. Verify the target website is accessible

## 📄 License

This script is provided as-is for educational purposes. Use responsibly and ethically.

## 🙏 Acknowledgments

- Built with Python, Requests, and BeautifulSoup4
- Designed for CyberParks company directory

---

**Version:** 1.0.0  
**Last Updated:** September 29, 2025  
**Python Version:** 3.6+