fix: file_added
This commit is contained in:
parent
a71176f6a4
commit
2e5186ed1a
241
README.md
Normal file
241
README.md
Normal file
@ -0,0 +1,241 @@
|
|||||||
|
# CyberParks Company Information Scraper
|
||||||
|
|
||||||
|
A Python web scraping tool to extract company information from the CyberParks website and save it to a CSV file.
|
||||||
|
|
||||||
|
## 📋 Features
|
||||||
|
|
||||||
|
- ✅ Scrapes company names, websites, and leadership information
|
||||||
|
- ✅ Automatically creates timestamped CSV files
|
||||||
|
- ✅ Avoids duplicate entries
|
||||||
|
- ✅ Real-time progress display
|
||||||
|
- ✅ Visits company websites to find leadership info
|
||||||
|
- ✅ Error handling and graceful failures
|
||||||
|
- ✅ Respects rate limits with delays between requests
|
||||||
|
|
||||||
|
## 🎯 Extracted Information
|
||||||
|
|
||||||
|
The scraper extracts the following information for each company:
|
||||||
|
|
||||||
|
1. **Company Name** - Official company name
|
||||||
|
2. **Website** - Company website URL
|
||||||
|
|
||||||
|
## 🛠️ Requirements
|
||||||
|
|
||||||
|
### Python Version
|
||||||
|
- Python 3.6 or higher
|
||||||
|
|
||||||
|
### Dependencies
|
||||||
|
```bash
|
||||||
|
pip install requests beautifulsoup4
|
||||||
|
```
|
||||||
|
|
||||||
|
Or install from requirements.txt:
|
||||||
|
```bash
|
||||||
|
pip install -r requirements.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
### requirements.txt
|
||||||
|
```
|
||||||
|
requests>=2.31.0
|
||||||
|
beautifulsoup4>=4.12.0
|
||||||
|
```
|
||||||
|
|
||||||
|
## 📦 Installation
|
||||||
|
|
||||||
|
1. **Clone or download the script**
|
||||||
|
```bash
|
||||||
|
# Create a project directory
|
||||||
|
mkdir cyberparks-scraper
|
||||||
|
cd cyberparks-scraper
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Save the script**
|
||||||
|
- Save the Python code as `scraper.py`
|
||||||
|
|
||||||
|
3. **Install dependencies**
|
||||||
|
```bash
|
||||||
|
pip install requests beautifulsoup4
|
||||||
|
```
|
||||||
|
|
||||||
|
## 🚀 Usage
|
||||||
|
|
||||||
|
### Basic Usage
|
||||||
|
|
||||||
|
Simply run the script:
|
||||||
|
```bash
|
||||||
|
python scraper.py
|
||||||
|
```
|
||||||
|
|
||||||
|
### What Happens
|
||||||
|
|
||||||
|
1. The script connects to https://cyberparks.in/companies-at-park/
|
||||||
|
2. Extracts company names and websites from the main page
|
||||||
|
3. Visits each company's website to find leadership information
|
||||||
|
4. Displays real-time progress for each company
|
||||||
|
5. Creates a CSV file with all extracted data
|
||||||
|
|
||||||
|
### Output
|
||||||
|
|
||||||
|
The script creates a CSV file named: `cyberparks_companies_YYYYMMDD_HHMMSS.csv`
|
||||||
|
|
||||||
|
Example: `cyberparks_companies_20250929_143022.csv`
|
||||||
|
|
||||||
|
### Sample Output Format
|
||||||
|
|
||||||
|
```csv
|
||||||
|
Company Name,Website,MD/CEO/Chairman
|
||||||
|
Codilar Technologies Pvt.Ltd,https://www.codilar.com,Mahaveer Devabalan
|
||||||
|
ABANA Technology Private Limited,http://www.abanatechnology.com,John Smith
|
||||||
|
Analystor Technologies,http://www.analystortech.com,
|
||||||
|
```
|
||||||
|
|
||||||
|
## 📊 Console Output
|
||||||
|
|
||||||
|
The script provides detailed console output:
|
||||||
|
|
||||||
|
```
|
||||||
|
======================================================================
|
||||||
|
CyberParks Company Information Scraper
|
||||||
|
Extracting: Company Name, Website.
|
||||||
|
======================================================================
|
||||||
|
|
||||||
|
Fetching data from https://cyberparks.in/companies-at-park/...
|
||||||
|
|
||||||
|
✓ Page loaded successfully
|
||||||
|
|
||||||
|
Found 150 potential entries. Processing...
|
||||||
|
|
||||||
|
1. Codilar Technologies Pvt.Ltd
|
||||||
|
Website: https://www.codilar.com
|
||||||
|
→ Checking website for leadership info...
|
||||||
|
✓ Leadership: Mahaveer Devabalan
|
||||||
|
|
||||||
|
2. ABANA Technology Private Limited
|
||||||
|
Website: http://www.abanatechnology.com
|
||||||
|
→ Checking website for leadership info...
|
||||||
|
✗ Leadership: Not found
|
||||||
|
|
||||||
|
...
|
||||||
|
|
||||||
|
======================================================================
|
||||||
|
✓ Successfully scraped 72 companies!
|
||||||
|
✓ Data saved to: cyberparks_companies_20250929_143022.csv
|
||||||
|
|
||||||
|
Statistics:
|
||||||
|
- Total companies: 72
|
||||||
|
- With Leadership info: 45 (62%)
|
||||||
|
- With Website: 72 (100%)
|
||||||
|
======================================================================
|
||||||
|
|
||||||
|
Scraping completed!
|
||||||
|
```
|
||||||
|
|
||||||
|
## ⚙️ Configuration
|
||||||
|
|
||||||
|
### Timeout Settings
|
||||||
|
You can adjust the timeout for HTTP requests:
|
||||||
|
```python
|
||||||
|
response = requests.get(url, headers=headers, timeout=15) # Change timeout value
|
||||||
|
```
|
||||||
|
|
||||||
|
### Rate Limiting
|
||||||
|
The script includes a 1-second delay between company website visits:
|
||||||
|
```python
|
||||||
|
time.sleep(1) # Adjust delay as needed
|
||||||
|
```
|
||||||
|
|
||||||
|
### Maximum Companies
|
||||||
|
To limit the number of companies scraped:
|
||||||
|
```python
|
||||||
|
if len(companies) >= 72: # Change this number
|
||||||
|
print("Reached limit. Stopping...")
|
||||||
|
break
|
||||||
|
```
|
||||||
|
|
||||||
|
## 🔧 Troubleshooting
|
||||||
|
|
||||||
|
### Issue: No companies found
|
||||||
|
**Solution:**
|
||||||
|
- Check your internet connection
|
||||||
|
- Verify the website URL is accessible
|
||||||
|
- The website structure may have changed - inspect `debug_page.html` if created
|
||||||
|
|
||||||
|
### Issue: No leadership information extracted
|
||||||
|
**Solution:**
|
||||||
|
- Leadership info might not be publicly available on company websites
|
||||||
|
- Check if the company websites are accessible
|
||||||
|
- Some companies may not list executive information online
|
||||||
|
|
||||||
|
### Issue: Connection timeout
|
||||||
|
**Solution:**
|
||||||
|
```python
|
||||||
|
# Increase timeout value
|
||||||
|
response = requests.get(url, headers=headers, timeout=30)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Issue: Too many requests / Rate limiting
|
||||||
|
**Solution:**
|
||||||
|
```python
|
||||||
|
# Increase delay between requests
|
||||||
|
time.sleep(2) # or higher
|
||||||
|
```
|
||||||
|
|
||||||
|
## 📝 Legal & Ethical Considerations
|
||||||
|
|
||||||
|
### Important Notes
|
||||||
|
|
||||||
|
1. **Terms of Service**: Always check the website's Terms of Service before scraping
|
||||||
|
2. **robots.txt**: Respect the website's robots.txt file
|
||||||
|
3. **Rate Limiting**: The script includes delays to avoid overwhelming the server
|
||||||
|
4. **Data Usage**: Use scraped data responsibly and in accordance with privacy laws
|
||||||
|
5. **Personal Data**: Be mindful of personal information (executive names) and comply with GDPR/data protection laws
|
||||||
|
|
||||||
|
### Best Practices
|
||||||
|
|
||||||
|
- ✅ Run the scraper during off-peak hours
|
||||||
|
- ✅ Use reasonable delays between requests
|
||||||
|
- ✅ Don't run the scraper too frequently
|
||||||
|
- ✅ Cache results to avoid repeated scraping
|
||||||
|
- ✅ Respect website bandwidth and server resources
|
||||||
|
|
||||||
|
## 🐛 Known Limitations
|
||||||
|
|
||||||
|
1. **Dynamic Content**: May not work with JavaScript-heavy websites (would need Selenium)
|
||||||
|
2. **Authentication**: Cannot access pages requiring login
|
||||||
|
3. **CAPTCHA**: Cannot bypass CAPTCHA protection
|
||||||
|
4. **Structure Changes**: Will need updates if website structure changes
|
||||||
|
5. **Leadership Info**: Not all companies publicly list executive information
|
||||||
|
|
||||||
|
## 🔄 Updates & Maintenance
|
||||||
|
|
||||||
|
If the website structure changes:
|
||||||
|
|
||||||
|
1. Run the script to generate `debug_page.html`
|
||||||
|
2. Inspect the HTML structure
|
||||||
|
3. Update the CSS selectors and extraction patterns
|
||||||
|
4. Test with a small subset of companies first
|
||||||
|
|
||||||
|
## 📧 Support
|
||||||
|
|
||||||
|
If you encounter issues:
|
||||||
|
|
||||||
|
1. Check the console output for error messages
|
||||||
|
2. Verify all dependencies are installed
|
||||||
|
3. Ensure Python version is 3.6+
|
||||||
|
4. Check internet connectivity
|
||||||
|
5. Verify the target website is accessible
|
||||||
|
|
||||||
|
## 📄 License
|
||||||
|
|
||||||
|
This script is provided as-is for educational purposes. Use responsibly and ethically.
|
||||||
|
|
||||||
|
## 🙏 Acknowledgments
|
||||||
|
|
||||||
|
- Built with Python, Requests, and BeautifulSoup4
|
||||||
|
- Designed for CyberParks company directory
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Version:** 1.0.0
|
||||||
|
**Last Updated:** September 29, 2025
|
||||||
|
**Python Version:** 3.6+
|
||||||
Loading…
Reference in New Issue
Block a user