From 2e5186ed1a343d938d7e5abe296786906f511a40 Mon Sep 17 00:00:00 2001 From: "koushik.m" Date: Mon, 29 Sep 2025 12:56:19 +0530 Subject: [PATCH] fix: file_added --- README.md | 241 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 241 insertions(+) create mode 100644 README.md diff --git a/README.md b/README.md new file mode 100644 index 0000000..7ff88c0 --- /dev/null +++ b/README.md @@ -0,0 +1,241 @@ +# CyberParks Company Information Scraper + +A Python web scraping tool to extract company information from the CyberParks website and save it to a CSV file. + +## 📋 Features + +- ✅ Scrapes company names, websites, and leadership information +- ✅ Automatically creates timestamped CSV files +- ✅ Avoids duplicate entries +- ✅ Real-time progress display +- ✅ Visits company websites to find leadership info +- ✅ Error handling and graceful failures +- ✅ Respects rate limits with delays between requests + +## 🎯 Extracted Information + +The scraper extracts the following information for each company: + +1. **Company Name** - Official company name +2. **Website** - Company website URL + +## 🛠️ Requirements + +### Python Version +- Python 3.6 or higher + +### Dependencies +```bash +pip install requests beautifulsoup4 +``` + +Or install from requirements.txt: +```bash +pip install -r requirements.txt +``` + +### requirements.txt +``` +requests>=2.31.0 +beautifulsoup4>=4.12.0 +``` + +## 📦 Installation + +1. **Clone or download the script** + ```bash + # Create a project directory + mkdir cyberparks-scraper + cd cyberparks-scraper + ``` + +2. **Save the script** + - Save the Python code as `scraper.py` + +3. **Install dependencies** + ```bash + pip install requests beautifulsoup4 + ``` + +## 🚀 Usage + +### Basic Usage + +Simply run the script: +```bash +python scraper.py +``` + +### What Happens + +1. The script connects to https://cyberparks.in/companies-at-park/ +2. Extracts company names and websites from the main page +3. Visits each company's website to find leadership information +4. Displays real-time progress for each company +5. Creates a CSV file with all extracted data + +### Output + +The script creates a CSV file named: `cyberparks_companies_YYYYMMDD_HHMMSS.csv` + +Example: `cyberparks_companies_20250929_143022.csv` + +### Sample Output Format + +```csv +Company Name,Website,MD/CEO/Chairman +Codilar Technologies Pvt.Ltd,https://www.codilar.com,Mahaveer Devabalan +ABANA Technology Private Limited,http://www.abanatechnology.com,John Smith +Analystor Technologies,http://www.analystortech.com, +``` + +## 📊 Console Output + +The script provides detailed console output: + +``` +====================================================================== +CyberParks Company Information Scraper +Extracting: Company Name, Website. +====================================================================== + +Fetching data from https://cyberparks.in/companies-at-park/... + +✓ Page loaded successfully + +Found 150 potential entries. Processing... + +1. Codilar Technologies Pvt.Ltd + Website: https://www.codilar.com + → Checking website for leadership info... + ✓ Leadership: Mahaveer Devabalan + +2. ABANA Technology Private Limited + Website: http://www.abanatechnology.com + → Checking website for leadership info... + ✗ Leadership: Not found + +... + +====================================================================== +✓ Successfully scraped 72 companies! +✓ Data saved to: cyberparks_companies_20250929_143022.csv + +Statistics: + - Total companies: 72 + - With Leadership info: 45 (62%) + - With Website: 72 (100%) +====================================================================== + +Scraping completed! +``` + +## ⚙️ Configuration + +### Timeout Settings +You can adjust the timeout for HTTP requests: +```python +response = requests.get(url, headers=headers, timeout=15) # Change timeout value +``` + +### Rate Limiting +The script includes a 1-second delay between company website visits: +```python +time.sleep(1) # Adjust delay as needed +``` + +### Maximum Companies +To limit the number of companies scraped: +```python +if len(companies) >= 72: # Change this number + print("Reached limit. Stopping...") + break +``` + +## 🔧 Troubleshooting + +### Issue: No companies found +**Solution:** +- Check your internet connection +- Verify the website URL is accessible +- The website structure may have changed - inspect `debug_page.html` if created + +### Issue: No leadership information extracted +**Solution:** +- Leadership info might not be publicly available on company websites +- Check if the company websites are accessible +- Some companies may not list executive information online + +### Issue: Connection timeout +**Solution:** +```python +# Increase timeout value +response = requests.get(url, headers=headers, timeout=30) +``` + +### Issue: Too many requests / Rate limiting +**Solution:** +```python +# Increase delay between requests +time.sleep(2) # or higher +``` + +## 📝 Legal & Ethical Considerations + +### Important Notes + +1. **Terms of Service**: Always check the website's Terms of Service before scraping +2. **robots.txt**: Respect the website's robots.txt file +3. **Rate Limiting**: The script includes delays to avoid overwhelming the server +4. **Data Usage**: Use scraped data responsibly and in accordance with privacy laws +5. **Personal Data**: Be mindful of personal information (executive names) and comply with GDPR/data protection laws + +### Best Practices + +- ✅ Run the scraper during off-peak hours +- ✅ Use reasonable delays between requests +- ✅ Don't run the scraper too frequently +- ✅ Cache results to avoid repeated scraping +- ✅ Respect website bandwidth and server resources + +## 🐛 Known Limitations + +1. **Dynamic Content**: May not work with JavaScript-heavy websites (would need Selenium) +2. **Authentication**: Cannot access pages requiring login +3. **CAPTCHA**: Cannot bypass CAPTCHA protection +4. **Structure Changes**: Will need updates if website structure changes +5. **Leadership Info**: Not all companies publicly list executive information + +## 🔄 Updates & Maintenance + +If the website structure changes: + +1. Run the script to generate `debug_page.html` +2. Inspect the HTML structure +3. Update the CSS selectors and extraction patterns +4. Test with a small subset of companies first + +## 📧 Support + +If you encounter issues: + +1. Check the console output for error messages +2. Verify all dependencies are installed +3. Ensure Python version is 3.6+ +4. Check internet connectivity +5. Verify the target website is accessible + +## 📄 License + +This script is provided as-is for educational purposes. Use responsibly and ethically. + +## 🙏 Acknowledgments + +- Built with Python, Requests, and BeautifulSoup4 +- Designed for CyberParks company directory + +--- + +**Version:** 1.0.0 +**Last Updated:** September 29, 2025 +**Python Version:** 3.6+