How to Scrap LinkedIn Company Data using Voyager Api

How to Scrap LinkedIn Company Data using Voyager Api

"Unlocking Insights Responsibly: Empowering Education through LinkedIn Scraping and Voyager API"

What is Linkedin Scrapping?

LinkedIn scraping is the process of extracting data from LinkedIn's website using automated scripts or tools. It involves collecting information from profiles, company pages, job postings, and other publicly available data. Web scraping, on the other hand, involves automatically gathering data from websites by sending HTTP requests and extracting specific information from HTML content.

Note: However, violating LinkedIn's terms of service can lead to account suspension, IP blocking, or legal action.

💡
To avoid legal or ethical issues, it's essential to review the website's terms of service and API usage guidelines.

What is Voyager's REST API?

Voyager's REST API enables developers to integrate enterprise applications with the geospatial search platform. It provides a web-based interface for exploring and interacting with Voyager's data and services.

Voyager offers two types of open interfaces: XML over HTTP and RESTful APIs. LinkedIn developed Voyager-API, a new API service, to provide a more resilient platform for web and mobile applications, based on the Play framework and GraphQL query language.

How to Obtain JSESSIONID and CSRF Token:

JSESSIONID: When a user accesses a website, the web server generates a unique identifier called the JSESSIONID.

  • It is often kept in the web browser used by the user as a cookie and sent along with future requests to identify the user's session.

  • The JSESSIONID may be obtained by viewing the cookies in your browser's developer tools when visiting the page.

  • Look for a cookie with the string "JSESSIONID" or anything similar. Please keep in mind that certain websites may use distinct cookie names for session tracking.

CSRF Token: The CSRF token (Cross-Site Request Forgery token) is a security measure used to prevent hostile websites from sending unauthorized requests.

  • When performing POST, PUT, or DELETE requests to web services, it is frequently necessary.

  • The method for obtaining the CSRF token differs according to the website or online service. In other circumstances, the token is concealed in a web page's form and may be extracted using web scraping techniques.

  • When performing an initial GET call to a web service, the CSRF token may be sent in the response headers. In such circumstances, the token may be extracted from the response headers and used in subsequent requests as needed.

Step by Step guidelines using the Python module for performing HTTP requests.

  1. It imports the requests library, a well-known Python module for performing HTTP requests.

  2. It assigns a user-agent string to the headers variable. The user agent identifies the type of client making the server request.

  3. It defines the variable company_link, which holds the URL for the business information API endpoint. The URL appears to be of the form linkedin.com/voyager/api/entities/companies.., where company_id is changed with the company's particular entity ID.

  4. The script then builds requests.session() object, which enables a persistent session and cookie setting.

  5. It uses the s.cookies property to set the needed cookies (li_at and JSESSIONID) in the session. These cookies may be required for authentication or permission while using the LinkedIn API.

  6. To simulate a standard web browser request, the script sets the user-agent and CSRF token headers for the session.

  7. It uses the session object s.get(company_link) to send an HTTP GET call to the company_link URL.

  8. The API answer has been received and looks to be in JSON format.

  9. The JSON response is then processed into a Python dictionary, and the script publishes the dictionary's contents, which contain corporate information like staff count, website URL, company type, industries, description, and more.

def getdatafromvoyagerlinkedin(company_id):
    headers = {"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36",
               }

    company_link = f'https://www.linkedin.com/voyager/api/entities/companies/{company_id}'

    with requests.session() as s:
        s.cookies['li_at'] = "AQEDATf5D_XXXXXXXXXXXXXXu"
        s.cookies["JSESSIONID"] = "ajax:1XXXXXXXXXXXXXX0"
        s.headers = headers
        s.headers["csrf-token"] = s.cookies["JSESSIONID"].strip('"')
        response = s.get(company_link)
        response_dict = response.json()
        return response_dict


print(getdatafromvoyagerlinkedin(16198870))
# Output
#{'employeeCountRange': '11-50',
 #'specialties': ['Web Development', 
#'Mobile App Development', 'Web Design', 
#'Node', 'Flutter', 'Ionic', 'AWS', 'Digital Ocean', 'Laravel', 'WordPress', 
#'React Native', 'React JS experts', 'PHP development', 'PrestaShop', 'OpenCart',
# 'SEO and SEM', 'Joomla'],
 #'entityUrn': 'urn:li:fs_company:16198870', 
#'websiteUrl': 'https://www.bytescrum.com', 
#'companyType': 'Privately Held', 'foundedDate': {'year': 2017}, 
#'entityInfo': {'objectUrn': 'urn:li:company:16198870',
#'trackingId': 'ymxYkL2eSUSOWQIt+Mn3xQ=='}, 
#'industries': ['Information Technology and Services'],
 #'description': 'ByteScrum taps into its strong business acumen to find solutions to the unique set of challenges and constraints imposed by each new project and delivers solutions that fill performance gaps. 
#Our founders understood for the first time how good software development services can transform the needs of entire business communities, especially emerging technologies.
# We have a proven track record of successfully meeting deadlines and executing the most complex projects within budget while consistently maintaining the highest quality.\n\nOur specialities:\n\n● Mobile app development
 #(Android and iOS)\n● Web app development (MERN, MEAN, Vue JS, PHP, Laravel, WordPress)\n● Custom Software development\n● Web designing (PSD to HTML/WordPress)\n● Api integration \n● 
#CMS development\n● Web and app service integration\n● SEO and SEM services\n\nContact Us:\nhttps://www.bytescrum.com/contact-us/', 
#'basicCompanyInfo': {'headquarters': 'Lucknow',
 #'followingInfo': {'entityUrn': 'urn:li:fs_followingInfo:urn:li:company:16198870', 
#'dashFollowingStateUrn': 'urn:li:fsd_followingState:urn:li:fsd_company:16198870', 'following': False, 'trackingUrn': 'urn:li:company:16198870', 'followingType': 'DEFAULT'},
# 'miniCompany': {'objectUrn': 'urn:li:company:16198870', 'entityUrn': 'urn:li:fs_miniCompany:16198870', 'name': 'ByteScrum Technologies Private Limited',
# 'showcase': False, 
#'active': True,
# 'logo': {'com.linkedin.common.VectorImage': {'artifacts': [{'width': 200, 'fileIdentifyingUrlPathSegment': '200_200/0/1653201669588?e=1698883200&v=beta&t=GE_5HHCt3u_xxKWDV1d3KmNBx0-AJXvIyjkIxSaXp-E',
# 'expiresAt': 1698883200000, 'height': 200}, {'width': 100, 'fileIdentifyingUrlPathSegment': '100_100/0/1653201669588?e=1698883200&v=beta&t=rbIH_vzfS4YkrOV-inNhuY9XXdbj28K9l4ZY_4-I41o', 
#'expiresAt': 1698883200000, 'height': 100}, {'width': 400, 'fileIdentifyingUrlPathSegment':
# '400_400/0/1653201669588?e=1698883200&v=beta&t=rARzTyswXT1D9vObNkCAh9ljFivi4r6T0QxC_WwLVvQ', 'expiresAt': 1698883200000, 'height': 400}], 
#'rootUrl': 'https://media.licdn.com/dms/image/C4D0BAQHzTgUzh6WpUw/company-logo_'}}, 'universalName': 'bytescrum', 'dashCompanyUrn': 'urn:li:fsd_company:16198870', 
#'trackingId': 'XXXXXXXXXXXXXXXXX'}}}

LinkedIn now provides a unique id to each company page, making it easy to discover. Once you have that, replace that number in the code with one.

Summary
LinkedIn scraping and web scraping are two ways for obtaining information from LinkedIn's website. While LinkedIn scraping is concerned with gathering publically accessible data, web scraping is concerned with mechanically pulling particular information from websites using HTTP queries and HTML parsing. To prevent potential penalties such as account suspension or legal action, it is critical to follow LinkedIn's terms of service. Voyager's REST API enables corporate applications to be seamlessly integrated with their geographic search engine. The JSESSIONID is a unique identification provided by a website server, whereas the CSRF token protects against unwanted requests. Both are accessible via web browser cookies, web scraping tools, or response headers. Overall, knowing data extraction regulations and ethical issues is critical while working with web scraping and APIs like Voyager.

Thank you for reading our blog. Our top priority is your success and satisfaction. We are ready to assist with any questions or additional help.

Warm regards,

ByteScrum Blog Team,

ByteScrum Technologies Private Limited! 🙏