DevExtrac Demo

DevExtrac(Beta) Demo:

Enter a article/post page url to get the readable content
Get article for testing here :https://news.google.com/


What is DevExtrac ?

DevExtrac enables extraction of data from web at scale. Using a number of algorithms based on manual observation of thousands of pages, it decides if a webpage contains readable content or not, and retrieves content good for analytics & sharing.

About DevExtrac

DevExtrac is a Java library that takes a url/html as input and decides if webpage is an article page or category/home page ; In general algorithm decides whether the webpage contains any readable content or not

  • Retrieve right content of any webpage before deciding type of page.

  • Light Weight : Do not depend on NLP or Regular Expression.

  • Optional most relevant image retrieval. Use only if needed as slows down process. To optimize performance, it reads image headers only.

  • Concise, short,scalable and depends on DOM structure of page.

  • Human-Language Independent : Since textract rely on structure of webpage, language of webpage doesn't matter when retrieving text and decision making.

  • Source code. You can tweak as per your requirement.

  • GPL-V3 /Commercial license available for use for USD 550 .