Continuum Analytics has always been a leader in open source development and state-of-the-art technology. A prime example of our dedication to addressing complex problems and testing the boundaries of data science can be seen in our work on the Defense Advanced Research Projects Agency (DARPA) Memex program, which has been featured on 60 Minutes.

Continuum Analytics has always been a leader in open source development and state-of-the-art technology. A prime example of our dedication to addressing complex problems and testing the boundaries of data science can be seen in our work on the Defense Advanced Research Projects Agency (DARPA) Memex program, which has been featured on 60 Minutes.

Commercially available web search engines typically do not work well for many government use cases. To help overcome these challenges, DARPA launched the Memex program in order to provide the mechanisms for improved content discovery, information extraction, information retrieval, user collaboration, and improved search capabilities.

Benefits of the Memex program include:

  • Development of next-generation search technologies to revolutionize the discovery, organization, and presentation of domain-specific content.
  • Creation of a new domain-specific search paradigm to discover relevant content and organize it in ways that are more immediately useful to specific tasks.
  • Extension of current search capabilities to the deep web and nontraditional content.
  • Improved interfaces for military, government, and commercial enterprises to find and organize publicly available information on the Internet.

A key use case of the technology is the ability for law enforcement to discover and address human trafficking through the Web. The number of websites facilitating human trafficking is increasing. Current software and search approaches do not address the scale and scope of the ever-expanding Web. They do not effectively integrate interactive and social media, text, images, and video, which is required for performing “deep searches.”

In an effort to shake-up the traditional search industry controlled by a handful of companies and expand the scope of search capabilities for government agencies, DARPA has decided to open source various components of Memex. Law enforcement, companies, and others are welcome to take the technologies and adapt them for their own use. All open-source projects can be found on the Memex Open Catalog.

Continuum Analytics is collaborating with NASA’s Jet Propulsion Laboratory (JPL) and Kitware to support and extend a collection of search utilities like ImageCat, FacetSpace, LegisGATE, and ImageSpace built on top of Apache Software Foundation projects.

Continuum Analytics is also collaborating with New York University (NYU), using theirDomain Discovery Tool (DDT) and Ache web crawler. These tools provide a powerful platform to explore new domains, train classifiers to determine relevance to that domain, and perform directed web crawling using those trained classifiers to find novel content.

Both crawlers are available through our product called Memex Explorer. This Django-based web app built upon the Anaconda platform provides a unified front-end to explore new domains, run directed crawls for domain-specific searches, and visualize information about the quality, scope, and relevance of the crawled data. Memex Explorer combines the power of both the Apache Nutch web crawler and NYU Ache crawler into one system. When a crawl is run, webpages or web content is passed through Apache Tika, which extracts metadata about the content, and then content and metadata are passed into Elastic Search. Once everything is indexed, we can visualize information about the collected content using Kibana or Bokeh.

Once semi-structured content is available through Elastic Search, various extractors can identify relevant information, which will then be indexed and fed to search utilities. For some domain-specific searches, we use DIG, developed at the University of Southern California’s Information Sciences Institute (USC-ISI), to extract content to feed topic modeling in our TopicSpace application. Topic modeling is an unsupervised clustering technique used to discover interesting patterns (or topics) within a large collection of text. More information about topic modeling can be found here. The intention of TopicSpace is to provide an easy way to explore crawled data and aid in investigating the content as the corpus of documents grows larger.

Continuum Analytics is excited to be participating in DARPA’s Memex program. We use a combination of many open-source technologies from the Python, Java, and R communities in order to make this work successful. We look forward to seeing the impact these tools have on the landscape of the Internet.


About the Author

Q. What is your superpower(s)?

A. Analytics, Developer, Python Instruction

Q. What is your technical specialty or area of research?

A. My specialties are data mining, mach …

Read more

Join the Disucssion