Skip to content

rattle/muddyit_fu

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

90 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Muddy is an information extraction platform. For further details see the ‘Getting Started with Muddy’ article. This gem provides access to the Muddy platform via it’s API (see Muddy Developer Guide).

sudo gem install muddyit_fu

Muddy supports OAuth and HTTP Basic auth for authentication and authorisation. We recommend you use OAuth wherever possible when accessing Muddy. An example of using OAuth with the Muddy platform is described in the Building with Muddy and OAuth article.

---
consumer_key: YOUR_CONSUMER_KEY
consumer_secret: YOUR_CONSUMER_SECRET
access_token: YOUR_ACCESS_TOKEN
access_token_secret: YOUR_ACCESS_TOKEN_SECRET
---
username: YOUR_USERNAME
password: YOUR_PASSWORD

This example uses the basic ‘extract’ method to retrieve a list of entities from a piece of source text.

require 'muddyit_fu'
muddyit =  Muddyit.new('./config.yml')
page = muddyit.extract(ARGV[0])
page.entities.each do |entity|
  puts "\t#{entity.term}, #{entity.uri}, #{entity.classification}"
end

Muddy uses an intelligent extraction method to identify the key text on any given web page, meaning that the entities extracted are relevant to the article and don’t include spurious results from navigation sidebars or page footers. To work with a URL rather than text, just specify a URL instead :

page = muddyit.extract('http://news.bbc.co.uk/1/hi/northern_ireland/8450854.stm')

Muddy allows you to store the entity extraction results so aggregate operations can be performed over a collection of content (a ‘collection’ has many analysed ‘pages’). A basic Muddy account provides a single ‘collection’ where extraction results can be stored.

To store a page against a collection, the collection must first be found :

collection = muddyit.collections.find(:all).first

Once a collection has been found, entity extraction results can be stored in it:

collection.pages.create('http://news.bbc.co.uk/1/hi/uk_politics/8011321.stm', {:minium_confidence => 0.2})

A collection allows aggregate operations to be perfomed on itself and on it’s members. A collection is identified by it’s ‘collection token’. This is an alphanumeric six character string (e.g. ‘a0ret4’). A collection can be found if it’s token is known :

collection = muddyit.collections.find('a0ret4')

You can iterate through all the analysed pages in a collection, be aware that the Muddy API provides the pages as paginated sets, so it may take some time to page through a complete set of pages in a collection (due to repeated HTTP requests for each new paginated set of results).

require 'muddyit_fu'
muddyit =  Muddyit.new('./config.yml')
collection = muddyit.collections.find(:all).first
collection.pages.find(:all) do |page|
  puts page.title
  page.entities.each do |entity|
    puts "\t#{entity.uri}"
  end
end

Each page in a collection is assigned a unique alphanumeric identifier. Whilst this can be used to find a given page in a collection, it is possible to search for the page using other attributes :

page = collection.pages.find('5d0e32b6-fd0b-400a-ac49-dae965a292df')
page = collection.pages.find(:all, :uri => 'http://news.bbc.co.uk/1/hi/business/8186840.stm').first
page = collection.pages.find(:all, :title => 'BBC NEWS | Business | ITV in 25m Friends Reunited sale').first

A page can be ‘refereshed’ (the entity extraction is run again) by calling the refresh method on a page object :

page = collection.pages.find('5d0e32b6-fd0b-400a-ac49-dae965a292df')
updated_page = page.update

A page can be removed from a collection by calling the ‘destroy’ method on a page object :

page = collection.pages.find('5d0e32b6-fd0b-400a-ac49-dae965a292df')
page.destroy

If we want to find all pages that reference the grounded entity for ‘Gordon Brown’ then it can be searched for using it’s DBpedia URI :

require 'muddyit_fu'
muddyit = Muddyit.new('./config.yml')
collection = muddyit.collections.find('a0ret4')
collection.pages.find_by_entity('http://dbpedia.org/resource/Gordon_Brown') do |page|
  puts "#{page.identifier} - #{page.title}"
end

To find other entities that occur frequently with ‘Gordon Brown’ in this collection :

require 'muddyit_fu'
muddyit = Muddyit.new('./config.yml')
collection = muddyit.collections.find('a0ret4')
puts "Related entity\tOccurance
collection.entities.find_related('http://dbpedia.org/resource/Gordon_Brown').each do |entry|
  puts "#{entry[:entity].uri}\t#{entry[:count]}"
end

To find other content in the collection that shares similar entities with the analysed page that has a uri ‘news.bbc.co.uk/1/hi/uk_politics/7878418.stm’ :

require 'muddyit_fu'
muddyit = Muddyit.new('./config.yml')
collection = muddyit.collections.find(:all).first
page = collection.pages.find(:all, :uri => 'http://news.bbc.co.uk/1/hi/uk_politics/7878418.stm').first
puts "Page : #{page.title}\n\n"
page.related_content.each do |results|
  puts "#{results[:page].title} #{results[:count]}"
end

The Muddy platform runs a background job queue that allows many requests to be made in quick succession (rather than waiting for the full extraction request to complete), with analysis of the pages happening asynchronously via the queue and being stored in the collection at a later time. This can be useful when trying to analyse large content collections. To send a request to the queue use :

collection = muddyit.collections.find('a0ret4')
collection.pages.create('http://news.bbc.co.uk/1/hi/uk_politics/8011321.stm', {:realtime => false})
Author: Rob Lee
Email: support [at] muddy.it
Main Repository: http://github.com/rattle/muddyit_fu/tree/master

About

A ruby client library for the muddy.it service

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages