Question 1

I admit that this is basically a duplicate question of Use freebase data on local server? but I need more detailed answers than have already been given there

I've fallen absolutely in love with Freebase. What I want now is to essentially create a very simple Freebase clone for storing content that may not belong on Freebase itself but can be described using the Freebase schema. Essentially what I want is a simple and elegant way to store data like Freebase itself does and be able to easily use that data in a Python (CherryPy) web application.

Chapter 2 of the MQL reference guide states:

The database that underlies Metaweb is fundamentally different than the relational databases that you may be familiar with. Relational databases store data in the form of tables, but the Metaweb database stores data as a graph of nodes and relationships between those nodes.

Which I guess means that I should be using either a triplestore or a graph database such as Neo4j? Does anybody here have any experience with using one of those from a Python environment?

(What I've actually tried so far is to create a relational database schema which would be able to easily store Freebase topics, but I'm having issues with configuring the mappings in SQLAlchemy).

Things I'm looking into

http://gen5.info/q/2009/02/25/putting-freebase-in-a-star-schema/
http://librdf.org/

UPDATE [28/12/2011]:

I found an article on the Freebase blog that describes the proprietary tuple store / database Freebase themselves use (graphd): http://blog.freebase.com/2008/04/09/a-brief-tour-of-graphd/

Question 2

This is what worked for me. It allows you to load all of a Freebase dump in a standard MySQL installation on less than 100GB of disk. The key is understanding the data layout in a dump and then transforming it (optimizing it for space and speed).

Freebase notions you should understand before you attempt to use this (all taken from the documentation):

Topic - anything of type '/common/topic', pay attention to the different types of ids you may encounter in Freebase - 'id', 'mid', 'guid', 'webid', etc.
Domain
Type - 'is a' relationship
Properties - 'has a' relationship
Schema
Namespace
Key - human readable in the '/en' namespace

Some other important Freebase specifics:

the query editor is your friend
understand the 'source', 'property', 'destination' and 'value' notions described here
everything has a mid, even things like '/', '/m', '/en', '/lang', '/m/0bnqs_5', etc.; Test using the query editor: [{'id':'/','mid':null}]
you don't know what any entity (i.e. row) in the data dump is, you have to get to its types to do that (for instance how do I know '/m/0cwtm' is a human);
every entity has at least one type (but usually many more)
every entity has at least one id/key (but usually many more)
the ontology (i.e. metadata) is embedded in the same dump and the same format as the data (not the case with other distributions like DBPedia, etc.)
the 'destination' column in the dump is the confusing one, it may contain a mid or a key (see how the transforms bellow deal with this)
the domains, types, properties are namespace levels at the same time (whoever came up with this is a genius IMHO);
understand what is a Topic and what is not a Topic (absolutely crucial), for example this entity '/m/03lmb2f' of type '/film/performance' is NOT a Topic (I choose to think of these as what Blank Nodes in RDF are although this may not be philosophically accurate), while '/m/04y78wb' of type '/film/director' (among others) is;

Transforms

(see the Python code at the bottom)

TRANSFORM 1 (from shell, split links from namespaces ignoring notable_for and non /lang/en text):

python parse.py freebase.tsv  #end up with freebase_links.tsv and freebase_ns.tsv

TRANSFORM 2 (from Python console, split freebase_ns.tsv on freebase_ns_types.tsv, freebase_ns_props.tsv plus 15 others which we ignore for now)

import e
e.split_external_keys( 'freebase_ns.tsv' )

TRANSFORM 3 (from Python console, convert property and destination to mids)

import e
ns = e.get_namespaced_data( 'freebase_ns_types.tsv' )
e.replace_property_and_destination_with_mid( 'freebase_links.tsv', ns )    #produces freebase_links_pdmids.tsv
e.replace_property_with_mid( 'freebase_ns_props.tsv', ns ) #produces freebase_ns_props_pmids.tsv

TRANSFORM 4 (from MySQL console, load freebase_links_mids.tsv, freebase_ns_props_mids.tsv and freebase_ns_types.tsv in DB):

CREATE TABLE links(
source      VARCHAR(20), 
property    VARCHAR(20), 
destination VARCHAR(20), 
value       VARCHAR(1)
) ENGINE=MyISAM CHARACTER SET utf8;CREATE TABLE ns(
source      VARCHAR(20), 
property    VARCHAR(20), 
destination VARCHAR(40), 
value       VARCHAR(255)
) ENGINE=MyISAM CHARACTER SET utf8;CREATE TABLE types(
source      VARCHAR(20), 
property    VARCHAR(40), 
destination VARCHAR(40), 
value       VARCHAR(40)
) ENGINE=MyISAM CHARACTER SET utf8;LOAD DATA LOCAL INFILE "/data/freebase_links_pdmids.tsv" INTO TABLE links FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';
LOAD DATA LOCAL INFILE "/data/freebase_ns_props_pmids.tsv" INTO TABLE ns FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';
LOAD DATA LOCAL INFILE "/data/freebase_ns_base_plus_types.tsv" INTO TABLE types FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';CREATE INDEX links_source            ON links (source)             USING BTREE;
CREATE INDEX ns_source               ON ns    (source)             USING BTREE;
CREATE INDEX ns_value                ON ns    (value)              USING BTREE;
CREATE INDEX types_source            ON types (source)             USING BTREE;
CREATE INDEX types_destination_value ON types (destination, value) USING BTREE;

Code

Save this as e.py:

import sys#returns a dict to be used by mid(...), replace_property_and_destination_with_mid(...) bellow
def get_namespaced_data( file_name ):f = open( file_name )result = {}for line in f:elements = line[:-1].split('\t')if len( elements ) < 4:print 'Skip...'continueresult[(elements[2], elements[3])] = elements[0]return result#runs out of memory
def load_links( file_name ):f = open( file_name )result = {}for line in f:if len( result ) % 1000000 == 0:print len(result)elements = line[:-1].split('\t')src, prop, dest = elements[0], elements[1], elements[2]if result.get( src, False ):if result[ src ].get( prop, False ):result[ src ][ prop ].append( dest )else:result[ src ][ prop ] = [dest]else:result[ src ] = dict([( prop, [dest] )])return result#same as load_links but for the namespaced data
def load_ns( file_name ):f = open( file_name )result = {}for line in f:if len( result ) % 1000000 == 0:print len(result)elements = line[:-1].split('\t')src, prop, value = elements[0], elements[1], elements[3]if result.get( src, False ):if result[ src ].get( prop, False ):result[ src ][ prop ].append( value )else:result[ src ][ prop ] = [value]else:result[ src ] = dict([( prop, [value] )])return resultdef links_in_set( file_name ):f = open( file_name )result = set()for line in f:elements = line[:-1].split('\t')result.add( elements[0] )return resultdef mid( key, ns ):if key == '':return Falseelif key == '/':key = '/boot/root_namespace'parts = key.split('/')if len(parts) == 1:           #cover the case of something which doesn't start with '/'print keyreturn Falseif parts[1] == 'm':           #already a midreturn keynamespace = '/'.join(parts[:-1])key = parts[-1]return ns.get( (namespace, key), False )def replace_property_and_destination_with_mid( file_name, ns ):fn = file_name.split('.')[0]f = open( file_name )f_out_mids = open(fn+'_pdmids'+'.tsv', 'w')def convert_to_mid_if_possible( value ):m = mid( value, ns )if m: return melse: return Nonecounter = 0for line in f:elements = line[:-1].split('\t')md   = convert_to_mid_if_possible(elements[1])dest = convert_to_mid_if_possible(elements[2])if md and dest:elements[1] = mdelements[2] = destf_out_mids.write( '\t'.join(elements)+'\n' )else:counter += 1print 'Skipped: ' + str( counter )def replace_property_with_mid( file_name, ns ):fn = file_name.split('.')[0]f = open( file_name )f_out_mids = open(fn+'_pmids'+'.tsv', 'w')def convert_to_mid_if_possible( value ):m = mid( value, ns )if m: return melse: return Nonefor line in f:elements = line[:-1].split('\t')md = convert_to_mid_if_possible(elements[1])if md:elements[1]=mdf_out_mids.write( '\t'.join(elements)+'\n' )else:#print 'Skipping ' + elements[1]pass#cPickle
#ns=e.get_namespaced_data('freebase_2.tsv')
#import cPickle
#cPickle.dump( ns, open('ttt.dump','wb'), protocol=2 )
#ns=cPickle.load( open('ttt.dump','rb') )#fn='/m/0'
#n=fn.split('/')[2]
#dir = n[:-1]def is_mid( value ):parts = value.split('/')if len(parts) == 1:   #it doesn't start with '/'return Falseif parts[1] == 'm':return Truereturn Falsedef check_if_property_or_destination_are_mid( file_name ):f = open( file_name )for line in f:elements = line[:-1].split('\t')#if is_mid( elements[1] ) or is_mid( elements[2] ):if is_mid( elements[1] ):print line#
def split_external_keys( file_name ):fn = file_name.split('.')[0]f = open( file_name )f_out_extkeys  = open(fn+'_extkeys' + '.tsv', 'w')f_out_intkeys  = open(fn+'_intkeys' + '.tsv', 'w')f_out_props    = open(fn+'_props'   + '.tsv', 'w')f_out_types    = open(fn+'_types'   + '.tsv', 'w')f_out_m        = open(fn+'_m'       + '.tsv', 'w')f_out_src      = open(fn+'_src'     + '.tsv', 'w')f_out_usr      = open(fn+'_usr'     + '.tsv', 'w')f_out_base     = open(fn+'_base'    + '.tsv', 'w')f_out_blg      = open(fn+'_blg'     + '.tsv', 'w')f_out_bus      = open(fn+'_bus'     + '.tsv', 'w')f_out_soft     = open(fn+'_soft'    + '.tsv', 'w')f_out_uri      = open(fn+'_uri'     + '.tsv', 'w')f_out_quot     = open(fn+'_quot'    + '.tsv', 'w')f_out_frb      = open(fn+'_frb'     + '.tsv', 'w')f_out_tag      = open(fn+'_tag'     + '.tsv', 'w')f_out_guid     = open(fn+'_guid'    + '.tsv', 'w')f_out_dtwrld   = open(fn+'_dtwrld'  + '.tsv', 'w')for line in f:elements = line[:-1].split('\t')parts_2 = elements[2].split('/')if len(parts_2) == 1:                 #the blank destination elements - '', plus the root domain onesif elements[1] == '/type/object/key':f_out_types.write( line )else:f_out_props.write( line )elif elements[2] == '/lang/en':f_out_props.write( line )elif (parts_2[1] == 'wikipedia' or parts_2[1] == 'authority') and len( parts_2 ) > 2:f_out_extkeys.write( line )elif parts_2[1] == 'm':f_out_m.write( line )elif parts_2[1] == 'en':f_out_intkeys.write( line )elif parts_2[1] == 'source' and len( parts_2 ) > 2:f_out_src.write( line )elif parts_2[1] == 'user':f_out_usr.write( line )elif parts_2[1] == 'base' and len( parts_2 ) > 2:if elements[1] == '/type/object/key':f_out_types.write( line )else:f_out_base.write( line )elif parts_2[1] == 'biology' and len( parts_2 ) > 2:f_out_blg.write( line )elif parts_2[1] == 'business' and len( parts_2 ) > 2:f_out_bus.write( line )elif parts_2[1] == 'soft' and len( parts_2 ) > 2:f_out_soft.write( line )elif parts_2[1] == 'uri':f_out_uri.write( line )elif parts_2[1] == 'quotationsbook' and len( parts_2 ) > 2:f_out_quot.write( line )elif parts_2[1] == 'freebase' and len( parts_2 ) > 2:f_out_frb.write( line )elif parts_2[1] == 'tag' and len( parts_2 ) > 2:f_out_tag.write( line )elif parts_2[1] == 'guid' and len( parts_2 ) > 2:f_out_guid.write( line )elif parts_2[1] == 'dataworld' and len( parts_2 ) > 2:f_out_dtwrld.write( line )else:f_out_types.write( line )

Save this as parse.py:

import sysdef parse_freebase_quadruple_tsv_file( file_name ):fn = file_name.split('.')[0]f = open( file_name )f_out_links = open(fn+'_links'+'.tsv', 'w')f_out_ns    = open(fn+'_ns'   +'.tsv', 'w')for line in f:elements = line[:-1].split('\t')if len( elements ) < 4:print 'Skip...'continue#print 'Processing ' + str( elements )                                                                                                                  #cases described here http://wiki.freebase.com/wiki/Data_dumps                                                                                          if elements[1].endswith('/notable_for'):                               #ignore notable_for, it has JSON in it                                           continueelif elements[2] and not elements[3]:                                  #case 1, linked                                                                  f_out_links.write( line )elif not (elements[2].startswith('/lang/') and elements[2] != '/lang/en'):   #ignore languages other than English                                       f_out_ns.write( line )if len(sys.argv[1:]) == 0:print 'Pass a list of .tsv filenames'for file_name in sys.argv[1:]:parse_freebase_quadruple_tsv_file( file_name )

Notes:

Depending on the machine the index creation may take anywhere from a few to 12+ hours (consider the amount of data you are dealing with though).
To be able to traverse the data in both directions you need an index on links.destination as well which I found to be expensive timewise and never finished.
Many other optimizations are possible here. For example the 'types' table is small enough to be loaded in memory in a Python dict (see e.get_namespaced_data( 'freebase_ns_types.tsv' ))

And the standard disclaimer here. It has been a few months since I did this. I believe it is mostly correct but I do apologize if my notes missed something. Unfortunately the project I needed it for fell through the cracks but hope this helps someone else. If something isn't clear drop a comment here.

How to store data like Freebase does?

Transforms

Code

Notes:

Related Q&A

Django-celery : Passing request Object to worker

How to get ROC curve for decision tree?

pandas - stacked bar chart with timeseries data

Get element at position with Selenium

Facing obstacle to install pyodbc and pymssql in ubuntu 16.04

Cross entropy loss suddenly increases to infinity

Converting each element of a list to tuple

Python, Zeep response to pandas

Adjust the distance only between two subplots in matplotlib

Many-to-many multi-database join with Flask-SQLAlchemy