-
Website
http://bret.appspot.com/ -
Original page
http://bret.appspot.com/entry/we-need-a-wikipedia-for-data -
Subscribe
All Comments -
Community
-
Top Commenters
-
eas
3 comments · 4 points
-
mndoci
5 comments · 2 points
-
Jud Valeski
3 comments · 1 points
-
sospartan
2 comments · 1 points
-
Aviv Shaham
2 comments · 3 points
-
-
Popular Threads
-
OAuth WRAP support in FriendFeed for feedback - Bret Taylor's blog
2 days ago · 5 comments
-
OAuth WRAP support in FriendFeed for feedback - Bret Taylor's blog
Most of the information I've been adding is out there and available for free, it's just not in a form that people can use. I retrieved a lot of financial data by parsing SEC documents, which is all that the companies (like Capital IQ) selling the data to Google and Yahoo are doing. Even without getting companies to "donate their data" we can do a lot better just by sharing our parsers and putting stuff in one place.
Freebase supports Creative Commons, so its relatively easy to mirror information there, but the way I see it, shouldn't data reside at it's original store (lets say a database of genes), but you can slurp the metadata into Freebase for more advanced querying? Or is the vision to convert Freebase into THE store for entity extraction and the original data sources as just repositories. That was the discussion that I had with Danny at Scifoo last year. Just feels somewhat limiting and doesn't really with the "web of data" idea either
In addition to theinfo.org mentioned below, lets not forget a site like Swivel.com. That's pretty much dedicated to datasets, but without the wiki-style flexibility
It was created to be a repository for public data.
imeem has partnered with all four major music labels as well as thousands of indies and video providers to offer their content for free, on-demand streaming on imeem (supported by advertising). Through the API, imeem delivers the underlying infrastructure and content licenses for our entire catalog of media, allowing developers to focus on what they do best- building apps. We’re really excited to see what everyone is able to create now that they can access the rich metadata for all music albums from all labels that you mention above.
Today we allow developers to create and test their apps in a sandbox environment. The next stage will be to roll out your apps to 24 million users on imeem. Check it out and let us know what you like\dislike about the platform as well what features you would like to see added!
Sachin
as many of the commenters identify, the Linked Data activity around the Semantic Web is moving in the direction of a lot of what you discuss; dbpedia etc are great examples of this start. The Linked Data workshop at the World Wide Web conference later this month should be an interesting event; http://events.linkeddata.org/ldow2008/
One big issue is that of effectively licensing this data so that contributors, consumers, and downstream users can all be clear about their rights and obligations. Whilst there's great work for software, creative works, etc, there's been a lot less done with data. Working with legal opinion, the Science Commons project over at Creative Commons and others, we at Talis were involved in some work that resulted in the release of the Open Data Commons Public Domain Dedication and License (http://opendatacommons.org/) at the end of last year; I'll be talking about it and the economic drivers it addresses at the Web Conference in just over a week...
I'd love to see GPS traces from the Street View fleet submitted back to OSM, but maybe that's too much to wish for.
At least in the USA they have decent Free GIS data. There is basically *no* geographic data for most countries in the EU, that's why OSM exists.
As well as google providing the GPS traces, they could also provide the camera view images to OSM, that way OSM could also add pedestrian crossings, post boxes, etc to the database.
I love the idea of empowering the users to change data though but the issue I could see is validation.
Are you thinking of something like this? www.jigsaw.com
It applies to contact data only, but in a sense it is community based data gathering and updating, with a penalty / reward system based on the communities reaction to data qulaity. Nobody uploading huge datasets I think, but presumably it has the potential.
all.
There are lot's of other items like this out there if you haven't seen.
Keep your eye on http://twine.com as well.
In all seriousness, this would be a tremendous boon to application developers.
Details here: http://www.librarything.com/thingology/2008/01/...
Rather than upload Google's Street View data, it would be better if Google allowed people to trace over their satelite imagry. Yahoo lets the OpenStreetMap project trace over their imagery, Google should do the same.
I agree with most of your points. Getting some big companies on board is important. It seems like there are a couple of issues for them right now, profitability and legality. Dillard's and Sam's Club donated a large retail dataset to the Teradata University Network last year, but like many released datasets, it has noncommercial restrictions. The danger of unintentional release of customer information or material financial data seem to be concerns along with competitive advantage/profitability. Another approach I've used to bootstrap data collection like this is to use mechanical turk...
A few good links:
http://en.wikipedia.org/wiki/Open_data
http://numbrary.com/
http://theinfo.org
I did a related blog post a few months ago here: http://www.datawrangling.com/some-datasets-avai... and have a bunch of links to interesting datasets tagged in delicious:
http://del.icio.us/pskomoroch/dataset
-Pete
http://www.masseyratings.com
Go there and email me if you can help.
They have a great API ( http://www.freebase.com/view/freebase/api ) there are various other projects for jQuery, Python & Javascript going on ( http://code.google.com/hosting/search?q=freebas... )
They recently got some more cash ( http://www.techcrunch.com/2008/01/16/freebase-t... & http://www.techcrunch.com/2007/03/09/this-is-co... )
Cheers,
Stephen Edgar
* A site with all the laws from all countries in the world (local, state, and federal). You could augment that with information about the law like when it was passed what was debated about it, what court cases it's been used in, and who voted for or against it.
* For that matter an open database of court cases. Including all documents and transcripts.
* An open database containing all businesses, their location, what they do, who they employ, ect.
* An open database with all news reports (from television, print, radio, and internet services).
* A database containing the budgets for all public institutions (schools, governments, police stations, ect).
I would suggest that Wikipedia is not a good analogy. Wikipedia deals with adding, cutting, pasting, and generally mercilessly editing content. The sort of site you're talking about is not best understood as a community collaborating to produce a product (like Wikipedia), but a community collaborating to create a data repository, the contents of which are collaboratively added and organized and which can be freely reused.
Better analogy: Creative Commons for data.
(note: a while has passed, probably mathematica already supports this by now)
Of course, it's Wolfram, so they want $$$ for it :-(
There is a great need also to keep this type of open database clean and updated--developing protocols to ensure that the data is timely. Sometimes "open source" means "better quality", but not always...
Maybe google can just licence everything and then all developers can chip in a £1 each to help offset the cost.
Then all we need is a standard xml markup and the rest is easy!
Ultimately, if we are able to mine the data and present the actual facts, I think we will being about a change in this country.
Besides the myriad 'expert' sites -- ones that specialize in a certain knowledge domain -- you'll find CKAN, which concentrates on indexing information sources, http://numbrary.com which is also CTM and has a wealth of open government data, http://freebase.com which is a sprawling integration of human-scale information, and http://swivel.com, which allows live interaction and visualization with their data.
I think the main virtues of http://infochimps.org lie in its suckiness:
- we're "messy". We're looking to loosely couple data: make it discoverable, make it publicly curated, make it interconnect -- but not to impose any kind of strict structure or format or ontology. You can sit happily in our DB with nothing more than a title, a list of credits, and a few tags.
- we're "not live". As much as possible, we'd like to give infochimps data to work with on their machine, using their tools (and not incidentally their CPU cycles).
- we're "not good at any one specific thing". There's sites with Economic data, with UN data, with astronomical data, with baseball statistics, with social network graphs. We need a place that allows and in fact inspires connections among all these rich sources of data, and gives you immediate access to them.
If you suspect you may also be an infochimp, please get in touch. The project will only succeed with community involvement.
Anyone should be able to setup auto-updates from their DB into this using an API with zero-programming. And of course anyone should be able to pull out any stream. I've been thinking a lot about this recently.
This would not be a system for integrating large collections of free data -- something like freebase is more suitable for this. Instead, it would be another source of data for such a system. a source providing meta data for tens of thousands of people, book, locations, etc...
I have contributed to the MediaWiki codebase before, and I'll be actively pushing to make this become reality. If I can make it happen, I will try to work closely with the dbpedia folks (chances are good I think, I have talked to them before). Exiting times!
1. You say "No one really wants factual data accuracy and completeness to be their competitive advantage; we all want the best data possible to build the best products possible, and discrepancies in data quality are artifacts of the extremely inefficient economy of buying and selling data we currently live in." I would argue the exact opposite: going forward a growing portion of our economy will depend on data being their competitive advantage.
2. Not sure a Wiki is the right model to achieve your end result. Several comments here about Semantic Web technolgies, make this point as well. etc. Don't forget that Internet Search killed the Yahoo Directory.
3. Big point that's easily overlooked is that the meta data is usually as important as the data itself.
4. Network Effects will play a major role. I wrote about a related idea regarding data availability for mashups a while back (http://blog.snaplogic.org/?p=147) that is now due for an update.
5. Data data everywhere nor any drop to drink. We're awash in data, the problem is finding what you need, understand the semantics and accessing it in a simple efficient manner.
We don't need a Wikipedia for Data. We need a Google for Metadata.
Dear xxxxxxxxx
Human Solutions Pty Ltd is contracted by the Australian Government
Department of Health & Ageing to collect and maintain public toilet
information for the National Public Toilet Map
(http://www.toiletmap.gov.au/).
Unfortunately, the National Public Toilet Map data is not currently
available to commercial third parties. Non-commercial providers (such as
other government agencies, charities, or associations) can submit a
request by email (project@toiletmap.gov.au) to the Department of Health &
Ageing to access the toilet data. If you submit a request to the
Department of Health and Ageing and do not receive a response, it can be
assumed that your request has been denied.
We do not make the entire National Public Toilet Map available to any
individual or organisation, with the exceptions mentioned above. This has
been done by design, to restrict access to the dataset, to deter
commercial providers from on-selling this data for a profit.
The providers of toilet facilities give us their toilet information on the
understanding that this will be used for the National Public Toilet Map, a
non-profit project for the benefit of the public. If the entire dataset
was made available, then it would be near impossible to control
unauthorised use of the toilet providers' information for commercial gain.
If you have any further queries, please do not hesitate to contact the
National Public Toilet Map Helpline on 1800 990 646.
Regards,
the National Public Toilet Map team
- so we can't even have access to the data set of public toilet locations in australia !!!!
Thanks for posting this. Illustrates the deep-rooted problems we have to overcome to make something like this happen.
I think that in Portugal all requests are handled this way. I've made several attempts to access government data and in very few times I got any answer.
Yahoo just shut down their weather service which provided weather information from around the world for free in an easy to parse xml format. We really need to make sure that those services keeps running. In a way that doesn't rely on single companies.
This is a huge problem right now and it really hinders innovation in a lot of areas. The conclusion of my morning brainstorm was that there needs to be created a non-profit organization which gathers and distributes this information. Maybe wikipedia just needs to add some more structure to the data and make libraries available that makes it easy to integrate into programs. Once the basic information is available, a lot of interesting application can be developed.
1. Discoverability of datasets. For this you want a registry of some kind and this is exactly what the Comprehensive Knowledge Archive Network (CKAN) is designed to do. As the blurb on the site states:
> CKAN is the Comprehensive Knowledge Archive Network, a registry of open knowledge packages and projects (and a few closed ones). CKAN is the place to search for open knowledge resources as well as register your own â be that a set of Shakespeare's works, a global population density database, the voting records of MPs, or 30 years of US patents.
>
> Those familiar with freshmeat or CPAN can think of CKAN as providing an analogous service for open knowledge.
We launched CKAN around a year ago and now has over 160 'packages' including many of those mentioned in the comments in this thread.
2. 'Developing' data particularly using many contributors and a versioning (wiki-like) model. This seems a general problem and one which I wrote about in this post on the collaborative development of data back in February last year. Since then various projects have launched or developed which attempt to address this issue, even if only partially (e.g. Freebase, Swivel, Numbrary, http://www.openeconomics.net ...). This then leads into:
3. Componentizing data so that one can easily plug different datasets together rather than having to aggregate data together in one big place (crudely: 'One Ring to Rule them All' vs. 'Small Pieces, Loosely Joined'). After all it seems unlikely that any one organization, however large, can hold 'all the data', and in ay case doing so would negate the benefits of having 'many minds' working on a problem. It is our hope that CKAN would start to facilitate the kind of packaging that one frequently observes in software but is, as yet, fairly rare for knowledge (data/content/...). More on this can be found in this blog post on componentization plus the slides from our presentation at XTech.
I think that it's more likely that we'll see Open Data split vertically rather than one big open data warehouse. People are more able to concentrate on creating a TV guide (like TVIV) or business listings (like Openguides) than stare at some big spreadsheet-style interface and come up with some "data" to share. I think projects like Freebase will be great for aggregating data created by more vertical projects.
I've started a project called Vinismo, where we're documenting every wine in every country in the world, both with unstructured text and with structured (RDF) data. It's an exciting project, and I think there are lots of other similar ones out there.
I agree with Evan, that we don't necessarily need a huge ultimate data store for all and everything. What we need are relations between open datasets in the way of the LinkingOpenData Project :
http://esw.w3.org/topic/SweoIG/TaskForces/Commu...
The project already interlinks data from wikipedia, GeoNames, MusicBrainz, WordNet and many more.
Make it interactive and fun to use and I think you'd get all sorts of data from all sorts of people - not just the tech community.
(Peter Murray-Rust)
http://freebase.com/
"No one really wants factual data accuracy and completeness to be their competitive advantage; we all want the best data possible to build the best products possible, and discrepancies in data quality are artifacts of the extremely inefficient economy of buying and selling data we currently live in. If everyone had the same, high quality data, all of our products would be better for it."
This is certainly true, but I think you've missed some of the incentives.
First, no one wants to compete in an efficient market. Efficient markets are really hard to make money in. It's why venture capital money does not go to people who want to sell wheat (or corn/oil/insert-random-commoditie-here). INEFFICIENT markets are where the money is -- that is, in an inefficient market if your company has figured out what really matters, you have a huge competitive edge. In an efficient market, all players know exactly what matters and a competitive advantage is hard to find.
I should be clear that I am not arguing with your conclusion -- I think the world would be a much better place if the market for data sets was basically an efficient commodity market. I don't, however, believe that any of the current players in this inefficient market have a lot of incentives to move us in that direction. Quite the opposite, in fact; a winner in an inefficient market is basically a player who has "solved" the game, and that player has every incentive to keep the solution secret.
http://www.dealipedia.com/
http://www.dealipedia.com/newsletter_subscribe.php
So it's imperative that these systems support open, commonly used formats and API's.
Specifically for Geo data, a project I work on is Mapufacture that is bringing together various data sources in a large number of formats and then sharing them out via common formats - so a user (or developer) can just use the format that makes sense for them.
Unfortunately it was going to cost us $6k-$9k/month (there is only one source here in Australia) which is crazy, so we obviously decided this was not an option.
If these were in an open system, more people can experiment and solve real problems with any technology (of course, we'd prefer it if people used HTMs).
required. To me, XML and vertical XML standards are the way to go.
Once you get the data inside your firewall, what you convert it to is your business. If you create data at the 60 Hz 120 V standard, you should be able to roll your meter backwards. It costs money to collect data so it should be worth something in both directions.
I have also wondered about the possibility of a DataWiki (I like the term Bret!), or perhaps of wikipedia taking on the challenge itself (wikipedia already has a large number of company profiles and there is now a formal Companies WikiProject (http://en.wikipedia.org/wiki/Wikipedia:WikiProj...). I agree it would probably have to be seeded by a significant player in the data space, something that will be painful to do for a company that gets significant revenue from licensing their data.) I also agree with many of the other comments made here about the importance of a data standard (probably by vertical, as xml dude suggested). In our case solving the challenge of a single standard for various data sets would be HUGE. The issue of seeding, while a challenge, is only temporary. Data is becoming increasingly commoditized and the price is fast approaching zero. Case in point, take Jigsaw's recent open data initiative in which they decided to give away their database of basic company profiles in order to generate interest in their service (and their less commoditized data for individual executive contacts.) There's some seed right there!
you might be interested in The Data Tank. we just opened up a technology preview on http://thedatatank.com and are looking for enthusiastic people around the world for joining in on our team!
. Thanks for kind of nice facts.