DISQUS

Bret Taylor's blog: We need a Wikipedia for data - Bret Taylor's blog

  • Toby Segaran · 1 year ago
    Echoing the Freebase suggestions... I work at Metaweb and our team has been furiously adding new datasets to prove the value of a "Wikipedia for data". For example, our entry for Wal-mart, http://freebase.com/view/en/wal-mart, has retail locations, board members, notable employees, political contributions, links to SIC and NAICS codes and some financial data (revenue, income, market cap).

    Most of the information I've been adding is out there and available for free, it's just not in a form that people can use. I retrieved a lot of financial data by parsing SEC documents, which is all that the companies (like Capital IQ) selling the data to Google and Yahoo are doing. Even without getting companies to "donate their data" we can do a lot better just by sharing our parsers and putting stuff in one place.
  • mndoci · 1 year ago
    Toby,

    Freebase supports Creative Commons, so its relatively easy to mirror information there, but the way I see it, shouldn't data reside at it's original store (lets say a database of genes), but you can slurp the metadata into Freebase for more advanced querying? Or is the vision to convert Freebase into THE store for entity extraction and the original data sources as just repositories. That was the discussion that I had with Danny at Scifoo last year. Just feels somewhat limiting and doesn't really with the "web of data" idea either
  • Bret Taylor · 1 year ago
    It seems interesting, but it also seems to be focused (currently) on different data. There are single digit millions of entries according to commenters, but over 12 million small businesses in the US alone (and many more people). There are a lot more roads on maps, and a millions of data points of stock data produced every day. It seems there is a fairly big difference between the types of data sets I mentioned and the data currently being put in Freebase unless I am misunderstanding something. Why do you think that is? The differences seem acute enough, I worry that the infrastructure and focus of Freebase is quite different than what I am looking for, but I am unfamiliar with the project.
  • mndoci · 1 year ago
    The focus of Freebase is to make it a data commons in a way. You can pretty much load up any public/CC licensed dataset, e.g. people have uploaded a ton of genomic information. The secret sauce comes from the ability to add structure and then query along that graph.

    In addition to theinfo.org mentioned below, lets not forget a site like Swivel.com. That's pretty much dedicated to datasets, but without the wiki-style flexibility
  • Aaron Swartz · 1 year ago
  • Bret Taylor · 1 year ago
    Interesting - surprised I hadn't stumbled on your site before. I will check it out.
  • Christian · 1 year ago
    Check out Numbrary: http://www.numbrary.com

    It was created to be a repository for public data.
  • Sachin Rekhi · 1 year ago
    Great post, Bret. I recommend you check out the API we recently launched at imeem (www.imeem.com/developers). We’ve added a new twist to our social network open platform by allowing developers to access our entire media library to create applications on the imeem Media Platform.

    imeem has partnered with all four major music labels as well as thousands of indies and video providers to offer their content for free, on-demand streaming on imeem (supported by advertising). Through the API, imeem delivers the underlying infrastructure and content licenses for our entire catalog of media, allowing developers to focus on what they do best- building apps. We’re really excited to see what everyone is able to create now that they can access the rich metadata for all music albums from all labels that you mention above.

    Today we allow developers to create and test their apps in a sandbox environment. The next stage will be to roll out your apps to 24 million users on imeem. Check it out and let us know what you like\dislike about the platform as well what features you would like to see added!

    Sachin
  • Paul Miller · 1 year ago
    Bret

    as many of the commenters identify, the Linked Data activity around the Semantic Web is moving in the direction of a lot of what you discuss; dbpedia etc are great examples of this start. The Linked Data workshop at the World Wide Web conference later this month should be an interesting event; http://events.linkeddata.org/ldow2008/

    One big issue is that of effectively licensing this data so that contributors, consumers, and downstream users can all be clear about their rights and obligations. Whilst there's great work for software, creative works, etc, there's been a lot less done with data. Working with legal opinion, the Science Commons project over at Creative Commons and others, we at Talis were involved in some work that resulted in the release of the Open Data Commons Public Domain Dedication and License (http://opendatacommons.org/) at the end of last year; I'll be talking about it and the economic drivers it addresses at the Web Conference in just over a week...
  • Bruno · 1 year ago
    Actually... Google (as usual) is already working on that. http://blog.wired.com/wiredscience/2008/01/goog...
  • mndoci · 1 year ago
    This effort is a lot more limited. They are essentially providing hosting for extremely large scientific datasets (too large for most to host). But the idea is actually along the principles discussed in the post. They don't plan on providing any apps from the data or do anything with it. It's sit there, indexed and ready to use.
  • John Brimacombe · 1 year ago
  • Mike Purvis · 1 year ago
    As far as mapping data goes, some amazing progress has been made at OpenStreetMap, especially in major urban centers: http://www.openstreetmap.org/

    I'd love to see GPS traces from the Street View fleet submitted back to OSM, but maybe that's too much to wish for.
  • Rory McCann · 9 months ago
    +1 for OSM.

    At least in the USA they have decent Free GIS data. There is basically *no* geographic data for most countries in the EU, that's why OSM exists.

    As well as google providing the GPS traces, they could also provide the camera view images to OSM, that way OSM could also add pedestrian crossings, post boxes, etc to the database.
  • Rui Lopes · 1 year ago
    Do you mean having sth like http://dbpedia.org ?
  • Tom · 1 year ago
    It's more historical data than the examples you mentioned, but http://freebase.com/ aims to do that as well. It's got a little over 3 Million entries so far.
  • mndoci · 1 year ago
    Exactly ... to me that's the role Freebase and dbpedia are fulfilling, but I still think that unless we are specifically seeking datasets, the web should be our database. We should just make a conscious effort to keep our data open and in forms that can be mined, etc.
  • THedudeabides · 1 year ago
    Great idea and about time we look at data in a way that we can benefit application developers and users. So far your blog is great. Keep it up.
  • jmason · 1 year ago
    hi -- take a look at CKAN.net; it seems vaguely related.
  • DanTMG · 1 year ago
    I made a post last year in my blog about content aggregation as a business and how it could be made to work see: http://danielgardner.wordpress.com/2007/03/25/b...

    I love the idea of empowering the users to change data though but the issue I could see is validation.
  • diddlediddledumpling · 1 year ago
    Great Idea.

    Are you thinking of something like this? www.jigsaw.com
    It applies to contact data only, but in a sense it is community based data gathering and updating, with a penalty / reward system based on the communities reaction to data qulaity. Nobody uploading huge datasets I think, but presumably it has the potential.
  • Tinus · 1 year ago
    Amen! I've been saying this for years. How come there's no free database for postal codes in The Netherlands? I thought (free) Web Services would solve this problem, but they didn't.
  • chex · 1 year ago
  • Tinus · 1 year ago
    Well, that list only contains a list of cities and regions. Not usable at
    all.
  • Pedro · 1 year ago
    I agree. Saw this the other day on Metafilter, something called Infochimps: http://projects.metafilter.com/1422/Infochimpso...
  • hendler · 1 year ago
    Have you seen http://freebase.com?
    There are lot's of other items like this out there if you haven't seen.
    Keep your eye on http://twine.com as well.
  • dc crowley · 1 year ago
    Great, great idea. Equally amazing but fairly trivial is that you made Techmeme with your first blog post
  • John Davidson · 1 year ago
    You should check out FreeBase at http://www.freebase.com/ This seems to be very close to what you are asking for
  • Brent · 1 year ago
    A great vision. I do agree that it is more of a business problem than a technical one.
  • Brandon Watson · 1 year ago
    Check out opentick.com for open sourced financial pricing data.
  • Greg Gershman · 1 year ago
    Half the fun of putting together a new application is figuring out how and where you're going to get your data! Who doesn't have fond memories of writing a parser for US census data to extract zip code data?

    In all seriousness, this would be a tremendous boon to application developers.
  • Seo Sanghyeon · 1 year ago
    One amazing project is LibraryThing, a book cataloging web service. With help of its enthusiastic users, it built "edition disambiguation" web service. You send ISBN, and recieve "related" ISBNs. Like ISBN of hardcover edition when you give ISBN of paperback edition. Then you can aggregate reviews for essentially same books. A company called OCLC provides a paid service for this, but LibraryThing is catching up both in quality and quantity.

    Details here: http://www.librarything.com/thingology/2008/01/...
  • LudoA · 1 year ago
    http://infochimps.org/ is the most-used one I believe.
  • Rory McCann · 1 year ago
    I'd like to second the recommendation for OpenStreetMaps. It's basically a wiki map of the planet. The USA is lucky because it has very good public domain maps. Places like the UK have be reverse engineered.

    Rather than upload Google's Street View data, it would be better if Google allowed people to trace over their satelite imagry. Yahoo lets the OpenStreetMap project trace over their imagery, Google should do the same.
  • Pete Skomoroch · 1 year ago
    Bret,

    I agree with most of your points. Getting some big companies on board is important. It seems like there are a couple of issues for them right now, profitability and legality. Dillard's and Sam's Club donated a large retail dataset to the Teradata University Network last year, but like many released datasets, it has noncommercial restrictions. The danger of unintentional release of customer information or material financial data seem to be concerns along with competitive advantage/profitability. Another approach I've used to bootstrap data collection like this is to use mechanical turk...

    A few good links:

    http://en.wikipedia.org/wiki/Open_data
    http://numbrary.com/
    http://theinfo.org

    I did a related blog post a few months ago here: http://www.datawrangling.com/some-datasets-avai... and have a bunch of links to interesting datasets tagged in delicious:

    http://del.icio.us/pskomoroch/dataset

    -Pete
  • Bret Taylor · 1 year ago
    Great info, thanks. I will check these out.
  • Chris Estes · 1 year ago
    It's always bothered me that US legislative districts are public information but to get them in a usable format and tie them to a database of addresses you have to use a private vendor. Same with trying to get Plus4 ZIP codes.
  • kL · 1 year ago
    "Rich meta data for all musical albums" = musicbrainz.org?
  • Daniel Schumacher · 1 year ago
    I think you're right about the degree to which lack of such a resource stifles innovation. It is a matter of standardization. I think it is a fact of political philosophy that standards benefit the society as a whole, but are not necessarily in the (perceived) interest of the capitalist. This is why standards are so few and there are so many areas crying out for standardization (one small one, a standard re-usable grocery bag.) The FOSS movement is a political movement as much as a technical one. The calculation of who wins vs. who loses is at the heart of why such endeavors are promoted or suppressed. Welcome to the western world!
  • daveevans · 1 year ago
    I've been mulling over a similar problem space for a few years, then I saw Freebase.com, which is close to what you are looking for.
  • Justin Bozonier · 1 year ago
    This is the purpose of the Semantic Web. There is a site called http://www.dbpedia.org Instead of creating a whole other approach I think we may want to contribute to projects suchs as these.
  • PRime · 1 year ago
    Right on I think In general you spot on... what do you think of Freebase
  • Kenneth · 1 year ago
    I am trying to collect sports scores at
    http://www.masseyratings.com
    Go there and email me if you can help.
  • Stephen Edgar · 1 year ago
    Echo'ing others with DBPedia & Freebase

    They have a great API ( http://www.freebase.com/view/freebase/api ) there are various other projects for jQuery, Python & Javascript going on ( http://code.google.com/hosting/search?q=freebas... )

    They recently got some more cash ( http://www.techcrunch.com/2008/01/16/freebase-t... & http://www.techcrunch.com/2007/03/09/this-is-co... )

    Cheers,

    Stephen Edgar
  • Mastermind · 1 year ago
    I've found wikipedia to be pretty good with album/movie information. If you really wanted to it would be pretty easy to write a spider that collected information from the site. It wouldn't be perfect but it would get the job done. Also the dpPedia linked by others seems like an attempt to do just that. In addition to your list I would like to see:

    * A site with all the laws from all countries in the world (local, state, and federal). You could augment that with information about the law like when it was passed what was debated about it, what court cases it's been used in, and who voted for or against it.

    * For that matter an open database of court cases. Including all documents and transcripts.

    * An open database containing all businesses, their location, what they do, who they employ, ect.

    * An open database with all news reports (from television, print, radio, and internet services).

    * A database containing the budgets for all public institutions (schools, governments, police stations, ect).
  • ReparateMe.com · 1 year ago
    What do you think that Swivel.com is? Another ReparateMe.com or something?
  • paul · 1 year ago
    Take a look at swivel.com. I had this idea too, but they're already so far, with venture capital to make it happen.
  • Martin Focazio · 1 year ago
    Good idea. Too bad that the very lists you have described are, in fact, "intellectual property" (thus the lawyers) and there have been plenty of cases that have gone to court (like the real-estate industry and the Multiple Listing Service) and come out the other side with a clear and enforced set of laws that restrict their use.
  • Car Fan · 1 year ago
    Great post! I'm looking for a list of US city and state pairs and a list of US car dealers -- but this stuff is hard to find for free.
  • Mike Johnson · 1 year ago
    I think something generally like what you're sketching out is definitely needed.

    I would suggest that Wikipedia is not a good analogy. Wikipedia deals with adding, cutting, pasting, and generally mercilessly editing content. The sort of site you're talking about is not best understood as a community collaborating to produce a product (like Wikipedia), but a community collaborating to create a data repository, the contents of which are collaboratively added and organized and which can be freely reused.

    Better analogy: Creative Commons for data.
  • Joe Rosenblum · 1 year ago
    Nice post. Have you seen Aboutus.org? Ward Cunnigham (inventor of the wiki) is CTO there, and they are working on a subset of this problem (corporate/white pages info), which points to the possible monetization options for an open source project like you describe.
  • hthth · 1 year ago
    http://www.Twine.com has some promise wrt. this. But due to the variety of data, and the different means in which they need to be displayed -- I fear no current system lives up to the ideal, and that we'll have to wait a while before it emerges.
  • spot · 1 year ago
  • jOE · 1 year ago
    that's just a great idea. interestingly, i can remember that i have read that the folks at wolfram research are going to incorporate a lot of data right into their product (afaik it had also the abilities to download missing data). assuming the incorporation in mathematica is straightforward and relatively simple, it adds new dimensions to using it ...
    (note: a while has passed, probably mathematica already supports this by now)
  • pozorvlak · 1 year ago
    That's in version 6, and it includes a lot of stuff mentioned already, like financial and geographic data.

    Of course, it's Wolfram, so they want $$$ for it :-(
  • Steve Lynch · 1 year ago
    There's also Freebase: http://www.freebase.com/
  • Sean Gorman · 1 year ago
    Great post and good insight. We took a shot at the map data side of this about a year ago with a project called GeoCommons. We aggregated a good pool of data but ran into some serious scaling issues, especially when you are having to read and write to the database frequently. At about one billion data entities MySQL craps out at about 5 billion data entities PostGIS craps out, so we ended up having to develop a custom object store solution, and will be launching that part of it at Where 2.0. Each of the bullets you gave have their own serious technical challenges. I'm not sure if any one datawiki can solve them all, but I do think there is potential in federating repositories in clever ways.
  • Music Lessons · 1 year ago
    Great idea! I was reminded of this while trying to get zip code data with latitude and longitude for the whole US. It should be freely available, but I have to pay someone to get it. And then trying to get this internationally is quite an issue.

    There is a great need also to keep this type of open database clean and updated--developing protocols to ensure that the data is timely. Sometimes "open source" means "better quality", but not always...
  • Jon · 1 year ago
    YES! This is a great idea.

    Maybe google can just licence everything and then all developers can chip in a £1 each to help offset the cost.

    Then all we need is a standard xml markup and the rest is easy!
  • Norman · 1 year ago
    Great idea - and one that not only programmers, but even the ordinary citizen would benefit from. Here is an example - voting records of politicians at all levels (city, county, state, house and senate) who in their pre-election speeches espouse some causes, but vote diametrically opposite during their tenure. Another piece of data that is vital would be their main campaign funders and their backgrounds.
    Ultimately, if we are able to mine the data and present the actual facts, I think we will being about a change in this country.
  • Steve · 1 year ago
    Don't forget food nutrient data!
  • @mrflip · 1 year ago
    I've started a project, http://infochimps.org (also mentioned above) to try to build part of the "Almanac" to sit next to wikipedia's "Encyclopedia".

    Besides the myriad 'expert' sites -- ones that specialize in a certain knowledge domain -- you'll find CKAN, which concentrates on indexing information sources, http://numbrary.com which is also CTM and has a wealth of open government data, http://freebase.com which is a sprawling integration of human-scale information, and http://swivel.com, which allows live interaction and visualization with their data.

    I think the main virtues of http://infochimps.org lie in its suckiness:
    - we're "messy". We're looking to loosely couple data: make it discoverable, make it publicly curated, make it interconnect -- but not to impose any kind of strict structure or format or ontology. You can sit happily in our DB with nothing more than a title, a list of credits, and a few tags.
    - we're "not live". As much as possible, we'd like to give infochimps data to work with on their machine, using their tools (and not incidentally their CPU cycles).
    - we're "not good at any one specific thing". There's sites with Economic data, with UN data, with astronomical data, with baseball statistics, with social network graphs. We need a place that allows and in fact inspires connections among all these rich sources of data, and gives you immediate access to them.

    If you suspect you may also be an infochimp, please get in touch. The project will only succeed with community involvement.
  • Paul Pedersen · 1 year ago
    Bret - good idea. I've been thinking about the same issue for a while. For the past year I've been developing a publicly accessible web repository. The bootstrap problem can be solved by building the repository one useful step at a time and charging a very light cost for the immediate consumers of the repository data. Running a shared crawl makes a lot of sense to me - everyone involved gets a better data set than if they ran solo, and at a way lower cost than running in-house. It's pretty hard work, OTOH, and any help would be great. The service is at page-store.com. The current crawl has been running for about 3 months, and the repository is shaping up pretty nicely. I've tried not to step on any toes while collecting web pages. I'm interested in making the collection process as transparent and polite as possible. I'm also interested in the entire value-add data extraction process.
  • Murat Aktihanoglu · 1 year ago
    But just putting data into a repository and forgetting about it doesn't solve the problem. The data in this repository should be live, streamed-in periodically from the real source. For example what's the use for a 5 year old new york pizza restaurants directory?
    Anyone should be able to setup auto-updates from their DB into this using an API with zero-programming. And of course anyone should be able to pull out any stream. I've been thinking a lot about this recently.
  • Daniel Kinzler · 1 year ago
    Here's my own take on how to get linked data into (or rather, out of) Wikipedia itself: http://brightbyte.de/page/WikiData_light

    This would not be a system for integrating large collections of free data -- something like freebase is more suitable for this. Instead, it would be another source of data for such a system. a source providing meta data for tens of thousands of people, book, locations, etc...

    I have contributed to the MediaWiki codebase before, and I'll be actively pushing to make this become reality. If I can make it happen, I will try to work closely with the dbpedia folks (chances are good I think, I have talked to them before). Exiting times!
  • Colin M. Saunders · 1 year ago
    Also check out http://millipedia.org , a collaborative semantic encyclopedia of facts. Any one (or machine) can post a fact, which is a noun, relationship, noun tuple (such as "rock beats paper"). Users can then vote that fact up or down. This collection of facts and associated weights forms the knowledge base, which can then be queried programmatically. More or less a thought experiment for now, but (IMO) interesting none-the-less.
  • tedmcg · 1 year ago
    Have you checked out Numberzoom http://www.numberzoom.com/ for user contributed phone numbers of unlisted Caller IDs that show up on their phones?
  • ChaosMotor · 1 year ago
    I've had the same idea before, and I support you in this endeavor.
  • dd · 1 year ago
    make a proposal at meta : )
  • Tom A · 1 year ago
    This reminds me of http://museum.media.org/edgar/ discussing how public internet access to SEC data came about.
  • a good one · 1 year ago
    I think you want MediaWiki, not Wikipedia...
  • krishnan · 1 year ago
    Good post Bret. I am thinking about this problem for quite some time too. Another problem which I am having in mind and want someone else to implement (:-)) is the idea of open reports. This idea follows that of the open data idea. Companies like Gartner and Forrester are making tons of money by tapping the elusive nature of data in the world. With open data and the free processing power of idle human minds, we should be able to dig out a repository of open reports. In other words, open source reports making some meaning out of open data. Such open reports will help those individuals/companies who want to make meaningful decisions based on the open data available. This is nothing new. It is an extension from the scientific community. We need to bring in such a collaboration into the technology and business community too. In fact, we can get started with open reports repository even with whatever data we have access to at this point. We just have to put the processing power of our minds into writing a report and, more importantly, aggregate it in a wikipedia kinda system.
  • Virtual :: Spirit · 1 year ago
    Humm a realy Good Idea !
  • Chris Marino · 1 year ago
    Hey Bret, obviously, based on the comment activity, your post has struck a nerve. There are a lot of people thinking about this problem, including me. A few comments:

    1. You say "No one really wants factual data accuracy and completeness to be their competitive advantage; we all want the best data possible to build the best products possible, and discrepancies in data quality are artifacts of the extremely inefficient economy of buying and selling data we currently live in." I would argue the exact opposite: going forward a growing portion of our economy will depend on data being their competitive advantage.

    2. Not sure a Wiki is the right model to achieve your end result. Several comments here about Semantic Web technolgies, make this point as well. etc. Don't forget that Internet Search killed the Yahoo Directory.

    3. Big point that's easily overlooked is that the meta data is usually as important as the data itself.

    4. Network Effects will play a major role. I wrote about a related idea regarding data availability for mashups a while back (http://blog.snaplogic.org/?p=147) that is now due for an update.

    5. Data data everywhere nor any drop to drink. We're awash in data, the problem is finding what you need, understand the semantics and accessing it in a simple efficient manner.

    We don't need a Wikipedia for Data. We need a Google for Metadata.
  • TimG · 1 year ago
    this is a real response to a request for some government data on location of public toilets in Australia......

    Dear xxxxxxxxx

    Human Solutions Pty Ltd is contracted by the Australian Government
    Department of Health & Ageing to collect and maintain public toilet
    information for the National Public Toilet Map
    (http://www.toiletmap.gov.au/).

    Unfortunately, the National Public Toilet Map data is not currently
    available to commercial third parties. Non-commercial providers (such as
    other government agencies, charities, or associations) can submit a
    request by email (project@toiletmap.gov.au) to the Department of Health &
    Ageing to access the toilet data. If you submit a request to the
    Department of Health and Ageing and do not receive a response, it can be
    assumed that your request has been denied.

    We do not make the entire National Public Toilet Map available to any
    individual or organisation, with the exceptions mentioned above. This has
    been done by design, to restrict access to the dataset, to deter
    commercial providers from on-selling this data for a profit.

    The providers of toilet facilities give us their toilet information on the
    understanding that this will be used for the National Public Toilet Map, a
    non-profit project for the benefit of the public. If the entire dataset
    was made available, then it would be near impossible to control
    unauthorised use of the toilet providers' information for commercial gain.

    If you have any further queries, please do not hesitate to contact the
    National Public Toilet Map Helpline on 1800 990 646.

    Regards,

    the National Public Toilet Map team

    - so we can't even have access to the data set of public toilet locations in australia !!!!
  • Bret Taylor · 1 year ago
    Absolutely unbelievable! Seriously, your tax dollars pay for this stuff, and you aren't "allowed" to have it?

    Thanks for posting this. Illustrates the deep-rooted problems we have to overcome to make something like this happen.
  • TimG · 1 year ago
    the most stupid thing is, you could scrape it off the site pretty easily anyway, since they expose it record by record in all it's beautiful toiletness glory.
  • Sérgio Nunes · 1 year ago
    "If you submit a request to the Department of Health and Ageing and do not receive a response, it can be assumed that your request has been denied."

    I think that in Portugal all requests are handled this way. I've made several attempts to access government data and in very few times I got any answer.
  • kambiz · 1 year ago
    agreed, we deinitely need that. By the way, welcome to the blog world:)
  • Anders Rune Jensen · 1 year ago
    Funny I was actually thinking about this exact problem this morning. For getting movie information a lot of open source projects resolve to getting information from imdb as it is a huge database. This is often done by parsing the html pages as no xml feed is freely available. This comes at the price depending on their html structure to not change. I contacted them in order to get access to the data for one of my open source projects, but they basically said pay up or no deal.

    Yahoo just shut down their weather service which provided weather information from around the world for free in an easy to parse xml format. We really need to make sure that those services keeps running. In a way that doesn't rely on single companies.

    This is a huge problem right now and it really hinders innovation in a lot of areas. The conclusion of my morning brainstorm was that there needs to be created a non-profit organization which gathers and distributes this information. Maybe wikipedia just needs to add some more structure to the data and make libraries available that makes it easy to integrate into programs. Once the basic information is available, a lot of interesting application can be developed.
  • Rufus Pollock · 1 year ago
    There seem to be several distinct issues you (and your commenters) are concerned with:

    1. Discoverability of datasets. For this you want a registry of some kind and this is exactly what the Comprehensive Knowledge Archive Network (CKAN) is designed to do. As the blurb on the site states:

    > CKAN is the Comprehensive Knowledge Archive Network, a registry of open knowledge packages and projects (and a few closed ones). CKAN is the place to search for open knowledge resources as well as register your own – be that a set of Shakespeare's works, a global population density database, the voting records of MPs, or 30 years of US patents.
    >
    > Those familiar with freshmeat or CPAN can think of CKAN as providing an analogous service for open knowledge.

    We launched CKAN around a year ago and now has over 160 'packages' including many of those mentioned in the comments in this thread.

    2. 'Developing' data particularly using many contributors and a versioning (wiki-like) model. This seems a general problem and one which I wrote about in this post on the collaborative development of data back in February last year. Since then various projects have launched or developed which attempt to address this issue, even if only partially (e.g. Freebase, Swivel, Numbrary, http://www.openeconomics.net ...). This then leads into:

    3. Componentizing data so that one can easily plug different datasets together rather than having to aggregate data together in one big place (crudely: 'One Ring to Rule them All' vs. 'Small Pieces, Loosely Joined'). After all it seems unlikely that any one organization, however large, can hold 'all the data', and in ay case doing so would negate the benefits of having 'many minds' working on a problem. It is our hope that CKAN would start to facilitate the kind of packaging that one frequently observes in software but is, as yet, fairly rare for knowledge (data/content/...). More on this can be found in this blog post on componentization plus the slides from our presentation at XTech.
  • dop · 1 year ago
    your so right.
  • Evan Prodromou · 1 year ago
    Hi, Bret. As someone very involved in Open Content and Open Data, I'm glad to see the firestorm of discussion you've started.

    I think that it's more likely that we'll see Open Data split vertically rather than one big open data warehouse. People are more able to concentrate on creating a TV guide (like TVIV) or business listings (like Openguides) than stare at some big spreadsheet-style interface and come up with some "data" to share. I think projects like Freebase will be great for aggregating data created by more vertical projects.

    I've started a project called Vinismo, where we're documenting every wine in every country in the world, both with unstructured text and with structured (RDF) data. It's an exciting project, and I think there are lots of other similar ones out there.
  • marc wick · 1 year ago
    Bret

    I agree with Evan, that we don't necessarily need a huge ultimate data store for all and everything. What we need are relations between open datasets in the way of the LinkingOpenData Project :
    http://esw.w3.org/topic/SweoIG/TaskForces/Commu...

    The project already interlinks data from wikipedia, GeoNames, MusicBrainz, WordNet and many more.
  • Jan Horna · 1 year ago
    How about microformats? If I have the data for public sharing, I could be able to export them into reusable XML format with a given structure (e.g. microformat). I can imagine this way could the data flow among different web apps.
  • Chris DeBrusk · 1 year ago
    While accessibility is really important I think usability is equally as important to make this concept work. The sites out there today are either two complicated to interface unless you are versed in esoteric W3C specifications, or to one dimensional - upload data, download data, repeat.

    Make it interactive and fun to use and I think you'd get all sorts of data from all sorts of people - not just the tech community.
  • peter murray-rust · 1 year ago
    Just to endorse Rufus and many other posters on this. I'm a scientist and have campaigned for Open Data (see the WPedia entry (http://en.wikipedia.org/wiki/Open_data) for what I hope is a fair summary). I believe that almost al scientists want their data to be Open but don't realise the problems and the methods for making t Open.
    (Peter Murray-Rust)
  • Michael Gaio · 1 year ago
    We have an open-source "wikipedia" of data:

    http://freebase.com/
  • patrickatevri · 1 year ago
    Interesting post. But this part is provoking:

    "No one really wants factual data accuracy and completeness to be their competitive advantage; we all want the best data possible to build the best products possible, and discrepancies in data quality are artifacts of the extremely inefficient economy of buying and selling data we currently live in. If everyone had the same, high quality data, all of our products would be better for it."

    This is certainly true, but I think you've missed some of the incentives.

    First, no one wants to compete in an efficient market. Efficient markets are really hard to make money in. It's why venture capital money does not go to people who want to sell wheat (or corn/oil/insert-random-commoditie-here). INEFFICIENT markets are where the money is -- that is, in an inefficient market if your company has figured out what really matters, you have a huge competitive edge. In an efficient market, all players know exactly what matters and a competitive advantage is hard to find.

    I should be clear that I am not arguing with your conclusion -- I think the world would be a much better place if the market for data sets was basically an efficient commodity market. I don't, however, believe that any of the current players in this inefficient market have a lot of incentives to move us in that direction. Quite the opposite, in fact; a winner in an inefficient market is basically a player who has "solved" the game, and that player has every incentive to keep the solution secret.
  • Jo jo · 1 year ago
    In the early '80s I was a member of the ANSI cartographic data standards committee (can't remember the exact name). The idea was to at least get the government agencies that provide such data to do it in standard ways. You're right, it's a hard problem. But it could be sooo valuable.
  • Scatman Dave · 1 year ago
    WELL Agreed
  • JP · 1 year ago
    Dealipedia, the business deal wiki, currently has almost 20,000 transactions on record and offers a free daily newsletter roundup of recent M&A, VC investment, IPO and bankruptcy deals.

    http://www.dealipedia.com/
    http://www.dealipedia.com/newsletter_subscribe.php
  • ajturner · 1 year ago
    There are already several efforts like this going on (as I is easy to notice in all the comments)- in different specific application spaces. But what's most important is that they keep this data open and easily shareable. So that one person can build a Geospatial Data "Wiki", and someone else a Business Listing Wiki, and these two could be brought together by someone looking for local businesses (as an example).

    So it's imperative that these systems support open, commonly used formats and API's.

    Specifically for Geo data, a project I work on is Mapufacture that is bringing together various data sources in a large number of formats and then sharing them out via common formats - so a user (or developer) can just use the format that makes sense for them.
  • Jono · 1 year ago
    I agree. We were looking at displaying a TV guide on our site awhile back, not to bring in additional revenue, but just in an attempt to be more user focused as it is something that would benefit much of our target audience.

    Unfortunately it was going to cost us $6k-$9k/month (there is only one source here in Australia) which is crazy, so we obviously decided this was not an option.
  • Phillip Shoemaker · 1 year ago
    Bret, I absolutely agree with your thoughts on the wikipedia for data. I work at Numenta, where we are focusing on creating an intelligent platform. Programming for the platform is fairly straightforward, however, once you want to solve a problem like finding a pattern for predictive toxicology, machine vision, or audio problems, you run into an issue of the datasets. Where do you find a good dataset for handling the toxicology of certain drugs on myriad people? Well, the answer is, you don't. Not easily anyway. Additionally, for audio issues, people tend to go to NIST and pay a lot of money for datasets.

    If these were in an open system, more people can experiment and solve real problems with any technology (of course, we'd prefer it if people used HTMs).
  • srw · 1 year ago
    IMHO as a specific subset of data we need Open Marketing Data where all size companies can benefit from consumer information and analytics. A good innitiative would be sharing data between different companies sizes in a secure way without compromising the consumer identities. Social networks are very trendy, but there is an opportunity window to business networks in a SOA sense, Google analytics has added recently a benchmarking option but it's not enought for a serious change.
  • ChemSpiderman · 1 year ago
    At ChemSpider we've been working hard to put together a free access website for Chemical structures and related information/data. At present we are close to 20 million structures linked to other websites, data sources and, other than the efforts of the NIH and PubChem ChemSpider has one of the richest (and crowdsource curated) datasets available online. We are working hard to curate the Wikipedia Chemistry dataset, with members of WP:Chem (http://www.chemconnector.com/chemunicating/dedi...) at present. I agree that we need more data online and available. It is interesting to note how few are willing to PROVIDE data though, even among the advocates within this domain.
  • Nils Hitze · 1 year ago
    This is so true, thanks for writing this.
  • xml dude · 1 year ago
    I really like your ideas and have had similar ones. There is a parallel between electricity and data, which is this. It is really valuable, because it allows us to do all the things we need to do, and is very adaptable to being converted into forms usable directly or indirectly to achieve our goals. To that end, data needs to have a 60 Hz 120 V standard which might not be perfect for every application but can be transformed to whatever is
    required. To me, XML and vertical XML standards are the way to go.

    Once you get the data inside your firewall, what you convert it to is your business. If you create data at the 60 Hz 120 V standard, you should be able to roll your meter backwards. It costs money to collect data so it should be worth something in both directions.
  • macdavid · 1 year ago
    Bret... something which many of us are thinking of... but as you point out impossibly complex to do well, in specific areas of focus it is feasible of course but Data per se... very complicated. I work around the world supporting development of different institutions and countries, if you or others establish a group looking at this let me know... data and effective Information Management is the key to stimulating change and advancement.
  • anon · 1 year ago
    Great idea - but check Freebase, this has some of what you're proposing.
  • Marc Perramond · 1 year ago
    Hear hear! My company (http://www.insideview.com) faces all of the challenges you outlined above for a specific vertical application... aggregating and making sense of all the business information available to sales and marketing professionals. We license data from traditional editorial players like Reuters, D&B, Hoovers as well as web site harvesters like SimplyHired and ZoomInfo, user contributed communities like Jigsaw, and social networks such as LinkedIn and Facebook. Each of these have "data issues", i.e. each model brings different strengths and weaknesses to the table as a data source. And perhaps most interestingly for us, since we have to do the heavy lifting as a meta-aggregator, is that they all have different formats and unique identifiers. It's tough enough for our algorithm... imagine individual end users trying to make sense of it all.

    I have also wondered about the possibility of a DataWiki (I like the term Bret!), or perhaps of wikipedia taking on the challenge itself (wikipedia already has a large number of company profiles and there is now a formal Companies WikiProject (http://en.wikipedia.org/wiki/Wikipedia:WikiProj...). I agree it would probably have to be seeded by a significant player in the data space, something that will be painful to do for a company that gets significant revenue from licensing their data.) I also agree with many of the other comments made here about the importance of a data standard (probably by vertical, as xml dude suggested). In our case solving the challenge of a single standard for various data sets would be HUGE. The issue of seeding, while a challenge, is only temporary. Data is becoming increasingly commoditized and the price is fast approaching zero. Case in point, take Jigsaw's recent open data initiative in which they decided to give away their database of basic company profiles in order to generate interest in their service (and their less commoditized data for individual executive contacts.) There's some seed right there!
  • Bart Van Loon · 8 months ago
    Hi,

    you might be interested in The Data Tank. we just opened up a technology preview on http://thedatatank.com and are looking for enthusiastic people around the world for joining in on our team!
  • buyessays · 4 months ago
    It is not hard to buy essay papers online at the essay writing organization just about We need a Wikipedia for data
    . Thanks for kind of nice facts.