Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
17 years of government contracts on BigQuery (github.com/antontarasenko)
136 points by anton_tarasenko on June 5, 2016 | hide | past | favorite | 20 comments


BigQuery has two unusual technical features that make it uniquely suited to hosting public data like this:

1. It has a separate "storage tier" and "compute tier", so we can all run queries against the same dataset at the same time.

2. The "compute tier" is 100% shared, so I can use BigQuery even if I just want to run a few queries.

I hope we'll see more projects like this. Having public data available in SQL is a great thing. SQL is easy enough that a lot of people can figure it out, but powerful enough to do real work.



Also check out http://government-contracts.insidegov.com/ and http://government-contractors.insidegov.com/. Data is slightly outdated, but is being updated as we speak to account for the recent changes USASpending.gov made to their file schema. We did work to tie in all subsidiaries and branches into summary profiles through Duns numbers to give a true estimate for the obligated contract amount won by large corporations. Thanks for posting the other sites with this data, always great to check out how others handle these problems!

Full disclosure, I work for Graphiq which owns InsideGov.


Do you have an email you can share? I'm at manigandham @ gmail.

We're looking to move to GCP and love what's there so far but need some help from the startup side.


tweet at me @thetinot


A lot of this data can also be browsed by mere mortals at http://openbeta-contracts-explorer.usaspending.gov/ . More information is available at https://openbeta.usaspending.gov/ too.


I checked but is there any reason why the totals are pretty low? for example looking for Microsoft I found a total of 33M http://openbeta-contracts-explorer.usaspending.gov/#!/year/A...


Probably because of the way government contracting works with (value added) resellers. One such is "Carahsoft" (see http://openbeta-contracts-explorer.usaspending.gov/#!/year/A...). If you dig into the details of the transactions you'll see that many are for technology companies' software and services. One example link that includes Salesforce, HP, VMWare, Symantec, etc.: http://openbeta-contracts-explorer.usaspending.gov/#!/year/A...


And for those companies providing services (i.e. people doing work, rather than just SaaS) there's a lot of subcontracting that doesn't show up in this data.


I don't know much about this new explorer tool, but it's based off of USASpending.gov which is a bit fickle...which is to be expected given the messiness of the data involved (companies have varied names, subsidiaries etc., among other real-life complexities).

But you can do an "advanced search" by registered company name, and this is what comes up for "Microsoft" -- probably doesn't include all of its subsidiaries that don't have "Microsoft" in the name: ~$1.4 billion in awarded contracts

https://www.usaspending.gov/Pages/AdvancedSearch.aspx?sub=y&...

edit: Oh I know the reason for the discrepancy...the homepage of the beta app says it's only partial data:

> Added more years of contract data to now include 2011 - 2015


You really have to have a groomed database to get reasonably accurate data. At a former employer (federal market research) we maintained a separate database of contractors mapping the relationship between companies based on DUNS number, which enabled rolling up totals for subsidiaries. Of course that doesn't account for JVs or subcontracting.


Government requires competition for purchases. So vendors like Dell, SHI, CDW, Carahsoft, etc are all wholesalers who compete over the pennies for fulfillment to sell Microsoft (and most other) software.


Two reasons:

1. Agencies often buy from distributors and integrators, not publishers. In this case, original manufacturers may appear in the variable `descriptionofcontractrequirement`.

2. Spelling may differ. Variable `dunsnumber` is be a better indicator for uniqueness.


And DUNS number is also pretty useless as it is based on location (not organization).

A search on sam.gov for "lockheed" results in 207 records (all probably the same company).


Yep, and http://government-contractors.insidegov.com/ is pretty easy to use.


Thanks for sharing Anton!

Fixing the link for mobile: (GitHub doesn't render the link)

- http://nbviewer.jupyter.org/github/antontarasenko/gpq/blob/m...

Some quick charts with DataStudio (https://medium.com/@hoffa/showing-off-the-new-free-google-an...):

http://i.imgur.com/ZHOkrN2.png


Is there an open source equivalent of the software-part of BigQuery?


I'm a bit late here, but maybe https://en.wikipedia.org/wiki/Apache_Drill ?


There's really no direct open source equivalent, since BigQuery is a multi-tenant serverless database (more at https://cloud.google.com/blog/big-data/2016/01/bigquery-unde...).

That said, there are plenty of open source projects that scale horizontally and let you run SQL faster than RDBMS - Presto, Impala, Spark, Druid, Hive, etc.

Happy to answer more questions on BigQuery.





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: