Hacker Newsnew | past | comments | ask | show | jobs | submit | AdamDKing's commentslogin

The script for this scam hasn't changed in years! Here's Kitboga with the same scam over 2 years ago: https://twitter.com/kitboga/status/1009830578407997440 (it's almost always 22 pounds of cocaine)

Here's the same scam from his stream 3 days ago: https://www.twitch.tv/videos/668914814?t=5h4m20s Abandoned Toyota Corolla on the south border of Texas with blood, _exactly_ 22 pounds of cocaine, cash. Multiple bank accounts and addresses registered with your SSN, funds wired out of the US.


Possibly due to the current Google Cloud outage. https://status.cloud.google.com/incident/compute/19008


i thought the entire point of the cloud was to make stuff like this not cause your site to go down, or did someone realize that only had to be part of the marketing


The cloud doesn't go down- regions and services inside that region go down.

A region (or AZ for Amazon) is no more or less reliable then your run of the mill DC


True, but then the question is it cheaper to build your own redundancy or duplicate all/most infrastructure across multiple zones? Once that becomes the question, the cloud not longer looks so simple or cost effective.


No the point is when the cloud goes down you can stay in bed because someone else is sorting it out for you.


excrement does hit the air-conditioning time to time


On the <|endoftext|>: GPT-2 and this model were trained by sampling fixed-length segments of text from a set of web pages. So if the sample happens to start near the end of one page then it will fill in the rest of the length with the beginning of another page. The model learns to do the same. TalkToTransformer.com hides this by not showing what comes after the <|endoftext|> token.


That explains why sometimes the talktotransformer samples are so short!


>How exactly the large GPT-2 models are deployed is a mystery I really wish was open-sourced more.

TalkToTransformer.com uses preemptible P4 GPUs on Google Kubernetes Engine. Changing the number of workers and automatically restarting them when they're preempted is easy with Kubernetes.

To provide outputs incrementally rather than waiting for the entire sequence to be generated, I open a websocket to a a worker and have it do a few tokens at a time, sending the output back as it goes. GPT-2 tokens can end partway through a multi-byte character, so to make this work you need to send the raw UTF-8 bytes to the browser and then have it concatenate them _before_ decoding the string.

While my workers can batch requests from multiple users, the modest increase in performance is probably not worth the complexity in most cases.


Any thoughts on the larger model? Doesn't seem materially better than the last one. Maybe the fine tuning exercises will show the benefit?


Thanks for the upvotes! The site is running quite a bit slower than planned, but I think I know why. I should be able to get it going at full speed around tomorrow.


The site was using one K80 GPU but it's slowing down significantly so I'm adding a second GPU. The servers run on Google Kubernetes Engine.


I'm also working on implementing that particular workflow with GPT-2/GKE/GPUs (I'm curious on your deployment strategy if you want to talk more).

You may want to use preemptible GPUs if you aren't already.


You seem to be saying this work is based on convolutional neural networks. That's incorrect. It uses the same attention mechanisms from natural language processing which involve no convolution operations.

Convolutions have a different set of weights for each position offset (with a fixed window size), and reuse those weights across the entire input space.

Transformer-based networks like this work compute attention functions between the current position's encoding and every previous position, then use the outputs to compute a weighted sum of the encodings at those positions. Hence they can look at an arbitrarily large window and the number of parameters they have is independent of the size of that window.


It seems that using "fixed attention" for text would encourage the network to periodically summarize the context so far and put it in that fixed column for the rows below to access.

Maybe the reason "strided attention" didn't work as well is that it would require the network to put this context summary in every column lest the rows below be unable to access it. That would waste features since the summary wouldn't vary much over time but would still be stored in full at each step.

If this is true, the approach they used for images might actually be inefficient in a similar way.


NVIDIA just released the code: https://github.com/nvlabs/spade/


"To reproduce the results reported in the paper, you would need an NVIDIA DGX1 machine with 8 V100 GPUs."


That line refers to training the model from scratch. You can still run the trained model very quickly with one "cheap" GPU.

That said, I'm not sure why one wouldn't get a similar result training on the EC2 or GCE instances that have 8 V100s. Or even training with fewer GPUs but accumulating gradients to get the same batch size.


What software do you use to make those 3D DNN architecture images?


For all the 3D diagrams that I made (including the animated one at the end) I wrote code that used https://threejs.org/ and my custom library. It worked, but with a lot of hassle. In the future I'll likely try using Blender.


I’ve no idea what nvidia use but you could do this pretty easily in Blender with the Freestyle NPR renderer and the built in import images as planes add on.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: