multi: implement new safe static channel backup and recovery scheme, RPCs, and cli commands #2313

Roasbeef · 2018-12-11T01:03:45Z

Overview

In this PR, we implement a new safe scheme for static channel backups (SCB's) for lnd. We say safe, as care has been taken to ensure that there are no foot guns in this method of backing up channels, vs doing things like rsyncing or copying the channel.db file periodically. Those methods can be dangerous as one never knows if they have the latest state of a channel or not. Instead, we aim to provide a simple safe instead to allow users to recover the settled funds in their channels in the case of partial or complete data loss. The backups themselves are encrypted using the a key derived from the user's seed, this way we protect privacy of the users channels in the back up state, and ensure that a random node can't attempt to import another user's channels.

Once this PR is merged, given their seed and the latest back up file, the user will be able to recover both their on-chain funds, and also funds that are fully settled within their channels. By "fully settled" we mean funds that are in the base commitment outputs, and not HTLCs. We can only restore these funds as right after the channel is created, we have all the data required to make a backup. In contrast, in order to resolve HTLCs, we would also need to update the backup state with each new channel update, which is tricky to do without additional infrastructure. This infrastructure will be built out in the near future, but until then we have this scheme which will also be a fall back in the scenario that any higher level mechanisms fail.

At a later point, we also plan to propose this backup scheme as an addition to the spec, as even with the change to make the "to self" outputs static, we still need this SCB information in order to restore user funds. Additionally, the current serialization format is a bit up in the air. Atm, we use the same "codec" as we do within the wire protocol for the BOLT specs. However, we'll likely move to a TLV (type-length-value) format as it's extremely flexible and allows us to add/remove fields in the future once we gain new channel types, or modifications are made in the protocol that warrant a change to the backup format. Most importantly, if aezeed and this chanbackup scheme are added to the spec, then it will be possible to write a simple program, that given a seed+backup from any of the implementations, will be able to recover all funds (sweep to an address) the shutdown.

Recovery Flow

Skipping the backup flow for a second, given their 24-word aezeed seed, and a special channels.backup file, the recovery flow would be something like the following

The user uses lncli create or the gRPC WalletUnlocker.Init call to input their seed and fully serialized backups.
1. Alternatively, if they already have a new node set up, they can use the cli and RPC commands to import channels one at a time, or the entire file.
lnd boots up and the wallet performs a rescan from the wallet's birthday (encoded in their aezeed) to restore all on-chain funds. Once this process is complete, the main lnd server will start up.
Given the set of channels to recover, the server will then (using the new chanbackup) package, will insert a series of "channel shells" into the database. These contains only the information required to initiate the DLP (data loss protection) protocol and nothing more. As a result, they're makred as "recovered" channels in the database, and we'll disallow trying to use then for any other process.
Once the channel is recovered, the chanbackup package will attempt to insert a LinkNode that contains all prior addresses that we were able to reach the peer at. During the process, we'll also insert the edge for that channel (only out outgoing direction) into the database as well.
lnd will then start up, and as usual attempt to establish connections to all peers that we have channels open with.
Once we connect with a peer, we'll then initiate the DLP protocol. The remote peer will discover that we've lost data, and then immediately force close their channel. Before they do though, they'll send over their latest unrevoked commitment point which we need to derive keys (will be fixed in BOLT 1.1 by making the key static) to sweep our funds.
Once the commitment transaction confirms, given information within the SCB we'll re-derive all keys we need, and then sweep the funds.

Backup + Recovery Methods

This PR exposes multiple safe ways to backup and recover a channel. We expect only one of them to be used primarily by unsophisticated end users, but have provided other mechanisms for more advanced users and business that already script lnd via the gRPC system.

First, the easiest method for backup+recovery. After this PR, lnd will maintain a channels.backup file in the same location that we store all the other files. Users will at any time be able to safely copy and backup this file. Each time a channel is opened or closed, lnd will update this file with the latest channel state. Users can use scripts to detect changes to the file, and upload them to their backup location. Something like fsnotify can notify a script each time the file changes to be backed up once again. The file is encrypted using an AEAD scheme, so it can safely be stored plainly in cloud storage, your SD card, etc. The file uses a special format and can be used to import via any of the recovery methods described below.

The second mechanism is via the new SubscribeChanBackups steaming gRPC method. Each time an channel is opened or closed, you'll get a new notification with all the chanbackup.Single files (described below), and a single chanbackup.Multi that contains all the information for all channels.

Finally, users are able to request a backup of a single channel, or all the channels via the cli and RPC methods. Here's an example, of a few ways users can obtain backups, see the PR for full details:

⛰ lncli --network=simnet exportchanbackup --chan_point=29be6d259dc71ebdf0a3a0e83b240eda78f9023d8aeaae13c89250c7e59467d5:0
{
    "chan_point": "29be6d259dc71ebdf0a3a0e83b240eda78f9023d8aeaae13c89250c7e59467d5:0",
    "chan_backup": "02e7b423c8cf11038354732e9696caff9d5ac9720440f70a50ca2b9fcef5d873c8e64d53bdadfe208a86c96c7f31dc4eb370a02631bb02dce6611c435753a0c1f86c9f5b99006457f0dc7ee4a1c19e0d31a1036941d65717a50136c877d66ec80bb8f3e67cee8d9a5cb3f4081c3817cd830a8d0cf851c1f1e03fee35d790e42d98df5b24e07e6d9d9a46a16352e9b44ad412571c903a532017a5bc1ffe1369c123e1e17e1e4d52cc32329aa205d73d57f846389a6e446f612eeb2dcc346e4590f59a4c533f216ee44f09c1d2298b7d6c"
}

⛰ lncli --network=simnet exportchanbackup --all
{
    "chan_points": [
        "29be6d259dc71ebdf0a3a0e83b240eda78f9023d8aeaae13c89250c7e59467d5:0"
    ],
    "multi_chan_backup": "fd73e992e5133aa085c8e45548e0189c411c8cfe42e902b0ee2dec528a18fb472c3375447868ffced0d4812125e4361d667b7e6a18b2357643e09bbe7e9110c6b28d74f4f55e7c29e92419b52509e5c367cf2d977b670a2ff7560f5fe24021d246abe30542e6c6e3aa52f903453c3a2389af918249dbdb5f1199aaecf4931c0366592165b10bdd58eaf706d6df02a39d9323a0c65260ffcc84776f2705e4942d89e4dbefa11c693027002c35582d56e295dcf74d27e90873699657337696b32c05c8014911a7ec8eb03bdbe526fe658be8abdf50ab12c4fec9ddeefc489cf817721c8e541d28fbe71e32137b5ea066a9f4e19814deedeb360def90eff2965570aab5fedd0ebfcd783ce3289360953680ac084b2e988c9cbd0912da400861467d7bb5ad4b42a95c2d541653e805cbfc84da401baf096fba43300358421ae1b43fd25f3289c8c73489977592f75bc9f73781f41718a752ab325b70c8eb2011c5d979f6efc7a76e16492566e43d94dbd42698eb06ff8ad4fd3f2baabafded"
}

⛰ lncli --network=simnet exportchanbackup --all --output_file=channels.backup

⛰ ll channels.backup
-rw-r--r--  1 roasbeef  staff   381B Dec  9 18:16 channels.backup

Static Channel Backup Scheme

Crypto

For encryption, we utilize chacha20poly1305 with a random 24 byte nonce. We use a larger nonce size as this can be safely generated via a CSPRNG without fear of frequency collisions between nonces generated. To encrypt a blob, we then use this nonce as the AD (associated data) and prepend the nonce to the front of the ciphertext package.

For key generation, in order to ensure the user only needs their passphrase and the backup file, we utilize the existing keychain to derive a private key. In order to ensure that at we don't force any hardware signer to be aware of our crypto operations, we instead opt to utilize a public key that will be hashed to derive our private key. The assumption here is that this key will only be exposed to this software, and never derived as a public facing address.

`chanbackup.Single`

The SCB contains all information required to initiate the data loss protection protocol once we restore the channel and connect to the remote channel peer.

The primary way outside callers will interact with this package are via the Pack and Unpack methods. Packing means writing a serialized+encrypted version of the SCB to an io.Writer. Unpacking does the opposite.

The encoding format itself uses the same encoding as we do on the wire within Lightning. Each encoded backup begins with a version so we can easily add or modify the serialization format in the future, if new channel types appear, or we need to add/remove fields. The backup contains:

The chain a channel belongs to.
The chanPoint of the channel.
The shortChanID of the channel.
The public key of the remote node.
The series of addresses that we can use to reach the node.
The CSV delay of the channel (required to later reconstruct our output script after BOLT 1.1)
A keychain.KeyLocator that allows us to re-derive the payment bas epoint we need to sweep our funds .
A keychain.KeyDescriptor that we need in order to re-derive our shachain root to validate the information the remote party gives us during the DLP protocol. (see the next section for the complications that arose here)

`chanbackup.Multi`

Multi is a series of static channel backups. This type of backup can contains ALL the channel
backup state in a single packed blob. This is suitable for storing on your file system, cloud storage, etc. Systems will be in place within lnd to ensure that one can easily obtain the latest version of the Multi for the node, and also that it will be kept up to date if channel state changes.

Implementation Complications and Open Questions

The main complication that arose during the implementation was that I realized late in development, that we also need to backup the details w.r.t how we derive out shachain root. We got a bit lucky here as we store the private key we use as the root, and not the public key itself. In order to derive the shachain roots, we use a special keychain.KeyFamily. However, we don't store the keychain.KeyLocator information which is a two-tuple that allow us to derive a key w/o knowing the public key or having any state in the wallet. Instead, within the backup, we're forced to store the entire public key and not just the key locator information. As a result, I needed to modify keychain.SecretKeyRing.DerivePrivKey to support a brute force scan to allow us to derive the key. In the future, we'll want to do a migration to also store the key locator information so we don't need to always do this brute force. In order to ensure we don't scan to infinity if we don't actually know the public key, I've added a cap on the max number of iterations.

As a result of the case above, it's now the case that any future hardware signers need to be aware of the shachain protocol, in order to generate and validate any points we receive.

The one other section that we maybe want to modify is the way we derive the key we use for encryption. We made an attempt to ensure that any future hardware signers don't actually need to understand our encryption protocol. So instead what we do is use a public point with the assumption that it will never be used for an address and be unveiled to the outside world. One alternative that I had (but scrapped, idk why TBH) is use a point, but then have the hardware signer provide us with an ECDH of that point and another. This would ensure that the key is derived from secret data, but allow us to not store any private data in the backup.

TODO's

write integration tests
write additional unit tests in channeldb
real world recovery attempts
update docs on how to use the recovery tools
after rpc: Add SubscribeChannels RPC. #1988 is in, finish hooking up the chanbackup.SubSwapper so we can auto update the backup file on disk

Fixes #175

chanbackup/backupfile.go

alexbosworth · 2018-12-11T01:09:51Z

chanbackup/backupfile.go

+	}
+}
+
+// UpdateAndSwap will attempt write a new temporary backup file to disk with


Suggested change

// UpdateAndSwap will attempt write a new temporary backup file to disk with

// UpdateAndSwap will attempt to write a new temporary backup file to disk with

chanbackup/crypto.go

chanrestore.go

lsching17 · 2018-12-13T13:12:04Z

"First, the easiest method for backup+recovery. After this PR, lnd will maintain a channels.backup file in the same location that we store all the other files. .."

Can a dedicated folder be used? If it is mounted with sshfs or nfs, the channels.backup and channel.db files can be separated into different machine.

Roasbeef · 2018-12-13T21:31:26Z

Can a dedicated folder be used?

I don't see why not. We can add a config flag for the backup file location.

Roasbeef · 2018-12-25T02:53:54Z

Alrighty, I've broken this PR up into 5 distinct PR's. Each new PR depends on the prior PR. As a result, they can go in one by one and be reviewed in smaller units, rather than waiting for the final dependents of this larger PR to be finalized. I'll keep this one as is though as it has the full description, and also builds allowing users to experiment with the set of commands. Once the final PR is ready for review (as all the prior PRs have been merged), I'll rebase this on on top of that, so everyone can use this as a central point of end to end testing.

Roasbeef · 2019-02-01T04:07:03Z

Pushed up a rebased version as all the dependent PRs have been merged. Once in #1988 is in, then I'll start the final push to getting this merged!

Roasbeef · 2019-02-09T03:48:47Z

Pushed up a new version that maintains the backup file on disk and modifies it based on new/closed channels. Will push up the integration tests next, and after that it's ready for review.

lnrpc/rpc.proto

walletunlocker/service.go

In this commit, we modify the main `closeObserver` dispatch loop to only look for the local force close if we didn't recover the channel. We do this, as for a recovered channel, it isn't possible for us to force close from a recovered channel.

…recovered chan In this commit, we modify the `closeObserver` to fast path the DLP dispatch case if we detect that the channel has been restored. We do this as otherwise, we may inadvertently enter one of the other cases erroneously, causing us to now properly look up their dlp commitment point.

In this commit, we convert the server's Start/Stop methods to use the sync.Once. We do this in order to fix concurrency issues that would allow certain queries to be sent to the server before it has actually fully start up. Before this commit, we would set started to 1 at the very top of the method, allowing certain queries to pass before the rest of the daemon was had started up. In order to fix this issue, we've converted the server to using a sync.Once, and two new atomic variables for clients to query to see if the server has fully started up, or is in the process of stopping.

During the restore process, it may be possible that we have already heard about our prior edge from a node on the network (or our channel peers). As a result, we shouldn't exit if this happens, and instead should continue with the rest of the restoration process.

In this commit, we modify the `RestoreNodeWithSeed` and `RestartNode` methods to also accept an SCB. This will be useful in new integration tests to properly exercise the various restore/restart scenarios using static channel backups.

In this commit, we update all uses of the `getChanPointFundingTxid` to match the new function signature. We no longer need to convert to a chainhash.Hash, as the method does so underneath now.

…t to new func In this commit, we modify the core testDataLossProtection test to extract the primary DLP assertion logic into a new function. We do this, as the upcoming SCB tests will fallback to this test after some initial set up.

In this commit, we add 4 new itests for exercising the SCB restore process via 4 primary scenarios: recover from backup using RPC, recover from file using RPC, recover channels during init/creation, recover channels during unlock. With all fields populated there're a total of 24 new scenarios to cover. At the time of authoring of this commit, the other scenarios (bits are: initiator, updates, private) have been left out for now, as they increased the run time of the integration tests significantly.

molxyz · 2019-03-30T16:56:40Z

Tested on a testnet node that has been running with noseedbackup, SCB still let me do exportchanbackup. Shouldn't this result in an error message instead?
https://hastebin.com/raw/urohupeful

Roasbeef · 2019-04-01T20:40:58Z

@molxyz at runtime, lnd doesn't know if you actually got a seed or not.

Roasbeef · 2019-04-01T20:46:19Z

In that case, you wouldn't actually be able to decrypt the SCB unless you read out the private data of the database.

cfromknecht

Awesome work on this feature @Roasbeef!

It is time to start securing our bags.

LGTM 💰

ZapUser77 · 2019-04-02T00:23:21Z

Any chance you can include what the commands are to restore (exact syntax), and what the expected outputs would be (just and example)? Considering how important this is, just guess and 'tying to figure it out' may not be the best idea.

From my understanding, this isn't actually a "back up" of the channels, and is instead a "channel funds recovery mechanism". Correct? If you restored using this, you'd have a node with zero channels, and would have to start open channels from scratch. Correct?

Roasbeef · 2019-04-02T00:28:35Z

Check out the PR description, more docs will be provided later.

…

On Mon, Apr 1, 2019, 5:23 PM ZapUser77 ***@***.***> wrote: Any chance you can include what the commands are to restore (exact syntax), and what the expected outputs would be (just and example)? Considering how important this is, just guess and 'tying to figure it out' may not be the best idea. From my understanding, this isn't actually a "back up" of the channels, and is instead a "channel funds recovery mechanism". Correct? If you restored using this, you'd have a node with zero channels, and would have to start open channels from scratch. Correct? — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#2313 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA87Lk9FdliyMFr1ImFE7wRLrAHHFsSZks5vcqL7gaJpZM4ZMeKM> .

ZapUser77 · 2019-04-02T00:49:22Z

"Check out the PR description"
I did, read the entire thing, multiple times. I wouldn't have asked before reading.

"more docs will be provided later."
Thanks as always for your diligent hard work. It really is appreciated.

gijswijs · 2019-05-15T07:46:42Z

Would it be possible to create a "backup" manually, in the scenario where a node is lost and accessing the original channels.backup file isn't accessible anymore?

Put in another way, given that you know the chan_point, could you insert the "channel shell" into the database, so that the DLP protocol can be initiated?

alexbosworth reviewed Dec 11, 2018

View reviewed changes

chanbackup/backupfile.go Outdated Show resolved Hide resolved

alexbosworth reviewed Dec 11, 2018

View reviewed changes

chanbackup/crypto.go Outdated Show resolved Hide resolved

alexbosworth reviewed Dec 11, 2018

View reviewed changes

chanrestore.go Show resolved Hide resolved

Roasbeef added this to In progress in High Priority via automation Dec 18, 2018

guggero mentioned this pull request Jan 4, 2019

walletunlocker+lnd: add command line flag to allow passing admin macaroon after wallet creation #1288

Merged

Roasbeef added this to the 0.6 milestone Jan 16, 2019

Roasbeef force-pushed the static-chan-backups branch from 84e91ad to f31163d Compare February 1, 2019 04:05

Roasbeef force-pushed the static-chan-backups branch from f31163d to 747b1c2 Compare February 7, 2019 02:36

wpaulino mentioned this pull request Feb 7, 2019

How to be sure that my node is protected when its offline #2613

Closed

Roasbeef force-pushed the static-chan-backups branch from 747b1c2 to 861c6fb Compare February 9, 2019 03:47

Roasbeef mentioned this pull request Feb 15, 2019

lnd crash, panic: page 1072 already freed #1107

Closed

seth586 mentioned this pull request Feb 19, 2019

BTRFS with optional USB thumb drive as RAID1 raspiblitz/raspiblitz#329

Closed

mrfelton mentioned this pull request Mar 5, 2019

Lost Funds after Deleting Wallet LN-Zap/zap-desktop#1600

Closed